mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2017-04-03, 06:34   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×11×172 Posts
Default Trouble restarting large job

I ran -nc2 on a 64G machine, and want to transfer the matrix and checkpoint to a faster machine with 32G memory to do the actual linear algebra.

But, on two separate machines and two attempts per machine, including restarting at an earlier checkpoint, I get something like

Code:
Fri Mar 31 20:51:19 2017  commencing Lanczos iteration (6 threads)
Fri Mar 31 20:51:19 2017  memory use: 16253.3 MB
Fri Mar 31 20:51:20 2017  restarting at iteration 626 (dim = 39615)
Fri Mar 31 20:53:58 2017  linear algebra at 0.1%, ETA 1106h20m
Fri Mar 31 20:55:29 2017  error: corrupt state, please restart from checkpoint
I've tried restarting on the 64G machine I was using originally, and that ran for eight hours without giving such a message.
fivemack is offline   Reply With Quote
Old 2017-04-03, 20:42   #2
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×2,861 Posts
Default

Is it trying to use a different block size on the 32gb machines?
henryzz is offline   Reply With Quote
Old 2017-04-03, 22:32   #3
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2·11·172 Posts
Default

Quote:
Originally Posted by henryzz View Post
Is it trying to use a different block size on the 32gb machines?
Different superblock size, though same block size:

Code:
tractor (64G)
Sun Apr  2 22:07:29 2017  using block size 8192 and superblock size 983040 for processor cache size 10240 kB
pumpkin (32G, i7-4930K)
Fri Mar 31 20:46:40 2017  sparse part has weight 4557683460 (118.90/col)
Fri Mar 31 20:46:40 2017  using block size 8192 and superblock size 1179648 for processor cache size 12288 kB
butternut (32G, i7-5820K)
Fri Mar 31 20:08:38 2017  using block size 8192 and superblock size 1179648 for processor cache size 12288 kB
Fri Mar 31 20:13:16 2017  commencing Lanczos iteration (6 threads)
I'm currently redoing the processing on butternut in the hope I can run the whole job there, but RelProcTime is about 26 hours so will not have results immediately.

Last fiddled with by fivemack on 2017-04-03 at 22:33
fivemack is offline   Reply With Quote
Old 2018-01-03, 22:56   #4
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,699 Posts
Default

We are experiencing the same error when starting a job.

We have tried a binary we compiled ourself and one that is from someone else that is known to work.

We have also tried several different target densities.

FWIW, we ran msieve.dat through "remdups" prior to starting the job.

Code:
Wed Jan  3 13:14:14 2018  commencing linear algebra
Wed Jan  3 13:14:40 2018  read 23031042 cycles
Wed Jan  3 13:15:14 2018  cycles contain 64089728 unique relations
Wed Jan  3 13:49:41 2018  read 64089728 relations
Wed Jan  3 13:52:03 2018  using 20 quadratic characters above 4294917296
Wed Jan  3 13:57:39 2018  building initial matrix
Wed Jan  3 14:09:31 2018  memory use: 8680.3 MB
Wed Jan  3 14:09:40 2018  read 23031042 cycles
Wed Jan  3 14:09:43 2018  matrix is 23026809 x 23031042 (7496.0 MB) with weight 2179834698 (94.65/col)
Wed Jan  3 14:09:43 2018  sparse part has weight 1711677031 (74.32/col)
Wed Jan  3 14:16:50 2018  filtering completed in 3 passes
Wed Jan  3 14:16:54 2018  matrix is 22923906 x 22924106 (7476.6 MB) with weight 2173903730 (94.83/col)
Wed Jan  3 14:16:54 2018  sparse part has weight 1707775293 (74.50/col)
Wed Jan  3 14:17:48 2018  matrix starts at (0, 0)
Wed Jan  3 14:17:51 2018  matrix is 22923906 x 22924106 (7476.6 MB) with weight 2173903730 (94.83/col)
Wed Jan  3 14:17:51 2018  sparse part has weight 1707775293 (74.50/col)
Wed Jan  3 14:17:51 2018  saving the first 48 matrix rows for later
Wed Jan  3 14:17:55 2018  matrix includes 64 packed rows
Wed Jan  3 14:17:57 2018  matrix is 22923858 x 22924106 (7142.7 MB) with weight 1764421261 (76.97/col)
Wed Jan  3 14:17:57 2018  sparse part has weight 1643180859 (71.68/col)
Wed Jan  3 14:17:58 2018  using block size 8192 and superblock size 294912 for processor cache size 3072 kB
Wed Jan  3 14:19:57 2018  commencing Lanczos iteration (2 threads)
Wed Jan  3 14:19:57 2018  memory use: 6034.5 MB
Wed Jan  3 14:23:49 2018  linear algebra at 0.0%, ETA 936h37m
Wed Jan  3 14:25:03 2018  checkpointing every 30000 dimensions
Wed Jan  3 16:46:03 2018  error: corrupt state, please restart from checkpoint
Attached Files
File Type: log msieve.log (173.7 KB, 60 views)
Xyzzy is offline   Reply With Quote
Old 2018-01-04, 01:13   #5
RichD
 
RichD's Avatar
 
Sep 2008
Kansas

309410 Posts
Default

If you are already into Block Lanczos, one option to try (backup the folder first) is:
Code:
./msieve -v -t 2 -ncr skip_matbuild=1 -nc3
RichD is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
assignment restarting prob isaac1204 Information & Answers 2 2017-07-20 17:26
restarting nfs linear algebra cubaq YAFU 2 2017-04-02 11:35
Restarting a process after it is hung? Xyzzy Linux 29 2014-04-19 14:33
Restarting linear algebra wombatman Msieve 2 2013-10-09 15:54
Stop p95 or llr before restarting? Joshua2 Software 6 2005-05-16 16:36

All times are UTC. The time now is 17:06.

Fri Sep 25 17:06:58 UTC 2020 up 15 days, 14:17, 0 users, load averages: 1.73, 1.56, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.