mersenneforum.org gpu poly search error
 Register FAQ Search Today's Posts Mark Forums Read

 2010-11-01, 01:31 #1 bdodson     Jun 2005 lehigh.edu 210 Posts gpu poly search error I've been getting np2 hanging, sometimes a day or more with no output at all --- this for the c187 --- once in 4M, and a second time in 5M. After some time fiddling to see which stage1 report(s) was causing the problem, turns out that one of the -np1's reported an error on the stdout file from the range "searching leading coefficients from 4000001 to 4400000" Code: error generating or reading NFS polynomials perhaps with a file server err, as the file was corrupt (grep reports "binary file" without locating anything; empty missing lines at the start of the file ...). I hadn't actually looked at the file, until I found the line below in msieve.dat.m; and then found a second one for the other range "from 5000001 to 5400000": Code: 4007700 24015536 295219382270877590927 801983503937356382677653357465991274 5084748 25083344 280739665478577317867 765040171045381367403062334481129902 in a file that's supposed to have just three fields, (a5, p, m)'s. The 4M report file hung on more than one line, although many/most of the other lines were OK. The 5M file didn't report any error. Maybe -np2 should check to see that the msieve.dat.m line is properly formatted? Losing a few lines (of 1000s, 10000s) isn't a problem; it's the hanging, and not knowing that something's gone wrong to know to go on to the rest of the valid reports that's the trouble. Unless these stage1 reports indicate a problem in the code? -Bruce Last fiddled with by bdodson on 2010-11-01 at 01:33 Reason: typo
2010-11-01, 01:56   #2
jrk

May 2008

3·5·73 Posts

Quote:
 Originally Posted by bdodson turns out that one of the -np1's reported an error on the stdout file from the range "searching leading coefficients from 4000001 to 4400000" Code: error generating or reading NFS polynomials
This happens when msieve doesn't have a complete polynomial when it terminates NFS, and so will always happen when you run -np1, i.e. it is harmless.

Quote:
 Originally Posted by bdodson perhaps with a file server err, as the file was corrupt (grep reports "binary file" without locating anything; empty missing lines at the start of the file ...). I hadn't actually looked at the file, until I found the line below in msieve.dat.m; and then found a second one for the other range "from 5000001 to 5400000": Code: 4007700 24015536 295219382270877590927 801983503937356382677653357465991274 5084748 25083344 280739665478577317867 765040171045381367403062334481129902 in a file that's supposed to have just three fields, (a5, p, m)'s.
That's suspicious of some kind of file corruption. FYI here's the code in msieve which writes the .dat.m file:

Code:
/*------------------------------------------------------------------*/
static void stage1_callback_log(mpz_t high_coeff, mpz_t p, mpz_t m,
double coeff_bound, void *extra) {

FILE *mfile = (FILE *)extra;
gmp_fprintf(mfile, "%Zd %Zd %Zd\n",
high_coeff, p, m);
fflush(mfile);
}
It just prints three gmp integers, so I wonder how you got four? The second number in your lines looks like it doesn't belong.

 2010-11-01, 02:04 #3 jasonp Tribal Bullet     Oct 2004 2×29×61 Posts Do you have multiple poly search processes writing to the same file? That could cause the problems you're seeing; specifying a different argument to '-s' (if you are not doing so now, or running an msieve binary from different directories) will cause output from different GPUs to go to different output files; otherwise I'd suspect a filesystem problem that's making file writes collide.
2010-11-01, 15:11   #4
bdodson

Jun 2005
lehigh.edu

210 Posts

Quote:
 Originally Posted by jasonp Do you have multiple poly search processes writing to the same file? That could cause the problems you're seeing; specifying a different argument to '-s' (if you are not doing so now, or running an msieve binary from different directories) will cause output from different GPUs to go to different output files; otherwise I'd suspect a filesystem problem that's making file writes collide.
No, in this case the cards were writing into different directories; and the
-np2's also in different directories than the -np1's. I suppose I could check
for disk errors by "sort -gk4 msieve.dat.m". Turns out that I missed one of
the 5M's
Code:
5000040 282950932555811249513 767572566639277931962886857122963054
5000040 282988014873105079573 767572566770261762635560319792058762
5000040 283489882496584278539 767572566780564045348713900359963743
...
5141820 303759607119684153587 763292097447179536382655903394573001
5141820 303874311272843118707 763292097967737416866712224883219685
5094360 25093256 290781228927362427487 764742170064234883654754500369055132
5084748 25083344 280739665478577317867 765040171045381367403062334481129902
Ah; maybe that accounts for all of the inputs that hang, here's 4M
Code:
4000260 264511855886585219909 802595085782688477156868964611859014
...
4128540 277554032589933354403 797544346807989836260266516658633381
4128540 277625813552862465761 797544346932719528504349552877501128
4007700 24015536 295219382270877590927 801983503937356382677653357465991274
4011384 24005864 265672669931552184923 802370401984067760730681286443632763
4010040 24005540 266427315108384809443 802383381517071143811386449187655371
-Bruce

 2010-11-01, 15:44 #5 jrk     May 2008 44716 Posts With those corrupted lines, here's where it's getting stuck: gnfs/poly/stage2/stage2.c in pol_expand(): Code:  mpz_tdiv_q_2exp(c->gmp_help1, gmp_d, (mp_limb_t)1); for (i = 0; i < degree; i++) { while (mpz_cmpabs(c->gmp_a[i], c->gmp_help1) > 0) { if (mpz_sgn(c->gmp_a[i]) < 0) { mpz_add(c->gmp_a[i], c->gmp_a[i], gmp_d); mpz_sub(c->gmp_a[i+1], c->gmp_a[i+1], gmp_p); } else { mpz_sub(c->gmp_a[i], c->gmp_a[i], gmp_d); mpz_add(c->gmp_a[i+1], c->gmp_a[i+1], gmp_p); } } } At i==4, the while loop keeps going forever.
 2010-11-01, 17:43 #6 jasonp Tribal Bullet     Oct 2004 67228 Posts Argh, that while() loop should do two or three iterations at most...
2010-11-01, 22:35   #7
Random Poster

Dec 2008

179 Posts

Quote:
 Originally Posted by bdodson Code: 5000040 282950932555811249513 767572566639277931962886857122963054 5000040 282988014873105079573 767572566770261762635560319792058762 5000040 283489882496584278539 767572566780564045348713900359963743 ... 5141820 303759607119684153587 763292097447179536382655903394573001 5141820 303874311272843118707 763292097967737416866712224883219685 5094360 25093256 290781228927362427487 764742170064234883654754500369055132 5084748 25083344 280739665478577317867 765040171045381367403062334481129902 Code: 4000260 264511855886585219909 802595085782688477156868964611859014 ... 4128540 277554032589933354403 797544346807989836260266516658633381 4128540 277625813552862465761 797544346932719528504349552877501128 4007700 24015536 295219382270877590927 801983503937356382677653357465991274 4011384 24005864 265672669931552184923 802370401984067760730681286443632763 4010040 24005540 266427315108384809443 802383381517071143811386449187655371
Removing 9 characters from the beginning of those offending lines leaves what appear to be valid lines, so it looks like gmp_fprintf sometimes writes just 9 characters instead of the whole string. Maybe you could gmp_sprintf to a buffer, check the contents of the buffer (print a warning and discard the buffer if the check fails), and then fwrite the buffer to the file; this should work around the bug if it's in gmp's formatting code (which I think is more likely than a bug in the operating system's file writing code).

2010-11-09, 15:47   #8
bdodson

Jun 2005
lehigh.edu

40016 Posts

Quote:
 Originally Posted by Random Poster Removing 9 characters from the beginning of those offending lines leaves what appear to be valid lines, so it looks like gmp_fprintf sometimes writes just 9 characters instead of the whole string. Maybe you could gmp_sprintf to a buffer, check the contents of the buffer (print a warning and discard the buffer if the check fails), and then fwrite the buffer to the file; this should work around the bug if it's in gmp's formatting code (which I think is more likely than a bug in the operating system's file writing code).
Ooops; here's a new winner
Code:
150672 148717065853967295793 1546349397151620 148003488673044184871 154441084180341103999
8407965507924884
with sort -gk4 showing
Code:
162060 151898518516855821613 1523978967286095124294544200586784087
162060 157845105824749120277 1523978967289425984216744549275830446
150672 148717065853967295793 1546349397151620 148003488673044184871 1544410841803411039998407965507924884
151164 13151032 128683849570608945163 1545611518056124233627904175463785373
This was with an alternate to the main code, the "special_q" version. -Bruce

(I'm not sure which hung. Both occur after the last stage1 hit that
ran with a stage2 report; the one with 4 fields (of 3!) just shortly
after the new one with 5 fields (of 3 ...).)

 2010-11-09, 17:49 #9 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 3×5×17×37 Posts It is probably not always 9 chars. A couple strings collide in a random place like XXXXXXXX XXXXXXXXXXXXyyyyyy yyyyyyyyyyyyyyyy yyyyyyyyyyyyyyyyyyyyyyyyy For this last one, the proper blue string seems to be 150672 148717065853967295793 1546349397| 151620 148003488673044184871 1544410841803411039998407965507924884 The red line should have its tail some where as a line with just one field, and could be rescued too probably. Instead of sort -gk4, try awk 'NF!=3'
 2010-11-09, 19:16 #10 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 23·3·5·31 Posts Is it possible that the fflush(mfile) is happening prior to the full completion of writing a line? Perhaps inserting a brief delay would show. . .
 2010-11-09, 19:46 #11 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 3·5·17·37 Posts Yeah, that's what Random Poster said a long ago. But he also said (I think) a deeper thing - that this is not necessarily this application's fault, but instead either gmp or the system libc fault - that I tend to agree with. A similar (but not exactly the same) thing happened to Prime95 with printing some invalid factors with repeated digit patterns (which could hint to memory bad alloc, but the margins of this message are to narrow to elaborate), and that defect was also OS-specific. I am tempted to look at Prime95's source and see if he simply wrote around the library bug in disgust. Is libgmp linked statically in this particular binary that emits errors? Last fiddled with by Batalov on 2010-11-09 at 19:49 Reason: narrow, naroow, tpyos... blegh

 Similar Threads Thread Thread Starter Forum Replies Last Post schickel Msieve 32 2013-11-05 19:11 EdH Factoring 10 2013-10-14 20:00 Andi47 Msieve 1 2011-03-28 04:30 henryzz Aliquot Sequences 59 2009-07-04 06:27 axn Aliquot Sequences 15 2009-05-28 16:50

All times are UTC. The time now is 02:57.

Mon May 17 02:57:45 UTC 2021 up 38 days, 21:38, 0 users, load averages: 3.49, 3.52, 3.39