mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   Mlucas v19 available (https://www.mersenneforum.org/showthread.php?t=24990)

ewmayer 2019-12-01 23:09

Mlucas v19 available
 
[url=http://www.mersenneforum.org/mayer/README.html]Mlucas v19 has gone live[/url]. Use this thread to report bugs, build issues, and for any other related discussion.

kriesel 2019-12-03 01:19

[LEFT]Haven't tried it yet, but congrats on getting it out.
[/LEFT]

ewmayer 2019-12-03 21:58

1 Attachment(s)
[QUOTE=kriesel;531891][LEFT]Haven't tried it yet, but congrats on getting it out.
[/LEFT][/QUOTE]

Thanks. Meanwhile I have discovered a bug related to the new PRP-handling logic of the kind I expected would be shaken out by further testing ... this one specifically affects exponents really close to an FFT-length breakover point (I discovered it when I fired up a first-time PRP test of M96365419, which is very close to the 5120K-FFT exponent limit), turns out the Gerbicz-check-related breaking of the usual checkpointing interval into multiple smaller subintervals (at the end of each of which we update the G-checkproduct) breaks the is-roundoff-error-reproducible-on-retry logic. It's a simple fix, I just uploaded updated versions of the release tarball and ARM prebuilt binaries, but folks who previously built and are running the the Dec 1 code snapshot can use the simpler expedient of incremental-rebuild-and-relink of the single attached sourcefile.

ewmayer 2019-12-05 21:51

An interesting subtheme re. the newly-added PRP assigment-type support and the Gerbicz check ... shortly after the initial v19 release, got e-mail from George about the importance of adding redundancy to the G-checking mechanism:
[quote]On Dec 4, 2019, at 7:57 AM, George Woltman wrote:

You are basically examining the code looking for any place a one-bit error could doom your result. This is mostly likely to occur right after a Gerbicz compare. In my first implementation, after the compare succeeded I threw away the equal value and only had the one value in memory (and wrote a save file with one value). If a cosmic ray hit during that time period, the end result would be wrong.

I now keep the compared (and equal) value in memory. One starts the next Gerbicz comparison value, the other continues the PRP exponentiation. Similarly when the computation ends, I generate a residue from both values and check that the res64s match.[/quote]
My reply follows.

Well, let's review how my code does things, say, starting from a post-interrupt savefile-read:

1. Read PRP residue into array a[], accumulated G-checkproduct into b[]. Both of these residues are written to savefiles together with associated full-residue checksums - I use the Selfridge-Hurwitz residues (full-length residue mod (2^35-1) and mod (2^36-1)) for that - and the checksums compared with those recomputed during the read-from-file.

2. Do an iteration interval leading up to the next savefile update, 10k or 100k mod-squarings of a[]. Every 1000 squarings update b[] *= a[]. The initial b[] is in pure-integer form; on subsequent mul-by-a[] updates the result is left in the partially-fwd-FFTed form returned by the carry step, i.e. fwd-weighted and initial-fwd-FFT-pass done.

3. On final G-checkproduct update of the current iteration interval, 1000 iterations before the next savefile write, save a copy of the current G-checkproduct b[] in a third array c[], before doing the usual G-checkproduct update b[] *= a[].

4. At end of the current iteration interval, prior to writing savefiles, do 1000 mod-squarings of c[] and compare the result to b[]. If mismatch, no savefiles written, instead roll back to last 'good' G-checkproduct data, which in my current first implementation means the previous multiple of 1M iterations.

So during the above, the G-checkproduct accumulator b[] is vulnerable to a 1-bit error, of the kind which would not show up, say, via a roundoff error during the ensuing *= a[] FFT-mul update.

So, what to do? Since the b[] data are kept in partially-fwd-FFTed form for most of the iteration interval, the Selfridge-Hurwitz (or similar CRC-style) checksums can't be easily computed from that. I think the easiest thing would be, every time I do an update b[] *= a[], do a memcpy to save a separate copy of the result, and compare that vs b[] prior to each update of the latter.

[followup e-mail a few hours later] Additional thoughts:

We are essentially trying to guard against a "false G-check failure", in the sense that the G-check might fail not because the PRP-residue array a[] had gotten corrupted but rather because the G-checkproduct accumulator b[] had. So every time we update b[] (or read it from a savefile) we also make a copy c[] = b[], and prior to each b[] *= a[] update we check that b == c. OK, but if at some point we find b != c, how can we tell which of the 2 is the good one? Obvious answer is to compute some kind of whole-array checksum at every update. Since post-update b[] may be in some kind of partially-FFTed state (that is the case for my code) the checksum needs to not assume integer data - perhaps simply treat the floats in a[] as integer bitfields. Would something as simple as computing a mod-2^64 sum of the uint64-reinterpretation-casted elements of a[] suffice, do you think? Further, any such checksum will be a much smaller bit-corruption target than b[], but to be safe one should probably make at least 2 further copies of *it*, call our 3 redundant checksums s1,s2,s3, then the attendant logic would look something like this:

[code]// Mod-2^64 sum of elements of double-float array a[], treated as uint64 bitfields:
uint64 sum64(double a[], int n) {
int i;
uint64 sum = 0ull;
for(i = 0; i < n; i++)
sum += *(uint64*(a+i)); // Type-punning cast of a[i]
return sum;
}
// Simply majority-vote consensus:
uint64 consensus_checksum(uint64 s1, uint64 s2, uint64 s3) {
if(s1 == s2) return s1;
if(s1 == s3) return s1;
if(s2 == s3) return s2;
return 0ull;
}
int n; // FFT length in doubles
double a[], b[], c[]; // a[] is PRP residue; b,c are redundant copies of G-checkproduct array
uint64 s1,s2,s3; // Triply-redundant whole-array checksum on b,c-arrays
...
[bunch of mod-squaring updates of a[]]
// prior to each b[]-update, check integrity of array data:
if(b[] != c[]) { // Houston, we have a problem
s1 = consensus_checksum(s1,s2,s3);
if(s1 == sum64(b,n)) // b-data good
/* no-op */
else if(s1 == sum64(c,n)) // c-data good, copy back into b
b[] = c[];
else // Catastrophic data corruption
[roll back to last-good G-check savefile]
}
b[] *= a[]; // G-checkproduct update
s1 = s2 = s3 = sum64(b,n); // Triply-redundant whole-array checksum update
c[] = b[]; // Make a copy
[/code]
And if that is an effective anti-corruption strategy, the obvious question is, why not apply it to the main residue array a[] itself? Likely performance impact is one issue - the cost of making a copy of a[] and of updating the whole-array checksum at each iteration, while O(n) and thus certainly smaller than that of an FFT-mod-squaring, is likely going to be nontrivial, a few percent I would guess.

kriesel 2019-12-05 23:24

After the dust settles, an update on the Mlucas save file format description to final v18, and to v19 PRP would be appreciated. For your convenience, [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url]

ewmayer 2019-12-06 02:25

[QUOTE=kriesel;532137]After the dust settles, an update on the Mlucas save file format description to final v18, and to v19 PRP would be appreciated. For your convenience, [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url][/QUOTE]

Simple, PRP test savefiles tack on an additional version of the last 4 items in the 'current Mlucas file format' list - full-length residue byte-array (this one holding the accumulated Gerbicz checkproduct) and the 3 associated checksums totaling a further 18 bytes. Thus, where an LL-savefile read reads one such residue+checksum data quartet, 3 bytes for FFT-length-in-Kdoubles which the code was using at time of savefile write (*note* your quote in post #2 needs to change that from 4 to 3 bytes), 8 bytes for circular-shift to be applied to the (shift-removed) savefile residue, a PRP-savefile read follows those reads with another read of a residue+checksum data quartet.

kriesel 2019-12-06 07:18

[QUOTE=ewmayer;532150]Simple, PRP test savefiles tack on an additional version of the last 4 items in the 'current Mlucas file format' list - full-length residue byte-array (this one holding the accumulated Gerbicz checkproduct) and the 3 associated checksums totaling a further 18 bytes. Thus, where an LL-savefile read reads one such residue+checksum data quartet, 3 bytes for FFT-length-in-Kdoubles which the code was using at time of savefile write (*note* your quote in post #2 needs to change that from 4 to 3 bytes), 8 bytes for circular-shift to be applied to the (shift-removed) savefile residue, a PRP-savefile read follows those reads with another read of a residue+checksum data quartet.[/QUOTE]Thanks; [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url] is updated and extended.

ewmayer 2019-12-06 20:13

[QUOTE=kriesel;532174]Thanks; [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url] is updated and extended.[/QUOTE]

FYI, the "master reference" for savefile format is, as always, the actual code - the relevant functions are read|write_ppm1_savefiles in the Mlucas.c source. ('ppm1' is short for 'Primality-test and P-1' ... the latter as yet unsupportd, but we remain ever-optimistic ... the 2-input FFT-modmul support added in v19 for Gerbicz-checking will help in that regard, since p-1 stage 2 needs that capability.)

ewmayer 2020-01-03 03:05

2 Attachment(s)
[b]***Patch *** 03 Jan 2020:[/b] This patch adds one functionality-related item, namely adding redundancy to the PRP-test Gerbicz-check mechanism to prevent data corruption in the G-check residue from causing a "false Gerbicz-check failure", i.e. a failure not due to a corrupted PRP-test residue itself. This more or less follows the schema laid out in post #4.

I have also patched another logic bug related to roundoff-error-retry, this one was occasionally causing the run to switch to the next-larger FFT length when encountering a reproducible roundoff error, rather than first retrying at the current FFT length but with a shorter carry-chain recurrence computation for DWT weights. Not fatal, just suboptimal in terms of CPU usage.

[b]NOTE ALSO[/b] that I hit a Primenet-server-side bug on 31. Dec when I used the primenet.py script to submit my first batch of v19 LL-test results (my previous v19 submissions were all PRP-test ones). The server code was incorrectly expecting a Prime95-style checksum as part of such results lines. The really nasty part of this was that I almost missed it - until now, the primenet.py script grepped the page resulting from each attempted result-line submission for "Error code", if it found that it emitted a user-visible echo of the error message which was found, and the attempted submission line was not copied to the results_sent.txt file for archiving. In this case - I only saw this after retrying one of the submits via the manual test webform - there was "Error" on the returned page, but that was not followed by "code", so the script treated the submissions as successful. I only saw the problem when I checked the exponent status page for one of the expos, and saw no result had been registered. James Heinrich has fixed the server-side issue and to be safe I've tweaked the primenet.py script to only grep for "Error", but if you used the script to submit any v19 LL-test results (PRP tests were being correctly handled at both ends) prior to the current patch, please delete the corresponding lines from your results_sent.txt file and retry submitting using the patched primenet.py file. To be safe, check the exponent status at mersenne.org to make sure your results appear there.

I just uploaded updated versions of the release tarball and ARM prebuilt binaries, but folks who previously built and are running the the Dec 3 code snapshot can use the simpler expedient of incremental-rebuild-and-relink of the attached Mlucas.c sourcefile. The also-attached tweaked primenet.py file - matching the updated one in the release tarball - is not necessary now that James has made the above-described server-side bugfix, but better safe than sorry, I say.

Jumba 2020-01-16 14:14

Version 19.0 error
 
I'm getting the following error after getting to the 100% mark:

ERROR: at line 2313 of file ../src/Mlucas.c
Assertion failed: After short-div, R != 0 (mod B)

Nothing has been written to the results.txt file since I started the run a week ago. I can restart the process, and it resumes from just before the end, but still spits out the same error after a couple minutes.

ewmayer 2020-01-16 20:15

[QUOTE=Jumba;535213]I'm getting the following error after getting to the 100% mark:

ERROR: at line 2313 of file ../src/Mlucas.c
Assertion failed: After short-div, R != 0 (mod B)

Nothing has been written to the results.txt file since I started the run a week ago. I can restart the process, and it resumes from just before the end, but still spits out the same error after a couple minutes.[/QUOTE]

That looks like you've found a bug in the PRP-residue postprocessing code ... could you upload your p[exponent] savefile to Dropbox or similar site so I can download it and re-run the final few (whatever) iterations within a debug session? PM me the resulting download location and the worktodo.ini file entry.

In the meantime, if you've not already done so, suggest you switch the top 2 entries in worktodo.ini and start on the next assignment. By the time that finishes you can grab a bug-patched version of the code, which should allow you successfully complete your above run.

Oh, your data should be fine, like I said this appears to strictly be a postprocessing bug.


All times are UTC. The time now is 08:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.