mersenneforum.org Gerbicz PRP status (was: New machine running PRP. No doublecheck default?)
 Register FAQ Search Today's Posts Mark Forums Read

 2019-02-05, 01:00 #1 Runtime Error   Sep 2017 USA 23·52 Posts Gerbicz PRP status (was: New machine running PRP. No doublecheck default?) I recently started a new-to-GIMPS machine on PRP first time tests. On the assignment rules page, I have it set so machine should do one matching double-check yearly. New LLing machines start out with a double check. However, this machine jumped right into the first time PRP test. I understand that the Gerbicz error-check for PRPs is very reliable. But wouldn't it be prudent for each PRPing machine to still do an occasional double check? Or is the error check just that good? Thanks!
 2019-02-05, 03:29 #2 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 24×32×53 Posts Your observation is correct. In theory, the Gerbicz error-check is so good that an undetected error is virtually impossible. Thus, if your machine is not quite stable, you should see some error messages during the test.
 2019-02-05, 03:35 #3 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 24·32·53 Posts That said, prime95's implementation of the Gerbicz error check and recovery is flawed somehow. I'm investigating now. It will be fixed in version 29.6.
2019-02-05, 04:07   #4
Runtime Error

Sep 2017
USA

23×52 Posts

Interesting. Thank you.

Quote:
 Originally Posted by Prime95 That said, prime95's implementation of the Gerbicz error check and recovery is flawed somehow. I'm investigating now. It will be fixed in version 29.6.
Are current PRP tests considered reliable? Would you recommend going back to LLing for now? Thanks in advance!

 2019-02-05, 04:20 #5 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 24×32×53 Posts PRP tests are more reliable than LL. Carry on. I'll post more when I've finished debugging.
 2019-02-05, 18:53 #6 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 167208 Posts Gerbicz PRP investigation thusfar: 1) A bug was fixed on Aug. 20, 2018 where a Gerbicz check could erroneously succeed (both values contained an invalid floating point value like INF or NaN). For testing, I tweaked the gwnum code to spit out bad values randomly 2% of the time. 2) If one set GerbiczCompareInterval=100 in prime.txt, then the automatic adjusting code could eventually reduce the compare interval to zero, which resulted in no error checking. In v29.6 the interval will not be allowed to get below 16. 3) If roundoff checking is enabled, there is a bug recovering from intermediate files that will eventually rollback the PRP test to the beginning. I haven't fixed this yet. Currently, I've tested a PRP of 19937 and 44497 successfully. Simon Cunningham's failed test of M79075979 remains unexplained. Last fiddled with by Prime95 on 2019-02-05 at 18:58
2019-02-06, 09:06   #7
GP2

Sep 2003

5×11×47 Posts

Quote:
 Originally Posted by Prime95 2) If one set GerbiczCompareInterval=100 in prime.txt, then the automatic adjusting code could eventually reduce the compare interval to zero, which resulted in no error checking. [...] Simon Cunningham's failed test of M79075979 remains unexplained.
Are we sure these two things are unrelated?

Say, a memory corruption bug that zeroed out the compare interval?

Instead of using zero as the value that means no error checking, maybe it should be some specific randomly chosen magic 64-bit constant.

And similar for any other variable that could lead to error checking being turned off.

 2019-02-07, 21:50 #8 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 167208 Posts Another PRP bug fixed. The routine that calculated interim and final residues was not checking the error code from converting the FFT data to binary. The subsequent rotating of the binary data (to undo the shift count) could corrupt memory. Triggering this bug always caused a crash for me -- not an incorrect final result. I do not think this is related to Simon's problem.
 2019-02-09, 16:57 #9 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 24×32×53 Posts In my review of the code I believe the biggest vulnerability is at the start of each Gerbicz block. At that point in time there is only one gwnum value sitting in memory. If there is an error reading that value then the final result will be incorrect and undetected. I believe gpuowl found a way around this vulnerabiilty. Time to dig through preda's forum messages.
2019-02-09, 19:47   #10
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132248 Posts

Quote:
 Originally Posted by Prime95 In my review of the code I believe the biggest vulnerability is at the start of each Gerbicz block. At that point in time there is only one gwnum value sitting in memory. If there is an error reading that value then the final result will be incorrect and undetected. I believe gpuowl found a way around this vulnerabiilty. Time to dig through preda's forum messages.
I remember reading something about copying data between gpu and cpu and back again and comparing the copied copy to the original to validate the copy.
A->B->C; compare A, C, to check B arrived without error, or detectable error anyway.
My notes on the gpuowl development thread say @ post 727.preda re gec mechanism (gpu-cpu-gpu copy)

I did something related long ago to check for disk read/write error rate.
Large file A->B, many iterations of copy B->C->D->B, compare A, B.

Last fiddled with by kriesel on 2019-02-09 at 19:58

2019-02-09, 23:56   #11
preda

"Mihai Preda"
Apr 2015

1,373 Posts

Quote:
 Originally Posted by Prime95 In my review of the code I believe the biggest vulnerability is at the start of each Gerbicz block. At that point in time there is only one gwnum value sitting in memory. If there is an error reading that value then the final result will be incorrect and undetected. I believe gpuowl found a way around this vulnerabiilty. Time to dig through preda's forum messages.
I do have two buffers (two bignums) in use at all times. I call them "Data" and "Check". They are initialized either at the very beginning of a PRP test, or when loading from a savefile.

As a schematic pseudocode, for blockSize L=1000, doing the check every L2=L^2=1M iterations (but the check can be done at any multiple of L, not only L^2), and Base is 3:
Code:
[init]
Data:=3
Check:=1

[one block]

repeat L times: Data:=Data^2
if is-time-to-check:
Tmp:=Check
repeat L times: Tmp:=Tmp^2
Tmp:=Tmp * 3
Check:=Check * Data
OK:= (Tmp == Check)
else: // don't check yet
Check:=Check * Data

[repeat one block]
Thus at all times I have two inter-redundant buffers, Data and Check, and a bitflip in either should be detected.

The round-tripping trick I use to work around data corruption during the transfer GPU<->CPU is:
0. initially data is GPU-side
2. write what was just read back to GPU (from CPU)
3. do the check on GPU.

If the check succeeds, then I'm confident that I have good data CPU-side.

An analogy without the GPU would be:
1. write data from RAM to disk (to savefile)
2. read data back from savefile
3. do the check based on what was read from disk.
if the check succeeds, then data is likely good on disk.

Last fiddled with by preda on 2019-02-10 at 00:00

 Similar Threads Thread Thread Starter Forum Replies Last Post fivemack Programming 2 2015-06-30 18:02 lycorn PrimeNet 9 2015-01-09 16:32 sixblueboxes Hardware 2 2013-03-31 22:14 swl551 GPU Computing 2 2012-08-19 13:37 ppo Information & Answers 25 2007-07-30 23:25

All times are UTC. The time now is 00:27.

Tue Oct 19 00:27:57 UTC 2021 up 87 days, 18:56, 0 users, load averages: 1.61, 1.33, 1.24