mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2019-02-05, 01:00   #1
Runtime Error
 
Sep 2017
USA

23·52 Posts
Default Gerbicz PRP status (was: New machine running PRP. No doublecheck default?)

I recently started a new-to-GIMPS machine on PRP first time tests. On the assignment rules page, I have it set so machine should do one matching double-check yearly. New LLing machines start out with a double check. However, this machine jumped right into the first time PRP test.

I understand that the Gerbicz error-check for PRPs is very reliable. But wouldn't it be prudent for each PRPing machine to still do an occasional double check? Or is the error check just that good?

Thanks!
Runtime Error is offline   Reply With Quote
Old 2019-02-05, 03:29   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24×32×53 Posts
Default

Your observation is correct.

In theory, the Gerbicz error-check is so good that an undetected error is virtually impossible. Thus, if your machine is not quite stable, you should see some error messages during the test.
Prime95 is offline   Reply With Quote
Old 2019-02-05, 03:35   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24·32·53 Posts
Default

That said, prime95's implementation of the Gerbicz error check and recovery is flawed somehow. I'm investigating now. It will be fixed in version 29.6.
Prime95 is offline   Reply With Quote
Old 2019-02-05, 04:07   #4
Runtime Error
 
Sep 2017
USA

23×52 Posts
Default

Interesting. Thank you.

Quote:
Originally Posted by Prime95 View Post
That said, prime95's implementation of the Gerbicz error check and recovery is flawed somehow. I'm investigating now. It will be fixed in version 29.6.
Are current PRP tests considered reliable? Would you recommend going back to LLing for now? Thanks in advance!
Runtime Error is offline   Reply With Quote
Old 2019-02-05, 04:20   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24×32×53 Posts
Default

PRP tests are more reliable than LL. Carry on. I'll post more when I've finished debugging.
Prime95 is offline   Reply With Quote
Old 2019-02-05, 18:53   #6
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

167208 Posts
Default

Gerbicz PRP investigation thusfar:

1) A bug was fixed on Aug. 20, 2018 where a Gerbicz check could erroneously succeed (both values contained an invalid floating point value like INF or NaN).

For testing, I tweaked the gwnum code to spit out bad values randomly 2% of the time.

2) If one set GerbiczCompareInterval=100 in prime.txt, then the automatic adjusting code could eventually reduce the compare interval to zero, which resulted in no error checking. In v29.6 the interval will not be allowed to get below 16.

3) If roundoff checking is enabled, there is a bug recovering from intermediate files that will eventually rollback the PRP test to the beginning. I haven't fixed this yet.

Currently, I've tested a PRP of 19937 and 44497 successfully.

Simon Cunningham's failed test of M79075979 remains unexplained.

Last fiddled with by Prime95 on 2019-02-05 at 18:58
Prime95 is offline   Reply With Quote
Old 2019-02-06, 09:06   #7
GP2
 
GP2's Avatar
 
Sep 2003

5×11×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
2) If one set GerbiczCompareInterval=100 in prime.txt, then the automatic adjusting code could eventually reduce the compare interval to zero, which resulted in no error checking.

[...]

Simon Cunningham's failed test of M79075979 remains unexplained.
Are we sure these two things are unrelated?

Say, a memory corruption bug that zeroed out the compare interval?

Instead of using zero as the value that means no error checking, maybe it should be some specific randomly chosen magic 64-bit constant.

And similar for any other variable that could lead to error checking being turned off.
GP2 is offline   Reply With Quote
Old 2019-02-07, 21:50   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

167208 Posts
Default

Another PRP bug fixed. The routine that calculated interim and final residues was not checking the error code from converting the FFT data to binary. The subsequent rotating of the binary data (to undo the shift count) could corrupt memory.

Triggering this bug always caused a crash for me -- not an incorrect final result.

I do not think this is related to Simon's problem.
Prime95 is offline   Reply With Quote
Old 2019-02-09, 16:57   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24×32×53 Posts
Default

In my review of the code I believe the biggest vulnerability is at the start of each Gerbicz block. At that point in time there is only one gwnum value sitting in memory. If there is an error reading that value then the final result will be incorrect and undetected.

I believe gpuowl found a way around this vulnerabiilty. Time to dig through preda's forum messages.
Prime95 is offline   Reply With Quote
Old 2019-02-09, 19:47   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132248 Posts
Default

Quote:
Originally Posted by Prime95 View Post
In my review of the code I believe the biggest vulnerability is at the start of each Gerbicz block. At that point in time there is only one gwnum value sitting in memory. If there is an error reading that value then the final result will be incorrect and undetected.

I believe gpuowl found a way around this vulnerabiilty. Time to dig through preda's forum messages.
I remember reading something about copying data between gpu and cpu and back again and comparing the copied copy to the original to validate the copy.
A->B->C; compare A, C, to check B arrived without error, or detectable error anyway.
My notes on the gpuowl development thread say @ post 727.preda re gec mechanism (gpu-cpu-gpu copy)
http://www.mersenneforum.org/showthread.php?t=22204

I did something related long ago to check for disk read/write error rate.
Large file A->B, many iterations of copy B->C->D->B, compare A, B.

Last fiddled with by kriesel on 2019-02-09 at 19:58
kriesel is offline   Reply With Quote
Old 2019-02-09, 23:56   #11
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

1,373 Posts
Default

Quote:
Originally Posted by Prime95 View Post
In my review of the code I believe the biggest vulnerability is at the start of each Gerbicz block. At that point in time there is only one gwnum value sitting in memory. If there is an error reading that value then the final result will be incorrect and undetected.

I believe gpuowl found a way around this vulnerabiilty. Time to dig through preda's forum messages.
I do have two buffers (two bignums) in use at all times. I call them "Data" and "Check". They are initialized either at the very beginning of a PRP test, or when loading from a savefile.

As a schematic pseudocode, for blockSize L=1000, doing the check every L2=L^2=1M iterations (but the check can be done at any multiple of L, not only L^2), and Base is 3:
Code:
[init]
Data:=3
Check:=1


[one block]

repeat L times: Data:=Data^2
if is-time-to-check:
    Tmp:=Check
    repeat L times: Tmp:=Tmp^2
    Tmp:=Tmp * 3
    Check:=Check * Data
    OK:= (Tmp == Check)
else: // don't check yet
    Check:=Check * Data

[repeat one block]
Thus at all times I have two inter-redundant buffers, Data and Check, and a bitflip in either should be detected.

The round-tripping trick I use to work around data corruption during the transfer GPU<->CPU is:
0. initially data is GPU-side
1. read data to CPU
2. write what was just read back to GPU (from CPU)
3. do the check on GPU.

If the check succeeds, then I'm confident that I have good data CPU-side.

An analogy without the GPU would be:
1. write data from RAM to disk (to savefile)
2. read data back from savefile
3. do the check based on what was read from disk.
if the check succeeds, then data is likely good on disk.

Last fiddled with by preda on 2019-02-10 at 00:00
preda is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Running a Windows machine at the end of a wire fivemack Programming 2 2015-06-30 18:02
Default ECM assignments lycorn PrimeNet 9 2015-01-09 16:32
running gimps on a virtual machine sixblueboxes Hardware 2 2013-03-31 22:14
mfaktO and mfaktC running on same machine. Proof! swl551 GPU Computing 2 2012-08-19 13:37
running two copies of prime95 in the same machine ppo Information & Answers 25 2007-07-30 23:25

All times are UTC. The time now is 00:27.


Tue Oct 19 00:27:57 UTC 2021 up 87 days, 18:56, 0 users, load averages: 1.61, 1.33, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.