![]() |
![]() |
#78 |
"Forget I exist"
Jul 2009
Dartmouth NS
845010 Posts |
![]()
https://maths-people.anu.edu.au/~bre..._ACCMCC_10.pdf pretty easy to find once you know what to search for.
|
![]() |
![]() |
![]() |
#79 | |
Serpentine Vermin Jar
Jul 2014
2×13×131 Posts |
![]() Quote:
In fact, when I started and there were thousands of pending mismatches (now only a handful), I used those general trends to guess the winner and loser. On the other hand, there are the oddball machines that started out great and then got worse over time, to the point where everything they turned in was junk. Maybe some memory module degraded, or it got dusty in there and heat started causing issues, or who knows what. There are also systems where the bad results came and went in waves. Maybe they were trying out some overclocking and ran into issues so they dialed it back, then tried again later on. Could have been dozens of other things too, I'm sure. People living in hot seasonal climates, turning in bad results only in summer time? Maybe. A classic example of that is a particular system by the user Robert_SoCal that currently has 220 bad results, 117 good ones, and still 519 unknown. When I break that one down by year & month, it's clear to see that it had good months and bad months. From 2012 to the beginning of 2015, it did terrible. I think over 50% of the results were bad. From April 2015 onwards, it started to improve. I, and others, have done cherry picking of it's newer 2015+ exponents to see if the bad trend continued or not, but when I look at its history, it's last bad result came in April 2015 and we've verified about 36 in the months up to Dec 2016. Still, there are a lot of unverified exponents out there, 25-35 per month, and only 1-5 that we've verified in each of those months. We may have got lucky and happened to verify the exponent it did right. Of course in this case, this is his "Manual Testing" cpu, so the results are probably being pasted in from different systems over the years, all CUDALucas from 2.04 beta up to 2.05.1. |
|
![]() |
![]() |
![]() |
#80 | |
Sep 2003
A1E16 Posts |
![]() Quote:
I mostly do double-checks of strategic exponents. That is, I try to identify exponents which have a high likelihood of having an incorrect first-time check and then perform double checks on them. So far, I've had almost 600 mismatches, where my double-check result differed from the first-time check (and all subsequently confirmed on triple check, as expected since I use servers with ECC memory). So I spend time trying to look for patterns, and... it turns out, it's hard to come up with any general rules. The split into two distinct categories, which you posit, doesn't really exist in practice. There are machines with 10% error rates, with 20% error rates, with every kind of error rate under the sun. There are machines where erroneous results are strongly correlated with a nonzero mprime error code and other machines which produce erroneous results without setting any mprime error codes. There are some machines whose erroneous results are concentrated in certain calendar months and others where they are not. "Happy But let's consider a machine with a 50% error rate. Glass half empty or half full? 50% of its results are good. So if we plot a probability histogram of the number of errors in LL tests, there is a peak at n=0, where P(n)=0.50... so what does the rest of the histogram look like for n > 0 ? Most likely it is monotonically decreasing, so there will be a sizeable number of LL tests where there is only one error. So even with a machine with such a high error rate, the Jacobi check with rewind to the last savefile will produce non-negligible benefits. And from empirical observation there are many machines with much lower error rates — at least for mprime. I don't know if the error characteristics of GPU-based programs will differ from CPU-based programs like mprime. Last fiddled with by GP2 on 2017-08-11 at 03:02 |
|
![]() |
![]() |
![]() |
#81 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
22·3·643 Posts |
![]() Quote:
Re allocating bad memory to lock it out, sorry if I was unclear before, I think the key is to do it at real physical addresses, and have it persist long enough, and that may be why in linux it was a kernel driver implementation. I was speaking of CUDA or openCL calls permanently allocating physical GPU bad memory to take it out of circulation, not OS-managed general purpose virtual memory in the system RAM DIMMs. Perhaps checking gpu ram could be done in a fast startup-time memory test; keep allocated and don't use the blocks that test bad. That overhead would only need to be paid at startup for cards that have already tested bad in some memory blocks. (Does the OS virtualize and page out gpu memory while a gpu program is running!?) In my memory testing of one GPU card, via CUDALucas, the range of 25MB blocks that failed was quite stable from run to run (blocks 23-40 out of ~58. The CUDA memory model article at https://www.3dgep.com/cuda-memory-mo...A_Memory_Types does not contain the string "virtual" (for whatever that is worth) http://www.seas.upenn.edu/~cis565/LECTURES/Lecture3.pdf slide 7 contains "virtual memory -does not exist". There may be persistence issues with the approach. Re double checks: little is lost until the next software package comes along and hasn't yet implemented nonzero offsets (as was once the case with practically everything, including cudalucas and prime95), or gpuOwl use becomes widespread, and one LLtest is done by gpuOwl, and the other lltest is done by gpuOwl or another zero-offset-only software. Lots of results submitted with zero offset sets the stage for such an issue later. |
|
![]() |
![]() |
![]() |
#82 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1E2416 Posts |
![]() Quote:
Running CUDA code, it is common that excessive roundoff error is detected and corrected by a restart from last save and retry. Also resetting the device and restarting from last checkpoint. Including verified results. Too much of it may result in a bad residue. Some of it is no problem. No interim residue or final residue is guaranteed to be correct. Passing the Jacobi test or any other error check is consistent with a higher probability of correctness, to that point, and that's as much as we can hope for. Please consider giving users of your software the choice of continuing after a recovery from last believed-good save file if the Jacobi check indicates recent iterations went wrong, rather than abandoning all the previous work on that exponent run. |
|
![]() |
![]() |
![]() |
#83 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1E2416 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#84 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
22·3·643 Posts |
![]() Quote:
And, it's already benefited the project considerably, by motivating your inquiry into reliable running, thereby bringing the Jacobi test into play. I wouldn't run code that _required_ running at half speed. Last fiddled with by kriesel on 2017-08-11 at 06:20 |
|
![]() |
![]() |
![]() |
#85 |
"Mihai Preda"
Apr 2015
1,451 Posts |
![]()
It doesn't *require* it, it's just an option for the user that suspects the hardware is not reliable yet wants to squeeze LL from it (and that'd be some pretty solid LL). It's a trade-off -- admittedly an expensive one.
The timing as is (4.5 ms / double iteration) isn't terrible. Now if I could just halve that.. :) |
![]() |
![]() |
![]() |
#86 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
22×3×643 Posts |
![]() Quote:
Keep up the good work. |
|
![]() |
![]() |
![]() |
#87 |
"Mihai Preda"
Apr 2015
1,451 Posts |
![]()
After consideration and input from this thread, this is the approach I ended up using in gpuOwL RE Jacobi check:
- on startup (from beginning or from savefile), establish a "good Jacobi" point that will be used if rollback is needed. When starting from a savefile this involves running one Jacobi check at the very beginning to verify that the savefile passes Jacobi (if it doesn't, it won't start). - on every Jacobi-check, either move the "good Jacobi" point forward if the check passes, or roll back to the the most recent rollback point if the check fails. The rollback point is kept in RAM, thus no file-read is involved in rolling back (thus simpler implem). One more Jacobi check is done at the end (after the last iteration), with the same behavior. |
![]() |
![]() |
![]() |
#88 |
"Robert Gerbicz"
Oct 2005
Hungary
2×19×43 Posts |
![]()
This reliable error checking idea here http://mersenneforum.org/showthread.php?t=22510
from me is just working for Mersenne numbers also! Why not do a Fermat pseudoprime test for base=3, but (for p>2) in the equivalent form of res=3^(2^p) mod mp, where mp=2^p-1. If it is 9 (or correctly it is 9 mod mp), then mp is a prp prime. And the totally same error checking trick works, what worked for Proth numbers, with a 0.1% overhead we could get a solid rock test. Assuming that we need the same time as for LL test to get res, starting with 3, and doing here p squaremod (we don't need to subtract 2). And for those p primes that passed this test we should make a Lucas Lehmer test, to prove that mp is "really" prime. For what p primes we need to do a LLT: (quick PARI-Gp test up to p=25000) Code:
forprime(p=2,25000,q=2^p-1;if(Mod(3,q)^(q+1)==Mod(9,q),print1(p","))) 2,3,5,7,13,17,19,31,61,89,107,127,521,607,1279,2203,2281,3217,4253,4423,9689,9941,11213,19937,21701,23209, though it is likely that there is not even a single prp for Mersenne numbers. Slightly modified test code, but the heart of the algorithm is the same as for Proth numbers: (note that here lift(u0)=3 smallish, we would not need to store it) Code:
myrand(r,N) returns s randomly from [0,N) for that s!=0 and s!=r (the r,s are in Z_N). myrand(r,N)={local(tmp);while(1,tmp=random(N);if(tmp!=0&&lift(tmp+r)!=0,return(tmp+r)))} we test mp=2^p-1 Mersenne number (where p is prime), we use L at error checking, making errors in the i-th squaring with 50% chance if errpos[i]!=0 (note that if we return to the same i multiple times, then we choose the making error independently from the previous choices already done) if printmsg!=0, then we print out some additional info, the return value is 3^(mp+1) mod mp, note that for prp prime the return value is 9 (correctly 9 mod mp). If you would not give a p prime or errpos's length is too small, then the return value is (-1). prpmersenne(p,L,errpos,printmsg=1)={ if(isprime(p)==0,if(printmsg,print("p is not prime."));return(-1)); if(length(errpos)<p,if(printmsg,print("The errpos array's length should be at least p"));return(-1)); mp=2^p-1; numerr=0; L2=L^2; u0=Mod(3,mp); prev_d=u0; saved_u=u0; saved_d=u0; saved_i=0; i=0;res=u0;while(i<p,i+=1; res=res^2; if(errpos[i]&&random(2),res=myrand(res,mp)); if(i%L==0,d=prev_d*res;set_d=0; if(i%L2==0||(i%L==0&&i+L>=p), if(d!=u0*prev_d^(2^L), numerr+=1;if(printmsg,print("Found error at iteration=",i,", roll back to iteration=",saved_i)); i=saved_i;res=saved_u;prev_d=saved_d;set_d=1, saved_i=i;saved_u=res;saved_d=d)); if(!set_d,prev_d=d))); if(printmsg,print1("m",p);if(res==Mod(9,mp),print(" is prp."),print(" is composite.")); print("Number of errors (corrected and) detected=",numerr)); return(lift(res))} (i=21,23 are in the same L2=16 block). Code:
p=61;errpos=vector(p,i,0);errpos[21]=1;errpos[23]=1;errpos[45]=1; cnt=0;for(h=1,50,cnt+=(prpmersenne(p,4,errpos,1)==9);print());cnt Code:
Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=4 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=11 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=5 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=14 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=10 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=5 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=3 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=3 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=4 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=4 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=5 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=9 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=3 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=6 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=6 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=9 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=11 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=7 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=4 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=4 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=11 %5 = 50 ? This test with the much smaller mp=2^17-1 with L=2: Code:
p=17;errpos=vector(p,i,0);errpos[9]=1;errpos[11]=1; sum(h=1,10^6,prpmersenne(p,2,errpos,0)==9) %7 = 999988 that is an error rate of less than 2/mp. Ofcourse for a true run errpos=vector(p,i,0), (a zero array), it is used above only to insert false residues in the squaring computations. And for largish Mersenne computations use L=2000 (or say L=1000), depending on how much overhead you allow. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces. | MooMoo2 | Other Chess Games | 0 | 2016-11-26 06:52 |
Redoing factoring work done by unreliable machines | tha | Lone Mersenne Hunters | 23 | 2016-11-02 08:51 |
Unreliable AMD Phenom 9850 | xilman | Hardware | 4 | 2014-08-02 18:08 |
[new fish check in] heloo | mwxdbcr | Lounge | 0 | 2009-01-14 04:55 |
The Happy Fish thread | xilman | Hobbies | 24 | 2006-08-22 11:44 |