20170811, 02:06  #78 
"Forget I exist"
Jul 2009
Dumbassville
2^{6}×131 Posts 
https://mathspeople.anu.edu.au/~bre..._ACCMCC_10.pdf pretty easy to find once you know what to search for.

20170811, 02:30  #79  
Serpentine Vermin Jar
Jul 2014
CDF_{16} Posts 
Quote:
In fact, when I started and there were thousands of pending mismatches (now only a handful), I used those general trends to guess the winner and loser. On the other hand, there are the oddball machines that started out great and then got worse over time, to the point where everything they turned in was junk. Maybe some memory module degraded, or it got dusty in there and heat started causing issues, or who knows what. There are also systems where the bad results came and went in waves. Maybe they were trying out some overclocking and ran into issues so they dialed it back, then tried again later on. Could have been dozens of other things too, I'm sure. People living in hot seasonal climates, turning in bad results only in summer time? Maybe. A classic example of that is a particular system by the user Robert_SoCal that currently has 220 bad results, 117 good ones, and still 519 unknown. When I break that one down by year & month, it's clear to see that it had good months and bad months. From 2012 to the beginning of 2015, it did terrible. I think over 50% of the results were bad. From April 2015 onwards, it started to improve. I, and others, have done cherry picking of it's newer 2015+ exponents to see if the bad trend continued or not, but when I look at its history, it's last bad result came in April 2015 and we've verified about 36 in the months up to Dec 2016. Still, there are a lot of unverified exponents out there, 2535 per month, and only 15 that we've verified in each of those months. We may have got lucky and happened to verify the exponent it did right. Of course in this case, this is his "Manual Testing" cpu, so the results are probably being pasted in from different systems over the years, all CUDALucas from 2.04 beta up to 2.05.1. 

20170811, 02:56  #80  
Sep 2003
A18_{16} Posts 
Quote:
I mostly do doublechecks of strategic exponents. That is, I try to identify exponents which have a high likelihood of having an incorrect firsttime check and then perform double checks on them. So far, I've had almost 600 mismatches, where my doublecheck result differed from the firsttime check (and all subsequently confirmed on triple check, as expected since I use servers with ECC memory). So I spend time trying to look for patterns, and... it turns out, it's hard to come up with any general rules. The split into two distinct categories, which you posit, doesn't really exist in practice. There are machines with 10% error rates, with 20% error rates, with every kind of error rate under the sun. There are machines where erroneous results are strongly correlated with a nonzero mprime error code and other machines which produce erroneous results without setting any mprime error codes. There are some machines whose erroneous results are concentrated in certain calendar months and others where they are not. "Happy But let's consider a machine with a 50% error rate. Glass half empty or half full? 50% of its results are good. So if we plot a probability histogram of the number of errors in LL tests, there is a peak at n=0, where P(n)=0.50... so what does the rest of the histogram look like for n > 0 ? Most likely it is monotonically decreasing, so there will be a sizeable number of LL tests where there is only one error. So even with a machine with such a high error rate, the Jacobi check with rewind to the last savefile will produce nonnegligible benefits. And from empirical observation there are many machines with much lower error rates — at least for mprime. I don't know if the error characteristics of GPUbased programs will differ from CPUbased programs like mprime. Last fiddled with by GP2 on 20170811 at 03:02 

20170811, 03:39  #81  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3^{2}×7×83 Posts 
Quote:
Re allocating bad memory to lock it out, sorry if I was unclear before, I think the key is to do it at real physical addresses, and have it persist long enough, and that may be why in linux it was a kernel driver implementation. I was speaking of CUDA or openCL calls permanently allocating physical GPU bad memory to take it out of circulation, not OSmanaged general purpose virtual memory in the system RAM DIMMs. Perhaps checking gpu ram could be done in a fast startuptime memory test; keep allocated and don't use the blocks that test bad. That overhead would only need to be paid at startup for cards that have already tested bad in some memory blocks. (Does the OS virtualize and page out gpu memory while a gpu program is running!?) In my memory testing of one GPU card, via CUDALucas, the range of 25MB blocks that failed was quite stable from run to run (blocks 2340 out of ~58. The CUDA memory model article at https://www.3dgep.com/cudamemorymo...A_Memory_Types does not contain the string "virtual" (for whatever that is worth) http://www.seas.upenn.edu/~cis565/LECTURES/Lecture3.pdf slide 7 contains "virtual memory does not exist". There may be persistence issues with the approach. Re double checks: little is lost until the next software package comes along and hasn't yet implemented nonzero offsets (as was once the case with practically everything, including cudalucas and prime95), or gpuOwl use becomes widespread, and one LLtest is done by gpuOwl, and the other lltest is done by gpuOwl or another zerooffsetonly software. Lots of results submitted with zero offset sets the stage for such an issue later. 

20170811, 05:36  #82  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3^{2}×7×83 Posts 
Quote:
Running CUDA code, it is common that excessive roundoff error is detected and corrected by a restart from last save and retry. Also resetting the device and restarting from last checkpoint. Including verified results. Too much of it may result in a bad residue. Some of it is no problem. No interim residue or final residue is guaranteed to be correct. Passing the Jacobi test or any other error check is consistent with a higher probability of correctness, to that point, and that's as much as we can hope for. Please consider giving users of your software the choice of continuing after a recovery from last believedgood save file if the Jacobi check indicates recent iterations went wrong, rather than abandoning all the previous work on that exponent run. 

20170811, 06:14  #83  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5229_{10} Posts 
Quote:


20170811, 06:19  #84  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3^{2}×7×83 Posts 
Quote:
And, it's already benefited the project considerably, by motivating your inquiry into reliable running, thereby bringing the Jacobi test into play. I wouldn't run code that _required_ running at half speed. Last fiddled with by kriesel on 20170811 at 06:20 

20170811, 07:07  #85 
"Mihai Preda"
Apr 2015
2^{3}×3^{2}×19 Posts 
It doesn't *require* it, it's just an option for the user that suspects the hardware is not reliable yet wants to squeeze LL from it (and that'd be some pretty solid LL). It's a tradeoff  admittedly an expensive one.
The timing as is (4.5 ms / double iteration) isn't terrible. Now if I could just halve that.. :) 
20170811, 14:00  #86  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3^{2}×7×83 Posts 
Quote:
Keep up the good work. 

20170812, 23:01  #87 
"Mihai Preda"
Apr 2015
2^{3}·3^{2}·19 Posts 
After consideration and input from this thread, this is the approach I ended up using in gpuOwL RE Jacobi check:
 on startup (from beginning or from savefile), establish a "good Jacobi" point that will be used if rollback is needed. When starting from a savefile this involves running one Jacobi check at the very beginning to verify that the savefile passes Jacobi (if it doesn't, it won't start).  on every Jacobicheck, either move the "good Jacobi" point forward if the check passes, or roll back to the the most recent rollback point if the check fails. The rollback point is kept in RAM, thus no fileread is involved in rolling back (thus simpler implem). One more Jacobi check is done at the end (after the last iteration), with the same behavior. 
20170813, 12:13  #88 
"Robert Gerbicz"
Oct 2005
Hungary
7·211 Posts 
This reliable error checking idea here http://mersenneforum.org/showthread.php?t=22510
from me is just working for Mersenne numbers also! Why not do a Fermat pseudoprime test for base=3, but (for p>2) in the equivalent form of res=3^(2^p) mod mp, where mp=2^p1. If it is 9 (or correctly it is 9 mod mp), then mp is a prp prime. And the totally same error checking trick works, what worked for Proth numbers, with a 0.1% overhead we could get a solid rock test. Assuming that we need the same time as for LL test to get res, starting with 3, and doing here p squaremod (we don't need to subtract 2). And for those p primes that passed this test we should make a Lucas Lehmer test, to prove that mp is "really" prime. For what p primes we need to do a LLT: (quick PARIGp test up to p=25000) Code:
forprime(p=2,25000,q=2^p1;if(Mod(3,q)^(q+1)==Mod(9,q),print1(p","))) 2,3,5,7,13,17,19,31,61,89,107,127,521,607,1279,2203,2281,3217,4253,4423,9689,9941,11213,19937,21701,23209, though it is likely that there is not even a single prp for Mersenne numbers. Slightly modified test code, but the heart of the algorithm is the same as for Proth numbers: (note that here lift(u0)=3 smallish, we would not need to store it) Code:
myrand(r,N) returns s randomly from [0,N) for that s!=0 and s!=r (the r,s are in Z_N). myrand(r,N)={local(tmp);while(1,tmp=random(N);if(tmp!=0&&lift(tmp+r)!=0,return(tmp+r)))} we test mp=2^p1 Mersenne number (where p is prime), we use L at error checking, making errors in the ith squaring with 50% chance if errpos[i]!=0 (note that if we return to the same i multiple times, then we choose the making error independently from the previous choices already done) if printmsg!=0, then we print out some additional info, the return value is 3^(mp+1) mod mp, note that for prp prime the return value is 9 (correctly 9 mod mp). If you would not give a p prime or errpos's length is too small, then the return value is (1). prpmersenne(p,L,errpos,printmsg=1)={ if(isprime(p)==0,if(printmsg,print("p is not prime."));return(1)); if(length(errpos)<p,if(printmsg,print("The errpos array's length should be at least p"));return(1)); mp=2^p1; numerr=0; L2=L^2; u0=Mod(3,mp); prev_d=u0; saved_u=u0; saved_d=u0; saved_i=0; i=0;res=u0;while(i<p,i+=1; res=res^2; if(errpos[i]&&random(2),res=myrand(res,mp)); if(i%L==0,d=prev_d*res;set_d=0; if(i%L2==0(i%L==0&&i+L>=p), if(d!=u0*prev_d^(2^L), numerr+=1;if(printmsg,print("Found error at iteration=",i,", roll back to iteration=",saved_i)); i=saved_i;res=saved_u;prev_d=saved_d;set_d=1, saved_i=i;saved_u=res;saved_d=d)); if(!set_d,prev_d=d))); if(printmsg,print1("m",p);if(res==Mod(9,mp),print(" is prp."),print(" is composite.")); print("Number of errors (corrected and) detected=",numerr)); return(lift(res))} (i=21,23 are in the same L2=16 block). Code:
p=61;errpos=vector(p,i,0);errpos[21]=1;errpos[23]=1;errpos[45]=1; cnt=0;for(h=1,50,cnt+=(prpmersenne(p,4,errpos,1)==9);print());cnt Code:
Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=4 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=11 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=5 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=14 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=10 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=5 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=3 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=3 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=4 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=4 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=5 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=9 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=3 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=6 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=6 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=9 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=11 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=1 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=2 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=7 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=8 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=4 m61 is prp. Number of errors (corrected and) detected=0 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 m61 is prp. Number of errors (corrected and) detected=4 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=32, roll back to iteration=16 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 Found error at iteration=48, roll back to iteration=32 m61 is prp. Number of errors (corrected and) detected=11 %5 = 50 ? This test with the much smaller mp=2^171 with L=2: Code:
p=17;errpos=vector(p,i,0);errpos[9]=1;errpos[11]=1; sum(h=1,10^6,prpmersenne(p,2,errpos,0)==9) %7 = 999988 that is an error rate of less than 2/mp. Ofcourse for a true run errpos=vector(p,i,0), (a zero array), it is used above only to insert false residues in the squaring computations. And for largish Mersenne computations use L=2000 (or say L=1000), depending on how much overhead you allow. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces.  MooMoo2  Other Chess Games  0  20161126 06:52 
Redoing factoring work done by unreliable machines  tha  Lone Mersenne Hunters  23  20161102 08:51 
Unreliable AMD Phenom 9850  xilman  Hardware  4  20140802 18:08 
[new fish check in] heloo  mwxdbcr  Lounge  0  20090114 04:55 
The Happy Fish thread  xilman  Hobbies  24  20060822 11:44 