20170724, 01:24  #1 
"Mihai Preda"
Apr 2015
4B0_{16} Posts 
Getting reliable LL from unreliable hardware
It appears one of my GPUs recently became less reliable than before  once in a while (about every 12hours) I get "Error is too large; retrying", with the retry producing a different, plausiblelooking result, and it keeps going from there.
This got me thinking about how to make better use of unreliable hardware. Let's say  the probability to get a correct result in any one iteration is "p", then  the probability to get a correct result after N iterations is p^N  which is approximated with 1  N*(1  p) when N*(1p) is small (close to 0). In short, the probability to have a wrong LL result grows linearly with the number of iterations N. Even generally reliable hardware gets into trouble as N grows. As an example, a GPU which produces 80% correct for a 75M exponent, would produce about 40% correct for a 300M exponent (because 0.8**4 == 0.4), or less. Last fiddled with by kladner on 20180614 at 02:35 
20170724, 01:34  #2  
"Forget I exist"
Jul 2009
Dumbassville
20300_{8} Posts 
Quote:


20170724, 01:51  #3 
"Mihai Preda"
Apr 2015
2^{4}×3×5^{2} Posts 
The classical way to "validate" an LL result is the double check. If two independent LLs produce the same result it is extremely unlikely that the result is wrong. (because the space of the LL results is huge, even the space of 64bit residues is huge, and assuming a mostly uniform distribution of wrong results over this space, the probability of two erroneous LL matching "by chance" is v. small).
But what if my GPU, for some big exponent range, displays a reliability of 20%? than most of the results would be wrong. Even if later disproved by doublechecks, I would call the work of this GPU useless or even negative. The situation changes radically if the GPU itself applies iterative doublechecking. For example, it would double check every iteration at every step along the way. The probability of an individual iteration being correct is extremely high (e.g. 0.99999998 for the previous example 20% reliability at 80M exponent). If the results of running the iteration twice [with different offsets] match than we are "sure" the iteration result is correct. Thus from a "bad" GPU we get extremely reliable LL results. I would argue such a result, let's call it "iterativelly selfdoublechecked" is almost as strong as an independent doublecheck. It does take twice the work  though in this aspect it's not different from a doublechecked LL (twice the work as well). Last fiddled with by preda on 20170724 at 02:02 
20170724, 06:39  #4 
∂^{2}ω=0
Sep 2002
República de California
11518_{10} Posts 
Based on much personal experience with this sort of thing, 2 sidebyside runs with different shits or slightly differing FFT lengths, proceeding at as close to the same speed as possible and saving uniquelynamed checkpoint files every (say) 10Miter is the way to go. But from the perspective of the project as a whole:
[1] That is only marginally better in terms of avoiding wasted cycles on runs which have gone off the rails than the current scheme, based on the assumption of an overall low error rate. From the perspective of nailing a single LL test result with minimal cycle wastage, though, above is good  if daily check reveals the 2 runs have diverged, stop 'em both and restart from whichever 10Miter (or whatever  on your hardware every 1Miter makes more sense) persistent checkpoint file was deposited before the point of divergence, after making sure said file matches between both runs, hopefully on retry both runs will now agree past the previous point of divergence. [2] The major drawback from the project perspective, however, is that it relies on the user being honest. Not a problem if the user claims to have found a prime  then we just insist on a copy of the last written checkpoint file and rerun the small number of iterations from that to the end, if it comes up "prime" we proceed to a full formal independent DC. But let's say someone wants to hurdle up the Top Producers list just for bragging rights and starts submitting fakedup "double checks" of this kind  if we accepted them we could easily miss a prime. I use the above 2sidebyside runs method in my Fermat number testing, but the difference there is that I would never think of publishing a primalitytest result gotten via this method without also making the full set of interim CP files available, enabling a rapid "parallel" triplecheck method whereby multiple machines can run the individual 10Miter intervals simultaneously, each one checking if its result after 10Miters agrees with the next such deposited CP file, as described here. 
20170724, 21:20  #5 
∂^{2}ω=0
Sep 2002
República de California
2·13·443 Posts 

20170725, 17:32  #6  
Serpentine Vermin Jar
Jul 2014
3274_{10} Posts 
Quote:
In theory though, yeah, it makes perfect sense to do doublechecking along the way, especially if you're doing a huge exponent like those 600M+ results, where he did a verifying run alongside and (presumably) compared residues along the way. If they diverged then you roll both back to the last place they matched and then resume. You do save cycles there because you're catching the error without having to run through the whole thing, waiting for a DC, getting a mismatch, doing a triple (or even more) check, etc. I still go by my general approximation of a 5% bad result rate, so if you were able to do sidebyside runs, you could effectively increase the throughput of the entire project (first and doublecheck, not just firsttime milestones) by 5%. 

20170725, 18:17  #7  
"/X\(‘‘)/X\"
Jan 2013
2×1,429 Posts 
Quote:


20170725, 21:09  #8  
∂^{2}ω=0
Sep 2002
República de California
2CFE_{16} Posts 
Quote:
As Mark notes, as long as the overall error rate remains reasonably low, the potential savings is simply unlikely to be worth the effort. 

20170726, 06:03  #9  
"David"
Jul 2015
Ohio
1005_{8} Posts 
Quote:


20170726, 09:07  #10  
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
13132_{8} Posts 
Quote:


20170726, 10:04  #11 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2×3×5×191 Posts 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces.  MooMoo2  Other Chess Games  0  20161126 06:52 
Redoing factoring work done by unreliable machines  tha  Lone Mersenne Hunters  23  20161102 08:51 
Unreliable AMD Phenom 9850  xilman  Hardware  4  20140802 18:08 
[new fish check in] heloo  mwxdbcr  Lounge  0  20090114 04:55 
The Happy Fish thread  xilman  Hobbies  24  20060822 11:44 