mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2017-07-24, 01:24   #1
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

4B016 Posts
Default Getting reliable LL from unreliable hardware

It appears one of my GPUs recently became less reliable than before -- once in a while (about every 12hours) I get "Error is too large; retrying", with the retry producing a different, plausible-looking result, and it keeps going from there.

This got me thinking about how to make better use of unreliable hardware.

Let's say
- the probability to get a correct result in any one iteration is "p", then
- the probability to get a correct result after N iterations is p^N
- which is approximated with 1 - N*(1 - p) when N*(1-p) is small (close to 0).

In short, the probability to have a wrong LL result grows linearly with the number of iterations N. Even generally reliable hardware gets into trouble as N grows.

As an example, a GPU which produces 80% correct for a 75M exponent, would produce about 40% correct for a 300M exponent (because 0.8**4 == 0.4), or less.

Last fiddled with by kladner on 2018-06-14 at 02:35
preda is offline   Reply With Quote
Old 2017-07-24, 01:34   #2
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

203008 Posts
Default

Quote:
Originally Posted by preda View Post
It appears one of my GPUs recently became less reliable than before -- once in a while (about every 12hours) I get "Error is too large; retrying", with the retry producing a different, plausible-looking result, and it keeps going from there.

This got me thinking about how to make better use of unreliable hardware.

Let's say
- the probability to get a correct result in any one iteration is "p", then
- the probability to get a correct result after N iterations is p^N
- which is approximated with 1 - N*(1 - p) when N*(1-p) is small (close to 0).

In short, the probability to have a wrong LL result grows linearly with the number of iterations N. Even generally reliable hardware gets into trouble as N grows.

As an example, a GPU which produces 80% correct for a 75M exponent, would produce about 40% correct for a 300M exponent (because 0.8**4 == 0.4), or less.
how many quadratic residues are there mod the value ? that's related to p.
science_man_88 is offline   Reply With Quote
Old 2017-07-24, 01:51   #3
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×3×52 Posts
Default

The classical way to "validate" an LL result is the double check. If two independent LLs produce the same result it is extremely unlikely that the result is wrong. (because the space of the LL results is huge, even the space of 64-bit residues is huge, and assuming a mostly uniform distribution of wrong results over this space, the probability of two erroneous LL matching "by chance" is v. small).

But what if my GPU, for some big exponent range, displays a reliability of 20%? than most of the results would be wrong. Even if later disproved by double-checks, I would call the work of this GPU useless or even negative.

The situation changes radically if the GPU itself applies iterative double-checking. For example, it would double check every iteration at every step along the way.

The probability of an individual iteration being correct is extremely high (e.g. 0.99999998 for the previous example 20% reliability at 80M exponent). If the results of running the iteration twice [with different offsets] match than we are "sure" the iteration result is correct.

Thus from a "bad" GPU we get extremely reliable LL results. I would argue such a result, let's call it "iterativelly self-double-checked" is almost as strong as an independent double-check. It does take twice the work -- though in this aspect it's not different from a double-checked LL (twice the work as well).

Last fiddled with by preda on 2017-07-24 at 02:02
preda is offline   Reply With Quote
Old 2017-07-24, 06:39   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1151810 Posts
Default

Based on much personal experience with this sort of thing, 2 side-by-side runs with different shits or slightly differing FFT lengths, proceeding at as close to the same speed as possible and saving uniquely-named checkpoint files every (say) 10Miter is the way to go. But from the perspective of the project as a whole:

[1] That is only marginally better in terms of avoiding wasted cycles on runs which have gone off the rails than the current scheme, based on the assumption of an overall low error rate. From the perspective of nailing a single LL test result with minimal cycle wastage, though, above is good - if daily check reveals the 2 runs have diverged, stop 'em both and restart from whichever 10Miter (or whatever - on your hardware every 1Miter makes more sense) persistent checkpoint file was deposited before the point of divergence, after making sure said file matches between both runs, hopefully on retry both runs will now agree past the previous point of divergence.

[2] The major drawback from the project perspective, however, is that it relies on the user being honest. Not a problem if the user claims to have found a prime - then we just insist on a copy of the last written checkpoint file and rerun the small number of iterations from that to the end, if it comes up "prime" we proceed to a full formal independent DC. But let's say someone wants to hurdle up the Top Producers list just for bragging rights and starts submitting faked-up "double checks" of this kind - if we accepted them we could easily miss a prime.

I use the above 2-side-by-side runs method in my Fermat number testing, but the difference there is that I would never think of publishing a primality-test result gotten via this method without also making the full set of interim CP files available, enabling a rapid "parallel" triple-check method whereby multiple machines can run the individual 10M-iter intervals simultaneously, each one checking if its result after 10Miters agrees with the next such deposited CP file, as described here.
ewmayer is offline   Reply With Quote
Old 2017-07-24, 21:20   #5
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Based on much personal experience with this sort of thing, 2 side-by-side runs with different shits
Ha - should read "shifts", but funny the way it, um came out.
ewmayer is offline   Reply With Quote
Old 2017-07-25, 17:32   #6
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

327410 Posts
Default

Quote:
Originally Posted by ewmayer View Post
...

[2] The major drawback from the project perspective, however, is that it relies on the user being honest. Not a problem if the user claims to have found a prime - then we just insist on a copy of the last written checkpoint file and rerun the small number of iterations from that to the end, if it comes up "prime" we proceed to a full formal independent DC. But let's say someone wants to hurdle up the Top Producers list just for bragging rights and starts submitting faked-up "double checks" of this kind - if we accepted them we could easily miss a prime.
...
Yeah, that's the real kicker. If there was some kind of way to shortcut the independent verification so that some neutral party could run a few iterations and compare to something or other besides (or in addition to) the final residue, then maybe. I can't think of anything off the top of my head, and I feel pretty sure this type of idea has been brought up before but I'm too lazy to search the forum right now.

In theory though, yeah, it makes perfect sense to do double-checking along the way, especially if you're doing a huge exponent like those 600M+ results, where he did a verifying run alongside and (presumably) compared residues along the way. If they diverged then you roll both back to the last place they matched and then resume.

You do save cycles there because you're catching the error without having to run through the whole thing, waiting for a DC, getting a mismatch, doing a triple (or even more) check, etc.

I still go by my general approximation of a 5% bad result rate, so if you were able to do side-by-side runs, you could effectively increase the throughput of the entire project (first and double-check, not just first-time milestones) by 5%.
Madpoo is offline   Reply With Quote
Old 2017-07-25, 18:17   #7
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×1,429 Posts
Default

Quote:
Originally Posted by Madpoo View Post
You do save cycles there because you're catching the error without having to run through the whole thing, waiting for a DC, getting a mismatch, doing a triple (or even more) check, etc.

I still go by my general approximation of a 5% bad result rate, so if you were able to do side-by-side runs, you could effectively increase the throughput of the entire project (first and double-check, not just first-time milestones) by 5%.
If 5% are bad, and we do a triple check 5% of the time, that's only 5/205 or 2.4% work saved, no?
Mark Rose is offline   Reply With Quote
Old 2017-07-25, 21:09   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2CFE16 Posts
Default

Quote:
Originally Posted by Madpoo View Post
In theory though, yeah, it makes perfect sense to do double-checking along the way, especially if you're doing a huge exponent like those 600M+ results, where he did a verifying run alongside and (presumably) compared residues along the way. If they diverged then you roll both back to the last place they matched and then resume.
It' also easy in theory to envision how one might do this in practice - now, the first-time-tester's interim checkpoint files (say every 10Miter) get deposited on the Primenet server, and the DCer's Res64s at those same iterations diffed against those, etc. But that would require a massive increase in both storage capacity at the server end and comms-bandwidth between the users' machines and the server. The low-bandwidth alternative would be to let the interim CP files remain on the users' machines and then have the server direct the restart-both-runs-from-last-matching-CP mechanism, but I shudder to think of the effort needed to support that functionality, not to mention the myriad real-world reasons why it would be inherently fragile.

As Mark notes, as long as the overall error rate remains reasonably low, the potential savings is simply unlikely to be worth the effort.
ewmayer is offline   Reply With Quote
Old 2017-07-26, 06:03   #9
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

Quote:
Originally Posted by ewmayer View Post
It' also easy in theory to envision how one might do this in practice - now, the first-time-tester's interim checkpoint files (say every 10Miter) get deposited on the Primenet server, and the DCer's Res64s at those same iterations diffed against those, etc. But that would require a massive increase in both storage capacity at the server end and comms-bandwidth between the users' machines and the server. The low-bandwidth alternative would be to let the interim CP files remain on the users' machines and then have the server direct the restart-both-runs-from-last-matching-CP mechanism, but I shudder to think of the effort needed to support that functionality, not to mention the myriad real-world reasons why it would be inherently fragile.

As Mark notes, as long as the overall error rate remains reasonably low, the potential savings is simply unlikely to be worth the effort.
I did a whole bunch of typing about this in some thread around here. I actually volunteered to do some work to try and get such a system up and running, but there wasn’t much interest. One important note - 5%+/- could be saved from not needing to do DCs, but the real advantage is the ability for users to contribute ‘small’ work chunks. A massive amount of the GIMPS cycles spent are wasted. If all those OCers doing burn ins were actually contributing 100,000 or so validated iterations that another user could continue, well - that would be something. It could also encourage more users to contribute just a few cycles here and there.
airsquirrels is offline   Reply With Quote
Old 2017-07-26, 09:07   #10
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

131328 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I did a whole bunch of typing about this in some thread around here. I actually volunteered to do some work to try and get such a system up and running, but there wasn’t much interest. One important note - 5%+/- could be saved from not needing to do DCs, but the real advantage is the ability for users to contribute ‘small’ work chunks. A massive amount of the GIMPS cycles spent are wasted. If all those OCers doing burn ins were actually contributing 100,000 or so validated iterations that another user could continue, well - that would be something. It could also encourage more users to contribute just a few cycles here and there.
I suppose torture tests could actually be doing useful doublechecks.
henryzz is offline   Reply With Quote
Old 2017-07-26, 10:04   #11
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2×3×5×191 Posts
Default

Quote:
Originally Posted by henryzz View Post
I suppose torture tests could actually be doing useful doublechecks.
Torture tests should only be on known verified results. So torture tests need to be redundant, and therefore useless in terms of progressing the "goal" forward.
retina is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces. MooMoo2 Other Chess Games 0 2016-11-26 06:52
Redoing factoring work done by unreliable machines tha Lone Mersenne Hunters 23 2016-11-02 08:51
Unreliable AMD Phenom 9850 xilman Hardware 4 2014-08-02 18:08
[new fish check in] heloo mwxdbcr Lounge 0 2009-01-14 04:55
The Happy Fish thread xilman Hobbies 24 2006-08-22 11:44

All times are UTC. The time now is 09:50.

Thu Sep 24 09:50:44 UTC 2020 up 14 days, 7:01, 0 users, load averages: 0.96, 1.31, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.