mersenneforum.org LL GPU error-detection side computation
 Register FAQ Search Today's Posts Mark Forums Read

 2017-05-02, 23:54 #1 preda     "Mihai Preda" Apr 2015 137510 Posts LL GPU error-detection side computation Here I'm explaining how I see LL error-detection on the GPU -- maybe somebody has helpful ideas/tips in that direction. Doing LL on the GPU is somewhat unreliable, because sometimes some GPUs manifest transient hardware errors. This problem is compounded by the long iteration chain needed for large LL exponents. And, for a first-time LL, there is no way to cheaply validate the final result -- basically the only way to validate an LL is to do it once again as a double-check. That's why I would very much like to have an "error detection" mechanism for the LL iteration on the GPU. This would validate (with some probability) each iteration step of the LL. When an error is detected, the step can be re-run until correct; thus "error detection" is easily turned into "error correction" for transient errors. Below, E is the exponent, N is the FFT size. We can see the LL iteration as application of this LLstep() function: function LLstep(BigInt state) { return (state^2 - 2) mod (2^E - 1); } Where "state" is a big integer, formed by N integer "words" or digits (representing together E bits). I see ideal error detection like this: have a function checksum() that takes the LL state (N words) and computes a small value (less than 64bits) that represents a form of "checksum" or "abstraction" or "characteristic" or "reduction" of the LL state, with properties discussed below. e.g. checksum(BigInt state) -> int checkState. Let's call the small characteristic that is computed from the LL state, "checkState". We must have, in addition to the LLstep() shown above, a similar iteration function checkStep() on the checkState: int checkStep(int checkState); This checkStep() performs a similar function to LLstep(), but on the abstracted value checkState, i.e.: checkStep(checksum(state)) == checksum(LLstep(state)) Now, the core of error detection lies in checking, after every iteration, that: checksum(LLstep(state)) == checkStep(checksum(state)) Normally LLstep() is executed on the GPU, while checkStep() is executed on the CPU. In other words: - keep LL state on the GPU, keep checkState on the CPU. On each iteration: - update LL state on the GPU (using LLstep()) and update checkState on the CPU (using checkStep()) - compute checksum(LL state) either on GPU or CPU, and compare with CPU's checkState for equality. The difficulty lies in finding the pair checksum() & checkStep() that is compatible with LLstep() above.
2017-05-03, 00:28   #2
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

26·131 Posts

Quote:
 Originally Posted by preda Here I'm explaining how I see LL error-detection on the GPU -- maybe somebody has helpful ideas/tips in that direction. Doing LL on the GPU is somewhat unreliable, because sometimes some GPUs manifest transient hardware errors. This problem is compounded by the long iteration chain needed for large LL exponents. And, for a first-time LL, there is no way to cheaply validate the final result -- basically the only way to validate an LL is to do it once again as a double-check. That's why I would very much like to have an "error detection" mechanism for the LL iteration on the GPU. This would validate (with some probability) each iteration step of the LL. When an error is detected, the step can be re-run until correct; thus "error detection" is easily turned into "error correction" for transient errors. Below, E is the exponent, N is the FFT size. We can see the LL iteration as application of this LLstep() function: function LLstep(BigInt state) { return (state^2 - 2) mod (2^E - 1); } Where "state" is a big integer, formed by N integer "words" or digits (representing together E bits). I see ideal error detection like this: have a function checksum() that takes the LL state (N words) and computes a small value (less than 64bits) that represents a form of "checksum" or "abstraction" or "characteristic" or "reduction" of the LL state, with properties discussed below. e.g. checksum(BigInt state) -> int checkState. Let's call the small characteristic that is computed from the LL state, "checkState". We must have, in addition to the LLstep() shown above, a similar iteration function checkStep() on the checkState: int checkStep(int checkState); This checkStep() performs a similar function to LLstep(), but on the abstracted value checkState, i.e.: checkStep(checksum(state)) == checksum(LLstep(state)) Now, the core of error detection lies in checking, after every iteration, that: checksum(LLstep(state)) == checkStep(checksum(state)) Normally LLstep() is executed on the GPU, while checkStep() is executed on the CPU. In other words: - keep LL state on the GPU, keep checkState on the CPU. On each iteration: - update LL state on the GPU (using LLstep()) and update checkState on the CPU (using checkStep()) - compute checksum(LL state) either on GPU or CPU, and compare with CPU's checkState for equality. The difficulty lies in finding the pair checksum() & checkStep() that is compatible with LLstep() above.
well there's a few values of the full residue that are known to not be prime and the test could exit 0 if prior to the final iteration, 2^p-3 as that's -2 mod 2^p-1 and (-2)^2-2 = 4-2 =2 which then cycles to 2 again as 2^2-2 is also 2. 2^p-2 ( aka -1 mod 2^p-1) leads to itself as well (-1)^2-2 = 1-2 = -1. 1 leads to -1 1^2-2 =-1 so there are 4 values ( maybe more I don't know about) that lead to a loop or repeat if the testing proceeds normally from them.

Last fiddled with by science_man_88 on 2017-05-03 at 00:29

 2017-05-03, 00:53 #3 preda     "Mihai Preda" Apr 2015 53·11 Posts 1. The SUMINP != SUMOUT is error-detection, but it only covers the FFT convolution. It does not cover: - multiply from words (ints) to doubles - inverse multiply from doubles to words - double rounding - carry propagation Thus I'd estimate it covers about 50% of the "error surface". 2. "max rounding error" also covers only the FFT convolution (with the same limitations as SUMINP!=SUMOUT), and IMO it is weaker. "rounding error" becomes even weaker for error detection when the normal rounding error rate increases (e.g. big words nearing 19bits), when random flips in the FFT would fail to trigger a jump in rounding error. "Rounding error" has the advantage of being easy and cheap on the GPU. 3. An error-detection mechanism that covers carry propagation must compute the checksum irrespective of how the carries are distributed. In other words it must compute the same result before carry-propagation as after carry-propagation. En example: word-0 has 19bits. A carry from word-0 to word-1 decreases words-0 by 2^19 and increases word-1 by 1. The checksum must compute the same value in the two cases (carry moved or not). The first thing that comes to mind is a modulo checksum: choose an integer M for modulo sum = 0 for i = N-1 downto 0: sum = ((sum << wordLength[i]) + word[i]) modulo M This has the nice property of abstracting the carries. OTOH the function checkStep() which would "simulate" the LL iteration on this checksum does not exist. This is because LLstep() does the modulo (2^E - 1), and our checksum lost essential information for that by doing the modulo M. 4. My idea for how to fix the problem above (3) was this: Instead of choosing an integer modulo M, choose a modulo value that is compatible with 2^E-1. Compatible means "divides". Of course that'd be hard to do with an integer, but it's easy if we extend to an irrational (floating-point) modulo. For example choose M such: M^N == 2^E - 1 which means: bitsPerWord = E / N M = 2^bitsPerWord which produces M^N = 2^E This has the very nice property: checkStep(checkState) == (checkState^2 - 2) modulo M The problem comes from the limited precision in computing the real modulo M, coupled with the amplification of error that comes from the shifting involved in checksum() (required by the carry-abstraction). Thus IMO this also doesn't work. 5. Other ideas?
2017-05-03, 01:03   #4
preda

"Mihai Preda"
Apr 2015

53·11 Posts

Quote:
 Originally Posted by science_man_88 well there's a few values of the full residue that are known to not be prime and the test could exit 0 if prior to the final iteration, 2^p-3 as that's -2 mod 2^p-1 and (-2)^2-2 = 4-2 =2 which then cycles to 2 again as 2^2-2 is also 2. 2^p-2 ( aka -1 mod 2^p-1) leads to itself as well (-1)^2-2 = 1-2 = -1. 1 leads to -1 1^2-2 =-1 so there are 4 values ( maybe more I don't know about) that lead to a loop or repeat if the testing proceeds normally from them.
Yes, but checking that LLstate is not one of a few "bad values" is a very weak check, unless there's an argument showing that these bad values are "attractors" over the overall LL state space. (which I think they aren't)

A weak check because the space of "error LL state" is huge (about 2^E) and the test only catches a few of them.

 2017-05-03, 01:29 #5 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 167768 Posts Here's a random idea off the top of my head (read: may be worthless) Let's assume most GPU errors are caused by memory problems not incorrect floating-point computations. How about computing a checksum of all the FFT data as it is written to GPU memory. Then, compute a checksum as it is read back in from GPU memory. Compare and raise error if they ever differ. The pros are it is very simple to implement, works for the FFT process and the carry propagation process. The cons are it will not catch arithmetic errors.
 2017-05-03, 04:21 #6 kladner     "Kieren" Jul 2011 In My Own Galaxy! 2×3×1,693 Posts Keeping watch on the memory would be a very good idea, at least based on my experiences with CUDALucas on a few different cards.
2017-05-03, 05:11   #7
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

24×613 Posts

Quote:
 Originally Posted by Prime95 Let's assume most GPU errors are caused by memory problems not incorrect floating-point computations.
Which is 101%t true. It was a time in the past (see my former posts) where there was a market for memory chips with defects - and that was the graphic card market. Think to it like that: for a chip with 500 mega-bits on it, or even one giga-bit, the defect rate of even 1 ppm (parts per million) will result in 500 to 1000 bits that can not be either set, or either reset, in a million parts produced. Even when you make screws and nuts, some of them have defects, there is no perfect production line. And that was in a time when 10ppm was a kind of top-technology, not many foundries could do. Nowadays, they just build more memory inside of each chip and use intelligent relocation of the bad bits. They actually relocate full rows and columns, for just a single bad-bit, to be able to give you error-free chips. Which matters for exact calculus.

But every manufacturer used to have his own yields -- not 100% of the chips coming out of the foundry were good. Some of the bad parts were beyond hopes, so they were scrapped, moved to the grinding machines, transformed to powder, then here you are, in the gold-recovering line, etc., but some other were still usable, assuming you can live with one (or few) bad bit(s). And guess what, nobody would care, neither see, if one pixel on your screen can only display 8 million colors, or 4, or 2, instead of intended 16 million colors. The human eye anyhow can only see few thousands, and only talented artists and painters can see like ten thousand colors or so. Professional monitors manufacturers say they use 8 bits of every fundamental color (red, green, blue) for a total of 16 777 216 colors (24 bits), but actually only 6 most significant lines are connected to the LCD, making only 262 144 colors physically on screen (18 bits). Therefore, memories with only few bits affected were sold very cheap to graphic card manufacturers, and used by them, of course. You may have 32-bit color in your OS, but your cheap monitor may not be able to show you more than a 5-6-5, with some TMED to enhance the number of colors.

Nowadays, some still use them. If your card is not function properly for calculus, but it has no flaw for videos and games, there is a higher possibility that the culprit is not the ALU's inside (arithmetical units, whatever those may be), but the memory. Some bits are just so stubborn and will not flip when you say so, but only when they want to...

Last fiddled with by LaurV on 2017-05-03 at 05:34

 2017-05-03, 07:09 #8 preda     "Mihai Preda" Apr 2015 53×11 Posts In the LL test, the buffers do not move around, they are allocated once at the beginning of the test and then keep the same position in the global memory. So the repeated iterations hit the same bytes again and again. So a stuck bit would be visible "from the start" if the LL buffers hit it. I would not be so concerned about a consistent error that manifests itself each time. What's more difficult is the error that hits let's say about once in 24h. If you repeat, it's not there anymore. About memory vs. computation, there is global memory, then the caches which likely use different technology from global memory, and there are the registers (VGPRs) which also are a form of "cache memory". Now, if bit-flip hits a VGPR it would be hard to argue whether that's memory or compute. OTOH I have no idea about the probability distribution of errors among these different elements. Let's hope mostly only global memory is affected :)
2017-05-03, 13:08   #9
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

24×613 Posts

Quote:
 Originally Posted by preda Let's hope mostly only global memory is affected :)
You are totally right, and that is why we do DC. BTW, how about implementing a shift? , this is more important (and easier) than the new FFTs. Right now, we can not LL and DC the same expos with gpuOwl, due to the fact that the test is "too" deterministic (a software error in the FFT will repeat itself on the subsequent tests).

So, first step will be to implement a random shift, like P95 and cudaLucas have.

Here I just realized that my LL+DC is futile, as I am doing one with clLucas and one with gpuOwl, and neither of them have shifts. Anyhow, Madpoo will do TC, so I am not concerned

And as long as we are at it, can you implement a -l (or -L, i.e. list) command line switch, to list/enumerate the available devices? Not much useful to have a device selection switch, if we are so ignorant and do not know which gpu-able thingies we have in our computer... is it? Like GPU-Z is doing, listing all the available devices. This would not be difficult to implement. Maybe I have a 5000GHzDays toy inside of the box and I have no idea it is there...

2017-05-03, 13:34   #10
preda

"Mihai Preda"
Apr 2015

53·11 Posts

Quote:
 Originally Posted by LaurV You are totally right, and that is why we do DC. BTW, how about implementing a shift? , this is more important (and easier) than the new FFTs. Right now, we can not LL and DC the same expos with gpuOwl, due to the fact that the test is "too" deterministic (a software error in the FFT will repeat itself on the subsequent tests).
I thought a bit about "offset". It's not exactly free to implement, mostly in code complexity and coding. It also offers no protection against transient hardware errors (which are now my main concern). Also I don't exactly see a big benefit, let me explain why: it may be a good idea to DC on different "platform" from LL, not just offset. I also agree that DC should in general not be done "at the same time" as an LL, because the information benefit of a DC is much less than an LL, thus a DC at the same exponent range as an LL is kind of a waste. Doing the DC let's say 1year or more behind makes more sense IMO.

Validating between clLucas and gpuOwL (for checking software correctness not for DC), is still quite strong because the two have completely different FFT implementations, with different trigonometric tables, etc, which is visible in the different rounding error results. I would expect the two have different rounding too, and different carry propagation, but I'd need to check clLucas' source to be sure of that.

You can always double-check exponents that have been LLd on CPU, that way you get software validation + useful DC; but you don't get "early abort" if gpuOwL is going awry.

Quote:
 Originally Posted by LaurV And as long as we are at it, can you implement a -l (or -L, i.e. list) command line switch, to list/enumerate the available devices?
Already there, see -h or --help.

 2017-05-03, 15:24 #11 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 24·613 Posts Well, we disagree about the shift. First, is important, to protect agains software errors. Second, is (almost) free. The only affected part is the start, where you start with "4 shifted with some random x bits", and the "-2" step where you, instead, subtract "-2 shifted" each time, and then "shift some more" the "-2 shifted" value. All the FFT and all the rest of the code stays the same. Here is an example

 Similar Threads Thread Thread Starter Forum Replies Last Post paul0 Factoring 5 2015-11-18 13:58 R.D. Silverman Math 68 2015-11-15 04:21 TObject GPU Computing 2 2012-07-21 01:56 Traveller Software 13 2009-03-09 23:16 Joe O Prime Sierpinski Project 6 2006-11-05 18:52

All times are UTC. The time now is 04:04.

Wed Dec 1 04:04:01 UTC 2021 up 130 days, 22:33, 1 user, load averages: 1.55, 1.32, 1.30