Thread: GpuOwl 7.x
View Single Post
Old 2020-10-11, 08:20   #82
preda's Avatar
"Mihai Preda"
Apr 2015

53×11 Posts
Default I've been blessed with a flaky GPU

My broken GPU generates about 20 errors per PRP test. Those are handled just fine by the error check for PRP, but the situation is different for P-1.

I tried to harden the P1 (first stage). I reimplemented the "fold" step (which is at the core of the new P1 implementation, the "fold" operation aggregates the many P1 buffers into a single residue, used for saving or at the end of P1) to make it less exposed to GPU errors. And I added the Jacobi check for P1, which happens (it seems so) to work nicely with the new P1 algorithm.

While there, I also added residue==0 detection for PRP, which was a long-requested feature. This simply flags any res64==0 as suspicious and does an early check when possible.

I added back a more-frequent res64 display in log (now that I read it more fequently from GPU-side); I do not expect this to affect performance.

P1: Jacobi check

A Jacobi check is done at start (on-load) of P1, and on every 1M iterations while P1 is ongoing. The Jacobi check is a CPU operation, about as slow as a GCD (let's say 35s CPU at the wavefront).
*If* the Jacobi fails (which is rare), a rather tricky rollback over savefiles is attempted, to start anew from an earlier point hopefully not affected. (we'll see what bugs hide in there).

P2 : What about the error-hardening of P2 (i.e. second stage of P-1)?
I think the Jacobi check is not applicable to P2; OTOH I also think that computation errors are not critical in P2, I'll explain why.

For P1 we need the precisely-correct final P1 result; even the slightest error during P1 would make both the P1 GCD and all of the full of P2 useless.

The situation is different though (IMO) for P2. An error during P2 would only affect the prime factors that happened to be P2-accumulated between the last P2 GCD and the location of the error. The follow-up multiplications (that take place after the error) are not affected by the P2 error. Thus, in P2, an error "erases" just a few stage-two primes (depending on how often the P2 GCD is done), without having "catastrophic" consequences of nullifying the whole P2. There is one exception -- if the P2 accumulator becomes zero (due to an error) it would remain stuck there -- but this special value can be detected and handled (e.g. by resetting the accumulator to 1).
preda is offline   Reply With Quote