mersenneforum.org gpuOwL: a tiny OpenCL "Mersenne primality testing" implementation
 Register FAQ Search Today's Posts Mark Forums Read

2020-02-13, 21:04   #1849
ewmayer
2ω=0

Sep 2002
República de California

2·3·13·137 Posts

Quote:
 Originally Posted by preda OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.
Shouldn't the too-small-FFT-size manifest via excessive-fractional-parts (a.k.a. roundoff errors) detected during the round-and-carry step?

2020-02-13, 21:21   #1850
mrh

"mrh"
Oct 2018
Temecula, ca

53 Posts

Quote:
 Originally Posted by preda OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.
Ah, thats good to know. I didn't think of that.

 2020-02-13, 22:04 #1851 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest C8916 Posts Are empty worktodo lines (just a newline) not allowed? Code: 2020-02-13 19:19:40 colab2-TeslaT4 {"exponent":"10000831", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-13 19:19:40 UTC", "user":"kriesel", "computer":"colab2-TeslaT4", "fft-length":524288, "B1":30000, "B2":500000, "factors":["646560662529991467527"]} 2020-02-13 19:19:40 colab2-TeslaT4 worktodo.txt line ignored: "" terminate called after throwing an instance of 'char const*' Last fiddled with by kriesel on 2020-02-13 at 22:04
2020-02-14, 07:19   #1852
preda

"Mihai Preda"
Apr 2015

2×11×41 Posts

Quote:
 Originally Posted by ewmayer Shouldn't the too-small-FFT-size manifest via excessive-fractional-parts (a.k.a. roundoff errors) detected during the round-and-carry step?
There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.

2020-02-14, 17:21   #1853
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1100100010012 Posts

Quote:
 Originally Posted by preda There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.
Gpuowl P-1 needs error detection. While for production wavefront running on fast gpus the individual run times are short, P-1 run times in the upper reaches of exponent are not (~4 days each on a Tesla P100 or Radeon VII near 109), and the cost of a missed factor due to P-1 computational error is lost factors and needless primality tests.

Error check possibilities include (in random order):
1. the afore-mentioned excessive-fractional-parts detection in the round-and-carry
2. res64 check for problem values in stage 1 (0 is probably bad, check the full res. I've had this occur with a recent commit of Gpuowl and unsuitable-for-the-fft-length -use options. https://mersenneforum.org/showpost.p...postcount=1838) 1 occurs in CUDAPm1. And see #7
3. res64 check repeat same value in stage 1. And see #7
4. Jacobi checks in selected portions of the computation, perhaps as an option since it is expensive, requiring both computing the correct Jacobi and the actual. There is a whole thread about that, at https://mersenneforum.org/showthread...ghlight=jacobi
5. screening -use options for suitability for the selected fft length (and this applies also to PRP)
6. initial-powering correctness check (a variation for P-1 of https://www.mersenneforum.org/showpo...72&postcount=9) This covers a broader range of erroneous results than merely detecting zero result as in #2 above, but only in the relatively few iterations before the modulo kicks in. This will quickly detect very seriously pathological runs and may detect less drastic issues due to too-small fft length or inappropriate -use option or seriously malfunctioning hardware where the powering is failing within the initial dozens of iterations, by turning the initial iterations into a fast very low cost inline self-test.
7. statistical checks on stage 1 res64, and perhaps on other measures (see #3 of https://www.mersenneforum.org/showpo...1&postcount=10)
Quote:
 In some pathological cases as little as 5 interim residues may be enough to identify an issue
The statistical check detects 0 res64, repeating res64, short-period cyclic patterns, and a lot more, although it detects them a bit later than specific checks would.
8. automatic built-in generic short self-test for the same fft length, run with the same -use options as the P-1 worktodo run will use, immediately before the start or resume of the P-1 computation. This is a variation of https://www.mersenneforum.org/showpo...2&postcount=14
9. "Gerbicz refers to a possible check on P-1, when there's a known factor, which is a very considerable restriction, in http://www.mersenneforum.org/showpos...&postcount=252" https://mersenneforum.org/showthread.php?t=23467
What's the P-1 factoring error rate? We don't know. https://mersenneforum.org/showpost.p...37&postcount=3
Are there more P-1 error check possibilities?

Last fiddled with by kriesel on 2020-02-14 at 18:06

2020-02-14, 21:44   #1854
ewmayer
2ω=0

Sep 2002
República de California

101001101111102 Posts

Quote:
 Originally Posted by preda There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.
Does that imply that you had implemented per-output ROE checking at some point? If so, what kind of performance hit did you see?

I'm also just beginning to delve into the underlying code here, so you can answer this more quickly: what is the underlying hardware instruction set used by your code, and does it include the needed round() instruction? If so, what is the hardware latency and pipelineability of that, and how do you expect it to compare to the old coders' trick (developed before IEEE floating-point standardization and widespread use of dedicated round() instructions) of rnd(x) = (x + c) - c, where c = 0.75*2^[#significand bits in a floating datum] needing just an add and a sub?

If the % hit with-ROE-checking is significant even after choosing the best of the above options, how difficult would it be to deploy a special round-and-carry-with-ROE-checking routine, which would be invoked only during p-1 testing (and any other future modmul sequence for which a Gerbicz-style check is unavailable)? In order to gauge how many PRP tests the addition of ROE checking here would save, we need some data re. missed p-1 factors - perhaps there could be a dedicated near-term QA effort comparing factors found for a decently large representative set of expos, we could compare factors found by trying each expo twice using the same stage bounds:

[1] Using gpuOwl with default settings, i.e. defualt FFT length and no ROE checking;

[2] Using gpuOwl with next-larger-than-default FFT length, or - probably better - Prime95/mprime with default FFT param and same p-1 stage bounds as were used in [1].

Or perhaps there are already some stats in hand here, based on early-DCs of first-time PRP tests? Or do those PRP-DC runs skip the p-1 step?

2020-02-15, 00:19   #1855
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

62118 Posts

Quote:
 Originally Posted by ewmayer If the % hit with-ROE-checking is significant even after choosing the best of the above options, how difficult would it be to deploy a special round-and-carry-with-ROE-checking routine, which would be invoked only during p-1 testing (and any other future modmul sequence for which a Gerbicz-style check is unavailable)?
Such as the contemplated return of LL to gpuowl, for instance.
Quote:
 In order to gauge how many PRP tests the addition of ROE checking here would save, we need some data re. missed p-1 factors - perhaps there could be a dedicated near-term QA effort comparing factors found for a decently large representative set of expos, we could compare factors found by trying each expo twice using the same stage bounds: [1] Using gpuOwl with default settings, i.e. default FFT length and no ROE checking; [2] Using gpuOwl with next-larger-than-default FFT length, or - probably better - Prime95/mprime with default FFT param and same p-1 stage bounds as were used in [1]. Or perhaps there are already some stats in hand here, based on early-DCs of first-time PRP tests? Or do those PRP-DC runs skip the p-1 step?
Madpoo may be able to dig up stats on those, if/when he has the time and inclination. To ordinary users, it's not clear what software was used to produce P-1 results on the server. One approach is to slam all the CUDAPm1 selftest candidates through a gpuowl install. However, above 432M, that's testing gpuowl for finding factors it already found. Also the sample size is too small. Best bet seems to me to determine whether prime95 finds factors that gpuowl does not, by running gpuowl on a suitably sized set of prime95-found-factors exponents.

Last fiddled with by kriesel on 2020-02-15 at 00:20

2020-02-15, 06:33   #1856
preda

"Mihai Preda"
Apr 2015

2×11×41 Posts

Quote:
 Originally Posted by kriesel Gpuowl P-1 needs error detection.
I'm thinking of a way to use GEC with P-1 first-stage, which would not be a waste if somebody is planning to continue with PRP on the same exponent (if a factor is not found).

The idea is to use right-to-left binary exponentiation, which can use to a large degree the error check. The residue thus computed can be saved and used to start the PRP from this point on.

(right now P-1 first-state uses left-to-right binary exponentiation, which is more efficient but can't use the error check).

https://en.wikipedia.org/wiki/Modular_exponentiation

 2020-02-15, 07:33 #1857 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11010000100112 Posts @Ernst: The GCN timings doc is here -- https://github.com/CLRX/CLRX-mirror/wiki/GcnTimings ROE error checking would be slower but it would be useful debugging option to sanity check FFT length selections.
2020-02-15, 16:05   #1858
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3,209 Posts

Quote:
 Originally Posted by kriesel It appears from nvidia-smi output at session start, that since Colab gpus are on headless linux VMs, the initial occupied gpu ram is 0. T4 0/15079 MiB P100 0/16280 MiB K80 ?
P4 0/7611 MiB
Stay tuned for K80, haven't got one in a while.

2020-02-15, 17:25   #1859
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

62118 Posts

Quote:
 Originally Posted by preda I'm thinking of a way to use GEC with P-1 first-stage, which would not be a waste if somebody is planning to continue with PRP on the same exponent (if a factor is not found). The idea is to use right-to-left binary exponentiation, which can use to a large degree the error check. The residue thus computed can be saved and used to start the PRP from this point on. (right now P-1 first-state uses left-to-right binary exponentiation, which is more efficient but can't use the error check). https://en.wikipedia.org/wiki/Modular_exponentiation
Interesting. I had thought from https://www.mersenneforum.org/showth...624#post470624 and https://www.mersenneforum.org/showpo...&postcount=245 that a known factor was required to apply the GEC to P-1 or LL. Left-to-right exponentiation was also an obstacle for applying the Jacobi symbol check to much of P-1, along with the additional check effort of computing both correct Jacobi and actual Jacobi for P-1 progress to a given point.
If the performance hit when applied to P-1 stage 1 is not too bad, it would be a tremendous advance, since P-1 is currently by its nature quite thin on error checks compared to primality testing.

Last fiddled with by kriesel on 2020-02-15 at 17:39

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1583 2020-02-20 18:25 xx005fs GPU Computing 0 2019-07-26 21:37 lukerichards Software 8 2018-01-24 22:30 mathPuzzles Math 8 2017-04-21 07:21 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12

All times are UTC. The time now is 23:07.

Sun Feb 23 23:07:11 UTC 2020 up 23 days, 17:39, 2 users, load averages: 2.23, 2.32, 2.43