mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-02-13, 21:04   #1849
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

11,423 Posts
Default

Quote:
Originally Posted by preda View Post
OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.
Shouldn't the too-small-FFT-size manifest via excessive-fractional-parts (a.k.a. roundoff errors) detected during the round-and-carry step?
ewmayer is offline   Reply With Quote
Old 2020-02-13, 21:21   #1850
mrh
 
"mrh"
Oct 2018
Temecula, ca

5×11 Posts
Default

Quote:
Originally Posted by preda View Post
OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.
Ah, thats good to know. I didn't think of that.
mrh is offline   Reply With Quote
Old 2020-02-13, 22:04   #1851
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·13·163 Posts
Default

Are empty worktodo lines (just a newline) not allowed?
Code:
2020-02-13 19:19:40 colab2-TeslaT4 {"exponent":"10000831", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-13 19:19:40 UTC", "user":"kriesel", "computer":"colab2-TeslaT4", "fft-length":524288, "B1":30000, "B2":500000, "factors":["646560662529991467527"]} 
2020-02-13 19:19:40 colab2-TeslaT4 worktodo.txt line ignored: ""
terminate called after throwing an instance of 'char const*'

Last fiddled with by kriesel on 2020-02-13 at 22:04
kriesel is offline   Reply With Quote
Old 2020-02-14, 07:19   #1852
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×5×7×11 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Shouldn't the too-small-FFT-size manifest via excessive-fractional-parts (a.k.a. roundoff errors) detected during the round-and-carry step?
There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.
preda is online now   Reply With Quote
Old 2020-02-14, 17:21   #1853
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·13·163 Posts
Thumbs up

Quote:
Originally Posted by preda View Post
There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.
Gpuowl P-1 needs error detection. While for production wavefront running on fast gpus the individual run times are short, P-1 run times in the upper reaches of exponent are not (~4 days each on a Tesla P100 or Radeon VII near 109), and the cost of a missed factor due to P-1 computational error is lost factors and needless primality tests.

Error check possibilities include (in random order):
  1. the afore-mentioned excessive-fractional-parts detection in the round-and-carry
  2. res64 check for problem values in stage 1 (0 is probably bad, check the full res. I've had this occur with a recent commit of Gpuowl and unsuitable-for-the-fft-length -use options. https://mersenneforum.org/showpost.p...postcount=1838) 1 occurs in CUDAPm1. And see #7
  3. res64 check repeat same value in stage 1. And see #7
  4. Jacobi checks in selected portions of the computation, perhaps as an option since it is expensive, requiring both computing the correct Jacobi and the actual. There is a whole thread about that, at https://mersenneforum.org/showthread...ghlight=jacobi
  5. screening -use options for suitability for the selected fft length (and this applies also to PRP)
  6. initial-powering correctness check (a variation for P-1 of https://www.mersenneforum.org/showpo...72&postcount=9) This covers a broader range of erroneous results than merely detecting zero result as in #2 above, but only in the relatively few iterations before the modulo kicks in. This will quickly detect very seriously pathological runs and may detect less drastic issues due to too-small fft length or inappropriate -use option or seriously malfunctioning hardware where the powering is failing within the initial dozens of iterations, by turning the initial iterations into a fast very low cost inline self-test.
  7. statistical checks on stage 1 res64, and perhaps on other measures (see #3 of https://www.mersenneforum.org/showpo...1&postcount=10)
    Quote:
    In some pathological cases as little as 5 interim residues may be enough to identify an issue
    The statistical check detects 0 res64, repeating res64, short-period cyclic patterns, and a lot more, although it detects them a bit later than specific checks would.
  8. automatic built-in generic short self-test for the same fft length, run with the same -use options as the P-1 worktodo run will use, immediately before the start or resume of the P-1 computation. This is a variation of https://www.mersenneforum.org/showpo...2&postcount=14
  9. "Gerbicz refers to a possible check on P-1, when there's a known factor, which is a very considerable restriction, in http://www.mersenneforum.org/showpos...&postcount=252" https://mersenneforum.org/showthread.php?t=23467
What's the P-1 factoring error rate? We don't know. https://mersenneforum.org/showpost.p...37&postcount=3
Are there more P-1 error check possibilities?

Last fiddled with by kriesel on 2020-02-14 at 18:06
kriesel is offline   Reply With Quote
Old 2020-02-14, 21:44   #1854
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

11,423 Posts
Default

Quote:
Originally Posted by preda View Post
There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.
Does that imply that you had implemented per-output ROE checking at some point? If so, what kind of performance hit did you see?

I'm also just beginning to delve into the underlying code here, so you can answer this more quickly: what is the underlying hardware instruction set used by your code, and does it include the needed round() instruction? If so, what is the hardware latency and pipelineability of that, and how do you expect it to compare to the old coders' trick (developed before IEEE floating-point standardization and widespread use of dedicated round() instructions) of rnd(x) = (x + c) - c, where c = 0.75*2^[#significand bits in a floating datum] needing just an add and a sub?

If the % hit with-ROE-checking is significant even after choosing the best of the above options, how difficult would it be to deploy a special round-and-carry-with-ROE-checking routine, which would be invoked only during p-1 testing (and any other future modmul sequence for which a Gerbicz-style check is unavailable)? In order to gauge how many PRP tests the addition of ROE checking here would save, we need some data re. missed p-1 factors - perhaps there could be a dedicated near-term QA effort comparing factors found for a decently large representative set of expos, we could compare factors found by trying each expo twice using the same stage bounds:

[1] Using gpuOwl with default settings, i.e. defualt FFT length and no ROE checking;

[2] Using gpuOwl with next-larger-than-default FFT length, or - probably better - Prime95/mprime with default FFT param and same p-1 stage bounds as were used in [1].

Or perhaps there are already some stats in hand here, based on early-DCs of first-time PRP tests? Or do those PRP-DC runs skip the p-1 step?
ewmayer is offline   Reply With Quote
Old 2020-02-15, 00:19   #1855
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×13×163 Posts
Default

Quote:
Originally Posted by ewmayer View Post
If the % hit with-ROE-checking is significant even after choosing the best of the above options, how difficult would it be to deploy a special round-and-carry-with-ROE-checking routine, which would be invoked only during p-1 testing (and any other future modmul sequence for which a Gerbicz-style check is unavailable)?
Such as the contemplated return of LL to gpuowl, for instance.
Quote:
In order to gauge how many PRP tests the addition of ROE checking here would save, we need some data re. missed p-1 factors - perhaps there could be a dedicated near-term QA effort comparing factors found for a decently large representative set of expos, we could compare factors found by trying each expo twice using the same stage bounds:

[1] Using gpuOwl with default settings, i.e. default FFT length and no ROE checking;

[2] Using gpuOwl with next-larger-than-default FFT length, or - probably better - Prime95/mprime with default FFT param and same p-1 stage bounds as were used in [1].

Or perhaps there are already some stats in hand here, based on early-DCs of first-time PRP tests? Or do those PRP-DC runs skip the p-1 step?
Madpoo may be able to dig up stats on those, if/when he has the time and inclination. To ordinary users, it's not clear what software was used to produce P-1 results on the server. One approach is to slam all the CUDAPm1 selftest candidates through a gpuowl install. However, above 432M, that's testing gpuowl for finding factors it already found. Also the sample size is too small. Best bet seems to me to determine whether prime95 finds factors that gpuowl does not, by running gpuowl on a suitably sized set of prime95-found-factors exponents.

Last fiddled with by kriesel on 2020-02-15 at 00:20
kriesel is offline   Reply With Quote
Old 2020-02-15, 06:33   #1856
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×5×7×11 Posts
Default

Quote:
Originally Posted by kriesel View Post
Gpuowl P-1 needs error detection.
I'm thinking of a way to use GEC with P-1 first-stage, which would not be a waste if somebody is planning to continue with PRP on the same exponent (if a factor is not found).

The idea is to use right-to-left binary exponentiation, which can use to a large degree the error check. The residue thus computed can be saved and used to start the PRP from this point on.

(right now P-1 first-state uses left-to-right binary exponentiation, which is more efficient but can't use the error check).

https://en.wikipedia.org/wiki/Modular_exponentiation
preda is online now   Reply With Quote
Old 2020-02-15, 07:33   #1857
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·1,759 Posts
Default

@Ernst: The GCN timings doc is here -- https://github.com/CLRX/CLRX-mirror/wiki/GcnTimings

ROE error checking would be slower but it would be useful debugging option to sanity check FFT length selections.
Prime95 is online now   Reply With Quote
Old 2020-02-15, 16:05   #1858
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

108E16 Posts
Default

Quote:
Originally Posted by kriesel View Post
It appears from nvidia-smi output at session start, that since Colab gpus are on headless linux VMs, the initial occupied gpu ram is 0.
T4 0/15079 MiB
P100 0/16280 MiB
K80 ?
P4 0/7611 MiB
Stay tuned for K80, haven't got one in a while.
kriesel is offline   Reply With Quote
Old 2020-02-15, 17:25   #1859
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×13×163 Posts
Default

Quote:
Originally Posted by preda View Post
I'm thinking of a way to use GEC with P-1 first-stage, which would not be a waste if somebody is planning to continue with PRP on the same exponent (if a factor is not found).

The idea is to use right-to-left binary exponentiation, which can use to a large degree the error check. The residue thus computed can be saved and used to start the PRP from this point on.

(right now P-1 first-state uses left-to-right binary exponentiation, which is more efficient but can't use the error check).

https://en.wikipedia.org/wiki/Modular_exponentiation
Interesting. I had thought from https://www.mersenneforum.org/showth...624#post470624 and https://www.mersenneforum.org/showpo...&postcount=245 that a known factor was required to apply the GEC to P-1 or LL. Left-to-right exponentiation was also an obstacle for applying the Jacobi symbol check to much of P-1, along with the additional check effort of computing both correct Jacobi and actual Jacobi for P-1 progress to a given point.
If the performance hit when applied to P-1 stage 1 is not too bad, it would be a tremendous advance, since P-1 is currently by its nature quite thin on error checks compared to primality testing.

Last fiddled with by kriesel on 2020-02-15 at 17:39
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1622 2020-08-12 14:39
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 05:18.

Thu Aug 13 05:18:43 UTC 2020 up 1:54, 0 users, load averages: 1.18, 1.22, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.