mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-09-26, 04:10   #1
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×3×13×17 Posts
Default GpuOwl 7.x

A heads up about upcoming changes, each will be discussed in more detail. (GpuOwl 7.x is not yet available)

- removed LL
- removed standalone P-1
- merged PRP and P-1
- reworked P-1 second-stage
- changed savefile version and naming scheme
- changed savefile cleanup

Planned:
- rework the proof generation
- migrate to proof v2
- rework PRP savefiles towards merging with the proof checkpoints
- some SP experiments

A git branch will be created for the current 6.x for people who feel strongly about the "removed" elements.

Last fiddled with by preda on 2020-09-26 at 04:42
preda is offline   Reply With Quote
Old 2020-09-26, 04:41   #2
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·3·13·17 Posts
Default Merged PRP and P-1 first-stage

I'll explain briefly why this is a good thing compared to the "classic" P-1 first-stage.

P-1 first stage is conceptually simple:
a) given the bound B1, compute powerSmooth(B1) which is roughly the product of all primes under B1. The small primes are taken to a higher power (than 1).
b) compute X = 3^powerSmooth(B1)
c) compute the GCD(X - 1, Mp) hoping to come up with a factor

The costly operation is computing 3^powerSmooth(B1). In the classical setup, its cost is log2(powerSmooth(B1)) "squarings". (i.e. the number of bits in powerSmooth(B1)).

log2(powerSmooth(B1)) ~= 1.442 * B1
(BTW, 1.442 ~= 1/ln(2))

For example, doing first-stage with B1=1M costs about 1.4M "PRP iterations".

2. Merged with PRP
During normal PRP we compute 3^(2^k) iterating over k. We can use these values (the normal PRP iterations) for cheaper P-1 first stage -- Robert Gerbicz showed a nice way to aggregate them for computing the 3^powerSmooth(B1). The new algorithm uses a number N of "cache" buffers (thus, a lot of memory). The cost of P-1 first-stage (in iterations) *on top of the PRP* becomes:

B1 * 1.442 / (log2(N) + 2)

Where N is the number of buffers. This represents a reduction of the cost compared to the "classic" P-1 FS of a factor of log2(N)+2. For example, with about 256 buffers, that's a 10x reduction in cost.

The catch is that this tiny cost is "on top of" the PRP cost.

Let's make clear what that means:
if somebody wants to run the PRP anyway, then the additional cost of P-1 FS is really tiny and that's great. If somebody does not care for the PRP and wants to do *only* the P-1 FS, then there is no cost reduction at all compared to the "classic" P-1.

classicCost = B1 * 1.442
"number of buffers" N
"reduction factor" F = 1 / (log2(N) + 2)
cost excluding the PRP iterations = classicCost * F
cost including the PRP iterations = classicCost * (1 + F)

When F=1/10 as in the example (N=256), the cost "on top" of PRP is only 0.1 * classic, while the cost including the PRP component is 1.1 * classic (a slight *increase*).

Aside from these cost consideration, one little advantage of the new algorithm is that it is partially protected with the GEC error check. ("partially" means that only the PRP iterations themselves are protected, but this is still an improvement compared to the "classic" P-1 FS which is not protected at all).

Last fiddled with by preda on 2020-09-26 at 04:47
preda is offline   Reply With Quote
Old 2020-09-26, 05:14   #3
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·3·13·17 Posts
Default Merged PRP and P-1 FS

Another way to look at running P-1 *only* (without PRP) in the new setup:

Let's say running to a high bound B1=5M,
log2(powerSmooth(B1))=7M,

So runnnig that much P-1 FS involves also running the first 7M iterations of the normal PRP test. If after completing the P-1 FS the tester desires to *not* continue the PRP, that would mean that the already-done 7M PRP iterations are lost.

That's why in the new setup running "standalone" P-1 is inefficient and that's why I'm not planning to offer it; even though I imagine some users will be dissapointed that the beloved "standalone P-1" is gone. But hopefully higher bounds and more factors will come out of it :)
preda is offline   Reply With Quote
Old 2020-09-26, 06:38   #4
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

133658 Posts
Default

Quote:
Originally Posted by preda View Post
- removed LL
Won't we need this to verify primes?
retina is online now   Reply With Quote
Old 2020-09-26, 09:40   #5
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

52E16 Posts
Default

Quote:
Originally Posted by retina View Post
Won't we need this to verify primes?
It's enough to check out the "v6" branch https://github.com/preda/gpuowl/tree/v6 ,
build (scons or make),
and we're ready to go for a LL prime verification.

PS: what's missing is the prime to verify..

Last fiddled with by preda on 2020-09-26 at 09:41
preda is offline   Reply With Quote
Old 2020-09-26, 12:06   #6
firejuggler
 
firejuggler's Avatar
 
Apr 2010
Over the rainbow

22×617 Posts
Default

Does this mean that say, one could P-1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?
firejuggler is online now   Reply With Quote
Old 2020-09-26, 12:29   #7
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·3·13·17 Posts
Default

Quote:
Originally Posted by firejuggler View Post
Does this mean that say, one could P-1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?
I would not P-1 a 17M exponent at this point, because the wavefront (i.e. exponents that didn't have either P-1 or PRP done already) is at 100M.

For a 100M exponent the workflow is:
- choose a rather large B1. Let's say B1=8M.
- choose a B2. Let's say B2=50M. Note that the ratio B2/B1 is now lower, under 10, because B1 is so large.
- start the PRP.

For the first 8M*1.44 iterations, let's say 12M iterations, the PRP runs in parallel with P-1 FS. For this it uses as much memory as allowed by -maxAlloc. For example, for 256 buffers of 22MB each, about 6GB is needed. (this number of buffers does not need to jump in powers of two, any number will do).

The overhead of the P-1 FS is about 1/10 of the PRP iterations it runs in parallel with. So, over 12M iterations, the cost of P-1 FS would be about 1.2M iterations.

Once the PRP reaches the 12M mark, it runs the first-stage GCD. It also pauses the PRP and starts the P-1 second-stage. The second-stage also runs one or more GCDs, and if no factor is found in any of those GCDs, the PRP continues from iteration 12M up.

Let's consider a GPU with only 4GB of RAM, let's say 3GB are allocated for P-1 FS buffers, at 22MB/buf that accomodates about 128 buffers, so the overhead ratio in this case becomes 1/9 (i.e. 1/(2 +log2(128))) instead of 1/10 because of the lower number of buffers.

Last fiddled with by preda on 2020-09-26 at 12:30
preda is offline   Reply With Quote
Old 2020-09-26, 16:11   #8
firejuggler
 
firejuggler's Avatar
 
Apr 2010
Over the rainbow

22×617 Posts
Default

My intent was to run a P-1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM-1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer?


From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found).

Last fiddled with by firejuggler on 2020-09-26 at 16:35
firejuggler is online now   Reply With Quote
Old 2020-09-26, 19:41   #9
firejuggler
 
firejuggler's Avatar
 
Apr 2010
Over the rainbow

22·617 Posts
Default

I mean, since the stage1 is free with enough memory. The problem is how the memory need scale.
firejuggler is online now   Reply With Quote
Old 2020-09-26, 20:41   #10
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

52E16 Posts
Default

Quote:
Originally Posted by firejuggler View Post
My intent was to run a P-1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM-1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer?


From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found).
Stage1, even if cheaper, is not free. If you run it over the full PRP length (which would mean, for a 100M exponent, you'd run to B1=70M), you still pay 10% of the full PRP just for the P-1 which is not nothing. What's more, the PRP saving when a factor in found in P-1 become smaller as you advance towards the end of the PRP.

The memory requirements do not grow with the length of the P-1. They do grow, linearly, with the size of the exponent (because the size of a "buffer" is relative to the exponent).

P-1 second-stage is still more efficient (faster) than first-stage. Thus it's still worth doing second-stage at the end. If you want to cover the range up to 100M, I'd split it in B1,B2 (12M,100M) or (15M,100M). If you want to do B1=100M, you should follow up with B2=400M or such. Use the P-1 probability calculator to check whether such very-high bounds are worth the effort.
preda is offline   Reply With Quote
Old 2020-09-28, 00:30   #11
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

2×5×41 Posts
Default

Thank you for your work on this new version, Preda - version 6 is running great and I don't think anyone would blame you for letting it rest a bit.


The concept of combining P-1 FS with the PRP is very interesting to me, but your response to Firejuggler makes me think that maybe I understand the costs of FS vs SS P-1 even less than I had realized.


I understand that you said that P-1 SS is still more efficient/faster than PP-1 FS, but is it more efficient than than this new cost, with F=1/10 or even less with say, the 16GB video cards on colab? With an F=1/12 (or whatever that would be with ~15GiB worth of buffers) Would it make sense to run what we'd otherwise consider rather bizarre bounds, like 20000000 B1 and 40000000 B2?
Maybe not even bother with second stage?

Last fiddled with by Aramis Wyler on 2020-09-28 at 00:30
Aramis Wyler is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuowl: runtime error SELROC GpuOwl 59 2020-10-02 03:56
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
How to interface gpuOwl with PrimeNet preda PrimeNet 2 2017-10-07 21:32

All times are UTC. The time now is 22:12.

Mon Nov 23 22:12:11 UTC 2020 up 74 days, 19:23, 4 users, load averages: 2.47, 2.59, 2.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.