mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-04-19, 12:41   #1
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24·83 Posts
Post gpuOwL: an OpenCL program for Mersenne primality testing

Dear Mersenners,
it is my pleasure to [pre]announce a new GPU LL checker, gpuOwL. It is implemented in OpenCL, targeting mainly AMD GPUs (this is the hardware that I use / test on).

The name gpuOwL brings together the "O" from OpenCL, the L from Lucas-Lehmer, and the obvious "gpu", evoking a long-thinking being that stays active at night (that would be my desktop :)

Now on to the technical details.

I implemented this from scratch, based mainly on the excellent article "Discrete Weighted Transforms and Large Integer Arithmetic" by Crandall & Fagin, http://www.faginfamily.net/barry/Pap...Transforms.pdf .

The main motivation while writing this program was exploring how fast such a computation (LL) can be done on AMD GPUs -- I wanted to get a "fastest" implementation. The secondary goal was getting an elegant, beautiful implementation --
simple/straightforward and small code (the code is indeed quite small, right now totaling ~1K LOC).

The code is also self-contained, without dependencies on external libraries (such as an FFT library). The only requirements are a C++ compiler and an OpenCL implementation. (I developed and use the program on Linux, with AMDGPU-Pro for OpenCL).

The source-code is available on github: https://github.com/preda/gpuowl

The program is still under development. Probably it still has some bugs, especially in the recent code, that are being worked out. OTOH the basic FFT and LL logic seems to be sound.

Usage:
- compile with either make (Makefile) or with the "build.sh" script, or even straight command-line:

c++ -O2 gpuowl.cpp -ogpuowl -lOpenCL -L/path/to/lib/opencl
e.g. "-L/opt/amdgpu-pro/lib/x86_64-linux-gnu"

- get LL assignments for exponents around 67-77M and put them into worktodo.txt
- the program prints progress report (every 20k iterations), and after a longer time like
a couple of days outputs to results.txt
- it automatically saves progress to a save-exponent.bin file, and can be stopped at any time, will resume on restart.

The pluses:
+ it is fast. For exponents in the vicinity of 77M, 2.1 ms/iteration on FuryX, 2.4 ms/iter on R9-390x
+ the code is supposedly easy to understand (or will be so when I finish an accompanying write-up explaining the tricky bits and the maths). In any case, it is quite small.
+ it works on AMD GPUs :)
+ prints residue and error info on every log step.

The minuses:
- gpuOwL in the present incarnation supports a single POT FFT size, 4M. This makes it able to handle exponents "up to 78M", and probably optimally above 70M. That means, for smaller exponents a smaller FFT would be faster; while for larger exponents the floating-point errors are getting too large and require a larger FFT size.
- does not use shiftcount (may be added in the future)
- may have rough edges (is still in development)

The tricks:
"transposed convolution":
getting "under the hood" of the FFT allowed some performance gains. A traditional convolution based on the "matrix FFT algorithm" (as in AMD's clFFT) proceeds with these steps:
1. transpose
2. sub-FFT
3. transpose
4. sub-FFT
5. transpose
6. Squaring
7. transpose
8. sub-FFT
9. transpose
10. sub-FFT
11. transpose

(1-5 is forward-FFT, 7-11 is inverse-FFT)

OTOH in gpuOwL I store the data in a transposed representation, which gets rid of step 1 (and 11) above. The squaring can be done just as well over the transposed form, thus steps 5 and 7 (the transpose before and after the squaring) aren't needed. Thus the steps remaining in gpuOwL are:
1. sub-FFT
2. transpose
3. sub-FFT
4. Squaring
5. sub-FFT
6. transpose
7. sub-FFT

Another trick (which AFAIK is used by CUDALucas and clLucas as well) is mapping a real-FFT into a complex-FFT of half size, which halves the FFT size.

Another small trick is using a probabilistic argument for limiting the carry propagation, which simplifies a bit the carry code.

Another small trick is computing the various pre-computed constants (such as the FFT coefficients and the A, A^-1 vectors for the DWT) in quad-precision ("long double") before conversion to double precision, which pushes the error bound a tiny bit further.

I'll be looking for problems to fix. I hope you'll enjoy!
cheers,
Mihai

Last fiddled with by xilman on 2020-04-16 at 18:39
preda is offline   Reply With Quote
Old 2017-04-19, 14:18   #2
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

893210 Posts
Default

Brava băiatu', felicitări și la mai multe de astea.

Întrebări:

1. Speed comparison with clLucas?
2. Any windows build?

Mersic.
LaurV is offline   Reply With Quote
Old 2017-04-19, 20:03   #3
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

100100101102 Posts
Default

First of all nice job!

Quote:
Originally Posted by LaurV View Post
2. Any windows build?
Code:
Victor@PCVICTOR MINGW64 /home/gpuowl-master
$ make
g++ -O2 -std=c++11 gpuowl.cpp -ogpuowl -L/home/OpenCL -lOpenCL
In file included from gpuowl.cpp:4:0:
clwrap.h: In member function 'void Program::compile(Context&, const char*, const char*)':
clwrap.h:56:22: error: too many arguments to function 'int mkdir(const char*)'
     mkdir("isa", 0777);
                      ^
In file included from C:/msys64/mingw64/x86_64-w64-mingw32/include/sys/stat.h:14:0,
                 from clwrap.h:7,
                 from gpuowl.cpp:4:
C:/msys64/mingw64/x86_64-w64-mingw32/include/io.h:271:15: note: declared here
   int __cdecl mkdir (const char *) __MINGW_ATTRIB_DEPRECATED_MSVC2005;
               ^
gpuowl.cpp: In function 'double* genBigTrig(int, int)':
gpuowl.cpp:51:17: error: 'M_PIl' was not declared in this scope
   auto base = - M_PIl / (W * H / 2);
                 ^
gpuowl.cpp: In function 'double* genSin(int, int)':
gpuowl.cpp:70:17: error: 'M_PIl' was not declared in this scope
   auto base = - M_PIl / (W * H);
                 ^
gpuowl.cpp: In function 'double* smallTrigBlock(int, int, double*)':
gpuowl.cpp:86:22: error: 'M_PIl' was not declared in this scope
       auto angle = - M_PIl * line * col / (W * H / 2);
                      ^
gpuowl.cpp: In function 'bool checkPrime(int, bool*, u64*)':
gpuowl.cpp:369:9: error: 'uint' does not name a type
   const uint zero = 0;
         ^
gpuowl.cpp:370:75: error: 'zero' was not declared in this scope
   Buf bufErr  (c, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(int), &zero);
                                                                           ^
gpuowl.cpp:402:5: error: 'uint' was not declared in this scope
     uint rawErr = 0;
     ^
gpuowl.cpp:403:43: error: 'rawErr' was not declared in this scope
     q.read( false, bufErr, sizeof(uint), &rawErr);
                                           ^
gpuowl.cpp: In function 'int main(int, char**)':
gpuowl.cpp:422:30: error: 'setlinebuf' was not declared in this scope
   if (logf) { setlinebuf(logf); }
                              ^
Makefile:2: recept voor doel 'gpuowl' is mislukt
make: *** [gpuowl] Fout 1
MINGW64 says no, but that is probably just from my incompetence
C:\msys64\home\OpenCL has a copy of the files from C:\Program Files (x86)\AMD APP SDK\2.9\lib\x86_64
which has the libOpenCL.a file

the errors are just abracadabra to me .
VictordeHolland is offline   Reply With Quote
Old 2017-04-19, 20:13   #4
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

Salut :)

Sorry, I don't have a windows build myself -- no access to windows :). But maybe somebody else could prepare one.

For similar reasons I don't have a straight speed comparison with clLucas -- I couldn't build it yet -- it depends on SDKApplication.hpp which are part of AMDAPP SDK (that I'm not using), and also openclsdkdefs.mk and openclsdkrules.mk. I do think these build problems can be overcome, but I didn't put the effort in yet.

OTOH as both programs print time-per-iteration, such comparison should be easy to do by anybody who gets to build both programs.

I would bet mine is faster :), but by how much..?
preda is offline   Reply With Quote
Old 2017-04-19, 20:42   #5
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

I just fixed some of those errors for mingw. Would you try again?
preda is offline   Reply With Quote
Old 2017-04-19, 20:52   #6
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Great!!!
I hope release Open Source License.
msft is offline   Reply With Quote
Old 2017-04-19, 21:10   #7
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·31·47 Posts
Default

AirSquirrels had run some cLucas benchmarks.

A Fury X oc'ed to 1100,500:

M75002911 gave 3.9026 ms/iter


A Fury X with no oc:

M75002911 gave 4.0754 ms/iter


And a R9 390X:

M75002911 gave 4.3873 ms/iter

Last fiddled with by Mark Rose on 2017-04-19 at 21:15
Mark Rose is online now   Reply With Quote
Old 2017-04-19, 21:24   #8
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101001100002 Posts
Default

Yep, licensed under GPL v3 (added License on github).
preda is offline   Reply With Quote
Old 2017-04-19, 21:27   #9
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24·83 Posts
Default

Nice. Let's say 1.8 times faster than clLucas then :)
preda is offline   Reply With Quote
Old 2017-04-19, 22:09   #10
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

I will pull this down and run some benchmark comparisons on the same system.
airsquirrels is offline   Reply With Quote
Old 2017-04-19, 22:56   #11
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

Pretty fast for hand rolled code (clFFT has had a lot of resources put into it by AMD), but I'm definitely not seeing the performance levels indicated above. Still slower than clLucas 1.04. Anything I should check? This is a FuryX

Good news is, the residues match for the first chunk of iterations.

Code:
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 75002911 at iteration 0
FFT 1024*2048 (4M words, 17.88 bits per word)
OpenCL compile: 1106 ms
setup: 1638 ms
00020000 / 75002911 [0.03%], ms/iter: 5.413, ETA: 4d 16:45; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 5.422, ETA: 4d 16:54; b9fc5678347cad9f error 0.140625 (max 0.140625)
00060000 / 75002911 [0.08%], ms/iter: 5.413, ETA: 4d 16:40; e7fab5c1f11d0f39 error 0.125 (max 0.140625)
00080000 / 75002911 [0.11%], ms/iter: 5.423, ETA: 4d 16:52; 76a6fb920dd95b71 error 0.140625 (max 0.140625)
Code:
Continuing work from a partial result of M75002911 fft length = 4096K iteration = 1001
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1875 (0:36 real, 3.5486 ms/iter, ETA 73:55:06)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9315 ms/iter, ETA 81:53:04)
Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9205 ms/iter, ETA 81:38:41)
Iteration 40000 M( 75002911 )C, 0xb9fc5678347cad9f, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9280 ms/iter, ETA 81:47:22)
Iteration 50000 M( 75002911 )C, 0x992a088a20504a90, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9407 ms/iter, ETA 82:02:32)
Iteration 60000 M( 75002911 )C, 0xe7fab5c1f11d0f39, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9230 ms/iter, ETA 81:39:51)
Iteration 70000 M( 75002911 )C, 0x89386b82336fc06d, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9320 ms/iter, ETA 81:50:26)
Iteration 80000 M( 75002911 )C, 0x76a6fb920dd95b71, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9236 ms/iter, ETA 81:39:18)
airsquirrels is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1657 2020-10-27 01:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 23:52.

Sun Nov 29 23:52:29 UTC 2020 up 80 days, 21:03, 3 users, load averages: 1.30, 1.31, 1.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.