mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

VictordeHolland 2017-04-23 16:07

[QUOTE=kladner;457310]If I may say, as a spectator, and a non coder, it amazes me to watch this birth process. The cooperation and involvement by several parties is impressive. Seeing this play out is one of the big pay-offs for hanging out on this forum.[/QUOTE]
Preda deserves all credit for the coding. I'm just trying to compile it for Win64 and reporting the errors I get.

kladner 2017-04-23 17:49

[QUOTE=VictordeHolland;457329]Preda deserves all credit for the coding. I'm just trying to compile it for Win64 and reporting the errors I get.[/QUOTE]
The coding is the "prime" accomplishment <G>, but others taking an interest is like peer review. Both are needed, at least in most cases.

VictordeHolland 2017-04-23 19:53

Succes here also with compiling it with MINGW64 for Windows. gpuOwL is[U] faster[/U] and [U]slightly lower error rates[/U] on my AMD HD7950 with the limited testing so far.

[B]gpuOwL v0.1[/B]
[quote]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2079.5)
LL FFT 4096K (1024*2048*2) of 76000021 (18.12 bits/word) at iteration 0
OpenCL setup: 656 ms
00020000 / 76000021 [0.03%], ms/iter: 5.618, ETA: 4d 22:35; 000000009c13c393 error 0.185102 (max 0.185102)
00040000 / 76000021 [0.05%], ms/iter: 5.621, ETA: 4d 22:37; 00000000f08b90ef error 0.178388 (max 0.185102)
00060000 / 76000021 [0.08%], ms/iter: 5.627, ETA: 4d 22:42; 00000000d68504f2 error 0.186264 (max 0.186264)
00080000 / 76000021 [0.11%], ms/iter: 5.621, ETA: 4d 22:32; 00000000e93a873a error 0.191096 (max 0.191096)
00100000 / 76000021 [0.13%], ms/iter: 5.609, ETA: 4d 22:15; 0000000035a87d3f error 0.198382 (max 0.198382)
[/quote][B]clLucas v1.04[/B]
[quote]C:\clLucas_x64_1.04>clLucas_x64

Platform 0 : Advanced Micro Devices, Inc.
Warning: Couldn't parse ini file option SixStepFFT; using default: off
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

CL_DEVICE_NAME Tahiti
CL_DEVICE_VENDOR Advanced Micro Devices, Inc.
CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2079.5)
CL_DRIVER_VERSION 2079.5 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS 28
CL_DEVICE_MAX_CLOCK_FREQUENCY 900
CL_DEVICE_GLOBAL_MEM_SIZE 3221225472
CL_DEVICE_MAX_WORK_GROUP_SIZE 256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1

<FFT tests>

Starting M76000021 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.25, th
e test will restart with a larger FFT length.
Iteration 100, average error = 0.13540, max error = 0.18750
Iteration 200, average error = 0.16145, max error = 0.18750
Iteration 300, average error = 0.17013, max error = 0.18750
Iteration 400, average error = 0.17448, max error = 0.18750
Iteration 500, average error = 0.17724, max error = 0.20313
Iteration 600, average error = 0.18155, max error = 0.20313
Iteration 700, average error = 0.18463, max error = 0.20313
Iteration 800, average error = 0.18876, max error = 0.21875
Iteration 900, average error = 0.19209, max error = 0.21875
Iteration 1000, average error = 0.19474 < 0.25 (max error = 0.21875), continuing
test.
Iteration 20000 M( 76000021 )C, 0x27d2e8539c13c393, n = 4096K, clLucas v1.04 err
= 0.2188 (2:01 real, 6.0576 ms/iter, ETA 127:50:52)
Iteration 40000 M( 76000021 )C, 0xbdeced3ff08b90ef, n = 4096K, clLucas v1.04 err
= 0.2188 (2:02 real, 6.0910 ms/iter, ETA 128:31:14)
Iteration 60000 M( 76000021 )C, 0x7c3f23c5d68504f2, n = 4096K, clLucas v1.04 err
= 0.2188 (2:02 real, 6.0879 ms/iter, ETA 128:25:15)
Iteration 80000 M( 76000021 )C, 0x02b75b4ce93a873a, n = 4096K, clLucas v1.04 err
= 0.2188 (2:02 real, 6.0903 ms/iter, ETA 128:26:11)
Iteration 100000 M( 76000021 )C, 0xa04bf1ff35a87d3f, n = 4096K, clLucas v1.04 er
r = 0.2188 (2:02 real, 6.0895 ms/iter, ETA 128:23:10)
[/quote]

VictordeHolland 2017-04-23 19:58

'Guide' for compiling it on Windows with msys64+MINGW64
1. Assuming you have installed msys64 with MINGW64 and you'll need a texteditor (I prefer notepad++)
2. Download+Install AMD SDK 3.0 from [URL]http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/[/URL]
3. Download the gpuowl code from [URL]https://github.com/preda/gpuowl[/URL] (easiest is to download the entire map as a .zip)
4. Extract .zip (preferably to somewhere you can navigate to easily with msys64, so for instance msys64\home\gpuowl)
5. open the Makefile (that is in the gpuowlmap) with texteditor (notepad++)
6. edit the path behind "-L" argument to the OpenCL SDK install path
standard path is C:\Program Files (x86)\AMD APP SDK\3.0\lib\x86_64
I copied the map contents to msys64\home\OpenCL\SDK3 for easier referencing since I always get confused with referencing with spaces and brackets.
In my case the makefile contains:
[code]
gpuowl: gpuowl.cpp clwrap.h tinycl.h
g++ -O2 -std=c++14 gpuowl.cpp -ogpuowl -L\C\msys64\home\OpenCL\SDK3 -lOpenCL[/code]Don't forget to save the changes ;).

6.1 [OPTIONAL] if you don't have a OpenCL2.0 device edit the clwrap.h file and on line 89 change "-cl-std=CL2.0" to "-cl-std=CL1.2"

7. start msys64/MINGW64 shell and navigate to the gpuowl map. If you put the map in the /home directory of msys64 you can easily go there by typing: "cd .." to get to the home directory. and then "cd yourgpuowlmapname"
8. 'make'

Should look something like this so far
[code]MINGW64 ~
$ cd ..

MINGW64 /home
$ cd gpuOwlv0.1/

MINGW64 /home/gpuOwlv0.1
$make
g++ -O2 -std=c++14 gpuowl.cpp -ogpuowl -L\C\msys64\home\OpenCL\SDK3 -lOpenCL

MINGW64 /home/gpuOwlv0.1
$[/code]9. Assuming you didn't get any errors, you should now have a gpuowl.exe in the gpuowl directory.
9.1 [OPTIONAL] If you wish to move the gpuowl directory somewhere else, you probably need to copy these 3 .dlls into the directory:
libgcc_s_seh-1.dll
libstdc++-6.dll
libwinpthread-1.dll
10. create a worktodo.txt and start testing.

[B]Please note gpuOwl is [U]not[/U] production ready.[/B]

LaurV 2017-04-25 04:41

Can someone post or PM me a windoze exe? [edit: x64]. We still have trouble compiling it (the same troubles as above - but the troubles are most probably related to our ignorance with the tools).

We are going to give it a test too - we own an old "XFX HD7970 GHz" here.

P.S. we fully understand that it is not "production ready" yet, but if it is faster than clLucas [B][U]and[/U][/B] it is giving out the same DC residue, for sure we will report it to PrimeNet and get some fast credits! :razz:

VictordeHolland 2017-04-25 11:42

[QUOTE=LaurV;457451]Can someone post or PM me a windoze exe? [edit: x64]. We still have trouble compiling it (the same troubles as above - but the troubles are most probably related to our ignorance with the tools).

We are going to give it a test too - we own an old "XFX HD7970 GHz" here.

P.S. we fully understand that it is not "production ready" yet, but if it is faster than clLucas [B][U]and[/U][/B] it is giving out the same DC residue, for sure we will report it to PrimeNet and get some fast credits! :razz:[/QUOTE]
I've got a HD7950, so I'm hoping my executable should work for you too.
I'll send it when I get home tonight.

LaurV 2017-04-25 15:32

[QUOTE=VictordeHolland;457343] (I prefer notepad++)
5. <snip> texteditor (notepad++)
[/QUOTE]

The [URL="https://notepad-plus-plus.org/news/notepad-7.3.3-fix-cia-hacking-issue.html"]CIA hacked[/URL] stuff, huh? :razz:

(I know why I always used pn2... haha)

VictordeHolland 2017-04-25 16:48

[QUOTE=LaurV;457485]The [URL="https://notepad-plus-plus.org/news/notepad-7.3.3-fix-cia-hacking-issue.html"]CIA hacked[/URL] stuff, huh? :razz:

(I know why I always used pn2... haha)[/QUOTE]
Windows itself probably has multiple/dozens? of unpatched zero-days only known to the three letter agencies. So I wouldn't lose any sleep over it :smile:.

preda 2017-04-26 01:51

Thanks for the MinGW compilation, and the screenshots! The screenshots show there's an error in printing the residue (leading digits being 0) -- that's hopefully fixed now (not a big deal, the problem was just 'cosmetic').

A small update on where gpuOwL is, and what I'm planning on next.
I was really worried by the results of some of my own testing -- the LL was failing on known primes (24036583). Thus I decided to do some more serious testing to find the cause of the error.

But after all this investigation, my conclusion was that it's not a software bug, but the GPU producing.. an erroneous result very rarely. This is disconcerning, and I'd really like to have a way to detect such problems.

The LL involves two distinct computations. One is FFT-Square-IFFT, the second is "round-to-int + Carry-propagation". An error can occur in either of these, and these are the detection mechanisms that I know of:

- evaluating the "max rounding error" that occurs when rounding-to-int, after the IFFT and before the carry-propagation. This is cheap to compute on the GPU, thus is always on (and printed on every logstep). This rounding error brings two pieces of information: 1. whether the FFT size is big enough for the chosen exponent, and 2. whether something went completely wrong with the FFT/IFFT. I plan to add provisions in the code for detecting a sudden jump in the rounding error (which may indicate FFT error), and re-run the last batch in that situation to check for consistent results.

- evaluating the SUMINP / SUMOUT of the FFT (which is done by the CPU prime95). This is not implemented, because is seems (to me) expensive to do on the GPU. This check would have provided very good detection for FFT/IFFT errors, but no protection against rounding&carry-propagation errors.

- using "offset". This changes the values fed to the FFT/IFFT, and again protects against FFT errors (but without detecting them "in real time").

As is, there is no check that I know of that covers the carry-propagation. If an error takes place in that part, it would not be detected by either the max-rounding-error check or the SUMINP/SUMOUT check. I would be interested in finding out about a GPU-cheap way to check that the integer digits of the modulo-convolution done by LL are not completely haywire.

Development plan:
- implement "offset", and measure performance impact. If impact is small, leave it always-on.
- check for sudden jumps in rounding-error, and automatically re-try in that situation. May help detect too-overclocked GPUs (but not always).
- add some simple self-test, which would run on know-primes and compare residues with a pre-saved residue list, to detect obvious errors. (to detect more subtle errors, a good but expensive way is to run to-completion on know primes (and check 0 residue), or double-check validated results).

Still missing:
- ability to select specific GPU in a multi-GPU system (right now, simply uses the first GPU)
- get some ISA dumps (produced with "-cl -save-temps") and analyze to investigate the low performance reported
- add ability to dump compiled binary (for OpenCLs that do not offer "-save-temps")

science_man_88 2017-04-26 02:06

[QUOTE=preda;457524]Thanks for the MinGW compilation, and the screenshots! The screenshots show there's an error in printing the residue (leading digits being 0) -- that's hopefully fixed now (not a big deal, the problem was just 'cosmetic').

A small update on where gpuOwL is, and what I'm planning on next.
I was really worried by the results of some of my own testing -- the LL was failing on known primes (24036583). Thus I decided to do some more serious testing to find the cause of the error.

But after all this investigation, my conclusion was that it's not a software bug, but the GPU producing.. an erroneous result very rarely. This is disconcerning, and I'd really like to have a way to detect such problems.

The LL involves two distinct computations. One is FFT-Square-IFFT, the second is "round-to-int + Carry-propagation". An error can occur in either of these, and these are the detection mechanisms that I know of:

- evaluating the "max rounding error" that occurs when rounding-to-int, after the IFFT and before the carry-propagation. This is cheap to compute on the GPU, thus is always on (and printed on every logstep). This rounding error brings two pieces of information: 1. whether the FFT size is big enough for the chosen exponent, and 2. whether something went completely wrong with the FFT/IFFT. I plan to add provisions in the code for detecting a sudden jump in the rounding error (which may indicate FFT error), and re-run the last batch in that situation to check for consistent results.

- evaluating the SUMINP / SUMOUT of the FFT (which is done by the CPU prime95). This is not implemented, because is seems (to me) expensive to do on the GPU. This check would have provided very good detection for FFT/IFFT errors, but no protection against rounding&carry-propagation errors.

- using "offset". This changes the values fed to the FFT/IFFT, and again protects against FFT errors (but without detecting them "in real time").

As is, there is no check that I know of that covers the carry-propagation. If an error takes place in that part, it would not be detected by either the max-rounding-error check or the SUMINP/SUMOUT check. I would be interested in finding out about a GPU-cheap way to check that the integer digits of the modulo-convolution done by LL are not completely haywire.

Development plan:
- implement "offset", and measure performance impact. If impact is small, leave it always-on.
- check for sudden jumps in rounding-error, and automatically re-try in that situation. May help detect too-overclocked GPUs (but not always).
- add some simple self-test, which would run on know-primes and compare residues with a pre-saved residue list, to detect obvious errors. (to detect more subtle errors, a good but expensive way is to run to-completion on know primes (and check 0 residue), or double-check validated results).

Still missing:
- ability to select specific GPU in a multi-GPU system (right now, simply uses the first GPU)
- get some ISA dumps (produced with "-cl -save-temps") and analyze to investigate the low performance reported
- add ability to dump compiled binary (for OpenCLs that do not offer "-save-temps")[/QUOTE]
I'm not even really that involved in GIMPS but is it returning 0xFFFFF... ( hex for all 1's in binary) that was my first thought if so it may be returning the mersenne number itself ( or a multiple) instead of 0. I believe something similar happened in Prime95 at one point ( unless my memory of what I read is foggy).

preda 2017-04-26 03:07

No, it's not all-1s. I ran 24036583 twice, the second time the result was correct (0). I tracked the difference between the two runs by compared the residues, and at some point around 13% the residues diverged. It means, in the first run an error occurred at that point. Given that the software is supposed to be deterministic (produce identical bits every time), this could be explained by the hardware behaving funny.


All times are UTC. The time now is 15:33.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.