mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2017-04-19 12:41

gpuOwL: an OpenCL program for Mersenne primality testing
 
Dear Mersenners,
it is my pleasure to [pre]announce a new GPU LL checker, gpuOwL. It is implemented in OpenCL, targeting mainly AMD GPUs (this is the hardware that I use / test on).

The name gpuOwL brings together the "O" from OpenCL, the L from Lucas-Lehmer, and the obvious "gpu", evoking a long-thinking being that stays active at night (that would be my desktop :)

Now on to the technical details.

I implemented this from scratch, based mainly on the excellent article "Discrete Weighted Transforms and Large Integer Arithmetic" by Crandall & Fagin, [url]http://www.faginfamily.net/barry/Papers/Discrete%20Weighted%20Transforms.pdf[/url] .

The main motivation while writing this program was exploring how fast such a computation (LL) can be done on AMD GPUs -- I wanted to get a "fastest" implementation. The secondary goal was getting an elegant, beautiful implementation --
simple/straightforward and small code (the code is indeed quite small, right now totaling ~1K LOC).

The code is also self-contained, without dependencies on external libraries (such as an FFT library). The only requirements are a C++ compiler and an OpenCL implementation. (I developed and use the program on Linux, with AMDGPU-Pro for OpenCL).

The source-code is available on github: [url]https://github.com/preda/gpuowl[/url]

The program is still under development. Probably it still has some bugs, especially in the recent code, that are being worked out. OTOH the basic FFT and LL logic seems to be sound.

Usage:
- compile with either make (Makefile) or with the "build.sh" script, or even straight command-line:

c++ -O2 gpuowl.cpp -ogpuowl -lOpenCL -L/path/to/lib/opencl
e.g. "-L/opt/amdgpu-pro/lib/x86_64-linux-gnu"

- get LL assignments for exponents around 67-77M and put them into worktodo.txt
- the program prints progress report (every 20k iterations), and after a longer time like
a couple of days outputs to results.txt
- it automatically saves progress to a save-exponent.bin file, and can be stopped at any time, will resume on restart.

The pluses:
+ it is fast. For exponents in the vicinity of 77M, 2.1 ms/iteration on FuryX, 2.4 ms/iter on R9-390x
+ the code is supposedly easy to understand (or will be so when I finish an accompanying write-up explaining the tricky bits and the maths). In any case, it is quite small.
+ it works on AMD GPUs :)
+ prints residue and error info on every log step.

The minuses:
- gpuOwL in the present incarnation supports a single POT FFT size, 4M. This makes it able to handle exponents "up to 78M", and probably optimally above 70M. That means, for smaller exponents a smaller FFT would be faster; while for larger exponents the floating-point errors are getting too large and require a larger FFT size.
- does not use shiftcount (may be added in the future)
- may have rough edges (is still in development)

The tricks:
"transposed convolution":
getting "under the hood" of the FFT allowed some performance gains. A traditional convolution based on the "matrix FFT algorithm" (as in AMD's clFFT) proceeds with these steps:
1. transpose
2. sub-FFT
3. transpose
4. sub-FFT
5. transpose
6. Squaring
7. transpose
8. sub-FFT
9. transpose
10. sub-FFT
11. transpose

(1-5 is forward-FFT, 7-11 is inverse-FFT)

OTOH in gpuOwL I store the data in a transposed representation, which gets rid of step 1 (and 11) above. The squaring can be done just as well over the transposed form, thus steps 5 and 7 (the transpose before and after the squaring) aren't needed. Thus the steps remaining in gpuOwL are:
1. sub-FFT
2. transpose
3. sub-FFT
4. Squaring
5. sub-FFT
6. transpose
7. sub-FFT

Another trick (which AFAIK is used by CUDALucas and clLucas as well) is mapping a real-FFT into a complex-FFT of half size, which halves the FFT size.

Another small trick is using a probabilistic argument for limiting the carry propagation, which simplifies a bit the carry code.

Another small trick is computing the various pre-computed constants (such as the FFT coefficients and the A, A^-1 vectors for the DWT) in quad-precision ("long double") before conversion to double precision, which pushes the error bound a tiny bit further.

I'll be looking for problems to fix. I hope you'll enjoy!
cheers,
Mihai

LaurV 2017-04-19 14:18

Brava băiatu', felicitări și la mai multe de astea.

Întrebări:

1. Speed comparison with clLucas?
2. Any windows build?

Mersic.

VictordeHolland 2017-04-19 20:03

First of all nice job!

[QUOTE=LaurV;457039]
2. Any windows build?
[/QUOTE]
[code]
Victor@PCVICTOR MINGW64 /home/gpuowl-master
$ make
g++ -O2 -std=c++11 gpuowl.cpp -ogpuowl -L/home/OpenCL -lOpenCL
In file included from gpuowl.cpp:4:0:
clwrap.h: In member function 'void Program::compile(Context&, const char*, const char*)':
clwrap.h:56:22: error: too many arguments to function 'int mkdir(const char*)'
mkdir("isa", 0777);
^
In file included from C:/msys64/mingw64/x86_64-w64-mingw32/include/sys/stat.h:14:0,
from clwrap.h:7,
from gpuowl.cpp:4:
C:/msys64/mingw64/x86_64-w64-mingw32/include/io.h:271:15: note: declared here
int __cdecl mkdir (const char *) __MINGW_ATTRIB_DEPRECATED_MSVC2005;
^
gpuowl.cpp: In function 'double* genBigTrig(int, int)':
gpuowl.cpp:51:17: error: 'M_PIl' was not declared in this scope
auto base = - M_PIl / (W * H / 2);
^
gpuowl.cpp: In function 'double* genSin(int, int)':
gpuowl.cpp:70:17: error: 'M_PIl' was not declared in this scope
auto base = - M_PIl / (W * H);
^
gpuowl.cpp: In function 'double* smallTrigBlock(int, int, double*)':
gpuowl.cpp:86:22: error: 'M_PIl' was not declared in this scope
auto angle = - M_PIl * line * col / (W * H / 2);
^
gpuowl.cpp: In function 'bool checkPrime(int, bool*, u64*)':
gpuowl.cpp:369:9: error: 'uint' does not name a type
const uint zero = 0;
^
gpuowl.cpp:370:75: error: 'zero' was not declared in this scope
Buf bufErr (c, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(int), &zero);
^
gpuowl.cpp:402:5: error: 'uint' was not declared in this scope
uint rawErr = 0;
^
gpuowl.cpp:403:43: error: 'rawErr' was not declared in this scope
q.read( false, bufErr, sizeof(uint), &rawErr);
^
gpuowl.cpp: In function 'int main(int, char**)':
gpuowl.cpp:422:30: error: 'setlinebuf' was not declared in this scope
if (logf) { setlinebuf(logf); }
^
Makefile:2: recept voor doel 'gpuowl' is mislukt
make: *** [gpuowl] Fout 1

[/code]MINGW64 says no, but that is probably just from my incompetence :razz:
C:\msys64\home\OpenCL has a copy of the files from C:\Program Files (x86)\AMD APP SDK\2.9\lib\x86_64
which has the libOpenCL.a file

the errors are just abracadabra to me :razz:.

preda 2017-04-19 20:13

Salut :)

Sorry, I don't have a windows build myself -- no access to windows :). But maybe somebody else could prepare one.

For similar reasons I don't have a straight speed comparison with clLucas -- I couldn't build it yet -- it depends on SDKApplication.hpp which are part of AMDAPP SDK (that I'm not using), and also openclsdkdefs.mk and openclsdkrules.mk. I do think these build problems can be overcome, but I didn't put the effort in yet.

OTOH as both programs print time-per-iteration, such comparison should be easy to do by anybody who gets to build both programs.

I would bet mine is faster :), but by how much..?

preda 2017-04-19 20:42

I just fixed some of those errors for mingw. Would you try again?

msft 2017-04-19 20:52

Great!!!
I hope release Open Source License.

Mark Rose 2017-04-19 21:10

AirSquirrels had run some cLucas [url=http://www.mersenneforum.org/showpost.php?p=406918&postcount=320]benchmarks[/url].

A Fury X oc'ed to 1100,500:

M75002911 gave 3.9026 ms/iter


A Fury X with no oc:

M75002911 gave 4.0754 ms/iter


And a R9 390X:

M75002911 gave 4.3873 ms/iter

preda 2017-04-19 21:24

Yep, licensed under GPL v3 (added License on github).

preda 2017-04-19 21:27

Nice. Let's say 1.8 times faster than clLucas then :)

airsquirrels 2017-04-19 22:09

I will pull this down and run some benchmark comparisons on the same system.

airsquirrels 2017-04-19 22:56

Pretty fast for hand rolled code (clFFT has had a lot of resources put into it by AMD), but I'm definitely not seeing the performance levels indicated above. Still slower than clLucas 1.04. Anything I should check? This is a FuryX

Good news is, the residues match for the first chunk of iterations.

[CODE]gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 75002911 at iteration 0
FFT 1024*2048 (4M words, 17.88 bits per word)
OpenCL compile: 1106 ms
setup: 1638 ms
00020000 / 75002911 [0.03%], ms/iter: 5.413, ETA: 4d 16:45; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 5.422, ETA: 4d 16:54; b9fc5678347cad9f error 0.140625 (max 0.140625)
00060000 / 75002911 [0.08%], ms/iter: 5.413, ETA: 4d 16:40; e7fab5c1f11d0f39 error 0.125 (max 0.140625)
00080000 / 75002911 [0.11%], ms/iter: 5.423, ETA: 4d 16:52; 76a6fb920dd95b71 error 0.140625 (max 0.140625)[/CODE]

[CODE]Continuing work from a partial result of M75002911 fft length = 4096K iteration = 1001
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1875 (0:36 real, 3.5486 ms/iter, ETA 73:55:06)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9315 ms/iter, ETA 81:53:04)
Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9205 ms/iter, ETA 81:38:41)
Iteration 40000 M( 75002911 )C, 0xb9fc5678347cad9f, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9280 ms/iter, ETA 81:47:22)
Iteration 50000 M( 75002911 )C, 0x992a088a20504a90, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9407 ms/iter, ETA 82:02:32)
Iteration 60000 M( 75002911 )C, 0xe7fab5c1f11d0f39, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9230 ms/iter, ETA 81:39:51)
Iteration 70000 M( 75002911 )C, 0x89386b82336fc06d, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9320 ms/iter, ETA 81:50:26)
Iteration 80000 M( 75002911 )C, 0x76a6fb920dd95b71, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9236 ms/iter, ETA 81:39:18)[/CODE]

preda 2017-04-20 01:08

Something is definitely unexpected with the performance you see. I'm using AMDGPU-Pro as OpenCL compiler (either most recent, 17.10, or prev 16.xx, were producing similar performance). What OpenCL driver do you use? is the GPU in a good PCIe slot (no transfer bottleneck)?

I would dump the .isa compiled kernels and look there for diffs (you can pass "-save-temps" in clwrap.h, it's there in a comment, and send me the .isa). Or I'll add an option to enable that as an argument.

tului 2017-04-20 03:22

Sam Harris would approve of the name. They're perfectly good after all.

kracker 2017-04-22 01:08

Been tinkering around with it on windows.. got this after compiling:

[code]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 76000021 at iteration 0
FFT 1024*2048 (4M words, 18.12 bits per word)
log An invalid option was specified.

error -11
Assertion failed!
Program: C:\Users\Back\Desktop\gpuowl-0832c6d\gpuowl-0832c6d.exe
File: clwrap.h, Line 66

Expression: false

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
[/code]

Also, it might be better for the program to accept any assignment ID, I tried N/A, and also some random gibberish and it didn't accept it. Probably needed a minimum number of characters for it to accept it.

EDIT: Pulled and recompiled from the latest commit..

[code]
OpenCL Compilation error -11, log:
An invalid option was specified.
[/code]

VictordeHolland 2017-04-22 09:10

1 Attachment(s)
[QUOTE=kracker;457232]Been tinkering around with it on windows.. got this after compiling:

[code]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 76000021 at iteration 0
FFT 1024*2048 (4M words, 18.12 bits per word)
log An invalid option was specified.

error -11
Assertion failed!
Program: C:\Users\Back\Desktop\gpuowl-0832c6d\gpuowl-0832c6d.exe
File: clwrap.h, Line 66

Expression: false

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
[/code]Also, it might be better for the program to accept any assignment ID, I tried N/A, and also some random gibberish and it didn't accept it. Probably needed a minimum number of characters for it to accept it.

EDIT: Pulled and recompiled from the latest commit..

[code]
OpenCL Compilation error -11, log:
An invalid option was specified.
[/code][/QUOTE]
Same here, compiles without errors on mingw64, but also
error -11
Assertion failed!
clwrap.h, Line 66

preda 2017-04-22 11:37

Kracker, Victor: thanks for trying the compile! As I only compiled with one CL implementation myself, I was not aware of these problems.

Now, would you try again with a fresh checked-out version, I simplified the CL options.

If it still doesn't pass the CL compiler, you can try deleting -cl-fast-relaxed-math -cl-std=CL2.0 from clwrap.h line 89 (but hopefully you won't have to do this).

Concerning the AID in worktodo.txt, try a single 0, like this:
Test=0,24036583,75,1

Otherwise, what should pass in any case is 32 hex-digits, like 00000.. repeated 32 times.
But Test=N/A would not pass because N/A is not valid hexadecimal digits.

- Mihai

preda 2017-04-22 11:44

@airsquirrels: could you run a fresh checked-out version with command line args "-cl -save-temps", e.g.: ./gpuowl -cl -save-temps

This passes "-save-temps" to the CL compiler, which should produce a dump of the GCN ISA. If that works, you should see a file like "_temp_1_Fiji.isa". Could you send that .isa file to me, to see if the reason for the perf degradation is in poor generated ISA code (like too many VGPRs used).

thanks,
Mihai

VictordeHolland 2017-04-22 13:10

1 Attachment(s)
[QUOTE=preda;457261]Kracker, Victor: thanks for trying the compile! As I only compiled with one CL implementation myself, I was not aware of these problems.

Now, would you try again with a fresh checked-out version, I simplified the CL options.

If it still doesn't pass the CL compiler, you can try deleting -cl-fast-relaxed-math -cl-std=CL2.0 from clwrap.h line 89 (but hopefully you won't have to do this).

Concerning the AID in worktodo.txt, try a single 0, like this:
Test=0,24036583,75,1

Otherwise, what should pass in any case is 32 hex-digits, like 00000.. repeated 32 times.
But Test=N/A would not pass because N/A is not valid hexadecimal digits.

- Mihai[/QUOTE]
Hi,

I tried with the default "-cl-fast-relaxed-math" and "-cl-opt-disable" change in line 89 in the clwrap.h but still got some errors.

I forgot to mention, but my card is a HD7950, which only supports OpenCL[U]1.2[/U]
[URL]https://en.wikipedia.org/wiki/Radeon_HD_7000_Series[/URL]

At least gpuOwl detects it is a Tahiti OpenCL 1.2 AMD-APP 2079.5 device :).

Attached the _temp_0_Tahiti.cl that was created. (I had to rename it to .txt or else I couldn't upload to the forum.

[code]
C:\msys64\home\gpuowl>gpuowl
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2079.5)
OpenCL compilation error -11, log:
"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 15: error: attributes
may not appear here
double2 _O mul(double2 u, double a, double b) { return (double2) { u.x * a - u
.y * b, u.x * b + u.y * a}; }
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 16: error: attributes
may not appear here
double2 _O mul(double2 u, double2 v) { return mul(u, v.x, v.y); }
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 59: error: attributes
may not appear here
void _O shuffle(local double *lds, double2 *u, uint n, uint f) {
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 83: error: variable
with automatic storage duration cannot be stored in the named
address space
local double lds[1024];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 86: error: identifier
"lds" is undefined
shuffle(lds, u, 4, 64);
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 102: error: variable
with automatic storage duration cannot be stored in the named
address space
local double lds[2048];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 105: error: identifier
"lds" is undefined
shuffle(lds, u, 8, 32);
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 365: error: variable
with automatic storage duration cannot be stored in the named
address space
local double lds[4096];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 372: error: identifier
"lds" is undefined
lds[l * 64 + (c + l) % 64] = ((double *)(u + i))[b];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 378: error: identifier
"lds" is undefined
((double *)(u + i))[b] = lds[l * 64 + (c + l) % 64];
^

10 errors detected in the compilation of "C:\Users\Victor\AppData\Local\Temp\OCL
6952T1.cl".
Frontend phase failed compilation.
[/code]Here is what I get from the clinfo command
[code]
clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (2079.5)
Platform Name: AMD Accelerated Parallel Proces
sing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing
cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offl
ine_devices


Platform Name: AMD Accelerated Parallel Proces
sing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon HD 7900 Series
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 28
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 900Mhz
Address bits: 32
Max memory allocation: 2214174021
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 3221225472
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 000007FED5DF5188
Name: Tahiti
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2079.5 (VM)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2079.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_
khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd
_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sha
ring cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_e
vent


Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 1002h
Board name:
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 8
Preferred vector width double: 4
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 3300Mhz
Address bits: 64
Max memory allocation: 4289850368
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 32768
Global memory size: 17159401472
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 4289850368
Max global variable size: 1879048192
Max global variable preferred total size: 1879048192
Max read/write image args: 64
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 310
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 000007FED5DF5188
Name: Intel(R) Core(TM) i5-250
0K CPU @ 3.30GHz
Vendor: GenuineIntel
Device OpenCL C version: OpenCL C 1.2
Driver version: 2079.5 (sse2,avx)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2079.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_
khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec
3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sh
aring cl_khr_spir cl_khr_gl_event[/code]

preda 2017-04-23 00:17

OK, I tried to fix these CL compilation errors as well (please retry).

OTOH I need to investigate a bug that appears to be present -- I would not recommend doing any serious LL with gpuOwL right now, I need to validate it a bit more first.

kracker 2017-04-23 01:07

Recompiled.. upon launch I got:
[code]
OpenCL compilation error -11, log:
An invalid option was specified.
[/code]

I removed -cl-std=CL2.0 from clwrap.h and it appears to be working... will see how it goes.

kracker 2017-04-23 01:55

1 Attachment(s)
Very impressive! It actually is slightly faster on my low end HD7770(will try on my R9 285 with OCL 2.0 capability when I have time) and also with better error numbers.. also residues seem to be matching with clLucas.

kladner 2017-04-23 08:06

If I may say, as a spectator, and a non coder, it amazes me to watch this birth process. The cooperation and involvement by several parties is impressive. Seeing this play out is one of the big pay-offs for hanging out on this forum.

VictordeHolland 2017-04-23 16:07

[QUOTE=kladner;457310]If I may say, as a spectator, and a non coder, it amazes me to watch this birth process. The cooperation and involvement by several parties is impressive. Seeing this play out is one of the big pay-offs for hanging out on this forum.[/QUOTE]
Preda deserves all credit for the coding. I'm just trying to compile it for Win64 and reporting the errors I get.

kladner 2017-04-23 17:49

[QUOTE=VictordeHolland;457329]Preda deserves all credit for the coding. I'm just trying to compile it for Win64 and reporting the errors I get.[/QUOTE]
The coding is the "prime" accomplishment <G>, but others taking an interest is like peer review. Both are needed, at least in most cases.

VictordeHolland 2017-04-23 19:53

Succes here also with compiling it with MINGW64 for Windows. gpuOwL is[U] faster[/U] and [U]slightly lower error rates[/U] on my AMD HD7950 with the limited testing so far.

[B]gpuOwL v0.1[/B]
[quote]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2079.5)
LL FFT 4096K (1024*2048*2) of 76000021 (18.12 bits/word) at iteration 0
OpenCL setup: 656 ms
00020000 / 76000021 [0.03%], ms/iter: 5.618, ETA: 4d 22:35; 000000009c13c393 error 0.185102 (max 0.185102)
00040000 / 76000021 [0.05%], ms/iter: 5.621, ETA: 4d 22:37; 00000000f08b90ef error 0.178388 (max 0.185102)
00060000 / 76000021 [0.08%], ms/iter: 5.627, ETA: 4d 22:42; 00000000d68504f2 error 0.186264 (max 0.186264)
00080000 / 76000021 [0.11%], ms/iter: 5.621, ETA: 4d 22:32; 00000000e93a873a error 0.191096 (max 0.191096)
00100000 / 76000021 [0.13%], ms/iter: 5.609, ETA: 4d 22:15; 0000000035a87d3f error 0.198382 (max 0.198382)
[/quote][B]clLucas v1.04[/B]
[quote]C:\clLucas_x64_1.04>clLucas_x64

Platform 0 : Advanced Micro Devices, Inc.
Warning: Couldn't parse ini file option SixStepFFT; using default: off
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

CL_DEVICE_NAME Tahiti
CL_DEVICE_VENDOR Advanced Micro Devices, Inc.
CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2079.5)
CL_DRIVER_VERSION 2079.5 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS 28
CL_DEVICE_MAX_CLOCK_FREQUENCY 900
CL_DEVICE_GLOBAL_MEM_SIZE 3221225472
CL_DEVICE_MAX_WORK_GROUP_SIZE 256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1

<FFT tests>

Starting M76000021 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.25, th
e test will restart with a larger FFT length.
Iteration 100, average error = 0.13540, max error = 0.18750
Iteration 200, average error = 0.16145, max error = 0.18750
Iteration 300, average error = 0.17013, max error = 0.18750
Iteration 400, average error = 0.17448, max error = 0.18750
Iteration 500, average error = 0.17724, max error = 0.20313
Iteration 600, average error = 0.18155, max error = 0.20313
Iteration 700, average error = 0.18463, max error = 0.20313
Iteration 800, average error = 0.18876, max error = 0.21875
Iteration 900, average error = 0.19209, max error = 0.21875
Iteration 1000, average error = 0.19474 < 0.25 (max error = 0.21875), continuing
test.
Iteration 20000 M( 76000021 )C, 0x27d2e8539c13c393, n = 4096K, clLucas v1.04 err
= 0.2188 (2:01 real, 6.0576 ms/iter, ETA 127:50:52)
Iteration 40000 M( 76000021 )C, 0xbdeced3ff08b90ef, n = 4096K, clLucas v1.04 err
= 0.2188 (2:02 real, 6.0910 ms/iter, ETA 128:31:14)
Iteration 60000 M( 76000021 )C, 0x7c3f23c5d68504f2, n = 4096K, clLucas v1.04 err
= 0.2188 (2:02 real, 6.0879 ms/iter, ETA 128:25:15)
Iteration 80000 M( 76000021 )C, 0x02b75b4ce93a873a, n = 4096K, clLucas v1.04 err
= 0.2188 (2:02 real, 6.0903 ms/iter, ETA 128:26:11)
Iteration 100000 M( 76000021 )C, 0xa04bf1ff35a87d3f, n = 4096K, clLucas v1.04 er
r = 0.2188 (2:02 real, 6.0895 ms/iter, ETA 128:23:10)
[/quote]

VictordeHolland 2017-04-23 19:58

'Guide' for compiling it on Windows with msys64+MINGW64
1. Assuming you have installed msys64 with MINGW64 and you'll need a texteditor (I prefer notepad++)
2. Download+Install AMD SDK 3.0 from [URL]http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/[/URL]
3. Download the gpuowl code from [URL]https://github.com/preda/gpuowl[/URL] (easiest is to download the entire map as a .zip)
4. Extract .zip (preferably to somewhere you can navigate to easily with msys64, so for instance msys64\home\gpuowl)
5. open the Makefile (that is in the gpuowlmap) with texteditor (notepad++)
6. edit the path behind "-L" argument to the OpenCL SDK install path
standard path is C:\Program Files (x86)\AMD APP SDK\3.0\lib\x86_64
I copied the map contents to msys64\home\OpenCL\SDK3 for easier referencing since I always get confused with referencing with spaces and brackets.
In my case the makefile contains:
[code]
gpuowl: gpuowl.cpp clwrap.h tinycl.h
g++ -O2 -std=c++14 gpuowl.cpp -ogpuowl -L\C\msys64\home\OpenCL\SDK3 -lOpenCL[/code]Don't forget to save the changes ;).

6.1 [OPTIONAL] if you don't have a OpenCL2.0 device edit the clwrap.h file and on line 89 change "-cl-std=CL2.0" to "-cl-std=CL1.2"

7. start msys64/MINGW64 shell and navigate to the gpuowl map. If you put the map in the /home directory of msys64 you can easily go there by typing: "cd .." to get to the home directory. and then "cd yourgpuowlmapname"
8. 'make'

Should look something like this so far
[code]MINGW64 ~
$ cd ..

MINGW64 /home
$ cd gpuOwlv0.1/

MINGW64 /home/gpuOwlv0.1
$make
g++ -O2 -std=c++14 gpuowl.cpp -ogpuowl -L\C\msys64\home\OpenCL\SDK3 -lOpenCL

MINGW64 /home/gpuOwlv0.1
$[/code]9. Assuming you didn't get any errors, you should now have a gpuowl.exe in the gpuowl directory.
9.1 [OPTIONAL] If you wish to move the gpuowl directory somewhere else, you probably need to copy these 3 .dlls into the directory:
libgcc_s_seh-1.dll
libstdc++-6.dll
libwinpthread-1.dll
10. create a worktodo.txt and start testing.

[B]Please note gpuOwl is [U]not[/U] production ready.[/B]

LaurV 2017-04-25 04:41

Can someone post or PM me a windoze exe? [edit: x64]. We still have trouble compiling it (the same troubles as above - but the troubles are most probably related to our ignorance with the tools).

We are going to give it a test too - we own an old "XFX HD7970 GHz" here.

P.S. we fully understand that it is not "production ready" yet, but if it is faster than clLucas [B][U]and[/U][/B] it is giving out the same DC residue, for sure we will report it to PrimeNet and get some fast credits! :razz:

VictordeHolland 2017-04-25 11:42

[QUOTE=LaurV;457451]Can someone post or PM me a windoze exe? [edit: x64]. We still have trouble compiling it (the same troubles as above - but the troubles are most probably related to our ignorance with the tools).

We are going to give it a test too - we own an old "XFX HD7970 GHz" here.

P.S. we fully understand that it is not "production ready" yet, but if it is faster than clLucas [B][U]and[/U][/B] it is giving out the same DC residue, for sure we will report it to PrimeNet and get some fast credits! :razz:[/QUOTE]
I've got a HD7950, so I'm hoping my executable should work for you too.
I'll send it when I get home tonight.

LaurV 2017-04-25 15:32

[QUOTE=VictordeHolland;457343] (I prefer notepad++)
5. <snip> texteditor (notepad++)
[/QUOTE]

The [URL="https://notepad-plus-plus.org/news/notepad-7.3.3-fix-cia-hacking-issue.html"]CIA hacked[/URL] stuff, huh? :razz:

(I know why I always used pn2... haha)

VictordeHolland 2017-04-25 16:48

[QUOTE=LaurV;457485]The [URL="https://notepad-plus-plus.org/news/notepad-7.3.3-fix-cia-hacking-issue.html"]CIA hacked[/URL] stuff, huh? :razz:

(I know why I always used pn2... haha)[/QUOTE]
Windows itself probably has multiple/dozens? of unpatched zero-days only known to the three letter agencies. So I wouldn't lose any sleep over it :smile:.

preda 2017-04-26 01:51

Thanks for the MinGW compilation, and the screenshots! The screenshots show there's an error in printing the residue (leading digits being 0) -- that's hopefully fixed now (not a big deal, the problem was just 'cosmetic').

A small update on where gpuOwL is, and what I'm planning on next.
I was really worried by the results of some of my own testing -- the LL was failing on known primes (24036583). Thus I decided to do some more serious testing to find the cause of the error.

But after all this investigation, my conclusion was that it's not a software bug, but the GPU producing.. an erroneous result very rarely. This is disconcerning, and I'd really like to have a way to detect such problems.

The LL involves two distinct computations. One is FFT-Square-IFFT, the second is "round-to-int + Carry-propagation". An error can occur in either of these, and these are the detection mechanisms that I know of:

- evaluating the "max rounding error" that occurs when rounding-to-int, after the IFFT and before the carry-propagation. This is cheap to compute on the GPU, thus is always on (and printed on every logstep). This rounding error brings two pieces of information: 1. whether the FFT size is big enough for the chosen exponent, and 2. whether something went completely wrong with the FFT/IFFT. I plan to add provisions in the code for detecting a sudden jump in the rounding error (which may indicate FFT error), and re-run the last batch in that situation to check for consistent results.

- evaluating the SUMINP / SUMOUT of the FFT (which is done by the CPU prime95). This is not implemented, because is seems (to me) expensive to do on the GPU. This check would have provided very good detection for FFT/IFFT errors, but no protection against rounding&carry-propagation errors.

- using "offset". This changes the values fed to the FFT/IFFT, and again protects against FFT errors (but without detecting them "in real time").

As is, there is no check that I know of that covers the carry-propagation. If an error takes place in that part, it would not be detected by either the max-rounding-error check or the SUMINP/SUMOUT check. I would be interested in finding out about a GPU-cheap way to check that the integer digits of the modulo-convolution done by LL are not completely haywire.

Development plan:
- implement "offset", and measure performance impact. If impact is small, leave it always-on.
- check for sudden jumps in rounding-error, and automatically re-try in that situation. May help detect too-overclocked GPUs (but not always).
- add some simple self-test, which would run on know-primes and compare residues with a pre-saved residue list, to detect obvious errors. (to detect more subtle errors, a good but expensive way is to run to-completion on know primes (and check 0 residue), or double-check validated results).

Still missing:
- ability to select specific GPU in a multi-GPU system (right now, simply uses the first GPU)
- get some ISA dumps (produced with "-cl -save-temps") and analyze to investigate the low performance reported
- add ability to dump compiled binary (for OpenCLs that do not offer "-save-temps")

science_man_88 2017-04-26 02:06

[QUOTE=preda;457524]Thanks for the MinGW compilation, and the screenshots! The screenshots show there's an error in printing the residue (leading digits being 0) -- that's hopefully fixed now (not a big deal, the problem was just 'cosmetic').

A small update on where gpuOwL is, and what I'm planning on next.
I was really worried by the results of some of my own testing -- the LL was failing on known primes (24036583). Thus I decided to do some more serious testing to find the cause of the error.

But after all this investigation, my conclusion was that it's not a software bug, but the GPU producing.. an erroneous result very rarely. This is disconcerning, and I'd really like to have a way to detect such problems.

The LL involves two distinct computations. One is FFT-Square-IFFT, the second is "round-to-int + Carry-propagation". An error can occur in either of these, and these are the detection mechanisms that I know of:

- evaluating the "max rounding error" that occurs when rounding-to-int, after the IFFT and before the carry-propagation. This is cheap to compute on the GPU, thus is always on (and printed on every logstep). This rounding error brings two pieces of information: 1. whether the FFT size is big enough for the chosen exponent, and 2. whether something went completely wrong with the FFT/IFFT. I plan to add provisions in the code for detecting a sudden jump in the rounding error (which may indicate FFT error), and re-run the last batch in that situation to check for consistent results.

- evaluating the SUMINP / SUMOUT of the FFT (which is done by the CPU prime95). This is not implemented, because is seems (to me) expensive to do on the GPU. This check would have provided very good detection for FFT/IFFT errors, but no protection against rounding&carry-propagation errors.

- using "offset". This changes the values fed to the FFT/IFFT, and again protects against FFT errors (but without detecting them "in real time").

As is, there is no check that I know of that covers the carry-propagation. If an error takes place in that part, it would not be detected by either the max-rounding-error check or the SUMINP/SUMOUT check. I would be interested in finding out about a GPU-cheap way to check that the integer digits of the modulo-convolution done by LL are not completely haywire.

Development plan:
- implement "offset", and measure performance impact. If impact is small, leave it always-on.
- check for sudden jumps in rounding-error, and automatically re-try in that situation. May help detect too-overclocked GPUs (but not always).
- add some simple self-test, which would run on know-primes and compare residues with a pre-saved residue list, to detect obvious errors. (to detect more subtle errors, a good but expensive way is to run to-completion on know primes (and check 0 residue), or double-check validated results).

Still missing:
- ability to select specific GPU in a multi-GPU system (right now, simply uses the first GPU)
- get some ISA dumps (produced with "-cl -save-temps") and analyze to investigate the low performance reported
- add ability to dump compiled binary (for OpenCLs that do not offer "-save-temps")[/QUOTE]
I'm not even really that involved in GIMPS but is it returning 0xFFFFF... ( hex for all 1's in binary) that was my first thought if so it may be returning the mersenne number itself ( or a multiple) instead of 0. I believe something similar happened in Prime95 at one point ( unless my memory of what I read is foggy).

preda 2017-04-26 03:07

No, it's not all-1s. I ran 24036583 twice, the second time the result was correct (0). I tracked the difference between the two runs by compared the residues, and at some point around 13% the residues diverged. It means, in the first run an error occurred at that point. Given that the software is supposed to be deterministic (produce identical bits every time), this could be explained by the hardware behaving funny.

Mark Rose 2017-04-26 03:27

I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.

kladner 2017-04-26 04:20

[QUOTE=Mark Rose;457528]I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.[/QUOTE]
That was definitely my experience. Until I slowed the memory clock, my gtx 460 and 580 cards could not complete both self-tests.

I have yet to really work at getting CUDALucas to run on the GTX 1060. Early self-test runs blew up in seconds. It's been a while, so I don't remember the details that well. With an i7 Skylake turning out a DC every 25 hours, I found it much more productive to keep the GPUs on TF.

preda 2017-04-26 08:08

Exponent range supported by gpuOwL
 
gpuOwL now only supports FFT 4096K. This allows LL in the range about 35M - 77M.

But all the exponents under ~ 40.8M have been double-checked, thus there's not much interest there.

Using FFT 4096K for exponents under about 65M may be a waste, because faster FFT-sizes are available. Thus gpuOwL is probably best used for first-time and double-checks in the range 70M - 77M.

I would recommend to start by doing at least a couple of double-checks (to validate correct function) before doing any first-time LL.

The results, found in "resuts.txt", are in a format that can be directly submitted on "Manual Testing" webpage.

Note: on github in the "fft2m" branch, [url]https://github.com/preda/gpuowl/tree/fft2m[/url]
there is an implementation for FFT 2048K as well, in case anybody is particularly interested in those small sizes (probably most useful for testing, or as sample code).

About larger FFT sizes: for now I'm only looking toward supporting POT (power-of-two) FFTs. But I think there's no interest in 8M or 16M FFTs (for LL), because there's plenty of first-time LL to do under 78M.

For world-record tests (exponents > 332M), the smallest POT that can handle them is 32M, which is also overkill (in my estimation, 32M FFT may handle exponents up to 550M). So it seems that world-record tests are better handled by a non-POT FFT. In addition, it would probably not be such a good idea to spend a big amount of time (on huge exponents) with a new program with limited testing.

preda 2017-04-26 08:59

gpuOwL stop / resume
 
gpuOwL writes on every logstep (20k iterations) a checkpoint to a file save-N.bin
(moving the previous file to save-N.old).

The program can be safely stopped/killed at any time. Upon restart, it will look for a checkpoint file for the given exponent, and continue from there if found.

The checkpoint file starts with a human-readable header, like this:
LL1 42643801 160000 1024 2048 0

With the values meaning:
file-signature, exponent, iteration, width, height, offset
followed by a newline, a ctrl-Z character, and a binary dump of the words.

These save files can be safely moved around. If deleted you lose the progress. If deleted/moved, the program starts from iteration 0.

LaurV 2017-04-26 14:46

Haha, nice avatar :razz:

To the subject: As promised, Victor sent me his built. I gave up doing mine, I found out I have some old tools and no time to renew them, but I will resume the trials as soon as the time will allow.

Let's first start by getting an assignment in 77M, to avoid wasting precious cycles, as the Owl only knows 4K FFT. We got for a start, M77002759. Good. For a comparison, we tried to give it a run with clLucas first, to see what we are fighting against. As we didn't use this machine for testing for a while, we had first an unsuccessful struggle to convince clLucas to stick with the FFT size. When we do not specify the FFT size in the command line, he works for a long while, deciding which FFT is the best (it starts much lower), and every time ends with a "wrong" one, i.e. above 4K.

We gave up, after he decided to get a too big error, and increase the FFT, regardless of what we were doing. Score, 1-0 for clLucas against us. The point is that the next FFT that he wants to use is about half-speed compared with the POT one. This is easy to see when he prints the test lines in the beginning, every 100 iterations, the text lines come less often (half speed) after he increases the FFT. This is visible, like two seconds per line, against one second per line before. Grrrr... We decided to forget the things, shot the dead horse, and get a new, [U]smaller[/U], assignment. This time we got M76453229, and clLucas happily decided not to increase the FFT. Gooooood.....

Then we did the same run with gpuOwl, and we decided to do both runs just to see the difference. [B]gpuOwl is indeed faster, but we have to complain about it zerorizing half of the residue [/B](of course, this is just a printing bug, we assume, or maybe a compilation bug).

[CODE]e:\99 - Prime\clLucas>cllucas_x64 -c 2000 -threads 256 -f 4194304 -s backups

Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

CL_DEVICE_NAME Tahiti
CL_DEVICE_VENDOR Advanced Micro Devices, Inc.
CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2348.3)
CL_DRIVER_VERSION 2348.3
CL_DEVICE_MAX_COMPUTE_UNITS 32
CL_DEVICE_MAX_CLOCK_FREQUENCY 1050
CL_DEVICE_GLOBAL_MEM_SIZE 3221225472
CL_DEVICE_MAX_WORK_GROUP_SIZE 256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1

mkdir: cannot create directory `backups': File exists
Starting M77002759 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.16452, max error = 0.21875
Iteration 200, average error = 0.19164, max error = 0.21875
Iteration 300, average error = 0.20540, max error = 0.23438
Iteration 400, average error = 0.21530, max error = 0.25000
Iteration 500, average error = 0.22243, max error = 0.28125
Iteration 600, average error = 0.23223, max error = 0.28125
Iteration 700, average error = 0.23923, max error = 0.28125
Iteration 800, average error = 0.24449, max error = 0.28125
Iteration 900, average error = 0.24857, max error = 0.28125
Iteration 1000, average error = 0.25181 >= 0.25 (max error = 0.28125), increasing FFT length and restarting.
Starting M77002759 fft length = 4480K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.02343, max error = 0.03125
Iteration 200, average error = 0.02738, max error = 0.03320
Iteration 300, average error = 0.02932, max error = 0.03320
Iteration 400, average error = 0.03035, max error = 0.03516
Iteration 500, average error = 0.03131, max error = 0.03516
Iteration 600, average error = 0.03195, max error = 0.03516
Iteration 700, average error = 0.03241, max error = 0.03516
Iteration 800, average error = 0.03276, max error = 0.03516
Iteration 900, average error = 0.03302, max error = 0.03516
Iteration 1000, average error = 0.03323 < 0.25 (max error = 0.03516), continuing test.
Iteration 2000 M( 77002759 )C, 0x9a2f030ffaeda2c4, n = 4480K, clLucas v1.04 err = 0.0371 (0:28 real, 14.0000 ms/iter, ETA 299:26:41)
Iteration 4000 M( 77002759 )C, 0xed3b849574c96289, n = 4480K, clLucas v1.04 err = 0.0371 (0:27 real, 13.9316 ms/iter, ETA 297:58:24)
Iteration 6000 M( 77002759 )C, 0x6d71868cfc75973d, n = 4480K, clLucas v1.04 err = 0.0371 (0:28 real, 13.9408 ms/iter, ETA 298:09:45)
Unknown signal caught, writing checkpoint. Estimated time spent so far: 1:50

e:\99 - Prime\clLucas>cllucas_x64 -c 2000 -threads 256 -f 4194304 -s backups

Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

CL_DEVICE_NAME Tahiti
CL_DEVICE_VENDOR Advanced Micro Devices, Inc.
CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2348.3)
CL_DRIVER_VERSION 2348.3
CL_DEVICE_MAX_COMPUTE_UNITS 32
CL_DEVICE_MAX_CLOCK_FREQUENCY 1050
CL_DEVICE_GLOBAL_MEM_SIZE 3221225472
CL_DEVICE_MAX_WORK_GROUP_SIZE 256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1

mkdir: cannot create directory `backups': File exists
Starting M76453229 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.14852, max error = 0.21875
Iteration 200, average error = 0.18363, max error = 0.21875
Iteration 300, average error = 0.19534, max error = 0.21875
Iteration 400, average error = 0.20119, max error = 0.21875
Iteration 500, average error = 0.20470, max error = 0.21875
Iteration 600, average error = 0.20704, max error = 0.21875
Iteration 700, average error = 0.20872, max error = 0.21875
Iteration 800, average error = 0.20997, max error = 0.21875
Iteration 900, average error = 0.21095, max error = 0.21875
Iteration 1000, average error = 0.21172 < 0.25 (max error = 0.21875), continuing test.
Iteration 2000 M( 76453229 )C, 0xbb1d6624a3ab7bf8, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7752 ms/iter, ETA 122:38:34)
Iteration 4000 M( 76453229 )C, 0xbaefb39c3b82c9d1, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7400 ms/iter, ETA 121:53:32)
Iteration 6000 M( 76453229 )C, 0x580c90a32431aeea, n = 4096K, clLucas v1.04 err = 0.2188 (0:11 real, 5.7350 ms/iter, ETA 121:46:58)
Iteration 8000 M( 76453229 )C, 0x034cc7c190b474a6, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7300 ms/iter, ETA 121:40:24)
Iteration 10000 M( 76453229 )C, 0x40e22bdb628637bd, n = 4096K, clLucas v1.04 err = 0.2188 (0:11 real, 5.7100 ms/iter, ETA 121:14:44)
Unknown signal caught, writing checkpoint. Estimated time spent so far: 1:02

e:\99 - Prime\clLucas>cd ..\gpuowl

e:\99 - Prime\gpuOwl>gpuowl -logstep 2000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 77002759 (18.36 bits/word) at iteration 0
OpenCL setup: 960 ms
00002000 / 77002759 [0.00%], ms/iter: 4.765, ETA: 4d 05:55; 00000000faeda2c4 error 0.238075 (max 0.238075)
00004000 / 77002759 [0.01%], ms/iter: 4.750, ETA: 4d 05:36; 0000000074c96289 error 0.236749 (max 0.238075)
00006000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 00000000fc75973d error 0.234828 (max 0.238075)
00008000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 000000001f37b1fb error 0.24425 (max 0.24425)
00010000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 00000000b0f55ab1 error 0.235088 (max 0.24425)
^C
[COLOR=Orange][changing the lines' order in worktodo]

[/COLOR]e:\99 - Prime\gpuOwl>gpuowl -logstep 2000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 0
OpenCL setup: 1000 ms
00002000 / 76453229 [0.00%], ms/iter: 4.770, ETA: 4d 05:18; 00000000a3ab7bf8 error 0.199004 (max 0.199004)
00004000 / 76453229 [0.01%], ms/iter: 4.745, ETA: 4d 04:46; 000000003b82c9d1 error 0.20329 (max 0.20329)
00006000 / 76453229 [0.01%], ms/iter: 4.740, ETA: 4d 04:39; 000000002431aeea error 0.208951 (max 0.208951)
00008000 / 76453229 [0.01%], ms/iter: 4.750, ETA: 4d 04:52; 0000000090b474a6 error 0.206401 (max 0.208951)
00010000 / 76453229 [0.01%], ms/iter: 4.745, ETA: 4d 04:45; 00000000628637bd error 0.203467 (max 0.208951)
^C
e:\99 - Prime\gpuOwl>
[/CODE]Note that 4 days, 5 hours, is 101 hours. The difference in time between the two gpuOwl runs seems normal as there are less iterations to do, in spite of the fact that the iterations themselves take the same time (as the same FFT is used).

This is about 20% speed increase, for this build, and this card.

Next step would be to try to compile our own version, and if (or when) the zerorizing error is fixed, to finish these tests and compare the residues with what the Titan/cudaLucas gives. If success, we will report both as LL and DC. Yes, we know this will infuriate MadPoo which will have to triple check :razz:

LaurV 2017-04-26 15:34

Additional "complaints", beside of the fact that the little thief is stealing my hex digits of the residue:

1. When something like this happens: (yes, we can force it, by giving irrealistic work to the card in the same time it is doing gpuOwl):
[CODE]Error 4 is too large!00030000 / 76453229 [0.04%], ms/iter: 5.561, ETA: 4d 22:03; 00000000eb2493b3 error 4 (max 4)[/CODE], then the error should be saved in the file too, and carried on with it at the next restart (use a byte in the header, or so), or the program should exit, and resume from an aterior saved file. Otherwise, (first) it is wasting precious time (yes the residue is wrong the correct one ends in 2BBB1710, and it continues with the wrong calculus, wasting cycles in vain) and (second) the user can restart the program (and the error is lost) so he never knows he has problems with his hardware. (most of use use some batch file like
[CODE]:loop
gpuOwl
goto loop[/CODE]to run the tools, because sometimes they crash, or the card crash, and the calculus has to resume, and not wait until I come from work in the evening. In this case, gpuOwl will continue with wrong residue, and the error will not be carried on after restart, so the user will have no idea (who's reading the logs? :smile:)

and 2., because we are here, please do not delete the partial residue files. Let them there, and use the iteration number [U]and the residue[/U], as part of the file name (cudaLucas style). The user can delete them manually if he wants. This is useful when we compare files and residues between different runs, different cards, different programs, as they all have different file structures, different shift counts, etc - i.e. the two files, one produced by cudaLucas, and one by clLucas, are not the same inside, but if they are both called "s76453229.30001.9d0222732bbb1710.txt" (real file name here!), then I know that both programs are doing fine, and my batch file can automatically parse the two backup folders and kill the programs and resume from a previous iteration (by renaming the files to cxxxxx and txxxxx, see cudaLucas), [U]without looking inside of the files[/U] (I am not interested in internal structure/kitchen) so we don't lose any time walking paths that start with wrong residues and backtracking those paths.

and 3. save those residue in a sub-folder, call it backup or so (later it can be given in an ini file) so they can be handled in bunch, deleted, etc, without inadvertently deleting the exe file or some library in the folder the program is running.

then we are good to go...

LaurV 2017-04-26 16:02

On the bright side, we timed the little owl again, this time with the clock (stopwatch) on hand, to see if it does not cheat on displaying ETAs. It does not. :razz: (but you have to agree that it is possible :razz: and we are paranoid by formation, nothing personal - we can produce a LL test program to say that it tests a 77M in 55 hours, but you let it and let it and let it, and after 55 hours he is only one third done, and it will take another 110 hours to do the other thirds, in spite of the fact that it now shows only 33 hours to go (two thirds). Not the case for gpuOwl, but you got the point. So, we made sure.

Using it, will make our card from ~42 GHz Days per Day (it is indeed its score at this range, despite James' site [URL="http://www.mersenne.ca/cudalucas.php?model=481"]saying here[/URL] that it only scores between 35 and 40) into a 20% faster card, as we have seen above, which is a bit over 50 GHzD/D, at parity with some "good" gtx 1070 or quadro plex, not talking about the boost it will give to lots of "fury" cards that are already at 50 or higher.

This matches with the displayed time, because the 77M exponent we tested is exactly 217 GHzDays worth, and the ETA time was (roughly) 4 days 5 hours, i.e. 4.2 days, then this makes exactly 50 GhzDays per Day.

Yay!

Therefore, if you can find the time, resources, motivation, whatever, to continue improving it, you will make a lot of people here happy....

Two thumbs up...
:tu: :tu:

Time for bed, midnight here (almost)...

kracker 2017-04-26 17:30

[QUOTE=VictordeHolland;457268]Hi,
I forgot to mention, but my card is a HD7950, which only supports OpenCL[U]1.2[/U]
[URL]https://en.wikipedia.org/wiki/Radeon_HD_7000_Series[/URL]

At least gpuOwl detects it is a Tahiti OpenCL 1.2 AMD-APP 2079.5 device :).

...
[/QUOTE]

Funny enough, although the HD7950 doesn't support OpenCL 1.2, the R9 280 which is a rebrand of the 7950 with a higher clock speed [I][URL="https://en.wikipedia.org/wiki/AMD_Radeon_Rx_200_series"]does[/URL][/I]... just driver things I guess....

VictordeHolland 2017-04-26 18:52

[QUOTE=kracker;457563]Funny enough, although the HD7950 doesn't support OpenCL 1.2, the R9 280 which is a rebrand of the 7950 with a higher clock speed [I][URL="https://en.wikipedia.org/wiki/AMD_Radeon_Rx_200_series"]does[/URL][/I]... just driver things I guess....[/QUOTE]
The Wikipedia pages (they could be wrong of course) state:

GCN 1st gen support OpenCL1.2:
Tahiti chips (HD79xx, HD89xx, R9 280(X))

GCN 2nd gen supports supports OpenCL2.0 used in
Bonaire chip (HD7790, HD8770, R7 260X, R7 360)
Hawaii chip (R9 290(X) , R9 390(X))

GCN 3rd gen also supports OpenCL2.0:
Tonga (R9 285, R9 380(X))
Fiji (R9 Fury, Nano X)

Seems highly plausible that the OpenCL support is dependant on the GCN generation.

preda 2017-04-26 23:05

@LaurV, that's a detailed analysis!

It seems your build was not fresh though: the "error too large" not stopping is already fixed. The zeroed residue is also "maybe" fixed already. If you can get a fresh build, I'll know if the residue is indeed now printed correctly (or look more into that if not).

I still don't understand how you got "4" for error.. it's supposed to go only up to 0.5..

I'll think about the structure of the save-files. I probably need to look a bit at what CUDALucas does there. But, to keep *all* the old checkpoints around? -- each file is 16MB. If you get 4000 of those, that'd be 64GB, probably too much.

airsquirrels 2017-04-27 06:14

I should have some time this weekend to do ISA dumps and try upgrading drivers / APP SDK to new versions on one of my FuryX Systems.

Also, most consumer cards do occasionally have errors. I have seen them less often on AMD cards than NVIDIA, but they do happen. If it is helpful, I have a system with 3x W9100s in it which have ECC memory and (ideally) do not exhibit hardware errors (100s of double checks agree). If you setup to select the GPU I can run a few double check exponents on those cards to check stability.

preda 2017-04-27 10:36

Observed reliability
 
Some information from my testing so far:

2 double-checks in 77M range were matches.

(the following are all known primes)

24036583 failed once, successful once.
42643801 failed twice, successful twice.
All these were successful on first run: 20996011, 25964951, 30402457, 32582657, 37156667, 74207281, 57885161.

So empiric error rate about 20% (quite high IMO).
The error rate may be affected by the temperature of the GPU.

Up to now I was unable to find a cause for these errors in software. Especially that the software is supposed to be deterministic (produce the same result every time, without variation), and yet the results for the same exponent vary (meaning, either my assumption of determinism is wrong, or hardware errors are involved).

Anyway, such a high error rate in bad news. An erroneous results becomes more likely as the computation length increases (higher exponents).

It would be great to have a way to validate the result on every LL iteration ("error detection"). If a wrong iteration can be detected, it can simply be re-tried until correct, and the ghost of unreliable hardware goes away. (of course, there'd be some cost involved in error detection)

preda 2017-04-27 10:39

[QUOTE=airsquirrels;457640]I should have some time this weekend to do ISA dumps and try upgrading drivers / APP SDK to new versions on one of my FuryX Systems.

Also, most consumer cards do occasionally have errors. I have seen them less often on AMD cards than NVIDIA, but they do happen. If it is helpful, I have a system with 3x W9100s in it which have ECC memory and (ideally) do not exhibit hardware errors (100s of double checks agree). If you setup to select the GPU I can run a few double check exponents on those cards to check stability.[/QUOTE]

Yep I'll try to add GPU selection. I don't have a multi-gpu system to test, but maybe it'll "just work" :)

LaurV 2017-04-27 12:55

[QUOTE=preda;457610]But, to keep *all* the old checkpoints around? -- each file is 16MB. If you get 4000 of those, that'd be 64GB, probably too much.[/QUOTE]
Yes please :rolleyes:. With a checkpoint set to 500k or 1M, you have 160 files for some 80M exponent. Even with 100K, you have 10 per million, or 800 per test, which is no more than 12 gig (but this never happens as they will be clear by matching with the parallel run, this is not the job of gpuOwl, but the additional tools). BTW, cudaLucas has two different counters, one for displaying on screen and one for saving checkpoints files, and both are interactive (you can decrement/increment them pressing t/T, etc).

But yes, ALL residues have to be kept, if the user choose so, until he (the user) decides what to do with them.

After a LL/DC match, one can (manually or automatically, from the batch file, tool, etc) delete them, or whatever.

But there are many situations when one may need them.

One may have the surprise that doing a DC, a non-match residue popos up, then he/she runs it again and gets the same non-match, will all residues matching all the way, and comparing them with cudaLucas or P95 -- that is how you find a hidden bug, in either gpuOwl, or cuddaLucas, or (why not?) P95.

For example.

preda 2017-04-27 14:36

@LaurV, I understand that you want a track of the residues (the 64-bit values), that makes sense. Right now the residue sequence is saved to gpuowl.log (but, arguably, the tools may not know how to parse that?).

But do you also need to store the full 16MB checkpoints? the full checkpoint is only needed to allow a re-start from that point. Do you want to re-start from arbitrary points in time, or just need a full track of the residues?

I would like the default behavior to be friendly for a non-expert user (i.e. not fill his storage by default). I'll think about it.

wombatman 2017-04-27 15:01

Why not keep only the two most recent checkpoints (so there's a backup in case the most recent is corrupted or something)? Since the residues are being kept elsewhere, it doesn't seem necessary to retain all the checkpoint files.

preda 2017-04-27 15:27

[QUOTE=wombatman;457678]Why not keep only the two most recent checkpoints (so there's a backup in case the most recent is corrupted or something)? Since the residues are being kept elsewhere, it doesn't seem necessary to retain all the checkpoint files.[/QUOTE]
Yes, right now only two most recent are kept, in save-N.bin and save-N.old. But I'm open to changing that, just need a bit more thinking to what.

preda 2017-04-27 15:36

[QUOTE=preda;457654]Yep I'll try to add GPU selection. I don't have a multi-gpu system to test, but maybe it'll "just work" :)[/QUOTE]
I've added a command line option to select a specific device (see --help for list of devices).

I've tested by playing with running on the CPU. Surprisingly, it worked, but it was sooo slow.. something like 100times slower than mprime . Which shows both what a good implementation mprime has, and what a poor compiler OpenCL/cpu has.

ewmayer 2017-04-27 21:19

[QUOTE=LaurV;457550]Additional "complaints", beside of the fact that the little thief is stealing my hex digits of the residue:

1. When something like this happens: (yes, we can force it, by giving irrealistic work to the card in the same time it is doing gpuOwl):
[CODE]Error 4 is too large!00030000 / 76453229 [0.04%], ms/iter: 5.561, ETA: 4d 22:03; 00000000eb2493b3 error 4 (max 4)[/CODE], then the error should be saved in the file too, and carried on with it at the next restart (use a byte in the header, or so), or the program should exit, and resume from an aterior saved file. Otherwise, (first) it is wasting precious time (yes the residue is wrong the correct one ends in 2BBB1710, and it continues with the wrong calculus, wasting cycles in vain)[/QUOTE]
Preda, does your code check fractional errors for every convolution output word on every iteration, or not?

preda 2017-04-27 23:50

[QUOTE=ewmayer;457715]Preda, does your code check fractional errors for every convolution output word on every iteration, or not?[/QUOTE]
Yes, it takes into account the rounding error from *every* double-to-long that is done.

This works like this:
On every iteration, a rounding error for every word is computed, and a global maximum of that is updated.

On every logstep (20000), this global maximum is read, printed, and reset to 0.
Thus, while the rounding error is computed on every iteration, it is only visible in aggregated (max) form on every logstep.

But any error in rounding at any point should be caught.

LaurV 2017-04-28 00:29

1 Attachment(s)
[QUOTE=preda;457674]But do you also need to store the full 16MB checkpoints? the full checkpoint is only needed to allow a re-start from that point. Do you want to re-start from arbitrary points in time, or just need a full track of the residues?
[/QUOTE]

yes, and yes.

We had all this "argument" for cudaLucas in the past, and I won't be very disapointed to win it again... :razz:

[QUOTE=wombatman;457678]Why not keep only the two most recent checkpoints[/QUOTE]

Unuseful. The program is doing that right now. But the two files, they usually are both wrong, with 50% chance, unless the error happes in the very early stage of the test. When you test with two cards in parallel, one is usualy faster and it takes some advance (more than two checkpoints) when the other spits out a residue which is not matching. In that point, you have 50% chances that the error is from the slower card, and 50% that... you have to completely restart the other test from scratch?

@Mihai: you can keep the interface as it is, simple for the normal user, keep the log and everything (very useful, but ok, for other programs you can - and I have to do it sometimes - redirect the output to a file) but please provide a "-save" switch (same as cudaLucas, youcan call it whatever) which will save all the checkpoints to a "backup" folder. See cudaLucas' .ini file.

Like this:
[ATTACH]15998[/ATTACH]

(mind that this files are even double in size - for nowadays disks I would't mind. Or, I can increase the number of iterations for checkpoints, to save space)

LaurV 2017-04-28 12:52

Grr.. replying to myself...

We have a bigger problem... Scratch the part with "the program is doing this right now". He is doing it only when you start it, then it resumes from .bin, instead of .new, therefore the progress of the last run is always lost, and you resume from the point where [U]former[/U] run (the one before last) left.

[CODE]07610000 / 76453229 [9.95%], ms/iter: 4.653, ETA: 3d 16:58; 00000000c8e14255 error 0.207659 (max 0.256673)
07620000 / 76453229 [9.97%], ms/iter: 4.668, ETA: 3d 17:15; 00000000eca854d1 error 0.211343 (max 0.256673)
07630000 / 76453229 [9.98%], ms/iter: 4.674, ETA: 3d 17:21; 000000003f13c300 error 0.23212 (max 0.256673)
^C
e:\99 - Prime\gpuOwl>gpuowl -logstep 10000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 710000
OpenCL setup: 1092 ms
00720000 / 76453229 [0.94%], ms/iter: 4.668, ETA: 4d 02:12; 000000007c15e2ce error 0.205605 (max 0.205605)
^C
[/CODE]Edit: the workaround would be to manualy copy the .new file to .bin, after interruption (or before resuming). Scrap the .bin, or rename it. Here actually the file header (where you can see the iteration number in clear) helps a lot! That was a very-very good idea!!!

preda 2017-04-28 14:05

LaurV, the save procedure is supposed to work like this:
1. write checkpoint to save-N.new
2. rename save-N.bin to save-N.old
3. rename save-N.new to save-N.bin

If these steps complete correctly, you should see no .new file, only a .bin and a .old. Maybe... (?) you interrupted (^C) before it finished writing the save-N.new, and do the renames.. if that's the case, the correct behavior is to start from .bin, which at least is correct, while the partially-written .new may be half-garbage.

Anyway, the writing and renaming is pretty fast, it's a bit surprising if you interrupted that in the middle. I wonder if something else is taking place. Do you always see the .new file? (you shouldn't).

Also, maybe you should get a fresh build (or is the residue still not fixed?)

VictordeHolland 2017-04-28 15:02

[QUOTE=preda;457777]LaurV, the save procedure is supposed to work like this:
1. write checkpoint to save-N.new
2. rename save-N.bin to save-N.old
3. rename save-N.new to save-N.bin

If these steps complete correctly, you should see no .new file, only a .bin and a .old. Maybe... (?) you interrupted (^C) before it finished writing the save-N.new, and do the renames.. if that's the case, the correct behavior is to start from .bin, which at least is correct, while the partially-written .new may be half-garbage.

Anyway, the writing and renaming is pretty fast, it's a bit surprising if you interrupted that in the middle. I wonder if something else is taking place. Do you always see the .new file? (you shouldn't).
[/QUOTE]
When I start gpuOwl on an exponent the first time, it first creates the .bin and .old files and uses/renames them as intended (step 1,2,3). Then I [control+c] to stop. When I restart, gpuOwl resumes from the .bin (as intended) and creates a .new file when it checkpoints, but doesn't rename the .bin to .old and .new to .bin (steps 2 and 3). In other words, it keeps [U]overwriting[/U] the .new file when resumed from a checkpoint.

[quote]Also, maybe you should get a fresh build (or is the residue still not fixed?)[/quote]I'll try a rebuild tonight.

VictordeHolland 2017-04-28 16:07

1 Attachment(s)
I had to edit the clwrap.h so it would build with OpenCL1.2 (I noticed you changed the code to check the OpenCL version, but it stops/closes with error -11 before trying to compile it for a lower version than 2.0. Might be a MINGW/Windows/OpenCL thing.

This build has the changes up to and including 2017-04-27
[code]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2079.5)
LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0
OpenCL setup: 702 ms
00020000 / 20996011 [0.10%], ms/iter: 5.647, ETA: 1d 08:54; d0456dd0d24132a4 error 3.72529e-009 (max 3.72529e-009)
00040000 / 20996011 [0.19%], ms/iter: 5.652, ETA: 1d 08:54; 7ad6db44e3c09980 error 3.72529e-009 (max 3.72529e-009)
00060000 / 20996011 [0.29%], ms/iter: 5.645, ETA: 1d 08:50; 151aeac2ef5d7d56 error 3.72529e-009 (max 3.72529e-009)
00080000 / 20996011 [0.38%], ms/iter: 5.646, ETA: 1d 08:48; a225288c032c08c3 error 3.72529e-009 (max 3.72529e-009)
00100000 / 20996011 [0.48%], ms/iter: 5.645, ETA: 1d 08:46; 988b9ccffadb977c error 4.65661e-009 (max 4.65661e-009)
Error jump by 25.00%, doing a consistency check.
00100000 / 20996011 [0.48%], ms/iter: 5.648, ETA: 1d 08:47; 988b9ccffadb977c error 4.65661e-009 (max 4.65661e-009)
Consistency checked OK, continuing.
00120000 / 20996011 [0.57%], ms/iter: 5.639, ETA: 1d 08:42; 61ec00e975266565 error 3.72529e-009 (max 4.65661e-009)
00140000 / 20996011 [0.67%], ms/iter: 5.646, ETA: 1d 08:42; a251cbe5b6f9fc4c error 3.72529e-009 (max 4.65661e-009)
00160000 / 20996011 [0.76%], ms/iter: 5.650, ETA: 1d 08:42; fb45cd2cdf21ae51 error 3.72529e-009 (max 4.65661e-009)
00180000 / 20996011 [0.86%], ms/iter: 5.645, ETA: 1d 08:38; 884ef2fc91d40df7 error 3.72529e-009 (max 4.65661e-009)
00200000 / 20996011 [0.95%], ms/iter: 5.639, ETA: 1d 08:35; f21190923cc9a293 error 3.72529e-009 (max 4.65661e-009)
00220000 / 20996011 [1.05%], ms/iter: 5.635, ETA: 1d 08:31; 0b197c09290d0af9 error 3.72529e-009 (max 4.65661e-009)
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2079.5)
LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 220000
OpenCL setup: 764 ms
00240000 / 20996011 [1.14%], ms/iter: 5.659, ETA: 1d 08:38; f59e2a1db0e708e6 error 3.72529e-009 (max 3.72529e-009)
00260000 / 20996011 [1.24%], ms/iter: 5.665, ETA: 1d 08:38; 17c25e23bccefca6 error 3.72529e-009 (max 3.72529e-009)[/code]I noticed you implemented extra error checking when error jumps by more than 25% :smile: .
The displaying of the residues is fixed, but I still get a .new checkpoint file (and doesn't rename it to .bin / .old) on resumes. So I have to be careful to rename the .new to .bin myself before restarting.

I think most of us would prefer something like this:
save-20996011-00100000.bin
save-20996011-00120000.bin
save-20996011-00140000.bin
save-20996011-00160000.bin
(or an option to specify how often it writes a checkpoint).

Edit:
Attached the "-cl -save-temps" from my HD7950

VictordeHolland 2017-04-28 18:36

[code]C:\GPUOWLv0.1>gpuowl.exe --help
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Command line options:
-cl <CL compiler options>
e.g. -cl -save-temps or -cl -save-temps=prefix or -cl -save-temps=folder/
to save the compiled ISA
-logstep <N> : to log every <n> iterations (default 20000)
-device <N> : select specific device among:
0 : Tahiti; OpenCL 1.2 AMD-APP (2079.5)
1 : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz; OpenCL 1.2 AMD-APP (2079.5)

C:\GPUOWLv0.1>gpuowl.exe -device 0
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2079.5)
Falling back to CL1.x compilation
LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0
OpenCL setup: 729 ms

C:\GPUOWLv0.1>gpuowl.exe -device 1
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz; OpenCL 1.2 AMD-APP (2079.5)
Falling back to CL1.x compilation
LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0
OpenCL setup: 728 ms
[/code]The -device command seems to be working (I only have 1 GPU installed, the other is the CPU which is also OpenCL capable).

[code]
C:\GPUOWLv0.1>gpuowl.exe -device 1 -logstep 100
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz; OpenCL 1.2 AMD-APP (2079.5)
LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0
OpenCL setup: 646 ms
00000100 / 20996011 [0.00%], ms/iter: 693.350, ETA: 168d 11:45; 4e146021da95925d error 3.72529e-009 (max 3.72529e-009)[/code]
It seems to be working on the CPU, albeit a little slow :mike::mike: (which was to be expected).

LaurV 2017-04-29 06:37

[QUOTE=preda;457777]
2. rename save-N.bin to save-N.old
[/QUOTE]
No, it does not. Different OS, or policy settings of the OS, will not let you do this unless [U]explicitely[/U] delete the .old file. Here is where the process may crash. Therefore yes, I always see 3 files, .new, .bin, and .old, the first one changes every time when a new line is written on the screen, the other two stay the same as they were at the beginning when the test started (resumed), never change. Ex:

[CODE]
.new: "LL1 76453229 10420000 1024 2048 0"
.bin: "LL1 76453229 920000 1024 2048 0"
.old: "LL1 76453229 910000 1024 2048 0"
[/CODE]When it resumes, if I didn't manually tweak the files (deleting the last 2 and renaming .new to .bin), then the last 10M iterations are gone (happened few times already until we realized, we are slower on this part of the world :razz:, this exponent should be finished by now!). This is what Victor describes too.

Right now I got a new build from Victor, I will give it a try.

ttyl

LaurV 2017-04-29 06:53

Well, I should have written ttys.

[LIST][*]1. stealing the hex digits from the residue has been fixed, now the residues appear right. The little thief was properly punished.[/LIST][LIST][*]2. stopping after error - good ! (told you I can make this card give errors, just giving it other crap to do, from other programs). Now assuming that the "restarting" problem is solved properly, and all checkpoint files are kept, we can deal properly (by external batch) with a card that give errors.[/LIST][CODE]10410000 / 76453229 [13.62%], ms/iter: 4.912, ETA: 3d 18:07; 00000000474e3953 error 0.208918 (max 0.256673)
10420000 / 76453229 [13.63%], ms/iter: 4.917, ETA: 3d 18:11; 00000000f52f0fd0 error 0.217105 (max 0.256673)
^C
e:\99 - Prime\gpuOwl>gpuowl -logstep 10000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2348.3)
Falling back to CL1.x compilation
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 10420000
OpenCL setup: 1090 ms
10430000 / 76453229 [13.64%], ms/iter: 4.723, ETA: 3d 14:37; 6bc4105d793b06df error 0.211732 (max 0.211732)
10440000 / 76453229 [13.66%], ms/iter: 4.719, ETA: 3d 14:32; 3eede0f0d1b058fa error 4 (max 4)
Error 4 is way too large, stopping.

Bye

e:\99 - Prime\gpuOwl>[/CODE][LIST][*] 3. The resuming problem is not solved. After the story above, the .new file shows 10440000, but the other two still show 10420000 (they were both manually named from the former .new before starting the test). In the case above, the last run (here fortunately only 10k iterations) is lost, because there is no 10430000 residue file nowhere (the .new was overwriten by 10440000), and the 10440000 file/residue is obviously wrong. Imagine those would be 5 million iterations, instead of 10k.[/LIST]

LaurV 2017-04-29 07:28

Now, it pissed me off - this program is so nice and so fast, that it should be a pity to be shadowed by such a little thing. Therefore I "solved" it, in my own way :razz:

(rename this to "start_owl.bat" or whatever)

[CODE]@echo off

if exist *.new (
del /q *.old
ren *.bin *.old
ren *.new *.bin
)

start /b gpuOwl -logstep 10000

:loop1

if exist *.new (
del /q *.old
ren *.bin *.old
ren *.new *.bin
) else (
timeout /t 2 /nobreak > nul
)
goto loop1
[/CODE]He he he... Next step is to parse the screen output every time the rename file is done, extract the residue from the printed line, and write a copy of the .new in a "./backup" folder, with the name "76453229.10480000.3aa7f82516517620.ckp", and I won't bother you anymore :razz:

Or parse the file and extract the residue from the FFT stored there, hehe...

Just kidding. Picking on you, because I know this is extremely easy to solve directly in your source code. But the batch works properly. (edit: or not? now because of /b of the start command, you need to use ctrl+break to stop the program, as ctrl+c will only stop the batch, hehe, see help for start command... well.. such is life... give with one hand, take with the other...)

(we having nothing to do Saturday afternoon...)

preda 2017-04-29 09:15

Thanks Laur and Victor for the feedback -- I was out for the day, thus slow action.

It seems indeed the problem is with rename on Windows, which fails if destination exists.

Concerning the file naming scheme, what do you think about this:
- create a sub-folder for each exponent (e.g. "77000001/")
- in that subfolder, store the checkpoints
- in files named s<exponent>-<iteration>-<residue>.owl

On start, the desired exponent is obtained from worktodo.txt. The folder for that exponent is scanned, and the file with the largest iteration value is selected, and starts from there.

Probably a command option would be needed to specify a different iteration to start from (other than the last).

A few potential problems:
- I need to list the folder to find the most recent iteration,
- There can be different iteration number (or exponent) in the file name, and inside the file. Which one to use, or report error? (e.g. because the user renamed the file). (i.e., this problem is created by duplicating information in the file and in the file name)
- specifying the start point requires command option (while before was just moving some checkpoint file to save-N.bin)

LaurV 2017-04-29 13:06

[QUOTE=preda;457867]
- I need to list the folder to find the most recent iteration[/QUOTE]
You can keep exactly the things you are doing now (or well, intended to) with the .new, .bin and .old files. That is quite ok. Additionaly, every time when you do the renaming trick, just make a copy of the .new file (that becomes .bin) into a ./backup folder. That is all we need. And for the copy that you put in the backup, use the name "exponent.iteration.residue.txt", that is all we need. (edit: do not fill the iteration number with zeros to the left, for the file name. For the screen it is ok as it is, zero-filled, it looks nicer) The program should resume from the .bin file, as doing now (or suppose to be doing now). Do not waste your time and resources in implementing the searching into the folders, files, resume from the newest, etc. This may be detrimental in some situations where we do not need to resume from the "newest" (it may be a wrong one, or assumed wrong due to mismatching residues with a parallel run).

cudaLucas and clLucas do as I described. They use a cXXX and txxx files (in the same folder where the program is running) instead of .bin nd .old, but the idea is the same. Additionally, you can select a different iteration step for printing on screen and saving checkpoints, but that is not mandatory for this step. Also, you can select by command line switch and/or .ini file option if you need to save files or not. But that is already too much to request.

Is the user who needs to take care to replace the proper files (.bin actually) in the folder, when he wants to resume from a certain point. The program does not need to know about this. Do not waste the time to make it "too intelligent", you can not cover all situations anyhow, and some guys (like me hehe) will always be not fully satisfied :razz:

Creating a subfolder for each exponent, as you suggested, it is a good idea, it can be helpful, but it is not mandatory.

Thanks a billion for your support! You will make a lot of AMD cards users very happy here!

ATH 2017-04-29 16:18

You can "rename" in Windows with:
move /y save-N.bin save-N.old
move /y save-N.new save-N.bin

LaurV 2017-04-30 14:26

1 Attachment(s)
To make my point: a real situation where all the residue files are needed.

An example when two consecutive errors may render both .bin and .old wrong. When this error happened, I backuped the "old" file to avoid being rewritten, then I resumed from the bin, as normal. The residues continued to be wrong (didn't match with the list I have generated previously with cudaLucas). So the .bin was useless. This of course overwrote the .old file, but I had the backup. I renamed the backup to .bin and started again, and the residues continued to be wrong, which proved that the .old file was also corrupted, and useless, by the strange, double, interruption.

[CODE]17780000 / 76453229 [23.26%], ms/iter: 4.662, ETA: 3d 03:59; 392743fdc2a2f383 error 0.207586 (max 0.240115)
17790000 / 76453229 [23.27%], ms/iter: 4.662, ETA: 3d 03:58; 5703098adab6543a error 0.210609 (max 0.240115)
17800000 / 76453229 [23.28%], ms/iter: 4.659, ETA: 3d 03:54; e1d44791a972db41 error 0.209469 (max 0.240115)
17810000 / 76453229 [23.30%], ms/iter: 4.662, ETA: 3d 03:57; [COLOR=Blue]9f05c0542334aab9 [/COLOR]error 0.207559 (max 0.240115)
17820000 / 76453229 [23.31%], ms/iter: 4.657, ETA: 3d 03:51; [COLOR=Red]44a524f8c4f2bd14 [/COLOR]error 0.5 (max 0.5)
Error jump by 135.36%, doing a consistency check.
17820000 / 76453229 [23.31%], ms/iter: 4.658, ETA: 3d 03:52; [COLOR=Red]07aff04f32a7490c [/COLOR]error 0.5 (max 0.5)
Consistency check FAILED, something is wrong, stopping.

Bye
Terminate batch job (Y/N)? y

e:\99 - Prime\gpuOwl>start_gpuowl
A subdirectory or file backups already exists.
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2348.3)
Falling back to CL1.x compilation
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 17820000
OpenCL setup: 1100 ms
17830000 / 76453229 [23.32%], ms/iter: 4.680, ETA: 3d 04:13; [COLOR=Red]7665ef2bb6cf4b56 [/COLOR]error 0.217532 (max 0.217532)
17840000 / 76453229 [23.33%], ms/iter: 4.681, ETA: 3d 04:13; [COLOR=Red]4be1026dbc904b6b [/COLOR]error 0.218731 (max 0.218731)
Terminate batch job (Y/N)? y

e:\99 - Prime\gpuOwl>start_gpuowl
A subdirectory or file backups already exists.
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2348.3)
Falling back to CL1.x compilation
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 17820000
OpenCL setup: 1100 ms
17830000 / 76453229 [23.32%], ms/iter: 4.680, ETA: 3d 04:13; [COLOR=Red]effe0eb1c009784b [/COLOR]error 0.211787 (max 0.211787)
17840000 / 76453229 [23.33%], ms/iter: 4.681, ETA: 3d 04:13; [COLOR=Red]4835c363b7e08f9a [/COLOR]error 0.214877 (max 0.214877)
Terminate batch job (Y/N)? y

e:\99 - Prime\gpuOwl>[/CODE](Edit: coloring: red is bad, blue is good)

[B]In this case, all the work from the beginning would have been lost, and in need to be redone from scratch, as both .bin and .old would have been corrupted.[/B]

Fortunately, my batch file "evolved" :razz: meantime, a "set n=0" and a "md backups" was added to its init part (you see the harmless error about the folder already existing, in subsequent runs), then a "set /a n+=1" and a "copy /b *.bin backups\*.%n%.txt >nul" added into the loop, beside of increasing the timeout a bit. As a result, now I have a folder full of former residue files, looking like:
[ATTACH]16026[/ATTACH]

One such file is saved every few cycles of renaming the new to bin (timeout relates). Note that I didn't bother with the residue and iteration number, but that would be needed for direct comparisson with files generated by other program, in a parallel run.

Anyhow. The fact that each file has a header which says the iteration number, was very helpful in renaming the right one to .new and moving this .new out of the folder (into gpuowl's folder), therefore I could resume, and this time got the right residues (matching the CL run):

[CODE]e:\99 - Prime\gpuOwl>start_gpuowl
A subdirectory or file backups already exists.
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti; OpenCL 1.2 AMD-APP (2348.3)
Falling back to CL1.x compilation
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 17700000
OpenCL setup: 1100 ms
17710000 / 76453229 [23.16%], ms/iter: 4.667, ETA: 3d 04:09; 21b1dd8eba653c95 error 0.240115(max 0.240115)
17720000 / 76453229 [23.18%], ms/iter: 4.669, ETA: 3d 04:10; 36dda9b0bd9b17f8 error 0.216413 (max 0.240115)
17730000 / 76453229 [23.19%], ms/iter: 4.667, ETA: 3d 04:08; 5d4b7b286715202c error 0.205101 (max 0.240115)
17740000 / 76453229 [23.20%], ms/iter: 4.668, ETA: 3d 04:08; 11a52d8cfe87e44b error 0.215326 (max 0.240115)
17750000 / 76453229 [23.22%], ms/iter: 4.665, ETA: 3d 04:04; 9c18d1c83a6d7c31 error 0.236088 (max 0.240115)
17760000 / 76453229 [23.23%], ms/iter: 4.665, ETA: 3d 04:03; 2d0eb7c0f7e6c41d error 0.205023 (max 0.240115)
17770000 / 76453229 [23.24%], ms/iter: 4.668, ETA: 3d 04:06; 3130dbc0017fe758 error 0.20842 (max 0.240115)
17780000 / 76453229 [23.26%], ms/iter: 4.668, ETA: 3d 04:05; 392743fdc2a2f383 error 0.207586 (max 0.240115)
17790000 / 76453229 [23.27%], ms/iter: 4.667, ETA: 3d 04:03; 5703098adab6543a error 0.210609 (max 0.240115)
17800000 / 76453229 [23.28%], ms/iter: 4.665, ETA: 3d 04:00; e1d44791a972db41 error 0.209469 (max 0.240115)
17810000 / 76453229 [23.30%], ms/iter: 4.662, ETA: 3d 03:57; [COLOR=Blue]9f05c0542334aab9 [/COLOR]error 0.207559 (max 0.240115)
17820000 / 76453229 [23.31%], ms/iter: 4.660, ETA: 3d 03:54; [COLOR=Blue]b8f79ec2fb82747f [/COLOR]error 0.200293 (max 0.240115)
17830000 / 76453229 [23.32%], ms/iter: 4.660, ETA: 3d 03:53; [COLOR=Blue]c024704b64d23208 [/COLOR]error 0.208741 (max 0.240115)
17840000 / 76453229 [23.33%], ms/iter: 4.658, ETA: 3d 03:50; [COLOR=Blue]90698abad62aab00 [/COLOR]error 0.214988 (max 0.240115)
17850000 / 76453229 [23.35%], ms/iter: 4.661, ETA: 3d 03:52; 0ae501f3988937e8 error 0.222933 (max 0.240115)
[/CODE]

LaurV 2017-04-30 15:00

To be clear, the error is caused by the hardware, not by the program. The program does a wonderful job in catching the error!

My quest is (first) with properly resuming after an error or a ctrl+c, and (second) with having saved all the residue files, and having them properly named (CL style).

airsquirrels 2017-04-30 22:56

[QUOTE=airsquirrels;457072]Pretty fast for hand rolled code (clFFT has had a lot of resources put into it by AMD), but I'm definitely not seeing the performance levels indicated above. Still slower than clLucas 1.04. Anything I should check? This is a FuryX

Good news is, the residues match for the first chunk of iterations.

[CODE]gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 75002911 at iteration 0
FFT 1024*2048 (4M words, 17.88 bits per word)
OpenCL compile: 1106 ms
setup: 1638 ms
00020000 / 75002911 [0.03%], ms/iter: 5.413, ETA: 4d 16:45; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 5.422, ETA: 4d 16:54; b9fc5678347cad9f error 0.140625 (max 0.140625)
00060000 / 75002911 [0.08%], ms/iter: 5.413, ETA: 4d 16:40; e7fab5c1f11d0f39 error 0.125 (max 0.140625)
00080000 / 75002911 [0.11%], ms/iter: 5.423, ETA: 4d 16:52; 76a6fb920dd95b71 error 0.140625 (max 0.140625)[/CODE]

[CODE]Continuing work from a partial result of M75002911 fft length = 4096K iteration = 1001
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1875 (0:36 real, 3.5486 ms/iter, ETA 73:55:06)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9315 ms/iter, ETA 81:53:04)
Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9205 ms/iter, ETA 81:38:41)
Iteration 40000 M( 75002911 )C, 0xb9fc5678347cad9f, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9280 ms/iter, ETA 81:47:22)
Iteration 50000 M( 75002911 )C, 0x992a088a20504a90, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9407 ms/iter, ETA 82:02:32)
Iteration 60000 M( 75002911 )C, 0xe7fab5c1f11d0f39, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9230 ms/iter, ETA 81:39:51)
Iteration 70000 M( 75002911 )C, 0x89386b82336fc06d, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9320 ms/iter, ETA 81:50:26)
Iteration 80000 M( 75002911 )C, 0x76a6fb920dd95b71, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9236 ms/iter, ETA 81:39:18)[/CODE][/QUOTE]


I've setup a different Ubuntu 16.30 based system with the latest AMDGPU-PRO driver 17.10 on a FuryX. In this case, timing is incredibly improved.

[CODE]gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Fiji; OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 75002911 (17.88 bits/word) at iteration 0
OpenCL setup: 2419 ms
00020000 / 75002911 [0.03%], ms/iter: 2.400, ETA: 2d 02:00; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 2.432, ETA: 2d 02:38; b9fc5678347cad9f error 0.125 (max 0.125)
00060000 / 75002911 [0.08%], ms/iter: 2.432, ETA: 2d 02:38; e7fab5c1f11d0f39 error 0.125 (max 0.125)[/CODE]

I also ran clLucas on the FuryX with the same system/driver with clFFT 2.12.2, which did not perform as well. Interestingly, the older driver on the previous system was faster for clLucas
[CODE]....
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1914 (0:48 real, 4.7985 ms/iter, ETA 99:57:18)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1914 (0:48 real, 4.8387 ms/iter, ETA 100:46:48)
....
[/CODE]

I also tested clLucas vs. gpuOwl on an RX480 in the same Ubuntu/AMDGPU-PRO system, which yielded very good numbers:
[CODE]
(clLucas)
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1914 (0:52 real, 5.1695 ms/iter, ETA 107:40:57)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1914 (0:52 real, 5.1787 ms/iter, ETA 107:51:42)
Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1914 (0:52 real, 5.2092 ms/iter, ETA 108:28:54)

(gpuOwl)
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Ellesmere; OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 75002911 (17.88 bits/word) at iteration 0
OpenCL setup: 2419 ms
00020000 / 75002911 [0.03%], ms/iter: 3.677, ETA: 3d 04:35; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 3.691, ETA: 3d 04:51; b9fc5678347cad9f error 0.125 (max 0.125)
[/CODE]

Finally, here are numbers for a W9100 (Hawaii) using the 15.2 fglrx driver and AMD APPSDK 3.0:
[CODE]
(W9100 - gpuOwl)
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Hawaii; OpenCL 2.0 AMD-APP (1912.5)
LL FFT 4096K (1024*2048*2) of 75002911 (17.88 bits/word) at iteration 0
OpenCL setup: 888 ms
00020000 / 75002911 [0.03%], ms/iter: 3.180, ETA: 2d 18:14; 777b6635d6b78b75 error 0.140625 (max 0.140625)
00040000 / 75002911 [0.05%], ms/iter: 3.138, ETA: 2d 17:21; b9fc5678347cad9f error 0.132812 (max 0.140625)

(W9100 - clLucas)
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1914 (0:47 real, 4.7813 ms/iter, ETA 99:35:46)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1914 (0:48 real, 4.7354 ms/iter, ETA 98:37:42)
[/CODE]


The doubled performance is pretty amazing - now we just need more FFT sizes :)

preda 2017-05-01 05:14

[QUOTE=LaurV;457959]To be clear, the error is caused by the hardware, not by the program. The program does a wonderful job in catching the error!

My quest is (first) with properly resuming after an error or a ctrl+c, and (second) with having saved all the residue files, and having them properly named (CL style).[/QUOTE]
LaurV, I just updated gpuOwL on github with these new changes:

- persist checkpoint every "savestep" (new command line argument, defaulting to 500 * logstep).
- use new name format for persist checkpoints (but with final extension .ll)
- use new checkpoint format. The human-readable info is now at the end. Can be printed nicely with:
"tail -1 c<N>.ll" (i.e. use "tail" to print only the very last line, which is the human-readable part).
- in general, use file naming in the style of CUDALucas but with .ll extension

There may be bugs/problems with these new things, looking for feedback :)

Not done yet: no sub-folders.

preda 2017-05-01 05:17

And a couple of other fixes:
- add a trivial checksum, to catch partially-written checkpoints.
- correctly handle multiple OpenCL "platforms" (discover all the devices in some multi-device setups)

preda 2017-05-01 05:25

[QUOTE=airsquirrels;457975]I've setup a different Ubuntu 16.30 based system with the latest AMDGPU-PRO driver 17.10 on a FuryX. In this case, timing is incredibly improved.
[/QUOTE]
Nice, that sort of answers the question ("where did the performance go? -- ask the OpenCL compiler") and saves me the effort to perf-debug the ISA dumps.

[QUOTE=airsquirrels;457975]
The doubled performance is pretty amazing - now we just need more FFT sizes :)[/QUOTE]

OK, what FFT sizes do you need? (and why?)

preda 2017-05-01 05:32

BTW, did you remark the improved error margin as well? Not a huge deal, but it does extend a bit the exponent range available for a given FFT size. (which changes 'radically' the cost for the exponents 'on the border' that become now included in the lower, POT FFT).

LaurV 2017-05-01 11:54

Yes, we said in the very first post that some 77M exponent can be done with 4K, in spite of the fact that c*Lucas wants more, and we appreciate this!

Short question: Does the new format of the file implies that I can not resume from the old format? (if so, then I will have to wait first to finish 76453229 before playing with the new version, sorry.You do not have to do anything in this direction, whatever format you chose for the future, it is ok with us).

Mark Rose 2017-05-01 14:37

[QUOTE=preda;457998]OK, what FFT sizes do you need? (and why?)[/QUOTE]

Well DC is basically beyond 2048K at this time, mostly in the 2560K range, but starting to reach into the 3072K range. LL, as you have discovered, is in the 4096K and 4608K range.

But it really comes down to what exact FFT sizes will be fastest for the hardware.

henryzz 2017-05-01 16:04

There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU

Mark Rose 2017-05-01 16:31

[QUOTE=henryzz;458021]There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU[/QUOTE]

K*2^n+1 would also be nice.

rogue 2017-05-01 16:42

[QUOTE=henryzz;458021]There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU[/QUOTE]

To be more sepcific k*b^n+1 and k*b^n-1

preda 2017-05-01 17:20

[QUOTE=LaurV;458012]
Short question: Does the new format of the file implies that I can not resume from the old format? (if so, then I will have to wait first to finish 76453229 before playing with the new version, sorry.You do not have to do anything in this direction, whatever format you chose for the future, it is ok with us).[/QUOTE]

No, the program still reads the old format from save-N.bin if the new file (cN.ll) is not there. So it should be possible to "switch format" in the middle.

airsquirrels 2017-05-01 23:37

I could load gpuOwl on all of my AMD systems (with the latest driver) and concentrate on lots of DC in the 4096k range.

How many exponents that need D.C. Would be covered more efficiently by this implementation vs clLucas? That's an interesting question.

Personally I'd love a mid range option to continue working on the small end of the D.C. backlog. And a big option for 100M digits ;)

Mark Rose 2017-05-02 00:24

[QUOTE=airsquirrels;458060]I could load gpuOwl on all of my AMD systems (with the latest driver) and concentrate on lots of DC in the 4096k range.

How many exponents that need D.C. Would be covered more efficiently by this implementation vs clLucas? That's an interesting question.

Personally I'd love a mid range option to continue working on the small end of the D.C. backlog. And a big option for 100M digits ;)[/QUOTE]

Between a 3584K and 4096K FFT (68M to 78M), there are approximately 158K exponents needing a DC, 25K awaiting LL, 6.4K assigned LL, and 0.5K assigned DC.

airsquirrels 2017-05-02 00:28

I have continued to look at the performance discrepancy on my older Debian Jessie systems that are stuck with the fglrx driver. I've noticed for one that the fglrx driver is advertising OpenCL 2.0?

Testing with a nice low 70000141, all 4096K:

[CODE]
Fiji; OpenCL 2.0 AMD-APP (1912.5) (Catalyst 15.12) is seeing 2.255 ms/iter
<clLucas: 3.713 ms/iter> (gpuOwl 1.6x faster)
Fiji; OpenCL 2.0 AMD-APP (1800.5) (Catalyst 15.7, old bad one) is seeing 5.513 ms/iter
<clLucas: 3.972 ms/iter> (gpuOwl at 72% of clLucas speed)
Fiji; OpenCL 1.2 AMD-APP (2348.3) (AMDGPU 17.10) is seeing 2.42ms/iter
<clLucas: 5.16ms/iter> (gpuOwl 2.1x faster)
[/CODE]

Sounds like for me I'll be sticking with the fglrx 15.12 / Debian 8 system for the best times on both applications with the Fury X. All residues matched.

preda 2017-05-02 00:47

[QUOTE=airsquirrels;458064]I have continued to look at the performance discrepancy on my older Debian Jessie systems that are stuck with the fglrx driver. I've noticed for one that the fglrx driver is advertising OpenCL 2.0?

Testing with a nice low 70000141, all 4096K:

[CODE]
Fiji; OpenCL 2.0 AMD-APP (1912.5) (Catalyst 15.12) is seeing 2.255 ms/iter
<clLucas: 3.713 ms/iter> (gpuOwl 1.6x faster)
Fiji; OpenCL 2.0 AMD-APP (1800.5) (Catalyst 15.7, old bad one) is seeing 5.513 ms/iter
<clLucas: 3.972 ms/iter> (gpuOwl at 72% of clLucas speed)
Fiji; OpenCL 1.2 AMD-APP (2348.3) (AMDGPU 17.10) is seeing 2.42ms/iter
<clLucas: 5.16ms/iter> (gpuOwl 2.1x faster)
[/CODE]

Sounds like for me I'll be sticking with the fglrx 15.12 / Debian 8 system for the best times on both applications with the Fury X. All residues matched.[/QUOTE]

Could you please report the iteration time on the best OpenCL setup on FuryX standard (not-overclocked) for something around 76M, e.g. 76008281. For this particular exponent I see 2.125ms, I'm curious how fglrx 15.12 compares. (I use amdgpu 17.10 on Ubuntu 16.10)

Note that the iteration time *decreases* a bit as the exponent grows (while staying at the same FFT size). This is because the carry propagation step takes longer for smaller exponents (because the word size is smaller and the carry spans more words). But overall the carry propagation time is a small percentage of the total, thus the impact of this is small.

airsquirrels 2017-05-02 00:49

I went through some old conversations I had with Madpoo:

4M FFT is essentially between 73.18M and 77.99M for Prime95 (sweet spot)

I went through the months of performance logs from my AMD systems to get a good baseline for FFT size performance (in GhzDay/Day):

In my Mersenne.org "Dashboard" I use this generalized formula to quickly estimate GhzDays.:

[CODE]llcredit=0.0246*(($exponent/1000000)-35)^2 + 3.2416*(($exponent/1000000)-35) + 41.369
ghzDayDay=(86400000/($exponent*$msPerIter)) * $llcredit[/CODE]

AMD Cards (clFFT, Fury X):
[CODE] Avg Mode Min Max
2048K) 56.5443 56.47 48.20 60.82
2240K) 10.9852 10.86 10.74 11.36
2304K) 47.5053 47.07 35.99 54.34
2400K) 43.8843 44.25 39.54 62.68
2560K) 51.0052 50.79 2.05 54.06
2688K) 25.1341 25.14 25.02 25.21
2880K) 44.7719 45.83 42.34 46.20
3072K) 58.9334 59.04 56.81 60.31
3200K) 49.976 49.95 49.34 50.46
3360K) 32.7874 32.87 31.96 33.07
3456K) 55.2102 55.22 54.77 55.55
3840K) 44.1085 43.38 42.62 46.06
4000K) 50.9889 51.05 45.95 51.40
4096K) 60.9276 61.06 45.05 61.70
4480K) 30.3824 30.41 29.09 30.52[/CODE]

NVIDIA (cuFFT, Titan)
[CODE] Avg Mode Min Max
2048K) 76.6284 78.29 45.63 89.37
2160K) 66.4163 65.90 49.10 78.65
2304K) 69.0988 71.78 54.58 79.31
2352K) 64.5668 64.64 63.75 64.66
2592K) 76.4363 75.27 59.89 89.20
2880K) 72.355 71.03 69.94 73.97
3024K) 75.0943 78.21 63.86 81.43
3584K) 60.9709 58.75 57.07 65.83
4096K) 80.1439 82.72 67.85 91.17
4320K) 66.5236 63.43 57.70 76.42
19208K) 41.36 41.40 41.02 41.40
[/CODE]

Since the ms/iter on my Fury X with the fglrx 15.12 driver is a constant 2.25ms/iter @ 4096K, I can extrapolate this chart using the formula:
[CODE]Exp ms/iter LL Credit GhzDay/Day
40 2.25 58.192 55.86432
42 2.25 65.2656 59.67140571
44 2.25 72.536 63.30414545
46 2.25 80.0032 66.78528
48 2.25 87.6672 70.13376
50 2.25 95.528 73.365504
52 2.25 103.5856 76.49398154
54 2.25 111.84 79.53066667
56 2.25 120.2912 82.48539429
58 2.25 128.9392 85.36664276
60 2.25 137.784 88.18176
62 2.25 146.8256 90.93714581
64 2.25 156.064 93.6384
66 2.25 165.4992 96.29044364
68 2.25 175.1312 98.89761882
70 2.25 184.96 101.4637714
72 2.25 194.9856 103.99232
74 2.25 205.208 106.4863135
76 2.25 215.6272 108.94848
78 2.25 226.2432 111.3812677[/CODE]

From my perspective, that means even using a 4096k all the way down to the DC line gpuOwl will outperform clLucas

airsquirrels 2017-05-02 00:59

[QUOTE=preda;458071]Could you please report the iteration time on the best OpenCL setup on FuryX standard (not-overclocked) for something around 76M, e.g. 76008281. For this particular exponent I see 2.125ms, I'm curious how fglrx 15.12 compares. (I use amdgpu 17.10 on Ubuntu 16.10)

Note that the iteration time *decreases* a bit as the exponent grows (while staying at the same FFT size). This is because the carry propagation step takes longer for smaller exponents (because the word size is smaller and the carry spans more words). But overall the carry propagation time is a small percentage of the total, thus the impact of this is small.[/QUOTE]

Are you able to test 76008281? I get an error and quit due to an error rate of 0.5 with the current code.

Using 75000143 with my best 15.12 driver I get 2.36ms/iter, which is just sightly slower than the 70M range.

airsquirrels 2017-05-02 01:59

One more reply to myself. I tested this theory on a 7GPU system with a simple patch to gpuOwl to match the output format of clLucas (so it could drop-into my management scripts):

Original work load, GPU 1 using gpuOwl
[CODE]FFTSize: 4096K Exponent: 42424699 (0.31%) Error: 0.0000 ms: 2.2720 eta: 26:41:34
Card 1 (gpuOwl AMD Radeon (TM) R9 Fury Series - 31.00C, 100% Load [1050/1050], M42424699 using 4096K) GhzDay: 59.87
FFTSize: 2304K Exponent: 42446867 (31.35%) Error: 0.1885 ms: 2.4882 eta: 20:08:25
Card 2 (AMD Radeon (TM) R9 Fury Series - 32.00C, 100% Load [1050/1050], M42446867 using 2304K) GhzDay: 54.71
FFTSize: 2304K Exponent: 42495623 (30.44%) Error: 0.1875 ms: 2.4890 eta: 20:26:13
Card 3 (AMD Radeon (TM) R9 Fury Series - 33.00C, 100% Load [1050/1050], M42495623 using 2304K) GhzDay: 54.77
FFTSize: 2560K Exponent: 42852191 (63.55%) Error: 0.0208 ms: 2.8335 eta: 12:17:38
Card 4 (AMD Radeon (TM) R9 Fury Series - 34.00C, 100% Load [1050/1050], M42852191 using 2560K) GhzDay: 48.63
FFTSize: 4480K Exponent: 78920381 (76.28%) Error: 0.1064 ms: 9.1494 eta: 47:34:36
Card 5 (AMD Radeon (TM) R9 Fury Series - 38.00C, 100% Load [1050/1050], M78920381 using 4480K) GhzDay: 27.66
FFTSize: 4480K Exponent: 78920419 (89.19%) Error: 0.0996 ms: 7.9857 eta: 18:55:17
Card 6 (AMD Radeon (TM) R9 Fury Series - 35.00C, 100% Load [1050/1050], M78920419 using 4480K) GhzDay: 31.69
FFTSize: 4480K Exponent: 78920497 (89.19%) Error: 0.1016 ms: 7.9951 eta: 18:56:38
Card 7 (AMD Radeon (TM) R9 Fury Series - 33.00C, 100% Load [1050/1050], M78920497 using 4480K) GhzDay: 31.66
Total GhzDay(7 cards): 308.99[/CODE]

Original workload, GPUs 1-4 using gpuOwl (due to larger LL tests on 5,6,7). Note that even using 4096k times are still better
[CODE]FFTSize: 4096K Exponent: 42424699 (0.26%) Error: 0.0000 ms: 2.2747 eta: 26:44:13
Card 1 (gpuOwl AMD Radeon (TM) R9 Fury Series - 31.00C, 0% Load [1050/1050], M42424699 using 4096K) GhzDay: 59.80
FFTSize: 4096K Exponent: 42446867 (0.07%) Error: 0.0000 ms: 2.2731 eta: 26:46:58
Card 2 (gpuOwl AMD Radeon (TM) R9 Fury Series - 33.00C, 100% Load [1050/1050], M42446867 using 4096K) GhzDay: 59.88
FFTSize: 4096K Exponent: 42495623 (0.07%) Error: 0.0000 ms: 2.2734 eta: 26:49:01
Card 3 (gpuOwl AMD Radeon (TM) R9 Fury Series - 34.00C, 100% Load [1050/1050], M42495623 using 4096K) GhzDay: 59.96
FFTSize: 4096K Exponent: 42852191 (0.07%) Error: 0.0000 ms: 2.2751 eta: 27:03:45
Card 4 (gpuOwl AMD Radeon (TM) R9 Fury Series - 35.00C, 100% Load [1050/1050], M42852191 using 4096K) GhzDay: 60.56
FFTSize: 4480K Exponent: 78920381 (76.34%) Error: 0.0840 ms: 8.0626 eta: 41:48:49
Card 5 (AMD Radeon (TM) R9 Fury Series - 38.00C, 100% Load [1050/1050], M78920381 using 4480K) GhzDay: 31.39
FFTSize: 4480K Exponent: 78920419 (89.26%) Error: 0.0742 ms: 6.9726 eta: 16:25:27
Card 6 (AMD Radeon (TM) R9 Fury Series - 35.00C, 100% Load [1050/1050], M78920419 using 4480K) GhzDay: 36.30
FFTSize: 4480K Exponent: 78920497 (89.26%) Error: 0.1016 ms: 7.9935 eta: 18:49:44
Card 7 (AMD Radeon (TM) R9 Fury Series - 33.00C, 100% Load [1050/1050], M78920497 using 4480K) GhzDay: 31.66
Total GhzDay(7 cards): 339.55[/CODE]

New workload, all assignments in the 73M range and GPU1-6 on gpuOwl, GPU7 with the remaining 78M assignments:
[CODE]FFTSize: 4096K Exponent: 73001809 (0.03%) Error: 0.0625 ms: 2.2571 eta: 45:45:27
Card 1 (gpuOwl AMD Radeon (TM) R9 Fury Series - 31.00C, 100% Load [1050/1050], M73001809 using 4096K) GhzDay: 104.91
FFTSize: 4096K Exponent: 73001989 (0.03%) Error: 0.0625 ms: 2.2615 eta: 45:50:49
Card 2 (gpuOwl AMD Radeon (TM) R9 Fury Series - 33.00C, 100% Load [1050/1050], M73001989 using 4096K) GhzDay: 104.71
FFTSize: 4096K Exponent: 73002113 (0.03%) Error: 0.0703 ms: 2.2623 eta: 45:51:47
Card 3 (gpuOwl AMD Radeon (TM) R9 Fury Series - 34.00C, 100% Load [1050/1050], M73002113 using 4096K) GhzDay: 104.67
FFTSize: 4096K Exponent: 73002211 (0.03%) Error: 0.0625 ms: 2.2624 eta: 45:51:55
Card 4 (gpuOwl AMD Radeon (TM) R9 Fury Series - 36.00C, 0% Load [1050/1050], M73002211 using 4096K) GhzDay: 104.67
FFTSize: 4096K Exponent: 73001413 (0.03%) Error: 0.0625 ms: 2.2628 eta: 45:52:22
Card 5 (gpuOwl AMD Radeon (TM) R9 Fury Series - 40.00C, 100% Load [1050/1050], M73001413 using 4096K) GhzDay: 104.65
FFTSize: 4096K Exponent: 73001603 (0.03%) Error: 0.0625 ms: 2.2595 eta: 45:48:22
Card 6 (gpuOwl AMD Radeon (TM) R9 Fury Series - 36.00C, 100% Load [1050/1050], M73001603 using 4096K) GhzDay: 104.80
FFTSize: 4480K Exponent: 78920497 (89.34%) Error: 0.1016 ms: 8.0108 eta: 18:42:50
Card 7 (AMD Radeon (TM) R9 Fury Series - 33.00C, 100% Load [1050/1050], M78920497 using 4480K) GhzDay: 31.60
Total GhzDay(7 cards): 660.01[/CODE]

Compared to an 8 card Titan Black (Air) system on 4096K 73M exponents:
[CODE]Card 1 (GeForce GTX TITAN Black - 78.00C, 100% Load [862/1202]@247.13W/250.00W, M73002467 using 4096K) GhzDay: 91.14
FFTSize: 4096K Exponent: 73004003 (0.01%) Error: 0.07422 ms: 2.7951 eta: 2:08:40:28
Card 2 (GeForce GTX TITAN - 79.00C, 100% Load [758/1254]@207.82W/250.00W, M73004003 using 4096K) GhzDay: 84.72
FFTSize: 4096K Exponent: 73002749 (0.02%) Error: 0.07812 ms: 2.5954 eta: 2:04:29:18
Card 3 (GeForce GTX TITAN Black - 86.00C, 100% Load [901/1280]@249.31W/250.00W, M73002749 using 4096K) GhzDay: 91.24
FFTSize: 4096K Exponent: 73003157 (0.02%) Error: 0.07812 ms: 2.6037 eta: 2:04:49:09
Card 4 (GeForce GTX TITAN Black - 79.00C, 100% Load [862/1202]@247.68W/250.00W, M73003157 using 4096K) GhzDay: 90.95
FFTSize: 4096K Exponent: 73003547 (0.02%) Error: 0.07812 ms: 2.6015 eta: 2:04:45:14
Card 5 (GeForce GTX TITAN Black - 77.00C, 99% Load [862/1202]@249.90W/250.00W, M73003547 using 4096K) GhzDay: 91.03
FFTSize: 4096K Exponent: 73003741 (0.02%) Error: 0.07812 ms: 2.6205 eta: 2:04:59:23
Card 6 (GeForce GTX TITAN Black - 76.00C, 100% Load [862/1202]@249.87W/250.00W, M73003741 using 4096K) GhzDay: 90.37
FFTSize: 4096K Exponent: 73003859 (0.02%) Error: 0.07812 ms: 2.5973 eta: 2:04:37:29
Card 7 (GeForce GTX TITAN Black - 77.00C, 100% Load [862/1202]@248.60W/250.00W, M73003859 using 4096K) GhzDay: 91.17
FFTSize: 4096K Exponent: 73003939 (0.01%) Error: 0.07031 ms: 2.6021 eta: 2:04:45:41
Card 8 (GeForce GTX TITAN Black - 75.00C, 100% Load [888/1202]@248.09W/250.00W, M73003939 using 4096K) GhzDay: 91.01
Total GhzDay(8 cards): 721.63[/CODE]

LaurV 2017-05-02 04:23

Hey, let the titans apart, you are comparing apples and watermelons :razz:

By the way, long ago you said you will send me some damaged titans, which, if I could repair, I could use for myself. I even offered to pay for shipping. Any news? Could you repair them by yourself? Did you give up? Throw them away? (that should be vert bad of you! :rant:)

preda 2017-05-02 07:20

[QUOTE=airsquirrels;458075]Are you able to test 76008281? I get an error and quit due to an error rate of 0.5 with the current code.
[/QUOTE]

76M is small enough for 4096K FFT, an error of 0.5 is not normal. This is what I see:

54460000 / 76008281 [71.65%], ms/iter: 2.124, ETA: 0d 12:43; 34e5c50b53ce4ce4 error 0.21875 (max 0.21875)
54480000 / 76008281 [71.68%], ms/iter: 2.128, ETA: 0d 12:43; 817ee6b4419a5303 error 0.1875 (max 0.21875)
54500000 / 76008281 [71.70%], ms/iter: 2.126, ETA: 0d 12:42; bb7a5ac252b61a9e error 0.1875 (max 0.21875)

preda 2017-05-02 07:26

@airsquirrels : Impressive hardware! Do you have a description somewhere of you hardware setup? (e.g. what motherboard, how are the GPUs connected and cooled, pictures, power use etc).

30C is such a low temperature, how do you cool? or was that only on startup?

airsquirrels 2017-05-02 12:10

[QUOTE=preda;458097]@airsquirrels : Impressive hardware! Do you have a description somewhere of you hardware setup? (e.g. what motherboard, how are the GPUs connected and cooled, pictures, power use etc).

30C is such a low temperature, how do you cool? or was that only on startup?[/QUOTE]

There is a thread somewhere around here detailing the setup. The liquid is on a large multi-system loop with industrial pumps and a large external heat exchanger, so I target 35-45* C for loop temp. The 7-8 GPU systems are all based on SuperMicro 4027/4028 8GPU chassis.

@laurv - I still have the Titans (and one or two more I'm afraid), however the last few months have been so busy I have not had a chance to bundle up and ship them. They are still tagged for you just waiting for a slow day!

@preda I very much appreciate your work on this. I did a good bit of performance evaluation work on clLucas a few months back and theorized several times that the method of kernel calls in clFFT could be combined into a single kernel or more efficient form that was about 2x faster, however I never found time to do the work. Do you think you could add a power of two 16K FFT option easily just for some tests? I presume that would be easier than implementing efficient mixed radix FFTs.

kracker 2017-05-02 23:04

FYI: with -Wall passed to g++ I'm getting this warning, not sure if it's anything(only bringing it up because you had -Werror in the makefile)
[code]
gpuowl.cpp: In function 'void doLog(int, int, float, float, double, u64)':
gpuowl.cpp:285:93: warning: unknown conversion type character 'l' in format [-Wformat=]
k, E, k * percent, msPerIter, days, hours, mins, (unsigned long long) res, err, maxErr);
^
gpuowl.cpp:285:93: warning: format '%g' expects argument of type 'double', but argument 9 has type 'u64 {aka long long unsigned int}' [-Wformat=]
gpuowl.cpp:285:93: warning: too many arguments for format [-Wformat-extra-args]
[/code]

preda 2017-05-03 00:00

[QUOTE=kracker;458148]FYI: with -Wall passed to g++ I'm getting this warning, not sure if it's anything(only bringing it up because you had -Werror in the makefile)
[code]
gpuowl.cpp: In function 'void doLog(int, int, float, float, double, u64)':
gpuowl.cpp:285:93: warning: unknown conversion type character 'l' in format [-Wformat=]
k, E, k * percent, msPerIter, days, hours, mins, (unsigned long long) res, err, maxErr);
^
gpuowl.cpp:285:93: warning: format '%g' expects argument of type 'double', but argument 9 has type 'u64 {aka long long unsigned int}' [-Wformat=]
gpuowl.cpp:285:93: warning: too many arguments for format [-Wformat-extra-args]
[/code][/QUOTE]
Feel free to drop -Werror to get the compilation going. (I only added it because in general it's useful to check every warning).

In this case, it seems gcc does not like %llx in printf() ("long long unsigned"). printf() may still execute that correctly though, try.

preda 2017-05-03 00:18

[QUOTE=airsquirrels;458111]
@preda I very much appreciate your work on this. I did a good bit of performance evaluation work on clLucas a few months back and theorized several times that the method of kernel calls in clFFT could be combined into a single kernel or more efficient form that was about 2x faster, however I never found time to do the work. Do you think you could add a power of two 16K FFT option easily just for some tests? I presume that would be easier than implementing efficient mixed radix FFTs.
[/QUOTE]

POT FFTs are clearly easier. Do you want 16K or 16M? (why 16K, so small?). In fact for 16K I wonder how many bits/word would work with float (single precision).

I just saw now your thread about LL implementation -- I didn't see it earlier sorry. I was also thinking initially about merging kernels for performance, but that didn't work well for reasons of VGPR (register) pressure, which is a major limit on GCN ISA. Keeping the kernels "small" reduces VGPR usage, allowing more workgroups to run at the same time.

In fact, I initially tried to implement Nussbaumer convolution, which does not need any floating point (integers only) and fewer multiplications. I stopped when I become convinced that it'd still be slower than classical LL (double precision FFT) on GPUs. (this is because Nussbaumer is more memory-intense).

The main optimization in gpuOwL IMO is using a transposed representation of the data matrix (what I call "transposed convolution"), which fits very nicely with the GPU memory access pattern. This saves two transposition steps in both the direct and inverse FFT. The transposed representation is also good for the parallel carry propagation.

airsquirrels 2017-05-03 01:58

[QUOTE=preda;458154]POT FFTs are clearly easier. Do you want 16K or 16M? (why 16K, so small?). In fact for 16K I wonder how many bits/word would work with float (single precision).

I just saw now your thread about LL implementation -- I didn't see it earlier sorry. I was also thinking initially about merging kernels for performance, but that didn't work well for reasons of VGPR (register) pressure, which is a major limit on GCN ISA. Keeping the kernels "small" reduces VGPR usage, allowing more workgroups to run at the same time.

In fact, I initially tried to implement Nussbaumer convolution, which does not need any floating point (integers only) and fewer multiplications. I stopped when I become convinced that it'd still be slower than classical LL (double precision FFT) on GPUs. (this is because Nussbaumer is more memory-intense).

The main optimization in gpuOwL IMO is using a transposed representation of the data matrix (what I call "transposed convolution"), which fits very nicely with the GPU memory access pattern. This saves two transposition steps in both the direct and inverse FFT. The transposed representation is also good for the parallel carry propagation.[/QUOTE]

You are correct - 16M, although 8M is good too! I am curious if you have given any thought to the difficulty of mixed-radix FFT sizes at this point.

Your transpose representation solves one of the big problems with the default runtime generated kernels in clFFT. The other problem was a post processing step needed to reload all of the values from memory that were just available in registers to convert them back to integers for carry prop., which was most of the bottleneck.

I did get some good advice from Mr Prime95 himself regarding the carry step - it is not necessary to complete the entire carry chain. You only need to carry enough to reduce your word sizes back to the point where they stay within the error bounds of another FFT and squaring step.

I also intend to test your openCL code on the Nvidia system and see if it runs and how it performs vs cuFFT.

airsquirrels 2017-05-03 02:28

Here is a quick test on a GTX 1080, residues match - however the code is 4.53ms vs 3.55. Still impressively close given that it was optimized for AMD cards.

I did another quick test on a Titan Black with double precision boost on, and gpuOwl was around 5.5ms vs 2.5 from CUDALucas. Interestingly they were closer with double precision boost off, suggesting cuFFT is perhaps better able to take advantage of the faster compute while gpuOwl's code is compiling memory bound.

[CODE]
gpuowl -logstep 5000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
GeForce GTX 1080; OpenCL 1.2 CUDA
Will log every 5000 iterations, and persist checkpoint every 2500000 iterations.
Falling back to CL1.x compilation (error -11)
Checkpoint file 'c71561261.ll' not found. You can use 't71561261.ll'.
LL FFT 4096K (1024*2048*2) of 71561261 (17.06 bits/word) at iteration 0
OpenCL setup: 1395 ms
00005000 / 71561261 [0.01%], ms/iter: 4.530, ETA: 3d 18:03; b40dd71dc9998cfd error 0.0390625 (max 0.0390625)
00010000 / 71561261 [0.01%], ms/iter: 4.538, ETA: 3d 18:12; 9421fec94352d8fd error 0.0390625 (max 0.0390625)
00015000 / 71561261 [0.02%], ms/iter: 4.545, ETA: 3d 18:20; 7ff289450308f24f error 0.0390625 (max 0.0390625)
00020000 / 71561261 [0.03%], ms/iter: 4.557, ETA: 3d 18:33; 02729de7028e2114 error 0.0390625 (max 0.0390625)
[/CODE]

[CODE]
CUDALucas -f 4096k 71561261

------- DEVICE 0 -------
name GeForce GTX 1080
Compatibility 6.1
clockRate (MHz) 1733
memClockRate (MHz) 5005
totalGlobalMem 8507555840
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 20
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

Using threads: square 256, splice 128.
Starting M71561261 fft length = 4096K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| May 02 22:30:22 | M71561261 10000 0x9421fec94352d8fd | 4096K 0.04688 3.5572 35.57s | 2:22:42:05 0.01% |
| May 02 22:30:57 | M71561261 20000 0x02729de7028e2114 | 4096K 0.05078 3.5814 35.81s | 2:22:55:55 0.02% |
[/CODE]

preda 2017-05-03 08:29

[QUOTE=henryzz;458021]There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU[/QUOTE]
Please allow me a few days to think about it (I'm quite new to LLR).

henryzz 2017-05-03 10:37

[QUOTE=preda;458185]Please allow me a few days to think about it (I'm quite new to LLR).[/QUOTE]
The LLR test would be nice as it means the results will be comparable with other programs for k*2^n-1. There is also Proth primality tests for k*2^n+1
More generally a fermat prp test would be fine for k*b^n+-1. Proof can be done on a cpu.


The FFT is the harder thing to get right though.

LaurV 2017-05-03 13:15

1. Shifting (more important and easier to implement than other FFT sizes - a step toward making gpuOwl production-ready).
2. Some command line switch to enumerate the existing devices (like GPU-Z is doing). (not important, but useful, some of us have no idea what treasures are hidden in our computer boxes... :razz:)


All times are UTC. The time now is 15:44.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.