mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-02-02, 05:16   #1816
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22·7·11·29 Posts
Default

Quote:
Originally Posted by kriesel View Post
CUDALucas has not had any significant development in years, so naturally has fallen behind.
This is a bit misleading.

First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included.

Second, there is "almost nothing" to improve in CudaLucas (well, there are some minor things, that's why the quotes, but the big picture won't change much), this toy is just a "square, subtract 2, repeat" tool, which uses Nvidia cuda FFT libraries (cuFFT) to do the squaring. These libraries, indeed, fell behind, as you said. They were not updated by Nvidia for ages, and if we can convince them to make (or make by ourselves) some cuFFT library a hundred times faster than the actual one, all CL would need would be a recompilation. . For the owl, Preda made the libraries from scratch, and they are well tuned for opencl, but nvidia cards are not so good in emulating opencl, they are faster when native cuda is used.

Last fiddled with by LaurV on 2020-02-02 at 05:22
LaurV is offline   Reply With Quote
Old 2020-02-02, 05:26   #1817
xx005fs
 
"Eric"
Jan 2018
USA

211 Posts
Default

Quote:
Originally Posted by LaurV View Post
First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.
This statement is a bit misleading since with the new gpuowl updates it has became significantly more efficient on memory bandwidth usage. I am seeing significant speedups on GPUs with high DP ratio like K80, P100, V100, Titan V. There is indeed not much difference for the GTX and RTX cards due to most of them being DP bound instead of memory.

Last fiddled with by xx005fs on 2020-02-02 at 05:26
xx005fs is offline   Reply With Quote
Old 2020-02-02, 07:43   #1818
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

23×41 Posts
Default

Quote:
Originally Posted by LaurV View Post
First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included.
Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again.
nomead is offline   Reply With Quote
Old 2020-02-02, 07:58   #1819
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22×7×11×29 Posts
Default



You may be totally right... We didn't move to such new fancy things yet..
Edit @nomead, crosspost, I was replying to xx, but what you say is really tempting, BRB soon

Last fiddled with by LaurV on 2020-02-02 at 08:00
LaurV is offline   Reply With Quote
Old 2020-02-02, 08:20   #1820
xx005fs
 
"Eric"
Jan 2018
USA

211 Posts
Default

Quote:
Originally Posted by nomead View Post
Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again.
Interesting. I saw only around 5% improvement going from CUDALucas to gpuowl on my 1070. Did RTX series get higher than 1/32 DP ratio?
xx005fs is offline   Reply With Quote
Old 2020-02-02, 20:33   #1821
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24·83 Posts
Default

It seems that your OpenCL compiler does not like __attribute__((opencl_unroll_hint(1))). To work around that, simply pass "-use UNROLL_ALL" (and none of the other UNROLL_ options), or, if running on a Nvidia card, don't pass any UNROLL option at all.

Quote:
Originally Posted by JCoveiro View Post
Thanks!

But first, just want to say that there is a bug on the program.

I'm using gpuowl v6.11-134-g1e0ce1d.

#####################################

Running the batch outputs the following errors:

Error#1
Running the Windows batch file at:
2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE
outputs some errors and after the following:
2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

Error#2
Running the Windows batch file at:
2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_WIDTH
outputs some errors and after the following:
2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

Error#3
Running the Windows batch file at:
2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_HEIGHT
outputs some errors and after the following:
2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

Error#4

Running the Windows batch file at:
2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL1
outputs some errors and after the following:
2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

Error#5
Running the Windows batch file at:
2020-02-01 23:55:16 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL2
outputs some errors and after the following:
2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

#####################################

Here are some more details on Error#1:

Code:
2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE 
2020-02-01 23:55:14 device 0, unique id ''
2020-02-01 23:55:14 GeForce GTX 1660-0 99753809 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.30 bits/word
2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL args "-DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1)
2020-02-01 23:55:14 GeForce GTX 1660-0 <kernel>:1386:3: error: expected identifier or '('
  for (i32 s = 4; s >= 0; s -= 2) {
  ^
<kernel>:1394:3: error: expected identifier or '('
  for (i32 s = 4; s >= 0; s -= 2) {
  ^
<kernel>:1404:3: error: expected identifier or '('
  for (i32 s = 3; s >= 0; s -= 3) {
  ^
<kernel>:1412:3: error: expected identifier or '('
  for (i32 s = 3; s >= 0; s -= 3) {
  ^
<kernel>:1422:3: error: expected identifier or '('
  for (i32 s = 6; s >= 0; s -= 2) {
  ^
<kernel>:1430:3: error: expected identifier or '('
  for (i32 s = 6; s >= 0; s -= 2) {
  ^
<kernel>:1440:3: error: expected identifier or '('
  for (i32 s = 6; s >= 0; s -= 3) {
  ^
<kernel>:1448:3: error: expected identifier or '('
  for (i32 s = 6; s >= 0; s -= 3) {
  ^
<kernel>:1458:3: error: expected identifier or '('
  for (i32 s = 5; s >= 2; s -= 3) {
  ^
<kernel>:1502:3: error: expected identifier or '('
  for (i32 s = 5; s >= 2; s -= 3) {
  ^
<kernel>:2478:3: error: expected identifier or '('
  for (i32 i = 0; i < MIDDLE; ++i) {
  ^

2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build
2020-02-01 23:55:14 GeForce GTX 1660-0 Bye
preda is offline   Reply With Quote
Old 2020-02-02, 20:34   #1822
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

As the error says, you can't use "WORKINGOUT4" with that FFT size.

Did you try running the program without any -use options? does that work?

Quote:
Originally Posted by JCoveiro View Post
I have found another bug, while trying to test M47 (a lower exponent).

Code:
2020-02-02 01:36:38 gpuowl v6.11-134-g1e0ce1d
2020-02-02 01:36:38 Note: not found 'config.txt'
2020-02-02 01:36:38 config: -use UNROLL_ALL,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE,CARRY64,FANCYMIDDLEMUL1,LESS_ACCURATE
2020-02-02 01:36:38 device 0, unique id ''
2020-02-02 01:36:38 GeForce GTX 1660-0 43112609 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.27 bits/word
2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL args "-DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1)
2020-02-02 01:36:39 GeForce GTX 1660-0 <kernel>:2009:2: error: WORKINGOUT4 not compatible with this FFT size
#error WORKINGOUT4 not compatible with this FFT size
 ^

2020-02-02 01:36:39 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build
2020-02-02 01:36:39 GeForce GTX 1660-0 Bye
preda is offline   Reply With Quote
Old 2020-02-02, 20:43   #1823
JCoveiro
 
"Jorge Coveiro"
Nov 2006
Moura, Portugal

2610 Posts
Default

Quote:
Originally Posted by preda View Post
As the error says, you can't use "WORKINGOUT4" with that FFT size.

Did you try running the program without any -use options? does that work?
Yes. It runs without -use options.
I was just testing the "optimized settings" for Nvidia cards, but it seems that I can't use WORKINGOUT4.

Going to test again and publish the results for the GTX1660.

Last fiddled with by JCoveiro on 2020-02-02 at 20:45
JCoveiro is offline   Reply With Quote
Old 2020-02-06, 09:29   #1824
wfgarnett3
 
wfgarnett3's Avatar
 
"William Garnett III"
Oct 2002
Bensalem, PA

2·43 Posts
Default

Quote:
Originally Posted by LaurV View Post
First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.
Same here -- CUDALucas is faster than gpuOwL on my EVGA Geforce 1050 2GB and also doesn't slow down Prime95 running on the CPU concurrently.

However even with the iteration times being a couple milleseconds slower on gpuOwL versus CUDALucas (plus a couple millesecond slowdown to Prime95 if it is running too) since gpuOwL eliminates the need for a double-check that makes gpuOwL the overall time saver winner over CUDALucas for me.

I only did one PRP double-check with gpuOwL and I occasionally do LL double-checks with CUDALucas.

Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc.

Last fiddled with by wfgarnett3 on 2020-02-06 at 09:34
wfgarnett3 is offline   Reply With Quote
Old 2020-02-06, 14:08   #1825
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×673 Posts
Default

Quote:
Originally Posted by wfgarnett3 View Post
since gpuOwL eliminates the need for a double-check
But it doesn't. There is a PRP DC work type for good reasons;
1) errors may occur outside the code that the GEC occurs, both in the software and in the manual reporting process, and some have already been confirmed to occur;
2) PRP DC guards against someone forging PRP first test submissions;
3) PRP GEC itself has a very low error rate, but not zero. Gerbicz himself has given error rate estimates.

Quote:
Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc.
Good choice.
kriesel is offline   Reply With Quote
Old 2020-02-06, 17:01   #1826
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·673 Posts
Default

CUDALucas still has its place;
faster on a few gpu models than gpuowl;
will run on older NVIDIA gpus that are entirelly incapable of running gpuowl because they don't support the required OpenCL level for gpuowl;
relatively current gpuowl versions don't do LL so can't do LLDC (although v0.5 and v0.6 gpuowl can with 4M fft)
It would be great if CUDALucas had the Jacobi check.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1657 2020-10-27 01:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 12:08.

Fri Nov 27 12:08:08 UTC 2020 up 78 days, 9:19, 4 users, load averages: 1.47, 1.30, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.