mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

 2017-04-26, 03:27 #34 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 3·977 Posts I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.
2017-04-26, 04:20   #35

"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts

Quote:
 Originally Posted by Mark Rose I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.
That was definitely my experience. Until I slowed the memory clock, my gtx 460 and 580 cards could not complete both self-tests.

I have yet to really work at getting CUDALucas to run on the GTX 1060. Early self-test runs blew up in seconds. It's been a while, so I don't remember the details that well. With an i7 Skylake turning out a DC every 25 hours, I found it much more productive to keep the GPUs on TF.

 2017-04-26, 08:08 #36 preda     "Mihai Preda" Apr 2015 54B16 Posts Exponent range supported by gpuOwL gpuOwL now only supports FFT 4096K. This allows LL in the range about 35M - 77M. But all the exponents under ~ 40.8M have been double-checked, thus there's not much interest there. Using FFT 4096K for exponents under about 65M may be a waste, because faster FFT-sizes are available. Thus gpuOwL is probably best used for first-time and double-checks in the range 70M - 77M. I would recommend to start by doing at least a couple of double-checks (to validate correct function) before doing any first-time LL. The results, found in "resuts.txt", are in a format that can be directly submitted on "Manual Testing" webpage. Note: on github in the "fft2m" branch, https://github.com/preda/gpuowl/tree/fft2m there is an implementation for FFT 2048K as well, in case anybody is particularly interested in those small sizes (probably most useful for testing, or as sample code). About larger FFT sizes: for now I'm only looking toward supporting POT (power-of-two) FFTs. But I think there's no interest in 8M or 16M FFTs (for LL), because there's plenty of first-time LL to do under 78M. For world-record tests (exponents > 332M), the smallest POT that can handle them is 32M, which is also overkill (in my estimation, 32M FFT may handle exponents up to 550M). So it seems that world-record tests are better handled by a non-POT FFT. In addition, it would probably not be such a good idea to spend a big amount of time (on huge exponents) with a new program with limited testing.
 2017-04-26, 08:59 #37 preda     "Mihai Preda" Apr 2015 5·271 Posts gpuOwL stop / resume gpuOwL writes on every logstep (20k iterations) a checkpoint to a file save-N.bin (moving the previous file to save-N.old). The program can be safely stopped/killed at any time. Upon restart, it will look for a checkpoint file for the given exponent, and continue from there if found. The checkpoint file starts with a human-readable header, like this: LL1 42643801 160000 1024 2048 0 With the values meaning: file-signature, exponent, iteration, width, height, offset followed by a newline, a ctrl-Z character, and a binary dump of the words. These save files can be safely moved around. If deleted you lose the progress. If deleted/moved, the program starts from iteration 0.
 2017-04-26, 14:46 #38 LaurV Romulan Interpreter     Jun 2011 Thailand 100100101000112 Posts Haha, nice avatar To the subject: As promised, Victor sent me his built. I gave up doing mine, I found out I have some old tools and no time to renew them, but I will resume the trials as soon as the time will allow. Let's first start by getting an assignment in 77M, to avoid wasting precious cycles, as the Owl only knows 4K FFT. We got for a start, M77002759. Good. For a comparison, we tried to give it a run with clLucas first, to see what we are fighting against. As we didn't use this machine for testing for a while, we had first an unsuccessful struggle to convince clLucas to stick with the FFT size. When we do not specify the FFT size in the command line, he works for a long while, deciding which FFT is the best (it starts much lower), and every time ends with a "wrong" one, i.e. above 4K. We gave up, after he decided to get a too big error, and increase the FFT, regardless of what we were doing. Score, 1-0 for clLucas against us. The point is that the next FFT that he wants to use is about half-speed compared with the POT one. This is easy to see when he prints the test lines in the beginning, every 100 iterations, the text lines come less often (half speed) after he increases the FFT. This is visible, like two seconds per line, against one second per line before. Grrrr... We decided to forget the things, shot the dead horse, and get a new, smaller, assignment. This time we got M76453229, and clLucas happily decided not to increase the FFT. Gooooood..... Then we did the same run with gpuOwl, and we decided to do both runs just to see the difference. gpuOwl is indeed faster, but we have to complain about it zerorizing half of the residue (of course, this is just a printing bug, we assume, or maybe a compilation bug). Code: e:\99 - Prime\clLucas>cllucas_x64 -c 2000 -threads 256 -f 4194304 -s backups Platform 0 : Advanced Micro Devices, Inc. Platform :Advanced Micro Devices, Inc. Device 0 : Tahiti Build Options are : -D KHR_DP_EXTENSION CL_DEVICE_NAME Tahiti CL_DEVICE_VENDOR Advanced Micro Devices, Inc. CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2348.3) CL_DRIVER_VERSION 2348.3 CL_DEVICE_MAX_COMPUTE_UNITS 32 CL_DEVICE_MAX_CLOCK_FREQUENCY 1050 CL_DEVICE_GLOBAL_MEM_SIZE 3221225472 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1 mkdir: cannot create directory backups': File exists Starting M77002759 fft length = 4096K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.16452, max error = 0.21875 Iteration 200, average error = 0.19164, max error = 0.21875 Iteration 300, average error = 0.20540, max error = 0.23438 Iteration 400, average error = 0.21530, max error = 0.25000 Iteration 500, average error = 0.22243, max error = 0.28125 Iteration 600, average error = 0.23223, max error = 0.28125 Iteration 700, average error = 0.23923, max error = 0.28125 Iteration 800, average error = 0.24449, max error = 0.28125 Iteration 900, average error = 0.24857, max error = 0.28125 Iteration 1000, average error = 0.25181 >= 0.25 (max error = 0.28125), increasing FFT length and restarting. Starting M77002759 fft length = 4480K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.02343, max error = 0.03125 Iteration 200, average error = 0.02738, max error = 0.03320 Iteration 300, average error = 0.02932, max error = 0.03320 Iteration 400, average error = 0.03035, max error = 0.03516 Iteration 500, average error = 0.03131, max error = 0.03516 Iteration 600, average error = 0.03195, max error = 0.03516 Iteration 700, average error = 0.03241, max error = 0.03516 Iteration 800, average error = 0.03276, max error = 0.03516 Iteration 900, average error = 0.03302, max error = 0.03516 Iteration 1000, average error = 0.03323 < 0.25 (max error = 0.03516), continuing test. Iteration 2000 M( 77002759 )C, 0x9a2f030ffaeda2c4, n = 4480K, clLucas v1.04 err = 0.0371 (0:28 real, 14.0000 ms/iter, ETA 299:26:41) Iteration 4000 M( 77002759 )C, 0xed3b849574c96289, n = 4480K, clLucas v1.04 err = 0.0371 (0:27 real, 13.9316 ms/iter, ETA 297:58:24) Iteration 6000 M( 77002759 )C, 0x6d71868cfc75973d, n = 4480K, clLucas v1.04 err = 0.0371 (0:28 real, 13.9408 ms/iter, ETA 298:09:45) Unknown signal caught, writing checkpoint. Estimated time spent so far: 1:50 e:\99 - Prime\clLucas>cllucas_x64 -c 2000 -threads 256 -f 4194304 -s backups Platform 0 : Advanced Micro Devices, Inc. Platform :Advanced Micro Devices, Inc. Device 0 : Tahiti Build Options are : -D KHR_DP_EXTENSION CL_DEVICE_NAME Tahiti CL_DEVICE_VENDOR Advanced Micro Devices, Inc. CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2348.3) CL_DRIVER_VERSION 2348.3 CL_DEVICE_MAX_COMPUTE_UNITS 32 CL_DEVICE_MAX_CLOCK_FREQUENCY 1050 CL_DEVICE_GLOBAL_MEM_SIZE 3221225472 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1 mkdir: cannot create directory backups': File exists Starting M76453229 fft length = 4096K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.14852, max error = 0.21875 Iteration 200, average error = 0.18363, max error = 0.21875 Iteration 300, average error = 0.19534, max error = 0.21875 Iteration 400, average error = 0.20119, max error = 0.21875 Iteration 500, average error = 0.20470, max error = 0.21875 Iteration 600, average error = 0.20704, max error = 0.21875 Iteration 700, average error = 0.20872, max error = 0.21875 Iteration 800, average error = 0.20997, max error = 0.21875 Iteration 900, average error = 0.21095, max error = 0.21875 Iteration 1000, average error = 0.21172 < 0.25 (max error = 0.21875), continuing test. Iteration 2000 M( 76453229 )C, 0xbb1d6624a3ab7bf8, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7752 ms/iter, ETA 122:38:34) Iteration 4000 M( 76453229 )C, 0xbaefb39c3b82c9d1, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7400 ms/iter, ETA 121:53:32) Iteration 6000 M( 76453229 )C, 0x580c90a32431aeea, n = 4096K, clLucas v1.04 err = 0.2188 (0:11 real, 5.7350 ms/iter, ETA 121:46:58) Iteration 8000 M( 76453229 )C, 0x034cc7c190b474a6, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7300 ms/iter, ETA 121:40:24) Iteration 10000 M( 76453229 )C, 0x40e22bdb628637bd, n = 4096K, clLucas v1.04 err = 0.2188 (0:11 real, 5.7100 ms/iter, ETA 121:14:44) Unknown signal caught, writing checkpoint. Estimated time spent so far: 1:02 e:\99 - Prime\clLucas>cd ..\gpuowl e:\99 - Prime\gpuOwl>gpuowl -logstep 2000 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti - OpenCL 1.2 AMD-APP (2348.3) LL FFT 4096K (1024*2048*2) of 77002759 (18.36 bits/word) at iteration 0 OpenCL setup: 960 ms 00002000 / 77002759 [0.00%], ms/iter: 4.765, ETA: 4d 05:55; 00000000faeda2c4 error 0.238075 (max 0.238075) 00004000 / 77002759 [0.01%], ms/iter: 4.750, ETA: 4d 05:36; 0000000074c96289 error 0.236749 (max 0.238075) 00006000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 00000000fc75973d error 0.234828 (max 0.238075) 00008000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 000000001f37b1fb error 0.24425 (max 0.24425) 00010000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 00000000b0f55ab1 error 0.235088 (max 0.24425) ^C [changing the lines' order in worktodo] e:\99 - Prime\gpuOwl>gpuowl -logstep 2000 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti - OpenCL 1.2 AMD-APP (2348.3) LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 0 OpenCL setup: 1000 ms 00002000 / 76453229 [0.00%], ms/iter: 4.770, ETA: 4d 05:18; 00000000a3ab7bf8 error 0.199004 (max 0.199004) 00004000 / 76453229 [0.01%], ms/iter: 4.745, ETA: 4d 04:46; 000000003b82c9d1 error 0.20329 (max 0.20329) 00006000 / 76453229 [0.01%], ms/iter: 4.740, ETA: 4d 04:39; 000000002431aeea error 0.208951 (max 0.208951) 00008000 / 76453229 [0.01%], ms/iter: 4.750, ETA: 4d 04:52; 0000000090b474a6 error 0.206401 (max 0.208951) 00010000 / 76453229 [0.01%], ms/iter: 4.745, ETA: 4d 04:45; 00000000628637bd error 0.203467 (max 0.208951) ^C e:\99 - Prime\gpuOwl> Note that 4 days, 5 hours, is 101 hours. The difference in time between the two gpuOwl runs seems normal as there are less iterations to do, in spite of the fact that the iterations themselves take the same time (as the same FFT is used). This is about 20% speed increase, for this build, and this card. Next step would be to try to compile our own version, and if (or when) the zerorizing error is fixed, to finish these tests and compare the residues with what the Titan/cudaLucas gives. If success, we will report both as LL and DC. Yes, we know this will infuriate MadPoo which will have to triple check Last fiddled with by LaurV on 2017-04-26 at 14:50
 2017-04-26, 15:34 #39 LaurV Romulan Interpreter     Jun 2011 Thailand 83×113 Posts Additional "complaints", beside of the fact that the little thief is stealing my hex digits of the residue: 1. When something like this happens: (yes, we can force it, by giving irrealistic work to the card in the same time it is doing gpuOwl): Code: Error 4 is too large!00030000 / 76453229 [0.04%], ms/iter: 5.561, ETA: 4d 22:03; 00000000eb2493b3 error 4 (max 4) , then the error should be saved in the file too, and carried on with it at the next restart (use a byte in the header, or so), or the program should exit, and resume from an aterior saved file. Otherwise, (first) it is wasting precious time (yes the residue is wrong the correct one ends in 2BBB1710, and it continues with the wrong calculus, wasting cycles in vain) and (second) the user can restart the program (and the error is lost) so he never knows he has problems with his hardware. (most of use use some batch file like Code: :loop gpuOwl goto loop to run the tools, because sometimes they crash, or the card crash, and the calculus has to resume, and not wait until I come from work in the evening. In this case, gpuOwl will continue with wrong residue, and the error will not be carried on after restart, so the user will have no idea (who's reading the logs? ) and 2., because we are here, please do not delete the partial residue files. Let them there, and use the iteration number and the residue, as part of the file name (cudaLucas style). The user can delete them manually if he wants. This is useful when we compare files and residues between different runs, different cards, different programs, as they all have different file structures, different shift counts, etc - i.e. the two files, one produced by cudaLucas, and one by clLucas, are not the same inside, but if they are both called "s76453229.30001.9d0222732bbb1710.txt" (real file name here!), then I know that both programs are doing fine, and my batch file can automatically parse the two backup folders and kill the programs and resume from a previous iteration (by renaming the files to cxxxxx and txxxxx, see cudaLucas), without looking inside of the files (I am not interested in internal structure/kitchen) so we don't lose any time walking paths that start with wrong residues and backtracking those paths. and 3. save those residue in a sub-folder, call it backup or so (later it can be given in an ini file) so they can be handled in bunch, deleted, etc, without inadvertently deleting the exe file or some library in the folder the program is running. then we are good to go... Last fiddled with by LaurV on 2017-04-26 at 16:33
 2017-04-26, 16:02 #40 LaurV Romulan Interpreter     Jun 2011 Thailand 24A316 Posts On the bright side, we timed the little owl again, this time with the clock (stopwatch) on hand, to see if it does not cheat on displaying ETAs. It does not. (but you have to agree that it is possible and we are paranoid by formation, nothing personal - we can produce a LL test program to say that it tests a 77M in 55 hours, but you let it and let it and let it, and after 55 hours he is only one third done, and it will take another 110 hours to do the other thirds, in spite of the fact that it now shows only 33 hours to go (two thirds). Not the case for gpuOwl, but you got the point. So, we made sure. Using it, will make our card from ~42 GHz Days per Day (it is indeed its score at this range, despite James' site saying here that it only scores between 35 and 40) into a 20% faster card, as we have seen above, which is a bit over 50 GHzD/D, at parity with some "good" gtx 1070 or quadro plex, not talking about the boost it will give to lots of "fury" cards that are already at 50 or higher. This matches with the displayed time, because the 77M exponent we tested is exactly 217 GHzDays worth, and the ETA time was (roughly) 4 days 5 hours, i.e. 4.2 days, then this makes exactly 50 GhzDays per Day. Yay! Therefore, if you can find the time, resources, motivation, whatever, to continue improving it, you will make a lot of people here happy.... Two thumbs up... Time for bed, midnight here (almost)... Last fiddled with by LaurV on 2017-04-26 at 16:04
2017-04-26, 17:30   #41
kracker

"Mr. Meeseeks"
Jan 2012
California, USA

32·241 Posts

Quote:
 Originally Posted by VictordeHolland Hi, I forgot to mention, but my card is a HD7950, which only supports OpenCL1.2 https://en.wikipedia.org/wiki/Radeon_HD_7000_Series At least gpuOwl detects it is a Tahiti OpenCL 1.2 AMD-APP 2079.5 device :). ...
Funny enough, although the HD7950 doesn't support OpenCL 1.2, the R9 280 which is a rebrand of the 7950 with a higher clock speed does... just driver things I guess....

Last fiddled with by kracker on 2017-04-26 at 17:30

2017-04-26, 18:52   #42
VictordeHolland

"Victor de Hollander"
Aug 2011
the Netherlands

22308 Posts

Quote:
 Originally Posted by kracker Funny enough, although the HD7950 doesn't support OpenCL 1.2, the R9 280 which is a rebrand of the 7950 with a higher clock speed does... just driver things I guess....

GCN 1st gen support OpenCL1.2:
Tahiti chips (HD79xx, HD89xx, R9 280(X))

GCN 2nd gen supports supports OpenCL2.0 used in
Bonaire chip (HD7790, HD8770, R7 260X, R7 360)
Hawaii chip (R9 290(X) , R9 390(X))

GCN 3rd gen also supports OpenCL2.0:
Tonga (R9 285, R9 380(X))
Fiji (R9 Fury, Nano X)

Seems highly plausible that the OpenCL support is dependant on the GCN generation.

 2017-04-26, 23:05 #43 preda     "Mihai Preda" Apr 2015 5×271 Posts @LaurV, that's a detailed analysis! It seems your build was not fresh though: the "error too large" not stopping is already fixed. The zeroed residue is also "maybe" fixed already. If you can get a fresh build, I'll know if the residue is indeed now printed correctly (or look more into that if not). I still don't understand how you got "4" for error.. it's supposed to go only up to 0.5.. I'll think about the structure of the save-files. I probably need to look a bit at what CUDALucas does there. But, to keep *all* the old checkpoints around? -- each file is 16MB. If you get 4000 of those, that'd be 64GB, probably too much.
 2017-04-27, 06:14 #44 airsquirrels     "David" Jul 2015 Ohio 11×47 Posts I should have some time this weekend to do ISA dumps and try upgrading drivers / APP SDK to new versions on one of my FuryX Systems. Also, most consumer cards do occasionally have errors. I have seen them less often on AMD cards than NVIDIA, but they do happen. If it is helpful, I have a system with 3x W9100s in it which have ECC memory and (ideally) do not exhibit hardware errors (100s of double checks agree). If you setup to select the GPU I can run a few double check exponents on those cards to check stability.

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1668 2020-12-22 15:38 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 03:57.

Mon Apr 19 03:57:12 UTC 2021 up 10 days, 22:38, 0 users, load averages: 1.83, 1.53, 1.58