mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2017-04-05, 19:57   #2586
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

100011000112 Posts
Default

Quote:
Originally Posted by kriesel View Post
Hi,

Thanks for a quick reply. I am more interested in updated Windows binaries...
What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.

Quote:
Originally Posted by LaurV View Post
does it get eny faster? Otherwise we are happy with the current version we use, and we don't want to fix it (i.e. upgrade) as long as it works...
Only faster if CUDA 8 is faster on your system
flashjh is offline   Reply With Quote
Old 2017-04-05, 21:51   #2587
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37·127 Posts
Default re Which version to update first? (or at all?)

Quote:
Originally Posted by flashjh View Post
Hello all,

From what I can see the updates from post 2464 were not incorporated on sourceforge, so they're not in any of the code now.

It's been over a year since all that discussion went on, are there still issues with residues?

Either way, I have the code updated, along with some miscellaneous changes. The biggest change is that CUDA 8 does not support 32 bit for CUDALucas, nor compute < 2.0.

What versions is everyone using now? I don't mind getting setup to compile versions <8, but I don't want to do it if no one is using them anymore.

So, let me know what architecture everyone needs and I'll make it happen
I'm curious what those miscellaneous code changes might be, or rather what feature changes they implement.
There are still issues with bad intermediate residues. That cost me days of run time early on.
I looked at ATH's benchmarks varying CUDA version and fft length (post 2535, page 231) some more. See the updated attachment. Granted, it's risky comparing for slight differences, but I'm using the data that's available. It should be pretty good since he ran 20 iterations. V8 has most of the slowest timings in that table, and very few of the fastest timings. V6.5 has the most fastest timings, most in the top 1% for an fft length, and none of the slowest. For that card and that system, driver version unknown or at least not in my notes. ATH's benchmarks are a comparison among 64-bit builds.
Benchmarks reported in post 2534 showed a speed advantage for 32bit, for driver v 373. That would put V8 at an additional speed disadvantage. That was for a smallish Mersenne prime ~2.97M, so some compact fft length. That test indicated V8 was a little slower than V4.2. ATH found V4.2 too slow to include in his.
My faster cards (maybe I should say less-slow) are compute capability 2.0, which you state V8 does not support, but the driver timeout issue does P) They also appear to be about to drop off the bottom of the list of products supported as NVIDIA continues to add new card support.
So other things being equal, which they never are, I think 32-bit V6.5 would be a pretty good candidate.
I'm in the process of doing benchmark timings for my old cards, versus version, analogous to what ATH has done, but got distracted by some hardware issues on other systems. Also the driver timeout issue is derailing my benchmarking on one card type, so I'm preparing to downgrade considerably on driver version and start that one over.
The driver timeout issue seems to be getting worse as I step up in version on that card. Or maybe it's a time trend. It is not temperature.
I'm thinking of doing benchmark timing versus driver level for my cards too. Has anyone reported or seen a noticeable effect of that?
Thanks!
Attached Thumbnails
Click image for larger version

Name:	titan-black-cuda-timings.png
Views:	81
Size:	63.3 KB
ID:	15890  

Last fiddled with by kriesel on 2017-04-05 at 21:55
kriesel is offline   Reply With Quote
Old 2017-04-05, 22:11   #2588
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37×127 Posts
Default

Quote:
Originally Posted by flashjh View Post
What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.
Sounds great. Quadro 2000 and GeForce GTX480 for now. More follows.

Last fiddled with by kriesel on 2017-04-05 at 22:25 Reason: awful formatting, replacing with separate attachment in following post
kriesel is offline   Reply With Quote
Old 2017-04-05, 22:21   #2589
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37×127 Posts
Default

Quote:
Originally Posted by flashjh View Post
What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.
I'm running Quadro 2000 and GeForce GTX480 currently, contemplating adding some other models later. Running threads 1024 on either reliably causes bad residues on them for current exponents, and for some but not the initial -r checks. This left me with the question, how to know which thread counts or other parameters are reasonable for a particular GPU type. Is there a better way than simply testing, to determine good parameters? Or are both of these GPUs defective somehow? And perhaps threadbench could be modified, to check for and flag occurrence of pathological values, after completing its individual timing loops, and then exclude the flagged ones from selection as the optimal.
Attached Files
File Type: txt tuning and odd residues.txt (7.3 KB, 62 views)

Last fiddled with by kriesel on 2017-04-05 at 22:37
kriesel is offline   Reply With Quote
Old 2017-04-05, 22:44   #2590
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37×127 Posts
Default ambitious fftlength crash on Quadro 2000

Quote:
Originally Posted by flashjh View Post
What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.
On a Quadro 2000, testing the limits of fft length crashes the program, without completing any benchmarks as far as I can tell. Observed repeatable on the only Quadro 2000 I've yet run it on.

$ cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>bigfftbench.txt
CUDALucas.cu(1055) : cudaSafeCall() Runtime API error 2: out of memory.

On a GeForce GTX480, it will run many fft lengths and output timings to stdout, then terminate before reaching 65536 or producing the fft lengths file. At least on mine. Then scaling back the maximum to what it reached on stdout produces a file.

Quadro 2000 has 1GB VRAM, GTX480 has 1.5.

Last fiddled with by kriesel on 2017-04-05 at 22:48
kriesel is offline   Reply With Quote
Old 2017-04-07, 05:56   #2591
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37·127 Posts
Default Anomalous 1024-square-threads timings examples

Bad 1024-threads timings examples (occurs on both Quadro2000 compute capability 2.1 and GEForce GTX480 CC 2.0, produces minimal timings, get selected and produces bad residues like repeating 0xfffffffffffffffd). The 1024-thread timings are more than a factor of two faster than for any other thread number, above 1024k fft length. (If I recall correctly, at very short fft lengths the difference disappears, and at large fft lengths it becomes an even more dramatic difference. So for currently useful lengths for first time testing or double checking, a bad 1024 length could easily be screened for by a modified program.)
Attached Thumbnails
Click image for larger version

Name:	1024s as fraction of functional minimum timings.png
Views:	89
Size:	50.2 KB
ID:	15902  
Attached Files
File Type: txt pathological 1024 thread timings.txt (6.2 KB, 105 views)

Last fiddled with by kriesel on 2017-04-07 at 06:11
kriesel is offline   Reply With Quote
Old 2017-04-07, 06:05   #2592
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37·127 Posts
Default Some threadbench combinations timed twice, minimum missed

threadbench could be accelerated a bit by benchmarking the squaring and slice combinations once, rather than one combination per fft length twice.

Computing it twice sometimes has a second run slower than the first and replacing apparently the first timing. The timing difference can be enough so that the minimum time's parameters are not selected for storage as the combination to use. Individual time savings are small. Minimum timing per fft length is marked with an *.
Attached Files
File Type: txt not selecting minimum iteration time.txt (3.4 KB, 60 views)

Last fiddled with by kriesel on 2017-04-07 at 06:12
kriesel is offline   Reply With Quote
Old 2017-04-09, 21:31   #2593
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

For now, what version do you want compiled for your tests with the new code?
flashjh is offline   Reply With Quote
Old 2017-04-16, 04:09   #2594
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

469910 Posts
Default Next build

Quote:
Originally Posted by flashjh View Post
For now, what version do you want compiled for your tests with the new code?
I think 32-bit V6.5 would be a pretty good candidate.
kriesel is offline   Reply With Quote
Old 2017-04-17, 04:46   #2595
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default CUDALucas 2.06beta

I incorporated the code changes listed and made some modifications to the supporting code -- no changes to any of the math. I'll upload to code tomorrow, getting late.

2.06beta is here

Lib files are here, if you need them

I have a 1050ti, was able to test all versions. I changed the way it compiles because when I used the old way, it would not run any version on my 1050ti except for CUDA 8. When I switched to a 940M I was able to get >=6.0 to run. Any verison below the version that worked on my cards *always* caused 0 or 2 results during self-test. NOTE: You might see small delay on 1st startup of each CUDA version now, due to JIT, but only if it doesn't find code for your GPU. So, now it's working but I need a lot of testing done.

Anyone who was having issues with the bad residues before, please test these versions and let me know if you're able to make it give you bad results.

Everyone, let me know what you find that needs to be fixed and what you would like changed.

~Cheers

Last fiddled with by flashjh on 2017-04-17 at 04:53 Reason: fix cuda version
flashjh is offline   Reply With Quote
Old 2017-04-19, 02:52   #2596
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

37·127 Posts
Default feature request

Quote:
Originally Posted by flashjh View Post
I incorporated the code changes listed and made some modifications to the supporting code -- no changes to any of the math. I'll upload to code tomorrow, getting late.

2.06beta is here

Lib files are here, if you need them

I have a 1050ti, was able to test all versions. I changed the way it compiles because when I used the old way, it would not run any version on my 1050ti except for CUDA 8. When I switched to a 940M I was able to get >=6.0 to run. Any verison below the version that worked on my cards *always* caused 0 or 2 results during self-test. NOTE: You might see small delay on 1st startup of each CUDA version now, due to JIT, but only if it doesn't find code for your GPU. So, now it's working but I need a lot of testing done.

Anyone who was having issues with the bad residues before, please test these versions and let me know if you're able to make it give you bad results.

Everyone, let me know what you find that needs to be fixed and what you would like changed.

~Cheers
For benchmarking and more generally keeping the output of various versions and systems straight, it would be helpful if in addition to GPU card information, the header output of -info, -fftbench, -threadbench, GPU fft length output and threads output files, and general program operation would include the following items: 1) CUDALucas version, 2) CUDA level; 3) if possible, NVIDIA driver version; 4) Operating system; 5) 32 or 64 bit; 6) system name.

Incorporate thread benchmarking sanity checks. Check for and flag occurrence of pathological values, after completing its individual timing loops, and then exclude the flagged ones from selection as the optimal.

Add a runtime estimate column for maximum exponent per fft length to fft.txt.

Add checks that the card at least meets the Compute capability required, and the driver supports the CUDA level required that CUDALucas was compiled for.

In prime95, in just a few characters, like We4: a results line is tagged with the program version ID. I'd like to see something like that added to CUDALucas too. (Maybe at the far right in case someone has a program that parses or pattern matches the results lines.)

Last fiddled with by kriesel on 2017-04-19 at 02:58
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 22:34.

Mon Nov 23 22:34:39 UTC 2020 up 74 days, 19:45, 4 users, load averages: 2.59, 2.39, 2.43

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.