![]() |
![]() |
#2586 | ||
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
![]() Quote:
Quote:
![]() |
||
![]() |
![]() |
![]() |
#2587 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5·17·19 Posts |
![]() Quote:
There are still issues with bad intermediate residues. That cost me days of run time early on. I looked at ATH's benchmarks varying CUDA version and fft length (post 2535, page 231) some more. See the updated attachment. Granted, it's risky comparing for slight differences, but I'm using the data that's available. It should be pretty good since he ran 20 iterations. V8 has most of the slowest timings in that table, and very few of the fastest timings. V6.5 has the most fastest timings, most in the top 1% for an fft length, and none of the slowest. For that card and that system, driver version unknown or at least not in my notes. ATH's benchmarks are a comparison among 64-bit builds. Benchmarks reported in post 2534 showed a speed advantage for 32bit, for driver v 373. That would put V8 at an additional speed disadvantage. That was for a smallish Mersenne prime ~2.97M, so some compact fft length. That test indicated V8 was a little slower than V4.2. ATH found V4.2 too slow to include in his. My faster cards (maybe I should say less-slow) are compute capability 2.0, which you state V8 does not support, but the driver timeout issue does P) They also appear to be about to drop off the bottom of the list of products supported as NVIDIA continues to add new card support. So other things being equal, which they never are, I think 32-bit V6.5 would be a pretty good candidate. I'm in the process of doing benchmark timings for my old cards, versus version, analogous to what ATH has done, but got distracted by some hardware issues on other systems. Also the driver timeout issue is derailing my benchmarking on one card type, so I'm preparing to downgrade considerably on driver version and start that one over. The driver timeout issue seems to be getting worse as I step up in version on that card. Or maybe it's a time trend. It is not temperature. I'm thinking of doing benchmark timing versus driver level for my cards too. Has anyone reported or seen a noticeable effect of that? Thanks! Last fiddled with by kriesel on 2017-04-05 at 21:55 |
|
![]() |
![]() |
![]() |
#2588 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10010111011012 Posts |
![]()
Sounds great. Quadro 2000 and GeForce GTX480 for now. More follows.
Last fiddled with by kriesel on 2017-04-05 at 22:25 Reason: awful formatting, replacing with separate attachment in following post |
![]() |
![]() |
![]() |
#2589 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5·17·19 Posts |
![]()
I'm running Quadro 2000 and GeForce GTX480 currently, contemplating adding some other models later. Running threads 1024 on either reliably causes bad residues on them for current exponents, and for some but not the initial -r checks. This left me with the question, how to know which thread counts or other parameters are reasonable for a particular GPU type. Is there a better way than simply testing, to determine good parameters? Or are both of these GPUs defective somehow? And perhaps threadbench could be modified, to check for and flag occurrence of pathological values, after completing its individual timing loops, and then exclude the flagged ones from selection as the optimal.
Last fiddled with by kriesel on 2017-04-05 at 22:37 |
![]() |
![]() |
![]() |
#2590 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
484510 Posts |
![]() Quote:
$ cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>bigfftbench.txt CUDALucas.cu(1055) : cudaSafeCall() Runtime API error 2: out of memory. On a GeForce GTX480, it will run many fft lengths and output timings to stdout, then terminate before reaching 65536 or producing the fft lengths file. At least on mine. Then scaling back the maximum to what it reached on stdout produces a file. Quadro 2000 has 1GB VRAM, GTX480 has 1.5. Last fiddled with by kriesel on 2017-04-05 at 22:48 |
|
![]() |
![]() |
![]() |
#2591 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
113558 Posts |
![]()
Bad 1024-threads timings examples (occurs on both Quadro2000 compute capability 2.1 and GEForce GTX480 CC 2.0, produces minimal timings, get selected and produces bad residues like repeating 0xfffffffffffffffd). The 1024-thread timings are more than a factor of two faster than for any other thread number, above 1024k fft length. (If I recall correctly, at very short fft lengths the difference disappears, and at large fft lengths it becomes an even more dramatic difference. So for currently useful lengths for first time testing or double checking, a bad 1024 length could easily be screened for by a modified program.)
Last fiddled with by kriesel on 2017-04-07 at 06:11 |
![]() |
![]() |
![]() |
#2592 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3×5×17×19 Posts |
![]()
threadbench could be accelerated a bit by benchmarking the squaring and slice combinations once, rather than one combination per fft length twice.
Computing it twice sometimes has a second run slower than the first and replacing apparently the first timing. The timing difference can be enough so that the minimum time's parameters are not selected for storage as the combination to use. Individual time savings are small. Minimum timing per fft length is marked with an *. Last fiddled with by kriesel on 2017-04-07 at 06:12 |
![]() |
![]() |
![]() |
#2593 |
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
![]()
For now, what version do you want compiled for your tests with the new code?
|
![]() |
![]() |
![]() |
#2594 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3×5×17×19 Posts |
![]() |
![]() |
![]() |
![]() |
#2595 |
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
![]()
I incorporated the code changes listed and made some modifications to the supporting code -- no changes to any of the math. I'll upload to code tomorrow, getting late.
2.06beta is here Lib files are here, if you need them I have a 1050ti, was able to test all versions. I changed the way it compiles because when I used the old way, it would not run any version on my 1050ti except for CUDA 8. When I switched to a 940M I was able to get >=6.0 to run. Any verison below the version that worked on my cards *always* caused 0 or 2 results during self-test. NOTE: You might see small delay on 1st startup of each CUDA version now, due to JIT, but only if it doesn't find code for your GPU. So, now it's working but I need a lot of testing done. Anyone who was having issues with the bad residues before, please test these versions and let me know if you're able to make it give you bad results. Everyone, let me know what you find that needs to be fixed and what you would like changed. ~Cheers Last fiddled with by flashjh on 2017-04-17 at 04:53 Reason: fix cuda version |
![]() |
![]() |
![]() |
#2596 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3×5×17×19 Posts |
![]() Quote:
Incorporate thread benchmarking sanity checks. Check for and flag occurrence of pathological values, after completing its individual timing loops, and then exclude the flagged ones from selection as the optimal. Add a runtime estimate column for maximum exponent per fft length to fft.txt. Add checks that the card at least meets the Compute capability required, and the driver supports the CUDA level required that CUDALucas was compiled for. In prime95, in just a few characters, like We4: a results line is tagged with the program version ID. I'd like to see something like that added to CUDALucas too. (Maybe at the far right in case someone has a program that parses or pattern matches the results lines.) Last fiddled with by kriesel on 2017-04-19 at 02:58 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |