mersenneforum.org CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 Register FAQ Search Today's Posts Mark Forums Read

2017-07-28, 02:25   #2619

"Kieren"
Jul 2011
In My Own Galaxy!

27A916 Posts

Quote:
 Originally Posted by kriesel This compilation is based on mostly my own running and testing since February on Windows, with some info from the forums mixed in. Please chime in with linux experience or in general. The absence of fft lengths greater than 8192k in the -r self test option seems like a priority item. Perhaps a separate -rbig or -r 2 option, with 1000 iterations for the big fft lengths >8192k?
What is the limit with -r 1 ?
EDIT: I don't have an active setup for CuLu, so I can't answer the question. I think I am correct, that the '-r" argument is equivalent to '-r 0' . The higher level self-test is '-r 1' .

Last fiddled with by kladner on 2017-07-28 at 02:29

2017-07-28, 14:48   #2620
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13·373 Posts
-r 1 vs -r 0 vs -r (none), and variations among GPU models, CUDA level or ?

Quote:
 Originally Posted by kladner What is the limit with -r 1 ? EDIT: I don't have an active setup for CuLu, so I can't answer the question. I think I am correct, that the '-r" argument is equivalent to '-r 0' . The higher level self-test is '-r 1' .
Yes.

-r n runs the short (n = 0) or long (n = 1) version of
the self-test.

In the table I posted, at item 28, it lists fft lengths run for -r 1 on a GTX 1060 3GB. That list was obtained from a -r 1 run made after fft benchmarking and threads benchmarking. Max fft length it ran was 8192k as listed there.

In an earlier run, cudalucas2.06beta-cuda5.0-windows-x64.exe -d %dev% -r 1 >>clstart.txt
run on the same GTX 1060 3GB before fft or threads benchmarking ran following residue checks in k (a somewhat different more extensive list):
1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1176, 1296, 1344, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3584, 4096, 4608, 4704, 5184, 5600, 5760, 6048, 6272, 6400, 6480, 7168, 7776, 8064, 8192

I run something like the following (version varies, usually now 2.06beta May build, and a higher cuda level; max possible memtest width)

cudalucas2.05.1-cuda4.2-windows-x64 -memtest 116 10 >>clstart.txt
cudalucas2.05.1-cuda4.2-windows-x64 -r 1 >>clstart.txt
cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>clstart.txt
rem suppress 1024 thread value in threadbench since it causes problems with my GTX480s or Quadro 2000s
CUDALucas2.05.1-cuda4.2-windows-x64 -threadbench 1 65536 5 4 >>clstart.txt
cudalucas2.05.1-cuda4.2-windows-x64 6972593 >>clstart.txt

on any gpu I install or relocate. (Sometimes the 65536 must be reduced; sometimes the threadbench mask allows 1024 threads, both depending on GPU model.)

On a GTX 480, cudalucas2.05.1-cuda4.2-windows-x64 -r 1 >>clstart.txt produced the following assortment of fft lengths run, _before_ fft or threads benchmarking were done. More lengths run in total, none above 8192k.

1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1296, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3456, 3600, 4096, 4608, 4704, 5184, 5600, 5760, 6048, 6480, 7168, 8192

From a Gtx 1070 before fft benchmarking, threads benchmarking (May 2.06beta, cuda 6.0, x64)
1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1176, 1296, 1344, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3584, 4096, 4608, 4704*, 5120, 5184, 5600, 5760, 6048, 6272, 6400, 6480, 7168, 7776, 8064, 8192

* 4704 appeared not to actually run:
Using threads: square 256, splice 128.
Starting self test M86845813 fft length = 4704K
Using threads: square 256, splice 128.
Starting self test M86845813 fft length = 5120K
Iteration 10000 / 86845813, 0x88220ac98093b65c, 5120K, CUDALucas v2.06beta, error = 0.04102, real: 1:05, 6.5254 ms/iter
This residue is correct.

Not completing a length is rare.

More variations on the same GTX 1060 3GB follow.

V2.06beta 32bit cuda 6.5 -r 0 (A rare successful run in 32-bit on this card)
4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136

V2.06beta 64bit cuda 6.5 -r 0
4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136

V2.06beta 64bit cuda 6.5 -r (neither 0 nor 1 specified)
4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136

Your statement that -r (no switch value specified) is equivalent to -r 0 (short residue test) seems to be confirmed.

My startup scripts all use -r 1 (long test). Item 28 in the table was about -r 1 results. None of <-r, -r 0, -r 1> tests, or on any run (of dozens) I've reviewed ever exceeded fft length 8192k. -r 2 is not a legal input and is not accepted on the May 2.06 beta.

 2017-07-28, 14:56 #2621 kladner     "Kieren" Jul 2011 In My Own Galaxy! 11×13×71 Posts Sorry. I did not look closely enough at the information provided.
2017-07-28, 18:52   #2622
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10010111100012 Posts

Quote:
 Originally Posted by kladner Sorry. I did not look closely enough at the information provided.
It's ok.

It turns out by looking at it some more, I noticed and learned some more, so it's all good.

 2017-07-28, 20:46 #2623 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 484910 Posts self test residues limit examining cudalucas 2.06 beta May 5 build source code confirms max exponent for which there's a selftest residue is 149,447,533, corresponding to 8192k fft length max. Last fiddled with by kriesel on 2017-07-28 at 20:46
 2017-07-29, 00:24 #2624 storm5510 Random Account     Aug 2009 U.S.A. 70E16 Posts I notice a lot of your tests were done with CUDA 6.5. I am using CUDA 8. My current version of mfaktc requires it. The best times I've gotten out of CuLu 2.06 is around 3.8 ms/iter on my GTX-480. To get that, I have to leave the threads/splice set at their default values of 256 and 128. It is problematic at this setting because I get frequent resets. Lowering the threads/splice values increases the time to 4.2 ms/iter, roughly. However, it seems more well-behaved at lower settings. The difference is only 0.4 ms, which is not an issue since the difference is so very small. All this is for exponents in the 41M range.
2017-07-29, 20:27   #2625
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10010111100012 Posts
cuda levels

Quote:
 Originally Posted by storm5510 I notice a lot of your tests were done with CUDA 6.5. I am using CUDA 8. My current version of mfaktc requires it. The best times I've gotten out of CuLu 2.06 is around 3.8 ms/iter on my GTX-480. To get that, I have to leave the threads/splice set at their default values of 256 and 128. It is problematic at this setting because I get frequent resets. Lowering the threads/splice values increases the time to 4.2 ms/iter, roughly. However, it seems more well-behaved at lower settings. The difference is only 0.4 ms, which is not an issue since the difference is so very small. All this is for exponents in the 41M range.
I frequently run CUDALucas2.06beta-CUDA6.0-Windows-x64.exe or versions near that because they have done well in my benchmark testing.
I've often seen the CUDALucas 8.0 version (and 4.2) significantly slower in careful benchmark testing. It also depends on the GPU model and exponent size. A few percent slower is significant, to me, as it's the same as losing a day or more of throughput per month; more than a a week per year, or running one of a dozen GPUs at half speed.

There's a difference between what maximum CUDA level the NVIDIA driver supports, and the minimum level that a given CUDALucas CUDAPm1 or Mfaktc requires, and what a given level of the SDK supports. CUDALucas2.06beta-CUDA6.0-Windows-x64.exe for example can run with any driver that supports CUDA 6.0 or above, including the latest that supports CUDA8, but not an old driver that supports only up to CUDA 5.5 or lower. With a driver installed that supports up to CUDA 8 requirements, one can run any version of CUDALucas with minimum requirement 4.0 through 8.0 (I've run the experiment by benchmarking all of 2.06beta 4.0 thru 8 on the same driver version), and pick the CUDA level that gives the best speed within accuracy limits for the GPU and exponents at the time. (There are some card and CUDA and fft length combinations that are not as dependable.) The driver versatility on CUDA level is a good thing in that it would allow running mfaktc requiring 8, CUDALucas fastest at 5.5, and CUDAPm1 fastest at some other level, on the same system and same single driver installation.

Recently I visited the CUDA wikipedia page and saw that CUDA 9 SDK will drop support for compute capability 2.x cards, which includes older Quadros (2000, 4000), and GTX480; all the way up through the GTX500s and 600s.
CUDA6.5 SDK is the last to support older compute capability 1.3 cards like the GTX290.
https://en.wikipedia.org/wiki/CUDA#GPUs_supported

The versions of Mfaktc I found online when I was looking months ago require CUDA 6.5 or up, not 8.0 minimum. http://www.mersennewiki.org/index.php/Mfaktc lists lots of choices, and CUDA 4.2, 6.5 or 8.0. I haven't the time right now to benchmark the assortment of Mfaktc versions.

2017-07-30, 18:49   #2626
storm5510
Random Account

Aug 2009
U.S.A.

70E16 Posts

Quote:
 Originally Posted by kriesel ...Recently I visited the CUDA wikipedia page and saw that CUDA 9 SDK will drop support for compute capability 2.x cards, which includes older Quadros (2000, 4000), and GTX480; all the way up through the GTX500s and 600s. CUDA6.5 SDK is the last to support older compute capability 1.3 cards like the GTX290. https://en.wikipedia.org/wiki/CUDA#GPUs_supported The versions of Mfaktc I found online when I was looking months ago require CUDA 6.5 or up, not 8.0 minimum. http://www.mersennewiki.org/index.php/Mfaktc lists lots of choices, and CUDA 4.2, 6.5 or 8.0. I haven't the time right now to benchmark the assortment of Mfaktc versions.
I get my drivers from nVidia's support pages. They're always the latest ones. I've never had any experience with anything below 8. As for the GTX-480 I have, its time is limited. I ocassionally browse around to see what is available, and where. I would 'like' to have something that will get me away from the resets in CuLu.

 2017-08-07, 15:50 #2627 storm5510 Random Account     Aug 2009 U.S.A. 111000011102 Posts I had to modify the batch file shown in post 2610: Code: @echo off set count=0 set program=cudalucas :loop TITLE %program% current reset count = %count% set /a count+=1 echo %count% >> log.txt echo %count% %program%.exe if %count%==50 goto end goto loop :end del log.txt If the worktodo.txt file contains no assignment, then the batch goes into a continuous loop. I found the count this morning at over 700,000. I added the lines in bold. With the count value of 50, it loops about a second before it drops out to the prompt. Of course, this value can be set to whatever one desires.
2017-08-09, 17:44   #2628
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13×373 Posts

Quote:
 Originally Posted by storm5510 I had to modify the batch file shown in post 2610: Code: @echo off set count=0 set program=cudalucas :loop TITLE %program% current reset count = %count% set /a count+=1 echo %count% >> log.txt echo %count% %program%.exe if %count%==50 goto end goto loop :end del log.txt If the worktodo.txt file contains no assignment, then the batch goes into a continuous loop. I found the count this morning at over 700,000. I added the lines in bold. With the count value of 50, it loops about a second before it drops out to the prompt. Of course, this value can be set to whatever one desires.
I experimented with increasing time delays between batch loop iterations as well as a loop count limit of 30. (Think what a mess unbounded loop iterations makes of output redirected to a log file...) Increased time delay had no discernible effect on NVIDIA driver timeout issue impact.

2017-08-14, 14:33   #2629
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10010111100012 Posts
cudalucas bug and wish list update

Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.
Attached Files
 cudalucas bug and wishlist table.pdf (66.1 KB, 119 views)

 Similar Threads Thread Thread Starter Forum Replies Last Post LaurV Data 131 2017-05-02 18:41 Brain GPU Computing 13 2016-02-19 15:53 Karl M Johnson GPU Computing 15 2015-10-13 04:44 fairsky GPU Computing 11 2013-11-03 02:08 Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 05:44.

Tue Jan 26 05:44:27 UTC 2021 up 54 days, 1:55, 0 users, load averages: 2.66, 2.42, 2.34