mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2017-12-06, 21:01   #2652
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

110710 Posts
Default Fast ist phun!

CUDA 9.0, CUDA driver 384.98, CUDALucas 2.05.1 (SVN rev. 99)

Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20'
Code:
Device              Tesla V100-PCIE-16GB
Compatibility       7.0
clockRate (MHz)     1380
memClockRate (MHz)  877

  fft    max exp  ms/iter
 2048   38492887   0.3978
 2187   41047411   0.5123
 2304   43194913   0.5183
 2401   44973503   0.5293
 2500   46787207   0.5429
 2592   48471289   0.5460
 2744   51250889   0.5997
 3136   58404433   0.6361
 3200   59570449   0.6514
 3456   64229677   0.7015
 4096   75846319   0.7591
 4375   80897867   0.9595
 4608   85111207   0.9649
 5184   95507747   1.0124
 5488  100984691   1.1235
 6272  115080019   1.2037
 6400  117377567   1.2445
 6561  120266023   1.3328
 6912  126558077   1.3391
 8000  146019329   1.5105
 8192  149447533   1.5230
 8575  156280961   1.8316
10368  188188471   1.9362
10976  198980129   2.1451
11907  215480183   2.3303
12544  226753511   2.3331
12800  231280639   2.3830
13824  249369863   2.5663
16384  294471259   2.9531
16807  301908293   3.3334
16875  303103441   3.5138
18225  326810201   3.7274
20736  370806323   3.7880
21952  392070229   4.2109
25088  446794913   4.5286
27783  493705637   5.5610
32000  566915989   5.8087
32768  580225813   5.8343

And timing 100M exponent './CUDALucas 332192879'
Code:
Starting M332192879 fft length = 20736K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Dec 06  21:52:36  | M332192879     10000  0xa19043095e213f4c  | 20736K  0.01758   3.8055   38.05s  |  14:15:09:00   0.00%  |
|  Dec 06  21:53:14  | M332192879     20000  0xcb7bc66ac81b24be  | 20736K  0.01709   3.8051   38.05s  |  14:15:07:16   0.00%  |
|  Dec 06  21:53:52  | M332192879     30000  0x38e4cc517de8fda3  | 20736K  0.01758   3.8051   38.05s  |  14:15:06:19   0.00%  |
Power consumption (boardpower reported by 'nvidia-smi') is around 145W while running LL test of M332192879.

Oliver
TheJudger is offline   Reply With Quote
Old 2017-12-07, 17:07   #2653
Luis
 
Luis's Avatar
 
Oct 2014
Bari, Italy

478 Posts
Default

So ~351*145Wh is the amount of energy consumed.
Luis is offline   Reply With Quote
Old 2017-12-07, 17:08   #2654
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

463310 Posts
Default multiple instances effect on performance (win some lose some)

To follow up on http://www.mersenneforum.org/showpos...postcount=2649, testing several combinations of applications (among CUDALucas, CUDAPm1, Mfaktc) run on several model GPUs, I have preliminary results per GPU and apps combination ranging from a few percent throughput reduction to over thirteen percent throughput increase. Throughput is computed as the sum for each simultaneously running instance on an individual GPU, of the rate of progress divided by the rate that was benchmarked to occur when that application was the only one running on that GPU. (This approach treats all run types, LL, P-1, trial factoring, as equally valuable; what's valued is a GPU-day of that model.) Estimated standard deviations so far are of order 0.2% to 0.5% for those I've evaluated, so the observed 1-13% gains evaluated are statistically significant. A spot check of a benchmark was repeatable quickly to 0.2%. Memory requirement is typically a small fraction of total GPU ram.

Last fiddled with by kriesel on 2017-12-07 at 17:20
kriesel is offline   Reply With Quote
Old 2017-12-23, 19:44   #2655
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

CUDA 9.1, CUDA driver 387.34, CUDALucas 2.05.1 (SVN rev. 99)

Updated P100-16GiB Benchmark (older CUDA 8 Benchmarks is here and here).

Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20'
Code:
Device              Tesla P100-PCIE-16GB
Compatibility       6.0
clockRate (MHz)     1328
memClockRate (MHz)  715

  fft    max exp  ms/iter
 2048   38492887   0.5972
 2187   41047411   0.7118
 2304   43194913   0.7301
 2401   44973503   0.7656
 2592   48471289   0.7971
 2744   51250889   0.8863
 3136   58404433   0.9482
 3200   59570449   0.9733
 3456   64229677   1.0467
 3584   66556463   1.1321
 4096   75846319   1.1423
 4608   85111207   1.4124
 5184   95507747   1.4988
 5488  100984691   1.6450
 6272  115080019   1.8127
 6400  117377567   1.8730
 6561  120266023   1.9556
 6912  126558077   2.0301
 7776  142017539   2.2474
 8192  149447533   2.2688
 8575  156280961   2.6593
 9261  168504209   2.8483
10368  188188471   2.9439
10976  198980129   3.1604
12544  226753511   3.5621
12800  231280639   3.6567
13824  249369863   3.9843
15552  279831199   4.4018
16384  294471259   4.5018
16807  301908293   5.1300
16875  303103441   5.5609
18225  326810201   5.7337
20736  370806323   5.8287
21952  392070229   6.2511
25088  446794913   7.0258
27783  493705637   8.2696
31104  551379091   8.7884
32000  566915989   9.0541
32768  580225813   9.0641
And timing 100M exponent './CUDALucas 332192879'
Code:
Starting M332192879 fft length = 20736K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Dec 23  20:37:12  | M332192879     10000  0xa19043095e213f4c  | 20736K  0.01758   5.8218   58.21s  |  22:09:12:09   0.00%  |
|  Dec 23  20:38:10  | M332192879     20000  0xcb7bc66ac81b24be  | 20736K  0.01709   5.8218   58.21s  |  22:09:11:04   0.00%  |
|  Dec 23  20:39:08  | M332192879     30000  0x38e4cc517de8fda3  | 20736K  0.01855   5.8249   58.24s  |  22:09:15:49   0.00%  |
Oliver
TheJudger is offline   Reply With Quote
Old 2017-12-23, 23:40   #2656
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

112408 Posts
Default

Quote:
Originally Posted by TheJudger View Post
CUDA 9.1, CUDA driver 387.34, CUDALucas 2.05.1 (SVN rev. 99)

Updated P100-16GiB Benchmark (older CUDA 8 Benchmarks is here and here).

Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20'
Code:
Device              Tesla P100-PCIE-16GB
Compatibility       6.0
clockRate (MHz)     1328
memClockRate (MHz)  715

  fft    max exp  ms/iter
 2048   38492887   0.5972
 2187   41047411   0.7118
 2304   43194913   0.7301
 2401   44973503   0.7656
 2592   48471289   0.7971
 2744   51250889   0.8863
 3136   58404433   0.9482
 3200   59570449   0.9733
 3456   64229677   1.0467
 3584   66556463   1.1321
 4096   75846319   1.1423
 4608   85111207   1.4124
 5184   95507747   1.4988
 5488  100984691   1.6450
 6272  115080019   1.8127
 6400  117377567   1.8730
 6561  120266023   1.9556
 6912  126558077   2.0301
 7776  142017539   2.2474
 8192  149447533   2.2688
 8575  156280961   2.6593
 9261  168504209   2.8483
10368  188188471   2.9439
10976  198980129   3.1604
12544  226753511   3.5621
12800  231280639   3.6567
13824  249369863   3.9843
15552  279831199   4.4018
16384  294471259   4.5018
16807  301908293   5.1300
16875  303103441   5.5609
18225  326810201   5.7337
20736  370806323   5.8287
21952  392070229   6.2511
25088  446794913   7.0258
27783  493705637   8.2696
31104  551379091   8.7884
32000  566915989   9.0541
32768  580225813   9.0641
And timing 100M exponent './CUDALucas 332192879'
Code:
Starting M332192879 fft length = 20736K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Dec 23  20:37:12  | M332192879     10000  0xa19043095e213f4c  | 20736K  0.01758   5.8218   58.21s  |  22:09:12:09   0.00%  |
|  Dec 23  20:38:10  | M332192879     20000  0xcb7bc66ac81b24be  | 20736K  0.01709   5.8218   58.21s  |  22:09:11:04   0.00%  |
|  Dec 23  20:39:08  | M332192879     30000  0x38e4cc517de8fda3  | 20736K  0.01855   5.8249   58.24s  |  22:09:15:49   0.00%  |
Oliver
22 days for a 100.000.000 digits number?
ET_ is offline   Reply With Quote
Old 2017-12-24, 00:01   #2657
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Hi Luigi,

Quote:
Originally Posted by ET_ View Post
22 days for a 100.000.000 digits number?
yes, but look at this

Oliver

Last fiddled with by TheJudger on 2017-12-24 at 00:01
TheJudger is offline   Reply With Quote
Old 2018-01-08, 03:41   #2658
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

41×113 Posts
Default Updated bug and wish list

Quote:
Originally Posted by kriesel View Post
Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.
After a few months and holidays, here's an updated version.
Attached Files
File Type: pdf cudalucas bug and wishlist table.pdf (88.9 KB, 158 views)
kriesel is offline   Reply With Quote
Old 2018-01-08, 03:52   #2659
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

463310 Posts
Default CUDALucas runtime scaling

The attachment is based on actual timed exponents on a 701 Mhz clocked GTX480. Times for a GTX1070 scale by about 70%. That is, what takes the 480 ten days takes the 1070 a week.
Attached Files
File Type: pdf cudalucas ll test run time scaling.pdf (15.4 KB, 78 views)
kriesel is offline   Reply With Quote
Old 2018-01-08, 07:24   #2660
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22×7×317 Posts
Default

Quote:
Originally Posted by kriesel View Post
After a few months and holidays, here's an updated version.
I had a fast read, some I didn't understand (need more time for me to read them deeper, I am in hurry now), but point 10 seems that it is actually not true. What you see is an effect of the save file storing the time when the test was started. The program computes the time like "how many iterations you did" over "how long time you worked on it", multiply with "how many iterations you still have" and that is a date in the future. You will experience the same effect if you interrupt your work for a while (days) and resume in the same computer. I remember a discussion in the past where we argued if the interruption time should be considered or not (i.e. averaged into the calculus) and it seems to me that it is better to be included. No matter if you take one picosecond per iteration, but if you spent 1 week to do half of the test (for whatever reasons, including interruptions), it would look normal for me that you will spend another week for the other half. In this way, your new computer doesn't know that the time per iteration is faster, but the ETA will "catch up" soon, as the iterations progress to higher numbers.

The other way, to display ETA as the "number of remaining iterations" multiplied with "iteration time", will give you an immediate result when you move it to a faster toy, but it will be very-VERY jumpy ETA, due to the fact that iteration time varies a lot with how busy your computer is. Some of us use the computers for other activities too. So it is not "reliable". Some kind of "averaging" with the past values (either SMA, or EMA) need to be done, to avoid the jumpy ETA, and you will still see "no effect" when you move it, unless the MA (moving average) main period passes. Of course, it would be nice to have an option in the ini file, for example, where to chose an averaging period, something like 255 should be the actual method, (just an example), something like 0 should be "no averaging" (jumpy). But I feel we request too much already.
LaurV is offline   Reply With Quote
Old 2018-01-28, 12:35   #2661
wfgarnett3
 
wfgarnett3's Avatar
 
"William Garnett III"
Oct 2002
Bensalem, PA

2·43 Posts
Default EVGA GeForce GTX 1050 (2GB GDDR5)

CUDALucas2.05.1-CUDA8.0-Windows-x64.exe

GeForce 1050 CUDALucas benchmarks below followed by Intel i3-4150 Prime95 benchmarks for comparison

Quote:
Device GeForce GTX 1050
Compatibility 6.1
clockRate (MHz) 1531
memClockRate (MHz) 3504

fft max exp ms/iter
1024 19535569 3.1435
1080 20580341 3.6334
1134 21586693 3.7268
1152 21921901 3.7988
1296 24599717 4.0779
1323 25101101 4.5591
1350 25602229 4.7156
1440 27271147 4.8550
1458 27604673 5.0420
1568 29640913 5.0514
1600 30232693 5.1335
1728 32597297 5.5383
1792 33778141 6.0359
2048 38492887 6.2727
2304 43194913 7.2045
2352 44075249 8.4636
2592 48471289 8.4958
2688 50227213 9.5387
2700 50446621 9.9943
2916 54392209 10.0159
3024 56362639 10.4233
3136 58404433 10.4462
3200 59570449 11.2744
3240 60298969 11.5216
3402 63247511 11.9492
3584 66556463 12.3054
3600 66847171 13.0066
4096 75846319 13.0730
4608 85111207 15.4038
4800 88579669 17.4774
5184 95507747 17.5136
5376 98967641 19.3875
5600 103000823 20.1571
5760 105879517 20.5611
5832 107174381 20.6808
6144 112781477 21.5606
6272 115080019 22.1675
6912 126558077 23.4366
7168 131142761 25.0548
7200 131715607 25.6965
8192 149447533 26.9798

Quote:
Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
CPU speed: 3491.95 MHz, 2 hyperthreaded cores
CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 3 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64

Timing FFTs using 2 threads on 2 cores.
Best time for 1792K FFT length: 4.751 ms., avg: 4.932 ms.
Best time for 1920K FFT length: 5.255 ms., avg: 5.418 ms.
Best time for 2016K FFT length: 5.469 ms., avg: 5.605 ms.
Best time for 2048K FFT length: 5.513 ms., avg: 5.574 ms.
Best time for 2304K FFT length: 6.246 ms., avg: 6.298 ms.
Best time for 2400K FFT length: 6.436 ms., avg: 6.484 ms.
Best time for 2560K FFT length: 6.825 ms., avg: 6.991 ms.
Best time for 2688K FFT length: 7.409 ms., avg: 7.502 ms.
Best time for 2880K FFT length: 7.736 ms., avg: 7.801 ms.
Best time for 3072K FFT length: 8.351 ms., avg: 8.448 ms.
Best time for 3200K FFT length: 8.811 ms., avg: 8.946 ms.
Best time for 3360K FFT length: 9.705 ms., avg: 9.879 ms.
Best time for 3456K FFT length: 9.940 ms., avg: 10.082 ms.
Best time for 3584K FFT length: 10.128 ms., avg: 10.220 ms.
Best time for 3840K FFT length: 10.919 ms., avg: 11.034 ms.
Best time for 4096K FFT length: 13.515 ms., avg: 13.819 ms.
Best time for 4480K FFT length: 12.547 ms., avg: 12.789 ms.
Best time for 4608K FFT length: 12.952 ms., avg: 13.141 ms.
Best time for 4800K FFT length: 13.462 ms., avg: 13.636 ms.
Best time for 5120K FFT length: 14.454 ms., avg: 14.626 ms.
Best time for 5376K FFT length: 15.308 ms., avg: 15.433 ms.
Best time for 5760K FFT length: 16.797 ms., avg: 16.957 ms.
Best time for 6144K FFT length: 17.702 ms., avg: 17.988 ms.
Best time for 6400K FFT length: 18.452 ms., avg: 18.641 ms.
Best time for 6720K FFT length: 20.265 ms., avg: 20.463 ms.
Best time for 6912K FFT length: 20.733 ms., avg: 21.296 ms.
Best time for 7168K FFT length: 22.067 ms., avg: 24.565 ms.
Best time for 7680K FFT length: 22.115 ms., avg: 22.333 ms.
Best time for 8064K FFT length: 24.796 ms., avg: 25.473 ms.
Best time for 8192K FFT length: 26.976 ms., avg: 28.400 ms.

Last fiddled with by wfgarnett3 on 2018-01-28 at 13:07
wfgarnett3 is offline   Reply With Quote
Old 2018-04-04, 17:23   #2662
Lexicographer
 
Mar 2018
Shenzhen, China

100102 Posts
Unhappy Problem compiling CUDALucas for 1080 Ti under Linux

Hello!

I'm not sure it's correct place to ask this, but I'm bumping into a problem while trying to compile the latest CUDALucas under Linux.

The problem is:

Quote:
$ make
/usr/local/cuda/bin/nvcc -O1 --generate-code arch=compute_61,code=sm_61 --compiler-options=-Wall -I/usr/local/cuda/include -c CUDALucas.cu
CUDALucas.cu(756): error: identifier "nvmlInit" is undefined

CUDALucas.cu(757): error: identifier "nvmlDevice_t" is undefined

CUDALucas.cu(758): error: identifier "nvmlDeviceGetHandleByIndex" is undefined

CUDALucas.cu(759): error: identifier "nvmlDeviceGetUUID" is undefined

CUDALucas.cu(760): error: identifier "nvmlShutdown" is undefined
It's the same if I try different versions of compute/sm.

I have CUDA Toolkit 9.1 installed.

Any suggestions, please?
Lexicographer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 16:09.

Thu Oct 29 16:09:45 UTC 2020 up 49 days, 13:20, 2 users, load averages: 1.92, 1.85, 1.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.