![]() |
![]() |
#12 | |
Mar 2022
Earth
2008 Posts |
![]()
@tdulcet
I used your CUDALucas script on a virtual machine and got the following benchmark! Quote:
Code:
magallan33@singletesla:~/cudalucas$ ./CUDALucas CUDALucas v2.06 64-bit build, compiled Apr 22 2022 @ 15:03:04 binary compiled for CUDA 11.60 CUDA runtime version 11.60 CUDA driver version 11.60 ---------------- DEVICE 0 ---------------- Device Name NVIDIA A100-SXM4-40GB ECC Support? Enabled Compatibility 8.0 clockRate (MHz) 1410 memClockRate (MHz) 1215 totalGlobalMem 42314694656 totalConstMem 65536 l2CacheSize 41943040 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 108 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 pciDeviceID 4 pciBusID 0 You may experience a small delay on 1st startup to due to Just-in-Time Compilation Using threads: square 256, splice 128. Starting M113613007 fft length = 6272K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Apr 22 15:07:25 | M113613007 10000 0x8e9d1902569605e5 | 6272K 0.16406 0.6424 6.42s | 20:16:25 0.00% | | Apr 22 15:07:31 | M113613007 20000 0x3636029ba31b50d0 | 6272K 0.17188 0.6423 6.42s | 20:16:12 0.01% | | Apr 22 15:07:38 | M113613007 30000 0xca88b4f9805ebc37 | 6272K 0.16797 0.6423 6.42s | 20:16:03 0.02% | | Apr 22 15:07:44 | M113613007 40000 0x8cb45690e9278cb4 | 6272K 0.17188 0.6424 6.42s | 20:16:00 0.03% | | Apr 22 15:07:51 | M113613007 50000 0xa867fda0ea381be2 | 6272K 0.17188 0.6423 6.42s | 20:15:51 0.04% | Here is the same benchmark utilizing GPUOWL Code:
magallan33@singletesla:~/gpuowl-master$ ./gpuowl -prp 113613007 20220422 15:15:12 GpuOwl VERSION 20220422 15:15:12 GpuOwl VERSION 20220422 15:15:12 config: -user Magallan3s -cpu Magellan -maxAlloc 40000M 20220422 15:15:12 config: -prp 113613007 20220422 15:15:12 device 0, unique id '' 20220422 15:15:12 Magellan 113613007 FFT: 6M 1K:12:256 (18.06 bpw) 20220422 15:15:13 Magellan 113613007 OpenCL args "-DEXP=113613007u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP=0.92078876355848627 -DIWEIGHT_STEP=-0.47938054461158819 -DIWEIGHTS={0,-0.45791076534214703,-0.41227852333612641,-0.36280502904659512,-0.30916693173607174,-0.25101366149694165,-0.18796513798337899,-0.11960928626782931,} -DFWEIGHTS={0,0.84471473710626932,0.70148623064852611,0.56937836233036643,0.44752769654326469,0.33513783709142603,0.23147422207537149,0.13585932291445821,} -cl-std=CL2.0 -cl-finite-math-only " 20220422 15:15:14 Magellan 113613007 20220422 15:15:14 Magellan 113613007 OpenCL compilation in 1.63 s 20220422 15:15:15 Magellan 113613007 maxAlloc: 39.1 GB 20220422 15:15:15 Magellan 113613007 P1(0) 0 bits 20220422 15:15:15 Magellan 113613007 PRP starting from beginning 20220422 15:15:15 Magellan 113613007 OK 0 on-load: blockSize 400, 0000000000000003 20220422 15:15:15 Magellan 113613007 validating proof residues for power 8 20220422 15:15:15 Magellan 113613007 Proof using power 8 20220422 15:15:17 Magellan 113613007 OK 800 0.00% 420cf6918603e7e1 1036 us/it + check 0.61s + save 0.22s; ETA 1d 08:41 20220422 15:15:27 Magellan 113613007 10000 28f5eefd6236e274 1035 20220422 15:15:37 Magellan 113613007 20000 d556e5c56bf104e0 1035 20220422 15:15:47 Magellan 113613007 30000 32d4895e2a4b9a36 1035 20220422 15:15:58 Magellan 113613007 40000 23ad82f3bf8c6401 1035 20220422 15:16:08 Magellan 113613007 50000 ff32c820dd26d801 1035 20220422 15:16:09 Magellan 113613007 Stopping, please wait.. 20220422 15:16:10 Magellan 113613007 OK 51200 0.05% 226f82f0f57ba2c7 1036 us/it + check 0.57s + save 0.22s; ETA 1d 08:40 20220422 15:16:10 Magellan Exiting because "stop requested" 20220422 15:16:10 Magellan Bye Last fiddled with by Magellan3s on 2022-04-22 at 15:17 |
|
![]() |
![]() |
![]() |
#13 |
Mar 2022
Earth
8016 Posts |
![]() |
![]() |
![]() |
![]() |
#14 |
"University student"
May 2021
Beijing, China
269 Posts |
![]()
Why is CUDALucas using a larger FFT but considerably faster than GPUowl?
|
![]() |
![]() |
![]() |
#15 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7·13·89 Posts |
![]() Quote:
Post 5 gpuowl a100 389 usec vs. post 4 gpuowl 1050 usec, post 12 gpuowl 1036 usec, is a whopping ~2.7:1 speed difference. The commonality in the anomalously slow A100 timings is apparently with a user or test methodology or gpuowl version. The ratio (post 12 cudalucas 642.3usec/it in LL) / (post 5 gpuowl 389 usec/it in PRP) ~ 1.65 is about what we would expect based on numerous comparative benchmarks on numerous GPU models by multiple users. (But it takes 2 LL versus ~1.01 PRP with proof, so CUDALucas is at more than triple runtime disadvantage per exponent completion.) If the user Magellan3s would run gpuowl in background, while doing top and lsgpu in foreground, and share results of that, it might be informative. One possibility is the CUDALucas instance or some other load was still running on the same GPU when gpuowl was timed. Another is that the gpuowl timing was not on an A100. Another possibility is that the slow timings for A100 were obtained with an old version of gpuowl that's slower than recent versions. ls -l might address that. (But the integrated P-1 is consistent with relatively recent gpuowl version, v7.x, which is generally within ~10% of fastest version for various fft lengths on Radeon VII.) Another possibility is that P-1 stage 1 timings, which is what he's reported for Gpuowl in both cases, are slower, not equivalent to PRP-only timings. Spot checking my own logs on Radeon VII locally, I don't find much support for that, although IIRC there's a ~10% effect. Another way to approach it is to run nvidia-smi before launching the gpuowl benchmark, to check whether the GPU is an A100 and is idle immediately before the benchmark is launched. Or nvidia-smi in foreground while Gpuowl and hopefully nothing else runs on the A100 GPU in background. Sample nvidia-smi output on Windows: Code:
Sat Apr 23 07:02:03 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 456.71 Driver Version: 456.71 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 WDDM | 00000000:01:00.0 Off | N/A | | 64% 66C P2 125W / 125W | 901MiB / 8192MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 2080 WDDM | 00000000:03:00.0 Off | N/A | | 44% 63C P2 147W / 150W | 551MiB / 8192MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3980 C ...0-g79ea0cc\gpuowl-win.exe N/A | | 1 N/A N/A 4940 C ...80\mfaktc-2047-win-64.exe N/A | +-----------------------------------------------------------------------------+ Last fiddled with by kriesel on 2022-04-23 at 12:24 |
|
![]() |
![]() |
![]() |
#16 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1FA316 Posts |
![]() Quote:
https://www.pcgamer.com/overclocking...quid-nitrogen/ Last fiddled with by kriesel on 2022-04-23 at 12:20 |
|
![]() |
![]() |
![]() |
#17 | |
Mar 2022
Earth
27 Posts |
![]() Quote:
https://gpuspecs.com/card/nvidia-gef...x-3080-ti-12gb My current MEMORY clock (not GPU clock) is a stable 10351 MHz - Liquid nitrogen isn't required. Last fiddled with by Magellan3s on 2022-04-23 at 14:04 |
|
![]() |
![]() |
![]() |
#18 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7×13×89 Posts |
![]()
ASUS RTX2080 claims 14GHz which appears to be also a computed effective. https://rog.asus.com/us/graphics-car...ing-model/spec
GPU-Z v2.44.0 indicates the memory clock running at 1700 MHz. Factor of ~8.235 lower. Last fiddled with by kriesel on 2022-04-23 at 15:44 |
![]() |
![]() |
![]() |
#19 | |
Mar 2022
Earth
27 Posts |
![]() Quote:
Ran the test again with version 6 for -113606098 (wavefront exponent) and got much better results. Code:
magallan33@singletesla:~/gpuowl-6$ ./gpuowl 2022-04-23 20:32:51 gpuowl 2022-04-23 20:32:51 config: -user Magallan3s -cpu Magellan -maxAlloc 40000M 2022-04-23 20:32:51 device 0, unique id '' 2022-04-23 20:32:51 Magellan 113606089 FFT: 6M 1K:12:256 (18.06 bpw) 2022-04-23 20:32:51 Magellan Expected maximum carry32: 4CEB0000 2022-04-23 20:32:52 Magellan OpenCL args "-DEXP=113606089u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DPM1=0 -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0x1.d831959e4834fp-1 -DIWEIGHT_STEP_MINUS_1=-0x1.eb4ab6a4ec4c4p-2 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2022-04-23 20:32:54 Magellan 2022-04-23 20:32:54 Magellan OpenCL compilation in 2.19 s 2022-04-23 20:32:54 Magellan 113606089 OK 0 loaded: blockSize 400, 0000000000000003 2022-04-23 20:32:54 Magellan validating proof residues for power 8 2022-04-23 20:32:54 Magellan Proof using power 8 2022-04-23 20:32:55 Magellan 113606089 OK 800 0.00%; 409 us/it; ETA 0d 12:55; 1c2da263c0685009 (check 0.32s) 2022-04-23 20:34:16 Magellan 113606089 OK 200000 0.18%; 406 us/it; ETA 0d 12:48; 9cf6ee38f1eaedd9 (check 0.30s) 2022-04-23 20:35:38 Magellan 113606089 OK 400000 0.35%; 406 us/it; ETA 0d 12:46; 62d3515b301b932b (check 0.31s) https://i.ibb.co/fMS7RcR/Screenshot-...3-15-35-04.png Last fiddled with by Magellan3s on 2022-04-23 at 20:52 |
|
![]() |
![]() |
![]() |
#20 |
"Tucker Kao"
Jan 2020
Head Base M168202123
37016 Posts |
![]()
Memory clock and Core clock of the same GPU are 2 different things. Similar for the CPU and memory sticks installed on the mother board too.
Last fiddled with by tuckerkao on 2022-04-24 at 05:34 |
![]() |
![]() |
![]() |
#21 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7·13·89 Posts |
![]()
Clock rate typically refers to the operating fundamental frequency of some oscillator or pulse train generator.
Clock rate is, by definition, not necessarily the same as effective bit rate, of some clocked circuitry, which may employ multiple signal lines, various encoding methods, etc. Clock rate and bit rate have different names, because they are different things. At the one extreme, FM radio modulates a carrier of ~100 MHz with much slower audio data of up to ~15KHz. https://en.wikipedia.org/wiki/FM_broadcasting (Low ratio of modulation rate/carrier rate is necessary for broadcast signals because of generation of sidebands.) At the other extreme, some digital tech modulates at a higher bit rate than the fundamental frequency. A lot of ingenuity has been applied to modulation methods. The capacity of a communication channel to carry information is a function of both its bandwidth and its signal to noise ratio. https://en.wikipedia.org/wiki/Channe...le_application The RTX 3080Ti uses GDDR6x memory, as previously stated. https://www.micron.com/products/ultr...lutions/gddr6x indicates ~19Gbit/second/pin data rate. Data rate is important in performance. But effective bit data rate is not memory clock frequency. (GPU computing core clock rate is yet another separate parameter, and not considered further in this post.) GDDR6x uses PAM4 encoding. https://www.technipages.com/what-is-gddr6x https://en.wikipedia.org/wiki/GDDR6_SDRAM PAM4 uses multiple voltage levels (corresponding to values 0 1 2 3) to represent two bits on a single signal line at the same instant in time. https://www.edn.com/the-fundamentals-of-pam4/ The maximum signal fundamental frequency in NRZ would be for alternating 0 and 1 bits, bit rate/2. (010101... each 01 bit pair is a cycle to minimum and to max, one period.) https://en.wikipedia.org/wiki/Non-return-to-zero The various references appear to show that GDDR6X puts 2 bits of modulation on a half period of a signal. That would be 4 bits per period of clock if the signal line was operated at the memory clock rate. I'm not sure where the other factor of ~two in ratio between memory clock and effective bit rate comes from for GDDR6X. Maybe there's a clock doubler circuit. The difference from integer multiples between clock frequency and effective bit rate may be due to something as simple as rounding. Last fiddled with by kriesel on 2022-04-24 at 18:02 |
![]() |
![]() |
![]() |
#22 | |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
141318 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Benchmarks | Pjetrode | Information & Answers | 3 | 2018-01-07 23:23 |
RPS benchmarks | pinhodecarlos | Riesel Prime Search | 29 | 2014-12-07 07:13 |
GPU Benchmarks | houding | Hardware | 7 | 2014-07-09 10:48 |
LLR benchmarks | Retep | Riesel Prime Search | 4 | 2008-11-06 22:15 |
Benchmarks | Vandy | Hardware | 6 | 2002-10-28 13:45 |