mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2022-04-22, 15:09   #12
Magellan3s
 
Mar 2022
Earth

2008 Posts
Default

@tdulcet

I used your CUDALucas script on a virtual machine and got the following benchmark!

Quote:
Originally Posted by tdulcet View Post
Colab currently use Ubuntu 18.04, but that should not matter. I would try removing the -yield argument from your config file, as that will slow things down and should not be needed for headless GPUs like the A100.



Code:
magallan33@singletesla:~/cudalucas$ ./CUDALucas

CUDALucas v2.06 64-bit build, compiled Apr 22 2022 @ 15:03:04

binary compiled for CUDA   11.60
CUDA runtime version       11.60
CUDA driver version        11.60

---------------- DEVICE 0 ----------------
Device Name               NVIDIA A100-SXM4-40GB
ECC Support?              Enabled
Compatibility             8.0
clockRate (MHz)           1410
memClockRate (MHz)        1215
totalGlobalMem            42314694656
totalConstMem             65536
l2CacheSize               41943040
sharedMemPerBlock         49152
regsPerBlock              65536
warpSize                  32
memPitch                  2147483647
maxThreadsPerBlock        1024
maxThreadsPerMP           2048
multiProcessorCount       108
maxThreadsDim[3]          1024,1024,64
maxGridSize[3]            2147483647,65535,65535
textureAlignment          512
deviceOverlap             1
pciDeviceID               4
pciBusID                  0

You may experience a small delay on 1st startup to due to Just-in-Time Compilation

Using threads: square 256, splice 128.
Starting M113613007 fft length = 6272K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Apr 22  15:07:25  | M113613007     10000  0x8e9d1902569605e5  |  6272K  0.16406   0.6424    6.42s  |     20:16:25   0.00%  |
|  Apr 22  15:07:31  | M113613007     20000  0x3636029ba31b50d0  |  6272K  0.17188   0.6423    6.42s  |     20:16:12   0.01%  |
|  Apr 22  15:07:38  | M113613007     30000  0xca88b4f9805ebc37  |  6272K  0.16797   0.6423    6.42s  |     20:16:03   0.02%  |
|  Apr 22  15:07:44  | M113613007     40000  0x8cb45690e9278cb4  |  6272K  0.17188   0.6424    6.42s  |     20:16:00   0.03%  |
|  Apr 22  15:07:51  | M113613007     50000  0xa867fda0ea381be2  |  6272K  0.17188   0.6423    6.42s  |     20:15:51   0.04%  |


Here is the same benchmark utilizing GPUOWL

Code:
magallan33@singletesla:~/gpuowl-master$ ./gpuowl -prp 113613007
20220422 15:15:12 GpuOwl VERSION 
20220422 15:15:12 GpuOwl VERSION 
20220422 15:15:12 config: -user Magallan3s -cpu Magellan -maxAlloc 40000M 
20220422 15:15:12 config: -prp 113613007 
20220422 15:15:12 device 0, unique id ''
20220422 15:15:12 Magellan 113613007 FFT: 6M 1K:12:256 (18.06 bpw)
20220422 15:15:13 Magellan 113613007 OpenCL args "-DEXP=113613007u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP=0.92078876355848627 -DIWEIGHT_STEP=-0.47938054461158819 -DIWEIGHTS={0,-0.45791076534214703,-0.41227852333612641,-0.36280502904659512,-0.30916693173607174,-0.25101366149694165,-0.18796513798337899,-0.11960928626782931,} -DFWEIGHTS={0,0.84471473710626932,0.70148623064852611,0.56937836233036643,0.44752769654326469,0.33513783709142603,0.23147422207537149,0.13585932291445821,}  -cl-std=CL2.0 -cl-finite-math-only "
20220422 15:15:14 Magellan 113613007 

20220422 15:15:14 Magellan 113613007 OpenCL compilation in 1.63 s
20220422 15:15:15 Magellan 113613007 maxAlloc: 39.1 GB
20220422 15:15:15 Magellan 113613007 P1(0) 0 bits
20220422 15:15:15 Magellan 113613007 PRP starting from beginning
20220422 15:15:15 Magellan 113613007 OK         0 on-load: blockSize 400, 0000000000000003
20220422 15:15:15 Magellan 113613007 validating proof residues for power 8
20220422 15:15:15 Magellan 113613007 Proof using power 8
20220422 15:15:17 Magellan 113613007 OK       800   0.00% 420cf6918603e7e1 1036 us/it + check 0.61s + save 0.22s; ETA 1d 08:41
20220422 15:15:27 Magellan 113613007     10000 28f5eefd6236e274 1035
20220422 15:15:37 Magellan 113613007     20000 d556e5c56bf104e0 1035
20220422 15:15:47 Magellan 113613007     30000 32d4895e2a4b9a36 1035
20220422 15:15:58 Magellan 113613007     40000 23ad82f3bf8c6401 1035
20220422 15:16:08 Magellan 113613007     50000 ff32c820dd26d801 1035
20220422 15:16:09 Magellan 113613007 Stopping, please wait..
20220422 15:16:10 Magellan 113613007 OK     51200   0.05% 226f82f0f57ba2c7 1036 us/it + check 0.57s + save 0.22s; ETA 1d 08:40
20220422 15:16:10 Magellan Exiting because "stop requested"
20220422 15:16:10 Magellan Bye

Last fiddled with by Magellan3s on 2022-04-22 at 15:17
Magellan3s is offline   Reply With Quote
Old 2022-04-22, 19:41   #13
Magellan3s
 
Mar 2022
Earth

8016 Posts
Default

Quote:
Originally Posted by storm5510 View Post
I read this as +200 MHz on the GPU core clock and +1,000 MHz on the memory clock. Is this an addition beyond the default settings?
I adjusted the overclock to a slightly more stable setting: GPU clock is 2100 MHz and Memory clock is 10251 MHz!
Magellan3s is offline   Reply With Quote
Old 2022-04-23, 04:28   #14
Zhangrc
 
"University student"
May 2021
Beijing, China

269 Posts
Default

Why is CUDALucas using a larger FFT but considerably faster than GPUowl?
Zhangrc is offline   Reply With Quote
Old 2022-04-23, 11:53   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·13·89 Posts
Default

Quote:
Originally Posted by Zhangrc View Post
Why is CUDALucas using a larger FFT but considerably faster than GPUowl?
https://mersenneforum.org/showpost.p...45&postcount=4 is anomalously slow per post 5.

Post 5 gpuowl a100 389 usec vs. post 4 gpuowl 1050 usec, post 12 gpuowl 1036 usec, is a whopping ~2.7:1 speed difference. The commonality in the anomalously slow A100 timings is apparently with a user or test methodology or gpuowl version.
The ratio (post 12 cudalucas 642.3usec/it in LL) / (post 5 gpuowl 389 usec/it in PRP) ~ 1.65 is about what we would expect based on numerous comparative benchmarks on numerous GPU models by multiple users. (But it takes 2 LL versus ~1.01 PRP with proof, so CUDALucas is at more than triple runtime disadvantage per exponent completion.)

If the user Magellan3s would run gpuowl in background, while doing top and lsgpu in foreground, and share results of that, it might be informative. One possibility is the CUDALucas instance or some other load was still running on the same GPU when gpuowl was timed. Another is that the gpuowl timing was not on an A100. Another possibility is that the slow timings for A100 were obtained with an old version of gpuowl that's slower than recent versions. ls -l might address that. (But the integrated P-1 is consistent with relatively recent gpuowl version, v7.x, which is generally within ~10% of fastest version for various fft lengths on Radeon VII.)

Another possibility is that P-1 stage 1 timings, which is what he's reported for Gpuowl in both cases, are slower, not equivalent to PRP-only timings. Spot checking my own logs on Radeon VII locally, I don't find much support for that, although IIRC there's a ~10% effect.
Another way to approach it is to run nvidia-smi before launching the gpuowl benchmark, to check whether the GPU is an A100 and is idle immediately before the benchmark is launched. Or nvidia-smi in foreground while Gpuowl and hopefully nothing else runs on the A100 GPU in background.
Sample nvidia-smi output on Windows:
Code:
Sat Apr 23 07:02:03 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.71       Driver Version: 456.71       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080   WDDM  | 00000000:01:00.0 Off |                  N/A |
| 64%   66C    P2   125W / 125W |    901MiB /  8192MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080   WDDM  | 00000000:03:00.0 Off |                  N/A |
| 44%   63C    P2   147W / 150W |    551MiB /  8192MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3980      C   ...0-g79ea0cc\gpuowl-win.exe    N/A      |
|    1   N/A  N/A      4940      C   ...80\mfaktc-2047-win-64.exe    N/A      |
+-----------------------------------------------------------------------------+
Tidily shows GPU model, utilization, and what is producing the utilization, and both power used and power limit setting.

Last fiddled with by kriesel on 2022-04-23 at 12:24
kriesel is online now   Reply With Quote
Old 2022-04-23, 12:18   #16
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1FA316 Posts
Default

Quote:
Originally Posted by Magellan3s View Post
I adjusted the overclock to a slightly more stable setting: GPU clock is 2100 MHz and Memory clock is 10251 MHz!
Really? That stated memory clock is more than eight times the nominal. 10251/1188 = 8.629x. Liquid nitrogen cooling? Colder? https://www.techpowerup.com/review/e...ti-ftw3-ultra/
https://www.pcgamer.com/overclocking...quid-nitrogen/

Last fiddled with by kriesel on 2022-04-23 at 12:20
kriesel is online now   Reply With Quote
Old 2022-04-23, 14:02   #17
Magellan3s
 
Mar 2022
Earth

27 Posts
Default

Quote:
Originally Posted by kriesel View Post
Really? That stated memory clock is more than eight times the nominal. 10251/1188 = 8.629x. Liquid nitrogen cooling? Colder? https://www.techpowerup.com/review/e...ti-ftw3-ultra/
https://www.pcgamer.com/overclocking...quid-nitrogen/
The 3XXX series nvidia cards feature GDDR6X Memory.

https://gpuspecs.com/card/nvidia-gef...x-3080-ti-12gb

My current MEMORY clock (not GPU clock) is a stable 10351 MHz - Liquid nitrogen isn't required.

Last fiddled with by Magellan3s on 2022-04-23 at 14:04
Magellan3s is offline   Reply With Quote
Old 2022-04-23, 15:42   #18
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×13×89 Posts
Default

ASUS RTX2080 claims 14GHz which appears to be also a computed effective. https://rog.asus.com/us/graphics-car...ing-model/spec
GPU-Z v2.44.0 indicates the memory clock running at 1700 MHz. Factor of ~8.235 lower.

Last fiddled with by kriesel on 2022-04-23 at 15:44
kriesel is online now   Reply With Quote
Old 2022-04-23, 20:51   #19
Magellan3s
 
Mar 2022
Earth

27 Posts
Default

Quote:
Originally Posted by kriesel View Post
https://mersenneforum.org/showpost.p...45&postcount=4 The commonality in the anomalously slow A100 timings is apparently with a user or test methodology or gpuowl version.
GPUOwl in first benchmark with the A100 was the newest github version "master".

Ran the test again with version 6 for -113606098 (wavefront exponent) and got much better results.


Code:
magallan33@singletesla:~/gpuowl-6$ ./gpuowl
2022-04-23 20:32:51 gpuowl 
2022-04-23 20:32:51 config: -user Magallan3s -cpu Magellan -maxAlloc 40000M 
2022-04-23 20:32:51 device 0, unique id ''
2022-04-23 20:32:51 Magellan 113606089 FFT: 6M 1K:12:256 (18.06 bpw)
2022-04-23 20:32:51 Magellan Expected maximum carry32: 4CEB0000
2022-04-23 20:32:52 Magellan OpenCL args "-DEXP=113606089u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DPM1=0 -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0x1.d831959e4834fp-1 -DIWEIGHT_STEP_MINUS_1=-0x1.eb4ab6a4ec4c4p-2  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2022-04-23 20:32:54 Magellan 

2022-04-23 20:32:54 Magellan OpenCL compilation in 2.19 s
2022-04-23 20:32:54 Magellan 113606089 OK        0 loaded: blockSize 400, 0000000000000003
2022-04-23 20:32:54 Magellan validating proof residues for power 8
2022-04-23 20:32:54 Magellan Proof using power 8
2022-04-23 20:32:55 Magellan 113606089 OK      800   0.00%;  409 us/it; ETA 0d 12:55; 1c2da263c0685009 (check 0.32s)
2022-04-23 20:34:16 Magellan 113606089 OK   200000   0.18%;  406 us/it; ETA 0d 12:48; 9cf6ee38f1eaedd9 (check 0.30s)
2022-04-23 20:35:38 Magellan 113606089 OK   400000   0.35%;  406 us/it; ETA 0d 12:46; 62d3515b301b932b (check 0.31s)
https://i.ibb.co/mJB7sxs/Screenshot-...3-15-35-04.png
https://i.ibb.co/fMS7RcR/Screenshot-...3-15-35-04.png

Last fiddled with by Magellan3s on 2022-04-23 at 20:52
Magellan3s is offline   Reply With Quote
Old 2022-04-24, 05:28   #20
tuckerkao
 
"Tucker Kao"
Jan 2020
Head Base M168202123

37016 Posts
Default

Memory clock and Core clock of the same GPU are 2 different things. Similar for the CPU and memory sticks installed on the mother board too.

Last fiddled with by tuckerkao on 2022-04-24 at 05:34
tuckerkao is offline   Reply With Quote
Old 2022-04-24, 17:58   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·13·89 Posts
Default

Clock rate typically refers to the operating fundamental frequency of some oscillator or pulse train generator.
Clock rate is, by definition, not necessarily the same as effective bit rate, of some clocked circuitry, which may employ multiple signal lines, various encoding methods, etc.
Clock rate and bit rate have different names, because they are different things.
At the one extreme, FM radio modulates a carrier of ~100 MHz with much slower audio data of up to ~15KHz. https://en.wikipedia.org/wiki/FM_broadcasting (Low ratio of modulation rate/carrier rate is necessary for broadcast signals because of generation of sidebands.)
At the other extreme, some digital tech modulates at a higher bit rate than the fundamental frequency. A lot of ingenuity has been applied to modulation methods.
The capacity of a communication channel to carry information is a function of both its bandwidth and its signal to noise ratio. https://en.wikipedia.org/wiki/Channe...le_application

The RTX 3080Ti uses GDDR6x memory, as previously stated.
https://www.micron.com/products/ultr...lutions/gddr6x indicates ~19Gbit/second/pin data rate.
Data rate is important in performance.
But effective bit data rate is not memory clock frequency.
(GPU computing core clock rate is yet another separate parameter, and not considered further in this post.)
GDDR6x uses PAM4 encoding. https://www.technipages.com/what-is-gddr6x
https://en.wikipedia.org/wiki/GDDR6_SDRAM
PAM4 uses multiple voltage levels (corresponding to values 0 1 2 3) to represent two bits on a single signal line at the same instant in time.
https://www.edn.com/the-fundamentals-of-pam4/
The maximum signal fundamental frequency in NRZ would be for alternating 0 and 1 bits,
bit rate/2. (010101... each 01 bit pair is a cycle to minimum and to max, one period.) https://en.wikipedia.org/wiki/Non-return-to-zero

The various references appear to show that GDDR6X puts 2 bits of modulation on a half period of a signal.
That would be 4 bits per period of clock if the signal line was operated at the memory clock rate. I'm not sure where the other factor of ~two in ratio between memory clock and effective bit rate comes from for GDDR6X. Maybe there's a clock doubler circuit.
The difference from integer multiples between clock frequency and effective bit rate may be due to something as simple as rounding.

Last fiddled with by kriesel on 2022-04-24 at 18:02
kriesel is online now   Reply With Quote
Old 2022-04-25, 13:20   #22
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

141318 Posts
Default

Quote:
Originally Posted by kriesel View Post
The various references appear to show that GDDR6X puts 2 bits of modulation on a half period of a signal.
That would be 4 bits per period of clock if the signal line was operated at the memory clock rate. I'm not sure where the other factor of ~two in ratio between memory clock and effective bit rate comes from for GDDR6X. Maybe there's a clock doubler circuit.
The difference from integer multiples between clock frequency and effective bit rate may be due to something as simple as rounding.
According to https://www.anandtech.com/show/15978...idias-rtx-3090 GDDR6X would actually be better named GQDR6X. Maybe the quad data rate is the extra factor of 2.
henryzz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Benchmarks Pjetrode Information & Answers 3 2018-01-07 23:23
RPS benchmarks pinhodecarlos Riesel Prime Search 29 2014-12-07 07:13
GPU Benchmarks houding Hardware 7 2014-07-09 10:48
LLR benchmarks Retep Riesel Prime Search 4 2008-11-06 22:15
Benchmarks Vandy Hardware 6 2002-10-28 13:45

All times are UTC. The time now is 02:35.


Wed Oct 4 02:35:36 UTC 2023 up 21 days, 17 mins, 0 users, load averages: 0.96, 1.01, 0.90

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔