mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-08-29, 21:20   #705
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

7738 Posts
Default

Quote:
Originally Posted by rogue View Post
Any updates?
The update is that I have found combinations of -G, -g and -M that are functional -- srsieve2cl runs, and I see 100% CPU utilization in nvidia-smi -- but is no faster (on an A100) than a 64-core CPU worker on another machine. I have no idea how to tune these, what's optimal, or whether there's some underlying bottleneck somewhere else.

Example:

Code:
$ ./srsieve2cl -P 1e15 -G 16 -g 512 -M 1000000 -n 20e6 -N 25e6 -o ferm9_20M_25M_sv1e15.txt -s 9*2^n+1
srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic for p >= 3
Creating CPU worker to use until p >= 1000000
GPU primes per worker is 14155776
Sieve started: 3 < p < 1e15 with 5000001 terms (20000000 < n < 25000000, k*2^n+1) (expecting 4840960 factors)
Sieving with single sequence c=1 logic for p >= 257
BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720
Split 1 base 2 sequence into 104 base 2^360 sequences.
Legendre summary:  Approximately 2 B needed for Legendre tables
         1 total sequences
         1 are eligible for Legendre tables
         0 are not eligible for Legendre tables
         1 have Legendre tables in memory
         0 cannot have Legendre tables in memory
         0 have Legendre tables loaded from files
         1 required building of the Legendre tables
518400 bytes used for congruent q and ladder indices
259200 bytes used for congruent qs and ladders
Creating CPU worker to use until p >= 1000000
  p=540837151, 6.587M p/sec, 4668811 factors found at 352.2 f/sec (last 1 min),
  p=9303408733, 7.126M p/sec, 4680592 factors found at 144.5 f/sec (last 2 min),
  p=21575254847, 7.304M p/sec, 4686586 factors found at 89.74 f/sec (last 3 min), 0.0% done.
 ETC 2022-12-06 01:21
On the 64-core CPU machine, I'm getting 7.589M p/sec. It seems like an NVIDIA A100 should be able to do far better than this.
ryanp is offline   Reply With Quote
Old 2022-08-29, 21:30   #706
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

3·132 Posts
Default

Quote:
Originally Posted by rogue View Post
It is also possible that the default GPU it is using is not the the GPU you are expecting. Start with -h to see the default GPU. You can use command line switches to change the platform and device.
That's definitely not the case here -- this is a VM with one GPU, and I see it pegged at 100% in nvidia-smi. It's just not any faster than a 64-core CPU instance.
ryanp is offline   Reply With Quote
Old 2022-08-30, 01:55   #707
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

26·32·13 Posts
Default

Quote:
Originally Posted by ryanp View Post
That's definitely not the case here -- this is a VM with one GPU, and I see it pegged at 100% in nvidia-smi. It's just not any faster than a 64-core CPU instance.
Note that you cannot directly compare sr1sieve/sr2sieve speeds with srsieve2 using p/sec. It is best to determine the range of p, e.g. 1e12 and see how long it takes to complete the range.

Are you comparing srsieve2 with -W32/-W64 with srsieve2cl (with no -W)? I'm trying to understand what you are comparing when you say it is no faster.

I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another.

It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly.

I wonder if there could be limitations due to it running on a VM. Maybe the VM configuration is limiting how much GPU it can use. I haven't run on a VM, so that is just a guess.
rogue is offline   Reply With Quote
Old 2022-09-05, 18:15   #708
pepi37
 
pepi37's Avatar
 
Dec 2011
After 1.58M nines:)

6F516 Posts
Default

Quote:
Originally Posted by rogue View Post
I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another.
It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly.

We can arrange and I am willing to give you access to my Windows box with RTX 2060 so you can see yourself. As I can say srsieve2cl doesnot work at all.I wish and would like that program works, since I have few GPU and then I will be able to make nice sieve filers by myself, but reality is different :( Mystery is even bigger since you always say that on your machine srsieve2cl is faster then srsieve2 and nobody other can reproduce that fact.
Just contact me via PM
Best regards

Last fiddled with by pepi37 on 2022-09-05 at 18:20
pepi37 is offline   Reply With Quote
Old 2022-09-06, 19:26   #709
pepi37
 
pepi37's Avatar
 
Dec 2011
After 1.58M nines:)

13×137 Posts
Default

RESULTS are here


Code:
srsieve2cl.exe -P1e6 -n 4 -N 10000000  -s "4*53^n+1"
 Creating CPU worker to use until p >= 1000000
Sieve completed at p=1000003.
CPU time: 8.38 sec. (0.00 sieving) (0.93 cores) GPU time: 0.00 sec.
305747 terms written to b53_n.abcd
Primes tested: 78290.  Factors found: 7194250.  Remaining terms: 305747.  Time: 9.04 seconds.

srsieve2cl.exe -P1e7 -n 4 -N 10000000  -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
Fatal Error:  Could not handle all GPU factors.  A range of p generated 36883 factors (limited to 7992).  Use -M to increase max factor density

srsieve2cl.exe -P1e7 -n 4 -N 10000000 -M 1000 -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
Sieve completed at p=10059761.
CPU time: 10.50 sec. (0.00 sieving) (0.87 cores) GPU time: 1.78 sec.
261476 terms written to b53_n.abcd
Primes tested: 589824.  Factors found: 7238521.  Remaining terms: 261476.  Time: 12.10 seconds.

srsieve2cl.exe -P 1e11 -D 1 -G 2 -n 4 -N 10000000 -M 4000 -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
  p=1789474679, 1.449M p/sec, 7302068 factors found at 1.073K f/sec (last 1 min), 1.8% done. ETC 2022-09-06 22:08         
  p=4033235879, 1.570M p/sec, 7309308 factors found at 507.2 f/sec (last 2 min), 4.0% done. ETC 2022-09-06 22:02          
  p=6335693783, 1.609M p/sec, 7313108 factors found at 335.6 f/sec (last 3 min), 6.3% done. ETC 2022-09-06 21:59          
  p=8678189381, 1.629M p/sec, 7315702 factors found at 252.5 f/sec (last 4 min), 8.7% done. ETC 2022-09-06 21:58     

srsieve2cl.exe -P 1e11 -W6 -n 4 -N 10000000 -M 4000 -s "4*53^n+1"
Increasing worksize to 64000 since each chunk is tested in less than a second
Increasing worksize to 1024000 since each chunk is tested in less than a second
  p=4336204969, 3.426M p/sec, 7310008 factors found at 1.240K f/sec (last 1 min), 4.3% done. ETC 2022-09-06 21:40         
  p=9693240527, 3.656M p/sec, 7316685 factors found at 570.1 f/sec (last 2 min), 9.7% done. ETC 2022-09-06 21:38          
  p=15208974959, 3.737M p/sec, 7320231 factors found at 369.4 f/sec (last 3 min), 15.2% done. ETC 2022-09-06 21:37
With option -G2 I got 98 GPU utilization. But at end my i5-9600K (running on 6 cores) is nearly double fast then my RTX 2060

Last fiddled with by pepi37 on 2022-09-06 at 19:42
pepi37 is offline   Reply With Quote
Old 2022-09-06, 21:26   #710
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

26·32·13 Posts
Default

Quote:
Originally Posted by pepi37 View Post
Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992).
This error tells you to adjust -M. The default for -M is 100. Change to 500 to hold all of the factors. This increases that limit by a factor of 5. This parameter is not necessary once you have sieved more deeply as fewer p produce factors.

If you use -h, it will tell you which platform and device are the default to be used. I suspect -D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything.
rogue is offline   Reply With Quote
Old 2022-09-06, 21:32   #711
pepi37
 
pepi37's Avatar
 
Dec 2011
After 1.58M nines:)

178110 Posts
Default

Quote:
Originally Posted by rogue View Post
This error tells you to adjust -M. The default for -M is 100. Change to 500 to hold all of the factors. This increases that limit by a factor of 5. This parameter is not necessary once you have sieved more deeply as fewer p produce factors.

If you use -h, it will tell you which platform and device are the default to be used. I suspect -D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything.
Opposite D 0 is Intel, D 1 is Nvidia
I checked twice


g:\SRSIEVE2>srsieve2cl.exe -h
srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
-h --help prints this help
-p --pmin=P0 sieve start: P0 < p (default 3)
-P --pmax=P1 sieve end: p < P1 (default 2^62)
-w --worksize=w initial primes per chunk of work (default 16000)
-W --workers=W start W workers (default 0)
-g --gpuworkgroups=g work groups per call to GPU (default 8)
-G --gpuworkers=G start G GPU workers (default 0)
-D --platform=D Use platform D instead of 0
-d --device=d Use device d instead of 0
-H --showgpudetail Show device and kernel details
List of available platforms and devices
Platform 0 is a Intel(R) Corporation Intel(R) OpenCL HD Graphics, version OpenCL 2.1
Device 0 is a Intel(R) Corporation Intel(R) UHD Graphics 630
Platform 1 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 3.0 CUDA 11.6.127
Device 0 is a NVIDIA Corporation NVIDIA GeForce RTX 2060

Last fiddled with by pepi37 on 2022-09-06 at 21:40
pepi37 is offline   Reply With Quote
Old 2022-09-08, 07:14   #712
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

2×1,877 Posts
Default

It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.
rebirther is offline   Reply With Quote
Old 2022-09-08, 12:42   #713
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

748810 Posts
Default

Quote:
Originally Posted by rebirther View Post
It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.
I do not know what you mean by "max VRAM gap" or "rest of the max workers". The only options you have at your disposal are -g, -G, and -K.

As for how much GPU memory it uses when executing a kernel, that is a good question. There are many values I can pull from the driver regarding memory utilization and some that I can compute on the fly. You can see these if you specify the -H switch. I haven't been able to determine a good way to know that a kernel will fail due to requiring too much memory.
rogue is offline   Reply With Quote
Old 2022-09-08, 15:42   #714
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

3·132 Posts
Default

Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.

(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: https://www.mersenne.ca/mfaktc.php).
ryanp is offline   Reply With Quote
Old 2022-09-08, 17:55   #715
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11101010000002 Posts
Default

Quote:
Originally Posted by ryanp View Post
Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.

(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: https://www.mersenne.ca/mfaktc.php).
Moving to PM for the time being.
rogue is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 02:27.


Thu Oct 5 02:27:51 UTC 2023 up 22 days, 10 mins, 0 users, load averages: 0.80, 0.85, 0.81

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔