mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)

 ryanp 2022-08-29 21:20

The update is that I have found combinations of [C]-G[/C], [C]-g[/C] and [C]-M[/C] that are functional -- srsieve2cl runs, and I see 100% CPU utilization in [C]nvidia-smi[/C] -- but is no faster (on an A100) than a 64-core CPU worker on another machine. I have no idea how to tune these, what's optimal, or whether there's some underlying bottleneck somewhere else.

Example:

[CODE]\$ ./srsieve2cl -P 1e15 -G 16 -g 512 -M 1000000 -n 20e6 -N 25e6 -o ferm9_20M_25M_sv1e15.txt -s 9*2^n+1
srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic for p >= 3
Creating CPU worker to use until p >= 1000000
GPU primes per worker is 14155776
Sieve started: 3 < p < 1e15 with 5000001 terms (20000000 < n < 25000000, k*2^n+1) (expecting 4840960 factors)
Sieving with single sequence c=1 logic for p >= 257
BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720
Split 1 base 2 sequence into 104 base 2^360 sequences.
Legendre summary: Approximately 2 B needed for Legendre tables
1 total sequences
1 are eligible for Legendre tables
0 are not eligible for Legendre tables
1 have Legendre tables in memory
0 cannot have Legendre tables in memory
0 have Legendre tables loaded from files
1 required building of the Legendre tables
518400 bytes used for congruent q and ladder indices
259200 bytes used for congruent qs and ladders
Creating CPU worker to use until p >= 1000000
p=540837151, 6.587M p/sec, 4668811 factors found at 352.2 f/sec (last 1 min),
p=9303408733, 7.126M p/sec, 4680592 factors found at 144.5 f/sec (last 2 min),
p=21575254847, 7.304M p/sec, 4686586 factors found at 89.74 f/sec (last 3 min), 0.0% done.
ETC 2022-12-06 01:21 [/CODE]

On the 64-core CPU machine, I'm getting 7.589M p/sec. It seems like an NVIDIA A100 should be able to do far better than this.

 ryanp 2022-08-29 21:30

[QUOTE=rogue;612278]It is also possible that the default GPU it is using is not the the GPU you are expecting. Start with -h to see the default GPU. You can use command line switches to change the platform and device.[/QUOTE]

That's definitely not the case here -- this is a VM with one GPU, and I see it pegged at 100% in [C]nvidia-smi[/C]. It's just not any faster than a 64-core CPU instance.

 rogue 2022-08-30 01:55

[QUOTE=ryanp;612301]That's definitely not the case here -- this is a VM with one GPU, and I see it pegged at 100% in [C]nvidia-smi[/C]. It's just not any faster than a 64-core CPU instance.[/QUOTE]

Note that you cannot directly compare sr1sieve/sr2sieve speeds with srsieve2 using p/sec. It is best to determine the range of p, e.g. 1e12 and see how long it takes to complete the range.

Are you comparing srsieve2 with -W32/-W64 with srsieve2cl (with no -W)? I'm trying to understand what you are comparing when you say it is no faster.

I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another.

It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly.

I wonder if there could be limitations due to it running on a VM. Maybe the VM configuration is limiting how much GPU it can use. I haven't run on a VM, so that is just a guess.

 pepi37 2022-09-05 18:15

[QUOTE=rogue;612310]
I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another.
It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly.
[/QUOTE]

We can arrange and I am willing to give you access to my Windows box with RTX 2060 so you can see yourself. As I can say srsieve2cl doesnot work at all.I wish and would like that program works, since I have few GPU and then I will be able to make nice sieve filers by myself, but reality is different :( Mystery is even bigger since you always say that on your machine srsieve2cl is faster then srsieve2 and nobody other can reproduce that fact.
Just contact me via PM
Best regards

 pepi37 2022-09-06 19:26

RESULTS are here

[CODE]srsieve2cl.exe -P1e6 -n 4 -N 10000000 -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
Sieve completed at p=1000003.
CPU time: 8.38 sec. (0.00 sieving) (0.93 cores) GPU time: 0.00 sec.
305747 terms written to b53_n.abcd
Primes tested: 78290. Factors found: 7194250. Remaining terms: 305747. Time: 9.04 seconds.

srsieve2cl.exe -P1e7 -n 4 -N 10000000 -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992). Use -M to increase max factor density

srsieve2cl.exe -P1e7 -n 4 -N 10000000 -M 1000 -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
Sieve completed at p=10059761.
CPU time: 10.50 sec. (0.00 sieving) (0.87 cores) GPU time: 1.78 sec.
261476 terms written to b53_n.abcd
Primes tested: 589824. Factors found: 7238521. Remaining terms: 261476. Time: 12.10 seconds.

srsieve2cl.exe -P 1e11 -D 1 -G 2 -n 4 -N 10000000 -M 4000 -s "4*53^n+1"
Creating CPU worker to use until p >= 1000000
p=1789474679, 1.449M p/sec, 7302068 factors found at 1.073K f/sec (last 1 min), 1.8% done. ETC 2022-09-06 22:08
p=4033235879, 1.570M p/sec, 7309308 factors found at 507.2 f/sec (last 2 min), 4.0% done. ETC 2022-09-06 22:02
p=6335693783, 1.609M p/sec, 7313108 factors found at 335.6 f/sec (last 3 min), 6.3% done. ETC 2022-09-06 21:59
p=8678189381, 1.629M p/sec, 7315702 factors found at 252.5 f/sec (last 4 min), 8.7% done. ETC 2022-09-06 21:58

srsieve2cl.exe -P 1e11 -W6 -n 4 -N 10000000 -M 4000 -s "4*53^n+1"
Increasing worksize to 64000 since each chunk is tested in less than a second
Increasing worksize to 1024000 since each chunk is tested in less than a second
p=4336204969, 3.426M p/sec, 7310008 factors found at 1.240K f/sec (last 1 min), 4.3% done. ETC 2022-09-06 21:40
p=9693240527, 3.656M p/sec, 7316685 factors found at 570.1 f/sec (last 2 min), 9.7% done. ETC 2022-09-06 21:38
p=15208974959, 3.737M p/sec, 7320231 factors found at 369.4 f/sec (last 3 min), 15.2% done. ETC 2022-09-06 21:37
[/CODE]With option -G2 I got 98 GPU utilization. But at end my i5-9600K (running on 6 cores) is nearly double fast then my RTX 2060

 rogue 2022-09-06 21:26

[QUOTE=pepi37;612810]Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992).[/QUOTE]

This error tells you to adjust -M. The default for -M is 100. Change to 500 to hold all of the factors. This increases that limit by a factor of 5. This parameter is not necessary once you have sieved more deeply as fewer p produce factors.

If you use -h, it will tell you which platform and device are the default to be used. I suspect -D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything.

 pepi37 2022-09-06 21:32

[QUOTE=rogue;612815]This error tells you to adjust -M. The default for -M is 100. Change to 500 to hold all of the factors. This increases that limit by a factor of 5. This parameter is not necessary once you have sieved more deeply as fewer p produce factors.

If you use -h, it will tell you which platform and device are the default to be used. I suspect -D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything.[/QUOTE]
Opposite D 0 is Intel, D 1 is Nvidia
I checked twice

g:\SRSIEVE2>srsieve2cl.exe -h
srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
-h --help prints this help
-p --pmin=P0 sieve start: P0 < p (default 3)
-P --pmax=P1 sieve end: p < P1 (default 2^62)
-w --worksize=w initial primes per chunk of work (default 16000)
-W --workers=W start W workers (default 0)
-g --gpuworkgroups=g work groups per call to GPU (default 8)
-G --gpuworkers=G start G GPU workers (default 0)
-D --platform=D Use platform D instead of 0
-d --device=d Use device d instead of 0
-H --showgpudetail Show device and kernel details
List of available platforms and devices
Platform 0 is a Intel(R) Corporation Intel(R) OpenCL HD Graphics, version OpenCL 2.1
Device 0 is a Intel(R) Corporation Intel(R) UHD Graphics 630
Platform 1 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 3.0 CUDA 11.6.127
Device 0 is a NVIDIA Corporation NVIDIA GeForce RTX 2060

 rebirther 2022-09-08 07:14

It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.

 rogue 2022-09-08 12:42

[QUOTE=rebirther;612944]It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.[/QUOTE]

I do not know what you mean by "max VRAM gap" or "rest of the max workers". The only options you have at your disposal are -g, -G, and -K.

As for how much GPU memory it uses when executing a kernel, that is a good question. There are many values I can pull from the driver regarding memory utilization and some that I can compute on the fly. You can see these if you specify the -H switch. I haven't been able to determine a good way to know that a kernel will fail due to requiring too much memory.

 ryanp 2022-09-08 15:42

Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.

(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: [url]https://www.mersenne.ca/mfaktc.php[/url]).

 rogue 2022-09-08 17:55

[QUOTE=ryanp;612962]Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.

(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: [url]https://www.mersenne.ca/mfaktc.php[/url]).[/QUOTE]

Moving to PM for the time being.

All times are UTC. The time now is 02:39.