![]() |
![]() |
#705 |
Jun 2012
Boulder, CO
7738 Posts |
![]()
The update is that I have found combinations of -G, -g and -M that are functional -- srsieve2cl runs, and I see 100% CPU utilization in nvidia-smi -- but is no faster (on an A100) than a 64-core CPU worker on another machine. I have no idea how to tune these, what's optimal, or whether there's some underlying bottleneck somewhere else.
Example: Code:
$ ./srsieve2cl -P 1e15 -G 16 -g 512 -M 1000000 -n 20e6 -N 25e6 -o ferm9_20M_25M_sv1e15.txt -s 9*2^n+1 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n Sieving with generic logic for p >= 3 Creating CPU worker to use until p >= 1000000 GPU primes per worker is 14155776 Sieve started: 3 < p < 1e15 with 5000001 terms (20000000 < n < 25000000, k*2^n+1) (expecting 4840960 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 104 base 2^360 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 259200 bytes used for congruent qs and ladders Creating CPU worker to use until p >= 1000000 p=540837151, 6.587M p/sec, 4668811 factors found at 352.2 f/sec (last 1 min), p=9303408733, 7.126M p/sec, 4680592 factors found at 144.5 f/sec (last 2 min), p=21575254847, 7.304M p/sec, 4686586 factors found at 89.74 f/sec (last 3 min), 0.0% done. ETC 2022-12-06 01:21 |
![]() |
![]() |
![]() |
#706 | |
Jun 2012
Boulder, CO
3·132 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#707 | |
"Mark"
Apr 2003
Between here and the
26·32·13 Posts |
![]() Quote:
Are you comparing srsieve2 with -W32/-W64 with srsieve2cl (with no -W)? I'm trying to understand what you are comparing when you say it is no faster. I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another. It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly. I wonder if there could be limitations due to it running on a VM. Maybe the VM configuration is limiting how much GPU it can use. I haven't run on a VM, so that is just a guess. |
|
![]() |
![]() |
![]() |
#708 | |
Dec 2011
After 1.58M nines:)
6F516 Posts |
![]() Quote:
We can arrange and I am willing to give you access to my Windows box with RTX 2060 so you can see yourself. As I can say srsieve2cl doesnot work at all.I wish and would like that program works, since I have few GPU and then I will be able to make nice sieve filers by myself, but reality is different :( Mystery is even bigger since you always say that on your machine srsieve2cl is faster then srsieve2 and nobody other can reproduce that fact. Just contact me via PM Best regards Last fiddled with by pepi37 on 2022-09-05 at 18:20 |
|
![]() |
![]() |
![]() |
#709 |
Dec 2011
After 1.58M nines:)
13×137 Posts |
![]()
RESULTS are here
Code:
srsieve2cl.exe -P1e6 -n 4 -N 10000000 -s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Sieve completed at p=1000003. CPU time: 8.38 sec. (0.00 sieving) (0.93 cores) GPU time: 0.00 sec. 305747 terms written to b53_n.abcd Primes tested: 78290. Factors found: 7194250. Remaining terms: 305747. Time: 9.04 seconds. srsieve2cl.exe -P1e7 -n 4 -N 10000000 -s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992). Use -M to increase max factor density srsieve2cl.exe -P1e7 -n 4 -N 10000000 -M 1000 -s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Sieve completed at p=10059761. CPU time: 10.50 sec. (0.00 sieving) (0.87 cores) GPU time: 1.78 sec. 261476 terms written to b53_n.abcd Primes tested: 589824. Factors found: 7238521. Remaining terms: 261476. Time: 12.10 seconds. srsieve2cl.exe -P 1e11 -D 1 -G 2 -n 4 -N 10000000 -M 4000 -s "4*53^n+1" Creating CPU worker to use until p >= 1000000 p=1789474679, 1.449M p/sec, 7302068 factors found at 1.073K f/sec (last 1 min), 1.8% done. ETC 2022-09-06 22:08 p=4033235879, 1.570M p/sec, 7309308 factors found at 507.2 f/sec (last 2 min), 4.0% done. ETC 2022-09-06 22:02 p=6335693783, 1.609M p/sec, 7313108 factors found at 335.6 f/sec (last 3 min), 6.3% done. ETC 2022-09-06 21:59 p=8678189381, 1.629M p/sec, 7315702 factors found at 252.5 f/sec (last 4 min), 8.7% done. ETC 2022-09-06 21:58 srsieve2cl.exe -P 1e11 -W6 -n 4 -N 10000000 -M 4000 -s "4*53^n+1" Increasing worksize to 64000 since each chunk is tested in less than a second Increasing worksize to 1024000 since each chunk is tested in less than a second p=4336204969, 3.426M p/sec, 7310008 factors found at 1.240K f/sec (last 1 min), 4.3% done. ETC 2022-09-06 21:40 p=9693240527, 3.656M p/sec, 7316685 factors found at 570.1 f/sec (last 2 min), 9.7% done. ETC 2022-09-06 21:38 p=15208974959, 3.737M p/sec, 7320231 factors found at 369.4 f/sec (last 3 min), 15.2% done. ETC 2022-09-06 21:37 Last fiddled with by pepi37 on 2022-09-06 at 19:42 |
![]() |
![]() |
![]() |
#710 | |
"Mark"
Apr 2003
Between here and the
26·32·13 Posts |
![]() Quote:
If you use -h, it will tell you which platform and device are the default to be used. I suspect -D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything. |
|
![]() |
![]() |
![]() |
#711 | |
Dec 2011
After 1.58M nines:)
178110 Posts |
![]() Quote:
I checked twice g:\SRSIEVE2>srsieve2cl.exe -h srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n -h --help prints this help -p --pmin=P0 sieve start: P0 < p (default 3) -P --pmax=P1 sieve end: p < P1 (default 2^62) -w --worksize=w initial primes per chunk of work (default 16000) -W --workers=W start W workers (default 0) -g --gpuworkgroups=g work groups per call to GPU (default 8) -G --gpuworkers=G start G GPU workers (default 0) -D --platform=D Use platform D instead of 0 -d --device=d Use device d instead of 0 -H --showgpudetail Show device and kernel details List of available platforms and devices Platform 0 is a Intel(R) Corporation Intel(R) OpenCL HD Graphics, version OpenCL 2.1 Device 0 is a Intel(R) Corporation Intel(R) UHD Graphics 630 Platform 1 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 3.0 CUDA 11.6.127 Device 0 is a NVIDIA Corporation NVIDIA GeForce RTX 2060 Last fiddled with by pepi37 on 2022-09-06 at 21:40 |
|
![]() |
![]() |
![]() |
#712 |
Sep 2011
Germany
2×1,877 Posts |
![]()
It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.
|
![]() |
![]() |
![]() |
#713 | |
"Mark"
Apr 2003
Between here and the
748810 Posts |
![]() Quote:
As for how much GPU memory it uses when executing a kernel, that is a good question. There are many values I can pull from the driver regarding memory utilization and some that I can compute on the fly. You can see these if you specify the -H switch. I haven't been able to determine a good way to know that a kernel will fail due to requiring too much memory. |
|
![]() |
![]() |
![]() |
#714 |
Jun 2012
Boulder, CO
3·132 Posts |
![]()
Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.
(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: https://www.mersenne.ca/mfaktc.php). |
![]() |
![]() |
![]() |
#715 | |
"Mark"
Apr 2003
Between here and the
11101010000002 Posts |
![]() Quote:
|
|
![]() |
![]() |