20220829, 21:20  #705 
Jun 2012
Boulder, CO
773_{8} Posts 
The update is that I have found combinations of G, g and M that are functional  srsieve2cl runs, and I see 100% CPU utilization in nvidiasmi  but is no faster (on an A100) than a 64core CPU worker on another machine. I have no idea how to tune these, what's optimal, or whether there's some underlying bottleneck somewhere else.
Example: Code:
$ ./srsieve2cl P 1e15 G 16 g 512 M 1000000 n 20e6 N 25e6 o ferm9_20M_25M_sv1e15.txt s 9*2^n+1 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n Sieving with generic logic for p >= 3 Creating CPU worker to use until p >= 1000000 GPU primes per worker is 14155776 Sieve started: 3 < p < 1e15 with 5000001 terms (20000000 < n < 25000000, k*2^n+1) (expecting 4840960 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 104 base 2^360 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 259200 bytes used for congruent qs and ladders Creating CPU worker to use until p >= 1000000 p=540837151, 6.587M p/sec, 4668811 factors found at 352.2 f/sec (last 1 min), p=9303408733, 7.126M p/sec, 4680592 factors found at 144.5 f/sec (last 2 min), p=21575254847, 7.304M p/sec, 4686586 factors found at 89.74 f/sec (last 3 min), 0.0% done. ETC 20221206 01:21 
20220829, 21:30  #706  
Jun 2012
Boulder, CO
3·13^{2} Posts 
Quote:


20220830, 01:55  #707  
"Mark"
Apr 2003
Between here and the
2^{6}·3^{2}·13 Posts 
Quote:
Are you comparing srsieve2 with W32/W64 with srsieve2cl (with no W)? I'm trying to understand what you are comparing when you say it is no faster. I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another. It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly. I wonder if there could be limitations due to it running on a VM. Maybe the VM configuration is limiting how much GPU it can use. I haven't run on a VM, so that is just a guess. 

20220905, 18:15  #708  
Dec 2011
After 1.58M nines:)
6F5_{16} Posts 
Quote:
We can arrange and I am willing to give you access to my Windows box with RTX 2060 so you can see yourself. As I can say srsieve2cl doesnot work at all.I wish and would like that program works, since I have few GPU and then I will be able to make nice sieve filers by myself, but reality is different :( Mystery is even bigger since you always say that on your machine srsieve2cl is faster then srsieve2 and nobody other can reproduce that fact. Just contact me via PM Best regards Last fiddled with by pepi37 on 20220905 at 18:20 

20220906, 19:26  #709 
Dec 2011
After 1.58M nines:)
13×137 Posts 
RESULTS are here
Code:
srsieve2cl.exe P1e6 n 4 N 10000000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Sieve completed at p=1000003. CPU time: 8.38 sec. (0.00 sieving) (0.93 cores) GPU time: 0.00 sec. 305747 terms written to b53_n.abcd Primes tested: 78290. Factors found: 7194250. Remaining terms: 305747. Time: 9.04 seconds. srsieve2cl.exe P1e7 n 4 N 10000000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992). Use M to increase max factor density srsieve2cl.exe P1e7 n 4 N 10000000 M 1000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Sieve completed at p=10059761. CPU time: 10.50 sec. (0.00 sieving) (0.87 cores) GPU time: 1.78 sec. 261476 terms written to b53_n.abcd Primes tested: 589824. Factors found: 7238521. Remaining terms: 261476. Time: 12.10 seconds. srsieve2cl.exe P 1e11 D 1 G 2 n 4 N 10000000 M 4000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 p=1789474679, 1.449M p/sec, 7302068 factors found at 1.073K f/sec (last 1 min), 1.8% done. ETC 20220906 22:08 p=4033235879, 1.570M p/sec, 7309308 factors found at 507.2 f/sec (last 2 min), 4.0% done. ETC 20220906 22:02 p=6335693783, 1.609M p/sec, 7313108 factors found at 335.6 f/sec (last 3 min), 6.3% done. ETC 20220906 21:59 p=8678189381, 1.629M p/sec, 7315702 factors found at 252.5 f/sec (last 4 min), 8.7% done. ETC 20220906 21:58 srsieve2cl.exe P 1e11 W6 n 4 N 10000000 M 4000 s "4*53^n+1" Increasing worksize to 64000 since each chunk is tested in less than a second Increasing worksize to 1024000 since each chunk is tested in less than a second p=4336204969, 3.426M p/sec, 7310008 factors found at 1.240K f/sec (last 1 min), 4.3% done. ETC 20220906 21:40 p=9693240527, 3.656M p/sec, 7316685 factors found at 570.1 f/sec (last 2 min), 9.7% done. ETC 20220906 21:38 p=15208974959, 3.737M p/sec, 7320231 factors found at 369.4 f/sec (last 3 min), 15.2% done. ETC 20220906 21:37 Last fiddled with by pepi37 on 20220906 at 19:42 
20220906, 21:26  #710  
"Mark"
Apr 2003
Between here and the
2^{6}·3^{2}·13 Posts 
Quote:
If you use h, it will tell you which platform and device are the default to be used. I suspect D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything. 

20220906, 21:32  #711  
Dec 2011
After 1.58M nines:)
1781_{10} Posts 
Quote:
I checked twice g:\SRSIEVE2>srsieve2cl.exe h srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n h help prints this help p pmin=P0 sieve start: P0 < p (default 3) P pmax=P1 sieve end: p < P1 (default 2^62) w worksize=w initial primes per chunk of work (default 16000) W workers=W start W workers (default 0) g gpuworkgroups=g work groups per call to GPU (default 8) G gpuworkers=G start G GPU workers (default 0) D platform=D Use platform D instead of 0 d device=d Use device d instead of 0 H showgpudetail Show device and kernel details List of available platforms and devices Platform 0 is a Intel(R) Corporation Intel(R) OpenCL HD Graphics, version OpenCL 2.1 Device 0 is a Intel(R) Corporation Intel(R) UHD Graphics 630 Platform 1 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 3.0 CUDA 11.6.127 Device 0 is a NVIDIA Corporation NVIDIA GeForce RTX 2060 Last fiddled with by pepi37 on 20220906 at 21:40 

20220908, 07:14  #712 
Sep 2011
Germany
2×1,877 Posts 
It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.

20220908, 12:42  #713  
"Mark"
Apr 2003
Between here and the
7488_{10} Posts 
Quote:
As for how much GPU memory it uses when executing a kernel, that is a good question. There are many values I can pull from the driver regarding memory utilization and some that I can compute on the fly. You can see these if you specify the H switch. I haven't been able to determine a good way to know that a kernel will fail due to requiring too much memory. 

20220908, 15:42  #714 
Jun 2012
Boulder, CO
3·13^{2} Posts 
Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.
(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: https://www.mersenne.ca/mfaktc.php). 
20220908, 17:55  #715  
"Mark"
Apr 2003
Between here and the
1110101000000_{2} Posts 
Quote:

