Another crash. This is with the latest SVN code.
[CODE]$ ./srsieve2 W 36 P 1e12 n 3e6 N 10e6 o ferm81_3M_10M.txt s "81*2^n+1" srsieve2 v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 1750000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Sieve started: 3 < p < 1e12 with 5250001 terms (3000000 < n < 10000000, k*2^n+1) (expecting 5041260 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Segmentation fault[/CODE] Seems to segfault no matter what value of [C]P[/C] I try. 
[QUOTE=ryanp;611708]Another crash. This is with the latest SVN code.
[CODE]$ ./srsieve2 W 36 P 1e12 n 3e6 N 10e6 o ferm81_3M_10M.txt s "81*2^n+1" srsieve2 v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 1750000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Sieve started: 3 < p < 1e12 with 5250001 terms (3000000 < n < 10000000, k*2^n+1) (expecting 5041260 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Segmentation fault[/CODE] Seems to segfault no matter what value of [C]P[/C] I try.[/QUOTE] Okay. I will look into it. Does this happen with lower values for W? I suspect that switch is behind this. 
[QUOTE=rogue;611710]Okay. I will look into it. Does this happen with lower values for W? I suspect that switch is behind this.[/QUOTE]
Yeah, with [C]W 4[/C] it seems to be working. 
I5 9600K 24 MB RAM win 10
With W 6 option [CODE]e:\MTSIEVE\MTSIEVE233>srsieve2 W 6 P 1e12 n 3e6 N 10e6 o ferm81_3M_10M.txt s "81*2^n+1" srsieve2 v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 1750000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Sieve started: 3 < p < 1e12 with 5250001 terms (3000000 < n < 10000000, k*2^n+1) (expecting 5041260 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Increasing worksize to 256000 since each chunk is tested in less than a second p=1589282789, 1.319M p/sec, 4524325 factors found at 4.361K f/sec (last 1 min), 0.2% done. ETC 20220819 10:12 CTRLC accepted. Threads will stop after sieving to 2020583611 Sieve interrupted at p=2020583611. CPU time: 416.06 sec. (0.47 sieving) (5.59 cores) 717815 terms written to ferm81_3M_10M.txt Primes tested: 99184016. Factors found: 4532186. Remaining terms: 717815. Time: 74.44 seconds.[/CODE] Linux Ryzen 3900x w 6 [CODE]srsieve2 v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 1750000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Sieve started: 3 < p < 1e12 with 5250001 terms (3000000 < n < 10000000, k*2^n+1) (expecting 5041260 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Decreasing worksize to 8000 since each chunk needs more than 5 seconds to test Increasing worksize to 128000 since each chunk is tested in less than a second corrupted doublelinked list (not small) ./a: line 1: 9550 Aborted ./srsieve2 W 6 P 1e12 n 3e6 N 10e6 o ferm81_3M_10M.txt s "81*2^n+1"[/CODE] [I][B]W 4 works ok![/B][/I] 
There are no double linked lists in the code so that is a mystery to me

That's an error message from GLIBC itself, it uses doublelinked lists internally.

[QUOTE=kruoli;611742]That's an error message from GLIBC itself, it uses doublelinked lists internally.[/QUOTE]
Why? A sparse matrix? Sincere question. 
Why they use it, I cannot say for sure (mostly because I do not enough about the interna). But I can [URL="http://github.com/bminor/glibc/blob/master/malloc/malloc.c#L1599"]link[/URL] you to the GLIBC code where the [URL="http://github.com/bminor/glibc/blob/master/malloc/malloc.c#L1617"]error message[/URL] comes from.

I do not know when I will get around to resolving the issue with srsieve2. I have a more pressing issue with srsieve2cl that is perplexing me. It was introduced when I started making the changes for Metal support. For some reason CL_KERNEL_PRIVATE_MEM_SIZE has increased dramatically even though the kernel itself has barely changed. Due to this increase the generic kernel won't run at all when one has a thousands of sequences. Once that is resolved, I will look into the other issue with srsieve2.

srsieve2 and srsieve2cl code has been updated. I have not posted new builds yet.
The segfault with srsieve2 should be fixed. The issues with srsieve2cl should also be fixed. My test with srsieve2cl is sieving over 170,000 sequences at a time on the GPU. It is not super fast, but much faster than srsieve2. 
[QUOTE=rogue;611950]srsieve2 and srsieve2cl code has been updated. I have not posted new builds yet.
The segfault with srsieve2 should be fixed. The issues with srsieve2cl should also be fixed. My test with srsieve2cl is sieving over 170,000 sequences at a time on the GPU. It is not super fast, but much faster than srsieve2.[/QUOTE] If you can please post command line as reference for further tuning on GPU Thanks 
[QUOTE=pepi37;611951]If you can please post command line as reference for further tuning on GPU
Thanks[/QUOTE] On a GPU you will want to tune mainly with K, g, and M. If you have too many sequences to sieve, use K to break them up. So if you have 6000 sequences and use K2, it will make two GPU calls with 3000 sequences each. If so many factors are found that it fills the buffer, use M to adjust the buffer size. You will want to play around with g to determine if changing it from the default value improves performance. Use srsieve2cl h to get the defaults for g and M. If starting to sieve a set of sequences, I suggest sieving to 1e6 (which will be all CPU), then starting from the .abcd file so that you don't lose the progress of sieving to 1e6. I had to use M1000 for what was testing. You might not need to change it at all. H will provide details around memory utilization. If srsieve2cl won't run in the GPU, you will likely need to adjust K higher or g smaller. I suggest adjusting K first. On various GPUs CL_KERNEL_PRIVATE_MEM_SIZE seems to top out at 500000. Above that gives an error. 
[QUOTE=rogue;611955]On a GPU you will want to tune mainly with K, g, and M. If you have too many sequences to sieve, use K to break them up. So if you have 6000 sequences and use K2, it will make two GPU calls with 3000 sequences each. If so many factors are found that it fills the buffer, use M to adjust the buffer size. You will want to play around with g to determine if changing it from the default value improves performance. Use srsieve2cl h to get the defaults for g and M.
If starting to sieve a set of sequences, I suggest sieving to 1e6 (which will be all CPU), then starting from the .abcd file so that you don't lose the progress of sieving to 1e6. I had to use M1000 for what was testing. You might not need to change it at all. H will provide details around memory utilization. If srsieve2cl won't run in the GPU, you will likely need to adjust K higher or g smaller. I suggest adjusting K first. On various GPUs CL_KERNEL_PRIVATE_MEM_SIZE seems to top out at 500000. Above that gives an error.[/QUOTE] What is your command line? Or it is top secret one? :) 
[QUOTE=pepi37;611957]What is your command line? Or it is top secret one? :)[/QUOTE]
For this it was: srsieve2cl ir63.abcd K6 M1000 This was a file presieved to 1e6. 
[QUOTE=rogue;609937]Not certain why 1.6.3 would behave differently. Can you reduce g and try again?[/QUOTE]
I don't really understand how to set these parameters. If I explicitly set [C]G 1[/C] (according to "h", the default of [C]G[/C] is 0): [CODE]$ ./srsieve2cl g 64 G 1 P 1e14 o "ferm81_3M_15M.txt" s "81*2^n+1" n 3e6 N 15e6 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 3000000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 GPU primes per worker is 1769472 Sieve started: 3 < p < 1e14 with 9000001 terms (3000000 < n < 15000000, k*2^n+1) (expecting 8693280 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders OpenCL Error: Out of resources in call to clEnqueueReadBuffer argument: factorCount[/CODE] If I use [C]W 1[/C] instead, it runs, but it's extremely slow compared to the CPU srsieve2 (even with a high [C]g[/C]), and I see 0% GPU utilization according to [C]nvidiasmi[/C]: [CODE]$ ./srsieve2cl g 5120 W 1 P 1e14 o "ferm81_3M_15M.txt" s "81*2^n+1" n 3e6 N 15e6 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 3000000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Sieve started: 3 < p < 1e14 with 9000001 terms (3000000 < n < 15000000, k*2^n+1) (expecting 8693280 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Decreasing worksize to 4000 since each chunk needs more than 5 seconds to test Increasing worksize to 16000 since each chunk is tested in less than a second Increasing worksize to 64000 since each chunk is tested in less than a second p=1020457127, 856.8K p/sec, 7729676 factors found at 5.619K f/sec (last 1 min) p=2253211963, 905.7K p/sec, 7776218 factors found at 2.643K f/sec (last 2 min), 0.0% done. ETC 20221026 09:18[/CODE] And if I pass neither [C]W[/C] nor [C]G[/C], it segfaults. [CODE]$ ./srsieve2cl g 5120 P 1e14 o "ferm81_3M_15M.txt" s "81*2^n+1" n 3e6 N 15e6 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 3000000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Creating CPU worker to use until p >= 1000000 GPU primes per worker is 141557760 Sieve started: 3 < p < 1e14 with 9000001 terms (3000000 < n < 15000000, k*2^n+1) (expecting 8693280 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Creating CPU worker to use until p >= 1000000 Segmentation fault (core dumped)[/CODE] 
Rogue, I dont ask you, I beg you. I read this thread many times, maybe I didnot understund .
It will be very nice if you for mtsieve package [U][I][B]write detailed tutorial.[/B][/I][/U] It is in my opinion insanely continue with the development of the program if few of us (that use mtsieve package) cannot use it as it should. I never get to get some gain in speed using cl version of sieves, always is stay on CPU. Again, and again I very respect you work your mtsieve package, but without detailed tutorial it is very hard to get some nice results. Number of few switches you use sieve programs can he huge,and I can try for many hours and dont find combination I need to "enable" GPU to full speed. Thanks and keep doing this great work. This message is not a criticism in any sense, just asking that you write some kind of tutorial where you will explain correlation between switches. Best regards 
In the latest revision of the code (r201), gcc fails when working on the Sophie Germain sieve with the following:
[CODE]g++ Isieve m64 Wall DUSE_X86 std=c++11 O3 c o sophie_germain/SophieGermainWorker.o sophie_germain/SophieGermainWorker.cpp In file included from sophie_germain/SophieGermainWorker.cpp:12: sophie_germain/../core/MpArithVector.h:20:11: error: ‘size_t’ has not been declared 20  template <size_t N>  ^~~~~~ sophie_germain/../core/MpArithVector.h:24:21: error: ‘N’ was not declared in this scope 24  uint64_t _r[N];  ^ sophie_germain/../core/MpArithVector.h:27:36: error: ‘size_t’ does not name a type 27  uint64_t operator [](const size_t i) const { return _r[i]; }  ^~~~~~ sophie_germain/../core/MpArithVector.h:1:1: note: ‘size_t’ is defined in header ‘<cstddef>’; did you forget to ‘#include <cstddef>’? +++ +#include <cstddef> 1  /* MpArithVector.h  (C) Mark Rodenkirch, December 2020 sophie_germain/../core/MpArithVector.h:28:38: error: ‘size_t’ does not name a type 28  uint64_t & operator [](const size_t i) { return _r[i]; }  ^~~~~~ sophie_germain/../core/MpArithVector.h:28:38: note: ‘size_t’ is defined in header ‘<cstddef>’; did you forget to ‘#include <cstddef>’? sophie_germain/../core/MpArithVector.h:32:11: error: ‘size_t’ has not been declared 32  template <size_t N>  ^~~~~~ sophie_germain/../core/MpArithVector.h:36:21: error: ‘N’ was not declared in this scope 36  uint64_t _p[N], _q[N];  ^ sophie_germain/../core/MpArithVector.h:36:28: error: ‘N’ was not declared in this scope 36  uint64_t _p[N], _q[N];  ^ sophie_germain/../core/MpArithVector.h:37:21: error: ‘N’ was not declared in this scope 37  MpResVector<N> _one; // 2^64 mod p  ^ sophie_germain/../core/MpArithVector.h:37:22: error: template argument 1 is invalid 37  MpResVector<N> _one; // 2^64 mod p  ^ sophie_germain/../core/MpArithVector.h:38:21: error: ‘N’ was not declared in this scope 38  MpResVector<N> _r2; // (2^64)^2 mod p  ^ sophie_germain/../core/MpArithVector.h:38:22: error: template argument 1 is invalid 38  MpResVector<N> _r2; // (2^64)^2 mod p  ^ sophie_germain/../core/MpArithVector.h:77:28: error: ‘N’ was not declared in this scope 77  static MpResVector<N> zero()  ^ sophie_germain/../core/MpArithVector.h:77:29: error: template argument 1 is invalid 77  static MpResVector<N> zero()  ^ sophie_germain/../core/MpArithVector.h:84:21: error: ‘N’ was not declared in this scope 84  MpResVector<N> one() const { return _one; } // Montgomery form of 1  ^ sophie_germain/../core/MpArithVector.h:84:22: error: template argument 1 is invalid 84  MpResVector<N> one() const { return _one; } // Montgomery form of 1  ^ sophie_germain/../core/MpArithVector.h:86:20: error: ‘size_t’ has not been declared 86  uint64_t p(size_t k) const { return _p[k]; }  ^~~~~~ sophie_germain/../core/MpArithVector.h:88:61: error: ‘N’ was not declared in this scope 88  static bool at_least_one_is_equal(const MpResVector<N> & a, const MpResVector<N> & b)  ^ sophie_germain/../core/MpArithVector.h:88:62: error: template argument 1 is invalid 88  static bool at_least_one_is_equal(const MpResVector<N> & a, const MpResVector<N> & b)  ^ sophie_germain/../core/MpArithVector.h:88:87: error: ‘N’ was not declared in this scope 88  l at_least_one_is_equal(const MpResVector<N> & a, const MpResVector<N> & b)  ^ sophie_germain/../core/MpArithVector.h:88:88: error: template argument 1 is invalid 88  l at_least_one_is_equal(const MpResVector<N> & a, const MpResVector<N> & b)  ^ sophie_germain/../core/MpArithVector.h:95:21: error: ‘N’ was not declared in this scope 95  MpResVector<N> add(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:95:22: error: template argument 1 is invalid 95  MpResVector<N> add(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:95:46: error: ‘N’ was not declared in this scope 95  MpResVector<N> add(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:95:47: error: template argument 1 is invalid 95  MpResVector<N> add(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:95:72: error: ‘N’ was not declared in this scope 95  MpResVector<N> add(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:95:73: error: template argument 1 is invalid 95  MpResVector<N> add(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:106:21: error: ‘N’ was not declared in this scope 106  MpResVector<N> sub(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:106:22: error: template argument 1 is invalid 106  MpResVector<N> sub(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:106:46: error: ‘N’ was not declared in this scope 106  MpResVector<N> sub(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:106:47: error: template argument 1 is invalid 106  MpResVector<N> sub(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:106:72: error: ‘N’ was not declared in this scope 106  MpResVector<N> sub(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:106:73: error: template argument 1 is invalid 106  MpResVector<N> sub(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:117:58: error: ‘N’ was not declared in this scope 117  uint64_t mul(const uint64_t a, const MpResVector<N> & b, size_t k) const  ^ sophie_germain/../core/MpArithVector.h:117:59: error: template argument 1 is invalid 117  uint64_t mul(const uint64_t a, const MpResVector<N> & b, size_t k) const  ^ sophie_germain/../core/MpArithVector.h:117:66: error: ‘size_t’ has not been declared 117  uint64_t mul(const uint64_t a, const MpResVector<N> & b, size_t k) const  ^~~~~~ sophie_germain/../core/MpArithVector.h:122:35: error: ‘N’ was not declared in this scope 122  uint64_t mul(const MpResVector<N> & a, const MpResVector<N> & b, size_t k) const  ^ sophie_germain/../core/MpArithVector.h:122:36: error: template argument 1 is invalid 122  uint64_t mul(const MpResVector<N> & a, const MpResVector<N> & b, size_t k) const  ^ sophie_germain/../core/MpArithVector.h:122:61: error: ‘N’ was not declared in this scope 122  uint64_t mul(const MpResVector<N> & a, const MpResVector<N> & b, size_t k) const  ^ sophie_germain/../core/MpArithVector.h:122:62: error: template argument 1 is invalid 122  uint64_t mul(const MpResVector<N> & a, const MpResVector<N> & b, size_t k) const  ^ sophie_germain/../core/MpArithVector.h:122:69: error: ‘size_t’ has not been declared 122  int64_t mul(const MpResVector<N> & a, const MpResVector<N> & b, size_t k) const  ^~~~~~ sophie_germain/../core/MpArithVector.h:127:21: error: ‘N’ was not declared in this scope 127  MpResVector<N> mul(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:127:22: error: template argument 1 is invalid 127  MpResVector<N> mul(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:127:46: error: ‘N’ was not declared in this scope 127  MpResVector<N> mul(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:127:47: error: template argument 1 is invalid 127  MpResVector<N> mul(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:127:72: error: ‘N’ was not declared in this scope 127  MpResVector<N> mul(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:127:73: error: template argument 1 is invalid 127  MpResVector<N> mul(const MpResVector<N> & a, const MpResVector<N> & b) const  ^ sophie_germain/../core/MpArithVector.h:137:21: error: ‘N’ was not declared in this scope 137  MpResVector<N> pow(const MpResVector<N> & a, size_t exp) const  ^ sophie_germain/../core/MpArithVector.h:137:22: error: template argument 1 is invalid 137  MpResVector<N> pow(const MpResVector<N> & a, size_t exp) const  ^ sophie_germain/../core/MpArithVector.h:137:46: error: ‘N’ was not declared in this scope 137  MpResVector<N> pow(const MpResVector<N> & a, size_t exp) const  ^ sophie_germain/../core/MpArithVector.h:137:47: error: template argument 1 is invalid 137  MpResVector<N> pow(const MpResVector<N> & a, size_t exp) const  ^ sophie_germain/../core/MpArithVector.h:137:54: error: ‘size_t’ has not been declared 137  MpResVector<N> pow(const MpResVector<N> & a, size_t exp) const  ^~~~~~ sophie_germain/../core/MpArithVector.h:159:21: error: ‘N’ was not declared in this scope 159  MpResVector<N> nToRes(const uint64_t *n) const  ^ sophie_germain/../core/MpArithVector.h:159:22: error: template argument 1 is invalid 159  MpResVector<N> nToRes(const uint64_t *n) const  ^ sophie_germain/../core/MpArithVector.h:171:21: error: ‘N’ was not declared in this scope 171  MpResVector<N> nToRes(uint64_t n) const  ^ sophie_germain/../core/MpArithVector.h:171:22: error: template argument 1 is invalid 171  MpResVector<N> nToRes(uint64_t n) const  ^ sophie_germain/../core/MpArithVector.h:183:21: error: ‘N’ was not declared in this scope 183  MpResVector<N> nToRes(uint32_t *n) const  ^ sophie_germain/../core/MpArithVector.h:183:22: error: template argument 1 is invalid 183  MpResVector<N> nToRes(uint32_t *n) const  ^ sophie_germain/../core/MpArithVector.h:195:21: error: ‘N’ was not declared in this scope 195  MpResVector<N> resToN(const MpResVector<N> & a) const  ^ sophie_germain/../core/MpArithVector.h:195:22: error: template argument 1 is invalid 195  MpResVector<N> resToN(const MpResVector<N> & a) const  ^ sophie_germain/../core/MpArithVector.h:195:49: error: ‘N’ was not declared in this scope 195  MpResVector<N> resToN(const MpResVector<N> & a) const  ^ sophie_germain/../core/MpArithVector.h:195:50: error: template argument 1 is invalid 195  MpResVector<N> resToN(const MpResVector<N> & a) const  ^ sophie_germain/SophieGermainWorker.cpp: In member function ‘void SophieGermainWorker::TestMegaPrimeChunkSmall()’: sophie_germain/SophieGermainWorker.cpp:73:21: error: invalid conversion from ‘uint64_t*’ {aka ‘long unsigned int*’} to ‘MpArithVec’ {aka ‘int’} [fpermissive] 73  MpArithVec mp(ps);  ^~    uint64_t* {aka long unsigned int*} sophie_germain/SophieGermainWorker.cpp:75:29: error: request for member ‘nToRes’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 75  MpResVec resInvs = mp.nToRes(invs);  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:76:25: error: request for member ‘pow’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 76  MpResVec res = mp.pow(resInvs, ii_N);  ^~~ sophie_germain/SophieGermainWorker.cpp:77:27: error: request for member ‘resToN’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 77  MpResVec resKs = mp.resToN(res);  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:79:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 79  if (resKs[0] <= il_MaxK) RemoveTermsSmallPrime(resKs[0], true, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:79:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 79  if (resKs[0] <= il_MaxK) RemoveTermsSmallPrime(resKs[0], true, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:80:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 80  if (resKs[1] <= il_MaxK) RemoveTermsSmallPrime(resKs[1], true, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:80:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 80  if (resKs[1] <= il_MaxK) RemoveTermsSmallPrime(resKs[1], true, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:81:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 81  if (resKs[2] <= il_MaxK) RemoveTermsSmallPrime(resKs[2], true, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:81:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 81  if (resKs[2] <= il_MaxK) RemoveTermsSmallPrime(resKs[2], true, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:82:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 82  if (resKs[3] <= il_MaxK) RemoveTermsSmallPrime(resKs[3], true, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp:82:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 82  if (resKs[3] <= il_MaxK) RemoveTermsSmallPrime(resKs[3], true, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp:86:19: error: request for member ‘mul’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 86  res = mp.mul(res, resInvs);  ^~~ sophie_germain/SophieGermainWorker.cpp:93:22: error: request for member ‘mul’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 93  res = mp.mul(res, resInvs);  ^~~ sophie_germain/SophieGermainWorker.cpp:103:22: error: request for member ‘mul’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 103  res = mp.mul(res, mp.nToRes(invs));  ^~~ sophie_germain/SophieGermainWorker.cpp:103:34: error: request for member ‘nToRes’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 103  res = mp.mul(res, mp.nToRes(invs));  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:107:18: error: request for member ‘resToN’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 107  resKs = mp.resToN(res);  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:109:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 109  if (resKs[0] <= il_MaxK) RemoveTermsSmallPrime(resKs[0], false, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:109:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 109  if (resKs[0] <= il_MaxK) RemoveTermsSmallPrime(resKs[0], false, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:110:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 110  if (resKs[1] <= il_MaxK) RemoveTermsSmallPrime(resKs[1], false, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:110:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 110  if (resKs[1] <= il_MaxK) RemoveTermsSmallPrime(resKs[1], false, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:111:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 111  if (resKs[2] <= il_MaxK) RemoveTermsSmallPrime(resKs[2], false, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:111:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 111  if (resKs[2] <= il_MaxK) RemoveTermsSmallPrime(resKs[2], false, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:112:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 112  if (resKs[3] <= il_MaxK) RemoveTermsSmallPrime(resKs[3], false, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp:112:59: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 112  if (resKs[3] <= il_MaxK) RemoveTermsSmallPrime(resKs[3], false, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp: In member function ‘void SophieGermainWorker::TestMegaPrimeChunkLarge()’: sophie_germain/SophieGermainWorker.cpp:155:21: error: invalid conversion from ‘uint64_t*’ {aka ‘long unsigned int*’} to ‘MpArithVec’ {aka ‘int’} [fpermissive] 155  MpArithVec mp(ps);  ^~    uint64_t* {aka long unsigned int*} sophie_germain/SophieGermainWorker.cpp:157:29: error: request for member ‘nToRes’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 157  MpResVec resInvs = mp.nToRes(invs);  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:158:25: error: request for member ‘pow’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 158  MpResVec res = mp.pow(resInvs, ii_N);  ^~~ sophie_germain/SophieGermainWorker.cpp:159:27: error: request for member ‘resToN’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 159  MpResVec resKs = mp.resToN(res);  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:161:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 161  if (resKs[0] >= il_MinK && resKs[0] <= il_MaxK) RemoveTermsLargePrime(resKs[0], true, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:161:39: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 161  if (resKs[0] >= il_MinK && resKs[0] <= il_MaxK) RemoveTermsLargePrime(resKs[0], true, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:161:82: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 161  ] >= il_MinK && resKs[0] <= il_MaxK) RemoveTermsLargePrime(resKs[0], true, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:162:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 162  if (resKs[1] >= il_MinK && resKs[1] <= il_MaxK) RemoveTermsLargePrime(resKs[1], true, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:162:39: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 162  if (resKs[1] >= il_MinK && resKs[1] <= il_MaxK) RemoveTermsLargePrime(resKs[1], true, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:162:82: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 162  ] >= il_MinK && resKs[1] <= il_MaxK) RemoveTermsLargePrime(resKs[1], true, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:163:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 163  if (resKs[2] >= il_MinK && resKs[2] <= il_MaxK) RemoveTermsLargePrime(resKs[2], true, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:163:39: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 163  if (resKs[2] >= il_MinK && resKs[2] <= il_MaxK) RemoveTermsLargePrime(resKs[2], true, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:163:82: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 163  ] >= il_MinK && resKs[2] <= il_MaxK) RemoveTermsLargePrime(resKs[2], true, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:164:16: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 164  if (resKs[3] >= il_MinK && resKs[3] <= il_MaxK) RemoveTermsLargePrime(resKs[3], true, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp:164:39: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 164  if (resKs[3] >= il_MinK && resKs[3] <= il_MaxK) RemoveTermsLargePrime(resKs[3], true, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp:164:82: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 164  ] >= il_MinK && resKs[3] <= il_MaxK) RemoveTermsLargePrime(resKs[3], true, ps[3]);  ^ sophie_germain/SophieGermainWorker.cpp:168:19: error: request for member ‘mul’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 168  res = mp.mul(res, resInvs);  ^~~ sophie_germain/SophieGermainWorker.cpp:176:19: error: request for member ‘mul’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 176  res = mp.mul(res, mp.nToRes(invs));  ^~~ sophie_germain/SophieGermainWorker.cpp:176:31: error: request for member ‘nToRes’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 176  res = mp.mul(res, mp.nToRes(invs));  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:179:18: error: request for member ‘resToN’ in ‘mp’, which is of nonclass type ‘MpArithVec’ {aka ‘int’} 179  resKs = mp.resToN(res);  ^~~~~~ sophie_germain/SophieGermainWorker.cpp:181:34: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 181  RemoveTermsLargePrime(resKs[0], false, ps[0]);  ^ sophie_germain/SophieGermainWorker.cpp:182:34: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 182  RemoveTermsLargePrime(resKs[1], false, ps[1]);  ^ sophie_germain/SophieGermainWorker.cpp:183:34: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 183  RemoveTermsLargePrime(resKs[2], false, ps[2]);  ^ sophie_germain/SophieGermainWorker.cpp:184:34: error: invalid types ‘MpResVec {aka int}[int]’ for array subscript 184  RemoveTermsLargePrime(resKs[3], false, ps[3]);  ^ make: *** [makefile:217: sophie_germain/SophieGermainWorker.o] Error 1[/CODE] I believe a include <cstddef> should solve most, if not all of the errors here, unless size_t is defined elsewhere. 
ryanp,
When you specify W1, it will not use a GPU worker. W is for CPU workers and when used with a GPU enabled program will prevent it from using the GPU. My recommendation is then you are using a GPU enabled program to not specify G at all until you can successfully run without it. I will look into the segfault with g5120. I have run into a bug with srsieve2cl. I need to look into it. I suggest running without g to see if it can run successfully. Dylan14, on which platform is that occurring? As you said it might be as simple as adding a missing #include. pepi37, I understand your desire for a tutorial. I do have a [URL="https://www.mersenneforum.org/rogue/mtsieve.html"]webpage[/URL], but it is horribly out of date. It is on my todo list to update that page. For most of the sieves I would think that the basic parameters are fairly obvious. In most cases you don't even need to specify values as they will default. My recommendation is to use the defaults until you are comfortable enough to use the other parameters to see if you can get better performance. 
The error with compilation is occurring on Linux (more specifically, Arch Linux). And indeed adding the #include <cstddef> in core/MpArithVector.h fixed the issue.

[QUOTE=rogue;612060]I will look into the segfault with g5120. I have run into a bug with srsieve2cl. I need to look into it.
I suggest running without g to see if it can run successfully.[/QUOTE] Without any of [C]G[/C], [C]g[/C] or [C]W[/C]: [CODE]$ ./srsieve2cl P 1e14 o "ferm81_3M_15M.txt" s "81*2^n+1" n 3e6 N 15e6 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n (b2) Removed 3000000 algebraic factors for 81*2^n+1 of the form (3^2)*2^(n/2)3*2^((n+2)/4))+1 when n%4=2 Sieving with generic logic for p >= 3 Creating CPU worker to use until p >= 1000000 GPU primes per worker is 221184 Sieve started: 3 < p < 1e14 with 9000001 terms (3000000 < n < 15000000, k*2^n+1) (expecting 8693280 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 384 base 2^720 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 295200 bytes used for congruent qs and ladders Creating CPU worker to use until p >= 1000000 double free or corruption (!prev) Aborted (core dumped)[/CODE] Here's a backtrace with gdb from a debug run: [CODE]double free or corruption (!prev) Thread 1 "srsieve2cl" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007ffff6cb97f1 in __GI_abort () at abort.c:79 #2 0x00007ffff6d02837 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6e2fa7b "%s\n") at ../sysdeps/posix/libc_fatal.c:181 #3 0x00007ffff6d098ba in malloc_printerr ( str=str@entry=0x7ffff6e317a8 "double free or corruption (!prev)") at malloc.c:5342 #4 0x00007ffff6d10e5c in _int_free (have_lock=0, p=0x55555675f070, av=0x7ffff7064c40 <main_arena>) at malloc.c:4311 #5 __GI___libc_free (mem=0x55555675f080) at malloc.c:3134 #6 0x00005555555653d8 in xfree (memoryPtr=0x55555675f100) at core/main.cpp:285 #7 0x00005555555638b8 in Worker::~Worker (this=0x555555829170, __in_chrg=<optimized out>) at core/Worker.cpp:80 #8 0x000055555559b968 in AbstractWorker::~AbstractWorker ( this=0x555555829170, __in_chrg=<optimized out>) at sierpinski_riesel/AbstractWorker.h:24 #9 0x00005555555a5ece in CisOneWithOneSequenceGpuWorker::~CisOneWithOneSequenceGpuWorker (this=0x555555829170, __in_chrg=<optimized out>) at sierpinski_riesel/CisOneWithOneSequenceGpuWorker.h:26 #10 0x00005555555a5eea in CisOneWithOneSequenceGpuWorker::~CisOneWithOneSequenceGpuWorker (this=0x555555829170, __in_chrg=<optimized out>) at sierpinski_riesel/CisOneWithOneSequenceGpuWorker.h:26 #11 0x000055555555e38a in App::DeleteWorkers (this=0x5555557e6970) at core/App.cpp:678 #12 0x000055555555e209 in App::PauseSievingAndRebuild (this=0x5555557e6970) at core/App.cpp:643 #13 0x000055555555dd39 in App::Sieve (this=0x5555557e6970) at core/App.cpp:499 #14 0x000055555555d9f8 in App::Run (this=0x5555557e6970) at core/App.cpp:422 #15 0x0000555555564bef in main (argc=11, argv=0x7fffffffe3f8) at core/main.cpp:91[/CODE] [QUOTE]pepi37, I understand your desire for a tutorial. I do have a [URL="https://www.mersenneforum.org/rogue/mtsieve.html"]webpage[/URL], but it is horribly out of date. It is on my todo list to update that page. For most of the sieves I would think that the basic parameters are fairly obvious. In most cases you don't even need to specify values as they will default. My recommendation is to use the defaults until you are comfortable enough to use the other parameters to see if you can get better performance.[/QUOTE] I would agree with pepi's general sentiment for the *cl sieves. The tuning of [C]G[/C], [C]g[/C], [C]W[/C] and/or [C]K[/C] is not obvious, at least to me. I don't think a lengthy tutorial is needed, but maybe just a severalparagraph doc with "how to tune flags for the common cases"? e.g.: one GPU + one sequence being sieved. It also wasn't clear that [C]W[/C] turns off GPU sieving altogether (I'm not sure why the *cl programs would even support this flag). 
The doublefree is what I am seeing as well. Whatever the cause is, it is not obvious at this time.
By default you get only one type of worker or the other. You have to use W and G together to get both kinds of workers. Without W, G defaults to 1. I know that is a little confusing. My typical uses for the GPUenabled programs is to not set either W or G. I will use W for the CPU only programs, but I could see the use of both if one has a lot of CPU cores and and GPU isn't that much faster than a single GPU core. 
I tracked down the problem. I have committed a change to CisOneWithOneSequenceGpuWorker.

[QUOTE=rogue;612071]I tracked down the problem. I have committed a change to CisOneWithOneSequenceGpuWorker.[/QUOTE]
It finally runs. But I can't seem to tune [C]g[/C] to be any more efficient than 16 regular CPU workers, even when running [C]srsieve2cl[/C] on an NVIDIA A100... which seems quite odd. 
[QUOTE=ryanp;612078]It finally runs. But I can't seem to tune [C]g[/C] to be any more efficient than 16 regular CPU workers, even when running [C]srsieve2cl[/C] on an NVIDIA A100... which seems quite odd.[/QUOTE]
Not necessarily. You can use G2 to try two GPU workers or you can mix CPU and GPU workers, e.g. G1 W16. What is the comparative speed of one GPU worker to one CPU worker? 
Any updates?
It is also possible that the default GPU it is using is not the the GPU you are expecting. Start with h to see the default GPU. You can use command line switches to change the platform and device. 
[QUOTE=rogue;612278]Any updates?[/QUOTE]
The update is that I have found combinations of [C]G[/C], [C]g[/C] and [C]M[/C] that are functional  srsieve2cl runs, and I see 100% CPU utilization in [C]nvidiasmi[/C]  but is no faster (on an A100) than a 64core CPU worker on another machine. I have no idea how to tune these, what's optimal, or whether there's some underlying bottleneck somewhere else. Example: [CODE]$ ./srsieve2cl P 1e15 G 16 g 512 M 1000000 n 20e6 N 25e6 o ferm9_20M_25M_sv1e15.txt s 9*2^n+1 srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n Sieving with generic logic for p >= 3 Creating CPU worker to use until p >= 1000000 GPU primes per worker is 14155776 Sieve started: 3 < p < 1e15 with 5000001 terms (20000000 < n < 25000000, k*2^n+1) (expecting 4840960 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 2 sequence into 104 base 2^360 sequences. Legendre summary: Approximately 2 B needed for Legendre tables 1 total sequences 1 are eligible for Legendre tables 0 are not eligible for Legendre tables 1 have Legendre tables in memory 0 cannot have Legendre tables in memory 0 have Legendre tables loaded from files 1 required building of the Legendre tables 518400 bytes used for congruent q and ladder indices 259200 bytes used for congruent qs and ladders Creating CPU worker to use until p >= 1000000 p=540837151, 6.587M p/sec, 4668811 factors found at 352.2 f/sec (last 1 min), p=9303408733, 7.126M p/sec, 4680592 factors found at 144.5 f/sec (last 2 min), p=21575254847, 7.304M p/sec, 4686586 factors found at 89.74 f/sec (last 3 min), 0.0% done. ETC 20221206 01:21 [/CODE] On the 64core CPU machine, I'm getting 7.589M p/sec. It seems like an NVIDIA A100 should be able to do far better than this. 
[QUOTE=rogue;612278]It is also possible that the default GPU it is using is not the the GPU you are expecting. Start with h to see the default GPU. You can use command line switches to change the platform and device.[/QUOTE]
That's definitely not the case here  this is a VM with one GPU, and I see it pegged at 100% in [C]nvidiasmi[/C]. It's just not any faster than a 64core CPU instance. 
[QUOTE=ryanp;612301]That's definitely not the case here  this is a VM with one GPU, and I see it pegged at 100% in [C]nvidiasmi[/C]. It's just not any faster than a 64core CPU instance.[/QUOTE]
Note that you cannot directly compare sr1sieve/sr2sieve speeds with srsieve2 using p/sec. It is best to determine the range of p, e.g. 1e12 and see how long it takes to complete the range. Are you comparing srsieve2 with W32/W64 with srsieve2cl (with no W)? I'm trying to understand what you are comparing when you say it is no faster. I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another. It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly. I wonder if there could be limitations due to it running on a VM. Maybe the VM configuration is limiting how much GPU it can use. I haven't run on a VM, so that is just a guess. 
[QUOTE=rogue;612310]
I don't have access to such a GPU. It is very possible that you are correct, but I can't attest to that one way or another. It is using an OpenCL version of the same algorithm used by sr1sieve and srsieve2, although it has no assembly. [/QUOTE] We can arrange and I am willing to give you access to my Windows box with RTX 2060 so you can see yourself. As I can say srsieve2cl doesnot work at all.I wish and would like that program works, since I have few GPU and then I will be able to make nice sieve filers by myself, but reality is different :( Mystery is even bigger since you always say that on your machine srsieve2cl is faster then srsieve2 and nobody other can reproduce that fact. Just contact me via PM Best regards 
RESULTS are here
[CODE]srsieve2cl.exe P1e6 n 4 N 10000000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Sieve completed at p=1000003. CPU time: 8.38 sec. (0.00 sieving) (0.93 cores) GPU time: 0.00 sec. 305747 terms written to b53_n.abcd Primes tested: 78290. Factors found: 7194250. Remaining terms: 305747. Time: 9.04 seconds. srsieve2cl.exe P1e7 n 4 N 10000000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992). Use M to increase max factor density srsieve2cl.exe P1e7 n 4 N 10000000 M 1000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 Sieve completed at p=10059761. CPU time: 10.50 sec. (0.00 sieving) (0.87 cores) GPU time: 1.78 sec. 261476 terms written to b53_n.abcd Primes tested: 589824. Factors found: 7238521. Remaining terms: 261476. Time: 12.10 seconds. srsieve2cl.exe P 1e11 D 1 G 2 n 4 N 10000000 M 4000 s "4*53^n+1" Creating CPU worker to use until p >= 1000000 p=1789474679, 1.449M p/sec, 7302068 factors found at 1.073K f/sec (last 1 min), 1.8% done. ETC 20220906 22:08 p=4033235879, 1.570M p/sec, 7309308 factors found at 507.2 f/sec (last 2 min), 4.0% done. ETC 20220906 22:02 p=6335693783, 1.609M p/sec, 7313108 factors found at 335.6 f/sec (last 3 min), 6.3% done. ETC 20220906 21:59 p=8678189381, 1.629M p/sec, 7315702 factors found at 252.5 f/sec (last 4 min), 8.7% done. ETC 20220906 21:58 srsieve2cl.exe P 1e11 W6 n 4 N 10000000 M 4000 s "4*53^n+1" Increasing worksize to 64000 since each chunk is tested in less than a second Increasing worksize to 1024000 since each chunk is tested in less than a second p=4336204969, 3.426M p/sec, 7310008 factors found at 1.240K f/sec (last 1 min), 4.3% done. ETC 20220906 21:40 p=9693240527, 3.656M p/sec, 7316685 factors found at 570.1 f/sec (last 2 min), 9.7% done. ETC 20220906 21:38 p=15208974959, 3.737M p/sec, 7320231 factors found at 369.4 f/sec (last 3 min), 15.2% done. ETC 20220906 21:37 [/CODE]With option G2 I got 98 GPU utilization. But at end my i59600K (running on 6 cores) is nearly double fast then my RTX 2060 
[QUOTE=pepi37;612810]Fatal Error: Could not handle all GPU factors. A range of p generated 36883 factors (limited to 7992).[/QUOTE]
This error tells you to adjust M. The default for M is 100. Change to 500 to hold all of the factors. This increases that limit by a factor of 5. This parameter is not necessary once you have sieved more deeply as fewer p produce factors. If you use h, it will tell you which platform and device are the default to be used. I suspect D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything. 
[QUOTE=rogue;612815]This error tells you to adjust M. The default for M is 100. Change to 500 to hold all of the factors. This increases that limit by a factor of 5. This parameter is not necessary once you have sieved more deeply as fewer p produce factors.
If you use h, it will tell you which platform and device are the default to be used. I suspect D1 is using the Intel Integrated GPU, which isn't going to provide you much of anything.[/QUOTE] Opposite D 0 is Intel, D 1 is Nvidia I checked twice g:\SRSIEVE2>srsieve2cl.exe h srsieve2cl v1.6.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n h help prints this help p pmin=P0 sieve start: P0 < p (default 3) P pmax=P1 sieve end: p < P1 (default 2^62) w worksize=w initial primes per chunk of work (default 16000) W workers=W start W workers (default 0) g gpuworkgroups=g work groups per call to GPU (default 8) G gpuworkers=G start G GPU workers (default 0) D platform=D Use platform D instead of 0 d device=d Use device d instead of 0 H showgpudetail Show device and kernel details List of available platforms and devices Platform 0 is a Intel(R) Corporation Intel(R) OpenCL HD Graphics, version OpenCL 2.1 Device 0 is a Intel(R) Corporation Intel(R) UHD Graphics 630 Platform 1 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 3.0 CUDA 11.6.127 Device 0 is a NVIDIA Corporation NVIDIA GeForce RTX 2060 
It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.

[QUOTE=rebirther;612944]It would be great for srsieve2cl to define a max VRAM gap to be used for a calculation, last time I have tried it on a 8GB card to find out whats the optimum, several times the driver crashes while I was over the limit, playing with the workers and got a 7GB limit. It would be a great help if the program can define the rest of the max workers by itself.[/QUOTE]
I do not know what you mean by "max VRAM gap" or "rest of the max workers". The only options you have at your disposal are g, G, and K. As for how much GPU memory it uses when executing a kernel, that is a good question. There are many values I can pull from the driver regarding memory utilization and some that I can compute on the fly. You can see these if you specify the H switch. I haven't been able to determine a good way to know that a kernel will fail due to requiring too much memory. 
Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.
(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: [url]https://www.mersenne.ca/mfaktc.php[/url]). 
[QUOTE=ryanp;612962]Is it possible there is some bottleneck for how fast things are fed to the GPU, that is the same for CPU workers? I don't know how else to explain an A100 being the same speed as a regular 64 core machine.
(Despite the fact that it is a VM, when running mfaktc I generally see similar numbers to: [url]https://www.mersenne.ca/mfaktc.php[/url]).[/QUOTE] Moving to PM for the time being. 
Did my testing show to you where problem is, my cpu is still faster then RTX 2060!
Any other idea? Different setup? 
[QUOTE=pepi37;612978]Did my testing show to you where problem is, my cpu is still faster then RTX 2060!
Any other idea? Different setup?[/QUOTE] You can try a different value for g, but remember that you are comparing a single GPU to multiple cores on a CPU. Unlike some of the other sieves using the framework, sr2sievecl uses a lot more GPU memory as each thread has to maintain its own set of tables in memory. When it comes to discrete logs, you can use less memory, but then the computation time for each p varies significantly. The discrete log used by srsieve2cl uses a method that "flattens the curve" for the calculation regardless of p, but requires more memory. It might be possible to modify the algorithm to use less memory in the GPU, but that could lead to other issues. One of the worst things with the current algorithm is that there are many conditionals and the remaining loops can't really be unrolled. It would likely require a completely different algorithm to get more speed out of it. 
[QUOTE=rogue;612989]You can try a different value for g, but remember that you are comparing a single GPU to multiple cores on a CPU.
Unlike some of the other sieves using the framework, sr2sievecl uses a lot more GPU memory as each thread has to maintain its own set of tables in memory. When it comes to discrete logs, you can use less memory, but then the computation time for each p varies significantly. The discrete log used by srsieve2cl uses a method that "flattens the curve" for the calculation regardless of p, but requires more memory. It might be possible to modify the algorithm to use less memory in the GPU, but that could lead to other issues. One of the worst things with the current algorithm is that there are many conditionals and the remaining loops can't really be unrolled. It would likely require a completely different algorithm to get more speed out of it.[/QUOTE] I agree with all that but what is purpose of the opencl sieve. RTX 2060 is not most powerfull card, but it is not bad at all. I expect huge difference in speed, . And as ryanp says " I don't know how else to explain an A100 being the same speed as a regular 64 core machine." A100 is beast GPU card... 
[QUOTE=pepi37;612990]I agree with all that but what is purpose of the opencl sieve. RTX 2060 is not most powerfull card, but it is not bad at all. I expect huge difference in speed, .
And as ryanp says " I don't know how else to explain an A100 being the same speed as a regular 64 core machine." A100 is beast GPU card...[/QUOTE] If you compare a single GPU vs a single core on a CPU, that is where you see the difference in speed. A GPU core is not equivalent to a CPU core. 
[QUOTE=rogue;613005]If you compare a single GPU vs a single core on a CPU, that is where you see the difference in speed. A GPU core is not equivalent to a CPU core.[/QUOTE]
I'm not really sure how that information helps. The reality is that we haven't seemed to find any combination of params that actually make [C]srsieve2cl[/C] any faster than a simple 4 or 8core CPU run ("W"). 
It looks like that srsieve2.cl cannot run on my HD7950:
[CODE]Sieving with generic logic for p >= 1000000000 Split 27683 base 486 sequences into 27683 base 486^1 sequences. OpenCL Error: Program build failure in call to clBuildProgram "C:\Users\user\AppData\Local\Temp\OCL2224T5.cl", line 165: warning: state ment is unreachable resBM64 = mmmPowmod(resBM64, BABY_STEPS, thePrime, _q, _one); ^ "C:\Users\user\AppData\Local\Temp\OCL2224T5.cl", line 238: warning: statement is unreachable return 0; ^ Error:E013:Insufficient Private Resources![/CODE] Anyone has a testline? I think the file is too large to handle. It was running on my old win7 PC, the card has 3GB VRAM 
[QUOTE=rebirther;613056]It looks like that srsieve2.cl cannot run on my HD7950:
[CODE]Sieving with generic logic for p >= 1000000000 Split 27683 base 486 sequences into 27683 base 486^1 sequences. Error:E013:Insufficient Private Resources![/CODE] Anyone has a testline? I think the file is too large to handle. It was running on my old win7 PC, the card has 3GB VRAM[/QUOTE] I have not seen this before. You can use K to create multiple kernels. That will reduce how much memory is needed. 
[QUOTE=ryanp;613054]I'm not really sure how that information helps. The reality is that we haven't seemed to find any combination of params that actually make [C]srsieve2cl[/C] any faster than a simple 4 or 8core CPU run ("W").[/QUOTE]
You cannot make the assumption the xx cores on a GPU is equivalent to xx cores on a CPU. You can't even assume that a GPU core at yy GHZ is the same as a CPU core at yy GHZ. There are many factors that impact the speed of the GPU. The GPU is great for parallelizing tasks. How much of a speed bump you actually get is dependent upon the size of each task and how much memory each task needs. The discrete log algorithm is much larger than typical GPU tasks and requires a lot more memory than typical GPU tasks. The driver is also a factor as it is responsible for compiling the OpenCL C (in the kernel) to machine code. Some drivers are better than others at the task. Based upon my testing across various computers with GPUs, G1 is anywhere from 5x to 10x faster than W1. So if you only have 5x faster with G1 than W1 but want to use W8, then the CPU will clearly be faster, but if you use G1 W5, that will be twice as fast as W5 alone. 
[QUOTE=rogue;613061]I have not seen this before. You can use K to create multiple kernels. That will reduce how much memory is needed.[/QUOTE]
The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl? 
[QUOTE=rogue;613063]Based upon my testing across various computers with GPUs, G1 is anywhere from 5x to 10x faster than W1. So if you only have 5x faster with G1 than W1 but want to use W8, then the CPU will clearly be faster, but if you use G1 W5, that will be twice as fast as W5 alone.[/QUOTE]
OK, can you suggest a combination of [C]g[/C] and [C]G[/C] that will be faster than W 36? That's sort of been the question all along. 
[QUOTE=ryanp;613071]OK, can you suggest a combination of [C]g[/C] and [C]G[/C] that will be faster than W 36? That's sort of been the question all along.[/QUOTE]
I cannot provide one that is can be faster in the GPU without using CPU workers. I can only suggest W36 G1 as W36 alone will not use the GPU. It will require trial and error to determine if you need to use a higher value for G or the default for g. 
[QUOTE=rebirther;613065]The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl?[/QUOTE]
K will split the number of sequences. In your case you have 27683 sequences so K2 will use 13842 and 13841 sequences per kernel. K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be K5. The max is under 10000 for my GPUs, but I don't know what the exact max it. I have observed that the higher the value for K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better. 
[QUOTE=rogue;613078]K will split the number of sequences. In your case you have 27683 sequences so K2 will use 13842 and 13841 sequences per kernel. K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be K5. The max is under 10000 for my GPUs, but I don't know what the exact max it.
I have observed that the higher the value for K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.[/QUOTE] I have tried K15, end up all with the same error with a base of 4800k's 
[QUOTE=rebirther;613091]I have tried K15, end up all with the same error with a base of 4800k's[/QUOTE]
You can try higher values for K to see at what point it starts to work. BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug. 
[QUOTE=rogue;613105]You can try higher values for K to see at what point it starts to work.
BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.[/QUOTE] Yes, I have the latest version and no also not working with K10 
I ran into something interesting today. Apparently if the kernel requires a CL_KERNEL_PRIVATE_MEM_SIZE that is too large, then you the kernel will fail. On Windows you get an OpenCL error when the kernel executes, but on OS X, you get an Abort trap: 6 error. With a little debugging I see an error under the covers that implies that there isn't enough memory. Unfortunately neither of these results tell you what the real problem is.
Each GPU seems to have a different limit for how much private memory a kernel can use. For some GPUs I see that limit at 64KB. At others I see it at 512KB. When you run with H it will output this value to you. This seems to only impact srsieve2cl when sieving with multiple sequences because there are arrays in the main kernel whose size vary depends upon the number of sequences sieved. This means that using K will be necessary in order to keep CL_KERNEL_PRIVATE_MEM_SIZE under the limit. The main driver in srsieve2cl is the number of baby steps for the BSGS algorithm as this is used to determine the number of elements in the arrays. It might be possible to reduce the number of baby steps which could then increase sieving speed if one can use a lower value for K. Unfortunately I cannot find any documentation that tells me what the limit is, although it is possible that I haven't dug far enough yet. If I knew then I could adjust K automatically to avoid failures at runtime since the failures are mysterious to end users of the software. 
There's no such thing as CL_DEVICE_PRIVATE_MEM_SIZE (the kernel limit), so one can either make the assumption that it's > 16KB on most GPUs or use local/global memory for the arrays, depending on which one performs the best.

Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE.
These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU. 
[QUOTE=rogue;615246]Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE.
These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.[/QUOTE] I apologize for having been vague: What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory ([URL]https://community.amd.com/t5/archivesdiscussions/silentprivatememorysizelimit/tdp/166604[/URL]), however this behavior, much like the amount of private memory, is strictly implementationdependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item ([URL]https://stackoverflow.com/questions/22083507/isthereamaximumlimittoprivatememoryinopencl[/URL]) spreading the work across multiple kernels (which won't necessarily check out to 16KB per kernel and may still cause the error) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening. 
[QUOTE=ruckuslemur;615266]I apologize for having been vague:
What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory ([url]https://community.amd.com/t5/archivesdiscussions/silentprivatememorysizelimit/tdp/166604[/url]), however this behavior, much like the amount of private memory, is strictly implementationdependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item ([url]https://stackoverflow.com/questions/22083507/isthereamaximumlimittoprivatememoryinopencl[/url]) spread across multiple kernels (which isn't necessarily 16KB per kernel and may still cause the error for the same reason) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.[/QUOTE] These are the variables in the kernel that require the memory: [code] ushort h_table[HASH_SIZE]; ushort h_olist[HASH_ELEMENTS]; ulong h_BJ64[HASH_ELEMENTS+1]; ulong resBDCK64[SUBSEQUENCES]; [/code] and they exist for the entire execution of the kernel. There is nothing I can do about that. These values are set via #define statements prepended to the kernel in C++ code. HASH_ELEMENTS is the number of baby steps. HASH_SIZE is the smallest power of 2 greater than HASH_ELEMENTS. SUBSEQUENCES is computed based upon the sequences. Sometimes sequences = subsequences, but not necessarily. subsequences cannot be split across kernels. Removing the variables would require a different discrete log algorithm, but as I have stated elsewhere that can be problematic in other ways. Some discrete log algorithms require a lot less memory, but the execution time for each distinct prime varies greatly. That isn't good when we are doing discrete logs in parallel as the slowest one will dictate the speed of the application. Others are much faster, but they require a lot more memory. So we have a "middle of the road" approach that uses a little more memory, but flattens the execution time for each distinct prime. The end user can specify K and b (a new command line switch) to affect the calculation for the #defined values, but it will be trial and error (for the end user) to determine which values work and which values provide the best throughput. For example, many combinations will work, but some will yield higher p/sec than others. In most cases the default will be best. K is used to split the sequences across multiple kernels. The default is 1 which means that it will try to do the discrete log for all sequences in a single kernel execution. Using K 2 will split sequences across 2 kernels, and so on. b is used in the calculation of the number of baby steps (and thus giant steps). The default is 1.0, but by reducing it one can reduce both HASH_SIZE and HASH_ELEMENTS. By decreasing the number of baby steps one can decrease these two values thus reducing memory consumption in the kernel. giant steps are more expensive than baby steps, so reducing the number of baby steps alone is not the best approach. As I stated and based upon your response, I cannot detect in code that the kernel will fail when K and b are not good. That sucks for end users. Fortunately by using the H command line switch the user can probably figure out the maximum value for CL_KERNEL_PRIVATE_MEM_SIZE for their GPU before errors occur and adjust K and b to stay below that value. I have only seen this issue when trying to sieve thousands of sequences concurrently. I'm one of the few who has been doing that. I have sieved over 50,000 sequences at a time. It required K3 on an NVIDIA RTX A5000 and 500 KB seems to be the upper bound of CL_KERNEL_PRIVATE_MEM_SIZE before failure. On an AMD Radio Pro 5700 the upper bound for CL_KERNEL_PRIVATE_MEM_SIZE is closer to 50 KB. 
I posted srsievee2cl at sourceforge. Here is what I added:
[code] srsieve2/srsieve2cl: verison 1.6.4 Added b. This parameter is used to calculate the number of baby steps and giant steps for the discrete log. The default is 1.0. As this value increases, so does the number of baby steps while decreasing the number of giant steps. The parameter has an impact on CL_KERNEL_PRIVATE_MEM_SIZE in GPU kernels (output when using H). If that value is too high, then srsieve2cl will terminate. The reasons for termination vary depending on CPU. An example of usage would be when you are sieving thousands of sequences and have to use K. In one test I had to use K5 (the GPU has about 4 GB of memory). When using K3 b0.2 there is a nearly 40% performance boost. I am unaware (at this time) of how to detect the maximum amount of memory available to a kernel so there is no way to autoset K and b to the optimial values at runtime. It will require trial and error for the end user. The impact of changing from the default value has not been determined for CPU workers. [/code] 
[QUOTE=rogue;585589]srsieve2 does not work correctly when d > 1. I don't know how easy that will be to fix. I thought it was correct and working but is clearly not.
I do see a separate issue in factor validation when abs(c) > 1, so I will fix that.[/QUOTE] Did this issue ever get fixed? I think I'm running into it again (but possible a different issue) [CODE] $ ./srsieve2 W1 p 10 P 1e8 n 10 N 300000 s "1*9^n7" srsieve2 v1.6.4, a program to find factors of k*b^n+c numbers for fixed b and variable k and n Sieving with generic logic for p >= 10 Sieve started: 10 < p < 1e8 with 299991 terms (10 < n < 300000, k*9^n7) (expecting 262492 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 9 sequence into 132 base 9^180 sequences. Legendre summary: Approximately 2 B needed for Legendre tables ... 518400 bytes used for congruent q and ladder indices 263200 bytes used for congruent qs and ladders Fatal Error: Invalid factor: 1*9^390607 mod 1303 = 1297 [/CODE]I'm sieving "(3^n7)/2" which I think I can hack to avoid d by testing 9^n7 and disabling the division by 2 test. I'm excited that srsieve2 might be 10100x faster than my hand coded sieve. 
[QUOTE=SethTro;616604]Did this issue ever get fixed? I think I'm running into it again (but possible a different issue)
[CODE] $ ./srsieve2 W1 p 10 P 1e8 n 10 N 300000 s "1*9^n7" srsieve2 v1.6.4, a program to find factors of k*b^n+c numbers for fixed b and variable k and n Sieving with generic logic for p >= 10 Sieve started: 10 < p < 1e8 with 299991 terms (10 < n < 300000, k*9^n7) (expecting 262492 factors) Sieving with single sequence c=1 logic for p >= 257 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 Split 1 base 9 sequence into 132 base 9^180 sequences. Legendre summary: Approximately 2 B needed for Legendre tables ... 518400 bytes used for congruent q and ladder indices 263200 bytes used for congruent qs and ladders Fatal Error: Invalid factor: 1*9^390607 mod 1303 = 1297 [/CODE]I'm sieving "(3^n7)/2" which I think I can hack to avoid d by testing 9^n7 and disabling the division by 2 test. I'm excited that srsieve2 might be 10100x faster than my hand coded sieve.[/QUOTE] Sorry, but I have not. I am mainly stuck on how to handle terms that are not divisible by the divisor. For example if you had (3^n1)/5, then how do I handle terms where 3^n1 is not already divisible by 5. I think that pfgw actually tests floor((3^n1)/5). The challenge is twofold. Just because p divides 3^n1 for some value of n, it doesn't mean that p divides floor(3^n1)/5 for that same value of n. The converse is also true. p could divide floor(3^n1)/5, but does not divide 3^n1. The algorithm behind srsieve2 works by finding factors of 3^n1 so it cannot find some of the factors that would divide floor((3^n1)/5). Given the values you have for k, b, c, and d, 3^n7 is always even, so always divisible by 2. I could make a change to support sieving sequences meeting these criteria: [LIST][*]d is prime[*]d evenly divides all k*b^n+c terms[/LIST] I would then need to: [LIST][*]Remove all terms where k*b^n+c is divisible by d^2[*]Skip the discrete log where d = p.[/LIST] That would mean that all factors found by the discrete log divide both k*b^n+c and (k*b^n+c)/d. I don't know when I can complete this, but it is the best I can offer at this time. 
[QUOTE=rogue;616610]Sorry, but I have not. I am mainly stuck on how to handle terms that are not divisible by the divisor. For example if you had (3^n1)/5, then how do I handle terms where 3^n1 is not already divisible by 5. I think that pfgw actually tests floor((3^n1)/5). The challenge is twofold. Just because p divides 3^n1 for some value of n, it doesn't mean that p divides floor(3^n1)/5 for that same value of n. The converse is also true. p could divide floor(3^n1)/5, but does not divide 3^n1. The algorithm behind srsieve2 works by finding factors of 3^n1 so it cannot find some of the factors that would divide floor((3^n1)/5).
Given the values you have for k, b, c, and d, 3^n7 is always even, so always divisible by 2. I could make a change to support sieving sequences meeting these criteria: [LIST][*]d is prime[*]d evenly divides all k*b^n+c terms[/LIST] I would then need to: [LIST][*]Remove all terms where k*b^n+c is divisible by d^2[*]Skip the discrete log where d = p.[/LIST] That would mean that all factors found by the discrete log divide both k*b^n+c and (k*b^n+c)/d. I don't know when I can complete this, but it is the best I can offer at this time.[/QUOTE] I coded my own slower sieve so this is low priority, I'll also play around and see if I can hack mtsieve to work for my specific case. 
[QUOTE=rogue;616610]Sorry, but I have not. I am mainly stuck on how to handle terms that are not divisible by the divisor. For example if you had (3^n1)/5, then how do I handle terms where 3^n1 is not already divisible by 5. I think that pfgw actually tests floor((3^n1)/5). The challenge is twofold. Just because p divides 3^n1 for some value of n, it doesn't mean that p divides floor(3^n1)/5 for that same value of n. The converse is also true. p could divide floor(3^n1)/5, but does not divide 3^n1. The algorithm behind srsieve2 works by finding factors of 3^n1 so it cannot find some of the factors that would divide floor((3^n1)/5).
Given the values you have for k, b, c, and d, 3^n7 is always even, so always divisible by 2. I could make a change to support sieving sequences meeting these criteria: [LIST][*]d is prime[*]d evenly divides all k*b^n+c terms[/LIST] I would then need to: [LIST][*]Remove all terms where k*b^n+c is divisible by d^2[*]Skip the discrete log where d = p.[/LIST] That would mean that all factors found by the discrete log divide both k*b^n+c and (k*b^n+c)/d. I don't know when I can complete this, but it is the best I can offer at this time.[/QUOTE] I coded my own slower sieve so this is low priority, I'll also play around and see if I can hack mtsieve to work for my specific case (I agree with your analysis of what would need to be done) I'm surprised that pfgw takes the floor of the division. I have no idea what you should do in that case. 
Solved my issue
I found the source of my problem by compiling with `make DEBUG=yes` and using gdb.
The code was running from CisOneWithOneSequenceWorker which I eventually parsed in the symbolic domain as "C Is One" instead of just a random identifier. [CODE] void SierpinskiRieselApp::ValidateOptions(void) { seq_t *seqPtr; ib_CanUseCIsOneLogic = true; [/CODE]Critically this is called AFTER AddSequence resetting ib_CanUseCIsOneLogic [CODE] void SierpinskiRieselApp::AddSequence(uint64_t k, int64_t c, uint32_t d) { ... uint64_t absc = abs(c); if (absc != 1) ib_CanUseCIsOneLogic = false; [/CODE] I can "fix" the invalid factors issue by commenting out the assignment to true. I suspect you're the person who should figure out the correct fix. 
AddSequence() should be called before MakeSubsequences(), which where the Helper is created. Is the wrong Helper created? It should be creating GenericSequenceHelper.

[QUOTE=rogue;616633]AddSequence() should be called before MakeSubsequences(), which where the Helper is created. Is the wrong Helper created? It should be creating GenericSequenceHelper.[/QUOTE]
The flow I'm seeing is [B]core/main:main[/B] calls [B]ProcessArgs[/B] calls [B]SierpinskiRieselApp::ParseOptions[/B] calls [B]SierpinskiRieselApp::ValidateAndAddNewSequence[/B] calls [B]SierpinskiRieselApp::AddSequence[/B] where ib_CanUseCIsOneLogic set false after that [B]core/main:main[/B] calls [B]core/App:Run[/B] calls [B]SierpinskiRieselApp::ValidateOptions[/B] which resets ib_CanUseCIsOneLogic = true; then calls [B]SierpinskiRieselApp::MakeSubsequences[/B] which creates [B]CisOneWithOneSequenceHelper[/B] when it should be creating [B]GenericSequenceHelper[/B] it feels like `ib_CanUseCIsOneLogic = true;` should probably be moved to the constructor [CODE] Index: sierpinski_riesel/SierpinskiRieselApp.cpp ===================================================================  sierpinski_riesel/SierpinskiRieselApp.cpp (revision 205) +++ sierpinski_riesel/SierpinskiRieselApp.cpp (working copy) @@ 48,6 +48,8 @@ SetAppMinPrime(3); SetAppMaxPrime(PMAX_MAX_62BIT); + ib_CanUseCIsOneLogic = true; + ii_MinN = 0; ii_MaxN = 0; it_Format = FF_ABCD; @@ 233,7 +235,6 @@ void SierpinskiRieselApp::ValidateOptions(void) { seq_t *seqPtr;  ib_CanUseCIsOneLogic = true; if (it_Format == FF_UNKNOWN) FatalError("the specified file format in not valid, use A (ABC), D (ABCD), P (ABC with number_primes), or B (BOINC)"); [/CODE] 
Can you update mtsieve to make it be able to sieve the sequence (a*b^n+c)/gcd(a+c,b1) with gcd(a+c,b1) > 1 (for a>=1, b>=2, c != 0, gcd(a,c) = 1, gcd(b,c) = 1)?
e.g. (113*13^n5)/12, which has no single known prime or PRP, and is the number 9555...555 with n 5's in base 13, if such prime exists, then the smallest such prime would be a minimal prime (start with b+1) in base b = 13, see [URL="https://github.com/xayahrainie4793/quasimepndata"]this page[/URL] For this family, one would sieve the sequence 113*13^n5 for primes not dividing 12, i.e. sieve starting with the prime 5 (and ending with 10^9 or more), and initialized the list of candidates to not include n for which 2 or 3 divides (113*13^n5)/12 Most of the unsolved families in the "minimal prime problem" require this updating, e.g. [URL="https://github.com/xayahrainie4793/quasimepndata/blob/main/left19"]the unsolved families in base 19[/URL] and [URL="https://github.com/xayahrainie4793/quasimepndata/blob/main/left21"]the unsolved families in base 21[/URL] 
I see what you are saying. The line in ValidateOptions should be in the constructor. Thanks for finding it.

[QUOTE=sweety439;616673]Can you update mtsieve to make it be able to sieve the sequence (a*b^n+c)/gcd(a+c,b1) with gcd(a+c,b1) > 1 (for a>=1, b>=2, c != 0, gcd(a,c) = 1, gcd(b,c) = 1)?
e.g. (113*13^n5)/12, which has no single known prime or PRP, and is the number 9555...555 with n 5's in base 13, if such prime exists, then the smallest such prime would be a minimal prime (start with b+1) in base b = 13, see [URL="https://github.com/xayahrainie4793/quasimepndata"]this page[/URL] For this family, one would sieve the sequence 113*13^n5 for primes not dividing 12, i.e. sieve starting with the prime 5 (and ending with 10^9 or more), and initialized the list of candidates to not include n for which 2 or 3 divides (113*13^n5)/12 Most of the unsolved families in the "minimal prime problem" require this updating, e.g. [URL="https://github.com/xayahrainie4793/quasimepndata/blob/main/left19"]the unsolved families in base 19[/URL] and [URL="https://github.com/xayahrainie4793/quasimepndata/blob/main/left21"]the unsolved families in base 21[/URL][/QUOTE] The answer at this time is "no". I stated above the limitations I can add to srsieve2. What you are asking for doesn't meet the criteria that I listed above (d must be prime and d must divide k*b^n+c for all n). It looks like SethTho is going to get that working. More investigation needs to be done to support other conditions. For example I think that pfgw uses floor(), but it might use round() or ceil(). I have to figure that out before doing anything along the lines of what you are asking. In the interim, if you know C++, you can possibly investigate writing your own sieve based upon the framework. 
[QUOTE=rogue;616701]The answer at this time is "no". I stated above the limitations I can add to srsieve2. What you are asking for doesn't meet the criteria that I listed above (d must be prime and d must divide k*b^n+c for all n). It looks like SethTho is going to get that working. More investigation needs to be done to support other conditions. For example I think that pfgw uses floor(), but it might use round() or ceil(). I have to figure that out before doing anything along the lines of what you are asking.
In the interim, if you know C++, you can possibly investigate writing your own sieve based upon the framework.[/QUOTE] Then, can we use mtsieve to sieve the sequence 113*13^n5 with the primes p>=5 (and initialized the list of candidates to not include n for which 2 or 3 divides (113*13^n5)/12, i.e. only include the n == 2, 4 mod 6)? Or it will return error as every term is divisible by 12? 
[QUOTE=sweety439;616705]Then, can we use mtsieve to sieve the sequence 113*13^n5 with the primes p>=5 (and initialized the list of candidates to not include n for which 2 or 3 divides (113*13^n5)/12, i.e. only include the n == 2, 4 mod 6)? Or it will return error as every term is divisible by 12?[/QUOTE]
I cannot tell you. I haven't determined the behavior to use yet. That is much easier said than done. I don't know when I am going to look into it. 
When I wrote my quick discretelog solver I added this very useful optimization. I looked over the code and didn't find it in mtsieve.
When considering a single sequence k*b^n+c with 0 <= n <= N (e.g. in the sierpinski riesel) I solve the discrete_log(b, p, mod_inv(k) * c) = x <=> b^x = mod_inv(k, p) * c <=> p  k*b^x+c For each prime I'm hoping x <= MaxN, I noticed that when x > MaxN I don't care about the value. So I only compute babysteps/giantsteps up to min(MaxN, order_b(p1)) With MaxN = 1e6 and p near 1e9 this is around 2*sqrt(1e6)=2000 steps vs 2*sqrt(1e9)=63000 steps! Maybe this is already covered by GenericWorker::InitializeWorker when choosing r? or in some other way I don't understand? 
[QUOTE=SethTro;617162]When I wrote my quick discretelog solver I added this very useful optimization. I looked over the code and didn't find it in mtsieve.
When considering a single sequence k*b^n+c with 0 <= n <= N (e.g. in the sierpinski riesel) I solve the discrete_log(b, p, mod_inv(k) * c) = x <=> b^x = mod_inv(k, p) * c <=> p  k*b^x+c For each prime I'm hoping x <= MaxN, I noticed that when x > MaxN I don't care about the value. So I only compute babysteps/giantsteps up to min(MaxN, order_b(p1)) With MaxN = 1e6 and p near 1e9 this is around 2*sqrt(1e6)=2000 steps vs 2*sqrt(1e9)=63000 steps! Maybe this is already covered by GenericWorker::InitializeWorker when choosing r? or in some other way I don't understand?[/QUOTE] Look at FindBestQ() is computed in the Helper class. q is computed once and is independent of p. This allows the discrete log to take the same amount of time regardless of p. 
[QUOTE=rogue;617176]Look at FindBestQ() is computed in the Helper class. q is computed once and is independent of p. This allows the discrete log to take the same amount of time regardless of p.[/QUOTE]
I've only spent a couple of days on this problem so I'm missing a lot of the deeper context you have. I'd love to understand why q makes the discrete log constant time (or the same per batch?). If there's an article or paper on the subject I'd enjoy reading it. My rough understanding of srsieve is that you convert b^n (in my case 9^n) into several subsequences 9^0*(9^3)^a, 9^1*(9^3)^a, 9^2*(9^3)^a. After this some of these sub sequence will always have trivial divisors (related to the residual classes) and can entirely be dropped. 
[QUOTE=SethTro;617512]I've only spent a couple of days on this problem so I'm missing a lot of the deeper context you have. I'd love to understand why q makes the discrete log constant time (or the same per batch?). If there's an article or paper on the subject I'd enjoy reading it.
My rough understanding of srsieve is that you convert b^n (in my case 9^n) into several subsequences 9^0*(9^3)^a, 9^1*(9^3)^a, 9^2*(9^3)^a. After this some of these sub sequence will always have trivial divisors (related to the residual classes) and can entirely be dropped.[/QUOTE] The code in srsieve2/srsieve2cl originated from srsieve/sr1sieve. I was not the writer of that code. I do not recall if any of that code referenced a wiki page or other document specifying the logic behind the discrete log. 
I found a bug with HashTable that when I fix gives me a 2030% speedup (at least for srsieve2). I'm looking around but I don't see a test suite in the SVN repository how do you test changes?

[QUOTE=SethTro;617907]I found a bug with HashTable that when I fix gives me a 2030% speedup (at least for srsieve2). I'm looking around but I don't see a test suite in the SVN repository how do you test changes?[/QUOTE]
Nice! I do not have a test suite. I typically run a set of tests with the previous version and the new version to compare results. If invalid factors are found or factors are missed, then I need to dig further. What is the bug? I assume this doesn't impact the number of factors found, but impacts either putting something into the hash table or looking something up. It is possible that this impacts the OpenCL kernel as well. 
The bug is that `htable` is initializing and cleared with `0` instead of `empty_slot`.
This causes Insert to mark every entry as a probing conflict, e.g. `htable[slot] = (J  HASH_MASK1)`. On Lookup this causes additional probing. I'm working on getting your code into a git repository so I can break out my changes into smaller commits. I should have a small diff by the end of the day. 
The change is here: [url]https://github.com/sethtroisi/mtsieve/commit/e5337806c3a54ebd3657b168587ada11384c9f6e[/url]
Before [CODE] five:~/Projects/mtsievegit$ ./srsieve2_clean fA W1 p 8 P 1e8 n 100000 N 250000 s "4*9^n7" Sieve started: 8 < p < 1e8 with 150001 terms (100000 < n < 250000, k*9^n7) (expecting 133067 factors) CPU time: 18.90 sec. (0.04 sieving) (0.90 cores) Primes tested: 5761244. Factors found: 126233. Remaining terms: 23768. Time: 20.99 seconds. five:~/Projects/mtsievegit$ ./srsieve2_clean fA W8 p 8 P 1e9 n 100000 N 250000 s "4*9^n7" Sieve started: 8 < p < 1e9 with 150001 terms (100000 < n < 250000, k*9^n7) (expecting 134949 factors) CPU time: 178.19 sec. (4.04 sieving) (7.21 cores) Primes tested: 50847324. Factors found: 128864. Remaining terms: 21137. Time: 24.69 seconds. [/CODE] After [CODE] five:~/Projects/mtsievegit$ ./srsieve2 fA W1 p 8 P 1e8 n 100000 N 250000 s "4*9^n7" Sieve started: 8 < p < 1e8 with 150001 terms (100000 < n < 250000, k*9^n7) (expecting 133067 factors) CPU time: 15.23 sec. (0.04 sieving) (0.88 cores) Primes tested: 5761244. Factors found: 126233. Remaining terms: 23768. Time: 17.30 seconds. five:~/Projects/mtsievegit$ ./srsieve2 fA W8 p 8 P 1e9 n 100000 N 250000 s "4*9^n7" Sieve started: 8 < p < 1e9 with 150001 terms (100000 < n < 250000, k*9^n7) (expecting 134949 factors) CPU time: 144.05 sec. (4.36 sieving) (7.08 cores) Primes tested: 50847324. Factors found: 128864. Remaining terms: 21137. Time: 20.34 seconds. [/CODE] 18.9 / 15.23 = 24% faster 178.19 / 144.05 = 24% faster 
Looks like the removal of memset() is where the performance boost comes from. How much of an improvement will likely be compiler specific. The OpenCL code does not use memset(), so it won't gain anything. I see that you removed the other constructor. It might be used by one of the other sieves.
This is a really nice performance boost. I greatly appreciate it. Not many people have been brave enough to look at my code and offer optimizations like this. I will look into making the changes in sourceforge and getting a new release out soon. 
[QUOTE=rogue;617961]Looks like the removal of memset() is where the performance boost comes from. How much of an improvement will likely be compiler specific. The OpenCL code does not use memset(), so it won't gain anything. I see that you removed the other constructor. It might be used by one of the other sieves.
This is a really nice performance boost. I greatly appreciate it. Not many people have been brave enough to look at my code and offer optimizations like this. I will look into making the changes in sourceforge and getting a new release out soon.[/QUOTE] I don't believe removing memset changes the performance, I still have to write a value to every index. The real benefit is you go from 100% conflicts to 816% on Insert() and Lookup() I have a little more diagnostics in [url]https://github.com/sethtroisi/mtsieve/commit/408e5ce3b17f69e119985c4788511d3ab577cf8a[/url] I also fixed a segfault in OpenCL with [url]https://github.com/sethtroisi/mtsieve/commit/7fc02473dc2b5caf6d6fd658914b92914b0ac1f3[/url] 
In two quick checks I saw a 10% improvement in the openCL code by making this [URL="https://github.com/sethtroisi/mtsieve/commit/910f749dc6fb2cef14514e712ccf5adf9be33fca"]one line fix[/URL]
[CODE] for (idx=0; idx<HASH_SIZE; idx++)  h_table[idx] = 0; + h_table[idx] = HASH_ELEMENTS; [/CODE] [CODE] # Before $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec. Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds. # After $ $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec. Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds. [/CODE] 
[QUOTE=SethTro;617967]In two quick checks I saw a 10% improvement in the openCL code by making this [URL="https://github.com/sethtroisi/mtsieve/commit/910f749dc6fb2cef14514e712ccf5adf9be33fca"]one line fix[/URL]
[CODE] for (idx=0; idx<HASH_SIZE; idx++)  h_table[idx] = 0; + h_table[idx] = HASH_ELEMENTS; [/CODE] [CODE] # Before $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec. Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds. # After $ $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec. Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds. [/CODE][/QUOTE] Any idea why it missed one factor? 
I believe this has to do with mtsieve not handling the end point strictly.
[CODE] five:~/Projects/mtsievegit$ diff 4b9m7_n_20e8_clean.abc 4b9m7_n_20e8_post.abc 1c1 < ABC 4*9^$a7 // Sieved to 2000000063  > ABC 4*9^$a7 // Sieved to 2000288291 [/CODE] My guess is that the if/when you increase worksize leads to a different amount of overrun of P 
[QUOTE=SethTro;617967]
[CODE] # Before $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec. Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds. # After $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec. Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds. [/CODE][/QUOTE] Using a little unix magic [CODE] $ primesieve 1e9 Primes: 50847534 $ primesieve n $(expr 50847534 + 47432100) Nth prime: 2001223151 $ ./srsieve2cl platform=1 W1 G1 fA p 20e8 P 2001223151 i 4b9m7_n_20e8_clean.abc o temp.abc Primes tested: 73344. Factors found: 1. Remaining terms: 20477. Time: 0.10 seconds. $ diff 4b9m7_n_20e8_clean.abc temp.abc 1c1 < ABC 4*9^$a7 // Sieved to 2000000063  > ABC 4*9^$a7 // Sieved to 2001565729 629d628 < 104493 $ python >>> for p in range(20 * 10 ** 8 + 1, 2001288307, 2): ... if (4 * pow(9, 104493, p)  7) % p == 0: ... print(p) ... 2001059303 $ echo "4*9^1044937"  ecm pm1 14650 0 GMPECM 7.0.5dev [configured with GMP 6.2.99, enableasmredc, enableassert] [P1] Input number is 4*9^1044937 (99713 digits) Using B1=14650, B2=0, polynomial x^1, x0=2592079789 Step 1 took 38713ms ********** Factor found in step 1: 2001059303 [/CODE] So there's a factor "just" after 20e8 and one of the two runs had a worksize that overran 20e8 enough to include it while the other didn't. I looked briefly into what it would take to make the worksize variable so that the bounds could be strict but it's not pretty. It would probably make more sense to pad the last batch with the last prime a bunch of times (and verify that if that primes divides a sequence it doesn't count multiple times) 
[QUOTE=SethTro;617978]So there's a factor "just" after 20e8 and one of the two runs had a worksize that overran 20e8 enough to include it while the other didn't.
I looked briefly into what it would take to make the worksize variable so that the bounds could be strict but it's not pretty. It would probably make more sense to pad the last batch with the last prime a bunch of times (and verify that if that primes divides a sequence it doesn't count multiple times)[/QUOTE] This is quite okay and explains the discrepancy. The worksize differs between GPU and CPU so in one case the CPU got the last chunk of work in the other the GPU got it. I won't make any changes here. 
Upon review, this is for a single sequence and you have doubled the size of the hash table. Have you done any testing with multiple sequences? I am concerned that the change to GenericGpuWorker will negatively affect memory requirements for the GPU when one has hundreds or thousands of sequences.

[QUOTE=rogue;617993]Upon review, this is for a single sequence and you have doubled the size of the hash table. Have you done any testing with multiple sequences? I am concerned that the change to GenericGpuWorker will negatively affect memory requirements for the GPU when one has hundreds or thousands of sequences.[/QUOTE]
I was testing changing the hash table size but it doesn't seem to have much impact (at least on my few tests). I'm happy to test with more sequences, do you have a recommendation for how many sequences at a time I should test? and how large those sequences should be? You can test with just the one line change, or the one line change + [URL="https://github.com/sethtroisi/mtsieve/commit/0c02e047f199e7463e8eceb09431f3e9d98c7f4e"]refactor,[/URL] or with [URL="https://github.com/sethtroisi/mtsieve/compare/base...hash_table_clean"]all my hash table changes[/URL]. On my GPU (1080ti) I consistently see a +510% performance improvement. [CODE] $ ./srsieve2cl_clean platform=1 W0 G1 fA p 1e9 P 20e8 i 4b9m7_n_1e9.abc Sieve started: 1e9 < p < 2e9 with 21137 terms (100012 < n < 249988, k*9^n7) (expecting 684 factors) CPU time: 35.73 sec. (0.64 sieving) (0.98 cores) GPU time: 34.83 sec. Primes tested: 47423488. Factors found: 659. Remaining terms: 20478. Time: 36.45 seconds. Segmentation fault (core dumped) $ ./srsieve2cl platform=1 W0 G1 fA p 1e9 P 20e8 i 4b9m7_n_1e9.abc Sieve started: 1e9 < p < 2e9 with 21137 terms (100012 < n < 249988, k*9^n7) (expecting 684 factors) CPU time: 33.44 sec. (0.60 sieving) (0.98 cores) GPU time: 32.26 sec. Primes tested: 47423488. Factors found: 659. Remaining terms: 20478. Time: 33.84 seconds. [/CODE] 
Is suggest testing with 1, 10, 10, and 1000 sequences. You can generate sequences or you can use the CRUS page to find sequences that will provide more meaningful results. Use n1e5 N2e5. You can then test p1e9 p1e10. It will take longer to run, but reduces impact of other things you might be doing on your computer.
I would like to see before and after for CPU and GPU. With srsieve2, no W. With srsieve2cl, no G. Assuming all looks good, can you email me the files that you have changed. It would make it easier for me to evaluate what you have changed. Also send me the starting sequence files so that I can do my own testing. I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base. I could also make you a developer on the project so that you can commit directly to sourceforge. 
[QUOTE=rogue;618008]I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base.[/QUOTE]
I appreciate you communicating quickly here and being receptive to changes. [QUOTE=rogue;618008]Is suggest testing with 1, 10, 10, and 1000 sequences. You can generate sequences or you can use the CRUS page to find sequences that will provide more meaningful results. Use n1e5 N2e5. You can then test p1e9 p1e10. It will take longer to run, but reduces impact of other things you might be doing on your computer. I would like to see before and after for CPU and GPU. With srsieve2, no W. With srsieve2cl, no G. Assuming all looks good, can you email me the files that you have changed. It would make it easier for me to evaluate what you have changed. Also send me the starting sequence files so that I can do my own testing. I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base. I could also make you a developer on the project so that you can commit directly to sourceforge.[/QUOTE] When i run `./srsieve2 fA p 1e8 p 2e8 n1e5 N2e5 s"118392*51^n+1" ` I get an error with [CODE]srsieve2: sierpinski_riesel/CisOneWithOneSequenceHelper.cpp:282: void CisOneWithOneSequenceHelper::MakeLadder(uint16_t*, uint32_t): Assertion `qListLen+1 < ii_BestQ' failed. [/CODE]which I don't think I caused, but is maybe because of my input? 
I suggest that you start by sieving the sequence to 1e8, then continue sieving with the output file that was generated. For example:
./srsieve2 P1e6 n1e5 N2e5 s"118392*51^n+1" followed by: ./srsieve2 P2e8 ib51_n.abcd The first execution will remove most terms. The second will provide the stats we care about. What you input might be a problem because you are overriding the starting value for p. At p=1e6 it recomputes bestQ. 
Results
3 Attachment(s)
[B]Results[/B]
I pulled down a few CRUS sequences and choose 1, 10, 100, 1000 k for random bases (I attach these files) All of this was run on a Ryzen 3900x with no other load, I tested p 1e8 to P 2e8 (to get results faster) but can test larger ranges if this doesn't convince you. [CODE] $ wc crus_seqs_rand* 1000 3000 15978 crus_seqs_rand1000.txt 100 300 1534 crus_seqs_rand100.txt 10 30 158 crus_seqs_rand10.txt 1 3 13 crus_seqs_rand1.txt $ cat crus_seqs_randX.txt  awk F", " '{ print "s\"" $1 "*" $2 "^n" $3 "\"" }'  tr '\n' ' ' > seqsX.txt $ cat seqs10.txt s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" $ echo "./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs10.txt)" ./srsieve2 fA p 1e8 p 2e8 n1e5 N2e5 s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" $ eval "./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs10.txt)" [/CODE] With 10 sequences [CODE] $ ./srsieve2_clean fA p 1e8 P 2e8 n1e5 N2e5 s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" Sieve started: 1e8 < p < 2e8 with 1000010 terms (100000 < n < 200000, k*3^n1) (expecting 36264 factors) CPU time: 58.77 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 36406. Remaining terms: 963604. Time: 58.60 seconds. $ ./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" Sieve started: 1e8 < p < 2e8 with 1000010 terms (100000 < n < 200000, k*3^n1) (expecting 36264 factors) CPU time: 47.23 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 36406. Remaining terms: 963604. Time: 47.19 seconds. # With W8 CPU time: 62.63 sec. (0.50 sieving) (7.84 cores) vs CPU time: 48.75 sec. (0.54 sieving) (7.79 cores) [/CODE]With 100 sequences [CODE] $ eval ./srsieve2_clean fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs100.txt) Sieve started: 1e8 < p < 2e8 with 10000100 terms (100000 < n < 200000, k*31^n+1) (expecting 362645 factors) p=130433617, 26.90K p/sec, 142003 factors found at 2.321K f/sec (last 1 min), 30.4% done. ETC 20221117 13:48 p=193974931, 27.34K p/sec, 347298 factors found at 1.488K f/sec (last 1 min), 93.9% done. ETC 20221117 13:48 CPU time: 195.54 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 362592. Remaining terms: 9637508. Time: 194.95 seconds. $ eval ./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs100.txt) Sieve started: 1e8 < p < 2e8 with 10000100 terms (100000 < n < 200000, k*31^n+1) (expecting 362645 factors) p=136261487, 32.20K p/sec, 165361 factors found at 2.720K f/sec (last 1 min), 36.2% done. ETC 20221117 13:52 p=173591807, 32.33K p/sec, 290747 factors found at 2.048K f/sec (last 1 min), 73.5% done. ETC 20221117 13:52 CPU time: 164.43 sec. (0.04 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 362592. Remaining terms: 9637508. Time: 163.99 seconds. # With W8 CPU time: 206.01 sec. (0.55 sieving) (7.92 cores) vs CPU time: 172.62 sec. (0.50 sieving) (7.90 cores) [/CODE] With 1000 sequences [CODE] $ eval ./srsieve2_clean W8 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs1000.txt) p=146758463, 41.42K p/sec, 2043151 factors found at 4.210K f/sec (last 1 min), 46.7% done. ETC 20221117 13:59 p=194516929, 41.36K p/sec, 3489737 factors found at 2.971K f/sec (last 1 min), 94.5% done. ETC 20221117 13:59 CPU time: 1026.07 sec. (0.61 sieving) (7.96 cores) Primes tested: 5317484. Factors found: 3626818. Remaining terms: 96374182. Time: 128.82 seconds. $ eval ./srsieve2 W8 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs1000.txt) p=150323629, 44.44K p/sec, 2167391 factors found at 4.458K f/sec (last 1 min), 50.3% done. ETC 20221117 14:02 CPU time: 957.16 sec. (0.71 sieving) (7.96 cores) Primes tested: 5317484. Factors found: 3626818. Remaining terms: 96374182. Time: 120.17 seconds. [/CODE] 
Better test runner script
[CODE] rm temp*.out for N in {1,10,100,100}; do cat "crus_seqs_rand${N}.txt"  awk F", " '{ print "s\"" $1 "*" $2 "^n" $3 "\"" }'  tr '\n' ' ' > "seqs${N}.txt" wc "seqs${N}.txt" echo eval ./srsieve2_clean P 1e8 n1e5 N2e5 $(cat seqs${N}.txt) o temp_${N}.in echo eval "./srsieve2_clean P 2e8 i temp_${N}.in o temp_${N}_befor1.out echo eval "./srsieve2_clean P 1e9 W8 i temp_${N}.in o temp_${N}_befor8.out echo eval "./srsieve2 P 2e8 i temp_${N}.in o temp_${N}_after1.out echo eval "./srsieve2 P 1e9 W8 i temp_${N}.in o temp_${N}_after8.out echo e "\n\n" done [/CODE] Results [CODE] [B]1 Sequence[/B] (using 10x higher bounds; 1e9, 1e10) Before CPU time: 21.90 sec. (0.26 sieving) (1.20 cores) Primes tested: 40351310. Factors found: 225. Remaining terms: 2731. Time: 18.20 seconds. CPU time: 219.52 sec. (27.57 sieving) (8.05 cores) Primes tested: 444556287. Factors found: 517. Remaining terms: 2439. Time: 27.26 seconds. After CPU time: 20.96 sec. (0.27 sieving) (1.20 cores) Primes tested: 40351310. Factors found: 225. Remaining terms: 2731. Time: 17.45 seconds. CPU time: 212.39 sec. (28.63 sieving) (8.03 cores) Primes tested: 444556287. Factors found: 517. Remaining terms: 2439. Time: 26.41 seconds. [B]10 Sequences[/B] Before CPU time: 25.28 sec. (0.06 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 757. Remaining terms: 19928. Time: 25.23 seconds. CPU time: 231.81 sec. (3.57 sieving) (7.91 cores) [B](This is 9X the work so expected that it takes longer)[/B] Primes tested: 45086080. Factors found: 2277. Remaining terms: 18408. Time: 29.30 seconds. After CPU time: 22.14 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 757. Remaining terms: 19928. Time: 22.13 seconds. CPU time: 202.03 sec. (4.27 sieving) (7.91 cores) Primes tested: 45086080. Factors found: 2277. Remaining terms: 18408. Time: 25.52 seconds. [B]100 Sequence[/B] Before CPU time: 135.06 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 21073. Remaining terms: 563114. Time: 134.57 seconds. CPU time: 1223.59 sec. (4.01 sieving) (7.96 cores) Primes tested: 45086080. Factors found: 64678. Remaining terms: 519509. Time: 153.59 seconds. After CPU time: 127.34 sec. (0.06 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 21073. Remaining terms: 563114. Time: 126.80 seconds. CPU time: 1155.34 sec. (4.17 sieving) (7.96 cores) Primes tested: 45086080. Factors found: 64678. Remaining terms: 519509. Time: 144.99 seconds. [B]1000 Sequence[/B] (sieved to 11e7 for no W, 2e8 for W8) Before CPU time: 84.85 sec. (0.00 sieving) (1.02 cores) Primes tested: 541856. Factors found: 23720. Remaining terms: 4588834. Time: 83.03 seconds. CPU time: 869.99 sec. (0.42 sieving) (7.97 cores) Primes tested: 5317484. Factors found: 167203. Remaining terms: 4445351. Time: 109.13 seconds. After CPU time: 85.95 sec. (0.00 sieving) (1.02 cores) Primes tested: 541856. Factors found: 23720. Remaining terms: 4588834. Time: 84.07 seconds. CPU time: 848.70 sec. (0.51 sieving) (7.97 cores) Primes tested: 5317484. Factors found: 167203. Remaining terms: 4445351. Time: 106.48 seconds. [/CODE] the output files are exact matches. I ran the single sequence to high bounds (1e9, 1e10). It is the only sequence not using generic logic and doesn't show a speedup. The 10/100 sequence results are faster, the 1/1000 sequence results are equal. I suspect the single sequence is because it's not using the generic logic. Maybe with 1000 sequences I'm hitting cache limits (I could profile with perf if it was important to verify that thesis) 
[QUOTE=SethTro;618023]Better test runner script...[/QUOTE]
IMO, this is an excellent example of excellent work. Test the code nine ways to Sunday before you even begin to trust it. Particularly if it is written by oneself (who, heuristically, is prone to mistakes). Only human... 8) 
@rogue:
Why when click "download" in the page [URL="https://sourceforge.net/projects/mtsieve/"]https://sourceforge.net/projects/mtsieve/[/URL], it only has "srsieve2cl.exe"? It seems that we should click [URL="https://sourceforge.net/projects/mtsieve/files/mtsieve_2.3.3.7z/download"]https://sourceforge.net/projects/mtsieve/files/mtsieve_2.3.3.7z/download[/URL] to download the full mtsieve 
[QUOTE=sweety439;618045]@rogue:
Why when click "download" in the page [URL="https://sourceforge.net/projects/mtsieve/"]https://sourceforge.net/projects/mtsieve/[/URL], it only has "srsieve2cl.exe"? It seems that we should click [URL="https://sourceforge.net/projects/mtsieve/files/mtsieve_2.3.3.7z/download"]https://sourceforge.net/projects/mtsieve/files/mtsieve_2.3.3.7z/download[/URL] to download the full mtsieve[/QUOTE] I'm not certain why it does that. After these changes are integrated I will delete any standalone exe files in sourceforge so it should d/l the 7z with all exes. 
SethPro, did you output the number of collisions vs inserts in the hash table from those tests? Maybe there are opportunities to sizing the hash table better.

[QUOTE=SethTro;617967]In two quick checks I saw a 10% improvement in the openCL code by making this [URL="https://github.com/sethtroisi/mtsieve/commit/910f749dc6fb2cef14514e712ccf5adf9be33fca"]one line fix[/URL]
[CODE] for (idx=0; idx<HASH_SIZE; idx++)  h_table[idx] = 0; + h_table[idx] = HASH_ELEMENTS; [/CODE][/QUOTE] I have confirmed the 10% improvement in the OpenCL performance. This is due to choosing a better code path when inserting into the hash table. I was sieving over 3600 at a time. I have not tested the CPU logic changes yet. 
I have still not been able to test the CPU changes as I have been busy, but I have uploaded mtsieve 2.3.5 and all executables to sourceforge. Here are the changes:
[code] framework: Added code to destuctors to free allocated memory. Performance updates to HashTable. srsieve2/srieve2cl: version 1.6.5 Fixed HashTable usage to get up to 10% better performance. [/code] 
All times are UTC. The time now is 01:18. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2023, Jelsoft Enterprises Ltd.