[QUOTE=SethTro;617967]In two quick checks I saw a 10% improvement in the openCL code by making this [URL="https://github.com/sethtroisi/mtsieve/commit/910f749dc6fb2cef14514e712ccf5adf9be33fca"]one line fix[/URL]
[CODE] for (idx=0; idx<HASH_SIZE; idx++)  h_table[idx] = 0; + h_table[idx] = HASH_ELEMENTS; [/CODE] [CODE] # Before $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec. Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds. # After $ $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec. Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds. [/CODE][/QUOTE] Any idea why it missed one factor? 
I believe this has to do with mtsieve not handling the end point strictly.
[CODE] five:~/Projects/mtsievegit$ diff 4b9m7_n_20e8_clean.abc 4b9m7_n_20e8_post.abc 1c1 < ABC 4*9^$a7 // Sieved to 2000000063  > ABC 4*9^$a7 // Sieved to 2000288291 [/CODE] My guess is that the if/when you increase worksize leads to a different amount of overrun of P 
[QUOTE=SethTro;617967]
[CODE] # Before $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec. Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds. # After $ ./srsieve2cl platform=1 W1 G1 fA p 1e9 P 20e8 i b9_n_1e9.abc CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec. Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds. [/CODE][/QUOTE] Using a little unix magic [CODE] $ primesieve 1e9 Primes: 50847534 $ primesieve n $(expr 50847534 + 47432100) Nth prime: 2001223151 $ ./srsieve2cl platform=1 W1 G1 fA p 20e8 P 2001223151 i 4b9m7_n_20e8_clean.abc o temp.abc Primes tested: 73344. Factors found: 1. Remaining terms: 20477. Time: 0.10 seconds. $ diff 4b9m7_n_20e8_clean.abc temp.abc 1c1 < ABC 4*9^$a7 // Sieved to 2000000063  > ABC 4*9^$a7 // Sieved to 2001565729 629d628 < 104493 $ python >>> for p in range(20 * 10 ** 8 + 1, 2001288307, 2): ... if (4 * pow(9, 104493, p)  7) % p == 0: ... print(p) ... 2001059303 $ echo "4*9^1044937"  ecm pm1 14650 0 GMPECM 7.0.5dev [configured with GMP 6.2.99, enableasmredc, enableassert] [P1] Input number is 4*9^1044937 (99713 digits) Using B1=14650, B2=0, polynomial x^1, x0=2592079789 Step 1 took 38713ms ********** Factor found in step 1: 2001059303 [/CODE] So there's a factor "just" after 20e8 and one of the two runs had a worksize that overran 20e8 enough to include it while the other didn't. I looked briefly into what it would take to make the worksize variable so that the bounds could be strict but it's not pretty. It would probably make more sense to pad the last batch with the last prime a bunch of times (and verify that if that primes divides a sequence it doesn't count multiple times) 
[QUOTE=SethTro;617978]So there's a factor "just" after 20e8 and one of the two runs had a worksize that overran 20e8 enough to include it while the other didn't.
I looked briefly into what it would take to make the worksize variable so that the bounds could be strict but it's not pretty. It would probably make more sense to pad the last batch with the last prime a bunch of times (and verify that if that primes divides a sequence it doesn't count multiple times)[/QUOTE] This is quite okay and explains the discrepancy. The worksize differs between GPU and CPU so in one case the CPU got the last chunk of work in the other the GPU got it. I won't make any changes here. 
Upon review, this is for a single sequence and you have doubled the size of the hash table. Have you done any testing with multiple sequences? I am concerned that the change to GenericGpuWorker will negatively affect memory requirements for the GPU when one has hundreds or thousands of sequences.

[QUOTE=rogue;617993]Upon review, this is for a single sequence and you have doubled the size of the hash table. Have you done any testing with multiple sequences? I am concerned that the change to GenericGpuWorker will negatively affect memory requirements for the GPU when one has hundreds or thousands of sequences.[/QUOTE]
I was testing changing the hash table size but it doesn't seem to have much impact (at least on my few tests). I'm happy to test with more sequences, do you have a recommendation for how many sequences at a time I should test? and how large those sequences should be? You can test with just the one line change, or the one line change + [URL="https://github.com/sethtroisi/mtsieve/commit/0c02e047f199e7463e8eceb09431f3e9d98c7f4e"]refactor,[/URL] or with [URL="https://github.com/sethtroisi/mtsieve/compare/base...hash_table_clean"]all my hash table changes[/URL]. On my GPU (1080ti) I consistently see a +510% performance improvement. [CODE] $ ./srsieve2cl_clean platform=1 W0 G1 fA p 1e9 P 20e8 i 4b9m7_n_1e9.abc Sieve started: 1e9 < p < 2e9 with 21137 terms (100012 < n < 249988, k*9^n7) (expecting 684 factors) CPU time: 35.73 sec. (0.64 sieving) (0.98 cores) GPU time: 34.83 sec. Primes tested: 47423488. Factors found: 659. Remaining terms: 20478. Time: 36.45 seconds. Segmentation fault (core dumped) $ ./srsieve2cl platform=1 W0 G1 fA p 1e9 P 20e8 i 4b9m7_n_1e9.abc Sieve started: 1e9 < p < 2e9 with 21137 terms (100012 < n < 249988, k*9^n7) (expecting 684 factors) CPU time: 33.44 sec. (0.60 sieving) (0.98 cores) GPU time: 32.26 sec. Primes tested: 47423488. Factors found: 659. Remaining terms: 20478. Time: 33.84 seconds. [/CODE] 
Is suggest testing with 1, 10, 10, and 1000 sequences. You can generate sequences or you can use the CRUS page to find sequences that will provide more meaningful results. Use n1e5 N2e5. You can then test p1e9 p1e10. It will take longer to run, but reduces impact of other things you might be doing on your computer.
I would like to see before and after for CPU and GPU. With srsieve2, no W. With srsieve2cl, no G. Assuming all looks good, can you email me the files that you have changed. It would make it easier for me to evaluate what you have changed. Also send me the starting sequence files so that I can do my own testing. I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base. I could also make you a developer on the project so that you can commit directly to sourceforge. 
[QUOTE=rogue;618008]I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base.[/QUOTE]
I appreciate you communicating quickly here and being receptive to changes. [QUOTE=rogue;618008]Is suggest testing with 1, 10, 10, and 1000 sequences. You can generate sequences or you can use the CRUS page to find sequences that will provide more meaningful results. Use n1e5 N2e5. You can then test p1e9 p1e10. It will take longer to run, but reduces impact of other things you might be doing on your computer. I would like to see before and after for CPU and GPU. With srsieve2, no W. With srsieve2cl, no G. Assuming all looks good, can you email me the files that you have changed. It would make it easier for me to evaluate what you have changed. Also send me the starting sequence files so that I can do my own testing. I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base. I could also make you a developer on the project so that you can commit directly to sourceforge.[/QUOTE] When i run `./srsieve2 fA p 1e8 p 2e8 n1e5 N2e5 s"118392*51^n+1" ` I get an error with [CODE]srsieve2: sierpinski_riesel/CisOneWithOneSequenceHelper.cpp:282: void CisOneWithOneSequenceHelper::MakeLadder(uint16_t*, uint32_t): Assertion `qListLen+1 < ii_BestQ' failed. [/CODE]which I don't think I caused, but is maybe because of my input? 
I suggest that you start by sieving the sequence to 1e8, then continue sieving with the output file that was generated. For example:
./srsieve2 P1e6 n1e5 N2e5 s"118392*51^n+1" followed by: ./srsieve2 P2e8 ib51_n.abcd The first execution will remove most terms. The second will provide the stats we care about. What you input might be a problem because you are overriding the starting value for p. At p=1e6 it recomputes bestQ. 
Results
3 Attachment(s)
[B]Results[/B]
I pulled down a few CRUS sequences and choose 1, 10, 100, 1000 k for random bases (I attach these files) All of this was run on a Ryzen 3900x with no other load, I tested p 1e8 to P 2e8 (to get results faster) but can test larger ranges if this doesn't convince you. [CODE] $ wc crus_seqs_rand* 1000 3000 15978 crus_seqs_rand1000.txt 100 300 1534 crus_seqs_rand100.txt 10 30 158 crus_seqs_rand10.txt 1 3 13 crus_seqs_rand1.txt $ cat crus_seqs_randX.txt  awk F", " '{ print "s\"" $1 "*" $2 "^n" $3 "\"" }'  tr '\n' ' ' > seqsX.txt $ cat seqs10.txt s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" $ echo "./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs10.txt)" ./srsieve2 fA p 1e8 p 2e8 n1e5 N2e5 s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" $ eval "./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs10.txt)" [/CODE] With 10 sequences [CODE] $ ./srsieve2_clean fA p 1e8 P 2e8 n1e5 N2e5 s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" Sieve started: 1e8 < p < 2e8 with 1000010 terms (100000 < n < 200000, k*3^n1) (expecting 36264 factors) CPU time: 58.77 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 36406. Remaining terms: 963604. Time: 58.60 seconds. $ ./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 s"3677878*3^n1" s"6793112*3^n1" s"10463066*3^n1" s"10789522*3^n1" s"11033634*3^n1" s"16874152*3^n1" s"18137648*3^n1" s"20379336*3^n1" s"21368582*3^n1" s"29140796*3^n1" Sieve started: 1e8 < p < 2e8 with 1000010 terms (100000 < n < 200000, k*3^n1) (expecting 36264 factors) CPU time: 47.23 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 36406. Remaining terms: 963604. Time: 47.19 seconds. # With W8 CPU time: 62.63 sec. (0.50 sieving) (7.84 cores) vs CPU time: 48.75 sec. (0.54 sieving) (7.79 cores) [/CODE]With 100 sequences [CODE] $ eval ./srsieve2_clean fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs100.txt) Sieve started: 1e8 < p < 2e8 with 10000100 terms (100000 < n < 200000, k*31^n+1) (expecting 362645 factors) p=130433617, 26.90K p/sec, 142003 factors found at 2.321K f/sec (last 1 min), 30.4% done. ETC 20221117 13:48 p=193974931, 27.34K p/sec, 347298 factors found at 1.488K f/sec (last 1 min), 93.9% done. ETC 20221117 13:48 CPU time: 195.54 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 362592. Remaining terms: 9637508. Time: 194.95 seconds. $ eval ./srsieve2 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs100.txt) Sieve started: 1e8 < p < 2e8 with 10000100 terms (100000 < n < 200000, k*31^n+1) (expecting 362645 factors) p=136261487, 32.20K p/sec, 165361 factors found at 2.720K f/sec (last 1 min), 36.2% done. ETC 20221117 13:52 p=173591807, 32.33K p/sec, 290747 factors found at 2.048K f/sec (last 1 min), 73.5% done. ETC 20221117 13:52 CPU time: 164.43 sec. (0.04 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 362592. Remaining terms: 9637508. Time: 163.99 seconds. # With W8 CPU time: 206.01 sec. (0.55 sieving) (7.92 cores) vs CPU time: 172.62 sec. (0.50 sieving) (7.90 cores) [/CODE] With 1000 sequences [CODE] $ eval ./srsieve2_clean W8 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs1000.txt) p=146758463, 41.42K p/sec, 2043151 factors found at 4.210K f/sec (last 1 min), 46.7% done. ETC 20221117 13:59 p=194516929, 41.36K p/sec, 3489737 factors found at 2.971K f/sec (last 1 min), 94.5% done. ETC 20221117 13:59 CPU time: 1026.07 sec. (0.61 sieving) (7.96 cores) Primes tested: 5317484. Factors found: 3626818. Remaining terms: 96374182. Time: 128.82 seconds. $ eval ./srsieve2 W8 fA p 1e8 P 2e8 n1e5 N2e5 $(cat seqs1000.txt) p=150323629, 44.44K p/sec, 2167391 factors found at 4.458K f/sec (last 1 min), 50.3% done. ETC 20221117 14:02 CPU time: 957.16 sec. (0.71 sieving) (7.96 cores) Primes tested: 5317484. Factors found: 3626818. Remaining terms: 96374182. Time: 120.17 seconds. [/CODE] 
Better test runner script
[CODE] rm temp*.out for N in {1,10,100,100}; do cat "crus_seqs_rand${N}.txt"  awk F", " '{ print "s\"" $1 "*" $2 "^n" $3 "\"" }'  tr '\n' ' ' > "seqs${N}.txt" wc "seqs${N}.txt" echo eval ./srsieve2_clean P 1e8 n1e5 N2e5 $(cat seqs${N}.txt) o temp_${N}.in echo eval "./srsieve2_clean P 2e8 i temp_${N}.in o temp_${N}_befor1.out echo eval "./srsieve2_clean P 1e9 W8 i temp_${N}.in o temp_${N}_befor8.out echo eval "./srsieve2 P 2e8 i temp_${N}.in o temp_${N}_after1.out echo eval "./srsieve2 P 1e9 W8 i temp_${N}.in o temp_${N}_after8.out echo e "\n\n" done [/CODE] Results [CODE] [B]1 Sequence[/B] (using 10x higher bounds; 1e9, 1e10) Before CPU time: 21.90 sec. (0.26 sieving) (1.20 cores) Primes tested: 40351310. Factors found: 225. Remaining terms: 2731. Time: 18.20 seconds. CPU time: 219.52 sec. (27.57 sieving) (8.05 cores) Primes tested: 444556287. Factors found: 517. Remaining terms: 2439. Time: 27.26 seconds. After CPU time: 20.96 sec. (0.27 sieving) (1.20 cores) Primes tested: 40351310. Factors found: 225. Remaining terms: 2731. Time: 17.45 seconds. CPU time: 212.39 sec. (28.63 sieving) (8.03 cores) Primes tested: 444556287. Factors found: 517. Remaining terms: 2439. Time: 26.41 seconds. [B]10 Sequences[/B] Before CPU time: 25.28 sec. (0.06 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 757. Remaining terms: 19928. Time: 25.23 seconds. CPU time: 231.81 sec. (3.57 sieving) (7.91 cores) [B](This is 9X the work so expected that it takes longer)[/B] Primes tested: 45086080. Factors found: 2277. Remaining terms: 18408. Time: 29.30 seconds. After CPU time: 22.14 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 757. Remaining terms: 19928. Time: 22.13 seconds. CPU time: 202.03 sec. (4.27 sieving) (7.91 cores) Primes tested: 45086080. Factors found: 2277. Remaining terms: 18408. Time: 25.52 seconds. [B]100 Sequence[/B] Before CPU time: 135.06 sec. (0.05 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 21073. Remaining terms: 563114. Time: 134.57 seconds. CPU time: 1223.59 sec. (4.01 sieving) (7.96 cores) Primes tested: 45086080. Factors found: 64678. Remaining terms: 519509. Time: 153.59 seconds. After CPU time: 127.34 sec. (0.06 sieving) (1.00 cores) Primes tested: 5317484. Factors found: 21073. Remaining terms: 563114. Time: 126.80 seconds. CPU time: 1155.34 sec. (4.17 sieving) (7.96 cores) Primes tested: 45086080. Factors found: 64678. Remaining terms: 519509. Time: 144.99 seconds. [B]1000 Sequence[/B] (sieved to 11e7 for no W, 2e8 for W8) Before CPU time: 84.85 sec. (0.00 sieving) (1.02 cores) Primes tested: 541856. Factors found: 23720. Remaining terms: 4588834. Time: 83.03 seconds. CPU time: 869.99 sec. (0.42 sieving) (7.97 cores) Primes tested: 5317484. Factors found: 167203. Remaining terms: 4445351. Time: 109.13 seconds. After CPU time: 85.95 sec. (0.00 sieving) (1.02 cores) Primes tested: 541856. Factors found: 23720. Remaining terms: 4588834. Time: 84.07 seconds. CPU time: 848.70 sec. (0.51 sieving) (7.97 cores) Primes tested: 5317484. Factors found: 167203. Remaining terms: 4445351. Time: 106.48 seconds. [/CODE] the output files are exact matches. I ran the single sequence to high bounds (1e9, 1e10). It is the only sequence not using generic logic and doesn't show a speedup. The 10/100 sequence results are faster, the 1/1000 sequence results are equal. I suspect the single sequence is because it's not using the generic logic. Maybe with 1000 sequences I'm hitting cache limits (I could profile with perf if it was important to verify that thesis) 
All times are UTC. The time now is 16:22. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2023, Jelsoft Enterprises Ltd.