mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve (https://www.mersenneforum.org/showthread.php?t=23042)

rogue 2022-11-17 00:35

[QUOTE=SethTro;617967]In two quick checks I saw a 10% improvement in the openCL code by making this [URL="https://github.com/sethtroisi/mtsieve/commit/910f749dc6fb2cef14514e712ccf5adf9be33fca"]one line fix[/URL]

[CODE]
for (idx=0; idx<HASH_SIZE; idx++)
- h_table[idx] = 0;
+ h_table[idx] = HASH_ELEMENTS;
[/CODE]

[CODE]
# Before
$ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 1e9 -P 20e8 -i b9_n_1e9.abc
CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec.
Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds.

# After
$ $ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 1e9 -P 20e8 -i b9_n_1e9.abc
CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec.
Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds.
[/CODE][/QUOTE]

Any idea why it missed one factor?

SethTro 2022-11-17 01:06

I believe this has to do with mtsieve not handling the end point strictly.

[CODE]
five:~/Projects/mtsieve-git$ diff 4b9m7_n_20e8_clean.abc 4b9m7_n_20e8_post.abc
1c1
< ABC 4*9^$a-7 // Sieved to 2000000063
---
> ABC 4*9^$a-7 // Sieved to 2000288291
[/CODE]

My guess is that the if/when you increase worksize leads to a different amount of overrun of -P

SethTro 2022-11-17 01:47

[QUOTE=SethTro;617967]
[CODE]
# Before
$ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 1e9 -P 20e8 -i b9_n_1e9.abc
CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec.
Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds.

# After
$ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 1e9 -P 20e8 -i b9_n_1e9.abc
CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec.
Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds.
[/CODE][/QUOTE]

Using a little unix magic
[CODE]
$ primesieve 1e9
Primes: 50847534
$ primesieve -n $(expr 50847534 + 47432100)
Nth prime: 2001223151
$ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 20e8 -P 2001223151 -i 4b9m7_n_20e8_clean.abc -o temp.abc
Primes tested: 73344. Factors found: 1. Remaining terms: 20477. Time: 0.10 seconds.

$ diff 4b9m7_n_20e8_clean.abc temp.abc
1c1
< ABC 4*9^$a-7 // Sieved to 2000000063
---
> ABC 4*9^$a-7 // Sieved to 2001565729
629d628
< 104493

$ python
>>> for p in range(20 * 10 ** 8 + 1, 2001288307, 2):
... if (4 * pow(9, 104493, p) - 7) % p == 0:
... print(p)
...
2001059303

$ echo "4*9^104493-7" | ecm -pm1 14650 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-assert] [P-1]
Input number is 4*9^104493-7 (99713 digits)
Using B1=14650, B2=0, polynomial x^1, x0=2592079789
Step 1 took 38713ms
********** Factor found in step 1: 2001059303
[/CODE]

So there's a factor "just" after 20e8 and one of the two runs had a worksize that overran 20e8 enough to include it while the other didn't.

I looked briefly into what it would take to make the worksize variable so that the bounds could be strict but it's not pretty. It would probably make more sense to pad the last batch with the last prime a bunch of times (and verify that if that primes divides a sequence it doesn't count multiple times)

rogue 2022-11-17 13:46

[QUOTE=SethTro;617978]So there's a factor "just" after 20e8 and one of the two runs had a worksize that overran 20e8 enough to include it while the other didn't.

I looked briefly into what it would take to make the worksize variable so that the bounds could be strict but it's not pretty. It would probably make more sense to pad the last batch with the last prime a bunch of times (and verify that if that primes divides a sequence it doesn't count multiple times)[/QUOTE]

This is quite okay and explains the discrepancy. The worksize differs between GPU and CPU so in one case the CPU got the last chunk of work in the other the GPU got it. I won't make any changes here.

rogue 2022-11-17 14:06

Upon review, this is for a single sequence and you have doubled the size of the hash table. Have you done any testing with multiple sequences? I am concerned that the change to GenericGpuWorker will negatively affect memory requirements for the GPU when one has hundreds or thousands of sequences.

SethTro 2022-11-17 17:00

[QUOTE=rogue;617993]Upon review, this is for a single sequence and you have doubled the size of the hash table. Have you done any testing with multiple sequences? I am concerned that the change to GenericGpuWorker will negatively affect memory requirements for the GPU when one has hundreds or thousands of sequences.[/QUOTE]

I was testing changing the hash table size but it doesn't seem to have much impact (at least on my few tests).

I'm happy to test with more sequences, do you have a recommendation for how many sequences at a time I should test? and how large those sequences should be?

You can test with just the one line change, or the one line change + [URL="https://github.com/sethtroisi/mtsieve/commit/0c02e047f199e7463e8eceb09431f3e9d98c7f4e"]refactor,[/URL] or with [URL="https://github.com/sethtroisi/mtsieve/compare/base...hash_table_clean"]all my hash table changes[/URL]. On my GPU (1080ti) I consistently see a +5-10% performance improvement.

[CODE]
$ ./srsieve2cl_clean --platform=1 -W0 -G1 -fA -p 1e9 -P 20e8 -i 4b9m7_n_1e9.abc
Sieve started: 1e9 < p < 2e9 with 21137 terms (100012 < n < 249988, k*9^n-7) (expecting 684 factors)
CPU time: 35.73 sec. (0.64 sieving) (0.98 cores) GPU time: 34.83 sec.
Primes tested: 47423488. Factors found: 659. Remaining terms: 20478. Time: 36.45 seconds.
Segmentation fault (core dumped)

$ ./srsieve2cl --platform=1 -W0 -G1 -fA -p 1e9 -P 20e8 -i 4b9m7_n_1e9.abc
Sieve started: 1e9 < p < 2e9 with 21137 terms (100012 < n < 249988, k*9^n-7) (expecting 684 factors)
CPU time: 33.44 sec. (0.60 sieving) (0.98 cores) GPU time: 32.26 sec.
Primes tested: 47423488. Factors found: 659. Remaining terms: 20478. Time: 33.84 seconds.
[/CODE]

rogue 2022-11-17 19:43

Is suggest testing with 1, 10, 10, and 1000 sequences. You can generate sequences or you can use the CRUS page to find sequences that will provide more meaningful results. Use -n1e5 -N2e5. You can then test -p1e9 -p1e10. It will take longer to run, but reduces impact of other things you might be doing on your computer.

I would like to see before and after for CPU and GPU. With srsieve2, no -W. With srsieve2cl, no -G.

Assuming all looks good, can you e-mail me the files that you have changed. It would make it easier for me to evaluate what you have changed. Also send me the starting sequence files so that I can do my own testing.

I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base. I could also make you a developer on the project so that you can commit directly to sourceforge.

SethTro 2022-11-17 21:41

[QUOTE=rogue;618008]I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base.[/QUOTE]

I appreciate you communicating quickly here and being receptive to changes.

[QUOTE=rogue;618008]Is suggest testing with 1, 10, 10, and 1000 sequences. You can generate sequences or you can use the CRUS page to find sequences that will provide more meaningful results. Use -n1e5 -N2e5. You can then test -p1e9 -p1e10. It will take longer to run, but reduces impact of other things you might be doing on your computer.

I would like to see before and after for CPU and GPU. With srsieve2, no -W. With srsieve2cl, no -G.

Assuming all looks good, can you e-mail me the files that you have changed. It would make it easier for me to evaluate what you have changed. Also send me the starting sequence files so that I can do my own testing.

I hope that I am not asking too much from you. I do intend to incorporate your changes into the code base. I could also make you a developer on the project so that you can commit directly to sourceforge.[/QUOTE]


When i run `./srsieve2 -fA -p 1e8 -p 2e8 -n1e5 -N2e5 -s"118392*51^n+1" ` I get an error with

[CODE]srsieve2: sierpinski_riesel/CisOneWithOneSequenceHelper.cpp:282: void CisOneWithOneSequenceHelper::MakeLadder(uint16_t*, uint32_t): Assertion `qListLen+1 < ii_BestQ' failed.
[/CODE]which I don't think I caused, but is maybe because of my input?

rogue 2022-11-17 21:59

I suggest that you start by sieving the sequence to 1e8, then continue sieving with the output file that was generated. For example:

./srsieve2 -P1e6 -n1e5 -N2e5 -s"118392*51^n+1"

followed by:

./srsieve2 -P2e8 -ib51_n.abcd

The first execution will remove most terms. The second will provide the stats we care about.

What you input might be a problem because you are overriding the starting value for p. At p=1e6 it re-computes bestQ.

SethTro 2022-11-17 22:05

Results
 
3 Attachment(s)
[B]Results[/B]

I pulled down a few CRUS sequences and choose 1, 10, 100, 1000 k for random bases (I attach these files)

All of this was run on a Ryzen 3900x with no other load, I tested -p 1e8 to -P 2e8 (to get results faster) but can test larger ranges if this doesn't convince you.

[CODE]
$ wc crus_seqs_rand*
1000 3000 15978 crus_seqs_rand1000.txt
100 300 1534 crus_seqs_rand100.txt
10 30 158 crus_seqs_rand10.txt
1 3 13 crus_seqs_rand1.txt

$ cat crus_seqs_randX.txt | awk -F", " '{ print "-s\"" $1 "*" $2 "^n" $3 "\"" }' | tr '\n' ' ' > seqsX.txt
$ cat seqs10.txt
-s"3677878*3^n-1" -s"6793112*3^n-1" -s"10463066*3^n-1" -s"10789522*3^n-1" -s"11033634*3^n-1" -s"16874152*3^n-1" -s"18137648*3^n-1" -s"20379336*3^n-1" -s"21368582*3^n-1" -s"29140796*3^n-1"

$ echo "./srsieve2 -fA -p 1e8 -P 2e8 -n1e5 -N2e5 $(cat seqs10.txt)"
./srsieve2 -fA -p 1e8 -p 2e8 -n1e5 -N2e5 -s"3677878*3^n-1" -s"6793112*3^n-1" -s"10463066*3^n-1" -s"10789522*3^n-1" -s"11033634*3^n-1" -s"16874152*3^n-1" -s"18137648*3^n-1" -s"20379336*3^n-1" -s"21368582*3^n-1" -s"29140796*3^n-1"
$ eval "./srsieve2 -fA -p 1e8 -P 2e8 -n1e5 -N2e5 $(cat seqs10.txt)"
[/CODE] With 10 sequences

[CODE]
$ ./srsieve2_clean -fA -p 1e8 -P 2e8 -n1e5 -N2e5 -s"3677878*3^n-1" -s"6793112*3^n-1" -s"10463066*3^n-1" -s"10789522*3^n-1" -s"11033634*3^n-1" -s"16874152*3^n-1" -s"18137648*3^n-1" -s"20379336*3^n-1" -s"21368582*3^n-1" -s"29140796*3^n-1"
Sieve started: 1e8 < p < 2e8 with 1000010 terms (100000 < n < 200000, k*3^n-1) (expecting 36264 factors)
CPU time: 58.77 sec. (0.05 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 36406. Remaining terms: 963604. Time: 58.60 seconds.

$ ./srsieve2 -fA -p 1e8 -P 2e8 -n1e5 -N2e5 -s"3677878*3^n-1" -s"6793112*3^n-1" -s"10463066*3^n-1" -s"10789522*3^n-1" -s"11033634*3^n-1" -s"16874152*3^n-1" -s"18137648*3^n-1" -s"20379336*3^n-1" -s"21368582*3^n-1" -s"29140796*3^n-1"

Sieve started: 1e8 < p < 2e8 with 1000010 terms (100000 < n < 200000, k*3^n-1) (expecting 36264 factors)
CPU time: 47.23 sec. (0.05 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 36406. Remaining terms: 963604. Time: 47.19 seconds.

# With -W8
CPU time: 62.63 sec. (0.50 sieving) (7.84 cores)
vs
CPU time: 48.75 sec. (0.54 sieving) (7.79 cores)
[/CODE]With 100 sequences

[CODE]
$ eval ./srsieve2_clean -fA -p 1e8 -P 2e8 -n1e5 -N2e5 $(cat seqs100.txt)
Sieve started: 1e8 < p < 2e8 with 10000100 terms (100000 < n < 200000, k*31^n+1) (expecting 362645 factors)
p=130433617, 26.90K p/sec, 142003 factors found at 2.321K f/sec (last 1 min), 30.4% done. ETC 2022-11-17 13:48
p=193974931, 27.34K p/sec, 347298 factors found at 1.488K f/sec (last 1 min), 93.9% done. ETC 2022-11-17 13:48
CPU time: 195.54 sec. (0.05 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 362592. Remaining terms: 9637508. Time: 194.95 seconds.

$ eval ./srsieve2 -fA -p 1e8 -P 2e8 -n1e5 -N2e5 $(cat seqs100.txt)
Sieve started: 1e8 < p < 2e8 with 10000100 terms (100000 < n < 200000, k*31^n+1) (expecting 362645 factors)
p=136261487, 32.20K p/sec, 165361 factors found at 2.720K f/sec (last 1 min), 36.2% done. ETC 2022-11-17 13:52
p=173591807, 32.33K p/sec, 290747 factors found at 2.048K f/sec (last 1 min), 73.5% done. ETC 2022-11-17 13:52
CPU time: 164.43 sec. (0.04 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 362592. Remaining terms: 9637508. Time: 163.99 seconds.

# With -W8
CPU time: 206.01 sec. (0.55 sieving) (7.92 cores)
vs
CPU time: 172.62 sec. (0.50 sieving) (7.90 cores)
[/CODE]

With 1000 sequences

[CODE]
$ eval ./srsieve2_clean -W8 -fA -p 1e8 -P 2e8 -n1e5 -N2e5 $(cat seqs1000.txt)
p=146758463, 41.42K p/sec, 2043151 factors found at 4.210K f/sec (last 1 min), 46.7% done. ETC 2022-11-17 13:59
p=194516929, 41.36K p/sec, 3489737 factors found at 2.971K f/sec (last 1 min), 94.5% done. ETC 2022-11-17 13:59
CPU time: 1026.07 sec. (0.61 sieving) (7.96 cores)
Primes tested: 5317484. Factors found: 3626818. Remaining terms: 96374182. Time: 128.82 seconds.

$ eval ./srsieve2 -W8 -fA -p 1e8 -P 2e8 -n1e5 -N2e5 $(cat seqs1000.txt)
p=150323629, 44.44K p/sec, 2167391 factors found at 4.458K f/sec (last 1 min), 50.3% done. ETC 2022-11-17 14:02
CPU time: 957.16 sec. (0.71 sieving) (7.96 cores)
Primes tested: 5317484. Factors found: 3626818. Remaining terms: 96374182. Time: 120.17 seconds.
[/CODE]

SethTro 2022-11-17 23:07

Better test runner script

[CODE]
rm temp*.out

for N in {1,10,100,100}; do
cat "crus_seqs_rand${N}.txt" | awk -F", " '{ print "-s\"" $1 "*" $2 "^n" $3 "\"" }' | tr '\n' ' ' > "seqs${N}.txt"
wc "seqs${N}.txt"
echo
eval ./srsieve2_clean -P 1e8 -n1e5 -N2e5 $(cat seqs${N}.txt) -o temp_${N}.in
echo
eval "./srsieve2_clean -P 2e8 -i temp_${N}.in -o temp_${N}_befor1.out
echo
eval "./srsieve2_clean -P 1e9 -W8 -i temp_${N}.in -o temp_${N}_befor8.out
echo
eval "./srsieve2 -P 2e8 -i temp_${N}.in -o temp_${N}_after1.out
echo
eval "./srsieve2 -P 1e9 -W8 -i temp_${N}.in -o temp_${N}_after8.out
echo -e "\n\n"
done
[/CODE]

Results

[CODE]
[B]1 Sequence[/B] (using 10x higher bounds; 1e9, 1e10)
Before
CPU time: 21.90 sec. (0.26 sieving) (1.20 cores)
Primes tested: 40351310. Factors found: 225. Remaining terms: 2731. Time: 18.20 seconds.

CPU time: 219.52 sec. (27.57 sieving) (8.05 cores)
Primes tested: 444556287. Factors found: 517. Remaining terms: 2439. Time: 27.26 seconds.

After
CPU time: 20.96 sec. (0.27 sieving) (1.20 cores)
Primes tested: 40351310. Factors found: 225. Remaining terms: 2731. Time: 17.45 seconds.

CPU time: 212.39 sec. (28.63 sieving) (8.03 cores)
Primes tested: 444556287. Factors found: 517. Remaining terms: 2439. Time: 26.41 seconds.


[B]10 Sequences[/B]
Before
CPU time: 25.28 sec. (0.06 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 757. Remaining terms: 19928. Time: 25.23 seconds.

CPU time: 231.81 sec. (3.57 sieving) (7.91 cores) [B](This is 9X the work so expected that it takes longer)[/B]
Primes tested: 45086080. Factors found: 2277. Remaining terms: 18408. Time: 29.30 seconds.

After
CPU time: 22.14 sec. (0.05 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 757. Remaining terms: 19928. Time: 22.13 seconds.

CPU time: 202.03 sec. (4.27 sieving) (7.91 cores)
Primes tested: 45086080. Factors found: 2277. Remaining terms: 18408. Time: 25.52 seconds.


[B]100 Sequence[/B]
Before
CPU time: 135.06 sec. (0.05 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 21073. Remaining terms: 563114. Time: 134.57 seconds.

CPU time: 1223.59 sec. (4.01 sieving) (7.96 cores)
Primes tested: 45086080. Factors found: 64678. Remaining terms: 519509. Time: 153.59 seconds.

After
CPU time: 127.34 sec. (0.06 sieving) (1.00 cores)
Primes tested: 5317484. Factors found: 21073. Remaining terms: 563114. Time: 126.80 seconds.

CPU time: 1155.34 sec. (4.17 sieving) (7.96 cores)
Primes tested: 45086080. Factors found: 64678. Remaining terms: 519509. Time: 144.99 seconds.


[B]1000 Sequence[/B] (sieved to 11e7 for no -W, 2e8 for -W8)
Before
CPU time: 84.85 sec. (0.00 sieving) (1.02 cores)
Primes tested: 541856. Factors found: 23720. Remaining terms: 4588834. Time: 83.03 seconds.

CPU time: 869.99 sec. (0.42 sieving) (7.97 cores)
Primes tested: 5317484. Factors found: 167203. Remaining terms: 4445351. Time: 109.13 seconds.

After
CPU time: 85.95 sec. (0.00 sieving) (1.02 cores)
Primes tested: 541856. Factors found: 23720. Remaining terms: 4588834. Time: 84.07 seconds.

CPU time: 848.70 sec. (0.51 sieving) (7.97 cores)
Primes tested: 5317484. Factors found: 167203. Remaining terms: 4445351. Time: 106.48 seconds.
[/CODE]

the output files are exact matches.

I ran the single sequence to high bounds (1e9, 1e10). It is the only sequence not using generic logic and doesn't show a speedup.

The 10/100 sequence results are faster, the 1/1000 sequence results are equal. I suspect the single sequence is because it's not using the generic logic. Maybe with 1000 sequences I'm hitting cache limits (I could profile with perf if it was important to verify that thesis)


All times are UTC. The time now is 16:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.