![]() |
When I wrote my quick discrete-log solver I added this very useful optimization. I looked over the code and didn't find it in mt-sieve.
When considering a single sequence k*b^n+c with 0 <= n <= N (e.g. in the sierpinski riesel) I solve the discrete_log(b, p, mod_inv(k) * -c) = x <=> b^x = mod_inv(k, p) * -c <=> p | k*b^x+c For each prime I'm hoping x <= MaxN, I noticed that when x > MaxN I don't care about the value. So I only compute baby-steps/giant-steps up to min(MaxN, order_b(p-1)) With MaxN = 1e6 and p near 1e9 this is around 2*sqrt(1e6)=2000 steps vs 2*sqrt(1e9)=63000 steps! Maybe this is already covered by GenericWorker::InitializeWorker when choosing r? or in some other way I don't understand? |
[QUOTE=SethTro;617162]When I wrote my quick discrete-log solver I added this very useful optimization. I looked over the code and didn't find it in mt-sieve.
When considering a single sequence k*b^n+c with 0 <= n <= N (e.g. in the sierpinski riesel) I solve the discrete_log(b, p, mod_inv(k) * -c) = x <=> b^x = mod_inv(k, p) * -c <=> p | k*b^x+c For each prime I'm hoping x <= MaxN, I noticed that when x > MaxN I don't care about the value. So I only compute baby-steps/giant-steps up to min(MaxN, order_b(p-1)) With MaxN = 1e6 and p near 1e9 this is around 2*sqrt(1e6)=2000 steps vs 2*sqrt(1e9)=63000 steps! Maybe this is already covered by GenericWorker::InitializeWorker when choosing r? or in some other way I don't understand?[/QUOTE] Look at FindBestQ() is computed in the Helper class. q is computed once and is independent of p. This allows the discrete log to take the same amount of time regardless of p. |
[QUOTE=rogue;617176]Look at FindBestQ() is computed in the Helper class. q is computed once and is independent of p. This allows the discrete log to take the same amount of time regardless of p.[/QUOTE]
I've only spent a couple of days on this problem so I'm missing a lot of the deeper context you have. I'd love to understand why q makes the discrete log constant time (or the same per batch?). If there's an article or paper on the subject I'd enjoy reading it. My rough understanding of srsieve is that you convert b^n (in my case 9^n) into several sub-sequences 9^0*(9^3)^a, 9^1*(9^3)^a, 9^2*(9^3)^a. After this some of these sub sequence will always have trivial divisors (related to the residual classes) and can entirely be dropped. |
[QUOTE=SethTro;617512]I've only spent a couple of days on this problem so I'm missing a lot of the deeper context you have. I'd love to understand why q makes the discrete log constant time (or the same per batch?). If there's an article or paper on the subject I'd enjoy reading it.
My rough understanding of srsieve is that you convert b^n (in my case 9^n) into several sub-sequences 9^0*(9^3)^a, 9^1*(9^3)^a, 9^2*(9^3)^a. After this some of these sub sequence will always have trivial divisors (related to the residual classes) and can entirely be dropped.[/QUOTE] The code in srsieve2/srsieve2cl originated from srsieve/sr1sieve. I was not the writer of that code. I do not recall if any of that code referenced a wiki page or other document specifying the logic behind the discrete log. |
I found a bug with HashTable that when I fix gives me a 20-30% speedup (at least for srsieve2). I'm looking around but I don't see a test suite in the SVN repository how do you test changes?
|
[QUOTE=SethTro;617907]I found a bug with HashTable that when I fix gives me a 20-30% speedup (at least for srsieve2). I'm looking around but I don't see a test suite in the SVN repository how do you test changes?[/QUOTE]
Nice! I do not have a test suite. I typically run a set of tests with the previous version and the new version to compare results. If invalid factors are found or factors are missed, then I need to dig further. What is the bug? I assume this doesn't impact the number of factors found, but impacts either putting something into the hash table or looking something up. It is possible that this impacts the OpenCL kernel as well. |
The bug is that `htable` is initializing and cleared with `0` instead of `empty_slot`.
This causes Insert to mark every entry as a probing conflict, e.g. `htable[slot] = (J | HASH_MASK1)`. On Lookup this causes additional probing. I'm working on getting your code into a git repository so I can break out my changes into smaller commits. I should have a small diff by the end of the day. |
The change is here: [url]https://github.com/sethtroisi/mtsieve/commit/e5337806c3a54ebd3657b168587ada11384c9f6e[/url]
Before [CODE] five:~/Projects/mtsieve-git$ ./srsieve2_clean -fA -W1 -p 8 -P 1e8 -n 100000 -N 250000 -s "4*9^n-7" Sieve started: 8 < p < 1e8 with 150001 terms (100000 < n < 250000, k*9^n-7) (expecting 133067 factors) CPU time: 18.90 sec. (0.04 sieving) (0.90 cores) Primes tested: 5761244. Factors found: 126233. Remaining terms: 23768. Time: 20.99 seconds. five:~/Projects/mtsieve-git$ ./srsieve2_clean -fA -W8 -p 8 -P 1e9 -n 100000 -N 250000 -s "4*9^n-7" Sieve started: 8 < p < 1e9 with 150001 terms (100000 < n < 250000, k*9^n-7) (expecting 134949 factors) CPU time: 178.19 sec. (4.04 sieving) (7.21 cores) Primes tested: 50847324. Factors found: 128864. Remaining terms: 21137. Time: 24.69 seconds. [/CODE] After [CODE] five:~/Projects/mtsieve-git$ ./srsieve2 -fA -W1 -p 8 -P 1e8 -n 100000 -N 250000 -s "4*9^n-7" Sieve started: 8 < p < 1e8 with 150001 terms (100000 < n < 250000, k*9^n-7) (expecting 133067 factors) CPU time: 15.23 sec. (0.04 sieving) (0.88 cores) Primes tested: 5761244. Factors found: 126233. Remaining terms: 23768. Time: 17.30 seconds. five:~/Projects/mtsieve-git$ ./srsieve2 -fA -W8 -p 8 -P 1e9 -n 100000 -N 250000 -s "4*9^n-7" Sieve started: 8 < p < 1e9 with 150001 terms (100000 < n < 250000, k*9^n-7) (expecting 134949 factors) CPU time: 144.05 sec. (4.36 sieving) (7.08 cores) Primes tested: 50847324. Factors found: 128864. Remaining terms: 21137. Time: 20.34 seconds. [/CODE] 18.9 / 15.23 = 24% faster 178.19 / 144.05 = 24% faster |
Looks like the removal of memset() is where the performance boost comes from. How much of an improvement will likely be compiler specific. The OpenCL code does not use memset(), so it won't gain anything. I see that you removed the other constructor. It might be used by one of the other sieves.
This is a really nice performance boost. I greatly appreciate it. Not many people have been brave enough to look at my code and offer optimizations like this. I will look into making the changes in sourceforge and getting a new release out soon. |
[QUOTE=rogue;617961]Looks like the removal of memset() is where the performance boost comes from. How much of an improvement will likely be compiler specific. The OpenCL code does not use memset(), so it won't gain anything. I see that you removed the other constructor. It might be used by one of the other sieves.
This is a really nice performance boost. I greatly appreciate it. Not many people have been brave enough to look at my code and offer optimizations like this. I will look into making the changes in sourceforge and getting a new release out soon.[/QUOTE] I don't believe removing memset changes the performance, I still have to write a value to every index. The real benefit is you go from 100% conflicts to 8-16% on Insert() and Lookup() I have a little more diagnostics in [url]https://github.com/sethtroisi/mtsieve/commit/408e5ce3b17f69e119985c4788511d3ab577cf8a[/url] I also fixed a segfault in OpenCL with [url]https://github.com/sethtroisi/mtsieve/commit/7fc02473dc2b5caf6d6fd658914b92914b0ac1f3[/url] |
In two quick checks I saw a 10% improvement in the openCL code by making this [URL="https://github.com/sethtroisi/mtsieve/commit/910f749dc6fb2cef14514e712ccf5adf9be33fca"]one line fix[/URL]
[CODE] for (idx=0; idx<HASH_SIZE; idx++) - h_table[idx] = 0; + h_table[idx] = HASH_ELEMENTS; [/CODE] [CODE] # Before $ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 1e9 -P 20e8 -i b9_n_1e9.abc CPU time: 56.43 sec. (1.21 sieving) (1.95 cores) GPU time: 26.96 sec. Primes tested: 47432100. Factors found: 660. Remaining terms: 20477. Time: 28.82 seconds. # After $ $ ./srsieve2cl --platform=1 -W1 -G1 -fA -p 1e9 -P 20e8 -i b9_n_1e9.abc CPU time: 51.72 sec. (0.93 sieving) (1.96 cores) GPU time: 24.91 sec. Primes tested: 47410816. Factors found: 659. Remaining terms: 20478. Time: 26.26 seconds. [/CODE] |
All times are UTC. The time now is 04:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.