![]() |
![]() |
#1 |
"Mark"
Apr 2003
Between here and the
1B3216 Posts |
![]()
This thread is a placeholder for people who want to make improvements to mtsieve. This isn't about "I want this feature" (use the other thread), but about code changes to improve performance. I would expect posts to this thread to be centered around algorithms and code that implements those algorithms.
|
![]() |
![]() |
![]() |
#2 | |
Jun 2003
3·5·107 Posts |
![]() Quote:
We need two versions/logic paths of srsieve2 - one for heavy weight and large N ranges searches and one for light weight and small N range searches. These require different optimizations. One size fits all does not work here. Most projects/people need the light weight/small N range version but srsieve was designed for heavy weight/large N sequences. All the CRUS work is low weight. For a heavy weight and large N range the overhead might be justified for optimizations but not for the light weight k and smaller N range as it only slows the program down. There are 3 main optimizations/algorithms that need to be tinkered: 1. Deciding on the best Q 2. Power residue 3. Legendre symbols Note:- There is overlap between 2 and 3. Optimizations will be different when many k's are being sieved simultaneously. Optimizations will be different for 2-3k sieve and when 100k are being sieved. These require different logic paths. The program needs to have different logic optimization paths based on number of k and size of range and weight of k. #3) I think we are taking about different things. Even though I called it a hashmap - I meant more of a bitmap. Can you explain what you implemented in the past. #4) This would be easy to implement for 1-2 k and low weight/small N range as memory requirements are minimal. (I am not sure how to do this for large number of k, with large N range and heavy weight K using the BSGS on GPU). I will stick with improving srsieve2.exe first. We might be able to make other sieves faster as well (I haven't looked at the code). Anyone else has any suggestions? ![]() |
|
![]() |
![]() |
![]() |
#3 |
"Curtis"
Feb 2005
Riverside, CA
2×32×313 Posts |
![]()
It has been said before about e.g. gpuowl, but it bears repeating here that it's enjoyable to view discussions about code / algo enhancements. I like to think I learn a little more about the sieve process, and I appreciate that such things are discussed in public. a little less of a black box!
|
![]() |
![]() |
![]() |
#4 | ||||
"Mark"
Apr 2003
Between here and the
2×592 Posts |
![]() Quote:
Quote:
Quote:
Quote:
I'm thinking that a faster implementation of the HashTable class would be a good start. |
||||
![]() |
![]() |
![]() |
#5 | |
Jun 2003
3×5×107 Posts |
![]() Quote:
Legendre tables at best would reduce the number of primes being tested by 50% and are mainly suited for small k and base values. For large number of k being sieved together Legendre tables are not worth it. They increase the overhead significantly as you previously mentioned. |
|
![]() |
![]() |
![]() |
#6 | |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
37·163 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 | |
"Mark"
Apr 2003
Between here and the
2×592 Posts |
![]() Quote:
If I wasn't clear above, I'm hoping for others to contribute code, not just ideas. Last fiddled with by rogue on 2020-04-24 at 12:42 |
|
![]() |
![]() |
![]() |
#8 | |
Dec 2011
After milion nines:)
65716 Posts |
![]() Quote:
Rogue as we write in PM, you must fix srsieve2 to go upper then 2^52 because this limit is low. What user have from the fact srsieve2 is faster then sr1sieve if upper limit is low (2^52) Maybe you should fix this first |
|
![]() |
![]() |
![]() |
#9 | |
"Mark"
Apr 2003
Between here and the
11011001100102 Posts |
![]() Quote:
I agree that getting srsieve2 to support p > 2^52 is important. Although the code changes are not too difficult, if I want to maximize speed for p < 2^52 and p > 2^52 it becomes more of a challenge. At this time I might need to have two different builds. I would like to avoid that. For now you will have to survive using sr2sieve for p > 2^52. Last fiddled with by rogue on 2020-04-24 at 14:46 |
|
![]() |
![]() |
![]() |
#10 |
"Mark"
Apr 2003
Between here and the
154628 Posts |
![]()
I have looked at the various mulmod routines available to me in x86 ASM to determine their relative speed. The i386 and sse ASM functions used by srXsieve will not compile using -m64 and as nobody (and I mean nobody) has asked for a 32-bit build in many years, I won't consider them going forward.
In this table the first two functions are part of mtsieve. The vec_mulmod_xx functions are from srXsieve. It is a little challenging to take the numbers "as is" because the functions from mtsieve accept 4 (sse) and 32 (avx) divisors as they sieve multiple prime concurrently, but the vec_mulmod_xx functions accept only 1. Code:
sse_mulmod_4a_4b_4p 718750 ms 1796 ms (per 1e6 mulmod) vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod) vec2_mulmod64_mmx 29296875 ms 610 ms (per 1e6 mulmod) vec4_mulmod64_mmx 12187500 ms 253 ms (per 1e6 mulmod) vec6_mulmod64_mmx 8015625 ms 166 ms (per 1e6 mulmod) vec8_mulmod64_mmx 6015625 ms 125 ms (per 1e6 mulmod) vec2_mulmod64_fpu 29781250 ms 620 ms (per 1e6 mulmod) vec4_mulmod64_fpu 13468750 ms 280 ms (per 1e6 mulmod) vec6_mulmod64_fpu 8875000 ms 184 ms (per 1e6 mulmod) vec8_mulmod64_fpu 6078125 ms 126 ms (per 1e6 mulmod) It might be possible to convert some of the existing sieves in the framework to use the functions from srXsieve, but I haven't looked into that yet. I will likely only include the vec8_mulmod64_fpu function in mtsieve because the others just are not as fast and because supporting multiple just adds unnecessary complexity to the framework and the vec_mulmod_mmx functions require a global variable, which mtsieve will not support as it is multi-threaded. That could probably be changed, but I don't know if it would impact performance. I have attached the code (containing a Windows exe). You can compile on your own with: gcc *mulmod*.S *fini.S avx*S test.c -m64 -O2 -o mulmod_test. I am curious as to what other CPUs output to see if these numbers are consistent. The compiled program takes 1 parameter, a number. That number is multiplied by 1e6 for the number of calls made to the function in question. The numbers above were run with the input parameter of 100 while no other CPU intensive programs were running. Last fiddled with by rogue on 2020-04-25 at 21:35 |
![]() |
![]() |
![]() |
#11 |
Random Account
Aug 2009
Not U. + S.A.
32·281 Posts |
![]()
The attached image. Same parameter, 100, and no other intensive processes running. This is my i7-7700 running Widows 10 v1903. CPU utilization was 15% running at 4.15 GHz.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mtsieve | rogue | Software | 982 | 2023-02-01 13:18 |
srsieve/sr2sieve enhancements | rogue | Software | 304 | 2021-11-06 13:51 |
LLRnet enhancements | kar_bon | No Prime Left Behind | 10 | 2008-03-28 11:21 |
TODO list and suggestions/comments/enhancements | Greenbank | Octoproth Search | 2 | 2006-12-03 17:28 |
Suggestions for future enhancements | Reboot It | Software | 16 | 2003-10-17 01:31 |