mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve enhancements (https://www.mersenneforum.org/showthread.php?t=25486)

rogue 2020-04-23 15:16

mtsieve enhancements
 
This thread is a placeholder for people who want to make improvements to mtsieve. This isn't about "I want this feature" (use the other thread), but about code changes to improve performance. I would expect posts to this thread to be centered around algorithms and code that implements those algorithms.

Citrix 2020-04-24 02:16

[QUOTE=rogue;543533]I appreciate your feedback.

As for #1, I'm not certain if you are referring to srsieve2 or srsieve/sr1sieve/sr2sieve. If srsieve2, I'm open to suggestions. Note that I aim for easy of development/support over speed. Sacrificing up to 1 percent to achieve that is acceptable because I can typically make up for it in other areas. Also for srsieve2, it is still a work in progress. You have probably noticed that it is designed to switch the helper/worker if certain conditions are met as each helper/worker will have very different logic meant to optimize in certain ways.

As for #3, I did try to implement a hashmap, but the performance was worse. I didn't dig into determining why as the implementation I had was supposed to be one of the fastest.

As for #4, it is on my radar, but I haven't don anything yet. My focus has been on implementing the Legendre tables that sr1sieve and sr2sieve have, but bsgs code using those tables a hornet's nest.

I'm open to any suggestions. If you want to play around with the code and try your hand at some optimizations, I'm all for it.

To continue this, we can either move this discussion to the mtsieve thread, to a new thread, or take offline via PM or e-mail. If we keep it online then other participants of this forum can add their knowledge to the discussion or can learn from it.[/QUOTE]

For #1) I was referring to all the various srXsieveY versions. They try to implement various different algorithms/optimizations that most users do not need or unnecessarily makes the program slow. I agree with making the code as straight forward as possible.

We need two versions/logic paths of srsieve2 - one for heavy weight and large N ranges searches and one for light weight and small N range searches. These require different optimizations. One size fits all does not work here.

Most projects/people need the light weight/small N range version but srsieve was designed for heavy weight/large N sequences. All the CRUS work is low weight.

For a heavy weight and large N range the overhead might be justified for optimizations but not for the light weight k and smaller N range as it only slows the program down.

There are 3 main optimizations/algorithms that need to be tinkered:
1. Deciding on the best Q
2. Power residue
3. Legendre symbols
Note:- There is overlap between 2 and 3.

Optimizations will be different when many k's are being sieved simultaneously.
Optimizations will be different for 2-3k sieve and when 100k are being sieved. These require different logic paths.

The program needs to have different logic optimization paths based on number of k and size of range and weight of k.

#3) I think we are taking about different things. Even though I called it a hashmap - I meant more of a bitmap. Can you explain what you implemented in the past.

#4) This would be easy to implement for 1-2 k and low weight/small N range as memory requirements are minimal.
(I am not sure how to do this for large number of k, with large N range and heavy weight K using the BSGS on GPU).

I will stick with improving srsieve2.exe first. We might be able to make other sieves faster as well (I haven't looked at the code).

Anyone else has any suggestions?

:coffee:

VBCurtis 2020-04-24 03:17

[QUOTE=Citrix;543604]Anyone else has any suggestions?

:coffee:[/QUOTE]

It has been said before about e.g. gpuowl, but it bears repeating here that it's enjoyable to view discussions about code / algo enhancements. I like to think I learn a little more about the sieve process, and I appreciate that such things are discussed in public. a little less of a black box!

rogue 2020-04-24 03:18

[QUOTE=Citrix;543604]For #1) I was referring to all the various srXsieveY versions. They try to implement various different algorithms/optimizations that most users do not need or unnecessarily makes the program slow. I agree with making the code as straight forward as possible.[/QUOTE]

Which is why most of that is gone in srsieve2.

[QUOTE]We need two versions/logic paths of srsieve2 - one for heavy weight and large N ranges searches and one for light weight and small N range searches. These require different optimizations. One size fits all does not work here.

Most projects/people need the light weight/small N range version but srsieve was designed for heavy weight/large N sequences. All the CRUS work is low weight.

For a heavy weight and large N range the overhead might be justified for optimizations but not for the light weight k and smaller N range as it only slows the program down.

There are 3 main optimizations/algorithms that need to be tinkered:
1. Deciding on the best Q
2. Power residue
3. Legendre symbols
Note:- There is overlap between 2 and 3.
[/QUOTE]

The Legendre table is easy to generate, but for larger k (> 1e9) it can take more time and a lot of memory to build the table.

[QUOTE]Optimizations will be different when many k's are being sieved simultaneously.
Optimizations will be different for 2-3k sieve and when 100k are being sieved. These require different logic paths.

The program needs to have different logic optimization paths based on number of k and size of range and weight of k.

#3) I think we are taking about different things. Even though I called it a hashmap - I meant more of a bitmap. Can you explain what you implemented in the past.[/QUOTE]

What is implemented in srsieve2 is a C++ friendly version of what was in srsieve. It is very likely that something faster could be written. I don't recall what I coded, only that it was slower. I never dug into why.

[QUOTE]#4) This would be easy to implement for 1-2 k and low weight/small N range as memory requirements are minimal.
(I am not sure how to do this for large number of k, with large N range and heavy weight K using the BSGS on GPU).

I will stick with improving srsieve2.exe first. We might be able to make other sieves faster as well (I haven't looked at the code).

Anyone else has any suggestions?[/QUOTE]

AFAIAC, srsieve2 is intended to replace srsieve, sr1sieve, and sr2sieve. It is faster than srsieve for most ranges I have worked with. It is about as fast as sr2sieve (when not using Legendre tables). I suspect that it can be faster than sr2sieve without all of the complexities of sr2sieve. Granted I have the advantage of only targeting 64-bit CPUs and have AVX at my disposal or most modern CPUs, although AVX only gives a speed boost to certain sieves. I haven't played with AVX2 since I have one computer (at most) that supports it.

I'm thinking that a faster implementation of the HashTable class would be a good start.

Citrix 2020-04-24 03:54

[QUOTE=rogue;543610]

The Legendre table is easy to generate, but for larger k (> 1e9) it can take more time and a lot of memory to build the table.

[/QUOTE]

For large k values unless the k is very smooth it might be faster to use less memory and use the power residue code (2) instead of a Legendre table.

Legendre tables at best would reduce the number of primes being tested by 50% and are mainly suited for small k and base values.

For large number of k being sieved together Legendre tables are not worth it. They increase the overhead significantly as you previously mentioned.

henryzz 2020-04-24 10:35

[QUOTE=Citrix;543614]For large k values unless the k is very smooth it might be faster to use less memory and use the power residue code (2) instead of a Legendre table.

Legendre tables at best would reduce the number of primes being tested by 50% and are mainly suited for small k and base values.

For large number of k being sieved together Legendre tables are not worth it. They increase the overhead significantly as you previously mentioned.[/QUOTE]

To solve this sort of issue there probably just needs to be a switch that turns off the legendre tables like in sr2sieve. People should be able to decide for themselves. Making it choose itself would just get complicated.

rogue 2020-04-24 12:29

[QUOTE=henryzz;543625]To solve this sort of issue there probably just needs to be a switch that turns off the legendre tables like in sr2sieve. People should be able to decide for themselves. Making it choose itself would just get complicated.[/QUOTE]

Use -x to turn off the tables in sr2sieve, but at that point you might as well compare to srsieve2 as it might be faster.

If I wasn't clear above, I'm hoping for others to contribute code, not just ideas.

pepi37 2020-04-24 13:06

[QUOTE=rogue;543630]Use -x to turn off the tables in sr2sieve, but at that point you might as well compare to srsieve2 as it might be faster.

If I wasn't clear above, I'm hoping for others to contribute code, not just ideas.[/QUOTE]


Rogue as we write in PM, you must fix srsieve2 to go upper then 2^52 because this limit is low. What user have from the fact srsieve2 is faster then sr1sieve if upper limit is low (2^52)


Maybe you should fix this first

rogue 2020-04-24 14:43

[QUOTE=pepi37;543633]Rogue as we write in PM, you must fix srsieve2 to go upper then 2^52 because this limit is low. What user have from the fact srsieve2 is faster then sr1sieve if upper limit is low (2^52)

Maybe you should fix this first[/QUOTE]

When not using Legendre tables, srsieve2 appears to be faster than sr2sieve for most sequences, Based upon testing I have done sr1sieve is much faster than srsieve2, but that is expected.

I agree that getting srsieve2 to support p > 2^52 is important. Although the code changes are not too difficult, if I want to maximize speed for p < 2^52 and p > 2^52 it becomes more of a challenge. At this time I might need to have two different builds. I would like to avoid that.

For now you will have to survive using sr2sieve for p > 2^52.

rogue 2020-04-25 21:33

1 Attachment(s)
I have looked at the various mulmod routines available to me in x86 ASM to determine their relative speed. The i386 and sse ASM functions used by srXsieve will not compile using -m64 and as nobody (and I mean nobody) has asked for a 32-bit build in many years, I won't consider them going forward.

In this table the first two functions are part of mtsieve. The vec_mulmod_xx functions are from srXsieve. It is a little challenging to take the numbers "as is" because the functions from mtsieve accept 4 (sse) and 32 (avx) divisors as they sieve multiple prime concurrently, but the vec_mulmod_xx functions accept only 1.

[code]
sse_mulmod_4a_4b_4p 718750 ms 1796 ms (per 1e6 mulmod)
vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 29296875 ms 610 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 12187500 ms 253 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 8015625 ms 166 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 6015625 ms 125 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 29781250 ms 620 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 13468750 ms 280 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 8875000 ms 184 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 6078125 ms 126 ms (per 1e6 mulmod)
[/code]

The vec_mulmod_fpu functions support p up to 2^62 whereas the vec_mulmod_mmx support p up to 2^52.

It might be possible to convert some of the existing sieves in the framework to use the functions from srXsieve, but I haven't looked into that yet. I will likely only include the vec8_mulmod64_fpu function in mtsieve because the others just are not as fast and because supporting multiple just adds unnecessary complexity to the framework and the vec_mulmod_mmx functions require a global variable, which mtsieve will not support as it is multi-threaded. That could probably be changed, but I don't know if it would impact performance.

I have attached the code (containing a Windows exe). You can compile on your own with: gcc *mulmod*.S *fini.S avx*S test.c -m64 -O2 -o mulmod_test. I am curious as to what other CPUs output to see if these numbers are consistent. The compiled program takes 1 parameter, a number. That number is multiplied by 1e6 for the number of calls made to the function in question. The numbers above were run with the input parameter of 100 while no other CPU intensive programs were running.

storm5510 2020-04-25 23:47

1 Attachment(s)
The attached image. Same parameter, 100, and no other intensive processes running. This is my i7-7700 running Widows 10 v1903. CPU utilization was 15% running at 4.15 GHz.


All times are UTC. The time now is 03:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.