mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve enhancements (https://www.mersenneforum.org/showthread.php?t=25486)

rogue 2020-04-23 15:16

mtsieve enhancements
 
This thread is a placeholder for people who want to make improvements to mtsieve. This isn't about "I want this feature" (use the other thread), but about code changes to improve performance. I would expect posts to this thread to be centered around algorithms and code that implements those algorithms.

Citrix 2020-04-24 02:16

[QUOTE=rogue;543533]I appreciate your feedback.

As for #1, I'm not certain if you are referring to srsieve2 or srsieve/sr1sieve/sr2sieve. If srsieve2, I'm open to suggestions. Note that I aim for easy of development/support over speed. Sacrificing up to 1 percent to achieve that is acceptable because I can typically make up for it in other areas. Also for srsieve2, it is still a work in progress. You have probably noticed that it is designed to switch the helper/worker if certain conditions are met as each helper/worker will have very different logic meant to optimize in certain ways.

As for #3, I did try to implement a hashmap, but the performance was worse. I didn't dig into determining why as the implementation I had was supposed to be one of the fastest.

As for #4, it is on my radar, but I haven't don anything yet. My focus has been on implementing the Legendre tables that sr1sieve and sr2sieve have, but bsgs code using those tables a hornet's nest.

I'm open to any suggestions. If you want to play around with the code and try your hand at some optimizations, I'm all for it.

To continue this, we can either move this discussion to the mtsieve thread, to a new thread, or take offline via PM or e-mail. If we keep it online then other participants of this forum can add their knowledge to the discussion or can learn from it.[/QUOTE]

For #1) I was referring to all the various srXsieveY versions. They try to implement various different algorithms/optimizations that most users do not need or unnecessarily makes the program slow. I agree with making the code as straight forward as possible.

We need two versions/logic paths of srsieve2 - one for heavy weight and large N ranges searches and one for light weight and small N range searches. These require different optimizations. One size fits all does not work here.

Most projects/people need the light weight/small N range version but srsieve was designed for heavy weight/large N sequences. All the CRUS work is low weight.

For a heavy weight and large N range the overhead might be justified for optimizations but not for the light weight k and smaller N range as it only slows the program down.

There are 3 main optimizations/algorithms that need to be tinkered:
1. Deciding on the best Q
2. Power residue
3. Legendre symbols
Note:- There is overlap between 2 and 3.

Optimizations will be different when many k's are being sieved simultaneously.
Optimizations will be different for 2-3k sieve and when 100k are being sieved. These require different logic paths.

The program needs to have different logic optimization paths based on number of k and size of range and weight of k.

#3) I think we are taking about different things. Even though I called it a hashmap - I meant more of a bitmap. Can you explain what you implemented in the past.

#4) This would be easy to implement for 1-2 k and low weight/small N range as memory requirements are minimal.
(I am not sure how to do this for large number of k, with large N range and heavy weight K using the BSGS on GPU).

I will stick with improving srsieve2.exe first. We might be able to make other sieves faster as well (I haven't looked at the code).

Anyone else has any suggestions?

:coffee:

VBCurtis 2020-04-24 03:17

[QUOTE=Citrix;543604]Anyone else has any suggestions?

:coffee:[/QUOTE]

It has been said before about e.g. gpuowl, but it bears repeating here that it's enjoyable to view discussions about code / algo enhancements. I like to think I learn a little more about the sieve process, and I appreciate that such things are discussed in public. a little less of a black box!

rogue 2020-04-24 03:18

[QUOTE=Citrix;543604]For #1) I was referring to all the various srXsieveY versions. They try to implement various different algorithms/optimizations that most users do not need or unnecessarily makes the program slow. I agree with making the code as straight forward as possible.[/QUOTE]

Which is why most of that is gone in srsieve2.

[QUOTE]We need two versions/logic paths of srsieve2 - one for heavy weight and large N ranges searches and one for light weight and small N range searches. These require different optimizations. One size fits all does not work here.

Most projects/people need the light weight/small N range version but srsieve was designed for heavy weight/large N sequences. All the CRUS work is low weight.

For a heavy weight and large N range the overhead might be justified for optimizations but not for the light weight k and smaller N range as it only slows the program down.

There are 3 main optimizations/algorithms that need to be tinkered:
1. Deciding on the best Q
2. Power residue
3. Legendre symbols
Note:- There is overlap between 2 and 3.
[/QUOTE]

The Legendre table is easy to generate, but for larger k (> 1e9) it can take more time and a lot of memory to build the table.

[QUOTE]Optimizations will be different when many k's are being sieved simultaneously.
Optimizations will be different for 2-3k sieve and when 100k are being sieved. These require different logic paths.

The program needs to have different logic optimization paths based on number of k and size of range and weight of k.

#3) I think we are taking about different things. Even though I called it a hashmap - I meant more of a bitmap. Can you explain what you implemented in the past.[/QUOTE]

What is implemented in srsieve2 is a C++ friendly version of what was in srsieve. It is very likely that something faster could be written. I don't recall what I coded, only that it was slower. I never dug into why.

[QUOTE]#4) This would be easy to implement for 1-2 k and low weight/small N range as memory requirements are minimal.
(I am not sure how to do this for large number of k, with large N range and heavy weight K using the BSGS on GPU).

I will stick with improving srsieve2.exe first. We might be able to make other sieves faster as well (I haven't looked at the code).

Anyone else has any suggestions?[/QUOTE]

AFAIAC, srsieve2 is intended to replace srsieve, sr1sieve, and sr2sieve. It is faster than srsieve for most ranges I have worked with. It is about as fast as sr2sieve (when not using Legendre tables). I suspect that it can be faster than sr2sieve without all of the complexities of sr2sieve. Granted I have the advantage of only targeting 64-bit CPUs and have AVX at my disposal or most modern CPUs, although AVX only gives a speed boost to certain sieves. I haven't played with AVX2 since I have one computer (at most) that supports it.

I'm thinking that a faster implementation of the HashTable class would be a good start.

Citrix 2020-04-24 03:54

[QUOTE=rogue;543610]

The Legendre table is easy to generate, but for larger k (> 1e9) it can take more time and a lot of memory to build the table.

[/QUOTE]

For large k values unless the k is very smooth it might be faster to use less memory and use the power residue code (2) instead of a Legendre table.

Legendre tables at best would reduce the number of primes being tested by 50% and are mainly suited for small k and base values.

For large number of k being sieved together Legendre tables are not worth it. They increase the overhead significantly as you previously mentioned.

henryzz 2020-04-24 10:35

[QUOTE=Citrix;543614]For large k values unless the k is very smooth it might be faster to use less memory and use the power residue code (2) instead of a Legendre table.

Legendre tables at best would reduce the number of primes being tested by 50% and are mainly suited for small k and base values.

For large number of k being sieved together Legendre tables are not worth it. They increase the overhead significantly as you previously mentioned.[/QUOTE]

To solve this sort of issue there probably just needs to be a switch that turns off the legendre tables like in sr2sieve. People should be able to decide for themselves. Making it choose itself would just get complicated.

rogue 2020-04-24 12:29

[QUOTE=henryzz;543625]To solve this sort of issue there probably just needs to be a switch that turns off the legendre tables like in sr2sieve. People should be able to decide for themselves. Making it choose itself would just get complicated.[/QUOTE]

Use -x to turn off the tables in sr2sieve, but at that point you might as well compare to srsieve2 as it might be faster.

If I wasn't clear above, I'm hoping for others to contribute code, not just ideas.

pepi37 2020-04-24 13:06

[QUOTE=rogue;543630]Use -x to turn off the tables in sr2sieve, but at that point you might as well compare to srsieve2 as it might be faster.

If I wasn't clear above, I'm hoping for others to contribute code, not just ideas.[/QUOTE]


Rogue as we write in PM, you must fix srsieve2 to go upper then 2^52 because this limit is low. What user have from the fact srsieve2 is faster then sr1sieve if upper limit is low (2^52)


Maybe you should fix this first

rogue 2020-04-24 14:43

[QUOTE=pepi37;543633]Rogue as we write in PM, you must fix srsieve2 to go upper then 2^52 because this limit is low. What user have from the fact srsieve2 is faster then sr1sieve if upper limit is low (2^52)

Maybe you should fix this first[/QUOTE]

When not using Legendre tables, srsieve2 appears to be faster than sr2sieve for most sequences, Based upon testing I have done sr1sieve is much faster than srsieve2, but that is expected.

I agree that getting srsieve2 to support p > 2^52 is important. Although the code changes are not too difficult, if I want to maximize speed for p < 2^52 and p > 2^52 it becomes more of a challenge. At this time I might need to have two different builds. I would like to avoid that.

For now you will have to survive using sr2sieve for p > 2^52.

rogue 2020-04-25 21:33

1 Attachment(s)
I have looked at the various mulmod routines available to me in x86 ASM to determine their relative speed. The i386 and sse ASM functions used by srXsieve will not compile using -m64 and as nobody (and I mean nobody) has asked for a 32-bit build in many years, I won't consider them going forward.

In this table the first two functions are part of mtsieve. The vec_mulmod_xx functions are from srXsieve. It is a little challenging to take the numbers "as is" because the functions from mtsieve accept 4 (sse) and 32 (avx) divisors as they sieve multiple prime concurrently, but the vec_mulmod_xx functions accept only 1.

[code]
sse_mulmod_4a_4b_4p 718750 ms 1796 ms (per 1e6 mulmod)
vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 29296875 ms 610 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 12187500 ms 253 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 8015625 ms 166 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 6015625 ms 125 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 29781250 ms 620 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 13468750 ms 280 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 8875000 ms 184 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 6078125 ms 126 ms (per 1e6 mulmod)
[/code]

The vec_mulmod_fpu functions support p up to 2^62 whereas the vec_mulmod_mmx support p up to 2^52.

It might be possible to convert some of the existing sieves in the framework to use the functions from srXsieve, but I haven't looked into that yet. I will likely only include the vec8_mulmod64_fpu function in mtsieve because the others just are not as fast and because supporting multiple just adds unnecessary complexity to the framework and the vec_mulmod_mmx functions require a global variable, which mtsieve will not support as it is multi-threaded. That could probably be changed, but I don't know if it would impact performance.

I have attached the code (containing a Windows exe). You can compile on your own with: gcc *mulmod*.S *fini.S avx*S test.c -m64 -O2 -o mulmod_test. I am curious as to what other CPUs output to see if these numbers are consistent. The compiled program takes 1 parameter, a number. That number is multiplied by 1e6 for the number of calls made to the function in question. The numbers above were run with the input parameter of 100 while no other CPU intensive programs were running.

storm5510 2020-04-25 23:47

1 Attachment(s)
The attached image. Same parameter, 100, and no other intensive processes running. This is my i7-7700 running Widows 10 v1903. CPU utilization was 15% running at 4.15 GHz.

Citrix 2020-04-26 00:50

[QUOTE=rogue;543610]
I'm thinking that a faster implementation of the HashTable class would be a good start.[/QUOTE]

For converting the hash table to bitmap (bit index) this is what I was thinking of:-

Step 1:-
For BSGS we want to compare G[i] to B[j] for giant and baby step respectively.
Assuming there are G_n giant steps and B_n baby steps.

Then we calculate G[i] (mod p) and B[j] (mod p)

We want to store the baby step value (could be the other way around also)

Instead of comparing G[i] and B[j] we want to limit our search space by using probability. This can reduce the amount of memory space needed and search operations but can produce false positive. There will be a inverse relationship between how much memory we use and how many false positives we get.

Let G[i]=B[j] (mod p)
then [G[i]=B[j] (mod p)](mod n)
G[i]%n=GN[i]
B[j]%n=BN[j]
GN[i]=BN[j]


We will store BN[j] for comparison with GN[i]
** if n is a power of 2 (2^x) then this modular arithmetic can be done at no extra cost.

So we do not need to store the full 64 bits of G[i] and B[j]. Effectively we have reduced our search space.

We can store these in a hash table or use Step 2 below in certain cases to speed up even further.

For very small B_n range (<5-10) we can just use an array instead of hash table which should be faster. This would happen if there were too few candidates left in the sieve.

Step 2:-
For small n we can create an bit index (array of bits) to store all B[j] values. Does not work for large n values and would need to use a regular hash table.

At default all values in the array are set to zero.

Then for each BN[j] value you convert the corresponding location in the array to 1.

For each GN[i] you read the corresponding value from the array.

Complexity to add a value is O(1) and to search a value is O (1)

If you do get a positive result (=1) it suggests that G[i] might produce a factor and then you need to check G[i] against B[1..j]. This will differentiate between false positive and true factors.

This might not need to be done if there were no values corresponding to a particular G[i] in the candidate left range.

To check G[i] against B[1..j] instead of repeating B_n steps this can be done in 2*sqrt(B_n) steps by repeating the BSGS algorithm again on this subrange.

[CODE]
Bit array [n];
for (int a=0; a<n; a++) {array[a]=0;}
int b=BN[j];
array[b]=1;
int g=GN[i];
if(array[g]==1) {// possible factor found and need to double check}
[/CODE]


2) Choosing the size of n

We want to reduce the number of false positive and reduce the amount of memory used
Larger the n the fewer the number of false positive and more the amount of memory used.

Percent chances of a positive result will be (G_n*B_n*100)/n

For a 1% false positive rate n will have to be around G_n*B_n*100

The false positive rate should be adjusted to the speed up from using less memory/array structure vs extra work needed to test the false positive.

n values will quickly become very large and this is best suited for low weight candidates or small ranges being searched with a few k.

Happy5214 2020-04-26 02:53

My desktop's Core 2 doesn't have AVX, so this is from my laptop (Intel Core i3-4012Y, Kubuntu 18.04):

[CODE]sse_mulmod_4a_4b_4p 2018927 ms 5047 ms (per 1e6 mulmod)
vec_avx_mulmod 21365975 ms 6676 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 82026202 ms 1708 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 34685917 ms 722 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 24030532 ms 500 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 18026937 ms 375 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 80305117 ms 1673 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 38246894 ms 796 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 27558843 ms 574 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 18187244 ms 378 ms (per 1e6 mulmod)[/CODE]

As an aside, I had to add a definition before an include to get it to compile, so it's now
[code] #define __USE_GNU
#include <sys/resource.h>[/code]

rogue 2020-04-26 12:18

I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.

rogue 2020-04-26 14:16

1 Attachment(s)
I added the mtsieve function fpu_mulmod_4a_4b_4p to this. I didn't include it previously because it I didn't use it correctly which led to incorrect output. Here is the updated output:

[code]
fpu_mulmod_4a_4b_4p 890625 ms 2226 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p 734375 ms 1835 ms (per 1e6 mulmod)
vec_avx_mulmod 3828125 ms 1196 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 30625000 ms 638 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 12359375 ms 257 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 8187500 ms 170 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 6078125 ms 126 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 29906250 ms 623 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 13546875 ms 282 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 8968750 ms 186 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 6031250 ms 125 ms (per 1e6 mulmod)
[/code]

fpu_mulmod_4a_4b_4p works like this:

[code]
a[0] = a[0] * b[0] mod p[0]
a[1] = a[1] * b[1] mod p[1]
a[2] = a[2] * b[2] mod p[2]
a[3] = a[3] * b[3] mod p[3]
[/code]

vec4_mulmod64_fpu works like this:

[code]
for (int i=0; i<count; i++)
a[0] = a[0] * b[0] mod p
a[1] = a[1] * b[1] mod p
a[2] = a[2] * b[2] mod p
a[3] = a[3] * b[3] mod p
[/code]

Now if mtsieve had a function that worked like this:

[code]
for (int i=0; i<count; i++)
a[0] = a[0] * b[0] mod p[0]
a[1] = a[1] * b[1] mod p[1]
a[2] = a[2] * b[2] mod p[2]
a[3] = a[3] * b[3] mod p[3]
[/code]

and had a speed faster than 1500 ms per 1e6 mulmod (compared to the above), that would be killer for the entire framework. Not only could that be used to give a 20% or better performance boost to most sieves, it would also allow most sieves to support sieving for p up to 2^62. vec4_mulmod64_fpu has very limited usage in the existing sieves of the framework, but the new function could be used in most, if not all of the sieves.

The sources to both of these are included in the attached and the usage of both is in the test.c file. For those of you are are not afraid of x86 assembly, I'd love you to take a crack at such a function.

storm5510 2020-04-26 15:46

1 Attachment(s)
Using your update. Same parameter as before. 100.

Citrix 2020-04-26 16:27

[QUOTE=rogue;543855]I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.[/QUOTE]

I am not sure how to integrate the code into mtsieve to do the comparison. The code is below:-

As I mentioned in the earlier post that it only is useful for small n range or low weight and might not work for a larger range due to memory requirements. It might not be of any practical use. It does use more memory but look up is significantly faster.

See: [url]https://docs.microsoft.com/en-us/cpp/standard-library/bitset-class?view=vs-2019[/url]
for more details on bitset.

[CODE]
#include <bitset> // built in STD library for bitmaps.

//BSGS algorithm
// For a range r=b*g for k*2^n+1
// then baby step = k*2^(1) to k*2^b -- b steps (mod p)
// giant step will be -2^(-n) to -(2^b) step 2^b --g steps (mod p)

const int g = 32; // number of giant steps
const int b = 32; // number of baby steps
const int c = 1024; // for 0.1% false positive rate. rate=1/c*100%
int n = g*b*c; //works better if n=2^m
std::bitset<n> bitmap; // by default in the constructor everything is set to zero/false

void main(void)
{
uint64_t giant[g];
uint64_t baby[b];
// replace arrays with mulmod code in mtsieve to generate giant and baby values (mod p)

for (int i = 0; i < b; i++)
{
uint32_t temp = baby[i];
temp = temp%n; // can use bit operations if n is a power of 2
bitmap.set(temp); //sets bit to 1
}
// we are done with setting up the table
// Now we will compare with the giant values
for (int j= 0; j < g; j++)
{
uint32_t temp = giant[j];
temp = temp%n; // can use bit operations if n is a power of 2
if (bitmap.test(temp))
{
// we might have a possible factor
// Possible factor will be k*2^(b*j)*2^x+1 =0 (mod p)
// Where x is between 1...b
// Need to search this range for factors
// Repeat BSGS for this smaller range.
}
}

bitmap.reset(); // sets everything to false
}
[/CODE]

Citrix 2020-04-26 19:26

[QUOTE=rogue;543860]
Now if mtsieve had a function that worked like this:

[code]
for (int i=0; i<count; i++)
a[0] = a[0] * b[0] mod p[0]
a[1] = a[1] * b[1] mod p[1]
a[2] = a[2] * b[2] mod p[2]
a[3] = a[3] * b[3] mod p[3]
[/code]
[/QUOTE]

Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

[CODE]
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

// calculate number of iterations
int number=(nmax-nmin)/step+1;
int start=(k*base^nmin)%p;
int step_exp=base^step%p;
for (int x=0; x<number; x++)
{
start=start*step_exp%p;
if (start+sign==p || start-sign==0)
{
//output factor
}
}
}
[/CODE]

Thanks.

rogue 2020-04-26 21:29

[QUOTE=Citrix;543882]Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

[CODE]
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

// calculate number of iterations
int number=(nmax-nmin)/step+1;
int start=(k*base^nmin)%p;
int step_exp=base^step%p;
for (int x=0; x<number; x++)
{
start=start*step_exp%p;
if (start+sign==p || start-sign==0)
{
//output factor
}
}
}
[/CODE][/QUOTE]

As for GPU vs CPU, I don't know the answer without running any tests. The issue with the GPU is the usage of memory. Although GPUs have a lot of memory, each "thread" typically accesses only a small portion of that. You can use global memory that all threads can access, but it is my understanding that global memory access is expensive. I only have access to AMD GPUs, so OpenCL is the only choice, in other words, no CUDA. Fortunately OpenCL can be run on NVIDIA GPUs.

As for your other post using bitset, I'll look into it further to see if it can be used.

rogue 2020-04-27 01:42

Upon reviewing the code, vec4_mulmod64_fpu works like this:

vec4_mulmod64_fpu works like this:

[code]
for (int i=0; i<count; i++)
y[0] = x[0] * b mod p
y[1] = x[1] * b mod p
y[2] = x[2] * b mod p
y[3] = x[3] * b mod p
[/code]

so it isn't a fair comparison between the two as b/p is precomputed. For the function I'm looking for, I cannot do that as b and p are variable. In doing more investigation into the performance of fpu_mulmod_4a_4b_4p, it seem that the overhead of calling the function is driving up the time to execute it but quite a bit. I think that is where I need to look. For some sieves one can also pre-compute all b as a long double (factorial and primorial for example). Since factorial is already GPU enabled, it might not benefit that sieve as much as primorial. It would probably only make sense to use a new function for p > 1e8 (or larger) because the current function will be faster at finding terms with small factors. The new function I'm thinking of would be better when the factors are further apart. Does anyone want to take a crack at it?

Does anyone know if there is an additional latency to using fmul with a memory address instead of st(x)? It doesn't appear that there is one from Agner Fog's tables, but I don't know if anyone has practical experience who can also answer the question.

rogue 2020-04-27 15:37

As I think about it more, primorial and factorial won't benefit that much because they have a specialized routine that reduces the number of asm function calls.

rogue 2020-04-27 18:21

I realized that I wasn't computing the runtimes correctly for the vec_xx functions. Here are updated numbers, which make a lot more sense now. I'm sorry that I didn't discover my mistake sooner. I assume that nobody has wasted time on this besides me.

[code]
fpu_mulmod_4a_4b_4p 859375 ms 2148 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p 703125 ms 1757 ms (per 1e6 mulmod)
vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 57687500 ms 1201 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 47265625 ms 984 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 44859375 ms 934 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 44953125 ms 936 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 58921875 ms 1227 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 50531250 ms 1052 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 50812500 ms 1058 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 42796875 ms 891 ms (per 1e6 mulmod)
[/code]

Further analysis reveals that access to memory is a killer for performance in the asm routines.

Citrix 2020-05-27 04:26

gcwsieve enhancements
 
[CODE]

// We're trying to determine if n*b^n (mod p) = +1/-1. This requires an two
// operations, an exponentiation (b^n) and a multiplication (n). We can make a
// change to eliminate one of those operations and convert the second one into addition which // should be faster. Doing this requires us to
// compute the multiplicative inverse of b. The multiplicative inverse of
// b (which I will call B) is the number such at b*B = 1 mod p. Here is how
// the algorithm works.
//
// We start with our number:
// n*b^n (mod p) = +1/-1
// We then generate B such that b*B=1 (mod p)
// // To get then next term (which I will call N) where N>n , we need to
// check the following
// N=n+r
// N*b^N(mod p) = +1/-1
// (n+r)*b^(n+r) (mod p) = +1/-1
// We multiply both sides by B^r
// (n+r)*b^(n+r))B^r (mod p) = +B^r/-B^r
// (n+r)*b^n (mod p) = +B^r/-B^r
// n*b^n + r*b^n (mod p) = +B^r/-B^r
// We already know n*b^n and b^n are from the previous step.

// (Array 1) Now we pre-compute +-B^r for a range of r since we will be using it multiple times (1 to rmax = M*sqrt(nmax-nmin))
// (Array 2) for each n*b^n in the loop we pre-compute r*b^n for small consecutive values of r (1 to rdiffmax) – can be done by modular addition
// We can calculate each successive candidate by just adding a term from Array 2 and checking it with corresponding term from Array 1.

M for size of Array 1 can be adjusted for maximum efficiency depending on efficiency of the loops.

// once we reach rmax then we multiply last term by b^rmax and b^n by b^rmax
n*b^n + rmax*b^n (mod p)
(n+rmax)* b^n (mod p)
(n+rmax)* b^n*b^rmax(mod p)
n'*b^n'(mod p)
and also generate b^n'=b^n*b^rmax. (mod p)
// We can now repeat the loop again starting at n'.


Other suggestions
Your current code takes into account n values left mod 2 but not for other higher primes.
Code can be made faster if you only look at subsequences left for mod 6 or higher number. BestQ code from srsieve can be used here.
Depending on size of subsequence being sieved we can adjust size of the loops for maximum efficiency.

On a GPU if memory storage is a concern Array 1 can be reduced in size and Array 2 might not be worth saving. This will be still faster than current code.



[/CODE]

rogue 2020-05-27 15:41

[QUOTE=Citrix;546564][CODE]

// We're trying to determine if n*b^n (mod p) = +1/-1. This requires an two
// operations, an exponentiation (b^n) and a multiplication (n). We can make a
// change to eliminate one of those operations and convert the second one into addition which
// should be faster. Doing this requires us to
// compute the multiplicative inverse of b. The multiplicative inverse of
// b (which I will call B) is the number such at b*B = 1 mod p. Here is how
// the algorithm works.
//
// We start with our number:
// n*b^n (mod p) = +1/-1
// We then generate B such that b*B=1 (mod p)
//
// To get then next term (which I will call N) where N>n , we need to
// check the following
// N=n+r
// N*b^N(mod p) = +1/-1
// (n+r)*b^(n+r) (mod p) = +1/-1
// We multiply both sides by B^r
// (n+r)*b^(n+r))B^r (mod p) = +B^r/-B^r
// (n+r)*b^n (mod p) = +B^r/-B^r
// n*b^n + r*b^n (mod p) = +B^r/-B^r
// We already know n*b^n and b^n are from the previous step.

// (Array 1) Now we pre-compute +-B^r for a range of r since we will be using it multiple times (1 to rmax = M*sqrt(nmax-nmin))
// (Array 2) for each n*b^n in the loop we pre-compute r*b^n for small consecutive values of r (1 to rdiffmax) – can be done by modular addition
// We can calculate each successive candidate by just adding a term from Array 2 and checking it with corresponding term from Array 1.

M for size of Array 1 can be adjusted for maximum efficiency depending on efficiency of the loops.

// once we reach rmax then we multiply last term by b^rmax and b^n by b^rmax
n*b^n + rmax*b^n (mod p)
(n+rmax)* b^n (mod p)
(n+rmax)* b^n*b^rmax(mod p)
n'*b^n'(mod p)
and also generate b^n'=b^n*b^rmax. (mod p)
// We can now repeat the loop again starting at n'.

Other suggestions
Your current code takes into account n values left mod 2 but not for other higher primes.
Code can be made faster if you only look at subsequences left for mod 6 or higher number. BestQ code from srsieve can be used here.
Depending on size of subsequence being sieved we can adjust size of the loops for maximum efficiency.

On a GPU if memory storage is a concern Array 1 can be reduced in size and Array 2 might not be worth saving. This will be still faster than current code.
[/CODE][/QUOTE]

It took me a while, but I believe I see where you are going with this. Obviously as the range of n increases, so does the benefit.

Memory is definitely a concern on the GPU, but I think that something similar could be done there as well.

I will table your other suggestions because I don't understand them. Someone else (maybe you) would be interested implementing.

Citrix 2020-05-27 16:54

[QUOTE=rogue;546600]
I will table your other suggestions because I don't understand them.[/QUOTE]

I am suggesting using a few more primes, other than 2, like 3,5,7,13 etc (Covering sets)
The code below is from mtsieve for prime 2.

[CODE]

// If the base is odd, then all n must be even and thus the difference
// between any two remaining n for the base must also be even.
if (idx > 2 && ii_Base & 1)
{
if (idx & 1)
continue;

avx_set_16a(powers[idx-2]);
avx_set_16b(powers[2]);
avx_mulmod(dps, reciprocals);
avx_get_16a(powers[idx]);
}
else
{
avx_set_16a(powers[idx-1]);
avx_set_16b(powers[1]);
avx_mulmod(dps, reciprocals);
avx_get_16a(powers[idx]);
}


[/CODE]

rogue 2020-05-27 18:21

I see. This is how I have commented the non-AVX code:

[code]
// We're trying to determine if n*b^n (mod p) = +1/-1. This requires an two
// operations, an exponentiation (b^n) and a multiplcation (n). By precomputing
// certain values, we can actually replace the exponentiation and mulitplication
// with array lookups. Doing this requires us to compute the multiplicative inverse
// of b and to build two arrays. The multiplicative inverse of b (which I will call B)
// is the number such at b*B = 1 mod p.
//
// We start with our n where n=minN:
// n*b^n (mod p) = +1/-1
// -> n*b^n*B^n (mod p) = +1/-1 * B^n (mod p)
// -> n (mod p) = +1/-1 B^n (mod p)
//
// Find all distinct r staring with n=minN where n+r is the next term until n=maxN.
//
// Pre-compute B^r (mod p) for all possible r.
// Pre-compute n*b^n + r*b^n (mod p) for all possible r.
//
// For each successive term:
// (n+r)*b^(n+r) (mod p) = +1/-1
// -> (n+r)*b^(n+r)*B^r (mod p) = +/-B^r (mod p)
// -> (n+r)*b^n (mod p) = +/-B^r (mod p)
// -> n*b^n + r*b^n (mod p) = +/-B^r (mod p)
//
// Since both n*b^n + r*b^n (mod p) and +/-B^r (mod p) have been pre-cmoputed
// we just need to compare elements of two arrays.
//
// If n*b^n + r*b^n (mod p) = -B^r, then p is a factor of n*b^n+1.
// If n*b^n + r*b^n (mod p) = +B^r, then p is a factor of n*b^n-1
[/code]

I think that your suggestion for bestQ is not necessary as I build a table of "the covering set" of r infrequently.

There are now two loops for each group of primes. One to compute B^r (mod p) and n*b^n + r*b^n (mod p) and the one that detects if we have a factor. This has not been tested yet, so I have likely made mistakes along the way. Because of the need to convert so many doubles back to uint64_t, I'm not certain that an AVX version will be faster than this. That will need some testing. On the GPU side, I can make additional changes to eliminate unused elements of the left and right arrays.

[code]
while (ii_Terms[termIndex] > 0)
{
theN = ii_Terms[termIndex];

diff = prevN - theN;

for (pIdx=0; pIdx<4; pIdx++)
{
// If n*b^n + r*b^n (mod p) = -B^r, then p is a factor of n*b^n+1.
// If n*b^n + r*b^n (mod p) = +B^r, then p is a factor of n*b^n-1
if (il_LeftTerms[DIFF_INDEX(diff)+pIdx] == il_RightTerms[DIFF_INDEX(diff)+pIdx])
if (ip_CullenWoodallApp->ReportFactor(ps[pIdx], theN, -1))
VerifyFactor(ps[pIdx], theN, -1);

if (il_LeftTerms[DIFF_INDEX(diff)+pIdx] == ps[pIdx] - il_RightTerms[DIFF_INDEX(diff)+pIdx])
if (ip_CullenWoodallApp->ReportFactor(ps[pIdx], theN, +1))
VerifyFactor(ps[pIdx], theN, +1);
}

prevN = theN;
termIndex++;
};
[/code]

Citrix 2020-05-28 05:25

I am not sure if you understood the algorithm correctly.
Here is the pseudocode

[CODE]
Given a base b, nmin, nmax, prime p to sieve.

Choose a value C where C<=nmax-nmin && (nmax-nmin+1)%C==0; need to experimentally find the best C value so code is fastest. This will be constant for all primes. Can adjust nmax so it solves the second modular part.

C=0 or C=nmax-nmin are not the best solutions. The best solution is somewhere in between based on efficiency of the code. (see ***)

Then generate B such that b*B=1 (mod p)

Then create an Array Invs[] = B^0, B^1,..., B^C (mod p)
Then create an Array Invsm[] = p-B^0, p-B^1,..., p-B^C (mod p)

Then we create a loop starting at X=nmin with step C upto nmax (e.g. for(int X=nmin; X<nmax; X=X+C)) {

We generate b^X and X*b^X (in each iteration of the loop can calculate B^lastX*B^C=B^CurrentX and then multiply by X)

We create an array Array MaxDiff []= 0*b^X, 1*b^X, ...., m*b^X (mod p; modular addition)
m would be around 50 for base 2. Could be smaller or larger based on density of remaining candidates in file. Similar to size of MAX_POWERS in current code. Use mean and standard deviations to find best m value. The small values 1*b^X are rarely used and might not need them.


Now we find all remaining candidates in the sieve between n=X and X+C (read from a bitmap)

Let the first candidate be n=r1

Then we add X^b^x+MaxDiff[r1-X] (mod p) = y
Check if y==Invs[r1-X] (for Woodall) and y==Invsm[r1-X] (for Cullen) --> Print factors

Let the next candidate be n=r2

Then we add y+MaxDiff[r2-r1] (mod p) = y
Check if y==Invs[r2-X] (for Woodall) and y==Invsm[r2-X] (for Cullen) --> Print factors

and so on till we reach the last term under X+C

Need to check if difference between 2 consecutive remaining terms is greater than Size of MaxDiff then we would need an extra step (similar to current code)
}// end loop


Best Q can further help as some values in Array Invs and Invsm might not need to be generated. Need to choose C value that is a multiple of Best_Q and adjust the nmin value so it is also a multiple of the Best_Q.

*For odd bases MaxDiff only needs 1/2 the values - otherwise should not apply Best_Q to MaxDiff

Why are you converting from double to uint64_t? You can do modular addition in double type.



***Notes on calculating best C value
Work done = C/q mulmod in first loop + N_range/C*(2 mulmod + 50 add and 50 subtract for MaxDiff array) in second loop.
Where q is adjustment due to Best_Q improvements
50 add+ 50 subtract = t mulmod (base on CPU time)
Then we have C/q + (N_range* (2+t))/C needs to be the smallest so work done is fastest.

Then C= sqrt((N_range*(2+t)*q)) is the best solution.
Then Array Invs should be of size ~= sqrt(n_range)*sqrt((2+t)*q) for best results.

[/CODE]

I hope this makes more sense.

rogue 2020-05-28 14:38

Honestly I don't fully understand and can't spend the brain power on it at the moment. If you want to take a crack at writing the code per your algorithm, then go for it. I think you have the best chance of implementing the code per your vision.

Citrix 2020-05-28 20:41

I did try implementing the algorithm. There possibly is some small bug as it is not producing all the factors. I can't find it right now. I do not have a debugger for GCC to help me.

For base 2
I initially sieved range nmin=10,000 to nmax=100,000 to remove all small p (under 1M). There were ~ 3,600 candidates left.
I used this sieved file to test the following:-
The original code gives a speed of 33Kp/sec for p range 1G to 2G.
The new code gives a speed of 2.5Kp/sec for p range 1G to 2G.

Then I tested the code nmin=1,000,000 to nmax=2,000,000 from 1G to 2G (without removing any small terms).
The original code gives a speed of 128 p/sec for p range 1G to 2G.
The new code gives a speed of 804p/sec for p range 1G to 2G.

So 8 times faster but as the original code removes terms from the sieve the code would get faster and faster whereas the new algorithm would stay at almost the same speed.

For the new algorithm to be significantly faster than the old algorithm we would need to test a range of at least 100-400M.

Since no one that I know of is planning on sieving a range that large- I would hold off on further development of this code.

Citrix 2020-05-28 21:53

I am posting a simpler explanation of the algorithm here - in case some can come up with a faster implementation

[CODE]

Take all the remaining candidates in a range Nmin to Nmax
Let b be the base and B be the multiplicative inverse such that b*B = 1 mod p.

We choose a C= sqrt(Nmax-Nmin)
We then divide all n values into C different groups
In each group G_r={n1,n2,n3...,} such that n1=n2=n3=r (mod C)

We want to sieve each of these C groups separately.

For G_r group
since n1=r (mod C) then n1=k1*C+r
since n2=r (mod C) then n2=k2*C+r

The first term in this group is n1*b^n1+-1
The second term in this group is n2*b^n2+-1

We multiply all these terms by their corresponding B^(k*C) value

The first term
n1*b^n1= +-1 (mod p)
n1*b^n1*B^(k1*C)= +-B^(k1*C) (mod p)
Since we know that n1=k1*C+r we get
n1*b^r=+-B^(k1*C) (mod p)


The second term
n2*b^n2= +-1 (mod p)
n2*b^n2*B^(k2*C)= +-B^(k2*C) (mod p)
Since we know that n2=k2*C+r we get
n2*b^r=+-B^(k2*C) (mod p)

Note that B^(k*C) values will be used in all the C groups and we can save on this work instead of repeatedly calculating it.
Also note n2*b^r - n1*b^r = (k2-k1)*C*b^r
If we know C*b^r (which we can calculate for each group) we can generate all these terms by addition only.

If n1*b^r=-B^(k1*C) (mod p) then n1*b^n1+1 is divisible by p
If n1*b^r=B^(k1*C) (mod) p then n1*b^n1-1 is divisible by p

[/CODE]

rogue 2020-06-30 15:19

I see where my logic is wrong referring to what I was trying to implement. Not certain if I can resurrect.

rogue 2020-07-08 16:20

One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.

Happy5214 2020-08-02 17:22

[QUOTE=rogue;550047]One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.[/QUOTE]

I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?

rogue 2020-08-02 18:44

[QUOTE=Happy5214;552329]I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?[/QUOTE]

Sorry about that. Too many things swimming my brain.

henryzz 2020-08-03 13:50

Something worthwhile for mtsieve would be c versions of the asm functions so that it is portable to arm cpus. Arm asm versions would also be useful. One day I might do this although finding the time is an issue.

rogue 2020-08-03 14:03

[QUOTE=henryzz;552423]Something worthwhile for mtsieve would be c versions of the asm functions so that it is portable to arm cpus. Arm asm versions would also be useful. One day I might do this although finding the time is an issue.[/QUOTE]

Routines in c should be easy to write as most asm are variants of powmod or mulmod. Those are fairly each to code in c, but much harder to optimize.

It is my desire to buy an ARM based MacBook in the future and to write the asm routines for it at that time. Of course others are welcome to do that as well.

The downside of ARM is that the choices for programs to execute PRP tests is very limited.

Happy5214 2020-08-03 14:37

[QUOTE=rogue;552426]Routines in c should be easy to write as most asm are variants of powmod or mulmod. Those are fairly each to code in c, but much harder to optimize.

It is my desire to buy an ARM based MacBook in the future and to write the asm routines for it at that time. Of course others are welcome to do that as well.

The downside of ARM is that the choices for programs to execute PRP tests is very limited.[/QUOTE]
I'm planning to buy an ODROID N2+ in the coming days for the express purpose of learning and testing ARM assembly (and as a replacement for my old RPi 2 B+ that hasn't worked in years). I'll let you know when it arrives if I feel up to porting the ASM routines.

rogue 2020-08-03 14:59

[QUOTE=Happy5214;552429]I'm planning to buy an ODROID N2+ in the coming days for the express purpose of learning and testing ARM assembly (and as a replacement for my old RPi 2 B+ that hasn't worked in years). I'll let you know when it arrives if I feel up to porting the ASM routines.[/QUOTE]

Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.

henryzz 2020-08-03 15:00

The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.

Happy5214 2020-08-03 15:37

[QUOTE=rogue;552431]Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.[/QUOTE]

Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.

[QUOTE=henryzz;552432]The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.[/QUOTE]

The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.

henryzz 2020-08-03 16:09

[QUOTE=Happy5214;552438]The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.[/QUOTE]

I checked an old version of the source, not a recent version.

rogue 2020-08-03 23:23

[QUOTE=Happy5214;552438]Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.[/QUOTE]

I have to believe that the Apple ARM chips they intend to use on the new MacBooks will support SVE, but I haven't found anything to explicitly state that. It is possible that they might only reserve chips with that capability to their higher end offerings when those are switched over.

Happy5214 2020-08-12 16:59

1 Attachment(s)
The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/[I]p[/I].

rogue 2020-08-12 17:44

[QUOTE=Happy5214;553447]The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/[I]p[/I].[/QUOTE]

The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.

What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available.

To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.

Happy5214 2020-08-14 11:41

[QUOTE=rogue;553455]The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.

What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available.

To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.[/QUOTE]
I think you're right. On the Cortex A55 that the ODROID C4 uses, FMUL has a latency of 4 cycles and a throughput of 2 instructions/cycle, while a double-precision FDIV requires 22 cycles and a throughput of 1/19. I'll revert to my original design and stuff it in d8, which is apparently callee-saved and thus should be saved across calls.

Happy5214 2020-08-19 15:26

1 Attachment(s)
I finally got my hands on that ODROID C4. It arrived Sunday (a day early). I tried to boot it Monday, only to find that the spare wireless keyboard I had allocated to it no longer worked. Today, I managed to find a spare Ethernet cable and plug it into our router, and I was able to log in through ssh. After several first-timer coding errors, I finally have a working version of fpu_mulmod. I've attached a tarball with the new header, a remake of the push functions (which just copies the appropriate value to register d8), the two aforementioned version of fpu_mulmod (one using the 1/p value in d8 and the other using a simple fdiv by p every time), and a simple (grossly inefficient) Fermat test with a Mersenne prime using fpu_mulmod (should result in 1), along with a simple Makefile to build them.

The normal mulmod (with 1/p calculated once and stored in d8) runs in 1.52s on my ODROID, while the divp version (with the fdiv by p within mulmod on each iteration) takes 2.3s. Interestingly, the x87 version takes 1.43s on my Core 2 Quad, just showing how much progress processor technology has made.

Happy5214 2020-08-19 18:26

[QUOTE=Happy5214;554253][...], and a simple (grossly inefficient) Fermat test with a Mersenne prime [B]exponent[/B] using fpu_mulmod (should result in 1), along with a simple Makefile to build them.[/QUOTE]

Needed correction added. :redface: I was too impatient to test M31, and smaller ones were too quick.

rogue 2020-08-19 20:06

Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.

Happy5214 2020-08-20 10:24

[QUOTE=rogue;554288]Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.[/QUOTE]
Registers d8-d15 are all callee-saved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8-d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can.

I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?

rogue 2020-08-20 12:00

[QUOTE=Happy5214;554362]Registers d8-d15 are all callee-saved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8-d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can.

I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?[/QUOTE]

For now that is okay. You can change add an asm function to store 1/p for 4 p and I could migrate the other calls to it where 4 values are put onto the stack in successive calls.

Citrix 2020-08-22 02:12

Calculating best Q for srsieve2.exe
 
[QUOTE=rogue;554188] Do you know some sequences where Q > 720? It might be possible for srsieve2 to compute it, unless it would be extremely large.[/QUOTE]

Moving this from the other thread:

A faster automated way of finding the bestQ would be

1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left
For N1, N2, N3, N4, ... left in the file
Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...)
So all N can be represented as
N1=k1*BASE_MULTIPLE+c
N2=k2*BASE_MULTIPLE+c

2) Then LIMIT_BASE should 720*7*11=55440
NDIVISORS=LIMIT_BASE

3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1;
OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE)

4) Use the current code of srsieve2 to find bestq

5) Q=bestq*BASE_MULTIPLE

rogue 2020-08-22 02:55

[QUOTE=Citrix;554557]Moving this from the other thread:

A faster automated way of finding the bestQ would be

1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left
For N1, N2, N3, N4, ... left in the file
Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...)
So all N can be represented as
N1=k1*BASE_MULTIPLE+c
N2=k2*BASE_MULTIPLE+c

2) Then LIMIT_BASE should 720*7*11=55440
NDIVISORS=LIMIT_BASE

3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1;
OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE)

4) Use the current code of srsieve2 to find bestq

5) Q=bestq*BASE_MULTIPLE[/QUOTE]

You are welcome to play around with the code, build and test. Speed really isn't an issue since it only has to execute that code once in a great while, but if a better Q can be found (thus allowing for faster sieving), then that would be great.

Citrix 2020-08-22 03:48

I modified sr1sieve to read any Q from command line and got significant benefit for low weight series. I am trying to combine the benefit with your faster FPU/AVX code.

The problem modifying srsieve2 code is that it supports multiple k version and calculating the Q for that is very complicated.

Happy5214 2020-08-25 15:55

1 Attachment(s)
[QUOTE=rogue;550047]One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.[/QUOTE]

[QUOTE=Happy5214;552329]I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?[/QUOTE]

The regular x86 fpu_powmod function [i]does[/i] use right-to-left exponentiation, though the others use left-to-right. Should I rewrite the x86 version while I'm writing the ARM version?

Speaking of, the fpu_mulmod ARM ports have been finished and are attached. I managed to use the minimum amount of memory accesses (just transferring the array values to/from registers).

Here are the timings for my test programs on ARM Cortex-A55 (which do a poor man's Fermat test on the largest known Mersenne prime exponent):
[code]time ./mulmod
3^82589932 mod 82589933 = 1

real 0m1.043s
user 0m1.040s
sys 0m0.000s

time ./mulmod_iter
3^82589932 mod 82589933 = 1

real 0m0.870s
user 0m0.864s
sys 0m0.004s

time ./mulmod_iter_4a
2^82589932 mod 82589933 = 1
2^82589932 mod 82589933 = 2
2^82589932 mod 82589933 = 4
2^82589932 mod 82589933 = 8

real 0m1.737s
user 0m1.736s
sys 0m0.000s

time ./mulmod_4a_4b_4p
2^82589932 mod 82589933 = 1
3^82589932 mod 82589933 = 1
4^82589932 mod 82589933 = 1
5^82589932 mod 82589933 = 1

real 0m2.300s
user 0m2.296s
sys 0m0.000s
[/code]

This compares to the x86 FPU test of the same calculation:
[code]time ./mulmod
3^82589932 mod 82589933 = 1

real 0m1.260s
user 0m1.232s
sys 0m0.000s

time ./mulmod_iter
3^82589932 mod 82589933 = 1

real 0m1.110s
user 0m1.074s
sys 0m0.005s

time ./mulmod_iter_4a
2^82589932 mod 82589933 = 1
2^82589932 mod 82589933 = 2
2^82589932 mod 82589933 = 4
2^82589932 mod 82589933 = 8

real 0m1.284s
user 0m1.252s
sys 0m0.004s

time ./mulmod_4a_4b_4p
2^82589932 mod 82589933 = 1
3^82589932 mod 82589933 = 1
4^82589932 mod 82589933 = 1
5^82589932 mod 82589933 = 1

real 0m1.543s
user 0m1.487s
sys 0m0.005s
[/code]

Notably, the functions with a single mulmod are faster on ARM than on x86, while the multiple-mulmod functions scale much better on x86. The poor performance of multiple mulmods on ARM could be due to pipelining or slow memory accesses (the single mulmod functions are entirely contained in registers in my ARM implementation).

Citrix 2020-09-21 17:26

I am looking at BestQ code

[code]

uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s)
{
uint32_t baby, giant, work;

ChooseSteps(&baby, &giant, Q, s);

work = baby*BABY_WORK + s*(giant-1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK;

return work;
}

[/code]

I understand the BABY_WORK, GIANT_WORK and SUBSEQ_WORK steps.
Can you help me understand what is the Q*EXP_WORK -- I do not see anything corresponding to this in the discrete log function?

For a large Q if only one subsequence is left - it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q.

Thanks.

rogue 2020-09-21 19:23

[QUOTE=Citrix;557487]I am looking at BestQ code

[code]

uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s)
{
uint32_t baby, giant, work;

ChooseSteps(&baby, &giant, Q, s);

work = baby*BABY_WORK + s*(giant-1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK;

return work;
}

[/code]

I understand the BABY_WORK, GIANT_WORK and SUBSEQ_WORK steps.
Can you help me understand what is the Q*EXP_WORK -- I do not see anything corresponding to this in the discrete log function?

For a large Q if only one subsequence is left - it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q.[/QUOTE]

This was borrowed from srsieve. The original code was written by Geoff Reynolds.

Note that the code does not have any sr1sieve or sr2sieve enhancements. I have a ways to go to pull in the sr2sieve bsgs logic.

rob147147 2020-09-22 14:27

I think EXP_WORK relates to the additional mulmods involved in pre-calculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q.

E.g. We are rewriting the terms of a sequence k * b^n -1 as k * b^d * b^(Qm) -1 where n=Qm+d.

The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK.

Citrix 2020-09-24 03:45

[QUOTE=rob147147;557562]I think EXP_WORK relates to the additional mulmods involved in pre-calculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q.

E.g. We are rewriting the terms of a sequence k * b^n -1 as k * b^d * b^(Qm) -1 where n=Qm+d.

The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK.[/QUOTE]

Robert, I think you are correct. I came to the same conclusion when reviewing the code. Though for sr1sieve there would not be any need to calculate b^d (mod p) for particular d values as there is no candidates left for certain d. That is the whole point of choosing a Q value.

For multiple k (sr2sieve) the program should intelligently decide which b^d should be computed.

In Srsieve2 the following function would need to be corrected to make it faster for single k.

[code]
void GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64){
...}
[/code]

Citrix 2020-09-24 05:31

I tried to replace mulmod by powmod to avoid unnecessary multiplication. I get an error during run time. Any idea why the calculations are not being done correctly?

[CODE]
void GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64)
{
uint64_t inv_b[4];
uint32_t qIdx, seqIdx, ssIdx;
uint64_t umod[4], inv[4];

fpu_push_1divp(primeList[3]);
fpu_push_1divp(primeList[2]);
fpu_push_1divp(primeList[1]);
fpu_push_1divp(primeList[0]);

// Precompute 1/b^d (mod p) for 0 <= d <= Q.

bd64[0][0] = bd64[0][1] = bd64[0][2] = bd64[0][3] = 1;

bm64[0] = inv_b[0] = invmod64(baseToUse[0], primeList[0]);
bm64[1] = inv_b[1] = invmod64(baseToUse[1], primeList[1]);
bm64[2] = inv_b[2] = invmod64(baseToUse[2], primeList[2]);
bm64[3] = inv_b[3] = invmod64(baseToUse[3], primeList[3]);

/* for (qIdx=1; qIdx<ii_BestQ; qIdx++)
{
bd64[qIdx][0] = bm64[0];
bd64[qIdx][1] = bm64[1];
bd64[qIdx][2] = bm64[2];
bd64[qIdx][3] = bm64[3];

fpu_mulmod_4a_4b_4p(bm64, inv_b, primeList);
}
*/

for (seqIdx=0; seqIdx<ii_SequenceCount; seqIdx++)
{
ck64[seqIdx][0] = smod64(-ip_Sequences[seqIdx].c, primeList[0]);
ck64[seqIdx][1] = smod64(-ip_Sequences[seqIdx].c, primeList[1]);
ck64[seqIdx][2] = smod64(-ip_Sequences[seqIdx].c, primeList[2]);
ck64[seqIdx][3] = smod64(-ip_Sequences[seqIdx].c, primeList[3]);

umod[0] = umod64(ip_Sequences[seqIdx].k, primeList[0]);
umod[1] = umod64(ip_Sequences[seqIdx].k, primeList[1]);
umod[2] = umod64(ip_Sequences[seqIdx].k, primeList[2]);
umod[3] = umod64(ip_Sequences[seqIdx].k, primeList[3]);

inv[0] = invmod64(umod[0], primeList[0]);
inv[1] = invmod64(umod[1], primeList[1]);
inv[2] = invmod64(umod[2], primeList[2]);
inv[3] = invmod64(umod[3], primeList[3]);

fpu_mulmod_4a_4b_4p(ck64[seqIdx], inv, primeList);
}

// Compute -c/(k*b^d) (mod p) for each subsequence.
for (ssIdx=0; ssIdx<ii_SubsequenceCount; ssIdx++)
{
bdck64[ssIdx][0] = ck64[ip_Subsequences[ssIdx].seqIdx][0];
bdck64[ssIdx][1] = ck64[ip_Subsequences[ssIdx].seqIdx][1];
bdck64[ssIdx][2] = ck64[ip_Subsequences[ssIdx].seqIdx][2];
bdck64[ssIdx][3] = ck64[ip_Subsequences[ssIdx].seqIdx][3];


fpu_powmod_4b_1n_4p(bm64, ip_Subsequences[ssIdx].d, primeList);

//fpu_mulmod_4a_4b_4p(bdck64[ssIdx], bd64[ip_Subsequences[ssIdx].d], primeList);
fpu_mulmod_4a_4b_4p(bdck64[ssIdx], bm64, primeList);

bm64[0] = inv_b[0];
bm64[1] = inv_b[1];
bm64[2] = inv_b[2];
bm64[3] = inv_b[3];
}
fpu_powmod_4b_1n_4p(bm64, ii_BestQ, primeList);


fpu_pop();
fpu_pop();
fpu_pop();
fpu_pop();
}

[/CODE]

rogue 2020-09-24 12:44

fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.

Happy5214 2020-09-26 12:15

[QUOTE=rogue;557742]fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.[/QUOTE]

On that subject, I'm about to implement the ARM version of fpu_powmod_4b_1n_4p() (it'll be the only one I'll have to spill registers for :sad:). Is there a reason why fpu_mulmod_4a_4b_4p() preloads the 1/p values and fpu_powmod_4b_1n_4p() doesn't? Should I keep the current semantics with the ARM version?

I discovered when I implemented fpu_powmod() that the x87 version [I]does[/I] in fact use right-to-left exponentiation, as I noted in a previous post, so you were right on that concern earlier.

rogue 2020-09-26 13:06

[QUOTE=Happy5214;557951]On that subject, I'm about to implement the ARM version of fpu_powmod_4b_1n_4p() (it'll be the only one I'll have to spill registers for :sad:). Is there a reason why fpu_mulmod_4a_4b_4p() preloads the 1/p values and fpu_powmod_4b_1n_4p() doesn't? Should I keep the current semantics with the ARM version?

I discovered when I implemented fpu_powmod() that the x87 version [I]does[/I] in fact use right-to-left exponentiation, as I noted in a previous post, so you were right on that concern earlier.[/QUOTE]

In most cases fpu_powmod_4b_1n_4p() is called once to set up a loop and fpu_mulmod_4a_4b_4p() is called within a loop. There is a time speed bump to setting up the stack prior to the call to fpu_powmod_4b_1n_4p(), but it wouldn't benefit all sieves and most that would benefit would probably see a negligible benefit. The long term goal would then be to preset the stack (or registers for ARM) for all fpu functions.

rogue 2020-12-27 18:26

I have posted mtsieve 2.1.0 over at sourceforge. There are many changes with this release, but users of gcwsievecl, mfsieve, and mfsievecl will get the most benefit. The changes are:

[code]
2.1.0 - December 27, 2020
framework:
On Windows, switched to using gcc from msys2 instead of gcc from mingw64. -O3
gives a nice performance bump to some of the non-ASM code.

Exit after showing help when using -h swtich instead of giving fatal error.

Add logic to support validation of factors passed with -I, but not all sieves are
coded to do this validation.

On GPU builds, default -W to 0 and -G to 1. A "special" CPU worker will be created
as necessary to test ranges of p that are to small for the GPU code.

For GPU executables, add -H to show some GPU details when each kernel is created.

Improve factor rate calculation when using GPU workers.

Created FUTURE.txt for items I would like to get working.

cksieve: version 1.3
Added logic to validate factors passed with -I.

gcwsieve, gcwsievecl: version 1.4
Added logic to validate factors passed with -I.
Add -f option to gcwsieve so that it can generate ABC output that is supported by LLR.
Added -M option and fixed -S for gcwsievecl.
Improved speed of GPU code of gcwsievecl by about 25%.
The GPU code is more than 20x faster than the CPU for the same amount of work.

mfsieve, mfsievecl: version 1.7
Added logic to validate factors passed with -I.
Replaced all ASM routines with non-ASM routines based upon Montgomery mulitplication. This
provides a speed bump of at least 30% over the previous release.
Fixed the GPU code and replaced magic number logic with Montgomery multiplcation as it is
faster. The new GPU code is more then 25% faster than the old GPU code.
The GPU code is more than 20x faster than the CPU for the same amount of work.

sgsieve: version 1.2
Added logic to validate factors passed with -I.

twinsieve: version 1.3
Added logic to validate factors passed with -I.

xyyxsieve, xyyxsievecl: version 1.8
Added logic to validate factors passed with -I.
[/code]

I also add a FUTURE.txt file with what I want to do in the future. I know there are things that I have promised that are not in this list. Please let me know what I have missed:

[code]
2.0.7 - December 22, 2020
framework:
Replace AMD SDK on Windows with open source SDK for OpenCL.

Add more output for -H for GPU executables.

afsieve, afsievecl:
Replace logic with Montgomery multiplcation.
Add factor validation when using -I.

dmdsieve:
Add factor validation when using -I.

fbncsieve:
Add factor validation when using -I.

fkbnsieve:
Add factor validation when using -I.

gfndsieve:
Add factor validation when using -I.

kbbsieve:
Add factor validation when using -I.

psieve:
Add factor validation when using -I.

pixsieve:

srsieve2:
Implement sr1sieve and sr2sieve logic.
Add factor validation when using -I.
[/code]

I will probably complete most of the -I updates in the coming days and a replacement OpenCL SDK on Windows is close behind.

pepi37 2020-12-27 19:21

If you can rise upper limit of srsieve2.

rogue 2020-12-27 20:06

[QUOTE=pepi37;567476]If you can rise upper limit of srsieve2.[/QUOTE]

So you need it to go above 2^52?

pepi37 2020-12-27 21:01

[QUOTE=rogue;567479]So you need it to go above 2^52?[/QUOTE]
1e17 is good for me

ET_ 2020-12-27 21:20

Any hope to have either gfndsieve or dmdsieve on an ARM-64 platform, either with an ASM translation or with plain c routines?

Just asking for a friend of mine... :rolleyes::innocent::et_:

rogue 2020-12-27 23:10

[QUOTE=ET_;567484]Any hope to have either gfndsieve or dmdsieve on an ARM-64 platform, either with an ASM translation or with plain c routines?

Just asking for a friend of mine... :rolleyes::innocent::et_:[/QUOTE]

It might be possible, but I have no way to compile or test such code. I'm hoping that my wife will let me get an Apple laptop with the M1 CPU next year, but don't have a lot of hope for that. I believe that someone posted some ARM ASM FPU routines earlier in this thread. Someone with access to an ARM CPU could make the changes and do some testing. Fortunately gfndsieve and dmdsieve should be easy to get working if one has the necessary ARM functions. Until then I will look at producing versions that do not require ASM.

pepi37 2020-12-27 23:18

Where srsieve2 store legendre table file?


I use this command but after I stop srsieve2 I cannot find any new file



srsieve2.exe -l -L legendre .txt -P 176362230330 .....

rogue 2020-12-28 01:08

[QUOTE=pepi37;567496]Where srsieve2 store legendre table file?

I use this command but after I stop srsieve2 I cannot find any new file

srsieve2.exe -l -L legendre .txt -P 176362230330 .....[/QUOTE]

That code is not implemented yet. It is not a small task. I started a while ago, but have a ways to go. You can continue to use sr1sieve/sr2sieve for now.

Happy5214 2020-12-28 04:56

1 Attachment(s)
[QUOTE=ET_;567484]Any hope to have either gfndsieve or dmdsieve on an ARM-64 platform, either with an ASM translation or with plain c routines?

Just asking for a friend of mine... :rolleyes::innocent::et_:[/QUOTE]

[QUOTE=rogue;567495]It might be possible, but I have no way to compile or test such code. I'm hoping that my wife will let me get an Apple laptop with the M1 CPU next year, but don't have a lot of hope for that. I believe that someone posted some ARM ASM FPU routines earlier in this thread. Someone with access to an ARM CPU could make the changes and do some testing. Fortunately gfndsieve and dmdsieve should be easy to get working if one has the necessary ARM functions. Until then I will look at producing versions that do not require ASM.[/QUOTE]

That would be me and my ODROID-C4, though I apparently forgot to post a tarball of the entire set of translated ASM files (attached here). I haven't done any actual integration work with the mtsieve code yet, though I do have sample programs that check out. Looking at the two sieves in question, they appear to utilize the untranslated asm-ext routines in the App class (I only translated the x87 routines), so that would have to be dealt with.

As an aside, I briefly experimented with an Advanced SIMD/NEON implementation, but I gave up quickly when I realized that I couldn't vectorize the integer instructions, so all the instructions I thought I'd be saving would still have to be executed in some form.

rogue 2020-12-28 13:25

[QUOTE=Happy5214;567515]That would be me and my ODROID-C4, though I apparently forgot to post a tarball of the entire set of translated ASM files (attached here). I haven't done any actual integration work with the mtsieve code yet, though I do have sample programs that check out. Looking at the two sieves in question, they appear to utilize the untranslated asm-ext routines in the App class (I only translated the x87 routines), so that would have to be dealt with.

As an aside, I briefly experimented with an Advanced SIMD/NEON implementation, but I gave up quickly when I realized that I couldn't vectorize the integer instructions, so all the instructions I thought I'd be saving would still have to be executed in some form.[/QUOTE]

I forgot about those. The easiest thing for now would be to disable -x and associated logic when building on a non-x86 CPU.

rogue 2020-12-28 14:40

There is a bug in this release that affects all exes that was introduced with factor validation. I'll fix it later today.

rogue 2020-12-29 14:29

I have posted 2.1.1 over at sourceforge and removed 2.1.0. Here are the changes:

[code]
2.1.1 - December 29, 2020
framework:
Fixed an infinite loop that occurs with the special CPU worker in GPU builds.

Removed the factor validation logic added in 2.1.0 do to an issue with how I
implemented it.

gcwsieve, gcwsievecl: version 1.5
Added working factor validation logic with -I.
Improved GPU speed by another 10%. Thanks to Yves Gallot for the code.

mfsieve, mfsievecl: version 1.8
Added working factor validation logic with -I.
Fix bug triggering invalid factor in the CPU code when minN is odd with factorials.
[/code]

pepi37 2020-12-30 21:57

Did you increase upper limit in srsieve2?

rogue 2020-12-30 22:20

[QUOTE=pepi37;567775]Did you increase upper limit in srsieve2?[/QUOTE]

Funny that you ask...

I just posted 2.1.2 over at sourceforge.

[code]
2.1.2 - December 30, 2020
framework:
Retain factor counts when workers are rebuilt.

srsieve2: version 1.3
Added factor validation logic when using -I.
Switched to Montgomery mulitplication, which is about 20% faster than the previous
logic. This also gives a 20% speed bump and adds support for p up to 2^62.
[/code]

And I have this on my wish list:

[code]
Added on December 29, 2020
framework:
Add ARM support
[/code]

This is my last release for 2020.

pepi37 2020-12-30 22:23

Great! so let test it!


and first bug found


fbncsieve.exe -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6 -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: [COLOR=red][B]347880187459691[/B][/COLOR] < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=Red][B]p=0[/B][/COLOR], 22.89M p/sec, no factors found

rogue 2020-12-30 23:34

[QUOTE=pepi37;567778]Great! so let test it!

and first bug found

fbncsieve.exe -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6 -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: [COLOR=red][B]347880187459691[/B][/COLOR] < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=Red][B]p=0[/B][/COLOR], 22.89M p/sec, no factors found[/QUOTE]

Email me your input file and I’ll take a look. Did this work in a previous release?

pepi37 2020-12-30 23:38

[QUOTE=rogue;567784]Email me your input file and I’ll take a look. Did this work in a previous release?[/QUOTE]




bat file


fbncsieve.exe -p 347880187459691 -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6 -O s53factodes.txt


first first lines of input file


347880187459691:P:1:10:1
1014 1100000
1015 1100000
1038 1100000
1075 1100000
1092 1100000
1107 1100000
1113 1100000
1131 1100000
1161 1100000
1171 1100000
1183 1100000
1191 1100000
1201 1100000
1216 1100000

rogue 2020-12-31 01:41

If you use ^C, does it save the correct value at the top of the file?

pepi37 2020-12-31 01:58

[QUOTE=rogue;567790]If you use ^C, does it save the correct value at the top of the file?[/QUOTE]


Yes

Happy5214 2020-12-31 09:33

In [c]App::GetWorkerStats[/c], what happens when [c]ip_Workers[0][/c] is an ordinary worker? Wouldn't the last iteration be out-of-bounds?

Edit: Ditto for [c]App::GetLargestPrimeTested[/c]?

rogue 2020-12-31 13:54

[QUOTE=Happy5214;567812]In [c]App::GetWorkerStats[/c], what happens when [c]ip_Workers[0][/c] is an ordinary worker? Wouldn't the last iteration be out-of-bounds?

Edit: Ditto for [c]App::GetLargestPrimeTested[/c]?[/QUOTE]

I'm not certain what you mean by "out of bounds".

Nevertheless there is a bug in those routines, but it should only impact executables with GPU workers.

rogue 2020-12-31 21:31

The issue with fbncsieve is that you have -W6. Some of the workers are not doing anything either because your machine doesn't have enough horsepower or because the workers are executing so quickly on their chunk of work that the other workers do not have an opportunity to get work.

You have two options. You can either decrease the number of workers or you can use -w to give each worker enough work to keep them busy.

The framework works in a way that the main thread is sieving for the next chunk of work. If a worker thread completes its work before the main thread completes sieving for another worker thread, then the first worker thread will get the next chunk of work.

It would require some re-thinking on my part to guarantee that all workers get work, but it could be at the expensive of overall performance when using one thread.

rogue 2021-01-01 15:11

I am working on a change to the framework to do a better job of determining the "largest prime sieved" to address this issue. Conceptually it will "ignore" workers that have done no work and workers that are waiting for work, so only currently executing workers have a "say" in the "largest prime sieved".

pepi37 2021-01-01 21:07

[QUOTE=rogue;567885]The issue with fbncsieve is that you have -W6. Some of the workers are not doing anything either because your machine doesn't have enough horsepower or because the workers are executing......[/QUOTE]


I dont think that is case. sieve file has 72000 candidates, so I dont think any worker has left without work. If sieve is done until some depth, why is problem to sieve start from that point?
Srxsieve works for many years without any problem ( even on Linux where is MT)

rogue 2021-01-01 22:41

[QUOTE=pepi37;567974]I dont think that is case. sieve file has 72000 candidates, so I dont think any worker has left without work. If sieve is done until some depth, why is problem to sieve start from that point?
Srxsieve works for many years without any problem ( even on Linux where is MT)[/QUOTE]

This isn't about the input file. This is about how each worker thread gets a chunk of primes to work on. It is fundamentally different than how srxsieve does it.

pepi37 2021-01-01 23:14

[QUOTE=rogue;567995]
You can either decrease the number of workers or you can use -w to give each worker enough work to keep them busy.[/QUOTE]
Excuse me, but what is point of have MT sieve that six core CPU doesnot have "horsepower" to feel all workers?
So if I take 6*1 instance of your sieve then my CPU will have "enough horsepower" but if I use 1*6 then it wont?
I still think that is only cosmetic bug , not real one, since
[QUOTE]e:\MTSIEVE\[COLOR=Red][B]MTSIEVE2037[/B][/COLOR]>fbncsieve.exe -p 347880187459691 -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6 -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
p=[B][COLOR=red]347927117723473[/COLOR][/B], 22.98M p/sec, no factors found, 0.0% done. ETC 2021-03-28 00:53[/QUOTE]


Sieve speed is same , version from build 2037 and latest one.

rogue 2021-01-02 01:54

[QUOTE=pepi37;568000]Excuse me, but what is point of have MT sieve that six core CPU doesnot have "horsepower" to feel all workers?
So if I take 6*1 instance of your sieve then my CPU will have "enough horsepower" but if I use 1*6 then it wont?
I still think that is only cosmetic bug , not real one, since

Sieve speed is same , version from build 2037 and latest one.[/QUOTE]

I'm not certain what you mean by "cosmetic bug" in terms of what you highlighted in red?

When the application starts all of the workers are started and put themselves in a state called "waiting for work". The main thread works in a loop like this:

1) find thread in "waiting for work"
2) if none are found, sleep for 10 ms.
3) if one is found, get the next 1e6 primes then tell the worker to start processing that chunk
4) return to step 1

You have 6 threads, so the assumption is that this loop is iterated 6 times and thus all workers have work. That assumption is not correct.

At some point the worker started by step 3 will finish. It will then go back to the status of "waiting for work".

What you see happening with this sieve is that the worker started by the first iteration of the loop finishes its chunk of work before the main thread has iterated 6 times.

Because of this the first worker (one that already did a chunk of 1e6 primes) will get the next chunk of 1e6 primes instead of the 5th or 6th worker.

In essence getting 6 chunks of 1e6 primes takes more time than testing a single chunk of 1e6 primes.

This is going to happen with fncsieve, fkbnsieve, gfndsieve, and dmdsieve because of how fast their worker threads can process a single chunk of work. This is why I suggested using -w to bump from 1e6 to 1e7. This should give the main thread the opportunity to feed all of the workers before any individual worker completes it chunk of work.

The only solution to this problem would be to modify the framework such that the worker threads work on a range of primes, i.e. 100e9 to 101e9, 101e9 to 102e9, etc instead of a count of primes. The downside of such a change is that each worker will be testing a variable number of primes rather than a fixed number of primes. This would not be good for GPU workers as they are most efficient using a number of primes that is a multiple of the GPU work group size.

pepi37 2021-01-02 02:08

[QUOTE=rogue;568010]I'm not certain what you mean by "cosmetic bug" in terms of what you highlighted in red?
[/QUOTE]


To explain: when I start your[COLOR=Red][B] latest released sieve then p=0 [/B][/COLOR]( and should not be zero since it doesnot start from zero , it start from header value in sieve)
And since sieve in previous version works ok then it is cosmetic bug not flaw in program itself ( also sieve speed is same)


fbncsieve.exe -p[B][COLOR=Red] 347880187459691[/COLOR][/B] -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6[B][COLOR=SeaGreen] [COLOR=Blue]-w 1e7[/COLOR][/COLOR][/B] -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=red][B] p=0[/B][/COLOR], [COLOR=blue][B]46.94M p/sec[/B][/COLOR], no factors found


But in one part you were correct , using -w 1e7 it nearly doubles speed!!!!

Happy5214 2021-01-02 07:42

[QUOTE=pepi37;568012]fbncsieve.exe -p[B][COLOR=Red] 347880187459691[/COLOR][/B] -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6[B][COLOR=SeaGreen] [COLOR=Blue]-w 1e7[/COLOR][/COLOR][/B] -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=red][B] p=0[/B][/COLOR], [COLOR=blue][B]46.94M p/sec[/B][/COLOR], no factors found


But in one part you were correct , using -w 1e7 it nearly doubles speed!!!![/QUOTE]

What is the CPU utilization with -w 1e7? I have a feeling there may still be room to up that number even more, until that p=0 finally goes away (i.e. all workers are working).

rogue 2021-01-02 14:14

[QUOTE=pepi37;568012]To explain: when I start your[COLOR=Red][B] latest released sieve then p=0 [/B][/COLOR]( and should not be zero since it doesnot start from zero , it start from header value in sieve)
And since sieve in previous version works ok then it is cosmetic bug not flaw in program itself ( also sieve speed is same)


fbncsieve.exe -p[B][COLOR=Red] 347880187459691[/COLOR][/B] -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6[B][COLOR=SeaGreen] [COLOR=Blue]-w 1e7[/COLOR][/COLOR][/B] -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=red][B] p=0[/B][/COLOR], [COLOR=blue][B]46.94M p/sec[/B][/COLOR], no factors found


But in one part you were correct , using -w 1e7 it nearly doubles speed!!!![/QUOTE]

The "p=0" should be fixed in the next release of the framework. The problem is that it tries to determine the max prime that has been used for testing, but if you have workers that have done nothing, then their max prime is 0. The change to the framework is to ignore workers that are doing nothing when determining max prime.

Happy5214 2021-01-05 09:35

I'm not able to create new sieve files using srsieve2 (SVN rev 84):
[code]$ ./srsieve2 -W 4 -n 5e4 -N 145e3 -P "1e9" -o 't17_b2.prp' -f B -s "34767*2^n-1"
srsieve2 v1.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Fatal Error: Expected 95001 terms, but only set 0
[/code]

rogue 2021-01-05 13:07

[QUOTE=Happy5214;568456]I'm not able to create new sieve files using srsieve2 (SVN rev 84):
[code]$ ./srsieve2 -W 4 -n 5e4 -N 145e3 -P "1e9" -o 't17_b2.prp' -f B -s "34767*2^n-1"
srsieve2 v1.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Fatal Error: Expected 95001 terms, but only set 0
[/code][/QUOTE]

Hmm. I cannot reproduce this on Windows or OS X. In other words, it works fine on both OSes. I assume this is linux. Does it work if you compile with debug=yes or -O2 instead of -O3?

Here is the OS X output:

[code]
./srsieve2 -W 4 -n 5e4 -N 145e3 -P "1e9" -o 't17_b2.prp' -f B -s "34767*2^n-1"
srsieve2 v1.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Sieve started: 3 < p < 1e9 with 95001 terms (50000 < n < 145000, k*2^n+c) (expecting 89964 factors)
Sieving with generic logic
Split 1 base 2 sequence into 10 base 2^20 sequences.
Sieve completed at p=1000000009.
Processor time: 153.19 sec. (0.53 sieving) (3.14 cores)
8181 terms written to t17_b2.prp
Primes tested: 50843824. Factors found: 86820. Remaining terms: 8181. Time: 48.66 seconds.
[/code]

Happy5214 2021-01-06 07:24

[QUOTE=rogue;568469]Hmm. I cannot reproduce this on Windows or OS X. In other words, it works fine on both OSes. I assume this is linux. Does it work if you compile with debug=yes or -O2 instead of -O3?[/QUOTE]

Yes, it was Ubuntu 20.04. It worked for both of those settings, so I tried to compile again with -O3 after a [c]make clean[/c], and that worked. Maybe it was a leftover object file? Sorry about that. :redface:

rogue 2021-01-06 13:11

[QUOTE=Happy5214;568551]Yes, it was Ubuntu 20.04. It worked for both of those settings, so I tried to compile again with -O3 after a [c]make clean[/c], and that worked. Maybe it was a leftover object file? Sorry about that. :redface:[/QUOTE]

That's okay. I'm glad your problem is resolved.

rogue 2021-01-07 16:42

I ran range of Sierpinski/Riesel sequences thru the various programs that can sieve them. This can be used as a guide to determine which program to use for your sieving purposes. Note that all tested the same input file for the same range of P:

[code]
srsieve -P2002626803 -m1e10 b25_old.abcd
srsieve 1.1.0 -- A sieve for integer sequences in n of the form k*b^n+c.
Read 1532485 terms for 81 sequences from ABCD format file `b25_old.abcd'.
Split 81 base 25 sequences into 1432 base 25^90 subsequences.
srsieve started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
Sieving 1000950101 <= p <= 2002626803 eliminated 49708 terms, 1482777 remain.
Wrote 1482777 terms for 81 sequences to srsieve format file `srsieve.out'.
srsieve stopped: at p=2002626803 because --pmax was reached.

These two lines are from srsieve.log:
01/07/21 08:51:03 srsieve started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
01/07/21 09:33:45 srsieve stopped: at p=2002626803 because --pmax was reached. -- 2562 seconds

sr2sieve -ib25_old.abcd -P2002626803 -q
sr2sieve 1.9.1 -- A sieve for multiple sequences k*b^n+/-1 or b^n+/-k.
Read 1532485 terms for 81 sequences from ABCD format file `b25_old.abcd'.
Split 81 base 25 sequences into 1432 base 25^90 subsequences.
Expecting to find factors for about 49622.16 terms in this range.
sr2sieve 1.9.1 started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
p=1990805857, 2188973 p/sec, 49245 factors, 98.8% done, 0 sec/factor, ETA 07 Jan 08:47
sr2sieve 1.9.1 stopped: at p=2002626803 because range is complete.
Found factors for 49708 terms in 486.927 sec. (expected about 49622.16)

sr2sieve -ib25_old.abcd -P2002626803 -q -x
sr2sieve 1.9.1 -- A sieve for multiple sequences k*b^n+/-1 or b^n+/-k.
Read 1532485 terms for 81 sequences from ABCD format file `b25_old.abcd'.
Split 81 base 25 sequences into 1432 base 25^90 subsequences.
Expecting to find factors for about 49622.16 terms in this range.
sr2sieve 1.9.1 started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
p=1982941577, 1086937 p/sec, 48960 factors, 98.0% done, 0 sec/factor, ETA 07 Jan 10:20
sr2sieve 1.9.1 stopped: at p=2002626803 because range is complete.
Found factors for 49708 terms in 977.900 sec. (expected about 49622.16)

srsieve2 -ib25_old.abcd -P2002626803
srsieve2 v1.3.1, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Split 81 base 25 sequences into 1432 base 25^90 sequences.
Sieve started: 1000950101 < p < 2e9 with 1532485 terms (300000 < n < 1000000, k*25^n+c) (expecting 49531 factors)
p=1996357801, 19.78K p/sec, 49480 factors found at 15.61 f/sec (last 1 min), 99.6% done. ETC 2021-01-07 10:16
Sieve completed at p=2000000087.
CPU time: 2319.69 sec. (0.23 sieving) (1.00 cores)
1482777 terms written to b25_n.abcd
Primes tested: 47452160. Factors found: 49708. Remaining terms: 1482777. Time: 2325.74 seconds.

srsieve2cl -ib25_old.abcd -P2e9
srsieve2cl v1.3.1, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Split 81 base 25 sequences into 1432 base 25^90 sequences.
GPU primes per worker is 143360
Sieve started: 1000950101 < p < 2e9 with 1532485 terms (300000 < n < 1000000, k*25^n+c) (expecting 49531 factors)
p=1925947301, 65.81K p/sec, 46892 factors found at 40.83 f/sec (last 1 min), 92.6% done. ETC 2021-01-07 08:38
Sieve completed at p=2002626803.
CPU time: 189.30 sec. (0.12 sieving) (0.26 cores) GPU time: 713.29 sec.
1482777 terms written to b25_n.abcd
Primes tested: 47452160. Factors found: 49708. Remaining terms: 1482777. Time: 723.70 seconds.
[/code]

Summarized:
[list][*]srsieve - 2562 seconds[*]sr2sieve - 487 seconds (with Legendre tables)
{*]sr2sieve - 977 seconds (w/o Legendre tables)[*]srsieve2 - 2326 seconds[*]srsieve2cl - 723 seconds[/list]
So the conclusion is this:[list][*]srsieve - only use if you do not trust srsieve2/srsieve2cl[*]sr2sieve - use if Legendre tables can be built from the sequences[*]sr2sieve - use if Legendre tables cannot be built from the sequences and you have a slow GPU[*]srsieve2 - use to start sieving sequences if you have a slow GPU then switch to sr2sieve if Legendre tables can be built from the sequences.[*]srsieve2 - use to start sieving sequences if you have too many sequences for the GPU then switch to sr2sieve if Legendre tables can be built from the sequences.[*]srsieve2cl - use to start sieving sequences if you have a fast GPU then switch to sr2sieve if Legendre tables can be built from the sequences.[/list]
"too many sequences for the GPU" is going to be dependent on GPU memory. I do not know (at this time) how to determine beforehand if the GPU can handle all of the sequences thrown at it. What I do know is that my Quadro P3200 can handle more than 5000 sequences at a time (but less than 10000) and still be at least 3x faster than srsieve2.

If you have read between the lines you will see stats for srsieve2cl. Nobody here has seen it as it is in development. It should be ready for release this weekend.

After srsieve2cl is released, you will need to do some independent testing to compare srsieve2cl to sr2sieve to see which is faster.

Maybe someone the forum could package up a script that take an input file and range of P and tell the user which program to use. That same script could compare the outputs to verify that there are no missed factors.

One last thing, srsieve2cl does not end on the exact boundary specified on the command line. That is okay. The key is that I had to use the max p from that run to set the max p for srsieve2, srsieve, and sr2sieve to ensure that I was comparing apples to apples.


All times are UTC. The time now is 23:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.