 mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)

 Citrix 2020-05-27 04:26

gcwsieve enhancements

[CODE]

// We're trying to determine if n*b^n (mod p) = +1/-1. This requires an two
// operations, an exponentiation (b^n) and a multiplication (n). We can make a
// change to eliminate one of those operations and convert the second one into addition which // should be faster. Doing this requires us to
// compute the multiplicative inverse of b. The multiplicative inverse of
// b (which I will call B) is the number such at b*B = 1 mod p. Here is how
// the algorithm works.
//
// n*b^n (mod p) = +1/-1
// We then generate B such that b*B=1 (mod p)
// // To get then next term (which I will call N) where N>n , we need to
// check the following
// N=n+r
// N*b^N(mod p) = +1/-1
// (n+r)*b^(n+r) (mod p) = +1/-1
// We multiply both sides by B^r
// (n+r)*b^(n+r))B^r (mod p) = +B^r/-B^r
// (n+r)*b^n (mod p) = +B^r/-B^r
// n*b^n + r*b^n (mod p) = +B^r/-B^r
// We already know n*b^n and b^n are from the previous step.

// (Array 1) Now we pre-compute +-B^r for a range of r since we will be using it multiple times (1 to rmax = M*sqrt(nmax-nmin))
// (Array 2) for each n*b^n in the loop we pre-compute r*b^n for small consecutive values of r (1 to rdiffmax) – can be done by modular addition
// We can calculate each successive candidate by just adding a term from Array 2 and checking it with corresponding term from Array 1.

M for size of Array 1 can be adjusted for maximum efficiency depending on efficiency of the loops.

// once we reach rmax then we multiply last term by b^rmax and b^n by b^rmax
n*b^n + rmax*b^n (mod p)
(n+rmax)* b^n (mod p)
(n+rmax)* b^n*b^rmax(mod p)
n'*b^n'(mod p)
and also generate b^n'=b^n*b^rmax. (mod p)
// We can now repeat the loop again starting at n'.

Other suggestions
Your current code takes into account n values left mod 2 but not for other higher primes.
Code can be made faster if you only look at subsequences left for mod 6 or higher number. BestQ code from srsieve can be used here.
Depending on size of subsequence being sieved we can adjust size of the loops for maximum efficiency.

On a GPU if memory storage is a concern Array 1 can be reduced in size and Array 2 might not be worth saving. This will be still faster than current code.

[/CODE]

 rogue 2020-05-27 15:41

[QUOTE=Citrix;546564][CODE]

// We're trying to determine if n*b^n (mod p) = +1/-1. This requires an two
// operations, an exponentiation (b^n) and a multiplication (n). We can make a
// change to eliminate one of those operations and convert the second one into addition which
// should be faster. Doing this requires us to
// compute the multiplicative inverse of b. The multiplicative inverse of
// b (which I will call B) is the number such at b*B = 1 mod p. Here is how
// the algorithm works.
//
// n*b^n (mod p) = +1/-1
// We then generate B such that b*B=1 (mod p)
//
// To get then next term (which I will call N) where N>n , we need to
// check the following
// N=n+r
// N*b^N(mod p) = +1/-1
// (n+r)*b^(n+r) (mod p) = +1/-1
// We multiply both sides by B^r
// (n+r)*b^(n+r))B^r (mod p) = +B^r/-B^r
// (n+r)*b^n (mod p) = +B^r/-B^r
// n*b^n + r*b^n (mod p) = +B^r/-B^r
// We already know n*b^n and b^n are from the previous step.

// (Array 1) Now we pre-compute +-B^r for a range of r since we will be using it multiple times (1 to rmax = M*sqrt(nmax-nmin))
// (Array 2) for each n*b^n in the loop we pre-compute r*b^n for small consecutive values of r (1 to rdiffmax) – can be done by modular addition
// We can calculate each successive candidate by just adding a term from Array 2 and checking it with corresponding term from Array 1.

M for size of Array 1 can be adjusted for maximum efficiency depending on efficiency of the loops.

// once we reach rmax then we multiply last term by b^rmax and b^n by b^rmax
n*b^n + rmax*b^n (mod p)
(n+rmax)* b^n (mod p)
(n+rmax)* b^n*b^rmax(mod p)
n'*b^n'(mod p)
and also generate b^n'=b^n*b^rmax. (mod p)
// We can now repeat the loop again starting at n'.

Other suggestions
Your current code takes into account n values left mod 2 but not for other higher primes.
Code can be made faster if you only look at subsequences left for mod 6 or higher number. BestQ code from srsieve can be used here.
Depending on size of subsequence being sieved we can adjust size of the loops for maximum efficiency.

On a GPU if memory storage is a concern Array 1 can be reduced in size and Array 2 might not be worth saving. This will be still faster than current code.
[/CODE][/QUOTE]

It took me a while, but I believe I see where you are going with this. Obviously as the range of n increases, so does the benefit.

Memory is definitely a concern on the GPU, but I think that something similar could be done there as well.

I will table your other suggestions because I don't understand them. Someone else (maybe you) would be interested implementing.

 Citrix 2020-05-27 16:54

[QUOTE=rogue;546600]
I will table your other suggestions because I don't understand them.[/QUOTE]

I am suggesting using a few more primes, other than 2, like 3,5,7,13 etc (Covering sets)
The code below is from mtsieve for prime 2.

[CODE]

// If the base is odd, then all n must be even and thus the difference
// between any two remaining n for the base must also be even.
if (idx > 2 && ii_Base & 1)
{
if (idx & 1)
continue;

avx_set_16a(powers[idx-2]);
avx_set_16b(powers);
avx_mulmod(dps, reciprocals);
avx_get_16a(powers[idx]);
}
else
{
avx_set_16a(powers[idx-1]);
avx_set_16b(powers);
avx_mulmod(dps, reciprocals);
avx_get_16a(powers[idx]);
}

[/CODE]

 rogue 2020-05-27 18:21

I see. This is how I have commented the non-AVX code:

[code]
// We're trying to determine if n*b^n (mod p) = +1/-1. This requires an two
// operations, an exponentiation (b^n) and a multiplcation (n). By precomputing
// certain values, we can actually replace the exponentiation and mulitplication
// with array lookups. Doing this requires us to compute the multiplicative inverse
// of b and to build two arrays. The multiplicative inverse of b (which I will call B)
// is the number such at b*B = 1 mod p.
//
// n*b^n (mod p) = +1/-1
// -> n*b^n*B^n (mod p) = +1/-1 * B^n (mod p)
// -> n (mod p) = +1/-1 B^n (mod p)
//
// Find all distinct r staring with n=minN where n+r is the next term until n=maxN.
//
// Pre-compute B^r (mod p) for all possible r.
// Pre-compute n*b^n + r*b^n (mod p) for all possible r.
//
// For each successive term:
// (n+r)*b^(n+r) (mod p) = +1/-1
// -> (n+r)*b^(n+r)*B^r (mod p) = +/-B^r (mod p)
// -> (n+r)*b^n (mod p) = +/-B^r (mod p)
// -> n*b^n + r*b^n (mod p) = +/-B^r (mod p)
//
// Since both n*b^n + r*b^n (mod p) and +/-B^r (mod p) have been pre-cmoputed
// we just need to compare elements of two arrays.
//
// If n*b^n + r*b^n (mod p) = -B^r, then p is a factor of n*b^n+1.
// If n*b^n + r*b^n (mod p) = +B^r, then p is a factor of n*b^n-1
[/code]

I think that your suggestion for bestQ is not necessary as I build a table of "the covering set" of r infrequently.

There are now two loops for each group of primes. One to compute B^r (mod p) and n*b^n + r*b^n (mod p) and the one that detects if we have a factor. This has not been tested yet, so I have likely made mistakes along the way. Because of the need to convert so many doubles back to uint64_t, I'm not certain that an AVX version will be faster than this. That will need some testing. On the GPU side, I can make additional changes to eliminate unused elements of the left and right arrays.

[code]
while (ii_Terms[termIndex] > 0)
{
theN = ii_Terms[termIndex];

diff = prevN - theN;

for (pIdx=0; pIdx<4; pIdx++)
{
// If n*b^n + r*b^n (mod p) = -B^r, then p is a factor of n*b^n+1.
// If n*b^n + r*b^n (mod p) = +B^r, then p is a factor of n*b^n-1
if (il_LeftTerms[DIFF_INDEX(diff)+pIdx] == il_RightTerms[DIFF_INDEX(diff)+pIdx])
if (ip_CullenWoodallApp->ReportFactor(ps[pIdx], theN, -1))
VerifyFactor(ps[pIdx], theN, -1);

if (il_LeftTerms[DIFF_INDEX(diff)+pIdx] == ps[pIdx] - il_RightTerms[DIFF_INDEX(diff)+pIdx])
if (ip_CullenWoodallApp->ReportFactor(ps[pIdx], theN, +1))
VerifyFactor(ps[pIdx], theN, +1);
}

prevN = theN;
termIndex++;
};
[/code]

 Citrix 2020-05-28 05:25

I am not sure if you understood the algorithm correctly.
Here is the pseudocode

[CODE]
Given a base b, nmin, nmax, prime p to sieve.

Choose a value C where C<=nmax-nmin && (nmax-nmin+1)%C==0; need to experimentally find the best C value so code is fastest. This will be constant for all primes. Can adjust nmax so it solves the second modular part.

C=0 or C=nmax-nmin are not the best solutions. The best solution is somewhere in between based on efficiency of the code. (see ***)

Then generate B such that b*B=1 (mod p)

Then create an Array Invs[] = B^0, B^1,..., B^C (mod p)
Then create an Array Invsm[] = p-B^0, p-B^1,..., p-B^C (mod p)

Then we create a loop starting at X=nmin with step C upto nmax (e.g. for(int X=nmin; X<nmax; X=X+C)) {

We generate b^X and X*b^X (in each iteration of the loop can calculate B^lastX*B^C=B^CurrentX and then multiply by X)

We create an array Array MaxDiff []= 0*b^X, 1*b^X, ...., m*b^X (mod p; modular addition)
m would be around 50 for base 2. Could be smaller or larger based on density of remaining candidates in file. Similar to size of MAX_POWERS in current code. Use mean and standard deviations to find best m value. The small values 1*b^X are rarely used and might not need them.

Now we find all remaining candidates in the sieve between n=X and X+C (read from a bitmap)

Let the first candidate be n=r1

Then we add X^b^x+MaxDiff[r1-X] (mod p) = y
Check if y==Invs[r1-X] (for Woodall) and y==Invsm[r1-X] (for Cullen) --> Print factors

Let the next candidate be n=r2

Then we add y+MaxDiff[r2-r1] (mod p) = y
Check if y==Invs[r2-X] (for Woodall) and y==Invsm[r2-X] (for Cullen) --> Print factors

and so on till we reach the last term under X+C

Need to check if difference between 2 consecutive remaining terms is greater than Size of MaxDiff then we would need an extra step (similar to current code)
}// end loop

Best Q can further help as some values in Array Invs and Invsm might not need to be generated. Need to choose C value that is a multiple of Best_Q and adjust the nmin value so it is also a multiple of the Best_Q.

*For odd bases MaxDiff only needs 1/2 the values - otherwise should not apply Best_Q to MaxDiff

Why are you converting from double to uint64_t? You can do modular addition in double type.

***Notes on calculating best C value
Work done = C/q mulmod in first loop + N_range/C*(2 mulmod + 50 add and 50 subtract for MaxDiff array) in second loop.
Where q is adjustment due to Best_Q improvements
50 add+ 50 subtract = t mulmod (base on CPU time)
Then we have C/q + (N_range* (2+t))/C needs to be the smallest so work done is fastest.

Then C= sqrt((N_range*(2+t)*q)) is the best solution.
Then Array Invs should be of size ~= sqrt(n_range)*sqrt((2+t)*q) for best results.

[/CODE]

I hope this makes more sense.

 rogue 2020-05-28 14:38

Honestly I don't fully understand and can't spend the brain power on it at the moment. If you want to take a crack at writing the code per your algorithm, then go for it. I think you have the best chance of implementing the code per your vision.

 Citrix 2020-05-28 20:41

I did try implementing the algorithm. There possibly is some small bug as it is not producing all the factors. I can't find it right now. I do not have a debugger for GCC to help me.

For base 2
I initially sieved range nmin=10,000 to nmax=100,000 to remove all small p (under 1M). There were ~ 3,600 candidates left.
I used this sieved file to test the following:-
The original code gives a speed of 33Kp/sec for p range 1G to 2G.
The new code gives a speed of 2.5Kp/sec for p range 1G to 2G.

Then I tested the code nmin=1,000,000 to nmax=2,000,000 from 1G to 2G (without removing any small terms).
The original code gives a speed of 128 p/sec for p range 1G to 2G.
The new code gives a speed of 804p/sec for p range 1G to 2G.

So 8 times faster but as the original code removes terms from the sieve the code would get faster and faster whereas the new algorithm would stay at almost the same speed.

For the new algorithm to be significantly faster than the old algorithm we would need to test a range of at least 100-400M.

Since no one that I know of is planning on sieving a range that large- I would hold off on further development of this code.

 Citrix 2020-05-28 21:53

I am posting a simpler explanation of the algorithm here - in case some can come up with a faster implementation

[CODE]

Take all the remaining candidates in a range Nmin to Nmax
Let b be the base and B be the multiplicative inverse such that b*B = 1 mod p.

We choose a C= sqrt(Nmax-Nmin)
We then divide all n values into C different groups
In each group G_r={n1,n2,n3...,} such that n1=n2=n3=r (mod C)

We want to sieve each of these C groups separately.

For G_r group
since n1=r (mod C) then n1=k1*C+r
since n2=r (mod C) then n2=k2*C+r

The first term in this group is n1*b^n1+-1
The second term in this group is n2*b^n2+-1

We multiply all these terms by their corresponding B^(k*C) value

The first term
n1*b^n1= +-1 (mod p)
n1*b^n1*B^(k1*C)= +-B^(k1*C) (mod p)
Since we know that n1=k1*C+r we get
n1*b^r=+-B^(k1*C) (mod p)

The second term
n2*b^n2= +-1 (mod p)
n2*b^n2*B^(k2*C)= +-B^(k2*C) (mod p)
Since we know that n2=k2*C+r we get
n2*b^r=+-B^(k2*C) (mod p)

Note that B^(k*C) values will be used in all the C groups and we can save on this work instead of repeatedly calculating it.
Also note n2*b^r - n1*b^r = (k2-k1)*C*b^r
If we know C*b^r (which we can calculate for each group) we can generate all these terms by addition only.

If n1*b^r=-B^(k1*C) (mod p) then n1*b^n1+1 is divisible by p
If n1*b^r=B^(k1*C) (mod) p then n1*b^n1-1 is divisible by p

[/CODE]

 rogue 2020-06-30 15:19

I see where my logic is wrong referring to what I was trying to implement. Not certain if I can resurrect.

 rogue 2020-07-08 16:20

One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.

 Happy5214 2020-08-02 17:22

[QUOTE=rogue;550047]One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.[/QUOTE]

I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?

All times are UTC. The time now is 08:11.