mersenneforum.org How to optimize the sieving stage of QS?
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2020-08-26, 19:30 #1 Ilya Gazman   Feb 2016 F16 Posts How to optimize the sieving stage of QS? My latest version of QS is moving in fifth of the speed of light as it only 5 times slower than Tills version now, lol. 40% of my sieving code is spent on the below loop. Do you have an idea how to improve it? Code: // that loop takes 40% of the performance for (Wheel wheel : wheels) { wheel.update(logs); } // Wheel update method public void update(byte[] logs) { for (; currentPosition < logs.length; currentPosition += prime) { // currentPosition is int logs[currentPosition] += log; } this.currentPosition -= logs.length; }
2020-08-26, 19:50   #2
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

5B216 Posts

Quote:
 Originally Posted by Ilya Gazman fifth of the speed of light as it only 5 times slower than Tills version now, lol.
And then msieve is faster than light?
Save out logs.length to a parameter. Don't know Java, but similar code in c++ is also slower when I'm using .size() in a loop.

2020-08-26, 20:17   #3
bsquared

"Ben"
Feb 2007

D4C16 Posts

Quote:
 Originally Posted by Ilya Gazman My latest version of QS is moving in fifth of the speed of light as it only 5 times slower than Tills version now, lol. 40% of my sieving code is spent on the below loop. Do you have an idea how to improve it? Code: // that loop takes 40% of the performance for (Wheel wheel : wheels) { wheel.update(logs); } // Wheel update method public void update(byte[] logs) { for (; currentPosition < logs.length; currentPosition += prime) { // currentPosition is int logs[currentPosition] += log; } this.currentPosition -= logs.length; }
For small-ish primes you could unroll it. log changes slowly and will be the same for 4 adjacent primes. I don't know how to to re-work this code exactly given that currentPosition, prime, log, etc., are apparently maintained by a class (?). But something like:

cp1 = currentPosition[i]
cp2 = currentPosition[i+1]
cp3 = currentPosition[i+2]
cp4 = currentPosition[i+3]

p1 = primes[i]
p2 = primes[i+1]
p3 = primes[i+2]
p4 = primes[i+3]

maxCP = MAX(cp1, cp2, cp3, cp4)
maxP = MAX(p1, p2, p3, p4)

for (; maxCP < logs.length; ) {
logs[cp1] += log;
logs[cp2] += log;
logs[cp3] += log;
logs[cp4] += log;
cp1 += p1;
cp2 += p2;
cp3 += p3;
cp4 += p4;
maxCP += maxP;
}

// now finish the last part of each range with individual loops

// finally update all positions for the next block
currentPosition[i] = cp1 - logs.length
currentPosition[i+1] = cp2 - logs.length
currentPosition[i+2] = cp3 - logs.length
currentPosition[i+3] = cp4 - logs.length

Last fiddled with by bsquared on 2020-08-26 at 20:24

 2020-08-26, 20:22 #4 bsquared     "Ben" Feb 2007 D4C16 Posts Also, usually people initialize the sieve byte values to a cutoff: for (i=0; i < sieve_length; i++) { sieve[i] = cutoff; } Then sieve just like normal, but subtract instead of add. Now, you can easily find locations that are ready for trial division by ANDing with a hi-bit mask. if (sieve[i] & 0x80) { trial_divide(i) } But most locations won't need trial division, so you can check and eliminate many at once by overloading the sieve to an 8-byte unsigned long long type: if (((uint64_t)sieve[i] & 0x8080808080808080ULL) == 0) { continue;' } Makes scanning for sieve reports much faster.
2020-08-26, 21:36   #5
Ilya Gazman

Feb 2016

3×5 Posts

Quote:
 I don't know how to to re-work this code exactly given that currentPosition, prime, log, etc., are apparently maintained by a class (?)
Yeah it is maintained by the Wheel class, each prime has two Wheel instances that track the current position in the logs array.

My sieving bound M is represented by an array of the size S where Sk = M for a small value of k(I use it as a square of M), I found it more memory efficient.
Then my sieving loop looks like this:

Code:
for (i=0;i<k;i++){
for (Wheel wheel : wheels) {
wheel.update(logs);
}
}
I got a planned optimization where I pick S as a multiplicate of the first few primes, then I only need to sieve them once for each M cycle and I reset the logs to their values as it don't change between the k-1 cycles of S.
----------------------------------------------------

Quote:
 For small-ish primes you could unroll it. log changes slowly and will be the same for 4 adjacent primes.
I perhaps don't understand it, but how having a common log value for several primes helps with performance?

Quote:
 if (((uint64_t)sieve[i] & 0x8080808080808080ULL) == 0) { continue;' }
Love it! Will try that :)

Last fiddled with by Ilya Gazman on 2020-08-26 at 21:37

2020-08-26, 21:41   #6
Ilya Gazman

Feb 2016

3×5 Posts

Quote:
 Originally Posted by R. Gerbicz And then msieve is faster than light? Save out logs.length to a parameter. Don't know Java, but similar code in c++ is also slower when I'm using .size() in a loop.
Nice! This actually helped a bit. I rewrote the update function to only use local variables.

Code:
    public void update(byte[] logs) {
int length = logs.length;
int currentPosition = this.currentPosition;
byte log = this.log;
int prime = this.prime;
while (currentPosition < length) {
logs[currentPosition] += log;
currentPosition += prime;
}

this.currentPosition = currentPosition - length;
}

2020-08-26, 22:03   #7
bsquared

"Ben"
Feb 2007

22·23·37 Posts

Quote:
 Originally Posted by Ilya Gazman I perhaps don't understand it, but how having a common log value for several primes helps with performance?
Saves registers. The goal of unrolling is to minimize loop overhead (comparing against loop condition, jumps). But unrolling increases the code size. A sweet spot occurs if all variables in the loop can be stored in registers (there are 14 available in x86-64). We need 4 for primes, 4 for positions, one each for the sieve address, log value, and loop counter. If we needed 4 different log values then the compiler will probably utilize stack, which is slower.

All this might be moot, as the bottleneck of the loop is probably memory access and not loop overhead. Unrolling 4x might actually help with memory access since some percentage of the time the writes might hit the same cache line (this is a good thing). Anyway, it might not be a big win but it might help.

 Similar Threads Thread Thread Starter Forum Replies Last Post Sam Kennedy Factoring 1 2016-11-29 02:59 D. B. Staple Factoring 2 2007-12-14 00:21 jasong GMP-ECM 9 2007-10-25 22:32 Matthias C. Noc PrimeNet 5 2004-08-25 15:42 GP2 Software 10 2003-12-09 20:41

All times are UTC. The time now is 20:34.

Sun Apr 18 20:34:49 UTC 2021 up 10 days, 15:15, 1 user, load averages: 1.97, 2.00, 2.16