mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2020-08-26, 19:30   #1
Ilya Gazman
 
Feb 2016

3×5 Posts
Cool How to optimize the sieving stage of QS?

My latest version of QS is moving in fifth of the speed of light as it only 5 times slower than Tills version now, lol.

40% of my sieving code is spent on the below loop. Do you have an idea how to improve it?

Code:
// that loop takes 40% of the performance
for (Wheel wheel : wheels) { 
    wheel.update(logs);
}

// Wheel update method
public void update(byte[] logs) {
    for (; currentPosition < logs.length; currentPosition += prime) { // currentPosition is int
        logs[currentPosition] += log;
    }

    this.currentPosition -= logs.length;
}
Ilya Gazman is offline   Reply With Quote
Old 2020-08-26, 19:50   #2
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2·709 Posts
Default

Quote:
Originally Posted by Ilya Gazman View Post
fifth of the speed of light as it only 5 times slower than Tills version now, lol.
And then msieve is faster than light?
Save out logs.length to a parameter. Don't know Java, but similar code in c++ is also slower when I'm using .size() in a loop.
R. Gerbicz is offline   Reply With Quote
Old 2020-08-26, 20:17   #3
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×3×557 Posts
Default

Quote:
Originally Posted by Ilya Gazman View Post
My latest version of QS is moving in fifth of the speed of light as it only 5 times slower than Tills version now, lol.

40% of my sieving code is spent on the below loop. Do you have an idea how to improve it?

Code:
// that loop takes 40% of the performance
for (Wheel wheel : wheels) { 
    wheel.update(logs);
}

// Wheel update method
public void update(byte[] logs) {
    for (; currentPosition < logs.length; currentPosition += prime) { // currentPosition is int
        logs[currentPosition] += log;
    }

    this.currentPosition -= logs.length;
}
For small-ish primes you could unroll it. log changes slowly and will be the same for 4 adjacent primes. I don't know how to to re-work this code exactly given that currentPosition, prime, log, etc., are apparently maintained by a class (?). But something like:

cp1 = currentPosition[i]
cp2 = currentPosition[i+1]
cp3 = currentPosition[i+2]
cp4 = currentPosition[i+3]

p1 = primes[i]
p2 = primes[i+1]
p3 = primes[i+2]
p4 = primes[i+3]

maxCP = MAX(cp1, cp2, cp3, cp4)
maxP = MAX(p1, p2, p3, p4)

for (; maxCP < logs.length; ) {
logs[cp1] += log;
logs[cp2] += log;
logs[cp3] += log;
logs[cp4] += log;
cp1 += p1;
cp2 += p2;
cp3 += p3;
cp4 += p4;
maxCP += maxP;
}

// now finish the last part of each range with individual loops


// finally update all positions for the next block
currentPosition[i] = cp1 - logs.length
currentPosition[i+1] = cp2 - logs.length
currentPosition[i+2] = cp3 - logs.length
currentPosition[i+3] = cp4 - logs.length

Last fiddled with by bsquared on 2020-08-26 at 20:24
bsquared is offline   Reply With Quote
Old 2020-08-26, 20:22   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×3×557 Posts
Default

Also, usually people initialize the sieve byte values to a cutoff:

for (i=0; i < sieve_length; i++) {
sieve[i] = cutoff;
}

Then sieve just like normal, but subtract instead of add.

Now, you can easily find locations that are ready for trial division by ANDing with a hi-bit mask.

if (sieve[i] & 0x80) {
trial_divide(i)
}

But most locations won't need trial division, so you can check and eliminate many at once by overloading the sieve to an 8-byte unsigned long long type:

if (((uint64_t)sieve[i] & 0x8080808080808080ULL) == 0) {
continue;'
}

Makes scanning for sieve reports much faster.
bsquared is offline   Reply With Quote
Old 2020-08-26, 21:36   #5
Ilya Gazman
 
Feb 2016

3·5 Posts
Default

Quote:
I don't know how to to re-work this code exactly given that currentPosition, prime, log, etc., are apparently maintained by a class (?)
Yeah it is maintained by the Wheel class, each prime has two Wheel instances that track the current position in the logs array.

My sieving bound M is represented by an array of the size S where Sk = M for a small value of k(I use it as a square of M), I found it more memory efficient.
Then my sieving loop looks like this:

Code:
for (i=0;i<k;i++){
   for (Wheel wheel : wheels) { 
       wheel.update(logs);
   }
}
I got a planned optimization where I pick S as a multiplicate of the first few primes, then I only need to sieve them once for each M cycle and I reset the logs to their values as it don't change between the k-1 cycles of S.
----------------------------------------------------

Quote:
For small-ish primes you could unroll it. log changes slowly and will be the same for 4 adjacent primes.
I perhaps don't understand it, but how having a common log value for several primes helps with performance?

Quote:
if (((uint64_t)sieve[i] & 0x8080808080808080ULL) == 0) {
continue;'
}
Love it! Will try that :)

Last fiddled with by Ilya Gazman on 2020-08-26 at 21:37
Ilya Gazman is offline   Reply With Quote
Old 2020-08-26, 21:41   #6
Ilya Gazman
 
Feb 2016

11112 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
And then msieve is faster than light?
Save out logs.length to a parameter. Don't know Java, but similar code in c++ is also slower when I'm using .size() in a loop.
Nice! This actually helped a bit. I rewrote the update function to only use local variables.

Code:
    public void update(byte[] logs) {
        int length = logs.length;
        int currentPosition = this.currentPosition;
        byte log = this.log;
        int prime = this.prime;
        while (currentPosition < length) {
            logs[currentPosition] += log;
            currentPosition += prime;
        }

        this.currentPosition = currentPosition - length;
    }
Ilya Gazman is offline   Reply With Quote
Old 2020-08-26, 22:03   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

D0E16 Posts
Default

Quote:
Originally Posted by Ilya Gazman View Post
I perhaps don't understand it, but how having a common log value for several primes helps with performance?
Saves registers. The goal of unrolling is to minimize loop overhead (comparing against loop condition, jumps). But unrolling increases the code size. A sweet spot occurs if all variables in the loop can be stored in registers (there are 14 available in x86-64). We need 4 for primes, 4 for positions, one each for the sieve address, log value, and loop counter. If we needed 4 different log values then the compiler will probably utilize stack, which is slower.

All this might be moot, as the bottleneck of the loop is probably memory access and not loop overhead. Unrolling 4x might actually help with memory access since some percentage of the time the writes might hit the same cache line (this is a good thing). Anyway, it might not be a big win but it might help.
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bypass Sieving Stage? Sam Kennedy Factoring 1 2016-11-29 02:59
Stage 1 with mprime/prime95, stage 2 with GMP-ECM D. B. Staple Factoring 2 2007-12-14 00:21
Need help to run stage 1 and stage 2 separately jasong GMP-ECM 9 2007-10-25 22:32
Stage 1 and stage 2 tests missing Matthias C. Noc PrimeNet 5 2004-08-25 15:42
A distributed-computing project to optimize GIMPS FFT? Genetic algorithms GP2 Software 10 2003-12-09 20:41

All times are UTC. The time now is 14:01.

Sun Nov 29 14:01:43 UTC 2020 up 80 days, 11:12, 3 users, load averages: 1.37, 1.15, 1.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.