mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2019-01-15, 19:37   #1
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

5×79 Posts
Default Fast Lehman implementation

Thilo Harich and me worked together for some time on Lehmans factoring method,
and we think we found some pretty improvements above the code incorporated in YaFu in file LehmanClean.c (written by Warren D. Smith).

Lets start with a definition: Given some N to factor, a simple version of Lehman's factoring algorithm looks like the following pseudo-code:

Code:
uint64 findFactor(uint64 N) {
    cbrtN = cbrt(N);
    sixthRootN = sqrt(cbrtN);

    Do trial division until cbrtN; finish if a factor is found.

    for k from 1 to cbrtN {
        float64 sqrt4kN = sqrt(4*k*N);
        uint64 aStart = ceil(sqrt4kN);
        uint64 aLimit = floor(sqrt4kN + sixthRootN / (4*sqrt(k)));
        for (uint64 a = aStart to aLimit) {
            uint64 test = a^2 - 4*k*N;
            uint64 b = sqrt(test);
            if (b*b==test) {
                // test is a square
                return gcd(a+b, N);
            }
        }
    }
    return 0; // fail;
}
Warren D.Smith's code does not deviate too far from that line, his only theoretical improvement seems to be that he exploits the congruences
a == k+1 (mod 2)
a == (k+N) (mod 4) if k is odd



Thilo and me found several other improvements:

1. Precompute sqrt(k) for all k from 1 to 2^21. Since k <= cbrt(N), this is enough for any N<=2^63.
This makes a big difference considering performance.

2. Along the same lines, we precompute 1/sqrt(k) for all k from 1 to 2^21.
This allows to replace the division sixthRootN / (4*sqrt(k)) by a multiplication sixthRootTerm * sqrt[k],
where sixthRootTerm = sixthRootN / 4.

3. Split the k loop: kTwoA = cbrtN >> 6 is the splitting point where we'ld have an a-range >= or <= 2 in the inner a-loop.
a. For k<=kTwoA, an inner a-loop is needed, and we proceed similar to Smith's implementation.
b. For k>kTwoA, the a-range of the a-loop is <=2. Exploiting the congruences mentioned above, only one a-value is possible.

Something similar has been mentioned in https://www.cambridge.org/core/servi...46788712000146, end of paragraph 6. Our implementation spends ~90% of its computing time and finds ~90% of factors in the k>kTwoA loops.

4. Arrange tested k according to k%6: k%6 is more discriminative than k%4.
k%6==0 has a bigger chance to find a factor, followed by k%6==3, so these cases are tested first.

5. In trial division, (division by p) should be replaced by (multiplication with reciprocals (1/p)).
This speeds up trial division by factor 3 approximately.



We tried to reproduce Warren D. Smith's implementation in Java and found that our Lehman_Fast implementation is about 5 times faster.
Of course, Java and C timings are hard to compare, but our improvements should have an impact in any programming language;
it is quite probable that a similar improvement can be obtained in C.

In JML (former PSIQS, see https://github.com/TilmanNeumann/java-math-library), the Lehman_Fast implementation is now the fastest
algorithm for N from 31 to 47 bit. Pollard-Rho-Brent-Montgomery takes over for Nfrom 48 to 62 bits, SquFoF is not the best algorithm at any N-size anymore.



Another interesting point is that our Lehman algorithm runs without errors until N with 60 bit.
We noticed that for N >=45 bit both a^2 and 4kN start to overflow positive long values;
nevertheless we get correct values of test = a^2 - 4kN, because if a^2 and 4kN overflow positive longs (sint64),
then the error ERR is the same in both r.h.s parts, i.e. test = (a^2+ERR) - (4kN+ERR) = a^2 - 4kN.
The error ERR is always a multiple of 2^64.
We found out that the congruence adjustments are responsible for that, but do not understand the phenomen yet.

Links to our implementation:
Fast trial division: https://github.com/TilmanNeumann/jav...63Inverse.java
Fast Lehman: https://github.com/TilmanNeumann/jav...hman_Fast.java
Till is offline   Reply With Quote
Old 2019-01-15, 20:39   #2
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

16068 Posts
Default

This is really great work. I am hoping to have some free time this year to dedicate to spend measuring the performance of various algorithms for 64-bit in C. Albeit I had the same hope last year and it didn't work out.

I did a little playing with Lehman (the Smith/Buhrow algorithm) a year ago and didn't find it faster in my very informal tests. In particular for values under 30-bit, trial division + a tune 32-bit HOLF was faster, and after that Pollard-Rho-Brent faster. That is for arbitrary inputs, while for just semiprime inputs Lehman was better for 32-38 bits.

I am leery of using largish tables, but I'd love to be able to try it for benchmarking as well as try different variations.
danaj is offline   Reply With Quote
Old 2019-01-15, 21:31   #3
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

327110 Posts
Default

Agreed, this looks really neat! I'd like to try a C port and benchmark next to other things.

One question:
In the lehmanEven and lehmanOdd functions where you have the squaritude tests, does it help at all to do a quick filter like

Code:
t2 = Qn & 31;
if (t2 == 0 || t2 == 1 || t2 == 4 ||
    t2 == 9 || t2 == 16 || t2 == 17 || t2 == 25)
before doing the Math.sqrt()?
bsquared is offline   Reply With Quote
Old 2019-01-15, 23:17   #4
ThiloHarich
 
ThiloHarich's Avatar
 
Nov 2005

32×11 Posts
Default

@bsquared We did several approaches to get around always calculating the square root.

We had several checks for certain mods in the code, but they slowed down the whole algorithm.

We only create numbers 'a' which produce number a^2 - kn = y^2 mod 8. In the original work of Lehman there was only a mod 4 argument.
Checking if a number can be a square Mod powers of 2 can be done very efficiently : Store in an boolean array if a number i is a square. Check the array for number & (2^k-1).
Unfortunately checking squares mod higher powers of 2 does not filter out much numbers. Without the checks the algorithm was faster.

Applying mod arguments for other numbers like products of small primes like 3*3*5*7 filters out much more numbers, but slows down the algorithm as well. Here we have to apply an expensive % operation. Again the algorithm was slower with the check.

This was very strange. We saved much time in precomputing sqrt(k) and storing it in an array, but calculating the int value of the square root of a^2 - k*n (magnitude n^2/3) was surprisingly fast.

We also tried different ways to calculate the square root like the quake approximation https://en.wikipedia.org/wiki/Fast_inverse_square_root followed by newton but calling sqrt was still faster.

Last fiddled with by ThiloHarich on 2019-01-15 at 23:18
ThiloHarich is offline   Reply With Quote
Old 2019-01-16, 15:12   #5
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

5×79 Posts
Default

Thanks for the good words. I hope it works out for you, too.



@Dana: We need 3 such tables, two in Lehman and one in the trial division. Currently each table is 16MB. This does not look too big to me; but if you want, it can be reduced:
* Since I know that the algorithm does not work for N>60 bit, I could reduce the table element count to 2^20 -> the table size to 8MB each.
* If you find out (like me) that there is a faster algorithm for N>=48 bit, then the table element count could be reduced to 2^16 -> the table size to 512k each.




@Ben: Java's Math.sqrt() is a native implementation taken from fdlibm. The source code says "slow but portable" but for me it feels pretty fast :-/ Here is the source code: http://hg.openjdk.java.net/jdk8/jdk8...m/src/e_sqrt.c.
Till is offline   Reply With Quote
Old 2019-01-16, 15:29   #6
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

5·79 Posts
Default

The above holds for OpenJDK. Maybe Oracle's JVM (which I am working with) is using different native code.

Last fiddled with by Till on 2019-01-16 at 15:31 Reason: clearer OpenJDK vs. Oracle Java differentiation
Till is offline   Reply With Quote
Old 2019-01-17, 22:54   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,271 Posts
Default

I ported Lehman_Fast to C and ran it though the same testbench as we did once before for squfof/rho/etc. From 32 to 60 bit semiprimes it was 100% correct and it is very fast! I measure about 3x faster than Warren's code. The crossover with brent rho is at about 51 bits on one platform (linux, icc) and about 48 bits on another (windows, mingw64 gcc).

This is with the trial division parameter set to tdiv *after* the lehman loop. If done before the loop it is much slower (many semiprime inputs are easier to factor by lehman than by tdiv up to the cuberoot). For general inputs one may want to reverse that, but for semiprimes it makes sense to do tdiv after. Oh, and I'm just using a simple division-based tdiv, not your multiply by inverse one.

Very nice work.
bsquared is offline   Reply With Quote
Old 2019-01-18, 04:39   #8
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

133468 Posts
Cool

Nice work!
CRGreathouse is offline   Reply With Quote
Old 2019-01-18, 06:33   #9
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

16068 Posts
Default

Quote:
Originally Posted by bsquared View Post
I measure about 3x faster than Warren's code.
Wow, that's enough to completely change where things line up. Well done indeed, Thilo and Till!
danaj is offline   Reply With Quote
Old 2019-01-18, 09:20   #10
ThiloHarich
 
ThiloHarich's Avatar
 
Nov 2005

32·11 Posts
Default

Great that we have similar performance results also in C.

Since there are still many things I have not understood I think the Lehman algorithm has even more potential.

We have discovered some structure for the values of N, k and a. We saw some nice congruences of N, k, a mod 12 or 32 for example. But there seems to be much more structure of the numbers then we have discovered so far. If we can find and use this structure it might end up in an even faster Algorithm.

Other questions / observations:

Using bigger multiplier(s) for k then 6 (like 2*3*3*5*7) results in even better performance. The factor might depend on N.

Can we have a faster check if a number is a square then calculating the expensive square root?

I also do not like the precomputing phase. We might get around the expensive calculation of sqrtt(k) with other approaches.
ThiloHarich is offline   Reply With Quote
Old 2019-01-18, 15:24   #11
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

5·79 Posts
Default

Quote:
Originally Posted by bsquared View Post
I ported Lehman_Fast to C and ran it though the same testbench as we did once before for squfof/rho/etc. From 32 to 60 bit semiprimes it was 100% correct and it is very fast! I measure about 3x faster than Warren's code. The crossover with brent rho is at about 51 bits on one platform (linux, icc) and about 48 bits on another (windows, mingw64 gcc).

This is with the trial division parameter set to tdiv *after* the lehman loop. If done before the loop it is much slower (many semiprime inputs are easier to factor by lehman than by tdiv up to the cuberoot). For general inputs one may want to reverse that, but for semiprimes it makes sense to do tdiv after. Oh, and I'm just using a simple division-based tdiv, not your multiply by inverse one.
Nice to hear :-)
The crossover point with my PollardRhoBrentMontgomeryR64Mul63 is at 47-48 bits, so very similar to your Windows result.

Typically I tested with semiprimes where up to 15% of inputs needed tdiv. There, tdiv *after* Lehman was most appropriate, too.
Till is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
SQUFOF implementation alpertron Factoring 15 2010-04-12 19:16
ECM/FPGA Implementation rdotson Hardware 12 2006-03-26 22:58
need C implementation of divb(r,k) Leith Miscellaneous Math 4 2005-01-18 23:14
RSA implementation flava Programming 12 2004-10-26 03:51

All times are UTC. The time now is 19:54.

Wed Aug 5 19:54:53 UTC 2020 up 19 days, 15:41, 2 users, load averages: 1.92, 1.55, 1.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.