mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Fast Lehman implementation (https://www.mersenneforum.org/showthread.php?t=24002)

Till 2019-01-15 19:37

Fast Lehman implementation
 
Thilo Harich and me worked together for some time on Lehmans factoring method,
and we think we found some pretty improvements above the code incorporated in YaFu in file LehmanClean.c (written by Warren D. Smith).

Lets start with a definition: Given some N to factor, a simple version of Lehman's factoring algorithm looks like the following pseudo-code:

[CODE]
uint64 findFactor(uint64 N) {
cbrtN = cbrt(N);
sixthRootN = sqrt(cbrtN);

Do trial division until cbrtN; finish if a factor is found.

for k from 1 to cbrtN {
float64 sqrt4kN = sqrt(4*k*N);
uint64 aStart = ceil(sqrt4kN);
uint64 aLimit = floor(sqrt4kN + sixthRootN / (4*sqrt(k)));
for (uint64 a = aStart to aLimit) {
uint64 test = a^2 - 4*k*N;
uint64 b = sqrt(test);
if (b*b==test) {
// test is a square
return gcd(a+b, N);
}
}
}
return 0; // fail;
}
[/CODE]Warren D.Smith's code does not deviate too far from that line, his only theoretical improvement seems to be that he exploits the congruences
a == k+1 (mod 2)
a == (k+N) (mod 4) if k is odd



Thilo and me found several other improvements:

1. Precompute sqrt(k) for all k from 1 to 2^21. Since k <= cbrt(N), this is enough for any N<=2^63.
This makes a big difference considering performance.

2. Along the same lines, we precompute 1/sqrt(k) for all k from 1 to 2^21.
This allows to replace the division sixthRootN / (4*sqrt(k)) by a multiplication sixthRootTerm * sqrt[k],
where sixthRootTerm = sixthRootN / 4.

3. Split the k loop: kTwoA = cbrtN >> 6 is the splitting point where we'ld have an a-range >= or <= 2 in the inner a-loop.
a. For k<=kTwoA, an inner a-loop is needed, and we proceed similar to Smith's implementation.
b. For k>kTwoA, the a-range of the a-loop is <=2. Exploiting the congruences mentioned above, only one a-value is possible.

Something similar has been mentioned in [URL]https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S1446788712000146[/URL], end of paragraph 6. Our implementation spends ~90% of its computing time and finds ~90% of factors in the k>kTwoA loops.

4. Arrange tested k according to k%6: k%6 is more discriminative than k%4.
k%6==0 has a bigger chance to find a factor, followed by k%6==3, so these cases are tested first.

5. In trial division, (division by p) should be replaced by (multiplication with reciprocals (1/p)).
This speeds up trial division by factor 3 approximately.



We tried to reproduce Warren D. Smith's implementation in Java and found that our Lehman_Fast implementation is about 5 times faster.
Of course, Java and C timings are hard to compare, but our improvements should have an impact in any programming language;
it is quite probable that a similar improvement can be obtained in C.

In JML (former PSIQS, see [URL]https://github.com/TilmanNeumann/java-math-library[/URL]), the Lehman_Fast implementation is now the fastest
algorithm for N from 31 to 47 bit. Pollard-Rho-Brent-Montgomery takes over for Nfrom 48 to 62 bits, SquFoF is not the best algorithm at any N-size anymore.



Another interesting point is that our Lehman algorithm runs without errors until N with 60 bit.
We noticed that for N >=45 bit both a^2 and 4kN start to overflow positive long values;
nevertheless we get correct values of test = a^2 - 4kN, because if a^2 and 4kN overflow positive longs (sint64),
then the error ERR is the same in both r.h.s parts, i.e. test = (a^2+ERR) - (4kN+ERR) = a^2 - 4kN.
The error ERR is always a multiple of 2^64.
We found out that the congruence adjustments are responsible for that, but do not understand the phenomen yet.

Links to our implementation:
Fast trial division: [URL]https://github.com/TilmanNeumann/java-math-library/blob/master/src/de/tilman_neumann/jml/factor/tdiv/TDiv63Inverse.java[/URL]
Fast Lehman: [URL]https://github.com/TilmanNeumann/java-math-library/blob/master/src/de/tilman_neumann/jml/factor/lehman/Lehman_Fast.java[/URL]

danaj 2019-01-15 20:39

This is really great work. I am hoping to have some free time this year to dedicate to spend measuring the performance of various algorithms for 64-bit in C. Albeit I had the same hope last year and it didn't work out.

I did a little playing with Lehman (the Smith/Buhrow algorithm) a year ago and didn't find it faster in my very informal tests. In particular for values under 30-bit, trial division + a tune 32-bit HOLF was faster, and after that Pollard-Rho-Brent faster. That is for arbitrary inputs, while for just semiprime inputs Lehman was better for 32-38 bits.

I am leery of using largish tables, but I'd love to be able to try it for benchmarking as well as try different variations.

bsquared 2019-01-15 21:31

Agreed, this looks really neat! I'd like to try a C port and benchmark next to other things.

One question:
In the lehmanEven and lehmanOdd functions where you have the squaritude tests, does it help at all to do a quick filter like

[CODE]t2 = Qn & 31;
if (t2 == 0 || t2 == 1 || t2 == 4 ||
t2 == 9 || t2 == 16 || t2 == 17 || t2 == 25)[/CODE]

before doing the Math.sqrt()?

ThiloHarich 2019-01-15 23:17

@bsquared We did several approaches to get around always calculating the square root.

We had several checks for certain mods in the code, but they slowed down the whole algorithm.

We only create numbers 'a' which produce number a^2 - kn = y^2 mod 8. In the original work of Lehman there was only a mod 4 argument.
Checking if a number can be a square Mod powers of 2 can be done very efficiently : Store in an boolean array if a number i is a square. Check the array for number & (2^k-1).
Unfortunately checking squares mod higher powers of 2 does not filter out much numbers. Without the checks the algorithm was faster.

Applying mod arguments for other numbers like products of small primes like 3*3*5*7 filters out much more numbers, but slows down the algorithm as well. Here we have to apply an expensive % operation. Again the algorithm was slower with the check.

This was very strange. We saved much time in precomputing sqrt(k) and storing it in an array, but calculating the int value of the square root of a^2 - k*n (magnitude n^2/3) was surprisingly fast.

We also tried different ways to calculate the square root like the quake approximation [url]https://en.wikipedia.org/wiki/Fast_inverse_square_root[/url] followed by newton but calling sqrt was still faster.

Till 2019-01-16 15:12

Thanks for the good words. I hope it works out for you, too.



@Dana: We need 3 such tables, two in Lehman and one in the trial division. Currently each table is 16MB. This does not look too big to me; but if you want, it can be reduced:
* Since I know that the algorithm does not work for N>60 bit, I could reduce the table element count to 2^20 -> the table size to 8MB each.
* If you find out (like me) that there is a faster algorithm for N>=48 bit, then the table element count could be reduced to 2^16 -> the table size to 512k each.




@Ben: Java's Math.sqrt() is a native implementation taken from fdlibm. The source code says "slow but portable" but for me it feels pretty fast :-/ Here is the source code: [url]http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/native/java/lang/fdlibm/src/e_sqrt.c[/url].

Till 2019-01-16 15:29

The above holds for OpenJDK. Maybe Oracle's JVM (which I am working with) is using different native code.

bsquared 2019-01-17 22:54

I ported Lehman_Fast to C and ran it though the same testbench as we did once before for squfof/rho/etc. From 32 to 60 bit semiprimes it was 100% correct and it is very fast! I measure about 3x faster than Warren's code. The crossover with brent rho is at about 51 bits on one platform (linux, icc) and about 48 bits on another (windows, mingw64 gcc).

This is with the trial division parameter set to tdiv *after* the lehman loop. If done before the loop it is much slower (many semiprime inputs are easier to factor by lehman than by tdiv up to the cuberoot). For general inputs one may want to reverse that, but for semiprimes it makes sense to do tdiv after. Oh, and I'm just using a simple division-based tdiv, not your multiply by inverse one.

Very nice work.

CRGreathouse 2019-01-18 04:39

Nice work!

danaj 2019-01-18 06:33

[QUOTE=bsquared;506294]I measure about 3x faster than Warren's code.[/QUOTE]

Wow, that's enough to completely change where things line up. Well done indeed, Thilo and Till!

ThiloHarich 2019-01-18 09:20

Great that we have similar performance results also in C.

Since there are still many things I have not understood I think the Lehman algorithm has even more potential.

We have discovered some structure for the values of N, k and a. We saw some nice congruences of N, k, a mod 12 or 32 for example. But there seems to be much more structure of the numbers then we have discovered so far. If we can find and use this structure it might end up in an even faster Algorithm.

Other questions / observations:

Using bigger multiplier(s) for k then 6 (like 2*3*3*5*7) results in even better performance. The factor might depend on N.

Can we have a faster check if a number is a square then calculating the expensive square root?

I also do not like the precomputing phase. We might get around the expensive calculation of sqrtt(k) with other approaches.

Till 2019-01-18 15:24

[QUOTE=bsquared;506294]I ported Lehman_Fast to C and ran it though the same testbench as we did once before for squfof/rho/etc. From 32 to 60 bit semiprimes it was 100% correct and it is very fast! I measure about 3x faster than Warren's code. The crossover with brent rho is at about 51 bits on one platform (linux, icc) and about 48 bits on another (windows, mingw64 gcc).

This is with the trial division parameter set to tdiv *after* the lehman loop. If done before the loop it is much slower (many semiprime inputs are easier to factor by lehman than by tdiv up to the cuberoot). For general inputs one may want to reverse that, but for semiprimes it makes sense to do tdiv after. Oh, and I'm just using a simple division-based tdiv, not your multiply by inverse one.
[/QUOTE]

Nice to hear :-)
The crossover point with my PollardRhoBrentMontgomeryR64Mul63 is at 47-48 bits, so very similar to your Windows result.

Typically I tested with semiprimes where up to 15% of inputs needed tdiv. There, tdiv *after* Lehman was most appropriate, too.


All times are UTC. The time now is 21:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.