20190115, 19:37  #1 
"Tilman Neumann"
Jan 2016
Germany
2·3·71 Posts 
Fast Lehman implementation
Thilo Harich and me worked together for some time on Lehmans factoring method,
and we think we found some pretty improvements above the code incorporated in YaFu in file LehmanClean.c (written by Warren D. Smith). Lets start with a definition: Given some N to factor, a simple version of Lehman's factoring algorithm looks like the following pseudocode: Code:
uint64 findFactor(uint64 N) { cbrtN = cbrt(N); sixthRootN = sqrt(cbrtN); Do trial division until cbrtN; finish if a factor is found. for k from 1 to cbrtN { float64 sqrt4kN = sqrt(4*k*N); uint64 aStart = ceil(sqrt4kN); uint64 aLimit = floor(sqrt4kN + sixthRootN / (4*sqrt(k))); for (uint64 a = aStart to aLimit) { uint64 test = a^2  4*k*N; uint64 b = sqrt(test); if (b*b==test) { // test is a square return gcd(a+b, N); } } } return 0; // fail; } a == k+1 (mod 2) a == (k+N) (mod 4) if k is odd Thilo and me found several other improvements: 1. Precompute sqrt(k) for all k from 1 to 2^21. Since k <= cbrt(N), this is enough for any N<=2^63. This makes a big difference considering performance. 2. Along the same lines, we precompute 1/sqrt(k) for all k from 1 to 2^21. This allows to replace the division sixthRootN / (4*sqrt(k)) by a multiplication sixthRootTerm * sqrt[k], where sixthRootTerm = sixthRootN / 4. 3. Split the k loop: kTwoA = cbrtN >> 6 is the splitting point where we'ld have an arange >= or <= 2 in the inner aloop. a. For k<=kTwoA, an inner aloop is needed, and we proceed similar to Smith's implementation. b. For k>kTwoA, the arange of the aloop is <=2. Exploiting the congruences mentioned above, only one avalue is possible. Something similar has been mentioned in https://www.cambridge.org/core/servi...46788712000146, end of paragraph 6. Our implementation spends ~90% of its computing time and finds ~90% of factors in the k>kTwoA loops. 4. Arrange tested k according to k%6: k%6 is more discriminative than k%4. k%6==0 has a bigger chance to find a factor, followed by k%6==3, so these cases are tested first. 5. In trial division, (division by p) should be replaced by (multiplication with reciprocals (1/p)). This speeds up trial division by factor 3 approximately. We tried to reproduce Warren D. Smith's implementation in Java and found that our Lehman_Fast implementation is about 5 times faster. Of course, Java and C timings are hard to compare, but our improvements should have an impact in any programming language; it is quite probable that a similar improvement can be obtained in C. In JML (former PSIQS, see https://github.com/TilmanNeumann/javamathlibrary), the Lehman_Fast implementation is now the fastest algorithm for N from 31 to 47 bit. PollardRhoBrentMontgomery takes over for Nfrom 48 to 62 bits, SquFoF is not the best algorithm at any Nsize anymore. Another interesting point is that our Lehman algorithm runs without errors until N with 60 bit. We noticed that for N >=45 bit both a^2 and 4kN start to overflow positive long values; nevertheless we get correct values of test = a^2  4kN, because if a^2 and 4kN overflow positive longs (sint64), then the error ERR is the same in both r.h.s parts, i.e. test = (a^2+ERR)  (4kN+ERR) = a^2  4kN. The error ERR is always a multiple of 2^64. We found out that the congruence adjustments are responsible for that, but do not understand the phenomen yet. Links to our implementation: Fast trial division: https://github.com/TilmanNeumann/jav...63Inverse.java Fast Lehman: https://github.com/TilmanNeumann/jav...hman_Fast.java 
20190115, 20:39  #2 
"Dana Jacobsen"
Feb 2011
Bangkok, TH
2^{2}×227 Posts 
This is really great work. I am hoping to have some free time this year to dedicate to spend measuring the performance of various algorithms for 64bit in C. Albeit I had the same hope last year and it didn't work out.
I did a little playing with Lehman (the Smith/Buhrow algorithm) a year ago and didn't find it faster in my very informal tests. In particular for values under 30bit, trial division + a tune 32bit HOLF was faster, and after that PollardRhoBrent faster. That is for arbitrary inputs, while for just semiprime inputs Lehman was better for 3238 bits. I am leery of using largish tables, but I'd love to be able to try it for benchmarking as well as try different variations. 
20190115, 21:31  #3 
"Ben"
Feb 2007
2·5·337 Posts 
Agreed, this looks really neat! I'd like to try a C port and benchmark next to other things.
One question: In the lehmanEven and lehmanOdd functions where you have the squaritude tests, does it help at all to do a quick filter like Code:
t2 = Qn & 31; if (t2 == 0  t2 == 1  t2 == 4  t2 == 9  t2 == 16  t2 == 17  t2 == 25) 
20190115, 23:17  #4 
Nov 2005
99_{10} Posts 
@bsquared We did several approaches to get around always calculating the square root.
We had several checks for certain mods in the code, but they slowed down the whole algorithm. We only create numbers 'a' which produce number a^2  kn = y^2 mod 8. In the original work of Lehman there was only a mod 4 argument. Checking if a number can be a square Mod powers of 2 can be done very efficiently : Store in an boolean array if a number i is a square. Check the array for number & (2^k1). Unfortunately checking squares mod higher powers of 2 does not filter out much numbers. Without the checks the algorithm was faster. Applying mod arguments for other numbers like products of small primes like 3*3*5*7 filters out much more numbers, but slows down the algorithm as well. Here we have to apply an expensive % operation. Again the algorithm was slower with the check. This was very strange. We saved much time in precomputing sqrt(k) and storing it in an array, but calculating the int value of the square root of a^2  k*n (magnitude n^2/3) was surprisingly fast. We also tried different ways to calculate the square root like the quake approximation https://en.wikipedia.org/wiki/Fast_inverse_square_root followed by newton but calling sqrt was still faster. Last fiddled with by ThiloHarich on 20190115 at 23:18 
20190116, 15:12  #5 
"Tilman Neumann"
Jan 2016
Germany
2×3×71 Posts 
Thanks for the good words. I hope it works out for you, too.
@Dana: We need 3 such tables, two in Lehman and one in the trial division. Currently each table is 16MB. This does not look too big to me; but if you want, it can be reduced: * Since I know that the algorithm does not work for N>60 bit, I could reduce the table element count to 2^20 > the table size to 8MB each. * If you find out (like me) that there is a faster algorithm for N>=48 bit, then the table element count could be reduced to 2^16 > the table size to 512k each. @Ben: Java's Math.sqrt() is a native implementation taken from fdlibm. The source code says "slow but portable" but for me it feels pretty fast :/ Here is the source code: http://hg.openjdk.java.net/jdk8/jdk8...m/src/e_sqrt.c. 
20190116, 15:29  #6 
"Tilman Neumann"
Jan 2016
Germany
110101010_{2} Posts 
The above holds for OpenJDK. Maybe Oracle's JVM (which I am working with) is using different native code.
Last fiddled with by Till on 20190116 at 15:31 Reason: clearer OpenJDK vs. Oracle Java differentiation 
20190117, 22:54  #7 
"Ben"
Feb 2007
2×5×337 Posts 
I ported Lehman_Fast to C and ran it though the same testbench as we did once before for squfof/rho/etc. From 32 to 60 bit semiprimes it was 100% correct and it is very fast! I measure about 3x faster than Warren's code. The crossover with brent rho is at about 51 bits on one platform (linux, icc) and about 48 bits on another (windows, mingw64 gcc).
This is with the trial division parameter set to tdiv *after* the lehman loop. If done before the loop it is much slower (many semiprime inputs are easier to factor by lehman than by tdiv up to the cuberoot). For general inputs one may want to reverse that, but for semiprimes it makes sense to do tdiv after. Oh, and I'm just using a simple divisionbased tdiv, not your multiply by inverse one. Very nice work. 
20190118, 04:39  #8 
Aug 2006
5972_{10} Posts 
Nice work!

20190118, 06:33  #9 
"Dana Jacobsen"
Feb 2011
Bangkok, TH
908_{10} Posts 

20190118, 09:20  #10 
Nov 2005
3^{2}·11 Posts 
Great that we have similar performance results also in C.
Since there are still many things I have not understood I think the Lehman algorithm has even more potential. We have discovered some structure for the values of N, k and a. We saw some nice congruences of N, k, a mod 12 or 32 for example. But there seems to be much more structure of the numbers then we have discovered so far. If we can find and use this structure it might end up in an even faster Algorithm. Other questions / observations: Using bigger multiplier(s) for k then 6 (like 2*3*3*5*7) results in even better performance. The factor might depend on N. Can we have a faster check if a number is a square then calculating the expensive square root? I also do not like the precomputing phase. We might get around the expensive calculation of sqrtt(k) with other approaches. 
20190118, 15:24  #11  
"Tilman Neumann"
Jan 2016
Germany
2·3·71 Posts 
Quote:
The crossover point with my PollardRhoBrentMontgomeryR64Mul63 is at 4748 bits, so very similar to your Windows result. Typically I tested with semiprimes where up to 15% of inputs needed tdiv. There, tdiv *after* Lehman was most appropriate, too. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Do normal adults give themselves an allowance? (...to fast or not to fast  there is no question!)  jasong  jasong  35  20161211 00:57 
SQUFOF implementation  alpertron  Factoring  15  20100412 19:16 
ECM/FPGA Implementation  rdotson  Hardware  12  20060326 22:58 
need C implementation of divb(r,k)  Leith  Miscellaneous Math  4  20050118 23:14 
RSA implementation  flava  Programming  12  20041026 03:51 