 mersenneforum.org Fast Lehman implementation
 Register FAQ Search Today's Posts Mark Forums Read  2019-01-18, 15:41   #12
jnml

Feb 2012
Prague, Czech Republ

32×19 Posts Quote:
 Originally Posted by ThiloHarich @bsquared We did several approaches to get around always calculating the square root. We had several checks for certain mods in the code, but they slowed down the whole algorithm. We only create numbers 'a' which produce number a^2 - kn = y^2 mod 8. In the original work of Lehman there was only a mod 4 argument. Checking if a number can be a square Mod powers of 2 can be done very efficiently : Store in an boolean array if a number i is a square. Check the array for number & (2^k-1). Unfortunately checking squares mod higher powers of 2 does not filter out much numbers. Without the checks the algorithm was faster. Applying mod arguments for other numbers like products of small primes like 3*3*5*7 filters out much more numbers, but slows down the algorithm as well. Here we have to apply an expensive % operation. Again the algorithm was slower with the check. This was very strange. We saved much time in precomputing sqrt(k) and storing it in an array, but calculating the int value of the square root of a^2 - k*n (magnitude n^2/3) was surprisingly fast. We also tried different ways to calculate the square root like the quake approximation https://en.wikipedia.org/wiki/Fast_inverse_square_root followed by newton but calling sqrt was still faster.
For the sqaure test try using modulo M24. It eliminates nearly 98% of candidates.
Computing a number modulo Mersenne number is fast:
http://homepage.divms.uiowa.edu/~jon...d.shtml#exmod3.

The bits take only 2MB of memrory, possibly half of that using pre-filtering mod 4.   2019-01-18, 15:41 #13 Till   "Tilman Neumann" Jan 2016 Germany 22·109 Posts Ben, how did you implement isSquare(test = a^2 - 4kN)? I am curious if some tricks help to speed it up in C. Can we find your source code somewhere? And, it will be worth to give the "divide by multiplication of inverse" tdiv a try.   2019-01-18, 17:07 #14 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 22·227 Posts What I have found quite fast, portable, and not needing big tables in C is something like: Code: static int is_perfect_square(uint64_t n) { /* Step 1, reduce to 18% of inputs */ uint32_t m = n & 127; if ((m*0x8bc40d7d) & (m*0xa1e2f5d1) & 0x14020a) return 0; /* Step 2, reduce to 7% of inputs (mod 99 reduces to 4% but slower) */ m = n %240; if ((m*0xfa445556) & (m*0x8021feb1) & 0x614aaa0f) return 0; /* m = n % 99; if ((m*0x5411171d) & (m*0xe41dd1c7) & 0x80028a80) return 0; */ /* Step 3, do the square root instead of any more rejections */ m = isqrt(n); return (uint64_t)m*(uint64_t)m == n; } The first reduction is very cheap since it is a power of 2 and gets rid of over 80% of the inputs. Having a fast is_perfect_square() is completely critical to the performance of Hart's OLF and SQUFOF, and important for the Smith Lehman also. For each of those I found is a bit faster on x86_64 by using just the single mod 128 test. Which is better depends somewhat on the platform and library. isqrt is a wrapper function around libm's sqrt() but checking for overflow and ensuring correct integer results regardless of rounding from some dodgy platform. I'd love to see something faster that is still portable and not using large tables. As an aside, similar filters can be made for testing for perfect cubes and higher powers.   2019-01-18, 22:43 #15 bsquared   "Ben" Feb 2007 2·32·191 Posts Nothing I tried w.r.t. sqrt guarding helped improve speed. Well, a single compare with Warren's issq1024[test & 1023] table was just as fast, but not faster. Everything else slowed it down. It may be that the speed of sqrt on modern processors has made such checks unnecessary. On my Xeon E5-2697 processor, 100k 32-bit semiprimes are factored in 0.08 seconds. That's 800 nanoseconds per factorization, on average. Pretty cool!   2019-01-18, 22:57   #16
GP2

Sep 2003

2×5×7×37 Posts Quote:
 Originally Posted by danaj I'd love to see something faster that is still portable and not using large tables.
The GMP library has the function mpz_perfect_square_p

How does this compare in speed?   2019-01-18, 23:13   #17
ThiloHarich

Nov 2005

1458 Posts Quote:
 Originally Posted by jnml For the sqaure test try using modulo M24. It eliminates nearly 98% of candidates. Computing a number modulo Mersenne number is fast: http://homepage.divms.uiowa.edu/~jon...d.shtml#exmod3. The bits take only 2MB of memrory, possibly half of that using pre-filtering mod 4.
I tried to give the mod 255 Mersenne argument in the link a try in java.
since 255 = 3 * 5 * 17 this should at least check if the number is a square mod 5*17, since in the Lehman_Fast code we use a multiplier 2*3.

I wrote a test and yes the following code

Code:
	private static long mod255Jones(long a) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >>  8) + (a & 0xFF);   /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
calculates a mod 255 two times faster then a % 255.
So I precompute the squares % 255 in the boolean array squaresMod255, added the following check to the Lehman code (only in the lehmanEven method):
Code:
                        // added this check
if (squaresMod255[(int) mod255Jones(test)])
{
final long b = (long) Math.sqrt(test);
if (b*b == test) {
return gcdEngine.gcd(a+b, N);
}
}
This slows down the code by a factor of 2 (=100%)! Is is even slower then checking with % 255.

Code:
if (squaresMod255[(int) test % 255]) {
only 65% slowdown. No clue why.

More interesting might be the Mersenne number 4095 = 2^12-1 = 3 * 3 * 5 * 7 * 13 since each prime divides the possible squares by 2.
With a % 4095 still 37% slowdown. Is there als a faster way to calculate mod 4095?
Yes M24 = 16777215 = 3* 3* 5* 7*13* 17* 241 reduces the number of squares by an other factor of 4.

Surprisingly mod 24 = 3*8 has no slowdown. We set up the 'a' in the loop such that this check will always be true. But interesting that the Java Just Intime Compiler seems to know of this fact.

It seem like it is better to use multipliers for k to increase the chance that the numbers will be a square then checking before testing.
I have a more complex algorithm https://github.com/ThiloHarich/facto...an_Fast33.java with an additional multiplier 2*3*5 this gives some more speed in java. Other bigger multipliers are also possible, but they reduce the chance of finding a solution.

Last fiddled with by ThiloHarich on 2019-01-18 at 23:35   2019-01-19, 00:13   #18
bsquared

"Ben"
Feb 2007

2·32·191 Posts Quote:
 Originally Posted by GP2 The GMP library has the function mpz_perfect_square_p How does this compare in speed?
For these tiny numbers (64 or fewer bits), its likely to be a lot slower. We'd have to translate into an mpz_t to do the test which will add a lot of overhead.   2019-01-19, 14:58 #19 ThiloHarich   Nov 2005 101 Posts Lehman suggest to use multipliers for k. His argument was that if a/b is closest to the ratio of the divisors of n then using k = a*b*k' is looking in the area. But there is an other reason. If k = d*k' then a^2 - kn = y^2 <-> a^2 - dk'n = y^2 <-> a^2 = y^2 mod d. This means that the a is chosen such that test = a^2 - kn is a square mod d. So using a multiplier 2*3*5*7 for example ensures that a^2 - kn is always a square mod 2*3*5*7. This means we do not have to check this with a expensive mod operation. For k even we know that all a must be odd. If k is odd there are less 'a' mod 2^k. This is why we should first investigate in even k with are multiples of some small primes. A simple version of this idea can be found here: https://github.com/ThiloHarich/facto...manSimple.java The code is rather simple and short.. This Algorithm is based on the lehman algorithm and the Hart variant of it. It somehow tries to combine both algorithms. First it tries to find solution with a k = 2*3*3*5*7*k' = 630 * k'. This ensures that the created a = ceil (sqrt (k*n)) | 1 produces numbers a^2 - kn which are squares mod 2,9,5 and 7. This is there is a high chance that this numbers are squares and the algorithm finds a factor. If this part of the algorithm does not find a factor we investigate in numbers for an odd k. Here we choose k=1. in each step we look into numbers of both kinds. The numbers for k=1 were investigated very seldom. They are like a insurance to find the factor, if the fast part is nit finding numbers. Sorry for brainstorming you with ideas of algorithms. All have quite similar running times. Maybe somebody has the right idea to make more progress here. I also tried to look into multiple (different) multipliers in parallel, but no performance gain.   2019-01-19, 15:24   #20
GP2

Sep 2003

1010000111102 Posts Quote:
 Originally Posted by bsquared For these tiny numbers (64 or fewer bits), its likely to be a lot slower. We'd have to translate into an mpz_t to do the test which will add a lot of overhead.
Hmm.

Well, maybe examining the source code might still be fruitful, I don't know.   2019-01-19, 17:00 #21 Till   "Tilman Neumann" Jan 2016 Germany 22·109 Posts I did a little benchmark in Java. The following snippet (I added the addition to sum and logging it to prevent the compiler optimizing it to nothing) Code:  long t0 = System.currentTimeMillis(); long sum = 0; for (int i=0; i<10000000; i++) { sum += (int) Math.sqrt(NArray[i]); } long t1 = System.currentTimeMillis(); LOG.info("Unguarded sqrt took " + (t1-t0) + "ms, sum=" + sum); takes about 65 to 80 ms for NArray having 10 Mio elements on my 8 year old notebook clocked with 2-2,8GHz, both for 33 and 63 bit numbers. So it computes more than 125 Mio sqrts per second. The other way round: 1 sqrt, 1 add, 1 increment, 1 comparison and 1 array access do not take more than 16 to 22 clock cycles. This looks like Math.sqrt() being a pretty fast hardware operation. I also tested Danas proposition to guard the square root but it slows the Lehman down a little bit. The above would explain why. Last fiddled with by Till on 2019-01-19 at 17:12 Reason: make explicit it is Math.sqrt() which seems to be implemented in hardware   2019-01-19, 17:21 #22 Till   "Tilman Neumann" Jan 2016 Germany 22×109 Posts Btw. it might be possible to further speed up the C implementation using AVX/SSE. Java is doing that under the hood when it finds suitable loops.   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 35 2016-12-11 00:57 alpertron Factoring 15 2010-04-12 19:16 rdotson Hardware 12 2006-03-26 22:58 Leith Miscellaneous Math 4 2005-01-18 23:14 flava Programming 12 2004-10-26 03:51

All times are UTC. The time now is 05:42.

Sun May 16 05:42:15 UTC 2021 up 38 days, 23 mins, 0 users, load averages: 1.86, 1.88, 1.71