mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2019-01-18, 15:41   #12
jnml
 
Feb 2012
Prague, Czech Republ

32×19 Posts
Default

Quote:
Originally Posted by ThiloHarich View Post
@bsquared We did several approaches to get around always calculating the square root.

We had several checks for certain mods in the code, but they slowed down the whole algorithm.

We only create numbers 'a' which produce number a^2 - kn = y^2 mod 8. In the original work of Lehman there was only a mod 4 argument.
Checking if a number can be a square Mod powers of 2 can be done very efficiently : Store in an boolean array if a number i is a square. Check the array for number & (2^k-1).
Unfortunately checking squares mod higher powers of 2 does not filter out much numbers. Without the checks the algorithm was faster.

Applying mod arguments for other numbers like products of small primes like 3*3*5*7 filters out much more numbers, but slows down the algorithm as well. Here we have to apply an expensive % operation. Again the algorithm was slower with the check.

This was very strange. We saved much time in precomputing sqrt(k) and storing it in an array, but calculating the int value of the square root of a^2 - k*n (magnitude n^2/3) was surprisingly fast.

We also tried different ways to calculate the square root like the quake approximation https://en.wikipedia.org/wiki/Fast_inverse_square_root followed by newton but calling sqrt was still faster.
For the sqaure test try using modulo M24. It eliminates nearly 98% of candidates.
Computing a number modulo Mersenne number is fast:
http://homepage.divms.uiowa.edu/~jon...d.shtml#exmod3.

The bits take only 2MB of memrory, possibly half of that using pre-filtering mod 4.
jnml is offline   Reply With Quote
Old 2019-01-18, 15:41   #13
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

22·109 Posts
Default

Ben, how did you implement isSquare(test = a^2 - 4kN)? I am curious if some tricks help to speed it up in C. Can we find your source code somewhere?

And, it will be worth to give the "divide by multiplication of inverse" tdiv a try.
Till is offline   Reply With Quote
Old 2019-01-18, 17:07   #14
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts
Default

What I have found quite fast, portable, and not needing big tables in C is something like:

Code:
static int is_perfect_square(uint64_t n)
{
  /* Step 1, reduce to 18% of inputs */
  uint32_t m = n & 127;
  if ((m*0x8bc40d7d) & (m*0xa1e2f5d1) & 0x14020a)  return 0;
  /* Step 2, reduce to 7% of inputs (mod 99 reduces to 4% but slower) */
  m = n %240; if ((m*0xfa445556) & (m*0x8021feb1) & 0x614aaa0f) return 0;
  /* m = n % 99; if ((m*0x5411171d) & (m*0xe41dd1c7) & 0x80028a80) return 0; */
  /* Step 3, do the square root instead of any more rejections */
  m = isqrt(n);
  return (uint64_t)m*(uint64_t)m == n;
}
The first reduction is very cheap since it is a power of 2 and gets rid of over 80% of the inputs.

Having a fast is_perfect_square() is completely critical to the performance of Hart's OLF and SQUFOF, and important for the Smith Lehman also. For each of those I found is a bit faster on x86_64 by using just the single mod 128 test. Which is better depends somewhat on the platform and library.

isqrt is a wrapper function around libm's sqrt() but checking for overflow and ensuring correct integer results regardless of rounding from some dodgy platform.

I'd love to see something faster that is still portable and not using large tables.

As an aside, similar filters can be made for testing for perfect cubes and higher powers.
danaj is offline   Reply With Quote
Old 2019-01-18, 22:43   #15
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·32·191 Posts
Default

Nothing I tried w.r.t. sqrt guarding helped improve speed. Well, a single compare with Warren's issq1024[test & 1023] table was just as fast, but not faster. Everything else slowed it down. It may be that the speed of sqrt on modern processors has made such checks unnecessary.

On my Xeon E5-2697 processor, 100k 32-bit semiprimes are factored in 0.08 seconds. That's 800 nanoseconds per factorization, on average. Pretty cool!
bsquared is offline   Reply With Quote
Old 2019-01-18, 22:57   #16
GP2
 
GP2's Avatar
 
Sep 2003

2×5×7×37 Posts
Default

Quote:
Originally Posted by danaj View Post
I'd love to see something faster that is still portable and not using large tables.
The GMP library has the function mpz_perfect_square_p

How does this compare in speed?
GP2 is offline   Reply With Quote
Old 2019-01-18, 23:13   #17
ThiloHarich
 
ThiloHarich's Avatar
 
Nov 2005

101 Posts
Default

Quote:
Originally Posted by jnml View Post
For the sqaure test try using modulo M24. It eliminates nearly 98% of candidates.
Computing a number modulo Mersenne number is fast:
http://homepage.divms.uiowa.edu/~jon...d.shtml#exmod3.

The bits take only 2MB of memrory, possibly half of that using pre-filtering mod 4.
I tried to give the mod 255 Mersenne argument in the link a try in java.
since 255 = 3 * 5 * 17 this should at least check if the number is a square mod 5*17, since in the Lehman_Fast code we use a multiplier 2*3.

I wrote a test and yes the following code

Code:
	private static long mod255Jones(long a) {
		a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
		a = (a >>  8) + (a & 0xFF);   /* sum base 2**8 digits */
		if (a < 255) return a;
		if (a < (2 * 255)) return a - 255;
		return a - (2 * 255);
	}
calculates a mod 255 two times faster then a % 255.
So I precompute the squares % 255 in the boolean array squaresMod255, added the following check to the Lehman code (only in the lehmanEven method):
Code:
                        // added this check 
			if (squaresMod255[(int) mod255Jones(test)]) 
			{
				final long b = (long) Math.sqrt(test);
				if (b*b == test) {
					return gcdEngine.gcd(a+b, N);
				}
			}
This slows down the code by a factor of 2 (=100%)! Is is even slower then checking with % 255.

Code:
if (squaresMod255[(int) test % 255]) {
only 65% slowdown. No clue why.

More interesting might be the Mersenne number 4095 = 2^12-1 = 3 * 3 * 5 * 7 * 13 since each prime divides the possible squares by 2.
With a % 4095 still 37% slowdown. Is there als a faster way to calculate mod 4095?
Yes M24 = 16777215 = 3* 3* 5* 7*13* 17* 241 reduces the number of squares by an other factor of 4.

Surprisingly mod 24 = 3*8 has no slowdown. We set up the 'a' in the loop such that this check will always be true. But interesting that the Java Just Intime Compiler seems to know of this fact.

It seem like it is better to use multipliers for k to increase the chance that the numbers will be a square then checking before testing.
I have a more complex algorithm https://github.com/ThiloHarich/facto...an_Fast33.java with an additional multiplier 2*3*5 this gives some more speed in java. Other bigger multipliers are also possible, but they reduce the chance of finding a solution.

Last fiddled with by ThiloHarich on 2019-01-18 at 23:35
ThiloHarich is offline   Reply With Quote
Old 2019-01-19, 00:13   #18
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

343810 Posts
Default

Quote:
Originally Posted by GP2 View Post
The GMP library has the function mpz_perfect_square_p

How does this compare in speed?
For these tiny numbers (64 or fewer bits), its likely to be a lot slower. We'd have to translate into an mpz_t to do the test which will add a lot of overhead.
bsquared is offline   Reply With Quote
Old 2019-01-19, 14:58   #19
ThiloHarich
 
ThiloHarich's Avatar
 
Nov 2005

101 Posts
Default

Lehman suggest to use multipliers for k.
His argument was that if a/b is closest to the ratio of the divisors of n then using k = a*b*k' is looking in the area.

But there is an other reason. If k = d*k' then a^2 - kn = y^2 <->
a^2 - dk'n = y^2 <->
a^2 = y^2 mod d.
This means that the a is chosen such that test = a^2 - kn is a square mod d.
So using a multiplier 2*3*5*7 for example ensures that a^2 - kn is always a square mod 2*3*5*7.
This means we do not have to check this with a expensive mod operation.

For k even we know that all a must be odd.
If k is odd there are less 'a' mod 2^k. This is why we should first investigate in even k with are multiples of some small primes.

A simple version of this idea can be found here: https://github.com/ThiloHarich/facto...manSimple.java

The code is rather simple and short..
This Algorithm is based on the lehman algorithm and the Hart variant of it. It somehow tries to combine both algorithms. First it tries to find solution with a k = 2*3*3*5*7*k' = 630 * k'. This ensures that the created a = ceil (sqrt (k*n)) | 1 produces numbers a^2 - kn which are squares mod 2,9,5 and 7. This is there is a high chance that this numbers are squares and the algorithm finds a factor. If this part of the algorithm does not find a factor we investigate in numbers for an odd k. Here we choose k=1. in each step we look into numbers of both kinds.
The numbers for k=1 were investigated very seldom. They are like a insurance to find the factor, if the fast part is nit finding numbers.

Sorry for brainstorming you with ideas of algorithms. All have quite similar running times.
Maybe somebody has the right idea to make more progress here.
I also tried to look into multiple (different) multipliers in parallel, but no performance gain.
ThiloHarich is offline   Reply With Quote
Old 2019-01-19, 15:24   #20
GP2
 
GP2's Avatar
 
Sep 2003

50368 Posts
Default

Quote:
Originally Posted by bsquared View Post
For these tiny numbers (64 or fewer bits), its likely to be a lot slower. We'd have to translate into an mpz_t to do the test which will add a lot of overhead.
Hmm.

Well, maybe examining the source code might still be fruitful, I don't know.
GP2 is offline   Reply With Quote
Old 2019-01-19, 17:00   #21
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

22·109 Posts
Default

I did a little benchmark in Java. The following snippet (I added the addition to sum and logging it to prevent the compiler optimizing it to nothing)
Code:
        long t0 = System.currentTimeMillis();
        long sum = 0;
        for (int i=0; i<10000000; i++) {
            sum += (int) Math.sqrt(NArray[i]);
        }
        long t1 = System.currentTimeMillis();
        LOG.info("Unguarded sqrt took " + (t1-t0) + "ms, sum=" + sum);
takes about 65 to 80 ms for NArray having 10 Mio elements on my 8 year old notebook clocked with 2-2,8GHz, both for 33 and 63 bit numbers.
So it computes more than 125 Mio sqrts per second. The other way round: 1 sqrt, 1 add, 1 increment, 1 comparison and 1 array access do not take more than 16 to 22 clock cycles. This looks like Math.sqrt() being a pretty fast hardware operation.

I also tested Danas proposition to guard the square root but it slows the Lehman down a little bit. The above would explain why.

Last fiddled with by Till on 2019-01-19 at 17:12 Reason: make explicit it is Math.sqrt() which seems to be implemented in hardware
Till is offline   Reply With Quote
Old 2019-01-19, 17:21   #22
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

22×109 Posts
Default

Btw. it might be possible to further speed up the C implementation using AVX/SSE. Java is doing that under the hood when it finds suitable loops.
Till is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
SQUFOF implementation alpertron Factoring 15 2010-04-12 19:16
ECM/FPGA Implementation rdotson Hardware 12 2006-03-26 22:58
need C implementation of divb(r,k) Leith Miscellaneous Math 4 2005-01-18 23:14
RSA implementation flava Programming 12 2004-10-26 03:51

All times are UTC. The time now is 12:32.

Fri May 14 12:32:10 UTC 2021 up 36 days, 7:13, 0 users, load averages: 1.28, 1.34, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.