20190118, 15:41  #12  
Feb 2012
Prague, Czech Republ
10101011_{2} Posts 
Quote:
Computing a number modulo Mersenne number is fast: http://homepage.divms.uiowa.edu/~jon...d.shtml#exmod3. The bits take only 2MB of memrory, possibly half of that using prefiltering mod 4. 

20190118, 15:41  #13 
"Tilman Neumann"
Jan 2016
Germany
110110010_{2} Posts 
Ben, how did you implement isSquare(test = a^2  4kN)? I am curious if some tricks help to speed it up in C. Can we find your source code somewhere?
And, it will be worth to give the "divide by multiplication of inverse" tdiv a try. 
20190118, 17:07  #14 
"Dana Jacobsen"
Feb 2011
Bangkok, TH
2^{2}×227 Posts 
What I have found quite fast, portable, and not needing big tables in C is something like:
Code:
static int is_perfect_square(uint64_t n) { /* Step 1, reduce to 18% of inputs */ uint32_t m = n & 127; if ((m*0x8bc40d7d) & (m*0xa1e2f5d1) & 0x14020a) return 0; /* Step 2, reduce to 7% of inputs (mod 99 reduces to 4% but slower) */ m = n %240; if ((m*0xfa445556) & (m*0x8021feb1) & 0x614aaa0f) return 0; /* m = n % 99; if ((m*0x5411171d) & (m*0xe41dd1c7) & 0x80028a80) return 0; */ /* Step 3, do the square root instead of any more rejections */ m = isqrt(n); return (uint64_t)m*(uint64_t)m == n; } Having a fast is_perfect_square() is completely critical to the performance of Hart's OLF and SQUFOF, and important for the Smith Lehman also. For each of those I found is a bit faster on x86_64 by using just the single mod 128 test. Which is better depends somewhat on the platform and library. isqrt is a wrapper function around libm's sqrt() but checking for overflow and ensuring correct integer results regardless of rounding from some dodgy platform. I'd love to see something faster that is still portable and not using large tables. As an aside, similar filters can be made for testing for perfect cubes and higher powers. 
20190118, 22:43  #15 
"Ben"
Feb 2007
2^{2}·23·37 Posts 
Nothing I tried w.r.t. sqrt guarding helped improve speed. Well, a single compare with Warren's issq1024[test & 1023] table was just as fast, but not faster. Everything else slowed it down. It may be that the speed of sqrt on modern processors has made such checks unnecessary.
On my Xeon E52697 processor, 100k 32bit semiprimes are factored in 0.08 seconds. That's 800 nanoseconds per factorization, on average. Pretty cool! 
20190118, 22:57  #16 
Sep 2003
A1B_{16} Posts 

20190118, 23:13  #17  
Nov 2005
101 Posts 
Quote:
since 255 = 3 * 5 * 17 this should at least check if the number is a square mod 5*17, since in the Lehman_Fast code we use a multiplier 2*3. I wrote a test and yes the following code Code:
private static long mod255Jones(long a) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ if (a < 255) return a; if (a < (2 * 255)) return a  255; return a  (2 * 255); } So I precompute the squares % 255 in the boolean array squaresMod255, added the following check to the Lehman code (only in the lehmanEven method): Code:
// added this check if (squaresMod255[(int) mod255Jones(test)]) { final long b = (long) Math.sqrt(test); if (b*b == test) { return gcdEngine.gcd(a+b, N); } } Code:
if (squaresMod255[(int) test % 255]) { More interesting might be the Mersenne number 4095 = 2^121 = 3 * 3 * 5 * 7 * 13 since each prime divides the possible squares by 2. With a % 4095 still 37% slowdown. Is there als a faster way to calculate mod 4095? Yes M24 = 16777215 = 3* 3* 5* 7*13* 17* 241 reduces the number of squares by an other factor of 4. Surprisingly mod 24 = 3*8 has no slowdown. We set up the 'a' in the loop such that this check will always be true. But interesting that the Java Just Intime Compiler seems to know of this fact. It seem like it is better to use multipliers for k to increase the chance that the numbers will be a square then checking before testing. I have a more complex algorithm https://github.com/ThiloHarich/facto...an_Fast33.java with an additional multiplier 2*3*5 this gives some more speed in java. Other bigger multipliers are also possible, but they reduce the chance of finding a solution. Last fiddled with by ThiloHarich on 20190118 at 23:35 

20190119, 00:13  #18 
"Ben"
Feb 2007
D4C_{16} Posts 

20190119, 14:58  #19 
Nov 2005
101 Posts 
Lehman suggest to use multipliers for k.
His argument was that if a/b is closest to the ratio of the divisors of n then using k = a*b*k' is looking in the area. But there is an other reason. If k = d*k' then a^2  kn = y^2 <> a^2  dk'n = y^2 <> a^2 = y^2 mod d. This means that the a is chosen such that test = a^2  kn is a square mod d. So using a multiplier 2*3*5*7 for example ensures that a^2  kn is always a square mod 2*3*5*7. This means we do not have to check this with a expensive mod operation. For k even we know that all a must be odd. If k is odd there are less 'a' mod 2^k. This is why we should first investigate in even k with are multiples of some small primes. A simple version of this idea can be found here: https://github.com/ThiloHarich/facto...manSimple.java The code is rather simple and short.. This Algorithm is based on the lehman algorithm and the Hart variant of it. It somehow tries to combine both algorithms. First it tries to find solution with a k = 2*3*3*5*7*k' = 630 * k'. This ensures that the created a = ceil (sqrt (k*n))  1 produces numbers a^2  kn which are squares mod 2,9,5 and 7. This is there is a high chance that this numbers are squares and the algorithm finds a factor. If this part of the algorithm does not find a factor we investigate in numbers for an odd k. Here we choose k=1. in each step we look into numbers of both kinds. The numbers for k=1 were investigated very seldom. They are like a insurance to find the factor, if the fast part is nit finding numbers. Sorry for brainstorming you with ideas of algorithms. All have quite similar running times. Maybe somebody has the right idea to make more progress here. I also tried to look into multiple (different) multipliers in parallel, but no performance gain. 
20190119, 15:24  #20 
Sep 2003
101000011011_{2} Posts 

20190119, 17:00  #21 
"Tilman Neumann"
Jan 2016
Germany
1B2_{16} Posts 
I did a little benchmark in Java. The following snippet (I added the addition to sum and logging it to prevent the compiler optimizing it to nothing)
Code:
long t0 = System.currentTimeMillis(); long sum = 0; for (int i=0; i<10000000; i++) { sum += (int) Math.sqrt(NArray[i]); } long t1 = System.currentTimeMillis(); LOG.info("Unguarded sqrt took " + (t1t0) + "ms, sum=" + sum); So it computes more than 125 Mio sqrts per second. The other way round: 1 sqrt, 1 add, 1 increment, 1 comparison and 1 array access do not take more than 16 to 22 clock cycles. This looks like Math.sqrt() being a pretty fast hardware operation. I also tested Danas proposition to guard the square root but it slows the Lehman down a little bit. The above would explain why. Last fiddled with by Till on 20190119 at 17:12 Reason: make explicit it is Math.sqrt() which seems to be implemented in hardware 
20190119, 17:21  #22 
"Tilman Neumann"
Jan 2016
Germany
434_{10} Posts 
Btw. it might be possible to further speed up the C implementation using AVX/SSE. Java is doing that under the hood when it finds suitable loops.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Do normal adults give themselves an allowance? (...to fast or not to fast  there is no question!)  jasong  jasong  35  20161211 00:57 
SQUFOF implementation  alpertron  Factoring  15  20100412 19:16 
ECM/FPGA Implementation  rdotson  Hardware  12  20060326 22:58 
need C implementation of divb(r,k)  Leith  Miscellaneous Math  4  20050118 23:14 
RSA implementation  flava  Programming  12  20041026 03:51 