mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2017-10-12, 22:32   #45
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

1258 Posts
Default

Quote:
Originally Posted by bsquared View Post
This is a new one for me... do you have a reference you could point me to? I was wondering was kind of iteration was going on in your code. I had assumed there was an addmod buried in there somewhere that I wasn't smart enough to find.
With a few exceptions, Pollard rho works with any quadratic poly. Mostly people use x^2+a for some integer a. I'm using x(x+1), which when iterated starting with x=1 produces the Sylvester sequence minus 1. Completing the square, it's equivalent to the usual poly with a=1/4.

Anyway, the point is that the 63-bit mulredc works as long as the product does not exceed 127 bits. That means you can afford one unmodded addition before multiplying (so mulredc(a,b+c) is OK, but mulredc(a+b,c+d) is not).
arbooker is offline   Reply With Quote
Old 2017-10-12, 23:34   #46
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts
Default

Quote:
Originally Posted by arbooker View Post
I'm experimenting with a 64-bit version using Ben's asm mulredc. One thing that surprised me is how much time is saved in factor63 by using the Sylvester sequence, which allowed me to replace one addmod by a simple addition (the line y = mulredc(y,y+one)). For 64 bits we need to use addmod (or submod, which is slightly quicker), and that single change adds more time than the savings from using asm mulredc. (On the bright side, this shows how close to the bone we're cutting in the code.) I noticed that Ben's routine handles 63-bit and 64-bit n separately, and I might just move the if check farther out so the penalty is only incurred for 64-bit numbers.
I briefly played with the asm mulmod in your code and got better results using the inlined C function (actually better than the previous #define). This may be a compiler thing, as I did not get that result when I plugged your C code into my or Ben's spbrent. Perhaps you'll have better luck.

Technically Ben wrote the 63-bit asm, then after I fruitlessly fiddled a bit and mentioned I'd like a clean 64-bit version, he wrote the 64-bit version. I put the switch inside the function, and noted that ideally for performance the caller would choose the correct one. Probably due to good branch prediction there didn't seem to be much penalty for doing it the easy way (switch inside the function).

One could of course just use the 64-bit version but that does give a measurable performance difference and doing the switch gives me almost the best of both.

Re the sequence x*x+a vs. x*(x+a) it gives me about a 5% speedup in my code and works great up to 63-bit. It does not work right for 64-bit inputs (where x*x+a does work since this is with a 64-bit mulredc).

Last fiddled with by danaj on 2017-10-12 at 23:36
danaj is offline   Reply With Quote
Old 2017-10-13, 02:05   #47
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,371 Posts
Default

Quote:
Originally Posted by arbooker View Post
With a few exceptions, Pollard rho works with any quadratic poly. Mostly people use x^2+a for some integer a. I'm using x(x+1), which when iterated starting with x=1 produces the Sylvester sequence minus 1. Completing the square, it's equivalent to the usual poly with a=1/4.

Anyway, the point is that the 63-bit mulredc works as long as the product does not exceed 127 bits. That means you can afford one unmodded addition before multiplying (so mulredc(a,b+c) is OK, but mulredc(a+b,c+d) is not).
Sweet! That's about a 10-15% speedup for me. Thanks! As long as I use the 64-bit mulredc it works there as well.
bsquared is offline   Reply With Quote
Old 2017-10-13, 08:52   #48
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5×17 Posts
Default

Quote:
Originally Posted by danaj View Post
I briefly played with the asm mulmod in your code and got better results using the inlined C function (actually better than the previous #define).
That's weird. There's no pleasing a temperamental compiler, but it does illustrate my point that they're hard to beat when you can express your intentions clearly in C. For another illustration, I hatched a brilliant plan to save one instruction in submod by using sbb+and, only to find that it's slower. I guess that's because it's more serialized--the chip can compute the two possible outcomes in parallel and choose between them with cmovc, which is faster despite being more instructions.

Quote:
Probably due to good branch prediction there didn't seem to be much penalty for doing it the easy way (switch inside the function).
Yeah, n isn't changing, so that makes sense. In my case it doesn't add much code to move the if check outside, and then I can replace y+one by addmod(y,one,n) only when needed.
arbooker is offline   Reply With Quote
Old 2017-10-13, 09:00   #49
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5·17 Posts
Default

Quote:
Originally Posted by bsquared View Post
Sweet! That's about a 10-15% speedup for me. Thanks! As long as I use the 64-bit mulredc it works there as well.
With 64-bit inputs, all additions have to be addmods, since the computation of a+b can overflow. There is one exception to that: when x is at most n-1 you can compute mulredc(x,x+1) without risk of overflow. The problem is that 1 is not Montgomery's 1, but if you don't care about using the same polynomial every time (which is fine if we're thinking of it as a randomized algorithm and are happy to allow occasional failures) then you could get this speedup for 64-bit inputs as well.
arbooker is offline   Reply With Quote
Old 2017-10-13, 12:07   #50
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

838410 Posts
Default

Quote:
Originally Posted by arbooker View Post
which is faster despite being more instructions.
and so can unrolling loops ( at least using pure asm instead of compiler optimization. in most asm, loops are if statements enclosed in a header, that can be jumped back to to create a piece of code that runs multiple times sometimes if you know how many runs you are doing of it unrolling it a bit makes sense and can speed it up as jmp/jne/jle/etc. commands can be expensive. so depending on how fast the code inside is you may be able to do it a few more times without using a jump and speed it up.
science_man_88 is offline   Reply With Quote
Old 2017-10-13, 14:26   #51
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,371 Posts
Default

Quote:
Originally Posted by arbooker View Post
With 64-bit inputs, all additions have to be addmods, since the computation of a+b can overflow. There is one exception to that: when x is at most n-1 you can compute mulredc(x,x+1) without risk of overflow. The problem is that 1 is not Montgomery's 1, but if you don't care about using the same polynomial every time (which is fine if we're thinking of it as a randomized algorithm and are happy to allow occasional failures) then you could get this speedup for 64-bit inputs as well.
I was surprised it worked as well as it did for 64-bit inputs. But I notice that as n increases past 63-bits, Montgomery's 1 decreases, because it is computed as 2^64 %n. For example the largest 64-bit semiprime, 18446743979220271189, has Montgomery's 1 = 94489280427, a 37 bit number. So y+one won't overflow even then. This is far from a proof, but at least goes some way toward explaining why it worked every time on 100k 64-bit semiprimes. This will only work for the polynomial constant 1, obviously.

Last fiddled with by bsquared on 2017-10-13 at 14:38
bsquared is offline   Reply With Quote
Old 2017-10-13, 16:04   #52
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5·17 Posts
Default

Quote:
Originally Posted by bsquared View Post
I was surprised it worked as well as it did for 64-bit inputs. But I notice that as n increases past 63-bits, Montgomery's 1 decreases, because it is computed as 2^64 %n.
Good point! In fact for n > 2^63, Montgomery's 1 is just 2^64-n. So x+one never overflows, and mulredc(x,x+one) should work as expected even with 64-bit inputs.
arbooker is offline   Reply With Quote
Old 2017-10-13, 16:54   #53
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts
Default

Quote:
Originally Posted by arbooker View Post
Yeah, n isn't changing, so that makes sense. In my case it doesn't add much code to move the if check outside, and then I can replace y+one by addmod(y,one,n) only when needed.
I have 17 functions in my library that use Montgomery reduction, so it's not quite as simple in my case. They already have two code paths for non-Mont and Mont math though (so they will continue to work on obscure machines / compilers).

Quote:
Originally Posted by arbooker
Good point! In fact for n > 2^63, Montgomery's 1 is just 2^64-n. So x+one never overflows, and mulredc(x,x+one) should work as expected even with 64-bit inputs.
This explains why I got good results with Ben's spbrent, which uses c=1 but it did not work well for mine which for obscure historical reasons I default to c=3. Changing to c=1 succeeds.
danaj is offline   Reply With Quote
Old 2017-10-13, 22:03   #54
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by science_man_88 View Post
and so can unrolling loops ( at least using pure asm instead of compiler optimization. in most asm, loops are if statements enclosed in a header, that can be jumped back to to create a piece of code that runs multiple times sometimes if you know how many runs you are doing of it unrolling it a bit makes sense and can speed it up as jmp/jne/jle/etc. commands can be expensive. so depending on how fast the code inside is you may be able to do it a few more times without using a jump and speed it up.
I suggest you read up on branch prediction. Given the level of discourse in this thread, everything here qualifies as "obvious" to the (relative) experts here.
Dubslow is offline   Reply With Quote
Old 2017-10-13, 22:31   #55
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

20C016 Posts
Default

Quote:
Originally Posted by Dubslow View Post
I suggest you read up on branch prediction. Given the level of discourse in this thread, everything here qualifies as "obvious" to the (relative) experts here.
I know such things exist ( at least by name), but I also was replying to a specific statement, that seemed to be a misunderstanding by arbooker. The statement suggest, they think that just because a program has more code to it, it should take longer to execute.

Last fiddled with by science_man_88 on 2017-10-13 at 22:41
science_man_88 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Factoring Mersenne numbers paulunderwood Miscellaneous Math 18 2017-08-27 14:56
Factoring Big Numbers In c# ShridharRasal Factoring 10 2008-03-20 17:17
Question about factoring code Peter Nelson Software 9 2005-03-30 18:28
Factoring Fermat Numbers Axel Fox Software 14 2003-07-04 18:57
Questions about SSE2 code and Factoring Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 18:57.

Thu Mar 4 18:57:38 UTC 2021 up 91 days, 15:08, 0 users, load averages: 1.23, 1.42, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.