20171012, 22:32  #45  
"Andrew Booker"
Mar 2013
85_{10} Posts 
Quote:
Anyway, the point is that the 63bit mulredc works as long as the product does not exceed 127 bits. That means you can afford one unmodded addition before multiplying (so mulredc(a,b+c) is OK, but mulredc(a+b,c+d) is not). 

20171012, 23:34  #46  
"Dana Jacobsen"
Feb 2011
Bangkok, TH
2^{2}×227 Posts 
Quote:
Technically Ben wrote the 63bit asm, then after I fruitlessly fiddled a bit and mentioned I'd like a clean 64bit version, he wrote the 64bit version. I put the switch inside the function, and noted that ideally for performance the caller would choose the correct one. Probably due to good branch prediction there didn't seem to be much penalty for doing it the easy way (switch inside the function). One could of course just use the 64bit version but that does give a measurable performance difference and doing the switch gives me almost the best of both. Re the sequence x*x+a vs. x*(x+a) it gives me about a 5% speedup in my code and works great up to 63bit. It does not work right for 64bit inputs (where x*x+a does work since this is with a 64bit mulredc). Last fiddled with by danaj on 20171012 at 23:36 

20171013, 02:05  #47  
"Ben"
Feb 2007
3,371 Posts 
Quote:


20171013, 08:52  #48  
"Andrew Booker"
Mar 2013
5·17 Posts 
Quote:
Quote:


20171013, 09:00  #49 
"Andrew Booker"
Mar 2013
55_{16} Posts 
With 64bit inputs, all additions have to be addmods, since the computation of a+b can overflow. There is one exception to that: when x is at most n1 you can compute mulredc(x,x+1) without risk of overflow. The problem is that 1 is not Montgomery's 1, but if you don't care about using the same polynomial every time (which is fine if we're thinking of it as a randomized algorithm and are happy to allow occasional failures) then you could get this speedup for 64bit inputs as well.

20171013, 12:07  #50 
"Forget I exist"
Jul 2009
Dumbassville
20C0_{16} Posts 
and so can unrolling loops ( at least using pure asm instead of compiler optimization. in most asm, loops are if statements enclosed in a header, that can be jumped back to to create a piece of code that runs multiple times sometimes if you know how many runs you are doing of it unrolling it a bit makes sense and can speed it up as jmp/jne/jle/etc. commands can be expensive. so depending on how fast the code inside is you may be able to do it a few more times without using a jump and speed it up.

20171013, 14:26  #51  
"Ben"
Feb 2007
3,371 Posts 
Quote:
Last fiddled with by bsquared on 20171013 at 14:38 

20171013, 16:04  #52 
"Andrew Booker"
Mar 2013
55_{16} Posts 
Good point! In fact for n > 2^63, Montgomery's 1 is just 2^64n. So x+one never overflows, and mulredc(x,x+one) should work as expected even with 64bit inputs.

20171013, 16:54  #53  
"Dana Jacobsen"
Feb 2011
Bangkok, TH
908_{10} Posts 
Quote:
Quote:


20171013, 22:03  #54  
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
3·29·83 Posts 
Quote:


20171013, 22:31  #55 
"Forget I exist"
Jul 2009
Dumbassville
20C0_{16} Posts 
I know such things exist ( at least by name), but I also was replying to a specific statement, that seemed to be a misunderstanding by arbooker. The statement suggest, they think that just because a program has more code to it, it should take longer to execute.
Last fiddled with by science_man_88 on 20171013 at 22:41 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Factoring Mersenne numbers  paulunderwood  Miscellaneous Math  18  20170827 14:56 
Factoring Big Numbers In c#  ShridharRasal  Factoring  10  20080320 17:17 
Question about factoring code  Peter Nelson  Software  9  20050330 18:28 
Factoring Fermat Numbers  Axel Fox  Software  14  20030704 18:57 
Questions about SSE2 code and Factoring  Joe O  Software  2  20020913 23:39 