mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2009-08-24, 17:56   #1
SPWorley
 
Jul 2009

378 Posts
Default P95 trial division strategy

I'm experimenting with my own GPU Mersenne trial division tool.. it's got its own fun challenges.

The classic way to test for divisibility of 2^p -1 by q is to compute
2^p mod q and look for a 1 remainder. It's a fast and easy power ladder.

The slow step is of course doing the mod q step. There's lots of strategies.
The important cases are when q> 2^64, so I'm using 3 32 bit words as my representation.

The modular method I use for SPRP testing is Montgomery reduction, and that's what I was planning to use for trial division as well. I have that working and it's successful.. though of course you always want more performance.
I've thought of using the "multiply trick" of computing 2^160/q and changing the division into a multiplication. There's also some clever tricks to use q^-1 mod 2^160, but I think that these precomputes are already slower than the Montgomery method.

The trial division tools in MERS uses several methods depending on the size of q. But for the interesting and most common case of q>2^62, it just jumps to an arbitrary precision library.

Prime95, however, is very interesting but also fairly opaque. Despite the many comments in factor32.asm and factor64.asm, it's hard to decipher its approach! It seems as if it computes an (approximate!) floating point reciprocal to q, then uses that to compute how many multiples of q to subtract off. So this may be doing classic division and subtraction to compute its mods, boosted by using a floating point initial approximation step.

I may be very wrong about this because the code is so beautifully tuned and expanded that it's hard to take a step back and see its strategy at a larger level. The comments in the code also are well written but at a very low statement level.

I understand the sieve steps, but can someone give a higher level description of P95's computational method of 2^p mod q?
SPWorley is offline   Reply With Quote
Old 2009-08-24, 19:25   #2
lfm
 
lfm's Avatar
 
Jul 2006
Calgary

1A916 Posts
Default

Did you read :

http://www.mersenne.org/various/math.php

??
lfm is offline   Reply With Quote
Old 2009-08-24, 20:06   #3
SPWorley
 
Jul 2009

31 Posts
Default

Quote:
Originally Posted by lfm View Post
Yes, indeed, but that does not describe the strategy P95 is using for its mod-square computation used in the loop to find 2^p mod q.
SPWorley is offline   Reply With Quote
Old 2009-08-24, 20:25   #4
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·599 Posts
Default

Quote:
Originally Posted by SPWorley View Post
It seems as if it computes an (approximate!) floating point reciprocal to q, then uses that to compute how many multiples of q to subtract off. So this may be doing classic division and subtraction to compute its mods, boosted by using a floating point initial approximation step.
Compare that to Algorithm D in section 4.3.1 of Knuth's The Art of Computer Programming. I haven't done that myself, but I'd bet that there's some similarity.
cheesehead is offline   Reply With Quote
Old 2009-08-24, 20:34   #5
SPWorley
 
Jul 2009

1F16 Posts
Default

Quote:
Originally Posted by cheesehead View Post
Compare that to Algorithm D in section 4.3.1 of Knuth's The Art of Computer Programming. I haven't done that myself, but I'd bet that there's some similarity.
I bet you're right.. that's exactly what I was thinking of when I called it "classic division and subtraction".

But that's also noticably slower (in my own code) than doing it with Montgomery reduction, at least for exponents more than about 65556. Given the beautiful tuning of P95 (both at algorithm and assembly code levels) I was thinking there may be something else going on that I could learn from.
SPWorley is offline   Reply With Quote
Old 2009-08-24, 21:13   #6
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

112×59 Posts
Default

Quote:
Originally Posted by SPWorley View Post
Prime95, however, is very interesting but also fairly opaque. Despite the many comments in factor32.asm and factor64.asm, it's hard to decipher its approach! It seems as if it computes an (approximate!) floating point reciprocal to q, then uses that to compute how many multiples of q to subtract off. So this may be doing classic division and subtraction to compute its mods, boosted by using a floating point initial approximation step.
There are actually many different algorithms in factor32.asm - each written for a different target CPU. Also, it has been a long time since I last looked at the code, so I've forgotten quite a lot.

To compute x mod y, you take the approximate reciprocal accurate to 53-bits and compute (top bits of x) * (approx. recip) to get 32 bits of the quotient (almost always accurate, Knuth guarantees it is accurate to within 2). Multiply q by y, subtract from x, repeat til done.

Also, note that you can compute the reciprocal quickly without division. Just take the last reciprocal you computed (it will be real close) and use Newton's method to compute the new reciprocal.

Last fiddled with by Prime95 on 2009-08-24 at 21:13
Prime95 is offline   Reply With Quote
Old 2009-08-24, 21:26   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

231038 Posts
Default

Quote:
Originally Posted by SPWorley View Post
But that's also noticably slower (in my own code) than doing it with Montgomery reduction, at least for exponents more than about 65556. Given the beautiful tuning of P95 (both at algorithm and assembly code levels) I was thinking there may be something else going on that I could learn from.
There probably is - but the first question I would ask is "how fast is your Montgomery-based algorithm?" Roughly how many cycles do you need per 96-bit modmul (try to separate that from trial-sieving overhead as much as you can - maybe just write a single modmul-based test timing loop that does a billion modmuls), and on which platform is that?

Note that should be able to get at least a 2-3x speedup on 64-bit x86-style (and most RISC) architectures (that is, ones running under a 64-bit OS as well) from using the full 64x64 -> 128-bit hardware integer MUL instruction. That "wastes" some bits for the high parts of the multiword products (e.g. the 32x64-bit subproducts), but fullness-of-bitfield-aesthetics is not the name of the game here.
ewmayer is online now   Reply With Quote
Old 2009-08-24, 23:22   #8
SPWorley
 
Jul 2009

31 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There are actually many different algorithms in factor32.asm - each written for a different target CPU. Also, it has been a long time since I last looked at the code, so I've forgotten quite a lot.

To compute x mod y, you take the approximate reciprocal accurate to 53-bits and compute (top bits of x) * (approx. recip) to get 32 bits of the quotient (almost always accurate, Knuth guarantees it is accurate to within 2). Multiply q by y, subtract from x, repeat til done.

Also, note that you can compute the reciprocal quickly without division. Just take the last reciprocal you computed (it will be real close) and use Newton's method to compute the new reciprocal.
Thanks much for the authoritative answer!
I did notice the (many!) multiple methods defined, especially for the low bit ranges of < 2^64. I guess in the days of 486 processors, it was a struggle even to test up that high and having custom code for different low bit ranges was useful.

I may try a couple old-school divide approaches as well... we'll see how they compare to Montgomery.
SPWorley is offline   Reply With Quote
Old 2009-08-24, 23:26   #9
SPWorley
 
Jul 2009

1F16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Note that should be able to get at least a 2-3x speedup on 64-bit x86-style (and most RISC) architectures (that is, ones running under a 64-bit OS as well) from using the full 64x64 -> 128-bit hardware integer MUL instruction. That "wastes" some bits for the high parts of the multiword products (e.g. the 32x64-bit subproducts), but fullness-of-bitfield-aesthetics is not the name of the game here.
Well the fun part is that I'm doing this on a new architecture: a GPU! And there's arguments to use a 24 bit word size.. there's hardware support for using the FPU to do integer math with 24 bit words (and you can pull the full 48 bit result out). But 32 bits and 64 bits are both also supported in hardware, but each have slower behavior as expected. It's hard to judge the fastest method to do bigint math, so my approach is to try them all and measure!
SPWorley is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sublinear complexity of trial division? yih117 Math 5 2018-02-02 02:49
Mersenne trial division implementation mathPuzzles Math 8 2017-04-21 07:21
trial division over a factor base Peter Hackman Factoring 7 2009-10-26 18:27
Trial division software for Mersenne SPWorley Factoring 7 2009-08-16 00:23
Need GMP trial-division timings ewmayer Factoring 7 2008-12-11 22:12

All times are UTC. The time now is 20:36.

Thu Oct 29 20:36:14 UTC 2020 up 49 days, 17:47, 2 users, load averages: 3.24, 3.10, 3.03

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.