I'm experimenting with my own GPU Mersenne trial division tool.. it's got its own fun challenges.

The classic way to test for divisibility of 2^p -1 by q is to compute

2^p mod q and look for a 1 remainder. It's a fast and easy power ladder.

The slow step is of course doing the mod q step. There's lots of strategies.

The important cases are when q> 2^64, so I'm using 3 32 bit words as my representation.

The modular method I use for SPRP testing is

Montgomery reduction, and that's what I was planning to use for trial division as well. I have that working and it's successful.. though of course you always want more performance.

I've thought of using the "multiply trick" of computing 2^160/q and changing the division into a multiplication. There's also some clever tricks to use q^-1 mod 2^160, but I think that these precomputes are already slower than the Montgomery method.

The trial division tools in

MERS uses several methods depending on the size of q. But for the interesting and most common case of q>2^62, it just jumps to an arbitrary precision library.

Prime95, however, is very interesting but also fairly opaque. Despite the many comments in factor32.asm and factor64.asm, it's hard to decipher its approach! It seems as if it computes an (approximate!) floating point reciprocal to q, then uses that to compute how many multiples of q to subtract off. So this may be doing classic division and subtraction to compute its mods, boosted by using a floating point initial approximation step.

I may be very wrong about this because the code is so

**beautifully tuned **and expanded that it's hard to take a step back and see its strategy at a larger level. The comments in the code also are well written but at a very low statement level.

I understand the sieve steps, but can someone give a higher level description of P95's computational method of 2^p mod q?