View Single Post
Old 2009-08-24, 21:26   #7
ewmayer's Avatar
Sep 2002
Rep├║blica de California

3·7·19·29 Posts

Originally Posted by SPWorley View Post
But that's also noticably slower (in my own code) than doing it with Montgomery reduction, at least for exponents more than about 65556. Given the beautiful tuning of P95 (both at algorithm and assembly code levels) I was thinking there may be something else going on that I could learn from.
There probably is - but the first question I would ask is "how fast is your Montgomery-based algorithm?" Roughly how many cycles do you need per 96-bit modmul (try to separate that from trial-sieving overhead as much as you can - maybe just write a single modmul-based test timing loop that does a billion modmuls), and on which platform is that?

Note that should be able to get at least a 2-3x speedup on 64-bit x86-style (and most RISC) architectures (that is, ones running under a 64-bit OS as well) from using the full 64x64 -> 128-bit hardware integer MUL instruction. That "wastes" some bits for the high parts of the multiword products (e.g. the 32x64-bit subproducts), but fullness-of-bitfield-aesthetics is not the name of the game here.
ewmayer is offline   Reply With Quote