View Single Post
Old 2009-11-27, 02:43   #4
Tribal Bullet
jasonp's Avatar
Oct 2004

2·3·19·31 Posts

Originally Posted by TheJudger View Post
Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?
No, I went straight to writing the Montgomery code because that's what I know :)

The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way.
jasonp is offline   Reply With Quote