View Single Post
Old 2009-11-27, 02:43   #4
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

66418 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?
No, I went straight to writing the Montgomery code because that's what I know :)

The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way.
jasonp is offline   Reply With Quote