Quote:
Originally Posted by Prime95
There are actually many different algorithms in factor32.asm  each written for a different target CPU. Also, it has been a long time since I last looked at the code, so I've forgotten quite a lot.
To compute x mod y, you take the approximate reciprocal accurate to 53bits and compute (top bits of x) * (approx. recip) to get 32 bits of the quotient (almost always accurate, Knuth guarantees it is accurate to within 2). Multiply q by y, subtract from x, repeat til done.
Also, note that you can compute the reciprocal quickly without division. Just take the last reciprocal you computed (it will be real close) and use Newton's method to compute the new reciprocal.

Thanks much for the authoritative answer!
I did notice the (many!) multiple methods defined, especially for the low bit ranges of < 2^64. I guess in the days of 486 processors, it was a struggle even to test up that high and having custom code for different low bit ranges was useful.
I may try a couple oldschool divide approaches as well... we'll see how they compare to Montgomery.