Originally Posted by cheesehead
Compare that to Algorithm D in section 4.3.1 of Knuth's The Art of Computer Programming. I haven't done that myself, but I'd bet that there's some similarity.
I bet you're right.. that's exactly what I was thinking of when I called it "classic division and subtraction".

But that's also noticably slower (in my own code) than doing it with Montgomery reduction, at least for exponents more than about 65556. Given the beautiful tuning of P95 (both at algorithm and assembly code levels) I was thinking there may be something else going on that I could learn from.
