Certainly no intention of pwning anyone; this is purely research code, I was working from Crandall's original paper and with the understanding that other's had gotten it to work with nonpowers of two, so I really don't know all the excellent work you all have done with cudaLucas and macLucasFFTW and such. Mainly why I did post this week...I can't finish my paper on this without mentioning other current work. if anyone would care to summarize the principal players and their programs, you'll get a grateful acknowledgment, for sure.
I'll post some timing results today or tomorrow...I've got a Friday deadline so finishing off my time trials right now.
As to whether it'll work with 1.3 cards...the implementation is pretty transparent, so it may need one or two mods but will probably work with any card that has true double precision and can run CUDA 3.2, though it does depend on the recent Fermi cards for a lot of its efficiency. Note
that CUFFT has improved a lot in the most recent implementation, eliminating crippling bugs and substantially improving the nonpoweroftwo FFTs.
As to my credentials...no offense taken...I'm mainly an imageanalysis guy, and these days teach undergrads, but I've been interested in Mersenne prime testing since 1995, when I was trying to parallelize LL for a Maspar MP1. :) I was at Carolina in the late '90s when they were doing the original work with PixelFlow, so we were all excited about programmable graphics hardware. The obsolete Cg work from a few years back was using compiled shaders on 8800GT and 9800 cards, with my own homebrew extendedprecision floatfloat FFTs and very baroque parallel carryadds. Totally crazy, but perhaps y'all here might appreciate that. :)
