This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?)

Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy.

Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2.
