cudaOwl
I added an initial CUDA backend to gpuOwl. I expect this to be rough, buggy and notoptimized yet, but it's a start.
The approach I ended with was to use most of the same codebase, but split out two backends, OpenCL and CUDA.
[I'm thinking, should I rename the previous gpuOwl to openOwl for symmetry with cudaOwl?]
So, the savefile format, and much of the logic, is shared between the cudaOwl and gpuOwl.
There are some notable differences though:
 gpuOwl supports "offset extension", which means varying the offset (aka "shift") when a PRP error is encountered. Not a big deal unfortunately, this trick achieves about 0.5% exponent extension for a given FFT size. This was motivated by the severe lack of FFT size choice in openOwl. (cudaOwl doesn't have "offset").
 cudaOwl has a rich choice of FFT sizes (unlike openOwl). FFT selection is controlled with the "fft" argument, allowing to specify hard sizes such as 4096K or 4M, or delta steps from the "default" size for the exponent, such as +1 or 1.
A few nice things:
 it's possible to switch the savefile between CUDA/OpenCL in midflight.
 it's possible to change the FFT size in midflight.
Not so nice:
the performance on GTX 1080 is disappointing. 5.9ms/it at the PRP wavefront, 4480K FFT. (thus I don't think it's such a good idea to do PRP or LL on Nvidia yet. Probably TF is a better fit for the 32bitoriented hardware).
Last fiddled with by preda on 20180625 at 10:46
