Just out of curiosity I did a test on an Opteron 246 (2 GHz) using the "CpuSupportsSSE2=0" switch to see whether it would run faster at the lower FFT length. Here are the figures I got:
Code:
FFT length time per iteration
---------------------------------------------
SSE2 enabled: 196608 8.483 ms
SSE2 disabled: 163840 9.036 ms
It clearly shows that the SSE2 code is much faster, even at the higher FFT length. This behaviour should also hold for the Athlon64's. Perhaps someone could do a test for verification...