In general, for the same FFT size (machines that support) SSE2 code will be faster than non-SSE2. However, for the cross-over exponents, we are comparing smaller non-SSE2 FFT sizes and larger SSE2 FFT size. Depending on the CPU type and the actual FFT sizes, you may find one or the other faster.

A noticeable exception is the P4 which plain sucks at non-SSE2 code
