If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
