View Single Post
2020-07-18, 12:35   #5
intelfx

Jul 2020

D16 Posts

Quote:
 Originally Posted by M344587487 If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
Yup, it's that one.

In fact I have simply overlooked the benchmark option. Considering benchmark results, it would appear that for 2048K FFTs the absolute best throughput (1800 iter/sec) is achieved with 4 workers:

Code:
FFTlen=2048K, Type=3, Arch=4, Pass1=1024, Pass2=2048, clm=2 (16 cores, 4 workers):  2.18,  2.18,  2.17,  2.17 ms.  Throughput: 1840.17 iter/sec.
FFTlen=2048K, Type=3, Arch=4, Pass1=2048, Pass2=1024, clm=1 (16 cores, 4  workers):  2.15,  2.16,  2.14,  2.14 ms.  Throughput: 1863.96 iter/sec.
With any larger FFTs however, 4 worker performance begins to degrade compared to 2 workers (I did not do the extended benchmark in this case, just used the defaults):

Code:
Timings for 2240K FFT length (16 cores, 2 workers):  1.43,  1.43 ms.  Throughput: 1399.54 iter/sec.
Timings for 2240K FFT length (16 cores, 4 workers):  3.39,  3.41,  3.22,  3.21 ms.  Throughput: 1209.72 iter/sec.
Timings for 2304K FFT length (16 cores, 2 workers):  1.43,  1.43 ms.  Throughput: 1397.51 iter/sec.
Timings for 2304K FFT length (16 cores, 4 workers):  4.04,  4.00,  3.66,  3.65 ms.  Throughput: 1044.27 iter/sec.
Timings for 2400K FFT length (16 cores, 2 workers):  1.53,  1.54 ms.  Throughput: 1300.87 iter/sec.
Timings for 2400K FFT length (16 cores, 4 workers):  4.52,  4.46,  4.66,  4.74 ms.  Throughput: 870.69 iter/sec.

With significantly larger FFTs, 4 worker performance turns drastically lower than 2 workers:
Code:
Timings for 3072K FFT length (16 cores, 2 workers):  1.88,  1.91 ms.  Throughput: 1056.56 iter/sec.
Timings for 3072K FFT length (16 cores, 4 workers):  7.88,  7.97,  7.80,  7.79 ms.  Throughput: 508.94 iter/sec.

Incidentally, do you happen to know how exactly can I use the extended benchmark results (i. e. the Type, Arch, Pass1, Pass2, clm values)? Can I specify them in a config somewhere to override the builtin values for my CPU?

Last fiddled with by intelfx on 2020-07-18 at 12:42