Running Prime95 on a 16-core CPU — one 16-threaded worker or 16 single-threaded workers?
2020-07-18, 09:34
intelfx

Jul 2020

13 Posts

Quote:
 Originally Posted by LaurV The common wisdom is that new processors will give you a better output if you only run fewer (1-2) workers, each in more than one thread, such that the sum of all threads are not higher than the number of your physical cores. It didn't use to be like that in the past, but with CPUs getting lots of cores, the limitation became the memory bandwidth - 16 workers would need to exchange data for 16 test. For example, in my system (10 cores, 20 threads, lots of cache memory) the best output I get with 2 workers, each running 5 threads. The hyper-threading is not useful, for most of the work types, it only produces more heat, but not more output.

I see. Memory throughput being the bottleneck sounds quite plausible. I'll run the benchmark, thanks.

 2020-07-18, 11:14 #4 M344587487     "Composite as Heck" Oct 2017 23·83 Posts If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
2020-07-18, 12:35
intelfx

Jul 2020

13 Posts

Quote:
 Originally Posted by M344587487 If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
Yup, it's that one.

In fact I have simply overlooked the benchmark option. Considering benchmark results, it would appear that for 2048K FFTs the absolute best throughput (1800 iter/sec) is achieved with 4 workers:

Code:
FFTlen=2048K, Type=3, Arch=4, Pass1=1024, Pass2=2048, clm=2 (16 cores, 4 workers):  2.18,  2.18,  2.17,  2.17 ms.  Throughput: 1840.17 iter/sec.
FFTlen=2048K, Type=3, Arch=4, Pass1=2048, Pass2=1024, clm=1 (16 cores, 4  workers):  2.15,  2.16,  2.14,  2.14 ms.  Throughput: 1863.96 iter/sec.
With any larger FFTs however, 4 worker performance begins to degrade compared to 2 workers (I did not do the extended benchmark in this case, just used the defaults):

Code:
Timings for 2240K FFT length (16 cores, 2 workers):  1.43,  1.43 ms.  Throughput: 1399.54 iter/sec.
Timings for 2240K FFT length (16 cores, 4 workers):  3.39,  3.41,  3.22,  3.21 ms.  Throughput: 1209.72 iter/sec.
Timings for 2304K FFT length (16 cores, 2 workers):  1.43,  1.43 ms.  Throughput: 1397.51 iter/sec.
Timings for 2304K FFT length (16 cores, 4 workers):  4.04,  4.00,  3.66,  3.65 ms.  Throughput: 1044.27 iter/sec.
Timings for 2400K FFT length (16 cores, 2 workers):  1.53,  1.54 ms.  Throughput: 1300.87 iter/sec.
Timings for 2400K FFT length (16 cores, 4 workers):  4.52,  4.46,  4.66,  4.74 ms.  Throughput: 870.69 iter/sec.

With significantly larger FFTs, 4 worker performance turns drastically lower than 2 workers:
Code:
Timings for 3072K FFT length (16 cores, 2 workers):  1.88,  1.91 ms.  Throughput: 1056.56 iter/sec.
Timings for 3072K FFT length (16 cores, 4 workers):  7.88,  7.97,  7.80,  7.79 ms.  Throughput: 508.94 iter/sec.

Incidentally, do you happen to know how exactly can I use the extended benchmark results (i. e. the Type, Arch, Pass1, Pass2, clm values)? Can I specify them in a config somewhere to override the builtin values for my CPU?

Last fiddled with by intelfx on 2020-07-18 at 12:42

2020-07-18, 16:49
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

157458 Posts

Quote:
 Originally Posted by intelfx Incidentally, do you happen to know how exactly can I use the extended benchmark results (i. e. the Type, Arch, Pass1, Pass2, clm values)? Can I specify them in a config somewhere to override the builtin values for my CPU?
Well, it will happen automatically over time. Every night, prime95 will do a quick benchmark of all the different FFT implementations for the work you'll be doing in the near future and writes the results to gwnum.txt. Prime95 uses that to pick the fastest FFT implementation for your machine. Some of that info is buried in undoc.txt, but it is pretty terse.

