20200630, 03:08  #1 
Jun 2020
10_{8} Posts 
“Odd” P95 memory benchmark results?
I’ve recently run P95 benchmark of system I’m building. While looking at results (see attached screenshot) I’ve noticed something that to me, as person that has zero P95 knowledge and experience, seems odd.
First “oddity” is that hyperthreaded throughput is on average lower than nonhyperthreaded (16% average drop in case of single worker). Second “oddity” is that it seems single worker always has best throughput, figures seem to start dropping once number of workers starts increasing. That leaves me scratching my head because I, not having knowledge, assume that a) hyperthreading should result in higher overall throughput, not lower, and b) more workers should result in more iterations per second, not less. So I need help, please, answering: Am I interpreting figures correctly? When P95 says throughput is xyz that is total throughput, it isn’t per “thread” per worker, correct? Am I correct in assuming that hyperthreaded figures should be higher than nonhyperthreaded ones, not lower? Am I correct in assuming more workers should’ve resulted in higher throughput, not lower? In other words: Does this seem odd to you too / does something seem wrong? 
20200630, 04:52  #2 
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
9478_{10} Posts 
Prime95 is so efficiently written that the normal gains that a program sees from hyperthreading don't happen. In fact hyperthreading interferes with it.
The through put is the total potential through put (how much total work gets done.) So each core doing its own task will get the most work done. Putting multiple cores on to a single task will get that one task done faster. But the total amount of work will be less. There may be issues with memory bandwidth if many cores are each trying to access a bunch of memory. Actual best performance might be slightly different. 
20200630, 05:39  #3 
Jun 2003
4900_{10} Posts 

20200630, 08:44  #4 
"Composite as Heck"
Oct 2017
3×263 Posts 
1) As mentioned hyperthreading (generically called SMT) should normally be disabled for P95 as P95 is more efficient at occupying the core than SMT is. Hyperthreading allows two threads to queue up work simultaneously to increase occupancy but there is overhead. A workload like P95 fully occupies the core without the cost of this thread juggling overhead.
2) L3 cache is shared between cores, more workers means less cache per worker. The less cache a worker has the higher the chance that a piece of data is evicted from cache before it gets accessed again, in which case it has to be loaded from RAM again. P95 is normally memory bound, meaning throughput is limited by how much memory bandwidth you have. The more bandwidth that is consumed transferring duplicate data the less that is available for unique data, making the memory bottleneck even worse and resulting in lower throughput. 
20200630, 12:45  #5 
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
10010100000110_{2} Posts 

20200630, 13:04  #6 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
13741_{8} Posts 
4EvrYng: Try more options.
You tested 10 x 1 and 1 x 10. Also test 5 x 2 and 2 x 5 Plus other splits like: 3,3,4 and 2,2,3,3. Even options that don't use all the cores: 4,4 and 3,3,3 and 2,2,2 etc. See which of those gives you the better outcome and use it. But I don't understand why you didn't try 20 x 1 and 1 x 20 when you had SMT enabled. Your 10 cores should logically (not physically) become 20 cores with SMT on. 
20200630, 13:24  #7  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1396_{16} Posts 
Quote:
Timings for 2240K FFT length (4 cores, 1 worker): 7.14 ms. Throughput: 140.00 iter/sec. Timings for 2240K FFT length (4 cores, 2 workers): 9.80, 12.34 ms. Throughput: 183.03 iter/sec. Timings for 2240K FFT length (4 cores, 4 workers): 24.74, 20.54, 18.41, 22.06 ms. Throughput: 188.76 iter/sec. [Fri May 29 22:33:19 2020] Timings for 2240K FFT length (4 cores hyperthreaded, 1 worker): 5.95 ms. Throughput: 168.09 iter/sec. Timings for 2240K FFT length (4 cores hyperthreaded, 2 workers): 11.51, 11.72 ms. Throughput: 172.17 iter/sec. Timings for 2240K FFT length (4 cores hyperthreaded, 4 workers): 46.42, 20.56, 17.51, 17.54 ms. Throughput: 184.32 iter/sec. Timings for 11520K FFT length (4 cores, 1 worker): 25.56 ms. Throughput: 39.12 iter/sec. Timings for 11520K FFT length (4 cores, 2 workers): 47.17, 46.74 ms. Throughput: 42.59 iter/sec. Timings for 11520K FFT length (4 cores, 4 workers): 99.19, 98.26, 95.55, 95.97 ms. Throughput: 41.14 iter/sec. Timings for 11520K FFT length (4 cores hyperthreaded, 1 worker): 29.73 ms. Throughput: 33.64 iter/sec. Timings for 11520K FFT length (4 cores hyperthreaded, 2 workers): 58.22, 57.55 ms. Throughput: 34.55 iter/sec. Timings for 11520K FFT length (4 cores hyperthreaded, 4 workers): 118.11, 115.52, 115.83, 114.87 ms. Throughput: 34.46 iter/sec. Timings for 65536K FFT length (4 cores, 1 worker): 160.64 ms. Throughput: 6.23 iter/sec. Timings for 65536K FFT length (4 cores, 2 workers): 340.03, 339.46 ms. Throughput: 5.89 iter/sec. Timings for 65536K FFT length (4 cores, 4 workers): 696.01, 689.13, 692.94, 688.94 ms. Throughput: 5.78 iter/sec. Timings for 65536K FFT length (4 cores hyperthreaded, 1 worker): 251.43 ms. Throughput: 3.98 iter/sec. Timings for 65536K FFT length (4 cores hyperthreaded, 2 workers): 529.32, 524.69 ms. Throughput: 3.80 iter/sec. Timings for 65536K FFT length (4 cores hyperthreaded, 4 workers): 1086.87, 1063.50, 1072.06, 1054.67 ms. Throughput: 3.74 iter/sec. Quote:
Quote:
Quote:
See Effect of number of workers and Effect of number of workers (continued) for several cpu models' extensive benchmark runs analyzed and graphed. 

20200630, 23:23  #8 
Jun 2020
2^{3} Posts 

20200630, 23:37  #9  
Jun 2020
2^{3} Posts 
Quote:
Quote:


20200630, 23:58  #10 
Jun 2020
2^{3} Posts 
I did. For FFTs I tested figures on my machine would start dropping moment there was more than one worker. I just didn't show all figures in spreadsheet posted in order to keep it small.
P95 offers 10 as default and when I enter more than 10 it does nothing so my interpretation of that is that it asks you how many cores you want testes and 'test hyperthreding' checkbox controls will you test "ht" ones too or not. 
20200701, 00:26  #11  
Jun 2020
10_{8} Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Prime95 benchmark results in GHzdays/day?  mnd9  Information & Answers  0  20190924 19:46 
Statistical properties of categories of GIMPS results and interim results  kriesel  Probability & Probabilistic Number Theory  1  20190522 22:59 
NVIDIA Quadro K4000 speed results benchmark  sixblueboxes  GPU Computing  3  20140717 00:25 
Strange benchmark results  AlTonno15  Information & Answers  3  20130129 02:23 
"Hybrid Memory Cube" offers 1 Tb/s memory bandwith at just 1.4 mW/Gb/s  ixfd64  Hardware  4  20111214 21:24 