mersenneforum.org Benchmarks Vs Reality
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2016-02-09, 16:45 #1 Fred     "Ron" Jan 2016 Fitchburg, MA 9710 Posts Benchmarks Vs Reality Since I'm starting to run Prime95 on multiple systems, I just want to be sure I'm understanding my observations. I've seen Madpoo note that many users are using the default of testing 4 exponents simultaniously on 4 workers, when sometimes testing 1 exponent using 4 workers/cpus would be more efficient. On 2 systems, I first ran benchmarks looking particularly at the Total Throughput iter/sec for the 4096k FFT section with 4 cpus on 1 worker vs 4 cpus on 2 workers and 4 cpus on 4 workers. With every test (on both systems), I was getting slightly higher numbers (more iterations per second) with 4 cpus on 4 workers. So, I assumed then, that 4 on 4 would give me the best results. I then fired up real life first time LL testing (exponents in the ~76M area) on both systems using 4 cpus on 4 exponents, let it run for a while, then averaged out what I was seeing for ms/iter. Then I repeated the real life number crunching using 4 cpus on 1 exponent. On both computers, I was seeing that the ms/iter were about 8% smaller (faster) using 4cpus on 1 worker. For example, on one system my ms/iter for 4 cpus on 4 workers averaged 6.75, but my ms/iter for 4 cpus on 1 worker averaged 6.25. Does it seem I'm understanding all of this correctly and in reality on both systems I would want to use 4 cpus on 1 exponent, since the ms/iter were lower (faster)? Part of my confusion is not understanding the 4096K FFT part. From what I see on the cpu benchmark page, I'm assuming this indicates testing in the exponent range currently being issued for first time LL tests. Last fiddled with by Fred on 2016-02-09 at 16:53
2016-02-09, 16:55   #2
chalsall
If I May

"Chris Halsall"
Sep 2002

222728 Posts

Quote:
 Originally Posted by Fred Part of my confusion is not understanding the 4096K FFT part. From what I see on the cpu benchmark page, I'm assuming this indicates testing in the exponent range currently being issued for first time LL tests.
What you are doing is exactly correct: run empirical tests. I have found (at least under Linux) the benchmarking doesn't correlate that strongly with actual run performance.

Another thing to look into (if you have the time and inclination) is the Affinity2 settings. Core affinity is critical for optimal throughput.

Lastly, even if four cores on one DC/LL is slightly net slower than four cores on four different DC/LL tests, some want to process candidates quickly. That decision is entirely up to the owner / manager of the machine.

Edit: Just after posting I realized I hadn't actually answered your question... Yes, the FFT size is a function of the candidate being tested. And, optimal threads per worker can change based on the FFT size because of memory and cache bandwidth.

Last fiddled with by chalsall on 2016-02-09 at 16:57

2016-02-09, 17:18   #3
Fred

"Ron"
Jan 2016
Fitchburg, MA

97 Posts

Quote:
 Originally Posted by chalsall What you are doing is exactly correct: run empirical tests. I have found (at least under Linux) the benchmarking doesn't correlate that strongly with actual run performance. Another thing to look into (if you have the time and inclination) is the Affinity2 settings. Core affinity is critical for optimal throughput. Lastly, even if four cores on one DC/LL is slightly net slower than four cores on four different DC/LL tests, some want to process candidates quickly. That decision is entirely up to the owner / manager of the machine. Edit: Just after posting I realized I hadn't actually answered your question... Yes, the FFT size is a function of the candidate being tested. And, optimal threads per worker can change based on the FFT size because of memory and cache bandwidth.
Perfect! Thanks so much for that thoughtful reply. Exactly what I was looking for. I'm getting dangerously addicted to the hunt (more hardware on its way - shhhh, don't tell the wife), so I'll definitely research the Affinity2 settings you noted (I know nothing about that yet).

 2016-02-09, 19:48 #4 Fred     "Ron" Jan 2016 Fitchburg, MA 97 Posts Hmmmm... actually, I wonder if someone can give me a jumpstart on the whole Affinity thing. On my 4 core i5s, I currently have the CPU Affininity set to "Run on any CPU". The other options on the list basically just let me specify cpu 1, 2, 3, or 4. I tried specifying a cpu (I tried all four individually), and the performance always seemed a little worse than if I just left it on "Run on any CPU". Are there other areas (such as in a config file) where I can tinker with affinity to fine tune best performance? Or is that about it and I should just leave it on "Run on any CPU"?
 2016-02-09, 20:09 #5 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 730110 Posts Can you try the 4096K FFT benchmark with this setting in prime.txt: BenchTime=120 I'm curious if the discrepancy is due to the short 10 second benchmark.
 2016-02-09, 20:15 #6 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 1C8516 Posts Also, in your real world test, try this for both cases: Start prime95 noting which iteration each worker starts on. Run for one (or more) hours of wall clock time. Stop prime95 and note which iteration each worker stops on. Then compute the number of iterations processed per second (you'll need an accurate stopwatch!). There is a chance of an error in the code that calculates the ms/iter.
2016-02-09, 20:17   #7
chalsall
If I May

"Chris Halsall"
Sep 2002

2×3×1,567 Posts

Quote:
 Originally Posted by Fred Hmmmm... actually, I wonder if someone can give me a jumpstart on the whole Affinity thing.
This is where things get "down and dirty"... And I can only speak from a Linux perspective.

You might have to hand-edit the Prime95 text configuration files (prime.txt, local.txt) for optimal affinity configuration. Perhaps Aaron (Madpoo) et al can speak about what they found to be optimal under Windows.

One quick-and-dirty thing you might try is to disable HyperThreading in your BIOS. What you _definitely_ don't want is for two threads to be running on the same CPU's hyperthreads. It doesn't matter which of the two hyperthreads are used (they're symmetrical), but because Prime95 is so optimized having two processing threads on two competing hyperthreads means everything slows down.

Another thing to look at is your Window's CPU usage monitor. When HyperThreading is enabled you want to see CPU#1 at 100% usage, while #2 is at 0%, then #3 at 100% and #4 at 0%, etc.

I hope that makes sense.

2016-02-09, 22:16   #8
Fred

"Ron"
Jan 2016
Fitchburg, MA

9710 Posts

Quote:
 Originally Posted by Prime95 Can you try the 4096K FFT benchmark with this setting in prime.txt: BenchTime=120 I'm curious if the discrepancy is due to the short 10 second benchmark.
Ran with the default benchtime (10 sec?) as well as with BenchTime=120 as you suggested. Results below. Although the results were slightly different, they seemed roughly the same to me. I definitely didn't see the ~8% increased throughput with 4 cpus, 1 worker that I'm seeing in real life LL testing. I'm going to run the other test you suggested (timing iterations over the course of an hour of wall clock time) and report back when that's complete.

Default Bench Time
Quote:
 Timings for 4096K FFT length (1 cpu, 1 worker): 14.27 ms. Throughput: 70.06 iter/sec. Timings for 4096K FFT length (2 cpus, 1 worker): 8.00 ms. Throughput: 125.01 iter/sec. Timings for 4096K FFT length (2 cpus, 2 workers): 15.60, 15.56 ms. Throughput: 128.39 iter/sec. Timings for 4096K FFT length (3 cpus, 1 worker): 6.57 ms. Throughput: 152.10 iter/sec. Timings for 4096K FFT length (3 cpus, 3 workers): 19.39, 19.49, 19.50 ms. Throughput: 154.17 iter/sec. Timings for 4096K FFT length (4 cpus, 1 worker): 6.33 ms. Throughput: 158.01 iter/sec. Timings for 4096K FFT length (4 cpus, 2 workers): 12.69, 12.56 ms. Throughput: 158.43 iter/sec. Timings for 4096K FFT length (4 cpus, 4 workers): 25.10, 25.72, 25.23, 24.63 ms. Throughput: 158.96 iter/sec.
BenchTime=120
Quote:
 Timings for 4096K FFT length (1 cpu, 1 worker): 14.23 ms. Throughput: 70.29 iter/sec. Timings for 4096K FFT length (2 cpus, 1 worker): 8.04 ms. Throughput: 124.43 iter/sec. [Wed Feb 10 04:59:54 2016] Timings for 4096K FFT length (2 cpus, 2 workers): 16.03, 16.13 ms. Throughput: 124.38 iter/sec. Timings for 4096K FFT length (3 cpus, 1 worker): 6.64 ms. Throughput: 150.67 iter/sec. Timings for 4096K FFT length (3 cpus, 3 workers): 19.69, 19.19, 19.35 ms. Throughput: 154.58 iter/sec. [Wed Feb 10 05:05:57 2016] Timings for 4096K FFT length (4 cpus, 1 worker): 6.33 ms. Throughput: 158.03 iter/sec. Timings for 4096K FFT length (4 cpus, 2 workers): 12.79, 12.68 ms. Throughput: 157.03 iter/sec. Timings for 4096K FFT length (4 cpus, 4 workers): 24.41, 25.09, 27.01, 25.24 ms. Throughput: 157.45 iter/sec.

Last fiddled with by Fred on 2016-02-09 at 22:17

2016-02-09, 23:35   #9
bgbeuning

Dec 2014

22×32×7 Posts

Quote:
 Originally Posted by Fred Part of my confusion is not understanding the 4096K FFT part.
FFT stands for Fast Fourier Transform. Most of the time in the LL test
is spent multiplying big numbers together. Using FFT is the fastest
practical way to speed up multiply operations.

The 4096K is how many double precision floating point numbers are
used in the FFT. A double has a 52-bit mantissa and prime95 can
use about 20 of those bits. 20 * 4096K is the exponent size. The rest
of the bits in the mantissa are used to protect against round off errors.

(Full disclosure - lots of assuming in the above.)

2016-02-10, 00:05   #10
Serpentine Vermin Jar

Jul 2014

72×67 Posts

Quote:
 Originally Posted by chalsall You might have to hand-edit the Prime95 text configuration files (prime.txt, local.txt) for optimal affinity configuration. Perhaps Aaron (Madpoo) et al can speak about what they found to be optimal under Windows.
That's the approach I took, to avoid any possible chance that Prime95 might not be able to figure out the affinity itself using its timing method.

In essence, Prime95 will attempt to automatically figure out which 2 "cores" are really a physical/hyperthread pair.

It does this by running a calculation on two cpus at a time and then figuring out during which test they ran half as slow. Those two must be a pair.

However, from time to time the test fails maybe because there was something else using a lot of CPU which threw off the timings. At best it can't figure it out and just falls back to the Windows default of 0,1 being a pair, 2,3 being the next pair, etc. (Linux is different). At worst the timing may be thrown off to where it thinks the incorrect cpus are pairs... I guess.

Here's what I do... I modified my own settings to mimic one with 4 physical cores to fit the original question:

prime.txt changes:
Code:
add:
DebugAffinityScramble=2
(disables the auto affinity method and won't show any info when starting up)

local.txt changes:
Code:
add:
AffinityScramble2=02461357

Under [Worker #1] add:
Affinity=0
The scramble line is a map of the cores... as mentioned, in Windows the pairs of physical/virtual are 0,1 / 2,3 / 4,5 / 6,7.

So you're giving it a map saying cores 0246 are the first 4 that Prime95 should look at, and cores 1357 are the last 4 (and are basically ignored by only using a max of 4 cores total).

The worker threads and threads per test, just change those to suit your needs. Those can be changed in the GUI as well but I just do it in the file now that I know what they are. Just make sure the # of workers times the # of threads per test is equal to the # of physical cores.

e.g. 1 worker and 4 threads, or 2 workers w/ 2 threads each, etc.

You can actually mix and match the # of threads in each worker by moving that "ThreadsPerTest" line under the individual [Worker #x] entries. Worker #1 could have 3 threads and Worker #2 could have 1, if you like.

Finally, under the [Worker #1] thread, the "Affinity=0" line says "the first core to use is core #0" (they start at zero).

If you wanted to have two workers with 2 threads each, the second worker would have "Affinity=2" (so that cores 0 and 1 are on worker one, cores 2 and 3 are on worker #2).

Where it got confusing for me was that the AffinityScramble2 setting is *re-defining" the core numbering. So while Windows things cores 0 and 1 are physical/HT, by mapping the physical cores up front as far as Prime95 is concerned, to worker #1 cores 0 and 1 are actually the first two physical cores now.

I guess try not to overthink it...maybe that's what I did.

It'd be nice if there was a simple "don't use HT cores" button or option and then the rest of that was taken care of by the program, or if it used something besides the timing method (there are OS calls available) to figure out what two cores are really part of the same physical thing.

In Windows, you can bring up Task Manager and set the CPU graph to show one graph per CPU instead of the default overall thing. Then when Prime95 is running you should see that you have alternating cores doing 100% (the physical cores) or nearly 0% (the HT / virtual cores), and an overall usage of about 50%.

The SysInternals tool "CoreInfo" is a simple command line thing that shows core mappings and all kinds of other stuff for Windows.

If you have more questions, ask away and I'll try to fill in some gaps.

2016-02-10, 00:42   #11
Fred

"Ron"
Jan 2016
Fitchburg, MA

97 Posts

Quote:
 Originally Posted by Prime95 Also, in your real world test, try this for both cases: Start prime95 noting which iteration each worker starts on. Run for one (or more) hours of wall clock time. Stop prime95 and note which iteration each worker stops on. Then compute the number of iterations processed per second (you'll need an accurate stopwatch!). There is a chance of an error in the code that calculates the ms/iter.
Ok, this test was really interesting. Here's the results. Note that for the second test below I subtracted 5, 10, and 15 seconds in the math from the 2nd, 3rd, and 4th workers respectively because although it was a 1 hour test, those workers have the delayed start.

One hour test with 4 cpus on 1 worker:
Quote:
 Start Iteration: 9212111 End Iteration: 9778639 9778639 - 9212111 = 566528 iterations / 3600 sec = 157.369 iter/sec
One hour test with 4 cpus on 4 workers:
Quote:
 Worker1 Start Iteration: 9778961 Worker1 End Iteration: 9922509 9922509 - 9778961 = 143548 iterations / 3600 sec = 39.484 iter/sec Worker2 Start Iteration: 5388521 Worker2 End Iteration: 5513851 5513851 - 5388521 = 125330 iterations / 3595 sec = 34.862 iter/sec Worker3 Start Iteration: 5431745 Worker3 End Iteration: 5561479 5561479 - 5431745 = 129734 iterations / 3590 sec = 36.138 iter/sec Worker4 Start Iteration: 5396053 Worker4 End Iteration: 5525589 5525589 - 5396053 = 129536 iterations / 3585 sec = 36.133 iter/sec 39.484 + 34.862 + 36.138 + 36.133 = 146.617 iter/sec
Conclusion: 146.617 -> 157.369 = 7.3% increased performance running as 4 cpu on 1 worker instead of 4 cpu on 4 workers. Obviously everyone's system is different, but I suspect I'm going to see similar results on my other i5 skylake systems.

What I find particularly interesting is that in looking at the benchmarks below, the benchmark was extremely accurate in what it reported for the 4 cpus on 1 worker. The discrepancy between "real" testing results and the benchmark were on the 4 cpus on 4 workers benchmark.

Quote:
 Timings for 4096K FFT length (4 cpus, 1 worker): 6.33 ms. Throughput: 158.03 iter/sec. Timings for 4096K FFT length (4 cpus, 4 workers): 24.41, 25.09, 27.01, 25.24 ms. Throughput: 157.45 iter/sec.

Last fiddled with by Fred on 2016-02-10 at 01:24

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 6 2016-04-26 22:41 Flatlander Lounge 0 2014-05-22 17:31 E_tron Lounge 0 2009-12-18 06:38 Unregistered Information & Answers 15 2009-08-18 16:44 ixfd64 Science & Technology 27 2007-11-26 20:43

All times are UTC. The time now is 22:25.

Tue Jan 19 22:25:13 UTC 2021 up 47 days, 18:36, 0 users, load averages: 2.02, 1.95, 1.98