![]() |
![]() |
#1 |
"Ron"
Jan 2016
Fitchburg, MA
9710 Posts |
![]()
Since I'm starting to run Prime95 on multiple systems, I just want to be sure I'm understanding my observations. I've seen Madpoo note that many users are using the default of testing 4 exponents simultaniously on 4 workers, when sometimes testing 1 exponent using 4 workers/cpus would be more efficient.
On 2 systems, I first ran benchmarks looking particularly at the Total Throughput iter/sec for the 4096k FFT section with 4 cpus on 1 worker vs 4 cpus on 2 workers and 4 cpus on 4 workers. With every test (on both systems), I was getting slightly higher numbers (more iterations per second) with 4 cpus on 4 workers. So, I assumed then, that 4 on 4 would give me the best results. I then fired up real life first time LL testing (exponents in the ~76M area) on both systems using 4 cpus on 4 exponents, let it run for a while, then averaged out what I was seeing for ms/iter. Then I repeated the real life number crunching using 4 cpus on 1 exponent. On both computers, I was seeing that the ms/iter were about 8% smaller (faster) using 4cpus on 1 worker. For example, on one system my ms/iter for 4 cpus on 4 workers averaged 6.75, but my ms/iter for 4 cpus on 1 worker averaged 6.25. Does it seem I'm understanding all of this correctly and in reality on both systems I would want to use 4 cpus on 1 exponent, since the ms/iter were lower (faster)? Part of my confusion is not understanding the 4096K FFT part. From what I see on the cpu benchmark page, I'm assuming this indicates testing in the exponent range currently being issued for first time LL tests. Last fiddled with by Fred on 2016-02-09 at 16:53 |
![]() |
![]() |
![]() |
#2 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
222728 Posts |
![]() Quote:
Another thing to look into (if you have the time and inclination) is the Affinity2 settings. Core affinity is critical for optimal throughput. Lastly, even if four cores on one DC/LL is slightly net slower than four cores on four different DC/LL tests, some want to process candidates quickly. That decision is entirely up to the owner / manager of the machine. Edit: Just after posting I realized I hadn't actually answered your question... Yes, the FFT size is a function of the candidate being tested. And, optimal threads per worker can change based on the FFT size because of memory and cache bandwidth. Last fiddled with by chalsall on 2016-02-09 at 16:57 |
|
![]() |
![]() |
![]() |
#3 | |
"Ron"
Jan 2016
Fitchburg, MA
97 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#4 |
"Ron"
Jan 2016
Fitchburg, MA
97 Posts |
![]()
Hmmmm... actually, I wonder if someone can give me a jumpstart on the whole Affinity thing. On my 4 core i5s, I currently have the CPU Affininity set to "Run on any CPU". The other options on the list basically just let me specify cpu 1, 2, 3, or 4. I tried specifying a cpu (I tried all four individually), and the performance always seemed a little worse than if I just left it on "Run on any CPU". Are there other areas (such as in a config file) where I can tinker with affinity to fine tune best performance? Or is that about it and I should just leave it on "Run on any CPU"?
|
![]() |
![]() |
![]() |
#5 |
P90 years forever!
Aug 2002
Yeehaw, FL
730110 Posts |
![]()
Can you try the 4096K FFT benchmark with this setting in prime.txt:
BenchTime=120 I'm curious if the discrepancy is due to the short 10 second benchmark. |
![]() |
![]() |
![]() |
#6 |
P90 years forever!
Aug 2002
Yeehaw, FL
1C8516 Posts |
![]()
Also, in your real world test, try this for both cases: Start prime95 noting which iteration each worker starts on. Run for one (or more) hours of wall clock time. Stop prime95 and note which iteration each worker stops on. Then compute the number of iterations processed per second (you'll need an accurate stopwatch!).
There is a chance of an error in the code that calculates the ms/iter. |
![]() |
![]() |
![]() |
#7 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
2×3×1,567 Posts |
![]() Quote:
You might have to hand-edit the Prime95 text configuration files (prime.txt, local.txt) for optimal affinity configuration. Perhaps Aaron (Madpoo) et al can speak about what they found to be optimal under Windows. One quick-and-dirty thing you might try is to disable HyperThreading in your BIOS. What you _definitely_ don't want is for two threads to be running on the same CPU's hyperthreads. It doesn't matter which of the two hyperthreads are used (they're symmetrical), but because Prime95 is so optimized having two processing threads on two competing hyperthreads means everything slows down. Another thing to look at is your Window's CPU usage monitor. When HyperThreading is enabled you want to see CPU#1 at 100% usage, while #2 is at 0%, then #3 at 100% and #4 at 0%, etc. I hope that makes sense. |
|
![]() |
![]() |
![]() |
#8 | |||
"Ron"
Jan 2016
Fitchburg, MA
9710 Posts |
![]() Quote:
Default Bench Time Quote:
Quote:
Last fiddled with by Fred on 2016-02-09 at 22:17 |
|||
![]() |
![]() |
![]() |
#9 |
Dec 2014
22×32×7 Posts |
![]()
FFT stands for Fast Fourier Transform. Most of the time in the LL test
is spent multiplying big numbers together. Using FFT is the fastest practical way to speed up multiply operations. The 4096K is how many double precision floating point numbers are used in the FFT. A double has a 52-bit mantissa and prime95 can use about 20 of those bits. 20 * 4096K is the exponent size. The rest of the bits in the mantissa are used to protect against round off errors. (Full disclosure - lots of assuming in the above.) |
![]() |
![]() |
![]() |
#10 | |
Serpentine Vermin Jar
Jul 2014
72×67 Posts |
![]() Quote:
In essence, Prime95 will attempt to automatically figure out which 2 "cores" are really a physical/hyperthread pair. It does this by running a calculation on two cpus at a time and then figuring out during which test they ran half as slow. Those two must be a pair. However, from time to time the test fails maybe because there was something else using a lot of CPU which threw off the timings. At best it can't figure it out and just falls back to the Windows default of 0,1 being a pair, 2,3 being the next pair, etc. (Linux is different). At worst the timing may be thrown off to where it thinks the incorrect cpus are pairs... I guess. Here's what I do... I modified my own settings to mimic one with 4 physical cores to fit the original question: prime.txt changes: Code:
add: DebugAffinityScramble=2 local.txt changes: Code:
add: AffinityScramble2=02461357 WorkerThreads=1 ThreadsPerTest=4 Under [Worker #1] add: Affinity=0 So you're giving it a map saying cores 0246 are the first 4 that Prime95 should look at, and cores 1357 are the last 4 (and are basically ignored by only using a max of 4 cores total). The worker threads and threads per test, just change those to suit your needs. Those can be changed in the GUI as well but I just do it in the file now that I know what they are. Just make sure the # of workers times the # of threads per test is equal to the # of physical cores. e.g. 1 worker and 4 threads, or 2 workers w/ 2 threads each, etc. You can actually mix and match the # of threads in each worker by moving that "ThreadsPerTest" line under the individual [Worker #x] entries. Worker #1 could have 3 threads and Worker #2 could have 1, if you like. Finally, under the [Worker #1] thread, the "Affinity=0" line says "the first core to use is core #0" (they start at zero). If you wanted to have two workers with 2 threads each, the second worker would have "Affinity=2" (so that cores 0 and 1 are on worker one, cores 2 and 3 are on worker #2). Where it got confusing for me was that the AffinityScramble2 setting is *re-defining" the core numbering. So while Windows things cores 0 and 1 are physical/HT, by mapping the physical cores up front as far as Prime95 is concerned, to worker #1 cores 0 and 1 are actually the first two physical cores now. I guess try not to overthink it...maybe that's what I did. ![]() It'd be nice if there was a simple "don't use HT cores" button or option and then the rest of that was taken care of by the program, or if it used something besides the timing method (there are OS calls available) to figure out what two cores are really part of the same physical thing. In Windows, you can bring up Task Manager and set the CPU graph to show one graph per CPU instead of the default overall thing. Then when Prime95 is running you should see that you have alternating cores doing 100% (the physical cores) or nearly 0% (the HT / virtual cores), and an overall usage of about 50%. The SysInternals tool "CoreInfo" is a simple command line thing that shows core mappings and all kinds of other stuff for Windows. If you have more questions, ask away and I'll try to fill in some gaps. |
|
![]() |
![]() |
![]() |
#11 | ||||
"Ron"
Jan 2016
Fitchburg, MA
97 Posts |
![]() Quote:
One hour test with 4 cpus on 1 worker: Quote:
Quote:
What I find particularly interesting is that in looking at the benchmarks below, the benchmark was extremely accurate in what it reported for the 4 cpus on 1 worker. The discrepancy between "real" testing results and the benchmark were on the 4 cpus on 4 workers benchmark. Quote:
Last fiddled with by Fred on 2016-02-10 at 01:24 |
||||
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Xyzzy, the man, the myth, the surprising but not disappointing reality | jasong | jasong | 6 | 2016-04-26 22:41 |
"... reality itself." | Flatlander | Lounge | 0 | 2014-05-22 17:31 |
AT&T gets a reality check | E_tron | Lounge | 0 | 2009-12-18 06:38 |
benchmarks | Unregistered | Information & Answers | 15 | 2009-08-18 16:44 |
anti-matter weapons - possible reality? | ixfd64 | Science & Technology | 27 | 2007-11-26 20:43 |