mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-10-26, 12:43   #12
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×34×19 Posts
Default

I'm considering building a Skylake 6700k system with similar speed memory, so a quad core dual channel benchmark would also interest me.
Mark Rose is online now   Reply With Quote
Old 2015-10-27, 04:06   #13
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

64718 Posts
Default

Quote:
Originally Posted by LaurV View Post
It also seems curious to me how a single worker 8 cores is, per total, more productive than 8 workers single threaded. The single worker would still need time to put the pieces of the FFT together from the 8 cores.
When I've run across this same phenomenon, my guess was that running 8 workers with 1 core each is simply exhausting the capabilities of the memory subsystem. Doing 1 worker/8 cores keeps the memory subsystem from stressing out and it actually does fairly well.

The larger the FFT size, the better it seems to be. I think George explained it as the way it parcels out the chunks of FMA to the various workers... some workers might be idling while others are doing things resulting in inefficiencies. Larger FFTs seem to have more possible "chunks" so they get distributed more evenly?

Don't know if my caveman understanding of it is close or not, but trust me, with a small FFT (le'ts say an exponent in the 30M or less size), if you have 8 cores in one worker, your total CPU usage will be something like 90-95%, not 100% like it would be with an exponent in the 60M range. Go figure.

The exception to how well a bunch of workers going at once does seems to still be (for me) when doing exponents below 38M.

I can setup a system with 8 cores total (dual 4-core CPUs) and as long as each worker is doing a <38M test, they'll play nicely. Once any of them is a bit larger, it falls apart.

With that in mind it could have more to do with the L2/L3 cache sizes?

I'd say that the code that "chunks" the parts of FMA work out to different cores should:
a) do it evenly so each core has the same amount of work
b) maybe keep each work unit small enough so it takes full advantage of the L2/L3 cache size?
Madpoo is offline   Reply With Quote
Old 2015-10-27, 08:28   #14
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

282116 Posts
Default

Quote:
Originally Posted by Madpoo View Post
When I've run across this same phenomenon, my guess was that running 8 workers with 1 core each is simply exhausting the capabilities of the memory subsystem. Doing 1 worker/8 cores keeps the memory subsystem from stressing out and it actually does fairly well.
Yes, this is the usual and reasonable explanation. However it doesn't really apply to his setup, which shows the sign from very early (2 cores, etc), and it doesn't seem to be memory bounded at all (see how the results go "up", to 8 workers, almost linear).

OTOH, what you say makes sense, and I never played with the billions of cores you have on your hands , my experience is limited to 2 and 4 cores...

Last fiddled with by LaurV on 2015-10-27 at 08:30
LaurV is offline   Reply With Quote
Old 2015-10-27, 15:17   #15
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

5·677 Posts
Default

Quote:
Originally Posted by LaurV View Post
OTOH, what you say makes sense, and I never played with the billions of cores you have on your hands , my experience is limited to 2 and 4 cores...
You might be able to see what I assume is a similar effect by setting up a 2-core system with two workers, each one doing a 60M-70M sized test. With both running, see how the per iteration times look, then stop one or the other and see if the running worker improves.

Then do the same but with one running a 60M-70M and the other just doing a < 38M. What I suspect you'll see is that the larger test won't suffer the same performance hit when the other worker is dealing with a smaller # than it did with the larger.

Maybe if we could figure out for sure if this is really a thing, and not just my imagination, it would help to figure out why, can it be improved, and if that's just the way things are then maybe there's a way to "default" Prime95 in such a setup to pick a first-time test for one worker and then smaller DC's for any others on the same system?

Eventually we'd run out of < 38M double-checks... but as I've found, this same thing does NOT exist on my Xeon E5-26xx v3 with DDR4 memory, so something changed the equation there, I just don't know what.
Madpoo is offline   Reply With Quote
Old 2015-10-27, 16:04   #16
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

5·677 Posts
Default

Quote:
Originally Posted by Madpoo View Post
You might be able to see what I assume is a similar effect by setting up a 2-core system with two workers, each one doing a 60M-70M sized test. With both running, see how the per iteration times look, then stop one or the other and see if the running worker improves.

Then do the same but with one running a 60M-70M and the other just doing a < 38M. What I suspect you'll see is that the larger test won't suffer the same performance hit when the other worker is dealing with a smaller # than it did with the larger.

Maybe if we could figure out for sure if this is really a thing, and not just my imagination, it would help to figure out why, can it be improved, and if that's just the way things are then maybe there's a way to "default" Prime95 in such a setup to pick a first-time test for one worker and then smaller DC's for any others on the same system?

Eventually we'd run out of < 38M double-checks... but as I've found, this same thing does NOT exist on my Xeon E5-26xx v3 with DDR4 memory, so something changed the equation there, I just don't know what.
Oh, I forgot to mention a couple other salient facts...

On a dual CPU system, I can run two workers, each using all the cores on it's own socket, and as long as they're both under 58M they're fine. If one or the other is above 58M is when it starts to degrade both of them. And that's where I found I get the best results by running a sub-38M test on the other one if the first worker is doing a 58M+ test.

When it comes to running multiple workers on the same physical chip, I've discovered a similar effect but it's a bit more pronounced because if any of the workers is higher than 40M or so it will affect all of them, so I try to keep all of them doing a 35M test.

I've done that with 4 workers on a single chip and they can do a 35M-37M exponent okay. If I start a 42M test on one of them, they all run slower.

On a 6-core chip, even when they're all doing a 35M test, I just wasn't getting very good timings... if I stopped all but one, that remaining worker would run faster than when all 6 were going. So for anything with more than 4 cores I just set it up to have one worker per CPU with all cores applied to that worker. Just more efficient for me. I guess I could have done a pair of workers with 3 cores each, but I wanted the flexibility of testing larger exponents without worrying about the interaction.

All of the above is what kind of makes me think it's not just L2/L3 size, or the hit to the main memory, but a combination of both.

I haven't tested the performance of multi workers on the same chip of the Xeon E5-2600 v3 (14-cores), but maybe I can do that sometime. I imagine it'd be pretty stressed running 14 workers at any FFT size. Maybe I should get all 28 cores of that box doing triple checks of everything in the 2M-3M range. It could probably do those just fine.
Madpoo is offline   Reply With Quote
Old 2015-10-27, 20:22   #17
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×17×101 Posts
Default

I just realized that my first 10M digit test back in 2003 (Athlon M4 Thunderbird 950Mhz) took 7 months (210 days or 232 days including factoring + P-1), and now I can do a 33M exponent in less than 12 hours

And after 8 months of work it turned out to be bad: http://www.mersenne.org/report_expon...3430777&full=1

Last fiddled with by ATH on 2015-10-27 at 20:22
ATH is offline   Reply With Quote
Old 2015-10-27, 21:06   #18
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Quote:
Originally Posted by Madpoo View Post
When I've run across this same phenomenon, my guess was that running 8 workers with 1 core each is simply exhausting the capabilities of the memory subsystem. Doing 1 worker/8 cores keeps the memory subsystem from stressing out and it actually does fairly well....

With that in mind it could have more to do with the L2/L3 cache sizes?
I was about to make the same point - the total top-level cache size on a many core system is able to hold a significant fraction of the total 'hot data' set for a single parallelized run, whereas one job per core means only a small % of each job's dataset can fit in the cache. As long as the various caches can shuttle data quickly enough during the 'cores must swap each other's data around' phases of the FFT-mul, one would in fact expect the overall throughput to be be better, since a much smaller bandwidth-to-main-memory is needed. In that light, it is perhaps surprising that it has taken so long for this effect to manifest itself.

I recall seeing such superlinear scaling ~10 years ago with a very early and crudely OMP-||ized version of Mlucas, but only on one specific 64-bit RISC architecture (forget which), this is the first time I've seen it since.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
i3 w/DDR4? Fred Hardware 13 2016-03-24 08:16
Single vs Dual channel memory TObject Hardware 5 2014-12-24 05:58
Importance of dual channel memory for dual core processors patrik Hardware 3 2007-01-07 09:26
Opteron 175, Asus A8V-Deluxe, OCZ dual channel pc4000 optyguy Hardware 3 2006-01-21 08:06
Cache, dual channel memory and Mprime performance optim Hardware 4 2004-06-25 03:20

All times are UTC. The time now is 18:43.


Fri Jan 27 18:43:02 UTC 2023 up 162 days, 16:11, 0 users, load averages: 0.91, 1.20, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔