![]() |
![]() |
#12 |
"/X\(‘-‘)/X\"
Jan 2013
2×34×19 Posts |
![]()
I'm considering building a Skylake 6700k system with similar speed memory, so a quad core dual channel benchmark would also interest me.
|
![]() |
![]() |
![]() |
#13 | |
Serpentine Vermin Jar
Jul 2014
64718 Posts |
![]() Quote:
The larger the FFT size, the better it seems to be. I think George explained it as the way it parcels out the chunks of FMA to the various workers... some workers might be idling while others are doing things resulting in inefficiencies. Larger FFTs seem to have more possible "chunks" so they get distributed more evenly? Don't know if my caveman understanding of it is close or not, but trust me, with a small FFT (le'ts say an exponent in the 30M or less size), if you have 8 cores in one worker, your total CPU usage will be something like 90-95%, not 100% like it would be with an exponent in the 60M range. Go figure. The exception to how well a bunch of workers going at once does seems to still be (for me) when doing exponents below 38M. I can setup a system with 8 cores total (dual 4-core CPUs) and as long as each worker is doing a <38M test, they'll play nicely. Once any of them is a bit larger, it falls apart. With that in mind it could have more to do with the L2/L3 cache sizes? I'd say that the code that "chunks" the parts of FMA work out to different cores should: a) do it evenly so each core has the same amount of work b) maybe keep each work unit small enough so it takes full advantage of the L2/L3 cache size? |
|
![]() |
![]() |
![]() |
#14 | |
Romulan Interpreter
"name field"
Jun 2011
Thailand
282116 Posts |
![]() Quote:
OTOH, what you say makes sense, and I never played with the billions of cores you have on your hands ![]() Last fiddled with by LaurV on 2015-10-27 at 08:30 |
|
![]() |
![]() |
![]() |
#15 | |
Serpentine Vermin Jar
Jul 2014
5·677 Posts |
![]() Quote:
Then do the same but with one running a 60M-70M and the other just doing a < 38M. What I suspect you'll see is that the larger test won't suffer the same performance hit when the other worker is dealing with a smaller # than it did with the larger. Maybe if we could figure out for sure if this is really a thing, and not just my imagination, it would help to figure out why, can it be improved, and if that's just the way things are then maybe there's a way to "default" Prime95 in such a setup to pick a first-time test for one worker and then smaller DC's for any others on the same system? Eventually we'd run out of < 38M double-checks... but as I've found, this same thing does NOT exist on my Xeon E5-26xx v3 with DDR4 memory, so something changed the equation there, I just don't know what. |
|
![]() |
![]() |
![]() |
#16 | |
Serpentine Vermin Jar
Jul 2014
5·677 Posts |
![]() Quote:
On a dual CPU system, I can run two workers, each using all the cores on it's own socket, and as long as they're both under 58M they're fine. If one or the other is above 58M is when it starts to degrade both of them. And that's where I found I get the best results by running a sub-38M test on the other one if the first worker is doing a 58M+ test. When it comes to running multiple workers on the same physical chip, I've discovered a similar effect but it's a bit more pronounced because if any of the workers is higher than 40M or so it will affect all of them, so I try to keep all of them doing a 35M test. I've done that with 4 workers on a single chip and they can do a 35M-37M exponent okay. If I start a 42M test on one of them, they all run slower. On a 6-core chip, even when they're all doing a 35M test, I just wasn't getting very good timings... if I stopped all but one, that remaining worker would run faster than when all 6 were going. So for anything with more than 4 cores I just set it up to have one worker per CPU with all cores applied to that worker. Just more efficient for me. I guess I could have done a pair of workers with 3 cores each, but I wanted the flexibility of testing larger exponents without worrying about the interaction. All of the above is what kind of makes me think it's not just L2/L3 size, or the hit to the main memory, but a combination of both. I haven't tested the performance of multi workers on the same chip of the Xeon E5-2600 v3 (14-cores), but maybe I can do that sometime. I imagine it'd be pretty stressed running 14 workers at any FFT size. Maybe I should get all 28 cores of that box doing triple checks of everything in the 2M-3M range. ![]() |
|
![]() |
![]() |
![]() |
#17 |
Einyen
Dec 2003
Denmark
2×17×101 Posts |
![]()
I just realized that my first 10M digit test back in 2003 (Athlon M4 Thunderbird 950Mhz) took 7 months (210 days or 232 days including factoring + P-1), and now I can do a 33M exponent in less than 12 hours
![]() And after 8 months of work it turned out to be bad: http://www.mersenne.org/report_expon...3430777&full=1 Last fiddled with by ATH on 2015-10-27 at 20:22 |
![]() |
![]() |
![]() |
#18 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
I recall seeing such superlinear scaling ~10 years ago with a very early and crudely OMP-||ized version of Mlucas, but only on one specific 64-bit RISC architecture (forget which), this is the first time I've seen it since. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
i3 w/DDR4? | Fred | Hardware | 13 | 2016-03-24 08:16 |
Single vs Dual channel memory | TObject | Hardware | 5 | 2014-12-24 05:58 |
Importance of dual channel memory for dual core processors | patrik | Hardware | 3 | 2007-01-07 09:26 |
Opteron 175, Asus A8V-Deluxe, OCZ dual channel pc4000 | optyguy | Hardware | 3 | 2006-01-21 08:06 |
Cache, dual channel memory and Mprime performance | optim | Hardware | 4 | 2004-06-25 03:20 |