mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-09-07, 15:50   #1
aurashift
 
Jan 2015

11×23 Posts
Default Optimal LL configuration

OK - so this thread is to figure out what the best configuration for an LL test is.

My belief is that 2 cores is best for anything less than 100M on server grade systems.
-It seems to half the time without any loss of CPU time. In fact, on some of my older g7 HP's equipped with Intel Xeon E7-4870 it seems to reduce the amount of time from ~60 to ~20-25 days per worker.

-For 100M exponents I think the best option for anything 8 core and above is to use all the cores excluding the first one in a socket in order to limit interrupts. If it's 6 core or less you gain more by not leaving a core idle and living with the interrupts.

-If you're really trying to squeeze the fastest test you can in a dual socket system, you can use all of the cores on the first socket and one MAYBE two cores on the second socket (if you're running DDR4 with a faster QPI, maybe 9.6GT/s)

-I don't have any 16+ cores to test this on any longer, but I did seem to notice that for exponents less than 100M, it stopped speeding up after a certain number of cores being assigned. These CPUs did have a lower overall GHz speed (2.7 vs 2.3 GHz)


-For desktops:
I'm noticing on quad core systems that the best practice is to use 3 of the cores and leave one idle for system tasks. My mac book pro's can't handle the heat of 4 running cores and turbo boost throttles things.

My surface pro 3 i7 is a piece of shite and I don't run any tests on there any longer (lol)

I know we've touched on this in other places, but I thought this would be a good place to gather all the info here.
aurashift is offline   Reply With Quote
Old 2015-09-07, 17:39   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

1100000101112 Posts
Default

On Ivy Bridge and Haswell with DDR3 1600, I find the fourth core does improve speed a little bit, but if hyperthreading is turned off, I'd reduce it to three cores to keep the desktop optimally responsive.
Mark Rose is offline   Reply With Quote
Old 2015-09-07, 19:21   #3
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

1101001110012 Posts
Default

Quote:
Originally Posted by aurashift View Post
OK - so this thread is to figure out what the best configuration for an LL test is.

My belief is that 2 cores is best for anything less than 100M on server grade systems.
-It seems to half the time without any loss of CPU time. In fact, on some of my older g7 HP's equipped with Intel Xeon E7-4870 it seems to reduce the amount of time from ~60 to ~20-25 days per worker.
Great thread topic...thanks for moving it away from the 100M benchmark thread.

On this first point, yeah, adding a 2nd core will come close to doubling the throughput of 1 core. Each additional core you add past that starts to show less improvement. Example, going from 1 to 8 cores won't make it 8x faster, but maybe 4-5x faster. It really depends on your memory at that point.

In my own experience again, having a single worker with 8 cores is still far faster than 2 workers of 4 cores each (on that same physical chip). My theory is that the memory contention of two things on one chip doing large FFT tasks is too much... better to limit it to a single worker with as many cores as you can. Desktop systems with 4 core chips should be fine with this. Xeons with more than 4 cores, you probably have faster memory with multiple banks and interleaving and all that, but maybe not. Best to experiment with adding more cores and seeing how the seconds per iteration value changes.

Quote:
Originally Posted by aurashift View Post
-For 100M exponents I think the best option for anything 8 core and above is to use all the cores excluding the first one in a socket in order to limit interrupts. If it's 6 core or less you gain more by not leaving a core idle and living with the interrupts.
Dual CPU systems might be able to handle interrupts on either CPU or whichever is affined to a particular PCI slot... I wouldn't be too concerned about this unless the system is also doing heavy I/O (disk, add-on GPU, network) which is heavy on the interrupts.

You can optionally tweak the affinity scramble to make something besides core #0 the first one... let core #0 be a "helper" core on the worker...

Quote:
Originally Posted by aurashift View Post
-If you're really trying to squeeze the fastest test you can in a dual socket system, you can use all of the cores on the first socket and one MAYBE two cores on the second socket (if you're running DDR4 with a faster QPI, maybe 9.6GT/s)

-I don't have any 16+ cores to test this on any longer, but I did seem to notice that for exponents less than 100M, it stopped speeding up after a certain number of cores being assigned. These CPUs did have a lower overall GHz speed (2.7 vs 2.3 GHz)
On DDR3 systems with slower QPI (Xeon E5 v1/v2 boxes) adding 1 more core on the other CPU would show a marginal improvement. Adding additional cores actually made it slower. On my DDR4 systems (Xeon E5 v3 CPU with 9.6 gig QPI) I was able to add 6 or 7 more cores on the other CPU of dual 14-core processors before I saw it getting worse. But also, adding more than 3 or 4 didn't make much improvement either. It also means you really can't run another worker on that other CPU with those other cores because you're basically eating up that other CPU's memory bandwidth for very little gain.

I'd only do this if I just wanted to run a single LL test very fast for some reason... maybe doing a DC on a potential prime? I hope that happens soon.

For a few extra things... on a dual CPU system, depending on the memory type and speed, interesting things happen.

On my DDR3 systems, if worker #1 on CPU #1 was doing a test for an exponent > 58M, then worker #2 on CPU #2 had to be smaller than 38M otherwise both workers started to suffer and run slower.

On my new DDR4 that I haven't had time to test thoroughly, I was able to do a 70M and 75M test on the 2 workers without either slowing down. I'm not sure what the real threshold is.

You can test it yourself... on the dual CPU system get your 2 workers going and note the iteration times... then stop one or the other and see if the timing on the running one improves significantly. It will be a big change, not just a small percent. VERY noticeable.

This same test would also apply to multiple workers on a single CPU... let's say you have a quad-core desktop chip and one worker running on each one. Stop all but one and see how the timings improve. You can do the "total throughput" calculations to find the sweet spot. It will also depend on the size of the exponent being tested, so I think the FFT size and L2/L3 cache on the chip play big parts in that, as well as your system RAM speed.
Madpoo is offline   Reply With Quote
Old 2015-09-08, 12:12   #4
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

20316 Posts
Default

I know it's not the answer you are after.

But most optimal solution is to grab a Nvidia Titan and make sure the DP FP setting is enabled. :) I've played with a few configurations, and nothing comes close.

But pure CPUs...

Use as many non-HT CPU cores on the same CPU as you can, but optimize affinity such that the primary core in prime95 isn't cpu core #0. If you have more than one CPU, you have more than one test. Grab the fastest ram you can get a hold of. Increasing clockspeed improves performance below linearly power consumption increases. (rephrasing) If you have linear power increases, you have sub-linear performance increases.

Optimizing throughput vs TCO can be a matter of obtaining lower clocked CPUs, but run more of them.

These are my conclusions based on the equipment I've played with.

-- Craig
nucleon is offline   Reply With Quote
Old 2015-09-08, 12:25   #5
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

240478 Posts
Default

Well, ridiculous high price aside, the best would be to grab few Tesla K80...
LaurV is offline   Reply With Quote
Old 2015-09-19, 21:13   #6
aurashift
 
Jan 2015

11·23 Posts
Default

has anyone compared turbo boost on vs off? I know it boils down to thermals but I wondered if it changed anything.
aurashift is offline   Reply With Quote
Old 2015-09-20, 07:08   #7
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

D3916 Posts
Default

Quote:
Originally Posted by aurashift View Post
has anyone compared turbo boost on vs off? I know it boils down to thermals but I wondered if it changed anything.
I haven't, exactly, but faster clock speeds seems like it would almost always be a good thing. Even when I find myself in what I think is a memory bandwidth limiting situation, faster CPU speed seems like it would always help.

The only testing I've done that was close to that is when I enabled "performance mode" on my Proliants. Doing that makes the CPU's always run at their max turbo, and it really did make a big difference in iteration times.

The flipside of that would be to run them in power saving mode and maybe even enable the power cap so they run at lower clock speeds than stock, but besides the Primenet server itself (which has to be power capped to fit our service agreement), I've never had a reason to do that.
Madpoo is offline   Reply With Quote
Old 2015-09-20, 14:17   #8
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

2×7×13×47 Posts
Default

If you are running into memory bandwidth limitations you could scale back the frequency and (possibly?) run more cores.

On our multicore system lowering the frequency greatly affects the CPU voltage which greatly affects the power drawn.

In other words, a savings in electricity and reduced heat generation.

YMMV
Xyzzy is offline   Reply With Quote
Old 2015-09-21, 01:07   #9
aurashift
 
Jan 2015

111111012 Posts
Default

until we figure out if or how a CPU can run at 100% possibly with a memory bottleneck I'm going to keep going full speed.

Heat and power is no concern for me :)

Last fiddled with by aurashift on 2015-09-21 at 01:07
aurashift is offline   Reply With Quote
Old 2015-09-21, 14:45   #10
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

5×677 Posts
Default

Quote:
Originally Posted by aurashift View Post
until we figure out if or how a CPU can run at 100% possibly with a memory bottleneck I'm going to keep going full speed.

Heat and power is no concern for me :)
Me too.

Recently, George had mentioned that Prime95 distributes work across multiple cores by splitting the multiplication of each iteration into a number of chunks. If the # of chunks isn't evenly divisible by the # of cores assigned to that worker, some cores will have more chunks than others which will essentially give the other cores nothing to do for some fraction of time during that iteration.

I don't know where in the code it determines how many chunks to split it into and if there's a reason it couldn't be as close as possible to a multiple of the # of threads in that worker?

The larger the exponent, apparently the more total chunks of work there are, so even if a few cores aren't doing anything, it's only for a little bit of the time, but that still leaves room for (I'm guessing) 2-3% improvement.

The problem was most pronounced on smaller exponents which, presumably, don't have as many chunks of work. And that's where I'd see things like the "helper" threads of a worker sitting idle maybe 20-30% of the time, and only the main core was near 100%.

By "small" exponents I mean really small, like < 10M. The higher you go, the less obvious the effect. But even with a 70M exponent on a 14-core system, core 1 might be 100% but then cores 2-14 are more like 97-98%. I don't think memory is saturated because I can get more "oomph" by adding additional cores on the other CPU. If mem had been saturated I think I would have seen iteration times level out or start to get worse, like I do at a certain point when adding more cores. Of course the QPI link is also probably saturated by then.

So yeah... someone who understands the code better could probably determine if there's any way to improve that multithreading part by distributing the work loads evenly?

Or maybe on those larger exponents with lots of "chunks", it's already doing a fair job of that and it's the primary core combining the result of all the chunks that eats up a little time, leaving the other cores idle?

If that were the case, then I'd wonder if there's a way to speed up the merging of the individual things... perhaps having pairs of chunks combined and rolled up until all the work is together as one big happy (and quite large) result? Rather than putting all of that onto a single core?

For example, on a 4 core worker, core #1 would merge the chunks of #1 and #2. Core #3 would merge the stuff from #3 and #4. Then finally core #1 would merge those two chunks together.

I know, I'm oversimplifying what's happening under the hood, and splitting a multiplication into chunks and "merging" them together isn't as easy as I'm saying, but hopefully that made sense.

Last fiddled with by Madpoo on 2015-09-21 at 14:51
Madpoo is offline   Reply With Quote
Old 2015-09-22, 10:14   #11
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

19×541 Posts
Default

Quote:
Originally Posted by aurashift View Post
until we figure out if or how a CPU can run at 100% possibly with a memory bottleneck I'm going to keep going full speed.

Heat and power is no concern for me :)
Try "int main(){while(1);}", that will make no memory access and keep one core busy...
Run it times 4... (well, times x, one for each core... hehe)

Last fiddled with by LaurV on 2015-09-22 at 10:15
LaurV is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
configuration for max memory bandwidth smartypants Hardware 11 2015-07-26 09:16
Optimal ECM bounds henryzz GMP-ECM 14 2011-06-09 17:04
optimal B1 TimSorbet Factoring 4 2011-05-27 12:19
Best configuration for linux + dual P4 Xeon + hyperthreading luma Software 3 2003-03-28 10:26
Multiple systems/multiple CPUs. Best configuration? BillW Software 1 2003-01-21 20:11

All times are UTC. The time now is 08:41.


Thu Feb 2 08:41:46 UTC 2023 up 168 days, 6:10, 1 user, load averages: 0.94, 0.94, 0.95

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔