mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-03-15, 13:52   #1
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×151 Posts
Default What's the best configuration of mprime for dual CPUs?

I had a second CPU arrive in the post today - a Xeon Platinum 8167M, 26-core, 2.0 GHz. By many standards it is not a great CPU - the single core performance is pretty poor. But with 26-cores, it works pretty well if one has multi-threaded code. It is also reasonably priced - £300 GBP (around $417 USD) for each second-hand CPU. The better gold or platinum CPUs can cost serious amounts of money.

So when I get around to fitting it, I will have a pair of 26-core CPUs. Whats the best way to handle this with mprime? Perhaps

1) Create 2 workers, with each have 24 cores (I don't want to use all 26)
2) Run 2 completely differences of mprime, in two different directories. I don't know if its possible on Linux, but on Solaris one can force a process to run only on a specific CPU.

It would be good to avoid the situation where mprime is using half the cores from one CPU, and half from another. Such a situation could be expected to slow the system down unnecessarily.

Dave

Last fiddled with by drkirkby on 2021-03-15 at 14:02 Reason: To change something - loL
drkirkby is offline   Reply With Quote
Old 2021-03-15, 15:02   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×29×127 Posts
Default

From one mprime instance, benchmark for various cores/worker combinations. Prime95 can do this automatically. It does fine on dual-Xeon systems.
The optimal throughput varies depending on processor type, other hardware characteristics, and fft length, and so depends partly on whether you intend to run small exponents (PRP-CF ~11M), modest exponents (DC ~56M), wavefront exponents (~103M), 100Mdigit (~334M), or higher.

A rough rule of thumb is to divide exponent by 17-18 to get a rough estimate of fft length. (18 at lower exponents, 17 at higher exponents.) On high core count processors, the throughput optimal may have latency too long and result in expiration before completion. With real data for your hardware from benchmarking, you can make well informed decisions and be comfortable with them. Some tabulated benchmark examples versus processor type are here (dual-12-core-Xeon in third attachment) and here (68 core Xeon Phi in second attachment).
A word of warning; specifying the full range of fft lengths and all fft combinations or HT too can result in some VERY LONG benchmark sessions (days in some cases) to compute, and also a long time to analyze the results.
Here's sample raw prime95 benchmark results output from its results.bench.txt file for the Xeon Phi 7250, for the no-HT case, 1 FFT length:
Code:
Timings for 5120K FFT length (68 cores, 1 worker):  5.21 ms.  Throughput: 192.06 iter/sec.
Timings for 5120K FFT length (68 cores, 2 workers):  4.06,  4.10 ms.  Throughput: 490.56 iter/sec.
Timings for 5120K FFT length (68 cores, 4 workers):  6.29,  6.24,  6.11,  6.27 ms.  Throughput: 642.35 iter/sec.
Timings for 5120K FFT length (68 cores, 17 workers): 23.80, 23.52, 23.33, 23.35, 24.33, 25.03, 23.82, 24.23, 24.46, 24.29, 24.27, 23.79, 24.03, 24.00, 23.62, 23.56, 23.99 ms.  Throughput: 709.54 iter/sec.
Timings for 5120K FFT length (68 cores, 34 workers): 45.89, 48.61, 46.82, 48.42, 47.44, 46.91, 47.34, 47.23, 47.28, 47.17, 47.96, 47.35, 48.76, 47.60, 49.13, 50.82, 47.44, 47.60, 46.94, 47.03, 47.09, 47.90, 47.45, 47.97, 49.19, 47.49, 46.88, 47.98, 47.60, 49.41, 47.09, 46.66, 47.24, 47.00 ms.  Throughput: 713.54 iter/sec.
Timings for 5120K FFT length (68 cores, 68 workers): 102.78, 100.97, 101.95, 104.04, 104.27, 105.99, 103.54, 102.54, 106.91, 107.87, 102.67, 103.24, 105.16, 103.14, 105.13, 103.62, 106.74, 102.54, 103.12, 102.58, 104.21, 102.72, 105.28, 110.39, 104.62, 104.49, 104.75, 105.55, 102.29, 102.70, 105.06, 104.61, 105.65, 102.77, 103.81, 104.77, 105.76, 105.12, 104.46, 100.85, 104.77, 105.47, 103.91, 103.57, 102.52, 104.28, 105.50, 105.21, 103.34, 102.21, 101.63, 102.90, 104.37, 104.45, 102.36, 104.02, 104.70, 104.00, 107.92, 109.63, 105.08, 104.48, 102.71, 104.59, 105.93, 105.79, 104.77, 104.68 ms.  Throughput:652.05 iter/sec.
Optimal eventual output, and fastest single completion, are very different configurations there. Prime95 defaults to 4 cores/worker = 17 workers. That is a reasonable compromise in this case between throughput (0.56% less) and latency (~half). To reach 4 qualifying successful DCs per worker quickly, I switched from 17 to 4 workers. That's also a better compromise for the larger exponents and fft lengths I'm mostly running on it currently.

Employing HT in prime95 rarely benefits PRP, P-1 or LL. But HT should be enabled in the BIOS, so that all cores can be used in prime95, and miscellaneous tasks run via HT without much reducing prime95 output. (I sometimes run mfactor alongside, with little effect on prime95 throughput if the # of cores used by mfactor is small compared to total core count; mostly it uses HT capability I think.)

I have not seen a need to leave some cores unused by prime95, on HT-capable processors, or otherwise. There might be a benefit on unusual-core-count processors whose counts are not very smooth. Something to test for.

Last fiddled with by kriesel on 2021-03-15 at 15:30
kriesel is offline   Reply With Quote
Old 2021-03-15, 18:10   #3
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

563410 Posts
Default

If "best" includes power efficiency, you may find a sweet spot where a socket is saturating memory completely at less than the full number of cores, say 8 cores used on a 10-core chip w/ 4 memory channels.

Cutting mprime from 10 threads to 8 in such a circumstance can maintain say 95% of mprime performance, while leaving more machine resources for "real" user work / other tasks that are not very memory hungry as well as using a bit less power.
VBCurtis is offline   Reply With Quote
Old 2021-03-15, 20:11   #4
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×151 Posts
Default

Quote:
Originally Posted by kriesel View Post
From one mprime instance, benchmark for various cores/worker combinations. Prime95 can do this automatically. It does fine on dual-Xeon systems.
The optimal throughput varies depending on processor type, other hardware characteristics, and fft length, and so depends partly on whether you intend to run small exponents (PRP-CF ~11M), modest exponents (DC ~56M), wavefront exponents (~103M), 100Mdigit (~334M), or higher.

A rough rule of thumb is to divide exponent by 17-18 to get a rough estimate of fft length. (18 at lower exponents, 17 at higher exponents.) On
My interest is in what you call wavefront exponents (~103M).

I know that the memory bandwidth in my machines is not good. The PC is designed for CPUs with 6 memory channels, but I believe that the 8167M has only 4 working. An attempt to put DIMMs in more than 4 memory channels results in a computer that will not power up. I do need to check if it’s a motherboard fault or a CPU issue. It’s an OEM CPU for which Intel will release no information.

The ideal situation would be to swap out the 8167M CPUs for a more common one, properly supported by Dell and Intel. Unfortunately those CPUs are ridiculously expensive - around $4000 each used.

I bought this Dell 7920 tower workstation at $1100. Unfortunately it is a bit of a money pit, as it can be upgraded to a very high specification. One can configure one on the Dell website to be over $100,000. 3 TB of RAM is quite pricey!
Quote:
Originally Posted by kriesel View Post
A word of warning; specifying the full range of fft lengths and all fft combinations or HT too can result in some VERY LONG benchmark sessions (days in some cases) to compute, and also a long time to analyze the results.
I realised that some time ago. Based on what I have
* Exponents around 103,000,000
* Two 26-core 2.0 GHz CPUs with 35.75 MB cache
* Concerns on memory bandwidth.
* 7 x 32 GB RDIMMs - not enough to present a problem with 2 CPUs as I can use 3 channels on each CPU.
can you suggest some sensible things to try with the benchmarks, without going through every possible combination of workers, threads, FFTs that will take days to run?
drkirkby is offline   Reply With Quote
Old 2021-03-15, 20:39   #5
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×151 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
If "best" includes power efficiency, you may find a sweet spot where a socket is saturating memory completely at less than the full number of cores, say 8 cores used on a 10-core chip w/ 4 memory channels.

Cutting mprime from 10 threads to 8 in such a circumstance can maintain say 95% of mprime performance, while leaving more machine resources for "real" user work / other tasks that are not very memory hungry as well as using a bit less power.
That’s useful to know. Memory bandwidth is not currently good on this machine due to the OEM “secret” CPUs which will only let me use 4 memory channels, not the 6 these CPUs should have. At least I think so - Intel will not say anything about the CPU.

There’s a video on YouTube
https://youtu.be/jP65i_Iqml8
where the presenter says that the Dell 7920 and an HP workstation can equally claim to be the worlds most powerful workstation. Unfortunately, one can only get that power if one sinks a hell of a lot of money into the machine. I don’t have that sort of money. To make best use of the two decent CPUs I would need to add 5 more RAM modules at around $250 each new. Better CPUs at $1000-$4000 each used. I am not even going to consider the very best CPUs.

Last fiddled with by drkirkby on 2021-03-15 at 20:41
drkirkby is offline   Reply With Quote
Old 2021-03-15, 20:53   #6
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

23·3·11·17 Posts
Default

Quote:
Originally Posted by drkirkby View Post

I realised that some time ago. Based on what I have
* Exponents around 103,000,000
* Two 26-core 2.0 GHz CPUs with 35.75 MB cache
* Concerns on memory bandwidth.
* 7 x 32 GB RDIMMs - not enough to present a problem with 2 CPUs as I can use 3 channels on each CPU.
can you suggest some sensible things to try with the benchmarks, without going through every possible combination of workers, threads, FFTs that will take days to run?
Try 2 instances of 26 cores each, one for each CPU. And then try 4 instances of 13 cores each and compare throughputs.

Last fiddled with by paulunderwood on 2021-03-15 at 21:57
paulunderwood is offline   Reply With Quote
Old 2021-03-15, 21:51   #7
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×151 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
If "best" includes power efficiency, you may find a sweet spot where a socket is saturating memory completely at less than the full number of cores, say 8 cores used on a 10-core chip w/ 4 memory channels.

Cutting mprime from 10 threads to 8 in such a circumstance can maintain say 95% of mprime performance, while leaving more machine resources for "real" user work / other tasks that are not very memory hungry as well as using a bit less power.
That’s useful to know. Memory bandwidth is not currently good on this machine due to the OEM “secret” CPUs which will only let me use 4 memory channels, not the 6 these CPUs should have. At least I think so - Intel will not say anything about the CPU.

There’s a video on YouTube
https://youtu.be/jP65i_Iqml8
where the presenter says that the Dell 7920 and an HP
one can equally claim to be the worlds most powerful workstation. Unfortunately, one can only get that power if one sinks a hell of a lot of money into the machine. I don’t have that sort of money. To make best use of the two decent CPUs I would need to add 5 more RAM modules at around $250 each new. Better CPUs at $1000-$4000 each used. I am not even going to consider the very best CPUs😢😢😢
drkirkby is offline   Reply With Quote
Old 2021-03-15, 21:58   #8
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

7058 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Try 2 instances of 26 cores each, one for each CPU. And then try 4 instances of 13 cores each and compare thoughputs.
When you say “two instances” do you mean 2 workers on the one process, or do you mean have two mprime processes running in different directories?
drkirkby is offline   Reply With Quote
Old 2021-03-15, 22:20   #9
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

23·3·11·17 Posts
Default

Quote:
Originally Posted by drkirkby View Post
When you say “two instances” do you mean 2 workers on the one process, or do you mean have two mprime processes running in different directories?
Hmm, two workers would do it. Then try four and compare throughput.

I think mprime and the Linux scheduler are smart enough to set the right affinities.

Last fiddled with by paulunderwood on 2021-03-15 at 22:21
paulunderwood is offline   Reply With Quote
Old 2021-03-19, 09:53   #10
RobBrown
 
Mar 2021

116 Posts
Default

I think on such a serious issue you should contact the computer service!
RobBrown is offline   Reply With Quote
Old 2021-03-19, 18:21   #11
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

130028 Posts
Default

We *are* the computer service.
VBCurtis is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Identical PenD Dual-Core CPUs, one is 2.25x faster NBtarheel_33 Hardware 5 2008-11-12 03:24
Dual P-III using mprime 23.5 vs 24.11 Carlos Software 2 2005-08-02 19:00
Dual CPUs and Hyperthreading Unregistered Hardware 34 2004-09-27 08:56
Best configuration for linux + dual P4 Xeon + hyperthreading luma Software 3 2003-03-28 10:26
Multiple systems/multiple CPUs. Best configuration? BillW Software 1 2003-01-21 20:11

All times are UTC. The time now is 09:18.


Thu Feb 2 09:18:12 UTC 2023 up 168 days, 6:46, 1 user, load averages: 1.51, 1.06, 0.91

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔