![]() |
![]() |
#1 |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
3×151 Posts |
![]()
I had a second CPU arrive in the post today - a Xeon Platinum 8167M, 26-core, 2.0 GHz. By many standards it is not a great CPU - the single core performance is pretty poor. But with 26-cores, it works pretty well if one has multi-threaded code. It is also reasonably priced - £300 GBP (around $417 USD) for each second-hand CPU. The better gold or platinum CPUs can cost serious amounts of money.
So when I get around to fitting it, I will have a pair of 26-core CPUs. Whats the best way to handle this with mprime? Perhaps 1) Create 2 workers, with each have 24 cores (I don't want to use all 26) 2) Run 2 completely differences of mprime, in two different directories. I don't know if its possible on Linux, but on Solaris one can force a process to run only on a specific CPU. It would be good to avoid the situation where mprime is using half the cores from one CPU, and half from another. Such a situation could be expected to slow the system down unnecessarily. Dave Last fiddled with by drkirkby on 2021-03-15 at 14:02 Reason: To change something - loL |
![]() |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×29×127 Posts |
![]()
From one mprime instance, benchmark for various cores/worker combinations. Prime95 can do this automatically. It does fine on dual-Xeon systems.
The optimal throughput varies depending on processor type, other hardware characteristics, and fft length, and so depends partly on whether you intend to run small exponents (PRP-CF ~11M), modest exponents (DC ~56M), wavefront exponents (~103M), 100Mdigit (~334M), or higher. A rough rule of thumb is to divide exponent by 17-18 to get a rough estimate of fft length. (18 at lower exponents, 17 at higher exponents.) On high core count processors, the throughput optimal may have latency too long and result in expiration before completion. With real data for your hardware from benchmarking, you can make well informed decisions and be comfortable with them. Some tabulated benchmark examples versus processor type are here (dual-12-core-Xeon in third attachment) and here (68 core Xeon Phi in second attachment). A word of warning; specifying the full range of fft lengths and all fft combinations or HT too can result in some VERY LONG benchmark sessions (days in some cases) to compute, and also a long time to analyze the results. Here's sample raw prime95 benchmark results output from its results.bench.txt file for the Xeon Phi 7250, for the no-HT case, 1 FFT length: Code:
Timings for 5120K FFT length (68 cores, 1 worker): 5.21 ms. Throughput: 192.06 iter/sec. Timings for 5120K FFT length (68 cores, 2 workers): 4.06, 4.10 ms. Throughput: 490.56 iter/sec. Timings for 5120K FFT length (68 cores, 4 workers): 6.29, 6.24, 6.11, 6.27 ms. Throughput: 642.35 iter/sec. Timings for 5120K FFT length (68 cores, 17 workers): 23.80, 23.52, 23.33, 23.35, 24.33, 25.03, 23.82, 24.23, 24.46, 24.29, 24.27, 23.79, 24.03, 24.00, 23.62, 23.56, 23.99 ms. Throughput: 709.54 iter/sec. Timings for 5120K FFT length (68 cores, 34 workers): 45.89, 48.61, 46.82, 48.42, 47.44, 46.91, 47.34, 47.23, 47.28, 47.17, 47.96, 47.35, 48.76, 47.60, 49.13, 50.82, 47.44, 47.60, 46.94, 47.03, 47.09, 47.90, 47.45, 47.97, 49.19, 47.49, 46.88, 47.98, 47.60, 49.41, 47.09, 46.66, 47.24, 47.00 ms. Throughput: 713.54 iter/sec. Timings for 5120K FFT length (68 cores, 68 workers): 102.78, 100.97, 101.95, 104.04, 104.27, 105.99, 103.54, 102.54, 106.91, 107.87, 102.67, 103.24, 105.16, 103.14, 105.13, 103.62, 106.74, 102.54, 103.12, 102.58, 104.21, 102.72, 105.28, 110.39, 104.62, 104.49, 104.75, 105.55, 102.29, 102.70, 105.06, 104.61, 105.65, 102.77, 103.81, 104.77, 105.76, 105.12, 104.46, 100.85, 104.77, 105.47, 103.91, 103.57, 102.52, 104.28, 105.50, 105.21, 103.34, 102.21, 101.63, 102.90, 104.37, 104.45, 102.36, 104.02, 104.70, 104.00, 107.92, 109.63, 105.08, 104.48, 102.71, 104.59, 105.93, 105.79, 104.77, 104.68 ms. Throughput:652.05 iter/sec. Employing HT in prime95 rarely benefits PRP, P-1 or LL. But HT should be enabled in the BIOS, so that all cores can be used in prime95, and miscellaneous tasks run via HT without much reducing prime95 output. (I sometimes run mfactor alongside, with little effect on prime95 throughput if the # of cores used by mfactor is small compared to total core count; mostly it uses HT capability I think.) I have not seen a need to leave some cores unused by prime95, on HT-capable processors, or otherwise. There might be a benefit on unusual-core-count processors whose counts are not very smooth. Something to test for. Last fiddled with by kriesel on 2021-03-15 at 15:30 |
![]() |
![]() |
![]() |
#3 |
"Curtis"
Feb 2005
Riverside, CA
563410 Posts |
![]()
If "best" includes power efficiency, you may find a sweet spot where a socket is saturating memory completely at less than the full number of cores, say 8 cores used on a 10-core chip w/ 4 memory channels.
Cutting mprime from 10 threads to 8 in such a circumstance can maintain say 95% of mprime performance, while leaving more machine resources for "real" user work / other tasks that are not very memory hungry as well as using a bit less power. |
![]() |
![]() |
![]() |
#4 | ||
"David Kirkby"
Jan 2021
Althorne, Essex, UK
3×151 Posts |
![]() Quote:
I know that the memory bandwidth in my machines is not good. The PC is designed for CPUs with 6 memory channels, but I believe that the 8167M has only 4 working. An attempt to put DIMMs in more than 4 memory channels results in a computer that will not power up. I do need to check if it’s a motherboard fault or a CPU issue. It’s an OEM CPU for which Intel will release no information. The ideal situation would be to swap out the 8167M CPUs for a more common one, properly supported by Dell and Intel. Unfortunately those CPUs are ridiculously expensive - around $4000 each used. I bought this Dell 7920 tower workstation at $1100. Unfortunately it is a bit of a money pit, as it can be upgraded to a very high specification. One can configure one on the Dell website to be over $100,000. 3 TB of RAM is quite pricey! Quote:
* Exponents around 103,000,000 * Two 26-core 2.0 GHz CPUs with 35.75 MB cache * Concerns on memory bandwidth. * 7 x 32 GB RDIMMs - not enough to present a problem with 2 CPUs as I can use 3 channels on each CPU. can you suggest some sensible things to try with the benchmarks, without going through every possible combination of workers, threads, FFTs that will take days to run? |
||
![]() |
![]() |
![]() |
#5 | |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
3×151 Posts |
![]() Quote:
There’s a video on YouTube https://youtu.be/jP65i_Iqml8 where the presenter says that the Dell 7920 and an HP workstation can equally claim to be the worlds most powerful workstation. Unfortunately, one can only get that power if one sinks a hell of a lot of money into the machine. I don’t have that sort of money. To make best use of the two decent CPUs I would need to add 5 more RAM modules at around $250 each new. Better CPUs at $1000-$4000 each used. I am not even going to consider the very best CPUs. Last fiddled with by drkirkby on 2021-03-15 at 20:41 |
|
![]() |
![]() |
![]() |
#6 | |
Sep 2002
Database er0rr
23·3·11·17 Posts |
![]() Quote:
Last fiddled with by paulunderwood on 2021-03-15 at 21:57 |
|
![]() |
![]() |
![]() |
#7 | |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
3×151 Posts |
![]() Quote:
There’s a video on YouTube https://youtu.be/jP65i_Iqml8 where the presenter says that the Dell 7920 and an HP one can equally claim to be the worlds most powerful workstation. Unfortunately, one can only get that power if one sinks a hell of a lot of money into the machine. I don’t have that sort of money. To make best use of the two decent CPUs I would need to add 5 more RAM modules at around $250 each new. Better CPUs at $1000-$4000 each used. I am not even going to consider the very best CPUs😢😢😢 |
|
![]() |
![]() |
![]() |
#8 |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
7058 Posts |
![]()
When you say “two instances” do you mean 2 workers on the one process, or do you mean have two mprime processes running in different directories?
|
![]() |
![]() |
![]() |
#9 | |
Sep 2002
Database er0rr
23·3·11·17 Posts |
![]() Quote:
I think mprime and the Linux scheduler are smart enough to set the right affinities. Last fiddled with by paulunderwood on 2021-03-15 at 22:21 |
|
![]() |
![]() |
![]() |
#10 |
Mar 2021
116 Posts |
![]()
I think on such a serious issue you should contact the computer service!
|
![]() |
![]() |
![]() |
#11 |
"Curtis"
Feb 2005
Riverside, CA
130028 Posts |
![]()
We *are* the computer service.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Identical PenD Dual-Core CPUs, one is 2.25x faster | NBtarheel_33 | Hardware | 5 | 2008-11-12 03:24 |
Dual P-III using mprime 23.5 vs 24.11 | Carlos | Software | 2 | 2005-08-02 19:00 |
Dual CPUs and Hyperthreading | Unregistered | Hardware | 34 | 2004-09-27 08:56 |
Best configuration for linux + dual P4 Xeon + hyperthreading | luma | Software | 3 | 2003-03-28 10:26 |
Multiple systems/multiple CPUs. Best configuration? | BillW | Software | 1 | 2003-01-21 20:11 |