mersenneforum.org Increasing memory channels, but not RAM, slows mprime.
 Register FAQ Search Today's Posts Mark Forums Read

2021-07-09, 13:59   #12
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

3×19×113 Posts

Quote:
 Originally Posted by drkirkby There's 35.75 MB of L3 cache per CPU. The 2400 MHz is a limitation of the CPU - other CPUs in the Xeon gold or platinum range run the RAM up to 2933 MHz, but they are quite expensive CPUs, whereas these CPUs are quite cheap. I've benchmarked more workers (I tried, 1, 2, 3 .. 52). But 4 workers gives optimal throughput.
Are these the same CPUs as https://www.ebay.co.uk/itm/154497112899 ? I was expecting there to be a catch, if they work well I'll pick myself up a pair. I've got a Supermicro Skylake system which has 4114s in it at the moment.

2021-07-09, 14:38   #13
Uncwilly
6809 > 6502

"""""""""""""""""""
Aug 2003
101×103 Posts

10,009 Posts

Quote:
 Originally Posted by drkirkby Unsurprisingly reducing the number of cores per worker from 26 to 13 increased the iteration time further.
On VBCurtis's behalf:
2

 2021-07-09, 14:42 #14 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2×11×347 Posts Prime95 is not NUMA-aware. Perhaps you need to run two instances of prime95 with some kind of OS command instructing each prime95 instance to allocate memory from different memory banks. I've no idea what that OS command would be in Windows.
2021-07-09, 14:54   #15
axn

Jun 2003

19·271 Posts

Quote:
 Originally Posted by Prime95 Prime95 is not NUMA-aware. Perhaps you need to run two instances of prime95 with some kind of OS command instructing each prime95 instance to allocate memory from different memory banks. I've no idea what that OS command would be in Windows.
I believe he uses Ubuntu (or some flavor of Linux), so taskset (along with Affinity setting in mprime) should work.

Last fiddled with by axn on 2021-07-09 at 14:54

2021-07-09, 15:00   #16
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×149 Posts

Quote:
 Originally Posted by fivemack Are these the same CPUs as https://www.ebay.co.uk/itm/154497112899 ? I was expecting there to be a catch, if they work well I'll pick myself up a pair. I've got a Supermicro Skylake system which has 4114s in it at the moment.
Yes they are the same CPUs, although I paid less than that - I paid £300 each. They gave a massive improvement in performance over a single Silver 4110, for not a huge outlay. If you want, I can dig out the email address of the seller I bought them from. You might be able to get a better deal than on eBay. PM me if you want. The single-threaded performance of the 8167M is pretty poor according to Passmark, but you get a lot of cores for the money. If you don't mind spending a bit more money, the 8171M appears to offer a lot more performance. The 8171M will not work in mainstream machines from Dell, IBM, Lenovo etc, but there's a good chance they would work on your Supermicro motherboard.

I intend swapping out the 8167Ms at a later date for higher performance CPUs when the prices fall. But currently the fast gold or platinum CPUs with a lot of cores are very expensive, but the 8167M offers a lot of bang for the buck.

If you only want the performance for GIMPS, a fast graphics card might be a better bet. The prices of them are currently well above their manufacturers recommended retail prices, but the prices are falling a lot now.

 2021-07-09, 17:44 #17 ATH Einyen     Dec 2003 Denmark 52×127 Posts I'm not sure you can call this 12-channel RAM. It is 2 CPU's with 6 channels each. You should definitely run different tests on each physical CPU and each test will get it's own 6-channel RAM. But if you run 1 single test on both CPUs, I do not think that test benefits from 12-channel RAM, but I could easily be wrong, I'm not familiar with this modern hardware
 2021-07-09, 17:57 #18 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22×1,447 Posts Prime95 deals well with dual-package systems in my opinion. I've run in a single instance, analyzed and posted prime95 benchmarks on a variey of single and dual-package systems (up to dual-12core, but no dual-26core beasts) versus number of workers, FFT length, HT vs not; see attachments of https://www.mersenneforum.org/showpo...18&postcount=4 https://www.mersenneforum.org/showpo...19&postcount=5 https://www.mersenneforum.org/showpo...4&postcount=11
2021-07-09, 19:55   #19
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

6778 Posts

Quote:
 Originally Posted by ATH I'm not sure you can call this 12-channel RAM. It is 2 CPU's with 6 channels each. You should definitely run different tests on each physical CPU and each test will get it's own 6-channel RAM. But if you run 1 single test on both CPUs, I do not think that test benefits from 12-channel RAM, but I could easily be wrong, I'm not familiar with this modern hardware
I never did call it 12-channel RAM - I just wrote that 12 memory channels were in use. As you say, it's dual CPUs, with each CPU having 6 memory channels.

I will have to run some more benchmarks, but have some real work to do the weekend. There's a rather important football match taking place on Sunday too.

Last fiddled with by drkirkby on 2021-07-09 at 19:58

2021-07-10, 15:05   #20
phillipsjk

Nov 2019

1058 Posts

Quote:
 Originally Posted by kriesel Prime95 deals well with dual-package systems in my opinion. I've run in a single instance, analyzed and posted prime95 benchmarks on a variey of single and dual-package systems (up to dual-12core, but no dual-26core beasts) versus number of workers, FFT length, HT vs not; see attachments of https://www.mersenneforum.org/showpo...18&postcount=4 https://www.mersenneforum.org/showpo...19&postcount=5 https://www.mersenneforum.org/showpo...4&postcount=11
I looked at "dual-12-core%20e5-2697v2%20roa%20performance.pdf", and it does not mention running two instances, each locked to a specific CPU (using the Worker affinity setting in undocumeted.txt), vs one instance possibly accessing ["foreign" memory].

When I was trying to mine Monero on my quad CPU system, the mining software would occasionally error out with a page fault until I started running 1 instance per CPU.

For P-1 factoring work, I am running 4 instances so that each CPU gets it's own pool of memory to allocate to it's own workers (again avoiding "foreign" memory access). The server does not let me give each CPU it's own name though; so the resulting stats are wonky.

Edit: I think the [tables] on [pages 1 and 3 are] supposed to be showing the penalty for "foreign" memory access [under the "Straddles chips?" (yes) heading]. Bolded numbers on the left-hand side appear to be the "best" times. The percentages appear to be the approximate reduction in performance for each setting.

Last fiddled with by phillipsjk on 2021-07-10 at 15:27 Reason: fixed wording.

2021-07-10, 16:39   #21
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×1,447 Posts

Quote:
 Originally Posted by phillipsjk I looked at "dual-12-core%20e5-2697v2%20roa%20performance.pdf", and it does not mention running two instances, each locked to a specific CPU (using the Worker affinity setting in undocumeted.txt), vs one instance possibly accessing ["foreign" memory].
You're right, it does not mention what was not done or attempted. All the benchmarking was done in a single instance, as previously posted. Also it presumes using all cores is best for throughput, and did not benchmark reduced-core-count cases.

Benchmarking such things as 3-workers on a dual-package system in a single instance gives slower total throughput. George has stated in the past that prime95 segregates threads of a worker onto a single CPU, not straddling dual packages for example. (I don't know how that squares with "not NUMA aware".) That would put two workers onto one cpu, and leave an entire cpu for one of the 3 workers. And benchmark results IIRC were consistent with that.
If it did not do that, some threads & cores of a worker would be distant from others, with possible consequent performance loss.

There are some practical issues with attempting to benchmark with more than one prime95 instance. Desynchronization of fft lengths and subcases between the instances is one that comes to mind.
On Windows, specifying NUMA Node in the start command is more of a recommendation the OS is permitted to deviate from, than a definite mandatory specification. From Windows 10's "start /?" command help output:
Code:
    NODE        Specifies the preferred Non-Uniform Memory Architecture (NUMA)
node as a decimal integer.
The process is restricted to running on these processors.

The affinity mask is interpreted differently when /AFFINITY and
/NODE are combined.  Specify the affinity mask as if the NUMA
node's processor mask is right shifted to begin at bit zero.
The process is restricted to running on those processors in
common between the specified affinity mask and the NUMA node.
If no processors are in common, the process is restricted to
running on the specified NUMA node.

Specifying /NODE allows processes to be created in a way that leverages memory
locality on NUMA systems.  For example, two processes that communicate with
each other heavily through shared memory can be created to share the same
preferred NUMA node in order to minimize memory latencies.  They allocate
memory from the same NUMA node when possible, and they are free to run on
processors outside the specified node.

start /NODE 1 application1.exe
start /NODE 1 application2.exe

These two processes can be further constrained to run on specific processors
within the same NUMA node.  In the following example, application1 runs on the
low-order two processors of the node, while application2 runs on the next two
processors of the node.  This example assumes the specified node has at least
four logical processors.  Note that the node number can be changed to any valid
node number for that computer without having to change the affinity mask.

start /NODE 1 /AFFINITY 0x3 application1.exe
start /NODE 1 /AFFINITY 0xc application2.exe
So presumably testing with a prime95 instance per package would require specifying affinity to all of node1's cores on one instance, and to all of node2's cores on the second, if that is possible.

In practice, I find with single-instance multi-package prime95 benchmarking, that number of workers = n * number of packages benchmarks best for total throughput, with n a small integer changing with fft length.
Throughput-optimal parameters are not always entirely practical. It does little good to tune to high worker count for the last few% of throughput, if the primality test assignments expire before completion and progress to completion is wasted. (Especially now with PRP & proof where a first test is not followed by full double check.) Latency less than expiration time is a constraint. At small fft lengths, the Xeon Phi 7250 benchmarks best nominal total throughput with dozens of workers, but latency is an issue.

Quite a while ago, Madpoo posted results for a different case, optimizing for latency of primality testing a single exponent, such as for verifying a new prime discovery where efficiency is less important than speed, on a dual package system (~dual-18-core?). Max speed on a single exponent was around (an entire cpu package plus ~6 cores of the other package), with the rest of the cores in the second cpu package left idle; adding more cores from package two slowed it.

Last fiddled with by kriesel on 2021-07-10 at 16:55

2021-07-10, 17:17   #22
axn

Jun 2003

19×271 Posts

Quote:
 Originally Posted by kriesel There are some practical issues with attempting to benchmark with more than one prime95 instance. Desynchronization of fft lengths and subcases between the instances is one that comes to mind.
This shouldn't be much of an issue. In theory, if you do this right, neither instance will have any impact on the other one, since they won't be sharing any resources (cores/cache/RAM). So it wouldn't matter if the benchmarks don't exactly sync up.

 Similar Threads Thread Thread Starter Forum Replies Last Post ZFR Software 11 2020-12-13 10:19 ZFR Software 1 2020-12-10 09:50 bgbeuning Hardware 7 2016-06-18 10:32 tha Software 7 2015-12-07 15:56 nomadicus Hardware 9 2003-03-01 00:15

All times are UTC. The time now is 04:16.

Fri Oct 22 04:16:21 UTC 2021 up 90 days, 22:45, 1 user, load averages: 2.93, 2.37, 1.86