mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-09-10, 18:07   #1
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

28×3 Posts
Default NUMA awareness

Hey, everybody!

On my new Threadripper build, running 16 workers doing LLD (all 2400K FFT), the timings on its two NUMA nodes differ substantially (averages are 25 ms and 40 ms). It seems like it is allocating all the memory on one NUMA node and then accesses it via infinity fabric. Is my assumption correct? Is there a way to get around that?

Greetings from Germany,
Oliver!
kruoli is offline   Reply With Quote
Old 2017-09-10, 19:23   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,691 Posts
Default

Prime95 is not NUMA-aware, especially when it comes to allocating memory.

The only work-around I can think of would be two run two instances of prime95. Use undoc.txt entries to force prime95 to think each instance has only 8 threads available. Then use OS tools to somehow bind each prime95 to a NUMA node -- don't ask me how or if this is even possible.
Prime95 is online now   Reply With Quote
Old 2017-09-10, 20:54   #3
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

100100110002 Posts
Default

Could you try running 4 workers with 4 threads/cores each? AMD Threadripper has 4 corecomplexes with 4 cores+L3 cache on each corecomplex.

Definitly following this thread!
VictordeHolland is offline   Reply With Quote
Old 2017-09-11, 01:33   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2D9C16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Prime95 is not NUMA-aware, especially when it comes to allocating memory.

The only work-around I can think of would be two run two instances of prime95. Use undoc.txt entries to force prime95 to think each instance has only 8 threads available. Then use OS tools to somehow bind each prime95 to a NUMA node -- don't ask me how or if this is even possible.
There is no way to set a logical core index offset (or however you refer to it) for a Prime95 instance?

Using Mlucas and taking account of AMD's logical-core numbering convention, to run (say) an 8-threaded job on each of the 2 sockets with 1 thread per physical core (that is the '2' increment in the lo:hi:incr triplet argument to the -core flag):

Job1: ./Mlucas -core 0:15:2 & [Sets affinity to logical cores 0,2,4,6,8,10,12,14]
Job2: ./Mlucas -core 16:31:2 & [Sets affinity to logical cores 16,18,20,22,24,26,28,30]

Not as automated, but very flexible.

How difficult would it be for you to add a 'starting at...' index option to your code's user interface? (Perhaps as an advanced-menu option, since multi-socket system owners tend to be 'power users'.)

Last fiddled with by ewmayer on 2017-09-11 at 01:35
ewmayer is offline   Reply With Quote
Old 2017-09-11, 03:11   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

170138 Posts
Default

In prime95 v29.2, you can easily assign specific logical CPUs to workers.

What you cannot do, is allocate memory in a NUMA-aware manner. I'm assuming from the OP's question that this is his performance problem.
Prime95 is online now   Reply With Quote
Old 2017-09-11, 18:11   #6
bgbeuning
 
Dec 2014

3×5×17 Posts
Default

On Windows, when a thread allocates memory the OS fills the allocation
from the same NUMA node as the CPU the thread is running on.
If thread affinity is used to bind a thread to a CPU, then things work out for prime95.
There are some cases when prime95 does not use thread affinity,
and then a NUMA machine gets slow results.

The above assumes prime95 is using VirtualAlloc() to get memory.
If it is uses malloc() then it may or may not work. If malloc() already
has free space, then it might be on the wrong node. If malloc()
needs to call VirtualAlloc() to get the space, then all is good.
bgbeuning is offline   Reply With Quote
Old 2017-09-11, 18:56   #7
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

28×3 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
Could you try running 4 workers with 4 threads/cores each? AMD Threadripper has 4 corecomplexes with 4 cores+L3 cache on each corecomplex.

Definitly following this thread!
AFAIK Threadripper has four Ryzen dies, of which two are disabled (or simply dummy/decoy units). Epyc server processors have all four CCXs enabled. That's why I guess the performance differs so extremely between node 0 (logical cores #0 to #15) and node 1 (logical cores #16 to #31).
kruoli is offline   Reply With Quote
Old 2017-09-11, 18:58   #8
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

28×3 Posts
Default

Quote:
Originally Posted by Prime95 View Post
In prime95 v29.2, you can easily assign specific logical CPUs to workers.

What you cannot do, is allocate memory in a NUMA-aware manner. I'm assuming from the OP's question that this is his performance problem.
Exactly!

I could not figure out how to start a process having already set its CPU affinity, so it would stop allocating any memory on startup.
kruoli is offline   Reply With Quote
Old 2017-09-11, 19:30   #9
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

numactl is useful on KNL to force memory allocation for specific processes to specific nodes. I'm not sure if it works on Threadripper, I haven't had a chance to test.
airsquirrels is offline   Reply With Quote
Old 2017-09-11, 19:39   #10
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

28·3 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
numactl is useful on KNL to force memory allocation for specific processes to specific nodes. I'm not sure if it works on Threadripper, I haven't had a chance to test.
Currently, I'm running Windows 10, so numactl is not availible (for what I know). As seen when using y-cruncher on that machine running some Linux flavour, it for sure does help, but I have not yet tested that.
kruoli is offline   Reply With Quote
Old 2017-09-11, 20:28   #11
Mysticial
 
Mysticial's Avatar
 
Sep 2016

23·43 Posts
Default

Quote:
Originally Posted by kruoli View Post
Currently, I'm running Windows 10, so numactl is not availible (for what I know). As seen when using y-cruncher on that machine running some Linux flavour, it for sure does help, but I have not yet tested that.
AFAICT, there is indeed no Windows equivalent of numactl in Windows. So there's no way to control where memory is allocated on startup. But if you start the program, then bind the entire process to specific nodes, then all the memory allocated from that point on will (usually) stick to those nodes from the first-touch policy.

If Prime95 doesn't allocate any much performance-critical memory on startup, this approach "should" work if you intend to run a separate Prime95 instance on each NUMA node.

(Just be aware that binding a process to specific cores via Task Manager doesn't always work if the program overrides it by setting affinities.)

If you run the latest y-cruncher on Windows, it will (by default) detect the NUMA and do some sort of manual node-interleaving.
Mysticial is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 04:17.


Thu Dec 9 04:17:27 UTC 2021 up 138 days, 22:46, 0 users, load averages: 1.43, 1.37, 1.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.