mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   LLR Affinity Problem (https://www.mersenneforum.org/showthread.php?t=26037)

carpetpool 2020-10-03 21:16

LLR Affinity Problem
 
1 Attachment(s)
I know there's a way to run different LLR instances and have them assigned to different designated CPU, making it run significantly faster than if only one instance were used.

I am using a [URL="https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html"]4 core, 8 thread[/URL] CPU. In the attachment I sent, one instance of LLR is running with only one thread, and time per bit is 0.576 ms. The CPU affinity is set to 0.

After terminating the program, I copy the LLR exectuable to another directory and run a test on a number of similar size to the first run (one thread). The CPU affinity is set to 1.

I check on the first run, when I notice a time increase of 1.172 ms. almost twice as running one one LLR application! No speedup whatsoever.

My goal is to run 4 instances of LLR with similar time sufficiency as only running one instance of LLR single threaded (4 instances each running with close to 0.576 ms. per bit, so that testing is 4x faster). Does anyone know what I am doing wrong here?

I am aware that running a single instance with 8 threads is less productive than running 4 single threaded instances and for some reason I never figured out how to achieve the latter.

Thanks for help!

paulunderwood 2020-10-03 21:38

Running only one instance has all the cache too itself and will run quicker than running two instances where there will be contention for cache. On a 4c/8t box I run on instance with the -t4 option. I think this approach is cache friendlier.

VBCurtis 2020-10-03 22:06

In Windows, are cores 0 and 1 hyperthreads of the same physical core? That would explain your timing exactly doubling.

What happens when you assign the second LLR copy to core 2 rather than 1?

Have you tried not assigning affinity? I've had decent luck just letting Windows utilize the cores- manually assigning affinity does help sometimes, but for this use case I'm not sure it matters for you.

carpetpool 2020-10-04 22:39

Thanks for the suggestions! I ran 4 subsequent instances of LLR --- assigning affinity to CPUS 0, 2.

The time increased by about 0.120 ms which I guess makes sense given that more cores means slower clock speed.

I loaded up 4 instances running on CPUS 0, 2, 4, 6 and the time per bit almost doubled --- a (0.380 ms. increase).

I think Paul is right --- running four threads on one instance seems to be faster than running 4 instances single threaded.

I would think that with larger number of cores, say 12 or 16, the latter might become slower?

paulunderwood 2020-10-04 23:30

I don't know about 12 core chips running LLR, but generally it makes sense to run 1 instance per chip or chiplet.

VBCurtis 2020-10-05 05:32

[QUOTE=paulunderwood;558896]I don't know about 12 core chips running LLR, but generally it makes sense to run 1 instance per chip or chiplet.[/QUOTE]

My experience, mostly on Haswell-era desktops, is that LLR doesn't benefit much from splitting small FFTs on to multiple threads. 128K per thread seems to be a good cutoff- so for OP's example 192K FFT, I doubt running two 2-threaded instances would be faster than four 1-threaded.

Once FFT reaches 256K, 2-threaded runs work pretty well.

OP- I've run LLR on this size of number on prebuilt machines with slow 2-channel memory, and running 3 instances was just about as fast as 4 but generated quite a bit less heat. That is, 3 is enough to saturate the memory on some quad-core machines. It takes some experimenting with threads-per-process and number of processes to find the sweet spot!

henryzz 2020-10-05 08:27

[QUOTE=VBCurtis;558917]OP- I've run LLR on this size of number on prebuilt machines with slow 2-channel memory, and running 3 instances was just about as fast as 4 but generated quite a bit less heat. That is, 3 is enough to saturate the memory on some quad-core machines. It takes some experimenting with threads-per-process and number of processes to find the sweet spot![/QUOTE]

Better still might be to reduce the cpu speed to match the memory throughput and still use all cores. Generally lower speeds need less power/cycle. Experimentation might be needed there.

rogue 2022-06-16 12:33

Back to the original question. With the new Intel CPUs, Windows seems to be run llr on the efficiency cores by default, not the performance cores. I would like to run llr only on the performance cores. Note that I am using PRPNet, so llr is run once for each PRP/primality test. PRPNet sets Affinity= in the llr.ini file, but that is not being respected.

kruoli 2022-06-16 19:50

1 Attachment(s)
For this problem, I have written a simple program. It is attached with source code (since the executable must be started as administrator).

rogue 2022-06-16 20:24

That isn't helpful because I cannot run that every time I start llr. llr should respect the affinity, but maybe it requires llr to be run as administrator to set affinity.

kruoli 2022-06-16 20:36

For your request that LLR will respect it by itself, I definitely concur. If a process wants to limit its own affinity, it can do it without further privileges (as of Windows 10, at least).

My program does only need to be started once. It will monitor all started processes and apply affinity in accordance with the parameters the program was started with.

rogue 2022-06-17 02:28

[QUOTE=kruoli;607995]For your request that LLR will respect it by itself, I definitely concur. If a process wants to limit its own affinity, it can do it without further privileges (as of Windows 10, at least).

My program does only need to be started once. It will monitor all started processes and apply affinity in accordance with the parameters the program was started with.[/QUOTE]

If I have multiple copies of llr, does it set affinity based upon the runtime directory of llr?

kruoli 2022-06-17 08:27

At the program's start, it will build a queue of cores available to LLR. When detecting a newly spawned process, it will assign it to a core and mark that core as used. This gets undone if the LLR process ends.

So it is not about the directory, but multiple processes are definitely supported. I found no advantage in assigning affinities by the working directory. It is more important to have one physical core per lifetime of a process.

rogue 2022-06-17 14:51

[QUOTE=kruoli;608013]At the program's start, it will build a queue of cores available to LLR. When detecting a newly spawned process, it will assign it to a core and mark that core as used. This gets undone if the LLR process ends.

So it is not about the directory, but multiple processes are definitely supported. I found no advantage in assigning affinities by the working directory. It is more important to have one physical core per lifetime of a process.[/QUOTE]

Can I limit your app to the cores I want it to assign the llr processes to? Since the computer has hyper threading and efficiency cores, I would only want to direct llr to one of each pair of hyper threaded cores and to avoid the efficiency cores.

kruoli 2022-06-17 18:25

Yes, take a look at the help the program generates ([C]-h[/C] switch). You can set[LIST][*]the first core to assign to.[*]the increment (on Windows, this would be 2 when having HT).[*]the maximum worker count.[/LIST]

rogue 2022-06-18 13:25

I d/l'd your program, ran it in a command window (started as administrator). Used -h. Nothing. It doesn't output anything.

kruoli 2022-06-18 13:38

You are correct, I could not believe it but tested it myself. That's dumb. I have not tested this before I uploaded it, sorry. Running it without parameter will give this (this is tested):
[CODE]Usage:
AffinitySetter [options] {process name}

Options:
-h Shows help and exits.
-i {number} Increment of cores to set affinity to (default: 2).
-m {number} Maximum number of processes to set affinity of at one given time (default: number of cores / 2).
-q Hides the console window.
-s {number} Zero-based first core to set affinity to (default: 1).
-t {number} Thread count of each process (default: 1).

Remarks: This program assumes HT. If not, you will have to specify -i, -m and -s manually.
The program will only work up to 64 logical cores.[/CODE]

rogue 2022-06-18 17:14

[QUOTE=kruoli;608070]You are correct, I could not believe it but tested it myself. That's dumb. I have not tested this before I uploaded it, sorry. Running it without parameter will give this (this is tested):
[CODE]Usage:
AffinitySetter [options] {process name}

Options:
-h Shows help and exits.
-i {number} Increment of cores to set affinity to (default: 2).
-m {number} Maximum number of processes to set affinity of at one given time (default: number of cores / 2).
-q Hides the console window.
-s {number} Zero-based first core to set affinity to (default: 1).
-t {number} Thread count of each process (default: 1).

Remarks: This program assumes HT. If not, you will have to specify -i, -m and -s manually.
The program will only work up to 64 logical cores.[/CODE][/QUOTE]

So if I have 7 copies of llr running, what parameters do I use to ensure that they are running on the performance cores? That assumes that performance cores (with HT) are 0-15.

paulunderwood 2022-06-18 18:55

If you turn off the energy efficient cores in BIOS, does your motherboard allow the performance cores to use AVX512?

rogue 2022-06-18 20:48

[QUOTE=paulunderwood;608079]If you turn off the energy efficient cores in BIOS, does your motherboard allow the performance cores to use AVX512?[/QUOTE]

I haven't tried that and I'm not certain I want to. It is my wife's computer, so performance will be an issue if she doesn't have cores available for her stuff.

kruoli 2022-06-19 07:55

[QUOTE=rogue;608076]So if I have 7 copies of llr running, what parameters do I use to ensure that they are running on the performance cores? That assumes that performance cores (with HT) are 0-15.[/QUOTE]

You first decide on which core to start on. In your example, everything from 0–3 would be feasible as a start core. That would either involve performance cores 0–6 or 1–7. I would leave performance core 0 free because a lot of kernel thongs are still done on core 0 on windows. This would lead to:
[C]AffinitySetter -s 3 -m 7 llr64[/C],
assuming your executable is called [C]llr64[/C]. The option [C]-i[/C] does not need to be added since it is the default on Windows. Leave it running in background. It will automatically detect newly spawned processes.

rogue 2022-06-19 12:26

[QUOTE=kruoli;608096]You first decide on which core to start on. In your example, everything from 0–3 would be feasible as a start core. That would either involve performance cores 0–6 or 1–7. I would leave performance core 0 free because a lot of kernel thongs are still done on core 0 on windows. This would lead to:
[C]AffinitySetter -s 3 -m 7 llr64[/C],
assuming your executable is called [C]llr64[/C]. The option [C]-i[/C] does not need to be added since it is the default on Windows. Leave it running in background. It will automatically detect newly spawned processes.[/QUOTE]

Does it keep running and poll the system every few minutes to determine if the affinity of a process needs to be changed?

After running for a few minutes, it is doing nothing. I see no change in the affinity of the processes. I suggest a few things:

1) Add a switch that indicates if hyper threading is enabled. That way -s 1 -H -m 3 would use cores 1, 3, 5 and -S -m 3 would use cores 1, 2, 3.
2) Output information such as "no process found with that name".
3) Output information such as "process <processname> with pid <pid>" changed to use core <core>.
4) Add a switch to specify the number of seconds to poll the OS for running processes to change. Default to 60 seconds.

This means that your program will need to have a table of pids for which it changed the affinity so that it doesn't need to change them each time (and output that the affinity was changed) the system is polled.

kruoli 2022-06-19 16:45

On startup, it will set the affinities of all processes that are named as the process name that was specified. After that, it will subscribe to a system call that informs about newly spawned processes.

So you have omitted .exe in the command line as I have done? You could try to call [C]AffinitySetter notepad[/C] and look if something happens if you open Notepad. It [I]might[/I] be that Windows 11 blocks the behaviour of my program, I cannot test that. (I am assuming you might use Windows 11 because you were speaking about performance cores.)
[LIST=1][*]In your example, [C]-H[/C] (hyperthreading enabled) would equal to [C]-i 2[/C] on Windows and not specifying [C]-H[/C] would equal to [C]-i 1[/C]. Yes, this seems more user friendly, but I will keep [C]-i[/C] when adding such a switch.[*]This could be done on startup, which is the only point where this could apply.[*]This is already the default and should be shown if it works correctly. Real world example from one of the machines where I am running it:
[CODE]Execute on PID 18340. Set affinity to core 15.
Core 9 freed.
Execute on PID 6744. Set affinity to core 9.
Core 13 freed.
Execute on PID 17332. Set affinity to core 13.
Core 1 freed.
Execute on PID 10520. Set affinity to core 1.
Core 9 freed.
Execute on PID 17088. Set affinity to core 9.
Core 11 freed.
Execute on PID 17208. Set affinity to core 11.
Core 13 freed.
Execute on PID 10792. Set affinity to core 13.
Core 1 freed.
Execute on PID 17516. Set affinity to core 1.
Core 5 freed.
Core 11 freed.
Execute on PID 18824. Set affinity to core 5.
Execute on PID 19148. Set affinity to core 11.[/CODE][*]This does not apply to the event watcher I use, unfortunately.[/LIST]
Yes, already set PIDs are watched and the used cores are marked as "used" in the program.

rogue 2022-06-19 17:29

I see what I did wrong. I was adding the .exe to the process name as that is what Task Manager shows. I have it working now. Thanks.

pepi37 2022-06-19 23:21

Affinity problem under Windows is old one. I like using Prime95 for PRP test and always , regardless configuration one worker is faster then another. I know other program steal cycles , but that is reason why I on 6 core CPU ( 6 true core, no HT) use only 4. But even on that configuration first worker is always slower then another. Prime95 set affinity itself, so I dont know will your program work on not in this case.

Jean Penné 2022-07-13 14:29

Affinity management on LLR
 
The Affinity managing was not really implemented on LLR...
This issue is now fixed in new LLR 4.0.2, on Linux and WIN32 versions.
It is also fixed on llrCUDA Version 3.8.6 (indeed for the CPU part of the work).
The option -oAffinity=2 allows the progam to run on logical core 2.
You may also choose a list of cores by setting -oAffinity="2,3,5" for example.

Regards,
Jean

rogue 2022-07-13 17:19

FYI, PRPNet supports setting affinity "out of the box" for LLR. PRPNet has been updated to support affinity for pfgw. I'm working on pfgw support for setting CPU affinity, at least for Windows.


All times are UTC. The time now is 16:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.