mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   LLR Affinity Problem (https://www.mersenneforum.org/showthread.php?t=26037)

carpetpool 2020-10-03 21:16

LLR Affinity Problem
 
1 Attachment(s)
I know there's a way to run different LLR instances and have them assigned to different designated CPU, making it run significantly faster than if only one instance were used.

I am using a [URL="https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html"]4 core, 8 thread[/URL] CPU. In the attachment I sent, one instance of LLR is running with only one thread, and time per bit is 0.576 ms. The CPU affinity is set to 0.

After terminating the program, I copy the LLR exectuable to another directory and run a test on a number of similar size to the first run (one thread). The CPU affinity is set to 1.

I check on the first run, when I notice a time increase of 1.172 ms. almost twice as running one one LLR application! No speedup whatsoever.

My goal is to run 4 instances of LLR with similar time sufficiency as only running one instance of LLR single threaded (4 instances each running with close to 0.576 ms. per bit, so that testing is 4x faster). Does anyone know what I am doing wrong here?

I am aware that running a single instance with 8 threads is less productive than running 4 single threaded instances and for some reason I never figured out how to achieve the latter.

Thanks for help!

paulunderwood 2020-10-03 21:38

Running only one instance has all the cache too itself and will run quicker than running two instances where there will be contention for cache. On a 4c/8t box I run on instance with the -t4 option. I think this approach is cache friendlier.

VBCurtis 2020-10-03 22:06

In Windows, are cores 0 and 1 hyperthreads of the same physical core? That would explain your timing exactly doubling.

What happens when you assign the second LLR copy to core 2 rather than 1?

Have you tried not assigning affinity? I've had decent luck just letting Windows utilize the cores- manually assigning affinity does help sometimes, but for this use case I'm not sure it matters for you.

carpetpool 2020-10-04 22:39

Thanks for the suggestions! I ran 4 subsequent instances of LLR --- assigning affinity to CPUS 0, 2.

The time increased by about 0.120 ms which I guess makes sense given that more cores means slower clock speed.

I loaded up 4 instances running on CPUS 0, 2, 4, 6 and the time per bit almost doubled --- a (0.380 ms. increase).

I think Paul is right --- running four threads on one instance seems to be faster than running 4 instances single threaded.

I would think that with larger number of cores, say 12 or 16, the latter might become slower?

paulunderwood 2020-10-04 23:30

I don't know about 12 core chips running LLR, but generally it makes sense to run 1 instance per chip or chiplet.

VBCurtis 2020-10-05 05:32

[QUOTE=paulunderwood;558896]I don't know about 12 core chips running LLR, but generally it makes sense to run 1 instance per chip or chiplet.[/QUOTE]

My experience, mostly on Haswell-era desktops, is that LLR doesn't benefit much from splitting small FFTs on to multiple threads. 128K per thread seems to be a good cutoff- so for OP's example 192K FFT, I doubt running two 2-threaded instances would be faster than four 1-threaded.

Once FFT reaches 256K, 2-threaded runs work pretty well.

OP- I've run LLR on this size of number on prebuilt machines with slow 2-channel memory, and running 3 instances was just about as fast as 4 but generated quite a bit less heat. That is, 3 is enough to saturate the memory on some quad-core machines. It takes some experimenting with threads-per-process and number of processes to find the sweet spot!

henryzz 2020-10-05 08:27

[QUOTE=VBCurtis;558917]OP- I've run LLR on this size of number on prebuilt machines with slow 2-channel memory, and running 3 instances was just about as fast as 4 but generated quite a bit less heat. That is, 3 is enough to saturate the memory on some quad-core machines. It takes some experimenting with threads-per-process and number of processes to find the sweet spot![/QUOTE]

Better still might be to reduce the cpu speed to match the memory throughput and still use all cores. Generally lower speeds need less power/cycle. Experimentation might be needed there.

rogue 2022-06-16 12:33

Back to the original question. With the new Intel CPUs, Windows seems to be run llr on the efficiency cores by default, not the performance cores. I would like to run llr only on the performance cores. Note that I am using PRPNet, so llr is run once for each PRP/primality test. PRPNet sets Affinity= in the llr.ini file, but that is not being respected.

kruoli 2022-06-16 19:50

1 Attachment(s)
For this problem, I have written a simple program. It is attached with source code (since the executable must be started as administrator).

rogue 2022-06-16 20:24

That isn't helpful because I cannot run that every time I start llr. llr should respect the affinity, but maybe it requires llr to be run as administrator to set affinity.

kruoli 2022-06-16 20:36

For your request that LLR will respect it by itself, I definitely concur. If a process wants to limit its own affinity, it can do it without further privileges (as of Windows 10, at least).

My program does only need to be started once. It will monitor all started processes and apply affinity in accordance with the parameters the program was started with.


All times are UTC. The time now is 12:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.