mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2020-07-18, 09:09   #1
intelfx
 
Jul 2020

13 Posts
Default Running Prime95 on a 16-core CPU — one 16-threaded worker or 16 single-threaded workers?

Greetings.

Since yesterday I'm trying my hand at GIMPS using spare cycles of my new 16-core workstation. I'm running the official build of Prime95 v29.8b7 on Linux.

When configuring Prime95 for the first time, I was presented with a choice of how many work threads I wanted to create and how many cores to allocate to each work thread (I suppose that a "work thread" is a misnomer?). Originally I chose to run 16 single-threaded workers, which subsequently received 16 different double-check assignments and started computing away at 42-45 ms/iter.

Later, I decided to experiment a bit and reconfigured Prime95 to run a single work thread using all 16 available cores. This yielded a computation speed of 1.5 ms/iter, i. e. a more than 16x speedup (meaning that 16 original assignments in total would finish quicker than if I ran them in parallel).

Hence three questions:
  • am I right to conclude that running a single multi-threaded worker is in my case better than 16 single-threaded workers?
  • is this expected behavior?
  • are there any other kind of threading recommendations for similar multi-core machines?

Last fiddled with by intelfx on 2020-07-18 at 09:17
intelfx is offline   Reply With Quote
Old 2020-07-18, 09:23   #2
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

25·3·7·13 Posts
Default

Please run a benchmark from the menu(s). Linux users may help you here, I don't know how P95 looks on Linux, and every system is different. How many memory channels? How much cache? Air or water cooled? Etc. (rhetoric questions, no need answers). The common wisdom is that new processors will give you a better output if you only run fewer (1-2) workers, each in more than one thread, such that the sum of all threads are not higher than the number of your physical cores. It didn't use to be like that in the past, but with CPUs getting lots of cores, the limitation became the memory bandwidth - 16 workers would need to exchange data for 16 tests. For example, in my system (10 cores, 20 threads, lots of cache memory) the best output I get with 2 workers, each running 5 threads. The hyper-threading is not useful, for most of the work types, it only produces more heat, but not more output.

Last fiddled with by LaurV on 2020-07-18 at 09:26
LaurV is offline   Reply With Quote
Old 2020-07-18, 09:34   #3
intelfx
 
Jul 2020

1310 Posts
Default

Quote:
Originally Posted by LaurV View Post
The common wisdom is that new processors will give you a better output if you only run fewer (1-2) workers, each in more than one thread, such that the sum of all threads are not higher than the number of your physical cores. It didn't use to be like that in the past, but with CPUs getting lots of cores, the limitation became the memory bandwidth - 16 workers would need to exchange data for 16 test. For example, in my system (10 cores, 20 threads, lots of cache memory) the best output I get with 2 workers, each running 5 threads. The hyper-threading is not useful, for most of the work types, it only produces more heat, but not more output.

I see. Memory throughput being the bottleneck sounds quite plausible. I'll run the benchmark, thanks.
intelfx is offline   Reply With Quote
Old 2020-07-18, 11:14   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

11628 Posts
Default

If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
M344587487 is offline   Reply With Quote
Old 2020-07-18, 12:35   #5
intelfx
 
Jul 2020

11012 Posts
Default

Quote:
Originally Posted by M344587487 View Post
If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
Yup, it's that one.

In fact I have simply overlooked the benchmark option. Considering benchmark results, it would appear that for 2048K FFTs the absolute best throughput (1800 iter/sec) is achieved with 4 workers:

Code:
FFTlen=2048K, Type=3, Arch=4, Pass1=1024, Pass2=2048, clm=2 (16 cores, 4 workers):  2.18,  2.18,  2.17,  2.17 ms.  Throughput: 1840.17 iter/sec.
FFTlen=2048K, Type=3, Arch=4, Pass1=2048, Pass2=1024, clm=1 (16 cores, 4  workers):  2.15,  2.16,  2.14,  2.14 ms.  Throughput: 1863.96 iter/sec.
With any larger FFTs however, 4 worker performance begins to degrade compared to 2 workers (I did not do the extended benchmark in this case, just used the defaults):

Code:
Timings for 2240K FFT length (16 cores, 2 workers):  1.43,  1.43 ms.  Throughput: 1399.54 iter/sec. 
Timings for 2240K FFT length (16 cores, 4 workers):  3.39,  3.41,  3.22,  3.21 ms.  Throughput: 1209.72 iter/sec. 
Timings for 2304K FFT length (16 cores, 2 workers):  1.43,  1.43 ms.  Throughput: 1397.51 iter/sec. 
Timings for 2304K FFT length (16 cores, 4 workers):  4.04,  4.00,  3.66,  3.65 ms.  Throughput: 1044.27 iter/sec. 
Timings for 2400K FFT length (16 cores, 2 workers):  1.53,  1.54 ms.  Throughput: 1300.87 iter/sec. 
 Timings for 2400K FFT length (16 cores, 4 workers):  4.52,  4.46,  4.66,  4.74 ms.  Throughput: 870.69 iter/sec.

With significantly larger FFTs, 4 worker performance turns drastically lower than 2 workers:
Code:
Timings for 3072K FFT length (16 cores, 2 workers):  1.88,  1.91 ms.  Throughput: 1056.56 iter/sec. 
Timings for 3072K FFT length (16 cores, 4 workers):  7.88,  7.97,  7.80,  7.79 ms.  Throughput: 508.94 iter/sec.


Incidentally, do you happen to know how exactly can I use the extended benchmark results (i. e. the Type, Arch, Pass1, Pass2, clm values)? Can I specify them in a config somewhere to override the builtin values for my CPU?

Last fiddled with by intelfx on 2020-07-18 at 12:42
intelfx is offline   Reply With Quote
Old 2020-07-18, 16:49   #6
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,151 Posts
Default

Quote:
Originally Posted by intelfx View Post
Incidentally, do you happen to know how exactly can I use the extended benchmark results (i. e. the Type, Arch, Pass1, Pass2, clm values)? Can I specify them in a config somewhere to override the builtin values for my CPU?
Well, it will happen automatically over time. Every night, prime95 will do a quick benchmark of all the different FFT implementations for the work you'll be doing in the near future and writes the results to gwnum.txt. Prime95 uses that to pick the fastest FFT implementation for your machine. Some of that info is buried in undoc.txt, but it is pretty terse.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Worker #5 and Worker#7 not running (Error ILLEGAL SUMOUT skrupian08 Information & Answers 9 2016-08-23 16:35
Multi-threaded factoring bchaffin Aliquot Sequences 8 2010-10-24 13:38
exclude single core from quad core cpu for gimps jippie Information & Answers 7 2009-12-14 22:04
NEW MERSENNES AND MULTI-THREADED SOFTWARE lpmurray Software 13 2005-12-21 08:24
Prime95 a multi-threaded application? Unregistered Software 10 2004-06-11 05:31

All times are UTC. The time now is 01:04.

Tue Sep 22 01:04:44 UTC 2020 up 11 days, 22:15, 0 users, load averages: 1.49, 1.77, 1.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.