![]() |
![]() |
#1 |
Jan 2019
24 Posts |
![]()
I would like to know if Prime95 running under Ubuntu Linux 18.04 LTS is programmed such that it will use all the 16 cores of the upcoming AMD Ryzen 9 3950X wich is suppose to be released at the end of this month.
|
![]() |
![]() |
![]() |
#2 |
Oct 2007
Manchester, UK
1,381 Posts |
![]()
Yes, you can set multiple workers, each of which can have multiple threads.
If I had to guess, I'd say the best performance would be 2 workers with either 8 or 16 threads each, with each worker assigned its own CCX on the chip. However, benchmarking in various configurations first would be wise. |
![]() |
![]() |
![]() |
#3 |
Oct 2008
n00bville
10111000002 Posts |
![]()
Tests on my 3700X suggested that every task shouldn't use more than 4 cores. More won't give you a lot of more performance.
So with the 3950 use four number checks with four threads each. Last fiddled with by joblack on 2019-09-16 at 11:51 |
![]() |
![]() |
![]() |
#4 | |
"Composite as Heck"
Oct 2017
24×61 Posts |
![]() Quote:
If you get a chance would you mind doing some more testing if required and providing the data? |
|
![]() |
![]() |
![]() |
#5 | |
Oct 2007
Manchester, UK
138110 Posts |
![]() Quote:
As I understand it: 4 cores share an L3 cache and form a "core complex" (CCX). 2 CCX's on a die and are connected by infinity facbric, these are called a CCD. 2 CCD chiplets on the 3900 and 3950X are connected individually to the IO die, and any access of L3 cache or RAM must occur via the IO die. Additionally, the post you linked to is based on a 3600, which only has a single CCD chiplet. Therefore I would still expect 2 workers (1 per chiplet) with either 8 or 16 threads to perform optimally, but as I said, benchmarks would be interesting to see. |
|
![]() |
![]() |
![]() |
#6 | |
"Composite as Heck"
Oct 2017
97610 Posts |
![]() Quote:
The jury is still out on how efficiently a worker spans CCX's across chiplets, the lack of direct intra-chiplet IF link at least doesn't rule out 1 worker across everything as viable. We have a small data point ( https://www.mersenneforum.org/showpo...&postcount=110 ) indicating that a single worker on a 3900X seems to scale fine, but there's no 2+ worker 3900X data and the test conditions were not equal to the 3600 it was compared against (different RAM and possibly Fclk configurations). Everything needs more data. My guess is that either 1 or 2 workers will be optimal for 3900X/3950X depending partly on how saturated RAM bandwidth is when there are 2 workers and partly on how detrimental spanning more than 2 CCX's is to throughput with 1 worker. It could be a weird situation where 1 worker across 3 CCX's is ideal as it mostly alleviates RAM while not incurring too much of an inter-CCX penalty. Lets hope that a test junkie among us gets their hands on a 3900X/3950X sometime this year. |
|
![]() |
![]() |
![]() |
#7 |
Oct 2007
Manchester, UK
1,381 Posts |
![]()
How much memory does p95 need at given FFT sizes? I'd be interested to see how much is needed for work on say 100M digit numbers. If it can all fit in cache that would be quite handy...
I still think that accessing cache via the IO die will incur a performance penalty though. |
![]() |
![]() |
![]() |
#8 |
Feb 2016
UK
23×3×19 Posts |
![]()
Only just saw this thread.
Ram needed is FFT_size * 8 + "a bit" for other lookup data. I've found in practice just considering the FFT*8 sufficient as guide to performance. I'm not really that familiar with large FFT sizes as used around here, since my interest is primarily for comparatively smaller tasks at PrimeGrid. Generally best throughput is when total work fits but doesn't exceed available cache. There is a performance impact from crossing CCX. Zen 2's CCX cache partition complicates that, and I have no idea what is the dominant influence on performance for large tasks. Given my previous results, my gut feel is that the large total cache, even if partitioned, does allow Zen 2 to work well at large tasks. It may even mitigate the half bandwidth writes from each CCD. Personally I wish it had unified L3 per CCD as I think for this type of workload it would enable better scaling at large FFT sizes. Anyway, it has been some time since I ran those early tests. I currently have both a 3600 and 3700X so I can repeat the testing later. Bios has improved a lot, particularly for ram compatibility since those early days. I can run 3600 ram in those systems, as well as trying something "slower" like 3000 or 3200. There is also something I've not tested, I've heard in other applications there can be benefit from running IF at higher speed than keeping it sync'd with ram clocks, so that would be an interesting test also. |
![]() |
![]() |
![]() |
#9 | |
Oct 2007
Manchester, UK
1,381 Posts |
![]() Quote:
Since each CCD has 32 MB L3 cache, this would be good for an FFT of 4096K, which is the assigned FFT for exponents up to 78M. However the first-time test front is now approaching 90M. A single worker might be the way to go for LL tests on such numbers after all. It's surprisingly difficult to find information about what size numbers correspond to a given FFT size, but assuming vaguely linear scaling, an 8192K FFT should just about fill the L3 cache and be good for exponents up to 150M in size. Edit: Worth noting that the L3 cache is exclusive for Zen2, so the L2 cache gives a little extra boost to capacity. That can be "a bit" in your formula FFT_size * 8 + "a bit". ![]() Last fiddled with by lavalamp on 2019-09-19 at 09:15 |
|
![]() |
![]() |
![]() |
#10 | ||
"Sam Laur"
Dec 2018
Turku, Finland
13D16 Posts |
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#11 | |
"Composite as Heck"
Oct 2017
24×61 Posts |
![]() Quote:
My hypothesis is that decoupling will greatly help 3200 DDR4 but be a wash for 3600 DDR4. Increasing Fclk may also disproportionately help tests that stay mainly in cache so testing at least three FFTs per setup would be nice for comparison (fully in cache, wavefront which straddles cache and RAM, 100M which is mainly in RAM). |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is there any sensible auxiliary task for HT logical cores when physical cores already used for PRP? | hansl | Information & Answers | 5 | 2019-06-17 14:07 |
Can't seem to make Prime95 use fewer cores | Octopuss | Software | 6 | 2018-01-28 13:05 |
Prime95 fails to recognize more than 2 cores? | MrLittleTexas | Software | 5 | 2016-12-14 03:30 |
6 CPU cores not recognized by Prime95 v25.11.8 | Christenson | Information & Answers | 4 | 2011-02-06 01:03 |
Intel e6600 Dual Core Problem - How to use both cores with Prime95? | Shoallakeboy | Hardware | 2 | 2006-11-06 17:55 |