mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2019-09-12, 17:57   #1
rgirard1
 
Jan 2019

168 Posts
Default Does Prime95 can use the 16 cores of AMD Ryzen 9 3950x ?

I would like to know if Prime95 running under Ubuntu Linux 18.04 LTS is programmed such that it will use all the 16 cores of the upcoming AMD Ryzen 9 3950X wich is suppose to be released at the end of this month.
rgirard1 is offline   Reply With Quote
Old 2019-09-12, 19:12   #2
lavalamp
 
lavalamp's Avatar
 
Oct 2007
London, UK

101001000002 Posts
Default

Yes, you can set multiple workers, each of which can have multiple threads.

If I had to guess, I'd say the best performance would be 2 workers with either 8 or 16 threads each, with each worker assigned its own CCX on the chip. However, benchmarking in various configurations first would be wise.
lavalamp is offline   Reply With Quote
Old 2019-09-16, 11:51   #3
joblack
 
joblack's Avatar
 
Oct 2008
n00bville

52·29 Posts
Default

Tests on my 3700X suggested that every task shouldn't use more than 4 cores. More won't give you a lot of more performance.

So with the 3950 use four number checks with four threads each.

Last fiddled with by joblack on 2019-09-16 at 11:51
joblack is offline   Reply With Quote
Old 2019-09-16, 12:34   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23×3×29 Posts
Default

Quote:
Originally Posted by joblack View Post
Tests on my 3700X suggested that every task shouldn't use more than 4 cores. More won't give you a lot of more performance.

So with the 3950 use four number checks with four threads each.
Mackerel's 3600 data suggests otherwise, they indicate that one worker using all cores is optimal as the large conjoined cache likely helps avoid a RAM bottleneck: https://www.mersenneforum.org/showpo...3&postcount=61


If you get a chance would you mind doing some more testing if required and providing the data?
M344587487 is online now   Reply With Quote
Old 2019-09-17, 05:30   #5
lavalamp
 
lavalamp's Avatar
 
Oct 2007
London, UK

131210 Posts
Default

Quote:
Originally Posted by M344587487 View Post
the large conjoined cache likely helps avoid a RAM bottleneck
That is not my understanding of how the L3 cache is configured.

As I understand it:
4 cores share an L3 cache and form a "core complex" (CCX).
2 CCX's on a die and are connected by infinity facbric, these are called a CCD.
2 CCD chiplets on the 3900 and 3950X are connected individually to the IO die, and any access of L3 cache or RAM must occur via the IO die.

Additionally, the post you linked to is based on a 3600, which only has a single CCD chiplet.

Therefore I would still expect 2 workers (1 per chiplet) with either 8 or 16 threads to perform optimally, but as I said, benchmarks would be interesting to see.
lavalamp is offline   Reply With Quote
Old 2019-09-17, 09:52   #6
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23·3·29 Posts
Default

Quote:
Originally Posted by lavalamp View Post
That is not my understanding of how the L3 cache is configured.

As I understand it:
4 cores share an L3 cache and form a "core complex" (CCX).
2 CCX's on a die and are connected by infinity facbric, these are called a CCD.
2 CCD chiplets on the 3900 and 3950X are connected individually to the IO die, and any access of L3 cache or RAM must occur via the IO die.

Additionally, the post you linked to is based on a 3600, which only has a single CCD chiplet.

Therefore I would still expect 2 workers (1 per chiplet) with either 8 or 16 threads to perform optimally, but as I said, benchmarks would be interesting to see.
  • A CCX is 4 cores sharing 16MB of discrete L3 as you say, conjoined was a poor choice of words on my part
  • It's my understanding that there's no intra-chiplet IF in zen2, all communication between CCX's has to go through the IO die even if the CCX's are on the same chiplet. This makes chiplets simpler and memory latency more uniform at the cost of latency in some situations
  • A 3600 is a 3700X with one core disabled per CCX and likely a worse bin, but the same 16MB of cache per CCX
My post was just about throughput of the single chiplet SKUs, mackerel's data suggests that joblack should be able to get higher throughput on the 3700X with a single worker spanning both CCX's but joblack's tests contradict that.

The jury is still out on how efficiently a worker spans CCX's across chiplets, the lack of direct intra-chiplet IF link at least doesn't rule out 1 worker across everything as viable. We have a small data point ( https://www.mersenneforum.org/showpo...&postcount=110 ) indicating that a single worker on a 3900X seems to scale fine, but there's no 2+ worker 3900X data and the test conditions were not equal to the 3600 it was compared against (different RAM and possibly Fclk configurations). Everything needs more data.


My guess is that either 1 or 2 workers will be optimal for 3900X/3950X depending partly on how saturated RAM bandwidth is when there are 2 workers and partly on how detrimental spanning more than 2 CCX's is to throughput with 1 worker. It could be a weird situation where 1 worker across 3 CCX's is ideal as it mostly alleviates RAM while not incurring too much of an inter-CCX penalty. Lets hope that a test junkie among us gets their hands on a 3900X/3950X sometime this year.
M344587487 is online now   Reply With Quote
Old 2019-09-17, 10:06   #7
lavalamp
 
lavalamp's Avatar
 
Oct 2007
London, UK

25·41 Posts
Default

How much memory does p95 need at given FFT sizes? I'd be interested to see how much is needed for work on say 100M digit numbers. If it can all fit in cache that would be quite handy...

I still think that accessing cache via the IO die will incur a performance penalty though.
lavalamp is offline   Reply With Quote
Old 2019-09-19, 08:11   #8
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

3×131 Posts
Default

Only just saw this thread.

Ram needed is FFT_size * 8 + "a bit" for other lookup data. I've found in practice just considering the FFT*8 sufficient as guide to performance.

I'm not really that familiar with large FFT sizes as used around here, since my interest is primarily for comparatively smaller tasks at PrimeGrid. Generally best throughput is when total work fits but doesn't exceed available cache. There is a performance impact from crossing CCX.

Zen 2's CCX cache partition complicates that, and I have no idea what is the dominant influence on performance for large tasks. Given my previous results, my gut feel is that the large total cache, even if partitioned, does allow Zen 2 to work well at large tasks. It may even mitigate the half bandwidth writes from each CCD. Personally I wish it had unified L3 per CCD as I think for this type of workload it would enable better scaling at large FFT sizes.

Anyway, it has been some time since I ran those early tests. I currently have both a 3600 and 3700X so I can repeat the testing later. Bios has improved a lot, particularly for ram compatibility since those early days. I can run 3600 ram in those systems, as well as trying something "slower" like 3000 or 3200. There is also something I've not tested, I've heard in other applications there can be benefit from running IF at higher speed than keeping it sync'd with ram clocks, so that would be an interesting test also.
mackerel is offline   Reply With Quote
Old 2019-09-19, 09:12   #9
lavalamp
 
lavalamp's Avatar
 
Oct 2007
London, UK

25·41 Posts
Default

Quote:
Originally Posted by mackerel View Post
I've found in practice just considering the FFT*8 sufficient as guide to performance.
Unfortunate then.

Since each CCD has 32 MB L3 cache, this would be good for an FFT of 4096K, which is the assigned FFT for exponents up to 78M. However the first-time test front is now approaching 90M.

A single worker might be the way to go for LL tests on such numbers after all.

It's surprisingly difficult to find information about what size numbers correspond to a given FFT size, but assuming vaguely linear scaling, an 8192K FFT should just about fill the L3 cache and be good for exponents up to 150M in size.

Edit: Worth noting that the L3 cache is exclusive for Zen2, so the L2 cache gives a little extra boost to capacity. That can be "a bit" in your formula FFT_size * 8 + "a bit".

Last fiddled with by lavalamp on 2019-09-19 at 09:15
lavalamp is offline   Reply With Quote
Old 2019-09-19, 23:10   #10
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

5108 Posts
Default

Quote:
Originally Posted by lavalamp View Post
Unfortunate then.

Since each CCD has 32 MB L3 cache, this would be good for an FFT of 4096K, which is the assigned FFT for exponents up to 78M. However the first-time test front is now approaching 90M.
Yes I agree, for a single CCD, the performance really sinks at that point, or actually even a bit before that. The efficiency actually starts dropping after about 65M, or 3456K FFT size. But on the dual CCD models (3900X, the future 3960X) the rolloff point is double that, still quite capable of running at its optimum rate at the current wavefront.

Quote:
Originally Posted by lavalamp View Post
It's surprisingly difficult to find information about what size numbers correspond to a given FFT size, but assuming vaguely linear scaling, an 8192K FFT should just about fill the L3 cache and be good for exponents up to 150M in size.
Try the CPU credit calculator at mersenne.ca - it gives some sort of info for the expected FFT size for the L-L exponent to be tested. But yes, it's vaguely linear.
nomead is offline   Reply With Quote
Old 2019-09-20, 10:20   #11
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23×3×29 Posts
Default

Quote:
Originally Posted by mackerel View Post
...
Anyway, it has been some time since I ran those early tests. I currently have both a 3600 and 3700X so I can repeat the testing later. Bios has improved a lot, particularly for ram compatibility since those early days. I can run 3600 ram in those systems, as well as trying something "slower" like 3000 or 3200. There is also something I've not tested, I've heard in other applications there can be benefit from running IF at higher speed than keeping it sync'd with ram clocks, so that would be an interesting test also.
It's my understanding that decoupling Fclk from RAM mainly has some impact on latency as there's extra complication keeping things in sync. Fclk can get to around 1900 although not every chip can get there, some can get a bit higher. These tests would yield some nice comparison points if you have the patience:
  • 3200 in sync at 1600 Fclk, baseline
  • 3600 in sync at 1800 Fclk, baseline
  • 3200 decoupled at 1800 Fclk, comparable to 3600 baseline
  • 3600 decoupled at 1900 Fclk or whatever Fclk you can get
  • 3200 decoupled at 1900 Fclk or whatever Fclk you can get

My hypothesis is that decoupling will greatly help 3200 DDR4 but be a wash for 3600 DDR4. Increasing Fclk may also disproportionately help tests that stay mainly in cache so testing at least three FFTs per setup would be nice for comparison (fully in cache, wavefront which straddles cache and RAM, 100M which is mainly in RAM).
M344587487 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there any sensible auxiliary task for HT logical cores when physical cores already used for PRP? hansl Information & Answers 5 2019-06-17 14:07
Can't seem to make Prime95 use fewer cores Octopuss Software 6 2018-01-28 13:05
Prime95 fails to recognize more than 2 cores? MrLittleTexas Software 5 2016-12-14 03:30
6 CPU cores not recognized by Prime95 v25.11.8 Christenson Information & Answers 4 2011-02-06 01:03
Intel e6600 Dual Core Problem - How to use both cores with Prime95? Shoallakeboy Hardware 2 2006-11-06 17:55

All times are UTC. The time now is 20:05.

Mon Nov 23 20:05:19 UTC 2020 up 74 days, 17:16, 3 users, load averages: 2.45, 2.59, 2.52

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.