mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-06-01, 12:13   #1
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×3×5×29 Posts
Default Ryzen "3D V-Cache", 96MiB L3 per chiplet

https://www.tomshardware.com/news/am...ng-improvement


It's actually a 64MiB SRAM chip stacked on top of a Ryzen chiplet which already has 32MiB of L3 cache. That they've billed it as L3 is a good sign, hopefully the 64MiB is transparently unified with the 32MiB. Latency TBD. Bandwidth TBD, the link suggests up to 2TB/s based on the TSV tech used for stacking. They're talking in terms of gaming performance so it seems likely it's destined for consumers.



96MiB of L3 has to at least make a sizeable dent in the amount of memory used in wavefront tests. At best it'll eliminate RAM as a factor entirely, at worst it should severely raise the memory bandwidth it/s cap (hopefully to the point where the cores are the main bottleneck again?). Either way it's a tasty bit of news for the GIMP.
M344587487 is offline   Reply With Quote
Old 2021-06-01, 13:57   #2
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

23×5×11 Posts
Default

Very interesting development, but opening many questions:
Bigger cache usually = slower cache. For gaming and P95, might not be a problem. For other use cases: ??? Comes down to size of data.
How will it affect thermals?
Will it be productised any time soon, or more a Zen 4 target?
Cost implication?
May also imply they don't expect great scaling on Infinity Fabric and/or ram interfaces.

Note I haven't seen the original keynote yet, only scanned some news reports.
mackerel is offline   Reply With Quote
Old 2021-06-01, 14:41   #3
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

Quote:
Originally Posted by mackerel View Post
Very interesting development, but opening many questions:
Bigger cache usually = slower cache. For gaming and P95, might not be a problem. For other use cases: ??? Comes down to size of data.
How will it affect thermals?
Will it be productised any time soon, or more a Zen 4 target?
Cost implication?
May also imply they don't expect great scaling on Infinity Fabric and/or ram interfaces.

Note I haven't seen the original keynote yet, only scanned some news reports.
No this is L3 cache which is for the better of 2 decades now in between the L2 and the RAM. Obviously the huge success of the threadrippers means they have a problem to the RAM. Because there is like a handful of dimms only and a 64 core threadripper 3990x has 8 ccd's which are all 8 core processors. So it's basically a faster version of a 8 socket machine each socket having 8 cores. Yet a 8 socket machine would have RAM and L3 towards the RAM for each socket. At threadripper this is kind of a problem. So there is a factor 8 problem to the RAM basically.

So you would really want to have 64 DIMMs so to speak inside each threadripper box. And 8 dimmes for each CCD. Yet that's not how it works.

In short adding quite a lot of SRAM (L3) to every CCD is a very clever way of doing things.

Intel has a major problem competing against this. Intel's way of doing business is make tons of cash for faster systems with more sockets and dimms and other facilities. AMD is packing it all in 1 cpu and the price of the 64 core threadripper is say factor 10 too cheap for intels way of doing business. If intel would follow the same path like AMD then they have a major problem business wise as they gonna make less cash.

Last fiddled with by diep on 2021-06-01 at 14:42
diep is offline   Reply With Quote
Old 2021-06-01, 16:02   #4
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

23×5×11 Posts
Default

There are many different possible workloads, and not all of them stress bandwidth. For those that do, moving data has been a bigger problem than execution for a long time. Caches are one solution to that. More ram channels, more cache layers/different sizes, even compute in ram are being looked at.

I like monolithic CPUs more but recognise they're on the way out. It is so much simpler to deal with data when it isn't split into different pools that is unavoidable as we go into chiplets.

I'm wondering where the 2 TB/s claim comes from. Can 8 cores move data in/out of that cache at that speed? Or it is a theoretical maximum based on stacking? I don't have numbers for Zen 3 currently, what does that do as currently sold? I did a quick Aida64 on my 8 core Cezanne and that's only 100GB/s copy in L3. The full fat desktop versions might do a bit better but that's still far from 2 TB/s.
mackerel is offline   Reply With Quote
Old 2021-06-01, 16:14   #5
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

Quote:
Originally Posted by mackerel View Post
...snip....
I'm wondering where the 2 TB/s claim comes from. Can 8 cores move data in/out of that cache at that speed? Or it is a theoretical maximum based on stacking?
All those manufacturers - and there is no exceptions - have bunch of marketing managers who do fantastic claims that get used to sell hardware.

For example SGI sold a Teras supercomputer end 90s saying it had a 1 TB bandwidth over the network.

Actually we can disprove this. 1024 processors were 256 nodes each node a quad socket cpu box.

The connectivity from each node to the network was 1 GB from which quite some part was actually hardware overhead data. So roughly 600MB or 0.6GB would be user data. From head i remember 680MB user data actually. It's nearly 20 years ago...

Actual theoretic bandwidth = 0.68 * 256 = 174 GB/s bandwidth (with a GB as 10^9 bytes)

Quite different from its claim of 1 TB.

Also a single core of a cpu was quite slower, about 20% for my program, than other partitions. As if the 512 processor partition had still cpu's with slower caches - which would've been far cheaper cpu's - whereas on paper it should have had the more expensive cpu's.

But well the government never benchmarked the machine - so what did they know they received from SGI there.

It's all fantastic claims there.

the threadripper internally with its 8 CCDs is basically a 8 socket system. So there will be a complicated cache-coherency protocol of some sort to the crossbar.

It's far cheaper to produce 8 core CCDs and clock those 3Ghz than to produce Xeon cpu's of 22 or even 54+ cores which by definition need to get clocked far far lower - 2Ghz is a lot actually for a 20+ core cpu.

Threadripper's biggest advantage is its high clock and as it's a 'gamers cpu' it's easy to overclock as well.

As for marketing managers - i remember a convention where some intel dude was giving a presentation - but before giving some data - we first had 10 minutes of sheets with DISCLAIMERS that anything he was gonna say was a big freaking lie. That was about fantastic claims about knights corner and knights landing and how this would revolutionize supercomputing. This was around 2010...

You can shredder any bandwidth claim.

More L3 cache is very important for whatever sort of prime number software you're busy with.
diep is offline   Reply With Quote
Old 2021-06-01, 16:35   #6
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

6708 Posts
Default

AMD showed a consumer CPU with that cache, not a threadripper. It is on that basis I'm asking questions trying to figure out how it might work in practice. We can worry about it scaling elsewhere later.

I'm starting to get the feeling that the 2 TB/s quoted is the maximum bandwidth TSVs could allow, not that it could be attainable in the cache implementation.

Hot info from Ian Cutress of Ananadtech: "Confirmed with AMD that V-Cache will be coming to Ryzen Zen 3 products, with production at end of year."
https://twitter.com/IanCutress/statu...66139769602058

Last fiddled with by mackerel on 2021-06-01 at 16:37
mackerel is offline   Reply With Quote
Old 2021-06-01, 16:37   #7
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

Quote:
Originally Posted by mackerel View Post
AMD showed a consumer CPU with that cache, not a threadripper. It is on that basis I'm asking questions trying to figure out how it might work in practice. We can worry about it scaling elsewhere later.

I'm starting to get the feeling that the 2 TB/s quoted is the maximum bandwidth TSVs could allow, not that it could be attainable in the cache implementation.
Even if you would have SRAM that delivers 2 TB/s - what sort of L2 cache can handle 2 TB/s for any sort of cpu right now?

edit:
please note that usually read speed is much faster than write speed because you can usually read concurrently and writing some sort of cache coherency protocol will kill performance.

Typical AMD processors it's 2 : 1 (2 cacheline reads can be done for every cacheline write)

Last fiddled with by diep on 2021-06-01 at 16:43
diep is offline   Reply With Quote
Old 2021-06-01, 16:45   #8
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

23·5·11 Posts
Default

Quote:
Originally Posted by diep View Post
Even if you would have SRAM that delivers 2 TB/s - what sort of L2 cache can handle 2 TB/s for any sort of cpu right now?
Again looking at my Cezanne, that's benchmarking with Aida64 (old version not optimised for recent CPUs) at around 1 TB/s on L2. I'm not sure if that is aggregate or per core. Even if aggregate, it is on a similar magnitude. Then we can argue measured vs peak.
mackerel is offline   Reply With Quote
Old 2021-06-01, 16:48   #9
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

78110 Posts
Default

Note your link doesn't work here. Maybe can cite some text or links to AMD websites? Twitter not friendly here (i'm under linux).

I would be amazed if any new cpu from AMD wouldn't have the CCD concept because with the CCD concept they have intel at their balls.
diep is offline   Reply With Quote
Old 2021-06-01, 17:03   #10
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

23×5×11 Posts
Default

Quote:
This technology will be productized with 7nm Zen 3-based Ryzen processors. Nothing was said about EPYC.
Those processors will start production at the end of the year. No comment on availability, although Q1 2022 would fit into AMD's regular cadence.
This V-Cache chiplet is 64 MB of additional L3, with no stepped penalty on latency. The V-Cache is address striped with the normal L3 and can be powered down when not in use. The V-Cache sits on the same power plane as the regular L3.
The processor with V-Cache is the same z-height as current Zen 3 products - both the core chiplet and the V-Cache are thinned to have an equal z-height as the IOD die for seamless integration
As the V-Cache is built over the L3 cache on the main CCX, it doesn't sit over any of the hotspots created by the cores and so thermal considerations are less of an issue. The support silicon above the cores is designed to be thermally efficient.
The V-Cache is a single 64 MB die, and is relatively denser than the normal L3 because it uses SRAM-optimized libraries of TSMC's 7nm process, AMD knows that TSMC can do multiple stacked dies, however AMD is only talking about a 1-High stack at this time which it will bring to market.
https://www.anandtech.com/show/16725...-for-15-gaming

Sone new news which answers pretty much what I asked.
mackerel is offline   Reply With Quote
Old 2021-06-01, 17:11   #11
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

Especially the benchmarking based upon an 'unknown videocard' which gives a 20% performance boost or 15% fps increase tells a lot :)

It's a theoretic bandwidth claim.
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is "mung" or "munged" a negative word in a moral sense? Uncwilly Lounge 15 2020-04-14 18:35
Stockfish game: "Move 8 poll", not "move 3.14159 discussion" MooMoo2 Other Chess Games 5 2016-10-22 01:55
"Master" and "helper" threads Madpoo Software 0 2016-09-08 01:27
Aouessare-El Haddouchi-Essaaidi "test": "if Mp has no factor, it is prime!" wildrabbitt Miscellaneous Math 11 2015-03-06 08:17
Would Minimizing "iterations between results file" may reveal "is not prime" earlier? nitai1999 Software 7 2004-08-26 18:12

All times are UTC. The time now is 10:05.


Fri Dec 3 10:05:44 UTC 2021 up 133 days, 4:34, 0 users, load averages: 1.35, 1.41, 1.30

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.