mersenneforum.org Zen4's AVX512 Teardown
 Register FAQ Search Today's Posts Mark Forums Read

2022-09-26, 19:52   #12
Mysticial

Sep 2016

2×5×37 Posts

Quote:
 Originally Posted by preda OK thanks. In case you want to try, there is Prime95 for Windows https://www.mersenne.org/download/ which I assume has the same benchmark function as on Linux. I don't know how easy all that is over the remote access though.

Yeah, I can run those later tonight.

 2022-09-26, 19:56 #13 kruoli     "Oliver" Sep 2017 Porta Westfalica, DE 32·137 Posts Would you mind starting the benchmark from 1K FFTs? The smaller ones usually profit the most from AVX-512. Larger FFTs like fewer workers and more cores per worker.
2022-09-27, 07:09   #14
Mysticial

Sep 2016

2·5·37 Posts

Quote:
 Originally Posted by preda OK thanks. In case you want to try, there is Prime95 for Windows https://www.mersenne.org/download/ which I assume has the same benchmark function as on Linux. I don't know how easy all that is over the remote access though.

--------------------

Finally got around to looking at more of the regular reviews. Looks like some of these launch-day reviewers had access to slides which I did not.

The FP register file did in fact turn out to be 192.

https://hothardware.com/reviews/amd-...50x-cpu-review
https://images.hothardware.com/conte...3-vs-zen-4.jpg

Last fiddled with by Mysticial on 2022-09-27 at 07:12

2022-09-28, 02:10   #15
Mysticial

Sep 2016

2·5·37 Posts

Some corrections and clarifications:

Quote:
 Shuffles/Permutes: ... This is incredible because the silicon cost for permutes is (probably) quadratic to the granularity, and it appears Zen4 has paid that cost.
After speaking with Travis Downs, I think the cost is actually O(N*log(N)) instead of O(N^2) to the granularity. I missed that the # of bits in each lane decreases - thus offsetting part of the quadratic factor. Then you throw in a log(N) factor for the MUX'ing levels.

Quote:
 Shuffles/Permutes: ... By comparison, Intel's lower-granular permutes are quite slow, though steadily improving since Skylake.
The only "quite slow" permutes on Intel are the 3-input 16-bit and byte granular permutes (the most expensive ones). They are 3-uop on with 2 cycle/instruction throughput. So 2-3x slower than AMD depending on whether you count the extra (non-shuffle) uop. All other shuffles are fast (1/cycle throughput) on Intel.

Quote:
 Shuffles/Permutes: ... it far exceeds my expectations and beats Intel in every aspect.
It does not actually beat Intel in every aspect. When I typed that I was only thinking about throughput and completely ignored latencies.

Zen4 beats or matches Intel in throughput for all shuffle instructions. But Intel has shorter latencies for some of the instructions.

Quote:
 Physical Register File: My best estimate of Zen4's physical vector register file is ~192 x 512-bit
As mentioned in my previous post, this 192 estimate turned out to be spot on. I was actually fairly confident of this, but I gave it a +/-16 uncertainty because I wasn't sure of the effect of the 16 additional architectural registers.

Last fiddled with by Mysticial on 2022-09-28 at 02:15

2022-09-28, 02:40   #16
Peter Cordes

"Peter Cordes"
Sep 2022

12 Posts

Quote:
 Originally Posted by Mysticial But anyway, we live in a very strange world now. AMD has AVX512, but Intel does not for mainstream... If you told me this a few years ago, I'd have looked at you funny. But here we are... Despite the "double-pumping", Zen4's AVX512 is not only competitive with Intel's, it outright beats it in many ways. Intel does remain ahead in a few areas though. AMD bringing a high quality AVX512 implementation to the consumer market is going to shake things up. This will make it much harder for Intel to kill off AVX512. So it will be interesting in the next few years to see how they respond. Do they bring AVX512 to the E-cores? Do they work with OS's to find a way to automatically migrate AVX512 processes to the P-cores? Do they kill off the E-cores? This situation is hilarious since it is Intel who made these instructions and now AMD is running away with it.
Yeah, I'm really glad to see AMD's AVX-512 is even better than how they handled AVX on Zen1 (a strategy they mostly kept from Bulldozer-family). (And really disappointed with Intel suppressing AVX-512 in Alder Lakes with no E-cores, although I assume that helps yields a small amount). A full-width shuffle unit is a cool idea; that solves a lot of the problems of a fully half-width implementation where it can face-plant on code not specifically tuned to be aware of pitfalls.

AVX-512 masking is great; very often you won't actually cross into an unmapped page, but you still need correctness in case you do. So a high penalty for fault-suppression at the end of loops in rare cases can be worth it, especially if that lets compilers auto-vectorize without as much code-size bloat. Depending how you vectorize, if the start of the array is aligned by 32 or 64, you won't cross a page, but you may still want your function to work in general.

And new instructions like vpternlogd are great. So many cool features not getting any closer to being widespread, let alone baseline, because of Intel's bad planning and/or market segmentation crap, and tying those new instructions to AVX-512 specifically, not making VEX opcodes for 256-bit versions of them. (There's a ton of VEX coding space left, there's a field where different patterns represent different combinations of mandatory prefixes for legacy SSE. But there are lots of values left for that field.)

It's not an easy problem, though, and I can see not wanting to have CPUID bits for 256-bit-limited AVX-512 just for the complexity. (I wonder if they also didn't want AMD to be able to implement a hypothetical AVX-512-ymm-only with 128 halves in a Zen1 uarch, and claim AVX-512 compatibility as feature-parity with Intel). And maybe they wanted software to get compiled with 512-bit codepaths, not using a 256-bit compatibility approach, so future CPUs with less bad throttling could run faster.

I assume Intel's client CPU strategy is built around ahead-of-time compiled binary-only software, where AVX-512 is less likely to get used than when stuff can get built from source with -march=native (or JITed).

For a lot of problems, a scalable extension like AArch64's SVE or RISC-V's vector extensions has big advantages, though, allowing efficient use of whatever hardware vector width exists. These AVX-512 problems really highlight that.

Quote:
 Originally Posted by Mysticial AVX512 Compress Store: vpcompressd (and family) with a memory destination is microcoded. I measured 142 cycles/instruction for "vpcompressd [mem]{k}, zmm". The in-register version is fast as is the corresponding expand load instructions (with or without memory operand). On Intel, these compress stores are not microcoded and are fast. So this is specific to the memory destination version on Zen4.
That is quite surprising, yeah, if masked vmovdqu32 stores are fast. Surprising enough that I assume you already double-checked to look for possible microbenchmarking mistakes, but the ones that occur to me are:

* accidentally having an all-zero mask, so fault suppression happens in a page that was copy-on-write and thus read-only in the HW page tables?
* Or suppression of a trap to a microcode assist to set the Dirty bit in the page table (assuming AMD does that like Intel.)

* Some microarchitectural quirk that doesn't like it when you do masked stores to the same location repeatedly?
* Some possible loop count error that's running it more iterations than you intended to count? Unlikely; if performance counters are available, I assume you'd be able to use those to test the code. Or test the same code on an Intel CPU where you know the perf counters work.

Quote:
 Originally Posted by Mysticial By comparison, Zen1 maps 256-bit registers into two entries in a 128-bit register file - resulting in 256-bit having half the reorder window of 128-bit code. Zen1 also cannot rename YMM registers due to them being split.
I think what you mean is that Zen1 can't *move-eliminate* at register-rename stage (for the high half of a YMM). It does of course still rename YMM registers to avoid WAR and WAW hazards of out-of-order exec! Might want to edit that to say Zen1 can't mov-eliminate YMM registers.

Quote:
 Originally Posted by Mysticial I did notice that 256-bit SIMD seems to have slightly better reorder capabilty [sic] than 512-bit for long latency code. This does not happen on Intel processors.
I wonder if that has anything to do with AMD using per-port scheduling queues, instead of a (mostly) unified scheduler like Intel? But that would only explain anything if there were differences in which ports could run the 256 vs. 512-bit versions of instructions, and you probably picked instructions that had the expected factor of 2 in throughput between versions. Also, it's only fully per-port scheduling for scalar integer. At least if it still works like https://en.wikichip.org/wiki/amd/mic...#Block_Diagram shows for Zen 2, where there's a 64-entry non-scheduling queue and then a single 36-entry scheduling queue in front of the four FP/SIMD pipes. (Which doesn't seem very deep, so I guess decent static scheduling may be important, and/or differences in getting instructions *executed* may matter more. Perhaps even how soon they can leave the scheduler after dispatch to a port?)

Or maybe it has something to do with the latency of the two halves of a 512-bit instruction occupying a port for an extra cycle? It's hidden when forwarding to another 512-bit non-shuffle instruction, but perhaps visible in terms of how soon an instruction from another dep chain can run. Hrm, but the relevant distance for whether two long dep chains can overlap is in uops, not cycles. So yeah, further investigation needed.

Quote:
 Originally Posted by Mysticial Personally, I'm super excited about Zen4 and am very much looking forward to a new build. It has the latest AVX512 and it blows my Skylake 7940X out of the water in code compilation.
Yeah same. I haven't had an AMD since K8, but with Intel dropping the ball on AVX-512 for efficient and affordable CPUs, and AMD supporting ECC RAM, I'm looking at Zen4 for a home server once the dust has settled.

Last fiddled with by Peter Cordes on 2022-09-28 at 03:05

 2022-09-28, 04:41 #17 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 2×47×109 Posts @Mysticial: Wow! Just wow! Last fiddled with by Dr Sardonicus on 2022-09-28 at 12:40 Reason: xingif topsy
2022-09-28, 15:00   #18
Mysticial

Sep 2016

2·5·37 Posts

Oh hi Peter. Long time no see!

Quote:
 Originally Posted by Peter Cordes That is quite surprising, yeah, if masked vmovdqu32 stores are fast. Surprising enough that I assume you already double-checked to look for possible microbenchmarking mistakes, but the ones that occur to me are: * accidentally having an all-zero mask, so fault suppression happens in a page that was copy-on-write and thus read-only in the HW page tables? * Or suppression of a trap to a microcode assist to set the Dirty bit in the page table (assuming AMD does that like Intel.) * Some microarchitectural quirk that doesn't like it when you do masked stores to the same location repeatedly? * Some possible loop count error that's running it more iterations than you intended to count? Unlikely; if performance counters are available, I assume you'd be able to use those to test the code. Or test the same code on an Intel CPU where you know the perf counters work.
Same code on Intel is fast. My test does reuse the same address, but when I replace it with a regular masked store it's only 20 cycles/instruction - from the failed forwarding through a masked store.

Quote:
 I wonder if that has anything to do with AMD using per-port scheduling queues, instead of a (mostly) unified scheduler like Intel? But that would only explain anything if there were differences in which ports could run the 256 vs. 512-bit versions of instructions, and you probably picked instructions that had the expected factor of 2 in throughput between versions. Also, it's only fully per-port scheduling for scalar integer. At least if it still works like https://en.wikichip.org/wiki/amd/mic...#Block_Diagram shows for Zen 2, where there's a 64-entry non-scheduling queue and then a single 36-entry scheduling queue in front of the four FP/SIMD pipes. (Which doesn't seem very deep, so I guess decent static scheduling may be important, and/or differences in getting instructions *executed* may matter more. Perhaps even how soon they can leave the scheduler after dispatch to a port?) Or maybe it has something to do with the latency of the two halves of a 512-bit instruction occupying a port for an extra cycle? It's hidden when forwarding to another 512-bit non-shuffle instruction, but perhaps visible in terms of how soon an instruction from another dep chain can run. Hrm, but the relevant distance for whether two long dep chains can overlap is in uops, not cycles. So yeah, further investigation needed.
The way I tested this was to iterate over an array where for each element, I read the value, then chain 48 FMAs on it and write it back. So the baseline performance is 4 cycles/instruction (with no reordering capability), and you'd need to reorder across 8 and 4 iterations to saturate the FMA for 256-bit and 512-bit respectively. Neither 256-bit nor 512-bit get anywhere close to saturation.

This test is far from perfect, but more representative of actual code that hasn't been optimized to hide latencies.

Quote:
 Yeah same. I haven't had an AMD since K8, but with Intel dropping the ball on AVX-512 for efficient and affordable CPUs, and AMD supporting ECC RAM, I'm looking at Zen4 for a home server once the dust has settled.
I'm actually wondering about what the 3D cache options will be. Intel doesn't want to admit that it exists since it's destroying them in benchmarks. And AMD doesn't want to admit it exists either because it will undercut their current Zen4 line until the Zen4-3D options become available next year.

I will actually need two Zen4 machines, one for everyday use (coding/gaming), and one for perf-testing. Can't perf test on my main machine since it needs to be clean and free of background programs. As much as I want to get that 2nd 7950X now, I also want to hold out for the 3D version (if it will exist). And I'm a bit hesitant to ask AMD for a review sample of a 7950X3D later on. Though I might try anyway. I do have a valid usecase for it since I'm currently working on a new internal algorithm for y-cruncher which is slated to better utilize large and deeply hierarchical caches. (which AMD's 3D-cache will fall under)

Last fiddled with by Mysticial on 2022-09-28 at 15:03

 2022-09-30, 14:48 #19 henryzz Just call me Henry     "David" Sep 2007 Liverpool (GMT/BST) 27×47 Posts Am I correct in thinking that microcoded instructions can be improved by updates? If so would it be worth you communicating with AMD about the potential for an improved version of vpcompressd? I am a little confused by how AMD select people to share test setups with. I would have thought it would in AMD's interest to give people like Agner Fog early access. It looks like Agner Fog hasn't updated his instruction tables with Zen 4 yet. This is probably due to lack of access. @Mystical Would you be able to help him with this? Last fiddled with by henryzz on 2022-09-30 at 14:48
2022-09-30, 16:35   #20
Mysticial

Sep 2016

2·5·37 Posts

Quote:
 Originally Posted by henryzz Am I correct in thinking that microcoded instructions can be improved by updates? If so would it be worth you communicating with AMD about the potential for an improved version of vpcompressd?
I bet there's some valid reason to it. It only took me like 2 minutes to figure out how to emulate it, so I doubt they'd have overlooked it.

This post suggests it could be a last minute bug that was patched via ucode instead of re-spinning the silicon: https://www.realworldtech.com/forum/...rpostid=208874

Quote:
 I am a little confused by how AMD select people to share test setups with. I would have thought it would in AMD's interest to give people like Agner Fog early access.
I have no idea. Extreme overclockers typically get access many months early - often before final steppings. Is it to hype up the chip?

Sampling is very tightly controlled to minimize the risk of leaks. When I revealed I had the chip by announcing the new version of y-cruncher, I was immediately slammed by tons of people trying to squeeze information out of me. So handing it out to too many people runs the risk of someone getting compromised (bribed). I bet they were also vetting me in the VCs I had with their chip architects.

In my case, y-cruncher has become quite popular among hardware reviewers. So it would benefit them if they could improve its performance on their chip. In the past, I've also had similar threads with Intel on how to speed up y-cruncher. But those were mostly software-side and never went in the direction of sending me hardware.

Quote:
 It looks like Agner Fog hasn't updated his instruction tables with Zen 4 yet. This is probably due to lack of access. @Mystical Would you be able to help him with this?
Agner needs Linux and I unfortunately can't get Linux to boot on my motherboard.

Last fiddled with by Mysticial on 2022-09-30 at 16:39

 2022-10-03, 12:50 #21 Xyzzy     Aug 2002 23·11·97 Posts Attached Thumbnails
 2022-11-07, 01:10 #22 willmore     Aug 2002 6110 Posts NTT love? "This implies that Zen4 has a 64-bit multiplier in every SIMD lane" Oh, noone tell jasonp!

 Similar Threads Thread Thread Starter Forum Replies Last Post Mysticial Programming 4 2022-09-26 17:28 Xyzzy y-cruncher 12 2022-09-15 17:15 bsquared Hardware 17 2020-11-10 12:15 kriesel Hardware 60 2020-06-23 01:05 heliosh Hardware 19 2020-01-18 04:01

All times are UTC. The time now is 17:02.

Wed Dec 7 17:02:08 UTC 2022 up 111 days, 14:30, 1 user, load averages: 0.75, 1.09, 1.02