![]() |
![]() |
#12 | |
Sep 2016
37010 Posts |
![]() Quote:
Yeah, I can run those later tonight. |
|
![]() |
![]() |
![]() |
#13 |
"Oliver"
Sep 2017
Porta Westfalica, DE
52716 Posts |
![]()
Would you mind starting the benchmark from 1K FFTs? The smaller ones usually profit the most from AVX-512. Larger FFTs like fewer workers and more cores per worker.
|
![]() |
![]() |
![]() |
#14 | |
Sep 2016
5628 Posts |
![]() Quote:
-------------------- Finally got around to looking at more of the regular reviews. Looks like some of these launch-day reviewers had access to slides which I did not. The FP register file did in fact turn out to be 192. https://hothardware.com/reviews/amd-...50x-cpu-review https://images.hothardware.com/conte...3-vs-zen-4.jpg Last fiddled with by Mysticial on 2022-09-27 at 07:12 |
|
![]() |
![]() |
![]() |
#15 | ||||
Sep 2016
2·5·37 Posts |
![]()
Some corrections and clarifications:
Quote:
Quote:
Quote:
Zen4 beats or matches Intel in throughput for all shuffle instructions. But Intel has shorter latencies for some of the instructions. Quote:
Last fiddled with by Mysticial on 2022-09-28 at 02:15 |
||||
![]() |
![]() |
![]() |
#16 | ||||
"Peter Cordes"
Sep 2022
1 Posts |
![]() Quote:
AVX-512 masking is great; very often you won't actually cross into an unmapped page, but you still need correctness in case you do. So a high penalty for fault-suppression at the end of loops in rare cases can be worth it, especially if that lets compilers auto-vectorize without as much code-size bloat. Depending how you vectorize, if the start of the array is aligned by 32 or 64, you won't cross a page, but you may still want your function to work in general. And new instructions like vpternlogd are great. So many cool features not getting any closer to being widespread, let alone baseline, because of Intel's bad planning and/or market segmentation crap, and tying those new instructions to AVX-512 specifically, not making VEX opcodes for 256-bit versions of them. (There's a ton of VEX coding space left, there's a field where different patterns represent different combinations of mandatory prefixes for legacy SSE. But there are lots of values left for that field.) It's not an easy problem, though, and I can see not wanting to have CPUID bits for 256-bit-limited AVX-512 just for the complexity. (I wonder if they also didn't want AMD to be able to implement a hypothetical AVX-512-ymm-only with 128 halves in a Zen1 uarch, and claim AVX-512 compatibility as feature-parity with Intel). And maybe they wanted software to get compiled with 512-bit codepaths, not using a 256-bit compatibility approach, so future CPUs with less bad throttling could run faster. I assume Intel's client CPU strategy is built around ahead-of-time compiled binary-only software, where AVX-512 is less likely to get used than when stuff can get built from source with `-march=native` (or JITed). For a lot of problems, a scalable extension like AArch64's SVE or RISC-V's vector extensions has big advantages, though, allowing efficient use of whatever hardware vector width exists. These AVX-512 problems really highlight that. Quote:
* accidentally having an all-zero mask, so fault suppression happens in a page that was copy-on-write and thus read-only in the HW page tables? * Or suppression of a trap to a microcode assist to set the Dirty bit in the page table (assuming AMD does that like Intel.) * Some microarchitectural quirk that doesn't like it when you do masked stores to the same location repeatedly? * Some possible loop count error that's running it more iterations than you intended to count? Unlikely; if performance counters are available, I assume you'd be able to use those to test the code. Or test the same code on an Intel CPU where you know the perf counters work. Quote:
Quote:
Or maybe it has something to do with the latency of the two halves of a 512-bit instruction occupying a port for an extra cycle? It's hidden when forwarding to another 512-bit non-shuffle instruction, but perhaps visible in terms of how soon an instruction from another dep chain can run. Hrm, but the relevant distance for whether two long dep chains can overlap is in uops, not cycles. So yeah, further investigation needed. Yeah same. I haven't had an AMD since K8, but with Intel dropping the ball on AVX-512 for efficient and affordable CPUs, and AMD supporting ECC RAM, I'm looking at Zen4 for a home server once the dust has settled. Last fiddled with by Peter Cordes on 2022-09-28 at 03:05 |
||||
![]() |
![]() |
![]() |
#17 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
10,273 Posts |
![]()
@Mysticial: Wow! Just wow!
![]() Last fiddled with by Dr Sardonicus on 2022-09-28 at 12:40 Reason: xingif topsy |
![]() |
![]() |
![]() |
#18 | |||
Sep 2016
2×5×37 Posts |
![]()
Oh hi Peter. Long time no see!
![]() Quote:
Quote:
This test is far from perfect, but more representative of actual code that hasn't been optimized to hide latencies. Quote:
I will actually need two Zen4 machines, one for everyday use (coding/gaming), and one for perf-testing. Can't perf test on my main machine since it needs to be clean and free of background programs. As much as I want to get that 2nd 7950X now, I also want to hold out for the 3D version (if it will exist). And I'm a bit hesitant to ask AMD for a review sample of a 7950X3D later on. Though I might try anyway. I do have a valid usecase for it since I'm currently working on a new internal algorithm for y-cruncher which is slated to better utilize large and deeply hierarchical caches. (which AMD's 3D-cache will fall under) Last fiddled with by Mysticial on 2022-09-28 at 15:03 |
|||
![]() |
![]() |
![]() |
#19 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
37×163 Posts |
![]()
Am I correct in thinking that microcoded instructions can be improved by updates? If so would it be worth you communicating with AMD about the potential for an improved version of vpcompressd?
I am a little confused by how AMD select people to share test setups with. I would have thought it would in AMD's interest to give people like Agner Fog early access. It looks like Agner Fog hasn't updated his instruction tables with Zen 4 yet. This is probably due to lack of access. @Mystical Would you be able to help him with this? Last fiddled with by henryzz on 2022-09-30 at 14:48 |
![]() |
![]() |
![]() |
#20 | |||
Sep 2016
2×5×37 Posts |
![]() Quote:
This post suggests it could be a last minute bug that was patched via ucode instead of re-spinning the silicon: https://www.realworldtech.com/forum/...rpostid=208874 Quote:
Sampling is very tightly controlled to minimize the risk of leaks. When I revealed I had the chip by announcing the new version of y-cruncher, I was immediately slammed by tons of people trying to squeeze information out of me. So handing it out to too many people runs the risk of someone getting compromised (bribed). I bet they were also vetting me in the VCs I had with their chip architects. In my case, y-cruncher has become quite popular among hardware reviewers. So it would benefit them if they could improve its performance on their chip. In the past, I've also had similar threads with Intel on how to speed up y-cruncher. But those were mostly software-side and never went in the direction of sending me hardware. Quote:
Last fiddled with by Mysticial on 2022-09-30 at 16:39 |
|||
![]() |
![]() |
![]() |
#21 |
Aug 2002
23·1,069 Posts |
![]() ![]() |
![]() |
![]() |
![]() |
#22 |
Aug 2002
61 Posts |
![]()
"This implies that Zen4 has a 64-bit multiplier in every SIMD lane"
Oh, noone tell jasonp! |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Emulating AVX512 vpcompressd | Mysticial | Programming | 4 | 2022-09-26 17:28 |
AVX512 and Zen4 pre-release speculations | Xyzzy | y-cruncher | 12 | 2022-09-15 17:15 |
AVX512-IFMA cpus | bsquared | Hardware | 17 | 2020-11-10 12:15 |
AVX512 hardware recommendations? | kriesel | Hardware | 60 | 2020-06-23 01:05 |
AVX512 performance on new shiny Intel kit | heliosh | Hardware | 19 | 2020-01-18 04:01 |