![]() |
![]() |
#12 |
P90 years forever!
Aug 2002
Yeehaw, FL
2·4,079 Posts |
![]()
Intel's optimization manuals have been updated with AVX512 info. Chapter 15 has some interesting details on FMA's use of port 0/1/5 and detecting CPUs with a single AVX512 FMA unit.
The link: https://software.intel.com/sites/def...ion-manual.pdf Last fiddled with by Prime95 on 2017-12-26 at 20:59 |
![]() |
![]() |
![]() |
#13 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
Thanks for the link to the updated optimization manual. |
|
![]() |
![]() |
![]() |
#14 | |
Sep 2016
17216 Posts |
![]() Quote:
So *any* arithmetic 512-bit instruction will contend with the FMA resources. I'm unsure about the mask instructions. I heard from somewhere that those go into port5 - which means they will also interfere with the FMAs. Those are for Knights Landing. Agner Fog doesn't have Skylake X results yet. Last fiddled with by Mysticial on 2017-12-27 at 02:45 |
|
![]() |
![]() |
![]() |
#15 | |
P90 years forever!
Aug 2002
Yeehaw, FL
2·4,079 Posts |
![]() Quote:
For now, I'll code it up in a straightforward way and leave optimization for later until an I9X is available. I may need to adjust my AVX-512 swizzling strategy later on if I cannot make this perform well. |
|
![]() |
![]() |
![]() |
#16 | |
Sep 2016
2×5×37 Posts |
![]() Quote:
I suppose merging the blending into an actual arithmetic instruction is a no-go since that implies wasted computation. |
|
![]() |
![]() |
![]() |
#17 |
P90 years forever!
Aug 2002
Yeehaw, FL
815810 Posts |
![]()
More info:
http://users.atw.hu/instlatx64/AVX51...n_v102_PUB.ods I wonder if "VMOVAPD zreg {kreg}, zreg" uses any ports. |
![]() |
![]() |
![]() |
#18 |
P90 years forever!
Aug 2002
Yeehaw, FL
2×4,079 Posts |
![]()
And reading between the lines of:
https://software.intel.com/sites/def...ut-latency.pdf I think all ops that do masking (including vmovapd) use port 0 and 5. This will compete with FMA. The exception is load-from-memory with masking, those will not use port 0 and 5. There is plenty of room for Intel to do a better implementation of AVX-512 in future processors. Issuing two FMA instructions uses 3 ports 0,1,5. This makes it impossible to do any shuffling / blending / masking / etc. without interfering with FMAs. Intel should put the FMA units on ports 0 & 1 with shuffling / masking / etc. done on port 5 without any extra silicon. This would mirror how AVX2 was implemented on Haswell/Skylake/KabyLake. |
![]() |
![]() |
![]() |
#19 |
Sep 2016
2·5·37 Posts |
![]()
This came up in a conversation with a fellow Stackoverflower.
The port 0+1 fusion in 512-bit mode might be related the register file. Intel discloses that Skylake client has "168 FP registers". "FP" has historically implied "SIMD". So let's assume that Skylake has 168 SIMD registers. But what's not disclosed is how large they are. Skylake server with AVX512 has the same core as Skylake client, but with the extra L2 and FMA unit attached to the side. If the 168 SIMD registers were 512-bit wide, that would imply a lot of dead silicon in the Skylake client chips. I find this hard to believe for a first generation implementation of AVX512 that hardly anybody uses. Personally, I've seen certain sequences with long dependency chains with identical FP-only code that differs only in 256-bit vs. 512-bit. The result is that the 512-bit version can sometimes run significantly slower (30%) with 512-bit for unknown reasons (same clock frequency). And I find it hard to believe that this can be completely explained by the longer port5 FMA latency. So here's a new theory:
It's likely that ports 0 and 1 don't even have a 512-bit data-width to begin with. Having only half the # of renamed registers in 512-bit mode could certainly help to explain why 512-bit code with long dependency chains can run slower since it's more likely to run out of them. Last fiddled with by Mysticial on 2018-02-20 at 20:38 |
![]() |
![]() |
![]() |
#20 | |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
6,679 Posts |
![]()
And also watch your transitions into and out of using AVX512 instructions.
https://travisdowns.github.io/blog/2.../avxfreq1.html Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
A surprising case of a new shiny thing being step-function better | fivemack | Astronomy | 12 | 2017-01-24 17:49 |
Ooohhh.....shiny! | schickel | FactorDB | 2 | 2012-08-16 00:09 |
performance of Intel "Harpertown" series | ixfd64 | Hardware | 1 | 2007-09-24 08:28 |
64 bit performance? | zacariaz | Hardware | 1 | 2007-05-10 13:08 |
LLR performance on k and n | robert44444uk | 15k Search | 1 | 2006-02-09 01:43 |