mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-12-26, 20:59   #12
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·4,079 Posts
Default

Intel's optimization manuals have been updated with AVX512 info. Chapter 15 has some interesting details on FMA's use of port 0/1/5 and detecting CPUs with a single AVX512 FMA unit.

The link: https://software.intel.com/sites/def...ion-manual.pdf

Last fiddled with by Prime95 on 2017-12-26 at 20:59
Prime95 is online now   Reply With Quote
Old 2017-12-26, 22:51   #13
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by Prime95 View Post
@mystical: Do you know if the vblendmpd instruction is running on the same ports as the FMA units?

http://users.atw.hu/instlatx64/Genui...InstLatX64.txt shows vblendmpd with a latency of one and a throughput of two.
Agner Fog's instruction tables PDF has vblendmpd issuing from FP ports 0/1, latency 2, throughput 2. You looking at this for 8x8 doubles-transposition, or something else?

Thanks for the link to the updated optimization manual.
ewmayer is offline   Reply With Quote
Old 2017-12-27, 02:37   #14
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17216 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Intel's optimization manuals have been updated with AVX512 info. Chapter 15 has some interesting details on FMA's use of port 0/1/5 and detecting CPUs with a single AVX512 FMA unit.

The link: https://software.intel.com/sites/def...ion-manual.pdf
In 512-bit mode, there's only two ports that will accept vector (non-memory) instructions. On the "good" chips, both of them have FMAs.

So *any* arithmetic 512-bit instruction will contend with the FMA resources. I'm unsure about the mask instructions. I heard from somewhere that those go into port5 - which means they will also interfere with the FMAs.

Quote:
Originally Posted by ewmayer View Post
Agner Fog's instruction tables PDF has vblendmpd issuing from FP ports 0/1, latency 2, throughput 2. You looking at this for 8x8 doubles-transposition, or something else?

Thanks for the link to the updated optimization manual.
Those are for Knights Landing. Agner Fog doesn't have Skylake X results yet.

Last fiddled with by Mysticial on 2017-12-27 at 02:45
Mysticial is offline   Reply With Quote
Old 2017-12-27, 03:46   #15
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·4,079 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You looking at this for 8x8 doubles-transposition, or something else?.
Something else. My real-to-complex FFT does not use Hermetian symmetry in the standard way (an all-complex FFT with a j,n-j combo before squaring as I've never figured out how to do that in a cache friendly way). Instead, the first step processes 16-reals producing 7-complex and 2-real values. These results are then swizzled, so that each cache line/AVX-512 word contains 1 real value and the real or imaginary part of 7-complex values. The next step does 3 FFT "levels". This step does an 8-complex butterfly on 7 of the AVX-512 doubles and an 8-reals on 1 of the AVX-512 doubles. Surprisingly, these 2 different operations are very similar and can be done as fast as an 8-complex butterfly IF vblendmpd is free.

For now, I'll code it up in a straightforward way and leave optimization for later until an I9X is available. I may need to adjust my AVX-512 swizzling strategy later on if I cannot make this perform well.
Prime95 is online now   Reply With Quote
Old 2017-12-27, 18:46   #16
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×5×37 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Something else. My real-to-complex FFT does not use Hermetian symmetry in the standard way (an all-complex FFT with a j,n-j combo before squaring as I've never figured out how to do that in a cache friendly way). Instead, the first step processes 16-reals producing 7-complex and 2-real values. These results are then swizzled, so that each cache line/AVX-512 word contains 1 real value and the real or imaginary part of 7-complex values. The next step does 3 FFT "levels". This step does an 8-complex butterfly on 7 of the AVX-512 doubles and an 8-reals on 1 of the AVX-512 doubles. Surprisingly, these 2 different operations are very similar and can be done as fast as an 8-complex butterfly IF vblendmpd is free.

For now, I'll code it up in a straightforward way and leave optimization for later until an I9X is available. I may need to adjust my AVX-512 swizzling strategy later on if I cannot make this perform well.
Not sure how feasible this is, but you could try abusing load/stores to do blending without clobbering either of the FMA ports.

I suppose merging the blending into an actual arithmetic instruction is a no-go since that implies wasted computation.
Mysticial is offline   Reply With Quote
Old 2017-12-28, 03:52   #17
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

815810 Posts
Default

More info:

http://users.atw.hu/instlatx64/AVX51...n_v102_PUB.ods

I wonder if "VMOVAPD zreg {kreg}, zreg" uses any ports.
Prime95 is online now   Reply With Quote
Old 2017-12-28, 04:26   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×4,079 Posts
Default

And reading between the lines of:

https://software.intel.com/sites/def...ut-latency.pdf

I think all ops that do masking (including vmovapd) use port 0 and 5. This will compete with FMA. The exception is load-from-memory with masking, those will not use port 0 and 5.

There is plenty of room for Intel to do a better implementation of AVX-512 in future processors. Issuing two FMA instructions uses 3 ports 0,1,5. This makes it impossible to do any shuffling / blending / masking / etc. without interfering with FMAs. Intel should put the FMA units on ports 0 & 1 with shuffling / masking / etc. done on port 5 without any extra silicon. This would mirror how AVX2 was implemented on Haswell/Skylake/KabyLake.
Prime95 is online now   Reply With Quote
Old 2018-02-20, 20:32   #19
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·5·37 Posts
Default

This came up in a conversation with a fellow Stackoverflower.

The port 0+1 fusion in 512-bit mode might be related the register file.

Intel discloses that Skylake client has "168 FP registers". "FP" has historically implied "SIMD". So let's assume that Skylake has 168 SIMD registers. But what's not disclosed is how large they are.

Skylake server with AVX512 has the same core as Skylake client, but with the extra L2 and FMA unit attached to the side.

If the 168 SIMD registers were 512-bit wide, that would imply a lot of dead silicon in the Skylake client chips. I find this hard to believe for a first generation implementation of AVX512 that hardly anybody uses.

Personally, I've seen certain sequences with long dependency chains with identical FP-only code that differs only in 256-bit vs. 512-bit. The result is that the 512-bit version can sometimes run significantly slower (30%) with 512-bit for unknown reasons (same clock frequency). And I find it hard to believe that this can be completely explained by the longer port5 FMA latency.

So here's a new theory:
  • Both Skylake Client and Skylake Server have a register file of 168 x 256-bit entries arranged into two columns of 84 rows each.
  • In 256-bit mode (defined as having no 512-bit ops in the reservation station): All 168 registers can be addressed independently by the hardware and are independently accessible by both ports 0 and 1.
  • In 512-bit mode (on Skylake Server): Registers in the two columns of the same row can fuse into a 512-bit register. Port 0 is hardwired to one column. Port 1 is hardwired to the other column. This means there's only 84 physical 512-bit registers.

It's likely that ports 0 and 1 don't even have a 512-bit data-width to begin with.

Having only half the # of renamed registers in 512-bit mode could certainly help to explain why 512-bit code with long dependency chains can run slower since it's more likely to run out of them.

Last fiddled with by Mysticial on 2018-02-20 at 20:38
Mysticial is offline   Reply With Quote
Old 2020-01-18, 04:01   #20
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

6,679 Posts
Default

And also watch your transitions into and out of using AVX512 instructions.

https://travisdowns.github.io/blog/2.../avxfreq1.html
Quote:
Summary

For the benefit of anyone who just skipped to the bottom, or whose eyes glazed over at some point, here’s a summary of the key findings:
  • After a period of about 680 μs not using the AVX upper bits (255:128) or AVX-512 upper bits (511:256) the processor enters a mode where using those bits again requires at least a voltage transition, and sometimes a frequency transition.
  • The processor continues executing instructions during a voltage transition, but at a greatly reduced speed: 1/4th the usual instruction dispatch rate. However, this throttling is fine-grained: it only applies when wide instructions are in flight (details).
  • Voltage transitions end when the voltage reaches the desired level, this depends on the magnitude of the transition but 8 to 20 μs is common on the hardware I tested.
  • In some cases a frequency transitions is also required, e.g., because the involved instruction requires a higher power license. These transitions seem to first incur a throttling period similar to a voltage-only transition, and then a halted period of 8 to 10 μs while the frequency changes.
  • A key motivator for this post was to give concrete, qualitative guidance on how to write code that is as fast as possible given this behavior.
retina is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
A surprising case of a new shiny thing being step-function better fivemack Astronomy 12 2017-01-24 17:49
Ooohhh.....shiny! schickel FactorDB 2 2012-08-16 00:09
performance of Intel "Harpertown" series ixfd64 Hardware 1 2007-09-24 08:28
64 bit performance? zacariaz Hardware 1 2007-05-10 13:08
LLR performance on k and n robert44444uk 15k Search 1 2006-02-09 01:43

All times are UTC. The time now is 05:51.


Mon Jan 30 05:51:42 UTC 2023 up 165 days, 3:20, 0 users, load averages: 0.71, 0.96, 0.99

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔