Register FAQ Search Today's Posts Mark Forums Read

 2016-02-18, 07:46 #1 ewmayer ∂2ω=0     Sep 2002 República de California 2×13×449 Posts Official AVX-512 programming thread With AVX-512-capable CPUs expected to hit market this year, time to open a thread dedicated to the various implementation issues. Note the latest AVX-512 Architecture Instruction Set Extensions Programming Reference is available from Intel here (topmost pdf on that page). Besides the obvious doubling of register width to 64 from AVX's 32 bytes and doubling of the vector register count from 16 to 32, I see the following types of new instructions as being espcially useful for GIMPS: [1] Load-with broadcast (the various VBROADCAST[...] instructions): This will be very handy for loads of various FFT-related consts (roots of unity) and DWT-weights-related data which consist of the same double-datum repeated across the vector register. AVX already has some similar functionality but AVX-512 significantly extends it, including versions which load-with-broadcast integer data from memory or GPRs. [2] Gather-load: fill a vector register with smaller 32/64-bit pieces loaded from non-contiguous memory locations, e.g. VGATHERQPD, whose summary is (this is from the PDF but with a few edits-for-succinctness and correction of my own - e.g. 'faulting-point' data in the original is clearly a typo): Code: Using signed qword indices, gather float64 vector into float64 vector zmm1 using OPMASK register k1 as completion mask: VGATHERQPD zmm1 {k1}, vm64z [{} arg can be any of k1-7 OPMASK regs, or k0 (or omit {k*} arg) for 'load all'] Description A set of 8 double-precision floating-point memory locations pointed by base address BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via the VSIB (i.e., the index register vm64z is a vector register, holding packed indices). Elements will only be loaded if their corresponding mask bit in the OPMASK register is one. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. With regard to the latter class of instructions, however, I can't find a description of the syntax for the hybrid 'vm64[xyz]' datum in the PDF. It looks like the index vector is a vector register ([xyz]mm depending on the precise instruction that is used) but whether BASE_ADDR is stored in the low 32/64-bits of said vector register is not made clear (so far as I can tell). It seems SCALE can take values {1,2,4,8} depending on the data type & precise instruction, but again, I see no actual clarifying examples in the PDF reference.
2016-02-18, 08:51   #2
ldesnogu

Jan 2008
France

2×52×11 Posts

Quote:
 Originally Posted by ewmayer [2] Gather-load: fill a vector register with smaller 32/64-bit pieces loaded from non-contiguous memory locations, e.g. VGATHERQPD, whose summary is (this is from the PDF but with a few edits-for-succinctness and correction of my own - e.g. 'faulting-point' data in the original is clearly a typo): Code: Using signed qword indices, gather float64 vector into float64 vector zmm1 using OPMASK register k1 as completion mask: VGATHERQPD zmm1 {k1}, vm64z [{} arg can be any of k1-7 OPMASK regs, or k0 (or omit {k*} arg) for 'load all'] Description A set of 8 double-precision floating-point memory locations pointed by base address BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via the VSIB (i.e., the index register vm64z is a vector register, holding packed indices). Elements will only be loaded if their corresponding mask bit in the OPMASK register is one. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. With regard to the latter class of instructions, however, I can't find a description of the syntax for the hybrid 'vm64[xyz]' datum in the PDF. It looks like the index vector is a vector register ([xyz]mm depending on the precise instruction that is used) but whether BASE_ADDR is stored in the low 32/64-bits of said vector register is not made clear (so far as I can tell). It seems SCALE can take values {1,2,4,8} depending on the data type & precise instruction, but again, I see no actual clarifying examples in the PDF reference.

According to test cases found in nasm, the syntax looks like this:
Code:
vgatherqpd zmm30{k1}, [r14+zmm31*8+0x7b]
I'd say that zmm31 is made of 8x64-bit indices (confirmed by Intel documentation), that r14 contains the BASE_ADDR, and that this will generate 8 addresses : r14 + 8*zmm31[i] + 0x7b.

Last fiddled with by ldesnogu on 2016-02-18 at 08:51

2016-02-18, 15:45   #3
Xyzzy

Aug 2002

3·112·23 Posts

We have attached the PDF for your convenience.

Attached Files
 319433-024.pdf (4.83 MB, 1952 views)

2016-02-19, 04:21   #4
ewmayer
2ω=0

Sep 2002
República de California

1167410 Posts

Quote:
 Originally Posted by ldesnogu According to test cases found in nasm, the syntax looks like this: Code: vgatherqpd zmm30{k1}, [r14+zmm31*8+0x7b] I'd say that zmm31 is made of 8x64-bit indices (confirmed by Intel documentation), that r14 contains the BASE_ADDR, and that this will generate 8 addresses : r14 + 8*zmm31[i] + 0x7b.
Code:
"vaddpd	%%zmm3,%%zmm2,%%zmm1	\n\t"\
"vgatherqpd 0x7b(%%r14,%%zmm31,8),%%zmm30	\n\t"\
but the 2nd gives

Assembler messages:
test_file.c:14: Error: default mask isn't allowed for vgatherqpd'

[i.e. ADDPD accepts a default-mask form, but VGATHER* does not]. Then I tried adding a mask-arg to the destination (rightmost in GCC syntax) operand of both instructions:
Code:
"vaddpd	%%zmm3,%%zmm2,{k1}%%zmm1	\n\t"\
"vgatherqpd 0x7b(%%r14,%%zmm31,8),{k1}%%zmm30	\n\t"\
That gives

Assembler messages:
test_file.c:13: Error: operand size mismatch for vaddpd'
test_file.c:14: Error: too many memory references for vgatherqpd'

 2016-02-19, 05:41 #5 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 142368 Posts Can you set GCC to use the more sane Intel syntax, instead of that AT&T ugliness?
2016-02-19, 08:06   #6
ewmayer
2ω=0

Sep 2002
República de California

2×13×449 Posts

Quote:
 Originally Posted by retina Can you set GCC to use the more sane Intel syntax, instead of that AT&T ugliness?
Eye of the beholder, my friend - yes, there are several ways (which differ in the key aspect of how they interact with memory-operand syntax) to use Intel syntax with GCC, described here.

Having used both syntaxes in the past I find that it's more a function of whichever one uses most often. I agree that the %% of the GCC extended-inline-asm are hard on the eyes, but I only add those after sketching out the basic assembly code, when I'm ready to test it out. And having learned reading and writing in Western-style left-to-right form I find the AT&T [src,src,dest] operand order to be much more natural to work with: "combine two source operands and output result in destination operand". I mean, consider the kind of descriptional awkwardness that litters Intel's instruction references due to their [dest,src,src] syntax:

Performs a SIMD [blah] of the double-precision floating-point values in the first source operand (the second operand) by the floating- point values in the second source operand (the third operand). Results are written to the destination operand (the first operand).

Back to the case at hand - in the inline-asm case one saves little 'ugliness' by using Intel syntax, here is the body of a basic AVX-syntax test macro which inlines the same instruction using each of the 2 syntaxes in turn:
Code:
".intel_syntax \n\t"\
".att_syntax \n\t"\
"vaddpd	(%%rax,%%rbx,8),%%ymm2,%%ymm1	\n\t"\

 2016-02-19, 08:41 #7 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 2×23×137 Posts Can you set GCC to not require the %% ugliness? Can you set GCC to not require the \n\t silliness? Can you set GCC to not require the surrounding "'s nonsensiness? Last fiddled with by retina on 2016-02-19 at 08:42
 2016-02-19, 13:52 #8 jasonp Tribal Bullet     Oct 2004 3×1,181 Posts No, all that stuff is a consequence of the way inline asm works in gcc; you are specifying a string that is blindly inserted into the assembler code the compiler produces so it needs C string formatting. The two %'s are for when you need to specify a register explictly, since a single % is assumed to mean the argument number out of the argument list below the asm string. You can not like it, but if the choice is between this and Microsoft's compiler, which doesn't allow inline asm at all for 64-bit code, then the choice is easy. Last fiddled with by jasonp on 2016-02-19 at 14:15
2016-02-19, 14:17   #9
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

2·23·137 Posts

Quote:
 Originally Posted by jasonp No, all that stuff is a consequence of the way inline asm works in gcc; you are specifying a string that is blindly inserted into the assembler code the compiler produces so it needs C string formatting.
It's almost as though the writers of GCC don't want anyone to use inline ASM. Are they actively trying to stop people using it out of some idealistic dream of C being the one and only?
Quote:
 Originally Posted by jasonp You can not like it, but if the choice is between this and Microsoft's compiler, which doesn't allow inline asm at all for 64-bit code, then the choice is easy.
Write all your ASM in "proper" assembly code as a separate file and link it at compile time to the other C stuff later. Or write a simple parser that take the C code source and upon detection of inline ASM inserts all the ugliness automatically. BTW: Fossil (the SCM) does this for inline webpage text. Makes it much easier to insert inline arbitrary text (without all the ugliness and red-tape).

2016-02-19, 15:47   #10
ldesnogu

Jan 2008
France

2·52·11 Posts

Quote:
 Originally Posted by ewmayer I first verified that that these need to go to the left of the destination register in GCC syntax, and that the k-register takes no prepended % (neither 1 nor 2 of these) as are needed for GPR and vector-data registers - apparently the special status of the k-regs obviates the need for a %. But I still get error/warning messages about the resulting instruction. Let's compare - Without a mask argument, the first of the following pair of test-instructions compiles/assembles OK: Code: "vaddpd %%zmm3,%%zmm2,%%zmm1 \n\t"\ "vgatherqpd 0x7b(%%r14,%%zmm31,8),%%zmm30 \n\t"\ but the 2nd gives Assembler messages: test_file.c:14: Error: default mask isn't allowed for vgatherqpd' [i.e. ADDPD accepts a default-mask form, but VGATHER* does not]. Then I tried adding a mask-arg to the destination (rightmost in GCC syntax) operand of both instructions: Code: "vaddpd %%zmm3,%%zmm2,{k1}%%zmm1 \n\t"\ "vgatherqpd 0x7b(%%r14,%%zmm31,8),{k1}%%zmm30 \n\t"\ That gives Assembler messages: test_file.c:13: Error: operand size mismatch for vaddpd' test_file.c:14: Error: too many memory references for vgatherqpd'
Are you sure the {k1} specifier should go before the register?

gas test files look like this:
Code:
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      123(%ebp,%zmm7,8), %zmm6{%k1}    # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      123(%ebp,%zmm7,8), %zmm6{%k1}    # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      256(%eax,%zmm7), %zmm6{%k1}      # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      1024(%ecx,%zmm7,4), %zmm6{%k1}   # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [ebp+zmm7*8-123]   # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [ebp+zmm7*8-123]   # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [eax+zmm7+256]     # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [ecx+zmm7*4+1024]  # AVX512F

2016-02-19, 21:27   #11
ewmayer
2ω=0

Sep 2002
República de California

2·13·449 Posts

Quote:
 Originally Posted by ldesnogu Are you sure the {k1} specifier should go before the register? gas test files look like this: Code: gas/testsuite/gas/i386/avx512f.s: vgatherqpd 123(%ebp,%zmm7,8), %zmm6{%k1} # AVX512F
I based that on the error messages I got from GCC in my various iterative syntax experiments - but it appears those may have led me down a blind alley w.r.to the k-register masking syntax. Thanks! Will try as soon as I am home and have access to my Broadwell NUC which has the GCC 4.9 install. Such upfront new-instruction-syntax issues seem to be par for the course - when I first started to upgrade to AVX a couple years ago, the first issue I had to work through was that the then-current version of GCC refused to accept ymm registers in the clobber list - workaround (which I still use, since specifying xmm* as clobbered implies the whole corresponding ymm or zmm register) was to use SSE2-style xmm-clobbers but ymm in the actual code.

Re. retina's separate-.s-file idea: sure that will work, but has issues of its own, like having to respect calling conventions and a 2-step compile. I like being able to focus on the code that actually 'does stuff' and being able to one-step-compile just as with pure C code. To each his own. Further, since I am for the most part translating many 1000s of line of existing AVX inline-asm - most of which is rote, with the exception of key new instructions like the above gather-load - it would be far more work to now switch to a separate-.s-file paradigm.

 Similar Threads Thread Thread Starter Forum Replies Last Post ixfd64 Lounge 68 2019-08-19 22:44 Prime95 Lounge 32 2015-10-02 04:17 jyb Factoring 2 2013-09-03 16:11 Flatlander Lounge 29 2013-01-12 19:29 ewmayer Math 14 2008-10-23 13:43

All times are UTC. The time now is 13:27.

Tue Nov 30 13:27:57 UTC 2021 up 130 days, 7:56, 0 users, load averages: 1.06, 1.20, 1.22