2016-01-17, 00:03 | #1 |
"Ben"
Feb 2007
41·83 Posts |
AVX2 weirdness
I've run across something weird when trying to optimize a loop in AVX2. I am doing trial division on 16 16-bit words, using precomputed multiplicative inverses rather than dividing. This is the code snippet that computes the first 8 (current using VEX enabled SSE2 for Haswell):
Code:
"vmovdqa (%1), %%ymm3 \n\t" /* move in primes */ \ "vpsubw %%ymm1, %%ymm0, %%ymm4 \n\t" /* BLOCKSIZE - block_loc */ \ "vpaddw (%2), %%ymm4, %%ymm4 \n\t" /* apply corrections */ \ "vmovdqa (%4), %%ymm6 \n\t" /* move in root1s */ \ "vpmulhuw (%3), %%xmm4, %%xmm4 \n\t" /* (unsigned) multiply by inverses */ \ "salq $16,%%r9 \n\t" /* move to top half of 32-bit word */ \ "vmovdqa (%5), %%ymm2 \n\t" /* move in root2s */ \ "vpsrlw $" xtra_bits ", %%ymm4, %%ymm4 \n\t" /* to get to total shift of 24/26/28 bits */ \ "vpaddw %%ymm3, %%ymm1, %%ymm7 \n\t" /* add primes and block_loc */ \ "vpmullw %%xmm3, %%xmm4, %%xmm4 \n\t" /* (signed) multiply by primes */ \ "vpsubw %%ymm0, %%ymm7, %%ymm7 \n\t" /* substract blocksize */ \ "vpaddw %%ymm7, %%ymm4, %%ymm4 \n\t" /* add in block_loc + primes - blocksize */ \ "vpcmpeqw %%ymm4, %%ymm6, %%ymm6 \n\t" /* compare to root1s */ \ "vpcmpeqw %%ymm4, %%ymm2, %%ymm2 \n\t" /* compare to root2s */ \ "vpor %%ymm6, %%ymm2, %%ymm2 \n\t" /* combine compares */ \ "vpmovmskb %%ymm2, %%r8 \n\t" /* export to result */ \ I've done profiling with Intel Vtune Analyzer, and the slowdown occurs in completely unrelated routines. This routine is actually faster (as it should be, since it takes half the number of instructions to do the same thing). But as soon as I use ymm registers with either of those instructions, wham, instant slowdown. Anyone heard of anything like this, or can explain it? It is very repeatable. And no this is not a SSE2/AVX2 penalty. I'm using VEX enabled instructions everywhere and the front and back of this routine are using VZEROUPPER. Last fiddled with by bsquared on 2016-01-17 at 00:14 Reason: table formatting |
2016-01-17, 17:26 | #2 |
"Ben"
Feb 2007
6513_{8} Posts |
Here is a (fugly) workaround:
Code:
"vmovdqa (%1), %%ymm3 \n\t" /* move in primes */ \ "vpsubw %%ymm1, %%ymm0, %%ymm4 \n\t" /* BLOCKSIZE - block_loc */ \ "vpaddw (%2), %%ymm4, %%ymm4 \n\t" /* apply corrections */ \ "vmovdqa (%4), %%ymm6 \n\t" /* move in root1s */ \ "vpmulhuw (%3), %%xmm4, %%xmm5 \n\t" /* low 8 words, (unsigned) multiply by inverses */ \ "vpmulhuw 16(%3), %%xmm4, %%xmm4 \n\t" /* high 8 words, (unsigned) multiply by inverses */ \ "vinserti128 $1, %%xmm4, %%ymm5, %%ymm4 \n\t" /* combine low and high parts */ \ "vmovdqa (%5), %%ymm2 \n\t" /* move in root2s */ \ "vpsrlw $" xtra_bits ", %%ymm4, %%ymm4 \n\t" /* to get to total shift of 24/26/28 bits */ \ "vpaddw %%ymm3, %%ymm1, %%ymm7 \n\t" /* add primes and block_loc */ \ "vextracti128 $1, %%ymm4, %%xmm5 \n\t" /* put high part of op1 into xmm5 */ \ "vextracti128 $1, %%ymm3, %%xmm8 \n\t" /* put high part of op2 into xmm8 */ \ "vpmullw %%xmm3, %%xmm4, %%xmm4 \n\t" /* (signed) multiply by primes */ \ "vpmullw %%xmm8, %%xmm5, %%xmm5 \n\t" /* (signed) multiply by primes */ \ "vinserti128 $1, %%xmm5, %%ymm4, %%ymm4 \n\t" /* combine low and high parts of result */ \ "vpsubw %%ymm0, %%ymm7, %%ymm7 \n\t" /* substract blocksize */ \ "vpaddw %%ymm7, %%ymm4, %%ymm4 \n\t" /* add in block_loc + primes - blocksize */ \ "vpcmpeqw %%ymm4, %%ymm6, %%ymm6 \n\t" /* compare to root1s */ \ "vpcmpeqw %%ymm4, %%ymm2, %%ymm2 \n\t" /* compare to root2s */ \ "vpor %%ymm6, %%ymm2, %%ymm2 \n\t" /* combine compares */ \ "vpmovmskb %%ymm2, %%r8 \n\t" /* export to result */ \ |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sandy Bridge vs Haswell AVX2 L-L Completion Times | danmur | Information & Answers | 16 | 2016-12-14 15:09 |
Haswell New Instructions / AVX2 | ixfd64 | Hardware | 72 | 2013-03-20 00:00 |
Bignum arithmetic in the AVX2 world | fivemack | Software | 2 | 2012-11-30 22:23 |
More Weirdness | R.D. Silverman | Programming | 4 | 2009-05-24 22:01 |
Linux weirdness; Help? | R.D. Silverman | Programming | 3 | 2009-04-29 12:35 |