mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2016-01-17, 00:03   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41·83 Posts
Default AVX2 weirdness

I've run across something weird when trying to optimize a loop in AVX2. I am doing trial division on 16 16-bit words, using precomputed multiplicative inverses rather than dividing. This is the code snippet that computes the first 8 (current using VEX enabled SSE2 for Haswell):

Code:
                        "vmovdqa    (%1), %%ymm3 \n\t"		            /* move in primes */							\
			"vpsubw	    %%ymm1, %%ymm0, %%ymm4 \n\t"	    /* BLOCKSIZE - block_loc */						\
			"vpaddw	    (%2), %%ymm4, %%ymm4 \n\t"		    /* apply corrections */							\
			"vmovdqa    (%4), %%ymm6 \n\t"		            /* move in root1s */							\
			"vpmulhuw   (%3), %%xmm4, %%xmm4 \n\t"	        /* (unsigned) multiply by inverses */		\
                        "salq       $16,%%r9    \n\t"                   /* move to top half of 32-bit word */ \
			"vmovdqa    (%5), %%ymm2 \n\t"		            /* move in root2s */							\
			"vpsrlw	$" xtra_bits ", %%ymm4, %%ymm4 \n\t"		/* to get to total shift of 24/26/28 bits */			\
			"vpaddw	    %%ymm3, %%ymm1, %%ymm7 \n\t"	    /* add primes and block_loc */					\
			"vpmullw    %%xmm3, %%xmm4, %%xmm4 \n\t"	    /* (signed) multiply by primes */				\
			"vpsubw	    %%ymm0, %%ymm7, %%ymm7 \n\t"	    /* substract blocksize */						\
			"vpaddw	    %%ymm7, %%ymm4, %%ymm4 \n\t"	    /* add in block_loc + primes - blocksize */		\
			"vpcmpeqw   %%ymm4, %%ymm6, %%ymm6 \n\t"	    /* compare to root1s */						\
			"vpcmpeqw   %%ymm4, %%ymm2, %%ymm2 \n\t"	    /* compare to root2s */						\
			"vpor	    %%ymm6, %%ymm2, %%ymm2 \n\t"	    /* combine compares */							\
			"vpmovmskb  %%ymm2, %%r8 \n\t"		            /* export to result */							\
You'll notice that many of these instructions are using ymm registers. All but two, in fact. For some reason, if I use ymm registers with vpmulhuw or vpmullw, the entire program (not just this tiny part of it) slows down by about 5%.

I've done profiling with Intel Vtune Analyzer, and the slowdown occurs in completely unrelated routines. This routine is actually faster (as it should be, since it takes half the number of instructions to do the same thing). But as soon as I use ymm registers with either of those instructions, wham, instant slowdown.

Anyone heard of anything like this, or can explain it? It is very repeatable. And no this is not a SSE2/AVX2 penalty. I'm using VEX enabled instructions everywhere and the front and back of this routine are using VZEROUPPER.

Last fiddled with by bsquared on 2016-01-17 at 00:14 Reason: table formatting
bsquared is offline   Reply With Quote
Old 2016-01-17, 17:26   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

65138 Posts
Default

Here is a (fugly) workaround:

Code:
			"vmovdqa (%1), %%ymm3 \n\t"		/* move in primes */							\
			"vpsubw	%%ymm1, %%ymm0, %%ymm4 \n\t"	/* BLOCKSIZE - block_loc */						\
			"vpaddw	(%2), %%ymm4, %%ymm4 \n\t"		/* apply corrections */							\
			"vmovdqa (%4), %%ymm6 \n\t"		/* move in root1s */							\
			"vpmulhuw	(%3), %%xmm4, %%xmm5 \n\t"	/* low 8 words, (unsigned) multiply by inverses */		\
                        "vpmulhuw	16(%3), %%xmm4, %%xmm4 \n\t"	/* high 8 words, (unsigned) multiply by inverses */		\
                        "vinserti128   $1, %%xmm4, %%ymm5, %%ymm4 \n\t" /* combine low and high parts */ \
			"vmovdqa (%5), %%ymm2 \n\t"		/* move in root2s */							\
			"vpsrlw	$" xtra_bits ", %%ymm4, %%ymm4 \n\t"		/* to get to total shift of 24/26/28 bits */			\
			"vpaddw	%%ymm3, %%ymm1, %%ymm7 \n\t"	/* add primes and block_loc */					\
                        "vextracti128  $1, %%ymm4, %%xmm5 \n\t" /* put high part of op1 into xmm5 */ \
                        "vextracti128  $1, %%ymm3, %%xmm8 \n\t" /* put high part of op2 into xmm8 */ \
                        "vpmullw	%%xmm3, %%xmm4, %%xmm4 \n\t"	/* (signed) multiply by primes */				\
			"vpmullw	%%xmm8, %%xmm5, %%xmm5 \n\t"	/* (signed) multiply by primes */				\
                        "vinserti128   $1, %%xmm5, %%ymm4, %%ymm4 \n\t" /* combine low and high parts of result */ \
			"vpsubw	%%ymm0, %%ymm7, %%ymm7 \n\t"	/* substract blocksize */						\
			"vpaddw	%%ymm7, %%ymm4, %%ymm4 \n\t"	/* add in block_loc + primes - blocksize */		\
			"vpcmpeqw	%%ymm4, %%ymm6, %%ymm6 \n\t"	/* compare to root1s */						\
			"vpcmpeqw	%%ymm4, %%ymm2, %%ymm2 \n\t"	/* compare to root2s */						\
			"vpor	%%ymm6, %%ymm2, %%ymm2 \n\t"	/* combine compares */							\
			"vpmovmskb %%ymm2, %%r8 \n\t"		/* export to result */							\
As long I as don't do the multiplies in ymm regs, I don't get the weird performance penalty. Even with all of the extract/insert nonsense, the above is still slightly faster than doing everything with xmm.
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sandy Bridge vs Haswell AVX2 L-L Completion Times danmur Information & Answers 16 2016-12-14 15:09
Haswell New Instructions / AVX2 ixfd64 Hardware 72 2013-03-20 00:00
Bignum arithmetic in the AVX2 world fivemack Software 2 2012-11-30 22:23
More Weirdness R.D. Silverman Programming 4 2009-05-24 22:01
Linux weirdness; Help? R.D. Silverman Programming 3 2009-04-29 12:35

All times are UTC. The time now is 08:40.

Tue Apr 13 08:40:29 UTC 2021 up 5 days, 3:21, 1 user, load averages: 0.85, 1.33, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.