mersenneforum.org 64x64 integer multiplication GCC
 Register FAQ Search Today's Posts Mark Forums Read

2020-07-23, 08:39   #12
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

188716 Posts

Quote:
 Originally Posted by xilman AFAICT essentially all other languages (assembly excepted of course) make it difficult to exploit all the instructions provided by any CISC architecture.
RISC also. No access to any flags. No way to get the high portion of a multiply. etc.

 2020-07-23, 13:06 #13 jasonp Tribal Bullet     Oct 2004 3×1,181 Posts Modern assembler versions also let you switch to Intel syntax with an assembler directive. The extra boilerplate controls where the input operands come from, where outputs go, what is expected to be overwritten and clobbered, whether the whole block can be moved around other basic blocks in your code, etc. The actual instructions in the inline asm are don't-cares for the compiler, you can put 1000 instructions in there and it will paste them into the generated assembly, or paste nonsense that will fail to compile if you make a mistake. If you want a braindead alternative, Sun's compiler used to have an inline asm syntax that only allowed the text of one instruction, with no way to control any of the above. Good luck doing something nontrivial with that facility. Last fiddled with by jasonp on 2020-07-23 at 13:12
 2020-07-23, 20:55 #14 pinhodecarlos     "Carlos Pinho" Oct 2011 Milton Keynes, UK 115538 Posts Hope it is a valid question, why not use fortran?
2020-07-23, 21:23   #15
ewmayer
2ω=0

Sep 2002
República de California

2·3·29·67 Posts

Quote:
 Originally Posted by retina I am still amazed at how much C code it takes just to coax it into using a single native instruction. You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.
That's why the way to go is use the single (or few)-instruction asm macros only at small scale, say for prototyping and proof-of-concept work. In my case, I'm a Luddite re. all those asm-macro i/o flag goodies gcc provides - way too many, syntax really nasty, hard to discern actual register usage, etc. Since those are mainly to help the compiler better inline one's asm with the surrounding C code, any performance gain therefrom is going to be mainly when the enclosed machine instruction count is small - for bigger chunks of asm, it's not gonna matter. If you're a performance-fetishist, you're gonna want larger chunks of asm anyway - so in my case, in order to better focus on the actual instruction flow and obviate the distractive boilerplate, I just use a standardized-format asm-macro template, which makes it easy to add as many memory operands as needed, uses actual hardware register names, etc. Here an example - this is the source for one of the 8x8 matrix-of-doubles transpose macros I tested when avx-512 became available:
Code:
// [1a] Rowwise-load and in-register data shuffles. On KNL: 45 cycles per loop-exec:
nerr = 0; clock1 = getRealTime();
for(i = 0; i < imax; i++) {
__asm__ volatile (\
"movq		%[__data],%%rax		\n\t"\
/* Read in the 8 rows of our input matrix: */\
"vmovaps		0x000(%%rax),%%zmm0		\n\t"\
"vmovaps		0x040(%%rax),%%zmm1		\n\t"\
"vmovaps		0x080(%%rax),%%zmm2		\n\t"\
"vmovaps		0x0c0(%%rax),%%zmm3		\n\t"\
"vmovaps		0x100(%%rax),%%zmm4		\n\t"\
"vmovaps		0x140(%%rax),%%zmm5		\n\t"\
"vmovaps		0x180(%%rax),%%zmm6		\n\t"\
"vmovaps		0x1c0(%%rax),%%zmm7		\n\t"\
/* Transpose uses regs0-7 for data, reg8 for temp: */\
/* [1] First step is a quartet of [UNPCKLPD,UNPCKHPD] pairs to effect transposed 2x2 submatrices - */\
/* indices in comments at right are [row,col] pairs, i.e. octal version of linear array indices: */
"vunpcklpd		%%zmm1,%%zmm0,%%zmm8	\n\t"/* zmm8 = 00 10 02 12 04 14 06 16 */\
"vunpckhpd		%%zmm1,%%zmm0,%%zmm1	\n\t"/* zmm1 = 01 11 03 13 05 15 07 17 */\
"vunpcklpd		%%zmm3,%%zmm2,%%zmm0	\n\t"/* zmm0 = 20 30 22 32 24 34 26 36 */\
"vunpckhpd		%%zmm3,%%zmm2,%%zmm3	\n\t"/* zmm3 = 21 31 23 33 25 35 27 37 */\
"vunpcklpd		%%zmm5,%%zmm4,%%zmm2	\n\t"/* zmm2 = 40 50 42 52 44 54 46 56 */\
"vunpckhpd		%%zmm5,%%zmm4,%%zmm5	\n\t"/* zmm5 = 41 51 43 53 45 55 47 57 */\
"vunpcklpd		%%zmm7,%%zmm6,%%zmm4	\n\t"/* zmm4 = 60 70 62 72 64 74 66 76 */\
"vunpckhpd		%%zmm7,%%zmm6,%%zmm7	\n\t"/* zmm7 = 61 71 63 73 65 75 67 77 */\
/**** Getting rid of reg-index-nicifying copies here means Outputs not in 0-7 but in 8,1,0,3,2,5,4,7, with 6 now free ****/\
/* [2] 1st layer of VSHUFF64x2, 2 outputs each with trailing index pairs [0,4],[1,5],[2,6],[3,7]. */\
/* Note the imm8 values expressed in terms of 2-bit index subfields again read right-to-left */\
/* (as for the SHUFPS imm8 values in the AVX 8x8 float code) are 221 = (3,1,3,1) and 136 = (2,0,2,0): */\
"vshuff64x2	$136,%%zmm0,%%zmm8,%%zmm6 \n\t"/* zmm6 = 00 10 04 14 20 30 24 34 */\ "vshuff64x2$221,%%zmm0,%%zmm8,%%zmm0	\n\t"/* zmm0 = 02 12 06 16 22 32 26 36 */\
"vshuff64x2	$136,%%zmm3,%%zmm1,%%zmm8 \n\t"/* zmm8 = 01 11 05 15 21 31 25 35 */\ "vshuff64x2$221,%%zmm3,%%zmm1,%%zmm3	\n\t"/* zmm3 = 03 13 07 17 23 33 27 37 */\
"vshuff64x2	$136,%%zmm4,%%zmm2,%%zmm1 \n\t"/* zmm1 = 40 50 44 54 60 70 64 74 */\ "vshuff64x2$221,%%zmm4,%%zmm2,%%zmm4	\n\t"/* zmm4 = 42 52 46 56 62 72 66 76 */\
"vshuff64x2	$136,%%zmm7,%%zmm5,%%zmm2 \n\t"/* zmm2 = 41 51 45 55 61 71 65 75 */\ "vshuff64x2$221,%%zmm7,%%zmm5,%%zmm7	\n\t"/* zmm7 = 43 53 47 57 63 73 67 77 */\
/**** Getting rid of reg-index-nicifying copies here means Outputs 8,1,2,5 -> 6,8,1,2, with 5 now free ***/\
/* [3] Last step in 2nd layer of VSHUFF64x2, now combining reg-pairs sharing same trailing index pairs. */\
/* Output register indices reflect trailing index of data contained therein: */\
"vshuff64x2	$136,%%zmm1,%%zmm6,%%zmm5 \n\t"/* zmm5 = 00 10 20 30 40 50 60 70 [row 0 of transpose-matrix] */\ "vshuff64x2$221,%%zmm1,%%zmm6,%%zmm1	\n\t"/* zmm1 = 04 14 24 34 44 54 64 74 [row 4 of transpose-matrix] */\
"vshuff64x2	$136,%%zmm2,%%zmm8,%%zmm6 \n\t"/* zmm6 = 01 11 21 31 41 51 61 71 [row 1 of transpose-matrix] */\ "vshuff64x2$221,%%zmm2,%%zmm8,%%zmm2	\n\t"/* zmm2 = 05 15 25 35 45 55 65 75 [row 5 of transpose-matrix] */\
"vshuff64x2	$136,%%zmm4,%%zmm0,%%zmm8 \n\t"/* zmm8 = 02 12 22 32 42 52 62 72 [row 2 of transpose-matrix] */\ "vshuff64x2$221,%%zmm4,%%zmm0,%%zmm4	\n\t"/* zmm4 = 06 16 26 36 46 56 66 76 [row 6 of transpose-matrix] */\
"vshuff64x2	$136,%%zmm7,%%zmm3,%%zmm0 \n\t"/* zmm0 = 03 13 23 33 43 53 63 73 [row 3 of transpose-matrix] */\ "vshuff64x2$221,%%zmm7,%%zmm3,%%zmm7	\n\t"/* zmm7 = 07 17 27 37 47 57 67 77 [row 7 of transpose-matrix] */\
/**** Getting rid of reg-index-nicifying copies here means Outputs 6,8,0,3 -> 5,6,8,0 with 3 now free ***/\
/* Write original columns back as rows: */\
"vmovaps		%%zmm5,0x000(%%rax)		\n\t"\
"vmovaps		%%zmm6,0x040(%%rax)		\n\t"\
"vmovaps		%%zmm8,0x080(%%rax)		\n\t"\
"vmovaps		%%zmm0,0x0c0(%%rax)		\n\t"\
"vmovaps		%%zmm1,0x100(%%rax)		\n\t"\
"vmovaps		%%zmm2,0x140(%%rax)		\n\t"\
"vmovaps		%%zmm4,0x180(%%rax)		\n\t"\
"vmovaps		%%zmm7,0x1c0(%%rax)		\n\t"\
:						// outputs: none
: [__data] "m" (data)	// All inputs from memory addresses here
: "cc","memory","rax","xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7","xmm8"	// Clobbered registers - use xmm form for compatibility with older versions of clang/gcc
);
}
clock2 = getRealTime();
tdiff = (double)(clock2 - clock1);
printf("Method [1a]: Time for %u 8x8 doubles-transposes using in-register shuffles =%s\n",imax, get_time_str(tdiff));
Just one mem-operand in this case; if I needed a second, it would be a simply matter of adding e.g. ,[__dat2] "m" (dat2). The actual register clobber list needs to be carefully done by the user once code is in place - one of my peeves re. GCC is that it doesn't even do something as simple as parse-asm-and-extract-register-names-to-autogenerate-a-clobber-list ... the old 32-bit MSFT visual studio was actually great in that respect, it did a "smart parsing" of the user's inline asm and would also allow step-thru debugging of same. (I abandoned MSFT when they took literally *years* following wide-scale deployment of x86_64 to update their compiler to support 64-bit inline asm.)

Anyway, as noted GCC won't way anything re. guts-of-such-macros at compile time - so if the asm has a syntax error (very easy to for these to creep in) one is often left trying to resort cryptic error messages from the assembler, and use "divide and conquer" syntax-debugging: cut bottom half of instruction block, see if assembler-error persists, etc.

Lastly, not the liberal use of inline C-syntax comments: most editors capable of C-syntax highlighting will also color-highlight the asm, instructions same color as for a C string, comments whatever color those are set to. *Very* nice in terms of aiding debug and readability.

Last fiddled with by ewmayer on 2020-07-23 at 21:27

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Computer Science & Computational Number Theory 17 2020-05-21 19:51 bhelmes Math 4 2016-10-06 13:33 jasong Miscellaneous Math 5 2016-04-24 03:40 vector Miscellaneous Math 10 2007-12-20 18:16 clowns789 Miscellaneous Math 5 2005-03-11 00:23

All times are UTC. The time now is 15:51.

Sun Oct 17 15:51:35 UTC 2021 up 86 days, 10:20, 1 user, load averages: 1.16, 1.17, 1.12