mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2012-04-27, 02:16   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·449 Posts
Default x86 ASM loop control: JUMP vs LOOP

I am trying to efficiently implement a 64-bit multiword-integer add loop in x86_64 assembler. Here is the C source I began with, from my own custom GMP-style mi64 (mnemonic for 'multiword integer, base-2^64') library:
Code:
/*
	Unsigned add of two base-2^64 vector ints X + Y.
	Any or all of X, Y and Z can point to the same array object.
	Return any exit carry - up to user to determine what to do if this is nonzero.
*/
uint64	mi64_add(const uint64 x[], const uint64 y[], uint64 z[], uint32 len)
{
	uint64 tmp, cy = 0;
	uint32 i;

	for(i=0; i<len; i++)
	{
		tmp = x[i] + cy;
		cy  = (tmp < x[i]);
		tmp = tmp + y[i];
		cy += (tmp < y[i]);
		z[i] = tmp;
	}
	return cy;
}
Since the loop body is small, I am using a single block of ASM for the loop body and loop control.

The first issue I ran into was this: On each loop iteration, I want the x86 carry flag CF to get set based on the sum (including carryin from previous iteration) of each pair of inputs, z[i] = x[i] + y[i] + CF (and re-set CF based on result). If I can preserve CF between loop iterations, each iteration needs just a single ADC instruction. But using the standard kind of jump-based loop control (cf. the GMP library for lots of examples), the jump clears the carry flag CF which I would like to use to manage carries. (For the same reason one must use LEA rather than ADD to increment the array pointers). According to the x86 ISA, there is another instruction, bizarrely named 'LOOP' :), which does just what one wants here. The ISA gives no syntax examples for how to use it, but a bit of web-searching shows it to be very similar to the label-based usage of the various flavors of JUMP. So here is the resulting assembly:
Code:
/* x86_64 ASM implementation of the above loop: */
__asm__ volatile (\
	"xorq	%%rsi,%%rsi	/* Use rsi for carryout of loop */\n\t"\
	"movq	%[__x0],%%rax	/* &x[0] */\n\t"\
	"movq	%[__y0],%%rbx	/* &y[0] */\n\t"\
	"movq	%[__z0],%%rdx	/* &z[0] */\n\t"\
	"addq	%%rsi,%%rsi	/* 0 + 0 inits cy = CF = 0 */\n\t"\
	"movslq	%[__len], %%rcx	/* ASM loop structured as for(j = len; j != 0; -j){...} */\n\t"\
"LoopBeg:	\n\t"\
"/* Load the next pair of addends: */	\n\t"\
	"movq	(%%rax),%%r8 	\n\t"\
	"adcq	(%%rbx),%%r8	/* Add x[i] + y[i] + CF, result in r8, CF bit will set according to result */\n\t"\
	"movq	%%r8,(%%rdx)	/* Write current result word to z[i] */\n\t"\
"/* Increment array pointers by 8 bytes each, using LEA rather than ADD to preserve CF */\n\t"\
	"leaq	0x8(%%rax),%%rax	\n\t"\
	"leaq	0x8(%%rbx),%%rbx	\n\t"\
	"leaq	0x8(%%rdx),%%rdx	\n\t"\
"loop	LoopBeg		/* if (j >= 0), continue loop */\n\t"\
	"adcq	%%rsi,%[__cy]	/* Carryout */\n\t"\
	:			/* outputs: none */\
	: [__x0] "m" (x)	/* All inputs from memory addresses here */\
	 ,[__y0] "m" (y)	\
	 ,[__z0] "m" (z)	\
	 ,[__cy] "m" (cy)	\
	 ,[__len] "m" (len)	\
	: "cl","memory","rax","rbx","rcx","rdx","rsi","r8"	/* Clobbered registers */\
);
That runs, but is appreciably slower than the compiled C code. I am guessing this is largely due to lack of unrolling and the usual host of optimizations the compiler can bring to bear on HLL loops, so I will be playing with unrolling and such to see if I can boost the speed. Any suggestions from the ASM experts around here are welcome.

Second issue: The above function is small enough that the compiler inlines the code when using the normal -O3 optimization level, but that leads to multiple occurrences of the loop label and fatal linker errors:

/var/folders/EM/EMv2WTDiGRSUauDZX4dmgU+++TI/-Tmp-//cc6hktKe.s:1678:FATAL:Symbol LoopBeg already defined.

To work around that I turned the optimization down to -O2 ... is there a way to get the compiler to uniquify such labels when inlining using -O3?

Last fiddled with by ewmayer on 2012-04-27 at 02:19
ewmayer is offline   Reply With Quote
Old 2012-04-27, 03:01   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·1,193 Posts
Default

Rather than explicitly update all three pointers every iteration, I'd try to just keep one index variable and utilize the decoder to do the array indexing, like so:

Code:
	"xorq	%%r9, %%r9 \n\t"
	"LoopBeg:	\n\t"\
	/* Load the next pair of addends: */ \
		"movq	(%%rax, %%r9, 8),%%r8 	\n\t"\
		"adcq	(%%rbx, %%r9, 8),%%r8	\n\t" /* Add x[i] + y[i] + CF, result in r8, CF bit will set according to result */\
		"movq	%%r8,(%%rdx, %%r9, 8)	\n\t" /* Write current result word to z[i] */\
	/* increment index (preserves CF) */\
		"incq	%%r9 \n\t" \
inc/dec instructions also preserve the carry flag.
bsquared is offline   Reply With Quote
Old 2012-04-27, 03:13   #3
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·1,193 Posts
Default

Also, you can use jnz and dec to do the looping with no CF clobbering. That way you can better specify the label you want to jump to, which will probably alleviate your second issue. Not sure if it will be faster, tho. Does setting ZF introduce a dependency so that inc/dec will not be issued in parallel?

Code:
"xorq	%%r9, %%r9 \n\t"
		"1:	\n\t" /* loop begin */\
	/* Load the next pair of addends: */ \
		"movq	(%%rax, %%r9, 8),%%r8 	\n\t"\
		"adcq	(%%rbx, %%r9, 8),%%r8	\n\t" /* Add x[i] + y[i] + CF, result in r8, CF bit will set according to result */\
		"movq	%%r8,(%%rdx, %%r9, 8)	\n\t" /* Write current result word to z[i] */\
	/* Increment array pointers by 8 bytes each, using LEA rather than ADD to preserve CF */\
		"incq	%%r9 \n\t" \
		"decq	%%rcx \n\t" \
		"jnz 1b \n\t" \
		"adcq	%%rsi,%[__cy]	\n\t" /* Carryout */\

Last fiddled with by bsquared on 2012-04-27 at 03:18
bsquared is offline   Reply With Quote
Old 2012-04-27, 03:17   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

767710 Posts
Default

As with all ASM problems, each architecture may have its own fastest solution.

First off, you can use DEC as it preserves CF. In fact LOOP is just DEC, JNZ. I know use of LOOP is frowned upon, I'd use DEC/JNZ instead as it may decode faster.

Second, use of DEC is frowned upon because it creates a R/W dependency on the flags register.

(I think the two paragraphs above mean you are screwed :)

Third, LEA is sometimes slower because often there are more ADD units than there are address generation units.

You might try using PUSHF/POPF or SAHF/LAHF. I've never used them, but they might be faster. Alternatively SBB or RCL or similar might be used to save the carry each iteration.

I agree that your big win will come from unrolling the loop.

I'm afraid that wasn't very helpful....
Prime95 is online now   Reply With Quote
Old 2012-04-27, 05:31   #5
axn
 
axn's Avatar
 
Jun 2003

5,179 Posts
Default

Quote:
Originally Posted by bsquared View Post
Also, you can use jnz and dec to do the looping with no CF clobbering. That way you can better specify the label you want to jump to, which will probably alleviate your second issue. Not sure if it will be faster, tho. Does setting ZF introduce a dependency so that inc/dec will not be issued in parallel?

Code:
"xorq	%%r9, %%r9 \n\t"
		"1:	\n\t" /* loop begin */\
	/* Load the next pair of addends: */ \
		"movq	(%%rax, %%r9, 8),%%r8 	\n\t"\
		"adcq	(%%rbx, %%r9, 8),%%r8	\n\t" /* Add x[i] + y[i] + CF, result in r8, CF bit will set according to result */\
		"movq	%%r8,(%%rdx, %%r9, 8)	\n\t" /* Write current result word to z[i] */\
	/* Increment array pointers by 8 bytes each, using LEA rather than ADD to preserve CF */\
		"incq	%%r9 \n\t" \
		"decq	%%rcx \n\t" \
		"jnz 1b \n\t" \
		"adcq	%%rsi,%[__cy]	\n\t" /* Carryout */\
lea can be used instead of inc [but either way, branch prediction means that the jump will be taken even before those instructions are executed]

that plus loop unrolling should cause the above code to beat the crap out of the hll code.
axn is offline   Reply With Quote
Old 2012-04-27, 05:41   #6
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2×3×13×83 Posts
Default

inc/dec "base index"
jnz

No, I'm not

David

PS This reminds me of an anecdote I told William in "The Well":

A great asm programmer mailed me as to why his jnz loop worked fine with
mov eax,0
but failed with
xor eax,eax
Needless to say he was suitably embarrassed by my instant reply.

Last fiddled with by davieddy on 2012-04-27 at 06:31 Reason: Utilize addressing modes to the hilt
davieddy is offline   Reply With Quote
Old 2012-04-27, 09:39   #7
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

647410 Posts
Default

PPS when did a jump ever affect a flag?

http://www.youtube.com/watch?v=ct2n2...eature=related

Last fiddled with by davieddy on 2012-04-27 at 09:44
davieddy is offline   Reply With Quote
Old 2012-04-27, 10:17   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

The 'standard' method of optimizing the overhead here is to
- add the array size to all the pointers, then negate it so the loop counts up
- use INC to preserve the carry flag
- put the INC right next to the JNZ, because modern processors pair them together to increase decode bandwidth. Very modern processors actually treat INC/JNZ or DEC/JNZ as an atomic instruction.

George is correct the the best gain would come from unrolling the loop. All versions of Msieve fixed the maximum size of a multiple-precision number so that the loop could be fully unrolled; it made a noticeable difference for addition and subtraction speed.

You can get more inspiration by looking at the ASM that GMP uses; for example, on the Pentium 4 ADC is very slow, so they use SSE2 for everything.

PS: Earlier Intel processors had LOOP implemented to be faster than the alternatives, but all modern processors have LOOP slightly slower than doing the pieces yourself.

Last fiddled with by jasonp on 2012-04-27 at 10:20
jasonp is offline   Reply With Quote
Old 2012-04-27, 11:50   #9
alpertron
 
alpertron's Avatar
 
Aug 2002
Buenos Aires, Argentina

5×277 Posts
Default

Quote:
Originally Posted by davieddy View Post
PPS when did a jump ever affect a flag?

http://www.youtube.com/watch?v=ct2n2...eature=related
In 8051 architecture there is a JBC instruction which jumps if the bit specified in the instruction is set. In that case it also clears that bit.
alpertron is offline   Reply With Quote
Old 2012-04-27, 14:09   #10
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2×3×13×83 Posts
Default

Quote:
Originally Posted by alpertron View Post
In 8051 architecture there is a JBC instruction which jumps if the bit specified in the instruction is set. In that case it also clears that bit.
While I'm still temporarily persona grata, I would like to to say that I have some experience with the 8051.
Let's see where this post ends up :(((((((

Last fiddled with by davieddy on 2012-04-27 at 14:10
davieddy is offline   Reply With Quote
Old 2012-04-27, 14:51   #11
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

24·13·53 Posts
Default

Quote:
Originally Posted by davieddy View Post
PPS when did a jump ever affect a flag?
On the VAX, CALLG and CALLS each clear all the condition codes.
xilman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to prevent GCC unrolling a single selected loop ewmayer Programming 8 2017-05-19 04:09
Infinite loop for ggnfs or msieve Greebley Aliquot Sequences 4 2013-02-06 19:28
Big jump up for me in Top Producers list (?) Freightyard PrimeNet 2 2009-11-07 19:48
Unreserve exponent endless loop luckyduck288 Software 3 2008-01-22 13:45
How many times around the loop? nibble4bits Puzzles 2 2006-02-27 09:34

All times are UTC. The time now is 09:22.


Mon Nov 29 09:22:13 UTC 2021 up 129 days, 3:51, 0 users, load averages: 0.49, 0.97, 1.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.