 mersenneforum.org Fast double precision Division
 Register FAQ Search Today's Posts Mark Forums Read  2009-04-22, 05:03 #1 __HRB__   Dec 2008 Boycotting the Soapbox 2D016 Posts Fast double precision Division I guess I got a little distracted... Algorithm: Get a 12 bit-approx with rcpps then do one step w/quintic convergence. The whole thing has 21 instructions (no rcppd?!), so I think it should be possible to get it to complete in roughly 7 cycles (~3.5 cycles per double, i.e. 2x faster than the Intel MKL advertises) for in-cache arrays. As far as I can tell, the results match the ones provided by divpd. I think there is a way to compute accurate values for e^x-1 in around 8 cycles, so e.g. hyperbolic tangents should be computable in around 12.5 cycles/double. I think I can also get log(1+x) in under 16 cycles, so the inverse hyperbolic functions can probably be done in under 20 cycles. Code: struct DIVIDE { // compute y/x const __m128d operator() ( __m128d y, __m128d x ) const { __m128d temp; asm volatile( "movaps %[r0],%[r3] \n" "andps %[pdnotexp],%[r0] \n"// mantissa & sign "orps %[pd1],%[r0] \n"// 1<=x<2 "andps %[pdexp],%[r3] \n"// exponent "psubd %[pdexpbias],%[r3] \n"// normalize exponent "movaps %[r0],%[r1] \n"// z "cvtpd2ps %[r0],%[r0] \n" "rcpps %[r0],%[r0] \n"// x (12-bit approx) "cvtps2pd %[r0],%[r0] \n" "mulpd %[r0],%[r1] \n"// z*x "mulpd %[y],%[r0] \n"// y*x - more precision if done here! "movaps %[pd1],%[r2] \n" "subpd %[r1],%[r2] \n"// h=1-z*x "movaps %[r2],%[r1] \n"// h "mulpd %[r2],%[r2] \n"// h^2 "addpd %[r2],%[r1] \n"// h+h^2 "addpd %[pd1],%[r2] \n"// 1+h^2 "mulpd %[r0],%[r1] \n"// y*x*(h+h^2) "mulpd %[r1],%[r2] \n"// y*x*(1+h^2)*(h+h^2) "addpd %[r2],%[r0] \n"// y*x+y*x*(1+h^2)*(h+h^2) "psubd %[r3],%[r0] \n"// -exponent : [r0]"+x"(x), [r1]"=&x"(temp), [r2]"=&x"(temp), [r3]"=&x"(temp) : [y]"X"(y), [pdexp]"X"(CONST::pdexp), [pdnotexp]"X"(CONST::pdnotexp), [pdexpbias]"X"(CONST::pdexpbias), [pd1]"X"(CONST::pd1) : ); return x; } }; P.S. I actually read a little about asm constraints. "X" is kinda nifty for read-only values. "=&x" seems to be the way to go if one needs scratch registers.   2012-01-05, 04:12 #2 bsquared   "Ben" Feb 2007 3·1,193 Posts I need some help understanding this. I'm fine with single precision. I've successfully implemented multi-up division via rcpps and one stage of newton-raphson to get ~23 bits of precision. It runs approximately 4.5x faster than using "/" to divide a list of numbers by another list. My trouble comes about when extending to double precision. Lack of "rcppd" requires using cvtps2ps/cvtpd2pd as above, but everything gets all hosed up when I try it. I suspect the exponents/signs aren't being treated correctly. Maybe it's something as simple as setting the correct floating point rounding mode, but I can't figure it out. The above code gets me close to understanding what needs to happen, but, for instance, what is [pdnotexp], [pdexpbias], and [pdexp], and where do they come from? Here is my code, which simple iterates through two vectors of random doubles, dividing each number from one list by one from the second: Code: asm volatile( "xorl %%ecx, %%ecx \n\t" /* init loop counter */ "movaps (%4), %%xmm3 \n\t" /* bring in vector of 2's */ "1: \n\t" "cmpl %%eax, %%ecx \n\t" /* are we done? */ "jge 2f \n\t" "movapd (%1, %%rcx, 8), %%xmm0 \n\t" /* bring in denominators */ "movapd (%2, %%rcx, 8), %%xmm1 \n\t" /* bring in numerators */ "cvtpd2ps %%xmm0, %%xmm0 \n\t" /* convert to single precision */ "rcpps %%xmm0, %%xmm2 \n\t" /* approx 1/d */ "cvtps2pd %%xmm2, %%xmm2 \n\t" /* convert back to double precision */ "movapd %%xmm2, %%xmm4 \n\t" /* copy 1/d */ "movapd %%xmm3, %%xmm5 \n\t" /* copy 2.0, begin NR stuff below */ "mulpd %%xmm2, %%xmm0 \n\t" /* d * 1/d */ "mulpd %%xmm0, %%xmm2 \n\t" /* d * 1/d * 1/d */ "mulpd %%xmm5, %%xmm4 \n\t" /* 2.0 * 1/d */ "subpd %%xmm2, %%xmm4 \n\t" /* 1/d = 2 * 1/d - d * 1/d^2 */ "mulpd %%xmm4, %%xmm1 \n\t" /* n * 1/d */ "movapd %%xmm1, (%3, %%rcx, 8) \n\t" /* move answers out */ "addl $2, %%ecx \n\t" /* next two divisions */ "jmp 1b \n\t" "2: \n\t" : : "a"(numdiv), "r"(darray), "r"(narray), "r"(rarray), "r"(twoarray) : "rcx", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "memory", "cc"); I then check that given: n/d = r, I get r * d - n < "a suitable low fraction" (say < 0.4). Instead of stuff like this (from control code using ordinary "/" operators): Code: 0: 2.1128 * 13792587.0000 - 29141108.0000 = 0.0000 I get stuff like this: Code: 0: -228535456420911741628038824726223177832726528.0000 * 13792587.0000 - 29141108.0000 = -3152095165270133574372109169366733124553529156960256.0000 Can anyone help me out here? Where am I going wrong?   2012-01-05, 08:36 #3 fivemack (loop (#_fork)) Feb 2006 Cambridge, England 22×32×179 Posts You're doing mulpd %%xmm2, %%xmm0 where xmm0 contains the single-precision version of the denominators! (I did have to step through this in gdb to see what was happening ...) Last fiddled with by fivemack on 2012-01-05 at 08:36   2012-01-05, 14:17 #4 bsquared "Ben" Feb 2007 DFB16 Posts Perfect! Thank you, that did the trick. I need to learn how to operate gdb better. Speedup also looks nice. Even when using 2 NR iterations to get ~48 bits of precision, it runs ~3.5x faster through the loop. Also looks promising as a substitute for integer division, as it runs ~1.8x faster than a for loop of integer divisions. Although I don't know what conversion penalties will do to that figure yet. Thanks again fivemack. - ben.   2012-01-05, 15:00 #5 fivemack (loop (#_fork)) Feb 2006 Cambridge, England 22·32·179 Posts Quote:  Originally Posted by bsquared Perfect! Thank you, that did the trick. I need to learn how to operate gdb better I am definitely not a remotely competent gdb user ... I just did disassemble main (to find out the address of the start of the SSE2 code) break *0x00000004000c839 (to set the breakpoint; you need the * otherwise it thinks you're looking for a variable with a really stupid name) display/i$pc (so that after each instruction it disassembles the .next. one for me)
start
stepi (to run one single instruction)
print $xmm0 print$xmm2
stepi
print $xmm0 print$xmm2

until I figured out what was going on. If I were more competent I could have put 'display $xmm0' 'display$xmm1' through 'display \$xmm5' at the start and it would then print out the values of all the relevant XMM registers at each step.   2012-01-05, 15:10   #6
bsquared

"Ben"
Feb 2007

DFB16 Posts Quote:
 Originally Posted by fivemack I am definitely not a remotely competent gdb user ...
Maybe not, but that sequence of commands displays more competence than I have . I've only ever really used gdb to run backtraces on segfaults and such.

I'm vaguely hopeful that this multi-up division will someday make my squfof routine faster, by enabling more than one multiplier to be run in parallel.   2012-01-05, 15:18   #7
bsquared

"Ben"
Feb 2007

3·1,193 Posts Quote:
 Originally Posted by __HRB__ Get a 12 bit-approx with rcpps then do one step w/quintic convergence. The whole thing has 21 instructions (no rcppd?!), so I think it should be possible to get it to complete in roughly 7 cycles (~3.5 cycles per double, i.e. 2x faster than the Intel MKL advertises) for in-cache arrays.
Looks to be pretty close. My code did 1e8 divisions in 0.25 seconds, for a speed of ~7.325 cycles / division (on a 2.93 GHz X5570). Granted that is not exactly full double precision, but maybe using a 3rd NR iteration or something else with a higher rate of convergence wouldn't hurt too much. As it is, 48 bits is enough to do multi-up 31 bit integer division.   2012-01-05, 19:25   #8
bsquared

"Ben"
Feb 2007

3×1,193 Posts Over lunch I played around with using this for integer division (pointwise division of each integer in two large random lists). Attached is code which does that, about 1.6x faster than a naive for loop of integer divisions.

I had to play around with a fudge factor to get this to work with all possible divisor sizes and ended up with +1/16 (following by truncation). I didn't try to optimize register usage - it'd be nice to be able to tweak it enough to only use the x86 SSE2 register set (i.e. xmm0 - xmm7). Although I think interleaving the inversion approximation of the high half and low half dwords might get a decent speed up (at the cost of using nearly all of the x86_64 SSE2 register set).
Attached Files rcpps_int_division_snippet.txt (4.6 KB, 160 views)

Last fiddled with by bsquared on 2012-01-05 at 19:34 Reason: 1/16, not 1/8...   2012-01-06, 01:05 #9 ewmayer ∂2ω=0   Sep 2002 República de California 22×3×7×139 Posts Ben, I suggest replacing the core NR computation sequence with something like the following. This assumes the packed numerators and denominators are in xmm1 and xmm0 as in your code snippet above), the constant 2.0 in PD form is in xmm7, and xmm6 contains a 64-bit-wide bitmask 0x8000... in PD for negation of a paired-double - follow the code flow to see why this is useful: Code:  "cvtpd2ps %%xmm0, %%xmm2 \n\t" /* convert d to sp and output in xmm2; keep original in xmm0 for negation */ "xorpd %%xmm6, %%xmm0 \n\t" /* -d */ "rcpps %%xmm2, %%xmm2 \n\t" /* ainv := approx 1/d to 11-12 bits of precision */ "cvtps2pd %%xmm2, %%xmm2 \n\t" /* convert approx-inverse back to double precision */ "movaps %%xmm2, %%xmm3 \n\t" /* copy ainv - this is a PD but movaps save an instruction byte*/ "mulpd %%xmm0, %%xmm2 \n\t" /* -d*ainv, overwrites ainv */ "addpd %%xmm7, %%xmm2 \n\t" /* 2 - d*ainv, overwrites ainv */ "mulpd %%xmm3, %%xmm2 \n\t" /* xmm2 = ainv*(2 - d*ainv) = 1/d accurate to ~23 bits */ "movaps %%xmm2, %%xmm3 \n\t" /* Do one more NR iteration by repeating previous 4 instructions: */ "mulpd %%xmm0, %%xmm2 \n\t" "addpd %%xmm7, %%xmm2 \n\t" "mulpd %%xmm3, %%xmm2 \n\t" /* xmm2 = 1/d accurate to ~46 bits */ "mulpd %%xmm2, %%xmm1 \n\t" /* overwrite numerator with desired quotient n/d */ This needs 5 xmm registers, but once we start the 2 iterations only 4 registers (xmm0-3) are used for 'live' data. Moreover, one of the 2 constants (probably the bitmask, since that is used only once at the outset) could be accessed via a memory reference, and since the numerator is only needed at the very end we should wait until then to load into a free register, or simply access it via memref in the final multiply. So let's assume rax and rbx contain pointers to the negation-bitmask and numerator, respectively, and now xmm1 is used much like xmm2 was above, to hold successive approximations to 1/d: Code:  "cvtpd2ps %%xmm0, %%xmm1 \n\t" /* convert d to sp and output in xmm1; keep original in xmm0 for negation */ "xorpd [%%rax], %%xmm0 \n\t" /* -d */ "rcpps %%xmm1, %%xmm1 \n\t" /* ainv := approx 1/d to 11-12 bits of precision */ "cvtps2pd %%xmm1, %%xmm1 \n\t" /* convert approx-inverse back to double precision */ "movaps %%xmm1, %%xmm2 \n\t" /* copy ainv - this is a PD but movaps save an instruction byte*/ "mulpd %%xmm0, %%xmm1 \n\t" /* -d*ainv, overwrites ainv */ "addpd %%xmm7, %%xmm1 \n\t" /* 2 - d*ainv, overwrites ainv */ "mulpd %%xmm2, %%xmm1 \n\t" /* xmm1 = ainv*(2 - d*ainv) = 1/d accurate to ~23 bits */ "movaps %%xmm1, %%xmm2 \n\t" /* Do one more NR iteration by repeating previous 4 instructions: */ "mulpd %%xmm0, %%xmm1 \n\t" "addpd %%xmm7, %%xmm1 \n\t" "mulpd %%xmm2, %%xmm1 \n\t" /* xmm1 = 1/d accurate to ~46 bits */ "mulpd [%%rbx], %%xmm1 \n\t" /* multiply by numerator to get desired quotient n/d */ This leaves xmm3-6 free, so one could use 3 of those to process a second set of data in interleaved fashion to help disguise any latency among dependent instructions above. Since the constant 2 is not overwritten in the implementation, it can be shared by both data streams, leaving one xmm register free for whatever. (I.e. the bitmask could be restored to that if desired). With 4 inputs-at-a-time (i.e. 4 numerators and 4 denominators) it is worth seeing if going to the trouble of packing the sp approximation of all 4 denominators into a single register (via a pair of cvtpd2ps and some high/low-half manipulations of the 2 respective results), then doing a single rcpps and one Newton iteration on that before back-coverting to double for the final iteration is worth it. Lastly, the above 2 snips do 2 full iterations at quadratic convergence - One could consider doing the second iteration at cubic convergence, if something close to the maximal 53 bits of precision is desired. To estimate how many registers that night need, let's consider the respective algorithm in pseudocode form. The Newton-Raphson iteration for successively approximating x = 1/c is x_{n+1} = x_n*[2 - c*x_n] . My implementation above stores -x and y (drop the iteration-subscript now) in registers and proceeds as z = x; // copy inv x *= -c; // -c*x x += 2; // 2 - c*x x *= z; // improved approximation to 1/c To derive a third-order scheme, it helps (or at least it helps me) to recall where the above second-order scheme comes from. NR for roots of f(x) iterates as x_{n+1} = x_n - f(x_n)/f'(x_n). If we want the inverse of a constant c, the obvious target function to express this as a root-finding problem is f(x) := c - 1/x. Then f' = 1/x^2, yielding the iteration formula x_{n+1} = x - (c*x^2 - x) = 2x - cx^2 = x*(2-c*x) . I tried applying Halley's cubic-convergence formula (Philos. Trans. Roy. Soc. London, 18 (1694), 136-145) to the same target function (all derivatives on the rhs are evaluated at he current iterate x_n): x_{n+1} = x_n - (f*f')/[(f')^2 - f*f''/2] Substituting f = c-1/x, f' = 1/x^2, f'' = -2/x^3 gives x_{n+1} = x - (c/x^2 - 1/x^3)/[1/x^4 + (c/x^3 - 1/x^4)] = x - (c/x^2 - 1/x^3)/(c/x^3) = x - (x - 1/c) = 1/c, which is annoying because 1/c - the very inverse we are trying to approximate without use of explicit inversion - appears as the result of the iteration formula. I also tried a standard 4th-order method: x_{n+1} = x_n - f*[(f')^2 - f*f''/2]/[(f')^3 - f*f'*f'' + f^2*f'''/6] with the same result, cancellation of terms causes the rhs to reduce to the thing we seek, 1/c. Crandall gives the following quartic scheme for iterative inversion, but does not give its derivation (nor give a 3rd-order analog): x_{n+1} = x_n*(2 - c*x_n)*[1 + (1 - c*x_n)^2] . Perhaps a usable 3rd-order scheme requires a more-adroit choice of target function. Last fiddled with by ewmayer on 2012-01-06 at 01:05   2012-01-06, 02:47   #10
bsquared

"Ben"
Feb 2007

3·1,193 Posts Quote:
 Originally Posted by ewmayer Ben, I suggest replacing the core NR computation sequence with something like the following. ... This leaves xmm3-6 free, so one could use 3 of those to process a second set of data in interleaved fashion to help disguise any latency among dependent instructions above ... With 4 inputs-at-a-time (i.e. 4 numerators and 4 denominators) it is worth seeing if going to the trouble of packing the sp approximation of all 4 denominators into a single register ... Lastly, the above 2 snips do 2 full iterations at quadratic convergence - One could consider doing the second iteration at cubic convergence, if something close to the maximal 53 bits of precision is desired. ...
Excellent ideas! Thank you! I hadn't yet started optimizing much, so this is very timely and helpful. I did make an interleaved version of my register hogging 1st pass integer division code, and that produced a speed-up of ~15%. I imagine with the much more efficient code ideas you gave that figure will increase quite a bit. Plus, it looks like it will be viable for x86 architectures. I'll make some comparison implementations and post the results. Fun!

Quote:
 Originally Posted by ewmayer Perhaps a usable 3rd-order scheme requires a more-adroit choice of target function.
Yes, but this sort of thing is on the back burner for now. The impact of 2 rounds of NR vs. 0 rounds was less than 10%, so the benefits of 1 NR + cubic might be negligible compared to just doing a 3rd round of NR (if necessary). __HRB__ in the OP has code for 1 round of quintic convergence, but I don't know how that was derived. 4th order Householder method, maybe?

Last fiddled with by bsquared on 2012-01-06 at 02:49   2012-01-06, 05:11   #11
bsquared

"Ben"
Feb 2007

3×1,193 Posts Quote:
 Originally Posted by bsquared I'll make some comparison implementations and post the results. Fun!
I implemented your modified approach, and also ported it into my interleaved integer division routine. Here are some timing results, and the code. All tests were 10,000 iterations of running through a list of 1,000,000 numbers, so 10e9 divisions in all. Divisions involved random 31 bit numerators and random 15 bit denominators.

Code:
test 2x simd double precision division via inverse-approx and newton-raphson iteration:
time for 0x newton-raphson simd division = 22.0409, 0.00 % correct
time for 1x newton-raphson simd division = 22.4083, 21.47 % correct
time for 2x newton-raphson simd division = 24.0490, 100.00 % correct
time for (ewmayer) 2x newton-raphson simd division = 23.7880, 100.00 % correct
time for serial division = 67.3021, 100.00 % correct

test 2x simd integer division via inverse-approx and newton-raphson iteration:
time for 2x newton-raphson simd division = 29.4001, 100.00 % correct
time for interleaved 2x newton-raphson simd division = 26.2557, 100.00 % correct
time for (ewmayer) interleaved 2x newton-raphson simd division = 25.9254, 100.00 % correct
time for serial division = 40.0531, 100.00 % correct
Your approach is faster, but not by as much as I thought it might be.

I'll play around next with the single precision stuff, and with extending precision up to the full 53 bits. Hopefully sometime this weekend.

- ben.
Attached Files multi-up-div.zip (4.0 KB, 161 views)

Last fiddled with by bsquared on 2012-01-06 at 05:14 Reason: more info   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 35 2016-12-11 00:57 ixfd64 Hardware 5 2012-09-12 05:10 dsouza123 Hardware 4 2007-10-15 02:20 fivemack Puzzles 3 2007-04-26 17:01 drew Software 4 2006-08-08 04:08

All times are UTC. The time now is 15:42.

Fri Dec 3 15:42:45 UTC 2021 up 133 days, 10:11, 0 users, load averages: 1.35, 1.25, 1.25