mersenneforum.org Algorithm wanted - sine and/or cosine
 Register FAQ Search Today's Posts Mark Forums Read

 2020-01-10, 02:52 #12 jasonp Tribal Bullet     Oct 2004 348910 Posts Is Abramowitz and Stegun available anywhere online? This sounds like something they would have many possibilities available for. In addition to the rational approximations, if you want simple code with specified precision maybe the CORDIC algorithm can be used. Minimax series can provide uniformly good approximations with minimal work. With modern GPUs small read-only lookup tables should not be performance bottlenecks. Last fiddled with by jasonp on 2020-01-10 at 03:50
2020-01-10, 03:10   #13
axn

Jun 2003

22×32×131 Posts

Quote:
 Originally Posted by jasonp Is Abramowitz and Stegun available anywhere online?
http://people.math.sfu.ca/~cbm/aands/

 2020-01-10, 04:15 #14 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 112×59 Posts Here is my code. Any improvements are welcome! Code: // This version of slowTrig assumes k is positive and k/n <= 0.5 which means we want cos and sin values in the range [0, pi/2] // We found free Sun Microsystems code that is short and efficient in the range [-pi/4, pi/4]. /* ==================================================== * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. * * Developed at SunSoft, a Sun Microsystems, Inc. business. * Permission to use, copy, modify, and distribute this * software is freely granted, provided that this notice * is preserved. * ==================================================== */ /* __kernel_sin(x) * kernel sin function on [-pi/4, pi/4], pi/4 ~ 0.7854 * Input x is assumed to be bounded by ~pi/4 in magnitude. * * Algorithm * 1. Since sin(-x) = -sin(x), we need only to consider positive x. * 2. sin(x) is approximated by a polynomial of degree 13 on [0,pi/4] * 3 13 * sin(x) ~ x + S1*x + ... + S6*x * where * * |sin(x) 2 4 6 8 10 12 | -58 * |----- - (1+S1*x +S2*x +S3*x +S4*x +S5*x +S6*x )| <= 2 * | x | * */ double __kernel_sin(double x) { const double S1 = -1.66666666666666324348e-01, /* 0xBFC55555, 0x55555549 */ S2 = 8.33333333332248946124e-03, /* 0x3F811111, 0x1110F8A6 */ S3 = -1.98412698298579493134e-04, /* 0xBF2A01A0, 0x19C161D5 */ S4 = 2.75573137070700676789e-06, /* 0x3EC71DE3, 0x57B1FE7D */ S5 = -2.50507602534068634195e-08, /* 0xBE5AE5E6, 0x8A2B9CEB */ S6 = 1.58969099521155010221e-10; /* 0x3DE5D93A, 0x5ACFD57C */ double z,r,v; z = x*x; v = z*x; r = S2+z*(S3+z*(S4+z*(S5+z*S6))); return x+v*(S1+z*r); } /* * __kernel_cos( x ) * kernel cos function on [-pi/4, pi/4], pi/4 ~ 0.785398164 * Input x is assumed to be bounded by ~pi/4 in magnitude. * * Algorithm * 1. Since cos(-x) = cos(x), we need only to consider positive x. * 2. cos(x) is approximated by a polynomial of degree 14 on [0,pi/4] * 4 14 * cos(x) ~ 1 - x*x/2 + C1*x + ... + C6*x * where the remez error is * * | 2 4 6 8 10 12 14 | -58 * |cos(x)-(1-.5*x +C1*x +C2*x +C3*x +C4*x +C5*x +C6*x )| <= 2 * | | * * 4 6 8 10 12 14 * 4. let r = C1*x +C2*x +C3*x +C4*x +C5*x +C6*x , then * cos(x) = 1 - x*x/2 + r * since cos(x+y) ~ cos(x) - sin(x)*y * ~ cos(x) - x*y, * a correction term is necessary in cos(x) and hence * cos(x+y) = 1 - (x*x/2 - (r - x*y)) * For better accuracy when x > 0.3, let qx = |x|/4 with * the last 32 bits mask off, and if x > 0.78125, let qx = 0.28125. * Then * cos(x+y) = (1-qx) - ((x*x/2-qx) - (r-x*y)). * Note that 1-qx and (x*x/2-qx) is EXACT here, and the * magnitude of the latter is at least a quarter of x*x/2, * thus, reducing the rounding error in the subtraction. */ double __kernel_cos(double x) { const double C1 = 4.16666666666666019037e-02, /* 0x3FA55555, 0x5555554C */ C2 = -1.38888888888741095749e-03, /* 0xBF56C16C, 0x16C15177 */ C3 = 2.48015872894767294178e-05, /* 0x3EFA01A0, 0x19CB1590 */ C4 = -2.75573143513906633035e-07, /* 0xBE927E4F, 0x809C52AD */ C5 = 2.08757232129817482790e-09, /* 0x3E21EE9E, 0xBDB4B1C4 */ C6 = -1.13596475577881948265e-11; /* 0xBDA8FAE9, 0xBE8838D4 */ double z,r; z = x*x; r = z*(C1+z*(C2+z*(C3+z*(C4+z*(C5+z*C6))))); #ifndef MORE_ACCURATE return 1.0 - (0.5*z - (z*r)); #else double a,hz,qx; int ix; ix = __HI(x)&0x7fffffff; /* ix = |x|'s high word*/ if (ix < 0x3FD33333) /* if |x| < 0.3 */ return 1.0 - (0.5*z - (z*r)); else { if(ix > 0x3fe90000) { /* x > 0.78125 */ qx = 0.28125; } else { __HI(qx) = ix-0x00200000; /* x/4 */ __LO(qx) = 0; } hz = 0.5*z-qx; a = 1.0-qx; return a - (hz - (z*r)); } #endif } // We use the following trig identities to convert our [0, pi/2] range to [-pi/4, pi/4] range: // cos(A + B) = cos A cos B - sin A sin B // sin(A + B) = sin A cos B + cos A sin B // We want to compute sin(pi*k/n) and cos(pi*k/n). Let x = pi*k/n - pi/4 // cos(pi*k/n) = cos(x + pi/4) = cos(x) * SQRTHALF - sin(x) * SQRTHALF // sin(pi*k/n) = sin(x + pi/4) = sin(x) * SQRTHALF + cos(x) * SQRTHALF double2 slowTrig(i32 k, i32 n) { double angle = M_PI / n * k - M_PI / 4; double c = __kernel_cos(angle); double s = __kernel_sin(angle); #if DEBUG if (k * 2 > n) printf ("slowTrig fail: k=%d, n=%d\n", k, n); #endif return M_SQRT1_2 * U2(c - s, -(c + s)); }
 2020-01-10, 04:27 #15 axn     Jun 2003 22×32×131 Posts Code: r = z*(C1+z*(C2+z*(C3+z*(C4+z*(C5+z*C6))))); Knuth does cover speeding this part up by reducing the number of multiplications.
2020-01-10, 09:59   #16
Nick

Dec 2012
The Netherlands

2×727 Posts

Quote:
 Originally Posted by R.D. Silverman You may want to look up 'Pade Approximation", Wiki has a short article including an expression for sin()
For anyone interested, several Padé approximations for sine(), along with attempts to improve them using genetic programming, can be found here:

https://pdfs.semanticscholar.org/486...2462c2bb6f.pdf

2020-01-10, 22:44   #17
ewmayer
2ω=0

Sep 2002
República de California

2·59·83 Posts

Quote:
 Originally Posted by jasonp With modern GPUs small read-only lookup tables should not be performance bottlenecks.
George, presumably you need this functionality for on-the-fly computation of FFT twiddles, i.e. for complex roots of unity of form exp(I*2*Pi/n), where n is some highly composite integer, likely of form n = [small odd]*2^k. did you actually GPU-benchmark the use of a standard 2-small-precomputed-tables approach here, each table consisting of O(sqrt(n)) precomputed roots of unity? I'm wondering precisely what kind of table-lookups are behind your "tables bad" statement, and ISTR that your x86 FFT code uses O(n)-length tables, not O(sqrt(n))-sized ones.

2020-01-10, 23:25   #18
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

112·59 Posts

Quote:
 Originally Posted by ewmayer George, presumably you need this functionality for on-the-fly computation of FFT twiddles, i.e. for complex roots of unity of form exp(I*2*Pi/n), where n is some highly composite integer, likely of form n = [small odd]*2^k. did you actually GPU-benchmark the use of a standard 2-small-precomputed-tables approach here, each table consisting of O(sqrt(n)) precomputed roots of unity? I'm wondering precisely what kind of table-lookups are behind your "tables bad" statement, and ISTR that your x86 FFT code uses O(n)-length tables, not O(sqrt(n))-sized ones.
Yes, it is for FFT twiddles on a GPU.

Take the awesome Radeon VII. Using the memory bandwidth from the spec sheets, reading a sin/cos twiddle from memory is 53 clocks latency if memory is overclocked, 78 if not overclocked. This latency can be hidden if the GPU has some compute it can do.

Looking at the Taylor series code I posted, we're looking at about 7 muls and 7 adds for both cos and sin. In theory an ADD and MUL can be dispatched each cycle. Latency is 8 or 16 clocks, which again can be hidden if there is other compute to do.

In practice, it is easier to hide sin/cos calculation latency than memory read latency. In fact, while waiting for FFT data to be read in the GPU can be independently calculating twiddle values.

BTW, the code I posted above led to a 20us improvement over the compiler's sin/cos implementation. A 5M FFT is now at 0.7 ms per iteration, or about 18 hours for a primality test.

 2020-01-11, 08:05 #19 axn     Jun 2003 126C16 Posts Using Knuth to optimize the mul count: Code: double __kernel_sin(double x) { const double A0 = -1.30608015549532788940e+02, A1 = 1.23312777810503624981e+04, A2 = 5.13167274033946697914e+01, A3 = -8.61464096450666776475e+06, A4 = 1.55798886190603733380e+07, A5 = 6.07511251529788036330e+13, A6 = 1.58969099521155010221e-10; double x2,r,z,w; x2 = x*x; z = (x2+A0)*x2+A1; w = (x2+A2)*z +A3; r = ((w+z+A4)*w+A5)*A6; return x*r; } double __kernel_cos(double x) { const double A0 = -2.86788466915238314072e+02, A1 = 6.38040759672674956541e+04, A2 = 1.94403031619891871554e+02, A3 = -2.01131786823940935180e+07, A4 = 1.46249769755550521701e+07, A5 = 5.38508025161968705769e+13, A6 = -1.13596475577881948265e-11; double x2,r,z,w; x2 = x*x; z = (x2+A0)*x2+A1; w = (x2+A2)*z +A3; r = ((w+z+A4)*w+A5)*A6; return 1.0+x2*r; }
2020-01-11, 19:36   #20
ewmayer
2ω=0

Sep 2002
República de California

2·59·83 Posts

Quote:
 Originally Posted by axn Using Knuth to optimize the mul count: [snip]
That is more suitable for non-FMA architectures ... for FMA-supporting ones, George's code needs 8 FMA (2 of which are pure-MUL), compared to 10 separate operations (mix of ADD, MUL, and FMA) for the code you posted. (Assume one has structured things so as to do as many such computations in parallel so as to hide the latencies of the various instructions.)

2020-01-11, 20:20   #21
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

112·59 Posts

Quote:
 Originally Posted by axn Using Knuth to optimize the mul count:
One would also need to see an analysis of the absolute and relative error in this new code.

2020-01-12, 04:20   #22
axn

Jun 2003

126C16 Posts

Quote:
 Originally Posted by ewmayer That is more suitable for non-FMA architectures ... for FMA-supporting ones, George's code needs 8 FMA (2 of which are pure-MUL), compared to 10 separate operations (mix of ADD, MUL, and FMA) for the code you posted. (Assume one has structured things so as to do as many such computations in parallel so as to hide the latencies of the various instructions.)
Quote:
 Originally Posted by Prime95 One would also need to see an analysis of the absolute and relative error in this new code.
Does the modified code sequence actually result in any reduction in time or is it DOA?

How does one go about doing error analysis? In theory, the code is computing the same polynomial as the previous one, with just fewer multiplications. In practice, I don't know how much additional error (if any) that the new code introduces. I suppose, one could empirically compare all the interesting sin/cos values of all three (native / poly / Knuth).

Also, is it worth combining both sin & cos into a single routine (for either version)? Would that give more options for compiler / scheduler to do a better job?

 Similar Threads Thread Thread Starter Forum Replies Last Post neomacdev Computer Science & Computational Number Theory 1 2019-07-30 01:40 EdH Math 56 2010-04-11 19:47 jinydu Homework Help 2 2008-04-29 01:22 ShiningArcanine Miscellaneous Math 2 2006-08-06 21:55 tinhnho Miscellaneous Math 6 2005-01-17 05:42

All times are UTC. The time now is 15:52.

Wed Oct 28 15:52:37 UTC 2020 up 48 days, 13:03, 3 users, load averages: 1.11, 1.46, 1.65