mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2020-01-10, 02:52   #12
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

348910 Posts
Default

Is Abramowitz and Stegun available anywhere online? This sounds like something they would have many possibilities available for.

In addition to the rational approximations, if you want simple code with specified precision maybe the CORDIC algorithm can be used.

Minimax series can provide uniformly good approximations with minimal work.

With modern GPUs small read-only lookup tables should not be performance bottlenecks.

Last fiddled with by jasonp on 2020-01-10 at 03:50
jasonp is offline   Reply With Quote
Old 2020-01-10, 03:10   #13
axn
 
axn's Avatar
 
Jun 2003

22×32×131 Posts
Default

Quote:
Originally Posted by jasonp View Post
Is Abramowitz and Stegun available anywhere online?
http://people.math.sfu.ca/~cbm/aands/
axn is offline   Reply With Quote
Old 2020-01-10, 04:15   #14
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

112×59 Posts
Default

Here is my code. Any improvements are welcome!

Code:
// This version of slowTrig assumes k is positive and k/n <= 0.5 which means we want cos and sin values in the range [0, pi/2]
// We found free Sun Microsystems code that is short and efficient in the range [-pi/4, pi/4].

/* ====================================================
 * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
 *
 * Developed at SunSoft, a Sun Microsystems, Inc. business.
 * Permission to use, copy, modify, and distribute this
 * software is freely granted, provided that this notice 
 * is preserved.
 * ====================================================
 */

/* __kernel_sin(x)
 * kernel sin function on [-pi/4, pi/4], pi/4 ~ 0.7854
 * Input x is assumed to be bounded by ~pi/4 in magnitude.
 *
 * Algorithm
 *	1. Since sin(-x) = -sin(x), we need only to consider positive x. 
 *	2. sin(x) is approximated by a polynomial of degree 13 on [0,pi/4]
 *		  	         3            13
 *	   	sin(x) ~ x + S1*x + ... + S6*x
 *	   where
 *	
 * 	|sin(x)         2     4     6     8     10     12  |     -58
 * 	|----- - (1+S1*x +S2*x +S3*x +S4*x +S5*x  +S6*x   )| <= 2
 * 	|  x 					           | 
 * 
 */

double __kernel_sin(double x)
{
  const double 
  S1  = -1.66666666666666324348e-01, /* 0xBFC55555, 0x55555549 */
  S2  =  8.33333333332248946124e-03, /* 0x3F811111, 0x1110F8A6 */
  S3  = -1.98412698298579493134e-04, /* 0xBF2A01A0, 0x19C161D5 */
  S4  =  2.75573137070700676789e-06, /* 0x3EC71DE3, 0x57B1FE7D */
  S5  = -2.50507602534068634195e-08, /* 0xBE5AE5E6, 0x8A2B9CEB */
  S6  =  1.58969099521155010221e-10; /* 0x3DE5D93A, 0x5ACFD57C */
  double z,r,v;
  z	=  x*x;
  v	=  z*x;
  r	=  S2+z*(S3+z*(S4+z*(S5+z*S6)));
  return x+v*(S1+z*r);
}

/*
 * __kernel_cos( x )
 * kernel cos function on [-pi/4, pi/4], pi/4 ~ 0.785398164
 * Input x is assumed to be bounded by ~pi/4 in magnitude.
 *
 * Algorithm
 *	1. Since cos(-x) = cos(x), we need only to consider positive x.
 *	2. cos(x) is approximated by a polynomial of degree 14 on [0,pi/4]
 *		  	                 4            14
 *	   	cos(x) ~ 1 - x*x/2 + C1*x + ... + C6*x
 *	   where the remez error is
 *	
 * 	|              2     4     6     8     10    12     14 |     -58
 * 	|cos(x)-(1-.5*x +C1*x +C2*x +C3*x +C4*x +C5*x  +C6*x  )| <= 2
 * 	|    					               | 
 * 
 * 	               4     6     8     10    12     14 
 *	4. let r = C1*x +C2*x +C3*x +C4*x +C5*x  +C6*x  , then
 *	       cos(x) = 1 - x*x/2 + r
 *	   since cos(x+y) ~ cos(x) - sin(x)*y 
 *			  ~ cos(x) - x*y,
 *	   a correction term is necessary in cos(x) and hence
 *		cos(x+y) = 1 - (x*x/2 - (r - x*y))
 *	   For better accuracy when x > 0.3, let qx = |x|/4 with
 *	   the last 32 bits mask off, and if x > 0.78125, let qx = 0.28125.
 *	   Then
 *		cos(x+y) = (1-qx) - ((x*x/2-qx) - (r-x*y)).
 *	   Note that 1-qx and (x*x/2-qx) is EXACT here, and the
 *	   magnitude of the latter is at least a quarter of x*x/2,
 *	   thus, reducing the rounding error in the subtraction.
 */

double __kernel_cos(double x)
{
  const double 
  C1  =  4.16666666666666019037e-02, /* 0x3FA55555, 0x5555554C */
  C2  = -1.38888888888741095749e-03, /* 0xBF56C16C, 0x16C15177 */
  C3  =  2.48015872894767294178e-05, /* 0x3EFA01A0, 0x19CB1590 */
  C4  = -2.75573143513906633035e-07, /* 0xBE927E4F, 0x809C52AD */
  C5  =  2.08757232129817482790e-09, /* 0x3E21EE9E, 0xBDB4B1C4 */
  C6  = -1.13596475577881948265e-11; /* 0xBDA8FAE9, 0xBE8838D4 */
  double z,r;
  z  = x*x;
  r  = z*(C1+z*(C2+z*(C3+z*(C4+z*(C5+z*C6)))));
#ifndef MORE_ACCURATE
  return 1.0 - (0.5*z - (z*r));
#else
  double a,hz,qx;
  int ix;
  ix = __HI(x)&0x7fffffff;		/* ix = |x|'s high word*/
  if (ix < 0x3FD33333) 			/* if |x| < 0.3 */ 
    return 1.0 - (0.5*z - (z*r));
  else {
    if(ix > 0x3fe90000) {		/* x > 0.78125 */
      qx = 0.28125;
    } else {
      __HI(qx) = ix-0x00200000;	/* x/4 */
      __LO(qx) = 0;
    }
    hz = 0.5*z-qx;
    a  = 1.0-qx;
    return a - (hz - (z*r));
  }
#endif
}

// We use the following trig identities to convert our [0, pi/2] range to [-pi/4, pi/4] range:
//	cos(A + B) = cos A cos B - sin A sin B 
//	sin(A + B) = sin A cos B + cos A sin B 
// We want to compute sin(pi*k/n) and cos(pi*k/n).  Let x = pi*k/n - pi/4
// cos(pi*k/n) = cos(x + pi/4) = cos(x) * SQRTHALF - sin(x) * SQRTHALF
// sin(pi*k/n) = sin(x + pi/4) = sin(x) * SQRTHALF + cos(x) * SQRTHALF

double2 slowTrig(i32 k, i32 n) {
  double angle = M_PI / n * k - M_PI / 4;
  double c = __kernel_cos(angle);
  double s = __kernel_sin(angle);
#if DEBUG
  if (k * 2 > n) printf ("slowTrig fail: k=%d, n=%d\n", k, n);
#endif
  return M_SQRT1_2 * U2(c - s, -(c + s));
}
Prime95 is offline   Reply With Quote
Old 2020-01-10, 04:27   #15
axn
 
axn's Avatar
 
Jun 2003

22×32×131 Posts
Default

Code:
r  = z*(C1+z*(C2+z*(C3+z*(C4+z*(C5+z*C6)))));
Knuth does cover speeding this part up by reducing the number of multiplications.
axn is offline   Reply With Quote
Old 2020-01-10, 09:59   #16
Nick
 
Nick's Avatar
 
Dec 2012
The Netherlands

2×727 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
You may want to look up 'Pade Approximation", Wiki has a short article
including an expression for sin()
For anyone interested, several Padé approximations for sine(), along with attempts to improve them using genetic programming, can be found here:

https://pdfs.semanticscholar.org/486...2462c2bb6f.pdf
Nick is online now   Reply With Quote
Old 2020-01-10, 22:44   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·59·83 Posts
Default

Quote:
Originally Posted by jasonp View Post
With modern GPUs small read-only lookup tables should not be performance bottlenecks.
George, presumably you need this functionality for on-the-fly computation of FFT twiddles, i.e. for complex roots of unity of form exp(I*2*Pi/n), where n is some highly composite integer, likely of form n = [small odd]*2^k. did you actually GPU-benchmark the use of a standard 2-small-precomputed-tables approach here, each table consisting of O(sqrt(n)) precomputed roots of unity? I'm wondering precisely what kind of table-lookups are behind your "tables bad" statement, and ISTR that your x86 FFT code uses O(n)-length tables, not O(sqrt(n))-sized ones.
ewmayer is offline   Reply With Quote
Old 2020-01-10, 23:25   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

112·59 Posts
Default

Quote:
Originally Posted by ewmayer View Post
George, presumably you need this functionality for on-the-fly computation of FFT twiddles, i.e. for complex roots of unity of form exp(I*2*Pi/n), where n is some highly composite integer, likely of form n = [small odd]*2^k. did you actually GPU-benchmark the use of a standard 2-small-precomputed-tables approach here, each table consisting of O(sqrt(n)) precomputed roots of unity? I'm wondering precisely what kind of table-lookups are behind your "tables bad" statement, and ISTR that your x86 FFT code uses O(n)-length tables, not O(sqrt(n))-sized ones.
Yes, it is for FFT twiddles on a GPU.

Take the awesome Radeon VII. Using the memory bandwidth from the spec sheets, reading a sin/cos twiddle from memory is 53 clocks latency if memory is overclocked, 78 if not overclocked. This latency can be hidden if the GPU has some compute it can do.

Looking at the Taylor series code I posted, we're looking at about 7 muls and 7 adds for both cos and sin. In theory an ADD and MUL can be dispatched each cycle. Latency is 8 or 16 clocks, which again can be hidden if there is other compute to do.

In practice, it is easier to hide sin/cos calculation latency than memory read latency. In fact, while waiting for FFT data to be read in the GPU can be independently calculating twiddle values.

BTW, the code I posted above led to a 20us improvement over the compiler's sin/cos implementation. A 5M FFT is now at 0.7 ms per iteration, or about 18 hours for a primality test.
Prime95 is offline   Reply With Quote
Old 2020-01-11, 08:05   #19
axn
 
axn's Avatar
 
Jun 2003

126C16 Posts
Default

Using Knuth to optimize the mul count:
Code:
double __kernel_sin(double x)
{
  const double 
  A0 = -1.30608015549532788940e+02,
  A1 =  1.23312777810503624981e+04, 
  A2 =  5.13167274033946697914e+01, 
  A3 = -8.61464096450666776475e+06,
  A4 =  1.55798886190603733380e+07,
  A5 =  6.07511251529788036330e+13,
  A6 =  1.58969099521155010221e-10;

  double x2,r,z,w;

  x2 = x*x;
  z = (x2+A0)*x2+A1;
  w = (x2+A2)*z +A3;
  r = ((w+z+A4)*w+A5)*A6;

  return x*r;
}


double __kernel_cos(double x)
{
  const double 
  A0 = -2.86788466915238314072e+02,
  A1 =  6.38040759672674956541e+04,
  A2 =  1.94403031619891871554e+02,
  A3 = -2.01131786823940935180e+07,
  A4 =  1.46249769755550521701e+07,
  A5 =  5.38508025161968705769e+13,
  A6 = -1.13596475577881948265e-11;

  double x2,r,z,w;

  x2 = x*x;
  z = (x2+A0)*x2+A1;
  w = (x2+A2)*z +A3;
  r = ((w+z+A4)*w+A5)*A6;

  return 1.0+x2*r;
}
axn is offline   Reply With Quote
Old 2020-01-11, 19:36   #20
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·59·83 Posts
Default

Quote:
Originally Posted by axn View Post
Using Knuth to optimize the mul count:
[snip]
That is more suitable for non-FMA architectures ... for FMA-supporting ones, George's code needs 8 FMA (2 of which are pure-MUL), compared to 10 separate operations (mix of ADD, MUL, and FMA) for the code you posted. (Assume one has structured things so as to do as many such computations in parallel so as to hide the latencies of the various instructions.)
ewmayer is offline   Reply With Quote
Old 2020-01-11, 20:20   #21
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

112·59 Posts
Default

Quote:
Originally Posted by axn View Post
Using Knuth to optimize the mul count:
One would also need to see an analysis of the absolute and relative error in this new code.
Prime95 is offline   Reply With Quote
Old 2020-01-12, 04:20   #22
axn
 
axn's Avatar
 
Jun 2003

126C16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
That is more suitable for non-FMA architectures ... for FMA-supporting ones, George's code needs 8 FMA (2 of which are pure-MUL), compared to 10 separate operations (mix of ADD, MUL, and FMA) for the code you posted. (Assume one has structured things so as to do as many such computations in parallel so as to hide the latencies of the various instructions.)
Quote:
Originally Posted by Prime95 View Post
One would also need to see an analysis of the absolute and relative error in this new code.
Does the modified code sequence actually result in any reduction in time or is it DOA?

How does one go about doing error analysis? In theory, the code is computing the same polynomial as the previous one, with just fewer multiplications. In practice, I don't know how much additional error (if any) that the new code introduces. I suppose, one could empirically compare all the interesting sin/cos values of all three (native / poly / Knuth).

Also, is it worth combining both sin & cos into a single routine (for either version)? Would that give more options for compiler / scheduler to do a better job?
axn is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Karatsuba algorithm relation to Matrix Multiplication (Strassen Algorithm)? neomacdev Computer Science & Computational Number Theory 1 2019-07-30 01:40
A Study of Primes Through Non-Resonant Sine Waves EdH Math 56 2010-04-11 19:47
Signs of Remainders of Cosine and Sine Series jinydu Homework Help 2 2008-04-29 01:22
Calculating the number of intersections between sine and cosine curves ShiningArcanine Miscellaneous Math 2 2006-08-06 21:55
Hard proplem for finding sine function tinhnho Miscellaneous Math 6 2005-01-17 05:42

All times are UTC. The time now is 15:52.

Wed Oct 28 15:52:37 UTC 2020 up 48 days, 13:03, 3 users, load averages: 1.11, 1.46, 1.65

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.