 mersenneforum.org Algorithm wanted - sine and/or cosine
 Register FAQ Search Today's Posts Mark Forums Read  2020-01-12, 05:47   #23
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

24×32×53 Posts Quote:
 Originally Posted by axn Does the modified code sequence actually result in any reduction in time or is it DOA? How does one go about doing error analysis? In theory, the code is computing the same polynomial as the previous one, with just fewer multiplications. In practice, I don't know how much additional error (if any) that the new code introduces. I suppose, one could empirically compare all the interesting sin/cos values of all three (native / poly / Knuth). Also, is it worth combining both sin & cos into a single routine (for either version)? Would that give more options for compiler / scheduler to do a better job?
I did not try your code. If I counted right, the op count is unchanged. A mul became an add.

I've not yet run across code that produces both sin and cos at a reduced cost.   2020-01-12, 06:24   #24
CRGreathouse

Aug 2006

3×1,993 Posts Quote:
 Originally Posted by Prime95 I've not yet run across code that produces both sin and cos at a reduced cost.
Perhaps I don't understand, but...
https://pari.math.u-bordeaux.fr/cgi-...022d739b#l3498   2020-01-12, 19:36   #25
ewmayer
2ω=0

Sep 2002
República de California

266128 Posts Quote:
 Originally Posted by axn How does one go about doing error analysis? In theory, the code is computing the same polynomial as the previous one, with just fewer multiplications. In practice, I don't know how much additional error (if any) that the new code introduces. I suppose, one could empirically compare all the interesting sin/cos values of all three (native / poly / Knuth).
One could, say, loop over a bunch of evenly-spaced points in the interval of interest, use the various formulae to compute sincos in DP, compute errors using expensive extended-precision sincos as a reference.   2020-01-12, 20:50   #26
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11101110100002 Posts Quote:
 Originally Posted by CRGreathouse Perhaps I don't understand, but... https://pari.math.u-bordeaux.fr/cgi-...022d739b#l3498
Looks like that code does a sqrt to get the sin value from the cos value. On my GPU that is more costly than the polynomial calculation.   2020-01-16, 21:42 #27 ewmayer ∂2ω=0   Sep 2002 República de California 266128 Posts I've been playing with the use of a Chebyshev-polynomial basis expansion for approximating e.g. sin and cos - this gives Pade-similar accuracy but is much more computationally convenient. Longer mathematical background (starting with basics of approximation in Hilbert spaces via orthogonal polynomials) here, but the upshot is, to generate an approximation to a given polynomial order n we simply need to evaluate our target function f(x) at the n zeros of the first neglected Chebyshev basis function T_n(x) (which are given by -cos((k+1/2)*Pi/n) for k = 0,...,n-1) and multiply-accumulate f(x) at each such x by the values of the n lower-order bases T_0,...T_n-1 there. For target functions which are even or odd need only even and odd-order T_j, respectively, and only need do the above multiply-accumulate operation on the zeros of T_n lying in the half-interval [0,+1]. Here the results for f(x) = sin(x) using approximations up to the same order as in the Sun code George posted. I did all the computations in quad-float, that's why we can have errors which are smaller than douple-precision epsilon: Unsurprisingly we get better approximational accuracy for smaller domains due to the closer-to-linear sin(x) snippets those contain. MaxErr for Chebyshev approximation to sin(y), y in [-Pi/4,+Pi/4] for odd-order basis set of degree n. y = x*Pi/4: n = 1: maxerr = 1.076412114012678e-02 at x = 0.406840175630695 n = 3: maxerr = 2.211828604797517e-04 at x = 0.744423941213003 n = 5: maxerr = 5.841972262455021e-07 at x = 0.536379036312490 n = 7: maxerr = 1.719787772866942e-09 at x = 0.717635580655843 n = 9: maxerr = 2.461209496415740e-12 at x = 0.785398163397448 n = 11: maxerr = 2.382614064390771e-15 at x = 0.711909242646512 n = 13: maxerr = 1.932929161164053e-18 at x = 0.784331592691555 MaxErr for Chebyshev approximation to sin(y), y in [-1,+1] for odd-order basis set of degree n. y = x: n = 1: maxerr = 7.725438505767193e-02 at x = 1.000000000000000 n = 3: maxerr = 9.930425666556403e-04 at x = 1.000000000000000 n = 5: maxerr = 5.988354112364584e-06 at x = 1.000000000000000 n = 7: maxerr = 2.094915738264925e-08 at x = 1.000000000000000 n = 9: maxerr = 4.784333699825444e-11 at x = 1.000000000000000 n = 11: maxerr = 7.693285387103897e-14 at x = 1.000000000000000 n = 13: maxerr = 9.181670816793705e-17 at x = 1.000000000000000 MaxErr for Chebyshev approximation to sin(y), y in [-Pi/2,+Pi/2] for odd-order basis set of degree n. y = x*Pi/2: n = 1: maxerr = 8.092495638969945e-02 at x = 0.402502421574251 n = 3: maxerr = 6.605511643131922e-03 at x = 0.742805235598241 n = 5: maxerr = 7.140718504766345e-05 at x = 0.535902299627308 n = 7: maxerr = 8.439303067887709e-07 at x = 0.717458080670915 n = 9: maxerr = 4.857121351026179e-09 at x = 0.785398163397448 n = 11: maxerr = 1.892336918249797e-11 at x = 0.711866831145689 n = 13: maxerr = 6.158693885456859e-14 at x = 0.784310386941143 I'll be happy to post the expansion coefficients if desired - they appear more or less the same as for the Sun code, at least when we consider the same x-domain. Unsurprisingly they converge to the Taylor series coefficients as we take n larger, the key lies in the differences between the 2 sets of expansion coefficients (Chebyshev vs Taylor) at any given order n. The nice thing is that, with the sin(x) case coded up and debugged, it's easy to do this for more or less any desired target function, and we have some control over the polynomial order - for example for sin(y) with y in [-Pi/4,+Pi/4] we could probably use n = 11 rather than n = 13 when working in double precision.   2020-01-17, 02:06 #28 Prime95 P90 years forever!   Aug 2002 Yeehaw, FL 24×32×53 Posts Good work, Ernst. When you've polished up the code, could you upload it for all to enjoy?   2020-01-17, 19:24   #29
ewmayer
2ω=0

Sep 2002
República de California

2D8A16 Posts Bzipped code attached. Note in the meanwhile I got a PM form DrSardonicus about mapping a target interval [0,pi/2] to [-pi/4,+pi/4]. The code George posted in #14 of the thread was for sin and cos, each over [-pi/4,+pi/4], so those are the domains I targeted. At bottom of the code snip George has "We use the following trig identities to convert our [0, pi/2] range to [-pi/4, pi/4] range" bit, which uses the identities:

cos(x + pi/4) = cos(x) * SQRTHALF - sin(x) * SQRTHALF
sin(x + pi/4) = sin(x) * SQRTHALF + cos(x) * SQRTHALF

Those are computationally efficient, in an FFT context we always need both sin and cos, so mapping [0,pi/2] to [-pi/4,pi/4] and computing sin(x) and cos(x) in the latter interval, the mapping-back needs just one add and one mul for each of sin(x + pi/4) and cos(x + pi/4). But DrS notes that e.g. is we desire cos(x) for x close to pi/2 (i.e. a near-zero result) using the first formula above we have x near pi/4, cos(x) nearly the same as sin(x) and we end up subtracting two small numbers, a known accuracy-killer. That needs further discussion. To avoid this we could use the identity cos(x) = sin(pi/2-x) for arguments in [pi/4,pi/2], but how to do this sort of thing without the resulting conditionals killing the performance?

==================

To build, drop the unzipped qfcheb.c source file into the Mlucas /src directory, and keep this set of object files away from any resulting from a normal Mlucas build, to prevent collision of multiple main() functions on linking. Compile this subset of files:

gcc -c -O3 get_cpuid.c get_fp_rnd_const.c imul_macro.c mi64.c qfcheb.c qfloat.c rng_isaac.c twopmodq.c types.c util.c
gcc -o qfcheb *o -lm

Note the code spends nearly all of its time in the max-error-checking loops, which compute the abserr at 1 million evenly-spaced points in the upper half of the Chebyshev interval, x in [0,1] ... the computation of the approximational coefficients is mre or less instantaneous, relatively speaking.

I've added minimal sanity checking on the input order n ... for dumb reasons (legacy of my initial sine-only implementation), n must be even. Taking n > 12 or so takes the precision beyond that of IEEE64 doubles, but you'll see rapidly diminishing returns beyond n ~= 20, when the approximational accuracy approaches that of my custom qfloat type, which is implemented as a pair of uint64s, the .hi ones of which bitmaps to an IEEE64 double format including hidden bit, the .lo one adding 64 more significand bits. If you wanted to use the code to craft cutom polys beyond IEEE64 precision you'd need to fiddle the coefficient-output prints to, say, print hex-integer-form bitfields of the desired length and with the correct bitlength for the exponent and significand of the target type. E.g. if I wanted outputs in x86 register double format, I'd craft 80-bit versions of the qfdbl_as_uint64 and qfdbl functions in qfloat.c, which properly map the 128-bit qfloat format to x86 reg-double format.
Attached Files qfcheb.c.bz2 (3.6 KB, 214 views)

Last fiddled with by ewmayer on 2020-01-19 at 19:32 Reason: Fixed missing-reinit-of-maxerr-prior to-sine-error-computation bug   2020-01-18, 02:01   #30
Dr Sardonicus

Feb 2017
Nowhere

3×1,657 Posts Quote:
 Originally Posted by ewmayer The nice thing is that, with the sin(x) case coded up and debugged, it's easy to do this for more or less any desired target function, and we have some control over the polynomial order - for example for sin(y) with y in [-Pi/4,+Pi/4] we could probably use n = 11 rather than n = 13 when working in double precision.
In the present case, the argument is guaranteed to be in [0, pi/2]. Obviously, you can obtain an argument in [0, pi/4] by replacing the argument x with pi/2 - x if x is greater than pi/4. However, that would appear to require an IF statement; and possibly another would be needed to take care of the reversal of the roles of sine or cosine if you do replace x with pi/2 - x. That may more than offset the reduction in the number of terms required, I don't know. I suppose the subtraction needed for the reduction might also introduce a small error in the "reduced" argument.

Two other ways that are fine in theory are

(1) shifting the interval to [-pi/4, pi/4] by subtracting pi/4 from the argument, compute the sine and cosine for the shifted argument, then use the addition formulas to recover the original sine and cosine. However, this is numerically very bad if the given argument is near either 0 or pi/2. (The near-zero sine or cosine is obtained by subtracting two nearly equal values.)

(2) Divide the argument by 2, then use the double-angle formulas to recover the original sine and cosine. This is numerically bad if the given angle is near pi/2.   2020-01-18, 16:06   #31
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

101010110001012 Posts Quote:
 Originally Posted by Dr Sardonicus In the present case, the argument is guaranteed to be in [0, pi/2]. Obviously, you can obtain an argument in [0, pi/4] by replacing the argument x with pi/2 - x if x is greater than pi/4. However, that would appear to require an IF statement;
Try pulling the standard trick for constant time arithmetic. I.e. compute (x-pi/4) and mask all but the sign bit:

/* Assume earlier code has declared x to be double and given a value in [0, pi/2] */

#define P4 = (3.14159265358979323846 / 4.0);
union du (double d, uint64_t u) mask;

/* Now mask.d is -1.0 or +1.0 according to the result of the subtraction */

If I didn't make any typos the rather disgusting type-punning code above should perform the range change without any conditional instruction. A union type isn't strictly necessary as all the punning can be done with casts at the cost of even more proof-reading of your code. Hint: declare mask to be of type double and perform bit-twiddling on *(uint64_t *) &mask. I would expect the compiler to generate the same instruction stream, but check this claim!

Please check the above very carefully!

Last fiddled with by xilman on 2020-01-18 at 16:31 Reason: Convert variable "double p4" to a constant macro P4   2020-01-18, 23:26 #32 ewmayer ∂2ω=0   Sep 2002 República de California 2·3·29·67 Posts @Paul: I consider type-punning aliasing a necessary evil, e.g. my quad-float code relies on it. There's also the "when you need to be 100% sure that double-const you are using is translated faithfully to IEEE64 bit format with no funny business done by the compiler to the low bits" issue that may arise with decimal-format doubles. The code I posted above does both cos(x) and sin(x) for x in [-pi/4,pi/4], and outputs the resulting approximation-polynomial coefficients in both decimal-float and uint64-bitfield format, e.g. for a 12-th-order approximation: Code: MacBook:obj_qfcheb ewmayer\$ ./qfcheb -n 12 Using n = 12 Computing Chebyshev-approximation to Cos(y) for |y| <= 0.7853981633974483, using first 12 T_j(x): Chebyshev basis function coefficients, as double and [exact uint64 bitfield]: c[ 0] = 8.516319137048081e-01 [0x3feb40919232f208] c[ 2] = -1.464366443908369e-01 [0xbfc2be6f9b4c1cb6] c[ 4] = 1.921449311814647e-03 [0x3f5f7b247d216ef5] c[ 6] = -9.964968489829300e-06 [0xbee4e5e6d7331ba2] c[ 8] = 2.757659560715710e-08 [0x3e5d9c3266cd5a3e] c = -4.739945098459938e-11 [0xbdca0ee132c60c1f] Raw polynomial coefficients, as double and [exact uint64 bitfield]: d[ 0] = 9.999999999999444e-01 [0x3feffffffffffe0b] d[ 2] = -4.999999999935111e-01 [0xbfdffffffffe3763] d[ 4] = 4.166666654391119e-02 [0x3fa5555554476425] d[ 6] = -1.388888039377642e-03 [0xbf56c16b2d3e61f5] d[ 8] = 2.479892875532367e-05 [0x3efa00e9685da9c4] d = -2.717342984102282e-07 [0xbe923c5c15441a00] maxerr = 5.558974014151506e-14 at x = 0.000000000000000 Computing Chebyshev-approximation to Sin(y) for |y| <= 0.7853981633974483, using first 12 T_j(x): Chebyshev basis function coefficients, as double and [exact uint64 bitfield]: c[ 1] = 7.263756766937347e-01 [0x3fe73e7834004d7f] c[ 3] = -1.942002905320151e-02 [0xbf93e2d816aef8a1] c[ 5] = 1.516929228510740e-04 [0x3f23e1f8490d5ad3] c[ 7] = -5.605804684120018e-07 [0xbea2cf597c94d8d6] c[ 9] = 1.205324169088147e-09 [0x3e14b5125ea41a49] c = -1.695817118027107e-12 [0xbd7dd54805f3f706] Raw polynomial coefficients, as double and [exact uint64 bitfield]: d[ 1] = 9.999999999999957e-01 [0x3fefffffffffffd9] d[ 3] = -1.666666666661669e-01 [0xbfc5555555550efe] d[ 5] = 8.333333323878150e-03 [0x3f81111110bde5e3] d[ 7] = -1.984126329839051e-04 [0xbf2a019f8a207d3f] d[ 9] = 2.755527189338982e-06 [0x3ec71d7317b8ee33] d = -2.475655877198039e-08 [0xbe5a9507f3711e2d] maxerr = 2.382614064390771e-15 at x = 0.711909242646512 Last fiddled with by ewmayer on 2020-01-19 at 19:33   2020-01-19, 01:40 #33 xilman Bamboozled!   "𒉺𒌌𒇷𒆷𒀭" May 2003 Down not across 2AC516 Posts I did make at least 2 tyops but the compiler would have picked them up instantly. The corrected line of code is union du {double d, uint64_t u} mask; Note curly brackets, not round ones.   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post neomacdev Computer Science & Computational Number Theory 1 2019-07-30 01:40 EdH Math 56 2010-04-11 19:47 jinydu Homework Help 2 2008-04-29 01:22 ShiningArcanine Miscellaneous Math 2 2006-08-06 21:55 tinhnho Miscellaneous Math 6 2005-01-17 05:42

All times are UTC. The time now is 17:02.

Sun Oct 17 17:02:11 UTC 2021 up 86 days, 11:31, 1 user, load averages: 2.51, 2.12, 1.72