mersenneforum.org Algorithm wanted - sine and/or cosine
 Register FAQ Search Today's Posts Mark Forums Read

2020-01-19, 02:40   #34
Dr Sardonicus

Feb 2017
Nowhere

3×1,657 Posts

Quote:
 Originally Posted by xilman Try pulling the standard trick for constant time arithmetic. I.e. compute (x-pi/4) and mask all but the sign bit: Please check the above very carefully!
I am not qualified to assess the specific instructions you're using.

But assuming they can be made to work, then they do force the argument into the interval [0, pi/4] without using IF statements.

And, you can also use the value of mask.d to interchange the sine and cosine values if it is -1.0 (i.e. if you changed the argument from x to pi/2 - x).

So the question of whether the effort to reduce the argument is worth the (possible) reduction in operations for computing the sine and cosine for the smaller argument interval, may be worth investigating...

2020-01-19, 02:49   #35
ewmayer
2ω=0

Sep 2002
República de California

2×3×29×67 Posts

I forgot to add, C++ defines reinterpret_cast<pointer> as part of the language standard, thus in a sense enshrining the necessary-ugliness of type-punning:
Quote:
 reinterpret_cast converts any pointer type to any other pointer type, even of unrelated classes. The operation result is a simple binary copy of the value from one pointer to the other. All pointer conversions are allowed: neither the content pointed nor the pointer type itself is checked. It can also cast pointers to or from integer types. The format in which this integer value represents a pointer is platform-specific. The only guarantee is that a pointer cast to an integer type large enough to fully contain it (such as intptr_t), is guaranteed to be able to be cast back to a valid pointer. The conversions that can be performed by reinterpret_cast but not by static_cast are low-level operations based on reinterpreting the binary representations of the types, which on most cases results in code which is system-specific, and thus non-portable. For example: Code: class A { /* ... */ }; class B { /* ... */ }; A * a = new A; B * b = reinterpret_cast This code compiles, although it does not make much sense, since now b points to an object of a totally unrelated and likely incompatible class. Dereferencing b is unsafe.

 2020-01-19, 14:04 #36 Dr Sardonicus     Feb 2017 Nowhere 3×1,657 Posts Hmm. "Platform specific" makes the "extract the sign bit" approach sound unpromising. I hereby invoke "Green's Law of Debate," that anything is possible if you don't know what you're talking about. From earlier posts, it seems that the sine/cosine calculation is merely part of a larger task. It is not clear to me whether the entire task is being performed on the GPU. So (not knowing what I'm talking about), I wonder whether, if not, perhaps any initial "argument fiddling" could be done before the sine/cosine calculation is sent to the GPU. Another question that's been sitting in the back of my mind is, how much can actually be said about the fraction a/b for which x = pi*a/b. All that has been specified so far is that a and b are positive 32-bit integers, and that 0 < 2*a < b. Heck, I don't even know whether the fraction a/b is in lowest terms! If anything else is known, it might help. Assuming these are twos-complement signed integers, b could be as large as 231 - 1. If there is a significantly smaller upper bound on the size of b, that might help. Also, if anything is known about just how close a/b can be (or can not be) either to 0 or 1/2, that might help. If b > 230 and is odd, and 4*a > b, the fraction 1/2 - a/b = (b - 2*a)/(2*b) can not be represented exactly as the quotient of two 32-bit positive integers. I note in this regard that, if you compute x = pi*a/b as a real number (double precision or whatever), and a/b is extremely close to 1/2, the subtraction pi/2 - x for the "reduced" argument will be numerically unfortunate. However, (b-2*a)/b is an exact fraction. Computing pi*(b-2*a)/b and then dividing by 2 might give a better result.
2020-01-19, 19:16   #37
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,949 Posts

Quote:
 Originally Posted by Dr Sardonicus Hmm. "Platform specific" makes the "extract the sign bit" approach sound unpromising. I hereby invoke "Green's Law of Debate," that anything is possible if you don't know what you're talking about.
Exactly the same idea can be used on any architecture. The bit masks may be different but platform-specific compilation is absolutely standard procedure.

Strange as it may seem, I do know what I am talking about. One of my first real jobs was to implement a IEEE floating point instruction set, single and double precision, in AMD-2900 series microcode. This was back in the days when if you wanted to do FP arithmetic on the x86 architecture you either used a software library or installed a 8087 co-processor.

Every single numerical format, fixed or floating, of which I am aware ATM can have the sign bit (assuming that negative numbers can be represented at all) extracted by suitable bit-wise operations on the word(s) storing the number to be tested.

Anyway, we know that in practice IEEE format is ubiquitous.

Last fiddled with by xilman on 2020-01-19 at 19:26

 2020-01-19, 19:47 #38 ewmayer ∂2ω=0     Sep 2002 República de California 2×3×29×67 Posts Found small bug in code I posted in #29 - when I added the second computation (now both cos and sin approximations get computed) I forgot to reinit maxerr = 0 prior to part 2, which in the code is now the sin(x) computation. Thus if the maxerr for cos(x) happened to be > than that for sin(x), both computations would report the former cos(x) approximational error value. Fixed code reattached to #29, and maxerr value for the n = 12 sin(x) approximation fixed in #32 - now we see that for n = 12, the cos(x) approximation has a whopping 20x the error level of the sin(x) one: cos: maxerr = 5.558974014151506e-14 at x = 0.000000000000000 sin: maxerr = 2.382614064390771e-15 at x = 0.711909242646512 The discrepancy is likely due to the fact that in context of the code, "-n [whatever]" means "consider T_0 through T_n-1", i.e. we only consider polys through of order < n. Thus -n 12 for cos(x) means x^0,x^2,...,x^10, whereas for sin(x) we consider x,x^3,...,x^11, i.e. have an 1-larger polynomial order in the approximation. When we go to -n 14 and thus add x^12 to the mix for cos(x) the resulting maxerr drops to 4.715767759805488e-17. A related issue: When using the quad-float-computed polynomial coefficients rounded to doubles in a DP-computation of sincos data, which yields better accuracy: leaving the coefficients in terms of the ones on the Chebyshev bases [1,x,2x^2-1,4x^3-3,..], or the ones on the raw powers of x? (I could crunch the numbers myself, of course, but in true egregious-pedant fashion leave it as a small exercise for the interested reader. :) It may be moot in a practical sense, since the purpose of the exercise is to compute sincos to a desired accuracy in the fewest cycles, and leaving the coefficients in terms of the ones on the Chebyshev bases adds compute cost, in that to compute T_j+1(x) we need to up-multiply using the standard 3-term recurrence T_j+1(x) = 2.x.T_j(x) - T_j-1(x), and to move between bases in either the sin or cos case would need 2 such up-muls due to the odd and even parity of the functions/approximations in question. Last fiddled with by ewmayer on 2020-01-19 at 20:00
2020-01-19, 22:06   #39
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,949 Posts

Quote:
 Originally Posted by xilman Try pulling the standard trick for constant time arithmetic. I.e. compute (x-pi/4) and mask all but the sign bit
I forgot to mention that this sort of code is frequently faster than the obvious conditional code, even on processors with short pipelines and a small number of threads such as recent x86 machines. On GPUs it is even more important to avoid loops which take a variable length of time to execute or you tend to find that all the threads, of which there may be thousands, have to wait until the slowest completes. Trading extra straight-line code for conditionals can be a very big win indeed.

Again, I do know of what I speak. I have writted cryptanalysis code in CUDA for which speed was the second most important of only two criteria. The primary criterion, of course, was correctness. Code size, both source and binary, elegance and everything else was completely irrelevant.

 2020-07-15, 05:55 #40 njuffa   Jul 2020 22 Posts In case you guys are still looking for a solution, the math function you would want to use in CUDA is sincospi(). For source code of a sincospi() implementation with a 1 ulp error bound, take a look at the code I posted to Stackoverflow in honor of Pi Day 2017: https://stackoverflow.com/questions/...c-math-library The posted implementation requires FMA (fused multiply add).
2020-07-15, 15:00   #41
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

24×32×53 Posts

Quote:
 Originally Posted by njuffa In case you guys are still looking for a solution, the math function you would want to use in CUDA is sincospi(). For source code of a sincospi() implementation with a 1 ulp error bound, take a look at the code I posted to Stackoverflow in honor of Pi Day 2017: https://stackoverflow.com/questions/...c-math-library The posted implementation requires FMA (fused multiply add).
.

After lots of study, my final solution was based on your post. It differs in that I need sincos (2*pi*k/n) where n is known in advance and is highly composite. The advance knowledge of n lets us choose different coefficients for even better accuracy (at cost of one extra multiply).

Thanks for your post - it was quite valuable!

 2020-07-15, 17:53 #42 njuffa   Jul 2020 48 Posts Do you actually perform a floating-point division as part of the computation? If so, it may be possible to accelerate that as well if the divisor is an odd integer: Efficient floating-point division with constant integer divisors
 2020-07-15, 20:14 #43 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 24·32·53 Posts No division is done. An example for the case where n (the FFT length) is known at compile-time to be 5*2^some_power. Code: double ksinpi(double k, const double n) { const double multiplier = 935.0, S0 = 3.359992142876784201685022306409177725743E-3, S1 = -6.322131648214566240979995473508278740838E-9, S2 = 3.568700182412300594922117535549453349134E-15, S3 = -9.592621220743218859188828943012321657143E-22, S4 = 1.504115654636920482511097706291165478487E-28, S5 = -1.543622286925744278152254064494080552013E-35, S6 = 1.105734195160535436277498021504043747290E-42; double x = k * (multiplier / n); double z = x * x; double r = fma(fma(fma(fma(fma(S6, z, S5), z, S4), z, S3), z, S2), z, S1) * (z * x); return forced_fma_by_const(x, S0, r); } k is an integer between 0 and n/4. The multiplier 935 was chosen such that S0 and the corresponding cosine C0 have very little round off error when converted to a double. Note that 935 / n, done at compile-time, can be represented exactly as a double. This is the routine I use when I need the most accurate result. There is another implementation (basically yours) that I use when I can be a little less accurate -- no 935 multiplier.
 2020-07-16, 07:42 #44 njuffa   Jul 2020 22 Posts Great example of how one can achieve a smidgen more performance and / or accuracy due to the known restrictions arising from special case. Thanks for posting.

 Similar Threads Thread Thread Starter Forum Replies Last Post neomacdev Computer Science & Computational Number Theory 1 2019-07-30 01:40 EdH Math 56 2010-04-11 19:47 jinydu Homework Help 2 2008-04-29 01:22 ShiningArcanine Miscellaneous Math 2 2006-08-06 21:55 tinhnho Miscellaneous Math 6 2005-01-17 05:42

All times are UTC. The time now is 05:40.

Sun Oct 17 05:40:30 UTC 2021 up 86 days, 9 mins, 0 users, load averages: 1.04, 1.03, 1.08