mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2020-01-19, 02:40   #34
Dr Sardonicus
 
Dr Sardonicus's Avatar
 
Feb 2017
Nowhere

3×1,657 Posts
Default

Quote:
Originally Posted by xilman View Post
Try pulling the standard trick for constant time arithmetic. I.e. compute (x-pi/4) and mask all but the sign bit:
<snip>
Please check the above very carefully!
I am not qualified to assess the specific instructions you're using.

But assuming they can be made to work, then they do force the argument into the interval [0, pi/4] without using IF statements.

And, you can also use the value of mask.d to interchange the sine and cosine values if it is -1.0 (i.e. if you changed the argument from x to pi/2 - x).

So the question of whether the effort to reduce the argument is worth the (possible) reduction in operations for computing the sine and cosine for the smaller argument interval, may be worth investigating...
Dr Sardonicus is offline   Reply With Quote
Old 2020-01-19, 02:49   #35
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×3×29×67 Posts
Default

I forgot to add, C++ defines reinterpret_cast<pointer> as part of the language standard, thus in a sense enshrining the necessary-ugliness of type-punning:
Quote:
reinterpret_cast converts any pointer type to any other pointer type, even of unrelated classes. The operation result is a simple binary copy of the value from one pointer to the other. All pointer conversions are allowed: neither the content pointed nor the pointer type itself is checked. It can also cast pointers to or from integer types.

The format in which this integer value represents a pointer is platform-specific. The only guarantee is that a pointer cast to an integer type large enough to fully contain it (such as intptr_t), is guaranteed to be able to be cast back to a valid pointer.

The conversions that can be performed by reinterpret_cast but not by static_cast are low-level operations based on reinterpreting the binary representations of the types, which on most cases results in code which is system-specific, and thus non-portable. For example:
Code:
class A { /* ... */ };
class B { /* ... */ };
A * a = new A;
B * b = reinterpret_cast<B*>
This code compiles, although it does not make much sense, since now b points to an object of a totally unrelated and likely incompatible class. Dereferencing b is unsafe.
ewmayer is offline   Reply With Quote
Old 2020-01-19, 14:04   #36
Dr Sardonicus
 
Dr Sardonicus's Avatar
 
Feb 2017
Nowhere

3×1,657 Posts
Default

Hmm. "Platform specific" makes the "extract the sign bit" approach sound unpromising.

I hereby invoke "Green's Law of Debate," that anything is possible if you don't know what you're talking about.

From earlier posts, it seems that the sine/cosine calculation is merely part of a larger task. It is not clear to me whether the entire task is being performed on the GPU. So (not knowing what I'm talking about), I wonder whether, if not, perhaps any initial "argument fiddling" could be done before the sine/cosine calculation is sent to the GPU.

Another question that's been sitting in the back of my mind is, how much can actually be said about the fraction a/b for which x = pi*a/b. All that has been specified so far is that a and b are positive 32-bit integers, and that 0 < 2*a < b. Heck, I don't even know whether the fraction a/b is in lowest terms! If anything else is known, it might help. Assuming these are twos-complement signed integers, b could be as large as 231 - 1.

If there is a significantly smaller upper bound on the size of b, that might help.

Also, if anything is known about just how close a/b can be (or can not be) either to 0 or 1/2, that might help. If b > 230 and is odd, and 4*a > b, the fraction

1/2 - a/b = (b - 2*a)/(2*b)

can not be represented exactly as the quotient of two 32-bit positive integers. I note in this regard that, if you compute x = pi*a/b as a real number (double precision or whatever), and a/b is extremely close to 1/2, the subtraction pi/2 - x for the "reduced" argument will be numerically unfortunate. However, (b-2*a)/b is an exact fraction. Computing pi*(b-2*a)/b and then dividing by 2 might give a better result.
Dr Sardonicus is offline   Reply With Quote
Old 2020-01-19, 19:16   #37
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,949 Posts
Default

Quote:
Originally Posted by Dr Sardonicus View Post
Hmm. "Platform specific" makes the "extract the sign bit" approach sound unpromising.

I hereby invoke "Green's Law of Debate," that anything is possible if you don't know what you're talking about.
Exactly the same idea can be used on any architecture. The bit masks may be different but platform-specific compilation is absolutely standard procedure.

Strange as it may seem, I do know what I am talking about. One of my first real jobs was to implement a IEEE floating point instruction set, single and double precision, in AMD-2900 series microcode. This was back in the days when if you wanted to do FP arithmetic on the x86 architecture you either used a software library or installed a 8087 co-processor.

Every single numerical format, fixed or floating, of which I am aware ATM can have the sign bit (assuming that negative numbers can be represented at all) extracted by suitable bit-wise operations on the word(s) storing the number to be tested.

Anyway, we know that in practice IEEE format is ubiquitous.

Last fiddled with by xilman on 2020-01-19 at 19:26
xilman is offline   Reply With Quote
Old 2020-01-19, 19:47   #38
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×3×29×67 Posts
Default

Found small bug in code I posted in #29 - when I added the second computation (now both cos and sin approximations get computed) I forgot to reinit maxerr = 0 prior to part 2, which in the code is now the sin(x) computation. Thus if the maxerr for cos(x) happened to be > than that for sin(x), both computations would report the former cos(x) approximational error value. Fixed code reattached to #29, and maxerr value for the n = 12 sin(x) approximation fixed in #32 - now we see that for n = 12, the cos(x) approximation has a whopping 20x the error level of the sin(x) one:

cos: maxerr = 5.558974014151506e-14 at x = 0.000000000000000
sin: maxerr = 2.382614064390771e-15 at x = 0.711909242646512

The discrepancy is likely due to the fact that in context of the code, "-n [whatever]" means "consider T_0 through T_n-1", i.e. we only consider polys through of order < n. Thus -n 12 for cos(x) means x^0,x^2,...,x^10, whereas for sin(x) we consider x,x^3,...,x^11, i.e. have an 1-larger polynomial order in the approximation. When we go to -n 14 and thus add x^12 to the mix for cos(x) the resulting maxerr drops to 4.715767759805488e-17.

A related issue: When using the quad-float-computed polynomial coefficients rounded to doubles in a DP-computation of sincos data, which yields better accuracy: leaving the coefficients in terms of the ones on the Chebyshev bases [1,x,2x^2-1,4x^3-3,..], or the ones on the raw powers of x? (I could crunch the numbers myself, of course, but in true egregious-pedant fashion leave it as a small exercise for the interested reader. :)

It may be moot in a practical sense, since the purpose of the exercise is to compute sincos to a desired accuracy in the fewest cycles, and leaving the coefficients in terms of the ones on the Chebyshev bases adds compute cost, in that to compute T_j+1(x) we need to up-multiply using the standard 3-term recurrence T_j+1(x) = 2.x.T_j(x) - T_j-1(x), and to move between bases in either the sin or cos case would need 2 such up-muls due to the odd and even parity of the functions/approximations in question.

Last fiddled with by ewmayer on 2020-01-19 at 20:00
ewmayer is offline   Reply With Quote
Old 2020-01-19, 22:06   #39
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,949 Posts
Default

Quote:
Originally Posted by xilman View Post
Try pulling the standard trick for constant time arithmetic. I.e. compute (x-pi/4) and mask all but the sign bit
I forgot to mention that this sort of code is frequently faster than the obvious conditional code, even on processors with short pipelines and a small number of threads such as recent x86 machines. On GPUs it is even more important to avoid loops which take a variable length of time to execute or you tend to find that all the threads, of which there may be thousands, have to wait until the slowest completes. Trading extra straight-line code for conditionals can be a very big win indeed.

Again, I do know of what I speak. I have writted cryptanalysis code in CUDA for which speed was the second most important of only two criteria. The primary criterion, of course, was correctness. Code size, both source and binary, elegance and everything else was completely irrelevant.
xilman is offline   Reply With Quote
Old 2020-07-15, 05:55   #40
njuffa
 
Jul 2020

22 Posts
Default

In case you guys are still looking for a solution, the math function you would want to use in CUDA is sincospi().



For source code of a sincospi() implementation with a 1 ulp error bound, take a look at the code I posted to Stackoverflow in honor of Pi Day 2017:



https://stackoverflow.com/questions/...c-math-library


The posted implementation requires FMA (fused multiply add).
njuffa is offline   Reply With Quote
Old 2020-07-15, 15:00   #41
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24×32×53 Posts
Default

Quote:
Originally Posted by njuffa View Post
In case you guys are still looking for a solution, the math function you would want to use in CUDA is sincospi().

For source code of a sincospi() implementation with a 1 ulp error bound, take a look at the code I posted to Stackoverflow in honor of Pi Day 2017:

https://stackoverflow.com/questions/...c-math-library

The posted implementation requires FMA (fused multiply add).
.

After lots of study, my final solution was based on your post. It differs in that I need sincos (2*pi*k/n) where n is known in advance and is highly composite. The advance knowledge of n lets us choose different coefficients for even better accuracy (at cost of one extra multiply).

Thanks for your post - it was quite valuable!
Prime95 is online now   Reply With Quote
Old 2020-07-15, 17:53   #42
njuffa
 
Jul 2020

48 Posts
Default

Do you actually perform a floating-point division as part of the computation? If so, it may be possible to accelerate that as well if the divisor is an odd integer:


Efficient floating-point division with constant integer divisors
njuffa is offline   Reply With Quote
Old 2020-07-15, 20:14   #43
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24·32·53 Posts
Default

No division is done. An example for the case where n (the FFT length) is known at compile-time to be 5*2^some_power.

Code:
double ksinpi(double k, const double n) {
  const double multiplier = 935.0,
  S0 = 3.359992142876784201685022306409177725743E-3,
  S1 = -6.322131648214566240979995473508278740838E-9,
  S2 = 3.568700182412300594922117535549453349134E-15,
  S3 = -9.592621220743218859188828943012321657143E-22,
  S4 = 1.504115654636920482511097706291165478487E-28,
  S5 = -1.543622286925744278152254064494080552013E-35,
  S6 = 1.105734195160535436277498021504043747290E-42;
  double x = k * (multiplier / n);
  double z = x * x;
  double r = fma(fma(fma(fma(fma(S6, z, S5), z, S4), z, S3), z, S2), z, S1) * (z * x);
  return forced_fma_by_const(x, S0, r);
}
k is an integer between 0 and n/4.

The multiplier 935 was chosen such that S0 and the corresponding cosine C0 have very little round off error when converted to a double. Note that 935 / n, done at compile-time, can be represented exactly as a double.

This is the routine I use when I need the most accurate result. There is another implementation (basically yours) that I use when I can be a little less accurate -- no 935 multiplier.
Prime95 is online now   Reply With Quote
Old 2020-07-16, 07:42   #44
njuffa
 
Jul 2020

22 Posts
Default

Great example of how one can achieve a smidgen more performance and / or accuracy due to the known restrictions arising from special case. Thanks for posting.
njuffa is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Karatsuba algorithm relation to Matrix Multiplication (Strassen Algorithm)? neomacdev Computer Science & Computational Number Theory 1 2019-07-30 01:40
A Study of Primes Through Non-Resonant Sine Waves EdH Math 56 2010-04-11 19:47
Signs of Remainders of Cosine and Sine Series jinydu Homework Help 2 2008-04-29 01:22
Calculating the number of intersections between sine and cosine curves ShiningArcanine Miscellaneous Math 2 2006-08-06 21:55
Hard proplem for finding sine function tinhnho Miscellaneous Math 6 2005-01-17 05:42

All times are UTC. The time now is 05:40.


Sun Oct 17 05:40:30 UTC 2021 up 86 days, 9 mins, 0 users, load averages: 1.04, 1.03, 1.08

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.