mersenneforum.org The future is 24bit
 Register FAQ Search Today's Posts Mark Forums Read

2020-09-19, 21:03   #12
preda

"Mihai Preda"
Apr 2015

1,301 Posts

Quote:
 Originally Posted by Prime95 Years ago I toyed with using two or three 32-bit ints to create a 64 or 96-bit float (no exponent bits -- all mantissa). I did enough work to prove to myself it was feasible and, at the time, would be about a fast as a double-precision FFT. As nVidia has lowered and lowered the DP-to-SP ratio, it would be a substantial winner now. An awful lot of code to write though.
Were you using a form of fixed-point to represent the "float without exponent" as ints?

 2020-09-19, 21:06 #13 preda     "Mihai Preda" Apr 2015 1,301 Posts My intuition about using SP in the classic FFT way is: the twiddles are too small. The twiddles (trigonometric values used in the FFT), if represented as SP, do not have enough precision for the FFT sizes needed. The solution is either use higher precision twiddles, or reduce the FFT size. Last fiddled with by preda on 2020-09-19 at 21:07
2020-09-20, 04:37   #14
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

112·59 Posts

Quote:
 Originally Posted by preda Were you using a form of fixed-point to represent the "float without exponent" as ints?

How many ops does it take to do double, triple, and quad precision SP? At a 64-to-1 ratio one of these options ought to pay off.

2020-09-20, 10:39   #15
preda

"Mihai Preda"
Apr 2015

101000101012 Posts

Quote:
 Originally Posted by Prime95 How many ops does it take to do double, triple, and quad precision SP? At a 64-to-1 ratio one of these options ought to pay off.
Here's an article I found on the topic, I still have to read it carefully http://www.andrewthall.com/papers/df64_qf128.pdf

2020-09-20, 22:38   #16
preda

"Mihai Preda"
Apr 2015

1,301 Posts

Quote:
 Originally Posted by preda Here's an article I found on the topic, I still have to read it carefully http://www.andrewthall.com/papers/df64_qf128.pdf
My takeaway from the above paper is:

double-SP multiplication is fast when FMA is available:
we represent a value "x" by a pair of SP (a,b) such that x=a+b, and "a" much larger than "b".
Then multiplication becomes:

(a,b) * (c,d) = (a*c, b*d + fma(a, c, -a*c))
(the term fma(a, c, -a*c) captures the "error" in the multiplication a*c).

OTOH the double-SP addition is not so fast. But if we assume that the two values (a,b) and (c,d) are of similar magnitude, we can approximate the addition as:

(a,b) + (c,d) = (a+c, b+d+ (c - ((a+c) - a))

Last fiddled with by preda on 2020-09-20 at 23:04

2020-09-20, 23:07   #17
preda

"Mihai Preda"
Apr 2015

1,301 Posts

Quote:
 Originally Posted by preda (a,b) * (c,d) = (a*c, b*d + fma(a, c, -a*c)) (a,b) + (c,d) = (a+c, b+d+ (c - ((a+c) - a))
If the above is correct, it means that a double-MUL is 3xMUL+1xADD, and a double-ADD is 4xADD, which is rather efficient. (but I don't know how bad is the double-ADD approximation).

 2020-09-20, 23:13 #18 preda     "Mihai Preda" Apr 2015 1,301 Posts About the "twiddles", they can be computed in double-SP this way: The hardware (GPU) provides v. fast but poor-accuracy SP sin/cos. Precompute a SP table with the difference between the "ideal" sin/cos and the hardware-sin/cos. The the double-SP sin/cos takes its two elements from the HW trig and the table delta.
 2020-09-21, 13:28 #19 tServo     "Marv" May 2009 near the TannhÃ¤user Gate 24·3·11 Posts Ever since I exchanged a few e-mails with George a while back, I have been working on this from time to time. Here is what I have discovered, but remember these are my opinions and could be wrong: Until I see some definitive benchmarks on RTX 3080 FP32 performance, I will remain skeptical. It seems somewhat like marketing hype and smoke and mirrors. My assumptions so far have been the 1:32 ratio and I will stick with that until I see proof otherwise. On the code I've written and tested so far I have found out 2 major things: (1) you must be very careful with rounding. I'm talking about the 3 bits or so that are beyond the end of the result register. Your results will zoom off into the ozone if you don't. One of the first tests I did was adding a couple of million numbers and I found out instantly that my rounding had a subtle bug. (2) I don't see how this can be done without using low level coding like PTX or SASS on CUDA. Memory accesses are the key to GPU performance and the slightest bit of sloppy code can doom your performance. In one instance I moved a memory read and my code was suddenly twice as fast! Since this is the case, I believe everything must be done in registers without touching memory except to load the input values and store the final result. Since I see no way to incorporate PTX modules into OPENCL on Nvidia hardware, that means GPUOWL is off the table AFAIK. Remember, these are my opinions so if you feel strongly otherwise, please keep on testing. I hope to have a low-level ( timing routines, not whole programs ) benchmark in a month or so.
2020-09-21, 21:09   #20
xx005fs

"Eric"
Jan 2018
USA

24×13 Posts

Quote:
 Originally Posted by tServo Ever since I exchanged a few e-mails with George a while back, I have been working on this from time to time. Here is what I have discovered, but remember these are my opinions and could be wrong: Until I see some definitive benchmarks on RTX 3080 FP32 performance, I will remain skeptical. It seems somewhat like marketing hype and smoke and mirrors. My assumptions so far have been the 1:32 ratio and I will stick with that until I see proof otherwise.

There are already AIDA64 benchmarks on the 3080 out in the wild such as https://www.overclockersclub.com/rev..._edition/4.htm
The single precision is 31523GFLOPs and double precision is 536.6GFLOPs

Last fiddled with by xx005fs on 2020-09-21 at 21:10

2020-09-21, 23:34   #21
tServo

"Marv"
May 2009
near the TannhÃ¤user Gate

24·3·11 Posts

Quote:
 Originally Posted by xx005fs There are already AIDA64 benchmarks on the 3080 out in the wild such as https://www.overclockersclub.com/rev..._edition/4.htm The single precision is 31523GFLOPs and double precision is 536.6GFLOPs
Thank you for the info.
My opinion on the other items remains though since even that 64:1 won't cure the memory access issue.

2020-09-21, 23:49   #22
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

112·59 Posts

Quote:
 Originally Posted by tServo My opinion on the other items remains though since even that 64:1 won't cure the memory access issue.
Right now we are getting about 19 bits per double (19/64 = 29.7% efficiency)
If we use two floats to emulate us a 48-bit double, we'll get about 14 bits per double (21.9% efficiency). So our memory access requirements goes up by 36%.

Now implementing triple or quad precision is a different story (38/96 = 39.6% or 62/128 = 48.4%) representing a significant reduction in memory accesses. However emulation costs go up, as does register pressure. Coding and benchmarking is required to see which, if any, is better.

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 1 2015-04-26 08:55 jasonp Msieve 23 2008-10-30 02:23 mfgoode Lounge 3 2006-11-18 23:43 JHansen NFSNET Discussion 15 2004-06-01 19:58 PrimeFun Lounge 21 2003-07-25 02:50

All times are UTC. The time now is 05:24.

Thu Oct 29 05:24:49 UTC 2020 up 49 days, 2:35, 1 user, load averages: 1.39, 1.47, 1.46