![]() |
![]() |
#12 | |
"Mihai Preda"
Apr 2015
2×11×61 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#13 |
"Mihai Preda"
Apr 2015
134210 Posts |
![]()
My intuition about using SP in the classic FFT way is: the twiddles are too small.
The twiddles (trigonometric values used in the FFT), if represented as SP, do not have enough precision for the FFT sizes needed. The solution is either use higher precision twiddles, or reduce the FFT size. Last fiddled with by preda on 2020-09-19 at 21:07 |
![]() |
![]() |
![]() |
#14 | |
P90 years forever!
Aug 2002
Yeehaw, FL
23·919 Posts |
![]() Quote:
How many ops does it take to do double, triple, and quad precision SP? At a 64-to-1 ratio one of these options ought to pay off. |
|
![]() |
![]() |
![]() |
#15 | |
"Mihai Preda"
Apr 2015
53E16 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 | |
"Mihai Preda"
Apr 2015
2×11×61 Posts |
![]() Quote:
double-SP multiplication is fast when FMA is available: we represent a value "x" by a pair of SP (a,b) such that x=a+b, and "a" much larger than "b". Then multiplication becomes: (a,b) * (c,d) = (a*c, b*d + fma(a, c, -a*c)) (the term fma(a, c, -a*c) captures the "error" in the multiplication a*c). OTOH the double-SP addition is not so fast. But if we assume that the two values (a,b) and (c,d) are of similar magnitude, we can approximate the addition as: (a,b) + (c,d) = (a+c, b+d+ (c - ((a+c) - a)) Last fiddled with by preda on 2020-09-20 at 23:04 |
|
![]() |
![]() |
![]() |
#17 |
"Mihai Preda"
Apr 2015
101001111102 Posts |
![]()
If the above is correct, it means that a double-MUL is 3xMUL+1xADD, and a double-ADD is 4xADD, which is rather efficient. (but I don't know how bad is the double-ADD approximation).
|
![]() |
![]() |
![]() |
#18 |
"Mihai Preda"
Apr 2015
2·11·61 Posts |
![]()
About the "twiddles", they can be computed in double-SP this way:
The hardware (GPU) provides v. fast but poor-accuracy SP sin/cos. Precompute a SP table with the difference between the "ideal" sin/cos and the hardware-sin/cos. The the double-SP sin/cos takes its two elements from the HW trig and the table delta. |
![]() |
![]() |
![]() |
#19 |
"Marv"
May 2009
near the Tannhäuser Gate
3×7×29 Posts |
![]()
Ever since I exchanged a few e-mails with George a while back, I have been working on this from time to time. Here is what I have discovered, but remember these are my opinions and could be wrong:
Until I see some definitive benchmarks on RTX 3080 FP32 performance, I will remain skeptical. It seems somewhat like marketing hype and smoke and mirrors. My assumptions so far have been the 1:32 ratio and I will stick with that until I see proof otherwise. On the code I've written and tested so far I have found out 2 major things: (1) you must be very careful with rounding. I'm talking about the 3 bits or so that are beyond the end of the result register. Your results will zoom off into the ozone if you don't. One of the first tests I did was adding a couple of million numbers and I found out instantly that my rounding had a subtle bug. (2) I don't see how this can be done without using low level coding like PTX or SASS on CUDA. Memory accesses are the key to GPU performance and the slightest bit of sloppy code can doom your performance. In one instance I moved a memory read and my code was suddenly twice as fast! Since this is the case, I believe everything must be done in registers without touching memory except to load the input values and store the final result. Since I see no way to incorporate PTX modules into OPENCL on Nvidia hardware, that means GPUOWL is off the table AFAIK. Remember, these are my opinions so if you feel strongly otherwise, please keep on testing. I hope to have a low-level ( timing routines, not whole programs ) benchmark in a month or so. |
![]() |
![]() |
![]() |
#20 | |
"Eric"
Jan 2018
USA
22·53 Posts |
![]() Quote:
There are already AIDA64 benchmarks on the 3080 out in the wild such as https://www.overclockersclub.com/rev..._edition/4.htm The single precision is 31523GFLOPs and double precision is 536.6GFLOPs Last fiddled with by xx005fs on 2020-09-21 at 21:10 |
|
![]() |
![]() |
![]() |
#21 | |
"Marv"
May 2009
near the Tannhäuser Gate
3×7×29 Posts |
![]() Quote:
My opinion on the other items remains though since even that 64:1 won't cure the memory access issue. |
|
![]() |
![]() |
![]() |
#22 | |
P90 years forever!
Aug 2002
Yeehaw, FL
23×919 Posts |
![]() Quote:
If we use two floats to emulate us a 48-bit double, we'll get about 14 bits per double (21.9% efficiency). So our memory access requirements goes up by 36%. Now implementing triple or quad precision is a different story (38/96 = 39.6% or 62/128 = 48.4%) representing a significant reduction in memory accesses. However emulation costs go up, as does register pressure. Coding and benchmarking is required to see which, if any, is better. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
What does net neutrality mean for the future? | jasong | jasong | 1 | 2015-04-26 08:55 |
The future of Msieve | jasonp | Msieve | 23 | 2008-10-30 02:23 |
Future of Primes. | mfgoode | Lounge | 3 | 2006-11-18 23:43 |
The future of NFSNET | JHansen | NFSNET Discussion | 15 | 2004-06-01 19:58 |
15k Future? | PrimeFun | Lounge | 21 | 2003-07-25 02:50 |