20200919, 21:03  #12  
"Mihai Preda"
Apr 2015
2×11×61 Posts 
Quote:


20200919, 21:06  #13 
"Mihai Preda"
Apr 2015
1342_{10} Posts 
My intuition about using SP in the classic FFT way is: the twiddles are too small.
The twiddles (trigonometric values used in the FFT), if represented as SP, do not have enough precision for the FFT sizes needed. The solution is either use higher precision twiddles, or reduce the FFT size. Last fiddled with by preda on 20200919 at 21:07 
20200920, 04:37  #14  
P90 years forever!
Aug 2002
Yeehaw, FL
2^{3}·919 Posts 
Quote:
How many ops does it take to do double, triple, and quad precision SP? At a 64to1 ratio one of these options ought to pay off. 

20200920, 10:39  #15  
"Mihai Preda"
Apr 2015
53E_{16} Posts 
Quote:


20200920, 22:38  #16  
"Mihai Preda"
Apr 2015
2×11×61 Posts 
Quote:
doubleSP multiplication is fast when FMA is available: we represent a value "x" by a pair of SP (a,b) such that x=a+b, and "a" much larger than "b". Then multiplication becomes: (a,b) * (c,d) = (a*c, b*d + fma(a, c, a*c)) (the term fma(a, c, a*c) captures the "error" in the multiplication a*c). OTOH the doubleSP addition is not so fast. But if we assume that the two values (a,b) and (c,d) are of similar magnitude, we can approximate the addition as: (a,b) + (c,d) = (a+c, b+d+ (c  ((a+c)  a)) Last fiddled with by preda on 20200920 at 23:04 

20200920, 23:07  #17 
"Mihai Preda"
Apr 2015
10100111110_{2} Posts 
If the above is correct, it means that a doubleMUL is 3xMUL+1xADD, and a doubleADD is 4xADD, which is rather efficient. (but I don't know how bad is the doubleADD approximation).

20200920, 23:13  #18 
"Mihai Preda"
Apr 2015
2·11·61 Posts 
About the "twiddles", they can be computed in doubleSP this way:
The hardware (GPU) provides v. fast but pooraccuracy SP sin/cos. Precompute a SP table with the difference between the "ideal" sin/cos and the hardwaresin/cos. The the doubleSP sin/cos takes its two elements from the HW trig and the table delta. 
20200921, 13:28  #19 
"Marv"
May 2009
near the TannhÃ¤user Gate
3×7×29 Posts 
Ever since I exchanged a few emails with George a while back, I have been working on this from time to time. Here is what I have discovered, but remember these are my opinions and could be wrong:
Until I see some definitive benchmarks on RTX 3080 FP32 performance, I will remain skeptical. It seems somewhat like marketing hype and smoke and mirrors. My assumptions so far have been the 1:32 ratio and I will stick with that until I see proof otherwise. On the code I've written and tested so far I have found out 2 major things: (1) you must be very careful with rounding. I'm talking about the 3 bits or so that are beyond the end of the result register. Your results will zoom off into the ozone if you don't. One of the first tests I did was adding a couple of million numbers and I found out instantly that my rounding had a subtle bug. (2) I don't see how this can be done without using low level coding like PTX or SASS on CUDA. Memory accesses are the key to GPU performance and the slightest bit of sloppy code can doom your performance. In one instance I moved a memory read and my code was suddenly twice as fast! Since this is the case, I believe everything must be done in registers without touching memory except to load the input values and store the final result. Since I see no way to incorporate PTX modules into OPENCL on Nvidia hardware, that means GPUOWL is off the table AFAIK. Remember, these are my opinions so if you feel strongly otherwise, please keep on testing. I hope to have a lowlevel ( timing routines, not whole programs ) benchmark in a month or so. 
20200921, 21:09  #20  
"Eric"
Jan 2018
USA
2^{2}·53 Posts 
Quote:
There are already AIDA64 benchmarks on the 3080 out in the wild such as https://www.overclockersclub.com/rev..._edition/4.htm The single precision is 31523GFLOPs and double precision is 536.6GFLOPs Last fiddled with by xx005fs on 20200921 at 21:10 

20200921, 23:34  #21  
"Marv"
May 2009
near the TannhÃ¤user Gate
3×7×29 Posts 
Quote:
My opinion on the other items remains though since even that 64:1 won't cure the memory access issue. 

20200921, 23:49  #22  
P90 years forever!
Aug 2002
Yeehaw, FL
2^{3}×919 Posts 
Quote:
If we use two floats to emulate us a 48bit double, we'll get about 14 bits per double (21.9% efficiency). So our memory access requirements goes up by 36%. Now implementing triple or quad precision is a different story (38/96 = 39.6% or 62/128 = 48.4%) representing a significant reduction in memory accesses. However emulation costs go up, as does register pressure. Coding and benchmarking is required to see which, if any, is better. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
What does net neutrality mean for the future?  jasong  jasong  1  20150426 08:55 
The future of Msieve  jasonp  Msieve  23  20081030 02:23 
Future of Primes.  mfgoode  Lounge  3  20061118 23:43 
The future of NFSNET  JHansen  NFSNET Discussion  15  20040601 19:58 
15k Future?  PrimeFun  Lounge  21  20030725 02:50 