mersenneforum.org The future is 24bit
 Register FAQ Search Today's Posts Mark Forums Read

 2020-09-22, 00:30 #23 firejuggler     Apr 2010 Over the rainbow 253710 Posts I don't know if it can help https://link.springer.com/chapter/10.1007/978-3-642-28151-8_25 Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs
2020-09-22, 03:41   #24
preda

"Mihai Preda"
Apr 2015

135310 Posts

Quote:
 Originally Posted by Prime95 Right now we are getting about 19 bits per double (19/64 = 29.7% efficiency) If we use two floats to emulate us a 48-bit double, we'll get about 14 bits per double (21.9% efficiency). So our memory access requirements goes up by 36%. Now implementing triple or quad precision is a different story (38/96 = 39.6% or 62/128 = 48.4%) representing a significant reduction in memory accesses. However emulation costs go up, as does register pressure. Coding and benchmarking is required to see which, if any, is better.
George, how did you work out the number of usable bits? (14, 38, 62)

I'm a bit surprised by the big jump from double-SP (14) to triple-SP (38), is that correct?

2020-09-22, 16:42   #25
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

2·3,701 Posts

Quote:
 Originally Posted by preda George, how did you work out the number of usable bits? (14, 38, 62), is that correct?
Correction (stupid human error):

53-bit doubles: (53-15)/2 = 19 bits (eff = 29.7%)

48-bit doubles: (48-15)/2 = 16.5 bits (eff = 25.8%)
72-bit triples: (72-15)/2 = 28.5 bits (eff = 29.7%)
96-bit quads: (96-15)/2 = 40.5 bits (eff = 31.6%)

 2020-09-22, 21:09 #26 preda     "Mihai Preda" Apr 2015 3·11·41 Posts I'm thinking of trying out a SP + DP implementation in gpuowl. If that would achieve a similar effictive performance on Radeon VII, it should be a net gain on Nvidia. For the twiddles, I would implement them by invoking the HW sin/cos (SP), together with a precomputed table of DP "delta". I'm thinking of representing the most significant bits as SP, and the "tail" as DP, I think this pushes a bit more operations to SP.
2020-09-26, 11:13   #28
preda

"Mihai Preda"
Apr 2015

3·11·41 Posts

Quote:
 Originally Posted by LaurV Not long ago we did some "study" and posted about using SP to implement DP.
Laur, thanks for the colourful post!

I don't think a GPU implementation is required in order to characterize a solution. It can be tested/demonstrated on CPU. Afterwards the GPU will be able to run exactly the same FP compution as the CPU (if both are running standard IEEE 754).

I don't understand exactly what you mean when you say that DP is needed for carry propagation.

 2020-10-28, 00:42 #29 preda     "Mihai Preda" Apr 2015 3·11·41 Posts I did some experiments with single SP FFT (the simplest case), to establish the baseline. A 2M convolution (1024x1024 SP pairs), not-weighted, done in the best possible accuracy (perfect SP sin/cos etc.) can handle 5bits per word. I think this result is consistent with past observations. A "by the ear" interpretation of the bit size is: SP : 25 bits, squaring the words: 10bits summing the convolution terms: 1/2 * log2(1M) = 10bits FFT errors: 5 bits, 25 == 10 + 10 + 5.
 2020-10-28, 10:49 #30 preda     "Mihai Preda" Apr 2015 101010010012 Posts 2xSP initial experiments The 1xSP 2M convolution code I mentioned previously is here: https://github.com/preda/gpuowl/tree...2b32bd4bd0c/SP https://github.com/preda/gpuowl/blob...4bd0c/SP/sp.cl I added a 2xSP 2M convolution here: https://github.com/preda/gpuowl/blob...0bbf8/SP/sp.cl Please have a look if interested. With 2xSP, in the same setup as the previous 1xSP (i.e. not-weighted convolution 2M words), I see as douable 15bits/word. This is in the ballpark, but I was expecting a bit higher (16bits?). OTOH I did "cheat" a bit on the trigonometric twiddles, using a technique I mentioned previously that combines the HW SP sin/cos (which are very fast but low-accuracy) with a precomputed table of SP "deltas from ideal". Thus a twiddle requires a single SP memory read (plus a HW sin/cos) vs. 2xSP memory read for a fully precomputed table, for a slight loss of accuracy. PS: I just added a precomputed 2xSP twiddles table, and that increases the bits/word from 15 to almost 16 (at 2M, no weighting) Last fiddled with by preda on 2020-10-28 at 11:42 Reason: info on 2xSP twiddles
 2020-10-28, 12:18 #31 preda     "Mihai Preda" Apr 2015 3·11·41 Posts After a few accuracy fixes, the 2xSP experiment can do 17bits/word, which is exactly where I was expecting it. OTOH the multiprecision ADD uses 20 SP ADDs!! Given that in the FFT we do lots of add-sub, that kind of cost inflation can't be good.. (the multiprecision MUL OTOH is quite fast due to the lovely HW FMA) Code: // Assumes |a| >= |b|; cost: 3 ADD float2 fastTwoSum(float a, float b) { float s = a + b; return U2(s, b - (s - a)); } // cost: 6 ADD float2 twoSum(float a, float b) { float s = a + b; float b1 = s - a; float a1 = s - b1; float e = (b - b1) + (a - a1); return U2(s, e); } // cost: 20 ADD !! T sum(T a, T b) { T s = twoSum(a.x, b.x); T t = twoSum(a.y, b.y); s = fastTwoSum(s.x, s.y + t.x); s = fastTwoSum(s.x, s.y + t.y); return s; }
 2020-10-28, 21:04 #32 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 29×101 Posts Niall Emmart's dissertation A Study of High Performance Multiple Precision Arithmetic on Graphics Processing Units may be useful. Last fiddled with by Mark Rose on 2020-10-28 at 21:06
 2020-10-28, 22:06 #33 preda     "Mihai Preda" Apr 2015 3·11·41 Posts It seems the wavefront (100M+) could be handled with 2xSP at 6.5M FFT or *maybe* 6M FFT pushing it a bit. Sounds efficient enough to be worth a try.

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 1 2015-04-26 08:55 jasonp Msieve 23 2008-10-30 02:23 mfgoode Lounge 3 2006-11-18 23:43 JHansen NFSNET Discussion 15 2004-06-01 19:58 PrimeFun Lounge 21 2003-07-25 02:50

All times are UTC. The time now is 02:47.

Mon Apr 12 02:47:22 UTC 2021 up 3 days, 21:28, 1 user, load averages: 1.60, 1.95, 1.89