mersenneforum.org The future is 24bit
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2020-09-22, 00:30 #23 firejuggler     Apr 2010 Over the rainbow 98716 Posts I don't know if it can help https://link.springer.com/chapter/10.1007/978-3-642-28151-8_25 Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs
2020-09-22, 03:41   #24
preda

"Mihai Preda"
Apr 2015

1,301 Posts

Quote:
 Originally Posted by Prime95 Right now we are getting about 19 bits per double (19/64 = 29.7% efficiency) If we use two floats to emulate us a 48-bit double, we'll get about 14 bits per double (21.9% efficiency). So our memory access requirements goes up by 36%. Now implementing triple or quad precision is a different story (38/96 = 39.6% or 62/128 = 48.4%) representing a significant reduction in memory accesses. However emulation costs go up, as does register pressure. Coding and benchmarking is required to see which, if any, is better.
George, how did you work out the number of usable bits? (14, 38, 62)

I'm a bit surprised by the big jump from double-SP (14) to triple-SP (38), is that correct?

2020-09-22, 16:42   #25
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

112×59 Posts

Quote:
 Originally Posted by preda George, how did you work out the number of usable bits? (14, 38, 62), is that correct?
Correction (stupid human error):

53-bit doubles: (53-15)/2 = 19 bits (eff = 29.7%)

48-bit doubles: (48-15)/2 = 16.5 bits (eff = 25.8%)
72-bit triples: (72-15)/2 = 28.5 bits (eff = 29.7%)
96-bit quads: (96-15)/2 = 40.5 bits (eff = 31.6%)

 2020-09-22, 21:09 #26 preda     "Mihai Preda" Apr 2015 1,301 Posts I'm thinking of trying out a SP + DP implementation in gpuowl. If that would achieve a similar effictive performance on Radeon VII, it should be a net gain on Nvidia. For the twiddles, I would implement them by invoking the HW sin/cos (SP), together with a precomputed table of DP "delta". I'm thinking of representing the most significant bits as SP, and the "tail" as DP, I think this pushes a bit more operations to SP.
 2020-09-23, 08:05 #27 LaurV Romulan Interpreter     Jun 2011 Thailand 19×467 Posts Not long ago we did some "study" and posted about using SP to implement DP. In fact, you can not represent one DP by two SP (like it's the case for integers**), due to how the mantissa and the exponent are arranged in a float. You need 3 SP to "emulate" one DP, and sometimes you will need 4 (carry propagation, anybody?). Now, at the time I was posting, the discussion was about cards having 1/16 and 1/32 DPSP ratios, and my argument was that any card with 1/8 or lower ratio would be futile for DP and the manufacturer would better invest in more silicone for SP. On the other hand, I never implemented a FFT multiplication, so I may be wrong, but I know the theory, I looked into other's implementations (like a cat into an orthodox calendar, as we Romanians use to say), and I work very intimate with the machine and close to the metal in my daily job ("low level" programmer, doing firmware for different electronic devices, mostly C and Assembler for different architectures, for over 30 years by now). I used floats a lot, and usually implement my own low-level things (like in the drivers and firmware), because all libraries are HUGE. Programmers in my area avoid using floats at all, for example, if we do a voltmeter supposed to measure millivolts, we will consider all the measurements are in micro- or nano-volts (large values are easy to achieve, by adding a lot of measurements, which also "stabilizes" the toy, by implicit averaging when you add more values), do all calculus in integers, and only take care to display the decimal dot in the right position on screen, but in the background, all things are done in integers. You don't have 3.5719 volts that you add, neither 3571.9 millivolts, you jsut have 3571900 microvolts. Deal with it! You do all calculus in integers and display it so, but don't forget to put the decimal point somewhere. But your firmware has no idea about floating point numbers. So, you use a integer library which takes only few kilobytes of your memory, instead of using a float library which take few tens of kilobytes. When all you have available is 16 or 32 kB, that is important, and it is also faster. This was just an example, and I am wandering from the subject... Anyhow, I played with floats at assembler level and I consider myself a good "algorithmist" too. You can take three SP floats and make an "animal" structure to emulate one DP float with it, for the most of the operations (like multiplication), you have a range a little bit larger than one DP can handle, and you can multiply such "animals" things each-other using karatsuba/toom-cook, in 5 multiplications plus some overhead. The school grade multiplications would take 9 (three "float digits" times three "float digits") but having less overhead. That would be a bit too much, but not when you talk about 1/16 or 1/32 DPSP ratios. Anyhow, say, you do a compromise, to avoid losing range, and use toom-cook only partial, and do some magic which would take the same time to multiply such thingies (including overhead), as it would take to do 7 SP multiplications, which is quite achievable, as you move "up" from the 5 muls of the cook (or.. is it?). In that case, your SP "library" will be faster than the "native DP", in ANY card that has a lower DP to SP ratio than 1 to 8. If the card has 1 to 16 or 1 to 32, or worse, then you could even afford to do school-grade multiplication, and still be faster with "tricky SP" than with "native DP". Therefore my rhetoric question at the time, why the manufacturers would invest in making "DP silicone" with worse than 1/8 ratios? With 1/4, 1/3, 1/2, is understandable, if you need a fast DP card to be used for "scientific" things (not gaming/graphics/physics). But 1 to 32? Dahack! that is wasted DP silicone. Let the card SP and make it faster... Additional, if my understanding of the FFT multiplication is right, the only place where you really need DP is carry propagation. Everywhere else, is just about how do you split your huge number in "float digits" (i.e. your base). There is a lot of place for improvement, and bringing a lot of "gaming cards" (i.e. very strong SP, almost zero DP) to us, i.e. to the LL/PRP market, would worth all the effort. Unfortunately, our experience in writing code for GPUs is quite limited (remember that cat and the calendar?) ---------- ** about that, I have a funny story, years ago I needed to implement a 48 bit counter on a 16-bit MCU which had no possibility to add or subtract with carry, you could add or subtract 8 or 16 bit, (and also multiply, but you only got the 16 least significant bits of the product, the most significant 16 were lost). Now, splitting the stuff in 8 bits and put them in 16 bits and add them, splitting them and re-arrange, etc, would have taken ages, and the counter wouldn't be fast enough. I spent a lot of time thinking how I could make it better than that, till suddenly I realized that if you add two "digits" (16 bits each, but it works in decimal too!), if you have a carry, the result is smaller than the value you add, and if not, it is bigger. Similar when you subtract. No matter what's in the register, if you add 7 to it, if you got a result 7 or larger than 7, then there was no carry. If there was a carry, your result would be smaller than 7. For example, 5+7, you get result 2, 8+7 you get 5, etc. If so, you just keep in mind and add 1 to the next "level" addition. So at the end I could make it with 3 16-bits additions, plus 3 tests and maximum 3 incs, haha. Much later I found out it can be done even faster than that. I know this may look silly for some of you, but for me at the time it was a big revelation, to find out that you actually don't need the "carry" flag in your CPU Last fiddled with by LaurV on 2020-09-23 at 08:34
2020-09-26, 11:13   #28
preda

"Mihai Preda"
Apr 2015

24258 Posts

Quote:
 Originally Posted by LaurV Not long ago we did some "study" and posted about using SP to implement DP.
Laur, thanks for the colourful post!

I don't think a GPU implementation is required in order to characterize a solution. It can be tested/demonstrated on CPU. Afterwards the GPU will be able to run exactly the same FP compution as the CPU (if both are running standard IEEE 754).

I don't understand exactly what you mean when you say that DP is needed for carry propagation.

 2020-10-28, 00:42 #29 preda     "Mihai Preda" Apr 2015 101000101012 Posts I did some experiments with single SP FFT (the simplest case), to establish the baseline. A 2M convolution (1024x1024 SP pairs), not-weighted, done in the best possible accuracy (perfect SP sin/cos etc.) can handle 5bits per word. I think this result is consistent with past observations. A "by the ear" interpretation of the bit size is: SP : 25 bits, squaring the words: 10bits summing the convolution terms: 1/2 * log2(1M) = 10bits FFT errors: 5 bits, 25 == 10 + 10 + 5.
 2020-10-28, 10:49 #30 preda     "Mihai Preda" Apr 2015 1,301 Posts 2xSP initial experiments The 1xSP 2M convolution code I mentioned previously is here: https://github.com/preda/gpuowl/tree...2b32bd4bd0c/SP https://github.com/preda/gpuowl/blob...4bd0c/SP/sp.cl I added a 2xSP 2M convolution here: https://github.com/preda/gpuowl/blob...0bbf8/SP/sp.cl Please have a look if interested. With 2xSP, in the same setup as the previous 1xSP (i.e. not-weighted convolution 2M words), I see as douable 15bits/word. This is in the ballpark, but I was expecting a bit higher (16bits?). OTOH I did "cheat" a bit on the trigonometric twiddles, using a technique I mentioned previously that combines the HW SP sin/cos (which are very fast but low-accuracy) with a precomputed table of SP "deltas from ideal". Thus a twiddle requires a single SP memory read (plus a HW sin/cos) vs. 2xSP memory read for a fully precomputed table, for a slight loss of accuracy. PS: I just added a precomputed 2xSP twiddles table, and that increases the bits/word from 15 to almost 16 (at 2M, no weighting) Last fiddled with by preda on 2020-10-28 at 11:42 Reason: info on 2xSP twiddles
 2020-10-28, 12:18 #31 preda     "Mihai Preda" Apr 2015 51516 Posts After a few accuracy fixes, the 2xSP experiment can do 17bits/word, which is exactly where I was expecting it. OTOH the multiprecision ADD uses 20 SP ADDs!! Given that in the FFT we do lots of add-sub, that kind of cost inflation can't be good.. (the multiprecision MUL OTOH is quite fast due to the lovely HW FMA) Code: // Assumes |a| >= |b|; cost: 3 ADD float2 fastTwoSum(float a, float b) { float s = a + b; return U2(s, b - (s - a)); } // cost: 6 ADD float2 twoSum(float a, float b) { float s = a + b; float b1 = s - a; float a1 = s - b1; float e = (b - b1) + (a - a1); return U2(s, e); } // cost: 20 ADD !! T sum(T a, T b) { T s = twoSum(a.x, b.x); T t = twoSum(a.y, b.y); s = fastTwoSum(s.x, s.y + t.x); s = fastTwoSum(s.x, s.y + t.y); return s; }
 2020-10-28, 21:04 #32 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 22·719 Posts Niall Emmart's dissertation A Study of High Performance Multiple Precision Arithmetic on Graphics Processing Units may be useful. Last fiddled with by Mark Rose on 2020-10-28 at 21:06
 2020-10-28, 22:06 #33 preda     "Mihai Preda" Apr 2015 1,301 Posts It seems the wavefront (100M+) could be handled with 2xSP at 6.5M FFT or *maybe* 6M FFT pushing it a bit. Sounds efficient enough to be worth a try.

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 1 2015-04-26 08:55 jasonp Msieve 23 2008-10-30 02:23 mfgoode Lounge 3 2006-11-18 23:43 JHansen NFSNET Discussion 15 2004-06-01 19:58 PrimeFun Lounge 21 2003-07-25 02:50

All times are UTC. The time now is 05:38.

Thu Oct 29 05:38:24 UTC 2020 up 49 days, 2:49, 1 user, load averages: 1.67, 1.65, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.