![]() |
![]() |
#23 |
Apr 2010
Over the rainbow
253710 Posts |
![]()
I don't know if it can help
https://link.springer.com/chapter/10.1007/978-3-642-28151-8_25 Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs |
![]() |
![]() |
![]() |
#24 | |
"Mihai Preda"
Apr 2015
135310 Posts |
![]() Quote:
I'm a bit surprised by the big jump from double-SP (14) to triple-SP (38), is that correct? |
|
![]() |
![]() |
![]() |
#25 | |
P90 years forever!
Aug 2002
Yeehaw, FL
2·3,701 Posts |
![]() Quote:
53-bit doubles: (53-15)/2 = 19 bits (eff = 29.7%) 48-bit doubles: (48-15)/2 = 16.5 bits (eff = 25.8%) 72-bit triples: (72-15)/2 = 28.5 bits (eff = 29.7%) 96-bit quads: (96-15)/2 = 40.5 bits (eff = 31.6%) |
|
![]() |
![]() |
![]() |
#26 |
"Mihai Preda"
Apr 2015
3·11·41 Posts |
![]()
I'm thinking of trying out a SP + DP implementation in gpuowl. If that would achieve a similar effictive performance on Radeon VII, it should be a net gain on Nvidia.
For the twiddles, I would implement them by invoking the HW sin/cos (SP), together with a precomputed table of DP "delta". I'm thinking of representing the most significant bits as SP, and the "tail" as DP, I think this pushes a bit more operations to SP. |
![]() |
![]() |
![]() |
#27 |
Romulan Interpreter
Jun 2011
Thailand
17×19×29 Posts |
![]()
Not long ago we did some "study" and posted about using SP to implement DP. In fact, you can not represent one DP by two SP (like it's the case for integers**), due to how the mantissa and the exponent are arranged in a float. You need 3 SP to "emulate" one DP, and sometimes you will need 4 (carry propagation, anybody?).
Now, at the time I was posting, the discussion was about cards having 1/16 and 1/32 DPSP ratios, and my argument was that any card with 1/8 or lower ratio would be futile for DP and the manufacturer would better invest in more silicone for SP. On the other hand, I never implemented a FFT multiplication, so I may be wrong, but I know the theory, I looked into other's implementations (like a cat into an orthodox calendar, as we Romanians use to say), and I work very intimate with the machine and close to the metal in my daily job ("low level" programmer, doing firmware for different electronic devices, mostly C and Assembler for different architectures, for over 30 years by now). I used floats a lot, and usually implement my own low-level things (like in the drivers and firmware), because all libraries are HUGE. Programmers in my area avoid using floats at all, for example, if we do a voltmeter supposed to measure millivolts, we will consider all the measurements are in micro- or nano-volts (large values are easy to achieve, by adding a lot of measurements, which also "stabilizes" the toy, by implicit averaging when you add more values), do all calculus in integers, and only take care to display the decimal dot in the right position on screen, but in the background, all things are done in integers. You don't have 3.5719 volts that you add, neither 3571.9 millivolts, you jsut have 3571900 microvolts. Deal with it! You do all calculus in integers and display it so, but don't forget to put the decimal point somewhere. But your firmware has no idea about floating point numbers. So, you use a integer library which takes only few kilobytes of your memory, instead of using a float library which take few tens of kilobytes. When all you have available is 16 or 32 kB, that is important, and it is also faster. This was just an example, and I am wandering from the subject... Anyhow, I played with floats at assembler level and I consider myself a good "algorithmist" too. You can take three SP floats and make an In that case, your SP "library" will be faster than the "native DP", in ANY card that has a lower DP to SP ratio than 1 to 8. If the card has 1 to 16 or 1 to 32, or worse, then you could even afford to do school-grade multiplication, and still be faster with "tricky SP" than with "native DP". Therefore my rhetoric question at the time, why the manufacturers would invest in making "DP silicone" with worse than 1/8 ratios? With 1/4, 1/3, 1/2, is understandable, if you need a fast DP card to be used for "scientific" things (not gaming/graphics/physics). But 1 to 32? Dahack! that is wasted DP silicone. Let the card SP and make it faster... Additional, if my understanding of the FFT multiplication is right, the only place where you really need DP is carry propagation. Everywhere else, is just about how do you split your huge number in "float digits" (i.e. your base). There is a lot of place for improvement, and bringing a lot of "gaming cards" (i.e. very strong SP, almost zero DP) to us, i.e. to the LL/PRP market, would worth all the effort. Unfortunately, our experience in writing code for GPUs is quite limited (remember that cat and the calendar?) ---------- ** about that, I have a funny story, years ago I needed to implement a 48 bit counter on a 16-bit MCU which had no possibility to add or subtract with carry, you could add or subtract 8 or 16 bit, (and also multiply, but you only got the 16 least significant bits of the product, the most significant 16 were lost). Now, splitting the stuff in 8 bits and put them in 16 bits and add them, splitting them and re-arrange, etc, would have taken ages, and the counter wouldn't be fast enough. I spent a lot of time thinking how I could make it better than that, till suddenly I realized that if you add two "digits" (16 bits each, but it works in decimal too!), if you have a carry, the result is smaller than the value you add, and if not, it is bigger. Similar when you subtract. No matter what's in the register, if you add 7 to it, if you got a result 7 or larger than 7, then there was no carry. If there was a carry, your result would be smaller than 7. For example, 5+7, you get result 2, 8+7 you get 5, etc. If so, you just keep in mind and add 1 to the next "level" addition. So at the end I could make it with 3 16-bits additions, plus 3 tests and maximum 3 incs, haha. Much later I found out it can be done even faster than that. I know this may look silly for some of you, but for me at the time it was a big revelation, to find out that you actually don't need the "carry" flag in your CPU ![]() Last fiddled with by LaurV on 2020-09-23 at 08:34 |
![]() |
![]() |
![]() |
#28 | |
"Mihai Preda"
Apr 2015
3·11·41 Posts |
![]() Quote:
I don't think a GPU implementation is required in order to characterize a solution. It can be tested/demonstrated on CPU. Afterwards the GPU will be able to run exactly the same FP compution as the CPU (if both are running standard IEEE 754). I don't understand exactly what you mean when you say that DP is needed for carry propagation. |
|
![]() |
![]() |
![]() |
#29 |
"Mihai Preda"
Apr 2015
3·11·41 Posts |
![]()
I did some experiments with single SP FFT (the simplest case), to establish the baseline.
A 2M convolution (1024x1024 SP pairs), not-weighted, done in the best possible accuracy (perfect SP sin/cos etc.) can handle 5bits per word. I think this result is consistent with past observations. A "by the ear" interpretation of the bit size is: SP : 25 bits, squaring the words: 10bits summing the convolution terms: 1/2 * log2(1M) = 10bits FFT errors: 5 bits, 25 == 10 + 10 + 5. |
![]() |
![]() |
![]() |
#30 |
"Mihai Preda"
Apr 2015
101010010012 Posts |
![]()
The 1xSP 2M convolution code I mentioned previously is here:
https://github.com/preda/gpuowl/tree...2b32bd4bd0c/SP https://github.com/preda/gpuowl/blob...4bd0c/SP/sp.cl I added a 2xSP 2M convolution here: https://github.com/preda/gpuowl/blob...0bbf8/SP/sp.cl Please have a look if interested. With 2xSP, in the same setup as the previous 1xSP (i.e. not-weighted convolution 2M words), I see as douable 15bits/word. This is in the ballpark, but I was expecting a bit higher (16bits?). OTOH I did "cheat" a bit on the trigonometric twiddles, using a technique I mentioned previously that combines the HW SP sin/cos (which are very fast but low-accuracy) with a precomputed table of SP "deltas from ideal". Thus a twiddle requires a single SP memory read (plus a HW sin/cos) vs. 2xSP memory read for a fully precomputed table, for a slight loss of accuracy. PS: I just added a precomputed 2xSP twiddles table, and that increases the bits/word from 15 to almost 16 (at 2M, no weighting) Last fiddled with by preda on 2020-10-28 at 11:42 Reason: info on 2xSP twiddles |
![]() |
![]() |
![]() |
#31 |
"Mihai Preda"
Apr 2015
3·11·41 Posts |
![]()
After a few accuracy fixes, the 2xSP experiment can do 17bits/word, which is exactly where I was expecting it.
OTOH the multiprecision ADD uses 20 SP ADDs!! Given that in the FFT we do lots of add-sub, that kind of cost inflation can't be good.. (the multiprecision MUL OTOH is quite fast due to the lovely HW FMA) Code:
// Assumes |a| >= |b|; cost: 3 ADD float2 fastTwoSum(float a, float b) { float s = a + b; return U2(s, b - (s - a)); } // cost: 6 ADD float2 twoSum(float a, float b) { float s = a + b; float b1 = s - a; float a1 = s - b1; float e = (b - b1) + (a - a1); return U2(s, e); } // cost: 20 ADD !! T sum(T a, T b) { T s = twoSum(a.x, b.x); T t = twoSum(a.y, b.y); s = fastTwoSum(s.x, s.y + t.x); s = fastTwoSum(s.x, s.y + t.y); return s; } |
![]() |
![]() |
![]() |
#32 |
"/X\(‘-‘)/X\"
Jan 2013
29×101 Posts |
![]()
Niall Emmart's dissertation A Study of High Performance Multiple Precision Arithmetic on
Graphics Processing Units may be useful. Last fiddled with by Mark Rose on 2020-10-28 at 21:06 |
![]() |
![]() |
![]() |
#33 |
"Mihai Preda"
Apr 2015
3·11·41 Posts |
![]()
It seems the wavefront (100M+) could be handled with 2xSP at 6.5M FFT or *maybe* 6M FFT pushing it a bit. Sounds efficient enough to be worth a try.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
What does net neutrality mean for the future? | jasong | jasong | 1 | 2015-04-26 08:55 |
The future of Msieve | jasonp | Msieve | 23 | 2008-10-30 02:23 |
Future of Primes. | mfgoode | Lounge | 3 | 2006-11-18 23:43 |
The future of NFSNET | JHansen | NFSNET Discussion | 15 | 2004-06-01 19:58 |
15k Future? | PrimeFun | Lounge | 21 | 2003-07-25 02:50 |