mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-09-19, 21:03   #12
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·443 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Years ago I toyed with using two or three 32-bit ints to create a 64 or 96-bit float (no exponent bits -- all mantissa).

I did enough work to prove to myself it was feasible and, at the time, would be about a fast as a double-precision FFT. As nVidia has lowered and lowered the DP-to-SP ratio, it would be a substantial winner now.

An awful lot of code to write though.
Were you using a form of fixed-point to represent the "float without exponent" as ints?
preda is offline   Reply With Quote
Old 2020-09-19, 21:06   #13
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53116 Posts
Default

My intuition about using SP in the classic FFT way is: the twiddles are too small.

The twiddles (trigonometric values used in the FFT), if represented as SP, do not have enough precision for the FFT sizes needed.

The solution is either use higher precision twiddles, or reduce the FFT size.

Last fiddled with by preda on 2020-09-19 at 21:07
preda is offline   Reply With Quote
Old 2020-09-20, 04:37   #14
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·3,613 Posts
Default

Quote:
Originally Posted by preda View Post
Were you using a form of fixed-point to represent the "float without exponent" as ints?
Yes, fixed point. Your comments about the integer units already being busy with pointer arithmetic and looping has me worried about this approach. That and the fact that there are twice as many SP units as INT units.

How many ops does it take to do double, triple, and quad precision SP? At a 64-to-1 ratio one of these options ought to pay off.
Prime95 is online now   Reply With Quote
Old 2020-09-20, 10:39   #15
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·443 Posts
Default

Quote:
Originally Posted by Prime95 View Post
How many ops does it take to do double, triple, and quad precision SP? At a 64-to-1 ratio one of these options ought to pay off.
Here's an article I found on the topic, I still have to read it carefully http://www.andrewthall.com/papers/df64_qf128.pdf
preda is offline   Reply With Quote
Old 2020-09-20, 22:38   #16
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·443 Posts
Default

Quote:
Originally Posted by preda View Post
Here's an article I found on the topic, I still have to read it carefully http://www.andrewthall.com/papers/df64_qf128.pdf
My takeaway from the above paper is:

double-SP multiplication is fast when FMA is available:
we represent a value "x" by a pair of SP (a,b) such that x=a+b, and "a" much larger than "b".
Then multiplication becomes:

(a,b) * (c,d) = (a*c, b*d + fma(a, c, -a*c))
(the term fma(a, c, -a*c) captures the "error" in the multiplication a*c).

OTOH the double-SP addition is not so fast. But if we assume that the two values (a,b) and (c,d) are of similar magnitude, we can approximate the addition as:

(a,b) + (c,d) = (a+c, b+d+ (c - ((a+c) - a))

Last fiddled with by preda on 2020-09-20 at 23:04
preda is offline   Reply With Quote
Old 2020-09-20, 23:07   #17
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×443 Posts
Default

Quote:
Originally Posted by preda View Post
(a,b) * (c,d) = (a*c, b*d + fma(a, c, -a*c))
(a,b) + (c,d) = (a+c, b+d+ (c - ((a+c) - a))
If the above is correct, it means that a double-MUL is 3xMUL+1xADD, and a double-ADD is 4xADD, which is rather efficient. (but I don't know how bad is the double-ADD approximation).
preda is offline   Reply With Quote
Old 2020-09-20, 23:13   #18
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

132910 Posts
Default

About the "twiddles", they can be computed in double-SP this way:

The hardware (GPU) provides v. fast but poor-accuracy SP sin/cos. Precompute a SP table with the difference between the "ideal" sin/cos and the hardware-sin/cos. The the double-SP sin/cos takes its two elements from the HW trig and the table delta.
preda is offline   Reply With Quote
Old 2020-09-21, 13:28   #19
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

10438 Posts
Default

Ever since I exchanged a few e-mails with George a while back, I have been working on this from time to time. Here is what I have discovered, but remember these are my opinions and could be wrong:

Until I see some definitive benchmarks on RTX 3080 FP32 performance, I will remain skeptical. It seems somewhat like marketing hype and smoke and mirrors. My assumptions so far have been the 1:32 ratio and I will stick with that until I see proof otherwise.

On the code I've written and tested so far I have found out 2 major things:
(1) you must be very careful with rounding. I'm talking about the 3 bits or so that are beyond the end of the result register. Your results will zoom off into the ozone if you don't. One of the first tests I did was adding a couple of million numbers and I found out instantly that my rounding had a subtle bug.

(2) I don't see how this can be done without using low level coding like PTX or SASS on CUDA. Memory accesses are the key to GPU performance and the slightest bit of sloppy code can doom your performance. In one instance I moved a memory read and my code was suddenly twice as fast!
Since this is the case, I believe everything must be done in registers without touching memory except to load the input values and store the final result.

Since I see no way to incorporate PTX modules into OPENCL on Nvidia hardware, that means GPUOWL is off the table AFAIK.

Remember, these are my opinions so if you feel strongly otherwise, please keep on testing.

I hope to have a low-level ( timing routines, not whole programs ) benchmark in a month or so.
tServo is offline   Reply With Quote
Old 2020-09-21, 21:09   #20
xx005fs
 
"Eric"
Jan 2018
USA

211 Posts
Default

Quote:
Originally Posted by tServo View Post
Ever since I exchanged a few e-mails with George a while back, I have been working on this from time to time. Here is what I have discovered, but remember these are my opinions and could be wrong:

Until I see some definitive benchmarks on RTX 3080 FP32 performance, I will remain skeptical. It seems somewhat like marketing hype and smoke and mirrors. My assumptions so far have been the 1:32 ratio and I will stick with that until I see proof otherwise.

There are already AIDA64 benchmarks on the 3080 out in the wild such as https://www.overclockersclub.com/rev..._edition/4.htm
The single precision is 31523GFLOPs and double precision is 536.6GFLOPs

Last fiddled with by xx005fs on 2020-09-21 at 21:10
xx005fs is offline   Reply With Quote
Old 2020-09-21, 23:34   #21
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

10438 Posts
Default

Quote:
Originally Posted by xx005fs View Post
There are already AIDA64 benchmarks on the 3080 out in the wild such as https://www.overclockersclub.com/rev..._edition/4.htm
The single precision is 31523GFLOPs and double precision is 536.6GFLOPs
Thank you for the info.
My opinion on the other items remains though since even that 64:1 won't cure the memory access issue.
tServo is offline   Reply With Quote
Old 2020-09-21, 23:49   #22
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1C3A16 Posts
Default

Quote:
Originally Posted by tServo View Post
My opinion on the other items remains though since even that 64:1 won't cure the memory access issue.
Right now we are getting about 19 bits per double (19/64 = 29.7% efficiency)
If we use two floats to emulate us a 48-bit double, we'll get about 14 bits per double (21.9% efficiency). So our memory access requirements goes up by 36%.

Now implementing triple or quad precision is a different story (38/96 = 39.6% or 62/128 = 48.4%) representing a significant reduction in memory accesses. However emulation costs go up, as does register pressure. Coding and benchmarking is required to see which, if any, is better.
Prime95 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
What does net neutrality mean for the future? jasong jasong 1 2015-04-26 08:55
The future of Msieve jasonp Msieve 23 2008-10-30 02:23
Future of Primes. mfgoode Lounge 3 2006-11-18 23:43
The future of NFSNET JHansen NFSNET Discussion 15 2004-06-01 19:58
15k Future? PrimeFun Lounge 21 2003-07-25 02:50

All times are UTC. The time now is 17:09.

Fri Dec 4 17:09:29 UTC 2020 up 1 day, 13:20, 0 users, load averages: 1.36, 1.35, 1.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.