20180321, 02:38  #1 
Tribal Bullet
Oct 2004
3543_{10} Posts 
CUDA integer FFTs
I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,
Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput? The latest source is here and includes a lot of stuff already, including optimizations for a range of x86 instruction sets. Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know. 
20180321, 03:04  #2  
Jun 2003
2×23×113 Posts 
Quote:
Can't speak for others, but, yes, assuming it is significantly faster (2x?) than cudaLucas. EDIT: You should look at implementing Gerbicz Error Check PRP version rather than LL. It is all the rage around here :) Last fiddled with by axn on 20180321 at 03:08 

20180321, 03:32  #3  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1011100100101_{2} Posts 
Quote:
If you come up with something that improves throughput, there will be interest. Are you acquainted with Preda's efforts on OpenCl, with multiple transform types? http://www.mersenneforum.org/showpos...&postcount=224 Memory occupancy and memory controller load iare very different depending on what type calculation is being performed. Left half of the attachment is an AMD RX550 running mfakto trial factoring; right half is PRP 3 test in gpuowl on a 77M exponent on another RX550. I don't think I've ever seen an AMD or NVIDIA gpu's memory controller maxed out, whether TF, P1, LL, or PRP3 crunching. (P1 not available in OpenCl.) (Maybe while moving data from gpu to cpu and back in a P1 gcd handoff.) Some of the existing software could use some maintenance & enhancement. Last fiddled with by kriesel on 20180321 at 03:38 

20180321, 07:05  #4  
Romulan Interpreter
"name field"
Jun 2011
Thailand
10011001010000_{2} Posts 
Quote:
 ** in this mode the cards spit fire, but yet, using mfaktc on them spits even more fire, and the output is "viceversa", i.e. doubled when "SPonly", from which we can see that cudaLucas still have a lot of margin to improve, on the "moreSP, lessDP" side... Last fiddled with by LaurV on 20180321 at 07:09 

20180321, 10:02  #5 
"David"
Jul 2015
Ohio
11×47 Posts 
I’d be interested in a good integer convolution, if partially because it would be easier to reference/ port into HDL on the FPGAs I’ve been playing with as of late.

20180321, 15:08  #6  
Sep 2003
101000011001_{2} Posts 
Quote:
The press release claimed a 20× improvement for deep learning applications, not sure if there's any applicable improvement for number crunching. 

20180321, 18:17  #7 
Tribal Bullet
Oct 2004
3·1,181 Posts 
Deep learning applications mean halfprecision floating point, fixed point, or support vector machines, all of which map very nicely to programmable logic.
George asked back in 2012 about the feasibility of NTT arithmetic to get around the lack of DP throughput on GPUs, and I figured at the time that there was no point because the design and implementation effort would be rendered obsolete when competition in the GPU space forced the major players to support double precision at higher rates. Six years later the joke is on me. 
20180321, 18:32  #8  
"David"
Jul 2015
Ohio
11×47 Posts 
Quote:
I find those very interesting, but the marketing hype is too abstract to fully grok capabilities. I’ll see what my Xilinx rep can do but I’ll be surprised if I can get access to a sample any time soon. 

20180321, 19:19  #9 
Sep 2016
2^{3}×43 Posts 
All this "lowprecision deep learning" stuff is in the opposite direction of what number crunching needs. But I'm not at all surprised at the direction the industry is going.
Double precision (high precision in general) requires too much area and too much power because of the implicit O(N^2) cost of supporting an Nbit multiplier. (and we're too small to benefit from the subquadratic algorithms) A lot of applications don't need such dense hardware anyway. (A number is a number, once you have enough precision, the rest is a waste.) So in an effort to increase throughput and efficiency, they (the hardware manufacturers) are trying to push everyone away from an "inefficient" use of DP and highprecision towards more efficient and specialized hardware. But by doing this, they're screwing over all the applications that legitimately need as much precision as possible. For example: large multiplication by FFT which benefits superlinearly with increased precision of the native type. When you look at this from a broader picture, much of the scientific computing space that wants this dense hardware is the same space that's being hit the hardest by the memory bottleneck. So it almost makes sense for the industry to backpedal on DP/densehardware until they fix the memory problem  which I doubt will happen anytime soon. 
20180322, 01:03  #10 
"David"
Jul 2015
Ohio
11×47 Posts 
The memory bandwidth problem is very real. We’re at the upper limits of reasonable caches and more parallelism has nowhere to drop the data unless it’s going to be used by the same core and thus cache oriented.
In thinking about building the FFTW (fastest in the West) LL/large FFT convolution hardware I wondered if instead of dealing with memory you could just have two or more hardware chips with a super wide IO bus playing catch between iterations. You can get actual (vs. theoretical) 100s of GBs/second that way. An 8K FFT needs what, 130MB of throughput per iteration? 1500 Iterations/second @ 200 GB/s bandwidth. Could do a 100M exponent in less than a day. Ultimately we just need an isPrime circuit. 32 bit exponent input wires and gives answer in 1 clock cycle. Or, just pipe those 32 bits, or the bits of the whole number into a neural net and train it to recognize primes. Surely someone has at least experimented with that. 
20180322, 01:39  #11 
P90 years forever!
Aug 2002
Yeehaw, FL
1111000001001_{2} Posts 
A 4M FFT needs ~135 MB of bandwidth per iteration. A 4M FFT is 32MB of data requiring two passes over the data. Thus, 32MB read + 32MB write + 32MB read + 32MB write + somewhere between 5MB and 10MB of readonly sin/cos/weights data.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
128 bit integer division in CUDA?  cseizert  GPU Computing  8  20161127 15:41 
Nonpoweroftwo FFTs  jasonp  Computer Science & Computational Number Theory  15  20140610 14:49 
P95 PrimeNet causes BSOD; small FFTs, large FFTs, and blend test don't  KarateF22  PrimeNet  16  20131028 00:34 
In Place Large FFTs Failure  nwb  Information & Answers  2  20110708 16:04 
gmpecm and FFTs  dave_dm  Factoring  9  20040904 11:47 