mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-03-21, 02:38   #1
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

354310 Posts
Default CUDA integer FFTs

I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,

Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput?

The latest source is here and includes a lot of stuff already, including optimizations for a range of x86 instruction sets.

Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know.
jasonp is offline   Reply With Quote
Old 2018-03-21, 03:04   #2
axn
 
axn's Avatar
 
Jun 2003

2×23×113 Posts
Default

Quote:
Originally Posted by jasonp View Post
Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster?
Yes

Quote:
Originally Posted by jasonp View Post
Would there be interest in another LL tester for CUDA?
Can't speak for others, but, yes, assuming it is significantly faster (2x?) than cudaLucas.

EDIT:- You should look at implementing Gerbicz Error Check PRP version rather than LL. It is all the rage around here :-)

Last fiddled with by axn on 2018-03-21 at 03:08
axn is online now   Reply With Quote
Old 2018-03-21, 03:32   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10111001001012 Posts
Default

Quote:
Originally Posted by jasonp View Post
I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,

Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput?

The latest source is here and includes a lot of stuff already, including optimizations for a range of x86 instruction sets.

Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know.
I have several CC 2.x gpus. Please don't go out of your way to make it incompatible with them. On the other hand, don't let compatibility with them get in the way of performance on newer designs.

If you come up with something that improves throughput, there will be interest.

Are you acquainted with Preda's efforts on OpenCl, with multiple transform types?
http://www.mersenneforum.org/showpos...&postcount=224

Memory occupancy and memory controller load iare very different depending on what type calculation is being performed. Left half of the attachment is an AMD RX550 running mfakto trial factoring; right half is PRP 3 test in gpuowl on a 77M exponent on another RX550. I don't think I've ever seen an AMD or NVIDIA gpu's memory controller maxed out, whether TF, P-1, LL, or PRP3 crunching. (P-1 not available in OpenCl.) (Maybe while moving data from gpu to cpu and back in a P-1 gcd handoff.)

Some of the existing software could use some maintenance & enhancement.
Attached Thumbnails
Click image for larger version

Name:	gpuz amd rz550 mfakto l and gpuowl r.png
Views:	219
Size:	165.8 KB
ID:	17952  

Last fiddled with by kriesel on 2018-03-21 at 03:38
kriesel is offline   Reply With Quote
Old 2018-03-21, 07:05   #4
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

100110010100002 Posts
Default

Quote:
Originally Posted by axn View Post
Can't speak for others, but, yes, assuming it is significantly faster (2x?) than cudaLucas.
Not necessarily to be faster. Our cards slow down to about half the speed if we set them "SP-only" (default), compared to "enable DP" (which can be done with nvidia control panel**). Moreover, there are lots of gaming cards there outside with no use for LL test due to "almost SP-only" design. Some "SP-only" or "integer" LL test comparable speed-wise with cudaLucas, or in the same ballpark (say at least 80%-90% as fast as cudaLucas, but without DP) would be, if not a "revolution", at least very useful.
--------
** in this mode the cards spit fire, but yet, using mfaktc on them spits even more fire, and the output is "viceversa", i.e. doubled when "SP-only", from which we can see that cudaLucas still have a lot of margin to improve, on the "more-SP, less-DP" side...

Last fiddled with by LaurV on 2018-03-21 at 07:09
LaurV is offline   Reply With Quote
Old 2018-03-21, 10:02   #5
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

I’d be interested in a good integer convolution, if partially because it would be easier to reference/ port into HDL on the FPGAs I’ve been playing with as of late.
airsquirrels is offline   Reply With Quote
Old 2018-03-21, 15:08   #6
GP2
 
GP2's Avatar
 
Sep 2003

1010000110012 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I’d be interested in a good integer convolution, if partially because it would be easier to reference/ port into HDL on the FPGAs I’ve been playing with as of late.
Have you looked at the very recent ACAP announcement from Xilinx mentioned in the Science News thread, and would it be more promising than FPGA for this purpose?

The press release claimed a 20× improvement for deep learning applications, not sure if there's any applicable improvement for number crunching.
GP2 is offline   Reply With Quote
Old 2018-03-21, 18:17   #7
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Deep learning applications mean half-precision floating point, fixed point, or support vector machines, all of which map very nicely to programmable logic.

George asked back in 2012 about the feasibility of NTT arithmetic to get around the lack of DP throughput on GPUs, and I figured at the time that there was no point because the design and implementation effort would be rendered obsolete when competition in the GPU space forced the major players to support double precision at higher rates. Six years later the joke is on me.
jasonp is offline   Reply With Quote
Old 2018-03-21, 18:32   #8
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by GP2 View Post
Have you looked at the very recent ACAP announcement from Xilinx mentioned in the Science News thread, and would it be more promising than FPGA for this purpose?

The press release claimed a 20× improvement for deep learning applications, not sure if there's any applicable improvement for number crunching.

I find those very interesting, but the marketing hype is too abstract to fully grok capabilities. I’ll see what my Xilinx rep can do but I’ll be surprised if I can get access to a sample any time soon.
airsquirrels is offline   Reply With Quote
Old 2018-03-21, 19:19   #9
Mysticial
 
Mysticial's Avatar
 
Sep 2016

23×43 Posts
Default

All this "low-precision deep learning" stuff is in the opposite direction of what number crunching needs. But I'm not at all surprised at the direction the industry is going.

Double precision (high precision in general) requires too much area and too much power because of the implicit O(N^2) cost of supporting an N-bit multiplier. (and we're too small to benefit from the sub-quadratic algorithms)

A lot of applications don't need such dense hardware anyway. (A number is a number, once you have enough precision, the rest is a waste.)

So in an effort to increase throughput and efficiency, they (the hardware manufacturers) are trying to push everyone away from an "inefficient" use of DP and high-precision towards more efficient and specialized hardware.

But by doing this, they're screwing over all the applications that legitimately need as much precision as possible. For example: large multiplication by FFT which benefits superlinearly with increased precision of the native type.

When you look at this from a broader picture, much of the scientific computing space that wants this dense hardware is the same space that's being hit the hardest by the memory bottleneck. So it almost makes sense for the industry to backpedal on DP/dense-hardware until they fix the memory problem - which I doubt will happen anytime soon.
Mysticial is offline   Reply With Quote
Old 2018-03-22, 01:03   #10
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

The memory bandwidth problem is very real. We’re at the upper limits of reasonable caches and more parallelism has nowhere to drop the data unless it’s going to be used by the same core and thus cache oriented.

In thinking about building the FFTW (fastest in the West) LL/large FFT convolution hardware I wondered if instead of dealing with memory you could just have two or more hardware chips with a super wide IO bus playing catch between iterations. You can get actual (vs. theoretical) 100s of GBs/second that way. An 8K FFT needs what, 130MB of throughput per iteration? 1500 Iterations/second @ 200 GB/s bandwidth. Could do a 100M exponent in less than a day.

Ultimately we just need an isPrime circuit. 32 bit exponent input wires and gives answer in 1 clock cycle.

Or, just pipe those 32 bits, or the bits of the whole number into a neural net and train it to recognize primes. Surely someone has at least experimented with that.
airsquirrels is offline   Reply With Quote
Old 2018-03-22, 01:39   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11110000010012 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
An 8K FFT needs what, 130MB of throughput per iteration?
A 4M FFT needs ~135 MB of bandwidth per iteration. A 4M FFT is 32MB of data requiring two passes over the data. Thus, 32MB read + 32MB write + 32MB read + 32MB write + somewhere between 5MB and 10MB of read-only sin/cos/weights data.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
128 bit integer division in CUDA? cseizert GPU Computing 8 2016-11-27 15:41
Non-power-of-two FFTs jasonp Computer Science & Computational Number Theory 15 2014-06-10 14:49
P95 PrimeNet causes BSOD; small FFTs, large FFTs, and blend test don't KarateF22 PrimeNet 16 2013-10-28 00:34
In Place Large FFTs Failure nwb Information & Answers 2 2011-07-08 16:04
gmp-ecm and FFTs dave_dm Factoring 9 2004-09-04 11:47

All times are UTC. The time now is 09:28.


Mon Dec 6 09:28:57 UTC 2021 up 136 days, 3:57, 0 users, load averages: 1.23, 1.22, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.