mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Fast Mersenne Testing on the GPU using CUDA (https://www.mersenneforum.org/showthread.php?t=14310)

Uncwilly 2010-12-08 03:24

[QUOTE=Mathew Steine;240625][URL="http://andrewthall.org/"]http://andrewthall.org/[/URL][/QUOTE]
[QUOTE]All your irrational base are ours.[/QUOTE]
:missingteeth:

msft 2010-12-08 04:35

Is "an old Cg Lucas-Lehmer implementation" mean CUDALucas ?
I can choice new name "YLucas" or "an old Cg Lucas-Lehmer implementation".
Who is Godfather ? :smile:

mdettweiler 2010-12-08 04:37

[QUOTE=msft;240646]Is "an old Cg Lucas-Lehmer implementation" mean CUDALucas ?
I can choice new name "YLucas" or "an old Cg Lucas-Lehmer implementation".
Who is Godfather ? :smile:[/QUOTE]
I'd keep the current name for CUDALucas...since the new program is called gpuLucas, there shouldn't be too much trouble telling them apart.

frmky 2010-12-08 09:14

[QUOTE=Andrew Thall;240510]
gpuLucas has been tested on GTX 480 and Tesla 2050 cards[/QUOTE]

Is it expected to work on compute 1.3 cards? I've got a Tesla S1070 that I could test it on.

Andrew Thall 2010-12-08 14:30

Certainly no intention of pwning anyone; this is purely research code, I was working from Crandall's original paper and with the understanding that other's had gotten it to work with non-powers of two, so I really don't know all the excellent work you all have done with cudaLucas and macLucasFFTW and such. Mainly why I did post this week...I can't finish my paper on this without mentioning other current work. --if anyone would care to summarize the principal players and their programs, you'll get a grateful acknowledgment, for sure.

I'll post some timing results today or tomorrow...I've got a Friday deadline so finishing off my time trials right now.

As to whether it'll work with 1.3 cards...the implementation is pretty transparent, so it may need one or two mods but will probably work with any card that has true double precision and can run CUDA 3.2, though it does depend on the recent Fermi cards for a lot of its efficiency. Note
that CUFFT has improved a lot in the most recent implementation, eliminating crippling bugs and substantially improving the non-power-of-two FFTs.

As to my credentials...no offense taken...I'm mainly an image-analysis guy, and these days teach undergrads, but I've been interested in Mersenne prime testing since 1995, when I was trying to parallelize LL for a Maspar MP-1. :) I was at Carolina in the late '90s when they were doing the original work with PixelFlow, so we were all excited about programmable graphics hardware. The obsolete Cg work from a few years back was using compiled shaders on 8800GT and 9800 cards, with my own homebrew extended-precision float-float FFTs and very baroque parallel carry-adds. Totally crazy, but perhaps y'all here might appreciate that. :)

R.D. Silverman 2010-12-08 16:14

[QUOTE=Andrew Thall;240510]I'd like to announce the implementation of a Lucas-Lehmer tester, gpuLucas, written in CUDA and running on Fermi-class NVidia cards. It's a full implementation of Crandall's IBDWT method and uses balanced integers and a few little tricks to make it fast on the GPU.

Example timing: demonstrated primality of M[SUB]42643801[/SUB] in 57.86 hours, at a rate of 4.88 msec per Lucas product. This used a DWT runlength of 2,359,296 = 2[SUP]18[/SUP]*3[SUP]2[/SUP], taking advantage of good efficiency for CUFFT runlengths of powers of small primes. Maximum error was 1.8e-1.

gpuLucas has been tested on GTX 480 and Tesla 2050 cards; there's actually very little difference in runtimes between the two...fears of a performance hit due to slow floating point on the 480 are bogus---it's a wicked fast card for the GPGPU stuff; you get an additional 32 CUDA cores in place of the faster double precision, and it's clocked much faster than the Tesla. The Tesla only really shines when you overclock the heck out of it; I ran it up to 1402 Mhz for the above test, at which point it is 15-20% faster than the GTX for the big Mersenne numbers. (It depends on the FFT length, though, and when the greater number of processors on the GTX are offset by slower double precision, which is only used in the FFTs anyway.)

Finishing off a paper on the topic, and will post a pre-print here in a week or so. I'll make the code available publicly as well, and maybe set up a tutorial webpage if folks are interested and if time permits.[/QUOTE]

Truly awesome. Kudos.

Now, it needs to be publicized. I am sure many users will take advantage of
it, but they need to know about it, how to install, run, etc.

It should also be folded in to GIMPS.

ixfd64 2010-12-08 16:32

Research code or not, it's definitely very exciting. I certainly hope it'll find its way into Prime95 soon! :smile:

Also, I never doubted your work, so I hope you don't take it that way. Oh, and since nobody else has said it: welcome to the GIMPS forum! :smile:

Ken_g6 2010-12-09 01:39

Thanks, Andrew!

You also sound like the kind of person who would have the experience necessary to create an LLR test, considering [url=http://mersenneforum.org/showpost.php?p=231099&postcount=334]George Woltman's requirements for such a test[/url]. Even a test for only small K's, as described in that post, would be of enormous benefit to PrimeGrid, the No Prime Left Behind search, and probably others as well.

mdettweiler 2010-12-09 02:46

[QUOTE=Ken_g6;240853]Thanks, Andrew!

You also sound like the kind of person who would have the experience necessary to create an LLR test, considering [URL="http://mersenneforum.org/showpost.php?p=231099&postcount=334"]George Woltman's requirements for such a test[/URL]. Even a test for only small K's, as described in that post, would be of enormous benefit to PrimeGrid, the No Prime Left Behind search, and probably others as well.[/QUOTE]
:goodposting:

I'm not sure entirely how much effort building a GPU LLR application would entail, but since LLR is an extension of LL, I imagine it could be at least partially derived from the existing application.

As Ken mentioned, such a program would be immensely beneficial to the many k*2^n+-1 prime search projects out there. I myself am an assistant admin at NPLB and would be glad to help with testing such an app. (Our main admin, Gary, has a GTX 460 that he bought both for sieving, which is already available for CUDA, and to help test prospective CUDA LLR programs. He's not particularly savvy with this stuff but I have remote access to the GPU machine and can run stuff on it as needed.)

Max :smile:

Andrew Thall 2010-12-09 15:09

With regard the GPU LLR work; haven't looked at the sequential algorithms; based on George W.'s description, use of straightline in place of circular convolution and shift-add for modular reduction...actually sounds pretty close to my initial CUDA efforts on LL, before I dug into Crandall's paper and got a better handle on the IBDWT approach.

You'll pay the cost of the larger FFTs; shift-add modular reduction isn't too hard, but you'll also need a parallel scan-based carry-adder if you need fully resolved carries---I have a hotwired CUDPP that does carry-add and subtract with borrow, so that's doable. (I can ask Mark Harris if they'd like to include that in the standard CUDPP release.) The most recent gpuLucas forgoes that and uses a carry-save configuration to keep all computations local except for the FFTs themselves. Big time savings there.

Oddball 2010-12-10 00:09

[QUOTE=Ken_g6;240853]considering [URL="http://mersenneforum.org/showpost.php?p=231099&postcount=334"]George Woltman's requirements for such a test[/URL].[/QUOTE]
Speaking of that post, here's a quick summary of the arguments for both sides:

Pro-LLR GPU side:
1.) Allows people with GPUs freedom of choice. If an GPU program for LLR is developed, those with GPUs can choose to either sieve or test for primes.
2.) Allows for faster verification of large (>1 million digit) primes.
3.) GPU clients are not optimized yet, so there's more potential for improvement.
4.) GPUs are more energy efficient than old CPUs (Pentium 4's, Athlons, etc), judging by the amount of electricity needed to LLR one k/n pair.

Anti-LLR GPU side:
1.) Reduces the number of participants. Those without fast CPUs would be discouraged from participating since they would no longer be able to do a significant amount of "meaningful" LLR work (defined as LLR work that has a reasonable chance of getting into the top 5000 list).
2.) GPUs are much less effective at primality testing than at sieving or trial factoring. Computing systems should be used for what they are best at, so CPU users should stick to LLR tests and GPU users should stick to sieving and factoring.
3.) GPUs have a high power consumption (~400 watts for a GPU system vs. ~150 watts for a CPU system). Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs.
4.) GPUs have a higher error rate than CPUs. It's much easier to check factors than it is to check LLR residues, so GPUs should stay with doing trial division.


All times are UTC. The time now is 15:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.