mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Fast Mersenne Testing on the GPU using CUDA (https://www.mersenneforum.org/showthread.php?t=14310)

Andrew Thall 2010-12-07 15:15

Fast Mersenne Testing on the GPU using CUDA
 
I'd like to announce the implementation of a Lucas-Lehmer tester, gpuLucas, written in CUDA and running on Fermi-class NVidia cards. It's a full implementation of Crandall's IBDWT method and uses balanced integers and a few little tricks to make it fast on the GPU.

Example timing: demonstrated primality of M[SUB]42643801[/SUB] in 57.86 hours, at a rate of 4.88 msec per Lucas product. This used a DWT runlength of 2,359,296 = 2[SUP]18[/SUP]*3[SUP]2[/SUP], taking advantage of good efficiency for CUFFT runlengths of powers of small primes. Maximum error was 1.8e-1.

gpuLucas has been tested on GTX 480 and Tesla 2050 cards; there's actually very little difference in runtimes between the two...fears of a performance hit due to slow floating point on the 480 are bogus---it's a wicked fast card for the GPGPU stuff; you get an additional 32 CUDA cores in place of the faster double precision, and it's clocked much faster than the Tesla. The Tesla only really shines when you overclock the heck out of it; I ran it up to 1402 Mhz for the above test, at which point it is 15-20% faster than the GTX for the big Mersenne numbers. (It depends on the FFT length, though, and when the greater number of processors on the GTX are offset by slower double precision, which is only used in the FFTs anyway.)

Finishing off a paper on the topic, and will post a pre-print here in a week or so. I'll make the code available publicly as well, and maybe set up a tutorial webpage if folks are interested and if time permits.

msft 2010-12-07 20:14

Hi ,Andrew Thall
Congratulations ! :smile:

Mini-Geek 2010-12-07 20:39

When they use the same FFT lengths, how does the speed of this program compare to MacLucasFFTW? In any case, the flexibility of having non-power-of-2 FFTs makes it a very attractive choice compared to MacLucasFFTW.

CRGreathouse 2010-12-07 22:47

[QUOTE=Andrew Thall;240510]Finishing off a paper on the topic, and will post a pre-print here in a week or so. I'll make the code available publicly as well, and maybe set up a tutorial webpage if folks are interested and if time permits.[/QUOTE]

I'd love to see those if/when you get to them.

Uncwilly 2010-12-08 00:50

A verification run in 3 days!?!?! :w00t:

Brain 2010-12-08 00:57

Sounds great
 
We are very interested. I would buy a GTX 460 just for running your program. ;-) Verification in 3 days? Wow. What would CUDALucas have needed?

msft 2010-12-08 01:21

[QUOTE=Brain;240606] What would CUDALucas have needed?[/QUOTE]
9.04 (ms/iter) / 4.88 (msec) * 57.86 (hours) = 107.2 (hours) :smile:

ixfd64 2010-12-08 02:04

I'm usually a bit leery when a brand new user makes such a bold claim; after all, we do get a fair share of trolls and cranks here (for example, someone recently claimed to have written an OpenCL-enabled siever but following up after his second post).

However, I am 99% sure that this is legit because the OP in this thread seems to know what he is talking about. If the "gpuLucas" really works as claimed, it will greatly benefit the GIMPS community.

Mathew 2010-12-08 02:14

[URL="http://andrewthall.org/"]http://andrewthall.org/[/URL]

ixfd64 2010-12-08 02:52

[QUOTE=Mathew Steine;240625][URL="http://andrewthall.org/"]http://andrewthall.org/[/URL][/QUOTE]

This is the real deal, then!

No offense to msft, but it looks like that CUDALucas just got owned!

msft 2010-12-08 03:17

[QUOTE=ixfd64;240634]No offense to msft, but it looks like that CUDALucas just got owned![/QUOTE]
I can change name to "YLucas" ,"Y" is my Initial.:lol:

Uncwilly 2010-12-08 03:24

[QUOTE=Mathew Steine;240625][URL="http://andrewthall.org/"]http://andrewthall.org/[/URL][/QUOTE]
[QUOTE]All your irrational base are ours.[/QUOTE]
:missingteeth:

msft 2010-12-08 04:35

Is "an old Cg Lucas-Lehmer implementation" mean CUDALucas ?
I can choice new name "YLucas" or "an old Cg Lucas-Lehmer implementation".
Who is Godfather ? :smile:

mdettweiler 2010-12-08 04:37

[QUOTE=msft;240646]Is "an old Cg Lucas-Lehmer implementation" mean CUDALucas ?
I can choice new name "YLucas" or "an old Cg Lucas-Lehmer implementation".
Who is Godfather ? :smile:[/QUOTE]
I'd keep the current name for CUDALucas...since the new program is called gpuLucas, there shouldn't be too much trouble telling them apart.

frmky 2010-12-08 09:14

[QUOTE=Andrew Thall;240510]
gpuLucas has been tested on GTX 480 and Tesla 2050 cards[/QUOTE]

Is it expected to work on compute 1.3 cards? I've got a Tesla S1070 that I could test it on.

Andrew Thall 2010-12-08 14:30

Certainly no intention of pwning anyone; this is purely research code, I was working from Crandall's original paper and with the understanding that other's had gotten it to work with non-powers of two, so I really don't know all the excellent work you all have done with cudaLucas and macLucasFFTW and such. Mainly why I did post this week...I can't finish my paper on this without mentioning other current work. --if anyone would care to summarize the principal players and their programs, you'll get a grateful acknowledgment, for sure.

I'll post some timing results today or tomorrow...I've got a Friday deadline so finishing off my time trials right now.

As to whether it'll work with 1.3 cards...the implementation is pretty transparent, so it may need one or two mods but will probably work with any card that has true double precision and can run CUDA 3.2, though it does depend on the recent Fermi cards for a lot of its efficiency. Note
that CUFFT has improved a lot in the most recent implementation, eliminating crippling bugs and substantially improving the non-power-of-two FFTs.

As to my credentials...no offense taken...I'm mainly an image-analysis guy, and these days teach undergrads, but I've been interested in Mersenne prime testing since 1995, when I was trying to parallelize LL for a Maspar MP-1. :) I was at Carolina in the late '90s when they were doing the original work with PixelFlow, so we were all excited about programmable graphics hardware. The obsolete Cg work from a few years back was using compiled shaders on 8800GT and 9800 cards, with my own homebrew extended-precision float-float FFTs and very baroque parallel carry-adds. Totally crazy, but perhaps y'all here might appreciate that. :)

R.D. Silverman 2010-12-08 16:14

[QUOTE=Andrew Thall;240510]I'd like to announce the implementation of a Lucas-Lehmer tester, gpuLucas, written in CUDA and running on Fermi-class NVidia cards. It's a full implementation of Crandall's IBDWT method and uses balanced integers and a few little tricks to make it fast on the GPU.

Example timing: demonstrated primality of M[SUB]42643801[/SUB] in 57.86 hours, at a rate of 4.88 msec per Lucas product. This used a DWT runlength of 2,359,296 = 2[SUP]18[/SUP]*3[SUP]2[/SUP], taking advantage of good efficiency for CUFFT runlengths of powers of small primes. Maximum error was 1.8e-1.

gpuLucas has been tested on GTX 480 and Tesla 2050 cards; there's actually very little difference in runtimes between the two...fears of a performance hit due to slow floating point on the 480 are bogus---it's a wicked fast card for the GPGPU stuff; you get an additional 32 CUDA cores in place of the faster double precision, and it's clocked much faster than the Tesla. The Tesla only really shines when you overclock the heck out of it; I ran it up to 1402 Mhz for the above test, at which point it is 15-20% faster than the GTX for the big Mersenne numbers. (It depends on the FFT length, though, and when the greater number of processors on the GTX are offset by slower double precision, which is only used in the FFTs anyway.)

Finishing off a paper on the topic, and will post a pre-print here in a week or so. I'll make the code available publicly as well, and maybe set up a tutorial webpage if folks are interested and if time permits.[/QUOTE]

Truly awesome. Kudos.

Now, it needs to be publicized. I am sure many users will take advantage of
it, but they need to know about it, how to install, run, etc.

It should also be folded in to GIMPS.

ixfd64 2010-12-08 16:32

Research code or not, it's definitely very exciting. I certainly hope it'll find its way into Prime95 soon! :smile:

Also, I never doubted your work, so I hope you don't take it that way. Oh, and since nobody else has said it: welcome to the GIMPS forum! :smile:

Ken_g6 2010-12-09 01:39

Thanks, Andrew!

You also sound like the kind of person who would have the experience necessary to create an LLR test, considering [url=http://mersenneforum.org/showpost.php?p=231099&postcount=334]George Woltman's requirements for such a test[/url]. Even a test for only small K's, as described in that post, would be of enormous benefit to PrimeGrid, the No Prime Left Behind search, and probably others as well.

mdettweiler 2010-12-09 02:46

[QUOTE=Ken_g6;240853]Thanks, Andrew!

You also sound like the kind of person who would have the experience necessary to create an LLR test, considering [URL="http://mersenneforum.org/showpost.php?p=231099&postcount=334"]George Woltman's requirements for such a test[/URL]. Even a test for only small K's, as described in that post, would be of enormous benefit to PrimeGrid, the No Prime Left Behind search, and probably others as well.[/QUOTE]
:goodposting:

I'm not sure entirely how much effort building a GPU LLR application would entail, but since LLR is an extension of LL, I imagine it could be at least partially derived from the existing application.

As Ken mentioned, such a program would be immensely beneficial to the many k*2^n+-1 prime search projects out there. I myself am an assistant admin at NPLB and would be glad to help with testing such an app. (Our main admin, Gary, has a GTX 460 that he bought both for sieving, which is already available for CUDA, and to help test prospective CUDA LLR programs. He's not particularly savvy with this stuff but I have remote access to the GPU machine and can run stuff on it as needed.)

Max :smile:

Andrew Thall 2010-12-09 15:09

With regard the GPU LLR work; haven't looked at the sequential algorithms; based on George W.'s description, use of straightline in place of circular convolution and shift-add for modular reduction...actually sounds pretty close to my initial CUDA efforts on LL, before I dug into Crandall's paper and got a better handle on the IBDWT approach.

You'll pay the cost of the larger FFTs; shift-add modular reduction isn't too hard, but you'll also need a parallel scan-based carry-adder if you need fully resolved carries---I have a hotwired CUDPP that does carry-add and subtract with borrow, so that's doable. (I can ask Mark Harris if they'd like to include that in the standard CUDPP release.) The most recent gpuLucas forgoes that and uses a carry-save configuration to keep all computations local except for the FFTs themselves. Big time savings there.

Oddball 2010-12-10 00:09

[QUOTE=Ken_g6;240853]considering [URL="http://mersenneforum.org/showpost.php?p=231099&postcount=334"]George Woltman's requirements for such a test[/URL].[/QUOTE]
Speaking of that post, here's a quick summary of the arguments for both sides:

Pro-LLR GPU side:
1.) Allows people with GPUs freedom of choice. If an GPU program for LLR is developed, those with GPUs can choose to either sieve or test for primes.
2.) Allows for faster verification of large (>1 million digit) primes.
3.) GPU clients are not optimized yet, so there's more potential for improvement.
4.) GPUs are more energy efficient than old CPUs (Pentium 4's, Athlons, etc), judging by the amount of electricity needed to LLR one k/n pair.

Anti-LLR GPU side:
1.) Reduces the number of participants. Those without fast CPUs would be discouraged from participating since they would no longer be able to do a significant amount of "meaningful" LLR work (defined as LLR work that has a reasonable chance of getting into the top 5000 list).
2.) GPUs are much less effective at primality testing than at sieving or trial factoring. Computing systems should be used for what they are best at, so CPU users should stick to LLR tests and GPU users should stick to sieving and factoring.
3.) GPUs have a high power consumption (~400 watts for a GPU system vs. ~150 watts for a CPU system). Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs.
4.) GPUs have a higher error rate than CPUs. It's much easier to check factors than it is to check LLR residues, so GPUs should stay with doing trial division.

axn 2010-12-10 02:14

[QUOTE=Oddball;241034]Speaking of that post, here's a quick summary of the arguments for both sides:[/quote]
That post describes the technical complexities of developing a suitable FFT for LLR. It doesn't deal with any "sides".

[QUOTE=Oddball;241034]Pro-LLR GPU side:
1.) Allows people with GPUs freedom of choice. If an GPU program for LLR is developed, those with GPUs can choose to either sieve or test for primes.
2.) Allows for faster verification of large (>1 million digit) primes.
3.) GPU clients are not optimized yet, so there's more potential for improvement.
4.) GPUs are more energy efficient than old CPUs (Pentium 4's, Athlons, etc), judging by the amount of electricity needed to LLR one k/n pair.

Anti-LLR GPU side:
1.) Reduces the number of participants. Those without fast CPUs would be discouraged from participating since they would no longer be able to do a significant amount of "meaningful" LLR work (defined as LLR work that has a reasonable chance of getting into the top 5000 list).
2.) GPUs are much less effective at primality testing than at sieving or trial factoring. Computing systems should be used for what they are best at, so CPU users should stick to LLR tests and GPU users should stick to sieving and factoring.
3.) GPUs have a high power consumption (~400 watts for a GPU system vs. ~150 watts for a CPU system). Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs.
4.) GPUs have a higher error rate than CPUs. It's much easier to check factors than it is to check LLR residues, so GPUs should stay with doing trial division.[/QUOTE]

Aside from you, I haven't actually seen any arguments advanced by any persons for not developing a primality testing program for GPU. One person hardly makes a "side". The arguments against a GPU-based LLR reads like a what's-what of fallacy files.

CRGreathouse 2010-12-10 02:24

[QUOTE=Oddball;241034]3.) GPUs have a high power consumption (~400 watts for a GPU system vs. ~150 watts for a CPU system). Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs.
4.) GPUs have a higher error rate than CPUs. It's much easier to check factors than it is to check LLR residues, so GPUs should stay with doing trial division.[/QUOTE]

Do we have numbers on those?

Oddball 2010-12-10 06:17

[quote]
[I]3.) GPUs have a high power consumption (~400 watts for a GPU system vs. ~150 watts for a CPU system). Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs.[/I]
[I]4.) GPUs have a higher error rate than CPUs. It's much easier to check factors than it is to check LLR residues, so GPUs should stay with doing trial division.[/I]
[/quote]
[QUOTE=CRGreathouse;241045]Do we have numbers on those?[/QUOTE]
There's this:
[URL]http://www.mersenneforum.org/showpost.php?p=213089&postcount=152[/URL]
"487W for GTX 295 under full load!"

The Phenom II system I have right now draws ~150 watts at full load.

The reference for claim #4 is here:
[URL]http://mersenneforum.org/showpost.php?p=238232&postcount=379[/URL]

"Consumer video cards are designed for gaming rather than technical computing, so they don't have as many error-checking features."
There's not enough data to provide more accurate figures.

Oddball 2010-12-10 06:27

[QUOTE=axn;241044]That post describes the technical complexities of developing a suitable FFT for LLR. It doesn't deal with any "sides".[/QUOTE]
In that post, the quote that Prime95 posted was referring to The Carnivore, who was describing the impatience of the pro-GPU side. I wasn't the first person who made the claim of different sides.

[quote]
Aside from you, I haven't actually seen any arguments advanced by any persons for not developing a primality testing program for GPU. One person hardly makes a "side".[/quote]
Here's another person with an anti-GPU point of view:
[URL]http://mersenneforum.org/showpost.php?p=231062&postcount=327[/URL]

Here's what George has to say:
[URL]http://mersenneforum.org/showpost.php?p=231172&postcount=339[/URL]

"if msft develops a CUDA LLR program then it will be modestly more powerful (in terms of throughput) than an i7 -- just like LL testing.
[B]From a project admin's point of view, he'd rather GPUs did sieving than primality testing[/B] as it seems a GPU will greatly exceed (as opposed to modestly exceed) the thoughput of an i7."

But I'm done debating this issue; it's been beaten to death, and none of the users involved are going to change their minds.

mdettweiler 2010-12-10 06:37

[QUOTE=Oddball;241055]"if msft develops a CUDA LLR program then it will be modestly more powerful (in terms of throughput) than an i7 -- just like LL testing.
[B]From a project admin's point of view, he'd rather GPUs did sieving than primality testing[/B] as it seems a GPU will greatly exceed (as opposed to modestly exceed) the thoughput of an i7."[/QUOTE]
The way I see it (from the perspective of a project admin) I figure it's nice to at least have the [i]ability[/i] to do both. As a case in point to why this would be of importance, currently NPLB and PrimeGrid are collaborating on a large (covering all k<10000) GPU sieving drive. What with the combined GPU resources of our two projects, we blew through the n<2M range in no time at all--and the 2M-3M range itself is itself moving very rapidly. Yet the primary leading edges of both projects' LLR testing are below n=1M. We won't get to some of this stuff for years, after which GPUs will likely be so much advanced that much of the work done now will be a drop in the bucket compared to the optimal depth relative to the GPUs of then.

Right now, the only work available from k*2^n+-1 prime search projects for GPUs is sieving. Thus, in order to keep the GPUs busy at all, we have to keep sieving farther and farther up in terms of n, which becomes increasingly suboptimal the further we depart from our LLR leading edge. If we had the option of putting those GPUs to work on LLR once everything needed in the forseeable future has been well-sieved, even if it's not quite the GPUs' forte, we could at least be using them for something that's needed, rather than effectively throwing away sieving work that can be done much more efficiently down the road.

Anyway, that's my $0.02...not trying to beat this to death on this end either.

MooMoo2 2010-12-10 06:58

[QUOTE=mdettweiler;241056]The way I see it (from the perspective of a project admin) I figure it's nice to at least have the [I]ability[/I] to do both. As a case in point to why this would be of importance, currently NPLB and PrimeGrid are collaborating on a large (covering all k<10000) GPU sieving drive. What with the combined GPU resources of our two projects, we blew through the n<2M range in no time at all--and the 2M-3M range itself is itself moving very rapidly. Yet the primary leading edges of both projects' LLR testing are below n=1M. We won't get to some of this stuff for years, after which GPUs will likely be so much advanced that much of the work done now will be a drop in the bucket compared to the optimal depth relative to the GPUs of then.

Right now, the only work available from k*2^n+-1 prime search projects for GPUs is sieving. Thus, in order to keep the GPUs busy at all, we have to keep sieving farther and farther up in terms of n, which becomes increasingly suboptimal the further we depart from our LLR leading edge. If we had the option of putting those GPUs to work on LLR once everything needed in the forseeable future has been well-sieved, even if it's not quite the GPUs' forte, we could at least be using them for something that's needed, rather than effectively throwing away sieving work that can be done much more efficiently down the road.
[/QUOTE]
You can direct the GPUs to the TPS forum if they're out of work :smile:

mdettweiler 2010-12-10 07:48

[QUOTE=MooMoo2;241058]You can direct the GPUs to the TPS forum if they're out of work :smile:[/QUOTE]
Indeed, that is an option. :smile: However, speaking solely from the perspective of a project admin (that is, trying to maximize the utilization of resources within my own project), it would seem worthwhile to have GPU LLR as an option--so that if (say) you have a participant who wants to contribute with his GPU at NPLB but is not particularly interested in TPS, he can still have useful work to do. (Or vice versa.)

CRGreathouse 2010-12-10 14:20

[QUOTE=Oddball;241054]There's this:
[URL]http://www.mersenneforum.org/showpost.php?p=213089&postcount=152[/URL]
"487W for GTX 295 under full load!"

The Phenom II system I have right now draws ~150 watts at full load.[/QUOTE]

I'm seeing 181 watts for the i7 under load. So for your claim "Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs" to hold, the GTX 295 needs to be less than 2.7 times faster than the i7 -- or 10.8 times faster than a single (physical) core. Is that so?

Oddball 2010-12-10 18:05

[QUOTE=CRGreathouse;241092]I'm seeing 181 watts for the i7 under load. So for your claim "Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs" to hold, the GTX 295 needs to be less than 2.7 times faster than the i7 -- or 10.8 times faster than a single (physical) core. Is that so?[/QUOTE]
Yes. See: [URL]http://mersenneforum.org/showpost.php?p=227433&postcount=293[/URL]

"in a worst-case (for the GPU) scenario, you still need all cores of your i7 working together to match its output! In a best-case scenario, it's closer to twice as fast as your CPU."

henryzz 2010-12-10 18:56

Please remember that your i7 cpu can be running as well as the GPU app on most cores without much more power comsumption.

Prime95 2010-12-10 19:16

[QUOTE=Oddball;241055]
[B]From a project admin's point of view, he'd rather GPUs did sieving than primality testing[/B] as it seems a GPU will greatly exceed (as opposed to modestly exceed) the thoughput of an i7."[/QUOTE]

This argument holds less water with this new CUDA program. As expected, IBDWT has halved the iteration times.

A different conclusion is also possible: Perhaps prime95's TF code is in need of optimization.

nucleon 2010-12-10 23:17

[QUOTE=CRGreathouse;241092]I'm seeing 181 watts for the i7 under load. So for your claim "Even when comparing power needed per primality test, they are less efficient than core i7's and other recent CPUs" to hold, the GTX 295 needs to be less than 2.7 times faster than the i7 -- or 10.8 times faster than a single (physical) core. Is that so?[/QUOTE]

My 460GTX does 3.7GHz-days per _hour_ of TF, or 88.8 GHz-days per day.

My core quad-core i7-930 (by my estimation) does 12GHz-days per day all cores utilized.

My GPU is at least 7times faster than my CPU running TF. Scaling the figures presented on the new gpu CUDA code a GTX480 looks like being 4times faster than my cpu on LL testing*.

I haven't checked power consumption fully populated.

-- Craig
*Take the 4times figure with a grain of salt. Large margin of error. Lots of assumptions and no way I can verify this figure for me personally.

ckdo 2010-12-11 00:25

[QUOTE=nucleon;241166]My 460GTX does 3.7GHz-days per _hour_ of TF, or 88.8 GHz-days per day.[/QUOTE]

That's lowish. My MSI N460GTX HAWK does around 107 GHzd/d on a Q6600 while the CPU is running 4 LLs on exponents in the 41M range (bad idea, I know). All at stock speed, that is, and with X fully responsive. There's probably plenty of room for improvement on my end.

nucleon 2010-12-11 04:27

I think it's the win7 gui. I have the full aero interface enabled.

As I move around to certain objects onscreen, I see the rate drop.

-- Craig

Brain 2010-12-12 10:52

Back to hardware
 
Assumption 1: The GTX 460 has lost almost all advanced computing features / double precision support (compared with 470/480)!?
Fact 1: The GTX 470/480 have only 25% of their possible double precision throughput (ref [URL="http://forums.nvidia.com/index.php?showtopic=164417&view=findpost&p=1028870"]here[/URL]) to protect the Tesla cards.

I always thought DP support was mandatory (for LL)?

Please check my facts and give me a hint what kind of graphics hardware fits Mersenne's needs best. ;-)

I my opinion, we do not need more power on the TF part. I'd really love to see CUDA P-1.

cheesehead 2010-12-12 21:34

[QUOTE=Brain;241414]
I always thought DP support was mandatory (for LL)?[/quote]Yes, number of guard bits required renders SP too inefficient.

[quote]I'd really love to see CUDA P-1.[/QUOTE]Yes!

But how much memory is available for stage 2? My sparse understanding about GPUs is that some have on-board RAM, while others (or all) use some of main RAM. Would transfer speeds in the latter case be a bottleneck?

Mr. P-1 2010-12-12 22:07

[QUOTE=cheesehead;241470]But how much memory is available for stage 2? My sparse understanding about GPUs is that some have on-board RAM, while others (or all) use some of main RAM. Would transfer speeds in the latter case be a bottleneck?[/QUOTE]

If that's a problem, then just do stage one on the GPU.

Mini-Geek 2010-12-12 22:57

[QUOTE=cheesehead;241470]But how much memory is available for stage 2? My sparse understanding about GPUs is that some have on-board RAM, while others (or all) use some of main RAM. Would transfer speeds in the latter case be a bottleneck?[/QUOTE]

All GPUs that would likely be used for CUDA work (I'd be jumping to conclusions to say "all CUDA-capable GPUs" - but I think it's practically that) are parts of discrete graphics cards, (as opposed to integrated to the motherboard or CPU) which always include on-board RAM. E.g. a current good pick in the $200 range is a GTX 460 GPU on a card with 1 GB of RAM. I don't see any reason, in principle, why most of this memory can't be used for P-1 stage 2.

Andrew Thall 2010-12-13 17:38

@Brain: Fact #1 is true but irrelevant. LL needs the double precision only for the FFT squaring; as I mentioned before, I get better timings from the Tesla 2050 over the GTX 480 only if I overclock it to the same core-processor speed (1400 MHz); otherwise, the greater number of processors on the GTX more than makes up for the better double precision performance. Surprising, but FFTs have to move a lot of data, too; don't assume that they're time bound by the double-precision multiplies, particularly with the way smart compilers reorder operations to get maximum hardware utilization.

KingKurly 2010-12-13 19:16

I take it that if I want to get in on this fun, I'll have to replace my ATI Radeon HD 5450 that came with my store-built PC? The AMD Phenom II X6 in it is doing great, but this GPU computation looks very promising and interesting. Obviously the ATI card is not going to do CUDA, that much I am aware of.

The machine is currently headless and primarily used for GIMPS. I guess I would be in the market for a new Nvidia card and perhaps a new power supply? (I'll have to look at what's in there, I haven't the faintest clue.)

Brain 2010-12-13 20:43

GTX 560
 
[QUOTE=KingKurly;241653]The machine is currently headless and primarily used for GIMPS. I guess I would be in the market for a new Nvidia card and perhaps a new power supply? (I'll have to look at what's in there, I haven't the faintest clue.) [/QUOTE]

I read first reviews on Nvidia's new GTX 5XX series. They seem to have done a very good job and a lot better than for GTX 4XX. But a 550W system power supply for a GTX 570 ([URL="http://www.nvidia.com/object/product-geforce-gtx-570-us.html"]ref[/URL]) is too much for me. Latest rumors say its small brother GTX 560 will arrive in late January 2011. That one could become best choice (depending an CUDA cores / capabilities).

[QUOTE=KingKurly;241653] I take it that if I want to get in on this fun, I'll have to replace my ATI Radeon HD 5450 that came with my store-built PC? The AMD Phenom II X6 in it is doing great, but this GPU computation looks very promising and interesting. Obviously the ATI card is not going to do CUDA, that much I am aware of. [/QUOTE]

I own an ATI 5770. --> No CUDA. :sad:

ixfd64 2010-12-29 04:05

Hmm, it's been over two weeks since was any update. I hope this doesn't get swept under the rug.

ixfd64 2011-01-12 06:22

Any updates?

moebius 2011-03-09 05:00

I am very interested in the runtimes for a GTX560 Ti related to Phenom II 955, Linux64.
How many hours will it take to prove primality of M4264380 on it?
I would of course be available with this card for a test.

ckdo 2011-03-10 05:41

[QUOTE=moebius;254687]How many hours will it take to prove primality of M4264380 on it?
[/QUOTE]

Saving you the heartache: It's not prime. :no:

sdbardwick 2011-03-10 09:35

Just for clarification: 4264380 is not prime, so (2^4264380)-1 cannot be prime.

axn 2011-03-10 10:22

[QUOTE=ckdo;254768]Saving you the heartache: It's not prime. :no:[/QUOTE]

[QUOTE=sdbardwick;254777]Just for clarification: 4264380 is not prime, so (2^4264380)-1 cannot be prime.[/QUOTE]

All this for a simple typo? :huh:

ckdo 2011-03-10 15:00

[QUOTE=axn;254779]All this for a simple typo? :huh:[/QUOTE]

Yes. :smile:

If moebius receives any time estimates for an exponent an order of magnitude smaller than the one he's (supposedly) interested in, they aren't gonna help him that much, either, so hinting at what happened seemed like a reasonable route to follow.

Mini-Geek 2011-03-10 15:10

[QUOTE=ckdo;254803]Yes. :smile:

If moebius receives any time estimates for an exponent an order of magnitude smaller than the one he's (supposedly) interested in, they aren't gonna help him that much, either, so hinting at what happened seemed like a reasonable route to follow.[/QUOTE]

Considering the intended exponent is pretty obvious, it would be better if we could just say something to the effect of:
"M4264380[B]1[/B] would take X hours."
But the program isn't released, so nobody so far has really been able to answer it...

moebius 2011-03-10 19:20

[QUOTE=sdbardwick;254777]Just for clarification: 4264380 is not prime, so (2^4264380)-1 cannot be prime.[/QUOTE]

Too bad, I thought I had found the first even prime number ... haha

xilman 2011-03-10 19:39

[QUOTE=moebius;254826]Too bad, I thought I had found the first even prime number ... haha[/QUOTE]ITYM the [i]second[/i] even prime number. HTH, HAND.

Paul

moebius 2011-03-10 20:23

[QUOTE=xilman;254828]ITYM the [I]second[/I] even prime number. HTH, HAND.

Paul[/QUOTE]

I do not think in such a trivial range of numbers.

Christenson 2011-03-13 04:37

Back to King Kurly's question: Which GPU should be in my future, assuming my budget is only $500 or so? 100GHz days per day makes my Phenom II x6 seem, well, puny!

Ken_g6 2011-03-13 04:55

Depends on your PSU, case, board slots, and cooling. But the 560 is out and does look good now.

I'm wondering what happened to Andrew? [thread=15195]Saturday, March 5[/thread] has come and gone. Any news?

Christenson 2011-03-13 16:50

Target System:
ASRock 880 GM/LE Mobo with one PCI express X16 slot and mechanical space for a double-width card. Power supply will upgrade to support the fans, need to add an internal fan for one chip anyway. Dual output video on-board, running 64-bit ubuntu.
Which card gives the best bang for the buck?

diep 2011-04-29 18:40

[QUOTE=Christenson;255074]Target System:
ASRock 880 GM/LE Mobo with one PCI express X16 slot and mechanical space for a double-width card. Power supply will upgrade to support the fans, need to add an internal fan for one chip anyway. Dual output video on-board, running 64-bit ubuntu.
Which card gives the best bang for the buck?[/QUOTE]

Where for integers AMD gpu's suck, the 6990 in double precision is unrivalled in double precision work by Nvidia.

That 6990 is just a few euro over 500 here and it's 1.2 Tflop double precision or so.

Wouldn't be too hard to port CUDA code to OpenCL, as architecture from programming model is similar.

How fast is the current code as compared to CPU's doing LL?

Regards,
Vincent

diep 2011-04-29 18:49

[QUOTE=Prime95;241135]This argument holds less water with this new CUDA program. As expected, IBDWT has halved the iteration times.

A different conclusion is also possible: Perhaps prime95's TF code is in need of optimization.[/QUOTE]

Right when i looked at your assembler i realized there was some optimizations possible for todays CPU's.

First simple optimization is using a bigger primebase to remove composite FC's. I ran some statistics on overhead and concluded a primebase of around 500k to generate FC's makes most sense and with a 64K buffer you can hold 512k bits and already have removed a bunch outside, so a 500k sized primebase always has a 'hit' on the write buffer.

The primebase you can store also efficiently using 1 byte per prime as the distance from each prime to the next one is just like that and another 24 bits are left then for storing where you previously hit the buffer.

You can in fact make the primebase quite bigger than the primebuffer, as you can keep track which baseprime will hit what buffer and throw them in a bucket for the buffer where it'll be useful. But that would slow down the datastructure quite a tad.

This weeding out of more composite FC's is quite complicated when generating FC's inside the gpu as i intend to do.

Then the speed of the actual comparisions is hard to judge for me, yet when i see that a single run at 61 bits here with Wagstaff (using your assembler) already takes half an hour @ around 8M-12M sizes, there is definitely improvements possible as well.

But even after improving all this, of course GPU's slam the CPU's here, as every gpu unit can multiply and at cpu's only 1 execution unit can out of each core.

lycorn 2011-04-30 11:57

[I]Does anybody here remember Andrew Thall?[/I]
[I]Remember how he said[/I]
[I]We would crunch so fast[/I]
[I]On a sunny day?[/I]
[I]Andrew, Andrew,[/I]
[I]What has become of you?[/I]
[I]Does anybody else in here[/I]
[I]Feel the way I do?[/I]

Adapted from... (?)

xilman 2011-04-30 12:27

[QUOTE=lycorn;260009][I]Does anybody here remember Andrew Thall?[/I]
[I]Remember how he said[/I]
[I]We would crunch so fast[/I]
[I]On a sunny day?[/I]
[I]Andrew, Andrew,[/I]
[I]What has become of you?[/I]
[I]Does anybody else in here[/I]
[I]Feel the way I do?[/I]

Adapted from... (?)[/QUOTE]Is there anybody out there? I can feel one of my turns coming on. Don't look so frightened, this is just a passing phase, one of my bad days.

Paul

P.S. BTW, ITYM "One sunny day".

lycorn 2011-04-30 12:55

Yep, you got it!.

[QUOTE=xilman;260012]
P.S. BTW, ITYM "One sunny day".[/QUOTE]

Actually, it´s "Some sunny day".

Also, it should be "Remember how he said [U]that[/U]"

Karl M Johnson 2011-04-30 14:14

[QUOTE=diep;259952]Where for integers AMD gpu's suck...[/QUOTE]
You dont have the slightest idea what are you talking about.
A Radeon 5970 can do 2.32 TIPS.
A GTX 590 can do 1244.16 GIPS.
A Radeon 5870 can do 1.36 TIPS, you can buy up to 3 for the price of GTX 590.
Not bad, ehh ?

diep 2011-04-30 15:05

[QUOTE=Karl M Johnson;260021]You dont have the slightest idea what are you talking about.
A Radeon 5970 can do 2.32 TIPS.
A GTX 590 can do 1244.16 GIPS.
A Radeon 5870 can do 1.36 TIPS, you can buy up to 3 for the price of GTX 590.
Not bad, ehh ?[/QUOTE]

The problem is that you cannot get the top bits easily at AMD.

So if you multiply 24 x 24 bits, you get within 1 cycle (throughput latency) the least significant bits, yet it needs 4 PE's to get just 16 topbits and 5 PE's in case of the 5000 series.

So it is 5 cycles (throughput latency) to get 64 bits output spreaded over 2 integers and 6 cycles for the 5000 series cards.

If you want to multiply using 32 bits integers (filling them up with 31 bits for example), you will need a lot of patience as that's 2 slow instructions; it requires 8 cycles throughput latency. So you can divide your numbers with respect to AMD by 8 for the 6000 series and by 10 for the 5000 series.

So the fastest way to multiply at AMD GPU's to emulate 70 bits precision is using the fast 24 bits multiplications least significant 32 bits. So you can store 14 bits information in each integer.

This should run fast both at the 6000 series as well as at the 5000 series.

A full multiplication then using multiply-add can use 25 multiply adds and in total other overhead counted i come to 69 fast instructions. So for throughput that is for a 70 x 70 bits multiplication 69 throughput cycles.

Nvidia on other hand you can use 24x24 bits == 48 bits, so you can use 3 integers for that. That's a lot quicker.

That is where AMD GPU's lose it bigtime from Nvidia at the moment.

This was also unexpected for me, as old AMd gpu's had 40 bits internal available within 1 cycle. You don't expect then that to get the top16 bits is so slow.

Now we didn't speak yet about adding carry as AMD doesn't have that either, where you'll lose another few dozens of percent if you want to achieve 72 bits. I knew from that already before i started investigating all this, and losing of course a 20% is no big deal if you have that much TIPS available.

So for trial factoring a GTX590 should achieve roughly 800M/s where a 6990 can achieve according to my calculation a max of around 500M/s. A 5970 there will achieve nothing of course, as the 2nd GPU is not supported by AMD to work for OpenCL (which sucks incredible as OpenCL is the only programming language supported right now).

My theoretic calculation is that it would be possible to achieve 270M/s at my Radeon HD6970 if you can perfectly load all PE's with instructions without interruption; yet that last is rather unlikely, yet i'll try :)

The interesting thing then, when i program it all in simple instructions, will be to see what IPC the code achieves at the 5000 series versus 6000 series. Probably the 6990 will be the one breaking even best if you add powercosts as well, but that's for later to figure out :)

The card being fastest is no discussion about here for TF that'll be the GTX590 from Nvidia.

Karl M Johnson 2011-04-30 15:24

You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.

Another example is RC5 GPU clients of distributed.net.


The last I've heard from Andrew Thall was on 16 of February.

diep 2011-04-30 15:34

[QUOTE=Karl M Johnson;260026]You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.[/QUOTE]

Yes it is impossible to get a Radeon HD6970 faster than the rough estimate of 270M/s what i wrote here when using simple instructions. A GTX 580 i'd expect 33% faster than that. Olivers numbers are of course based upon Tesla's that achieve optimal (305M/s) and the gamerscards probably won't do that.

The only escape to speed things up, which means move to the 3 x 24 bits implementation for 69 bits, would be when the GPU's native instruction which is MULHI_UINT24, if that instruction would be 1 cycle throughput latency.

OpenCL doesn't support that instruction. OpenCL specs were created by an ex-ATI guy, so if that instruction would have been faster than the 32x32 bits mul_hi, obviously it would have been inside the OpenCL 1.1 specs :)

There is 1 report of a guy, possibly AMD engineer, reporting that MULHI_UINT24 is casted in reality onto the 32x32 bits mul_hi which is 4 cycles at the 6000 series and 5 cycles at the 5000 series.

I'm still awaiting official answer from the AMD helpdesk there. No answer means of course guilty.

diep 2011-04-30 15:42

[QUOTE=Karl M Johnson;260026]You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.

Another example is RC5 GPU clients of distributed.net.


The last I've heard from Andrew Thall was on 16 of February.[/QUOTE]

It is more work to optimize CUDA by the way. Please try to dig up which instructionset that GPU has.

You won't find it.

It's not there.

At least the crappy AMD documentation *does* show the instruction set it has. So all you need to dig up then is whether it needs all PE's of a streamcore to execute that instruction or whether it's a simple fast instruction.

Usually the answer is very simple: there is just a few fast instructions.

With Nvidia it's more complicated there as you don't know what instructions it supports in hardware nor how fast those are.

With AMD the answer is dead simple: just a handful of instructions are fast, basically that half dozen instructions they mention is fast and all the rest isn't.

So OpenCL programming is *far* *far* easier for guys like me.

diep 2011-04-30 16:03

[QUOTE=Karl M Johnson;260026]You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.

Another example is RC5 GPU clients of distributed.net.
The last I've heard from Andrew Thall was on 16 of February.[/QUOTE]

Everything where you do not need integer multiplication, or can limit it to 16 bits as a max, there AMD will be faster of course and a lot. Which is why i bought the GPU, besides that too many already toy in CUDA like i once also did.

However the next project after this of me will be Factorisation and after a few attempts there i'll investigate (and not implement if nothing comes out of the investigation that can work fast) a fast multiplication of the million bits numbers at the GPU's. We've got trillions of instructions per cycle available on those things, yet what's really needed is a new transform that can avoid deal with the fact that the RAM in no manner can keep up with the processing elements. So far i only investigated integer transforms there and the bad news so far is that exactly there you always need some sort of integer multiplication in the end, where Nvidia is faster.

So if i succeed in creating an integer based multiplication that can avoid non-stop streaming to and from the RAM and bundle it more efficiently within that 32KB shared cache (LDS) that the PE's share, then that will be most interesting. Yet Nvidia might also be fast there for the same reason then that AMD sucks in integer multiplication (the top bits).

It seems for now that the AMD gpu's are total floating point optimized if i may say so, more than Nvidia. And i'm not interested in floating point at all with all its rounding errors and inefficient usage of the bits :)

xilman 2011-04-30 16:25

[QUOTE=lycorn;260014]Yep, you got it!.



Actually, it´s "Some sunny day".

Also, it should be "Remember how he said [U]that[/U]"[/QUOTE]True. Shows I was working from memory instead of using Google.

Paul

Prime95 2011-04-30 16:41

His paper is available: [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url]

Should give msft enough information to upgrade CudaLucas. I had originally though Thall's 2x improvement was due to IBDWT. However, msft and Thall both use IBDWT in their programs. The nearly 2x came from using non-power-of-2 FFT lengths. I'd also hoped Thall had improved on Nvidia's FFT library, but that is not the case.

diep 2011-04-30 17:40

Maybe silly question but where's Thall's CUDA code?

Regards,
Vincent

TheJudger 2011-04-30 20:10

Hi Vincent

[QUOTE=diep;260027]Olivers numbers are of course based upon Tesla's that achieve optimal (305M/s) and the gamerscards probably won't do that.
[/QUOTE]

Wrong, 305M/s are for my stock GTX 470 for M66.xxx.xxx and factor candidates below 2^79. A Tesla 20x0 is actually slower for mfaktc because of its slightly lower clock.

Oliver

ixfd64 2011-04-30 20:17

[QUOTE=Prime95;260032]His paper is available: [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url]

Should give msft enough information to upgrade CudaLucas. I had originally though Thall's 2x improvement was due to IBDWT. However, msft and Thall both use IBDWT in their programs. The nearly 2x came from using non-power-of-2 FFT lengths. I'd also hoped Thall had improved on Nvidia's FFT library, but that is not the case.[/QUOTE]

I wonder if Prime95 could benefit from this information.

diep 2011-04-30 23:29

[QUOTE=TheJudger;260052]Hi Vincent



Wrong, 305M/s are for my stock GTX 470 for M66.xxx.xxx and factor candidates below 2^79. A Tesla 20x0 is actually slower for mfaktc because of its slightly lower clock.

Oliver[/QUOTE]

Ah yes thx for the update. This is in fact better than i said, because obviously exponent=66M+ has more bits than the 8M bits range where i was calculating for. It's 3 bits more than the 24 at 8M.

All my calculations are for the range we're busy with currently doing TF which is slightly above 8M.

Extrapolation to 800M/s i had done correct for your code to GTX590, provided you have the bandwidth :)

Add to that the 3 bits. Maybe your 72 bits kernel is also faster than the 79 bits one @ 8M bits?

diep 2011-04-30 23:38

[QUOTE=ixfd64;260054]I wonder if Prime95 could benefit from this information.[/QUOTE]

Would be cool if it's the 9 year old kids that are going to kick butt there rather than clusters from uni with ECC.

Didn't the last so many mersennes mainly get found by uni machines equipped with ECC and everything?

Note double checking the gpu's is not a luxury probably.

Regards,
Vincent

msft 2011-05-01 02:59

[QUOTE=Prime95;260032]His paper is available: [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url]
[/QUOTE]Good information.
Thank you,

RichD 2011-09-25 04:01

Seems like his time table is slipping. :)

[QUOTE]Summer 2011

Presented the gpuLucas work at GPGPU 4 at Newport Beach in March. I am currently working
with my research students (rising sophomores through the PRISM grant) to get gpuLucas ready
for public release in mid-August.[/QUOTE]

Pulled from this [URL="http://andrewthall.org/"]page[/URL].

Karl M Johnson 2011-11-30 15:49

Any updates ?

RichD 2012-02-23 19:25

Just received a note from Andrew Thall and he is releasing his gpuLucas program at [url]https://github.com/Almajester/gpuLucas[/url].

He claims it is still pretty ugly research code but between the ReadMe file, internal documentation and his [URL="http://andrewthall.org/papers/gpuMersenne2011MKII.pdf"]paper[/URL], that should be enough to make a working copy.

It appears the program was developed under Windows 7 using Visual C++ in Visual Studio 2008. I may play with it (time permitting) to see if I can get a working version under Linux.

aaronhaviland 2012-02-24 03:49

[QUOTE=RichD;290603]Just received a note from Andrew Thall and he is releasing his gpuLucas program at [URL]https://github.com/Almajester/gpuLucas[/URL].

I may play with it (time permitting) to see if I can get a working version under Linux.[/QUOTE]

Very interesting!

I've managed to get a linux version working, myself. (Had a bit of trouble with #include <qd/dd_real.h> being included under nvcc compilation.)

Observations: Currently the number to test, and the FFTlen are hard-coded, there is no checkpoint file, it does not bail/restart/change FFTlen if error is too great, and there is no residue output for non-primes.

However, after a couple tests, it does seem to be a fair bit faster than CUDALucas: estimated runtime for M(26xxxxxx) using the same FFT size (1572864) is about 47 hrs in CUDALucas, and 40 hrs in gpuLucas (I've actually gotten it down to 36 hrs by fine tuning FFT size, and T_PER_B), but that's just [I]estimated[/I] run-time...

RichD 2012-02-24 04:08

Hey, that's great!!!

I found the QD package at [url]http://crd-legacy.lbl.gov/~dhbailey/mpdist/[/url] but then I ran into another problem before getting side track.

Your observations are what I was expecting (unfortunately).

I think [B]TheJudger[/B] has done a lot of work on threads per block (T_PER_B) is his mfaktc program. Might need to be tuned for each card.

There is a lot of work that still needs to be done before it can be accepted by the community. Or maybe just the ideas present in the code could be used in existing programs. ??

Dubslow 2012-02-24 04:14

A hybrid of of GL and CL? (Oh, those are such unfortunate acronyms.)

frmky 2012-02-25 00:29

Yes, gpulucas appears considerably faster. On a GTX 480, for 43122609 using a 2304K FFT, gpulucas claims to require 51.2 hours and CUDALucas 1.58 claims to require 63.7 hours. Of course both of these are ETA's and not actual runtimes, but that's a nearly 20% difference.

TheJudger 2012-02-25 00:50

Hi,

[QUOTE=RichD;290663]I think [B]TheJudger[/B] has done a lot of work on threads per block (T_PER_B) is his mfaktc program. Might need to be tuned for each card.[/QUOTE]

*hmm* not really. Actually "threads per block" is currently fixed at 256 in mfaktc. When I've choosen this number I did some tests with other values, 512 runs out of registers on CC 1.1 GPUs, for other GPUs it does not really make any difference for 128, 256 or 512. The more important number for mfaktc is the number of threads per grid but this might be special to mfaktc, not for all CUDA applications.

Oliver

msft 2012-02-25 23:05

Hi ,
Work on linux.
I think compile option is important.
Makefile
[code]
NVIDIA_SDK = $(HOME)/NVIDIA_GPU_Computing_SDK
gpuLucas: gpuLucas.o
g++ -fPIC -o gpuLucas gpuLucas.o -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 $(NVIDIA_SDK)/C/lib/libcutil_x86_64.a -lqd -lcufft -lm
gpuLucas.o: gpuLucas.cu
/usr/local/cuda/bin/nvcc -O3 -use_fast_math -gencode arch=compute_20,code=sm_20 --compiler-options="-fno-strict-aliasing" -w -I. -I/usr/local/include -I$(NVIDIA_SDK)/C/common/inc gpuLucas.cu -arch=sm_13 -c
clean:
-rm *.o gpuLucas
[/code]
GTX-550Ti
[code]
[0/50]: iteration 4300: max abs error = 0.226562
[0/50]: iteration 4300: max Bit Vector = 39.000000
Time to rebalance llint: 1.936 ms

Time to rebalance and write-back: 821.3 ms

Timing: To test M43112609
elapsed time : 75901 msec = 75.9 sec
dev. elapsed time: 143860 msec = 143.9 sec
est. total time: 620216064 msec = 620216.1 sec

Beginning full test of M43112609
[/code]
CUDALucas
[code]
$ ./CUDALucas 43112609
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.58 (2:35 real, 15.4797 ms/iter, ETA 185:19:36)
[/code]

science_man_88 2012-02-25 23:25

[QUOTE=msft;290911]Hi ,
Work on linux.
I think compile option is important.
Makefile
[code]
NVIDIA_SDK = $(HOME)/NVIDIA_GPU_Computing_SDK
gpuLucas: gpuLucas.o
g++ -fPIC -o gpuLucas gpuLucas.o -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 $(NVIDIA_SDK)/C/lib/libcutil_x86_64.a -lqd -lcufft -lm
gpuLucas.o: gpuLucas.cu
/usr/local/cuda/bin/nvcc -O3 -use_fast_math -gencode arch=compute_20,code=sm_20 --compiler-options="-fno-strict-aliasing" -w -I. -I/usr/local/include -I$(NVIDIA_SDK)/C/common/inc gpuLucas.cu -arch=sm_13 -c
clean:
-rm *.o gpuLucas
[/code]
GTX-550Ti
[code]
[0/50]: iteration 4300: max abs error = 0.226562
[0/50]: iteration 4300: max Bit Vector = 39.000000
Time to rebalance llint: 1.936 ms

Time to rebalance and write-back: 821.3 ms

Timing: To test M43112609
elapsed time : 75901 msec = 75.9 sec
dev. elapsed time: 143860 msec = 143.9 sec
est. total time: 620216064 msec = 620216.1 sec

Beginning full test of M43112609
[/code]
CUDALucas
[code]
$ ./CUDALucas 43112609
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.58 (2:35 real, 15.4797 ms/iter, ETA 185:19:36)
[/code][/QUOTE]
looks like the difference is about 13:02:40

msft 2012-02-26 01:50

[QUOTE=science_man_88;290913]looks like the difference is about 13:02:40[/QUOTE]
Indeed.
[QUOTE=aaronhaviland;290657] and there is no residue output for non-primes.
[/QUOTE]
residue is available.
but not same mprime.
[code]
M_1215421 tests as non-prime.
M_1215421, 0xfd93939b00a071bf, n = 65536, gpuLucas
[/code]
mprime:
[code]
[Work thread Feb 25 18:53] M1215421 is not prime. Res64: FE93935B009871C0. We8: 5EAF771A,140242,00000000
[/code]
each h_signalOUT[] value -1 or +1.

aaronhaviland 2012-02-26 13:47

[QUOTE=msft;290918]residue is available.
but not same mprime.
[code]
M_1215421 tests as non-prime.
M_1215421, 0xfd93939b00a071bf, n = 65536, gpuLucas
[/code]mprime:
[code]
[Work thread Feb 25 18:53] M1215421 is not prime. Res64: FE93935B009871C0. We8: 5EAF771A,140242,00000000
[/code]each h_signalOUT[] value -1 or +1.[/QUOTE]

Residue output wasn't available when I posted that. Since then, I had submitted a pull request with basic residue output, and a basic Makefile, which has been merged.

As far as the residue output: It actually works fine on my end (linux, 64-bit, gtx460):
[CODE]M_1215421, 0xfe93935b009871c0, n = 65536, gpuLucas
M_1215421, 0xfe93935b009871c0, n = 61440, gpuLucas
[/CODE]I've tested it now with several different testIntegers... and different FFT lengths.

msft 2012-02-26 18:59

[QUOTE=aaronhaviland;290949]Residue output wasn't available when I posted that. Since then, I had submitted a pull request with basic residue output, and a basic Makefile, which has been merged.
[/QUOTE]
understand.
apologies.

aaronhaviland 2012-02-27 03:56

[QUOTE=msft;290971]understand.
apologies.[/QUOTE]
No need to apologise :)

I'm continuing to work on his code on my fork: [URL]https://github.com/ah42/gpuLucas[/URL]

RichD 2012-02-27 05:17

[QUOTE=aaronhaviland;291003]I'm continuing to work on his code on my fork: [URL]https://github.com/ah42/gpuLucas[/URL][/QUOTE]

Thank you for taking a vested interest in gpuLucas. I am glad someone, with more time than me, can continue the review/development. I felt this was a too valuable asset to be forgotten in the GIMPS community. Thus, my friendly "prodding" to Prof. Thall.

aaronhaviland 2012-03-02 04:06

1 Attachment(s)
current progress: I'm comfortable calling this an alpha version, and have given it its first version number, 0.9.0 (tagged in git: [URL]https://github.com/ah42/gpuLucas/tags[/URL])
I have no idea if the windows build files still work, and I'm fairly sure this will only run on 64-bit.
I have not yet implemented checkpointing, that is my next goal. Not having checkpoints, I haven't yet run larger exponents to completion.

Currently verified primes: 1279, 2203, 110503, 859433
Verified non-prime residues: 1061, 10771, 106121, 1061069

ChangeLog:
[SIZE=1]
Aaron Haviland <orion@parsed.net> 2012-03-01

* Update version to 0.9.0. First versioned commit [/SIZE][SIZE=1]
*[B][I] Add autodetection of optimal FFT signalSize.[/I][/B] Can test exponents as low as 1000 (Verified residue with M(1061)
* [B][I]Auto-select setSliceAndDice[/I][/B] depending on testPrime and signalSize. May need to be further tuned.
* Reduce T_PER_B to 512 to better fit more blocks on GPUs

Aaron Haviland <orion@parsed.net> 2012-02-29 [/SIZE][SIZE=1]

* Add -d flag to choose CUDA device, default to device 0 [/SIZE][SIZE=1]
* Add some more kernel failure checks
* Clean up compiler warnings
* Makefile: more verbose. Lower registers-per-kernel to 20, fits better.
* Fix __cufftSafeCall: break statements had been accidentally omitted

Aaron Haviland <orion@parsed.net> 2012-02-26 [/SIZE][SIZE=1]

* Add help option -h [/SIZE][SIZE=1]
* [B][I]Introduce getopt support for passing command-line options.[/I][/B]
* Wrap verbose startup messages inside opt_verbose and add verbosity flag.
* Abort if error goes too high in errorTrial
* Remove generated files
* Remove external dependancy on cutil as NVIDIA recommends against using it, and it was the only thing from the SDK we needed.

Aaron Haviland <orion@parsed.net> 2012-02-24 [/SIZE][SIZE=1]

* Add residue-printing support for non-primes (based on rw.c from the mers package). Verified to work on 64-bit linux. [/SIZE][SIZE=1]
* Add a basic makefile for building on Linux[/SIZE]


Attachment: compiled for 64-bit linux, with Cuda3.2. Cuda libraries not included.

aaronhaviland 2012-03-02 19:27

v0.9.1: Checkpointing added! It took less time than I thought it would. (inspired by checkpointing code in recent versions of CUDALucas)

Checkpoints are incompatible with CUDALucas.

flashjh 2012-03-02 19:58

Just wondering since I haven't used gpuLucas... is this project set to replace CUDALucas or the other way around?

Is it feasable to combine the best of both into one as to not spend time on two separate projects with the same goal?

Maybe aaronhaviland and msft can talk about it?

aaronhaviland 2012-03-03 15:00

I don't believe that its intention is to replace CUDALucas at all, but an alternative method to the same goal. As for merging the two codebases, I'm not sure how possible that is, since, although at the core, they both do FFT->Multiplication->IFFT, the supporting maths around them are quite different (which is why I intentionally made the checkpoint files incompatible).

Mr. Thall has done some great work getting the maths to this point, and I think gpuLucas as it stands could see quite a bit of improvement still, and would like to keep working on it independently of CUDALucas.

With the recent improvements in CL, when running with the same FFT lengths, it is only 2% slower than gpuLucas (rather than the larger speedups reported before). However, it does appear that gpuLucas is capable of running with smaller FFT lengths, (where CL bails out due to potential round-off error) thereby increasing its lead again (to about an estimated 9% quicker on this 26xxxxxx exponent I'm currently testing)

All of that said, I think there are things that can be learned from both of these endeavours (as well as some other GPGPU applications I've dug into), as far as portability and best practices for optimising for multiple gpu architectures, and I plan to keep working on gpuLucas (or possibly renaming as a third derivative work due to gpuLucas being BSD licensed, but other code being GPL'd.)

Prime95 2012-03-04 19:54

Minor bug: You are checkpointing every 1000 iterations, not 10000.


Note: every 10000 iters is every 40 seconds or so? That seems excessive. Prime95 writes one every half hour.

aaronhaviland 2012-03-05 02:58

[QUOTE=Prime95;291890]Minor bug: You are checkpointing every 1000 iterations, not 10000.


Note: every 10000 iters is every 40 seconds or so? That seems excessive. Prime95 writes one every half hour.[/QUOTE]

I agree, I think it is excessive (even at 10,000), but thanks for pointing out my typo. I had left it low for debugging purposes, and I hadn't yet looked at what a sane frequency should be.

I've just committed configurable checkpointing, defaulting to 10,000 iterations, and changed some of the output formats so it matches CL a bit more: it now prints residue output at checkpointing time (unless running in quiet mode with -q).

I'm a little concerned about the residue mismatch I just got on M(26171441), but since I had restarted it several times, and changed a few things, including the checkpoint format itself, it was most likely my fault. I'm re-starting the test... it's using an FFT length with a high round-off error (around 0.37) so I can test out what an acceptable round-off error should be with this method. (So far, residues are matching CUDALucas through around 250,000 iterations.)


All times are UTC. The time now is 13:36.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.