mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Fast Mersenne Testing on the GPU using CUDA (https://www.mersenneforum.org/showthread.php?t=14310)

diep 2011-04-30 15:42

[QUOTE=Karl M Johnson;260026]You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.

Another example is RC5 GPU clients of distributed.net.


The last I've heard from Andrew Thall was on 16 of February.[/QUOTE]

It is more work to optimize CUDA by the way. Please try to dig up which instructionset that GPU has.

You won't find it.

It's not there.

At least the crappy AMD documentation *does* show the instruction set it has. So all you need to dig up then is whether it needs all PE's of a streamcore to execute that instruction or whether it's a simple fast instruction.

Usually the answer is very simple: there is just a few fast instructions.

With Nvidia it's more complicated there as you don't know what instructions it supports in hardware nor how fast those are.

With AMD the answer is dead simple: just a handful of instructions are fast, basically that half dozen instructions they mention is fast and all the rest isn't.

So OpenCL programming is *far* *far* easier for guys like me.

diep 2011-04-30 16:03

[QUOTE=Karl M Johnson;260026]You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.

Another example is RC5 GPU clients of distributed.net.
The last I've heard from Andrew Thall was on 16 of February.[/QUOTE]

Everything where you do not need integer multiplication, or can limit it to 16 bits as a max, there AMD will be faster of course and a lot. Which is why i bought the GPU, besides that too many already toy in CUDA like i once also did.

However the next project after this of me will be Factorisation and after a few attempts there i'll investigate (and not implement if nothing comes out of the investigation that can work fast) a fast multiplication of the million bits numbers at the GPU's. We've got trillions of instructions per cycle available on those things, yet what's really needed is a new transform that can avoid deal with the fact that the RAM in no manner can keep up with the processing elements. So far i only investigated integer transforms there and the bad news so far is that exactly there you always need some sort of integer multiplication in the end, where Nvidia is faster.

So if i succeed in creating an integer based multiplication that can avoid non-stop streaming to and from the RAM and bundle it more efficiently within that 32KB shared cache (LDS) that the PE's share, then that will be most interesting. Yet Nvidia might also be fast there for the same reason then that AMD sucks in integer multiplication (the top bits).

It seems for now that the AMD gpu's are total floating point optimized if i may say so, more than Nvidia. And i'm not interested in floating point at all with all its rounding errors and inefficient usage of the bits :)

xilman 2011-04-30 16:25

[QUOTE=lycorn;260014]Yep, you got it!.



Actually, it´s "Some sunny day".

Also, it should be "Remember how he said [U]that[/U]"[/QUOTE]True. Shows I was working from memory instead of using Google.

Paul

Prime95 2011-04-30 16:41

His paper is available: [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url]

Should give msft enough information to upgrade CudaLucas. I had originally though Thall's 2x improvement was due to IBDWT. However, msft and Thall both use IBDWT in their programs. The nearly 2x came from using non-power-of-2 FFT lengths. I'd also hoped Thall had improved on Nvidia's FFT library, but that is not the case.

diep 2011-04-30 17:40

Maybe silly question but where's Thall's CUDA code?

Regards,
Vincent

TheJudger 2011-04-30 20:10

Hi Vincent

[QUOTE=diep;260027]Olivers numbers are of course based upon Tesla's that achieve optimal (305M/s) and the gamerscards probably won't do that.
[/QUOTE]

Wrong, 305M/s are for my stock GTX 470 for M66.xxx.xxx and factor candidates below 2^79. A Tesla 20x0 is actually slower for mfaktc because of its slightly lower clock.

Oliver

ixfd64 2011-04-30 20:17

[QUOTE=Prime95;260032]His paper is available: [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url]

Should give msft enough information to upgrade CudaLucas. I had originally though Thall's 2x improvement was due to IBDWT. However, msft and Thall both use IBDWT in their programs. The nearly 2x came from using non-power-of-2 FFT lengths. I'd also hoped Thall had improved on Nvidia's FFT library, but that is not the case.[/QUOTE]

I wonder if Prime95 could benefit from this information.

diep 2011-04-30 23:29

[QUOTE=TheJudger;260052]Hi Vincent



Wrong, 305M/s are for my stock GTX 470 for M66.xxx.xxx and factor candidates below 2^79. A Tesla 20x0 is actually slower for mfaktc because of its slightly lower clock.

Oliver[/QUOTE]

Ah yes thx for the update. This is in fact better than i said, because obviously exponent=66M+ has more bits than the 8M bits range where i was calculating for. It's 3 bits more than the 24 at 8M.

All my calculations are for the range we're busy with currently doing TF which is slightly above 8M.

Extrapolation to 800M/s i had done correct for your code to GTX590, provided you have the bandwidth :)

Add to that the 3 bits. Maybe your 72 bits kernel is also faster than the 79 bits one @ 8M bits?

diep 2011-04-30 23:38

[QUOTE=ixfd64;260054]I wonder if Prime95 could benefit from this information.[/QUOTE]

Would be cool if it's the 9 year old kids that are going to kick butt there rather than clusters from uni with ECC.

Didn't the last so many mersennes mainly get found by uni machines equipped with ECC and everything?

Note double checking the gpu's is not a luxury probably.

Regards,
Vincent

msft 2011-05-01 02:59

[QUOTE=Prime95;260032]His paper is available: [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url]
[/QUOTE]Good information.
Thank you,

RichD 2011-09-25 04:01

Seems like his time table is slipping. :)

[QUOTE]Summer 2011

Presented the gpuLucas work at GPGPU 4 at Newport Beach in March. I am currently working
with my research students (rising sophomores through the PRISM grant) to get gpuLucas ready
for public release in mid-August.[/QUOTE]

Pulled from this [URL="http://andrewthall.org/"]page[/URL].


All times are UTC. The time now is 08:17.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.