mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-03-13, 04:55   #56
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Depends on your PSU, case, board slots, and cooling. But the 560 is out and does look good now.

I'm wondering what happened to Andrew? Saturday, March 5 has come and gone. Any news?
Ken_g6 is offline   Reply With Quote
Old 2011-03-13, 16:50   #57
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

70316 Posts
Default

Target System:
ASRock 880 GM/LE Mobo with one PCI express X16 slot and mechanical space for a double-width card. Power supply will upgrade to support the fans, need to add an internal fan for one chip anyway. Dual output video on-board, running 64-bit ubuntu.
Which card gives the best bang for the buck?
Christenson is offline   Reply With Quote
Old 2011-04-29, 18:40   #58
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

787 Posts
Default

Quote:
Originally Posted by Christenson View Post
Target System:
ASRock 880 GM/LE Mobo with one PCI express X16 slot and mechanical space for a double-width card. Power supply will upgrade to support the fans, need to add an internal fan for one chip anyway. Dual output video on-board, running 64-bit ubuntu.
Which card gives the best bang for the buck?
Where for integers AMD gpu's suck, the 6990 in double precision is unrivalled in double precision work by Nvidia.

That 6990 is just a few euro over 500 here and it's 1.2 Tflop double precision or so.

Wouldn't be too hard to port CUDA code to OpenCL, as architecture from programming model is similar.

How fast is the current code as compared to CPU's doing LL?

Regards,
Vincent
diep is offline   Reply With Quote
Old 2011-04-29, 18:49   #59
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14238 Posts
Default

Quote:
Originally Posted by Prime95 View Post
This argument holds less water with this new CUDA program. As expected, IBDWT has halved the iteration times.

A different conclusion is also possible: Perhaps prime95's TF code is in need of optimization.
Right when i looked at your assembler i realized there was some optimizations possible for todays CPU's.

First simple optimization is using a bigger primebase to remove composite FC's. I ran some statistics on overhead and concluded a primebase of around 500k to generate FC's makes most sense and with a 64K buffer you can hold 512k bits and already have removed a bunch outside, so a 500k sized primebase always has a 'hit' on the write buffer.

The primebase you can store also efficiently using 1 byte per prime as the distance from each prime to the next one is just like that and another 24 bits are left then for storing where you previously hit the buffer.

You can in fact make the primebase quite bigger than the primebuffer, as you can keep track which baseprime will hit what buffer and throw them in a bucket for the buffer where it'll be useful. But that would slow down the datastructure quite a tad.

This weeding out of more composite FC's is quite complicated when generating FC's inside the gpu as i intend to do.

Then the speed of the actual comparisions is hard to judge for me, yet when i see that a single run at 61 bits here with Wagstaff (using your assembler) already takes half an hour @ around 8M-12M sizes, there is definitely improvements possible as well.

But even after improving all this, of course GPU's slam the CPU's here, as every gpu unit can multiply and at cpu's only 1 execution unit can out of each core.

Last fiddled with by diep on 2011-04-29 at 18:55
diep is offline   Reply With Quote
Old 2011-04-30, 11:57   #60
lycorn
 
lycorn's Avatar
 
"GIMFS"
Sep 2002
Oeiras, Portugal

30268 Posts
Default

Does anybody here remember Andrew Thall?
Remember how he said
We would crunch so fast
On a sunny day?
Andrew, Andrew,
What has become of you?
Does anybody else in here
Feel the way I do?

Adapted from... (?)
lycorn is offline   Reply With Quote
Old 2011-04-30, 12:27   #61
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

11·1,039 Posts
Default

Quote:
Originally Posted by lycorn View Post
Does anybody here remember Andrew Thall?
Remember how he said
We would crunch so fast
On a sunny day?
Andrew, Andrew,
What has become of you?
Does anybody else in here
Feel the way I do?

Adapted from... (?)
Is there anybody out there? I can feel one of my turns coming on. Don't look so frightened, this is just a passing phase, one of my bad days.

Paul

P.S. BTW, ITYM "One sunny day".
xilman is offline   Reply With Quote
Old 2011-04-30, 12:55   #62
lycorn
 
lycorn's Avatar
 
"GIMFS"
Sep 2002
Oeiras, Portugal

2·19·41 Posts
Default

Yep, you got it!.

Quote:
Originally Posted by xilman View Post
P.S. BTW, ITYM "One sunny day".
Actually, itยดs "Some sunny day".

Also, it should be "Remember how he said that"

Last fiddled with by lycorn on 2011-04-30 at 13:00
lycorn is offline   Reply With Quote
Old 2011-04-30, 14:14   #63
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Quote:
Originally Posted by diep View Post
Where for integers AMD gpu's suck...
You dont have the slightest idea what are you talking about.
A Radeon 5970 can do 2.32 TIPS.
A GTX 590 can do 1244.16 GIPS.
A Radeon 5870 can do 1.36 TIPS, you can buy up to 3 for the price of GTX 590.
Not bad, ehh ?
Karl M Johnson is offline   Reply With Quote
Old 2011-04-30, 15:05   #64
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

787 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
You dont have the slightest idea what are you talking about.
A Radeon 5970 can do 2.32 TIPS.
A GTX 590 can do 1244.16 GIPS.
A Radeon 5870 can do 1.36 TIPS, you can buy up to 3 for the price of GTX 590.
Not bad, ehh ?
The problem is that you cannot get the top bits easily at AMD.

So if you multiply 24 x 24 bits, you get within 1 cycle (throughput latency) the least significant bits, yet it needs 4 PE's to get just 16 topbits and 5 PE's in case of the 5000 series.

So it is 5 cycles (throughput latency) to get 64 bits output spreaded over 2 integers and 6 cycles for the 5000 series cards.

If you want to multiply using 32 bits integers (filling them up with 31 bits for example), you will need a lot of patience as that's 2 slow instructions; it requires 8 cycles throughput latency. So you can divide your numbers with respect to AMD by 8 for the 6000 series and by 10 for the 5000 series.

So the fastest way to multiply at AMD GPU's to emulate 70 bits precision is using the fast 24 bits multiplications least significant 32 bits. So you can store 14 bits information in each integer.

This should run fast both at the 6000 series as well as at the 5000 series.

A full multiplication then using multiply-add can use 25 multiply adds and in total other overhead counted i come to 69 fast instructions. So for throughput that is for a 70 x 70 bits multiplication 69 throughput cycles.

Nvidia on other hand you can use 24x24 bits == 48 bits, so you can use 3 integers for that. That's a lot quicker.

That is where AMD GPU's lose it bigtime from Nvidia at the moment.

This was also unexpected for me, as old AMd gpu's had 40 bits internal available within 1 cycle. You don't expect then that to get the top16 bits is so slow.

Now we didn't speak yet about adding carry as AMD doesn't have that either, where you'll lose another few dozens of percent if you want to achieve 72 bits. I knew from that already before i started investigating all this, and losing of course a 20% is no big deal if you have that much TIPS available.

So for trial factoring a GTX590 should achieve roughly 800M/s where a 6990 can achieve according to my calculation a max of around 500M/s. A 5970 there will achieve nothing of course, as the 2nd GPU is not supported by AMD to work for OpenCL (which sucks incredible as OpenCL is the only programming language supported right now).

My theoretic calculation is that it would be possible to achieve 270M/s at my Radeon HD6970 if you can perfectly load all PE's with instructions without interruption; yet that last is rather unlikely, yet i'll try :)

The interesting thing then, when i program it all in simple instructions, will be to see what IPC the code achieves at the 5000 series versus 6000 series. Probably the 6990 will be the one breaking even best if you add powercosts as well, but that's for later to figure out :)

The card being fastest is no discussion about here for TF that'll be the GTX590 from Nvidia.

Last fiddled with by diep on 2011-04-30 at 15:09
diep is offline   Reply With Quote
Old 2011-04-30, 15:24   #65
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.

Another example is RC5 GPU clients of distributed.net.


The last I've heard from Andrew Thall was on 16 of February.

Last fiddled with by Karl M Johnson on 2011-04-30 at 15:35
Karl M Johnson is offline   Reply With Quote
Old 2011-04-30, 15:34   #66
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

787 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year.

However, that doesn't mean it's impossible.
A fabulous example is OCLHashcat, written by Atom.
It's used in hash reversal(cracking).
Since hashing passwords deals with integers, AMD gpus win here.
I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align.
Yes it is impossible to get a Radeon HD6970 faster than the rough estimate of 270M/s what i wrote here when using simple instructions. A GTX 580 i'd expect 33% faster than that. Olivers numbers are of course based upon Tesla's that achieve optimal (305M/s) and the gamerscards probably won't do that.

The only escape to speed things up, which means move to the 3 x 24 bits implementation for 69 bits, would be when the GPU's native instruction which is MULHI_UINT24, if that instruction would be 1 cycle throughput latency.

OpenCL doesn't support that instruction. OpenCL specs were created by an ex-ATI guy, so if that instruction would have been faster than the 32x32 bits mul_hi, obviously it would have been inside the OpenCL 1.1 specs :)

There is 1 report of a guy, possibly AMD engineer, reporting that MULHI_UINT24 is casted in reality onto the 32x32 bits mul_hi which is 4 cycles at the 6000 series and 5 cycles at the 5000 series.

I'm still awaiting official answer from the AMD helpdesk there. No answer means of course guilty.

Last fiddled with by diep on 2011-04-30 at 15:38
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3541 2022-04-21 22:37
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Find Mersenne Primes twice as fast? Derived Number Theory Discussion Group 24 2016-09-08 11:45
TPSieve CUDA Testing Thread Ken_g6 Twin Prime Search 52 2011-01-16 16:09
Fast calculations modulo small mersenne primes like M61 Dresdenboy Programming 10 2004-02-29 17:27

All times are UTC. The time now is 19:55.


Thu Aug 18 19:55:19 UTC 2022 up 17:23, 0 users, load averages: 1.72, 1.66, 1.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โŠ โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โŠ† โŠ‚ โŠ„ โŠŠ โŠ‡ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”