20110313, 04:55  #56 
Jan 2005
Caught in a sieve
5·79 Posts 
Depends on your PSU, case, board slots, and cooling. But the 560 is out and does look good now.
I'm wondering what happened to Andrew? Saturday, March 5 has come and gone. Any news? 
20110313, 16:50  #57 
Dec 2010
Monticello
703_{16} Posts 
Target System:
ASRock 880 GM/LE Mobo with one PCI express X16 slot and mechanical space for a doublewidth card. Power supply will upgrade to support the fans, need to add an internal fan for one chip anyway. Dual output video onboard, running 64bit ubuntu. Which card gives the best bang for the buck? 
20110429, 18:40  #58  
Sep 2006
The Netherlands
787 Posts 
Quote:
That 6990 is just a few euro over 500 here and it's 1.2 Tflop double precision or so. Wouldn't be too hard to port CUDA code to OpenCL, as architecture from programming model is similar. How fast is the current code as compared to CPU's doing LL? Regards, Vincent 

20110429, 18:49  #59  
Sep 2006
The Netherlands
1423_{8} Posts 
Quote:
First simple optimization is using a bigger primebase to remove composite FC's. I ran some statistics on overhead and concluded a primebase of around 500k to generate FC's makes most sense and with a 64K buffer you can hold 512k bits and already have removed a bunch outside, so a 500k sized primebase always has a 'hit' on the write buffer. The primebase you can store also efficiently using 1 byte per prime as the distance from each prime to the next one is just like that and another 24 bits are left then for storing where you previously hit the buffer. You can in fact make the primebase quite bigger than the primebuffer, as you can keep track which baseprime will hit what buffer and throw them in a bucket for the buffer where it'll be useful. But that would slow down the datastructure quite a tad. This weeding out of more composite FC's is quite complicated when generating FC's inside the gpu as i intend to do. Then the speed of the actual comparisions is hard to judge for me, yet when i see that a single run at 61 bits here with Wagstaff (using your assembler) already takes half an hour @ around 8M12M sizes, there is definitely improvements possible as well. But even after improving all this, of course GPU's slam the CPU's here, as every gpu unit can multiply and at cpu's only 1 execution unit can out of each core. Last fiddled with by diep on 20110429 at 18:55 

20110430, 11:57  #60 
"GIMFS"
Sep 2002
Oeiras, Portugal
3026_{8} Posts 
Does anybody here remember Andrew Thall?
Remember how he said We would crunch so fast On a sunny day? Andrew, Andrew, What has become of you? Does anybody else in here Feel the way I do? Adapted from... (?) 
20110430, 12:27  #61  
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
11·1,039 Posts 
Quote:
Paul P.S. BTW, ITYM "One sunny day". 

20110430, 12:55  #62 
"GIMFS"
Sep 2002
Oeiras, Portugal
2·19·41 Posts 
Yep, you got it!.
Actually, itยดs "Some sunny day". Also, it should be "Remember how he said that" Last fiddled with by lycorn on 20110430 at 13:00 
20110430, 14:14  #63 
Mar 2010
3·137 Posts 

20110430, 15:05  #64  
Sep 2006
The Netherlands
787 Posts 
Quote:
So if you multiply 24 x 24 bits, you get within 1 cycle (throughput latency) the least significant bits, yet it needs 4 PE's to get just 16 topbits and 5 PE's in case of the 5000 series. So it is 5 cycles (throughput latency) to get 64 bits output spreaded over 2 integers and 6 cycles for the 5000 series cards. If you want to multiply using 32 bits integers (filling them up with 31 bits for example), you will need a lot of patience as that's 2 slow instructions; it requires 8 cycles throughput latency. So you can divide your numbers with respect to AMD by 8 for the 6000 series and by 10 for the 5000 series. So the fastest way to multiply at AMD GPU's to emulate 70 bits precision is using the fast 24 bits multiplications least significant 32 bits. So you can store 14 bits information in each integer. This should run fast both at the 6000 series as well as at the 5000 series. A full multiplication then using multiplyadd can use 25 multiply adds and in total other overhead counted i come to 69 fast instructions. So for throughput that is for a 70 x 70 bits multiplication 69 throughput cycles. Nvidia on other hand you can use 24x24 bits == 48 bits, so you can use 3 integers for that. That's a lot quicker. That is where AMD GPU's lose it bigtime from Nvidia at the moment. This was also unexpected for me, as old AMd gpu's had 40 bits internal available within 1 cycle. You don't expect then that to get the top16 bits is so slow. Now we didn't speak yet about adding carry as AMD doesn't have that either, where you'll lose another few dozens of percent if you want to achieve 72 bits. I knew from that already before i started investigating all this, and losing of course a 20% is no big deal if you have that much TIPS available. So for trial factoring a GTX590 should achieve roughly 800M/s where a 6990 can achieve according to my calculation a max of around 500M/s. A 5970 there will achieve nothing of course, as the 2nd GPU is not supported by AMD to work for OpenCL (which sucks incredible as OpenCL is the only programming language supported right now). My theoretic calculation is that it would be possible to achieve 270M/s at my Radeon HD6970 if you can perfectly load all PE's with instructions without interruption; yet that last is rather unlikely, yet i'll try :) The interesting thing then, when i program it all in simple instructions, will be to see what IPC the code achieves at the 5000 series versus 6000 series. Probably the 6990 will be the one breaking even best if you add powercosts as well, but that's for later to figure out :) The card being fastest is no discussion about here for TF that'll be the GTX590 from Nvidia. Last fiddled with by diep on 20110430 at 15:09 

20110430, 15:24  #65 
Mar 2010
3·137 Posts 
You've got point, it's far easier to optimize a CUDA application than a CAL/OpenCL one.
And AMD's SDK has bugs which havent been fixed for at least a quarter of a year. However, that doesn't mean it's impossible. A fabulous example is OCLHashcat, written by Atom. It's used in hash reversal(cracking). Since hashing passwords deals with integers, AMD gpus win here. I've also read that AMD GPUs have native instructions, which help in that area, such as bitfield insert and bit align. Another example is RC5 GPU clients of distributed.net. The last I've heard from Andrew Thall was on 16 of February. Last fiddled with by Karl M Johnson on 20110430 at 15:35 
20110430, 15:34  #66  
Sep 2006
The Netherlands
787 Posts 
Quote:
The only escape to speed things up, which means move to the 3 x 24 bits implementation for 69 bits, would be when the GPU's native instruction which is MULHI_UINT24, if that instruction would be 1 cycle throughput latency. OpenCL doesn't support that instruction. OpenCL specs were created by an exATI guy, so if that instruction would have been faster than the 32x32 bits mul_hi, obviously it would have been inside the OpenCL 1.1 specs :) There is 1 report of a guy, possibly AMD engineer, reporting that MULHI_UINT24 is casted in reality onto the 32x32 bits mul_hi which is 4 cycles at the 6000 series and 5 cycles at the 5000 series. I'm still awaiting official answer from the AMD helpdesk there. No answer means of course guilty. Last fiddled with by diep on 20110430 at 15:38 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3541  20220421 22:37 
Do normal adults give themselves an allowance? (...to fast or not to fast  there is no question!)  jasong  jasong  35  20161211 00:57 
Find Mersenne Primes twice as fast?  Derived  Number Theory Discussion Group  24  20160908 11:45 
TPSieve CUDA Testing Thread  Ken_g6  Twin Prime Search  52  20110116 16:09 
Fast calculations modulo small mersenne primes like M61  Dresdenboy  Programming  10  20040229 17:27 