20090906, 01:53  #122 
Mar 2003
Melbourne
5·103 Posts 
Posting some of my research over the last week:
According to http://en.wikipedia.org/wiki/FLOPS here is a breakdown of performance between current flagship intel (corei7 965 XE) and nvidia (Tesla C1060 similar to 280) Code:
SP DP corei7 140 70 nvidia 933 78 SP=32bit floats DP=64bit floats The big difference between SP & DP of nvidia boils down to 1x 64 bit fp unit to 8x 32bit fp units. I think this has been mentioned before. As GPUs advance  who knows how this ratio will evolve. (New generation of GPUs are due in the next couple of months) Then also take this little gem: from https://visualization.hpc.mil/wiki/Introduction_to_CUDA "Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display." This suggests (to me anyway) that coding for CUDA isn't exactly trivial process. There will be gotchas in the development process. No way there will be mass adoption of a GPGPU client if it causes screen corruption. So what are the x86 camp doing? Intel? http://en.wikipedia.org/wiki/Larrabee_(GPU) First half of 2010; Larrabee will also compete in the GPGPU and highperformance computing markets. It will run x86 code which prime95 has an excellent heritage in. (No mention of 64bit float capability) If I could put my holier than though hat on, yes in a commercial development organization one would set aside resources to develop on GPU platforms to show a proofofconcept then make a business call which way to go, but when resources are tight it's a real tough call where to go. Anyone want to have a stab at a proofofconcept LL test on a GPU? Plenty of training resources at: http://www.nvidia.com/object/cuda_education.html  Craig 
20090907, 08:37  #123  
Jul 2006
Calgary
5^{2}×17 Posts 
Quote:


20090907, 18:31  #124  
Jul 2009
31 Posts 
Quote:
AMD uses its singleprecision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute. NVidia instead has full IEEE double precision ALUs in hardware, in addition to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs. So NV's double precision is at full speed but just fewer ALUs. The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care. The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode. All the G200 series of cards has these native DP processors... so it's not their top end boards, it's midlevel ($150) and above. 

20090907, 20:48  #125  
Jul 2006
Calgary
1A9_{16} Posts 
Quote:
From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150? 

20090907, 21:15  #126  
Jul 2009
31 Posts 
Quote:
Yeah, none of the mobile chips have DP support, even the latest GTS 260m. The lowest NV card with DP support is the GTX260. That's $150. That gives you about 95 double precision GFlops. That's very similar to the ideal DP flops of a 3Ghz quad i7 processor using packed SSE. 

20090924, 20:03  #127 
Mar 2003
Melbourne
5·103 Posts 
Stats on the soon to be released ATI video card:
ATI 5000 series video card  2.72SP Gflops/544 DP Gflops Cost: $400ish ATI article: http://pcper.com/article.php?aid=783 "AMD added IEEE7542008 compliant precision" Does that statement qualify for suitablity for the algorithms for 2^P1 primes? Today's flops count for entire GIMPS  41.761Tflops. Just 1 of these vid cards would increase GIMPs output by 1.3% (based on purely theoretical numbers). Even if the first iteration of code ran the software at 1/4 of theoretical, 3 of these cards would be 1% of GIMPs output. Anybody want to look at coding LL test in OpenCL?  Craig 
20090925, 12:55  #128 
Sep 2009
2^{3}×3 Posts 
Just to add some information about CUDA with Ellipitc curves:
http://eecm.cr.yp.to/mpfq.html http://cr.yp.to/talks/2009.09.12/slides.pdf I think this 2 links are very helpful. If someone can compile this to use with win32, please, post the links. Thank you in advance. 
20090925, 13:05  #129 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
13673_{8} Posts 
How come SP is 200 times less than DP? I think the figures are wrong somewhere.
Last fiddled with by retina on 20090925 at 13:06 
20090925, 14:01  #130 
(loop (#_fork))
Feb 2006
Cambridge, England
1100011101110_{2} Posts 
It is, of course, 2.72 SP Tflops peak.
Only a gigabyte of memory, so no use for GNFS linear algebra, though there does seem to be an addressing mode for the perSIMD local store which is designed for doing radix4 SP FFTs. Some reviews suggest that there are now four 24x24 integer multipliers rather than one 32x32, not sure how good a tradeoff that is  depends on how many extra bits you wanted to carry around between carry propagation passes. Bernstein, Lange &c's work on mulmod on GPGPU is all on very small numbers (say 210 bits) which fit in the GPU register sets, I'm not sure how painful things get if the numbers stop fitting there. On the other hand, there's little point in my buying a 5850 to write GPGPU code on until the R800 Instruction Set Architecture document codes out, and AMD pull their finger out and release the OpenCLrunningonGPU implementation. 
20090925, 14:19  #131 
Tribal Bullet
Oct 2004
2×3×19×31 Posts 
I think all of the complexity of porting algorithms to GPUs boils down to making the important part of the working set fit into caches that are much too small.
If your applications do not require double precision floating point, getting started with graphics card programming is ridiculously cheap. Nvidia's development libraries are a free download, and a 1year old GPU with 1GB of memory costs ~$100 and potentially can crunch 510x faster than the machine it's plugged into. 
20090925, 14:46  #132 
Mar 2003
Melbourne
203_{16} Posts 
Yep, what 5mack said....Tflops.
Oops.  Craig 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
New PC dedicated to Mersenne Prime Search  Taiy  Hardware  12  20180102 15:54 
The primecrunching on dedicated hardware FAQ (II)  jasonp  Hardware  46  20160718 16:41 
How would you design a CPU/GPU for prime number crunching?  emily  Hardware  4  20120220 18:46 
DSP hardware for number crunching?  ixfd64  Hardware  15  20110809 01:11 
Optimal Hardware for Dedicated Crunching Computer  Angular  Hardware  5  20040116 12:37 