mersenneforum.org The prime-crunching on dedicated hardware FAQ
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2009-09-06, 01:53 #122 nucleon     Mar 2003 Melbourne 5·103 Posts Posting some of my research over the last week: According to http://en.wikipedia.org/wiki/FLOPS here is a breakdown of performance between current flagship intel (core-i7 965 XE) and nvidia (Tesla C1060 similar to 280) Code:  SP DP corei7 140 70 nvidia 933 78 SP=32bit floats DP=64bit floats Given these are top theoretical optimistic figures, don't expect to get this performance, but it's a good guide to compare the two. Lets say in DP it's pretty similar between the 2x architectures. The big difference between SP & DP of nvidia boils down to 1x 64 bit fp unit to 8x 32bit fp units. I think this has been mentioned before. As GPUs advance - who knows how this ratio will evolve. (New generation of GPUs are due in the next couple of months) Then also take this little gem: from https://visualization.hpc.mil/wiki/Introduction_to_CUDA "Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display." This suggests (to me anyway) that coding for CUDA isn't exactly trivial process. There will be gotchas in the development process. No way there will be mass adoption of a GPGPU client if it causes screen corruption. So what are the x86 camp doing? Intel? http://en.wikipedia.org/wiki/Larrabee_(GPU) First half of 2010; Larrabee will also compete in the GPGPU and high-performance computing markets. It will run x86 code which prime95 has an excellent heritage in. (No mention of 64bit float capability) If I could put my holier than though hat on, yes in a commercial development organization one would set aside resources to develop on GPU platforms to show a proof-of-concept then make a business call which way to go, but when resources are tight it's a real tough call where to go. Anyone want to have a stab at a proof-of-concept LL test on a GPU? Plenty of training resources at: http://www.nvidia.com/object/cuda_education.html -- Craig
2009-09-07, 08:37   #123
lfm

Jul 2006
Calgary

1101010012 Posts

Quote:
 Originally Posted by fivemack This business about Brook+ allowing you to use double-precision units on the ATI HD3870 card is quite interesting; the system behaves as if it has 64 775MHz double-precision FPUs rather than 320 single-precision ones, but it looks as if you get something which might be fairly adequate.
Note this is a 5x penalty for DP. It seems to correspond to the 10x penatly quited for software DP emulation using SP mentioned in the faq. A factor of 2 is the minimal hardware support implemented for DP. I think nvidia has a similar scheme for a few of their top end boards.

2009-09-07, 18:31   #124
SPWorley

Jul 2009

3110 Posts

Quote:
 Originally Posted by lfm Note this is a 5x penalty for DP. It seems to correspond to the 10x penatly quited for software DP emulation using SP mentioned in the faq. A factor of 2 is the minimal hardware support implemented for DP. I think nvidia has a similar scheme for a few of their top end boards.
The AMD and NVidia approaches are very different.

AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute.

NVidia instead has full IEEE double precision ALUs in hardware, in addition to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs.
So NV's double precision is at full speed but just fewer ALUs.
The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care.
The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode.

All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above. 2009-09-07, 20:48 #125 lfm Jul 2006 Calgary 52·17 Posts Quote:  Originally Posted by SPWorley The AMD and NVidia approaches are very different. AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute. NVidia instead has full IEEE double precision ALUs in hardware, in addition to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs. So NV's double precision is at full speed but just fewer ALUs. The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care. The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode. All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above.
It seems the same end effect. A 5x penalty on ATI and it looks like 1/5th the alus. I think its a 8 to 1 ratio on nvidia (current cards?)

From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150? 2009-09-07, 21:15 #126 SPWorley Jul 2009 31 Posts Quote:  Originally Posted by lfm It seems the same end effect. A 5x penalty on ATI and it looks like 1/5th the alus. I think its a 8 to 1 ratio on nvidia (current cards?) From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under$150?
Exactly, it's 8 SP : 1 DP : 1 SF for the G200 Nvidia cards. The ATI isn't a constant 5x at all.. it's variable based on operation type since it's software, but also it's not IEEE either so sometimes it's hard to compare. (ATI is effectively running Kahan summation to emulate the DP behavior, which is very fast for adds but slower for mults and especially painfully slow for divides)

Yeah, none of the mobile chips have DP support, even the latest GTS 260m.
The lowest NV card with DP support is the GTX260. That's $150. That gives you about 95 double precision GFlops. That's very similar to the ideal DP flops of a 3Ghz quad i7 processor using packed SSE.  2009-09-24, 20:03 #127 nucleon Mar 2003 Melbourne 51510 Posts Stats on the soon to be released ATI video card: ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops Cost:$400ish ATI article: http://pcper.com/article.php?aid=783 "AMD added IEEE754-2008 compliant precision" Does that statement qualify for suitablity for the algorithms for 2^P-1 primes? Today's flops count for entire GIMPS - 41.761Tflops. Just 1 of these vid cards would increase GIMPs output by 1.3% (based on purely theoretical numbers). Even if the first iteration of code ran the software at 1/4 of theoretical, 3 of these cards would be 1% of GIMPs output. Anybody want to look at coding LL test in OpenCL? -- Craig
 2009-09-25, 12:55 #128 zarabatana   Sep 2009 23·3 Posts Just to add some information about CUDA with Ellipitc curves: http://eecm.cr.yp.to/mpfq.html http://cr.yp.to/talks/2009.09.12/slides.pdf I think this 2 links are very helpful. If someone can compile this to use with win32, please, post the links. Thank you in advance.
2009-09-25, 13:05   #129
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

59×103 Posts

Quote:
 Originally Posted by nucleon ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops
How come SP is 200 times less than DP? I think the figures are wrong somewhere.

Last fiddled with by retina on 2009-09-25 at 13:06

 2009-09-25, 14:01 #130 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 18EE16 Posts It is, of course, 2.72 SP Tflops peak. Only a gigabyte of memory, so no use for GNFS linear algebra, though there does seem to be an addressing mode for the per-SIMD local store which is designed for doing radix-4 SP FFTs. Some reviews suggest that there are now four 24x24 integer multipliers rather than one 32x32, not sure how good a trade-off that is - depends on how many extra bits you wanted to carry around between carry propagation passes. Bernstein, Lange &c's work on mulmod on GPGPU is all on very small numbers (say 210 bits) which fit in the GPU register sets, I'm not sure how painful things get if the numbers stop fitting there. On the other hand, there's little point in my buying a 5850 to write GPGPU code on until the R800 Instruction Set Architecture document codes out, and AMD pull their finger out and release the OpenCL-running-on-GPU implementation.
 2009-09-25, 14:19 #131 jasonp Tribal Bullet     Oct 2004 2·3·19·31 Posts I think all of the complexity of porting algorithms to GPUs boils down to making the important part of the working set fit into caches that are much too small. If your applications do not require double precision floating point, getting started with graphics card programming is ridiculously cheap. Nvidia's development libraries are a free download, and a 1-year old GPU with 1GB of memory costs ~\$100 and potentially can crunch 5-10x faster than the machine it's plugged into.
 2009-09-25, 14:46 #132 nucleon     Mar 2003 Melbourne 5×103 Posts Yep, what 5mack said....Tflops. Oops. -- Craig

 Similar Threads Thread Thread Starter Forum Replies Last Post Taiy Hardware 12 2018-01-02 15:54 jasonp Hardware 46 2016-07-18 16:41 emily Hardware 4 2012-02-20 18:46 ixfd64 Hardware 15 2011-08-09 01:11 Angular Hardware 5 2004-01-16 12:37

All times are UTC. The time now is 00:54.

Sun Feb 28 00:54:41 UTC 2021 up 86 days, 21:06, 0 users, load averages: 3.25, 2.33, 2.14