mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Closed Thread
 
Thread Tools
Old 2009-09-06, 01:53   #122
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Posting some of my research over the last week:

According to http://en.wikipedia.org/wiki/FLOPS here is a breakdown
of performance between current flagship intel (core-i7 965 XE) and nvidia
(Tesla C1060 similar to 280)

Code:
           SP  DP
corei7    140  70
nvidia    933  78

SP=32bit floats DP=64bit floats
Given these are top theoretical optimistic figures, don't expect to get this performance, but it's a good guide to compare the two. Lets say in DP it's pretty similar between the 2x architectures.

The big difference between SP & DP of nvidia boils down to 1x 64 bit fp unit to 8x 32bit fp units. I think this has been mentioned before. As GPUs advance - who knows how this ratio will evolve. (New generation of GPUs are due in the next couple of months)

Then also take this little gem:

from https://visualization.hpc.mil/wiki/Introduction_to_CUDA

"Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display."

This suggests (to me anyway) that coding for CUDA isn't exactly trivial process. There will be gotchas in the development process. No way there will be mass adoption of a GPGPU client if it causes screen corruption.

So what are the x86 camp doing? Intel?

http://en.wikipedia.org/wiki/Larrabee_(GPU)

First half of 2010; Larrabee will also compete in the GPGPU and high-performance computing markets. It will run x86 code which prime95 has an excellent heritage in. (No mention of 64bit float capability)

If I could put my holier than though hat on, yes in a commercial development organization one would set aside resources to develop on GPU platforms to show a proof-of-concept then make a business call which way to go, but when resources are tight it's a real tough call where to go.

Anyone want to have a stab at a proof-of-concept LL test on a GPU?

Plenty of training resources at:

http://www.nvidia.com/object/cuda_education.html

-- Craig
nucleon is offline  
Old 2009-09-07, 08:37   #123
lfm
 
lfm's Avatar
 
Jul 2006
Calgary

1101010012 Posts
Default

Quote:
Originally Posted by fivemack View Post
This business about Brook+ allowing you to use double-precision units on the ATI HD3870 card is quite interesting; the system behaves as if it has 64 775MHz double-precision FPUs rather than 320 single-precision ones, but it looks as if you get something which might be fairly adequate.
Note this is a 5x penalty for DP. It seems to correspond to the 10x penatly quited for software DP emulation using SP mentioned in the faq. A factor of 2 is the minimal hardware support implemented for DP. I think nvidia has a similar scheme for a few of their top end boards.
lfm is offline  
Old 2009-09-07, 18:31   #124
SPWorley
 
Jul 2009

3110 Posts
Default

Quote:
Originally Posted by lfm View Post
Note this is a 5x penalty for DP. It seems to correspond to the 10x penatly quited for software DP emulation using SP mentioned in the faq. A factor of 2 is the minimal hardware support implemented for DP. I think nvidia has a similar scheme for a few of their top end boards.
The AMD and NVidia approaches are very different.

AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute.

NVidia instead has full IEEE double precision ALUs in hardware, in addition to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs.
So NV's double precision is at full speed but just fewer ALUs.
The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care.
The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode.

All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above.
SPWorley is offline  
Old 2009-09-07, 20:48   #125
lfm
 
lfm's Avatar
 
Jul 2006
Calgary

52·17 Posts
Default

Quote:
Originally Posted by SPWorley View Post
The AMD and NVidia approaches are very different.

AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute.

NVidia instead has full IEEE double precision ALUs in hardware, in addition to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs.
So NV's double precision is at full speed but just fewer ALUs.
The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care.
The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode.

All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above.
It seems the same end effect. A 5x penalty on ATI and it looks like 1/5th the alus. I think its a 8 to 1 ratio on nvidia (current cards?)

From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150?
lfm is offline  
Old 2009-09-07, 21:15   #126
SPWorley
 
Jul 2009

31 Posts
Default

Quote:
Originally Posted by lfm View Post
It seems the same end effect. A 5x penalty on ATI and it looks like 1/5th the alus. I think its a 8 to 1 ratio on nvidia (current cards?)

From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150?
Exactly, it's 8 SP : 1 DP : 1 SF for the G200 Nvidia cards. The ATI isn't a constant 5x at all.. it's variable based on operation type since it's software, but also it's not IEEE either so sometimes it's hard to compare. (ATI is effectively running Kahan summation to emulate the DP behavior, which is very fast for adds but slower for mults and especially painfully slow for divides)

Yeah, none of the mobile chips have DP support, even the latest GTS 260m.
The lowest NV card with DP support is the GTX260. That's $150. That gives you about 95 double precision GFlops. That's very similar to the ideal DP flops of a 3Ghz quad i7 processor using packed SSE.
SPWorley is offline  
Old 2009-09-24, 20:03   #127
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

51510 Posts
Default

Stats on the soon to be released ATI video card:

ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops
Cost: $400ish

ATI article: http://pcper.com/article.php?aid=783

"AMD added IEEE754-2008 compliant precision"

Does that statement qualify for suitablity for the algorithms for 2^P-1 primes?

Today's flops count for entire GIMPS - 41.761Tflops.

Just 1 of these vid cards would increase GIMPs output by 1.3% (based on purely theoretical numbers).

Even if the first iteration of code ran the software at 1/4 of theoretical, 3 of these cards would be 1% of GIMPs output.

Anybody want to look at coding LL test in OpenCL?

-- Craig
nucleon is offline  
Old 2009-09-25, 12:55   #128
zarabatana
 
Sep 2009

23·3 Posts
Default

Just to add some information about CUDA with Ellipitc curves:
http://eecm.cr.yp.to/mpfq.html
http://cr.yp.to/talks/2009.09.12/slides.pdf

I think this 2 links are very helpful.
If someone can compile this to use with win32, please, post the links.
Thank you in advance.
zarabatana is offline  
Old 2009-09-25, 13:05   #129
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

59×103 Posts
Default

Quote:
Originally Posted by nucleon View Post
ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops
How come SP is 200 times less than DP? I think the figures are wrong somewhere.

Last fiddled with by retina on 2009-09-25 at 13:06
retina is offline  
Old 2009-09-25, 14:01   #130
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

18EE16 Posts
Default

It is, of course, 2.72 SP Tflops peak.

Only a gigabyte of memory, so no use for GNFS linear algebra, though there does seem to be an addressing mode for the per-SIMD local store which is designed for doing radix-4 SP FFTs. Some reviews suggest that there are now four 24x24 integer multipliers rather than one 32x32, not sure how good a trade-off that is - depends on how many extra bits you wanted to carry around between carry propagation passes.

Bernstein, Lange &c's work on mulmod on GPGPU is all on very small numbers (say 210 bits) which fit in the GPU register sets, I'm not sure how painful things get if the numbers stop fitting there.

On the other hand, there's little point in my buying a 5850 to write GPGPU code on until the R800 Instruction Set Architecture document codes out, and AMD pull their finger out and release the OpenCL-running-on-GPU implementation.
fivemack is offline  
Old 2009-09-25, 14:19   #131
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·3·19·31 Posts
Default

I think all of the complexity of porting algorithms to GPUs boils down to making the important part of the working set fit into caches that are much too small.

If your applications do not require double precision floating point, getting started with graphics card programming is ridiculously cheap. Nvidia's development libraries are a free download, and a 1-year old GPU with 1GB of memory costs ~$100 and potentially can crunch 5-10x faster than the machine it's plugged into.
jasonp is offline  
Old 2009-09-25, 14:46   #132
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Yep, what 5mack said....Tflops.

Oops.

-- Craig
nucleon is offline  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
New PC dedicated to Mersenne Prime Search Taiy Hardware 12 2018-01-02 15:54
The prime-crunching on dedicated hardware FAQ (II) jasonp Hardware 46 2016-07-18 16:41
How would you design a CPU/GPU for prime number crunching? emily Hardware 4 2012-02-20 18:46
DSP hardware for number crunching? ixfd64 Hardware 15 2011-08-09 01:11
Optimal Hardware for Dedicated Crunching Computer Angular Hardware 5 2004-01-16 12:37

All times are UTC. The time now is 00:54.

Sun Feb 28 00:54:41 UTC 2021 up 86 days, 21:06, 0 users, load averages: 3.25, 2.33, 2.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.