![]() |
![]() |
#122 |
Mar 2003
Melbourne
5·103 Posts |
![]()
Posting some of my research over the last week:
According to http://en.wikipedia.org/wiki/FLOPS here is a breakdown of performance between current flagship intel (core-i7 965 XE) and nvidia (Tesla C1060 similar to 280) Code:
SP DP corei7 140 70 nvidia 933 78 SP=32bit floats DP=64bit floats The big difference between SP & DP of nvidia boils down to 1x 64 bit fp unit to 8x 32bit fp units. I think this has been mentioned before. As GPUs advance - who knows how this ratio will evolve. (New generation of GPUs are due in the next couple of months) Then also take this little gem: from https://visualization.hpc.mil/wiki/Introduction_to_CUDA "Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display." This suggests (to me anyway) that coding for CUDA isn't exactly trivial process. There will be gotchas in the development process. No way there will be mass adoption of a GPGPU client if it causes screen corruption. So what are the x86 camp doing? Intel? http://en.wikipedia.org/wiki/Larrabee_(GPU) First half of 2010; Larrabee will also compete in the GPGPU and high-performance computing markets. It will run x86 code which prime95 has an excellent heritage in. (No mention of 64bit float capability) If I could put my holier than though hat on, yes in a commercial development organization one would set aside resources to develop on GPU platforms to show a proof-of-concept then make a business call which way to go, but when resources are tight it's a real tough call where to go. Anyone want to have a stab at a proof-of-concept LL test on a GPU? Plenty of training resources at: http://www.nvidia.com/object/cuda_education.html -- Craig |
![]() |
![]() |
#123 | |
Jul 2006
Calgary
52×17 Posts |
![]() Quote:
|
|
![]() |
![]() |
#124 | |
Jul 2009
31 Posts |
![]() Quote:
AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute. NVidia instead has full IEEE double precision ALUs in hardware, in addition to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs. So NV's double precision is at full speed but just fewer ALUs. The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care. The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode. All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above. |
|
![]() |
![]() |
#125 | |
Jul 2006
Calgary
1A916 Posts |
![]() Quote:
From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150? |
|
![]() |
![]() |
#126 | |
Jul 2009
31 Posts |
![]() Quote:
Yeah, none of the mobile chips have DP support, even the latest GTS 260m. The lowest NV card with DP support is the GTX260. That's $150. That gives you about 95 double precision GFlops. That's very similar to the ideal DP flops of a 3Ghz quad i7 processor using packed SSE. |
|
![]() |
![]() |
#127 |
Mar 2003
Melbourne
5·103 Posts |
![]()
Stats on the soon to be released ATI video card:
ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops Cost: $400ish ATI article: http://pcper.com/article.php?aid=783 "AMD added IEEE754-2008 compliant precision" Does that statement qualify for suitablity for the algorithms for 2^P-1 primes? Today's flops count for entire GIMPS - 41.761Tflops. Just 1 of these vid cards would increase GIMPs output by 1.3% (based on purely theoretical numbers). Even if the first iteration of code ran the software at 1/4 of theoretical, 3 of these cards would be 1% of GIMPs output. Anybody want to look at coding LL test in OpenCL? -- Craig |
![]() |
![]() |
#128 |
Sep 2009
23×3 Posts |
![]()
Just to add some information about CUDA with Ellipitc curves:
http://eecm.cr.yp.to/mpfq.html http://cr.yp.to/talks/2009.09.12/slides.pdf I think this 2 links are very helpful. If someone can compile this to use with win32, please, post the links. Thank you in advance. |
![]() |
![]() |
#129 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
136738 Posts |
![]()
How come SP is 200 times less than DP? I think the figures are wrong somewhere.
Last fiddled with by retina on 2009-09-25 at 13:06 |
![]() |
![]() |
#130 |
(loop (#_fork))
Feb 2006
Cambridge, England
11000111011102 Posts |
![]()
It is, of course, 2.72 SP Tflops peak.
Only a gigabyte of memory, so no use for GNFS linear algebra, though there does seem to be an addressing mode for the per-SIMD local store which is designed for doing radix-4 SP FFTs. Some reviews suggest that there are now four 24x24 integer multipliers rather than one 32x32, not sure how good a trade-off that is - depends on how many extra bits you wanted to carry around between carry propagation passes. Bernstein, Lange &c's work on mulmod on GPGPU is all on very small numbers (say 210 bits) which fit in the GPU register sets, I'm not sure how painful things get if the numbers stop fitting there. On the other hand, there's little point in my buying a 5850 to write GPGPU code on until the R800 Instruction Set Architecture document codes out, and AMD pull their finger out and release the OpenCL-running-on-GPU implementation. |
![]() |
![]() |
#131 |
Tribal Bullet
Oct 2004
2×3×19×31 Posts |
![]()
I think all of the complexity of porting algorithms to GPUs boils down to making the important part of the working set fit into caches that are much too small.
If your applications do not require double precision floating point, getting started with graphics card programming is ridiculously cheap. Nvidia's development libraries are a free download, and a 1-year old GPU with 1GB of memory costs ~$100 and potentially can crunch 5-10x faster than the machine it's plugged into. |
![]() |
![]() |
#132 |
Mar 2003
Melbourne
20316 Posts |
![]()
Yep, what 5mack said....Tflops.
Oops. -- Craig |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
New PC dedicated to Mersenne Prime Search | Taiy | Hardware | 12 | 2018-01-02 15:54 |
The prime-crunching on dedicated hardware FAQ (II) | jasonp | Hardware | 46 | 2016-07-18 16:41 |
How would you design a CPU/GPU for prime number crunching? | emily | Hardware | 4 | 2012-02-20 18:46 |
DSP hardware for number crunching? | ixfd64 | Hardware | 15 | 2011-08-09 01:11 |
Optimal Hardware for Dedicated Crunching Computer | Angular | Hardware | 5 | 2004-01-16 12:37 |