20070815, 19:15  #12  
Sep 2006
The Netherlands
13·53 Posts 
Quote:
Vincent 

20070817, 06:51  #13 
Jun 2003
3^{2}·17 Posts 
Hey guys,
Because Google acquired Peakstream Inc. back in June and shut down the Peakstream web site, http://arstechnica.com/news.ars/post...treaminc.html it may be too late to look into using their software technology. However, you all might still be interested in reading a couple of articles that Jon Stokes of Ars Technica wrote about it: http://arstechnica.com/news.ars/post/200609187763.html and http://arstechnica.com/news.ars/post...pufusion.html 
20071018, 08:36  #14 
Apr 2003
Berlin, Germany
169_{16} Posts 
According to the rumour site The Inquirer both next GPU models from Nvidia and ATI will support double precision:
http://www.theinquirer.net/gb/inquir...gpgpumonsters Somewhere else I've also seen an ATI RV670 related slide mentioning double precision support. But somewhere in different news articles it has been stated, that the number of transistors won't increase by much. Because of this and other arguments already brought up by diep I think, that  if this will be the case  these DP implementations will be similar to the one known from Cell's SPEs. This means, that the SPEs are actually able to do SP, but that they can also do DP by using these SP units for DP calculations with higher latency. The throughput would also suffer then. The advantage is a low transistor overhead and a nice power consumption compared to full featured DP units. The 128 bit GPU registers might be used as 2x64 bit then instead of 4x32 bit. But we'll see. 
20071018, 14:18  #15  
Sep 2006
The Netherlands
689_{10} Posts 
Quote:
I remember IBM giving a few years ago a theoretic number out of roughly 0.5 Tflop for their CELL processor @ 4Ghz for single precision floating point. For double precision floating point their number was roughly 30 gflop. So a loss of roughly factor 16 moving from single precision to double precision. We know however now that the CELL isn't clocked at 4Ghz nor that it can do theoretically 0.5 Tflop. In fact at playstation 3 you just have got 6 SPE's available and you aren't gonna manage to get to the theoretic speed, meaning getting a 100 gflop single precision out of it is a lot. Yet it is a very workable design that CELL, this in contradiction to AMD's ATI thing. Despite phoning for weeks, i've yet to receive confirmation that CTM actually exists, let alone that you can freely get it to write software. So the action i undertook is giving away my ATI 2900 card to a nice guy who hopefully uses it a lot to game a tad faster than he used to do. All those graphics cards so far have been on paper big potential, and in reality big bla bla, let's see how the future works out there, as having many tiny cpu's definitely looks like a workable design. At least CELL is a chip you can really calculate at, but in the end the quad core AMD and quad core Intels just totally crush all those cpu's/gpu's/CELL's for prime number calculations; OTOH the Tesla is just too expensive. But at least Nvidia has something there, AMD's CTM seems like total vaporware. As long as that keeps vaporware, it's possible for them to claim petaflops of performance out of those gpu's, meanwhile delivering a few gflop. Is no one stopping them doing such unfounded claims? Last fiddled with by diep on 20071018 at 14:26 

20071124, 12:03  #16 
"Jorge Coveiro"
Nov 2006
Moura, Portugal
32_{8} Posts 
Anyway
Anyway, we could just make a tool to run Prime95,
independent from the speed. Because, many people may have purchased these cards, some in single mode others in SLI. So if there is a way to split the double pr. mode to a single pr. mode (32bit) it would be ok. Cause the think here is that there we may have many computers where we can use the code. So. Let's do it, independently from the speed. It's a new resource, might be slower, but its ok. Cause in we have a massive group of testing computers that might have these cards a get a boost to the prime project. That's my message. Thank you. Or wait for nvidia 92's (doubleprec. cards). 
20071124, 21:29  #17  
Sep 2006
The Netherlands
13×53 Posts 
Quote:
It is a very interesting problem to ponder about. What's needed for it, is rewrite DWT into 32 bits floating point for it. Using emulated doubles isn't gonna get the maximum out of it. An additional requirement for videocards is that you need to do it massively parallel. The parallel thing is solvable. Done mass parallellizing before. So the first hurdle is to get 32 bits floating point to work for a FFT. DWT if i understand correctly from the paper is that it takes implicit modulo for special types of primes, reducing the FFT size by a factor 2; this as a result speeds up more than a factor 2 over a default FFT multiplication. So the problem IMHO can get split in a number of things, with the last phase moving from a good working FFT to DWT. First problem to solve is to get a FFT going that's basically giving a lossless result for tens of millions of bits will require some help from some of the great theoretical math guys around here. Is there examples around showing this? I'm sure this problem isn't new. New is the type of FFT that might be fastest for videocards. They are relative bad in throughput. So having 2 input streams, doing a multiply and having 1 output stream, so having 2X bytes input from memory and having X output to memory and that "2 log 2n" times (where log n is the log2), is a rather bad idea to use the videocard for. I still have to read the tesla docs there to see what's fast at it, but it's not hard to guess that an algorithm doing more work a limb, yet streaming less from and to RAM, might be way faster. The amount of instructions the gpu's can execute is far bigger than their bandwidth to memory. (to show the insight into this: assume we have a 1 tflop card. that can then on paper execute 1000 billion instructions a second; that's in case of multiply for example 2 x 4 bytes input and 4 bytes output, so that's 12 bytes bandwidth a flop, or 12 terabyte/s. Memory controller delivers now something above 100 GB/s with 4 memory controllers DDR3. So the instructions can get executed at far faster speed, than the internal bandwidth is to and from cache and to and from RAM; up to factor 100 difference). That's creating a vacuum where a different (new?) type of FFT might come out as the superior solution. It might be better to already start pondering about this as the X Tflop gpu's are already there (tesla). Not sure how much of the scientists here throw their thoughts also on the net. Any thoughts? Last fiddled with by diep on 20071124 at 21:42 

20080618, 00:40  #18  
Sep 2006
The Netherlands
13·53 Posts 
GPGPU continued  Nvidia
Quote:
" [snip from beowulf mailing list] An article posted today about the GTX280, which is to be release tomorrow, states that the GTX280 has "support for the IEEE754R doubleprecision floatingpoint standard." http://www.maximumpc.com/sites/maxim...s_next_gen_gpu Craig Tierney " Then i did do an act of the advocate of the devil and came up with the card delivering: 30 double precision processing cores * 1 instruction a cycle on average (optimistic guess) * 0.675Ghz = 20 Gflop of double precision power at the GTX280. Oh well 40 Gflop if they have a vector of 128 bits, but that's not yet confirmed (reasonable assumption though). Note it's 250 watt or so. That's from PSU to card at 12 volt, so we can safely assume it's a lot more from the power tap. Probably up to 400 or so. Not so bad for a card that claims on marketing paper to have 1 Teraflop single precision. Note the reviewer comes somehow to 90 Gflop double precision, not sure how. Maybe calculations for next years Christmas. The new CELL seems to be around 77 Gflop double precision for each chip, that's 150+ Gflop for each node. We can deduce that from the press release of the new supercomputer IBM announces @ 1 Petaflop. I simply calculated it back. 410 watt for each node of that supercomputer including harddrives, a lot of RAM and network. That's what it eats from the power tap i assume. I'd go for that CELL chip if you want to buy one of both this year. Vincent p.s. cruelty, check PRP top, competition lining up for you :) Last fiddled with by diep on 20080618 at 00:53 

20080618, 03:27  #19  
Einyen
Dec 2003
Denmark
3^{2}×331 Posts 
http://www.tomshardware.com/reviews/...80,19538.html
Quote:
Last fiddled with by ATH on 20080618 at 03:27 

20080618, 07:43  #20  
Sep 2006
The Netherlands
1261_{8} Posts 
Quote:
There is 2 small caveats in the tomshardware way of calculating. If i would apply the same logics that they use for double precision to single precision then it is a multiplication of: clockrate of 1.296 Ghz for ALU 3 instructions a cycle vectors of 4 floats 240 streamprocessors ==> 1.296 Gflop * 3 * 4 * 240 = 3.732,48 Tflop Now that ain't true; the claim is the GPU delivers 0.933 Tflop by the same article writer, so his logics applied also to single precision shows the problem for double precision :) Note the 2nd problem is that the question is what you multiply with in case of FFT; accesses to the device RAM cannot get cached and are dead slow and at the 8800 they are bottlenecked in videocards. The paper claim is that for 8800 hardware a RAM access eats 600 cycles, yet there is only 4 memory controllers and 240 stream processors to fullfill that promise... ...when multiplying big FFT like needed for GIMPS you cannot have all data in local cache forever of course, as very little fits in there; idemdito shared cache for 1 block. Vincent 

20080626, 02:06  #21 
Jun 2008
3 Posts 
i´m new her just want to say hey

20080626, 02:08  #22 
Jun 2008
11_{2} Posts 
hi
i´t didnt wor k before

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Talk on gpuLucas at GPGPU4 Workshop in March  Andrew Thall  GPU Computing  6  20110203 14:46 
New GPGPU programming systems  dsouza123  Programming  1  20061117 21:54 