mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2007-08-15, 19:15   #12
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Ernst,

I think CUDA already has all the info you need.

http://www.mersenneforum.org/showthread.php?t=7150

If you read all their documentation, please let us know your opinions on how difficult it will be to integrate into your C code.
Talking of which, how to connect the cards to mersenne.org project?

Vincent
diep is offline   Reply With Quote
Old 2007-08-17, 06:51   #13
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

32·17 Posts
Default

Hey guys,
Because Google acquired Peakstream Inc. back in June and shut down the Peakstream web site, http://arstechnica.com/news.ars/post...tream-inc.html
it may be too late to look into using their software technology. However, you all might still be interested in reading a couple of articles that Jon Stokes of Ars Technica wrote about it: http://arstechnica.com/news.ars/post/20060918-7763.html
and http://arstechnica.com/news.ars/post...pu-fusion.html
RMAC9.5 is offline   Reply With Quote
Old 2007-10-18, 08:36   #14
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

According to the rumour site The Inquirer both next GPU models from Nvidia and ATI will support double precision:
http://www.theinquirer.net/gb/inquir...gpgpu-monsters

Somewhere else I've also seen an ATI RV670 related slide mentioning double precision support.

But somewhere in different news articles it has been stated, that the number of transistors won't increase by much. Because of this and other arguments already brought up by diep I think, that - if this will be the case - these DP implementations will be similar to the one known from Cell's SPEs. This means, that the SPEs are actually able to do SP, but that they can also do DP by using these SP units for DP calculations with higher latency. The throughput would also suffer then. The advantage is a low transistor overhead and a nice power consumption compared to full featured DP units.

The 128 bit GPU registers might be used as 2x64 bit then instead of 4x32 bit. But we'll see.
Dresdenboy is offline   Reply With Quote
Old 2007-10-18, 14:18   #15
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default

Quote:
Originally Posted by Dresdenboy View Post
According to the rumour site The Inquirer both next GPU models from Nvidia and ATI will support double precision:
http://www.theinquirer.net/gb/inquir...gpgpu-monsters

Somewhere else I've also seen an ATI RV670 related slide mentioning double precision support.

But somewhere in different news articles it has been stated, that the number of transistors won't increase by much. Because of this and other arguments already brought up by diep I think, that - if this will be the case - these DP implementations will be similar to the one known from Cell's SPEs. This means, that the SPEs are actually able to do SP, but that they can also do DP by using these SP units for DP calculations with higher latency. The throughput would also suffer then. The advantage is a low transistor overhead and a nice power consumption compared to full featured DP units.

The 128 bit GPU registers might be used as 2x64 bit then instead of 4x32 bit. But we'll see.
Hi Dresdenboy,

I remember IBM giving a few years ago a theoretic number out of roughly 0.5 Tflop for their CELL processor @ 4Ghz for single precision floating point.

For double precision floating point their number was roughly 30 gflop.

So a loss of roughly factor 16 moving from single precision to double precision.

We know however now that the CELL isn't clocked at 4Ghz nor that it can do theoretically 0.5 Tflop. In fact at playstation 3 you just have got 6 SPE's available and you aren't gonna manage to get to the theoretic speed, meaning getting a 100 gflop single precision out of it is a lot.

Yet it is a very workable design that CELL, this in contradiction to AMD's ATI thing. Despite phoning for weeks, i've yet to receive confirmation that CTM actually exists, let alone that you can freely get it to write software. So the action i undertook is giving away my ATI 2900 card to a nice guy who hopefully uses it a lot to game a tad faster than he used to do.

All those graphics cards so far have been on paper big potential, and in reality big bla bla, let's see how the future works out there, as having many tiny cpu's definitely looks like a workable design.

At least CELL is a chip you can really calculate at, but in the end the quad core AMD and quad core Intels just totally crush all those cpu's/gpu's/CELL's for prime number calculations; OTOH the Tesla is just too expensive.

But at least Nvidia has something there, AMD's CTM seems like total vaporware.
As long as that keeps vaporware, it's possible for them to claim petaflops of performance out of those gpu's, meanwhile delivering a few gflop. Is no one stopping them doing such unfounded claims?

Last fiddled with by diep on 2007-10-18 at 14:26
diep is offline   Reply With Quote
Old 2007-11-24, 12:03   #16
JCoveiro
 
"Jorge Coveiro"
Nov 2006
Moura, Portugal

2×13 Posts
Thumbs up Anyway

Anyway, we could just make a tool to run Prime95,
independent from the speed. Because, many people
may have purchased these cards, some in single mode
others in SLI. So if there is a way to split the double pr. mode
to a single pr. mode (32-bit) it would be ok. Cause the think here
is that there we may have many computers where we can use the
code. So. Let's do it, independently from the speed. It's a new
resource, might be slower, but its ok. Cause in we have a massive
group of testing computers that might have these cards a get a boost
to the prime project. That's my message. Thank you.

Or wait for nvidia 92's (double-prec. cards).
JCoveiro is offline   Reply With Quote
Old 2007-11-24, 21:29   #17
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default

Quote:
Originally Posted by JCoveiro View Post
Anyway, we could just make a tool to run Prime95,
independent from the speed. Because, many people
may have purchased these cards, some in single mode
others in SLI. So if there is a way to split the double pr. mode
to a single pr. mode (32-bit) it would be ok. Cause the think here
is that there we may have many computers where we can use the
code. So. Let's do it, independently from the speed. It's a new
resource, might be slower, but its ok. Cause in we have a massive
group of testing computers that might have these cards a get a boost
to the prime project. That's my message. Thank you.

Or wait for nvidia 92's (double-prec. cards).
Usually you need 2 videocards; one to serve the hardware for the OS and at the second videocard you can calculate then.

It is a very interesting problem to ponder about.

What's needed for it, is rewrite DWT into 32 bits floating point for it. Using emulated doubles isn't gonna get the maximum out of it.

An additional requirement for videocards is that you need to do it massively parallel. The parallel thing is solvable. Done mass parallellizing before.

So the first hurdle is to get 32 bits floating point to work for a FFT.

DWT if i understand correctly from the paper is that it takes implicit modulo for special types of primes, reducing the FFT size by a factor 2; this as a result speeds up more than a factor 2 over a default FFT multiplication. So the problem IMHO can get split in a number of things, with the last phase moving from a good working FFT to DWT.

First problem to solve is to get a FFT going that's basically giving a lossless result for tens of millions of bits will require some help from some of the great theoretical math guys around here.

Is there examples around showing this?
I'm sure this problem isn't new.

New is the type of FFT that might be fastest for videocards. They are relative bad in throughput. So having 2 input streams, doing a multiply and having 1 output stream, so having 2X bytes input from memory and having X output to memory and that "2 log 2n" times (where log n is the log2), is a rather bad idea to use the videocard for.

I still have to read the tesla docs there to see what's fast at it, but it's not hard to guess that an algorithm doing more work a limb, yet streaming less from and to RAM, might be way faster.

The amount of instructions the gpu's can execute is far bigger than their bandwidth to memory.
(to show the insight into this: assume we have a 1 tflop card. that can then on paper execute 1000 billion instructions a second; that's in case of multiply for example 2 x 4 bytes input and 4 bytes output, so that's 12 bytes bandwidth a flop, or 12 terabyte/s. Memory controller delivers now something above 100 GB/s with 4 memory controllers DDR3. So the instructions can get executed at far faster speed, than the internal bandwidth is to and from cache and to and from RAM; up to factor 100 difference).

That's creating a vacuum where a different (new?) type of FFT might come out as the superior solution. It might be better to already start pondering about this as the X Tflop gpu's are already there (tesla).

Not sure how much of the scientists here throw their thoughts also on the net.

Any thoughts?

Last fiddled with by diep on 2007-11-24 at 21:42
diep is offline   Reply With Quote
Old 2008-06-18, 00:40   #18
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default GPGPU continued - Nvidia

Quote:
Originally Posted by Cruelty View Post
Could someone create a "sticky" (or even subforum) for subject "Running Prime95 on graphics cards"? Eventually it will be possible+feasible
The new videocard of nvidia has been released, that is, first benchmarks show up on websites. Seems to have 240 streamprocessors of 32 bits and

"
[snip from beowulf mailing list]
An article posted today about the GTX280, which is to be release tomorrow,
states that the GTX280 has "support for the IEEE-754R double-precision
floating-point standard."

http://www.maximumpc.com/sites/maxim...s_next_gen_gpu

Craig Tierney
"

Then i did do an act of the advocate of the devil and came up with the card delivering:

30 double precision processing cores * 1 instruction a cycle on average (optimistic guess) * 0.675Ghz = 20 Gflop of double precision power at the GTX280. Oh well 40 Gflop if they have a vector of 128 bits, but that's not yet confirmed (reasonable assumption though). Note it's 250 watt or so. That's from PSU to card at 12 volt, so we can safely assume it's a lot more from the power tap. Probably up to 400 or so.

Not so bad for a card that claims on marketing paper to have 1 Teraflop single precision. Note the reviewer comes somehow to 90 Gflop double precision, not sure how. Maybe calculations for next years Christmas.

The new CELL seems to be around 77 Gflop double precision for each chip, that's 150+ Gflop for each node. We can deduce that from the press release of the new supercomputer IBM announces @ 1 Petaflop. I simply calculated it back. 410 watt for each node of that supercomputer including harddrives, a lot of RAM and network. That's what it eats from the power tap i assume.

I'd go for that CELL chip if you want to buy one of both this year.

Vincent

p.s. cruelty, check PRP top, competition lining up for you :)

Last fiddled with by diep on 2008-06-18 at 00:53
diep is offline   Reply With Quote
Old 2008-06-18, 03:27   #19
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

3·23·43 Posts
Default

http://www.tomshardware.com/reviews/...80,1953-8.html

Quote:
The last change made to the Streaming Multiprocessors is support for double precision (floating-point numbers on 64 bits instead of 32). Let’s be clear – the additional precision is only moderately useful in graphics algorithms. But as we know, GPGPU is taking on more and more importance for Nvidia, and in certain scientific applications, double precision is a non-negotiable demand!

Nvidia is not the first company to take note of that. IBM recently modified its Cell processor to increase the performance of the SPUs for this type of data. In terms of performance, the GT200 implementation leaves something to be desired – double-precision floating-point calculations are managed by a dedicated Streaming Multiprocessor unit. With a unit capable of executing one double-precision MAD calculation per cycle, we get a peak performance of: 1.296 x 10 (TPC) x 3 (SM) x 2 (Multiply+Add) = 77.78 Gflops, or between 1/8th and 1/12th of the single-precision performance. AMD has introduced support by using the same processing units over several cycles, with noticeably better results – only between two and four times slower than single precision calculations.

Last fiddled with by ATH on 2008-06-18 at 03:27
ATH is offline   Reply With Quote
Old 2008-06-18, 07:43   #20
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default

Hi,

There is 2 small caveats in the tomshardware way of calculating.
If i would apply the same logics that they use for double precision to single precision then it is a multiplication of:

clockrate of 1.296 Ghz for ALU
3 instructions a cycle
vectors of 4 floats
240 streamprocessors

==> 1.296 Gflop * 3 * 4 * 240 = 3.732,48 Tflop

Now that ain't true; the claim is the GPU delivers 0.933 Tflop by the
same article writer, so his logics applied also to single precision shows the problem for double precision :)

Note the 2nd problem is that the question is what you multiply with in case of FFT; accesses to the device RAM cannot get cached and are dead slow and at the 8800 they are bottlenecked in videocards. The paper claim is that for 8800 hardware a RAM access eats 600 cycles, yet there is only 4 memory controllers and 240 stream processors to fullfill that promise... ...when multiplying big FFT like needed for GIMPS you cannot have all data in local cache forever of course, as very little fits in there; idemdito shared cache for 1 block.

Vincent
diep is offline   Reply With Quote
Old 2008-06-26, 02:06   #21
knector
 
knector's Avatar
 
Jun 2008

3 Posts
Thumbs up

i´m new her just want to say hey
knector is offline   Reply With Quote
Old 2008-06-26, 02:08   #22
knector
 
knector's Avatar
 
Jun 2008

112 Posts
Default hi

i´t didnt wor k before
knector is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Talk on gpuLucas at GPGPU-4 Workshop in March Andrew Thall GPU Computing 6 2011-02-03 14:46
New GPGPU programming systems dsouza123 Programming 1 2006-11-17 21:54

All times are UTC. The time now is 10:11.

Thu Oct 22 10:11:34 UTC 2020 up 42 days, 7:22, 0 users, load averages: 1.36, 1.26, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.