mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2007-08-30, 08:05   #1
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default SSE5

AMD will introduce SSE5 with upcoming "Bulldozer" core (which is meant to be optimized for high throughput) in 2009.

More here:
http://developer.amd.com/sse5.jsp
http://www.extremetech.com/article2/...2177464,00.asp
Dresdenboy is offline   Reply With Quote
Old 2007-08-30, 11:35   #2
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

17FC16 Posts
Default

I haven't yet seen an SSE4 chip and AMD already want to do SSE5. I wonder how useful it will really be? Time will tell I guess.
retina is online now   Reply With Quote
Old 2007-09-07, 09:10   #3
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

16916 Posts
Default

Quote:
Originally Posted by retina View Post
I haven't yet seen an SSE4 chip and AMD already want to do SSE5. I wonder how useful it will really be? Time will tell I guess.
AMD's initiative is not targeted at end customers, but at compiler and software developers. Same was done by AMD & Intel in the past regarding changes in the instruction sets (MMX, 3DNow!, AMD64, SSEx). It's important for CPU manufacturers to have software support of new features available at launch.

Just think of the SSE4.1 benchmark results shown by Intel a while ago. If there is already a SSE4.1 optimized DivX encoder ready (as beta version) to be run on an early CPU engineering sample, then the developers must have known about SSE4.1 long before.

SSE5 is in theory much more useful to Prime95 than SSE3 was (with it's horizontal operations). Besides other changes, it will provide fused multiply add (FMAC) with 3 source operands, which should be the most useful change.
Dresdenboy is offline   Reply With Quote
Old 2008-01-04, 20:39   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D6B16 Posts
Default

Quote:
Originally Posted by Dresdenboy View Post
AMD will introduce SSE5 with upcoming "Bulldozer" core (which is meant to be optimized for high throughput) in 2009.
If the company is still around by then, you mean. :(

You're right, fused double-pumped SSE-based MADD would give a nice performance boost, but I wonder if you can pull it off without significantly increasing chip wattage. Besides the bugs in, delays with and [what now appear to be] outright lies about Barcelona performance by AMD, one of the things that will hurt them even if they manage to get their newest chips to market and recapture some of the share they're hemorrhaging to Intel is that their power consumption will be much higher than the Core2 series, and the disparity will only get worse once Intel finishes the shrink to 45nm. AMD desperately needs to stay viable in the notebook-PC market, and making more-power-hungry chips is the exact opposite of what they need to do in that respect. Sure, the SSE5-capable CPUs may be "targeted" at the server market, but how much engineering talent is being diverted away from the high-volume-low-power consumer market as a result? I think Intel have got it right [once again] - make a fast, great-performing low-power chip series, sell it in 2-4 core in the PC market, and 8,16-core and above in the server market. That way you leverage the same single highly-optimized technology for both sectors. Such considerations should be even higher on the list for a company like AMD, since they have far less money and manpower to burn.

Hate to say it, but not looking good for AMD, either in terms of near-term deliverables nor long-term strategy.
ewmayer is offline   Reply With Quote
Old 2008-01-10, 17:32   #5
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

7·11·151 Posts
Default

Forgot to mention: one thing I've been mentioning to anyone who might have an "in" with one of the major chip vendors' hardware groups over the past few years, is the desirability of a hardware instruction which computes a packed interleaved floating-point add/sub pair, in-place, i.e. which takes inputs x and y and returns x+y and x-y in their place. [2 such pairs for SSE2-style packed doubles, 4 for SSE packed singles.]

Note that this is different than the SSE3 addsubpd instruction, which takes packed doubles [x0,x1] and [y0,y1] and returns - taking x as the destination operand - [x0-y0,x1+y1]. The throughput of my above version of interleaved add/sub would be double the number of FP add and sub per instruction, in that it would take "destination" input [x0,y0] and "source" input [y0,y1] and return [x0+y0,x1+y1]] and [x0-y0,x1-y1] in their stead.

The rationale is this: in computing e.g. x0+-y0, the floating-point sign,exponent and significand-extraction-and-shift-align step need only be done once, and then one can do what amounts to a fixed-point add/sub pair on the resulting normalized significands, along with the usual FP rounding and ensuing repacking [which would need to be done separately for the output of the qadd and sub, obviously]. Thus, one can get double the throughput for less than double the hardware, without breaking the x86-style two-operand instruction paradigm [except that both src and dest operands would be altered by the operation.]

Defining a RISC-style version of this is also not hard, though one would either need to relax the RISC 2-input-register-1-output-register paradigm. [But this is already done by many RISC chips for certain instructions - in any event, since the above is done in-place one could treat it as a 3-register RISC instruction with one null or dummy operand, which seems easier than allowing for an fully general [i.e. not nec. in-place] 2-input-register-2-output-register version.

In my estimation this would give significantly more throughput bang for one's hardware buck than fused mul/add for computations such as transforms [FFT and other kinds]. The drawback is that it's less generally useful than FMADD.
ewmayer is offline   Reply With Quote
Old 2008-04-11, 17:34   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

7×11×151 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Hate to say it, but not looking good for AMD, either in terms of near-term deliverables nor long-term strategy.
AMD Shows Tech Chief Hester the Door

Was he fired or did he quit? Either way, not a good sign for AMD. The fact SSE5 has already been relegated to ho-hum by Intel recent announcement of 256-bit 4-way-floating-double SSE-style instructions in their 2010 chips ain't good news for AMD, either. If only they listed to us prime folks about how to max chip performance. ;)
ewmayer is offline   Reply With Quote
Old 2009-11-25, 12:16   #7
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Now, after AMD decided to include AVX (with FMA support, which Sandy Bridge won't have) and published some architecture details this November (a lot of that was already published in patents as I documented in my blog - my latest posting is about the FMAC architecture), it's time again to look at the options of doing prime search on either the Bulldozer core architecture or even using the upcoming APU designs (accelerated processing units - combined CPU + GPU designs).

So far there will be eight core chips, where two of them will be used for 16 core MCM "Interlagos". Those eight cores are actually four "Bulldozer modules" (optimized dual cores). Each module will have two (vector) 128 bit wide FMAC units in a shared FPU (kind of SMT) and two integer cores (or clusters) with four pipelines, a scheduler and a L1 D$ each. Max throughput of one of these FMAC units per cycle will be either
1x128 bit FMA
or
1x128 bit FADD and 1x128 bit FMUL.

One thread (running on one integer core, 1 thread per such core) can use one or both FMAC units per cycle, depending on availability.

The L1 D$s of both integer cores will feed the FPU, 2 loads per cycle (width is unknown so far).

As it seems (just found some evidence on AMD slides), there will be some boost technology more advanced than "Turbo Boost". I described it in one of my power management related blog postings (see tags).
Dresdenboy is offline   Reply With Quote
Old 2009-11-29, 14:34   #8
joblack
 
joblack's Avatar
 
Oct 2008
n00bville

52·29 Posts
Default

More interesting will be a OpenCL support with the new Geforce Tesla cards ...
joblack is offline   Reply With Quote
Old 2009-11-30, 20:48   #9
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

36110 Posts
Default

Quote:
Originally Posted by joblack View Post
More interesting will be a OpenCL support with the new Geforce Tesla cards ...
You are right. Intel, AMD and Nvidia think, that GPGPU (as addition to conventional cores) is the way to go. Look at
Larrabee (already reaches 1 TFLOPS - DP I suppose),
Hemlock (this dual GPU card from ATI with 4.6 TFLOPS in SP and 0.93 TFLOPS in DP - theoretical peak) and last but not least
Fermi based systems and look at their moves to integrate shader cores on general purpose processors.

But the first variants will be less powerful. E.g. Llano with ~480 shader units (~130 DP GFLOPS) - just a bit more FP power as a 8 core Bulldozer at 3 GHz might have. And that at about the same power consumption and without the good GPU mem interface.

Maybe it will need another step until heterogenous computing is went that far, that the power of many small processors (similar to shaders today) is simply there without any overhead - living in the same coherent memory space and so on.

Last fiddled with by Dresdenboy on 2009-11-30 at 20:49
Dresdenboy is offline   Reply With Quote
Old 2009-11-30, 21:49   #10
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3·181 Posts
Default

Quote:
Originally Posted by Dresdenboy View Post
Look at Larrabee (already reaches 1 TFLOPS - DP I suppose)
It's SP I think. At least that's what this article claims : http://www.theregister.co.uk/2009/11...ote/page2.html
ldesnogu is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 08:04.

Thu May 6 08:04:14 UTC 2021 up 28 days, 2:45, 0 users, load averages: 1.60, 1.82, 1.82

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.