mersenneforum.org SSE5
 Register FAQ Search Today's Posts Mark Forums Read

 2007-08-30, 08:05 #1 Dresdenboy     Apr 2003 Berlin, Germany 5518 Posts SSE5 AMD will introduce SSE5 with upcoming "Bulldozer" core (which is meant to be optimized for high throughput) in 2009. More here: http://developer.amd.com/sse5.jsp http://www.extremetech.com/article2/...2177464,00.asp
 2007-08-30, 11:35 #2 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 17FC16 Posts I haven't yet seen an SSE4 chip and AMD already want to do SSE5. I wonder how useful it will really be? Time will tell I guess.
2007-09-07, 09:10   #3
Dresdenboy

Apr 2003
Berlin, Germany

16916 Posts

Quote:
 Originally Posted by retina I haven't yet seen an SSE4 chip and AMD already want to do SSE5. I wonder how useful it will really be? Time will tell I guess.
AMD's initiative is not targeted at end customers, but at compiler and software developers. Same was done by AMD & Intel in the past regarding changes in the instruction sets (MMX, 3DNow!, AMD64, SSEx). It's important for CPU manufacturers to have software support of new features available at launch.

Just think of the SSE4.1 benchmark results shown by Intel a while ago. If there is already a SSE4.1 optimized DivX encoder ready (as beta version) to be run on an early CPU engineering sample, then the developers must have known about SSE4.1 long before.

SSE5 is in theory much more useful to Prime95 than SSE3 was (with it's horizontal operations). Besides other changes, it will provide fused multiply add (FMAC) with 3 source operands, which should be the most useful change.

2008-01-04, 20:39   #4
ewmayer
2ω=0

Sep 2002
República de California

2D6B16 Posts

Quote:
 Originally Posted by Dresdenboy AMD will introduce SSE5 with upcoming "Bulldozer" core (which is meant to be optimized for high throughput) in 2009.
If the company is still around by then, you mean. :(

You're right, fused double-pumped SSE-based MADD would give a nice performance boost, but I wonder if you can pull it off without significantly increasing chip wattage. Besides the bugs in, delays with and [what now appear to be] outright lies about Barcelona performance by AMD, one of the things that will hurt them even if they manage to get their newest chips to market and recapture some of the share they're hemorrhaging to Intel is that their power consumption will be much higher than the Core2 series, and the disparity will only get worse once Intel finishes the shrink to 45nm. AMD desperately needs to stay viable in the notebook-PC market, and making more-power-hungry chips is the exact opposite of what they need to do in that respect. Sure, the SSE5-capable CPUs may be "targeted" at the server market, but how much engineering talent is being diverted away from the high-volume-low-power consumer market as a result? I think Intel have got it right [once again] - make a fast, great-performing low-power chip series, sell it in 2-4 core in the PC market, and 8,16-core and above in the server market. That way you leverage the same single highly-optimized technology for both sectors. Such considerations should be even higher on the list for a company like AMD, since they have far less money and manpower to burn.

Hate to say it, but not looking good for AMD, either in terms of near-term deliverables nor long-term strategy.

 2008-01-10, 17:32 #5 ewmayer ∂2ω=0     Sep 2002 República de California 7·11·151 Posts Forgot to mention: one thing I've been mentioning to anyone who might have an "in" with one of the major chip vendors' hardware groups over the past few years, is the desirability of a hardware instruction which computes a packed interleaved floating-point add/sub pair, in-place, i.e. which takes inputs x and y and returns x+y and x-y in their place. [2 such pairs for SSE2-style packed doubles, 4 for SSE packed singles.] Note that this is different than the SSE3 addsubpd instruction, which takes packed doubles [x0,x1] and [y0,y1] and returns - taking x as the destination operand - [x0-y0,x1+y1]. The throughput of my above version of interleaved add/sub would be double the number of FP add and sub per instruction, in that it would take "destination" input [x0,y0] and "source" input [y0,y1] and return [x0+y0,x1+y1]] and [x0-y0,x1-y1] in their stead. The rationale is this: in computing e.g. x0+-y0, the floating-point sign,exponent and significand-extraction-and-shift-align step need only be done once, and then one can do what amounts to a fixed-point add/sub pair on the resulting normalized significands, along with the usual FP rounding and ensuing repacking [which would need to be done separately for the output of the qadd and sub, obviously]. Thus, one can get double the throughput for less than double the hardware, without breaking the x86-style two-operand instruction paradigm [except that both src and dest operands would be altered by the operation.] Defining a RISC-style version of this is also not hard, though one would either need to relax the RISC 2-input-register-1-output-register paradigm. [But this is already done by many RISC chips for certain instructions - in any event, since the above is done in-place one could treat it as a 3-register RISC instruction with one null or dummy operand, which seems easier than allowing for an fully general [i.e. not nec. in-place] 2-input-register-2-output-register version. In my estimation this would give significantly more throughput bang for one's hardware buck than fused mul/add for computations such as transforms [FFT and other kinds]. The drawback is that it's less generally useful than FMADD.
2008-04-11, 17:34   #6
ewmayer
2ω=0

Sep 2002
República de California

7×11×151 Posts

Quote:
 Originally Posted by ewmayer Hate to say it, but not looking good for AMD, either in terms of near-term deliverables nor long-term strategy.
AMD Shows Tech Chief Hester the Door

Was he fired or did he quit? Either way, not a good sign for AMD. The fact SSE5 has already been relegated to ho-hum by Intel recent announcement of 256-bit 4-way-floating-double SSE-style instructions in their 2010 chips ain't good news for AMD, either. If only they listed to us prime folks about how to max chip performance. ;)

 2009-11-25, 12:16 #7 Dresdenboy     Apr 2003 Berlin, Germany 192 Posts Now, after AMD decided to include AVX (with FMA support, which Sandy Bridge won't have) and published some architecture details this November (a lot of that was already published in patents as I documented in my blog - my latest posting is about the FMAC architecture), it's time again to look at the options of doing prime search on either the Bulldozer core architecture or even using the upcoming APU designs (accelerated processing units - combined CPU + GPU designs). So far there will be eight core chips, where two of them will be used for 16 core MCM "Interlagos". Those eight cores are actually four "Bulldozer modules" (optimized dual cores). Each module will have two (vector) 128 bit wide FMAC units in a shared FPU (kind of SMT) and two integer cores (or clusters) with four pipelines, a scheduler and a L1 D$each. Max throughput of one of these FMAC units per cycle will be either 1x128 bit FMA or 1x128 bit FADD and 1x128 bit FMUL. One thread (running on one integer core, 1 thread per such core) can use one or both FMAC units per cycle, depending on availability. The L1 D$s of both integer cores will feed the FPU, 2 loads per cycle (width is unknown so far). As it seems (just found some evidence on AMD slides), there will be some boost technology more advanced than "Turbo Boost". I described it in one of my power management related blog postings (see tags).
 2009-11-29, 14:34 #8 joblack     Oct 2008 n00bville 52·29 Posts More interesting will be a OpenCL support with the new Geforce Tesla cards ...
2009-11-30, 20:48   #9
Dresdenboy

Apr 2003
Berlin, Germany

36110 Posts

Quote:
 Originally Posted by joblack More interesting will be a OpenCL support with the new Geforce Tesla cards ...
You are right. Intel, AMD and Nvidia think, that GPGPU (as addition to conventional cores) is the way to go. Look at
Larrabee (already reaches 1 TFLOPS - DP I suppose),
Hemlock (this dual GPU card from ATI with 4.6 TFLOPS in SP and 0.93 TFLOPS in DP - theoretical peak) and last but not least
Fermi based systems and look at their moves to integrate shader cores on general purpose processors.

But the first variants will be less powerful. E.g. Llano with ~480 shader units (~130 DP GFLOPS) - just a bit more FP power as a 8 core Bulldozer at 3 GHz might have. And that at about the same power consumption and without the good GPU mem interface.

Maybe it will need another step until heterogenous computing is went that far, that the power of many small processors (similar to shaders today) is simply there without any overhead - living in the same coherent memory space and so on.

Last fiddled with by Dresdenboy on 2009-11-30 at 20:49

2009-11-30, 21:49   #10
ldesnogu

Jan 2008
France

3·181 Posts

Quote:
 Originally Posted by Dresdenboy Look at Larrabee (already reaches 1 TFLOPS - DP I suppose)
It's SP I think. At least that's what this article claims : http://www.theregister.co.uk/2009/11...ote/page2.html