![]() |
![]() |
#1 |
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
![]()
Occasionally you'll come across a really cool way of doing something using SSE. Then you discover that it won't work, because the designers had a 50/50 or better chance of doing it right - and did it wrong.
pmuludq: It doesn't get any wronger than this. If you want to be fast using SSE, the trick is usually figuring out how to do it with one 8/16/32/64-bit value and then use SSE to process 16/8/4/2 values in one go. Instead of providing an unsigned multiply that delivers the high 32-bits for dword operands or the low 64-bits for qword operands, we get two 32x32->64 bit results. So, to do anything useful with this you'll ALWAYS need shuffles and/or unpacking, as the upper 32-bit inputs are ignored and have to be processed somewhere else. psll, psrl w/immediate Aw, c'mon guys. If you've ever used these instuctions, you'd know that 90% of the time you need a move to preserve the inputs. Why doesn't this have a SRC, DST form? pcmp: There is no excuse for leaving out unsigned versions. Don't tell me that it requires real effort to include them: all compares have an immediate byte with unused bits, so for 50 extra transistors you could have xor'ed one bit with the top bit of the input. pcmpestrm/pcmpistri: Finally! Now the only missing instruction is: paddcpuidtoweekdayxorbit19iftuesdayandstartinternetexplorer addsubpd: a.k.a. mycomplextypeis128bitsoSSEistheanswerpd addsubps: a.k.a. ivectorizecodethewrongwayps |
![]() |
![]() |
![]() |
#2 |
Tribal Bullet
Oct 2004
1101111001112 Posts |
![]()
Maybe PMULUDQ was designed for fixed-point operations where the result is expected to be rounded and shifted right to destroy the low bits. Splitting that across two registers would be very painful...
I also think many of the quirks in SSE2 instructions boil down to a constrained intruction encoding space. The real problem is that multiple precision arithmetic was not on the agenda when this stuff was designed, so we'll just have to make do with what we have. |
![]() |
![]() |
![]() |
#3 | ||
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
pmuludq4 xmm0,xmm1,xmm2 where xmm0 and xmm1 are the inputs, stick the low halves of the 4 product in xmm1 and the high halves in xmm2. More elegantly, provide separate instructions to generate 4 lower and upper halves at a time, e.g. pmulld xmm0,xmm1,xmm2 pmuludh xmm0,xmm1,xmm3 with the low halves output in xmm2 and the high halves in xmm3. This seems inefficient because it uses 2 instructions and the 2nd mul discards the lower halves, but one could add microcode support so that the hardware recognizes such paired lower-and-upper-half muls and fuses them into a single hardware operation, which splits the double-wide outputs into the 2 destination registers. All sorts of ways to do this. Quote:
Another example - the utterly idiotic lack of any support whatsoever for complex MUL in SSE and SSE2. With a little bit of thought they could have added just 2 or 3 instructions (or better, enhance some of the ones they did us) to permit for an efficient CMUL. These are supposed to be some of the world's top CPU and ISA people here - no excuse for those kinds of oversights. Now, with AVX, floating-point support looks really good ... on the integer side they added some really nice crypto-related instructions, but they didn't do a goddamn thing to improve legacy generic-integer support (they didn't even widen the bandwidth from 128-bit to 256-bit as they did for floats), except for the aforementioned 3-operand format - which they should have done right from the start anyway, because with SSE they had in essence a blank slate to work with. You gave us a RISC-style register set, why not a set of RISC-style instructions to go along with it, which don't force us to do cycle-wasting register-operand copying at every turn? Now look how many hoops they have to jump through to retroactively do the right thing - a good fraction of their current and future ISA is in effect there only because its predecessors were so poorly thought out. I cite AltiVec as a basis for comparison Jason will be intimately familiar with. *There* was a reasonably well-thought-out SIMD ISA ... add double-float and 64-bit int support in later generations and it would have been killer. Intel is great at shrinking a given architecture to incredibly small sizes and lowering power consumption, but good ISA designers, they are not. Not even close. Last fiddled with by ewmayer on 2009-02-20 at 19:21 |
||
![]() |
![]() |
![]() |
#4 | |
"Mark"
Apr 2003
Between here and the
706110 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 | |
Jan 2008
France
22×149 Posts |
![]() Quote:
As far as bits in control registers go, that's generally handled differently and causes less trouble because these are not as wide as registers and you can apply some tricks by only forwarding parts of these control regs. So it's not a RISC issue or an encoding issue, it's a design constraint ![]() I hope I did not use too many technical terms and that I made my point clear... |
|
![]() |
![]() |
![]() |
#6 |
Tribal Bullet
Oct 2004
3,559 Posts |
![]()
Anyone who is interested in these sorts of issues could look at some of the work of the F-CPU project, which is (was?) an attempt to design a general-purpose 64-bit CPU from scratch, with SIMD designed in from the ground up. Because the first implementations needed to be efficient on FPGAs, and these have a strict limit on the number of read and write ports to the logic you would use for register files, the instruction set has many instructions that write two registers (e.g. a sum and a carry out in register (X) and (X XOR 1), respectively). I don't envy whoever tried to modify gcc to emit code for it :)
|
![]() |
![]() |
![]() |
#7 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Another example of "useless":
Quote:
So I tried using it for the write-outputs step at the end of the loop body of the radix-8 complex-DFT-pass loop in Mlucas last night, in full-optimized mode on my Win32/Core2Duo. And whaddya know - the runtime instantly more than doubled. Note I said "runtime", not "performance". My reaction was something along the lines of "You know, if I needed a way to get my CPU to run cooler, I'd just switch my system power options to max-battery-life mode or fill my assembly code with no-ops." Another useless example: Since the various SSE mov--- instructions don't care about the data type (e.g. we can freely use movaps in place of movapd, which is in fact recommended because the former has a smaller opcode), why do we need separate MOVAPS, MOVAPD, MOVDQA instructions? (similar with the unaligned versions of the these). Last fiddled with by ewmayer on 2009-02-20 at 23:12 |
|
![]() |
![]() |
![]() |
#8 | ||
Dec 2008
Boycotting the Soapbox
13208 Posts |
![]() Quote:
![]() I'd rather have a status register, a 4-bit condition-code, free shifts & rotates in ALL instructions. Quote:
I want 64 ARMs on one chip running at 2Ghz for my birthday. Oh, and 64 blocks of 64k dual ported RAM on the same chip. Thank you. My objection is that it didn't have to be on the agenda. Just sticking to SIMD philosophy would have been sufficient. pmuludq mixes 32 and 64-bit operands, so this can't be right if the general idea is to have 2/4/8/16 independent streams. Last fiddled with by __HRB__ on 2009-02-20 at 23:28 |
||
![]() |
![]() |
![]() |
#9 | |
Tribal Bullet
Oct 2004
67478 Posts |
![]() Quote:
Sometimes I think MOVNTPD is designed only for memory copies; there's an AMD example whitepaper for that application where the MMX version of the instruction increases performance drastically because it allows write combining. |
|
![]() |
![]() |
![]() |
#10 |
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
![]() Code:
xorps xmm,xmm punpck* mem, xmm but with source and destination exchanged, we'd have a quick zero-extend of the inputs, fetching only what you need from memory. I'd like to quote the docs, but copy&paste is "forbidden by drm". I'm too lazy to figure out how to circumvent it. Anyhoo, they actually mention that you can use punpck* for this purpose with a source operand of 0. This greatly benefits applications which store data in registers and zero-extend results before writing them to memory. By writing to the same memory location, data compression (lossy) of almost %100 is possible, while still being able to correctly reconstuct 50% of the input. Last fiddled with by __HRB__ on 2009-02-21 at 04:37 |
![]() |
![]() |
![]() |
#11 | |
Jan 2008
France
22·149 Posts |
![]() Quote:
![]() Section 5.1 of the book Embedded Computing: A Vliw Approach To Architecture, Compilers And Tools explains that the area of a register file increases as the square of the number of ports and access time increases linearly with the number of ports. Section 5.4.2 also explains how large the forwarding network can grow. Anyway that doesn't explain why x86 SIMD various instruction sets are so odd. I guess it's the result of adding a few instructions at each generation, instead of spending a few years in R&D thinking about what is really needed in the longer term. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Posts that seem less than useless, or something like that | jasong | Forum Feedback | 1054 | 2022-06-20 22:34 |
Fedora gedit for bash has become useless | EdH | Linux | 11 | 2016-05-13 15:36 |
Useless DC assignment | lycorn | PrimeNet | 16 | 2009-09-08 18:16 |
Useless p-1 work | jocelynl | Data | 4 | 2004-11-28 13:28 |