mersenneforum.org Useless SSE instructions
 Register FAQ Search Today's Posts Mark Forums Read

2009-02-21, 15:14   #12
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

2·3,343 Posts

Quote:
 Originally Posted by ldesnogu Anyway that doesn't explain why x86 SIMD various instruction sets are so odd. I guess it's the result of adding a few instructions at each generation, instead of spending a few years in R&D thinking about what is really needed in the longer term.
I strongly doubt that the current instruction set(s) is/are just a random collection of things someone threw in before going to lunch.

It seems to me that doing FFT on multi-megabit numbers was not a high priority for Intel/AMD/whoever. I would imagine video/audio encoding/decoding is very high on their list of things to consider when designing new instructions.

How many people do multi-precision arithmetic compared to how many play games, listen to music and/or watch movies? I imagine the ratio is a rather small one.

2009-02-21, 15:50   #13
ldesnogu

Jan 2008
France

22×149 Posts

Quote:
 Originally Posted by retina I strongly doubt that the current instruction set(s) is/are just a random collection of things someone threw in before going to lunch.
No they are not random additions, but they certainly are not well thought resulting in a stacking of instructions that doesn't look very coherent. Intel's AVX looks better, Altivec (and newer Power SIMD ISA) looks nicer too.

But as you wrote, these are not written with MP in mind

2009-02-21, 16:01   #14
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

101101100011102 Posts

Quote:
 Originally Posted by ewmayer Sounds pretty good, doesn't it? Sounds like "if you're done crunching datum whose value is currently stored in an xmm register and you know that won't be used (either in read or write mode) for a while, you should use this special MOV instruction to write it back to memory, because this bypasses the cache hierarchy and thus allows more soon-to-be-used data to enter the caches without risking being kicked out by on its way back to main memory."
The first thing I thought when reading this paragraph was "cache side-channel attacks". Use Google if that phrase means nothing to you.

Paul

2009-02-21, 16:44   #15
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by retina I strongly doubt that the current instruction set(s) is/are just a random collection of things someone threw in before going to lunch.
I agree.

My guess is that it's the result of one serious rochambeau tournament that took the better part of the afternoon, too.

The guy in the sales-department won alot: He got to add 20 new instructions by aliasing them to existing instructions, and the guy responsible for putting together the "Instruction Set Reference", probably has a template and "add redundant instruction" macroed to F11.

Quote:
 Originally Posted by retina It seems to me that doing FFT on multi-megabit numbers was not a high priority for Intel/AMD/whoever. I would imagine video/audio encoding/decoding is very high on their list of things to consider when designing new instructions.
I doubt it.

Somewhere there's a forum specialized on video encoding with guys concluding that pmuludq must be for multi-megabit computations, because it's useless for anything else.

The rest of the thread is devoted to missing 16-bit floating point support, and wtf to do with the 205 instead of 200 idle cycles now that paddusw, psubsw, etc. eliminated some code instructions.

Last fiddled with by __HRB__ on 2009-02-21 at 16:46

 2009-02-21, 16:57 #16 akruppa     "Nancy" Aug 2002 Alexandria 2,467 Posts While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM. Alex
2009-02-21, 17:14   #17
ldesnogu

Jan 2008
France

11248 Posts

Quote:
 Originally Posted by akruppa While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM.
I had the same thought and had to resort to zero-extension of the result of setcc with a movzbl After all there's a price to pay for a monstruosity that has lived far too long and that doesn't show any will to die the horrible death it deserves.

2009-02-21, 17:15   #18
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

2×3,343 Posts

Quote:
 Originally Posted by akruppa While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM. Alex
I thought the partial register stall was only in the P4. Earlier CPUs were not affected and later CPUs used the older architecture as a starting point and also avoided it. However I always use sbb eax,eax to replicate the carry through a register and then just sub instead of add it later to accumulate carries. And only ever use setcc for accumulating multiple flag tests to avoid long lists of branches.

Last fiddled with by retina on 2009-02-21 at 17:16

2009-02-21, 18:01   #19
akruppa

"Nancy"
Aug 2002
Alexandria

1001101000112 Posts

Quote:
 Originally Posted by retina I thought the partial register stall was only in the P4. Earlier CPUs were not affected and later CPUs used the older architecture as a starting point and also avoided it. However I always use sbb eax,eax to replicate the carry through a register and then just sub instead of add it later to accumulate carries. And only ever use setcc for accumulating multiple flag tests to avoid long lists of branches.
The AMD64 software optimization guide lists partial register reads/writes as something to avoid. I don't know if the Core2 identifies and ignores these false dependencies, but my (somewhat naive) understanding is that keeping track of the dependencies between instructions is hard enough even without treating each register as 4 independent pieces, where one instruction can write to one or several pieces.

Yes, "sbb reg, same reg" can be used to set a full register according only to the carry flag, and recent cpus even know that this instruction does not depend on the previous value of "reg," so no false dependency occurs here. However, I use three registers as a kind of ring buffer for the carry propagation, and need to add to the register that holds the carry... so having the carry value negated was a bit of a problem. I thought about flipping the sign of the partial result in every pass so I could use sbb and then keep subtracting. I didn't do that, though... I may try some time.

Alex

2009-02-21, 18:29   #20
__HRB__

Dec 2008
Boycotting the Soapbox

24×32×5 Posts

Quote:
 Originally Posted by retina sbb eax,eax
That's clever. I quickly deleted my post with an inferior solution.

I was thinking about using rcr or rcl to create a carry cache and use clc to remove dependencies to do several multiprecision adds in parallel.

Your solution is much better. You 'da Man!.

Here's why I'm so exited:

Let's suppose we want to do 2 multiprecision adds in parallel. With eax & edx as temporaries.

After sbb eax,eax a add edx,edx will

a) remove the dependency on the carry
b) restore the carry

ending with sbb edx,edx
and setting up the second stream with add eax,eax

I count 4 + 2*3*N instructions, so, if e.g. N==4, and Core 2 can do 1 add/clock, we're doing 8 multiprecision adds in 12 cycles. This is 25% faster than the naive implementation.

If we're doing X+Y and X-Y and can reuse X and/or Y, Athlons might be able to do more than one 64-bit adc/clock.

This would allow nice butterflies for medium sized power-of-two moduli.

Edit: Two adds can be replaced with shifts. Duh. 10/8=1.25 clocks/adc for core 2.

Last fiddled with by __HRB__ on 2009-02-21 at 18:39 Reason: I'm a moron

2009-02-25, 16:33   #21
akruppa

"Nancy"
Aug 2002
Alexandria

9A316 Posts

Quote:
 Originally Posted by ldesnogu I had the same thought and had to resort to zero-extension of the result of setcc with a movzbl After all there's a price to pay for a monstruosity that has lived far too long and that doesn't show any will to die the horrible death it deserves.
Btw, zero-extending the result of setcc keeps the (false) dependency chain intact. It may be better to do a "mov $0, reg" before the setcc, it's just as ugly but at least it breaks the dependency chain. Alex 2009-02-25, 16:49 #22 ldesnogu Jan 2008 France 22·149 Posts Quote:  Originally Posted by akruppa Btw, zero-extending the result of setcc keeps the (false) dependency chain intact. It may be better to do a "mov$0, reg" before the setcc, it's just as ugly but at least it breaks the dependency chain.
I really have to forget my old m68k days and its all-instructions-set-flags view of the world
If I make this remark it's because I have restrictions on the order of instructions: I get registers allocated from above (inputs and destination); so in my case I was doing a cmp + setcc + signextend; and I wrongly thought the only other option was mov 0 + cmp + setcc ("only" due to wrong "mov 0 clobbers flags" assumption) which doesn't work if my allocated destination reg overlaps one of the input regs.

Thanks for the hint!

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong Forum Feedback 1054 2022-06-20 22:34 EdH Linux 11 2016-05-13 15:36 lycorn PrimeNet 16 2009-09-08 18:16 jocelynl Data 4 2004-11-28 13:28

All times are UTC. The time now is 12:48.

Tue Feb 7 12:48:48 UTC 2023 up 173 days, 10:17, 1 user, load averages: 1.99, 2.03, 1.71