Quote:
 Originally Posted by ewmayer Note I said "runtime", not "performance". My reaction was something along the lines of "You know, if I needed a way to get my CPU to run cooler, I'd just switch my system power options to max-battery-life mode or fill my assembly code with no-ops."

There is famous quote from Seymour Cray: What do we need software for?
It just slows the machine down.......

 2009-02-26, 21:33 #24 geoff     Mar 2003 New Zealand 13×89 Posts My whinge: SSE has AND and AND-NOT, but no NOT. So I synthesize NOT from AND, AND-NOT, and say PCMPEQD, which uses an extra scratch register. Why not have AND and NOT and let the programmer synthesize AND-NOT, no scratch register required? I suppose there must be a reason.
Quote:
 Originally Posted by geoff My whinge: SSE has AND and AND-NOT, but no NOT. So I synthesize NOT from AND, AND-NOT, and say PCMPEQD, which uses an extra scratch register. Why not have AND and NOT and let the programmer synthesize AND-NOT, no scratch register required? I suppose there must be a reason.
Synthesizing AND-NOT requires two instructions, so PANDN can be twice as fast. If you need NOT, you can do that with XOR in one instruction, using a memory operand and a location filled with FFFF, if you're experiencing register pressure.

 2009-04-22, 04:47 #26 __HRB__     Dec 2008 Boycotting the Soapbox 72010 Posts rcpps, but no rcppd! WTF? (nt) no text
Quote:
 Originally Posted by __HRB__ rcpps, but no rcppd! WTF? (nt)
Since the result is only 12bit there seems little sense in expanding it to a 53bit mantissa. Do four conversions in one cycle and go from there to whatever final precision is needed.

divps, divpd

I should have payed attention to the thread title. What I meant was that divps & divpd are superfluous, since rcpps/rcppd & newton-raphson are faster and can be pipelined.

Quote:
 Originally Posted by retina Since the result is only 12bit there seems little sense in expanding it to a 53bit mantissa. Do four conversions in one cycle and go from there to whatever final precision is needed.
The issue is that the missing rcppd forces one to use two extra instructions - convert doubles to floats and floats to doubles - blocking the execution ports for 2 cycles and adding 6-8 cycles in latency.

 2012-03-28, 17:58 #29 bsquared     "Ben" Feb 2007 72238 Posts pcmpgtw Ok, so pcmpgtw isn't exactly useless, but I'm really quite upset right now over the fact that there is no unsigned equivalent.
Quote:
 Originally Posted by bsquared Ok, so pcmpgtw isn't exactly useless, but I'm really quite upset right now over the fact that there is no unsigned equivalent.
PSUBUSW should get you to almost all the way.

Quote:
 Originally Posted by axn PSUBUSW should get you to almost all the way.
Yeah, cool!

This will do the job:
Code:

"pxor %%xmm0, %%xmm0 \n\t"/* xmm0 := 0 */
"psubusw %%xmm1, %%xmm2 \n\t"/* xmm2 := b - a */
"pcmpeqw %%xmm0, %%xmm2 \n\t"/* xmm2 := a >= b ? 1 : 0 */

The extra dependency costs a cycle of latency, a "0" register must be set up (which can be reused for additional tests), and it's actually a ">=" test, but it's still a decent workaround.

In the spirit of this thread, though, it still sucks that this is necessary...

 2012-03-28, 23:05 #32 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 235038 Posts "The only thing in the house that didn't suck was the vacuum cleaner."
THX

Quote:
 Originally Posted by Batalov "The only thing in the house that didn't suck was the vacuum cleaner."
I don't laugh often enough these days.

Sounds like Raymond Chandler or similar.

David

