View Single Post
Old 2009-03-23, 18:15   #8
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

2D016 Posts
Default

Quote:
Originally Posted by fivemack View Post
A problem is that PSLLDQ requires the shift amount to be hard-coded at compile time, so you have to use something like the PSHUFB code in my example if you want to do a variable shift (see 'daft SSE restrictions' thread elsewhere)
Thank any god that's not true!

They screwed up everything else, but the variable shift you are looking for is one of the few things they actually got right (they screwed up the "xmm, imm8" one: they could have combined it with a move at zero cost).

Every SSE-shift (that I know of) has a "xmm, imm8" and a "xmm1, xmm2" form. See, e.g. http://www.ews.uiuc.edu/~cjiang/reference/vc256.htm

Maybe your compiler doesn't have the intrinsics or you were trying to specify the variable shifts using an 'int'?

You could also do two conversions 'double to int32' (erasing the lower 32 bits of the mantissa, subtract from original will give you the lower 32 bits), to avoid messing around too much with IEEE754.

Quote:
Originally Posted by fivemack View Post
Decoupling the shift and the conversion seems like a good idea, and the extra parallelism from doing four conversions at once in floats seems useful; I'm just a little concerned that any power of 10 above 10^10 can't be stored exactly in a float (10^22 is the largest that fits exactly in a double), so I'd want to do the multiplication by the larger powers of ten once I'm working in doubles.
Yup, you're 100% right. The only thing I paid attention to was magnitude, so this definitely needs some doctoring.

We should be able to do everything up to 10^7/256 in floats, since log(10^7)=23.2534966642. I think therefore only one double precision multiply with [10^8,10^0] will be needed and it should be possible merge this multiply with the final correction, since I don't see anything that isn't distributive.

Quote:
Originally Posted by fivemack View Post
It's not relevant in this case because the conversion probably takes longer than determining the index, but I'm reminded of some nice Nehalem string-instruction demo code from Intel which features the absurd line

if (A==16) u+=16; else u+=A;

Not too absurd if your strings are very long: the branch is predicted taken and that lets the OOO machinery run several iterations in parallel.
If OOO doesn't do the trick, one could unroll the loop by hand: discover the next (or if strlen is known any) 2-4 line feeds, and do every instruction 2-4x with a different registers (if one has enough).

Of course, if you're willing to allow data-dependent speed-ups, we could also branch to code that handles all 2^8 distributions of LFs (at the cost of one mispredicted branch). A word composed of the first bit of all "pcmpeqb #LF" bytes could be used to compute the address of the jump-table & next chunk.
__HRB__ is offline   Reply With Quote