Quote:
Originally Posted by fivemack
A problem is that PSLLDQ requires the shift amount to be hard-coded at compile time, so you have to use something like the PSHUFB code in my example if you want to do a variable shift (see 'daft SSE restrictions' thread elsewhere)
|
Thank any god that's not true!
They screwed up everything else, but the variable shift you are looking for is one of the few things they actually got right (they screwed up the "xmm, imm8" one: they could have combined it with a move at zero cost).
Every SSE-shift (that I know of) has a "xmm, imm8" and a "xmm1, xmm2" form. See, e.g.
http://www.ews.uiuc.edu/~cjiang/reference/vc256.htm
Maybe your compiler doesn't have the intrinsics or you were trying to specify the variable shifts using an 'int'?
You could also do two conversions 'double to int32' (erasing the lower 32 bits of the mantissa, subtract from original will give you the lower 32 bits), to avoid messing around too much with IEEE754.
Quote:
Originally Posted by fivemack
Decoupling the shift and the conversion seems like a good idea, and the extra parallelism from doing four conversions at once in floats seems useful; I'm just a little concerned that any power of 10 above 10^10 can't be stored exactly in a float (10^22 is the largest that fits exactly in a double), so I'd want to do the multiplication by the larger powers of ten once I'm working in doubles.
|
Yup, you're 100% right. The only thing I paid attention to was magnitude, so this definitely needs some doctoring.
We should be able to do everything up to 10^7/256 in floats, since log(10^7)=23.2534966642. I think therefore only one double precision multiply with [10^8,10^0] will be needed and it should be possible merge this multiply with the final correction, since I don't see anything that isn't distributive.
Quote:
Originally Posted by fivemack
It's not relevant in this case because the conversion probably takes longer than determining the index, but I'm reminded of some nice Nehalem string-instruction demo code from Intel which features the absurd line
if (A==16) u+=16; else u+=A;
Not too absurd if your strings are very long: the branch is predicted taken and that lets the OOO machinery run several iterations in parallel.
|
If OOO doesn't do the trick, one could unroll the loop by hand: discover the next (or if strlen is known any) 2-4 line feeds, and do every instruction 2-4x with a different registers (if one has enough).
Of course, if you're willing to allow data-dependent speed-ups, we could also branch to code that handles all 2^8 distributions of LFs (at the cost of one mispredicted branch). A word composed of the first bit of all "pcmpeqb #LF" bytes could be used to compute the address of the jump-table & next chunk.