mersenneforum.org instruction for counting bits?
 Register FAQ Search Today's Posts Mark Forums Read

 2007-10-18, 04:44 #1 bsquared     "Ben" Feb 2007 22×859 Posts instruction for counting bits? Is there an instruction in the x86 instruction set which will count the number of set bits in a byte, or word? I've looked through the AMD and Intel Architecture manuals a bit, and didn't see anything in the general purpose instructions. But I could have missed it, or maybe there is something in SSE(2,3,4) or MMX or something. In liu of that, the best I could come up with considering pipelining and all is the following: Code:  //val is an array of bytes. count += (val[i] & 0x80) >> 7; count += (val[i] & 0x40) >> 6; count += (val[i] & 0x20) >> 5; count += (val[i] & 0x10) >> 4; count += (val[i] & 0x8) >> 3; count += (val[i] & 0x4) >> 2; count += (val[i] & 0x2) >> 1; count += (val[i] & 0x1);
 2007-10-18, 05:06 #2 bsquared     "Ben" Feb 2007 22×859 Posts If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement... If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect. A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner.
2007-10-18, 05:52   #3
frmky

Jul 2003
So Cal

23×32×29 Posts

This post suggests that the fastest way is to use a lookup table for each byte.

Greg

Quote:
 Originally Posted by bsquared If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement... If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect. A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner.

 2007-10-18, 10:03 #4 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 638310 Posts There is a bit-count instruction in AMD's Barcelona processor. This isn't very helpful unless you have one of those. The paper from 1990 has measurements on a series of systems whose performance characteristics are rather different from contemporary hardware. I think a 256-byte table is probably still best for single bytes, but something like the HAKMEM recipe might well work better for 64-bit words. [you do X = (X & 0x55555555555) + ((X & 0xaaaaaaaaaaaaa)>>1) X = (X & 0x33333333333) + ((X & 0xccccccccccccccc)>>2) X = (X & 0x0f0f0f0f0f0f0f) + ((X & 0xf0f0f0f0f0f0f0f0)>>4) X = (X & 0x00ff00ff00ff) + ((X & 0xff00ff00ff00)>>8) X += X>>16; X += X>>32; X &= 0xff; where each step adds up the values in two adjacent 2^n-bit pockets of the register and writes the sum into a 2^{n+1}-bit pocket overwriting the original data. You can replace some of the lines by a multiplication by 0x10101010101010101... which is nice and fast on EMT64 hardware.
 2007-10-18, 10:33 #5 Dresdenboy     Apr 2003 Berlin, Germany 192 Posts I found a paper covering this topic, called "Beating the popcount": http://www.icis.ntu.edu.sg/scs-ijit/91/91-1.pdf
 2007-10-18, 13:35 #6 bsquared     "Ben" Feb 2007 22×859 Posts No Barcelona for me, unfortunately , but if it can sieve faster than the 3GHz Xeon Woodcrest I have access to I'll be impressed. Thank you all for the suggestions! I'll post results as I implement some of these. By the way, I should clarify my original improvement gains. The 5% was for the *overall* application (a sieve of erat.). The bit counting part was about 33% faster with the 32x loop unrolling. As of now, I can find and count all the primes up to 1e9 in 0.75 sec, on the above mentioned system.
 2007-10-18, 13:54 #7 bsquared     "Ben" Feb 2007 65548 Posts Cool! If i use the consistently fastest method from Greg's link (a table lookup), I get a 75% speedup over my 32x unrolled version! Here it is: Code:  //cast array of bytes into an unsigned int32 pointer flagblock32 = (uint32 *)soe.line[j]; //for each word of flags in this sieve line: for (i=0;i
 2007-10-18, 14:31 #8 bsquared     "Ben" Feb 2007 22×859 Posts This method: Code:  //cast array of bytes into an unsigned int32 pointer flagblock32 = (uint32 *)soe.line[j]; //for each word of flags in this sieve line: for (i=0;i> 1) & 0x55555555); n = (n & 0x33333333) + ((n >> 2) & 0x33333333); n = (n + (n >> 4)) & 0x0f0f0f0f; n += n >> 8; n += n >> 16; it += (n & 0xff); }` was unresolvable from the table lookup method for the 1e9 interval, but proved to be slightly slower when I made the interval bigger in order to get something longer in duration to measure. Now if I looked at 64 bit words in each loop iteration... maybe I'll look into that. - ben. Last fiddled with by bsquared on 2007-10-18 at 14:32

 Similar Threads Thread Thread Starter Forum Replies Last Post luisda2994 YAFU 1 2016-12-09 00:30 EdH Factoring 32 2016-10-12 20:49 pepi37 Homework Help 2 2016-04-24 21:18 fivemack Hardware 0 2009-03-25 12:09 PrimeCruncher Hardware 6 2004-05-01 11:53

All times are UTC. The time now is 07:52.

Fri May 7 07:52:58 UTC 2021 up 29 days, 2:33, 0 users, load averages: 1.54, 1.42, 1.61