mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2007-10-18, 04:44   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

343610 Posts
Default instruction for counting bits?

Is there an instruction in the x86 instruction set which will count the number of set bits in a byte, or word? I've looked through the AMD and Intel Architecture manuals a bit, and didn't see anything in the general purpose instructions. But I could have missed it, or maybe there is something in SSE(2,3,4) or MMX or something.

In liu of that, the best I could come up with considering pipelining and all is the following:

Code:
 
//val is an array of bytes.  
count += (val[i] & 0x80) >> 7;
count += (val[i] & 0x40) >> 6;
count += (val[i] & 0x20) >> 5;
count += (val[i] & 0x10) >> 4;
count += (val[i] & 0x8) >> 3;
count += (val[i] & 0x4) >> 2;
count += (val[i] & 0x2) >> 1;
count += (val[i] & 0x1);
bsquared is offline   Reply With Quote
Old 2007-10-18, 05:06   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×859 Posts
Default

If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement...

If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect.

A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner.
bsquared is offline   Reply With Quote
Old 2007-10-18, 05:52   #3
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

23×32×29 Posts
Default

This post suggests that the fastest way is to use a lookup table for each byte.

Greg

Quote:
Originally Posted by bsquared View Post
If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement...

If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect.

A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner.
frmky is online now   Reply With Quote
Old 2007-10-18, 10:03   #4
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11000111011112 Posts
Default

There is a bit-count instruction in AMD's Barcelona processor. This isn't very helpful unless you have one of those.

The paper from 1990 has measurements on a series of systems whose performance characteristics are rather different from contemporary hardware.

I think a 256-byte table is probably still best for single bytes, but something like the HAKMEM recipe might well work better for 64-bit words.

[you do

X = (X & 0x55555555555) + ((X & 0xaaaaaaaaaaaaa)>>1)
X = (X & 0x33333333333) + ((X & 0xccccccccccccccc)>>2)
X = (X & 0x0f0f0f0f0f0f0f) + ((X & 0xf0f0f0f0f0f0f0f0)>>4)
X = (X & 0x00ff00ff00ff) + ((X & 0xff00ff00ff00)>>8)
X += X>>16; X += X>>32; X &= 0xff;

where each step adds up the values in two adjacent 2^n-bit pockets of the register and writes the sum into a 2^{n+1}-bit pocket overwriting the original data. You can replace some of the lines by a multiplication by 0x10101010101010101... which is nice and fast on EMT64 hardware.
fivemack is offline   Reply With Quote
Old 2007-10-18, 10:33   #5
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

I found a paper covering this topic, called "Beating the popcount":

http://www.icis.ntu.edu.sg/scs-ijit/91/91-1.pdf
Dresdenboy is offline   Reply With Quote
Old 2007-10-18, 13:35   #6
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22·859 Posts
Default

No Barcelona for me, unfortunately , but if it can sieve faster than the 3GHz Xeon Woodcrest I have access to I'll be impressed.

Thank you all for the suggestions! I'll post results as I implement some of these.

By the way, I should clarify my original improvement gains. The 5% was for the *overall* application (a sieve of erat.). The bit counting part was about 33% faster with the 32x loop unrolling.

As of now, I can find and count all the primes up to 1e9 in 0.75 sec, on the above mentioned system.
bsquared is offline   Reply With Quote
Old 2007-10-18, 13:54   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22·859 Posts
Default

Cool!

If i use the consistently fastest method from Greg's link (a table lookup), I get a 75% speedup over my 32x unrolled version!

Here it is:

Code:
 
   //cast array of bytes into an unsigned int32 pointer
   flagblock32 = (uint32 *)soe.line[j];
   //for each word of flags in this sieve line:
   for (i=0;i<numlinebytes/4;i++)
   {
      //cast again into a byte pointer
      p = (unsigned char *)&flagblock32[i]; 
      //lookup and sum the counts in each byte of the 32bit word
      count += (inbits[p[0]] + inbits[p[1]] + inbits[p[2]] + inbits[p[3]]); 
   }
This is a 12% improvement over the total time for the 1e9 interval. My ability to accurately measure the time involved (using clock_t structures in C), is starting to suffer...

Last fiddled with by bsquared on 2007-10-18 at 14:00 Reason: added stuff at the end
bsquared is offline   Reply With Quote
Old 2007-10-18, 14:31   #8
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×859 Posts
Default

This method:

Code:
   //cast array of bytes into an unsigned int32 pointer
   flagblock32 = (uint32 *)soe.line[j];
   //for each word of flags in this sieve line:
   for (i=0;i<numlinebytes/4;i++)
   {
    //look at each word
    n = flagblock32[i];
    //and do some magic
    n = (n & 0x55555555) + ((n >> 1) & 0x55555555); 
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333); 
    n = (n + (n >> 4)) & 0x0f0f0f0f; 
    n += n >> 8; 
    n += n >> 16; 
    it += (n & 0xff); 
   }
was unresolvable from the table lookup method for the 1e9 interval, but proved to be slightly slower when I made the interval bigger in order to get something longer in duration to measure. Now if I looked at 64 bit words in each loop iteration... maybe I'll look into that.

- ben.

Last fiddled with by bsquared on 2007-10-18 at 14:32
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Yafu instantly crashing after first instruction luisda2994 YAFU 1 2016-12-09 00:30
gnfs asm version sievers illegal instruction EdH Factoring 32 2016-10-12 20:49
Instruction for setup PRPNET server pepi37 Homework Help 2 2016-04-24 21:18
Larrabee instruction set announced fivemack Hardware 0 2009-03-25 12:09
Auto-instruction creating processor!! PrimeCruncher Hardware 6 2004-05-01 11:53

All times are UTC. The time now is 08:21.

Fri May 7 08:21:39 UTC 2021 up 29 days, 3:02, 0 users, load averages: 1.48, 1.64, 1.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.