mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2007-10-18, 04:44   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×859 Posts
Default instruction for counting bits?

Is there an instruction in the x86 instruction set which will count the number of set bits in a byte, or word? I've looked through the AMD and Intel Architecture manuals a bit, and didn't see anything in the general purpose instructions. But I could have missed it, or maybe there is something in SSE(2,3,4) or MMX or something.

In liu of that, the best I could come up with considering pipelining and all is the following:

Code:
 
//val is an array of bytes.  
count += (val[i] & 0x80) >> 7;
count += (val[i] & 0x40) >> 6;
count += (val[i] & 0x20) >> 5;
count += (val[i] & 0x10) >> 4;
count += (val[i] & 0x8) >> 3;
count += (val[i] & 0x4) >> 2;
count += (val[i] & 0x2) >> 1;
count += (val[i] & 0x1);
bsquared is offline   Reply With Quote
Old 2007-10-18, 05:06   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×859 Posts
Default

If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement...

If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect.

A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner.
bsquared is offline   Reply With Quote
Old 2007-10-18, 05:52   #3
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

23×32×29 Posts
Default

This post suggests that the fastest way is to use a lookup table for each byte.

Greg

Quote:
Originally Posted by bsquared View Post
If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement...

If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect.

A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner.
frmky is online now   Reply With Quote
Old 2007-10-18, 10:03   #4
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

638310 Posts
Default

There is a bit-count instruction in AMD's Barcelona processor. This isn't very helpful unless you have one of those.

The paper from 1990 has measurements on a series of systems whose performance characteristics are rather different from contemporary hardware.

I think a 256-byte table is probably still best for single bytes, but something like the HAKMEM recipe might well work better for 64-bit words.

[you do

X = (X & 0x55555555555) + ((X & 0xaaaaaaaaaaaaa)>>1)
X = (X & 0x33333333333) + ((X & 0xccccccccccccccc)>>2)
X = (X & 0x0f0f0f0f0f0f0f) + ((X & 0xf0f0f0f0f0f0f0f0)>>4)
X = (X & 0x00ff00ff00ff) + ((X & 0xff00ff00ff00)>>8)
X += X>>16; X += X>>32; X &= 0xff;

where each step adds up the values in two adjacent 2^n-bit pockets of the register and writes the sum into a 2^{n+1}-bit pocket overwriting the original data. You can replace some of the lines by a multiplication by 0x10101010101010101... which is nice and fast on EMT64 hardware.
fivemack is offline   Reply With Quote
Old 2007-10-18, 10:33   #5
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

I found a paper covering this topic, called "Beating the popcount":

http://www.icis.ntu.edu.sg/scs-ijit/91/91-1.pdf
Dresdenboy is offline   Reply With Quote
Old 2007-10-18, 13:35   #6
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×859 Posts
Default

No Barcelona for me, unfortunately , but if it can sieve faster than the 3GHz Xeon Woodcrest I have access to I'll be impressed.

Thank you all for the suggestions! I'll post results as I implement some of these.

By the way, I should clarify my original improvement gains. The 5% was for the *overall* application (a sieve of erat.). The bit counting part was about 33% faster with the 32x loop unrolling.

As of now, I can find and count all the primes up to 1e9 in 0.75 sec, on the above mentioned system.
bsquared is offline   Reply With Quote
Old 2007-10-18, 13:54   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

65548 Posts
Default

Cool!

If i use the consistently fastest method from Greg's link (a table lookup), I get a 75% speedup over my 32x unrolled version!

Here it is:

Code:
 
   //cast array of bytes into an unsigned int32 pointer
   flagblock32 = (uint32 *)soe.line[j];
   //for each word of flags in this sieve line:
   for (i=0;i<numlinebytes/4;i++)
   {
      //cast again into a byte pointer
      p = (unsigned char *)&flagblock32[i]; 
      //lookup and sum the counts in each byte of the 32bit word
      count += (inbits[p[0]] + inbits[p[1]] + inbits[p[2]] + inbits[p[3]]); 
   }
This is a 12% improvement over the total time for the 1e9 interval. My ability to accurately measure the time involved (using clock_t structures in C), is starting to suffer...

Last fiddled with by bsquared on 2007-10-18 at 14:00 Reason: added stuff at the end
bsquared is offline   Reply With Quote
Old 2007-10-18, 14:31   #8
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×859 Posts
Default

This method:

Code:
   //cast array of bytes into an unsigned int32 pointer
   flagblock32 = (uint32 *)soe.line[j];
   //for each word of flags in this sieve line:
   for (i=0;i<numlinebytes/4;i++)
   {
    //look at each word
    n = flagblock32[i];
    //and do some magic
    n = (n & 0x55555555) + ((n >> 1) & 0x55555555); 
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333); 
    n = (n + (n >> 4)) & 0x0f0f0f0f; 
    n += n >> 8; 
    n += n >> 16; 
    it += (n & 0xff); 
   }
was unresolvable from the table lookup method for the 1e9 interval, but proved to be slightly slower when I made the interval bigger in order to get something longer in duration to measure. Now if I looked at 64 bit words in each loop iteration... maybe I'll look into that.

- ben.

Last fiddled with by bsquared on 2007-10-18 at 14:32
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Yafu instantly crashing after first instruction luisda2994 YAFU 1 2016-12-09 00:30
gnfs asm version sievers illegal instruction EdH Factoring 32 2016-10-12 20:49
Instruction for setup PRPNET server pepi37 Homework Help 2 2016-04-24 21:18
Larrabee instruction set announced fivemack Hardware 0 2009-03-25 12:09
Auto-instruction creating processor!! PrimeCruncher Hardware 6 2004-05-01 11:53

All times are UTC. The time now is 07:52.

Fri May 7 07:52:58 UTC 2021 up 29 days, 2:33, 0 users, load averages: 1.54, 1.42, 1.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.