mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2011-02-26, 00:02   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

767710 Posts
Default Random bit in a word

I need some code that efficiently (on Intel this means relatively short chunk of code with no branches that are hard to predict) returns the index of a randomly selected set bit in a 64-bit value.

Psuedo code:

int64 t;

cnt = POPCNT (t)
i = random value between 0 and cnt
while (i--) use BSF to find a set bit, then clear that set bit
return BSF(t)

POPCNT is an Intel assembly code instruction to count the number of set bits in a register
BSF is bit-scan-forward, it returns the index of the first set bit in a register


Clever bit-twiddling suggestions are welcome!

Last fiddled with by Prime95 on 2011-02-26 at 00:07
Prime95 is offline   Reply With Quote
Old 2011-02-26, 01:48   #2
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

2·11·19·23 Posts
Default

Quote:
Originally Posted by Prime95 View Post
BSF is bit-scan-forward, it returns the index of the first set bit in a register

Clever bit-twiddling suggestions are welcome!
Shooting in the dark: init a 64k lookup table (for 16-bit words) once, and use it four times per BSF() call?
Batalov is offline   Reply With Quote
Old 2011-02-26, 04:48   #3
axn
 
axn's Avatar
 
Jun 2003

5,179 Posts
Default

Quote:
Originally Posted by Prime95 View Post
use BSF to find a set bit, then clear that set bit
If BSF scans from LSB to MSB, then "x & (x-1)" is a standard way to turn of the least set bit. Don't need bsf at all.
axn is online now   Reply With Quote
Old 2011-02-26, 05:12   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

32×853 Posts
Default

I was thinking a binary search for the random nth set bit as in:

int result = 0;
int64 t;

cnt = POPCNT (t)
i = random value between 0 and cnt

cnt32 = POPCNT (t & 0xFFFFFFFF)
t = t >> ((i > cnt32) ? 32 : 0)
result = result + ((i > cnt32) ? 32 : 0)
i = i - ((i > cnt32) ? cnt32 : 0)
repeat for 16, 8, 4, 2
return result

the compiler ought to be smart enough to generate conditional move instructions and thus the above is jump-less
Prime95 is offline   Reply With Quote
Old 2011-02-26, 07:21   #5
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

2·1,877 Posts
Default

Select the bit position (from the most-significant bit) with the given count (rank) might be close to what you want. It does its' own bit counting using a common fast method but retains the intermediate calculations and then uses those values to quickly locate the index of the rth set bit. In the example it finds the position of the rth set bit from the most significant end but it says that it can be adapted to find the bit from the least significant end.

Code:
The following 64-bit code selects the position of the rth 1 bit when counting from the left. In other words if we start at the most significant bit and proceed to the right, counting the number of bits set to 1 until we reach the desired rank, r, then the position where we stop is returned. If the rank requested exceeds the count of bits set, then 64 is returned. The code may be modified for 32-bit or counting from the right.
  uint64_t v;          // Input value to find position with rank r.
  unsigned int r;      // Input: bit's desired rank [1-64].
  unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
  uint64_t a, b, c, d; // Intermediate temporaries for bit count.
  unsigned int t;      // Bit count temporary.

  // Do a normal parallel bit count for a 64-bit integer,                     
  // but store all intermediate steps.                                        
  // a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
  a =  v - ((v >> 1) & ~0UL/3);
  // b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
  b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
  // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
  c = (b + (b >> 4)) & ~0UL/0x11;
  // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
  d = (c + (c >> 8)) & ~0UL/0x101;
  t = (d >> 32) + (d >> 48);
  // Now do branchless select!                                                
  s  = 64;
  // if (r > t) {s -= 32; r -= t;}
  s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
  t  = (d >> (s - 16)) & 0xff;
  // if (r > t) {s -= 16; r -= t;}
  s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
  t  = (c >> (s - 8)) & 0xf;
  // if (r > t) {s -= 8; r -= t;}
  s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
  t  = (b >> (s - 4)) & 0x7;
  // if (r > t) {s -= 4; r -= t;}
  s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
  t  = (a >> (s - 2)) & 0x3;
  // if (r > t) {s -= 2; r -= t;}
  s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
  t  = (v >> (s - 1)) & 0x1;
  // if (r > t) s--;
  s -= ((t - r) & 256) >> 8;
  s = 65 - s;
If branching is fast on your target CPU, consider uncommenting the if-statements and commenting the lines that follow them.
Juha Järvi sent this to me on November 21, 2009.
The link I provided doesn't jump to the anchored text properly in my browser. Here is another link anchored below the quoted section (scroll upward to quoted text) http://graphics.stanford.edu/~seande...ml#ParityNaive
only_human is offline   Reply With Quote
Old 2011-02-26, 16:57   #6
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101111111012 Posts
Default

Quote:
Originally Posted by only_human View Post
Perfect! Thanks.
Prime95 is offline   Reply With Quote
Old 2011-02-28, 19:14   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·449 Posts
Default

This is more in generic-C terms, but perhaps worth investigating:

1. Create a pair of 256-entry lookup tables: Table 1 is simply a bytewise lookup-table version of POPCNT: Take a byte and return #set bits in that byte;

Table 2 is a bit more complex - it is a lookup table which encodes the following: For any byte-valued integer B in [0,255], given a target set-bit index I in [0,POPCNT(B)], the table-lookup returns the bit-index of the set bit with respect to B. This is a bit trickier to set up, since different values of B have numbers of set bits ranging from 0 to 8, so one needs to consider whether the requested bit-index is legal for the byte-valued int in question. But assuming the caller will only ever request a bit-index value which is appropriate for the specific byte-value in question, you could use a set of 256 24-bit ints to encode this. Each 24-bit table entry is divided into eight 3-bit subfields containing up to eight bit indices for the byte-value corresponding to the table entry. The low 3 bits are the bit index of the first set bit in B, etc.

For greater efficiency it would probably be better to use 32-bit ints 4-bit index-offset subfields, with one bit in each subfield going unused. Then, declaring this table like so:

uint32 table2[256];

and after properly initializing it, the bit-index of a given set bit I (zero-offset, i.e.I=0 means the first set bit) in byte-value B is simply equal to

(table2[B] >> (4*I)) & 0xf

2. Loop over the 8 bytes in your 64-bit input int X, and do 8 Table-1 lookups to generate a sequence of partial sums based on the total #bits set in the accumulated bytes. This could be the basis for a fast binary search, but since are just 8 such partial the overhead for managing such a data structure might exceed any gains due to the better asymptotic opcount;

3. Given an input bit-index in [0, POPCNT(X)], use the partial sums you generated in step [2] to find which byte contains the bit in question, and then use Table 2 to give you the bit index within the target byte.
ewmayer is offline   Reply With Quote
Old 2011-02-28, 21:58   #8
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

72528 Posts
Default

I'm a bit rusty in programming Ernst, so I apologize if I miss the points or end up blathering a bit...

It is easy to get population counts for small fields of the entire 64 bit value

For single bit fields, POPCNT is the bit itself.

For two bit POPCNTs, you add even bit positions to odd bit positions.

Looking above, at the code snippet in my earlier message, it is just a case of shift, mask and add. The comments for the POPCNT variables are more explicit than the slightly optimized actual code.


v = 1 bit POPCNTs (64 of them)
a = 2 bit POPCNTs (32 of them)
b = 4 bit POPCNTs (16 of them)
c = 8 bit POPCNTs (8 of them)
d = 16 bit POPCNTs (4 of them)

Perhaps 'c' would be more efficient than 8 table lookups for step two that you describe.

The rest of the code snippet is a bit roughish; to avoid conditional tests and jumps, for comparisons it does subtracts and depends on certain bits of the result. To adjust 's' it depends on bit 8 of an integer being 0 or 1 after a subtraction . To adjust 'r' it depends on several bits of the second byte of a subtraction being all 0 or all 1.

In any case the fact that it works makes it so ugly that it is nice. The first 't' value looks wrong to me but I accept that I must be misunderstanding things a bit. After each test and adjustment, 't' is changed to depend on a different one of those POPCNT variables I mentioned above shifted by the partial result 's' and masked to the relevant bits for the next comparison.

My tools didn't compile the masking constants correctly and I was forced to replace ~0UL with ~0ui64
Ross

Last fiddled with by only_human on 2011-02-28 at 22:43 Reason: Expanded thoughts. Added 'v' to POPCNT variable list
only_human is offline   Reply With Quote
Old 2011-03-01, 01:05   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101111111012 Posts
Default

Quote:
Originally Posted by only_human View Post
It is easy to get population counts for small fields of the entire 64 bit value
I agree, the POPCNTs are fairly easy to calculate. However, Ernst's idea of a table lookup for the last 3 levels of the binary search is a good idea.

The code from Juha takes 11 operations per binary search level. We can replace 33 operations with a few operations and a lookup into a 1KB table.

BTW, I think I can reduce the 11 operations per binary search level down to a more reasonable 6 or 7.
Prime95 is offline   Reply With Quote
Old 2011-03-01, 01:28   #10
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

72528 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I agree, the POPCNTs are fairly easy to calculate. However, Ernst's idea of a table lookup for the last 3 levels of the binary search is a good idea.

The code from Juha takes 11 operations per binary search level. We can replace 33 operations with a few operations and a lookup into a 1KB table.

BTW, I think I can reduce the 11 operations per binary search level down to a more reasonable 6 or 7.
Agree that a table could be good in here somewhere and definitely don't intend to disparage Ernst's ideas.

I thought that more than one comparison could be done at a time by setting guard bits that would keep borrows from propagating too far but that is likely primitive thinking because this is probably perfect fodder for modern SSE type instructions. I would be curious to see your ideas on simplifying Juha's binary search if you do develop them.

An ugly part of the code in my opinion is that, as written, in C is that the masks used after the subtractions wouldn't work correctly on machines that have a signed binary default integer representation.
only_human is offline   Reply With Quote
Old 2011-03-01, 01:57   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

32·853 Posts
Default

Here is my modified version without Ernst's suggestion to eliminate the last 3 binary search levels. This is quite a bit faster. Also, it shouldn't be hard to remove the multiplication operations.

The original code was 25.5% of program execution time. The new code is 20.5% of total execution time.

Code:
		// Adapted from code by Juha J‰rvi found on Stanford's bit hacks web page
		uint64_t v = needs_work_mask;          // Input value to find position with rank r.
		unsigned int r;      // Input: bit's desired rank [1-64].
		unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
		uint64_t a, b, c, d; // Intermediate temporaries for bit count.
		unsigned int t;      // Bit count temporary.

		// Do a normal parallel bit count for a 64-bit integer,
		// but store all intermediate steps.
		// a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
		a =  v - ((v >> 1) & ~0UL/3);
		// b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
		b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);	
		// c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
		c = (b + (b >> 4)) & ~0UL/0x11;
		// d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
		d = (c + (c >> 8)) & ~0UL/0x101;
#ifdef OLD
		t = (d >> 32) + (d >> 48);
		// Now do branchless select!
		s  = 64;
		// if (r > t) {s -= 32; r -= t;}
		s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
		t  = (d >> (s - 16)) & 0xff;
		// if (r > t) {s -= 16; r -= t;}
		s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
		t  = (c >> (s - 8)) & 0xf;
		// if (r > t) {s -= 8; r -= t;}
		s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
		t  = (b >> (s - 4)) & 0x7;
		// if (r > t) {s -= 4; r -= t;}
		s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
		t  = (a >> (s - 2)) & 0x3;
		// if (r > t) {s -= 2; r -= t;}
		s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
		t  = (v >> (s - 1)) & 0x1;
		// if (r > t) s--;
		s -= ((t - r) & 256) >> 8;
		//s = 65 - s;

		//convert s to LSB 0-63 notation
		s--;
#else
		int	pick;
		// s = 0;
		// if (r > popcnt32[s]) {s += 32; r -= popcnt32[s];}
		t = (d + (d >> 16)) & 0xFF;
		pick = (r > t);
		s = pick << 5; r -= pick * t;
		// if (r > popcnt16[s]) {s += 16; r -= popcnt16[s];}
		t  = (d >> s) & 0xff;
		pick = (r > t);
		s += pick << 4; r -= pick * t;
		// if (r > popcnt8[s]) {s += 8; r -= popcnt8[s];}
		t  = (c >> s) & 0xf;
		pick = (r > t);
		s += pick << 3; r -= pick * t;
		// if (r > popcnt4[s]) {s += 4; r -= popcnt4[s];}
		t  = (b >> s) & 0x7;
		pick = (r > t);
		s += pick << 2; r -= pick * t;
		// if (r > popcnt2[s]) {s += 2; r -= popcnt2[s];}
		t  = (a >> s) & 0x3;
		pick = (r > t);
		s += pick << 1; r -= pick * t;
		// if (r > popcnt1[s]) s++;
		t  = (v >> s) & 0x1;
		pick = (r > t);
		s += pick;
#endif
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
random comments, random questions and thread titles made for Google jasong Lounge 46 2017-05-09 12:32
About random number (random seed) in Msieve Greenk12 Factoring 1 2008-11-15 13:56
My last name is a word. I had no idea. jasong jasong 3 2007-10-05 18:02
Word Problem mfgoode Lounge 11 2006-04-01 16:16
The only word mfgoode Puzzles 11 2004-09-14 19:08

All times are UTC. The time now is 04:04.


Tue Nov 30 04:04:07 UTC 2021 up 129 days, 22:33, 0 users, load averages: 0.90, 0.99, 1.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.