 mersenneforum.org Random bit in a word
 Register FAQ Search Today's Posts Mark Forums Read  2011-02-26, 00:02 #1 Prime95 P90 years forever!   Aug 2002 Yeehaw, FL 767710 Posts Random bit in a word I need some code that efficiently (on Intel this means relatively short chunk of code with no branches that are hard to predict) returns the index of a randomly selected set bit in a 64-bit value. Psuedo code: int64 t; cnt = POPCNT (t) i = random value between 0 and cnt while (i--) use BSF to find a set bit, then clear that set bit return BSF(t) POPCNT is an Intel assembly code instruction to count the number of set bits in a register BSF is bit-scan-forward, it returns the index of the first set bit in a register Clever bit-twiddling suggestions are welcome! Last fiddled with by Prime95 on 2011-02-26 at 00:07   2011-02-26, 01:48   #2
Batalov

"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

2·11·19·23 Posts Quote:
 Originally Posted by Prime95 BSF is bit-scan-forward, it returns the index of the first set bit in a register Clever bit-twiddling suggestions are welcome!
Shooting in the dark: init a 64k lookup table (for 16-bit words) once, and use it four times per BSF() call?   2011-02-26, 04:48   #3
axn

Jun 2003

5,179 Posts Quote:
 Originally Posted by Prime95 use BSF to find a set bit, then clear that set bit
If BSF scans from LSB to MSB, then "x & (x-1)" is a standard way to turn of the least set bit. Don't need bsf at all.   2011-02-26, 05:12 #4 Prime95 P90 years forever!   Aug 2002 Yeehaw, FL 32×853 Posts I was thinking a binary search for the random nth set bit as in: int result = 0; int64 t; cnt = POPCNT (t) i = random value between 0 and cnt cnt32 = POPCNT (t & 0xFFFFFFFF) t = t >> ((i > cnt32) ? 32 : 0) result = result + ((i > cnt32) ? 32 : 0) i = i - ((i > cnt32) ? cnt32 : 0) repeat for 16, 8, 4, 2 return result the compiler ought to be smart enough to generate conditional move instructions and thus the above is jump-less   2011-02-26, 07:21 #5 only_human   "Gang aft agley" Sep 2002 2·1,877 Posts Select the bit position (from the most-significant bit) with the given count (rank) might be close to what you want. It does its' own bit counting using a common fast method but retains the intermediate calculations and then uses those values to quickly locate the index of the rth set bit. In the example it finds the position of the rth set bit from the most significant end but it says that it can be adapted to find the bit from the least significant end. Code: The following 64-bit code selects the position of the rth 1 bit when counting from the left. In other words if we start at the most significant bit and proceed to the right, counting the number of bits set to 1 until we reach the desired rank, r, then the position where we stop is returned. If the rank requested exceeds the count of bits set, then 64 is returned. The code may be modified for 32-bit or counting from the right. uint64_t v; // Input value to find position with rank r. unsigned int r; // Input: bit's desired rank [1-64]. unsigned int s; // Output: Resulting position of bit with rank r [1-64] uint64_t a, b, c, d; // Intermediate temporaries for bit count. unsigned int t; // Bit count temporary. // Do a normal parallel bit count for a 64-bit integer, // but store all intermediate steps. // a = (v & 0x5555...) + ((v >> 1) & 0x5555...); a = v - ((v >> 1) & ~0UL/3); // b = (a & 0x3333...) + ((a >> 2) & 0x3333...); b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5); // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...); c = (b + (b >> 4)) & ~0UL/0x11; // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...); d = (c + (c >> 8)) & ~0UL/0x101; t = (d >> 32) + (d >> 48); // Now do branchless select! s = 64; // if (r > t) {s -= 32; r -= t;} s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8)); t = (d >> (s - 16)) & 0xff; // if (r > t) {s -= 16; r -= t;} s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8)); t = (c >> (s - 8)) & 0xf; // if (r > t) {s -= 8; r -= t;} s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8)); t = (b >> (s - 4)) & 0x7; // if (r > t) {s -= 4; r -= t;} s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8)); t = (a >> (s - 2)) & 0x3; // if (r > t) {s -= 2; r -= t;} s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8)); t = (v >> (s - 1)) & 0x1; // if (r > t) s--; s -= ((t - r) & 256) >> 8; s = 65 - s; If branching is fast on your target CPU, consider uncommenting the if-statements and commenting the lines that follow them. Juha Järvi sent this to me on November 21, 2009. The link I provided doesn't jump to the anchored text properly in my browser. Here is another link anchored below the quoted section (scroll upward to quoted text) http://graphics.stanford.edu/~seande...ml#ParityNaive   2011-02-26, 16:57   #6
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11101111111012 Posts Quote:
 Originally Posted by only_human Select the bit position (from the most-significant bit) with the given count (rank) might be close to what you want.
Perfect! Thanks.   2011-02-28, 19:14 #7 ewmayer ∂2ω=0   Sep 2002 República de California 2·13·449 Posts This is more in generic-C terms, but perhaps worth investigating: 1. Create a pair of 256-entry lookup tables: Table 1 is simply a bytewise lookup-table version of POPCNT: Take a byte and return #set bits in that byte; Table 2 is a bit more complex - it is a lookup table which encodes the following: For any byte-valued integer B in [0,255], given a target set-bit index I in [0,POPCNT(B)], the table-lookup returns the bit-index of the set bit with respect to B. This is a bit trickier to set up, since different values of B have numbers of set bits ranging from 0 to 8, so one needs to consider whether the requested bit-index is legal for the byte-valued int in question. But assuming the caller will only ever request a bit-index value which is appropriate for the specific byte-value in question, you could use a set of 256 24-bit ints to encode this. Each 24-bit table entry is divided into eight 3-bit subfields containing up to eight bit indices for the byte-value corresponding to the table entry. The low 3 bits are the bit index of the first set bit in B, etc. For greater efficiency it would probably be better to use 32-bit ints 4-bit index-offset subfields, with one bit in each subfield going unused. Then, declaring this table like so: uint32 table2; and after properly initializing it, the bit-index of a given set bit I (zero-offset, i.e.I=0 means the first set bit) in byte-value B is simply equal to (table2[B] >> (4*I)) & 0xf 2. Loop over the 8 bytes in your 64-bit input int X, and do 8 Table-1 lookups to generate a sequence of partial sums based on the total #bits set in the accumulated bytes. This could be the basis for a fast binary search, but since are just 8 such partial the overhead for managing such a data structure might exceed any gains due to the better asymptotic opcount; 3. Given an input bit-index in [0, POPCNT(X)], use the partial sums you generated in step  to find which byte contains the bit in question, and then use Table 2 to give you the bit index within the target byte.   2011-02-28, 21:58 #8 only_human   "Gang aft agley" Sep 2002 72528 Posts I'm a bit rusty in programming Ernst, so I apologize if I miss the points or end up blathering a bit... It is easy to get population counts for small fields of the entire 64 bit value For single bit fields, POPCNT is the bit itself. For two bit POPCNTs, you add even bit positions to odd bit positions. Looking above, at the code snippet in my earlier message, it is just a case of shift, mask and add. The comments for the POPCNT variables are more explicit than the slightly optimized actual code. v = 1 bit POPCNTs (64 of them) a = 2 bit POPCNTs (32 of them) b = 4 bit POPCNTs (16 of them) c = 8 bit POPCNTs (8 of them) d = 16 bit POPCNTs (4 of them) Perhaps 'c' would be more efficient than 8 table lookups for step two that you describe. The rest of the code snippet is a bit roughish; to avoid conditional tests and jumps, for comparisons it does subtracts and depends on certain bits of the result. To adjust 's' it depends on bit 8 of an integer being 0 or 1 after a subtraction . To adjust 'r' it depends on several bits of the second byte of a subtraction being all 0 or all 1. In any case the fact that it works makes it so ugly that it is nice. The first 't' value looks wrong to me but I accept that I must be misunderstanding things a bit. After each test and adjustment, 't' is changed to depend on a different one of those POPCNT variables I mentioned above shifted by the partial result 's' and masked to the relevant bits for the next comparison. My tools didn't compile the masking constants correctly and I was forced to replace ~0UL with ~0ui64 Ross Last fiddled with by only_human on 2011-02-28 at 22:43 Reason: Expanded thoughts. Added 'v' to POPCNT variable list   2011-03-01, 01:05   #9
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11101111111012 Posts Quote:
 Originally Posted by only_human It is easy to get population counts for small fields of the entire 64 bit value
I agree, the POPCNTs are fairly easy to calculate. However, Ernst's idea of a table lookup for the last 3 levels of the binary search is a good idea.

The code from Juha takes 11 operations per binary search level. We can replace 33 operations with a few operations and a lookup into a 1KB table.

BTW, I think I can reduce the 11 operations per binary search level down to a more reasonable 6 or 7.   2011-03-01, 01:28   #10
only_human

"Gang aft agley"
Sep 2002

72528 Posts Quote:
 Originally Posted by Prime95 I agree, the POPCNTs are fairly easy to calculate. However, Ernst's idea of a table lookup for the last 3 levels of the binary search is a good idea. The code from Juha takes 11 operations per binary search level. We can replace 33 operations with a few operations and a lookup into a 1KB table. BTW, I think I can reduce the 11 operations per binary search level down to a more reasonable 6 or 7.
Agree that a table could be good in here somewhere and definitely don't intend to disparage Ernst's ideas.

I thought that more than one comparison could be done at a time by setting guard bits that would keep borrows from propagating too far but that is likely primitive thinking because this is probably perfect fodder for modern SSE type instructions. I would be curious to see your ideas on simplifying Juha's binary search if you do develop them.

An ugly part of the code in my opinion is that, as written, in C is that the masks used after the subtractions wouldn't work correctly on machines that have a signed binary default integer representation.   2011-03-01, 01:57 #11 Prime95 P90 years forever!   Aug 2002 Yeehaw, FL 32·853 Posts Here is my modified version without Ernst's suggestion to eliminate the last 3 binary search levels. This is quite a bit faster. Also, it shouldn't be hard to remove the multiplication operations. The original code was 25.5% of program execution time. The new code is 20.5% of total execution time. Code:  // Adapted from code by Juha J‰rvi found on Stanford's bit hacks web page uint64_t v = needs_work_mask; // Input value to find position with rank r. unsigned int r; // Input: bit's desired rank [1-64]. unsigned int s; // Output: Resulting position of bit with rank r [1-64] uint64_t a, b, c, d; // Intermediate temporaries for bit count. unsigned int t; // Bit count temporary. // Do a normal parallel bit count for a 64-bit integer, // but store all intermediate steps. // a = (v & 0x5555...) + ((v >> 1) & 0x5555...); a = v - ((v >> 1) & ~0UL/3); // b = (a & 0x3333...) + ((a >> 2) & 0x3333...); b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5); // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...); c = (b + (b >> 4)) & ~0UL/0x11; // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...); d = (c + (c >> 8)) & ~0UL/0x101; #ifdef OLD t = (d >> 32) + (d >> 48); // Now do branchless select! s = 64; // if (r > t) {s -= 32; r -= t;} s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8)); t = (d >> (s - 16)) & 0xff; // if (r > t) {s -= 16; r -= t;} s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8)); t = (c >> (s - 8)) & 0xf; // if (r > t) {s -= 8; r -= t;} s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8)); t = (b >> (s - 4)) & 0x7; // if (r > t) {s -= 4; r -= t;} s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8)); t = (a >> (s - 2)) & 0x3; // if (r > t) {s -= 2; r -= t;} s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8)); t = (v >> (s - 1)) & 0x1; // if (r > t) s--; s -= ((t - r) & 256) >> 8; //s = 65 - s; //convert s to LSB 0-63 notation s--; #else int pick; // s = 0; // if (r > popcnt32[s]) {s += 32; r -= popcnt32[s];} t = (d + (d >> 16)) & 0xFF; pick = (r > t); s = pick << 5; r -= pick * t; // if (r > popcnt16[s]) {s += 16; r -= popcnt16[s];} t = (d >> s) & 0xff; pick = (r > t); s += pick << 4; r -= pick * t; // if (r > popcnt8[s]) {s += 8; r -= popcnt8[s];} t = (c >> s) & 0xf; pick = (r > t); s += pick << 3; r -= pick * t; // if (r > popcnt4[s]) {s += 4; r -= popcnt4[s];} t = (b >> s) & 0x7; pick = (r > t); s += pick << 2; r -= pick * t; // if (r > popcnt2[s]) {s += 2; r -= popcnt2[s];} t = (a >> s) & 0x3; pick = (r > t); s += pick << 1; r -= pick * t; // if (r > popcnt1[s]) s++; t = (v >> s) & 0x1; pick = (r > t); s += pick; #endif   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post jasong Lounge 46 2017-05-09 12:32 Greenk12 Factoring 1 2008-11-15 13:56 jasong jasong 3 2007-10-05 18:02 mfgoode Lounge 11 2006-04-01 16:16 mfgoode Puzzles 11 2004-09-14 19:08

All times are UTC. The time now is 04:04.

Tue Nov 30 04:04:07 UTC 2021 up 129 days, 22:33, 0 users, load averages: 0.90, 0.99, 1.10