11.12 Extracting Bit Strings

Of course, we can easily accomplish the converse of merging two bit streams; i.e., we can extract and distribute bits in a bit string among multiple destinations. The following code takes the 32-bit value in EAX and distributes alternate bits among the BX and DX registers:

                mov( 16, cl );     // Count off the number of loop iterations. ExtractLp:     shr( 1, eax );     // Extract even bits to (E)BX.                rcr( 1, ebx );                shr( 1, eax );     // Extract odd bits to (E)DX.                rcr( 1, edx );                dec( cl );         // Repeat 16 times.                jnz ExtractLp;                shr( 16, ebx );    // Need to move the results from the H.O.                shr( 16, edx );    // bytes of EBX/EDX to the L.O. bytes.

This sequence executes 99 instructions. This isn't terrible, but we can probably do a little better by using a better algorithm that extracts bits in parallel. Employing the technique we used to reverse bits in a register, we can come up with the following algorithm that relocates all the even bits to the L.O. word of EAX and all the odd bits to the H.O. word of EAX.

 // Swap bits at positions (1,2), (5,6), (9,10), (13,14), (17,18), // (21,22), (25,26), and (29, 30).           mov( eax, edx );           and( $9999_9999, eax );       // Mask out the bits we'll keep for now.           mov( edx, ecx );           shr( 1, edx );                // Move 1st bits in tuple above to the           and( $2222_2222, ecx );       // correct position and mask out the           and( $2222_2222, edx );       // unneeded bits.           shl( 1, ecx );                // Move 2nd bits in tuples above.           or( edx, ecx );               // Merge all the bits back together.           or( ecx, eax ); // Swap bit pairs at positions ((2,3), (4,5)), ((10,11), (12,13)), etc.           mov( eax, edx );           and( $c3c3_c3c3, eax );       // The bits we'll leave alone.           mov( edx, ecx );           shr( 2, edx );           and( $0c0c_0c0c, ecx );           and( $0c0c_0c0c, edx );           shl( 2, ecx );           or( edx, ecx );           or( ecx, eax ); // Swap nibbles at nibble positions (1,2), (5,6), (9,10), etc.           mov( eax, edx );           and( $f00f_f00f, eax );           mov( edx, ecx );           shr(4, edx );           and( $0f0f_0f0f, ecx );           and( $0f0f_0f0f, ecx );           shl( 4, ecx );           or( edx, ecx );           or( ecx, eax ); // Swap bits at positions 1 and 2.           ror( 8, eax );           xchg( al, ah );           rol( 8, eax );

This sequence requires 30 instructions. At first blush it looks like a winner because the original loop executes 64 instructions. However, this code isn't quite as good as it looks. After all, if we're willing to write this much code, why not unroll the loop above 16 times? That sequence only requires 64 instructions. So the complexity of the previous algorithm may not gain much on instruction count. As to which sequence is faster, well, you'll have to time them to figure this out. However, the shrd instructions are not particularly fast and neither are the instructions in the other sequence. This example does not appear here to show you a better algorithm, but rather to demonstrate that writing really tricky code doesn't always provide a big performance boost.

Extracting other bit combinations is left as an exercise for the reader.