3.7 Packing and Unpacking Data

The advantage of packed data types is efficient memory use. Consider the Social Security identification number in use in the United States. This is a nine-digit code that normally takes the following form (each 'X' represents a single decimal digit):

 XXX-XX-XXXX

If we encode a Social Security number using three separate (32-bit) integers, it will take 12 bytes to represent this value. That's actually more than the 11 bytes needed to represent the number using an array of characters . A better solution is to encode each field using short (16-bit) integers. Now it takes only 6 bytes to represent the Social Security number. Because the middle field in the Social Security number is always between 0 and 99, we can actually shave one more byte off the size of this structure by encoding the middle field with a single byte. Here's a sample Delphi/Kylix record structure that defines this data structure:

 SSN :record          FirstField:  smallint;  // smallints are 16 bits in Delphi/Kylix          SecondField: byte;          ThirdField:  smallint;  end;

If we drop the hyphens in the Social Security number, you'll notice that the result is a nine-digit number. Because we can exactly represent all values between 0 and 999,999,999 (nine digits) using 30 bits, it should be clear that we could actually encode any legal Social Security number using a 32-bit integer. The problem is that some software that manipulates Social Security numbers may need to operate on the individual fields. This means that you have to use expensive division, modulo, and multiplication operators in order to extract fields from a Social Security number you've encoded in a 32-bit integer format. Furthermore, it's a bit more painful to convert Social Security numbers to and from strings when using the 32-bit format. The advantage of using bit fields to hold a value is that it's relatively easy to insert and extract individual bit fields using fast machine instructions, and it's also less work to create a standard string representation (including the hyphens) of one of these fields. Figure 3-8 provides a straightforward implementation of the Social Security number packed data type using a separate string of bits for each field (note that this format uses 31 bits and ignores the HO bit).

Figure 3-8: Social Security number packed fields encoding

As you'll soon see, fields that begin at bit position zero in a packed data object are the ones you can most efficiently access. So it's generally a good idea to arrange the fields in your packed data type so that the field you access most often begins at bit zero. Of course, you'll have to determine which field you access most often on an application-by-application basis. If you have no idea which field you'll access most often, you should try to assign the fields so they begin on a byte boundary. If there are unused bits in your packed type, you should attempt to spread them throughout the structure so that individual fields begin on a byte boundary and have those fields consume multiples of eight bits.

We've only got one unused bit in the Social Security example shown in Figure 3-8, but it turns out that we can use this extra bit to align two fields on a byte boundary and ensure that one of those fields occupies a bit string whose length is a multiple of eight bits. Consider Figure 3-9, which shows a rearranged version of our Social Security number data type.

Figure 3-9: A (possibly) improved encoding of the Social Security number

One problem with the data format shown in Figure 3-9 is that we can't sort Social Security numbers in an intuitive fashion by simply comparing 32-bit unsigned integers. ^[5] Therefore, if you intend to do a lot of sorting based on the entire Social Security number, the format in Figure 3-8 is probably a better format.

If this type of sorting isn't important to you, the format in Figure 3-9 has some advantages. This packed type actually uses eight bits (rather than seven) to represent SecondField (along with moving SecondField down to bit position zero); the extra bit will always contain zero. This means that SecondField consumes bits 0..7 (a whole byte) and ThirdField begins on a byte boundary (bit position eight). ThirdField doesn't consume a multiple of eight bits, and FirstField doesn't begin on a nice byte boundary, but we've done fairly well with this encoding, considering we only had one extrabit to play around with.

The next question is, 'How do we access the fields of this packed type?' There are two separate activities here. We need the ability to retrieve, or extract , the packed fields, and we need to be able to insert data into these fields. The AND, OR, and SHIFT operations provide the tools for this.

When actually operating on these fields, it's convenient to work with three separate variables rather than working directly with the packed data. For our Social Security number example, we can create the three variables FirstField, SecondField , and ThirdField . We can then extract the actual data from the packed value into these three variables, operate on these variables, and then insert the data from these three variables back into their fields when we're done.

Extracting the SecondField data from the packed format shown in Figure 3-9 is easy (remember, the field aligned to bit zero in our packed data is the easiest one to access). All you have to do is copy the data from the packed representation to the SecondField variable and then mask out all but the SecondField bits using the AND operation. Because SecondField is a 7-bit value, we can create the mask as an integer containing all one bits in positions zero through six and zeros everywhere else. The following C/C++ code demonstrates how to extract this field into the SecondField variable ( assuming packedValue is a variable holding the 32-bit packed Social Security number):

 SecondField = packedValue & 0x7f;   // 0x7f = %0111_1111

Extracting fields that are not aligned at bit zero takes a little more work. Consider the ThirdField entry in Figure 3-9. We can mask out all the bits associated with the first and second fields by logically ANDing the packed value with %_11_1111_1111_1111_0000_0000 ($3F_FF00). However, this leaves the ThirdField value sitting in bits 8 through 21, which is not convenient for various arithmetic operations. The solution is to shift the masked value down eight bits so that it's aligned at bit zero in our working variable. The following Pascal/Delphi/Kylix code shows how one might do this:

 SecondField := (packedValue and fff00) shr 8;

As it turns out, you can also do the shift first and then do the logical AND operation (though this requires a different mask, $11_1111_1111_1111 or $3FFF). Here's the C/C++ code that extracts SecondField using that technique:

 SecondField = (packedValue >> 8) & 0x3FFF;

Extracting a field that is aligned against the HO bit, as the first field is in our Social Security packed data type is almost as easy as accessing the data aligned at bit zero. All you have to do is shift the HO field down so that it's aligned at bit zero. The logical shift right operation automatically fills in the HO bits of the result with zeros, so no explicit masking is necessary. The following Pascal/Delphi code demonstrates this:

 FirstField := packedValue shr 18; // Delphi's shift right is a logical                                    // shift right.

In HLA/x86 assembly language, it's actually quite easy to access the second and third fields of the packed data format in Figure 3-9. This is because we can easily access data at any arbitrary byte boundary in memory. That allows us to treat both the second and third fields as though they both are aligned at bit zero in the data structure. In addition, because the SecondField value is an 8-bit value (with the HO bit always containing zero), it only takes a single machine instruction to unpack the data, as shown here:

 movzx( (type byte packedValue), eax );

This instruction fetches the first byte of packedValue (which is the LO 8 bits of packedValue on the 80x86), and it zero extends this value to 32 bits in EAX ( movzx stands for 'move with zero extension'). The EAX register, therefore, contains the SecondField value after this instruction executes.

Extracting the ThirdField value from our packed format isn't quite as easy, because this field isn't an even multiple of eight bits long. Therefore, we'll still need a masking operation to clear the unused bits from the 32-bit result we produce. However, because ThirdField is aligned on a byte (8-bit) boundary in our packed structure, we'll be able to avoid the shift operation that was necessary in the high-level code. Here's the HLA/x86 assembly code that extracts the third field from our packedValue object:

 mov( (type word packedValue[1]), ax );   // Extracts bytes 1 & 2                                           // from packedValue.  and( FFF, eax );                       // Clears all the undesired bits.

Extracting FirstField from the packedValue object in HLA/x86 assembly code is identical to the high-level code; we'll simply shift the upper ten bits (which comprise FirstField ) down to bit zero:

 mov( packedValue, eax );  shr( 21, eax );

Inserting a field into a packed structure is only a little more complicated than extracting a field. Assuming the data you want to insert appears in some variable and contains zeros in the unused bits, inserting a field into apacked object requires three operations. First, if necessary, you shift the field's data to the left so its alignment matches the corresponding field in the packed object. The second step is to clear the corresponding bits in the packed structure. The final step is to logically OR the shifted field into the packed object. Figure 3-10 on the next page displays the details of this operation.

Figure 3-10: Inserting ThirdField into the Social Security packed type

Here's the C/C++ code that accomplishes the operation shown in Figure 3-10:

 packedValue = (packedValue & 0xFFc000FF)  (ThirdField << 8 );

You'll note that $FFC000FF is the hexadecimal value that corresponds to all zeros in bit positions 8 through 21 and ones everywhere else.

^[5] 'Intuitive' meaning that the first field is the most significant portion of the value, the second field is the next most significant, and the third field is the least significant component of the number.