3.6 Bit Fields and Packed Data

CPUs generally operate most efficiently on byte, word, and double-word data types; ^[3] but occasionally you'll need to work with a data type whose size is something other than 8, 16, or 32 bits. In such cases, you may be able to save some memory by packing different strings of bits together as compactly as possible, without wasting any bits to align a particular data field on a byte or other boundary.

Consider a date of the form '04/02/01.' It takes three numeric values to represent this date: month, day, and year values. Months, of course, use the values 1..12. It will require at least four bits (a maximum of 16 different values) to represent the month. Days use the range 1..31. Therefore, it will take five bits (a maximum of 32 different values) to represent the day entry. The year value, assuming that we're working with values in the range 0..99, requires seven bits (representing up to 128 different values). Four plus five plus seven is 16 bits, or two bytes. In other words, we can pack our date data into two bytes rather than the three that would be required if we used a separate byte for each of the month, day, and year values. This saves one byte of memory for each date stored, which could be a substantial saving if you need to store many dates. You might arrange the bits as shown in Figure 3-6.

Figure 3-6: Short packed date format (16 bits)

MMMM represents the four bits making up the month value, DDDDD represents the five bits making up the day, and YYYYYYY is the seven bits that hold the year. Each collection of bits representing a data item is a bit field . We could represent April 2, 2001, with $4101:

 0100 00010 0000001 = %0100_0001_0000_0001 or 01   04   02    01

Although packed values are space efficient (that is, they use little memory), they are computationally inefficient (slow!). The reason? It takes extra instructions to unpack the data from the various bit fields. These extra instructions take time to execute (and additional bytes to hold the instructions); hence, you must carefully consider whether packed data fields will save you anything. The following sample HLA/x86 code demonstrates the effort that must go into packing and unpacking this 16-bit date format.

 program dateDemo;  #include( "stdlib.hhf" )  static      day:        uns8;      month:      uns8;      year:       uns8;      packedDate: word;  begin dateDemo;      stdout.put( "Enter the current month, day, and year: " );      stdin.get( month, day, year );      // Pack the data into the following bits:      //      //  15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0      //   m  m  m  m  d  d  d  d  d  y  y  y  y  y  y  y      mov( 0, ax );      mov( ax, packedDate );  //Just in case there is an error.      if( month > 12 ) then          stdout.put( "Month value is too large", nl );      elseif( month = 0 ) then          stdout.put( "Month value must be in the range 1..12", nl );      elseif( day > 31 ) then          stdout.put( "Day value is too large", nl );      elseif( day = 0 ) then          stdout.put( "Day value must be in the range 1..31", nl );      elseif( year > 99 ) then          stdout.put( "Year value must be in the range 0..99", nl );      else          mov( month, al );          shl( 5, ax );          or( day, al );          shl( 7, ax );          or( year, al );          mov( ax, packedDate );      endif;      // Okay, display the packed value:      stdout.put( "Packed data = $", packedDate, nl );      // Unpack the date:      mov( packedDate, ax );      and( f, al );         // Retrieve the year value.      mov( al, year );      mov( packedDate, ax );  // Retrieve the day value.      shr( 7, ax );      and( %1_1111, al );      mov( al, day );      mov( packedDate, ax );  // Retrieve the month value.      rol( 4, ax );      and( %1111, al );      mov( al, month );      stdout.put( "The date is ", month, "/", day, "/", year, nl );  end dateDemo;

Keeping in mind the Y2K ^[4] problem, adopting a date format that only supports a two-digit year is rather foolish. So consider a better date format, shown in Figure 3-7.

Figure 3-7: Long packed date format (32 bits)

Because there are more bits in a 32-bit variable than are needed to hold the date, even accounting for years in the range 0-65,535, this format allots a full byte for the month and day fields. Because these two fields are bytes, anapplication can easily manipulate them as byte objects, reducing the overhead to pack and unpack these fields on those processors that support byte access. This leaves fewer bits for the year, but 65,536 years is probably sufficient (you can probably assume that your software will not be in use 63,000 years from now).

Of course, you could argue that this is no longer a packed date format. After all, we needed three numeric values, two of which fit just nicely into one byte each and one that should probably have at least two bytes. Because this 'packed' date format consumes the same four bytes as the unpacked version, what is so special about this format? Well, in this example packed effectively means packaged or encapsulated . This particular packed format does not use as few bits as possible; by packing the data into a double-word variable the program can treat the date value as a single data value rather than as three separate variables . This generally means that it requires only a single machine instruction to operate on this data rather than three separate instructions.

Another difference you will note between this long packed date format and the short date format appearing in Figure 3-6 is the fact that this long date format rearranges the Year, Month , and Day fields. This is important because it allows you to easily compare two dates using an unsigned integer comparison. Consider the following HLA/assembly code:

 mov( Date1, eax );        // Assume Date1 and Date2 are double-word variables  if( eax > Date2 ) then    // using the Long Packed Date format.      << do something if Date1 > Date2 >>  endif;

Had you kept the different date fields in separate variables, or organized the fields differently, you would not have been able to compare Date1 and Date2 in such a straightforward fashion. This example demonstrates another reason for packing data, even if you don't realize any space savings - it can make certain computations more convenient or even more efficient (contrary to what normally happens when you pack data).

Some high-level languages provide built-in support for packed data. For example, in C you can define structures like the following:

 struct  {      unsigned bits0_3   :4;      unsigned bits4_11  :8;      unsigned bits12_15 :4;      unsigned bits16_23 :8;      unsigned bits24_31 :8;  } packedData;

This structure specifies that each field is an unsigned object that holds four, eight, four, eight, and eight bits, respectively. The ':n' item appearing after each declaration specifies the minimum number of bits the compiler will allocate for the given field.

Unfortunately, it is not possible to provide a diagram that shows how a C/C++ compiler will allocate the values from a 32-bit double word among the fields. No (single) diagram is possible because C/C++ compiler implementers are free to implement these bit fields any way they see fit. The arrangement of the bits within the bit string is arbitrary (for example, the compiler could allocate the bits0_3 field in bits 28..31 of the ultimate object). The compiler can also inject extra bits between fields as it sees fit. The compiler can use a larger number of bits for each field if it so desires (this is actually the same thing as injecting extra padding bits between fields). Most C compilers attempt to minimize the injection of extraneous padding, but different C compilers ( especially on different CPUs) do have their differences. Therefore, any use of C/C++ struct bit field declarations is almost guaranteed to be nonportable, and you can't really count on what the compiler is going to do with those fields.

The advantage of using the compiler's built-in data-packing capabilities is that the compiler automatically handles packing and unpacking the data for you. For example, you could write the following C/C++ code, and the compiler would automatically emit the necessary machine instructions to store and retrieve the individual bit fields for you:

 struct  {      unsigned year   :7;      unsigned month  :4;      unsigned day    :5;  } ShortDate;          . . .      ShortDate.day = 28;      ShortDate.month = 2;      ShortDate.year = 3;    // 2003

^[3] Some RISC CPUs only operate efficiently on double-word values, so the concept of bit fields and packed data may apply to any object less than 32 bits in size on such CPUs.

^[4] Year 2000, a software engineering disaster that occurred because programmers in the 1900s encoded dates using only two digits and then discovered they couldn't differentiate 1900 and 2000 when the year 2000 came along.