Chapter 7: Composite Data Types and Memory Objects

Composite data types are types that are composed of other, more primitive, types. Examples of composite data types commonly found in applications include pointers, arrays, records or structures, and unions. Many high-level languages provide syntactical abstractions for these composite data types that make them easy to declare and use, all the time hiding their underlying complexities.

Though the cost of using these composite data types is not terrible, it is very easy for a programmer to introduce inefficiencies into an application by using these data types without understanding the underlying costs. Great programmers, therefore, are cognizant of the costs associated with using composite data types so they can use them in an appropriate manner. This chapter discusses the costs associated with these composite data types to better enable you to write great code.

7.1 Pointer Types

A pointer is a variable whose value refers to some other object. Now you've probably experienced pointers firsthand in Pascal, C/C++, or some other programming language, and you may be feeling a little anxious right now. Well, fear not! Pointers are actually easy to deal with.

Probably the best place to start is with the definition of a pointer. High-level languages like Pascal and C/C++ hide the simplicity of pointers behind a wall of abstraction. This added complexity tends to frighten programmers because they don't understand what's going on behind the scenes . However, a little knowledge can erase all your fears of pointers.

Let's just ignore pointers for a moment and work with something that's easier to understand: an array. Consider the following array declaration in Pascal:

 M: array [0..1023] of integer;

Even if you don't know Pascal, the concept here is easy to understand. M is anarray with 1,024 integers in it, indexed from M[0] to M[1023] . Each one of these array elements can hold an integer value that is independent of the others. In other words, this array gives you 1,024 different integer variables each of which you access via an array index rather than by name .

If you have a program with the statement M[0]:=100; it probably wouldn't take you any time to figure out what this statement is doing. It stores the value 100 into the first element of the array M . Now consider the following two statements:

 i := 0; (* assume i is an integer variable *)  M [i] := 100;

You should agree, without too much hesitation, that these two statements do the same thing as M[0]:=100; . Indeed, you're probably willing to agree that you can use any integer expression in the range 0..1,023 as an index of this array. The following statements still perform the same operation as our earlier statement:

 i := 5;     (* assume all variables are  integers*)  j := 10;  k := 50;  m [i * j - k] := 100;

Okay, how about the following:

 M [1] := 0;  M [ M [1] ] := 100;

Whoa! Now that takes a few moments to digest. However, if you take it slowly, it makes sense and you'll discover that these two instructions perform the same operation as before. The first statement stores zero into array element M[1] . The second statement fetches the value of M[1] , which is zero, and uses that value to determine where it stores the value 100.

If you're willing to accept this example as reasonable - perhaps bizarre, but usable nonetheless - then you'll have no problems with pointers, because M[1] is a pointer! Well, not really, but if you were to change 'M' to 'memory' and treat each element of this array as a separate memory location, then this is the definition of a pointer: a pointer is a memory variable whose value is the address of some other memory object.

7.1.1 Pointer Implementation

Although most languages implement pointers using memory addresses, a pointer is actually an abstraction of a memory address, and therefore a language could define a pointer using any mechanism that maps the value of the pointer to the address of some object in memory. Some implementations of Pascal, for example, use offsets from some fixed memory address as pointer values. Some languages (such as dynamic languages like LISP) might actually implement pointers by using double indirection. That is, the pointer object contains the address of some memory variable whose value is the address of the object to be accessed. This double indirection may seem somewhat convoluted, but it does offer certain advantages when using a complex memory management system. However, this chapter will assume that a pointer is a variable whose value is the address of some other object inmemory.

As you've seen in examples from previous chapters, you can indirectly access an object using a pointer with two 80x86 machine instructions (or with a similar sequence on other CPUs), as follows :

 mov(PointerVariable, ebx);   // Load the pointer variable into a register.  mov([ebx], eax);             // Use register indirect mode to access data.

Now consider the double-indirect pointer implementation described earlier. Access to data via double indirection is less efficient than the straight pointer implementation because it takes an extra machine instruction to fetch the data from memory. This isn't obvious in a high-level language like C/C++ or Pascal, where you'd use double indirection as follows (it looks very similar to single indirection):

 i = **cDblPtr;      // C/C++  i := ^^pDblPtr;     (* Pascal *)

In assembly language, however, you'll see the extra work involved:

 mov(hDblPtr, ebx);  // Get the pointer to a pointer.  mov([ebx], ebx);    // Get the pointer to the value.  mov([ebx], eax);    // Get the value.

Contrast this with the two assembly instructions (shown earlier) needed to access an object using single indirection. Because double indirection requires 50 percent more code than single indirection, you can see why many languages implement pointers using single indirection.

7.1.2 Pointers and Dynamic Memory Allocation

Pointers typically reference anonymous variables that you allocate on the heap (a region in memory reserved for dynamic storage allocation) using memory allocation/deallocation functions in different languages like malloc/free (C), new/dispose (Pascal), and new/delete (C++). Objects you allocate on the heap are known as anonymous variables because you refer to them by their address, and you do not associate a name with them. And because the allocation functions return the address of an object on the heap, you would typically store the function's return result into a pointer variable. True, the pointer variable may have a name, but that name applies to the pointer's data (an address), not the name of the object referenced by this address.

7.1.3 Pointer Operations and Pointer Arithmetic

Most languages that provide the pointer data type let you assign addresses to pointer variables, compare pointer values for equality or inequality, and indirectly reference an object via a pointer. Some languages allow additional operations; we're going to look at the possibilities in this section.

Many languages provide the ability to do limited arithmetic with pointers. At the very least, these languages will provide the ability to add an integer constant to a pointer, or subtract an integer constant from a pointer. To understand the purpose of these two arithmetic operations, note the syntax of the malloc function in the C standard library:

 ptrVar = malloc(  bytes_to_allocate  );

The parameter you pass malloc specifies the number of bytes of storage to allocate. A good C programmer will generally supply an expression like sizeof(int) as the parameter to malloc. The sizeof function returns the number of bytes needed by its single parameter. Therefore, sizeof(int) tells malloc to allocate at least enough storage for an int variable. Now consider the following call to malloc :

 ptrVar = malloc(sizeof(int) * 8);

If the size of an integer is 4 bytes, this call to malloc will allocate storage for 32 bytes. The malloc function allocates these 32 bytes at consecutive addresses in memory (see Figure 7-1).

Figure 7-1: Memory allocation with malloc(sizeof(int) * 8)

The pointer that malloc returns contains the address of the first integer in this set, so the C program will only be able to directly access the very first of these eight integers. To access the individual addresses of the other seven integers, you will need to add an integer offset to the base address. On machines that support byte-addressable memory (such as the 80x86), the address of each successive integer in memory is the address of the previous integer plus the size of an integer. For example, if a call to the C standard library malloc routine returns the memory address $0300_1000, then the eight 4-byte integers that malloc allocates will lie at the following memory addresses:

Integer	Memory Addresses
	$0300_1000..$0300_1003
1	$0300_1004..$0300_1007
2	$0300_1008..$0300_100b
3	$0300_100c..$0300_100f
4	$0300_1010..$0300_1013
5	$0300_1014..$0300_1017
6	$0300_1018..$0300_101b
7	$0300_101c..$0300_101f

7.1.3.1 Adding an Integer to a Pointer

Because these integers described in the preceding section are exactly four bytes apart, we need only add four to the address of the first integer to obtain the address of the second integer; likewise, the address of the third integer is the address of the second integer plus four, and so on. In assembly language, we could access these eight integers using code like the following:

 malloc(@size(int32) * 8);    // Returns storage for eight int32 objects.                                   // EAX points at this storage.  mov(0, ecx);  mov(ecx, [eax]);               // Zero out the 32 bytes (four bytes  mov(ecx, [eax+4]);             // at a time).  mov(ecx, [eax+8]);  mov(ecx, [eax+12]);  mov(ecx, [eax+16]);  mov(ecx, [eax+20]);  mov(ecx, [eax+24]);  mov(ecx, [eax+28]);

Notice the use of the 80x86 indexed addressing mode to access the eight integers that malloc allocates. The EAX register maintains the base address (first address) of the eight integers that this code allocates, and the constant appearing in the addressing mode of the mov instructions selects the offset of the specific integer from this base address.

Most CPUs use byte addresses for memory objects. Therefore, when allocating multiple copies of some n -byte object in memory, the objects will not begin at consecutive memory addresses; instead, they will appear in memory at addresses that are n bytes apart. Some machines, however, do not allow a program to access memory at an arbitrary address in memory; they require that applications access data on address boundaries that are a multiple of a word, a double word, or even a quad word. Any attempt to access memory on some other boundary will raise an exception and (possibly) halt the application. If a high-level language supports pointer arithmetic, it must take this fact into consideration and provide a generic pointer arithmetic scheme that is portable across many different CPU architectures. The most common solution that high-level languages use when adding an integer offset to a pointer is to multiply that offset by the size of the object that the pointer references. That is, if you've got a pointer p to a 16-byte object in memory, then p + 1 points 16 bytes beyond where p points. Likewise, p + 2 points 32 bytes beyond the address contained in the pointer p . As long as the size of the data object is a multiple of the required alignment size (which the compiler can enforce by adding padding bytes, if necessary), this scheme avoids problems on those architectures that require aligned data access.

An important thing to realize is that the addition operator only makes sense between a pointer and an integer value. For example, in the C/C++ language you can indirectly access objects in memory using an expression like *(p + i) (where p is a pointer to an object and i is an integer value). It doesn't make any sense to add two pointers together. Similarly, it isn't reasonable to add other data types with a pointer. For example, adding a floating-point value to a pointer makes no sense. What would it mean to reference the data at some base address plus 1.5612? Operations on pointers involving strings, characters , and other data types don't make much sense, either. Integers (signed and unsigned) are the only reasonable values to add to a pointer.

On the other hand, not only can you add an integer to a pointer, you can also add a pointer to an integer and the result is still pointer (both p + i and i + p are legal). This is because addition is commutative.

7.1.3.2 Subtracting an Integer from a Pointer

Another reasonable pointer arithmetic operation is subtraction. Subtracting an integer from a pointer references a memory location immediately before the address held in the pointer. However, subtraction is not commutative and subtracting a pointer from an integer is not a legal operation ( p ˆ’ i is legal, but i ˆ’ p is not).

In C/C++ *(p ˆ’ i) accesses the i ^th object immediately before the object at which p points. In 80x86 assembly language, as in assembly on many processors, you can also specify a negative constant offset when using an indexed addressing mode, e.g.,

 mov([ebx   4], eax);

Keep in mind, that 80x86 assembly language uses byte offsets, not object offsets (as C/C++ does). Therefore, this statement loads into EAX the double word in memory immediately preceding the memory address in EBX.

7.1.3.3 Subtracting a Pointer from a Pointer

Unlike addition, it actually makes sense to subtract the value of one pointer variable from another. Consider the following C/C++ code that marches through a string of characters looking for the first e character that follows the first a that it finds:

 int distance;  char *aPtr;  char *ePtr;      . . .  aPtr = someString;  // Get ptr to start of string in aPtr.  // While we're not at the end of the string and the current  // char isn't 'a':  while(*aPtr != '  int distance; char *aPtr; char *ePtr; . . . aPtr = someString; // Get ptr to start of string in aPtr. // While we're not at the end of the string and the current // char isn't 'a': while(*aPtr != '\0' && *aPtr != 'a') { aPtr = aPtr + 1; // Move on to the next character pointed // at by aPtr. } // While we're not at the end of the string and the current // character isn't 'e': ePtr = aPtr; // Start at the 'a' char (or end of string // if no 'a'). while(*ePtr != '\0' && *ePtr != 'a') { ePtr = ePtr + 1; // Move on to the next character pointed at by aPtr. } // Now compute the number of characters between the 'a' and the 'e' // (counting the 'a' but not counting the 'e'): distance = (ePtr - aPtr); 
 ' && *aPtr != 'a')  {      aPtr = aPtr + 1;   // Move on to the next character pointed                         // at by aPtr.  }  // While we're not at the end of the string and the current  // character isn't 'e':  ePtr = aPtr;         // Start at the 'a' char (or end of string                       // if no 'a').  while(*ePtr != '  int distance; char *aPtr; char *ePtr; . . . aPtr = someString; // Get ptr to start of string in aPtr. // While we're not at the end of the string and the current // char isn't 'a': while(*aPtr != '\0' && *aPtr != 'a') { aPtr = aPtr + 1; // Move on to the next character pointed // at by aPtr. } // While we're not at the end of the string and the current // character isn't 'e': ePtr = aPtr; // Start at the 'a' char (or end of string // if no 'a'). while(*ePtr != '\0' && *ePtr != 'a') { ePtr = ePtr + 1; // Move on to the next character pointed at by aPtr. } // Now compute the number of characters between the 'a' and the 'e' // (counting the 'a' but not counting the 'e'): distance = (ePtr - aPtr); 
 ' && *ePtr != 'a')  {      ePtr = ePtr + 1;   // Move on to the next character pointed at by aPtr.  }  // Now compute the number of characters between the 'a' and the 'e'  // (counting the 'a' but not counting the 'e'):  distance = (ePtr - aPtr);

The subtraction of these two pointers produces the number of data objects that exist between the two pointers (in this case, ePtr and aPtr point at characters, so the subtraction result produces the number of characters, or bytes, between the two pointers).

The subtraction of two pointer values only makes sense if the two pointers reference the same data structure in memory (for example, pointing at characters within the same string, as in this C/C++ example). Although C/C++ (and certainly assembly language) will allow you to subtract two pointers that point at completely different objects in memory, their difference will probably have very little meaning.

When using pointer subtraction in C/C++, the base types of the two pointers must be identical (that is, the two pointers must contain the addresses of two objects whose types are identical). This restriction exists because pointer subtraction in C/C++ produces the number of objects between the two pointers, not the number of bytes. It wouldn't make any sense to compute the number of objects between a byte in memory and a double word in memory; would you be counting the number of bytes or the number of double words in this case? Of course, in assembly language you can get away with this (and the result is always the number of bytes between the two pointers), but it still doesn't make much sense semantically.

Note that the subtraction of two pointers could return a negative number if the left pointer operand is at a lower memory address than the right pointer operand. Depending on your language and its implementation, you may need to take the absolute value of the result if you're only interested in the distance between the two pointers and you don't care which pointer contains the greater address.

7.1.3.4 Comparing Pointers

Comparisons are another set of operations that make sense for pointers. Almost every language that supports pointers will let you compare two pointers to see if they are equal or not equal. Such a comparison will tell you whether the pointers reference the same object in memory. Some languages (such as assembly and C/C++) will also let you compare two pointers to see if one pointer is less than or greater than another. This only makes sense, however, if the pointers have the same base type and both pointers contain the address of some object within the same data structure (such as an array, string, or record). If the comparison of two pointers suggests that the value of one pointer is less than the other, then this tells you that the first pointer references an object within the data structure appearing before the object referenced by the second pointer. The converse is equally true for the greater than comparison.