4.10 Character Strings

After integer values, character strings are probably the most common data type that modern programs use. The 80x86 does support a handful of string instructions, but these instructions are really intended for block memory operations, not a specific implementation of a character string. Therefore, this section will concentrate mainly on the HLA definition of character strings and also discuss the string handling routines available in the HLA Standard Library.

In general, a character string is a sequence of ASCII characters that possesses two main attributes: a length and the character data. Different languages use different data structures to represent strings. To better understand the reasoning behind the design of HLA strings, it is probably instructive to look at two different string representations popularized by various high level languages.

Without question, zero terminated strings are probably the most common string representation in use today because this is the native string format for C, C++, C#, Java, and other languages. A zero terminated string consists of a sequence of zero or more ASCII characters ending with a byte containing zero. For example, in C/C++, the string "abc" requires four bytes: the three characters ‘a’, ‘b’, and ‘c’ followed by a byte containing zero. As you'll soon see, HLA character strings are upward compatible with zero terminated strings, but in the meantime you should note that it is very easy to create zero terminated strings in HLA. The easiest place to do this is in the static section using code like the following:

 static      zeroTerminatedString:      char; @nostorage;                          byte "This is the zero terminated string", 0;

Remember, when using the @nostorage option, HLA doesn't reserve any space for the variable, so the zeroTerminatedString variable's address in memory corresponds to the first character in the following byte directive. Whenever a character string appears in the byte directive as it does here, HLA emits each character in the string to successive memory locations. The zero value at the end of the string properly terminates this string.

Zero terminated strings have two principle attributes: They are very simple to implement and the strings can be any length. On the other hand, zero terminated strings have a few drawbacks. First, though not usually important, zero terminated strings cannot contain the NUL character (whose ASCII code is zero). Generally, this isn't a problem, but it does create havoc once in a great while. The second problem with zero terminated strings is that many operations on them are somewhat inefficient. For example, to compute the length of a zero terminated string you must scan the entire string looking for that zero byte (counting characters up to the zero). The following program fragment demonstrates how to compute the length of the string above:

      mov( &zeroTerminatedString, ebx );      mov( 0, eax );      while( (type byte [ebx+eax]) <> 0 ) do           inc( eax );      endwhile;      // String length is now in EAX.

As you can see from this code, the time it takes to compute the length of the string is proportional to the length of the string; as the string gets longer it will take longer to compute its length.

A second string format, length prefixed strings, overcomes some of the problems with zero terminated strings. length prefixed strings are common in languages like Pascal; they generally consist of a length byte followed by zero or more character values. The first byte specifies the length of the string, the remaining bytes (up to the specified length) are the character data itself. In a length prefixed scheme, the string "abc" would consist of the four bytes — $03 (the string length) followed by ‘a’, ‘b’, and ‘c’. You can create length prefixed strings in HLA using code like the following:

 static      lengthPrefixedString:char; @nostorage;                byte 3, "abc";

Counting the characters ahead of time and inserting them into the byte statement, as was done here, may seem like a major pain. Fortunately, there are ways to have HLA automatically compute the string length for you.

Length prefixed strings solve the two major problems associated with zero terminated strings. It is possible to include the NUL character in length prefixed strings and those operations on zero terminated strings that are relatively inefficient (e.g., string length) are more efficient when using length prefixed strings. However, length prefixed strings have their own drawbacks. The principal drawback to length prefixed strings, as described, is that they are limited to a maximum of 255 characters in length (assuming a one-byte length prefix).

HLA uses an expanded scheme for strings that is upwards compatible with both zero terminated and length prefixed strings. HLA strings enjoy the advantages of both zero terminated and length prefixed strings without the disadvantages. In fact, the only drawback to HLA strings over these other formats is that HLA strings consume a few additional bytes (the overhead for an HLA string is 9 to 12 bytes, compared to 1 byte for zero terminated or length prefixed strings; the overhead being the number of bytes needed above and beyond the actual characters in the string).

An HLA string value consists of four components. The first element is a double word value that specifies the maximum number of characters that the string can hold. The second element is a double word value specifying the current length of the string. The third component is the sequence of characters in the string. The final component is a zero terminating byte. You could create an HLA-compatible string in the static section using code like the following:^[12]

 static      align(4);      dword 11;      dword 11; TheString: char; @nostorage;      byte "Hello there";      byte 0;

Note that the address associated with the HLA string is the address of the first character, not the maximum or current length values.

"So what is the difference between the current and maximum string lengths?" you're probably wondering. Well, in a fixed string like the preceding one they are usually the same. However, when you allocate storage for a string variable at runtime, you will normally specify the maximum number of characters that can go into the string. When you store actual string data into the string, the number of characters you store must be less than or equal to this maximum value. The HLA Standard Library string routines will raise an exception if you attempt to exceed this maximum length (something the C/C++ and Pascal formats can't do).

The zero terminating byte at the end of the HLA string lets you treat an HLA string as a zero terminated string if it is more efficient or more convenient to do so. For example, most calls to Windows and Linux require zero terminated strings for their string parameters. Placing a zero at the end of an HLA string ensures compatibility with Windows, Linux, and other library modules that use zero terminated strings.

^[12]Actually, there are some restrictions on the placement of HLA strings in memory. This text will not cover those issues. See the HLA documentation for more details.