4.16 Character Set Implementation in HLA

There are many different ways to represent character sets in an assembly language program. HLA implements character sets by using an array of 128 boolean values. Each boolean value determines whether the corresponding character is or is not a member of the character set — i.e., a true boolean value indicates that the specified character is a member of the set; a false value indicates that the corresponding character is not a member of the set. To conserve memory, HLA allocates only a single bit for each character in the set; therefore, HLA character sets consume 16 bytes of memory because there are 128 bits in 16 bytes. This array of 128 bits is organized in memory, as shown in Figure 4-3.

Figure 4-3: Bit Layout of a Character Set Object.

Bit zero of byte zero corresponds to ASCII code zero (the NUL character). If this bit is one, then the character set contains the NUL character; if this bit contains false, then the character set does not contain the NUL character. Likewise, bit zero of byte one (the eighth bit in the 128-bit array) corresponds to the backspace character (ASCII code is eight). Bit one of byte eight corresponds to ASCII code 65, an upper case ‘A.’ Bit 65 will contain a one if ‘A’ is a current member of the character set, it will contain zero if ‘A’ is not a member of the set.

While there are other possible ways to implement character sets, this bit vector implementation has the advantage that it is very easy to implement set operations like union, intersection, difference comparison, and membership tests.

HLA supports character set variables using the cset data type. To declare a character set variable, you would use a declaration like the following:

 static      CharSetVar: cset;

This declaration will reserve 16 bytes of storage to hold the 128 bits needed to represent a set of ASCII characters.

Although it is possible to manipulate the bits in a character set using instructions like and, or, xor, and so on, the 80x86 instruction set includes several bit test, set, reset, and complement instructions that are nearly perfect for manipulating character sets. The bt (bit test) instruction, for example, will copy a single bit in memory to the carry flag. The bt instruction allows the following syntactical forms:

      bt( BitNumber, BitsToTest );      bt( reg₁₆, reg₁₆ );      bt( reg₃₂, reg₃₂ );      bt( constant, reg₁₆ );      bt( constant, reg₃₂ );      bt( reg₁₆, mem₁₆ );      bt( reg₃₂, mem₃₂ );         //HLA treats cset objects as dwords within bt.      bt( constant, mem₁₆ );      bt( constant, mem₃₂ );     //HLA treats cset objects as dwords within bt.

The first operand holds a bit number; the second operand specifies a register or memory location whose bit should be copied into the carry flag. If the second operand is a register, the first operand must contain a value in the range 0..n-1, where n is the number of bits in the second operand. If the first operand is a constant and the second operand is a memory location, the constant must be in the range 0..255. Here are some examples of these instructions:

      bt( 7, ax );           // Copies bit #7 of AX into the carry flag (CF).      mov( 20, eax );      bt( eax, ebx );        // Copies bit #20 of EBX into CF.      // Copies bit #0 of the byte at CharSetVar+3 into CF.      bt( 24, CharSetVar );      // Copies bit #4 of the byte at DWmem+2 into CF.      bt( eax, DWmem );

The bt instruction turns out to be quite useful for testing set membership. For example, to see if the character ‘A’ is a member of a character set, you could use a code sequence like the following:

      bt( 'A', CharSetVar );      if( @c ) then           << Do something if 'A' is a member of the set >>      endif;

The bts (bit test and set), btr (bit test and reset), and btc (bit test and complement) instructions are also quite useful for manipulating character set variables. Like the bt instruction, these instructions copy the specified bit into the carry flag; after copying the specified bit, these instructions will set (bts), clear (btr), or invert (btc) the specified bit. Therefore, you can use the bts instruction to add a character to a character set via set union (that is, it adds a character to the set if the character was not already a member of the set; otherwise the set is unaffected). You can use the btr instruction to remove a character from a character set via set intersection (that is, it removes a character from the set if and only if it was previously in the set; otherwise it has no effect on the set). The btc instruction lets you add a character to the set if it wasn't previously in the set; it removes the character from the set if it was previously a member (that is, it toggles the membership of that character in the set).