Like strings, character sets are another composite data type built upon the character data type. A character set is a mathematical set of characters. Membership in a set is a binary relation. A character is either in the set or it is not in the set; you cannot have multiple copies of the same character in a character set. Furthermore, the concept of sequence (whether one character comes before another, as in a string) is foreign to a character set. If two characters are members of a set, their order in the set is irrelevant.
Table 5-3 lists some of the more common character set functions to give you an idea of the types of operations applications typically perform on character sets.
Function/Operator | Description |
---|---|
Membership (IN) | Checks to see if a character is a member of a character set (returns true/false). |
Intersection | Returns the intersection of two character sets (that is, the set of characters that are members of both sets). |
Union | Returns the union of two character sets (that is, all the characters that are members of either set or both sets). |
Difference | Returns the difference of two sets (that is, those characters in one set that are not in the other). |
Extraction | Extracts a single character from a set. |
Subset | Returns true if one character set is a subset of another. |
Proper subset | Returns true if one character set is a proper subset of another. |
Superset | Returns true if one character set is a superset of another. |
Proper superset | Returns true if one character set is a proper superset of another. |
Equality | Returns true if one character set is equal to another. |
Inequality | Returns true if one character set is not equal to another. |
There are many different ways to represent character sets. Several languages implement character sets using an array of Boolean values (one Boolean value for each possible character code). Each Boolean value determines whether its corresponding character is or is not a member of the character set: true indicates that the specified character is a member of the set; false indicates that the corresponding character is not a member of the set. To conserve memory, most character set implementations allocate only a single bit for each character in the set; therefore, such character sets consume 16 bytes (128 bits) of memory when supporting 128 characters, or 32 bytes (256 bits) when supporting up to 256 possible characters. This representation of a character set is known as a powerset .
The HLA language uses an array of 16 bytes to represent the 128 possible ASCII characters. This array of 128 bits is organized in memory, as shown in Figure 5-5.
Bit zero of byte zero corresponds to ASCII code zero (the NUL character). If this bit is one, then the character set contains the NUL character; if this bit is zero, then the character set does not contain the NUL character. Likewise, bit one of byte eight corresponds to ASCII code 65, an uppercase A . Bit 65 will contain a one if A is a current member of the character set, it will contain zero if A is not a member of the set.
Pascal (for example, Delphi/Kylix) uses a similar scheme to represent character sets. Delphi allows up to 256 characters in a character set, so Delphi/Kylix character sets consume 256 bits (or 32 bytes) of memory.
While there are other possible ways to implement character sets, this bit vector (array) implementation has the advantage that it is very easy to implement set operations like union, intersection, difference comparison, and membership tests.
Sometimes a powerset bitmap just isn't the right representation for a character set. For example, if your sets are always very small (no more than three or four members), using 16 or 32 bytes to represent such a set can be overkill. For very small sets, using a character string to represent a list of characters is probably the best way to go. [10] If you rarely have more than a few characters in a set, scanning through a string to locate a particular character is probably efficient enough for most applications.
On the other hand, if your character set has a large number of possible characters, then the powerset representation for the character set could become quite large (for example, Unicode character sets would require 8,192 bytes of memory to implement them as powersets). For these reasons (and more), the powerset representation isn't always the best. A list or character string representation could be more appropriate in such situations.
[10] Though it is up to you to ensure that the character string maintains set semantics. That is, you never allow duplicate characters in such a string.