There is very little that is sacred about the ASCII, EBCDIC, and Unicode character sets. Their primary advantage is that they are international standards to which many systems adhere . If you stick with one of these standards, chances are good you'll be able to exchange information with other people. That is the whole purpose of these standardized codes.
However, these codes were not designed to make various character computations easy. ASCII and EBCDIC were designed with now-antiquated hardware in mind. In particular, the ASCII character set was designed to correspond to the mechanical teletypewriters' keyboards, and EBCDIC was designed with old punched-card systems in mind. Given that such equipment is mainly found in museums today, the layout of the codes in these character sets has almost no benefit in modern computer systems. If we could design our own character sets today, they'd probably be considerably different from ASCII or EBCDIC. They'd probably be based on modern keyboards (so they'd include codes for common keys, like LEFT ARROW, RIGHT ARROW, PGUP, and PGDN). The codes would also be laid out to make various common computations a whole lot easier.
Although the ASCII and EBCDIC character sets are not going away anytime soon, there is nothing stopping you from defining your own application-specific character set. Of course, an application-specific character set is, well, application-specific, and you won't be able to share text files containing characters encoded in your custom character set with applications that are ignorant of your private encoding. But it is fairly easy to translate between different character sets using a lookup table, so you can convert between your application's internal character set and an external character set (like ASCII) when performing I/O operations. Assuming you pick a reasonable encoding that makes your programs more efficient overall, the loss of efficiency during I/O can be worthwhile. But how do you choose an encoding for your character set?
The first question you have to ask yourself is, 'How many characters do you want to support in your character set?' Obviously, the number of characters you choose will directly affect the size of your character data. An easy choice is 256 possible characters, because bytes are the most common primitive data type that software uses to represent character data. Keep in mind, however, that if you don't really need 256 characters, you probably shouldn't try to define that many characters in your character set. For example, if you can get by with 128 characters or even 64 characters in your custom character set, then 'text files' you create with such character sets will compress better. Likewise, data transmissions using your custom character set will be faster if you only have to transmit six or seven bits for each character instead of eight. If you need more than 256 characters, you'll have to weigh the advantages and disadvantages of using multiple code pages, double-byte character sets, or 16-bit characters. And keep in mind that Unicode provides for user-defined characters. So if you need more than 256 characters in your character set, you might consider using Unicode and the user -defined points (character sets) to remain 'somewhat standard' with the rest of the world.
However, in this section, we'll define a character set containing 128 characters using an 8-bit byte. For the most part, we're going to simply rearrange the codes in the ASCII character set to make them more convenient for several calculations, and we're going to rename a few of the control codes so they make sense on modern systems instead of the old mainframes and teletypes for which they were created. However, we will add a few new characters beyond those defined by the ASCII standard. Again, the main purpose of this exercise is to make various computations more efficient, not create new characters. We'll call this character set the HyCode character set.
Note | The development of HyCode in this chapter is not an attempt to create some new character set standard. HyCode is simply a demonstration of how you can create acustom, application-specific, character set to improve your programs. |
We should think about several things when designing a new character set. For example, do we need to be able to represent strings of characters using an existing string format? This can have a bearing on the encoding of our strings. For example, if you want to be able to use function libraries that operate on zero- terminated strings, then you need to reserve encoding zero in your custom character set for use as an end-of-string marker. Do keep in mind, however, that a fair number of string functions won't work with your new character set, no matter what you do. For example, functions like stricmp only work if you use the same representation for alphabetic characters as ASCII (or some other common character set). Therefore, you shouldn't feel hampered by the requirements of some particular string representation, because you're going to have to write many of your own string functions to process your custom characters. The HyCode character set doesn't reserve code zero for an end-of-string marker, and that's okay because, as we've seen, zero-terminated strings are not very efficient.
If you look at programs that make use of character functions, you'll see that certain functions occur frequently, such as these:
Check a character to see if it is a digit.
Convert a digit character to its numeric equivalent.
Convert a numeric digit to its character equivalent.
Check a character to see if it is alphabetic.
Check a character to see if it is a lowercase character.
Check a character to see if it is an uppercase character.
Compare two characters (or strings) using a case-insensitive comparison.
Sort a set of alphabetic strings (case-sensitive and case-insensitive sorting).
Check a character to see if it is alphanumeric .
Check a character to see if it is legal in an identifier.
Check a character to see if it is a common arithmetic or logical operator.
Check a character to see if it is a bracketing character (that is, one of (, ), [, ], {, }, <, or > ).
Check a character to see if it is a punctuation character.
Check a character to see if it is a whitespace character (such as a space, tab, or newline).
Check a character to see if it is a cursor control character.
Check a character to see if it is a scroll control key (such as PGUP, PGDN, HOME, and END).
Check a character to see if it is a function key.
We'll design the HyCode character set to make these types of operations as efficient and as easy as possible. A huge improvement we can make over the ASCII character set is to assign contiguous character codes to characters belonging to the same type, such as alphabetic characters and control characters. This will allow us to do any of the tests above by using a pair of comparisons. For example, it would be nice if we could determine that a particular character is some sort of punctuation character by comparing against two values that represent upper and lower bounds of the entire range of such characters. While it's not possible to satisfy every conceivable range comparison this way, we can design our character set to accommodate the most common tests with as few comparisons as possible. Although ASCII does organize some of the character sequences in a reasonable fashion, we can do much better. For example, in ASCII, it is not possible to check for a punctuation character with a pair of comparisons because the punctuation characters are spread throughout the character set.
Consider the first three functions in the previous list - we can achieve all three of these goals by reserving the character codes zero through nine for the characters through 9 . First, by using a single unsigned comparison to check if a character code is less than or equal to nine, we can see if a character is a digit. Next, converting between characters and their numeric representations is trivial, because the character code and the numeric representation are one and the same.
Dealing with alphabetic characters is another common character/string problem. The ASCII character set, though nowhere near as bad as EBCDIC, just isn't well designed for dealing with alphabetic character tests and operations. Here are some problems with ASCII that we'll solve with HyCode:
The alphabetic characters lie in two disjoint ranges. Tests for an alphabetic character, for example, require four comparisons.
The lowercase characters have ASCII codes that are greater than the uppercase characters. For comparison purposes, if we're going to do a case-sensitive comparison, it's more intuitive to treat lowercase characters as being less than uppercase characters.
All lowercase characters have a greater value than any individual uppercase character. This leads to counterintuitive results such as a being greater than B even though any school child who has learned their ABCs knows that this isn't the case.
HyCode solves these problems in a couple of interesting ways. First, HyCode uses encodings $4C through $7F to represent the 52 alphabetic characters. Because HyCode only uses 128 character codes ($00..$7F), the alphabetic codes consume the last 52 character codes. This means that if we want to test a character to see if it is alphabetic, we only need to compare whether the code is greater than or equal to $4C. In a high-level language, you'd write the comparison like this:
if( c >= 76) . . .
Or if your compiler supports the HyCode character set, like this:
if( c >= 'a') . . .
In assembly language, you could use a pair of instructions like the following:
cmp( al, 76 ); jnae NotAlphabetic; // Execute these statements if it's alphabetic NotAlphabetic:
Another advantage of HyCode (and another big difference from other character sets) is that HyCode interleaves the lowercase and uppercase characters (that is, the sequential encodings are for the characters a, A, b, B, c, C, and so on). This makes sorting and comparing strings very easy, regardless of whether you're doing a case-sensitive or case-insensitive search. The interleaving uses the LO bit of the character code to determine whether the character code is lowercase (LO bit is zero) or uppercase (LO bit is one). HyCode uses the following encodings for alphabetic characters:
a:76, A:77, b:78, B:79, c:80, C:81, . . . y:124, Y:125, z:126, Z:127
Checking for an uppercase or lowercase alphabetic using HyCode is a little more work than checking whether a character is alphabetic, but when working in assembly it's still less work than you'll need for the equivalent ASCII comparison. To test a character to see if it's a member of a single case, you effectively need two comparisons; the first test is to see if it's alphabetic, and then you determine its case. In C/C++ you'd probably use statements like the following:
if( (c >= 76) && (c & 1)) { // execute this code if it's an uppercase character } if( (c >= 76) && !(c & 1)) { // execute this code if it's a lowercase character }
Note that the subexpression (c & 1) evaluates true (1) if the LO bit of c is one, meaning we have an uppercase character if c is alphabetic. Likewise, !(c & 1) evaluates true if the LO bit of c is zero, meaning we have a lowercase character. If you're working in 80x86 assembly language, you can actually test a character to see if it is uppercase or lowercase by using three machine instructions:
// Note: ROR(1, AL) maps lowercase to the range ..F (38..63) // and uppercase to $A6..$BF (166..191). Note that all other characters // get mapped to smaller values within these ranges. ror( 1, al ); cmp( al, ); jnae NotLower; // Note: must be an unsigned branch! // Code that deals with a lowercase character. NotLower: // For uppercase, note that the ROR creates codes in the range $A8..$BF which // are negative (8-bit) values. They also happen to be the *most* negative // numbers that ROR will produce from the HyCode character set. ror( 1, al ); cmp( al, $a6 ); jge NotUpper; // Note: must be a signed branch! // Code that deals with an uppercase character. NotUpper:
Unfortunately, very few languages provide the equivalent of an ror operation, and only a few languages allow you to (easily) treat character values as signed and unsigned within the same code sequence. Therefore, this sequence is probably limited to assembly language programs.
The HyCode grouping of alphabetic characters means that lexicographical ordering (i.e., 'dictionary ordering') is almost free, no matter what language you're using. As long as you can live with the fact that lowercase characters are less than the corresponding uppercase characters, sorting your strings by comparing the HyCode character values will give you lexicographical order. This is because, unlike ASCII, HyCode defines the following relations on the alphabetic characters:
a < A < b < B < c < C < d < D < . . . < w < W < x < X < y < Y < z < Z
This is exactly the relationship you want for lexicographical ordering, and it's also the intuitive relationship most people would expect.
Case-insensitive comparisons only involve a tiny bit more work than straight case-sensitive comparisons (and far less work than doing case-insensitive comparisons using a character set like ASCII). When comparing two alphabetic characters, you simply mask out their LO bits (or force them both to one) and you automatically get a case-insensitive comparison.
To see the benefit of the HyCode character set when doing case-insensitive comparisons, let's first take a look at what the standard case-insensitive character comparison would look like in C/C++ for two ASCII characters:
if( toupper( c ) == toupper( d )) { // do code that handles c==d using a case-insensitive comparison. }
This code doesn't look too bad, but consider what the toupper function (or, usually, macro) expands to: [11]
#define toupper(ch) ((ch >= 'a' && ch <= 'z') ? ch & 0x5f : ch )
With this macro, you wind up with the following once the C preprocessor expands the former if statement:
if ( ((c >= 'a' && c <= 'z') ? c & 0x5f : c ) == ((d >= 'a' && d <= 'z') ? d & 0x5f : d ) ) { // do code that handles c==d using a case-insensitive comparison. }
This example expands to 80x86 code similar to this:
// assume c is in cl and d is in dl. cmp( cl, 'a' ); // See if c is in the range 'a'..'z' jb NotLower; cmp( cl, 'z' ); ja NotLower; and( f, cl ); // Convert lowercase char in cl to uppercase. NotLower: cmp( dl, 'a' ); // See if d is in the range 'a'..'z' jb NotLower2; cmp( dl, 'z' ); ja NotLower2; and( f, dl ); // Convert lowercase char in dl to uppercase. NotLower2: cmp( cl, dl ); // Compare the (now uppercase if alphabetic) // chars. jne NotEqual; // Skip the code that handles c==d if they're // not equal. // do code that handles c==d using a case-insensitive comparison. NotEqual:
When using HyCode, case-insensitive comparisons are much simpler. Here's what the HLA assembly code would look like:
// Check to see if CL is alphabetic. No need to check DL as the comparison // will always fail if DL is non-alphabetic. cmp( cl, 76 ); // If CL < 76 ('a') then it's not alphabetic jb TestEqual; // and there is no way the two chars are equal // (even ignoring case). or( 1, cl ); // CL is alpha, force it to uppercase. or( 1, dl ); // DL may or may not be alpha. Force to // uppercase if it is. TestEqual: cmp( cl, dl ); // Compare the uppercase versions of the chars. jne NotEqual; // Bail out if they're not equal. TheyreEqual: // do code that handles c==d using a case-insensitive comparison. NotEqual:
As you can see, the HyCode sequence uses half the instructions for a case-insensitive comparison of two characters.
Because alphabetic characters are at one end of the character-code range and numeric characters are at the other, it takes two comparisons to check a character to see if it's alphanumeric (which is still better than the four comparisons necessary when using ASCII). Here's the Pascal/Delphi/Kylix code you'd use to see if a character is alphanumeric:
if( ch < chr(10) or ch >= chr(76)) then . . .
Several programs (beyond compilers) need to efficiently process strings of characters that represent program identifiers. Most languages allow alphanumeric characters in identifiers, and, as you just saw, we can check a character to see if it's alphanumeric using only two comparisons.
Many languages also allow underscores within identifiers, and some languages, such as MASM and TASM, allow other characters like the at character (@) and dollar sign ($) to appear within identifiers. Therefore, by assigning the underscore character the value 75, and by assigning the $ and @ characters the codes 73 and 74, we can still test for an identifier character using only two comparisons.
For similar reasons, the HyCode character set groups several other classes of characters into contiguous character-code ranges. For example, HyCode groups the cursor control keys together, the whitespace characters, the bracketing characters (parentheses, brackets, braces, and angle brackets), the arithmetic operators, the punctuation characters, and so on. Table 5-4 lists the complete HyCode character set. If you study the numeric codes assigned to each of these characters, you'll discover that their code assignments allow efficient computation of most of the character operations described earlier.
Binary | Hex | Decimal | Character | Binary | Hex | Decimal | Character |
---|---|---|---|---|---|---|---|
0000_0000 | 00 |
|
| 0001_1110 | 1E | 30 | End |
0000_0001 | 01 | 1 | 1 | 0001_1111 | 1F | 31 | Home |
0000_0010 | 02 | 2 | 2 | 0010_0000 | 20 | 32 | PgDn |
0000_0011 | 03 | 3 | 3 | 0010_0001 | 21 | 33 | PgUp |
0000_0100 | 04 | 4 | 4 | 0010_0010 | 22 | 34 | left |
0000_0101 | 05 | 5 | 5 | 0010_0011 | 23 | 35 | right |
0000_0110 | 06 | 6 | 6 | 0010_0100 | 24 | 36 | up |
0000_0111 | 07 | 7 | 7 | 0010_0101 | 25 | 37 | down/ linefeed |
0000_1000 | 08 | 8 | 8 | 0010_0110 | 26 | 38 | nonbreaking space |
0000_1001 | 09 | 9 | 9 | 0010_0111 | 27 | 39 | paragraph |
0000_1010 | 0A | 10 | keypad | 0010_1000 | 28 | 40 | carriage return |
0000_1011 | 0B | 11 | cursor | 0010_1001 | 29 | 41 | newline/enter |
0000_1100 | 0C | 12 | function | 0010_1010 | 2A | 42 | tab |
0000_1101 | 0D | 13 | alt | 0010_1011 | 2B | 43 | space |
0000_1110 | 0E | 14 | control | 0010_1100 | 2C | 44 | ( |
0000_1111 | 0F | 15 | command | 0010_1101 | 2D | 45 | ) |
0001_0000 | 10 | 16 | len | 0010_1110 | 2E | 46 | [ |
0001_0001 | 11 | 17 | len128 | 0010_1111 | 2F | 47 | ] |
0001_0010 | 12 | 18 | bin128 | 0011_0000 | 30 | 48 | { |
0001_0011 | 13 | 19 | EOS | 0011_0001 | 31 | 49 | } |
0001_0100 | 14 | 20 | EOF | 0011_0010 | 32 | 50 | < |
0001_0101 | 15 | 21 | sentinel | 0011_0011 | 33 | 51 | > |
0001_0110 | 16 | 22 | break/ interrupt | 0011_0100 | 34 | 52 | = |
0001_0111 | 17 | 23 | escape/ cancel | 0011_0101 | 35 | 53 | ^ |
0001_1000 | 18 | 24 | pause | 0011_0110 | 36 | 54 |
|
0001_1001 | 19 | 25 | bell | 0011_0111 | 37 | 55 | & |
0001_1010 | 1A | 26 | back tab | 0011_1000 | 38 | 56 | - |
0001_1011 | 1B | 27 | backspace | 0011_1001 | 39 | 57 | + |
0001_1100 | 1C | 28 | delete | ||||
0001_1101 | 1D | 29 | Insert | ||||
0011_1010 | 3A | 58 | * | 0101_1101 | 5D | 93 | I |
0011_1011 | 3B | 59 | / | 0101_1110 | 5E | 94 | j |
0011_1100 | 3C | 60 | % | 0101_1111 | 5F | 95 | J |
0011_1101 | 3D | 61 | ~ | 0110_0000 | 60 | 96 | k |
0011_1110 | 3E | 62 | ! | 0110_0001 | 61 | 97 | K |
0011_1111 | 3F | 63 | ? | 0110_0010 | 62 | 98 | l |
0100_0000 | 40 | 64 | , | 0110_0011 | 63 | 99 | L |
0100_0001 | 41 | 65 | . | 0110_0100 | 64 | 100 | m |
0100_0010 | 42 | 66 | : | 0110_0101 | 65 | 101 | M |
0100_0011 | 43 | 67 | ; | 0110_0110 | 66 | 102 | n |
0100_0100 | 44 | 68 | ' | 0110_0111 | 67 | 103 | N |
0100_0101 | 45 | 69 | ' | 0110_1000 | 68 | 104 | o |
0100_0110 | 46 | 70 | ` | 0110_1001 | 69 | 105 | O |
0100_0111 | 47 | 71 | \ | 0110_1010 | 6A | 106 | p |
0100_1000 | 48 | 72 | # | 0110_1011 | 6B | 107 | P |
0100_1001 | 49 | 73 | $ | 0110_1100 | 6C | 108 | q |
0100_1010 | 4A | 74 | @ | 0110_1101 | 6D | 109 | Q |
0100_1011 | 4B | 75 | _ | 0110_1110 | 6E | 110 | r |
0100_1100 | 4C | 76 | a | 0110_1111 | 6F | 111 | R |
0100_1101 | 4D | 77 | A | 0111_0000 | 70 | 112 | s |
0100_1110 | 4E | 78 | b | 0111_0001 | 71 | 113 | S |
0100_1111 | 4F | 79 | B | 0111_0010 | 72 | 114 | t |
0101_0000 | 50 | 80 | c | 0111_0011 | 73 | 115 | T |
0101_0001 | 51 | 81 | C | 0111_0100 | 74 | 116 | u |
0101_0010 | 52 | 82 | d | 0111_0101 | 75 | 117 | U |
0101_0011 | 53 | 83 | D | 0111_0110 | 76 | 118 | v |
0101_0100 | 54 | 84 | e | 0111_0111 | 77 | 119 | V |
0101_0101 | 55 | 85 | E | 0111_1000 | 78 | 120 | w |
0101_0110 | 56 | 86 | f | 0111_1001 | 79 | 121 | W |
0101_0111 | 57 | 87 | F | 0111_1010 | 7A | 122 | x |
0101_1000 | 58 | 88 | g | 0111_1011 | 7B | 123 | X |
0101_1001 | 59 | 89 | G | 0111_1100 | 7C | 124 | y |
0101_1010 | 5A | 90 | h | 0111_1101 | 7D | 125 | Y |
0101_1011 | 5B | 91 | H | 0111_1110 | 7E | 126 | z |
0101_1100 | 5C | 92 | i | 0111_1111 | 7F | 127 | Z |
[11] Actually, it's worse than this because most C standard libraries use lookup tables to map ranges of characters, but we'll ignore that issue here.