5.4 Designing Your Own Character Set

There is very little that is sacred about the ASCII, EBCDIC, and Unicode character sets. Their primary advantage is that they are international standards to which many systems adhere . If you stick with one of these standards, chances are good you'll be able to exchange information with other people. That is the whole purpose of these standardized codes.

However, these codes were not designed to make various character computations easy. ASCII and EBCDIC were designed with now-antiquated hardware in mind. In particular, the ASCII character set was designed to correspond to the mechanical teletypewriters' keyboards, and EBCDIC was designed with old punched-card systems in mind. Given that such equipment is mainly found in museums today, the layout of the codes in these character sets has almost no benefit in modern computer systems. If we could design our own character sets today, they'd probably be considerably different from ASCII or EBCDIC. They'd probably be based on modern keyboards (so they'd include codes for common keys, like LEFT ARROW, RIGHT ARROW, PGUP, and PGDN). The codes would also be laid out to make various common computations a whole lot easier.

Although the ASCII and EBCDIC character sets are not going away anytime soon, there is nothing stopping you from defining your own application-specific character set. Of course, an application-specific character set is, well, application-specific, and you won't be able to share text files containing characters encoded in your custom character set with applications that are ignorant of your private encoding. But it is fairly easy to translate between different character sets using a lookup table, so you can convert between your application's internal character set and an external character set (like ASCII) when performing I/O operations. Assuming you pick a reasonable encoding that makes your programs more efficient overall, the loss of efficiency during I/O can be worthwhile. But how do you choose an encoding for your character set?

The first question you have to ask yourself is, 'How many characters do you want to support in your character set?' Obviously, the number of characters you choose will directly affect the size of your character data. An easy choice is 256 possible characters, because bytes are the most common primitive data type that software uses to represent character data. Keep in mind, however, that if you don't really need 256 characters, you probably shouldn't try to define that many characters in your character set. For example, if you can get by with 128 characters or even 64 characters in your custom character set, then 'text files' you create with such character sets will compress better. Likewise, data transmissions using your custom character set will be faster if you only have to transmit six or seven bits for each character instead of eight. If you need more than 256 characters, you'll have to weigh the advantages and disadvantages of using multiple code pages, double-byte character sets, or 16-bit characters. And keep in mind that Unicode provides for user-defined characters. So if you need more than 256 characters in your character set, you might consider using Unicode and the user -defined points (character sets) to remain 'somewhat standard' with the rest of the world.

However, in this section, we'll define a character set containing 128 characters using an 8-bit byte. For the most part, we're going to simply rearrange the codes in the ASCII character set to make them more convenient for several calculations, and we're going to rename a few of the control codes so they make sense on modern systems instead of the old mainframes and teletypes for which they were created. However, we will add a few new characters beyond those defined by the ASCII standard. Again, the main purpose of this exercise is to make various computations more efficient, not create new characters. We'll call this character set the HyCode character set.

Note	The development of HyCode in this chapter is not an attempt to create some new character set standard. HyCode is simply a demonstration of how you can create acustom, application-specific, character set to improve your programs.

5.4.1 Designing an Efficient Character Set

We should think about several things when designing a new character set. For example, do we need to be able to represent strings of characters using an existing string format? This can have a bearing on the encoding of our strings. For example, if you want to be able to use function libraries that operate on zero- terminated strings, then you need to reserve encoding zero in your custom character set for use as an end-of-string marker. Do keep in mind, however, that a fair number of string functions won't work with your new character set, no matter what you do. For example, functions like stricmp only work if you use the same representation for alphabetic characters as ASCII (or some other common character set). Therefore, you shouldn't feel hampered by the requirements of some particular string representation, because you're going to have to write many of your own string functions to process your custom characters. The HyCode character set doesn't reserve code zero for an end-of-string marker, and that's okay because, as we've seen, zero-terminated strings are not very efficient.

If you look at programs that make use of character functions, you'll see that certain functions occur frequently, such as these:

Check a character to see if it is a digit.
Convert a digit character to its numeric equivalent.
Convert a numeric digit to its character equivalent.
Check a character to see if it is alphabetic.
Check a character to see if it is a lowercase character.
Check a character to see if it is an uppercase character.
Compare two characters (or strings) using a case-insensitive comparison.
Sort a set of alphabetic strings (case-sensitive and case-insensitive sorting).
Check a character to see if it is alphanumeric .
Check a character to see if it is legal in an identifier.
Check a character to see if it is a common arithmetic or logical operator.
Check a character to see if it is a bracketing character (that is, one of (, ), [, ], {, }, <, or > ).
Check a character to see if it is a punctuation character.
Check a character to see if it is a whitespace character (such as a space, tab, or newline).
Check a character to see if it is a cursor control character.
Check a character to see if it is a scroll control key (such as PGUP, PGDN, HOME, and END).
Check a character to see if it is a function key.

We'll design the HyCode character set to make these types of operations as efficient and as easy as possible. A huge improvement we can make over the ASCII character set is to assign contiguous character codes to characters belonging to the same type, such as alphabetic characters and control characters. This will allow us to do any of the tests above by using a pair of comparisons. For example, it would be nice if we could determine that a particular character is some sort of punctuation character by comparing against two values that represent upper and lower bounds of the entire range of such characters. While it's not possible to satisfy every conceivable range comparison this way, we can design our character set to accommodate the most common tests with as few comparisons as possible. Although ASCII does organize some of the character sequences in a reasonable fashion, we can do much better. For example, in ASCII, it is not possible to check for a punctuation character with a pair of comparisons because the punctuation characters are spread throughout the character set.

5.4.2 Grouping the Character Codes for Numeric Digits

Consider the first three functions in the previous list - we can achieve all three of these goals by reserving the character codes zero through nine for the characters through 9 . First, by using a single unsigned comparison to check if a character code is less than or equal to nine, we can see if a character is a digit. Next, converting between characters and their numeric representations is trivial, because the character code and the numeric representation are one and the same.

5.4.3 Grouping Alphabetic Characters

Dealing with alphabetic characters is another common character/string problem. The ASCII character set, though nowhere near as bad as EBCDIC, just isn't well designed for dealing with alphabetic character tests and operations. Here are some problems with ASCII that we'll solve with HyCode:

The alphabetic characters lie in two disjoint ranges. Tests for an alphabetic character, for example, require four comparisons.
The lowercase characters have ASCII codes that are greater than the uppercase characters. For comparison purposes, if we're going to do a case-sensitive comparison, it's more intuitive to treat lowercase characters as being less than uppercase characters.
All lowercase characters have a greater value than any individual uppercase character. This leads to counterintuitive results such as a being greater than B even though any school child who has learned their ABCs knows that this isn't the case.

HyCode solves these problems in a couple of interesting ways. First, HyCode uses encodings $4C through $7F to represent the 52 alphabetic characters. Because HyCode only uses 128 character codes ($00..$7F), the alphabetic codes consume the last 52 character codes. This means that if we want to test a character to see if it is alphabetic, we only need to compare whether the code is greater than or equal to $4C. In a high-level language, you'd write the comparison like this:

 if( c >= 76) . . .

Or if your compiler supports the HyCode character set, like this:

 if( c >= 'a') . . .

In assembly language, you could use a pair of instructions like the following:

 cmp( al, 76 );               jnae NotAlphabetic;               // Execute these statements if it's alphabetic                NotAlphabetic:

Another advantage of HyCode (and another big difference from other character sets) is that HyCode interleaves the lowercase and uppercase characters (that is, the sequential encodings are for the characters a, A, b, B, c, C, and so on). This makes sorting and comparing strings very easy, regardless of whether you're doing a case-sensitive or case-insensitive search. The interleaving uses the LO bit of the character code to determine whether the character code is lowercase (LO bit is zero) or uppercase (LO bit is one). HyCode uses the following encodings for alphabetic characters:

 a:76, A:77, b:78, B:79, c:80, C:81, . . . y:124, Y:125, z:126, Z:127

Checking for an uppercase or lowercase alphabetic using HyCode is a little more work than checking whether a character is alphabetic, but when working in assembly it's still less work than you'll need for the equivalent ASCII comparison. To test a character to see if it's a member of a single case, you effectively need two comparisons; the first test is to see if it's alphabetic, and then you determine its case. In C/C++ you'd probably use statements like the following:

 if( (c >= 76) && (c & 1))  {      // execute this code if it's an uppercase character  }  if( (c >= 76) && !(c & 1))  {      // execute this code if it's a lowercase character  }

Note that the subexpression (c & 1) evaluates true (1) if the LO bit of c is one, meaning we have an uppercase character if c is alphabetic. Likewise, !(c & 1) evaluates true if the LO bit of c is zero, meaning we have a lowercase character. If you're working in 80x86 assembly language, you can actually test a character to see if it is uppercase or lowercase by using three machine instructions:

 // Note: ROR(1, AL) maps lowercase to the range ..F (38..63)  //       and uppercase to $A6..$BF (166..191). Note that all other characters  //       get mapped to smaller values within these ranges.           ror( 1, al );           cmp( al,  );            jnae NotLower;   // Note: must be an unsigned branch!               // Code that deals with a lowercase character.  NotLower:  // For uppercase, note that the ROR creates codes in the range $A8..$BF which  // are negative (8-bit) values. They also happen to be the *most* negative  // numbers that ROR will produce from the HyCode character set.          ror( 1, al );          cmp( al, $a6 );          jge NotUpper;     // Note: must be a signed branch!              // Code that deals with an uppercase character.  NotUpper:

Unfortunately, very few languages provide the equivalent of an ror operation, and only a few languages allow you to (easily) treat character values as signed and unsigned within the same code sequence. Therefore, this sequence is probably limited to assembly language programs.

5.4.4 Comparing Alphabetic Characters

The HyCode grouping of alphabetic characters means that lexicographical ordering (i.e., 'dictionary ordering') is almost free, no matter what language you're using. As long as you can live with the fact that lowercase characters are less than the corresponding uppercase characters, sorting your strings by comparing the HyCode character values will give you lexicographical order. This is because, unlike ASCII, HyCode defines the following relations on the alphabetic characters:

 a < A < b < B < c < C < d < D < . . . < w < W < x < X < y < Y < z < Z

This is exactly the relationship you want for lexicographical ordering, and it's also the intuitive relationship most people would expect.

Case-insensitive comparisons only involve a tiny bit more work than straight case-sensitive comparisons (and far less work than doing case-insensitive comparisons using a character set like ASCII). When comparing two alphabetic characters, you simply mask out their LO bits (or force them both to one) and you automatically get a case-insensitive comparison.

To see the benefit of the HyCode character set when doing case-insensitive comparisons, let's first take a look at what the standard case-insensitive character comparison would look like in C/C++ for two ASCII characters:

 if( toupper( c ) == toupper( d ))  {       // do code that handles c==d using a case-insensitive comparison.  }

This code doesn't look too bad, but consider what the toupper function (or, usually, macro) expands to: ^[11]

 #define toupper(ch) ((ch >= 'a' && ch <= 'z') ? ch & 0x5f : ch )

With this macro, you wind up with the following once the C preprocessor expands the former if statement:

 if  (          ((c >= 'a' && c <= 'z') ? c & 0x5f : c )       == ((d >= 'a' && d <= 'z') ? d & 0x5f : d )  )  {          // do code that handles c==d using a case-insensitive comparison.  }

This example expands to 80x86 code similar to this:

 // assume c is in cl and d is in dl.          cmp( cl, 'a' );     // See if c is in the range 'a'..'z'          jb NotLower;          cmp( cl, 'z' );          ja NotLower;          and( f, cl );     // Convert lowercase char in cl to uppercase.  NotLower:          cmp( dl, 'a' );     // See if d is in the range 'a'..'z'          jb NotLower2;          cmp( dl, 'z' );          ja NotLower2;          and( f, dl );     // Convert lowercase char in dl to uppercase.  NotLower2:          cmp( cl, dl );      // Compare the (now uppercase if alphabetic)                              // chars.          jne NotEqual;       // Skip the code that handles c==d if they're                              // not equal.              // do code that handles c==d using a case-insensitive comparison.  NotEqual:

When using HyCode, case-insensitive comparisons are much simpler. Here's what the HLA assembly code would look like:

 // Check to see if CL is alphabetic. No need to check DL as the comparison  // will always fail if DL is non-alphabetic.          cmp( cl, 76 );      // If CL < 76 ('a') then it's not alphabetic          jb TestEqual;       // and there is no way the two chars are equal                              // (even ignoring case).          or( 1, cl );        // CL is alpha, force it to uppercase.          or( 1, dl );        // DL may or may not be alpha. Force to                              // uppercase if it is.  TestEqual:          cmp( cl, dl );      // Compare the uppercase versions of the chars.          jne NotEqual;       // Bail out if they're not equal.  TheyreEqual:              // do code that handles c==d using a case-insensitive comparison.  NotEqual:

As you can see, the HyCode sequence uses half the instructions for a case-insensitive comparison of two characters.

5.4.5 Other Character Groupings

Because alphabetic characters are at one end of the character-code range and numeric characters are at the other, it takes two comparisons to check a character to see if it's alphanumeric (which is still better than the four comparisons necessary when using ASCII). Here's the Pascal/Delphi/Kylix code you'd use to see if a character is alphanumeric:

 if( ch < chr(10) or ch >= chr(76)) then . . .

Several programs (beyond compilers) need to efficiently process strings of characters that represent program identifiers. Most languages allow alphanumeric characters in identifiers, and, as you just saw, we can check a character to see if it's alphanumeric using only two comparisons.

Many languages also allow underscores within identifiers, and some languages, such as MASM and TASM, allow other characters like the at character (@) and dollar sign ($) to appear within identifiers. Therefore, by assigning the underscore character the value 75, and by assigning the $ and @ characters the codes 73 and 74, we can still test for an identifier character using only two comparisons.

For similar reasons, the HyCode character set groups several other classes of characters into contiguous character-code ranges. For example, HyCode groups the cursor control keys together, the whitespace characters, the bracketing characters (parentheses, brackets, braces, and angle brackets), the arithmetic operators, the punctuation characters, and so on. Table 5-4 lists the complete HyCode character set. If you study the numeric codes assigned to each of these characters, you'll discover that their code assignments allow efficient computation of most of the character operations described earlier.

Table 5-4: The HyCode Character Set
Binary	Hex	Decimal	Character	Binary	Hex	Decimal	Character
0000_0000	00			0001_1110	1E	30	End
0000_0001	01	1	1	0001_1111	1F	31	Home
0000_0010	02	2	2	0010_0000	20	32	PgDn
0000_0011	03	3	3	0010_0001	21	33	PgUp
0000_0100	04	4	4	0010_0010	22	34	left
0000_0101	05	5	5	0010_0011	23	35	right
0000_0110	06	6	6	0010_0100	24	36	up
0000_0111	07	7	7	0010_0101	25	37	down/ linefeed
0000_1000	08	8	8	0010_0110	26	38	nonbreaking space
0000_1001	09	9	9	0010_0111	27	39	paragraph
0000_1010	0A	10	keypad	0010_1000	28	40	carriage return
0000_1011	0B	11	cursor	0010_1001	29	41	newline/enter
0000_1100	0C	12	function	0010_1010	2A	42	tab
0000_1101	0D	13	alt	0010_1011	2B	43	space
0000_1110	0E	14	control	0010_1100	2C	44	(
0000_1111	0F	15	command	0010_1101	2D	45	)
0001_0000	10	16	len	0010_1110	2E	46	[
0001_0001	11	17	len128	0010_1111	2F	47	]
0001_0010	12	18	bin128	0011_0000	30	48	{
0001_0011	13	19	EOS	0011_0001	31	49	}
0001_0100	14	20	EOF	0011_0010	32	50	<
0001_0101	15	21	sentinel	0011_0011	33	51	>
0001_0110	16	22	break/ interrupt	0011_0100	34	52	=
0001_0111	17	23	escape/ cancel	0011_0101	35	53	^
0001_1000	18	24	pause	0011_0110	36	54
0001_1001	19	25	bell	0011_0111	37	55	&
0001_1010	1A	26	back tab	0011_1000	38	56	-
0001_1011	1B	27	backspace	0011_1001	39	57	+
0001_1100	1C	28	delete
0001_1101	1D	29	Insert
0011_1010	3A	58	*	0101_1101	5D	93	I
0011_1011	3B	59	/	0101_1110	5E	94	j
0011_1100	3C	60	%	0101_1111	5F	95	J
0011_1101	3D	61	~	0110_0000	60	96	k
0011_1110	3E	62	!	0110_0001	61	97	K
0011_1111	3F	63	?	0110_0010	62	98	l
0100_0000	40	64	,	0110_0011	63	99	L
0100_0001	41	65	.	0110_0100	64	100	m
0100_0010	42	66	:	0110_0101	65	101	M
0100_0011	43	67	;	0110_0110	66	102	n
0100_0100	44	68	'	0110_0111	67	103	N
0100_0101	45	69	'	0110_1000	68	104	o
0100_0110	46	70	`	0110_1001	69	105	O
0100_0111	47	71	\	0110_1010	6A	106	p
0100_1000	48	72	#	0110_1011	6B	107	P
0100_1001	49	73	$	0110_1100	6C	108	q
0100_1010	4A	74	@	0110_1101	6D	109	Q
0100_1011	4B	75	_	0110_1110	6E	110	r
0100_1100	4C	76	a	0110_1111	6F	111	R
0100_1101	4D	77	A	0111_0000	70	112	s
0100_1110	4E	78	b	0111_0001	71	113	S
0100_1111	4F	79	B	0111_0010	72	114	t
0101_0000	50	80	c	0111_0011	73	115	T
0101_0001	51	81	C	0111_0100	74	116	u
0101_0010	52	82	d	0111_0101	75	117	U
0101_0011	53	83	D	0111_0110	76	118	v
0101_0100	54	84	e	0111_0111	77	119	V
0101_0101	55	85	E	0111_1000	78	120	w
0101_0110	56	86	f	0111_1001	79	121	W
0101_0111	57	87	F	0111_1010	7A	122	x
0101_1000	58	88	g	0111_1011	7B	123	X
0101_1001	59	89	G	0111_1100	7C	124	y
0101_1010	5A	90	h	0111_1101	7D	125	Y
0101_1011	5B	91	H	0111_1110	7E	126	z
0101_1100	5C	92	i	0111_1111	7F	127	Z

^[11] Actually, it's worse than this because most C standard libraries use lookup tables to map ranges of characters, but we'll ignore that issue here.