5.4 Designing Your Own Character Set


5.4 Designing Your Own Character Set

There is very little that is sacred about the ASCII, EBCDIC, and Unicode character sets. Their primary advantage is that they are international standards to which many systems adhere . If you stick with one of these standards, chances are good you'll be able to exchange information with other people. That is the whole purpose of these standardized codes.

However, these codes were not designed to make various character computations easy. ASCII and EBCDIC were designed with now-antiquated hardware in mind. In particular, the ASCII character set was designed to correspond to the mechanical teletypewriters' keyboards, and EBCDIC was designed with old punched-card systems in mind. Given that such equipment is mainly found in museums today, the layout of the codes in these character sets has almost no benefit in modern computer systems. If we could design our own character sets today, they'd probably be considerably different from ASCII or EBCDIC. They'd probably be based on modern keyboards (so they'd include codes for common keys, like LEFT ARROW, RIGHT ARROW, PGUP, and PGDN). The codes would also be laid out to make various common computations a whole lot easier.

Although the ASCII and EBCDIC character sets are not going away anytime soon, there is nothing stopping you from defining your own application-specific character set. Of course, an application-specific character set is, well, application-specific, and you won't be able to share text files containing characters encoded in your custom character set with applications that are ignorant of your private encoding. But it is fairly easy to translate between different character sets using a lookup table, so you can convert between your application's internal character set and an external character set (like ASCII) when performing I/O operations. Assuming you pick a reasonable encoding that makes your programs more efficient overall, the loss of efficiency during I/O can be worthwhile. But how do you choose an encoding for your character set?

The first question you have to ask yourself is, 'How many characters do you want to support in your character set?' Obviously, the number of characters you choose will directly affect the size of your character data. An easy choice is 256 possible characters, because bytes are the most common primitive data type that software uses to represent character data. Keep in mind, however, that if you don't really need 256 characters, you probably shouldn't try to define that many characters in your character set. For example, if you can get by with 128 characters or even 64 characters in your custom character set, then 'text files' you create with such character sets will compress better. Likewise, data transmissions using your custom character set will be faster if you only have to transmit six or seven bits for each character instead of eight. If you need more than 256 characters, you'll have to weigh the advantages and disadvantages of using multiple code pages, double-byte character sets, or 16-bit characters. And keep in mind that Unicode provides for user-defined characters. So if you need more than 256 characters in your character set, you might consider using Unicode and the user -defined points (character sets) to remain 'somewhat standard' with the rest of the world.

However, in this section, we'll define a character set containing 128 characters using an 8-bit byte. For the most part, we're going to simply rearrange the codes in the ASCII character set to make them more convenient for several calculations, and we're going to rename a few of the control codes so they make sense on modern systems instead of the old mainframes and teletypes for which they were created. However, we will add a few new characters beyond those defined by the ASCII standard. Again, the main purpose of this exercise is to make various computations more efficient, not create new characters. We'll call this character set the HyCode character set.

Note  

The development of HyCode in this chapter is not an attempt to create some new character set standard. HyCode is simply a demonstration of how you can create acustom, application-specific, character set to improve your programs.

5.4.1 Designing an Efficient Character Set

We should think about several things when designing a new character set. For example, do we need to be able to represent strings of characters using an existing string format? This can have a bearing on the encoding of our strings. For example, if you want to be able to use function libraries that operate on zero- terminated strings, then you need to reserve encoding zero in your custom character set for use as an end-of-string marker. Do keep in mind, however, that a fair number of string functions won't work with your new character set, no matter what you do. For example, functions like stricmp only work if you use the same representation for alphabetic characters as ASCII (or some other common character set). Therefore, you shouldn't feel hampered by the requirements of some particular string representation, because you're going to have to write many of your own string functions to process your custom characters. The HyCode character set doesn't reserve code zero for an end-of-string marker, and that's okay because, as we've seen, zero-terminated strings are not very efficient.

If you look at programs that make use of character functions, you'll see that certain functions occur frequently, such as these:

  • Check a character to see if it is a digit.

  • Convert a digit character to its numeric equivalent.

  • Convert a numeric digit to its character equivalent.

  • Check a character to see if it is alphabetic.

  • Check a character to see if it is a lowercase character.

  • Check a character to see if it is an uppercase character.

  • Compare two characters (or strings) using a case-insensitive comparison.

  • Sort a set of alphabetic strings (case-sensitive and case-insensitive sorting).

  • Check a character to see if it is alphanumeric .

  • Check a character to see if it is legal in an identifier.

  • Check a character to see if it is a common arithmetic or logical operator.

  • Check a character to see if it is a bracketing character (that is, one of (, ), [, ], {, }, <, or > ).

  • Check a character to see if it is a punctuation character.

  • Check a character to see if it is a whitespace character (such as a space, tab, or newline).

  • Check a character to see if it is a cursor control character.

  • Check a character to see if it is a scroll control key (such as PGUP, PGDN, HOME, and END).

  • Check a character to see if it is a function key.

We'll design the HyCode character set to make these types of operations as efficient and as easy as possible. A huge improvement we can make over the ASCII character set is to assign contiguous character codes to characters belonging to the same type, such as alphabetic characters and control characters. This will allow us to do any of the tests above by using a pair of comparisons. For example, it would be nice if we could determine that a particular character is some sort of punctuation character by comparing against two values that represent upper and lower bounds of the entire range of such characters. While it's not possible to satisfy every conceivable range comparison this way, we can design our character set to accommodate the most common tests with as few comparisons as possible. Although ASCII does organize some of the character sequences in a reasonable fashion, we can do much better. For example, in ASCII, it is not possible to check for a punctuation character with a pair of comparisons because the punctuation characters are spread throughout the character set.

5.4.2 Grouping the Character Codes for Numeric Digits

Consider the first three functions in the previous list - we can achieve all three of these goals by reserving the character codes zero through nine for the characters through 9 . First, by using a single unsigned comparison to check if a character code is less than or equal to nine, we can see if a character is a digit. Next, converting between characters and their numeric representations is trivial, because the character code and the numeric representation are one and the same.

5.4.3 Grouping Alphabetic Characters

Dealing with alphabetic characters is another common character/string problem. The ASCII character set, though nowhere near as bad as EBCDIC, just isn't well designed for dealing with alphabetic character tests and operations. Here are some problems with ASCII that we'll solve with HyCode:

  • The alphabetic characters lie in two disjoint ranges. Tests for an alphabetic character, for example, require four comparisons.

  • The lowercase characters have ASCII codes that are greater than the uppercase characters. For comparison purposes, if we're going to do a case-sensitive comparison, it's more intuitive to treat lowercase characters as being less than uppercase characters.

  • All lowercase characters have a greater value than any individual uppercase character. This leads to counterintuitive results such as a being greater than B even though any school child who has learned their ABCs knows that this isn't the case.

HyCode solves these problems in a couple of interesting ways. First, HyCode uses encodings $4C through $7F to represent the 52 alphabetic characters. Because HyCode only uses 128 character codes ($00..$7F), the alphabetic codes consume the last 52 character codes. This means that if we want to test a character to see if it is alphabetic, we only need to compare whether the code is greater than or equal to $4C. In a high-level language, you'd write the comparison like this:

 if( c >= 76) . . . 

Or if your compiler supports the HyCode character set, like this:

 if( c >= 'a') . . . 

In assembly language, you could use a pair of instructions like the following:

 cmp( al, 76 );               jnae NotAlphabetic;               // Execute these statements if it's alphabetic                NotAlphabetic: 

Another advantage of HyCode (and another big difference from other character sets) is that HyCode interleaves the lowercase and uppercase characters (that is, the sequential encodings are for the characters a, A, b, B, c, C, and so on). This makes sorting and comparing strings very easy, regardless of whether you're doing a case-sensitive or case-insensitive search. The interleaving uses the LO bit of the character code to determine whether the character code is lowercase (LO bit is zero) or uppercase (LO bit is one). HyCode uses the following encodings for alphabetic characters:

 a:76, A:77, b:78, B:79, c:80, C:81, . . . y:124, Y:125, z:126, Z:127 

Checking for an uppercase or lowercase alphabetic using HyCode is a little more work than checking whether a character is alphabetic, but when working in assembly it's still less work than you'll need for the equivalent ASCII comparison. To test a character to see if it's a member of a single case, you effectively need two comparisons; the first test is to see if it's alphabetic, and then you determine its case. In C/C++ you'd probably use statements like the following:

 if( (c >= 76) && (c & 1))  {      // execute this code if it's an uppercase character  }  if( (c >= 76) && !(c & 1))  {      // execute this code if it's a lowercase character  } 

Note that the subexpression (c & 1) evaluates true (1) if the LO bit of c is one, meaning we have an uppercase character if c is alphabetic. Likewise, !(c & 1) evaluates true if the LO bit of c is zero, meaning we have a lowercase character. If you're working in 80x86 assembly language, you can actually test a character to see if it is uppercase or lowercase by using three machine instructions:

 // Note: ROR(1, AL) maps lowercase to the range ..F (38..63)  //       and uppercase to $A6..$BF (166..191). Note that all other characters  //       get mapped to smaller values within these ranges.           ror( 1, al );           cmp( al,  );            jnae NotLower;   // Note: must be an unsigned branch!               // Code that deals with a lowercase character.  NotLower:  // For uppercase, note that the ROR creates codes in the range $A8..$BF which  // are negative (8-bit) values. They also happen to be the *most* negative  // numbers that ROR will produce from the HyCode character set.          ror( 1, al );          cmp( al, $a6 );          jge NotUpper;     // Note: must be a signed branch!              // Code that deals with an uppercase character.  NotUpper: 

Unfortunately, very few languages provide the equivalent of an ror operation, and only a few languages allow you to (easily) treat character values as signed and unsigned within the same code sequence. Therefore, this sequence is probably limited to assembly language programs.

5.4.4 Comparing Alphabetic Characters

The HyCode grouping of alphabetic characters means that lexicographical ordering (i.e., 'dictionary ordering') is almost free, no matter what language you're using. As long as you can live with the fact that lowercase characters are less than the corresponding uppercase characters, sorting your strings by comparing the HyCode character values will give you lexicographical order. This is because, unlike ASCII, HyCode defines the following relations on the alphabetic characters:

 a < A < b < B < c < C < d < D < . . . < w < W < x < X < y < Y < z < Z 

This is exactly the relationship you want for lexicographical ordering, and it's also the intuitive relationship most people would expect.

Case-insensitive comparisons only involve a tiny bit more work than straight case-sensitive comparisons (and far less work than doing case-insensitive comparisons using a character set like ASCII). When comparing two alphabetic characters, you simply mask out their LO bits (or force them both to one) and you automatically get a case-insensitive comparison.

To see the benefit of the HyCode character set when doing case-insensitive comparisons, let's first take a look at what the standard case-insensitive character comparison would look like in C/C++ for two ASCII characters:

 if( toupper( c ) == toupper( d ))  {       // do code that handles c==d using a case-insensitive comparison.  } 

This code doesn't look too bad, but consider what the toupper function (or, usually, macro) expands to: [11]

 #define toupper(ch) ((ch >= 'a' && ch <= 'z') ? ch & 0x5f : ch ) 

With this macro, you wind up with the following once the C preprocessor expands the former if statement:

 if  (          ((c >= 'a' && c <= 'z') ? c & 0x5f : c )       == ((d >= 'a' && d <= 'z') ? d & 0x5f : d )  )  {          // do code that handles c==d using a case-insensitive comparison.  } 

This example expands to 80x86 code similar to this:

 // assume c is in cl and d is in dl.          cmp( cl, 'a' );     // See if c is in the range 'a'..'z'          jb NotLower;          cmp( cl, 'z' );          ja NotLower;          and( f, cl );     // Convert lowercase char in cl to uppercase.  NotLower:          cmp( dl, 'a' );     // See if d is in the range 'a'..'z'          jb NotLower2;          cmp( dl, 'z' );          ja NotLower2;          and( f, dl );     // Convert lowercase char in dl to uppercase.  NotLower2:          cmp( cl, dl );      // Compare the (now uppercase if alphabetic)                              // chars.          jne NotEqual;       // Skip the code that handles c==d if they're                              // not equal.              // do code that handles c==d using a case-insensitive comparison.  NotEqual: 

When using HyCode, case-insensitive comparisons are much simpler. Here's what the HLA assembly code would look like:

 // Check to see if CL is alphabetic. No need to check DL as the comparison  // will always fail if DL is non-alphabetic.          cmp( cl, 76 );      // If CL < 76 ('a') then it's not alphabetic          jb TestEqual;       // and there is no way the two chars are equal                              // (even ignoring case).          or( 1, cl );        // CL is alpha, force it to uppercase.          or( 1, dl );        // DL may or may not be alpha. Force to                              // uppercase if it is.  TestEqual:          cmp( cl, dl );      // Compare the uppercase versions of the chars.          jne NotEqual;       // Bail out if they're not equal.  TheyreEqual:              // do code that handles c==d using a case-insensitive comparison.  NotEqual: 

As you can see, the HyCode sequence uses half the instructions for a case-insensitive comparison of two characters.

5.4.5 Other Character Groupings

Because alphabetic characters are at one end of the character-code range and numeric characters are at the other, it takes two comparisons to check a character to see if it's alphanumeric (which is still better than the four comparisons necessary when using ASCII). Here's the Pascal/Delphi/Kylix code you'd use to see if a character is alphanumeric:

 if( ch < chr(10) or ch >= chr(76)) then . . . 

Several programs (beyond compilers) need to efficiently process strings of characters that represent program identifiers. Most languages allow alphanumeric characters in identifiers, and, as you just saw, we can check a character to see if it's alphanumeric using only two comparisons.

Many languages also allow underscores within identifiers, and some languages, such as MASM and TASM, allow other characters like the at character (@) and dollar sign ($) to appear within identifiers. Therefore, by assigning the underscore character the value 75, and by assigning the $ and @ characters the codes 73 and 74, we can still test for an identifier character using only two comparisons.

For similar reasons, the HyCode character set groups several other classes of characters into contiguous character-code ranges. For example, HyCode groups the cursor control keys together, the whitespace characters, the bracketing characters (parentheses, brackets, braces, and angle brackets), the arithmetic operators, the punctuation characters, and so on. Table 5-4 lists the complete HyCode character set. If you study the numeric codes assigned to each of these characters, you'll discover that their code assignments allow efficient computation of most of the character operations described earlier.

Table 5-4: The HyCode Character Set

Binary

Hex

Decimal

Character

Binary

Hex

Decimal

Character

0000_0000

00

0001_1110

1E

30

End

0000_0001

01

1

1

0001_1111

1F

31

Home

0000_0010

02

2

2

0010_0000

20

32

PgDn

0000_0011

03

3

3

0010_0001

21

33

PgUp

0000_0100

04

4

4

0010_0010

22

34

left

0000_0101

05

5

5

0010_0011

23

35

right

0000_0110

06

6

6

0010_0100

24

36

up

0000_0111

07

7

7

0010_0101

25

37

down/ linefeed

0000_1000

08

8

8

0010_0110

26

38

nonbreaking space

0000_1001

09

9

9

0010_0111

27

39

paragraph

0000_1010

0A

10

keypad

0010_1000

28

40

carriage return

0000_1011

0B

11

cursor

0010_1001

29

41

newline/enter

0000_1100

0C

12

function

0010_1010

2A

42

tab

0000_1101

0D

13

alt

0010_1011

2B

43

space

0000_1110

0E

14

control

0010_1100

2C

44

(

0000_1111

0F

15

command

0010_1101

2D

45

)

0001_0000

10

16

len

0010_1110

2E

46

[

0001_0001

11

17

len128

0010_1111

2F

47

]

0001_0010

12

18

bin128

0011_0000

30

48

{

0001_0011

13

19

EOS

0011_0001

31

49

}

0001_0100

14

20

EOF

0011_0010

32

50

<

0001_0101

15

21

sentinel

0011_0011

33

51

>

0001_0110

16

22

break/ interrupt

0011_0100

34

52

=

0001_0111

17

23

escape/ cancel

0011_0101

35

53

^

0001_1000

18

24

pause

0011_0110

36

54

0001_1001

19

25

bell

0011_0111

37

55

&

0001_1010

1A

26

back tab

0011_1000

38

56

-

0001_1011

1B

27

backspace

0011_1001

39

57

+

0001_1100

1C

28

delete

 

0001_1101

1D

29

Insert

 

0011_1010

3A

58

*

0101_1101

5D

93

I

0011_1011

3B

59

/

0101_1110

5E

94

j

0011_1100

3C

60

%

0101_1111

5F

95

J

0011_1101

3D

61

~

0110_0000

60

96

k

0011_1110

3E

62

!

0110_0001

61

97

K

0011_1111

3F

63

?

0110_0010

62

98

l

0100_0000

40

64

,

0110_0011

63

99

L

0100_0001

41

65

.

0110_0100

64

100

m

0100_0010

42

66

:

0110_0101

65

101

M

0100_0011

43

67

;

0110_0110

66

102

n

0100_0100

44

68

'

0110_0111

67

103

N

0100_0101

45

69

'

0110_1000

68

104

o

0100_0110

46

70

`

0110_1001

69

105

O

0100_0111

47

71

\

0110_1010

6A

106

p

0100_1000

48

72

#

0110_1011

6B

107

P

0100_1001

49

73

$

0110_1100

6C

108

q

0100_1010

4A

74

@

0110_1101

6D

109

Q

0100_1011

4B

75

_

0110_1110

6E

110

r

0100_1100

4C

76

a

0110_1111

6F

111

R

0100_1101

4D

77

A

0111_0000

70

112

s

0100_1110

4E

78

b

0111_0001

71

113

S

0100_1111

4F

79

B

0111_0010

72

114

t

0101_0000

50

80

c

0111_0011

73

115

T

0101_0001

51

81

C

0111_0100

74

116

u

0101_0010

52

82

d

0111_0101

75

117

U

0101_0011

53

83

D

0111_0110

76

118

v

0101_0100

54

84

e

0111_0111

77

119

V

0101_0101

55

85

E

0111_1000

78

120

w

0101_0110

56

86

f

0111_1001

79

121

W

0101_0111

57

87

F

0111_1010

7A

122

x

0101_1000

58

88

g

0111_1011

7B

123

X

0101_1001

59

89

G

0111_1100

7C

124

y

0101_1010

5A

90

h

0111_1101

7D

125

Y

0101_1011

5B

91

H

0111_1110

7E

126

z

0101_1100

5C

92

i

0111_1111

7F

127

Z

[11] Actually, it's worse than this because most C standard libraries use lookup tables to map ranges of characters, but we'll ignore that issue here.




Write Great Code. Understanding the Machine, Vol. 1
The Art of Assembly Language
ISBN: 1593270038
EAN: 2147483647
Year: 2003
Pages: 144
Authors: Randall Hyde

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net