Section 1.6. Numbering Characters

1.6. Numbering Characters

Definitions in character standards assign a number to each character. The numbers are unique in each standard, but different standards assign the numbers differently. Some commonly used standards are mutually compatible, in part: the numbers of characters in ASCII (ranging from 0 to 127) are the same as in the ISO 8859 standards, and the numbers of characters in ISO 8859-1 (ranging from 0 to 255) are the same as in Unicode.

The numbers are nonnegative integers 0, 1, 2,..., but are not necessarily consecutive; there can be gaps in the assignment. For example, in ISO 8859 standards, numbers in the range 128 to 159 are unassigned; more specifically, they are reserved for control purposes, leaving it up to other standards to define them. Unicode contains a lot of gaps, due to the coding structure, partly in order to leave space for future extensions.

It might sound natural to use the first few code numbers for digits 0, 1,..., but character standards use different assignments. Don't expect to find much logic in it. The code number of a character should be treated as fairly arbitrary, but fixed.

The number assigned to a character in a character standard has many different names: code number, code position, code value, code element, code point, code set value, as well as simply code. In the Unicode standard, the term "code point" is used both about a number and about a location in the coding space where a character could reside. Some code points are allocated for characters, a few have been explicitly designated as not corresponding to characters (now or ever), and most code points are still not assigned in any way.

Since characters are internally represented by their code numbers, a character can also be treated as an integer. In fact, many old programming languages lack a data type for characters and use an integer type instead. However, the code numbers are usually not used in arithmetic operations, since they mostly lack numeric meaning. If a character's number is smaller than another character's number, this by no means implies a corresponding relation in alphabetic order. For some small regions of code numbers, the order actually corresponds to alphabetic order, though.

For example, in Unicode, the numbers for the characters "a," 0 (digit zero), ! (exclamation mark), ä (letter a with umlaut), and (per mille sign) are 97, 48, 33, 228, and 8240 in decimal notation. More often, hexadecimal notation is used: 61, 30, 21, E4, and 2030. The code number assignments are essentially arbitrary: the code number has no relationship with the meaning of a character.

Normally, you do not need to memorize the numbers; you check them from suitable references. However, if you use some code numbers frequently, you will probably learn to remember some of them by heart. This explains the sarcastic saying: "Real Programmers might or might not know their spouse's name. They do, however, know the entire ASCII (or EBCDIC) code table."

1.6.1. Hexadecimal Notation

As mentioned above, character numbers are usually specified in hexadecimal notation, or hex notation. The phrase hexadecimal number is often used, but in fact, it is just a convention for writing numbers. The hexadecimal notation FF denotes the same number as the decimal notation 255.

In hexadecimal notation, letters "A" through "F" (or "a" through "f") are used to denote numbers from 10 to 15 (10 to 15 in decimal notation). The number denoted by a two-digit hexadecimal notation is the value of the first digit times 16 plus the value of the second digit. For example, hexadecimal 2E means 2 x 16 + 14 = 46 in decimal. Similarly, the four-digit hexadecimal notation 215A means 2 x 16³ + 1 x 16² + 5 x 16 + 10 = 8,538 in decimal. The largest four-digit hexadecimal number is FFFF, which is 65,535 in decimal.

Figure 1-8. The Calculator in Windows XP, in Scientific mode

It is usually evident from context whether a number is presented as hexadecimal or decimal. In particular, Unicode code numbers written as U+nnnn are always in hexadecimal. When necessary, some special convention is used to indicate the base. In plain text, it is common to use a "0x" (digit zero, letter "x") prefix for hexadecimal numbers, such as in "0x215A." In mathematical notations, the base is often written (in decimal) as a subscript, as in 215A₁₆ or 8,538₁₀.

It is easy but boring to convert between decimal and hexadecimal, so we mostly use computers for that. In Windows, the Calculator program can be used for such conversions, when set in "scientific" mode. As shown on Figure 1-8, you can, for example, set the Calculator to hexadecimal mode, enter a number, and click on "Dec" to get the value in decimal.

The reason for using hexadecimal notation in character code issues is that Unicode and other standards use that notation. This in turn reflects the design decisions of using 8-bit bytes and grouping characters into 256-character sets. For example, the Unicode number U+205F denotes the character in relative position 5F inside the set U+2000..U+20FF. Such handy things are not possible if decimal numbers are used.

Another reason is that it is trivial and fast to convert between hexadecimal and binary, and computers internally use binary. Each hexadecimal digit corresponds to 4 bits: 0 = 0000, 1 = 0001, 2 = 0010, 3 = 0011,..., E = 1110, F = 1111.

1.6.2. Numbers as Indexes

We can regard a character code as a row of boxes, each capable of containing one character. In many widely used old character codes, the sequence has 256 boxes. In Unicode, the sequence is about a million boxes long. Although Unicode is often presented as a set of code tables (arrays), each consisting of 256 elements, its fundamental structure is essentially linear.

The code numbers are ordinal numbers, or indexes, of the boxes, starting from zero. They can also be understood as indexes to tables of properties of characters. Thus, to find out whether a particular character is a letter in the most general sense, you would conceptually use the character's code number to access a table that contains information about the general category of each character. Actual implementations do not necessarily use such table lookup techniques, but the idea illustrates the point of using code numbers.

There are some things to note on this model, however:

Not all boxes contain a character. That is, not all code points correspond to a character. In Unicode, most code points are currently unassigned, and some have been explicitly defined as "noncharacters"i.e., as not corresponding to any character, ever.
Not all characters have a box of their own, or a code point. Some characters containing a diacritic mark can only be written as decomposedi.e., as a base character followed by one or more combining diacritic marks. For example, the letter "e" with acute accent, é, has a box of its own; but the Cyrillic letter with an acute accent on it (́), though used as a character in dictionaries, for example, has no code pointit can only be represented as followed by a combining acute accent.
Thus, although characters are identified by their code points, which are numbers (unsigned integers), the numeric (arithmetic) value is usually irrelevant. That is, we mostly don't operate on them as numbers, with arithmetic operations. For most purposes, the numbers are just indexes. It is not a pure coincidence, though, that some characters have code points that correspond to their mutual alphabetic order. Many character codes have put letters into alphabetic order, and Unicode has tried to preserve much of that.

1.6.3. Making Use of Character Numbers

There are several ways to use the Unicode number of a character. The methods of writing characters will be discussed in Chapter 2, but here are some possibilities:
- In HTML and XML authoring, you can use a character reference of the form &#xnumber;e.g., ℮. That way, you can include any character, no matter what your keyboard is or what your document's encoding is.
- On Microsoft software that uses the so-called Uniscribe input (e.g., many programs under Windows XP), you can type a character's number in hexadecimal, such as 212e, and then type Alt-X and see how the number is replaced by the character.
  
  Figure 1-9. Character insertion window in Microsoft Word lets you select a character by its Unicode number, as one possibility
- You can use the number as an index to information on characters in different tables, databases, and services, including the Unicode standard.
- You can select a character by its number in user interfaces such as the Character Map in Windows, as illustrated earlier in Figure 1-1, or the window that opens in Microsoft Word when you select Insert Symbol. The latter is illustrated in Figure 1-9, which shows the window in a Finnish version of Word. As you can see, the character name shown is still the Unicode name as suchin this case,ESTIMATED SYMBOL.

Section 1.6. Numbering Characters

1.6. Numbering Characters

1.6.1. Hexadecimal Notation

Figure 1-8. The Calculator in Windows XP, in Scientific mode

1.6.2. Numbers as Indexes

1.6.3. Making Use of Character Numbers

Figure 1-9. Character insertion window in Microsoft Word lets you select a character by its Unicode number, as one possibility