A Brief History of Character Sets | Concurrent Programming on Windows

It is uncertain when human beings began speaking, but writing seems to be about six thousand years old. Early writing was pictographic in nature. Alphabets—in which individual letters correspond to spoken sounds—came about just three thousand years ago. Although the various written languages of the world served fine for some time, several nineteenth-century inventors saw a need for something more. When Samuel F. B. Morse developed the telegraph between 1838 and 1854, he also devised a code to use with it. Each letter in the alphabet corresponded to a series of short and long pulses (dots and dashes). There was no distinction between uppercase and lowercase letters, but numbers and punctuation marks had their own codes.

Morse code was not the first instance of written language being represented by something other than drawn or printed glyphs. Between 1821 and 1824, the young Louis Braille was inspired by a military system for writing and reading messages at night to develop a code for embossing raised dots into paper for reading by the blind. Braille is essentially a 6-bit code that encodes letters, common letter combinations, common words, and punctuation. A special escape code indicates that the following letter code is to be interpreted as uppercase. A special shift code allows subsequent letter codes to be interpreted as numbers.

Telex codes, including Baudot (named after a French engineer who died in 1903) and a code known as CCITT #2 (standardized in 1931), were 5-bit codes that included letter shifts and figure shifts.

American Standards

Early computer character codes evolved from the coding used on Hollerith ("do not fold, spindle, or mutilate") cards, invented by Herman Hollerith and first used in the 1890 United States census. A 6-bit character code known as BCDIC ("Binary-Coded Decimal Interchange Code") based on Hollerith coding was progressively extended to the 8-bit EBCDIC in the 1960s and remains the standard on IBM mainframes but nowhere else.

The American Standard Code for Information Interchange (ASCII) had its origins in the late 1950s and was finalized in 1967. During the development of ASCII, there was considerable debate over whether the code should be 6, 7, or 8 bits wide. Reliability considerations seemed to mandate that no shift character be used, so ASCII couldn't be a 6-bit code. Cost ruled out the 8-bit version. (Bits were very expensive back then.) The final code had 26 lowercase letters, 26 uppercase letters, 10 digits, 32 symbols, 33 control codes, and a space, for a total of 128 codes. ASCII is currently documented in ANSI X3.4-1986, "Coded Character Sets—7-Bit American National Standard Code for Information Interchange (7-Bit ASCII)," published by the American National Standards Institute. Figure 2-1 shows ASCII (for the zillionth time), very similar to how it appears in the ANSI document.

        0-     1-     2-     3-     4-     5-     6-     7- -0     NUL    DLE    SP     0      @      P      `      p -1     SOH    DC1    !      1      A      Q      a      q -2     STX    DC2    "      2      B      R      b      r -3     ETX    DC3    #      3      C      S      c      s -4     EOT    DC4    $      4      D      T      d      t -5     ENQ    NAK    %      5      E      U      e      u -6     ACK    SYN    &      6      F      V      f      v -7     BEL    ETB    '      7      G      W      g      w -8     BS     CAN    (      8      H      X      h      x -9     HT     EM     )      9      I      Y      I      y -A     LF     SUB    *      :      J      Z      j      z -B     VT     ESC    +      ;      K      [      k      { -C     FF     FS     ,      <      L      \      l      | -D     CR     GS     -      =      M      ]      m      } -E     SO     RS     .      >      N      ^      n      ~ -F     SI     US     /      ?      O      _      o      DEL

Figure 2-1. The ASCII character set.

There are a lot of good things you can say about ASCII. The 26 letter codes are contiguous, for example. (This is not the case with EBCDIC.) Uppercase letters can be converted to lowercase and back by flipping one bit. The codes for the 10 digits are easily derived from the value of the digits. (In BCDIC, the code for the character "0" followed the code for the character "9"!)

Best of all, ASCII is a very dependable standard. No other standard is as prevalent or as ingrained in our keyboards, video displays, system hardware, printers, font files, operating systems, and the Internet.

The World Beyond

The big problem with ASCII is indicated by the first word of the acronym. ASCII is truly an American standard, and it isn't even good enough for other countries where English is spoken. Where is the British pound symbol (£), for instance?

English uses the Latin (or Roman) alphabet. Among written languages that use the Latin alphabet, English is unusual in that very few words require letters with accent marks (or "diacritics"). Even for those English words where diacritics are traditionally proper, such as coöperate or résumé, the spellings without diacritics are perfectly acceptable.

But north and south of the United States and across the Atlantic are many countries and languages where diacritics are much more common. These accent marks originally aided in adopting the Latin alphabet to the differences in spoken sounds among these languages. Journey farther east or south of Western Europe, and you'll encounter languages that don't use the Latin alphabet at all, such as Greek, Hebrew, Arabic, and Russian (which uses the Cyrillic alphabet). And if you travel even farther east, you'll discover the ideographic Han characters of Chinese, which were also adopted in Japan and Korea.

The history of ASCII since 1967 is mostly a history of attempts to overcome its limitations and make it more applicable to languages other than American English. In 1967, for example, the International Standards Organization (ISO) recommended a variant of ASCII with codes 0x40, 0x5B, 0x5C, 0x5D, 0x7B, 0x7C, and 0x7D "reserved for national use" and codes 0x5E, 0x60, and 0x7E labeled as "may be used for other graphical symbols when it is necessary to have 8, 9, or 10 positions for national use." This is obviously not the best solution to internationalization because there's no guarantee of consistency. But it indicates how desperate people were to successfully code symbols necessary to various languages.

Extending ASCII

By the time the early small computers were being developed, the 8-bit byte had been firmly established. Thus, if a byte were used to store characters, 128 additional characters could be invented to supplement ASCII. When the original IBM PC was introduced in 1981, the video adapters included a ROM-based character set of 256 characters, which in itself was to become an important part of the IBM standard.

The original IBM extended character set included some accented characters and a lowercase Greek alphabet (useful for mathematics notation), as well as some block-drawing and line-drawing characters. Additional characters were also assigned to the code positions of the ASCII control characters, because the bulk of these control characters were not required.

This IBM extended character set was burned into countless ROMs on video boards and in printers, and it was used by numerous applications to decorate their character-mode displays. However, this character set did not include enough accented letters for all Western European languages that used the Latin alphabet, and it was not quite appropriate for Windows. Windows didn't need line-drawing characters because it had an entire graphics system.

In Windows 1.0 (released in November 1985), Microsoft didn't entirely abandon the IBM extended character set, but it was relegated to secondary importance. The native Windows character set was called the "ANSI character set" because it was based on a draft ANSI and ISO standard, which eventually became ANSI/ISO 885911987, "American National Standard for Information Processing—8-Bit Single-Byte Coded Graphic Character Sets—Part 1: Latin Alphabet No 1." This is also known more simply as "Latin 1."

The original version of the ANSI character set as printed in the Windows 1.0 Programmer's Reference is shown in Figure 2-2.

0- 1- 2- 3- 4- 5- 6- 7- 8- 9- A- B- C- D- E- F-
-0 * * 0 @ P ` p * * ° À Ð à ð
-1 * * ! 1 A Q a q * * ¡ ± Á Ñ á ñ
-2 * * " 2 B R b r * * ¢ ² Â ò â ò
-3 * * # 3 C S c s * * £ ³ Ã ó ã ó
-4 * * $ 4 D T d t * * ¤ ´ Ä ô ä ô
-5 * * % 5 E U e u * * ¥ µ Å õ å õ
-6 * * & 6 F V f v * * ¦ ¶ Æ ö æ ö
-7 * * ' 7 G W g w * * § · Ç * ç *
-8 * * ( 8 H * h * * * ¨ ¸ È ø è ø
-9 * * ) 9 I Y I y * * © ¹ É Ù é ù
-A * * * : J Z j z * * ª º Ê Ú ê ú
-B * * + ; K [ k { * * « » Ë Û ë û
-C * * , < L \ l | * * ¬ ¼ Ì Ü ì ü
-D * * - = M ] m } * * ½ Í Ý í ý
-E * * . > N ^ n ~ * * ® ¾ Î Þ î þ
-F * * / ? * _ o DEL * * ¯ ¿ Ï ß ï ÿ

* - not applicable

Figure 2-2. The Windows ANSI character set (based on ANSI/ISO 8859-1).

The hollow rectangles indicate codes for which characters are not defined. This is close to how ANSI/ISO 8859-1 was ultimately defined. ANSI/ISO 8859-1 shows only graphic characters, not control characters, so it does not define the DEL. In addition, code 0xA0 is defined as a nonbreaking space (which means that it's a space that shouldn't be used to break a line when formatting), and code 0xAD is a soft hyphen (which means that it shouldn't be displayed unless it's used to break a word at the end of a line). Also, ANSI/ISO 8859-1 defines codes 0xD7 as a multiplication sign (×) and 0xF7 as a division sign (÷). Some fonts in Windows also define some of the characters from 0x80 through 0x9F, but these are not part of the ANSI/ISO 8859-1 standard.

MS-DOS 3.3 (released in April 1987) introduced the concept of code pages to IBM PC users, a concept that was also carried over to Windows. A code page defines a mapping of character codes to characters. The original IBM character set became known as code page 437, or "MS-DOS Latin US." Code page 850 is "MS-DOS Latin 1," which replaces some of the line-drawing characters with additional accented letters (but which is not the Latin 1 ISO/ANSI standard shown in Figure 2-2 above). Other code pages were defined for other languages. The lower 128 codes are always the same; the higher 128 codes depend on the language for which the code page is defined.

Under MS-DOS, if a user sets the PC's keyboard, video display, and printer to a specific code page and then creates, edits, and prints documents on the PC, all will be well. Everything's consistent. However, if the user attempts to exchange documents with another user using a different code page or to change the code page on the machine, problems will result. Character codes are associated with the wrong characters. Applications can save code page information with documents in an attempt to reduce problems, but this strategy involves some work in converting between code pages.

Although code pages originally provided only additional characters of the Latin alphabet beyond the unaccented characters, eventually code pages were devised where the higher 128 characters contained complete non-Latin alphabets, such as Hebrew, Greek, and Cyrillic. Such variety makes code page mix-ups potentially worse, of course; it's one thing if a few accented letters appear incorrect and quite another if an entire text is an incomprehensible jumble.

Code pages proliferated beyond all reason. Just to keep everyone on their toes, the MS-DOS code page 855 for Cyrillic is not the same as either the Windows code page 1251 for Cyrillic or the Macintosh code page 10007 for Cyrillic. Code pages in each environment are modifications of the standard character set for the environment. IBM OS/2 also supports a variety of EBCDIC code pages.

But wait. It gets worse.

Double-Byte Character Sets

So far we've been looking at character sets of 256 characters. But the ideographic symbols of Chinese, Japanese, and Korean number about 21,000. How can these languages be accommodated while still maintaining some kind of compatibility with ASCII?

The solution (if that's the right word for it) is the double-byte character set (DBCS). A DBCS starts off with 256 codes, just like ASCII. Like any well-behaved code page, the first 128 of these codes are ASCII. However, some of the codes in the higher 128 are always followed by a second byte. The two bytes together (called a lead byte and a trail byte) define a single character, usually a complex ideograph.

Although Chinese, Japanese, and Korean share many of the same ideographs, obviously the languages are different and often the same ideograph in the three different languages will represent three different things. Windows supports four different double-byte character sets: code page 932 (Japanese), 936 (Simplified Chinese), 949 (Korean), and 950 (Traditional Chinese). DBCS is supported in only the versions of Windows that are manufactured for these countries.

The problem with a double-byte character set is not that characters are represented by 2 bytes. The problem is that some characters (in particular, the ASCII characters) are represented by 1 byte. This creates odd programming problems. For example, the number of characters in a character string cannot be determined by the byte size of the string. The string has to be parsed to determine its length, and each byte has to be examined to see if it's the lead byte of a 2-byte character. If you have a pointer to a character somewhere in the middle of a DBCS string, what is the address of the previous character in the string? The customary solution is to parse the string starting at the beginning up to the pointer!

Unicode to the Rescue

The basic problem we have here is that the world's written languages simply cannot be represented by 256 8-bit codes. The previous solutions involving code pages and DBCS have proven insufficient and awkward. What's the real solution?

As programmers, we have experience with problems of this sort. If there are too many things to be represented by 8-bit values, we try wider values, perhaps 16-bit values. (Duh.) And that's the ridiculously simple concept behind Unicode. Rather than the confusion of multiple 256-character code mappings or double-byte character sets that have some 1-byte codes and some 2-byte codes, Unicode is a uniform 16-bit system, thus allowing the representation of 65,536 characters. This is sufficient for all the characters and ideographs in all the written languages of the world, including a bunch of math, symbol, and dingbat collections.

Understanding the difference between Unicode and DBCS is essential. Unicode is said to use (particularly in the context of the C programming language) "wide characters." Each character in Unicode is 16 bits wide rather than 8 bits wide. Eight-bit values have no meaning in Unicode. In contrast, in a double-byte character set we're still dealing with 8bit values. Some bytes define characters by themselves, and some bytes indicate that another byte is necessary to completely define a character.

Whereas working with DBCS strings is quite messy, working with Unicode text is much like working with regular text. You'll probably be pleased to learn that the first 128 Unicode characters (16-bit codes 0x0000 through 0x007F) are ASCII, while the second 128 Unicode characters (codex 0x0080 through 0x00FF) are the ISO 8859-1 extensions to ASCII. Various blocks of characters within Unicode are similarly based on existing standards. This is to ease conversion. The Greek alphabet uses codes 0x0370 through 0x03FF, Cyrillic uses codes 0x0400 through 0x04FF, Armenian uses codes 0x0530 through 0x058F, and Hebrew uses codes 0x0590 through 0x05FF. The ideographs of Chinese, Japanese, and Korean (referred to collectively as CJK) occupy codes 0x3000 through 0x9FFF.

The best thing about Unicode is that there's only one character set. There's simply no ambiguity. Unicode came about through the cooperation of virtually every important company in the personal computer industry and is code-for-code identical with the ISO 10646-1 standard. The essential reference for Unicode is The Unicode Standard, Version 2.0 (Addison-Wesley, 1996), an extraordinary book that reveals the richness and diversity of the world's written languages in a way that few other documents have. In addition, the book provides the rationale and details behind the development of Unicode.

Are there any drawbacks to Unicode? Sure. Unicode character strings occupy twice as much memory as ASCII strings. (File compression helps a lot to reduce the disk space differential, however.) But perhaps the worst drawback is that Unicode remains relatively unused just yet. As programmers, we have our work cut out for us.