Section 3.4. Other 8-bit Codes

3.4. Other 8-bit Codes

There is a large number of 8-bit encodings, including HP Roman-8, KOI8-R (for Russian), and many others. A few of them are discussed below.

In general, full conversions between 8-bit character codes are not possible. For example, the Macintosh character repertoire contains the Greek letter pi (π), which does not exist in ISO Latin 1 at all. Naturally, a text can be converted (by a simple program that uses a conversion table) from Macintosh character code to ISO 8859-1 if the text contains only those characters that belong to the ISO Latin 1 character repertoire.

If a document needs to contain, say, both French and Greek (in Greek letters), then no existing 8-bit code would be suitable. Such codes might contain accented characters needed in French, or Greek letters, but not both. It would be impractical to define new codes for every possible combination of characters you might need, and often impossible due to the limitation to a total of 256 code points.

Hence, it is natural to ask whether it should be possible to switch between encodings within a file. For example, could you use ISO 8859-1 for the French text, and then switch to ISO 8859-7 for Greek text, and back to ISO 8859-1? Such ideas have been developed, but their use is much more limited than one might think.

The standard ISO 2022 (and the equivalent ECMA-35) defines a general framework for switching between 8-bit codes (and other codes). One of the basic ideas is that code positions 128159 (decimal) are reserved for use as control codes (C1 controls). Some of those codes are used for switching (shifting) purposes, to specify that subsequent data is in a different encoding. Note that the Windows character sets do not fit well into this scheme, since they use codes in that range for printable characters. The standard is rather complex, and only parts of it have been implemented and used. It is used particularly for East Asian languages, such as Japanese, which uses different writing systems. However, even for such purposes, Unicode offers a more uniform approach.

3.4.1. DOS Code Pages

In MS DOS systems, different character codes are used; they are called "code pages ." The original American code page was CP 437, which includes some Greek letters, mathematical symbols, and characters that can be used as elements in simple pseudo-graphics. Later, CP 850 became popular, since it contains letters needed for Western European languageslargely the same letters as ISO 8859-1, but in different code positions. Note that DOS code pages are quite different from Windows character codes, although the latter are sometimes referred to by names like cp-1252 (same as windows-1252)! For further confusion, Microsoft now prefers to use the notion OEM code page for the DOS character set used in a particular country.

The registered names of DOS code pages as encodings (for use on the Internet) have no space or hyphen: cp437, cp850, etc. They have alias names like IBM437, IBM850, etc., because of the once important role of IBM in the PC market and in development.

In character-encoding menus in Save dialogs, web browsers, etc., you can often see entries like "Cyrillic (DOS)." They refer to DOS code pages designed for particular cultural environment. In that sense, they correspond to the Windows codes mentioned earlier. Otherwise, DOS and Windows code pages can be quite different, in the allocation of code numbers and even in the character repertoire.

Even in modern Windows systems, the command-line user interface (DOS window) still typically uses some DOS code page, so if you try to view a text file there (using the type command, for example), you'll probably get odd results: the data, which is most likely in some Windows encoding, will be interpreted according to another encoding.

DOS code pages should not normally be used for new data, but there is a lot of existing data in such encodings. The main reason for getting acquainted with DOS code pages is finding out how to convert from them to some other encodings. It is not always trivial to identify what the encoding really is, since there are several DOS code pages with similar names and different versions of the code pages.

Detailed information on DOS code pages is available as code page-to-Unicode mapping tables at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/.

The use of the coding space is rather different in the DOS and Windows encodings, except for the range 0 through 7F (hexadecimal), which follows the ASCII tradition. Figure 3-1 shows the "upper halves" of Windows and DOS encodings designed for Central/Eastern Europe. There are differences in the character repertoirese.g., due to the presence of various drawing characters. Most strikingly, the allocation of characters is almost completely different.

3.4.2. Mac Encodings

On Macintosh (Mac) computers, there has been less variation in character codes than on Windows PCs. However, much like Windows code pages, there are several codes for different languages and language groups. They can now be called "legacy encodings," since the Mac world is moving to Unicode.

The most widely known legacy encoding isMac Roman, which is a combination of ASCII, accented letters, mathematical symbols, and other ingredients. The general idea is similar to that of ISO 8859-1 and windows-1252, but the repertoires are different. At http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/, you can find cross-mapping tables from Mac Roman as well as other legacy encodings to Unicode.

The original Mac Roman code is presented visually in Figure 3-2. Code positions 0 through 1F (hexadecimal) are not shown there; they (as well as position 7F) are assigned to control characters the same way as in the ASCII context, although partly in a manner different from that used in the ASCII context. As you can see, code positions 20 through 7F (the first six rows in the figure) are the same as in ASCII. The same applies to other legacy encodings, with few exceptions.

As you can see, the Mac Roman character set contains several punctuation marks and mathematical symbols that are not present in ISO 8859-1. On the other hand, it lacks the following ISO 8859-1 characters: multiplication sign x; superscripts ¹, ², and ³; vulgar fractions ¼, ½, and ¾; broken vertical bar ; "y with acute and Ý; Icelandic letters eth (, ) and thorn (, ); and the soft hyphen. Moreover, in the modern version, Mac Roman has the euro sign € instead of the currency sign ¤.

Thus, perfect conversion between Mac Roman and ISO 8859-1 (or windows-1252) is generally not possible. It can of course be performed if the text contains only characters that belong to both encodings.

Figure 3-1. Windows Latin 2 and DOS Latin 2 (characters in code positions 80 through FF in hexadecimal)

In fact, Mac Roman contains a character in position F0, too (the grayed first cell on the last row of the table in Figure 3-2). It is the stylized apple that is used as the symbol of the Apple company, called "Apple logo." Unicode does not include symbols of companies and trademarks, so the mapping tables map the character to U+F8FF, which is the last code point in the Private Use area, to be used by "private agreement" only.

Modern Mac computers can use a wider character repertoire, but there is still Mac software that is limited to the Mac Roman encoding. This is one of the main reasons for saying that the character repertoire of ISO 8859-1 is not absolutely universally supported yet.

Mac OS X uses Unicode as its primary character code. Legacy encodings are supported either directly, in a limited manner, in some programs, or through the Mac OS Text

Figure 3-2. Mac Roman encoding, code positions 20 to FF (hexadecimal)

Encoding Converter or other conversion software. For more information, consult the document "Background information on Unicode mapping tables for Mac OS legacy text encodings," which is available at the following site: http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/Readme.txt.

3.4.3. EBCDIC

The EBCDIC code was defined by IBM, and it was once in widespread use on large "mainframe" computers but has lost relative importance. EBCDIC exists in different national variants, and due to its nature as a vendor-defined code, EBCDIC lacked rigorous definitions.

EBCDIC deviates from most 8-bit codes in basic structure. It contains all ASCII characters but in quite different code positions. Another peculiarity is that in EBCDIC, normal letters AZ do not all appear in consecutive code positions. They are in alphabetic order, but with gaps. The original reason for this was related to punched card technology. EBCDIC has been the most important practical reason why it is incorrect (even in the limited context of the English language) to test for a character being a letter simply by checking that it is in the range AZ or az, in comparison of code numbers.

For example, the CP 037 version of EBCDIC, as defined by the cross mapping table at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/, is shown in Figure 3-3. Code positions 0 through 3F (hexadecimal) are not shown; they (as well as

Figure 3-3. EBCDIC CP 037 characters with code positions 40 to FE (hexadecimal) are the same as the characters of ISO 8859-1, but in a very different order

position FF) are assigned to control characters, though partly in a manner different from that used in the ASCII context. Code positions 40 and 41, appearing as blank here, have been allocated to space U+0020 and no-break space U+00A0, respectively.

3.4.4. The Cyrillic KOI8 Encodings

For languages written in Cyrillic letters, such as Russian and Ukrainian, several 8-bit encodings have been developed. We have already mentioned ISO 8859-5 (Latin/Cyrillic) and windows-1251 (Windows Cyrillic ), and there is also DOS Cyrillic and Mac Cyrillic. However, along with windows-1251, the most widely used encoding for Russian is KOI8-R (the letter "R" stands for Russian). There are also other versions of KOI8.

The KOI8 encodings assign code positions 0 through 7F as in ASCII and place Cyrillic letters and other characters in the "upper half." As you can see from Figure 3-4, KOI8-R contains a large number of drawing characters. Its repertoire of letters covers only (modern) Russian and a few other languages. In contrast,Windows Cyrillic has many more Cyrillic letters, giving a wider coverage of languages.

Comparing code positions C0 through FF (hexadecimal) in the two encodingsi.e., the last four rows of the tables in Figure 3-4, we notice how they have different schemes for allocating the basic Cyrillic letters. Even if you don't know the Cyrillic alphabet, you probably see that Windows Cyrillic has uppercase letters first, and then lowercase, whereas KOI8-R has them the other way around. In KOI8-R, the letters are not in the Russian alphabetic order but placed so that if the most significant bit of each octet is lost, the text turns into a coarse transliteration with the case of letters reversed: Cyrillic "" becomes Latin "A," Cyrillic becomes Latin "B," Cyrillic becomes Latin "C," etc.

Figure 3-4. Windows Cyrillic and KOI8-R (code positions 80 through FF hexadecimal)

This implies that if you have Russian text in Windows Cyrillic and your program interprets it according to KOI8-R, or vice versa, words still resemble Russian but in an oddly distorted way. Uppercase becomes lowercase, and vice versa, and with a shift of one position. This is comparable to having "abcdef" munged to "BCDEFG," and such things have actually happenede.g., in Usenet discussions in the Russian-language relcom.* groups, because some people post their messages in KOI8-R, some in Windows Cyrillic, and they might use software that does not include information about the encoding. Modern software can usually handle either encoding, but only if the encoding is properly declared. The situation is not as bad as you might guess, since nowadays most people post in KOI8-R in those groups. If your software does not use that encoding as the default, you probably need to change its settings in order to read relcom.* groups.

This illustrates the point (to be elaborated on in Chapter 10) that the multitude of encodings is not a problem as such, as long as there is adequate information of what the encoding is. It is a better approach than trying to make everyone use the same encoding.

Figure 3-5. Samples of Wingdings fonts

3.4.5. Ad Hoc "8-bit Codes" Defined by Fonts

There is a theoretically quite unsatisfactory, yet widely used method of working with characters: using font settings to extend character repertoire. To take a simple example, use a text-processing program and type the letters abc, and then select them and choose the Symbol font from a font menu. You will probably see the Greek letters αβχ. It seems that this way you can switch between different 8-bit codes, if you have suitable fonts containing various sets of characters. In web authoring, you could achieve a similar effect by using markup like <font face="Symbol">abc</font>. (The Appendix contains a table of Symbol font glyphs and their Unicode equivalents.)

This approach may look conceptually simple, and it has often been practically successful, when you just needed some characters on paper, or perhaps on screen. However, it is quite inadequate for any operations where font information may get lost, or ignored. For example, if viewed on a system without the Symbol font, the data in the example in the last paragraph would appear just as "abc." The same happens if the font is changed for some reason, not to mention any operations of saving and sending data as plain text. When data is entered into a database, for example, font information will hardly be saved. A web browser can be configured or instructed to ignore font suggestions on web pages.

Still, the approach can be useful in special circumstances, such as working with some repertoire of uncommon characters. For example, in phonetics, people have often used a special 8-bit font that contains a collection of phonetic (IPA) characters. Although the material is then unreadable without that font (or a comparable tool), things have worked reasonably well within a community that knows what is needed. Similarly, for some languages with a relatively small repertoire of characters, an 8-bit font might be designed and distributed as a quick way of making it possible to use the language

There are some graphic symbols, such as Wingdings symbols, that cannot be effectively used except via a font-based approach. Figure 3-5 shows some symbols that can be produced by applying Wingdings fonts to the text "abcdef." Although some Wingdings symbols have been encoded in Unicode (e.g., as Dingbats), many of them are essentially small decorative drawings rather than characters for writing texts.

Similarly, if you wish to use some "private" characters, such as special characters designed for use within a community, the use of a special font is a simple way to achieve this. If you would use the characters just to create a printed fantasy book, it would not matter that nobody else has your special font. It is possible, but more complicated, to use "private" characters in Unicode: there as a large block of code points reserved for that purpose.

This approach has been used for many languages, especially in circumstances where programs cannot be expected to support anything other than 8-bit encodings. Whenever you see a statement like "you need the ... font for viewing this document," the odds are that some strictly font-based approach is used. When Unicode or some other standardized encoding is used, you are not limited to use any particular font; any font that contains the characters will do.

Conceptually, the approach discussed here means that you implicitly define a character by the design of a font. If you put the letter alpha (α) into the code position that is occupied by the letter "a" in ASCIIi.e., 61 (hexadecimal)you are in the process of defining a character code where that position is allocated for the alpha. However, you rely on the use of a special font, which logically corresponds to a character code conversion.