Section 3.2. ISO 8859 Codes


3.2. ISO 8859 Codes

ISO 8859or more formally, ISO/IEC 8859is a family of character code standards. They were largely developed by Ecma, which distributes ECMA standards that are equivalent to ISO 8859 standards. ISO 8859 standards are largely oriented toward languages of European origin.

ISO 8859 codes are widely used on different platforms and in different contexts. For example, on the Web, ISO 8859-1 was long treated as the default encoding. On Windows, ISO 8859 as such is not used that much, but the corresponding, somewhat extended Windows encodings are common. In Unix and Linux, ISO 8859 is very common.

Each ISO 8859 standard tries to address the needs of one or more specific languages and cultural environment, within the fairly narrow framework of 8-bit structure. This means that in most cases, you cannot represent multilingual text using any single ISO 8859 encoding.

3.2.1. ISO 8859-1 (ISO Latin 1)

The international standard ISO 8859-1 defines a character repertoire identified as Latin alphabet No. 1, commonly called ISO Latin 1, as well as a character code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one octet.

In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western and Northern Europe, and some special characters. These characters occupy code positions 160255, and they are, in code number order and rendered in a monospace font:

¡ ¢ £ ¤ ¥  § ¨ © ª « ¬ - ® ¯ ° ± 2 3 ´ µ ¶ · ¸ 1 º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï  Ñ Ò Ó Ô Õ Ö x Ø Ù Ú Û Ü Ý  ß à á â ã ä å æ ç è é ê ë ì í î ï  ñ ò ó ô õ ö ÷ ø ù ú û ü   ÿ

On the first row, the first character is the so-called no-break space, which corresponds to the ASCII space character but disallows line breaks in text formatting. The third-to-last character on the first row is thesoft hyphen character, which either has no graphic appearance or looks the same as the ASCII hyphen character.

The standard mentions that ISO 8859-1 was designed to cover the needs of the following languages: Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. It also covers Albanian and some non-European languages, such as Indonesian/Malay, Tagalog, Swahili, and Afrikaans.

3.2.2. Names of Encodings

A character encoding may have several names, even several official names. This is illustrated in Table 3-2, which summarizes some names of ISO 8859-1. Names of encodings are often written with a hyphen instead of spacee.g., ISO-8859-1or sometimes with an underscore (low line)e.g., ISO_8859-1. This is because in Internet protocols (see Chapter 10), the character encoding needs to be specified by a name that does not contain spaces. However, each context has its own rules for accepted names. Generally, encoding names are case insensitive: iso-8859-1 is the same as ISO-8859-1.

Table 3-2. Names of the ISO 8859-1 standard and encoding

Name

Context of use

ISO/IEC 8859-1:1998

Official name of a particular version of the standard

ISO/IEC 8859-1

Official name of the standard in general

ISO 8859-1

Commonly used name of the standard and the encoding

ISO-8859-1

Preferred MIME name of the encoding (e.g., in Internet headers)

ISO_8859-1

An alternate MIME name (among others)

ISO8859-1

Unofficial, unregistered name used in some contexts

Latin alphabet No. 1

Official name of the character repertoire (in the standard)

Latin 1

Common name of the encoding and repertoire

ISO Latin 1

Another common name, to distinguish from Windows Latin 1

West European (ISO)

A name for the encoding, used in some software


3.2.3. Other ISO 8859 Codes

ISO 8859-1 is a member of the ISO 8859 family of character codes, which extends the ASCII repertoire in different ways with different special characters, for the purposes of different languages and cultures. Just as ISO 8859-1 contains ASCII characters and a collection of characters needed in languages of Western and Northern Europe, there is ISO 8859-2 alias ISO Latin 2 constructed similarly for languages of Central/Eastern Europe, etc. The ISO 8859 character codes are isomorphic in the following sense: code positions 0127 contain the same characters as in ASCII, positions 128159 are unused (reserved for control characters), and positions 160255 are the varying part (often called "the upper half"), used differently in different members of the ISO 8859 family.

The ISO 8859 character codes use the obvious encoding: each code position is represented as one octet. Such encodings have several alternative names in the official registry of character encodings, but the preferred ones are of the form ISO-8859-n.

Although ISO 8859-1 has been a de facto default encoding in many contexts, it has in principle no special role. ISO 8859-15 alias ISO Latin 9 was expected to replace ISO 8859-1 to a great extent, since it contains the politically important symbol for euro (€), but it has gained relatively little practical use. Old software does not recognize it, and new software supports Unicode encodings, which give a much wider repertoire of characters.

Table 3-3 lists the ISO 8859 alphabets. Note that ISO 8859-n is Latin alphabet no. n (or ISO Latin n for short) for n = 1, 2 ,3, 4, but this correspondence is broken for the other Latin alphabets. For eventual new approved or proposed ISO 8859 standards, check the page http://anubis.dkuug.dk/jtc1/sc2/ (official home of ISO/IEC JTC 1/SC 2, the international standardization subcommittee for coded character sets).

Table 3-3. ISO 8859 character codes

Standard

Name of alphabet

Characterization

ECMA

ISO 8859-1

Latin alphabet No. 1

"Western," "West European"

94

ISO 8859-2

Latin alphabet No. 2

"Central/East European"

94

ISO 8859-3

Latin alphabet No. 3

(For Maltese and Esperanto)

94

ISO 8859-4

Latin alphabet No. 4

"North European," "Baltic"

94

ISO 8859-5

Latin/Cyrillic alphabet

(For some Slavic languages)

113

ISO 8859-6

Latin/Arabic alphabet

(For the Arabic language)

114

ISO 8859-7

Latin/Greek alphabet

(For modern Greek)

118

ISO 8859-8

Latin/Hebrew alphabet

(For Hebrew and Yiddish)

121

ISO 8859-9

Latin alphabet No. 5

"Turkish"

128

ISO 8859-10

Latin alphabet No. 6

"Nordic" (Sámi, Inuit, Icelandic)

144

ISO 8859-11

Latin/Thai alphabet

(For the Thai language)

 

(There is no part 12; it was planned to cover Devanagari, but the idea was abandoned.)

   

ISO 8859-13

Latin alphabet No. 7

Baltic Rim

 

ISO 8859-14

Latin alphabet No. 8

Celtic

 

ISO 8859-15

Latin alphabet No. 9

"Euro" variant of ISO 8859-1

 

ISO 8859-16

Latin alphabet No. 10

"South-Eastern European"

 


Ecma International has defined ECMA standards that have the same content as some ISO 8859 standards, as indicated in the table. For example, ECMA-94 defines Latin alphabets 1 through 4, equivalent to ISO 8859-1 through ISO 8859-4. The ECMA standards are available via http://www.ecmainternational.org/publications/standards/Standard.htm.

For a tabular summary of the coverage of European languages by the different ISO Latin codes, refer to http://www.cs.tut.fi/~jkorpela/8859.html. The languages are listed in each standard, but the coverage is somewhat debatable. In particular, ISO Latin codes usually do not contain characters needed for correct punctuation of languages, even English.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net