Section 3.6. Encodings for East Asian Language


3.6. Encodings for East Asian Language

The languages written in East Asia pose special problems to encoding characters, since the languages use, in part, a very rich character repertoire. Before considering the problems of characters of Chinese origin, we discuss the modern writing system of Vietnamese, which is just about manageable with 8-bit codes.

These encodings use different approaches to the problem of representing a large repertoire of characters as sequences of octets. We will not consider their technical nature or the choice between them in this book. Consult CJKV Information Processing by Ken Lunde (O'Reilly) for detailed information on such matters.


3.6.1. Vietnamese 8-bit Codes

The Vietnamese language is nowadays written in Latin letters but with several diacritic marks, including multiple marks on a single letter. For example, the name of Vietnam in Vietnamese is "Việt Nam" (note that the "e" has both a circumflex above it and a dot below it).

One reason for this is that Vietnamese is a tonal language: the tone (e.g., falling versus rising tone) of a syllable is important and often makes a difference in meaning. Quite often, the tone, indicated by a diacritic mark, is the only thing that distinguishes between words.

In texts in English and other Western languages, it is common to omit all or most of the diacritic marks in Vietnamese names. They are difficult to produce and difficult to preserve in data transmission and processing. However, at least in Vietnamese itself, it would be inappropriate to omit the diacritic marks.

Due to the number of extra characters needed, the ISO 8859 model is not suitable for Vietnamese. There are various 8-bit character codes developed for it; the most common of them are TCVN, VISCII, VPN, and windows-1258 ("Windows Vietnamese").

VISCII (described in RFC 1456) uses almost all code points in the hexadecimal range 20FF for printable characters, and it even allocates some points in the 01F range to printable characters. Thus, although it has the range 207F allocated as in ASCII, it's not a pure extension of ASCII.

Windows-1258 is not very different from ISO-8859-1 but uses some code points for combining diacritic marks. Thus, to write ệ, for example, you would write ê followed by a combining dot below, instead of using a single code point for the character.

3.6.2. Encodings for Chinese

The traditional Chinese writing system uses thousands of ideographic characters. In the 20th century, a simplified version of the writing system was developed in the People's Republic of China, using simpler forms for the characters. It is called "Simplified Chinese" as opposite to "Traditional Chinese." Thus, the difference between the two is in the writing system, rather than the language as a whole, although these alternatives often appear in a menu for language choice.

Either variant of the writing system can be encoded in different ways. For example, in Mozilla, the menu of encodings contains the following options:

  • Chinese Simplified (GB18030)

  • Chinese Simplified (GB2312)

  • Chinese Simplified (GBK)

  • Chinese Simplified (HZ)

  • Chinese Simplified (ISO-2022-CN)

  • Chinese Traditional (Big5)

  • Chinese Traditional (Big5-HKSCS)

  • Chinese Traditional (EUC-TW)

The names in parentheses refer to specific encodings. The abbreviation "GB" refers to Chinese words that mean Chinese national standard in the People's Republic of China. The abbreviation "Big5" refers to an agreement on character encoding by five big international companies in the computer industry.

For our purposes in this book, it is sufficient to know that several different encodings for Chinese are in use, and one or another is often strongly preferred by a user or by an organization. The choice may involve political considerations as well. Thus, if you design an application that allows Chinese characters to be entered and shown, it is generally not sufficient to support Unicode alone. You could use some Unicode encoding(s) internallye.g., in a databasebut the input and output operations should be carried out using methods that allow at least some of the specific Chinese encodings to be used as well. This means that the application needs to use character code converters.

3.6.3. Encodings for Japanese

The Japanese language is written using three different types of characters: kanji characters, which are Japanese versions of Chinese characters, and hiragana and katakana, which are much smaller repertoires of characters and are used to describe pronunciation. Although it is possible to represent hiragana or katakana within an 8-bit code, it is usually culturally unacceptable to restrict the writing of Japanese that way. Normally, Japanese is written using a mixture of the three writing systems, and perhaps with additional characters such as Latin letters, too.

Encodings for Japanese include EUC-JP, ISO-2022-JP, and Shift_JIS. The ISO-2022-JP encoding uses the switching mechanism defined in the ISO 2022 standard, effectively using control codes to specify which 8-bit code (representing 256 different characters in the repertoire) is used at each point. Other codes use different approaches to the switching problem. Shift_JIS is also called Shift-JIS or SJIS.

3.6.4. Encodings for Korean

Korean was previously written using characters of Chinese origin; hence the abbreviation "CJK," which refers to Chinese characters in a broad sense, with Chinese, Japanese, and Korean versions. The abbreviation "CJKV" adds the old Vietnamese versions to these.

Nowadays, Korean is mostly written using hangul characters, which were specifically developed for Korean. They constitute a very logical and regular system for writing words phonetically. Hangul has been called an "alphabetic syllabary," since it can be regarded as a system of syllable symbols that consist of letters of an alphabet. The number of letters is comparable in size to the English alphabet, whereas the syllable symbols, as precomposed sequences of letters, constitute a very large set.

If Korean is represented in a form that encodes the letters separately, a program for rendering text needs to recognize how adjacent letters constitute syllables and to show them accordingly. The construction of the written text needs to combine glyphs in specific ways. It is much easier to render Korean text encoded using syllable characters.

Encodings for Korean include EUC-KR, ISO-2022-KR, JOHAB, and UHC.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net