Character Sets and Unicode


What follows is a brief introduction to the character sets you might encounter as you move around the Internet and the computing world. It is not designed to be comprehensive, but is designed to merely present you with the basics of what you will see. For those wishing to learn more about this topic, there are a plethora of web sites on the Internet with extremely detailed descriptions of this information.

ASCII

When computers were first developed, one of the primary things required was the ability to map digital codes into printable characters. Older systems existed, but none were quite suited to the binary nature of the computer. With this in mind, the American Standards Association announced the American Standard Code for Information Interchange, more commonly known as ASCII, in 1963. This was a 7-bit character set containingin addition to all lower- and uppercase Latin letters used in the (American) English alphabet numbers, punctuation markers, quotation markers, and currency symbols.

Unfortunately, this system proved particularly ill-suited to the needs of Western European countries, which ranged from the British Pound Symbol (£); to accents and other markings for French and Spanish; to new letters and ligatures (two letters combined together, such as æ); and completely different letters, as in modern Greek. In other parts of Europe, users of the Cyrillic, Armenian, or Hebrew/Yiddish alphabets also found themselves left in the dark.

The ISO 8859 Character Sets

Fortunately, most modern computers are able to store data in 8-bit bytes, and the 7-bit ASCII characters were only using the high bit as a parity bit (used for verifying data integrity), and not for useful information. The obvious next step was to start using the upper 128 slots available in an 8-bit byte to add different characters.

What resulted over time was the ISO 8859 series of character sets (also known as code pages). ISO 8859-1 defined the "Latin Alphabet No. 1," or Latin-1 set of characters that covered a vast majority of Western European languages. Over the years, the 8859 series of code pages has grown to 14 (8859-11 is proposed for Thai, and 8859-12 remains unfilled), with 8859-15 created in 1999it is the Latin-1 (8859-1) code page with the Euro Symbol () added. Among other code pages are those for eastern European Slavic languages and Cyrillic languages such as Russian, Hebrew, and Turkish.

The benefit to these character pages is that they remain fully compatible with the old ASCII character set and largely share the same lower 127 characters and control sequences. There were some slightly modified implementations of these character sets, most notably the default code pages in some versions of the Microsoft Windows operating system. This came to be known as the windows-1252 code page, or simply cp-1252 for the English language (windows-1251 for Russian and windows-1254 for Turkish, and so on). The Apple Macintosh also has a slightly modified Latin-1 code page.

Far Eastern Character Sets

It turns out that 256 character codes is not enough to handle the needs of East Asian languages with large alphabets, such as Chinese, Japanese, or Korean. (Korean actually uses a syllabary, where larger units are made up of individual letters, but computers have to store large numbers of these possible syllabic units.) To handle these, many different solutions were developed over the years, with many originating from Japan, where the need arose first.

Initially an 8-bit character set was created that encoded the Latin letters, some symbols, and the characters in the Japanese katakana alphabet (one of the four alphabets used in Japanese). This is called "8-bit JIS" (Japanese Industrial Standards). Afterward came a character code system called "Old JIS," which was superseded by "New JIS." Both are multi-byte character sets that use a special byte called an escape sequence to switch between 8-bit and 16-bit encodings. In addition to the Japanese phonetic alphabets, these codes also included the Latin and Cyrillic alphabets and the more commonly used Chinese characters (Kanji) used in modern Japan.

A slight variation on this was invented by Microsoft Corporation"Shift-JIS" or S-JIS, also known as DBCS (Double-Byte Character Set). This merely specified that if a sequence of bits in the first byte was set, there was a second byte that would be used to specify which character to use, thus avoiding the separate escape sequence mechanism. This meant a reduced number of possible characters that could fit into the 16 bits available in a multi-byte code because certain bits were reserved to mark a character as two bytes instead of one. However, it was felt that the tradeoff was worthwhile on older 8- and 16-bit computer systems, where space and performance were at a premium.

Over the years, similar systems have been created to encode the various forms of Chinese and Korean. In addition to cryptically named standards to cover Simplified Chinese (written in mainland China), such as GB 2312-80, other standards, such as Big-5 (for Traditional Chinese, written in Taiwan) and UHC (Unified Hangul Code) exist for Korean. There are strengths and weaknesses to all these systems, although many do not fully and properly encode the full set of characters available in these languages (particularly those in Chinese).

Unicode

As people started to understand the limitations in the various character sets and the need for fully globalized computer applications grew, various initiatives were taken to develop a character set that could encode every language. Two initiatives were started in the late 1980s to create this standard. Unicode (from "Universal Code") eventually came to dominate, becoming ISO 10646.

The initial Unicode standard suggested that all characters in the world should be encoded into a 16-bit two-byte sequence that would be fully compatible with the old ASCII characters in the first 127 slots. In addition to the Latin alphabets and their variants, support for other alphabets, such as Armenian, Greek, Thai, Bengali, Arabic, Chinese, Japanese, and Korean would be included.

Unfortunately, 16 bits is not enough to encode the characters found in Chinese, Japanese, and Korean, which are in excess of 70,000. The initial approaches of the Unicode Consortium were to try and consolidate the characters in the three languages and eliminate "redundant" characters, but this would clearly prohibit computer encoding of ancient texts and names of places and people in these countries.

Therefore, a 32-bit version of Unicode has recently been introduced. For cases when 16 bits are not sufficient, a 32-bit encoding system can be used. This encoding reserves space not only for modern and living languages, but also dead ones. Newer versions of the standard have provided maximal flexibility in how the language is stored, permitting not only 16-bit and 32-bit character streams, but also single-byte streams.

Unicode Encodings

Next, we must look at how Unicode is transmitted over the Internet, stored on your computer, and sent in HTML or XML (all of which are still typically done in single-byte formats). There are commonly used encodings for Unicode. The most common ones that you will see or hear of are:

  • UTF-7 This encodes all Unicode characters in 7-bit characters by preserving most of the regular ASCII characters and then using one or two slots to indicate a sequence of extended bytes for others.

  • UTF-8 This encodes the full ASCII character set in the first 127 slots and then uses a non-trivial scheme to encode the remaining Unicode characters in as many as 6 bytes. This encoding is heavily favored over the 7-bit Unicode encoding for single-byte Unicode transmission.

  • UTF-16 This encodes Unicode characters into a 16-bit word. Originally envisioned to be a fixed-sized character set, it now supports chaining to correctly handle the full set of characters that Unicode encompasses. Fortunately, a majority of characters still fall into the first 16 bits.

  • UTF-32 This encodes Unicode characters into a 32-bit double word (often referred to as a "DWORD"). Additionally, it supports multi-DWORD character sequences in case there is a need for more characters in the future.




Core Web Application Development With PHP And MYSQL
Core Web Application Development with PHP and MySQL
ISBN: 0131867164
EAN: 2147483647
Year: 2005
Pages: 255

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net