A Closer Look at Unicode | Developing International Software

As you've seen by now, with the emergence of the Unicode Standard you can consistently handle English text as well as East Asian and other major scripts of the world. This ability is essential when dealing with an increasingly global market. The following sections will first give you a very brief look at Unicode's origin, and will then show you Unicode's capabilities. You'll also gain a deeper understanding of Unicode-aware functions. The discussion will then switch to a more technical nature, with specific details on how to create Win32 Unicode applications, as well as how to use encodings in Web pages, in the .NET Framework, and finally, in console or text-mode programs.

Unicode's Capabilities

Glossary

Base character: An encoding code point that does not graphically combine with preceding characters, and that is neither a control nor a format character. The Latin "a" is an example of a base character.
Combining character: A character that graphically combines with a preceding base character. The combining acute accent mark ()-U+0301-is an example of a combining character.
Precomposed character: A character that is equivalent to a sequence of one or more characters. It is also known as a "composed character" or a "composite character." Thus the combining character sequence "a + ' " forms the precomposed, composed, or composite character " ."
ISO 10646: The International Organization for Standardization's encoding that is code-for-code equivalent to Unicode.
Private Use Area: The area in Unicode repertoire from U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD that is set aside for vendor-specific or user-designed characters.
Contextual analysis: A process for determining how to handle text based on surrounding characters, as in Arabic, in which a glyph changes shape depending on its position in a word.
Glyph: The actual shape (bit pattern, outline, and so forth) of a character image. For example, an italic "a " and a roman "a" are two different glyphs representing the same underlying character.
Logical order: In the same order in which it is typed. Generally refers to text that might be displayed in a different order, such as Arabic, Hebrew, or bidirectional text.
Neutral character: A character whose directionality (right-to-left or left-to right) is dependent on the directionality of the characters that surround it. See contextual analysis.
Request For Comments (RFC) documents: The written definitions of Internet protocols and policies.
Little-endian: A computer architecture that stores multibyte numerical values with the least significant byte values first. On systems using little-endian architecture, the letter "A" (U+0041) is stored as 0x41 0x00.
Big-endian: A computer architecture that stores multibyte numerical values with the most significant byte values first. On systems using big-endian architecture, the letter "A" (U+0041) is stored as 0x00 0x41.

The complex programming methods required for working with mixed-byte encodings, the involved process of creating new code pages every time another language requires computer support, and the importance of mixing and sharing information in a variety of languages across different systems were some of the factors motivating the creators of the Unicode encoding standard.

Unicode originated through collaboration between Xerox and Apple. An ad hoc committee of several companies then formed, and others, including IBM and Microsoft, rapidly joined. In 1991, this group founded the Unicode Consortium whose membership now includes several leading Information Technology (IT) companies. (For more information on Unicode, visit the Unicode Consortium's site at http://www.Unicode.org.)

Unicode is an especially good fit for the age of the Internet, since the worldwide nature of the Internet demands solutions that work in any language. The World Wide Web Consortium (W3C) has recognized this fact and now expects all new RFCs to use Unicode for text. Many other products and standards now require or allow use of Unicode; for example, XML, HTML, Microsoft JScript, Java, Perl, Microsoft C#, and Microsoft Visual Basic 7 (VB.NET). Today, Unicode is the de facto character encoding standard accepted by all major computer companies, while ISO 10646 is the corresponding worldwide de jure standard approved by all ISO member countries. The two standards include identical character repertoires and binary representations.

Unicode encompasses virtually all characters used widely in computers today. It is capable of addressing more than 1.1 million code points. The standard has provisions for 8-bit, 16-bit and 32-bit encoding forms. The 16-bit encoding is used as its default encoding and allows for its million plus code points to be distributed across 17 "planes" with each plane addressing over 65,000 characters each. The characters in Plane 0-or as it is commonly called the "Basic Multilingual Plane" (BMP)-are used to represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, basic dingbats (including all level-100 Zapf Dingbats), and punctuation marks. But in addition to the support for characters in modern languages and for the symbols and shapes just mentioned, Unicode also offers coverage for other characters, such as less commonly used Chinese, Japanese, and Korean (CJK) ideographs, Arabic presentation forms, and musical symbols. Many of these additional characters are mapped beyond the original plane using an extension mechanism called "surrogate pairs." With Unicode 3.2, over 95,000 code points have already been assigned characters; the rest have been set aside for future use. Unicode also provides Private Use Areas of over 131,000 locations available to applications for user-defined characters, which typically are rare ideographs representing names of people or places.

Figure 3-6 shows the Unicode encoding layout for the BMP (Plane 0) in abstract form. (To see the entire Unicode code-point allocation, see Appendix G, "Unicode Code-Point Allocation.")

figure 3.6 unicode encoding layout for the bmp (plane 0).

Figure 3.6 - Unicode encoding layout for the BMP (Plane 0).

Unicode rules, however, are strict about code-point assignment-each code point has a distinct representation. There are also many cases in which Unicode deliberately does not provide code points. Variants of existing characters are not given separate code points, because to do so would represent duplicate encoding of what is underlying the same character. Examples are font variants (such as bold and italic) and glyph variants, which basically are different ways of representing the same characters.

For the most part, Unicode defines characters uniquely, but some characters can be combined to form others, such as accented characters. The most common accented characters, which are used in French, German, and many other European languages, exist in their precomposed forms and are assigned code points. These same characters can be expressed by combining a base character with one or more nonspacing diacritic marks. For example, "a" followed by a nonspacing accent mark is displayed as "à." Nonspacing accent marks make it possible to have a large set of accented characters without assigning them all distinct code points. This is useful for representing accented characters in written languages that are less widely used, such as some African languages. It's also useful for creating a variety of mathematical symbols. The precomposed characters are encoded in the Unicode Standard primarily for compatibility with other encodings. The Unicode Standard contains strict rules for determining the equivalence of precomposed characters to combining character sequences. (See Figure 3-7.) The Win32 API function FoldStringW maps multiple combining characters into precomposed forms. Also, MultiByteToWideChar can be used with either the MB_PRECOMPOSED or the MB_COMPOSITE flags for mapping characters to their precomposed or composite forms.

For all their advantages, Unicode Standards are far from a panacea for internationalization. The code-point positions of Unicode elements do not imply a sort order, and Unicode does not encode font information. It is the operating system that defines these rules, as in the case of Win32-based applications, which need to obtain sorting and font information from the operating system.

In addition, basing your software on the Unicode Standard is only one step in the internationalization process. You still need to write code that adapts to cultural preferences or language rules. (For more information on other globalization considerations, see Chapter 4, "Locale and Cultural Awareness," Chapter 5, "Text Input, Output, and Display," and Chapter 6, "Multilingual User Interface [MUI].")

As a further caveat, not all Unicode-based text processing is a matter of simple character-by-character parsing. Complex text-based operations such as hyphenation, line breaking, and glyph formation need to take into account the context in which they are being used (the relation to surrounding characters, for instance). The complexity of these operations hinges on language rules and has nothing to do with Unicode as an encoding standard. Instead, the software implementation should define a higher-level protocol for handling these operations.

In contrast, there are unusual characters that have very specific semantic rules attached to them; these characters are detailed in the The Unicode Standard.¹ Some characters always allow a line break (for example, most spaces), whereas others never do (for instance, nonspacing or nonbreaking characters). Still other characters, including many used in Arabic and Hebrew, are defined as having strong or weak text directionality. The Unicode Standard defines an algorithm for determining the display order of bidirectional text, and it also defines sev eral "directional formatting codes" as overrides for cases not handled by the implicit bidirectional ordering rules to help create comprehensible bidirectional text. These formatting codes allow characters to be stored in logical order but displayed appropriately depending on their directionality. Neutral characters, such as punctuation marks, assume the directionality of the strong or weak characters nearby. Formatting codes can be used to delineate embedded text or to specify the directionality of characters. (For more information on displaying bidirectional Unicode-based text, see The Unicode Standard.)

figure 3.7 precomposed and composite characters.

Figure 3.7 - Precomposed and composite characters.

Unicode's Functionality

Glossary

Plaintext: Computer-encoded text that contains only code elements and no other formatting or structural information (for example, font size, font type, or other layout information). Plaintext exchange is commonly used between computer systems that might have no other way to exchange information.
Wide character: A character encoded by a wchar_t or a 16-bit (WORD) data type. Often used to refer to UTF-16-encoded characters.

You've now seen some of the capabilities that Unicode offers. The sections that follow delve deeper into Unicode's functions to provide helpful information as you work with Unicode Standards and encodings. For instance, what is the function of byte-order marks (BOMs)? What are surrogate pairs, and how do they enable you to go from encoding 65,000 characters to over 1 million additional characters? These and other questions will be explored in the following sections.

Transformations of Unicode Code Points

There are different techniques to represent each one of the Unicode code points in binary format. Each of the following techniques uses a different mapping to represent unique Unicode characters. The Unicode encodings are:

UTF-8: To meet the requirements of byte-oriented and ASCII-based systems, UTF-8 has been defined by the Unicode Standard. Each character is represented in UTF-8 as a sequence of up to 4 bytes, where the first byte indicates the number of bytes to follow in a multibyte sequence, allowing for efficient string parsing. UTF-8 is commonly used in transmission via Internet protocols and in Web content.
UTF-16: This is the 16-bit encoding form of the Unicode Standard where characters are assigned a unique 16-bit value, with the exception of characters encoded by surrogate pairs, which consist of a pair of 16-bit values. The Unicode 16-bit encoding form is identical to the International Organization for Standardization/International Electrotechnical Commision (ISO/IEC) transformation format UTF-16. In UTF-16, any characters that are mapped up to the number 65,535 are encoded as a single 16-bit value; characters mapped above the number 65,535 are encoded as pairs of 16-bit values. (For more information on surrogate pairs, see "Surrogate Pairs" later in this chapter.) UTF-16 little-endian is the encoding standard at Microsoft (and in the Windows operating system).
UTF-32: Each character is represented as a single 32-bit integer.

Figure 3-8 shows two characters encoded in both code pages and Unicode, using UTF-16 and UTF-8.

figure 3.8 the character a and a kanji character encoded in code pages and in unicode with both utf-16 and utf-8.

Figure 3.8 - The character "A" and a kanji character encoded in code pages and in Unicode with both UTF-16 and UTF-8.

Since UTF-8 is so commonly used in Web content, it's helpful to know how Unicode code points get mapped into this encoding without introducing the hassle of MBCS characters. Table 3-3 shows the relationship between Unicode code points and a UTF-8-encoded character. The starting byte of a chain of bytes in a UTF-8 encoded character tells how many bytes are used to encode that character. All the following bytes start with the mark "10" and the xxx's denote the binary representation of the encoding within the given range.

Table 3-3 Relationship between Unicode code points and a UTF-8-encoded character. In UTF-8, the first byte indicates the number of bytes to follow in a multibyte-encoded sequence.

Unicode Range	UTF-8-Encoded Bytes
0x00000000-0x0000007F	0 xxxxxxx
0x00000080-0x000007FF	110 xxxxx 10 xxxxxx
0x00000800-0x0000FFFF	1110 xxxx 10 xxxxxx 10 xxxxxx
0x00010000-0x001FFFFF	11110 xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Using this approach, once encoded in UTF-8, would become 0xE9 0x99 0xA2, which in binary is 0x11101001 0x10011001 0x10100010.

Byte-Order Marks

Another concept to be familiar with as you work with Unicode is that of byte- order marks. A BOM is used to indicate how a processor places serialized text into a sequence of bytes. If the least significant byte is placed in the initial position, this is referred to as "little-endian," whereas if the most significant byte is placed in the initial position, the method is known as "big-endian." A BOM can also be used as a reference to identify the encoding of the text file. Notepad, for example, adds the BOM to the beginning of each file, depending on the encoding used in saving the file. This signature will allow Notepad to reopen the file later. Table 3-4 shows byte-order marks for various encodings. The UTF-8 BOM identifies the encoding format rather than the BOM of the document-since each character is represented by a sequence of bytes.

Table 3-4 Binary representation of the byte-order mark (U+FEFF) for specific encodings.

Encoding	Encoded BOM
UTF-16 big-endian	FE FF
UTF-16 little-endian	FF FE
UTF-8	EF BB BF

Surrogate Pairs

With the Unicode 16-bit encoding system, over 65,000 characters can be encoded (2¹⁶ = 65536). However, the total number of characters that needs to be encoded has actually exceeded that limit (mainly to accommodate the CJK extension of characters). To find additional place for new characters, developers of the Unicode Standard decided to introduce the notion of surrogate pairs. With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over 1 million additional characters. Unlike MBCS characters, high and low surrogates cannot be interpreted when they do not appear as part of a surrogate pair (one of the major challenges with lead-byte and trail-byte processing of MBCS text).

For the first time, in Unicode 3.01 characters are encoded beyond the original 16-bit code space or the BMP (Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the international standard ISO/IEC 10646-2. In addition to two Private Use Areas-plane 15 (U+F0000 - U+FFFFD) and plane 16 (U+100000 - U+10FFFD)-Unicode 3.1 and 10646-2 define three new supplementary planes:

Supplementary Multilingual Plane (SMP)-with code positions from U+10000 through U+1FFFF
Supplementary Ideographic Plane (SIP)-with code positions from U+20000 through U+2FFFF
Supplementary Special-purpose Plane (SSP)-with code positions from (SSP) U+E0000 through U+EFFFF

The SMP, or Plane 1, contains several historic scripts and several sets of symbols: Old Italic, Gothic, Deseret, Byzantine Musical Symbols, (Western) Musical Symbols, and Mathematical Alphanumeric Symbols. Together these comprise 1,594 newly encoded characters. The SIP, or Plane 2, contains a very large collection of additional unified Han ideographs known as "CJK Extension B," comprising 42,711 characters, as well as 542 additional CJK Compatibility ideographs. The SSP, or Plane 14, contains a set of 97 tag characters.

Windows Support for Unicode

You've now learned more about the benefits and capabilities that Unicode offers, in addition to looking more closely at its functionality. You might also be wondering about the extent to which Windows supports Unicode's features. Microsoft Windows NT 3.1 was the first major operating system to support Unicode, and since then Microsoft Windows NT 4, Microsoft Windows 2000, and Microsoft Windows XP have extended this support, with Unicode being their native encoding. In fact, when you run a non-Unicode application on them, the operating system converts the application's text internally to Unicode before any processing is done. The operating system then converts the text back to the expected code-page encoding before passing the information back to the application.

In addition, Windows XP supports a majority of the Unicode code points with fonts, keyboard drivers, and other system files necessary to the input and display of content in all supported languages. Once again, the fundamental representation of text in Windows NT-based operating systems is UTF-16, and the WCHAR data type is a UTF-16 code unit. Windows does provide interfaces for other encodings in order to be backward-compatible, but converts such text to UTF-16 internally. The system also provides interfaces to convert between UTF-16 and UTF-8 and to inquire about the basic properties of a UTF-16 code point (for example, whether it is a letter, a digit, or a punctuation mark). Since Microsoft Windows 95, Microsoft Windows 98, and Windows Me are not Unicode-based, they provide only a small subset of the Unicode support available in the Windows NT-based versions of Windows. Thus by working with Unicode and Windows NT-based operating systems, you are yet one step closer toward the goal of creating world-ready applications. The remaining sections will show you practical techniques and examples for creating Win32 Unicode applications, as well as tips for using encoding for Web pages, in the .NET Framework, and in console or text-mode programming.