Chapter 3: Unicode
Imagine a room filled with people gathered from varying nations, where each individual is speaking a language that is basically unintelligible to the others in the room. In the midst of this confusion, however, each person has been charged with communicating a vital message to the others. In a sense, dealing with varying encoding standards in some ways bears resemblance to this scenario. Most developers of international programs have at some point experienced frustration when trying to work with character encodings. The mishmash of standards makes it hard for users to share data and for programmers to create world-ready software, but Unicode can help combat this problem.
This chapter will begin by helping you understand what Unicode is, its origin, and its purpose as an encoding method. By first looking at traditional encoding methods and some of the problems associated with them, you'll see why Unicode offers a viable and practical solution when you're dealing with multiple scripts and languages, which of course is the case when creating world-ready software. Later subsections will discuss creating Microsoft Win32 Unicode applications, as well as encoding in Web pages, in the .NET Framework, and in console or text-mode programs.
As discussed in Part I, "Introduction," one of the first tasks connected with creating globalized applications is to write Unicode applications. The advantages of using Unicode are many. (See "Unicode's Capabilities" later in this chapter.) Among them, Unicode solves the issue just mentioned of multiple encoding standards. For instance, some standards are 7-bit; others are 8-bit. Single-byte character sets come in several varieties, as do the double-byte standards (which are also called "multibyte" because they are really a mix of single-byte and double-byte character codes). Trying to pass data between different character encodings across networks, platforms, or between operating systems involves a gauntlet of mappings, and headaches.
In the international arena, the ability to share information from a variety of writing systems in a straightforward manner will be increasingly important, especially for applications such as large databases. Take, for instance, a hypothetical European agency based in Belgium that wants to set up a directory to communicate with its French, Greek, Hungarian, and Russian clients. If the company's only computer runs the French edition of Microsoft Windows Millennium Edition (Me), which is based on the Windows 1252 code page for Western European languages, it does not support the Greek and Russian alphabets or certain Hungarian accented characters. Some names will then have to be romanized and others will be spelled with whatever characters are available. In the past this might have been acceptable, but today it cannot be. People want their names spelled correctly, and online transactions require them to be spelled consistently. It's difficult to retrieve archived information using a name that has been transliterated in a dozen different ways. With Unicode, it's easy to share and retrieve information from various scripts and among differing operating systems. This is certainly not the case with traditional character encoding, as the next section illustrates.