Chapter 17: Unicode Support in Office 2003 Editions | Microsoft Office 2003 Editions Resource Kit (Pro-Resource Kit)

Download CD Content

Microsoft Office 2003 Editions provide broad support for Unicode , making it straightforward to share files across language versions of Office 2003 Editions. There are a few limitations to full Unicode support that may be important to note, such as the inability to print scripts or special characters on some printers. In general, however, Unicode helps you easily share documents across languages and different versions of Office.

Unicode Support and Multilingual Office Documents

Sharing documents in a multilingual environment can be challenging when the languages involved span multiple Microsoft Windows code pages. However, using the Unicode character encoding standard overcomes many of these challenges, and in Microsoft Office 2003 Editions, all applications are capable of using Unicode.

Office 2003 Editions provide the conversion tables necessary to convert code page–based data to Unicode and back again for interaction with previous applications. Because Office 2003 Editions provide fonts to support many languages, users can create multilingual documents with text from multiple scripts.

Unicode support in Office 2003 Editions also means that users can copy multilingual text from Office 97 documents and paste it into any Office 2003 Editions document, and the text is displayed correctly. Conversely, multilingual text copied from any Office 2003 Editions document can be pasted into a document created in any Office 97 application (except Microsoft Access).

In addition to document text, Office 2003 Editions support Unicode in other areas, including document properties, bookmarks, style names, footnotes, and user information. Unicode support in Office 2003 Editions also means that you can edit and display multilingual text in dialog boxes. For example, you can search for a file by a Greek author’s name in the Open dialog box. In addition, Microsoft Office Outlook 2003 now supports Unicode throughout the product.

Note

For more information about Unicode support in Outlook 2003, see “Unicode Enhancements in Outlook 2003” in Chapter 6, “Planning an Outlook 2003 Deployment.”

Understanding Unicode

Without Unicode, systems typically use a code page–based environment, in which each script has its own table of characters. Documents based on the code page of one Windows operating system rarely travel well to a Windows operating system that uses another code page. In some cases, the documents cannot contain text that uses characters from more than one script.

For example, if a user running the English version of the Microsoft Windows 98 operating system with the Latin code page opens a plain text file created on a computer running the Japanese version of Windows 98, the code points of the Japanese code page are mapped to unexpected or nonexistent characters in the Western script, and the resulting text is unintelligible.

The universal character set provided by Unicode overcomes this problem.

The following sections describe how scripts and code pages are used in representing characters in different languages. Understanding scripts and code pages helps provide a foundation for understanding how Unicode facilitates a straightforward way of providing language support.

Scripts

Multilingual documents can contain text in languages that require different scripts. However, a single script can be used to represent many languages.

For example, the Latin or Roman script has character shapes—glyphs—for the 26 letters (both uppercase and lowercase) of the English alphabet, as well as accented (extended) characters used to represent sounds in other Western European languages.

The Latin script also has glyphs to represent all of the characters in most European languages and some non-European languages. Some European languages, such as Greek or Russian, have characters for which there are no glyphs in the Latin script; these languages have their own scripts.

Some Asian languages use ideographic scripts that have glyphs based on Chinese characters. Other languages, such as Thai and Arabic, use scripts that have glyphs that are composed of several smaller glyphs, or glyphs that must be shaped differently depending on adjacent characters. These scripts are referred to as complex scripts in this documentation.

Code pages

A common way to store plain text is to represent each character by using a single byte. The value of each byte is a numeric index—or code point—in a table of characters called a code page. Each code point corresponds to a character in the default code page of the computer on which the text document is created. For example, a byte with a code point whose value is decimal 65 represents the capital letter ‘A’ on a computer with Microsoft Code Page 1252 (or Latin 1).

For single-byte code pages, each code page contains a maximum of 256 byte values because each character in the code page is represented by a single byte.

A code page with a limit of 256 characters cannot accommodate all languages because all languages together use far more than 256 characters. Therefore, different scripts use separate code pages. There is one code page for Greek, another for Cyrillic, and so on.

In addition, single-byte code pages cannot accommodate Asian languages, which commonly use more than 5,000 Chinese-based characters. Double-byte code pages—in which each character is represented by one or two bytes—were developed to support these languages. (The first 128 characters of double-byte code pages are single-byte code points, to help ensure that English characters—which use only these first 128 characters—are mapped by virtually all code pages, include double-byte code pages.)

One drawback of the code page system is that the character represented by a particular code point depends on the specific code page on which it was entered. If you do not know which code page a code point is from, you cannot determine how to interpret the code point accurately. This can cause problems when a text document is shared between users on different computers.

For example, unless you know which code page it comes from, the code point 230 might be the Greek lowercase zeta ( ), the Cyrillic lowercase zhe ( ), or the Western European diphthong ( ). All three characters have the same code point (230), but the code point is from three different code pages (1253, 1251, and 1252, respectively). Users exchanging documents between these languages are likely to see incorrect characters.

Unicode: a worldwide character set

Unicode is a character encoding standard developed by the Unicode Consortium to create a universal character set that can accommodate all known scripts. Unicode can use more than one byte for every character; so in contrast to code pages, every character has its own unique code point. For example, the Unicode code point of lowercase zeta (X) is the hexadecimal value 03B6, lowercase zhe (X) is 0436, and the diphthong ( ) is 00E6. The Unicode encoding standard enables almost all written languages in the world to be represented by using a single-character set.

Currently in the Microsoft Windows operating systems, the two systems of storing text—code pages and Unicode—coexist. However, Unicode-based systems are replacing code page–based systems. For example, Microsoft Windows NT 4, Microsoft Windows 2000, Microsoft Windows XP, Microsoft Office 97 and later, Microsoft Internet Explorer 4.0 and later, and Microsoft SQL Server 7.0 and later all support Unicode.

Note

The Microsoft Visual Basic for Applications environment does not support Unicode. Only characters supported by the active Windows code page can be used in the Visual Basic Editor or displayed in custom dialog boxes or message boxes.

You can use the ChrW( ) function to manipulate text outside the code page. The ChrW( ) function accepts a number that represents the Unicode value of a character and returns that character string.