Character Sets | Programming Applications for Microsoft Windows (Microsoft Programming Series)

[Previous] [Next]

The real problem with localization has always been manipulating different character sets. For years, most of us have been coding text strings as a series of single-byte characters with a zero at the end. This is second nature to us. When we call strlen, it returns the number of characters in a zero-terminated array of single-byte characters.

The problem is that some languages and writing systems (Japanese kanji being the classic example) have so many symbols in their character sets that a single byte, which offers no more than 256 different symbols at best, is just not enough. So double-byte character sets (DBCSs) were created to support these languages and writing systems.

Single-Byte and Double-Byte Character Sets

In a double-byte character set, each character in a string consists of either 1 or 2 bytes. With kanji, for example, if the first character is between 0x81 and 0x9F or between 0xE0 and 0xFC, you must look at the next byte to determine the full character in the string. Working with double-byte character sets is a programmer's nightmare because some characters are 1 byte wide and some are 2 bytes wide.

Simply placing a call to strlen doesn't really tell you how many characters are in the string—it tells you the number of bytes before you hit a terminating zero. The ANSI C run-time library has no functions that allow you to manipulate double-byte character sets. However, the Microsoft Visual C++ run-time library does include a number of functions, such as _mbslen, that allow you to manipulate multibyte (that is, both single-byte and double-byte) character strings.

To help manipulate DBCS strings, Windows offers the following set of helper functions.

Function	Description
PTSTR CharNext (PCTSTR pszCurrentChar);	Returns the address of the next character in a string
PTSTR CharPrev (PCTSTR pszStart, PCTSTR pszCurrentChar);	Returns the address of the previous character in a string
BOOL IsDBCSLeadByte(BYTE bTestChar);	Returns TRUE if the byte is the first byte of a DBCS character

The first two functions, CharNext and CharPrev, allow you to traverse forward or backward through a DBCS string one character at a time. The third function, IsDBCSLeadByte, returns TRUE if the byte passed to it is the first byte of a 2-byte character.

Although these functions make manipulating DBCS strings a little easier, a better approach is definitely needed. Enter Unicode.

Unicode:The Wide-Byte Character Set

Unicode is a standard founded by Apple and Xerox in 1988. In 1991, a consortium was created to develop and promote Unicode. The consortium consists of companies such as Apple, Compaq, Hewlett-Packard, IBM, Microsoft, Oracle, Silicon Graphics, Inc., Sybase, Unisys, and Xerox. (A complete and updated list of consortium members is available at www.Unicode.org.) This group of companies is responsible for maintaining the Unicode standard. The full description of Unicode can be found in The Unicode Standard, published by Addison-Wesley. (This book is available through www.Unicode.org.)

Unicode offers a simple and consistent way of representing strings. All characters in a Unicode string are 16-bit values (2 bytes). There are no special bytes that indicate whether the next byte is part of the same character or is a new character. This means that you can traverse the characters in a string by simply incrementing or decrementing a pointer. Calls to functions such as CharNext, CharPrev, and IsDBCSLeadByte are no longer necessary.

Because Unicode represents each character with a 16-bit value, more than 65,000 characters are available, making it possible to encode all the characters that make up written languages throughout the world. This is a far cry from the 256 characters available with a single-byte character set.

Currently, Unicode code points¹ are defined for the Arabic, Chinese bopomofo, Cyrillic (Russian), Greek, Hebrew, Japanese kana, Korean hangul, and Latin (English) alphabets—and more. A large number of punctuation marks, mathematical symbols, technical symbols, arrows, dingbats, diacritics, and other characters are also included in the character sets. When you add together all these alphabets and symbols, they total about 35,000 different code points, which leaves about half of the 65,000 total code points available for future expansion.

These 65,536 characters are divided into regions. The following table shows some of the regions and the characters that are assigned to them.

16-Bit Code	Characters	16-Bit Code	Characters
0000-007F	ASCII	0300-036F	Generic diacritical marks
0080-00FF	Latin1 characters	0400-04FF	Cyrillic
0100-017F	European Latin	0530-058F	Armenian
0180-01FF	Extended Latin	0590-05FF	Hebrew
0250-02AF	Standard phonetic	0600-06FF	Arabic
02B0-02FF	Modified letters	0900-097F	Devanagari

Approximately 29,000 code points are currently unassigned, but they are reserved for future use. And approximately 6000 code points are reserved for your own personal use.