Creating Win32 Unicode Applications | Developing International Software

Glossary

System locale: The system locale determines the Windows code page used by the ANSI (non-Unicode) version of Win32 APIs. String and character parameters passed to a Win32 ANSI are converted from this Windows code page to Unicode.
Win32 API: The set of 32-bit functions supported by Windows.

You've decided to create a Win32 Unicode application. Following the tips and examples included here will help you to get underway and to perform your tasks efficiently-whether you're converting between code-page encodings and Unicode, compiling your Unicode application in Microsoft Visual C++, migrating your code page-based application to Unicode, or writing Unicode code. Although adopting the Unicode encoding system for your application is only a first step toward fully globalized software, it certainly is the most important factor in multilingual computing. In fact, any language version of a Unicode Windows application can run on any language version of Windows 2000 and Windows XP, regardless of the user's settings or the language version of the operating system.

As mentioned earlier, Windows 2000 and Windows XP use Unicode as their default encoding system. Whenever dealing with data based on code pages, the data is internally converted to UTF-16 and processed in Unicode. This means that a legacy or non-Unicode application's behavior depends on the system's conversion from code-page encoding to Unicode. The system performs this conversion based on the value of the default code page of what is called the system locale. Windows XP refers to this variable as "Language for non-Unicode programs" (found when clicking the Advanced tab within the Regional And Language Options property sheet). (See Figure 3-9.) For each system, this value can only be changed or set by a system administrator and requires the computer to be restarted. The primary effect of the system-locale value is to set the internal system's conversion from the code page to Unicode and vice versa.

figure 3.9 windows xp regional and language options property sheet, advanced tab.

Figure 3.9 - Windows XP Regional And Language Options property sheet, Advanced tab.

In Figure 3-9, the system locale is set to Russian (Windows code page 1251). This means that if you are not dealing with a Unicode application, the default text encoding in the application has to be either 1251 or ASCII. All applications built on other encodings will fail, since the system will be using the Russian code page to convert between the code page and Unicode, unless they implement their own code-page conversion mechanism. Figure 3-10 shows how a message box from a non-Unicode Arabic application running with an English system locale would appear. The conversion between code page and Unicode fails to map to appropriate Arabic characters, and instead randomly maps to extended Windows 1252 characters.

figure 3.10 non-unicode application running on mismatched system locales.

Figure 3.10 - Non-Unicode application running on mismatched system locales.

Porting existing code page-based applications to Unicode is easier than you might think. In fact, Unicode was implemented in such a way as to make writing Unicode applications almost transparent to developers. Unicode also needed to be implemented in such a way as to ensure that non-Unicode applications remain functional whenever running in a pure Unicode platform. To accommodate these needs, the implementation of Unicode required changes in two major areas:

Creation of a data-type variable (WCHAR) to handle 16-bit characters
Creation of a set of APIs that accept string parameters with 16-bit character encoding

WCHAR, a 16-Bit Data Type

Most string operations for Unicode can be coded with the same logic used for handling the Windows character set. The difference is that the basic unit of operation is a 16-bit quantity instead of an 8-bit one. The header files provide a number of type definitions that make it easy to create sources that can be compiled for Unicode or the Windows character set.

 For 8-bit (ANSI) and double-byte characters:     typedef char CHAR;            // 8-bit character     typedef char *LPSTR;         // pointer to 8-bit string   For Unicode (wide) characters:     typedef unsigned short WCHAR; // 16-bit character     typedef WCHAR *LPWSTR; // pointer to 16-bit string

Figure 3-11 shows the method by which the Win32 header files define three sets of types:

One set of generic type definitions (TCHAR, LPTSTR), which depend on the state of the _UNICODE manifest constant.
Two sets of explicit type definitions (one set for those that are based on code pages or ANSI and one set for Unicode).

With generic declarations, it is possible to maintain a single set of source files and compile them for either Unicode or ANSI support.

figure 3.11 wchar, a new data type.

Figure 3.11 - WCHAR, a new data type.

W Function Prototypes for Win32 APIs

All Win32 APIs that take a text argument either as an input or output variable have been provided with a generic function prototype and two definitions: a version that is based on code pages or ANSI (called "A") to handle code page-based text argument and a wide version (called "W ") to handle Unicode. The generic function prototype consists of the standard API function name implemented as a macro. The generic prototype gets resolved into one of the explicit function prototypes ("A " or "W "), depending on whether the compile-time manifest constant UNICODE is defined in a #define statement. The letter "W" or "A" is added at the end of the API function name in each explicit function prototype.

 // windows.h  #ifdef UNICODE  #define SetWindowText SetWindowTextW  #else  #define SetWindowText SetWindowTextA  #endif // UNICODE

With this mechanism, an application can use the generic API function to work transparently with Unicode, depending on the #define UNICODE manifest constant. It can also make mixed calls by using the explicit function name with "W" or "A."

One function of particular importance in this dual compile design is RegisterClass (and RegisterClassEx). A window class is implemented by a window procedure. A window procedure can be registered with either the RegisterClassA or RegisterClassW function. By using the function's "A" version, the program tells the system that the window procedure of the created class expects messages with text or parameters that are based on code pages; other objects associated with the window are created using a Windows code page as the encoding of the text. By registering the window class with a call to the wide-character version of the function, the program can request that the system pass text parameters of messages as Unicode. The IsWindowUnicode function allows programs to query the nature of each window.

On Windows NT 4, Windows 2000, and Windows XP, "A" routines are wrappers that convert text that is based on code pages or ANSI to Unicode-using the system-locale code page-and that then call the corresponding "W" routine. On Windows 95, Windows 98, and Windows Me, "A" routines are native, and most "W" routines are not implemented. If a "W" routine is called and yet not implemented, the ERROR_CALL_NOT_ IMPLEMENTED error message is returned. (For more information on how to write Unicode-based applications for non-Unicode platforms, see Chapter 18, "Microsoft Layer for Unicode [MSLU].")

Unicode Text Macro

Visual C++ lets you prefix a literal with an "L" to indicate it is a Unicode string, as shown here:

 LPWSTR str = L"This is a Unicode string";

In the source file, the string is expressed in the code page that the editor or compiler understands. When compiled, the characters are converted to Unicode. The Win32 SDK resource compiler also supports the "L" prefix notation, even though it can interpret Unicode source files directly. WINDOWS.H defines a macro called TEXT() that will mark string literals as Unicode, depending on whether the UNICODE compile flag is set.

 #ifdef UNICODE  #define TEXT(string) L#string  #else  #define TEXT(string) string  #endif // UNICODE

So, the generic version of a string of characters should become:

 LPTSTR str = TEXT("This is a generic string");

C Run-Time Extensions

The Unicode data type is compatible with the wide-character data type wchar_t in ANSI C, thus allowing access to the wide-character string functions. Most of the C run-time (CRT) libraries contain wide-character versions of the strxxx string functions. The wide-character versions of the functions all start with wcs. (See Table 3-5.)

Table 3-5 Examples of C run-time library routines used for string manipulation.

Generic CRT	8-bit Character Sets	Unicode
_tcscpy	strcpy	wcscpy
_tcscmp	strcmp	wcscmp

The C run-time library also provides such functions as mbtowc and wctomb, which can translate the C character set to and from Unicode. The more general set of functions of the Win32 API can perform the same functions as the C run-time libraries including conversions between Unicode, Windows character sets, and MS-DOS code pages. In Windows programming, it is highly recommended that you use the Win32 APIs instead of the CRT libraries in order to take advantage of locale-aware functionality provided by the system, as described in Chapter 4. (See Table 3-6.)

Table 3-6 Equivalent Win32 API functions for the C run-time library routines found in Table 3-5.

Generic Win32	8-bit Character Sets	Unicode
lstrcpy	lstrcpyA	lstrcpyW
lstrcmp	lstrcmpA	lstrcmpW

Conversion Functions Between Code Page and Unicode

Since a large number of applications are still code page-based, and since you might want to support Unicode internally, there are a lot of occasions where a conversion between code-page encodings and Unicode is necessary. The pair of Win32 APIs,MultiByteToWideChar and WideCharToMultiByte, allow you to convert code-page encoding to Unicode and Unicode data to code-page encoding, respectively. Each of these APIs takes as an argument the value of the code page to be used for that conversion. You can, therefore, either specify the value of a given code page (example: 1256 for Arabic) or use predefined flags such as:

CP_ACP: for the currently selected system Windows code page.
CP_OEMCP: for the currently selected system OEM code page.
CP_UTF8: for conversions between UTF-16 and UTF-8.

(For more information, see the Microsoft Developer Network [MSDN] documentation at http://msdn.microsoft.com.)

By using MultiByteToWideChar and WideCharToMultiByte consecutively, using the same code-page information, you do what is called a "round trip." If the code-page number that is used in this encoding conversion is the same as the code-page number that was used in encoding the original string, the round trip should allow you to retrieve the initial character string.

Compiling Unicode Applications in Visual C++

By using the generic data types and function prototypes, you have the liberty of creating a non-Unicode application or compiling your software as Unicode. To compile an application as Unicode in Visual C/C++, go to Project/Settings/C/C++ /General, and include UNICODE and _UNICODE in Preprocessor Definitions. The UNICODE flag is the preprocessor definition for all Win32 APIs and data types, and _UNICODE is the preprocessor definition for C run-time functions.

Migration to Unicode

Glossary

Compatibility zone: The area in Unicode repertoire from U+F900 through U+FFEF that is assigned to characters from other standards. These characters are variants of other Unicode characters.

Creating a new program based on Unicode is fairly easy. Unicode has a few features that require special handling, but you can isolate these in your code. Converting an existing program that uses code-page encoding to one that uses Unicode or generic declarations is also straightforward. Here are the steps to follow:

Modify your code to use generic data types. Determine which variables declared as char or char* are text, and not pointers to buffers or binary byte arrays. Change these types to TCHAR and TCHAR*, as defined in the Win32 file WINDOWS.H, or to _TCHAR as defined in the Visual C++ file TCHAR.H. Replace instances of LPSTR and LPCH with LPTSTR and LPTCH. Make sure to check all local variables and return types. Using generic data types is a good transition strategy because you can compile both ANSI and Unicode versions of your program without sacrificing the readability of the code. Don't use generic data types, however, for data that will always be Unicode or always stays in a given code page. For example, one of the string parameters to MultiByteToWideChar and WideCharToMultiByte should always be a code page-based data type, and the other should always be a Unicode data type.
Modify your code to use generic function prototypes. For example, use the C run-time call _tcslen instead of strlen, and use the Win32 API SetWindowText instead of SetWindowTextA. This rule applies to all APIs and C functions that handle text arguments.
Surround any character or string literal with the TEXT macro. The TEXT macro conditionally places an "L" in front of a character literal or a string literal definition. Be careful with escape sequences. For example, the Win32 resource compiler interprets L/" as an escape sequence specifying a 16-bit Unicode double-quote character, not as the beginning of a Unicode string.
Create generic versions of your data structures. Type definitions for string or character fields in structures should resolve correctly based on the UNICODE compile-time flag. If you write your own string-handling and character-handling functions, or functions that take strings as parameters, create Unicode versions of them and define generic prototypes for them.
Change your build process. When you want to build a Unicode version of your application, both the Win32 compile-time flag -DUNICODE and the C run-time compile-time flag -D_UNICODE must be defined.
Adjust pointer arithmetic. Subtracting char* values yields an answer in terms of bytes; subtracting wchar_t* values yields an answer in terms of 16-bit chunks. When determining the number of bytes (for example, when allocating memory for a string), multiply the length of the string in symbols by sizeof(TCHAR). When determining the number of characters from the number of bytes, divide by sizeof(TCHAR). You can also create macros for these two operations, if you prefer. C makes sure that the ++ and -- operators increment and decrement by the size of the data type. Or even better, use Win32 APIs CharNext and CharPrev.
Check for any code that assumes a character is always 1 byte long. Code that assumes a character's value is always less than 256 (for example, code that uses a character value as an index into a table of size 256) must be changed. Make sure your definition of NULL is 16 bits long.
Add code to support special Unicode characters. These include Unicode characters in the compatibility zone, characters in the Private Use Area, combining characters, and characters with directionality. Other special characters include the Private Use Area noncharacter U+FFFF, which can be used as a placeholder, and the byte-order marks U+FEFF and U+FFFE, which can serve as flags that indicate a file is stored in Unicode. The byte-order marks are used to indicate whether a text stream is little-endian or big-endian. In plaintext, the line separator U+2028 marks an unconditional end of line. Inserting a paragraph separator, U+2029, between paragraphs makes it easier to lay out text at different line widths.
Debug your port by enabling your compiler's type-checking. Do this with and without the UNICODE flag defined. Some warnings that you might be able to ignore in the code page-based world will cause problems with Unicode. If your original code compiles cleanly with type-checking turned on, it will be easier to port. The warnings will help you make sure that you are not passing the wrong data type to code that expects wide-character data types. Use the Win32 National Language Support API (NLS API) or equivalent C run-time calls to get character typing and sorting information. Don't try to write your own logic for handling locale-specific type checking-your application will end up carrying very large tables!

In the following example, a string is loaded from the resources and is used in two scenarios:

As a body to a message box
To be drawn at run time in a given window

(For more information on resources, see Chapter 7, "Software Localizability Guidelines.")

For the purpose of simplification, this example will ignore where and how irrelevant variables have been defined. Suppose you want to migrate the following code page-based code to Unicode:

 char g_szTemp[MAX_STR];         // Definition of a char data type  // Loading IDS_SAMPLE from the resources in our char variable  LoadString(g_hInst, IDS_SAMPLE, g_szTemp, MAX_STR); // Using the loaded string as the body of the message box  MessageBox(NULL, g_szTemp, "This is an ANSI message box!", MB_OK); // Using the loaded string in a call to TextOut for drawing at // run time ExtTextOut(hDC, 10, 10, ETO_CLIPPED , NULL, g_szTemp,  strlen(g_szTemp), NULL);

Migrating this code to Unicode is as easy as following the generic coding conventions and properly replacing the data type, Win32 APIs, and C run-time API definitions. You can see the changes in bold typeface.

         #include <tchar.h>         // Include wchar specific header file          TCHAR g_szTemp[MAX_STR]; // Definition of the data type as a// generic variable // Calling the generic LoadString and not W or A versions explicitly  LoadString(g_hInst, IDS_SAMPLE, g_szTemp, MAX_STR);  // Using the appropriate text macro for the title of our message box  MessageBox(NULL, g_szTemp, TEXT("This is a Unicode message box."),  MB_OK);  // Using the generic run-time version of strlen  ExtTextOut(hDC, 10, 10, ETO_CLIPPED , NULL, g_szTemp,  _tcslen(g_szTemp), NULL);

After implementing these simple steps, all that is left to do in order to create a Unicode application is to compile your code as Unicode by defining the compiling flags UNICODE and _UNICODE.

Options to Migrate to Unicode

Depending on your needs and your target operating systems, there are several options for migration from an application that is based on code pages or to one that is based on Unicode. Some of these options do have certain caveats, however.

Create two binaries: default compile for Windows 95, Windows 98, and Windows Me, and Unicode compile for Windows NT, Windows 2000, and Windows XP.
Disadvantage: Maintaining two versions of your software is messy and goes against the principle of a single, worldwide binary, introduced in Chapter 2, "Designing a World-Ready Program."
Always register as a non-Unicode application, converting to and from Unicode as needed.
Disadvantage: Since Windows does not support the creation of custom code pages, you will not be able to use scripts that are supported only through Unicode (such as those in the Indic family of languages, Armenian, and Georgian). Also, this option makes multilingual computing impossible since, when it comes to displaying, your application is always limited to the system's code page.
Create a pure Unicode application.
Disadvantage: This works only on Windows NT, Windows 2000, and Windows XP, since only limited Unicode support is provided on legacy platforms. This is the preferred approach if you are only targeting Unicode platforms.
Use Microsoft Layer for Unicode (MSLU). In this easy approach, you merely create a pure Unicode application, and then link the Unicows.lib file provided by the SDK platform to your project. You will also need to ship the Unicows.dll file along with your deliverables. MSLU is essentially wrapping all explicit "W" version calls made in your code, at run time, to "A"versions, if a non-Unicode platform is detected at run time. This approach is by far the best solution for migrating to Unicode and for ensuring backward-compatibility. (For more information about MSLU, see Chapter 18.)

Best Practices

When writing Unicode code, there are many points to consider, such as when to use UTF-16, when to use UTF-8, what to use for compression, and so forth. The following are recommended practices that will help ensure you choose the best method based on the circumstance at hand.

Choose UTF-16 as the fundamental representation of text in your application. UTF-8 should be used for application interoperability only (for example, for content sent to be displayed in browsers that do not support Unicode, or over networks and servers that do not support Unicode).
Avoid character-by-character processing and use the existing WCHAR system interfaces and resources wherever possible. The interaction between characters in some languages requires expert knowledge of those languages. Microsoft has developed and tested the system interfaces with most of the languages represented by Unicode-unless you are a multilingual expert, it will be difficult to reproduce this support.
If your application must run on Windows 95, Windows 98, or Windows Me, keep UTF-16 as your fundamental text representation and use MSLU on these operating systems. If you must support non-Unicode text, keep data internally in UTF-16 and convert to other encodings via a gateway. Use system interfaces such as MultiByteToWideChar to convert when necessary.
Ensure your application supports Unicode characters that require two UTF-16 code points (surrogate pairs). This should be automatic if you use existing system interfaces, but will require careful development and testing when you do not. Avoid the trap of likening surrogate pairs to the older East Asian double-byte encodings (DBCS). Instead, centralize the needed string operations in a few subroutines. These subroutines should take surrogate pairs into consideration, but should also handle combining characters and other characters that require special handling. A well-written application can confine surrogate processing to just a few such routines.
Don't use UTF-8 for compression-it actually expands the size of the data for most languages. If you need a real compression algorithm for Unicode, refer to the Unicode Consortium technical standard "Unicode Technical Standard #6: A Standard Compression Scheme for Unicode" available on their site, http://www.unicode.org.
Don't choose UTF-32 merely to avoid surrogate processing. Data size will double and the processing benefits are elusive. If you follow the earlier advice on surrogate processing, UTF-16 should be adequate.
Test your Unicode support with a mix of unrelated languages such as Arabic, Hindi, and Korean. For a well-written Unicode application, the system-locale setting should be irrelevant-test to verify this is the case.

You've now seen techniques and code samples for creating Win32 Unicode applications. Unicode is also extremely useful for dealing with Web content in the global workplace and market. Knowing how to handle encoding in Web pages will help bridge the gap between the plethora of languages that are in use today within Web content.