How to Write Unicode Source Code | Programming Applications for Microsoft Windows (Microsoft Programming Series)

[Previous] [Next]

Microsoft designed the Windows API for Unicode so that it would have as little impact on your code as possible. In fact, it is possible to write a single source code file so that it can be compiled with or without using Unicode—you need only define two macros (UNICODE and _UNICODE) to make the change and then recompile.

Unicode Support in the C Run-Time Library

To take advantage of Unicode character strings, some data types have been defined. The standard C header file, String.h, has been modified to define a data type named wchar_t, which is the data type of a Unicode character:

 typedef unsigned short wchar_t;

For example, if you want to create a buffer to hold a Unicode string of up to 99 characters and a terminating zero character, you can use the following statement:

 wchar_t szBuffer[100];

This statement creates an array of one hundred 16-bit values. Of course, the standard C run-time string functions, such as strcpy, strchr, and strcat, operate on ANSI strings only; they don't correctly process Unicode strings. So, ANSI C also has a complementary set of functions. Figure 2-1 shows some of the standard ANSI C string functions followed by their equivalent Unicode functions.

Figure 2-1. Standard ANSI C string functions and their Unicode equivalents

 char * strcat(char *, const char *); wchar_t * wcscat(wchar_t *, const wchar_t *); char * strchr(const char *, int); wchar_t * wcschr(const wchar_t *, wchar_t); int strcmp(const char *, const char *); int wcscmp(const wchar_t *, const wchar_t *); char * strcpy(char *, const char *); wchar_t * wcscpy(wchar_t *, const wchar_t *); size_t strlen(const char *); size_t wcslen(const wchar_t *);

Notice that all the Unicode functions begin with wcs, which stands for wide character string. To call the Unicode function, simply replace the str prefix of any ANSI string function with the wcs prefix.

NOTE
One very important point that most developers don't remember is that the C run-time library provided by Microsoft conforms to the ANSI standard C run-time library. ANSI C dictates that the C run-time library supports Unicode characters and strings. This means that you can always call C run-time functions to manipulate Unicode characters and strings—even if you're running on Windows 98. In other words, wcscat, wcslen, wcstok, and so on all work just fine on Windows 98; it's the operating system functions you need to worry about.

Code that includes explicit calls to either the str functions or the wcs functions cannot be compiled easily for both ANSI and Unicode. Earlier in this chapter, I said that it's possible to make a single source code file that can be compiled for both. To set up the dual capability, you include the TChar.h file instead of including String.h.

TChar.h exists for the sole purpose of helping you create ANSI/Unicode generic source code files. It consists of a set of macros that you should use in your source code instead of making direct calls to either the str or the wcs functions. If you define _UNICODE when you compile your source code, the macros reference the wcs set of functions. If you do not define _UNICODE, the macros reference the str set of functions.

For example, there is a macro called _tcscpy in TChar.h. If _UNICODE is not defined when you include this header file, _tcscpy expands to the ANSI strcpy function. However, if _UNICODE is defined, _tcscpy expands to the Unicode wcscpy function. All C run-time functions that take string arguments have a generic macro defined in TChar.h. If you use the generic macros instead of the ANSI/Unicode specific function names, you'll be well on your way to creating source code that can be compiled natively for ANSI or Unicode.

Unfortunately, you need to do a little more work than just use these macros. TChar.h includes some additional macros.

To define an array of string characters that is ANSI/Unicode generic, use the following TCHAR data type. If _UNICODE is defined, TCHAR is declared as follows:

 typedef wchar_t TCHAR;

If _UNICODE is not defined, TCHAR is declared as

 typedef char TCHAR;

Using this data type, you can allocate a string of characters as follows:

 TCHAR szString[100];

You can also create pointers to strings:

 TCHAR *szError = "Error";

However, there is a problem with the previous line. By default, Microsoft's C++ compiler compiles all strings as though they were ANSI strings, not Unicode strings. As a result, the compiler will compile this line correctly if _UNICODE is not defined, but will generate an error if _UNICODE is defined. To generate a Unicode string instead of an ANSI string, you would have to rewrite the line as follows:

 TCHAR *szError = L"Error";

An uppercase L before a literal string informs the compiler that the string should be compiled as a Unicode string. When the compiler places the string in the program's data section, it intersperses zero bytes between every character. The problem with this change is that now the program will compile successfully only if _UNICODE is defined. We need another macro that selectively adds the uppercase L before a literal string. This is the job of the _TEXT macro, also defined in TChar.h. If _UNICODE is defined, _TEXT is defined as

 #define _TEXT(x) L ## x

If _UNICODE is not defined, _TEXT is defined as

 #define _TEXT(x) x

Using this macro, we can rewrite the line above so that it compiles correctly whether or not the _UNICODE macro is defined, as shown here:

 TCHAR *szError = _TEXT("Error");

The _TEXT macro can also be used for literal characters. For example, to check whether the first character of a string is an uppercase J, write the following code:

 if (szError[0] == _TEXT('J')) { // First character is a 'J'  } else { // First character is not a 'J'  }

Unicode Data Types Defined by Windows

The Windows header files define the data types listed in the following table.

Data Type	Description
WCHAR	Unicode character
PWSTR	Pointer to a Unicode string
PCWSTR	Pointer to a constant Unicode string

These data types always refer to Unicode characters and strings. The Windows header files also define the ANSI/Unicode generic data types PTSTR and PCTSTR. These data types point to either an ANSI string or a Unicode string, depending on whether the UNICODE macro is defined when you compile the module.

Notice that this time the UNICODE macro is not preceded by an underscore. The _UNICODE macro is used for the C run-time header files and the UNICODE macro is used for the Windows header files. You usually need to define both macros when compiling a source code module.

Unicode and ANSI Functions in Windows

I implied earlier that two functions are called CreateWindowEx: a CreateWindowEx that accepts Unicode strings and a second CreateWindowEx that accepts ANSI strings. This is true, but the two functions are actually prototyped as follows:

 HWND WINAPI CreateWindowExW( DWORD dwExStyle, PCWSTR pClassName, PCWSTR pWindowName, DWORD dwStyle, int X, int Y, int nWidth, int nHeight, HWND hWndParent, HMENU hMenu, HINSTANCE hInstance, PVOID pParam); HWND WINAPI CreateWindowExA( DWORD dwExStyle, PCSTR pClassName, PCSTR pWindowName, DWORD dwStyle, int X, int Y, int nWidth, int nHeight, HWND hWndParent, HMENU hMenu, HINSTANCE hInstance, PVOID pParam);

CreateWindowExW is the version that accepts Unicode strings. The uppercase W at the end of the function name stands for wide. Unicode characters are 16 bits each, so they are frequently referred to as wide characters. The uppercase A at the end of CreateWindowExA indicates that the function accepts ANSI character strings.

But usually we just include a call to CreateWindowEx in our code and don't directly call either CreateWindowExW or CreateWindowExA. In WinUser.h, CreateWindowEx is actually a macro defined as

 #ifdef UNICODE #define CreateWindowEx CreateWindowExW #else #define CreateWindowEx CreateWindowExA #endif // !UNICODE

Whether UNICODE is defined when you compile your source code module determines which version of CreateWindowEx is called. When you port a 16-bit Windows application, you probably won't define UNICODE when you compile. Any calls you make to CreateWindowEx expand the macro to call CreateWindowExA—the ANSI version of CreateWindowEx. Because 16-bit Windows offers only an ANSI version of CreateWindowEx, your porting will go much easier.

Under Windows 2000, Microsoft's source code for CreateWindowExA is simply a thunking, or translation, layer that allocates memory to convert ANSI strings to Unicode strings; the code then calls CreateWindowExW, passing the converted strings. When CreateWindowExW returns, CreateWindowExA frees its memory buffers and returns the window handle to you.

If you're creating dynamic-link libraries (DLLs) that other software developers will use, consider using this technique: supply two exported functions in the DLL—an ANSI version and a Unicode version. In the ANSI version, simply allocate memory, perform the necessary string conversions, and call the Unicode version of the function. (I'll demonstrate this process later in this chapter.)

Under Windows 98, Microsoft's source code for CreateWindowExA is the function that does the work. Windows 98 offers all the entry points to all the Windows functions that accept a Unicode parameter, but these functions do not translate Unicode strings to ANSI strings—they just return failure. A call to GetLastError returns ERROR_CALL_NOT_IMPLEMENTED. Only ANSI versions of these functions work properly. If your compiled code makes calls to any of the wide-character functions, your application will not run under Windows 98.

Certain functions in the Windows API, such as WinExec and OpenFile, exist solely for backward compatibility with 16-bit Windows programs and should be avoided. You should replace any calls to WinExec and OpenFile with calls to the CreateProcess and CreateFile functions. Internally, the old functions call the new functions anyway. The big problem with the old functions is that they don't accept Unicode strings. When you call these functions, you must pass ANSI strings. All the new and nonobsolete functions, on the other hand, do have both ANSI and Unicode versions on Windows 2000.

Windows String Functions

Windows also offers a comprehensive set of string manipulation functions. These functions are similar to the C run-time string functions, such as strcpy and wcscpy. However, the operating system functions are part of the OS, and many OS components use these functions instead of the C run-time library. I recommend that you favor the OS functions over the C run-time string functions. This will help your application's performance slightly because the OS string functions are used frequently by heavyweight applications such as the operating system's shell process, Explorer.exe. Since the functions are used heavily, they will probably already be loaded into RAM while your application runs.

To use these functions, the system must be running Windows 2000 or Windows 98. The functions are also available on earlier versions of Windows if Internet Explorer 4.0 or later is installed.

In classic OS function style, the OS string function names contain both uppercase and lowercase letters and look like this: StrCat, StrChr, StrCmp, and StrCpy (to name just a few). To use these functions, you must include the ShlWApi.h header file. Also, as previously discussed, these string functions come in both ANSI and Unicode versions, such as StrCatA and StrCatW. Because these are operating system functions, the symbols will expand to their wide versions if you define UNICODE (without the preceding underscore) when you build your application.