Making Your Application ANSI- and Unicode-Ready | Programming Applications for Microsoft Windows (Microsoft Programming Series)

[Previous] [Next]

It's a good idea to start converting your application to be Unicode-ready even if you don't plan to use Unicode right away. Here are the basic guidelines you should follow:

Start thinking of text strings as arrays of characters, not as arrays of chars or arrays of bytes.

Use generic data types (such as TCHAR and PTSTR) for text characters and strings.

Use explicit data types (such as BYTE and PBYTE) for bytes, byte pointers, and data buffers.

Use the TEXT macro for literal characters and strings.

Perform global replaces. (For example, replace PSTR with PTSTR.)

Modify string arithmetic problems. For example, functions usually expect you to pass a buffer's size in characters, not bytes. This means that you should not pass sizeof(szBuffer) but should instead pass (sizeof(szBuffer) / sizeof(TCHAR)). Also, if you need to allocate a block of memory for a string and you have the number of characters in the string, remember that you allocate memory in bytes. This means that you must call malloc(nCharacters * sizeof(TCHAR)) and not call malloc(nCharacters). Of all the guidelines I've just listed, this is the most difficult one to remember, and the compiler offers no warnings or errors if you make a mistake.

When I was developing the sample programs for the first edition of this book, I originally wrote them so that they compiled natively as ANSI-only. Then, when I began to write this chapter, I knew that I wanted to encourage the use of Unicode and was going to create sample programs to demonstrate how easy it is to create programs that can be compiled in both Unicode and ANSI. I decided that the best course of action was to convert all the sample programs in the book so that they could be compiled in both Unicode and ANSI.

I converted all the programs in about four hours, which isn't bad, considering that I didn't have any prior conversion experience.

Windows String Functions

Windows also offers a set of functions for manipulating Unicode strings, as described in the following table.

Function	Description
lstrcat	Concatenates one string onto the end of another
lstrcmp	Performs case-sensitive comparison of two strings
lstrcmpi	Performs case-insensitive comparison of two strings
lstrcpy	Copies one string to another location in memory
lstrlen	Returns the length of a string in characters

These functions are implemented as macros that call either the Unicode version of the function or the ANSI version of the function, depending on whether UNICODE is defined when the source module is compiled. For example, if UNICODE is not defined, lstrcat will expand to lstrcatA. If UNICODE is defined, lstrcat will expand to lstrcatW.

Two string functions, lstrcmp and lstrcmpi, behave differently from their equivalent C run-time functions. The C run-time functions strcmp, strcmpi, wcscmp, and wcscmpi simply compare the values of the code points in the strings; that is, the functions ignore the meaning of the actual characters and simply check the numeric value of each character in the first string with the numeric value of the character in the second string. The Windows functions lstrcmp and lstrcmpi, on the other hand, are implemented as calls to the Windows function CompareString:

 int CompareString( LCID lcid, DWORD fdwStyle, PCWSTR pString1, int cch1, PCTSTR pString2, int cch2);

This function compares two Unicode strings. The first parameter to CompareString specifies a locale ID (LCID), a 32-bit value that identifies a particular language. CompareString uses this LCID to compare the two strings by checking the meaning of the characters as they apply to a particular language. This action is much more meaningful than the simple number comparison performed by the C run-time functions.

When any of the lstrcmp family of functions calls CompareString, the function passes the result of calling the Windows GetThreadLocale function as the first parameter:

 LCID GetThreadLocale();

Every time a thread is created, it is assigned a locale. This function returns the current locale setting for the thread.

The second parameter of CompareString identifies flags that modify the method used by the function to compare the two strings. The following table shows the possible flags.

Flag	Meaning
NORM_IGNORECASE	Ignore case differences
NORM_IGNOREKANATYPE	Do not differentiate between hiragana and katakana characters
NORM_IGNORENONSPACE	Ignore nonspacing characters
NORM_IGNORESYMBOLS	Ignore symbols
NORM_IGNOREWIDTH	Do not differentiate between a single-byte character and the same character as a double-byte character
SORT_STRINGSORT	Treat punctuation the same as symbols

When lstrcmp calls CompareString, it passes 0 for the fdwStyle parameter. But when lstrcmpi calls CompareString, it passes NORM_IGNORECASE. The remaining four parameters of CompareString specify the two strings and their respective lengths. If you pass -1 for the cch1 parameter, the function assumes that the pString1 string is zero-terminated and calculates the length of the string. This also is true for the cch2 parameter with respect to the pString2 string.

Other C run-time functions don't offer good support for manipulating Unicode strings. For example, the tolower and toupper functions don't properly convert characters with accent marks. To compensate for these deficiencies in the C run-time library, you'll need to call the following Windows functions to convert the case of a Unicode string. These functions also work correctly for ANSI strings.

The first two functions,

 PTSTR CharLower(PTSTR pszString);

and

 PTSTR CharUpper(PTSTR pszString);

convert either a single character or an entire zero-terminated string. To convert an entire string, simply pass the address of the string. To convert a single character, you must pass the individual character as follows:

 TCHAR cLowerCaseChar = CharLower((PTSTR) szString[0]);

Casting the single character to a PTSTR calls the function, passing it a value in which the low 16 bits contain the character and the high 16 bits contain 0. When the function sees that the high bits are 0, the function knows that you want to convert a single character rather than a whole string. The value returned will be a 32-bit value with the converted character in the low 16 bits.

The next two functions are similar to the previous two except that they convert the characters contained inside a buffer (which does not need to be zero-terminated):

 DWORD CharLowerBuff( PTSTR pszString, DWORD cchString); DWORD CharUpperBuff( PTSTR pszString, DWORD cchString);

Other C run-time functions, such as isalpha, islower, and isupper, return a value that indicates whether a given character is alphabetic, lowercase, or uppercase. The Windows API offers functions that return this information as well, but the Windows functions also consider the language indicated by the user in the Control Panel:

 BOOL IsCharAlpha(TCHAR ch); BOOL IsCharAlphaNumeric(TCHAR ch); BOOL IsCharLower(TCHAR ch); BOOL IsCharUpper(TCHAR ch);

The printf family of functions is the last group of C run-time functions we'll discuss. If you compile your source module with _UNICODE defined, the printf family of functions expects that all the character and string parameters represent Unicode characters and strings. However, if you compile without defining _UNICODE, the printf family expects that all the characters and strings passed to it are ANSI.

Microsoft has added some special field types to their C run-time's printf family of functions. Some of these field types have not been adopted by ANSI C. The new types allow you to easily mix and match ANSI and Unicode characters and strings. The operating system's wsprintf function has also been enhanced. Here are some examples (note the use of capital S and lowercase s):

 char szA[100]; // An ANSI string buffer WCHAR szW[100]; // A Unicode string buffer // Normal sprintf: all strings are ANSI sprintf(szA, "%s", "ANSI Str"); // Converts Unicode string to ANSI sprintf(szA, "%S", L"Unicode Str"); // Normal swprintf: all strings are Unicode swprintf(szW, L"%s", L"Unicode Str"); // Converts ANSI string to Unicode swprintf(szW, L"%S", "ANSI Str");

Resources

When the resource compiler compiles all your resources, the output file is a binary representation of the resources. String values in your resources (string tables, dialog box templates, menus, and so on) are always written as Unicode strings. Under both Windows 98 and Windows 2000, the system performs internal conversions if your application doesn't define the UNICODE macro. For example, if UNICODE is not defined when you compile your source module, a call to LoadString will actually call the LoadStringA function. LoadStringA will then read the string from your resources and convert the string to ANSI. The ANSI representation of the string will be returned from the function to your application.

Determining If Text Is ANSI or Unicode

To date, there have been very few Unicode text files. In fact, most of Microsoft's own products do not ship with any Unicode text files. However, I expect that this trend could change in the future (albeit a long way into the future). Certainly, the Windows 2000 Notepad application allows you to open both Unicode and ANSI files as well as create them. In fact, Figure 2-2 shows Notepad's File Save As dialog box. Notice the different ways that you can save a text file.

click to view at full size.

Figure 2-2. The Windows 2000 Notepad File Save As dialog box

For many applications that open text files and process them, such as compilers, it would be convenient if, after opening a file, the application could determine whether the text file contained ANSI characters or Unicode characters. The IsTextUnicode function can help make this distinction:

 DWORD IsTextUnicode(CONST PVOID pvBuffer, int cb, PINT pResult);

The problem with text files is that there are no hard and fast rules as to their content. This makes it extremely difficult to determine whether the file contains ANSI or Unicode characters. IsTextUnicode uses a series of statistical and deterministic methods in order to guess at the content of the buffer. Because this is not an exact science, it is possible that IsTextUnicode will return an incorrect result.

The first parameter, pvBuffer, identifies the address of a buffer that you want to test. The data is a void pointer because you don't know whether you have an array of ANSI characters or an array of Unicode characters.

The second parameter, cb, specifies the number of bytes that pvBuffer points to. Again, because you don't know what's in the buffer, cb is a count of bytes rather than a count of characters. Note that you do not have to specify the entire length of the buffer. Of course, the more bytes IsTextUnicode can test, the more accurate a response you're likely to get.

The third parameter, pResult, is the address of an integer that you must initialize before calling IsTextUnicode. You initialize this integer to indicate which tests you want IsTextUnicode to perform. You can also pass NULL for this parameter, in which case IsTextUnicode will perform every test it can. (See the Platform SDK documentation for more details.)

If IsTextUnicode thinks that the buffer contains Unicode text, TRUE is returned; otherwise, FALSE is returned. That's right, the function actually returns a Boolean even though Microsoft prototyped it as returning a DWORD. If specific tests were requested in the integer pointed to by the pResult parameter, the function sets the bits in the integer before returning to reflect the results of each test.

Windows 98
Under Windows 98, the IsTextUnicode function has no useful implementation and simply returns FALSE; calling GetLastError returns ERROR_CALL_NOT_IMPLEMENTED.

The FileRev sample application presented in Chapter 17 demonstrates the use of the IsTextUnicode function.

Translating Strings Between Unicode and ANSI

The Windows function MultiByteToWideChar converts multibyte-character strings to wide-character strings. MultiByteToWideChar is shown below.

 int MultiByteToWideChar( UINT uCodePage, DWORD dwFlags, PCSTR pMultiByteStr, int cchMultiByte, PWSTR pWideCharStr, int cchWideChar);

The uCodePage parameter identifies a code page number that is associated with the multibyte string. The dwFlags parameter allows you to specify additional control that affects characters with diacritical marks such as accents. Usually the flags aren't used, and 0 is passed in the dwFlags parameter. The pMultiByteStr parameter specifies the string to be converted, and the cchMultiByte parameter indicates the length (in bytes) of the string. The function determines the length of the source string if you pass -1 for the cchMultiByte parameter.

The Unicode version of the string resulting from the conversion is written to the buffer located in memory at the address specified by the pWideCharStr parameter. You must specify the maximum size of this buffer (in characters) in the cchWideChar parameter. If you call MultiByteToWideChar, passing 0 for the cchWideChar parameter, the function doesn't perform the conversion and instead returns the size of the buffer required for the conversion to succeed. Typically, you will convert a multibyte-character string to its Unicode equivalent by performing the following steps:

Call MultiByteToWideChar, passing NULL for the pWideCharStr parameter and 0 for the cchWideChar parameter.

Allocate a block of memory large enough to hold the converted Unicode string. This size is returned by the previous call to MultiByteToWideChar.

Call MultiByteToWideChar again, this time passing the address of the buffer as the pWideCharStr parameter and passing the size returned by the first call to MultiByteToWideChar as the cchWideChar parameter.

Use the converted string.

Free the memory block occupying the Unicode string.

The function WideCharToMultiByte converts a wide-character string to its multibyte string equivalent, as shown here:

 int WideCharToMultiByte( UINT uCodePage, DWORD dwFlags, PCWSTR pWideCharStr, int cchWideChar, PSTR pMultiByteStr, int cchMultiByte, PCSTR pDefaultChar, PBOOL pfUsedDefaultChar);

This function is similar to the MultiByteToWideChar function. Again, the uCodePage parameter identifies the code page to be associated with the newly converted string. The dwFlags parameter allows you to specify additional control over the conversion. The flags affect characters with diacritical marks and characters that the system is unable to convert. Usually you won't need this degree of control over the conversion, and you'll pass 0 for the dwFlags parameter.

The pWideCharStr parameter specifies the address in memory of the string to be converted, and the cchWideChar parameter indicates the length (in characters) of this string. The function determines the length of the source string if you pass -1 for the cchWideChar parameter.

The multibyte version of the string resulting from the conversion is written to the buffer indicated by the pMultiByteStr parameter. You must specify the maximum size of this buffer (in bytes) in the cchMultiByte parameter. Passing 0 as the cchMultiByte parameter of the WideCharToMultiByte function causes the function to return the size required by the destination buffer. You'll typically convert a wide-byte character string to a multibyte-character string using a sequence of events similar to those discussed when converting a multibyte string to a wide-byte string.

You'll notice that the WideCharToMultiByte function accepts two parameters more than the MultiByteToWideChar function: pDefaultChar and pfUsedDefaultChar . These parameters are used by the WideCharToMultiByte function only if it comes across a wide character that doesn't have a representation in the code page identified by the uCodePage parameter. If the wide character cannot be converted, the function uses the character pointed to by the pDefaultChar parameter. If this parameter is NULL, which is most common, the function uses a system default character. This default character is usually a question mark. This is dangerous for filenames because the question mark is a wildcard character.

The pfUsedDefaultChar parameter points to a Boolean variable that the function sets to TRUE if at least one character in the wide-character string could not be converted to its multibyte equivalent. The function sets the variable to FALSE if all the characters convert successfully. You can test this variable after the function returns to check whether the wide-character string was converted successfully. Again, you usually pass NULL for this parameter.

For a more complete description of how to use these functions, please refer to the Platform SDK documentation.

You could use these two functions to easily create both Unicode and ANSI versions of functions. For example, you might have a dynamic-link library containing a function that reverses all the characters in a string. You could write the Unicode version of the function as shown here:

 BOOL StringReverseW(PWSTR pWideCharStr) { // Get a pointer to the last character in the string. PWSTR pEndOfStr = pWideCharStr + wcslen(pWideCharStr) - 1; wchar_t cCharT; // Repeat until we reach the center character in the string. while (pWideCharStr < pEndOfStr) { // Save a character in a temporary variable. cCharT = *pWideCharStr; // Put the last character in the first character. *pWideCharStr = *pEndOfStr; // Put the temporary character in the last character. *pEndOfStr = cCharT; // Move in one character from the left. pWideCharStr++; // Move in one character from the right. pEndOfStr--; } // The string is reversed; return success. return(TRUE); }

And you could write the ANSI version of the function so that it doesn't perform the actual work of reversing the string at all. Instead, you could write the ANSI version so that it converts the ANSI string to Unicode, passes the Unicode string to the StringReverseW function, and then converts the reversed string back to ANSI. The function would look like this:

 BOOL StringReverseA(PSTR pMultiByteStr) { PWSTR pWideCharStr; int nLenOfWideCharStr; BOOL fOk = FALSE; // Calculate the number of characters needed to hold // the wide-character version of the string. nLenOfWideCharStr = MultiByteToWideChar(CP_ACP, 0, pMultiByteStr, -1, NULL, 0); // Allocate memory from the process's default heap to // accommodate the size of the wide-character string. // Don't forget that MultiByteToWideChar returns the // number of characters, not the number of bytes, so // you must multiply by the size of a wide character. pWideCharStr = HeapAlloc(GetProcessHeap(), 0, nLenOfWideCharStr * sizeof(WCHAR)); if (pWideCharStr == NULL) return(fOk); // Convert the multibyte string to a wide-character string. MultiByteToWideChar(CP_ACP, 0, pMultiByteStr, -1, pWideCharStr, nLenOfWideCharStr); // Call the wide-character version of this // function to do the actual work. fOk = StringReverseW(pWideCharStr); if (fOk) { // Convert the wide-character string back // to a multibyte string. WideCharToMultiByte(CP_ACP, 0, pWideCharStr, -1, pMultiByteStr, strlen(pMultiByteStr), NULL, NULL); } // Free the memory containing the wide-character string. HeapFree(GetProcessHeap(), 0, pWideCharStr); return(fOk); }

Finally, in the header file that you distribute with the dynamic-link library, you would prototype the two functions as follows:

 BOOL StringReverseW(PWSTR pWideCharStr); BOOL StringReverseA(PSTR pMultiByteStr); #ifdef UNICODE #define StringReverse StringReverseW #else #define StringReverse StringReverseA #endif // !UNICODE