Unicode Text and Strings

< BACK NEXT >

[oR]

Before starting out there are a couple of topics that need to be covered, and the first of these is Unicode. Windows 98 API functions have partial support for Unicode strings, and Windows 2000 and NT allow applications to call either Unicode or ANSI versions of the API functions. Windows CE, on the other hand, only supports Unicode, so you will need to write your applications using Unicode strings and text.

Most of us grew up safe in the knowledge that a character was stored in a single byte using eight bits. Character strings are stored in "char" arrays and are terminated with a NULL, ANSI 0 character. Strangely enough, the "char" data type is signed, but we get used to that. The problem is that there are many more than the 255 characters that fit in a "char" used by different languages around the world, so tricks need to be employed to support all these characters. Two such tricks are:

Use multi-byte character strings (MBCS), where special characters act as lead-ins indicating that the next character should be treated as an entirely different character.
Use Code Pages, in which the same ANSI character number is used to display completely different characters depending on which code page is loaded.

Neither of these tricks is satisfactory. Parsing MBCS strings is difficult; for example, the length of a string can only be determined by traversing the entire string and inspecting each character. With code pages, you can display completely incorrect text by having the wrong code page loaded for the text being displayed. The Unicode solution uses two bytes to store a single character. This allows up to 65536 different characters to be displayed more than enough for all the languages around the world. With Unicode, a character is stored as an unsigned two-byte integer value. They are also known as "wide byte characters". The Unicode characters in the range 0x00 to 0x7F are reserved for ANSI characters, so ANSI characters always have the high byte set to zero when represented in Unicode.

Compilers do not provide native support for Unicode that is, there is no magic compiler switch that changes a char from one byte to two bytes. Instead, support for Unicode is achieved through defines and typedef statements in header files. The data type wchar_t is used to represent a Unicode character, and an array of wchar_t is used to store strings. As with ANSI strings, a NULL terminates a string, but this is a two-byte rather than a one-byte value. ANSI strings and characters can be used alongside Unicode strings and characters you can continue to use the "char" data type. This is important because data coming from the outside world (through the Internet or as a file) may use ANSI characters, and these need to be converted before being used.

Unicode characters obviously take twice as much space as ANSI to store strings. In many applications the majority of strings stored using Unicode actually store ANSI characters, so every other byte is a NULL. In Windows CE, the compression algorithms used to store data in the object store (that is, data stored in files or databases) are optimized to recognize this sequence.

Generic String and Character Data Types

You can use the standard Unicode data type wchar_t, but it is more usual to use generic string data types, and then use compiler defines to specify which character type should be used for the compilation. You can write code that can be compiled for ANSI and Unicode and is portable. The define _UNICODE is defined either as a compiler switch or using #define to indicate that the Unicode version of API functions should be used. Some header files expect the UNICODE define to be used, so both often end up being defined. The compiler defines _MBCS, and multi-byte character strings (MBCS) are used in Windows NT/98/2000 to compile for ANSI characters but are not supported under Windows CE. If neither _MBCS nor _UNICODE is defined, the header files default to single-byte character strings (SBCS). SCBS don't use lead-in characters to extend the supported range of characters.

To use generic string and character data types, include the file tchar.h and ensure that _UNICODE or _MBCS is defined as appropriate. To declare a character, use the data type TCHAR, and this will be compiled to wchar_t or char depending on the define in operation. The following code declares a character variable and a character string that can store up to ten characters including the terminating NULL:

 TCHAR cChar; TCHAR szArray[10];

Rather than using the LPSTR data type for specifying a pointer to a character string, you should use LPTSTR. This will be compiled to either a "char*" or a "wchar_t*".

String Constants

In the following code fragment, the string constant "my string" will always be compiled as an ANSI character string constant using one byte per character.

 LPTSTR lpszStr = "my string";

You will get a compiler type mismatch error if you try to compile this code with _UNICODE. The header file tchar.h declares two macros "_T" and "_TEXT" that are used to specify Unicode character string constants when _UNICODE is declared, and ANSI character string constants when _MBCS is declared. So, the previous line of code should be written as

 LPTSTR lpszStr = _T("my string");

 LPTSTR lpszStr = _TEXT("my string");

The L macro can be used to force a Unicode string constant. In this next line of code, the LPWSTR data type declares a Unicode string pointer and points it to a Unicode string constant.

 LPWSTR lpszStr = L("my string");

With Windows CE programming you will need to use the _T or _TEXT macro around just about every string constant. My preference is for _T, only because it is shorter. I like to set up an eMbedded Visual C++ macro and assign it to the Ctrl+T key sequence to generate the _T(" ") sequence in the source file. To do this:

Select the Tools+Macro menu command.
Enter the name of the macro, say "T", and click the Record button.
Enter the text _T(" ") into a source file, followed by two left arrow key presses to locate the cursor between the two double quotes.
Turn off recording by pressing the Macro toolbar icon with a square box.
Select the Tools+Macro menu command again, this time to assign the macro to a keystroke.
Select your macro from the list and click the Options button.
Select the Keystrokes button, and assign the macro to the required keystroke, for example Ctrl+T.

Macros in eMbedded Visual C++ are recorded using VB Script. Here is the source for the _T macro:

 Sub T() 'DESCRIPTION: A macro to enter _T("") into a source file. 'Begin Recording   ActiveDocument.Selection = "_T("""")"   ActiveDocument.Selection.CharLeft dsMove, 2 'End Recording End Sub

Calculating String Buffer Lengths

One of the most common bugs introduced when moving to Unicode programming concerns calculating buffer lengths all too often, code assumes that characters are stored in one byte. For example:

 TCHAR szBuffer[200]; DWORD dwLen; dwLen = sizeof(szBuffer);

We might expect dwLen to contain the value 200, but it will actually contain 400, which is the number of bytes occupied by szBuffer. If dwLen were passed to a function indicating how many characters can be placed in szBuffer, the application might fail, as the function could exceed the bounds of the array szBuffer. The following code should be used instead, and this will work for both ANSI and Unicode compilation.

 dwLen = sizeof(szBuffer) / sizeof(TCHAR);

When passing the length of a string buffer to a function, check whether the function expects the size of the buffer in bytes or characters.

Standard String Library Functions

We are all accustomed to the standard C run-time functions for string manipulation strlen, strcpy, and so on. These functions work with the "char" data type and cannot be used for Unicode strings. Unicode equivalent functions are provided, such as wcslen and wcscpy (standing for "wide character string length," and "wide character string copy").

Generic string functions are also available which will be compiled to the ANSI or Unicode function equivalents. For example, the function _tcslen will compile to strlen if _MBCS is defined, or wcslen if _UNICODE is defined. The header file tchar.h should be included to enable this behavior. Using the _tc functions makes code portable between ANSI and Unicode. The samples in this book tend to use the wcs functions rather than _tc, since I never intend to port this code away from Unicode. Table 1.1 shows some of the C common run-time string functions and their generic and Unicode equivalents.

Converting Between ANSI and Unicode Strings

There are times when you will need to convert ANSI strings or characters to Unicode and vice versa. Examples include:

Reading an ANSI text file into a Windows CE application
Reading and writing characters from a serial device that supplies data in ANSI
Reading and writing data from Internet servers, such as web or email servers, most of which expect text in ANSI

Table 1.1. C common run-time string functions with generic and Unicode equivalents
Purpose	Generic String Function	ANSI Function	Unicode Function
Return length of string in characters	`_tcslen`	`strlen`	`wcslen`
Concatenate strings	`_tcscat`	`strcat`	`wcscat`
Search for character in string	`_tcschr`	`strchr`	`wcschr`
Compare two strings	`_tcscmp`	`strcmp`	`wcscmp`
Copy a string	`_tcscpy`	`strcpy`	`wcscpy`
Find one string in another	`_tcsstr`	`strstr`	`wcsstr`
Reverse a string	`_tcsrev`	`_strrev`	`_wcsrev`

Converting an ANSI character to Unicode is easy all you need to do is set the high byte in the Unicode character to zero and copy the ANSI character into the low byte. In this next code fragment, the MAKEWORD macro combines a low byte and high byte into a single two-byte word, and the result is assigned to a Unicode character.

 WCHAR wC; char c = 'C'; wC = MAKEWORD(c, 0);

You can convert strings using one of the C run-time functions:

mbstowcs Convert a multi-byte (ANSI) string to wide character string (Unicode)
wcstombs Convert a wide character string to multi-byte string

Both of these functions take three arguments that are the buffer in which to place the converted string, the string to convert, and the maximum number of characters that can be placed in the string. Both functions return the number of converted characters placed in the string. The following code converts an ANSI string to Unicode and a Unicode string to ANSI.

 WCHAR szwcBuffer[100]; char szBuffer[100]; char* lpszConvert = "ANSI String to convert"; WCHAR* lpszwcConvert = _T("Unicode string to convert"); int nChars; nChars = mbstowcs(szwcBuffer, lpszConvert, 100); nChars = wcstombs(szBuffer, lpszwcConvert, 100);

If you are using code pages, the Windows API functions MultiByteToWideChar and WideCharToMultiByte should be used since you can specify the target or destination code page to be used for the conversion.