String Data Types, Conversion Classes, and Helper Functions

A Review of Text Data Types

The text data type is somewhat of a pain to deal with in C++ programming. The main problem is that there isn't just one text data type; there are many of them. I use the term text data type here in the general sense of an array of characters. Often, different operating systems and programming languages introduce additional semantics for an array of characters (for example, NUL character termination or a length prefix) before they consider an array of characters a text string.

When you select a text data type, you must make a number of decisions. First, you must decide what type of characters constitute the array. Some operating systems require you to use ANSI characters when you pass a string (such as a file name) to the operating system. Some operating systems prefer that you use Unicode characters but will accept ANSI characters. Other operating systems require you to use EBCDIC characters. Stranger character sets are in use as well, such as the Multi/Double Byte Character Sets (MBCS/DBCS); this book largely doesn't discuss those details.

Second, you must consider what character set you want to use to manipulate text within your program. No requirement states that your source code must use the same character set that the operating system running your program prefers. Clearly, it's more convenient when both use the same character set, but a program and the operating system can use different character sets. You "simply" must convert all text strings going to and coming from the operating system.

Third, you must determine the length of a text string. Some languages, such as C and C++, and some operating systems, such as Windows 9x/NT/XP and UNIX, use a terminating NUL character to delimit the end of a text string. Other languages, such as the Microsoft Visual Basic interpreter, Microsoft Java virtual machine, and Pascal, prefer an explicit length prefix specifying the number of characters in the text string.

Finally, in practice, a text string presents a resource-management issue. Text strings typically vary in length. This makes it difficult to allocate memory for the string on the stackand the text string might not fit on the stack at all. Therefore, text strings are often dynamically allocated. Of course, this means that a text string must be freed eventually. Resource management introduces the idea of an owner of a text string. Only the owner frees the stringand frees it only once. Ownership becomes quite important when you pass a text string between components.

To make matters worse, two COM objects can reside on two different computers running two different operating systems that prefer two different character sets for a text string. For example, you can write one COM object in Visual Basic and run it on the Windows XP operating system. You might pass a text string to another COM object written in C++ running on an IBM mainframe. Clearly, we need some standard text data type that all COM objects in a heterogeneous environment can understand.

COM uses the OLECHAR character data type. A COM text string is a NUL-character-terminated array of OLECHAR characters; a pointer to such a string is an LPOLESTR.^[1] As a rule, a text string parameter to a COM interface method should be of type LPOLESTR. When a method doesn't change the string, the parameter should be of type LPCOLESTRthat is, a constant pointer to an array of OLECHAR characters.

^[1] Note that the actual underlying character data type for OLECHAR on one operating system can be different from the underlying character data type for OLECHAR on a different operating system. The COM remoting infrastructure performs any necessary character set conversion during marshaling and unmarshaling. Therefore, a COM component always receives text in its expected OLECHAR format.

Frequently, though not always, the OLECHAR type isn't the same as the characters you use when writing your code. Sometimes, though not always, the OLECHAR type isn't the same as the characters you must provide when passing a text string to the operating system. This means that, depending on context, sometimes you need to convert a text string from one character set to anotherand sometimes you won't.

Unfortunately, a change in compiler options (for example, a Windows XP Unicode build or a Windows CE build) can change this context. As a result, code that previously didn't need to convert a string might require conversion, or vice versa. You don't want to rewrite all string-manipulation code each time you change a compiler option. Therefore, ATL provides a number of string-conversion macros that convert a text string from one character set to another and are sensitive to the context in which you invoke the conversion.

Windows Character Data Types

Now let's focus specifically on the Windows platform. Windows-based COM components typically use a mix of four text data types:

Unicode. A specification for representing a character as a "wide-character," 16-bit multilingual character code. The Windows NT/XP operating system uses the Unicode character set internally. All characters used in modern computing worldwide, including technical symbols and special publishing characters, can be represented uniquely in Unicode. The fixed character size simplifies programming when using international character sets. In C/C++, you represent a wide-character string as a wchar_t array; a pointer to such a string is a wchar_t*.
MBCS/DBCS. The Multi-Byte Character Set is a mixed-width character set in which some characters consist of more than 1 byte. The Windows 9x operating systems, in general, use the MBCS to represent characters. The Double-Byte Character Set (DBCS) is a specific type of multibyte character set. It includes some characters that consist of 1 byte and some characters that consist of 2 bytes to represent the symbols for one specific locale, such as the Japanese, Chinese, and Korean languages.
In C/C++, you represent an MBCS/DBCS string as an unsigned char array; a pointer to such a string is an unsigned char*. Sometimes a character is one unsigned char in length; sometimes, it's more than one. This is loads of fun to deal with, especially when you're trying to back up through a string. In Visual C++, MBCS always means DBCS. Character sets wider than 2 bytes are not supported.
ANSI. You can represent all characters in the English language, as well as many Western European languages, using only 8 bits. Versions of Windows that support such languages use a degenerate case of MBCS, called the Microsoft Windows ANSI character set, in which no multibyte characters are present. The Microsoft Windows ANSI character set, which is essentially ISO 8859/x plus additional characters, was originally based on an ANSI draft standard.
The ANSI character set maps the letters and numerals in the same manner as ASCII. However, ANSI does not support control characters and maps many symbols, including accented letters, that are not mapped in standard ASCII. All Windows fonts are defined in the ANSI character set. This is also called the Single-Byte Character Set (SBCS), for symmetry.
In C/C++, you represent an ANSI string as a char array; a pointer to such a string is a char*. A character is always one char in length. By default, a char is a signed char in Visual C++. Because MBCS characters are unsigned and ANSI characters are, by default, signed characters, expressions can evaluate differently when using ANSI characters, compared to using MBCS characters.
TCHAR/_TCHAR. This is a Microsoft-specific generic-text data type that you can map to a Unicode character, an MBCS character, or an ANSI character using compile-time options. You use this character type to write generic code that can be compiled for any of the three character sets. This simplifies code development for international markets. The C runtime library defines the _TCHAR type, and the Windows operating system defines the TCHAR type; they are synonymous.
tchar.h, a Microsoft-specific C runtime library header file, defines the generic-text data type _TCHAR. ANSI C/C++ compiler compliance requires implementer-defined names to be prefixed by an underscore. When you do not define the __STDC__ preprocessor symbol (by default, this macro is not defined in Visual C++), you indicate that you don't require ANSI compliance. In this case, the tchar.h header file also defines the symbol TCHAR as another alias for the generic-text data type if it isn't already defined. winnt.h, a Microsoft-specific Win32 operating system header file, defines the generic-text data type TCHAR. This header file is operating system specific, so the symbol names don't need the underscore prefix.

Win32 APIs and Strings

Each Win32 API that requires a string has two versions: one that requires a Unicode argument and another that requires an MBCS argument. On a non-MBCS-enabled version of Windows, the MBCS version of an API expects an ANSI argument. For example, the SetWindowText API doesn't really exist. There are actually two functions: SetWindowTextW, which expects a Unicode string argument, and SetWindowTextA, which expects an MBCS/ANSI string argument.

The Windows NT/2000/XP operating systems internally use only Unicode strings. Therefore, when you call SetWindowTextA on Windows NT/2000/XP, the function translates the specified string to Unicode and then calls SetWindowTextW. The Windows 9x operating systems do not support Unicode directly. The SetWindowTextA function on the Windows 9x operating systems does the work, while SetWindowTextW returns an error. The MSLU library from Microsoft^[2] provides implementations of almost all the Unicode functions on Win9x.

^[2] More information on MSLU is available at http://www.microsoft.com/globaldev/handson/dev/mslu_announce.mspx (http://tinysells.com/49).

This gives you a difficult choice. You could write a performance-optimized component using Unicode character strings that runs on Windows 2000 but not on Windows 9x. You could use MSLU for Unicode strings on both families and lose performance on Windows 9x. You could write a more general component using MBCS/ANSI character strings that runs on both operating systems but not optimally on Windows 2000. Alternatively, you could hedge your bets by writing source code that enables you to decide at compile time what character set to support.

A little coding discipline and some preprocessor magic let you code as if there were a single API called SetWindowText that expects a TCHAR string argument. You specify at compile time which kind of component you want to build. For example, you write code that calls SetWindowText and specifies a TCHAR buffer. When compiling a component as Unicode, you call SetWindowTextW; the argument is a wchar_t buffer. When compiling an MBCS/ANSI component, you call SetWindowTextA; the argument is a char buffer.

When you write a Windows-based COM component, you should typically use the TCHAR character type to represent characters used by the component internally. Additionally, you should use it for all characters used in interactions with the operating system. Similarly, you should use the TEXT or __TEXT macro to surround every literal character or string.

tchar.h defines the functionally equivalent macros _T, __T, and _TEXT, which all compile a character or string literal as a generic-text character or literal. winnt.h also defines the functionally equivalent macros TEXT and __TEXT, which are yet more synonyms for _T, __T, and _TEXT. (There's nothing like five ways to do exactly the same thing.) The examples in this chapter use __TEXT because it's defined in winnt.h. I actually prefer _T because it's less clutter in my source code.

An operating-system-agnostic coding approach favors including tchar.h and using the _TCHAR generic-text data type because that's somewhat less tied to the Windows operating systems. However, we're discussing building components with text handling optimized at compile time for specific versions of the Windows operating systems. This argues that we should use TCHAR, the type defined in winnt.h. Plus, TCHAR isn't as jarring to the eyes as _TCHAR and it's easier to type. Most code already implicitly includes the winnt.h header file via windows.h, and you must explicitly include tchar.h. All sorts of good reasons support using TCHAR, so the examples in this book use this as the generic-text data type.

This means that you can compile specialized versions of the component for different markets or for performance reasons. These types and macros are defined in the winnt.h header file.

You also must use a different set of string runtime library functions when manipulating strings of TCHAR characters. The familiar functions strlen, strcpy, and so on operate only on char characters. The less familiar functions wcslen, wcscpy, and so on work on wchar_t characters. Moreover, the totally strange functions _mbslen, _mbscpy, and so on work on multibyte characters. Because TCHAR characters are sometimes wchar_t, sometimes char-holding ANSI characters, and sometimes char-holding (nominally unsigned) multibyte characters, you need an equivalent set of runtime library functions that work with TCHAR characters.

The tchar.h header file defines a number of useful generic-text mappings for string-handling functions. These functions expect TCHAR parameters, so all their function names use the _tcs (the _t character set) prefix. For example, _tcslen is equivalent to the C runtime library strlen function. The _tcslen function expects TCHAR characters, whereas the strlen function expects char characters.

Controlling Generic-Text Mapping Using the Preprocessor

Two preprocessor symbols and two macros control the mapping of the TCHAR data type to the underlying character type the application uses.

UNICODE/_UNICODE. The header files for the Windows operating system APIs use the UNICODE preprocessor symbol. The C/C++ runtime library header files use the _UNICODE preprocessor symbol. Typically, you define either both symbols or neither of them. When you compile with the symbol _UNICODE defined, tchar.h maps all TCHAR characters to wchar_t characters. The _T,__T, and _TEXT macros prefix each character or string literal with a capital L (creating a Unicode character or literal, respectively). When you compile with the symbol UNICODE defined, winnt.h maps all TCHAR characters to wchar_t characters. The TEXT and __TEXT macros prefix each character or string literal with a capital L (creating a Unicode character or literal, respectively). The _tcsXXX functions are mapped to the corresponding _wcsXXX functions.
_MBCS. When you compile with the symbol _MBCS defined, all TCHAR characters map to char characters, and the preprocessor removes all the _T and __TEXT macro variations. It leaves the character or literal unchanged (creating an MBCS character or literal, respectively). The _tcsXXX functions are mapped to the corresponding _mbsXXX versions.
None of the above. When you compile with neither symbol defined, all TCHAR characters map to char characters and the preprocessor removes all the _T and __TEXT macro variations, leaving the character or literal unchanged (creating an ANSI character or literal, respectively). The _tcsXXX functions are mapped to the corresponding strXXX functions.

You write generic-text-compatible code by using the generic-text data types and functions. An example of reversing and concatenating to a generic-text string follows:

TCHAR *reversedString, *sourceString, *completeString; reversedString = _tcsrev (sourceString); completeString = _tcscat (reversedString, __TEXT("suffix"));

When you compile the code without defining any preprocessor symbols, the preprocessor produces this output:

char *reversedString, *sourceString, *completeString; reversedString = _strrev (sourceString); completeString = strcat (reversedString, "suffix");

When you compile the code after defining the _UNICODE preprocessor symbol, the preprocessor produces this output:

wchar_t *reversedString, *sourceString, *completeString; reversedString = _wcsrev (sourceString); completeString = wcscat (reversedString, L"suffix");

When you compile the code after defining the _MBCS preprocessor symbol, the preprocessor produces this output:

char *reversedString, *sourceString, *completeString; reversedString = _mbsrev (sourceString); completeString = _mbscat (reversedString, "suffix");

COM Character Data Types

COM uses two character types:

OLECHAR. The character type COM uses on the operating system for which you compile your source code. For Win32 operating systems, this is the wchar_t character type.^[3] For Win16 operating systems, this is the char character type. For the Mac OS, this is the char character type. For the Solaris OS, this is the wchar_t character type. For the as yet unknown operating system, this is who knows what. Let's just pretend there is an abstract data type called OLECHAR. COM uses it. Don't rely on it mapping to any specific underlying data type.
^[3] Actually, you can change the Win32 OLECHAR data type from the default wchar_t (which COM uses internally) to char by defining the preprocessor symbol OLE2ANSI. This lets you pretend that COM uses ANSI. MFC once used this feature, but it no longer does and neither should you.
BSTR. A specialized string type some COM components use. A BSTR is a length-prefixed array of OLECHAR characters with numerous special semantics.

Now let's complicate things a bit. You want to write code for which you can select, at compile time, the type of characters it uses. Therefore, you're manipulating strictly TCHAR strings internally. You also want to call a COM method and pass it the same strings. You must pass the method either an OLECHAR string or a BSTR string, depending on its signature. The strings your component uses might or might not be in the correct character format, depending on your compilation options. This is a job for Supermacro!

ATL String-Conversion Classes

ATL provides a number of string-conversion classes that convert, when necessary, among the various character types described previously. The classes perform no conversion and, in fact, do nothing, when the compilation options make the source and destination character types identical. Seven different classes in atlconv.h implement the real conversion logic, but this header also uses a number of typedefs and preprocessor #define statements to make using these converter classes syntactically more convenient.

These class names use a number of abbreviations for the various character data types:

T represents a pointer to the Win32 TCHAR character typean LPTSTR parameter.
W represents a pointer to the Unicode wchar_t character typean LPWSTR parameter.
A represents a pointer to the MBCS/ANSI char character typean LPSTR parameter.
OLE represents a pointer to the COM OLECHAR character typean LPOLESTR parameter.
C represents the C/C++ const modifier.

All class names use the form C<source-abbreviation>2<destination-abbreviation>. For example, the CA2W class converts an LPSTR to an LPWSTR. When there is a C in the name (not including the first Cthat stands for "class"), add a const modification to the following abbreviation; for example, the CT2CW class converts a LPTSTR to a LPCWSTR.

The actual class behavior depends on which preprocessor symbols you define (see Table 2.1). Note that the ATL conversion classes and macros treat OLE and W as equivalent.

Table 2.1. Character Set Preprocessor Symbols
Preprocessor Symbol Defined	`T` Becomes . . .	`OLE` Becomes . . .
None	`A`	`W`
_UNICODE	`W`	`W`

Table 2.2 lists the ATL string-conversion macros.

Table 2.2. ATL String-Conversion Classes
`CA2W`
`CA2WEX`
`CA2T`
`CA2TEX`
`CA2CT`
`CA2CTEX`
`COLE2T`
`COLE2TEX`
`COLE2CT`
`COLE2CTEX`
`CT2A`
`CT2AEX`
`CT2CA`
`CT2CAEX`
`CT2OLE`
`CT2OLEEX`
`CT2COLE`
`CT2COLEEX`
`CT2W`
`CT2WEX`
`CT2CW`
`CT2CWEX`
`CW2A`
`CW2AEX`
`CW2T`
`CW2TEX`
`CW2CT`
`CW2CTEX`

As you can see, no BSTR conversion classes are listed in Table 2.2. The next section of this chapter introduces the CComBSTR class as the preferred mechanism for dealing with BSTR-type conversions.

When you look inside the atlconv.h header file, you'll see that many of the definitions distill down to a fairly small set of six actual classes. For instance, when _UNICODE is defined, CT2A becomes CW2A, which is itself typedef'd to the CW2AEX template class. The type definition merely applies the default template parameters to CW2AEX. Additionally, all the previous class names always map OLE to W, so COLE2T becomes CW2T, which is defined as CW2W under Unicode builds. Because the source and destination types for CW2W are the same, this class performs no conversions. Ultimately, the only six classes defined are the template classes CA2AEX, CA2CAEX, CA2WEX, CW2AEX, CW2CWEX, and CW2WEX. Only CA2WEX and CW2AEX have different source and destination types, so these are the only two classes doing any real work. Thus, our expansive list of conversion classes in Table 2.2 has distilled down to only two interesting ones. These two classes are both defined and implemented similarly, so we look at only CA2WEX to glean an understanding of how they both work.

template< int t_nBufferLength = 128 >     class CA2WEX {                                CA2WEX( LPCSTR psz );                     CA2WEX( LPCSTR psz, UINT nCodePage );     ...                                   public:                                       LPWSTR m_psz;                             wchar_t m_szBuffer[t_nBufferLength];      ...                                   };

The class definition is actually pretty simple. The template parameter specifies the size of a fixed static buffer to hold the string data. This means that most string-conversion operations can be performed without allocating any dynamic storage. If the requested string to convert exceeds the number of characters passed as an argument to the template, CA2WEX uses malloc to allocate additional storage.

Two constructors are provided for CA2WEX. The first constructor accepts an LPCSTR and uses the Win32 API function MultiByteToWideChar to perform the conversion. By default, the class uses the ANSI code page for the current thread's locale to perform the conversion. The second constructor can be used to specify an alternate code page that governs how the conversion is performed. This value is passed directly to MultiByteToWideChar, so see the online documentation for details on code pages accepted by the various Win32 character conversion functions.

The simplest way to use this converter class is to accept the default value for the buffer size parameter. Thus, ATL provides a simple typedef to facilitate this:

typedef CA2WEX<> CA2W;

To use this converter class, you need to write only simple code such as the following:

void PutName (LPCWSTR lpwszName); void RegisterName (LPCSTR lpsz) {     PutName (CA2W(lpsz)); }

Two other use cases are also common in practice:

Receiving a generic-text string and passing to a method that expects an OLESTR as input
Receiving an OLESTR and passing it to a method that expects a generic-text string

The conversion classes are easily employed to deal with these cases:

void PutAddress(LPOLESTR lpszAddress); void RegisterAddress(LPTSTR lpsz) {     PutAddress(CT2OLE(lpsz)); } void PutNickName(LPTSTR lpszName); void RegisterAddress(LPOLESTR lpsz) {     PutNickName(COLE2T(lpsz)); }

A Note on Memory Management

As convenient as the conversion classes are, you can run into some nasty pitfalls if you use them incorrectly. The conversion classes allocate the memory for the converted text automatically and clean it up in the class destructor. This is useful because you don't have to worry about buffer management. However, it also means that code like this is a crash waiting to happen:

LPOLESTR ConvertString(LPTSTR lpsz) {     return CT2OLE(lpsz); }

You've just returned either a pointer to the stack of the called function (which is trashed when the function returns) if the string was short, or a pointer to an array on the heap that will be deallocated before the function returns.

The worst part is that, depending on your macro selection, the code might work just fine but will crash when you switch from ANSI to Unicode for the first time (usually two days before ship). To avoid this, make sure that you copy the converted string to a separate buffer (or use a string class) first if you need it for more than a single expression.

ATL String-Helper Functions

Sometimes you want to copy a string of OLECHAR characters. You also happen to know that OLECHAR characters are wide characters on the Win32 operating system. When writing a Win32 version of your component, you might call the Win32 operating system function lstrcpyW, which copies wide characters. Unfortunately, Windows NT/2000, which supports Unicode, implements lstrcpyW, but Windows 95 does not. A component that uses the lstrcpyW API doesn't work correctly on Windows 95.

Instead of lstrcpyW, use the ATL string-helper function ocscpy to copy an OLECHAR character string. It works properly on both Windows NT/2000 and Windows 95. The ATL string-helper function ocslen returns the length of an OLECHAR string. This is nice for symmetry, although the lstrlenW function it replaces does work on both operating systems.

OLECHAR* ocscpy(LPOLESTR dest, LPCOLESTR src); size_t ocslen(LPCOLESTR s);

Similarly, the Win32 CharNextW operating system function doesn't work on Windows 95, so ATL provides a CharNextO string-helper function that increments an OLECHAR* by one character and returns the next character pointer. It does not increment the pointer beyond a NUL termination character.

LPOLESTR CharNextO(LPCOLESTR lp);

ATL String-Conversion Macros

The string-conversion classes discussed previously were introduced in ATL 7. ATL 3 (and code written with ATL 3) used a set of macros instead. In fact, these macros are still in use in the ATL code base. For example, this code is in the atlctl.h header:

STDMETHOD(Help)(LPCOLESTR pszHelpDir) {                     T* pT = static_cast<T*>(this);                          USES_CONVERSION;                                        ATLTRACE(atlTraceControls,2,                               _T("IPropertyPageImpl::Help\n"));                      CComBSTR szFullFileName(pszHelpDir);                     CComHeapPtr<OLECHAR>                                       pszFileName(LoadStringHelper(pT->m_dwHelpFileID));     if (pszFileName == NULL)                                   return E_OUTOFMEMORY;                                  szFullFileName.Append(OLESTR("\\"));                     szFullFileName.Append(pszFileName);                      WinHelp(pT->m_hWnd, OLE2CT(szFullFileName),                  HELP_CONTEXTPOPUP, NULL);                            return S_OK;                                         }

The macros behave much like the conversion classes, minus the leading C in the macro name. So, to convert from tchar to olechar, you use T2OLE(s).

Two major differences arise between the macros and the conversion classes. First, the macros require some local variables to work; the USES_CONVERSION macro is required in any function that uses the conversion macros. (It declares these local variables.) The second difference is the location of the conversion buffer.

In the conversion classes, the buffer is stored either as a member variable on the stack (if the buffer is small) or on the heap (if the buffer is large). The conversion macros always use the stack. They call the runtime function _alloca, which allocates extra space on the local stack.

Although it is fast, _alloca has some serious downsides. The stack space isn't freed until the function exits, which means that if you do conversion in a loop, you might end up blowing out your stack space. Another nasty problem is that if you use the conversion macros inside a C++ catch block, the _alloca call messes up the exception-tracking information on the stack and you crash.^[4]

^[4] For this reason, the _alloca function is deprecated in favor of _malloca, but ATL still uses _alloca.

The ATL team apparently took two swipes at improving the conversion macros. The final solution is the conversion classes. However, a second set of conversion macros exists: the _EX flavor. These are used much like the original conversion macros; you put USES_CONVERSION_EX at the top of the function. The macros have an _EX suffix, as in T2A_EX. The _EX macros are different, however: They take two parameters, not one. The first parameter is the buffer to convert from as usual. The second parameter is a threshold value. If the converted buffer is smaller than this threshold, the memory is allocated via _alloca. If the buffer is larger, it is allocated on the heap instead. So, these macros give you a chance to avoid the stack overflow. (They still won't help you in a catch block.) The ATL code uses the _EX macros extensively; the previous example is the only one left that still uses the old macros.

We don't go into the details of either macro set here; the conversion classes are much safer to use and are preferred for new code. We mention them only so that you know what you're looking at if you see them in older code or the ATL sources themselves.