A Review of Text Data TypesThe text data type is somewhat of a pain to deal with in C++ programming. The main problem is that there isn't just one text data type; there are many of them. I use the term text data type here in the general sense of an array of characters. Often, different operating systems and programming languages introduce additional semantics for an array of characters (for example, NUL character termination or a length prefix) before they consider an array of characters a text string. When you select a text data type, you must make a number of decisions. First, you must decide what type of characters constitute the array. Some operating systems require you to use ANSI characters when you pass a string (such as a file name) to the operating system. Some operating systems prefer that you use Unicode characters but will accept ANSI characters. Other operating systems require you to use EBCDIC characters. Stranger character sets are in use as well, such as the Multi/Double Byte Character Sets (MBCS/DBCS); this book largely doesn't discuss those details. Second, you must consider what character set you want to use to manipulate text within your program. No requirement states that your source code must use the same character set that the operating system running your program prefers. Clearly, it's more convenient when both use the same character set, but a program and the operating system can use different character sets. You "simply" must convert all text strings going to and coming from the operating system. Third, you must determine the length of a text string. Some languages, such as C and C++, and some operating systems, such as Windows 9x/NT/XP and UNIX, use a terminating NUL character to delimit the end of a text string. Other languages, such as the Microsoft Visual Basic interpreter, Microsoft Java virtual machine, and Pascal, prefer an explicit length prefix specifying the number of characters in the text string. Finally, in practice, a text string presents a resource-management issue. Text strings typically vary in length. This makes it difficult to allocate memory for the string on the stackand the text string might not fit on the stack at all. Therefore, text strings are often dynamically allocated. Of course, this means that a text string must be freed eventually. Resource management introduces the idea of an owner of a text string. Only the owner frees the stringand frees it only once. Ownership becomes quite important when you pass a text string between components. To make matters worse, two COM objects can reside on two different computers running two different operating systems that prefer two different character sets for a text string. For example, you can write one COM object in Visual Basic and run it on the Windows XP operating system. You might pass a text string to another COM object written in C++ running on an IBM mainframe. Clearly, we need some standard text data type that all COM objects in a heterogeneous environment can understand. COM uses the OLECHAR character data type. A COM text string is a NUL-character-terminated array of OLECHAR characters; a pointer to such a string is an LPOLESTR.[1] As a rule, a text string parameter to a COM interface method should be of type LPOLESTR. When a method doesn't change the string, the parameter should be of type LPCOLESTRthat is, a constant pointer to an array of OLECHAR characters.
Frequently, though not always, the OLECHAR type isn't the same as the characters you use when writing your code. Sometimes, though not always, the OLECHAR type isn't the same as the characters you must provide when passing a text string to the operating system. This means that, depending on context, sometimes you need to convert a text string from one character set to anotherand sometimes you won't. Unfortunately, a change in compiler options (for example, a Windows XP Unicode build or a Windows CE build) can change this context. As a result, code that previously didn't need to convert a string might require conversion, or vice versa. You don't want to rewrite all string-manipulation code each time you change a compiler option. Therefore, ATL provides a number of string-conversion macros that convert a text string from one character set to another and are sensitive to the context in which you invoke the conversion. Windows Character Data TypesNow let's focus specifically on the Windows platform. Windows-based COM components typically use a mix of four text data types:
Win32 APIs and StringsEach Win32 API that requires a string has two versions: one that requires a Unicode argument and another that requires an MBCS argument. On a non-MBCS-enabled version of Windows, the MBCS version of an API expects an ANSI argument. For example, the SetWindowText API doesn't really exist. There are actually two functions: SetWindowTextW, which expects a Unicode string argument, and SetWindowTextA, which expects an MBCS/ANSI string argument. The Windows NT/2000/XP operating systems internally use only Unicode strings. Therefore, when you call SetWindowTextA on Windows NT/2000/XP, the function translates the specified string to Unicode and then calls SetWindowTextW. The Windows 9x operating systems do not support Unicode directly. The SetWindowTextA function on the Windows 9x operating systems does the work, while SetWindowTextW returns an error. The MSLU library from Microsoft[2] provides implementations of almost all the Unicode functions on Win9x.
This gives you a difficult choice. You could write a performance-optimized component using Unicode character strings that runs on Windows 2000 but not on Windows 9x. You could use MSLU for Unicode strings on both families and lose performance on Windows 9x. You could write a more general component using MBCS/ANSI character strings that runs on both operating systems but not optimally on Windows 2000. Alternatively, you could hedge your bets by writing source code that enables you to decide at compile time what character set to support. A little coding discipline and some preprocessor magic let you code as if there were a single API called SetWindowText that expects a TCHAR string argument. You specify at compile time which kind of component you want to build. For example, you write code that calls SetWindowText and specifies a TCHAR buffer. When compiling a component as Unicode, you call SetWindowTextW; the argument is a wchar_t buffer. When compiling an MBCS/ANSI component, you call SetWindowTextA; the argument is a char buffer. When you write a Windows-based COM component, you should typically use the TCHAR character type to represent characters used by the component internally. Additionally, you should use it for all characters used in interactions with the operating system. Similarly, you should use the TEXT or __TEXT macro to surround every literal character or string. tchar.h defines the functionally equivalent macros _T, __T, and _TEXT, which all compile a character or string literal as a generic-text character or literal. winnt.h also defines the functionally equivalent macros TEXT and __TEXT, which are yet more synonyms for _T, __T, and _TEXT. (There's nothing like five ways to do exactly the same thing.) The examples in this chapter use __TEXT because it's defined in winnt.h. I actually prefer _T because it's less clutter in my source code. An operating-system-agnostic coding approach favors including tchar.h and using the _TCHAR generic-text data type because that's somewhat less tied to the Windows operating systems. However, we're discussing building components with text handling optimized at compile time for specific versions of the Windows operating systems. This argues that we should use TCHAR, the type defined in winnt.h. Plus, TCHAR isn't as jarring to the eyes as _TCHAR and it's easier to type. Most code already implicitly includes the winnt.h header file via windows.h, and you must explicitly include tchar.h. All sorts of good reasons support using TCHAR, so the examples in this book use this as the generic-text data type. This means that you can compile specialized versions of the component for different markets or for performance reasons. These types and macros are defined in the winnt.h header file. You also must use a different set of string runtime library functions when manipulating strings of TCHAR characters. The familiar functions strlen, strcpy, and so on operate only on char characters. The less familiar functions wcslen, wcscpy, and so on work on wchar_t characters. Moreover, the totally strange functions _mbslen, _mbscpy, and so on work on multibyte characters. Because TCHAR characters are sometimes wchar_t, sometimes char-holding ANSI characters, and sometimes char-holding (nominally unsigned) multibyte characters, you need an equivalent set of runtime library functions that work with TCHAR characters. The tchar.h header file defines a number of useful generic-text mappings for string-handling functions. These functions expect TCHAR parameters, so all their function names use the _tcs (the _t character set) prefix. For example, _tcslen is equivalent to the C runtime library strlen function. The _tcslen function expects TCHAR characters, whereas the strlen function expects char characters. Controlling Generic-Text Mapping Using the PreprocessorTwo preprocessor symbols and two macros control the mapping of the TCHAR data type to the underlying character type the application uses.
You write generic-text-compatible code by using the generic-text data types and functions. An example of reversing and concatenating to a generic-text string follows: TCHAR *reversedString, *sourceString, *completeString; reversedString = _tcsrev (sourceString); completeString = _tcscat (reversedString, __TEXT("suffix")); When you compile the code without defining any preprocessor symbols, the preprocessor produces this output: char *reversedString, *sourceString, *completeString; reversedString = _strrev (sourceString); completeString = strcat (reversedString, "suffix"); When you compile the code after defining the _UNICODE preprocessor symbol, the preprocessor produces this output: wchar_t *reversedString, *sourceString, *completeString; reversedString = _wcsrev (sourceString); completeString = wcscat (reversedString, L"suffix"); When you compile the code after defining the _MBCS preprocessor symbol, the preprocessor produces this output: char *reversedString, *sourceString, *completeString; reversedString = _mbsrev (sourceString); completeString = _mbscat (reversedString, "suffix"); COM Character Data TypesCOM uses two character types:
Now let's complicate things a bit. You want to write code for which you can select, at compile time, the type of characters it uses. Therefore, you're manipulating strictly TCHAR strings internally. You also want to call a COM method and pass it the same strings. You must pass the method either an OLECHAR string or a BSTR string, depending on its signature. The strings your component uses might or might not be in the correct character format, depending on your compilation options. This is a job for Supermacro! ATL String-Conversion ClassesATL provides a number of string-conversion classes that convert, when necessary, among the various character types described previously. The classes perform no conversion and, in fact, do nothing, when the compilation options make the source and destination character types identical. Seven different classes in atlconv.h implement the real conversion logic, but this header also uses a number of typedefs and preprocessor #define statements to make using these converter classes syntactically more convenient. These class names use a number of abbreviations for the various character data types:
All class names use the form C<source-abbreviation>2<destination-abbreviation>. For example, the CA2W class converts an LPSTR to an LPWSTR. When there is a C in the name (not including the first Cthat stands for "class"), add a const modification to the following abbreviation; for example, the CT2CW class converts a LPTSTR to a LPCWSTR. The actual class behavior depends on which preprocessor symbols you define (see Table 2.1). Note that the ATL conversion classes and macros treat OLE and W as equivalent.
Table 2.2 lists the ATL string-conversion macros.
As you can see, no BSTR conversion classes are listed in Table 2.2. The next section of this chapter introduces the CComBSTR class as the preferred mechanism for dealing with BSTR-type conversions. When you look inside the atlconv.h header file, you'll see that many of the definitions distill down to a fairly small set of six actual classes. For instance, when _UNICODE is defined, CT2A becomes CW2A, which is itself typedef'd to the CW2AEX template class. The type definition merely applies the default template parameters to CW2AEX. Additionally, all the previous class names always map OLE to W, so COLE2T becomes CW2T, which is defined as CW2W under Unicode builds. Because the source and destination types for CW2W are the same, this class performs no conversions. Ultimately, the only six classes defined are the template classes CA2AEX, CA2CAEX, CA2WEX, CW2AEX, CW2CWEX, and CW2WEX. Only CA2WEX and CW2AEX have different source and destination types, so these are the only two classes doing any real work. Thus, our expansive list of conversion classes in Table 2.2 has distilled down to only two interesting ones. These two classes are both defined and implemented similarly, so we look at only CA2WEX to glean an understanding of how they both work. template< int t_nBufferLength = 128 > class CA2WEX { CA2WEX( LPCSTR psz ); CA2WEX( LPCSTR psz, UINT nCodePage ); ... public: LPWSTR m_psz; wchar_t m_szBuffer[t_nBufferLength]; ... }; The class definition is actually pretty simple. The template parameter specifies the size of a fixed static buffer to hold the string data. This means that most string-conversion operations can be performed without allocating any dynamic storage. If the requested string to convert exceeds the number of characters passed as an argument to the template, CA2WEX uses malloc to allocate additional storage. Two constructors are provided for CA2WEX. The first constructor accepts an LPCSTR and uses the Win32 API function MultiByteToWideChar to perform the conversion. By default, the class uses the ANSI code page for the current thread's locale to perform the conversion. The second constructor can be used to specify an alternate code page that governs how the conversion is performed. This value is passed directly to MultiByteToWideChar, so see the online documentation for details on code pages accepted by the various Win32 character conversion functions. The simplest way to use this converter class is to accept the default value for the buffer size parameter. Thus, ATL provides a simple typedef to facilitate this: typedef CA2WEX<> CA2W; To use this converter class, you need to write only simple code such as the following: void PutName (LPCWSTR lpwszName); void RegisterName (LPCSTR lpsz) { PutName (CA2W(lpsz)); } Two other use cases are also common in practice:
The conversion classes are easily employed to deal with these cases: void PutAddress(LPOLESTR lpszAddress); void RegisterAddress(LPTSTR lpsz) { PutAddress(CT2OLE(lpsz)); } void PutNickName(LPTSTR lpszName); void RegisterAddress(LPOLESTR lpsz) { PutNickName(COLE2T(lpsz)); } A Note on Memory ManagementAs convenient as the conversion classes are, you can run into some nasty pitfalls if you use them incorrectly. The conversion classes allocate the memory for the converted text automatically and clean it up in the class destructor. This is useful because you don't have to worry about buffer management. However, it also means that code like this is a crash waiting to happen: LPOLESTR ConvertString(LPTSTR lpsz) { return CT2OLE(lpsz); } You've just returned either a pointer to the stack of the called function (which is trashed when the function returns) if the string was short, or a pointer to an array on the heap that will be deallocated before the function returns. The worst part is that, depending on your macro selection, the code might work just fine but will crash when you switch from ANSI to Unicode for the first time (usually two days before ship). To avoid this, make sure that you copy the converted string to a separate buffer (or use a string class) first if you need it for more than a single expression. ATL String-Helper FunctionsSometimes you want to copy a string of OLECHAR characters. You also happen to know that OLECHAR characters are wide characters on the Win32 operating system. When writing a Win32 version of your component, you might call the Win32 operating system function lstrcpyW, which copies wide characters. Unfortunately, Windows NT/2000, which supports Unicode, implements lstrcpyW, but Windows 95 does not. A component that uses the lstrcpyW API doesn't work correctly on Windows 95. Instead of lstrcpyW, use the ATL string-helper function ocscpy to copy an OLECHAR character string. It works properly on both Windows NT/2000 and Windows 95. The ATL string-helper function ocslen returns the length of an OLECHAR string. This is nice for symmetry, although the lstrlenW function it replaces does work on both operating systems. OLECHAR* ocscpy(LPOLESTR dest, LPCOLESTR src); size_t ocslen(LPCOLESTR s); Similarly, the Win32 CharNextW operating system function doesn't work on Windows 95, so ATL provides a CharNextO string-helper function that increments an OLECHAR* by one character and returns the next character pointer. It does not increment the pointer beyond a NUL termination character. LPOLESTR CharNextO(LPCOLESTR lp); ATL String-Conversion MacrosThe string-conversion classes discussed previously were introduced in ATL 7. ATL 3 (and code written with ATL 3) used a set of macros instead. In fact, these macros are still in use in the ATL code base. For example, this code is in the atlctl.h header: STDMETHOD(Help)(LPCOLESTR pszHelpDir) { T* pT = static_cast<T*>(this); USES_CONVERSION; ATLTRACE(atlTraceControls,2, _T("IPropertyPageImpl::Help\n")); CComBSTR szFullFileName(pszHelpDir); CComHeapPtr<OLECHAR> pszFileName(LoadStringHelper(pT->m_dwHelpFileID)); if (pszFileName == NULL) return E_OUTOFMEMORY; szFullFileName.Append(OLESTR("\\")); szFullFileName.Append(pszFileName); WinHelp(pT->m_hWnd, OLE2CT(szFullFileName), HELP_CONTEXTPOPUP, NULL); return S_OK; } The macros behave much like the conversion classes, minus the leading C in the macro name. So, to convert from tchar to olechar, you use T2OLE(s). Two major differences arise between the macros and the conversion classes. First, the macros require some local variables to work; the USES_CONVERSION macro is required in any function that uses the conversion macros. (It declares these local variables.) The second difference is the location of the conversion buffer. In the conversion classes, the buffer is stored either as a member variable on the stack (if the buffer is small) or on the heap (if the buffer is large). The conversion macros always use the stack. They call the runtime function _alloca, which allocates extra space on the local stack. Although it is fast, _alloca has some serious downsides. The stack space isn't freed until the function exits, which means that if you do conversion in a loop, you might end up blowing out your stack space. Another nasty problem is that if you use the conversion macros inside a C++ catch block, the _alloca call messes up the exception-tracking information on the stack and you crash.[4]
The ATL team apparently took two swipes at improving the conversion macros. The final solution is the conversion classes. However, a second set of conversion macros exists: the _EX flavor. These are used much like the original conversion macros; you put USES_CONVERSION_EX at the top of the function. The macros have an _EX suffix, as in T2A_EX. The _EX macros are different, however: They take two parameters, not one. The first parameter is the buffer to convert from as usual. The second parameter is a threshold value. If the converted buffer is smaller than this threshold, the memory is allocated via _alloca. If the buffer is larger, it is allocated on the heap instead. So, these macros give you a chance to avoid the stack overflow. (They still won't help you in a catch block.) The ATL code uses the _EX macros extensively; the previous example is the only one left that still uses the old macros. We don't go into the details of either macro set here; the conversion classes are much safer to use and are preferred for new code. We mention them only so that you know what you're looking at if you see them in older code or the ATL sources themselves. |