Prevent I18N Buffer Overruns

To avoid buffer overruns, always allocate sufficient buffer space for conversion and always check the function result. The following code shows how to do this correctly.

//Determine the size of the buffer required for the converted string. //The length includes the terminating \0. int nLen = MultiByteToWideChar(CP_OEMCP, MB_ERR_INVALID_CHARS, lpszOld, -1, NULL, 0); //If the function failed, don't convert! if (nLen == 0) { //oops! } //Allocate the buffer for the converted string. LPWSTR lpszNew = (LPWSTR) GlobalAlloc(0, sizeof(WCHAR) * nLen); //If the allocation failed, don't convert! if (lpszNew == NULL) { //oops! } //Convert the string. nLen = MultiByteToWideChar(CP_OEMCP, MB_ERR_INVALID_CHARS, lpszOld, -1, lpszNew, nLen); //The conversion failed, the result is unreliable. if (nLen == 0) { //oops! }

In general, do not rely on a precalculated maximum buffer size. For example, the new Chinese standard GB18030 (which can be up to 4 bytes for a single character) has invalidated many such calculations.

LCMapString is especially tricky: the output buffer length is words unless called with the LCMAP_SORTKEY option, in which case the output buffer length is bytes.

More Info
If you think Unicode buffer overruns are hard to exploit, you should read Creating Arbitrary Shellcode in Unicode Expanded Strings at http://www.nextgenss.com/papers/unicodebo.pdf.

Words and Bytes

Despite their names and descriptions, most Win32 functions do not process characters. Most Win32 A functions, such as CreateProcessA, process bytes, so a two-byte character, such as a Unicode character, would count as two bytes instead of one. Most Win32 W functions, such as CreateProcessW, process 16-bit words, so a pair of surrogates will count as two words instead of one character. More about surrogates in a moment. Confusion here can easily lead to buffer overruns or over allocation.

Many people don't realize there are A and W functions in Windows. The following code snippet from winbase.h should help you understand their relationship.

#ifdef UNICODE #define CreateProcess CreateProcessW #else #define CreateProcess CreateProcessA #endif // !UNICODE

What's a Unicode Surrogate?

The Unicode standard defines a surrogate pair as a coded character representation for a single abstract character that consists of a sequence of two Unicode code values. The first value of the surrogate pair is the high surrogate, and it contains a 16-bit code value in the range of U+D800 through U+DBFF. The second value of the pair is the low surrogate; it contains values in the range of U+DC00 through U+DFFF.

The Unicode standard defines a combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character. For more information on surrogate pairs and combining character sequences, see The Unicode Standard at http://www.unicode.org.

The key point to remember is that two surrogate pairs together represent a single abstract character and you cannot assume that one 16-bit UTF-16 encoding value maps to exactly one character. By using surrogate pairs, a 16-bit Unicode encoded system can address an additional one million characters, called supplementary characters. The Unicode standard already assigns many important characters to the supplementary region.