Unicode and ANSI Buffer Size Mismatches

The buffer overrun caused by Unicode and ANSI buffer size mismatches is somewhat common on Windows platforms. It occurs if you mix up the number of elements with the size in bytes of a Unicode buffer. There are two reasons it's rather widespread: Windows NT and later support ANSI and Unicode strings, and most Unicode functions deal with buffer sizes in wide characters, not byte sizes.

The most commonly used function that is vulnerable to this kind of bug is MultiByteToWideChar. Take a look at the following code:

BOOL GetName(char *szName) { WCHAR wszUserName[256]; // Convert ANSI name to Unicode. MultiByteToWideChar(CP_ACP, 0, szName, -1, wszUserName, sizeof(wszUserName)); // Snip  }

Can you see the vulnerability? OK, time is up. The problem is the last argument of MultiByteToWideChar. The documentation for this argument states: Specifies the size, in wide characters, of the buffer pointed to by the lpWideCharStr parameter. The value passed into this call is sizeof(wszUserName), which is 256, right? No, it's not. wszUserName is a Unicode string; it's 256 wide characters. A wide character is two bytes, so sizeof(wszUserName) is actually 512 bytes. Hence, the function thinks the buffer is 512 wide characters in size. Because wszUserName is on the stack, we have a potential exploitable buffer overrun.

Here's the correct way to write this function:

 MultiByteToWideChar(CP_ACP, 0, szName, -1, wszUserName, sizeof(wszUserName) / sizeof(wszUserName[0]));

To reduce confusion, one good approach is to create a macro like so:

#define ElementCount(x) (sizeof(x)/sizeof(x[0]))

Here's something else to consider when translating Unicode to ANSI: not all characters will translate. The second argument to WideCharToMultiByte determines how the function behaves when a character cannot be translated. This is important when dealing with canonicalization or the logging of user input, particularly from the network.

WARNING
Using the %S format specifier with the printf family of functions will silently skip characters that don't translate, so it's quite possible that the number of characters in the input Unicode string will be greater than the number of characters in the output string.

A Real Unicode Bug Example

The Internet Printing Protocol (IPP) buffer overrun vulnerability was a Unicode bug. You can find out more information on this vulnerability at http://www.microsoft.com/technet/security; look at bulletin MS01-23. IPP runs as an ISAPI application in the same process as Internet Information Services (IIS) 5, which runs under the SYSTEM account therefore, an exploitable buffer overrun is even more dangerous. Notice that the bug was not in IIS. The vulnerable code looks somewhat like this:

TCHAR wszComputerName[256]; BOOL GetServerName(EXTENSION_CONTROL_BLOCK *pECB) { DWORD dwSize = sizeof(wszComputerName); char szComputerName[256]; if (pECB->GetServerVariable (pECB->ConnID,  "SERVER_NAME", szComputerName, &dwSize)) { // Do something. }

GetServerVariable, an ISAPI function, copies up to dwSize bytes to szComputerName. However, dwSize is 512 because TCHAR is a macro that, in the case of this code, is a Unicode or wide char. The function is told that it can copy up to 512 bytes of data into szComputerName, which is only 256 bytes in size! Oops!

It's also a common misconception that overruns where the buffer gets converted from ANSI to Unicode first aren't exploitable. Every other character is null, so how could you exploit it? Here's a paper, written by Chris Anley, that details how it can be done: http://www.nextgenss.com/papers/unicodebo.pdf. To sum it up, you need a somewhat larger buffer than usual, and the attacker then takes advantage of the fact that instructions on the Intel architecture can have a variable number of bytes. This allows the attacker to cause the system to decode a series of Unicode characters into a string of single-byte instructions. As always, assume that if an attacker can affect the execution path in any way, an exploit is possible.