Character Sets and Unicode

In the previous section, you were primarily concerned with characters that, when left unchecked, might represent a security threat to the application you're reviewing. Extending on this idea, now you examine different character set encodings and common situations in which they can cause problems. Character set encodings determine the sequence of bytes used to represent characters in different languages. In the context of security, you're concerned with how conversions between character sets affects an application's capability to accurately evaluate data streams and filter hostile input.

Unicode

The Unicode standard describes characters from any language in a unique and unambiguous way. It was primarily intended to address limitations of the ASCII character set and the proliferation of potentially incompatible character sets for other languages. The result is a standard that defines "a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software." The Unicode standard (available at www.unicode.org) defines characters as a series of codepoints (numerical values) that can be encoded in several formats, each with different size code units. A code unit is a single entity as seen by the encoding and decoding routines; each code unit size can be represented in either byte orderbig endian (BE) or little endian (LE). Table 8-3 shows the different encoding formats in Unicode.

Table 8-3. Unicode Encoding Formats
Name	Code Unit Size (in Bits)	Byte Order
UTF-8	8
UTF-16BE	16	Big endian
UTF-16LE	16	Little endian
UTF-32BE	32	Big endian
UTF-32LE	32	Little endian

Note that the byte-order specification (BE or LE) can be omitted, in which case a byte-order marker (BOM) at the beginning of the input can be used to indicate the byte order.

These encoding schemes are used extensively in HTTP communications for request data or XML documents. They are also used in a lot of Microsoft-based software because current Windows operating systems use Unicode internally to represent strings. Unicode's codespace is 0 to 0x10FFFF, so 16-bit and 8-bit code units might not be able to represent an entire Unicode character because of size constraints. However, characters can be encoded multibyte streams; that is, several encoded bytes in sequence might combine to represent one Unicode character.

Auditing programs that make use of Unicode characters and Unicode encoding schemes require reviewers to verify:

Whether characters can be encoded to bypass security checks
Whether the implementation of encoding and decoding contains vulnerabilities of its own

The first check requires verifying that characters aren't converted after filter code has run to check the input's integrity. For example, a major bug was discovered in the Microsoft Internet Information Services (IIS) Web server. It was a result of the Web server software failing to decode Unicode escapes before checking whether a user was trying to perform a directory traversal (double dot) attack; so it didn't catch encoded ../ and ..\ sequences. Users could make the following request:

GET /..%c0%af..%c0%afwinnt/system32/cmd.exe?/c+dir

In this way, they could run arbitrary commands with the permissions the Web server uses.

Note

You can find details of this vulnerability at www.microsoft.com/security/technet/bulletin/MS00-078.mspx.

Because many applications use Unicode representation, an attack of this nature is always a major threat. Given that a range of encoding schemes are available to express data, there are quite a few ways to represent the same codepoint. You already know that you can represent a value in 8-, 16-, or 32-bit code units (in either byte order), but smaller code units have multiple ways to represent individual code points. To understand this better, you need to know more about how code points are encoded, explained in the following sections.

UTF-8

UTF-8 encoded codepoints are represented as single or multibyte sequences. For the ranges of values 0x00 to 0x7F, only a single byte is required, so the UTF-8 encoding for U.S. ASCII codepoints is identical to ASCII. For other values that can't be represented in 7 bits, a lead byte is given followed by a variable number of trailing bytes (up to four) that combine to represent the value being encoded. The lead byte consists of the highest bit set plus a number of other bits in the most significant word that indicate how many bytes are in this multibyte set. So the number of bits set contiguously in the lead byte's high word specifies the number of trailing bytes, as shown in Table 8-4.

Table 8-4. UTF-8 Lead-Byte Encoding Scheme
Bit Pattern	Bytes Following
110x xxxx	1
1110 xxxx	2
1111 xxxx	3, 4, or 5

Note

The bit pattern rules in Table 8-4 are a slight oversimplification, but they are adequate for the purposes of this discussion. Interested readers are encouraged to browse the current specification at www.unicode.org.

The bits replaced by x are used to hold part of the value being represented. Each trailing byte begins with its topmost bits set to 10 and have the least significant 6 bits set to hold part of the value being represented. Therefore, it's illegal for a trailing byte to be less than 0x80 or greater than 0xBF, and it's also illegal for a lead byte to start with 10 (as that would make it indistinguishable from a trailing byte).

Until recently, you could encode Unicode values with any of the supported multibyte lengths you wanted. So, for example, a / character could be represented as

0x2F
0xC0 0xAF
0xE0 0x80 0xAF
0xF0 0x80 0x80 0xAF

The Unicode 3.0 standard, released in 1999, has been revised to allow only the shortest form encoding; for instance, the only legal UTF-8 encoding in the preceding list is 0x2F. Windows XP and later enforce the shortest-form encoding rule. However, not all Unicode implementations are compliant. In particular, ASCII characters are often accepted as one- or two-byte sequences, which could be useful in evading character filters. For example, a filter searching for slashes in a path argument (0x2F) might miss the sequence 0xC0 0xAF; if UTF-8 conversions are performed later, this character filter can be completely evaded for any arbitrary ASCII character.

Note

Daniel J. Roelker published an interesting paper on combining these different multibyte encodings with several other hexadecimal encoding techniques to evade IDS filtering of HTTP traffic. It's available at http://docs.idsresearch.org/http_ids_evasions.pdf.

UTF-16

UTF-16 expresses codepoints as 16-bit words, which is enough to represent most commonly used characters in major languages. However, some characters require more than 16 bits. Remember, the codespace for Unicode ranges from 0 to 0x10FFFF, and the maximum value a 16-bit integer can represent is 0xFFFF. Therefore, UTF-16 can also contain multi-unit sequences, so UTF-16 encoded codepoints can be one or two units. A codepoint higher than 0xFFFF requires two code units to be expressed and is encoded as a surrogate pair; that is, a pair of code units with a special lead bit sequence that combines to represent a codepoint. These are the rules for encoding Unicode codepoints in UTF-16 (taken from RFC 2781):

1.	If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.
2.	Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, U' must be less than or equal to 0xFFFFF. That is, U' can be represented in 20 bits.
3.	Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. Each integer has 10 bits free to encode the character value, for a total of 20 bits.
4.	Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. Terminate.

Because the constant value 0x100000 is added to the bits read from a surrogate pair, you can't encode arbitrary values the way you were able to in UTF-8. With UTF-16 encoding, there's only one way to represent a codepoint.

UTF-32

UTF-32 expresses codepoints as 32-bit value. Because it can represent any codepoint in a single value, surrogate pairs are never required, as they are in UTF-8 and UTF-16. The only way to alter how a codepoint is represented in UTF-32 encoding is to change the data stream's endian format (using the special BOM mentioned after Table 8-3).

Vulnerabilities in Decoding

As mentioned, the difficulty with filtering Unicode data correctly is that the same value can be represented in many ways by using different word-size encodings, by switching byte order, and by abusing UTF-8's unique capability to represent the same value in more than one way. An application isn't going to be susceptible to bypassing filters if only one data decoding is performedthat is, the data is decoded, checked, and then used. However, in the context of HTTP traffic, only one decoding seldom happens. Why? Web applications have increased the complexity of HTTP exchanges dramatically, and data can often be decoded several times and in several ways. For example, the IIS Web server decodes hexadecimal sequences in a request, and then later performs UTF-8 decoding on itand then might hand it off to an ISAPI filter or Web application that does more hexadecimal decoding on it.

Note

You can find excellent information on security issues with Unicode in TR36Unicode Security Considerations Technical Report. At the time of this writing, it's available at www.unicode.org/reports/tr36/.

Homographic Attacks

Homographic attacks are primarily useful as a form of social engineering; Evgeniy Gabrilovich and Alex Gontmakher originally described them in "The Homographic Attack" published in the February 2002 edition of Communications of the ACM. These attacks take advantage of a Unicode homograph, which includes different characters that have the same visual representation. On its simplest level, a homographic attack doesn't specifically require Unicode. For example, the digit 1 (ASCII 0x31) can look like the lowercase letter l (ASCII 0x6c). However, with a little scrutiny, you can tell them apart. In contrast, a Unicode homographic attack involves two graphical representations that are identical, even though the underlying characters are different. For example, the Cyrillic character at codepoint 0x0441 happens to look a lot like the Latin-1 (ASCII) character 0x0063. In general, both are actually rendered as a lowercase c.

Chapter 17 includes an example of a well-publicized homographic attack in the discussion on phishing. For now, just understand that attackers can take advantage of these simple tricks when you're reviewing an application that presents users with data from external sources. Even if the data isn't directly harmful, attackers might be able to use it to trick unsuspecting users.

Windows Unicode Functions

The Windows operating system deals with string data internally as wide characters (encoded as UTF-16). Because many applications deal with ASCII strings (or perhaps other single or multibyte character sets), Windows provides functions for converting between the two formats as well as ASCII wrapper functions for all the exposed API functions that would otherwise require wide character strings.

The conversion between character encodings takes place similarly whether an application uses ASCII wrapper functions or converts data explicitly. The rules for these conversions are determined primarily by the behavior of two functions: MultiByteToWideChar() and WideCharToMultiByte(). The details of how these functions perform conversions have a number of security implications ranging from memory corruption errors to conversions that produce unexpected results, as discussed in the following sections.

MultiByteToWideChar()

The MultiByteToWideChar() function is used to convert multi- and single-byte character strings into Unicode strings. A maximum of cchWideChar characters can be written to the output buffer (lpWideCharString). A common error that application developers make when using this function is to specify the destination buffer's size in bytes as the cchWideChar parameter. Doing this means twice as many bytes could be written to the output buffer than space has been allocated for, and a buffer overflow might occur. The MultiByteToWideChar() function is summarized in the following list:

Function int MultiByteToWideChar(UINT CodePage, DWORD dwFlags, LPCSTR lpMultiByteStr, int cbMultiByte, LPWSTR lpWideCharStr, int cchWideChar)
API Win32 API
Similar functions mbtowc
Purpose MultiByteToWideChar() maps a single- or multibyte character string to a wide character string.

The following code is an example misusing MultiByteToWideChar():

HANDLE OpenFile(LPSTR lpFilename) {     WCHAR wPath[MAX_PATH];     if(MultiByteToWideChar(0, 0, lpFilename, -1, wPath,                            sizeof(wPath)) == 0)         Return INVALID_HANDLE_VALUE;     ... Create the file ... }

This code is an example of the problem just mentioned. The bolded line shows the wide character count is set to the size of the output buffer, which in this case is MAX_PATH * sizeof(WCHAR). However, a WCHAR is two bytes, so the output size provided to MultiByteToWideChar() is interpreted as MAX_PATH * 2 bytestwice the real length of the output buffer.

WideCharToMultiByte()

The WideCharToMultiByte() function is the inverse of MultiByteToWideChar(); it converts a string of wide characters into a string of narrow characters. Developers are considerably less likely to trigger a buffer overflow when using this function because the output size is in bytes rather than wide characters, so there's no misunderstanding the meaning of the size parameter. The WideCharToMultiByte() function is summarized in the following list:

Function int WideCharToMultiByte(UINT CodePage, DWORD dwFlags, LPCWSTR lpWideCharStr, int cchWideChar, LPSTR lpMultiByteStr, int cbMultiByte, LPCSTR lpDefaultChar, LPBOOL lpUsedDefaultChar)
API Win32 API
Similar functions wctombc
Purpose WideCharToMultiByte() maps a wide character string to a single- or multibyte character string.

Because wide characters are a larger data type, their information sometimes needs to be represented by a sequence of single-bytes, called a multibyte character. The rules for encoding wide characters into multibyte characters are governed by the code page specified as the first argument to this function.

NUL-Termination Problems

The MultiByteToWideChar() and WideCharToMultiByte() functions don't guarantee NUL-termination if the destination buffer is filled. In these cases, the functions return 0, as opposed to the number of characters converted. It's intended that users of these functions check the return value; however, this is often not the case. Listing 8-29 shows a brief example.

Listing 8-29. Return Value Checking of MultiByteToWideChar()

HANDLE open_file(char *name) {     WCHAR buf[1024];     HANDLE hFile;     MultiByteToWideChar(CP_ACP, 0, name, strlen(filename),                         buf, sizeof(buf)/2);     wcsncat(buf, sizeof(buf)/2  wcslen(buf)  1, ".txt");     ... }

Because the return value is left unchecked, the fact that buf isn't big enough to hold the name being converted isn't caught, and buf is not NUL-terminated. This causes wcsncat() to miscalculate the remaining buffer size as a negative number, which you know is converted into a large positive number if you review the wcsncat() function prototype listed under strncat().

MultiByteToWideChar() might have additional problems when multibyte character sets are being converted. If the MB_ERR_INVALID_CHARS flag is specified, the function triggers an error when an invalid multibyte sequence is encountered. Here's an example showing a potentially dangerous call:

PWCHAR convert_string(UINT cp, char *instr) {     WCHAR *outstr;     size_t length;     length = strlen(instr) + 1;     outstr = (WCHAR *)calloc(length, sizeof(WCHAR));     MultiByteToWideChar(cp, MB_ERR_INVALID_CHARS, instr, -1,                         outstr, -1);     return outstr; }

Again, because the function's return value isn't checked, the convert_string() function doesn't catch invalid character sequences. The problem is that MultiByteToWideChar() returns an error when it sees an invalid character sequence, but it doesn't NUL-terminate the destination buffer (outstr, in this case). Because the return value isn't checked, the function doesn't deal with this error, and an unterminated wide string is returned to the caller. Because of this any later processing on this string could result in memory corruption.

Unicode Manipulation Vulnerabilities

Memory management issues can also occur when using any bounded multibyte or wide character functions. Take a look at an example using wcsncpy():

wchar_t destination[1024]; wcsncpy(destination, source, sizeof(destination));

At first glance, it seems as though this code is correct, but of course the size parameter should indicate how big the destination buffer is in wide characters, not the size in bytes; so the third argument is actually twice the length of the output buffer. This mistake is easy to make, so code auditors should keep an eye out for it.

Another interesting quirk is errors in dealing with user-supplied multibyte-character data strings. If the application code page indicates that a double-byte character set (DBCS) is in use, characters can be one or two bytes. Applications processing these strings need to identify whether each byte being processed is a single character or part of a two-byte sequence; in Windows, this check is performed with the IsDBCSLeadByte() function. Vulnerabilities in which a pointer can be incremented out of the buffer's bounds can easily occur if the application determines that a byte is the first of a two-byte sequence and neglects to check the next byte to make sure it isn't the terminating NUL byte. Listing 8-30 shows an example.

Listing 8-30. Dangerous Use of IsDBCSLeadByte()

char *escape_string(char *src) {     char *newstring, *dst;     newstring = (char *)malloc(2*strlen(src) + 1);     if(!newstring)         return NULL;     for(dst = newstring; *src; src++){         if(IsDBCSLeadByte(*src)){             *dst++ = *src++;             *dst++ = *src;             continue;         }         if(*src == '\''))             *dst++ = '\';         *dst++ = *src;     }     return newstring; }

When the code in Listing 8-30 encounters a lead byte of a two-byte sequence, it does no checking on the second byte of the two-byte sequence. If the string passed to this function ends with a DBCS lead byte, the lead byte and the terminating NUL byte are written to the destination buffer. The src pointer is incremented past the NUL byte and continues processing bytes outside the bounds of the string passed to this function. This error could result in a buffer overflow of the newstring buffer, as the allocated length is based on the string length of the source string.

Note

When multibyte character sequences are interpreted, examine the code to see what can happen if the second byte of the sequence is the string's terminating NUL byte. If no check is done on the second byte, processing data outside the buffer's bounds might be possible.

Code Page Assumptions

When converting from multibyte to wide characters, the code page argument affects how MultiByteToWideChar() behaves, as it specifies the character set the multibyte string is encoded in. In most cases, this function is used with the default system code page (CP_ACP, ANSI Code Page), which doesn't do much. However, attackers can affect the code page in some situations by constructing multibyte character sequences that evade filters in earlier layers. Listing 8-31 is an example of a vulnerable code fragment.

Listing 8-31. Code Page Mismatch Example

if(strchr(filename, '/') || strchr(filename, '\\')){     error("filenames with slashes are illegal!");     return 1; } MultiByteToWideChar(CP_UTF8, 0, filename, strlen(filename),                     wfilename, sizeof(wfilename)/2); ...

As you can see, encoding is performed after a check for slashes, so by encoding slashes, attackers targeting earlier versions of Windows can evade that check and presumably do something they shouldn't be able to later. Akio Ishida and Yasuo Ohgaki discovered an interesting variation on this vulnerability in the PostgreSQL and MySQL database APIs (available at www.postgresql.org/docs/techdocs.50). As mentioned, SQL control characters are commonly escaped with the backslash (\) character. However, some naive implementations of this technique might not account for multibyte characters correctly. Consider the following sequence of bytes:

0x95 0x5c 0x27

It's actually a string in which the first two bytes are a valid Shift-JIS encoded Japanese character, and the last byte is an ASCII single quote ('). A naive filter won't identify that the first two bytes refer to one character; instead, it interprets the 0x5c byte as the backslash character. Escaping this sequence would result in the following bytes:

0x95 0x5c 0x5c 0x5c 0x27

Passing the resulting string to a multibyte-aware database can cause a problem because the first two bytes are interpreted as a single Japanese character. Then the remaining two 0x5c bytes are interpreted as an escaped backslash sequence. Finally, the last byte is left as an unescaped single quote character. This misinterpreted encoding can be exploited to inject SQL statements into an application that otherwise shouldn't be vulnerable.

Having multibyte character sets used with MultiByteToWideChar() might have some additional complications related to memory corruption. Listing 8-32 shows an interesting call to this function.

Listing 8-32. NUL Bytes in Multibyte Code Pages

PWCHAR convert_string(UINT cp, char *instr) {     WCHAR *outstr;     size_t length;     length = strlen(instr) * 2 + 1;     outstr = (WCHAR *)calloc(length, sizeof(WCHAR));     MultiByteToWideChar(cp, 0, instr, -1, outstr, -1);     return outstr; }

The MultiByteToWideChar() function in Listing 8-32 is vulnerable to a buffer overflow when a multibyte code page is used. Why? Because the output string length is calculated by multiplying the input string length by two. However, this calculation isn't adequate because the NUL byte in the string could be part of a multibyte character; therefore, the NUL byte can be skipped and out-of-bounds characters continue to be processed and written to the output buffer. In UTF-8, if the NUL byte appeared in a multibyte sequence, it would form an illegal character; however, MultiByteToWideChar() enters a default replacement or skips the character (depending on Windows versions), unless the MB_ERR_INVALID_CHARS flag is specified in the second argument. When that flag is specified, the function returns an error when it encounters an illegal character sequence.

Character Equivalence

Using WideCharToMultiByte() has some interesting consequences when decoding data. If conversions are performed after character filters, the code is equally susceptible to sneaking illegal characters through filters. When converting wide characters into multibyte, however, the risk increases for two main reasons:

Even with the default code page, multiple 16-bit values often map to the same 8-bit character. As an example, if you want a backslash to appear in the input stream of the converted character set, you can supply three different wide characters that convert into the backslash byte (0x5c): 0x00 0x5c, 0x22 0x16, and 0xff 0x0c. You can do this not because the backslash character has three Unicode representations, but because output character represents the closest match when an exact conversion can't be performed. This behavior can be toggled with the WC_NO_BEST_FIT_CHARS flag.
When a character is encountered that can't be converted to a multibyte character and a close replacement can't be found (or the WC_NO_BEST_FIT flag is set), a default replacement character is inserted in the output stream; the . character is used for the ANSI code page, unless otherwise specified. If this replacement character is filtered, a wide range of values can generate this character in the output stream.

Auditing code that uses MultiByteToWideChar() or WideCharToMultiByte() requires careful attention to all these factors:

Check whether data is required to pass through a filter before it's converted rather than after.
Check whether the code page is multibyte or can be specified by a user.
If the MB_ERR_INVALID_CHARS flag is set for converting multibyte streams, user input can prematurely terminate processing and leave an unterminated output buffer. If it's omitted, a multibyte sequence including the trailing NUL byte can be specified, potentially causing problems for poorly written code.
If the WC_NO_BEST_FIT_CHARS flag is present for converting wide character data, users might be able to supply multiple data values that translate into the same single-byte character. The best-fit matching rules are years out of date, and most developers shouldn't use them anyway.
Look for any other flags affecting how these functions might be misused.
Make sure the return value is checked. If a conversion error is not identified, unterminated buffers might be used incorrectly.
Check the sizes of input and output buffers, as mentioned in the discussion of memory corruption in Chapter 5.

Unicode

Table 8-3. Unicode Encoding Formats

UTF-8

Table 8-4. UTF-8 Lead-Byte Encoding Scheme

UTF-16

UTF-32

Vulnerabilities in Decoding

Homographic Attacks

Windows Unicode Functions

MultiByteToWideChar()

WideCharToMultiByte()

NUL-Termination Problems

Listing 8-29. Return Value Checking of MultiByteToWideChar()

Unicode Manipulation Vulnerabilities

Listing 8-30. Dangerous Use of IsDBCSLeadByte()

Code Page Assumptions

Listing 8-31. Code Page Mismatch Example

Listing 8-32. NUL Bytes in Multibyte Code Pages

Character Equivalence