Web-Specific Canonicalization Bugs

Chapter 8 covers canonicalization issues in detail, but I purposefully avoided Web-based canonicalization topics there because, although the security vulnerabilities are the same, the attack types are subtly different. First, a quick recap of canonicalization bugs. Canonicalization mistakes are caused when your application makes a security decision based on a name (such as a filename, a directory name, or a URL) and more than one representation of the resource name exists, which can lead to the security check being bypassed.

What makes Web-based canonicalization issues so prevalent and hard to defend against is the number of ways you can represent any character. For example, any character can be represented in a URL or a Web page by using one or more of the following mechanisms:

The normal 7-bit or 8-bit character representation, also called US-ASCII
Hexadecimal escape codes
UTF-8 variable-width encoding
UCS-2 Unicode encoding
Double encoding
HTML escape codes (Web pages, not URLs)

7-Bit and 8-Bit ASCII

I trust you understand the 7-bit and 8-bit ASCII representations, which have been used in computer systems for many years.

Hexadecimal Escape Codes

Hex escapes are a way to represent a possibly nonprintable character by using its hexadecimal equivalent. For example, the space character is %20, and the pounds sterling character ( ) is %A3. You can use this mapping in a URL such as http://www.northwindtraders.com/my%20document.doc, which will open my document.doc on the Northwind Traders Web site; http://www.northwindtraders.com/my%20document%2Edoc will do likewise.

In Chapter 8, I mentioned a canonicalization bug in eEye s SecureIIS tool. The tool looked for certain words in the client request and rejected the request if any of the words were found in the request. However, an attacker could hex escape any of the characters in the request, and the tool would not reject the requests, essentially bypassing the security mechanisms.

UTF-8 Variable-Width Encoding

Eight-bit Unicode Transformation Format, UTF-8, as defined in RFC 2279 (www.ietf.org/rfc/rfc2279.txt), is a way to encode characters by using one or more bytes. The variable-byte sizes allow UTF-8 to encode many different byte-size character sets, such as 2-byte Unicode (UCS-2), 4-byte Unicode (UCS-4), and ASCII, to name but a few. However, the fact that one character can potentially map to multiple-byte representations is problematic.

How UTF-8 Encodes Data

UTF-8 can encode n-byte characters into different byte sequences, depending on the value of the original characters. For example, a character in the 7-bit ASCII range 0x00 0x7F encodes to 07654321, where 0 is the leading bit, set to 0, and 7654321 represents the 7 bits that make up the 7-bit ASCII character. For instance, the letter H, which is 0x48 in hex, or 1001000 in binary, becomes the UTF-8 character 01001000, or 0x48. As you can see, 7-bit ASCII characters are unchanged by UTF-8.

Things become a little more complex as you start mapping characters beyond the 7-bit ASCII range, all the way up to the top of the Unicode range, 0x7FFFFFFF. For example, any character in the range 0x80 0x7FF encodes to 110xxxxx 10xxxxxx, where 110 and 10 are predefined bits and each x represents one bit from the character. For example, pounds sterling is 0xA3, which is 10100011 in binary. The UTF-8 representation is 11000101 10000011, or 0xC5 0x83. However, it doesn t stop there. UTF-8 can encode larger byte-size characters. Table 12-2 outlines the mappings.

Table 12-2 UTF-8 Character Mappings
Character Range	Encoded Bytes
0x00000000 0x0000007F	0xxxxxxx
0x00000080 0x000007FF	110xxxxx 10xxxxxx
0x00000800 0x0000FFFF	1110xxxx10xxxxxx10xxxxxx
0x00010000 0x001FFFFF	11110xxx10xxxxxx10xxxxxx10xxxxxx
0x00200000 0x03FFFFFF	111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx
0x04000000 0x7FFFFFFF	1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx, 10xxxxxx

And this is where the fun starts; it is possible to represent a character by using any of these mappings, even though the UTF-8 specification warns against doing so. All UTF-8 characters should be represented in the shortest possible format. For example, the only valid UTF-8 representation of the ? character is 0x3F, or 00111111 in binary. On the other hand, an attacker might try using illegal nonshortest formats, such as these:

0xC0 0xBF
0xE0 0x80 0xBF
0xF0 0x80 0x80 0xBF
0xF8 0x80 0x80 0x80 0xBF
0xFC 0x80 0x80 0x80 0x80 0xBF

A bad UTF-8 parser might determine that all of these formats are the same, when, in fact, only 0x3F is valid.

Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this http://servername/scripts/..%c0%af../winnt/system32/cmd.exe the server didn t correctly handle %c0%af in the URL. What do you think %c0%af means? It s 11000000 10101111 in binary; and if it s broken up using the UTF-8 mapping rules in Table 12-2, we get this: 11000000 10101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence.

So when the attacker requested the tainted URL, he accessed http://servername/scripts/../../winnt/system32/cmd.exe. In other words, he walked out of the script s virtual directory, which is marked to allow program execution, up to the root and down into the system32 directory, where he could pass commands to the command shell, Cmd.exe.

You can read more about the File Permission Canonicalization vulnerability at www.microsoft.com/technet/security/bulletin/MS00-057.asp.

UCS-2 Unicode Encoding

UCS-2 issues are a variation of hex encoding and, to some extent, UTF-8 encoding. Two-byte Universal Character Set, UCS-2, can be hex-encoded in a similar manner as ASCII characters but with the %uNNNN format, where NNNN is the hexadecimal value of the Unicode character. For example, %5C is the ASCII and UTF-8 hex escape for the backslash (\) character, and %u005C is the same character in two-byte Unicode.

To really confuse things, %u005C can also be represented by a wide Unicode equivalent called a fullwidth version. The fullwidth encodings are provided by Unicode to support conversions between some legacy Asian double-byte encoding systems. The characters in the range %uFF00 to %uFFEF are reserved as the fullwidth equivalents of %20 to %7E. For example, the \ character is %u005C and %uFF3C.

You can view these characters by using the Character Map application included with Microsoft Windows. Figure 12-1 shows the backslash character once the Arial Unicode MS font is installed from Microsoft Office XP.

Figure 12-1

Using the Character Map application to view Unicode characters.

Double Encoding

Just when you thought you understood the various encoding schemes and we ve looked at only the most common along comes double encoding, which involves reencoding the encoded data. For example, the UTF-8 escape for the backslash character is %5C, which is made up of three characters %, 5, and C all of which can be reencoded using their UTF-8 escapes, %25, %35, and %63. Table 12-3 outlines some double-encoding variations of the \ character.

Table 12-3 Sample Double Escaping Representations of \
Escape	Comments
%5C	Normal UTF-8 escape of the backslash character
%255C	%25, the escape for % followed by 5C
%%35%63	The % character followed by %35, the escape for 5, and %63, the escape for C
%25%35%63	The individual escapes for %, 5, and C

The vulnerability lies in the mistaken belief that a simple unescape operation will yield clean, raw data. The application then makes a security decision based on the data, but the data might not be fully unescaped.

HTML Escape Codes

HTML pages can also escape characters by using special characters. For example, angle brackets (< and >) can be represented as < and &gt, and the pounds sterling symbol can be represented as £. But wait, there s more! These escape sequences can also be represented using the decimal or hexadecimal character values, not just easy-to-remember mnemonics, such as < (less than) and > (greater than). For example, < is the same as < (hexadecimal value of the < character) and is also the same as < (decimal value of the < character). A complete list of these entities is available at www.w3.org/TR/REC-html40/sgml/entities.html.

As you can see, many ways exist to encode data on the Web, which makes making decisions based on the name of a resource a dangerous programming practice. Let s now focus on remedies for these issues.

Web-Based Canonicalization Remedies

Like all potential canonicalization vulnerabilities, the first defense is simply not to make decisions based on the name of a resource if it s possible to represent the resource name in more than one way.

Restrict What Is Valid Input

The next best remedy is to restrict what is a valid user request. You created the resources being protected, so you can define the valid ways to access that data and reject all other requests. This is achieved using regular expressions, which are discussed in Chapter 8. Learning to define and use good regular expressions is critical to the security of your application. I ll say it just one more time: always determine what is valid input and reject all other input. It s safer to have a client complain that something doesn t work because of an over-zealous regular expression, than have the service not work because it s been hacked!

Be Careful When Dealing with UTF-8

If you must manipulate UTF-8 characters, you need to reduce the data to its canonical form by using the MultiByteToWideChar function in Windows. The following sample code shows how you can call this function with various valid and invalid UTF-8 characters. You can find the complete code listing on the companion CD in the folder Secureco\Chapter 12\UTF8. Also note that if you want to create UTF-8 characters, you can use WideCharToMultiByte by setting the code page to CP_UTF8.

void FromUTF8(LPBYTE pUTF8, DWORD cbUTF8) { WCHAR wszResult[MAX_CHAR+1]; DWORD dwResult = MAX_CHAR; int iRes = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)pUTF8, cbUTF8, wszResult, dwResult); if (iRes == 0) { DWORD dwErr = GetLastError(); printf( MultiByteToWideChar() failed -> %d\n", dwErr); } else { printf( MultiByteToWideChar() returned %s (%d) wide characters\n", wszResult, iRes); } }  void main() { // Get Unicode for 0x5c; should be \ . BYTE pUTF8_1[] = {0x5C}; DWORD cbUTF8_1 = sizeof pUTF8_1; FromUTF8(pUTF8_1, cbUTF8_1); // Get Unicode for 0xC0 0xAF. // Should fail because this is // an overlong / . BYTE pUTF8_2[] = {0xC0, 0xAF}; DWORD cbUTF8_2 = sizeof pUTF8_2; FromUTF8(pUTF8_2, cbUTF8_2); // Get Unicode for 0xC2 0xA9; should be // a symbol. BYTE pUTF8_3[] = {0xC2, 0xA9}; DWORD cbUTF8_3 = sizeof pUTF8_3; FromUTF8(pUTF8_3, cbUTF8_3); }

Design Parent Paths Out of Your Application

Another canonicalization issue relates to the handling of parent paths (..), which can lead to directory traversal issues if not done correctly. You should design your Web-based system in such a way that parent paths are not required when data within the application is being accessed. It s common to see a Web application with a directory structure that requires the use of parent paths, thereby encouraging attackers to attempt to access data outside of the Web root by using URLs like http://servername/../../boot.ini to access the boot configuration file, boot.ini. Take a look at the example directory structure in Figure 12-2.

Figure 12-2

A common Web application directory structure.

As you can see, a common source of images is used throughout the application. To access an image file from a directory that is below the images directory or that is a peer of the images directory for example, advertising and private your application will need to move out of the current directory into the images directory, therefore requiring that your application use parent paths. For example, to load an image, a file named /private/default.aspx would need to use <IMG> tags that look like this:

<IMG SRC=../images/Logo.jpg>

However, in Windows 2000 and later, the need for parent paths can be reduced. You can create a junction point to the images directory or a hard link to an individual file in the images directory from within the present directory. Figure 12-3 shows what the newer directory structure looks like. It s more secure because there s no need to access any file or directory by using parent paths; your application can remove multiple dots as a requirement in a valid file request.

Figure 12-3

A common Web application directory structure using links to a parent or peer directory.

With this directory format in place, the application can access the image without using parent paths, like so:

<IMG SRC=images/Logo.jpg>

You can create junction points by using the Linkd.exe tool included in the Windows 2000 Resource Kit, and you can link to an individual file by using the CreateHardLink function. The following is a simple example of using the CreateHardLink function to create hard links to files. You can also find this example code on the companion CD in the folder Secureco\Chapter 12\HardLink.

/* HardLink.cpp */ #include stdafx.h" DWORD DoHardLink(LPCSTR szName, LPCSTR szTarget) { DWORD dwErr = 0; if (!CreateHardLink(szName, szTarget, NULL)) dwErr = GetLastError(); return dwErr; } void main(int argc, char* argv[]) { if (argc != 3) { printf( Usage: HardLink <linkname> <target>\n ); } DWORD dwErr = DoHardLink(argv[1], argv[2]); if (dwErr) printf( Error calling CreateHardLink() -> %d\n", dwErr); else printf( Hard link created to %s\n", argv[2]); }


	Just say no to parent paths. If you remove the requirement for parent paths in your application, anyone attempting to access a resource by using parent paths is, by definition, an attacker!