Canonical Web-Based Issues
Unfortunately, many applications make security decisions based on the name of a URL, or a component of a URL. Just as with file-based security decisions, making URL-based security decisions raises several concerns. Let's look at a few.
Bypassing AOL Parental Controls
America Online (AOL) 5.0 added controls so that parents could prevent their children from accessing certain Web sites. When a user typed a URL in the browser, the software checked the Web site name against a list of restricted sites, and if it found the site on the list, access to that site was blocked. Here's the flaw: if the user added a period to the end of the host name, the software allowed the user to access the site. My guess is that the vulnerability existed because the software did not take into consideration the trailing dot when performing a string compare against the list of disallowed Web sites, and the software stripped out invalid characters from the URL after the check had been made.
The bug is now rectified. More information on this vulnerability can be found at http://www.slashdot.org/features/00/07/15/0327239.shtml.
Bypassing eEye's Security Checks
The irony of this example is that the vulnerabilities were found in a security product, SecureIIS, designed to protect Microsoft Internet Information Services (IIS) from attack. Marketing material from eEye (http://www.eeye.com) describes SecureIIS like so:
| SecureIIS protects Microsoft Internet Information Services Web servers from known and unknown attacks. SecureIIS wraps around IIS and works within it, verifying and analyzing incoming and outgoing Web server data for any possible security breaches. | 
Two canonicalization bugs were found in the product. The first related to how SecureIIS handled specific keywords. For example, say you decided that a user (or attacker) should not have access to a specific area of the Web site if he entered a URL query string containing action=delete. An attacker could escape any character in the query string to bypass the SecureIIS settings. Rather than entering action=delete, the attacker could enter action=%64elete and obtain the desired access. %64 is the hexadecimal representation of the letter d.
The other bug related to how SecureIIS checked for characters that were used to traverse out of a Web directory to other directories. For example, as a Web site developer or administrator, you wouldn't want users accessing a URL like http://www.northwindtraders.com/scripts/process.asp?file=../../../winnt/repair/sam, which returns the backup SAM database to the user. The traversal characters are the two dots (..) and the slash (/), which SecureIIS looks for. However, an attacker can bypass the check by typing http://www.northwindtraders.com/scripts/process.asp?file=%2e%2e/%2e%2e/%2e%2e/winnt/repair/sam. As you've probably worked out, %2e is the escaped representation of the dot in hexadecimal!
You can read more about this vulnerability at http://www.securityfocus.com/bid/2742.
Zones and the Internet Explorer 4 Dotless-IP Address Bug
Security zones, introduced in Internet Explorer 4 (exported by UrlMon.dll), are an easy way to administer security because they allow you to gather security settings into easy-to-manage groups. These settings are enforced as the user browses Web sites. Each Web page is handled according to specific security restrictions depending on the page's host Web site, thereby tying security restrictions to Web page origin.
Internet Explorer 4 uses a simple heuristic to determine whether a Web site is located in the more trusted Local Intranet Zone or in the less trusted Internet Zone. If a Web site name contains one or more dots, such as http://www.microsoft.com, the site must be in the Internet Zone unless the user has explicitly placed the Web site in some other zone. If the site has no dots in its name, such as http://northwindtraders, it must be in the Local Intranet Zone because only a NetBIOS name, which has no dots, can be accessed from within the local intranet. Makes sense, right? Not quite!
This mechanism has a wrinkle: if the user enters the IP address of a remote computer, Internet Explorer will apply the security settings of the more restrictive Internet Zone, even if the site is on the local intranet. This is good because the browser will use more stringent security checks. However, an IP address can be represented as a dotless-IP address, which can be calculated by taking a dotted-IP address that is, an address in the form a.b.c.d and applying the following formula:
Dotless-IP = (a 16777216) + (b 65536) + (c 256) + d
For example, 192.168.197.100 is the same as 3232286052. If you enter http:/ /192.168.197.100 in Internet Explorer 4, the browser will invoke security policies for the Internet Zone, which is correct. And if you enter http://3232286052 in the unpatched Internet Explorer 4, the browser will notice no dots in the name, place the site in the Local Intranet Zone, and apply the less restrictive security policy. This might lead to a malicious Internet-based Web site executing code in the less secure environment.
More information is available at http://www.microsoft.com/technet/security/bulletin/MS98-016.asp.
Internet Information Server 4.0 ::$DATA Vulnerability
I remember the IIS ::$DATA vulnerability well because I was on the IIS team at the time the bug was found. Allow me to go over a little background material. The NTFS file system built into Microsoft Windows NT and later is designed to be a superset of many other file systems, including the Apple Macintosh HFS file system, which supports two sets of data, or forks, in a disk-based file. These forks are called the data fork and the resource fork. (You can read more about this at http://support.microsoft.com/default.aspx?scid=kb;en-us;Q147438) To help support these files, NTFS provides multiple-named data streams. For example, you could create a new stream named test in a file named Bar.txt that is, bar.txt:test by using the following code:
char *szFilename = "c:\\temp\\bar.txt:test"; HANDLE h = CreateFile(szFilename, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if (h == INVALID_HANDLE_VALUE) { printf("Error CreateFile() %d", GetLastError()); return; } char *bBuff = "Hello, stream world!"; DWORD dwWritten = 0; if (WriteFile(h, bBuff, lstrlen(bBuff), &dwWritten, NUL L)) { printf("Cool!"); } else { printf("Error WriteFile() %d", GetLastError()); }  This example code is available in the companion content in the folder Secureco2\Chapter11\NTFSStream. You can view the contents of the file from the command line by using the following syntax:
more < bar.txt:test
You can also use the echo command to insert a stream into a file and then view the contents of the file:
echo Hello, Stream World! > bar.txt:test more < bar.txt:test
Doing so displays the contents of the stream on the console. The normal data in a file is held in a stream that has no name, and it has an internal NTFS data type of $DATA. With this in mind, you can also access the default data stream in an NTFS file by using the following command-line syntax:
more < boot.ini::$DATA
Figure 11-1 outlines what this file syntax means.
 
 
Figure 11-1. The NTFS file system stream syntax.
An NTFS stream name follows the same naming rules as an NTFS filename, including all alphanumeric characters and a limited set of punctuation characters. For example, two files, john3 and readme, with streams named 16 and now, respectively, would become john3:16 and readme:now. Any combination of valid filename characters is allowed.
Back to the vulnerability. When IIS receives a request from a user, the server looks at the file extension and determines what it should do with the request. For example, if the file ends in .asp, the request must be for an Active Server Pages (ASP) file, so the server routes the request to Asp.dll for processing. If IIS does not recognize the extension, the request is sent directly to Windows for processing so that the contents of the file can be shown to the user. This functionality is handled by the static-file handler. Think of this as a big default switch in a switch statement. So if the user requests Data.txt and no special extension handler, called a script map, associated with the .txt file extension is found, the source code of the text file is sent to the user.
The vulnerability lies in the attacker requesting a file such as Default.asp::$DATA. When IIS evaluates the extension, it does not recognize .asp::$DATA as a file extension and passes the file to the operating system for processing. NTFS determines that the user requested the default data stream in the file and returns the contents of Default.asp, not the processed result, to the attacker.
You can find out more about this bug at http://www.microsoft.com/technet/security/bulletin/MS98-003.asp.
When is a Line Really Two Lines?
A recent vulnerability is processing lines that include carriage return or carriage return/line feed characters. Imagine your application logs client requests, and as an example, a client requests file.txt. Your server application logs the IP address of the client, his name, the date and time, and the requested resource in the following format:
172.23.11.19 Mike 2002-09-03 13:02:43 file.txt
Imagine that an attacker decides to access a file named file.txt\r\n127.0.0.1\tCheryl\t2002-09-03\t13:03:00\tsecretfile.txt, which results in this log entry:
172.23.11.19 Mike 2002-09-03 13:02:43 file.txt 127.0.0.1 Cheryl 2002-09-03 13:03:00 secretfile.txt
Does this mean that Cheryl accessed a sensitive file by logging on the server (127.0.0.1)? No, it does not. The attacker forced a new entry in the log file by using a carriage return and line feed character in the requested resource! You can read more about this vulnerability at http://online.securityfocus.com/archive/82/271498/2002-05-09/2002-05-15/2.
Yet Another Web Issue Escaping
What makes Web-based canonicalization issues so prevalent and hard to defend against is the number of ways you can represent any character. For example, any character can be represented in a URL or a Web page by using one or more of the following mechanisms:
The normal 7-bit or 8-bit character representation, also called US-ASCII
Hexadecimal escape codes
UTF-8 variable-width encoding
UCS-2 Unicode encoding
Double encoding
HTML escape codes (Web pages, not URLs)
7-Bit and 8-Bit ASCII
I trust you understand the 7-bit and 8-bit ASCII representations, which have been used in computer systems for many years, so I won't cover them here.
Hexadecimal Escape Codes
Hex escapes are a way to represent a possibly nonprintable character by using its hexadecimal equivalent. For example, the space character is %20, and the pounds sterling character ( ) is %A3. You can use this mapping in a URL such as http:// www.northwindtraders.com/my%20document.doc, which will open my document.doc on the Northwind Traders Web site; http://www.northwindtraders.com/my%20document%2Edoc will do likewise.
I have already mentioned a canonicalization bug in eEye's SecureIIS tool. The tool looked for certain words in the client request and rejected the request if any of the words were found. However, an attacker could hex escape any of the characters in the request and the tool would fail to reject the requests, essentially bypassing the security mechanisms.
UTF-8 Variable-Width Encoding
Eight-bit Unicode Transformation Format, UTF-8, as defined in RFC 2279 (http://www.ietf.org/rfc/rfc2279.txt), is a way to encode characters by using one or more bytes. The variable-byte sizes allow UTF-8 to encode many different byte-size character sets, such as 2-byte Unicode (UCS-2), 4-byte Unicode (UCS-4), and ASCII, to name but a few. However, the fact that one character can potentially map to multiple-byte representations is problematic.
How UTF-8 Encodes Data
UTF-8 can encode n-byte characters into different byte sequences, depending on the value of the original characters. For example, a character in the 7-bit ASCII range 0x00 0x7F encodes to 07654321, where 0 is the leading bit, set to 0, and 7654321 represents the 7 bits that make up the 7-bit ASCII character. For instance, the letter H, which is 0x48 in hex or 1001000 in binary, becomes the UTF-8 character 01001000, or 0x48. As you can see, 7-bit ASCII characters are unchanged by UTF-8.
Things become a little more complex as you start mapping characters beyond the 7-bit ASCII range, all the way up to the top of the Unicode range, 0x7FFFFFFF. For example, any character in the range 0x80 0x7FF encodes to 110xxxxx 10xxxxxx, where 110 and 10 are predefined bits and each x represents one bit from the character. For example, pounds sterling is 0xA3, which is 10100011 in binary. The UTF-8 representation is 11000010 10100011, or 0xC2 0xA3. However, it doesn't stop there. UTF-8 can encode larger byte-size characters. Table 11-1 outlines the mappings.
| Character Range | Encoded Bytes | 
| 0x00000000 0x0000007F | 0xxxxxxx | 
| 0x00000080 0x000007FF | 110xxxxx 10xxxxxx | 
| 0x00000800 0x0000FFFF | 1110xxxx10xxxxxx10xxxxxx | 
| 0x00010000 0x001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 
| 0x00200000 0x03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 
| 0x04000000 0x7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx, 10xxxxxx | 
And this is where the fun starts; it is possible to represent a character by using any of these mappings, even though the UTF-8 specification warns against doing so. All UTF-8 characters should be represented in the shortest possible format. For example, the only valid UTF-8 representation of the ? character is 0x3F, or 00111111 in binary. On the other hand, an attacker might try using illegal nonshortest formats, such as these:
0xC0 0xBF
0xE0 0x80 0xBF
0xF0 0x80 0x80 0xBF
0xF8 0x80 0x80 0x80 0xBF
0xFC 0x80 0x80 0x80 0x80 0xBF
A bad UTF-8 parser might determine that all of these formats are the same, when, in fact, only 0x3F is valid.
Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe the server didn't correctly handle %c0%af in the URL. What do you think %c0%af means? It's 11000000 10101111 in binary; and if it's broken up using the UTF-8 mapping rules in Table 11-1, we get this: 11000000 10101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence.
So when the attacker requested the tainted URL, he accessed http://servername/scripts/../../winnt/system32/cmd.exe. In other words, he walked out of the script's virtual directory, which is marked to allow program execution, up to the root and down into the system32 directory, where he could pass commands to the command shell, Cmd.exe.
More Info
You can read more about the  File Permission Canonicalization  vulnerability at http://www.microsoft.com/technet/security/bulletin/MS00-057.asp.
UCS-2 Unicode Encoding
UCS-2 issues are a variation of hex encoding and, to some extent, UTF-8 encoding. Two-byte Universal Character Set, UCS-2, can be hex-encoded in a similar manner as ASCII characters but with the %uNNNN format, where NNNN is the hexadecimal value of the Unicode character. For example, %5C is the ASCII and UTF-8 hex escape for the backslash (\) character, and %u005C is the same character in 2-byte Unicode.
To really confuse things, %u005C can also be represented by a wide Unicode equivalent called a fullwidth version. The fullwidth encodings are provided by Unicode to support conversions between some legacy Asian double-byte encoding systems. The characters in the range %uFF00 to %uFFEF are reserved as the fullwidth equivalents of %20 to %7E. For example, the \ character is %u005C and %uFF3C.
Double Encoding
Just when you thought you understood the various encoding schemes and we've looked at only the most common along comes double encoding, which involves reencoding the encoded data. For example, the UTF-8 escape for the backslash character is %5c, which is made up of three characters %, 5, and c all of which can be re-encoded using their UTF-8 escapes, %25, %35, and %63. Table 11-2 outlines some double-encoding variations of the \ character.
| Escape | Comments | 
| %5c | Normal UTF-8 escape of the backslash character | 
| %255c | %25, the escape for % followed by 5c | 
| %%35%63 | The % character followed by %35, the escape for 5, and %63, the escape for c | 
| %25%35%63 | The individual escapes for %, 5, and c | 
The vulnerability lies in the mistaken belief that a simple unescape operation will yield clean, raw data. The application then makes a security decision based on the data, but the data might not be fully unescaped.
HTML Escape Codes
HTML pages can also escape characters by using special characters. For example, angle brackets (< and >) can be represented as < and > and the pound sterling symbol can be represented as £. But wait, there's more! These escape sequences can also be represented using the decimal or hexadecimal character values, not just easy-to-remember mnemonics. For example, < is the same as C; (hexadecimal value of the < character) and is also the same as < (decimal value of the < character). A complete list of these entities is available at http://www.w3.org/TR/REC-html40/sgml/entities.html.
As you can see, there are many ways to encode data on the Web, which means that making decisions based on the name of a resource is a dangerous programming practice. Let's now focus on remedies for these issues.
