Web-Based Canonicalization Issues | Hunting Security Bugs

Some of the topics mentioned in the preceding section on file-based canonicalization issues, such as directory traversal and dealing with file extensions, also apply to Web-based applications. However, Web applications are more complex because of encoding or issues in handling URLs. In either case, great care must be taken to ensure that security decisions made off of a name are tested thoroughly.

Encoding Issues

When you read about the canonicalization issues that files have, you saw how variations like c:\file.txt, \\?\c:\test.txt, and c:\ windows \..\.\file.TxT could all be used to refer to the same file. With Web-based applications, encoding issues add to the problem of making security decisions based on a name. For instance, consider the following values:

%41
%u0041
%C1%81
%uFF21
%EF%BC%A1
A

All of the preceding values are equivalent to the ASCII character A, and this isn t even a complete list. These variations illustrate some of the types of encodings that are covered in this section, including URL escaping, HTML encoding, overlong UTF-8, and more. As a security tester, you should realize how canonicalization typically offers many variations that can fool parsers if the parsers are not comparing the canonical form of the value ”resulting in a security bug.

Using Hexadecimal Escape Codes

You are probably most familiar with using ASCII characters , such as A, B, #, and !. Each ASCII character has a decimal value, and those values can be converted to hexadecimal. Table 12-2 shows several ASCII values with their decimal and hexadecimal values.

Table 12-2: ASCII Characters and Their Decimal and Hexadecimal Equivalents
ASCII character	Decimal	Hexadecimal
A	65	41
B	66	42
	35	23
!	33	21
.	46	2E
/	47	2F
	92	5C

More Info

A complete ASCII character code chart can be found at http://msdn.microsoft.com/library/en-us/vsintro7/html/_pluslang_ASCII_Character_Codes.asp .

Hexadecimal escape codes are just another way to represent a character. In URLs, hexadecimal characters are often used to represent some of the nonprintable characters. For example, an ampersand (&) in a URL usually is a delimiter between the name/value pairs in a query string, such as http://www.example.com/file.aspx?name1=value1&name2=value2 . What happens if one of the values contains an ampersand? The programmer would not want the application to mistake the ampersand in the value for another name/value pair delimiter, so the hexadecimal escape code for the ampersand (%26) can be used. Thus, the URL would be http://www.example.com/file.aspx?name1=some%26value .

If an application fails to decode the escape characters first, and then makes a security decision based on the name, a security vulnerability might be imminent. In the ASP.NET path validation vulnerability mentioned at the beginning of the chapter, Microsoft Internet Explorer automatically replaced the backslash (\) with a forward slash (/) if you made a request to http://www.example.com/secure\somefile.aspx . However, if you replaced the backslash with %5C (the hex value for the backslash), the request would succeed and enable you to access somefile.aspx .

Using Overlong UTF-8 Encoding

The ASCII character examples in the preceding section are all 1-byte long, but many languages in the world require more than one byte to represent a character. The 8-bit Unicode Transformation Format (UTF-8) is a common encoding used for Internet URLs. UTF-8 is a variable-byte encoding scheme that allows different character sets, such as 2-byte Unicode (UCS-2; this encoding is discussed shortly), to be encoded. The following are common places where UTF-8 encodings are used:

URLs
Multipurpose Internet Mail Extensions (MIME) encodings
XML documents
Text files

More Info

For more information about the format of UTF-8, refer to RFC 2279 at http://www.faqs.org/rfcs/rfc2279.html .

Because UTF-8 can be used to encode a character with more than one byte, it can also represent a single-byte character by using any of the UTF-8 character mappings. Generally, all UTF-8 characters are shown in the shortest form, but it is possible for an attacker to use a nonshort form of a character encoding, which is known as overlong UTF-8 encoding. An attacker can use the overlong form hopefully to trick the parser, which should accept only the shortest form. Let s look at an example. The UTF-8 representation of a forward slash (/) is 0x2F. The overlong UTF-8 equivalent of this value is any one of the following:

0xC0 0xAF
0xE0 0x80 0xAF
0xF0 0x80 0x80 0xAF
0xF8 0x80 0x80 0x80 0xAF
0xFC 0x80 0x80 0x80 0x80 0xAF
0xFE 0x80 0x80 0x80 0x80 0x80 0xAF

If the UTF-8 parser does not use the shortest form, it might consider all these representations as the same, leading to a canonicalization issue.

Using an overlong UTF-8 sequence is another way attackers can try to trick the parser into thinking a value is something else when it is actually equivalent in canonical form. To generate an overlong UTF-8 encoding of a character, you can use the tool called OverlongUTF, which is included on this book s companion Web site. Figure 12-2 shows the overlong UTF-8 encodings of the forward slash.

Figure 12-2: Using OverlongUTF to generate the overlong UTF-8 encodings of a character

Using UCS-2 Unicode Encoding

Another encoding that can be used in URLs is called UCS-2 Unicode encoding. It is a lot like hexadecimal and UTF-8, but uses the format %u NNNN , where NNNN is the Unicode character value in hexadecimal. Look at the forward slash (/) character again, which had the hexadecimal value %2F. This value is the same as its UCS-2 encoding of %u002F. Having fun yet?

In Figure 12-2, notice the output also shows 0x2F has equivalent values of U+FF0F and %uFF0F. The latter two representations are the wide Unicode equivalent called the full-width version. Overlong UTF can be used to show whether a full-width version of the character is available. If there is, you can use the full-width value in hopes of fooling the parser.

You can also use the UTF-8 encoding format to represent the UCS-2 Unicode value. For example, %uFFOF in UTF-8 format is %EF%BC%8F. And, of course, even that is subject to overlong UTF-8 sequences.

Selecting Other Character Encodings

You might consider trying other types of encodings, depending on your application. For instance, UTF-7 and UCS-4 can sometimes fool certain parsers. Chapter 10, HTML Scripting Attacks, gives an example of how UTF-7 can be used to encode data to fool parsers. For instance, if a Web site tries to block certain HTML tags by stripping out angle brackets (< and >), it might still be vulnerable to attack if another encoding could be used.

Internet Explorer has a feature that attempts to autoselect the encoding for a Web site. If the Web page contains characters in the first 200 bytes that use a specific encoding, Internet Explorer defaults to using that encoding unless the request explicitly specifies a particular encoding. So if the browser can be forced to use UTF-7, the attacker can use the UTF-7 encoding of the angle brackets (+ADw- and +AD4-) to bypass the filter an application might use. Normally, UTF-7 is used for mail and news transports, but that does not mean an attacker won t use it to attempt to fool your application.

More Info

For more information about UTF-7 encoding, refer to RFC 1642 ( http://www.faqs.org/rfcs/rfc1642.html ).

Double Encoding Characters

To make matters more interesting, values can even be double-encoded in an attempt to bypass code in which the developer fails to fully decode the data. The process of double encoding takes a character from the string and essentially encodes it twice. This usually causes a problem when the application decodes the data in one place and then later decodes it again. Normally, this is not a problem when using the application because the input is not double-encoded. Look at the following examples:

Encoding the letter A one time results in A = %41.
Encoding each character in the %41 sequence results in % = %25, 4 = %34, and 1 = %31.
If you encode the A once, and then encode the percent sign, you end up with %2541.
If you encode just the 4 instead of the percent sign, you get %%341.
If you encode all the characters in %41, you get %25%34%31.

When you use the preceding technique to double-encode values, the following URLs are equivalent:

http://example/file.asp
http://example/file.%41sp
http://example/file.%2541sp
http://example/file.%%341sp
http://example/file.%25%34%31sp

Note	Values can also be triply encoded, even though it isn t a common problem. For example, if %2541 is a result of double encoding, %252541 is a triple encoding.

Using HTML Escape Codes

Chapter 10 discusses cross-site scripting attacks in great detail, but canonicalization techniques can be used to fool parsers that are attempting to block script. Remember, if your application wants to prevent malicious data, it should accept only the safe values by using an allow list and fail on everything else. Otherwise, cases are likely to be missed.

For example, some Web applications attempt to block malicious script by looking for values such as <script> or javascript: and removing them from the input. If you can fool the parser into allowing an equivalent value to a restricted value, you will have found a bug. Let s look at different ways characters can be represented in HTML.

The decimal value of a forward slash (/) is 47 and the hexadecimal value is 0x2F. If you create HTML files with the following content, they will all be equivalent:

<a href="http://www.contoso.com">Regular Value</a>
<a href="http://www.contoso.com">Decimal Value</a>
<a href="http://www.contoso.com">Hexadecimal Value</a>

You can also omit the semicolon and pad the beginning of the value with zeros, such as http:&#00047/www.contoso.com . By using HTML escape codes for characters, you might be able to fool the parser that is supposed to block the malicious values. On June 3, 2004, GreyMagic published a security advisory against Yahoo! s Web-based e-mail service. The Yahoo! mail service attempted to remove any malicious script in an e-mail; however, it missed a variation that allowed the following HTML to be embedded in an e-mail message:

 <div style="background-image:url(jav&#000013;ascript:alert())">Hi!</div>

Tip	You can use the tool Web Text Converter, which is included on the book s Web site, to escape a string or convert an escaped string back to a more readable format.

HTML Entities You can also escape certain characters by using a special value known as a named entity. For instance, an ampersand can be escaped as & as well as by using the decimal (&), hexadecimal (&), and UCS-2 (＠) escape codes. Table 12-3 shows a few examples of HTML entities that are commonly used. A complete list is available at http://www.w3.org/TR/REC-html40/sgml/entities.html .

Table 12-3: Common HTML Entities
Entity	Value
amp
lt
gt
quot
nbsp	(space)

URL Issues

As mentioned earlier, a URL can use different types of encodings to represent characters. The common encodings that could lead to issues when parsing a URL include hexadecimal, UTF-8, overlong UTF-8, and UCS-2. If your application parses the URL to make security decisions, be sure to try the different encoding techniques to try and fool the parser.

In addition to encoding issues for URLs, other common problems include these:

Improper handling of SSL URLs
Improper handling of domain name parsing
Improper handling of credentials in a URL
Improper handling of a forward slash versus a backslash

Handling SSL URLs

We often hear people claim that their applications are secure because they use SSL. Throughout this book, you will read how SSL does not offer protection against such attacks as cross-site scripting, SQL injection, among others. In addition, applications that do not handle the URL properly when dealing with SSL also have problems. For example, to access a Web site using SSL, the https : protocol is used. If you search your source code for http: , you might find code like the following:

 if (url.StartsWith("http:") == true) {     // Handle URL. } else {    // Invalid URL format, so return false.    return false; }

If so, your application might not be properly handling URLs that use SSL. Also, imagine if the code was supposed to return an error if the URL started with http: . The check could be bypassed by using https: instead. You ll need to decide the intention of the check because the developer might have forgotten to check both http: and https :.

Handling Domain Name Parsing

If your application makes decisions based on parsing the domain name, you must consider a few things. For example, how might a developer implement a check in an application to allow connections only to intranet sites? One method might be to check to make sure the Web site name does not contain any dots, such as http://contoso . This idea might seem reasonable because most Internet addresses either have one or more dots, for example, http://contoso.com , or they are in the IP address form http://207.46.130.108 ; but this check isn t good enough. Other ways to fool the parser include the following:

Encoding the URL
Using dotless IP addresses
Using Internet Protocol version 6 (IPv6) formats

Important

Some browsers also allow a dot at the end of the domain name, so http://www.microsoft.com and http://www.microsoft.com . would both work. This technique could fool some parsers, especially if your application has a block list for domains.

Remember how values can be encoded to represent the same character? Depending on how the parser works, encoding an Internet address might be able to bypass the check for dots. To accomplish this exploit, %2E can be used in place of the dot because it is the hexadecimal equivalent of the dot. So the URL looks like http://contoso%2Ecom ”no dots, and the check passes .

Dotless IP Addresses When you type in a human-readable Web address, the name resolves to an IP address. An Internet Protocol version 4 (IPv4) address is broken up into four segments that each use numbers in the range of 0 to 255. You can usually use this address to access the Web site. The IP address can then be converted into dotless form using different formats.

For example, to convert an IP address in the form of a.b.c.d (where a , b , c , and d are numbers ranging from 0 to 255) into a DWORD (32-bit) value, use the formula:

 DWORD Dot-less IP = (a  16777216) + (b  65536) + (c + 256) + d

In this example, running 207.46.130.108 through this formula results in the value 3475931756. Browsing to http://3475931756 is the same as browsing to http://207.46.130.108 .

Another method involves converting the IP address to a hexadecimal address. To accomplish this conversion, change each of the four segments from the decimal to the hexadecimal value. Using the hex format, the IP address 207.46.130.108 becomes 0xCF.0x2E.0x82.0x6C . You can then omit the dots and simply precede the beginning of the address with 0x, which results in another form of the dotless IP address: http://0xCF2E826C .

The main point is, do not make assumptions about whether the domain is on the Internet or intranet based on whether there are dots in the name. IPv6 introduces additional problems, especially because it uses colons (:) instead of dots.

IPv6 Formats Although many details about the IPv6 format are beyond the scope of this book, if your application supports this format, there are some interesting canonicalization issues you should consider.

IPv4 supports only 4.3 — 10 ⁹ (or 4.3 billion) addresses; the world is slowly running out of IPv4 address spaces. IPv6, which supports 3.4 — 10 ³⁸ addresses, was introduced to alleviate the problem. IPv6 uses 128-bit addresses in the format xxxx : xxxx : xxxx : xxxx : xxxx : xxxx : xxxx : xxxx (hexadecimal). IPv6 allows the zeros to be compressed or trimmed , so you can have the following:

:0000: can be compressed to :000:
:000: can be compressed to :00:
:00: can be compressed to :0:
:0: can be compressed to ::

The general rule is that a group of four zeros can be reduced to two colons as long as there isn t more that one double colon in an address. This means 0000:0000:0000:0000:0000:0000 :0000:0001 can be reduced to simply ::1, which is also known as the loopback address.

Also, a sequence of four bytes at the end of the IPv6 address can be written in decimal format for compatibility reasons. So the following are also the same:

0000:0000:0000:0000:0000:0000:0102:0304
::102:304
::1.2.3.4

More Info

As the Internet continues to grow, more applications will need to support IPv6 ”which introduces additional security threats. For more information about IPv6 format, see http://www.faqs.org/rfcs/rfc3513.html .

Handling Credentials in a URL

Some Web browsers allow the user name and password to be supplied as part of the URL by using the following format:

  http://username:password@server/resource.ext

This syntax can be used when users log on to a site that uses basic authentication, but it can lead to all sorts of problems. For example, look at the following URLs:

http://www.contoso.com@www.example.com
http://www.contoso.com%40www.example.com
http://www.contoso.com%40%77%77%77%2E%65%78%61%6D%70%6C%65%2E%63%6F%6D

Where do you think they go to? All of them take you to http://www.example.com . This method of representing a URL can help in exploiting spoofing attacks, which are discussed in Chapter 6, Spoofing. However, think about ways this technique might fool a parser. For example, perhaps an application that is preparing to open a resource on a server wants to display the name of the server in case the user wants to cancel the action. The application verifies that the URL begins with http:// or https:// , and then displays all the characters until the first nonalphanumeric character, hyphen (-), dot (.), colon (:), or the end of the string is reached. In this simple example, the application would miss the case of the percent sign (%) and at sign (@), and thus would display the incorrect server name to the user.