Metacharacter Filtering

The potential issues associated with metacharacters often necessitates a more defensive coding strategy. Generally, this strategy involves attempting to detect potential attacks or sanitize input before it's interpreted. There are three basic options:

Detect erroneous input and reject what appears to be an attack.
Detect and strip dangerous characters.
Detect and encode dangerous characters with a metacharacter escape sequence.

Each of these options has its uses, and each opens the potential for new vulnerabilities. The first two options attempt to eliminate metacharacters outright, so they share certain commonalties addressed in the next section. The third option involves a number of unique concerns, so it is addressed separately in "Escaping Metacharacters."

Eliminating Metacharacters

Rejecting illegal requests and stripping dangerous characters are similar strategies; they both involve running user data through some sort of sanitization routine, often using a regular expression. If the disallowed input is rejected, any request containing illegal metacharacters is simply discarded. This approach usually includes some sort of error indicating why the input wasn't allowed, as shown in this example:

if($input_data =~ /[^A-Za-z0-9_ ]/){     print "Error! Input data contains illegal characters!";     exit; }

In this example, the input_data variable is checked for any character that isn't alphanumeric, an underscore, or a space. If any of these characters are found, an error is signaled and processing terminates.

With character stripping, the input is modified to get rid of any violations to the restrictions, and then processing continues as normal. Here's a simple modification of the previous example:

$input_data =~ s/[^A-Za-z0-9]/g;

Each option has its strengths and weaknesses. Rejection of dangerous input lessens the chance of a breach because fewer things can go wrong in handling. However, a high false-positive rate on certain inputs might cause the application to be particularly unfriendly. Stripping data elements is more dangerous because developers could make small errors in implementing filters that fix up the input stream. However, stripping input may be considered more robust because the application can handle a wide variety of input without constantly generating errors.

Both approaches must account for how strong their filter implementation is; if they don't catch all the dangerous input, nothing that happens afterward matters much! There are two main types of filters: explicit deny filters (black lists) and explicit allow filters (white lists). With an explicit deny filter, all data is assumed to be legal except the specific characters deemed dangerous. Listing 8-22 is an example of an explicit deny filter implementation.

Listing 8-22. Character Black-List Filter

int islegal(char *input) {     char *bad_characters = "\"\\\|;<>&-*";     for(; *input; input++){         if(strchr(bad_characters, *input)             return 0;     }     return 1; }

As you can see, this filter allows any characters except those in the bad_characters set. Conversely, an explicit allow filter checks for characters known to be legal, and anything else is assumed illegal, as shown in Listing 8-23.

Listing 8-23. Character White-List Filter

int islegal(char *input) {     for(; *input; input++){         if(!isalphanum(*input) && *input != '_' && !isspace(*input))             return 0;     }     return 1; }

This example is similar to Listing 8-22, except it's testing for the existence of each character in a set of legal characters, as opposed to checking for illegal characters. White-list filters are much more restrictive by nature, so they are generally considered more secure. When the accept set is large, however, using an explicit deny filter might be more appropriate.

When reviewing code containing filters of either kind, you must determine whether the application has failed to account for any dangerous input. To do this, you should take these steps:

1.	Make a list of every input the filter allows.
2.	Make a list of every input that's dangerous if left in the input stream.
3.	Check whether there are any results from the intersection of these two lists.

Step 1 is straightforward and can be done from just reading the code; however, step 2 might require more creativity. The more knowledge you have about the component or program interpreting the data, the more thorough analysis you can perform. It follows, therefore, that a good code auditor should be familiar with whatever data formats they encounter in an assessment. For example, shell programming and SQL are metadata formats commonly used in web applications.

Insufficient Filtering

When you already have a thorough knowledge of the formats you deal with, there's usually the temptation to not make allowed input lists. You might instead choose to draw on your existing knowledge to assess the filter's strength. This approach may be adequate, but it also increases your chances of missing subtle vulnerabilities, just as the application developer might. For example, take a look at Listing 8-24, which demonstrates a filtering vulnerability in the PCNFSD server.

Listing 8-24. Metacharacter Vulnerability in PCNFSD

int suspicious (s) char *s; {     if(strpbrk(s, ";|&<>`'#!?*()[]^") != NULL)         return 1;     return 0; }

A filter is constructed to strip out dangerous characters before the data is passed to popen(). The developers have a fairly complete reject set, but they missed a character. Can you see what it is? That's right: it's the newline (('\n') character. If a newline character is inserted in the input stream, the shell treats the data before it as one command and the data after it as a new command, thus allowing attackers to run arbitrary commands. This example is interesting because the newline character is often forgotten when filtering data for shell execution issues. People think about other command separators, such as semicolons, but often neglect to filter out the newline character, demonstrating that even experienced programmers can be familiar with a system yet make oversights that result in vulnerabilities.

Even when you're familiar with a format, you need to keep in mind the different implementations or versions of a program. Unique extensions might introduce the potential for variations of standard attacks, and data might be interpreted more than once with different rules. For example, when sanitizing input for a call to popen(), you need to be aware that any data passed to the program being called is interpreted by the command shell, and then interpreted again differently by the program that's running.

Character Stripping Vulnerabilities

There are additional risks when stripping illegal characters instead of just rejecting the request. The reason is that there are more opportunities for developers to make mistakes. In addition to missing potentially dangerous characters, they might make mistakes in implementing sanitization routines. Sometimes implementations are required to filter out multicharacter sequences; for example, consider a CGI script that opens a file in a server-side data directory. The developers want to allow users to open any file in this directory, and maybe even data in subdirectories below that directory. Therefore, both dot (.) and slash (/) are valid characters. They certainly don't want to allow user-supplied filenames outside the data directory, such as ../../../etc/passwd; so the developers strip out occurrences of the ../ sequence. An implementation for this filter is shown in Listing 8-25.

Listing 8-25. Vulnerability in Filtering a Character Sequence

char *clean_path(char *input) {    char *src, *dst;    for(src = dst = input; *src; ){        if(src[0] == '.' && src[1] == '.' && src[2] == '/'){           src += 3;            memmove(dst, src, strlen(src)+1);            continue;        } else            *dst++ = *src++;    }    *dst = '\0';    return input; }

Unfortunately, this filtering algorithm has a severe flaw. When a ../ is encountered, it's removed from the stream by copying over the ../ with the rest of the path. However, the src pointer is incremented by three bytes, so it doesn't process the three bytes immediately following a ../ sequence! Therefore, all an attacker needs to do to bypass it is put one double dot exactly after another, because the second one is missed. For example, input such as ../../test.txt is converted to ../test.txt. Listing 8-26 shows how to fix the incorrect filter.

Listing 8-26. Vulnerability in Filtering a Character Sequence #2

char *clean_path(char *input) {     char *src, *dst;     for(src = dst = input; *src; ){         if(src[0] == '.' && src[1] == '.' && src[2] == '/'){             memmove(dst, src+3, strlen(src+3)+1);             continue;         } else             *dst++ = *src++;     }     *dst = '\0';     return input; }

Now the algorithm removes ../ sequences, but do you see that there's still a problem? What happens if you supply a file argument such as ....//hi? Table 8-2 steps through the algorithm.

Table 8-2. Desk-Check of clean_path with Input ....//hi
Iteration	Input	Output
1	....//hi	.
2	...//hi	..
3	..//hi	.. (Nothing is written)
4	/hi	../
5	hi	../h
6	i	../hi

This algorithm demonstrates a subtle issue common to many multicharacter filters that strip invalid input. By supplying characters around forbidden patterns that combine to make the forbidden pattern, you have the filter itself construct the malicious input by stripping out the bytes in between.

Auditing Tip

When auditing multicharacter filters, attempt to determine whether building illegal sequences by constructing embedded illegal patterns is possible, as in Listing 8-26.

Also, note that these attacks are possible when developers use a single substitution pattern with regular expressions, such as this example:

$path =~ s/\.\.\///g;

This approach is prevalent in several programming languages (notably Perl and PHP).

Escaping Metacharacters

Escaping dangerous characters differs from other filtering methods because it's essentially nondestructive. That is, it doesn't deny or remove metacharacters but handles them in a safer form. Escaping methods differ among data formats, but the most common method is to prepend an escape metacharacter (usually a backslash) to any potentially dangerous metacharacters. This method allows these characters to be safely interpreted as a two-character escape sequence, so the application won't interpret the metacharacter directly.

When reviewing these implementations, you need to be mindful of the escape character. If this character isn't treated carefully, it could be used to undermine the rest of the character filter. For example, the following filter is designed to escape the quote characters from a MySQL query using the backslash as an escape metacharacter:

$username =~ s/\"\'\*/\\$1/g; $passwd =~ s/\"\'\*/\\$1/g; ... $query = "SELECT * FROM users WHERE user='" . $username   . "' AND pass = '" . $passwd . "'";

This query replaces dangerous quote characters with an escaped version of the character. For example, a username of "bob' OR user <> 'bob" would be replaced with "bob\' OR user <> \'bob". Therefore, attackers couldn't break out of the single quotes and compromise the application. The regular expression pattern neglects to escape the backslash character (\), however, so attackers still have an avenue of attack by submitting the following:

username = bob\' OR username = passwd = OR 1=1

This input would create the following query after being filtered:

SELECT * FROM users WHERE user='bob\\' OR username = '   AND pass = ' OR 1=1

The MySQL server interprets the double-backslash sequence after bob as an escaped backslash. This prevents the inserted backslash from escaping the single quote, allowing an attacker to alter the query.

Note

Escape characters vary between SQL implementations. Generally, the database supports the slash-quote (\') or double-apostrophe ('') escape sequences. However, developers might confuse which escape sequence is supported and accidentally use the wrong sequence for the target database.

Metacharacter Evasion

One of the most interesting security ramifications of escaping metacharacters is that the encoded characters can be used to avoid other filtering mechanisms. As a code auditor, you must determine when data can be encoded in a manner that undermines application security. To do this, you must couple decoding phases with relevant security decisions and resulting actions in the code. The following steps are a basic procedure:

1.	Identify each location in the code where escaped input is decoded.
2.	Identify associated security decisions based on that input.
3.	If decoding occurs after the decision is made, you have a problem.

To perform this procedure correctly, you need to correlate what data is relevant to the action performed after the security check. There's no hard and fast method of tying a decoding phase to a security decision, but one thing you need to consider is that the more times data is modified, the more opportunities exist for fooling security logic. Beyond that, it's just a matter of understanding the code involved in data processing. To help build this understanding, the following sections provide specific examples of how data encodings are used to evade filters.

Hexadecimal Encoding

HTTP is discussed in more detail in later chapters; however, this discussion of encoding would be remiss if it didn't address the standard encoding form for URIs and query data. For the most part, all alphanumeric characters are transmitted directly via HTTP, and all other characters (excluding control characters) are escaped by using a three-character encoding scheme. This scheme uses a percent character (%) followed by two hexadecimal digits representing the byte value. For example, a space character (which has a hexadecimal of 0x20) uses this three-character sequence: %20.

HTTP transactions can also include Unicode characters. Details of Unicode are covered in "Character Sets and Unicode" later in this chapter, but for this discussion, you just need to remember that Unicode characters can be represented as sequences of one or two bytes. For one-byte sequences, HTTP uses the hexadecimal encoding method already discussed. However, for two-byte sequences, Unicode characters can be encoded with a six-character sequence consisting of the string %u or %U followed by four hexadecimal digits. These digits represent the 16-bit value of a Unicode character. These alternate encodings are a potential threat for smuggling dangerous characters through character filters. To understand the problem, look at the sample code in Listing 8-27.

Listing 8-27. Hex-Encoded Pathname Vulnerability

int open_profile(char *username) {    if(strchr(username, '/')) {        log("possible attack, slashes in username");        return 1;    }    chdir("/data/profiles");    return open(hexdecode(username), O_RDONLY); }

This admittedly contrived example has a glaring security problem: the username variable is checked for slashes (/) before hexadecimal characters are decoded. Using the coupling technique described earlier, you can associate decoding phases, security decisions, and actions as shown in this list:

Decision If username contains a / character, it's dangerous (refer to line 3 in Listing 8-27).
Decoding Hexadecimal decoding is performed on input after the decision (refer to line 10).
Action Username is used to open a file (refer to line 10).

So a username such as ..%2F..%2Fetc%2Fpasswd results in this program opening the system password file. Usually, these types of vulnerabilities aren't as obvious. Decoding issues are more likely to occur when a program is compartmentalized, and individual modules are isolated from the decoding process. Therefore, the developer using a decoding module generally isn't aware of what's occurring.

Note

Hexadecimal encoding is also a popular method for evading security software (such as IDSs) used to detect attacks against Web servers. If an IDS fails to decode hexadecimal encoded requests or decodes them improperly, an attack can be staged without generating an alert.

Handling embedded hexadecimal sequences is usually simple. A decoder can generally do two things wrong:

Skip a NUL byte.
Decode illegal characters.

Earlier in this chapter, you examined a faulty implementation that failed to check for NUL bytes (see Listing 8-5). So this coverage will concentrate on the second error, decoding illegal characters. This error can happen when assumptions are made about the data following a % sign. Two hexadecimal digits are expected follow a % sign. Listing 8-28 shows a typical implementation for converting those values into data.

Listing 8-28. Decoding Incorrect Byte Values

int convert_byte(char byte) {     if(byte >= 'A' && byte <= 'F')         return (byte  'A') + 10;     else if(byte >= 'a' && byte <= 'f')         return (byte  'a') + 10;     else         return (byte  '0'); } int convert_hex(char *string) {     int val1, val2;     val1 = convert_byte(string[0]);     val2 = convert_byte(string[1]);     return (val1 << 4) | val2; }

The convert_byte() function is flawed, in that it assumes the byte is a number character if it's not explicitly a hexadecimal letter (as shown in the bolded lines). Therefore, invalid hex characters passed to this function (including the characters A through F) produce unexpected decoded bytes. The security implication of this incorrect decoding is simple; any filters processing the data in an earlier stage miss values that can appear in the resulting output stream.

HTML and XML Encoding

HTML and XML documents can contain encoded data in the form of entities, which are used to encode HTML rendering metacharacters. Entities are constructed by using the ampersand sign (&), followed by the entity abbreviation, and terminated with a semicolon. For example, to represent an ampersand, the abbreviation is "amp," so & is the encoded HTML entity. A complete list of entities is available from the World Wide Web Consortium (W3C) site at www.w3c.org.

Even more interesting, characters can also be encoded as their numeric codepoints in both decimal and hexadecimal. To represent a codepoint in decimal, the codepoint value is prepended with &#. For example, a space character has the decimal value 32, so it's represented as &#32. Hex encoding is similar, except the value is prepended with &#x, so the space character (0x20) is represented as &#x20. Two-byte Unicode characters can also be specified with five decimal or four hexadecimal digit sequences. This encoding form is susceptible to the same basic vulnerabilities that hexadecimal decoders might havesuch as embedding NUL characters, evading filters, and assuming that at least two bytes follow an &# sequence.

Note

Keep in mind that HTML decoding is normally handled by a client browser application. However, using this encoding form in XML data does open the possibility of a variety of server-directed attacks.

Multiple Encoding Layers

Sometimes data is decoded several times and in several different ways, especially when multiple layers of processing are performed before the input is used for its intended purpose. Decoding several times makes validation extremely difficult, as higher layers see the data in an intermediate format rather than the final unencoded content.

In complex multitiered applications, the fact that input goes through a number of filters or conversions might not be immediately obvious, or it might happen only in certain conditions. For example, data posted to a HTTP Web server might go through base64 decoding if the Content-Encoding header specifies this behavior, UTF-8 decoding because it's the encoding format specified in the Content-Type header, and finally hexadecimal decoding, which occurs on all HTTP traffic. Additionally, if the data is destined for a Web application or script, it's entirely possible that it goes through another level of hexadecimal decoding. Figure 8-3 shows this behavior.

Figure 8-3. Encoded Web data

Each component involved in decoding is often developed with no regard to other components performing additional decoding steps at lower or higher layers, so developers might make incorrect judgments on what input should result. Vulnerabilities of this nature tie back into previous discussions on design errors. Specifically, cross-component problems might happen when an interface to a component is known, but the component's exact function is unknown or undefined. For example, a Web server module might perform some decoding of request data to make security decisions about that decoded data. The data might then undergo another layer of decoding afterward, thus introducing the possibility for attackers to sneak encoded content through a filter.

This example brings up another interesting point: Vulnerabilities of this nature might also be a result of operational security flaws. As you learned in Chapter 3, "Operational Review," applications don't operate in a vacuum, especially integrated pieces of software, such as Web applications. The web server and platform modules may provide encoding methods that attackers can use to violate the security of an application.

Eliminating Metacharacters

Listing 8-22. Character Black-List Filter

Listing 8-23. Character White-List Filter

Insufficient Filtering

Listing 8-24. Metacharacter Vulnerability in PCNFSD

Character Stripping Vulnerabilities

Listing 8-25. Vulnerability in Filtering a Character Sequence

Listing 8-26. Vulnerability in Filtering a Character Sequence #2

Table 8-2. Desk-Check of clean_path with Input ....//hi

Escaping Metacharacters

Metacharacter Evasion

Hexadecimal Encoding

Listing 8-27. Hex-Encoded Pathname Vulnerability

Listing 8-28. Decoding Incorrect Byte Values

HTML and XML Encoding

Multiple Encoding Layers

Figure 8-3. Encoded Web data