Metacharacters | The Art of Software Security Assessment: Identifying and Preventing Software Vulnerabilities

For many types of data, a program also maintains metadata (or meta-information) that it tracks alongside the main data; metadata is simply information that describes or augments the main data. It might include details on how to format data for display, processing instructions, or information on how pieces of the data are stored in memory. There are two basic strategies for representing program data alongside its associated metadata: embedding the metadata in-band or storing the metadata separately, out-of-band.

In-band representation embeds metadata in the data itself. When embedding metadata in textual data, you indicate this information by using special characters called metacharacters or metacharacter sequences. One of the simplest examples of in-band representation is the NUL character terminator in a C string. Out-of-band representation keeps metadata separate from data and associates the two through some external mechanism. String data types in other languages provide a simple example of out-of-band data. Many programming languages (such as C++, Java, PHP, Python, and Pascal) do not have a string terminator character; instead these languages store the string's length in an out-of-band variable.

In many ways, in-band representation is a superior format, as it is often more compact and human readable. However, there are a number of security pitfalls associated with in-band metadata representation that are not a concern for out-of-band metadata. These pitfalls exist because in-band representation creates the potential for overlapping trust domains where explicit boundaries are required. Essentially, in-band metadata representation places both data and metadata within the same trust domain, and parsing routines must handle the logical trust boundaries that exist between data and metadata. However, parsing functions are often very complex, and it can be extremely difficult for developers to account for the security implications of all possible data and metadata combinations.

So far, this chapter has discussed vulnerabilities that can result from mishandling a single in-band metacharacter: the NUL terminator character. However, there are a variety of in-band representations that are common in textual data formats. For example, a slash (/) metacharacter in a filename indicates the beginning or end of a path segment, a dot (.) metacharacter in a hostname indicates a subdomain, and a space metacharacter in an ASCII-based protocol often denotes the end of an input token. It's not unusual for applications to construct strings by incorporating user-controllable data, as in the following common situations:

Constructing a filename
Constructing a registry path (Windows-specific)
Constructing an e-mail address
Constructing an SQL statement
Adding user data to a text file

The following sections examine the potential security ramifications of neglecting to carefully sanitize user input when constructing strings containing metacharacters. Although these sections cover only general situations, later in the chapter you focus on specific examples in contemporary applications, including notorious cases of metacharacter abuse.

Embedded Delimiters

The simplest case of metacharacter vulnerabilities occur when users can embed delimiter characters used to denote the termination of a field. Vulnerabilities of this nature are caused by insufficiently sanitized user input incorporated into a formatted string. For example, say you have a data file containing username and password pairs, with each line in the file in the format username:password.

You can deduce that two delimiters are used: the colon (:) character and the newline (\n) character. What if you have the username bob, but could specify the password test\nnewuser:newpassword\n? The password entry would be written to the file like this:

bob:test newuseruser:newpassword

You can add an arbitrary new user account, which probably isn't what the developer intended for regular users.

So what would a vulnerable application look like? Essentially, you're looking for a pattern in which the application takes user input that isn't filtered sufficiently and uses it as input to a function that interprets the formatted string. Note that this interpretation might not happen immediately; it might be written to a secondary storage facility and then interpreted later. An attack of this kind is sometimes referred to a "second-order injection attack."

Note

This phrase "second-order injection attack" has been coined to refer to delayed SQL and cross-site scripting attacks, but it could apply to any sort of stored metacharacter data that's interpreted later.

To see an example of an application that's vulnerable to a basic delimiter injection attack, look at Listing 8-8, which contains the code that writes the password file shown previously.

Listing 8-8. Embedded Delimiter Example

use CGI; ... verify session details ... $new_password = $query->param('password'); open(IFH, "</opt/passwords.txt") || die("$!"); open(OFH, ">/opt/passwords.txt.tmp") || die("$!"); while(<IFH>){     ($user, $pass) = split /:/;     if($user ne $session_username)         print OFH "$user:$pass\n";     else         print OFH "$user:$new_password\n"; } close(IFH); close(OFH);

Listing 8-8 does no real sanitization; it simply writes the supplied password parameter to the file, so an attacker could add extraneous delimiters.

In general, discovering vulnerabilities of this nature consists of a number of steps:


1.	Identify some code that deals with metacharacter strings, including the common examples presented throughout this chapter. Web applications often have a variety of metacharacter strings because they constantly deal with URLs, session data, database queries, and so on. Some of these formats are covered in this chapter; however Web applications are covered in more depth in Chapters 17, "Web Applications," and 18, "Web Technologies."
2.	Identify all the delimiter characters that are specially handled. Depending on the situation, different characters take on special meanings. In well-known examples such as format strings and SQL, this chapter specifies the characters you need to be aware of. However, for unique situations, you need to examine the code that interprets the data to find the special characters.
3.	Identify any filtering performed on the input, and see what characters or character sequences are filtered (as described in "Input Filters" later in this chapter).
4.	Eliminate potentially hazardous delimiter characters from your compiled list that have been filtered out successfully. Any remaining delimiters indicate a vulnerability.

Using this simple procedure, you can quickly evaluate the construction of strings to determine what delimiters or special character sequences could be sneaked into input. The impact of being able to sneak delimiters into the string depends heavily on what the string represents and how it's interpreted. To see this technique in action, look at Listing 8-9, which is a CGI application being launched by a Web server:

Listing 8-9. Multiple Embedded Delimiters

BOOL HandleUploadedFile(char *filename) {     unsigned char buf[MAX_PATH], pathname[MAX_PATH];     char *fname = filename, *tmp1, *tmp2;     DWORD rc;     HANDLE hFile;     tmp1 = strrchr(filename, '/');     tmp2 = strrchr(filename, '\\');     if(tmp1 || tmp2)         fname = (tmp1 > tmp2 ? tmp1 : tmp2) + 1;     if(!*fname)         return FALSE;     if(strstr(fname, ".."))         return FALSE;     _snprintf(buf, sizeof(buf), "\\\\?\\%TEMP%\\%s", fname);     rc = ExpandEnvironmentStrings(buf, pathname, sizeof(pathname));     if(rc == 0 || rc > sizeof(pathname))         return FALSE;     hFile = CreateFile(pathname, ...);     ... read bytes into the file ... }

This code snippet handles an uploaded file from the client and stores the file in a specific temporary directory. Being able to store files outside this directory isn't desirable, of course, but is it safe? Apply the procedure shown previously:

1.	Identify some code that deals with format strings. The input string is formatted a couple of ways before it eventually becomes a filename. First, it's added to a statically sized buffer and is prefixed with `"\\\\?\\%TEMP%\\"`. Second, it's passed to `ExpandEnvironmentStrings()`, where presumably `%TEMP%` is expanded to a temporary directory. Finally, it's used as part of a filename.
2.	Identify the set of delimiter characters that are specially handled. Primarily, you want to access a special file or achieve directory traversal, which would involve characters such as `'/'`, `'\'` and the sequence `".."`. Also, notice that the string is passed to `ExpandEnvironmentStrings()`. Environment variables are denoted with `%` characters. Interesting!
3.	Identify any filtering that's performed. The `strrchr()` function is used to find the last slash and then increments past it. Therefore, slashes are out. The code also specifically checks for the double-dot sequence `".."`, so that's out, too.
4.	You have eliminated all the usual directory traversal tricks but are left with the `%` character that `ExpandEnvironmentStrings()` interprets. This interpretation allows arbitrary environment variables to be substituted in the pathname. Given that this code is a CGI program, clients could actually supply a number of environment variables, such as `QUERY_STRING`. This environment variable could contain all the sequences that have already been checked for in the original filename. If `"..\..\..\any\pathname\file.txt"` is supplied to `QUERY_STRING`, the client can write to arbitrary locations on the file system.

NUL Character Injection

As you've seen, C uses the NUL metacharacter as a string delimiter, but higher-level languages (such as Java, PHP, and Perl) use counted strings, in which the string contains its length and the NUL character has no special meaning. This difference in interpretation creates situations where the NUL character can be injected to manipulate the behavior of C APIs called by higher level languages. This issue is really just a special case of an embedded delimiter, but it's unique enough that it helps to discuss it separately.

Note

NUL byte injection is an issue regardless of the technology because at some level, the counted string language might eventually interact with the OS. Even a true virtual machine environment, such as Java or .NET, eventually calls base OS functions to do things such as open and close files.

You know that NUL-terminated strings are necessary when calling C routines from the OS and many external APIs. Therefore, a vulnerability may exist when attackers can include NUL characters in a string later handled as a C-style string. For example, say a Perl application opens a file based on some user-provided input. The application requires only text files, so the developer includes a filter requiring that the file end in a .txt extension. Figure 8-1 shows an example of a valid filename laid out in memory:

Figure 8-1. C strings in memory

However, what if one of the bytes is a NUL terminator character? After all, Perl doesn't treat the NUL character as a metacharacter. So the resulting string could look like Figure 8-2.

Figure 8-2. C string with NUL-byte injection in memory

The function responsible for opening this file would consider the first NUL byte the end of the string, so the .txt extension would disappear and the bob file would be opened.

This scenario is actually quite common in CGI and server-side Web scripting languages. The problems arise when decoding hexadecimal-encoded data (discussed in more depth in "Hexadecimal Decoding" later in this chapter). If the sequence %00 is encountered in input, it's decoded into a single NUL character. If the NUL character isn't handled correctly, attackers can artificially truncate strings while still meeting any other filtering requirements. The following Perl code is a simple example that could generate the altered file name shown Figure 8-2:

open(FH, ">$username.txt") || die("$!"); print FH $data; close(FH);

The username variable in this code isn't checked for NUL characters. Therefore, attackers can NUL terminate the string and create whatever file extensions they choose. The string in Figure 8-2 is just one example, but the NUL character could be used to exploit the server. For example, supplying execcmd.pl%00 for the username will create a file named execcmd.pl. A file with the .pl extension can be used to execute arbitrary code on many Web servers.

Most C/C++ programs aren't prone to having NUL bytes injected into user data because they deal with strings as regular C-character arrays. However, there are situations in which unexpected NUL characters can appear in strings. This most commonly occurs when string data is read directly from the network, as shown in Listing 8-10.

Listing 8-10. NUL-Byte Injection with Memory Corruption

int read_string(int fd, char *buffer, size_t length) {     int rc;     char *p;     if(length == 0)         return 1;     length--;     rc = read(fd, buffer, length);     if(rc <= 0)         return 1;     buffer[length] = '\0';     // trim trailing whitespace     for(p = &buffer[strlen(buffer)-1]; isspace(*p); p--)         *p = '\0';     return 0; }

The read_string() function in Listing 8-10 reads a string and returns it to users after removing trailing whitespace. The developer makes the assumption, however, that the string includes a trailing newline and does not contain any NUL characters (except at the end). If the first byte is a NUL character, the code trims whitespace before the beginning of the buffer, which could result in memory corruption.

The same can be said of dealing with files. When the read primitives are used to read a number of bytes into the buffer from a file, they too might be populated with unexpected NUL characters. This error can lead to problems like the one described previously in Listing 8-10. For example, the fgets() function, used for reading strings from files, is designed to read text strings from a file into a buffer. That is, it reads bytes into a file until one of the following happens:

It runs out of space in the destination buffer.
It encounters a newline character (\n) or end-of-file (EOF).

So the fgets() function doesn't stop reading when it encounters a NUL byte. Because it's specifically intended to deal with strings, it can catch developers unaware sometimes. The following example illustrates how this function might be a problem:

if(fgets(buffer, sizeof(buffer), fp) != NULL){     buffer[strlen(buffer)-1] = '\0';    ... }

This code is written with the assumption that the trailing newline character must be stripped. However, if the first character is a NUL byte, this code writes another NUL byte before the beginning of the buffer, thus corrupting another variable or program control information.

Truncation

Truncation bugs are one of the most overlooked areas in format string handling, but they can have a variety of interesting results. Developers using memory-unsafe languages can dynamically resize memory at runtime to accommodate for user input or use statically sized buffers based on an expected maximum input length. In statically sizes buffers, input that exceeds the length of the buffer must be truncated to fit the buffer size and avoid buffer overflows. Although truncation avoids memory corruption, you might observe interesting side effects from data loss in the shortened input string. To see how this works, say that a programmer has replaced a call to sprintf() with a call to snprintf() to avoid buffer overflows, as in Listing 8-11.

Listing 8-11. Data Truncation Vulnerability

int update_profile(char *username, char *data) {     char buf[64];     int fd;     snprintf(buf, sizeof(buf), "/data/profiles/%s.txt",               username);     fd = open(buf, O_WRONLY);     ... }

The snprintf() function (shown in bold) in Listing 8-11 is safe from buffer overflows, but a potentially interesting side effect has been introduced: The filename can be a maximum of only 64 characters. Therefore, if the supplied username is close to or exceeds 60 bytes, the buffer is completely filled and the .txt extension is never appended. This result is especially interesting in a Web application because attackers could specify a new arbitrary file extension (such as .php) and then request the file directly from the Web server. The file would then be interpreted in a totally different manner than intended; for example, specifying a .php extension would cause the file to run as a PHP script.

Note

File paths are among the most common examples of truncation vulnerabilities; they can allow an attacker to cut off a mandatory component of the file path (for example, the file extension). The resulting path might avoid a security restriction the developer intended for the code.

Listing 8-12 shows a slightly different example of truncating file paths.

Listing 8-12. Data Truncation Vulnerability 2

int read_profile(char *username, char *data) {     char buf[64];     int fd;     snprintf(buf, sizeof(buf), "/data/%s_profile.txt",              username);     fd = open(buf, O_WRONLY);     ... }

For Listing 8-12, assume you want to read sensitive files in the /data/ directory, but they don't end in _profile.txt. Even though you can truncate the ending off the filename, you can't view the sensitive file unless the filename is exactly the right number of characters to fill up this buffer, right? The truth is it doesn't matter because you can fill up the buffer with slashes. In filename components, any number of contiguous slashes are seen as just a single path separator; for example, /////// and / are treated the same. Additionally, you can use the current directory entry (.) repetitively to fill up the buffer in a pattern such as this: ././././././.

Auditing Tip

Code that uses snprintf() and equivalents often does so because the developer wants to combine user-controlled data with static string elements. This use may indicate that delimiters can be embedded or some level of truncation can be performed. To spot the possibility of truncation, concentrate on static data following attacker-controllable elements that can be of excessive length.

Another point to consider is the idiosyncrasies of API functions when dealing with data they need to truncate. You have already seen examples of low-level memory-related problems with functions in the strncpy() family, but you need to consider how every function behaves when it receives data that isn't going to fit in a destination buffer. Does it just overflow the destination buffer? If it truncates the data, does it correctly NUL-terminate the destination buffer? Does it have a way for the caller to know whether it truncated data? If so, does the caller check for this truncation? You need to address these questions when examining functions that manipulate string data. Some functions don't behave as you'd expect, leading to potentially interesting results. For example, the GetFullPathName() function in Windows has the following prototype:

DWORD GetFullPathName(LPCTSTR lpFileName, DWORD nBufferLength,                         LPTSTR lpBuffer, LPTSTR *lpFilePart)

This function gets the full pathname of lpFileName and stores it in lpBuffer, which is nBufferLength TCHARs long. Then it returns the length of the path it outputs, or 0 on error. What happens if the full pathname is longer than nBufferLength TCHARs? The function leaves lpBuffer untouched (uninitialized) and returns the number of TCHARs required to hold the full pathname. So this failure case is handled in a very unintuitive manner. Listing 8-13 shows a correct calling of this function.

Listing 8-13. Correct Use of GetFullPathName()

DWORD rc; TCHAR buffer[MAX_PATH], *filepart; DWORD length = sizeof(buffer)/sizeof(TCHAR); rc = GetFullPathName(filename, length, buffer, &filepart); if(rc == 0 || rc > length) {     ... handle error ... }

As you have probably guessed, it's not uncommon for callers to mistakenly just check whether the return value is 0 and neglect to check whether the return code is larger than the specified length. As a result, if the lpFileName parameter is long enough, the call to GetFullPathName() doesn't touch the output buffer at all, and the program uses an uninitialized variable as a pathname. Listing 8-14 from the Apache 2.x codebase shows a vulnerable call of GetFullPathName().

Listing 8-14. GetFullPathName() Call in Apache 2.2.0

apr_status_t filepath_root_case(char **rootpath, char *root, apr_pool_t *p) { #if APR_HAS_UNICODE_FS     IF_WIN_OS_IS_UNICODE     {         apr_wchar_t *ignored;         apr_wchar_t wpath[APR_PATH_MAX];         apr_status_t rv;         apr_wchar_t wroot[APR_PATH_MAX];         /* ???: This needs review. Apparently "\\?\d:."          * returns "\\?\d:" as if that is useful for          * anything.          */         if (rv = utf8_to_unicode_path(wroot, sizeof(wroot)             / sizeof(apr_wchar_t), root))             return rv;         if (!GetFullPathNameW(wroot, sizeof(wpath) /             sizeof(apr_wchar_t), wpath, &ignored))             return apr_get_os_error();         /* Borrow wroot as a char buffer (twice as big as          * necessary)          */         if ((rv = unicode_to_utf8_path((char*)wroot,              sizeof(wroot), wpath)))             return rv;         *rootpath = apr_pstrdup(p, (char*)wroot); } #endif     return APR_SUCCESS; }

You can see that the truncation case hasn't been checked for in Listing 8-14. As a result, the wroot variable can be used even though GetFullPathName() might not have initialized it. You might encounter other functions exhibiting similar behavior, so keep your eyes peeled!

Note

ExpandEnvironmentStrings() is one function that behaves similarly to GetFullPathName().