Preventing Canonicalization Mistakes

Now that you ve read the bad news, let s look at solutions for canonicalization mistakes. The solutions include avoiding making decisions based on names, restricting what is allowed in a name, and attempting to canonicalize the name. Let s look at each in detail.

Don t Make Decisions Based on Names

The simplest way of avoiding canonicalization bugs is to avoid making decisions based on the filename. Let the file system and operating system do the work for you, and use ACLs or other operating system based authorization technologies. Of course, it s not quite as simple as that! Some security semantics cannot currently be represented in the file system. For example, IIS supports scripting. In other words, a script file, such as an ASP page containing Visual Basic Scripting Edition (VBScript) or Microsoft JScript, is read and processed by a script engine, and the results of the script are sent to the user. This is not the same as read access or execute access; it s somewhere in the middle. IIS, not the operating system, has to determine how to process the file. All it takes is a mistake in IIS s canonicalization, such as that in the ::$DATA exploit, and IIS sends the script file source code to the user rather than processing the file correctly.

As mentioned, you can limit access to resources based on the user s IP address. However, this security semantics currently cannot be represented as an ACL, and applications supporting restrictions based on IP address, Domain Name System (DNS) name, or subnet must use their own access code.


	Refrain from making security decisions based on the name of a file. The wrong conclusion might have dire security consequences.

Use a Regular Expression to Restrict What s Allowed in a Name

If you must make name-based security decisions, restrict what you consider a valid name and deny all other formats. For example, you might require that all filenames be absolute paths containing a restricted pool of characters. Or you might decide that the following must be true for a file to be determined as valid:

The file must reside on drive c: or d:.
The path is a series of backslashes and alphanumeric characters.
The filename follows the path; the filename is also alphanumeric, is not longer than 32 characters, is followed by a dot, and ends with the txt, jpg, or gif extension.

The easiest way to do this is to use regular expressions. A regular expression is a series of characters that define a pattern which is then compared with target data, such as a string, to see whether the target includes any matches of the pattern. For example, the following regular expression will represent the example absolute path just described:

^[cd]:(?:\\\w+)+\\\w{1,32}\.(txt jpg gif)$

Let me quickly explain this write-only syntax. Take a look at Table 8-2.

Table 8-2 A Regular Expression to Match Absolute Paths
Element	Comments
^	Matches the position at the beginning of the input string.
[cd]:	The letter c or d followed by a colon.
(?:\\\w+)+	The opening and closing parentheses have two purposes. The first purpose is to group parts of a pattern together, and the second is to capture the match and store the result in a variable. The ?: means don t store the matched characters just treat the next characters as a group that must appear together. In this case, we want \\\w+ to be a group. The \\ is a \. You must escape certain characters with a \ first. The \w is shorthand for A-Za-z0-9 and underscore (_). The plus sign indicates one or more matches. So this portion of the expression means, Look for one or more series of backslashes followed by one or more alphanumeric or underscore characters (for example, \abc\def or \xyz), and don t bother storing the data that was found.
\\\w{1,32}\.	The backslash character followed by between 1 and 32 alphanumeric characters and then a period. This is the first part of the filename.
(txt jpg gif)	The letters txt, jpg, or gif. This and the previous portion of the pattern make up a backslash followed by the filename.
$	Matches the position at the end of the input string.

This expression is strict the following are valid:

c:\mydir\myotherdir\myfile.txt
d:\mydir\myotherdir\someotherdir\picture.jpg

The following are invalid:

e:\mydir\myotherdir\myfile.txt (invalid drive letter)
c:\fred.txt (must have a directory before the filename)
c:\mydir\myotherdir\..\mydir\myfile.txt (can t have anything but A-Za-z0-9 and an underscore in a directory name)
c:\mydir\myotherdir\fdisk.exe (invalid file extension)
c:\mydir\myothe~1\myfile.txt (the tilde [~] is invalid)
c:\mydir\myfile.txt::$DATA (the colon [:] is invalid other than after the drive letter; $ is also invalid)
C:\mydir\myfile.txt. (the trailing dot is invalid)
\\myserver\myshare\myfile.txt (no drive letter)
\\?\c:\mydir\myfile.txt (no drive letter)

As you can see, using this simple expression can drastically reduce the possibility of using a noncanonical name. However, it does not detect whether a filename represents a device; we ll look at that shortly.


	Regular expressions teach an important lesson. A regular expression determines what is valid, and everything else is invalid. This is the correct way to parse any kind of input. You should never look for and block invalid data and then allow everything else through; you will likely miss a rare edge case. This is incredibly important. I repeat: look for that which is provably valid, and disallow everything else.

Visual Basic 6, VBScript 5 and later, JScript, Perl, and any language using the .NET Framework, such as C#, have support for regular expressions. If you use C++, a Standard Template Library aware class named Regex++ is available at www.boost.org, and Microsoft Visual C++ included with Visual Studio .NET includes a lightweight Active Template Library (ATL) regular-expression parser template class, CAtlRegExp. Note that the regular-expression syntax used by CAtlRegExp is a little different from the classic syntax some of the less-used operators are missing.

Regular Expressions and International Applications

All the example regular expressions in this chapter use 8-bit characters, which is less than adequate for international audiences. If you want to take advantage of international characters, you ll need to use 4-hex-byte Unicode escapes in your expressions. For example, \u00A9 matches the copyright symbol , and \u00DF is the German sharp-S symbol, . You can see all these symbols by using the Character Map application included with Windows.

The following simple command line JScript code shows how to construct a regular expression and tests whether the data entered on the command line is a valid directory and filename:

var args = WScript.Arguments; if (args.length > 0) { var reg = /^[cd]:(?:\\\w+)+\\\w{1,32}\.(txt jpg gif)$/i; WScript.echo(reg.test(args(0)) ? Cool, it matches! : Ugh, invalid! ); }

Note the use of slash (/) characters to start and end the regular expression. This is similar to the way expressions are built in, say, Perl. The i at the end of the expression means perform a case-insensitive search.

Here s a similar example in a C# class using the .NET Framework classes:

using System; using System.Text.RegularExpressions; class SampleRegExp { static bool IsValidFileName(String s) { String pat = @"^[cd]:(?:\\\w+)+\\\w{1,32}\.(txt jpg gif)$"; Regex re = new Regex(pat); return re.Match(s).Success; } static void Main(string[] args) { if (args.Length > 0) if (IsValidFileName(args[0])) Console.Write( {0} is a valid filename!", args[0]); } }

As with the JScript example, you can run this small application from the command line and check the syntax of a directory and filename.

Stopping 8.3 Filename Generation

You should also consider preventing the file system from generating short filenames. This is not a programmatic option it s an administrative setting. You can stop Windows from creating 8.3 filenames by adding the following setting to the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem registry key:

NtfsDisable8dot3NameCreation : REG_DWORD : 1

This option does not remove previously generated 8.3 filenames.

Don t Trust the PATH

Never depend on the PATH environment variable to find files. You should be explicit about where your files reside. For all you know, an attacker might have changed the PATH to read c:\myhacktools;%systemroot% and so on! When was the last time you checked the PATH on your systems?

A new registry setting in Windows XP allows you to search some of the folders specified in the PATH environment variable before searching the current directory. Normally, the current directory is searched first, which can make it easy for attackers to place Trojan horses on the computer. The registry key is HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\ SafeDllSearchMode. You need to add this registry key. The value is a DWORD type and is 0 by default. If the value is set to 1, the current directory is searched after system32.

Restricting what is valid in a filename and rejecting all else is reasonably safe, as long as you use a good regular expression. However, if you want more flexibility, you might need to attempt to canonicalize the filename for yourself, and that s the next topic.

Attempt to Canonicalize the Name

Canonicalizing a filename is not as hard as it seems; you just need to be aware of some Win32 functions to help you. The goal of canonicalization is to get as close as possible to the file system s representation of the file in your code and then to make decisions based on the result. In my opinion, you should get as close as possible to the canonical representation and reject the name if it still does not look valid. For example, the CleanCanon application I ve written in the past performs robust canonicalization functions as described in the following steps:

It takes an untrusted filename request from a user for example, mysecretfile.txt.
It determines whether the filename is well formed. For example, mysecretfile.txt is valid; mysecr~1.txt, mysecretfile.txt::$DATA, and mysecretfile.txt. (trailing dot) are all invalid.
The code determines whether the combined length of the filename and the directory is greater than MAX_PATH in length. If so, the request is rejected. This is to help mitigate denial of service attacks and buffer overruns.
It prepends an application-configurable directory to the filename for example, c:\myfiles, to yield c:\myfiles\mysecretfile.txt.
It determines the correct directory structure that allows for two dots (..) this is achieved by calling GetFullPathName.
It evaluates the long filename of the file in case the user uses the short filename version. For example, mysecr~1.txt becomes mysecretfile.txt, achieved by calling GetLongPathName. This is technically moot because of the filename validation in step 2. However, it s a defense-in-depth measure!
It determines whether the filename represents a file or a device. This is something a regular expression cannot achieve. If the GetFileType function determines the file to be of type FILE_TYPE_DISK, it s a real file and not a device of some kind.


	Earlier I mentioned that device name issues exist in Linux and Unix also. C or C++ programs running on these operating systems can determine whether a file is a file or a device by calling the stat function and checking the value of the stat.st_mode variable. If its value is S_IFREG (0x0100000), the file is indeed a real file and not a device or a link.

Let s look at the C++ code, written using Visual C++ .NET, that performs these steps:

/* CleanCanon.cpp */ #include atlrx.h" #include stdafx.h enum errCanon { ERR_CANON_NO_ERROR = 0, ERR_CANON_INVALID_FILENAME, ERR_CANON_NOT_A_FILE, ERR_CANON_NO_SUCH_FILE, ERR_CANON_TOO_BIG, ERR_CANON_NO_MEM}; errCanon GetCanonicalFileName(_TCHAR *szFilename, _TCHAR *szDir, _TCHAR **pszNewFilename) { *pszNewFilename = NULL; _TCHAR *pTempFullDir = NULL; HANDLE hFile = NULL; errCanon err = ERR_CANON_NO_ERROR; try { // STEP 2 // Check that filename is valid (alphanum . 0-4 alphanums) // Case insensitive CAtlRegExp<> reFilename; reFilename.Parse( ^\\a+\\.\\a?\\a?\\a?\\a?$", FALSE); CAtlREMatchContext<> mc; if (!reFilename.Match(szFilename, &mc)) throw ERR_CANON_INVALID_FILENAME; DWORD cbFilename = lstrlen(szFilename); DWORD cbDir = lstrlen(szDir); // Temp new buffer size, allow for added \ . DWORD cbNewFilename = cbFilename + cbDir + 1; // STEP 3 // Make sure file size is small enough. if (cbNewFilename > MAX_PATH) throw ERR_CANON_TOO_BIG; // Allocate memory for the new filename. // Accomodate for trailing \0 . _TCHAR *pTempFullDir = new _TCHAR[cbNewFilename + 1]; if (pTempFullDir == NULL) throw ERR_CANON_NO_MEM; // STEP 4 // Join the dir and the filename together. _sntprintf(pTempFullDir, cbNewFilename, _T( %s\\%s ), szDir, szFilename); pTempFullDir[cbNewFilename] = \0 ; // STEP 5 // Get the full path, // Accommodates for .. and trailing . and spaces LPTSTR pFilename; _TCHAR pFullPathName[MAX_PATH + 1]; DWORD dwFullPathLen = GetFullPathName(pTempFullDir, MAX_PATH, pFullPathName, &pFilename); if (dwFullPathLen > MAX_PATH) throw ERR_CANON_NO_MEM; // STEP 6 // Get the long filename. GetLongPathName(pFullPathName, pFullPathName, MAX_PATH); // STEP 7 // Is this a file or a device? HANDLE hFile = CreateFile(pFullPathName, 0, 0, NULL, OPEN_EXISTING, 0, NULL); if (hFile == INVALID_HANDLE_VALUE) throw ERR_CANON_NO_SUCH_FILE; if (GetFileType(hFile) != FILE_TYPE_DISK) throw ERR_CANON_NOT_A_FILE; // Looks good! // Caller must call delete [] pszNewFilename. *pszNewFilename = new _TCHAR[lstrlen(pFullPathName) + 1]; if (*pszNewFilename != NULL) lstrcpy(*pszNewFilename, pFullPathName); else err = ERR_CANON_NO_MEM; } catch(errCanon e) { err = e; } if (pTempFullDir) delete [] pTempFullDir; CloseHandle(hFile); return err; }

The complete code listing is available on the companion CD in the folder Secureco\Chapter 8\CleanCanon. CreateFile has a side effect when it s determining whether the file is a drive-based file. The function will fail if the file does not exist, saving your application from performing the check.

You might realize that one check is missing in this chapter there s no support for unescaping characters from Unicode, UTF-8, or hexadecimal. Because this is very much a Web problem, I will defer that discussion until Chapter 12.