The Common Gateway Interface (CGI) standard specifies how a normal, run-of-the-mill executable interacts with a Web server to create dynamic Web content. It lays out how the two programs can use the features of their runtime environment to communicate everything necessary about a HTTP request and response. Specifically, the CGI program takes input about the HTTP request through its environment variables, its command line, and its standard input, and it returns all its HTTP response instructions and data over its standard output. It's unlikely you'll need to review the security of a straightforward CGI application, as it's been obsolete as a dynamic Web programming technique for at least a decade. However, modern Web technology borrows so much from the CGI interface, both implicitly and explicitly, that it's worthwhile to cover the technical nuances that are still around today. The following sections focus on the artifacts that are still causing security headaches for Web developers. Indexed QueriesIn the CGI model, most of the information about the incoming HTTP request is placed in the CGI program's environment variables. They are covered in detail in the next section, but they will probably seem familiar to you, with names such as QUERY_STRING and SERVER_NAME. Most people are aware that the CGI program's standard input (stdin) is used to send the body of the HTTP request, which is generally referred to as "POSTing data." CGI uses its standard output to communicate its HTTP response to the Web server. Next, look at the command-line arguments. You've probably assumed that the GET query string parameters are passed over the command line. It turns out, however, that this assumption is almost entirely wrong. The query string is always in the QUERY_STRING environment variable, but it's almost never passed over the command line. This contention probably seems flat wrong to anyone who has witnessed the efficacy of URLs such as the following: GET /scripts/..%c1%c1../winnt/system32/cmd.exe?/c+dir+c:\ This Unicode attack works because it inadvertently initiates an antiquated form of HTTP request called an "indexed query." Indexed queries are old: They predate HTML forms and today's GET and POST methods. (At one point, they were almost added to the HTTP specification as the TEXTSEARCH query, but they never made it into the final draft.) Before HTML had input boxes and buttons, you could place only a search box on your Web site by using the <ISINDEX> tag on your page. It causes a single input text box to be placed on your site, and still works if you want to see it in action. If a user enters data in the box and presses Enter, the Web browser issues an indexed query to the page. As an example, entering the string "jump car cake door" causes the browser to send the following query: GET /name/of/the/page.exe?jump+car+cake+door The Web server interprets this indexed query by running page.exe with an argument array argv[] of {"page.exe", "jump", "car", "cake", "door"}. The original string delimiter was the addition sign, not the ampersand, but other than that, it's close to the query string mechanism used today. So when a contemporary Web server sees a request with a query string, it checks to see whether it's an indexed query. If the query string contains an unescaped equal sign (=), the Web server decides it's a normal GET query string request, puts the query string in the QUERY_STRING environment variable, and doesn't pass any command-line arguments to the CGI program. If the Web server sees a query string without an equal sign, it assumes it's an indexed query. It still places the entire query string in QUERY_STRING, but it also sets up command-line arguments for running the CGI program. Environment VariablesMost of the information about a Web request is communicated through environment variables in the CGI model. It's important to have a grasp of these variables because they have been carried through into most new Web technology. In fact, a few subtly confusing variables inherited from the CGI interface still trip up new developers. Some variables are straightforward pieces of data that are copied straight out of the client's HTTP request, and the Web server fills out other variables to explain its runtime environment and configuration. Finally, some variables contain analysis and interpretation of the request. The Web server performs analysis and processing of the request to reach the point where it decides it should call a CGI program. Some of this analysis is passed on to the CGI, and it's usually these variables that cause problems because of their nuanced nature. Static VariablesStart with the variables that stay the same across multiple requests:
Straightforward Request VariablesThese variables vary depending on the HTTP request, but they are fairly straightforward in how they get their information and what they mean:
Parroted Request VariablesFor every HTTP request line the Web server sees, it translates it into an appropriate environment variable name and passes it on to the application. For example, an HTTP request header contains the following User-Agent tag: User-Agent: AwesomeWebBrowser/1.5 The CGI engine converts the variable name to all uppercase letters. It then converts any hyphen characters into underscores, and finally adds HTTP_ to the beginning of all automatically converted request header fields. So you end up with the environment variable HTTP_USER_AGENT set to the value AwesomeWebBrowser/1.5. The Web server puts a few request header fields, such as Content-Length and Content-Type, into the core environment variables, so it doesn't need to convert those request header fields and duplicate the information. Also, CGI engines shouldn't translate a few request header fields for security reasons, such as the base64 authorization data users provide. This makes sense; if the Web server is handling authentication and verification of credentials, there's no reason to expose usernames and passwords to the CGI script as well. Synthesized Request VariablesAs the Web server processes a request, it creates more subtle variables. Originally, the CGI system was designed around a straightforward file tree model that assumes a URI refers to a file existing on the file system. This assumption is often untrue in modern applications, as the web server may perform number of path mappings before determining the final URI. In many cases, the server must synthesize the final URI, along with variables and state information that match the CGI programs requirements. When run, the CGI program is told it's being called on behalf of a particular URI, called the script URI. It might be the same URI the client requested, or it could be a completely arbitrary fabrication of the Web server. Either way, all the information provided in separate environment variables should appear to refer to a single initial query from the user. These synthesized request variables are described in the following list:
Path ConfusionIf you think about the exposed functions in the CGI specification, there isn't a lot to help developers who want to know where their application resides in the Web tree and the file system. The odd thing is that the environment variable names sound as though they have a logical purpose toward this end. Most people assume PATH_INFO is the path to the directory where the script resides. They assume PATH_TRANSLATED is simply that pathname mapped to the physical file system. However these variables don't behave even remotely as their names imply. What's amusing is that sometimes developer's get lucky by virtue of circumstance, and their code works well enough to get by even though it uses the variables incorrectly. So CGI path handling provides a historic interface that's quite inconsistent, solves the wrong problems, and is prone to being misunderstood and used incorrectly. Naturally, it has been propagated to every Web technology in some form or another as a universal interface. The following sections explain how some common environment variables have been incorporated into modern Web environments, focusing on PATH_INFO, PATH_TRANSLATED, QUERY_STRING, and SCRIPT_NAME, because they are the most important or baffling. Table 18-1 summarizes these variables.
Example of a PATH_INFO-Related VulnerabilityOne common security mistake is to not consider PATH_INFO information when performing a security check against a filename. If the dynamic code constructs its notion of the SCRIPT_NAME in a way that includes PATH_INFO or a query string, the integrity of that filename can be violated. Here's a real-world example of a security check that went wrong: if (!request.getRequestURI().endsWith("_proc.jsp")){ session.invalidate(); weblogic.servlet.security. ServletAuthentication.logout(request); RequestDispatcher rd = application.getRequestDispatcher( "/sanitized/login.jsp"); rd.forward(request, response); }else{ ... Actual page content ... } In this code, the request.getRequestURI() function is used to get the filename of the currently running program, and then the code attempts to check that it's indeed a JSP file. The problem is that the equivalent of SCRIPT_NAME should have been checked; it's retrieved with getServletPath(). The getrequestURI() function is similar, except it includes any PATH_INFO that's present. Therefore, an attacker can avoid the bolded security check by appending extraneous PATH_INFO ending in _proc.jsp. |