CGI

The Common Gateway Interface (CGI) standard specifies how a normal, run-of-the-mill executable interacts with a Web server to create dynamic Web content. It lays out how the two programs can use the features of their runtime environment to communicate everything necessary about a HTTP request and response. Specifically, the CGI program takes input about the HTTP request through its environment variables, its command line, and its standard input, and it returns all its HTTP response instructions and data over its standard output.

It's unlikely you'll need to review the security of a straightforward CGI application, as it's been obsolete as a dynamic Web programming technique for at least a decade. However, modern Web technology borrows so much from the CGI interface, both implicitly and explicitly, that it's worthwhile to cover the technical nuances that are still around today. The following sections focus on the artifacts that are still causing security headaches for Web developers.

Indexed Queries

In the CGI model, most of the information about the incoming HTTP request is placed in the CGI program's environment variables. They are covered in detail in the next section, but they will probably seem familiar to you, with names such as QUERY_STRING and SERVER_NAME. Most people are aware that the CGI program's standard input (stdin) is used to send the body of the HTTP request, which is generally referred to as "POSTing data." CGI uses its standard output to communicate its HTTP response to the Web server.

Next, look at the command-line arguments. You've probably assumed that the GET query string parameters are passed over the command line. It turns out, however, that this assumption is almost entirely wrong. The query string is always in the QUERY_STRING environment variable, but it's almost never passed over the command line. This contention probably seems flat wrong to anyone who has witnessed the efficacy of URLs such as the following:

GET /scripts/..%c1%c1../winnt/system32/cmd.exe?/c+dir+c:\

This Unicode attack works because it inadvertently initiates an antiquated form of HTTP request called an "indexed query." Indexed queries are old: They predate HTML forms and today's GET and POST methods. (At one point, they were almost added to the HTTP specification as the TEXTSEARCH query, but they never made it into the final draft.) Before HTML had input boxes and buttons, you could place only a search box on your Web site by using the <ISINDEX> tag on your page. It causes a single input text box to be placed on your site, and still works if you want to see it in action. If a user enters data in the box and presses Enter, the Web browser issues an indexed query to the page. As an example, entering the string "jump car cake door" causes the browser to send the following query:

GET /name/of/the/page.exe?jump+car+cake+door

The Web server interprets this indexed query by running page.exe with an argument array argv[] of {"page.exe", "jump", "car", "cake", "door"}. The original string delimiter was the addition sign, not the ampersand, but other than that, it's close to the query string mechanism used today.

So when a contemporary Web server sees a request with a query string, it checks to see whether it's an indexed query. If the query string contains an unescaped equal sign (=), the Web server decides it's a normal GET query string request, puts the query string in the QUERY_STRING environment variable, and doesn't pass any command-line arguments to the CGI program.

If the Web server sees a query string without an equal sign, it assumes it's an indexed query. It still places the entire query string in QUERY_STRING, but it also sets up command-line arguments for running the CGI program.

Environment Variables

Most of the information about a Web request is communicated through environment variables in the CGI model. It's important to have a grasp of these variables because they have been carried through into most new Web technology. In fact, a few subtly confusing variables inherited from the CGI interface still trip up new developers.

Some variables are straightforward pieces of data that are copied straight out of the client's HTTP request, and the Web server fills out other variables to explain its runtime environment and configuration. Finally, some variables contain analysis and interpretation of the request. The Web server performs analysis and processing of the request to reach the point where it decides it should call a CGI program. Some of this analysis is passed on to the CGI, and it's usually these variables that cause problems because of their nuanced nature.

Static Variables

Start with the variables that stay the same across multiple requests:

GATEWAY_INTERFACE This variable tells the CGI program what version of the CGI interface the Web server is using, such as CGI/1.1.
SERVER_SOFTWARE This variable is the name and version of the Web server managing the CGI gatewayfor example, Apache 1.32.3.

Straightforward Request Variables

These variables vary depending on the HTTP request, but they are fairly straightforward in how they get their information and what they mean:

REMOTE_ADDR This variable is the IP address of the machine sending the request to the Web server. It's often the IP address of a load-balancer or proxy appliance, if these devices are in use.
REMOTE_HOST This variable is the fully qualified domain name of the host sending the request to the server, if it's available. Again, it isn't always the real client's hostname; it could refer to a proxy server.
REMOTE_IDENT If the Web server queries the IDENT server on the client and gets a response indicating the client's username, that name is placed in this variable.
CONTENT_LENGTH This variable contains the number of bytes the Web server is going to send over stdin to the CGI program. It's the size of the content data of the HTTP requestfor example, 10000, meaning 10,000 bytes.
CONTENT_TYPE This variable is the media type of the request body data sent over stdin, such as application/x-www-form-urlencoded. If the server can't figure it out from the request, it can omit it.
AUTH_TYPE This variable tells the CGI which type of HTTP authentication the user requested, if any. The Web server parses this valueBasic, for examplefrom the Authorization header field.
REMOTE_USER If the user authenticates with HTTP authentication, this variable is the username. Otherwise, it's undefined.
REQUEST_METHOD This variable is the HTTP method the client used, such as GET, POST, or TRACE.

Parroted Request Variables

For every HTTP request line the Web server sees, it translates it into an appropriate environment variable name and passes it on to the application. For example, an HTTP request header contains the following User-Agent tag:

User-Agent: AwesomeWebBrowser/1.5

The CGI engine converts the variable name to all uppercase letters. It then converts any hyphen characters into underscores, and finally adds HTTP_ to the beginning of all automatically converted request header fields. So you end up with the environment variable HTTP_USER_AGENT set to the value AwesomeWebBrowser/1.5.

The Web server puts a few request header fields, such as Content-Length and Content-Type, into the core environment variables, so it doesn't need to convert those request header fields and duplicate the information. Also, CGI engines shouldn't translate a few request header fields for security reasons, such as the base64 authorization data users provide. This makes sense; if the Web server is handling authentication and verification of credentials, there's no reason to expose usernames and passwords to the CGI script as well.

Synthesized Request Variables

As the Web server processes a request, it creates more subtle variables. Originally, the CGI system was designed around a straightforward file tree model that assumes a URI refers to a file existing on the file system. This assumption is often untrue in modern applications, as the web server may perform number of path mappings before determining the final URI. In many cases, the server must synthesize the final URI, along with variables and state information that match the CGI programs requirements.

When run, the CGI program is told it's being called on behalf of a particular URI, called the script URI. It might be the same URI the client requested, or it could be a completely arbitrary fabrication of the Web server. Either way, all the information provided in separate environment variables should appear to refer to a single initial query from the user. These synthesized request variables are described in the following list:

SERVER_NAME This variable is simply the hostname of the Web server. It's listed under synthesized request variables because certain valid requests include a hostname from the client. A fully expressed URL includes the hostname in a GET statement, and the virtual hosting support of most Web servers allows the client to provide a hostname in the request header. So a Web server has some latitude in constructing what CGI sees as the server's hostname.
Inge Henriksen, an independent security researcher, discovered that Internet Information Services (IIS) 4, 5, and 6 are malleable in this fashion, and he came up with several situations in which SERVER_NAME is trusted as being immutable (archived at http://secunia.com/advisories/16548/). The attack is simply to change a request like the following:
```
  GET /test.asp HTTP/1.0
```
To this request:
```
  GET http://localhost/test.asp HTTP/1.0
```
ISS trusts the supplied hostname as a reasonable specification of a virtual host, and then certain code that checks to make sure SERVER_NAME is localhost ends up being defeated.
SERVER_PORT This variable is the TCP port on which the request came in. This value should be fairly immutable, too, but it might be influenced by attackers somehow. It's unlikely, however.
SERVER_PROTOCOL This variable specifies the protocol used when the request is submitted by the client. It's usually something like "HTTP/1.1," corresponding roughly to the protocol specified on the first line of an HTTP request.
PATH_INFO This variable refers to a lesser-known technique used to pass arguments to CGI scripts and other dynamically executed code. Say you have a program named compute.exe in your Web tree in the directory /scripts. If someone issues this Web request:
```
  GET /scripts/compute.exe HTTP/1.0
```
it calls your compute.exe program just as you would expect. Here's the request with some PATH_INFO added:
```
  GET /scripts/compute.exe/compute_slow/output_blue HTTP/1.0
```
This might not be what you'd expect, but the Web server still runs compute.exe. The algorithm that Web servers use stops at the first such solid match when interpreting a pathname. Everything past the matched name is considered additional arguments to the program, called PATH_INFO. So the string /compute_slow/output_blue is provided to compute.exe in the environment variable PATH_INFO.
PATH_TRANSLATED If you think the implicit default support for PATH_INFO in Web servers is odd, you'll wonder what underground lab PATH_TRANSLATED crawled out of. To get the value for PATH_TRANSLATED, the Web server starts by interpreting the PATH_INFO component of the query as a pathname, assuming it's relative to the document root. It then converts that pathname from a virtual Web tree path to an actual path in the underlying file system. It's not immediately obvious why someone would do all this, which makes it even more amazing that it's one of a select few default behaviors of Web servers.
This processing comes in useful, however, if you want to use a CGI program as a wrapper or filter to other files or content. Say you have a popular Web page in your Web tree in /cake.html, and you wrote a program that converts files from English to French. You could place the French program on your Web site in the root as well.
If users go to www.cakestories.com/French/cake.html, they end up running the French program with a PATH_INFO of /cake.html. So PATH_TRANSLATED takes /cake.html and figures out the physical drive path corresponding to that file. When French runs, its PATH_TRANSLATED environment variable is set to something like /home/jim/jenny/website/htdocs/cake.html. The French program can open that file directly with file system API calls, do its magic, and display the results.
PATH_TRANSLATED can be used to make wrapper-type programs as well, assuming you have the support of the Web server. A program based on PATH_TRANSLATED simply opens the file in that environment variable, assuming it's called with that filename. With a little sleight of hand performed by the Web server, the French program doesn't need to be in the Web tree or in the immediate file path.
QUERY_STRING This variable is what it sounds like, which is probably a relief after the previous two environment variables. It's everything in the requested URI past the question mark. For example, say you have a program at /convert.exe, and this request is sent:
```
  GET /convert.exe?query HTTP/1.0
```
The QUERY_STRING variable is set to query, even though it's an indexed query. It's always set to the query string if there is one. Now consider this request:
```
  GET /convert.exe/allthestuff/pathinfoisfun?queryingis/also/fun
```
The PATH_INFO in this request is /allthestuff/pathinfoisfun, and the QUERY_STRING is queryingis/also/fun. The query string is simply everything after the question mark in the URI.
SCRIPT_NAME This variable is a Web path that can be used to identify the CGI that's running. It should not overlap with PATH_INFO or QUERY_STRING, and you should be able to concatenate all three variables to assemble the script URI the CGI program is processing. SCRIPT_NAME has to be a URL a script can use to refer to itself when talking to the Web server.

Path Confusion

If you think about the exposed functions in the CGI specification, there isn't a lot to help developers who want to know where their application resides in the Web tree and the file system. The odd thing is that the environment variable names sound as though they have a logical purpose toward this end. Most people assume PATH_INFO is the path to the directory where the script resides. They assume PATH_TRANSLATED is simply that pathname mapped to the physical file system. However these variables don't behave even remotely as their names imply. What's amusing is that sometimes developer's get lucky by virtue of circumstance, and their code works well enough to get by even though it uses the variables incorrectly.

So CGI path handling provides a historic interface that's quite inconsistent, solves the wrong problems, and is prone to being misunderstood and used incorrectly. Naturally, it has been propagated to every Web technology in some form or another as a universal interface. The following sections explain how some common environment variables have been incorporated into modern Web environments, focusing on PATH_INFO, PATH_TRANSLATED, QUERY_STRING, and SCRIPT_NAME, because they are the most important or baffling. Table 18-1 summarizes these variables.

Table 18-1. Common Web Environment Variables
Language	Interface
PATH_INFO: additional path argument information
CGI and Perl	Environment variable `PATH_INFO`
PHP	`$_SERVER['PATH_INFO']`
ASP and ASP.NET	`Request.ServerVariables("PATH_INFO")`
Java and JSP	`Request.getPathInfo()`
PATH_TRANSLATED: a filename mapped to the real file system
CGI and Perl	Environment variable `PATH_TRANSLATED`
PHP	`$_SERVER['PATH_TRANSLATED']`
ASP and ASP.NET	`Request.ServerVariables("PATH_TRANSLATED")`
Java and JSP	`Request.getPathTranslated()`
QUERY_STRING: everything to the right of the `?`
CGI and Perl	Environment variable `QUERY_STRING`
PHP	`$_SERVER['QUERY_STRING']`, among others
ASP and ASP.NET	`Request.ServerVariables("QUERY_STRING")`, among others
Java and JSP	`Request.getQueryString()`
SCRIPT_NAME: virtual path to the running URI
CGI and Perl	Environment variable `SCRIPT_NAME`
PHP	`$_SERVER['SCRIPT_NAME']`
ASP and ASP.NET	`Request.ServerVariables("SCRIPT_NAME")`
Java and JSP	`Request.getServletPath()`

Example of a PATH_INFO-Related Vulnerability

One common security mistake is to not consider PATH_INFO information when performing a security check against a filename. If the dynamic code constructs its notion of the SCRIPT_NAME in a way that includes PATH_INFO or a query string, the integrity of that filename can be violated. Here's a real-world example of a security check that went wrong:

   if (!request.getRequestURI().endsWith("_proc.jsp")){       session.invalidate();       weblogic.servlet.security. ServletAuthentication.logout(request);       RequestDispatcher rd = application.getRequestDispatcher(          "/sanitized/login.jsp");       rd.forward(request, response);    }else{ ... Actual page content ... }

In this code, the request.getRequestURI() function is used to get the filename of the currently running program, and then the code attempts to check that it's indeed a JSP file. The problem is that the equivalent of SCRIPT_NAME should have been checked; it's retrieved with getServletPath(). The getrequestURI() function is similar, except it includes any PATH_INFO that's present. Therefore, an attacker can avoid the bolded security check by appending extraneous PATH_INFO ending in _proc.jsp.