HTTP

HTTP is the network protocol that all Web transactions use under the hood. The next section summarizes the high points, but interested readers should check out RFC 2616 (www.ietf.org) or find a good Web inspection proxy tool and start studying traffic.

Overview

HTTP is a straightforward request and response protocol, in which every request the client sends to the server is reciprocated with a single response. These requests are performed over TCP connections. In contemporary versions of HTTP, a single TCP connection is typically reused for multiple requests to the same server, but historically, each Web request caused the creation of an entirely new TCP connection. Here's an example of a simple HTTP request:

[View full width]

GET /testing/test.html HTTP/1.1 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-gsarcade-launch, application/x- shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */* Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR

1.1.4322) Host: test.testing.com:1234 Connection: Keep-Alive

HTTP requests are composed of a header and an optional body. A blank linecalled a carriage return/line feed (CRLF)separates the header and the body. The preceding request doesn't have a body, so the blank line is simply the end of the request.

The first line of a HTTP request is composed of a method, a URI path, and an HTTP protocol version. The method tells the server what type of request it is. The preceding request has a GET method, which tells the server to retrieve (get) the requested resource. The URI path which tells the server which resource the client is requesting. The preceding request asks for the resource located at /testing/test.html on the server. The protocol version specifies the version of HTTP the client is using. In the preceding request, the client is using version HTTP/1.1.

The rest of the lines in the request header share the same general format: a field name followed by a colon, and then a field definition. The preceding request includes the following request header fields:

Accept This header field tells the server which kinds of media (such as an image or application) are acceptable for the response and their order of preference.
Accept-Language This header field tells the server which languages the client accepts and prefers, which in the preceding request is U.S. English.
Accept-Encoding This header field tells the server it can encode the request body with certain schemes if necessary.
User-Agent This header field tells the server what software versions the client is using for its Web browser and operating system. You can see that the preceding request was made from Internet Explorer 6.0 (MSIE 6.0) on a Windows XP machine (Windows NT 5.1) with the .NET 1.1 runtime installed (.NET CLR 1.0.3705; .NET CLR 1.1.4322).
Host This header field tells the Web server which host the request is for, which is useful if multiple Web sites are hosted on the same machine (called virtual hosts). You can see that the request was for the machine named test.testing.com, and the client is talking to the server on port 1234.
Connection This header field gives the server options that are specific to the connection. In the preceding request, the client's Keep-Alive value tells the server not to close the connection after it answers the request. This way, the client can reuse the TCP connection to issue another request.

Now look at the response to this query:

HTTP/1.1 404 Not Found Date: Fri, 20 Aug 2006 01:58:14 GMT Server: Apache/1.3.28 (Unix) PHP/4.3.0 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 d3 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>404 Not Found</TITLE> </HEAD><BODY> <H1>Not Found</H1> The requested URL /testing/test.html was not found on this server.<P> </BODY></HTML> 0

HTTP responses are similar to HTTP requests. The response has a header and a body, and the response header is set up so that the first line has a special format. The rest of the header response lines share the field name, colon, and field value format.

The first line of the HTTP response header is composed of the HTTP protocol version, the response code, and the response reason phrase. The protocol version is the same as in the request: HTTP/1.1. The response code is a numeric status code that tells the client the result of the request. In the preceding response, it's 404, which is probably familiar to you. If it isn't, the response reason phrase gives a short text description of the status code, which is "Not Found" in this response.

The rest of the response header lines provide information to the client:

Date This field tells the client when the server generated the response.
Server This field gives the client information about the Web server software. You can see that the Web server is running Apache 1.3.28 on some kind of UNIX machine.
Keep-Alive and Connection These fields give the client information about the connection and how long it will be held open.
Transfer-Encoding This field tells the client the mechanism the server uses to transmit the body of the response. This server elected to use the chunked method of encoding.
Content-Type This field tells the client the media type and character set of the response, which is a plain HTML document.

The response body in the example is encoded with the chunked encoding method, which is made up of a series of chunks. Each chunk has a line specifying its length in hexadecimal and the corresponding data. In the preceding response, d3 specifies 211 bytes of data in the first chunk. The 0 at the end indicates the end of the chunked data. You can see that in the response, which is plain HTML, the server gives an error message to go along with the error code 404.

Versions

Three versions of HTTP are currently in use: 0.9, 1.0, and 1.1. An HTTP version 0.9 request looks like this:

GET /

This request retrieves the root document. It's about as straightforward as it can get and can be used for quick manual testing. A minimal HTTP version 1.0 request looks like this:

GET / HTTP/1.0

This request is similar to the request shown in the previous section. Note that a blank line (a second CRLF) signifies the end of the HTTP request header and, therefore, the end of the HTTP request. If you're entering requests by hand, HTTP/1.0 is easiest to use because it's simpler than HTTP/1.1. Here's a minimal HTTP/1.1 request:

GET / HTTP/1.1 Host: test.com

This request is nearly identical to the minimal HTTP/1.0 request, except it requires the client to provide a Host header in the request.

Headers

HTTP headers provide descriptive information (metadata) about the HTTP connection. They are used in negotiating an HTTP connection and establishing the connection's properties after successful negotiation. HTTP supports a variety of headers that fall into one of four basic categories:

Request Headers in the initial request
Response Headers in the server response
General Headers that can be in a request or response
Entity Headers that apply to a specific entity in the request or response

The remainder of this chapter refers to a number of HTTP headers, so Table 17-1 lists them for easy reference.

Table 17-1. Request and Response Header Fields
Header	Type	Description
Accept	Request	Lists media (MIME) types the client will accept
Accept-Charset	Request	Lists character encodings the client will accept
Accept-Encoding	Request	Lists content encodings the client will accept, such as compression mechanisms
Accept-Language	Request	Lists languages the client will accept
Accept-Ranges	Response	Server indicates it supports range requests
Age	Response	Freshness of the requested URI
Allow	Entity	Lists HTTP methods allowed for the requested URI
Allowed	Response	Deprecated: lists allowed request methods
Authorization	Request	Presents credentials for HTTP authentication
Cache-Control	Response	Specifies caching requirements for the requested URI
Charge-To	Request	Deprecated: billing information
Connection	General	Allows the client to specify connection options
Content-Encoding	Entity	Identifies additional encoding of the entity body, such as compression
Content-Transfer-Encoding	Response	Deprecated: MIME transfer encoding
Content-Language	Entity	Identifies the language of the entity body
Content-Length	Entity	Identifies the length (in bytes) of the entity body
Content-Location	Entity	Supplies the correct location for the entity if known and not available at the requested URI
Content-MD5	Entity	Supplies an MD5 digest of the entity body
Content-Range	Entity	Lists the byte range of a partial entity body
Content-Type	Entity	Specifies the media (MIME) type of the entity
Cost	Response	Deprecated: cost of requested URI
Date	General	Date and time of the message
Derived-From	Response	Deprecated: previous version of requested URI
ETag	Response	Entity tag used for caching purposes
Expect	Request	Lists server behaviors required by the client
Expires	Entity	Date and time after which the entity is considered stale
From	Request	E-mail address of the requester
Host	Request	Host name and port number of the requested URI
If-Match	Request	Used to make request conditional based on entity tags
If-Modified-Since	Request	Used to make request conditional based on HTTP date
If-None-Match	Request	Used to make request conditional based on entity tags
If-Range	Request	Used to make a range request conditional based on entity tags
If-Unmodified-Since	Request	Used to make request conditional based on HTTP date
Last-Modified	Entity	Identifies the time the entity was last modified
Location	Response	Supplies an alternate location for the requested URI
Max-Forwards	Request	Mechanism for limiting the number of gateways in a TRACE or OPTIONS request
Message-Id	Response	Deprecated: globally unique message identifier
Pragma	General	Used for implementation-specific headers
Proxy-Authenticate	Response	Identifies that a proxy requires authentication
Proxy-Authorization	Request	Presents credentials for HTTP proxy authentication
Public	Response	Deprecated: lists publicly accessible methods
Range	Request	Identifies a specific range of bytes needed from the requested URI
Referer	Request	Client-provided URI responsible for initiating the request
Retry-After	Response	Indicates how long a service is expected to be unavailable
Server	Response	Server identification string
TE	Request	Lists transfer encodings accepted by the client for a chunked transfer
Trailer	General	Indicates header fields present in the trailer of a chunked message
Transfer-Encoding	General	Identifies the encoding applied to the message
Upgrade	General	Identifies additional protocols supported by the client
URI	Response	Deprecated: superseded by Location header field
User-Agent	Request	Contains general information about the client
Vary	Response	Provided by the server to determine cache freshness
Version	Response	Deprecated: version of requested URI
Via	General	Used by gateways and proxies to identify intermediate hosts
Warning	General	Provides additional message status information
WWW-Authenticate	Response	Initiates the HTTP authentication challenge required by a server
WWW-Title	Response	Deprecated: document title
WWW-Link	Response	Deprecated: external document reference

Methods

HTTP supports many methods, especially considering vendor extensions to the protocol. The three most important are GET, HEAD, and POST. GET is the most common method used by a client to retrieve a resource. HEAD is identical to GET, except it tells the server not to return the actual document contents. In other words, it tells the server to return only the response headers. POST is used to submit a block of data to a specified resource on the server. The difference between GET and POST is related to how developers use HTML forms and parameters (covered in "Parameters and Forms" later in this chapter). The following sections describe some less common methods.

DELETE and PUT

The DELETE and PUT methods allow files to be removed from and added to a Web server. Historically, these two methods have been seen little use in real sites; further, they have been associated with a number of vulnerabilities and are usually disabled. The notable exception is using these methods as a component of complete WebDAV support.

TEXTSEARCH and SPACEJUMP

The TEXTSEARCH and SPACEJUMP requests aren't methods, nor were they ever officially added to the HTTP specification. However, they were proposed methods, and the functionality they describe is supported in modern Web servers. To briefly see how they work, start by looking at the TEXTSEARCH request:

GET /customers?John+Doe HTTP/1.0

This request uses the ? character to terminate the request and contains a URL-encoded search string. This string causes the server to run a file at the supplied location and pass the decoded search string as a command line. Anyone familiar with common path traversal attacks should recognize this request type immediately. It's the form of request commonly used to pass parameters to an executable file via the query string, which makes it useful in exploiting a path traversal vulnerability. In all truth, this use might be the only remaining one for this request type.

The following SPACEJUMP request represents another legacy request type:

GET /map/1.1+2.7 HTTP/1.0

This request is designed for handling server-side image maps. It provides the coordinates of a clicked point in an object. As server-side image mapping has disappeared, so has the SPACEJUMP request. It's interesting to note, however, that this request type has also been associated with a number of vulnerabilities. The classic handler for this request (on both Apache and IIS servers) is the htimage program, which has been the source of a number of high-risk vulnerabilities, ranging from data disclosure to stack buffer overflows.

OPTIONS and TRACE

The OPTIONS and TRACE methods provide information about a server. The OPTIONS request simply lists all methods the server accepts. This information is not particularly sensitive, although it does give a potential attacker details about the system. Further, this method is useful only for servers that support extended functionality, such as WebDAV.

The HTTP TRACE method is quite simple, although its implications are interesting. This method simply echoes the request body to the client, ostensibly for testing purposes. Of course, the capability to have a Web site present arbitrary content can present some interesting possibilities for vulnerabilities, discussed in "Cross-Site Scripting" later in this chapter.

CONNECT

The HTTP CONNECT method provides a way for proxies to establish Secure Sockets Layer (SSL) connections with other servers. It's a reasonable method for use in proxies but is usually dangerous on application servers.

WebDAV Methods

Web Distributed Authoring and Versioning (WebDAV) is a set of methods and associated protocols for managing files over HTTP connections. It makes use of the standard GET, PUT, and DELETE methods for basic file access. WebDAV adds a number of methods for other file-management tasks, described in Table 17-2.

Table 17-2. WebDAV Methods
Method	Description
`COPY`	Copies a resource from one URI to another
`MOVE`	Moves a resource from one URI to another
`LOCK`	Locks a resource for shared or exclusive use
`UNLOCK`	Removes a lock from a resource
`PROPFIND`	Retrieves properties from a resource
`PROPPATCH`	Modifies multiple properties atomically
`MKCOL`	Creates a directory (collection)
`SEARCH`	Initiates a server-side search

Fortunately, most Web applications do not (and certainly should not) expose WebDAV functionality directly. However, you should keep a few points in mind when you encounter WebDAV systems. First, WebDAV uses HTTP as a transport protocol and uses the same basic security mechanisms of SSL and HTTP authentication, so the coverage of these standards also applies to WebDAV. Second, the specification for WebDAV access control is only in draft form and not widely implemented at the time of this writing, so access control capabilities can vary widely between products.

Parameters and Forms

A Web client transmits parameters (user-supplied input and variables) to a Web application through HTTP in three main ways, explained in the following sections.

Embedded Path Information

A URI path can contain embedded parameters as part of the path components. This embedded path information can be handled by server-based filtering such as path rewriting rules, which remap the received path and place the information into request variables. Path information may also be handled through the PATH_INFO environment variable common to most web application platforms. The PATH_INFO variable contains additional components appended to a URI resource path. For example, say you have a dynamic Web application at /Webapp, and a user submitted the following request:

GET /webapp/blah/blah/blah HTTP/1.1 Host: test.com

The Web server calls the program or request handler corresponding to /webapp and indicates that extra information was passed through the appropriate mechanism. If the program gets information through CGI variables, the CGI program would see something like this:

PATH_INFO=/blah/blah/blah SCRIPT_NAME=/webapp

If the program is a Java servlet and calls request.getServletPath(), it receives /webapp. However, if the program calls request.getRequestURI(), it receives /webapp/blah/blah/blah.

Auditing Tip

If you see code performing actions or checks based on the request URI, make sure the developer is handling the path information correctly. Many servlet programmers use request.getRequestURI() when they intend to use request.getServletPath(), which can definitely have security consequences. Be sure to look for checks done on file extensions, as supplying unexpected path information can circumvent these checks as well.

GET and Query Strings

The second mechanism for transmitting parameters to a Web application is the query string. It's the component of a request URI that follows the question mark character (?). For example, if the http://test.com/webapp?arg1=h1&arg2=jimbo URI is entered into a browser, the browser connects to the test.com server and submits a request similar to the following:

GET /webapp?arg1=hi&arg2=jimbo HTTP/1.1 Host: test.com

This is the query string in the preceding request:

arg1=hi&arg2=jimbo

Most dynamic Web technologies parse this query string into two separate variables: arg1 with a value of hi and arg2 with a value of jimbo. The & character is used to separate the arguments, and the = character separates the argument name from the argument value.

The other possible form for a query string is the one mentioned for the TEXTSEARCH request. If the query string doesn't contain an = character, the Web server assumes the query is an indexed query, and the arguments represent command-line arguments. For example, the following code runs the CGI program mycgi.pl with the arguments hi and jimbo:

GET /mycgi.pl?hi&jimbo HTTP/1.1 Host: test.com

HTML Forms

Before you look at the third common way of transmitting parameters, take a look at HTML forms. Forms are an HTML construct that enables application designers to construct Web pages that request user input and then relay it back to the server. A basic HTML form has an action, a method, and variables. The action is a URI that corresponds to the resource handling the filled-out form. The method is GET or POST, and it determines which method the client uses to transmit the filled-out form. The variables are the actual content of the form, and designers can use a few basic types of variables. Here's a brief example of a form:

<form method="GET" action="http://test.com/transfer.php"> Source Account: <select name="source"> <option selected value="42424242">42424242</option> <option value="82345678">82345678</option> </select><br> Destination Account: <select name="dest"> <option selected value="12345678">12345678</option> <option value="82345678">82345678</option> </select><br> Amount: <input type="input" name="value"><br> <input type="Submit" value="Transfer Money"><br> </form>

Figure 17-1 shows what this simple form would look like rendered in a client's browser. This form uses the GET method, and the results are submitted to the transfer.php page. There are drop-down list boxes for the source account and destination account and a simple text input field for the transfer amount. The last input is the submit button, which allows users to initiate the transmission of the form contents.

Figure 17-1. Simple form

When users submit this form, their browsers connect to test.com and issue a request similar to the following:

GET /transfer.php?source=42424242&dest=12345678&value=123 HTTP/1.1 Host: test.com

In this request, you can see that the variables in the form have been turned into a query string. The source, dest, and value parameters are transmitted to the server and submitted via the GET method.

POST and Content Body

The third mechanism for transmitting parameters to a Web application is the POST method. In this method, the user's data is transferred by using the body of the HTTP request instead of embedding the data in the URI as the GET method does. Assume you changed the preceding form to use a POST method instead of a GET method by changing this line:

<form method="GET" action="http://test.com/transfer.php">

To this:

<form method="POST" action="http://test.com/transfer.php">

When users submit this form, a request from the Web browser similar to the following is issued:

POST /transfer.php HTTP/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 40 source=42424242&dest=12345678&value=123

You can see that the parameters are encoded in a similar fashion to the GET request, but they are now in the request's content body.

Parameter Encoding

Parameters are encoded by using guidelines outlined in RFC 2396, which defines the URI general syntax. This encoding is necessary whether they are sent via the GET method in a query string or the POST method in the content body. All nonalphanumeric ASCII characters are encoded, which includes most Unicode characters and multibyte characters. This encoding is described in Chapter 8 "Strings and Metacharacters," but we will briefly recap it here.

The URL encoding scheme is % hex hex, with a percent character starting the escape sequence, followed by a hexadecimal representation of the required byte value. For example, the character = has the value 61 in the ASCII character set, which is 0x3d in hexadecimal. Therefore, an equal sign can be encoded by using the sequence %3d. So you can set the testvar variable to the string jim=42 with the following encoded string:

testvar=jim%3d42

GET Versus POST

Although you've learned the technical details of GET and POST, you haven't seen the difference between them in a real-world sense. Here are the essential tradeoffs:

GET requests have more limitations than POST requests. The Web server typically limits the query string to a certain number of characters. This limitation is usually between 1024 and 8192 characters and is tied to the maximum size request header line the Web server accepts. POST requests can effectively be any length, although the Web server might limit them to a reasonable threshold (or crash because of numeric overflow vulnerabilities).
GET requests are easier to create, as you can specify them via hyperlinks without having to create an HTML form. POST requests, on the other hand, require creating an HTML form or scripted events, which might have display characteristics that Web designers want to avoid.
GET requests are less secure because they are likely to be logged in Web proxy logs, browser histories, and Web server logs. Usually, security-sensitive information shouldn't be transmitted in GET requests because of this logging.
GET requests also expose application logic to end users by placing variables in the Web browser's address bar, which just tempts users to manipulate them.
The Referer request header tells the server the URI of the page the client just came from. So if the query string used to generate a page contains sensitive variables, and users click a link on that page that takes them to another server, those sensitive variables are transferred to the third-party server in the Referer header.