Introduction to the HTTP Protocol

The Hypertext Transfer Protocol (HTTP) is the backbone of the World Wide Web. Developed in 1993 to support information exchange at CERN in Switzerland, HTTP is a very simple protocol, involving no authentication and only a few possible client commands. It's optimized for the lightweight serving of small (several kilobytes at most) text files, in keeping with the original intent of HTTP as a means of disseminating interlinked informational pages in the newly developed markup language understood by HTTP browsers, known as the Hypertext Markup Language (HTML).

Although originally not intended for this purpose, HTTP is now used for the transfer of large binary files, often including many inline images requested at once from an HTML page as it renders. This kind of data transfer was seldom done in the early days of the Web (GIF and JPEG image support in 1994 was sporadic across various platforms). In response to this change, certain augmentations have been made to the HTTP specification. The most important of these is the HTTP/1.1 standard, which provides features such as pipelining (combining the responses to many simultaneous requests into a single response stream), improved caching, partial requests (for resuming interrupted downloads), and compression. HTTP/1.1 is widely used on most popular browsers today, but HTTP/1.0 is a useful fallback that all browsers understand.

Unlike SMTP, FTP, and many other such popular protocols, HTTP is stateless, meaning that there isn't a concept of a "session" where a client connects to a server, performs several transactions, and then ends the connection. HTTP permits only a single request per connection, and therefore it's really not possible to determine how many users are connected to a web server at any given time (except for downloads that are currently in progress). This also means there aren't any of the topological issues associated with protocols such as SMTP: Relaying, MX records, queues, and so on aren't relevant to HTTP. Instead, the problems that an HTTP server administrator faces have mostly to do with bandwidth, CPU, and memory resources, and with their most efficient use as concurrent activity grows with the popularity of the website.

HTTP Request Structure

The structure of an HTTP/1.0 request is about as simple as it can get. You can simulate an HTTP transaction by connecting to port 80 of an HTTP server and issuing a GET request (which can contain multiple lines) in the following form:

# telnet www.example.com 80 Connected to www.example.com. Escape character is '^]'. GET / HTTP/1.0 HTTP/1.1 200 OK Date: Thu, 12 Jan 2006 18:38:23 GMT Server: Apache/1.3.33 (Unix) DAV/1.0.3 PHP/4.4.0 Content-Location: index.html Vary: negotiate,accept-language,accept-charset TCN: choice Last-Modified: Thu, 07 Aug 2003 20:11:54 GMT ETag: "52ab66-3c-43c6a251" Accept-Ranges: bytes Content-Length: 60 Connection: close Content-Type: text/html Content-Language: en <HTML> <TITLE>test page</TITLE> <BODY> test </BODY> </HTML>

You could instead issue a HEAD request of the same form, to retrieve the headers only, not the message body. You enter a blank line to indicate the end of the multiline request (press Enter twice).

The response from the server is divided into two sections: the header and the body. The response might contain a subset of the headers shown here, or others not shown, depending on the type of document requested and the configuration of the server. Each HTTP server has its own unique signature on the Server: line. This signature tells you the name of the server software and the platform that it was built for; bear in mind, though, that even though it says "UNIX", this doesn't actually tell you anything about the operating system that the server is running, which could be anything from Linux to Solaris or Mac OS X. The rest of the lines, especially the Content-*: lines, contain information that help the web browser lay out the page. For instance, Content-Length:, when present, allows the browser to report how much data there is left in the download, and Content-Type: tells the browser how to render the requested file (as HTML, plain text, GIF or JPEG image data, and so on).

HTTP/1.0 allows a number of extra lines to be included in the request, including lines specifying cookies, accepted encodings, preferred languages, and so on. Aside from the request line (the GET command in this case), which must appear first, the order of the rest of the lines doesn't matter. However, these additional lines are optional; only one line (the request line itself) is required. An HTTP/1.1 request is almost the same as HTTP/1.0, except that a second line is also requiredthe Host: line. This is an addition to the protocol intended to support virtual hosting, where a single web server can answer for many different hostnames. This means that the client has to specify the hostname whose web content it wants to see. Because a web browser looks up the server's IP address from the hostname the user specifies and then makes the HTTP connection based on the IP address (which is how TCP/IP applications operate, as you saw in Chapter 22, "Principles of TCP/IP Networking"), the server knows nothing about what hostname the user is trying to reach unless the client supplies the Host: header, as shown here:

# telnet www.example.com 80 Connected to www.example.com. Escape character is '^]'. GET / HTTP/1.1 Host: www.frankspage.com

All major browsers today, including text-only browsers such as Lynx, support HTTP/1.1-style Host: headers. (However, whether they formulate their requests to claim HTTP/1.1 support is inconsistent. Netscape Navigator, for instance, supports many HTTP/1.1 features, but it issues its requests as HTTP/1.0 anyway.) This means that virtual hosting based on the Host: header (rather than by IP address and network-level IP aliases) is now almost exclusively the method of choice, greatly simplifying matters. We'll talk more about virtual hosting later in this chapter.

Response Codes and Redirects

Although there are only a few request methods (GET, HEAD, and POST, plus several more for HTTP/1.1), the server can return a wide variety of responses. These responses are three-digit numeric codes, and they're grouped on meaning by the first digit. Table 26.1 shows the complete set of HTTP response codes and what they mean, particularly in the context of Apache and its features.

Table 26.1. HTTP Response Codes
Numeric Code	Name	Meaning
2XXSuccess
200	OK	Standard success code.
201	Created
202	Accepted
203	Partial Information
204	No Content
3XXRedirection
300	Multiple Choices	`MultiViews` or `CheckSpelling` found multiple matches.
301	Moved Permanently	Trailing slash was omitted.
302	Moved Temporarily	Redirect found.
304	Not Modified	Cached copy is okay to use.
4XXClient Error
400	Bad Request
401	Unauthorized	Must authenticate to continue.
403	Forbidden	Server permissions or configuration do not permit access.
404	Not Found	File does not exist.
5XXServer Error
500	Internal Server Error	Server-side (CGI) program failed (generic error).
501	Not Implemented
502	Bad Gateway
503	Service Unavailable	Resources to process the request are not available.

You're probably quite familiar with 404 and 403 errors, and 500 errors will be familiar to you if you've ever done any CGI programming. However, one important but little-understood code is 304. This code is never seen by a user because it's intended purely for a browser's use; nonetheless, it's one of the most commonly used codes by a production server, as you would see if you were to look through the access log (/var/log/httpd-access.log).

When a client has to make a request for a file that it already has in its cache (such as an inline GIF image in an HTML page), it performs a GET request with the If-Modified-Since field set to the date and time the image was last downloaded. This causes the server to evaluate whether the file has been changed on the server since that time. If it has, it sends the file (with a 200 success code); if it hasn't, it returns a 304 (Not Modified) code, telling the browser that it's okay to display the copy that it has in its cache, thereby saving the trouble of serving the file all over again.

Another code frequently seen by browsers but not people is 301 (Moved Permanently). This code most often occurs when someone requests a URL of the type http://some.host.com/Subdirectory, where Subdirectory is the name of a directory on the server. The correct form of the URL that accesses the index of that directory is http://some.host.com/Subdirectory/, with a trailing slash. Notice, however, that if you enter the URL without the trailing slash, you'll still get the pagebut the browser attaches the slash for you. This is because it received a 301 code for the first request, redirecting it to the same URL, reconstructed from the server's hostname and the request path, with the slash appended. The URL in your browser was updated, the browser made a second request, and the correct page was served. For this to work seamlessly, the server needs to know exactly what its hostname is; this is the purpose of the ServerName directive in Apache, which we will discuss a little later in this chapter.

More information about HTTP, its structure, response codes, and much more can be found at the W3 Consortium website, http://www.w3.org/Protocols. The original HTTP/1.0 specification is laid out in RFC 1945, and HTTP/1.1 in RFC 2068.

Introduction to the HTTP Protocol

HTTP Request Structure

Response Codes and Redirects

Table 26.1. HTTP Response Codes