Section 7.3. Using HTTP | Building Scalable Web Sites: Building, Scaling, and Optimizing the Next Generation of Web Applications

7.3. Using HTTP

The most familiar way of interacting with a remote service, as far as web developers are concerned, is using HTTP. HTTP is a widely agreed on and understood protocol, with support in most languages; it's the backbone of the web. In use since 1990, the protocol was formalized in RFC 1945 in May 1996 and has been updated once (HTTP/1.1) in RFC 2068 in 1997. As Internet standards go, it's fairly set in stone and is easy to implement against in its most basic forms.

As if powering the Web were not enough, HTTP is also the protocol underlying a number of other services. Services based on XML-RPC, SOAP, and REST all use HTTP as their base transport. Atom, the newcomer to the publishing protocols, is also built on top of HTTP, which is ideal for use as a transport layer in higher-level protocols because it completely defines a mechanism for requesting and returning resources. The main benefit of using HTTP isn't necessarily as a protocol, but rather that it's been around for a long time and a lot of client and server code already exists. We can build services on top of it without having to do any lower level implementation. This should appeal to our laziness.

Since we're typically servicing the parent request via HTTP, using HTTP for child requests fits our connectionless, stateless model very well. We don't need to worry about expensive communication link setup and teardown for each request, as the protocol was designed with connect-request-disconnect semantics.

7.3.1. The HTTP Request and Response Cycle

The HTTP protocol consists of two fundamental partsrequest and response. Both of these parts consist further of headers and a body, just as with email. Headers consist of zero or more lines of text, in the following format:

 field-name: field-value Keep-Alive: 300 Last-Modified: Thu, 06 Oct 2005 23:55:19 GMT

After the headers comes a single carriage return and linefeed, which delimits the headers from the body. Like email, the content of the body depends on the media type specified in the headers. HTTP allows multipart media types, as with email, so that a single request or response can contain multiple bodies.

In addition to the email-esque headers and body, HTTP requests and responses also have a single leading header of a different format. For the HTTP request, it takes this form:

 Method Request-URI HTTP-Version GET /test.html HTTP/1.0 POST /foo.php HTTP/1.1

This header specifies the resource (path) the request is for, the action (verb) to perform on it, and the version of the protocol being used. The most common verbs are GET (for fetching a resource) and POST (for updating a resource), although PUT (for creating a resource), DELETE (for deleting a resource), and HEAD (for checking a resource exists) are common in some applications.

The leading header for the response takes a different form:

 HTTP-Version Status-Code Reason-Phrase HTTP/1.0 404 File not found HTTP/1.1 200 OK

This header specifies the status code of the response using some predefined values and a textual description of the status. The status code is expressed using three digits, with the first digit representing the category of the response. 1xx codes are for informational messages, 2xx for successful requests, 3xx for redirections, 4xx for client errors, and 5xx for server errors.

Some of the more important response codes are listed below:

200 OK: The request was successful.
301 Moved Permanently / 302 Move Temporarily: The requested resource has been moved. The new location for the resource is sent in the response using the Location header.
304 Not Modified: The requested resource has not changed since the time it was last requested (the client specifies the time of the last request using an If-Modified-Since header).
401 Unauthorized: The client was not authorized to view the requested resource. Authentication instructions, in the form of a WWW-Authenticate header, should be included in the response. The user agent should then prompt the user for authorization credentials and resend the request. If this response is returned when making a request with credentials attached, then it indicates that the credentials were invalid.
403 Forbidden: The client was not authorized to view the requested resource. Unlike a 401, no instructions for authenticating the call are sent.
404 Not Found: The requested resource could not be found. This usually means that it can't be found on disk, but nothing about the protocol requires that resource URLs map to actual files.
500 Internal Server Error: The server encountered an unexpected condition and was unable to fulfill the request.

A typical HTTP request a response cycle might look like this. First, the request is sent from the client to the server:

 GET /hello.txt HTTP/1.1 Host: test.com User-Agent: Flock/0.4

Here we requested a resource called /hello.txt from the server test.com using HTTP version 1.1. Our User-Agent string, which tells the server what client we're using, is Flock/0.4.

The server receives the request, parses it, and performs its magic. When it's ready, it spits back an HTTP response:

 HTTP/1.1 200 OK Date: Thu, 06 Oct 2005 23:56:01 GMT Last-Modified: Thu, 06 Oct 2005 23:55:19 GMT Server: Apache/2.0.52 Connection: close Content-Length: 11 Content-Type: text/plain; charset=UTF-8 hello world

The request was a success (status code 200 and returned 11 bytes of UTF-8 text). We are told the date and time according to the server as well as the time it thinks the resource we requested was last modified. The server also tells us what software it's running, using the Server header. The Connection header tells us that the server will close the connection when it's done, rather than keeping it open for the next request (known as a keep-alive).

7.3.2. HTTP Authentication

HTTP also has authentication baked into the protocol. HTTP Basic Auth was described along with HTTP itself in RFC 1945 and allows a client to send authentication details using the Authorization header in a request. When a client makes an unauthenticated request for a resource that requires authentication, the server returns one or more WWW-Authenticate headers, with the following format:

  "WWW-Authenticate:" auth-scheme realm ("," auth-param)*

The auth-scheme specifies the scheme to use for authentication. Following the scheme are one or more comma-separated auth-param items, although the first must be the realm parameter. Each parameter takes the format token=value. The realm token specifies the realm under which the requested resource falls. A client should send the same authentication credentials for all resources under the same canonical URL within the same realm.

The HTTP 1.0 specification defines a single authentication method called Basic. HTTP Digest authentication was described in RFC 2069, but we won't cover that here. Basic authentication credentials are passed along using an Authorization header in the following format:

 "Authorization:" "Basic" basic-cookie

The terminology here seems designed to confuse the casual observerthe basic cookie is not in fact a cookie in the usual HTTP sense, but just a chunk of authentication information. The format for the cookie is defined as follows:

 base64( userid ":" password )

For accessing a resource using the username "foo" and the example "bar," we need to calculate the base64-encoded version of "foo:bar." This is "Zm9vOmJhcg==", so our full request header is as follows:

 Authorization: Basic Zm9vOmJhcg==

The obvious drawback to this approach is that our password is virtually sent in the clear, since decoding base64 is trivial. For this reason, digest authentication was invented and is recommended over using basic. Other HTTP-based protocols, such as Atom, use authentication mechanisms outside of the WWW-Authenticate and Authorization headers.

7.3.3. Making an HTTP Request

Once we have a working knowledge of the protocol, writing some code to request a resource using HTTP is pretty trivial. In PHP, we can just open a socket using the fsockopen( ) function:

 $request = "GET /hello.txt HTTP/1.1\r\nHost: test.com\r\n\r\n"; $sock = fsockopen('test.com', 80); fwrite($sock, $request); $response = ''; while(!feof($sock)){   $response .= fgets($sock); } fclose($sock);

This works well for the very simple case, but there are a number of issues with performing HTTP calls by hand. First, we need to think about the connection-related error conditions. What if the remote server isn't listening on port 80? What if the remote server is completely unreachable? What if it takes a long time to respond to your SYN request? What if the connection opens, but the remote server doesn't respond or takes a long time to start responding?

Once we've dealt with all of these, we need to worry about the protocol level errors. What if the remote server responds with half a response? What if the response is garbage? What if the response length doesn't match the Content-length header? What if the connection is kept in keep-alive mode and left openyour code will need to detect the end of the first request correctly.

As if these issues weren't enough to worry about, there are also protocol-level features we might have to contend with. What if the server responds with a 301 or 302 relocation status? What if the request requires basic or digest authentication? What if the server responds using chunked encoding? What if you need to request several pages in sequence and accept cookies from the first to pass to the subsequent?

Once we have a successful response, how to we find out the character set and convert it to UTF-8? How do we encode UTF-8 data in the request URL or body? If we're fetching back HTML, how do we verify that it's in the expected character settrust the HTTP headers or trust the http-equiv meta tags?

As we add support for each new case, the code starts to become large and unwieldy. Of course, we're not the first people to use HTTP as a protocol, so we should be able to build on the work of others. Since this is a fairly common task, there's good support in all major languages. We'll look at a generic solution and a couple of language-specific implementations.

libcurl and the curl command-line application are part of the multiplatform cURL open source URL file transfer library. cURL allows you to programmatically request documents via HTTP, HTTPS, FTP, GOPHER, and a bunch of more esoteric protocols. cURL has already resolved all of the above identified issues, up to the point of passing the completed request back to us. It's been tried and tested, contains a full regression suite, and knows more about HTTP than we do. It sounds like something we'd want to use.

PHP has support for cURL through an optional module called curl. Perl has support for cURL through the WWW::Curl module that contains XS bindings to libcurl. If compiling C code is outside of your scope, then you can try using cURL by simply forking a process and executing the curl command-line application.

If you want a little tighter control over your protocol, then you can use a good source-based library for your requests. For PHP, PEAR contains an HTTP_Request class that provides all the features you'll probably need. To make a simple GET request, we can just use the following code:

 include_once("HTTP/Request.php"); $req =& new HTTP_Request($url, array(     'timeout'  => $http_connection_timeout,     'readTimeout'  => array($http_io_timeout, 0),   )); $req->sendRequest( ); $status_code = $req->getResponseCode( ); $headers = $req->getResponseHeader( ); $body = $req->getResponseBody( );

For the Perl users, the equivalent is the LWP set of modules. To perform a simple GET request, you'll need to use the following code:

 use LWP::UserAgent; my $agent = LWP::UserAgent->new; $agent->timeout($http_timeout); my $request = HTTP::Request->new(GET => $url); my $response = $agent->request($request); my $status_code = $response->code( ); my $headers = $response->headers( ); my $body = $response->decoded_content( );