| only for RuBoard - do not distribute or recompile |
This section is directed to content providers: I want to convince you to engineer a cache-friendly web site. If you stick with me through the motivation section, I'll give you some practical advice and even show you how to implement many of the tips on the Apache server.
Why should you, as a content provider, care about web caching? For at least the following three reasons:
When people access your web site, pages will load faster.
Caches isolate
Caches reduce the load placed on your servers and network connections.
Let's examine each of these reasons in more detail.
It should be pretty obvious that caching objects close to web clients can greatly reduce the amount of time it takes to access those objects. This is the reason why web browsers have their own built-in cache. Retrieving an object from the browser cache is almost always faster than retrieving it over the network. When considering how your web site
What if the
When discussing how caches reduce latency, it is very important to include the differences between validated and unvalidated cache hits (see Section 2.3). While validation
Most likely, you have
People who use web caches, however, may still be able to receive your site's pages, even during a network outage. As long as the user has network connectivity to the cache, pages already in the cache can be sent to the user. If a cached object is
For a stale cached object, the cache forwards a validation request to the origin server. If the validation request fails because of the network outage, the cache may be able to send the cached copy to the user anyway. HTTP/1.1
Warning: 111 (cache.foo.com:3128) Revalidation Failed
Note that caches are not required to send stale responses for failed validation requests. This is up to the caching proxy implementation or is perhaps an option for the cache administrator. Also recall that if the cached response includes the
must-revalidate
As a content provider, you probably want your users to receive your information as quickly as possible. Many people
Server load is usually measured in terms of requests per second. Numerous factors affect a server's performance, including network speed, CPU power, disk access times, and TCP
It should be pretty obvious that web caches can reduce the load placed on origin servers, but it's quite difficult actually to say how much of an origin server's load is absorbed by caches. Both the site's popularity and
As fascinating as all this may be, should you really care about the load absorbed by web caches? Maybe not. Organizations that require heavy-duty web servers can usually afford to buy whatever they need. Smaller organizations can probably get by with inexpensive hardware and free software. The most important thing to remember is this: if you're thinking about making all your content uncachable one day, expect a very large increase in server load shortly thereafter. If you are not prepared to handle significantly more requests, you might lose your job!
This "Top Ten list" describes the steps you can take to build a cache-friendly web site. Do not feel like you have to implement all of these. It is still beneficial if you put just one or two into practice. The most beneficial and practical ideas are listed first:
Avoid using CGI, ASP, and server-side includes (SSI) unless
The main problem with CGI scripts is that many caches simply do not store a response when the URL includes
cgi-bin
or even
cgi
. The reason for this heuristic is perhaps historical. When caching was first in use, this was the
From a cache's point of view, Active Server Pages (ASP) are very similar to CGI scripts. Both are generated by the server, on the fly, for each request. As such, ASP responses usually have
Finally, you should avoid server-side includes (SSI) for the same reasons. This is a feature of some HTTP servers to parse HTML at request time and replace certain markers with special text. For example, with Apache, you can insert the current date and time or the current file
Use the GET method instead of POST if possible. Both
However, this difference also means that POST responses cannot be cached unless
Avoid renaming web site files; use unique filenames instead. This might be difficult or
In this example, it is better to use a file-naming scheme that does not depend on the presentation order; a scheme based on the presenter's
Give your content a default expiration time, even if it is very short. If your content is relatively static, adding an Expires header can significantly speed up access to your site. The explicit expiration time means clients know exactly when they should issue revalidation requests. An expiration-based cache hit is almost always faster than a validation-based near hit; see Section A.8. See Section 6.2.4, for advice on choosing expiration values.
If you have a mixture of static and dynamic content, you might find it helpful to have a separate HTTP server for each. This way, you can set server-wide defaults to improve the cachability of your static content without
Don't use content negotiation. Occasionally, people like to create pages that are customized for the user's browser; for example, Netscape may have a nifty feature that Internet Explorer does not have. An origin server can examine the User-agent request header and generate special HTML to take advantage of a browser feature. To use the terminology from HTTP, an origin server may have any number of variants for a single URI. The mechanism for selecting the most appropriate variant is known as content negotiation , and it has negative consequences for web caches.
First of all, if either the cache or the origin server does not correctly implement content negotiation, a cache client might receive the wrong response. For example, if an HTML page with content specific to Internet Explorer gets cached, the cache might send the page to a Netscape user. To prevent this from happening, the origin server is supposed to add a response header telling caches that the response varies on the User-agent value:
Vary: User-agent
If the cache ignores the Vary header, or if the origin server does not send it, cache users can get incorrect responses.
Even when content negotiation is correctly implemented, it
Synchronize your system clocks with a reference clock. This ensures that your server sends accurate
Last-modified
and
Expires
timestamps in its responses. Even though
Avoid using
Address-based authentication can also deny
[2] DMZ stands for de-militarized zone. A DMZ network is considered to be "neutral territory" between your internal network and the outside world. See [Zwicky, Cooper and Chapman, 2000] for more information.
Think different! Sometimes, those of us in the United States forget about Internet users in other
Even if you think shared proxy caches are evil, consider allowing single-user browser caches to store your pages. There is a simple way to accomplish this with HTTP/1.1. Just add the following header to your server's replies:
Cache-control: private
This header allows only browser caches to store responses. The browser may then perform a validation request on the cached object as necessary.
In the previous section, I gave you a number of recommendations for the responses generated by your web server. Now, we will see how you can implement those with the Apache server.
Apache has a couple of ways to include an Expires header in HTTP responses. The old way is actually a legacy from the CERN proxy. It uses .meta directories to hold the header information. For example, if you want to set a header value for the resource /foo/index.html , create a file named /foo/.meta/index.html , in which you put lines such as:
Expires: Wed, 28 Feb 2001 19:52:18 GMT
Before you can use meta files in Apache, you must include the cern_meta module when you compile the server. This is accomplished with Version 1.3 of Apache by giving the following command-line option to the configure script:
./configure --add-module=src/modules/extra/mod_cern_meta.c
The CERN meta file technique has a number of shortcomings. First of all, you have to create a separate meta file for every file on your server. Second, you must specify the headers exactly. If you do not remember to update the Expires time, responses are served with an expiration time in the past. It is not possible to have the server dynamically calculate the expiration time. For these reasons, I strongly discourage you from using the .meta technique.
Apache has a newer module, called mod_expires , that is easier to use and offers much more flexibility. This module is available in Version1.3 of Apache and later. To add the module to your server binary, you need to use this command:
./configure --add-module=src/modules/standard/mod_expires.c
This module is nice because it sets the max-age cache control directive, in addition to the Expires header. Documentation from the Apache web site can be found at http://www.apache.org/docs/mod/mod_expires.html.
To use the expires module, you must first enable the option for your server with the ExpiresActive keyword. This option can be set either globally or for a specific subset of your document tree. The easiest technique is simply to enable it for your whole server by adding the following line to your httpd.conf file:
ExpiresActive on
If you want to use fine-grained controls with the .htaccess file, you must also add Override Indexes for the necessary directories in httpd.conf .
The expires module has two directives that specify which objects receive an Expires header. The ExpiresDefault directive applies to all responses, while ExpiresByType applies to objects of a specific content type, such as text/html . Unfortunately, you cannot use wildcards ( text/* ) in the type specification. These directives may appear in a number of contexts. They can be applied to the entire server, a virtual domain name, or a subdirectory. Thus, you have a lot of control over which responses have an expiration time.
Expiration times can be calculated in two ways, based on either the object's modification time or its access time. In both cases, the
Expires
value is calculated as a fixed offset from the
access plus one day
More complex specifications are allowed:
access plus 1 week 2 days 4 hours 7 minutes
The expiration time can also be based on the modification time, using the modification keyword. For example:
modification plus 2 weeks
The latter approach should be used only for objects that definitely change at regular intervals. If the expiration time
Now let's see how to put it all together. Let's say you want to
ExpiresActive on ExpiresByType image/jpeg "access plus 3 days" ExpiresByType image/gif "access plus 3 days" ExpiresByType text/html "access plus 12 hours" ExpiresDefault "access plus 1 day"
If you have a subdirectory that requires special treatment, you can put similar commands in an
.htaccess
file. For example, let's say you have a directory called
weather
that holds current images from weather
ExpiresByType image/gif "modification plus 1 hour"
Apache also has a module that allows you to add arbitrary headers to a response. This module is called mod_headers , and it is useful for setting headers such as Cache-control . To add the headers module to your Apache installation, use the following configure option:
./configure --add-module=src/modules/standard/mod_headers.c
The full documentation can be found at http://www.apache.org/docs/mod/mod_headers.html.
With the headers module, you can easily add, remove, and append almost any HTTP header. If you are not familiar with the format and structure of HTTP headers, review Section4.2 of RFC2616. The general syntax for the headers module is:
Header <setappendadd> name value Header unset name
If the
value
includes whitespace, it must be
Now we'll see how to include a Cache-control header so some responses are cachable by browsers but not by shared proxy caches. You can apply header directives for an entire subdirectory by adding the following to an .htaccess file in that directory:
Header append Cache-control private
Note that we use append instead of set because we don't want to clobber any existing Cache-control directives in the response.
As an alternative to the expires module described previously, you can use the headers module to set an expiration time and the max-age directive to define an expiration time. For example:
Header append Cache-control "max-age=3600"
For HTTP/1.1, this is equivalent to using "access plus 1 hour" in the expires module. The difference is that here we use Cache-control . An HTTP/1.0 client (or cache) may not understand the Cache-control header.
It is also possible to set headers from your CGI scripts without the headers or expires modules. In particular, you might want to set Last-modified and Expires headers, since these are normally absent from CGI script responses.
The output of a CGI program consists of reply headers, an empty line, and the reply body. Some of the reply headers (such as Date and Server ) are supplied by the server and not by the CGI program. However, the CGI script must at least output a Content-type header. Apache also allows you to pass other reply headers from the script to the server.
I've already mentioned that Apache includes
Last-modified
only for disk files. CGI scripts are considered dynamic, so the server does not generate a
Last-modified
header for them. You can generate your own from a CGI script with relatively little trouble. While you're at it, you might as well send an expiration time too. The following Perl code
#!/usr/bin/perl -w
use POSIX;
$exp_delta = 300; # 5 minutes
$lmt = strftime("%a, %d %b %Y %H:%M:%S GMT", gmtime(time));
$exp = strftime("%a, %d %b %Y %H:%M:%S GMT",
gmtime(time+$exp_delta));
print "Content-type: text/plain\n";
print "Last-modified: $lmt\n";
print "Expires: $exp\n";
print "Cache-control: max-age=$exp_delta\n";
print "\n";
print "This demonstrates setting reply headers from a CGI script.\n";
The trickiest part is that we have to use Perl's
POSIX
module to get the
strftime()
function. The
When the above script is placed on a server and requested, the output looks something like this:
HTTP/1.1 200 OK Date: Wed, 10 Jan 2001 03:13:11 GMT Server: Apache/1.3.3 (Unix) Cache-control: max-age=300 Expires: Wed, 10 Jan 2001 03:18:11 GMT Last-Modified: Wed, 10 Jan 2001 03:13:11 GMT Connection: close Content-Type: text/plain This demonstrates setting reply headers from a CGI script.
Note that the server included all of the headers we added and even rearranged them a little bit. The server also added Date , Server , and Connection headers.
Note that the response does not have a Content-length header. This is because Apache does not know how long the reply body is going to be. You should output a Content-length header from your CGI scripts if the body length is easy to calculate. As mentioned in Section 6.1.5, a missing Content-length has certain negative consequences for persistent connections.
If I have managed to convince you that specific expiration times are a good thing, you might
Generally, people consider HTML pages to be more dynamic and critical than images. Thus, you can probably give images a longer expiration period than HTML files. This is a good
If your HTML content changes daily at specific times (e.g., midnight), obviously you should use an expiration time of "modification time plus one day." However, if your content changes daily, but at random times, you'll probably want to use an expiration scheme such as "access time plus six hours." Then it will take no longer than six hours for your updated page to propagate through all web caches. If you have time-sensitive information, it is a good idea to include a timestamp somewhere on the page. For example:
This page was last modified Fri Mar 9 02:50:34 GMT 2001.
This gives your viewers important information when they wonder if the page is up-to-date. If they believe the page may have been changed, they can ask for the latest version by clicking on their browser's Reload button.
| only for RuBoard - do not distribute or recompile |