6.2 Being Cache-Friendly

only for RuBoard - do not distribute or recompile

6.2 Being Cache-Friendly

This section is directed to content providers: I want to convince you to engineer a cache-friendly web site. If you stick with me through the motivation section, I'll give you some practical advice and even show you how to implement many of the tips on the Apache server.

6.2.1 Why?

Why should you, as a content provider, care about web caching? For at least the following three reasons:

When people access your web site, pages will load faster.
Caches isolate clients from network failures.
Caches reduce the load placed on your servers and network connections.

Let's examine each of these reasons in more detail.

6.2.1.1 Latency

It should be pretty obvious that caching objects close to web clients can greatly reduce the amount of time it takes to access those objects. This is the reason why web browsers have their own built-in cache. Retrieving an object from the browser cache is almost always faster than retrieving it over the network. When considering how your web site interacts with caches, don't forget browser caches!

What if the requested objects are not in the browser cache but might be stored in a proxy cache? Now the benefits of caching are strongly correlated to relative proximity of the client, cache, and origin server. Here, proximity refers to network topology rather than geography. Both latency and throughput characterize network proximity. As an example, let's consider two different users accessing the www.cnn.com home page. The first user is connected to a U.S. ISP with a 56K modem, using the ISP's proxy cache. The second is in a classroom on the campus of a university in Perth, Australia, using the university's proxy cache. The dial-up user is actually "far away" from the ISP cache because the dial-up connection is a major bottleneck. Throughput between the user's home computer and the proxy cache is limited to about 4KB per second. The ISP cache probably does not speed up transfers for the dial-up user because both cache hits and misses are limited to the modem speed. However, the Australian student is very close to her cache, probably connected via a local area network. In her case, the major bottleneck is the transoceanic link between Australia and the United States. The cache provides a significant speedup because cache hits are transferred much faster than cache misses.

When discussing how caches reduce latency, it is very important to include the differences between validated and unvalidated cache hits (see Section 2.3). While validation requests can contribute significantly to reducing network bandwidth, they still incur high latency penalties. In some situations, a validated cache hit takes just about as long as a cache miss . For users who enjoy high-speed network connections, round-trip delays, rather than transmission delays, are the primary source of latency. Thus, validated cache hits do not appear to be significantly faster on high-speed networks. For dial-up users, however, transmission time is the primary source of delay, and a validated hit from the browser cache should be faster than a cache miss.

6.2.1.2 Hiding network failures

Most likely, you have experienced a network outage or failure when using the Internet. An outage may be due to failed hardware (such as routers and switches) or a telecommunication breakdown (fiber cut). Whatever the cause, network failures are frustrating because they prevent users from reaching certain web sites.

People who use web caches, however, may still be able to receive your site's pages, even during a network outage. As long as the user has network connectivity to the cache, pages already in the cache can be sent to the user. If a cached object is considered fresh, there is no need to contact the origin server anyway. For an unvalidated cache hit, both the user and the cache would never even know about the network outage.

For a stale cached object, the cache forwards a validation request to the origin server. If the validation request fails because of the network outage, the cache may be able to send the cached copy to the user anyway. HTTP/1.1 generally allows caches to do this, but the cache must insert a Warning header, which looks like this:

 Warning: 111 (cache.foo.com:3128) Revalidation Failed

Note that caches are not required to send stale responses for failed validation requests. This is up to the caching proxy implementation or is perhaps an option for the cache administrator. Also recall that if the cached response includes the must-revalidate cache-control directive, the cache cannot send a stale response to the client.

6.2.1.3 Server load reduction

As a content provider, you probably want your users to receive your information as quickly as possible. Many people spend a lot of time thinking about and working on ways to optimize their HTTP servers. The load placed upon an origin server affects its overall performance. In other words, as the load increases, the average response time increases as well.

Server load is usually measured in terms of requests per second. Numerous factors affect a server's performance, including network speed, CPU power, disk access times, and TCP implementations . At the time of this writing, state-of-the-art web servers can handle about 10,000 requests per second. That's much more than most of us require.

It should be pretty obvious that web caches can reduce the load placed on origin servers, but it's quite difficult actually to say how much of an origin server's load is absorbed by caches. Both the site's popularity and cachability affect the percentage of requests satisfied as cache hits. A more popular site provides more opportunities for cache hits because more clients request its objects. An object that remains fresh for a long period of time also provides more opportunities for cache hits.

As fascinating as all this may be, should you really care about the load absorbed by web caches? Maybe not. Organizations that require heavy-duty web servers can usually afford to buy whatever they need. Smaller organizations can probably get by with inexpensive hardware and free software. The most important thing to remember is this: if you're thinking about making all your content uncachable one day, expect a very large increase in server load shortly thereafter. If you are not prepared to handle significantly more requests, you might lose your job!

6.2.2 Ten Ways to be Cache-Friendly

This "Top Ten list" describes the steps you can take to build a cache-friendly web site. Do not feel like you have to implement all of these. It is still beneficial if you put just one or two into practice. The most beneficial and practical ideas are listed first:

Avoid using CGI, ASP, and server-side includes (SSI) unless absolutely necessary. Generally, these techniques are bad for caches because they usually produce dynamic content. Dynamic content is not a bad thing per se, but it may be abused. CGI and ASP can also generate cache-friendly, static content, but this requires special effort by the author and seems to occur infrequently in practice.

The main problem with CGI scripts is that many caches simply do not store a response when the URL includes cgi-bin or even cgi . The reason for this heuristic is perhaps historical. When caching was first in use, this was the easiest way to identify dynamic content. Today, with HTTP/1.1, we probably need to look at only the response headers to determine what may be cached. Even so, the heuristic remains, and some caches might be hard-wired never to store CGI responses.

From a cache's point of view, Active Server Pages (ASP) are very similar to CGI scripts. Both are generated by the server, on the fly, for each request. As such, ASP responses usually have neither a Last-modified nor an Expires header. On the plus side, it is uncommon to find special cache heuristics for ASP (unlike CGI) probably because ASP was invented well after caching was in widespread use.

Finally, you should avoid server-side includes (SSI) for the same reasons. This is a feature of some HTTP servers to parse HTML at request time and replace certain markers with special text. For example, with Apache, you can insert the current date and time or the current file size into an HTML page. Because the server generates new content, the Last-Modified header is either absent in the response or set to the current time. Both cases are bad for caches.
Use the GET method instead of POST if possible. Both methods are used for HTML forms and query-type requests. With the POST method, query terms are transmitted in the request body. A GET request, on the other hand, puts the query terms in the URI. It's easy to see the difference in your browser's Location box. A GET query has all the terms in the box with lots of & and = characters . This means that POST is somewhat more secure because the query terms are hidden in the message body.

However, this difference also means that POST responses cannot be cached unless specifically allowed. POST responses may have side effects on the server (e.g., updating a database), but those side effects aren't triggered if the cache gives back a cached response. Section 9.1 of RFC 2616 explains the important differences between GET and POST. In practice, it is quite rare to find a cachable POST response, so I would not be surprised if most caching products never cache any POST responses at all. If you want to have cachable query results, you certainly should use GET instead of POST.
Avoid renaming web site files; use unique filenames instead. This might be difficult or impossible in some situations, but consider this example: a web site lists a schedule of talks for a conference. For each talk there is an abstract, stored in a separate HTML file. These files are named in order of their presentation during the conference: talk01.html , talk02.html , talk03.html , etc. At some point, the schedule changes and the filenames are no longer in order. If the files are renamed to match the order of the presentation, web caches are likely to become confused . Renaming usually does not update the file modification time, so an If-Modified-Since request for a renamed file can have unpredictable consequences. Renaming files in this manner is similar to cache poisoning .

In this example, it is better to use a file-naming scheme that does not depend on the presentation order; a scheme based on the presenter's name would be preferable. Then, if the order of presentation changes, the HTML file must be rewritten, but the other files can still be served from the cache. Another solution is to touch the files to adjust the timestamp.
Give your content a default expiration time, even if it is very short. If your content is relatively static, adding an Expires header can significantly speed up access to your site. The explicit expiration time means clients know exactly when they should issue revalidation requests. An expiration-based cache hit is almost always faster than a validation-based near hit; see Section A.8. See Section 6.2.4, for advice on choosing expiration values.
If you have a mixture of static and dynamic content, you might find it helpful to have a separate HTTP server for each. This way, you can set server-wide defaults to improve the cachability of your static content without affecting the dynamic data. Since the entire server is dedicated to static objects, you need to maintain only one configuration file. A number of large web sites have taken this approach. Yahoo! serves all of their images from a server at images.yahoo.com , as does CNN with images.cnn.com . Wired serves advertisements and other images from static.wired.com , and Hotbot uses a server named static.hotbot.com .
Don't use content negotiation. Occasionally, people like to create pages that are customized for the user's browser; for example, Netscape may have a nifty feature that Internet Explorer does not have. An origin server can examine the User-agent request header and generate special HTML to take advantage of a browser feature. To use the terminology from HTTP, an origin server may have any number of variants for a single URI. The mechanism for selecting the most appropriate variant is known as content negotiation , and it has negative consequences for web caches.

First of all, if either the cache or the origin server does not correctly implement content negotiation, a cache client might receive the wrong response. For example, if an HTML page with content specific to Internet Explorer gets cached, the cache might send the page to a Netscape user. To prevent this from happening, the origin server is supposed to add a response header telling caches that the response varies on the User-agent value:
```
 Vary: User-agent 
```
If the cache ignores the Vary header, or if the origin server does not send it, cache users can get incorrect responses.

Even when content negotiation is correctly implemented, it reduces the number of cache hits for the URL. If a response varies on the User-agent header, a cache must store a separate response for every User-agent it encounters. Note that the User-agent value is more than just Netscape or MSIE . Rather, it is a string such as Mozilla/4.05 [en] (X11; I; FreeBSD 2.2.5-RELEASE i386; Nav) . Thus, when a response varies on the User-agent header, we can get only a cache hit for clients running the same version of the browser on the same operating system.
Synchronize your system clocks with a reference clock. This ensures that your server sends accurate Last-modified and Expires timestamps in its responses. Even though newer versions of HTTP use techniques that are less susceptible to clock skew, many web clients and servers still rely on the absolute timestamps. xntpd implements the Network Time Protocol (NTP) and is widely used to keep clocks synchronized on Unix systems. You can get the software and installation tips from http://www.ntp.org/.
Avoid using address-based authentication. Recall that most proxy caches hide the addresses of clients. An origin server sees connections coming from the proxy's address, not the client's. Furthermore, there is no standard and safe way to convey the client's address in an HTTP request. Some of the consequences of address-based authentication are discussed in Section 2.2.5.

Address-based authentication can also deny legitimate users access to protected information when they use a proxy cache. Many organizations use a DMZ network for the firewall between the Internet and their internal systems. ^[2] A cache that runs on the DMZ network is probably not allowed to access internal web servers. Thus, the users on the internal network cannot simply send all of their requests to a cache on the DMZ network. Instead, the browsers must be configured to make direct connections for the internal servers.

^[2] DMZ stands for de-militarized zone. A DMZ network is considered to be "neutral territory" between your internal network and the outside world. See [Zwicky, Cooper and Chapman, 2000] for more information.
Think different! Sometimes, those of us in the United States forget about Internet users in other parts of the world. In some countries , Internet bandwidth is so constrained that we would find it appalling. What takes seconds or minutes to load in the U.S. may take hours or even days in some locations. I strongly encourage you to remember bandwidth-starved users when designing your web sites, and remember that improved cachability speeds up your web site for such users.
Even if you think shared proxy caches are evil, consider allowing single-user browser caches to store your pages. There is a simple way to accomplish this with HTTP/1.1. Just add the following header to your server's replies:
```
 Cache-control: private 
```
This header allows only browser caches to store responses. The browser may then perform a validation request on the cached object as necessary.

6.2.3 Apache

In the previous section, I gave you a number of recommendations for the responses generated by your web server. Now, we will see how you can implement those with the Apache server.

6.2.3.1 The Expires header

Apache has a couple of ways to include an Expires header in HTTP responses. The old way is actually a legacy from the CERN proxy. It uses .meta directories to hold the header information. For example, if you want to set a header value for the resource /foo/index.html , create a file named /foo/.meta/index.html , in which you put lines such as:

 Expires: Wed, 28 Feb 2001 19:52:18 GMT

Before you can use meta files in Apache, you must include the cern_meta module when you compile the server. This is accomplished with Version 1.3 of Apache by giving the following command-line option to the configure script:

 ./configure --add-module=src/modules/extra/mod_cern_meta.c

The CERN meta file technique has a number of shortcomings. First of all, you have to create a separate meta file for every file on your server. Second, you must specify the headers exactly. If you do not remember to update the Expires time, responses are served with an expiration time in the past. It is not possible to have the server dynamically calculate the expiration time. For these reasons, I strongly discourage you from using the .meta technique.

Apache has a newer module, called mod_expires , that is easier to use and offers much more flexibility. This module is available in Version1.3 of Apache and later. To add the module to your server binary, you need to use this command:

 ./configure --add-module=src/modules/standard/mod_expires.c

This module is nice because it sets the max-age cache control directive, in addition to the Expires header. Documentation from the Apache web site can be found at http://www.apache.org/docs/mod/mod_expires.html.

To use the expires module, you must first enable the option for your server with the ExpiresActive keyword. This option can be set either globally or for a specific subset of your document tree. The easiest technique is simply to enable it for your whole server by adding the following line to your httpd.conf file:

 ExpiresActive on

If you want to use fine-grained controls with the .htaccess file, you must also add Override Indexes for the necessary directories in httpd.conf .

The expires module has two directives that specify which objects receive an Expires header. The ExpiresDefault directive applies to all responses, while ExpiresByType applies to objects of a specific content type, such as text/html . Unfortunately, you cannot use wildcards ( text/* ) in the type specification. These directives may appear in a number of contexts. They can be applied to the entire server, a virtual domain name, or a subdirectory. Thus, you have a lot of control over which responses have an expiration time.

Expiration times can be calculated in two ways, based on either the object's modification time or its access time. In both cases, the Expires value is calculated as a fixed offset from the chosen time. For example, to specify an expiration time of one day after the time of access, you write:

 access plus one day

More complex specifications are allowed:

 access plus 1 week 2 days 4 hours 7 minutes

The expiration time can also be based on the modification time, using the modification keyword. For example:

 modification plus 2 weeks

The latter approach should be used only for objects that definitely change at regular intervals. If the expiration time passes and the object does not get updated, any subsequent request for the object will result in a preexpired response. This hardly improves the cachability of the object! Furthermore, you should use the modification keyword only for disk files. When a response is generated dynamically (a CGI script, for example), it does not have a Last-modified time, and thus Apache cannot include an Expires header.

Now let's see how to put it all together. Let's say you want to turn on Expires headers for your web server. You want images to expire 3 days after being accessed and HTML pages to expire after 12 hours. All other content should expire one day after being accessed. The following configuration lines, placed in httpd.conf , do what you want:

 ExpiresActive on ExpiresByType image/jpeg "access plus 3 days" ExpiresByType image/gif "access plus 3 days" ExpiresByType text/html "access plus 12 hours" ExpiresDefault "access plus 1 day"

If you have a subdirectory that requires special treatment, you can put similar commands in an .htaccess file. For example, let's say you have a directory called weather that holds current images from weather satellites . If the images are updated every hour , you can put these configuration lines in the file named weather/.htaccess :

 ExpiresByType image/gif "modification plus 1 hour"

6.2.3.2 General header manipulation

Apache also has a module that allows you to add arbitrary headers to a response. This module is called mod_headers , and it is useful for setting headers such as Cache-control . To add the headers module to your Apache installation, use the following configure option:

 ./configure --add-module=src/modules/standard/mod_headers.c

The full documentation can be found at http://www.apache.org/docs/mod/mod_headers.html.

With the headers module, you can easily add, remove, and append almost any HTTP header. If you are not familiar with the format and structure of HTTP headers, review Section4.2 of RFC2616. The general syntax for the headers module is:

 Header <setappendadd>   name value   Header unset   name

If the value includes whitespace, it must be enclosed in double quotes. The set keyword overwrites any existing headers with the same name . The append and add keywords are similar. Neither overwrites an existing header. The append keyword inserts the value at the end of an existing header, while add adds a new, possibly duplicate header. The unset keyword removes the first header with the given name . You can delete only an entire header, not a single value within a header.

Now we'll see how to include a Cache-control header so some responses are cachable by browsers but not by shared proxy caches. You can apply header directives for an entire subdirectory by adding the following to an .htaccess file in that directory:

 Header append Cache-control private

Note that we use append instead of set because we don't want to clobber any existing Cache-control directives in the response.

As an alternative to the expires module described previously, you can use the headers module to set an expiration time and the max-age directive to define an expiration time. For example:

 Header append Cache-control "max-age=3600"

For HTTP/1.1, this is equivalent to using "access plus 1 hour" in the expires module. The difference is that here we use Cache-control . An HTTP/1.0 client (or cache) may not understand the Cache-control header.

6.2.3.3 Setting headers from CGI scripts

It is also possible to set headers from your CGI scripts without the headers or expires modules. In particular, you might want to set Last-modified and Expires headers, since these are normally absent from CGI script responses.

The output of a CGI program consists of reply headers, an empty line, and the reply body. Some of the reply headers (such as Date and Server ) are supplied by the server and not by the CGI program. However, the CGI script must at least output a Content-type header. Apache also allows you to pass other reply headers from the script to the server.

I've already mentioned that Apache includes Last-modified only for disk files. CGI scripts are considered dynamic, so the server does not generate a Last-modified header for them. You can generate your own from a CGI script with relatively little trouble. While you're at it, you might as well send an expiration time too. The following Perl code demonstrates how to correctly generate these headers:

 #!/usr/bin/perl -w use POSIX; $exp_delta = 300;       # 5 minutes $lmt = strftime("%a, %d %b %Y %H:%M:%S GMT", gmtime(time)); $exp = strftime("%a, %d %b %Y %H:%M:%S GMT",     gmtime(time+$exp_delta)); print "Content-type: text/plain\n"; print "Last-modified: $lmt\n"; print "Expires: $exp\n"; print "Cache-control: max-age=$exp_delta\n";  print "\n"; print "This demonstrates setting reply headers from a CGI script.\n";

The trickiest part is that we have to use Perl's POSIX module to get the strftime() function. The magical format string, passed as the first argument to strftime() , is the preferred date format for HTTP messages, as defined in RFCs 822 and 1123.

When the above script is placed on a server and requested, the output looks something like this:

 HTTP/1.1 200 OK Date: Wed, 10 Jan 2001 03:13:11 GMT Server: Apache/1.3.3 (Unix) Cache-control: max-age=300 Expires: Wed, 10 Jan 2001 03:18:11 GMT Last-Modified: Wed, 10 Jan 2001 03:13:11 GMT Connection: close Content-Type: text/plain This demonstrates setting reply headers from a CGI script.

Note that the server included all of the headers we added and even rearranged them a little bit. The server also added Date , Server , and Connection headers.

Note that the response does not have a Content-length header. This is because Apache does not know how long the reply body is going to be. You should output a Content-length header from your CGI scripts if the body length is easy to calculate. As mentioned in Section 6.1.5, a missing Content-length has certain negative consequences for persistent connections.

6.2.4 How to Choose Expiration Times

If I have managed to convince you that specific expiration times are a good thing, you might wonder what sorts of values you should use. Before we can answer that, you'll have to think about these related questions: how often does your content usually change? How important is it for your readers/ viewers to have absolutely up-to-date content? The answers to the latter question might vary depending on the type of content.

Generally, people consider HTML pages to be more dynamic and critical than images. Thus, you can probably give images a longer expiration period than HTML files. This is a good tradeoff since images comprise about 70% of all web traffic, while HTML accounts for only 15%. For most web sites, I recommend an expiration period of between one hour and one day for HTML and between one day and one week for images. You may certainly use expiration times longer than one week, but unless the object is popular, many caches will delete it before then anyway.

If your HTML content changes daily at specific times (e.g., midnight), obviously you should use an expiration time of "modification time plus one day." However, if your content changes daily, but at random times, you'll probably want to use an expiration scheme such as "access time plus six hours." Then it will take no longer than six hours for your updated page to propagate through all web caches. If you have time-sensitive information, it is a good idea to include a timestamp somewhere on the page. For example:

 This page was last modified Fri Mar  9 02:50:34 GMT 2001.

This gives your viewers important information when they wonder if the page is up-to-date. If they believe the page may have been changed, they can ask for the latest version by clicking on their browser's Reload button.

only for RuBoard - do not distribute or recompile