2.2 Is It Cachable? | Web Caching

only for RuBoard - do not distribute or recompile

2.2 Is It Cachable ?

The primary purpose of a cache is to store some of the responses it receives from origin servers. A response is said to be cachable if it can be used to answer a future request. For typical request streams, about 75% of responses are cachable.

A cache decides if a particular response is cachable by looking at different components of the request and response. In particular, it examines the following:

The response status code
The request method
Response Cache-control directives
A response validator
Request authentication

These different factors interact in a somewhat complicated manner. For example, some request methods are uncachable unless allowed by a Cache-control directive. Some status codes are cachable by default, but authentication and Cache-control take precedence.

Even though a response is cachable, a cache may choose not to store it. Many products include heuristics ”or allow the administrator to define rules ”that avoid caching certain responses. Some objects are more valuable than others. An object that gets requested frequently (and results in cache hits) is more valuable than an object that is requested only once. Many dynamic responses fall into the latter category. If the cache can identify worthless responses, it saves resources and increases performance by not caching them.

2.2.1 Status Codes

One of the most important factors in determining cachability is the HTTP server response code, or status code. The three-digit status code indicates whether the request was successful or if some kind of error occurred. The status codes are divided into the following five groups:

1xx: An informational, intermediate status. The transaction is still being processed .
2xx: The request was successfully received and processed.
3xx: The server is redirecting the client to a different location.
4xx: There is an error or problem with the client's request. For example, authentication is required, or the resource does not exist.
5xx: An error occurred on the server for a valid request.

Refer to Appendix F for the complete list, or see Section 10 of RFC 2616.

The most common status code is 200 (OK), which means the request was successfully processed. The 200 code and a number of others shown in Table 2-1 are cachable by default. However, some other aspect of the request or response can make the response uncachable. In other words, the status code alone is not enough to make a response cachable. We'll talk about the other factors shortly.

Table 2-1. Cachable Response Codes

Code	Description	Explanation
200	OK	Request was processed successfully.
203	Non-Authoritative Information	This is similar to 200, but it can be used when the sender has reason to believe the given entity headers are different from those that the origin server would send.
206	Partial Content	The 206 response is similar to 200, but it is a response to a range request. A 206 response is cachable if the cache fully supports range requests . ^[3]
300	Multiple Choices	The response includes a list of appropriate choices from which the user should make a selection.
301	Moved Permanently	The requested resource has been moved to a new location. The new URL is given in the response headers.
410	Gone	The requested resource has been intentionally and permanently removed from the origin server.

^[3] HTTP/1.1 defines the range request. It allows clients to request a specific subset of a resource, rather than the entire thing. For example, a user agent can request specific pages from a large PDF document instead of transferring the whole file.

All other response codes are uncachable by default but can be cached if explicitly allowed by other means. For example, if a 302 (Moved Temporarily) response also has an Expires header, it can be cached. In reality, there is relatively little to gain from caching these uncachable-by-default status responses, even when allowed. These status codes occur infrequently to begin with. It's even more unlikely that the response also includes Expires or Cache-control headers to make them cachable. Thus, it is simpler and safer for a cache never to store one of these responses.

2.2.2 Request Methods

Another significant factor in determining cachability is the request method. Here, the rules are somewhat simpler. As shown in Table 2-2, we really have to worry about only three methods: GET, HEAD, and POST.

Table 2-2. Request Methods and Cachability

Request Method	Cachable?
GET	Yes, cachable by default
HEAD	May be used to update previously cached entry
POST	Uncachable by default; cachable if `Cache-control` headers allow
PUT	Never cachable
DELETE	Never cachable
OPTIONS	Never cachable
TRACE	Never cachable

GET is the most popular request method, and responses to GET requests are by default cachable. Responses to HEAD requests are treated specially. HEAD response messages do not include bodies, so there is really nothing to cache. However, we can use the response headers to update a previously cached response's metadata. For example, a HEAD response can return a new expiration time. Similarly, the response may instead indicate that the resource has changed, which means the cached copy is now invalid. A POST response is cachable only if the response includes an expiration time or one of the Cache-control directives that overrides the default. Cachable POST responses are quite rare in reality.

Responses from all other request methods are always uncachable. In addition, caching proxies must not reuse responses from unknown ("extension") methods. Some RFCs, such as 2518 (WEBDAV) define extensions to HTTP including new request methods. It's possible that such extensions allow responses from new methods to be cached. RFC 2518, however, does not state that any of the WEBDAV methods are cachable.

2.2.3 Expiration and Validation

HTTP/1.1 provides two ways for caches to maintain consistency with origin servers: expiration times and validators. Both ensure that users always receive up-to-date information. Ideally, every cachable response would include one or both; in reality, a small but significant percentage of responses have neither .

Expiration and validation affect cachability in two important ways. First, RFC 2616 says that responses with neither an expiration time nor a cache validator should not be cached. Without these pieces of information, a cache can never tell if a cached copy is still valid. Note, however, that this is only a recommendation, not a requirement. Storing and reusing these types of responses does not violate the protocol. Second, an expiration time turns normally uncachable responses into cachable ones. For example, responses to POST requests and 302 status messages become cachable when the origin server provides an expiration time.

It's very important to understand the subtle difference between expiration and cachability. Cachable means that the response can be stored in the proxy. Expired responses may still be cached, but they must be validated before being used again. Even though a response expires, it does not mean that the resource has changed.

In fact, some web sites send pre-expired responses. This means that the cache must validate the response the next time someone requests it. Pre-expiration is useful for origin servers that want to closely track accesses to their site but also want their content to be cachable. The proper way to pre-expire a response is to set the Expires header equal to the Date header. For example:

 Date: Sun, 01 Apr 2001 18:32:48 GMT Expires: Sun, 01 Apr 2001 18:32:48 GMT

An alternative method is to send an invalid date or use the value "0":

 Expires: 0

Unfortunately , the meaning of an expiration time in the past and an invalid expiration value has changed over time. Under HTTP/1.0 (RFC 1945), caches are not supposed to store such responses. ^[4] However, HTTP/1.1 allows them to be cached. This causes confusion, as well as problems for people running old HTTP servers, which RFC 2616 recognizes:

^[4] Since HTTP/1.0 lacks other cache-specific headers, this was the only way to mark a response as uncachable. Some HTTP/1.0 agents won't cache a response that includes Pragma: no-cache , but this is not mentioned in RFC 1945.

Many HTTP/1.0 cache implementations treat an Expires value that is less than or equal to the response Date value as being equivalent to the Cache-Control response directive "no-cache". If an HTTP/1.1 cache receives such a response, and the response does not include a Cache-Control header field, it SHOULD consider the response to be non-cachable in order to retain compatibility with HTTP/1.0 servers.

We'll talk more about expiration in Section 2.3, and about validation in Section 2.5. For now, the important point is this: in order to be cachable, a response should have an expiration time, a validator, or both. Responses with neither should not be cached. However, HTTP/1.1 allows them to be cached anyway without violating the protocol.

2.2.4 Cache-control

The Cache-control header is a new feature of HTTP/1.1 used to tell caches how to handle requests and responses. The value of the header is one or more directive keywords, for example:

 Cache-control: private Cache-control: public,max-age=86400

Although Cache-control appears in both requests and responses, our discussion in this section focuses only on directives that appear in a response. We're not interested in request directives yet because, with one exception, they don't affect response cachability.

Note that the Cache-control directives override the defaults for most status codes and request methods when determining cachability. In other words, some Cache-control directives can turn a normally cachable GET response into one that is uncachable, and vice-versa for POST responses.

Here are the Cache-control directives that may appear in an HTTP response and affect its cachability:

no-cache

The no-cache directive is, unfortunately, somewhat confusing. Its meaning has changed in a significant way over time. The January 1997 draft standard RFC for HTTP/1.1, number 2068, says: " no-cache indicates that all or part of the response message MUST NOT be cached anywhere ." This statement is quite clear and matches what people intuitively think "no-cache" means.

In RFC 2616 (June 1999), however, the directive's meaning has changed. Now, responses with no-cache can be stored but may not be reused without validation ( "a cache MUST NOT use the response to satisfy a subsequent request without successful revalidation with the origin server"). Not only is this counterintuitive, but it seems to mean the same thing as must-revalidate with max-age=0 .

Note that, even though a response with no-cache can be stored, it is not one of the Cache-control directives that turns uncachable responses into cachable ones. In fact, it doesn't affect cachability at all. Rather, it instructs a cache to always validate the response if it has been cached.

The no-cache directive also has a secondary use. When the directive specifies one or more header names , those headers must not be sent to a client without validation. This enables origin servers to prevent sharing of certain headers while allowing the response content to be cached and reused. An example of this usage is:

 Cache-control: no-cache=Set-cookie

private

The private directive gives user agent caches permission to store a response but prevents shared caching proxies from doing so. The directive is useful if the response contains content customized for just one person. An origin server might also use it to track individuals and still allow responses to be cached.

For a shared cache, the private directive makes a normally cachable response uncachable. However, for a nonshared cache, a normally uncachable response becomes cachable. For example, a browser may cache a response to a POST request if the response includes the Cache-control: private directive.

public

The public directive makes normally uncachable responses cachable by both shared and nonshared caches.

The public directive takes even higher precedence than authorization credentials. That is, if a request includes an Authorization header, and the response headers contain Cache-control: public , the response is cachable.

max-age

The max-age directive is an alternate way to specify an expiration time. Recall from the previous section that some normally uncachable responses become cachable when an expiration time is given. Also, RFC 2616 recommends that responses with neither a validator nor an expiration time should not be cached. When a response contains the max-age directive, the public directive is implied as well.

s-maxage

The s-maxage directive is very similar to max-age , except it only applies to shared caches. Like its cousin, the s-maxage directive allows normally uncachable responses to be cached. If both are present in a response, a shared cache uses the s-maxage value and ignores all other expiration values.

Unlike max-age , s-maxage does not also imply the public directive. It does, however, imply the proxy-revalidate directive.

must-revalidate

This directive also allows caches to store a response that is normally uncachable. Since the must-revalidate directive deals with validation, we'll talk about it in the following section.

proxy-revalidate

The proxy-revalidate directive is similar to must-revalidate , except it applies only to shared caches. It also allows caches to store a normally uncachable response.

no-store

The no-store directive causes any response to become uncachable. It may also be present in a request, in which case the corresponding response is uncachable.

Section 14.9.2 of RFC 2616 describes no-store with relatively strong language. Requests and responses with this directive must never be written to nonvolatile storage (i.e., disk), even temporarily. It is a way for paranoid content providers to decrease the probability that sensitive information is inadvertently discovered or made public.

Now that no-cache has a different meaning, however, no-store becomes very important. It is the only directive (except private ) that prevents a response from getting cached. Thus, we're likely to see no-store used for many types of responses, not just super-sensitive data.

To summarize, public , max-age , s-maxage , must-revalidate , and proxy-revalidate turn normally uncachable responses into a cachable ones. no-store and private turn normally cachable responses into uncachable ones for a (shared) caching proxy. At the same time, private turns an uncachable response into a cachable one for a (nonshared) user agent cache.

The HTTP/1.1 standard defines additional cache-control directives for responses that are not presented here. Section 14.9 of RFC 2616 gives the full specification. Cache-control directives appear in HTTP requests as well, but since they generally do not affect cachability, we'll talk about those directives in Section 2.6.

2.2.5 Authentication

Requests that require authentication are not normally cachable. Only the origin server can determine who is allowed to access its resources. Since a caching proxy doesn't know which users are authorized, it cannot give out unvalidated hits.

Origin servers typically use a simple challenge-response scheme to authenticate users. When you first try to access a protected resource, the server returns a message with a 401 (Unauthorized) status code. The response also includes a WWW-Authenticate header, which contains the challenge. Upon receipt of this response, your user-agent prompts you for your authorization credentials. Normally this is a username and password. The user-agent then resubmits the request, this time including an Authorization header. When a caching proxy finds this header in a request, it knows that the corresponding response is uncachable unless the origin server explicitly allows it.

The rules for caching authenticated responses are somewhat tricky. Section 14.8 of RFC 2616 talks about the conditions under which shared caches can store and reuse such responses. They can be cached only when one of the following Cache-control headers is present: s-maxage , must-revalidate , or public . The public directive alone allows the response to be cached and reused for any subsequent request, subject to normal expiration. The s-maxage directive allows the response to be reused until the expiration time is reached. After that, the caching proxy must revalidate it with the origin server. The must-revalidate directive instructs the cache to revalidate the response for every subsequent request.

The HTTP RFC doesn't say how nonshared (user-agent) caches should handle authenticated responses. It's safe to assume that they can be cached and reused, since, by definition, the cache is not shared with other users. However, nonshared caches can actually be shared by multiple users. For example, consider terminals in university labs or Internet cafes.

I wouldn't expect many authenticated responses to include public or s-maxage cache controls. However, the must-revalidate and proxy-revalidate directives can be quite useful. They provide a mechanism that allows responses to be cached, while still giving the origin server full control over who can and cannot access the protected information.

There is another issue related to authentication, but not cachability. Origin servers sometimes use address-based authentication rather than passwords. That is, the client's IP address determines whether or not the request is allowed. In these cases, it's possible for a caching proxy to open a back door into a protected server. If the proxy's IP address is in the range of allowed addresses, and the proxy itself allows requests from anywhere, then anyone can access the protected server via the proxy.

People are often surprised to discover that HTTP does not have a request header that specifies the client's IP address. Actually, there was such a header in early versions of HTTP/1.1, but it was taken out. The problem with such a feature is that there is no way to validate the correctness of this information. A proxy or other agent that forwards requests can easily spoof the header. It is a bad idea to use such untrustworthy information for authentication.

2.2.6 Cookies

A cookie is a device that allows an origin server to maintain session information for individual users between requests [Kristol and Montulli, 2000]. A response may include a Set-cookie header, which contains the cookie as a string of random-looking characters . When user-agents receive a cookie, they are supposed to use it in their future requests to that server. Thus, the cookie serves as a session identifier.

Cookies are typically used to represent or reference private information (e.g. a "shopping basket "), and they should not be shared between users. However, in most cases, the cookie itself is usually the only thing that is private. The object in the body of the reply message may be public and cachable. Rather than making the entire response uncachable, HTTP has a way to make only the cookie information uncachable. This is done with the Cache-control : no-cache directive. For example:

 Cache-control: no-cache="Set-cookie"

Origin servers can use the no-cache directive to prevent caching of other headers as well.

RFC 2965 actually talks about the Set-cookie2 and Cookie2 headers. Apparently these new headers are designed to avoid interoperability problems with the older specifications.

2.2.7 Dynamic Content

I mentioned dynamic responses briefly in Chapter 1. RFC 2616 does not directly address cachability of dynamic content. In terms of the protocol, only the request and reply headers determine cachability. It does not matter how the content is generated. Dynamic responses can be cached, subject to the previously described rules. Origin servers usually mark dynamic content as uncachable, so we have very little to worry about. Even so, it's a good idea to understand dynamic content in case problems arise. Additionally, you may experience a slight performance improvement by not caching some responses that would normally be stored.

Some people worry that a cache will incorrectly handle some dynamic content. For example, consider stock quotes and weather reports . If users happen to receive out-of-date information, they can make misinformed decisions with negative consequences. In these cases, the burden of responsibility is really on the content provider. They need to make sure their server marks dynamic responses as uncachable, using one of the HTTP headers described previously. However, rather than trusting servers (and caches) to do the right thing, some cache administrators feel more comfortable not caching any dynamic content.

Lack of popularity is another reason for not caching dynamic responses. Caches are only beneficial when the request stream includes repeat requests. An object that is requested only once has no value for a cache. Rather, it consumes resources (disk space, memory) that can be more efficiently utilized if allocated to some other object. If we can identify objects that are unlikely to be requested again, it leaves more space for more valuable objects.

How can we identify dynamic content? The HTTP reply headers provide a good indication. Specifically, the Last-modified header specifies when the resource was last updated. If the modification time is close to the current time, the content is probably dynamic. Unfortunately, about 35% of responses don't include a last-modification timestamp. Another useful technique is to look for specific strings in URLs. The following usually indicates a dynamic response:

/cgi-bin/ or .cgi

The presence of /cgi-bin/ or .cgi in a URL path indicates the response was generated from a CGI script.

/servlet/

When the URL path includes /servlet/ , the response was most likely generated by the execution of a Java servlet on the origin server.

.asp

ASP pages normally have the .asp extension, for example, http://www.microsoft.com/default.asp.

.shtml

The .shtml extension indicates an HTML page that is parsed for server-side includes. These are special tags that the origin server interprets and then replaces with some dynamically generated content. With Apache, for example, you can include a file's time of last modification with this bit of HTML:

 <--#flastmod-->

Java servlets may also be used this way inside a .shtml file.

Query terms

Another likely indication of dynamic content is the presence of a "?" (question mark) in a URL. This character delineates the optional query terms, or searchpart, of a URL. Usually, the query terms are entered by users into HTML forms and then given as parameters to CGI scripts or database frontends.

only for RuBoard - do not distribute or recompile