E.1 The Cache Digest Implementation

only for RuBoard - do not distribute or recompile

E.1 The Cache Digest Implementation

Since digests are designed to be shared between cooperating caches, all implementations must agree on (or have some way to determine) how many and which hash functions to use, the format of the digest key, and the size of the Bloom filter. This section describes Cache Digests Version 5, as implemented in Squid. Currently, there is no other formal documentation for Cache Digests, apart from some pages on the Squid web site.

E.1.1 Keys

Let's start with the digest key. This is the chunk of data to which we apply the hash functions. The key is simply a unique identifier for each object in the cache. The URI alone is not a good cache key. Cached responses are identified by a request method, a URI, and possibly additional request fields. A Cache Digest key consists of the request method, encoded as an 8-bit value, followed by the URI string (see Figure E-1). The null byte that terminates the URI string is not included in the key. The values used to encode the request method are listed in Table E-1.

Table E-1. Cache Digest Request Method Encodings

Value	Method
	NONE
1	GET
2	POST
3	PUT
4	HEAD
5	CONNECT
6	TRACE
7	PURGE
8	OPTIONS

Not all of these request methods are cachable , and thus not all are appropriate for a Cache Digest, but the full list is included here for completeness. This encoding scheme leaves much to be desired. New request methods cannot be supported in a Cache Digest until they are assigned a value.

Figure E-1. Cache Digest key

Strictly speaking, there are numerous other attributes that should probably be included in a cache key. For example, the client may specify a language preference. If a resource (HTML page) is available in different languages, the response can vary on the client's Accept-language request header.

E.1.2 Hash Functions

Recall that the Bloom filter design uses K independent hash functions to produce K bit locations. The Cache Digest implementation uses four hash functions. Intuitively four feels like about the right number, but the decision is somewhat arbitrary. As computer scientists, we like powers of two. Two hash functions are probably not enough, and eight are probably too many.

The number four works well because it evenly divides 128the number of bits in an MD5 hash. The four Cache Digest hash functions are really just four chunks of the MD5 checksum. That is, we take the 128 bits and chop them into four 32-bit pieces. The first hash function is bits 132, the second is bits 3364, etc.

When converting a 32-bit MD5 chunk into an integer, the result depends on the byte ordering of the CPU architecture. Since Cache Digests are exchanged between all types of systems, we must enforce a consistent conversion for this step. We use the htonl () function to interpret the MD5 chunk in network byte order. This code snippet from Squid demonstrates the necessary conversion:

 unsigned int * cacheDigestHashKey(const CacheDigest * cd, const MD5 * md5) {     const unsigned int nbits = cd->mask_size * 8;     unsigned int tmp_vals[4];     static unsigned int hashed_vals[4];     memcpy(tmp_vals, md5, sizeof(tmp_vals));     hashed_vals[0] = htonl(tmp_vals[0]) % nbits;     hashed_vals[1] = htonl(tmp_vals[1]) % nbits;     hashed_vals[2] = htonl(tmp_vals[2]) % nbits;     hashed_vals[3] = htonl(tmp_vals[3]) % nbits;     return hashed_vals; }

E.1.3 Sizing the Filter

Determining the optimal filter size is an interesting problem. If the filter is too small, there are too many false positives. If it's too large, most of the bits will be 0, which results in wasted space. Bloom says that the filter has an optimal size when exactly half the bits are on. Recall from Section 8.4.1, that M is the number of bits in the filter, N is the number of objects in the cache, and K is the number of hash functions. Now we can calculate the optimal number of bits per item that should turn on half of the bits.

According to these formulas, we should use 5.77 bits per entry with four hash functions. The ratio of M/N is a local decision only. A digest user does not need to know what the ratio is; it only needs to know M , the number of bits in the digest. Thus, the digest creator determines the false-hit probability for his own digest.

The previous formulas are for the so-called optimal digest size, which, according to Bloom, is when exactly half the bits are on. This requirement determines the ratio of M/N and the false-hit probability. We might like to specify the false-hit probability first and then determine the bits per entry ratio. Solving the previous equation for M/N in terms of the probability, p , and K , we have:

For example, if we want a false-hit probability of 1%, then we need 10.5 bits per entry. Figure E-2 is a graph of all probabilities.

Figure E-2. Bits per entry versus false hit probability for four hash functions

E.1.4 Selecting Objects for the Digest

A cache need not, and probably should not, add every cached object to its digest. Any object that, when requested , would result in a cache miss should not be added. Obviously this includes uncachable objects, which might temporarily be in the cache. It also includes objects that are likely to become stale, or even removed in the near future. We'll talk more about these objects in the following section.

Generally, there are no rules for determining which objects should be put in the digest. The digest essentially advertises the objects that a cache is willing to serve to its neighbors. The cache may choose to advertise only a subset of its cache. For example, we may want to serve only small objects to our neighbors. Thus, objects that are larger than a certain threshold should not be added to the digest. In theory, a cache can have more than one digest it gives to neighbors. One digest could be given to child caches and a different one to parents.

E.1.5 False Hits and Digest Freshness

When using cache digests, a false hit can occur for one of two reasons: when a digest lookup is a false positive or when an object added to the digest becomes stale or is removed from the cache. As described previously, the false-hit probability can be controlled by adjusting the Bloom filter parameters (increasing the filter size, for example). Limiting false hits due to the second factor requires a little more work.

A Cache Digest represents a cache's contents at a single point in time. Obviously, the contents change with time, so the digest must be updated. The current implementation does not support incremental updates because we cannot remove entries from a Bloom filter. Instead, the cache rebuilds its own digest from scratch at regular intervals. In Squid, the rebuild interval defaults to one hour. In addition, a cache must update its copies of its neighbor's digests. A neighbor's digest is refreshed when its expiration time is reached. Thus, a copy of a neighbor's digest represents the state of the cache within the last hour .

In conjunction with the update interval, we can use the following heuristics to minimize false hits:

Do not add objects that will become stale before the next digest update, according to response headers or the local freshness rules. Unfortunately, a client's Cache-control request header may still cause freshness-based cache misses.
Do not add objects that are likely to be removed quickly due to the cache replacement policy. For example, with an LRU replacement policy, we can identify objects that will be removed if they are not accessed before the next digest update.

E.1.6 Exchanging Digests

Digests are transferred using HTTP. A cache periodically requests its neighbor's digest with a well-known URL. The Cache Digest protocol does not specify how or under what URL a digest is requested. Thus, each product must currently choose its own naming scheme. In Squid, the digest URL consists of the cache's hostname and port number followed by /squid-internal-periodic/store_digest , for example:

http://squid.ircache.net:3128/squid-internal-periodic/store_digest

The digest's response header includes valid Expires and Last-modified values. As you would expect, the Last-modified value indicates when the digest was created. Similarly, the Expires value indicates when the digest becomes stale and a new version is available. The update period is set by local policy. In Squid, the default value is one hour. A digest must not be used after it becomes stale. Instead, the digest user (i.e., a neighbor cache) should fetch a fresh digest.

only for RuBoard - do not distribute or recompile