8.4 Cache Digests

only for RuBoard - do not distribute or recompile

Two of the protocols discussed so far, ICP and HTCP, incur a lookup delay for every cache miss. Before a cache miss can be forwarded, the selection process takes about one network round-trip time to query the neighbor caches. Depending on the proximity of the neighbor caches, this could take anywhere from 10 to 300 milliseconds . Such a delay can be significant, especially for a web page with many embedded images.

The lookup delay exists because a cache doesn't know a priori which neighbors hold a particular object. Furthermore, it cannot predict which, if any, would return a cache hit for the object. If we can give a cache this advance knowledge about its neighbors, then we can eliminate the delays. We want the best of both worlds : hit predictions without delays.

As mentioned previously, CARP has no forwarding delays. However, it does not meet our requirements because the next -hop cache is chosen based solely on the requested URL. CARP does not care whether the request will be a cache hit or miss. This means, for example, that CARP is not usable in a sibling relationship.

To predict hits in a neighbor, a cache needs to know which objects are stored there. A neighbor's cache contents can be represented as a database or directory. These directories must be exchanged between neighbors. By looking up objects in the directory, we can predict cache hits. Since a cache's content changes over time, the directories must also be updated. A primary goal of our new protocol is to make the transmission and storage of this directory as efficient as possible.

Probably the worst we could do is use a plain list of cached URLs. The average URL length is about 55bytes, or 440 bits. The URL list is poor choice for a directory because URLs are an inefficient representationthey use only printable characters , and some sequences appear very frequently. In information theory terms, URLs have a low amount of entropy. A better option is to use a list of hash values of the URLs. With hash values, we may have collisions where two or more URLs have the same hash value. A collision's probability is related to the hash value range and the number of URLs in the list. For a cache of about one million objects, we probably need 24 to 32 bits per object. That's a significant reduction in directory size compared with 440 bits per object in a URL list. Believe it or not, we can do even better. With an algorithm known as a Bloom filter , storage is reduced to only a few bits per object.

8.4.1 Bloom Filters

The Bloom filter algorithm, first described by Burton Bloom in 1970, encodes a set of input keys with the help of some hash functions. Recall that a hash function is simply an algorithm that takes a chunk of data as input (the key ) and produces an integer as output. A good hash function has the characteristic of being random. That is, given a sufficiently large number of input keys, the output values should have an apparently random distribution. Another hash function characteristic is the range of its output. Often, the range is expressed as a power of 2, such as 32bits or 2 ³² . One of the most often used hash functions is called MD5 , which is short for Message Digest (Version5) [Rivest, 1992]. The MD5 hash function outputs 128-bit values and is used extensively in the current Cache Digest implementation.

A Bloom filter is defined by two parameters, K independent hash functions and an array of M bits. The bits of the filter are used to efficiently encode a collection of N items. For a Cache Digest, the collection is the set of object keys (URIs or URLs) stored in a cache. To construct a Bloom filter, we iterate over the items in the collection. Applying the K hash functions to each item gives us K hash values. Each hash value represents a bit position in the filter that should be turned on. If a hash value is larger than M , we apply the modulus operator to keep it in range. Note that M should be less than the hash function range; otherwise , some bits can never be turned on.

After all items have been added, the filter should consist of approximately equal numbers of 1's and 0's. Note that more than one key can turn on the same bit. (Of course, a bit can only be turned on once. In other words, once a bit is turned on, it remains on.) This property means that it is not possible to delete an item from the filter. A given bit may be on due to multiple items. Clearing the bit removes all of these items from the filter.

Let's look at a simple example to get a better handle on Bloom filters. Assume that we have four hash functions called H1, H2, H3, and H4, and a filter size of 2 ²⁰ bits. We want to add an object, in this case a URL, to the filter. Our sample URL is http://www.squid-cache.org. Step one is to apply our four hash functions to the URL. Step two is to take the modulus of each value by 1,048,576. Finally, step three is just to turn on the bits in the corresponding positions . Table 8-1 shows the values for this example.

Table 8-1. Bloom Filter Example

Hash Function	Step 1 Value	Step 2 Value
H1	4,107,498,967	226,775
H2	2,692,669,443	974,851
H3	3,532,948,500	295,956
H4	991,683,655	779,335

After the filter has been populated , we can look up specific objects using the same procedure. We apply the same four hash functions to the URI in question and end up with four bit positions. If at least one of these bits is off, we know the URI is not present. If all of the bits are on, then there is some probability that the URI is present.

We cannot conclude that an item is present even though all K bits are on. Consider the following contrived example: the key "A" turns on bits 1, 2, 5, and 9. The key "B" turns on bits 3, 6, 7, and 11. Now we test for the presence of key "C," which corresponds to bits 2, 5, 6, and 11. Even though "C" was not added to the filter, we will find all of C's bits on. This occurrence, known as a false positive , is the price we pay for the Bloom filter's compact representation. The probability that a lookup will be a false positive depends on the size of the filter, the number of hash functions, and the number of items in the collection. Bloom's paper [Bloom, 1970] gives some formulas that relate all of these parameters. In particular, this probability is:

For large values of M , this simplifies to:

For example, with and , the false positive probability is 9.2%. ^[3]

^[3] Bloom's formula requires that K distinct bits are turned on for each input key. In Cache Digests, two or more hash functions could have the same value for a single input key, and thus fewer than K bits could be turned on. The probability of this occurring is the inverse of the filter size, i.e., on the order of 10 ^-6 for a typical cache.

8.4.2 Comparing Digests and ICP

How do Cache Digests and ICP compare in terms of network bandwidth? We know that a Cache Digest is a very efficient encoding of a cache's contents, but that doesn't mean it uses less bandwidth. The Cache Digest protocol uses bandwidth in proportion to the size of a cache. ICP, on the other hand, uses bandwidth in proportion to the rate of cache requests . This means, for example, that ICP uses less bandwidth during idle periods, whereas Cache Digests probably use the same amount during idle and busy times.

We can do some simple calculations to estimate when it makes sense to switch from ICP to Cache Digests. First, let's define our parameters: S is the cache size in gigabytes, L is the mean object size in kilobytes, T is the digest update period in hours, and R is the average ICP query rate per peer in queries per second. Let's also assume the cache digest is sized at 5 bits per cache object, and the average ICP message size is 80 bytes. Then, the bandwidth that Cache Digests uses is given in bytes/second as:

The ICP bandwidth is:

Equating these two, substituting 10KB for L , and solving for R , we get:

Given a cache size S and digest update period T , we can now calculate the ICP query rate that uses the same amount of bandwidth. For example, a 20GB cache with a 1- hour digest update period uses the same bandwidth as 2.3 ICP queries per second. If the query rate exceeds 2.3 ICP queries per second, then Cache Digests uses less bandwidth than ICP.

ICP and Cache Digests also differ significantly in their usage of system resources. For the most part, ICP requires little CPU processing, except for copying buffers and building packets. Cache Digests, on the other hand, may require more of the CPU because of the hash function calculations. The major difference, however, is in memory usage. ICP is stateless and therefore uses no memory. Digests, of course, require a moderate amount of memory. A cache may store its own digest on disk, but it must keep all of its neighbors' digests in memory for fast lookups. Cache Digest memory usage scales linearly with the number of neighbors.

Cache Digests eliminate the annoying per-request delays from ICP but introduce another potential source of delay. A sibling relationship that uses Cache Digests experiences some fraction of false hits. In these cases, the first cache must re-forward its request to another server. Most likely, the delays due to cache digest false hits are negligible compared to ICP delays. It may be difficult to quantify the delay caused by a false hit, but we can assume it is on the order of one network round-trip, just like an ICP query. Since false hits occur in a small percentage of cases, whereas ICP delays every request, we're already ahead by using digests. Furthermore, since ICP also may cause false hits, we're even better off with digests. Finally, remember that digest false hits cause a delay only for sibling relationships.

only for RuBoard - do not distribute or recompile