Chapter 1. Introduction

only for RuBoard - do not distribute or recompile

Chapter 1. Introduction

The term cache has French roots and means, literally, to store . As a data processing term, caching refers to the storage of recently retrieved computer information for future reference. The stored information may or may not be used again, so caches are beneficial only when the cost of storing the information is less than the cost of retrieving or computing the information again.

The concept of caching has found its way into almost every aspect of computing and networking systems. Computer processors have both data and instruction caches. Computer operating systems have buffer caches for disk drives and filesystems. Distributed (networked) filesystems such as NFS and AFS rely heavily on caching for good performance. Internet routers cache recently used routes. The Domain Name System (DNS) servers cache hostname-to-address and other lookups.

Caches work well because of a principle known as locality of reference . There are two flavors of locality: temporal and spatial. Temporal locality means that some pieces of data are more popular than others. CNN's home page is more popular than mine. Within a given period of time, somebody is more likely to request the CNN page than my page. Spatial locality means that requests for certain pieces of data are likely to occur together. A request for the CNN home page is usually followed by requests for all of the page's embedded graphics. Caches use locality of reference to predict future accesses based on previous ones. When the prediction is correct, there is a significant performance improvement. In practice, this technique works so well that we would find computer systems unbearably slow without memory and disk caches. Almost all data processing tasks exhibit locality of reference and therefore benefit from caching.

When requested data is found in the cache, we call it a hit . Similarly, referenced data that is not cached is known as a miss . The performance improvement that a cache provides is based mostly on the difference in service times for cache hits compared to misses. The percentage of all requests that are hits is called the hit ratio .

Any system that utilizes caching must have mechanisms for maintaining cache consistency . This is the process by which cached copies are kept up-to-date with the originals . We say that cached data is either fresh or stale . Caches can reuse fresh copies immediately, but stale data usually requires validation. The algorithms that are to maintain consistency may be either weak or strong. Weak consistency means that the cache sometimes returns outdated information. Strong consistency, on the other hand, means that cached data is always validated before it is used. CPU and filesystem caches require strong consistency. However, some types of caches, such as those in routers and DNS resolvers , are effective even if they return stale information.

We know that caching plays an important role in modern computer memory and disk systems. Can it be applied to the Web with equal success? Ask different people and you're likely to get different answers. For some, caching is critical to making the Web usable. Others view caching as a necessary evil. A fraction probably consider it just plain evil [Tewksbury, 1998].

In this book, I'll talk about applying caching techniques to the World Wide Web and try to convince you that web caching is a worthwhile endeavor. We'll see how web caches work, how they interact with clients and servers, and the role that HTTP plays. You'll learn about a number of protocols that are used to build cache clusters and hierarchies. In addition to talking about the technical aspects, I also spend a lot of time on the issues and politics. The Web presents some interesting problems due to its highly distributed nature.

After you've read this book, you should be able to design and evaluate a caching proxy solution for your organization. Perhaps you'll install a single caching proxy on your firewall, or maybe you need many caches located throughout your network. Furthermore, you should be well prepared to understand and diagnose any problems that may arise from the operation or failure of your caches. If you're a content provider, then I hope I'll have convinced you to increase the cachability of the information you serve.

only for RuBoard - do not distribute or recompile