Appendix A. Analysis of Production Cache Trace Data

only for RuBoard - do not distribute or recompile

Appendix A. Analysis of Production Cache Trace Data

In this appendix, we'll look at some interesting characteristics of web traffic, such as reply size distributions, HTTP headers, and expiration times. Such data is useful for a number of reasons. First, the information in this appendix backs up some of the statements I made earlier in the book. For example, when I said that small files are more more popular than large ones, I wasn't just making that up. Second, this data can help you make decisions regarding your own caching proxies. The hit ratio analysis demonstrates how increasing your cache size may result in higher hit ratios.

For these analyses, I use data from two different sources. One is the NLANR/IRCache project, consisting of nine caches I maintain throughout the U.S. ^[A] The other is a proxy cache located at a U.S. university, which I'll call Anon-U. All data comes from production Squid caches with real users.

^[A] The NLANR/IRCache project is funded by the National Science Foundation, grants NCR-9616602 and NCR-9521745.

I use the IRCache data for most analyses because it is significantly larger and includes more information. The IRCache set includes client access logs, cache "store" logs, and HTTP header logs. The access logs are from March 5 “25, 2000 and contain 216 million responses. The store logs are from March 8 “25, 2000, and contain 71 million entries. The header logs are from April 2 “29, 2000, and contain 268 million request and response entries.

The IRCache proxies are unique in certain ways that can skew the data. In other words, the data collected from these proxies does not necessarily represent typical web traffic. In particular, keep the following points in mind while reading this appendix:

Most of the IRCache clients are other caches. In some cases, there are three or more caches between the user and my cache. Many requests that would be hits are filtered out by the lower-layer caches. This tends to reduce the caches' hit ratios.
Many clients use ICP or Cache Digests, so they request only cache hits. This tends to increase the caches' hit ratios.
A number of clients use the IRCache proxies to bypass filtering in their own organization. Thus, these caches may see a higher percentage of pornography, etc., than a typical cache does.

The Anon-U data consists of 21 million access log entries from May 1 “31, 1999. To protect the privacy of that cache's users, both the URLs and client IP addresses have been randomized. The URLs are sanitized in a way that removes information such as hostnames, protocols, port numbers , and filename extension.

When the analysis results in a distribution (e.g., reply sizes), I report both mean and median values. Many of the distributions have heavy tails that make mean values less interesting. In these cases, median is a better representation of the average.

When looking at traffic statistics collected from web caches, it's important to understand where the data comes from and what it represents. I can think of at least four different ways to collect and analyze caching proxy data:

Per request, as seen on the network between the cache and its clients
Per object (URL) stored in the cache
Per request, as seen on the network between the cache and origin servers
Per object (URL) stored at origin servers

Most of the analyses in this appendix use the first technique: per request between the cache and its clients. The exceptions are object sizes and cachability , both of which are per object. In each of the following subsections, I'll explain a little about the procedures used to generate the results.

only for RuBoard - do not distribute or recompile