1.4 Why Not Cache the Web?

only for RuBoard - do not distribute or recompile

1.4 Why Not Cache the Web?

By now, you may have the impression that web caching is a wonderful solution without any negative side effects. In fact, there are a number of important issues and consequences to understand about web caching. I'll mention some of them here, with a deeper discussion to follow in Chapter 3.

Unlike more tightly coupled systems, it can be difficult for a web cache to guarantee consistency. This means that a cache might return out-of-date information to a user . Why should this be the case? One important factor is that web servers provide only weak hints about freshness. Many responses don't have any hints at all. On-demand validation is the only way to guarantee a cached response is up-to-date. Given the relatively high latencies involved (compared to other systems), validation can take a significant amount of time. Furthermore, the cache may not even be able to reach the server due to a network or server failure. If a validation request fails, the cache doesn't really know if its response is up-to-date or not. Some caching products can be configured to intentionally return stale responses.

If you've ever set up and maintained a web server, you understand how good it feels to watch the access log file and see people visiting your site. Many content providers feel the same way. They want to know exactly who their users are, which pages they view, and how often. Caches complicate their analysis. Requests served as cache hits are not logged at the origin server. Proxies also tend to hide the identity of users. For example, all users behind a caching proxy come from the same IP address. Furthermore, some products also have features to remove or modify HTTP request headers that can otherwise identify individual users.

Copyright has been controversial with respect to caching for quite some time. Some people feel that caches violate an author's right to control the distribution of her work. The possibility of being sued for copyright infringement prevents some people from providing caching services. HTTP does allow content providers to specify if, and how, their information should be handled and distributed by different types of caches. However, the protocol does not address copyright directly.

Some people predict that the percentage of web content that is dynamic and personalized is increasing. Dynamic responses usually should not be cached, because they cannot be reused for a future request; Jane should not receive a page that was customized for Bob. If the prediction is true, then caching will become less important over time. However, other people believe that web content is increasingly static, and that it's becoming easier to differentiate static and dynamic data. In this case, caching becomes more efficient. (Note that movies, Java applets, and Macromedia Flash are all static content, even though they can display changing images on your screen.)

These problems all highlight the ongoing struggle for control of web content. Users and their service providers want a high percentage of cache hits, because cache hits save them time and bandwidth. Some content providers, on the other hand, want fewer hits delivered from caching proxies. They don't want their content stored in servers they don't control, and they want to accurately count page accesses and track users throughout their site.

Despite these potential problems, I still feel that web caching is a worthwhile practice. This is not to say that the problems should be ignored. In fact, we'll continue talking about them throughout the rest of this book.

only for RuBoard - do not distribute or recompile