Chapter 5. Interception Proxying and Caching

only for RuBoard - do not distribute or recompile

Chapter 5. Interception Proxying and Caching

As we discussed in Chapter 4, one of the most difficult problems you might face in deploying a web caching service is getting users to use your cache. In some cases, the problem is mostly political; users might resist caching because of privacy concerns or fears they will receive stale information. But even if users are convinced to use the cache ”or have no choice ”administrative hurdles may still be a problem. Changing the configuration of thousands of installed clients is a daunting task. For ISPs, the issue is slightly different ”they have little or no control over their customers' browser configurations. An ISP can provide preconfigured browsers to their customers, but that doesn't necessarily ensure that customers will continue to use the caching proxy.

Because of problems such as these, interception caching has become very popular recently. The fundamental idea behind interception caching (or proxying) is to bring traffic to your cache without configuring clients. This is different from a technique such as WPAD (see Section 4.4), whereby clients automatically locate a nearby proxy cache. Rather, your clients initiate TCP connections directly to origin servers, and a router or switch on your network recognizes HTTP traffic and redirects it to your cache. Web caches require only minor modifications to process requests received in this manner.

As wonderful as this may sound, a number of issues surround interception caching. Interception caching breaks the rules of the Internet Protocol. Routers and switches are supposed to deliver IP packets to their intended destination. Diverting web traffic to a cache is similar to a postal service that opens your mail and reads it before deciding where to send it or whether it needs to be sent at all. ^[1] The phrase connection hijacking is often used to describe interception caching, as a reminder that it violates the Internet Protocol standards. Interception also leads to problems with HTTP. Clients may not send certain headers, such as Cache-control , when they are unaware of the caching proxy.

^[1] Imagine how much work the postal service could avoid by not delivering losing sweepstakes entries. Imagine how upset Publisher's Clearinghouse would be if they did!

Interception proxies are also known as transparent proxies . Even though the word "transparent" is very common, it is a poor choice for several reasons. First of all, "transparent" doesn't really describe the function. We hope that users remain unaware of interception caches, and all web caches for that matter. However, interception proxies are certainly not transparent to origin servers. Furthermore, interception proxies are known to break both HTTP and IP interoperability. Another reason is that RFC 2616 defines a transparent proxy to mean something different. In particular, it states, "A `transparent proxy' is a proxy that does not modify the request or response beyond what is required for proxy authentication and identification." Thus, to remain consistent with documents produced by the IETF Web Replication and Caching working group , I use the term interception caching.

In this chapter, we'll explore how interception caching works and the issues surrounding it. The technical discussion is broken into three sections, corresponding to different networking layers . We start near the bottom, with the IP layer. As packets traverse the network, a router or switch diverts HTTP packets to a nearby proxy cache. At the TCP layer, we'll see how the diverted packets are accepted, possibly modified, and then sent to the application. Finally, at the application layer, the cache uses some simple tricks to turn the original request into a proxy-HTTP request.

only for RuBoard - do not distribute or recompile