7.3 Why Not Join a Hierarchy?

only for RuBoard - do not distribute or recompile

7.3 Why Not Join a Hierarchy?

Now that you know about the advantages of hierarchical caching, it is also important to consider carefully some of the disadvantages and potential problems. Any improvements in performance may be offset by one or more of the following issues.

Some of these issues are significant only when you establish a relationship with caches outside your own organization. For example, you probably trust a neighbor cache within your company more than you trust one that belongs to another company. The presence or absence of a business agreement between two organizations also affects many of these issues. For instance, when you pay another party for a service, it is much easier to get problems resolved quickly.

7.3.1 Trust

You may recall that we talked about trust in Chapter 3. That discussion focused on content integrity and privacy concerns with logfile information. These issues are even more important when you join a cache hierarchy. Not only must you trust your immediate neighbors, but you must also trust all of their neighbors, and so on. Again, you are trusting them to protect the privacy of your web requests and to deliver correct, unmodified documents. Hierarchies can be quite large. It's possible that your requests and responses pass through five or more caching proxies between you and the origin server.

When you get a web page, how can you tell if it is authentic ? Currently, there is no good way. Such a scheme would most likely involve digital signatures and public encryption keys in the manner of PGP. Then it should be possible to prove the content originated with a certain entity and was not tampered with before reaching you. In practice, this can be very difficult to implement. Imagine needing to store a PGP-like public key for every web server you visit. Furthermore, the distribution of such keys requires secure channels of communication.

Since there is no end-to-end mechanism for verifying web content, clients usually have no choice but to trust that neighbor caches and origin servers deliver the correct data. If a web page becomes altered, either intentionally or not, it may go undetected. When combined with the fact that HTTP headers may be altered as well, an invalid response could remain fresh for a very long time. Since fresh objects are not normally revalidated, many people may receive the wrong page or incorrect information. Furthermore, this bogus content may spread through a cache hierarchy.

In Chapter 3, we also talked about the need to protect users' privacy. Of course, when you use neighbor caches, many of the requests from your users end up in the neighbor's log files. There are obvious privacy concerns here. The operators of the other cache may easily be able to determine something about your company or your users that you or your users would rather keep secret.

7.3.2 Low Hit Ratios

The hit ratios from parent and sibling caches are normally quite low when compared to a cache that services end users directly. For example, let's say you have a standalone cache that has a hit ratio of 35%. The other 65% of your requests are cache misses that must be forwarded to origin servers. If you establish a parent or sibling relationship with another cache, you can expect that about 5% of all requests would be found as hits in your neighbor cache. In other words, you would have 35% local hits, 5% remote hits, and 60% going to origin servers. There are two primary reasons for low neighbor hit ratios:

A response that passes through two caches, such as a parent and a child, is usually cached by both. This means, of course, that some responses are duplicated in both caches. The degree of duplication depends on the size of each cache. If the two caches have the same size , then all cache hits are satisfied by the first cache. If, on the other hand, a parent cache is significantly larger than its child cache, objects that quickly get removed from the child may still be found in the parent.
First-level caches (that service end users directly) usually have more clients than a parent cache does. A first-level cache may serve hundreds or thousands of users. A typical parent cache probably serves no more than 20 child caches. More clients means more cache hits, because there is a smaller probability that any single client is the first to request a particular URI.

In an ideal parent-child relationship, the parent cache would be about an order of magnitude larger than the child cache. In practice, it may not be possible to build or find such a large parent cache. At the minimum, the parent should be at least twice as large as the child. If the two caches are nearly the same size, then only a small percentage of requests result in cache hits at the parent.

7.3.3 Effects on Routing

We've already discussed the fact that proxy caches alter the flow of packets through a network (see Section 3.10). In some cases, this can be advantageous, while at other times, it causes problems. As a rule of thumb, when neighbor caches are close, there are fewer problems. As the distance (router hops) increases , so do the effects of routing differences.

As an example, let's say you have a parent cache that has multiple connections to the Internet. Your own cache uses a different Internet connection. If one of the parent's connections goes down, it won't be able to reach some origin servers. Requests sent to the parent may result in "connection timed out" error messages. Since your own cache has a different route to the Internet, it may still be able to reach those origin servers. If the outage is severe enough, you may want to terminate the parent relationship, at least temporarily. Squid has some features that attempt to detect such failures automatically and work around the problem.

7.3.4 Freshness

Maintaining consistency and freshness between members of a hierarchy can be difficult. Consider a child cache with two parents where each cache has saved a copy of a particular URI and the cached response is still fresh. A user generates a no-cache request by clicking on her Reload button. This request goes through the child cache and the first parent cache. The resource has been modified recently, so the origin server returns a response with new content. The new response is saved in the child cache and the first parent. However, at this point in time, the second parent has an older version of the resource, which it believes is still fresh.

How can we avoid this situation? The best way is to use some sort of object invalidation process. When a cache discovers a resource has been updated, it "broadcasts" invalidation messages to its neighbors. Some of the protocols that we'll discuss in Chapter 8 have invalidation features. It seems that relatively few products support invalidation at this time, however. Mirror Image Internet has a caching service that makes use of invalidation between nodes. There is currently some interest within the IETF to create a standard invalidation protocol. You may want to search the IETF pages and databases for "RUP," which stands for Resource Update Protocol.

7.3.5 Large Families

If you consider a hierarchy like the one in Figure 7-1, you can see that the upper-layer nodes need to support all the traffic from the lower layers . This is the famous scaling issue found in many aspects of the Internet. The issue brings a number of questions to mind. Can a single parent cache support the load from hundreds, thousands, or even more child caches? How many levels deep should a hierarchy be? At what point do the uppermost nodes become a bottleneck? I do not have a simple answer to these questions. It will depend on many factors, such as the performance of your systems and the request rate. Every product and every architecture has its limits, but finding that limit may be hard. Most likely, you'll have to take a wait-and-see approach.

A caching proxy can become a bottleneck for a number of reasons. A caching system consists of many components , each of which has finite resources. Every product has its limits, but different products have different limits, due to either their design or their particular hardware. The common bottlenecks include network media bandwidth, disk drives , available memory, and network state information. (We'll talk more about performance in Chapter 12.)

When the incoming load exceeds a single proxy cache's resources and leads to performance degradation, steps must be taken to rectify the situation. Typically, you have three options: upgrade the cache, create a cluster of caches, or reduce the incoming load. Upgrading a cache may be as simple as adding disks or replacing a CPU. On the other hand, it may require the purchase of an entire new system. Cache clusters provide a scalable solution to this problem. (We'll talk more about clusters in Chapter 9.) Finally, you may want or need to decrease the load from lower layers. This involves asking the other administrators to stop using your cache as a parent and instead forward their requests directly to origin servers.

7.3.6 Abuses, Real and Imagined

I've already mentioned in Section 1.5.2, how a caching proxy hides the client's IP address. Whenever a proxy forwards a request to an origin server, the server logs a connection from the proxy's IP address. If a content or service provider believes their resources are being abused, they almost always contact the person or organization associated with the source IP address. A parent cache that forwards traffic for thousands of users is likely to receive a few email messages from angry providers complaining about one thing or another. Sometimes the complaints are legitimate , but usually they are not.

Credit card fraud is a good example of a legitimate complaint. People who buy and sell lists of stolen credit card numbers use automated software to figure out which are still valid. It seems that they repeatedly submit orders for products via web sites. If the order is accepted, the card number is valid. By using a hierarchy of caching proxies, they hide from the merchant site and force the merchant to deal with the proxy administrator. A deeper hierarchy is better for the criminals because it's less likely they will be found.

Fortunately, almost all web sites use SSL encryption for credit card transactions. Since caching proxies cannot store encrypted responses anyway, parent caches should be configured to deny all SSL requests. The only reason for a caching proxy to tunnel SSL requests is when the users are behind a firewall. By denying SSL requests, we force users to connect directly to origin servers instead. If the users are doing something illegal, this makes it easier to identify them and take appropriate action.

Though it is obvious that cache administrators should intervene to stop illegal activities such as credit card fraud, there are cases where the appropriate course of action is less obvious. The example I am thinking of relates to freedom of speech. Message boards and chat rooms abound on the Web, and many of them are unmoderated . Anyone can post or say whatever they like. When someone posts a message the other members find offensive, they may be able to find out that the message "originated" from your parent cache. They may ask you to do something to prevent further offensive messages. In some cases, they may block all accesses from your cache's IP address.

The fact that hierarchies aggregate traffic from lower layers is another source of potential problems. At the upper layers, the traffic intensity is significantly increased compared to a normal user. In other words, a top-level parent cache has a higher connection rate than we would expect for a single user. Some content providers interpret such traffic as web robots or even denial-of-service attacks. The xxx.lanl.gov site is famous for its anti-robot stance. If they detect apparent robot activity, all subsequent requests from that IP address are denied . Fortunately, the LANL folks understand caching proxies and are willing to make exceptions.

7.3.7 Error Messages

Proxy caches, by their nature, must occasionally generate error messages for the end user. There are some requests that a caching proxy cannot possibly satisfy , such as when the user enters an invalid hostname. In this case, the proxy returns an HTML page with an error message stating that the hostname could not be resolved. Unfortunately, the proxy can't always tell the difference between a DNS name that really doesn't exist and a temporary failure.

In most cases, neither the caching proxy nor the end user is really smart enough to identify the real cause of a particular error. If the user is to blame, we don't want them calling the support staff to complain that the proxy doesn't work. Conversely, we do want users to notify the support staff if the proxy is misconfigured or malfunctioning. Thus, error pages usually include an email address or other contact information so users can receive assistance if they need it.

A tricky situation arises when cache hierarchies cross organizational boundaries. Downstream users may receive an error message from an upstream cache. In the event of a problem, the downstream users may contact the upstream provider with support questions. The support staff from company A is probably not interested in providing assistance to the users or customers of company B, unless there is some kind of business relationship between the two.

7.3.8 False Hits

Recall that a sibling relationship requires a hit prediction mechanism, such as one or more of the intercache protocols described in the following chapter. These predictions are not always correct, due to various factors and characteristics of the intercache protocols. When a request is predicted to be a hit but turns out to be a miss , we call it a false hit .

False hits can be a serious problem for sibling relationships. By definition, the sibling relationship forbids the forwarding of cache misses. False hits are not a problem for parent relationships because the parent is willing to forward the request. Given that false hits are a reality, we have two ways to deal with them.

One way is to relax the requirements of a sibling relationship. That is, allow a small percentage of false hits to be forwarded anyway. This may require configuring the sibling to always allow misses to be forwarded. If so, the sibling is vulnerable to abuse by its neighbors.

HTTP/1.1 provides a significantly better solution to this problem. The only-if-cached cache-control directive is used on all requests sent to a sibling. If the request is a cache miss at the sibling, then it returns a 504 (Gateway Timeout) response. Upon receiving this response, the first cache knows that it should retry the request somewhere else.

A false hit followed by a 504 response adds a small delay to the user's request. Checking the cache for a particular URI should be almost instantaneous. Thus, most of the delay is due to network transmission and should be approximately equal to two round-trip times. In most situations, this corresponds to 100 milliseconds or less. Given that sibling hits are rare (say 5%) and among those, false hits are rare as well (say 10%), very few requests overall (0.5%) experience this delay.

7.3.9 Forwarding Loops

A forwarding loop occurs when a request is sent back and forth between two or more nodes. Looping also occurs in other systems such as email and IP routing. Normally, a well-behaved network is free of loops. They may appear, however, due to configuration mistakes or other errors.

In a proxy cache hierarchy (or a mesh), a forwarding loop can appear when two caches are configured such that each has a parent relationship with the other. For a request that is a cache miss, each cache forwards the request to the other. This configuration can result only from human error. Two caches must not be parents for each other. It is of course possible that loops appear for other reasons and in more complicated situations. For example, a group of three or more caches may have a forwarding loop. Loops with sibling caches were relatively common before the only-if-cached directive became widely used.

Fortunately, it is relatively easy for a proxy cache to detect a loop. The Via request header is a list of all proxy caches that a request has been forwarded through. By searching this list for its own hostname, a cache knows whether it has seen a particular request before. If a loop is detected , it is easily broken by sending the request to the origin server rather than to a parent cache. Of course, with interception proxying (see Chapter 5), it may not be possible to connect directly to the origin server. The request could be diverted to a caching proxy instead.

7.3.10 Failures and Service Denial

A proxy cache, and especially a parent cache, is potentially a single point of failure. There are a number of techniques designed to work around equipment failures transparently . Layerfour switching products have the ability to bypass servers that become unavailable. Many organizations use dual servers in a redundant configuration such that, if one system fails, the second can take over. Also, applications themselves may be able to detect upstream failures and stop sending requests to caches that appear to be down.

There are some subtle problems that are more difficult to detect than a total failure. If your parent cache becomes heavily loaded, your response times increase, but all requests continue to be serviced. However, increased response times don't necessarily indicate a problem with your parent cache. It might instead be due to general Internet congestion or the failure of a major traffic exchange.

The DNS is another potential source of service denial. If your parent cache's DNS server fails, it is likely to cause an increase in response times for some of your requests. It may be enough to annoy a few of your users but not significant enough to detect and quickly diagnose, especially since failed DNS lookups are an everyday occurrence when surfing the Web anyway.

The bottom line is that you should carefully consider these and other possibilities when using a parent cache. If your parent cache is under the administration of a separate organization, there may be little you can do to get problems fixed quickly. Unless your caching product is good at detecting failures (and hopefully partial failures as well), you may find yourself disabling neighbor caches when you observe suspicious behavior or performance.

only for RuBoard - do not distribute or recompile