Tackling Content Distribution

We are going to build this rip-roaring static cluster serving traffic. First we'll approach the simplest of questions: How does the content get onto the machines in the first place?

A Priori Placement

A priori placement is just a fancy way of saying that the files need to be hosted on each node before they can be served from that node as depicted in Figure 6.2. This is one of the oldest problems with running clusters of machines, solved a million times over and, luckily for us, simplified greatly due to the fact that we are serving cacheable static content.

Figure 6.2. A priori placement of content.

Dynamic content typically performs some sort of business logic. As such, having one node execute the new version of code and another execute an older version spells disaster, specifically in multinode environments where each in the sequence of requests that compose a user's session may be serviced by different nodes. Although it is possible to "stick" a user to a specific node in the cluster, we already discussed in Chapter 5, "Load Balancing and the Utter Confusion Surrounding It," why this is a bad idea and should be avoided if possible.

Static content distribution, on the other hand, is a much simpler problem to address because we have already forfeited something quite important: control. To serve static content fast(er), we allow remote sites to cache objects. This means that we can't change an object and expect clients to immediately know that their cached copies are no longer valid. This isn't a problem logistically at all because new static objects are usually added and old ones removed because changing an object in-place is unreliable.

This means that no complicated semantics are necessary to maintain clusterwide consistency during the propagation of new content. Content is simply propagated and then available for use. Simply propagated... Hmmm...

The issue here is that the propagation of large sets of data can be challenging to accomplish in both a resource- and time-efficient manner. The "right" approach isn't always the most efficient approach, at least from a resource utilization perspective. It is important to align the techniques used to distribute content to your application servers and static servers in a manner that reduces confusion. If developers or systems administrators have to use different policies or take different pitfalls into consideration when pushing static content as opposed to dynamic content, mistakes are bound to happen.

This means that if you push code by performing exports directly from your revision control system, it may be easier to propagate static content via the same mechanism. Although an export may not be as efficient as a differential transfer of content from a "master server," only one mentality must be adoptedand remember: Humans confuse easily.

If no such infrastructure exists, we have several content distribution options.

NFS

Use a network file system. One copy of the content exists on a file server accessible by all web servers. The web servers mount this file system over IP and access the content directly. This is a popular solution in some environments, but poses a clear single point of failure because the clients cannot operate independently of the single file server. This approach also has certain deficiencies when used over a wide-area network.

AFS or CODA

AFS and CODA are "next-generation" network file systems. Both AFS and CODA allow for independent operation of the master file server by caching copies of accessed files locally. However, this technology is like implementing cache-on-demand on a different level in the application stack. Although these protocols are more friendly for wide areas, they still suffer from a variety of problems when operating in a disconnected manner.

Differential Synchronization

This technique involves moving only the data that has changed from the content distribution point to the nodes in need. You are probably familiar with the freely available (and widely used) tool called rsync that implements this. Rsync provides a mechanism for synchronizing the content from a source to a destination by first determining the differences and then transferring only what is needed. This is extremely network efficient and sounds ideal. However, the servers in question are serving traffic from their disks and are heavily taxed by the request loads placed on them. Rsync sacrifices local resources to save on network resources by first performing relatively inexpensive checksums across all files and then comparing them with the remote checksums so that unnecessary data transfers can be avoided.

The "relatively inexpensive" checksums don't seem so inexpensive when run on a highly utilized server. Plus, all our nodes need to sync from a master server, and, although rsync is only consuming marginal resources on each of the N nodes, it is consuming N times as many resources on the master server.

Hence, the world needs a network-efficient, resource-efficient 1 to N content distribution system with rsync's ease of use. Consider this a call to arms.

Exports from Revision Control

Directly exporting from revision control, assuming that your content is stored in some revision control system (as it certainly should be), has tremendous advantages. All the uses for revision control of code directly apply to static content: understanding change sets, backing out changes, and analyzing differences between tags or dates.

Most systems administrators are familiar with some version of revision control, and all developers should be fluent. This means that revision control is not only philosophically compatible with the internal control of static content but also is familiar, intuitive, and usable as well.

With so much praise, why isn't this approach used by everyone? The answer is efficiency. CVS is the most popular revision control system around due to its licensing and availability. CVS suffers from terrible tag times, and exports on large trees can be quite painful. Even with commercial tools and newer open free tools such as Subversion, the efficiency of a large checkout is, well, inefficient.

Most well-controlled setups implement exports from revision control on their master server and use a tool such as rsync to distribute that content out to the nodes responsible for serving that traffic, or opt for a pure cache-on-demand system.

Cache-on-Demand

Cache-on-demand uses a different strategy to propagate information to a second (or third, or fourth) tier. A variety of commercial solutions are available that you simply "plug in" in front of your website, and it runs faster. These solutions typically deploy a web proxy in reverse cache mode.

Web caches were originally deployed around the Internet in an attempt to minimize bandwidth usage and latency by caching commonly accessed content closer to a set of users. However, the technologies were designed with high throughput and high concurrency in mind, and most of the technologies tend to outperform general-purpose web servers. As such, a web cache can be deployed in front of a busy website to accomplish two things:

Decrease the traffic directed at the general web servers by serving some of the previously seen content from its own cache.
Reduce the amount of time the general purpose web server spends negotiating and tearing down TCP connections and sending data to clients by requiring the general purpose web server to talk over low-latency, high-throughput connections. The web cache is responsible for expensive client TCP session handling and spoon-feeding data back to clients connected via low-throughput networks.

Web caches handle client requests directly by serving the content from a local cache. If the content does not exist in the local cache, the server requests the content from the "authoritative" source once and places the resulting content in the local cache so that subsequent requests for that data do not require referencing the "authoritative" source. This architecture is depicted in Figure 6.3.

Figure 6.3. Typical cache-on-demand architecture.

Web caches that operate in this reverse proxying fashion are often called web accelerators. Apache with mod_proxy and Squid are two popular, widely adopted caching solutions commonly deployed in this configuration. Figure 6.3 shows a configuration in which the authoritative source is the main website. We can remove this marginal load from the dynamic servers by placing the content a priori on a single static content server that is hidden from the public as seen in Figure 6.4.