What's So Tricky?
Static web serving doesn't seem so difficult at first. And, in truth, it isn't. However, as sites evolve and grow, they must scale, and the stresses of serving traffic in development and staging are different from those placed on an architecture by millions of visitors.
Let's walk through a typical approach to serving a website and then investigate that approach for inefficiencies. First, how much of the traffic is static? The answer to this is the basis for numerous timing, throughput, and other scalability calculations. Table 6.1 shows four sites: three real sites (one very small site and two that are huge) and a fourth site that is an example large site for discussion purposes whose page composition metrics are legitimized by the real data.
As can be seen in Table 6.1, the number of requests for static content far outweighs the number of requests for possibly dynamic HTML pages. Additionally, the volume of static content served constitutes more than 50% of the total volume served.
Although browser cache reduces the number of queries for these static objects (particularly images) as can be seen in the "subsequent pages" rows, it does not eliminate them entirely. Sites that rarely change will benefit tremendously from browser-side caching, whereas other more dynamic sites will not. If you look at a popular online news source (such as CNN or BBC), you will notice that almost every news story that is added contains several new external objects whether an image, movie, or audio stream. As visitors come, they are unlikely to reread the same article, and thus unlikely to capitalize on the existence of the corresponding images in their browser's cache. While surfing on the CNN site, I calculated about an 88% cache hit rate for static objects within a page.
Additionally, ISPs, corporations, and even small companies employ transparent and nontransparent caching web proxies to help reduce network usage and improve user experience. This means that when Bob loads a page and all its images, Bob's ISP may cache those images on its network so that when another user from that ISP requests the page and attempts to fetch the embedded images, he will wind up pulling them (more quickly) from the ISP's local cache.
Now, it should be obvious that only one of those TCP/IP connections was used to fetch the base document as it was fetched first and only once, and that the connection was reused to fetch some of the static objects. That leaves six induced connections due to page dependencies. Although subsequent page loads weren't so dramatic, the initial page loads alone provide enough evidence to paint a dismal picture.
Let's take the most popular web server on the Internet as the basis for our discussion. According to NetCraft web survey, Apache is the web server technology behind approximately 67% of all Internet sites with a footprint of approximately 33 million websites. It is considered by many industry experts to be an excellent tool for enterprise and carrier-class deployments, and it has my vote of confidence as well. For a variety of reasons, Apache 1.3 remains far more popular than Apache 2.0, so we will assume Apache 1.3.x to be core technology for our web installation at www.example.com.
Apache 1.3 uses a process model to handle requests and serve traffic. This means that when a TCP/IP connection is made, a process is dedicated to fielding all the requests that arrive on that connectionand yes, the process remains dedicated during all the lag time, pauses, and slow content delivery due to slow client connection speed.
How many processes can your system run? Most Unix-like machines ship with a maximum process limit between 512 and 20,000. That is a big range, but we can narrow it down if we consider the speed of a context switch. Context switching on most modern Unix-like systems is fast and completely unnoticeable on a workstation when 100 processes are running. However, this is because 100 processes aren't running; only a handful is actually running at any given time. In stark comparison, on a heavily trafficked web server, all the processes are on the CPU or are waiting for some "in-demand" resource such as the CPU or disk.
Due to the nature of web requests (being quick and short), you have processes accomplishing relatively small units of work at any given time. This leads to processes bouncing into and out of the run queue at a rapid rate. When one process is taken off the CPU and another process (from the run queue) is placed on the CPU to execute it, it is called a context switch. We won't get into the details, but the important thing to remember is that nothing in computing is free.
There is a common misconception that serving a web request requires the process to be switched onto the processor, where it does its job, and then gets switched off again, requiring a total of two context switches. This is not true. Any time a task must communicate over the network (to a client or to a database), or read from or write to disk, it has nothing to do while those interactions are being completed. At the bare minimum, a web server process must
Each of these actions requires the process to be context switched in and then out, totaling 10 context switches. If multiple requests are serviced over the life of a single connection, the first and last events in the preceding list are amortized over the life of the connection, dropping the total to slightly more than six. However, the preceding events are the bare minimum. Typically, websites do something when they service requests, which places one or more events (such as file I/O or database operations) between reading the request and writing the response. In the end, 10 context switches is a hopeful lower bound that most web environments never achieve.
Assume that a context takes an average of 10μ (microseconds) to complete. That means that the server can perform 100,000 context switches per second, and at 10 per web request that comes out to 10,000 web requests per second, right? No.
The system could perform the context switches necessary to service 10,000 requests per second, but then it would not have resources remaining to do anything. In other words, it would be spending 100% of its time switching between processes and never actually running them.
Well, that's okay. The goal isn't to serve 10,000 requests per second. Instead, we have a modest goal of serving 1,000 requests per second. At that rate, we spend 10% of all time on the system switching between tasks. Although 10% isn't an enormous amount, an architect should always consider this when the CPU is being used for other important tasks such as generating dynamic content.
Many web server platforms in use have different design architectures than Apache 1.3.x. Some are based on a thread per request, some are event driven, and others take the extreme approach of servicing a portion of the requests from within the kernel of the operating system, and by doing so, help alleviate the particular nuisance of context switching costs.
As so many existing architectures utilize Apache 1.3.x and enjoy many of its benefits (namely stability, flexibility, and wide adoption); the issues of process limitation often need to be addressed without changing platforms.
The next step in understanding the scope of the problem requires looking at the resources required to service a request and comparing that to the resources actually allocated to service that request.
This, unlike many complicated technical problems, can easily be explained with a simple simile: Web serving is like carpentry. Carpentry requires the use of nailsmany different types of nails (finishing, framing, outdoor, tacks, masonry, and so on). If you ask a carpenter why he doesn't use the same hammer to drive all the nails, he would probably answer: "That question shows why I'm a carpenter and you aren't."
Of course, I have seen a carpenter use the same hammer to drive all types of nails, which is what makes this simile so apropos. You can use the same web server configuration and setup to serve all your trafficit will work. However, if you see a carpenter set on a long task of installing trim, he will pick up a trim hammer, and if he were about to spend all day anchoring lumber into concrete he would certainly use a hand sledge hammer or even a masonry nail gun.
In essence, it is the difference between a quick hack and a good efficient solution. A claw hammer can be used on just about any task, but it isn't always the most effective tool for the job.
In addition, if a big job requires a lot of carpentry, more than one carpenter will be hired. If you know beforehand that half the time will be spent driving framing nails, and the other half will be setting trim, you have valuable knowledge. If one person works on each task independently, you can make two valuable optimizations. The first is obvious from our context switching discussion above: Neither has to waste time switching between hammers. The second is the crux of the solution: A framing carpenter costs less than a trim carpenter.
Web serving is essentially the same even inside Apache itself. Apache can be compiled with mod_php or mod_perl to generate dynamic content based on custom applications. Think of Apache with an embedded scripting language as a sledge hammer and images as finishing nails. Although you can set finishing nails with a sledge hammer, your arm is going to become unnecessarily tired.
Listing 6.1 shows the memory footprint size of Apache running with mod_perl, Apache running with mod_php, and Apache "barebones" with mod_proxy and mod_rewrite all "in-flight" at a high traffic site.
Listing 6.1. Apache Memory Resource Consumption
As can be seen, the static Apache server has a drastically smaller memory footprint. Because machines have limited resources, only so many Apache processes can run in memory concurrently. If we look at the httpd-perl instance, we see more than 20MB of memory being used by each process (RSS - SHARE). At 20MB per process, we can have fewer than 100 processes on a server with 2GB RAM before we exhaust the memory resources and begin to swap, which spells certain disaster. On the other hand, we have the httpd-static processes consuming almost no memory at all (less than 1MB across all processes combined). We could have several thousand httpd-static processes running without exhausting memory.
Small websites can get by with general-purpose Apache instances serving all types of traffic because traffic is low, and resources are plentiful. Large sites that require more than a single server to handle the load empirically have resource shortages. On our www.example.com site, each visitor can hold seven connections (and thus seven processes) hostage for 4 seconds or more. Assuming that it is running the httpd-perl variant shown previously to serve traffic and we have 2GB RAM, we know we can only sustain 100 concurrent processes. 100 processes / (7 processes * 4 second) = 3.58 visits/second. Not very high performance.
If we revisit our browsing of the www.example.com site and instruct our browser to retrieve only the URL (no dependencies), we see that only one connection is established and that it lasts for approximately 1 second. If we chose to serve only the dynamic content from the httpd-perl Apache processes, we would see 100 processes / (1 process * 1 second) = 100 visits/second. To achieve 100 visits per second serving all the traffic from this Apache instance, we would need 100/3.58 30 machines, and that assumes optimal load balancing, which we know is an impossibility. So, with a 70% capacity model, we wind up with 43 machines.
Why were the processes tied up for so long when serving all the content? Well, pinging www.example.com yields 90ms of latency from my workstation. TCP requires a three-way handshake, but we will only account for the first two phases because data can ride on the heels of the third phase making its latency contribution negligible. So, as shown in Figure 6.1, establishing the connection takes 90ms. Sending the request takes 45ms, and getting the response takes at least 45ms. Our www.example.com visit requires the loading of 59 individual pieces of content spread over seven connections yielding about eight requests per connection. On average, we see that each connection spent at minimum one handshake and eight request/responses summing to 900ms.
Figure 6.1. TCP state diagram for a typical web request.
This constitutes 900ms out of the 4 seconds, so where did the rest go? Well, as is typical, my cable Internet connection, although good on average, only yielded about 500Kb/s (bits, not bytes). Our total page size was 167500 bytes (or 1340000 bits). That means 2.7 seconds were spent trying to fetch all those bits. Now this isn't exact science as some of these seconds can and will overlap with the 900ms of latency shown in the previous paragraph, but you get the pointit took a while.
Now imagine dial-up users at 56Kb/sthe math is left as an exercise to the reader.