Choosing a Web Serving Platform | Scalable Internet Architectures

Let's continue our route of using free software, as Apache is already at the core of our architecture. Apache 1.3.x with a relatively vanilla install clocks in at about 800 requests per second on our server. Our goal is to service 11,750 requests per second, and we don't to not exceed 70% capacity, which leaves us with a need for (11,750/70%)/800 = 20 servers. Each server here is capable of pushing about 16MB/s. Although commodity servers such as this have a list price of around $2,000 each, totaling at a reasonable $40,000, 20 servers for static images seems, well, less than satisfying.

Because we are serving static traffic, several other web server technologies could be used with little or no effort, instead of Apache. A quick download and compile of thttpd yields better results at approximately 3500 requests per second on the same server pushing about 70MB/s. Repeating the previous server calculations with our new metrics, we now need five servers(11,750/70%)/3,500 rounded up.

A valuable feature exists in Apache and is notably absent in thttpd. This is reverse-proxy (web cache) support. This feature is useful because it allows you to build a cluster with a different strategy and adds elegance and simplicity to the overall solution. thttpd requires a priori placement of content, whereas Apache can use both a priori placement of content and cache-on-demand via the mod_proxy module. As we have seen, it takes 20 servers running Apache to meet the capacity requirements of our project, so let's find a higher performance caching architecture.

Apache is slower than thttpd in this particular environment for several reasons:

It is more flexible, extensible, and standard.
It is more complicated and multipurposed.
It uses an architectural model that allocates more resources to each individual connection.

So, logically, we want to find a web server capable of proxying and caching data that is single-purposed, simple, and extremely efficient on a per-connection basis. Some research leads us to Squid (www.squid-cache.org). Architecturally, it is similar to thttpd, but single-purposed to be a web cache.

Cache-on-demand systems are inherently less efficient than direct-serve content servers because extra efforts must be made to acquire items that are requested but not yet in the cache and to ensure that data served from cache is still valid. However, a quick test of Squid is a good indication as to whether such a performance degradation is acceptable.

By installing Squid in http acceleration mode, we can benchmark on the same hardware around 2,800 requests per second. This is 20% slower. However, we see that this only increases our single location requirements to (11,750/70%)/2,800 and thus six servers.

Several commercial products boast higher performance than Squid or Apache. The adoption of any such device is really up to the preference of the operations group. Growing a solution based on open-sourced technologies tends to have clear technical and financial advantages for geographically distributed sites because costs multiply rapidly when commercial technologies are used in these situations. This is basically the same argument that was made in Chapter 4, "High Availability HA! No Downtime?!." Adding a high performance appliance to a setup is often an easy way to accomplish a goal, just remember you need two for high availability. Although this may make good sense at a single highly trafficked location, if the site were to want a presence at four geographically unique locations, you have just committed to six more of these appliances (two at each site). Whether that is right is a decision you need to make for yourself.

Six servers are better than 20 for more reasons than capital hardware costs. A smaller cluster is easier and cheaper to manage, and, as discussed in previous chapters, it simplifies troubleshooting problems. Because this cluster serves only the simplest of content, we are not impressed with the extra added features of Apache and its modules, and do not find them to be a compelling reason to choose it over a smaller, simpler, and faster solution for our project at hand.