High Availability | Scalable Internet Architectures

Because high availability was the first item in the previous list, the first thing in your mind might be: "What about load balancing?" The criticality of an environment has absolutely nothing to do with its scale. Load balancing attempts to effectively combine multiple resources to handle higher loadsand therefore is completely related to scale. Throughout this book, we will continue to unlearn the misunderstood relationship between high availability and load balancing.

When discussing mission-critical systems, the first thing that comes to mind should be high availability. Without it, a simple system failure could cause a service interruption, and a service interruption isn't acceptable in a mission-critical environment. Perhaps it is the first thing that comes to mind due to the rate things seem to break in many production environments. The point of this discussion is to understand that although high availability is a necessity, it certainly won't save you from ignorance or idiocy.

High availability from a technical perspective is simply taking a single "service" and ensuring that a failure of one of its components will not result in an outage. So often high availability is considered only on the machinery levelone machine is failover for another. However, that is not the business goal.

The business goal is to ensure the services provided by the business are functional and accessible 100% of the time. Goals are nice, and it is always good to have a few unachievable goals in life to keep your accomplishments in perspective. Building a system that guarantees 100% uptime is an impossibility. A relatively useful but deceptive measurement that was widely popular during the dot-com era was the n nines measurement. Everyone wanted an architecture with five nines availability, which meant functioning and accessible 99.999% of the time.

Let's do a little math to see what this really means and why a healthy bit of perspective can make an unreasonable technical goal reasonable. Five nines availability means that of the (60 seconds/minute * 60 minutes/hour * 24 hours/day * 365 days/year =) 31,536,000 seconds in a year you must be up (99.999% * 31,536,000 seconds =) 31,535,684.64 seconds. This leaves an allowable (31,536,000 - 31,535,684.64 =) 315.36 seconds of unavailability. That's just slightly more than 5 minutes of downtime in an entire year.

Now, in all fairness, there are different perspectives on what it means to be available. Take online banking for example. It is absolutely vital that I be able to access my account online and transfer money to pay bills. However, being the night owl that I am, I constantly try to access my bank account at 3 a.m., and at least twice per month it is unavailable with a message regarding "planned maintenance." I believe that my bank has a maintenance window between 2 a.m. and 5 a.m. daily that it uses every so often. Although this may seem like cheating, most large production environments define high availability to be the lack of unplanned outages. So, what may be considered cheating could also be viewed as smart, responsible, and controlled. Planned maintenance windows (regardless of whether they go unused) provide an opportunity to perform proactive maintenance that reduces the risk of unexpected outages during non-maintenance windows.