Chapter 4. High Availability. HA No Downtime? | Scalable Internet Architectures

4. High Availability. HA! No Downtime?!

Often, people refer to high availability (HA) and load balancing (LB) interchangeably and incorrectly, using high availability when they mean load balancing alone and/or when they mean high availability and load balancing together. High availability has become a part of high availability/load balancing because product brochures and promotional material combine high availability and load balancing in a way that implies that they are one and the same thing. This is simply wrong. High availability is an orthogonal concept to load balancing, and, if approached as such, a better understanding can be achieved.

High availability means that things are always available. Even in the event of an unexpected failure, the services being provided should remain available. This is known as fault tolerant.

Fault tolerance: The capability of a system or component to continue normal operation despite the presence of hardware or software faults. Also, the number of faults a system or component can withstand before normal operation is impaired.

Simply put, actors may die, but the show must go on. But before you say "Great! Let's do that!" take a look at costs and benefits. Figure 4.1 depicts a classic network diagram with no fault tolerance. Figure 4.2 depicts the same architecture with fault tolerance networking infrastructure. Basically, this network architecture eliminates any single point of failure in the egress point to our network provider.

Figure 4.1. A simple network with no fault tolerance.

Figure 4.2. A simple, fault-tolerant network.

Obviously, if all the components fail, there is no way the service can survive. But how many components can fail? If an architecture allows for a single component failure without the underlying service being jeopardized, it is considered 1 fault tolerant. On the other hand, if every component but one can fail (the strongest guarantee possible) while maintaining the availability of the service, the system is called N-1 fault tolerant.

Because making a system highly available adds cost, what can you hope to gain for your pains? The following are some advantages of investing time, money, and effort in building and maintaining a highly available architecture:

Taking down components of your architecture for maintenance and upgrades is simple. When an architecture has been designed to transparently cope with the unexpected loss of a system or component, it is often trivial to intentionally remove that system or components for testing, upgrades, or anything else your heart desires.
A simple example of this outside the web world is the operation of mail exchanges. SMTP, albeit an ancient protocol, has an intrinsic requirement of responsibility and reliability. A mail server must hold onto a message until it has decidedly handed that message to the responsible party. In the case of Internet email, a Message Transfer Agent (MTA) will try all the listed mail exchanges in ascending order of preference until one succeeds. Although this may sound obvious, the intrinsic fault tolerance of the system may not be so obvious.
This means that, at any time, you can intentionally or accidentally disable an advertised mail exchange, and all inbound emails will simply use the other mail exchanges that are still accessible. SMTP's simple and elegant design allows for this basic fault tolerance without the need for extraneous (and often expensive) hardware or software solutions to provide high availability. HTTP provides no such intrinsic fault tolerance, and that forces us to architect an appropriate highly available solution.
Well-designed highly available architectures often lend themselves to higher capacity. The rest of this book details how to approach the issues of highly scalable architectural design. Although you can always take "the wrong path" and build a large highly available system that scales poorly, we will concentrate on the fact that it can be done "right" and afford the architect, maintainers, and users the luxury of a system that scales without the need for fundamental engineering changes.
It is important to keep in mind that any hardware is required for the architecture to run, cannot be considered redundant. If you rely on your redundant hardware for the purposes of handling routine load, it isn't really redundant hardware at all.
Well-designed distributed systems offer the benefits of controlled and understood horizontal scalability. One example of this "good design" is some of today's peer-to-peer (P2P) systems. A key feature of their success is that as the networks grow to arbitrarily large sizes, and the number of machines on the network increases, that growth neither requires the individual machines to be more powerful nor sacrifices the quality of service.
After tackling the three-node cluster, scaling to n nodes is a process not an experiment. When building a cluster of machines to serve a single purpose, one of the most common problems encountered is getting those machines to cooperate toward the common goal. This is the most tremendous challenge of distributed systems design. For some services, such as DNS, this exercise is simple and easy. For other systems, specifically databases, this challenge poses a problem that we can consider academically difficult.
Database replication is a controversial topic. Many claim that the problem is solved; others claim the solutions are inadequate. The simpler two-node solution has been used for many years in the financial industry. This replication is usually employed to ensure high availability on two systems using a protocol such as two-phase commit (2PC) or three-phase commit (3PC). Unfortunately, these commonly deployed solutions do not handle the expansion to three nodes in an elegant way due to the limitations of the protocols they use. In short, there are no good enterprise-ready, horizontally scalable N-1 failure tolerant database systems in existence. This problem is expensive to solve correctly, so instead of solving the three-node problem, the accepted approach is to engineer around it.
We won't delve into this subject too deeply here because Chapter 8, "Distributed Databases Are Easy, Just Read the Fine Print," is dedicated to the exploration of distributed, fault-tolerant databases both from a theoretical aspect and in the real world.

Now that we know what we've bought, let's take a look at the bill. The disadvantages of building and maintaining a highly available architecture are both serious and unavoidable:

Implementing a high-availability solution means more equipment and services to maintain. By their very nature, highly available systems have multiple working components. Multiple components allow for survivability in light of failures. Assuming that the system is N-1 fault tolerant, you may find yourself building a service with a multitude of components and should truly understand the technical, financial, and emotional cost of maintaining them. As illustrated in previous chapters, solid policies and procedures can help minimize these costs.
Troubleshooting distributed systems is dramatically more difficult than their single-node counterparts. Web services are built on top of HTTP, so a long single transaction from the end-user's perspective is actually composed of a series of simple and short transactions. Because a single conceptual transaction is actually individual POST and GET requests over HTTP, each one could conceivably arrive at a different machine in a cluster. One solution to this is to force the entire conceptual transaction to visit the same machine for each incremental web transaction. Chapter 5, "Load Balancing and the Utter Confusion Surrounding It," discusses the numerous pitfalls to this approach. We are left with tightly coupled events occurring on separate machines. This vastly complicates identifying cause and effect, which is the most fundamental of troubleshooting concepts. Chapter 9, "Juggling Logs and Other Circus Tricks," discusses approaches to unifying the logging across all the machines in a web cluster to provide a single-instance perspective. This in turn provides an environment where traditional troubleshooting techniques can be applied to solve problems.
Application programmers must be aware of content synchronization issues. When running a web service on a single machine, it is intrinsically impossible to have content synchronization or consistency issues. However, when applications are deployed across a cluster of machines, there is a risk of a difference in content between two servers affecting the integrity of the service being provided to the end-user.
Code consistency is generally a simpler issue to solve. At first glance, the problem would seem to manifest itself as one user seeing one thing, and a second user seeing something different; however, the problem is more severe. As subsequent requests in the same user's session arrive at different machines, a variety of complicated problems can arise.
Content synchronization is a much more complicated issue because it can manifest itself in a wide variety of scenarios ranging from dynamic content in web pages to synchronizing user session states. Several techniques can be used to distribute content and user data across a cluster of machines, but most are foreign concepts to classic application design.
The mere fact that you need a highly available solution speaks to the attention required to properly manage the overall architecture. Managing production systems is an art. More than that, it is the responsibility of a group of multidisciplinary artisans who always have their hand in changing the artwork. As availability demands increase, the margin for error decreases. Developers, systems administrators, database administrators, and the rest of the crew are required to work more closely to ensure that availability requirements can be met.