11.3 Redundancy and single points of failure

The typical application solution, which consists of hardware and software, has a multitude of parts that all have some failure rate. A single point of failure (SPOF) is that part of a solution, that if it fails, makes the entire solution no longer available. A typical high-availability design focuses on both the availability characteristics of its parts, as well as assuring the continuous availability the face of individual part failures.

Redundancy, having appropriate spare parts available for use, is one design technique used to address single points of failure.^[13] Of course, only those with extremely large budgets can afford to have one or more instances of every resource. So the challenge in any high-availability design is determining the single points of failure within the proposed solution, and finding a way at reasonable cost to ensure a quick recovery from the failure of those parts that are important.

^[13] Systems that are responsible for controlling human life-support usually have a backup system with a completely unique design, just in case the "failure mechanism" is in the original design! Redundancy is not the only way to handle single points of failure.

Given the definition of 99.99% solution availability, system owners are forced to deal withan exponentially rising costs as various design alternative look to cover more of the potential failure scenarios. High availability is really the business practice of "good enough, and no more." One common approach to high availability is to ignore the catastrophic scenarios. Many find that addressing the set of highly probable single points of failure is their cost-effective answer. This approach can be simplified even further with the assumption that one only needs to deal with one failure at a time. In other words, single failure is used during the design phase to understand failure problems, and as a design assumption to control the complexity and cost of solution development.

Interestingly, redundancy can occur at many levels. You might be surprised to know that the internals of a processor contain many redundancies. Starting with redundancy at the "processor" level, there are further levels of redundancy all the way up to completely redundant application servers. At this higher level for recovery from failures, such a design chooses to disregard which part of an application server failed and simply replaces the entire server with another one of similar capabilities. A set of redundant application servers, interconnected in a redundant way sharing the data on disk is called a cluster. In 11.4, "High availability for the ISPCompany example" we explore how ISPCompany uses a load balancing cluster to provide a high-availability, static, Web-page serving solution.

11.3.1 Linux image redundancy

It is also possible to configure redundant Linux guest images, where availability is achieved by having another Linux image take over work from the failing image. There are two different approaches taken, one called hot standby and the other clustering.

The hot-standby approach to availability has an alternate, already-booted but idle, Linux image just waiting to take over the workload if the primary image fails. With z/VM, a hotstandby image is simply an idle Linux image that z/VM will have swapped out, not consuming real resources until the point of failure of the primary server. But at that precise time of failure, the primary server stops using z/VM resources. Those resources will begin to be used by the hot standby. The z/VM approach to hot-standby images has an extremely low cost, since the hot standby does not require additional computing resources. This is especially true if you compare this to a discrete server farm implementation, where the hot-standby image is a completely-configured duplicate hardware box. Figure 11-2 is a simplified drawing of this approach on z/VM.

Figure 11-2. A simplified Linux hot-standby environment

graphics/11fig02.gif

It is important to recognize that this hot-standby approach is a special case where read/write file sharing is safe. The hot standby never writes while it is idle, so there is only one "writer" at any one time. Writing only begins when it is driven to recover. After the recovery takeover process, it is safe to write to these shared disks because part of the process is ensuring that the failed primary image can no longer write to disk.

With z/VM, this approach to Linux image redundancy has a number of significant advantages over a discrete server farm implementation. First, there is no need for additional server hardware or cabling which can be a significant cost savings. When we discussed the various types of server consolidation in Chapter 7, "The Value of Virtualization," we emphasized the efficiency with which z/VM managed the logical resources for the hundreds of guests. Here, because an idle guests put almost no resource load on z/VM, the low cost factor makes it reasonable to consider having more production images each having their own unique hot-standby image.

Configuring for hot standby with Linux on the mainframe is also easy. One uses a cloningprocess to configure the hot-standby image. In this case, the clone will also have the identical set of disks for the data as the primary server. The heartbeat process is a simple communication path between the two images exchanging an "I am well" message. Ensuring that the failed image is in fact no longer able to do I/O is as simple as requesting z/VM CP to "stop" and not reboot the particular virtual machine. The speed for recovery comes down to how fast the application data can be returned to a safe state. The implementation of the data recovery depends upon the application and the specific Linux file system.

There is a very similar story for the cluster approach to Linux image redundancy. This design is based on there being a significant number of equally capable application servers. A few, typically two, servers in the configuration are unique in that they act as workload managers, receiving the incoming work and sending it to one of the many application servers. The cluster approach has two unique advantages. First, an individual server failure does not affect the application availability, so there is zero recovery time. Only the active units of work in that server are lost. The workload manager will no longer send incoming work to a failed server. The cluster just has a little less capacity. The second advantage is that the overall capacity of the cluster can be increased with no availability impact by simply adding yet another application server which, once identified to the workload manager, becomes a candidate for its share of the incoming work.

A Linux cluster is fairly simple to implement on z/VM, and there is a related Open Source project called Linux Virtual Server (LVS).^[14] The workload manager is actually likely to be implemented as two Linux images, one as a hot standby (Figure 11-3). The Open Source module, FAKE, is used to handle taking over the failed workload manager's IP address so that users do not notice that the application server has moved to a new Linux image. The cluster is a set of cloned Linux images, with a Linux file system that allows them to share a common set of disks. The connectivity among the cluster elements would be a z/VM Guest LAN. Only the two workload managers would have redundant connections to real network adapters.

^[14] Table 25-6 provides a pointer to further information about this project.

Figure 11-3. A simplified Linux cluster environment

graphics/11fig03.gif

When it is required to increase the work capacity of the cluster, an additional interesting option exists with this type of z/VM Linux cluster. As with a real server farm implementation, one could define yet another Linux image. However, one has the new option to just increase the CPU shares for each of the cluster members. And if the environment was already "maxed out" on the available real CPU capacity, one could use the zSeries feature of Capacity Upgrade on Demand to dynamically add the real CPU capacity to z/VM, and thus to the cluster. See 11.4, "High availability for the ISPCompany example" for details of how ISPCompany uses LVS.

11.3.2 z/VM: Redundancy and recovery time

First we will discuss how z/VM itself uses redundancy, and also lets its guests use redundancy. Then we will discuss having a redundant z/VM.

z/VM inherits the zSeries hardware availability advantages (as described in 11.2, "The zSeries hardware availability") and can be configured to pass them on to its guests. In the case where the hardware is architecturally required to notify the software about a failure, z/VM will process that information. Most of the time it completely hides the failure event from the Linux guest, or, in some worst case scenarios, z/VM will fail the Linux guest. z/VM can then have its own automation control for a recovery of the Linux guest (which usually involves a re-IPL of the Linux guest).

The hardware configuration file (IOCDS), which controls the environment where z/VM executes, should always contain definitions for at least two of every hardware resource:

There should be at least two CPUs defined.
Every disk should have at least two paths defined.
There should be at least two LAN connections.

Similarly, if the intent is to have a Linux guest able to recover from single points of virtual hardware failure, each Linux guest will have the same generic list of redundancy needs as z/VM itself.

Figure 11-4 shows how the three Linux guests (Linux X, Linux Y, and Linux Z) have been defined. Each Linux guest has two virtual CPUs that it can use (CPU 00 and CPU 01). Since none of these Linux guests has dedicated real paths and devices, there is no need to specify the path redundancy. z/VM automatically handles that aspect of recovery for the Linux guest.

Figure 11-4. Sample statements that show the CPU redundancy for the Linux guests

graphics/11fig04.gif

As with any part of a solution, there is the possibility that z/VM becomes unavailable. z/VM is a very stable operating system, and although it is possible that z/VM itself crashes, it is more likely that factors such as human error or z/VM maintenance or software upgrade might cause an interruption. The same schemes as for Linux image redundancy, hot standby and clusters, can be implemented at the z/VM level.

The first item to identify with a z/VM availability discussion is the source of the risk of z/VM failure. Almost equally important is the speed for recovery from the failure. For example, if it was human error that stopped a z/VM image, it is possible to immediately restart the z/VM and then all the Linux images, and so on. While this approach is simple and inexpensive, it is so lengthy as to not fit any high-availability scenario. If a more timely recovery process is required, then a second z/VM image is needed.

The secondary z/VM image could be in another LPAR on the same machine. This approach is effective providing the zSeries hardware itself was not the cause of the outage. If you need to include the risk of the hardware failure, then the secondary z/VM must be in another zSeries machine, and possibly even at another site, all depending upon which causes of outage you want to survive.

The hot-standby choice for z/VM redundancy makes a lot of sense in the situation where all the lost Linux images can be brought up fast enough to meet the recovery time requirements. A hot-standby z/VM can even share all the disks with the primary z/VM. By using the two LPARs on the same machine approach and sharing the identical set of resources with both the primary and hot-standby LPARs, the failed primary LPAR's resources can be used by the hot standby. The same set of resources adequately serves both LPARs because of the serial nature of the requirements.

The cluster approach of two z/VM images sharing the workload is another approach for z/VM image redundancy. It is going to be required if the recovery time is tight. However, clusters imply a certain amount of actively-shared data across the z/VM cluster and are more challenging to implement.

11.3.3 Data redundancy and recovery time

The data constitute another component of the solution whose high availability must also be assured. There are two levels that need to be addressed: the hardware (the physical layer), and the Linux file systems (the logical layer) that provide access to the data. Having highly available data brings with it the need to address how quickly the data (the logical, consistent view they represent) can be recovered following the interruption of access.

Access to data on the zSeries has two hardware components: the channel path to the device and the device itself where the data resides. Simply assuring that the hardware (and virtual machine) configuration includes multiple paths to each device will assure the appropriate recovery from a channel path failure.

There are a number of approaches used with disk devices based on the concept of redundancy. By eliminating the single point of failure at the hardware level, redundancy ensures continued data availability in face of the loss of a single disk. One such approach is the Redundant Array of Independent Disks, better known as RAID. For example, with disks configured for RAID 5, one adds an extra disk to the set of data disks and keeps a type of error-correcting code for each unit of data stored across the array of data disks. If any single disk in the array fails, there is sufficient information on the other disks to rebuild "on the fly" all the data within the array. The recovery time in this implementation is zero.

Another common, but slightly more expensive, scheme goes by the names of "mirroring" or "dual copy." The concept is that every write-to-disk operation causes the data to be placed on two disks. This dual placement assures that if any single disk fails, there is an up-to-date copy available on the mirror disk. Depending upon the specific implementation when there is a failure of the primary copy, the data may be immediately available, or some specific action, usually automated, needs to be taken to make the second copy available to the running system. This duplication scheme can be extended over significant distances. This latter type of remote dual copy is typically used for availability across a single site disaster.^[15]

^[15] There also is a Linux Open Source project related to data redundancy that might be of use in certain types of solution design. rsync is a tool that can be used to create and manage mirror data across two remote sites. The tool is run periodically to resynchronize the two instances of the data. This approach might prove useful for a disaster recovery design, but would not handle a timely recovery of the data that would meet high-availability requirements.

There is another way that access to the data might be lost to the application. What if the software fails, leaving the data on disk inconsistent? Having a RAID cluster or a dual copy of the data will not help. What additional considerations are needed for a highly available application where the data on disk are damaged?

The typical Linux file systems are byte-stream oriented file systems that have the characteristic that the data from individual files are interleaved throughout the entire disk space managed by the file system. The allocation map as to where the data for each file reside is critical to being able to use any file. The Linux ext2 file system, in the interest of high-speed performance, does not always have this allocation map (that correctly represents that data) actually on the disk with the file system data. If Linux crashes, crucial information about each of the open ext2 file systems can be lost. In the boot process, Linux checks each file system to see if it had been closed properly and if not, then it runs a special file-check program which attempts to rebuild the allocation map. The larger the file system and the more writing done to the file system, the longer it will take to recover access to the data. And some data or files may be non-recoverable!

The Open Source community has developed a number of unique file systems,^[16] each addressing some aspects of delivering both high performance while in normal use, and reasonable speed for data recovery in the event of a Linux system crash. The latest emphasis has been on journaled file systems, like JFS, ext3 and ReiserFS. These newer file systems are now showing up in the Linux distributions that support the mainframe. If you consider there to be an availability risk from a Linux crash, then when you install your Linux on the mainframe you should consider exploiting one of these journaled file systems.

^[16] http://www.linuxgazette.com/issue55/florido.html is a good tutorial on various Linux file systems.