Fault Tolerance

As far as computers are concerned, fault tolerance refers to the capability of the computer system or network to provide continued data availability in the event of hardware failure. Every component within a server, from CPU fan to power supply, has a chance of failure. Some components such as processors rarely fail, whereas hard disk failures are well documented.

Almost every component has fault-tolerant measures. These measures typically require redundant hardware components that can easily or automatically take over when there is a hardware failure.

Of all the components inside computer systems, the one that requires the most redundancy are the hard disks. Not only are hard disk failures more common than any other component but they also maintain the data, without which there would be little need for a network.

Hard Disks Are Half the Problem

In fact, according to recent research, hard disks are responsible for one of every two server hardware failures. This is an interesting statistic to think about.

Disk-level Fault Tolerance

Making the decision to have hard disk fault tolerance on the server is the first step; the second is deciding which fault-tolerant strategy to use. Hard disk fault tolerance is implemented according to different RAID (redundant array of inexpensive disks) levels. Each RAID level offers differing amounts of data protection and performance. The RAID level appropriate for a given situation depends on the importance placed on the data, the difficulty of replacing that data, and the associated costs of a respective RAID implementation. Oftentimes, the cost of data loss and replacement outweigh the costs associated with implementing a strong RAID fault-tolerant solution.

RAID 0: Stripe Set Without Parity

Although it's given RAID status, RAID 0 does not actually provide any fault tolerance; in fact, using RAID 0 might even be less fault tolerant than storing all of your data on a single hard disk.

RAID 0 combines unused disk space on two or more hard drives into a single logical volume with data being written to equally sized stripes across all the disks. By using multiple disks, reads and writes are performed simultaneously across all drives. This means that disk access is faster, making the performance of RAID 0 better than other RAID solutions and significantly better than a single hard disk. The downside of RAID 0 is that if any disk in the array fails, the data is lost and must be restored from backup.

Because of its lack of fault tolerance, RAID 0 is rarely implemented. Figure 9.2 shows an example of RAID 0 striping across three hard disks.

Figure 9.2. RAID 0 striping without parity.

RAID 1

One of the more common RAID implementations is RAID 1. RAID 1 requires two hard disks and uses disk mirroring to provide fault tolerance. When information is written to the hard disk, it is automatically and simultaneously written to the second hard disk. Both of the hard disks in the mirrored configuration use the same hard disk controller; the partitions used on the hard disk need to be approximately the same size to establish the mirror. In the mirrored configuration, if the primary disk were to fail, the second mirrored disk would contain all the required information and there would be little disruption to data availability. RAID 1 ensures that the server will continue operating in the case of the primary disk failure.

There are some key advantages to a RAID 1 solution. First, it is cheap, as only two hard disks are required to provide fault tolerance. Second, no additional software is required for establishing RAID 1, as modern network operating systems have built-in support for it. RAID levels using striping are often incapable of including a boot or system partition in fault-tolerant solutions. Finally, RAID 1 offers load balancing over multiple disks, which increases read performance over that of a single disk. Write performance however is not improved.

Because of its advantages, RAID 1 is well suited as an entry-level RAID solution, but it has a few significant shortcomings that exclude its use in many environments. It has limited storage capacitytwo 100GB hard drives only provide 100GB of storage space. Organizations with large data storage needs can exceed a mirrored solutions capacity in very short order. RAID 1 also has a single point of failure, the hard disk controller. If it were to fail, the data would be inaccessible on either drive. Figure 9.3 shows an example of RAID 1 disk mirroring.

Figure 9.3. RAID 1 disk mirroring.

An extension of RAID 1 is disk duplexing. Disk duplexing is the same as mirroring with the exception of one key detail: It places the hard disks on separate hard disk controllers, eliminating the single point of failure.

Be aware of the differences between disk duplexing and mirroring for the exam.

RAID 5

RAID 5, also known as disk striping with parity, uses distributed parity to write information across all disks in the array. Unlike the striping used in RAID 0, RAID 5 includes parity information in the striping, which provides fault tolerance. This parity information is used to re-create the data in the event of a failure. RAID 5 requires a minimum of three disks with the equivalent of a single disk being used for the parity information. This means that if you have three 40GB hard disks, you have 80GB of storage space with the other 40GB used for parity. To increase storage space in a RAID 5 array, you need only add another disk to the array. Depending on the sophistication of the RAID setup you are using, the RAID controller will be able to incorporate the new drive into the array automatically, or you will need to rebuild the array and restore the data from backup.

Many factors have made RAID 5 a very popular fault-tolerant design. RAID 5 can continue to function in the event of a single drive failure. If a hard disk were to fail in the array, the parity would re-create the missing data and continue to function with the remaining drives. The read performance of RAID 5 is improved over a single disk.

There are only a few drawbacks for the RAID 5 solution. These are as follows:

The costs of implementing RAID 5 are initially higher than other fault-tolerant measures requiring a minimum of three hard disks. Given the costs of hard disks today, this is a minor concern.
RAID 5 suffers from poor write performance because the parity has to be calculated and then written across several disks. The performance lag is minimal and won't have a noticeable difference on the network.
When a new disk is placed in a failed RAID 5 array, there is a regeneration time when the data is being rebuilt on the new drive. This process requires extensive resources from the server.

Figure 9.4 shows an example of RAID 5 striping with parity.

Figure 9.4. RAID 5 striping with parity.

RAID 10

Sometimes RAID levels are combined to take advantage of the best of each. One such strategy is RAID 10, which combines RAID levels 1 and 0. In this configuration, four disks are required. As you might expect, the configuration consists of a mirrored stripe set. To some extent, RAID 10 takes advantage of the performance capability of a stripe set while offering the fault tolerance of a mirrored solution. As well as having the benefits of each though, RAID 10 also inherits the shortcomings of each strategy. In this case, the high overhead and the decreased write performance are the disadvantages. Figure 9.5 shows an example of a RAID 10 configuration. Table 9.3 provides a summary of the various RAID levels.

Figure 9.5. Disks in a RAID 10 configuration.

Table 9.3. Summary of RAID Levels
RAID Level	Description	Advantages	Disadvantages	Required Disks
RAID 0	Disk striping	Increased read and write performance. RAID 0 can be implemented with only two disks.	Does not offer any fault tolerance.	Two or more
RAID 1	Disk mirroring	Provides fault tolerance. Can also be used with separate disk controllers, reducing the single point of failure (called disk duplexing).	RAID 1 has a 50% overhead and suffers from poor write performance.	Two
RAID 5	Disk striping with distributed parity	Can recover from a single disk failure; increased read performance over a poor write single disk. Disks can be added to the array to increase storage capacity.	May slow down network during regeneration time, and may suffer from performance	Minimum of three
RAID 10	Striping with mirrored volumes striping;	Increased perfor mance with striping; offers mirrored fault tolerance.	High overhead as with mirroring.	Four

RAID levels 2, 3, and 4 are omitted from this discussion as they are infrequently used and will rarely, if at all, be seen in modern network environments.

Server and Services Fault Tolerance

In addition to providing fault tolerance for individual hardware components, some organizations go the extra mile to include the entire server in the fault-tolerant design. Such a design keeps servers and the services they provide up and running. When it comes to server fault tolerance, two key strategies are commonly employed: stand-by servers and server clustering.

Stand-by Servers

Stand-by servers are a fault-tolerant measure in which a second server is configured identically to the first one. The second server can be stored remotely or locally and set up in a failover configuration. In a failover configuration, the secondary server is connected to the primary and ready to take over the server functions at a heartbeat's notice. If the secondary server detects that the primary has failed, it will automatically cut in. Network users will not notice the transition, as there will be little or no disruption in data availability.

The primary server communicates with the secondary server by issuing special notification notices referred to as heartbeats. If the secondary server stops receiving the heartbeat messages, it assume that the primary has died and so assumes the primary server configuration.

Server Clustering

Those companies wanting maximum data availability that have the funds to pay for it can choose to use server clustering. As the name suggests, server clustering involves grouping servers together for the purposes of fault tolerance and load balancing. In this configuration, other servers in the cluster can compensate for the failure of a single server. The failed server will have no impact on the network, and the end users will have no idea that a server has failed.

The clear advantage of server clusters is that they offer the highest level of fault tolerance and data availability. The disadvantages are equally clearcost. The cost of buying a single server can be a huge investment for many organizations; having to buy duplicate servers is far too costly.

Link Redundancy

Although a failed network card might not actually stop the server or a system, it might as well. A network server that cannot be used on the network makes for server downtime. Although the chances of a failed network card are relatively low, our attempts to reduce the occurrence of downtime have led to the development of a strategy that provides fault tolerance for network connections.

Through a process called adapter teaming, groups of network cards are configured to act as a single unit. The teaming capability is achieved through software, either as a function of the network card driver or through specific application software. The process of adapter teaming is not widely implemented; though the benefits it offers are many, so it's likely to become a more common sight. The result of adapter teaming is increased bandwidth, fault tolerance, and the ability to manage network traffic more effectively. These features are broken down into three sections:

Adapter fault tolerance The basic configuration enables one network card to be configured as the primary device and others as secondary. If the primary adapter fails, one of the other cards can take its place without the need for intervention. When the original card is replaced, it resumes the role of primary controller.
Adapter load balancing Because software controls the network adapters, workloads can be distributed evenly among the cards so that each link is used to a similar degree. This distribution allows for a more responsive server because one card is not overworked while another is under worked.
Link aggregation This provides vastly improved performance by allowing more than one network card's bandwidth to be aggregatedcombined into a single connection. For example, through link aggregation, four 100MBps network cards can provide a total of 400MBps bandwidth. Link aggregation requires that both the network adapters and the switch being used support it. In 1999, the IEEE ratified the 802.3ad standard for link aggregation, allowing compatible products to be produced.

Using Uninterruptible Power Supplies

No discussion of fault tolerance can be complete without a look at power-related issues and the mechanisms used to combat them. When you're designing a fault-tolerant system, your planning should definitely include UPSs (Uninterruptible Power Supplies). A UPS serves many functions and is a major part of server consideration and implementation.

On a basic level, a UPS is a box that holds a battery and a built-in charging circuit. During times of good power, the battery is recharged; when the UPS is needed, it's ready to provide power to the server. Most often, the UPS is required to provide enough power to give the administrator time to shut down the server in an orderly fashion, preventing any potential data loss from a dirty shutdown.

Why Use a UPS?

Organizations of all shapes and sizes need UPSs as part of their fault-tolerance strategies. A UPS is as important as any other fault-tolerance measure. Three key reasons make a UPS necessary:

Data availability The goal of any fault-tolerance measure is data availability. A UPS ensures access to the server in the event of a power failureor at least as long as it takes to save a file.
Protection from data loss Fluctuations in power or a sudden power down can damage the data on the server system. In addition, many servers take full advantage of caching, and a sudden loss of power could cause the loss of all information held in cache.
Protection from hardware damage Constant power fluctuations or sudden power downs can damage hardware components within a computer. Damaged hardware can lead to reduced data availability while the hardware is being repaired.

Power Threats

In addition to keeping a server functioning long enough to safely shut it down, a UPS also safeguards a server from inconsistent power. This inconsistent power can take many forms. A UPS protects a system from the following power-related threats:

Blackout A total failure of the power supplied to the server.
Spike A spike is a very short (usually less than a second) but very intense increase in voltage. Spikes can do irreparable damage to any kind of equipment, especially computers.
Surge Compared to a spike, a surge is a considerably longer (sometimes many seconds) but usually less intense increase in power. Surges can also damage your computer equipment.
Sag A sag is a short-term voltage drop (the opposite of a spike). This type of voltage drop can cause a server to reboot.
Brownout A brownout is a drop in voltage that usually lasts more than a few minutes.

Many of these power-related threats can occur without your knowledge; if you don't have a UPS, you cannot prepare for them. For the cost, it is worth buying a UPS, if for no other reason than to sleep better at night.