As far as computers are concerned, fault tolerance refers to the capability of the computer system or network to provide continued data availability in the event of hardware failure. Every component within a server, from CPU fan to power supply, has a chance of failure. Some components such as processors rarely fail, whereas hard disk failures are well documented. Almost every component has fault-tolerant measures. These measures typically require redundant hardware components that can easily or automatically take over when there is a hardware failure. Of all the components inside computer systems, the one that requires the most redundancy are the hard disks. Not only are hard disk failures more common than any other component but they also maintain the data, without which there would be little need for a network.
Disk-level Fault ToleranceMaking the decision to have hard disk fault tolerance on the server is the first step; the second is deciding which fault-tolerant strategy to use. Hard disk fault tolerance is implemented according to different RAID (redundant array of inexpensive disks) levels. Each RAID level offers differing amounts of data protection and performance. The RAID level appropriate for a given situation depends on the importance placed on the data, the difficulty of replacing that data, and the associated costs of a respective RAID implementation. Oftentimes, the cost of data loss and replacement outweigh the costs associated with implementing a strong RAID fault-tolerant solution. RAID 0: Stripe Set Without ParityAlthough it's given RAID status, RAID 0 does not actually provide any fault tolerance; in fact, using RAID 0 might even be less fault tolerant than storing all of your data on a single hard disk. RAID 0 combines unused disk space on two or more hard drives into a single logical volume with data being written to equally sized stripes across all the disks. By using multiple disks, reads and writes are performed simultaneously across all drives. This means that disk access is faster, making the performance of RAID 0 better than other RAID solutions and significantly better than a single hard disk. The downside of RAID 0 is that if any disk in the array fails, the data is lost and must be restored from backup. Because of its lack of fault tolerance, RAID 0 is rarely implemented. Figure 9.2 shows an example of RAID 0 striping across three hard disks. Figure 9.2. RAID 0 striping without parity.RAID 1One of the more common RAID implementations is RAID 1. RAID 1 requires two hard disks and uses disk mirroring to provide fault tolerance. When information is written to the hard disk, it is automatically and simultaneously written to the second hard disk. Both of the hard disks in the mirrored configuration use the same hard disk controller; the partitions used on the hard disk need to be approximately the same size to establish the mirror. In the mirrored configuration, if the primary disk were to fail, the second mirrored disk would contain all the required information and there would be little disruption to data availability. RAID 1 ensures that the server will continue operating in the case of the primary disk failure. There are some key advantages to a RAID 1 solution. First, it is cheap, as only two hard disks are required to provide fault tolerance. Second, no additional software is required for establishing RAID 1, as modern network operating systems have built-in support for it. RAID levels using striping are often incapable of including a boot or system partition in fault-tolerant solutions. Finally, RAID 1 offers load balancing over multiple disks, which increases read performance over that of a single disk. Write performance however is not improved. Because of its advantages, RAID 1 is well suited as an entry-level RAID solution, but it has a few significant shortcomings that exclude its use in many environments. It has limited storage capacitytwo 100GB hard drives only provide 100GB of storage space. Organizations with large data storage needs can exceed a mirrored solutions capacity in very short order. RAID 1 also has a single point of failure, the hard disk controller. If it were to fail, the data would be inaccessible on either drive. Figure 9.3 shows an example of RAID 1 disk mirroring. Figure 9.3. RAID 1 disk mirroring.An extension of RAID 1 is disk duplexing. Disk duplexing is the same as mirroring with the exception of one key detail: It places the hard disks on separate hard disk controllers, eliminating the single point of failure.
RAID 5RAID 5, also known as disk striping with parity, uses distributed parity to write information across all disks in the array. Unlike the striping used in RAID 0, RAID 5 includes parity information in the striping, which provides fault tolerance. This parity information is used to re-create the data in the event of a failure. RAID 5 requires a minimum of three disks with the equivalent of a single disk being used for the parity information. This means that if you have three 40GB hard disks, you have 80GB of storage space with the other 40GB used for parity. To increase storage space in a RAID 5 array, you need only add another disk to the array. Depending on the sophistication of the RAID setup you are using, the RAID controller will be able to incorporate the new drive into the array automatically, or you will need to rebuild the array and restore the data from backup. Many factors have made RAID 5 a very popular fault-tolerant design. RAID 5 can continue to function in the event of a single drive failure. If a hard disk were to fail in the array, the parity would re-create the missing data and continue to function with the remaining drives. The read performance of RAID 5 is improved over a single disk. There are only a few drawbacks for the RAID 5 solution. These are as follows:
Figure 9.4 shows an example of RAID 5 striping with parity. Figure 9.4. RAID 5 striping with parity.RAID 10Sometimes RAID levels are combined to take advantage of the best of each. One such strategy is RAID 10, which combines RAID levels 1 and 0. In this configuration, four disks are required. As you might expect, the configuration consists of a mirrored stripe set. To some extent, RAID 10 takes advantage of the performance capability of a stripe set while offering the fault tolerance of a mirrored solution. As well as having the benefits of each though, RAID 10 also inherits the shortcomings of each strategy. In this case, the high overhead and the decreased write performance are the disadvantages. Figure 9.5 shows an example of a RAID 10 configuration. Table 9.3 provides a summary of the various RAID levels. Figure 9.5. Disks in a RAID 10 configuration.
Server and Services Fault ToleranceIn addition to providing fault tolerance for individual hardware components, some organizations go the extra mile to include the entire server in the fault-tolerant design. Such a design keeps servers and the services they provide up and running. When it comes to server fault tolerance, two key strategies are commonly employed: stand-by servers and server clustering. Stand-by ServersStand-by servers are a fault-tolerant measure in which a second server is configured identically to the first one. The second server can be stored remotely or locally and set up in a failover configuration. In a failover configuration, the secondary server is connected to the primary and ready to take over the server functions at a heartbeat's notice. If the secondary server detects that the primary has failed, it will automatically cut in. Network users will not notice the transition, as there will be little or no disruption in data availability. The primary server communicates with the secondary server by issuing special notification notices referred to as heartbeats. If the secondary server stops receiving the heartbeat messages, it assume that the primary has died and so assumes the primary server configuration. Server ClusteringThose companies wanting maximum data availability that have the funds to pay for it can choose to use server clustering. As the name suggests, server clustering involves grouping servers together for the purposes of fault tolerance and load balancing. In this configuration, other servers in the cluster can compensate for the failure of a single server. The failed server will have no impact on the network, and the end users will have no idea that a server has failed. The clear advantage of server clusters is that they offer the highest level of fault tolerance and data availability. The disadvantages are equally clearcost. The cost of buying a single server can be a huge investment for many organizations; having to buy duplicate servers is far too costly. Link RedundancyAlthough a failed network card might not actually stop the server or a system, it might as well. A network server that cannot be used on the network makes for server downtime. Although the chances of a failed network card are relatively low, our attempts to reduce the occurrence of downtime have led to the development of a strategy that provides fault tolerance for network connections. Through a process called adapter teaming, groups of network cards are configured to act as a single unit. The teaming capability is achieved through software, either as a function of the network card driver or through specific application software. The process of adapter teaming is not widely implemented; though the benefits it offers are many, so it's likely to become a more common sight. The result of adapter teaming is increased bandwidth, fault tolerance, and the ability to manage network traffic more effectively. These features are broken down into three sections:
Using Uninterruptible Power SuppliesNo discussion of fault tolerance can be complete without a look at power-related issues and the mechanisms used to combat them. When you're designing a fault-tolerant system, your planning should definitely include UPSs (Uninterruptible Power Supplies). A UPS serves many functions and is a major part of server consideration and implementation. On a basic level, a UPS is a box that holds a battery and a built-in charging circuit. During times of good power, the battery is recharged; when the UPS is needed, it's ready to provide power to the server. Most often, the UPS is required to provide enough power to give the administrator time to shut down the server in an orderly fashion, preventing any potential data loss from a dirty shutdown. Why Use a UPS?Organizations of all shapes and sizes need UPSs as part of their fault-tolerance strategies. A UPS is as important as any other fault-tolerance measure. Three key reasons make a UPS necessary:
Power ThreatsIn addition to keeping a server functioning long enough to safely shut it down, a UPS also safeguards a server from inconsistent power. This inconsistent power can take many forms. A UPS protects a system from the following power-related threats:
Many of these power-related threats can occur without your knowledge; if you don't have a UPS, you cannot prepare for them. For the cost, it is worth buying a UPS, if for no other reason than to sleep better at night. |