Redundancy | Upgrading and Repairing Servers

As mentioned earlier, servers may be rated based on their availability, or the amount of uptime they have. The more reliable a server has to be, the more "insurance" you have to buy for contingencies. Lots of things can go wrong on a servereverything from driver corruption to power supply problems or the failure of an add-in card. To create servers that are truly fault tolerant, vendors go to great lengths to create redundancy as well as seamless failover. In fault-tolerant systems like the ones that Stratus Computer (www.stratus.com) is famous for, essentially every component in the server is duplicated. In such a system, when a component of the server fails, the entire server fails over quickly and seamlessly.

Storage systems, disks, and disk controllers are the components that fail most often in servers. A disk is a mechanical device, and it runs hot and at high speed. You can make these components very reliable, but you can't completely eliminate failure. When you multiply the number of disks to expand your storage capacity, the potential for problems also goes up linearly. Even in servers that don't need to be highly available and can suffer some downtime, protecting your storage assets (that is, data) is paramount.

The popularization of inexpensive disk drive technology provided the impetus to create disk structures that could survive different types of failure, including data corruption, drive failures, host bus failures, and array failures. You do this by creating redundancy, which you can achieve in several different ways:

If a drive fails, you can switch to another drive with exactly the same data, called a mirror.
If your data spans more than one drive and a drive fails, you can reconstruct your data from additional parity data written to the volume that "fills in" the missing data, provided that the data is spread out evenly (that is, striped) across all the volumes in the array.
If an HBA fails, you can switch to another HBAusually one that is managing an array that is a mirror of the one attached to a failed array. This approach also works for array failure.

These are three very different approaches to three very different problems, yet all of them are grouped together under a concept called RAID. The redundancy discussed in the first case involves both hardware (a disk) and data (a data set). In the second case, the data set still exists, but hardware is added (a spare disk), and the data set is reestablished in its original form. The third case also substitutes redundant hardware and a redundant data set to recover from error. RAID can be relatively inexpensive (for example, replace a drive) or very expensive (for example, run duplicate or triplicate data systems for redundancy).