Fault-Tolerant Systems

team lib

Network downtime is not only frustrating to users and network administrators, it can become downright expensive. A nonoperational network can cost an organization upwards of $50,000 per hour, depending on the application and marketplace (according to the findings of a downtime cost survey performed by market research firm The Yankee Group). Take the case of a reservation system-an airline or theater-ticket agency. During a two- hour period in which such a system is down, an agency dependent on such a network could lose thousands or-in the case of an airline-millions of dollars in revenue.

As a result, there is a strong demand among end users for products that protect their network systems against loss of data; these include magnetic tape-based backup-and-recording systems, uninterruptible power supplies , and fault-tolerant systems.

Most network managers rely on some form of regular backup procedures, which create archival files, and uninterruptible power sources, which provide battery-supplied electricity that takes over operation of the network automatically when regular electrical service fails. As a result, these are fairly well- understood and implemented technologies.

Fault-tolerant products such as Novell's System Fault Tolerant (SFT) NetWare are not so well understood and, as a consequence, there are relatively few in use. Because they are based on hardware redundancy that provides two identical copies of data and program files-fault-tolerant systems also go by the nickname of mirroring devices.

The term mirroring is apt in one sense: Fault-tolerant products rely on two mass-storage devices-usually, a server or hard disk-that work in tandem to support a mirror image of each other. That is, they contain identical formatting, applications, and data files. The analogy isn't exact, however-a true mirrored image is backward from its original, while the formatting and files on fault-tolerant systems, as one might expect, are identical, not backward.

Still, the mirroring analogy is helpful in describing and understanding the concept of fault-tolerant systems. Fault-tolerant products offer a measure of security that goes beyond the backup-and-recovery process, which provides a static, or time-specific, record of the data stored on a network's hard disk drives .

Identical Data Stores

Fault-tolerant systems prevent data loss and network downtime by giving the network operating system real-time, immediate access to two identical and dynamically changing copies of the information stored on the networks. A fault-tolerant system thus relies on hardware redundancy, either with two identical hard disks or servers.

In a fault-tolerant system, the failure of one mirrored component-for example, a hard disk-doesn't bring on a catastrophic collapse of the network: The duplicate, or secondary, device, which is running concurrently with the primary device, merely takes over the operation of the tasks the primary component was handling, and the user isn't aware that his or her network has experienced trouble.

In a disk-mirroring system, a network server contains hard disks in shadowed pairs of primary and secondary drives. When network users store data to the network, the server writes it to both drives, thus creating mirrored images on the separate devices. Should either hard disk fail, network operation can continue uninterrupted, since the NOS automatically makes all reads and writes to and from the remaining hard disk.

Some disk-mirroring packages offer specific options to fine tune a system. Novell's SFT NetWare, for example, lets network administrators duplicate directories and file-allocation tables while providing a read-after-write verification process. And if a NetWare SFT server fails to read a block of data from one mirrored disk, the server automatically looks to the secondary disk to fetch the data. Moreover, SFT NetWare marks that bad area on the disk unusable, then repairs the file by copying the valid data from the secondary disk to a known-usable area on the primary disk.

Mirroring Servers

Similarly, a server-mirroring system, composed of primary and secondary servers, operates with the two servers running in parallel. The primary server handles all network activity while the secondary server operates concurrently- in the background, as it were.

Mirrored servers generally are linked via a special cable and dedicated interface adapters, one of which must be plugged into each server's internal bus. Depending on vendor, these links can be made via RS-232, parallel, or SCSI connections.

Each server continuously monitors the operation of the other, so when the primary server fails, the secondary server automatically takes control of the network. Because the secondary server's hard disks contain mirrored images of those on the primary server, users don't lose data in the exchange. Should the primary server failure be limited to just a hard-disk crash, then the primary server automatically switches disk I/O to the secondary server's mass-storage system.

Disk Duplexing

Another form of fault tolerance is disk duplexing, in which two disk controllers rather than one are used within a server. This provides nonstop operation should a disk controller, disk interface or disk power supply fail. Duplexing can improve a system's performance by creating two data channels. If any component within one channel fails, the second channel takes over automatically, again without loss of data.

In this type of system, the server's processor can receive data from whichever disk channel responds first. This often improves performance because the majority of requests across a network are disk reads. In addition, a server with duplexed disk drives can read from one disk drive, write to a second, then, when the process is completed, create a mirror image of the most-recently stored data.

After a disk or server has been taken out of operation and is ready to be put back online, a fault-tolerant system must provide a synchronization process that puts the redundant components back in sync with each other.

Costly Overhead?

All fault-tolerant products offer drawbacks, the most obvious of which are costs of the redundant hardware and the software to run the hardware. Another factor is overhead from the executable code that controls the mirroring or duplexing process.

Monetary costs are easy to figure out. You'll pay for two of the most expensive components on a network-a server or hard disk (or, in duplexing, a disk controller)-not just one.

Determining whether such hardware and software costs are justifiable in any particular installation is a much more complex matter. This risk analysis process involves evaluating the loss potential, determining the equipment necessary to reduce the risk, and forming effective management procedures. (Numerous network vendors can provide worksheets and formulas that help in determining whether a particular installation would benefit from the installation of fault-tolerant products.)

Software overhead is a key issue in server and disk mirroring. Not only must the fault-tolerant product provide the code to handle reads and writes to and from redundant disks or servers, it must also be capable of determining when one of the disks or servers has failed, then put the secondary component in charge. This, of course, can consume quite a bit of RAM and CPU cycles on the server and traffic on the wire.

In the final analysis, however, many network managers believe that the security that fault-tolerant products provide is well worth the costs.

This tutorial, number 18, was originally published in the January 1990 issue of LAN Magazine/Network Magazine.

 
team lib


Network Tutorial
Lan Tutorial With Glossary of Terms: A Complete Introduction to Local Area Networks (Lan Networking Library)
ISBN: 0879303794
EAN: 2147483647
Year: 2003
Pages: 193

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net