Building Fault-Tolerant Systems | Microsoft Windows Server 2003 Unleashed (R2 Edition)

Building fault-tolerant computing systems consists of carefully planning and configuring server hardware and software, network devices, and power sources. Purchasing quality server and network hardware is a good start to building a fault-tolerant system, but the proper configuration of this hardware is equally important. Also, providing this equipment with stable line power that is backed up by a battery or generator adds fault tolerance to the network. Last but not least, proper tuning of server operating systems helps enhance availability of network services such as file shares, print servers, network applications, and authentication servers.

Using Uninterruptible Power Supplies

Connecting line power to server and network devices through uninterruptible power supplies (UPSs) not only provides conditioned incoming power by removing voltage spikes and providing steady line voltage levels, but it also provides battery backup power. When line power fails, the UPS switches to battery mode, which should provide ample time to shut down the server or network device without risk of damaging hardware or corrupting data. UPS manufacturers commonly provide software that can send network notifications, run scripts, or even gracefully shut down servers when power thresholds are met. One final word on power is that most computer and network hardware manufacturers provide device configurations that incorporate redundant power supplies designed to keep the system powered up in the event of a single power supply failure.

During power outages, many system administrators find out which critical devices are not connected to a UPS, and the race begins to shut down and shift power from non-critical devices. To avoid these situations, administrators need to perform regular inspections of critical hardware devices in server rooms and network closets to ensure that all necessary servers, network routers, switches, hubs, and firewalls are backed by battery power. When power to a server fails and the battery provides only a few minutes for users to save data and close connections to reduce the chance of data corruption, it is essential for the network to remain available.

Choosing Networking Hardware for Fault Tolerance

Network design can also incorporate fault tolerance by creating redundant network routes and by utilizing technologies that can group devices together for the purposes of load balancing and device failover. Load balancing is the process of spreading requests across multiple devices to keep individual device load at an acceptable level. Failover is the process of moving services offered on one device to another upon device failure, to maintain availability.

Networking hardware such as Ethernet switches, routers, and network cards can be configured to provide fault-tolerant services through load-balancing applications or through features within the network device firmware or operating system. Refer to the manufacturer's documentation to research fault-tolerant configurations available in your organization's network devices.

For more robust redundant network card configurations, third-party hardware vendors have created network card teaming and network card fault-tolerant software applications. These technologies allow client/server communication to fail over from one network interface card (NIC) to another in the event of an NIC failure. Also, they can be configured to balance network requests across all the NICs in one server simultaneously. Refer to the particular hardware manufacturer's documentation to find out whether a compatible teaming application is available for your network card.

Note

Windows Server 2003 network load balancing does not allow multiple NICs on the same server to participate in the same NLB cluster.

Selecting Server Storage for Redundancy

Server disk storage usually contains user data and/or operating system files that make it a critical server subsystem that should incorporate fault tolerance. There are a few different ways to create fault-tolerant disk storage for the Windows Server 2003 operating system. The first is creating Redundant Arrays of Inexpensive Disks (RAID) using disk controller configuration utilities, and the second is creating the RAID disks using dynamic disk configuration from within the Windows Server 2003 operating system.

Using two or more disks, different RAID-level arrays can be configured to provide fault tolerance that can withstand disk failures and still provide uninterrupted disk access. Implementing hardware-level RAID configured and stored on the disk controller is preferred over the software-level RAID configurable within Windows Server 2003 Disk Management because the Disk Management and synchronization processes in hardware-level RAID are offloaded to the RAID controller. With Disk Management and synchronization processes offloaded from the RAID controller, the operating system will perform better overall.

Another good reason to provide hardware-level RAID is that the configuration of the disks does not depend on the operating system, which gives administrators greater flexibility when it comes to recovering server systems and performing upgrades. Refer to Chapter 22, "Windows Server 2003 Management and Maintenance Practices," for more information on ways to create RAID arrays using Windows Server 2003 Disk Management. Also, refer to the manufacturer's documentation on creating RAID arrays on your RAID disk controller.

Improving Application Reliability

An application's reliability is greatly dependent on the software code and the hardware it is running on. Administrators can make applications more reliable on Windows Server 2003 by running legacy client/server applications in lower application compatibility modes to improve overall reliability; they do so by isolating each application instance to a separate memory location. If one instance crashes, the remaining instances and the server itself remain available and unaffected. Reliability for client/server-based applications written for Windows Server 2003 can be improved by deploying these applications on clusters. Windows Server 2003 Enterprise and Datacenter servers provide two different clustering technologies that enhance application reliability by providing server load balancing and failover capabilities.