Lesson 1: Designing a Fault-Tolerant System

When designing a highly available network infrastructure, you must ensure that your Microsoft Windows 2000 Server computers remain available to the network. One way to ensure this availability is to use redundant components within the servers or to provide readily available backup components. Another way to ensure availability is to make certain that the environment in which the servers are located facilitates a highly available operation and that the servers have a supply of uninterruptible power. In this lesson you’ll learn how to design a server configuration that’s fault tolerant and how to ensure a safe environment for those servers so that they remain highly available.

After this lesson, you will be able to

  • Plan a server configuration that uses redundant components to help ensure a highly available Web site
  • Plan a server environment that protects the servers

Estimated lesson time: 30 minutes

Highly Available Configurations

Although it’s important to be prepared for problems, you can take steps to protect against certain failures, such as disk failures, component problems, network problems, and power failures. You can implement hardware and software configurations to help reduce the likelihood of problems that result in costly downtime and recovery processes.

Computers running Windows 2000 Server have fault tolerance features built into the operating system. Fault tolerance is the ability of a computer or operating system to respond to a catastrophic event—a power outage or hardware failure, for example—so that no data is lost and that work in progress isn’t corrupted. Although the term is often applied to disk subsystems, it also can refer to any piece of hardware that uses redundancy to ensure the system’s availability. In a fully fault-tolerant system, every major component is made redundant. For example, such a system includes redundant disk controllers, power supplies, uninterruptible power supplies (UPSs), disk subsystems, and other redundant components. In such a system, every single point of failure is eliminated.

To ensure that your servers are configured to support high availability, you should use redundant and backup components and provide an environment that protects the servers.

Redundant Components

You can help to ensure a highly available network by using redundant components within your Windows 2000 Server computers or by having backup components on hand. This section provides information about many of the server components that you can make fault-tolerant.

NICs

A network interface card (NIC) is an adapter card that plugs into a computer’s system bus and allows the computer to send and receive signals on a network. The NIC in most servers is a single point of failure. Fortunately, the NIC is typically very reliable and failures are rare; still, they do occur. You can install multiple NICs in a single server or configure a single NIC with multiple Internet Protocol (IP) addresses. This process is known as multihoming a computer. On a Transmission Control Protocol/Internet Protocol (TCP/IP) network, a multi-homed machine has a separate IP address assigned to each of its interfaces. Multihoming is typically used to increase bandwidth by providing connectivity to several networks simultaneously.

Installing multiple NICs can also enhance the reliability of critical network servers. If you make a NIC fault tolerant by installing a redundant card, the redundant NIC, which is inactive, shares the device driver with the active NIC. If the device driver detects an unrecoverable error, the driver uses the redundant NIC without interrupting service.

In some cases you can configure each NIC into a subnet separate from the other NICs on the computer. Configuring multiple NICs into separate subnets can enhance both performance and availability. First, performance is improved by shortening the network routes between clients and servers. Second, availability is increased because clients may be able to find alternate routes to critical network services in the event of a failed network adapter.

You should take care to configure multiple NICs into separate subnets properly because some network services don’t operate as expected on multihomed hosts.

By installing multiple NICs on separate network segments, administrators can significantly reduce downtime due to network outages of any single segment. For maximum reliability, you should configure the secondary NIC as a backup in case of failures on the primary NIC. In this case the secondary interface is kept in a hot standby mode. If the primary adapter fails, the standby adapter takes over.

In a clustered configuration, you can configure two network segments—one for normal network traffic and a second dedicated to the heartbeat signal used by cluster members to monitor the cluster’s health.

Although this configuration works, other methods are more effective. For example, you can use a dedicated virtual local area network (VLAN) that supports heartbeat connections between cluster nodes. You can also use redundant direct connections (using cross-over cables) for the heartbeat.

When installing two NICs, make sure that you’ve run cabling from two separate hubs or switches. Make sure the network cables are color-coded or marked some other way to signify network A and B. This prevents the cables from being plugged into the wrong NIC. In addition, always use fixed IP addresses on these servers and don’t use Dynamic Host Configuration Protocol (DHCP) to prevent an outage due to the failure of the DHCP server. This strategy can improve address resolution by Domain Name System (DNS) servers that don’t handle the dynamic address assignment that DHCP provides.

Motherboard and CPU

Motherboards consist of electronics that can and do fail, although the motherboard and the central processing unit (CPU) are generally reliable computer components. You can’t do very much to prevent a motherboard failure or CPU fault,except to run regular system checks to ensure that the components are functioning correctly. Some systems include built-in diagnostic tools that operate with Windows 2000.

Memory

The three major types of random access memory (RAM) that deal with error detection and correction are nonparity RAM, parity RAM, and error-correction coding (ECC) RAM.

  • Nonparity RAM If you use nonparity RAM, Windows 2000 has no way to detect memory problems and your computer might crash randomly. Nonparity RAM costs less than parity RAM, and parity RAM isn’t available for all computers. If you don’t have parity RAM in your computers, ask your vendor if it can be installed or if your computer supports it.
  • Parity RAM Parity RAM contains an extra bit that indicates whether a byte in the RAM is faulty. When parity RAM detects a parity difference, it signals the CPU through a nonmaskable interrupt (NMI). Depending on where and when detection happens, Windows 2000 determines if this is an input/output (I/O) board parity error, memory bus error, or some other kind of parity error. Windows 2000 can also report I/O channel parity errors from cards in slots. In these cases an error message is generated, and sometimes the computer stops. Parity memory isn’t fault tolerant because even if it does detect a memory error, the memory can’t correct it. Thus, the server halts operation—abruptly.
  • ECC RAM High-end systems often use ECC RAM, which can detect a two-bit failure and correct a single-bit failure in the system memory. With ECC RAM, Windows 2000 can continue to run in spite of a single-bit failure. Depending on the hardware design, there might or might not be a report of this corrective action. For maximum memory fault tolerance, choose advanced ECC memory whenever possible.

Be aware that even with ECC, memory chips do fail. Try to keep enough memory on hand to replace a computer’s entire memory. If memory check errors become frequent, or the machine won’t boot, replace all of the memory chips rather than spend time trying to figure out which one is bad. You can find the faulty chip at your leisure.

Cooling

Cooling is one of the most overlooked elements of a server. Should the cooling fan fail, the processors, hard drives, or controller cards might overheat and fail. If the computer feels extremely warm when you open the chassis, a cooling fan might have failed. Most servers have two or more fans to protect against overheating in this circumstance. Some servers also have thermal sensors to detect abnormal temperatures.

Power Supplies

Although power supplies are very reliable, they do fail. Most middle to high-end servers offer the option of multiple power supplies. If one of the power supplies should fail, another continues to provide power. When using dual power supplies, you should use two separate power feeds. Using two power feeds protects against a circuit breaker tripping or some other accidental event. Also, don’t forget about external cabinets for RAID arrays or modem banks. If they have power supplies, check and see if dual supplies are available.

Disk Controllers

The disk controller, like other components, can be a single point of failure. If the controller fails, the data stored on the hard disks is not accessible until that controller is replaced, whether or not the disks themselves have been made fault-tolerant. Redundant controllers provide a level of fault tolerance that eliminates the single point of failure that exists when a system is configured with only one controller.

Storage

Storage strategies are based on the type and quantity of information that must be stored and the cost of equipment. If a particular computer isn’t used to store data, the storage solution can be simple and inexpensive. However, if the computer will be used to store large amounts of data and to perform frequent database reads and writes, the storage strategy is more complicated. Consider the cost of any storage components when developing a strategy for storing the data that your organization needs. It doesn’t make sense to spend more on the storage system than the expected cost recovery in which you may lose time and data. Storage strategies are discussed in more detail in Lesson 2, "Designing Data Storage for High Availability."

Environmental Concerns

When addressing environmental concerns, you must ensure that the network components, particularly servers, are protected from extremes in temperature and humidity. You must also keep them clean. In addition, you should provide a UPS and set up cables in a way that follows specific guidelines.

Temperature, Humidity, and Cleanliness

Computers perform best at a temperature of approximately 70º F (21º C). A long rack of computers can generate a huge amount of heat, which would raise the surrounding air temperature to unacceptable levels. For this reason, almost all computer rooms have some form of cooling or air conditioning. When adding servers to a computer room, be careful to make sure you won’t exceed the room’s cooling capacity. If the temperature greatly exceeds 70º F (21º C), you could have problems.

Humidity can be an important factor as well, provided it’s high enough to create condensation or low enough to create static. Obviously, water condensing in a computer could cause that computer to fail. In addition, you don’t want mold forming on the computers that could affect cooling or cause a short circuit. Too-dry air also can present a problem, as people near the computer can develop static in those conditions. A good static jolt can damage internal components or cause the computer to restart.

Cleanliness is very important for computers; dust and dirt can cause shorts and, in extreme conditions, fires. For computer-room computers, whenever the case is opened for any reason, a quick check should be made to determine if the unit needs cleaning. If it does, you should check all the units in the area.

You should check computers in office areas quarterly, or more often if they’re in a dirty area. For computers on a plant floor or in other hazardous areas, an enclosure with air filtration and climate control is a necessity. You should clean the air filters on the cabinet according to the manufacturer’s recommendation. At the same time, you should check and clean the computer and its cabinet if you need to.

Power

A computer won’t run without power, and power grids aren’t always that reliable. Consequently, backup power may be a necessity, to at least allow the computer to shut down in a controlled manner. There are two scopes of outages: the first is building or computer-room failure and the second is a regional outage.

In building power failures, particularly in a corporate computer center, it may be necessary to continue providing service to other buildings in the area or to areas geographically remote from the computer center. In this instance short outages can be survived by using UPS units. Standby generators can handle longer outages.

Most UPS devices are one of the following types:

  • Online UPS An online UPS is connected between the computer and the main power source that supplies power to your computer. The main power continuously charges the batteries that supply the power to the computer. Connecting it to the main power keeps its battery charged. This method provides power conditioning, which means that it removes spikes, surges, sags, and noise.
  • Standby UPS A device configured to provide either the main power or its own power source and to switch from one to the other as necessary. When main power is available, the UPS device connects the main power directly to the computer and monitors the main power voltage level. When the main power fails or the voltage falls below an acceptable level, the UPS device switches to its own power.

There are two strategies for using UPS: implementing one large unit or implementing many small units. Using a large UPS system to cover the entire computer room has the advantage of being easier to maintain and monitor. However, it’s a big problem if it doesn’t work. You should test the system regularly, preferably on weekends or holidays.

The other strategy is for every computer to have its own UPS. This tends to be more practical for computers outside of a computer room. The advantage of these systems is that they can interface directly with the computer to signal a shutdown warning when the battery power drops to a set point. The other advantage is that a breaker trip or some other isolated power outage won’t shut down the computer. The disadvantage of these units is that maintenance is more involved because of the number of units, both from the point of view of record keeping and the physical replacement of batteries and testing of units.

When an area experiences a regional power outage, the UPS and generators may work fine but telecommunication links may fail. A regional failure can be very expensive if your company has distributed locations in the same region or your business is actively involved with e-commerce or the Internet. The best alternative in this case is to have another facility in a geographically separate location. This facility should duplicate as many server resources as is practical.

Cables

You need to consider what can fail in the connections between computers as well as within an individual computer. When working with cables, you should adhere to the following guidelines:

  • Make sure cables are neat and orderly, either with a cable management system or tie wraps. Cables should never be loose in a cabinet; this could potentially result in accidental disconnects of cables.
  • Use strain relief whenever possible to secure the cables to something the computer is connected to—with pull-out rack mounted equipment in particular. This way a tug on the cable won’t pull the cable out of its socket.
  • Make sure all cables are securely attached at both ends wherever possible.
  • If you use multiple sources of power or network communications, try to route the cables feeding the cabinets from different points. This way, if one is severed, the other will likely still be functional.
  • Label all cables at both ends if possible. Color-coding with tape or labels helps as well.
  • Make sure rack-mounted pull-out equipment has enough slack in the cables and that the cables won’t bind or be pinched or scraped.
  • Don’t plug dual power supplies into the same power strip or use the same power sources.
  • Don’t leave loose cables in cabinets.
  • Make sure that cables can’t be accidentally snagged on someone walking by or on a cart. All cables should be inside the cabinet.

Making a Decision

The level of fault tolerance that you can implement in your system often depends on the associated costs of implementing fault-tolerant components and on how much downtime can cost your business. Table 3.1 provides an overview of the issues involved in implementing fault tolerance into your server configurations.

Table 3.1 Implementing Fault Tolerance

Strategy Description

Redundant NICs

Multiple NICs provide some fault tolerance for your network connections. However, if the NICs are on the same subnet, clients might not be able to find critical services if the primary NIC fails or the network path fails. Installing NICs on separate subnets allows clients to find alternative routes to the services.

Motherboard/CPU maintenance

In some systems, there is little you can do to avoid a motherboard failure or CPU fault, except to run regular system checks to ensure that the components are functioning correctly. However, some companies, such as Stratus and Marathon Technologies, provide fully fault-tolerant computer systems that include redundant system I/O boards and CPUs.

Fault-tolerant memory

Windows 2000 can’t detect memory problems with nonparity RAM. Windows 2000 can detect problems with parity RAM but can’t correct the error. With ECC RAM, Windows 2000 can detect and correct a one-bit failure and detect a two-bit failure.

Backup memory

Memory can fail. Enough backup memory should be kept on hand to replace all the memory in a computer should any of the memory fail.

Redundant cooling fans, power supplies, disk controllers

Each server should have two or more fans to protect and against the failure of a fan and at least two power supplies, each one using a power feed separate from the other. In addition, you can use multiple controllers to remove that point of failure.

Fault-tolerant storage

You should make storage fault-tolerant. Storage strategies are discussed in more detail in Lesson 2, "Designing Data Storage for High Availability."

Proper environment

You should maintain temperature, humidity, and cleanliness to ensure that servers don’t risk failure.

Redundant power

UPS can be implemented on a large scale for a group of computers or implemented for individual computers. A large UPS is easier to maintain but is a major problem if it fails at a critical moment. Individual UPS systems require more maintenance but provide greater reliability. To protect against regional power failures, you should set up alternate sites in separate geographical areas.

Proper cable maintenance

You should properly maintain cables and make connections secure.

Recommendations

When configuring the server, use multiple NICs installed in separate subnets. In addition, use ECC RAM and keep enough memory on hand to replace all the memory in the computer. Each server should also have redundant cooling fans, power supplies, and disk controllers that support a fault-tolerant storage subsystem.

In addition to configuring a server with redundant components, pay careful attention to the environment in which they’re to be located. Maintain temperature, humidity, and cleanliness to ensure that servers don’t risk failure, and maintain cables according to recommended guidelines. Connect the servers to a UPS. If possible, each server should have its own UPS. To prepare for regional power failures, your network should, ideally, have at least one other facility in another geographical region.

Example: A Fault-Tolerant System

Organizations must determine how much money they want to invest in each server to provide fault tolerance. Figure 3.1 shows one way that you can configure a Windows 2000 Server computer.

Figure 3.1 - A Windows 2000 Server computer configured for fault tolerance

Lesson Summary

You can help to ensure a highly available network by using redundant components within your Windows 2000 Server computers or by having backup components on hand. In addition, you should use regular system checks to ensure that the components are functioning correctly. Installing multiple NICs in a computer can enhance the reliability of critical network servers. The three major types of RAM that deal with error detection and correction are nonparity RAM, parity RAM, and ECC RAM, which provides the greatest fault tolerance. You should keep enough backup ECC memory on hand to replace all the computer’s memory. Each server should also include redundant cooling fans, power supplies, and disk controllers. When addressing environmental concerns, you must ensure that the network components are protected from extremes in temperature and humidity and that the server environment is kept clean. You should set up cables in a way that follows specific guidelines and provide a UPS.



Microsoft Corporation - MCSE Training Kit. Designing Highly Available Web Solutions with Microsoft Windows 2000 Server Technologies
MCSE Training Kit (Exam 70-226): Designing Highly Available Web Solutions with Microsoft Windows 2000 Server Technologies (MCSE Training Kits)
ISBN: 0735614253
EAN: 2147483647
Year: 2001
Pages: 103

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net