6.4 Component-level availability

At the lowest level of an internetwork, we have the individual network devices and discrete systems (i.e., servers, routers, modems, workstations, network printers, etc.). The decision of whether or not to provide backup systems and procedures should these devices fail is largely a matter of assessing risk and relative value. The systems most often targeted for attention are business- or mission-critical servers, key data repositories, wide area routers, and perimeter firewalls—that is, systems where even a small break in service could equate to significant financial or productivity losses. On carrier networks it is expected that systems will operate nonstop at key switching centers and PoPs. On corporate networks there may be similar requirements for systems providing e-commerce applications or intranet service. Some of the key techniques used to improve availability are summarized as follows:

Passive backplanes and hot swap line cards
Server mirroring, clustering, distributed processing
Fault-tolerant power, backup DC power, uninterruptable power
Backup media, disk mirroring, RAID
Complete or partial hot or cold standby systems and subcomponents
Automated and alternate data and boot image sources (Disk, BOOTP, DHCP, etc.)

Standard off-the-shelf commercial products may include limited resilience features; however, at the top end of the market there is a subset of systems available commercially that are referred to as carrier class (a term often misused in product literature and typically requiring at least 99.99999 percent availability). In essence the design of a carrier class system should eliminate all single points of failure, including loss of power, loss of a single interface card, loss of a processor card, loss of configuration data, and loss of recorded data. It is usually mandatory that such a system should be capable of running 24/7. This dictates that it should be possible to make software changes without taking the system down. It is also important that a high level of system integration is provided, together with sophisticated system and environmental management (for proactive signaling of predicted failures in power supplies, fan systems, etc.). True carrier class systems tend to be very expensive, and they may make stringent demands on the environmental conditions provided in the equipment rooms (air quality, temperature control, humidity, etc.).

6.4.1 Backplane design

The internal backplane design of higher-end systems may offer several features that assist fault tolerance, including the following:

Passive backplane design—So-called passive rather than active backplanes are preferred. Passive backplanes do not require power to operate and so are less prone to failure.
Multipath backplane design—A single large bus represents a single point of failure. Multiple buses or, better still, multiple switching fabrics, offer multiple paths so that at least partial work can continue in the event of bus failure.
Backplane connectors—Using female backplane connectors is preferred (i.e., no pins on the backplane). If the backplane has male connectors, then a connector failure (e.g., a bent or broken pin) affects the whole backplane, which may need to be replaced. If male connectors are used on line cards then a connector failure affects only that line card.
Midplane design—Some backplanes are positioned in the middle of the chassis with connectors at either side, enabling interface cards to be swapped or upgraded without affecting the processing function. With some architectures this may also improve cost efficiency, since cheaper interface cards may operate with a corresponding processor card.

If the product is based on standard system and bus components (e.g., a generic PC chassis running BSD UNIX or LINUX), there may be serious limitations in performance and fault-tolerance features. For example, the PCI architecture does not enable hot swap, so several vendors have moved to the Compact PCI (cPCI) architecture.

6.4.2 Hot swap

Hot swap is a feature often provided by high-end routing and firewall platforms; it is mandatory in carrier class systems such as high-end backbone routers. In essence this means that we have the ability to replace or install new cards or hardware without taking the system down and rebooting. Clearly, any users serviced by the card being replaced will be affected, but the point is that other users will not. On a large routing concentrator or firewall this is important. Another common product feature included in hot swap is the ability to perform some configuration or updating online (e.g., a firewall or IDS system may update virus signatures online and have them activated without requiring a reboot). A more ambitious target is the capability to patch the operating system or install a completely new operating system online (very hard to achieve in practice). Enterprise and backbone routers generally offer hot swap features, though the implementation and scope is entirely vendor dependent and may be constrained by the internal bus architecture. In practice hot swap architectures are more expensive to implement and require more sophisticated components (e.g., early-earth detection and more sophisticated arbitration logic to ensure that new cards do not interfere with existing systems running on shared buses during bootstrapping or during bus data transfers).

6.4.3 Disk mirroring

Data can be protected by recording them to multiple storage devices simultaneously; the most widely used techniques are disk mirroring and RAID. Disk mirroring is a common technique used for mission-critical devices such as application servers or firewalls. It is often used in online database systems and for the storage of a mission-critical event, policy, or configuration (such as a firewall accounting data), where it is critical that these data be accessible and recoverable at all times. The concepts are quite simple: a backup disk drive is dynamically synchronized with the primary drive so that all write operations are performed on both drives simultaneously. Typically there is a simple synchronization protocol and/or some watchdog hardware to ensure that the backup drive contains identical, or near identical, data. The second drive may be located in the same physical device or (for additional resilience) on a mirrored device elsewhere in the network. For true resilience ensure that the drive farm is accessible by multiple disk controllers. For further information on such products, refer to [20, 21].

6.4.4 Redundant Array of Inexpensive Disks (RAID)

Redundant Array of Inexpensive Disks (RAID) is a technology that enables multiple inexpensive PC-style hard disks to be logically clustered so that they appear as a large single hard disk (normally achieved using SCSI). This flexibility enables higher performance and resilient configurations. RAID is available at several levels, designed to suit various applications, as follows:

Level 0—High-speed mode, using a technique called striping, where blocks of data are written to multiple disks concurrently and can be read back into memory concurrently. This significantly increases overall disk access times. No fault tolerance is provided, since all disks are used to provide aggregate capacity.
Level 1—Offers disk mirroring in addition to striping. Two copies of data are simultaneously written on different drives.
Level 2—Similar to Level 1 but with additional bit-level error detection/correction distributed over all drives. Performance can be seriously degraded for small files (it is not suitable for server applications). Only 70 percent of the total disk space is available for data.
Level 3—Similar to Level 2 but with bit-level error detection/correction applied to a single drive called the parity disk. Disk write operations are sequential, seriously degrading performance (again, it is not suitable for server applications). Approximately 85 percent of disk space is available for data.
Level 4—Similar to Level 3 with optimized error detection/correction at the sector level (rather than bit-level), offering higher performance.
Level 5—Similar to Level 4 but with additional bit-level error detection/correction distributed over all drives. This enables concurrent read and write operations, significantly improving performance. Approximately 85 percent of disk space is available for data. This mode requires more expensive intelligent disk controllers.

For true resilience ensure that the drive farm is accessible by multiple SCSI controllers. It is not unusual to see a server communicating with a fault-tolerant RAID using a single SCSI controller card (representing an SPOF).

Peripheral switching

Peripheral switching places an intelligent switch on the SCSI chain between the server and the disk farm (typically a RAID device). This is a logical extension of the matrix switches used for many years on large mainframe sites to move huge numbers of peripherals from one mainframe channel to another. A typical example of peripheral switching could be as follows. Two RAID devices are connected to the switch along with two servers. One server is primary, the other an active backup. Each server owns one of the RAID devices. In the standby server, a background application runs, which periodically polls the primary and performs small disk reads to ensure operation. If the test fails, it waits a configurable period to retry. After a second failure it signals the switch to move the failed server's RAID device to the secondary. The secondary then mounts the volumes and starts the appropriate applications.

6.4.5 Reliable power

Power faults are common in large networks; in fact, some studies report that nearly 50 percent of claims for computer damage relates to power surges. Power faults are difficult to predict; they can be caused by pulled power cords, tripped circuit breakers, and regional or national power failures due to adverse weather conditions (such as lightning strikes or high winds) or brown-outs due to network overload. For each network design you need to include the probability and scope of power failure in your risk analysis. Before going much further, some definitions are in order, as follows:

Power surges occur when the voltage exceeds the recommended level for a brief time; this can happen if a heavy load is removed from a circuit quickly. Typically a surge lasts for half the AC period or possibly longer. In the United States (60 Hz) this would be approximately 1/ 120th of a second; in Europe (50 Hz) this would be 1/100th of a second. Surges are potentially very damaging for electrical components.
Power sags are temporary drops in voltage below the recommended level. A typical example is switching on a power shower at home where you might see the light dim slightly for a few seconds. Typically, power sags will be restricted locally to the same electrical circuit and should not affect modern computer equipment.
Brownouts are power sags that are much longer in duration. They are typically caused when the network is overloaded and the resulting supply becomes diluted. This could be caused by millions of television viewers all switching on coffee-makers during the intermission of a major television event. European equipment is more tolerant of brownouts than U.S. equipment, since it is typically designed to cope with the much lower U.S. voltages (110 V).
Power spikes are very short overvoltages. They can be caused by electrical storms or devices such as fluorescent lights, photocopiers, heaters, and air conditioners. Most power supplies include some measure of protection from everyday spikes, but spikes caused by lightning are potentially very harmful to electrical equipment and may cause severe damage. Spikes of ±100 volts are not uncommon even in major city power grids. It is, therefore, recommended that clean power sources always be used for sensitive computer equipment.
Noise has a variety of sources; examples include RFI, computer equipment, and distance lightning. Typically there will be noise even on a clean electrical supply, but as noise becomes more serious it will have effects similar to those of power surges and spikes.
Blackouts are periods of total power loss. The duration is unpredictable since there are a number of reasons, some due to natural phenomenon. A typical example would be loss of overhead supply cables to a region due to high winds or heavy snow. The effects of a blackout are unpredictable, since it is effectively the same as switching the power off on your equipment while in the middle of some operation. Having said that, blackouts are typically preceded and followed by sags, surges, and spikes and these may cause real damage.

Preventative measures

There are several basic approaches to providing a fault-tolerant power supply, including the following:

Source your main power supply from different providers. This option is currently more widely available in the United States than in Europe. It may be possible to get power from different power grids using the same supplier, but this can be very expensive.
Ask your provider if it can provide surge-protected supplies. This is likely to cost more money.
Install surge suppressers. These devices protect against overvoltage problems such as surges and spikes. This can be an expensive option, and one should also consider that the additional voltage is typically redirected to the ground. Since your LAN infrastructure is likely to be grounded to the same ground, this is hardly desirable, and there are specialized devices you can buy to address this problem.
Install voltage regulators. These are passive devices designed to damp out temporary voltage anomalies. Since they cannot create power, these devices cannot cope with power sags unless supplemented by batteries. Batteries are trickle charged in normal operation and brought into play during power sags to provide extra voltage. Power sags are quite common and so such a device may be a cost-effective solution.
Install AC filters. These are passive devices and essentially less powerful voltage regulators without battery backup. AC filters essentially smooth out noise. Top-end voltage regulators typically include AC filters as standard.
Install an Uninterruptable Power Supply (UPS). A UPS is essentially a battery system designed to cope with short-term power loss. Some UPS systems may cooperate with servers (such as Novell) so that they signal imminent failure prior to battery loss, allowing the server to gracefully shut down. You should ensure that the UPS produces a waveform your systems can tolerate during the switchover between main AC and battery AC, since many UPS systems produce a square AC wave, which some power supplies will not tolerate.
Ensure that all equipment and cabling are correctly grounded. This is especially true for tall buildings (where a potential difference can be created between upper and lower stories) and industrial complexes (where grounding may be problematic).

It is common to protect key resources such as servers, firewalls, and backbone routers with one or more UPS systems. You should also note that many mid- to high-end devices (such as routers) support DC operation rather than AC. This is because many service providers require their telecommunications exchanges to run on a -48 DC supply, since this is much easier to back up with batteries than AC. Devices such as the Cisco 4000 and 7000 and the Bay BLN and BCN nodes support DC mode operation.

Multiple power supplies

Localized power failures can be avoided by the use of redundant power supplies in key servers, firewalls, and routers. Often these power supplies can be implemented in a load-sharing configuration, thereby extending the life of the power supplies. High-end managed devices may also interface power supplies to a common management bus, so that the status of each power supply can be monitored and alarms sent in the event of any significant status change. This level of integration may even allow power supplies to be switched on and off remotely via a network management station or Telnet session.

6.4.6 Boot and configuration data

The operating system and configuration data of devices such as servers, routers, gateways, and firewalls are vulnerable to corruption. These systems often rely on hard or floppy disk media for holding OS or configuration data. A moving media subsystem is generally more prone to failure than static media such as NonVolatile RAM (NVRAM). For mission-critical applications (such as firewalling, server clustering, or backbone routing) designers typically demand at least some degree of fault tolerance in the media subsystem. Operating system and key configuration data can be loaded from a number of sources, including the following:

Hard disk, tape
Flash EPROM, ROM/PROM
NVRAM (holding system configuration data)
SCSI RAID for OS, configuration and logging data
Reboot from local CD or floppy disk
The capability to reboot automatically over the network, using protocols such as BOOTP, RARP, DHCP, MOP, and TFTP

Fault tolerance loading should be implemented so that failure of one source forces the node to try alternate sources. Note that general-purpose OSs can be quite large, and this may restrict loading to some forms of mass storage (hard disk or tape) for practical purposes. In this case mirrored hard disks or RAID configurations should be considered. In many cases a mass storage device is required for high-volume storage, such as event logging. In the event of catastrophic disk failure, some devices can automatically switch over to pump logging data to a remote system over the network.

6.4.7 Standby modules and spares

The internal architecture of a mission-critical device may lead to critical single points of failure. For example, if a router uses a single module to perform the routing processor function, then if this card fails all routing dies with it. Carrier class devices often allow the use of multiple modules for critical system functions such as routing. These modules may either be in a cold-standby state (i.e., the device simply monitors the live module using some form of watchdog and performs no activity until the primary module fails) or hot standby (where load could be distributed between modules, and either module is capable of taking over if the other dies). The approach you should take is as follows:

Identify all single points of failure in the system. Get the MTBF of each component from the vendor.
Scope the impact of any failures. Does it just affect the users connected to that card or the whole system functionality? How would you rectify the problem?
Identify whether the application justifies additional fault-tolerant measures. A single firewall to the Internet may be a critical resource. If you cannot justify multiple firewalls, then it would be wise to consider at least system fault tolerance.
Identify what possible warnings can be generated to indicate failure (SNMP traps on low disk space, environmental problems, etc.).
Identify what fault-tolerant features are available. What are the switchover times for each option? What is the impact on users (session loss, no affect)?

At the very least, most networks now carry some spares at key locations and document procedures for dealing with component failure. On a large international network spares holding could represent a substantial part of the overall network cost, and there is a potential issue of maintaining current revision levels of hardware and software. In such cases it is worth investigating spares provision as part of an outsourced maintenance contract.