6.3 Fault-tolerant, high-availability, and clustering systems

The approach taken for service protection depends largely upon the budget, the relative importance of service outages to the business or organization, and the type of data being protected. The network designer may choose from a basic high-availability solution through to a full fault-tolerant system. These systems can be described as follows:

Fault-tolerant systems—At the highest level of availability there are fault-tolerant systems (sometimes called Continuous Availability [CA] systems). These systems employ highly sophisticated designs to eliminate practically all single points of failure; consequently, only a small number of vendors can truly claim to have fault-tolerant platforms.
Fault-resilient systems—At the next level of availability there are fault-resilient systems. These systems employ sophisticated designs to eliminate some single points of failure, so there are basic levels of fault tolerance (e.g., ECC memory, RAID arrays, and multiple network interfaces).
High-availability systems (clustering and server mirroring)—These techniques rely on external cooperation between multiple systems so that the systems effectively operate as a single logical group resource.

Figure 6.13 illustrates two of these systems.

click to expand
Figure 6.13: General architecture for servers in (a) fault-tolerant mode and (b) clustered high-availability (HA) mode.

Fault-tolerant solutions are explicitly designed to eliminate all unplanned and planned computer downtime by incorporating reliability, maintainability, and serviceability features. Fault-tolerant systems claim as much as 99.999 percent uptime (colloquially known as five-nines). High-availability systems can deliver average levels of uptime in the 99 to 99.9 percent range.

6.3.1 Maintaining service levels

When we talk about system availability, we need to differentiate between data loss, transaction loss, and loss of service. These issues are tackled via different techniques. Data can be protected by recording them simultaneously to multiple storage devices; the most widely used techniques are disk mirroring and RAID. Fault-tolerant systems offer complete protection at the transaction level but are typically very expensive and do not scale. In contrast, HA solutions offer cost-effective and scalable protection; however, transactions may be lost. Regardless of the solution provided, overall service availability depends heavily on the architecture of the applications. Application-based redundancy refers to application-level routines designed to protect data integrity, such as two-phase commit or various data replication processes. Simple applications generally have no protection against transactions lost in midsession; once data are committed they are gone for good. Sophisticated database management systems that support a two-phase commit model are typically much more robust against data loss.

6.3.2 Application models and availability

In general the purpose of improving availability is to protect mission- or business-critical services; it is, therefore, important to understand the dynamics of these services in detail to ensure that the solutions provided are appropriate for, and consistent with, application behavior. Networked applications today often use a multitier architecture based on a client/server or distributed model. For availability purposes, it is useful to consider applications as three layers: communications services, application services, and database services, since the availability issues for each of these layers can be quite different, as follows:

Communications services include management of the LAN and WAN connections together with higher-level communications and messaging functions (such as routing, store-and-forward messaging, and protocol or data conversion). Systems at this layer typically rely on a large amount of memory-resident data (e.g., connection/session status, queued messages, intermediate results for transactions, and the context of user dialogs). The ability to preserve these critical data against serious failures may be highly desirable for transaction-based services, or services where sessions need to be maintained continuously.
Application services functions will vary depending upon the application but typically include user interface handling, transaction capture and sequencing, local searching and sorting, computational processing and statistics reporting, event and message logging, and so on. The availability solution most appropriate for this layer will depend largely on the application requirements. Fault-tolerant servers are appropriate when it is important to protect critical, memory-based state information or an in-memory database. If this is not a requirement, then an HA cluster may be more appropriate. In some cases front-end services may mandate stand-in processing, where applications provide continuous client service during periods when back-end databases are unavailable.
Database services are transaction-based services and typically require a large back-end central database. This is generally implemented as a high-end, scalable server platform architected to handle volume transaction traffic and interface with mass storage systems (either via SCSI/RAID, Fibre Channel, or some other peripheral interface). The recovery issues are fairly straightforward for this layer, since the persistence, data integrity, and recoverability of application data are standard features of all serious database and transaction software.

At a large central site, the three layers are often distributed across several server platforms, in which case it is generally possible to provide a tailored availability solution for each layer. For example, fault-tolerant communications servers can provide continuous availability for higher-level communications services, enabling front-end applications to route messages among multiple back-end systems and to store and later submit transactions if back-end systems are offline. For smaller sites it is normal practice to select a single solution that is most suited for the needs of all three layers.

With fault-tolerant systems much of the hard work is transparent to the network designer, so these systems can be relatively straightforward to implement. However, fault-tolerant systems have finite resources and do not scale (so even FT systems may have to use clustering techniques). HA solutions provide scalability and can provide equivalent availability; an HA solution can be harder to implement, since much of the workings of the HA solution are exposed and may require special tuning or have topological restrictions. Currently there are no generic communications solutions that provide the equivalent level of reliability provided by two-phase commit database management software (although sessions may be maintained by HA solutions, they typically cannot guarantee against transaction loss). Further complications arise from the heterogeneous nature of many communications environments, as well as the need to incorporate existing legacy protocols, multiple operating systems, and a variety of networking devices in the end-to-end delivery path. All of these factors impact reliability.

6.3.3 Fault-tolerant systems

At the high end of the server market there are a number of vendors that offer truly fault-tolerant machines (such as Tandem [11] and Stratus [12]). These machines are designed to eliminate most single points of failure within the internal processor, memory, IO, and power subsystems. They also typically offer multiple network adaptors. Key features to look for in a fault-tolerant system are as follows:

Replicated hardware subsystems—Some vendors offer the choice of Dual-Modular Redundancy (DMR) or even Triple-Modular Redundancy (TMR) at the hardware level. Systems are monitored constantly and failover is immediate (or at least in the order of milliseconds).
Online software serviceability—Support for configuration changes, operating-system upgrades, patches, and device driver updates without requiring a system reboot.
Online hardware serviceability—Hot-swappable drives are an essential component (RAID and otherwise).
Fast boot methods—Fast dump and reboot to minimize reboot time after a catastrophic OS failure.
Sophisticated fault-detection logic—Isolates and classifies faults quickly and initiates recovery procedures. The recovery method will depend on whether the failure is hard (repeatable) or transient (not repeatable).
Hardware failure recovery—Automatic and transparent recovery from hardware failures, including CPU, memory, disk, LAN, I/O, adapter, fans, and power supplies.
Backup power—A UPS interface is essential for continued short-term operation or graceful shutdown in the event of a major power outage.
Persistent memory—Some systems offer so-called persistent memory, which enables applications to read and write to selected memory contents even after rebooting the system (this memory is not reinitialized after a reboot). For example, the RAMDISK feature of Windows 2000 could be used to this end.
Total remote management—Including the ability to reboot remote servers regardless of their state.
Concurrent backup and restore—The system must be usable during backups and restores (some implementations impose performance or file locking issues, which reduce usability for the duration).

The most resilient of fault-tolerant architectures include full hardware-based fault tolerance. Within these systems, hardware is engineered to include continuous self-checking logic, and all of the main subsystems are physically duplicated (CPUs, main memory, I/O controllers, system bus, power subsystems, disks, etc.). Self-checking logic is typically resident on every major circuit board to detect and immediately isolate failures. Since every separate element of the computer is duplicated, normal application processing continues even if one of these components should fail. In such systems, hardware-based fault tolerance is transparent to application software, and there is no performance degradation if a component failure occurs. Also, logic self-checking allows data errors to be isolated at each clock tick, assuring that erroneous data never enter the bus and, as a result, cannot corrupt other parts of the system. Finally, onboard diagnostics built into continuous availability system architectures often automatically detect problems before they lead to failures and initiate service instantaneously should a component fail.

Conventional computers (even in HA mode) are not fault tolerant; they have many single points of failure that can crash the system running the end user's mission-critical application. Their goal is to recover as soon as possible from a crash, rather than to assure that a crash does not occur in the first place. Recovery times can vary from seconds to many minutes depending on the complexity of the application and the communications involved. Recovery can take much longer if the application must be returned to the point where it was before the failure occurred. In many cases it may actually be impossible to recover the application context, and users may simply have to log in and start again. Disks may have to be resynchronized, databases synchronized, and screens refreshed. If a crash occurs in the middle of a transaction, corrupt data may enter the system, entailing additional time and cost to rectify. Obviously, transaction data also may be permanently lost during a system crash. Conventional computers reconfigured for high availability rely on layered systems software or custom application code residing above the operating system in order to help an application recover from a system crash. These configurations, however, have limited capabilities to identify hardware failures. They cannot detect transient hardware failures, for example. As a result, although the hardware platform may continue to run, the mission-critical software application can be rendered useless by bad data.

While fault-tolerant systems clearly have their advantages, there are several issues with fault-tolerant systems from the designer's perspective: they tend to be very expensive; since the system sits in a single location this is a single point of failure, finite resources mean that for very high traffic volumes there can be scalability issues.

Operating systems

The server operating system can be proprietary or an industry standard OS (such as UNIX or Windows NT/2000). Clearly, the more standard the OS the more likely that more applications will be available to run on the platform. For example, the initial Stratus fault-tolerant platforms ran the proprietary Virtual Operating System (VOS). Stratus subsequently released support for two flavors of UNIX (FTX and HP/UX). Stratus Continuum systems are based on the HP PA-RISC microprocessor family and run a standard version of HP-UX. These systems are reportedly fully ABI and API compatible with HP-9000 servers and can run both HP and third-party software without modification. Stratus recently offered support for Windows 2000 via its Melody fault-tolerant server platform.

Scalability

While fault tolerance can be essential for mission- or business-critical servers, these systems have finite resources. As traffic and transaction levels increase, fault-tolerant systems eventually run out of steam, and an intelligent way of clustering systems (either with or without fault tolerance) is required. Clustering builds availability into a solution through external system redundancy and control procedures, much the same way that fault-tolerant systems internalize those processes. Clustering enables systems to provide scalability through modular addition of further systems into a cooperating group.

Example fault-tolerant application

Fault-tolerant systems are commonly routinely employed to support Automated Teller Machine (ATM) and Point of Sale (POS) card authorization systems, since these systems typically support transactions on a global, 24/7. These systems are characterized by continuously available front-end communications, authorization, and logging service, with back-end mainframe systems handling the customer account databases and settlements. With a fault-tolerant communications front end, service can continue during periods where the back-end systems are unavailable. Another common application is a 24/7 call center, integrated either with a customer service or an order-entry application. By deploying fault-tolerant systems in the call center front-end, the call center can provide temporary processing or transaction capture if the back-end database systems are unavailable. The front end can also route transactions to multiple back-end systems if required.

Fault-tolerant systems may be used to support applications such as mission-critical Intelligent Network (IN) elements and Service Control Points (SCPs). These systems provide application and database services to the voice switches using the SS7 protocol running over WAN links. SCP applications require very high reliability and rely on large in-memory databases to fulfill the subsecond response times required. Network management systems, especially those used in large enterprises or for telecommunications network monitoring, may be deployed using fault-tolerant systems. These applications often keep large amounts of network status information in memory, and some fault-tolerant systems can provide persistent memory for in-context recovery after system failure.

6.3.4 Clustering and high-availability systems

Application and file servers are typically key resources on a network. For many reasons IT managers often prefer to centralize these resources, whether for cost, management, or performance reasons. Since many organizations are heavily dependent on these services, it is imperative that you protect them by deploying some fault-tolerance measures. In the past many enterprise networks were often designed with multiple routers between LAN segments in order to provide redundancy. The effectiveness of this design was limited by the speed at which the hosts on those LANs detected a topology failure and switched to an alternate router. As we have already discussed, many IP hosts tend to be configured with a default gateway or are configured to use Proxy ARP in order to find the nearest router on their LAN. Forcing an IP host to change its default router often requires manual intervention to clear the ARP cache or to change the default gateway. This section describes how this and other problems can be resolved. The main approaches we will consider are as follows:

Server cluster techniques—Sophisticated software for managing server cluster status.
Router clustering techniques—Virtual Router Redundancy Protocol (VRRP) and Hot Standby Router Protocol (HSRP).
Proxy server techniques—Round-robin DNS, network address translation, Proxy ARP, DNS load sharing.

There are many vendors currently offering standard and proprietary-load balancing hardware or software products. The techniques vary widely and have advantages and disadvantages that we will now examine.

Design features of HA systems

Specialized networking devices, such as routers, switches, and firewalls, generally use purpose-built hardware designs, particularly at the high end, that enable fault tolerance or simply improve reliability. These include the following:

Fault-tolerant power supply—Load sharing managed power supplies with a common power rail.
DC power supply—To enable backup power generators to provide power should the main AC supply fail.
Fault-tolerant backplane design—Avoiding single points of failure and improving overall reliability.
Fault-tolerant processors—Dual processor cards (either hot swap or hot standby) may spread the load and duplicate key live data (such as routing tables or firewall state tables).
Hot swap line cards—Hot swap interface and processor cards; commissioning can be achieved without taking down the whole system and rebooting.
Clustering—Logical clustering or stacking to avoid system-level failures.
Fault-tolerant loading of OS and configuration data—The ability to load the OS and/or configuration data from fast permanent storage with backup sources.

Server mirroring

Server mirroring enables servers to be distributed, eliminating the problems associated with a single location while also enabling more cost-effective platforms to be purchased. With this technique, a backup server simultaneously duplicates all the processes and transactions of the primary server. If the primary server fails, the backup server can immediately take its place without any downtime. Server mirroring is an expensive but effective strategy for achieving fault tolerance. It's expensive because each server must be mirrored by an identical server whose only purpose is to be there in the event of a failure.

This technique generally works by providing hot standby servers. A primary and backup server exchange keep-alive messages; when one stops transmitting, the standby system automatically takes over. These solutions work by capturing disk writes and replicating them to two live servers. Each piece of data written to the volume is captured on the backup server. The backup server is always online, ready to step in (this can take anything from milliseconds to minutes depending upon the architecture of the system). Products in this category are available from several vendors, including Novell (SFT Level III), Vinca Corp (StandbyServer), IBM (HACMP/HAGEO) software, and HP's SwitchOver/UX. A less expensive technique that is becoming more and more popular is clustering.

Clustering techniques

Clustering is a technique generally associated with the grouping of two or more systems together in such a way that they behave like a single logical system. Clustering is used to provide a scalable, resilient solution for mission-critical networking components such as servers, routers, firewalls, and VPN appliances. Clustering is used for parallel processing, for load balancing, and for fault tolerance. Clustering is a popular strategy for implementing parallel processing applications, because it enables companies to leverage the investment already made in PCs and workstations. In addition, it's relatively easy to add new CPUs simply by adding a new PC to the network.

Clustering may be achieved using proprietary protocols and signaling systems or via standard protocols. Protocols such as VRRP or HSRP offer a crude form of clustering for routers.

Clustering software

High-availability server clustering software is available from a number of vendors, including Hewlett-Packard's MC/ServiceGuard [13], IBM's HACMP [5], and Microsoft's Cluster Server software (Wolfpack) [14]. These enable multiple servers, in conjunction with shared disk storage units, to rapidly recover from failures. Whenever a hardware or software failure is detected that affects a critical component (or an entire system), the software automatically triggers a failover from one cluster node to another. Data integrity and data access are preserved using a shared-access RAID system. Application processing and access to disk-based data are typically restored within minutes (recovery times will vary depending upon specific characteristics of the application and system configuration). To ensure proper failover behavior, the user must customize the configuration to match the environment by creating a number of failover scripts. Management of a cluster is generally more complex than for a single system, since multiple systems must be managed and configuration information must be consistent across systems. Software upgrades and other hardware or software maintenance operations can be done with minimal disruption by migrating applications from one node in a cluster to another.

Availability of this software is dependent upon the operating system used. Router and firewall platforms typically rely on proprietary, or at least heavily modified operating systems, which means that this software is unlikely to be available. These devices either implement proprietary clustering protocols or standards-based protocols such as VRRP.

6.3.5 Virtual Router Redundancy Protocol (VRRP)

Virtual Router Redundancy Protocol (VRRP) is a standards-based protocol [15]. VRRP enables devices (normally routers) to act as a logical cluster, offering themselves on a single virtual IP address. The clustering model is simple but effective, based on master-slave relationships. VRRP slave devices continually monitor a master's status and offer hot standby redundancy. A crude form of load sharing is possible through the use of multiple virtual groups. VRRP is similar in operation to Cisco's proprietary Hot Standby Redundancy Protocol (HSRP) [16]. Although primarily used for fault-tolerant router deployment, VRRP has also been employed with other platforms (such as Nokia's range of firewall appliances [17]). The current version of VRRP is version 2.

The real problem VRRP attempts to address is the network vulnerability caused by the lack of end-system routing capabilities on most workstations and desktop devices. The vast majority of end systems interact with routers via default routes; the problem with default gateway functionality is that it creates a single point of failure—if the default router goes down, then all host communications may be lost outside the local subnet, either permanently or until some timer has expired. A mechanism was required to solve this problem quickly and transparently, so that no additional host configuration or software is required. VRRP solves this by clustering routing nodes that reside on a common shared media interface, offering them to end systems under a single virtual IP address. End systems continue to use the default gateway approach; nodes within the VRRP cluster resolve who should forward this traffic. Before proceeding any further let us reflect on the following definitions:

VRRP router—A router running the VRRP. A VRRP router may participate in one or more virtual routers (i.e., it may belong to more than one virtual group). A VRRP router can be in different states for different virtual groups (it may be master of one group and slave of another).
Virtual router—This is an abstract object managed by VRRP that acts as a default router for hosts on a shared LAN. Essentially it is a virtual group address and comprises a Virtual Router Identifier (VRID) and a set of associated IP addresses.
IP address owner—The VRRP router that has the virtual router's IP address(es) as real interface address(es). This is the router that will normally respond to packets addressed to one of these IP addresses (such as ICMP pings, TCP connections, etc.).
Primary IP address—An IP address selected from the set of real interface addresses (this could be the first address, depending upon the selection algorithm adopted). VRRP advertisements are always sent using the primary IP address as the source address.
Virtual router master—The VRRP router currently has the responsibility for forwarding packets sent to the IP address(es) associated with the virtual router and answering ARP requests for these IP addresses. If the IP address owner is alive, then it will always assume the role of master.
Virtual router backup—The set of VRRP routers in standby mode, ready to assume forwarding responsibility should the current master fail.
Virtual MAC address (VMAC)—The MAC address used by the master when advertising or responding to queries (such as an ATP request).
Physical MAC address—The real MAC address for a particular VRRP node (i.e., the unique address typically burned into its network hardware).

It is worth pointing out that VRRP is essentially a LAN-based protocol. To my knowledge there are no VRRP implementations available for wide area interfaces (although multiaccess technologies such as Frame Relay or SMDS could conceivably support it). Since the default gateway problem does not manifest itself in the wide area, it makes little sense to use VRRP on WAN interfaces, and dynamic routing protocols generally do a much better job.

VRRP packet format

VRRP messages are encapsulated in IP packets (protocol 112) and addressed to the IPv4 multicast address 224.0.0.18. This is a link local scope multicast address. Routers must not forward a datagram with this destination address regardless of its TTL, so the TTL must be set to 255. Just to ensure that the packet is not forwarded, a VRRP router receiving a VRRP packet with the TTL not equal to 255 must still discard the packet. The function of VRRP messages is to communicate priority and status information. In a stable state these messages originate from the master only. (See Figure 6.14.)

click to expand
Figure 6.14: VRRP packet format.

Field Definitions

Ver—Specifies the VRRP protocol version of this packet; the current version is 2.
Type—Specifies the type of this VRRP packet. The only type defined at present is 1; all other types should be discarded.
VRID—An abstract group identifier that identifies a logical community of cooperating VRRP routers. The VRID must be unique for a particular multiaccess network but may be reused on different multiaccess network interfaces on the same system. The master virtual router advertises status information using the VRID.
Priority—Specifies the sending VRRP router's priority for the virtual router. Higher values equal higher priority. The priority value for the VRRP router that owns the IP addresses associated with the virtual router (i.e., has a real IP address assigned to one of its interfaces that matches the virtual IP address) should be set to 255 (implementations should ideally assign this automatically). VRRP routers backing up a virtual router must use priority values between 1 and 254, with the default set to 100. The priority value zero (0) has special meaning, indicating that the current master has stopped participating in VRRP. This is used to trigger backup routers to quickly transition to master without having to wait for the current master to time out. A VRRP master router with priority 255 will respond to pings.
IP Addrs—The number of IP addresses contained in this VRRP advertisement.
Auth Type—Identifies the authentication method being utilized. Authentication type is unique on a per interface basis. The authentication type field is an 8-bit unsigned integer. A packet with unknown authentication type or that does not match the locally configured authentication method must be discarded. The authentication methods currently defined are as follows:
- 0—No Authentication
- 1—Simple text password
- 2—IP Authentication Header (using HMAC-MD5-96)
Adver Int—The default advertisement interval is one second, representing the interval between advertisements from the master.
Checksum—A 16-bit one's complement of the one's complement sum of the entire VRRP message, starting with the version field (where the checksum field is set to 0 for the purpose of this calculation). It is used to detect data corruption in the VRRP message.
IP Address(es)—One or more IP addresses that are associated with the virtual router.
Authentication Data—Any associated authentication data.

VRRP operation

VRRP operations are fairly straightforward and are summarized as follows:

Master election process—VRRP uses a simple election process to dynamically promote master responsibility. The election process is deliberately simple to minimize bandwidth and resource overheads; it implements a very limited state machine and uses a single message type. When VRRP routers first come online, the master is elected via a straightforward highest-priority mechanism. The network administrator typically assigns priorities manually. In a tie, where two routers have equal priority, the router with the higher IP address wins. Once election is complete, the master forwards all traffic as the default gateway for a specific virtual IP address. The exception to this rule is that the real virtual IP address owner should always seize master status if it is functioning properly.
Backup routers—If the master becomes unavailable (i.e., stops sending VRRP announcements), the highest-priority backup will transition to master after a short delay (the default is three poll intervals, each of one-second duration, plus a random skew time of up to one second). This provides a controlled transition of the virtual router responsibility with minimal service interruption. Note that VRRP routers in a backup state do not send advertisements; they simply monitor the status of the master. A backup router will preempt the master only if it has higher priority. This eliminates service outages unless a more preferred path becomes available. It should also be possible to prohibit all preemption attempts through the configuration interface. The only exception is that a VRRP router will always become master of any virtual router associated with addresses it owns.
VRRP advertisements—The master sends regular advertisements (by default once a second) to inform slave routers that it is alive. These messages are sent as IP multicast datagrams, so VRRP may operate over any multiaccess LAN media supporting IP multicast. If the master fails to send announcements for a specified time period, a backup router with the next highest priority will take over forwarding responsibility to maintain integrity.
Virtual groups—Each virtual IP address creates to a logical group, abstracted by a Virtual Router Identifier (VRID). The mapping between VRID and addresses is usually configured manually on a set of VRRP routers by the network administrator. The VRID must be unique to a LAN; however, the same VRID can be reused on other LAN interfaces as long as those LANs are not joined together. In general it is recommended that you use different VRIDs to simplify diagnostics.
IP and MAC addresses used—To avoid issues with address resolution, a special virtual MAC address is generated and used as the source MAC address for all VRRP announcements sent by the master. This enables devices such as bridges and switches to cache a single MAC address in extended LANs. The virtual router MAC (VMAC) address associated with a virtual router is an IEEE 802 MAC address in the following format (expressed in hex in Internet standard bit order): 00-00-5E-00-01-{VRID}. The first three octets are derived from the IANA's OUI. The next two octets (00-01) indicate the address block assigned to the VRRP protocol. {VRID} is the VRRP virtual router identifier. This mapping provides for up to 255 VRRP routers on a network. For VRRP advertisements the master will use the unicast VMAC as its source address; however, the destination address will use the VMAC with the multicast bit set. For example, for an advertisement for VRID 2 we would see the MAC destination address set to 01-00-5E-00-01-02 and the MAC source address set to 00-00-5E-00-01-02.
Address seeding—In order to precharge host, bridge, and switch ARP caches VRRP routers use gratuitous ARP messages to advertise their presence. This, together with the use of the VMAC source addresses in all VRRP interactions (including failover to the backup router), is particularly important for efficient multiport bridge and switch operation (VRRPv1 was inconsistent in the use of source addresses, and this caused switches to flood frames onto all ports on occasion).
ICMP echo (ping) interaction—When a node pings a VRRP master router, or pings through a VRRP router to a remote host, the ICMP response packet from the VRRP router should include the physical (i.e., real) MAC address of the router as its source address, rather than the VMAC address. This is important for diagnosing faults or for remote NMS polling operations (in a VRRP failure scenario, if the VMAC were used as the source address then it would be impossible to distinguish which physical router has failed, and these routers may not necessarily be at the same site). This behavior is not specified implicitly in [15].
ICMP redirects—ICMP redirects may be used normally when VRRP is running between a group of routers. This allows VRRP to be used in environments where the topology is not symmetric. The IP source address of an ICMP redirect should be the address the end host used when making its next-hop routing decision. If a backup router is currently acting as master, then it must determine which virtual router the packet was sent to when selecting the redirect source address. One method is to examine the destination MAC address in the packet that triggered the redirect. It may be useful to disable redirects for specific cases where VRRP is being used to load share traffic between several routers in a symmetric topology.
ARP interaction—In contrast to ping operations, when a host sends an ARP request for one of the virtual router IP addresses, the master virtual router must respond using its VMAC address rather than its physical MAC address. This allows the client to consistently use the same MAC address regardless of the current master router. When a VRRP router restarts or boots, it should not send any ARP messages with its physical MAC address for the IP address it owns; it should send only ARP messages that include VMAC addresses. This may entail the following actions:
- When configuring an interface, VRRP routers should broadcast a gratuitous ARP request containing the VMAC address for each IP address on that interface.
- At system boot, when initializing interfaces for VRRP operation, delay gratuitous ARP requests and ARP responses until both the IP address and the VMAC address are configured.

If Proxy ARP is to be used on a VRRP router, then the VRRP router must advertise the VMAC address in the Proxy ARP message; otherwise, hosts might learn the real MAC address of the VRRP router.

Example design—simple hot standby

Figure 6.15 illustrates a topology where VRRP is used between two routers to provide resilience for client/server access for two LANs. In this configuration, both routers run VRRP on all interfaces, and on both LAN interfaces both routers simultaneously participate in a single VRRP group. Note that the VRIDs used could be the same in this case, since the two broadcast LANs are physically separate.

click to expand
Figure 6.15: VRRP configuration with resilience for highspeed server farm.

End systems on the client LAN (VRID-1) install a default route to the virtual IP address (194.34.4.1), and Router-1 (with a priority of 254 on this interface) is configured as the master VRRP router for that group. Router-2 acts as backup for VRID-1 and only starts forwarding if the master router dies.

End systems on the server LAN (VRID-2) install a default route to the virtual IP address (193.168.32.12), and Router-2 (with a priority of 254 on this interface) is configured as the master VRRP router for that group. Router-1 acts as backup for VRID-2 and starts forwarding only if the master router dies.

This configuration enables full transparent resilience for both clients and servers. The hosts require no special software or configuration and are oblivious to the VRRP operations.

The more observant of you may have noticed that in the topology shown in Figure 6.15, we have effectively created asymmetrical paths across the VRRP cluster; traffic from the client network (VRID-1) is forwarded via Router-1 and is returned from the server network (VRID-2) via Router-2. It would have been just as easy to force the path to be symmetrical by making Router-1 master on both interfaces. In this scenario asymmetry is not a problem, assuming both routers are evenly resourced; in fact, this configuration distributes some of the workload between routers. In cases where the routers have very different performance characteristics (i.e., processor speeds and buffer sizes), this would not be advisable. In such cases the router with the most resources should be configured as master for both interfaces, or at least master for the server side configuration (assuming the bulk of the traffic is server-to-client oriented). Path asymmetry can also be an issue for VRRP routers that also offer firewall applications (session states may be maintained between firewalls), and, depending on the state update frequency, Router-1 and Router-2 may be out of synchronization.

Note also that in this configuration Router-2 is not backed up by Router-1 on the 194.34.4.0 network, and Router-1 is not backed up by Router-2 on the 193.168.32.0 network. This can be achieved by configuring another virtual router on each LAN, this time with alternate primary and backup routers. This configuration could be enhanced quite easily to support full load sharing in addition to backup.

VRRP issues

The VRRP default router technique is relatively simple and effective but not without its drawbacks, and there are some subtle VRRP configuration issues that can be difficult to analyze for inexperienced engineers. These include the following:

Lack of topological knowledge—VRRP is fairly dumb, and the scope of its topological knowledge is generally very narrow. It is typically implemented in isolation, so that there is no interaction with other routing protocols (static or otherwise). In certain scenarios this can lead to serious routing issues. This situation could be improved through tighter vendor integration with dynamic routing protocols, but it would be useful for the standards to address such interplay, even if just for guidance.
Routing problems and inefficiencies—As indicated previously, VRRP's isolation can lead to serious routing problems, particularly in LAN-WAN scenarios. For example, in Figure 6.16 let us assume that static routes are used on the wide area links. VRRP would typically have no knowledge of the status of these WAN interfaces. Failure of the remote interface 140.0.0.1 would mean that even though router R1 has no remote connectivity, it still remains master for a particular VRID. In this case packets from the LAN destined for the 140.0.0.0 are simply black-holed unless there is manual intervention to force a transition.

Figure 6.16: VRRP configuration with LAN and WAN interfaces.

Assuming that we run a dynamic routing protocol, or R1's VRRP process somehow gains knowledge of the broken link, it will at best start sending ICMP redirects to clients on the 194.34.4.1 interface, redirecting them to the real IP address of R2. This can lead to further problems, since R2 sees that VRRP operations on the client side are working well (R1 never actually relinquishes master status for VRID-1). This results in routing inefficiencies, since every new user session must be explicitly redirected (some client stacks also handle ICMP redirects badly). To solve this problem some vendors allow the monitoring of specified interfaces, so that transitions on those interfaces automatically trigger a change to the VRRP master status. In this case we could monitor 140.0.0.1, and any failure would be treated as a soft system failure, so the master stops advertising or lowers its priority to force reelection (note that this feature is not specified in the standards).

This works, but there is yet another subtle problem that this enhancement does not address. Since dynamic routing knowledge is not available to VRRP, in some scenarios it is quite possible that a dynamic routing protocol (e.g., RIP or OSPF) will be announcing a more optimal next-hop address, based on superior topological knowledge. In Figure 6.16 consider a failure of interface 140.0.0.1; assuming circuit monitoring is available, this will result in VRRP transitioning, so that R2 starts forwarding traffic as expected. However, OSPF (running concurrently on R1) may announce a better route to 140.0.0.0 via R1 and network 150.0.0.0 (e.g., there could be problems upstream of R2 at the interface to 140.0.0.0 that VRRP is unaware of). These problems could include the following:

Scalability—Perhaps a more important issue with the VRRP/default route technique is that it is not a truly scalable solution for LANs with a large workstation population, particularly in load-sharing mode. In a large enterprise it will be cumbersome to configure default gateways for hundreds of PCs by hand. A dynamic service such as DHCP can help; however, the flexibility offered by DHCP depends on the implementation. On a Windows NT machine you can allocate default gateway addresses on a per-subnet basis. This would enable clients to load share very crudely by subnet. There could be huge differences in the actual load distribution, especially if subnets use different services.
Backward compatibility—There were significant changes between VRRPv2 and VRRPv1 that affect compatibility; VRRPv1 used IP protocol number 99, VRRPv2 uses IP protocol number 112. VRRPv1 also used destination IP multicast 224.0.0.12, VRRPv2 uses 224.0.0.18. VRRPv1 also used an additional IP address to denote the virtual address, whereas VRRPv2 uses real addresses.
Convergence speed—For standards-compliant implementations the fastest convergence time would be approximately four to five seconds (three polls spaced at one-second intervals plus skew time).
Duplicate packets—The typical operational scenarios are defined to be two redundant routers and/or distinct path preferences among each router. A side effect when these assumptions are violated (e.g., more than two redundant paths all with equal preference) is that duplicate packets can be forwarded for a brief period during master election. However, the typical scenario assumptions are likely to cover the vast majority of deployments, loss of the master router is infrequent, and the expected duration in master election convergence is quite small (less than one second). Thus, the VRRP optimizations represent significant simplifications in the protocol design while incurring an insignificant probability of brief network degradation.

VRRP is clearly useful but can be problematic for anything other than simple clustering applications. Subtle interactions with ARP, ping, and interior routing protocols often result in confusion for engineers, making diagnostic work protracted. VRRP does not provide efficient load sharing; in practice it distributes traffic on a node-node basis (i.e., not the session or packet level). This means that a heavy traffic producer always goes to the same gateway regardless. For large client populations there is no easy way for a network administrator to automate allocation of default gateways fairly (ironically this is the scenario VRRP would most usefully benefit if used as a quick fix). In summary VRRP is a useful but very basic tool. For real high-bandwidth load sharing and fault-tolerant applications a more granular, more intelligent, transparent clustering technique is required.

6.3.6 Hot Standby Routing Protocol (HSRP)

The Hot Standby Router Protocol (HSRP) predates VRRP and is described in [16]. HSRP is a Cisco proprietary protocol (see Cisco's patent [18]) with functionality similar to VRRP. HSRP handles network topology changes transparent to the host using a virtual group IP address. HSRP has its own terminology, as follows:

Active router is the router that is currently forwarding packets for the virtual router.
Standby router is the primary backup router.
Standby group is the set of routers participating in HSRP that jointly emulate a virtual router.
Hello time is the interval between successive HSRP hello messages from a given router.
Hold time is the interval between receipt of a hello and the presumption that the sending router has failed.

HSRP is supported over Ethernet, Token Ring, FDDI, Fast Ethernet, and ATM. HSRP runs over the UDP protocol and uses port number 1985. Routers use their actual IP address as the source address for protocol packets, not the virtual IP address. This is necessary so that the HSRP routers can identify each other. Packets are sent to multicast address 224.0.0.2 with a TTL of 1. As with VRRP, an HSRP group can be defined on each LAN. One member of the group is elected master (the active router), and this router forwards all packets sent to the HSRP virtual group address. The other routers are in standby mode and constantly monitor the status of the active router. All members of the group know the standby IP address and the standby MAC address. If the active router becomes unavailable, the highest-priority standby router is elected and inherits the HSRP MAC address and IP address. HSRP typically allows hosts to reroute in approximately ten seconds. High-end routers (Cisco 4500, 7000, and 7500 families) are able to support multiple MAC addresses on the same Ethernet or FDDI interface, allowing the routers to simultaneously handle both traffic that is sent to the standby MAC address and to the private MAC address. As with VRRP, if multiple groups are configured on a single LAN, load sharing is possible using different standby groups and appropriate default routes in end systems.

Differences between HSRP and VRRP

The main differences between VRRP and HSRP are as follows:

Both active and standby HSRP routers send hello messages. In VRRP only the master sends hellos.
HSRP implementation is a little more sophisticated than VRRP in that HSRP uses a more complex state machine and includes more events. The operational results from the user perspective are very similar, except that HSRP produces more traffic.
Unlike VRRP, [16] mandates that routers participating in HSRP on an interface must not send ICMP redirects on that interface.
Unlike VRRP, the virtual IP address is not assigned to any real interface (i.e., very much like VRRP version 1). This means that an additional IP address is required for an HSRP deployment.

The same reservations I made for VRRP in sophisticated designs apply to HSRP also. For further information on HSRP, the interested reader is directed to [16, 19].

6.3.7 Proxy server and interception techniques

There are a number of techniques used by gateways and proxies to enable load balancing as a kind of value-added service. These techniques usually work on the premise that since these proxies are often placed in strategic network locations (say in front of a server farm or on the perimeter WAN interface), and they need to modify or respond to requests as part of their basic function, they might add value without the user knowing. Note that the term balancing here is generally misleading, and the term sharing is perhaps more appropriate. Most of these techniques rely on quite crude methods to distribute load, and generally there are no guarantees about the actual load levels. Techniques in this class include the following:

DNS load-sharing proxies—Domain Name System (DNS) requests can be intercepted and distributed to a cluster of real DNS servers to provide both performance and reliability improvements.
HTTP load-sharing proxies—A method of distributing requests for Web service among a cluster of Web servers to improve performance and resilience. This technique is often referred to as HTTP redirection.
FTP load-sharing proxies—A method of distributing requests for FTP service among a cluster of FTP servers to improve performance and resilience. In practice this is generally achieved using network address translation.
ARP load-sharing proxies—Proxy ARP can be used to provide resilient paths to services and improve performance.
NAT load-sharing proxies—Network address translation is a common technique for providing resilient paths to services and improving performance.

When deploying proxy load balancers, careful attention should be paid to the design so that the balancer may become a single point of failure. If the load balancer dies, then remote clients may not be able to reach any of the servers in the cluster behind it. Since this functionality is increasingly being integrated into general-purpose load-sharing systems, switches, and firewalls, these devices can often be clustered to provide appropriate levels of fault tolerance. Many of the techniques listed above are closely associated with load sharing and performance optimization.

6.3.8 Other related techniques

There are a number of other protocol techniques used to provide clustered fault tolerance, some standard, some proprietary, including router discovery protocol and IGP default routes.

6.3.9 Combining HA clusters with fault-tolerant servers

An architecture that combines HA clustering and fault-tolerant servers in a front-end/back-end configuration provides customers with the best combination of performance, flexibility, scalability, and availability. High-end servers, combined with MC/ServiceGuard software, provide a robust, scalable, back-end database service. Fault-tolerant servers provide a continuously available front-end communication service. Application services can run on the back end, front end, or both, depending on the specific application requirements. In some cases, the Application Services layer may warrant a separate set of server systems, again depending on the particular application environment and availability needs. This front-end/back-end architecture is actually very similar to the traditional mainframe architecture that has supported enterprise applications for over 30 years. Mainframes have used an intelligent communications controller to offload communications processing from the host (much of this communication is IBM SNA) and to allow routing of transactions among multiple hosts, providing both higher availability and load sharing. The combination of approaches brings the benefits of this traditional architecture to the world of open systems.

Deciding between an HA cluster and a fault-tolerant system will depend on the particular characteristics of the application and operational features of the network. HA clusters are a good choice if the following guidelines are present:

Database services are the predominant part of the application.
The recovery model is transaction based.
Scalability beyond a midrange system is needed.
Application can accommodate seconds to minutes of recovery time.
Operations staff is available for cluster management.

A fault-tolerant system is a good choice if the following guidelines are present:

Communications services or in-memory data are the predominant part of the application.
The recovery model relies on in-memory data or application state.
Remote site requires lights-out operation.
Application requires subsecond recovery time.