8.11 Failures and Fault Tolerance in Multi-Board Systems | Designing Embedded Communications Software

The single control card + multiple line card architecture is the more common implementation for multi board systems in communications applications. It will be used as the basis for discussions on fault tolerance.

Multi-board systems are equally as susceptible to hardware failure as single-board systems. However, unlike the single-board systems, the multi-board architecture can handle failures by providing a switchover to another board. This is a requirement in carrier or service provider environments where multi-board systems are frequently deployed. The switchover is done to a redundant board, thus ensuring the high availability of the system. A redundant board is an additional board used to switch over to when the primary board fails. A system or network with high availability is able to recover from failures and continue operation.

This section covers common redundancy schemes used in multi-board systems for high availability and how software is modified to handle this.

8.11.1 Types of Failures

In a multi-board system, hardware failure can take multiple forms, the common ones being:

Line Card Port Failure
Line Card Failure
Line Card to Control Card Link Failure
Control Card Failure
Switch Fabric Failure
Line Card to Switch Fabric Link Failure

When a port on a line card fails, the line card identifies the failure and informs the control card, after which the protocol tasks on the control card process this as an event. For example, the RIP routing protocol needs to declare all routes reachable through that port as unreachable and propagate this information to neighboring routers. The FIB is recalculated by the RTM after the routing protocol has converged and then downloaded to all line cards. This FIB reloading at the line card could be incremental or complete, depending upon the impact of the port failure on the routes.

A line card failure or a card-to-card link failure manifests itself as a loss of heartbeat between the control and line cards. Specifically, the ICCP or messaging sub-layer on the control card detects that it has not received heartbeats from the line card in a configured time interval. The ICCP informs the control card CPU, which causes protocol tasks to be notified of the port failure on the line card. The resulting actions are similar to the single-port failure case.

The port failure and the line card failure cause only a degradation in system performance since the device still functions as a router, but with a lower port count. However, failure of the control card causes the system to lose the core of its intelligence-in terms of routing protocols, management agents, and so on. While the forwarding could possibly continue for some time using the information provided earlier by the RTM, the risk is that this information could be stale, causing misdirection of network data traffic and network instability. To address this, several systems provide control card redundancy to ensure continuous operation.

Switch fabric failure results in islands of line cards-they can forward packets between their ports but are unable to forward packets between the line cards. While the system is still functional, there is a severe performance degradation, so systems often have a redundant switch fabric card to address this. The line card-to-switch fabric link failure is similar to the switch fabric failure from the line card perspective, but the system is still able to function with just one line card being isolated.

8.11.2 Redundancy Scenarios with Control and Line Cards

There are two options for redundancy with control and line cards:

A redundant card for each card (1:1 redundancy)
One redundant card for N cards (1:N redundancy)

With 1:1 redundancy, each primary line card has an equivalent backup line card with the same configuration. When the primary card fails, the backup or redundant (or standby) card takes over. A highly available system requires that the switch from primary to redundant card take place without operator intervention. To accomplish this, the primary and backup card exchange heartbeats so that the redundant card can take over on both the switch fabric and control card link if the primary card fails. There are two options upon startup for the redundant line card:

Warm Standby.

The standby card was initialized in the redundant configuration and can request a download of the configuration from the system operator and continue operation. Warm-standby operations require operator intervention.

Hot Standby.

The configuration is obtained from the primary card, while it is still functional. The two cards do this by a periodic update from the primary to the standby card and/or when the configuration changes, also known as a checkpoint. When the standby card takes over, its information is as current as the last checkpoint from the primary.

The warm-standby operation is less flexible since the new configuration has to be provided to the redundant card-causing a disruption in system operation until the redundant card is fully operational. Moreover, this causes an extra burden on the system operator since the previous configuration has to be replicated step by step.

The hot-standby operation, on the other hand, requires three types of messages to be exchanged between the primary and redundant cards:

An initialization or bulk update, sent by the primary card when the redundant card comes up, provides a complete snapshot of the current configuration.
A periodic or on-demand checkpoint of configuration changes sent from the primary card to the redundant card.
A heartbeat message and response between the primary and secondary cards, sent in the absence of the checkpoint messages.

8.11.3 Control Card Redundancy

The card initialization, bulk update, and periodic checkpoints of a control card are shown in Figure 8.8. Note that when the redundant card comes up, it initializes itself, requests the complete configuration from the primary card, and then remains in standby mode. In this mode, it does not process events or messages other than the periodic checkpoint and heartbeat messages from the primary card. When it detects that the primary has not sent any heartbeats or checkpoints in a configured time period, it takes over as the primary card.

At this point, the software on the redundant card moves from a standby mode to the primary mode of operation. This causes the redundant card to start responding to all the standard events like queue events, timer events, and so on. The redundant card has taken over operation from the primary card.

This scenario implies that the software needs to operate in two modes: primary and standby. In primary mode, the software operates as before (non-redundant configuration), the only addition being that it will process messages from the standby card software and also provide initialization and checkpoint updates. In the standby mode, the software only obtains and updates the configuration using an initialization update and checkpoint updates.

Instead of each protocol task implementing the messaging scheme for the primary-to- standby transaction, system designers usually implement a piece of software called redundancy middleware (RM), which provides the facilities for checkpoints, heartbeat, and so on. The redundancy middleware offers a framework for the individual protocol and system tasks to provide information in a format appropriate for the standby. Protocol tasks make calls to the RM for passing information to the standby. The RM can accumulate the checkpoints from the individual tasks and send a periodic update or send the checkpoint immediately, which is usually configurable. The redundancy middleware uses the services of the ICCP messaging layer (see Figure 8.7 ) to communicate between primary and standby. On the standby, the RM provides the initialization or checkpoint data to the appropriate protocol tasks which have registered for this.

click to expand
Figure 8.8: Control card redundancy.

8.11.4 Line Card Redundancy

The 1:N redundancy scheme is used more often with line cards. It is less expensive in terms of hardware, since there is only one backup card for several line cards. Considering hot standby, the initialization and checkpointing activities are the same as the 1:1 case, except that the backup line card has to repeat initialization and checkpointing for every line card it backs up. This increases both the size of maintained data as well as software complexity since it now needs to track all the line cards. So, the preferred approach is warm standby, in which the card is initialized but the configuration is not loaded, since it will vary depending upon which primary fails.

8.11.5 Summary of Redundancy Model and Standby Models for Control and Line Cards

In the single control card + multiple line card architecture, only 1:1 redundancy is of interest. Each control card is backed up by a standby control card. Hot standby is more commonly used to avoid switchover delays. In line cards, 1:N redundancy with warm standby is more common.

Table 8.4 summarizes the various redundancy schemes for the control card and line cards.

Table 8.4: Redundancy schemes for control and line cards.
Card Type	Common Redundancy Model	Common Standby Model
Control Card	1: 1	Hot Standby
Line Card	1:N	Warm Standby