Competing with the Reliability of Existing Phone Systems

The perception of many in business today is that VoIP simply isn't reliable enough to support the telecommunication demands of a corporate environment. After all, corporate PBX systems are considered highly reliable, but how many times in a month do you hear users say, "My e-mail isn't working," "The Internet is down," or "I can't print to the network printer"? Because of such past frustrations with data network applications, this perception of unreliability has unfortunately carried forward to any new application running on the data network, such as VoIP.

Many PBX administrators like to boast, even though they are not always correct, that their PBX has the "five nines" of availability. By the "five nines" they mean that their PBX is available (that is, up and running) 99.999 percent of the time, and this availability isn't just during regular business hoursit's 24 hours a day, 365 days a year. If we were to do the math, we would see that if a network is up and available 99.999 percent of the time, then it would only be unavailable for five minutes a year. Consider Table 3-1, which illustrates the yearly downtime associated with various availability levels.

Table 3-1. Availability and Downtime
Availability	Maximum Yearly Downtime
99.000 percent (two nines)	3 days, 15 hours, and 36 minutes
99.900 percent (three nines)	8 hours, and 46 minutes
99.990 percent (four nines)	53 minutes
99.999 percent (five nines)	5 minutes
99.9999 percent (six nines)	30 seconds

Before discussing how a VoIP network can be designed to be more available, we need to distinguish between reliability and availability. A reliable network, as an example, does not drop many packets, whereas an available network is up and functioning. Availability is a function of the mean time to repair (MTTR) and the mean time between failures (MTBF).

As the names suggest, the MTTR is the average time it takes to repair a failed network component, and the MTBF is the average time between the failures of a network component. A network's availability can be improved by reducing the MTTR and increasing the MTBF. When purchasing network hardware (for example, an Ethernet switch), many manufacturers, such as Cisco, provide MTBF information; and you can determine the MTTR as part of your network design. For example, you might have spare parts on-site to quickly swap out failed equipment, or you might have redundant components within a chassis (for example, redundant supervisor engines in a Cisco Catalyst switch). These network components can also be interconnected in a redundant fashion (for example, having multiple connections between multiple devices). Let's consider some of these design approaches in a bit more detail.

One approach is to have fault tolerance built into the network components. So, even though the perception of VoIP reliability is still growing, we can actually design VoIP networks that are just as reliable as legacy PBX systems. Notice in Figure 3-1 that there are dual physical connections between all network components.

Figure 3-1. Redundant Devices with Single Points of Failure

As an example, we might have a Cisco Catalyst 6500 multilayer switch with redundant features built into the chassis itself, including:

Two supervisor engines (the "brains" of the switch)
Two power supplies
Two switch fabric modules (which increase the throughput of the switch)

Not only do these modules help minimize the MTTR, but they are also hot-swappable. For example, if one supervisor engine were to fail, the other supervisor engine could step in and take over that responsibility, and the failed supervisor engine could be removed from the chassis and replaced without powering down the chassis.

Instead of having redundancy built into the router or switch itself, another design approach is to have redundancy between devices, as shown in Figure 3-2. Notice that in this topology, any single network link or network infrastructure device (for example, switch or router) can fail (with the exception of the wiring closet switch where the IP phone attaches); and a path will still exist from the host to the server.

Figure 3-2. No Single Points of Failure

Redundant design approaches such as these benefit not only voice networks but also data networks. In fact, many network redundancy features were available well before the introduction of VoIP. However, the critical nature of voice traffic is causing many network designers to beef up the redundancy in their existing data networks.

End systems not running a routing protocol point to a default gateway. The default gateway is traditionally the IP address of a router on the local subnet. However, if the default gateway router fails, the end systems are unable to leave their subnet. Two approaches to Layer 3 redundancy include Hot Standby Router Protocol (HSRP) and Virtual Router Redundancy Protocol (VRRP). With both of these technologies, the Media Access Control (MAC) address and the IP address of the default gateway can be serviced by more than one router. Therefore, if a default gateway router goes down, then another router can take over, still servicing the same MAC and IP addresses:

HSRP is a Cisco-proprietary approach to Layer 3 redundancy.
VRRP is a standards-based approach to Layer 3 redundancy.

Layer 3 redundancy is also achieved by having multiple links between devices and selecting a routing protocol that load balances over the links. EtherChannel is another way to load balance across multiple links. With EtherChannel, you can define up to eight physical links that are logically bundled together, such that the bundle appears as a single link to the route processor.

Although having multiple links between switches is great for redundancy, these links can cause loops in the Layer 2 (that is, switching) network. These Layer 2 loops can cause broadcast storms, where broadcast packets circle the network forever, consuming bandwidth and switch processor resources. The IEEE 802.1D standard is the legacy approach for Layer 2 loop avoidance. IEEE 802.1D is better known as the Spanning Tree Protocol (STP). By default, in the event of a link failure, STP takes 50 seconds to recover and start forwarding traffic over a backup link (that is, to converge). Cisco added proprietary enhancements to speed up the convergence time. These Cisco-proprietary STP enhancements include:

PortFast Used on ports connecting to end stations
UplinkFast Used on building access switches
BackboneFast Used on all switches in the topology

Each virtual LAN (VLAN) can run its own instance of STP. This Per-VLAN STP approach allows different VLANs (that is, subnets) to have different root bridges (that is, switches in the Layer 2 network that serve as the points to which other switches forward traffic). However, with the Cisco Per-VLAN STP, every VLAN must run its own instance of STP, which might place unnecessary overhead on the switches.

The best of both worlds is achieved with the new IEEE 802.1w and 802.1s protocols. IEEE 802.1w (that is, Rapid Spanning Tree Protocol) dramatically reduces convergence times in the event of a failure. IEEE 802.1s (that is, Multiple Spanning Tree) allows you to create a set of STP instances. Then VLANs might be assigned to appropriate STP instances. This eliminates the Per-VLAN STP requirement that each VLAN run its own instance of STP.

Competing with the Reliability of Existing Phone Systems

Table 3-1. Availability and Downtime

Figure 3-1. Redundant Devices with Single Points of Failure

Figure 3-2. No Single Points of Failure