Increasing Network Availability

When we think of network availability, we must go back to the business objectives and requirementsthe first step in the design processto see what purpose the network has in the organization. For example, availability can mean that the online customer services must be available 24 hours a day, 7 days a week. Or, it can mean that the IP phone system must be as available as the public switched telephone network (PSTN) system.

Thus, when we think of increasing the network availability, or achieving high availability, we must also reference the business objectives. One definition of high availability is as follows:

The ability to define, achieve, and sustain 'target availability objectives' across services and/or technologies supported in the network that align with the objectives of the business.^[1]

Availability is usually measured as either the percentage of time that the network is up or by the amount of time the network is down. For example, two common formulas for availability are as follows:^[2]

Availability = MTBF / (MTBF + MTTR), where
MTBF = Mean time between failurethe average amount of time that the network is up (between failures).
MTTR = Mean time to repairthe average amount of time it takes to get the network functioning again after a failure has occurred.
The type of network connections, for example, whether devices are connected in parallel or in series, can make this calculation more complex.
Availability = (Total User Time Total User Outage Time) / Total User Time, where
Total User Time = Total amount of user time that the network should be accessible = Number of users * Total measurement time.
Total User Outage Time = Sum of the amount of time that each user was unable to access the system during the measurement time.

Table 10-1 illustrates some availability percentages and describes how they translate to the amount of downtime in a year. High availability usually means that the network is down for less than 5 minutes in a year, which equates to a 99.999% availability (also known as five nines availability).

Table 10-1. Availability Can Be Translated into Network Downtime
Availability, %	Downtime per Year
99.000	3 days, 15 hours, 36 minutes
99.500	1 day, 19 hours, 48 minutes
99.700	26 hours, 17 minutes
99.900	8 hours, 46 minutes
99.950	4 hours, 23 minutes
99.990	53 minutes
99.999	5 minutes
99.9999	30 seconds

When you consider increasing the availability of your network, the cost of doing so should be weighed against the cost of downtime. For example, ensuring that an online ordering system is highly available avoids the opportunity costs of lost sales and therefore might be worth the expense. In contrast, ensuring that every user is always able to dial in to the corporate network without getting a busy signal might not be worth the loss in productivity of a few users having to retry making the connection.

The reasons for network problems must also be considered. Many times, only the design and the technologies used are considered in availability analysis; however, a network can experience problems for other reasons. For example, one study^[3] found that the relative distribution of the common causes of network outages is as follows:

User and process errors (including change management and process issues): 40%
Software and applications (including software, performance, and load issues): 40%
Technology (including design, hardware, and links): 20%

Thus, design and equipment issues should be considered, but other factors must also be taken into account. Therefore, increasing the availability of your network can include implementing the following measures:

Using redundant links between devices, including between redundant devices
Using redundant components within devices, for example, installing redundant network interface cards (NICs) in mission-critical servers or redundant processors in network devices
Having a simple, logical network design that is easily understood by the network administrators and having processes and procedures for naming and labeling equipment, and for implementing changes to anything within the network
Having processes and procedures in place for monitoring the network for potential problems and for correcting those problems before they cause the network to fail
Ensuring the appropriate physical and environmental conditions for all equipment and the availability of appropriate spare parts

For redundancy, recall from Chapter 2, "Switching Design," that a Layer 2 switched network with redundant links can have problems because of the way that switches forward frames. Thus, the Spanning-Tree Protocol (STP) logically disables part of the redundant network for regular traffic while still maintaining the redundancy in case an error occurs. When multiple virtual LANs (VLANs) exist algorithms such as per-VLAN spanning tree (PVST) can also be implemented. With PVST, switches have one instance of STP running per VLAN. PVST can result in load balancing across the redundant links by allowing different links to be forwarding for each VLAN.

In Chapter 3, "IPv4 Routing Design," you see that routed (Layer 3) networks inherently support redundant paths, so a protocol such as STP is not required. All the IP version 4 (IPv4) routing protocols can load-balance over multiple paths of equal cost; EIGRP and Interior Gateway Routing Protocol (IGRP) can also load-balance over unequal-cost paths.

Some of the other protocols that can be enabled on network devices for increasing availability include the following:

Hot Standby Router Protocol (HSRP) The Cisco HSRP allows a group of routers to appear as a single virtual router to the hosts on a LAN. The group is assigned a virtual IP address (and is either assigned or autoconfigures, based on the group number, a virtual Media Access Control [MAC] address); hosts on the LAN have the virtual address as their default gateway. One router is elected as the active router and processes packets addressed to the virtual address. If the active router fails, another router takes over this responsibility, and routing continues transparently to the hosts.
Note

HSRP supports load sharing, using the multiple HSRP (MHSRP) groups feature. However, hosts on the LAN must be configured to point to routers in the different groups as their default gateways.
Virtual Router Redundancy Protocol (VRRP) VRRP is a standard protocol, similar to the Cisco HSRP. A group of routers represent a single virtual router; the IP address of the virtual router is the same as configured on one of the real routers. That router, known as the master virtual router, is initially responsible for processing packets addressed to the IP address. If the master virtual router fails, one of the backup virtual routers (as determined by a priority) takes over, and routing continues transparently to the hosts.
Gateway Load Balancing Protocol (GLBP) GLBP is another protocol that allows redundancy of routers on a LAN, similar to HSRP. The difference is that GLBP allows load balancing over the redundant routers, using a single virtual IP address and multiple virtual MAC addresses, so that all hosts are configured with the same default gateway. All routers in the group participate in forwarding packets simultaneously, making better use of network resources.
Nonstop Forwarding (NSF) with Stateful Switchover (SSO) In Cisco devices that support two route processors, the SSO feature allows one to be active while the other is in standby mode. Configuration data and routing information are synchronized between the two, and if the active route processor fails, the other takes over. During the switchover, the NSF feature ensures that packets continue to be forwarded along the previous routes, with no packet loss.
Server Load Balancing (SLB) The Cisco SLB feature provides IP server load balancing. A virtual server address represents a group of real servers. When a client initiates a connection to the virtual server address, the SLB function chooses a real server for the connection, based on a load-balancing algorithm.

Note

Further information on increasing network availability can be found at http://www.cisco.com/go/availability.

Table 10-1. Availability Can Be Translated into Network Downtime