6.1 Availability

In the same way that "one person's passion is another person's poison ," there are different ways in which you can look at and define availability . Thus, to alleviate any possibility of confusion, let us first define the term as it relates to both components and systems. Once this is accomplished, we can examine its applicability to different local area network (LAN) communications configurations.

6.1.1 Component Availability

The availability of an individual component can be expressed in two ways that are directly related to one another. First, as a percentage, availability can be defined as the operational time of a device divided by the total time, with the result multiplied by 100. This is indicated by the following equation:

where A% is availability expressed as a percent.

For example, consider a leased line modem or DSU that normally operates continuously 24 hours per day. Over a one-year period of time, let us assume that the modem failed once and required eight hours to repair. During the year, the modem was available for use 365 days * 24 hours/day - 8 hours, or 8752 hours. Thus, the modem was operational 8752 hours during a period of 8760 hours. Using the availability formula, we obtain:

6.1.2 MTBF and MTTR

Now let us define two commonly used terms and discuss their relationship to operational time and total time. The first term, Mean Time Before Failure (MTBF), is the average operational time of a device prior to its failure. Thus, MTBF is equivalent to the operational time of a device.

Once a device has failed, you must initiate actions to effect its repair. The interval from the time the device fails until the time the device is repaired is known as the time to repair and the average of each repair time is known as the Mean Time To Repair (MTTR). Because the total time is MTBF + MTTR, we can rewrite the availability formula as follows :

It is important to remember the M in MTBF and MTTR, as you must use the average or mean time before failure and average or mean time to repair. Otherwise, your calculations are subject to error. For example, if your device failure occurred halfway through the year, you might be tempted to assign 4380 hours to the MTBF. Then, you would compute availability as:

The problem with the above computation is the fact that only one failure occurred, which results in the MTBF not actually representing a mean. Although the computed MTBF is correct for one particular device, as sure as the sun rises in the East, the MTBF would be different for a second device, different for a third device, etc. Thus, if you are attempting to obtain an availability level for a number of devices installed or to be installed, you, in effect, will compute an average level of availability through the use of an average MTBF. Thus, the next logical question is: How do you obtain average MTBF information for a communications device? Fortunately, many vendors provide MTBF information for the products they manufacture that you can use instead of waiting for a significant period of time to obtain appropriate information. Although many published MTBF statistics can be used as is, certain statistics may represent extrapolations that deserve a degree of elaboration. When vendors introduce a new product and quote an MTBF of 50,000, 100,000, or more hours, they obviously did not operate that device for that length of time. Instead, they either extrapolated MTBF statistics based on the improvements made to a previously manufactured product or based their statistics on the MTBF values of individual components.

If you note an asterisk next to an MTBF figure and the footnote indicates extrapolation, you should probably question the MTBF value. After all, if the MTBF of some device is indicated as 100,000 hours or almost 12 years, why is the warranty period typically one or two years ? In such situations, you might wish to consider using the warranty period as the MTBF value instead of an extrapolated MTBF value. Concerning the MTTR, this number is also provided by the manufacturer but normally requires a degree of modification to be realistic.

Most manufacturers quote an MTTR figure based on the time required to repair a device once a repair person is on-site. Thus, you must consider the location where your equipment is to be installed and travel time from a vendor's location to your location. If your organization has a maintenance contract that guarantees a service call within a predefined period after notification of an equipment failure, you can use that time period and add it to the MTTR. For example, assume the specification sheet for a vendor's T1 channel service unit (CSU) listed an MTBF of 16,500 hours and an MTTR of two hours. If you anticipate installing the device in Macon, Georgia, and the nearest vendor service office is located in the northern suburbs of Atlanta, you would probably add four to six hours to the MTTR time. This addition would reflect the time required for a repair person in Atlanta to receive notification that he or she should service a failed device in Macon, complete his or her work in Atlanta, and travel to the site in Macon. While this may not be significant when an MTBF exceeds a year, suppose your equipment location was Boise, Idaho, and the CSU vendor used next-day delivery to ship a replacement device, in effect repairing by replacement.

In this situation, you may have to add 24 hours or more to the time required to swap CSUs to obtain a more realistic MTTR value.

6.1.3 System Availability

In communications, a system is considered to represent a collection of devices connected through the use of one or more transmission facilities that form a given topology. Thus, to determine the availability of a system, you must consider the availability of each device and transmission facility as well as the overall topology of the system. Concerning the latter, the structure of the system in which components are connected in serial or in parallel will affect the overall availability of that system. To illustrate the effect of topology on system availability, several basic LAN structures in which devices are connected in series and in parallel will be examined.

6.1.4 Devices Connected in Series

The top portion of Figure 6.1 illustrates the connection of n components in series. In this and subsequent illustrations, a component will be considered to represent either a physical network device or a transmission facility connecting two devices. Thus, the boxes labeled A ₁ , A ₂ , and A ₃ could represent the availability of a data service unit (DSU; A ₁ ), the availability of a leased line (A ₂ ), and the availability of a second DSU (A ₃ ).

Figure 6.1: Network Components in Series

The availability of n components connected in series is computed by multiplying the availability of each individual component. Mathematically, this is expressed as follows for n components:

To illustrate the computation of a system in which components are arranged in series, consider the Token Ring network connected to an Ethernet LAN via the use of two remote bridges or routers and a pair of DSUs. This networking system is illustrated in the lower portion of Figure 6.1.

Let us assume that each remote bridge has an MTBF of one year, or 8760 hours, and any failure would be corrected by the manufacturer shipping a replacement unit to each location where a bridge or router is installed. Thus, we might assume a worst-case MTTR of 48 hours to allow for the time between reporting a failure and the arrival and installation of a replacement unit. Similarly, let us assume an MTBF of 8760 hours and an MTTR of 48 hours for each DSU. For the transmission line, most communications carriers specify a 99.5 percent availability level for digital circuits, so using a slightly lower level of 99.4 percent for the communications carrier serving our location would appear reasonable. Based on the preceding , the availability of the communications system, A _S , which enables a user on a Token Ring network to access the Ethernet network and vice versa via a pair of single-port bridges or routers, then becomes:

where:

Bridge _A = availability level of each bridge
DSU _A = availability level of each DSU
Line _D = availability level of the digital circuit connecting the two locations

Because the availability of each component equals the MTBF divided by the sum of the MTBF and the MTTR, we obtain:

This means that 2.75 percent of the time (100 ˆ’ 97.25) in which an attempt is made to use the communications system the failure of one or more components will render the system inoperative.

6.1.5 Devices Connected in Parallel

Figure 6.2a illustrates n devices connected in parallel. If only one device out of n is required to provide communications at any point in time, then the availability of the system as a percentage becomes:

Figure 6.2: Connecting Devices in Parallel

For example, assume two devices, each having an availability level of 99 percent, are operated in parallel. Then, the availability level of the resulting parallel system becomes:

Substituting, we obtain:

Because communications within and between networks normally traverse multiple components, the use of parallel transmission paths normally involves multiple components on each path . Figures 6.2b and 6.2c illustrate two methods by which alternative paths could be provided to connect the Token Ring and Ethernet networks together.

Readers should note that transparent bridging supported by Ethernet networks precludes the use of closed loops that physically occur through the use of dual bridges or multi-port bridges illustrated in Figure 6.2b and 6.2c. However, if we assume that only one path operates at a single point in time, the use of two paths in which one only becomes active if the other becomes inactive is supported by transparent bridging. If routers are used instead of bridges, because they operate at the network layer, we do not have to concern ourselves with closed loops as they are only applicable to data-link layer 2 operations.

In the topology illustrated in Figure 6.2b, duplicate remote bridges or routers, DSUs, and transmission paths were assumed to be installed. In Figure 6.2c, it was assumed that your organization could obtain a very reliable remote bridge or router and preferred to expend funds on parallel communications circuits and DSUs because the failure rate of long-distance communications facilities normally exceeds the failure rate of the equipment.

For simplicity, let us assume that the availability level of each component illustrated in Figure 6.2b is 0.9. For each parallel path, you can consider the traversal of the path to encounter five components: two bridges or routers, two DSUs, and the communications line. Thus, the upper path containing five devices in series would have an availability level of 0.9 * 0.9 * 0.9 * 0.9 * 0.9, or 0.59049. Similarly, the lower path would have a level of availability of 0.59049. Thus, you have now reduced the network structure to two parallel paths, each having an availability level of 0.59049. If A ₁ is the availability level of path 1 and A ₂ is the availability level of the second path, system availability, A _S , becomes:

Thus:

Simplifying the above equation by multiplying the terms and substituting 0.59049 for A ₁ and A ₂ , you obtain:

6.1.6 Mixed Topologies

Now let us focus attention on the network configuration illustrated in Figure 6.2c, in which a common bridge or router at each LAN location provides access to duplicate transmission facilities. To compute the availability of this communications system, you can treat each bridge or router as a serial element, while the two DSUs and the communications line between each DSU represent parallel routes of three serial devices.

Figure 6.3 illustrates how you can consider the communications system previously illustrated in Figure 6.2c as a sequence of serial and parallel elements. By combining groups of serial and parallel elements, you can easily compute the overall level of availability for the communications system as indicated in the four parts of Figure 6.3. In Figure 6.3a, the three serial elements of each parallel circuit are combined, including the two DSUs and communications line, to obtain a serial availability level of (0.9 * 0.9 * 0.9), or 0.729. Next, in Figure 6.3b, the two parallel paths are combined to obtain a joint availability of [(0.729 + 0.729) “ (0.729 * 0.729)], or 0.926. Finally, in Figure 6.3c, the joint availability of the parallel transmission paths is treated as a serial element with the two bridges or routers, obtaining a system availability of 0.75, which is shown in Figure 6.3d. Note that at a uniform 90 percent level of availability for each device, the use of single bridges or routers in place of dual bridges or routers lowers the system availability by approximately 8 percent. That is, the system availability obtained through the use of dual single-port bridges or routers illustrated in Figure 6.2b was computed to be 83.23 percent. In comparison, the system availability obtained from the use of single dual-port bridges or routers illustrated in Figure 6.2c was determined to be 75 percent.

Figure 6.3: Computing the Availability of a Mixed Serial and Parallel Transmission System

6.1.7 Dual Hardware versus Dual Transmission Facilities

In the above calculations, assuming a uniform component availability of 0.90, the system availability of the dual network illustrated in Figure 6.2b was computed to be 83.23 percent, while the availability of the network illustrated in Figure 6.2c was computed to be 75 percent. Although the difference in availability of over 8 percent could probably justify the extra cost associated with dual bridges for organizations with critical applications, what happens to the difference in the availability of each network as the availability level of each component increases ?

Table 6.1 compares the system availability for the use of single multi-port and parallel single-port bridge networks as component availability increases from 0.90 to 0.999. In examining the entries in Table 6.1, you will note that the difference between the availability level of each network decreases as component availability increases. In fact, the 8 percent difference in the availability of each system at a component availability level of 0.9 decreases to under 4 percent at a component availability level of 0.98 and to under 2 percent at a component availability level of 0.99. Most modern communications devices have a component availability level close to 0.999, which, by the way, usually exceeds the availability level of a transmission facility by 0.005. Thus, the use of dual bridges or dual routers instead of multi-port bridges or routers may be limited to increasing network availability by one tenth of 1 percent. This means you must balance the gain in availability against the cost of redundant bridges or routers.

Table 6.1: System Availability Comparison
	System Availability
Component Availability	Single-Multiport Bridge or Router	Parallel Single-Port Bridges or Router
0.90	0.7505	0.8323
0.91	0.7778	0.8586
0.92	0.8049	0.8838
0.93	0.8318	0.9073
0.94	0.8582	0.9292
0.95	0.8841	0.9488
0.96	0.9094	0.9659
0.97	0.9337	0.9800
0.98	0.9571	0.9908
0.99	0.9792	0.9976
0.999	0.9980	0.9999

Concerning cost, although bridges vary considerably with respect to features and price, at the end of 2002 their average price was approximately $1000. In comparison, the incremental cost of a dual-port bridge versus a single-port bridge was typically less than $500. Because you require two bridges to link geographically separated networks, the cost difference between the use of dual single-port bridges and single dual-port bridges is approximately (1000 ˆ’ 500) * 2, or $1000. Thus, in this example you would have to decide if an increase in network availability by approximately one tenth of 1 percent is worth $1000.

If you apply the preceding economic analysis to routers, the cost disparity between different network configurations becomes more pronounced. For example, a middle-range router might cost $5000, while the incremental cost of a dual-port router can be expected to add approximately $750 to the cost of the device. Then, the cost difference between dual single-port routers and single dual-port routers becomes (5000 ˆ’ 750) * 2, or $8500. Thus, when routers are used, your decision criterion could be one of deciding if an increase in network availability of approximately 0.1 percent is worth $8500.

If your organization operates a reservation system in which a minute of downtime could result in the loss of thousands to tens of thousands of dollars of revenue, the additional cost would probably be most acceptable. If your organization uses your network for interoffice electronic mail transmissions, you might prefer to save the $1000 and use dual-port bridges instead of dual single-port bridges or save $8500 if you were using routers for interconnecting your offices. This is because a gain of 0.1 percent of availability based on an eight- hour workday with 22 workdays per month would only produce 2.1 additional hours of availability per year. At a cost of approximately $500 when bridges are used or over $4250 when routers are used for the additional availability, it might be more economical to simply delay non-urgent mail and use the telephone for urgent communications.

6.1.8 Evaluating Disk Mirroring

Another area where the use of availability computations can be valuable with respect to LAN performance is in the area of evaluating the advantage associated with a disk mirroring system. To illustrate the use of availability in evaluating the practicality of a disk mirroring system, let us compare different file server equipment structures.

The top portion of Figure 6.4a illustrates the use of a conventional disk subsystem installed in a file server. In this example, data transmitted from a workstation flows across the network media into the file server and through its controller onto the disk. Thus, the conventional disk subsystem from the perspective of availability can be schematically represented as three components in series: the file server, controller, and disk. The rationale for including the file server when computing the availability level of a disk system is the fact that the failure of the server terminates access to the disk. Although you might be tempted to compare availability levels of different disk systems without considering the availability level of the file server, by including the server you obtain a more realistic real-life comparison.

Figure 6.4: Comparing Data Flow Using Conventional and Mirrored Disk Systems on a File Server

Figure 6.4b illustrates the dataflow when a file server has dual controllers and dual disks installed, a configuration commonly referred to as a fully mirrored disk subsystem.

A schematic representation, from the perspective of availability, for the mirrored disk drives and dual controllers is shown in Figure 6.5. Note that the file server can be considered to be placed in series with two pairs of parallel arranged devices: the controller and disk drive. Thus, if the file server fails, access to either drive is blocked, which represents reality.

Figure 6.5: A Schematic Representation of a Fully Mirrored Disk Subsystem from the Perspective of Its Availability

Because we need availability levels of the file server, controller, and disk to compare single and mirrored disk systems, let us make some assumptions. Let us assume the file server is expected to fail once every three years and will require replacement via Express Mail, UPS, or another service within a 48-hour notification period. Thus, the availability of the file server becomes:

For the disk controller, let us assume it is very reliable and might fail once in a five-year period and its replacement will require 24 hours. Thus, the availability of the disk controller becomes:

Although disk vendors commonly quote a MTBF of 100,000 hours or more, from practical experience my file server disks seem to need replacement on the average once every 2.5 years. When they fail, their replacement involves more than installing new hardware as the last backup must be moved onto the disk. Again, from this author's experience, the ordering of a replacement disk, its arrival and installation, accompanied by a full restore, can result in an average MTTR of 36 hours. Thus, I would compute the availability of disks I use as follows:

Now that we have an availability level for each component, we can compare the availability level of the single and fully mirrored disk systems.

6.1.8.1 Single Disk System

The single disk system can be represented by three components in series. Hence, its availability becomes:

6.1.8.2 Mirrored Disk System

For the fully mirrored disk system, the availability of a controller and disk in series becomes:

The fully mirrored system consists of two parallel paths of a controller and disk in series. Thus, the availability of each of those parallel paths becomes:

When we consider the availability of the file server, the availability of the fully mirrored disk subsystem, to include the computer its components reside in, becomes:

In comparing the availability level of a single disk subsystem versus fully redundant mirrored disks with dual controllers, note that availability has increased from 99.61 percent to 99.82 percent. By itself, this increase of 0.21 percent may not be meaningful, so let us attempt to consider it from an operational perspective. Over a three-year period, not considering a leap year in that period, there are 8760 * 3, or 26,280 hours. By increasing availability by 0.21 percent, we can expect to gain 26280 0.0021, or 55.2 hours of equipment life prior to a failure occurring that terminates access to data. Whether or not the additional operational time is worth the additional cost associated with installing a fully mirrored disk subsystem obviously depends on its additional cost as well as the actual MTBF and MTTR values you would use in your computations. However, the preceding information provides you with a methodology you can easily alter to analyze the equipment you may be considering.