The Need to Understand Network Statistics

In this section, the need to understand statistics will be shown through the following:

Availability data
Random nature of the variables collected for baselines and trending
The need to aggregate data
The use of simulation modeling for networking new solution deployment and design

Availability Statistics

One reason that network managers need an understanding of basic statistics is so they can apply availability data effectively in monitoring the network and service-level agreements. To amplify on the availability discussion in Chapter 4, availability can be defined as the probability that a product or service will operate when needed. In data network environments, this can be defined as the average fraction of connection time that the product or service is expected to be in operating condition. For a network that can have partial as well as total system outages, availability is typically expressed as network availability.

Equation 7-1

graphics/07equ01.gif

Equation 7-2

graphics/07equ02.gif

NOTE

In discussions of availability, the term unavailability is often required within some of the different types of availability calculations. For instance, see Equation 7-3:

Equation 7-3

Two key statistical metrics provide more detail for the analysis of availability: mean time between failures (MTBF)and mean time to repair (MTTR).

The time between when the device broke and when it was brought back into service is called the mean time to repair (MTTR). In real networks, MTTR can be controlled by using good processes, using a well-developed sparing plan, or paying for high-end service contracts. For example, if you have a device that breaks and you have Cisco's onsite service (with a four hour guaranteed response), the probability is a higher that the device will be down for four hours rather than twenty-four hours. It should be noted that the MTTR includes response time (both Network Operations Center (NOC) and dispatch to the site if necessary), isolation of the fault, time to fix the fault, and verification that the fix did indeed correct the original problem. A trouble ticket can help you measure MTTR by including the times for failure detection, craft dispatch, (if any), fault diagnosis, fault isolation, the actual repair, and any software resynchronization time needed to restore the entire service.

Obviously, a device that breaks down four hours per year is more desirable than a device that is down for twenty-four hours per year. What may not be as obvious and is certainly not easy is the methodology and procedures an organization needs in place to accurately gather this metric.

Mean time between failures (MTBF) is calculated by measuring the average time between failures of a device. In the preceding example, the assumption is that the device fails once each year. This would indicate an MTBF of 8,760 hours (365 days x 24 hours) less the MTTR of 24 hours, which together equal the number of hours between failures for a year. In fact, if we have MTBF and MTTR, we can then go immediately to solution for availability by using the following mathematics:

Equation 7-4

Equation 7-5

Equation 7-6

Equation 7-7

NOTE

Equation 7-7 applies to an individual component and its hardware MTBF and MTTR.

In the networking community today, it is somewhat common to hear a requirement for a server to have a 99.999 (or five nines) percent availability. How much time does that translate into? Assume that the time frame is a seven-day by 24-hour operational year. One year = 32,000,000 seconds or 32 x 10⁶ seconds. Looking initially at a yearly availability of 99.9999 would mean that a failure can occur only 1/(1 0.999999) = 0.000001 times per year. Using scientific notation, this works out to a failure rate of 32 x 10⁶ x 10⁶ = 32 seconds for a single device or component. For five nines of availability, this number would be multiplied by 10 for 320 seconds. In a computer network, there are dependencies and redundancies that will impact the aggregate system availability. All of the previous discussion concerns device availability.

Chapter 4's discussion of availability included several different types. In order to show a broader picture of the issues with availability, these areas of path availability, application availability, and reachability will be expanded on and briefly discussed.

Path Availability

Path availability refers to availability from one point in the network to another. Availability from point A to point B in the network can be defined as shown in Equation 7-8, where X is equal to a failure due to software, hardware, power, human error and path convergence time in redundant systems.

Equation 7-8

This definition is very useful in predicting the availability of new network systems where the network has a well-defined hierarchy and connectivity from any end point to any other is similar. However, predicting the availability of new software systems and power and the likelihood of human error. Cisco currently has a tool that can predict path availability where X is equal to hardware failures.

Application Availability

Application availability is defined as the probability that an application service will operate when needed. In data network environments, this is defined as the average fraction of application connection time for which an application is expected to be in operating condition for all users of that application (see Equation 7-9). Application availability can then be expressed as the sum of network availability and server/application availability for the users of that service (see Equation 7-10).

Equation 7-9

Equation 7-10

In general, application availability takes into account network availability and server/application availability. The definition of application availability does not include individual user or client workstation issues. This definition is useful for enterprise organizations where application availability is the ultimate goal or requirement.

Server application availability is typically measured by accessing the individual application over time, but may not take into account other network problems that affect connectivity to that application. Comparing the general definition of availability and device availability given earlier, you can see that application availability is a special case of device availability.

Reachability

Another measurement of availability is reachability. It can be defined simply as the capability to successfully ping a device B in the network from another device A in the network. An ICMP echo packet is sent to the destination IP address; if the device is reachable, it returns an ICMP echo response to the originating device. If device A receives the ICMP echo response, device B is said to be reachable from that location. Reachability can be used to determine network availability from certain points in the network to other points in the network using the Layer 3 IP protocol. This does not guarantee network availability from all points in the network to all other points and does not guarantee the delivery of upper-layer protocols or other Layer 3 protocols. Reachability can be a useful method, however, of estimating network availability in IP environments.

Other Availability Measurement Methods

Other availability measurement methodologies include the following:

Impacted user minutes (IUMs) can be defined for an organization by the total amount of unproductive user time due to network or server downtime and performance issues.
Link & device status is the cumulative "up" time and "down" time of both the devices and links being measured.
ICMP reachability is the "ping" test on whether a device responds to a ping or not.
Application reachability is a modified "ping" test. It usually uses a udp or tcp form of a ping that goes to the application port or socket. If the application then responds, it is "reachable."
Response Time Reporter (RTR) is a feature set on Cisco routers that allows them to send ICMP pings and other types of packets that devices may respond to. RTR has the additional advantage of being RMON-like in setting thresholds, using events and alarms.
Defects per million (DPM) is the number of calls lost per million calls processed. DPM is particularly useful for measuring the availability of switched virtual circuit (SVC) services in a multiservice switch, where connections are constantly made, sustained, and torn down.
Combined methods could be any or all of these. It usually includes some type of correlation and often heuristics.

Measuring availability is a complicated problem. There are a variety of techniques that can only approximate availability for a given environment. Several major problems exist with measuring availability accurately for a given environment. See the sections "Availability" and "Measuring Availability" in Chapter 4 for more discussion of major inhibitors to accurate availability measurement and the resulting measurement dilemma.

The practical result of the measurement dilemma is that availability numbers themselves have become less meaningful. Both vendors and service providers quickly announce five nines solutions or even 100 percent availability, yet the reality is that the majority of these solutions cannot achieve this level of availability over time (as you saw in the previous five nines example).

Although many end users have become confused about and wary of high availability, they continue to ask for high availability solutions. Customers, vendors, and service providers are all trying to understand what it takes to achieve high availability from a technology, people, expertise, and process perspective.

What the industry needs is a fairly accurate and useful definition of availability that can be applied to most networks and situations. Organizations need to better understand their network availability and what it takes to achieve a higher availability environment. As you can see, an understanding of your network's availability requires some level of knowledge of mathematics and statistics.

Availability Example

Following is a short example of computing the MTBF and availability numbers for the Cisco IOS 11.x software version.

Figure 7-1 shows 472 inter-failure times observed for Cisco's 11.x IOS series software in a "scatter graph." This group includes all minor versions running in a network from the 11.0, 11.1, 11.2, and 11.3 releases. Figure 7-2 shows a histogram of these data plotted under a normal curve.

Figure 7-1. Scatter Graph of 472 Inter-failure Times Observed for Cisco's 11.x IOS Series Software

graphics/07fig01.gif

Figure 7-2. Histogram of the Data Plotted Under a Normal Curve

graphics/07fig02.gif

The two figures are the data collection and distributions for the IOS failure rates. The collection of 472 data points can provide a more accurate MTBF.

It's common to assume an exponential distribution for hardware failure rates. However, both theory and experience indicate that a lognormal distribution is more appropriate for software failure rates. Looking closely at the scatter graph in Figure 7-1, the experience with this type of data indicates that the data from this report will fit the lognormal distribution. As you can see in Figure 7-2, the use of the lognormal distribution does in fact closely match the data histogram. This provides you with the proper statistical measurements that can be used. In this case, the mean, standard deviation, and variance are calculated by using scaled parameters in Equation 7-11.

The use of mean, standard deviation, and variance will be explained in the section "Basic Statistical Measures and Applications" later in this chapter. They are used here as an example of determining MTBF and system availability.

Table 7-1 gives summary statistics for the 11.x series observations.

Table 7-1. Descriptive Statistics for 11.x Failures
Statistics	Ln(TBF(Seconds))
Mean	13.395
Standard Deviation	3.143
Count	472
Minimum	3.689
Maximum	20.646
Variance	9.878

Ln(TBF(Seconds) is read as lognormal(Time between failure(in seconds)). Using the equations given by Kececioglu (see Equation 7-11), you can calculate the MTBF using the Mean of the log inter-failure times in conjunction with the Variance.

Equation 7-11

graphics/07equ11.gif

The availability of the Cisco IOS 11.x software is given by equation 7-12. We are using a MTTR for IOS software as approximately 6 minutes (.1 hrs) in this calculation. This approximation is based on observation of the average router reload times for the 472 data points recorded.

Equation 7-12

graphics/07equ12.gif

Using the data for the Cisco IOS software availability, we are now ready to place the data into system availability equations. Equation 7-13 provides the overall system availability.

Equation 7-13

graphics/07equ13.gif

From Equation 7-12, we will use 25484.1 from the previous IOS MTBF in the Availability of IOS equation following. We still need MTBF and MTTR for the other components of a router. The MTTR is 24 hours for everything except the IOS reload. The following MTBF numbers are provided for the example:

Chassis MTBF: 250,000

Circuit board MTBF: 250,000

Power supply MTBF: 500,000

Availability of Cisco chassis (a₁) = (250,000/250,000+24) = .999904

Availability of circuit board (a₂) = (250,000/250,000 +24) = .999904

Availability of power supply (a₃) = (500,000/500,000+24) = .999952

Availability of IOS software (a₄) = (25484.1/(25484.1+0.1))= .999996076

System Availability = (.999904)x(.999904)x(.999952)x(.999996076)=.999756=99.9756%

Equation 7-13 describes system availability as simply the product of all the component availabilities. Depending on the network topology and the redundancy built within your network, more complex equations would be needed. Kececioglu provides several examples of calculations with both redundant and non-redundant networks, as does Stallings (see end-of-chapter references).

NOTE

As you can see in the preceding equations where 24 is used for the MTTR, one way to increase availability is through a reduction of the MTTR. The development of a formal sparing plan for the network's major geographic locations and the training of personnel doing the hardware replacement can provide tremendous dividends in the reduction of the MTTR.

Random Nature of Network Performance Variables

Another reason that a network analyst needs to have some knowledge about probability and statistics is the concept of randomness. Randomness refers to the unpredictability of certain data or events.

Randomness is a major characteristic for all computer networks. At a basic level, all the data carried by networks has random properties. For example, both response time and utilization can be defined as random processes. Therefore, virtually every performance measurement made on a network is a sample of a random process.

Performance factors are not the only random processes in the network. Other random variables that the network manager must be aware of are errors and overhead. Overhead includes such things as SNMP polling, routing protocol updates, and those packets that are not application-related. The pervasiveness of random properties makes it essential for a network manager or a performance analyst to understand the basics about probability and statistics.

Although some recent literature refers to the non-random nature of ethernets and self-correlation due to packet trains and burstiness, randomness works as a first approximation and is generally found to be applicable in practice.

Simulations

Network managers need a tool to collapse the data that they have collected. Sampling and statistical techniques are needed to analyze and condense the data collected from the computer network for utilization reports, performance reports, availability reports, and reliability/stability reports.

For network planning and design of very large networks, it can be important to define the data traffic streams as mathematical distributions and network processes as additional distributions, and then build a statistical model. Such modeling can reveal bandwidth and latency issues, which are the essence of capacity planning. Statistical modeling now can be accomplished through relatively inexpensive computer systems.

NOTE

Although some may disagree, an informal survey of several simulation vendors indicates that there is an increased use of computer network simulations in the area of capacity planning and network design. As will be shown in the use of network statistics section, capacity planning and network design require statistical knowledge.

The drawback of statistical modeling and simulation is that the mathematical sophistication needed by the user to develop a verifiable statistical model or simulation can initially be quite high. For that reason, modeling is not covered in this book. The rest of this chapter concentrates not on mathematical modeling but on using the data that you gather in ways that allow you to glean the most practical information from it. The understanding required for performance reports and capacity planning is in fact the same needed for analysis of the output from simulations and mathematical modeling. So, even if simulation will not be explicitly covered in this chapter, it is of growing importance and is covered implicitly.

Baselining requires some understanding of statistics. Also, many of today's larger enterprise networks, as well as most Internet Service Providers (ISPs), collect basic statistics on their network's performance and traffic flows. Usually, this measurement includes metrics on throughput, delay, and availability. Again, an understanding of statistics is necessary in order to do the basic analysis on this data.

Often, the statistics and mathematics needed for understanding your network are presented in a formal manner. These usually include proofs of each of the formulas and concepts. In this chapter, we present the statistics and mathematics in as practical and empirical a manner as possible. The emphasis is on practical application of the techniques and how to determine the correct variables to collect for your network. Although proofs and mathematically correct presentation are not part of this chapter, a bibliography has been provided with several references that do present the material in this way.

Availability Statistics

Path Availability

Application Availability

Reachability

Other Availability Measurement Methods

Availability Example

Figure 7-1. Scatter Graph of 472 Inter-failure Times Observed for Cisco's 11.x IOS Series Software

Figure 7-2. Histogram of the Data Plotted Under a Normal Curve

Table 7-1. Descriptive Statistics for 11.x Failures

Random Nature of Network Performance Variables

Simulations