Basic Statistical Measures and Applications

This section defines some basic statistical values and gives simple examples of how they apply to network management. The following indexes are considered:

Average
Mode
Median
Range
Variance
Standard deviation

Before defining and looking at examples of these values, a bit of background is in order.

At its most basic, data analysis provides us with ways of organizing information. Descriptive measures such as averages and standard deviations combined with graphing techniques such as histograms, pie charts, and line diagrams can provide us with an overall picture of the data characteristics.

Because data from a network represents random samples, you can form confidence intervals to make inferences about the network itself. From these confidence intervals, you can define confidence levels that you want to see in your data. Confidence levels are estimates of your uncertainty about your parameters and thus indicate numerically your confidence in the results. Often, the standard deviation is chosen as the 68 percent confidence level. That is, you are 68 percent sure that the conclusions that you make from the data are what you will see in reality. The wider the interval, the more confident you are that it contains the parameter. The 95 percent confidence interval is therefore higher than the 68 percent confidence interval.

Another term that you may see for confidence interval is margin of error. You can think of the margin of error at the 95 percent confidence interval as being equal to two standard deviations in your polling sample.

This is probably a good place to touch on sampling size and its statistical significance. Perhaps the easiest way to figure this out is to think about it backwards. Let's say you decided to collect a network value during seven half-hour intervals over the course of a week. What is the chance that the times you picked do not accurately represent your network as a whole? For example, what is the chance that the percentage of low values received during your seven sampling periods does not match the percentage of low values over an entire week?

Common sense suggests that the chance that your sample is off the mark decreases as you add more time marks to your sample. In other words, the more times that you sample the data, the more likely you are to get a representative sample. This is easy so far, right?

The formula that describes this relationship is basically as follows:

The margin of error in a sample = 1 divided by the square root of the number of variables in the sample. Or as expressed in Equation 7-14:

Equation 7-14

graphics/07equ14.gif

The formula is also derived from the standard deviation of the proportion of times that a researcher gets a sample "right," given a large number of samples. Both the standard deviation and variance formulas are discussed in detail after we have gone through a few other statistics.

Polling Interval Versus Sampling Size

There is often some confusion about the difference between polling interval and the sampling size. Polling is a mechanism to collect data over time. Sampling is a statistical methodology that uses the data gathered through the polling process. For instance, you might want to poll an interface for the ifInOctets variable every 15 minutes. This would mean that over a period of 24 hours you would have a sample size of 96 (4 x 24) data points, or 672 data points in a week.

This would give a margin of error . As you continue to gather data, the margin of error will continue to decrease. For instance, in a four-week period, you would have 2,688 data points, which would give you a margin of error of 0.019. As your margin of error decreases, the more confidence you can have in your sample size and in the decisions that you make based on this data.

For instance, by polling the different variables that you have identified through the process described in Chapter 3, "Developing the Network Knowledge Base," and logging that data to a server, you have created a database of historical values. This database can be used in a number of different ways:

To understand network performance during a certain time period in the past
To analyze network problems
To plan for capacity requirements for your network
To verify network designs and changes that are made on the network

However, in each of these cases, the raw data can be overwhelming. So, you use statistical techniques to aggregate the data to a more manageable form. Besides reduction of the data, statistical analysis often can derive additional meaning from the data. One way that you accomplish this analysis is through the use of standardized reports on the network health. For instance, in the area of performance analysis and capacity planning, the use of the averages can be a major factor in planning for future capacity.

For example in Tables 7-2 and 7-3, average utilization has been calculated for the serial line and CPU of several routers. The data indicates that the serial line utilization on Rtr001 and Rtr002 is very high, but that the CPU for both routers remains at a manageable state. Now, by looking closer at the actual traffic on the two serial lines, a decision can be made on whether an upgrade is needed or not.

Table 7-2. Serial Line Utilization
Resource	Address	Speed	Average Util (%)	Peak Util (%)
Rtr001	10.101.2.2	1.544 Mbps	87.3	97.9
Rtr002	10.101.2.1	1.544 Mbps	88.5	98.2
Rtr003	10.102.2.1	64 Kbps	69.2	89.5
Rtr004	10.102.2.2	64 Kbps	69	89.7

Table 7-3. CPU Utilization
Resource	Address	Average Utilization (%)	Peak Utilization (%)
Rtr001	10.101.2.2	52.4	95
Rtr002	10.101.2.1	48	81
Rtr003	10.102.2.1	35.2	98
Rtr004	10.102.2.2	34.7	93

Aggregation of data refers to the replacement of data values on a number of time intervals by some function of the values over the union of the intervals (RFC1857). Although this may sound complex, it is important to understand that the raw data will have to be reduced through aggregation in almost every network. Also, the shorter-term aggregates may be re-aggregated.

Data aggregation not only reduces the amount of data; it also reduces the available information. Depending on the use of the data (for instance, troubleshooting), this reduction can hinder problem resolution. However, for trending of the data over a three-month or longer period of time, this reduction is very useful. The data reduction model from RFC 1404 and 1857 will be discussed later in the chapter.

Particularly in the area of problem determination and troubleshooting, it is important that the historical data stored be highly granular.

Performance polling gathers data over time that can be analyzed to determine trends and to aid in capacity planning. First, determine what MIB variables to poll for. Chapter 3 describes how you can make a list of variables for data collection, and Part II of this book recommends specific variables for specific technologies.

For performance polling, individual data points are stored intermittently on the polling machine. Depending on the polling mechanism you are using, the data could be in either a raw format (the default for HP's Openview) or a relational database.

As stated earlier, to keep the data manageable, aggregate the raw data periodically, and store it in another database or data file for future reporting. The norm is to keep the raw data some specified period of time for backups, but eventually purge this data and keep only the aggregate data. Use the aggregate data to produce reports to determine trends and patterns. Depending on what you are planning, you can use the raw collected data to determine minimum, maximum, and average values for each variable; and then delete the raw data. Or, you can use a plan that summarizes the data and enables you to do more accurate statistical analysis on a larger historical sample.

The suggested aggregation periods from RFC 1857 are as follows:

Over a 24 hour period, aggregate to 15 minutes
Over a 1-month period, aggregate to 1 hour
Over a 1-year period, aggregate to 1 day

Setting up this type of data reduction is not an easy task. However, all of the current commercial network performance-reporting applications provide an aggregation mechanism for you. Understanding what is going on "behind the veil" of such applications enables you to use the data gathered in the most efficient and effective ways.

Average, Mode, and Median

Measurements in network performance are rarely exact or repeatable because of the random nature of network traffic. It often cannot be predicted with certainty and usually contains a time-dependent nature as well. This randomness forces network managers to consider statistical values, specifically average response times.

An average, also known as an arithmetic mean, is calculated by adding up all the sample data (x_i) and then dividing the result by the number of samples (N). Equations 7-15 and 7-16 show alternative representations of the average.

Equation 7-15

graphics/07equ15.gif

Equation 7-16

graphics/07equ16.gif

The average is one measure of how data points tend to cluster around the center of a distribution. This clustering is sometimes referred to as the central tendency of the data. Besides the average, there are two other indexes of the central tendency of data: mode and median. Very simply, the mode is the most frequently reported data point in a distribution. The median is the midpoint within the distribution. Of the three statistics, the average is the most sensitive to data changes to all scores in a distribution.

If the average is more sensitive to all scores in a distribution, why bother with the mode or median? One of the most practical arguments is that they are easy to obtain. After the data points are grouped, the mode is obtained by simple inspection. If the data points are ranked, the median is also easy to obtain.

Another reason for obtaining the median is to look for a significant value discrepancy between it and the average. Normally, the two values should be fairly close. But if the distribution contains extremely high or low scores, the average may skew significantly from the median. For a more accurate average, statisticians often omit the highest and lowest values in the distribution as anomalies, and you may need to consider doing the same. For instance, consider gathering data by pinging a device across the wide area network (see Table 7-4). The normal round-trip time may be 120ms, but due to a backup that takes place early in the morning, the time increases to 2400ms for several data-collection points. These anomalous high values can cause the average to be higher than it would be normally. Both the mode and median help to point out the discrepancy.

Table 7-4. Example of the Use of Mode and Median with Average
Ping (ms) collected on an hourly basis
120	119	121	110	120	100	128	2400	2390	2405	120	121
100	110	119	120	120	120	121	121	128	2390	2400	2405
Sum of all (24) data points: 8254 Average: 687.833
Mode: 120 Median: 120

Range, Variance, and Standard Deviation

Just as there are indexes to describe how data cluster around the center of a distribution, there are those that describe the dispersion or scatter across the measurement scale. The most common of the dispersion indexes are the range, variance, and standard deviation.

The range is the highest "score" in a distribution, minus the lowest score. The range is a very crude measurement. Just as the mode and median are insensitive to all but a few data points, so is the range. In the example in Table 7-4, the range would be 2405 minus 100 or 2305. Although this information is not very useful by itself, if you were expecting a range of about 100ms, it would indicate a network issue that needs to be explored.

Variance is basically the average of the squared deviation scores about the average of the distribution. Variance, which is also known as "the first moment about the mean," is calculated by subtracting each x value from the average, squaring the difference, summing the squares of the differences, and dividing everything by the number of samples minus 1 or (N 1). See equations 7-17 and 7-18.

Equation 7-17

graphics/07equ17.gif

Equation 7-18

graphics/07equ18.gif

NOTE

The variance could almost be the average squared deviation around the mean if the expression were divided by N rather than N 1. It is divided by N 1, called the degrees of freedom (df), for theoretical reasons. If the mean is known, as it must be to compute the numerator of the expression, then only N 1 scores are free to vary. That is, if the mean and N 1 scores are known, it is possible to figure out the Nth score.

Variance, like the mean (arithmetic average), is sensitive to all data points in a distribution. By comparing the average, standard deviation, and variance, you can develop an expected value range. If the variance or standard deviation is much greater than expected, you must look at the data more closely or discard it altogether. Thus, the standard deviation and variance are similar to the earlier statistics considered with the average as a balance point about the distribution.

The first column of Table 7-5 includes the ping data from Table 7-4. The second column is the square of the average minus the data point, or (avg n₁)². Summing the values in the second column gives the "sum of the squares" in this case, 11703875.67. Dividing the sum of the squares by N 1 produces the variance. Table 7-6 shows the variance for the original data, 1063988.697, and for a second data set in which the three high data points (>2300ms) have been replaced with the median value (120ms). The variance for the second set of data is 48.20454545. This example shows why other statistical values are needed beyond the simple arithmetic average.

Table 7-5. Sample Data for Calculating Variance
Data	Power(Avg n_i)²
120	322434.6944
119	323571.3611
121	321300.0277
110	333891.3611
120	322434.6944
100	345548.0277
128	313413.3611
2400	2931514.695
2390	2897371.361
2405	2948661.361
120	322434.6944
121	321300.0277

Table 7-6. Comparison of Example Data, With and Without Three High Data Points
Value	Original Data Statistics	Statistics without Three High Data Points
Sum	8254	1419
Avg	687.8333333	118.25
Max	2405	128
Min	100	100
Variance	1063988.697	48.20454545
Variance (population)	975322.9722	44.1875
Standard Deviation	1031.498278	6.942949334
Standard Deviation (population)	987.5844127	6.6473679

Note that Table 7-6 also contains the standard deviations for the original and revised example data. The standard deviation is the square root of the variance (see Equation 7-19).

Equation 7-19

graphics/07equ19.gif

A practical advantage of using the standard deviation rather than the variance as the index of dispersion is that its values are easier to use because of its natural relationship to the mean. The standard deviation gives us the basis for estimating the probability of how frequently certain data points can be expected to occur based on the sampling rate (or the margin of error discussed earlier).

The variance and standard deviation both calculate how much variation the samples have around the mean value. Therefore, small variations indicate a strong central tendency of the samples. This also indicates that the sample set has little statistical randomness. Larger variations indicate very little central tendency, and show large statistical randomness. Central tendency is a typical or representative score. The three measures of central tendency that have been discussed in this chapter are the mode, median, and mean.

The variance and standard deviation are important statistics for understanding traffic distributions for a link or network. They are the initial metrics used in defining the type of statistical distribution that may be of use to the analyst.

Deviations from the mean and the concepts of variance and standard deviation are crucial for understanding statistical models. It is important to be able to conceptualize their interrelationships. One method of doing this is through the use of graphing techniques. Figure 7-3 shows graphically the data from Table 7-4.

Figure 7-3. The Cumulative Distribution Function for the Data in Table 7-4

graphics/07fig03.gif

Although this enables you to get a certain feel for the data using the mean and standard deviation, the use of a normal distribution as in Figure 7-4 is usually more helpful.

Figure 7-4. The Normal Distribution for the Data in Table 7-4

graphics/07fig04.gif

The data in Figure 7-4 was generated using the following statistics from Table 7-4:

The normal distribution is also known as the bell curve, due to its symmetry about the mean. This figure has a mean (m) = 687.8333 and a standard deviation (s) = 1008.825. This shows in a clearer fashion than Figure 7-3 the use of standard deviation. In Figure 7-5, the long ping times (>2000 ms) were removed. So Figure 7-5 has a mean (m) = 118.7083 and a standard deviation (s) = 7.055551, which is more in line with what was expected for this interface.

Figure 7-5. Normal Distribution of Ping Data Without Points > 2000ms

graphics/07fig05.gif

There are many other distributions that can be useful in the analysis of performance data. The distributions that can be useful in network management are the poisson, exponential, uniform, lognormal, and fixed. Many examples of these can be found in the sources listed in the References at the end of this chapter. One of the easiest ways to use these distributions is with a mathematical scratchpad. It allows you to import data from a spreadsheet and, by using the correct statistical values for the distribution, graph your sample data.

Weighted Mean

Another context in which statistics can be of help to you is when applying a single requirement to multiple applications. For instance, if you have three applications (Voice over IP, Database transactions, and Web video) that have different response time requirements, how do you make a decision on the most affordable network design?

One way is to use a weighted mean. In this case, you weight each application based on how critical it is to the enterprise, giving the most critical the highest weight. It is important that you understand that this weight is an arbitrary value that you have subjectively chosen, but the relationship between the various weights is significant. By setting a high weight on the critical applications, you ensure that their needs are met.

The weighted mean can be used as a decision analysis tool. It allows you to analyze the critical applications within your organization, assign a weight to them, and then use the new value in calculations. In the following example, each step will be explained.

To find the weighted mean, do the following:

Find the average response time required for each of the applications.
Select weights for each application based on how crucial it is to the enterprise.
Multiply the required average response time for each application by its assigned weight and sum the products.
Divide the sum of the products by the sum of the weights.

Here is an example:

Voice over IP = 150ms average response time; assigned weight = 4.
Voice over IP is an application that is currently being placed in several areas on the network and is therefore quite important. This is indicated with a moderately high weighting of 4.
DBMS transactions = 300ms average response time; assigned weight = 6
The database transactions are the most critical application for this customer and therefore have gotten the highest weighting (6).
Web video = 500ms average response time; assigned weight = 2
In Web video, you have an application that is just being looked at by the customer and is not of critical importance. However, because this organization is actively making some tests of the application, we give the Web video a relatively low weight or 2 rather than a very low weight (1) for our calculations.

For this example, the weighted average works out to be the following:

(150 x 4) + (300 x 6) + (500 x 2)/16 = 3400/16 = 212.5ms

The non-weighted average is as follows:

(150) + (300) + (500) = 316.6ms

Note that the weighted average is still less than the non-weighted average. Because you are looking for the lowest response time, the weighted mean can be seen as more favorable to your goals. Comparing the weighted versus the non-weighted averages is usually a good sanity check of you weighting scheme and the results.

Polling Interval Versus Sampling Size

Table 7-2. Serial Line Utilization

Table 7-3. CPU Utilization

Average, Mode, and Median

Table 7-4. Example of the Use of Mode and Median with Average

Range, Variance, and Standard Deviation

Table 7-5. Sample Data for Calculating Variance

Table 7-6. Comparison of Example Data, With and Without Three High Data Points

Figure 7-3. The Cumulative Distribution Function for the Data in Table 7-4

Figure 7-4. The Normal Distribution for the Data in Table 7-4

Figure 7-5. Normal Distribution of Ping Data Without Points > 2000ms

Weighted Mean