Chapter 17: Functions for Analyzing Statistics | MicrosoftВ® Office ExcelВ® 2007 Inside Out (Inside Out (Microsoft))

Microsoft Office Excel 2007 provides a wide range of features that can help you analyze statistical data. A number of functions that assist in simple analysis tasks, such as AVERAGE, MEDIAN, and MODE, are built into the program. If the built-in statistical functions aren't enough, you can turn to the Analysis Toolpak, an add-in that provides a collection of tools to augment the built-in analytical capabilities of Office Excel 2007. You can use these tools to create histograms, create rank-and-percentile tables, extract samples from a data set, perform regression analysis, generate special random-number sets, apply Fourier and other transformations to your data, and more. In this chapter, we'll explore the most important statistical analysis functions that are built into Excel, as well as those included with the Analysis Toolpak.

Analyzing Distributions of Data

In statistics, a collection of measurements is called a distribution. Excel has several methods you can use to analyze distributions: built-in statistical functions, the sample and population statistical functions, or the rank-and-percentile functions together with the Rank And Percentile tool.

Note

You can also analyze distributions using the Descriptive Statistics and Histogram tools, both of which are included in the Analysis Toolpak add-in. For more information, see "Using the Analysis Toolpak Data Analysis Tools" on page 567.

Using Built-in Statistical Functions

image from book You use the built-in statistical functions to analyze a group (or population) of measurements. In the following sections, the discussion is limited to the most commonly used statistical functions. To quickly access these functions, click the More Functions button on the Formulas tab on the Ribbon, and then click Statistical to display a menu of statistical functions.

Note

Excel also offers the advanced statistical functions LINEST, LOGEST, TREND, and GROWTH, which operate on arrays. For more information, see "Understanding Linear and Exponential Regression" on page 559.

The AVERAGE Function

The AVERAGE function computes the arithmetic mean, or average, of the numbers in a range by summing a series of numeric values and then dividing the result by the number of values. This function takes the arguments (number1, number2, . . .), can include up to 255 arguments, and ignores blank cells and cells containing logical and text values. For example, to calculate the average of the values in cells B4 through B15, you could use the formula =(B4+B5+B6+B7+B8+B9+B10+B11+B12+B13+B14+B15)/12, but it's obviously more efficient to use =AVERAGE(B4:B15).

For more information about this function, see the sidebar "AVERAGE vs. AVG" on page 502.

The MEDIAN, MODE, MAX, MIN, and COUNT Functions

These functions all take the same arguments, essentially just a cell range or a list of numbers separated by commas, such as (number1, number2, . . .). They can accept up to 255 arguments, ignoring text, error values, and logical values. Here's a brief description of each:

MEDIAN Computes the median of a set of numbers. The median is the number in the middle of the set, that is, an equal number of values are higher and lower than the median. If the numbers specified include an even number of values, the value returned is the average of the two that lie in the middle of the set.
MODE Determines which value occurs most frequently in a set of numbers. If no number occurs more than once, MODE returns the #N/A error value.
MAX Returns the largest value in a range.
MIN Returns the smallest value in a range.
COUNT Tells you how many cells in a given range contain numbers, including dates and formulas that evaluate to numbers.

Note

To count all nonblank cells, regardless of what they contain, you can use the COUNTA function. For more information about this function, see "The A Functions" on page 556.

The SUMIF, SUMIFS, and COUNTIF Functions

The SUMIF function is similar to SUM but first tests each cell using a specified conditional test before adding it to the total. This function takes the arguments (range, criteria, sum_range). The range argument specifies the range you want to test, the criteria argument specifies the conditional test to be performed on each cell in the range, and the sum_range argument specifies the cells to be totaled. For example, if you have a worksheet with a column of month names defined using the range name Months and an adjacent column of numbers named Sales, use the formula =SUMIF(Months, "June", Sales) to return the value in the Sales cell that is adjacent to the label June. Alternatively, you can use a conditional test formula, such as =SUMIF(Sales, ">=999", Sales), to return the total of all sales figures that are more than $999.

The SUMIFS function does similar work to that of the SUMIF function, except you can specify up to 127 different ranges to sum, each with their own criteria. Note that the sum_range argument is in the first position instead of the third position in this function: (sum_range, criteria_range1, criteria1, criteria_range2, criteria2, . . .). The sum range and each criteria range must all be the same size and shape. Using a similar example to the one we used for the SUMIF function, suppose we also created defined names for cell ranges Months, Totals, and Product1, Product2, and so on. The formula =SUMIFS(Totals, Product3, "<=124", Months, "Jun") returns the total sales for the month of June when sales of Product2 were less than or equal to $124.

You can use the Conditional Sum Wizard add-in to help you construct formulas. For more information, see "Using the Conditional Sum and Lookup Wizards" on page 478.

Similarly, COUNTIF counts the cells that match specified criteria and takes the arguments (range, criteria). Using the same example, you can find the number of months in which total sales fell to less than $600 using a conditional test, as in the formula =COUNTIF(Totals, "<600").

For more information about conditional tests, see "Creating Conditional Tests" on page 478. For more about using range names, see "Naming Cells and Cell Ranges" on page 441.

Using Functions That Analyze Rank and Percentile

Excel includes several functions that extract rank and percentile information from a set of input values: PERCENTRANK, PERCENTILE, QUARTILE, SMALL, LARGE, and RANK.

The PERCENTRANK Function

The PERCENTRANK function returns a percentile ranking for any member of a data set. You can use this function to create a percentile table that's linked to the input range so that the percentile figures are updated if the input values change. We used this function to create the percentile ranking in column E of Figure 17-1.

image from book
Figure 17-1: PERCENTRANK links percentile figures to input values.

On the CD You can find the SAT Scores.xlsx file in the Sample Files section of the companion CD.

The PERCENTRANK function takes the arguments (array, x, significance). The array argument specifies the input range (which is D2:D1001, in our example), and x specifies the value whose rank you want to obtain. The significance argument, which is optional, indicates the number of digits of precision you want; if omitted, results are rounded to three digits (0.xxx or xx.x%).

The A Functions

Excel includes a set of functions that give you more flexibility when calculating data sets that include text or logical values. These functions are AVERAGEA, COUNTA, MAXA, MINA, STDEVA, STDEVPA, VARA, and VARPA, all of which accept a series of up to 255 arguments (value1, value2,...).

Ordinarily, the non-A versions of these functions ignore cells containing text values. For example, if a range of 10 cells contains one text value, AVERAGE ignores that cell and divides by 9 to arrive at the average, whereas AVERAGEA considers the text value part of the range and divides by 10. This is helpful if you always want to include all referenced cells in your calculations, especially if you use formulas that return text flags, such as "none," if a certain condition is met. For more information about STDEVA, STDEVPA, VARA, and VARPA, see "Using Sample and Population Statistical Functions" on page 558.

The PERCENTILE and QUARTILE Functions

You use the PERCENTILE function to determine which member of an input range stands at a specified percentile ranking; it takes the arguments (array, k). You must express the percentile k as a decimal fraction from 0 to 1. For example, to find out which score in Figure 17-1 represents the 87th percentile, you can use the formula =PERCENTILE(D2:D1001, 0.87).

The QUARTILE function, which takes the arguments (array, quart), works much like the PERCENTILE function, except it returns the value that represents the lowest percentile, or any quarter-percentile in the input set. The array argument specifies the input range. The quart argument specifies the value to be returned, as shown in Table 17-1.

Table 17-1: The *Quart* Argument
Open table as spreadsheet
Quart	Returns
0	Lowest value
1	25th-percentile value
2	Median (50th-percentile) value
3	75th-percentile value
4	Highest value

Inside Out-Use MIN, MEDIAN, and MAX

QUARTILE is a powerful function, but if you don't need to return 25th or 75th percentile values, you will get faster results using other functions, particularly when working with large data sets. Use the MIN function instead of QUARTILE(array, 0), the MEDIAN function instead of QUARTILE(array, 2), and the MAX function instead of QUARTILE(array, 4).

The SMALL and LARGE Functions

The SMALL and LARGE functions return the kth smallest and kth largest values in an input range; both take the arguments (array, k), where k is the position from the largest or smallest value to the value in the array you want to find. For example, to find the 15th highest score in Figure 17-1, you can use the formula =LARGE(D2:D1001, 15).

The RANK Function

The RANK function returns the ranked position of a particular number within a set of numbers and takes the arguments (number, ref, order). The number argument is the number for which you want to find the rank, ref is the range containing the data set, and order optionally ranks the number as if it were in a ranking list in an ascending or descending (the default) order. For example, to find out which ranking the score 1200 has in the data set in Figure 17-1, you can use the formula =RANK(1200, D2:D1001).

By default, the highest value is ranked 1, the second highest is ranked 2, and so on. If RANK can't find an exact match between its first argument and an input value, it returns the #N/A error value.

Using Sample and Population Statistical Functions

Variance and standard deviation are statistical measurements of the dispersion of a group, or population, of numbers. The standard deviation is the square root of the variance. As a rule, about 68 percent of a normally distributed population falls within one standard deviation of the mean, and about 95 percent falls within two standard deviations. A large standard deviation indicates that the population is widely dispersed from the mean; a small standard deviation indicates that the population is tightly packed around the mean.

Four statistical functions-VAR, VARP, STDEV, and STDEVP-compute the variance and standard deviation of the numbers in a range of cells. Before you calculate the variance and standard deviation of a group of values, you must determine whether those values represent the total population or only a representative sample of that population. The VAR and STDEV functions assume that the values represent only a sample of the total population; the VARP and STDEVP functions assume that the values represent the total population.

Calculating Sample Statistics: VAR and STDEV

The VAR and STDEV functions compute variance and standard deviation, assuming that their arguments represent only a sample of the total population. These functions both take the arguments (number1, number2, . . .) and accept up to 255 arguments. The worksheet in Figure 17-2 shows exam scores for five students and assumes that the scores in cells B4:E8 represent only a part of the total population.

image from book
Figure 17-2: The VAR and STDEV functions measure the dispersion of sample exam scores.

On the CD You can find the VAR.xlsx file in the Sample Files area of the companion CD.

Cell 14 uses the VAR function =VAR(B4:E8) to calculate the variance for this sample group of test scores. Cell 15 uses the STDEV function =STDEV(B4:E8) to calculate the standard deviation.

Assuming that the test scores in the example are normally distributed, we can deduce that about 68 percent of the students (the general-rule percentage) achieved scores from 83.65 (the average 89.20 minus the standard deviation 5.55) to 94.75 (89.20 plus 5.55).

Calculating Total Population Statistics: VARP and STDEVP

If the numbers you're analyzing represent an entire population rather than a sample, use the VARP and STDEVP functions to calculate variance and standard deviation. These functions both take the arguments (number1, number2, . . .) and accept up to 255 arguments.

Assuming that cells B4:E8 in the worksheet shown in Figure 17-2 represent the total population, you can calculate the variance and standard deviation with the formulas =VARP(B4:E8) and =STDEVP(B4:E8). The VARP function returns 29.26, and the STDEVP function returns 5.41.

Note

The STDEV, STDEVP, VAR, and VARP functions do not include text values or blank cells in their calculations. If you want to include blanks or text, use the A versions: STDEVA, STDEVPA, VARA, and VARPA. For more information, see "The A Functions" on page 556.