Recipe5.3.Calculating Confidence Intervals

Recipe 5.3. Calculating Confidence Intervals

Problem

You need to compute confidence intervals for certain estimates and you're not sure what support Excel provides for these calculations.

Solution

Excel offers a built-in function called CONFIDENCE that allows you to compute confidence intervals based on a normal distribution. Moreover, Excel provides functions such as NORMSINV and TINV that allow you to compute inverses for normal and Student's t-distributions, respectively. You can then use these results to compute confidence intervals.

Discussion

In Recipe 5.1, I showed you how to use Excel's built-in functions to compute summary statistics. I also showed you how to compute summary statistics using the Descriptive Statistics tool available from the Analysis ToolPak. In both cases, confidence intervals for the mean were computed and I pointed out that the confidence interval returned by Excel's CONFIDENCE function is different from that returned by the Descriptive Statistics tool. Figure 5-1 shows the different values.

The difference in the computed confidence interval is due to the fact that the CONFIDENCE function uses the inverse of a standard normal distribution to compute the confidence interval, whereas the Analysis ToolPak uses the inverse of Student's t-distribution. Instead of using the CONFIDENCE function, you can use the functions NORMSINV or TINV to compute confidence intervals; this gives you the option of using a normal distribution or Student's t-distribution, which is more suitable for small samples.

NORMSINV computes the inverse of a standard normal distribution with a mean of 0 and standard deviation equal to 1. In other words, NORMSINV returns the z-score corresponding to a given probability. The syntax for NORMSINV is =NORMSINV(probability). For a 95% confidence interval, you need to enter a probability of 1-0.05/2 or 0.975. Referring to the example shown in Figure 5-1, you can compute the confidence interval for the mean using the formula =H9*NORMSINV(0.975), where cell H9's value contains the standard error of the mean. This formula returns a confidence value of 0.015486, which agrees with the results returned using the CONFIDENCE function.

You can use TINV to compute the confidence interval based on Student's t-distribution. TINV computes the inverse of Student's t-distribution given a probability and the degrees of freedom characterizing the distribution. Again, referring back to the example in Figure 5-1, you can compute the confidence interval using the formula =H9*TINV(0.05,H20-1). In this case, the probability for 95% confidence is entered as 1-0.95 or 0.05. As before, H9 refers to the cell containing the standard error of the mean. Cell H20 contains the value returned by COUNT, and the number of degrees of freedom is computed by h20-1. The resulting confidence value is 0.015677, which agrees with the result returned by the Descriptive Statistics tool.

You can use the formulas discussed in this recipe to compute confidence intervals for other estimates as well. In Recipe 8.8, I show you how to compute confidence intervals for regression curves.

Recipe 5.4. Correlating Data

Problem

You're working with a multivariable problem and want to compute the correlation coefficient between variables.

Solution

Use Excel's built-in functions CORREL or PEARSON, or use the Correlation tool available in the Analysis ToolPak.

Discussion

In Recipe 8.4, I show you how to perform linear regression where the dependent variable depends on more than one independent variable. The example I discuss in that recipe involves six independent variables. This is a typical example where it's prudent to assess the correlation between independent variables before conducting the regression analysis. This way you can avoid using highly correlated (or colinear) independent variables, which can give you trouble.

Excel provides two built-in functions that allow you to compute the Pearson product-moment correlation coefficients between variables. These functions are CORREL and PEARSON and, according to Excel's help documents, they perform exactly the same calculation. (I've no idea why there are two functions to do the same thing.)

Figure 5-8 shows the data from the multiple linear regression example I discuss in Chapter 8. For now, I'll show you how to examine the correlation between independent variablesthe x variablesusing Excel's correlation functions as well as the Analysis ToolPak.

To compute the correlation coefficient between variables x1 and x2, you can use the cell formula =CORREL(E7:E22,F7:F22) or =PEARSON(E7:E22,F7:F22). They both produce the same results, which is 0.991589. Clearly, these two variables are positively correlated and in a regression analysis, for example, you would probably exclude one of these variables from the curve fit. You can compute correlation coefficients between the other variables in a similar manner.

Figure 5-8. Multiple linear regression data

To save some time when dealing with numerous variables, you might consider using the Correlation tool available in the Analysis ToolPak. Select Tools images/U2192.jpg border=0> Data Analysis from the main menu bar to open the Data Analysis dialog box shown in Figure 5-9.

Figure 5-9. Data Analysis Correlation tool

Select the Correlation tool and press OK to open the Correlation tool dialog box shown in Figure 5-10.

Figure 5-10. Correlation tool dialog box

The Correlation tool allows you to select multiple datasets at once; it will compute the correlation coefficients between all the datasets. For example, in Figure 5-10 I entered cells E6 through J22 in the Input Range. This cell range includes all of the x variables shown in Figure 5-8. I included the header labels in the input range, so I checked the "Labels in first row" option. Further, I specified cell M7 (on the same worksheet) as the starting cell where I'd like the output displayed. Figure 5-11 shows the results.

Figure 5-11. Correlation results

This table presents the correlation coefficient between each given variable and every other variable. It also correlates each variable with itself, which of course results in a correlation coefficient of 1 (as you can see in Figure 5-11).

Sometimes it's useful to compute coefficients of determination instead of correlation coefficients alone. To compute a coefficient of determination, simply square the correlation coefficient. For example, if cell N9 contains the correlation coefficient between variables x1 and x2, then the formula =N9^2 would yield the coefficient of determination.

There are other types of correlation coefficients common in statistical analysis. For example, Phi, rank biserial, point biserial, and Spearman rank coefficients are common, in addition to the Pearson coefficient. Excel does not have built-in support for these other correlation coefficients; however, you can certainly set up a spreadsheet to calculate these coefficients.

For example, you can readily prepare a spreadsheet to compute the Spearman rank correlation coefficient between variables x1 and x2 from the previous example. The Spearman coefficient requires you to convert the raw data to ranks first. You can achieve this conversion using Excel's built-in RANK function (see Recipe 5.5). Figure 5-12 shows a simple spreadsheet I set up to compute the Spearman rank correlation coefficient between the variables x1 and x2 from the previous example.

The first two columns in the table contain the raw data. The third and fourth columns contain the rank of each data point. I used the RANK function for these calculations. For example, cell E41 contains the cell formula =RANK(C41,\$C\$41:\$C\$56,1), which computes the rank for the first x1 value, assuming the data are sorted in ascending order (see Recipe 5.5).

The formula for computing the Pearson rank correlation coefficient is:

Here, d is the difference in ranks between corresponding values for both x1 and x2, and N is the number of data points. You can compute N using the COUNT function. Cell D58 contains the formula =COUNT(D41:D56) which returns a value of 16 for the number of points.

To compute d2, I set up another column containing formulas like =(E41-F41)^2. The results are shown in the last column of the table shown in Figure 5-12.

Cell G57 computes the sum of d2 values using the formula =SUM(G41:G56). Cell G59 computes the N(N2 - 1) term using the formula =(D58*(D58^2-1)). And finally, the correlation coefficient is computed in cell G60 using the formula =1-6*G57/G59. The resulting correlation coefficient is 0.997.