Chapter 64: Using Resampling to Analyze Data | MicrosoftВ® Office ExcelВ® 2007: Data Analysis and Business Modeling (Bpg -- Other)

Overview

I’ve produced nine batches of a product by using a high temperature and seven batches by using a low temperature. What is the probability that the product yield is better at the high temperature?

In our work and personal lives, we often use data to answer questions such as these:

What is the probability that a new teaching technique improves student learning?
What is the probability that aspirin reduces the incidence of heart attacks?
What is the probability that Machine 1 is the most productive of our three machines?

You can use a simple yet powerful technique known as resampling to make inferences from data. To make statistical inferences by using resampling, you regenerate data many times by sampling with replacement from your data. Sampling with replacement from data means that the same data point can be chosen more than once. You then make inferences based on the results of this repeated sampling. A key tool in implementing resampling is the RANDBE-TWEEN function. Entering the function RANDBETWEEN(a,b) yields with equal probability any integer between a and b (inclusive). For example, RANDBETWEEN(1,9) is equally likely to yield one of the numbers 1 through 9, inclusive.

I’ve produced nine batches of a product by using a high temperature and seven batches by using a low temperature. What is the probability that the product yield is better at the high temperature?
The file Resampleyield.xlsx, shown in Figure 64-1 on the next page, contains the product yield from nine batches of a product manufactured at a high temperature and seven batches manufactured at a low temperature.

Figure 64-1: Product yields at high and low temperature
The mean yield at a high temperature is 39.74, and the mean yield at a low temperature is 32.27. This difference does not prove, however, that mean yield at a high temperature is better than mean yield at a low temperature. We want to know, based on our sample data, the probability that yield at a high temperature is better than at a low temperature. To answer this question, we can randomly generate nine integers between 1 and 9, which creates a resampling of the high-temperature yields. For example, if we generate the random number 4, the resampled data for high-temperature yields will include a yield of 41.40, and so on. Next we randomly generate seven integers between 1 and 7, which creates our resampling of the low-temperature yields. We can then check the resampled data to see whether the high-temperature mean is larger than the low-temperature mean, and then use a data table to repeat this process several hundred times. (I repeated the process 400 times in this example.) In the resampled data, the fraction of the time that the high-temperature mean is larger than the low-temperature mean estimates the probability that the high-temperature process is superior to the low-temperature process.
To begin, we generate a resampling of the high-temperature data by copying from cell C16 to C17:C24 the formula RANDBETWEEN(1,9), as shown in Figure 64-2. A given observation can be chosen more than once or not chosen at all. Copying from cell D16 to D17:D24 the formula VLOOKUP(C16,lookup,2)-the range C4:E13 has been named Lookup-generates the yields corresponding to our random resampling of the data. Next we generate a resampling from our low-temperature yields. Copying from E16 to E17:E22 the formula RANDBETWEEN(1,7) generates a resampling of seven observations from our original low-temperature data. Copying from F16 to F17:F22 the formula VLOOKUP(E16,lookup,3) generates the seven actual resampled low-temperature yields.

Figure 64-2: Implementation of resampling
In cell D26, I compute the mean of our resampled high-temperature yields with the formula AVERAGE(D16:D24). Similarly, in cell F26, I compute the mean of our resampled low-temperature yields with the formula AVERAGE(F16:F22). In cell D29, I determine whether the resampled mean for high temperature is larger than the resampled mean for low temperature with the formula IF(D26>F26,1,0).
To replay our resampling 400 times, we can use a one-way data table. I copy iteration numbers 1 through 400 to the cell range C32:431. (See Chapter 60, “Introduction to Monte Carlo Simulation,” for an explanation of how to use the Fill Series command to easily create a list of iteration values.) By typing =D29 in cell D31, I create the formula that records whether high-temperature mean is larger than low-temperature mean in the output cell for our data table. After selecting the table range (C31:D431) and then selecting Data Table from the What-If Analysis command in the Data Tools group on the Data tab, we can choose any blank cell in the worksheet as our Column Input Cell. We have now tricked Microsoft Office Excel 2007 into playing out our resampling 400 times. Each iteration with a value of 1 indicates a resampling in which high temperature has the larger mean. Each iteration with a value of 0 indicates a resampling for which low temperature has a larger mean. In cell F31, I determine the fraction of time that high-temperature yield has a larger mean by using the formula =AVERAGE(D32:D431). In Figure 64-2, our resampling indicates a 91 percent chance that high temperature has a larger mean than low temperature. Of course, pressing F9 will generate a different set of 400 resamplings and will give us a slightly different estimate of the probability that high temperature yield is superior to low temperature yield.