HOW MUCH DO THE VALUES DIFFER? | Six Sigma and Beyond: Statistics and Probability, Volume III

HOW MUCH DO THE VALUES DIFFER ?

Measures of central tendency provide information only about "typical" values. They tell you nothing about how much the values vary within the sample. Suppose you examine the transmissions of 20 vehicles: ten trucks and ten sedans. The trucks were used to haul heavy items, whereas the sedans were used for pleasure driving. You are interested in measuring the problems with each vehicle during three months in service. The data are shown in Table 3.1.

Table 3.1: Vehicle Problems per Week
Trucks		Sedans
Problems per Week	Difference	Problems per Week	Difference
4	-1		-5
4	-1		-5
5			-5
5			-5
5			-5
5		3	_ 2
5		10	5
5		10	5
6	1	12	7
6	1	15	10
Sum = 50	Sum = 0	Sum = 50	Sum = 0
Mean = 5	Mean = 0	Mean = 5	Mean = 0

The average number of problems for the two groups of vehicles is the same ” five per week. However, the distributions of values differ. All of the trucks had consistent numbers of problems ” between four and six per week. The numbers do not vary much from week to week. The sedans, on the other hand, differ from each other much more. Some seem to have no problems early on but then as time goes on their problems increase.

How can you measure this variability? One of the more obvious ways is to report the smallest and largest values in each of the samples. The minimum number of problems for the trucks is four and the maximum is six. In the sedans, the minimum number of problems is zero and the maximum is 15.

The distance between the largest and smallest values is called the range. For the trucks, the range is 2, while for the sedans, it is 15. That is quite a difference. By comparing the ranges of the two samples, you can tell that the number of problems per vehicle in the sedan sample differed more from each other than those in the truck sample.

The range is not a particularly good measure of variability, though. It depends only on the smallest and largest numbers and pays no attention to the distribution of the numbers in between. Nevertheless, it is the best measure of variability available for variables measured on an ordinal scale. For a variable measured on an interval or ratio scale, however, you can compute some better measures.

THE VARIANCE

For each case, you can compute how much it varies from the mean of all the cases. Just subtract the overall mean from the case's value. For the first case in Table 3.1, the difference is:

4 (the case's value) - 5 (the mean) = -1

This indicates that this particular truck had one fewer problem than the average. Table 3.1 shows the differences for each case. From the table, you can see that the differences are much smaller for the trucks' performance than for the sedans.

How can you use these differences to measure variability? The simplest tactic that comes to mind is just to add up the differences and compute a mean difference for each group . Like many seemingly good ideas, this one has a flaw. The sum of the differences from the mean is always zero. Some of the differences are positive, and some are negative, so when you add up all of the positive and negative numbers the result is always zero. You need a better way to assemble all of the differences from the mean.

You can do this in several ways. For example, you could treat all the differences as if they were positive and compute a mean difference for them. It turns out, though, that a better way by far is to

Square the differences.
Add them up.
Divide the sum by the number of cases minus one.

This measure is called the variance.

Why divide by the number of cases minus one instead of the number of cases? You are working with a sample taken from a larger population, and you are trying to describe how much the responses vary from the mean of the entire population. However, since you do not know the population mean, you have to use the sample mean in your calculation ” and using the sample mean makes the sample seem less variable than it really is. When you divide by the number of cases minus one, you compensate for the smaller variability that you observe in the sample. Later on, we are going to define this as degrees of freedom (df).

Large values for the variance tell you that the values are quite spread out. Small values indicate that the responses are pretty similar. In fact, a value of zero means that all of the values are exactly equal. For the set of data shown in Table 3.1, the variance for the trucks is .44, while that for the sedans is 36.44. This supports our observation that sedans vary more.

THE STANDARD DEVIATION

Since you calculate the variance by squaring differences from the mean, it is expressed in a unit of measurement such as squared hours, squared children, or something similar. To express the variability in the same unit as the observations, you take the square root of the variance. This is called the standard deviation . The standard deviation is expressed in the same units as the original data. For the trucks in Table 3.1, the standard deviation is the square root of .44, or .66. For the sedans, it is the square root of 36.44, or 6.04.

With any statistical package, it is easy to calculate various measures of central tendency and variability. Under the FREQUENCY command, you specify the statistics kurtosis , mean, median, mode, standard error of the mean, skewness, sum, standard error of skewness , minimum, maximum, range, variance and standard deviation and the computer will do the rest.

FREQUENCY TABLES VERSUS CROSS-CLASSIFICATION TABLES

Up to this point, we have been dealing primarily with frequency tables. We can obtain additional types of information about our data, though, if we use cross-classification tables.

Fundamentally, the difference between frequency tables and cross-classification tables is that in the first, we deal with variables one at a time while in the second, we have the capability to do multiple comparisons. To do the comparisons, however, we must know the variables and their individual groupings.

For example, consider a study in which we have asked people to evaluate the ride of a new vehicle, classifying it as "comfortable," "normal," or "rough." We want to know how many of the 684 people who found the ride comfortable were men and how many were women. How do we find out? No problem. Just use the CROSSTABS command.

This command creates a table (such as Table 3.2) showing the categories of the two variables and compares the variables with each other. Notice that the numbers appear in cells , and they are arranged in rows and columns. Labels at the left and the top of the table describe what is in each of the rows and columns . To the right and at the bottom of the table are totals ” often called marginal totals because they are in the table's margins.

Table 3.2: Cross-Tabulation Table
Ride	Count	Male 1	Female 2	Row Total
Comfortable	1	300	384	684
Comfortable	1	50.3	44.4	46.8
Normal	2	267	437	704
Normal	2	44.8	50.5	48.2
Rough	3	29	44	73
Rough	3	4.9	5.1	5.0
Column Total		596	865	1461
Column Total		40.8	59.2	100.0
Missing Observations = 12

Because the categories of the two variables are "crossed" with each other, this kind of table is called a cross-classification table or simply a cross-tabulation . A cross-classification table shows a cell for every combination of categories of the two variables. Inside the cell is a number showing how many people gave that combination of responses. (For our example we used a 2 — 3 cross-tabulation table. However, the process is the same and just as easy with a computer even with more than two variables.) The table is a very efficient way to present a lot of numbers. When you get used to it, it is quite easy to read. Let us look at what is in the cells.

The number in the first cell of the table, 300, tells you that 300 males found the ride comfortable. The next number is in the column labeled Female, and it tells you that 384 females found the ride comfortable. The sum of these two numbers - 684 - is shown in the last column.

Each of the cells also contains a second number. This number is the percentage. For example: 50.3% of the men reported a comfortable ride, but only 44.4% of the women reported the same thing. These percentages are just the opposite of what the counts show. It is easy to mislead yourself if you compare just the counts in the cells of a cross-tabulation table. It is better to turn the counts into percentages in order to eliminate the differences that show up when you have more people in one group than in another.

What do these numbers tell you? They tell you how likely it was that a person who considered the ride comfortable was male or a female. But you are probably interested in knowing how likely it was that a male or female found the ride comfortable, and the row percentages do not tell you that. It is usually true in a cross-tabulation table that either row percentages or column percentages answer your question. Deciding to use one or the other is often based on which variable you consider dependent, and which one you consider independent. This is very important, and is easy to remember which is which. Here is what you need to remember:

The dependent variable depends on the other one.
The independent variable does not depend on the other one; it goes its own way, independently.

In summary, here are some key points about how you study the relationship between responses to two or more questions which have a small number of possible answers:

A cross-tabulation shows the numbers of cases that have particular combinations of responses to two or more questions.
The number of cases in each cell of a cross-tabulation can be expressed as the percentage of all cases in that row (the row percentage) or the percentage of all cases in that column (the column percentage).
The variable that is thought to influence the values of another variable is called the independent variable.
The variable that is influenced is called the dependent variable.
If there is an independent variable, percentages should be calculated so that they sum to 100 for each category of the independent variable.
When you have more than two variables, you can make separate cross-tabulations for each of the combinations of variables.

MEANS

So far we have looked at the relation between variables and groupings of variables. By doing so, however, we have ignored some of the available information. All cases with values in the same range have been treated as the same.

We can look at the relation between ride and male/female in another way that still produces compact tables but is based on each person's actual preference. What we can do is to compute means. That is, we can compute the individual mean of the participants as well as the group. This can be accomplished by a simple command of "means" in any statistical package, provided that you specify the variables that you want to work with.

MEANS FROM SAMPLES

So far we have tried to answer questions such as: "What percentage of the sample thinks that the ride is comfortable?" or "What is the average age of the people who said the ride is rough?" The emphasis was always on reporting just the results of the study. We looked at the data and described the sample. Nothing more.

Now, we will begin to look at the problems we face when drawing conclusions about a whole population on the basis of what is observed in a sample.

In our sample, the men were more likely than the women to find the ride more comfortable. No doubt about it: 50% of the men but only 44% of the women called the ride comfortable. Unless an error was made somewhere in entering the data into the file, the results are crisp and clear. We can speak about the sample with confidence. We know, or can figure out, anything we want to about the sample ” assuming that we asked the right questions and had the data entered correctly into the file.

But talking about a sample is usually not enough. We do not want conclusions about the 1473 people in the study; we want conclusions about the population that this sample represents. We want to be able to say such things as: "A comfortable ride means more to American men than it does to American women." Based on the results in the sample, we want to speak about the population from which the sample was selected.

That may not seem like a big deal. Why not just assume that whatever is true for the sample is also true for the population? If the men in the sample found the comfortable ride more appealing than the women did, why not claim that the same must be true in the population? Let us just conclude that American men are more enthusiastic about comfortable ride than American women. That would certainly be simple. But would it always be correct?

PROBLEMS IN GENERALIZING

Suppose you had a sample of two men and two women, and you found that one of the men but neither of the women was concerned about the ride. Would you be willing to draw the conclusion that, in general, men are more excited by the ride than women are? The numbers, especially if you do not think about them, suggest the headline, "Amazing new research shows that half of all men but no women at all are excited by a vehicle's 'ride.''' It does not take much statistical know-how to find fault with this headline. Generalizing from a tiny sample of two men and two women to the whole U.S. population is laughable. If you sampled another two men and two women, you would probably get completely different results. But you could not generalize from those results, either. You cannot conclude much at all about the whole population from a sample of four people.

What if the sample were larger, say 200 men and 200 women? Conclusions from a study with this sample size would certainly be more believable than those from a four-person study. It is easier to believe that the results observed in the larger sample hold true for the population. But if you found that 50% of all the men and 49% of all the women in the larger sample said their vehicle ride was comfortable, would you be willing to conclude that in the population, men are more likely than women to find the ride comfortable? What if the difference were larger, say 50% of the men compared to 40% of the women?

SAMPLING VARIABILITY

You get different results from different samples. Consequently, it takes some thought to sort out what you can reasonably say about the population, based on the results from a sample. If you and I each look at samples of 400 people from the same population, we are not going to get exactly the same answers when we each analyze our own data. Our samples will undoubtedly include different people, and our results will differ. With any luck, the results will be similar, but it is very unlikely that they will be identical to the last decimal place. Even if they are, they probably would not be the same values that we would obtain if we questioned the whole population.

How much the results from different samples vary from one to another depends not only on the size of the samples but also on how often the various responses occur in the population. (Statisticians call this the distribution of responses in the population.) If everybody in the United States plans to vote for the same candidate for President - say, the one you have been working for ” any old sample will lead to the same answer to the question, "What percentage of the vote will my candidate receive?" The answers would not vary from person to person, and they would not vary from survey to survey. Any survey would tell you that 100% of the voters plan to vote for your candidate.

On the other hand, if only half of the voters plan to vote for your candidate, your samples would show more variability. One sample might show that 60% of the vote will go to your candidate, and another sample might show that your candidate will get 45% of the vote. If 1000 researchers took random samples of 400 voters each, they would obtain a lot of different percentages. Some would be close to the correct figure of 50%, while others would be higher or lower.

By now you should be wondering if the size of the sample has anything to do with the results of the study. The effect of sample size on any study is indeed important. For our discussion here, a basic fact to remember is that results from large samples do not vary as much as results from small samples do. You can test this on your own with a simple simulated study.

A COMPUTER MODEL

We can use the computer to actually do what we have been talking about. With the proper instructions, it can set up a population in which half of the people say they will vote for your candidate, and half say they will not. We can instruct the computer to conduct a hypothetical survey by randomly selecting 400 cases from this population. Then we can tell the computer to calculate from this sample the percentage of the cases that endorse your candidate. We can have the computer repeat this kind of survey as many times as we want. Each time, it will select a new random sample of 400 hypothetical people and compute the percentage planning to vote for your candidate. (This is called a simulated survey.) This is very important for you to recognize because we are going to use this principle in Volume VI, titled Design for Six Sigma .

OTHER STATISTICS

Although we examined the percentage of people planning to vote for a candidate, we could have looked at some other characteristic, such as mean weight, mean number of pencils owned, or mean income. The procedure would have been the same. For each of the random samples from a population, we would have calculated the mean. Then we would have seen how much the mean values varied from sample to sample. The results would have been very similar to what we have seen, and the same basic rules would have applied. After all, percentages agreeing with a statement are equivalent to means.

How can a percentage be the same thing as a mean? For a variable that can have only two possible values (such as yes or no, agree or disagree , cured or not cured), you can code one of the responses as 0 and the other response as 1. If you add up the values for all of the cases, divide by the number of cases, and then multiply by 100, you will obtain the percentage of cases giving the response coded as 1.

Consider a simple example. You ask five people whether they approve of the president's performance. Three say they do, and two say they do not. If you code Approve as 1, you have the values 1, 1, 1, 0, 0. The mean of these values is 3/5 = .6. To get the percentage agreeing with the statement, just multiply the mean by 100. In this survey, 60% of the people approved of the president's performance.

At this point it is important to take a breather. We have introduced the word statistic quite a few times but without an official explanation. So, what is a statistic? A statistic is nothing more than some characteristic of a sample. The average height of the people in a sample is a statistic. So is the standard deviation or the variance of the heights. The term statistic is used only to describe sample values. The term parameter is used to describe characteristics of the population. If you could measure the height of all the people in the United States and calculate their average height, the result would be a parameter, since it would be the value for the population. Most of the time, population values, or parameters, are not known. You must estimate them based on statistics calculated from samples.

Here is what you can you say about the mean of a population, based on the results observed in a sample:

When you take a sample from a population and compute the sample mean, it will not be identical to the mean you would have obtained if you had observed the entire population.
Different samples result in different means.
The distribution of all possible values of the mean, for samples of a particular size, is called the sampling distribution of the mean.
The variability of the distribution of sample means depends on how large your sample is and on how much variability exists in the population from which the samples are taken.
As the size of the sample increases , the variability of the sample means decreases.
As variability in a population increases, so does the variability of the sample means.

DESCRIBING DATA SETS WITH BOXPLOTS

The final tool to be discussed in this chapter is the boxplot, a very useful graphical method for summarizing data. Boxplots can be used in two ways: either to describe a single variable in a data set or to compare two (or more) variables. The keys to understanding a boxplot are the following:

The right and left of the box are at the third and first quartiles. Therefore, the length of the box equals the interquartile range (IQR), and the box represents the middle 50% of the observations. The height of the box has no significance.
The vertical line inside the box indicates the location of the median. The point inside the box indicates the location of the mean.
Horizontal lines are drawn from each side of the box. They extend to the most extreme observations that are no farther than 1.5 IQRs from the box. They are useful for indicating variability and skewness.
Observations farther than 1.5 IQRs from the box are shown as individual points. If they are between 1.5 IQRs and 3 IQRs from the box, they are called mild outliers and are hollow. Otherwise, they are called extreme outliers and are solid.

Boxplots are probably most useful for comparing two populations graphically. As for the terminology and the conventional interpretation of the plot, we owe it all to the statistician John Tukey.