CORRELATION | Six Sigma and Beyond: Statistics and Probability, Volume III

Although plots give you a pretty good idea of the strength of a linear association, they do not provide an objective summary measure that you could use to compare and summarize the relationships between pairs of variables . On the basis of the plots, you could say that a relationship between variable X and variable Y appears to be present (it may be strong or weak depending on the slope of the line), but you cannot really say how strong the relationship is unless you have a summary measure that quantifies your visual impressions . This is reflected in Figure 8.3. In this figure a combination of outcomes is summarized in matrix form. Values above the diagonal are bivariate correlations , with corresponding scatterplots below the diagonal. The diagonal portrays the distribution of each variable.

Figure 8.3: Scatterplot matrix of metric variables.

The most commonly used measure is the Pearson correlation coefficient, which is abbreviated as r. (The statistic is named after Karl Pearson, an eminent statistician of the early twentieth century.) Here are some of its characteristics:

If no linear relationship exists between two variables, the value of the coefficient is 0.
If a perfect positive linear relationship is present, the value is +1.
If a perfect negative linear relationship is present, the value is -1.

To summarize, the values of the coefficient can range from -1 to +1, with a value of 0 indicating no linear relationship. Positive values mean that a positive relationship exists between the variables. Negative values mean that a negative relationship is present. If one pair of variables has a correlation coefficient of +.8, while another pair has a coefficient of -.8, the strength of the relationship is the same for both. The direction of the relationship differs , however.

Does a correlation coefficient of zero mean that no relationship exists between two variables? Not necessarily . The Pearson correlation coefficient only measures the strength of a linear relationship. Two variables can have a correlation coefficient close to zero and yet have a very strong nonlinear relationship. Look at Figure 8.4, which is a plot of two hypothetical variables. You will note that a strong relationship exists between the two variables. The value of the correlation coefficient, however, is close to zero. Always plot the values of the variables before you compute a correlation coefficient. This will allow you to detect nonlinear relationships, for which the Pearson correlation coefficient is not a good summary measure. The Pearson correlation coefficient should only be used for linear relationships.

Figure 8.4: Strong relationship but very low correlation.

The mathematical formula that tells you how to calculate the correlation coefficient for a pair of variables is

where X and Y are the values of the two variables for a case, N is the number of cases, and S _x and S _y are the standard deviations of the two variables. It does not matter which variable you take to be X and which to be Y in the formula, since the correlation coefficient will be the same. The correlation coefficient is not expressed in any unit of measurement. The correlation coefficient between two variables will be the same regardless of how you measure them.

Sometimes the correlation coefficient is used simply to summarize the strength of a linear relationship between two variables. In other situations you may want to do more than that; you may want to test hypotheses about the population correlation coefficient. For example, you may want to test the null hypothesis that no linear relationship exists between variable X and variable Y in the population. Remember, if your data are a random sample from a particular population, you want to be able to draw conclusions about the population based on the results you observe in your sample. As was the case with other descriptive measures such as the mean, you know that the value of the correlation coefficient you calculate for your sample will not exactly equal the value that you would obtain if you had values for the entire population. You know that if you took many samples from the same population and calculated the correlation coefficients, their values would vary. That is, there is a distribution of possible values of the correlation coefficient, just as there is a distribution of possible values for sample means. If you know what the distribution is, you can calculate observed significance levels. For example, you can calculate how often you would expect to find, in samples of a particular size , a coefficient of .3 or greater when the population value is zero.

DOES SIGNIFICANT MEAN IMPORTANT?

If you reject the null hypothesis, does that mean that an important relationship exists between the two variables? No. It simply means that it is unlikely that the value of the correlation coefficient is zero in the population. For large sample sizes, even very small correlation coefficients have small observed significance levels. You can have a correlation coefficient of .1 and have it be statistically significant. It indicates that a very small, but nonzero, linear relationship exists between the variables. You should look at both the value of the coefficient and its associated significance level when evaluating the relationships among variables.

ONE-TAILED AND TWO-TAILED SIGNIFICANCE PROBABILITIES

If you do not know before looking at your data whether a pair of variables should be positively or negatively correlated, you must use a two-tailed significance level. You reject the null hypothesis for either large positive or large negative values of the correlation coefficient. If you know in advance whether your variables should be positively or negatively correlated, you can use a one-tailed significance test, For example, if you are studying the relationship between total yearly income and value of housing, you know that if a relationship exists, it will be positive. Poor people cannot own expensive houses .

For a one-tailed test, you reject the null hypothesis only if the value of the correlation coefficient is large and in the direction you specified. For a one-tailed test, the observed significance level is half of the two-tailed value. That is because you only calculate the probability that you would obtain a more extreme value in one direction, not two. If you do not specify what kind of test you want when you use a software package, more often than not the default is a one-tailed test. In order to get two-tailed tests, you must specify it with specific option of the software.

ASSUMPTIONS ABOUT THE DATA

In order to test hypotheses about the Pearson correlation coefficient you have to make certain assumptions about the data. If your data are a random sample from a population in which the distribution of the two variables together is normal, the previously described procedure is appropriate. If it seems unreasonable to assume that the variables are from normal distributions, you may have to use other statistical procedures that do not require the normality assumption. These are nonparametric procedures and are described in detail in statistics books such as Siegel (1956). Some of them will be summarized at the end of this chapter.

EXAMINING MANY COEFFICIENTS

If your study involves many variables, you may be tempted to compute all possible correlation coefficients among them. If you are just interested in exploring possible associations among the variables, you may find the coefficients helpful in identifying possible relationships. However, you must be careful when examining the significance levels from large tables. If you have enough coefficients, you expect some of them to be statistically significant even if no relationship exists between the variables in the population. If you compute 100 coefficients, you expect somewhere around five (95% confidence) of them to have observed significance levels less than .05 even if none of them are truly related . Think about it ” that is what a significance level means.

When dealing with large samples and many variables, we must be concerned about missing values. If our study is complete and all the values have been accounted for, then we call this process pairwise deletion of missing data. Our calculation used as much of the data as possible. On the other hand, analysis of data when some of the cases have missing information can be troublesome , especially if you have reason to believe that the missing values are related to values of one of the variables you are analyzing.

If your data have any missing values, you should see whether the missing values show a pattern. For example, you can calculate the average of the variables. If these values are quite different, you have reason to suspect that the data values are not randomly missing. When values are not randomly missing, you must use great caution in attempting to analyze the data. In fact, you may not be able to analyze some of it. (With pairwise deletion of missing data, you could even end up with correlation coefficients that were based on entirely different groups of cases.)

Perhaps one of the most confusing issues in correlation is the notion of whether correlation and cause are the same. So, if two variables are correlated, does that mean one of them causes the other? Not at all. You can never assume that just because two variables are correlated, one of them causes the other. If you find a large correlation coefficient between the ounces of coffee consumed in a day and number of auto accidents in a year, you cannot conclude that coffee consumption causes auto accidents. It may well be that coffee drinkers also consume more alcohol, or are older, or are more poorly coordinated than people who do not drink coffee. You cannot easily tell which of the factors may influence the occurrence of accidents.

How can you summarize the strength of the linear relationship between two variables? Here are some key points to remember:

The Pearson correlation coefficient measures the strength of the linear relationship between variables.
Two variables have a positive relationship if, as the values of one variable increase, so do the values of the other.
Two variables have a negative relationship if, as the values of one variable increase, the values of the other decrease.
A correlation coefficient of +1 means that a perfect positive linear relationship exists between two variables. A value of -1 means that a perfect negative linear relationship exists.
A correlation coefficient only measures the strength of a linear relationship. If a strong nonlinear relationship between two variables is present, the correlation coefficient can be zero.
A correlation between two variables does not necessarily mean that one causes the other.
To test the null hypothesis that the correlation coefficient is zero in the population, you can calculate the observed significance level for the coefficient.
You can use a one-tailed test if you know in advance whether the relationship between two variables is positive or negative.
If you have missing values in your data, you should see whether there is a pattern to the cases for which information is missing.