Revealing the invisible connections in the world is just a matter of recording observations and computing the magical, mystical correlation coefficient.
You probably make all sorts of assumptions about why people feel the way they feel or do the things they do. Statistical researchers would call these assumptions hypotheses about the relationship among variables.
Regardless of what science calls it, you probably do it. You might make these guesses about associations between attitudes and behavior or between attitudes and attitudes or behaviors and behaviors. You might do it informally as you seek to understand people in the world around you, or you might need to do it as a marketing specialist to understand your customer, or you might be a struggling psychology graduate student who needs to complete a class assignment that requires statistical analysis of the relationship between self-esteem and depression.
In statistics, such a relationship is called a correlation. The number describing the size of that relationship is a correlation coefficient. By computing this useful value, you can get answers to any question you have about relationships (except in terms of dating relationships; you're on your own there).
Testing Hypotheses About Relationships
Imagine a study in which a researcher for the American Cheesecake Sellers Association has a hypothesis that the reason people like cheesecake is that they like cheese. She is guessing that there is a relationship between attitude toward cheese and attitude toward cheesecake. If her hypothesis turns out to be correct, she'll purchase the huge mailing list of cheese lovers from the American Cheese Lovers Association and send them informative brochures about the healing properties of cheesecake. If she's right, sales will rocket up!
To test her hypothesis, she creates two surveys. One asks respondents to say how they feel about cheese, and the other asks how they feel about cheesecake. A score of 50 means the person loves cheese (or cheesecake), and a score of 0 means the person hates cheesecake (or cheese). Table 2-1 shows the results for the data she collects from five people on the bus on her way to work.
Let's look at the data and see if there seems to be a relationship between the two variables. (Go ahead, I'll give you 30 seconds.)
I'd say there is a pretty clear relationship there. The people who scored the highest on the cheese scale also scored the highest on the cheesecake scale. The groups of people didn't score exactly the same on both scales, of course, and the rank order isn't even the same, but, relatively speaking, the position of each person to each of the other people when it comes to cheese attitude is about the same as when it comes to cheesecake attitude. The Association's marketer has support for her hypothesis.
Computing a Correlation Coefficient
Just eyeballing two columns of numbers from a sample, though, is usually not enough to really know whether there is a relationship between two things. The marketing specialist in our example wants to use a single number to more precisely describe whatever relationship is seen.
The correlation coefficient takes into account all the information we used when we looked at our two columns of numbers in Table 2-1 and decided whether there was a relationship there. The correlation coefficient is produced through a formula that does the following things:
If this were a statistics textbook, I'd have to present a somewhat complicated formula for calculating the correlation coefficient. To call it somewhat complicated is generous. Frankly, it is terrifically frightening. For your own sanity, I'm not even going to show it to you. Trust me. Instead, I'll show you this pleasant, friendly looking formula (which works just as well):
Z refers to a Z-score, which is the distance of a score from the mean. These distances are then divided by the standard deviation for that distribution. So, Zx means all the Z-scores from the first column, and Zy means all the Z-scores from the second column. ZxZy means multiply them together. The S symbol means add up. So, the equation says to multiply together all the pairs of Z-scores and add those cross-products together. Then, divide by the number (N) of pairs of scores minus 1.
The mean is the arithmetic average of a group of scores. It is produced by adding up all the numbers and dividing by the number of scores. A standard deviation for a group of numbers is the average distance of each score from the mean.
Before I produce the Z-scores used in our correlation formula, I need to know the means and standard deviations for each column of data. Equations for calculating these key values are provided in "Describe the World Using Just Two Numbers" [Hack #2]. Here are the means and standard deviations for our two variables:
Table 2-2 shows some of the calculations for our cheese attitude data.
The correlation is .93. This is very close to 1.0, which is the strongest a positive correlation can be, so the cheese-to-cheesecake correlation represents a very strong relationship.
Interpreting a Correlation Coefficient
Somewhat magically, the correlation formula process produces a number, ranging in value from -1.00 to +1.00, that measures the strength of relationship between two variables. Positive signs indicate the relationship is in the same direction. As one value increases, the other value increases. Negative signs indicate the relationship is in the opposite direction. As one value increases, the other value decreases. An important point to make is that the correlation coefficient provides a standardized measure of the strength of linear relationship between two variables [Hack #12].
The direction of a correlation (whether it is negative or positive) is the artificial result of the direction of the scale one chooses to use to measure the variables. In other words, strong correlations can be negative. Think of a measure of golf skill correlated with average golf score. The higher the skill, the lower the score, but you would still expect a strong relationship.
Statistical Significance and Correlations
Our marketing specialist is likely also interested in whether a sample correlation is large enough that it is likely to have been drawn from a population where the correlation is bigger than zero. In other words, is the correlation we found in our sample so large that it must have come from a population where there is at least some sort of relationship between these variables?
The marketer in our example trusts correlations between a large number of pairs more than she does correlations from a small sample (such as our five bus riders). If she were to report this relationship to her boss and it wasn't true about most people, she might find herself selling cheesecake out of her minivan for a living.
Table 2-3 shows how large a correlation in a sample must be before statisticians are sure that there is a relationship greater than zero in the population it represents.
With our sample of five people, any correlation at least as big as .88 would be treated as statistically significant (which means "so big it probably exists in whatever population you took your sample from").
Where Else It Works
You can produce a correlation coefficient as a measure of the strength of a relationship between any two variables as long as certain conditions are met:
Dire Warning About Correlations
It's tempting to treat correlational evidence as evidence of cause and effect. Of course, there might be all sorts of reasons why two things are related that have nothing to do with one thing causing the other.
For example, in the presence of such a strong correlation between attitude toward cheese and attitude toward cheesecake, you might want to conclude that a person's affinity for cheese causes him to like cheesecake because there is cheese in it. There might be noncausal explanations, though. The same people who like cheese might tend to like cheesecake because they like all foods that are kind of soft and smooshy.