Hack 11. Discover Relationships

Revealing the invisible connections in the world is just a matter of recording observations and computing the magical, mystical correlation coefficient.

You probably make all sorts of assumptions about why people feel the way they feel or do the things they do. Statistical researchers would call these assumptions hypotheses about the relationship among variables.

Regardless of what science calls it, you probably do it. You might make these guesses about associations between attitudes and behavior or between attitudes and attitudes or behaviors and behaviors. You might do it informally as you seek to understand people in the world around you, or you might need to do it as a marketing specialist to understand your customer, or you might be a struggling psychology graduate student who needs to complete a class assignment that requires statistical analysis of the relationship between self-esteem and depression.

In statistics, such a relationship is called a correlation. The number describing the size of that relationship is a correlation coefficient. By computing this useful value, you can get answers to any question you have about relationships (except in terms of dating relationships; you're on your own there).

Testing Hypotheses About Relationships

Imagine a study in which a researcher for the American Cheesecake Sellers Association has a hypothesis that the reason people like cheesecake is that they like cheese. She is guessing that there is a relationship between attitude toward cheese and attitude toward cheesecake. If her hypothesis turns out to be correct, she'll purchase the huge mailing list of cheese lovers from the American Cheese Lovers Association and send them informative brochures about the healing properties of cheesecake. If she's right, sales will rocket up!

To test her hypothesis, she creates two surveys. One asks respondents to say how they feel about cheese, and the other asks how they feel about cheesecake. A score of 50 means the person loves cheese (or cheesecake), and a score of 0 means the person hates cheesecake (or cheese). Table 2-1 shows the results for the data she collects from five people on the bus on her way to work.

Table Data for the relationship between cheese and cheesecake attitudes
Person	Attitude toward cheese	Attitude toward cheesecake
Larry	50	36
Moe	45	35
Curly Joe	30	22
Shemp	30	25
Groucho	10	20

Let's look at the data and see if there seems to be a relationship between the two variables. (Go ahead, I'll give you 30 seconds.)

I'd say there is a pretty clear relationship there. The people who scored the highest on the cheese scale also scored the highest on the cheesecake scale. The groups of people didn't score exactly the same on both scales, of course, and the rank order isn't even the same, but, relatively speaking, the position of each person to each of the other people when it comes to cheese attitude is about the same as when it comes to cheesecake attitude. The Association's marketer has support for her hypothesis.

Computing a Correlation Coefficient

Just eyeballing two columns of numbers from a sample, though, is usually not enough to really know whether there is a relationship between two things. The marketing specialist in our example wants to use a single number to more precisely describe whatever relationship is seen.

The correlation coefficient takes into account all the information we used when we looked at our two columns of numbers in Table 2-1 and decided whether there was a relationship there. The correlation coefficient is produced through a formula that does the following things:

Looks at each score in a column
Sees how distant that score is from the mean of that column
Identifies the distance from the mean of its matching score in the other column
Multiplies the paired distances together
Averages the results of those multiplications

If this were a statistics textbook, I'd have to present a somewhat complicated formula for calculating the correlation coefficient. To call it somewhat complicated is generous. Frankly, it is terrifically frightening. For your own sanity, I'm not even going to show it to you. Trust me. Instead, I'll show you this pleasant, friendly looking formula (which works just as well):

Z refers to a Z-score, which is the distance of a score from the mean. These distances are then divided by the standard deviation for that distribution. So, Zx means all the Z-scores from the first column, and Zy means all the Z-scores from the second column. ZxZy means multiply them together. The S symbol means add up. So, the equation says to multiply together all the pairs of Z-scores and add those cross-products together. Then, divide by the number (N) of pairs of scores minus 1.

The mean is the arithmetic average of a group of scores. It is produced by adding up all the numbers and dividing by the number of scores. A standard deviation for a group of numbers is the average distance of each score from the mean.

Before I produce the Z-scores used in our correlation formula, I need to know the means and standard deviations for each column of data. Equations for calculating these key values are provided in "Describe the World Using Just Two Numbers" [Hack #2]. Here are the means and standard deviations for our two variables:

Attitude toward cheese: Mean = 33; standard deviation = 15.65
Attitude toward cheesecake: Mean = 27.6; standard deviation = 7.44

Table 2-2 shows some of the calculations for our cheese attitude data.

Table Calculations for discovering relationship between cheese attitude and cheesecake attitude
Person	Attitude toward cheese	Attitude toward cheesecake	Z-scores for cheese	Z-scores for cheesecake	Cross products of Z-scores
Larry	50	36	1.09	1.13	1.23
Moe	45	35	.77	.99	.76
Curly Joe	30	22	-.19	-.75	.14
Shemp	30	25	-.19	-.35	.07
Groucho	10	20	-1.47	-1.02	1.50

The correlation is .93. This is very close to 1.0, which is the strongest a positive correlation can be, so the cheese-to-cheesecake correlation represents a very strong relationship.

Interpreting a Correlation Coefficient

Somewhat magically, the correlation formula process produces a number, ranging in value from -1.00 to +1.00, that measures the strength of relationship between two variables. Positive signs indicate the relationship is in the same direction. As one value increases, the other value increases. Negative signs indicate the relationship is in the opposite direction. As one value increases, the other value decreases. An important point to make is that the correlation coefficient provides a standardized measure of the strength of linear relationship between two variables [Hack #12].

The direction of a correlation (whether it is negative or positive) is the artificial result of the direction of the scale one chooses to use to measure the variables. In other words, strong correlations can be negative. Think of a measure of golf skill correlated with average golf score. The higher the skill, the lower the score, but you would still expect a strong relationship.

Statistical Significance and Correlations

Our marketing specialist is likely also interested in whether a sample correlation is large enough that it is likely to have been drawn from a population where the correlation is bigger than zero. In other words, is the correlation we found in our sample so large that it must have come from a population where there is at least some sort of relationship between these variables?

The marketer in our example trusts correlations between a large number of pairs more than she does correlations from a small sample (such as our five bus riders). If she were to report this relationship to her boss and it wasn't true about most people, she might find herself selling cheesecake out of her minivan for a living.

Table 2-3 shows how large a correlation in a sample must be before statisticians are sure that there is a relationship greater than zero in the population it represents.

Table Correlations that likely did not occur by chance
Sample size	Smallest correlation considered statistically significant
5	.88
10	.63
15	.51
20	.44
25	.40
30	.38
60	.26
100	.20

With our sample of five people, any correlation at least as big as .88 would be treated as statistically significant (which means "so big it probably exists in whatever population you took your sample from").

Where Else It Works

You can produce a correlation coefficient as a measure of the strength of a relationship between any two variables as long as certain conditions are met:

You must be able to measure the variables in a way where numbers have real meaning and represent some underlying continuous concept. Examples of continuous variables are attitude, feelings, knowledge, skill, and things you can count, such as pounds gained because of love of cheesecake. (If the thing you are measuring is not continuous, as in the case when you have different categories, such as gender or political party, you can still calculate a correlation, just not with the formula shown here.)
The variables must actually vary. If everyone felt the same about cheese, you couldn't calculate a correlation with attitude towards cheesecake or chocolate or anything. The math requires some variability.
The minimum correlation sizes required to have statistical significance (shown in Table 2-3) are precisely accurate only when the sample is randomly drawn from the population. Researchers, such as our cheesecake marketer, must decide whether their sample is representative in the way a random sample would be.

Dire Warning About Correlations

It's tempting to treat correlational evidence as evidence of cause and effect. Of course, there might be all sorts of reasons why two things are related that have nothing to do with one thing causing the other.

For example, in the presence of such a strong correlation between attitude toward cheese and attitude toward cheesecake, you might want to conclude that a person's affinity for cheese causes him to like cheesecake because there is cheese in it. There might be noncausal explanations, though. The same people who like cheese might tend to like cheesecake because they like all foods that are kind of soft and smooshy.

Testing Hypotheses About Relationships

Table Data for the relationship between cheese and cheesecake attitudes

Computing a Correlation Coefficient

Table Calculations for discovering relationship between cheese attitude and cheesecake attitude

Interpreting a Correlation Coefficient

Statistical Significance and Correlations

Table Correlations that likely did not occur by chance

Where Else It Works

Dire Warning About Correlations