MEASURES OF ASSOCIATION FOR VARIABLES

When you have variables that are measured on a nominal scale, you are limited in what you can say about their relationship. You cannot say that marital status increases as religious affiliation increases , or that automobile color decreases with increasing state of residence. You cannot say anything about the direction of the association. If the categories of the variables do not have a meaningful order, it does not make sense to say they are associated in one direction or another. All you can do is try to measure the strength of the association. Two types of measures of association are useful for nominal variables: measures based on chi-square and measures of proportional reduction in error (called PRE measures). We will look at each of these in turn .

MEASURES BASED ON CHI-SQUARE

We just finished discussing why the chi-square statistic is not a good measure of association. However, since its use is common in tests of independence, people have tried to construct measures of association based on it. The measures based on chi-square attempt to modify it so it is not influenced by sample size and so it falls in the range of zero to one. Without such adjustments, you cannot compare chi-square values from tables with different sample sizes and different dimensions. (In the range from zero to one, a value of zero corresponds to no association and a value of one to perfect association. Coefficients are often normalized to fall in this range.)

The phi coefficient ” This is one of the simplest modifications of the chi-square statistic. To calculate a phi coefficient, just divide the chi-square value by the sample size and then take the square root. The formula is

The maximum value of phi depends on the size of the table. If a table has more than two rows or two columns , the phi coefficient can be greater than one ” an undesirable feature.

The coefficient of contingency ” This measure is always less than or equal to one. It is often abbreviated with the letter C. It is calculated from the chi-square statistic using the following formula:

Although the value of C is always between 0 and 1, it can never get as high as 1, even for a table showing what seems to be a perfect relationship. The largest value it can have depends on the number of rows and columns in the table. For example, if you have a four-by-four table, the largest possible value of C is .87.

Cramer's V ” This is a chi-square-based measure of association that can attain the value of 1 for tables of any dimension. Its formula is:

where k is the smaller of the number of rows and columns. If the number of rows or columns is two, Cramer's V is identical in value to phi.

CALCULATING THE » (LAMBDA)

The lambda statistic measures how much your error rate decreases when you use additional information about a variable. It is calculated as:

» = Misclassified in situation 1 -Misclassified in situation 2/Misclassified in situation 1

Lambda tells you the proportion by which you can reduce your error in predicting the dependent variable if you know the independent variable. That is why it is called a proportional reduction in error measure. The largest value that lambda can be is one. A value of zero for lambda means the independent variable is of no help in predicting the dependent variable. When two variables are statistically independent, lambda is zero; but a lambda of zero does not necessarily imply statistical independence. As with all measures of association, lambda measures association in a very specific way ” reduction in error when values of one variable are used to predict values of the other. If this particular type of association is absent, lambda is zero. Even when lambda is zero, other measures of association may find associations of a different kind. No measure of association is sensitive to every type of association imaginable.

TWO DIFFERENT LAMBDAS

Lambda is not a symmetric measure. Its value depends on which variable you predict from which. Suppose that instead of predicting the excitement category based on marital happiness, you tried to predict the reverse ” how happy a person's marriage was, based on how exciting the person found life to be. You would get a different value for lambda. The actual statistic will be generated from the "crosstab" job that produced the cross-tabulation table. For this discussion, you must recognize two lambdas exist: (1) the asymmetric lambda ” which we just covered, and (2) the symmetric lambda.

Although we just said that the lambda value is not a symmetric statistic, sometimes if you have no reason to consider one of the variables dependent and the other independent, you can compute a symmetric lambda coefficient. You predict the first variable from the second and then the second variable from the first. The symmetric lambda is calculated as the sum of the two differences divided by the total number misclassified without additional information. In other words, you just add up the numerators for the two lambdas, then add up the denominators, then divide.

Is it really possible for variables to be related and still have a lambda of zero? That does not sound right. Actually, this can happen easily, depending on the distribution of the dependent variable. For example, consider the following cross-tabulation table:

(DEP)endent variable (INDEP)endent variable
Count	1.00	2.00	3.00	Row total
1.00	19	10	1	30
2.00	20	20	20	60
3.00	1	10	19	30
Column	40	40	40	120
Total	33.3	33.3	33.3	100.0

Using the SPSS software we generate a table that looks like:

Statistic	Symmetric	With DEP Dependent			With INDEP Dependent
Lambda	.12857	.00000			.22500
Number of Missing Observations = 0

We can see from the above information that the two variables are clearly associated, but value two of the dependent variable occurs most often in each category of the independent variable. You would predict that value whether or not you knew the independent variable. Since knowing the independent variable does not help at all, lambda equals zero. You can see that SPSS/PC+ reports a lambda of zero when DEP is the dependent variable. Remember: a measure of association is sensitive to a particular kind of association.

MEASURES OF ASSOCIATION FOR ORDINAL VARIABLES

Lambda can be used as a measure of association for variables measured on ordinal scales as well as for variables measured on nominal scales . The computation of lambda, however, did not use the order information. Because of that, we could rearrange the order of the rows and columns in any way we wanted and not change the value of lambda at all.

Several measures of association make use of the additional information available for ordinal variables. They tell us not only about the strength of the association but the direction as well. For example, if one variable changes in the same direction as the other, then we say that the two variables have a positive relationship. If, on the other hand, the values of one variable increase while those of the other decrease, we can say the variables have a negative relationship. We cannot make statements like these about nominal variables, since the categories of the variables have no order. Values cannot increase or decrease unless they have an order.

CONCORDANT AND DISCORDANT PAIRS

Many ordinal measures of association are based on comparing pairs of cases. For example, look at the following data, which contain a listing of the values of Var1, and Var2, for three cases.

	Var1	Var2
Case 1	1	2
Case 2	2	3
Case3	3	2

Consider the pair of cases, Case 1 and Case 2. Both Case 2 values are larger than the corresponding values in Case 1. That is, the value for Var1 is larger for Case 2 than for Case 1, and the value for Var2 is larger for Case 2 than for Case 1. Such a pair of cases is called concordant. A pair of cases is concordant if the value of each variable is larger (or each is smaller) for one case than for the other case.

A pair of cases is discordant if the value of one variable for a case is larger than the value for the other case, but the direction is reversed for the second variable. For example, Case 2 and Case 3 are a discordant pair, since the value of Var1 for Case 3 is larger than for Case 2, but the value of Var2 is larger for Case 2 than for Case 3.

When two cases have identical values on one or both variables, they are said to be tied. Five different outcomes are possible when you compare two cases. They can be concordant, discordant, tied on the first variable, tied on the second variable, or tied on both variables. When data are arranged in a cross-tabulation, it is easy to compute the number of concordant, discordant, and tied pairs, just by looking at the table and adding up cell frequencies.

If most of the pairs are concordant, the association is said to be positive. As values of one variable increase (or decrease), so do the values of the other variable. If most of the pairs are discordant, the association is negative. As values of one variable increase, those of the other tend to decrease. If concordant and discordant pairs are equally likely, we say that no association is present.

MEASURES BASED ON CONCORDANT AND DISCORDANT PAIRS

The ordinal measures of association that we will consider are all based on the difference between the number of concordant pairs (P) and the number of discordant pairs (Q), calculated for all distinct pairs of observations. Since we want our measures of association to fall within a known range for all tables we must standardize the difference, P - Q (if possible, from -1 to 1, where -1 indicates a perfect negative relationship, 1 indicates a perfect positive relationship, and 0 indicates no relationship). The measures differ in the way they attempt to standardize P - Q.

Goodman and Kruskal's Gamma

One way of standardizing the difference between the number of concordant and discordant pairs is to use Goodman and Kruskal's gamma. You calculate the difference between the number of concordant and discordant pairs, (P - Q), and then divide this difference by the sum of the number of concordant and discordant pairs (P + Q). For example, using the previous data for Var1 and Var2, we produce a gamma value of .459. What does this mean? A positive gamma tells you that there are more "like" (concordant) pairs of cases than "unlike" (discordant) pairs. A negative gamma would mean that a negative relationship exists.

The absolute value of gamma has a proportional reduction in error interpretation. What you are trying to predict is whether a pair of cases is like or unlike. In the first situation, you classify pairs as like or unlike based on the flip of a fair coin. In the second situation, you base your decision rule on whether you find more concordant or more discordant pairs. If most of the pairs are concordant you predict "like" for all pairs. If most of the pairs are discordant you predict "unlike." The absolute value of gamma (the numerical value, ignoring a minus sign if one is present) is the proportional reduction in error when the second rule is used instead of the first.

For example, if half of the pairs of cases are concordant and half are discordant, guessing randomly and classifying all cases as concordant leads to the same number of misclassified cases ” one half. The value of gamma is then zero. If all the pairs are concordant, guessing "like" will result in correct classification of all pairs. Guessing randomly will classify only half of the pairs correctly. In this situation, the value of gamma is one.

If two variables are independent, the value of gamma is zero. However, a gamma of zero does not necessarily mean independence. (If the table is two by two, though, a gamma of zero does mean that the variables are independent.)

Kendall's Tau-b

Gamma ignores all pairs of cases that involve ties. A measure that attempts to normalize P - Q by considering ties on each variable in a pair separately (but not ties on both variables) is tau-b. It is computed as

where T _x is the number of ties involving only the first variable, and T _y is the number of ties involving only the second variable. Tau-b can have the values of +1 and -1 only for square tables. Since the denominator is complicated, there is no simple explanation in terms of proportional reduction of error. However, tau-b is a commonly used measure.

Tau-c

A measure that can attain, or nearly attain, the values of +1 and -1 for a table of any size is tau-c. It is computed as

where m is the smaller of the number of rows and columns. Unfortunately, no simple proportional reduction of error interpretation is possible for tau-c either.

Somers' d

Gamma, tau-b, and tau-c are all symmetric measures. It does not matter whether one of the variables is considered dependent. The value of the statistic is the same. Somers proposed an extension of gamma in which one of the variables is considered dependent. It differs from gamma only in that the denominator is the sum of all pairs of cases that are not tied on the independent variable. (In gamma, all cases involving ties are excluded from the denominator.)

MEASURES INVOLVING INTERVAL DATA

If the two variables are measured on an interval scale, you can calculate coefficients that make use of this additional information. The Pearson correlation coefficient, discussed later, measures the strength of what is called a linear association. The eta ( · ) coefficient can be used when a dependent variable is measured on an interval scale, and the independent variable is measured on a nominal or ordinal scale. When eta is squared, it can be interpreted as the proportion of the total variance in the dependent variable that can be accounted for by knowing the values of the independent variable.