The FREQ procedure provides easy access to statistics for testing for association in a crosstabulation table.
In this example, high school students applied for courses in a summer enrichment program: these courses included journalism, art history, statistics, graphic arts, and computer programming. The students accepted were randomly assigned to classes with and without internships in local companies. The following table contains counts of the students who enrolled in the summer program by gender and whether they were assigned an internship slot.
Enrollment | ||||
---|---|---|---|---|
Gender | Internship | Yes | No | Total |
boys | yes | 35 | 29 | 64 |
boys | no | 14 | 27 | 41 |
girls | yes | 32 | 10 | 42 |
girls | no | 53 | 23 | 76 |
The SAS data set SummerSchool is created by inputting the summer enrichment data as cell count data, or providing the frequency count for each combination of variable values. The following DATA step statements create the SAS data set SummerSchool .
data SummerSchool; input Gender $ Internship $ Enrollment $ Count @@; datalines; boys yes yes 35 boys yes no 29 boys no yes 14 boys no no 27 girls yes yes 32 girls yes no 10 girls no yes 53 girls no no 23 ;
The variable Gender takes the values ˜boys or ˜girls , the variable Internship takes the values ˜yes and ˜no , and the variable Enrollment takes the values ˜yes and ˜no . The variable Count contains the number of students corresponding to each combination of data values. The double at sign (@@) indicates that more than one observation is included on a single data line. In this DATA step, two observations are included on each line.
Researchers are interested in whether there is an association between internship status and summer program enrollment. The Pearson chi-square statistic is an appropriate statistic to assess the association in the corresponding 2 — 2 table. The following PROC FREQ statements specify this analysis.
You specify the table for which you want to compute statistics with the TABLES statement. You specify the statistics you want to compute with options after a slash (/) in the TABLES statement.
proc freq data=SummerSchool order=data; weight count; tables Internship*Enrollment / chisq; run;
The ORDER= option controls the order in which variable values are displayed in the rows and columns of the table. By default, the values are arranged according to the alphanumeric order of their unformatted values. If you specify ORDER=DATA, the data are displayed in the same order as they occur in the input data set. Here, since ˜yes appears before ˜no in the data, ˜yes appears first in any table. Other options for controlling order include ORDER=FORMATTED, which orders according to the formatted values, and ORDER=FREQUENCY, which orders by descending frequency count.
In the TABLES statement, Internship * Enrollment specifies a table where the rows are internship status and the columns are program enrollment. The CHISQ option requests chi-square statistics for assessing association between these two variables . Since the input data are in cell count form, the WEIGHT statement is required. The WEIGHT statement names the variable Count , which provides the frequency of each combination of data values.
The FREQ Procedure Table of Internship by Enrollment Internship Enrollment Frequency Percent Row Pct Col Pct yes no Total ---------+--------+--------+ yes 67 39 106 30.04 17.49 47.53 63.21 36.79 50.00 43.82 ---------+--------+--------+ no 67 50 117 30.04 22.42 52.47 57.26 42.74 50.00 56.18 ---------+--------+--------+ Total 134 89 223 60.09 39.91 100.00
Statistics for Table of Internship by Enrollment Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.8189 0.3655 Likelihood Ratio Chi-Square 1 0.8202 0.3651 Continuity Adj. Chi-Square 1 0.5899 0.4425 Mantel-Haenszel Chi-Square 1 0.8153 0.3666 Phi Coefficient 0.0606 Contingency Coefficient 0.0605 Cramer's V 0.0606 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 67 Left-sided Pr <= F 0.8513 Right-sided Pr >= F 0.2213 Table Probability (P) 0.0726 Two-sided Pr <= P 0.4122 Sample Size = 223
The analysis, so far, has ignored gender. However, it may be of interest to ask whether program enrollment is associated with internship status after adjusting for gender. You can address this question by doing an analysis of a set of tables, in this case, by analyzing the set consisting of one for boys and one for girls. The Cochran-Mantel-Haenszel statistic is appropriate for this situation: it addresses whether rows and columns are associated after controlling for the stratification variable. In this case, you would be stratifying by gender.
The FREQ statements for this analysis are very similar to those for the first analysis, except that there is a third variable, Gender , in the TABLES statement. When you cross more than two variables, the two rightmost variables construct the rows and columns of the table, respectively, and the leftmost variables determine the stratification.
proc freq data=SummerSchool; weight count; tables Gender*Internship*Enrollment / chisq cmh; run;
This execution of PROC FREQ first produces two individual crosstabulation tables of Internship * Enrollment , one for boys and one for girls. Chi-square statistics are produced for each individual table. Figure 29.3 shows the results for boys. Note that the chi-square statistic for boys is significant at the ± = 0 . 05 level of significance. Boys offered a course with an internship are more likely to enroll than boys who are not.
The FREQ Procedure Table 1 of Internship by Enrollment Controlling for Gender=boys Internship Enrollment Frequency Percent Row Pct Col Pct no yes Total ---------+--------+--------+ no 27 14 41 25.71 13.33 39.05 65.85 34.15 48.21 28.57 ---------+--------+--------+ yes 29 35 64 27.62 33.33 60.95 45.31 54.69 51.79 71.43 ---------+--------+--------+ Total 56 49 105 53.33 46.67 100.00 Statistics for Table 1 of Internship by Enrollment Controlling for Gender=boys Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 4.2366 0.0396 Likelihood Ratio Chi-Square 1 4.2903 0.0383 Continuity Adj. Chi-Square 1 3.4515 0.0632 Mantel-Haenszel Chi-Square 1 4.1963 0.0405 Phi Coefficient 0.2009 Contingency Coefficient 0.1969 Cramer's V 0.2009 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 27 Left-sided Pr <= F 0.9885 Right-sided Pr >= F 0.0311 Table Probability (P) 0.0196 Two-sided Pr <= P 0.0467 Sample Size = 105
If you look at the individual table for girls, displayed in Figure 29.4, you see that there is no evidence of association for girls between internship offers and program enrollment.
Table 2 of Internship by Enrollment Controlling for Gender=girls Internship Enrollment Frequency Percent Row Pct Col Pct no yes Total ---------+--------+--------+ no 23 53 76 19.49 44.92 64.41 30.26 69.74 69.70 62.35 ---------+--------+--------+ yes 10 32 42 8.47 27.12 35.59 23.81 76.19 30.30 37.65 ---------+--------+--------+ Total 33 85 118 27.97 72.03 100.00 Statistics for Table 2 of Internship by Enrollment Controlling for Gender=girls Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.5593 0.4546 Likelihood Ratio Chi-Square 1 0.5681 0.4510 Continuity Adj. Chi-Square 1 0.2848 0.5936 Mantel-Haenszel Chi-Square 1 0.5545 0.4565 Phi Coefficient 0.0688 Contingency Coefficient 0.0687 Cramer's V 0.0688 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 23 Left-sided Pr <= F 0.8317 Right-sided Pr >= F 0.2994 Table Probability (P) 0.1311 Two-sided Pr <= P 0.5245 Sample Size = 118
These individual table results demonstrate the occasional problems with combining information into one table and not accounting for information in other variables such as Gender . Figure 29.5 contains the CMH results. There are three summary (CMH) statistics; which one you use depends on whether your rows and/or columns have an order in r — c tables. However, in the case of 2 — 2 tables, ordering does not matter and all three statistics take the same value. The CMH statistic follows the chi-square distribution under the hypothesis of no association, and here, it takes the value 4.0186 with 1 degree of freedom. The associated p -value is 0.0450, which indicates a significant association at the ± = 0 . 05 level.
Summary Statistics for Internship by Enrollment Controlling for Gender Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob --------------------------------------------------------------- 1 Nonzero Correlation 1 4.0186 0.0450 2 Row Mean Scores Differ 1 4.0186 0.0450 3 General Association 1 4.0186 0.0450 Total Sample Size = 223
Thus, when you adjust for the effect of gender in these data, there is an association between internship and program enrollment. But, if you ignore gender, no association is found. Note that the CMH option also produces other statistics, including estimates and confidence limits for relative risk and odds ratios for 2 — 2 tables and the Breslow-Day Test. These results are not displayed here.
Medical researchers are interested in evaluating the efficacy of a new treatment for a skin condition. Dermatologists from participating clinics were trained to conduct the study and to evaluate the condition. After the training, two dermatologists examined patients with the skin condition from a pilot study and rated the same patients . The possible evaluations are terrible, poor, marginal, and clear. Table 29.2 contains the data.
Dermatologist 2 | ||||
---|---|---|---|---|
Dermatologist 1 | Terrible | Poor | Marginal | Clear |
Terrible | 10 | 4 | 1 |
|
Poor | 5 | 10 | 12 | 2 |
Marginal | 2 | 4 | 12 | 5 |
Clear |
| 2 | 6 | 13 |
The dermatologists evaluations of the patients are contained in the variables derm1 and derm2 ; the variable count is the number of patients given a particular pair of ratings. In order to evaluate the agreement of the diagnoses (a possible contribution to measurement error in the study), the kappa coefficient is computed. You specify the AGREE option in the TABLES statement and use the TEST statement to request a test for the null hypothesis that their agreement is purely by chance. You specify the keyword KAPPA to perform this test for the kappa coefficient. The results are shown in Figure 29.6.
data SkinCondition; input derm1 $ derm2 $ count; datalines; terrible terrible 10 terrible poor 4 terrible marginal 1 terrible clear 0 poor terrible 5 poor poor 10 poor marginal 12 poor clear 2 marginal terrible 2 marginal poor 4 marginal marginal 12 marginal clear 5 clear terrible 0 clear poor 2 clear marginal 6 clear clear 13 ; proc freq data=SkinCondition order=data; weight count; tables derm1*derm2 / agree noprint; test kappa; run;
The FREQ Procedure Statistics for Table of derm1 by derm2 Simple Kappa Coefficient -------------------------------- Kappa 0.3449 ASE 0.0724 95% Lower Conf Limit 0.2030 95% Upper Conf Limit 0.4868 Test of H0: Kappa = 0 ASE under H0 0.0612 Z 5.6366 One-sided Pr > Z <.0001 Two-sided Pr > Z <.0001 Sample Size = 88
The kappa coefficient has the value 0.3449, which indicates slight agreement between the dermatologists, and the hypothesis test confirms that you can reject the null hypothesis of no agreement. This conclusion is further supported by the confidence interval of (0.2030, 0.4868), which suggests that the true kappa is greater than zero. The AGREE option also produces Bowker s test for symmetry and the weighted kappa coefficient, but that output is not shown.