Getting Started | SAS/STAT 9.1 Users Guide Volume 2 only

Frequency Tables and Statistics

The FREQ procedure provides easy access to statistics for testing for association in a crosstabulation table.

In this example, high school students applied for courses in a summer enrichment program: these courses included journalism, art history, statistics, graphic arts, and computer programming. The students accepted were randomly assigned to classes with and without internships in local companies. The following table contains counts of the students who enrolled in the summer program by gender and whether they were assigned an internship slot.

Table 29.1: Summer Enrichment Data
		Enrollment
Gender	Internship	Yes	No	Total
boys	yes	35	29	64
boys	no	14	27	41
girls	yes	32	10	42
girls	no	53	23	76

The SAS data set SummerSchool is created by inputting the summer enrichment data as cell count data, or providing the frequency count for each combination of variable values. The following DATA step statements create the SAS data set SummerSchool .

  data SummerSchool;   input Gender $ Internship $ Enrollment $ Count @@;   datalines;   boys  yes yes 35   boys  yes no 29   boys   no yes 14   boys   no no 27   girls yes yes 32   girls yes no 10   girls  no yes 53   girls  no no 23   ;

The variable Gender takes the values ˜boys or ˜girls , the variable Internship takes the values ˜yes and ˜no , and the variable Enrollment takes the values ˜yes and ˜no . The variable Count contains the number of students corresponding to each combination of data values. The double at sign (@@) indicates that more than one observation is included on a single data line. In this DATA step, two observations are included on each line.

Researchers are interested in whether there is an association between internship status and summer program enrollment. The Pearson chi-square statistic is an appropriate statistic to assess the association in the corresponding 2 — 2 table. The following PROC FREQ statements specify this analysis.

You specify the table for which you want to compute statistics with the TABLES statement. You specify the statistics you want to compute with options after a slash (/) in the TABLES statement.

  proc freq data=SummerSchool order=data;   weight count;   tables Internship*Enrollment / chisq;   run;

The ORDER= option controls the order in which variable values are displayed in the rows and columns of the table. By default, the values are arranged according to the alphanumeric order of their unformatted values. If you specify ORDER=DATA, the data are displayed in the same order as they occur in the input data set. Here, since ˜yes appears before ˜no in the data, ˜yes appears first in any table. Other options for controlling order include ORDER=FORMATTED, which orders according to the formatted values, and ORDER=FREQUENCY, which orders by descending frequency count.

In the TABLES statement, Internship * Enrollment specifies a table where the rows are internship status and the columns are program enrollment. The CHISQ option requests chi-square statistics for assessing association between these two variables . Since the input data are in cell count form, the WEIGHT statement is required. The WEIGHT statement names the variable Count , which provides the frequency of each combination of data values.

Figure 29.1 presents the crosstabulation of Internship and Enrollment . In each cell, the values printed under the cell count are the table percentage, row percentage, and column percentage, respectively. For example, in the first cell, 63.21 percent of those offered courses with internships accepted them and 36.79 percent did not.

  The FREQ Procedure   Table of Internship by Enrollment   Internship     Enrollment   Frequency   Percent   Row Pct   Col Pct  yes     no        Total   ---------+--------+--------+   yes           67      39     106   30.04   17.49   47.53   63.21   36.79   50.00   43.82   ---------+--------+--------+   no            67      50     117   30.04   22.42   52.47   57.26   42.74   50.00   56.18   ---------+--------+--------+   Total         134       89      223   60.09    39.91   100.00

Figure 29.1: Crosstabulation Table

Figure 29.2 displays the statistics produced by the CHISQ option. The Pearson chi-square statistic is labeled ˜Chi-Square and has a value of 0.8189 with 1 degree of freedom. The associated p -value is 0.3655, which means that there is no significant evidence of an association between internship status and program enrollment. The other chi-square statistics have similar values and are asymptotically equivalent. The other statistics (Phi Coefficient, Contingency Coefficient, and Cramer s V ) are measures of association derived from the Pearson chi-square. For Fisher s exact test, the two-sided p -value is 0.4122, which also shows no association between internship status and program enrollment.

  Statistics for Table of Internship by Enrollment   Statistic                     DF       Value      Prob   ------------------------------------------------------   Chi-Square                     1      0.8189    0.3655   Likelihood Ratio Chi-Square    1      0.8202    0.3651   Continuity Adj. Chi-Square     1      0.5899    0.4425   Mantel-Haenszel Chi-Square     1      0.8153    0.3666   Phi Coefficient                       0.0606   Contingency Coefficient               0.0605   Cramer's V                            0.0606   Fisher's Exact Test   ----------------------------------   Cell (1,1) Frequency (F)        67   Left-sided Pr <= F          0.8513   Right-sided Pr >= F         0.2213   Table Probability (P)       0.0726   Two-sided Pr <= P           0.4122   Sample Size = 223

Figure 29.2: Statistics Produced with the CHISQ Option

The analysis, so far, has ignored gender. However, it may be of interest to ask whether program enrollment is associated with internship status after adjusting for gender. You can address this question by doing an analysis of a set of tables, in this case, by analyzing the set consisting of one for boys and one for girls. The Cochran-Mantel-Haenszel statistic is appropriate for this situation: it addresses whether rows and columns are associated after controlling for the stratification variable. In this case, you would be stratifying by gender.

The FREQ statements for this analysis are very similar to those for the first analysis, except that there is a third variable, Gender , in the TABLES statement. When you cross more than two variables, the two rightmost variables construct the rows and columns of the table, respectively, and the leftmost variables determine the stratification.

  proc freq data=SummerSchool;   weight count;   tables Gender*Internship*Enrollment / chisq cmh;   run;

This execution of PROC FREQ first produces two individual crosstabulation tables of Internship * Enrollment , one for boys and one for girls. Chi-square statistics are produced for each individual table. Figure 29.3 shows the results for boys. Note that the chi-square statistic for boys is significant at the ± = 0 . 05 level of significance. Boys offered a course with an internship are more likely to enroll than boys who are not.

  The FREQ Procedure   Table 1 of Internship by Enrollment   Controlling for Gender=boys   Internship     Enrollment   Frequency   Percent   Row Pct   Col Pct  no      yes       Total   ---------+--------+--------+   no            27      14      41   25.71   13.33   39.05   65.85   34.15   48.21   28.57   ---------+--------+--------+   yes           29      35      64   27.62   33.33   60.95   45.31   54.69   51.79   71.43   ---------+--------+--------+   Total          56       49      105   53.33    46.67   100.00   Statistics for Table 1 of Internship by Enrollment   Controlling for Gender=boys   Statistic                     DF       Value      Prob   ------------------------------------------------------   Chi-Square                     1      4.2366    0.0396   Likelihood Ratio Chi-Square    1      4.2903    0.0383   Continuity Adj. Chi-Square     1      3.4515    0.0632   Mantel-Haenszel Chi-Square     1      4.1963    0.0405   Phi Coefficient                       0.2009   Contingency Coefficient               0.1969   Cramer's V                            0.2009   Fisher's Exact Test   ----------------------------------   Cell (1,1) Frequency (F)        27   Left-sided Pr <= F          0.9885   Right-sided Pr >= F         0.0311   Table Probability (P)       0.0196   Two-sided Pr <= P           0.0467   Sample Size = 105

Figure 29.3: Crosstabulation Table and Statistics for Boys

If you look at the individual table for girls, displayed in Figure 29.4, you see that there is no evidence of association for girls between internship offers and program enrollment.

  Table 2 of Internship by Enrollment   Controlling for Gender=girls   Internship     Enrollment   Frequency   Percent   Row Pct   Col Pct  no      yes       Total   ---------+--------+--------+   no            23      53      76   19.49   44.92   64.41   30.26   69.74   69.70   62.35   ---------+--------+--------+   yes           10      32      42   8.47   27.12   35.59   23.81   76.19   30.30   37.65   ---------+--------+--------+   Total          33       85      118   27.97    72.03   100.00   Statistics for Table 2 of Internship by Enrollment   Controlling for Gender=girls   Statistic                     DF       Value      Prob   ------------------------------------------------------   Chi-Square                     1      0.5593    0.4546   Likelihood Ratio Chi-Square    1      0.5681    0.4510   Continuity Adj. Chi-Square     1      0.2848    0.5936   Mantel-Haenszel Chi-Square     1      0.5545    0.4565   Phi Coefficient                       0.0688   Contingency Coefficient               0.0687   Cramer's V                            0.0688   Fisher's Exact Test   ----------------------------------   Cell (1,1) Frequency (F)        23   Left-sided Pr <= F          0.8317   Right-sided Pr >= F         0.2994   Table Probability (P)       0.1311   Two-sided Pr <= P           0.5245   Sample Size = 118

Figure 29.4: Crosstabulation Table and Statistics for Girls

These individual table results demonstrate the occasional problems with combining information into one table and not accounting for information in other variables such as Gender . Figure 29.5 contains the CMH results. There are three summary (CMH) statistics; which one you use depends on whether your rows and/or columns have an order in r — c tables. However, in the case of 2 — 2 tables, ordering does not matter and all three statistics take the same value. The CMH statistic follows the chi-square distribution under the hypothesis of no association, and here, it takes the value 4.0186 with 1 degree of freedom. The associated p -value is 0.0450, which indicates a significant association at the ± = 0 . 05 level.

  Summary Statistics for Internship by Enrollment   Controlling for Gender   Cochran-Mantel-Haenszel Statistics (Based on Table Scores)   Statistic    Alternative Hypothesis    DF       Value      Prob   ---------------------------------------------------------------   1        Nonzero Correlation        1      4.0186    0.0450   2        Row Mean Scores Differ     1      4.0186    0.0450   3        General Association        1      4.0186    0.0450   Total Sample Size = 223

Figure 29.5: Test for the Hypothesis of No Association

Thus, when you adjust for the effect of gender in these data, there is an association between internship and program enrollment. But, if you ignore gender, no association is found. Note that the CMH option also produces other statistics, including estimates and confidence limits for relative risk and odds ratios for 2 — 2 tables and the Breslow-Day Test. These results are not displayed here.

Agreement Study Example

Medical researchers are interested in evaluating the efficacy of a new treatment for a skin condition. Dermatologists from participating clinics were trained to conduct the study and to evaluate the condition. After the training, two dermatologists examined patients with the skin condition from a pilot study and rated the same patients . The possible evaluations are terrible, poor, marginal, and clear. Table 29.2 contains the data.

Table 29.2: Skin Condition Data
	Dermatologist 2
Dermatologist 1	Terrible	Poor	Marginal	Clear
Terrible	10	4	1
Poor	5	10	12	2
Marginal	2	4	12	5
Clear		2	6	13

The dermatologists evaluations of the patients are contained in the variables derm1 and derm2 ; the variable count is the number of patients given a particular pair of ratings. In order to evaluate the agreement of the diagnoses (a possible contribution to measurement error in the study), the kappa coefficient is computed. You specify the AGREE option in the TABLES statement and use the TEST statement to request a test for the null hypothesis that their agreement is purely by chance. You specify the keyword KAPPA to perform this test for the kappa coefficient. The results are shown in Figure 29.6.

  data SkinCondition;   input derm1 $ derm2 $ count;   datalines;   terrible terrible 10   terrible     poor 4   terrible marginal 1   terrible    clear 0   poor     terrible 5   poor         poor 10   poor     marginal 12   poor        clear 2   marginal terrible 2   marginal     poor 4   marginal marginal 12   marginal    clear 5   clear    terrible 0   clear        poor 2   clear    marginal 6   clear       clear 13   ;   proc freq data=SkinCondition order=data;   weight count;   tables derm1*derm2 / agree noprint;   test kappa;   run;

  The FREQ Procedure   Statistics for Table of derm1 by derm2   Simple Kappa Coefficient   --------------------------------   Kappa                     0.3449   ASE                       0.0724   95% Lower Conf Limit      0.2030   95% Upper Conf Limit      0.4868   Test of H0: Kappa = 0   ASE under H0              0.0612   Z                         5.6366   One-sided Pr >  Z         <.0001   Two-sided Pr > Z        <.0001   Sample Size = 88

Figure 29.6: Agreement Study

The kappa coefficient has the value 0.3449, which indicates slight agreement between the dermatologists, and the hypothesis test confirms that you can reject the null hypothesis of no agreement. This conclusion is further supported by the confidence interval of (0.2030, 0.4868), which suggests that the true kappa is greater than zero. The AGREE option also produces Bowker s test for symmetry and the weighted kappa coefficient, but that output is not shown.