Getting Started


The CATMOD procedure is a general modeling procedure for categorical data analysis, and it can be used for very sophisticated analyses that require matrix specification of the response function and the design matrix. It can be used to perform very basic analysis-of-variance-type analyses that require very few statements. The following is a basic example.

Weighted-Least-Squares Analysis of Mean Response

Consider the data in the following table (Stokes, Davis, and Koch 2000).

Table 22.2: Colds in Children
   

Periods with Colds

 

Sex

Residence

1

2

Total

Female

Rural

45

64

71

180

Female

Urban

80

104

116

300

Male

Rural

84

124

82

290

Male

Urban

106

117

87

310

For males and females in rural and urban counties, the number of periods (of two) in which subjects report cold symptoms are recorded. Thus, 45 subjects who were female and in rural counties report no cold symptoms, and 71 subjects who are female and from rural counties report colds in both periods.

The question of interest is whether the mean number of periods with colds reported is associated with gender or type of county. There is no reason to believe that the mean number of periods with colds is normally distributed, so a weighted least-squares analysis of these data is performed with PROC CATMOD instead of an analysis of variance with PROC ANOVA or PROC GLM.

The input data for categorical data is often recorded in frequency form, with the counts for each particular profile being the input values. Thus, for the colds data, the input SAS data set colds is created with the following statements. The variable count contains the frequency of observations that have the particular profile described by the values of the other variables on that input line.

  data colds;   input sex $ residence $ periods count @@;   datalines;   female rural 0  45  female rural 1  64  female rural 2  71   female urban 0  80  female urban 1 104  female urban 2 116   male   rural 0  84  male   rural 1 124  male   rural 2  82   male   urban 0 106  male   urban 1 117  male   urban 2  87   ;   run;  

In order to fit a model to the mean number of periods with colds, you have to specify the response function in PROC CATMOD. The default response function is the logit if the response variable has two values, and it is generalized logits if the response variable has more than two values. If you want a different response function, then you request that function in the RESPONSE statement. To request the mean number of periods with colds, you specify the MEANS option in the RESPONSE statement.

You can request a model consisting of the main effects and interaction of the variables sex and residence just as you would in the GLM procedure. Unlike the GLM procedure, you do not need to use a CLASS statement in PROC CATMOD to treat a variable as a classification variable. All variables in the MODEL statement in the CATMOD procedure are treated as classification variables unless you specify otherwise with a DIRECT statement. To verify that your model is specified correctly, you can specify the DESIGN option in the MODEL statement to display the design matrix.

Thus, the PROC CATMOD statements needed to model mean periods of colds with a main effects and interaction model are

  proc catmod data=colds;   weight count;   response means;   model periods = sex residence sex*residence / design;   run;  

The results of this analysis are shown in Figure 22.1 through Figure 22.3.

start figure
  The CATMOD Procedure   Data Summary   Response            periods    Response Levels    3   Weight Variable     count      Populations        4   Data Set            COLDS      Total Frequency 1080   Frequency Missing   0          Observations      12   Population Profiles   Sample    sex       residence    Sample Size   --------------------------------------------   1     female    rural                180   2     female    urban                300   3     male      rural                290   4     male      urban                310   Response Profiles   Response periods   -------------------   1    0   2    1   3    2  
end figure

Figure 22.1: Model Information and Profile Tables

The CATMOD procedure first displays a summary of the contingency table you are analyzing. The 'Population Profiles' table lists the values of the explanatory variables that define each population, or row of the underlying contingency table, and labels each group with a sample number. The number of observations in each population is also displayed. The 'Response Profiles' table lists the variable levels that define the response, or columns of the underlying contingency table.

start figure
  Response Functions and Design Matrix   Response              Design Matrix   Sample       Function        1        2        3       4   --------------------------------------------------------   1         1.14444        1        1        1       1   2         1.12000        1        1   1   1   3         0.99310        1   1        1   1   4         0.93871        1   1   1       1  
end figure

Figure 22.2: Observed Response Functions and Design Matrix

The 'Design Matrix' table contains the observed response functions-in this case, the mean number of periods with colds for each of the populations-and the design matrix. The first column of the design matrix contains the coefficients for the intercept parameter, the second column coefficients are for the sex parameter (note that the sum-to-zero constraint of a full-rank parameterization implies that the coefficient for males is the negative of that for females. The parameter is called the differential effect for females), the third column is similarly set up for residence , and the last column is for the interaction.

start figure
  Analysis of Variance   Source            DF   Chi-Square Pr > ChiSq   -----------------------------------------------   Intercept          1      1841.13     <.0001   sex                1        11.57     0.0007   residence          1         0.65     0.4202   sex*residence      1         0.09     0.7594   Residual           0          .        .  
end figure

Figure 22.3: ANOVA Table for the Saturated Model

The model-fitting results are displayed in the 'Analysis of Variance' table (Figure 22.3), which is similar to an ANOVA table. The effects from the right-hand side of the MODEL statement are listed under the 'Source' column.

The interaction effect is nonsignificant, so the data are reanalyzed using a maineffects model. Since PROC CATMOD is an interactive procedure, you can analyze the main-effects model by simply submitting the new MODEL statement as follows . The resulting tables are displayed in Figure 22.4 through Figure 22.7.

  proc catmod data=colds;   weight count;   response means;   model periods = sex residence / design;   run;  
start figure
  The CATMOD Procedure   Data Summary   Response           periods     Response Levels    3   Weight Variable    count       Populations        4   Data Set           COLDS       Total Frequency 1080   Frequency Missing  0           Observations      12   Population Profiles   Sample    sex       residence    Sample Size   --------------------------------------------   1     female    rural                180   2     female    urban                300   3     male      rural                290   4     male      urban                310   Response Profiles   Response    periods   -------------------   1       0   2       1   3       2  
end figure

Figure 22.4: Population and Response Profiles, Main-Effects Model
start figure
  Response Functions and Design Matrix   Response         Design Matrix   Sample      Function        1        2        3   -----------------------------------------------   1        1.14444        1        1        1   2        1.12000        1        1   1   3        0.99310        1   1        1   4        0.93871        1   1   1  
end figure

Figure 22.5: Design Matrix for the Main-Effects Model
start figure
  Analysis of Variance   Source       DF   Chi-Square     Pr > ChiSq   -------------------------------------------   Intercept     1      1882.77         <.0001   sex           1        12.08         0.0005   residence     1         0.76         0.3839   Residual      1         0.09         0.7594  
end figure

Figure 22.6: ANOVA Table for the Main-Effects Model

The goodness-of-fit chi-square statistic is 0.09 with one degree of freedom and a p -value of 0.7594; hence, the model fits the data. Note that the chi-square tests in Figure 22.6 test whether all the parameters for a given effect are zero. In this model, each effect has only one parameter, and therefore only one degree of freedom.

start figure
  Analysis of Weighted Least Squares Estimates   Standard        Chi-   Parameter           Estimate      Error      Square    Pr > ChiSq   -----------------------------------------------------------------   Intercept            1.0501     0.0242      1882.77        <.0001   sex       female     0.0842     0.0242        12.08        0.0005   residence rural      0.0210     0.0241         0.76        0.3839  
end figure

Figure 22.7: Parameter Estimates for the Main-Effects Model

The 'Analysis of Weighted-Least-Squares Estimates' table lists the parameters and their estimates for the model, as well as the standard errors, Wald statistics, and p -values. These chi-square tests are single degree-of-freedom tests that the individual parameter is equal to zero. They are equal to the tests shown in Figure 22.6 since each effect is composed of exactly one parameter.

You can compute the mean number of periods of colds for the first population (Sample 1, females in rural residences) from Table 22.2 as follows.

click to expand

This is the same value as reported for the Response Function for Sample 1 in Figure 22.5.

PROC CATMOD is fitting a model to the mean number of colds in each population as follows:

click to expand

where the design matrix is the same one displayed in Figure 22.5, ² is the mean number of colds averaged over all the populations, ² 1 is the differential effect for females, and ² 2 is the differential effect for rural residences. The parameter estimates are shown in Figure 22.7; thus, the expected number of periods with colds for rural females from this model is

click to expand

and the expected number for rural males from this model is

click to expand

Notice also, in Figure 22.7, that the differential effect for residence is nonsignificant ( p =0 . 3839): If you continue the analysis by fitting a single effect model ( sex ), you need to include a POPULATION statement to maintain the same underlying contingency table.

  population sex residence;   model periods = sex;   run;  

Generalized Logits Model

Over the course of one school year, third-graders from three different schools are exposed to three different styles of mathematics instruction: a self-paced computer-learning style, a team approach, and a traditional class approach. The students are asked which style they prefer, and their responses, classified by the type of program they are in (a regular school day versus a regular day supplemented with an afternoon school program), are displayed in Table 22.3. The data set is from Stokes, Davis, and Koch (2000), and it is also analyzed in Example 42.4 on page 2416 of Chapter 42, 'The LOGISTIC Procedure,'.

Table 22.3: School Program Data
 

Learning Style Preference

School

Program

Self

Team

Class

1

Regular

10

17

26

1

Afternoon

5

12

50

2

Regular

21

17

26

2

Afternoon

16

12

36

3

Regular

15

15

16

3

Afternoon

12

12

20

The levels of the response variable (self, team, and class) have no essential ordering, hence a logistic regression is performed on the generalized logits. The model to be fitis

click to expand

where hij is the probability that a student in school h and program i prefers teaching style j , j ‰  r , and style r is the class style. There are separate sets of intercept parameters ± j and regression parameters ² j for each logit, and the matrix x hi is the set of explanatory variables for the hi th population. Thus, two logits are modeled for each school and program combination (population): the logit comparing self to class and the logit comparing team to class.

The following statements create the data set school and request the analysis. Generalized logits are the default response functions, and maximum likelihood estimation is the default method for analyzing generalized logits, so only the WEIGHT and MODEL statements are required. The option ORDER=DATA means that the response variable levels are ordered as they exist in the data set: self, team, and class; thus the logits are formed by comparing self to class and by comparing team to class. The results of this analysis are shown in Figure 22.8 and Figure 22.9.

  data school;   length Program $ 9;   input School Program $ Style $ Count @@;   datalines;   1 regular   self 10 1 regular     team 17 1 regular   class 26   1 afternoon self  5 1 afternoon   team 12 1 afternoon class 50   2 regular   self 21 2 regular     team 17 2 regular   class 26   2 afternoon self 16 2 afternoon   team 12 2 afternoon class 36   3 regular   self 15 3 regular     team 15 3 regular   class 16   3 afternoon self 12 3 afternoon   team 12 3 afternoon class 20   ;   proc catmod order=data;   weight Count;   model Style=School Program School*Program;   run;  
start figure
  The CATMOD Procedure   Data Summary   Response           Style      Response Levels   3   Weight Variable    Count      Populations       6   Data Set           SCHOOL     Total Frequency 338   Frequency Missing  0          Observations     18   Population Profiles   Sample    School    Program      Sample Size   --------------------------------------------   1     1         regular               53   2     1         afternoon             67   3     2         regular               64   4     2         afternoon             64   5     3         regular               46   6     3         afternoon             44   Response Profiles   Response    Style   -----------------   1       self   2       team   3       class  
end figure

Figure 22.8: Model Information and Profile Tables

A summary of the data set is displayed in Figure 22.8; the variable levels that form the three responses and six populations are listed in the 'Response Profiles' and 'Population Profiles' table, respectively.

start figure
  Maximum Likelihood Analysis of Variance   Source               DF   Chi-Square    Pr > ChiSq   --------------------------------------------------   Intercept             2        40.05        <.0001   School                4        14.55        0.0057   Program               2        10.48        0.0053   School*Program        4         1.74        0.7827   Likelihood Ratio      0          .           .  
end figure

Figure 22.9: ANOVA Table

The analysis of variance table is displayed in Figure 22.9. Since this is a saturated model, there are no degrees of freedom remaining for a likelihood ratio test, and missing values are displayed in the table. The interaction effect is clearly nonsignificant, so a main effects model is fit.

Since PROC CATMOD is an interactive procedure, you can analyze the main effects model by simply submitting the new MODEL statement as follows.

  model Style=School Program;   run;  
start figure
  The CATMOD Procedure   Maximum Likelihood Analysis of Variance   Source               DF   Chi-Square    Pr > ChiSq   --------------------------------------------------   Intercept             2        39.88        <.0001   School                4        14.84        0.0050   Program               2        10.92        0.0043   Likelihood Ratio      4         1.78        0.7766  
end figure

Figure 22.10: ANOVA Table

You can check the population and response profiles (not shown) to confirm that they are the same as those in Figure 22.8. The analysis of variance table is shown in Figure 22.10. The likelihood ratio chi-square statistic is 1.78 with a p -value of 0.7766, indicating a good fit; the Wald chi-square tests for the school and program effects are also significant. Since School has three levels, two parameters are estimated for each of the two logits they modeled, for a total of four degrees of freedom. Since Program has two levels, one parameter is estimated for each of the two logits, for a total of two degrees of freedom.

start figure
  Analysis of Maximum Likelihood Estimates   Function               Standard        Chi-   Parameter         Number     Estimate      Error      Square     Pr > ChiSq   ---------------------------------------------------------------------------   Intercept          1          -0.7979     0.1465       29.65         <.0001   2          -0.6589     0.1367       23.23         <.0001   School   1         1          -0.7992     0.2198       13.22         0.0003   1         2          -0.2786     0.1867        2.23         0.1356   2         1           0.2836     0.1899        2.23         0.1352   2         2          -0.0985     0.1892        0.27         0.6028   Program  regular   1           0.3737     0.1410        7.03         0.0080   regular   2           0.3713     0.1353        7.53         0.0061  
end figure

Figure 22.11: Parameter Estimates

The parameter estimates and tests for individual parameters are displayed in Figure 22.11. The ordering of the parameters corresponds to the order of the population and response variables as shown in the profile tables (see Figure 22.8), with the levels of the response variables varying most rapidly . So, for the first response function, which is the logit that compares self to class, Parameter 1 is the intercept, Parameter 3 is the parameter for the differential effect for School =1, Parameter 5 is the parameter for the differential effect for School =2, and Parameter 7 is the parameter for the differential effect for Program =regular. The even parameters are interpreted similarly for the second logit, which compares team to class.

The Program variable (Parameters 7 and 8) has nearly the same effect on both logits, while School =1 (Parameters 3 and 4) has the largest effect of the schools.




SAS.STAT 9.1 Users Guide (Vol. 2)
SAS/STAT 9.1 Users Guide Volume 2 only
ISBN: B003ZVJDOK
EAN: N/A
Year: 2004
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net