Chapter 22: The CATMOD Procedure | SAS/STAT 9.1 Users Guide Volume 2 only

Overview

The CATMOD procedure performs categorical data modeling of data that can be represented by a contingency table. PROC CATMOD fits linear models to functions of response frequencies, and it can be used for linear modeling, log-linear modeling, logistic regression, and repeated measurement analysis. PROC CATMOD uses

weighted least-squares (WLS) estimation of parameters for a wide range of general linear models
maximum likelihood (ML) estimation of parameters for log-linear models and the analysis of generalized logits

The CATMOD procedure provides a wide variety of categorical data analyses, many of which are generalizations of continuous data analysis methods . For example, analysis of variance, in the traditional sense, refers to the analysis of means and the partitioning of variation among the means into various sources. Here, the term analysis of variance is used in a generalized sense to denote the analysis of response functions and the partitioning of variation among those functions into various sources. The response functions might be mean scores if the dependent variables are ordinally scaled. But they can also be marginal probabilities, cumulative logits, or other functions that incorporate the essential information from the dependent variables.

Types of Input Data

The data that PROC CATMOD analyzes are usually supplied in one of two ways. First, you can supply raw data, where each observation is a subject. Second, you can supply cell count data, where each observation is a cell in a contingency table. (A third way, which uses direct input of the covariance matrix, is also available; details are given in the 'Inputting Response Functions and Covariances Directly' section on page 862.)

Suppose detergent preference is related to three other categorical variables: water softness, water temperature, and previous use of a brand of detergent. In the raw data case, each observation in the input data set identifies a given respondent in the study and contains information on all four variables. The data set contains the same number of observations as the survey had respondents. In the cell count case, each observation identifies a given cell in the four-way table of water softness, water temperature, previous use of brand, and brand preference. A fifth variable contains the number of respondents in the cell. In the analysis, this fifth variable is identified in a WEIGHT statement. The data set contains the same number of observations as the number of cross-classifications formed by the four categorical variables. For more on this particular example, see Example 22.1 on page 901. For additional details, see the section 'Input Data Sets' on page 860.

Most of the examples in this chapter use cell counts as input and use a WEIGHT statement.

Types of Statistical Analyses

This section illustrates, by example, the wide variety of categorical data analyses that PROC CATMOD provides. For each type of analysis, a brief description of the statistical problem and the SAS statements to provide the analysis are given. For each analysis, assume that the input data set consists of a set of cell counts from a contingency table. The variable specified in the WEIGHT statement contains these counts. In all these analyses, both the dependent and independent variables are categorical.

Linear Model Analysis

Suppose you want to analyze the relationship between the dependent variables ( r1, r2 ) and the independent variables ( a, b ). Analyze the marginal probabilities of the dependent variables, and use a main-effects model.

  proc catmod;   weight wt;   response marginals;   model r1*r2=a b;   quit;

Log-Linear Model Analysis

Suppose you want to analyze the nominal dependent variables ( r1, r2, r3 ) with a log-linear model. Use maximum likelihood analysis, and include the main effects and the r1*r2 interaction in the model. Obtain the predicted cell frequencies.

  proc catmod;   weight wt;   model r1*r2*r3=_response_ / pred=freq;   loglin r1r2 r3;   quit;

Logistic Regression

Suppose you want to analyze the relationship between the nominal dependent variable ( r ) and the independent variables ( x1, x2 ) with a logistic regression analysis. Use maximum likelihood estimation.

  proc catmod;   weight wt;   direct x1 x2;   model r=x1 x2;   quit;

If x1 and x2 are continuous so that each observation has a unique value of these two variables, then it may be more appropriate to use the LOGISTIC, GENMOD, or PROBIT procedure. See the 'Logistic Regression' section on page 869.

Repeated Measures Analysis

Suppose the dependent variables ( r1, r2, r3 ) represent the same type of measurement taken at three different times. Analyze the relationship among the dependent variables, the repeated measurement factor ( time ), and the independent variable ( a ).

  proc catmod;   weight wt;   response marginals;   model r1*r2*r3=_response_a;   repeated time 3 / _response_=time;   quit;

Analysis of Variance

Suppose you want to investigate the relationship between the dependent variable ( r ) and the independent variables ( a, b ). Analyze the mean of the dependent variable, and include all main effects and interactions in the model.

  proc catmod;   weight wt;   response mean;   model r=ab;   quit;

Linear Regression

PROC CATMOD can analyze the relationship between the dependent variables ( r1, r2 ) and the independent variables ( x1, x2 ). Use a linear regression analysis to analyze the marginal probabilities of the dependent variables.

  proc catmod;   weight wt;   direct x1 x2;   response marginals;   model r1*r2=x1 x2;   quit;

Logistic Analysis of Ordinal Data

Suppose you want to analyze the relationship between the ordinally scaled dependent variable ( r ) and the independent variable ( a ). Use cumulative logits to take into account the ordinal nature of the dependent variable. Use weighted least-squares estimation.

  proc catmod;   weight wt;   response clogits;   model r=_response_ a;   quit;

Sample Survey Analysis

Suppose the data set contains estimates of a vector of four functions and their covariance matrix, estimated in such a way as to correspond to the sampling process that is used. Analyze the functions with respect to the independent variables ( a, b ), and use a main-effects model.

  proc catmod;   response read b1-b10;   model _f_=_response_;   factors a2,b5/_response_=a b;   quit;

Background: The Underlying Model

The CATMOD procedure analyzes data that can be represented by a two-dimensional contingency table. The rows of the table correspond to populations (or samples) formed on the basis of one or more independent variables. The columns of the table correspond to observed responses formed on the basis of one or more dependent variables. The frequency in the ( i,j )th cell is the number of subjects in the i th population that have the j th response. The frequencies in the table are assumed to follow a product multinomial distribution, corresponding to a sampling design in which a simple random sample is taken for each population. The contingency table can be represented as shown in Table 22.1.

Table 22.1: Contingency Table Representation
	Response
Sample	1	2	r	Total
1	n ₁₁	n ₁₂	n ₁ _r	n ₁
2	n ₂₁	n ₂₂	n ₂ _r	n ₂

s	n _s ₁	n _s ₂	n _sr	n _s

For each sample i , the probability of the j th response ( _ij ) is estimated by the sample proportion, p _ij = n _ij /n _i . The vector ( p ) of all such proportions is then transformed into a vector of functions, denoted by F = F ( p ). If denotes the vector of true probabilities for the entire table, then the functions of the true probabilities, denoted by F ( ), are assumed to follow a linear model

where E _A denotes asymptotic expectation, X is the design matrix containing fixed constants, and ² is a vector of parameters to be estimated.

PROC CATMOD provides two estimation methods:

The maximum likelihood method estimates the parameters of the linear model so as to maximize the value of the joint multinomial likelihood function of the responses. Maximum likelihood estimation is available only for the standard response functions, logits and generalized logits, which are used for logistic regression analysis and log-linear model analysis. Two methods of maximization are available: Newton-Raphson and iterative proportional fitting. For details of the theory, refer to Bishop, Fienberg, and Holland (1975).
The weighted least-squares method minimizes the weighted residual sum of squares for the model. The weights are contained in the inverse covariance matrix of the functions F ( p ). According to central limit theory, if the sample sizes within populations are sufficiently large, the elements of F and b (the estimate of ² ) are distributed approximately as multivariate normal. This allows the computation of statistics for testing the goodness of fit of the model and the significance of other sources of variation. For details of the theory, refer to Grizzle, Starmer, and Koch (1969) or Koch et al. (1977, Appendix 1). Weighted least-squares estimation is available for all types of response functions.

Following parameter estimation, hypotheses about linear combinations of the parameters can be tested . For that purpose, PROC CATMOD computes generalized Wald (1943) statistics, which are approximately distributed as chi-square if the sample sizes are sufficiently large and the null hypotheses are true.

Linear Models Contrasted with Log-Linear Models

Linear model methods (as typified by the Grizzle, Starmer, Koch approach) make a very clear distinction between independent and dependent variables. The emphasis of these methods is estimation and hypothesis testing of the model parameters. Therefore, it is easy to test for differences among probabilities, perform repeated measurement analysis, and test for marginal homogeneity, but it is awkward to test independence and generalized independence. These methods are a natural extension of the usual ANOVA approach for continuous data.

In contrast, log-linear model methods (as typified by the Bishop, Fienberg, Holland approach) do not make an a priori distinction between independent and dependent variables, although model specifications that allow for the distinction can be made. The emphasis of these methods is on model building, goodness-of-fit tests, and estimation of cell frequencies or probabilities for the underlying contingency table. With these methods, it is easy to test independence and generalized independence, but it is awkward to test for differences among probabilities, do repeated measurement analysis, and test for marginal homogeneity.

Using PROC CATMOD Interactively

You can use the CATMOD procedure interactively. After specifying a model with a MODEL statement and running PROC CATMOD with a RUN statement, you can execute any statement without reinvoking PROC CATMOD. You can execute the statements singly or in groups by following the single statement or group of statements with a RUN statement. Note that you can use more than one MODEL statement; this is an important difference from the GLM procedure.

If you use PROC CATMOD interactively, you can end the CATMOD procedure with a DATA step, another PROC step, an ENDSAS statement, or a QUIT statement. The syntax of the QUIT statement is

  quit;

When you are using PROC CATMOD interactively, additional RUN statements do not end the procedure but tell the procedure to execute additional statements.

When the CATMOD procedure detects a BY statement, it disables interactive processing; that is, once the BY statement and the next RUN statement are encountered , processing proceeds for each BY group in the data set, and no additional statements are accepted by the procedure. For example, the following statements tell PROC CATMOD to do three analyses: one for the entire data set, one for males, and one for females.

  proc catmod;   weight wt;   response marginals;   model r1*r2=ab;   run;   by sex;   run;

Note that the BY statement may appear after the first RUN statement; this is an important difference from PROC GLM, which requires that the BY statement appear before the first RUN statement.