The following statements are available in PROC CATMOD.
PROC CATMOD < options > ;
DIRECT < variables > ;
MODEL response-effect=design-effects < / options > ;
CONTRAST 'label' row-description <, , row-description >
< / options > ;
BY variables ;
FACTORS factor-description <, ,factor-description >
< / options > ;
LOGLIN effects ;
POPULATION variables ;
REPEATED factor-description <, ,factor-description >
< / options > ;
RESPONSE function <, ,function >< / options > ;
RESTRICT parameter=value < parameter=value > ;
WEIGHT variable ;
You can use all of the statements in PROC CATMOD interactively. The first RUN statement executes all of the previous statements. Any subsequent RUN statement executes only those statements that appear between the previous RUN statement and the current one. However, if you specify a BY statement, interactive processing is disabled. That is, all statements through the following RUN statement are processed for each BY group in the data set, but no additional statements are accepted by the procedure.
If more than one CONTRAST statement appears between two RUN statements, all the CONTRAST statements are processed. If more than one RESPONSE statement appears between two RUN statements, then analyses associated with each RESPONSE statement are produced. For all other statements, there can be only one occurrence of the statement between any two RUN statements. For example, if there are two LOGLIN statements between two RUN statements, the first LOGLIN statement is ignored.
The PROC CATMOD and MODEL statements are required. If specified, the DIRECT statement must precede the MODEL statement. As a result, if you use the DIRECT statement interactively, you need to specify a MODEL statement in the same RUN group. See the section 'DIRECT Statement' on page 835 for an example.
The CONTRAST statements, if any, must follow the MODEL statement.
You can specify only one of the LOGLIN, REPEATED, and FACTORS statements between any two RUN statements, because they all specify the same information: how to partition the variation among the response functions within a population.
A QUIT statement executes any statements that have not been processed and then ends the CATMOD procedure.
The purpose of each statement, other than the PROC CATMOD statement, are summarized in the following list:
BY | determines groups in which data are to be processed separately. |
CONTRAST | specifies a hypothesis to test. |
DIRECT | specifies independent variables that are to be treated quantitatively (like continuous variables) rather than qualitatively (like class or discrete variables). These variables also help to determine the rows of the contingency table and distinguish response functions in one population from those in other populations. |
FACTORS | specifies (1) the factors that distinguish response functions from others in the same population and (2) model effects, based on these factors, which help to determine the design matrix. |
LOGLIN | specifies log-linear model effects. |
MODEL | specifies (1) dependent variables, which determine the columns of the contingency table, (2) independent variables, which distinguish response functions in one population from those in other populations, and (3) model effects, which determine the design matrix and the way in which total variation among the response functions is partitioned. |
POPULATION | specifies variables which determine the rows of the contingency table and distinguish response functions in one population from those in other populations. |
REPEATED | specifies (1) the repeated measurement factors that distinguish response functions from others in the same population and (2) model effects, based on these factors, which help to determine the design matrix. |
RESPONSE | determines the response functions that are to be modeled . |
RESTRICT | restricts values of parameters to the values you specify. |
WEIGHT | specifies a variable containing frequency counts. |
PROC CATMOD < options > ;
The PROC CATMOD statement invokes the procedure. You can specify the following options.
DATA= SAS-data-set
names the SAS data set containing the data to be analyzed . By default, the CATMOD procedure uses the most recently created SAS data set. For details, see the section 'Input Data Sets' on page 860.
NAMELEN= n
specifies the length of effect names in tables and output data sets to be n characters long, where n is a value between 24 and 200 characters. The default length is 24 characters.
NOPRINT
suppresses the normal display of results. The NOPRINT option is useful when you only want to create output data sets with the OUT= or OUTEST= optioninthe RESPONSE statement. A NOPRINT option is also available in the MODEL statement. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, 'Using the Output Delivery System,' for more information.
ORDER=DATA FORMATTED FREQ INTERNAL
specifies the sorting order for the levels of classification variables. This affects the ordering of the populations, responses, and parameters, as well as the definitions of the parameters. The default, ORDER=INTERNAL, orders the variable levels by their unformatted values (for example, numeric order or alphabetical order).
The following table shows how PROC CATMOD interprets values of the ORDER= option.
Value of ORDER= | Levels Sorted By |
---|---|
DATA | order of appearance in the input data set |
FORMATTED | external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value |
FREQ | descending frequency count; levels with the most observations come first in the order |
INTERNAL | unformatted value |
By default, ORDER=INTERNAL. For ORDER=FORMATTED and ORDER=INTERNAL, the sort order is machine dependent. See the section 'Ordering of Populations and Responses' on page 863 for more information and examples. For more information on sorting order, see the chapter on the SORT procedure in the SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts .
BY variables ;
You can specify a BY statement with PROC CATMOD to obtain separate analyses of groups determined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. The variables are one or more variables in the input data set.
If your input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the CATMOD procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure (in base SAS software).
For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .
When you specify a BY statement with PROC CATMOD, no further interactive processing is possible. In other words, once the BY statement appears, all statements up to the associated RUN statement are executed for each BY group in the data set. After the RUN statement, no further statements are accepted by the procedure.
CONTRAST 'label' row-description < , , row-description >< / options > ;
where a row-description is
@ n > effect values <...<@ n > effect values >
The CONTRAST statement constructs and tests linear functions of the parameters in the MODEL statement or effects listed in the LOGLIN statement. Each set of effects (separated by commas) specifies one row or set of rows of the matrix C that PROC CATMOD uses to test the hypothesis C ² = .
CONTRAST statements must be preceded by the MODEL statement, and by the LOGLIN statement, if one is used. You can specify the following terms in the CONTRAST statement.
' label ' | specifies up to 256 characters of identifying information displayed with the test. The ' label ' is required. |
effect | is one of the effects specified in the MODEL or LOGLIN statement, INTERCEPT (for the intercept parameter), or ALL_PARMS (for the complete set of parameters). The ALL_PARMS option is regarded as an effect with the same number of parameters as the number of columns in the design matrix. This is particularly useful when the design matrix is input directly, as in the following example: model y=(1000, 1010, 1100, 1111); contrast 'Main Effect of B' all_parms0100; contrast 'Main Effect of C' all_parms0010; contrast 'B*C Interaction ' all_parms0001; |
values | are numbers that form the coefficients of the parameters associated with the given effect. If there are fewer values than parameters for an effect, the remaining coefficients become zero. For example, if you specify two values and the effect actually has five parameters, the final three are set to zero. |
@ n | points to the parameters in the n th set when the model has a separate set of parameters for each of the response functions. The @ n notation is seldom needed. It enables you to test the variation among response functions in the same population. However, it is usually easier to model and test such variation by using the _RESPONSE_ effect in the MODEL statement or by using the ALL_PARMS designation. Usually, contrasts are performed with respect to all of the response functions, and this is what the CONTRAST statement does by default (in this case, do not use the @ n notation). |
For example, if there are three response functions per population, then contrast 'Level 1 vs. Level 2'A 1 1 0; | |
results in a three-degree-of-freedom test comparing the first two levels of A simultaneously on the three response functions. | |
If, however, you want to specify a contrast with respect to the parameters in the n th set only, then use a single @ n in a row-description . For example, to test that the first parameter of A and the first parameter of B are zero in the third response function, specify contrast 'A=0, B=0, Function 3' @3 A 1 B 1; | |
To specify a contrast with respect to parameters in two or more different sets of effects, use @ n with each effect. For example, contrast 'Average over Functions' @1 A 1 0 1 @2 A 1 1 2; | |
When the model does not have a separate set of parameters for each of the response functions, the @ n notation is invalid. This type of model is called AVERAGED. For details, see the description of the AVERAGED option on page 842 and the 'Generation of the Design Matrix' section on page 876. |
You can specify the following options in the CONTRAST statement after a slash.
ALPHA= value
specifies the significance level of the confidence interval for each contrast when the ESTIMATE= option is specified. The default is ALPHA=0.05, resulting in a 95% confidence interval for each contrast.
ESTIMATE= keyword
EST= keyword
requests that each individual contrast (that is, each row, c i ² , of C ² ) or exponentiated contrast (exp( c i ² )) be estimated and tested . PROC CATMOD displays the point estimate, its standard error, a Wald confidence interval, and a Wald chi-square test for each contrast. The significance level of the confidence interval is controlled by the ALPHA= option.
You can estimate the contrast or the exponentiated contrast, or both, by specifying one of the following keywords:
PARM | specifies that the contrast itself be estimated. |
EXP | specifies that the exponentiated contrast be estimated. |
BOTH | specifies that both the contrast and the exponentiated contrast be estimated. |
PROC CATMOD is parameterized differently than PROC GLM, so you must be careful not to use the same contrasts that you would with PROC GLM. Since PROC CATMOD uses a full-rank parameterization, all estimable parameters are directly estimable without involving other parameters.
For example, suppose a class variable A has four levels. Then there are four parameters ( ± 1 , ± 2 , ± 3 , ± 4 ), of which PROC CATMOD uses only the first three. The fourth parameter is related to the others by the equation
To test the first versus the fourth level of A , you would test ± 1 = ± 4 , which is
or, equivalently,
Therefore, you would use the following CONTRAST statement:
contrast '1 vs. 4'A211;
To contrast the third level with the average of the first two levels, you would test
or, equivalently,
Therefore, you would use the following CONTRAST statement:
contrast '1&2 vs. 3' A 1 1 -2;
Other CONTRAST statements are constructed similarly; for example,
contrast '1 vs. 2 ' A 1 -1 0; contrast '1&2 vs. 4 ' A 3 3 2; contrast '1&2 vs. 3&4' A 2 2 0; contrast 'Main Effect' A 1 0 0, A 0 1 0, A 0 0 1;
The actual form of the C matrix depends on the effects in the model. The following examples assume a single response function for each population.
proc catmod; model y=a; contrast '1 vs. 4' A 2 1 1; run;
The C matrix for the preceding statements is
since the first parameter corresponds to the intercept.
But if there is a variable B with three levels and you use the following statements,
proc catmod; model y=b a; contrast '1 vs. 4' A 2 1 1; run;
then the CONTRAST statement induces the C matrix
since the first parameter corresponds to the intercept and the next two correspond to the B main effect.
You can also use the CONTRAST statement to test the joint effect of two or more effects in the MODEL statement. For example, the joint effect of A and B in the previous model has five degrees of freedom and is obtained by specifying
contrast 'Joint Effect of A&B' A 1 0 0, A 0 1 0, A 0 0 1, B 1 0, B 0 1;
The ordering of variable levels is determined by the ORDER= option in the PROC CATMOD statement. Whenever you specify a contrast that depends on the order of the variable levels, you should verify the order from the 'Population Profiles' table, the 'Response Profiles' table, or the 'One-Way Frequencies' table.
DIRECT variables ;
The DIRECT statement lists numeric independent variables to be treated in a quantitative, rather than qualitative, way. The DIRECT statement is useful for logistic regression, which is described in the 'Logistic Regression' section on page 869. For limitations of models involving continuous variables, see the 'Continuous Variables' section on page 870.
If a DIRECT variable is formatted, then the unformatted (internal) values are used in the analysis and the formatted values are displayed. CAUTION: If you use a format to group the internal values into one formatted value, then the first internal value is used in the analysis.
If specified, the DIRECT statement must precede the MODEL statement. For example,
proc catmod; direct X; model Y=X; run;
Suppose X has five levels. Then the main effect X induces only one column in the design matrix, rather than four. The values inserted into the design matrix are the actual values of X .
You can interactively change the variables declared as DIRECT variables by using the statement without listing any variables. The following statements are valid:
proc catmod; direct X; model Y=X; weight wt; run; direct; model Y=X; run;
The first MODEL statement uses the actual values of X , and the second MODEL statement uses the four variables created when PROC CATMOD generates the design matrix. Note that the preceding statements can be run without a WEIGHT statement if the input data are raw data rather than cell counts.
For more details, see the discussions of main and direct effects in the section 'Generation of the Design Matrix' on page 876 .
FACTORS factor-description <, ,factor-description >< / options > ;
where a factor-description is
factor- name < $ >< levels >
and factor-description s are separated from each other by a comma. The $ is required for character-valued factors. The value of levels provides the number of levels of the factor identified by a given factor-name . For only one factor, levels is optional; for two or more factors, it is required.
The FACTORS statement identifies factors that distinguish response functions from others in the same population. It also specifies how those factors are incorporated into the model. You can use the FACTORS statement whenever there is more than one response function per population and the keyword _RESPONSE_ is specified in the MODEL statement. You can specify the name, type, and number of levels of each factor and the identification of each level.
The FACTORS statement is most useful when the response functions and their covariance matrix are read directly from the input data set. In this case, PROC CATMOD reads the response functions as though they are from one population (this poses no problem in the multiple-population case because the appropriately constructed covariance matrix is also read directly). Thus, you can use the FACTORS statement to partition the variation among the response functions into appropriate sources, even when the functions actually represent separate populations.
The format of the FACTORS statement is identical to that of the REPEATED statement. In fact, repeated measurement factors are simply special cases of factors in which some of the response functions correspond to multiple dependent variables that are measurements on the same experimental (or sampling) units.
You cannot specify the FACTORS statement for an analysis that also contains the REPEATED or LOGLIN statement since all of them specify the same information: how to partition the variation among the response functions within a population.
In the FACTORS statement,
factor-name | names a factor that corresponds to two or more response functions. This name must be a valid SAS variable name, and it should not be the same as the name of a variable that already exists in the data set being analyzed. |
$ | indicates that the factor is character-valued. If the $ is omitted, then PROC CATMOD assumes that the factor is numeric. The type of the factor is relevant only when you use the PROFILE= option or when the _RESPONSE_= option (described later in this section) specifies nested- by-value effects. |
levels | specifies the number of levels of the corresponding factor. If there is only one such factor, and the number is omitted, then PROC CATMOD assumes that the number of levels is equal to the number of response functions per population ( q ). Unless you specify the PROFILE= option, the number q must either be equal to or be a multiple of the product of the number of levels of all the factors. |
You can specify the following options in the FACTORS statement after a slash.
PROFILE=( matrix )
specifies the values assumed by the factors for each response function. There should be one column for each factor, and the values in a given column (character or numeric) should match the type of the corresponding factor. Character values are restricted to 16 characters or less. If there are q response functions per population, then the matrix must have i rows, where q must either be equal to or be a multiple of i . Adjacent rows of the matrix should be separated by a comma.
The values in the PROFILE matrix are useful for specifying models in those situations where the study design is not a full factorial with respect to the factors. They can also be used to specify nested-by-value effects in the _RESPONSE_= option. If you specify character values in both places (the PROFILE= option and the _RESPONSE_= option), then the values must match with respect to whether or not they are enclosed in quotes (that is, enclosed in quotes in both places or in neither place).
For an example of using the PROFILE= option, see Example 22.10 on page 944.
_RESPONSE_= effects
specifies design effects. The variables named in the effects must be factor-names that appear in the FACTORS statement. If the _RESPONSE_= option is omitted, then PROC CATMOD builds a full factorial _RESPONSE_ effect with respect to the factors.
TITLE= ' title '
displays the title at the top of certain pages of output that correspond to the current FACTORS statement.
For an example of how the FACTORS statement is useful, consider the case where the response functions and their covariance matrix are read directly from the input data set. The TYPE=EST data set might be created in the following manner:
data direct(type=est); input b1-b4 _type_ $ _name_ .; datalines; 0.590463 0.384720 0.273269 0.136458 parms . 0.001690 0.000911 0.000474 0.000432 cov b1 0.000911 0.001823 0.000031 0.000102 cov b2 0.000474 0.000031 0.001056 0.000477 cov b3 0.000432 0.000102 0.000477 0.000396 cov b4 ;
Suppose the response functions correspond to four populations that represent the cross-classification of age (two groups) by sex. You can use the FACTORS statement to identify these two factors and to name the effects in the model. The statements required to fit a main-effects model to these data are
proc catmod data=direct; response read b1-b4; model _f_=_response_; factors age 2, sex 2 / _response_=age sex; run;
If you want to specify some nested-by-value effects, you can change the FACTORS statement to
factors age $ 2, sex $ 2 / _response_=age sex(age='under 30') sex(age='30 & over') profile=('under 30' male, 'under 30' female, '30 & over' male, '30 & over' female);
If, by design or by chance, the study contains no male subjects under 30 years of age, then there are only three response functions, and you can specify a main-effects model as
proc catmod data=direct; response read b2-b4; model _f_=_response_; factors age $ 2, sex $ 2 / _response_=age sex profile=('under 30' female, '30 & over' male, '30 & over' female); run;
When you specify two or more factors and omit the PROFILE= option, PROC CATMOD presumes that the response functions are ordered so that the levels of the rightmost factor change most rapidly . For the preceding example, the order implied by the FACTORS statement is as follows .
Response Function | Dependent Variable | Age | Sex |
---|---|---|---|
1 | b1 | 1 | 1 |
2 | b2 | 1 | 2 |
3 | b3 | 2 | 1 |
4 | b4 | 2 | 2 |
For additional examples of how to use the FACTORS statement, see the section 'Repeated Measures Analysis' on page 873. All of the examples in that section are applicable , with the REPEATED statement replaced by the FACTORS statement.
LOGLIN effects < / option > ;
The LOGLIN statement is used to define log-linear model effects. It can be used whenever the default response functions (generalized logits) are used.
In the LOGLIN statement, effects are design effects that contain dependent variables in the MODEL statement, including interaction, nested, and nested-by-value effects. You can use the bar () and at (@) operators as well. The following lists of effects are equivalent:
a b c a*b a*c b*c
and
abc @2
When you use the LOGLIN statement, the keyword _RESPONSE_ should be specified in the MODEL statement. For further information on log-linear model analysis, see the 'Log-Linear Model Analysis' section on page 870.
You cannot specify the LOGLIN statement for an analysis that also contains the REPEATED or FACTORS statement since all of them specify the same information: how to partition the variation among the response functions within a population. You can specify the following option in the LOGLIN statement after a slash.
TITLE= ' title '
displays the title at the top of certain pages of output that correspond to this LOGLIN statement.
The following statements give an example of how to use the LOGLIN statement.
proc catmod; model a*b*c=_response_; loglin abc @ 2; run;
These statements yield a log-linear model analysis that contains all main effects and two-variable interactions. For more examples of log-linear model analysis, see the 'Log-Linear Model Analysis' section on page 870.
MODEL response-effect= < design-effects >< / options > ;
PROC CATMOD requires a MODEL statement. You can specify the following in a MODEL statement:
response-effect | can be either a single variable, a crossed effect with two or more variables joined by asterisks , or _F_.The_F_ specification indicates that the response functions and their estimated covariance matrix are to be read directly into the procedure (see the 'Inputting Response Functions and Covariances Directly' section on page 862 for details). The response-effect indicates the dependent variables that determine the response categories (the columns of the underlying contingency table). |
design-effects | specify potential sources of variation (such as main effects and interactions) in the model. Thus, these effects determine the number of model parameters, as well as the interpretation of such parameters. In addition, if there is no POPULATION statement, PROC CATMOD uses these variables to determine the populations (the rows of the underlying contingency table). When fitting the model, PROC CATMOD adjusts the independent effects in the model for all other independent effects in the model. Design-effects can be any of those described in the section 'Specification of Effects' on page 864, or they can be defined by specifying the actual design matrix, enclosed in parentheses (see the 'Specifying the Design Matrix Directly' section on page 847). In addition, you can use the keyword _RESPONSE_ alone or as part of an effect. Effects cannot be nested within _RESPONSE_, so effects of the form A (_RESPONSE_) are invalid. For more information, see the 'Log-Linear Model Analysis' sec-tion on page 870 and the 'Repeated Measures Analysis' section on page 873. |
Some examples of MODEL statements are
model r=a b; main effects only model r=a b a*b; main effects with interaction model r=a b(a); nested effect model r=ab; complete factorial model r=a b(a=1) b(a=2); nested-by-value effects model r*s=_response_; log-linear model model r*s=a _response_(a); nested repeated measurement factor model _f_=_response_; direct input of the response functions
The relationship between these specifications and the structure of the design matrix X is described in the 'Generation of the Design Matrix' section on page 876.
The following table summarizes the options available in the MODEL statement.
Task | Options |
---|---|
Specify details of computation | |
Generates maximum likelihood estimates | ML= |
Generates weighted least-squares estimates | GLS |
WLS | |
Omits intercept term from the model | NOINT |
Specifies parameterization of classification variables | PARAM= |
Adds a number to each cell frequency | ADDCELL= |
Averages main effects across response functions | AVERAGED |
Specifies the convergence criterion for maximum likelihood | EPSILON= |
Specifies the number of iterations for maximum likelihood | MAXITER= |
Specifies how missing cells are treated | MISSING= |
Specifies how zero cells are treated | ZERO= |
Request additional computation and tables | |
Significance level of confidence intervals | ALPHA= |
Wald confidence intervals of estimates | CLPARM |
Estimated correlation matrix of estimates | CORRB |
Covariance matrix of response functions | COV |
Estimated covariance matrix of estimates | COVB |
Design and _RESPONSE_ matrix | DESIGN |
Two-way frequency tables | FREQ |
Iterations for maximum likelihood | ITPRINT |
One-way frequency tables | ONEWAY |
Predicted values | PRED= |
PREDICT | |
Probability estimates | PROB |
Population profiles | PROFILE |
Crossproducts matrix | XPX |
Title | TITLE= |
Suppress output | |
Design matrix | NODESIGN |
Parameter estimates | NOPARM |
Variable levels | NOPREDVAR |
Population and response profiles | NOPROFILE |
_RESPONSE_ matrix | NORESPONSE |
The following list describes these options in alphabetical order.
ADDCELL= number
adds number to the frequency count in each cell, where number is any positive number. This option has no effect on maximum likelihood analysis; it is used only for weighted least-squares analysis.
ALPHA= number
sets the significance level for the Wald confidence intervals for parameter estimates. The value must be between 0 and 1. The default value of 0.05 results in the calculation of a 95% confidence interval. This option has no effect unless the CLPARM option is also specified.
AVERAGED
specifies that dependent variable effects can be modeled and that independent variable main effects are averaged across the response functions in a population. For further information on the effect of using (or not using) the AVERAGED option, see the 'Generation of the Design Matrix' section on page 876. Direct input of the design matrix or specification of the _RESPONSE_ keyword in the MODEL statement automatically induces an AVERAGED model type.
CLPARM
produces Wald confidence limits for the parameter estimates. The confidence coefficient can be specified with the ALPHA= option.
CORRB
displays the estimated correlation matrix of the parameter estimates.
COV
displays S i , which is the covariance matrix of the response functions for each population.
COVB
displays the estimated covariance matrix of the parameter estimates.
DESIGN
displays the design matrix X for WLS and ML analyses, and also displays the _RESPONSE_ matrix for log-linear models. For further information, see the 'Generation of the Design Matrix' section on page 876.
EPSILON= number
specifies the convergence criterion for the maximum likelihood estimation of the parameters. The iterative estimation process stops when the proportional change in the log likelihood is less than number , or after the number of iterations specified by the MAXITER= option, whichever comes first. By default, EPSILON=1E ˆ’ 8.
FREQ
produces the two-way frequency table for the cross-classification of populations by responses.
ITPRINT
displays parameter estimates and other information at each iteration of a maximum likelihood analysis.
MAXITER= number
specifies the maximum number of iterations used for the maximum likelihood estimation of the parameters. By default, MAXITER=20.
ML < =NRIPF < ( ipf-options ) >>
computes maximum likelihood estimates (MLE) using either a Newton-Raphson algorithm (NR) or an iterative proportional fitting algorithm (IPF).
The option ML=NR (or simply ML) is available when you use generalized logits, and also when you perform binary logistic regression with logits, cumulative logits, or adjacent category logits. For generalized logits (the default response functions), ML=NR is the default estimation method.
The option ML=IPF is available for fitting a hierarchical log-linear model with one population (no independent variables and no population variables). The use of bar notation to express the log-linear effects guarantees that the model is hierarchical (the presence of any interaction term in the model requires the presence of all its lower-order terms). If your table is incomplete (that is, your table has a zero or missing entry in at least one cell), then all missing cells and all cells with zero weight are treated as structural zeros by default; this behavior can be modified with the ZERO= and MISSING= options in the MODEL statement.
You can control the convergence of the two algorithms with the EPSILON= and MAXITER= options in the MODEL statement. You can select the convergence criterion for the IPF algorithm with the CONVCRIT= option. Note: The RESTRICT statement is not available with the ML=IPF option.
You can specify the following ipf-options within parentheses after the ML=IPF option.
CONV= keyword
CONVCRIT= keyword
specifies the method that determines when convergence of the IPF algorithm occurs. You can specify one of the following keywords :
CELL | termination requires the maximum absolute difference between consecutive cell estimates to be less than 0.001 (or the value of the EPSILON= option, if specified). |
LOGL | termination requires the relative difference between consecutive estimates of the log-likelihood to be less than 1E-8 (or the value of the EPSILON= option, if specified). This is the default. |
MARGIN | termination requires the maximum absolute difference between consecutive margin estimates to be less than 0.001 (or the value of the EPSILON= option, if specified). |
DF= keyword
specifies the method used to compute the degrees of freedom for the goodness of fit G 2 test (labeled 'Likelihood Ratio' in the 'Estimates' table).
For a complete table (a table having nonzero entries in every cell), the degrees of freedom are calculated as the number of cells in the table ( n c ) minus the number of independent parameters specified in the model ( n p ). For incomplete tables, these degrees of freedom may be adjusted by the number of fitted zeros ( n z , which includes the number of structural zeros) and the number of nonestimable parameters due to the zeros ( n n ). If you are analyzing an incomplete table, you should verify that the degrees of freedom are correct.
You can specify one of the following keywords :
UNADJ | computes the unadjusted degrees of freedom as n c ˆ’ n p . These are the same degrees of freedom you would get if all cells in the table were positive. |
ADJ | computes the degrees of freedom as ( n c ˆ’ n p ) ˆ’ ( n z ˆ’ n n ) (Bishop, Fienberg, and Holland 1975), which adjusts for fitted zeros and nonestimable parameters. This is the default, and for complete tables gives the same results as the UNADJ option. |
ADJEST | computes the degrees of freedom as ( n c ˆ’ n p ) ˆ’ n z , which adjusts for fitted zeros only. This gives a lower bound on the true degrees of freedom. |
PARM
computes parameter estimates, generates the 'ANOVA,' 'Parameter Estimates,' and 'Predicted Values of Response Functions' tables, and includes the predicted standard errors in the 'Predicted Values of Frequencies and Probabilities' tables.
When you specify the PARM option, the algorithm used to obtain the maximum likelihood parameter estimates is weighted least squares on the IPF-predicted frequencies. This algorithm can be much faster than the NewtonRaphson algorithm used if you just specify the ML=NR option. In the resulting ANOVA table, the likelihood ratio is computed from the initial IPF fit while the degrees of freedom are generated from the WLS analysis; the DF= option can override this. Also, the initial response function, which the WLS method usually computes from the raw data, is computed from the IPF fitted frequencies.
If there are any zero marginals in the configurations that define the model, then there are predicted cell frequencies of zero and WLS cannot be used to compute the estimates. In this case, PROC CATMOD automatically changes the algorithm from ML=IPF to ML=NR and prints a note in the log.
MISSING= keyword
MISS = keyword
specifies whether a missing cell is treated as a sampling or structural zero.
Structural zero cells are removed from the analysis since their expected values are zero, while sampling zero cells may have nonzero expected value and may be estimable. For a single population, the missing cells are treated as structural zeros by default. For multiple populations, as long as some population has a nonzero count for a given population and response profile, the missing values are treated as sampling zeros by default.
The following table displays the available keywords and summarizes how PROC CATMOD treats missing values for one or more populations.
MISSING= | One Population | Multiple Populations |
---|---|---|
STRUCTURAL (default) | structural zeros | sampling zeros |
SAMPSAMPLING | sampling zeros | sampling zeros |
value | sets missing weights and cells to value | sets missing weights and cells to value |
NODESIGN
suppresses the display of the design matrix X when the DESIGN option is also specified. This enables you to display only the _RESPONSE_ matrix for log-linear models.
NOINT
suppresses the intercept term in the model.
NOITER
suppresses the display of parameter estimates and other information at each iteration of a maximum likelihood analysis.
NOPARM
suppresses the display of the estimated parameters and the statistics for testing that each parameter is zero.
NOPREDVAR
suppresses the display of the variable levels in tables requested with the PRED= option and in the 'Estimates' table. Population profiles are replaced with the sample number, class variable levels are suppressed, and response profiles are replaced with a function number.
NOPRINT
suppresses the normal display of results. The NOPRINT option is useful when you only want to create output data sets with the OUT= or OUTEST= optioninthe RESPONSE statement. A NOPRINT option is also available in the PROC CATMOD statement. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, 'Using the Output Delivery System,' for more information.
NOPROFILE
suppresses the display of the population profiles and the response profiles.
NORESPONSE
suppresses the display of the _RESPONSE_ matrix for log-linear models when the DESIGN option is also specified. This enables you to display only the design matrix for log-linear models.
ONEWAY
produces a one-way table of frequencies for each variable used in the analysis. This table is useful in determining the order of the observed levels for each variable.
PARAM = EFFECT REFERENCE
specifies the parameterization method for the classification variable or variables. The default is PARAM=EFFECT. Both the effect and reference parameterizations are full rank. See the 'Generation of the Design Matrix' section on page 876 for further details.
PREDICT
PRED=FREQ PROB
displays the observed and predicted values of the response functions for each population, together with their standard errors and the residuals (observed - predicted). In addition, if the response functions are the standard ones (generalized logits), then the PRED=FREQ option specifies the computation and display of predicted cell frequencies, while PRED=PROB (or just PREDICT) specifies the computation and display of predicted cell probabilities.
The OUT= data set always contains the predicted probabilities. If the response functions are the generalized logits, the predicted cell probabilities are output unless the option PRED=FREQ is specified, in which case the predicted cell frequencies are output.
PROB
produces the two-way table of probability estimates for the cross-classification of populations by responses. These estimates sum to one across the response categories for each population.
PROFILE
displays all of the population profiles. If you have more than 60 populations, then by default only the first 40 profiles are displayed; the PROFILE option overrides this default behavior.
TITLE=' title '
displays the title at the top of certain pages of output that correspond to this MODEL statement.
WLS
GLS
computes weighted least-squares estimates. This type of estimation is also called generalized-least-squares estimation. For response functions other than the default (of generalized logits), WLS is the default estimation method.
XPX
displays X ² S ˆ’ 1 X , the crossproducts matrix for the normal equations.
ZERO= keyword
ZEROS= keyword
ZEROES= keyword
specifies whether a non-missing cell with zero weight in the data set is treated as a sampling or structural zero.
Structural zero cells are removed from the analysis since their expected values are zero, while sampling zero cells have nonzero expected value and may be estimable. For a single population, the zero cells are treated as structural zeros by default; with multiple populations, as long as some population has a nonzero count for a given population and response profile, the zeros are treated as sampling zeros by default.
The following table displays the available keywords and summarizes how PROC CATMOD treats zeros for one or more populations.
ZERO= | One Population | Multiple Populations |
---|---|---|
STRUCTURAL (default) | structural zeros | sampling zeros |
SAMP SAMPLING | sampling zeros | sampling zeros |
value | sets zero weights to value | sets zero weights to value |
If you specify the design matrix directly, adjacent rows of the matrix must be separated by a comma, and the matrix must have q — s rows, where s is the number of populations and q is the number of response functions per population. The first q rows correspond to the response functions for the first population, the second set of q rows corresponds to the functions for the second population, and so forth. The following is an example using direct specification of the design matrix.
proc catmod; model R=(1 0, 1 1, 1 2, 1 3); run;
These statements are appropriate for the case of one population and for R with five levels (generating four response functions), so that 4 — 1 = 4. These statements are also appropriate for a situation with two populations and two response functions per population; giving 2 — 2 = 4 rows of the design matrix. (To induce more than one population, the POPULATION statement is needed.)
When you input the design matrix directly, you also have the option of specifying that any subsets of the parameters be tested for equality to zero. Indicate each subset by specifying the appropriate column numbers of the design matrix, followed by an equal sign and a label (24 characters or less, in quotes) that describes the subset. Adjacent subsets are separated by a comma, and the entire specification is enclosed in parentheses and placed after the design matrix. For example,
proc catmod; population Group Time; model R=(1 1 0 0, 1 1 0 1, 1 1 0 2, 1 0 1 0, 1 0 1 1, 1 0 1 2, 1 -1 -1 0, 1 -1 -1 1, 1 -1 -1 2) (1 ='Intercept', 2 3='Group main effect', 4 ='Linear effect of Time'); run;
The preceding statements are appropriate when Group and Time each have three levels, and R is dichotomous. The POPULATION statement induces nine populations, and q =1(since R is dichotomous), so q — s = 1 — 9 = 9.
If you input the design matrix directly but do not specify any subsets of the parameters to be tested, then PROC CATMOD tests the effect of MODEL MEAN, which represents the significance of the model beyond what is explained by an overall mean. For the previous example, the MODEL MEAN effect is the same as that obtained by specifying
(2 3 4='modelmean');
at the end of the MODEL statement.
POPULATION variables ;
The POPULATION statement specifies that populations are to be based only on cross-classifications of the specified variables . If you do not specify the POPULATION statement, then populations are based only on cross-classifications of the independent variables in the MODEL statement.
The POPULATION statement has two major uses:
When you enter the design matrix directly, there are no independent variables in the MODEL statement; therefore, the POPULATION statement is the only way of inducing more than one population.
When you fit a reduced model, the POPULATION statement may be necessary if you want to form the same number of populations as there are for the saturated model.
To illustrate the first use, suppose that you specify the following statements:
data one; input A $ B $ wt @@; datalines; yes yes 23 yes no 31 no yes 47 no no 50 ; proc catmod; weight wt; population B; model A=(1 0, 1 1); run;
Since the dependent variable A has two levels, there is one response function per population. Since the variable B has two levels, there are two populations. Thus, the MODEL statement is valid since the number of rows in the design matrix (2) is the same as the total number of response functions. If the POPULATION statement is omitted, there would be only one population and one response function, and the MODEL statement would be invalid.
To illustrate the second use, suppose that you specify
data two; input A $ B $ Y wt @@; datalines; yes yes 1 23 yes yes 2 63 yes no 1 31 yes no 2 70 no yes 1 47 no yes 2 80 no no 1 50 no no 2 84 ; proc catmod; weight wt; model Y=A B A*B / wls; run;
These statements form four populations and produce the following design matrix and analysis of variance table.
Source | DF | Chi-Square | Pr > ChiSq | |
---|---|---|---|---|
| Intercept | 1 | 48.10 | <.0001 |
A | 1 | 3.47 | 0.0625 | |
B | 1 | 0.25 | 0.6186 | |
A*B | 1 | 0.19 | 0.6638 | |
Residual |
|
Since the B and A * B effects are nonsignificant ( p> . 10), you may want to fitthe reduced model that contains only the A effect. If your new statements are
proc catmod; weight wt; model Y=A / wls; run;
then only two populations are formed , and the design matrix and the analysis of variance table are as follows.
Source | DF | Chi-Square | Pr > ChiSq | |
---|---|---|---|---|
| ||||
Intercept | 1 | 47.94 | <.0001 | |
A | 1 | 3.33 | 0.0678 | |
Residual |
|
However, if the new statements are
proc catmod; weight wt; population A B; model Y=A / wls; run;
then four populations are formed, and the design matrix and the analysis of variance table are as follows.
Source | DF | Chi-Square | Pr > ChiSq | |
---|---|---|---|---|
| ||||
Intercept | 1 | 47.76 | <.0001 | |
A | 1 | 3.30 | 0.0694 | |
Residual | 2 | 0.35 | 0.8374 |
The advantage of the latter analysis is that it retains four populations for the reduced model, thereby creating a built-in goodness-of-fit test: the residual chi-square. Such a test is important because the cumulative (or joint) effect of deleting two or more effects from the model may be significant, even if the individual effects are not.
The resulting differences between the two analyses are due to the fact that the latter analysis uses pure weighted least-squares estimates with respect to the four populations that are actually sampled. The former analysis pools populations and therefore uses parameter estimates that can be regarded as weighted least-squares estimates of maximum likelihood predicted cell frequencies. In any case, the estimation methods are asymptotically equivalent; therefore, the results are very similar. If you specify the ML option (instead of the WLS option) in the MODEL statements, then the parameter estimates are identical for the two analyses.
CAUTION: if your model has different covariate profiles within any population, then the first profile is used in the analysis.
REPEATED factor-description < , , factor-description >< / options > ;
where a factor-description is
factor-name < $ >< levels >
and factor-description s are separated from each other by a comma. The $ is required for character-valued factors. The value of levels provides the number of levels of the repeated measurement factor identified by a given factor-name . For only one repeated measurement factor, levels is optional; for two or more repeated measurement factors, it is required.
The REPEATED statement incorporates repeated measurement factors into the model. You can use this statement whenever there is more than one dependent variable and the keyword _RESPONSE_ is specified in the MODEL statement. If the dependent variables correspond to one or more repeated measurement factors, you can use the REPEATED statement to define _RESPONSE_ in terms of those factors. You can specify the name, type, and number of levels of each factor, as well as the identification of each level.
You cannot specify the REPEATED statement for an analysis that also contains the FACTORS or LOGLIN statement since all of them specify the same information: how to partition the variation among the response functions within a population.
In the REPEATED statement,
factor-name | names a repeated measurement factor that corresponds to two or more response functions. This name must be a valid SAS variable name, and it should not be the same as the name of a variable that already exists in the data set being analyzed. |
$ | indicates that the factor is character-valued. If the $ is omitted, then PROC CATMOD assumes that the factor is numeric. The type of the factor is relevant only when you use the PROFILE= option or when the _RESPONSE_= option specifies nested-by-value effects. |
levels | specifies the number of levels of the corresponding repeated measurement factor. If there is only one such factor and the number is omitted, then PROC CATMOD assumes that the number of levels is equal to the number of response functions per population ( q ). Unless you specify the PROFILE= option, the number q must either be equal to or be a multiple of the product of the number of levels of all the factors. |
You can specify the following options in the REPEATED statement after a slash.
PROFILE=( matrix )
specifies the values assumed by the factors for each response function. There should be one column for each factor, and the values in a given column should match the type (character or numeric) of the corresponding factor. Character values are restricted to 16 characters or less. If there are q response functions per population, then the matrix must have i rows, where q must either be equal to or be a multiple of i . Adjacent rows of the matrix should be separated by a comma.
The values in the PROFILE matrix are useful for specifying models in those situations where the study design is not a full factorial with respect to the factors. They can also be used to specify nested-with-value effects in the _RESPONSE_= option. If you specify character values in both the PROFILE= option and the _RESPONSE_= option, then the values must match with respect to whether or not they are enclosed in quotes (that is, enclosed in quotes in both places or in neither place).
_RESPONSE_= effects
specifies design effects. The variables named in the effects must be factor-names that appear in the REPEATED statement. If the _RESPONSE_= option is omitted, then PROC CATMOD builds a full factorial _RESPONSE_ effect with respect to the repeated measurement factors. For example, the following two statements are equivalent in that they produce the same parameter estimates.
repeated Time 2, Treatment 2; repeated Time 2, Treatment 2 / _response_=TimeTreatment;
However, the second statement produces tests of the Time , Treatment ,and Time * Treatment effects in the 'Analysis of Variance' table, whereas the first statement produces a single test for the combined effects in _RESPONSE_.
TITLE= ' title '
displays the title at the top of certain pages of output that correspond to this REPEATED statement.
For further information and numerous examples of the REPEATED statement, see the section 'Repeated Measures Analysis' on page 873.
RESPONSE < function >< / options > ;
The RESPONSE statement specifies functions of the response probabilities. The procedure models these response functions as linear combinations of the parameters.
By default, PROC CATMOD uses the standard response functions (generalized logits, which are explained in detail in the 'Understanding the Standard Response Functions' section on page 859). With these standard response functions, the default estimation method is maximum likelihood, but you can use the WLS option in the MODEL statement to request weighted least-squares estimation. With other response functions (specified in the RESPONSE statement), the default (and only) estimation method is weighted least squares.
You can specify more than one RESPONSE statement, in which case each RESPONSE statement produces a separate analysis. If the computed response functions for any population are linearly dependent (yielding a singular covariance matrix), then PROC CATMOD displays an error message and stops processing. See the 'Cautions' section on page 887 for methods of dealing with this.
The function specification can be any of the items in the following list. For an example of response functions generated and formulas for q (the number of response functions), see the 'More on Response Functions' section on page 854.
ALOGIT ALOGITS | specifies response functions as adjacent-category logits of the marginal probabilities for each of the dependent variables. For each dependent variable, the response functions are a set of linearly independent adjacent-category logits, obtained by taking the logarithms of the ratios of two probabilities. The denominator of the k th ratio is the marginal probability corresponding to the k th level of the variable, and the numerator is the marginal probability corresponding to the ( k + 1)th level. If a dependent variable has two levels, then the adjacent-category logit is the negative of the generalized logit. |
CLOGIT CLOGITS | specifies that the response functions are cumulative logits of the marginal probabilities for each of the dependent variables. For each dependent variable, the response functions are a set of linearly independent cumulative logits, obtained by taking the logarithms of the ratios of two probabilities. The denominator of the k th ratio is the cumulative probability, c k , corresponding to the k th level of the variable, and the numerator is 1 - c k (Agresti 1984, 113-114). If a dependent variable has two levels, then PROC CATMOD computes its cumulative logit as the negative of its generalized logit. You should use cumulative logits only when the dependent variables are ordinally scaled. |
JOINT | specifies that the response functions are the joint response probabilities. A linearly independent set is created by deleting the last response probability. For the case of one dependent variable, the JOINT and MARGINALS specifications are equivalent. |
LOGIT LOGITS | specifies that the response functions are generalized logits of the marginal probabilities for each of the dependent variables. For each dependent variable, the response functions are a set of linearly independent generalized logits, obtained by taking the logarithms of the ratios of two probabilities. The denominator of each ratio is the marginal probability corresponding to the last observed level of the variable, and the numerators are the marginal probabilities corresponding to each of the other levels. If there is one dependent variable, then specifying LOGIT is equivalent to using the standard response functions. |
MARGINAL MARGINALS | specifies that the response functions are marginal probabilities for each of the dependent variables in the MODEL statement. For each dependent variable, the response functions are a set of linearly independent marginals, obtained by deleting the marginal probability corresponding to the last level. |
MEAN MEANS | specifies that the response functions are the means of the dependent variables in the MODEL statement. This specification requires that all of the dependent variables be numeric. |
READ variables | specifies that the response functions and their covariance matrix are to be read directly from the input data set with one response function for each variable named. See the section 'Inputting Response Functions and Covariances Directly' on page 862 for more information. |
transformation | specifies response functions that can be expressed by using successive applications of the four operations: LOG , EXP , * matrix literal, or + matrix literal. The operations are described in detail in the 'Using a Transformation to Specify Response Functions' section on page 856. |
You can specify the following options in the RESPONSE statement after a slash.
OUT= SAS-data-set
produces a SAS data set that contains, for each population, the observed and predicted values of the response functions, their standard errors, and the residuals. Moreover, if you use the standard response functions, the data set also includes observed and predicted values of the cell frequencies or the cell probabilities. For further information, see the 'Output Data Sets' section on page 866.
OUTEST= SAS-data-set
produces a SAS data set that contains the estimated parameter vector and its estimated covariance matrix. For further information, see the 'Output Data Sets' section on page 866.
TITLE= ' title'
displays the title at the top of certain pages of output that correspond to this RESPONSE statement.
Suppose the dependent variable A has 3 levels and is the only response-effect in the MODEL statement. The following table shows the proportions upon which the response functions are defined.
Value of A : | 1 | 2 | 3 |
proportions: | p 1 | p 2 | p 3 |
Note that ˆ‘ j p j = 1. The following table shows the response functions generated for each population.
Function Specification | Value of q | Response Function |
---|---|---|
none [*] | 2 |
|
ALOGITS | 2 |
|
CLOGITS | 2 |
|
JOINT | 2 | p 1 , p 2 |
LOGITS | 2 |
|
MARGINAL | 2 | p 1 , p 2 |
MEAN | 1 | 1 p 1 + 2 p 2 + 3 p 3 |
[*] Without a function specification, the default response functions are generalized logits. |
Now, suppose the dependent variables A and B each have 3 levels (valued 1, 2, and 3 each) and the response-effect in the MODEL statement is A * B . The following table shows the proportions upon which the response functions are defined.
Value of A : | 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 3 |
Value of B : | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 |
proportions: | p 1 | p 2 | p 3 | p 4 | p 5 | p 6 | p 7 | p 8 | p 9 |
The marginal totals for the preceding table are defined as follows,
where ˆ‘ j p j =1. The following table shows the response functions generated for each population.
Function Specification | Value of q | Response Function |
---|---|---|
none [*] | 8 |
|
ALOGITS | 4 |
|
CLOGITS | 4 |
|
JOINT | 8 | p 1 , p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 |
LOGITS | 4 |
|
MARGINAL | 4 | p 1 · , p 2 · , p ·1 , p ·2 |
MEAN | 2 | 1 p 1 + 2 p 2 + 3 p 3. , 1 p ·1 + 2 p ·2 + 3 p ·3 |
[*] Without a function specification, the default response functions are generalized logits. |
The READ and transformation function specifications are not shown in the preceding table. For these two situations, there is not a general response function; the response functions generated depend on what you specify.
Another important aspect of the function specification is the number of response functions generated per population, q . Let m i represent the number of levels for the i th dependent variable in the MODEL statement, and let d represent the number of dependent variables in the MODEL statement. Then, if the function specification is ALOGITS, CLOGITS, LOGITS, or MARGINALS, the number of response functions is
If the function specification is JOINT or the default (generalized logits), the number of response functions per population is
where r is the number of response profiles. If every possible cross-classification of the dependent variables is observed in the samples, then
Otherwise, r is the number of cross-classifications actually observed.
If the function specification is MEANS, the number of response functions per population is q = d .
Some example response statements are shown in the following table.
Example | Result |
---|---|
response marginals; | marginals for each dependent variable |
response means; | the mean of each dependent variable |
response logits; | generalized logits of the marginal probabilities |
response clogits; | cumulative logits of the marginal probabilities |
response alogits; | adjacent-category logits of the marginal probabilities |
response joint; | the joint probabilities |
response 1 -1 log; | the logit |
response; | generalized logits |
response123; | the mean score, with scores of 1, 2, and 3 corresponding to the three response levels |
response read b1-b4; | four response functions and their covariance matrix, read directly from the input data set |
If you specify a transformation , it is applied to the vector that contains the sample proportions in each population. The transformation can be any combination of the following four operations.
Operation | Specification |
---|---|
linear combination | * matrix literal matrix literal |
logarithm | LOG |
exponential | EXP |
adding constant | + matrix literal |
If more than one operation is specified, then PROC CATMOD applies the operations consecutively from right to left.
A matrix literal is a matrix of numbers with each row of the matrix separated from the next by a comma. If you specify a linear combination, in most cases the * is not needed. The following statement defines the response function p 1 + 1. The * is needed to separate the two matrix literals '1' and '1 0'.
response + 1 * 1 0;
The LOG of a vector transforms each element of the vector into its natural logarithm; the EXP of a vector transforms each element into its exponential function (antilogarithm).
In order to specify a linear response function for data that have r = 3 response categories, you could specify either of the following RESPONSE statements:
response * 1 0 0 , 0 1 0; response 1 0 0 , 0 1 0;
The matrix literal in the preceding statements specifies a 2 —3 matrix, which is applied to each population as follows:
where p 1 , p 2 , and p 3 are sample proportions for the three response categories in a population, and F 1 and F 2 are the two response functions computed for that population. This response function, therefore, sets F 1= p 1 and F 2= p 2 in each population.
As another example of the linear response function, suppose you have two dependent variables corresponding to two observers who evaluate the same subjects. If the observers grade on the same three-point scale and if all nine possible responses are observed, then the following RESPONSE statement would compute the probability that the observers agree on their assessments:
response 1 0 0 0 1 0 0 0 1;
This response function is then computed as
where p ij denotes the probability that a subject gets a grade of i from the first observer and j from the second observer.
If the function is a compound function, requiring more than one operation to specify it, then the operations should be listed so that the first operation to be applied is on the right and the last operation to be applied is on the left. For example, if there are two response levels, the response function
response 1 1 log;
is equivalent to the matrix expression:
which is the logit response function since p 2 = 1 ˆ’ p 1 when there are only two response levels.
Another example of a compound response function is
response exp 1 1 * 1 0 0 1, 0 1 1 0 log;
which is equivalent to the matrix expression
F = EXP ( A * B * LOG ( P ))
where P is the vector of sample proportions for some population,
If the four responses are based on two dependent variables, each with two levels, then the function can also be written as
which is the odds (crossproduct) ratio for a 2 — 2 table.
If no RESPONSE statement is specified, PROC CATMOD computes the standard response functions, which contrast the log of each response probability with the log of the probability for the last response category. If there are r response categories, then there are r ˆ’ 1 standard response functions. For example, if there are four response categories, using no RESPONSE statement is equivalent to specifying
response 1 0 0 1, 0 1 0 1, 0 0 1 1 log;
This results in three response functions:
If there are only two response levels, the resulting response function would be a logit. Thus, the standard response functions are called generalized logits. They are useful in dealing with the log-linear model:
If C denotes the matrix in the preceding RESPONSE statement, then because of the restriction that the probabilities sum to 1, it follows that an equivalent model is
But C * LOG ( P ) is simply the vector of standard response functions. Thus, fitting a log-linear model on the cell probabilities is equivalent to fitting a linear model on the generalized logits.
RESTRICT parameter=value <... parameter=value > ;
where parameter is the letter B followed by a number; for example, B3 specifies the third parameter in the model. The value is the value to which the parameter is restricted. The RESTRICT statement restricts values of parameters to the values you specify, so that the estimation of the remaining parameters is subject to these restrictions. Consider the following statement:
restrict b1=1 b4=0 b6=0;
This restricts the values of three parameters. The first parameter is set to 1, and the fourth and sixth parameters are set to zero.
The RESTRICT statement is interactive. A new RESTRICT statement replaces any previous ones. In addition, if you submit two or more MODEL, LOGLIN, FACTORS, or REPEATED statements, then the subsequent occurrences of these statements also delete the previous RESTRICT statement.
WEIGHT variable ;
You can use a WEIGHT statement to refer to a variable containing the cell frequencies, which need not be integers. The WEIGHT statement lets you use summary data sets containing a count variable. See the 'Input Data Sets' section on page 860 for further information concerning the WEIGHT statement.