Syntax | SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)

You can specify the following statements in the GENMOD procedure. Items within the <> are optional.

PROC GENMOD < options > ;
- ASSESS ASSESSMENT keyword < / options > ;
- BY variables ;
- CLASS variables ;
- CONTRAST label effect values < effect values >< / options > ;
- DEVIANCE variable = expression ;
- ESTIMATE label effect values < effect values >< / options > ;
- FREQ FREQUENCY variable ;
- FWDLINK variable = expression ;
- INVLINK variable = expression ;
- LSMEANS effects < / options > ;
- MODEL response = < effects >< / options > ;
- OUTPUT < OUT= SAS-data-set >
  - < keyword= name keyword=name > ;
- programming statements
- REPEATED SUBJECT= subject-effect < / options > ;
- WEIGHT SCWGT variable ;
- VARIANCE variable = expression ;

The PROC GENMOD statement invokes the procedure. All statements other than the MODEL statement are optional. The CLASS statement, if present, must precede the MODEL statement, and the CONTRAST statement must come after the MODEL statement.

PROC GENMOD Statement

PROC GENMOD < options > ;

The PROC GENMOD statement invokes the procedure. You can specify the following options.

DATA= SAS-data-set

specifies the SAS data set containing the data to be analyzed . If you omit the DATA= option, the procedure uses the most recently created SAS data set.

DESCENDING DESCEND DESC

specifies that the levels of the response variable for the ordinal multinomial model and the binomial model with single variable response syntax be sorted in the reverse of the default order. For example, if RORDER=FORMATTED (the default), the DESCENDING option causes the levels to be sorted from highest to lowest instead of from lowest to highest. If RORDER=FREQ, the DESCENDING option causes the levels to be sorted from lowest frequency count to highest instead of from highest to lowest .

NAMELEN= n

specifies the length of effect names in tables and output data sets to be n characters long, where n is a value between 20 and 200 characters. The default length is 20 characters.

ORDER= keyword

specifies the sorting order for the levels of the classification variables (specified in the CLASS statement). This ordering determines which parameters in the model correspond to each level in the data, so the ORDER= option may be useful when you use the CONTRAST or ESTIMATE statement. Note that the ORDER= option applies to the levels for all classification variables. The exception is the default ORDER=FORMATTED for numeric variables for which you have supplied no explicit format. In this case, the levels are ordered by their internal value. Note that this represents a change from previous releases for how class levels are ordered. In releases previous to Version 8, numeric class levels with no explicit format were ordered by their BEST12. formatted values, and in order to revert to the previous ordering you can specify this format explicitly for the affected classification variables. The change was implemented because the former default behavior for ORDER=FORMATTED often resulted in levels not being ordered numerically and usually required the user to intervene with an explicit format or ORDER=INTERNAL to get the more natural ordering. The following table displays the valid keywords and describes how PROC GENMOD interprets them.

ORDER= keyword	Levels Sorted by
DATA	order of appearance in the input data set
FORMATTED	external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ	descending frequency count; levels with the most observations come first in the order
INTERNAL	unformatted value

By default, ORDER=FORMATTED. For ORDER=FORMATTED and ORDER=INTERNAL, the sort order is machine dependent. For more information on sorting order, refer to the chapter titled The SORT Procedure in the SAS Procedures Guide .

RORDER= keyword

specifies the sorting order for the levels of the response variable. This ordering determines which intercept parameter in the model corresponds to each level in the data. If RORDER=FORMATTED for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. Note that this represents a change from previous releases for how class levels are ordered. In releases previous to Version 8, numeric class levels with no explicit format were ordered by their BEST12. formatted values, and in order to revert to the previous ordering you can specify this format explicitly for the response variable. The change was implemented because the former default behavior for RORDER=FORMATTED often resulted in levels not being ordered numerically and usually required the user to intervene with an explicit format or RORDER=INTERNAL to get the more natural ordering. The following table displays the valid keywords and describes how PROC GENMOD interprets them.

RORDER= keyword	Levels Sorted by
DATA	order of appearance in the input data set
FORMATTED	external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ	descending frequency count; levels with the most observations come first in the order
INTERNAL	unformatted value

By default, RORDER=FORMATTED. For RORDER=FORMATTED and RORDER=INTERNAL, the sort order is machine dependent. The DESCENDING option in the PROC GENMOD statement causes the response variable to be sorted in the reverse of the order displayed in the previous table. For more information on sorting order, refer to the chapter on the SORT procedure in the SAS Procedures Guide .
The NOPRINT option, which suppresses displayed output in other SAS procedures, is not available in the PROC GENMOD statement. However, you can use the Output Delivery System (ODS) to suppress all displayed output, store all output on disk for further analysis, or create SAS data sets from selected output. You can suppress all displayed output with the statement ODS SELECT NONE;, and you can turn displayed output back on with the statement ODS SELECT ALL;. See Table 31.3 on page 1694 for the names of output tables available from PROC GENMOD. For more information on ODS, see Chapter 14, Using the Output Delivery System.

ASSESS Statement (Experimental)

ASSESSASSESSMENT VAR=(effect) LINK < / options > ;

The ASSESS statement computes and plots, using ODS graphics, model-checking statistics based on aggregates of residuals. See the Assessment of Models Based on Aggregates of Residuals section on page 1680 for details about the model assessment methods available in GENMOD.

The types of aggregates available are cumulative residuals, moving sums of residuals, and lowess smoothed residuals. If you do not specify which aggregate to use, the assessments are based on cumulative sums. PROC GENMOD uses experimental ODS graphics for graphical displays. For specific information about the experimental graphics available in GENMOD, see the ODS Graphics section on page 1695.

You must specify either LINK or VAR= in order to create an analysis.

LINK

request the assessment of the link function by performing the analysis with respect to the linear predictor .

VAR=( effect )

specifies the functional form of a covariate be checked by performing the analysis with respect to the variable identified by the effect. The effect must be specified in the MODEL statement, and must contain only continuous variables (variables not listed in a CLASS statement).
You can specify the following options after the slash (/).

CRPANEL

requests a plot with four panels, each containing aggregates of the observed residuals and two simulated curves, be created.

LOWESS < ( number ) >

requests model assessment based on lowess smoothed residuals with optional number the fraction of data used. number must be between zero and one. If number is not specified, the default value one-third is used.

NPATHSNPATHPATHS PATH = number

specifies the number of simulated paths to plot on the default aggregate residuals plot.

RESAMPLERESAMPLES < = number>

specifies a p -value be computed based on 1,000 simulated paths, or number paths, if number is specified.

SEED= number

specifies a seed for the normal random number generator used in creating simulated realizations of aggregates of residuals for plots and estimating p -values. Specifying a seed allows you to produce identical graphs and p -values from run to run of the procedure. If a seed is not specified, or if number is negative or zero, a random number seed is derived from the time of day.

WINDOW < ( number ) >

requests assessment based on a moving sum window of width number . If number is not specified, a value of one-half of the range of the x -coordinate is used.

BY Statement

BY variables ;

You can specify a BY statement with PROC GENMOD to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.

Since sorting the data changes the order in which PROC GENMOD reads the data, this can affect the sorting order for the levels of classification variables if you have specified ORDER=DATA in the PROC GENMOD statement. This, in turn, affects specifications in the CONTRAST statement.

If your input data set is not sorted in ascending order, use one of the following alternatives:

Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the GENMOD procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .

CLASS Statement

The CLASS statement names the classification variables to be used in the analysis. The CLASS statement must precede the MODEL statement. You can specify various v-options for each variable by enclosing them in parentheses after the variable name. You can also specify global v-options for the CLASS statement by placing them after a slash (/). Global v-options are applied to all the variables specified in the CLASS statement. If you specify more than one CLASS statement, the global v-options specified on any one CLASS statement apply to all CLASS statements. However, individual CLASS variable v-options override the global v-options .

DESCENDING

DESC

reverses the sorting order of the classification variable.

MISSING

allows missing value ( . for a numeric variable and blanks for a character variables) as a valid value for the CLASS variable.

ORDER=DATA FORMATTED FREQ INTERNAL

specifies the sorting order for the levels of classification variables. This ordering determines which parameters in the model correspond to each level in the data, so the ORDER= option may be useful when you use the CONTRAST or ESTIMATE statement. If ORDER=FORMATTED for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. Note that this represents a change from previous releases for how class levels are ordered. In releases previous to Version 8, numeric class levels with no explicit format were ordered by their BEST12. formatted values, and in order to revert to the previous ordering you can specify this format explicitly for the affected classification variables. The change was implemented because the former default behavior for ORDER=FORMATTED often resulted in levels not being ordered numerically and usually required the user to intervene with an explicit format or ORDER=INTERNAL to get the more natural ordering. The following table shows how PROC GENMOD interprets values of the ORDER= option.

Value of ORDER=	Levels Sorted By
DATA	order of appearance in the input data set
FORMATTED	external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ	descending frequency count; levels with the most observations come first in the order
INTERNAL	unformatted value

By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent. For more information on sorting order, see the chapter on the SORT procedure in the SAS Procedures Guide and the discussion of BY- group processing in SAS Language Reference: Concepts .

PARAM= keyword

specifies the parameterization method for the classification variable or variables. Design matrix columns are created from CLASS variables according to the following coding schemes. The default is PARAM=GLM. If PARAM=ORTHPOLY or PARAM=POLY, and the CLASS levels are numeric, then the ORDER= option in the CLASS statement is ignored, and the internal, unformatted values are used. See the CLASS Variable Parameterization section on page 1661 for further details.

EFFECT	specifies effect coding
GLM	specifies less-than -full-rank, reference- cell coding; this option can only be used as a global option
ORDINAL THERMOMETER	specifies the cumulative parameterization for an ordinal CLASS variable.
POLYNOMIAL POLY	specifies polynomial coding
REFERENCE REF	specifies reference cell coding
ORTHEFFECT	orthogonalizes PARAM=EFFECT
ORTHORDINAL ORTHOTHERM	orthogonalizes PARAM=ORDINAL
ORTHPOLY	orthogonalizes PARAM=POLYNOMIAL
ORTHREF	orthogonalizes PARAM=REFERENCE

The EFFECT, POLYNOMIAL, REFERENCE, ORDINAL, and their orthogonal parameterizations are full rank. The REF= option in the CLASS statement determines the reference level for the EFFECT, REFERENCE, and their orthogonal parameterizations.

REF= level keyword

specifies the reference level for PARAM=EFFECT, PARAM=REFERENCE, and their orthogonalizations. For an individual (but not a global) variable REF= option , you can specify the level of the variable to use as the reference level. For a global or individual variable REF= option , you can use one of the following keywords . The default is REF=LAST.

FIRST

designates the first ordered level as reference

LAST

designates the last ordered level as reference

TRUNCATE < =n >

specifies the length n of CLASS variable values to use in determining CLASS variable levels. If you specify TRUNCATE without the length n , the first 16 characters of the formatted values are used. When formatted values are longer than 16 characters, you can use this option to revert to the levels as determined in releases previous to Version 9. The default is to use the full formatted length of the CLASS variable. The TRUNCATE option is only available as a global option.

CONTRAST Statement

CONTRAST label effect values < ,... effect values >< /options > ;

The CONTRAST statement provides a means for obtaining a test for a specified hypothesis concerning the model parameters. This is accomplished by specifying a matrix L for testing the hypothesis L ² ² = . You must be familiar with the details of the model parameterization that PROC GENMOD uses. For more information, see the Parameterization Used in PROC GENMOD section on page 1661 and the CLASS Variable Parameterization section on page 1661. Computed statistics are based on the asymptotic chi-square distribution of the likelihood ratio statistic, or the generalized score statistic for GEE models, with degrees of freedom determined by the number of linearly independent rows in the L matrix. You can request Wald chi-square statistics with the Wald option in the CONTRAST statement.

There is no limit to the number of CONTRAST statements that you can specify, but they must appear after the MODEL statement. Statistics for multiple CONTRAST statements are displayed in a single table.

The following parameters are specified in the CONTRAST statement:

label	identifies the contrast on the output. A label is required for every contrast specified. Labels can be up to 20 characters and must be enclosed in single quotes.
effect	identifies an effect that appears in the MODEL statement. The value INTERCEPT or intercept can be used as an effect when an intercept is included in the model. You do not need to include all effects that are included in the MODEL statement.
values	are constants that are elements of the L vector associated with the effect.

The rows of L ² are specified in order and are separated by commas.

If you use the default less-than-full-rank GLM CLASS variable parameterization, each row of the L matrix is checked for estimability. If PROC GENMOD finds a contrast to be nonestimable, it displays missing values in corresponding rows in the results. Refer to Searle (1971) for a discussion of estimable functions. If the elements of L are not specified for an effect that contains a specified effect, then the elements of the specified effect are distributed over the levels of the higher-order effect just as the GLM procedure does for its CONTRAST and ESTIMATE statements. For example, suppose that the model contains effects A and B and their interaction A*B. If you specify a CONTRAST statement involving A alone, the L matrix contains nonzero terms for both A and A*B, since A*B contains A.

When you use any of the full-rank PARAM= CLASS variable options, all parameters are directly estimable, and rows of L are not checked for estimability.

If an effect is not specified in the CONTRAST statement, all of its coefficients in the L matrix are set to 0. If too many values are specified for an effect, the extra ones are ignored. If too few values are specified, the remaining ones are set to 0.

PROC GENMOD handles missing level combinations of classification variables in the same manner as the GLM and MIXED procedures. Parameters corresponding to missing level combinations are not included in the model. This convention can affect the way in which you specify the L matrix in your CONTRAST statement.

If you specify the WALD option, the test of hypothesis is based on a Wald chi-square statistic. If you omit the WALD option, the test statistic computed depends on whether an ordinary generalized linear model or a GEE-type model is specified.

For an ordinary generalized linear model, the CONTRAST statement computes the likelihood ratio statistic. This is defined to be twice the difference between the log likelihood of the model unconstrained by the contrast and the log likelihood with the model fitted under the constraint that the linear function of the parameters defined by the contrast is equal to 0. A p -value is computed based on the asymptotic chi-square distribution of the chi-square statistic.

If you specify a GEE model with the REPEATED statement, the test is based on a score statistic. The GEE model is fit under the constraint that the linear function of the parameters defined by the contrast is equal to 0. The score chi-square statistic is computed based on the generalized score function. See the Generalized Score Statistics section on page 1680 for more information.

The degrees of freedom is the number of linearly independent constraints implied by the CONTRAST statement, that is, the rank of L .

You can specify the following options after a slash (/).

requests that the L matrix be displayed.

SINGULAR = number

tunes the estimability checking. If v is a vector, define ABS( v ) to be the absolute value of the element of v with the largest absolute value. Let K ² be any row in the contrast matrix L . Define C to be equal to ABS( K ² ) if ABS( K ² ) is greater than 0; otherwise , C equals 1. If ABS( K ² ˆ’ K ² T ) is greater than C* number , then K is declared nonestimable. T is the Hermite form matrix ( X ² X ) ( X ² X ), and ( X ² X ) represents a generalized inverse of the matrix X ² X . The value for number must be between 0 and 1; the default value is 1E ˆ’ 4.

WALD

requests that a Wald chi-square statistic be computed for the contrast rather than the default likelihood ratio or score statistic. The Wald statistic for testing L ² ² = is defined by
where is the maximum likelihood estimate and & pound ; is its estimated covariance matrix. The asymptotic distribution of S is , where r is the rank of L . Computed p -values are based on this distribution.
If you specify a GEE model with the REPEATED statement, is the empirical covariance matrix estimate.

DEVIANCE Statement

DEVIANCE variable = expression ;

You can specify a probability distribution other than those available in PROC GENMOD by using the DEVIANCE and VARIANCE statements. You do not need to specify the DEVIANCE or VARIANCE statements if you use the DIST= MODEL statement option to specify a probability distribution. The variable identifies the deviance contribution from a single observation to the procedure, and it must be a valid SAS variable name that does not appear in the input data set. The expression can be any arithmetic expression supported by the DATA step language, and it is used to define the functional dependence of the deviance on the mean and the response. You use the automatic variables _MEAN_ and _RESP_ to represent the mean and response in the expression .

Alternatively, the deviance function can be defined using programming statements (see the section Programming Statements on page 1645) and assigned to a variable, which is then listed as the expression . This form is convenient for using complex statements such as if-then-else clauses.

The DEVIANCE statement is ignored unless the VARIANCE statement is also specified.

ESTIMATE Statement

ESTIMATE label effect values ... < /options > ;

The ESTIMATE statement is similar to a CONTRAST statement, except only one-row L ² matrices are permitted.

If you use the default less-than-full-rank GLM CLASS variable parameterization, each row is checked for estimability. If PROC GENMOD finds a contrast to be nonestimable, it displays missing values in corresponding rows in the results. Refer to Searle (1971) for a discussion of estimable functions.

The actual estimate, L ² ² , its approximate standard error, and its confidence limits are displayed. A Wald chi-square test that L ² ² = 0 is also displayed.

The approximate standard error of the estimate is computed as the square root of L ² L , where is the estimated covariance matrix of the parameter estimates. If you specify a GEE model in the REPEATED statement, is the empirical covariance matrix estimate.

If you specify the EXP option, then exp( L ² ² ), its standard error, and its confidence limits are also displayed.

The construction of the L vector for an ESTIMATE statement follows the same rules as listed under the CONTRAST statement.

You can specify the following options in the ESTIMATE statement after a slash (/).

ALPHA= number

requests that a confidence interval be constructed with confidence level 1 ˆ’ number . The value of number must be between 0 and 1; the default value is 0.05.

requests that the L matrix coefficients be displayed.

EXP

requests that exp( L ² ² ), its standard error, and its confidence limits be computed.

FREQ Statement

FREQ FREQUENCY variable ;

The variable in the FREQ statement identifies a variable in the input data set containing the frequency of occurrence of each observation. PROC GENMOD treats each observation as if it appears n times, where n is the value of the FREQ variable for the observation. If it is not an integer, the frequency value is truncated to an integer. If it is less than 1 or if it is missing, the observation is not used.

FWDLINK Statement

FWDLINK variable = expression ;

You can define a link function other than a built-in link function by using the FWDLINK statement. If you use the MODEL statement option LINK= to specify a link function, you do not need to use the FWDLINK statement. The variable identifies the link function to the procedure. The expression can be any arithmetic expression supported by the DATA step language, and it is used to define the functional dependence on the mean.

Alternatively, the link function can be defined by using programming statements (see the Programming Statements section on page 1645) and assigned to a variable, which is then listed as the expression . The second form is convenient for using complex statements such as if-then-else clauses. The GENMOD procedure automatically computes derivatives of the link function required for iterative fitting. You must specify the inverse of the link function in the INVLINK statement when you specify the FWDLINK statement to define the link function. You use the automatic variable _MEAN_ to represent the mean in the preceding expression .

INVLINK Statement

INVLINK variable = expression ;

If you define a link function in the FWDLINK statement, then you must define the inverse link function using the INVLINK statement. If you use the MODEL statement option LINK= to specify a link function, you do not need to use the INVLINK statement. The variable identifies the inverse link function to the procedure. The expression can be any arithmetic expression supported by the DATA step language, anditisusedtodefine the functional dependence on the linear predictor.

Alternatively, the inverse link function can be defined using programming statements (see the section Programming Statements on page 1645) and assigned to a variable, which is then listed as the expression . The second form is convenient for using complex statements such as if-then-else clauses. The automatic variable _XBETA_ represents the linear predictor in the preceding expression .

LSMEANS Statement

LSMEANS effects < / options > ;

The LSMEANS statement computes least-squares means (LS-means) corresponding to the specified effects for the linear predictor part of the model. The L matrix constructed to compute them is precisely the same as the one formed in PROC GLM.

The LSMEANS statement is not available for multinomial distribution models for ordinal response data.

Each LS-mean is computed as L ² , where L is the coefficient matrix associated with the least-squares mean and is the estimate of the parameter vector. The approximate standard errors for the LS-mean is computed as the square root of L ² L , where is the estimated covariance matrix of the parameter estimates. If you specify a GEE model in the REPEATED statement, is the empirical covariance matrix estimate.

LS-means can be computed for any effect in the MODEL statement that involves CLASS variables. You can specify multiple effects in one LSMEANS statement or multiple LSMEANS statements, and all LSMEANS statements must appear after the MODEL statement.

As in the ESTIMATE statement, the L matrix is tested for estimability, and if this test fails, PROC GENMOD displays Non-est for the LS-means entries.

Assuming the LS-mean is estimable, PROC GENMOD constructs a Wald chi-square test to test the null hypothesis that the associated population quantity equals zero.

You can specify the following options in the LSMEANS statement after a slash (/).

ALPHA= number

requests that a confidence interval be constructed for each of the LS-means with confidence level (1 ˆ’ number ) — 100%. The value of number must be between 0 and 1; the default value is 0.05, corresponding to a 95% confidence interval.

requests that confidence limits be constructed for each of the LS-means. The confidence level is 0.95 by default; this can be changed with the ALPHA= option.

CORR

displays the estimated correlation matrix of the LS-means as part of the Least Squares Means table.

COV

displays the estimated covariance matrix of the LS-means as part of the Least Squares Means table.

DIFF

requests that differences of the LS-means be displayed. All possible differences of LS-means, standard errors, and a Wald chi-square test are computed. Confidence limits are computed if the CL option is also specified.

requests that the L matrix coefficients for all LSMEANS effects be displayed.

MODEL Statement

MODEL response = < effects >< /options > ;
MODEL events/trials = < effects >< /options > ;

The MODEL statement specifies the response, or dependent variable, and the effects, or explanatory variables. If you omit the explanatory variables, the procedure fits an intercept-only model. An intercept term is included in the model by default. The intercept can be removed with the NOINT option.

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials . The first form is applicable to all responses. The second form is applicable only to summarized binomial response data. When each observation in the input data set contains the number of events (for example, successes) and the number of trials from a set of binomial trials, use the events/trials syntax.

In the events/trials model syntax, you specify two variables that contain the event and trial counts. These two variables are separated by a slash (/). The values of both events and ( trials “ events ) must be nonnegative, and the value of the trials variable must be greater than 0 for an observation to be valid. The variable events or trials may take noninteger values.

When each observation in the input data set contains a single trial from a binomial or multinomial experiment, use the first form of the preceding MODEL statements. The response variable can be numeric or character. The ordering of response levels is critical in these models. You can use the RORDER= option in the PROC GENMOD statement to specify the response level ordering.

Responses for the Poisson distribution must be positive, but they can be noninteger values.

The effects in the MODEL statement consist of an explanatory variable or combination of variables. Explanatory variables can be continuous or classification variables. Classification variables can be character or numeric. Explanatory variables representing nominal, or classification, data must be declared in a CLASS statement. Interactions between variables can also be included as effects. Columns of the design matrix are automatically generated for classification variables and interactions. The syntax for specification of effects is the same as for the GLM procedure. See the Specification of Effects section on page 1659 for more information. Also refer to Chapter 32, The GLM Procedure.

You can specify the following options in the MODEL statement after a slash (/).

AGGREGATE= (variable-list)

AGGREGATE= variable

specifies the subpopulations on which the Pearson chi-square and the deviance are calculated. This option applies only to the multinomial distribution or the binomial distribution with binary (single trial syntax) response. It is ignored if specified for other cases. Observations with common values in the given list of variables are regarded as coming from the same subpopulation. This affects the computation of the deviance and Pearson chi-square statistics. Variables in the list can be any variables in the input data set.

ALPHA ALPH A= number

sets the confidence coefficientfor parameter confidence intervals to 1 “ number . The value of number must be between 0 and 1. The default value of number is 0.05.

CICONV= number

sets the convergence criterion for profile likelihood confidence intervals. See the section Confidence Intervals for Parameters on page 1666 for the definition of convergence. The value of number must be between 0 and 1. By default, CICONV=1E ˆ’ 4.

requests that confidence limits for predicted values be displayed. See the OBSTATS option.

CODING=EFFECT FULLRANK

specifies effect coding be used for all class variables in the model. This is the same as specifying PARAM=EFFECT as a CLASS statement option.

CONVERGE= number

sets the convergence criterion. The value of number must be between 0 and 1. The iterations are considered to have converged when the maximum change in the parameter estimates between iteration steps is less than the value specified. The change is a relative change if the parameter is greater than 0.01 in absolute value; otherwise, it is an absolute change. By default, CONVERGE=1E ˆ’ 4. This convergence criterion is used in parameter estimation for a single model fit, Type 1 statistics, and likelihood ratio statistics for Type 3 analyses and CONTRAST statements.

CONVH= number

sets the relative Hessian convergence criterion. The value of number must be between 0 and 1. After convergence is determined with the change in parameter criterion specified with the CONVERGE= option, the quantity is computed and compared to number , where g is the gradient vector, H is the Hessian matrix for the model parameters, and f is the log-likelihood function. If tc is greater than number , a warning that the relative Hessian convergence criterion has been exceeded is printed. This criterion detects the occasional case where the change in parameter convergence criterion is satisfied, but a maximum in the log-likelihood function has not been attained. By default, CONVH=1E ˆ’ 4.

CORRB

requests that the parameter estimate correlation matrix be displayed.

COVB

requests that the parameter estimate covariance matrix be displayed.

DIST D ERROR ERR = keyword

specifies the built-in probability distribution to use in the model. If you specify the DIST= option and you omit a user-defined link function, a default link function is chosen as displayed in the following table. If you specify no distribution and no link function, then the GENMOD procedure defaults to the normal distribution with the identity link function.

DIST=	Distribution	Default Link Function
BINOMIAL BIN B	binomial	logit
GAMMA GAM G	gamma	inverse ( power( ˆ’ 1) )
IGAUSSIAN IG	inverse Gaussian	inverse squared ( power( ˆ’ 2) )
MULTINOMIAL MULT	multinomial	cumulative logit
NEGBIN NB	negative binomial	log
NORMAL NOR N	normal	identity
POISSON POI P	Poisson	log

EXPECTED

requests that the expected Fisher information matrix be used to compute parameter estimate covariances and the associated statistics. The default action is to use the observed Fisher information matrix. See the SCORING= option.

ID= variable

causes the values of variable in the input data set to be displayed in the OBSTATS table. If an explicit format for variable has been defined, the formatted values are displayed. If the OBSTATS option is not specified, this option has no effect.

INITIAL= numbers

sets initial values for parameter estimates in the model. The default initial parameter values are weighted least squares estimates based on using the response data as the initial mean estimate. This option can be useful in case of convergence difficulty. The intercept parameter is initialized with the INTERCEPT= option and is not included here. The values are assigned to the variables in the MODEL statement in the same order in which they appear in the MODEL statement. The order of levels for CLASS variables is determined by the ORDER= option. Note that some levels of class variables can be aliased; that is, they correspond to linearly dependent parameters that are not estimated by the procedure. Initial values must be assigned to all levels of class variables, regardless of whether they are aliased or not. The procedure ignores initial values corresponding to parameters not being estimated. If you specify a BY statement, all class variables must take on the same number of levels in each BY group. Otherwise, class variables in some of the BY groups are assigned incorrect initial values. Types of INITIAL= specifications are illustrated in the following table.

Type of List	Specification
list separated by blanks	INITIAL = 3 4 5
list separated by commas	INITIAL = 3, 4, 5
x to y	INITIAL = 3 to 5
x to y by z	INITIAL = 3 to 5 by 1
combination of list types	INITIAL = 1, 3 to 5, 9

INTERCEPT= number

initializes the intercept term to number for parameter estimation. If you specify both the INTERCEPT= and the NOINT options, the intercept term is not estimated, but an intercept term of number is included in the model.

ITPRINT

displays the iteration history for all iterative processes: parameter estimation, fitting constrained models for contrasts and Type 3 analyses, and profile likelihood confidence intervals. The last evaluation of the gradient and the negative of the Hessian (second derivative) matrix are also displayed for parameter estimation. This option may result in a large amount of displayed output, especially if some of the optional iterative processes are selected.

LINK = keyword

specifies the link function to use in the model. The keywords and their associated built-in link functions are as follows.

LINK=	Link Function
CUMCLL CCLL	cumulative complementary log-log
CUMLOGIT CLOGIT	cumulative logit
CUMPROBIT CPROBIT	cumulative probit
CLOGLOG CLL	complementary log-log
IDENTITY ID	identity
LOG	log
LOGIT	logit
PROBIT	probit
POWER( number ) POW( number )	power with » = number

If no LINK= option is supplied and there is a user-defined link function, the user-defined link function is used. If you specify neither the LINK= option nor a user-defined link function, then the default canonical link function is used if you specify the DIST= option. Otherwise, if you omit the DIST= option, the identity link function is used.
The cumulative link functions are appropriate only for the multinomial distribution.

LRCI

requests that two-sided confidence intervals for all model parameters be computed based on the profile likelihood function. This is sometimes called the partially maximized likelihood function. See the Confidence Intervals for Parameters section on page 1666 for more information on the profile likelihood function. This computation is iterative and can consume a relatively large amount of CPU time. The confidence coefficient can be selected with the ALPHA= number option. The resulting confidence coefficient is 1 “ number . The default confidence coefficient is 0.95.

MAXITER= number

MAXIT= number

sets the maximum allowable number of iterations for all iterative computation processes in PROC GENMOD. By default, MAXITER=50.

NOINT

requests that no intercept term be included in the model. An intercept is included unless this option is specified.

NOSCALE

holds the scale parameter fixed. Otherwise, for the normal, inverse Gaussian, and gamma distributions, the scale parameter is estimated by maximum likelihood. If you omit the SCALE= option, the scale parameter is fixed at the value 1.

OFFSET= variable

specifies a variable in the input data set to be used as an offset variable. This variable cannot be a CLASS variable, and it cannot be the response variable or one of the explanatory variables.

OBSTATS

specifies that an additional table of statistics be displayed. For each observation, the following items are displayed:
- the value of the response variable (variables if the data are binomial), frequency, and weight variables
- the values of the regression variables
- predicted mean, = g ^{ˆ’ 1} ( · ), where · = x _i ² is the linear predictor and g is the link function. If there is an offset, it is included in x _i ² .
- estimate of the linear predictor x _i ² . If there is an offset, it is included in x _i ² .
- standard error of the linear predictor x _i ²
- the value of the Hessian weight at the final iteration
- lower confidence limit of the predicted value of the mean. The confidence coefficient is specified with the ALPHA= option. See the section Confidence Intervals on Predicted Values on page 1669 for the computational method.
- upper confidence limit of the predicted value of the mean
- raw residual , defined as Y ˆ’ ¼
- Pearson, or chi residual, defined as the square root of the contribution for the observation to the Pearson chi-square, that is
  
  where Y is the response, ¼ is the predicted mean, w is the value of the prior weight variable specified in a WEIGHT statement, and V( ¼ ) is the variance function evaluated at ¼ .
- the standardized Pearson residual
- deviance residual, defined as the square root of the deviance contribution for the observation, with sign equal to the sign of the raw residual
- the standardized deviance residual
- the likelihood residual

The RESIDUALS, PREDICTED, XVARS, and CL options cause only subgroups of the observation statistics to be displayed. You can specify more than one of these options to include different subgroups of statistics.
The ID= variable option causes the values of variable in the input data set to be displayed in the table. If an explicit format for variable has been defined, the formatted values are displayed.
If a REPEATED statement is present, a table is displayed for the GEE model specified in the REPEATED statement. Only the regression variables, response values, predicted values, confidence limits for the predicted values, linear predictor, raw residuals, and Pearson residuals for each observation in the input data set are available.

PREDICTED

PRED

requests that predicted values, the linear predictor, its standard error, and the Hessian weight be displayed. See the OBSTATS option.

RESIDUALS

requests that residuals and standardized residuals be displayed. See the OBSTATS option.

SCALE= number

SCALE=PEARSON

SCALE=P

PSCALE

SCALE=DEVIANCE

SCALE=D

DSCALE

sets the value used for the scale parameter where the NOSCALE option is used. For the binomial and Poisson distributions, which have no free scale parameter, this can be used to specify an overdispersed model. In this case, the parameter covariance matrix and the likelihood function are adjusted by the scale parameter. See the Dispersion Parameter section (page 1658) and the Overdispersion section (page 1659) for more information. If the NOSCALE option is not specified, then number is used as an initial estimate of the scale parameter.
Specifying SCALE=PEARSON or SCALE=P is the same as specifying the PSCALE option. This fixes the scale parameter at the value 1 in the estimation procedure. After the parameter estimates are determined, the exponential family dispersion parameter is assumed to be given by Pearson s chi-square statistic divided by the degrees of freedom, and all statistics such as standard errors and likelihood ratio statistics are adjusted appropriately.
Specifying SCALE=DEVIANCE or SCALE=D is the same as specifying the DSCALE option. This fixes the scale parameter at a value of 1 in the estimation procedure.
After the parameter estimates are determined, the exponential family dispersion parameter is assumed to be given by the deviance divided by the degrees of freedom. All statistics such as standard errors and likelihood ratio statistics are adjusted appropriately.

SCORING= number

requests that on iterations up to number , the Hessian matrix is computed using the Fisher s scoring method. For further iterations, the full Hessian matrix is computed. The default value is 1. A value of 0 causes all iterations to use the full Hessian matrix, and a value greater than or equal to the value of the MAXITER option causes all iterations to use Fisher s scoring. The value of the SCORING= option must be 0 or a positive integer.

SINGULAR= number

sets the tolerance for testing singularity of the information matrix and the crossproducts matrix. Roughly, the test requires that a pivot be at least this number times the original diagonal value. By default, number is 10 ⁷ times the machine epsilon . The default number is approximately 10 ^{ˆ’ 9} on most machines.

TYPE1

requests that a Type 1, or sequential, analysis be performed. This consists of sequentially fitting models, beginning with the null (intercept term only) model and continuing up to the model specified in the MODEL statement. The likelihood ratio statistic between each successive pair of models is computed and displayed in a table. A Type 1 analysis is not available for GEE models, since there is no associated likelihood.

TYPE3

requests that statistics for Type 3 contrasts be computed for each effect specified in the MODEL statement. The default analysis is to compute likelihood ratio statistics for the contrasts or score statistics for GEEs. Wald statistics are computed if the WALD option is also specified.

WALD

requests Wald statistics for Type 3 contrasts. You must also specify the TYPE3 option in order to compute Type 3 Wald statistics.

WALDCI

requests that two-sided Wald confidence intervals for all model parameters be computed based on the asymptotic normality of the parameter estimators. This computation is not as time consuming as the LRCI method, since it does not involve an iterative procedure. However, it is not thought to be as accurate, especially for small sample sizes. The confidence coefficient can be selected with the ALPHA= option in the same way as for the LRCI option.

XVARS

requests that the regression variables be included in the OBSTATS table.

OUTPUT Statement

OUTPUT < OUT= SAS-data-set >
- < keyword=name ... keyword=name > ;

The OUTPUT statement creates a new SAS data set that contains all the variables in the input data set and, optionally , the estimated linear predictors (XBETA) and their standard error estimates, the weights for the Hessian matrix, predicted values of the mean, confidence limits for predicted values, and residuals.

You can also request these statistics with the OBSTATS, PREDICTED, RESIDUALS, CL, or XVARS options in the MODEL statement. You can then create a SAS data set containing them with ODS OUTPUT commands. You may prefer to specify the OUTPUT statement for requesting these statistics since

the OUTPUT statement produces no tabular output
the OUTPUT statement creates a SAS data set more efficiently than ODS. This can be an advantage for large data sets.
you can specify the individual statistics to be included in the SAS data set

If you use the multinomial distribution with one of the cumulative link functions for ordinal data, the data set also contains variables named _ORDER_ and _LEVEL_ that indicate the levels of the ordinal response variable and the values of the variable in the input data set corresponding to the sorted levels. These variables indicate that the predicted value for a given observation is the probability that the response variable is as large as the value of the Value variable.

The estimated linear predictor, its standard error estimate, and the predicted values and their confidence intervals are computed for all observations in which the explanatory variables are all nonmissing, even if the response is missing. By adding observations with missing response values to the input data set, you can compute these statistics for new observations or for settings of the explanatory variables not present in the data without affecting the model fit.

The following list explains specifications in the OUTPUT statement.

OUT= SAS-data-set

specifies the output data set. If you omit the OUT=option, the output data set is created and given a default name using the DATA n convention.

keyword=name

specifies the statistics to be included in the output data set and names the new variables that contain the statistics. Specify a keyword for each desired statistic (see the following list of keywords), an equal sign, and the name of the new variable or variables to contain the statistic. You can list only one variable after the equal sign. Although you can use the OUTPUT statement without any keyword=name specifications, the output data set then contains only the original variables and, possibly, the variables Level and Value (if you use the multinomial model with ordinal data). Note that the residuals are not available for the multinomial model with ordinal data. Formulas for the statistics are given in the section Predicted Values of the Mean on page 1669 and the Residuals section on page 1669. The keywords allowed and the statistics they represent are as follows:

HESSWGT	diagonal element of the weight matrix used in computing the Hessian matrix
LOWER L	lower confidence limit for the predicted value of the mean, or the lower confidence limit for the probability that the response is less than or equal to the value of Level or Value . The confidence coefficient is determined by the ALPHA= number option in the MODEL statement as (1 ˆ’ number ) — 100%. The default confidence coefficient is 95%.
PREDICTED PRED PROB P	predicted value of the mean or the predicted probability that the response variable is less than or equal to the value of Level or Value if the multinomial model for ordinal data is used (in other words, Pr(Y ‰ Value), where Y is the response variable)
RESCHI	Pearson (Chi) residual for identifying observations that are poorly accounted for by the model
RESDEV	deviance residual for identifying poorly fitted observations
RESLIK	likelihood residual for identifying poorly fitted observations
STDXBETA	standard error estimate of XBETA (see the XBETA keyword)
STDRESCHI	standardized Pearson (Chi) residual for identifying observations that are poorly accounted for by the model
STDRESDEV	standardized deviance residual for identifying poorly fitted observations
UPPER U	upper confidence limit for the predicted value of the mean, or the lower confidence limit for the probability that the response is less than or equal to the value of Level or Value . The confidence coefficient is determined by the ALPHA= number option in the MODEL statement as (1 ˆ’ number ) — 100%. The default confidence coefficient is 95%.
XBETA	estimate of the linear predictor x _i ² ² for observation i , or ± _j + x _i ² ² , where j is the corresponding ordered value of the response variable for the multinomial model with ordinal data. If there is an offset, it is included in x _i ² ² .

Programming Statements

Although the most commonly used link and probability distributions are available as built-in functions, the GENMOD procedure enables you to define your own link functions and response probability distributions using the FWDLINK, INVLINK, VARIANCE, and DEVIANCE statements. The variables assigned in these statements can have values computed in programming statements. These programming statements can occur anywhere between the PROC GENMOD statement and the RUN statement. Variable names used in programming statements must be unique. Variables from the input data set may be referenced in programming statements. The mean, linear predictor, and response are represented by the automatic variables _MEAN_ , _XBETA_ , and _RESP_ , which can be referenced in your programming statements. Programming statements are used to define the functional dependencies of the link function, the inverse link function, the variance function, and the deviance function on the mean, linear predictor, and response variable.

The following code illustrates the use of programming statements. Even though you usually request the Poisson distribution by specifying DIST=POISSON as a MODEL statement option, you can define the variance and deviance functions for the Poisson distribution by using the VARIANCE and DEVIANCE statements. For example, the following code performs the same analysis as the Poisson regression example in the Getting Started section on page 1616. The code must be in logical order for computation, just as in a DATA step.

  proc genmod ;   class car age;   a = _MEAN_;   y = _RESP_;   d = 2 * ( y * log( y / a) - ( y - a ) );   variance var = a;   deviance dev = d;   model c = car age / link = log offset = ln;   run;

The variables var and dev are dummy variables used internally by the procedure to identify the variance and deviance functions. Any valid SAS variable names can be used.

Similarly, the log link function and its inverse could be defined with the FWDLINK and INVLINK statements.

  fwdlink link = log(_MEAN_);   invlink ilink = exp(_XBETA_);

This code is for illustration, and it works well for most Poisson regression problems. If, however, in the iterative fitting process, the mean parameter becomes too close to 0, or a 0 response value occurs, an error condition occurs when the procedure attempts to evaluate the log function. You can circumvent this kind of problem by using if-then-else clauses or other conditional statements to check for possible error conditions and appropriately define the functions for these cases.

Data set variables can be referenced in user definitions of the link function and response distributions using programming statements and the FWDLINK, INVLINK, DEVIANCE, and VARIANCE statements.

See the DEVIANCE, VARIANCE, FWDLINK, and INVLINK statements for more information.

REPEATED Statement

REPEATED SUBJECT= subject-effect < / options > ;

The REPEATED statement specifies the covariance structure of multivariate responses for GEE model fitting in the GENMOD procedure. In addition, the REPEATED statement controls the iterative fitting algorithm used in GEEs and specifies optional output. Other GENMOD procedure statements, such as the MODEL and CLASS statements, are used in the same way as they are for ordinary generalized linear models to specify the regression model for the mean of the responses.

SUBJECT= subject-effect

identifies subjects in the input data set. The subject-effect can be a single variable, an interaction effect, a nested effect, or a combination. Each distinct value, or level, of the effect identifies a different subject, or cluster. Responses from different subjects are assumed to be statistically independent, and responses within subjects are assumed to be correlated. A subject-effect must be specified, and variables used in defining the subject-effect must be listed in the CLASS statement. The input data set does not need to be sorted by subject. See the SORTED option.
The options control how the model is fit and what output is produced. You can specify the following options after a slash (/).

ALPHAINIT= numbers

specifies initial values for log odds ratio regression parameters if the LOGOR= option is specified for binary data. If this option is not specified, an initial value of 0.01 is used for all the parameters.

CONVERGE= number

specifies the convergence criterion for GEE parameter estimation. If the maximum absolute difference between regression parameter estimates is less than the value of number on two successive iterations, convergence is declared. If the absolute value of a regression parameter estimate is greater than 0.08, then the absolute difference normalized by the regression parameter value is used instead of the absolute difference. The default value of number is 0.0001.

CORRW

displays the estimated working correlation matrix.

CORRB

displays the estimated regression parameter correlation matrix. Both model-based and empirical correlations are displayed.

COVB

displays the estimated regression parameter covariance matrix. Both model-based and empirical covariances are displayed.

ECORRB

displays the estimated regression parameter empirical correlation matrix.

ECOVB

displays the estimated regression parameter empirical covariance matrix.

INTERCEPT= number

specifies either an initial or a fixed value of the intercept regression parameter in the GEE model. If you specify the NOINT option in the MODEL statement, then the intercept is fixed at the value of number .

INITIAL= numbers

specifies initial values of the regression parameters estimation, other than the intercept parameter, for GEE estimation. If this option is not specified, the estimated regression parameters assuming independence for all responses are used for the initial values.

LOGOR= log odds ratio structure keyword

specifies the regression structure of the log odds ratio used to model the association of the responses from subjects for binary data. The response syntax must be of the single variable type, the distribution must be binomial, and the data must be binary. The following table displays the log odds ratio structure keywords and the corresponding log odds ratio regression structures. See the Alternating Logistic Regressions section on page 1676 for definitions of the log odds ratio types and examples of specifying log odds ratio models. You should specify either the LOGOR= or the TYPE= option, but not both.

Table 31.1: Log Odds Ratio Regression Structures
Keyword	Log Odds Ratio Regression Structure
EXCH	exchangeable
FULLCLUST	fully parameterized clusters
LOGORVAR( variable )	indicator variable for specifying block effects
NESTK	k -nested
NEST1	1-nested
ZFULL	fully specified z -matrix specified in ZDATA= data set
ZREP	single cluster specification for replicated z -matrix specified in ZDATA= data set
ZREP(matrix)	single cluster specification for replicated z -matrix

MAXITER= number

MAXIT= number

specifies the maximum number of iterations allowed in the iterative GEE estimation process. The default number is 50.

MCORRB

displays the estimated regression parameter model-based correlation matrix.

MCOVB

displays the estimated regression parameter model-based covariance matrix.

MODELSE

displays an analysis of parameter estimates table using model-based standard errors. By default, an Analysis of Parameter Estimates table based on empirical standard errors is displayed.

RUPDATE= number

specifies the number of iterations between updates of the working correlation matrix. For example, RUPDATE=5 specifies that the working correlation is updated once for every five regression parameter updates. The default value of number is 1; that is, the working correlation is updated every time the regression parameters are updated.

SORTED

specifies that the input data are grouped by subject and sorted within subject. If this option is not specified, then the procedure internally sorts by subject-effect and within subject-effect , if a within subject-effect is specified.

SUBCLUSTER= variable

SUBCLUST= variable

specifies a variable defining subclusters for the 1-nested or k -nested log odds ratio association modeling structures.

TYPE CORR= correlation-structure keyword

specifies the structure of the working correlation matrix used to model the correlation of the responses from subjects. The following table displays the correlation structure keywords and the corresponding correlation structures. The default working correlation type is the independent (CORR=IND). See the Details section on page 1650 for definitions of the correlation matrix types. You should specify LOGOR= or TYPE= but not both.

Table 31.2: Correlation Structure Types
Keyword	Correlation Matrix Type
AR AR(1)	autoregressive(1)
EXCH CS	exchangeable
IND	independent
MDEP(number)	m -dependent with m =number
UNSTR UN	unstructured
USER FIXED (matrix)	fixed, user-specified correlation matrix

For example, you can specify a fixed 4 — 4 correlation matrix with the option

  TYPE=USER( 1.0  0.9  0.8  0.6   0.9  1.0  0.9  0.8   0.8  0.9  1.0  0.9   0.6  0.8  0.9  1.0 )

V6CORR

specifies that the ˜Version 6 method of computing the normalized Pearson chi-square be used for working correlation estimation and for model-based covariance matrix scale factor.

WITHINSUBJECT WITHIN= within subject-effect

defines an effect specifying the order of measurements within subjects. Each distinct level of the within subject-effect defines a different response from the same subject. If the data are in proper order within each subject, you do not need to specify this option.
If some measurements do not appear in the data for some subjects, this option prop-erly orders the existing measurements and treats the omitted measurements as missing values. If the WITHINSUBJECT= option is not used in this situation, measurements may be improperly ordered and missing values assumed for the last measurements in acluster.
Variables used in defining the within subject-effect must be listed in the CLASS statement.

YPAIR= variable-list

specifies the variables in the ZDATA= data set corresponding to pairs of responses for log odds ratio association modeling.

ZDATA = SAS-data-set

specifies a SAS data set containing either the full z -matrix for log odds ratio association modeling or the z -matrix for a single complete cluster to be replicated for all clusters.

ZROW= variable-list

specifies the variables in the ZDATA= data set corresponding to rows of the z -matrix for log odds ratio association modeling.

VARIANCE Statement

VARIANCE variable = expression ;

You can specify a probability distribution other than the built-in distributions by using the VARIANCE and DEVIANCE statements. The variable name variable identifies the variance function to the procedure. The expression is used to define the functional dependence on the mean, and it can be any arithmetic expression supported by the DATA step language. You use the automatic variable _MEAN_ to represent the mean in the expression.

Alternatively, you can define the variance function with programming statements, as detailed in the section Programming Statements on page 1645. This form is convenient for using complex statements such as if-then-else clauses. Derivatives of the variance function for use during optimization are computed automatically. The DEVIANCE statement must also appear when the VARIANCE statement is used to define the variance function.

WEIGHT Statement

WEIGHT SCWGT variable ;

The WEIGHT statement identifies a variable in the input data set to be used as the exponential family dispersion parameter weight for each observation. The exponential family dispersion parameter is divided by the WEIGHT variable value for each observation. This is true regardless of whether the parameter is estimated by the procedure or specified in the MODEL statement with the SCALE= option. It is also true for distributions such as the Poisson and binomial that are not usually defined to have a dispersion parameter. For these distributions, a WEIGHT variable weights the overdispersion parameter, which has the default value of 1.

The WEIGHT variable does not have to be an integer; if it is less than or equal to 0 or if it is missing, the corresponding observation is not used.

FIRST	designates the first ordered level as reference
LAST	designates the last ordered level as reference