Chapter 31: The GENMOD Procedure | SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)

Overview

The GENMOD procedure fits generalized linear models, as defined by Nelder and Wedderburn (1972). The class of generalized linear models is an extension of traditional linear models that allows the mean of a population to depend on a linear predictor through a nonlinear link function and allows the response probability distribution to be any member of an exponential family of distributions. Many widely used statistical models are generalized linear models. These include classical linear models with normal errors, logistic and probit models for binary data, and log-linear models for multinomial data. Many other useful statistical models can be formulated as generalized linear models by the selection of an appropriate link function and response probability distribution. Refer to McCullagh and Nelder (1989) for a discussion of statistical modeling using generalized linear models. The books by Aitkin, Anderson, Francis, and Hinde (1989) and Dobson (1990) are also excellent references with many examples of applications of generalized linear models. Firth (1991) provides an overview of generalized linear models.

The analysis of correlated data arising from repeated measurements when the measurements are assumed to be multivariate normal has been studied extensively. However, the normality assumption may not always be reasonable; for example, different methodology must be used in the data analysis when the responses are discrete and correlated. Generalized Estimating Equations (GEEs) provide a practical method with reasonable statistical efficiency to analyze such data.

Liang and Zeger (1986) introduced GEEs as a method of dealing with correlated data when, except for the correlation among responses, the data can be modeled as a generalized linear model. For example, correlated binary and count data in many cases can be modeled in this way.

The GENMOD procedure can fit models to correlated responses by the GEE method. You can use PROC GENMOD to fit models with most of the correlation structures from Liang and Zeger (1986) using GEEs. Refer to Liang and Zeger (1986), Diggle, Liang, and Zeger (1994), and Lipsitz, Fitzmaurice, Orav, and Laird (1994) for more details on GEEs.

Experimental graphics are now available with the GENMOD procedure for model assessment. For more information, see the ODS Graphics section on page 1695.

What Is a Generalized Linear Model?

A traditional linear model is of the form

where y _i is the response variable for the i th observation. The quantity x _i is a column vector of covariates, or explanatory variables , for observation i that is known from the experimental setting and is considered to be fixed, or nonrandom. The vector of unknown coefficients ² is estimated by a least squares fit to the data y . The µ _i are assumed to be independent, normal random variables with zero mean and constant variance. The expected value of y _i , denoted by ¼ _i , is

While traditional linear models are used extensively in statistical data analysis, there are types of problems for which they are not appropriate.

It may not be reasonable to assume that data are normally distributed. For example, the normal distribution (which is continuous) may not be adequate for modeling counts or measured proportions that are considered to be discrete.
If the mean of the data is naturally restricted to a range of values, the traditional linear model may not be appropriate, since the linear predictor can take on any value. For example, the mean of a measured proportion is between 0 and 1, but the linear predictor of the mean in a traditional linear model is not restricted to this range.
It may not be realistic to assume that the variance of the data is constant for all observations. For example, it is not unusual to observe data where the variance increases with the mean of the data.

A generalized linear model extends the traditional linear model and is, therefore, applicable to a wider range of data analysis problems. A generalized linear model consists of the following components :

The linear component is defined just as it is for traditional linear models:
A monotonic differentiable link function g describes how the expected value of y _i is related to the linear predictor · _i :
The response variables y _i are independent for i = 1, 2, and have a probability distribution from an exponential family. This implies that the variance of the response depends on the mean ¼ through a variance function V :

where is a constant and w _i is a known weight for each observation. The dispersion parameter is either known (for example, for the binomial or Poisson distribution, = 1) or it must be estimated.

See the section Response Probability Distributions on page 1650 for the form of a probability distribution from the exponential family of distributions.

As in the case of traditional linear models, fitted generalized linear models can be summarized through statistics such as parameter estimates, their standard errors, and goodness-of-fit statistics. You can also make statistical inference about the parameters using confidence intervals and hypothesis tests. However, specific inference procedures are usually based on asymptotic considerations, since exact distribution theory is not available or is not practical for all generalized linear models.

Examples of Generalized Linear Models

You construct a generalized linear model by deciding on response and explanatory variables for your data and choosing an appropriate link function and response probability distribution. Some examples of generalized linear models follow. Explanatory variables can be any combination of continuous variables, classification variables, and interactions.

Traditional Linear Model

response variable: a continuous variable
distribution: normal
link function: identity g( ¼ ) = ¼

Logistic Regression

response variable: a proportion
distribution: binomial
link function: logit

Poisson Regression in Log Linear Model

response variable: a count
distribution: Poisson
link function: log g( ¼ ) = log ( ¼ )

Gamma Model with Log Link

response variable: a positive, continuous variable
distribution: gamma
link function: log g( ¼ ) = log ( ¼ )

The GENMOD Procedure

The GENMOD procedure fits a generalized linear model to the data by maximum likelihood estimation of the parameter vector ² . There is, in general, no closed form solution for the maximum likelihood estimates of the parameters. The GENMOD procedure estimates the parameters of the model numerically through an iterative fitting process. The dispersion parameter is also estimated by maximum likelihood or, optionally , by the residual deviance or by Pearson s chi-square divided by the degrees of freedom. Covariances, standard errors, and are computed for the estimated parameters based on the asymptotic normality of maximum likelihood estimators.

A number of popular link functions and probability distributions are available in the GENMOD procedure. The built-in link functions are

identity: g( ¼ ) = ¼
logit: g( ¼ ) = log ( ¼ / (1 ˆ’ ¼ ) )
probit: g( ¼ ) = ^{ˆ’ 1} ( ¼ ) , where is the standard normal cumulative distribution function
power:
log: g( ¼ ) = log ( ¼ )
complementary log-log: g( ¼ ) = log( ˆ’ log(1 ˆ’ ¼ ))

The available distributions and associated variance functions are

normal: V( ¼ ) = 1
binomial (proportion): V( ¼ ) = ¼ ( 1 ˆ’ ¼ )
Poisson: V( ¼ ) = ¼
gamma: V( ¼ ) = ¼ ²
inverse Gaussian: V( ¼ ) = ¼ ³
negative binomial: V( ¼ ) = ¼ + k ¼ ²
multinomial

The negative binomial is a distribution with an additional parameter k in the variance function. PROC GENMOD estimates k by maximum likelihood, or you can optionally set it to a constant value. Refer to McCullagh and Nelder (1989, Chapter 11), Hilbe (1994), or Lawless (1987) for discussions of the negative binomial distribution.

The multinomial distribution is sometimes used to model a response that can take values from a number of categories. The binomial is a special case of the multinomial with two categories. See the section Multinomial Models on page 1671 and refer to McCullagh and Nelder (1989, Chapter 5) for a description of the multinomial distribution.

In addition, you can easily define your own link functions or distributions through DATA step programming statements used within the procedure.

An important aspect of generalized linear modeling is the selection of explanatory variables in the model. Changes in goodness-of-fit statistics are often used to evaluate the contribution of subsets of explanatory variables to a particular model. The deviance, defined to be twice the difference between the maximum attainable log likelihood and the log likelihood of the model under consideration, is often used as a measure of goodness of fit. The maximum attainable log likelihood is achieved with a model that has a parameter for every observation. See the section Goodness of Fit on page 1656 for formulas for the deviance.

One strategy for variable selection is to fit a sequence of models, beginning with a simple model with only an intercept term, and then include one additional explanatory variable in each successive model. You can measure the importance of the additional explanatory variable by the difference in deviances or fitted log likelihoods between successive models. Asymptotic tests computed by the GENMOD procedure enable you to assess the statistical significance of the additional term .

The GENMOD procedure enables you to fit a sequence of models, up through a maximum number of terms specified in a MODEL statement. A table summarizes twice the difference in log likelihoods between each successive pair of models. This is called a Type 1 analysis in the GENMOD procedure, because it is analogous to Type I (sequential) sums of squares in the GLM procedure. As with the PROC GLM Type I sums of squares, the results from this process depend on the order in which the model terms are fit.

The GENMOD procedure also generates a Type 3 analysis analogous to Type III sums of squares in the GLM procedure. A Type 3 analysis does not depend on the order in which the terms for the model are specified. A GENMOD procedure Type 3 analysis consists of specifying a model and computing likelihood ratio statistics for Type III contrasts for each term in the model. The contrasts are defined in the same way as they are in the GLM procedure. The GENMOD procedure optionally computes Wald statistics for Type III contrasts. This is computationally less expensive than likelihood ratio statistics, but it is thought to be less accurate because the specified significance level of hypothesis tests based on the Wald statistic may not be as close to the actual significance level as it is for likelihood ratio tests.

A Type 3 analysis generalizes the use of Type III estimable functions in linear models. Briefly, a Type III estimable function (contrast) for an effect is a linear function of the model parameters that involves the parameters of the effect and any interactions with that effect. A test of the hypothesis that the Type III contrast for a main effect is equal to 0 is intended to test the significance of the main effect in the presence of interactions. See Chapter 32, The GLM Procedure, and Chapter 11, The Four Types of Estimable Functions, for more information about Type III estimable functions. Also refer to Littell, Freund, and Spector (1991).

Additional features of the GENMOD procedure are

likelihood ratio statistics for user -defined contrasts, that is, linear functions of the parameters, and p -values based on their asymptotic chi-square distributions
estimated values, standard errors, and confidence limits for user-defined contrasts and least-squares means
ability to create a SAS data set corresponding to most tables displayed by the procedure (see Table 31.3 on page 1694)
confidence intervals for model parameters based on either the profile likelihood function or asymptotic normality
syntax similar to that of PROC GLM for the specification of the response and model effects, including interaction terms and automatic coding of classification variables
ability to fit GEE models for clustered response data