Getting Started | SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)

Poisson Regression

You can use the GENMOD procedure to fit a variety of statistical models. A typical use of PROC GENMOD is to perform Poisson regression.

You can use the Poisson distribution to model the distribution of cell counts in a multiway contingency table. Aitkin, Anderson, Francis, and Hinde (1989) have used this method to model insurance claims data. Suppose the following hypothetical insurance claims data are classified by two factors: age group (with two levels) and car type (with three levels).

  data insure;   input n c car$ age;   ln = log(n);   datalines;   500   42  small  1   1200  37  medium 1   100    1  large  1   400  101  small  2   500   73  medium 2   300   14  large  2   ;   run;

In the preceding data set, the variable n represents the number of insurance policyholders and the variable c represents the number of insurance claims. The variable car is the type of car involved (classified into three groups) and the variable age is the age group of a policyholder (classified into two groups).

You can use PROC GENMOD to perform a Poisson regression analysis of these data with a log link function. This type of model is sometimes called a log-linear model .

Assume that the number of claims c has a Poisson probability distribution and that its mean, ¼ _i , is related to the factors car and age for observation i by

The indicator variables car _i (j) and age _i (j) are associated with the j th level of the variables car and age for observation i

The ² s are unknown parameters to be estimated by the procedure. The logarithm of the variable n is used as an offset , that is, a regression variable with a constant coefficient of 1 for each observation. A log linear relationship between the mean and the factors car and age is specified by the log link function. The log link function ensures that the mean number of insurance claims for each car and age group predicted from the fitted model is positive.

The following statements invoke the GENMOD procedure to perform this analysis:

  proc genmod data=insure;   class car age;   model c = car age / dist   = poisson   link   = log   offset = ln;   run;

The variables car and age are specified as CLASS variables so that PROC GENMOD automatically generates the indicator variables associated with car and age .

The MODEL statement specifies c as the response variable and car and age as explanatory variables. An intercept term is included by default. Thus, the model matrix X (the matrix which has as its i th row the transpose of the covariate vector for the i th observation) consists of a column of 1s representing the intercept term and columns of 0s and 1s derived from indicator variables representing the levels of the car and age variables.

That is, the model matrix is

where the first column corresponds to the intercept, the next three columns correspond to the variable car , and the last two columns correspond to the variable age .

The response distribution is specified as Poisson, and the link function is chosen to be log. That is, the Poisson mean parameter ¼ is related to the linear predictor by

The logarithm of n is specified as an offset variable, as is common in this type of analysis. In this case, the offset variable serves to normalize the fitted cell means to a per policyholder basis, since the total number of claims, not individual policyholder claims, are observed .

PROC GENMOD produces the following default output from the preceding statements.

The Model Information table displayed in Figure 31.1 provides information about the specified model and the input data set.

  Class Level Information   Class      Levels    Values   car             3    large medium small   age             2    1 2

Figure 31.2: Class Level Information

  The GENMOD Procedure   Model Information   Data Set              WORK.INSURE   Distribution              Poisson   Link Function                 Log   Dependent Variable              c   Offset Variable                ln

Figure 31.1: Model Information

Figure 31.2 displays the Class Level Information table, which identifies the levels of the classification variables that are used in the model. Note that car is a character variable, and the values are sorted in alphabetical order. This is the default sort order, but you can select different sort orders with the ORDER= option in the PROC GENMOD statement (see the ORDER= option on page 1625 for details).

  The GENMOD Procedure   Model Information   Data Set              WORK.INSURE   Distribution              Poisson   Link Function                 Log   Dependent Variable              c   Offset Variable                ln

Figure 31.1: Model Information

The Criteria For Assessing Goodness Of Fit table displayed in Figure 31.3 contains statistics that summarize the fit of the specified model. These statistics are helpful in judging the adequacy of a model and in comparing it with other models under consideration. If you compare the deviance of 2.8207 with its asymptotic chi-square with 2 degrees of freedom distribution, you find that the p -value is 0.24. This indicates that the specified model fits the data reasonably well.

  Analysis Of Parameter Estimates   Standard   Wald 95% Confidence      Chi   Parameter            DF   Estimate      Error          Limits          Square   Intercept             1   1.3168     0.0903   1.4937   1.1398    212.73   car         large     1   1.7643     0.2724   2.2981   1.2304     41.96   car         medium    1   0.6928     0.1282   0.9441   0.4414     29.18   car         small     0     0.0000     0.0000      0.0000     0.0000       .   age         1         1   1.3199     0.1359   1.5863   1.0536     94.34   age         2         0     0.0000     0.0000      0.0000     0.0000       .   Scale                 0     1.0000     0.0000      1.0000     1.0000   Analysis Of Parameter Estimates   Parameter            Pr > ChiSq   Intercept                <.0001   car         large        <.0001   car         medium       <.0001   car         small         .   age         1            <.0001   age         2             .   Scale   NOTE: The scale parameter was held fixed.

Figure 31.4: Analysis Of Parameter Estimates

  Criteria For Assessing Goodness Of Fit   Criterion                 DF           Value        Value/DF   Deviance                   2          2.8207          1.4103   Scaled Deviance            2          2.8207          1.4103   Pearson Chi-Square         2          2.8416          1.4208   Scaled Pearson X2          2          2.8416          1.4208   Log Likelihood                      837.4533

Figure 31.3: Goodness Of Fit

Figure 31.4 displays the Analysis Of Parameter Estimates table, which summarizes the results of the iterative parameter estimation process. For each parameter in the model, PROC GENMOD displays columns with the parameter name , the degrees of freedom associated with the parameter, the estimated parameter value, the standard error of the parameter estimate, the confidence intervals, and the Wald chi-square statistic and associated p -value for testing the significance of the parameter to the model. If a column of the model matrix corresponding to a parameter is found to be linearly dependent, or aliased , with columns corresponding to parameters preceding it in the model, PROC GENMOD assigns it zero degrees of freedom and displays a value of zero for both the parameter estimate and its standard error.

  Criteria For Assessing Goodness Of Fit   Criterion                 DF           Value        Value/DF   Deviance                   2          2.8207          1.4103   Scaled Deviance            2          2.8207          1.4103   Pearson Chi-Square         2          2.8416          1.4208   Scaled Pearson X2          2          2.8416          1.4208   Log Likelihood                      837.4533

Figure 31.3: Goodness Of Fit

This table includes a row for a scale parameter, even though there is no free scale parameter in the Poisson distribution. See the Response Probability Distributions section on page 1650 for the form of the Poisson probability distribution. PROC GENMOD allows the specification of a scale parameter to fit overdispersed Poisson and binomial distributions. In such cases, the SCALE row indicates the value of the overdispersion scale parameter used in adjusting output statistics. See the section Overdispersion on page 1659 for more on overdispersion and the meaning of the SCALE parameter output by the GENMOD procedure. PROC GENMOD displays a note indicating that the scale parameter is fixed, that is, not estimated by the iterative fitting process.

It is usually of interest to assess the importance of the main effects in the model. Type 1 and Type 3 analyses generate statistical tests for the significance of these effects. You can request these analyses with the TYPE1 and TYPE3 options in the MODEL statement.

  proc genmod data=insure;   class car age;   model c = car age / dist   = poisson   link    = log   offset  = ln   type1   type3;   run;

The results of these analyses are summarized in the tables that follow.

In the table for Type 1 analysis displayed in Figure 31.5, each entry in the deviance column represents the deviance for the model containing the effect for that row and all effects preceding it in the table. For example, the deviance corresponding to car in the table is the deviance of the model containing an intercept and car . As more terms are included in the model, the deviance decreases.

  The GENMOD Procedure   LR Statistics For Type 1 Analysis   Chi-   Source         Deviance        DF     Square    Pr > ChiSq   Intercept      175.1536   car            107.4620         2      67.69        <.0001   age              2.8207         1     104.64        <.0001

Figure 31.5: Type 1 Analysis

Entries in the chi-square column are likelihood ratio statistics for testing the significance of the effect added to the model containing all the preceding effects. The chi-square value of 67.69 for car represents twice the difference in log likelihoods between fitting a model with only an intercept term and a model with an intercept and car . Since the scale parameter is set to 1 in this analysis, this is equal to the difference in deviances. Since two additional parameters are involved, this statistic can be compared with a chi-square distribution with two degrees of freedom. The resulting p -value (labeled Pr > Chi) of less than 0.0001 indicates that this variable is highly significant. Similarly, the chi-square value of 104.64 for age represents the difference in log likelihoods between the model with the intercept and car and the model with the intercept, car , and age . This effect is also highly significant, as indicated by the small p -value.

  LR Statistics For Type 3 Analysis   Chi-   Source           DF     Square    Pr > ChiSq   car               2      72.82        <.0001   age               1     104.64        <.0001

Figure 31.6: Type 3 Analysis

The Type 3 analysis results in the same conclusions as the Type 1 analysis. The Type 3 chi-square value for the car variable, for example, is twice the difference between the log likelihood for the model with the variables Intercept, car , and age included and the log likelihood for the model with the car variable excluded. The hypothesis tested in this case is the significance of the variable car given that the variable age is in the model. In other words, it tests the additional contribution of car in the model.

The values of the Type 3 likelihood ratio statistics for the car and age variables indicate that both of these factors are highly significant in determining the claims performance of the insurance policyholders.

Generalized Estimating Equations

This section illustrates the use of the REPEATED statement to fit a GEE model, using repeated measures data from the Six Cities study of the health effects of air pollution (Ware et al. 1984). The data analyzed are the 16 selected cases in Lipsitz, Fitzmaurice, et al. (1994). The binary response is the wheezing status of 16 children at ages 9, 10, 11, and 12 years . The mean response is modeled as a logistic regression model using the explanatory variables city of residence, age, and maternal smoking status at the particular age. The binary responses for individual children are assumed to be equally correlated, implying an exchangeable correlation structure.

The data set and SAS statements that fit the model by the GEE method are as follows :

  data six;   input case city$ @@;   do i=1 to 4;   input age smoke wheeze @@;   output;   end;   datalines;   1 portage   9 0 1  10 0 1  11 0 1  12 0 0   2 kingston  9 1 1  10 2 1  11 2 0  12 2 0   3 kingston  9 0 1  10 0 0  11 1 0  12 1 0   4 portage   9 0 0  10 0 1  11 0 1  12 1 0   5 kingston  9 0 0  10 1 0  11 1 0  12 1 0   6 portage   9 0 0  10 1 0  11 1 0  12 1 0   7 kingston  9 1 0  10 1 0  11 0 0  12 0 0   8 portage   9 1 0  10 1 0  11 1 0  12 2 0   9 portage   9 2 1  10 2 0  11 1 0  12 1 0   10 kingston  9 0 0  10 0 0  11 0 0  12 1 0   11 kingston  9 1 1  10 0 0  11 0 1  12 0 1   12 portage   9 1 0  10 0 0  11 0 0  12 0 0   13 kingston  9 1 0  10 0 1  11 1 1  12 1 1   14 portage   9 1 0  10 2 0  11 1 0  12 2 1   15 kingston  9 1 0  10 1 0  11 1 0  12 2 1   16 portage   9 1 1  10 1 1  11 2 0  12 1 0   ;   run;   proc genmod data=six ;   class case city ;   model wheeze = city age smoke / dist=bin;   repeated subject=case / type=exch covb corrw;   run;

The CLASS statement and the MODEL statement specify the model for the mean of the wheeze variable response as a logistic regression with city , age , and smoke as independent variables, just as for an ordinary logistic regression.

The REPEATED statement invokes the GEE method, specifies the correlation structure, and controls the displayed output from the GEE model. The option SUBJECT=CASE specifies that individual subjects are identified in the input data set by the variable case . The SUBJECT= variable case must be listed in the CLASS statement. Measurements on individual subjects at ages 9, 10, 11, and 12 are in the proper order in the data set, so the WITHINSUBJECT= option is not required. The TYPE=EXCH option specifies an exchangeable working correlation structure, the COVB option specifies that the parameter estimate covariance matrix be displayed, and the CORRW option specifies that the final working correlation be displayed.

Initial parameter estimates for iterative fitting of the GEE model are computed as in an ordinary generalized linear model, as described previously. Results of the initial model fit displayed as part of the generated output are not shown here. Statistics for the initial model fit such as parameter estimates, standard errors, deviances, and Pearson chi-squares do not apply to the GEE model, and are only valid for the initial model fit. The following tables display information that applies to the GEE model fit.

Figure 31.7 displays general information about the GEE model fit.

  The GENMOD Procedure   GEE Model Information   Correlation Structure               Exchangeable   Subject Effect                  case (16 levels)   Number of Clusters                            16   Correlation Matrix Dimension                   4   Maximum Cluster Size                           4   Minimum Cluster Size                           4

Figure 31.7: GEE Model Information

Figure 31.8 displays the parameter estimate covariance matrices specified by the COVB option. Both model-based and empirical covariances are produced.

  Covariance Matrix (Model-Based)   Prm1           Prm2           Prm4           Prm5   Prm1        5.74947   0.22257   0.53472        0.01655   Prm2   0.22257        0.45478   0.002410        0.01876   Prm4   0.53472   0.002410        0.05300   0.01658   Prm5        0.01655        0.01876   0.01658        0.19104   Covariance Matrix (Empirical)   Prm1           Prm2           Prm4           Prm5   Prm1        9.33994   0.85104   0.83253   0.16534   Prm2   0.85104        0.47368        0.05736        0.04023   Prm4   0.83253        0.05736        0.07778   0.002364   Prm5   0.16534        0.04023   0.002364        0.13051

Figure 31.8: GEE Parameter Estimate Covariance Matrices

The exchangeable working correlation matrix specified by the CORRW option is displayed in Figure 31.9.

  Working Correlation Matrix   Col1         Col2         Col3         Col4   Row1       1.0000       0.1648       0.1648       0.1648   Row2       0.1648       1.0000       0.1648       0.1648   Row3       0.1648       0.1648       1.0000       0.1648   Row4       0.1648       0.1648       0.1648       1.0000

Figure 31.9: GEE Working Correlation Matrix

The parameter estimates table, displayed in Figure 31.10, contains parameter estimates, standard errors, confidence intervals, Z scores, and p -values for the parameter estimates. Empirical standard error estimates are used in this table. A table using model-based standard errors can be created by using the REPEATED statement option MODELSE.

  Analysis Of GEE Parameter Estimates   Empirical Standard Error Estimates   Standard   95% Confidence   Parameter          Estimate    Error       Limits            Z Pr > Z   Intercept   1.2751   3.0561   7.2650   4.7148   0.42   0.6765   city      kingston   0.1223   0.6882   1.4713   1.2266   0.18   0.8589   city      portage    0.0000   0.0000   0.0000   0.0000     .      .   age                  0.2036   0.2789   0.3431   0.7502     0.73   0.4655   smoke                0.0935   0.3613   0.6145   0.8016     0.26   0.7957

Figure 31.10: GEE Parameter Estimates Table