Getting Started | SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)

PROC GLM for Unbalanced ANOVA

Analysis of variance , or ANOVA, typically refers to partitioning the variation in a variable s values into variation between and within several groups or classes of observations. The GLM procedure can perform simple or complicated ANOVA for balanced or unbalanced data.

This example discusses a 2 — 2 ANOVA model. The experimental design is a full factorial, in which each level of one treatment factor occurs at each level of the other treatment factor. The data are shown in a table and then read into a SAS data set.

  title 'Analysis of Unbalanced 2-by-2 Factorial';   data exp;   input A $ B $ Y @@;   datalines;   A1 B1 12 A1 B1 14     A1 B2 11 A1 B2 9   A2 B1 20 A2 B1 18     A2 B2 17   ;

Note that there is only one value for the cell with A = ˜A2 and B = ˜B2 . Since one cell contains a different number of values from the other cells in the table, this is an unbalanced design.

The following PROC GLM invocation produces the analysis.

  proc glm;   class A B;   model Y=A B A*B;   run;

Both treatments are listed in the CLASS statement because they are classification variables . A * B denotes the interaction of the A effect and the B effect. The results are shown in Figure 32.1 and Figure 32.2.

  Analysis of Unbalanced 2-by-2 Factorial   The GLM Procedure   Class Level Information   Class         Levels    Values   A                  2    A1 A2   B                  2    B1 B2   Number of Observations Read           7   Number of Observations Used           7

Figure 32.1: Class Level Information

  Analysis of Unbalanced 2-by-2 Factorial   The GLM Procedure   Dependent Variable: Y   Sum of   Source                     DF        Squares    Mean Square   F Value   Pr > F   Model                       3    91.71428571    30.57142857     15.29   0.0253   Error                       3     6.00000000     2.00000000   Corrected Total             6    97.71428571   R-Square     Coeff Var      Root MSE        Y Mean   0.938596      9.801480      1.414214      14.42857   Source                     DF      Type I SS    Mean Square   F Value   Pr > F   A                           1    80.04761905    80.04761905     40.02   0.0080   B                           1    11.26666667    11.26666667      5.63   0.0982   A*B                         1     0.40000000     0.40000000      0.20   0.6850   Source                     DF    Type III SS    Mean Square   F Value   Pr > F   A                           1    67.60000000    67.60000000     33.80   0.0101   B                           1    10.00000000    10.00000000      5.00   0.1114   A*B                         1     0.40000000     0.40000000      0.20   0.6850

Figure 32.2: ANOVA Table and Tests of Effects

Figure 32.1 displays information about the classes as well as the number of observations in the data set. Figure 32.2 shows the ANOVA table, simple statistics, and tests of effects.

The degrees of freedom may be used to check your data. The Model degrees of freedom for a 2 — 2 factorial design with interaction are ( ab ˆ’ 1), where a is the number of levels of A and b is the number of levels of B ; in this case, (2 — 2 ˆ’ 1) = 3. The Corrected Total degrees of freedom are always one less than the number of observations used in the analysis; in this case, 7 ˆ’ 1 = 6.

The overall F test is significant ( F = 15 . 29, p = 0 . 0253), indicating strong evidence that the means for the four different A — B cells are different. You can further analyze this difference by examining the individual tests for each effect.

Four types of estimable functions of parameters are available for testing hypotheses in PROC GLM. For data with no missing cells, the Type III and Type IV estimable functions are the same and test the same hypotheses that would be tested if the data were balanced. Type I and Type III sums of squares are typically not equal when the data are unbalanced; Type III sums of squares are preferred in testing effects in unbalanced cases because they test a function of the underlying parameters that is independent of the number of observations per treatment combination.

According to a significance level of 5% ( ± = 0 . 05), the A * B interaction is not significant ( F = 0 . 20, p = 0 . 6850). This indicates that the effect of A does not depend on the level of B and vice versa. Therefore, the tests for the individual effects are valid, showing a significant A effect ( F = 33 . 80, p = 0 . 0101) but no significant B effect ( F = 5 . 00, p = 0 . 1114).

PROC GLM for Quadratic Least Squares Regression

In polynomial regression, the values of a dependent variable (also called a response variable) are described or predicted in terms of polynomial terms involving one or more independent or explanatory variables. An example of quadratic regression in PROC GLM follows . These data are taken from Draper and Smith (1966, p. 57). Thirteen specimens of 90/10 Cu-Ni alloys are tested in a corrosion-wheel setup in order to examine corrosion. Each specimen has a certain iron content. The wheel is rotated in salt sea water at 30 ft/sec for 60 days. Weight loss is used to quantify the corrosion. The fe variable represents the iron content, and the loss variable denotes the weight loss in milligrams/square decimeter/day in the following DATA step.

  title 'Regression in PROC GLM';   data iron;   input fe loss @@;   datalines;   0.01 127.6   0.48 124.0   0.71  110.8  0.95 103.9   1.19 101.5   0.01 130.1   0.48  122.0  1.44  92.3   0.71 113.1   1.96  83.7   0.01  128.0  1.44  91.4   1.96  86.2   ;

The GPLOT procedure is used to request a scatter plot of the response variable versus the independent variable.

  symbol1 c=blue;   proc gplot;   plot loss*fe / vm=1;   run;

The plot in Figure 32.3 displays a strong negative relationship between iron content and corrosion resistance, but it is not clear whether there is curvature in this relationship.

Figure 32.3: Plot of LOSS vs. FE

The following statements fit a quadratic regression model to the data. This enables you to estimate the linear relationship between iron content and corrosion resistance and test for the presence of a quadratic component. The intercept is automatically fit unless the NOINT option is specified.

  proc glm;   model loss=fe fe*fe;   run;

The CLASS statement is omitted because a regression line is being fitted. Unlike PROC REG, PROC GLM allows polynomial terms in the MODEL statement.

The preliminary information in Figure 32.4 informs you that the GLM procedure has been invoked and states the number of observations in the data set. If the model involves classification variables, they are also listed here, along with their levels.

  Regression in PROC GLM   The GLM Procedure   Number of Observations Read          13   Number of Observations Used          13

Figure 32.4: Class Level Information

Figure 32.5 shows the overall ANOVA table and some simple statistics. The degrees of freedom can be used to check that the model is correct and that the data have been read correctly. The Model degrees of freedom for a regression is the number of parameters in the model minus 1. You are fitting a model with three parameters in this case,

  Regression in PROC GLM   The GLM Procedure   Dependent Variable: loss   Sum of   Source                     DF        Squares    Mean Square   F Value   Pr > F   Model                       2    3296.530589    1648.265295    164.68   <.0001   Error                      10     100.086334      10.008633   Corrected Total            12    3396.616923   R-Square     Coeff Var      Root MSE     loss Mean   0.970534      2.907348      3.163642      108.8154

Figure 32.5: ANOVA Table

so the degrees of freedom are 3 ˆ’ 1 = 2. The Corrected Total degrees of freedom are always one less than the number of observations used in the analysis.

The R ² indicates that the model accounts for 97% of the variation in LOSS. The coefficient of variation (C.V.), Root MSE (Mean Square for Error), and mean of the dependent variable are also listed.

The overall F test is significant ( F = 164 . 68 , p < . 0001), indicating that the model as a whole accounts for a significant amount of the variation in LOSS. Thus, it is appropriate to proceed to testing the effects.

Figure 32.6 contains tests of effects and parameter estimates. The latter are displayed by default when the model contains only continuous variables.

  Regression in PROC GLM   The GLM Procedure   Dependent Variable: loss   Source                     DF      Type I SS    Mean Square   F Value   Pr > F   fe                          1    3293.766690    3293.766690    329.09   <.0001   fe*fe                       1       2.763899       2.763899      0.28   0.6107   Source                     DF    Type III SS    Mean Square   F Value   Pr > F   fe                          1    356.7572421    356.7572421     35.64   0.0001   fe*fe                       1      2.7638994      2.7638994      0.28   0.6107   Standard   Parameter         Estimate           Error    t Value    Pr > t   Intercept      130.3199337      1.77096213      73.59      <.0001   fe   26.2203900      4.39177557   5.97      0.0001   fe*fe            1.1552018      2.19828568       0.53      0.6107

Figure 32.6: Tests of Effects and Parameter Estimates

The t tests provided are equivalent to the Type III F tests. The quadratic term is not significant ( F = 0 . 28 , p = 0 . 6107; t = 0 . 53 , p = 0 . 6107) and thus can be removed from the model; the linear term is significant ( F = 35 . 64 , p = 0 . 0001; t = ˆ’ 5 . 97 , p =0 . 0001). This suggests that there is indeed a straight line relationship between loss and fe .

Fitting the model without the quadratic term provides more accurate estimates for ² and ² ₁ . PROC GLM allows only one MODEL statement per invocation of the procedure, so the PROC GLM statement must be issued again. The statements used to fit the linear model are

  proc glm;   model loss=fe;   run;

Figure 32.7 displays the output produced by these statements. The linear term is still significant ( F = 352 . 27 , p < . 0001). The estimated model is now

  Regression in PROC GLM   The GLM Procedure   Dependent Variable: loss   Sum of   Source                     DF        Squares    Mean Square   F Value   Pr > F   Model                       1    3293.766690    3293.766690    352.27   <.0001   Error                      11     102.850233       9.350021   Corrected Total            12    3396.616923   R-Square     Coeff Var      Root MSE     loss Mean   0.969720      2.810063      3.057780      108.8154   Source                     DF      Type I SS    Mean Square   F Value   Pr > F   fe                          1    3293.766690    3293.766690    352.27   <.0001   Source                     DF    Type III SS    Mean Square   F Value   Pr > F   fe                          1    3293.766690    3293.766690    352.27   <.0001   Standard   Parameter         Estimate           Error    t Value    Pr > t   Intercept      129.7865993      1.40273671      92.52      <.0001   fe   24.0198934      1.27976715   18.77      <.0001

Figure 32.7: Linear Model Output