Analysis of variance , or ANOVA, typically refers to partitioning the variation in a variable s values into variation between and within several groups or classes of observations. The GLM procedure can perform simple or complicated ANOVA for balanced or unbalanced data.
This example discusses a 2 — 2 ANOVA model. The experimental design is a full factorial, in which each level of one treatment factor occurs at each level of the other treatment factor. The data are shown in a table and then read into a SAS data set.
title 'Analysis of Unbalanced 2-by-2 Factorial'; data exp; input A $ B $ Y @@; datalines; A1 B1 12 A1 B1 14 A1 B2 11 A1 B2 9 A2 B1 20 A2 B1 18 A2 B2 17 ;
Note that there is only one value for the cell with A = ˜A2 and B = ˜B2 . Since one cell contains a different number of values from the other cells in the table, this is an unbalanced design.
The following PROC GLM invocation produces the analysis.
proc glm; class A B; model Y=A B A*B; run;
Both treatments are listed in the CLASS statement because they are classification variables . A * B denotes the interaction of the A effect and the B effect. The results are shown in Figure 32.1 and Figure 32.2.
Analysis of Unbalanced 2-by-2 Factorial The GLM Procedure Class Level Information Class Levels Values A 2 A1 A2 B 2 B1 B2 Number of Observations Read 7 Number of Observations Used 7
Analysis of Unbalanced 2-by-2 Factorial The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 3 91.71428571 30.57142857 15.29 0.0253 Error 3 6.00000000 2.00000000 Corrected Total 6 97.71428571 R-Square Coeff Var Root MSE Y Mean 0.938596 9.801480 1.414214 14.42857 Source DF Type I SS Mean Square F Value Pr > F A 1 80.04761905 80.04761905 40.02 0.0080 B 1 11.26666667 11.26666667 5.63 0.0982 A*B 1 0.40000000 0.40000000 0.20 0.6850 Source DF Type III SS Mean Square F Value Pr > F A 1 67.60000000 67.60000000 33.80 0.0101 B 1 10.00000000 10.00000000 5.00 0.1114 A*B 1 0.40000000 0.40000000 0.20 0.6850
Figure 32.1 displays information about the classes as well as the number of observations in the data set. Figure 32.2 shows the ANOVA table, simple statistics, and tests of effects.
The degrees of freedom may be used to check your data. The Model degrees of freedom for a 2 — 2 factorial design with interaction are ( ab ˆ’ 1), where a is the number of levels of A and b is the number of levels of B ; in this case, (2 — 2 ˆ’ 1) = 3. The Corrected Total degrees of freedom are always one less than the number of observations used in the analysis; in this case, 7 ˆ’ 1 = 6.
The overall F test is significant ( F = 15 . 29, p = 0 . 0253), indicating strong evidence that the means for the four different A — B cells are different. You can further analyze this difference by examining the individual tests for each effect.
Four types of estimable functions of parameters are available for testing hypotheses in PROC GLM. For data with no missing cells, the Type III and Type IV estimable functions are the same and test the same hypotheses that would be tested if the data were balanced. Type I and Type III sums of squares are typically not equal when the data are unbalanced; Type III sums of squares are preferred in testing effects in unbalanced cases because they test a function of the underlying parameters that is independent of the number of observations per treatment combination.
According to a significance level of 5% ( ± = 0 . 05), the A * B interaction is not significant ( F = 0 . 20, p = 0 . 6850). This indicates that the effect of A does not depend on the level of B and vice versa. Therefore, the tests for the individual effects are valid, showing a significant A effect ( F = 33 . 80, p = 0 . 0101) but no significant B effect ( F = 5 . 00, p = 0 . 1114).
In polynomial regression, the values of a dependent variable (also called a response variable) are described or predicted in terms of polynomial terms involving one or more independent or explanatory variables. An example of quadratic regression in PROC GLM follows . These data are taken from Draper and Smith (1966, p. 57). Thirteen specimens of 90/10 Cu-Ni alloys are tested in a corrosion-wheel setup in order to examine corrosion. Each specimen has a certain iron content. The wheel is rotated in salt sea water at 30 ft/sec for 60 days. Weight loss is used to quantify the corrosion. The fe variable represents the iron content, and the loss variable denotes the weight loss in milligrams/square decimeter/day in the following DATA step.
title 'Regression in PROC GLM'; data iron; input fe loss @@; datalines; 0.01 127.6 0.48 124.0 0.71 110.8 0.95 103.9 1.19 101.5 0.01 130.1 0.48 122.0 1.44 92.3 0.71 113.1 1.96 83.7 0.01 128.0 1.44 91.4 1.96 86.2 ;
The GPLOT procedure is used to request a scatter plot of the response variable versus the independent variable.
symbol1 c=blue; proc gplot; plot loss*fe / vm=1; run;
The plot in Figure 32.3 displays a strong negative relationship between iron content and corrosion resistance, but it is not clear whether there is curvature in this relationship.
The following statements fit a quadratic regression model to the data. This enables you to estimate the linear relationship between iron content and corrosion resistance and test for the presence of a quadratic component. The intercept is automatically fit unless the NOINT option is specified.
proc glm; model loss=fe fe*fe; run;
The CLASS statement is omitted because a regression line is being fitted. Unlike PROC REG, PROC GLM allows polynomial terms in the MODEL statement.
The preliminary information in Figure 32.4 informs you that the GLM procedure has been invoked and states the number of observations in the data set. If the model involves classification variables, they are also listed here, along with their levels.
Regression in PROC GLM The GLM Procedure Number of Observations Read 13 Number of Observations Used 13
Figure 32.5 shows the overall ANOVA table and some simple statistics. The degrees of freedom can be used to check that the model is correct and that the data have been read correctly. The Model degrees of freedom for a regression is the number of parameters in the model minus 1. You are fitting a model with three parameters in this case,
Regression in PROC GLM The GLM Procedure Dependent Variable: loss Sum of Source DF Squares Mean Square F Value Pr > F Model 2 3296.530589 1648.265295 164.68 <.0001 Error 10 100.086334 10.008633 Corrected Total 12 3396.616923 R-Square Coeff Var Root MSE loss Mean 0.970534 2.907348 3.163642 108.8154
so the degrees of freedom are 3 ˆ’ 1 = 2. The Corrected Total degrees of freedom are always one less than the number of observations used in the analysis.
The R 2 indicates that the model accounts for 97% of the variation in LOSS. The coefficient of variation (C.V.), Root MSE (Mean Square for Error), and mean of the dependent variable are also listed.
The overall F test is significant ( F = 164 . 68 , p < . 0001), indicating that the model as a whole accounts for a significant amount of the variation in LOSS. Thus, it is appropriate to proceed to testing the effects.
Figure 32.6 contains tests of effects and parameter estimates. The latter are displayed by default when the model contains only continuous variables.
Regression in PROC GLM The GLM Procedure Dependent Variable: loss Source DF Type I SS Mean Square F Value Pr > F fe 1 3293.766690 3293.766690 329.09 <.0001 fe*fe 1 2.763899 2.763899 0.28 0.6107 Source DF Type III SS Mean Square F Value Pr > F fe 1 356.7572421 356.7572421 35.64 0.0001 fe*fe 1 2.7638994 2.7638994 0.28 0.6107 Standard Parameter Estimate Error t Value Pr > t Intercept 130.3199337 1.77096213 73.59 <.0001 fe 26.2203900 4.39177557 5.97 0.0001 fe*fe 1.1552018 2.19828568 0.53 0.6107
The t tests provided are equivalent to the Type III F tests. The quadratic term is not significant ( F = 0 . 28 , p = 0 . 6107; t = 0 . 53 , p = 0 . 6107) and thus can be removed from the model; the linear term is significant ( F = 35 . 64 , p = 0 . 0001; t = ˆ’ 5 . 97 , p =0 . 0001). This suggests that there is indeed a straight line relationship between loss and fe .
Fitting the model without the quadratic term provides more accurate estimates for ² and ² 1 . PROC GLM allows only one MODEL statement per invocation of the procedure, so the PROC GLM statement must be issued again. The statements used to fit the linear model are
proc glm; model loss=fe; run;
Figure 32.7 displays the output produced by these statements. The linear term is still significant ( F = 352 . 27 , p < . 0001). The estimated model is now
Regression in PROC GLM The GLM Procedure Dependent Variable: loss Sum of Source DF Squares Mean Square F Value Pr > F Model 1 3293.766690 3293.766690 352.27 <.0001 Error 11 102.850233 9.350021 Corrected Total 12 3396.616923 R-Square Coeff Var Root MSE loss Mean 0.969720 2.810063 3.057780 108.8154 Source DF Type I SS Mean Square F Value Pr > F fe 1 3293.766690 3293.766690 352.27 <.0001 Source DF Type III SS Mean Square F Value Pr > F fe 1 3293.766690 3293.766690 352.27 <.0001 Standard Parameter Estimate Error t Value Pr > t Intercept 129.7865993 1.40273671 92.52 <.0001 fe 24.0198934 1.27976715 18.77 <.0001