In SAS analysis-of-variance procedures, the variables that identify levels of the classifications are called classification variables , and they are declared in the CLASS statement. Classification variables are also called categorical , qualitative , discrete , or nominal variables . The values of a class variable are called levels . Class variables can be either numeric or character. This is in contrast to the response (or dependent ) variables , which are continuous. Response variables must be numeric.
The analysis-of-variance model specifies effects , which are combinations of classification variables used to explain the variability of the dependent variables in the following manner:
Main effects are specified by writing the variables by themselves in the CLASS statement: A B C . Main effects used as independent variables test the hypothesis that the mean of the dependent variable is the same for each level of the factor in question, ignoring the other independent variables in the model.
Crossed effects (interactions) are specified by joining the class variables with asterisks in the MODEL statement: A * B A * C A * B * C . Interaction terms in a model test the hypothesis that the effect of a factor does not depend on the levels of the other factors in the interaction.
Nested effects are specified by following a main effect or crossed effect with a class variable or list of class variables enclosed in parentheses in the MODEL statement. The main effect or crossed effect is nested within the effects listed in parentheses: B(A) C*D(AB). Nested effects test hypotheses similar to interactions, but the levels of the nested variables are not the same for every combination within which they are nested.
The general form of an effect can be illustrated using the class variables A , B , C , D , E , and F :
A * B * C(DEF)
The crossed list should come first, followed by the nested list in parentheses. Note that no asterisks appear within the nested list or immediately before the left parenthesis.
For a three-factor main effects model with A , B , and C as the factors and Y as the dependent variable, the necessary statements are
proc anova; classABC; model Y=A B C; run;
To specify interactions in a factorial model, join effects with asterisks as described previously. For example, these statements specify a complete factorial model, which includes all the interactions:
proc anova; classABC; model Y=A B C A*B A*C B*C A*B*C; run;
You can shorten the specifications of a full factorial model by using bar notation. For example, the preceding statements can also be written
proc anova; classABC; model Y=ABC; run;
When the bar () is used, the expression on the right side of the equal sign is expanded from left to right using the equivalents of rules 2-4 given in Searle (1971, p. 390). The variables on the right- and left-hand sides of the bar become effects, and the cross of them becomes an effect. Multiple bars are permitted. For instance, A B C is evaluated as follows :
You can also specify the maximum number of variables involved in any effect that results from bar evaluation by specifying that maximum number, preceded by an @ sign, at the end of the bar effect. For example, the specification A B C @2 results in only those effects that contain two or fewer variables; in this case, A B A * B C A * C and B * C .
The following table gives more examples of using the bar and at operators.
A C ( B ) | is equivalent to | A C ( B ) A * C ( B ) |
A ( B ) C ( B ) | is equivalent to | A ( B ) C ( B ) A * C ( B ) |
A ( B ) B ( DE ) | is equivalent to | A ( B ) B ( D E ) |
A B ( A ) C | is equivalent to | A B ( A ) C A * C B * C ( A ) |
A B ( A ) C @2 | is equivalent to | A B ( A ) CA * C |
A B C D @2 | is equivalent to | A B A * B C A * C B * C D A * D B * D C * D |
Consult the 'Specification of Effects' section on page 1784 in Chapter 32, 'The GLM Procedure,' for further details on bar notation.
Write the effect that is nested within another effect first, followed by the other effect in parentheses. For example, if A and B are main effects and C is nested within A and B (that is, the levels of C that are observed are not the same for each combination of A and B ), the statements for PROC ANOVA are
proc anova; class A B C ; model y=A B C(A B); run;
The identity of a level is viewed within the context of the level of the containing effects. For example, if City is nested within State , then the identity of City is viewed within the context of State .
The distinguishing feature of a nested specification is that nested effects never appear as main effects. Another way of viewing nested effects is that they are effects that pool the main effect with the interaction of the nesting variable. See the 'Automatic Pooling' section, which follows.
Asterisks and parentheses can be combined in the MODEL statement for models involving nested and crossed effects:
proc anova; classABC; model Y=A B(A) C(A) B*C(A); run;
In line with the general philosophy of the GLM procedure, there is no difference between the statements
model Y=A B(A);
and
model Y=A A*B;
The effect B becomes a nested effect by virtue of the fact that it does not occur as a main effect. If B is not written as a main effect in addition to participating in A * B , then the sum of squares that is associated with B is pooled into A * B .
This feature allows the automatic pooling of sums of squares. If an effect is omitted from the model, it is automatically pooled with all the higher-level effects containing the class variables in the omitted effect (or within-error). This feature is most useful in split-plot designs.
PROC ANOVA can be used interactively. After you specify a model in a MODEL statement and run PROC ANOVA with a RUN statement, a variety of statements (such as MEANS, MANOVA, TEST, and REPEATED) can be executed without PROC ANOVA recalculating the model sum of squares.
The 'Syntax' section (page 432) describes which statements can be used interactively. You can execute these interactive statements individually or in groups by following the single statement or group of statements with a RUN statement. Note that the MODEL statement cannot be repeated; the ANOVA procedure allows only one MODEL statement.
If you use PROC ANOVA interactively, you can end the procedure with a DATA step, another PROC step, an ENDSAS statement, or a QUIT statement. The syntax of the QUIT statement is
quit;
When you use PROC ANOVA interactively, additional RUN statements do not end the procedure but tell PROC ANOVA to execute additional statements.
When a WHERE statement is used with PROC ANOVA, it should appear before the first RUN statement. The WHERE statement enables you to select only certain observations for analysis without using a subsetting DATA step. For example, the statement where group ne 5 omits observations with GROUP=5 from the analysis. Refer to SAS Language Reference: Dictionary for details on this statement.
When a BY statement is used with PROC ANOVA, interactive processing is not possible; that is, once the first RUN statement is encountered , processing proceeds for each BY group in the data set, and no further statements are accepted by the procedure.
Interactivity is also disabled when there are different patterns of missing values among the dependent variables. For details, see the section 'Missing Values,' which follows.
For an analysis involving one dependent variable, PROC ANOVA uses an observation if values are nonmissing for that dependent variable and for all the variables used in independent effects.
For an analysis involving multiple dependent variables without the MANOVA or REPEATED statement, or without the MANOVA option in the PROC ANOVA statement, a missing value in one dependent variable does not eliminate the observation from the analysis of other nonmissing dependent variables. For an analysis with the MANOVA or REPEATED statement, or with the MANOVA option in the PROC ANOVA statement, the ANOVA procedure requires values for all dependent variables to be nonmissing for an observation before the observation can be used in the analysis.
During processing, PROC ANOVA groups the dependent variables by their pattern of missing values across observations so that sums and cross products can be collected in the most efficient manner.
If your data have different patterns of missing values among the dependent variables, interactivity is disabled. This could occur when some of the variables in your data set have missing values and
you do not use the MANOVA option in the PROC ANOVA statement
you do not use a MANOVA or REPEATED statement before the first RUN statement
The OUTSTAT= option in the PROC ANOVA statement produces an output data set that contains the following:
the BY variables, if any
_TYPE_ , a new character variable. This variable has the value ˜ANOVA' for observations corresponding to sums of squares; it has the value ˜CANCORR', ˜STRUCTUR', or ˜SCORE' if a canonical analysis is performed through the MANOVA statement and no M= matrix is specified.
_SOURCE_ , a new character variable. For each observation in the data set, _SOURCE_ contains the name of the model effect from which the corresponding statistics are generated.
_NAME_ , a new character variable. The variable _NAME_ contains the name of one of the dependent variables in the model or, in the case of canonical statistics, the name of one of the canonical variables (CAN1, CAN2, and so on).
four new numeric variables, SS , DF , F , and PROB , containing sums of squares, degrees of freedom, F values, and probabilities, respectively, for each model or contrast sum of squares generated in the analysis. For observations resulting from canonical analyses, these variables have missing values.
if there is more than one dependent variable, then variables with the same names as the dependent variables represent
for _TYPE_ = ˜ANOVA', the crossproducts of the hypothesis matrices
for _TYPE_ = ˜CANCORR', canonical correlations for each variable
for _TYPE_ = ˜STRUCTUR', coefficients of the total structure matrix
for _TYPE_ = ˜SCORE', raw canonical score coefficients
The output data set can be used to perform special hypothesis tests (for example, with the IML procedure in SAS/IML software), to reformat output, to produce canonical variates (through the SCORE procedure), or to rotate structure matrices (through the FACTOR procedure).
Let X represent the n — p design matrix. The columns of X contain only 0s and 1s. Let Y represent the n — 1 vector of dependent variables.
In the GLM procedure, X ² X , X ² Y , and Y ² Y are formed in main storage. However, in the ANOVA procedure, only the diagonals of X ² X are computed, along with X ² Y and Y ² Y . Thus, PROC ANOVA saves a considerable amount of storage as well as time. The memory requirements for PROC ANOVA are asymptotically linear functions of n 2 and nr , where n is the number of dependent variables and r the number of independent parameters.
The elements of X ² Y are cell totals, and the diagonal elements of X ² X are cell frequencies. Since PROC ANOVA automatically pools omitted effects into the next higher-level effect containing the names of the omitted effect (or within-error), a slight modification to the rules given by Searle (1971, p. 389) is used.
PROC ANOVA computes the sum of squares for each effect as if it is a main effect. In other words, for each effect, PROC ANOVA squares each cell total and divides by its cell frequency. The procedure then adds these quantities together and subtracts the correction factor for the mean (total squared over N).
For each effect involving two class names, PROC ANOVA subtracts the SS for any main effect with a name that is contained in the two-factor effect.
For each effect involving three class names, PROC ANOVA subtracts the SS for all main effects and two-factor effects with names that are contained in the three-factor effect. If effects involving four or more class names are present, the procedure continues this process.
PROC ANOVA first displays a table that includes the following:
the name of each variable in the CLASS statement
the number of different values or Levels of the Class variables
the Values of the Class variables
the Number of observations in the data set and the number of observations excluded from the analysis because of missing values, if any
PROC ANOVA then displays an analysis-of-variance table for each dependent variable in the MODEL statement. This table breaks down
the Total Sum of Squares for the dependent variable into the portion attributed to the Model and the portion attributed to Error
the Mean Square term , which is the Sum of Squares divided by the degrees of freedom (DF)
The analysis-of-variance table also lists the following:
the Mean Square for Error (MSE), which is an estimate of ƒ 2 , the variance of the true errors
the F Value, which is the ratio produced by dividing the Mean Square for the Model by the Mean Square for Error. It tests how well the model as a whole (adjusted for the mean) accounts for the dependent variable's behavior. This F test is a test of the null hypothesis that all parameters except the intercept are zero.
the significance probability associated with the F statistic, labeled 'Pr > F'
R-Square, R 2 , which measures how much variation in the dependent variable can be accounted for by the model. The R 2 statistic, which can range from 0 to 1, is the ratio of the sum of squares for the model divided by the sum of squares for the corrected total. In general, the larger the R 2 value, the better the model fits the data.
C.V., the coefficient of variation, which is often used to describe the amount of variation in the population. The C.V. is 100 times the standard deviation of the dependent variable divided by the Mean. The coefficient of variation is often a preferred measure because it is unitless.
Root MSE, which estimates the standard deviation of the dependent variable. Root MSE is computed as the square root of Mean Square for Error, the mean square of the error term.
the Mean of the dependent variable
For each effect (or source of variation) in the model, PROC ANOVA then displays the following:
DF, degrees of freedom
Anova SS, the sum of squares, and the associated Mean Square
the F Value for testing the hypothesis that the group means for that effect are equal
Pr > F, the significance probability value associated with the F Value
When you specify a TEST statement, PROC ANOVA displays the results of the requested tests. When you specify a MANOVA statement and the model includes more than one dependent variable, PROC ANOVA produces these additional statistics:
the characteristic roots and vectors of E ˆ’ 1 H for each H matrix
the Hotelling-Lawley trace
Pillai's trace
Wilks' criterion
Roy's maximum root criterion
See Example 32.6 on page 1868 in Chapter 32, 'The GLM Procedure,' for an example of the MANOVA results. These MANOVA tests are discussed in Chapter 2, 'Introduction to Regression Procedures.'
PROC ANOVA assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'
ODS Table Name | Description | Statement / Option |
---|---|---|
AltErrTests | Anova tests with error other than MSE | TEST /E= |
Bartlett | Bartlett's homogeneity of variance test | MEANS / HOVTEST=BARTLETT |
CLDiffs | Multiple comparisons of pairwise differences | MEANS / CLDIFF or DUNNETT or (Unequal cells and not LINES) |
CLDiffsInfo | Information for multiple comparisons of pairwise differences | MEANS / CLDIFF or DUNNETT or (Unequal cells and not LINES) |
CLMeans | Multiple comparisons of means with confidence/comparison interval | MEANS / CLM with (BON or GABRIEL or SCHEFFE or SIDAK or SMM or T or LSD) |
CLMeansInfo | Information for multiple comparisons of means with confidence/comparison interval | MEANS / CLM |
CanAnalysis | Canonical analysis | (MANOVA or REPEATED) / CANONICAL |
CanCoef | Canonical coefficients | (MANOVA or REPEATED) / CANONICAL |
CanStructure | Canonical structure | (MANOVA or REPEATED) / CANONICAL |
CharStruct | Characteristic roots and vectors | (MANOVA / not CANONICAL) or (REPEATED / PRINTRV) |
ClassLevels | Classification variable levels | CLASS statement |
DependentInfo | Simultaneously analyzed dependent variables | default when there are multiple dependent variables with different patterns of missing values |
Epsilons | Greenhouse-Geisser and Huynh-Feldt epsilons | REPEATED statement |
ErrorSSCP | Error SSCP matrix | (MANOVA or REPEATED) / PRINTE |
FitStatistics | R-Square, C.V., Root MSE, and dependent mean | default |
HOVFTest | Homogeneity of variance ANOVA | MEANS / HOVTEST |
HypothesisSSCP | Hypothesis SSCP matrix | (MANOVA or REPEATED) / PRINTE |
MANOVATransform | Multivariate transformation matrix | MANOVA / M= |
MCLines | Multiple comparisons LINES output | MEANS / LINES or ((DUNCAN or WALLER or SNK or REGWQ) and not(CLDIFF or CLM)) or (Equal cells and not CLDIFF) |
MCLinesInfo | Information for multiple comparison LINES output | MEANS / LINES or ((DUNCAN or WALLER or SNK or REGWQ) and not (CLDIFF or CLM)) or (Equal cells and not CLDIFF) |
MCLinesRange | Ranges for multiple range MC tests | MEANS / LINES or ((DUNCAN or WALLER or SNK or REGWQ) and not (CLDIFF or CLM)) or (Equal cells and not CLDIFF) |
Means | Group means | MEANS statement |
ModelANOVA | ANOVA for model terms | default |
MultStat | Multivariate tests | MANOVA statement |
NObs | Number of observations | default |
OverallANOVA | Over-all ANOVA | default |
PartialCorr | Partial correlation matrix | (MANOVA or REPEATED) / PRINTE |
RepTransform | Repeated transformation matrix | REPEATED (CONTRAST or HELMERT or MEAN or POLYNOMIAL or PROFILE) |
RepeatedLevelInfo | Correspondence between dependents and repeated measures levels | REPEATED statement |
Sphericity | Sphericity tests | REPEATED / PRINTE |
Tests | Summary ANOVA for specified MANOVA H= effects | MANOVA / H= SUMMARY |
Welch | Welch's ANOVA | MEANS / WELCH |
This section describes the use of ODS for creating statistical graphs with the ANOVA procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request these graphs you must specify the ODS GRAPHICS statement with an appropriate model, as discussed in the following. For more information on the ODS GRAPHICS statement, see Chapter 15, 'Statistical Graphics Using ODS.'
When the ODS GRAPHICS are in effect, then if you specify a one-way analysis of variance model, with just one independent classification variable, the ANOVA procedure will produce a grouped box plot of the response values versus the classification levels. For an example of the box plot, see the 'One-Way Layout with Means Comparisons' section on page 424.
PROC ANOVA assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 17.4.
ODS Graph Name | Plot Description |
---|---|
BoxPlot | Box plot |
To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, 'Statistical Graphics Using ODS.'