PROC REG constructs only one crossproducts matrix for the variables in all regressions. If any variable needed for any regression is missing, the observation is excluded from all estimates. If you include variables with missing values in the VAR statement, the corresponding observations are excluded from all analyses, even if you never include the variables in a model. PROC REG assumes that you may want to include these variables after the first RUN statement and deletes observations with missing values.
PROC REG does not compute new regressors. For example, if you want a quadratic term in your model, you should create a new variable when you prepare the input data. For example, the statement
model y=x1 x1*x1;
is not valid. Note that this MODEL statement is valid in the GLM procedure.
The input data set for most applications of PROC REG contains standard rectangular data, but special TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets can also be used. TYPE=CORR and TYPE=COV data sets created by the CORR procedure contain means and standard deviations. In addition, TYPE=CORR data sets contain correlations and TYPE=COV data sets contain covariances. TYPE=SSCP data sets created in previous runs of PROC REG that used the OUTSSCP= option contain the sums of squares and crossproducts of the variables. See Appendix A, Special SAS Data Sets, and the SAS Files section in SAS Language Reference: Concepts for more information on special SAS data sets.
These summary files save CPU time. It takes nk 2 operations (where n =number of observations and k =number of variables) to calculate crossproducts; the regressions are of the order k 3 . When n is in the thousands and k is less than 10, you can save 99 percent of the CPU time by reusing the SSCP matrix rather than recomputing it.
When you want to use a special SAS data set as input, PROC REG must determine the TYPE for the data set. PROC CORR and PROC REG automatically set the type for their output data sets. However, if you create the data set by some other means (such as a DATA step) you must specify its type with the TYPE= data set option. If the TYPE for the data set is not specified when the data set is created, you can specify TYPE= as a data set option in the DATA= option in the PROC REG statement. For example,
proc reg data=a(type=corr);
When TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets are used with PROC REG, statements and options that require the original data values have no effect. The OUTPUT, PAINT, PLOT, and REWEIGHT statements and the MODEL and PRINT statement options P, R, CLM, CLI, DW, INFLUENCE, and PARTIAL are disabled since the original observations needed to calculate predicted and residual values are not present.
This example uses PROC CORR to produce an input data set for PROC REG. The fitness data for this analysis can be found in Example 61.1 on page 3924.
proc corr data=fitness outp=r noprint; var Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse; proc print data=r; proc reg data=r; model Oxygen=RunTime Age Weight; run;
Since the OUTP= data set from PROC CORR is automatically set to TYPE=CORR, the TYPE= data set option is not required in this example. The data set containing the correlation matrix is displayed by the PRINT procedure as shown in Figure 61.12. Figure 61.13 shows results from the regression using the TYPE=CORR data as an input data set.
Rest Obs _TYPE_ _NAME_ Oxygen RunTime Age Weight RunPulse MaxPulse Pulse 1 MEAN 47.3758 10.5861 47.6774 77.4445 169.645 173.774 53.4516 2 STD 5.3272 1.3874 5.2114 8.3286 10.252 9.164 7.6194 3 N 31.0000 31.0000 31.0000 31.0000 31.000 31.000 31.0000 4 CORR Oxygen 1.0000 0.8622 0.3046 0.1628 0.398 0.237 0.3994 5 CORR RunTime 0.8622 1.0000 0.1887 0.1435 0.314 0.226 0.4504 6 CORR Age 0.3046 0.1887 1.0000 0.2335 0.338 0.433 0.1641 7 CORR Weight 0.1628 0.1435 0.2335 1.0000 0.182 0.249 0.0440 8 CORR RunPulse 0.3980 0.3136 0.3379 0.1815 1.000 0.930 0.3525 9 CORR MaxPulse 0.2367 0.2261 0.4329 0.2494 0.930 1.000 0.3051 10 CORR RestPulse 0.3994 0.4504 0.1641 0.0440 0.352 0.305 1.0000
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 656.27095 218.75698 30.27 <.0001 Error 27 195.11060 7.22632 Corrected Total 30 851.38154 Root MSE 2.68818 R-Square 0.7708 Dependent Mean 47.37581 Adj R-Sq 0.7454 Coeff Var 5.67416 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 93.12615 7.55916 12.32 <.0001 RunTime 1 3.14039 0.36738 8.55 <.0001 Age 1 0.17388 0.09955 1.75 0.0921 Weight 1 0.05444 0.06181 0.88 0.3862
The following example uses the saved crossproducts matrix:
proc reg data=fitness outsscp=sscp noprint; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse; proc print data=sscp; proc reg data=sscp; model Oxygen=RunTime Age Weight; run;
First, all variables are used to fit the data and create the SSCP data set. Figure 61.14 shows the PROC PRINT display of the SSCP data set. The SSCP data set is then used as the input data set for PROC REG, and a reduced model is fit to the data. Figure 61.15 also shows the PROC REG results for the reduced model. (For the PROC REG results for the full model, see Figure 61.27 on page 3877.)
In the preceding example, the TYPE= data set option is not required since PROC REG sets the OUTSSCP= data set to TYPE=SSCP.
Obs _TYPE_ _NAME_ Intercept RunTime Age Weight RunPulse MaxPulse RestPulse Oxygen 1 SSCP Intercept 31.00 328.17 1478.00 2400.78 5259.00 5387.00 1657.00 1468.65 2 SSCP RunTime 328.17 3531.80 15687.24 25464.71 55806.29 57113.72 17684.05 15356.14 3 SSCP Age 1478.00 15687.24 71282.00 114158.90 250194.00 256218.00 78806.00 69767.75 4 SSCP Weight 2400.78 25464.71 114158.90 188008.20 407745.67 417764.62 128409.28 113522.26 5 SSCP RunPulse 5259.00 55806.29 250194.00 407745.67 895317.00 916499.00 281928.00 248497.31 6 SSCP MaxPulse 5387.00 57113.72 256218.00 417764.62 916499.00 938641.00 288583.00 254866.75 7 SSCP RestPulse 1657.00 17684.05 78806.00 128409.28 281928.00 288583.00 90311.00 78015.41 8 SSCP Oxygen 1468.65 15356.14 69767.75 113522.26 248497.31 254866.75 78015.41 70429.86 9 N 31.00 31.00 31.00 31.00 31.00 31.00 31.00 31.00
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 656.27095 218.75698 30.27 <.0001 Error 27 195.11060 7.22632 Corrected Total 30 851.38154 Root MSE 2.68818 R-Square 0.7708 Dependent Mean 47.37581 Adj R-Sq 0.7454 Coeff Var 5.67416 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 93.12615 7.55916 12.32 <.0001 RunTime 1 3.14039 0.36738 8.55 <.0001 Age 1 0.17388 0.09955 1.75 0.0921 Weight 1 0.05444 0.06181 0.88 0.3862
The OUTEST= specification produces a TYPE=EST output SAS data set containing estimates and optional statistics from the regression models. For each BY group on each dependent variable occurring in each MODEL statement, PROC REG outputs an observation to the OUTEST= data set. The variables output to the data set are as follows :
the BY variables, if any
_MODEL_ , a character variable containing the label of the corresponding MODEL statement, or MODEL n if no label is specified, where n is 1 for the first MODEL statement, 2 for the second model statement, and so on
_TYPE_ , a character variable with the value PARMS for every observation
_DEPVAR_ , the name of the dependent variable
_RMSE_ , the root mean squared error or the estimate of the standard deviation of the error term
Intercept , the estimated intercept, unless the NOINT option is specified
all the variables listed in any MODEL or VAR statement. Values of these variables are the estimated regression coefficients for the model. A variable that does not appear in the model corresponding to a given observation has a missing value in that observation. The dependent variable in each model is given a value of ˆ’ 1.
If you specify the COVOUT option, the covariance matrix of the estimates is output after the estimates; the _TYPE_ variable is set to the value COV and the names of the rows are identified by the 8-byte character variable, _NAME_ .
If you specify the TABLEOUT option, the following statistics listed by _TYPE_ are added after the estimates:
STDERR, the standard error of the estimate
T, the t statistic for testing if the estimate is zero
PVALUE, the associated p -value
L n B, the 100(1 ˆ’ ± ) lower confidence for the estimate, where n is the nearest integer to 100(1 ˆ’ ± ) and ± defaults to 0 . 05 or is set using the ALPHA= option in the PROC REG or MODEL statement
U n B, the 100(1 ˆ’ ± ) upper confidence for the estimate
Specifying the option ADJRSQ, AIC, BIC, CP, EDF, GMSEP, JP, MSE, PC, RSQUARE, SBC, SP, or SSE in the PROC REG or MODEL statement automatically outputs these statistics and the model R 2 for each model selected, regardless of the model selection method. Additional variables, in order of occurrence, are as follows.
_IN_ , the number of regressors in the model not including the intercept
_P_ , the number of parameters in the model including the intercept, if any
_EDF_ , the error degrees of freedom
_SSE_ , the error sum of squares, if the SSE option is specified
_MSE_ , the mean squared error, if the MSE option is specified
_RSQ_ , the R 2 statistic
_ADJRSQ_ , the adjusted R 2 , if the ADJRSQ option is specified
_CP_ , the C p statistic, if the CP option is specified
_SP_ , the S p statistic, if the SP option is specified
_JP_ , the J p statistic, if the JP option is specified
_PC_ , the PC statistic, if the PC option is specified
_GMSEP_ , the GMSEP statistic, if the GMSEP option is specified
_AIC_ , the AIC statistic, if the AIC option is specified
_BIC_ , the BIC statistic, if the BIC option is specified
_SBC_ , the SBC statistic, if the SBC option is specified
The following is an example with a display of the OUTEST= data set. This example uses the population data given in the section Polynomial Regression beginning on page 3804. Figure 61.16 on page 3865 through Figure 61.18 on page 3866 show the regression equations and the resulting OUTEST= data set.
proc reg data=USPopulation outest=est; m1: model Population=Year; m2: model Population=Year YearSq; proc print data=est; run;
The REG Procedure Model: m1 Dependent Variable: Population Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 146869 146869 228.92 <.0001 Error 20 12832 641.58160 Corrected Total 21 159700 Root MSE 25.32946 R-Square 0.9197 Dependent Mean 94.64800 Adj R-Sq 0.9156 Coeff Var 26.76175 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 2345.85498 161.39279 14.54 <.0001 Year 1 1.28786 0.08512 15.13 <.0001
The REG Procedure Model: m2 Dependent Variable: Population Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 159529 79765 8864.19 <.0001 Error 19 170.97193 8.99852 Corrected Total 21 159700 Root MSE 2.99975 R-Square 0.9989 Dependent Mean 94.64800 Adj R-Sq 0.9988 Coeff Var 3.16938 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 21631 639.50181 33.82 <.0001 Year 1 24.04581 0.67547 35.60 <.0001 YearSq 1 0.00668 0.00017820 37.51 <.0001
Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept Year Population YearSq 1 m1 PARMS Population 25.3295 2345.85 1.2879 1 . 2 m2 PARMS Population 2.9998 21630.89 24.0458 1 .006684346
The following modification of the previous example uses the TABLEOUT and ALPHA= options to obtain additional information in the OUTEST= data set:
proc reg data=USPopulation outest=est tableout alpha=0.1; m1: model Population=Year/noprint; m2: model Population=Year YearSq/noprint; proc print data=est; run;
Notice that the TABLEOUT option causes standard errors, t statistics, p -values, and confidence limits for the estimates to be added to the OUTEST= data set. Also note that the ALPHA= option is used to set the confidence level at 90%. The OUTEST= data set follows.
Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept Year Population YearSq 1 m1 PARMS Population 25.3295 -2345.85 1.2879 -1 . 2 m1 STDERR Population 25.3295 161.39 0.0851 . . 3 m1 T Population 25.3295 -14.54 15.1300 . . 4 m1 PVALUE Population 25.3295 0.00 0.0000 . . 5 m1 L90B Population 25.3295 -2624.21 1.1411 . . 6 m1 U90B Population 25.3295 -2067.50 1.4347 . . 7 m2 PARMS Population 2.9998 21630.89 -24.0458 -1 0.0067 8 m2 STDERR Population 2.9998 639.50 0.6755 . 0.0002 9 m2 T Population 2.9998 33.82 -35.5988 . 37.5096 10 m2 PVALUE Population 2.9998 0.00 0.0000 . 0.0000 11 m2 L90B Population 2.9998 20525.11 -25.2138 . 0.0064 12 m2 U90B Population 2.9998 22736.68 -22.8778 . 0.0070
A slightly different OUTEST= data set is created when you use the RSQUARE selection method. This example requests only the best model for each subset size but asks for a variety of model selection statistics, as well as the estimated regression coefficients. An OUTEST= data set is created and displayed. See Figure 61.20 and Figure 61.21 for results.
proc reg data=fitness outest=est; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare mse jp gmsep cp aic bic sbc b best=1; proc print data=est; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen R-Square Selection Method Number in Estimated MSE Model R-Square C(p) AIC BIC of Prediction J(p) MSE SBC 1 0.7434 13.6988 64.5341 65.4673 8.0546 8.0199 7.53384 67.40210 ------------------------------------------------------------------------------------------------------------------ 2 0.7642 12.3894 63.9050 64.8212 7.9478 7.8621 7.16842 68.20695 ------------------------------------------------------------------------------------------------------------------ 3 0.8111 6.9596 59.0373 61.3127 6.8583 6.7253 5.95669 64.77326 ------------------------------------------------------------------------------------------------------------------ 4 0.8368 4.8800 56.4995 60.3996 6.3984 6.2053 5.34346 63.66941 ------------------------------------------------------------------------------------------------------------------ 5 0.8480 5.1063 56.2986 61.5667 6.4565 6.1782 5.17634 64.90250 ------------------------------------------------------------------------------------------------------------------ 6 0.8487 7.0000 58.1616 64.0748 6.9870 6.5804 5.36825 68.19952 Number in --------------------------------------Parameter Estimates-------------------------------------- Model R-Square Intercept Age Weight RunTime RunPulse RestPulse MaxPulse 1 0.7434 82.42177 . . 3.31056 . . . ---------------------------------------------------------------------------------------------------------------------- 2 0.7642 88.46229 0.15037 . 3.20395 . . . ---------------------------------------------------------------------------------------------------------------------- 3 0.8111 111.71806 0.25640 . 2.82538 0.13091 . . ---------------------------------------------------------------------------------------------------------------------- 4 0.8368 98.14789 0.19773 . 2.76758 0.34811 . 0.27051 ---------------------------------------------------------------------------------------------------------------------- 5 0.8480 102.20428 0.21962 0.07230 2.68252 0.37340 . 0.30491 ---------------------------------------------------------------------------------------------------------------------- 6 0.8487 102.93448 0.22697 0.07418 2.62865 0.36963 0.02153 0.30322
Max Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept Age Weight RunTime RunPulse RestPulse Pulse 1 MODEL1 PARMS Oxygen 2.74478 82.422 . . 3.31056 . . . 2 MODEL1 PARMS Oxygen 2.67739 88.462 0.15037 . 3.20395 . . . 3 MODEL1 PARMS Oxygen 2.44063 111.718 0.25640 . 2.82538 0.13091 . . 4 MODEL1 PARMS Oxygen 2.31159 98.148 0.19773 . 2.76758 0.34811 . 0.27051 5 MODEL1 PARMS Oxygen 2.27516 102.204 0.21962 0.072302 2.68252 0.37340 . 0.30491 6 MODEL1 PARMS Oxygen 2.31695 102.934 0.22697 0.074177 2.62865 0.36963 0.021534 0.30322 Obs Oxygen _IN_ _P_ _EDF_ _MSE_ _RSQ_ _CP_ _JP_ _GMSEP_ _AIC_ _BIC_ _SBC_ 1 1 1 2 29 7.53384 0.74338 13.6988 8.01990 8.05462 64.5341 65.4673 67.4021 2 1 2 3 28 7.16842 0.76425 12.3894 7.86214 7.94778 63.9050 64.8212 68.2069 3 1 3 4 27 5.95669 0.81109 6.9596 6.72530 6.85833 59.0373 61.3127 64.7733 4 1 4 5 26 5.34346 0.83682 4.8800 6.20531 6.39837 56.4995 60.3996 63.6694 5 1 5 6 25 5.17634 0.84800 5.1063 6.17821 6.45651 56.2986 61.5667 64.9025 6 1 6 7 24 5.36825 0.84867 7.0000 6.58043 6.98700 58.1616 64.0748 68.1995
The OUTSSCP= option produces a TYPE=SSCP output SAS data set containing sums of squares and crossproducts. A special row (observation) and column (variable) of the matrix called Intercept contain the number of observations and sums. Observations are identified by the character variable _NAME_ . The data set contains all variables used in MODEL statements. You can specify additional variables that you want included in the crossproducts matrix with a VAR statement.
The SSCP data set is used when a large number of observations are explored in many different runs. The SSCP data set can be saved and used for subsequent runs, which are much less expensive since PROC REG never reads the original data again. If you run PROC REG once to create only a SSCP data set, you should list all the variables that you may need in a VAR statement or include all the variables that you may need in a MODEL statement.
The following example uses the fitness data from Example 61.1 on page 3924 to produce an output data set with the OUTSSCP= option. The resulting output is shown in Figure 61.22.
proc reg data=fitness outsscp=sscp; var Oxygen RunTime Age Weight RestPulse RunPulse MaxPulse; proc print data=sscp; run;
Obs _TYPE_ _NAME_ Intercept Oxygen RunTime Age Weight RestPulse RunPulse MaxPulse 1 SSCP Intercept 31.00 1468.65 328.17 1478.00 2400.78 1657.00 5259.00 5387.00 2 SSCP Oxygen 1468.65 70429.86 15356.14 69767.75 113522.26 78015.41 248497.31 254866.75 3 SSCP RunTime 328.17 15356.14 3531.80 15687.24 25464.71 17684.05 55806.29 57113.72 4 SSCP Age 1478.00 69767.75 15687.24 71282.00 114158.90 78806.00 250194.00 256218.00 5 SSCP Weight 2400.78 113522.26 25464.71 114158.90 188008.20 128409.28 407745.67 417764.62 6 SSCP RestPulse 1657.00 78015.41 17684.05 78806.00 128409.28 90311.00 281928.00 288583.00 7 SSCP RunPulse 5259.00 248497.31 55806.29 250194.00 407745.67 281928.00 895317.00 916499.00 8 SSCP MaxPulse 5387.00 254866.75 57113.72 256218.00 417764.62 288583.00 916499.00 938641.00 9 N 31.00 31.00 31.00 31.00 31.00 31.00 31.00 31.00
Since a model is not fit to the data and since the only request is to create the SSCP data set, a MODEL statement is not required in this example. However, since the MODEL statement is not used, the VAR statement is required.
PROC REG enables you to change interactively both the model and the data used to compute the model, and to produce and highlight scatter plots. See the section Using PROC REG Interactively on page 3812 for an overview of interactive analysis using PROC REG. The following statements can be used interactively (without reinvoking PROC REG): ADD, DELETE, MODEL, MTEST , OUTPUT, PAINT, PLOT, PRINT, REFIT, RESTRICT, REWEIGHT, and TEST. All interactive features are disabled if there is a BY statement.
The ADD, DELETE and REWEIGHT statements can be used to modify the current MODEL. Every use of an ADD, DELETE or REWEIGHT statement causes the model label to be modified by attaching an additional number to it. This number is the cumulative total of the number of ADD, DELETE or REWEIGHT statements following the current MODEL statement.
A more detailed explanation of changing the data used to compute the model is given in the section Reweighting Observations in an Analysis on page 3903. Extra features for line printer scatter plots are discussed in the section Line Printer Scatter Plot Features on page 3882.
The following example illustrates the usefulness of the interactive features. First, the full regression model is fit to the class data (see the Getting Started section on page 3800), and Figure 61.23 is produced.
proc reg data=Class; model Weight=Age Height; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 7215.63710 3607.81855 27.23 <.0001 Error 16 2120.09974 132.50623 Corrected Total 18 9335.73684 Root MSE 11.51114 R-Square 0.7729 Dependent Mean 100.02632 Adj R-Sq 0.7445 Coeff Var 11.50811 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 141.22376 33.38309 4.23 0.0006 Age 1 1.27839 3.11010 0.41 0.6865 Height 1 3.59703 0.90546 3.97 0.0011
Next , the regression model is reduced by the following statements, and Figure 61.24 is produced.
delete age; print; run;
The REG Procedure Model: MODEL1.1 Dependent Variable: Weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 7193.24912 7193.24912 57.08 <.0001 Error 17 2142.48772 126.02869 Corrected Total 18 9335.73684 Root MSE 11.22625 R-Square 0.7705 Dependent Mean 100.02632 Adj R-Sq 0.7570 Coeff Var 11.22330 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 143.02692 32.27459 4.43 0.0004 Height 1 3.89903 0.51609 7.55 <.0001
Note that the MODEL label has been changed from MODEL1 to MODEL1.1, as the original MODEL has been changed by the delete statement.
The following statements generate a scatter plot of the residuals against the predicted values from the full model. Figure 61.25 is produced, and the scatter plot shows a possible outlier.
add age; plot r.*p. / cframe=ligr; run;
The following statements delete the observation with the largest residual, refitthe regression model, and produce a scatter plot of residuals against predicted values for the refitted model. Figure 61.26 shows the new scatter plot.
reweight r.>20; plot / cframe=ligr; run;
The nine methods of model selection implemented in PROC REG are specified with the SELECTION= option in the MODEL statement. Each method is discussed in this section.
This method is the default and provides no model selection capability. The complete model specified in the MODEL statement is used to fit the model. For many regression analyses, this may be the only method you need.
The forward-selection technique begins with no variables in the model. For each of the independent variables, the FORWARD method calculates F statistics that reflect the variable s contribution to the model if it is included. The p -values for these F statistics are compared to the SLENTRY= value that is specified in the MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If no F statistic has a significance level greater than the SLENTRY= value, the FORWARD selection stops. Otherwise , the FORWARD method adds the variable that has the largest F statistic to the model. The FORWARD method then calculates F statistics again for the variables still remaining outside the model, and the evaluation process is repeated. Thus, variables are added one by one to the model until no remaining variable produces a significant F statistic. Once a variable is in the model, it stays.
The backward elimination technique begins by calculating F statistics for a model, including all of the independent variables. Then the variables are deleted from the model one by one until all the variables remaining in the model produce F statistics significant at the SLSTAY= level specified in the MODEL statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the variable showing the smallest contribution to the model is deleted.
The stepwise method is a modification of the forward-selection technique and differs in that variables already in the model do not necessarily stay there. As in the forward-selection method, variables are added one by one to the model, and the F statistic for a variable to be added must be significant at the SLENTRY= level. After a variable is added, however, the stepwise method looks at all the variables already included in the model and deletes any variable that does not produce an F statistic significant at the SLSTAY= level. Only after this check is made and the necessary deletions accomplished can another variable be added to the model. The stepwise process ends when none of the variables outside the model has an F statistic significant at the SLENTRY= level and every variable in the model is significantattheSLSTAY= level, or when the variable to be added to the model is the one just deleted from it.
The maximum R 2 improvement technique does not settle on a single model. Instead, it tries to find the best one-variable model, the best two-variable model, and so forth, although it is not guaranteed to find the model with the largest R 2 for each size.
The MAXR method begins by finding the one-variable model producing the highest R 2 . Then another variable, the one that yields the greatest increase in R 2 , is added. Once the two-variable model is obtained, each of the variables in the model is compared to each variable not in the model. For each comparison, the MAXR method determines if removing one variable and replacing it with the other variable increases R 2 . After comparing all possible switches, the MAXR method makes the switch that produces the largest increase in R 2 . Comparisons begin again, and the process continues until the MAXR method finds that no switch could increase R 2 . Thus, the two-variable model achieved is considered the best two-variable model the technique can find. Another variable is then added to the model, and the comparing-and-switching process is repeated to find the best three-variable model, and so forth.
The difference between the STEPWISE method and the MAXR method is that all switches are evaluated before any switch is made in the MAXR method . In the STEPWISE method, the worst variable may be removed without considering what adding the best remaining variable might accomplish. The MAXR method may require much more computer time than the STEPWISE method.
The MINR method closely resembles the MAXR method, but the switch chosen is the one that produces the smallest increase in R 2 . For a given number of variables in the model, the MAXR and MINR methods usually produce the same best model, but the MINR method considers more models of each size.
The RSQUARE method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. You can specify the largest and smallest number of independent variables to appear in a subset and the number of subsets of each size to be selected. The RSQUARE method can efficiently perform all possible subset regressions and display the models in decreasing order of R 2 magnitude within each subset size. Other statistics are available for comparing subsets of different sizes. These statistics, as well as estimated regression coefficients, can be displayed or output to a SAS data set.
The subset models selected by the RSQUARE method are optimal in terms of R 2 for the given sample, but they are not necessarily optimal for the population from which the sample is drawn or for any other sample for which you may want to make predictions . If a subset model is selected on the basis of a large R 2 value or any other criterion commonly used for model selection, then all regression statistics computed for that model under the assumption that the model is given a priori , including all statistics computed by PROC REG, are biased .
While the RSQUARE method is a useful tool for exploratory model building, no statistical method can be relied on to identify the true model. Effective model building requires substantive theory to suggest relevant predictors and plausible functional forms for the model.
The RSQUARE method differs from the other selection methods in that RSQUARE always identifies the model with the largest R 2 for each number of variables considered. The other selection methods are not guaranteed to find the model with the largest R 2 . The RSQUARE method requires much more computer time than the other selection methods, so a different selection method such as the STEPWISE method is a good choice when there are many independent variables to consider.
This method is similar to the RSQUARE method, except that the adjusted R 2 statistic is used as the criterion for selecting models, and the method finds the models with the highest adjusted R 2 within the range of sizes.
This method is similar to the ADJRSQ method, except that Mallows C p statistic is used as the criterion for model selection. Models are listed in ascending order of C p .
If the RSQUARE or STEPWISE procedure (as documented in SAS User s Guide: Statistics, Version 5 Edition ) is requested , PROC REG with the appropriate model-selection method is actually used.
Reviews of model-selection methods by Hocking (1976) and Judge et al. (1980) describe these and other variable-selection methods.
When many significance tests are performed, each at a level of, for example, 5 percent, the overall probability of rejecting at least one true null hypothesis is much larger than 5 percent. If you want to guard against including any variables that do not contribute to the predictive power of the model in the population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE methods.
In most applications, many of the variables considered have some predictive power, however small. If you want to choose the model that provides the best prediction using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated with the given sample size, so you should use a moderate significance level, perhaps in the range of 10 percent to 25 percent.
In addition to R 2 , the C p statistic is displayed for each model generated in the model-selection methods. The C p statistic is proposed by Mallows (1973) as a criterion for selecting a model. It is a measure of total squared error defined as
where s 2 is the MSE for the full model, and SSE p is the sum-of-squares error for a model with p parameters including the intercept, if any. If C p is plotted against p , Mallows recommends the model where C p first approaches p . When the right model is chosen, the parameter estimates are unbiased , and this is reflected in C p near p .For further discussion, refer to Daniel and Wood (1980).
The Adjusted R 2 statistic is an alternative to R 2 that is adjusted for the number of parameters in the model. The adjusted R 2 statistic is calculated as
where n is the number of observations used in fitting the model, and i is an indicator variable that is 1 if the model includes an intercept, and 0 otherwise.
The use of model-selection methods can be time-consuming in some cases because there is no built-in limit on the number of independent variables, and the calculations for a large number of independent variables can be lengthy. The recommended limit on the number of independent variables for the MINR method is 20 + i , where i is the value of the INCLUDE= option.
For the RSQUARE, ADJRSQ, or CP methods, with a large value of the BEST= option, adding one more variable to the list from which regressors are selected may significantly increase the CPU time. Also, the time required for the analysis is highly dependent on the data and on the values of the BEST=, START=, and STOP= options.
The following example uses the fitness data from Example 61.1 on page 3924. Figure 61.28 shows the parameter estimates and the tables from the SS1, SS2, STB, CLB, COVB, and CORRB options:
proc reg data=fitness; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / ss1 ss2 stb clb covb corrb; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > t Type I SS Type II SS Estimate 95% Confidence Limits Intercept 1 102.93448 12.40326 8.30 <.0001 69578 369.72831 0 77.33541 128.53355 RunTime 1 2.62865 0.38456 6.84 <.0001 632.90010 250.82210 0.68460 3.42235 1.83496 Age 1 0.22697 0.09984 2.27 0.0322 17.76563 27.74577 0.22204 0.43303 0.02092 Weight 1 0.07418 0.05459 1.36 0.1869 5.60522 9.91059 0.11597 0.18685 0.03850 RunPulse 1 0.36963 0.11985 3.08 0.0051 38.87574 51.05806 0.71133 0.61699 0.12226 MaxPulse 1 0.30322 0.13650 2.22 0.0360 26.82640 26.49142 0.52161 0.02150 0.58493 RestPulse 1 0.02153 0.06605 0.33 0.7473 0.57051 0.57051 0.03080 0.15786 0.11480
The procedure first displays an Analysis of Variance table (Figure 61.27). The F statistic for the overall model is significant, indicating that the model explains a significant portion of the variation in the data.
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154 Root MSE 2.31695 R-Square 0.8487 Dependent Mean 47.37581 Adj R-Sq 0.8108 Coeff Var 4.89057
The procedure next displays Parameter Estimates and some associated statistics (Figure 61.28). First, the estimates are shown, followed by their Standard Errors. The next two columns of the table contain the t statistics and the corresponding probabilities for testing the null hypothesis that the parameter is not significantly different from zero. These probabilities are usually referred to as p -values. For example, the Intercept term in the model is estimated to be 102.9 and is significantly different from zero. The next two columns of the table are the result of requesting the SS1 and SS2 options, and they show sequential and partial Sums of Squares (SS) associated with each variable. The Standardized Estimates (produced by the STB option) are the parameter estimates that result when all variables are standardized to a mean of 0 and a variance of 1. These estimates are computed by multiplying the original estimates by the standard deviation of the regressor (independent) variable and then dividing by the standard deviation of the dependent variable. The CLB option adds the upper and lower 95% confidence limits for the parameter estimates; the ± level can be changed by specifying the ALPHA= option in the PROC REG or MODEL statement.
The final two tables are produced as a result of requesting the COVB and CORRB options (Figure 61.29). These tables show the estimated covariance matrix of the parameter estimates, and the estimated correlation matrix of the estimates.
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Covariance of Estimates Variable Intercept RunTime Age Weight RunPulse MaxPulse RestPulse Intercept 153.84081152 0.7678373769 0.902049478 0.178237818 0.280796516 0.832761667 0.147954715 RunTime 0.7678373769 0.1478880839 0.014191688 0.004417672 0.009047784 0.0046249498 0.010915224 Age 0.902049478 0.014191688 0.009967521 0.0010219105 0.001203914 0.0035823843 0.0014897532 Weight 0.178237818 0.004417672 0.0010219105 0.0029804131 0.0009644683 0.001372241 0.0003799295 RunPulse 0.280796516 0.009047784 0.001203914 0.0009644683 0.0143647273 0.014952457 0.000764507 MaxPulse 0.832761667 0.0046249498 0.0035823843 0.001372241 0.014952457 0.0186309364 0.0003425724 RestPulse 0.147954715 0.010915224 0.0014897532 0.0003799295 0.000764507 0.0003425724 0.0043631674 Correlation of Estimates Variable Intercept RunTime Age Weight RunPulse MaxPulse RestPulse Intercept 1.0000 0.1610 0.7285 0.2632 0.1889 0.4919 0.1806 RunTime 0.1610 1.0000 0.3696 0.2104 0.1963 0.0881 0.4297 Age 0.7285 0.3696 1.0000 0.1875 0.1006 0.2629 0.2259 Weight 0.2632 0.2104 0.1875 1.0000 0.1474 0.1842 0.1054 RunPulse 0.1889 0.1963 0.1006 0.1474 1.0000 0.9140 0.0966 MaxPulse 0.4919 0.0881 0.2629 0.1842 0.9140 1.0000 0.0380 RestPulse 0.1806 0.4297 0.2259 0.1054 0.0966 0.0380 1.0000
For further discussion of the parameters and statistics, see the Displayed Output section on page 3918, and Chapter 2, Introduction to Regression Procedures.
The display of the predicted values and residuals is controlled by the P, R, CLM, and CLI options in the MODEL statement. The P option causes PROC REG to display the observation number, the ID value (if an ID statement is used), the actual value, the predicted value, and the residual. The R, CLI, and CLM options also produce the items under the P option. Thus, P is unnecessary if you use one of the other options.
The R option requests more detail, especially about the residuals. The standard errors of the mean predicted value and the residual are displayed. The studentized residual, which is the residual divided by its standard error, is both displayed and plotted. A measure of influence, Cook s D , is displayed. Cook s D measures the change to the estimates that results from deleting each observation (Cook 1977, 1979). This statistic is very similar to DFFITS.
The CLM option requests that PROC REG display the 100(1 ˆ’ ± )% lower and upper confidence limits for the mean predicted values. This accounts for the variation due to estimating the parameters only. If you want a 100(1 ˆ’ ± )%confidence interval for observed values, then you can use the CLI option, which adds in the variability of the error term. The ± level can be specified with the ALPHA= option in the PROC REG or MODEL statement.
You can use these statistics in PLOT and PAINT statements. This is useful in performing a variety of regression diagnostics. For definitions of the statistics produced by these options, see Chapter 2, Introduction to Regression Procedures.
The following example uses the US population data found on the section Polynomial Regression beginning on page 3804.
data USPop2; input Year @@; YearSq=Year*Year; datalines; 2010 2020 2030 ; data USPop2; set USPopulation USPop2; proc reg data=USPop2; id Year; model Population=Year YearSq / r cli clm; run;
After producing the usual Analysis of Variance and Parameter Estimates tables (Figure 61.30), the procedure displays the results of requesting the options for predicted and residual values (Figure 61.31). For each observation, the requested information is shown. Note that the ID variable is used to identify each observation. Also note that, for observations with missing dependent variables, the predicted value, standard error of the predicted value, and confidence intervals for the predicted value are still available.
The REG Procedure Model: MODEL1 Dependent Variable: Population Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 159529 79765 8864.19 <.0001 Error 19 170.97193 8.99852 Corrected Total 21 159700 Root MSE 2.99975 R-Square 0.9989 Dependent Mean 94.64800 Adj R-Sq 0.9988 Coeff Var 3.16938 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 21631 639.50181 33.82 <.0001 Year 1 24.04581 0.67547 35.60 <.0001 YearSq 1 0.00668 0.00017820 37.51 <.0001
The REG Procedure Model: MODEL1 Dependent Variable: Population Output Statistics Dependent Predicted Std Error Obs Year Variable Value Mean Predict 95% CL Mean 95% CL Predict 1 1790 3.9290 6.2127 1.7565 2.5362 9.8892 1.0631 13.4884 2 1800 5.3080 5.7226 1.4560 2.6751 8.7701 1.2565 12.7017 3 1810 7.2390 6.5694 1.2118 4.0331 9.1057 0.2021 13.3409 4 1820 9.6380 8.7531 1.0305 6.5963 10.9100 2.1144 15.3918 5 1830 12.8660 12.2737 0.9163 10.3558 14.1916 5.7087 18.8386 6 1840 17.0690 17.1311 0.8650 15.3207 18.9415 10.5968 23.6655 7 1850 23.1910 23.3254 0.8613 21.5227 25.1281 16.7932 29.8576 8 1860 31.4430 30.8566 0.8846 29.0051 32.7080 24.3107 37.4024 9 1870 39.8180 39.7246 0.9163 37.8067 41.6425 33.1597 46.2896 10 1880 50.1550 49.9295 0.9436 47.9545 51.9046 43.3476 56.5114 11 1890 62.9470 61.4713 0.9590 59.4641 63.4785 54.8797 68.0629 12 1900 75.9940 74.3499 0.9590 72.3427 76.3571 67.7583 80.9415 13 1910 91.9720 88.5655 0.9436 86.5904 90.5405 81.9836 95.1473 14 1920 105.7100 104.1178 0.9163 102.2000 106.0357 97.5529 110.6828 15 1930 122.7750 121.0071 0.8846 119.1556 122.8585 114.4612 127.5529 16 1940 131.6690 139.2332 0.8613 137.4305 141.0359 132.7010 145.7654 17 1950 151.3250 158.7962 0.8650 156.9858 160.6066 152.2618 165.3306 18 1960 179.3230 179.6961 0.9163 177.7782 181.6139 173.1311 186.2610 19 1970 203.2110 201.9328 1.0305 199.7759 204.0896 195.2941 208.5715 20 1980 226.5420 225.5064 1.2118 222.9701 228.0427 218.7349 232.2779 21 1990 248.7100 250.4168 1.4560 247.3693 253.4644 243.4378 257.3959 22 2000 281.4220 276.6642 1.7565 272.9877 280.3407 269.3884 283.9400 23 2010 . 304.2484 2.1073 299.8377 308.6591 296.5754 311.9214 24 2020 . 333.1695 2.5040 327.9285 338.4104 324.9910 341.3479 25 2030 . 363.4274 2.9435 357.2665 369.5883 354.6310 372.2238 Output Statistics Std Error Student Cooks Obs Year Residual Residual Residual 2 1 0 1 2 D 1 1790 2.2837 2.432 0.939 * 0.153 2 1800 0.4146 2.623 0.158 0.003 3 1810 0.6696 2.744 0.244 0.004 4 1820 0.8849 2.817 0.314 0.004 5 1830 0.5923 2.856 0.207 0.001 6 1840 0.0621 2.872 0.0216 0.000 7 1850 0.1344 2.873 0.0468 0.000 8 1860 0.5864 2.866 0.205 0.001 9 1870 0.0934 2.856 0.0327 0.000 10 1880 0.2255 2.847 0.0792 0.000 11 1890 1.4757 2.842 0.519 * 0.010 12 1900 1.6441 2.842 0.578 * 0.013 13 1910 3.4065 2.847 1.196 ** 0.052 14 1920 1.5922 2.856 0.557 * 0.011 15 1930 1.7679 2.866 0.617 * 0.012 16 1940 7.5642 2.873 2.632 ***** 0.208 17 1950 7.4712 2.872 2.601 ***** 0.205 18 1960 0.3731 2.856 0.131 0.001 19 1970 1.2782 2.817 0.454 0.009 20 1980 1.0356 2.744 0.377 0.009 21 1990 1.7068 2.623 0.651 * 0.044 22 2000 4.7578 2.432 1.957 *** 0.666 23 2010 . . . . 24 2020 . . . . 25 2030 . . . . Sum of Residuals 4.4596E-11 Sum of Squared Residuals 170.97193 Predicted Residual SS (PRESS) 237.71229
The plot of studentized residuals and Cook s D statistics are displayed as a result of requesting the R option. In the plot of studentized residuals, a large number of observations with absolute values greater than two indicates an inadequate model. A version of the studentized residual plot can be created on a high-resolution graphics device; see Example 61.7 on page 3952 for a similar example.
This section discusses the special options available with line printer scatter plots. Detailed examples of high resolution graphics plots and options are given in Example 61.6 on page 3950.
The interactive PLOT statement available in PROC REG enables you to look at scatter plots of data and diagnostic statistics. These plots can help you to evaluate the model and detect outliers in your data. Several options enable you to place multiple plots on a single page, superimpose plots, and collect plots to be overlaid by later plots. The PAINT statement can be used to highlight points on a plot. See the section Painting Scatter Plots on page 3889 for more information on painting.
The Class data set introduced in is used in the following examples.
You can superimpose several plots with the OVERLAY option. With the following statements, a plot of Weight against Height is overlaid with plots of the predicted values and the 95% prediction intervals. The model on which the statistics are based is the full model including Height and Age . These statements produce Figure 61.32:
proc reg data=Class lineprinter; model Weight=Height Age / noprint; plot (ucl. lcl. p.)*Height='-' Weight*Height / overlay symbol='o'; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---- U U95 p p 175 + + e r - B o 150 + - o + u -- n - d -- - o - - o o 125 + - - + f - - o - o 9 - - - o - 5 -- -- ? ? - % 100 + o - o - + - C - o - . - o oo - o o - - I - -- - - . 75 + ? - + ( I - n - d - -- i 50 + o -- + v i d u - a 25 + + l ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---- P 50 52 54 56 58 60 62 64 66 68 70 72 r Height
In this plot, the data values are marked with the symbol o and the predicted values and prediction interval limits are labeled with the symbol - . The plot is scaled to accommodate the points from all plots. This is an important difference from the COLLECT option, which does not rescale plots after the first plot or plots are collected. You could separate the overlaid plots by using the following statements:
plot; run;
This places each of the four plots on a separate page, while the statements
plot / overlay; run;
plot;
The next example shows how you can overlay plots of statistics before and after a change in the model. For the full model involving Height and Age , the ordinary residuals and the studentized residuals are plotted against the predicted values. The COLLECT option causes these plots to be collected or retained for re-display later. The option HPLOTS=2 allows the two plots to appear side by side on one page. The symbol f is used on these plots to identify them as resulting from the full model. These statements produce Figure 61.33:
plot r.*p. student.*p. / collect hplots=2 symbol='f'; run;
The REG Procedure Model: MODEL1 -+-----+-----+-----+-----+-----+- -+-----+-----+-----+-----+----- +-- 30 + + 3 + + f f 20 + + 2 + + f f f R f f f E 10 + f + S 1 + f + S T I U D f D f U E A f N f L T 0 + f f + 0 + f f + f f f f f f f f 10 + + 1 + + f f f f f f f f 20 + + 2 + + -+-----+-----+-----+-----+-----+- -+-----+-----+-----+-----+----- +-- 40 60 80 100 120 140 40 60 80 100 120 140 PRED PRED
Note that these plots are not overlaid. The COLLECT option does not overlay the plots in one PLOT statement but retains them so that they can be overlaid by later plots. When the COLLECT option appears in a PLOT statement, the plots in that statement become the first plots in the collection.
Next, the model is reduced by deleting the Age variable. The PLOT statement requests the same plots as before but labels the points with the symbol r denoting the reduced model. The following statements produce Figure 61.34:
delete Age; plot r.*p. student.*p. / symbol='r'; run;
The REG Procedure Model: MODEL1.1 -+-----+-----+-----+-----+-----+- -+-----+-----+-----+-----+----- +-- 30 + + 3 + + f f 20 + + 2 + + r r rf r ? r f r R r f ? r f E 10 + f + S 1 + f + S T I r U D f D rf U r E r A ? N ? L T 0 + ? ? + 0 + ? ? + r r f f f r f ? r f r ? ? 10 + + 1 + + f f f fr r r r f f r ? r r f 20 + + 2 + + -+-----+-----+-----+-----+-----+- -+-----+-----+-----+-----+------+-- 40 60 80 100 120 140 40 60 80 100 120 140 PRED PRED
Notice that the COLLECT option causes the corresponding plots to be overlaid. Also notice that the DELETE statement causes the model label to be changed from MODEL1 to MODEL1.1. The points labeled f are from the full model, and points labeled r are from the reduced model. Positions labeled ? contain at least one point from each model. In this example, the OVERLAY option cannot be used because all of the plots to be overlaid cannot be specified in one PLOT statement. With the COLLECT option, any changes to the model or the data used to fit the model do not affect plots collected before the changes. Collected plots are always reproduced exactly as they first appear. (Similarly, a PAINT statement does not affect plots collected before the PAINT statement is issued.)
The previous example overlays the residual plots for two different models. You may prefer to see them side by side on the same page. This can also be done with the COLLECT option by using a blank plot. Continuing from the last example, the COLLECT, HPLOTS=2, and SYMBOL= r options are still in effect. In the following PLOT statement, the CLEAR option deletes the collected plots and allows the specified plot to begin a new collection. The plot created is the residual plot for the reduced model. These statements produce Figure 61.35:
plot r.*p. / clear; run;
The REG Procedure Model: MODEL1.1 -+-----+-----+-----+-----+-----+- 20 + + r r r r r 10 + + r R E r S r I D 0 + r r + U r A L r r r 10 + + r r r r 20 + + -+-----+-----+-----+-----+-----+- 40 60 80 100 120 140 PRED
The next statements add the variable AGE to the model and place the residual plot for the full model next to the plot for the reduced model. Notice that a blank plot is created in the first plot request by placing nothing between the quotes. Since the COLLECT option is in effect, this plot is superimposed on the residual plot for the reduced model. The residual plot for the full model is created by the second request. The result is the desired side-by-side plots. The NOCOLLECT option turns off the collection process after the specified plots are added and displayed. Any PLOT statements that follow show only the newly specified plots. These statements produce Figure 61.36:
add Age; plot r.*p.='' r.*p.='f' / nocollect; run;
The REG Procedure Model: MODEL1.2 -+-----+-----+-----+-----+-----+- -+-----+-----+-----+-----+-----+-- 30 + + 20 + + r f r 20 + + r r r 10 + + f r f R R f E r E 10 + f + S r S I I D 0 + r r + D f U r U A A f L r L r 0 + f f + r f 10 + + f f r f r 10 + + r r f f 20 + + f f 20 + + -+-----+-----+-----+-----+-----+- -+-----+-----+-----+-----+-----+- 40 60 80 100 120 140 40 60 80 100 120 140 PRED PRED
Frequently, when the COLLECT option is in effect, you want the current and following PLOT statements to show only the specified plots. To do this, use both the CLEAR and NOCOLLECT options in the current PLOT statement.
Painting scatter plots is a useful interactive tool that enables you to mark points of interest in scatter plots. Painting can be used to identify extreme points in scatter plots or to reveal the relationship between two scatter plots. The CLASS data (from the Simple Linear Regression section on page 3800) is used to illustrate some of these applications. First, a scatter plot of the studentized residuals against the predicted values is generated. This plot is shown in Figure 61.37.
proc reg data=Class lineprinter; model Weight=Age Height / noprint; plot student.*p.; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight ---+------+------+------+------+------+------+------+------+------+--- STUDENT 3 + + S 1 t 2 + + u d e n 1 t 1 1 i 1 + 1 + z e d 11 1 R e 0 + 1 1 + s 1 i d 1 2 u 1 a l 1 + + 1 1 1 1 2 + + ---+------+------+------+------+------+------+------+------+------+--- 50 60 70 80 90 100 110 120 130 140 Predicted Value of Weight PRED
Then, the following statements identify the observation Henry in the scatter plot and produce Figure 61.38:
paint Name='Henry' / symbol = 'H'; plot; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight ---+------+------+------+------+------+------+------+------+------+--- STUDENT 3 + + S 1 t 2 + + u d e n 1 t 1 1 i 1 + 1 + z e d 11 1 R e 0 + 1 1 + s H i d 1 2 u 1 a l 1 + + 1 1 1 1 2 + + ---+------+------+------+------+------+------+------+------+------+--- 50 60 70 80 90 100 110 120 130 140 Predicted Value of Weight PRED
Next, the following statements identify observations with large absolute residuals:
paint student.>=2 or student.<=-2 / symbol='s'; plot; run;
The log shows the observation numbers found with these conditions and gives the painting symbol and the number of observations found. Note that the previous PAINT statement is also used in the PLOT statement. Figure 61.39 shows the scatter plot produced by the preceding statements.
The REG Procedure Model: MODEL1 Dependent Variable: Weight ---+------+------+------+------+------+------+------+------+------+--- STUDENT 3 + + S s t 2 + + u d e n 1 t 1 1 i 1 + 1 + z e d 11 1 R e 0 + 1 1 + s H i d 1 2 u 1 a l -1 + + 1 1 1 1 2 + + ---+------+------+------+------+------+------+------+------+------+--- 50 60 70 80 90 100 110 120 130 140 Predicted Value of Weight PRED
The following statements relate two different scatter plots. These statements produce Figure 61.40.
paint student.>=1 / symbol='p'; paint student.<1 and student.>-1 / symbol='s'; paint student.<=-1 / symbol='n'; plot student. * p. cookd. * h. / hplots=2; run;
The REG Procedure Model: MODEL1 -+-----+-----+-----+-----+-----+-- -+----+----+----+----+----+----+- 3 + + 0.8 + p + p 2 + + 0.6 + + p p p S 1 + s + T C U O D s O 0.4 + + E K N s D T 0 + s s + s s s 0.2 + + p s n s 1 + + p n s n n n p ss n 0.0 + ss ss s + n 2 + + -+-----+-----+-----+-----+-----+-- -+----+----+----+----+----+----+- 40 60 80 100 120 140 0.05 0.10 0.15 0.20 0.25 0.30 0.35 PRED H
If the model is not full rank, there are an infinite number of least-squares solutions for the estimates. PROC REG chooses a nonzero solution for all variables that are linearly independent of previous variables and a zero solution for other variables. This solution corresponds to using a generalized inverse in the normal equations, and the expected values of the estimates are the Hermite normal form of X multiplied by the true parameters:
Degrees of freedom for the zeroed estimates are reported as zero. The hypotheses that are not testable have t tests reported as missing. The message that the model is not full rank includes a display of the relations that exist in the matrix.
The next example uses the fitness data from Example 61.1 on page 3924. The variable Dif=RunPulse ˆ’ RestPulse is created. When this variable is included in the model along with RunPulse and RestPulse , there is a linear dependency (or exact collinearity) between the independent variables. Figure 61.41 shows how this problem is diagnosed.
data fit2; set fitness; Dif=RunPulse-RestPulse; proc reg data=fit2; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse Dif; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154 Root MSE 2.31695 R-Square 0.8487 Dependent Mean 47.37581 Adj R-Sq 0.8108 Coeff Var 4.89057 NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. Dif = RunPulse - RestPulse Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 102.93448 12.40326 8.30 <.0001 RunTime 1 2.62865 0.38456 6.84 <.0001 Age 1 0.22697 0.09984 2.27 0.0322 Weight 1 0.07418 0.05459 1.36 0.1869 RunPulse B 0.36963 0.11985 3.08 0.0051 MaxPulse 1 0.30322 0.13650 2.22 0.0360 RestPulse B 0.02153 0.06605 0.33 0.7473 Dif 0 0 . . .
PROC REG produces a message informing you that the model is less than full rank. Parameters with DF=0 are not estimated, and parameters with DF=B are biased. In addition, the form of the linear dependency among the regressors is displayed.
When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are unstable and have high standard errors. This problem is called collinearity or multicollinearity . It is a good idea to find out which variables are nearly collinear with which other variables. The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL, and VIF options.
The COLLIN option in the MODEL statement requests that a collinearity analysis be performed. First, X ² X is scaled to have 1s on the diagonal. If you specify the COLLINOINT option, the intercept variable is adjusted out first. Then the eigenvalues and eigenvectors are extracted. The analysis in PROC REG is reported with eigenvalues of X ² X rather than singular values of X . The eigenvalues of X ² X are the squares of the singular values of X .
The condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue . The largest condition index is the condition number of the scaled X matrix. Belsey, Kuh, and Welsch (1980) suggest that, when this number is around 10, weak dependencies may be starting to affect the regression estimates. When this number is larger than 100, the estimates may have a fair amount of numerical error (although the statistical standard error almost always is much greater than the numerical error).
For each variable, PROC REG produces the proportion of the variance of the estimate accounted for by each principal component. A collinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables.
The VIF option in the MODEL statement provides the Variance Inflation Factors (VIF). These factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is large enough to affect the predicted values.
The TOL option requests the tolerance values for the parameter estimates. The tolerance is defined as 1 /V IF .
For a complete discussion of the preceding methods, refer to Belsley, Kuh, and Welsch (1980). For a more detailed explanation of using the methods with PROC REG, refer to Freund and Littell (1986).
This example uses the COLLIN option on the fitness data found in Example 61.1 on page 3924. The following statements produce Figure 61.42.
proc reg data=fitness; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / tol vif collin; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154 Root MSE 2.31695 R-Square 0.8487 Dependent Mean 47.37581 Adj R-Sq 0.8108 Coeff Var 4.89057 Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > t Tolerance Inflation Intercept 1 102.93448 12.40326 8.30 <.0001 . 0 RunTime 1 2.62865 0.38456 6.84 <.0001 0.62859 1.59087 Age 1 0.22697 0.09984 2.27 0.0322 0.66101 1.51284 Weight 1 0.07418 0.05459 1.36 0.1869 0.86555 1.15533 RunPulse 1 0.36963 0.11985 3.08 0.0051 0.11852 8.43727 MaxPulse 1 0.30322 0.13650 2.22 0.0360 0.11437 8.74385 RestPulse 1 0.02153 0.06605 0.33 0.7473 0.70642 1.41559 Collinearity Diagnostics Condition ----------------------------------Proportion of Variation--------------------------------- Number Eigenvalue Index Intercept RunTime Age Weight RunPulse MaxPulse RestPulse 1 6.94991 1.00000 0.00002326 0.00021086 0.00015451 0.00019651 0.00000862 0.00000634 0.00027850 2 0.01868 19.29087 0.00218 0.02522 0.14632 0.01042 0.00000244 0.00000743 0.39064 3 0.01503 21.50072 0.00061541 0.12858 0.15013 0.23571 0.00119 0.00125 0.02809 4 0.00911 27.62115 0.00638 0.60897 0.03186 0.18313 0.00149 0.00123 0.19030 5 0.00607 33.82918 0.00133 0.12501 0.11284 0.44442 0.01506 0.00833 0.36475 6 0.00102 82.63757 0.79966 0.09746 0.49660 0.10330 0.06948 0.00561 0.02026 7 0.00017947 196.78560 0.18981 0.01455 0.06210 0.02283 0.91277 0.98357 0.00568
This section gathers the formulas for the statistics available in the MODEL, PLOT, and OUTPUT statements. The model to be fitis Y = X ² + ˆˆ , and the parameter estimate is denoted by b =( X ² X ) ˆ’ X ² Y . The subscript i denotes values for the i th observation, the parenthetical subscript ( i ) means that the statistic is computed using all observations except the i th observation, and the subscript jj indicates the j th diagonal matrix entry. The ALPHA= option in the PROC REG or MODEL statement is used to set the ± value for the t statistics.
Table 61.6 contains the summary statistics for assessing the fit of the model.
Table 61.7 contains the diagnostic statistics and their formulas; these formulas and further information can be found in Chapter 2, Introduction to Regression Procedures, andinthe Influence Diagnostics section on page 3898. Each statistic is computed for each observation.
MODEL Option or Statistic | Formula |
---|---|
PRED ( i ) | X i b |
RES ( r i ) | Y i ˆ’ i |
H( h i ) |
|
STDP |
|
STDI |
|
STDR |
|
LCL |
|
LCLM |
|
UCL |
|
UCLM |
|
STUDENT |
|
RSTUDENT |
|
COOKD |
|
COVRATIO |
|
DFFITS |
|
DFBETAS j |
|
PRESS( predr i ) |
|
This section discusses the INFLUENCE option, which produces several influence statistics, and the PARTIAL option, which produces partial regression leverage plots.
The INFLUENCE option (in the MODEL statement) requests the statistics proposed by Belsley, Kuh, and Welsch (1980) to measure the influence of each observation on the estimates. Influential observations are those that, according to various criteria, appear to have a large influence on the parameter estimates.
Let b ( i ) be the parameter estimates after deleting the i th observation; let s ( i ) 2 be the variance estimate after deleting the i th observation; let X ( i ) be the X matrix without the i th observation; let ·( i ) be the i th value predicted without using the i th observation; let r i = y i “ · i be the i th residual; and let h i be the i th diagonal of the projection matrix for the predictor space, also called the hat matrix :
Belsley, Kuh, and Welsch propose a cutoff of 2 p/n , where n is the number of observations used to fit the model and p is the number of parameters in the model. Observations with h i values above this cutoff should be investigated.
For each observation, PROC REG first displays the residual, the studentized residual (RSTUDENT), and the h i . The studentized residual RSTUDENT differs slightly from STUDENT since the error variance is estimated by without the i th observation, not by s 2 . For example,
Observations with RSTUDENT larger than 2 in absolute value may need some attention.
The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the i th observation:
Belsley, Kuh, and Welsch suggest that observations with
where p is the number of parameters in the model and n is the number of observations used to fit the model, are worth investigation.
The DFFITS statistic is a scaled measure of the change in the predicted value for the i th observation and is calculated by deleting the i th observation. A large value indicates that the observation is very influential in its neighborhood of the X space.
Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch is , where n and p are as defined previously.
The DFFITS statistic is very similar to Cook s D , defined in the section Predicted and Residual Values on page 3879.
The DFBETAS statistics are the scaled measures of the change in each parameter estimate and are calculated by deleting the i th observation:
where ( X X ) j j i s the (j ,j) th element of (X ² X ) ˆ’ 1 .
In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and as a size-adjusted cutoff.
Figure 61.43 shows the tables produced by the INFLUENCE option for the population example (the section Polynomial Regression beginning on page 3804). See Figure 61.30 for the fitted regression equation.
proc reg data=USPopulation; model Population=Year YearSq / influence; run;
In Figure 61.43, observations 16, 17, and 19 exceed the cutoff value of 2 for RSTUDENT. None of the observations exceeds the general cutoff of 2 for DFFITS or the DFBETAS, but observations 16, 17, and 19 exceed at least one of the size-adjusted cutoffs for these statistics. Observations 1 and 19 exceed the cutoff for the hat diagonals, and observations 1, 2, 16, 17, and 18 exceed the cutoffs for COVRATIO. Taken together, these statistics indicate that you should look first at observations 16, 17, and 19 and then perhaps investigate the other observations that exceeded a cutoff.
The PARTIAL option in the MODEL statement produces partial regression leverage plots. If the experimental ODS graphics are not in effect, this option requires the use of the LINEPRINTER option in the PROC REG statement. One plot is created for each regressor in the current full model. For example, plots are produced for regressors included by using ADD statements; plots are not produced for interim models in the various model-selection methods but only for the full model. If you use a modelselection method and the final model contains only a subset of the original regressors, the PARTIAL option still produces plots for all regressors in the full model. If the experimental ODS graphics are in effect, these plots are produced as high-resolution graphics, in panels with a maximum of six partial regression leverage plots plots per panel. Multiple panels are displayed for models with more than six regressors.
For a given regressor, the partial regression leverage plot is the plot of the dependent variable and the regressor after they have been made orthogonal to the other regressors in the model. These can be obtained by plotting the residuals for the dependent variable against the residuals for the selected regressor, where the residuals for the dependent variable are calculated with the selected regressor omitted, and the residuals for the selected regressor are calculated from a model where the selected regressor is regressed on the remaining regressors. A line fit to the points has a slope equal to the parameter estimate in the full model.
When the experimental ODS graphics are not in effect, points in the plot are marked by the number of replicates appearing at one position. The symbol * is used if there are ten or more replicates. If an ID statement is specified, the left-most nonblank character in the value of the ID variable is used as the plotting symbol.
The following statements use the fitness data in Example 61.1 on page 3924 with the PARTIAL option and the ODS GRAPHICS statement to produce the partial regression leverage plots. The plots are shown in Figure 61.44. For general information about ODS graphics, see Chapter 15, Statistical Graphics Using ODS. For specific information about the graphics available in the REG procedure, see the ODS Graphics section on page 3922.
ods html; ods graphics on; proc reg data=fitness; model Oxygen=RunTime Weight Age / partial; run; ods graphics off; ods html close;
The following statements create a similar panel of partial regression plots using the OUTPUT dataset and the GPLOT procedure. Four plots (created by regressing Oxygen and one of the variables on the remaining variables) are displayed in Figure 61.45. Notice that the Int variable is explicitly added to be used as the intercept term.
data fitness2; set fitness; Int=1; proc reg data=fitness2 noprint; model Oxygen Int = RunTime Weight Age / noint; output out=temp r=ry rx; symbol1 c=blue; proc gplot data=temp; plot ry*rx / cframe=ligr; label ry=Oxygen rx=Intercept; run;
Reweighting observations is an interactive feature of PROC REG that enables you to change the weights of observations used in computing the regression equation. Observations can also be deleted from the analysis (not from the data set) by changing their weights to zero. The Class data (in the Getting Started section on page 3800) are used to illustrate some of the features of the REWEIGHT statement. First, the full model is fit, and the residuals are displayed in Figure 61.46.
proc reg data=Class; model Weight=Age Height / p; id Name; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight Output Statistics Dependent Predicted Obs Name Variable Value Residual 1 Alfred 112.5000 124.8686 12.3686 2 Alice 84.0000 78.6273 5.3727 3 Barbara 98.0000 110.2812 12.2812 4 Carol 102.5000 102.5670 0.0670 5 Henry 102.5000 105.0849 2.5849 6 James 83.0000 80.2266 2.7734 7 Jane 84.5000 89.2191 4.7191 8 Janet 112.5000 102.7663 9.7337 9 Jeffrey 84.0000 100.2095 16.2095 10 John 99.5000 86.3415 13.1585 11 Joyce 50.5000 57.3660 6.8660 12 Judy 90.0000 107.9625 17.9625 13 Louise 77.0000 76.6295 0.3705 14 Mary 112.0000 117.1544 5.1544 15 Philip 150.0000 138.2164 11.7836 16 Robert 128.0000 107.2043 20.7957 17 Ronald 133.0000 118.9529 14.0471 18 Thomas 85.0000 79.6676 5.3324 19 William 112.0000 117.1544 5.1544 Sum of Residuals 0 Sum of Squared Residuals 2120.09974 Predicted Residual SS (PRESS) 3272.72186
Upon examining the data and residuals, you realize that observation 17 (Ronald) was mistakenly included in the analysis. Also, you would like to examine the effect of reweighting to 0.5 those observations with residuals that have absolute values greater than or equal to 17.
reweight obs.=17; reweight r. le -17 or r. ge 17 / weight=0.5; print p; run;
At this point, a message (on the log) appears that tells you which observations have been reweighted and what the new weights are. Figure 61.47 is produced.
The REG Procedure Model: MODEL1.2 Dependent Variable: Weight Output Statistics Weight Dependent Predicted Obs Name Variable Variable Value Residual 1 Alfred 1.0000 112.5000 121.6250 9.1250 2 Alice 1.0000 84.0000 79.9296 4.0704 3 Barbara 1.0000 98.0000 107.5484 9.5484 4 Carol 1.0000 102.5000 102.1663 0.3337 5 Henry 1.0000 102.5000 104.3632 1.8632 6 James 1.0000 83.0000 79.9762 3.0238 7 Jane 1.0000 84.5000 87.8225 3.3225 8 Janet 1.0000 112.5000 103.6889 8.8111 9 Jeffrey 1.0000 84.0000 98.7606 14.7606 10 John 1.0000 99.5000 85.3117 14.1883 11 Joyce 1.0000 50.5000 58.6811 8.1811 12 Judy 0.5000 90.0000 106.8740 16.8740 13 Louise 1.0000 77.0000 76.8377 0.1623 14 Mary 1.0000 112.0000 116.2429 4.2429 15 Philip 1.0000 150.0000 135.9688 14.0312 16 Robert 0.5000 128.0000 103.5150 24.4850 17 Ronald 0 133.0000 117.8121 15.1879 18 Thomas 1.0000 85.0000 78.1398 6.8602 19 William 1.0000 112.0000 116.2429 4.2429 Sum of Residuals 0 Sum of Squared Residuals 1500.61194 Predicted Residual SS (PRESS) 2287.57621 NOTE: The above statistics use observation weights or frequencies.
The first REWEIGHT statement excludes observation 17, and the second REWEIGHT statement reweights observations 12 and 16 to 0.5. An important feature to note from this example is that the model is not refit until after the PRINT statement. REWEIGHT statements do not cause the model to be refit. This is so that multiple REWEIGHT statements can be applied to a subsequent model.
In this example, since the intent is to reweight observations with large residuals, the observation that was mistakenly included in the analysis should be deleted; then, the model should be fit for those remaining observations, and the observations with large residuals should be reweighted. To accomplish this, use the REFIT statement. Note that the model label has been changed from MODEL1 to MODEL1.2 as two REWEIGHT statements have been used. These statements produce Figure 61.48:
reweight allobs / weight=1.0; reweight obs.=17; refit; reweight r. le -17 or r. ge 17 / weight=.5; print; run;
The REG Procedure Model: MODEL1.5 Dependent Variable: Weight Output Statistics Weight Dependent Predicted Obs Name Variable Variable Value Residual 1 Alfred 1.0000 112.5000 120.9716 8.4716 2 Alice 1.0000 84.0000 79.5342 4.4658 3 Barbara 1.0000 98.0000 107.0746 9.0746 4 Carol 1.0000 102.5000 101.5681 0.9319 5 Henry 1.0000 102.5000 103.7588 1.2588 6 James 1.0000 83.0000 79.7204 3.2796 7 Jane 1.0000 84.5000 87.5443 3.0443 8 Janet 1.0000 112.5000 102.9467 9.5533 9 Jeffrey 1.0000 84.0000 98.3117 14.3117 10 John 1.0000 99.5000 85.0407 14.4593 11 Joyce 1.0000 50.5000 58.6253 8.1253 12 Judy 1.0000 90.0000 106.2625 16.2625 13 Louise 1.0000 77.0000 76.5908 0.4092 14 Mary 1.0000 112.0000 115.4651 3.4651 15 Philip 1.0000 150.0000 134.9953 15.0047 16 Robert 0.5000 128.0000 103.1923 24.8077 17 Ronald 0 133.0000 117.0299 15.9701 18 Thomas 1.0000 85.0000 78.0288 6.9712 19 William 1.0000 112.0000 115.4651 3.4651 Sum of Residuals 0 Sum of Squared Residuals 1637.81879 Predicted Residual SS (PRESS) 2473.87984 NOTE: The above statistics use observation weights or frequencies.
Notice that this results in a slightly different model than the previous set of statements: only observation 16 is reweighted to 0.5. Also note that the model label is now MODEL1.5 since five REWEIGHT statements have been used for this model.
Another important feature of the REWEIGHT statement is the ability to nullify the effect of a previous or all REWEIGHT statements. First, assume that you have several REWEIGHT statements in effect and you want to restore the original weights of all the observations. The following REWEIGHT statement accomplishes this and produces Figure 61.49:
reweight allobs / reset; print; run;
The REG Procedure Model: MODEL1.6 Dependent Variable: Weight Output Statistics Dependent Predicted Obs Name Variable Value Residual 1 Alfred 112.5000 124.8686 12.3686 2 Alice 84.0000 78.6273 5.3727 3 Barbara 98.0000 110.2812 12.2812 4 Carol 102.5000 102.5670 0.0670 5 Henry 102.5000 105.0849 2.5849 6 James 83.0000 80.2266 2.7734 7 Jane 84.5000 89.2191 4.7191 8 Janet 112.5000 102.7663 9.7337 9 Jeffrey 84.0000 100.2095 16.2095 10 John 99.5000 86.3415 13.1585 11 Joyce 50.5000 57.3660 6.8660 12 Judy 90.0000 107.9625 17.9625 13 Louise 77.0000 76.6295 0.3705 14 Mary 112.0000 117.1544 5.1544 15 Philip 150.0000 138.2164 11.7836 16 Robert 128.0000 107.2043 20.7957 17 Ronald 133.0000 118.9529 14.0471 18 Thomas 85.0000 79.6676 5.3324 19 William 112.0000 117.1544 5.1544 Sum of Residuals 0 Sum of Squared Residuals 2120.09974 Predicted Residual SS (PRESS) 3272.72186
The resulting model is identical to the original model specified at the beginning of this section. Notice that the model label is now MODEL1.6. Note that the Weight column does not appear, since all observations have been reweighted to have weight=1.
Now suppose you want only to undo the changes made by the most recent REWEIGHT statement. Use REWEIGHT UNDO for this. The following statements produce Figure 61.50:
reweight r. le -12 or r. ge 12 / weight=.75; reweight r. le -17 or r. ge 17 / weight=.5; reweight undo; print; run;
The REG Procedure Model: MODEL1.9 Dependent Variable: Weight Output Statistics Weight Dependent Predicted Obs Name Variable Variable Value Residual 1 Alfred 0.7500 112.5000 125.1152 12.6152 2 Alice 1.0000 84.0000 78.7691 5.2309 3 Barbara 0.7500 98.0000 110.3236 12.3236 4 Carol 1.0000 102.5000 102.8836 0.3836 5 Henry 1.0000 102.5000 105.3936 2.8936 6 James 1.0000 83.0000 80.1133 2.8867 7 Jane 1.0000 84.5000 89.0776 4.5776 8 Janet 1.0000 112.5000 103.3322 9.1678 9 Jeffrey 0.7500 84.0000 100.2835 16.2835 10 John 0.7500 99.5000 86.2090 13.2910 11 Joyce 1.0000 50.5000 57.0745 6.5745 12 Judy 0.7500 90.0000 108.2622 18.2622 13 Louise 1.0000 77.0000 76.5275 0.4725 14 Mary 1.0000 112.0000 117.6752 5.6752 15 Philip 1.0000 150.0000 138.9211 11.0789 16 Robert 0.7500 128.0000 107.0063 20.9937 17 Ronald 0.7500 133.0000 119.4681 13.5319 18 Thomas 1.0000 85.0000 79.3061 5.6939 19 William 1.0000 112.0000 117.6752 5.6752 Sum of Residuals 0 Sum of Squared Residuals 1694.87114 Predicted Residual SS (PRESS) 2547.22751 NOTE: The above statistics use observation weights or frequencies.
The resulting model reflects changes made only by the first REWEIGHT statement since the third REWEIGHT statement negates the effect of the second REWEIGHT statement. Observations 1, 3, 9, 10, 12, 16, and 17 have their weights changed to 0.75. Note that the label MODEL1.9 reflects the use of nine REWEIGHT statements for the current model.
Now suppose you want to reset the observations selected by the most recent REWEIGHT statement to their original weights. Use the REWEIGHT statement with the RESET option to do this. The following statements produce Figure 61.51:
reweight r. le -12 or r. ge 12 / weight=.75; reweight r. le -17 or r. ge 17 / weight=.5; reweight / reset; print; run;
The REG Procedure Model: MODEL1.12 Dependent Variable: Weight Output Statistics Weight Dependent Predicted Obs Name Variable Variable Value Residual 1 Alfred 0.7500 112.5000 126.0076 13.5076 2 Alice 1.0000 84.0000 77.8727 6.1273 3 Barbara 0.7500 98.0000 111.2805 13.2805 4 Carol 1.0000 102.5000 102.4703 0.0297 5 Henry 1.0000 102.5000 105.1278 2.6278 6 James 1.0000 83.0000 80.2290 2.7710 7 Jane 1.0000 84.5000 89.7199 5.2199 8 Janet 1.0000 112.5000 102.0122 10.4878 9 Jeffrey 0.7500 84.0000 100.6507 16.6507 10 John 0.7500 99.5000 86.6828 12.8172 11 Joyce 1.0000 50.5000 56.7703 6.2703 12 Judy 1.0000 90.0000 108.1649 18.1649 13 Louise 1.0000 77.0000 76.4327 0.5673 14 Mary 1.0000 112.0000 117.1975 5.1975 15 Philip 1.0000 150.0000 138.7581 11.2419 16 Robert 1.0000 128.0000 108.7016 19.2984 17 Ronald 0.7500 133.0000 119.0957 13.9043 18 Thomas 1.0000 85.0000 80.3076 4.6924 19 William 1.0000 112.0000 117.1975 5.1975 Sum of Residuals 0 Sum of Squared Residuals 1879.08980 Predicted Residual SS (PRESS) 2959.57279 NOTE: The above statistics use observation weights or frequencies.
Note that observations that meet the condition of the second REWEIGHT statement (residuals with an absolute value greater than or equal to 17) now have weights reset to their original value of 1. Observations 1, 3, 9, 10, and 17 have weights of 0.75, but observations 12 and 16 (which meet the condition of the second REWEIGHT statement) have their weights reset to 1.
Notice how the last three examples show three ways to change weights back to a previous value. In the first example, ALLOBS and the RESET option are used to change weights for all observations back to their original values. In the second example, the UNDO option is used to negate the effect of a previous REWEIGHT statement, thus changing weights for observations selected in the previous REWEIGHT statement to the weights specified in still another REWEIGHT statement. In the third example, the RESET option is used to change weights for observations selected in a previous REWEIGHT statement back to their original values. Finally, note that the label MODEL1.12 indicates that twelve REWEIGHT statements have been applied to the original model.
The regression model is specified as y i = x i ² + ˆˆ i , where the ˆˆ i s are identically and independently distributed: E ( ˆˆ )=0 and E ( ˆˆ ² ˆˆ )= ƒ 2 I . If the ˆˆ i s are not independent or their variances are not constant, the parameter estimates are unbiased, but the estimate of the covariance matrix is inconsistent. In the case of heteroscedasticity, the ACOV option provides a consistent estimate of the covariance matrix. If the regression data are from a simple random sample, the ACOV option produces the covariance matrix. This matrix is
where
The SPEC option performs a model specification test. The null hypothesis for this test maintains that the errors are homoscedastic, independent of the regressors and that several technical assumptions about the model specification are valid. For details, see theorem 2 and assumptions 1 “7 of White (1980). When the model is correctly specified and the errors are independent of the regressors, the rejection of this null hypothesis is evidence of heteroscedasticity. In implementing this test, an estimator of the average covariance matrix (White 1980, p. 822) is constructed and inverted. The nonsingularity of this matrix is one of the assumptions in the null hypothesis about the model specification. When PROC REG determines this matrix to be numerically singular, a generalized inverse is used and a note to this effect is written to the log. In such cases, care should be taken in interpreting the results of this test.
When you specify the SPEC option, tests listed in the TEST statement are performed with both the usual covariance matrix and the heteroscedasticity consistent covariance matrix. Tests performed with the consistent covariance matrix are asymptotic . For more information, refer to White (1980).
Both the ACOV and SPEC options can be specified in a MODEL or PRINT statement.
The MTEST statement described in the MTEST Statement section on page 3832 can test hypotheses involving several dependent variables in the form
where L is a linear function on the regressor side, ² is a matrix of parameters, c is a column vector of constants, j is a row vector of ones, and M is a linear function on the dependent side. The special case where the constants are zero is
To test this hypothesis, PROC REG constructs two matrices called H and E that correspond to the numerator and denominator of a univariate F test:
These matrices are displayed for each MTEST statement if the PRINT option is specified.
Four test statistics based on the eigenvalues of E ˆ’ 1 H or ( E + H ) ˆ’ 1 H are formed . These are Wilks Lambda, Pillai s Trace, the Hotelling-Lawley Trace, and Roy s maximum root. These test statistics are discussed in Chapter 2, Introduction to Regression Procedures.
The following statements perform a multivariate analysis of variance and produce Figure 61.52 through Figure 61.56:
* Manova Data from Morrison (1976, 190); data a; input sex $ drug $ @; do rep=1 to 4; input y1 y2 @; sexcode=(sex=m)-(sex=f); drug1=(drug=a)-(drug=c); drug2=(drug=b)-(drug=c); sexdrug1=sexcode*drug1; sexdrug2=sexcode*drug2; output; end; datalines; m a 5 6 5 4 9 9 7 6 m b 7 6 7 7 9 12 6 8 m c 21 15 14 11 17 12 12 10 f a 7 10 6 6 9 7 8 10 f b 10 13 8 7 7 6 6 9 f c 16 12 14 9 14 8 10 5 ; proc reg; model y1 y2=sexcode drug1 drug2 sexdrug1 sexdrug2; y1y2drug: mtest y1=y2, drug1,drug2; drugshow: mtest drug1, drug2 / print canprint; run;
The REG Procedure Model: MODEL1 Dependent Variable: y1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 316.00000 63.20000 12.04 <.0001 Error 18 94.50000 5.25000 Corrected Total 23 410.50000 Root MSE 2.29129 R-Square 0.7698 Dependent Mean 9.75000 Adj R-Sq 0.7058 Coeff Var 23.50039 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 9.75000 0.46771 20.85 <.0001 sexcode 1 0.16667 0.46771 0.36 0.7257 drug1 1 2.75000 0.66144 4.16 0.0006 drug2 1 2.25000 0.66144 3.40 0.0032 sexdrug1 1 0.66667 0.66144 1.01 0.3269 sexdrug2 1 0.41667 0.66144 0.63 0.5366
The REG Procedure Model: MODEL1 Dependent Variable: y2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 69.33333 13.86667 2.19 0.1008 Error 18 114.00000 6.33333 Corrected Total 23 183.33333 Root MSE 2.51661 R-Square 0.3782 Dependent Mean 8.66667 Adj R-Sq 0.2055 Coeff Var 29.03782 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 8.66667 0.51370 16.87 <.0001 sexcode 1 0.16667 0.51370 0.32 0.7493 drug1 1 1.41667 0.72648 1.95 0.0669 drug2 1 0.16667 0.72648 0.23 0.8211 sexdrug1 1 1.16667 0.72648 1.61 0.1257 sexdrug2 1 0.41667 0.72648 0.57 0.5734
The REG Procedure Model: MODEL1 Multivariate Test: y1y2drug Multivariate Statistics and Exact F Statistics S=1 M=0 N=8 Statistic Value F Value Num DF Den DF Pr > F Wilks Lambda 0.28053917 23.08 2 18 <.0001 Pillais Trace 0.71946083 23.08 2 18 <.0001 Hotelling-Lawley Trace 2.56456456 23.08 2 18 <.0001 Roys Greatest Root 2.56456456 23.08 2 18 <.0001
The REG Procedure Model: MODEL1 Multivariate Test: drugshow Error Matrix (E) 94.5 76.5 76.5 114 Hypothesis Matrix (H) 301 97.5 97.5 36.333333333 Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.905903 0.899927 0.040101 0.820661 2 0.244371 . 0.210254 0.059717 Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative 1 4.5760 4.5125 0.9863 0.9863 2 0.0635 0.0137 1.0000 Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Approximate Ratio F Value Num DF Den DF Pr > F 1 0.16862952 12.20 4 34 <.0001 2 0.94028273 1.14 1 18 0.2991
The REG Procedure Model: MODEL1 Multivariate Test: drugshow Multivariate Statistics and F Approximations S=2 M=-0.5 N=7.5 Statistic Value F Value Num DF Den DF Pr > F Wilks Lambda 0.16862952 12.20 4 34 <.0001 Pillais Trace 0.88037810 7.08 4 36 0.0003 Hotelling-Lawley Trace 4.63953666 19.40 4 19.407 <.0001 Roys Greatest Root 4.57602675 41.18 2 18 <.0001 NOTE: F Statistic for Roys Greatest Root is an upper bound. NOTE: F Statistic for Wilks Lambda is exact.
The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not the same across dependent variables y1 and y2 .
The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not zero for both dependent variables.
When regression is performed on time series data, the errors may not be independent. Often errors are autocorrelated; that is, each error is correlated with the error immediately before it. Autocorrelation is also a symptom of systematic lack of fit. The DW option provides the Durbin-Watson d statistic to test that the autocorrelation is zero:
The value of d is close to 2 if the errors are uncorrelated. The distribution of d is reported by Durbin and Watson (1951). Tables of the distribution are found in most econometrics textbooks , such as Johnston (1972) and Pindyck and Rubinfeld (1981).
The sample autocorrelation estimate is displayed after the Durbin-Watson statistic. The sample is computed as
This autocorrelation of the residuals may not be a very good estimate of the autocorrelation of the true errors, especially if there are few observations and the independent variables have certain patterns. If there are missing observations in the regression, these measures are computed as though the missing observations did not exist.
Positive autocorrelation of the errors generally tends to make the estimate of the error variance too small, so confidence intervals are too narrow and true null hypotheses are rejected with a higher probability than the stated significance level. Negative autocorrelation of the errors generally tends to make the estimate of the error variance toolarge,soconfidence intervals are too wide and the power of significance tests is reduced. With either positive or negative autocorrelation, least-squares parameter estimates are usually not as efficient as generalized least-squares parameter estimates. For more details, refer to Judge et al. (1985, Chapter 8) and the SAS/ETS User s Guide .
The following SAS statements request the DW option for the US population data (see Figure 61.57):
proc reg data=USPopulation; model Population=Year YearSq / dw; run;
In ridge regression analysis, the crossproduct matrix for the independent variables is centered (the NOINT option is ignored if it is specified) and scaled to one on the diagonal elements. The ridge constant k (specified with the RIDGE= option) is then added to each diagonal element of the crossproduct matrix. The ridge regression estimates are the least-squares estimates obtained by using the new crossproduct matrix.
Let X be an n — p matrix of the independent variables after centering the data, and let Y be an n — 1 vector corresponding to the dependent variable. Let D be a p — p diagonal matrix with diagonal elements as in X ² X . The ridge regression estimate corresponding to the ridge constant k can be computed as
where Z = XD “ 1/2 and I p is a p — p identity matrix.
For IPC analysis, the smallest m eigenvalues of Z ² Z (where m is specified with the PCOMIT= option) are omitted to form the estimates.
For information about ridge regression and IPC standardized parameter estimates, parameter estimate standard errors, and variance inflation factors, refer to Rawlings (1988), Neter, Wasserman, and Kutner (1990), and Marquardt and Snee (1975). Unlike Rawlings (1988), the REG procedure uses the mean squared errors of the submodels instead of the full model MSE to compute the standard errors of the parameter estimates.
If a normal probability-probability or quantile-quantile plot for the variable x is requested, the n nonmissing values of x are first ordered from smallest to largest:
If a Q-Q plot is requested (with a PLOT statement of the form PLOT yvariable *NQQ.), the i th ordered value x ( i ) is represented by a point with y -coordinate x ( i ) and x -coordinate , where (·) is the standard normal distribution.
If a P-P plot is requested (with a PLOT statement of the form PLOT yvariable *NPP.), the i th ordered value x ( i ) is represented by a point with y -coordinate and x -coordinate , where µ is the mean of the nonmissing x- v alues and ƒ is the standard deviation. I+f an x -value has multiplicity k (that is, x ( i ) = · · · = x ( i + k ˆ’ 1) ), then only the point is displayed.
The REG procedure first composes a crossproducts matrix. The matrix can be calculated from input data, reformed from an input correlation matrix, or read in from an SSCP data set. For each model, the procedure selects the appropriate crossproducts from the main matrix. The normal equations formed from the crossproducts are solved using a sweep algorithm (Goodnight 1979). The method is accurate for data that are reasonably scaled and not too collinear.
The mechanism that PROC REG uses to check for singularity involves the diagonal (pivot) elements of X ² X as it is being swept. If a pivot is less than SINGULAR*CSS, then a singularity is declared and the pivot is not swept (where CSS is the corrected sum of squares for the regressor and SINGULAR is machine dependent but is approximately 1E “ 7 on most machines or reset in the PROC statement).
The sweep algorithm is also used in many places in the model-selection methods. The RSQUARE method uses the leaps and bounds algorithm by Furnival and Wilson (1974).
The REG procedure is efficient for ordinary regression; however, requests for optional features can greatly increase the amount of time required.
The major computational expense in the regression analysis is the collection of the crossproducts matrix. For p variables and n observations, the time required is proportional to np 2 . For each model run, PROC REG needs time roughly proportional to k 3 , where k is the number of regressors in the model. Add an additional nk 2 for one of the R, CLM, or CLI options and another nk 2 for the INFLUENCE option.
Most of the memory that PROC REG needs to solve large problems is used for crossproducts matrices. PROC REG requires 4 p 2 bytes for the main crossproducts matrix plus 4 k 2 bytes for the largest model. If several output data sets are requested, memory is also needed for buffers.
See the Input Data Sets section on page 3860 for information on how to use TYPE=SSCP data sets to reduce computing time.
Many of the more specialized tables are described in detail in previous sections. Most of the formulas for the statistics are in Chapter 2, Introduction to Regression Procedures, while other formulas can be found in the section Model Fit and Diagnostic Statistics on page 3896 and the Influence Diagnostics section on page 3898.
The analysis-of-variance table includes
the Source of the variation, Model for the fitted regression, Error for the residual error, and C Total for the total variation after correcting for the mean. The Uncorrected Total Variation is produced when the NOINT option is used.
the degrees of freedom (DF) associated with the source
the Sum of Squares for the term
the Mean Square, the sum of squares divided by the degrees of freedom
the F Value for testing the hypothesis that all parameters are zero except for the intercept. This is formed by dividing the mean square for Model by the mean square for Error.
the Prob>F, the probability of getting a greater F statistic than that observed if the hypothesis is true. This is the significance probability.
Other statistics displayed include the following:
Root MSE is an estimate of the standard deviation of the error term. It is calculated as the square root of the mean square error.
Dep Mean is the sample mean of the dependent variable.
C.V. is the coefficient of variation, computed as 100 times Root MSE divided by Dep Mean. This expresses the variation in unitless values.
R-Square is a measure between 0 and 1 that indicates the portion of the (corrected) total variation that is attributed to the fit rather than left to residual error. It is calculated as SS(Model) divided by SS(Total). It is also called the coefficient of determination . It is the square of the multiple correlation; in other words, the square of the correlation between the dependent variable and the predicted values.
Adj R-Sq, the adjusted R 2 , is a version of R 2 that has been adjusted for degrees of freedom. It is calculated as
where i is equal to 1 if there is an intercept and 0 otherwise; n is the number of observations used to fit the model; and p is the number of parameters in the model.
The parameter estimates and associated statistics are then displayed, and they include the following:
the Variable used as the regressor, including the name Intercept to represent the estimate of the intercept parameter
the degrees of freedom (DF) for the variable. There is one degree of freedom unless the model is not full rank.
the Parameter Estimate
the Standard Error, the estimate of the standard deviation of the parameter estimate
T for H0: Parameter=0, the t test that the parameter is zero. This is computed as the Parameter Estimate divided by the Standard Error.
the Prob > T, the probability that a t statistic would obtain a greater absolute value than that observed given that the true parameter is zero. This is the two-tailed significance probability.
If model-selection methods other than NONE, RSQUARE, ADJRSQ, or CP are used, the analysis-of-variance table and the parameter estimates with associated statistics are produced at each step. Also displayed are
C(p), which is Mallows C p statistic
bounds on the condition number of the correlation matrix for the variables in the model (Berk 1977)
After statistics for the final model are produced, the following is displayed when the method chosen is FORWARD, BACKWARD, or STEPWISE:
a Summary table listing Step number, Variable Entered or Removed, Partial and Model R-Square, and C(p) and F statistics
The RSQUARE method displays its results beginning with the model containing the fewest independent variables and producing the largest R 2 . Results for other models with the same number of variables are then shown in order of decreasing R 2 , and so on, for models with larger numbers of variables. The ADJRSQ and CP methods group models of all sizes together and display results beginning with the model having the optimal value of adjusted R 2 and C p , respectively.
For each model considered, the RSQUARE, ADJRSQ, and CP methods display the following:
Number in Model or IN, the number of independent variables used in each model
R-Square or RSQ, the squared multiple correlation coefficient
If the B option is specified, the RSQUARE, ADJRSQ, and CP methods produce the following:
Parameter Estimates, the estimated regression coefficients
If the B option is not specified, the RSQUARE, ADJRSQ, and CP methods display the following:
Variables in Model, the names of the independent variables included in the model
PROC REG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.
ODS Table Name | Description | Statement | Option |
---|---|---|---|
ACovEst | Consistent covariance of estimates matrix | MODEL | ALL, ACOV |
ACovTestANOVA | Test ANOVA using ACOV estimates | TEST | ACOV (MODEL statement) |
ANOVA | Model ANOVA table | MODEL | default |
CanCorr | Canonical correlations for hypothesis combinations | MTEST | CANPRINT |
CollinDiag | Collinearity Diagnostics table | MODEL | COLLIN |
CollinDiagNoInt | Collinearity Diagnostics for no intercept model | MODEL | COLLINOINT |
ConditionBounds | Bounds on condition number | MODEL | (SELECTION=BACKWARD FORWARD STEPWISE MAXR MINR) and DETAILS |
Corr | Correlation matrix for analysis variables | PROC | ALL, CORR |
CorrB | Correlation of estimates | MODEL | CORRB |
CovB | Covariance of estimates | MODEL | COVB |
CrossProducts | Bordered model X X matrix | MODEL | ALL, XPX |
DWStatistic | Durbin-Watson statistic | MODEL | ALL, DW |
DependenceEquations | Linear dependence equations | MODEL | default if needed |
Eigenvalues | MTest eigenvalues | MTEST | CANPRINT |
Eigenvectors | MTest eigenvectors | MTEST | CANPRINT |
EntryStatistics | Entry statistics for selection methods | MODEL | (SELECTION=BACKWARD FORWARD STEPWISE MAXR MINR) and DETAILS |
ErrorPlusHypothesis | MTest error plus hypothesis matrix H + E | MTEST | |
ErrorSSCP | MTest error matrix E | MTEST | |
FitStatistics | Model fit statistics | MODEL | default |
HypothesisSSCP | MTest hypothesis matrix | MTEST | |
InvMTestCov I | nv( L Ginv( X X ) L ) and Inv( Lb - c ) | MTEST | DETAILS |
InvTestCov | Inv( L Ginv( X X ) L ) and Inv( Lb - c ) | TEST | |
InvXPX | Bordered X X inverse matrix | MODEL | I |
MTestCov | L Ginv( X X ) L and Lb - c | MTEST | DETAILS |
MTransform | MTest matrix M , across dependents | MTEST | DETAILS |
MultStat | Multivariate test statistics | MTEST | default |
NObs | Number of observations | default | |
OutputStatistics | Output statistics table | MODEL | ALL, CLI, CLM, INFLUENCE, P, R |
ParameterEstimates | Model parameter estimates | MODEL | default |
RemovalStatistics | Removal statistics for selection methods | MODEL | (SELECTION=BACKWARD STEPWISE MAXR MINR) and DETAILS |
ResidualStatistics | Residual statistics and PRESS statistic | MODEL | ALL, CLI, CLM, INFLUENCE, P, R |
SelParmEst | Parameter estimates for selection methods | MODEL | SELECTION=BACKWARD FORWARD STEPWISE MAXR MINR |
SelectionSummary | Selection summary for forward, backward and stepwise methods | MODEL | SELECTION=BACKWARD FORWARD STEPWISE |
SeqParmEst | Sequential parameterestimates | MODEL | SEQB |
SimpleStatistics | Simple statistics for analysis variables | PROC | ALL, SIMPLE |
SpecTest | White s heteroscedasticity test | MODEL | ALL, SPEC |
Subset SelSummary | Selection summary for R-Square, Adj-RSq and Cp methods | MODEL | SELECTION=RSQUARE ADJRSQ CP |
TestANOVA | Test ANOVA table | TEST | default |
TestCov | L Ginv( X X ) L and Lb - c | TEST | |
USSCP | Uncorrected SSCP matrix for analysis variables | PROC | ALL, USSCP |
This section describes the use of ODS for creating statistical graphs with the REG procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.
To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.
When the experimental ODS graphics are in effect, the REG procedure produces a variety of plots. For models with multiple dependent variables, separate plots are produced for each dependent variable. For jobs with more than one MODEL statement, plots are produced for each model statement.
The plots available are as follows:
With a single regressor, a scatterplot of the input data overlayed with the fitted regression line, confidence band , and prediction limits.
A summary panel of fit diagnostics:
Residuals versus the predicted values
Studentized residuals versus the predicted values
Studentized residuals versus the leverage
Normal quantile plot of the residuals
Dependent variable values versus the predicted values
Cook s D versus observation number
Histogram of the residuals
A Residual-Fit (or RF) plot consisting of side-by-side quantile plots of the centered fit and the residuals. This plot shows how much variation in the data is explained by the fit and how much remains in the residuals (Cleveland, 1993).
If the PLOTS(UNPACKPANELS) option is specified in the PROC REG statement, then the eight plots in the fit diagnostics panel are displayed individually.
Panels of the residuals versus the regressors in the model. Note that each panel contains at most six plots, and multiple panels are used in the case that there are more than six regressors (including the intercept) in the model.
If the PARTIAL option is specified in a MODEL statement, panels of the partial regression plots for each regressor (see the The PARTIAL Option section on page 3901). Note that each panel contains at most six partial plots, and multiple panels are used in the case that there are more than six regressors in the model.
If the RIDGE= option is specified in the model statement, panels of ridge traces versus the specified ridge parameters for each regressor in the model. At most eight ridge traces are included on a panel and multiple panels are used for models with more than eight regressors.
PLOTS ( general-plot-options )
specifies characteristics of the graphics produced when you use the experimental ODS GRAPHICS statement. You can specify the following general-plot-options in parentheses after the PLOTS option:
UNPACK UNPACKPANELS specifies that plots in the fit diagnostics panel should be displayed separately.
MAXPOINTS= number NONE specifies that plots with elements that require processing more than number points are suppressed. The default is MAXPOINTS=5000. This cutoff is ignored if you specify MAXPOINTS=NONE.
PROC REG assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 61.9.
ODS Graph Name | Plot Description | PLOTS Option |
---|---|---|
ActualByPredicted | Dependent variable versus predicted values | UNPACKPANELS |
CooksD | Cook s D statistic versus observation number | UNPACKPANELS |
DiagnosticsPanel | Panel of fit diagnostics | |
Fit | Regression line, confidence band, and prediction limits overlayed on scatterplot of data | |
PartialPlotPanel i | Panel i of partial regression plots | |
QQPlot | Normal quantile plot residuals | UNPACKPANELS |
ResidualByPredicted | Residuals versus predicted values | UNPACKPANELS |
ResidualHistogram | Histogram of fit residuals | UNPACKPANELS |
ResidualPanel i | Panel i of residuals versus regressors | |
RFPlot | Side-by-side plots of quantiles of centered fit and residuals | UNPACKPANELS |
RidgePanel i | Panel i of ridge traces | |
RStudentByLeverage | Studentized residuals versus leverage | UNPACKPANELS |
RStudentByPredicted | Studentized residuals versus predicted values | UNPACKPANELS |
To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.