Details


Missing Values

PROC REG constructs only one crossproducts matrix for the variables in all regressions. If any variable needed for any regression is missing, the observation is excluded from all estimates. If you include variables with missing values in the VAR statement, the corresponding observations are excluded from all analyses, even if you never include the variables in a model. PROC REG assumes that you may want to include these variables after the first RUN statement and deletes observations with missing values.

Input Data Sets

PROC REG does not compute new regressors. For example, if you want a quadratic term in your model, you should create a new variable when you prepare the input data. For example, the statement

  model y=x1 x1*x1;  

is not valid. Note that this MODEL statement is valid in the GLM procedure.

The input data set for most applications of PROC REG contains standard rectangular data, but special TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets can also be used. TYPE=CORR and TYPE=COV data sets created by the CORR procedure contain means and standard deviations. In addition, TYPE=CORR data sets contain correlations and TYPE=COV data sets contain covariances. TYPE=SSCP data sets created in previous runs of PROC REG that used the OUTSSCP= option contain the sums of squares and crossproducts of the variables. See Appendix A, Special SAS Data Sets, and the SAS Files section in SAS Language Reference: Concepts for more information on special SAS data sets.

These summary files save CPU time. It takes nk 2 operations (where n =number of observations and k =number of variables) to calculate crossproducts; the regressions are of the order k 3 . When n is in the thousands and k is less than 10, you can save 99 percent of the CPU time by reusing the SSCP matrix rather than recomputing it.

When you want to use a special SAS data set as input, PROC REG must determine the TYPE for the data set. PROC CORR and PROC REG automatically set the type for their output data sets. However, if you create the data set by some other means (such as a DATA step) you must specify its type with the TYPE= data set option. If the TYPE for the data set is not specified when the data set is created, you can specify TYPE= as a data set option in the DATA= option in the PROC REG statement. For example,

  proc reg data=a(type=corr);  

When TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets are used with PROC REG, statements and options that require the original data values have no effect. The OUTPUT, PAINT, PLOT, and REWEIGHT statements and the MODEL and PRINT statement options P, R, CLM, CLI, DW, INFLUENCE, and PARTIAL are disabled since the original observations needed to calculate predicted and residual values are not present.

Example Using TYPE=CORR Data Set

This example uses PROC CORR to produce an input data set for PROC REG. The fitness data for this analysis can be found in Example 61.1 on page 3924.

  proc corr data=fitness outp=r noprint;   var Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse;   proc print data=r;   proc reg data=r;   model Oxygen=RunTime Age Weight;   run;  

Since the OUTP= data set from PROC CORR is automatically set to TYPE=CORR, the TYPE= data set option is not required in this example. The data set containing the correlation matrix is displayed by the PRINT procedure as shown in Figure 61.12. Figure 61.13 shows results from the regression using the TYPE=CORR data as an input data set.

start figure
  Rest   Obs  _TYPE_ _NAME_        Oxygen   RunTime       Age    Weight  RunPulse  MaxPulse    Pulse   1   MEAN               47.3758   10.5861   47.6774   77.4445   169.645   173.774  53.4516   2   STD                 5.3272    1.3874    5.2114    8.3286    10.252     9.164   7.6194   3   N                  31.0000   31.0000   31.0000   31.0000    31.000    31.000  31.0000   4   CORR   Oxygen       1.0000   0.8622   0.3046   0.1628   0.398   0.237   0.3994   5   CORR   RunTime   0.8622    1.0000    0.1887    0.1435     0.314     0.226   0.4504   6   CORR   Age   0.3046    0.1887    1.0000   0.2335   0.338   0.433   0.1641   7   CORR   Weight   0.1628    0.1435   0.2335    1.0000     0.182     0.249   0.0440   8   CORR   RunPulse   0.3980    0.3136   0.3379    0.1815     1.000     0.930   0.3525   9   CORR   MaxPulse   0.2367    0.2261   0.4329    0.2494     0.930     1.000   0.3051   10   CORR   RestPulse   0.3994    0.4504   0.1641    0.0440     0.352     0.305   1.0000  
end figure

Figure 61.12: TYPE=CORR Data Set Created by PROC CORR
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     3      656.27095      218.75698      30.27    <.0001   Error                    27      195.11060        7.22632   Corrected Total          30      851.38154   Root MSE              2.68818    R-Square     0.7708   Dependent Mean       47.37581    Adj R-Sq     0.7454   Coeff Var             5.67416   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1       93.12615        7.55916      12.32      <.0001   RunTime       1   3.14039        0.36738   8.55      <.0001   Age           1   0.17388        0.09955   1.75      0.0921   Weight        1   0.05444        0.06181   0.88      0.3862  
end figure

Figure 61.13: Regression on TYPE=CORR Data Set

Example Using TYPE=SSCP Data Set

The following example uses the saved crossproducts matrix:

  proc reg data=fitness outsscp=sscp noprint;   model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse;   proc print data=sscp;   proc reg data=sscp;   model Oxygen=RunTime Age Weight;   run;  

First, all variables are used to fit the data and create the SSCP data set. Figure 61.14 shows the PROC PRINT display of the SSCP data set. The SSCP data set is then used as the input data set for PROC REG, and a reduced model is fit to the data. Figure 61.15 also shows the PROC REG results for the reduced model. (For the PROC REG results for the full model, see Figure 61.27 on page 3877.)

In the preceding example, the TYPE= data set option is not required since PROC REG sets the OUTSSCP= data set to TYPE=SSCP.

start figure
  Obs _TYPE_ _NAME_   Intercept  RunTime       Age    Weight  RunPulse  MaxPulse RestPulse    Oxygen   1   SSCP  Intercept   31.00    328.17   1478.00   2400.78   5259.00   5387.00   1657.00   1468.65   2   SSCP  RunTime    328.17   3531.80  15687.24  25464.71  55806.29  57113.72  17684.05  15356.14   3   SSCP  Age       1478.00  15687.24  71282.00 114158.90 250194.00 256218.00  78806.00  69767.75   4   SSCP  Weight    2400.78  25464.71 114158.90 188008.20 407745.67 417764.62 128409.28 113522.26   5   SSCP  RunPulse  5259.00  55806.29 250194.00 407745.67 895317.00 916499.00 281928.00 248497.31   6   SSCP  MaxPulse  5387.00  57113.72 256218.00 417764.62 916499.00 938641.00 288583.00 254866.75   7   SSCP  RestPulse 1657.00  17684.05  78806.00 128409.28 281928.00 288583.00  90311.00  78015.41   8   SSCP  Oxygen    1468.65  15356.14  69767.75 113522.26 248497.31 254866.75  78015.41  70429.86   9   N                 31.00     31.00     31.00     31.00     31.00     31.00     31.00     31.00  
end figure

Figure 61.14: TYPE=SSCP Data Set Created by PROC CORR
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     3      656.27095      218.75698      30.27    <.0001   Error                    27      195.11060        7.22632   Corrected Total          30      851.38154   Root MSE              2.68818    R-Square     0.7708   Dependent Mean       47.37581    Adj R-Sq     0.7454   Coeff Var             5.67416   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1       93.12615        7.55916      12.32      <.0001   RunTime       1   3.14039        0.36738   8.55      <.0001   Age           1   0.17388        0.09955   1.75      0.0921   Weight        1   0.05444        0.06181   0.88      0.3862  
end figure

Figure 61.15: Regression on TYPE=SSCP Data Set

Output Data Sets

OUTEST= Data Set

The OUTEST= specification produces a TYPE=EST output SAS data set containing estimates and optional statistics from the regression models. For each BY group on each dependent variable occurring in each MODEL statement, PROC REG outputs an observation to the OUTEST= data set. The variables output to the data set are as follows :

  • the BY variables, if any

  • _MODEL_ , a character variable containing the label of the corresponding MODEL statement, or MODEL n if no label is specified, where n is 1 for the first MODEL statement, 2 for the second model statement, and so on

  • _TYPE_ , a character variable with the value PARMS for every observation

  • _DEPVAR_ , the name of the dependent variable

  • _RMSE_ , the root mean squared error or the estimate of the standard deviation of the error term

  • Intercept , the estimated intercept, unless the NOINT option is specified

  • all the variables listed in any MODEL or VAR statement. Values of these variables are the estimated regression coefficients for the model. A variable that does not appear in the model corresponding to a given observation has a missing value in that observation. The dependent variable in each model is given a value of ˆ’ 1.

If you specify the COVOUT option, the covariance matrix of the estimates is output after the estimates; the _TYPE_ variable is set to the value COV and the names of the rows are identified by the 8-byte character variable, _NAME_ .

If you specify the TABLEOUT option, the following statistics listed by _TYPE_ are added after the estimates:

  • STDERR, the standard error of the estimate

  • T, the t statistic for testing if the estimate is zero

  • PVALUE, the associated p -value

  • L n B, the 100(1 ˆ’ ± ) lower confidence for the estimate, where n is the nearest integer to 100(1 ˆ’ ± ) and ± defaults to 0 . 05 or is set using the ALPHA= option in the PROC REG or MODEL statement

  • U n B, the 100(1 ˆ’ ± ) upper confidence for the estimate

Specifying the option ADJRSQ, AIC, BIC, CP, EDF, GMSEP, JP, MSE, PC, RSQUARE, SBC, SP, or SSE in the PROC REG or MODEL statement automatically outputs these statistics and the model R 2 for each model selected, regardless of the model selection method. Additional variables, in order of occurrence, are as follows.

  • _IN_ , the number of regressors in the model not including the intercept

  • _P_ , the number of parameters in the model including the intercept, if any

  • _EDF_ , the error degrees of freedom

  • _SSE_ , the error sum of squares, if the SSE option is specified

  • _MSE_ , the mean squared error, if the MSE option is specified

  • _RSQ_ , the R 2 statistic

  • _ADJRSQ_ , the adjusted R 2 , if the ADJRSQ option is specified

  • _CP_ , the C p statistic, if the CP option is specified

  • _SP_ , the S p statistic, if the SP option is specified

  • _JP_ , the J p statistic, if the JP option is specified

  • _PC_ , the PC statistic, if the PC option is specified

  • _GMSEP_ , the GMSEP statistic, if the GMSEP option is specified

  • _AIC_ , the AIC statistic, if the AIC option is specified

  • _BIC_ , the BIC statistic, if the BIC option is specified

  • _SBC_ , the SBC statistic, if the SBC option is specified

The following is an example with a display of the OUTEST= data set. This example uses the population data given in the section Polynomial Regression beginning on page 3804. Figure 61.16 on page 3865 through Figure 61.18 on page 3866 show the regression equations and the resulting OUTEST= data set.

  proc reg data=USPopulation outest=est;   m1: model Population=Year;   m2: model Population=Year YearSq;   proc print data=est;   run;  
start figure
  The REG Procedure   Model: m1   Dependent Variable: Population   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     1         146869         146869     228.92    <.0001   Error                    20          12832      641.58160   Corrected Total          21         159700   Root MSE             25.32946    R-Square     0.9197   Dependent Mean       94.64800    Adj R-Sq     0.9156   Coeff Var            26.76175   Parameter Estimates   Parameter       Standard   Variable      DF       Estimate          Error    t Value    Pr > t   Intercept      1   2345.85498      161.39279   14.54      <.0001   Year           1        1.28786        0.08512      15.13      <.0001  
end figure

Figure 61.16: Regression Output for Model M1
start figure
  The REG Procedure   Model: m2   Dependent Variable: Population   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     2         159529          79765    8864.19    <.0001   Error                    19      170.97193        8.99852   Corrected Total          21         159700   Root MSE              2.99975    R-Square     0.9989   Dependent Mean       94.64800    Adj R-Sq     0.9988   Coeff Var             3.16938   Parameter Estimates   Parameter       Standard   Variable      DF       Estimate          Error    t Value    Pr > t   Intercept      1          21631      639.50181      33.82      <.0001   Year           1   24.04581        0.67547   35.60      <.0001   YearSq         1        0.00668     0.00017820      37.51      <.0001  
end figure

Figure 61.17: Regression Output for Model M2
start figure
  Obs _MODEL_ _TYPE_ _DEPVAR_    _RMSE_ Intercept   Year   Population     YearSq   1    m1    PARMS Population  25.3295   2345.85   1.2879   1     .   2    m2    PARMS Population   2.9998  21630.89   24.0458   1     .006684346  
end figure

Figure 61.18: OUTEST= Data Set

The following modification of the previous example uses the TABLEOUT and ALPHA= options to obtain additional information in the OUTEST= data set:

  proc reg data=USPopulation outest=est tableout alpha=0.1;   m1: model Population=Year/noprint;   m2: model Population=Year YearSq/noprint;   proc print data=est;   run;  

Notice that the TABLEOUT option causes standard errors, t statistics, p -values, and confidence limits for the estimates to be added to the OUTEST= data set. Also note that the ALPHA= option is used to set the confidence level at 90%. The OUTEST= data set follows.

start figure
  Obs _MODEL_ _TYPE_  _DEPVAR_   _RMSE_ Intercept     Year Population  YearSq   1   m1    PARMS  Population 25.3295  -2345.85   1.2879     -1       .   2   m1    STDERR Population 25.3295    161.39   0.0851      .       .   3   m1    T      Population 25.3295    -14.54  15.1300      .       .   4   m1    PVALUE Population 25.3295      0.00   0.0000      .       .   5   m1    L90B   Population 25.3295  -2624.21   1.1411      .       .   6   m1    U90B   Population 25.3295  -2067.50   1.4347      .       .   7   m2    PARMS  Population  2.9998  21630.89 -24.0458     -1      0.0067   8   m2    STDERR Population  2.9998    639.50   0.6755      .      0.0002   9   m2    T      Population  2.9998     33.82 -35.5988      .     37.5096   10   m2    PVALUE Population  2.9998      0.00   0.0000      .      0.0000   11   m2    L90B   Population  2.9998  20525.11 -25.2138      .      0.0064   12   m2    U90B   Population  2.9998  22736.68 -22.8778      .      0.0070  
end figure

Figure 61.19: The OUTEST= Data Set When TABLEOUT is Specified

A slightly different OUTEST= data set is created when you use the RSQUARE selection method. This example requests only the best model for each subset size but asks for a variety of model selection statistics, as well as the estimated regression coefficients. An OUTEST= data set is created and displayed. See Figure 61.20 and Figure 61.21 for results.

  proc reg data=fitness outest=est;   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse   / selection=rsquare mse jp gmsep cp aic bic sbc b best=1;   proc print data=est;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   R-Square Selection Method   Number in                                                   Estimated MSE   Model     R-Square       C(p)          AIC          BIC   of Prediction         J(p)          MSE           SBC   1      0.7434    13.6988      64.5341      65.4673          8.0546       8.0199      7.53384      67.40210   ------------------------------------------------------------------------------------------------------------------   2      0.7642    12.3894      63.9050      64.8212          7.9478       7.8621      7.16842      68.20695   ------------------------------------------------------------------------------------------------------------------   3      0.8111     6.9596      59.0373      61.3127          6.8583       6.7253      5.95669      64.77326   ------------------------------------------------------------------------------------------------------------------   4      0.8368     4.8800      56.4995      60.3996          6.3984       6.2053      5.34346      63.66941   ------------------------------------------------------------------------------------------------------------------   5      0.8480     5.1063      56.2986      61.5667          6.4565       6.1782      5.17634      64.90250   ------------------------------------------------------------------------------------------------------------------   6      0.8487     7.0000      58.1616      64.0748          6.9870       6.5804      5.36825      68.19952   Number in              --------------------------------------Parameter Estimates--------------------------------------   Model     R-Square     Intercept       Age         Weight        RunTime      RunPulse     RestPulse     MaxPulse   1      0.7434      82.42177             .             .   3.31056            .             .             .   ----------------------------------------------------------------------------------------------------------------------   2      0.7642      88.46229   0.15037             .   3.20395            .             .             .   ----------------------------------------------------------------------------------------------------------------------   3      0.8111     111.71806   0.25640             .   2.82538   0.13091            .             .   ----------------------------------------------------------------------------------------------------------------------   4      0.8368      98.14789   0.19773             .   2.76758   0.34811            .       0.27051   ----------------------------------------------------------------------------------------------------------------------   5      0.8480     102.20428   0.21962   0.07230   2.68252   0.37340            .       0.30491   ----------------------------------------------------------------------------------------------------------------------   6      0.8487     102.93448   0.22697   0.07418   2.62865   0.36963   0.02153       0.30322  
end figure

Figure 61.20: PROC REG Output for Physical Fitness Data: Best Models
start figure
  Max   Obs _MODEL_ _TYPE_ _DEPVAR_   _RMSE_ Intercept     Age      Weight    RunTime  RunPulse  RestPulse   Pulse   1  MODEL1  PARMS   Oxygen   2.74478   82.422     .         .   3.31056    .         .         .   2  MODEL1  PARMS   Oxygen   2.67739   88.462   0.15037    .   3.20395    .         .         .   3  MODEL1  PARMS   Oxygen   2.44063  111.718   0.25640    .   2.82538   0.13091    .         .   4  MODEL1  PARMS   Oxygen   2.31159   98.148   0.19773    .   2.76758   0.34811    .        0.27051   5  MODEL1  PARMS   Oxygen   2.27516  102.204   0.21962   0.072302   2.68252   0.37340    .        0.30491   6  MODEL1  PARMS   Oxygen   2.31695  102.934   0.22697   0.074177   2.62865   0.36963   0.021534  0.30322   Obs Oxygen  _IN_  _P_   _EDF_    _MSE_    _RSQ_      _CP_      _JP_    _GMSEP_    _AIC_     _BIC_     _SBC_   1   1      1    2      29    7.53384  0.74338   13.6988   8.01990   8.05462   64.5341   65.4673   67.4021   2   1      2    3      28    7.16842  0.76425   12.3894   7.86214   7.94778   63.9050   64.8212   68.2069   3   1      3    4      27    5.95669  0.81109    6.9596   6.72530   6.85833   59.0373   61.3127   64.7733   4   1      4    5      26    5.34346  0.83682    4.8800   6.20531   6.39837   56.4995   60.3996   63.6694   5   1      5    6      25    5.17634  0.84800    5.1063   6.17821   6.45651   56.2986   61.5667   64.9025   6   1      6    7      24    5.36825  0.84867    7.0000   6.58043   6.98700   58.1616   64.0748   68.1995  
end figure

Figure 61.21: PROC PRINT Output for Physical Fitness Data: OUTEST= Data Set

OUTSSCP= Data Sets

The OUTSSCP= option produces a TYPE=SSCP output SAS data set containing sums of squares and crossproducts. A special row (observation) and column (variable) of the matrix called Intercept contain the number of observations and sums. Observations are identified by the character variable _NAME_ . The data set contains all variables used in MODEL statements. You can specify additional variables that you want included in the crossproducts matrix with a VAR statement.

The SSCP data set is used when a large number of observations are explored in many different runs. The SSCP data set can be saved and used for subsequent runs, which are much less expensive since PROC REG never reads the original data again. If you run PROC REG once to create only a SSCP data set, you should list all the variables that you may need in a VAR statement or include all the variables that you may need in a MODEL statement.

The following example uses the fitness data from Example 61.1 on page 3924 to produce an output data set with the OUTSSCP= option. The resulting output is shown in Figure 61.22.

  proc reg data=fitness outsscp=sscp;   var Oxygen RunTime Age Weight RestPulse RunPulse MaxPulse;   proc print data=sscp;   run;  
start figure
  Obs   _TYPE_   _NAME_      Intercept      Oxygen    RunTime         Age     Weight   RestPulse    RunPulse    MaxPulse   1     SSCP    Intercept      31.00      1468.65     328.17     1478.00    2400.78     1657.00     5259.00     5387.00   2     SSCP    Oxygen       1468.65     70429.86   15356.14    69767.75  113522.26    78015.41   248497.31   254866.75   3     SSCP    RunTime       328.17     15356.14    3531.80    15687.24   25464.71    17684.05    55806.29    57113.72   4     SSCP    Age          1478.00     69767.75   15687.24    71282.00  114158.90    78806.00   250194.00   256218.00   5     SSCP    Weight       2400.78    113522.26   25464.71   114158.90  188008.20   128409.28   407745.67   417764.62   6     SSCP    RestPulse    1657.00     78015.41   17684.05    78806.00  128409.28    90311.00   281928.00   288583.00   7     SSCP    RunPulse     5259.00    248497.31   55806.29   250194.00  407745.67   281928.00   895317.00   916499.00   8     SSCP    MaxPulse     5387.00    254866.75   57113.72   256218.00  417764.62   288583.00   916499.00   938641.00   9     N                      31.00        31.00      31.00       31.00      31.00       31.00       31.00       31.00  
end figure

Figure 61.22: SSCP Data Set Created with OUTSSCP= Option: REG Procedure

Since a model is not fit to the data and since the only request is to create the SSCP data set, a MODEL statement is not required in this example. However, since the MODEL statement is not used, the VAR statement is required.

Interactive Analysis

PROC REG enables you to change interactively both the model and the data used to compute the model, and to produce and highlight scatter plots. See the section Using PROC REG Interactively on page 3812 for an overview of interactive analysis using PROC REG. The following statements can be used interactively (without reinvoking PROC REG): ADD, DELETE, MODEL, MTEST , OUTPUT, PAINT, PLOT, PRINT, REFIT, RESTRICT, REWEIGHT, and TEST. All interactive features are disabled if there is a BY statement.

The ADD, DELETE and REWEIGHT statements can be used to modify the current MODEL. Every use of an ADD, DELETE or REWEIGHT statement causes the model label to be modified by attaching an additional number to it. This number is the cumulative total of the number of ADD, DELETE or REWEIGHT statements following the current MODEL statement.

A more detailed explanation of changing the data used to compute the model is given in the section Reweighting Observations in an Analysis on page 3903. Extra features for line printer scatter plots are discussed in the section Line Printer Scatter Plot Features on page 3882.

The following example illustrates the usefulness of the interactive features. First, the full regression model is fit to the class data (see the Getting Started section on page 3800), and Figure 61.23 is produced.

  proc reg data=Class;   model Weight=Age Height;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Weight   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     2     7215.63710     3607.81855      27.23    <.0001   Error                    16     2120.09974      132.50623   Corrected Total          18     9335.73684   Root MSE             11.51114    R-Square     0.7729   Dependent Mean      100.02632    Adj R-Sq     0.7445   Coeff Var            11.50811   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1   141.22376       33.38309   4.23      0.0006   Age           1        1.27839        3.11010       0.41      0.6865   Height        1        3.59703        0.90546       3.97      0.0011  
end figure

Figure 61.23: Interactive Analysis: Full Model

Next , the regression model is reduced by the following statements, and Figure 61.24 is produced.

  delete age;   print;   run;  
start figure
  The REG Procedure   Model: MODEL1.1   Dependent Variable: Weight   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     1     7193.24912     7193.24912      57.08    <.0001   Error                    17     2142.48772      126.02869   Corrected Total          18     9335.73684   Root MSE             11.22625    R-Square     0.7705   Dependent Mean      100.02632    Adj R-Sq     0.7570   Coeff Var            11.22330   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1   143.02692       32.27459   4.43      0.0004   Height        1        3.89903        0.51609       7.55      <.0001  
end figure

Figure 61.24: Interactive Analysis: Reduced Model

Note that the MODEL label has been changed from MODEL1 to MODEL1.1, as the original MODEL has been changed by the delete statement.

The following statements generate a scatter plot of the residuals against the predicted values from the full model. Figure 61.25 is produced, and the scatter plot shows a possible outlier.

  add age;   plot r.*p. / cframe=ligr;   run;  
click to expand
Figure 61.25: Interactive Analysis: Scatter Plot

The following statements delete the observation with the largest residual, refitthe regression model, and produce a scatter plot of residuals against predicted values for the refitted model. Figure 61.26 shows the new scatter plot.

  reweight r.>20;   plot / cframe=ligr;   run;  
click to expand
Figure 61.26: Interactive Analysis: Scatter Plot for Refitted Model

Model-Selection Methods

The nine methods of model selection implemented in PROC REG are specified with the SELECTION= option in the MODEL statement. Each method is discussed in this section.

Full Model Fitted (NONE)

This method is the default and provides no model selection capability. The complete model specified in the MODEL statement is used to fit the model. For many regression analyses, this may be the only method you need.

Forward Selection (FORWARD)

The forward-selection technique begins with no variables in the model. For each of the independent variables, the FORWARD method calculates F statistics that reflect the variable s contribution to the model if it is included. The p -values for these F statistics are compared to the SLENTRY= value that is specified in the MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If no F statistic has a significance level greater than the SLENTRY= value, the FORWARD selection stops. Otherwise , the FORWARD method adds the variable that has the largest F statistic to the model. The FORWARD method then calculates F statistics again for the variables still remaining outside the model, and the evaluation process is repeated. Thus, variables are added one by one to the model until no remaining variable produces a significant F statistic. Once a variable is in the model, it stays.

Backward Elimination (BACKWARD)

The backward elimination technique begins by calculating F statistics for a model, including all of the independent variables. Then the variables are deleted from the model one by one until all the variables remaining in the model produce F statistics significant at the SLSTAY= level specified in the MODEL statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the variable showing the smallest contribution to the model is deleted.

Stepwise (STEPWISE)

The stepwise method is a modification of the forward-selection technique and differs in that variables already in the model do not necessarily stay there. As in the forward-selection method, variables are added one by one to the model, and the F statistic for a variable to be added must be significant at the SLENTRY= level. After a variable is added, however, the stepwise method looks at all the variables already included in the model and deletes any variable that does not produce an F statistic significant at the SLSTAY= level. Only after this check is made and the necessary deletions accomplished can another variable be added to the model. The stepwise process ends when none of the variables outside the model has an F statistic significant at the SLENTRY= level and every variable in the model is significantattheSLSTAY= level, or when the variable to be added to the model is the one just deleted from it.

Maximum R 2 Improvement (MAXR)

The maximum R 2 improvement technique does not settle on a single model. Instead, it tries to find the best one-variable model, the best two-variable model, and so forth, although it is not guaranteed to find the model with the largest R 2 for each size.

The MAXR method begins by finding the one-variable model producing the highest R 2 . Then another variable, the one that yields the greatest increase in R 2 , is added. Once the two-variable model is obtained, each of the variables in the model is compared to each variable not in the model. For each comparison, the MAXR method determines if removing one variable and replacing it with the other variable increases R 2 . After comparing all possible switches, the MAXR method makes the switch that produces the largest increase in R 2 . Comparisons begin again, and the process continues until the MAXR method finds that no switch could increase R 2 . Thus, the two-variable model achieved is considered the best two-variable model the technique can find. Another variable is then added to the model, and the comparing-and-switching process is repeated to find the best three-variable model, and so forth.

The difference between the STEPWISE method and the MAXR method is that all switches are evaluated before any switch is made in the MAXR method . In the STEPWISE method, the worst variable may be removed without considering what adding the best remaining variable might accomplish. The MAXR method may require much more computer time than the STEPWISE method.

Minimum R 2 (MINR) Improvement

The MINR method closely resembles the MAXR method, but the switch chosen is the one that produces the smallest increase in R 2 . For a given number of variables in the model, the MAXR and MINR methods usually produce the same best model, but the MINR method considers more models of each size.

R 2 Selection (RSQUARE)

The RSQUARE method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. You can specify the largest and smallest number of independent variables to appear in a subset and the number of subsets of each size to be selected. The RSQUARE method can efficiently perform all possible subset regressions and display the models in decreasing order of R 2 magnitude within each subset size. Other statistics are available for comparing subsets of different sizes. These statistics, as well as estimated regression coefficients, can be displayed or output to a SAS data set.

The subset models selected by the RSQUARE method are optimal in terms of R 2 for the given sample, but they are not necessarily optimal for the population from which the sample is drawn or for any other sample for which you may want to make predictions . If a subset model is selected on the basis of a large R 2 value or any other criterion commonly used for model selection, then all regression statistics computed for that model under the assumption that the model is given a priori , including all statistics computed by PROC REG, are biased .

While the RSQUARE method is a useful tool for exploratory model building, no statistical method can be relied on to identify the true model. Effective model building requires substantive theory to suggest relevant predictors and plausible functional forms for the model.

The RSQUARE method differs from the other selection methods in that RSQUARE always identifies the model with the largest R 2 for each number of variables considered. The other selection methods are not guaranteed to find the model with the largest R 2 . The RSQUARE method requires much more computer time than the other selection methods, so a different selection method such as the STEPWISE method is a good choice when there are many independent variables to consider.

Adjusted R 2 Selection (ADJRSQ)

This method is similar to the RSQUARE method, except that the adjusted R 2 statistic is used as the criterion for selecting models, and the method finds the models with the highest adjusted R 2 within the range of sizes.

Mallows Cp Selection (CP)

This method is similar to the ADJRSQ method, except that Mallows C p statistic is used as the criterion for model selection. Models are listed in ascending order of C p .

Additional Information on Model-Selection Methods

If the RSQUARE or STEPWISE procedure (as documented in SAS User s Guide: Statistics, Version 5 Edition ) is requested , PROC REG with the appropriate model-selection method is actually used.

Reviews of model-selection methods by Hocking (1976) and Judge et al. (1980) describe these and other variable-selection methods.

Criteria Used in Model-Selection Methods

When many significance tests are performed, each at a level of, for example, 5 percent, the overall probability of rejecting at least one true null hypothesis is much larger than 5 percent. If you want to guard against including any variables that do not contribute to the predictive power of the model in the population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE methods.

In most applications, many of the variables considered have some predictive power, however small. If you want to choose the model that provides the best prediction using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated with the given sample size, so you should use a moderate significance level, perhaps in the range of 10 percent to 25 percent.

In addition to R 2 , the C p statistic is displayed for each model generated in the model-selection methods. The C p statistic is proposed by Mallows (1973) as a criterion for selecting a model. It is a measure of total squared error defined as

click to expand

where s 2 is the MSE for the full model, and SSE p is the sum-of-squares error for a model with p parameters including the intercept, if any. If C p is plotted against p , Mallows recommends the model where C p first approaches p . When the right model is chosen, the parameter estimates are unbiased , and this is reflected in C p near p .For further discussion, refer to Daniel and Wood (1980).

The Adjusted R 2 statistic is an alternative to R 2 that is adjusted for the number of parameters in the model. The adjusted R 2 statistic is calculated as

click to expand

where n is the number of observations used in fitting the model, and i is an indicator variable that is 1 if the model includes an intercept, and 0 otherwise.

Limitations in Model-Selection Methods

The use of model-selection methods can be time-consuming in some cases because there is no built-in limit on the number of independent variables, and the calculations for a large number of independent variables can be lengthy. The recommended limit on the number of independent variables for the MINR method is 20 + i , where i is the value of the INCLUDE= option.

For the RSQUARE, ADJRSQ, or CP methods, with a large value of the BEST= option, adding one more variable to the list from which regressors are selected may significantly increase the CPU time. Also, the time required for the analysis is highly dependent on the data and on the values of the BEST=, START=, and STOP= options.

Parameter Estimates and Associated Statistics

The following example uses the fitness data from Example 61.1 on page 3924. Figure 61.28 shows the parameter estimates and the tables from the SS1, SS2, STB, CLB, COVB, and CORRB options:

  proc reg data=fitness;   model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse   / ss1 ss2 stb clb covb corrb;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Parameter Estimates   Parameter     Standard                                              Standardized   Variable   DF     Estimate        Error  t Value  Pr > t    Type I SS  Type II SS      Estimate    95% Confidence Limits   Intercept   1    102.93448     12.40326     8.30    <.0001        69578   369.72831             0     77.33541    128.53355   RunTime     1   2.62865      0.38456   6.84    <.0001    632.90010   250.82210   0.68460   3.42235   1.83496   Age         1   0.22697      0.09984   2.27    0.0322     17.76563    27.74577   0.22204   0.43303   0.02092   Weight      1   0.07418      0.05459   1.36    0.1869      5.60522     9.91059   0.11597   0.18685      0.03850   RunPulse    1   0.36963      0.11985   3.08    0.0051     38.87574    51.05806   0.71133   0.61699   0.12226   MaxPulse    1      0.30322      0.13650     2.22    0.0360     26.82640    26.49142       0.52161      0.02150      0.58493   RestPulse   1   0.02153      0.06605   0.33    0.7473      0.57051     0.57051   0.03080   0.15786      0.11480  
end figure

Figure 61.28: SS1, SS2, STB, CLB, COVB, and CORRB Options: Parameter Estimates

The procedure first displays an Analysis of Variance table (Figure 61.27). The F statistic for the overall model is significant, indicating that the model explains a significant portion of the variation in the data.

start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Analysis of Variance   Sum of            Mean   Source                   DF       Squares          Square  F Value  Pr > F   Model                     6     722.54361       120.42393    22.43  <.0001   Error                    24     128.83794         5.36825   Corrected Total          30     851.38154   Root MSE             2.31695     R-Square    0.8487   Dependent Mean      47.37581     Adj R-Sq    0.8108   Coeff Var 4.89057  
end figure

Figure 61.27: ANOVA Table

The procedure next displays Parameter Estimates and some associated statistics (Figure 61.28). First, the estimates are shown, followed by their Standard Errors. The next two columns of the table contain the t statistics and the corresponding probabilities for testing the null hypothesis that the parameter is not significantly different from zero. These probabilities are usually referred to as p -values. For example, the Intercept term in the model is estimated to be 102.9 and is significantly different from zero. The next two columns of the table are the result of requesting the SS1 and SS2 options, and they show sequential and partial Sums of Squares (SS) associated with each variable. The Standardized Estimates (produced by the STB option) are the parameter estimates that result when all variables are standardized to a mean of 0 and a variance of 1. These estimates are computed by multiplying the original estimates by the standard deviation of the regressor (independent) variable and then dividing by the standard deviation of the dependent variable. The CLB option adds the upper and lower 95% confidence limits for the parameter estimates; the ± level can be changed by specifying the ALPHA= option in the PROC REG or MODEL statement.

The final two tables are produced as a result of requesting the COVB and CORRB options (Figure 61.29). These tables show the estimated covariance matrix of the parameter estimates, and the estimated correlation matrix of the estimates.

start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Covariance of Estimates   Variable       Intercept        RunTime            Age         Weight      RunPulse       MaxPulse      RestPulse   Intercept   153.84081152   0.7678373769   0.902049478   0.178237818   0.280796516   0.832761667   0.147954715   RunTime     0.7678373769   0.1478880839   0.014191688   0.004417672   0.009047784   0.0046249498   0.010915224   Age   0.902049478   0.014191688    0.009967521   0.0010219105   0.001203914   0.0035823843   0.0014897532   Weight   0.178237818   0.004417672   0.0010219105   0.0029804131  0.0009644683   0.001372241   0.0003799295   RunPulse     0.280796516   0.009047784   0.001203914   0.0009644683  0.0143647273   0.014952457   0.000764507   MaxPulse   0.832761667   0.0046249498   0.0035823843   0.001372241   0.014952457   0.0186309364   0.0003425724   RestPulse   0.147954715   0.010915224   0.0014897532   0.0003799295   0.000764507   0.0003425724   0.0043631674   Correlation of Estimates   Variable       Intercept        RunTime            Age         Weight      RunPulse       MaxPulse      RestPulse   Intercept         1.0000         0.1610   0.7285   0.2632        0.1889   0.4919   0.1806   RunTime           0.1610         1.0000   0.3696   0.2104   0.1963         0.0881   0.4297   Age   0.7285   0.3696         1.0000         0.1875   0.1006         0.2629         0.2259   Weight   0.2632   0.2104         0.1875         1.0000        0.1474   0.1842         0.1054   RunPulse          0.1889   0.1963   0.1006         0.1474        1.0000   0.9140   0.0966   MaxPulse   0.4919         0.0881         0.2629   0.1842   0.9140         1.0000         0.0380   RestPulse   0.1806   0.4297         0.2259         0.1054   0.0966         0.0380         1.0000  
end figure

Figure 61.29: SS1, SS2, STB, CLB, COVB, and CORRB Options: Covariances and Correlations

For further discussion of the parameters and statistics, see the Displayed Output section on page 3918, and Chapter 2, Introduction to Regression Procedures.

Predicted and Residual Values

The display of the predicted values and residuals is controlled by the P, R, CLM, and CLI options in the MODEL statement. The P option causes PROC REG to display the observation number, the ID value (if an ID statement is used), the actual value, the predicted value, and the residual. The R, CLI, and CLM options also produce the items under the P option. Thus, P is unnecessary if you use one of the other options.

The R option requests more detail, especially about the residuals. The standard errors of the mean predicted value and the residual are displayed. The studentized residual, which is the residual divided by its standard error, is both displayed and plotted. A measure of influence, Cook s D , is displayed. Cook s D measures the change to the estimates that results from deleting each observation (Cook 1977, 1979). This statistic is very similar to DFFITS.

The CLM option requests that PROC REG display the 100(1 ˆ’ ± )% lower and upper confidence limits for the mean predicted values. This accounts for the variation due to estimating the parameters only. If you want a 100(1 ˆ’ ± )%confidence interval for observed values, then you can use the CLI option, which adds in the variability of the error term. The ± level can be specified with the ALPHA= option in the PROC REG or MODEL statement.

You can use these statistics in PLOT and PAINT statements. This is useful in performing a variety of regression diagnostics. For definitions of the statistics produced by these options, see Chapter 2, Introduction to Regression Procedures.

The following example uses the US population data found on the section Polynomial Regression beginning on page 3804.

  data USPop2;   input Year @@;   YearSq=Year*Year;   datalines;   2010 2020 2030   ;   data USPop2;   set USPopulation USPop2;   proc reg data=USPop2;   id Year;   model Population=Year YearSq / r cli clm;   run;  

After producing the usual Analysis of Variance and Parameter Estimates tables (Figure 61.30), the procedure displays the results of requesting the options for predicted and residual values (Figure 61.31). For each observation, the requested information is shown. Note that the ID variable is used to identify each observation. Also note that, for observations with missing dependent variables, the predicted value, standard error of the predicted value, and confidence intervals for the predicted value are still available.

start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Population   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     2         159529          79765    8864.19    <.0001   Error                    19      170.97193        8.99852   Corrected Total          21         159700   Root MSE              2.99975    R-Square     0.9989   Dependent Mean       94.64800    Adj R-Sq     0.9988   Coeff Var             3.16938   Parameter Estimates   Parameter       Standard   Variable      DF       Estimate          Error    t Value    Pr > t   Intercept      1          21631      639.50181      33.82      <.0001   Year           1   24.04581        0.67547   35.60      <.0001   YearSq         1        0.00668     0.00017820      37.51      <.0001  
end figure

Figure 61.30: Regression Using the R, CLI, and CLM Options
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Population   Output Statistics   Dependent Predicted    Std Error   Obs Year      Variable     Value Mean Predict     95% CL Mean        95% CL Predict   1     1790    3.9290    6.2127       1.7565    2.5362    9.8892   1.0631   13.4884   2     1800    5.3080    5.7226       1.4560    2.6751    8.7701   1.2565   12.7017   3     1810    7.2390    6.5694       1.2118    4.0331    9.1057   0.2021   13.3409   4     1820    9.6380    8.7531       1.0305    6.5963   10.9100    2.1144   15.3918   5     1830   12.8660   12.2737       0.9163   10.3558   14.1916    5.7087   18.8386   6     1840   17.0690   17.1311       0.8650   15.3207   18.9415   10.5968   23.6655   7     1850   23.1910   23.3254       0.8613   21.5227   25.1281   16.7932   29.8576   8     1860   31.4430   30.8566       0.8846   29.0051   32.7080   24.3107   37.4024   9     1870   39.8180   39.7246       0.9163   37.8067   41.6425   33.1597   46.2896   10     1880   50.1550   49.9295       0.9436   47.9545   51.9046   43.3476   56.5114   11     1890   62.9470   61.4713       0.9590   59.4641   63.4785   54.8797   68.0629   12     1900   75.9940   74.3499       0.9590   72.3427   76.3571   67.7583   80.9415   13     1910   91.9720   88.5655       0.9436   86.5904   90.5405   81.9836   95.1473   14     1920  105.7100  104.1178       0.9163  102.2000  106.0357   97.5529  110.6828   15     1930  122.7750  121.0071       0.8846  119.1556  122.8585  114.4612  127.5529   16     1940  131.6690  139.2332       0.8613  137.4305  141.0359  132.7010  145.7654   17     1950  151.3250  158.7962       0.8650  156.9858  160.6066  152.2618  165.3306   18     1960  179.3230  179.6961       0.9163  177.7782  181.6139  173.1311  186.2610   19     1970  203.2110  201.9328       1.0305  199.7759  204.0896  195.2941  208.5715   20     1980  226.5420  225.5064       1.2118  222.9701  228.0427  218.7349  232.2779   21     1990  248.7100  250.4168       1.4560  247.3693  253.4644  243.4378  257.3959   22     2000  281.4220  276.6642       1.7565  272.9877  280.3407  269.3884  283.9400   23     2010         .  304.2484       2.1073  299.8377  308.6591  296.5754  311.9214   24     2020         .  333.1695       2.5040  327.9285  338.4104  324.9910  341.3479   25     2030         .  363.4274       2.9435  357.2665  369.5883  354.6310  372.2238   Output Statistics   Std Error     Student                         Cooks   Obs Year      Residual     Residual    Residual   2   1 0 1 2              D   1     1790   2.2837        2.432   0.939         *             0.153   2     1800   0.4146        2.623   0.158                       0.003   3     1810    0.6696        2.744       0.244                       0.004   4     1820    0.8849        2.817       0.314                       0.004   5     1830    0.5923        2.856       0.207                       0.001   6     1840   0.0621        2.872   0.0216                       0.000   7     1850   0.1344        2.873   0.0468                       0.000   8     1860    0.5864        2.866       0.205                       0.001   9     1870    0.0934        2.856      0.0327                       0.000   10     1880    0.2255        2.847      0.0792                       0.000   11     1890    1.4757        2.842       0.519          *            0.010   12     1900    1.6441        2.842       0.578          *            0.013   13     1910    3.4065        2.847       1.196          **           0.052   14     1920    1.5922        2.856       0.557          *            0.011   15     1930    1.7679        2.866       0.617          *            0.012   16     1940   7.5642        2.873   2.632     *****             0.208   17     1950   7.4712        2.872   2.601     *****             0.205   18     1960   0.3731        2.856   0.131                       0.001   19     1970    1.2782        2.817       0.454                       0.009   20     1980    1.0356        2.744       0.377                       0.009   21     1990   1.7068        2.623   0.651         *             0.044   22     2000    4.7578        2.432       1.957          ***          0.666   23     2010         .            .           .                           .   24     2020         .            .           .                           .   25     2030         .            .           .                           .   Sum of Residuals   4.4596E-11   Sum of Squared Residuals           170.97193   Predicted Residual SS (PRESS)      237.71229  
end figure

Figure 61.31: Regression Using the R, CLI, and CLM Options

The plot of studentized residuals and Cook s D statistics are displayed as a result of requesting the R option. In the plot of studentized residuals, a large number of observations with absolute values greater than two indicates an inadequate model. A version of the studentized residual plot can be created on a high-resolution graphics device; see Example 61.7 on page 3952 for a similar example.

Line Printer Scatter Plot Features

This section discusses the special options available with line printer scatter plots. Detailed examples of high resolution graphics plots and options are given in Example 61.6 on page 3950.

Producing Scatter Plots

The interactive PLOT statement available in PROC REG enables you to look at scatter plots of data and diagnostic statistics. These plots can help you to evaluate the model and detect outliers in your data. Several options enable you to place multiple plots on a single page, superimpose plots, and collect plots to be overlaid by later plots. The PAINT statement can be used to highlight points on a plot. See the section Painting Scatter Plots on page 3889 for more information on painting.

The Class data set introduced in is used in the following examples.

You can superimpose several plots with the OVERLAY option. With the following statements, a plot of Weight against Height is overlaid with plots of the predicted values and the 95% prediction intervals. The model on which the statistics are based is the full model including Height and Age . These statements produce Figure 61.32:

  proc reg data=Class lineprinter;   model Weight=Height Age / noprint;   plot (ucl. lcl. p.)*Height='-' Weight*Height   / overlay symbol='o';   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Weight   ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----   U U95   p   p 175 +                                                                        +   e   r                                                                         -     B   o 150 +                                                           -        o   +   u                                                         --   n                                                                         -   d                                                  -- -    o   - -   o   o 125 +                                        -                  -            +   f                                                          -   -        o           -      o   9                                  -               - -   o                -   5                          -- --              ? ?   -   % 100 +                             o          -       o          -            +   -   C                                    -             o      -   .           -               o oo   - o        o    - -   I                           - --                 -  -   .  75 +                     ?                  -                               +   (   I                                    -   n                                  -   d           -                 --   i  50 +      o              --                                                 +   v   i   d   u           -   a  25 +                                                                        +   l   ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----   P       50    52    54    56    58    60    62    64    66    68    70    72   r   Height  
end figure

Figure 61.32: Scatter Plot Showing Data, Predicted Values, and Confidence Limits

In this plot, the data values are marked with the symbol o and the predicted values and prediction interval limits are labeled with the symbol - . The plot is scaled to accommodate the points from all plots. This is an important difference from the COLLECT option, which does not rescale plots after the first plot or plots are collected. You could separate the overlaid plots by using the following statements:

  plot;   run;  

This places each of the four plots on a separate page, while the statements

  plot / overlay;   run;  
repeat the previous overlaid plot. In general, the statement
  plot;  
is equivalent to respecifying the most recent PLOT statement without any options. However, the COLLECT, HPLOTS=, SYMBOL=, and VPLOTS= options apply across PLOT statements and remain in effect.

The next example shows how you can overlay plots of statistics before and after a change in the model. For the full model involving Height and Age , the ordinary residuals and the studentized residuals are plotted against the predicted values. The COLLECT option causes these plots to be collected or retained for re-display later. The option HPLOTS=2 allows the two plots to appear side by side on one page. The symbol f is used on these plots to identify them as resulting from the full model. These statements produce Figure 61.33:

  plot r.*p. student.*p. / collect hplots=2 symbol='f';   run;  
start figure
  The REG Procedure   Model: MODEL1   -+-----+-----+-----+-----+-----+-      -+-----+-----+-----+-----+----- +--     30 +                               +    3 +                                +             f                                    f   20 +                               +    2 +                                +         f   f                                              f   R                                  f                     f              f   E  10 +                   f           + S  1 +                   f            +   S                                     T   I                                     U   D                 f                   D                f   U                                     E   A                 f                   N                f   L                                     T   0 +           f       f           +    0 +           f       f            +     f                                    f   f                                    f       f   f   f                                    f       10 +                               +   1 +                                +   f   f   f   f   f   f                                      f     f     20 +                               +   2 +                                +   -+-----+-----+-----+-----+-----+-      -+-----+-----+-----+-----+----- +--   40    60    80    100   120  140       40    60    80    100   120   140   PRED                                    PRED  
end figure

Figure 61.33: Collecting Residual Plots for the Full Model

Note that these plots are not overlaid. The COLLECT option does not overlay the plots in one PLOT statement but retains them so that they can be overlaid by later plots. When the COLLECT option appears in a PLOT statement, the plots in that statement become the first plots in the collection.

Next, the model is reduced by deleting the Age variable. The PLOT statement requests the same plots as before but labels the points with the symbol r denoting the reduced model. The following statements produce Figure 61.34:

  delete Age;   plot r.*p. student.*p. / symbol='r';   run;  
start figure
  The REG Procedure   Model: MODEL1.1   -+-----+-----+-----+-----+-----+-      -+-----+-----+-----+-----+----- +--     30 +                               +    3 +                                +             f                                    f   20 +                               +    2 +                                +   r   r     rf                                   r   ?              r                               f    r   R                       r          f                     ?   r          f   E  10 +                   f           + S 1  +                   f            +   S                                     T   I                r                    U   D                 f                   D               rf   U                 r                   E                r   A                 ?                   N                ?   L                                     T   0 +           ?       ?           +    0 +           ?       ?            +   r                                    r   f                                    f   f       r                            f       ?   r       f                            r   ?                                    ?       10 +                               +   1 +                                +   f   f   f   fr                               r    r   r                                 f   f                                    r ?   r r   f     20 +                               +   2 +                                +   -+-----+-----+-----+-----+-----+-      -+-----+-----+-----+-----+------+--   40    60    80    100   120 140        40    60    80    100   120   140   PRED                                    PRED  
end figure

Figure 61.34: Overlaid Residual Plots for Full and Reduced Models

Notice that the COLLECT option causes the corresponding plots to be overlaid. Also notice that the DELETE statement causes the model label to be changed from MODEL1 to MODEL1.1. The points labeled f are from the full model, and points labeled r are from the reduced model. Positions labeled ? contain at least one point from each model. In this example, the OVERLAY option cannot be used because all of the plots to be overlaid cannot be specified in one PLOT statement. With the COLLECT option, any changes to the model or the data used to fit the model do not affect plots collected before the changes. Collected plots are always reproduced exactly as they first appear. (Similarly, a PAINT statement does not affect plots collected before the PAINT statement is issued.)

The previous example overlays the residual plots for two different models. You may prefer to see them side by side on the same page. This can also be done with the COLLECT option by using a blank plot. Continuing from the last example, the COLLECT, HPLOTS=2, and SYMBOL= r options are still in effect. In the following PLOT statement, the CLEAR option deletes the collected plots and allows the specified plot to begin a new collection. The plot created is the residual plot for the reduced model. These statements produce Figure 61.35:

  plot r.*p. / clear;   run;  
start figure
  The REG Procedure   Model: MODEL1.1   -+-----+-----+-----+-----+-----+-           20 +                               +   r       r   r              r   r   10 +                               +     r   R   E                 r   S                 r   I   D   0 +           r       r           +   U                        r   A   L                            r   r   r       10 +                               +     r   r     r r       20 +                               +           -+-----+-----+-----+-----+-----+-   40    60    80    100   120 140   PRED  
end figure

Figure 61.35: Residual Plot for Reduced Model Only

The next statements add the variable AGE to the model and place the residual plot for the full model next to the plot for the reduced model. Notice that a blank plot is created in the first plot request by placing nothing between the quotes. Since the COLLECT option is in effect, this plot is superimposed on the residual plot for the reduced model. The residual plot for the full model is created by the second request. The result is the desired side-by-side plots. The NOCOLLECT option turns off the collection process after the specified plots are added and displayed. Any PLOT statements that follow show only the newly specified plots. These statements produce Figure 61.36:

  add Age;   plot r.*p.='' r.*p.='f' / nocollect;   run;  
start figure
  The REG Procedure   Model: MODEL1.2   -+-----+-----+-----+-----+-----+-       -+-----+-----+-----+-----+-----+--     30 +                               +       20 +                               +   r     f   r           20 +                               +   r              r   r   10 +                               +   f   r                                        f   R                                     R                                  f   E                 r                   E  10 +                   f           +   S                 r                   S   I                                     I   D   0 +           r       r           + D                 f   U                        r            U   A                                     A                 f   L                            r        L   r                    0 +           f       f           +   r   f     10 +                               +                      f   f   r                f   r     10 +                               +   r r   f   f     20 +                               +   f     f     20 +                               +   -+-----+-----+-----+-----+-----+-       -+-----+-----+-----+-----+-----+-   40    60    80    100   120  140        40    60    80    100   120  140   PRED                                    PRED  
end figure

Figure 61.36: Side-by-Side Residual Plots for the Full and Reduced Models

Frequently, when the COLLECT option is in effect, you want the current and following PLOT statements to show only the specified plots. To do this, use both the CLEAR and NOCOLLECT options in the current PLOT statement.

Painting Scatter Plots

Painting scatter plots is a useful interactive tool that enables you to mark points of interest in scatter plots. Painting can be used to identify extreme points in scatter plots or to reveal the relationship between two scatter plots. The CLASS data (from the Simple Linear Regression section on page 3800) is used to illustrate some of these applications. First, a scatter plot of the studentized residuals against the predicted values is generated. This plot is shown in Figure 61.37.

  proc reg data=Class lineprinter;   model Weight=Age Height / noprint;   plot student.*p.;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Weight   ---+------+------+------+------+------+------+------+------+------+---   STUDENT     3 +                                                                    +           S                                                   1   t       2 +                                                                    +   u   d   e   n                                                           1   t                                    1                                    1   i       1 +                                       1                            +   z   e   d                               11   1   R   e       0 +                     1                 1                            +   s                                                  1   i   d                                      1                   2   u                1   a   l   1 +                                                                    +   1         1     1   1       2 +                                                                    +     ---+------+------+------+------+------+------+------+------+------+---   50     60     70     80     90     100    110    120    130    140   Predicted Value of Weight      PRED  
end figure

Figure 61.37: Plotting Studentized Residuals Against Predicted Values

Then, the following statements identify the observation Henry in the scatter plot and produce Figure 61.38:

  paint Name='Henry' / symbol = 'H';   plot;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Weight   ---+------+------+------+------+------+------+------+------+------+---   STUDENT     3 +                                                                    +           S                                                   1   t       2 +                                                                    +   u   d   e   n                                                           1   t                                    1                                    1   i       1 +                                       1                            +   z   e   d                               11   1   R   e       0 +                     1                 1                            +   s                                                  H   i   d                                      1                   2   u                1   a   l   1 +                                                                    +   1         1   1   1       2 +                                                                    +     ---+------+------+------+------+------+------+------+------+------+---   50     60     70     80     90     100    110    120    130    140   Predicted Value of Weight      PRED  
end figure

Figure 61.38: Painting One Observation

Next, the following statements identify observations with large absolute residuals:

  paint student.>=2 or student.<=-2 / symbol='s';   plot;   run;  

The log shows the observation numbers found with these conditions and gives the painting symbol and the number of observations found. Note that the previous PAINT statement is also used in the PLOT statement. Figure 61.39 shows the scatter plot produced by the preceding statements.

start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Weight   ---+------+------+------+------+------+------+------+------+------+---   STUDENT     3 +                                                                    +           S                                                   s   t       2 +                                                                    +   u   d   e   n                                                           1   t                                    1                                    1   i       1 +                                       1                            +   z   e   d                               11   1   R   e       0 +                     1                 1                            +   s                                                  H   i   d                                      1                   2   u                1   a   l      -1 +                                                                    +   1         1     1   1       2 +                                                                    +     ---+------+------+------+------+------+------+------+------+------+---   50     60     70     80     90     100    110    120    130    140   Predicted Value of Weight      PRED  
end figure

Figure 61.39: Painting Several Observations

The following statements relate two different scatter plots. These statements produce Figure 61.40.

  paint student.>=1 / symbol='p';   paint student.<1 and student.>-1 / symbol='s';   paint student.<=-1 / symbol='n';   plot student. * p. cookd. * h. / hplots=2;   run;  
start figure
  The REG Procedure   Model: MODEL1   -+-----+-----+-----+-----+-----+--       -+----+----+----+----+----+----+-     3 +                                +       0.8 +                            p  +       p   2 +                                +       0.6 +                               +     p   p              p   S  1 +                   s            +   T                                     C   U                                     O   D                s                    O 0.4 +                               +   E                                     K   N                s                    D   T   0 +           s       s            +     s   s       s           0.2 +                               +   p   s   n    s     1 +                                +              p n                 s   n   n              n   p   ss   n                0.0 + ss ss s                       +   n         2 +                                +   -+-----+-----+-----+-----+-----+--       -+----+----+----+----+----+----+-   40    60    80    100   120   140      0.05 0.10 0.15 0.20 0.25 0.30 0.35   PRED                                      H  
end figure

Figure 61.40: Painting Observations on More than One Plot

Models of Less Than Full Rank

If the model is not full rank, there are an infinite number of least-squares solutions for the estimates. PROC REG chooses a nonzero solution for all variables that are linearly independent of previous variables and a zero solution for other variables. This solution corresponds to using a generalized inverse in the normal equations, and the expected values of the estimates are the Hermite normal form of X multiplied by the true parameters:

click to expand

Degrees of freedom for the zeroed estimates are reported as zero. The hypotheses that are not testable have t tests reported as missing. The message that the model is not full rank includes a display of the relations that exist in the matrix.

The next example uses the fitness data from Example 61.1 on page 3924. The variable Dif=RunPulse ˆ’ RestPulse is created. When this variable is included in the model along with RunPulse and RestPulse , there is a linear dependency (or exact collinearity) between the independent variables. Figure 61.41 shows how this problem is diagnosed.

  data fit2;   set fitness; Dif=RunPulse-RestPulse;   proc reg data=fit2;   model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse Dif;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     6      722.54361      120.42393      22.43    <.0001   Error                    24      128.83794        5.36825   Corrected Total          30      851.38154   Root MSE              2.31695    R-Square     0.8487   Dependent Mean       47.37581    Adj R-Sq     0.8108   Coeff Var             4.89057   NOTE: Model is not full rank. Least-squares solutions for the parameters are   not unique. Some statistics will be misleading. A reported DF of 0 or B   means that the estimate is biased.   NOTE: The following parameters have been set to 0, since the variables are a   linear combination of other variables as shown.   Dif = RunPulse - RestPulse   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1      102.93448       12.40326       8.30      <.0001   RunTime       1   2.62865        0.38456   6.84      <.0001   Age           1   0.22697        0.09984   2.27      0.0322   Weight        1   0.07418        0.05459   1.36      0.1869   RunPulse      B   0.36963        0.11985   3.08      0.0051   MaxPulse      1        0.30322        0.13650       2.22      0.0360   RestPulse     B   0.02153        0.06605   0.33      0.7473   Dif           0              0              .        .         .  
end figure

Figure 61.41: Model That Is Not Full Rank: REG Procedure

PROC REG produces a message informing you that the model is less than full rank. Parameters with DF=0 are not estimated, and parameters with DF=B are biased. In addition, the form of the linear dependency among the regressors is displayed.

Collinearity Diagnostics

When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are unstable and have high standard errors. This problem is called collinearity or multicollinearity . It is a good idea to find out which variables are nearly collinear with which other variables. The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL, and VIF options.

The COLLIN option in the MODEL statement requests that a collinearity analysis be performed. First, X ² X is scaled to have 1s on the diagonal. If you specify the COLLINOINT option, the intercept variable is adjusted out first. Then the eigenvalues and eigenvectors are extracted. The analysis in PROC REG is reported with eigenvalues of X ² X rather than singular values of X . The eigenvalues of X ² X are the squares of the singular values of X .

The condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue . The largest condition index is the condition number of the scaled X matrix. Belsey, Kuh, and Welsch (1980) suggest that, when this number is around 10, weak dependencies may be starting to affect the regression estimates. When this number is larger than 100, the estimates may have a fair amount of numerical error (although the statistical standard error almost always is much greater than the numerical error).

For each variable, PROC REG produces the proportion of the variance of the estimate accounted for by each principal component. A collinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables.

The VIF option in the MODEL statement provides the Variance Inflation Factors (VIF). These factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is large enough to affect the predicted values.

The TOL option requests the tolerance values for the parameter estimates. The tolerance is defined as 1 /V IF .

For a complete discussion of the preceding methods, refer to Belsley, Kuh, and Welsch (1980). For a more detailed explanation of using the methods with PROC REG, refer to Freund and Littell (1986).

This example uses the COLLIN option on the fitness data found in Example 61.1 on page 3924. The following statements produce Figure 61.42.

  proc reg data=fitness;   model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse   / tol vif collin;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Oxygen   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square   F Value    Pr > F   Model                     6      722.54361      120.42393     22.43    <.0001   Error                    24      128.83794        5.36825   Corrected Total          30      851.38154   Root MSE              2.31695    R-Square     0.8487   Dependent Mean       47.37581    Adj R-Sq     0.8108   Coeff Var             4.89057   Parameter Estimates   Parameter       Standard                                            Variance   Variable     DF       Estimate          Error    t Value    Pr > t     Tolerance      Inflation   Intercept     1      102.93448       12.40326       8.30      <.0001             .              0   RunTime       1   2.62865        0.38456   6.84      <.0001       0.62859        1.59087   Age           1   0.22697        0.09984   2.27      0.0322       0.66101        1.51284   Weight        1   0.07418        0.05459   1.36      0.1869       0.86555        1.15533   RunPulse      1   0.36963        0.11985   3.08      0.0051       0.11852        8.43727   MaxPulse      1        0.30322        0.13650       2.22      0.0360       0.11437        8.74385   RestPulse     1   0.02153        0.06605   0.33      0.7473       0.70642        1.41559   Collinearity Diagnostics   Condition ----------------------------------Proportion of Variation---------------------------------   Number   Eigenvalue        Index    Intercept      RunTime          Age       Weight     RunPulse     MaxPulse   RestPulse   1      6.94991      1.00000   0.00002326   0.00021086   0.00015451   0.00019651   0.00000862   0.00000634  0.00027850   2      0.01868     19.29087      0.00218      0.02522      0.14632      0.01042   0.00000244   0.00000743     0.39064   3      0.01503     21.50072   0.00061541      0.12858      0.15013      0.23571      0.00119      0.00125     0.02809   4      0.00911     27.62115      0.00638      0.60897      0.03186      0.18313      0.00149      0.00123     0.19030   5      0.00607     33.82918      0.00133      0.12501      0.11284      0.44442      0.01506      0.00833     0.36475   6      0.00102     82.63757      0.79966      0.09746      0.49660      0.10330      0.06948      0.00561     0.02026   7   0.00017947    196.78560      0.18981      0.01455      0.06210      0.02283      0.91277      0.98357     0.00568  
end figure

Figure 61.42: Regression Using the TOL, VIF, and COLLIN Options

Model Fit and Diagnostic Statistics

This section gathers the formulas for the statistics available in the MODEL, PLOT, and OUTPUT statements. The model to be fitis Y = X ² + ˆˆ , and the parameter estimate is denoted by b =( X ² X ) ˆ’ X ² Y . The subscript i denotes values for the i th observation, the parenthetical subscript ( i ) means that the statistic is computed using all observations except the i th observation, and the subscript jj indicates the j th diagonal matrix entry. The ALPHA= option in the PROC REG or MODEL statement is used to set the ± value for the t statistics.

Table 61.6 contains the summary statistics for assessing the fit of the model.

Table 61.7 contains the diagnostic statistics and their formulas; these formulas and further information can be found in Chapter 2, Introduction to Regression Procedures, andinthe Influence Diagnostics section on page 3898. Each statistic is computed for each observation.

Table 61.7: Formulas and Definitions for Diagnostic Statistics

MODEL Option or Statistic

Formula

PRED ( i )

X i b

RES ( r i )

Y i ˆ’ i

H( h i )

STDP

STDI

STDR

LCL

LCLM

UCL

UCLM

STUDENT

RSTUDENT

COOKD

click to expand

COVRATIO

click to expand

DFFITS

DFBETAS j

PRESS( predr i )

Influence Diagnostics

This section discusses the INFLUENCE option, which produces several influence statistics, and the PARTIAL option, which produces partial regression leverage plots.

The INFLUENCE Option

The INFLUENCE option (in the MODEL statement) requests the statistics proposed by Belsley, Kuh, and Welsch (1980) to measure the influence of each observation on the estimates. Influential observations are those that, according to various criteria, appear to have a large influence on the parameter estimates.

Let b ( i ) be the parameter estimates after deleting the i th observation; let s ( i ) 2 be the variance estimate after deleting the i th observation; let X ( i ) be the X matrix without the i th observation; let ·( i ) be the i th value predicted without using the i th observation; let r i = y i · i be the i th residual; and let h i be the i th diagonal of the projection matrix for the predictor space, also called the hat matrix :

click to expand

Belsley, Kuh, and Welsch propose a cutoff of 2 p/n , where n is the number of observations used to fit the model and p is the number of parameters in the model. Observations with h i values above this cutoff should be investigated.

For each observation, PROC REG first displays the residual, the studentized residual (RSTUDENT), and the h i . The studentized residual RSTUDENT differs slightly from STUDENT since the error variance is estimated by without the i th observation, not by s 2 . For example,

click to expand

Observations with RSTUDENT larger than 2 in absolute value may need some attention.

The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the i th observation:

click to expand

Belsley, Kuh, and Welsch suggest that observations with

click to expand

where p is the number of parameters in the model and n is the number of observations used to fit the model, are worth investigation.

The DFFITS statistic is a scaled measure of the change in the predicted value for the i th observation and is calculated by deleting the i th observation. A large value indicates that the observation is very influential in its neighborhood of the X space.

click to expand

Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch is , where n and p are as defined previously.

The DFFITS statistic is very similar to Cook s D , defined in the section Predicted and Residual Values on page 3879.

The DFBETAS statistics are the scaled measures of the change in each parameter estimate and are calculated by deleting the i th observation:

click to expand

where ( X X ) j j i s the (j ,j) th element of (X ² X ) ˆ’ 1 .

In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and as a size-adjusted cutoff.

Figure 61.43 shows the tables produced by the INFLUENCE option for the population example (the section Polynomial Regression beginning on page 3804). See Figure 61.30 for the fitted regression equation.

  proc reg data=USPopulation;   model Population=Year YearSq / influence;   run;  

In Figure 61.43, observations 16, 17, and 19 exceed the cutoff value of 2 for RSTUDENT. None of the observations exceeds the general cutoff of 2 for DFFITS or the DFBETAS, but observations 16, 17, and 19 exceed at least one of the size-adjusted cutoffs for these statistics. Observations 1 and 19 exceed the cutoff for the hat diagonals, and observations 1, 2, 16, 17, and 18 exceed the cutoffs for COVRATIO. Taken together, these statistics indicate that you should look first at observations 16, 17, and 19 and then perhaps investigate the other observations that exceeded a cutoff.

The PARTIAL Option

The PARTIAL option in the MODEL statement produces partial regression leverage plots. If the experimental ODS graphics are not in effect, this option requires the use of the LINEPRINTER option in the PROC REG statement. One plot is created for each regressor in the current full model. For example, plots are produced for regressors included by using ADD statements; plots are not produced for interim models in the various model-selection methods but only for the full model. If you use a modelselection method and the final model contains only a subset of the original regressors, the PARTIAL option still produces plots for all regressors in the full model. If the experimental ODS graphics are in effect, these plots are produced as high-resolution graphics, in panels with a maximum of six partial regression leverage plots plots per panel. Multiple panels are displayed for models with more than six regressors.

For a given regressor, the partial regression leverage plot is the plot of the dependent variable and the regressor after they have been made orthogonal to the other regressors in the model. These can be obtained by plotting the residuals for the dependent variable against the residuals for the selected regressor, where the residuals for the dependent variable are calculated with the selected regressor omitted, and the residuals for the selected regressor are calculated from a model where the selected regressor is regressed on the remaining regressors. A line fit to the points has a slope equal to the parameter estimate in the full model.

When the experimental ODS graphics are not in effect, points in the plot are marked by the number of replicates appearing at one position. The symbol * is used if there are ten or more replicates. If an ID statement is specified, the left-most nonblank character in the value of the ID variable is used as the plotting symbol.

The following statements use the fitness data in Example 61.1 on page 3924 with the PARTIAL option and the ODS GRAPHICS statement to produce the partial regression leverage plots. The plots are shown in Figure 61.44. For general information about ODS graphics, see Chapter 15, Statistical Graphics Using ODS. For specific information about the graphics available in the REG procedure, see the ODS Graphics section on page 3922.

  ods html;   ods graphics on;   proc reg data=fitness;   model Oxygen=RunTime Weight Age / partial;   run;   ods graphics off;   ods html close;  
click to expand
Figure 61.44: Partial Regression Leverage Plots (Experimental)

The following statements create a similar panel of partial regression plots using the OUTPUT dataset and the GPLOT procedure. Four plots (created by regressing Oxygen and one of the variables on the remaining variables) are displayed in Figure 61.45. Notice that the Int variable is explicitly added to be used as the intercept term.

  data fitness2;   set fitness;   Int=1;   proc reg data=fitness2 noprint;   model Oxygen Int = RunTime Weight Age / noint;   output out=temp r=ry rx;   symbol1 c=blue;   proc gplot data=temp;   plot ry*rx / cframe=ligr;   label ry=Oxygen   rx=Intercept;   run;  
click to expand
Figure 61.45: Partial Regression Leverage Plots

Reweighting Observations in an Analysis

Reweighting observations is an interactive feature of PROC REG that enables you to change the weights of observations used in computing the regression equation. Observations can also be deleted from the analysis (not from the data set) by changing their weights to zero. The Class data (in the Getting Started section on page 3800) are used to illustrate some of the features of the REWEIGHT statement. First, the full model is fit, and the residuals are displayed in Figure 61.46.

  proc reg data=Class;   model Weight=Age Height / p;   id Name;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: Weight   Output Statistics   Dependent    Predicted   Obs    Name         Variable        Value     Residual   1    Alfred       112.5000     124.8686   12.3686   2    Alice         84.0000      78.6273       5.3727   3    Barbara       98.0000     110.2812   12.2812   4    Carol        102.5000     102.5670   0.0670   5    Henry        102.5000     105.0849   2.5849   6    James         83.0000      80.2266       2.7734   7    Jane          84.5000      89.2191   4.7191   8    Janet        112.5000     102.7663       9.7337   9    Jeffrey       84.0000     100.2095   16.2095   10    John          99.5000      86.3415      13.1585   11    Joyce         50.5000      57.3660   6.8660   12    Judy          90.0000     107.9625   17.9625   13    Louise        77.0000      76.6295       0.3705   14    Mary         112.0000     117.1544   5.1544   15    Philip       150.0000     138.2164      11.7836   16    Robert       128.0000     107.2043      20.7957   17    Ronald       133.0000     118.9529      14.0471   18    Thomas        85.0000      79.6676       5.3324   19    William      112.0000     117.1544   5.1544   Sum of Residuals                           0   Sum of Squared Residuals          2120.09974   Predicted Residual SS (PRESS)     3272.72186  
end figure

Figure 61.46: Full Model for CLASS Data, Residuals Shown

Upon examining the data and residuals, you realize that observation 17 (Ronald) was mistakenly included in the analysis. Also, you would like to examine the effect of reweighting to 0.5 those observations with residuals that have absolute values greater than or equal to 17.

  reweight obs.=17;   reweight r. le -17 or r. ge 17 / weight=0.5;   print p;   run;  

At this point, a message (on the log) appears that tells you which observations have been reweighted and what the new weights are. Figure 61.47 is produced.

start figure
  The REG Procedure   Model: MODEL1.2   Dependent Variable: Weight   Output Statistics   Weight    Dependent    Predicted   Obs    Name         Variable     Variable        Value     Residual   1    Alfred         1.0000     112.5000     121.6250   9.1250   2    Alice          1.0000      84.0000      79.9296       4.0704   3    Barbara        1.0000      98.0000     107.5484   9.5484   4    Carol          1.0000     102.5000     102.1663       0.3337   5    Henry          1.0000     102.5000     104.3632   1.8632   6    James          1.0000      83.0000      79.9762       3.0238   7    Jane           1.0000      84.5000      87.8225   3.3225   8    Janet          1.0000     112.5000     103.6889       8.8111   9    Jeffrey        1.0000      84.0000      98.7606   14.7606   10    John           1.0000      99.5000      85.3117      14.1883   11    Joyce          1.0000      50.5000      58.6811   8.1811   12    Judy           0.5000      90.0000     106.8740   16.8740   13    Louise         1.0000      77.0000      76.8377       0.1623   14    Mary           1.0000     112.0000     116.2429   4.2429   15    Philip         1.0000     150.0000     135.9688      14.0312   16    Robert         0.5000     128.0000     103.5150      24.4850   17    Ronald              0     133.0000     117.8121      15.1879   18    Thomas         1.0000      85.0000      78.1398       6.8602   19    William        1.0000     112.0000     116.2429   4.2429   Sum of Residuals                           0   Sum of Squared Residuals          1500.61194   Predicted Residual SS (PRESS)     2287.57621   NOTE: The above statistics use observation weights or frequencies.  
end figure

Figure 61.47: Model with Reweighted Observations

The first REWEIGHT statement excludes observation 17, and the second REWEIGHT statement reweights observations 12 and 16 to 0.5. An important feature to note from this example is that the model is not refit until after the PRINT statement. REWEIGHT statements do not cause the model to be refit. This is so that multiple REWEIGHT statements can be applied to a subsequent model.

In this example, since the intent is to reweight observations with large residuals, the observation that was mistakenly included in the analysis should be deleted; then, the model should be fit for those remaining observations, and the observations with large residuals should be reweighted. To accomplish this, use the REFIT statement. Note that the model label has been changed from MODEL1 to MODEL1.2 as two REWEIGHT statements have been used. These statements produce Figure 61.48:

  reweight allobs / weight=1.0;   reweight obs.=17;   refit;   reweight r. le -17 or r. ge 17 / weight=.5;   print;   run;  
start figure
  The REG Procedure   Model: MODEL1.5   Dependent Variable: Weight   Output Statistics   Weight    Dependent    Predicted   Obs    Name         Variable     Variable        Value     Residual   1    Alfred         1.0000     112.5000     120.9716   8.4716   2    Alice          1.0000      84.0000      79.5342       4.4658   3    Barbara        1.0000      98.0000     107.0746   9.0746   4    Carol          1.0000     102.5000     101.5681       0.9319   5    Henry          1.0000     102.5000     103.7588   1.2588   6    James          1.0000      83.0000      79.7204       3.2796   7    Jane           1.0000      84.5000      87.5443   3.0443   8    Janet          1.0000     112.5000     102.9467       9.5533   9    Jeffrey        1.0000      84.0000      98.3117   14.3117   10    John           1.0000      99.5000      85.0407      14.4593   11    Joyce          1.0000      50.5000      58.6253   8.1253   12    Judy           1.0000      90.0000     106.2625   16.2625   13    Louise         1.0000      77.0000      76.5908       0.4092   14    Mary           1.0000     112.0000     115.4651   3.4651   15    Philip         1.0000     150.0000     134.9953      15.0047   16    Robert         0.5000     128.0000     103.1923      24.8077   17    Ronald              0     133.0000     117.0299      15.9701   18    Thomas         1.0000      85.0000      78.0288       6.9712   19    William        1.0000     112.0000     115.4651   3.4651   Sum of Residuals                           0   Sum of Squared Residuals          1637.81879   Predicted Residual SS (PRESS)     2473.87984   NOTE: The above statistics use observation weights or frequencies.  
end figure

Figure 61.48: Observations Excluded from Analysis, Model Refitted and Observations Reweighted

Notice that this results in a slightly different model than the previous set of statements: only observation 16 is reweighted to 0.5. Also note that the model label is now MODEL1.5 since five REWEIGHT statements have been used for this model.

Another important feature of the REWEIGHT statement is the ability to nullify the effect of a previous or all REWEIGHT statements. First, assume that you have several REWEIGHT statements in effect and you want to restore the original weights of all the observations. The following REWEIGHT statement accomplishes this and produces Figure 61.49:

  reweight allobs / reset;   print;   run;  
start figure
  The REG Procedure   Model: MODEL1.6   Dependent Variable: Weight   Output Statistics   Dependent    Predicted   Obs    Name         Variable        Value     Residual   1    Alfred       112.5000     124.8686   12.3686   2    Alice         84.0000      78.6273       5.3727   3    Barbara       98.0000     110.2812   12.2812   4    Carol        102.5000     102.5670   0.0670   5    Henry        102.5000     105.0849   2.5849   6    James         83.0000      80.2266       2.7734   7    Jane          84.5000      89.2191   4.7191   8    Janet        112.5000     102.7663       9.7337   9    Jeffrey       84.0000     100.2095   16.2095   10    John          99.5000      86.3415      13.1585   11    Joyce         50.5000      57.3660   6.8660   12    Judy          90.0000     107.9625   17.9625   13    Louise        77.0000      76.6295       0.3705   14    Mary         112.0000     117.1544   5.1544   15    Philip       150.0000     138.2164      11.7836   16    Robert       128.0000     107.2043      20.7957   17    Ronald       133.0000     118.9529      14.0471   18    Thomas        85.0000      79.6676       5.3324   19    William      112.0000     117.1544   5.1544   Sum of Residuals                           0   Sum of Squared Residuals          2120.09974   Predicted Residual SS (PRESS)     3272.72186  
end figure

Figure 61.49: Restoring Weights of All Observations

The resulting model is identical to the original model specified at the beginning of this section. Notice that the model label is now MODEL1.6. Note that the Weight column does not appear, since all observations have been reweighted to have weight=1.

Now suppose you want only to undo the changes made by the most recent REWEIGHT statement. Use REWEIGHT UNDO for this. The following statements produce Figure 61.50:

  reweight r. le -12 or r. ge 12 / weight=.75;   reweight r. le -17 or r. ge 17 / weight=.5;   reweight undo;   print;   run;  
start figure
  The REG Procedure   Model: MODEL1.9   Dependent Variable: Weight   Output Statistics   Weight    Dependent    Predicted   Obs    Name         Variable     Variable        Value     Residual   1    Alfred         0.7500     112.5000     125.1152   12.6152   2    Alice          1.0000      84.0000      78.7691       5.2309   3    Barbara        0.7500      98.0000     110.3236   12.3236   4    Carol          1.0000     102.5000     102.8836   0.3836   5    Henry          1.0000     102.5000     105.3936   2.8936   6    James          1.0000      83.0000      80.1133       2.8867   7    Jane           1.0000      84.5000      89.0776   4.5776   8    Janet          1.0000     112.5000     103.3322       9.1678   9    Jeffrey        0.7500      84.0000     100.2835   16.2835   10    John           0.7500      99.5000      86.2090      13.2910   11    Joyce          1.0000      50.5000      57.0745   6.5745   12    Judy           0.7500      90.0000     108.2622   18.2622   13    Louise         1.0000      77.0000      76.5275       0.4725   14    Mary           1.0000     112.0000     117.6752   5.6752   15    Philip         1.0000     150.0000     138.9211      11.0789   16    Robert         0.7500     128.0000     107.0063      20.9937   17    Ronald         0.7500     133.0000     119.4681      13.5319   18    Thomas         1.0000      85.0000      79.3061       5.6939   19    William        1.0000     112.0000     117.6752   5.6752   Sum of Residuals                           0   Sum of Squared Residuals          1694.87114   Predicted Residual SS (PRESS)     2547.22751   NOTE: The above statistics use observation weights or frequencies.  
end figure

Figure 61.50: Example of UNDO in REWEIGHT Statement

The resulting model reflects changes made only by the first REWEIGHT statement since the third REWEIGHT statement negates the effect of the second REWEIGHT statement. Observations 1, 3, 9, 10, 12, 16, and 17 have their weights changed to 0.75. Note that the label MODEL1.9 reflects the use of nine REWEIGHT statements for the current model.

Now suppose you want to reset the observations selected by the most recent REWEIGHT statement to their original weights. Use the REWEIGHT statement with the RESET option to do this. The following statements produce Figure 61.51:

  reweight r. le -12 or r. ge 12 / weight=.75;   reweight r. le -17 or r. ge 17 / weight=.5;   reweight / reset;   print;   run;  
start figure
  The REG Procedure   Model: MODEL1.12   Dependent Variable: Weight   Output Statistics   Weight    Dependent    Predicted   Obs    Name         Variable     Variable        Value     Residual   1    Alfred         0.7500     112.5000     126.0076   13.5076   2    Alice          1.0000      84.0000      77.8727       6.1273   3    Barbara        0.7500      98.0000     111.2805   13.2805   4    Carol          1.0000     102.5000     102.4703       0.0297   5    Henry          1.0000     102.5000     105.1278   2.6278   6    James          1.0000      83.0000      80.2290       2.7710   7    Jane           1.0000      84.5000      89.7199   5.2199   8    Janet          1.0000     112.5000     102.0122      10.4878   9    Jeffrey        0.7500      84.0000     100.6507   16.6507   10    John           0.7500      99.5000      86.6828      12.8172   11    Joyce          1.0000      50.5000      56.7703   6.2703   12    Judy           1.0000      90.0000     108.1649   18.1649   13    Louise         1.0000      77.0000      76.4327       0.5673   14    Mary           1.0000     112.0000     117.1975   5.1975   15    Philip         1.0000     150.0000     138.7581      11.2419   16    Robert         1.0000     128.0000     108.7016      19.2984   17    Ronald         0.7500     133.0000     119.0957      13.9043   18    Thomas         1.0000      85.0000      80.3076       4.6924   19    William        1.0000     112.0000     117.1975   5.1975   Sum of Residuals                           0   Sum of Squared Residuals          1879.08980   Predicted Residual SS (PRESS)     2959.57279   NOTE: The above statistics use observation weights or frequencies.  
end figure

Figure 61.51: REWEIGHT Statement with RESET option

Note that observations that meet the condition of the second REWEIGHT statement (residuals with an absolute value greater than or equal to 17) now have weights reset to their original value of 1. Observations 1, 3, 9, 10, and 17 have weights of 0.75, but observations 12 and 16 (which meet the condition of the second REWEIGHT statement) have their weights reset to 1.

Notice how the last three examples show three ways to change weights back to a previous value. In the first example, ALLOBS and the RESET option are used to change weights for all observations back to their original values. In the second example, the UNDO option is used to negate the effect of a previous REWEIGHT statement, thus changing weights for observations selected in the previous REWEIGHT statement to the weights specified in still another REWEIGHT statement. In the third example, the RESET option is used to change weights for observations selected in a previous REWEIGHT statement back to their original values. Finally, note that the label MODEL1.12 indicates that twelve REWEIGHT statements have been applied to the original model.

Testing for Heteroscedasticity

The regression model is specified as y i = x i ² + ˆˆ i , where the ˆˆ i s are identically and independently distributed: E ( ˆˆ )=0 and E ( ˆˆ ² ˆˆ )= ƒ 2 I . If the ˆˆ i s are not independent or their variances are not constant, the parameter estimates are unbiased, but the estimate of the covariance matrix is inconsistent. In the case of heteroscedasticity, the ACOV option provides a consistent estimate of the covariance matrix. If the regression data are from a simple random sample, the ACOV option produces the covariance matrix. This matrix is

click to expand

where

The SPEC option performs a model specification test. The null hypothesis for this test maintains that the errors are homoscedastic, independent of the regressors and that several technical assumptions about the model specification are valid. For details, see theorem 2 and assumptions 1 “7 of White (1980). When the model is correctly specified and the errors are independent of the regressors, the rejection of this null hypothesis is evidence of heteroscedasticity. In implementing this test, an estimator of the average covariance matrix (White 1980, p. 822) is constructed and inverted. The nonsingularity of this matrix is one of the assumptions in the null hypothesis about the model specification. When PROC REG determines this matrix to be numerically singular, a generalized inverse is used and a note to this effect is written to the log. In such cases, care should be taken in interpreting the results of this test.

When you specify the SPEC option, tests listed in the TEST statement are performed with both the usual covariance matrix and the heteroscedasticity consistent covariance matrix. Tests performed with the consistent covariance matrix are asymptotic . For more information, refer to White (1980).

Both the ACOV and SPEC options can be specified in a MODEL or PRINT statement.

Multivariate Tests

The MTEST statement described in the MTEST Statement section on page 3832 can test hypotheses involving several dependent variables in the form

where L is a linear function on the regressor side, ² is a matrix of parameters, c is a column vector of constants, j is a row vector of ones, and M is a linear function on the dependent side. The special case where the constants are zero is

To test this hypothesis, PROC REG constructs two matrices called H and E that correspond to the numerator and denominator of a univariate F test:

click to expand

These matrices are displayed for each MTEST statement if the PRINT option is specified.

Four test statistics based on the eigenvalues of E ˆ’ 1 H or ( E + H ) ˆ’ 1 H are formed . These are Wilks Lambda, Pillai s Trace, the Hotelling-Lawley Trace, and Roy s maximum root. These test statistics are discussed in Chapter 2, Introduction to Regression Procedures.

The following statements perform a multivariate analysis of variance and produce Figure 61.52 through Figure 61.56:

  * Manova Data from Morrison (1976, 190);   data a;   input sex $ drug $ @;   do rep=1 to 4;   input y1 y2 @;   sexcode=(sex=m)-(sex=f);   drug1=(drug=a)-(drug=c);   drug2=(drug=b)-(drug=c);   sexdrug1=sexcode*drug1;   sexdrug2=sexcode*drug2;   output;   end;   datalines;   m a  5  6  5  4  9  9  7  6   m b  7  6  7  7  9 12  6  8   m c 21 15 14 11 17 12 12 10   f a  7 10  6  6  9  7  8 10   f b 10 13  8  7  7  6  6  9   f c 16 12 14  9 14  8 10  5   ;   proc reg;   model y1 y2=sexcode drug1 drug2 sexdrug1 sexdrug2;   y1y2drug: mtest y1=y2, drug1,drug2;   drugshow: mtest drug1, drug2 / print canprint;   run;  
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: y1   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     5      316.00000       63.20000      12.04    <.0001   Error                    18       94.50000        5.25000   Corrected Total          23      410.50000   Root MSE              2.29129    R-Square     0.7698   Dependent Mean        9.75000    Adj R-Sq     0.7058   Coeff Var            23.50039   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1        9.75000        0.46771      20.85      <.0001   sexcode       1        0.16667        0.46771       0.36      0.7257   drug1         1   2.75000        0.66144   4.16      0.0006   drug2         1   2.25000        0.66144   3.40      0.0032   sexdrug1      1   0.66667        0.66144   1.01      0.3269   sexdrug2      1   0.41667        0.66144   0.63      0.5366  
end figure

Figure 61.52: Multivariate Analysis of Variance: REG Procedure
start figure
  The REG Procedure   Model: MODEL1   Dependent Variable: y2   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     5       69.33333       13.86667       2.19    0.1008   Error                    18      114.00000        6.33333   Corrected Total          23      183.33333   Root MSE              2.51661    R-Square     0.3782   Dependent Mean        8.66667    Adj R-Sq     0.2055   Coeff Var            29.03782   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1        8.66667        0.51370      16.87      <.0001   sexcode       1        0.16667        0.51370       0.32      0.7493   drug1         1   1.41667        0.72648   1.95      0.0669   drug2         1   0.16667        0.72648   0.23      0.8211   sexdrug1      1   1.16667        0.72648   1.61      0.1257   sexdrug2      1   0.41667        0.72648   0.57      0.5734  
end figure

Figure 61.53: Multivariate Analysis of Variance: REG Procedure
start figure
  The REG Procedure   Model: MODEL1   Multivariate Test: y1y2drug   Multivariate Statistics and Exact F Statistics   S=1    M=0    N=8   Statistic                        Value    F Value    Num DF    Den DF    Pr > F   Wilks Lambda               0.28053917      23.08         2        18    <.0001   Pillais Trace              0.71946083      23.08         2        18    <.0001   Hotelling-Lawley Trace      2.56456456      23.08         2        18    <.0001   Roys Greatest Root         2.56456456      23.08         2        18    <.0001  
end figure

Figure 61.54: Multivariate Analysis of Variance: First Test
start figure
  The REG Procedure   Model: MODEL1   Multivariate Test: drugshow   Error Matrix (E)   94.5              76.5   76.5               114   Hypothesis Matrix (H)   301              97.5   97.5      36.333333333   Adjusted    Approximate        Squared   Canonical      Canonical       Standard      Canonical   Correlation    Correlation          Error    Correlation   1       0.905903       0.899927       0.040101       0.820661   2       0.244371        .             0.210254       0.059717   Eigenvalues of Inv(E)*H   = CanRsq/(1-CanRsq)   Eigenvalue    Difference    Proportion    Cumulative   1        4.5760        4.5125        0.9863        0.9863   2        0.0635                      0.0137        1.0000   Test of H0: The canonical correlations in the   current row and all that follow are zero   Likelihood    Approximate   Ratio        F Value    Num DF    Den DF    Pr > F   1    0.16862952          12.20         4        34    <.0001   2    0.94028273           1.14         1        18    0.2991  
end figure

Figure 61.55: Multivariate Analysis of Variance: Second Test
start figure
  The REG Procedure   Model: MODEL1   Multivariate Test: drugshow   Multivariate Statistics and F Approximations   S=2    M=-0.5    N=7.5   Statistic                        Value    F Value    Num DF    Den DF    Pr > F   Wilks Lambda               0.16862952      12.20         4        34    <.0001   Pillais Trace              0.88037810       7.08         4        36    0.0003   Hotelling-Lawley Trace      4.63953666      19.40         4    19.407    <.0001   Roys Greatest Root         4.57602675      41.18         2        18    <.0001   NOTE: F Statistic for Roys Greatest Root is an upper bound.   NOTE: F Statistic for Wilks Lambda is exact.  
end figure

Figure 61.56: Multivariate Analysis of Variance: Second Test

The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not the same across dependent variables y1 and y2 .

The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not zero for both dependent variables.

Autocorrelation in Time Series Data

When regression is performed on time series data, the errors may not be independent. Often errors are autocorrelated; that is, each error is correlated with the error immediately before it. Autocorrelation is also a symptom of systematic lack of fit. The DW option provides the Durbin-Watson d statistic to test that the autocorrelation is zero:

click to expand

The value of d is close to 2 if the errors are uncorrelated. The distribution of d is reported by Durbin and Watson (1951). Tables of the distribution are found in most econometrics textbooks , such as Johnston (1972) and Pindyck and Rubinfeld (1981).

The sample autocorrelation estimate is displayed after the Durbin-Watson statistic. The sample is computed as

This autocorrelation of the residuals may not be a very good estimate of the autocorrelation of the true errors, especially if there are few observations and the independent variables have certain patterns. If there are missing observations in the regression, these measures are computed as though the missing observations did not exist.

Positive autocorrelation of the errors generally tends to make the estimate of the error variance too small, so confidence intervals are too narrow and true null hypotheses are rejected with a higher probability than the stated significance level. Negative autocorrelation of the errors generally tends to make the estimate of the error variance toolarge,soconfidence intervals are too wide and the power of significance tests is reduced. With either positive or negative autocorrelation, least-squares parameter estimates are usually not as efficient as generalized least-squares parameter estimates. For more details, refer to Judge et al. (1985, Chapter 8) and the SAS/ETS User s Guide .

The following SAS statements request the DW option for the US population data (see Figure 61.57):

  proc reg data=USPopulation;   model Population=Year YearSq / dw;   run;  

Computations for Ridge Regression and IPC Analysis

In ridge regression analysis, the crossproduct matrix for the independent variables is centered (the NOINT option is ignored if it is specified) and scaled to one on the diagonal elements. The ridge constant k (specified with the RIDGE= option) is then added to each diagonal element of the crossproduct matrix. The ridge regression estimates are the least-squares estimates obtained by using the new crossproduct matrix.

Let X be an n p matrix of the independent variables after centering the data, and let Y be an n — 1 vector corresponding to the dependent variable. Let D be a p p diagonal matrix with diagonal elements as in X ² X . The ridge regression estimate corresponding to the ridge constant k can be computed as

click to expand

where Z = XD 1/2 and I p is a p p identity matrix.

For IPC analysis, the smallest m eigenvalues of Z ² Z (where m is specified with the PCOMIT= option) are omitted to form the estimates.

For information about ridge regression and IPC standardized parameter estimates, parameter estimate standard errors, and variance inflation factors, refer to Rawlings (1988), Neter, Wasserman, and Kutner (1990), and Marquardt and Snee (1975). Unlike Rawlings (1988), the REG procedure uses the mean squared errors of the submodels instead of the full model MSE to compute the standard errors of the parameter estimates.

Construction of Q-Q and P-P Plots

If a normal probability-probability or quantile-quantile plot for the variable x is requested, the n nonmissing values of x are first ordered from smallest to largest:

click to expand

If a Q-Q plot is requested (with a PLOT statement of the form PLOT yvariable *NQQ.), the i th ordered value x ( i ) is represented by a point with y -coordinate x ( i ) and x -coordinate , where (·) is the standard normal distribution.

If a P-P plot is requested (with a PLOT statement of the form PLOT yvariable *NPP.), the i th ordered value x ( i ) is represented by a point with y -coordinate and x -coordinate , where µ is the mean of the nonmissing x- v alues and ƒ is the standard deviation. I+f an x -value has multiplicity k (that is, x ( i ) = · · · = x ( i + k ˆ’ 1) ), then only the point click to expand is displayed.

Computational Methods

The REG procedure first composes a crossproducts matrix. The matrix can be calculated from input data, reformed from an input correlation matrix, or read in from an SSCP data set. For each model, the procedure selects the appropriate crossproducts from the main matrix. The normal equations formed from the crossproducts are solved using a sweep algorithm (Goodnight 1979). The method is accurate for data that are reasonably scaled and not too collinear.

The mechanism that PROC REG uses to check for singularity involves the diagonal (pivot) elements of X ² X as it is being swept. If a pivot is less than SINGULAR*CSS, then a singularity is declared and the pivot is not swept (where CSS is the corrected sum of squares for the regressor and SINGULAR is machine dependent but is approximately 1E “ 7 on most machines or reset in the PROC statement).

The sweep algorithm is also used in many places in the model-selection methods. The RSQUARE method uses the leaps and bounds algorithm by Furnival and Wilson (1974).

Computer Resources in Regression Analysis

The REG procedure is efficient for ordinary regression; however, requests for optional features can greatly increase the amount of time required.

The major computational expense in the regression analysis is the collection of the crossproducts matrix. For p variables and n observations, the time required is proportional to np 2 . For each model run, PROC REG needs time roughly proportional to k 3 , where k is the number of regressors in the model. Add an additional nk 2 for one of the R, CLM, or CLI options and another nk 2 for the INFLUENCE option.

Most of the memory that PROC REG needs to solve large problems is used for crossproducts matrices. PROC REG requires 4 p 2 bytes for the main crossproducts matrix plus 4 k 2 bytes for the largest model. If several output data sets are requested, memory is also needed for buffers.

See the Input Data Sets section on page 3860 for information on how to use TYPE=SSCP data sets to reduce computing time.

Displayed Output

Many of the more specialized tables are described in detail in previous sections. Most of the formulas for the statistics are in Chapter 2, Introduction to Regression Procedures, while other formulas can be found in the section Model Fit and Diagnostic Statistics on page 3896 and the Influence Diagnostics section on page 3898.

The analysis-of-variance table includes

  • the Source of the variation, Model for the fitted regression, Error for the residual error, and C Total for the total variation after correcting for the mean. The Uncorrected Total Variation is produced when the NOINT option is used.

  • the degrees of freedom (DF) associated with the source

  • the Sum of Squares for the term

  • the Mean Square, the sum of squares divided by the degrees of freedom

  • the F Value for testing the hypothesis that all parameters are zero except for the intercept. This is formed by dividing the mean square for Model by the mean square for Error.

  • the Prob>F, the probability of getting a greater F statistic than that observed if the hypothesis is true. This is the significance probability.

Other statistics displayed include the following:

  • Root MSE is an estimate of the standard deviation of the error term. It is calculated as the square root of the mean square error.

  • Dep Mean is the sample mean of the dependent variable.

  • C.V. is the coefficient of variation, computed as 100 times Root MSE divided by Dep Mean. This expresses the variation in unitless values.

  • R-Square is a measure between 0 and 1 that indicates the portion of the (corrected) total variation that is attributed to the fit rather than left to residual error. It is calculated as SS(Model) divided by SS(Total). It is also called the coefficient of determination . It is the square of the multiple correlation; in other words, the square of the correlation between the dependent variable and the predicted values.

  • Adj R-Sq, the adjusted R 2 , is a version of R 2 that has been adjusted for degrees of freedom. It is calculated as

    click to expand

    where i is equal to 1 if there is an intercept and 0 otherwise; n is the number of observations used to fit the model; and p is the number of parameters in the model.

The parameter estimates and associated statistics are then displayed, and they include the following:

  • the Variable used as the regressor, including the name Intercept to represent the estimate of the intercept parameter

  • the degrees of freedom (DF) for the variable. There is one degree of freedom unless the model is not full rank.

  • the Parameter Estimate

  • the Standard Error, the estimate of the standard deviation of the parameter estimate

  • T for H0: Parameter=0, the t test that the parameter is zero. This is computed as the Parameter Estimate divided by the Standard Error.

  • the Prob > T, the probability that a t statistic would obtain a greater absolute value than that observed given that the true parameter is zero. This is the two-tailed significance probability.

If model-selection methods other than NONE, RSQUARE, ADJRSQ, or CP are used, the analysis-of-variance table and the parameter estimates with associated statistics are produced at each step. Also displayed are

  • C(p), which is Mallows C p statistic

  • bounds on the condition number of the correlation matrix for the variables in the model (Berk 1977)

After statistics for the final model are produced, the following is displayed when the method chosen is FORWARD, BACKWARD, or STEPWISE:

  • a Summary table listing Step number, Variable Entered or Removed, Partial and Model R-Square, and C(p) and F statistics

The RSQUARE method displays its results beginning with the model containing the fewest independent variables and producing the largest R 2 . Results for other models with the same number of variables are then shown in order of decreasing R 2 , and so on, for models with larger numbers of variables. The ADJRSQ and CP methods group models of all sizes together and display results beginning with the model having the optimal value of adjusted R 2 and C p , respectively.

For each model considered, the RSQUARE, ADJRSQ, and CP methods display the following:

  • Number in Model or IN, the number of independent variables used in each model

  • R-Square or RSQ, the squared multiple correlation coefficient

If the B option is specified, the RSQUARE, ADJRSQ, and CP methods produce the following:

  • Parameter Estimates, the estimated regression coefficients

If the B option is not specified, the RSQUARE, ADJRSQ, and CP methods display the following:

  1. Variables in Model, the names of the independent variables included in the model

ODS Table Names

PROC REG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 61.8: ODS Tables Produced in PROC REG

ODS Table Name

Description

Statement

Option

ACovEst

Consistent covariance of estimates matrix

MODEL

ALL, ACOV

ACovTestANOVA

Test ANOVA using ACOV estimates

TEST

ACOV (MODEL statement)

ANOVA

Model ANOVA table

MODEL

default

CanCorr

Canonical correlations for hypothesis combinations

MTEST

CANPRINT

CollinDiag

Collinearity Diagnostics table

MODEL

COLLIN

CollinDiagNoInt

Collinearity Diagnostics for no intercept model

MODEL

COLLINOINT

ConditionBounds

Bounds on condition number

MODEL

(SELECTION=BACKWARD FORWARD STEPWISE MAXR MINR) and DETAILS

Corr

Correlation matrix for analysis variables

PROC

ALL, CORR

CorrB

Correlation of estimates

MODEL

CORRB

CovB

Covariance of estimates

MODEL

COVB

CrossProducts

Bordered model X X matrix

MODEL

ALL, XPX

DWStatistic

Durbin-Watson statistic

MODEL

ALL, DW

DependenceEquations

Linear dependence equations

MODEL

default if needed

Eigenvalues

MTest eigenvalues

MTEST

CANPRINT

Eigenvectors

MTest eigenvectors

MTEST

CANPRINT

EntryStatistics

Entry statistics for selection methods

MODEL

(SELECTION=BACKWARD FORWARD STEPWISE MAXR MINR) and DETAILS

ErrorPlusHypothesis

MTest error plus hypothesis matrix H + E

MTEST

PRINT

ErrorSSCP

MTest error matrix E

MTEST

PRINT

FitStatistics

Model fit statistics

MODEL

default

HypothesisSSCP

MTest hypothesis matrix

MTEST

PRINT

InvMTestCov I

nv( L Ginv( X X ) L ) and Inv( Lb - c )

MTEST

DETAILS

InvTestCov

Inv( L Ginv( X X ) L ) and Inv( Lb - c )

TEST

PRINT

InvXPX

Bordered X X inverse matrix

MODEL

I

MTestCov

L Ginv( X X ) L and Lb - c

MTEST

DETAILS

MTransform

MTest matrix M , across dependents

MTEST

DETAILS

MultStat

Multivariate test statistics

MTEST

default

NObs

Number of observations

 

default

OutputStatistics

Output statistics table

MODEL

ALL, CLI, CLM, INFLUENCE, P, R

ParameterEstimates

Model parameter estimates

MODEL

default

RemovalStatistics

Removal statistics for selection methods

MODEL

(SELECTION=BACKWARD STEPWISE MAXR MINR) and DETAILS

ResidualStatistics

Residual statistics and PRESS statistic

MODEL

ALL, CLI, CLM, INFLUENCE, P, R

SelParmEst

Parameter estimates for selection methods

MODEL

SELECTION=BACKWARD FORWARD STEPWISE MAXR MINR

SelectionSummary

Selection summary for forward, backward and stepwise methods

MODEL

SELECTION=BACKWARD FORWARD STEPWISE

SeqParmEst

Sequential parameterestimates

MODEL

SEQB

SimpleStatistics

Simple statistics for analysis variables

PROC

ALL, SIMPLE

SpecTest

White s heteroscedasticity test

MODEL

ALL, SPEC

Subset SelSummary

Selection summary for R-Square, Adj-RSq and Cp methods

MODEL

SELECTION=RSQUARE ADJRSQ CP

TestANOVA

Test ANOVA table

TEST

default

TestCov

L Ginv( X X ) L and Lb - c

TEST

PRINT

USSCP

Uncorrected SSCP matrix for analysis variables

PROC

ALL, USSCP

ODS Graphics (Experimental)

This section describes the use of ODS for creating statistical graphs with the REG procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.

To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

When the experimental ODS graphics are in effect, the REG procedure produces a variety of plots. For models with multiple dependent variables, separate plots are produced for each dependent variable. For jobs with more than one MODEL statement, plots are produced for each model statement.

The plots available are as follows:

  • With a single regressor, a scatterplot of the input data overlayed with the fitted regression line, confidence band , and prediction limits.

  • A summary panel of fit diagnostics:

    • Residuals versus the predicted values

    • Studentized residuals versus the predicted values

    • Studentized residuals versus the leverage

    • Normal quantile plot of the residuals

    • Dependent variable values versus the predicted values

    • Cook s D versus observation number

    • Histogram of the residuals

    • A Residual-Fit (or RF) plot consisting of side-by-side quantile plots of the centered fit and the residuals. This plot shows how much variation in the data is explained by the fit and how much remains in the residuals (Cleveland, 1993).

  • If the PLOTS(UNPACKPANELS) option is specified in the PROC REG statement, then the eight plots in the fit diagnostics panel are displayed individually.

  • Panels of the residuals versus the regressors in the model. Note that each panel contains at most six plots, and multiple panels are used in the case that there are more than six regressors (including the intercept) in the model.

  • If the PARTIAL option is specified in a MODEL statement, panels of the partial regression plots for each regressor (see the The PARTIAL Option section on page 3901). Note that each panel contains at most six partial plots, and multiple panels are used in the case that there are more than six regressors in the model.

  • If the RIDGE= option is specified in the model statement, panels of ridge traces versus the specified ridge parameters for each regressor in the model. At most eight ridge traces are included on a panel and multiple panels are used for models with more than eight regressors.

PLOTS ( general-plot-options )

  • specifies characteristics of the graphics produced when you use the experimental ODS GRAPHICS statement. You can specify the following general-plot-options in parentheses after the PLOTS option:

  • UNPACK UNPACKPANELS specifies that plots in the fit diagnostics panel should be displayed separately.

  • MAXPOINTS= number NONE specifies that plots with elements that require processing more than number points are suppressed. The default is MAXPOINTS=5000. This cutoff is ignored if you specify MAXPOINTS=NONE.

ODS Graph Names

PROC REG assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 61.9.

Table 61.9: ODS Graphics Produced by PROC REG

ODS Graph Name

Plot Description

PLOTS Option

ActualByPredicted

Dependent variable versus predicted values

UNPACKPANELS

CooksD

Cook s D statistic versus observation number

UNPACKPANELS

DiagnosticsPanel

Panel of fit diagnostics

 

Fit

Regression line, confidence band, and prediction limits overlayed on scatterplot of data

 

PartialPlotPanel i

Panel i of partial regression plots

 

QQPlot

Normal quantile plot residuals

UNPACKPANELS

ResidualByPredicted

Residuals versus predicted values

UNPACKPANELS

ResidualHistogram

Histogram of fit residuals

UNPACKPANELS

ResidualPanel i

Panel i of residuals versus regressors

 

RFPlot

Side-by-side plots of quantiles of centered fit and residuals

UNPACKPANELS

RidgePanel i

Panel i of ridge traces

 

RStudentByLeverage

Studentized residuals versus leverage

UNPACKPANELS

RStudentByPredicted

Studentized residuals versus predicted values

UNPACKPANELS

To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.




SAS.STAT 9.1 Users Guide (Vol. 6)
SAS.STAT 9.1 Users Guide (Vol. 6)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 127

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net