Getting Started | SAS.STAT 9.1 Users Guide (Vol. 6)

Simple Linear Regression

Suppose that a response variable Y can be predicted by a linear function of a regressor variable X . You can estimate ² , the intercept, and ² ₁ , the slope, in

for the observations i =1 , 2 , ,n . Fitting this model with the REG procedure requires only the following MODEL statement, where y is the outcome variable and x is the regressor variable.

  proc reg;   model y=x;   run;

For example, you might use regression analysis to find out how well you can predict a child s weight if you know that child s height. The following data are from a study of nineteen children. Height and weight are measured for each child.

  title 'Simple Linear Regression';   data Class;   input Name $ Height Weight Age @@;   datalines;   Alfred  69.0 112.5 14  Alice  56.5  84.0 13  Barbara 65.3  98.0 13   Carol   62.8 102.5 14  Henry  63.5 102.5 14  James   57.3  83.0 12   Jane    59.8  84.5 12  Janet  62.5 112.5 15  Jeffrey 62.5  84.0 13   John    59.0  99.5 12  Joyce  51.3  50.5 11  Judy    64.3  90.0 14   Louise  56.3  77.0 12  Mary   66.5 112.0 15  Philip  72.0 150.0 16   Robert  64.8 128.0 12  Ronald 67.0 133.0 15  Thomas  57.5  85.0 11   William 66.5 112.0 15   ;

The equation of interest is

The variable Weight is the response or dependent variable in this equation, and ² and ² ₁ are the unknown parameters to be estimated. The variable Height is the regressor or independent variable, and ˆˆ is the unknown error. The following commands invoke the REG procedure and fit this model to the data.

  proc reg;   model Weight = Height;   run;

Figure 61.1 includes some information concerning model fit.

  Simple Linear Regression   The REG Procedure   Model: MODEL1   Dependent Variable: Weight   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square     F Value     Pr > F   Model                     1     7193.24912     7193.24912       57.08     <.0001   Error                    17     2142.48772      126.02869   Corrected Total          18     9335.73684   Root MSE             11.22625    R-Square      0.7705   Dependent Mean      100.02632    Adj R-Sq      0.7570   Coeff Var            11.22330

Figure 61.1: ANOVA Table

The F statistic for the overall model is highly significant ( F =57.076, p <0.0001), indicating that the model explains a significant portion of the variation in the data.

The degrees of freedom can be used in checking accuracy of the data and model. The model degrees of freedom are one less than the number of parameters to be estimated. This model estimates two parameters, ² and ² ₁ ; thus, the degrees of freedom should be 2 ˆ’ 1 = 1. The corrected total degrees of freedom are always one less than the total number of observations in the data set, in this case 19 ˆ’ 1 = 18.

Several simple statistics follow the ANOVA table. The Root MSE is an estimate of the standard deviation of the error term . The coefficient of variation, or Coeff Var, is a unitless expression of the variation in the data. The R-Square and Adj R-Square are two statistics used in assessing the fit of the model; values close to 1 indicate a better fit. The R-Square of 0.77 indicates that Height accounts for 77% of the variation in Weight .

The Parameter Estimates table shown in Figure 61.2 contains the estimates of ² and ² ₁ . The table also contains the t statistics and the corresponding p -values for testing whether each parameter is significantly different from zero. The p -values ( t = ˆ’ 4 . 43, p = 0 . 0004 and t = 7 . 55, p < . 0001) indicate that the intercept and Height parameter estimates, respectively, are highly significant.

  Simple Linear Regression   The REG Procedure   Model: MODEL1   Dependent Variable: Weight   Parameter Estimates   Parameter       Standard   Variable     DF       Estimate          Error    t Value    Pr > t   Intercept     1   143.02692       32.27459   4.43      0.0004   Height        1        3.89903        0.51609       7.55      <.0001

Figure 61.2: Parameter Estimates

From the parameter estimates, the fitted model is

The REG procedure can be used interactively. After you specify a model with the MODEL statement and submit the PROC REG statements, you can submit further statements without reinvoking the procedure. The following command can now be issued to request a plot of the residual versus the predicted values, as shown in Figure 61.3.

  plot r.*p.; run;

Figure 61.3: Plot of Residual vs. Predicted Values

A trend in the residuals would indicate nonconstant variance in the data. Figure 61.3 may indicate a slight trend in the residuals; they appear to increase slightly as the predicted values increase. A fan-shaped trend may indicate the need for a variancestabilizing transformation. A curved trend (such as a semi-circle) may indicate the need for a quadratic term in the model. Since these residuals have no apparent trend, the analysis is considered to be acceptable.

Polynomial Regression

Consider a response variable Y that can be predicted by a polynomial function of a regressor variable X . You can estimate ² , the intercept, ² ₁ , the slope due to X , and ² ₂ , the slope due to X ² , in

for the observations i =1 , 2 ,...,n .

Consider the following example on population growth trends. The population of the United States from 1790 to 2000 is fit to linear and quadratic functions of time. Note that the quadratic term, YearSq , is created in the DATA step; this is done since polynomial effects such as Year*Year cannot be specified in the MODEL statement in PROC REG. The data are as follows :

  data USPopulation;   input Population @@;   retain Year 1780;   Year=Year+10;   YearSq=Year*Year;   Population=Population/1000;   datalines;   3929 5308 7239 9638 12866 17069 23191 31443 39818 50155   62947 75994 91972 105710 122775 131669 151325 179323 203211   226542 248710 281422   ;

The following statements begin the analysis. (Influence diagnostics and autocorrelation information for the full model are shown in Figure 61.43 on page 3900 and Figure 61.57 on page 3916.)

  symbol1 c=blue;   proc reg data=USPopulation;   var YearSq;   model Population=Year / r cli clm;   plot r.*p. / cframe=ligr;   run;

  The REG Procedure   Model: MODEL1   Dependent Variable: Population   Output Statistics   Hat Diag       Cov            -----------DFBETAS-----------   Obs  Residual   RStudent         H     Ratio    DFFITS  Intercept      Year   YearSq   1   2.2837   0.9361    0.3429    1.5519   0.6762   0.4924    0.4862   0.4802   2   0.4146   0.1540    0.2356    1.5325   0.0855   0.0540    0.0531   0.0523   3    0.6696     0.2379    0.1632    1.3923    0.1050     0.0517   0.0505   0.0494   4    0.8849     0.3065    0.1180    1.3128    0.1121     0.0335   0.0322   0.0310   5    0.5923     0.2021    0.0933    1.2883    0.0648     0.0040   0.0032   0.0025   6   0.0621   0.0210    0.0831    1.2827   0.0063     0.0012   0.0012   0.0013   7   0.1344   0.0455    0.0824    1.2813   0.0136     0.0054   0.0055   0.0056   8    0.5864     0.1994    0.0870    1.2796    0.0615   0.0339    0.0343   0.0347   9    0.0934     0.0318    0.0933    1.2969    0.0102   0.0067    0.0067   0.0068   10    0.2255     0.0771    0.0990    1.3040    0.0255   0.0182    0.0183   0.0183   11    1.4757     0.5090    0.1022    1.2550    0.1717   0.1272    0.1275   0.1276   12    1.6441     0.5680    0.1022    1.2420    0.1916   0.1426    0.1426   0.1424   13    3.4065     1.2109    0.0990    1.0320    0.4013   0.2895    0.2889   0.2880   14    1.5922     0.5470    0.0933    1.2345    0.1755   0.1173    0.1167   0.1160   15    1.7679     0.6064    0.0870    1.2123    0.1871   0.1076    0.1067   0.1056   16   7.5642   3.2147    0.0824    0.3286   0.9636     0.4130   0.4063   0.3987   17   7.4712   3.1550    0.0831    0.3425   0.9501     0.2131   0.2048   0.1957   18   0.3731   0.1272    0.0933    1.2936   0.0408   0.0007    0.0012   0.0016   19    1.2782     0.4440    0.1180    1.2906    0.1624     0.0415   0.0432   0.0449   20    1.0356     0.3687    0.1632    1.3741    0.1628     0.0732   0.0749   0.0766   21   1.7068   0.6406    0.2356    1.4380   0.3557   0.2107    0.2141   0.2176   22    4.7578     2.1312    0.3429    0.9113    1.5395     1.0656   1.0793   1.0933   Sum of Residuals   4.4596E-11   Sum of Squared Residuals           170.97193   Predicted Residual SS (PRESS)      237.71229

Figure 61.43: Regression Using the INFLUENCE Option

  The REG Procedure   Model: MODEL1   Dependent Variable: Population   Durbin-Watson D                1.191   Number of Observations            22   1st Order Autocorrelation      0.323

Figure 61.57: Regression Using DW Option

The DATA option ensures that the procedure uses the intended data set. Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement. In the MODEL statement, three options are specified: R requests a residual analysis to be performed, CLI requests 95% confidence limits for an individual value, and CLM requests these limits for the expected value of the dependent variable. You can request specific 100(1 ˆ’ ± )% limits with the ALPHA= option in the PROC REG or MODEL statement. A plot of the residuals against the predicted values is requested by the PLOT statement.

The ANOVA table is displayed in Figure 61.4.

  The REG Procedure   Model: MODEL1   Dependent Variable: Population   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     1         146869         146869     228.92    <.0001   Error                    20          12832      641.58160   Corrected Total          21         159700   Root MSE             25.32946    R-Square     0.9197   Dependent Mean       94.64800    Adj R-Sq     0.9156   Coeff Var            26.76175   Parameter Estimates   Parameter       Standard   Variable      DF       Estimate          Error    t Value    Pr > t   Intercept      1   2345.85498      161.39279   14.54      <.0001   Year           1        1.28786        0.08512      15.13      <.0001

Figure 61.4: ANOVA Table and Parameter Estimates

The Model F statistic is significant ( F =228.92, p <0.0001), indicating that the model accounts for a significant portion of variation in the data. The R-Square indicates that the model accounts for 92% of the variation in population growth. The fitted equation for this model is

Figure 61.5 shows the confidence limits for both individual and expected values resulting from the CLM and CLI options.

  The REG Procedure   Model: MODEL1   Dependent Variable: Population   Output Statistics   Dependent Predicted    Std Error   Obs Variable      Value Mean Predict     95% CL Mean        95% CL Predict   1   3.9290   40.5778      10.4424   62.3602   18.7953   97.7280   16.5725   2   5.3080   27.6991       9.7238   47.9826   7.4156   84.2950   28.8968   3   7.2390   14.8205       9.0283   33.6533    4.0123   70.9128   41.2719   4   9.6380   1.9418       8.3617   19.3841   15.5004   57.5827   53.6991   5  12.8660    10.9368       7.7314   5.1906   27.0643   44.3060   66.1797   6  17.0690    23.8155       7.1470    8.9070   38.7239   31.0839   78.7148   7  23.1910    36.6941       6.6208   22.8834   50.5048   17.9174   91.3056   8  31.4430    49.5727       6.1675   36.7075   62.4380   4.8073  103.9528   9  39.8180    62.4514       5.8044   50.3436   74.5592    8.2455  116.6573   10  50.1550    75.3300       5.5491   63.7547   86.9053   21.2406  129.4195   11  62.9470    88.2087       5.4170   76.9090   99.5084   34.1776  142.2398   12  75.9940   101.0873       5.4170   89.7876  112.3870   47.0562  155.1184   13  91.9720   113.9660       5.5491  102.3907  125.5413   59.8765  168.0554   14 105.7100   126.8446       5.8044  114.7368  138.9524   72.6387  181.0505   15 122.7750   139.7233       6.1675  126.8580  152.5885   85.3432  194.1033   16 131.6690   152.6019       6.6208  138.7912  166.4126   97.9904  207.2134   17 151.3250   165.4805       7.1470  150.5721  180.3890  110.5812  220.3799   18 179.3230   178.3592       7.7314  162.2317  194.4866  123.1163  233.6020   19 203.2110   191.2378       8.3617  173.7956  208.6801  135.5969  246.8787   20 226.5420   204.1165       9.0283  185.2837  222.9493  148.0241  260.2088   21 248.7100   216.9951       9.7238  196.7116  237.2786  160.3992  273.5910   22 281.4220   229.8738      10.4424  208.0913  251.6562  172.7235  287.0240

Figure 61.5: Confidence Limits

The observed dependent variable is displayed for each observation along with its predicted value from the regression equation and the standard error of the mean predicted value. The 95% CL Mean columns are the confidence limits for the expected value of each observation. The 95% CL Predict columns are the confidence limits for the individual observations.

Figure 61.6 displays the residual analysis requested by the R option.

  Output Statistics   Std Error     Student                         Cooks   Obs  Residual     Residual    Residual      -2-1 0 1 2              D   1   44.5068       23.077       1.929          ***          0.381   2   33.0071       23.389       1.411          **           0.172   3   22.0595       23.666       0.932          *            0.063   4   11.5798       23.909       0.484                       0.014   5    1.9292       24.121      0.0800                       0.000   6   6.7465       24.300   0.278                       0.003   7   13.5031       24.449   0.552         *             0.011   8   18.1297       24.567   0.738         *             0.017   9   22.6334       24.655   0.918         *             0.023   10   25.1750       24.714   1.019        **             0.026   11   25.2617       24.743   1.021        **             0.025   12   25.0933       24.743   1.014        **             0.025   13   21.9940       24.714   0.890         *             0.020   14   21.1346       24.655   0.857         *             0.020   15   16.9483       24.567   0.690         *             0.015   16   20.9329       24.449   0.856         *             0.027   17   14.1555       24.300   0.583         *             0.015   18    0.9638       24.121      0.0400                       0.000   19   11.9732       23.909       0.501          *            0.015   20   22.4255       23.666       0.948          *            0.065   21   31.7149       23.389       1.356          **           0.159   22   51.5482       23.077       2.234          ****          0.511   Sum of Residuals                           0   Sum of Squared Residuals               12832   Predicted Residual SS (PRESS)          16662

Figure 61.6: Residual Analysis

The residual, its standard error, and the studentized residuals are displayed for each observation. The studentized residual is the residual divided by its standard error. The magnitude of each studentized residual is shown in a plot. Studentized residuals follow a t distribution and can be used to identify outlying or extreme observations. Asterisks (*) extending beyond the dashed lines indicate that the residual is more than three standard errors from zero. Many observations having absolute studentized residuals greater than 2 may indicate an inadequate model. The wave pattern seen in this plot is also an indication that the model is inadequate; a quadratic term may be needed or autocorrelation may be present in the data. Cook s D is a measure of the change in the predicted values upon deletion of that observation from the data set; hence, it measures the influence of the observation on the estimated regression coefficients. A fairly close agreement between the PRESS statistic (see Table 61.6 on page 3897) and the Sum of Squared Residuals indicates that the MSE is a reasonable measure of the predictive accuracy of the fitted model (Neter, Wasserman, and Kutner, 1990).

Table 61.6: Formulas and Definitions for Model Fit Summary Statistics
MODEL Option or Statistic	Definition or Formula
n	the number of observations
p	the number of parameters including the intercept
i	1 if there is an intercept, 0 otherwise
²	the estimate of pure error variance from the SIGMA= option or from fitting the full model
SST	the uncorrected total sum of squares for the dependent variable
SST ₁	the total sum of squares corrected for the mean for the dependent variable
SSE	the error sum of squares
MSE
R ²
ADJRSQ 1
AIC
BIC
CP ( C _p )
GMSEP
JP ( J _p )
PC
PRESS	the sum of squares of predr _i (see Table 61.7)
RMSE
SBC
SP ( S _p )

A plot of the residuals versus predicted values is shown in Figure 61.7.

Figure 61.7: Plot of Residual vs. Predicted Values

The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.

Using the interactive feature of PROC REG, the following commands add the variable YearSq to the independent variables and refit the model.

  add YearSq;   print;   plot / cframe=ligr;   run;

The ADD statement requests that YearSq be added to the model, and the PRINT command displays the ANOVA table for the new model. The PLOT statement with no variables recreates the most recent plot requested, in this case a plot of residual versus predicted values.

Figure 61.8 displays the ANOVA table and estimates for the new model.

  The REG Procedure   Model: MODEL1.1   Dependent Variable: Population   Analysis of Variance   Sum of           Mean   Source                   DF        Squares         Square    F Value    Pr > F   Model                     2         159529          79765    8864.19    <.0001   Error                    19      170.97193        8.99852   Corrected Total          21         159700   Root MSE              2.99975    R-Square     0.9989   Dependent Mean       94.64800    Adj R-Sq     0.9988   Coeff Var             3.16938   Parameter Estimates   Parameter       Standard   Variable      DF       Estimate          Error    t Value    Pr > t   Intercept      1          21631      639.50181      33.82      <.0001   Year           1   24.04581        0.67547   35.60      <.0001   YearSq         1        0.00668     0.00017820      37.51      <.0001

Figure 61.8: ANOVA Table and Parameter Estimates

The overall F statistic is still significant ( F =8864.19, p <0.0001). The R-square has increased from 0.9197 to 0.9989, indicating that the model now accounts for 99.9% of the variation in Population . All effects are significant with p <0.0001 for each effect in the model.

The fitted equation is now

The confidence limits and residual analysis for the second model are displayed in Figure 61.9.

  The REG Procedure   Model: MODEL1.1   Dependent Variable: Population   Output Statistics   Dependent Predicted    Std Error   Obs Variable      Value Mean Predict     95% CL Mean        95% CL Predict   1   3.9290     6.2127       1.7565    2.5362    9.8892   1.0631   13.4884   2   5.3080     5.7226       1.4560    2.6751    8.7701   1.2565   12.7017   3   7.2390     6.5694       1.2118    4.0331    9.1057   0.2021   13.3409   4   9.6380     8.7531       1.0305    6.5963   10.9100    2.1144   15.3918   5  12.8660    12.2737       0.9163   10.3558   14.1916    5.7087   18.8386   6  17.0690    17.1311       0.8650   15.3207   18.9415   10.5968   23.6655   7  23.1910    23.3254       0.8613   21.5227   25.1281   16.7932   29.8576   8  31.4430    30.8566       0.8846   29.0051   32.7080   24.3107   37.4024   9  39.8180    39.7246       0.9163   37.8067   41.6425   33.1597   46.2896   10  50.1550    49.9295       0.9436   47.9545   51.9046   43.3476   56.5114   11  62.9470    61.4713       0.9590   59.4641   63.4785   54.8797   68.0629   12  75.9940    74.3499       0.9590   72.3427   76.3571   67.7583   80.9415   13  91.9720    88.5655       0.9436   86.5904   90.5405   81.9836   95.1473   14 105.7100   104.1178       0.9163  102.2000  106.0357   97.5529  110.6828   15 122.7750   121.0071       0.8846  119.1556  122.8585  114.4612  127.5529   16 131.6690   139.2332       0.8613  137.4305  141.0359  132.7010  145.7654   17 151.3250   158.7962       0.8650  156.9858  160.6066  152.2618  165.3306   18 179.3230   179.6961       0.9163  177.7782  181.6139  173.1311  186.2610   19 203.2110   201.9328       1.0305  199.7759  204.0896  195.2941  208.5715   20 226.5420   225.5064       1.2118  222.9701  228.0427  218.7349  232.2779   21 248.7100   250.4168       1.4560  247.3693  253.4644  243.4378  257.3959   22 281.4220   276.6642       1.7565  272.9877  280.3407  269.3884  283.9400   Output Statistics   Std Error     Student                         Cooks   Obs  Residual     Residual    Residual      -2-1 0 1 2              D   1   2.2837        2.432   0.939         *             0.153   2   0.4146        2.623   0.158                       0.003   3    0.6696        2.744       0.244                       0.004   4    0.8849        2.817       0.314                       0.004   5    0.5923        2.856       0.207                       0.001   6   0.0621        2.872   0.0216                       0.000   7   0.1344        2.873   0.0468                       0.000   8    0.5864        2.866       0.205                       0.001   9    0.0934        2.856      0.0327                       0.000   10    0.2255        2.847      0.0792                       0.000   11    1.4757        2.842       0.519          *            0.010   12    1.6441        2.842       0.578          *            0.013   13    3.4065        2.847       1.196          **           0.052   14    1.5922        2.856       0.557          *            0.011   15    1.7679        2.866       0.617          *            0.012   16   7.5642        2.873   2.632     *****             0.208   17   7.4712        2.872   2.601     *****             0.205   18   0.3731        2.856   0.131                       0.001   19    1.2782        2.817       0.454                       0.009   20    1.0356        2.744       0.377                       0.009   21   1.7068        2.623   0.651         *             0.044   22    4.7578        2.432       1.957          ***          0.666   Sum of Residuals   4.4596E-11   Sum of Squared Residuals           170.97193   Predicted Residual SS (PRESS)      237.71229

Figure 61.9: Confidence Limits and Residual Analysis

The plot of the studentized residuals shows that the wave structure is gone. The PRESS statistic is much closer to the Sum of Squared Residuals now, and both statistics have been dramatically reduced. Most of the Cook s D statistics have also been reduced.

The plot of residuals versus predicted values seen in Figure 61.10 has improved since a major trend is no longer visible.

Figure 61.10: Plot of Residual vs. Predicted Values

To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot, you can submit the following statements.

  symbol1 v=dot     c=yellow h=.3;   symbol2 v=square  c=red;   symbol3 f=simplex c=blue  h=2 v='-';   symbol4 f=simplex c=blue  h=2 v='-';   plot (Population predicted. u95. l95.)*Year   / overlay cframe=ligr;   run;

Figure 61.11: Plot of Population vs Year with Confidence Limits

The SYMBOL statements requests that the actual data be displayed as dots, the predicted values as squares, and the upper and lower 95% confidence limits for an individual value (sometimes called a prediction interval ) as dashes. PROC REG provides the short-hand commands CONF and PRED to request confidence and prediction intervals for simple regression models; see the PLOT Statement section on page 3839 for details.

To complete an analysis of these data, you may want to examine influence statistics and, since the data are essentially time series data, examine the Durbin-Watson statistic. You might also want to examine other residual plots, such as the residuals vs. regressors.

Using PROC REG Interactively

PROC REG can be used interactively. After you specify a model with a MODEL statement and run REG with a RUN statement, a variety of statements can be executed without reinvoking REG.

The Interactive Analysis section on page 3869 describes which statements can be used interactively. These interactive statements can be executed singly or in groups by following the single statement or group of statements with a RUN statement. Note that the MODEL statement can be repeated. This is an important difference from the GLM procedure, which allows only one MODEL statement.

If you use REG interactively, you can end the REG procedure with a DATA step, another PROC step, an ENDSAS statement, or with a QUIT statement. The syntax of the QUIT statement is

  quit;

When you are using REG interactively, additional RUN statements do not end REG but tell the procedure to execute additional statements.

When a BY statement is used with PROC REG, interactive processing is not possible; that is, once the first RUN statement is encountered , processing proceeds for each BY group in the data set, and no further statements are accepted by the procedure.

When using REG interactively, you can fit a model, perform diagnostics, then refit the model, and perform diagnostics on the refitted model. Most of the interactive statements implicitly refit the model; for example, if you use the ADD statement to add a variable to the model, the regression equation is automatically recomputed. The two exceptions to this automatic recomputing are the PAINT and REWEIGHT statements. These two statements do not cause the model to be refitted. To do so, you can follow these statements either with a REFIT statement, which causes the model to be explicitly recomputed, or with another interactive statement that causes the model to be implicitly recomputed.