Suppose that a response variable Y can be predicted by a linear function of a regressor variable X . You can estimate ² , the intercept, and ² 1 , the slope, in
for the observations i =1 , 2 , ,n . Fitting this model with the REG procedure requires only the following MODEL statement, where y is the outcome variable and x is the regressor variable.
proc reg; model y=x; run;
For example, you might use regression analysis to find out how well you can predict a child s weight if you know that child s height. The following data are from a study of nineteen children. Height and weight are measured for each child.
title 'Simple Linear Regression'; data Class; input Name $ Height Weight Age @@; datalines; Alfred 69.0 112.5 14 Alice 56.5 84.0 13 Barbara 65.3 98.0 13 Carol 62.8 102.5 14 Henry 63.5 102.5 14 James 57.3 83.0 12 Jane 59.8 84.5 12 Janet 62.5 112.5 15 Jeffrey 62.5 84.0 13 John 59.0 99.5 12 Joyce 51.3 50.5 11 Judy 64.3 90.0 14 Louise 56.3 77.0 12 Mary 66.5 112.0 15 Philip 72.0 150.0 16 Robert 64.8 128.0 12 Ronald 67.0 133.0 15 Thomas 57.5 85.0 11 William 66.5 112.0 15 ;
The equation of interest is
The variable Weight is the response or dependent variable in this equation, and ² and ² 1 are the unknown parameters to be estimated. The variable Height is the regressor or independent variable, and ˆˆ is the unknown error. The following commands invoke the REG procedure and fit this model to the data.
proc reg; model Weight = Height; run;
Figure 61.1 includes some information concerning model fit.
Simple Linear Regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 7193.24912 7193.24912 57.08 <.0001 Error 17 2142.48772 126.02869 Corrected Total 18 9335.73684 Root MSE 11.22625 R-Square 0.7705 Dependent Mean 100.02632 Adj R-Sq 0.7570 Coeff Var 11.22330
The F statistic for the overall model is highly significant ( F =57.076, p <0.0001), indicating that the model explains a significant portion of the variation in the data.
The degrees of freedom can be used in checking accuracy of the data and model. The model degrees of freedom are one less than the number of parameters to be estimated. This model estimates two parameters, ² and ² 1 ; thus, the degrees of freedom should be 2 ˆ’ 1 = 1. The corrected total degrees of freedom are always one less than the total number of observations in the data set, in this case 19 ˆ’ 1 = 18.
Several simple statistics follow the ANOVA table. The Root MSE is an estimate of the standard deviation of the error term . The coefficient of variation, or Coeff Var, is a unitless expression of the variation in the data. The R-Square and Adj R-Square are two statistics used in assessing the fit of the model; values close to 1 indicate a better fit. The R-Square of 0.77 indicates that Height accounts for 77% of the variation in Weight .
The Parameter Estimates table shown in Figure 61.2 contains the estimates of ² and ² 1 . The table also contains the t statistics and the corresponding p -values for testing whether each parameter is significantly different from zero. The p -values ( t = ˆ’ 4 . 43, p = 0 . 0004 and t = 7 . 55, p < . 0001) indicate that the intercept and Height parameter estimates, respectively, are highly significant.
Simple Linear Regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 143.02692 32.27459 4.43 0.0004 Height 1 3.89903 0.51609 7.55 <.0001
From the parameter estimates, the fitted model is
The REG procedure can be used interactively. After you specify a model with the MODEL statement and submit the PROC REG statements, you can submit further statements without reinvoking the procedure. The following command can now be issued to request a plot of the residual versus the predicted values, as shown in Figure 61.3.
plot r.*p.; run;
A trend in the residuals would indicate nonconstant variance in the data. Figure 61.3 may indicate a slight trend in the residuals; they appear to increase slightly as the predicted values increase. A fan-shaped trend may indicate the need for a variancestabilizing transformation. A curved trend (such as a semi-circle) may indicate the need for a quadratic term in the model. Since these residuals have no apparent trend, the analysis is considered to be acceptable.
Consider a response variable Y that can be predicted by a polynomial function of a regressor variable X . You can estimate ² , the intercept, ² 1 , the slope due to X , and ² 2 , the slope due to X 2 , in
for the observations i =1 , 2 ,...,n .
Consider the following example on population growth trends. The population of the United States from 1790 to 2000 is fit to linear and quadratic functions of time. Note that the quadratic term, YearSq , is created in the DATA step; this is done since polynomial effects such as Year*Year cannot be specified in the MODEL statement in PROC REG. The data are as follows :
data USPopulation; input Population @@; retain Year 1780; Year=Year+10; YearSq=Year*Year; Population=Population/1000; datalines; 3929 5308 7239 9638 12866 17069 23191 31443 39818 50155 62947 75994 91972 105710 122775 131669 151325 179323 203211 226542 248710 281422 ;
The following statements begin the analysis. (Influence diagnostics and autocorrelation information for the full model are shown in Figure 61.43 on page 3900 and Figure 61.57 on page 3916.)
symbol1 c=blue; proc reg data=USPopulation; var YearSq; model Population=Year / r cli clm; plot r.*p. / cframe=ligr; run;
The REG Procedure Model: MODEL1 Dependent Variable: Population Output Statistics Hat Diag Cov -----------DFBETAS----------- Obs Residual RStudent H Ratio DFFITS Intercept Year YearSq 1 2.2837 0.9361 0.3429 1.5519 0.6762 0.4924 0.4862 0.4802 2 0.4146 0.1540 0.2356 1.5325 0.0855 0.0540 0.0531 0.0523 3 0.6696 0.2379 0.1632 1.3923 0.1050 0.0517 0.0505 0.0494 4 0.8849 0.3065 0.1180 1.3128 0.1121 0.0335 0.0322 0.0310 5 0.5923 0.2021 0.0933 1.2883 0.0648 0.0040 0.0032 0.0025 6 0.0621 0.0210 0.0831 1.2827 0.0063 0.0012 0.0012 0.0013 7 0.1344 0.0455 0.0824 1.2813 0.0136 0.0054 0.0055 0.0056 8 0.5864 0.1994 0.0870 1.2796 0.0615 0.0339 0.0343 0.0347 9 0.0934 0.0318 0.0933 1.2969 0.0102 0.0067 0.0067 0.0068 10 0.2255 0.0771 0.0990 1.3040 0.0255 0.0182 0.0183 0.0183 11 1.4757 0.5090 0.1022 1.2550 0.1717 0.1272 0.1275 0.1276 12 1.6441 0.5680 0.1022 1.2420 0.1916 0.1426 0.1426 0.1424 13 3.4065 1.2109 0.0990 1.0320 0.4013 0.2895 0.2889 0.2880 14 1.5922 0.5470 0.0933 1.2345 0.1755 0.1173 0.1167 0.1160 15 1.7679 0.6064 0.0870 1.2123 0.1871 0.1076 0.1067 0.1056 16 7.5642 3.2147 0.0824 0.3286 0.9636 0.4130 0.4063 0.3987 17 7.4712 3.1550 0.0831 0.3425 0.9501 0.2131 0.2048 0.1957 18 0.3731 0.1272 0.0933 1.2936 0.0408 0.0007 0.0012 0.0016 19 1.2782 0.4440 0.1180 1.2906 0.1624 0.0415 0.0432 0.0449 20 1.0356 0.3687 0.1632 1.3741 0.1628 0.0732 0.0749 0.0766 21 1.7068 0.6406 0.2356 1.4380 0.3557 0.2107 0.2141 0.2176 22 4.7578 2.1312 0.3429 0.9113 1.5395 1.0656 1.0793 1.0933 Sum of Residuals 4.4596E-11 Sum of Squared Residuals 170.97193 Predicted Residual SS (PRESS) 237.71229
The REG Procedure Model: MODEL1 Dependent Variable: Population Durbin-Watson D 1.191 Number of Observations 22 1st Order Autocorrelation 0.323
The DATA option ensures that the procedure uses the intended data set. Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement. In the MODEL statement, three options are specified: R requests a residual analysis to be performed, CLI requests 95% confidence limits for an individual value, and CLM requests these limits for the expected value of the dependent variable. You can request specific 100(1 ˆ’ ± )% limits with the ALPHA= option in the PROC REG or MODEL statement. A plot of the residuals against the predicted values is requested by the PLOT statement.
The ANOVA table is displayed in Figure 61.4.
The REG Procedure Model: MODEL1 Dependent Variable: Population Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 146869 146869 228.92 <.0001 Error 20 12832 641.58160 Corrected Total 21 159700 Root MSE 25.32946 R-Square 0.9197 Dependent Mean 94.64800 Adj R-Sq 0.9156 Coeff Var 26.76175 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 2345.85498 161.39279 14.54 <.0001 Year 1 1.28786 0.08512 15.13 <.0001
The Model F statistic is significant ( F =228.92, p <0.0001), indicating that the model accounts for a significant portion of variation in the data. The R-Square indicates that the model accounts for 92% of the variation in population growth. The fitted equation for this model is
Figure 61.5 shows the confidence limits for both individual and expected values resulting from the CLM and CLI options.
The REG Procedure Model: MODEL1 Dependent Variable: Population Output Statistics Dependent Predicted Std Error Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict 1 3.9290 40.5778 10.4424 62.3602 18.7953 97.7280 16.5725 2 5.3080 27.6991 9.7238 47.9826 7.4156 84.2950 28.8968 3 7.2390 14.8205 9.0283 33.6533 4.0123 70.9128 41.2719 4 9.6380 1.9418 8.3617 19.3841 15.5004 57.5827 53.6991 5 12.8660 10.9368 7.7314 5.1906 27.0643 44.3060 66.1797 6 17.0690 23.8155 7.1470 8.9070 38.7239 31.0839 78.7148 7 23.1910 36.6941 6.6208 22.8834 50.5048 17.9174 91.3056 8 31.4430 49.5727 6.1675 36.7075 62.4380 4.8073 103.9528 9 39.8180 62.4514 5.8044 50.3436 74.5592 8.2455 116.6573 10 50.1550 75.3300 5.5491 63.7547 86.9053 21.2406 129.4195 11 62.9470 88.2087 5.4170 76.9090 99.5084 34.1776 142.2398 12 75.9940 101.0873 5.4170 89.7876 112.3870 47.0562 155.1184 13 91.9720 113.9660 5.5491 102.3907 125.5413 59.8765 168.0554 14 105.7100 126.8446 5.8044 114.7368 138.9524 72.6387 181.0505 15 122.7750 139.7233 6.1675 126.8580 152.5885 85.3432 194.1033 16 131.6690 152.6019 6.6208 138.7912 166.4126 97.9904 207.2134 17 151.3250 165.4805 7.1470 150.5721 180.3890 110.5812 220.3799 18 179.3230 178.3592 7.7314 162.2317 194.4866 123.1163 233.6020 19 203.2110 191.2378 8.3617 173.7956 208.6801 135.5969 246.8787 20 226.5420 204.1165 9.0283 185.2837 222.9493 148.0241 260.2088 21 248.7100 216.9951 9.7238 196.7116 237.2786 160.3992 273.5910 22 281.4220 229.8738 10.4424 208.0913 251.6562 172.7235 287.0240
The observed dependent variable is displayed for each observation along with its predicted value from the regression equation and the standard error of the mean predicted value. The 95% CL Mean columns are the confidence limits for the expected value of each observation. The 95% CL Predict columns are the confidence limits for the individual observations.
Figure 61.6 displays the residual analysis requested by the R option.
Output Statistics Std Error Student Cooks Obs Residual Residual Residual -2-1 0 1 2 D 1 44.5068 23.077 1.929 *** 0.381 2 33.0071 23.389 1.411 ** 0.172 3 22.0595 23.666 0.932 * 0.063 4 11.5798 23.909 0.484 0.014 5 1.9292 24.121 0.0800 0.000 6 6.7465 24.300 0.278 0.003 7 13.5031 24.449 0.552 * 0.011 8 18.1297 24.567 0.738 * 0.017 9 22.6334 24.655 0.918 * 0.023 10 25.1750 24.714 1.019 ** 0.026 11 25.2617 24.743 1.021 ** 0.025 12 25.0933 24.743 1.014 ** 0.025 13 21.9940 24.714 0.890 * 0.020 14 21.1346 24.655 0.857 * 0.020 15 16.9483 24.567 0.690 * 0.015 16 20.9329 24.449 0.856 * 0.027 17 14.1555 24.300 0.583 * 0.015 18 0.9638 24.121 0.0400 0.000 19 11.9732 23.909 0.501 * 0.015 20 22.4255 23.666 0.948 * 0.065 21 31.7149 23.389 1.356 ** 0.159 22 51.5482 23.077 2.234 **** 0.511 Sum of Residuals 0 Sum of Squared Residuals 12832 Predicted Residual SS (PRESS) 16662
The residual, its standard error, and the studentized residuals are displayed for each observation. The studentized residual is the residual divided by its standard error. The magnitude of each studentized residual is shown in a plot. Studentized residuals follow a t distribution and can be used to identify outlying or extreme observations. Asterisks (*) extending beyond the dashed lines indicate that the residual is more than three standard errors from zero. Many observations having absolute studentized residuals greater than 2 may indicate an inadequate model. The wave pattern seen in this plot is also an indication that the model is inadequate; a quadratic term may be needed or autocorrelation may be present in the data. Cook s D is a measure of the change in the predicted values upon deletion of that observation from the data set; hence, it measures the influence of the observation on the estimated regression coefficients. A fairly close agreement between the PRESS statistic (see Table 61.6 on page 3897) and the Sum of Squared Residuals indicates that the MSE is a reasonable measure of the predictive accuracy of the fitted model (Neter, Wasserman, and Kutner, 1990).
MODEL Option or Statistic | Definition or Formula |
---|---|
n | the number of observations |
p | the number of parameters including the intercept |
i | 1 if there is an intercept, 0 otherwise |
2 | the estimate of pure error variance from the SIGMA= option or from fitting the full model |
SST | the uncorrected total sum of squares for the dependent variable |
SST 1 | the total sum of squares corrected for the mean for the dependent variable |
SSE | the error sum of squares |
MSE |
|
R 2 |
|
ADJRSQ 1 |
|
AIC |
|
BIC |
|
CP ( C p ) |
|
GMSEP |
|
JP ( J p ) |
|
PC |
|
PRESS | the sum of squares of predr i (see Table 61.7) |
RMSE |
|
SBC |
|
SP ( S p ) |
|
A plot of the residuals versus predicted values is shown in Figure 61.7.
The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.
Using the interactive feature of PROC REG, the following commands add the variable YearSq to the independent variables and refit the model.
add YearSq; print; plot / cframe=ligr; run;
The ADD statement requests that YearSq be added to the model, and the PRINT command displays the ANOVA table for the new model. The PLOT statement with no variables recreates the most recent plot requested, in this case a plot of residual versus predicted values.
Figure 61.8 displays the ANOVA table and estimates for the new model.
The REG Procedure Model: MODEL1.1 Dependent Variable: Population Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 159529 79765 8864.19 <.0001 Error 19 170.97193 8.99852 Corrected Total 21 159700 Root MSE 2.99975 R-Square 0.9989 Dependent Mean 94.64800 Adj R-Sq 0.9988 Coeff Var 3.16938 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 21631 639.50181 33.82 <.0001 Year 1 24.04581 0.67547 35.60 <.0001 YearSq 1 0.00668 0.00017820 37.51 <.0001
The overall F statistic is still significant ( F =8864.19, p <0.0001). The R-square has increased from 0.9197 to 0.9989, indicating that the model now accounts for 99.9% of the variation in Population . All effects are significant with p <0.0001 for each effect in the model.
The fitted equation is now
The confidence limits and residual analysis for the second model are displayed in Figure 61.9.
The REG Procedure Model: MODEL1.1 Dependent Variable: Population Output Statistics Dependent Predicted Std Error Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict 1 3.9290 6.2127 1.7565 2.5362 9.8892 1.0631 13.4884 2 5.3080 5.7226 1.4560 2.6751 8.7701 1.2565 12.7017 3 7.2390 6.5694 1.2118 4.0331 9.1057 0.2021 13.3409 4 9.6380 8.7531 1.0305 6.5963 10.9100 2.1144 15.3918 5 12.8660 12.2737 0.9163 10.3558 14.1916 5.7087 18.8386 6 17.0690 17.1311 0.8650 15.3207 18.9415 10.5968 23.6655 7 23.1910 23.3254 0.8613 21.5227 25.1281 16.7932 29.8576 8 31.4430 30.8566 0.8846 29.0051 32.7080 24.3107 37.4024 9 39.8180 39.7246 0.9163 37.8067 41.6425 33.1597 46.2896 10 50.1550 49.9295 0.9436 47.9545 51.9046 43.3476 56.5114 11 62.9470 61.4713 0.9590 59.4641 63.4785 54.8797 68.0629 12 75.9940 74.3499 0.9590 72.3427 76.3571 67.7583 80.9415 13 91.9720 88.5655 0.9436 86.5904 90.5405 81.9836 95.1473 14 105.7100 104.1178 0.9163 102.2000 106.0357 97.5529 110.6828 15 122.7750 121.0071 0.8846 119.1556 122.8585 114.4612 127.5529 16 131.6690 139.2332 0.8613 137.4305 141.0359 132.7010 145.7654 17 151.3250 158.7962 0.8650 156.9858 160.6066 152.2618 165.3306 18 179.3230 179.6961 0.9163 177.7782 181.6139 173.1311 186.2610 19 203.2110 201.9328 1.0305 199.7759 204.0896 195.2941 208.5715 20 226.5420 225.5064 1.2118 222.9701 228.0427 218.7349 232.2779 21 248.7100 250.4168 1.4560 247.3693 253.4644 243.4378 257.3959 22 281.4220 276.6642 1.7565 272.9877 280.3407 269.3884 283.9400 Output Statistics Std Error Student Cooks Obs Residual Residual Residual -2-1 0 1 2 D 1 2.2837 2.432 0.939 * 0.153 2 0.4146 2.623 0.158 0.003 3 0.6696 2.744 0.244 0.004 4 0.8849 2.817 0.314 0.004 5 0.5923 2.856 0.207 0.001 6 0.0621 2.872 0.0216 0.000 7 0.1344 2.873 0.0468 0.000 8 0.5864 2.866 0.205 0.001 9 0.0934 2.856 0.0327 0.000 10 0.2255 2.847 0.0792 0.000 11 1.4757 2.842 0.519 * 0.010 12 1.6441 2.842 0.578 * 0.013 13 3.4065 2.847 1.196 ** 0.052 14 1.5922 2.856 0.557 * 0.011 15 1.7679 2.866 0.617 * 0.012 16 7.5642 2.873 2.632 ***** 0.208 17 7.4712 2.872 2.601 ***** 0.205 18 0.3731 2.856 0.131 0.001 19 1.2782 2.817 0.454 0.009 20 1.0356 2.744 0.377 0.009 21 1.7068 2.623 0.651 * 0.044 22 4.7578 2.432 1.957 *** 0.666 Sum of Residuals 4.4596E-11 Sum of Squared Residuals 170.97193 Predicted Residual SS (PRESS) 237.71229
The plot of the studentized residuals shows that the wave structure is gone. The PRESS statistic is much closer to the Sum of Squared Residuals now, and both statistics have been dramatically reduced. Most of the Cook s D statistics have also been reduced.
The plot of residuals versus predicted values seen in Figure 61.10 has improved since a major trend is no longer visible.
To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot, you can submit the following statements.
symbol1 v=dot c=yellow h=.3; symbol2 v=square c=red; symbol3 f=simplex c=blue h=2 v='-'; symbol4 f=simplex c=blue h=2 v='-'; plot (Population predicted. u95. l95.)*Year / overlay cframe=ligr; run;
The SYMBOL statements requests that the actual data be displayed as dots, the predicted values as squares, and the upper and lower 95% confidence limits for an individual value (sometimes called a prediction interval ) as dashes. PROC REG provides the short-hand commands CONF and PRED to request confidence and prediction intervals for simple regression models; see the PLOT Statement section on page 3839 for details.
To complete an analysis of these data, you may want to examine influence statistics and, since the data are essentially time series data, examine the Durbin-Watson statistic. You might also want to examine other residual plots, such as the residuals vs. regressors.
PROC REG can be used interactively. After you specify a model with a MODEL statement and run REG with a RUN statement, a variety of statements can be executed without reinvoking REG.
The Interactive Analysis section on page 3869 describes which statements can be used interactively. These interactive statements can be executed singly or in groups by following the single statement or group of statements with a RUN statement. Note that the MODEL statement can be repeated. This is an important difference from the GLM procedure, which allows only one MODEL statement.
If you use REG interactively, you can end the REG procedure with a DATA step, another PROC step, an ENDSAS statement, or with a QUIT statement. The syntax of the QUIT statement is
quit;
When you are using REG interactively, additional RUN statements do not end REG but tell the procedure to execute additional statements.
When a BY statement is used with PROC REG, interactive processing is not possible; that is, once the first RUN statement is encountered , processing proceeds for each BY group in the data set, and no further statements are accepted by the procedure.
When using REG interactively, you can fit a model, perform diagnostics, then refit the model, and perform diagnostics on the refitted model. Most of the interactive statements implicitly refit the model; for example, if you use the ADD statement to add a variable to the model, the regression equation is automatically recomputed. The two exceptions to this automatic recomputing are the PAINT and REWEIGHT statements. These two statements do not cause the model to be refitted. To do so, you can follow these statements either with a REFIT statement, which causes the model to be explicitly recomputed, or with another interactive statement that causes the model to be implicitly recomputed.