Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive and cumbersome oxygen consumption measurements. Three model-selection methods are used: forward selection, backward selection, and MAXR selection. The following statements produce Output 61.1.1 through Output 61.1.5. (Collinearity diagnostics for the full model are shown in Figure 61.42 on page 3896.)
*-------------------Data on Physical Fitness-------------------* These measurements were made on men involved in a physical fitness course at N.C.State Univ. The variables are Age (years), Weight (kg), Oxygen intake rate (ml per kg body weight per minute), time to run 1.5 miles (minutes), heart rate while resting, heart rate while running (same time Oxygen rate measured), and maximum heart rate recorded while running. ***Certain values of MaxPulse were changed for this analysis. *--------------------------------------------------------------*; data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=forward; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=backward; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=maxr; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Forward Selection: Step 1 Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 632.90010 632.90010 84.01 <.0001 Error 29 218.48144 7.53384 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 82.42177 3.85530 3443.36654 457.05 <.0001 RunTime 3.31056 0.36119 632.90010 84.01 <.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------- Forward Selection: Step 2 Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 650.66573 325.33287 45.38 <.0001 Error 28 200.71581 7.16842 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 88.46229 5.37264 1943.41071 271.11 <.0001 Age 0.15037 0.09551 17.76563 2.48 0.1267 RunTime 3.20395 0.35877 571.67751 79.75 <.0001 Bounds on condition number: 1.0369, 4.1478 --------------------------------------------------------------------------------
The FORWARD model-selection method begins with no variables in the model and adds RunTime , then Age ,
then RunPulse , then MaxPulse ,
Forward Selection: Step 3 Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 690.55086 230.18362 38.64 <.0001 Error 27 160.83069 5.95669 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 111.71806 10.23509 709.69014 119.14 <.0001 Age 0.25640 0.09623 42.28867 7.10 0.0129 RunTime 2.82538 0.35828 370.43529 62.19 <.0001 RunPulse 0.13091 0.05059 39.88512 6.70 0.0154 Bounds on condition number: 1.3548, 11.597 -------------------------------------------------------------------------------- Forward Selection: Step 4 Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 712.45153 178.11288 33.33 <.0001 Error 26 138.93002 5.34346 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 98.14789 11.78569 370.57373 69.35 <.0001 Age 0.19773 0.09564 22.84231 4.27 0.0488 RunTime 2.76758 0.34054 352.93570 66.05 <.0001 RunPulse 0.34811 0.11750 46.90089 8.78 0.0064 MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533 Bounds on condition number: 8.4182, 76.851 --------------------------------------------------------------------------------
and finally, Weight .Thefinal variable available to add to the model, RestPulse , is not added since it does not meet the 50% (the default value of the SLE option is 0.5 for FORWARD selection) significance-level criterion for entry into the model.
Forward Selection: Step 5 Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 721.97309 144.39462 27.90 <.0001 Error 25 129.40845 5.17634 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 102.20428 11.97929 376.78935 72.79 <.0001 Age 0.21962 0.09550 27.37429 5.29 0.0301 Weight 0.07230 0.05331 9.52157 1.84 0.1871 RunTime 2.68252 0.34099 320.35968 61.89 <.0001 RunPulse 0.37340 0.11714 52.59624 10.16 0.0038 MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316 Bounds on condition number: 8.7312, 104.83 -------------------------------------------------------------------------------- No other variable met the 0.5000 significance level for entry into the model. Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 13.6988 84.01 <.0001 2 Age 2 0.0209 0.7642 12.3894 2.48 0.1267 3 RunPulse 3 0.0468 0.8111 6.9596 6.70 0.0154 4 MaxPulse 4 0.0257 0.8368 4.8800 4.10 0.0533 5 Weight 5 0.0112 0.8480 5.1063 1.84 0.1871
The BACKWARD model-selection method begins with the full model.
The REG Procedure Model: MODEL2 Dependent Variable: Oxygen Backward Elimination: Step 0 All Variables Entered: R-Square = 0.8487 and C(p) = 7.0000 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 102.93448 12.40326 369.72831 68.87 <.0001 Age 0.22697 0.09984 27.74577 5.17 0.0322 Weight 0.07418 0.05459 9.91059 1.85 0.1869 RunTime 2.62865 0.38456 250.82210 46.72 <.0001 RunPulse 0.36963 0.11985 51.05806 9.51 0.0051 RestPulse 0.02153 0.06605 0.57051 0.11 0.7473 MaxPulse 0.30322 0.13650 26.49142 4.93 0.0360 Bounds on condition number: 8.7438, 137.13 --------------------------------------------------------------------------------
RestPulse is the first variable deleted,
Backward Elimination: Step 1 Variable RestPulse Removed: R-Square = 0.8480 and C(p) = 5.1063 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 721.97309 144.39462 27.90 <.0001 Error 25 129.40845 5.17634 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 102.20428 11.97929 376.78935 72.79 <.0001 Age 0.21962 0.09550 27.37429 5.29 0.0301 Weight 0.07230 0.05331 9.52157 1.84 0.1871 RunTime 2.68252 0.34099 320.35968 61.89 <.0001 RunPulse 0.37340 0.11714 52.59624 10.16 0.0038 MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316 Bounds on condition number: 8.7312, 104.83 --------------------------------------------------------------------------------
Backward Elimination: Step 2 Variable Weight Removed: R-Square = 0.8368 and C(p) = 4.8800 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 712.45153 178.11288 33.33 <.0001 Error 26 138.93002 5.34346 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 98.14789 11.78569 370.57373 69.35 <.0001 Age 0.19773 0.09564 22.84231 4.27 0.0488 RunTime 2.76758 0.34054 352.93570 66.05 <.0001 RunPulse 0.34811 0.11750 46.90089 8.78 0.0064 MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533 Bounds on condition number: 8.4182, 76.851 -------------------------------------------------------------------------------- All variables left in the model are significant at the 0.1000 level. Summary of Backward Elimination Variable Number Partial Model Step Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 RestPulse 5 0.0007 0.8480 5.1063 0.11 0.7473 2 Weight 4 0.0112 0.8368 4.8800 1.84 0.1871
The MAXR method tries to find the best one-variable model, the best two-variable model, and so on. For the fitness data, the one-variable model contains RunTime ; the two-variable model contains RunTime and Age ;
The REG Procedure Model: MODEL3 Dependent Variable: Oxygen Maximum R-Square Improvement: Step 1 Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 632.90010 632.90010 84.01 <.0001 Error 29 218.48144 7.53384 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 82.42177 3.85530 3443.36654 457.05 <.0001 RunTime -3.31056 0.36119 632.90010 84.01 <.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------- The above model is the best 1-variable model found. Maximum R-Square Improvement: Step 2 Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 650.66573 325.33287 45.38 <.0001 Error 28 200.71581 7.16842 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 88.46229 5.37264 1943.41071 271.11 <.0001 Age 0.15037 0.09551 17.76563 2.48 0.1267 RunTime 3.20395 0.35877 571.67751 79.75 <.0001 Bounds on condition number: 1.0369, 4.1478 -------------------------------------------------------------------------------- The above model is the best 2-variable model found.
Maximum R-Square Improvement: Step 3 Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 690.55086 230.18362 38.64 <.0001 Error 27 160.83069 5.95669 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 111.71806 10.23509 709.69014 119.14 <.0001 Age 0.25640 0.09623 42.28867 7.10 0.0129 RunTime 2.82538 0.35828 370.43529 62.19 <.0001 RunPulse 0.13091 0.05059 39.88512 6.70 0.0154 Bounds on condition number: 1.3548, 11.597 -------------------------------------------------------------------------------- The above model is the best 3-variable model found. Maximum R-Square Improvement: Step 4 Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 712.45153 178.11288 33.33 <.0001 Error 26 138.93002 5.34346 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 98.14789 11.78569 370.57373 69.35 <.0001 Age 0.19773 0.09564 22.84231 4.27 0.0488 RunTime 2.76758 0.34054 352.93570 66.05 <.0001 RunPulse 0.34811 0.11750 46.90089 8.78 0.0064 MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533 Bounds on condition number: 8.4182, 76.851 -------------------------------------------------------------------------------- The above model is the best 4-variable model found.
Maximum R-Square Improvement: Step 5 Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 721.97309 144.39462 27.90 <.0001 Error 25 129.40845 5.17634 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 102.20428 11.97929 376.78935 72.79 <.0001 Age 0.21962 0.09550 27.37429 5.29 0.0301 Weight 0.07230 0.05331 9.52157 1.84 0.1871 RunTime 2.68252 0.34099 320.35968 61.89 <.0001 RunPulse 0.37340 0.11714 52.59624 10.16 0.0038 MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316 Bounds on condition number: 8.7312, 104.83 -------------------------------------------------------------------------------- The above model is the best 5-variable model found. Maximum R-Square Improvement: Step 6 Variable RestPulse Entered: R-Square = 0.8487 and C(p) = 7.0000 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 102.93448 12.40326 369.72831 68.87 <.0001 Age 0.22697 0.09984 27.74577 5.17 0.0322 Weight 0.07418 0.05459 9.91059 1.85 0.1869 RunTime 2.62865 0.38456 250.82210 46.72 <.0001 RunPulse 0.36963 0.11985 51.05806 9.51 0.0051 RestPulse 0.02153 0.06605 0.57051 0.11 0.7473 MaxPulse 0.30322 0.13650 26.49142 4.93 0.0360 Bounds on condition number: 8.7438, 137.13 -------------------------------------------------------------------------------- The above model is the best 6-variable model found. No further improvement in R-Square is possible.
Note that for all three of these methods, RestPulse contributes least to the model. In the case of forward selection, it is not added to the model. In the case of backward selection, it is the first variable to be removed from the model. In the case of MAXR selection, RestPulse is included only for the full model.
For the STEPWISE, BACKWARDS and FORWARD selection methods, you can control the amount of detail displayed by using the DETAILS option. For example, the following statements display only the selection summary table for the FORWARD selection method.
proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=forward details=summary; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 13.6988 84.01 <.0001 2 Age 2 0.0209 0.7642 12.3894 2.48 0.1267 3 RunPulse 3 0.0468 0.8111 6.9596 6.70 0.0154 4 MaxPulse 4 0.0257 0.8368 4.8800 4.10 0.0533 5 Weight 5 0.0112 0.8480 5.1063 1.84 0.1871
Next, the RSQUARE model-selection method is used to request R 2 and C p statistics for all possible combinations of the six independent variables. The following statements produce Output 61.1.5
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare cp; title Physical fitness data: all models; run;
Physical fitness data: all models The REG Procedure Model: MODEL2 Dependent Variable: Oxygen R-Square Selection Method Number in Model R-Square C(p) Variables in Model 1 0.7434 13.6988 RunTime 1 0.1595 106.3021 RestPulse 1 0.1584 106.4769 RunPulse 1 0.0928 116.8818 Age 1 0.0560 122.7072 MaxPulse 1 0.0265 127.3948 Weight ------------------------------------------------------------------------------ 2 0.7642 12.3894 Age RunTime 2 0.7614 12.8372 RunTime RunPulse 2 0.7452 15.4069 RunTime MaxPulse 2 0.7449 15.4523 Weight RunTime 2 0.7435 15.6746 RunTime RestPulse 2 0.3760 73.9645 Age RunPulse 2 0.3003 85.9742 Age RestPulse 2 0.2894 87.6951 RunPulse MaxPulse 2 0.2600 92.3638 Age MaxPulse 2 0.2350 96.3209 RunPulse RestPulse 2 0.1806 104.9523 Weight RestPulse 2 0.1740 105.9939 RestPulse MaxPulse 2 0.1669 107.1332 Weight RunPulse 2 0.1506 109.7057 Age Weight 2 0.0675 122.8881 Weight MaxPulse ------------------------------------------------------------------------------ 3 0.8111 6.9596 Age RunTime RunPulse 3 0.8100 7.1350 RunTime RunPulse MaxPulse 3 0.7817 11.6167 Age RunTime MaxPulse 3 0.7708 13.3453 Age Weight RunTime 3 0.7673 13.8974 Age RunTime RestPulse 3 0.7619 14.7619 RunTime RunPulse RestPulse 3 0.7618 14.7729 Weight RunTime RunPulse 3 0.7462 17.2588 Weight RunTime MaxPulse 3 0.7452 17.4060 RunTime RestPulse MaxPulse 3 0.7451 17.4243 Weight RunTime RestPulse 3 0.4666 61.5873 Age RunPulse RestPulse 3 0.4223 68.6250 Age RunPulse MaxPulse 3 0.4091 70.7102 Age Weight RunPulse 3 0.3900 73.7424 Age RestPulse MaxPulse 3 0.3568 79.0013 Age Weight RestPulse 3 0.3538 79.4891 RunPulse RestPulse MaxPulse 3 0.3208 84.7216 Weight RunPulse MaxPulse 3 0.2902 89.5693 Age Weight MaxPulse 3 0.2447 96.7952 Weight RunPulse RestPulse 3 0.1882 105.7430 Weight RestPulse MaxPulse ------------------------------------------------------------------------------ 4 0.8368 4.8800 Age RunTime RunPulse MaxPulse 4 0.8165 8.1035 Age Weight RunTime RunPulse 4 0.8158 8.2056 Weight RunTime RunPulse MaxPulse 4 0.8117 8.8683 Age RunTime RunPulse RestPulse 4 0.8104 9.0697 RunTime RunPulse RestPulse MaxPulse 4 0.7862 12.9039 Age Weight RunTime MaxPulse 4 0.7834 13.3468 Age RunTime RestPulse MaxPulse 4 0.7750 14.6788 Age Weight RunTime RestPulse 4 0.7623 16.7058 Weight RunTime RunPulse RestPulse 4 0.7462 19.2550 Weight RunTime RestPulse MaxPulse 4 0.5034 57.7590 Age Weight RunPulse RestPulse 4 0.5025 57.9092 Age RunPulse RestPulse MaxPulse 4 0.4717 62.7830 Age Weight RunPulse MaxPulse 4 0.4256 70.0963 Age Weight RestPulse MaxPulse 4 0.3858 76.4100 Weight RunPulse RestPulse MaxPulse ------------------------------------------------------------------------------ 5 0.8480 5.1063 Age Weight RunTime RunPulse MaxPulse 5 0.8370 6.8461 Age RunTime RunPulse RestPulse MaxPulse 5 0.8176 9.9348 Age Weight RunTime RunPulse RestPulse 5 0.8161 10.1685 Weight RunTime RunPulse RestPulse MaxPulse 5 0.7887 14.5111 Age Weight RunTime RestPulse MaxPulse 5 0.5541 51.7233 Age Weight RunPulse RestPulse MaxPulse ------------------------------------------------------------------------------ 6 0.8487 7.0000 Age Weight RunTime RunPulse RestPulse MaxPulse
In this example, the weights of school children are modeled as a function of their heights and ages. Modeling is performed separately for boys and girls . The example shows the use of a BY statement with PROC REG, multiple MODEL statements, and the OUTEST= and OUTSSCP= options, which create data sets. Since the BY statement is used, interactive processing is not possible in this example; no statements can appear after the first RUN statement. The following statements produce Output 61.2.1 through Output 61.2.4:
*------------Data on Age, Weight, and Height of Children-------* Age (months), height (inches), and weight (pounds) were recorded for a group of school children. From Lewis and Taylor (1967). *--------------------------------------------------------------*; data htwt; input sex $ age :3.1 height weight @@; datalines; f 143 56.3 85.0 f 155 62.3 105.0 f 153 63.3 108.0 f 161 59.0 92.0 f 191 62.5 112.5 f 171 62.5 112.0 f 185 59.0 104.0 f 142 56.5 69.0 f 160 62.0 94.5 f 140 53.8 68.5 f 139 61.5 104.0 f 178 61.5 103.5 f 157 64.5 123.5 f 149 58.3 93.0 f 143 51.3 50.5 f 145 58.8 89.0 f 191 65.3 107.0 f 150 59.5 78.5 f 147 61.3 115.0 f 180 63.3 114.0 f 141 61.8 85.0 f 140 53.5 81.0 f 164 58.0 83.5 f 176 61.3 112.0 f 185 63.3 101.0 f 166 61.5 103.5 f 175 60.8 93.5 f 180 59.0 112.0 f 210 65.5 140.0 f 146 56.3 83.5 f 170 64.3 90.0 f 162 58.0 84.0 f 149 64.3 110.5 f 139 57.5 96.0 f 186 57.8 95.0 f 197 61.5 121.0 f 169 62.3 99.5 f 177 61.8 142.5 f 185 65.3 118.0 f 182 58.3 104.5 f 173 62.8 102.5 f 166 59.3 89.5 f 168 61.5 95.0 f 169 62.0 98.5 f 150 61.3 94.0 f 184 62.3 108.0 f 139 52.8 63.5 f 147 59.8 84.5 f 144 59.5 93.5 f 177 61.3 112.0 f 178 63.5 148.5 f 197 64.8 112.0 f 146 60.0 109.0 f 145 59.0 91.5 f 147 55.8 75.0 f 145 57.8 84.0 f 155 61.3 107.0 f 167 62.3 92.5 f 183 64.3 109.5 f 143 55.5 84.0 f 183 64.5 102.5 f 185 60.0 106.0 f 148 56.3 77.0 f 147 58.3 111.5 f 154 60.0 114.0 f 156 54.5 75.0 f 144 55.8 73.5 f 154 62.8 93.5 f 152 60.5 105.0 f 191 63.3 113.5 f 190 66.8 140.0 f 140 60.0 77.0 f 148 60.5 84.5 f 189 64.3 113.5 f 143 58.3 77.5 f 178 66.5 117.5 f 164 65.3 98.0 f 157 60.5 112.0 f 147 59.5 101.0 f 148 59.0 95.0 f 177 61.3 81.0 f 171 61.5 91.0 f 172 64.8 142.0 f 190 56.8 98.5 f 183 66.5 112.0 f 143 61.5 116.5 f 179 63.0 98.5 f 186 57.0 83.5 f 182 65.5 133.0 f 182 62.0 91.5 f 142 56.0 72.5 f 165 61.3 106.5 f 165 55.5 67.0 f 154 61.0 122.5 f 150 54.5 74.0 f 155 66.0 144.5 f 163 56.5 84.0 f 141 56.0 72.5 f 147 51.5 64.0 f 210 62.0 116.0 f 171 63.0 84.0 f 167 61.0 93.5 f 182 64.0 111.5 f 144 61.0 92.0 f 193 59.8 115.0 f 141 61.3 85.0 f 164 63.3 108.0 f 186 63.5 108.0 f 169 61.5 85.0 f 175 60.3 86.0 f 180 61.3 110.5 m 165 64.8 98.0 m 157 60.5 105.0 m 144 57.3 76.5 m 150 59.5 84.0 m 150 60.8 128.0 m 139 60.5 87.0 m 189 67.0 128.0 m 183 64.8 111.0 m 147 50.5 79.0 m 146 57.5 90.0 m 160 60.5 84.0 m 156 61.8 112.0 m 173 61.3 93.0 m 151 66.3 117.0 m 141 53.3 84.0 m 150 59.0 99.5 m 164 57.8 95.0 m 153 60.0 84.0 m 206 68.3 134.0 m 250 67.5 171.5 m 176 63.8 98.5 m 176 65.0 118.5 m 140 59.5 94.5 m 185 66.0 105.0 m 180 61.8 104.0 m 146 57.3 83.0 m 183 66.0 105.5 m 140 56.5 84.0 m 151 58.3 86.0 m 151 61.0 81.0 m 144 62.8 94.0 m 160 59.3 78.5 m 178 67.3 119.5 m 193 66.3 133.0 m 162 64.5 119.0 m 164 60.5 95.0 m 186 66.0 112.0 m 143 57.5 75.0 m 175 64.0 92.0 m 175 68.0 112.0 m 175 63.5 98.5 m 173 69.0 112.5 m 170 63.8 112.5 m 174 66.0 108.0 m 164 63.5 108.0 m 144 59.5 88.0 m 156 66.3 106.0 m 149 57.0 92.0 m 144 60.0 117.5 m 147 57.0 84.0 m 188 67.3 112.0 m 169 62.0 100.0 m 172 65.0 112.0 m 150 59.5 84.0 m 193 67.8 127.5 m 157 58.0 80.5 m 168 60.0 93.5 m 140 58.5 86.5 m 156 58.3 92.5 m 156 61.5 108.5 m 158 65.0 121.0 m 184 66.5 112.0 m 156 68.5 114.0 m 144 57.0 84.0 m 176 61.5 81.0 m 168 66.5 111.5 m 149 52.5 81.0 m 142 55.0 70.0 m 188 71.0 140.0 m 203 66.5 117.0 m 142 58.8 84.0 m 189 66.3 112.0 m 188 65.8 150.5 m 200 71.0 147.0 m 152 59.5 105.0 m 174 69.8 119.5 m 166 62.5 84.0 m 145 56.5 91.0 m 143 57.5 101.0 m 163 65.3 117.5 m 166 67.3 121.0 m 182 67.0 133.0 m 173 66.0 112.0 m 155 61.8 91.5 m 162 60.0 105.0 m 177 63.0 111.0 m 177 60.5 112.0 m 175 65.5 114.0 m 166 62.0 91.0 m 150 59.0 98.0 m 150 61.8 118.0 m 188 63.3 115.5 m 163 66.0 112.0 m 171 61.8 112.0 m 162 63.0 91.0 m 141 57.5 85.0 m 174 63.0 112.0 m 142 56.0 87.5 m 148 60.5 118.0 m 140 56.8 83.5 m 160 64.0 116.0 m 144 60.0 89.0 m 206 69.5 171.5 m 159 63.3 112.0 m 149 56.3 72.0 m 193 72.0 150.0 m 194 65.3 134.5 m 152 60.8 97.0 m 146 55.0 71.5 m 139 55.0 73.5 m 186 66.5 112.0 m 161 56.8 75.0 m 153 64.8 128.0 m 196 64.5 98.0 m 164 58.0 84.0 m 159 62.8 99.0 m 178 63.8 112.0 m 153 57.8 79.5 m 155 57.3 80.5 m 178 63.5 102.5 m 142 55.0 76.0 m 164 66.5 112.0 m 189 65.0 114.0 m 164 61.5 140.0 m 167 62.0 107.5 m 151 59.3 87.0 ; title '----- Data on age, weight, and height of children ------'; proc reg outest=est1 outsscp=sscp1 rsquare; by sex; eq1: model weight=height; eq2: model weight=height age; proc print data=sscp1; title2 'SSCP type data set'; proc print data=est1; title2 'EST type data set'; run;
----- Data on age, weight, and height of children ------ ------------------------------------ sex=f ------------------------------------- The REG Procedure Model: eq1 Dependent Variable: weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 21507 21507 141.09 <.0001 Error 109 16615 152.42739 Corrected Total 110 38121 Root MSE 12.34615 R-Square 0.5642 Dependent Mean 98.87838 Adj R-Sq 0.5602 Coeff Var 12.48620 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 153.12891 21.24814 7.21 <.0001 height 1 4.16361 0.35052 11.88 <.0001 ----- Data on age, weight, and height of children ------ ------------------------------------ sex=f ------------------------------------- The REG Procedure Model: eq2 Dependent Variable: weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 22432 11216 77.21 <.0001 Error 108 15689 145.26700 Corrected Total 110 38121 Root MSE 12.05268 R-Square 0.5884 Dependent Mean 98.87838 Adj R-Sq 0.5808 Coeff Var 12.18939 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 150.59698 20.76730 7.25 <.0001 height 1 3.60378 0.40777 8.84 <.0001 age 1 1.90703 0.75543 2.52 0.0130
----- Data on age, weight, and height of children ------ ------------------------------------ sex=m ------------------------------------- The REG Procedure Model: eq1 Dependent Variable: weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 31126 31126 206.24 <.0001 Error 124 18714 150.92222 Corrected Total 125 49840 Root MSE 12.28504 R-Square 0.6245 Dependent Mean 103.44841 Adj R-Sq 0.6215 Coeff Var 11.87552 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 125.69807 15.99362 7.86 <.0001 height 1 3.68977 0.25693 14.36 <.0001 ----- Data on age, weight, and height of children ------ ------------------------------------ sex=m ------------------------------------- The REG Procedure Model: eq2 Dependent Variable: weight Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 32975 16487 120.24 <.0001 Error 123 16866 137.11922 Corrected Total 125 49840 Root MSE 11.70979 R-Square 0.6616 Dependent Mean 103.44841 Adj R-Sq 0.6561 Coeff Var 11.31945 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 113.71346 15.59021 7.29 <.0001 height 1 2.68075 0.36809 7.28 <.0001 age 1 3.08167 0.83927 3.67 0.0004
----- Data on age, weight, and height of children ------ SSCP type data set Obs sex _TYPE_ _NAME_ Intercept height weight age 1 f SSCP Intercept 111.0 6718.40 10975.50 1824.90 2 f SSCP height 6718.4 407879.32 669469.85 110818.32 3 f SSCP weight 10975.5 669469.85 1123360.75 182444.95 4 f SSCP age 1824.9 110818.32 182444.95 30363.81 5 f N 111.0 111.00 111.00 111.00 6 m SSCP Intercept 126.0 7825.00 13034.50 2072.10 7 m SSCP height 7825.0 488243.60 817919.60 129432.57 8 m SSCP weight 13034.5 817919.60 1398238.75 217717.45 9 m SSCP age 2072.1 129432.57 217717.45 34515.95 10 m N 126.0 126.00 126.00 126.00
----- Data on age, weight, and height of children ------ EST type data set Obs sex _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept height weight age _IN_ _P_ _EDF_ _RSQ_ 1 f eq1 PARMS weight 12.3461 153.129 4.16361 1 . 1 2 109 0.56416 2 f eq2 PARMS weight 12.0527 150.597 3.60378 1 1.90703 2 3 108 0.58845 3 m eq1 PARMS weight 12.2850 125.698 3.68977 1 . 1 2 124 0.62451 4 m eq2 PARMS weight 11.7098 113.713 2.68075 1 3.08167 2 3 123 0.66161
For both females and males, the overall F statistics for both models are significant, indicating that the model explains a significant portion of the variation in the data. For females, the full model is
and, for males, the full model is
At times it is desirable to have independent variables in the model that are qualitative rather than quantitative. This is easily handled in a regression framework. Regression uses qualitative variables to distinguish between populations. There are two main advantages of fitting both populations in one model. You gain the ability to test for different slopes or intercepts in the populations, and more degrees of freedom are available for the analysis.
Regression with qualitative variables is different from analysis of variance and analysis of covariance. Analysis of variance uses qualitative independent variables only. Analysis of covariance uses quantitative variables in addition to the qualitative variables in order to account for correlation in the data and reduce MSE; however, the quantitative variables are not of primary interest and merely improve the precision of the analysis.
Consider the case where Y i is the dependent variable, X1 i is a quantitative variable, X2 i is a qualitative variable taking on values 0 or 1, and X1 i X2 i is the interaction. The variable X2 i is called a dummy , binary, or indicator variable. With values 0 or 1, it distinguishes between two populations. The model is of the form
for the observations i =1 , 2 , ,n . The parameters to be estimated are ² , ² 1 , ² 2 , and ² 3 . The number of dummy variables used is one less than the number of qualitative levels. This yields a nonsingular X ² X matrix. See Chapter 10 of Neter, Wasserman, and Kutner (1990) for more details.
An example from Neter, Wasserman, and Kutner (1990) follows . An economist is investigating the relationship between the size of an insurance firm and the speed at which they implement new insurance innovations. He believes that the type of firm may affect this relationship and suspects that there may be some interaction between the size and type of firm. The dummy variable in the model allows the two firms to have different intercepts. The interaction term allows the firms to have different slopes as well.
In this study, Y i is the number of months from the time the first firm implemented the innovation to the time it was implemented by the ith firm. The variable X1 i is the size of the firm, measured in total assets of the firm. The variable X2 i denotes the firm type and is 0 if the firm is a mutual fund company and 1 if the firm is a stock company. The dummy variable allows each firm type to have a different intercept and slope.
The previous model can be broken down into a model for each firm type by plugging in the values for X2 i . If X2 i = 0, the model is
This is the model for a mutual company. If X2 i =1, the model for a stock firm is
This model has intercept ² + ² 2 and slope ² 1 + ² 3 .
The data [ *] follow. Note that the interaction term is created in the DATA step since polynomial effects such as size*type are not allowed in the MODEL statement in the REG procedure.
title 'Regression With Quantitative and Qualitative Variables'; data insurance; input time size type @@; sizetype=size*type; datalines; 17 151 0 26 92 0 21 175 0 30 31 0 22 104 0 0 277 0 12 210 0 19 120 0 4 290 0 16 238 0 28 164 1 15 272 1 11 295 1 38 68 1 31 85 1 21 224 1 20 166 1 13 305 1 30 124 1 14 246 1 ; run;
The following statements begin the analysis:
proc reg data=insurance; model time = size type sizetype; run;
The ANOVA table is displayed in Output 61.3.1.
Regression With Quantitative and Qualitative Variables The REG Procedure Model: MODEL1 Dependent Variable: time Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 1504.41904 501.47301 45.49 <.0001 Error 16 176.38096 11.02381 Corrected Total 19 1680.80000 Root MSE 3.32021 R-Square 0.8951 Dependent Mean 19.40000 Adj R-Sq 0.8754 Coeff Var 17.11450 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 33.83837 2.44065 13.86 <.0001 size 1 0.10153 0.01305 7.78 <.0001 type 1 8.13125 3.65405 2.23 0.0408 sizetype 1 0.00041714 0.01833 0.02 0.9821
The overall F statistic is significant ( F =45.490, p <0.0001). The interaction term is not significant ( t = ˆ’ 0.023, p =0.9821). Hence, this term should be removed and the model re-fitted, as shown in the following statements.
delete sizetype; print; run;
The DELETE statement removes the interaction term ( sizetype ) from the model. The new ANOVA table is shown in Output 61.3.2.
Regression With Quantitative and Qualitative Variables The REG Procedure Model: MODEL1.1 Dependent Variable: time Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 1504.41333 752.20667 72.50 <.0001 Error 17 176.38667 10.37569 Corrected Total 19 1680.80000 Root MSE 3.22113 R-Square 0.8951 Dependent Mean 19.40000 Adj R-Sq 0.8827 Coeff Var 16.60377 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 33.87407 1.81386 18.68 <.0001 size 1 0.10174 0.00889 11.44 <.0001 type 1 8.05547 1.45911 5.52 <.0001
The overall F statistic is still significant ( F =72.497, p <0.0001). The intercept and the coefficients associated with size and type are significantly different from zero ( t =18.675, p <0.0001; t = ˆ’ 11.443, p <0.0001; t =5.521, p <0.0001, respectively). Notice that the R 2 did not change with the omission of the interaction term.
The fitted model is
The fitted model for a mutual fund company ( X 2 i = 0) is
and the fitted model for a stock company ( X 2 i = 1) is
So the two models have different intercepts but the same slope.
Now plot the residual versus predicted values using the firm type as the plot symbol (PLOT=TYPE); this can be useful in determining if the firm types have different residual patterns. PROC REG does not support the plot y*x=type syntax for high-resolution graphics, so use PROC GPLOT to create Output 61.3.3. First, the OUTPUT statement saves the residuals and predicted values from the new model in the OUT= data set.
output out=out r=r p=p; run; symbol1 v='0' c=blue f=swissb; symbol2 v='1' c=yellow f=swissb; axis1 label=(angle=90); proc gplot data=out; plot r*p=type / nolegend vaxis=axis1 cframe=ligr; plot p*size=type / nolegend vaxis=axis1 cframe=ligr; run;
The residuals show no major trend. Neither firm type by itself shows a trend either. This indicates that the model is satisfactory.
A plot of the predicted values versus size appears in Output 61.3.4, where the firm type is again used as the plotting symbol.
The different intercepts are very evident in this plot.
This example introduces the basic PROC REG graphics syntax used to produce a standard plot of data from the aerobic fitness data set (Example 61.1 on page 3924). A simple linear regression of Oxygen on RunTime is performed, and a plot of Oxygen * RunTime is requested . The fitted model, the regression line, and the four default statistics are also displayed in Output 61.4.1.
data fitness; set fitness; label Age ='age(years)' Weight ='weight(kg)' Oxygen ='oxygen uptake(ml/kg/min)' RunTime ='1.5 mile time(min)' RestPulse='rest pulse' RunPulse ='running pulse' MaxPulse ='maximum running pulse'; proc reg data=fitness; model Oxygen=RunTime; plot Oxygen*RunTime / cframe=ligr; run;
The C p statistics for model selection are plotted against the number of parameters in the model, and the CHOCKING= and CMALLOWS= options draw useful reference lines. Note the four default statistics in the plot margin, the default model equation, and the default legend in Output 61.5.1.
title 'Cp Plot with Reference Lines'; symbol1 c=green; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare noprint; plot cp.*np. / chocking=red cmallows=blue vaxis=0 to 15 by 5 cframe=ligr; run;
Using the criteria suggested by Hocking (1976) (see the section Dictionary of PLOT Statement Options beginning on page 3844), Output 61.5.1 indicates that a 6-variable model is a reasonable choice for doing parameter estimation, while a 5-variable model may be suitable for doing prediction.
This example uses model fit summary statistics from the OUTEST= data set to create a plot for a model selection analysis. Global graphics statements and PLOT statement options are used to control the appearance of the plot.
goptions ctitle=black htitle=3.5pct ftitle=swiss ctext =magenta htext =3.0pct ftext =swiss cback =ligr border; symbol1 v=circle c=red h=1 w=2; title1 Selection=Rsquare; title2 plot Rsquare versus the number of parameters P in each model; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare noprint; plot rsq.*np. / aic bic edf gmsep jp np pc sbc sp haxis=2 to 7 by 1 caxis=red cframe=white ctext=blue modellab=Full Model modelht=2.4 statht=2.4; run;
In the GOPTIONS statement,
BORDER | frames the entire display |
CBACK= | specifies the background color |
CTEXT= | selects the default color for the border and all text, including titles, footnotes, and notes |
CTITLE= | specifies the title, footnote, note, and border color |
HTEXT= | specifies the height for all text in the display |
HTITLE= | specifies the height for the first title line |
FTEXT= | selects the default font for all text, including titles, footnotes, notes, the model label and equation, the statistics, the axis labels, the tick values, and the legend |
FTITLE= | specifies the first title font |
For more information on the GOPTIONS statement and other global graphics statements, refer to SAS/GRAPH Software: Reference .
In Output 61.6.1, note the following:
The PLOT statement option CTEXT= affects all text not controlled by the CTITLE= option in the GOPTIONS statement. Hence, the GOPTIONS statement option CTEXT=MAGENTA has no effect. Therefore, the color of the title is black and all other text is blue.
The area enclosed by the axes and the frame has a white background, while the background outside the plot area is gray.
The MODELHT= option allows the entire model equation to fit on one line.
The STATHT= option allows the statistics in the margin to fit in one column.
The displayed statistics and the fitted model equation refer to the selected model. See the Traditional High-Resolution Graphics Plots section beginning on page 3840 for more information.
This example illustrates how you can display diagnostics for checking the adequacy of a regression model. The following statements plot the studentized deleted residuals against the observation number for the full model. Vertical reference lines at ± tinv( . 95 ,n ˆ’ p ˆ’ 1) = ± 1 . 714 are added to identify possible outlying Oxygen values. A vertical reference line is displayed at zero by default when the RSTUDENT option is specified. The graph is shown in Output 61.7.1. Observations 15 and 17 are indicated as possible outliers.
title Check for Outlying Observations; symbol v=dot h=1 c=green; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse; plot rstudent.*obs. / vref= -1.714 1.714 cvref=blue lvref=1 href= 0 to 30 by 5 chref=red cframe=ligr; run;
The following program creates probability-probability plots and quantile-quantile plots of the residuals ( Output 61.8.1 and Output 61.8.2, respectively). An annotation data set is created to produce the (0,0) ˆ’ (1,1) reference line for the PP-plot. Note that the NOSTAT option for the PP-plot suppresses the statistics that would be displayed in the margin.
data annote1; length function color ; retain ysys xsys