This example illustrates differences in the performance of robust estimates available in the ROBUSTREG procedure.
The following statements generate 1000 random observations. The first 900 observations are from a linear model and the last 100 observations are significantly biased in the y -direction. In other words, ten percent of the observations are contaminated with outliers.
data a (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 900 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run; proc reg data=a; model y = x1 x2; run; proc robustreg data=a method=m ; model y = x1 x2; run; proc robustreg data=a method=mm; model y = x1 x2; run;
The tables of parameter estimates generated by the ROBUSTREG procedure using M estimation and MM estimation are shown in Output 62.1.2 and Output 62.1.3. For comparison, the ordinary least squares (OLS) estimates produced by the REG procedure are shown in Output 62.1.1. Both the M estimate and the MM estimate correctly estimate the regression coefficients for the underlying model (10, 5, and 3), but the OLS estimate does not.
The ROBUSTREG Procedure Model Information Data Set WORK.B Dependent Variable y Number of Covariates 2 Number of Observations 1000 Method M Estimation Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 10.0024 0.0174 9.9683 10.0364 331908 <.0001 x1 1 5.0077 0.0175 4.9735 5.0420 82106.9 <.0001 x2 1 3.0161 0.0167 2.9834 3.0488 32612.5 <.0001 Scale 1 0.5780
The ROBUSTREG Procedure Model Information Data Set WORK.B Dependent Variable y Number of Covariates 2 Number of Observations 1000 Method MM Estimation Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 10.0035 0.0176 9.9690 10.0379 323947 <.0001 x1 1 5.0085 0.0178 4.9737 5.0433 79600.6 <.0001 x2 1 3.0181 0.0168 2.9851 3.0511 32165.0 <.0001 Scale 0 0.6733
The REG Procedure Model: MODEL1 Dependent Variable: y Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 19.06712 0.86322 22.09 <.0001 x1 1 3.55485 0.86892 4.09 <.0001 x2 1 2.12341 0.83039 2.56 0.0107
The next statements demonstrate that if the percentage of contamination is increased to 40%, the M estimates and MM estimates with default options fail to pick up the underlying model. However, by tuning the constant c for the M estimate and the constants INITH and K0 for the MM estimate, you can increase the breakdown values of these estimates and capture the right model. Output 62.1.4 and Output 62.1.5 display these estimates.
The ROBUSTREG Procedure Model Information Data Set WORK.B Dependent Variable y Number of Covariates 2 Number of Observations 1000 Method M Estimation Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 10.0137 0.0219 9.9708 10.0565 209688 <.0001 x1 1 4.9905 0.0220 4.9473 5.0336 51399.1 <.0001 x2 1 3.0399 0.0210 2.9987 3.0811 20882.4 <.0001 Scale 1 1.0531
The ROBUSTREG Procedure Model Information Data Set WORK.B Dependent Variable y Number of Covariates 2 Number of Observations 1000 Method MM Estimation Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 10.0103 0.0213 9.9686 10.0520 221639 <.0001 x1 1 4.9890 0.0218 4.9463 5.0316 52535.7 <.0001 x2 1 3.0363 0.0201 2.9970 3.0756 22895.4 <.0001 Scale 0 1.8997
data b (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run; proc robustreg data=b method=m(wf=bisquare(c=2)); model y = x1 x2; run; proc robustreg data=b method=mm(inith=502 k0=1.8); model y = x1 x2; run;
When there are bad leverage points, the M estimates fail to pick up the underlying model no matter what constant c you use. In this case, other estimates (LTS, S, and MM estimates) in PROC ROBUSTREG, which are robust to bad leverage points, will pick up the underlying model.
The following statements generate 1000 observations with 1% bad high leverage points.
data b (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; if i < 11 then x1=200 * rannor(1234); if i < 11 then x2=200 * rannor(1234); if i < 11 then y= 100*e; output; end; run; proc robustreg data=b method=s(k0=1.8); model y = x1 x2; run; proc robustreg data=b method=mm(inith=502 k0=1.8); model y = x1 x2; run;
Output 62.1.6 displays the S estimates and Output 62.1.7 displays the MM estimates with initial LTS estimates.
The ROBUSTREG Procedure Model Information Data Set WORK.C Dependent Variable y Number of Covariates 2 Number of Observations 1000 Method S Estimation Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 9.9808 0.0216 9.9383 10.0232 212532 <.0001 x1 1 5.0303 0.0208 4.9896 5.0710 58656.3 <.0001 x2 1 3.0217 0.0222 2.9782 3.0652 18555.7 <.0001 Scale 0 2.2094
The ROBUSTREG Procedure Model Information Data Set WORK.C Dependent Variable y Number of Covariates 2 Number of Observations 1000 Method MM Estimation Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 9.9820 0.0215 9.9398 10.0241 215369 <.0001 x1 1 5.0303 0.0206 4.9898 5.0707 59469.1 <.0001 x2 1 3.0222 0.0221 2.9789 3.0655 18744.9 <.0001 Scale 0 2.2134
The classical analysis of variance (ANOVA) technique based on least squares assumes that the underlying experimental errors are normally distributed. However, data often contain outliers due to recording or other errors. In other cases, extreme responses occurs when control variables in the experiments is set to extremes. It is important to distinguish these extreme points and determine whether they are outliers or important extreme cases. You can use the ROBUSTREG procedure for robust analysis of variance based on M estimation. Typically, there are no high leverage points in a well-designed experiment, so M estimation is appropriate.
The following example shows how to use the ROBUSTREG procedure for robust ANOVA.
An experiment was carried out to study the effects of two successive treatments ( T1 , T2 ) on the recovery time of mice with certain diseases. Sixteen mice were randomly assigned into four groups for the four different combinations of the treatments. The recovery times ( time ) were recorded (in hours).
data recover; input id T1 $ T2 $ time; datalines; 1 0 0 20.2 2 0 0 23.9 3 0 0 21.9 4 0 0 42.4 5 1 0 27.2 6 1 0 34.0 7 1 0 27.4 8 1 0 28.5 9 0 1 25.9 10 0 1 34.5 11 0 1 25.1 12 0 1 34.2 13 1 1 35.0 14 1 1 33.9 15 1 1 38.3 16 1 1 39.9 ;
The following statements invoke the GLM procedure for a standard ANOVA.
proc glm data=recover; class T1 T2; model time = T1 T2 T1*T2; run;
Output 62.2.1 indicates that the overall model effect is not significantatthe10% level and Output 62.2.2 indicates that neither treatment is significantatthe10% level.
The GLM Procedure Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model 3 209.9118750 69.9706250 1.86 0.1905 Error 12 451.9225000 37.6602083 Corrected Total 15 661.8343750 R-Square Coeff Var Root MSE time Mean 0.317167 19.94488 6.136791 30.76875
The GLM Procedure Dependent Variable: time Source DF Type I SS Mean Square F Value Pr > F T1 1 81.4506250 81.4506250 2.16 0.1671 T2 1 106.6056250 106.6056250 2.83 0.1183 T1*T2 1 21.8556250 21.8556250 0.58 0.4609 Source DF Type III SS Mean Square F Value Pr > F T1 1 81.4506250 81.4506250 2.16 0.1671 T2 1 106.6056250 106.6056250 2.83 0.1183 T1*T2 1 21.8556250 21.8556250 0.58 0.4609
The following statements invoke the ROBUSTREG procedure with the same model.
proc robustreg data=recover; class T1 T2; model time = T1 T2 T1*T2 / diagnostics; T1_T2: test T1*T2; output out=robout r=resid sr=stdres; run;
Output 62.2.3 shows some basic information about the model and the response variable time .
The ROBUSTREG Procedure Model Information Data Set WORK.RECOVER Dependent Variable time Number of Covariates 2 Number of Continuous Covariates 0 Number of Discrete Covariates 2 Number of Observations 16 Method M Estimation Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD time 25.5000 31.2000 34.7500 30.7688 6.6425 6.8941
The Parameter Estimates table in Output 62.2.4 indicates that the main effects of both treatments are significantatthe5% level.
The ROBUSTREG Procedure Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 36.7655 2.0489 32.7497 40.7814 321.98 <.0001 T1 0 1 6.8307 2.8976 12.5100 1.1514 5.56 0.0184 T1 1 0.0000 0.0000 0.0000 0.0000 . . T2 0 1 7.6755 2.8976 13.3548 1.9962 7.02 0.0081 T2 1 0.0000 0.0000 0.0000 0.0000 . . T1*T2 0 0 1 0.2619 4.0979 8.2936 7.7698 0.00 0.9490 T1*T2 0 1 0.0000 0.0000 0.0000 0.0000 . . T1*T2 1 0 0.0000 0.0000 0.0000 0.0000 . . T1*T2 1 1 0.0000 0.0000 0.0000 0.0000 . . Scale 1 3.5346
The reason for the difference between the traditional ANOVA and the robust ANOVA is explained by Output 62.2.5, which shows that the fourth observation is an outlier. Further investigation shows that the original value of 24.4 for the fourth observation was recorded incorrectly.
The ROBUSTREG Procedure Diagnostics Standardized Robust Obs Residual Outlier 4 5.7722 * Diagnostics Summary Observation Type Proportion Cutoff Outlier 0.0625 3.0000
Output 62.2.6 displays the robust test results. The interaction between the two treatments is not significant. Output 62.2.7 displays the robust residuals and standardized robust residuals.
The ROBUSTREG Procedure Robust Linear Tests T1_T2 Test Chi- Test Statistic Lambda DF Square Pr > ChiSq Rho 0.0041 0.7977 1 0.01 0.9431 Rn2 0.0041 1 0.00 0.9490
Obs T1 T2 time resid stdres 1 0 0 20.2 1.7974 0.50851 2 0 0 23.9 1.9026 0.53827 3 0 0 21.9 0.0974 0.02756 4 0 0 42.4 20.4026 5.77222 5 1 0 27.2 1.8900 0.53472 6 1 0 34.0 4.9100 1.38911 7 1 0 27.4 1.6900 0.47813 8 1 0 28.5 0.5900 0.16693 9 0 1 25.9 4.0348 1.14152 10 0 1 34.5 4.5652 1.29156 11 0 1 25.1 4.8348 1.36785 12 0 1 34.2 4.2652 1.20668 13 1 1 35.0 1.7655 0.49950 14 1 1 33.9 2.8655 0.81070 15 1 1 38.3 1.5345 0.43413 16 1 1 39.9 3.1345 0.88679
Robust regression and outlier detection techniques have considerable applications to econometrics. The following example from Zaman, Rousseeuw, and Orhan (2001) shows how these techniques substantially improve the ordinary least squares (OLS) results for the growth study of De Long and Summers.
De Long and Summers (1991) studied the national growth of 61 countries from 1960 to 1985 using OLS.
data growth; input country$ GDP LFG EQP NEQ GAP @@; datalines; Argentin 0.0089 0.0118 0.0214 0.2286 0.6079 Austria 0.0332 0.0014 0.0991 0.1349 0.5809 Belgium 0.0256 0.0061 0.0684 0.1653 0.4109 Bolivia 0.0124 0.0209 0.0167 0.1133 0.8634 Botswana 0.0676 0.0239 0.1310 0.1490 0.9474 Brazil 0.0437 0.0306 0.0646 0.1588 0.8498 Cameroon 0.0458 0.0169 0.0415 0.0885 0.9333 Canada 0.0169 0.0261 0.0771 0.1529 0.1783 Chile 0.0021 0.0216 0.0154 0.2846 0.5402 Colombia 0.0239 0.0266 0.0229 0.1553 0.7695 CostaRic 0.0121 0.0354 0.0433 0.1067 0.7043 Denmark 0.0187 0.0115 0.0688 0.1834 0.4079 Dominica 0.0199 0.0280 0.0321 0.1379 0.8293 Ecuador 0.0283 0.0274 0.0303 0.2097 0.8205 ElSalvad 0.0046 0.0316 0.0223 0.0577 0.8414 Ethiopia 0.0094 0.0206 0.0212 0.0288 0.9805 Finland 0.0301 0.0083 0.1206 0.2494 0.5589 France 0.0292 0.0089 0.0879 0.1767 0.4708 Germany 0.0259 0.0047 0.0890 0.1885 0.4585 Greece 0.0446 0.0044 0.0655 0.2245 0.7924 Guatemal 0.0149 0.0242 0.0384 0.0516 0.7885 Honduras 0.0148 0.0303 0.0446 0.0954 0.8850 HongKong 0.0484 0.0359 0.0767 0.1233 0.7471 India 0.0115 0.0170 0.0278 0.1448 0.9356 Indonesi 0.0345 0.0213 0.0221 0.1179 0.9243 Ireland 0.0288 0.0081 0.0814 0.1879 0.6457 Israel 0.0452 0.0305 0.1112 0.1788 0.6816 Italy 0.0362 0.0038 0.0683 0.1790 0.5441 IvoryCoa 0.0278 0.0274 0.0243 0.0957 0.9207 Jamaica 0.0055 0.0201 0.0609 0.1455 0.8229 Japan 0.0535 0.0117 0.1223 0.2464 0.7484 Kenya 0.0146 0.0346 0.0462 0.1268 0.9415 Korea 0.0479 0.0282 0.0557 0.1842 0.8807 Luxembou 0.0236 0.0064 0.0711 0.1944 0.2863 Madagasc 0.0102 0.0203 0.0219 0.0481 0.9217 Malawi 0.0153 0.0226 0.0361 0.0935 0.9628 Malaysia 0.0332 0.0316 0.0446 0.1878 0.7853 Mali 0.0044 0.0184 0.0433 0.0267 0.9478 Mexico 0.0198 0.0349 0.0273 0.1687 0.5921 Morocco 0.0243 0.0281 0.0260 0.0540 0.8405 Netherla 0.0231 0.0146 0.0778 0.1781 0.3605 Nigeria 0.0047 0.0283 0.0358 0.0842 0.8579 Norway 0.0260 0.0150 0.0701 0.2199 0.3755 Pakistan 0.0295 0.0258 0.0263 0.0880 0.9180 Panama 0.0295 0.0279 0.0388 0.2212 0.8015 Paraguay 0.0261 0.0299 0.0189 0.1011 0.8458 Peru 0.0107 0.0271 0.0267 0.0933 0.7406 Philippi 0.0179 0.0253 0.0445 0.0974 0.8747 Portugal 0.0318 0.0118 0.0729 0.1571 0.8033 Senegal 0.0011 0.0274 0.0193 0.0807 0.8884 Spain 0.0373 0.0069 0.0397 0.1305 0.6613 SriLanka 0.0137 0.0207 0.0138 0.1352 0.8555 Tanzania 0.0184 0.0276 0.0860 0.0940 0.9762 Thailand 0.0341 0.0278 0.0395 0.1412 0.9174 Tunisia 0.0279 0.0256 0.0428 0.0972 0.7838 U.K. 0.0189 0.0048 0.0694 0.1132 0.4307 U.S. 0.0133 0.0189 0.0762 0.1356 0.0000 Uruguay 0.0041 0.0052 0.0155 0.1154 0.5782 Venezuel 0.0120 0.0378 0.0340 0.0760 0.4974 Zambia 0.0110 0.0275 0.0702 0.2012 0.8695 Zimbabwe 0.0110 0.0309 0.0843 0.1257 0.8875 ;
The regression equation they used is:
where the response variable is the growth in gross domestic product per worker ( GDP ) and the regressors are labor force growth ( LFG ), relative GDP gap ( GAP ), equipment investment ( EQP ), and non-equipment investment ( NEQ ).
The following statements invoke the REG procedure for the OLS analysis:
proc reg data=growth; model GDP = LFG GAP EQP NEQ ; run;
The OLS analysis of Output 62.3.1 indicates that GAP and EQP have a significant influence on GDP at the 5% level.
The REG Procedure Model: MODEL1 Dependent Variable: GDP Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 0.01430 0.01028 1.39 0.1697 LFG 1 0.02981 0.19838 0.15 0.8811 GAP 1 0.02026 0.00917 2.21 0.0313 EQP 1 0.26538 0.06529 4.06 0.0002 NEQ 1 0.06236 0.03482 1.79 0.0787
The following statements invoke the ROBUSTREG procedure with the default M estimation.
proc robustreg data=growth; model GDP = LFG GAP EQP NEQ / diagnostics leverage; output out=robout r=resid sr=stdres; run;
Output 62.3.2 displays model information and summary statistics for variables in the model.
The ROBUSTREG Procedure Model Information Data Set MYLIB.GROWTH Dependent Variable GDP Number of Covariates 4 Number of Observations 61 Method M Estimation Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD LFG 0.0118 0.0239 0.0281 0.0211 0.00979 0.00949 GAP 0.5796 0.8015 0.8863 0.7258 0.2181 0.1778 EQP 0.0265 0.0433 0.0720 0.0523 0.0296 0.0325 NEQ 0.0956 0.1356 0.1812 0.1399 0.0570 0.0624 GDP 0.0121 0.0231 0.0310 0.0224 0.0155 0.0150
Output 62.3.3 displays the M estimates. Besides GAP and EQP , the robust analysis also indicates that NEQ is significant. This new finding is explained by Output 62.3.4, which shows that Zambia, the sixtieth country in the data, is an outlier. Output 62.3.4 also identifies leverage points based the robust MCD distances; however, there are no serious high leverage points in this data set. Output 62.3.5 displays robust versions of goodness-of-fit statistics for the model.
The ROBUSTREG Procedure Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 0.0247 0.0097 0.0437 0.0058 6.53 0.0106 LFG 1 0.1040 0.1867 0.2619 0.4699 0.31 0.5775 GAP 1 0.0250 0.0086 0.0080 0.0419 8.36 0.0038 EQP 1 0.2968 0.0614 0.1764 0.4172 23.33 <.0001 NEQ 1 0.0885 0.0328 0.0242 0.1527 7.29 0.0069 Scale 1 0.0099
The ROBUSTREG Procedure Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier 1 2.6083 4.0639 * 0.9424 5 3.4351 6.7391 * 1.4200 8 3.1876 4.6843 * 0.1972 9 3.6752 5.0599 * 1.8784 17 2.6024 3.8186 * 1.7971 23 2.1225 3.8238 * 1.7161 27 2.6461 5.0336 * 0.0909 31 2.9179 4.7140 * 0.0216 53 2.2600 4.3193 * 1.8082 57 3.8701 5.4874 * 0.1448 58 2.5953 3.9671 * 0.0978 59 2.9239 4.1663 * 0.3573 60 1.8562 2.7135 4.9798 * 61 1.9634 3.9128 * 2.5959 Diagnostics Summary Observation Type Proportion Cutoff Outlier 0.0164 3.0000 Leverage 0.2131 3.3382
The ROBUSTREG Procedure Goodness-of-Fit Statistic Value R-Square 0.3178 AICR 80.2134 BICR 91.5095 Deviance 0.0070
The following statements invoke the ROBUSTREG procedure with LTS estimation, which was used by Zaman, Rousseeuw, and Orhan (2001). The results are consistent with those of M estimation.
proc robustreg method=lts(h=33) fwls data=growth; model GDP = LFG GAP EQP NEQ / diagnostics leverage ; output out=robout r=resid sr=stdres; run;
Output 62.3.6 displays the LTS estimates.
The ROBUSTREG Procedure LTS Profile Total Number of Observations 61 Number of Squares Minimized 33 Number of Coefficients 5 Highest Possible Breakdown Value 0.4590 LTS Parameter Estimates Parameter DF Estimate Intercept 1 0.0249 LFG 1 0.1123 GAP 1 0.0214 EQP 1 0.2669 NEQ 1 0.1110 Scale (sLTS) 0 0.0076 Scale (Wscale) 0 0.0109
Output 62.3.7 displays outlier and leverage point diagnostics based on the LTS estimates.
The ROBUSTREG Procedure Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier 1 2.6083 4.0639 * 1.0715 5 3.4351 6.7391 * 1.6574 8 3.1876 4.6843 * 0.2324 9 3.6752 5.0599 * 2.0896 17 2.6024 3.8186 * 1.6367 23 2.1225 3.8238 * 1.7570 27 2.6461 5.0336 * 0.2334 31 2.9179 4.7140 * 0.0971 53 2.2600 4.3193 * 1.2978 57 3.8701 5.4874 * 0.0605 58 2.5953 3.9671 * 0.0857 59 2.9239 4.1663 * 0.4113 60 1.8562 2.7135 4.4984 * 61 1.9634 3.9128 * 2.1201 Diagnostics Summary Observation Type Proportion Cutoff Outlier 0.0164 3.0000 Leverage 0.2131 3.3382 R-Square for LTS Estimation R-Square 0.7418
Output 62.3.8 displays the final weighted lease squares estimates, which are identical to those reported in Zaman, Rousseeuw, and Orhan (2001).
The ROBUSTREG Procedure Parameter Estimates for Final Weighted Least Squares Fit Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 0.0222 0.0093 0.0405 0.0039 5.65 0.0175 LFG 1 0.0446 0.1771 0.3026 0.3917 0.06 0.8013 GAP 1 0.0245 0.0082 0.0084 0.0406 8.89 0.0029 EQP 1 0.2824 0.0581 0.1685 0.3964 23.60 <.0001 NEQ 1 0.0849 0.0314 0.0233 0.1465 7.30 0.0069 Scale 0 0.0116