Details

This section describes the statistical and computational aspects of the ROBUSTREG procedure. The following notation is used throughout this section.

Let X =( x _ij ) denote an n — p matrix, y =( y ₁ ,...,y _n ) ^T a given n -vector of responses, and =( ₁ ,..., _p ) ^T an unknown p -vector of parameters or coefficients whose components are to be estimated. The matrix X is called the design matrix. Consider the usual linear model

where e =( e ₁ ,...,e _n ) ^T is an n -vector of unknown errors. It is assumed that (for a given X ) the components e _i of e are independent and identically distributed according to a distribution L ( · / ƒ ), where ƒ is a scale parameter (usually unknown). The vector of residuals for a given value of is denoted by r =( r ₁ , ...,rn ) ^T and the i th row of the matrix X is denoted by .

M Estimation

M estimation in the context of regression was first introduced by Huber (1973) as an result of making the least squares approach robust. Although M estimators are not robust with respect to leverage points, they are popular in applications where leverage points are not an issue.

Instead of minimizing a sum of squares of the residuals, a Huber-type M estimator _M of minimizes a sum of less rapidly increasing functions of the residuals:

where r = y ˆ’ X . For the ordinary least squares estimation, is the quadratic function.

If ƒ is known, by taking derivatives with respect to , _M is also a solution of the system of p equations:

where ˆ = ² . If is convex, _M is the unique solution.

The ROBUSTREG procedure solves this system by using iteratively reweighted least squares (IRLS). The weight function w ( x ) is defined as

The ROBUSTREG procedure provides ten kinds of weight functions (corresponding to ten -functions) through the WEIGHTFUNCTION= option in the MODEL statement. See the section 'Weight Functions' on page 3995 for a complete discussion. You can specify the scale parameter ƒ with the SCALE= option in the PROC statement.

If ƒ is unknown, both and ƒ are estimated by minimizing the function

The algorithm proceeds by alternately improving in a location step and in a scale step.

For the scale step, three methods are available to estimate ƒ , which you can select with the SCALE= option.

(SCALE=HUBER<(D=d)>) Compute by the iteration

where

is the Huber function and is the Huber constant (refer to Huber 1981, p. 179). You can specify d with the D= option. By default, d =2 . 5.
(SCALE=TUKEY<(D=d)>) Compute by solving the supplementary equation

where

Here is Tukey's biweight function, and ² = ˆ« d ( s ) d ( s ) is the constant such that the solution is asymptotically consistent when L ( ·/ ƒ ) = ( ·) (refer to Hampel et. al. 1986, p. 149). You can specify d with the D= option. By default, d =2.5.
(SCALE=MED) Compute by the iteration

where ² = ^{ˆ’ 1} (.75) is the constant such that the solution is asymptotically consistent when L ( ·/ ƒ )= ( ·) (refer to Hampel et. al. 1986, p. 312).

Note that SCALE = MED is the default.

Algorithm

The basic algorithm for computing M estimates for regression is iteratively reweighted least squares (IRLS). As the name suggests, a weighted least squares fit is carried out inside an iteration loop. For each iteration, a set of weights for the observations is used in the least squares fit. The weights are constructed by applying a weight function to the current residuals. Initial weights are based on residuals from an initial fit. The ROBUSTREG procedure uses the unweighted least squares fit as a default initial fit. The iteration terminates when a convergence criterion is satisfied. The maximum number of iterations is set to 1000. You can specify the weight function and the convergence criteria.

Weight Functions

You can specify the weight function for M estimation with the WEIGHTFUNCTION= option. The ROBUSTREG procedure provides ten weight functions. By default, the procedure uses the bisquare weight function. In most cases, M estimates are more sensitive to the parameters of these weight functions than to the type of the weight function. The median weight function is not stable and is seldom recommended in data analysis; it is included in the procedure for completeness. You can specify the parameters for these weight functions. Except for the hampel and median weight functions, default values for these parameters are defined such that the corresponding M estimates have 95% asymptotic efficiency in the location model with the Gaussian distribution (see Holland and Welsch (1977)).

The following list shows the weight functions available.

andrews
bisquare
cauchy
fair

hampel
huber
logistic
median
talworth

welsch

See Table 62.4 on page 3985 for the default values of the constants in these weight functions.

Convergence Criteria

The following convergence criteria are available in PROC ROBUSTREG:

Relative change in the scaled residuals (CONVERGENCE= RESID)
Relative change in the coefficients (CONVERGENCE= COEF)
Relative change in weights (CONVERGENCE= W)

You can specify the criteria with the CONVERGENCE= option in the PROC statement. The default is CONVERGENCE= COEF.

You can specify the precision of the convergence criterion with the EPS= sub-option.

In addition to these convergence criteria, a convergence criterion based on scale-independent measure of the gradient is always checked. See Coleman, et. al. (1980). A warning is issued if this criterion is not satisfied.

Asymptotic Covariance and Confidence Intervals

The following three estimators of the asymptotic covariance of the robust estimator are available in PROC ROBUSTREG:

where is a correction factor and W _jk = ˆ‘ˆ ² ( r _i ) x _ij x _ik . Refer to Huber (1981, p. 173) for more details.

You can specify the asymptotic covariance estimate with the option ASYMPCOV= option. The ROBUSTREG procedure uses H1 as the default because of its simplicity and stability. Confidence intervals are computed from the diagonal elements of the estimated asymptotic covariance matrix.

R ² and Deviance

The robust version of R ² is defined as

and the robust deviance is defined as the optimal value of the objective function on the ƒ ² -scale:

where ² = ˆ , is the M estimator of , is the M estimator of location, and is the M estimator of the scale parameter in the full model.

Linear Tests

Two tests are available in PROC ROBUSTREG for the canonical linear hypothesis

The first test is a robust version of the F test, which is refered to as the -test. Denote the M estimators in the full and reduced model as ˆˆ and ₁ ˆˆ ₁ , respectively. Let

with

The robust F test is based on the test statistic

Asymptotically under , where the standardization factor is » = ˆ« ˆ ² ( s ) d ( s )/ ˆ« ˆ ² ( s ) d ( s ) and is the cumulative distribution function of the standard normal distribution. Large values of are significant. This test is a special case of the general -test of Hampel et. al. (1986, Section 7.2).

The second test is a robust version of the Wald test, which is refered to as -test. The test uses a test statistic

where is the ( p ˆ’ q ) —( p ˆ’ q ) lower right block of the asymptotic covariance matrix of the M estimate _M of in a p -parameter linear model.

Under , the statistic has an asymptotic ² distribution with p ˆ’ q degrees of freedom. Large absolute values of are significant. Refer to Hampel et. al. (1986, Chapter 7).

Model Selection

When M estimation is used, two criteria are available in PROC ROBUSTREG for model selection. The first criterion is a counterpart of the Akaike (1974) AIC criterion for robust regression, and it is defined as

where is a robust estimate of ƒ and is the M estimator with p -dimensional design matrix.

As with AIC, ± is the weight of the penalty for dimensions. The ROBUSTREG procedure uses ± =2 E ˆ ² /E ˆ ² (Ronchetti (1985)) and estimates it using the final robust residuals.

The second criterion is a robust version of the Schwarz information criteria(BIC), and it is defined as

High Breakdown Value Estimation

The breakdown value of an estimator is defined as the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from its value on the uncontamined data. The breakdown value of an estimator can be used as a measure of the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced the following high breakdown value estimators for linear regression.

LTS Estimate

The least trimmed squares (LTS) estimate proposed by Rousseeuw (1984) is defined as the p -vector

where

are the ordered squared residuals and h is defined in the range .

You can specify the parameter h with the H= option in the PROC statement. By default, . The breakdown value is for the LTS estimate.

The least median of squares (LMS) estimate is defined as the p -vector

where

are the ordered squared residuals i = 1,..., n , and h is defined in the range .

The breakdown value for the LMS estimate is also . However the LTS estimate has several advantages over the LMS estimate. Its objective function is smoother, making the LTS estimate less 'jumpy' (i.e. sensitive to local effects) than the LMS estimate. Its statistical efficiency is better, because the LTS estimate is asymptotically normal whereas the LMS estimate has a lower convergence rate (Rousseeuw and Leroy (1987)). Another important advantage is that, using the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000), the LTS estimate takes less computing time and is more accurate.

The ROBUSTREG procedure computes LTS estimates using the FAST-LTS algorithm. The estimates are often used to detect outliers in the data, which are then downweighted in the resulting weighted LS regression.

Algorithm

Least trimmed squares (LTS) regression is based on the subset of h observations (out of a total of n observations) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between and n . The LTS method was proposed by Rousseeuw (1984, p. 876) as a highly robust regression estimator with breakdown value . The ROBUSTREG procedure uses the FAST-LTS algorithm given by Rousseeuw and Van Driessen (1998). The intercept adjustment technique is also used in this implementation. However, because this adjustment is expensive to compute, it is optional. You can use the IADJUST option in the PROC statement to request or suppress the intercept adjustment. By default, PROC ROBUSTREG does intercept adjustment for data sets with less than 10000 observations. The algorithm is described briefly as follows . Refer to Rousseeuw and Van Driessen (2000) for details.
1. The default h is where p is the number of independent variables . You can specify any integer h with with the H= option in the MODEL statement. The breakdown value for LTS, is reported . The default h is a good compromise between breakdown value and statistical efficiency.
2. If p =1(single regressor) the procedure uses the exact algorithm of Rousseeuw and Leroy (1987, p. 172-172).
3. If p ‰ 2, the procedure uses the following algorithm. If n < 2 ssubs , where ssubs is the size of the subgroups (you can specify ssubs using the SUBGROUPSIZE= option in the PROC statement, by default, ssubs = 300), draw a random p -subset and compute the regression coefficients using these p points (if the regression is degenerate , draw another p -subset). Compute the absolute residuals for all observations in the data set and select the first h points with smallest absolute residuals. From this selected h -subset, carry out nsteps C-steps (Concentration step, see Rousseeuw and Van Driessen (2000) for details. You can specify nsteps with the CSTEP= option in the PROC statement, by default, nsteps =2). Redraw p - subsets and repeat the preceding computing procedure nrep times and find the nbsol (at most) solutions with the lowest sums of h squared residuals. nrep can be specified with the NREP= option in the PROC statement. By default, NREP=min{500, }. For small n and p , all subsets are used and the NREP= option is ignored (Rousseeuw and Hubert (1996)). nbsol can be specified with the NBEST= option in the PROC statement. By default, NBEST=10. For each of these nbsol best solutions, take C-steps until convergence and find the best final solution.
4. If n ‰ 5 ssubs , construct 5 disjoint random subgroups with size ssubs . If 2 ssubs < n < 5 ssubs , the data are split into at most four subgroups with ssubs or more observations in each subgroup, so that each observation belongs to a subgroup and such that the subgroups have roughly the same size. Let nsubs denote the number of subgroups. Inside each subgroup, repeat the procedure in Step 3 times and keep the nbsol best solutions. Pool the subgroups, yielding the merged set of size n _merged . In the merged set, for each of the nsubs — nbsol best solutions, carry out nsteps C-steps using n _merged and and keep the nbsol best solutions. In the full data set, for each of these nbsol best solutions, take C-steps using n and h until convergence and find the best final solution.

R ²

The robust version of R ² for the LTS estimate is defined as

for models with the intercept term and as

for models without the intercept term, where

s _LTS is a preliminary estimate of the parameter ƒ in the distribution function L ( · / ƒ ).

Here d _h,n is chosen to make s _LTS consistent assuming a Gaussian model. Specifically,

with and being the distribution function and the density function of the standard normal distribution, respectively.

Final Weighted Scale Estimator

The ROBUSTREG procedure displays two scale estimators, s _LTS and Wscale. The estimate Wscale is a more efficient scale estimate based on the preliminary estimate s _LTS , and it is defined as

where
You can specify k with the CUTOFF= option in the MODEL statement. By default, k =3.

S Estimate

The S estimate proposed by Rousseeuw and Yohai (1984) is defined as the p -vector

where the dispersion S ( ) is the solution of

Here ² is set to ˆ« ( s ) d ( s ) such that s and S ( s ) are asymptotically consistent estimates of and ƒ for the Gaussian regression model. The breakdown value of the S estimate is

The ROBUSTREG procedure provides two choices for : the Tukey function and the Yohai function.

The Tukey function, which you can specify with the option CHIF=TUKEY, is

The constant k controls the breakdown value and efficiency of the S estimate. By specifying the efficiency using the EFF= option, you can determine the corresponding k . The default k is 2.9366 such that the breakdown value of the S estimate is 0.25 with a corresponding asymptotic efficiency for the Gaussian model of 75 . 9%.

The Yohai function, which you can specify with the option CHIF=YOHAI, is

where b =1 . 792, b ₁ = ˆ’ . 972, b ₂ =0 . 432, b ₃ = ˆ’ . 052, and b ₄ =0 . 002. By specifying the efficiency using the EFF= option, you can determine the corresponding k . By default, k is set to 0.7405 such that the breakdown value of the S estimate is 0.25 with a corresponding asymptotic efficiency for the Gaussian model of 72 . 7%.

Algorithm

The ROBUSTREG procedure implements the algorithm by Marazzi (1993) for the S estimate, which is a refined version of the algorithm proposed by Ruppert (1992). The refined algorithm is briefly described as follows.
Initialize iter = 1.
1. Draw a random q -subset of the total n observations and compute the regression coefficients using these q observations (if the regression is degenerate, draw another q -subset), where q ‰ p can be specified with the SUBSIZE= option. By default, q = p .
2. Compute the residuals: for i =1 , , n . If iter = 1, set s * = 2 median { r _i ,i = 1 , ,n } ; if s * = 0, set s * = min { r _i ,i =1 , ,n }; while ; go to Step3. If iter > 1 and , go to the Step 3; else go to Step 5.
3. Solve for s the equation
  
  using an iterative algorithm.
4. If iter > 1 and s>s , go to Step5. Otherwise , sets = s an d * = . I f s * <TOLS , return s * and *; else go to Step5.
5. if iter <NREP , setiter = iter + 1 and return to Step 1; else return s * and *.

The ROBUSTREG procedure does the following refinement step by default. You can request this refinement not be done using the NOREFINE option in the PROC statement.
1. Let ˆ = ² . Using the values s * and * from the previous steps, compute M estimates _M and ƒ _M of and ƒ with the setup for M estimation in the section 'M Estimation' on page 3993. If ƒ _M >s *, give a warning and return s * and *; otherwise, return ƒ _M and _M .

You can specify TOLS with the TOLERANCE= option; by default, TOLS = . 001. Alternately You can specify NREP with the NREP= option. You can also use the options NREP= NREP0 or NREP= NREP1 to determine NREP according to the following table. NREP= NREP0 is set as the default.

Table 62.7: Default NREP
P	NREP0	NREP1
1	150	500
2	300	1000
3	400	1500
4	500	2000
5	600	2500
6	700	3000
7	850	3000
8	1250	3000
9	1500	3000
>9	1500	3000

R ² and Deviance

The robust version of R ² for the S estimate is defined as
for the model with the intercept term and
for the model without the intercept term, where S _p is the S estimate of the scale in the full model, S _µ is the S estimate of the scale in the regression model with only the intercept term, and S is the S estimate of the scale without any regressor. The deviance D is defined as the optimal value of the objective function on the ƒ ² -scale:

Asymptotic Covariance and Confidence Intervals

Since the S estimate satisfies the first-order necessary conditions as the M estimate, it has the same asymptotic covariance as that of the M estimate. All three estimators of the asymptotic covariance for the M estimate in the section 'Asymptotic Covariance and Confidence Intervals' on page 3997 can be used for the S estimate. Besides, the weighted covariance estimator H4 described in the section 'Asymptotic Covariance and Confidence Intervals' on page 4008 is also available and is set as the default. Confidence intervals for estimated parameters are computed from the diagonal elements of the estimated asymptotic covariance matrix.

MM Estimation

MM estimation is a combination of high breakdown value estimation and efficient estimation, which was introduced by Yohai (1987). It has three steps:

Compute an initial (consistent) high breakdown value estimate ² . The ROBUSTREG procedure provides two kinds of estimates as the initial estimate, the LTS estimate and the S estimate. By default, the LTS estimate because of its speed, efficiency, and high breakdown value. The breakdown value of the final MM estimate is decided by the breakdown value of the initial LTS estimate and the constant k in the function. To use the S estimate as the initial estimate, you specify the INITEST=S option in the PROC statement. In this case, the breakdown value of the final MM estimate is decided only by the constant k . Instead of computing the LTS estimate or the S estimate as initial estimates, you can also specify the initial estimate explicitly using the INEST= option in the PROC statement. See the section 'INEST= Data Set' on page 4011 for details.
Find ² such that

where ² = ˆ« ( s ) d ( s ).

The ROBUSTREG procedure provides two choices for : the Tukey function and the Yohai function.

The Tukey function, which you can specify with the option CHIF=TUKEY, is

where k can be specified with the K0= option. The default k = 2.9366 such that the asymptotically consistent scale estimate ² has the breakdown value of 25%.

The Yohai function, which you can specify with the option CHIF=YOHAI, is

where b = 1 . 792, b ₁ = ˆ’ . 972, b ₂ = 0 . 432, b ₃ = ˆ’ . 052, and b ₄ = 0 . 002. You can specify k with the K0= option. The default k is .7405 such that the asymptotically consistent scale estimate ² has the breakdown value of 25%.
Find a local minimum _MM of

such that Q _MM ( _MM ) ‰ Q _MM ( ² ). The algorithm for M estimation is used here.

The ROBUSTREG procedure provides two choices for : the Tukey function and the Yohai function.

The Tukey function, which you can specify with the option CHIF=TUKEY, is

where k ₁ can be specified with the K1= option. The default k ₁ is 3.440 such that the MM estimate has 85% asymptotic efficiency with the Gaussian distribution.

The Yohai function, which you can specify with the option CHIF=Yohai, is

where k ₁ can be specified with the K1= option. The default k ₁ is 0.868 such that the MM estimate has 85% asymptotic efficiency with the Gaussian distribution.