Syntax | SAS.STAT 9.1 Users Guide (Vol. 6)

PROC ROBUSTREG < options >;
- BY variables ;
- CLASS variables ;
- ID variables ;
- MODEL response = < effects > < / options > ;
- OUTPUT < OUT = SAS-data-set > < options > ;
- PERFORMANCE < options > ;
- TEST 'label' effects ;
- WEIGHT variable ;

The PROC ROBUSTREG statement invokes the procedure. The METHOD= option in the PROC ROBUSTREG statement selects one of the four estimation methods, M, LTS, S, and MM. By default, Huber M estimation is used. The MODEL statement is required and specifies the variables used in the regression. Main effects and interaction terms can be specified in the MODEL statement, as in the GLM procedure. The CLASS statement specifies which explanatory variables are treated as categorical. These variables are allowed in the MODEL statement only for M estimation, and not for other estimation methods . The ID statement names variables to identify observations in the outlier diagnostics tables. The WEIGHT statement identifies a variable in the input data set whose values are used to weight the observations. The OUTPUT statement creates an output data set containing final weights, predicted values, and residuals. The TEST statement requests robust linear tests for the model parameters. The PERFORMANCE statement tunes the performance of the procedure by using single or multiple processors available on the hardware. In one invocation of PROC ROBUSTREG, multiple OUTPUT and TEST statements are allowed.

PROC ROBUSTREG Statement

PROC ROBUSTREG < options > ;

The PROC ROBUSTREG statement invokes the procedure. You can specify the following options in the PROC ROBUSTREG statement.

COVOUT

saves the estimated covariance matrix in the OUTEST= data set for M estimation and MM estimation.

DATA = SAS-data-set

specifies the input SAS data set used by PROC ROBUSTREG. By default, the most recently created SAS data set is used.

FWLS

requests that final weighted least squares estimators be computed.

INEST = SAS-data-set

specifies an input SAS data set that contains initial estimates for all the parameters in the model. See the section 'INEST= Data Set' on page 4011 for a detailed description of the contents of the INEST= data set.

ITPRINT

displays the iteration history for the iteratively reweighted least squares algorithm used by M and MM estimation. You can also use this option in the MODEL statement.

NAMELEN = n

specifies the length of effect names in tables and output data sets to be n characters , where n is a value between 20 and 200. The default length is 20 characters.

ORDER=DATA FORMATTED FREQ INTERNAL

specifies the sorting order for the levels of the classification variables (specified in the CLASS statement). This ordering determines which parameters in the model correspond to each level in the data. The following table explains how PROC ROBUSTREG interprets values of the ORDER= option.

Table 62.1: Options for Order
Value of ORDER=	Levels Sorted By
DATA	order of appearance in the input data set
FORMATTED	formatted value
FREQ	descending frequency count; levels with the most observations come first in the order
INTERNAL	unformatted value

By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent. For more information on sorting order, refer to the chapter titled 'The SORT Procedure' in the SAS Procedures Guide .

OUTEST = SAS-data-set

specifies an output SAS data set containing the parameter estimates, and, if the COVOUT option is specified, the estimated covariance matrix. See the section 'OUTEST= Data Set' on page 4011 for a detailed description of the contents of the OUTEST= data set.

SEED = number

specifies the seed for the random number generator used to randomly select the subgroups and subsets for LTS and S estimation. By default or you specify zero, the ROBUSTREG procedure generates a seed between one and one billion.

METHOD = method type <( options )>

specifies the estimation method and options specify some additional options for the estimation method. PROC ROBUSTREG provides four estimation methods: M estimation, LTS estimation, S estimation, and MM estimation. The default method is M estimation.
Since the LTS and S methods use subsampling algorithms, it is not suitable to apply these methods to an analysis with continuous independent variables which have only a few nonzero values or a few nonzero values within one BY group .

Options with METHOD=M

With METHOD=M, you can specify the following additional options :

ASYMPCOV = H1 H2 H3

specifies the type of asymptotic covariance computed for the M estimate. The three types are described in the section ' Asymptotic Covariance and Confidence Intervals' on page 3997. By default, ASYMPCOV= H1.

CONVERGENCE = criterion < ( EPS = value ) >

specifies a convergence criterion for the M estimate.

Table 62.2: Options to Specify Convergence Criteria
Type	Option
residual	CONVERGENCE= RESID
weight	CONVERGENCE= WEIGHT
coefficient	CONVERGENCE= COEF

By default, CONVERGENCE = COEF. You can specify the precision of the convergence can be specified with the EPS= option. By default, EPS=1.E ˆ’ 8.

MAXITER = n

sets the maximum number of iterations during the parameter estimation. By default, MAXITER=1000.

SCALE = scale type value

specifies the scale parameter or a method for estimating the scale parameter.

Table 62.3: Options to Specify Scale
Scale	Option	Default d
Median estimate	SCALE=MED
Tukey estimate	SCALE=TUKEY<(D=d)>	2.5
Huber estimate	SCALE=HUBER<(D=d)>	2.5
Fixed constant	SCALE= value

By default, SCALE = MED.

WF WEIGHTFUNCTION = function type

specifies the weight function used for the M estimate. The ROBUSTREG procedure provides ten weight functions, which are listed in the following table. You can specify the parameters in these functions with the A=, B=, and C= options. These functions are described in the section 'M Estimation' on page 3993. The default weight function is bisquare.

Table 62.4: Options to Specify Weight Functions
Weight Function	Option	Default a, b, c
andrews	WF = ANDREWS<(C=c)>	1 . 339
bisquare	WF = BISQUARE<(C=c)>	4 . 685
cauchy	WF = CAUCHY<(C=c)>	2 . 385
fair	WF = FAIR<(C=c)>	1 . 4
hampel	WF = HAMPEL<( <A=a> <B=b> <C=c>)>	2 , 4 , 8
huber	WF = HUBER<(C=c)>	1 . 345
logistic	WF = LOGISTIC<(C=c)>	1 . 205
median	WF = MEDIAN<(C=c)>	. 01
talworth	WF = TALWORTH<(C=c)>	2 . 795
welsch	WF = WELSCH<(C=c)>	2 . 985

Options with METHOD=LTS

With METHOD=LTS, you can specify the following additional options :

CSTEP = n

specifies the number of C-steps for the LTS estimate. See the section 'LTS Estimate' on page 4000 for how the default value is determined.

IADJUST = ALL NONE

requests (IADJUST=ALL) or suppresses (IADJUST=NONE) the intercept adjustment for all estimates in the LTS-algorithm. By default, the intercept adjustment is used for data sets with less than 10000 observations. See the section 'Algorithm' on page 4001 for details.

H = n

specifies the quantile for the LTS estimate. See the section 'LTS Estimate' on page 4000 for how the default value is determined.

NBEST= n

specifies the number of best solutions kept for each subgroup during the computation of the LTS estimate. The default number is 10, which is the maximum number allowed.

NREP= n

specifies the number of repeats of least squares fit in subgroups during the computation of the LTS estimate See the section 'LTS Estimate' on page 4000 for how the default number is determined.

SUBANALYSIS

requests a display of the subgrouping information and parameter estimates within subgroups. This option may generate the following ODS tables:

Table 62.5: ODS Tables Available with SUBANALYSIS
ODS Table Name	Description
BestEstimates	Best final estimates for LTS
BestSubEstimates	Best estimates for each subgroup
CStep	C-Step information for LTS
Groups	Grouping information for LTS

Some of these tables are data dependent.

SUBGROUPSIZE= n

specifies the data set size of the subgroups in the computation of the LTS estimate. The default number is 300.

Options with METHOD=S

With METHOD=S, you can specify the following additional options :

ASYMPCOV= H1 H2 H3 H4

specifies the type of asymptotic covariance computed for the S estimate. The four types are described in the section 'Asymptotic Covariance and Confidence Intervals' on page 4005. By default, ASYMPCOV= H4.

CHIF= TUKEY YOHAI

specifies the function for the S estimate. PROC ROBUSTREG provides two functions, Tukey's BISQUARE function and Yohai's OPTIMAL function, which you can request with CHIF=TUKEY and CHIF=YOHAI, respectively. The default is Tukey's bisquare function.

EFF= value

specifies the efficiency for the S estimate. The parameter k in the function is determined by this efficiency. The default efficiency is determined such that the consistent S estimate has the breakdown value of 25%.

MAXITER= n

sets the maximum number of iterations for computing the scale parameter of the S estimate. By default, MAXITER=1000.

NREP= n

specifies the number of repeats of subsampling in the computation of the S estimate. See the section 'Algorithm' on page 4004 for how the default number of repeats is determined.

NOREFINE

suppresses the refinement for the S estimate. See the section 'Algorithm' on page 4004 for details.

SUBSETSIZE= n

specifies the size of the subset for the S estimate. See the section 'Algorithm' on page 4004 for how its default value is determined.

TOLERANCE= value

specifies the tolerance for the S estimate of the scale. The default value is .001.

Options with METHOD=MM

With METHOD=MM, you can specify the following additional options :

ASYMPCOV= H1 H2 H3 H4

specifies the type of asymptotic covariance computed for the MM estimate. The four types are described in the 'Details' section. By default, ASYMPCOV= H4.

BIASTEST<(ALPHA= number )>

requests the bias test for the final MM estimate. See the section 'Bias Test' on page 4008 for details about this test.

CHIF= TUKEY YOHAI

selects the function for the MM estimate. PROC ROBUSTREG provides two functions: Tukey's BISQUARE function and Yohai's OPTIMAL function, which you can request with CHIF=TUKEY and CHIF=YOHAI, respectively. The default is Tukey's bisquare function. This function is also used by the initial S estimate if you specify the INITEST=S option.

CONVERGENCE= criterion < (EPS= number ) >

specifies a convergence criterion for the MM estimate.

Table 62.6: Options to Specify Convergence Criteria
Type	Option
residual	CONVERGENCE= RESID
weight	CONVERGENCE= WEIGHT
coefficient	CONVERGENCE= COEF

By default, CONVERGENCE = COEF. You can specify the precision of the convergence with the EPS= option. By default, EPS=1.E ˆ’ 8.

EFF= value

specifies the efficiency for the MM estimate. The parameter k ₁ in the function is determined by this efficiency. The default efficiency is set to 85%, which corresponds to k ₁ =3.440 for CHIF=TUKEY or k ₁ =0.868 for CHIF=YOHAI.

INITH= n

specifies the integer h for the initial LTS estimator used by the MM estimator . See the section 'Algorithm' on page 4007 for how to specify h and how the default is determined.

INITEST= LTS S

specifies the initial estimator for the MM estimator. By default, the LTS estimator is used as the initial estimator for the MM estimator.

K0= number

specifies the parameter k in the function for the MM estimate. For CHIF=TUKEY, the default is k =2.9366. For CHIF=YOHAI, the default is k =0.7405.These default values correspond to the 25% breakdown value of the MM estimator.

MAXITER= n

sets the maximum number of iterations during the parameter estimation. By default, MAXITER=1000.

BY Statement

BY variables ;

You can specify a BY statement with PROC ROBUSTREG to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.

If your input data set is not sorted in ascending order, use one of the following alternatives:

Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the ROBUSTREG procedure. The NOTSORTED option does not mean that the data are unsorted, but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the SAS Procedures Guide .

CLASS Statement

CLASS variables ;

Explanatory variables that are classification variables rather than quantitative numeric variables must be listed in the CLASS statement. For each explanatory variable listed in the CLASS statement, indicator variables are generated for the levels assumed by the CLASS variable. If the CLASS statement is used, it must appear before the MODEL statement.

ID Statement

ID variables ;

When the diagnostics table is requested with the DIAGNOSTICS option in the MODEL statement, the variables listed in the ID statement are displayed besides the observation number. These variables can be used to identify each observation. If the ID statement is omitted, the observation number is used to identify the observations.

MODEL Statement

label: > MODEL response = < effects > < / options > ;

Main effects and interaction terms can be specified in the MODEL statement, as in the GLM procedure. Class variables are not allowed in the MODEL statement when you specify MM estimation or LTS estimation using the METHOD= option in the PROC statement.

The optional label is used to label output from the matching MODEL statement.

Options

You can specify the following options for the model fit.

ALPHA= value

specifies the significance level for the confidence intervals for regression parameters. The value must be between 0 and 1. By default, ALPHA = 0.05.

CORRB

produces the estimated correlation matrix of the parameter estimates.

COVB

produces the estimated covariance matrix of the parameter estimates.

CUTOFF= value

specifies the multiplier of the cutoff value for outlier detection. By default, CUTOFF =3.

DIAGNOSTICS<(ALL)>

requests the outlier diagnostics. By default, only observations identified as outliers or leverage points are displayed. To request that all observations be displayed, specify the ALL option.

ITPRINT

displays the iteration history for the iteratively reweighted least squares algorithm used by M and MM estimation. You can also use this option in the PROC statement.

LEVERAGE<(CUTOFF= value CUTOFFALPHA= value QUANTILE= n )>

requests an analysis of leverage points for the continuous covariates. The results are added to the diagnostics table, which you can request with the DIAGNOSTICS option in the MODEL statement. You can specify the cutoff value for leverage point detection with the CUTOFF= option. The default cutoff value is where ± can be specified with the CUTOFFALPHA= option. By default, ± = . 025. You can use the QUANTILE= option to specify the quantile to be minimized for the MCD algorithm used for the leverage point analysis. By default, QUANTILE=[(3 n + p + 1)/4], where n is the number of observations and p is the number of independent variables. The LEVERAGE option is ignored if the model includes class variables as covariates.
Since the MCD algorithm uses subsampling, it is not suitable to apply the leverage point analysis to continuous variables which have only a few nonzero values or a few nonzero values within one BY group.

NOGOODFIT

suppresses the computation of goodness-of-fit statistics.

NOINT

specifies no-intercept regression.

SINGULAR= value

specifies the tolerance for testing singularity of the information matrix and the crossproducts matrix for the initial least-squares estimates. Roughly, the test requires that a pivot be at least this value times the original diagonal value. By default, SINGULAR = 1.E ˆ’ 12.

OUTPUT Statement

OUTPUT < OUT= SAS-data-set > keyword = name <... keyword = name > ;

The OUTPUT statement creates an output SAS data set containing statistics calculated after fitting the model. At least one specification of the form keyword = name is required.

All variables in the original data set are included in the new data set, along with the variables created with keyword options in the OUTPUT statement. These new variables contain fitted values and estimated quantiles. If you want to create a permanent SAS data set, you must specify a two-level name (refer to SAS Language Reference: Concepts for more information on permanent SAS data sets).

The following specifications can appear in the OUTPUT statement:

OUT= SAS-data-set	specifies the new data set. By default, the procedure uses the DATA n convention to name the new data set.
keyword=name	specifies the statistics to include in the output data set and gives names to the new variables. Specify a keyword for each desired statistic (see the following list), an equal sign, and the variable to contain the statistic.

The keywords allowed and the statistics they represent are as follows :

LEVERAGE	specifies a variable to indicate leverage points. To include this variable in the OUTPUT data set, you must specify the LEVERAGE option in the PROC statement. See the section 'Leverage Point and Outlier Detection' on page 4010 for how to define LEVERAGE.
OUTLIER	specifies a variable to indicate outliers. See the section 'Leverage Point and Outlier Detection' on page 4010 for how to define OUTLIER.
PREDICTED P	specifies a variable to contain the estimated response.
RESIDUAL R	specifies a variable to contain the residuals
SRESIDUAL	SR specifies a variable to contain the standardized residuals
STDP	specifies a variable to contain the estimates of the standard errors of the estimated response.
WEIGHT	specifies a variable to contain the computed final weights.

PERFORMANCE Statement

You use the PERFORMANCE statement to specify options that tune the performance of PROC ROBUSTREG. By default these options are chosen to maximize performance. See Chen (2002) for some empirical results.

PERFORMANCE < options > ;

The following option is available:

CPUCOUNT= n

specifies the number of threads to use in the computation of LTS or S estimation (initial LTS or S estimation for MM estimation). By default this will be equal to the number of processors on the hardware.

TEST Statement

<label:> TEST effects ;

With M estimation and MM estimation, the TEST statement provides a means for obtaining a test for the canonical linear hypothesis concerning the model parameters:

where p is the total number of parameters in the model, and q is the number of parameters for testing of significance.

PROC ROBUSTREG provides two kinds of robust tests: the -test and the -test. They are described in the 'Details' section. No test is available for LTS and S estimation.

The optional label is used to label output from the corresponding TEST statement.

WEIGHT Statement

WEIGHT variable ;

The WEIGHT statement specifies a weight variable in the input data set.

If you want to use fixed weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement. The values of the WEIGHT variable can be nonintegral and are not truncated. Observations with nonpositive or missing values for the weight variable do not contribute to the fit of the model.