Details | SAS.STAT 9.1 Users Guide (Vol. 4)

Input Data Sets

You specify input data sets based on the type of inference you requested . For univariate inference, you can use one of the following options:

a DATA= data set, which provides both parameter estimates and the associated standard errors.
a DATA= type EST, COV, or CORR data set, which provides both parameter estimates and the associated standard errors either explicitly (type CORR) or through the covariance matrix (type EST, COV).
PARMS = data set, which provides both parameter estimates and the associated standard errors.

For multivariate inference, which includes the testing of linear hypotheses about parameters, you can use one of the following option combinations:

a DATA= type EST, COV, or CORR data set, which provides parameter estimates and the associated covariance matrix either explicitly (type EST, COV) or through the correlation matrix and standard errors (type CORR) in a single data set.
PARMS= and COVB= data sets, which provide parameter estimates in a PARMS= data set and the associated covariance matrix in a COVB= data set.
PARMS=, COVB=, and PARMINFO= data sets, which provide parameter estimates in a PARMS= data set, the associated covariance matrix in a COVB= data set with variables named PRM1 , PRM2 , ..., and the effects associated with these variables in a PARMINFO= data set.
PARMS= and XPXI= data sets, which provide parameter estimates and the associated standard errors in a PARMS= data set and the associated ( X ² X ) ˆ’ 1 matrix in an XPXI= data set.

The appropriate combination depends on the type of inference and the SAS procedure you used to create the data sets. For instance, if you used PROC REG to create an OUTEST= data set containing the parameter estimates and covariance matrix, you would use the DATA= option to read the OUTEST= data set.

When the input DATA= data set is not a specially structured SAS data set, each observation corresponds to an imputation and contains both parameter estimates and associated standard errors. For others, each input data set must contains the variable _Imputation_ to identify the imputation by number.

If you do not specify an input data set with the DATA= or PARMS= option, then the most recently created SAS data set is used as an input DATA= data set. Note that with a DATA= data set, each effect represents a continuous variable, only regressor effects (continuous variables by themselves ) are allowed in the MODELEFFECTS statement.

DATA= SAS data set

The DATA= data set provides both parameter estimates and the associated standard errors computed from imputed data sets. Such data sets are typically created with an OUTPUT statement using procedures such as PROC MEANS and PROC UNIVARIATE.

The MIANALYZE procedure reads parameter estimates from observations with variables in the MODELEFFECTS statement, and standard errors for parameter estimates from observations with variables in the STDERR statement. The order of the variables for standard errors must match the order of the variables for parameter estimates.

DATA= type EST, COV, or CORR SAS data set

The specially structured DATA= data set provides both parameter estimates and the associated covariance matrix computed from imputed data sets. Such data sets are created by procedures such as PROC CORR (type COV, CORR) and PROC REG (type EST).

With TYPE=EST, the MIANALYZE procedure reads parameter estimates from observations with _TYPE_ = ˜PARM', _TYPE_ = ˜PARMS', _TYPE_ = ˜OLS', or _TYPE_ = ˜FINAL', and covariance matrices for parameter estimates from observations with _TYPE_ = ˜COV' or _TYPE_ = ˜COVB'.

With TYPE=COV, the procedure reads sample means from observations with _TYPE_ = ˜MEAN', sample size n from observations with _TYPE_ = ˜N', and covariance matrices for variables from observations with _TYPE_ = ˜COV'.

With TYPE=CORR, the procedure reads sample means from observations with _TYPE_ = ˜MEAN', sample size n from observations with _TYPE_ = ˜N', correlation matrices for variables from observations with _TYPE_ = ˜CORR', and standard errors for variables from observations with _TYPE_ = ˜STD'. The standard errors and correlation matrix are used to generate a covariance matrix for the variables.

Note that with TYPE=COV or CORR, each covariance matrix for the variables is divided by n to create the covariance matrix for the sample means.

PARMS <(CLASSVAR= ctype )> = data set

The PARMS= data set contains parameter estimates and associated standard errors computed from imputed data sets. Such data sets are typically created with an ODS OUTPUT statement using procedures such as PROC GENMOD, PROC GLM, PROC LOGISTIC, and PROC MIXED.

The MIANALYZE procedure reads effect names from observations with the variable Parameter , Effect , Variable , or Parm . It then reads parameter estimates from observations with the variable Estimate and standard errors for parameter estimates from observations with the variable StdErr .

When the effects contain CLASS variables, the option CLASSVAR= ctype can be used to identify associated CLASS variables when reading the CLASS levels from observations. The available types are FULL, LEVEL, and CLASSVAL. The default is CLASSVAR= FULL.

With CLASSVAR=FULL, the data set contains the CLASS variables explicitly. PROC MIANALYZE reads the CLASS levels from observations with their corresponding CLASS variables. PROC MIXED generates this type of tables.

With CLASSVAR=LEVEL, PROC MIANALYZE reads the classification levels for the effect from observations with variables Level1 , Level2 , and so on, where the variable Level1 contains the classification level for the first CLASS variable in the effect, the variable Level2 contains the classification level for the second CLASS variable in the effect. For each effect, the variables in the crossed list are displayed before the variables in the nested list. The variable order in the CLASS statement is used for variables inside each list. PROC GENMOD generates this type of tables.

For example, with the following statements,

  proc mianalyze parms(classvar=Level)= dataparm;   class c1 c2 c3;   modeleffects c2 c3(c2 c1);   run;

the variable Level1 has the classification level of the variable c2 for the effect c2 . For the effect c3(c2 c1) , the variable Level1 has the classification level of the variable c3 , Level2 has the level of c1 , and Level3 has the level of c2 .

Similarly, with CLASSVAR=CLASSVAL, PROC MIANALYZE reads the classification levels for the effect from observations with variables ClassVal0 , ClassVal1 , and so on, where the variable ClassVal0 contains the classification level for the first CLASS variable in the effect, the variable ClassVal1 contains the classification level for the second CLASS variable in the effect. For each effect, the variables in the crossed list are displayed before the variables in the nested list. The variable order in the CLASS statement is used for variables inside each list. PROC LOGISTIC generates this type of tables.

PARMS <(CLASSVAR= ctype)>= and COVB= data sets

The PARMS= data set contains parameter estimates and the COVB= data set contains associated covariance matrices computed from imputed data sets. Such data sets are typically created with an ODS OUTPUT statement using procedures such as PROC LOGISTIC, PROC MIXED, and PROC REG.

With a PARMS= data set, the MIANALYZE procedure reads effect names from observations with the variable Parameter , Effect , Variable , or Parm . It then reads parameter estimates from observations with the variable Estimate .

When the effects contain CLASS variables, the option CLASSVAR= ctype can be used to identify the associated CLASS variables when reading the CLASS levels from observations. The available types are FULL, LEVEL, and CLASSVAL, and are described in the 'PARMS < (CLASSVAR= ctype ) > = data set' section on page 2621. The default is CLASSVAR= FULL.

The option EFFECTVAR= etype identifies the variables for parameters displayed in the covariance matrix. The available types are STACKING and ROWCOL. The default is EFFECTVAR=STACKING.

With EFFECTVAR=STACKING, each parameter is displayed by stacking variables in the effect. Begin with the variables in the crossed list, followed by the continuous list, then followed by the nested list. Each CLASS variable is displayed with its CLASS level attached. PROC LOGISTIC generates this type of tables.

When each effect is a continuous variable by itself, each stacked parameter name reduces to the effect name . PROC REG generates this type of tables.

With EFFECTVAR=STACKING, the MIANALYZE procedure reads parameter names from observations with the variable Parameter , Effect , Variable , Parm , or RowName . It then reads covariance matrices from observations with the stacked variables in a COVB= data set.

With EFFECTVAR=ROWCOL, parameters are displayed by the variables Col1 , Col2 , The parameter associated with the variable Col1 is identified by the observation with value 1 for the variable Row . The parameter associated with the variable Col2 is identified by the observation with value 2 for the variable Row . PROC MIXED generates this type of table.

With EFFECTVAR=ROWCOL, the MIANALYZE procedure reads the parameter indices from observations with the variable Row , the effect names from observations with the variable Parameter , Effect , Variable , Parm , or RowName , and covariance matrices from observations with the variables Col1 , Col2 , ... in a COVB= data set.

When the effects contain CLASS variables, the data set contains the CLASS variables explicitly and the MIANALYZE procedure also reads the CLASS levels from their corresponding CLASS variables.

PARMS <(CLASSVAR= ctype)> =, PARMINFO=, and COVB= data sets

The input PARMS= data set contains parameter estimates and the input COVB= data set contains associated covariance matrices computed from imputed data sets. Such data sets are typically created with an ODS OUTPUT statement using procedure such as PROC GENMOD.

With a COVB= data set, the MIANALYZE procedure reads parameter names from observations with the variable Parameter , Effect , Variable , Parm , or RowName and covariance matrices from observations with variables Prm1 , Prm2 , and so on.

The parameters associated with the variables Prm1 , Prm2 , ... are identified in the PARMINFO= data set. PROC MIANALYZE reads the parameter names from observations with the variable Parameter and the corresponding effect from observations with the variable Effect . When the effects contain CLASS variables, the data set contains the CLASS variables explicitly and the MIANALYZE procedure also reads the CLASS levels from observations with their corresponding CLASS variables.

PARMS= and XPXI= data sets

The input PARMS= data set contains parameter estimates and the input XPXI= data set contains associated ( X ² X ) ^{ˆ’ 1} matrices computed from imputed data sets. Such data sets are typically created with an ODS OUTPUT statement using a procedure such as PROC GLM.

With a PARMS= data set, the MIANALYZE procedure reads parameter names from observations with the variable Parameter , Effect , Variable , or Parm . It then reads parameter estimates from observations with the variable Estimate and standard errors for parameter estimates from observations with the variable StdErr .

With a XPXI= data set, the MIANALYZE procedure reads parameter names from observations with the variable Parameter and ( X ² X ) ^{ˆ’ 1} matrices from observations with the parameter variables in the data set.

Note that this combination can only be used when each effect is a continuous variable by itself.

Combining Inferences from Imputed Data Sets

With m imputations, m different sets of the point and variance estimates for a parameter Q can be computed. Suppose that _i and _i are the point and variance estimates from the i th imputed data set, i =1, 2, ..., m . Then the combined point estimate for Q from multiple imputation is the average of the m complete-data estimates:

Suppose that W is the within-imputation variance, which is the average of the m complete-data estimates:

and B be the between-imputation variance

Then the variance estimate associated with Q is the total variance (Rubin 1987)

The statistic ( Q ˆ’ Q ) T ^{ˆ’ (1} ^/ ²⁾ is approximately distributed as t with v _m degrees of freedom (Rubin 1987), where

The degrees of freedom v _m depends on m and the ratio

The ratio r is called the relative increase in variance due to nonresponse (Rubin 1987). When there is no missing information about Q , the values of r and B are both zero. With a large value of m or a small value of r , the degrees of freedom v _m will be large and the distribution of ( Q ˆ’ Q ) T ^{ˆ’ (1} ^/ ²⁾ will be approximately normal.

Another useful statistic is the fraction of missing information about Q :

Both statistics r and » are helpful diagnostics for assessing how the missing data contribute to the uncertainty about Q .

When the complete-data degrees of freedom v is small, and there is only a mod-est proportion of missing data, the computed degrees of freedom, v _m , can be much larger than v , which is inappropriate. For example, with m = 5 and r = 10%,the computed degrees of freedom v _m = 484, which is inappropriate for data sets with complete-data degrees of freedom less than 484.

Barnard and Rubin (1999) recommend the use of an adjusted degrees of freedom

where _obs = (1 ˆ’ ³ ) v ( v +1) / ( v +3) and ³ =(1+ m ^{ˆ’ 1} ) B/T .

If you specify the complete-data degrees of freedom v with the EDF= option, the MIANALYZE procedure uses the adjusted degrees of freedom, , for inference. Otherwise, the degrees of freedom v _m is used.

Multiple Imputation Efficiency

The relative efficiency (RE) of using the finite m imputation estimator , rather than using an infinite number for the fully efficient imputation, in units of variance, is approximately a function of m and » (Rubin 1987, p. 114).

The following table shows relative efficiencies with different values of m and » .

Table 45.2: Relative Efficiency
»
m	10%	20%	30%	50%	70%
3	0.9677	0.9375	0.9091	0.8571	0.8108
5	0.9804	0.9615	0.9434	0.9091	0.8772
10	0.9901	0.9804	0.9709	0.9524	0.9346
20	0.9950	0.9901	0.9852	0.9756	0.9662

The table shows that for situations with little missing information, only a small number of imputations are necessary. In practice, the number of imputations needed can be informally verified by replicating sets of m imputations and checking whether the estimates are stable between sets (Horton and Lipsitz 2001, p. 246).

Multivariate Inferences

Multivariate inference based on Wald tests can be done with m imputed data sets. The approach is a generalization of the approach taken in the univariate case (Rubin 1987, p. 137; Schafer 1997, p. 113). Suppose that _i and _i are the point and covariance matrix estimates for a p -dimensional parameter Q (such as a multivariate mean) from the i th imputed data set, i =1, 2, ..., m . Then the combined point estimate for Q from the multiple imputation is the average of the m complete-data estimates:

Suppose that U is the within-imputation covariance matrix, which is the average of the m complete-data estimates

and suppose that B is the between-imputation covariance matrix

Then the covariance matrix associated with Q is the total covariance matrix

The natural multivariate extension of the t statistic used in the univariate case is the F statistic

with degrees of freedom p and

where

is an average relative increase in variance due to nonresponse (Rubin 1987, p. 137; Schafer 1997, p. 114).

However, the reference distribution of the statistic F is not easily derived. Especially for small m , the between-imputation covariance matrix B is unstable and does not have full rank for m ‰ p (Schafer 1997, p. 113).

One solution is to make an additional assumption that the population between-imputation and within-imputation covariance matrices are proportional to each other (Schafer 1997, p. 113). This assumption implies that the fractions of missing information for all components of Q are equal. Under this assumption, a more stable estimate of the total covariance matrix is

With the total covariance matrix T , the F statistic (Rubin 1987, p. 137)

has an F distribution with degrees of freedom p and v ₁ , where

For t = p ( m ˆ’ 1) ‰ 4, PROC MIANALYZE uses the degrees of freedom v ₁ in the analysis. For t = p ( m ˆ’ 1) > 4, PROC MIANALYZE uses v ₂ , a better approximation of the degrees of freedom given by Li, Raghunathan, and Rubin (1991).

Testing Linear Hypotheses about the Parameters

Linear hypotheses for parameters ² are expressed in matrix form as

H : L ² = c

where L is a matrix of coefficients for the linear hypotheses, and c is a vector of constants.

Suppose that _i and _i are the point and covariance matrix estimates for a p -dimensional parameter Q from the i th imputed data set, i =1, 2, ..., m . Then for a given matrix L , the point and covariance matrix estimates for the linear functions LQ in the i th imputed data set are

L L _i L'

The inferences described in the 'Combining Inferences from Imputed Data Sets' section on page 2624 and the 'Multivariate Inferences' section on page 2626 are applied to these linear estimates for testing the null hypothesis H : L ² = c .

For each TEST statement, the 'Test Specification' table displays the L matrix and the c vector, the 'Multiple Imputation Variance Information' table displays the between-imputation, within-imputation, and total variances for combining complete-data inferences, the 'Multiple Imputation Parameter Estimates' table displays a combined estimate and standard error for each linear component.

With the WCOV and BCOV options in the TEST statement, the procedure displays the within-imputation and between-imputation covariance matrices, respectively.

With the TCOV option, the procedure displays the total covariance matrix derived under the assumption that the population between-imputation and within-imputation covariance matrices are proportional to each other.

With the MULT option in the TEST statement, the 'Multiple Imputation Multivariate Inference' table displays an F test for the null hypothesis L ² = c of the linear components.

Examples of the Complete-Data Inferences

For a given parameter of interest, it is not always possible to compute the estimate and associated covariance matrix directly from a SAS procedure. This section describes examples of parameters with their estimates and associated covariance matrices, which provide the input to the MIANALYZE procedure. Some are straightforward, and others require special techniques.

Means

For a population mean vector ¼ , the usual estimate is the sample mean vector

A variance estimate for y is , where S is the sample covariance matrix

These statistics can be computed from a procedure such as CORR. This approach is illustrated in Example 45.2.

Regression Coefficients

Many SAS procedures are available for regression analysis. Among them, PROC REG provides the most general analysis capabilities, and others like PROC LOGISTIC and PROC MIXED provide more specialized analyses.

Some regression procedures, such as REG and LOGISTIC, create an EST type data set that contains both the parameter estimates for the regression coefficients and their associated covariance matrix. You can read an EST type data set in the MIANALYZE procedure with the DATA= option. This approach is illustrated in Example 45.3.

Other procedures, such as GLM, MIXED, and GENMOD, do not generate EST type data sets for regression coefficients. For PROC MIXED and PROC GENMOD, you can use ODS OUTPUT statement to save parameter estimates in a data set and the associated covariance matrix in a separate data set. These data sets are then read in the MIANALYZE procedure with the PARMS= and COVB= options, respectively. This approach is illustrated in Example 45.4 for PROC MIXED and in Example 45.5 for PROC GENMOD.

PROC GLM does not display tables for covariance matrices. However, you can use the ODS OUTPUT statement to save parameter estimates and associated standard errors in a data set and the associated ( X ² X ) ^{ˆ’ 1} matrix in a separate data set. These data sets are then read in the MIANALYZE procedure with the PARMS= and XPXI= options, respectively. This approach is illustrated in Example 45.6.

For univariate inference, only parameter estimates and associated standard errors are needed. You can use the ODS OUTPUT statement to save parameter estimates and associated standard errors in a data set. These data set is then read in the MIANALYZE procedure with the PARMS= option. This approach is illustrated in Example 45.4.

Correlation Coefficients

For the population correlation coefficient , a point estimate is the sample correlation coefficient r . However, for nonzero , the distribution of r is skewed.

The distribution of r can be normalized through Fisher's z transformation

z ( r ) is approximately normally distributed with mean z ( ) and variance 1 / ( n ˆ’ 3).

With a point estimate and an approximate 95% confidence interval ( z ₁ , z ₂ ) for z ( ), a point estimate and a 95% confidence interval ( r ₁ , r ₂ ) for can be obtained by applying the inverse transformation

to z = , z ₁ , and z ₂ .

This approach is illustrated in Example 45.10.

Ratios of Variable Means

For the ratio ¼ ₁ / ¼ ₂ of means for variables Y ₁ and Y ₂ , the point estimate is y ₁ / y ₂ , the ratio of the sample means. The Taylor expansion and delta method can be applied to the function y ₁ /y ₂ to obtain the variance estimate (Schafer 1997, p. 196)

where s ₁₁ and s ₂₂ are the sample variances of Y ₁ and Y ₂ , respectively, and s ₁₂ is the sample covariance between Y ₁ and Y ₂ .

A ratio of sample means will be approximately unbiased and normally distributed if the coefficient of variation of the denominator (the standard error for the mean divided by the estimated mean) is 10% or less (Cochran 1977, p. 166; Schafer 1997, p. 196).

ODS Table Names

PROC MIANALYZE assigns a name to each table it creates. You must use these names to reference tables when using the Output Delivery System (ODS). These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 45.3: ODS Tables Produced in PROC MIANALYZE
ODS Table Name	Description	Statement	Option
BCov	Between-imputation covariance matrix		BCOV
ModelInfo	Model information
MultStat	Multivariate inference		MULT
ParameterEstimates	Parameter estimates
TCov	Total covariance matrix		TCOV
TestBCov	Between-imputation covariance matrix for L ²	TEST	BCOV
TestMultStat	Multivariate inference for L ²	TEST	MULT
TestParameterEstimates	Parameter estimates for L ²	TEST
TestSpec	Test specification, L and c	TEST
TestTCov	Total covariance matrix for L ²	TEST	TCOV
TestVarianceInfo	Variance information for L ²	TEST
TestWCov	Within-imputation covariance matrix for L ²	TEST	WCOV
VarianceInfo	Variance information
WCov	Within-imputation covariance matrix		WCOV