Details | SAS.STAT 9.1 Users Guide (Vol. 4)

Descriptive Statistics

Suppose Y =( y ₁ , y ₂ , , y _n ) ² is the ( n — p ) matrix of complete data, which may not be fully observed , n is the number of observations fully observed, and n _j is the number of observations with observed values for variable Y _j .

With complete cases, the sample mean vector is

and the CSSCP matrix is

where each summation is over the fully observed observations.

The sample covariance matrix is

and is an unbiased estimate of the covariance matrix.

The correlation matrix R containing the Pearson product-moment correlations of the variables is derived by scaling the corresponding covariance matrix:

where D is a diagonal matrix whose diagonal elements are the square roots of the diagonal elements of S .

With available cases, the corrected sum of squares for variable Y _j is

where is the sample mean and each summation is over observations with observed values for variable Y _j .

The variance is

The correlations for available cases contain pairwise correlations for each pair of variables. Each correlation is computed from all observations that have nonmissing values for the corresponding pair of variables.

EM Algorithm for Data with Missing Values

The EM algorithm (Dempster, Laird, and Rubin 1977) is a technique that finds maximum likelihood estimates in parametric models for incomplete data. The books by Little and Rubin (1987), Schafer (1997), and McLachlan and Krishnan (1997) provide detailed description and applications of the EM algorithm.

The EM algorithm is an iterative procedure that finds the MLE of the parameter vector by repeating the following steps:

1. The expectation E-step:

Given a set of parameter estimates, such as a mean vector and covariance matrix for a multivariate normal distribution, the E-step calculates the conditional expectation of the complete-data log likelihood given the observed data and the parameter estimates.

2. The maximization M-step:

Given a complete-data log likelihood, the M-step finds the parameter estimates to maximize the complete-data log likelihood from the E-step.

The two steps are iterated until the iterations converge.

In the EM process, the observed-data log likelihood is non- decreasing at each iteration. For multivariate normal data, suppose there are G groups with distinct missing patterns. Then the observed-data log likelihood being maximized can be expressed as

where log L _g ( Y _obs ) is the observed-data log likelihood from the g _th group , and

where n _g is the number of observations in the g _th group, the summation is over observations in the g _th group, y _ig is a vector of observed values corresponding to observed variables, ¼ _g is the corresponding mean vector, and ˆ‘ _g is the associated covariance matrix.

A sample covariance matrix is computed at each step of the EM algorithm. If the covariance matrix is singular, the linearly dependent variables for the observed data are excluded from the likelihood function. That is, for each observation with linear dependency among its observed variables, the dependent variables are excluded from the likelihood function. Note that this may result in an unexpected change in the likelihood between iterations prior to the final convergence.

Refer to Schafer (1997, pp. 163-181) for a detailed description of the EM algorithm for multivariate normal data.

PROC MI uses the means and standard deviations from available cases as the initial estimates for the EM algorithm. The correlations are set to zero. It provides a good starting value with positive definite covariance matrix. For a discussion of suggested starting values for the algorithm, see Schafer (1997, p. 169).

You can specify the convergence criterion with the CONVERGE= option in the EM statement. The iterations are considered to have converged when the maximum change in the parameter estimates between iteration steps is less than the value specified. You can also specify the maximum number of iterations used in the EM algorithm with the MAXITER= option.

The MI procedure displays tables of the initial parameter estimates used to begin the EM process and the MLE parameter estimates derived from EM. You can also display the EM iteration history with the ITPRINT option. PROC MI lists the iteration number, the likelihood -2 Log L, and the parameter values ¼ at each iteration. You can also save the MLE derived from the EM algorithm in a SAS data set specified with the OUTEM= option.

Statistical Assumptions for Multiple Imputation

The MI procedure assumes that the data are from a continuous multivariate distribution and contain missing values that can occur for any of the variables. It also assumes that the data are from a multivariate normal distribution when either the regression method or the MCMC method is used.

Suppose Y is the n — p matrix of complete data, which is not fully observed, and denote the observed part of Y by Y _obs and the missing part by Y _mis . The SAS MI and MIANALYZE procedures assume that the missing data are missing at random (MAR), that is, the probability that an observation is missing can depend on Y _obs , but not on Y _mis (Rubin 1976; 1987, p. 53).

To be more precise, suppose that R is the n — p matrix of response indicators whose elements are zero or one depending on whether the corresponding elements of Y are missing or observed. Then the MAR assumption is that the distribution of R can depend on Y _obs but not on Y _mis .

For example, consider a trivariate data set with variables Y ₁ and Y ₂ fully observed, and a variable Y ₃ that has missing values. MAR assumes that the probability that Y ₃ is missing for an individual can be related to the individual's values of variables Y ₁ and Y ₂ , but not to its value of Y ₃ . On the other hand, if a complete case and an incomplete case for Y ₃ with exactly the same values for variables Y ₁ and Y ₂ have systematically different values, then there exists a response bias for Y ₃ , and MAR is violated.

The MAR assumption is not the same as missing completely at random (MCAR), which is a special case of MAR. Under the MCAR assumption, the missing data values are a simple random sample of all data values; the missingness does not depend on the values of any variables in the data set.

Although the MAR assumption cannot be verified with the data and it can be questionable in some situations, the assumption becomes more plausible as more variables are included in the imputation model (Schafer 1997, pp. 27-28; van Buuren, Boshuizen, and Knook, 1999, p. 687).

Furthermore, the MI and MIANALYZE procedures assume that the parameters of the data model and the parameters of the model for the missing data indicators are distinct. That is, knowing the values of does not provide any additional information about , and vice versa. If both the MAR and distinctness assumptions are satisfied, the missing-data mechanism is said to be ignorable (Rubin 1987, pp. 50-54; Schafer 1997, pp. 10-11) .

Missing Data Patterns

The MI procedure sorts the data into groups based on whether an individual's value is observed or missing for each variable to be analyzed . Note that the input data set does not need to be sorted in any order.

For example, with variables Y ₁ , Y ₂ , and Y ₃ (in that order) in a data set, up to eight groups of observations can be formed from the data set. The following figure displays the eight groups of observations and an unique missing pattern for each group:

  Missing Data Patterns   Group    Y1    Y2    Y3   1      X     X     X   2      X     X     .   3      X     .     X   4      X     .     .   5      .     X     X   6      .     X     .   7      .     .     X   8      .     .     .

Figure 44.6: Missing Data Patterns

Here, an 'X' means that the variable is observed in the corresponding group and a '.' means that the variable is missing.

The variable order is used to derive the order of the groups from the data set, and thus, determines the order of missing values in the data to be imputed. If you specify a different order of variables in the VAR statement, then the results are different even if the other specifications remain the same.

A data set with variables Y ₁ , Y ₂ , , Y _p (in that order) is said to have a monotone missing pattern when the event that a variable Y _j is missing for a particular individual implies that all subsequent variables Y _k , k > j , are missing for that individual. Alternatively, when a variable Y _j is observed for a particular individual, it is assumed that all previous variables Y _k , k < j , are also observed for that individual.

For example, the following figure displays a data set of three variables with a monotone missing pattern.

  Monotone Missing Data Patterns   Group    Y1    Y2    Y3   1      X     X     X   2      X     X     .   3      X     .     .

Figure 44.7: Monotone Missing Patterns

The following figure displays a data set of three variables with a non-monotone missing pattern.

  Non-monotone Missing Data Patterns   Group    Y1    Y2    Y3   1      X     X     X   2      X     .     X   3      .     X     .   4      .     .     X

Figure 44.8: Non-monotone Missing Patterns

A data set with an arbitrary missing pattern is a data set with either a monotone missing pattern or a non-monotone missing pattern.

Imputation Methods

This section describes the methods for multiple imputation that are available in the MI procedure. The method of choice depends on the pattern of missingness in the data and the type of the imputed variable, as summarized in the following table:

Table 44.3: Imputation Methods in PROC MI
Pattern of Missingness	Type of Imputed Variable	Recommended Methods
Monotone	Continuous	* Regression * Predicted Mean Matching * Propensity Score
Monotone	Classification (Ordinal)	* Logistic Regression
Monotone	Classification (Nominal)	* Discriminant Function Method
Arbitrary	Continuous	* MCMC Full-Data Imputation * MCMC Monotone-Data Imputation

To impute missing values for a continuous variable in data sets with monotone missing patterns, you should use either a parametric method that assumes multivariate normality or a nonparametric method that uses propensity scores (Rubin 1987, p. 124, 158; Lavori, Dawson, and Shera 1995). Parametric methods available include the regression method (Rubin 1987, pp. 166-167) and the predictive mean matching method (Heitjan and Little 1991; Schenker and Taylor 1996).

To impute missing values for a CLASS variable in data sets with monotone missing patterns, you should use the logistic regression method or the discriminant function method. Use the logistic regression method when the CLASS variable has a binary or ordinal response, and the discriminant function method when the CLASS variable has a binary or nominal response.

For continuous variables in data sets with arbitrary missing patterns, you can use the Markov Chain Monte Carlo (MCMC) method (Schafer 1997) to impute either all the missing values or just enough missing values to make the imputed data sets have monotone missing patterns.

With a monotone missing data pattern, you have greater flexibility in your choice of imputation models. In addition to the MCMC method, you can implement other methods, such as the regression method, that do not use Markov chains. You can also specify a different set of covariates for each imputed variable.

With an arbitrary missing data pattern, you can often use the MCMC method, which creates multiple imputations by drawing simulations from a Bayesian predictive distribution for normal data. Another way to handle a data set with an arbitrary missing data pattern is to use the MCMC approach to impute just enough values to make the missing data pattern monotone. Then, you can use a more flexible imputation method. This approach is described in the 'Producing Monotone Missingness with the MCMC Method' section on page 2552.

Although the regression and MCMC methods assume multivariate normality, inferences based on multiple imputation can be robust to departures from multivariate normality if the amount of missing information is not large, because the imputation model is effectively applied not to the entire data set but only to its missing part (Schafer 1997, pp. 147-148).

You can also use a TRANSFORM statement to transform variables to conform to the multivariate normality assumption. Variables are transformed before the imputation process and then are reverse-transformed to create the imputed data set.

Li (1988) presented a theoretical argument for convergence of the MCMC method in the continuous case and used it to create imputations for incomplete multivariate continuous data. In practice, however, it is not easy to check the convergence of a Markov chain, especially for a large number of parameters. PROC MI generates statistics and plots which you can use to check for convergence of the MCMC process. The details are described in the 'Checking Convergence in MCMC' section on page 2555.

Regression Method for Monotone Missing Data

The regression method is the default imputation method for continuous variables in a data set with a monotone missing pattern.

In the regression method, a regression model is fitted for a continuous variable with the covariates constructed from a set of effects. Based on the fitted regression model, a new regression model is simulated from the posterior predictive distribution of the parameters and is used to impute the missing values for each variable (Rubin 1987, pp. 166-167). That is, for a continuous variable Y _j with missing values, a model

is fitted using observations with observed values for the variable Y _j and its covariates X ₁ , X ₂ , , X _k .

The fitted model includes the regression parameter estimates = ( , ₁ , ... , _k ) and the associated covariance matrix , where V _j is the usual X ² X inverse matrix derived from the intercept and covariates X ₁ , X ₂ , ..., X _k .

The following steps are used to generate imputed values for each imputation:

1. New parameters ² _* = ( ² _*0 , ² _*1 , , ² _*( _k ₎ ) and are drawn from the posterior predictive distribution of the parameters. That is, they are simulated from ( , ₁ , , _k ), , and V _j . The variance is drawn as

where g is a random variate and n _j is the number of nonmissing observations for Y _j . The regression coefficients are drawn as

where is the upper triangular matrix in the Cholesky decomposition,

, and Z is a vector of k + 1 independent random normal variates.

2. The missing values are then replaced by

where x ₁ , x ₂ , , x _k are the values of the covariates and z _i is a simulated normal deviate.

Predictive Mean Matching Method for Monotone Missing Data

The predictive mean matching method is also an imputation method available for continuous variables. It is similar to the regression method except that for each missing value, it imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996).

Following the description of the model in the 'Regression Method for Monotone Missing Data' section on page 2541, the following steps are used to generate imputed values:

where g is a random variate and n _j is the number of nonmissing observations for Y _j . The regression coefficients are drawn as

where is the upper triangular matrix in the Cholesky decomposition, and Z is a vector of k + 1 independent random normal variates.

2. For each missing value, a predicted value

is computed with the covariate values x ₁ , x ₂ , , x _k .

3. A set of k observations whose corresponding predicted values are closest to y _i _* is generated. You can specify k with the K= option.

4. The missing value is then replaced by a value drawn randomly from these k observed values.

The predictive mean matching method requires the number of closest observations to be specified. A smaller k tends to increase the correlation among the multiple imputations for the missing observation and results in a higher variability of point estimators in repeated sampling. On the other hand, a larger k tends to lessen the effect from the imputation model and results in biased estimators (Schenker and Taylor 1996, p. 430). An optimal k is currently not available in the literature on multiple imputation. The default is K=5. This default value is experimental and may change in future releases.

The predictive mean matching method ensures that imputed values are plausible and may be more appropriate than the regression method if the normality assumption is violated (Horton and Lipsitz 2001, p. 246).

Note that in SAS 9.0, the predictive mean matching method replaces each missing value by the observed value closest to its predicted value. This may result in a higher variability of point estimators in repeated sampling (Schenker and Taylor 1996, p. 430).

Propensity Score Method for Monotone Missing Data

The propensity score method is another imputation method available for continuous variables when the data set has a monotone missing pattern.

A propensity score is generally defined as the conditional probability of assignment to a particular treatment given a vector of observed covariates (Rosenbaum and Rubin 1983). In the propensity score method, for a variable with missing values, a propensity score is generated for each observation to estimate the probability that the observation is missing. The observations are then grouped based on these propensity scores, and an approximate Bayesian bootstrap imputation (Rubin 1987, p. 124) is applied to each group (Lavori, Dawson, and Shera 1995).

The propensity score method uses the following steps to impute values for variable Y _j with missing values:

1. Create an indicator variable R _j with the value 0 for observations with missing Y _j and 1 otherwise .

2. Fit a logistic regression model

where X ₁ , X ₂ , , X _k are covariates for Y _j , p _j = Pr ( R _j = 0 X ₁ , X ₂ , , X _k ), and logit( p ) = log( p/ (1 ˆ’ p )) .

3. Create a propensity score for each observation to estimate the probability that it is missing.

4. Divide the observations into a fixed number of groups (typically assumed to be five) based on these propensity scores.

5. Apply an approximate Bayesian bootstrap imputation to each group. In group k , suppose that Y _obs denotes the n ₁ observations with nonmissing Y _j values and Y _mis denotes the n observations with missing Y _j . The approximate Bayesian bootstrap imputation first draws n ₁ observations randomly with replacement from Y _obs to create a new data set . This is a nonparametric analogue of drawing parameters from the posterior predictive distribution of the parameters. The process then draws the n values for Y _mis randomly with replacement from .

Steps 1 through 5 are repeated sequentially for each variable with missing values.

Note that the propensity score method was originally designed for a randomized experiment with repeated measures on the response variables. The goal was to impute the missing values on the response variables. The method uses only the covariate information that is associated with whether the imputed variable values are missing. It does not use correlations among variables. It is effective for inferences about the distributions of individual imputed variables, such as an univariate analysis, but it is not appropriate for analyses involving relationship among variables, such as a regression analysis (Schafer 1999, p. 11). It can also produce badly biased estimates of regression coefficients when data on predictor variables are missing (Allison 2000).

Discriminant Function Method for Monotone Missing Data

The discriminant function method is the default imputation method for CLASS variables in a data set with a monotone missing pattern.

For a nominal class variable Y _j with responses 1, , g, and a set of effects from its preceding variables, if the covariates X ₁ , X ₂ , , X _k associated with these effects within each group is approximately multivariate normal and the within-group covariance matrices are approximately equal, the discriminant function method (Brand 1999, pp. 95-96) can be used to impute missing values for the variable Y _j .

Denote the group-specific means for covariates X ₁ , X ₂ , , X _k by

then the pooled covariance matrix is computed as

where S _t is the within-group covariance matrix, n _t is the group-specific sample size , and is the total sample size.

In each imputation, new parameters of the group-specific means ( m _* _t ), pooled covariance matrix ( S _* ), and prior probabilities of group membership ( q _* _t ) can be drawn from their corresponding posterior distributions (Schafer 1997, p. 356).

Pooled Covariance Matrix and Group-specific Means

For each imputation, the MI procedure uses either the fixed observed pooled covariance matrix (PCOV=FIXED) or a drawn pooled covariance matrix (PCOV=POSTERIOR) from its posterior distribution with a noninformative prior. That is,

where W ^{ˆ’ 1} is an inverted Wishart distribution.

The group-specific means are then drawn from their posterior distributions with a noninformative prior

See the 'Bayesian Estimation of the Mean Vector and Covariance Matrix' section on page 2549 for a complete description of the inverted Wishart distribution and posterior distributions using a noninformative prior.

Prior Probabilities of Group Membership

The prior probabilities are computed through the drawing of new group sample sizes. When the total sample size n is considered fixed, the group sample sizes ( n ₁ , n ₂ , , n _g ) has a multinomial distribution. A new multinomial parameters (group sample sizes) can be drawn from its posterior distribution using a Dirichlet prior with parameters ( ± ₁ , ± ₂ , , ± _g ).

After the new sample sizes are drawn from the posterior distribution of ( n ₁ , n ₂ , , n _g ), the prior probabilities q _* _t are computed proportionally to the drawn sample sizes.

Refer to Schafer (1997, pp. 247-255) for a complete description of the Dirichlet prior.

Imputation Steps

The discriminant function method uses the following steps in each imputation to impute values for a nominal class variable Y _j with g responses:

1. Draw a pooled covariance matrix S _* from its posterior distribution if the PCOV=POSTERIOR option is used.

2. For each group, draw group means m _* _t from the observed group mean X _t and either the observed pooled covariance matrix (PCOV=FIXED) or the drawn pooled covariance matrix S _* (PCOV=POSTERIOR).

3. For each group, compute or draw q _* _t , prior probabilities of group membership, based on the PRIOR= option:

PRIOR=EQUAL, q _* _t =1 /g , prior probabilities of group membership are all equal.
PRIOR=PROPORTIONAL, q _* _t = n _t /n , prior probabilities are proportional to their group sample sizes.
PRIOR=JEFFREYS= c , a noninformative Dirichlet prior with ± _t = c is used.
PRIOR=RIDGE= d , a ridge prior is used with ± _t = d * n _t /n for d ‰ 1 and ± _t = d * n _t for d< 1.

4. With the group means m _* _t , the pooled covariance matrix S _* , and the prior probabilities of group membership q _* _t , the discriminant function method derives linear discriminant function and computes the posterior probabilities of an observation belonging to each group

where is the generalized squared distance from x to group t .

5. Draw a random uniform variate u , between 0 and 1, for each observation with missing group value. With the posterior probabilities, p ₁ ( x ) + p ₂ ( x ) + , + p _g ( x ) = 1, the discriminant function method imputes Y _j = 1 if the value of u is less than p ₁ ( x ), Y _j = 2 if the value is greater than or equal to p ₁ ( x ) but less than p ₁ ( x )+ p ₂ ( x ), andsoon.

Logistic Regression Method for Monotone Missing Data

The logistic regression method is another imputation method available for CLASS variables in a data set with a monotone missing pattern.

In the logistic regression method, a logistic regression model is fitted for a class variable with a set of covariates constructed from the effects. For a binary class variable, based on the fitted regression model, a new logistic regression model is simulated from the posterior predictive distribution of the parameters and is used to impute the missing values for each variable (Rubin 1987, pp. 169-170).

For a binary variable Y _j with responses 1 and 2, a logistic regression model is fitted using observations with observed values for the imputed variable Y _j and its covariates X ₁ , X ₂ , , X _k .

where X ₁ , X ₂ , , X _k are covariates for Y _j , p _j = Pr( R _j = 1 X ₁ , X ₂ , , X _k ), and logit( p ) = log( p/ (1 ˆ’ p )) .

The fitted model includes the regression parameter estimates = ( , ₁ , ..., _k ) and the associated covariance matrix V _j .

The following steps are used to generate imputed values for a binary variable Y _j with responses 1 and 2:

1. New parameters ² _* = ( ² _*0 , ² _*1 , , ² _*( _k ₎ ) are drawn from the posterior predictive distribution of the parameters.

where is the upper triangular matrix in the Cholesky decomposition, and Z is a vector of k + 1 independent random normal variates.

2. For an observation with missing Y _j and covariates x ₁ , x ₂ , , x _k , compute the expected probability that Y _j = 1.

where ¼ _j = ² _*0 + ² _*1 x ₁ + ² _*2 x ₂ + + ² _*( _k ₎ x _k .

3. Draw a random uniform variate, u , between 0 and 1. If the value of u is less than p _j , impute Y _j = 1, otherwise impute Y _j = 2.

The preceding logistic regression method can be extended to include the ordinal class variables with more than two levels of responses. The options ORDER= and DESCENDING can be used to specify the sorting order for the levels of the imputed variables.

MCMC Method for Arbitrary Missing Data

The Markov Chain Monte Carlo (MCMC) method originated in physics as a tool for exploring equilibrium distributions of interacting molecules. In statistical applications, it is used to generate pseudo-random draws from multidimensional and otherwise intractable probability distributions via Markov chains. A Markov chain is a sequence of random variables in which the distribution of each element depends only on the value of the previous one.

In MCMC simulation, one constructs a Markov chain long enough for the distribution of the elements to stabilize to a stationary distribution, which is the distribution of interest. By repeatedly simulating steps of the chain, the method simulates draws from the distribution of interest. Refer to Schafer (1997) for a detailed discussion of this method.

In Bayesian inference, information about unknown parameters is expressed in the form of a posterior probability distribution. This posterior distribution is computed using Bayes' theorem

MCMC has been applied as a method for exploring posterior distributions in Bayesian inference. That is, through MCMC, one can simulate the entire joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters that are of interest.

In many incomplete data problems, the observed-data posterior p ( Y _obs ) is intractable and cannot easily be simulated. However, when Y _obs is augmented by an estimated/simulated value of the missing data Y _mis , the complete-data posterior p ( Y _obs ,Y _mis ) is much easier to simulate. Assuming that the data are from a multivariate normal distribution, data augmentation can be applied to Bayesian inference with missing data by repeating the following steps:

1. The imputation I-step:

Given an estimated mean vector and covariance matrix, the I-step simulates the missing values for each observation independently. That is, if you denote the variables with missing values for observation i by Y _i ₍ _mis ₎ and the variables with observed values by Y _i ₍ _obs ₎ , then the I-step draws values for Y _i ₍ _mis ₎ from a conditional distribution for Y _i ₍ _mis ₎ given Y _i ₍ _obs ₎ .

2. The posterior P-step:

Given a complete sample, the P-step simulates the posterior population mean vector and covariance matrix. These new estimates are then used in the next I-step. Without prior information about the parameters, a noninformative prior is used. You can also use other informative priors . For example, a prior information about the covariance matrix can be helpful to stabilize the inference about the mean vector for a near singular covariance matrix.

The two steps are iterated long enough for the results to be reliable for a multiply imputed data set (Schafer 1997, p. 72). That is, with a current parameter estimate ⁽ ^t ⁾ at the t th iteration, the I-step draws from p ( Y _mis Y _obs , ⁽ ^t ⁾ ) and the P-step draws ⁽ ^t ⁺¹⁾ from .

This creates a Markov chain

which converges in distribution to p ( Y _mis , Y _obs ). Assuming the iterates converge to a stationary distribution, the goal is to simulate an approximately independent draw of the missing values from this distribution.

To validate the imputation results, you should repeat the process with different random number generators and starting values based on different initial parameter estimates.

The next three sections provide details for the imputation step, Bayesian estimation of the mean vector and covariance matrix, and the posterior step.

Imputation Step

In each iteration, starting with a given mean vector ¼ and covariance matrix & pound ; , the imputation step draws values for the missing data from the conditional distribution Y _mis given Y _obs .

Suppose is the partitioned mean vector of two sets of variables, Y _obs and Y _mis , where ¼ ₁ is the mean vector for variables Y _obs and ¼ ₂ is the mean vector for variables Y _mis .

Also suppose

is the partitioned covariance matrix for these variables, where ₁₁ is the covariance matrix for variables Y _obs , ₂₂ is the covariance matrix for variables Y _mis , and ₁₂ is the covariance matrix between variables Y _obs and variables Y _mis .

By using the sweep operator (Goodnight 1979) on the pivots of the ₁₁ submatrix, the matrix becomes

where can be used to compute the conditional covariance matrix of Y _mis after controlling for Y _obs .

For an observation with the preceding missing pattern, the conditional distribution of Y _mis given Y _obs = y ₁ is a multivariate normal distribution with the mean vector

and the conditional covariance matrix

Bayesian Estimation of the Mean Vector and Covariance Matrix

Suppose that is an ( n — p ) matrix made up of n ( p —1) independent vectors y _i , each of which has a multivariate normal distribution with mean zero and covariance matrix . Then the SSCP matrix

has a Wishart distribution W ( n, ).

When each observation y _i is distributed with a multivariate normal distribution with an unknown mean ¼ , then the CSSCP matrix

has a Wishart distribution W ( n ˆ’ 1 , ).

If A has a Wishart distribution W ( n, ), then B = A ^{ˆ’ 1} has an inverted Wishart distribution W ^{ˆ’ 1} ( n, ˆ ), where n is the degrees of freedom and ˆ = ^{ˆ’ 1} is the precision matrix (Anderson 1984).

Note that, instead of using the parameter ˆ = ^{ˆ’ 1} for the inverted Wishart distribution, Schafer (1997) uses the parameter .

Suppose that each observation in the data matrix Y has a multivariate normal distribution with mean ¼ and covariance matrix . Then with a prior inverted Wishart distribution for and a prior normal distribution for ¼

where > 0 is a fixed number.

The posterior distribution (Anderson 1984, p. 270; Schafer 1997, p. 152) is

where ( n ˆ’ 1) S is the CSSCP matrix.

Posterior Step

In each iteration, the posterior step simulates the posterior population mean vector ¼ and covariance matrix from prior information for ¼ and , and the complete sample estimates.

You can specify the prior parameter information using one of the following methods:

PRIOR=JEFFREYS, which uses a noninformative prior.
PRIOR=INPUT=, which provides a prior information for in the data set. Optionally, it also provides a prior information for ¼ in the data set.
PRIOR=RIDGE=, which uses a ridge prior.

The next four subsections provide details of the posterior step for different prior distributions.

1. A Noninformative Prior

Without prior information about the mean and covariance estimates, a noninformative prior can be used by specifying the PRIOR=JEFFREYS option. The posterior distributions (Schafer 1997, p. 154) are

2. An Informative Prior for ¼ and

When prior information is available for the parameters ¼ and , you can provide it with a SAS data set that you specify with the PRIOR=INPUT= option.

To obtain the prior distribution for , PROC MI reads the matrix S * from observations in the data set with _TYPE_ = ˜COV', and it reads n * = d * +1from observations with _TYPE_ = ˜N'.

To obtain the prior distribution for ¼ , PROC MI reads the mean vector ¼ from observations with _TYPE_ = ˜MEAN', and it reads n from observations with _TYPE_ = ˜N_MEAN'. When there are no observations with _TYPE_ = ˜N_MEAN', PROC MI reads n from observations with _TYPE_ = ˜N'.

The resulting posterior distribution, as described in the 'Bayesian Estimation of the Mean Vector and Covariance Matrix' section on page 2549, is given by

where

3. An Informative Prior for

When the sample covariance matrix S is singular or near singular, prior information about can also be used without prior information about ¼ to stabilize the inference about ¼ . You can provide it with a SAS data set that you specify with the PRIOR=INPUT= option.

To obtain the prior distribution for , PROC MI reads the matrix S * from observations in the data set with _TYPE_ = ˜COV', and it reads n * from observations with _TYPE_ = ˜N'.

The resulting posterior distribution for ( ¼ , ) (Schafer 1997, p. 156) is

Note that if the PRIOR=INPUT= data set also contains observations with _TYPE_ = ˜MEAN', then a complete informative prior for both ¼ and will be used.

4. A Ridge Prior

A special case of the preceding adjustment is a ridge prior with S * =Diag( S ) (Schafer 1997, p. 156). That is, S * is a diagonal matrix with diagonal elements equal to the corresponding elements in S .

You can request a ridge prior by using the PRIOR=RIDGE= option. You can explicitly specify the number d * ‰ 1 in the PRIOR=RIDGE= d * option. Or you can implicitly specify the number by specifying the proportion p in the PRIOR=RIDGE= p option to request d * = ( n ˆ’ 1) p .

The posterior is then given by

Producing Monotone Missingness with the MCMC Method

The monotone data MCMC method was first proposed by Li (1988), and Liu (1993) described the algorithm. The method is useful especially when a data set is close to having a monotone missing pattern. In this case, the method only needs to impute a few missing values to the data set to have a monotone missing pattern in the imputed data set. Compared to a full data imputation that imputes all missing values, the monotone data MCMC method imputes fewer missing values in each iteration and achieves approximate stationarity in fewer iterations (Schafer 1997, p. 227).

You can request the monotone MCMC method by specifying the option IMPUTE=MONOTONE in the MCMC statement. The 'Missing Data Patterns' table now denotes the variables with missing values by '.' or 'O'. The value '.' means that the variable is missing and will be imputed and the value 'O' means that the variable is missing and will not be imputed. The tables of 'Multiple Imputation Variance Information' and 'Multiple Imputation Parameter Estimates' are not created.

You must specify the variables in the VAR statement. The variable order in the list determines the monotone missing pattern in the imputed data set. With a different order in the VAR list, the results will be different because the monotone missing pattern to be constructed will be different.

Assuming that the data are from a multivariate normal distribution, then similar to the MCMC method, the monotone MCMC method repeats the following steps:

1. The imputation I-step:

Given an estimated mean vector and covariance matrix, the I-step simulates the missing values for each observation independently. Only a subset of missing values are simulated to achieve a monotone pattern of missingness.

2. The posterior P-step:

Given a new sample with a monotone pattern of missingness, the P-step simulates the posterior population mean vector and covariance matrix with a noninformative Jeffreys prior. These new estimates are then used in the next I-step.

Imputation Step

The I-step is almost identical to the I-step described in the 'MCMC Method for Arbitrary Missing Data' section on page 2547 except that only a subset of missing values need to be simulated. To state this precisely, denote the variables with observed values for observation i by Y _i ₍ _obs ₎ and the variables with missing values by Y _i ₍ _mis ₎ =( Y _i ₍ _m ₁₎ , Y _i ₍ _m ₂₎ ), where Y _i ₍ _m ₁₎ is a subset of the the missing variables that will result a monotone missingness when their values are imputed. Then the I-step draws values for Y _i ₍ _m ₁₎ from a conditional distribution for Y _i ₍ _m ₁₎ given Y _i ₍ _obs ₎ .

Posterior Step

The P-step is different from the P-step described in the 'MCMC Method for Arbitrary Missing Data' section on page 2547. Instead of simulating the ¼ and parameters from the full imputed data set, this P-step simulates the ¼ and parameters through simulated regression coefficients from regression models based on the imputed data set with a monotone pattern of missingness. The step is similar to the process described in the 'Regression Method for Monotone Missing Data' section on page 2541.

That is, for the variable Y _j , a model

is fitted using n _j nonmissing observations for variable Y _j in the imputed data sets.

The fitted model consists of the regression parameter estimates = ( , ₁ , , _j _{ˆ’ 1} ) and the associated covariance matrix where V _j is the usual X ² X inverse matrix from the intercept and variables Y ₁ , Y ₂ , ,Y _j _{ˆ’ 1} .

For each imputation, new parameters ² _* =( ² _*0 , ² _*1 , , ² _*( _j _{ˆ’ 1)} ) and are drawn from the posterior predictive distribution of the parameters. That is, they are simulated from ( , ₁ , , _j _{ˆ’ 1} ), , and V _j . The variance is drawn as

where g is a random variate and n _j is the number of nonmissing observations for Y _j . The regression coefficients are drawn as

where is the upper triangular matrix in the Cholesky decomposition and Z is a vector of j independent random normal variates.

These simulated values of ² _* and are then used to re-create the parameters ¼ and . For a detailed description of how to produce monotone-missingness with the MCMC method for a multivariate normal data, refer to Schafer (1997, pp. 226-235).

MCMC Method Specifications

With MCMC, you can impute either all missing values (IMPUTE=FULL) or just enough missing values to make the imputed data set have a monotone missing pattern (IMPUTE=MONOTONE). In the process, either a single chain for all imputations (CHAIN=SINGLE) or a separate chain for each imputation (CHAIN=MULTIPLE) is used. The single chain may be somewhat more precise for estimating a single quantity such as posterior mean (Schafer 1997, p. 138). Refer to Schafer (1997, pp. 137-138) for a discussion of single versus multiple chains.

You can specify the number of initial burn-in iterations before the first imputation with the NBITER= option. This number is also used for subsequent chains for multiple chains. For a single chain, you can also specify the number of iterations between imputations with the NITER= option.

You can explicitly specify initial parameter values for the MCMC process with the INITIAL=INPUT= data set option. Alternatively, you can use the EM algorithm to derive a set of initial parameter values for MCMC with the option INITIAL=EM. These estimates are used as either the starting value (START=VALUE) or as the starting distribution (START=DIST) for the MCMC process. For multiple chains, these estimates are used again as either the starting value (START=VALUE) or as the starting distribution (START=DIST) for the subsequent chains.

You can specify the prior parameter information in the PRIOR= option. You can use a noninformative prior (PRIOR=JEFFREYS), a ridge prior (PRIOR=RIDGE), or an informative prior specified in a data set (PRIOR=INPUT).

The parameter estimates used to generate imputed values in each imputation can be saved in a data set with the OUTEST= option. Later, this data set can be read with the INEST= option to provide the reference distribution for imputing missing values for a new data set.

By default, the MCMC method uses a single chain to produce five imputations. It completes 200 burn-in iterations before the first imputation and 100 iterations between imputations. The posterior mode computed from the EM algorithm with a noninformative prior is used as the starting values for the MCMC process.

INITIAL=EM Specifications

The EM algorithm is used to find the maximum likelihood estimates for incomplete data in the EM statement. You can also use the EM algorithm to find a posterior mode, the parameter estimates that maximize the observed-data posterior density. The resulting posterior mode provides a good starting value for the MCMC process.

With INITIAL=EM, PROC MI uses the MLE of the parameter vector as the initial estimates in the EM algorithm for the posterior mode. You can use the ITPRINT option in INITIAL=EM to display the iteration history for the EM algorithm.

You can use the CONVERGE= option to specify the convergence criterion in deriving the EM posterior mode. The iterations are considered to have converged when the maximum change in the parameter estimates between iteration steps is less than the value specified. By default, CONVERGE=1E-4.

You can also use the MAXITER= option to specify the maximum number of iterations in the EM algorithm. By default, MAXITER=200.

With the BOOTSTRAP option, you can use overdispersed starting values for the MCMC process. In this case, PROC MI applies the EM algorithm to a bootstrap sample, a simple random sample with replacement from the input data set, to derive the initial estimates for each chain (Schafer 1997, p. 128).

Checking Convergence in MCMC

The theoretical convergence of the MCMC process has been explored under various conditions, as described in Schafer (1997, p. 70). However, in practice, verification of convergence is not a simple matter.

The parameters used in the imputation step for each iteration can be saved in an output data set with the OUTITER= option. These include the means, standard deviations, covariances, the worst linear function, and observed-data LR statistics. You can then monitor the convergence in a single chain by displaying time-series plots and autocorrelations for those parameter values (Schafer 1997, p. 120). The time-series and autocorrelation function plots for parameters such as variable means, covariances, and the worst linear function can be displayed by specifying the TIMEPLOT and ACFPLOT option.

You can apply EM to a bootstrap sample to obtain overdispersed starting values for multiple chains (Gelman and Rubin 1992). This provides a conservative estimate of the number of iterations needed before each imputation.

The next four subsections describe useful statistics and plots that can be used to check the convergence of the MCMC process.

LR Statistics

You can save the observed-data likelihood ratio (LR) statistic in each iteration with the LR option in the OUTITER= data set. The statistic is based on the observed-data likelihood with parameter values used in the iteration and the observed-data maximum likelihood derived from the EM algorithm.

In each iteration, the LR statistic is given by

where f ( ) is the observed-data maximum likelihood derived from the EM algorithm and f ( _i ) is the observed-data likelihood for _i used in the iteration.

Similarly, you can also save the observed-data LR posterior mode statistic for each iteration with the LR_POST option. This statistic is based on the observed-data posterior density with parameter values used in each iteration and the observed-data posterior mode derived from the EM algorithm for posterior mode.

For large samples, these LR statistics tends to be approximately ² distributed with degrees of freedom equal to the dimension of (Schafer 1997, p. 131). For example, with a large number of iterations, if the values of the LR statistic do not behave like a random sample from the described ² distribution, then there is evidence that the MCMC process has not converged.

Worst Linear Function of Parameters

The worst linear function (WLF) of parameters (Schafer 1997, pp. 129-131) is a scalar function of parameters ¼ and that is 'worst' in the sense that its function values converge most slowly among parameters in the MCMC process. The convergence of this function is evidence that other parameters are likely to converge as well.

For linear functions of parameters = ( ¼ , ), a worst linear function of has the highest asymptotic rate of missing information. The function can be derived from the iterative values of near the posterior mode in the EM algorithm. That is, an estimated worst linear function of is

where is the posterior mode and the coefficients v = _{( ˆ’ 1)} ˆ’ is the difference between the estimated value of one step prior to convergence and the converged value .

You can display the coefficients of the worst linear function, v , by specifying the WLF option in the MCMC statement. You can save the function value from each iteration in an OUTITER= data set by specifying the WLF option in the OUTITER option. You can also display the worst linear function values from iterations in an autocorrelation plot or a time-series plot by specifying WLF as an ACFPLOT or TIMEPLOT option, respectively.

Note that when the observed-data posterior is nearly normal, the WLF is one of the slowest functions to approach stationarity. When the posterior is not close to normal, other functions may take much longer than the WLF to converge, as described in Schafer (1997, p.130).

Time-Series Plot

A time-series plot for a parameter ¾ is a scatter plot of successive parameter estimates ¾ _i against the iteration number i . The plot provides a simple way to examine the convergence behavior of the estimation algorithm for ¾ . Long- term trends in the plot indicate that successive iterations are highly correlated and that the series of iterations has not converged.

You can display time-series plots for the worst linear function, the variable means, variable variances, and covariances of variables. You can also request logarithmic transformations for positive parameters in the plots with the LOG option. When a parameter value is less than or equal to zero, the value is not displayed in the corresponding plot.

By default, the MI procedure uses solid line segments to connect data points in a time-series plot. You can use the CCONNECT=, LCONNECT=, and WCONNECT= options to change the color, line type, and width of the line segments. When WCONNECT=0 is specified, the data points are not connected, and the procedure uses the plus sign (+) as the plot symbol to display the points with a height of one (percentage screen unit) in a time-series plot You can use the SYMBOL=, CSYMBOL=, and HSYMBOL= options to change the shape, color , and height of the plot symbol.

By default, the plot title 'Time-Series Plot' is displayed in a time-series plot. You can request another title by using the TITLE= option in the TIMEPLOT option. When another title is also specified in a TITLE statement, this title is displayed as the main title and the plot title is displayed as a subtitle in the plot.

You can use options in the GOPTIONS statement to change the color and height of the title. Refer to the chapter 'The SAS/GRAPH Statements' in SAS/GRAPH Software: Reference for an illustration of title options. See Example 44.8 for a usage of the time-series plot.

Autocorrelation Function Plot

To examine relationships of successive parameter estimates ¾ , the autocorrelation function (ACF) can be used. For a stationary series, ¾ _i , i ‰ 1, in time series data, the autocorrelation function at lag k is

The sample k ^th order autocorrelation is computed as

You can display autocorrelation function plots for the worst linear function, the variable means, variable variances, and covariances of variables. You can also request logarithmic transformations for parameters in the plots with the LOG option. When a parameter has values less than or equal to zero, the corresponding plot is not created.

You specify the maximum number of lags of the series with the NLAG= option. The autocorrelations at each lag less than or equal to the specified lag are displayed in the graph. In addition, the plot also displays approximate 95% confidence limits for the autocorrelations. At lag k , the confidence limits indicate a set of approximate 95% critical values for testing the hypothesis _j = 0 , j ‰ k.

By default, the MI procedure uses the star sign (*) as the plot symbol to display the points with a height of one (percentage screen unit) in the plot, a solid line to display the reference line of zero autocorrelation, vertical line segments to connect autocorrelations to the reference line, and a pair of dashed lines to display approximately 95% confidence limits for the autocorrelations.

You can use the SYMBOL=, CSYMBOL=, and HSYMBOL= options to change the shape, color, and height of the plot symbol, and the CNEEDLES= and WNEEDLES= options to change the color and width of the needles . You can also use the LREF=, CREF=, and WREF= options to change the line type, color, and width of the reference line. Similarly, you can use the LCONF=, CCONF=, and WCONF= options to change the line type, color, and width of the confidence limits.

By default, the plot title 'Autocorrelation Plot' is displayed in a autocorrelation function plot. You can request another title by using the TITLE= option in ACFPLOT. When another title is also specified in a TITLE statement, this title is displayed as the main title and the plot title is displayed as a subtitle in the plot.

You can use options in the GOPTIONS statement to change the color and height of the title. Refer to the chapter 'The SAS/GRAPH Statements' in SAS/GRAPH Software: Reference for a description of title options. See Example 44.8 for an illustration of the autocorrelation function plot.

Input Data Sets

You can specify the input data set with missing values with the DATA= option in the PROC MI statement. When an MCMC method is used, you can specify the data set containing the reference distribution information for imputation with the INEST= option, the data set containing initial parameter estimates for the MCMC process with the INITIAL=INPUT= option, and the data set containing information for the prior distribution with the PRIOR=INPUT= option in the MCMC statement.

DATA= SAS-data-set

The input DATA= data set is an ordinary SAS data set containing multivariate data with missing values.

INEST= SAS-data-set

The input INEST= data set is a TYPE=EST data set and contains a variable _Imputation_ to identify the imputation number. For each imputation, PROC MI reads the point estimate from the observations with _TYPE_ = ˜PARM' or _TYPE_ = ˜PARMS' and the associated covariances from the observations with _TYPE_ = ˜COV' or _TYPE_ = ˜COVB'. These estimates are used as the reference distribution to impute values for observations in the DATA= data set. When the input INEST= data set also contains observations with _TYPE_ = ˜SEED', PROC MI reads the seed information for the random number generator from these observations. Otherwise, the SEED= option provides the seed information.

INITIAL=INPUT= SAS-data-set

The input INITIAL=INPUT= data set is a TYPE=COV or CORR data set and provides initial parameter estimates for the MCMC process. The covariances derived from the TYPE=COV/CORR data set are divided by the number of observations to get the correct covariance matrix for the point estimate (sample mean).

If TYPE=COV, PROC MI reads the number of observations from the observations with _TYPE_ = ˜N', the point estimate from the observations with _TYPE_ = ˜MEAN', and the covariances from the observations with _TYPE_ = ˜COV'.

If TYPE=CORR, PROC MI reads the number of observations from the observations with _TYPE_ = ˜N', the point estimate from the observations with
_TYPE_ = ˜MEAN', the correlations from the observations with _TYPE_ = ˜CORR', and the standard deviations from the observations with _TYPE_ = ˜STD'.

PRIOR=INPUT= SAS-data-set

The input PRIOR=INPUT= data set is a TYPE=COV data set that provides information for the prior distribution. You can use the data set to specify a prior distribution for of the form

where d * = n * ˆ’ 1 is the degrees of freedom. PROC MI reads the matrix S * from observations with _TYPE_ = ˜COV' and n * from observations with _TYPE_ = ˜N'.

You can also use this data set to specify a prior distribution for ¼ of the form

PROC MI reads the mean vector ¼ from observations with _TYPE_ = ˜MEAN' and n from observations with _TYPE_ = ˜N_MEAN'. When there are no observations with _TYPE_ = ˜N_MEAN', PROC MI reads n from observations with _TYPE_ = ˜N'.

Output Data Sets

You can specify the output data set of imputed values with the OUT= option in the PROC MI statement. When an EM statement is used, you can specify the data set containing the original data set with missing values being replaced by the expected values from the EM algorithm with the OUT= option in the EM statement. You can also specify the data set containing MLE computed with the EM algorithm with the OUTEM= option.

When an MCMC method is used, you can specify the data set containing parameter estimates used in each imputation with the OUTEST= option and the data set containing parameters used in the imputation step for each iteration with the OUTITER option in the MCMC statement.

OUT= SAS-data-set in the PROC MI statement

The OUT= data set contains all the variables in the original data set and a new variable named _Imputation_ that identifies the imputation. For each imputation, the data set contains all variables in the input DATA= data set with missing values being replaced by imputed values. Note that when the NIMPUTE=1 option is specified, the variable _Imputation_ is not created.

OUT= SAS-data-set in an EM statement

The OUT= data set contains the original data set with missing values being replaced by expected values from the EM algorithm.

OUTEM= SAS-data-set

The OUTEM= data set is a TYPE=COV data set and contains the MLE computed with the EM algorithm. The observations with _TYPE_ = ˜MEAN' contain the estimated mean and the observations with _TYPE_ = ˜COV' contain the estimated covariances.

OUTEST= SAS-data-set

The OUTEST= data set is a TYPE=EST data set and contains parameter estimates used in each imputation in the MCMC method. It also includes an index variable named _Imputation_ , which identifies the imputation.
The observations with _TYPE_ = ˜SEED' contain the seed information for the random number generator. The observations with _TYPE_ = ˜PARM' or _TYPE_ = ˜PARMS' contain the point estimate and the observations with _TYPE_ = ˜COV' or _TYPE_ = ˜COVB' contain the associated covariances. These estimates are used as the parameters of the reference distribution to impute values for observations in the DATA= dataset.
Note that these estimates are the values used in the I-step before each imputation. These are not the parameter values simulated from the P-step in the same iteration. See Example 44.9 for a usage of this option.

OUTITER < ( options ) > = SAS-data-set in an EM statement

The OUTITER= data set in an EM statement is a TYPE=COV data set and contains parameters for each iteration. It also includes a variable _Iteration_ that provides the iteration number.
The parameters in the output data set depend on the options specified. You can specify the MEAN and COV options for OUTITER. With the MEAN option, the output data set contains the mean parameters in observations with the variable _TYPE_ = ˜MEAN'. Similarly, with the MEAN option, the output data set contains the covariance parameters in observations with the variable _TYPE_ = ˜COV'. When no options are specified, the output data set contains the mean parameters for each iteration.

OUTITER < ( options ) > = SAS-data-set in an MCMC statement

The OUTITER= data set in an MCMC statement is a TYPE=COV data set and contains parameters used in the imputation step for each iteration. It also includes variables named _Imputation_ and _Iteration_ , which provide the imputation number and iteration number.

The parameters in the output data set depend on the options specified. The following table summarizes the options available for OUTITER and the corresponding values for the output variable _TYPE_ .

Table 44.4: Summary of Options for OUTITER in an MCMC statement
Options	Output Parameters	_TYPE_
MEAN	mean parameters	MEAN
STD	standard deviations	STD
COV	covariances	COV
LR	ˆ’ 2 log LR statistic	LOG_LR
LR_POST	ˆ’ 2 log LR statistic of the posterior mode	LOG_POST
WLF	worst linear function	WLF

When no options are specified, the output data set contains the mean parameters used in the imputation step for each iteration. For a detailed description of the worst linear function and LR statistics, see the 'Checking Convergence in MCMC' section on page 2555.

Combining Inferences from Multiply Imputed Data Sets

With m imputations, m different sets of the point and variance estimates for a parameter Q can be computed. Suppose _i and _i are the point and variance estimates from the i th imputed data set, i = 1, 2, , m . Then the combined point estimate for Q from multiple imputation is the average of the m complete-data estimates:

Suppose U is the within-imputation variance, which is the average of the m complete-data estimates:

and B is the between-imputation variance

Then the variance estimate associated with Q is the total variance (Rubin 1987)

The statistic ( Q ˆ’ Q ) T ^{ˆ’ (1} ^/ ²⁾ is approximately distributed as t with v _m degrees of freedom (Rubin 1987), where

The degrees of freedom v _m depends on m and the ratio

The ratio r is called the relative increase in variance due to nonresponse (Rubin 1987). When there is no missing information about Q , the values of r and B are both zero. With a large value of m or a small value of r , the degrees of freedom v _m will be large and the distribution of ( Q ˆ’ Q ) T ^{ˆ’ (1} ^/ ²⁾ will be approximately normal.

Another useful statistic is the fraction of missing information about Q :

Both statistics r and » are helpful diagnostics for assessing how the missing data contribute to the uncertainty about Q .

When the complete-data degrees of freedom v is small, and there is only a modest proportion of missing data, the computed degrees of freedom, v _m , can be much larger than v , which is inappropriate. For example, with m =5and r = 10%,the computed degrees of freedom v _m = 484, which is inappropriate for data sets with complete-data degrees of freedom less than 484.

Barnard and Rubin (1999) recommend the use of an adjusted degrees of freedom

where _obs = (1 ˆ’ ³ ) v ( v + 1) / ( v + 3) and ³ =(1+ m ^{ˆ’ 1} ) B/T .

Note that the MI procedure uses the adjusted degrees of freedom, , for inference.

Multiple Imputation Efficiency

The relative efficiency (RE) of using the finite m imputation estimator , rather than using an infinite number for the fully efficient imputation, in units of variance, is approximately a function of m and » (Rubin 1987, p. 114).

The following table shows relative efficiencies with different values of m and » .

Table 44.5: Relative Efficiency
»
m	10%	20%	30%	50%	70%
3	0.9677	0.9375	0.9091	0.8571	0.8108
5	0.9804	0.9615	0.9434	0.9091	0.8772
10	0.9901	0.9804	0.9709	0.9524	0.9346
20	0.9950	0.9901	0.9852	0.9756	0.9662

The table shows that for situations with little missing information, only a small number of imputations are necessary. In practice, the number of imputations needed can be informally verified by replicating sets of m imputations and checking whether the estimates are stable between sets (Horton and Lipsitz 2001, p. 246).

Imputer's Model Versus Analyst's Model

Multiple imputation inference assumes that the model you used to analyze the multiply imputed data (the analyst's model) is the same as the model used to impute missing values in multiple imputation (the imputer's model). But in practice, the two models may not be the same (Schafer 1997, p. 139).

Schafer (1997, pp. 139-143) provides comprehensive coverage of this topic, and the following example is based on his work.

Consider a trivariate data set with variables Y ₁ and Y ₂ fully observed, and a variable Y ₃ with missing values. An imputer creates multiple imputations with the model Y ₃ = Y ₁ Y ₂ . However, the analyst can later use the simpler model Y ₃ = Y ₁ . In this case, the analyst assumes more than the imputer. That is, the analyst assumes there is no relationship between variables Y ₃ and Y ₂ .

The effect of the discrepancy between the models depends on whether the analyst's additional assumption is true. If the assumption is true, the imputer's model still applies. The inferences derived from multiple imputations will still be valid, although they may be somewhat conservative because they reflect the additional uncertainty of estimating the relationship between Y ₃ and Y ₂ .

On the other hand, suppose that the analyst models Y ₃ = Y ₁ , and there is a relationship between variables Y ₃ and Y ₂ . Then the model Y ₃ = Y ₁ will be biased and is inappropriate. Appropriate results can be generated only from appropriate analyst models.

Another type of discrepancy occurs when the imputer assumes more than the analyst. For example, suppose that an imputer creates multiple imputations with the model Y ₃ = Y ₁ , but the analyst later fits a model Y ₃ = Y ₁ Y ₂ . When the assumption is true, the imputer's model is a correct model and the inferences still hold.

On the other hand, suppose there is a relationship between Y ₃ and Y ₂ . Imputations created under the incorrect assumption that there is no relationship between Y ₃ and Y ₂ will make the analyst's estimate of the relationship biased toward zero. Multiple imputations created under an incorrect model can lead to incorrect conclusions.

Thus, generally you should include as many variables as you can when doing multiple imputation. The precision you lose with included unimportant predictors is usually a relatively small price to pay for the general validity of analyses of the resultant multiply imputed data set (Rubin 1996). But at the same time, you need to keep the model building and fitting feasible (Barnard and Meng, 1999, pp. 19-20).

To produce high-quality imputations for a particular variable, the imputation model should also include variables that are potentially related to the imputed variable and variables that are potentially related to the missingness of the imputed variable (Schafer 1997, p. 143).

Similar suggestions were also given by van Buuren, Boshuizen, and Knook (1999, p. 687). They recommended the imputation model includes three sets of covariates: variables in the analyst's model, variables associated with the missingness of the imputed variable, and variables correlated with the imputed variable. They also recommended the removal of the covariates not in the analyst's model if they have too many missing values for observations with missing imputed variable.

Note that it is good practice to include a description of the imputer's model with the multiply imputed data set (Rubin 1996, p.479). That way, the analysts will have information about the variables involved in the imputation and which relationships among the variables have been implicitly set to zero.

Parameter Simulation Versus Multiple Imputation

As an alternative to multiple imputation, parameter simulation can also be used to analyze the data for many incomplete-data problems. Although the MI procedure does not offer parameter simulation, the trade-offs between the two methods (Schafer 1997, pp. 89-90, 135-136) are examined in this section.

The parameter simulation method simulates random values of parameters from the observed-data posterior distribution and makes simple inferences about these parameters (Schafer 1997, p. 89). When a set of well-defined population parameters are of interest, parameter simulation can be used to directly examine and summarize simulated values of . This usually requires a large number of iterations, and involves calculating appropriate summaries of the resulting dependent sample of the iterates of the . If only a small set of parameters are involved, parameter simulation is suitable (Schafer 1997).

Multiple imputation only requires a small number of imputations. Generating and storing a few imputations can be more efficient than generating and storing a large number of iterations for parameter simulation.

When fractions of missing information are low, methods that average over simulated values of the missing data, as in multiple imputation, can be much more efficient than methods that average over simulated values of as in parameter simulation (Schafer 1997).

Summary of Issues in Multiple Imputation

This section summarizes issues which are encountered in applications of the MI procedure.

The MAR Assumption

The missing at random (MAR) assumption is needed for the imputation methods in the MI Procedure. Although this assumption cannot be verified with the data, it becomes more plausible as more variables are included in the imputation model (Schafer 1997, pp. 27-28; van Buuren, Boshuizen, and Knook, 1999, p. 687).

Number of Imputations

Based on the theory of multiple imputation, only a small number of imputations are needed for a data set with little missing information (Rubin 1987, p. 114). The number of imputations can be informally verified by replicating sets of m imputations and checking whether the estimates are stable (Horton and Lipsitz 2001, p. 246).

Imputation Model

Generally you should include as many variables as you can in the imputation model (Rubin 1996), At the same time, however, it is important to keep the number of variables in control, as discussed by Barnard and Meng (1999, pp. 19-20). For the imputation of a particular variable, the model should include variables in the complete data model, variables that are correlated with the imputed variable, and variables that are associated with the missingness of the imputed variable (Schafer 1997, p. 143; van Buuren, Boshuizen, and Knook 1999, p. 687).

Multivariate Normality Assumption

Although the regression and MCMC methods assume multivariate normality, inferences based on multiple imputation can be robust to departures from the multivariate normality if the amount of missing information is not large (Schafer 1997, pp. 147-148).

You can use variable transformations to make the normality assumption more tenable. Variables are transformed before the imputation process and then back-transformed to create imputed values.

Monotone Regression Method

With the multivariate normality assumption, either the regression method or the predictive mean matching method can be used to impute continuous variables in data sets with monotone missing patterns.

Monotone Propensity Score Method

The propensity score method can also be used to impute continuous variables in data sets with monotone missing patterns.

The propensity score method does not use correlations among variables and is not appropriate for analyses involving relationship among variables, such as a regression analysis (Schafer 1999, p.11). It can also produce badly biased estimates of regression coefficients when data on predictor variables are missing (Allison 2000).

MCMC Monotone-Data Imputation

The MCMC Method is used to impute continuous variables in data sets with arbitrary missing patterns, assuming a multivariate normal distribution for the data. It can also be used to impute just enough missing values to make the imputed data sets have a monotone missing pattern. Then, a more flexible monotone imputation method can be used for the remaining missing values.

Checking Convergence in MCMC

In an MCMC process, parameters are drawn after the MCMC is run long enough to converge to its stationary distribution. In practice, however, it is not simple to verify the convergence of the process, especially for a large number of parameters.

You can check for convergence by examining the observed-data likelihood ratio statistic and worst linear function of the parameters in each iteration. You can also check for convergence by examining a plot of autocorrelation function, as well as a time-series plot of parameters (Schafer 1997, p. 120).

EM Estimates

The EM algorithm can be used to compute the MLE of the mean vector and covariance matrix of the data with missing values, assuming a multivariate normal distribution for the data. However, the covariance matrix associated with the estimate of the mean vector cannot be derived from the EM algorithm.

In the MI procedure, you can use the EM algorithm to compute the posterior mode, which provides a good starting value for the MCMC process (Schafer 1997, p. 169).

ODS Table Names

PROC MI assigns a name to each table it creates. You must use these names to reference tables when using the Output Delivery System (ODS). These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 44.6: ODS Tables Produced in PROC MI
ODS Table Name	Description	Statement	Option
Corr	Pairwise correlations		SIMPLE
EMEstimates	EM (MLE) estimates	EM
EMInitEstimates	EM initial estimates	EM
EMIterHistory	EM (MLE) iteration history	EM	ITPRINT
EMPostEstimates	EM (Posterior mode) estimates	MCMC	INITIAL=EM
EMPostIterHistory	EM (Posterior mode) iteration history	MCMC	INITIAL=EM (ITPRINT)
EMWLF	Worst linear function	MCMC	WLF
MCMCInitEstimates	MCMC initial estimates	MCMC	DISPLAYINIT
MissPattern	Missing data patterns
ModelInfo	Model information
MonoDiscrim	Discriminant model group means	MONOTONE	DISCRIM (/DETAILS)
MonoLogistic	Logistic model	MONOTONE	LOGISTIC (/DETAILS)
MonoModel	Multiple monotone models	MONOTONE
MonoPropensity	Propensity score model logistic function	MONOTONE	PROPENSITY (/DETAILS)
MonoReg	Regression model	MONOTONE	REG (/DETAILS)
MonoRegPMM	Predicted mean matching model	MONOTONE	REGPMM (/DETAILS)
ParameterEstimates	Parameter estimates
Transform	Variable transformations	TRANSFORM
Univariate	Univariate statistics		SIMPLE
VarianceInfo	Between, within, and total variances

ODS Graphics (Experimental)

This section describes the use of ODS for creating graphics with the MI procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.

To request these graphs, you must specify the ODS GRAPHICS statement in addition to the following options in the MCMC statement. For more information on the ODS GRAPHICS statement, see Chapter 15, 'Statistical Graphics Using ODS.'

ACFPLOT < ( options < / display-options > ) >

displays plots of the autocorrelation function of parameters from iterations.
For a detailed description of the ACFPLOT option, see the 'Autocorrelation Function Plot' section on page 2557. Note that for the display-options, only the LOG, NLAG=, and TITLE= options are applicable .

TIMEPLOT < ( options < / display-options > ) >

displays time-series plots of parameters from iterations.
For a detailed description of the TIMEPLOT option, see the 'Time-Series Plot' section on page 2556. Note that for the display-options, only the LOG, WCONNECT=, and TITLE= options are applicable. If you specify the WCONNECT=0 option, a scatter plot is created. Otherwise, a line plot is created.

ODS Graph Names

PROC MI assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 44.7.

To request these graphs, you must specify the ODS GRAPHICS statement in addition to the options indicated in Table 44.7. For more information on the ODS GRAPHICS statement, see Chapter 15, 'Statistical Graphics Using ODS.'

Table 44.7: ODS Graphics Produced by PROC MI
ODS Graph Name	Plot Description	Statement	Option
ACFPlot	ACF plot	MCMC	ACFPLOT
TimeScatterPlot	Time-series scatter plot	MCMC	TIMEPLOT(WCONNECT=0)
TimeSeriesPlot	Time-series plot	MCMC	TIMEPLOT