Consider the following Fitness data set that has been altered to contain an arbitrary pattern of missingness:
*----------------- Data on Physical Fitness -----------------* These measurements were made on men involved in a physical fitness course at N.C. State University. Only selected variables of Oxygen (oxygen intake, ml per kg body weight per minute), Runtime (time to run 1.5 miles in minutes), and RunPulse (heart rate while running) are used. Certain values were changed to missing for the analysis. *------------------------------------------------------------*; data FitMiss; input Oxygen RunTime RunPulse @@; datalines; 44.609 11.37 178 45.313 10.07 185 54.297 8.65 156 59.571 . . 49.874 9.22 . 44.811 11.63 176 . 11.95 176 . 10.85 . 39.442 13.08 174 60.055 8.63 170 50.541 . . 37.388 14.03 186 44.754 11.12 176 47.273 . . 51.855 10.33 166 49.156 8.95 180 40.836 10.95 168 46.672 10.00 . 46.774 10.25 . 50.388 10.08 168 39.407 12.63 174 46.080 11.17 156 45.441 9.63 164 . 8.92 . 45.118 11.08 . 39.203 12.88 168 45.790 10.47 186 50.545 9.93 148 48.673 9.40 186 47.920 11.50 170 47.467 10.50 170 ;
Suppose that the data are multivariate normally distributed and the missing data are missing at random (MAR). That is, the probability that an observation is missing can depend on the observed variable values of the individual, but not on the missing variable values of the individual. See the 'Statistical Assumptions for Multiple Imputation' section on page 2537 for a detailed description of the MAR assumption.
The following statements invoke the MI procedure and impute missing values for the FitMiss data set.
proc mi data=FitMiss seed=501213 mu0=50 10 180 out=outmi; var Oxygen RunTime RunPulse; run;
The 'Model Information' table displayed in Figure 44.1 describes the method used in the multiple imputation process. By default, the procedure uses the Markov Chain Monte Carlo (MCMC) method with a single chain to create five imputations. The posterior mode, the highest observed-data posterior density, with a noninformative prior, is computed from the EM algorithm and is used as the starting value for the chain.
The MI Procedure Model Information Data Set WORK.FITMISS Method MCMC Multiple Imputation Chain Single Chain Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 5 Number of Burn-in Iterations 200 Number of Iterations 100 Seed for random number generator 501213
The MI procedure takes 200 burn-in iterations before the first imputation and 100 iterations between imputations. In a Markov chain, the information in the current iteration has influence on the state of the next iteration. The burn-in iterations are iterations in the beginning of each chain that are used both to eliminate the series of dependence on the starting value of the chain and to achieve the stationary distribution. The between-imputation iterations in a single chain are used to eliminate the series of dependence between the two imputations.
The 'Missing Data Patterns' table displayed in Figure 44.2 lists distinct missing data patterns with corresponding frequencies and percents. Here, an 'X' means that the variable is observed in the corresponding group and a '.' means that the variable is missing. The table also displays group -specific variable means. The MI procedure sorts the data into groups based on whether an individual's value is observed or missing for each variable to be analyzed . For a detailed description of missing data patterns, see the 'Missing Data Patterns' section on page 2538.
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 21 67.74 2 X X . 4 12.90 3 X . . 3 9.68 4 . X X 1 3.23 5 . X . 2 6.45 Missing Data Patterns -----------------Group Means----------------- Group Oxygen RunTime RunPulse 1 46.353810 10.809524 171.666667 2 47.109500 10.137500 . 3 52.461667 . . 4 . 11.950000 176.000000 5 . 9.885000 .
After the completion of m imputations, the 'Multiple Imputation Variance Information' table shown in Figure 44.3 displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to missing values, the fraction of missing information, and the relative efficiency (in units of variance) for each variable are also displayed. A detailed description of these statistics is provided in the 'Combining Inferences from Multiply Imputed Data Sets' section on page 2561.
The MI Procedure Multiple Imputation Variance Information -----------------Variance------------------ Variable Between Within Total DF Oxygen 0.056930 0.954041 1.022356 25.549 RunTime 0.000811 0.064496 0.065469 27.721 RunPulse 0.922032 3.269089 4.375528 15.753 Multiple Imputation Variance Information Relative Fraction Increase Missing Relative Variable in Variance Information Efficiency Oxygen 0.071606 0.068898 0.986408 RunTime 0.015084 0.014968 0.997015 RunPulse 0.338455 0.275664 0.947748
The following 'Multiple Imputation Parameter Estimates' table shown in Figure 44.4 displays the estimated mean and standard error of the mean for each variable. The inferences are based on the t distribution. The table also displays a 95% confidence interval for the mean and a t statistic with the associated p -value for the hypothesis that the population mean is equal to the value specified with the MU0= option. A detailed description of these statistics is provided in the 'Combining Inferences from Multiply Imputed Data Sets' section on page 2561.
The MI Procedure Multiple Imputation Parameter Estimates Variable Mean Std Error 95% Confidence Limits DF Oxygen 47.094040 1.011116 45.0139 49.1742 25.549 RunTime 10.572073 0.255870 10.0477 11.0964 27.721 RunPulse 171.787793 2.091776 167.3478 176.2278 15.753 Multiple Imputation Parameter Estimates t for H0: Variable Minimum Maximum Mu0 Mean=Mu0 Pr > t Oxygen 46.783898 47.395550 50.000000 -2.87 0.0081 RunTime 10.526392 10.599616 10.000000 2.24 0.0336 RunPulse 170.774818 173.122002 180.000000 -3.93 0.0012
In addition to the output tables, the procedure also creates a data set with imputed values. The imputed data sets are stored in the outmi data set, with the index variable _Imputation_ indicating the imputation numbers . The data set can now be analyzed using standard statistical procedures with _Imputation_ as a BY variable. The following statements list the first ten observations of data set outmi .
proc print data=outmi (obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
The table displayed in Figure 44.5 shows that the precision of the imputed values differs from the precision of the observed values. You can use the ROUND= option to make the imputed values consistent with the observed values.
First 10 Observations of the Imputed Data Set Run Obs _Imputation_ Oxygen RunTime Pulse 1 1 44.6090 11.3700 178.000 2 1 45.3130 10.0700 185.000 3 1 54.2970 8.6500 156.000 4 1 59.5710 8.0747 155.925 5 1 49.8740 9.2200 176.837 6 1 44.8110 11.6300 176.000 7 1 42.8857 11.9500 176.000 8 1 46.9992 10.8500 173.099 9 1 39.4420 13.0800 174.000 10 1 60.0550 8.6300 170.000