Getting Started | SAS.STAT 9.1 Users Guide (Vol. 4)

Consider the following Fitness data set that has been altered to contain an arbitrary pattern of missingness:

  *----------------- Data on Physical Fitness -----------------*   These measurements were made on men involved in a physical   fitness course at N.C. State University.   Only selected variables of   Oxygen (oxygen intake, ml per kg body weight per minute),   Runtime (time to run 1.5 miles in minutes), and   RunPulse (heart rate while running) are used.   Certain values were changed to missing for the analysis.   *------------------------------------------------------------*;   data FitMiss;   input Oxygen RunTime RunPulse @@;   datalines;   44.609  11.37  178     45.313  10.07   185   54.297   8.65  156     59.571    .       .   49.874   9.22    .     44.811  11.63   176   .     11.95  176          .  10.85     .   39.442  13.08  174     60.055   8.63   170   50.541    .      .     37.388  14.03   186   44.754  11.12  176     47.273    .       .   51.855  10.33  166     49.156   8.95   180   40.836  10.95  168     46.672  10.00     .   46.774  10.25    .     50.388  10.08   168   39.407  12.63  174     46.080  11.17   156   45.441   9.63  164          .   8.92     .   45.118  11.08    .     39.203  12.88   168   45.790  10.47  186     50.545   9.93   148   48.673   9.40  186     47.920  11.50   170   47.467  10.50  170   ;

Suppose that the data are multivariate normally distributed and the missing data are missing at random (MAR). That is, the probability that an observation is missing can depend on the observed variable values of the individual, but not on the missing variable values of the individual. See the 'Statistical Assumptions for Multiple Imputation' section on page 2537 for a detailed description of the MAR assumption.

The following statements invoke the MI procedure and impute missing values for the FitMiss data set.

  proc mi data=FitMiss seed=501213 mu0=50 10 180 out=outmi;   var Oxygen RunTime RunPulse;   run;

The 'Model Information' table displayed in Figure 44.1 describes the method used in the multiple imputation process. By default, the procedure uses the Markov Chain Monte Carlo (MCMC) method with a single chain to create five imputations. The posterior mode, the highest observed-data posterior density, with a noninformative prior, is computed from the EM algorithm and is used as the starting value for the chain.

  The MI Procedure   Model Information   Data Set                             WORK.FITMISS   Method                               MCMC   Multiple Imputation Chain            Single Chain   Initial Estimates for MCMC           EM Posterior Mode   Start                                Starting Value   Prior                                Jeffreys   Number of Imputations                5   Number of Burn-in Iterations         200   Number of Iterations                 100   Seed for random number generator     501213

Figure 44.1: Model Information

The MI procedure takes 200 burn-in iterations before the first imputation and 100 iterations between imputations. In a Markov chain, the information in the current iteration has influence on the state of the next iteration. The burn-in iterations are iterations in the beginning of each chain that are used both to eliminate the series of dependence on the starting value of the chain and to achieve the stationary distribution. The between-imputation iterations in a single chain are used to eliminate the series of dependence between the two imputations.

The 'Missing Data Patterns' table displayed in Figure 44.2 lists distinct missing data patterns with corresponding frequencies and percents. Here, an 'X' means that the variable is observed in the corresponding group and a '.' means that the variable is missing. The table also displays group -specific variable means. The MI procedure sorts the data into groups based on whether an individual's value is observed or missing for each variable to be analyzed . For a detailed description of missing data patterns, see the 'Missing Data Patterns' section on page 2538.

  The MI Procedure   Missing Data Patterns   Run     Run   Group    Oxygen    Time    Pulse        Freq     Percent   1    X         X       X              21       67.74   2    X         X       .               4       12.90   3    X         .       .               3        9.68   4    .         X       X               1        3.23   5    .         X       .               2        6.45   Missing Data Patterns   -----------------Group Means-----------------   Group          Oxygen         RunTime        RunPulse   1       46.353810       10.809524      171.666667   2       47.109500       10.137500               .   3       52.461667               .               .   4               .       11.950000      176.000000   5               .        9.885000               .

Figure 44.2: Missing Data Patterns

After the completion of m imputations, the 'Multiple Imputation Variance Information' table shown in Figure 44.3 displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to missing values, the fraction of missing information, and the relative efficiency (in units of variance) for each variable are also displayed. A detailed description of these statistics is provided in the 'Combining Inferences from Multiply Imputed Data Sets' section on page 2561.

  The MI Procedure   Multiple Imputation Variance Information   -----------------Variance------------------   Variable         Between         Within          Total       DF   Oxygen          0.056930       0.954041       1.022356   25.549   RunTime         0.000811       0.064496       0.065469   27.721   RunPulse        0.922032       3.269089       4.375528   15.753   Multiple Imputation Variance Information   Relative       Fraction   Increase        Missing       Relative   Variable     in Variance    Information     Efficiency   Oxygen          0.071606       0.068898       0.986408   RunTime         0.015084       0.014968       0.997015   RunPulse        0.338455       0.275664       0.947748

Figure 44.3: Variance Information

The following 'Multiple Imputation Parameter Estimates' table shown in Figure 44.4 displays the estimated mean and standard error of the mean for each variable. The inferences are based on the t distribution. The table also displays a 95% confidence interval for the mean and a t statistic with the associated p -value for the hypothesis that the population mean is equal to the value specified with the MU0= option. A detailed description of these statistics is provided in the 'Combining Inferences from Multiply Imputed Data Sets' section on page 2561.

  The MI Procedure   Multiple Imputation Parameter Estimates   Variable            Mean      Std Error    95% Confidence Limits        DF   Oxygen         47.094040       1.011116      45.0139      49.1742   25.549   RunTime        10.572073       0.255870      10.0477      11.0964   27.721   RunPulse      171.787793       2.091776     167.3478     176.2278   15.753   Multiple Imputation Parameter Estimates   t for H0:   Variable         Minimum        Maximum            Mu0    Mean=Mu0   Pr > t   Oxygen         46.783898      47.395550      50.000000       -2.87     0.0081   RunTime        10.526392      10.599616      10.000000        2.24     0.0336   RunPulse      170.774818     173.122002     180.000000       -3.93     0.0012

Figure 44.4: Parameter Estimates

In addition to the output tables, the procedure also creates a data set with imputed values. The imputed data sets are stored in the outmi data set, with the index variable _Imputation_ indicating the imputation numbers . The data set can now be analyzed using standard statistical procedures with _Imputation_ as a BY variable. The following statements list the first ten observations of data set outmi .

  proc print data=outmi (obs=10);   title 'First 10 Observations of the Imputed Data Set';   run;

The table displayed in Figure 44.5 shows that the precision of the imputed values differs from the precision of the observed values. You can use the ROUND= option to make the imputed values consistent with the observed values.

  First 10 Observations of the Imputed Data Set   Run   Obs   _Imputation_     Oxygen    RunTime     Pulse   1         1         44.6090    11.3700    178.000   2         1         45.3130    10.0700    185.000   3         1         54.2970     8.6500    156.000   4         1         59.5710     8.0747    155.925   5         1         49.8740     9.2200    176.837   6         1         44.8110    11.6300    176.000   7         1         42.8857    11.9500    176.000   8         1         46.9992    10.8500    173.099   9         1         39.4420    13.0800    174.000   10         1         60.0550     8.6300    170.000

Figure 44.5: Imputed Data Set