The Fish data described in the STEPDISC procedure are measurements of 159 fish of seven species caught in Finland's lake Laengelmavesi. For each fish, the length, height, and width are measured. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail ( Length1 ), from the nose to the notch of its tail ( Length2 ), and from the nose to the end of its tail ( Length3 ). See Chapter 67, 'The STEPDISC Procedure,' for more information.
The Fish1 data set is constructed from the Fish data set and contains only one species of the fish and the three length measurements. Some values have been set to missing and the resulting data set has a monotone missing pattern in variables Length1 Length2 ,and Length3 .The Fish1 data set is used in Example 44.2 with the propensity score method and in Example 44.3 with the regression method.
The Fish2 data set is also constructed from the Fish data set and contains two species of fish. Some values have been set to missing and the resulting data set has a monotone missing pattern in variables Length3 , Height , Width ,and Species .The Fish2 data set is used in Example 44.4 with the logistic regression method and in Example 44.5 with the discriminant function method. Note that some values of the variable Species have also been altered in the data set.
The FitMiss data set created in the 'Getting Started' section is used in other examples. The following statements create the Fish1 data set.
/*----------- Fishes of Species Bream ---------- */ data Fish1; title 'Fish Measurement Data'; input Length1 Length2 Length3 @@; datalines; 23.2 25.4 30.0 24.0 26.3 31.2 23.9 26.5 31.1 26.3 29.0 33.5 26.5 29.0 . 26.8 29.7 34.7 26.8 . . 27.6 30.0 35.0 27.6 30.0 35.1 28.5 30.7 36.2 28.4 31.0 36.2 28.7 . . 29.1 31.5 . 29.5 32.0 37.3 29.4 32.0 37.2 29.4 32.0 37.2 30.4 33.0 38.3 30.4 33.0 38.5 30.9 33.5 38.6 31.0 33.5 38.7 31.3 34.0 39.5 31.4 34.0 39.2 31.5 34.5 . 31.8 35.0 40.6 31.9 35.0 40.5 31.8 35.0 40.9 32.0 35.0 40.6 32.7 36.0 41.5 32.8 36.0 41.6 33.5 37.0 42.6 35.0 38.5 44.1 35.0 38.5 44.0 36.2 39.5 45.3 37.4 41.0 45.9 38.0 41.0 46.5 ;
The Fish2 data set contains two of the seven species in the Fish data set. For each of the two species ( Bream and Parkki ), the length from the nose of the fish to the end of its tail, the height, and the width of each fish are measured. The height and width are recorded as percentages of the length variable.
The following statements create the Fish2 data set.
/*-------- Fishes of Species Bream and Parkki Pike --------*/ data Fish2 (drop=HtPct WidthPct); title 'Fish Measurement Data'; input Species $ Length3 HtPct WidthPct @@; Height= HtPct*Length3/100; Width= WidthPct*Length3/100; datalines; Gp1 30.0 38.4 13.4 Gp1 31.2 40.0 13.8 Gp1 31.1 39.8 15.1 . 33.5 38.0 . . 34.0 36.6 15.1 Gp1 34.7 39.2 14.2 Gp1 34.5 41.1 15.3 Gp1 35.0 36.2 13.4 Gp1 35.1 39.9 13.8 . 36.2 39.3 13.7 Gp1 36.2 39.4 14.1 . 36.2 39.7 13.3 Gp1 36.4 37.8 12.0 . 37.3 37.3 13.6 Gp1 37.2 40.2 13.9 Gp1 37.2 41.5 15.0 Gp1 38.3 38.8 13.8 Gp1 38.5 38.8 13.5 Gp1 38.6 40.5 13.3 Gp1 38.7 37.4 14.8 Gp1 39.5 38.3 14.1 Gp1 39.2 40.8 13.7 . 39.7 39.1 . Gp1 40.6 38.1 15.1 Gp1 40.5 40.1 13.8 Gp1 40.9 40.0 14.8 Gp1 40.6 40.3 15.0 Gp1 41.5 39.8 14.1 Gp2 41.6 40.6 14.9 Gp1 42.6 44.5 15.5 Gp1 44.1 40.9 14.3 Gp1 44.0 41.1 14.3 Gp1 45.3 41.4 14.9 Gp1 45.9 40.6 14.7 Gp1 46.5 37.9 13.7 Gp2 16.2 25.6 14.0 Gp2 20.3 26.1 13.9 Gp2 21.2 26.3 13.7 Gp2 22.2 25.3 14.3 Gp2 22.2 28.0 16.1 Gp2 22.8 28.4 14.7 Gp2 23.1 26.7 14.7 . 23.7 25.8 13.9 Gp2 24.7 23.5 15.2 Gp1 24.3 27.3 14.6 Gp2 25.3 27.8 15.1 Gp2 25.0 26.2 13.3 Gp2 25.0 25.6 15.2 Gp2 27.2 27.7 14.1 Gp2 26.7 25.9 13.6 . 26.8 27.6 15.4 Gp2 27.9 25.4 14.0 Gp2 29.2 30.4 15.4 Gp2 30.6 28.0 15.6 Gp2 35.0 27.1 15.3 ;
This example uses the EM algorithm to compute the maximum likelihood estimates for parameters of a multivariate normal distribution using data with missing values. The following statements invoke the MI procedure and request the EM algorithm to compute the MLE for ( ¼ , & pound ; ) of a multivariate normal distribution from the input data set FitMiss .
proc mi data=FitMiss seed=1518971 simple nimpute=0; em itprint outem=outem; var Oxygen RunTime RunPulse; run;
Note that when you specify the NIMPUTE=0 option, the missing values will not be imputed. The procedure generates the following output:
The ' Model Information ' table shown in Output 44.1.1 describes the method and options used in the procedure if a positive number is specified in the NIMPUTE= option.
Fish Measurement Data The MI Procedure Model Information Data Set WORK.FITMISS Method MCMC Multiple Imputation Chain Single Chain Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 0 Number of Burn-in Iterations 200 Number of Iterations 100 Seed for random number generator 1518971
The 'Missing Data Patterns' table shown in Output 44.1.2 lists distinct missing data patterns with corresponding frequencies and percents. Here, a value of 'X' means that the variable is observed in the corresponding group and a value of '.' means that the variable is missing. The table also displays group -specific variable means.
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 21 67.74 2 X X . 4 12.90 3 X . . 3 9.68 4 . X X 1 3.23 5 . X . 2 6.45 Missing Data Patterns -----------------Group Means---------------- Group Oxygen RunTime RunPulse 1 46.353810 10.809524 171.666667 2 47.109500 10.137500 . 3 52.461667 . . 4 . 11.950000 176.000000 5 . 9.885000 .
With the SIMPLE option, the procedure displays simple descriptive univariate statistics for available cases in the 'Univariate Statistics' table shown in Output 44.1.3 and correlations from pairwise available cases in the 'Pairwise Correlations ' table shown in Output 44.1.4.
The MI Procedure Univariate Statistics Variable N Mean Std Dev Minimum Maximum Oxygen 28 47.11618 5.41305 37.38800 60.05500 RunTime 28 10.68821 1.37988 8.63000 14.03000 RunPulse 22 171.86364 10.14324 148.00000 186.00000 Univariate Statistics --- Missing Values-- Variable Count Percent Oxygen 3 9.68 RunTime 3 9.68 RunPulse 9 29.03
The MI Procedure Pairwise Correlations Oxygen RunTime RunPulse Oxygen 1.000000000 0.849118562 0.343961742 RunTime 0.849118562 1.000000000 0.247258191 RunPulse 0.343961742 0.247258191 1.000000000
With the EM statement, the procedure displays the initial parameter estimates for EM in the 'Initial Parameter Estimates for EM' table shown in Output 44.1.5.
The MI Procedure Initial Parameter Estimates for EM _TYPE_ _NAME_ Oxygen RunTime RunPulse MEAN 47.116179 10.688214 171.863636 COV Oxygen 29.301078 0 0 COV RunTime 0 1.904067 0 COV RunPulse 0 0 102.885281
With the ITPRINT option in the EM statement, the 'EM (MLE) Iteration History' table shown in Output 44.1.6 displays the iteration history for the EM algorithm.
The MI Procedure EM (MLE) Iteration History _Iteration_ 2 Log L Oxygen RunTime RunPulse 0 289.544782 47.116179 10.688214 171.863636 1 263.549489 47.116179 10.688214 171.863636 2 255.851312 47.139089 10.603506 171.538203 3 254.616428 47.122353 10.571685 171.426790 4 254.494971 47.111080 10.560585 171.398296 5 254.483973 47.106523 10.556768 171.389208 6 254.482920 47.104899 10.555485 171.385257 7 254.482813 47.104348 10.555062 171.383345 8 254.482801 47.104165 10.554923 171.382424 9 254.482800 47.104105 10.554878 171.381992 10 254.482800 47.104086 10.554864 171.381796 11 254.482800 47.104079 10.554859 171.381708 12 254.482800 47.104077 10.554858 171.381669
The 'EM (MLE) Parameter Estimates' table shown in Output 44.1.7 displays the maximum likelihood estimates for ¼ and of a multivariate normal distribution from the data set FitMiss .
The MI Procedure EM (MLE) Parameter Estimates _TYPE_ _NAME_ Oxygen RunTime RunPulse MEAN 47.104077 10.554858 171.381669 COV Oxygen 27.797931 6.457975 18.031298 COV RunTime 6.457975 2.015514 3.516287 COV RunPulse 18.031298 3.516287 97.766857
You can also output the EM (MLE) parameter estimates into an output data set with the OUTEM= option. The following statements list the observations in the output data set outem .
proc print data=outem; title 'EM Estimates'; run;
The output data set outem shown in Output 44.1.8 is a TYPE=COV data set. The observation with _TYPE_ = ˜MEAN' contains the MLE for the parameter ¼ and the observations with _TYPE_ = ˜COV' contain the MLE for the parameter of a multivariate normal distribution from the data set FitMiss .
EM Estimates Obs _TYPE_ _NAME_ Oxygen RunTime RunPulse 1 MEAN 47.1041 10.5549 171.382 2 COV Oxygen 27.7979 6.4580 18.031 3 COV RunTime 6.4580 2.0155 3.516 4 COV RunPulse 18.0313 3.5163 97.767
This example uses the propensity score method to impute missing values for variables in a data set with a monotone missing pattern. The following statements invoke the MI procedure and request the propensity score method. The resulting data set is named outex2 .
proc mi data=Fish1 seed=899603 out=outex2; monotone propensity; var Length1 Length2 Length3; run;
Note that the VAR statement is required and the data set must have a monotone missing pattern with variables as ordered in the VAR statement. The procedure generates the following output:
The 'Model Information' table shown in Output 44.2.1 describes the method and options used in the multiple imputation process. By default, five imputations are created for the missing data.
The MI Procedure Model Information Data Set WORK.FISH1 Method Monotone Number of Imputations 5 Seed for random number generator 899603
When monotone methods are used in the imputation, MONOTONE is displayed as the method. The 'Monotone Model Specification' table shown in Output 44.2.2 displays the detailed model specification. By default, the observations are sorted into five groups based on their propensity scores.
The MI Procedure Monotone Model Specification Imputed Method Variables Propensity(Groups= 5) Length2 Length3
Without covariates specified for imputed variables Length2 and Length3 , the variable Length1 is used as the covariate for Length2 , and variables Length1 and Length2 are used as covariates for Length3 .
The 'Missing Data Patterns' table shown in Output 44.2.3 lists distinct missing data patterns with corresponding frequencies and percents. Here, values of 'X' and '.' indicate that the variable is observed or missing in the corresponding group. The table confirms a monotone missing pattern for these three variables.
The MI Procedure Missing Data Patterns Group Length1 Length2 Length3 Freq Percent 1 X X X 30 85.71 2 X X . 3 8.57 3 X . . 2 5.71 Missing Data Patterns -----------------Group Means---------------- Group Length1 Length2 Length3 1 30.603333 33.436667 38.720000 2 29.033333 31.666667 . 3 27.750000 . .
For the imputation process, first, missing values of Length2 in Group 3 are imputed using observed values of Length1 . Then the missing values of Length3 in Group 2 are imputed using observed values of Length1 and Length2 .Andfinally, the missing values of Length3 in Group 3 are imputed using observed values of Length1 and imputed values of Length2 .
After the completion of m imputations, the 'Multiple Imputation Variance Information' table shown in Output 44.2.4 displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to missingness, the fraction of missing information, and the relative efficiency for each variable are also displayed. A detailed description of these statistics is provided in the 'Combining Inferences from Multiply Imputed Data Sets' section on page 2561.
The MI Procedure Multiple Imputation Variance Information -----------------Variance----------------- Variable Between Within Total DF Length2 0.001500 0.465422 0.467223 32.034 Length3 0.049725 0.547434 0.607104 27.103 Multiple Imputation Variance Information Relative Fraction Increase Missing Relative Variable in Variance Information Efficiency Length2 0.003869 0.003861 0.999228 Length3 0.108999 0.102610 0.979891
The 'Multiple Imputation Parameter Estimates' table shown in Output 44.2.5 displays the estimated mean and standard error of the mean for each variable. The inferences are based on the t -distributions. For each variable, the table also displays a 95% mean confidence interval and a t -statistic with the associated p -value for the hypothesis that the population mean is equal to the value specified in the MU0= option, which is zero by default.
The MI Procedure Multiple Imputation Parameter Estimates Variable Mean Std Error 95% Confidence Limits DF Length2 33.006857 0.683537 31.61460 34.39912 32.034 Length3 38.361714 0.779169 36.76328 39.96015 27.103 Multiple Imputation Parameter Estimates t for H0: Variable Minimum Maximum Mu0 Mean=Mu0 Pr > t Length2 32.957143 33.060000 0 48.29 <.0001 Length3 38.080000 38.545714 0 49.23 <.0001
The following statements list the first ten observations of the data set outex2 , as shown in Output 44.2.6. The missing values are imputed from observed values with similar propensity scores.
First 10 Observations of the Imputed Data Set Obs _Imputation_ Length1 Length2 Length3 1 1 23.2 25.4 30.0 2 1 24.0 26.3 31.2 3 1 23.9 26.5 31.1 4 1 26.3 29.0 33.5 5 1 26.5 29.0 38.6 6 1 26.8 29.7 34.7 7 1 26.8 29.0 35.0 8 1 27.6 30.0 35.0 9 1 27.6 30.0 35.1 10 1 28.5 30.7 36.2
proc print data=outex2(obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
This example uses the regression method to impute missing values for all variables in a data set with a monotone missing pattern. The following statements invoke the MI procedure and request the regression method for variable Length2 and the predictive mean matching method for variable Length3 . The resulting data set is named outex3 .
proc mi data=Fish1 round=.1 mu0= 0 35 45 seed=13951639 out=outex3; monotone reg(Length2/ details) regpmm(Length3= Length1 Length2 Length1*Length2/ details); var Length1 Length2 Length3; run;
The ROUND= option is used to round the imputed values to the same precision as observed values. The values specified with the ROUND= option are matched with the variables Length1 , Length2 ,and Length3 in the order listed in the VAR statement. The MU0= option requests t tests for the hypotheses that the population means corresponding to the variables in the VAR statement are Length2 =35 and Length3 =45.
Note that an optimal K= value is currently not available for the REGPMM option in the literature on multiple imputation. The default K=5 is experimental and may change in future releases.
The 'Missing Data Patterns' table lists distinct missing data patterns with corresponding frequencies and percents. It is identical to the table displayed in Output 44.2.3 in the previous example.
The 'Monotone Model Specification' table shown in Output 44.3.1 displays the model specification.
Fish Measurement Data The MI Procedure Monotone Model Specification Imputed Method Variables Regression Length2 Regression-PMM(K= 5) Length3
With the DETAILS option, the parameters estimated from the observed data and the parameters used in each imputation are displayed in Output 44.3.2 and Output 44.3.3.
The MI Procedure Regression Models for Monotone Method Imputed ----------------Imputation---------------- Variable Effect Obs-Data 1 2 3 Length2 Intercept 0.04249 0.049184 0.055470 0.051346 Length2 Length1 0.98587 1.001934 0.995275 0.992294 Regression Models for Monotone Method Imputed ---------Imputation--------- Variable Effect 4 5 Length2 Intercept 0.064193 0.030719 Length2 Length1 0.983122 0.995883
The MI Procedure Regression Models for Monotone Predicted Mean Matching Method Imputed ---------------Imputation--------------- Variable Effect Obs Data 1 2 3 Length3 Intercept 0.01304 0.004134 0.011417 0.034177 Length3 Length1 0.01332 0.025320 0.037494 0.308765 Length3 Length2 0.98918 0.955510 1.025741 0.673374 Length3 Length1*Length2 0.02521 0.034964 0.022017 0.017919 Regression Models for Monotone Predicted Mean Matching Method Imputed ---------Imputation--------- Variable Effect 4 5 Length3 Intercept 0.010532 0.004685 Length3 Length1 0.156606 0.147118 Length3 Length2 0.828384 1.146440 Length3 Length1*Length2 0.029335 0.034671
After the completion of five imputations by default, the 'Multiple Imputation Variance Information' table shown in Output 44.3.4 displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences. The relative increase in variance due to missingness, the fraction of missing information, and the relative efficiency for each variable are also displayed. These statistics are described in the 'Combining Inferences from Multiply Imputed Data Sets' section on page 2561.
The MI Procedure Multiple Imputation Variance Information -----------------Variance----------------- Variable Between Within Total DF Length2 0.000133 0.439512 0.439672 32.15 Length3 0.000386 0.486913 0.487376 32.131 Multiple Imputation Variance Information Relative Fraction Increase Missing Relative Variable in Variance Information Efficiency Length2 0.000363 0.000363 0.999927 Length3 0.000952 0.000951 0.999810
The 'Multiple Imputation Parameter Estimates' table shown in Output 44.3.5 displays a 95% mean confidence interval and a t -statistic with its associated p -value for each of the hypotheses requested with the MU0= option.
The MI Procedure Multiple Imputation Parameter Estimates Variable Mean Std Error 95% Confidence Limits DF Length2 33.104571 0.663078 31.75417 34.45497 32.15 Length3 38.424571 0.698123 37.00277 39.84637 32.131 Multiple Imputation Parameter Estimates t for H0: Variable Minimum Maximum Mu0 Mean=Mu0 Pr > t Length2 33.088571 33.117143 35.000000 -2.86 0.0074 Length3 38.397143 38.445714 45.000000 -9.42 <.0001
The following statements list the first ten observations of the data set outex3 in Output 44.3.6. Note that the imputed values of Length2 are rounded to the same precision as the observed values.
proc print data=outex3(obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
First 10 Observations of the Imputed Data Set Obs _Imputation_ Length1 Length2 Length3 1 1 23.2 25.4 30.0 2 1 24.0 26.3 31.2 3 1 23.9 26.5 31.1 4 1 26.3 29.0 33.5 5 1 26.5 29.0 34.7 6 1 26.8 29.7 34.7 7 1 26.8 28.8 34.7 8 1 27.6 30.0 35.0 9 1 27.6 30.0 35.1 10 1 28.5 30.7 36.2
This example uses logistic regression method to impute values for a binary variable in a data set with a monotone missing pattern.
The logistic regression method is used for the binary and ordinal CLASS variables. Since the variable Species is not an ordinal variable, only the first two species are used.
proc mi data=Fish2 seed=1305417 out=outex4; class Species; monotone logistic(Species= Height Width Height*Width/ details); var Height Width Species; run;
The 'Model Information' table shown in Output 44.4.1 describes the method and options uses the regression imputation process.
The MI Procedure Model Information Data Set WORK.FISH2 Method Monotone Number of Imputations 5 Seed for random number generator 1305417
The 'Monotone Model Specification' table shown in Output 44.4.2 describes methods and imputed variables in the imputation model. The procedure uses the logistic regression method to impute variable Species in the model. Missing values in other variables are not imputed.
The MI Procedure Monotone Model Specification Imputed Method Variables Logistic Regression Species
The 'Missing Data Patterns' table shown in Output 44.4.3 lists distinct missing data patterns with corresponding frequencies and percents. The table confirms a monotone missing pattern for these variables.
The MI Procedure Missing Data Patterns --------Group Means------- Group Height Width Species Freq Percent Height Width 1 X X X 47 85.45 12.097645 4.808204 2 X X . 6 10.91 11.411050 4.567050 3 X . . 2 3.64 14.126350 .
With the DETAILS option, parameters estimated from the observed data and the parameters used in each imputation are displayed in the 'Logistic Models for Monotone Method' table in Output 44.4.4.
The MI Procedure Logistic Models for Monotone Method Imputed ---------------Imputation--------------- Variable Effect Obs-Data 1 2 3 Species Intercept 2.65234 1.794014 5.392323 5.859932 Species Height 7.73757 3.727095 11.790557 12.200408 Species Width 5.25709 1.209209 8.492849 8.696497 Species Height*Width 1.12990 1.593964 1.989302 3.087310 Logistic Models for Monotone Method Imputed ---------Imputation--------- Variable Effect 4 5 Species Intercept 0.649860 6.393629 Species Height 2.449332 13.644077 Species Width 0.629963 10.767135 Species Height*Width 0.979165 2.389491
The following statements list the first ten observations of the data set outex4 in Output 44.4.5.
proc print data=outex4(obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
First 10 Observations of the Imputed Data Set Obs _Imputation_ Species Length3 Height Width 1 1 Gp1 30.0 11.5200 4.0200 2 1 Gp1 31.2 12.4800 4.3056 3 1 Gp1 31.1 12.3778 4.6961 4 1 33.5 12.7300 . 5 1 Gp1 34.0 12.4440 5.1340 6 1 Gp1 34.7 13.6024 4.9274 7 1 Gp1 34.5 14.1795 5.2785 8 1 Gp1 35.0 12.6700 4.6900 9 1 Gp1 35.1 14.0049 4.8438 10 1 Gp1 36.2 14.2266 4.9594
Note that a missing value of the variable Species is not imputed if the corresponding covariates are missing and not imputed, as shown by observation 4 in the table.
This example uses discriminant monotone methods to impute values of a CLASS variable from the observed observation values in a data set with a monotone missing pattern.
The following statements impute the continuous variables Height and Width with the regression method and the CLASS variable Species with the discriminant function method.
proc mi data=Fish2 seed=7545417 nimpute=3 out=outex5; class Species; monotone reg(Height Width) discrim(Species= Length3 Height Width/ details); var Length3 Height Width Species; run;
The 'Model Information' table shown in Output 44.5.1 describes the method and options used in the multiple imputation process.
The MI Procedure Model Information Data Set WORK.FISH2 Method Monotone Number of Imputations 3 Seed for random number generator 7545417
The 'Monotone Model Specification' table shown in Output 44.5.2 describes methods and imputed variables in the imputation model. The procedure uses the regression method to impute variables Height and Width , and uses the logistic regression method to impute variable Species in the model.
The MI Procedure Monotone Model Specification Imputed Method Variables Regression Height Width Discriminant Function Species
The 'Missing Data Patterns' table shown in Output 44.5.3 lists distinct missing data patterns with corresponding frequencies and percents. The table confirms a monotone missing pattern for these variables.
The MI Procedure Missing Data Patterns Group Length3 Height Width Species Freq Percent 1 X X X X 47 85.45 2 X X X . 6 10.91 3 X X . . 2 3.64 Missing Data Patterns -----------------Group Means---------------- Group Length3 Height Width 1 33.497872 12.097645 4.808204 2 32.366667 11.411050 4.567050 3 36.600000 14.126350 .
With the DETAILS option, parameters estimated from the observed data and parameters used in each imputation are displayed in Output 44.5.4.
The MI Procedure Group Means for Monotone Discriminant Method ----------------Imputation---------------- Species Variable Obs-Data 1 2 3 Gp1 Length3 0.61625 0.707861 0.662448 0.505410 Gp1 Height 0.67244 0.750984 0.732151 0.594226 Gp1 Width 0.57896 0.643334 0.665698 0.515014 Gp2 Length3 0.98925 0.776131 0.987989 0.887032 Gp2 Height 1.08272 0.934081 1.081832 1.004799 Gp2 Width 0.86963 0.680065 0.811745 0.722943
The following statements list the first ten observations of the data set outex5 in Output 44.5.5 Note that all missing values of variables Width and Species are imputed.
proc print data=outex5(obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
First 10 Observations of the Imputed Data Set Obs _Imputation_ Species Length3 Height Width 1 1 Gp1 30.0 11.5200 4.02000 2 1 Gp1 31.2 12.4800 4.30560 3 1 Gp1 31.1 12.3778 4.69610 4 1 Gp1 33.5 12.7300 4.67966 5 1 Gp2 34.0 12.4440 5.13400 6 1 Gp1 34.7 13.6024 4.92740 7 1 Gp1 34.5 14.1795 5.27850 8 1 Gp1 35.0 12.6700 4.69000 9 1 Gp1 35.1 14.0049 4.84380 10 1 Gp1 36.2 14.2266 4.95940
This example uses the MCMC method to impute missing values for a data set with an arbitrary missing pattern. The following statements invoke the MI procedure and specify the MCMC method with six imputations.
proc mi data=FitMiss seed=21355417 nimpute=6 mu0=50 10 180 ; mcmc chain=multiple displayinit initial=em(itprint); var Oxygen RunTime RunPulse; run;
The MI Procedure Model Information Data Set WORK.FITMISS Method MCMC Multiple Imputation Chain Multiple Chains Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 6 Number of Burn-in Iterations 200 Seed for random number generator 21355417
The 'Model Information' table shown in Output 44.6.1 describes the method used in the multiple imputation process. With CHAIN=MULTIPLE, the procedure uses multiple chains and completes the default 200 burn-in iterations before each imputation. The 200 burn-in iterations are used to make the iterations converge to the stationary distribution before the imputation.
By default, the procedure uses a noninformative Jeffreys prior to derive the posterior mode from the EM algorithm as the starting values for the MCMC process.
The 'Missing Data Patterns' table shown in Output 44.6.2 lists distinct missing data patterns with corresponding statistics.
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 21 67.74 2 X X . 4 12.90 3 X . . 3 9.68 4 . X X 1 3.23 5 . X . 2 6.45 Missing Data Patterns -----------------Group Means---------------- Group Oxygen RunTime RunPulse 1 46.353810 10.809524 171.666667 2 47.109500 10.137500 . 3 52.461667 . . 4 . 11.950000 176.000000 5 . 9.885000 .
With the ITPRINT option in INITIAL=EM, the procedure displays the 'EM (Posterior Mode) Iteration History' table in Output 44.6.3.
The MI Procedure EM (Posterior Mode) Iteration History _Iteration_ 2 Log L 2 Log Posterior Oxygen RunTime 0 254.482800 282.909549 47.104077 10.554858 1 255.081168 282.051584 47.104077 10.554857 2 255.271408 282.017488 47.104077 10.554857 3 255.318622 282.015372 47.104002 10.554523 4 255.330259 282.015232 47.103861 10.554388 5 255.333161 282.015222 47.103797 10.554341 6 255.333896 282.015222 47.103774 10.554325 7 255.334085 282.015222 47.103766 10.554320 EM (Posterior Mode) Iteration History _Iteration_ RunPulse 0 171.381669 1 171.381652 2 171.381644 3 171.381842 4 171.382053 5 171.382150 6 171.382185 7 171.382196
With the DISPLAYINIT option in the MCMC statement, the 'Initial Parameter Estimates for MCMC' table shown in Output 44.6.4 displays the starting mean and covariance estimates used in MCMC. The same starting estimates are used for the MCMC process for multiple chains because the EM algorithm is applied to the same data set in each chain. You can explicitly specify different initial estimates for different imputations, or you can use the bootstrap to generate different parameter estimates from the EM algorithm for the MCMC process.
The MI Procedure Initial Parameter Estimates for MCMC _TYPE_ _NAME_ Oxygen RunTime RunPulse MEAN 47.103766 10.554320 171.382196 COV Oxygen 24.549967 5.726112 15.926036 COV RunTime 5.726112 1.781407 3.124798 COV RunPulse 15.926036 3.124798 83.164045
Output 44.6.5 and Output 44.6.6 display variance information and parameter estimates from the multiple imputation.
The MI Procedure Multiple Imputation Variance Information -----------------Variance----------------- Variable Between Within Total DF Oxygen 0.051560 0.928170 0.988323 25.958 RunTime 0.003979 0.070057 0.074699 25.902 RunPulse 4.118578 4.260631 9.065638 7.5938 Multiple Imputation Variance Information Relative Fraction Increase Missing Relative Variable in Variance Information Efficiency Oxygen 0.064809 0.062253 0.989731 RunTime 0.066262 0.063589 0.989513 RunPulse 1.127769 0.575218 0.912517
The MI Procedure Multiple Imputation Parameter Estimates Variable Mean Std Error 95% Confidence Limits DF Oxygen 47.164819 0.994145 45.1212 49.2085 25.958 RunTime 10.549936 0.273312 9.9880 11.1118 25.902 RunPulse 170.969836 3.010920 163.9615 177.9782 7.5938 Multiple Imputation Parameter Estimates t for H0: Variable Minimum Maximum Mu0 Mean=Mu0 Pr > t Oxygen 46.858020 47.363540 50.000000 2.85 0.0084 RunTime 10.476886 10.659412 10.000000 2.01 0.0547 RunPulse 168.252615 172.894991 180.000000 3.00 0.0182
This example uses the MCMC method to impute just enough missing values for a data set with an arbitrary missing pattern so that each imputed data set has a monotone missing pattern based on the order of variables in the VAR statement.
The following statements invoke the MI procedure and specify the IMPUTE=MONOTONE option to create the imputed data set with a monotone missing pattern. You must specify a VAR statement to provide the order of variables for the imputed data to achieve a monotone missing pattern.
proc mi data=FitMiss seed=17655417 out=outex7; mcmc impute=monotone; var Oxygen RunTime RunPulse; run;
The MI Procedure Model Information Data Set WORK.FITMISS Method Monotone-data MCMC Multiple Imputation Chain Single Chain Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 5 Number of Burn-in Iterations 200 Number of Iterations 100 Seed for random number generator 17655417
The 'Model Information' table shown in Output 44.7.1 describes the method used in the multiple imputation process.
The 'Missing Data Patterns' table shown in Output 44.7.2 lists distinct missing data patterns with corresponding statistics. Here, an 'X' means that the variable is observed in the corresponding group, a '.' means that the variable is missing and will be imputed to achieve the monotone missingness for the imputed data set, and an 'O' means that the variable is missing and will not be imputed. The table also displays group-specific variable means.
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 21 67.74 2 X X O 4 12.90 3 X O O 3 9.68 4 . X X 1 3.23 5 . X O 2 6.45 Missing Data Patterns -----------------Group Means---------------- Group Oxygen RunTime RunPulse 1 46.353810 10.809524 171.666667 2 47.109500 10.137500 . 3 52.461667 . . 4 . 11.950000 176.000000 5 . 9.885000 .
As shown in the table, the MI procedure only needs to impute three missing values from Group 4 and Group 5 to achieve a monotone missing pattern for the imputed data set.
When using the MCMC method to produce an imputed data set with a monotone missing pattern, tables of variance information and parameter estimates are not created.
The following statements are used just to show the monotone missingness of the output data set outex7 .
proc mi data=outex7 nimpute=0; var Oxygen RunTime RunPulse; run;
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 110 70.97 2 X X . 30 19.35 3 X . . 15 9.68 Missing Data Patterns -----------------Group Means--------------- Group Oxygen RunTime RunPulse 1 46.152428 10.861364 171.863636 2 47.796038 10.053333 . 3 52.461667 . .
The 'Missing Data Patterns' table shown in Output 44.7.3 displays a monotone missing data pattern.
The following statements impute one value for each missing value in the monotone missingness data set outex7 .
proc mi data=outex7 nimpute=1 seed=51343672 out=outds; monotone method=reg; var Oxygen RunTime RunPulse; by _Imputation_; run;
You can then analyze these data sets by using other SAS procedures and combine these results by using the MIANALYZE procedure. Note that the VAR statement is required with a MONOTONE statement to provide the variable order for the monotone missing pattern.
This example uses the MCMC method with a single chain. It also displays time-series and autocorrelation plots to check convergence for the single chain.
The following statements use the MCMC method to create an iteration plot for the successive estimates of the mean of Oxygen . Note that iterations during the burn-in period are indicated with negative iteration numbers . These statements also create an autocorrelation function plot for the variable Oxygen .
proc mi data=FitMiss seed=42037921 noprint nimpute=2; mcmc timeplot(mean(Oxygen)) acfplot(mean(Oxygen)); var Oxygen RunTime RunPulse; run;
With the TIMEPLOT(MEAN(Oxygen)) option, the procedure displays a time-series plot for the mean of Oxygen in Output 44.8.1.
By default, the MI procedure displays solid line segments that connect data points in the time-series plot. The plot shows no apparent trends for the variable Oxygen .
With the ACFPLOT(MEAN(oxygen)) option, the procedure displays an autocorrelation plot for the mean of Oxygen in Output 44.8.2.
By default, the MI procedure uses the star sign (*) as the plot symbol to display the points in the plot, a solid line to display the reference line of zero autocorrelation, and a pair of dashed lines to display approximately 95% confidence limits for the autocorrelations. The autocorrelation function plot shows no significant positive or negative autocorrelation.
The following statements use display options to modify the autocorrelation function plot for Oxygen in Output 44.8.3.
proc mi data=FitMiss seed=42037921 noprint nimpute=2; mcmc acfplot(mean(Oxygen) / symbol=dot lref=2); var Oxygen RunTime RunPulse; run;
You can also create plots for the worst linear function, the means of other variables, the variances of variables, and covariances between variables. Alternatively, you can use the OUTITER option to save statistics such as the means, standard deviations, covariances, ˆ’ 2 log LR statistic, ˆ’ 2 log LR statistic of the posterior mode, and worst linear function from each iteration in an output data set. Then you can do a more in-depth time-series analysis of the iterations with other procedures, such as PROC AUTOREG and PROC ARIMA in the SAS/ETS User's Guide .
With the experimental ODS GRAPHICS statement specified in the following statements
ods html; ods graphics on; proc mi data=FitMiss seed=42037921 noprint nimpute=2; mcmc timeplot(mean(Oxygen)) acfplot(mean(Oxygen)); var Oxygen RunTime RunPulse; run; ods graphics off; ods html close;
the MI procedure produces the experimental graphs, as shown in Output 44.8.4 and Output 44.8.5.
For general information about ODS graphics see Chapter 15, 'Statistical Graphics Using ODS.' For specific information about the graphics available in the MI procedure, see the 'ODS Graphics' section on page 2567.
This example uses the MCMC method with multiple chains as specified in Example 44.6. It saves the parameter values used for each imputation in an output data set of type EST called miest . This output data set can then be used to impute missing values in other similar input data sets. The following statements invoke the MI procedure and specify the MCMC method with multiple chains to create three imputations.
proc mi data=FitMiss seed=21355417 nimpute=6 mu0=50 10 180 ; mcmc chain=multiple initial=em outest=miest; var Oxygen RunTime RunPulse; run;
The following statements list the parameters used for the imputations in Output 44.9.1. Note that the data set includes observations with _ TYPE_ = ˜SEED' containing the seed to start the next random number generator.
proc print data=miest(obs=15); title 'Parameters for the Imputations'; run;
Parameters for the Imputations Obs _Imputation_ _TYPE_ _NAME_ Oxygen RunTime RunPulse 1 1 SEED 825240167.00 825240167.00 825240167.00 2 1 PARM 46.77 10.47 169.41 3 1 COV Oxygen 30.59 8.32 50.99 4 1 COV RunTime 8.32 2.90 17.03 5 1 COV RunPulse 50.99 17.03 200.09 6 2 SEED 1895925872.00 1895925872.00 1895925872.00 7 2 PARM 47.41 10.37 173.34 8 2 COV Oxygen 22.35 4.44 21.18 9 2 COV RunTime 4.44 1.76 1.25 10 2 COV RunPulse 21.18 1.25 125.67 11 3 SEED 137653011.00 137653011.00 137653011.00 12 3 PARM 48.21 10.36 170.52 13 3 COV Oxygen 23.59 5.25 19.76 14 3 COV RunTime 5.25 1.66 5.00 15 3 COV RunPulse 19.76 5.00 110.99
The following statements invoke the MI procedure and use the INEST= option in the MCMC statement.
proc mi data=FitMiss; mcmc inest=miest; var Oxygen RunTime RunPulse; run;
The MI Procedure Model Information Data Set WORK.FITMISS Method MCMC INEST Data Set WORK.MIEST Number of Imputations 6
The 'Model Information' table shown in Output 44.9.2 describes the method used in the multiple imputation process. The remaining tables for the example are identical to the tables in Output 44.6.2, Output 44.6.4, Output 44.6.5,andOutput 44.6.6 in Example 44.6.
This example applies the MCMC method to the FitMiss data set in which the variable Oxygen is transformed. Assume that Oxygen is skewed and can be transformed to normality with a logarithmic transformation. The following statements invoke the MI procedure and specify the transformation. The TRANSFORM statement specifies the log transformation for Oxygen . Note that the values displayed for Oxygen in all of the results correspond to transformed values.
proc mi data=FitMiss seed=32937921 mu0=50 10 180 out=outex10; transform log(Oxygen); mcmc chain=multiple displayinit; var Oxygen RunTime RunPulse; run;
The 'Missing Data Patterns' table shown in Output 44.10.1 lists distinct missing data patterns with corresponding statistics for the FitMiss data. Note that the values of Oxygen shown in the tables are transformed values.
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 21 67.74 2 X X . 4 12.90 3 X . . 3 9.68 4 . X X 1 3.23 5 . X . 2 6.45 Transformed Variables: Oxygen Missing Data Patterns -----------------Group Means---------------- Group Oxygen RunTime RunPulse 1 3.829760 10.809524 171.666667 2 3.851813 10.137500 . 3 3.955298 . . 4 . 11.950000 176.000000 5 . 9.885000 . Transformed Variables: Oxygen
The 'Variable Transformations' table shown in Output 44.10.2 lists the variables that have been transformed.
The MI Procedure Variable Transformations Variable _Transform_ Oxygen LOG
The 'Initial Parameter Estimates for MCMC' table shown in Output 44.10.3 displays the starting mean and covariance estimates used in the MCMC process.
The MI Procedure Initial Parameter Estimates for MCMC _TYPE_ _NAME_ Oxygen RunTime RunPulse MEAN 3.846122 10.557605 171.382949 COV Oxygen 0.010827 0.120891 0.328772 COV RunTime 0.120891 1.744580 3.011180 COV RunPulse 0.328772 3.011180 82.747609 Transformed Variables: Oxygen
Output 44.10.4 displays variance information from the multiple imputation.
The MI Procedure Multiple Imputation Variance Information -----------------Variance----------------- Variable Between Within Total DF * Oxygen 0.000016175 0.000401 0.000420 26.499 RunTime 0.001762 0.065421 0.067536 27.118 RunPulse 0.205979 3.116830 3.364004 25.222 * Transformed Variables Multiple Imputation Variance Information Relative Fraction Increase Missing Relative Variable in Variance Information Efficiency * Oxygen 0.048454 0.047232 0.990642 RunTime 0.032318 0.031780 0.993684 RunPulse 0.079303 0.075967 0.985034 * Transformed Variables
Output 44.10.5 displays parameter estimates from the multiple imputation. Note that the parameter value of ¼ has also been transformed using the logarithmic transformation.
The MI Procedure Multiple Imputation Parameter Estimates Variable Mean Std Error 95% Confidence Limits DF * Oxygen 3.845175 0.020494 3.8031 3.8873 26.499 RunTime 10.560131 0.259876 10.0270 11.0932 27.118 RunPulse 171.802181 1.834122 168.0264 175.5779 25.222 * Transformed Variables Multiple Imputation Parameter Estimates t for H0: Variable Minimum Maximum Mu0 Mean=Mu0 Pr > t * Oxygen 3.838599 3.848456 3.912023 3.26 0.0030 RunTime 10.493031 10.600498 10.000000 2.16 0.0402 RunPulse 171.251777 172.498626 180.000000 4.47 0.0001 * Transformed Variables
The following statements list the first ten observations of the data set outmi in Output 44.10.6. Note that the values for Oxygen are in the original scale.
proc print data=outex10(obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
First 10 Observations of the Imputed Data Set Run Obs _Imputation_ Oxygen RunTime Pulse 1 1 44.6090 11.3700 178.000 2 1 45.3130 10.0700 185.000 3 1 54.2970 8.6500 156.000 4 1 59.5710 7.1440 167.012 5 1 49.8740 9.2200 170.092 6 1 44.8110 11.6300 176.000 7 1 38.5834 11.9500 176.000 8 1 43.7376 10.8500 158.851 9 1 39.4420 13.0800 174.000 10 1 60.0550 8.6300 170.000
Note that the preceding results can also be produced from the following statements without using a TRANSFORM statement. A transformed value of log(50)=3.91202 is used in the MU0= option.
data temp; set FitMiss; LogOxygen= log(Oxygen); run; proc mi data=temp seed=14337921 mu0=3.91202 10 180 out=outtemp; mcmc chain=multiple displayinit; var LogOxygen RunTime RunPulse; run; data outex10; set outtemp; Oxygen= exp(LogOxygen); run;
This example uses two separate imputation procedures to complete the imputation process. The first MI procedure uses the MCMC method to impute just enough missing values for a data set with an arbitrary missing pattern so that each imputed data set has a monotone missing pattern. The second MI procedure uses a MONOTONE statement to impute missing values for data sets with monotone missing patterns.
The following statements are identical to Example 44.7. The statements invoke the MI procedure and specify the the IMPUTE=MONOTONE option to create the imputed data set with a monotone missing pattern.
proc mi data=FitMiss seed=17655417 out=outex11; mcmc impute=monotone; var Oxygen RunTime RunPulse; run;
The 'Missing Data Patterns' table shown in Output 44.11.1 lists distinct missing data patterns with corresponding statistics. Here, an 'X' means that the variable is observed in the corresponding group, a '.' means that the variable is missing and will be imputed to achieve the monotone missingness for the imputed data set, and an 'O' means that the variable is missing and will not be imputed. The table also displays group-specific variable means.
The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 21 67.74 2 X X O 4 12.90 3 X O O 3 9.68 4 . X X 1 3.23 5 . X O 2 6.45 Missing Data Patterns -----------------Group Means---------------- Group Oxygen RunTime RunPulse 1 46.353810 10.809524 171.666667 2 47.109500 10.137500 . 3 52.461667 . . 4 . 11.950000 176.000000 5 . 9.885000 .
As shown in the table, the MI procedure only needs to impute three missing values from Group 4 and Group 5 to achieve a monotone missing pattern for the imputed data set. When the MCMC method is used to produce an imputed data set with a monotone missing pattern, tables of variance information and parameter estimates are not created.
The following statements impute one value for each missing value in the monotone missingness data set outex11 .
proc mi data=outex11 nimpute=1 seed=51343672 out=outex11a; monotone reg; var Oxygen RunTime RunPulse; by _Imputation_; run;
You can then analyze these data sets by using other SAS procedures and combine these results by using the procedure MIANALYZE. Note that the VAR statement is required with a MONOTONE statement to provide the variable order for the monotone missing pattern.
The 'Model Information' table displayed in Output 44.11.2 shows that a monotone method is used to generate imputed values in the first BY group.
----------------------------- Imputation Number=1 ------------------------------ The MI Procedure Model Information Data Set WORK.OUTEX11 Method Monotone Number of Imputations 1 Seed for random number generator 51343672
The 'Monotone Model Specification' table shown in Output 44.11.3 describes methods and imputed variables in the imputation model. The procedure uses the regression method to impute variables RunTime and RunPulse in the model.
----------------------------- Imputation Number=1 ------------------------------ The MI Procedure Monotone Model Specification Imputed Method Variables Regression RunTime RunPulse
The 'Missing Data Patterns' table shown in Output 44.11.4 lists distinct missing data patterns with corresponding statistics. It shows a monotone missing pattern for the imputed data set.
----------------------------- Imputation Number=1 ------------------------------ The MI Procedure Missing Data Patterns Run Run Group Oxygen Time Pulse Freq Percent 1 X X X 22 70.97 2 X X . 6 19.35 3 X . . 3 9.68 Missing Data Patterns -----------------Group Means---------------- Group Oxygen RunTime RunPulse 1 46.057479 10.861364 171.863636 2 46.745227 10.053333 . 3 52.461667 . .
The following statements list the first ten observations of the data set outex11a in Output 44.11.5.
proc print data=outex11a(obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
First 10 Observations of the Imputed Data Set Run Obs _Imputation_ Oxygen RunTime Pulse 1 1 44.6090 11.3700 178.000 2 1 45.3130 10.0700 185.000 3 1 54.2970 8.6500 156.000 4 1 59.5710 7.1569 169.914 5 1 49.8740 9.2200 159.315 6 1 44.8110 11.6300 176.000 7 1 39.8345 11.9500 176.000 8 1 45.3196 10.8500 151.252 9 1 39.4420 13.0800 174.000 10 1 60.0550 8.6300 170.000
This example presents an alternative to the full-data MCMC imputation. When imputation of only a few missing values are needed to achieve a monotone missing pattern for the imputed data set. The example uses a monotone MCMC method that impute fewer missing values in each iteration and achieves approximate stationarity in fewer iterations (Schafer 1997, p. 227). The example also demonstrates how to combine the monotone MCMC method with a method for monotone missing data, which does not rely on iterations of steps.