Syntax


The following statements are available in PROC MI.

  • PROC MI < options > ;

    • BY variables ;

    • CLASS variables ;

    • EM < options > ;

    • FREQ variable ;

    • MCMC < options > ;

    • MONOTONE < options > ;

    • TRANSFORM transform ( variables < / options > )

      • < ...transform ( variables < / options > ) > ;

    • VAR variables ;

The BY statement specifies groups in which separate multiple imputation analyses are performed.

The CLASS statement lists the classification variables in the VAR statement. Classification variables can be either character or numeric.

The EM statement uses the EM algorithm to compute the maximum likelihood estimate (MLE) of the data with missing values, assuming a multivariate normal distribution for the data.

The FREQ statement specifies the variable that represents the frequency of occurrence for other values in the observation.

The MCMC statement uses a Markov chain Monte Carlo method to impute values for a data set with an arbitrary missing pattern, assuming a multivariate normal distribution for the data.

The MONOTONE statement specifies monotone methods to impute continuous and CLASS variables for a data set with a monotone missing pattern. Note that you can use either an MCMC statement or a MONOTONE statement, but not both. When neither of these two statements is specified, the MCMC method with its default options is used.

The TRANSFORM statement lists the variables to be transformed before the imputation process. The imputed values of these transformed variables will be reverse-transformed to the original forms before the imputation.

The VAR statement lists the numeric variables to be analyzed . If you omit the VAR statement, all numeric variables not listed in other statements are used.

The PROC MI statement is the only required statement for the MI procedure. The rest of this section provides detailed syntax information for each of these statements, beginning with the PROC MI statement. The remaining statements are in alphabetical order.

PROC MI Statement

  • PROC MI < options > ;

The following table summarizes the options available in the PROC MI statement.

Table 44.1: Summary of PROC MI Options

Tasks

Options

Specify data sets

 

input data set

DATA=

output data set with imputed values

OUT=

Specify imputation details

 

number of imputations

NIMPUTE=

seed to begin random number generator

SEED=

units to round imputed variable values

ROUND=

maximum values for imputed variable values

MAXIMUM=

minimum values for imputed variable values

MINIMUM=

maximum number of iterations to impute values

MINMAXITER=

in the specified range

 

singularity tolerance

SINGULAR=

Specify statistical analysis

 

level for the confidence interval, (1 ˆ’ ± )

ALPHA=

means under the null hypothesis

MU0=

Control printed output

 

suppress all displayed output

NOPRINT

displays univariate statistics and correlations

SIMPLE

The following options can be used in the PROC MI statement (in alphabetical order):

ALPHA= ±

  • specifies that confidence limits be constructed for the mean estimates with confidence level 100(1 ˆ’ ± )%, where 0 < ± < 1. The default is ALPHA=0.05.

DATA= SAS-data-set

  • names the SAS data set to be analyzed by PROC MI. By default, the procedure uses the most recently created SAS data set.

MAXIMUM= numbers

  • specifies maximum values for imputed variables. When an intended imputed value is greater than the maximum, PROC MI redraws another value for imputation. If only one number is specified, that number is used for all variables. If more than one number is specified, you must use a VAR statement, and the specified numbers must correspond to variables in the VAR statement. The default number is a missing value, which indicates no restriction on the maximum for the corresponding variable

  • The MAXIMUM= option is related to the MINIMUM= and ROUND= options, which are used to make the imputed values more consistent with the observed variable values. These options are applicable only if you use the MCMC method or the monotone regression method.

  • When specifying a maximum for the first variable only, you must also specify a missing value after the maximum. Otherwise , the maximum is used for all variables. For example, the MAXIMUM= 100 . option sets a maximum of 100 for the first analysis variable only and no maximum for the remaining variables. The MAXIMUM= . 100 option sets a maximum of 100 for the second analysis variable only and no maximum for the other variables.

MINIMUM= numbers

  • specifies the minimum values for imputed variables. When an intended imputed value is less than the minimum, PROC MI redraws another value for imputation. If only one number is specified, that number is used for all variables. If more than one number is specified, you must use a VAR statement, and the specified numbers must correspond to variables in the VAR statement. The default number is a missing value, which indicates no restriction on the minimum for the corresponding variable

MINMAXITER= number

  • specifies the maximum number of iterations for imputed values to be in the specified range when the option MINIMUM or MAXIMUM is also specified. The default is MINMAXITER=100.

MU0= numbers

THETA0= numbers

  • specifies the parameter values ¼ under the null hypothesis ¼ = ¼ for the population means corresponding to the analysis variables. Each hypothesis is tested with a t test. If only one number is specified, that number is used for all variables. If more than one number is specified, you must use a VAR statement, and the specified numbers must correspond to variables in the VAR statement. The default is MU0=0.

  • If a variable is transformed as specified in a TRANSFORM statement, then the same transformation for that variable is also applied to its corresponding specified MU0= value in the t test. If the parameter values ¼ for a transformed variable is not specified, then a value of zero is used for the resulting ¼ after transformation.

NIMPUTE= number

  • specifies the number of imputations. The default is NIMPUTE=5. You can specify NIMPUTE=0 to skip the imputation. In this case, only tables of model information, missing data patterns, descriptive statistics (SIMPLE option), and MLE from the EM algorithm (EM statement) are displayed.

NOPRINT

  • suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, 'Using the Output Delivery System,' for more information.

OUT= SAS-data-set

  • creates an output SAS data set containing imputation results. The data set includes an index variable, _Imputation_ , to identify the imputation number. For each imputation, the data set contains all variables in the input data set with missing values being replaced by the imputed values. See the 'Output Data Sets' section on page 2559 for a description of this data set.

ROUND= numbers

  • specifies the units to round variables in the imputation. If only one number is specified, that number is used for all continuous variables. If more than one number is specified, you must use a VAR statement, and the specified numbers must correspond to variables in the VAR statement. When the CLASS variables are listed in the VAR statement, their corresponding roundoff units are not used. The default number is a missing value, which indicates no rounding for imputed variables.

    When specifying a roundoff unit for the first variable only, you must also specify a missing value after the roundoff unit. Otherwise, the roundoff unit is used for all variables. For example, the option 'ROUND= 10 .' sets a roundoff unit of 10 for the first analysis variable only and no rounding for the remaining variables. The option 'ROUND= . 10' sets a roundoff unit of 10 for the second analysis variable only and no rounding for other variables.

    The ROUND= option sets the precision of imputed values. For example, with a roundoff unit of 0.001, each value is rounded to the nearest multiple of 0.001. That is, each value has three significant digits after the decimal point. See Example 44.3 for an illustration of this option.

SEED= number

  • specifies a positive integer to start the pseudo-random number generator. The default is a value generated from reading the time of day from the computer's clock. However, in order to duplicate the results under identical situations, you must use the same value of the seed explicitly in subsequent runs of the MI procedure.

  • The seed information is displayed in the 'Model Information' table so that the results can be reproduced by specifying this seed with the SEED= option. You need to specify the same seed number in the future to reproduce the results.

SIMPLE

  • displays simple descriptive univariate statistics and pairwise correlations from available cases. For a detailed description of these statistics, see the 'Descriptive Statistics' section on page 2535.

SINGULAR= p

  • specifies the criterion for determining the singularity of a covariance matrix based on standardized variables, where 0 <p< 1. The default is SINGULAR=1E ˆ’ 8.

    Suppose that S is a covariance matrix and v is the number of variables in S .Basedon the spectral decomposition S = ““ ² , where is a diagonal matrix of eigenvalues » j , j =1 , , v , where » i » j when i < j , and is a matrix with the corresponding orthonormal eigenvectors of S as columns , S is considered singular when an eigenvalue » j is less than p » , where the average

BY Statement

  • BY variables ;

You can specify a BY statement with PROC MI to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.

If your input data set is not sorted in ascending order, use one of the following alternatives:

  • Sort the data using the SORT procedure with a similar BY statement.

  • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the MI procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.

  • Create an index on the BY variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .

CLASS Statement (Experimental)

  • CLASS variables ;

The CLASS statement specifies the classification variables in the VAR statement. Classification variables can be either character or numeric. The CLASS statement must be used in conjunction with the MONOTONE statement.

Class levels are determined from the formatted values of the CLASS variables. Refer to the chapter titled 'The FORMAT Procedure' in the SAS Procedures Guide .

EM Statement

  • EM < options > ;

The expectation-maximization (EM) algorithm is a technique for maximum likelihood estimation in parametric models for incomplete data. The EM statement uses the EM algorithm to compute the MLE for ( ¼ , & pound ; ), the means and covariance matrix, of a multivariate normal distribution from the input data set with missing values. Either the means and covariances from complete cases or the means and standard deviations from available cases can be used as the initial estimates for the EM algorithm. You can also specify the correlations for the estimates from available cases.

You can also use the EM statement with the NIMPUTE=0 option in the PROC statement to compute the EM estimates without multiple imputation, as shown in Example 44.1 in the 'Examples' section on page 2568.

The following seven options are available with the EM statement.

CONVERGE= p

XCONV= p

  • sets the convergence criterion. The value must be between 0 and 1. The iterations are considered to have converged when the change in the parameter estimates between iteration steps is less than p for each parameter. That is, for each of the means and covariances. For each parameter, the change is a relative change if the parameter is greater than 0.01 in absolute value; otherwise, it is an absolute change. By default, CONVERGE=1E-4.

INITIAL=CC AC AC(R= r )

  • sets the initial estimates for the EM algorithm. The INITIAL=CC option uses the means and covariances from complete cases, the INITIAL=AC option uses the means and standard deviations from available cases and the correlations are set to zero, and the INITIAL=AC( R= r ) option uses the means and standard deviations from available cases with correlation r , where ˆ’ 1 / ( p ˆ’ 1) < r < 1 and p is the number of variables to be analyzed. The default is INITIAL=AC.

ITPRINT

  • prints the iteration history in the EM algorithm.

MAXITER= number

  • specifies the maximum number of iterations used in the EM algorithm. The default is MAXITER=200.

OUT= SAS-data-set

  • creates an output SAS data set containing results from the EM algorithm. The data set contains all variables in the input data set with missing values being replaced by the expected values from the EM algorithm. See the 'Output Data Sets' section on page 2559 for a description of this data set.

OUTEM= SAS-data-set

  • creates an output SAS data set of TYPE=COV containing the MLE of the parameter vector ( ¼ , ). These estimates are computed with the EM algorithm. See the 'Output Data Sets' section on page 2559 for a description of this output data set.

OUTITER < ( options ) > = SAS-data-set

  • creates an output SAS data set of TYPE=COV containing parameters for each iteration. The data set includes a variable named _Iteration_ to identify the iteration number. The parameters in the output data set depend on the options specified. You can specify the MEAN and COV options to output the mean and covariance parameters. When no options are specified, the output data set contains the mean parameters for each iteration. See the 'Output Data Sets' section on page 2559 for a description of this data set.

FREQ Statement

  • FREQ variable ;

If one variable in your input data set represents the frequency of occurrence for other values in the observation, specify the variable name in a FREQ statement. PROC MI then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation. If the value of the FREQ variable is less than one, the observation is not used in the analysis. Only the integer portion of the value is used. The total number of observations is considered to be equal to the sum of the FREQ variable when PROC MI calculates significance probabilities.

MCMC Statement

  • MCMC < options > ;

The MCMC statement specifies the details of the MCMC method for imputation.

The following table summarizes the options available for the MCMC statement.

Table 44.2: Summary of Options in MCMC

Tasks

Options

Specify data sets

 

input parameter estimates for imputations

INEST=

output parameter estimates used in imputations

OUTEST=

output parameter estimates used in iterations

OUTITER=

Specify imputation details

 

monotone/full imputation

IMPUTE=

single/multiple chain

CHAIN=

number of burn-in iterations for each chain

NBITER=

number of iterations between imputations in a chain

NITER=

initial parameter estimates for MCMC

INITIAL=

prior parameter information

PRIOR=

starting parameters

START=

Specify output graphics

 

displays time-series plots

TIMEPLOT=

displays autocorrelation plots

ACFPLOT=

graphics catalog name for saving graphics output

GOUT=

Control printed output

 

displays worst linear function

WLF

displays initial parameter values for MCMC

DISPLAYINIT

The following options are available for the MCMC statement (in alphabetical order):

ACFPLOT < ( options < / display-options > ) >

  • displays the autocorrelation function plots of parameters from iterations.

  • The available options are:

  • COV < ( < variables >< variable1*variable2 ><... variable1*variable2 > ) >

    • displays plots of variances for variables in the list and covariances for pairs of variables in the list. When the option COV is specified without variables, variances for all variables and covariances for all pairs of variables are used.

  • MEAN < ( variables ) >

    • displays plots of means for variables in the list. When the option MEAN is specified without variables, all variables are used.

  • WLF

    • displays the plot for the worst linear function.

  • When the ACFPLOT is specified without the preceding options, the procedure displays plots of means for all variables that are used.

  • The display-options provide additional information for the autocorrelation function plots. The available display-options are:

  • CCONF= color

    • specifies the color of the displayed confidence limits. The default is CCONF=BLACK.

  • CFRAME= color

    • specifies the color for filling the area enclosed by the axes and the frame. By default, this area is not filled.

  • CNEEDLES= color

    • specifies the color of the vertical line segments ( needles ) that connect autocorrelations to the reference line. The default is CNEEDLES=BLACK.

  • CREF= color

    • specifies the color of the displayed reference line. The default is CREF=BLACK.

  • CSYMBOL= color

    • specifies the color of the displayed data points. The default is CSYMBOL=BLACK.

  • HSYMBOL= number

    • specifies the height for data points in percentage screen units. The default is HSYMBOL=1.

  • LCONF= linetype

    • specifies the line type for the displayed confidence limits. The default is LCONF=1, a solid line.

  • LOG

    • requests that the logarithmic transformations of parameters be used to compute the autocorrelations. It's generally used for the variances of variables. When a parameter has values less than or equal to zero, the corresponding plot is not created.

  • LREF= linetype

    • specifies the line type for the displayed reference line. The default is LREF=3, a dashed line.

  • NLAG= number

    • specifies the maximum lag of the series. The default is NLAG=20. The autocorrelations at each lag are displayed in the graph.

  • SYMBOL= value

    • specifies the symbol for data points in percentage screen units. The default is SYMBOL=STAR.

  • TITLE= 'string'

    • specifies the title to be displayed in the autocorrelation function plots. The default is TITLE='Autocorrelation Plot'.

  • WCONF= number

    • specifies the width for the displayed confidence limits in percentage screen units. If you specify the WCONF=0 option, the confidence limits are not displayed. The default is WCONF=1.

  • WNEEDLES= number

    • specifies the width for the displayed needles that connect autocorrelations to the reference line in percentage screen units. If you specify the WNEEDLES=0 option, the needles are not displayed. The default is WNEEDLES=1.

  • WREF= number

    • specifies the width for the displayed reference line in percentage screen units. If you specify the WREF=0 option, the reference line is not displayed. The default is WREF=1.

    • For example, the statement

        acfplot( mean( y1) cov(y1) /log);  
    • requests autocorrelation function plots for the means and variances of the variable y1 , respectively. Logarithmic transformations of both the means and variances are used in the plots. For a detailed description of the autocorrelation function plot, see the 'Autocorrelation Function Plot' section on page 2557; refer also to Schafer (1997, pp. 120-126) and the SAS/ETS User's Guide .

CHAIN=SINGLE MULTIPLE

  • specifies whether a single chain is used for all imputations or a separate chain is used for each imputation. The default is CHAIN=SINGLE.

DISPLAYINIT

  • displays initial parameter values in the MCMC process for each imputation.

GOUT= graphics-catalog

  • specifies the graphics catalog for saving graphics output from PROC MI. The default is WORK.GSEG. For more information, refer to the chapter 'The GREPLAY Procedure' in SAS/GRAPH Software: Reference .

IMPUTE=FULL MONOTONE

  • specifies whether a full-data imputation is used for all missing values or a monotone-data imputation is used for a subset of missing values to make the imputed data sets have a monotone missing pattern. The default is IMPUTE=FULL. When IMPUTE=MONOTONE is specified, the order in the VAR statement is used to complete the monotone pattern.

INEST= SAS-data-set

  • names a SAS data set of TYPE=EST containing parameter estimates for imputations. These estimates are used to impute values for observations in the DATA= data set. A detailed description of the data set is provided in the 'Input Data Sets' section on page 2558.

INITIAL=EM < ( options ) >

INITIAL=INPUT= SAS-data-set

  • specifies the initial mean and covariance estimates for the MCMC process. The default is INITIAL=EM.

  • You can specify INITIAL=INPUT= SAS-data-set to read the initial estimates of the mean and covariance matrix for each imputation from a SAS data set. See the 'Input Data Sets' section on page 2558 for a description of this data set.

  • With INITIAL=EM, PROC MI derives parameter estimates for a posterior mode, the highest observed-data posterior density, from the EM algorithm. The MLE from EM is used to start the EM algorithm for the posterior mode, and the resulting EM estimates are used to begin the MCMC process. The prior information specified in the PRIOR= option is also used in the process to compute the posterior mode.

  • The following four options are available with INITIAL=EM.

    BOOTSTRAP < = number >

    • requests bootstrap resampling, which uses a simple random sample with replacement from the input data set for the initial estimate. You can explicitly specify the number of observations in the random sample. Alternatively, you can implicitly specify the number of observations in the random sample by specifying the proportion p, < p < = 1, to request [ np ] observations in the random sample, where n is the number of observations in the data set and [ np ] is the integer part of np . This produces an overdispersed initial estimate that provides different starting values for the MCMC process. If you specify the BOOTSTRAP option without the number, p =0.75 is used by default.

CONVERGE= p

XCONV= p

  • sets the convergence criterion. The value must be between 0 and 1. The iterations are considered to have converged when the change in the parameter estimates between iteration steps is less than p for each parameter. That is, for each of the means and covariances. For each parameter, the change is a relative change if the parameter is greater than 0.01 in absolute value; otherwise, it is an absolute change. By default, CONVERGE=1E-4.

ITPRINT

  • prints the iteration history in the EM algorithm for the posterior mode.

MAXITER= number

  • specifies the maximum number of iterations used in the EM algorithm. The default is MAXITER=200.

NBITER= number

  • specifies the number of burn-in iterations before the first imputation in each chain. The default is NBITER=200.

NITER= number

  • specifies the number of iterations between imputations in a single chain. The default is NITER=100.

OUTEST= SAS-data-set

  • creates an output SAS data set of TYPE=EST. The data set contains parameter estimates used in each imputation. The data set also includes a variable named _Imputation_ to identify the imputation number. See the 'Output Data Sets' section on page 2559 for a description of this data set.

OUTITER < ( options ) > = SAS-data-set

  • creates an output SAS data set of TYPE=COV containing parameters used in the imputation step for each iteration. The data set includes variables named _Imputation_ and _Iteration_ to identify the imputation number and iteration number.

  • The parameters in the output data set depend on the options specified. You can specify options MEAN, STD, COV, LR, LR_POST, and WLF to output parameters of means, standard deviations, covariances, -2 log LR statistic, -2 log LR statistic of the posterior mode, and the worst linear function. When no options are specified, the output data set contains the mean parameters used in the imputation step for each iteration. See the 'Output Data Sets' section on page 2559 for a description of this data set.

PRIOR= name

  • specifies the prior information for the means and covariances. Valid values for name are as follows :

    JEFFREYS

    specifies a noninformative prior.

    RIDGE= number

    specifies a ridge prior.

    INPUT= SAS-data-set

    specifies a data set containing prior information.

  • For a detailed description of the prior information, see the 'Bayesian Estimation of the Mean Vector and Covariance Matrix' section on page 2549 and the 'Posterior Step' section on page 2550. If you do not specify the PRIOR= option, the default is PRIOR=JEFFREYS.

  • The PRIOR=INPUT= option specifies a TYPE=COV data set from which the prior information of the mean vector and the covariance matrix is read. See the 'Input Data Sets' section on page 2558 for a description of this data set.

START=VALUE DIST

  • specifies that the initial parameter estimates are used as either the starting value (START=VALUE) or as the starting distribution (START=DIST) in the first imputation step of each chain. If the IMPUTE=MONOTONE option is specified, then START=VALUE is used in the procedure. The default is START=VALUE.

TIMEPLOT < ( options < / display-options > ) >

  • displays the time-series plots of parameters from iterations. The available options are:

COV < ( < variables >< variable1*variable2 ><... variable1*variable2 > ) >

  • displays plots of variances for variables in the list and covariances for pairs of variables in the list. When the option COV is specified without variables, variances for all variables and covariances for all pairs of variables are used.

MEAN < ( variables ) >

  • displays plots of means for variables in the list. When the option MEAN is specified without variables, all variables are used.

WLF

  • displays the plot for the worst linear function.

  • When the TIMEPLOT is specified without the preceding options, the procedure displays plots of means for all variables are used.

  • The display-options provide additional information for the time-series plots. The available display-options are:

CCONNECT= color

  • specifies the color for the line segments that connect data points in the time-series plots. The default is CCONNECT=BLACK.

CFRAME= color

  • specifies the color for filling the area enclosed by the axes and the frame. By default, this area is not filled.

CSYMBOL= color

  • specifies the color of the data points to be displayed in the time-series plots. The default is CSYMBOL=BLACK.

HSYMBOL= number

  • specifies the height for data points in percentage screen units. The default is HSYMBOL=1.

LCONNECT= linetype

  • specifies the line type for the line segments that connect data points in the time-series plots. The default is LCONNECT=1, a solid line.

LOG

  • requests that the logarithmic transformations of parameters be used. It's generally used for the variances of variables. When a parameter value is less than or equal to zero, the value is not displayed in the corresponding plot.

SYMBOL= value

  • specifies the symbol for data points in percentage screen units. The default is SYMBOL=PLUS.

TITLE= 'string'

  • specifies the title to be displayed in the time-series plots. The default is TITLE='Time-series Plot for Iterations'.

WCONNECT= number

  • specifies the width for the line segments that connect data points in the time-series plots in percentage screen units. If you specify the WCONNECT=0 option, the data points are not connected. The default is WCONNECT=1. For a detailed description of the time-series plot, see the 'Time-Series Plot' section on page 2556 and Schafer (1997, pp. 120-126).

WLF

  • displays the worst linear function of parameters. This scalar function of parameters ¼ and is 'worst' in the sense that its values from iterations converge most slowly among parameters. For a detailed description of this statistic, see the 'Worst Linear Function of Parameters' section on page 2556.

MONOTONE Statement

  • MONOTONE < method < ( < imputed < = effects >>< / options > ) > >

  • < ... method < ( < imputed < = effects >>< / options > ) >> ;

The MONOTONE statement specifies imputation methods for data sets with monotone missingness. You must also specify a VAR statement and the data set must have a monotone missing pattern with variables ordered in the VAR list. When both MONOTONE and MCMC statements are specified, the MONOTONE statement is not used.

For each method, you can specify the imputed variables and optionally , a set of the effects to impute these variables. Each effect is a variable or a combination of variables preceding the imputed variable in the VAR statement. The syntax for specification of effects is the same as for the GLM procedure. See See Chapter 32, 'The GLM Procedure,' for more information.

One general form of an effect involving several variables is

   X1 * X2 * A * B * C ( DE)   

where A , B , C , D ,and E are class variables and X1 and X2 are continuous variables.

If no covariates are specified, then all preceding variables are used as the covariates. That is, each preceding continuous variable is used as a regressor effect, and each preceding class variable is used as a main effect. For the discriminant function method, only the continuous variables can be used as covariate effects.

When a method for continuous variables is specified without imputed variables, the method is used for all continuous variables in the VAR statement that are not specified in other methods. Similarly, when a method for class variables is specified without imputed variables, the method is used for all class variables in the VAR statement that are not specified in other methods.

When a MONOTONE statement is used without specifying any methods, the regression method is used for all continuous variables and the discriminant function method is used for all class variables. The preceding variables of each imputed variable in the VAR statement are used as the covariates.

With a MONOTONE statement, the variables are imputed sequentially in the order given by the VAR statement. For a continuous variable, you can use a regression method, a regression predicted mean matching method, or a propensity score method to impute missing values.

For a nominal class variable, you can use a discriminant function method to impute missing values without using the ordering of the class levels. For a ordinal class variable, you can use a logistic regression method to impute missing values using the ordering of the class levels. For a binary class variable, either a discriminant function method or a logistic regression method can be used.

Note that except for the regression method, all other methods impute values from the observed observation values. You can specify the following methods in a MONOTONE statement.

DISCRIM < ( imputed < = effects ></ options > ) >

  • specifies the discriminant function method of class variables. Only the continuous variables are allowed as covariate effects. The available options are DETAILS, PCOV=, and PRIOR=. The DETAILS option displays the group means and pooled covariance matrix used in each imputation. The PCOV= option specifies the pooled covariance used in the discriminant method. Valid values for the PCOV= option are:

    FIXED

    uses the observed-data pooled covariance matrix for each imputation.

    POSTERIOR

    draws a pooled covariance matrix from its posterior distribution.

  • The default is PCOV=POSTERIOR. See the 'Discriminant Function Method for Monotone Missing Data' section on page 2544 for a detailed description of the method.

  • The PRIOR= option specifies the prior probabilities of group membership. Valid values for the PRIOR= option are:

    EQUAL

    set the prior probabilities equal for all groups.

    PROPORTIONAL

    set the prior probabilities proportion to the group sample sizes.

    JEFFREYS < = c >

    specifies a noninformative prior, 0 < c < 1. If the number c is not specified, JEFFREYS=0.5.

    RIDGE < = d >

    specifies a ridge prior, d > 0. If the number d is not specified, RIDGE=0.25.

  • The default is PRIOR=JEFFREYS. See the 'Discriminant Function Method for Monotone Missing Data' section on page 2544 for a detailed description of the method.

LOGISTIC < ( imputed < = effects ></ options > ) >

  • specifies the logistic regression method of class variables. The available options are DETAILS, ORDER=, and DESCENDING. The DETAILS option displays the regression coefficients in the logistic regression model used in each imputation.

  • When the imputed variable has more than two response levels, the ordinal logistic regression method is used. The ORDER= option specifies the sorting order for the levels of the response variable. Valid values for the ORDER= option are:

    DATA

    sorts by the order of appearance in the input data set

    FORMATTED

    sorts by their external formatted values

    FREQ

    sorts by the descending frequency counts

    INTERNAL

    sorts by the unformatted values

    By default, ORDER=FORMATTED.

    The option DESCENDING reverses the sorting order for the levels of the response variables.

    See the 'Logistic Regression Method for Monotone Missing Data' section on page 2546 for a detailed description of the method.

    REG REGRESSION < ( imputed < = effects ></ DETAILS > ) >

  • specifies the regression method of continuous variables. The DETAILS option displays the regression coefficients in the regression model used in each imputation.

  • With a regression method, the MAXIMUM=, MINIMUM=, and ROUND= options can be used to make the imputed values more consistent with the observed variable values.

  • See the 'Regression Method for Monotone Missing Data' section on page 2541 for a detailed description of the method.

REGPMM < ( imputed < = effects ></ options > ) >

REGPREDMEANMATCH < ( imputed < = effects ></ options > ) >

  • specifies the predictive mean matching method for continuous variables. This method is similar to the regression method except that it imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996).

  • The available options are DETAILS and K=. The DETAILS option displays the regression coefficients in the regression model used in each imputation. The K= option specifies the number of closest observations to be used in the selection. The default is K=5.

  • Note that an optimal K= value is currently not available in the literature on multiple imputation. The default K=5 is experimental and may change in future releases.

  • See the 'Predictive Mean Matching Method for Monotone Missing Data' section on page 2542 for a detailed description of the method.

PROPENSITY < ( imputed < = effects ></ options > ) >

  • specifies the propensity scores method of variables. Each variable is either a class variable or a continuous variable. The available options are DETAILS and NGROUPS=. The DETAILS option displays the regression coefficients in the logistic regression model for propensity scores. The NGROUPS= option specifies the number of groups created based on propensity scores. The default is NGROUPS=5.

  • See the 'Propensity Score Method for Monotone Missing Data' section on page 2543 for a detailed description of the method.

  • With a MONOTONE statement, the missing values of a variable are imputed when the variable is either explicitly specified in the method or implicitly specified when a method is specified without imputed variables. These variables are imputed sequentially in the order specified in the VAR statement. For example, the following MI procedure

      proc mi;   class c1;   var y1 y2 c1 y3;   monotone reg(y3= y1 y2 c1) logistic(c1= y1 y2 y1*y2);   run;  
  • uses the logistic regression method to impute variable c 1 from effects y 1, y 2, and y 1 * y 2 first, then uses the regression method to impute variable y 3 from effects y 1, y 2, and c 1. The variables y 1 and y 2 are not imputed since y 1 is the leading variable in the VAR statement and y 2 is not specified as an imputed variable in the MONOTONE statement.

TRANSFORM Statement

  • TRANSFORM transform ( variables < / options > )

    • < ... transform ( variables < / options > ) > ;

The TRANSFORM statement lists the transformations and their associated variables to be transformed. The options are transformation options that provide additional information for the transformation.

The MI procedure assumes that the data are from a multivariate normal distribution when either the regression method or the MCMC method is used. When some variables in a data set are clearly non-normal , it is useful to transform these variables to conform to the multivariate normality assumption. With a TRANSFORM statement, variables are transformed before the imputation process and these transformed variable values are displayed in all of the results. When you specify an OUT= option, the variable values are back-transformed to create the imputed data set.

The following transformations can be used in the TRANSFORM statement.

BOXCOX

  • specifies the Box-Cox transformation of variables. The variable Y is transformed to , where c is a constant such that each value of Y + c must be positive and the constant » > 0.

EXP

  • specifies the exponential transformation of variables. The variable Y is transformed to e (Y+ c ) , where c is a constant.

LOG

  • specifies the logarithmic transformation of variables. The variable Y is transformed to log(Y + c ), where c is a constant such that each value of Y+ c must be positive.

LOGIT

  • specifies the logit transformation of variables. The variable Y is transformed to log( ), where the constant c > 0 and the values of Y /c must be between 0 and 1.

POWER

  • specifies the power transformation of variables. The variable Y is transformed to (Y + c ) » , where c is a constant such that each value of Y+ c must be positive and the constant » ‰ 

  • The following options provide the constant c and » values in the transformations.

C= number

  • specifies the c value in the transformation. The default is c =1for logit transformation and c =0for other transformations.

LAMBDA= number

  • specifies the » value in the power and Box-Cox transformations. You must specify the » value for these two transformations.

  • For example, the statement

      transform log(y1) power(y2/c=1 lambda=.5);  
  • requests that variables log( y 1), a logarithmic transformation for the variable y1 ,and , a power transformation for the variable y2 , be used in the imputation.

  • If the MU0= option is used to specify a parameter value ¼ for a transformed variable, the same transformation for the variable is also applied to its corresponding MU0= value in the t test. Otherwise, ¼ = 0 is used for the transformed variable. See Example 44.10 for a usage of the TRANSFORM statement.

VAR Statement

  • VAR variables ;

The VAR statement lists the variables to be analyzed. The variables can be either character or numeric. If you omit the VAR statement, all continuous variables not mentioned in other statements are used. The VAR statement is required if you specify a MONOTONE statement, an IMPUTE=MONOTONE option in the MCMC statement, or more than one number in the MU0=, MAXIMUM=, MINIMUM=, or ROUND= option.

The character variables are allowed only when they are specified as CLASS variables and the MONOTONE statement is also specified.




SAS.STAT 9.1 Users Guide (Vol. 4)
SAS.STAT 9.1 Users Guide (Vol. 4)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 91

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net