METHODS OF ADDRESSING MISSING DATA

data mining: opportunities and challenges
Chapter VII - The Impact of Missing Data on Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

Methods for dealing with missing data can be broken down into the following categories:

  • Use Of Complete Data Only;

  • Deleting Selected Cases Or Variables;

  • Data Imputation; and

  • Model-Based Approaches.

These categories are based on the randomness of the missing data and how the missing data is estimated and used for replacement. The next section describes each of these categories.

Use of Complete Data Only

One of the most direct and simple methods of addressing missing data is to include only those values with complete data. This method is generally referred to as the "complete case approach" and is readily available in all statistical analysis packages. This method can only be used successfully if the missing data are classified as MCAR. If missing data are not classified as MCAR, bias will be introduced and make the results non-generalizable to the overall population. This method is best suited to situations where the amount of missing data is small. When the relationships within a data set are strong enough to not be significantly affected by missing data, large sample sizes may allow for the deletion of a predetermined percentage of cases.

Delete Selected Cases or Variables

The simple deletion of data that contains missing values may be utilized when a non random pattern of missing data is present. Even though Nie, Hull, Jenkins, Steinbrenner, and Bent (1975) examined this strategy, no firm guidelines exist for the deletion of offending cases. Overall, if the deletion of a particular subset (cluster) significantly detracts from the usefulness of the data, case deletion may not be effective. Further, it may simply not be cost effective to delete cases from a sample. Assume that new automobiles costing $20,000 each have been selected and used to test new oil additives. During a 100,000-mile test procedure, the drivers of the automobiles found it necessary to add an oil-additive to the engine while driving. If the chemicals in the oil-additive significantly polluted the oil samples taken throughout the 100,000-mile test, it would be ill-advised to eliminate ALL of the samples taken from a $20,000 test automobile. The researchers may determine other methods to gain new knowledge from the test without dropping all sample cases from the test.

Also, if the deletion of an attribute (containing missing data) that is to be used as an independent variable in a statistical regression procedure has a significant impact on the dependent variable, various imputation methods may be applied to replace the missing data (rather than altering the significance of the independent variable on the dependent variable).

Imputation Methods for Missing Data

The definition of imputation is "the process of estimating missing data of an observation based on valid values of other variables" (Hair, Anderson, Tatham, & Black, 1998). Imputation methods are literally methods of filling in missing values by attributing them to other available data. As Dempster and Rubin (1983) commented, "Imputation is a general and flexible method for handling missing-data problems, but is not without its pitfalls. Imputation methods can be dangerous as they can generate substantial biases between real and imputed data." Nonetheless, imputation methods tend to be a popular method for addressing missing data.

Commonly used imputation methods include:

  • Case Substitution

  • Mean Substitution

  • Hot Deck Substitution

  • Cold Deck Substitution

  • Regression Imputation

  • Multiple Imputation

Case Substitution

The method most widely used to replace observations with completely missing data. Cases are simply replaced by non-sampled observations. Only a researcher with complete knowledge of the data (and its history) should have the authority to replace missing data with values from previous research. For example, if the records were lost for an automobile test sample, an authorized researcher could review similar previous test results and determine if they could be substituted for the lost sample values. If it were found that all automobiles had nearly identical sample results for the first 10,000 miles of the test, then these results could easily be used in place of the lost sample values.

Mean Substitution

Accomplished by estimating missing values by using the mean of the recorded or available values. This is a popular imputation method for replacing missing data. However, it is important to calculate the mean only from responses that been proven to be valid and are chosen from a population that has been verified to have a normal distribution. If the data is proven to be skewed, it is usually better to use the median of the available data as the substitute. For example, suppose that respondents to a survey are asked to provide their income levels and choose not to respond. If the mean income from an availably normal and verified distribution is determined to be $48,250, then any missing income values are assigned that value. Otherwise, the use of the median is considered as an alternative replacement value.

The rationale for using the mean for missing data is that, without any additional knowledge, the mean provides the best estimate. There are three main disadvantages to mean substitution:

  1. Variance estimates derived using this new mean are invalid by the understatement of the true variance.

  2. The actual distribution of values is distorted. It would appear that more observations fall into the category containing the calculated mean than may actually exist.

  3. Observed correlations are depressed due to the repetition of a single constant value.

Mean imputation is a widely used method for dealing with missing data. The main advantage is its ease of implementation and ability to provide all cases with complete information. A researcher must weigh the advantages against the disadvantages. These are dependent upon the application being considered.

Cold Deck Imputation

Cold Deck Imputation methods select values or use relationships obtained from sources other than the current database (Kalton & Kasprzyk, 1982,1986; Sande, 1982, 1983). With this method, the end-user substitutes a constant value derived from external sources or from previous research for the missing values. It must be ascertained by the end-user that the replacement value used is more valid than any internally derived value. Unfortunately, feasible values are not always provided using cold deck imputation methods. Many of the same disadvantages that apply to the mean substitution method apply to cold deck imputation. Cold deck imputation methods are rarely used as the sole method of imputation and instead are generally used to provide starting values for hot deck imputation methods. Pennell (1993) contains a good example of using cold deck imputation to provide values for an ensuing hot deck imputation application. Hot deck imputation is discussed next.

Hot Deck Imputation

The implementation of this imputation method results in the replacement of a missing value with a value selected from an estimated distribution of similar responding units for each missing value. In most instances, the empirical distribution consists of values from responding units. Generally speaking, hot deck imputation replaces missing values with values drawn from the next most similar case. This method is very common in practice, but has received little attention in the missing data literature (although one paper utilizing SAS to perform hot deck imputation was written by Iannacchione (1982)). For example, Table 1 displays the following data set.

Table 1: Illustration of Hot Deck Imputation: Incomplete data set

Case

Item 1

Item 2

Item 3

Item 4

1

10

2

3

5

2

13

10

3

13

3

5

10

3

???

4

2

5

10

2

From Table 1, it is noted that case three is missing data for item four. Using hot deck imputation, each of the other cases with complete data is examined and the value for the most similar case is substituted for the missing data value. In this example, case one, two, and four are examined. Case four is easily eliminated, as it has nothing in common with case three. Case one and two both have similarities with case three. Case one has one item in common whereas case two has two items in common. Therefore, case two is the most similar to case three.

Once the most similar case has been identified, hot deck imputation substitutes the most similar complete case's value for the missing value. Since case two contains the value of 13 for item four, a value of 13 replaces the missing data point for case three.

Table 2 provides the following revised data set displays the hot deck imputation results.

Table 2: Illustration of Hot Deck Imputation: Imputed data set

Case

Item 1

Item 2

Item 3

Item 4

1

10

2

3

5

2

13

10

3

13

3

5

10

3

13

4

2

5

10

2

The advantages of hot deck imputation include conceptual simplicity, maintenance and proper measurement level of variables, and the availability of a complete set of data at the end of the imputation process that can be analyzed like any complete set of data. One of hot deck's disadvantages is the difficulty in defining what is "similar". Hence, many different schemes for deciding on what is "similar" may evolve.

Regression Imputation

Single and multiple regression can be used to impute missing values. Regression Analysis is used to predict missing values based on the variable's relationship to other variables in the data set. The first step consists of identifying the independent variables and the dependent variables. In turn, the dependent variable is regressed on the independent variables. The resulting regression equation is then used to predict the missing values. Table 3 displays an example of regression imputation.

Table 3: Illustration of Regression Imputation

Case

Income

Age

Years of College Education

Regression Prediction


1

$ 45,251.25

26

4

$ 47,951.79

2

$ 62,498.27

45

6

$ 56,776.85

3

$ 49,350.32

28

5

$ 50,107.78

4

$ 46,424.92

28

4

$ 48,553.54

5

$ 56,077.27

46

4

$ 53,969.22

6

$ 51,776.24

38

4

$ 51,562.25

7

$ 51,410.97

35

4

$ 50,659.64

8

$ 64,102.33

50

6

$ 58,281.20

9

$ 45,953.96

45

3

$ 52,114.10

10

$ 50,818.87

52

5

$ 57,328.70

11

$ 49,078.98

30

0

$ 42,938.29

12

$ 61,657.42

50

6

$ 58,281.20

13

$ 54,479.90

46

6

$ 57,077.72

14

$ 64,035.71

48

6

$ 57,679.46

15

$ 51,651.50

50

6

$ 58,281.20

16

$ 46,326.93

31

3

$ 47,901.90

17

$ 53,742.71

50

4

$ 55,172.71

18

???

55

6

$ 59,785.56

19

???

35

4

$ 50,659.64

20

???

39

5

$ 53,417.37

From the table, twenty cases with three variables (income, age, and years of college education) are listed. Income contains missing data and is identified as the dependent variable while age and years of college education are identified as the independent variables.

The following regression equation is produced for the example:

Predictions of income can be made using the regression equation. The right-most column of the table displays these predictions. Therefore, for cases eighteen, nineteen, and twenty, income is predicted to be $59,785.56, $50,659.64, and $53,417.37, respectfully. An advantage to regression imputation is that it preserves the variance and covariance structures of variables with missing data.

Although regression imputation is useful for simple estimates, it has several inherent disadvantages:

  1. This method reinforces relationships that already exist within the data. As this method is utilized more often, the resulting data becomes more reflective of the sample and becomes less generalizable to the universe it represents.

  2. The variance of the distribution is understated.

  3. The assumption is implied that the variable being estimated has a substantial correlation to other attributes within the data set.

  4. The estimated value is not constrained and therefore may fall outside predetermined boundaries for the given variable. An additional adjustment may necessary.

In addition to these points, there is also the problem of over-prediction. Regression imputation may lead to over-prediction of the model's explanatory power. For example, if the regression R2 is too strong, multicollinearity most likely exists. Otherwise, if the R2 value is modest, errors in the regression prediction equation will be substantial (see Graham et al., 1994).

Mean imputation can be regarded as a special type of regression imputation. For data where the relationships between variables is sufficiently established, regression imputation is a very good method of imputing values for missing data.

Overall, regression imputation not only estimates the missing values but also derives inferences for the population (see discussion of variance and covariance above). For discussions on regression imputation see Royall and Herson (1973), Hansen, Madow, and Tepping (1982).

Multiple Imputation

Rubin (1978) was the first to propose multiple imputation as a method for dealing with missing data. Multiple imputation combines a number of imputation methods into a single procedure. In most cases, expectation maximization (see Little & Rubin, 1987) is combined with maximum likelihood estimates and hot deck imputation to provide data for analysis. The method works by generating a maximum likelihood covariance matrix and a mean vector. Statistical uncertainty is introduced into the model and is used to emulate the natural variability of the complete database. Hot deck imputation is then used to fill in missing data points to complete the data set. Multiple imputation differs from hot deck imputation in the number of imputed data sets generated. Whereas hot deck imputation generates one imputed data set to draw values from, multiple imputation creates multiple imputed data sets.

Multiple imputation creates a summary data set for imputing missing values from these multiple imputed data sets. Multiple imputation has a distinct advantage in that it is robust to the normalcy conditions of the variables used in the analysis and it outputs complete data matrices. The method is time intensive as the researcher must create the multiple data sets, test the models for each data set separately, and then combine the data sets into one summary set. The process is simplified if the researcher is using basic regression analysis as the modeling technique. It is much more complex when models such as factor analysis, structural equation modeling, or high order regression analysis are used.

A comprehensive handling of multiple imputation is given in Rubin (1987) and Schafer (1997). Other seminal works include Rubin (1986), Herzog and Rubin (1983), Li (1985), and Rubin and Schenker (1986).

Model-Based Procedures

Model-based procedures incorporate missing data into the analysis. These procedures are characterized in one of two ways: maximum likelihood estimation or missing data inclusion.

Dempster, Little, and Rubin (1977) give a general approach for computing maximum likelihood estimates from missing data. They call their technique the EM approach. The approach consists of two steps, "E" for the conditional expectation step and "M" for the maximum likelihood step. The EM approach is an interactive method. The first step makes the best possible estimates of the missing data, and the second step then makes estimates of the parameters (e.g., means, variances, or correlations) assuming the missing data are replaced. Each of the stages is repeated until the change in the estimated values is negligible. The missing data is then replaced with these estimated values. This approach has become extremely popular and is included in commercial software packages such as SPSS. Starting with SPSS 7.5, a missing value module employing the EM procedure for treating missing data is included.

Cohen and Cohen (1983) prescribe inclusion of missing data into the analysis. In general, the missing data is grouped as a subset of the entire data set. This subset of missing data is then analyzed using any standard statistical test. If the missing data occurs on a non-metric variable, statistical methods such as ANOVA, MANOVA, or discriminant analysis can be used. If the missing data occurs on a metric variable in a dependence relationship, regression can be used as the analysis method.

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net