Chapter VII: The Impact of Missing Data on Data Mining | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter VII - The Impact of Missing Data on Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Marvin L. Brown, Hawaii Pacific University

USAJohn F. Kros, East Carolina University

USA

Data mining is based upon searching the concatenation of multiple databases that usually contain some amount of missing data along with a variable percentage of inaccurate data, pollution, outliers, and noise. The actual data-mining process deals significantly with prediction, estimation, classification, pattern recognition, and the development of association rules. Therefore, the significance of the analysis depends heavily on the accuracy of the database and on the chosen sample data to be used for model training and testing. The issue of missing data must be addressed since ignoring this problem can introduce bias into the models being evaluated and lead to inaccurate data mining conclusions.

THE IMPACT OF MISSING DATA

Missing or inconsistent data has been a pervasive problem in data analysis since the origin of data collection. More historical data is being collected today due to the proliferation of computer software and the high capacity of storage media. In turn, the issue of missing data becomes an even more pervasive dilemma. An added complication is that the more data that is collected, the higher the likelihood of missing data. This will require one to address the problem of missing data in order to be effective.

During the last four decades, statisticians have attempted to address the impact of missing data on information technology.

This chapter's objectives are to address the impact of missing data and its impact on data mining. The chapter commences with a background analysis, including a review of both seminal and current literature. Reasons for data inconsistency along with definitions of various types of missing data are addressed. The main thrust of the chapter focuses on methods of addressing missing data and the impact that missing data has on the knowledge discovery process. Finally, trends regarding missing data and data mining are discussed in addition to future research opportunities and concluding remarks.

Background

The analysis of missing data is a comparatively recent discipline. With the advent of the mainframe computer in the 1960s, businesses were capable of collecting large amounts of data on their customer databases. As large amounts of data were collected, the issue of missing data began to appear. A number of works provide perspective on missing data and data mining.

Afifi and Elashoff (1966) provide a review of the literature regarding missing data and data mining. Their paper contains many seminal concepts, however, the work may be dated for today's use. Hartley and Hocking (1971), in their paper entitled "The Analysis of Incomplete Data," presented one of the first discussions on dealing with skewed and categorical data, especially maximum likelihood (ML) algorithms such as those used in Amos. Orchard and Woodbury (1972) provide early reasoning for approaching missing data in data mining by using what is commonly referred to as an expectation maximization (EM) algorithm to produce unbiased estimates when the data are missing at random (MAR). Dempster, Laird, and Rubin's (1977) paper provided another method for obtaining ML estimates and using EM algorithms. The main difference between Dempster, Laird, and Rubin's (1977) EM approach and that of Hartley and Hocking is the Full Information Maximum Likelihood (FIML) algorithm used by Amos. In general, the FIML algorithm employs both first and second-order derivatives whereas the EM algorithm uses only first-order derivatives.

Little (1982) discussed models for nonresponse, while Little and Rubin (1987) considered statistical analysis with missing data. Specifically, Little and Rubin (1987) defined three unique types of missing data mechanisms and provided parametric methods for handling these types of missing data. These papers sparked numerous works in the area of missing data. Diggle and Kenward (1994) addressed issues regarding data missing completely at random, data missing at random, and likelihood-based inference. Graham, Hofer, Donaldson, MacKinnon, and Schafer (1997) discussed using the EM algorithm to estimate means and covariance matrices from incomplete data. Papers from Little (1995) and Little and Rubin (1989) extended the concept of ML estimation in data mining, but they also tended to concentrate on data that have a few distinct patterns of missing data. Howell (1998) provided a good overview and examples of basic statistical calculations to handle missing data.

The problem of missing data is a complex one. Little and Rubin (1987) and Schafer (1997) provided conventional statistical methods for analyzing missing data and discussed the negative implications of na ve imputation methods. However, the statistical literature on missing data deals almost exclusively with the training of models rather than prediction (Little, 1992). Training is described as follows: when dealing with a small proportion of cases containing missing data, you can simply eliminate the missing cases for purposes of training. Cases cannot be eliminated if any portion of the case is needed in any segment of the overall discovery process.

In theory, Bayesian methods can be used to ameliorate this issue. However, Bayesian methods have strong assumptions associated with them. Imputation methods are valuable alternatives to introduce here, as they can be interpreted as an approximate Bayesian inference for quantities of interest based on observed data.

A number of articles have been published since the early 1990s regarding imputation methodology. Schafer and Olsen (1998) and Schafer (1999) provided an excellent starting point for multiple imputation. Rubin (1996) provided a detailed discussion on the interrelationship between the model used for imputation and the model used for analysis. Schafer's (1997) text has been considered a follow up to Rubin's 1987 text. A number of conceptual issues associated with imputation methods are clarified in Little (1992). In addition, a number of case studies have been published regarding the use of imputation in medicine (Barnard & Meng, 1999; van Buuren, Boshuizen, & Knook, 1999) and in survey research (Clogg, Rubin, Schenker, Schultz, & Weidman, 1991). A number of researchers have begun to discuss specific imputation methods. Hot deck imputation and nearest neighbor methods are very popular in practice, despite receiving little overall coverage with regard to Data Mining (see Ernst, 1980; Kalton & Kish, 1981; Ford, 1983; and David, Little, Samuhel, & Triest, 1986).

Breiman, Friedman, Olshen, and Stone (1984) developed a method known as CART , or classification and regression trees. Classification trees are used to predict membership of cases or objects in the classes of categorical dependent variables from their measurements on one or more predictor variables. Loh and Shih (1997) expanded on classification trees with their paper regarding split selection methods. Some popular classification tree programs include FACT (Loh & Vanichestakul, 1988), THAID (Morgan & Messenger, 1973), as well as the related programs AID, for Automatic Interaction Detection (Morgan & Sonquist, 1963), and CHAID, for Chi-Square Automatic Interaction Detection (Kass, 1980). Classification trees are useful data- mining techniques as they are easily understood by business practitioners and are easy to perceive visually. Also, the basic tree induction algorithm is considered to be a "greedy" approach to classification, using a recursive divide-and-conquer approach (Han & Kamber, 2001). This allows for raw data to be analyzed quickly without a great deal of preprocessing. No data is lost, and outliers can be identified and dealt with immediately (Berson, Smith and Thearling, 2000).

Agrawal, Imielinski, and Swami (1993) introduced association rules for the first time in their paper, "Mining Association Rules between Sets of Items in Large." A second paper by Agrawal and Srikant (1994) introduced the Apriori algorithm. This is the reference algorithm for the problem of finding Association Rules in a database. Valuable "general purpose" chapters regarding the discovery of Association Rules are included in the texts: Fast Discovery of Association Rules by R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo, and Advances in Knowledge Discovery and Data Mining. Association rules in data-mining applications search for interesting relationships among a given set of attributes. Rule generation is based on the "interestingness" of a proposed rule, measured by satisfaction of thresholds for minimum support and minimum confidence. Another method for determining interestingness is the use of correlation analysis to determine correlation rules between itemsets (Han & Kamber, 2001).

In addition to the aforementioned data-mining techniques, neural networks are used to build explanatory models by exploring datasets in search of relevant variables or groups of variables. Haykin (1994), Masters (1995), and Ripley (1996) provided information on neural networks. For a good discussion of neural networks used as statistical tools, see Warner and Misra (1996). More recent neural net literature also contains good papers covering prediction with missing data (Ghahramani & Jordan, 1997; Tresp, Neuneier, & Ahmad, 1995). Neural networks are useful in data-mining software packages in their ability to be readily applied to prediction, classification, and clustering problems, and can be trained to generalize and learn (Berson, Smith, & Thearling, 2000). When prediction is the main goal of the neural network, it is most applicable for use when both the inputs to, and outputs from, the network are well understood by the knowledge worker. The answers obtained from neural networks are often correct and contain a considerable amount of business value (Barry & Linoff, 1997).

Finally, genetic algorithms are also learning-based data mining techniques. Holland (1975) introduced genetic algorithms as a learning-based method for search and optimization problems. Michalewicz (1994) provided a good overview of genetic algorithms, data structures, and evolution programs. A number of interesting articles discussing genetic algorithms have appeared. These include Flockhart and Radcliffe (1996), Szpiro (1997), and Sharpe and Glover (1999). While neural networks employ the calculation of internal link weights to determine the best output, genetic algorithms are also used in data mining to find the best possible link weights. Simulating natural evolution, genetic algorithms generate many different genetic estimates of link weights, creating a population of various neural networks. Survival-of-the-fittest techniques are then used to weed out networks that are performing poorly (Berson, Smith & Thearling, 2000).

In summation, the literature to date has addressed various approaches to data mining through the application of historically proven methods. The base theories of nearest neighbor, classification trees, association rules, neural networks and genetic algorithms have been researched and proven to be viable methodologies to be used as the foundation of commercial data mining software packages. The impact of incomplete or missing data on the knowledge discovery (data mining) process has more recently been approached in association with these individual methodologies. Prior research in this area has resulted in updates to commercial software in order to address issues concerning the quality and preprocessing of both training and new data sets, as well as the nature and origin of various problematic data issues.

In the future, many hybrid methods utilizing the proven algorithms currently used for successful data mining will be developed to compete both in the realms of research and private industry. In addition to these algorithmic hybrids, new methodologies for dealing with incomplete and missing data will also be developed, also through the merging of proven and newly developed approaches to these issues. Other research gaps to pursue will include the warehousing of all "dirty data," warehousing of pattern recognition results, and the development of new techniques to search these patterns for the very nature of the problem and its solution. In areas that are found to have unavailable practical solutions, methods for handling individual "special interest" situations must be investigated.


	Brought to you by Team-Fly