The issues concerning the impact of inconsistent data and missing data are a fact of life in the world of knowledge discovery and data mining. They must be faced with rigor by developers of new data-mining applications before viable decisions can be developed by the end-users of these systems.
Following a background analysis and literature review of missing data concepts, the authors addressed reasons for data inconsistency. Included in the discussion were procedural factors, refusal of response, and inapplicable questions. Following was the classification of missing data types into the areas of (data) missing at random (MAR), (data) missing completely at random (MCAR), non-ignorable missing data, and outliers treated as missing data. Next, a review of existing methods for addressing the problem of missing data was conducted for the deletion of cases or variables and various imputation methods. Imputation methods included were case substitution, mean substitution, cold deck imputation, hot deck imputation, regression imputation, and multiple imputation.
The impact of missing data on various data mining algorithms was then addressed. The algorithms reviewed included k-Nearest Neighbor, Decision Trees, Association Rules, and Neural Networks.
Finally, the authors offered their opinions on future developments and trends that developers of knowledge discovery software can expect to face, and the needs of end-users when confronted with the issues of data inconsistency and missing data.
It is the goal of the authors that the issues of inconsistent data and missing data be exposed to individuals new to the venues of knowledge discovery and data mining. It is a topic worthy of research and investigation by developers of fresh data-mining applications, as well as a method of review for systems that have already been developed or that are currently under construction.
This concludes the chapter concerning the impact of missing data on data mining. It is our sincere hope that readers of this topic have gained a fresh perspective for the necessity of data consistency and an exposure to alternatives for dealing with the issues of missing data.