FUTURE TRENDS | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter VII - The Impact of Missing Data on Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Poor data quality has plagued the knowledge discovery process and all associated data-mining techniques. Future data-mining systems should be sensitive to noise and have the ability to deal with all types of pollution, both internally and in conjunction with end-users. As data gathers noise, the system should only reduce the level of confidence associated with the results provided and not "suddenly alter" the direction of discovery. We cannot dismiss the dangers of blind data mining that can deteriorate into a datadredging process. Systems should still produce the most significant findings from the data set possible, even if noise is present. Systems will be robust against uncertainty and missing data issues (Fayyad & Piatetsky-Shapiro, 1996).

As software products evolve and new data-mining algorithms are developed, alternative methods for the identification and deletion/replacement of missing data will also be developed. New hybrid methodologies for dealing with noisy data will continue to evolve as end-users uncover previously undiscovered patterns of missing data within their applications. Commonality of patterns discovered within a particular industry should result in the sharing of new methodologies for dealing with these discoveries. Warehousing missing data patterns and allowing end-users the capability of selecting and testing a variety of patterns and imputation methods will become commonplace, inhouse and in the public domain.

Software from other applications in the software industry will also prove to positively impact the area of data cleansing and preparation for proper data mining. For instance, simulation software using various approaches for prediction and projection will be interwoven with current techniques to form new hybrids for the estimation of missing values. Software for sensitivity analysis and noise tolerance will also be utilized to determine a measurement of volatility for a data set prior to the deletion/replacement of cases (Chung & Gray, 1999). Maintaining stability without altering original data is an issue that end-users may wish to address, instead of dealing with case deletion or data imputation methods.

As data mining continues to evolve and mature as a viable business tool, it will be monitored to address its role in the technological life cycle. While data mining is currently regarded as being at the third stage of a six-stage maturity process in technological evolution, it will soon gain momentum and grow into the area of Early Majority. The maturity stages include:

Innovators;
Early Adopters;
Chasm;
Early Majority;
Late Majority; and
Laggards.

The "chasm" stage is characterized by various hurdles and challenges that must be met before the technology can become widely accepted as mainstream. As tools for the preprocessing and cleansing of noisy data continue to be proven to be effective, advancement toward a level of "early majority" (where the technology becomes mature and is generally accepted and used) will be accomplished.

Tools for dealing with missing data will also grow from being used as a horizontal solution (not designed to provide business-specific end solutions) into a type of vertical solution (integration of domain-specific logic into data mining solutions). As the gigabyte-, terabyte-, and petabyte-size data sets become more prevalent in data warehousing applications, the issue of dealing with missing data will itself become an integral solution for the use of such data rather than simply existing as a component of the knowledge discovery and data mining processes (Han & Kamber, 2001).

As in other maturing applications, missing or incomplete data will be looked upon as a global issue. Finding clues to terrorist activities will be uncovered by utilizing textual data-mining techniques for incomplete and unstructured data (Sullivan, 2001). For instance, monitoring links and searching for trends between news reports from commercial news services may be used to glean clues from items in cross-cultural environments that might not otherwise be available.

Although the issue of statistical analysis with missing data has been addressed since the early 1970s (Little & Rubin, 1987), the advent of data warehousing, knowledge discovery, data mining, and data cleansing has brought the concept of dealing with missing data into the limelight. Costs associated with the deletion of data will make the imputation of data a focal point for independent data-sleuthing entities. For example, an independent firm involved in the restructuring of incomplete data sets will be able to obtain or purchase fragmented data that has been deemed unusable by mainstream corporations and produce complete data with a level of support worthy of use in actual production environments.

An extended topic in this arena will be the ownership of such newly created data sets and/or algorithms and imputation methods that are discovered within an application or industry. What if an independent firm discovers new knowledge based on findings from data sets that were originally incomplete and made available by outside firms? Are the discoveries partially owned by both firms? It is up to the legal system and the parties involved to sort out these issues.

It can be seen that future generations of data miners will be faced with many challenges concerning the issues of missing data. As the industry as a whole continues to evolve and mature, so will end-user expectations of having various, readily available methods of dealing with missing data.


	Brought to you by Team-Fly