DATA MINING WITH INCONSISTENT DATAMISSING DATA | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter VII - The Impact of Missing Data on Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

DATA MINING WITH INCONSISTENT DATA/MISSING DATA

The main thrust of the chapter focuses on methods of addressing missing data and the impact that missing data has on the knowledge discovery process (depending on the data-mining algorithm being utilized). Finally, trends regarding missing data and data mining are discussed along with future research opportunities and concluding remarks.

Reasons For Data Inconsistency

Data inconsistency may arise for a number of reasons, including:

Procedural Factors;
Refusal of Response; or
Inapplicable Responses.

Procedural Factors

Errors in databases are a fact of life; however, their impact on knowledge discovery and data mining can generate serious problems. Data entry errors are common. Dillman (1999) provided an excellent text for designing and collecting data. He also promoted discussion for the reduction of survey error including coverage, sampling, measurement, and non response.

Whenever invalid codes are allowed to slip into a database, inaccurate classifications of new data occur resulting in classification error or omission. Erroneous estimates, predictions, and invalid pattern-recognition conclusions may also take place. Correlation between attributes can also become skewed which will result in erroneous association rules.

Data from questionnaires that are left blank by the respondent further complicate the data-mining process. If a large number of similar respondents fail to complete similar questions, the deletion or misclassification of these observations can take the researcher down the wrong path of investigation or lead to inaccurate decision making by end-users. Methods for prevention of procedural data inconsistency are presented in Jenkins and Dillman (1997). Included are topics such as how to design a questionnaire with regard to typography and layout in order to avoid data inconsistency. Another excellent paper is Brick and Kalton (1996) which discusses the handling of missing data in survey research.

Refusal of Response

Some respondents may find certain survey questions offensive, or they may be personally sensitive to certain questions. For example, questions that refer to one's education level, income, age, or weight may be deemed too personal by some respondents. In addition, some respondents may have no opinion regarding certain questions, such as political or religious affiliation.

Furthermore, respondents may simply have insufficient knowledge to accurately answer particular questions (Hair, 1998). Students, in particular, may have insufficient knowledge to answer certain questions. For example, when polled for data concerning future goals and/or career choices, they may not have had the time to investigate certain aspects of their career choice (such as salaries in various regions of the country, retirement options, insurance choices, etc.).

Inapplicable Responses

Sometimes questions are left blank simply because the questions apply to a more general population than to an individual respondent. In addition, if a subset of questions on a questionnaire does not apply to the individual respondent, data may be missing for a particular expected group within a data set. For example, adults who have never been married or who are widowed or divorced are not likely to answer a question regarding years of marriage. Likewise, graduate students may choose to leave questions blank that concern social activities for which they simply do not have time.

Types of Missing Data

It is important for an analyst to understand the different types of missing data before he or she can address the issue. The following is a list of the standard types of missing data:

Data Missing At Random;
Data Missing Completely At Random;
Non-Ignorable Missing Data; and
Outliers Treated As Missing Data

[Data] Missing At Random (MAR)

Cases containing incomplete data must be treated differently than cases with complete data. The pattern of the missing data may be traceable or predictable from other variables in the database, rather than being attributable to the specific variable on which the data are missing (Stat. Serv. Texas, 2000). Rubin (1976) defined missing data as Missing At Random (MAR) "when given the variables X and Y, the probability of response depends on X but not on Y." For example, if the likelihood that a respondent will provide his or her weight depends on the probability that the respondent will not provide his or her age, then the missing data is considered to be MAR (Kim, 2001).

Consider the situation of reading comprehension. Investigators may administer a reading comprehension test at the beginning of a survey administration session in order to find participants with lower reading comprehension scores. These individuals may be less likely to complete questions that are located at the end of the survey.

[Data] Missing Completely At Random (MCAR)

Missing Completely At Random (MCAR) data exhibits a higher level of randomness than does MAR. Rubin (1976) and Kim (2001) classified data as MCAR when "the probability of response [shows that] independence exists between X and Y." In other words, the observed values of Y are truly a random sample for all values of Y, and no other factors included in the study may bias the observed values of Y.

Consider the case of a laboratory providing the results of a decomposition test of a chemical compound in which a significant level of iron is being sought. If certain levels of iron are met or missing entirely and no other elements in the compound are identified to correlate with iron at that level, then it can be determined that the identified or missing data for iron is MCAR.

Non-Ignorable Missing Data

Given two variables, X and Y, data is deemed Non-Ignorable when the probability of response depends on variable X and possibly on variable Y. For example, if the likelihood of an individual providing his or her weight varied within various age categories, the missing data is non-ignorable (Kim, 2001). Thus, the pattern of missing data is non-random and possibly predictable from other variables in the database.

In contrast to the MAR situation where data missingness is explained by other measured variables in a study, non-ignorable missing data arise due to the data missingness pattern being explainable and only explainable by the very variable(s) on which the data are missing (Stat. Serv. Texas, 2000).

In practice, the MCAR assumption is seldom met. Most missing data methods are applied on the assumption of MAR, although that is not always tenable. And in correspondence to Kim (2001), "Non-Ignorable missing data is the hardest condition to deal with, but, unfortunately, the most likely to occur as well."

Outliers Treated As Missing Data

Data whose values fall outside of predefined ranges may skew test results. Many times, it is necessary to classify these outliers as missing data. Pre-testing and calculating threshold boundaries are necessary in the pre-processing of data in order to identify those values that are to be classified as missing. Reconsider the case of a laboratory providing the results of a decomposition test of a chemical compound. If it has been predetermined that the maximum amount of iron that can be contained in a particular compound is 500 parts/million, then the value for the variable "iron" should never exceed that amount. If, for some reason, the value does exceed 500 parts/million, then some visualization technique should be implemented to identify that value. Those offending cases are then presented to the end-users.

For even greater precision, various levels of a specific attribute can be calculated according to its volume, magnitude, percentage, and overall impact on other attributes and subsequently used to help determine their impact on overall data-mining performance.

Suppose that the amount of silicon in our previous chemical compound example had an impact on the level of iron in that same compound. A percentile threshold for the level of silicon that is permissible in a compound before it has a significant impact on the content of iron can be calculated in order to identify the threshold at which silicon significantly alters both the iron content and, therefore, the overall compound. This "trigger" should be defined somewhere in the data-mining procedure in order to identify which test samples of a compound may be polluted with an overabundance of silicon, thus skewing the iron sample taken from the compound.


	Brought to you by Team-Fly