STATISTICAL ISSUES | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter IX - The Pitfalls of Knowledge Discovery in Databases and Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Systematic Patterns vs. Sample-Specific

There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable). Both of these goals require that the pattern of observed time series data is identified and more or less formally described. Once the pattern is established, we can interpret and integrate it with other data. Regardless of the depth of our understanding and the validity of our interpretation of the phenomenon, we can extrapolate (ceteris paribus) the identified pattern to predict future events. Pyle (1999) argued that series data had many of the problems non-series data had. Series data also had a number of own special problems.

Temporal DM pertains to trying to forecast events that may occur in the future. Trends usually are regarded with the incline or decline of something. There could also be seasonal patterns that could be tracked over the years that will allow future outcomes to be predicted. There is also an irregular pattern that has erratic outcomes and follows no true pattern. The shortcoming of temporal DM is when something follows the irregular pattern. For the most part, events that follow a pattern continue to follow that course, but occasionally an outcome will occur that will be unexpected. These outcomes could be a result of something that the decision maker could not foresee happening in the near future, and may make the outcome of any analysis worthless. Temporal DM would not be able to foresee this happening because its user would not be able to predict this occurrence.

As DM is used more extensively, caution must be exercised while searching through DBs for correlations and patterns. The practice of using DM is not inherently bad, but the context in how it is used must be observed with a keen and practiced eye. Possibly the most notorious group of data miners are stock market researchers who seek to predict future stock price movement. Basically, past performance is used to predict future results. There are a number of potential problems in making the leap from a back-tested strategy to successfully investing in future real-world conditions. The first problem is determining the probability that the relationships occurred at random or whether the anomaly may be unique to the specific sample that was tested. Statisticians are fond of pointing out that if you torture the data long enough, it will confess to anything (McQueen & Thorley, 1999).

In describing the pitfalls of DM, Leinweber "sifted through a United Nations CDROM and discovered that, historically, the single-best predictor of the Standard & Poor's 500-stock index was butter production in Bangladesh." The lesson to learn here is that a "formula that happens to fit the data of the past won't necessarily have any predictive value" (Investor Home, 1999). The "random walk theory" of stock prices suggests that securities prices cannot be forecasted. Successful investment strategies even those that have been successful for many years may turn out to be fool's gold rather than a golden chalice.

Given a finite amount of historical data and an infinite number of complex models, uninformed investors may be lured into "overfitting" the data. Patterns that are assumed to be systematic may actually be sample-specific and therefore of no value (Montana, 2001). When people search through enough data, the data can be tailored to back any theory. The vast majority of the things that people discover by taking standard mathematical tools and sifting through a vast amount of data are statistical artifacts.

Explanatory Factors vs. Random Variables

The variables used in DM need to be more than variables; they need to be explanatory factors. If the factors are fundamentally sound, there is a greater chance the DM will prove to be more fruitful. We might review the relationship in several, distinct time periods. A common mistake made by those inexperienced in DM is to do "data dredging," that is, simply attempting to find associations, patterns, and trends in the data by using various DM tools without any prior analysis and preparation of the data. Using multiple comparison procedures indicates that data can be twisted to indicate a trend if the user feels so inclined (Jensen, 1999). Those who do "data dredging" are likely to find patterns that are common in the data, but are less likely to find patterns that are rare events, such as fraud.

An old saw that may fit in the analysis of DM is "garbage in, garbage out." One of the quirks of statistical analysis is that one may be able to find a factor that seems very highly correlated with the dependent variable during a specific time period, but such a relationship turns out to be spurious when tested in other time periods. Such spurious correlations produce the iron pyrite ("fool's gold") of DM. However, even when adjustments are made for excessive collinearity by removing less explanatory, co-linear variables from the model, many such models have trouble withstanding the test of time (Dietterich, 1999). Just as gold is difficult to detect in the ground because it is a rare and precious metal, so too are low-incidence occurrences such as fraud. The person mining has to know where to search and what signs to look for to discover fraudulent practice, which is where data analysis comes in.

Statistical inference "has a great deal to offer in evaluating hypotheses in the search, in evaluating the results of the search and in applying the results" (Numerical Machine Learning, 1997). Any correlation found through statistical inference might be considered completely random and therefore not meaningful. Even worse, "variables not included in a dataset may obscure relationships enough to make the effect of a variable appear the opposite from the truth" (Numerical Machine Learning, 1997). Furthermore, other research indicates that on large datasets with multiple variables, results using statistics can become overwhelming and therefore be the cause of difficulty in interpreting results.

Hypothesis testing is a respectable and sometimes valuable tool to assess the results of experiments. However, it too has difficulties. For example, there may be a problem with the asymmetry in the treatment of the null and alternative hypotheses, which will control the probability of Type I errors, but the probability of Type II errors may have to be largely guessed (Berger & Berry, 1988). In practice, the approach is not followed literally common sense prevails. Rather than setting an α in advance and then acting accordingly, most researchers tend to treat the p-value obtained for their data as a kind of standardized descriptive statistic. They report these p-values, and then let others draw their own conclusions; such conclusions will often be that further experiments are needed. The problem then is that there is no standard approach to arriving at a final conclusion. Perhaps this is how it should be, but this means that statistical tests are used as a component in a slightly ill-defined mechanism for accumulating evidence, rather than in the tidy cut-and-dried way that their inventors were trying to establish. The rejection/acceptance paradigm also leads to the problem of biased reporting. Usually, positive results are much more exciting than negative ones, and so it is tempting to use low p-values as a criterion for publications of results. Despite these difficulties, those who seek rigorous analysis of experimental results will often want to see p-values, and provided its limitations are borne in mind, the hypothesis testing methodology can be applied in useful and effective ways.

Segmentation vs. Sampling

Segmentation is an inherently different task from sampling. As a segment, we deliberately focus on a subset of data, sharpening the focus of the analysis. But when we sample data, we lose information because we throw away data not knowing what to keep and what to ignore. Sampling will almost always result in a loss of information, in particular with respect to data fields with a large number of non-numeric values.

Most of the time it does not make sense to analyze all of variables from a large dataset because patterns are lost through dilution. To find useful patterns in a large data warehouse, we usually have to select a segment (and not a sample) of data that fits a business objective, prepare it for analysis, and then perform DM. Looking at all of the data at once often hides the patterns, because the factors that apply to distinct business objectives often dilute each other.

While sampling may seem to offer a short cut to faster data analysis, the end results are often less than desirable. Sampling was used within statistics because it was so difficult to have access to an entire population. Sampling methods were developed to allow for some rough calculations about some of the characteristics of the population without access to the entire population. This contradicts having a large DB. We build DBs of a huge customer's behavior exactly for the purpose of having access to the entire population. Sampling a large warehouse for analysis almost defeats the purpose of having all the data in the first place (Data Mines for Data Warehouses, 2001).

In discussing some pitfalls of DM, the issue of how to avoid them deserves mention. There are unique statistical challenges produced by searching a space of models and evaluating model quality based on a single data sample. Work in statistics on specification searches and multiple comparisons has long explored the implications of DM, and statisticians have also developed several adjustments to account for the effects of search. Work in machine learning and knowledge discovery related to overfilling and oversearching has also explored similar themes, and researchers in these fields have also developed techniques such as pruning and minimum description length encoding to adjust for the effects of the search (Jensen, 1999). However, this "dark side" of DM is still largely unknown to some practitioners, and problems such as overfilling and overestimation of accuracy still arise in knowledge discovery applications with surprising regularity. In addition, the statistical effects of the search can be quite subtle, and they can trip up even experienced researchers and practitioners.

A very common approach is to obtain new data or to divide an existing sample into two or more subsamples, using one subsample to select a small number of models and other subsamples to obtain unbiased scores. Cross-validation is a related approach that can be used when the process for identifying a "best" model is algorithmic. Sometimes, incremental induction is efficient (Hand, Mannila, & Smyth, 2001). A model is developed on a small data sample and, while suggestive of an interesting relationship, it does not exceed a prespecified critical value. Another small sample of data becomes available later, but it is also too small to confer statistical significance to the model. However, the relationship would be significant if considered in the context of both data samples together. This indicates the importance of maintaining both tentative models and links to the original data (Jensen, 2001).

Several relatively simple mathematical adjustments can be made to statistical significance tests to correct for the effects of multiple comparisons. These have been explored in detail within the statistical literature on experimental design. Unfortunately, the assumptions of these adjustments are often restrictive. Many of the most successful approaches are based on computationally intensive techniques such as randomization and resampling (Saarenvirta, 1999). Randomization tests have been employed in several knowledge discovery algorithms. Serious statistical problems are introduced by searching large model spaces, and unwary analysts and researchers can still fall prey to these pitfalls.


	Brought to you by Team-Fly