THE IMPACT OF MISSING DATA ON DATA-MINING ALGORITHMS
Missing data impacts the Knowledge Discovery Process in various ways depending on which data-mining algorithm is being utilized. We will now address the impact of missing data on various types of data mining algorithms.
The Impact on the k-Nearest Neighbor Algorithm
The very nature of the k-Nearest Neighbor algorithm is based on the accuracy of the data. Missing and inaccurate data have a severe impact on the performance of this type of algorithm. If data is missing entirely, misrepresented clusters (data distributions) can occur depending upon the frequency and categorization of the cases containing the missing data. One method to help solve this problem is to use the k-Nearest Neighbor data-mining algorithm itself to approach the missing data problem. The imputed values obtained can be used to enhance the performance of the Nearest Neighbor algorithm itself.
First, the k-Nearest Neighbors (those containing no missing data) to the observation that does contain missing data are identified. The k stands for a predetermined constant representing the number of neighbors containing no missing data to be considered in the analysis. According to Witten and Frank (2000), it is advised to keep the value for k small, say five, so that the impact of any noise present will be kept to a minimum.
Hence, this algorithm is not recommended for large data sets (Adriaans & Zantinge, 1996). Once these "neighbors" have been identified, the majority class for the attribute in question can be assigned to the case containing the missing value. Berson, Smith and Thearling (2000) maintained that an historical database containing attributes containing similar predictor values to those in the offending case can also be utilized to aid in the classification of unclassified records.
Of course, the three main disadvantages mentioned in the imputation section (variance understatement, distribution distortion, and correlation depression) should be addressed whenever a constant value is used to replace missing data. The proportion of values replaced should be calculated and compared to all clusters and category identification that existed prior to the replacement of the missing data.
Further, inaccurate data and outliers must be identified. Predetermined "impossible" and "improbable" values are determined, along with minimum and maximum boundary thresholds, to act as identifiers for inaccurate data and outliers. The "closeness" of values to these boundaries can be determined using the Euclidean distance between the two points (Han & Kamber, 2001). These values are then visualized to the end-user and treated as though the data were actually missing. An imputation method is then selected and implemented as previously described to replace offending values.
The Impact on Decision Trees
If missing data is a frequent occurrence in the data mining application in question, decision trees are a good methodology for dealing with them (Berry & Linoff, 1997).
They also scale up very well for large data sets (Adriaans & Zantinge, 1997). A popular method utilized when using decision trees is to treat the missing values as if they were an actual data type for a given attribute. In fact, an entire coding scheme may be developed for any number of types of missing data that may be identified in a given system. A separate branch for missing data may be implemented into a tree for a given attribute and various limbs developed for each of the missing data code types.
For instance, data that is known but "not yet" entered could be coded as "NYE," while data that does not apply to particular case may be entered as "DNA."
When missing data is to be replaced, a popular solution is to record the number of elements in the training set that follows each branch for a that particular attribute and then use the most popular branch as the default replacement value for the missing data. However, any of the aforementioned imputation methods could be utilized to replace the missing data.
It is sometimes useful to prune the tree whenever there is an overabundance of missing data in certain branches (Berry & Linoff, 1997). Eliminating particular paths may be necessary to ensure that the overall success of the decision-making process is not inhibited by the inclusion of cases containing missing data. Witten and Frank (2000) advise the use of prepruning during the tree-building process to determine when to stop developing subtrees. Postpruning can be utilized after a tree is completely built. If one chooses postpruning, decisions for pruning rules can then be made after the tree has been built and analyzed.
The Impact on Association Rules
Association Rules help to identify how various attribute values are related within a data set. They are developed to predict the value of an attribute (or sets of attributes) in the same data set (Darling, 1997). Since Association Rules are often developed to help identify various regularities (patterns) within a data set, algorithms that utilize association rules have been found to work best with large data sets. Attributes containing missing or corrupted data values may easily result in the creation of invalid rule sets or in the failure of identifying valid patterns that normally exist within the data. Since the main focus of association rule discovery is to identify rules that apply to large numbers of cases to which the rules can directly relate, missing data may overstate both the support and the confidence of any newly discovered rules sets (Witten & Frank, 2000).
However, if the data set used to train the algorithm contains only "pristine" data, overfitting the model based on the patterns included in the training set typically results.
Therefore, rules need to be developed for the "exceptions-to-rule" sets that have been developed in violation of correct or "clean" data. It is then necessary to populate the training set for algorithms that utilize Association Rules with a sufficient percentage of "noisy data," representing all possible types of exceptions to existing rules. In this way, exception rules can be developed to handle all patterns of noise that may be associated with a given data set, rather than redesigning rule sets that deal with "clean" data or attempting to force cases that do not belong to existing rule sets into those sets.
Default outcomes should normally be defined for exception rules as well as for "normal" data. The most frequently occurring outcome (type of missing data or noise) is chosen as the default outcome.
As exceptions are discovered for initial exceptions, a type of tree structure is created, forming a decision list for the treatment of missing and noisy data for the data set. It becomes necessary to utilize both propositional rules and relational rules in the rule set for the treatment of missing or noisy data.
Propositional rules test an attribute's value against a constant value, thereby developing very concise limits to delineate between "clean" and "noisy" data. In extreme instances, the constants, breakpoints, and values from associated attributes are used to grow a regression tree in order to estimate missing data values under various conditions.
As these regression trees become larger they can also be utilized as a model tree for a missing data rule set. Relational rules are used to test the relationships between attributes. For nominal attributes, tests are made for equality and inequality. Numeric attributes, on the other hand, test the conditions of less than, greater than, less than or equal to, and greater than or equal to.
Incorporating an additional rule or rule set to deal with exceptions (such as missing data) can easily be accomplished since some rules may be developed to predict multiple outcomes. Failure to allow for the missing data exception may easily misrepresent some of the associations between attributes.
Although a rule may have both high support and confidence, a subjective evaluation by the end-user may determine how interesting a newly discovered rule is (Groth, 2000). Some association rule software packages may be trained to automatically prune "uninteresting rules." Therefore, minimum values (breakpoints) must be established for both the confidence and support of newly discovered rules. In some instances, a hierarchy of rules can be developed so that some rules may imply other rules. In some cases, only the strongest rule is presented as a newly discovered rule and rules of "lesser strength" (support and confidence) are linked to the stronger rule for use at a later time (Han & Kamber, 2001). These resulting item sets may also be stored as metadata for further investigation as the process of final evaluation and pruning takes place by the end-user.
The Impact on Neural Networks
Since neural networks have been found to be both reliable and effective when applied to applications involving prediction, classification, and clustering (Adriaans & Zantinge, 1997), it can be seen that the issue of missing data has a similar impact on neural networks as it does on other types of classification algorithms, such as k-Nearest Neighbor. These similarities include variance understatement, distribution distortion, and correlation depression.
Since the internal weights used to calculate outputs are created and distributed within the network without providing the insight as to how a solution is developed, missing or dirty data can distort the weights that are assigned as the associations between nodes in a manner unknown to the research analyst.
Further, numeric outliers containing extremely large values tend to "swamp" attributes of lesser value, which can impact the correlations between both these and other attributes. These distorted weights can throw off the performance of the entire network while also falsely identifying some attributes as being "more important" than others (Han & Kamber, 2001).
Categorical attributes containing missing data may also be impacted negatively. Since data used as input to a neural network is usually massaged to values between 0 and 1, a categorical attribute containing five values (say, 0.0, 0.25, 0.50, 0.75, and 1.0) can be grossly misrepresented whenever missing data is heavy for that attribute (Berson, Smith, & Thearling, 2000). For instance, if our chemical breakdown analysis were to contain a categorical attribute that rated the breakdown as:
and that attribute contained a high degree of missing data, an overall rating value for multiple tests can easily be misclassified (when unmassaged) as "Good" or "Poor" rather than perhaps a more accurate rating of "Fair."
Another point regarding the issue of missing data and neural networks is that it may be necessary to "train" the initial network with missing data if the data to be tested and evaluated later is itself going to contain missing data. By training the network with only "clean" data, the internal weights developed using the training set cannot be accurately applied to the test set later.
In conjunction with the issue of training the model with a representative amount of missing data, it must also be remembered to refresh the model with a certain amount of missing data. This is done so that, as the model ages, it does not become insensitive to the fact that the application does, in fact, face missing data as input when the network is actually being utilized with fresh data.
These issues may be regarded as simply "housekeeping" functions when using neural networks. If an application involves the use of missing data as input, it only makes sense to both train and refresh the training set with a similar percentage of missing data.
How does missing data actually impact the internal execution of the neural network? While the hidden layer is where the actual weights are developed for the network, the activation function combines the inputs to the network into a single output (Westphal & Blaxton, 1998). The output remains low until the combined inputs reach a predetermined threshold, and small changes to the input can have a dramatic effect on the output (Groth, 2000). The activation function can be very sensitive to missing data.
Let's take this a step further. The activation function of the basic unit of a neural network has two sub-functions: the combination function and the transfer function.
The combination function commonly uses the "standard weighted sum" (the summation of the input attribute values multiplied by the weights that have been assigned to those attributes) to calculate a value to be passed on to the transfer function.
The transfer function applies either a linear or non-linear function to the value passed to it by the combination function. Even though a linear function used in a feedforward neural network is simply performing a linear regression, missing values can distort the coefficients in the regression equation and therefore pass on invalid values as output (Berry & Linoff, 1997).
Between the linear and non-linear function types is the most commonly used transfer function: the S-Shaped Sigmoid Function. This function represents a gradual change from a linear model to a non-linear model and is the transfer function most commonly used in "off the shelf" neural network packages. Since the result of the combination function is usually between 1 and 1, the function is deemed "near-linear" and satisfies most linear and non-linear applications. However, the impact of missing data on the sigmoid function goes back to how the combination function derives its value for the sigmoid function to act upon (Skapura, 1995).