COOPERATIVE DATA MINING

data mining: opportunities and challenges
Chapter III - Cooperative Learning and Virtual Reality-Based Visualization for Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

Data mining concerns the automated discovery of hidden patterns and relationships that may not always be obvious. A number of data mining tools exist, including decision trees, rule induction programs and neural networks (Han & Kamber, 2001). Each of these tools applies a specific algorithm to the data in order to build a model based on the statistically significant relationships found in the data.

A number of data mining practitioners have found that, when supplied with the same data repository, different data mining tools produce diverse results. This fact is due to the so-called inductive bias that each algorithm uses when constructing the model (Mitchell, 1997). That is, each data mining tool uses a different learning style (based on one of a variety of different statistical measures) and criteria when building its model. The purpose of combining the results of diverse data mining tools, through the fusion thereof, is to produce high quality results. Therefore, the development of a hybrid multi-strategy learning system that incorporates more than one learning style is currently an active area of research (Honavar, 1995; Lin & Hendler, 1995; Sun, 1995). In this multi-strategy learning approach, the strengths of each technique are amplified, and the weaknesses are ameliorated.

In the CILT system, two or more diverse data mining techniques are combined into a multi-strategy learning system (Viktor, 1999). The system currently includes three different data mining tools with different knowledge representations. The C4.5 tool constructs a decision tree using the information gain ratio criteria (Quinlan, 1994). This tree is pruned to produce a set of rules. The CN2 method induces rules from examples using the Laplace error estimate (Clark & Niblett, 1989). Third, the ANNSER learner creates rules from a trained neural network (Viktor, Engelbrecht, & Cloete, 1998). In addition, the rules of the domain expert are included through a personal assistant (PA) learner, which contains a knowledge base constructed after interviews with one or more domain experts (Viktor, 1999). That is, the user plays an active role in the data mining process, and is able to make and implement decisions that will affect its outcome (Hinke & Newman, 2001).

In the CILT system, the individual results are fused into one through a three-phase process, as follows (Viktor, 1999; Viktor, le Roux, & Paquet, 2001):

  • Phase 1: Individual learning. First, each individual data mining tool (or learner) uses a training set to form an initial representation of the problem at hand. In addition, the results of the user, a domain expert, are modeled as part of the system (Viktor, 1999). Each individual component's knowledge is stored in a separate knowledge base, in the form of disjunctive normal form (DNF) rules.

  • Phase 2: Cooperative learning. During phase 2, the individual data mining tools share their knowledge with one another. A rule accuracy threshold is used to distinguish between low - and high-quality rules. Cooperative learning proceeds in four steps as follows (Viktor, le Roux, & Paquet, 2001):

    1. each data mining tool queries the knowledge bases of the others to obtain the high-quality rules that it may have missed. These rules are compared with the rules contained in its rule set. In this way, a NewRule list of relevant rules is produced.

    2. the NewRule list is compiled by identifying the following relationships between the rules, where R1 denotes a high-quality rule contained in the tool's rule set and R2 denotes a high-quality rule created by another data mining tool.

      • Similarities. Two rules, R1 and R2, are similar if two conditions hold: the training examples covered by rule R1 are a subset of those covered by R2; and the rules contain the same attributes with similar values. For example, the attribute-value test (petal-width > 49.0) is similar to the test (petal-width >49.5). If R1 and R2 are similar, it implies that the learner has already acquired the knowledge as contained in rule R2 and that R2 should not be added to the NewRule list.

      • Overlapping rules. Two rules overlap when they contain one or more attribute-value tests that are the same. For example, rule Rl with attribute-value tests (petal-length > 7.5) and (petal-width < 47.5) and rule R2 with attribute-value tests (petal-length > 7.5) and (sepal-length < 35.5) overlap. Rule R2 is placed on the tool's NewRule list. Note that, for the example, a new rule R3 with form (petal-length > 7.5) and petal-width < 47.5) and (sepal-length < 35.5)} may be formed. This rule represents a specialization that will usually be less accurate than the original general rules and will cover fewer cases. Such a specialization should be avoided, since it leads to overfitting.

      • Subsumption. A rule R2 subsumes another Rl if and only if they describe the same concept and the attribute-value tests of R2 form a subset of that of rule Rl. In other words, rule R2 is more general than R1. If R2 subsumes rule Rl, it is placed on the NewRule list.

    3. the rule combination procedure is executed. Here, the rules as contained in the NewRule list are used to form new rules, as follows. The attribute-value tests of the rules as contained in the NewRule list are combined with the attribute-value tests of the rules in the tools rule set to form a new set of rules. Each of these new rules is evaluated against the test set. The new high-quality rules, which are dissimilar to, do not overlap with, and are not subsumed by existing rules, are retained on the NewRule list. These rules act as input to the data generation step.

    4. the data generator uses each of the rules in the NewRule list to generate a new set of training instances. The newly generated training instances are added to the original training set, and the learner reiterates the individual learning phase. In this way, a new training set that is biased towards the particular rule is generated. This process is constrained by ensuring that distribution of the data as contained in the original training set is maintained. Interested readers are referred to Viktor (1999) for a detailed description of this process.

    Steps 1 to 4 are reiterated until no new rules can be generated. Lastly, redundant rules are pruned using a reduced error pruning algorithm.

  • Phase 3: Knowledge fusion. Finally, the resulting knowledge, as contained in the individual knowledge bases, is fused into one. Again, redundant rules are pruned and a fused knowledge base that reflects the results of multi-strategy learning is created.

Note that a detailed description of the cooperative learning approach falls beyond the scope of this paper. Interested readers are referred to Viktor (1999) for a description thereof. This chapter concerns the visual representation of the cooperative data mining process and results, as well as the data itself, using visual data mining techniques, as discussed next.

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net