VISUAL DATA MINING

data mining: opportunities and challenges
Chapter III - Cooperative Learning and Virtual Reality-Based Visualization for Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

Data mining techniques, as discussed above, construct a model of the data through repetitive calculation to find statistically significant relationships within the data. However, the human visual perception system can detect patterns within the data that are unknown to a data mining tool (Johnson-Laird, 1993). The combination of the various strengths of the human visual system and data mining tools may subsequently lead to the discovery of novel insights and the improvement of the human's perspective of the problem at hand.

Data mining extracts information from a data repository of which the user may be unaware. Useful relationships between variables that are non-intuitive are the jewels that data mining hopes to locate. The aim of visual data mining techniques is thus to discover and extract implicit, useful knowledge from large data sets using data and/or knowledge visualization techniques. Visual data mining harnesses the power of the human vision system, making it an effective tool to comprehend data distribution, patterns, clusters, and outliers in data (Han & Kamber, 2001).

Visual data mining integrates data visualization and data mining and is thus closely related to computer graphics, multimedia systems, human computer interfaces, pattern recognition, and high performance computing. Since there are usually many ways to graphically represent a model, the type of visualizations used should be chosen to maximize their value to the user (Johnson-Laird, 1993). This requirement implies that we understand the user's needs and design the visualization with the end user in mind.

Note that, in order to ensure the success of visualization, the visual data mining process should be interactive. In interactive visual data mining, visualization tools can be used to help users make smart data mining decisions (Docherty & Beck, 2001; Han & Kamber, 2001). Here, the data distribution in a set of attributes is displayed using color sectors or columns, giving the user the ability to visually understand the data and therefore allowing him or her to be interactively part of the mining process. In the CILT environment, the user participates (through the PA learner) in the cooperative process and is therefore able to validate the knowledge, as well as add his personal knowledge to the process.

The following observation is noteworthy. Visual data mining concerns both visualizing the data, and visualizing the results of data mining and the data mining process itself. In a cooperative data mining environment, as introduced in the last section, result visualization includes the interactive visualization of the results of multiple data mining techniques and the cooperation processes. Data visualization is important not only during data mining, but also during data preprocessing, as discussed next.

Data Visualization During Data Preprocessing

Data preprocessing is one of the most important aspects of any data mining exercise. According to Adriaans and Zantinge (1996), data preprocessing consumes 80% of the time of a typical, real-world data mining effort. Here, the "garbage-in, garbage-out" rule applies. According to a survey conducted by Redman (1996), a typical operational data repository contains 1% to 5% incorrect values. It follows that the poor quality of data may lead to nonsensical data mining results, which will subsequently have to be discarded. In addition, the implicit assumption that the data do in fact relate to the case study from which they were drawn and thus reflect the real world is often not tested (Pyle, 1999).

Figure 1 shows the different components of the knowledge discovery process, which includes the selection of appropriate tools, the interpretation of the results, and the actual data mining itself. Data preprocessing concerns the selection, evaluation, cleaning, enrichment, and transformation of the data (Adriaans & Zantinge, 1996; Han & Kamber, 2001; Pyle, 1999). Data preprocessing involves the following aspects:

  • Data cleaning is used to ensure that the data are of a high quality and contain no duplicate values. If the data set is large, random samples are obtained and analyzed. The data-cleaning process involves the detection and possible elimination of incorrect and missing values. Such values may have one of a number of causes. These causes include data capturing errors due to missing information, deliberate typing errors, and negligence. Moreover, end users may fraudulently supply misleading information.

  • Data integration. When integrating data, historic data and data referring to day-to-day operations are merged into a uniform format. For example, data from source A may include a "date of recording" field, whereas the data from source B implicitly refer to current operations.

  • Data selection involves the collection and selection of appropriate data. High-quality data collection requires care to minimize ambiguity, errors, and randomness in data. The data are collected to cover the widest range of the problem domain.

  • Data transformation involves transforming the original data set to the data representations of the individual data mining tools. Neural networks, for example, use numeric-valued attributes, while decision trees usually combine numeric and symbolic representations. Care should be taken to ensure that no information is lost during this coding process.

click to expand
Figure 1: Data preprocessing and data mining tasks [Adapted from Docherty & Beck, 2001].

Data visualization provides a powerful mechanism to aid the user during the important data preprocessing steps (Foong, 2001). Through the visualization of the original data, the user can browse to get a "feel" for the properties of that data. For example, large samples can be visualized and analyzed (Thearling et al., 2001). In particular, visualization can the used for outlier detection, which highlights surprises in the data, that is, data instances that do not comply with the general behavior or model of the data (Han & Kamber, 2001; Pyle, 1999). In addition, the user is aided in selecting the appropriate data through a visual interface. During data transformation, visualizing the data can help the user to ensure the correctness of the transformation. That is, the user may determine whether the two views (original versus transformed) of the data are equivalent. Visualization may also be used to assist users when integrating data sources, assisting them to see relationships within the different formats.

Data mining relies heavily on the training data, and it is important to understand the limitations of the original data repository. Visualizing the data preprocessing steps thus helps the user to place the appropriate amount of trust in the final model (Thearling et al., 2001).

The next section discusses data visualization techniques.

Data Visualization

According to Grinstein and Ward (2002), data visualization techniques are classified in respect of three aspects: their focus, i.e., symbolic versus geometric; their stimulus (2D versus 3D); and whether the display is static or dynamic. In addition, data in a data repository can be viewed as different levels of granularity or abstraction, or as different combinations of attributes or dimensions. The data can be presented in various visual formats, including box plots, scatter plots, 3D-cubes, data distribution charts, curves, volume visualization, surfaces, or link graphs, among others (Thearling et al., 2001).

Scatter plots refer to the visualization of data items according to two axes, namely X and Y values. The data are shown as points on this 2-D coordinated plane, with possible extra information on each point, such as a name or a number, or even a color. 3D-cubes are used in relationship diagrams, where the data are compared as totals of different categories. According to Hoffman and Grinstein (2002), the scatter plot is probably the most popular visualization tool, since it can help find clusters, outliers, trends, and correlations. In surface charts, the data points are visualized by drawing a line between them. The area defined by the line, together with the lower portion of the chart, is subsequently filled. Link or line graphs display the relationships between data points through fitting a line connecting them (CAESAR Project, http://www.sae.org/technicalcommittees/caesumm.htm; NRC Cleopatra Anthropometric Search Engine, http://www.cleopatra.nrc.ca/; Paquet, Robinette & Rioux, 2000). They are normally used for 2D data where the X value is not repeated (Hoffman & Grinstein, 2001).

Note that advanced visualization techniques greatly expand the range of models that can be understood by domain experts, thereby easing the so-called accuracyversus understandability tradeoff (Singhal & Zyda, 1999). However, due to the so-called "curse of dimensionality," highly accurate models are usually less understandable, and vice versa. In a data mining system, the aim of data visualization is to obtain an initial understanding of the data and the quality thereof. The actual accurate assessment of the data and the discovery of new knowledge are the tasks of the data mining tools. Therefore, the visual display should preferably be highly understandable, possibly at the cost of accuracy.

Three components are essential for understanding a visual model of the data, namely representation, interaction and integration (Singhal et al., 1999):

  • Representation refers to the visual form in which the model appears. A high-quality representation displays the model in terms of visual components that are already familiar to the user.

  • Interaction refers to the ability to view the model "in action" in real time, which allows the user to play with the model. Examples are "what-if" analysis and forecasting based on the data and the business scenario.

  • Integration concerns the ability to display relationships between the model and alternative views of the data, thus providing the user with a holistic view of the data mining process.

A number of "rule of thumb" guidelines have to be kept in mind when developing and evaluating the data visualization techniques, including the following (Tufte, 1990): color should be used with care, and context-sensitive expectations and perceptual limitations should be kept in mind; intuitive mappings should be used as far as possible, keeping in mind non-intuitive mappings may reveal interesting features; the representation should be appealing to the eye, and there should be a balance between simplicity and pleasing color combinations while avoiding excessive texture variations and distracting flashes; and data distortion should be avoided and data should be scaled with care.

The use of one or more of the above-mentioned data visualization techniques thus helps the user to obtain an initial model of the data, in order to detect possible outliers and to obtain an intuitive assessment of the quality of the data used for data mining. The visualization of the data mining process and results is discussed next.

Processes and Result Visualization

According to Foster and Gee (2002), it is crucial to be aware of what users require for exploring data sets, small and large. The driving force behind visualizing data mining models can be broken down into two key areas, namely, understanding and trust (Singhal et al., 1999; Thearling et al., 2001). Understanding is undoubtedly the most fundamental motivation behind visualization. Understanding means more than just comprehension; it also involves context. If the user can understand what has been discovered in the context of the business issue, he will trust the data and the underlying model and thus put it to use. Visualizing a model also allows a user to discuss and explain the logic behind the model to others. In this way, the overall trust in the model increases and subsequent actions taken as a result are justifiable (Thearling et al., 2001).

According to Gershon and Eick (1995), the art of information visualization can be seen as the combination of three well-defined and understood disciplines, namely, cognitive science, graphic art, and information graphics. A number of important factors have to be kept in mind during process and result visualization, including the following: the visualization approach should provide an easy understanding of the domain knowledge, explore visual parameters, and produce useful outputs; salient features should be encoded graphically; and the interactive process should prove useful to the user.

As stated in a previous section, the CILT learning strategy involves the cooperation of two or more data mining tools. During the cooperative data mining effort, the data mining processes of both the individual and the cooperative learning process are visualized. This type of visualization presents the various processes of data mining. In this way, the user can determine how the data are extracted, from which data repository the data are extracted, as well as how the selected data are cleaned, integrated, preprocessed, and mined. Moreover, it is also indicated which method is selected for data mining, where the results are stored, and how they may be viewed.

The format of knowledge extracted during the mining process depends on the type of data mining task and its complexity. Examples include classification rules, association rules, temporal sequences and casual graphs (Singhal et al., 1999). Visualization of these data mining results involves the presentation of the results or knowledge obtained from data mining in visual forms, such as decision trees, association rules, clusters, outliers, and generalized rules. An example is the Visual Query-Language-Based Control Interface, where the user is allowed to make queries based on visual inputs and to manipulate the visual representations, i.e., the system provides a visual query capability (Multiple Authors,2000). The Silicon Graphics (SGI) MineSet 3.0 toolset, on the other hand, uses connectivity diagrams to visualize decision trees, and simple Bayesian and decision table classifiers (Carter & Hamilton, 1997; Han & Kamber, 2001; Thearling et al., 2001). Other examples include the Evidence Visualizer, which is used to visualize Bayesian classifiers (Becker, Kohavi, & Sommerfield, 2002); the DB-Discover system that uses multi-attribute generalization to summarize data (Carter & Hamilton, 1997; Hilderman, Li, & Hamilton, 2002); and the NASD Regulation Advanced Detection System, which employs decision trees and association rule visualization for surveillance of the NASDAQ stock market (Senator, Goldberg, & Shyr, 2002).

In addition, the model-building process may be visualized. For example, the Angoss decision tree builder gives the user full control over the decision tree building process (http://www.angoss.com/). That is, the user is able to suggest splits, prune the tree or add his knowledge through the manual construction of the tree.

Alternatively, visualization of the constructs created by a data mining tool (e.g., rules, decision tree branches, etc.) and the data covered by them may be accomplished through the use of scatter plots and box plots. For example, scatter plots may be used to indicate the points of data covered by a rule in one color and the points not covered by another color. This visualization method allows users to ask simple, intuitive questions interactively (Thearling et al., 2001). That is, the user is able to complete some form of "what-if" analysis. For example, consider a rule IF Temp > 70 THEN Thunder used on a Weather Prediction data repository. The user is subsequently able to see the effect when the rule's conditions are changed slightly, to IF Temp > 72 THEN Thunder, for instance.

This section provides an overview of current techniques used to visualize the data mining process and its results. The next section discusses the implementation of the ViziMine tool.

Visual Data Mining with ViziMine

As stated in the previous section, the driving forces behind visual data mining are understanding and trust. ViziMine addresses the comprehensibility and trust issues in two ways: first, by visualizing the data, the cooperative data mining process, and the results of data mining in a meaningful way, and second, by allowing the user to participate in the cooperative data mining process, through manipulation of the data and the rules. In this way, the user is able to interact and participate in the data mining process. Because of the user's active participation and understanding of and trust in the data, the data mining process and its underlying model should improve. Therefore, the main aim of the ViziMine tool is to illustrate the cooperative data mining process and to provide a tool that can be used by the domain expert to incorporate his knowledge into the system. This type of visualization attempts to provide a greater understanding of the data and the cooperative data mining process (Docherty & Beck, 2001).

Data Visualization

Interaction is an essential part of a data exploration effort. The user should be able to interact with data to discover information hidden in it (Cruz-Neira, Sandin & Defanti, 1993). Manipulation of the data dynamically allows the user to "get a feel" for the dynamics and test whether the data accurately reflect the business environment.

The current implementation of ViziMine provides an option to visualize the data by means of scatter diagrams. This graphical representation is simple enough to be easily understood, while being complete enough to reveal all the information present in the model. Experience shows that, in order to illustrate simple relational data, the problem is to navigate through these visualizations and to stay focused on the object of interest. As indicated in the previous section, this technique is popular due to its strength when attempting to locate clusters, outliers, trends, and correlations. Scatter plots work well for small data sets that contain a small amount of input features and a few classes. With larger data sets, the use of colors allows the user to gain a good understanding of the data and to detect possible tendencies in the data.

For example, consider the well-known Iris benchmarking data set. The problem concerns the classification of Irises into one of three classes, i.e., Setosa, Virginica, and Versicolor. The flowers are classified in terms of four inputs, namely, the sepal-width, sepal-length, petal-width, and petal-length. Figure 2 shows how two input values, namely petal-width and petal-length, are plotted in a Cartesian space. The figure illustrates the usefulness of scatter plots for identifying possible clusters of data points. This tool can convey enormous amounts of information in a compact representation. The user is able to view a data set by making a projection along any two input dimensions. The user uses this tool to obtain a general idea of the contents and quality of the data set.

click to expand
Figure 2: Graphical representation of rules.

However, from a cooperative learning perspective, the main objective of the data visualization component is to visualize the portion of the data set covered by a particular rule, as described next. For example, Figure 2 depicts how the rule (henceforth referred to as Rule 1)

is visualized by means of a scatter diagram. The diagram depicts the petal-width and petal-length input dimensions. The scatter diagram shows the Virginica examples covered by the rule (in yellow), together with those examples that were not covered (in black or red). The diagram indicates that the petal-lengths of most Virginica Irises are larger than 46.50. Note that any two of the inputs can be depicted on the X-axis or Y-axis. This may lead to the identification of new clusters that are not described in the rule. For example, a new cluster identifying a relationship between the sepal-widths and petal-widths of Virginicas may be identified, merely by changing the two inputs displayed. This information may then subsequently be used to form a new rule to be used for further training (Viktor, le Roux, & Paquet, 2001).

Visual Cooperative Learning

As stated previously, the CILT system currently incorporates the results of a number of data mining tools that are used for classification. The ViziMine tool imports the rules (or initial knowledge) produced by each tool as an ASCII file and subsequently represents the rules visually as part of the team. This allows the user to easily understand the rule, and provides an interface between the user and the data mining tool.

The ViziMine tool thus provides the user with a detailed description of the current state of each data mining tool. This interface also allows the user to incorporate his knowledge into the system by participating as a learning team member.

The visualization of, for example, a C4.5 decision tree, provides a user with an initial understanding of the knowledge discovered. However, the ability to drag-and-drop the rules onto the data and see the impact immediately allows the user first to understand the rules, and then to play with various "what-if" analysis techniques. ViziMine allows the user to achieve this by illustrating the rules visually on selected data axes by means of a scatter plot of 2D data graphs. This is made possible by the easily comprehensible interface, where the user can select a rule, either directly from one of the learners, or from the manipulated rules of the domain expert, and then drop it onto the data visualization. The portion of the data covered by the rule is subsequently shown on the data, as illustrated in Figure 2.

The data-items (inputs) of the class that is covered by the rule are displayed in color, while the data points of the complementary classes are displayed using round black dots. For example, in Figure 2, the data-items of Iris types Setosa and Versicolor are shown as round black dots. For the Virginica Irises, the data-items covered by Rule 1 are displayed as diamond-shaped yellow dots. The data-items of Virginicas that are not covered by Rule 1 are indicated through the use of red triangles. This information is used to assess the individual rule's accuracy and coverage. By interacting in this way, the user can understand the data that underlies a particular rule constructed by the data mining tool. That is, a clearer understanding of the knowledge discovered by the various mining tools that coexist in the cooperative learning environment is obtained.

The user is also able to participate in the learning process by manually combining parts of the rules. The Rule Builder models an individual data mining tool's participation during the cooperative learning phase, as discussed in earlier in the chapter. Note that the learning team operates in one of two modes:

  1. In automatic mode, the cooperative learning phase is dynamically reiterated until no new rules can be created. That is, the rule combining, data generation, and rule pruning steps are completed without any feedback from the user. Here, the user acts as a spectator, viewing the data mining tools' cooperation. This is especially useful when the user wishes to trace the learning process.

  2. In manual mode, the user actively participates in the learning process. He monitors and guides the reiteration of new rules and the data generation process. Importantly, the user can promote the cooperation process by removing or adapting rules. In this way, the user guides the learning process by incorporating his domain knowledge into the system.

The Rule Builder interface is shown in Figure 3. The top part of the interface displays the rules as generated by the active learner. The current rule accuracy threshold, which is equal to the average rule accuracy, is displayed next. Here one will recall that this value is used to distinguish between high - and low-quality rules. The window also displays the newly generated rules. These rules are generated using the rule combination algorithm described earlier. In this way, the visualization helps the user to easily understand the various rules. For an expert user, the tool also provides an additional function allowing the user to add his own rules to the combined knowledge.

click to expand
Figure 3: Rule Builder interface.

As has been indicated, the three components essential for understanding a visual model are high-quality representation, real-time interaction, and model and data integration. The current implementation of the ViziMine system thus addresses these three requirements through the real-time integration of both data visualization and data mining result visualization into one.

The next section discusses the use of three-dimensional visualization and virtual reality as a powerful tool for visualizing the data during the data preprocessing and mining processes.

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net