THREE-DIMENSIONAL VISUALIZATION, VIRTUAL REALITY AND DATA MINING

data mining: opportunities and challenges

Chapter III - Cooperative Learning and Virtual Reality-Based Visualization for Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Three-dimensional visualization has the potential to show far more information than two-dimensional visualization, while retaining its simplicity. Current and future research efforts will include the ability to model relationships between data in a three-dimensional rule grid. This visualization technique quickly reveals the quantity and relative strength of relationships between elements, helping to focus attention on important data entities and rules. It therefore aids both the data preprocessing and data mining processes.

Note that, although the data set and the data context determine the visualization technique, the size also plays a very important role. Unfortunately, the more data items on the visual graph, the more complex the graph becomes, and thus the user is faced with information overload or the so-called "curse of dimensionality."

The next section contains general considerations on Virtual Reality (VR) and data mining, but most of the conclusions can be applied to the ViziMine tool as well. The ViziMine tool relies on bidimensional visualization, but the importance of each rule can be better visualized if three-dimensional visualization is utilized.

Dimensionality Reduction

The curse of dimensionality is not restricted to data mining. This problem can arise in information retrieval, and Bayesian and multivariate analysis, to mention just a few. Many solutions have been designed, including the principal component analysis (or PCA). Under certain conditions, defined formally in Reinsel and Velu (1998), it is possible to drastically reduce the number of dimensions, while keeping most of the knowledge by performing a PCA on the data. The approach can be summarized as follows. The covariance matrix of the data is first computed, and then the corresponding Eigen values and vectors are evaluated. Because the covariance matrix is symmetric, the Eigen values and vectors can be easily calculated with a deterministic algorithm like the Jacobi method.

The amplitude of the Eigen values is representative of the importance of a particular dimension: the greater the value, the higher the importance. The corresponding Eigen vectors provide the basis that spans the subspace. It can be shown that the dimensions corresponding to the smallest Eigen values can be neglected. It can also be proved that the truncated Eigen decomposition is the decomposition that minimizes the quadratic error between the truncated decomposition and the real one.

The algorithm can be better understood by considering a bidimensional data distribution. Let us suppose that the data are distributed along a regression line. The Eigen vectors corresponding to this distribution are respectively oriented parallel and perpendicularly to the regression line; the biggest Eigen value corresponds to the Eigen vector oriented parallel to the regression line while the smallest Eigen value corresponds to the Eigen vector normal to the regression line. This particular case shows a fundamental limitation of the PCA method; i.e., in order to reduce the number of dimensions, the data must be correlated in the sense that a linear relationship must exist between them. Even if such a correlation is common in practice, that constraint constitutes an important limitation.

Many researchers have tried to overcome the limitation inherent in the linear relation by working on a generalization of the previous algorithm. A case of particular interest is the piecewise linear relation, which is an immediate generalization of the previous case. Instead of assuming a global linear relation, it is assumed that the domain can be divided into small domains to which a linear relationship applies. That means that the dimensions reduction is performed on a local basis and not on a global basis as in the previous case. More details about this method can be found in Chakrabarti and Menrotra (2000).

In addition to PCA, other methods like neural networks, clustorization, and latent semantic indexing can be utilized for dimension reduction. Clustorization allows representation of the data by a few archetypes corresponding to representative or typical data within the clusters. Neural networks reduce the data set to a limited number of classes. The classes can be determined in advance, as it is the case for a multilayer perceptron and radial basis neural networks, or determined by the network itself, as in the self-organizing Kohonen map.

Three-Dimensional Representation and Visualization of Data

Even if the number of dimensions can be reduced, it is well known that the number of remaining dimensions is much higher than three. Under these circumstances, it is legitimate to ask whether or not there is a clear advantage to increasing the number of dimensions from two to three. Doing so has many advantages. It is well known that complicated data sets are very difficult to visualize in two dimensions. There are many reasons for that. Let us look at three of them: the amount of information that can be displayed, the navigation through the data set, and the way data can be represented.

Let us suppose that N information elements can be displayed in each dimension. If the data are displayed in two dimensions with a volumetric display, it is theoretically possible to display up to N² information elements. If three dimensions are utilized, it is then possible to display up to N³ information elements simultaneously. In practice, the number of displayed information elements is much lower, but this simple example illustrates the fact that an increase in the number of dimensions dramatically increases the bandwidth of the display system.

In visual data mining (Fayyad, Grinstein, & Wierse, 2001), the way data are looked at or the point of view from which they are considered is very important. A pattern that can be evident from a certain point of view might be very difficult to see from a different point of view. Consequently, it is very important that the analyst is able to navigate through the data in order to determine the best localization for his analysis. It is well known that bidimensional representations cannot take into account more that one point of view. Of course, it is always possible to reprocess the data in order to show them from a different point of view, but in practice that is not convenient. First, it can take a lot of time to reprocess the data. Such delays can greatly hamper the analysis process. Second, it is very difficult to determine the right point of view if one cannot navigate through the data in a non-stepwise manner.

The above-mentioned problems can be overcome by using an additional dimension. Three-dimensional data can be viewed from any point of view just by navigating through them. If the system is properly designed, the navigation can be done in realtime. The determination of the right point of view is much easier because the user can walk or fly though the data and look at them from any suitable direction. These systems can be implemented using a VRML browser, a virtual reality environment, or any suitable graphical system.

Multimedia Standards

In order to represent the data, new multimedia alternatives are becoming available, including the X3D and MPEG-4 standards. X3D allows the analyst to define a graphical library adapted to the class of problem he is trying to solve. Not only is it possible to create templates, but also to define a grammar for a specific class of problems. Once the grammar has been defined, it can be reused transparently with various data sets. More details about X3D can be found at the Web3D Consortium web site (http://www.web3d.com/).

MPEG-4 is the first, if not the only, real multimedia format. It can transparently handle sound, images, videos, and 3-D objects, as well as events, synchronization, and scripting languages. MPEG-4 is specially adapted to represent very complex multimedia data sets and to facilitate the interaction between those data sets and the analysts. The data mining task could be further simplified by using MPEG-7. MPEG-7 is a multimedia description standard that can describe the content of any multimedia object. That means that MPEG-7 can provide the analyst with additional information. For a video, this information could be the duration, the title, and a trailer. Even if the MPEG-4 and MPEG-7 standards look very promising, it is too early to draw a conclusion about their usefulness in data mining. More details about MPEG-4 and MPEG-7 can be found at http://mpeg.telecomitalialab.com/.

Three-Dimensional Visualization Techniques

In two dimensions, data representation is limited to bidimensional graphical elements. In three dimensions, both two - and three-dimensional graphical elements can be utilized. These elements are much more numerous and diversified in three dimensions than in two. Furthermore, three-dimensional representations can be either volumetric or surface-based depending on whether the internal structure is of interest or not. A surface-based representation only takes into account the outer appearance or the shell of the object, while a volumetric approach assigns a value to each volume element. The latter approach is quite common in biomedical imagery like CAT scanning.

Many techniques are available to visualize data in three dimensions (Harris, 2000). Let us review a few of them. It is very common to represent data by glyphs (Hoffman & Grinstein, 2001; Fayyad, Grinstein, & Wierse, 2001). A glyph can be defined as a three - dimensional object suitable for representing data or subsets of data. The object is chosen in order to facilitate both the visualization and the data mining process. The glyph must be self-explanatory and unambiguous. Glyphs can have various attributes like their color and scale. Even if most glyphs are rigid objects, non-rigid and articulated objects can be used as well. It is then possible to use the deformation and the pose of the glyph to represent some specific behavior of the data set. Furthermore, glyphs can be animated in order to model some dynamic process. A scene is defined as the set of all glyphs and their surroundings, as explained by the following example.

Furniture Store Example

Assume that the original data come from a furniture store. The data concern a data mining effort initiated by the Orders Department. The aim of data mining is to determine the purchase patterns of customers indicating the popularity of items, e.g., of sofas. In this example, each sold category is represented by a glyph in the virtual scene. The color of the glyph represents the warehouse stock status: a warm color means that the item is a back order, a cold color means that the item is being overproduced, and a gray color corresponds to a normal inventory. The relative sizes indicate the current sales status. For example, a small sofa would indicate that the sales are too low, while a big sofa would indicate that the sofas are selling well on the market. If the sofa is much bigger than anything else, it means that the sales are unbalanced. The position of the glyphs in the virtual scene is related to the localization of the corresponding items in the store. From an analysis of the localization, it is possible to determine the disposition that maximizes the sales. In addition to the geometrical content, aural attributes can be added to the scene as well (Begault, 1994). These attributes can be utilized to attract the analyst's attention to some particular characteristics of the data or to bias the data mining process in a certain way. The sounds can signal points of interest or can be voice recordings of a previous analysis. When used in a moderate way, aural attributes do not interfere with the visual content and can provide additional information about the data. The overall efficiency of aural attributes can be enhanced if they are spatialized, i.e., they can only be heard at certain locations and within a certain spatial range. The scene can be illuminated in various ways. Depending on the angle of incidence, the number of luminaires, and their nature, it is possible to enhance or hide various aspects of the data. If many analysts are simultaneously working on the data set, the lighting can be used to help them to better visualize their mutual understanding of the data by enhancing features or patterns related to their respective understanding of the situation. The lighting does not need to be static; the dynamic nature of the data can be taken into account as well. A time-dependent lighting can facilitate understanding the evolution of data over time. The interaction of the lighting with the scene can be used to model the interaction of the data with an external agent.

Glyphs can be shown at various levels of detail (LOD). The LOD can be adjusted to provide the analyst with the right amount of information for the task he has to perform. The LOD can be related to the number of triangles used to represent the glyph in a surface - based representation or by the resolution of the basic volumetric element in a volumetric representation. A low LOD glyph can be seen as a sketch of the situation while a high LOD glyph corresponds to a more detail representation. It is important to point out that each glyph can have a different LOD in the scene.

The LOD can be increased by either adding more details to the structure of the glyph or by repeating the structure of the glyph when a scaling operation is performed; this is equivalent to using a fractal representation for the glyph. A fractal is an object that can repeat itself at different scales or LOD. The object does not necessarily repeat itself in an identical fashion at each scale; random variations are allowed. The fractal behavior is usually valid only on a certain range of scales. Fractal structures are very useful for representing data that have a high degree of auto-similarity at various levels, like organizational data. By varying the number of levels on which the symmetry is valid, it is possible to determine whether or not it is suitable to repeat a structure or a behavior within an organization.

Not only can three-dimensional representations be used to model data efficiently, but they can also be utilized to model interrelations. The fractal structure described above is an interesting example. It is also possible to utilize complicated graphs like the octree and the cone diagram, to mention but a few. Although these graphs can be very useful, they must be utilized with great care because they can rapidly become difficult to visualize. A possible solution to this problem is to transpose the data into another context. The basic idea is to map the data space, which can be very abstract, to a well - known problem that can be easily visualized and understood. That kind of paradigm is called a metaphor. The metaphor maps the problem space into a space familiar to the analyst called the "metaphoric space." There must be a one-to-one correspondence between the real space and the metaphoric space. The reason for this is that if a pattern is found in the metaphoric space, the transformation must be invertible in order to find the corresponding pattern in the real space.

Attempts have been made to use more than three geometric dimensions at once by continuously showing subsets of the higher dimension space;dimensions are sampled at random and displayed to the analyst at regular time intervals. Even though these systems are of great interest from a research point of view, they have proved to be very difficult to use, because the computing load is very high and can only be handled by high - end computers, and because it is very difficult for the analyst to make sense of all the subsets that are shown to him. It is clear that classical visualization paradigms are not suited to those systems. However, computers that are more powerful and new visualization paradigms should open the door to multidimensional data mining.

Virtual Reality and Data Mining

Three-dimensional visualization can be made more efficient by the use of virtual reality (VR) (Hoffman & Grinstein, 2001). It is commonly admitted that a virtual environment (VE) is a three-dimensional environment characterized by the fact that it is immersive, interactive, illustrative, and intuitive.

The fact that the environment is immersive is of great importance in data mining. In an image, one looks at the data from outside, while in a VR environment the user is part of the data world. This means that the user can utilize all his senses in order to navigate and understand the data. This also implies that the representation is more intuitive. VR is particularly well adapted to representing the scale and the topology of various sets of data. That becomes even more evident when stereo visualization is utilized. Stereo vision allows the analyst to have real depth perception. This depth perception is important to estimate the relative distances and scales between the glyphs. Such estimation can be difficult without stereo vision if the scene does not correspond to the paradigms our brain is used to processing. In certain cases, the depth perception can be enhanced by the use of metaphors.

If more than three dimensions are required, more dimensions can be added locally by inserting an image or a graph at a specific position in the VR world. Dimensions can also be added by modifying the lighting conditions. It is also possible to add more dimensions by using feedback or haptic devices (Burdea, 1996). These devices can be very useful if the user interacts with the data. The force applied by the haptic device to the user simulates the difficulty of performing a given operation on the data. Color and texture can be used as well to represent additional dimensions or to enhance specific features within the VR world.

Collaborative Virtual Environments (CVEs) (Churchill, Snowdon & Munro, 2001; Singhal et al., 1999) can be considered as a major breakthrough in data mining. By analogy, they can be considered the equivalent of collaborative agents in visualization. Traditionally, one or more analysts perform visualization at a unique site. This operational model does not reflect the fact that many enterprises are distributed worldwide and so are their operations, data, and specialists. It is consequently impossible for those enterprises to centralize all their data mining operations in a single center. Not only must they collaborate on the data mining process, which can be carried out automatically to a certain extent by distributed and collaborative agents, but they must also collaborate on visualization and on the visual data mining aspect.

CVE allows these enterprises to work on data originating from various sources and to analyze them simultaneously from various physical locations. Each location has its own VE. For most CVEs, the VEs do not need to be of the same type; one VE could be a workstation, another a tiled wall display (Cruz-Neira, Sandin & Defanti, 2000), and the third one could be a surround screen environment like a CAVE (Cruz-Neira et al., 1993). The VEs can exchange data, video, and sound. All data can be accessed and manipulated from all VEs simultaneously. If needed, a monitor can ensure that the VEs do not apply conflicting manipulations to the data.

Note that it is important to choose the VE that is best adapted to the data and their analysis. Graphical workstations are perfect for simple data sets analyzed by a few specialists. Tiled wall displays are made up of a set of flat panels with liquid crystal displays (LCDs) that are carefully aligned on a flat or a curved surface. They usually cover a very large area and consequently allow many specialists to work on the same data simultaneously. Surround screen environments are usually made up of many surfaces. Each surface is either a titled wall display or a projection surface. Surround screen environments are used to achieve an optimum level of immersion. For many technical reasons, the projector is based on digital micromirror device technology or DMD (Digital Light Processing, http://www.dlp.com/). Stereo vision can be achieved by alternatively projecting the left-eye and right-eye views and by synchronizing the rendering with shutter glasses. Stereo vision can also be achieved by simultaneously projecting polarized right-eye and left-eye views on the screen. In that case, passive polarized glasses replace the active synchronized glasses. It should be noted that a metallic screen must be utilized to preserve the polarization. The polarization system has a refreshing rate that is twice the refreshing rate of the corresponding shutter system. A high refreshing rate is suitable in order to avoid tiredness and VR-sickness. Since data mining usually involves long sessions, the polarization-based system is the most suitable.

Multimedia Data Mining and VR

Over the past decades, data mining has mostly been applied to alphanumerical data. Data mining is also applied to multimedia data, but most of the time the data mining process is restricted to the associated metadata and to the surrounding text. Multimedia objects contain a huge amount of information. Their description is directly related to their content and for that reason they are called "content-based." Most of the time, the descriptor is a feature vector that contains a set of abstract data representing the characteristics of interest.

Multimedia data mining is currently in its infancy. For that reason, the discussion will be limited to the description of the data mining of three-dimensional objects in the context of the CAESAR Project and of the National Research Council of Canada's (NRC) Cleopatra multimedia data mining system.

The CAESAR Project is an international project that involves the USA, Canada, Italy, and the Netherlands. The purpose of the project is to collect statistical and anthropometrical data about the worldwide population. The anthropometrical data are made up of various measurements performed on thousands of individuals, as well as of three-dimensional scans of their bodies. The statistical data contain information about their perceptions, consumer habits, and lifestyle. The CAESAR database is intended for utilization mainly by the apparel and the transportation industries. The apparel industry needs body archetypes to design clothes that fit the population, and the transportation industry needs archetypes to design car interiors and seats in planes that are suitable for its clients.

In this project, clustering and filtering is used as a method to group individuals within the population into clusters (Han & Kamber, 2001). That is, we use the clustering data mining technique to find similar individuals within the population, based on an archetype. An archetype is defined as a typical individual within a cluster. It should be noted that an archetype is usually not the center of the cluster; it is an individual that is statistically representative of the behavior of the population within the cluster. Note that the archetype must be a real individual belonging to the database. Average and median individuals usually lead to archetypes that do not exist at all in the general population.

The clustering method is based on the description of the individuals as contained in the Cleopatra system designed by the NRC. Cleopatra analyzes the body scans as contained in the CAESAR database and subsequently generates a shape descriptor for each one of them. The descriptor is an abstract and compact representation of the human body. Cleopatra then loads the shape descriptors, and the anthropometrical and statistical data into an Oracle8i database. An object-oriented-relational representation has been adopted for the data. The Cleopatra Search Engine can query the database by human shape, anthropometrical data, and statistical data.

Given a subject and some of his characteristics, like weight, the Cleopatra Search Engine retrieves similar bodies from the database. The alphanumerical data act as a filter; bodies that meet the filter requirements are accepted, while bodies that do not meet the requirements are rejected. Once the filtering process is completed, the search engine retrieves similar bodies based on their 3-D shape. The outcome of the retrieval operation is a cluster corresponding to the proposed subject. By reiterating the process, it is possible to validate the cluster and to find the right archetype. More details can be found in Paquet, Robinette, and Rioux (2000) and The Caesar Project (http://www.sae.org/technicalcommittees/caesumm.htm), and a demonstration of the system can be found at the NRC's web site(http://www.cleopatra.nrc.ca/). Once the cluster has been characterized, it can be visualized by using a VE; the archetype occupies the center of the visual field while the various individuals belonging to the cluster are distributed around the archetypes according to their similarity. In order not to bias the visualization process, the individuals can be subjected to a random motion on spherical trajectory of constant radius, the radius being related to the degree of similarity.


	Brought to you by Team-Fly