Review of Literature on Data Mining | Managing Data Mining Technologies in Organizations: Techniques and Applications

< Day Day Up >

Data mining is abroad area involving the discovery of different patterns within historical databases. In general, data mining involves the identification of patterns that are not otherwise easily obtained using traditional descriptive statistical techniques such as mean, median, mode and standard deviation. Much of the research in data mining relates to machine learning of rules and the observation of patterns in very large customer databases (Cheung et al., 1996) and knowledge acquisition for various knowledge base systems (Bhattacharyya & Pendharkar, 1998). We review the current literature on data mining in both of these areas.

Past research into identifying rules and patterns in very large databases has focused on learning association rules in transactional databases, clustering the abstract objects and finding similarities within different sequences in a database. However, the majority of research in mining and learning association rules (Agrawal, Faloutsos, & Swami, 1993; Agrawal & Srikant, 1995; Chen et al., 1996; Han & Flu, 1995; Mannila, Toivonen, & Verkamo, 1994; Park, Chen, & Yu, 1995; Savasere, Omiecinski, & Navathe, 1995; Srikant & Agrawal, 1995) has focused on learning association rules of the form:

P₁ ^ P₂ .... ^ P_n ⇒ Q₁ ^ Q₂.... ^Q_m

where P_i and Q_j for i ∈ [1,..n], j ∈ [1,..,m] are a set of attribute values from the data sets. In general, the mining of association rules problem can be represented in the following form:

Let I = {i₁, i₂, ..., i_n} represent the set of items in a database D. Further, let T represent a transaction that includes a set of items. We can then write T ⊆ I. If X represents a set of items then we can say that T contains X if and only if X⊆T. The association rule is an implication of the form X ⇒ Y where X ⊆ I, Y ⊆ I, & X ∩Y = φ. The confidence factor, CF, of the association rule in database D is then obtained by finding the percentage of transactions in database D that contain X and also contain Y. The evidence E for the association rule in database D is the percentage of transactions in database D that contain X ∪ Y. If η_minCF is the minimum confidence threshold and η_minE is the minimum evidence threshold, then the problem of mining association rules is to find all the association rules whose confidence and evidence is greater than the respective thresholds.

Many algorithms have been proposed for mining association rules. Popular algorithms are Apriori (Agrawal & Srikant, 1995), DHP (Chen et al., 1996), PARTITION (Savasere et al., 1995) and DMA (Cheung et al., 1996), with the Apriori algorithm being one of the most popular for mining association rules in a centralized database. In an Apriori algorithm, the entire database is scanned once and large item sets are identified. The item sets are then arranged in ascending order of size. Let C_l be the set of one-item large item sets L₁ generated by the first scan. A set C₂ of two-item sets is created using L₁ * L₁ where * is an operation for concatenation given by:

L_k* L_k = { X ∪ Y | X, Y ∈ L_k | X ∩Y | = k-1}

Agrawal and Srikant (1994) describe the Apriori algorithm with an easy to follow example. DHP and PARTITION are extensions to the Apriori algorithm that make it computationally efficient. In the DHP algorithm, hash tables are used to prune candidates during various scan iterations (Cheung et al., 1996). A PARTITION algorithm divides a database into partitions such that each can be processed effectively (Savasere et al., 1995). While Apriori, DHP and PARTITION algorithms were designed for centralized databases, real-life transactional data is often contained in distributed databases. To solve this problem, Cheung et al. (1996) proposed the distributed mining of association rules (DMA) algorithm for mining association rules in a distributed database environment. DMA extends the original Apriori algorithm in a distributed database environment for circumstances where a small number of candidate sets are obtained and messages are exchanged between various sites in the distributed database for mining association rules.

Other research on machine learning of rules and patterns in very large databases has focused on using cluster analysis and pattern-based similarity search approaches. Cluster analysis is the grouping of abstract objects into groups of similar objects. In a large database context, cluster analysis provides an approach to divide large databases into smaller similar components. Cluster analysis is studied in statistics (Cheeseman & Stuz, 1996; Jain & Dubes, 1988), machine learning (Fisher, 1987, 1995) and data mining literature (Ester, Kriegel, & Xu, 1995; Ng & Han, 1994; Weiss & Kapouleas, 1989). In statistics, Bayesian classification approaches have been used for clustering (Cheeseman & Stutz, 1996). Various unsupervised learning approaches were used for clustering in machine learning. The machine learning approaches differ from statistical approaches in that distance-based measures (used in statistical approaches) are replaced by measures that check for similarity between objects (Chen et al., 1996). Other approaches use conceptual clustering based methods with probability analysis. The probability analysis approaches, however, make the assumption that probability distributions of the attributes are independent of each other (Fisher, 1987, 1995). This assumption is challenged by a few researchers who believe that correlation between the attributes often exist (Cheung et al., 1996).

Pattern-based similarity approaches have been used for matching sequences in temporal databases (Agrawal et al., 1993; Agrawal et al., 1995; Faloutsos & Lin, 1995; Faloutsos, Ranganathan, & Manolopoulos, 1994; Li, Yu, & Castelli, 1996; Mannila et al., 1994). There are two types of similarity search queries that support various data mining operations. The first is called object similarity query, whereby a user searches for the collection of objects that are within the user-defined distance from the queried object. The second type of query is called all-pair similarity query, where the objective is to find all the pairs of elements that are separated by user-specified distances (Chen & Han et al., 1996; Chen & Park et al., 1996). The similarity measures that are used fall into two categories: Euclidean distance similarity measures (Faloutsos & Lin, 1995; Faloutsos et al., 1994) and correlation-based similarity measures (Li et al., 1996).

The aforementioned research related to data mining of large databases primarily focused on the development of algorithms and improvement of the performance of existing algorithms. The focus was on mining association rules, aggregating data in clusters and searching for relevant concepts in large databases using similarity search techniques.

A new stream of researchers has used statistical methods, case-based reasoning and machine learning concepts for mining decision rules in order to acquire knowledge for expert systems. Unlike the data mining research in large databases, these researchers used relatively smaller sets of databases that contained expert decisions. The problem that was addressed in the majority of this research was that of mining and developing models for classification/discrimination and forecasting.

Statistical techniques used for classification problems have included Fisher's (1936) linear discriminant analysis (LDA) and quadratic and logistic discriminant analysis. Each of the techniques differ with respect to assumptions about group distributions and functional forms of the discriminant function. Linear models, given the ease of result of interpretation and reliability, were generally preferred for decision making (Hand, 1981); non-linear models, though more accurate on the training data, tended to show sharp declines in performance on unseen test samples (Altman, Eisenbeis, & Sinkey, 1981). The major drawback with statistical methods is the fact that real-world data often did not satisfy the parametric distribution assumption. Nonparametric methods relax the parametric assumption and are less restrictive. Among the popular non-parametric methods used were the k-nearest neighbor and linear programming methods (Freed & Glover, 1981; Hand, 1981).

Machine learning techniques that are used for classification fall into two categories: (1) connectionist models employing some form of neural network learning algorithm, and (2) inductive learning models where the discriminant function is expressed in a symbolic form using IF- THEN rules or decision tree. The back-propagation neural network (Rumelhart, Hinton, & Williams, 1986) was the most commonly used algorithm for connectionist schemes. Several induction algorithms were suggested for classification. Among the popular induction algorithms are CART (Breiman, Friedman, Olshen, & Stone, 1984), ID3 (Quinlan, 1993), and CN2(Clark & Niblett, 1989). A third set of techniques that has recently received attention for classification problems is based on the evolutionary computation paradigm which includes genetic algorithms and genetic programming (Bhattacharyya & Pendharkar, 1998; Koehler, 1991).

This wide variety of approaches made the selection of a particular technique that "best" matches a given problem a difficult task. The literature includes a number of studies comparing the performance of machine learning with statistical approaches (Atlas, Cole, Connor, El-Sharkawi, Marks, Muthusamy, & Barnard, 1990; Chung & Silver, 1992; Fisher & McKusick, 1989; Shavlik, Mooney, & Towell, 1991; Wolpert & Macready, 1995). Reviewing a number of comparative studies on symbolic and neural network methods, Quinlan (1986) emphasized that no single method uniformly demonstrates superior performance.

A study called "No Free Lunch" (NFL) theorems on search (Wolpert & Macready, 1995) pointed out that "positive performance in some learning situations must be balanced by negative performance in others," i.e., search algorithms perform the same when performance is averaged over a sufficiently large group of problems. The NFL results emphasized that conclusions regarding a technique's performance can be made only with respect to the specific problem type being examined. Thus, the technique selected should be chosen with an understanding of its respective strengths and limitations. Bhattacharyya and Pendharkar (1998) performed several experiments on simulated data to compare the performance of statistical discriminant analysis, genetic algorithms, C 4.5, genetic programming and neural networks for classification. Their results indicate that the performance of the technique depends upon input data distribution characteristics that include group variance heterogeneity, distribution kurtosis and normality.

Recently, a few researchers have attempted to investigate the semantics of misclassification. Mannino and Koushik (1996) have looked into the inverse classification problem to identify cost minimization changes to different independent variables so that a case can be classified into a different preferred class. Park (1997) has incorporated Type I and Type II error cost estimates in a binary tree and a linear discriminant analysis classification approaches to find total cost minimizing classification system. Murphy and Benaroch (1997) used classification cost for selecting attributes in a rule-based induction system. Mookerjee and Dos Santos (1993) and Nunez (1991) used cost minimization and value maximization approaches for inductive expert systems. Pendharkar and Kumar (1998) used data envelopment analysis for estimating marginal costs of predictor attributes in certain case-based expert systems for classification. Cost estimates were also used for redesigning case retrieval in certain case based systems (Mookerjee & Mannino, 1997). Commercial systems, such as DecileMax® (Bhattacharyya, Ratner, & Omac, 1995) are available that incorporate cost-based estimates in a linear genetic-algorithm-based classifier. The cost minimizing classification problem is different from minimizing the number of misclassifications. The cost minimizing problem facilitates the incorporation of changing organizational costs into a classifier that minimizes the misclassification cost. Few researchers have argued that minimizing misclassification rate doesn't appear to be a good criterion for multistep induction systems (Breiman, Friedman, Olshen, & Stone, 1984; Murphy & Benaroch, 1997).

< Day Day Up >