7.9 The Decision Tree Tools


7.9 The Decision Tree Tools

Most machine-learning based software products are capable of generating decision trees or IF/THEN rules. Some are capable of producing both. The following software systems primarily produce decision trees. To obtain further information on these products, please proceed to their Web sites for white papers, demos, screen shoots, and evaluation copies of their software.

AC2

http://www.alice-soft.com

AC2 provides graphical tools both for data preparation and building decision trees. AC2 uses its proprietary machine-learning algorithm to tests all possible combinations from a data set to segment and discover the optimum criteria for converging on a selected variable (e.g., fraud versus legal). It ranks all relevant criteria along a graphical decision tree. An Example Editor allows the user to evaluate single data objects more closely. This software product is from France, but is available in the United States from its producer ISoft (see Figure 7.12).

click to expand
Figure 7.12: Alice decision tree interface.

Alice d'Isoft 6.0, a streamlined version of ISoft's decision-tree-based AC2 data mining product, is designed for individuals who are new to data mining (see Figure 7.13).

click to expand
Figure 7.13: Alice d'Isoft 6.0 decision tree output.

Attar XperRule

http://www.attar.com/

XpertRule Miner provides graphical binary decision trees (two branch splits). The software is highly flexible and scalable, supporting workstation or server deployment. It can generate a variety of types of code from the trees it produces, such as COM+ Java, or XML for data exchange. Attar is a software firm from the United Kingdom, but all of its products have U.S. distributors.

Business Miner

http://www.businessobjects.com

Business Miner allows the user, with just a few mouse clicks, to build decision trees interactively that let them see the trends and relationships in any data set (see Figure 7.14). There's no complex algorithm tuning and no confusing technical terminology. The tool features a familiar, easy-to-use interface. Designed for the business analyst, it can be adapted for criminal investigation applications.

click to expand
Figure 7.14: Business Miner decision tree interface.

C5.0

http://www.rulequest.com/

C5.0 constructs classifiers in the form of both decision trees and rule sets. C5.0 includes the latest innovations, such as boosting for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions. C5.0 has been designed to analyze databases containing hundreds of thousands of records and tens to hundreds of numeric or nominal fields. To maximize interpretability, C5.0 classifiers are expressed as decision trees or sets of IF/THEN rules, formats that are generally easier to understand than neural networks. C5.0 is available from its founder at the above site; it is also licensed to other vendors, such as SPSS and its Clementine data mining workbench, which is discussed later in this chapter.

CART

http://www.salford-systems.com

One of the most powerful data mining algorithms in the marketplace is CART, licensed and offered by Salford Systems. The algorithm was developed and refined over the years by several renowned statisticians, most notably Dr. Jerome H. Friedman from Stanford University. CART, like C5.0, can generate binary trees or IF/THEN rules. CART splits a node in its decision trees, which are exclusively binary, into two child nodes, always posing binary questions that have a "yes" or "no" answer. For example, the questions might be: is age <= 55? Credit Score <= 600? Or, Fraud <= 234?

How does CART come up with the candidates for creating the splits in a data set for generating its rules? CART's method is to look at all possible splits for all variables included in the analysis. For example, consider a data set with 215 cases and 19 variables. CART considers up to 215 x 19 different splits for each variable in the data set for a total of 4,085 possible splits. Any problem will have a finite number of candidate splits, and CART will conduct a brute-force search through them all.

The CART algorithm has won the data mining contest for its accuracy; however, it is extremely demanding of computing power due to its brute-force search approach. One of the disadvantages of CART is that it works only with numeric data. This means categorical data, such as male or female, needs to be converted into something like male = 0 and female = 1, or high, medium, and low into high = 1, medium = 2, and low = 3. This requires some additional data preparation and rule deployment processing.

CART uses a software package DBMS/COPY, a data management utility to import data from virtually any format, which once loaded appears in the model setup screen. From there the target variable (the dependent field) that will be used to segment the data and all the other independent variables can be selected (see Figure 7.15).

click to expand
Figure 7.15: This is the CART interface for model setup.

Once the variables have been selected, tabs along the model setup can be selected so that, for example, the data can be split for training and testing via a random method or for partitioning according to a percentage of the file. This is how binary decision trees are generated by CART (see Figure 7.16).

click to expand
Figure 7.16: The CART binary trees.

Once the decision trees have been created by CART, an assortment of charts and reports on the result of the analysis are available and accessible via tabs (see Figure 7.17).

click to expand
Figure 7.17: Lift charts for each class from the decision trees can be viewed.

The variables that are most important in the construction of a decision tree can also be prioritized and viewed (see Figure 7.18).

click to expand
Figure 7.18: This instrument displays the variables of most importance.

The tool can also report on the expected accuracy of the model, both on the training and test data sets (see Figure 7.19).

click to expand
Figure 7.19: The rates of prediction for training and testing classes can be viewed.

Rules can also be generated by CART directly from the data, the rules that CART is able to generate and export are in C programming format (see Figure 7.20).

click to expand
Figure 7.20: Sample of CART rules.

Cognos Scenario

http://www.cognos.com/products/scenario/index.html

Cognos Scenario allows the user to quickly identify and rank the factors that have a significant impact on a database. The tool is specifically designed to spot patterns and exceptions in data and allows users to visualize the information being uncovered in graphs or classification trees. Scenario can also identify data variables that are not a factor for predicting such outcomes as fraud. The tool can also highlight data values that are unexpected and possibly incorrect, again a feature important for predicting fraud and data outliers. As with other software tools in this category, Scenario can be used to quickly and easily perform classification, segmentation, profiling, and outlier detection; it can also identify data points that are out of range, a key process in such criminal detection analyzes as money laundering. The user can drill down within factors to build a profile or choose from different views of the data: graphs, classification trees, etc. Query and reporting can be integrated with Cognos's other on-line analytic processing (OLAP) tools, Impromptu and PowerPlay.

Neusciences aXi Decision Tree

http://www.neusciences.com

Neusciences aXi DecisionTree uses ActiveX Controls for building decision trees. It can work with discrete and continuous data variables and can extract rules directly from the tree. The ActiveX component makes it easy to embed into other applications. It has four data preprocessing options, allowing the user to select the best one for his or her data. aXi has three different decision tree implementations and allows for the pruning and removal of unnecessary rules, which can be extracted in text format.

SPSS Answer Tree

http://www.spss.com/spssbi/answertree/

SPSS also offers their Answer Tree, an easy-to-use package that uses the CHAID algorithm as its decision tree engine; it includes decision tree export in XML format. SPSS also offers a decision tree component in its data mining suite, Clementine, based on the C5.0 machine-learning algorithm.

Free Trees

There are also some free decision tree software tools. Links to them may be found in the data mining portal kdnuggets.com (Knowledge Discovery Nuggets). They include the following:

  • C4.5: the "classic" decision tree tool, developed by J. R. Quinlan. Available with restricted distribution from the master himself.

    http://www.cse.unsw.edu.au/~quinlan

  • EC4.5: a more efficient version of C4.5; the predecessor of C5.0, which uses the best among three strategies at each node construction, made available from the University of Pisa.

    http://www-kdd.cnuce.cnr.it

  • IND: Gini- and C4.5-style decision trees and more. It is publicly available from NASA with some export restrictions.

    http://ic-www.arc.nasa.gov/ic/projects/bayes-group/ind/IND-program.html

  • LMDT. builds linear machine decision trees. Made available by Purdue University.

    http://mow.ecn.purdue.edu/~brodley/software/lmdt.html

  • OC1: decision tree system with continuous feature values; builds decision trees with linear combinations of attributes at each internal node. Made available by Johns Hopkins University.

    http://www.cs.jhu.edu/~salzberg/announce-oc1.html

  • PC4.5: a parallel version of C4.5 built with the Persistent Linda (PLinda) system. Made available by New York University.

    http://cs1.cs.nyu.edu/~binli/pc4.5

  • PLUS: for constructing polytomous logistic regression trees with unbiased splits.

    http://www.recursive-partitioning.com/plus




Investigative Data Mining for Security and Criminal Detection
Investigative Data Mining for Security and Criminal Detection
ISBN: 0750676132
EAN: 2147483647
Year: 2005
Pages: 232
Authors: Jesus Mena

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net