4 Approaches to Data Mining | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

In this chapter, we'll see how Microsoft Analysis Services is used to implement decision tree models. As we go over the steps in this process, I'll explain the statistical principles behind the algorithm and how the model is used to make predictions .

Decision trees is a very well known algorithm used in one form or another by almost all commercially available data-mining tools. This umbrella term describes a number of specific algorithms, such as Chi-squared Automatic Interaction Detector (CHAID) and C4.5 and the algorithm process they share, which results in models that look like trees. Decision tree algorithms are recommended for predictive tasks that require a classification-oriented model, and as such they are designed for problems best solved by segregating cases into discrete groups. For example, decision trees are often used to predict those customers most likely to respond to direct-mail marketing or those likely to be approved for loans.

One advantage of this method is that this way of describing nodes with rules is intuitive and easily understood by any operator. However, weighing the significance of a found rule can be a very serious problem for the decision tree approach. The problem originates from the fact that as the tree expands down the levels, there are fewer records left at the nodes of those levels of the classification tree that is being built. A decision tree splits data into a larger number of sets that become smaller as they get more specific. The larger the number of different cases to examine, the smaller is each next separate set of training examples, which because of the dwindling numbers , inspires less confidence that correct further classification can be accurately performed. If the tree becomes overbuilt with a large number of small branches, then there's a good chance that the rules inside those nodes won't stand up to any justifiable statistical scrutiny, mostly because each of the nodes stemming from those branches will generally contain such a suspiciously small percentage of the overall cases. This often leads to the problem of overfitting, described in ‚ Chapter 4 1 ‚ and discussed later in this chapter as well.

Creating the Model

The first step in any data-mining operation is to create the model. The data-mining model is generated from cases contained in a data source. Any data source that can be connected to through OLE DB can be used to create a model. These sources include relational databases, OLAP cubes, FoxPro tables, text files, or even Microsoft Excel spreadsheets.

In this section, we'll create two decision tree models, one using standard relational tables from Microsoft SQL Server 2000 as a source and another using OLAP cubes. We'll also look at how to use those data sources to store the test cases used to make predictions and how to store the results of those predictions.

Analysis Manager

The starting point to create a data-mining model is with Analysis Manager included in the Analysis Services Installation package on the SQL Server 2000 CD-ROM.

Before anything can begin, you must register the analysis server that you wish to connect to by right-clicking on the Analysis Servers folder and choosing Register Server, as shown in Figure 5-1. 2

Figure 5-1. Microsoft Analysis Manager.

The Register Analysis Server dialog box appears on the screen. The server name requested is the same as the Microsoft Windows server it resides on. Once the server is created, you'll see the FoodMart 2000 sample analysis server database if you chose it as an option during the installation of Analysis Services. (See Figure 5-2.)

Figure 5-2. FoodMart 2000 sample database.

You'll notice that there are several Analysis Manager folders that contain the elements needed to create OLAP cubes and data-mining models. Analysis Server includes the following components :

‚ Databases 3 ‚ Each Analysis Server contains one or more databases; an icon represents each database. There are four folders and an icon beneath each database icon.
‚ Data Sources ‚ The Data Sources folder contains the data sources specified in the database. A data source maintains OLE DB provider information, server connection information, network settings, connection time-out, and access permissions. A database can contain multiple data sources in its Data Sources folder.
‚ Cubes ‚ The Cubes folder contains the cubes in the database. An icon represents each cube. Three varieties of cubes are depicted in the Analysis Manager Tree pane: Regular, Linked, and Virtual.
In a Cubes folder, an icon represents each cube variety. Beneath each Cube icon is a Partitions folder that contains an icon for each partition in the cube and a Cube Roles icon that represents all the cube roles for the cube.

To see the dimensions, measures, and other components in the Regular, Linked, or Virtual cube, right-click the appropriate icon and then choose Edit. 4
‚ Partitions ‚ A cube's Partitions folder contains an icon for each partition in the cube. There are two types of partitions depicted in the Analysis Manager Tree pane: Local and Remote.
In a Partitions folder, an icon represents each Local partition and each Remote partition. To access the settings for a partition, right-click the appropriate icon and then click Edit.
‚ Cube Roles ‚ Beneath a cube, a single Cube Roles icon represents all the cube's roles. To access the roles, right-click the icon and then choose Manage Roles.
‚ Shared Dimensions ‚ The Shared Dimensions folder contains an icon for each shared dimension in the database. These dimensions can be included in any cube in the database. Four varieties of shared dimensions are depicted in the Analysis Manager Tree pane: Regular, Virtual, Parent-Child, and Data-Mining. 5
In a Shared Dimensions folder, an icon represents each of these four dimensions. To see the levels, members , and other components in a dimension, right-click the appropriate icon and then choose Edit.
‚ Mining Models ‚ The Mining Models folder contains an icon for each mining model in the database. You'll notice that two icons represent the two types of data-mining models. The small cube icon indicates that the data source for this data-mining model is an OLAP cube, while the cylinder icon indicates that the data source for this data-mining model uses a relational database.
To view or modify the structure of the relational or OLAP mining models, right-click the appropriate icon and then choose Edit. To view the contents of a mining model, right-click its icon and then choose Browse.

Beneath a mining model, a single Mining Model Roles icon represents all the roles for the mining model. To access the roles, right-click the icon and then choose Manage Roles.
‚ Database Roles ‚ The Database Roles icon represents all the database roles in the database. These roles can be assigned to any cube or any data-mining model in the database. To access the roles, right-click the icon and then choose Manage Roles. 6

Mushrooms Data-Mining Model

To illustrate the step-by-step process of using decision trees, we'll create a mining model using cases from a SQL Server database about mushrooms. The model will indicate whether a mushroom is edible.

Note

This data comes from one of the many "data-mining ready"databases in the University of California, Irvine, Machine Learning Repository. UC of Irvine's Department of Information and Computer Science graciously offers their data at no cost on the Web. For more information, visit their Web site at ‚ http://www.ics.uci.edu/~mlearn/MLRepository.html ‚

This Mushrooms database contains over 8000 cases of mushrooms and field descriptions. Table 5-1 lists the field descriptions.

Table 5-1. Fields in the Mushrooms Database

8 11 14 17

Field Name 7	Source
ID	Primary Key ID generated by SQL Server
edibility	Target and input field
cap_shape	Input
cap_surface	Input
cap_color 9	Input
bruises	Input
odor	Input 10
gill_attachment	Input
gill_spacing	Input
gill_size	Input
gill_color	Input
stalk_shape 12	Input
stalk_root	Input
stalk_surface_above_ring	Input 13
stalk_surface_below_ring	Input
stalk_color_above_ring	Input
stalk_color_below_ring	Input
veil_type	Input
veil_color 15	Input
ring_number	Input
ring_type	Input 16
spore_print_color	Input
population	Input
habitat	Target and Input

Creating the Database

Creating the database is simple. You only need to right-click on the server and choose New Database. The Database dialog box, shown in Figure 5-3, prompts you for the name of the database, which in this example is Mushrooms. Optionally , you can enter a description of the database.

Note

To create a data-mining model using DTS, see ‚ Chapter 8, "Using Microsoft Data Transformation Serivces (DTS)." ‚

Figure 5-3. Choosing a database.

As you can see in Figure 5-4, a database with a folder for each type of component needed to manage your analysis tasks has been created. Some of these folders are optional. For example, if you are mining a relational database, you won't use the Cubes folder.

Figure 5-4. Data analysis database folder structure.

To create a new mining model, right-click on the Mining Models folder and then choose New Mining Model to open the Mining Model Wizard.

Mining Model Wizard

Microsoft products use wizards to accomplish certain tasks in a limited and predictable number of steps. The Mining Model Wizard walks you through the following steps to create a model: 19

Select source.
Select the case table or tables for the data-mining model.
Select data-mining technique (algorithm).
Edit joins if multiple tables were chosen as the source in the previous step.
Select Case Key column.
Select Input and Prediction columns .
Finish. 20

Select source

You have the choice of creating data-mining models, which contain cases located either in relational tables or in OLAP cubes. For this exercise, choose the relational data type to use cases from the SQL Server 2000 database.

Select case tables

As Figure 5-5 illustrates, the connections used with a relational model are created and displayed in the Select Case Tables screen. Also provided here is the option of creating a new connection by clicking on the New Data Source button. You can also specify the number of tables you would like your cases contained in. In this case, we only need one table.

Figure 5-5. Select Case Tables screen.

To get a data source, click the New Data Source button to bring up the Data Link Properties dialog box, shown in Figure 5-6. On the Provider tab is a list of all the OLE DB drivers installed on the server where the Analysis Services server is installed. In this case, we pick the Microsoft OLE DB provider for SQL Server to get access to the Mushrooms table on the SQL Server 2000 server, which contains all the cases we need. 21

Figure 5-6. Data Link Properties dialog box.

To connect to a specific server and database, enter the name of the SQL Server and, optionally, the name of the database on the Connection tab in the Data Link Properties dialog box. (See Figure 5-7.) If you don't provide the name of a specific database, you'll be connected to the designated default database for your server. Supply the logon credentials you'll use to connect. Because of its simplicity, you'll be tempted to use the Integrated Security option to allow SQL Server to make use of the credentials supplied by Microsoft Windows NT and Windows 2000 based on your Windows NT or Windows 2000 user name and password without needing to use a SQL Server user name and password. Be forewarned that using it opens the door for hard-to-find errors to enter because the security context will be associated to the logon credentials of the user currently logged on to the server where Analysis Services resides. For added safety, you can click on the Test Connection button to check the connection parameters you supplied.

Note

The database structure information is cached on the Analysis Manager client to save on bandwidth requirements. As a result, whenever you change or add fields to the tables that you use to build the models, you must remember to right-click on the connection and then choose Refresh Connection. This causes Analysis Manager to query the SQL Server to reload the database schema information so that you can see the changes in the list of fields you pick from when creating the data-mining model.

Figure 5-7. Selecting a database.

As soon as you close the Data Links Properties dialog box, the Select Case Tables screen has the table information. As shown in Figure 5-8, you select your tables from an Available Tables pane in the Select Case Tables screen. The Details pane gives you a list of the field names available in the table. 22

Figure 5-8. Selecting a case table.

Before committing the table to the mining model, you can click the Browse Data button to preview the first 1000 rows of data, as displayed in Figure 5-9, to make sure that the contents correspond to your expectations.

Figure 5-9. Preview of a Mushrooms table.

Select a data-mining technique

The Mining Model Wizard offers two data-mining algorithms, or "techniques"as they're called in the wizard, for you to select from. For our purposes, we'll select Microsoft Decision Trees in the Select Data Mining Techniques screen. 23

Create and edit joins

If you selected multiple cases tables in the previous steps, then the Create And Edit Joins screen will appear next. This screen lets you graphically join tables by dragging the Key columns from the parent tables to the children. If you chose only a single table, this step is skipped .

Select the Key column

The next step is to select ID as the Case Key column. The choice of ID has a very important effect on the output of the decision tree because this Key is what the engine uses to uniquely identify a case. Choosing a Key is mandatory, so it's very important to create a Key in the SQL Server database if one does not exist already, especially since a Case Key is not an option for use as either a target or an input value.

Select Input and Prediction columns

In the Select Input And Prediction Column screen, pick at least one Input column for the mining model from the available columns in the list on the left pane. Input columns represent actual data that is used to train the mining model. If you selected Microsoft Decision Trees in the Select Case Tables screen, also select at least one Predictable column. Predictable columns are fields identified as being used to provide the output predicted from the mining model. If you want to use a model to predict whether a given mushroom is edible, the edible column would be the Predictable column because that would presumably be the one thing we don't know about a given mushroom but that the model can tell us based on all the other attributes of that mushroom. For convenience, these columns are also used as Input columns, and later in this chapter we'll see how we can make the column a Predictable column only. For the purposes of this chapter, we'll select both Edibility and Habitat as the Predictable columns.

Shown in Figure 5-10 are three panes to work with: 24

‚ Available columns ‚ Select columns from the tree view. Use the buttons provided to move columns to either the Predictable Columns pane or the Input Columns pane or to remove columns from the selection. You cannot use the ID column you selected in the Select The Key Column dialog box as an Input column because it's a key field. Select all the columns besides Habitat and Edibility as Input columns.
‚ Predictable columns ‚ View the selected Predictable columns. This pane is displayed only if you selected Microsoft Decision Trees in the Select Case Tables dialog box. For this exercise, select Edibility and Habitat as Predictable columns.
‚ Input columns ‚ View the selected Input columns.

Figure 5-10. Select Input and Predictable columns.

If you select the Finish This Mining Model In The Editor check box, you'll bypass selection of Input and Predictable columns and finish working with the mining model in the Relational Mining Model Editor. If you select this option, you cannot process the mining model in the last step of the wizard, instead, you'll need to explicitly select the option in the Editor. Leave the check box clear for this exercise, and click Next.

Finish

Now that the mining-model parameters have been defined, you have to enter the name of the data-mining model, Mushroom Analysis RDBMS for this exercise. If you select the Save, But Don't Process Now option, the data-mining model will be saved but will still need processing to be trained. Choosing the Save And Process Now option (the option you should select for this exercise) saves and trains the model at the same time.

As the model is trained, the various training steps are detailed as they occur. The starting time of the process is recorded, and a progress bar in the bottom of the screen reflects the stage of processing during each step, as shown in Figure 5-11.

Figure 5-11. Processing the model. 26

Once the processing is complete and you click the Close button, the Relational Mining Model Editor appears, as shown in Figure 5-12. Here you can make modifications to the mining model and reprocess the model as needed.

Figure 5-12. Relational Mining Model Editor.

Relational Mining Model Editor

As convenient as wizards are in applications, they do limit your flexibility in each step because to maintain simplicity, the wizard must use default values and implicit decisions to accomplish a task. By using the Relational Mining Model Editor, you can dispense with the wizard, at least in part, and make some design decisions that you would not otherwise have the chance to make. To illustrate, create another data-mining model and make the same choices you made before until you reach the Select Input And Predictable Columns screen shown again in Figure 5-13. For this model, select to continue to build the mining model without the Mining Model Wizard.

Figure 5-13. Select Input and Predictable columns. 27

As you can see, you did have to use part of the wizard to get to this point, but thus far you have only designated the source type, the source table, the case key, and the mining technique. By selecting the Finish The Mining Model In The Editor check box and then clicking Next, the Finish The Mining Model screen appears prompting you for a name. After you enter a name (Mushrooms DBMS for this example) and click Finish, the main editing canvas of the Relational Mining Model Editor appears.

The editor contains many components that are worth examining. On the upper left of the editing canvas, you?ll find the Editor toolbar, shown in Figure 5-14.

Figure 5-14. The Editor toolbar.

Table 5-2 describes the Editor toolbar buttons.

Table 5-2. Editor Toolbar

Button 28	Description
Save	Saves the relational data-mining model.
Insert Table	Adds a new table to the schema of the relational data-mining model.
Insert Column	Adds a new column to the structure of the relational data-mining model.
Insert Nested Table	Adds a new nested table to the structure of the relational data-mining model.
Process Mining Model 30	Displays the Process A Mining Model dialog box, where you can select the processing method for the relational data-mining model.

As you can see in Figure 5-15, the Mushrooms DBMS mining model is rather sparsely populated , containing only the case key. You must now add the columns to the model one by one.

Figure 5-15. Mushrooms mining model.

To add a Predictable column, right-click on the column you wish to add and then choose Insert As Column. For this exercise, add the Edibility column as a Predictable column.

Before continuing, let's take a quick look at portions of the Relational Mining Model Editor window. The lower-left pane is the Properties pane, which displays the properties of either the mining model or the individual columns, depending on which one is highlighted in the Structure pane above.

Table 5-3 describes the features of the Properties pane. 31

Table 5-3. Properties Pane

Feature	Description
Properties button	Shows or hides the Properties pane
Basic tab 32	Shows the most commonly-used properties, such as Name and Description, for the mining model and mining-model columns
Advanced tab	Displays Advanced properties, such as Distribution and Content Type, used to further define the mining-model columns
Description	Displays the name and a brief explanation of the property selected in the Properties pane 33

The Basic tab is used to display and, optionally, edit the most commonly viewed properties for data-mining models and data-mining columns. Table 5-4 describes the properties displayed in the Basic tab in more detail and indicates the data-mining object (data-mining model or data-mining column) to which the property applies.

Table 5-4. Basic Tab Properties

Property	Description	Applicable objects
Name	The name of the selected data-mining model or column. This property is read-only for data-mining models. 34	Both
Description	The description of the selected data-mining model or column.	Both
Mining Algorithm 35	The data-mining algorithm provider for the selected data-mining model. By default, the available models are limited to Decision Trees and Clustering, but would also display any others for which there would be a provider installed.	Data-mining model
Are Keys Unique	Whether the Key columns in the data-mining model uniquely identify records in the source case table.	Data-mining model
Is Case Key	Whether the data-mining column is used as a Key column in the data-mining model. This property must be set to False before you can delete the column.	Data-mining column
IsNestedKey	Whether the data-mining column is used as a Key column for a nested table in the data-mining model. This property must be set to False before you can delete the column.	Data-mining nested table column 37
Source Column	The name of the source column in the case or supporting table.	Data-mining column
Data Type	The data type of the data-mining column. This setting must be compatible with the data-mining algorithm provider that is being used. The data types and algorithms that Microsoft SQL Server 2000 Analysis Services supports are documented in the OLE DB for Data Mining specification. For more information about the OLE DB for Data Mining specification, see the Microsoft OLE DB Web page at 38 ‚ http://www.microsoft.com/data/oledb/default.htm ‚ . For data types supported by data-mining algorithm providers, see the data-mining algorithm provider documentation.	Data-mining column
Usage	Whether the data-mining column is used as an Input column, a Predictable column, or both. This property is read-only for Key columns.	Data-mining column 39
Additional Parameters	A comma-delimited list of provider-specific mining parameter names and values. For mining parameters supported by data-mining algorithm providers, see the data-mining algorithm provider documentation.	Data-mining model

The Advanced tab displays Advanced properties for data-mining models and data-mining columns, such as Relation column information and distribution. Table 5-5 describes these Advanced properties in more detail.

Table 5-5. Advanced Tab Properties

Property 40	Description	Applicable Objects
Related To	For Relation columns, when multiple tables are being used for case data. This is the name of the column to which the selected data-mining column is related. It is read-only for Key columns. When this property is set for a column, the Usage property is changed to match the value of the related column.	Data-mining column
Distribution 41	The distribution flag, such as NORMAL or UNIFORM, of the data-mining column. It is read-only for Key columns.	Data-mining column
Content Type	The Content type, such as DISCRETE or ORDERED, for the data-mining column. It is read-only for Key columns.	Data-mining column
Data Options	The model flag, such as MODEL_EXISTENCE_ ONLY or NOT NULL, of the data-mining column. It is read-only for Key columns.	Data-mining column

Notice that the column you just added is by default an Input column. To make it a Predictable column, click on the Usage field of the Basic tab and then select Input And Predictable or Predictable, a choice that insures that this value is not used to train the model. For now, select Input And Predictable. After you make this change, you'll see a diamond-shaped icon appear next to the column in the Structure pane. (Yes, we're mining for diamonds of information.)

Note

Selecting Predictable is possible only through the Relational Data Mining Model Editor. The wizard offers only Input And Predictable as a Predictable Column option.

On the Advanced tab for the column, you can select the type of data this column represents. Select DISCRETE for descriptive data such as names or types, and select ORDERED for sequential values such as age or years . Add all the other database columns, and make them Input columns. Now you're ready to begin processing, but before you do so be sure to save the new model changes.

To process the model, click the Process Mining Model button (it looks like a set of gears) or choose Process Mining Model from the Tools menu. Whenever you process a new model or a model with a changed structure, the model must be rebuilt and new cases inserted. If only new cases are added to a model and no changes are made to the model's structure, you have the option to refresh the data. In this way, the structure does not have to be re-created. The model is purged of all the existing cases, reloaded with the cases again, and retrained with the new data. 43

When processing is finished, the status screen appears, as shown in Figure 5-16.

Figure 5-16. Processing completed.

Visualizing the Model

One of the most valuable features of a decision tree is the simplicity of the logic behind its construction. The Data Mining Model Editor contains two tabs at the bottom of the screen, the Schema tab, which we have been using thus far to alter the structure of the model, and the Content tab, which shows how the data has been classified and organized within the tree. The Content tab is shown in Figure 5-17.

Figure 5-17. The Content tab. 44

The Content tab is a quick and convenient way to look at the model, but the Structure and Properties panes on the sides take up a good portion of the screen real estate. Another way to arrive at a screen that permits easy visualization is to go to the Analysis Manager tree, right-click on the data-mining model that you wish to visualize, and choose Browse to bring up a similar view with more dedicated space for the decision tree. (See Figure 5-18.)

Figure 5-18. View of the Edibility tree.

Two magnifying glass icons in the upper-left corner of this window allow you to zoom in on the diagram to get a better view of the tree outline. In the upper middle of the window, there's a drop-down list box that contains a list of all the tree structures in the model. A data-mining model can contain multiple trees, one for each Predictive column. In this model, you made Habitat and Edibility Predictive columns. As a result, you can view Habitat as the current tree by choosing it from the list. The tree will no longer predict whether a mushroom is edible, but it will predict a mushroom's habitat.

As you can see, the diagram of the model looks like a fallen tree with the root to the left and the branches to the right. This hierarchical structure is created by the IF->THEN rules used to classify information. Describing nodes with intuitive rules is one of the advantages of the decision tree technique.

An added bonus of the visualization pane is that the colors give you a feel for the density of the cases in the nodes. The darker the color, the higher percentage of cases corresponding to that value in the attribute is in that node. In the lower-right portion of the window, there is a drop-down list box that contains all the possible values for the Predictive column of that tree. By default, it's set to All Cases so that the color of the nodes will reflect the overall quantity of cases in a node. But if you want to highlight those nodes that contain higher percentages of edible mushrooms, you would only need to select that attribute from that list box and look for the darkest nodes. A view of the edible mushrooms in the tree is shown in Figure 5-19.

Figure 5-19. Edible mushrooms.

Figure 5-20 shows the same tree, but the colors highlight the poisonous mushrooms.

Figure 5-20. Poisonous mushrooms.

The nodes and branches of the tree respond to click events from the user, revealing more information. If you click on any of the nodes in the tree, the Attributes pane in the middle-right portion of the window will reflect the data of that node. In addition, the Node Path pane on the lower-right corner will display the description of the rules that govern the inclusion of the cases within the node. On the upper-right panel, the Content Navigator provides a very general overview of the shape of the tree and also allows you to select a visible portion of the tree by clicking on that part of the miniature tree in that panel. For example, if I wanted to see the leaf nodes of a tree, I could either click on the individual branches of the tree in the Visualization pane, or I could click on that portion of the tree in the Content Navigator that I wish to see and the Visualization pane will reflect my choice.

Prediction Columns

Decision trees allow only one variable to be the target of a prediction at one time. That's why when you analyze the data-mining models generated by decision trees, there can be more than one target variable, but for every target variable a new model is created. By choosing different (if any) target variables from the drop-down menu on the top of the visualization pane of the Analysis Manager, you can view each of the different models. 46

As you can see from the tree, the data has been separated into groups that can be used to make predictions because each node contains data that follows the rules that describe the data based on the attributes Analysis Manager has gathered. You can use this model to determine whether a particular mushroom is edible. To see how the tree works, let's imagine that you found a mushroom in the forest with the characteristics listed in Table 5-6 and you'd like to know whether you can use it for a new cream of mushroom soup recipe you saw on TV.

Table 5-6. Mushroom Characteristics

47 50 53

Field Name	Value
cap_shape	Convex
cap_surface	Smooth
cap_color	Gray
bruises 48	No Bruises
odor	None
gill_attachment	Free 49
gill_spacing	Crowded
gill_size	Narrow
gill_color	Black
stalk_shape	Tapering
stalk_root 51	Equal
stalk_surface_above_ring	Smooth
stalk_surface_below_ring	Scaly 52
stalk_color_above_ring	White
stalk_color_below_ring	White
veil_type	Partial
veil_color	White
ring_number 54	One
ring_type	Evanescent
spore_print_color	Brown 55
population	Abundant
habitat	Grasses

A decision tree separates data into sets of rules that can be used to describe data or make predictions. In this model, you see that the algorithm created multiple branches based on the odor alone. When the tree color is based on whether a mushroom is poisonous (as in Figure 5-20), the tree seems to indicate that if a mushroom has an odor it's probably poisonous. But if there is no odor, as is the case with this mushroom, then further analysis is needed to determine whether it is poisonous. When you click on the Odor = None node, as shown in Figure 5-21, the Attributes pane tells you that of the more than 8000 total cases, 3528 fall into the odorless category and of those, 120 are poisonous. 56

Figure 5-21. Mushroom odor analysis.

The Spore Print Color nodes describe the print color of the spores. If you click on the Spore Print Color = Green node, the Attributes pane reveals that if the spores are green, they are poisonous. Your mushroom has brown spores, however, and when you click on the Spore Print Color not = Green Node, as shown in Figure 5-22, the Attributes pane indicates that of 3456 cases of brown-spored odorless mushrooms, only 48 are poisonous.

Figure 5-22. Mushroom color analysis.

The stalk surface below the ring of your mushroom is scaly, so you're not out of the woods just yet. As the information in the Attributes pane shown in Figure 5-23 indicates, there are only 56 cases of mushrooms like yours and of those 40 are poisonous.

Perhaps it's time to put some gloves on before handling your mushroom any further, but for the definitive conclusion that your mushroom is poisonous, you need to investigate a bit more. The final verdict is determined by the mushroom ring type. Your mushroom is evanescent and according to the information in the Attributes panel shown in Figure 5-24, your mushroom is definitely poisonous! 57

One of the first tasks undertaken by the decision tree for the mushroom database is to find the combinations of characteristic variables that best set the poisonous mushrooms apart from the edible ones. This process is called segmentation.

Figure 5-23. Mushroom stalk surface analysis.

Figure 5-24. Mushroom ring type analysis.

This example shows how a predictive task uses the model to make a prediction. Of course, we will be able to make these kinds of predictions automatically and in much greater numbers using predictive queries, which will be covered in more detail in ‚ Chapter 12, "Data-Mining Queries." 58 ‚

Dependency Network Browser

Dependency Network Browser, shown in Figure 5-25, is a tool used to view the dependencies and relationships among objects in a data-mining model. To display it from the Analysis Manager Tree pane, right-click a data-mining model and then choose Browse Dependency Network.

Figure 5-25. Dependency Network Browser.

In Dependency Network Browser, a data-mining model is expressed as a network of attributes. Within the model, you can identify data dependencies and predictability among the related attributes. Dependency is indicated by arrows. The direction of predictability is indicated by arrowheads and by the color-coding of the nodes.

Dependency Network Browser Helps Understand Models

Dependency Network Browser presents data-mining content for decision tree mining models from a different point of view than that of Data Mining Model Browser. Data Mining Model Browser allows you to view relationship and distribution information from the Predictive attribute that governs the structure of the tree. Dependency Network Browser allows you to view the entire data-mining model from all the attributes using relationship information alone. 59

Dependency Network Browser displays all the attributes in the data-mining model as nodes. Arrows between nodes predict links. For example, an arrow from the Odor node to the Edibility node indicates that the Odor attribute predicts the Edibility attribute.

The nodes are color-coded to represent the selected node and the predictability direction of related nodes; click on a node to view its dependency relationships. To improve the view of the relationships, you can drag the nodes or use the Improve Layout button, which automatically distributes and resizes nodes.

The mushroom tree presented by Dependency Network Browser clearly shows which input attributes contribute most to deriving the predictive value. As you move the slider to the left of the window down, Dependency Network Browser removes the weakest dependencies one at a time. Figure 5-26 reveals that the strongest predictor of edibility is odor because the last arrow on the screen is the odor attribute.

Figure 5-26. The Links pointer.

After an attribute is isolated, Data Mining Model Browser allows you to view the details and distribution information for the relationships of the selected attribute.

Dealing with Numerical Data 60

To function, decision trees algorithms must continuously place data in categorical groups or bins . Quantitative values such as salaries or miles are continuous data. Although some algorithms are able to make use of the individual values suggested by continuous data, Microsoft decision trees will actually bin the data for you to create a semblance of discrete variables.

The following example shows the advantages of binning data to obtain predictions from your data. Let's say you created a data-mining model using decision trees with cases of car information. To get the best possible price prediction for a used car, you use all the attributes of the car (such as the mileage, year, make, model, body type, and color) as the input types and an optimal selling price as the target variable. Although this seems like a straight-forward problem, you'll be surprised to find that the algorithm creates discrete groups of attributes based on the price or target variable. Although the prices for each make and model range from $500 to $80,000, the predictions usually are a set of four or five values that will look something like $1,030, $8,540, $15,900, and $27,520. Each node will have a division for each value, making it very difficult to decide on a ticket price. One way to make these predictions more accurate is to bin the price attributes before submitting them to the data-mining algorithm. To bin the attributes, create a new field in the case table that contains a price range, much like the one shown in Table 5-7.

Table 5.7 Case Table for Car Fields

‚ K	Year	Make	Price 61	Price Range
‚ K	‚ K	‚ K	$1,525	$1,000 - $1,999
‚ K	‚ K	‚ K	$8,600	$8,000 - $8,999
‚ K 63	‚ K	‚ K	$8,250	$8,000 - $8,999
‚ K	‚ K 64	‚ K	$11,500	$11,000 - $11,999

Although this method won't provide one sale price, it will come up with a price accurate within $999.00. If you're willing to sacrifice accuracy, this method is great, but a car dealership really can't afford to be off by that much money.

Depending on how accurate the prediction must be, binning might or might not be the best data-mining method. The example of used-car pricing showed that binning does not work well when there is a large range of numbers and the predictions must be precise. On the other hand, if you're pricing collectable postage stamps (especially for very rare stamps valued in the hundreds of thousands of dollars), the price range might be small enough to create smaller and more precise bins.

Consider these options before binning:

Create small bins. 65
Avoid creating too many bins. When there are too many bins, the engine, unable to deal with the complexity, doesn't split and classify the data correctly. How many is too many? Well, there's no right answer since much depends on how you intend on using the data-mining model and the nature of the source data used to populate the model. You may want to experiment with this until you find the number of bins that works best for you.
Create multiple decision trees instead of bins. For example, you could create one decision tree for each car price category. In other words, divide the car population into those that sell for less than $5000, those that sell for less than $10,000, and so on. Next create a separate data-mining model for each population. Because the price range is smaller, the bins created for each model are more precise.

Inside the Decision Tree Algorithm

As its name suggests, a decision tree algorithm is a tree-shaped model. There's no limit to the levels, and the more inputs and variables assigned to the algorithm, the bigger, wider and deeper, the tree grows.

CART, CHAID, and C4.5

When a decision tree algorithm is applied to a data-mining problem, the result, or decision, looks like a tree. Although Microsoft uses its own algorithm to generate a decision tree, this algorithm is inspired by other tried-and-proven methods . Let's take a minute to discuss these popular decision tree algorithms used in the world of data mining.

Classification and Regression Trees (CART) 66

CART is by far the most widely used algorithm because of its efficient classification system that uses various automatic tree-pruning techniques, including cross-validation using a test set.

CART attributes that have strong predictive values are picked once it's determined that they introduce order to the data set. They are used if they split the existing pool of data into two separate nodes, and further branches or leaves are subsequently created.

One of the most useful features of CART is its ability to handle missing data when building the tree. Either it will know not to use a certain record to determine whether a split should be made, or it will use surrogate data. For example, if income information is missing from a record, the effective tax rate can be used in its place because it is a correlative value, even if it's not an exact predictor of income.

Caution

At times, surrogate data points (such as the one mentioned in the previous example) can cause a model to contain incorrect data that is generated from false assumptions that are valid in the universe of the data set but not in the real world. For instance, the sample of cases used to build a model for the sale of cars could have only red Ford Expeditions and a few Expeditions of unspecified color. The algorithm could incorrectly proceed as if Expeditions of unspecified color are red because red is the only specified color for Ford Expeditions in the database.

One unique feature of the CART algorithm is the binary split restriction that causes the tree nodes to sprout only two branches at a time, which produces a tree that is deeper than algorithm trees that can sprout multiple branches from a single node. A deep tree is considerably more economical with data and as a result is able to detect more structures before too little data is left for analysis. Remember that every node in a hierarchy divides its records among the nodes below it. So a node with 1000 records will create 2 nodes with a certain number of records in each. If a node has a very low record count, it ceases to split, while the other node continues. If there are multiple nodes emanating from a higher level node, each one of them needs to have a split of smaller numbers of records than a purely binary split would have and thus cause the existence of more leaf nodes. Other decision tree algorithms fragment the data rapidly with multiple splits, making it difficult to detect rules that might need a larger number of attributes in a group to make more accurate splits.

Chi-Squared Automatic Interaction Detector (CHAID)

The CHAID algorithm uses Chi-Squared analysis tests to validate the tree. Because Chi-Squared analysis (see Chapter 4) relies on group tables or contingency grids to determine what the distribution of a given value is, the attributes of the data sets must be forced into groups that can be tested . For example, income ranges would have to be placed into discrete attributes such as $10,000 to $19,999 and $20,000 to $29,999. 67

C4.5

This algorithm is an enhancement to another basic decision tree algorithm known as Iterative Dichotomizer version 3 (ID3), which was developed well over 20 years ago. At the time, it used a logical decision tree to come up with chess moves that would beat a human opponent . Because ID3 was not standardized at the time, many variations and enhancements were made to ID3 before C4.5 was introduced. C4.5 also shares many of its features with CART, and now there is very little difference between them.

To better understand the structure and function of a tree, let's look at a generic diagram of a decision tree structure. To do so, we'll look at a database of credit card holders from the sample FoodMart 2000 database, as shown in Figure 5-27.

Figure 5-27. Generic decision tree structure.

Each element in the tree is actually a node. Every tree starts with one node at the very top that is known as the root node. (Although the root node is on the left of Figure 5-27, it is the logical top of the tree.) Many more nodes, all connected by branches, make up the remainder of the tree. Just like a real tree, all subsequent nodes and branches stem from the root node. The leaf nodes, located on the end of the branches, are the last nodes on the tree, and they're the ones that contain the values that are ultimately used to make predictions.

What does the presence of pure leaf nodes say about your data? 68

A tree that has only pure leaf nodes is known as a pure tree. Each pure leaf node contains 100 percent of a given type of case. This structure defeats the purpose of using a decision tree and most data-analysis professionals actually considered pure trees suspect because in most cases, some very severe overfitting needs to be done to create a tree of this kind. Some data-mining models are built with training sets that contain many different attributes that could be applied. However, unbeknownst to the model builder, only one of these attributes actually determines the probability of reaching a target objective. Consider the fictitious database of surf conditions for various locations shown in Table 5.8.

This data is derived by observing surfers and finding all the factors that contribute to a surfer's decision to frequent a particular beach . Although this is not an exhaustive listing of table elements, it gives an idea of the amount and kind of data that is used and the factor combinations that are possible, such as wave size and quality and water and air temperatures . The last column, the verdict, is the database's recommendation to the surfer. If you processed this data through a Microsoft decision tree algorithm, you'd find that the model contains only one, four-way split that results in four pure nodes. Why? The reason is simple, but not immediately obvious from the data. Surfers (in this fictitious world) only take wave size into consideration when deciding where to surf. The algorithm detects this and thus creates one node for each wave type that contains 100 percent of that kind of wave. The algorithm doesn't recognize other possible splits in the rest of the data that was used to build the model. Although the nodes are 100 percent pure, they still describe something important about the data, namely that only wave size matters to surfers.

Table 5.8. Surf Conditions

70 76 82 88 94 100 106 112 118

Waves	Temperature	Coolness	Break Quality 69	Verdict
BIG	WARM	VERY	GOOD	RUN
SMALL	WARM	VERY	GOOD	STAY HOME
MEDIUM 71	WARM	VERY	GOOD	WALK
HUGE	WARM 72	VERY	GOOD	RUN
BIG	COLD	VERY 73	GOOD	RUN
SMALL	COLD	VERY	GOOD 74	STAY HOME
MEDIUM	COLD	VERY	GOOD	WALK 75
HUGE	COLD	VERY	GOOD	RUN
BIG	FREEZING	VERY	GOOD	RUN
SMALL 77	FREEZING	VERY	GOOD	STAY HOME
MEDIUM	FREEZING 78	VERY	GOOD	WALK
HUGE	FREEZING	VERY 79	GOOD	RUN
BIG	WARM	NOT	GOOD 80	RUN
SMALL	WARM	NOT	GOOD	STAY HOME 81
MEDIUM	WARM	NOT	GOOD	WALK
HUGE	WARM	NOT	GOOD	RUN
BIG 83	COLD	NOT	GOOD	RUN
SMALL	COLD 84	NOT	GOOD	STAY HOME
MEDIUM	COLD	NOT 85	GOOD	WALK
HUGE	COLD	NOT	GOOD 86	RUN
BIG	FREEZING	NOT	GOOD	RUN 87
SMALL	FREEZING	NOT	GOOD	STAY HOME
BIG	COLD	VERY	BAD	RUN
SMALL 89	COLD	VERY	BAD	STAY HOME
MEDIUM	COLD 90	VERY	BAD	WALK
HUGE	COLD	VERY 91	BAD	RUN
BIG	FREEZING	VERY	BAD 92	RUN
SMALL	FREEZING	VERY	BAD	STAY HOME 93
MEDIUM	FREEZING	VERY	BAD	WALK
HUGE	FREEZING	VERY	BAD	RUN
BIG 95	WARM	NOT	BAD	RUN
SMALL	WARM 96	NOT	BAD	STAY HOME
MEDIUM	WARM	NOT 97	BAD	WALK
HUGE	WARM	NOT	BAD 98	RUN
BIG	COLD	NOT	BAD	RUN 99
SMALL	COLD	NOT	BAD	STAY HOME
MEDIUM	COLD	NOT	BAD	WALK
HUGE 101	COLD	NOT	BAD	RUN
BIG	FREEZING 102	NOT	BAD	RUN
SMALL	FREEZING	NOT 103	BAD	STAY HOME
MEDIUM	FREEZING	NOT	BAD 104	WALK
HUGE	FREEZING	NOT	BAD	RUN 105
BIG	WARM	NEUTRAL	BAD	RUN
SMALL	WARM	NEUTRAL	BAD	STAY HOME
MEDIUM 107	WARM	NEUTRAL	BAD	WALK
HUGE	WARM 108	NEUTRAL	BAD	RUN
BIG	COLD	NEUTRAL 109	BAD	RUN
SMALL	COLD	NEUTRAL	BAD 110	STAY HOME
MEDIUM	COLD	NEUTRAL	BAD	WALK 111
HUGE	COLD	NEUTRAL	BAD	RUN
BIG	FREEZING	NEUTRAL	BAD	RUN
SMALL 113	FREEZING	NEUTRAL	BAD	STAY HOME
MEDIUM	FREEZING 114	NEUTRAL	BAD	WALK
HUGE	FREEZING	NEUTRAL 115	BAD	RUN
BIG	WARM	VERY	FAIR 116	RUN
SMALL	WARM	VERY	FAIR	STAY HOME 117
MEDIUM	WARM	VERY	FAIR	WALK
HUGE	WARM	VERY	FAIR	RUN

Note

The Member Card RDBMS data-mining model used in the following sections is one of the sample data-mining models on the SQL Server 2000 CD-ROM. To install the samples on your computer, select them when you run the SQL Server 2000 Setup program.

As you can see in Figure 5-28, each node of the Member Card RDBMS data-mining model contains information about the characteristics that define each group. You can view the number of instances attached to that node and learn about the distribution of dependent variable values that are being predicted. 119

Figure 5-28. The Member Card RDBMS data-mining model tree structure.

To view the node contents, click on the node you want to examine. The Attributes panel shows the contents for that node. This panel also contains the quantity of cases for each target variable as well as the percentage of the population it represents.

The Node Path panel describes the grouping criteria of the selected node. The Content Navigator panel provides a rough visual description of the tree shape. The darker the node, the more samples it contains.

The attributes in the root node are all the instances in the training set. In this model, the root node contains five separate instances totaling 10,281 cases, of which 55.45 percent have Bronze cards, 11.66 percent have Gold cards, 23.54 percent have Normal cards, and 9.34 percent have Silver cards.

The first binary split stems from either the root or a parent node. The data is split into two new child nodes. In this model, the data from card holders with incomes between $10,000 to $30,000 go in one child node and the rest of the data goes in the other child node. Presumably, those who fall in the $10,000 to $30,000 group are low-income earners and as you can see from the contents of the node shown in Figure 5-29, two conclusions can already be drawn about this income group:

These card holders represent a minority of the card holding population (2222 members out of 10,281). 120
The vast majority of these card holders have Normal cards.

Figure 5-29. The $10,000 to $30,000 node.

Because this split has over 92 percent of cases with the same target value, or card type, another split would not be warranted. In data-mining terminology, a node with such a high concentration is considered pure, even though in this case it's really more accurate to say it's "pure enough"because only about eight percent of the cases are other card member types.

The second node in the first split (see Figure 5-30) contains all of those members who do not fall within the $10,000 to $30,000 income bracket, presumably those belonging to a higher income bracket . As you can see, this node contains 8059 of the 10,281 members and a better distribution of card types than the low-income node. What we find here is that the majority of these members have Bronze cards (69.82 percent), and the lowest incidence of card type is the Normal card (4.53 percent).

Figure 5-30. The more than $30,000 node. 121

From this model, we learn that the higher number of Gold and Silver cards (14.34 and 11.31 percent, respectively) triggers the algorithm to create yet another split that categorizes all the card members according to how many children they have living at home. Those with one or two children at home seem to be the majority judging by the dark color of the node and the contents displayed in Figure 5-31 that shows 7007 individual cases.

Figure 5-31. The node with less than 2.25 children at home.

Note

The algorithm in this example split at 2.25 children. We can only hope that we don't find any .25 individuals in the real world! Numerical attributes will sometimes end up with decimal values even though the original data set contained only whole integers. This is because to determine at which number they can be split, the numerical values are analyzed in order. This sample data set contains cases with more than two children and cases with less than two children. The algorithm determined that 2.25 would be the demarcation between the two nodes. You can avoid this problem by having the numerical values be character types, but this might cause the attribute to be treated as though each value were a separate, discrete entity instead of one having a continuous value.

By examining the contents of the node shown in Figure 5-31, you can see that the majority of card holders with less than 2.25 children living at home (over 77 percent) have Bronze cards.

The node shown in Figure 5-32 shows the members with more than two children living at home. The numbers show that fewer of these folks become card members to begin with. Interestingly, of these members, almost 73 percent have Gold cards and more than 12 percent have Silver cards. As was the case with the $10,000 to $30,000 income bracket node, the case population is too low and concentrated to justify any more splits.

122

Figure 5-32. The node with more than 2 children at home.

Traveling across the nodes brings us to some interesting conclusions:

If you want to promote Bronze cards, target high-income members with one or two children.
If you want to promote Gold cards, target high-income members with more than two children.
Don't bother targeting low-income nonmembers unless you want to subscribe more Normal card members.

The fact is, there is a high enough number of cases in the one or two children node, as compared to the total case set, for the node to benefit from further splits. As you can see in Figure 5-33, another split is created for those who make more than $150,000 per year and those who make less than $150,000 per year.

123

Figure 5-33. The more than $150,00 income node.

How Splits Are Determined

Decision trees use an induction algorithm to decide the order in which the splits occur. The inductive algorithm draws conclusions (known as rules) based on repeated instances of an event rather than the logical correlation between events. For example, a purely logical algorithm would be able to determine that if John and D.J. are brothers, and D.J. and Bill are brothers, then John and Bill must also be brothers.

Inductive reasoning, on the other hand, will determine that if Box A is a wooden box and is empty, and Box B is also a wooden box and is empty, and Box C is a wooden box and is empty, and Box D is a wooden box and is empty, and Box F is a metal box and is full, and Box G is a metal box and is full, then a wooden Box H, which has never been examined, is probably empty. (This example illustrates why a higher number of cases ensures better accuracy of the data-mining model.)

In the credit card model, the over $150,000 income split was decided based on the algorithm's determination that the income attribute causes much of the data to belong predominately to one of two groups, those who make more than $150,000 per year and those who don't. To find the best split, the algorithm runs through a series of complex calculations on each attribute to find the one that will cause two groups to be created with a predominance of a single class or attribute value. This is essentially an exercise in calculating diversity.

Calculating Diversity

When it comes to deciding which attribute to use to split a node, diversity is the key factor. To calculate the diversity of an attribute, say in the $10,00 to $30,000 income bracket, the algorithm counts all the cases that would fit into that group and all the cases that won't. Then a calculation is made to determine the diversity level of the attribute. But what is diversity exactly? Let's say for instance that you happened to be living in a town with all the members of the "Member RDBMS"sample and only those members. Every person you meet on the street or in the store would fall into one of the above-mentioned categories. The odds of meeting a person who falls in the $10,000 to $30,000 income bracket category as opposed to the odds of meeting someone in a higher income bracket category is a measure of diversity. If there is low diversity in one income level attribute, you will meet people in one category far more often the other. If there is high diversity, you will meet an equal number of people from both income groups. In this case, for the algorithm to decide which attribute to use to split, it seeks the least amount of diversity possible in each group. As we can see from the previous examples, there are 2222 of the 10,281 members in the low-income category and 8059 in the other, so when you go outside, you should have 2222 chances out of 10,281 to find a low-income earner, or a 21.6 percent chance. Conversely, you have a 78.3 percent chance of finding someone in the higher income bracket category. To get a diversity index, the algorithm needs to determine the odds of first running into a low-wage earner and then encountering a high-wage earner. The best way to do that is to get the odds of running into the same kind of wage earner twice in a row for each type and add them together. Whatever remains after subtracting that figure from one is the odds of running into two different types, one after the other: 124

 1 - ( (21.6% * 21.6%) + (78.3% * 78.3%) ) = 34%

The lower the resulting index, the higher the diversity index and the higher chance that the category will get picked for a split. To come up with the lowest index, this formula is applied to all the attributes. The worst result is to wind up with a diversity index of 50 percent.

Upon examining the leaf nodes, we don't find any hard-and-fast rules about the members of the node. For instance, the one or two children and yearly income between $30,000 and $150,000 node indicates that nearly 81 percent of the members have Bronze cards and of the population that fits that classification, about 20 percent use other cards. Presumably, the algorithm could have easily split the rest of the population even further according to all the available attributes, such as by gender and car ownership, until the pure leaf nodes contained 100% of one type of card or another. At some point, the nodes would have so few members that their predictive value would be suspect. The algorithm has built-in checks and balances to make sure that such a thing doesn't happen.

How Predictions Are Derived

Once the tree is built, it can be used for one of its most important purposes, predicting the missing values for new cases. There are two ways to approach making predictions. You can pick a case and follow its pathways , which are determined by its attributes, to see what leaf node it winds up in, or you can use each leaf node to derive a new rule.

Navigating the Tree

Once a tree is filled out, you can use it to predict new cases by starting at the root node of the tree and traversing the route down the branches, which are based on the attributes of the new case, until it gets to the leaf node. The path that the new case follows is based on the very values that caused the split in the first place and not on the independent variables in the new instance. 125

Let's say that you were to examine a row in the training set for a person with the following characteristics:

Name	Adam Barn
Income	$167,000
Marital Status	Married 126
Children	1

Because Adam's income does not fall into the $10,000 to $30,000 node, his data will be located on a bottom branch. Since Adam has one child at home, follow the tree upwards to the One Or Two Children At Home node. Now we know that Adam makes more than the $150,000 income required to go up to the Yearly Income Is Above $150,000 node. Adam is married, so we get to classify him in the Marital Status = M node. Now we've reached one of the leaf nodes, and the predicted value is the predominate value in that node, which happens to be Golden, referring to the Gold card. If you took all the values in the set that were used to build the data-mining model, you'd find that this particular tree is 100 percent accurate.

The uncanny accuracy of training sets

Why is it that a case from a training set always seems to come up accurate 100 percent of the time? Let us suppose that Microsoft develops a new kind of certification test that allows the test taker to grade himself. He would take the exam and then give himself the maximum score each time he took the test. The only problem, of course, is that he has no real way of knowing which questions were incorrect without an answer key. In the same way, acquiring data for predictions from the same source as the data used to train the model will cause the predictions to always be 100 percent accurate.

However, if the test taker met with another person who just took the test and allowed this person to grade the exam, the test taker would discover errors by comparing the answers, and the grading accuracy, though not perfect, would still be more accurate. In the same way, for a model to be effective the test cases presented to it must come from a source other than the one used to build the model. Obviously, if you could use more than one test taker to grade the exam, the accuracy of the results would increase, as would the effectiveness of the model if you used a larger sample of data to build it with.

Problem Trees 127

If by chance you happen to generate a pure tree, then the predictions based on the cases used to build the model will always show 100 percent accurate results. If Adam Barn's case was used as part of the training set, it's 100 percent certain that any predictions using Adam's case as a test set will be perfectly accurate. That's because the attributes used to build the tree will always match the cases. However, once you start making predictions based on cases that are independent of the training set, the 100 percent pure tree will be downright lousy at making predictions not only because the rules supported by the tree are simply too strict to reflect the real world but also because often times, to reach such high levels of purity in a tree, the model has to resort to techniques that cause overfitting. Both of these problems might cause a case to fall into a category that it matches only by chance. For instance, if Bill, a holder of a Normal card, happened to be self-employed, and others in his category were not, you can envision an eager algorithm generating another split that might further divide the 86 cases in that category into 6 self-employed cases and 80 salaried cases. Of those six, another split might be created based on the card holder's educational level. If Adam and Mary, another Normal card holder, were the only Ph.D. 's in the group, they might find themselves in pure leaf nodes because all the other card types for that node would be zero in quantity. However, their reasons for holding a Normal card may be completely unrelated to their educational level or employment status. Maybe Adam is a Normal card member because he has a Gold card from another company, and maybe Mary, a data-mining consultant, is too busy traveling all over the world to upgrade her card; therefore, their membership in the same group is completely coincidental.

Let's suppose for a moment that Lifetime Credit Company wants to review all of its credit card applications and propose the more expensive Gold card to those applicants determined most likely to accept it. Imagine that a new application is presented to the model for D.J. Cornfield, who has the following characteristics:

Name	D.J. Cornfield
Income	$280,000
Marital Status 128	Married
Children	1
Education	Ph.D.
Employment Status	Self-employed 129

If you follow D.J. 's attributes along the decision tree, at first he seems like an ideal candidate for the Gold card, until you get to splits concerning his employment status. He's a Microsoft SQL Server consultant, and his Ph.D. is from Harvard University. Now anyone who knows D.J. knows that he's a big spender who will invariably spend a few extra dollars to get the extra service or the extra trimmings on whatever he's buying. If given the choice between the Gold card and the Normal card, you can be sure that he'll go for the Gold. However, the model knows best and will immediately decide that D.J. is actually a bad candidate for the Gold card because apparently he's an incorrigible Normal card member and therefore there is no point in wasting money on that penny-pincher.

As you can see, Lifetime Credit Company would be better off if the algorithm tempered its enthusiasm for splits. Microsoft decision trees usually does a pretty good job of not overgrowing tree branches. That said, you might need to examine the tree to make sure that the nodes contain enough cases to make sense of the rule that determined their membership. If you find that the tree is overgrown, you may need to eliminate some of the splits, a process commonly know as pruning the tree. Pruning involves tampering with the underlying structures, a technique we will explore in more detail in ‚ Chapter 10, "Understanding Data-Mining Structures. " ‚

The other side of the coin is the node that doesn't appear to have a predominate value. In other words, all the possibilities present in a given node have approximately the same chance of occurring. How a prediction for a test case that arrives at such a node is treated depends on the way the prediction is implemented. The best method is usually to allow the operator to decide what to do in such a case. In reality, there are a few options an application may want to take in such a case:

Remove the node, and use the one above it.
Allow for a failed, or unknown, prediction.
Accept the predominate value, even if it's not much more present than the other values. 130
Use a default value.

By removing the ambivalent node, you offer the prediction task the possibility of using a node that might have a more informative distribution of target values from which to base a decision. The disadvantage of that tactic is that your prediction will be a little less meaningful. In other words, your prediction will be less precise because as you work your way back up the tree toward the root, you'll have a population that is much larger, is based on fewer attributes to describe them, and as a result is more generalized.

If your application allows for the identification predictions based on perfectly even odds of reaching a target, you could chalk up the prediction as failed. This depends largely on the constructs of the application depending on this prediction. Clearly, you should be prepared for the fact that certain attributes will simply have little or no effect on the outcome of an event. For instance, you could easily try to find out whether car purchases are affected by astrological and numerological charts using a decision tree algorithm only to find that the stars don't seem to have much of an effect on car sales. This should be obvious by the fact that car sales of all models are represented by a relatively even distribution in all the nodes. Rather than considering a failed prediction to be failure, you can take the nonevent as a sign that the attributes used to arrive at a target might not accurately reflect the outcome.

Assigning default values or taking a value as the true outcome because it's represented a few percentage points more than another can be very dangerous in circumstances where the training set does not contain the exhaustive list of all the cases. Consider that even large samples of data will have some built-in imbalances because of the difficulty in obtaining a truly random sampling of data from a larger source. It's perfectly acceptable in the larger scheme of things to accept a few percentage points of difference in the weighting of some attributes in the sample as compared to the totality of the data set, but this difference cannot accurately be used as a predictor. Default values, although dangerous if misused, are actually preferable because they offer at least a certain measure of control by the designer of the data-mining model. Clearly the default values chosen would have to be incorporated into the overall strategy of the data-mining effort in such a way as to continue to provide useful predictions without having to report failure.

Navigation vs. Rules

Although navigating a tree to produce predicted values offers the best way to follow the logic of the data-mining model, it can actually become extremely cumbersome, especially when the tree becomes large in size and deep in branch complexity. As was pointed out earlier, each node in the data-mining model, when visualized in Analysis Manager, shows the description of the characteristics that define the membership in that node.

This description reads very much like a rule, which shows that it's possible to derive a set of rules for a tree by establishing one rule for each leaf node. To create the rule, follow the path between the root and that leaf node. The rules for the leaf nodes in Figure 5-33, read left to right, are as follows: 131

 if marital status is married and     the yearly income is greater than 0,000 and     the number of children at home is zero, one or two then the probability of a Gold card is high and     the probability of other cards is low

If you're just looking to establish a prediction for one type of card, such as the Gold card, then reduce the rules to just two outcomes , one for the Gold card and one for the other cards, through the judicious use of the OR connector and the AND joiner. This way of expressing the path of the tree makes it easy to describe how a decision is arrived at. In other words, if someone were to come to you and ask who uses the Gold card, you can simply reply "Married people who make more than $150,000 a year and have zero, one, or two children. "In spite of having to go through five nodes to get it, the answer is very clear and succinct.

When comparing rules to trees, we can make the following general statements:

There are generally as many rules using the AND joiner as there are nodes in the path needed to get from the root node to a leaf node.
OR can be used to combine some rules, which then reduces the total number of rules to one rule per dependent variable.

Even when not used for prediction, the rules provide interesting descriptive information about the data. There are often additional interesting and potentially useful observations about the data that can be made after a tree has been induced. In the case of the Member Card RDBMS mining model, the following observations can be made:

Gender and education have no effect on card types. 132
People who earn between $10,000 and $30,000 per year almost always have Normal cards.
Income is the most significant factor in determining card type.

By comparing these conclusions to the tree and the data set used to build the tree, the following further observations can be made about the process:

The observations above are based on a decision tree algorithm that tried to prioritize its splits by choosing the most significant split first. A different first split might make all the difference in subsequent nodes.
These observations were made from a sampling of a population of credit card holders. The generalizations made about the sample might not apply to the whole population, particularly if the extraction criterion of the sample itself introduces bias.
Data mining frequently analyzes information about groups of people. This type of analysis raises important ethical, legal, moral, and political issues when rules regarding gender, race, and national origin are applied to the larger population.

When to Use Decision Trees 133

I recommend using decision trees under the following circumstances:

When you want to reliably apply the segmentation scheme to a set of data that reflects a group of potential customers
When you want to identify possible interactive relationships between variables in a way that would lead you to understand how changing one variable can affect another
When you want to provide a visual representation of the relationship between variables in the form of a tree, which is a relatively easy way to understand the nature of the data residing in your databases
When you want to simplify the mix of attributes and categories to stay with the essential ones needed to make predictions.
When you want to explore data to identify important variables in a data set that can eventually be used as a target.

Summary 134

Microsoft decision trees is one of the methods, or algorithms, used to build a data-mining model. This method, although one of the most popular, is used for specific types of predictions involving the classification of cases into specific groups.

The decision tree model is used primarily for predictive purposes, but it can also help explain how the underlying data attributes are distributed. In other words, by examining a decision tree, it's easy to draw general conclusions about the population of your data.

The statistical techniques employed to build the decision trees include:

CART
CHAID
C4.5

Splits are determined by applying sophisticated statistical analysis about the data attributes that make up the cases. The general objective is to build trees that are as unbalanced as possible, in terms of the distribution of those attributes. In other words, the algorithm seeks to put as many of one type of attribute as possible in a given node. When possible, the algorithm will try to have all pure nodes, which means that the node contains only cases with 100 percent of a given type of variable. 135

While building trees, it's a good idea to identify problems with trees that can hamper accurate predictions. These can be caused by overfitting or underfitting data because of a lack of predominate values.

Decision trees also create rules. Every node can be expressed as a set of rules that provides a description of the function of that particular node in the tree as well as the nodes that led up to it.

Another commonly used algorithm to create models is Microsoft Clustering. This algorithm is designed to create a model that cannot be used to make predictions but is very effective in finding records that have attributes in common with each other. In the next chapter, we'll learn how to create a clustered data-mining model and what underlying mechanisms come into play when the algorithm trains the model.