5 Microsoft Decision Trees | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

OLAP is a well-structured format designed primarily to optimize the storage of aggregated data. With OLAP, you can create persistent aggregations along hierarchical dimensions and quickly access values summed according to time dimensions, product dimensions, and geographical locations, much like the GROUP BY statement in SQL. The dimensions offer a means by which to express relationships among data fields in a way that's not easily done with relational data. For example, to store in flat relational tables the hierarchical relationships that exist between employees and their managers in a corporation's human resources database requires relatively complex logic.

OLAP's dimensions are hierarchical by design. It's easy to tell Analysis Manager that there's an employee dimension with 5000 employees at the bottom level, 500 at the middle manager level, 100 at the upper manager level, 20 at the vice president level, and one CEO at the pyramid pinnacle. Of course, this hierarchy can be used to organize just about any grouping of people, species, or product.

Although these levels are really designed to facilitate the aggregation of data by level, it just so happens that this organization is also a convenient way to provide cases for data mining. In a relational data-mining model, you can't store that data-mining model anywhere other than in the Microsoft Analysis Services structures. However, when using data organized with OLAP, you can store a model as a dimension of a cube, as a cube itself, or even as a virtual cube.

Creating the Model 1

We'll begin by creating our model with the Mining Model Wizard. For a detailed discussion of how to get to and use this wizard, please refer to ‚ Chapter 5, "Microsoft Decision Trees, " ‚ on creating data-mining models using relational data.

Note

The example in this section uses the FoodMart 2000 database that ships with Microsoft SQL Server 2000 and the decision tree algorithm to create the data-mining model. The clustering algorithm goes through the exact process to create a model with a few exceptions. I'll point out those exceptions as we go along. The clustered model created through OLAP is essentially the same as the model created with the relational source, which is why you may want to refer back to ‚ Chapter 5 ‚ from time to time.

Follow these steps to create the model using the Mining Model Wizard: 2

Select source type.
Select the source cube for your mining model.
Select the data-mining technique.
Select the dimension and level the mining model will analyze.
Select training data.
Create a dimension, a virtual cube, or both. This is an optional step.
Finish. 3

Note

You can't use either dimensions that have their Visible property set to False or Virtual dimensions to create data-mining models. Also, you can't use virtual cubes and cubes that contain calculated cells and custom members to create data-mining models.

Select Source Type

The Introduction screen requires no input, but the Select Source Type dialog box, shown in Figure 6-1, requires that you specify the data source, which in this case is OLAP.

Selecting OLAP causes the subsequent screens to be specific to cubes and dimensions as opposed to tables and fields, as was the case with relational data.

Select Source Cube and Data-Mining Technique

In the Select Source Cube dialog box, shown in Figure 6-2, you select the cube that contains the cases that you'll be using to train the model.

Figure 6-1. Select source type.

Figure 6-2. Select source cube.

Next you select the data-mining technique, as shown in Figure 6-3. You can choose between Microsoft Decision Trees and Microsoft Clustering. If there are any third-party algorithms, these will appear as well.

Note

Because you can only create data-mining models from real cubes, as opposed to virtual cubes, you cannot select the virtual cubes even though they might be displayed in the Cube pane. A virtual cube does not contain any data; instead, it serves as a logical representation of either a single cube or a join of two or more cubes. The virtual cube is similar to SQL VIEW in that regard. You also are not allowed to choose cubes that contain virtual members in the form of calculated fields.

Figure 6-3. Select data-mining technique. 5

Select Case

In the Select Case screen, select the dimension that contains the cases that are to be used to train the data-mining model. (See Figure 6-4.) Also, optionally choose the level that you're interested in using. If you don't choose a level, the wizard assumes the lowest level in the dimension. The dimension level constitutes the input fields.

Note

Dimension levels are similar to fields in a relational database, except that dimensions can be composed of multiple hierarchical levels. In a relational table, you would have a date field, but in a cube, you could have a date dimension which contains the year as the highest level. For each year, you would have twelve months in a month level, and for each month, you would have that month's allocated number of days in a day level, and so on.

Figure 6-4. Select case.

Note

Because OLAP deals mainly in aggregates, there is no notion of a case key, such as an ID, because unique row identifiers are normally lost during aggregation.

Select Predicted Entity 6

In the Select Predicted Entity screen, shown in Figure 6-5, you have three options for the source of your predictions .

Measure of the source cube
Member property of the case level
Members of another dimension

Figure 6-5. Select predicted entity.

Note

The Select Predicted Entity screen is available only if you choose OLAP with the decision tree algorithm. Unlike the decision tree algorithm, clustering does not provide any real predictive abilities . 7

A Measure of the Source Cube

If you want to make predictions based on the measures, the numerical values, in the cube, you would choose the A Measure Of The Source Cube option. Bear in mind that because Microsoft Decision Trees is designed primarily for discrete values, it doesn't deal with numerical values in the way you might normally expect because of the tendency of the Microsoft Decision Trees algorithm to use the numbers to make small numbers of discrete groups known as bins .

As was mentioned in ‚ Chapter 5 ‚ , Microsoft Decision Trees will have a tendency to bin a wide range of numbers into a much smaller set of discrete values that it can place in a decision tree node. That's fine unless you want the decision tree to predict a very specific value.

Member Property of the Case Level

All dimension levels in OLAP can contain member properties to aid in describing that level. For example, employee levels will likely be expressed as employee numbers so that OLAP can create aggregations based on those numbers. However, when it comes time to generate a report or to display the data from the cube, you're going to want to display employee names, which are more descriptive, even if the names aren't suitable for aggregation. Each level can have multiple properties associated with it, and each of those properties can serve as predictive attributes for the data-mining model. One of those properties is chosen in Figure 6-5. 8

Members of Another Dimension

If there is relationship between the dimension containing the cases and another dimension, you can use that related dimension as the source of your prediction attributes. If the dimension you choose does not have any relation to the cases, then you'll most likely generate an error at processing time because there will be no entries in the model.

Select Training Data

In the next step, you select data for training your model. The dimension that you selected in the Select Case screen is selected by default, but you must also choose at least one other dimension as a case. (See Figure 6-6.) Failure to do so will not cause an error, but will result in a rather flat tree.

Figure 6-6. Selecting training data.

Select Dimension and Virtual Cube 9

The next step is optional but offers some very powerful features available only when using OLAP as a data source and Microsoft Decision Trees as your data-mining algorithm. Regardless of the options you choose, a data-mining model will be created in the Analysis Services structures. However, the Create A Dimension And Virtual Cube screen, shown in Figure 6-7, asks if you also want to create a data-mining dimension and a virtual cube as output from the data-mining algorithm. If you choose to create the dimension, you'll also be able to create the virtual cube. You can only create the virtual cube if the dimension is first chosen as an output option.

Note

If the source cube happens to contain a measure that is a distinct count, the option will be unavailable to you. Select Dimension And Virtual Cube.

Figure 6-7. Virtual dimension and cube.

The Dimension

The dimension is the result of the output of the data-mining model. If you look at any OLAP dimension, you'll notice that it forms a hierarchical tree in which branches may have sub-branches, each of which may have sub-branches of its own.

Analysis Services capitalizes on the tree-like structure of the dimension by replicating the tree structure of the Microsoft Decision Trees data-mining model into an OLAP dimension tree. 10

After completing this data model, we'll discuss the contents of that dimension and how it can be used to analyze data.

The Virtual Cube

The virtual cube is almost identical to the cube from which the cases came except that it also contains the dimension that was created in the current session. Even if you choose not to build a new dimension or a new virtual cube, you'll still create a data-mining model. Go ahead and create the virtual dimension. Later in this chapter, you'll get a chance to use this cube to better analyze your source cube.

Completing the Data-Mining Model

The last step, shown in Figure 6-8, is to name the data-mining model.

Figure 6-8. Naming the data-mining model. 11

One click on the Finish button and the process of creating dimensions and cubes and building the data-mining model begins. A processed cube is shown in Figure 6-9.

Figure 6-9. A successfully processed cube.

A Word About Transactions

There are multiple tasks that need to be completed in succession when processing a cube or a data-mining model. These include:

Creating structures
Querying source data 12
Inserting data into the structures
Creating calculated fields

Before Analysis Services declares that the cube or data-mining model is complete, it checks that all the steps have been completed. If any of the steps fail, all the previous steps must be undone lest they get partially processed and used for the model. This could produce a cube or a data-mining model that looks usable but is actually incomplete. The following figure shows a tree-like structure of all the processes that took place while trying to create a cube for which one of the steps was not completed. The lower tasks need to be complete before the upper tasks of the tree can be completed. If any of the tasks fail, as happened in the model shown here, all the dependent tasks also fail.

To insure the integrity of an OLAP or data-mining structure, Analysis Services uses a mechanism known as transactions. Transactions allow you to bunch multiple steps into a single task that must either succeed or fail as one unit. If the individual steps succeed, the engine issues a commit; if any step fails, the engine issues a rollback, which restores the data structures to their original state.

Transactions are used to insure the integrity of these operations by wrapping the multiple steps involved. RDBMS engines are best able to accomplish this because the same engine that's responsible for the manipulation of the data is also responsible for its storage. This makes it easy for the engine to have full visibility of the transaction completion stage and gives it the ability to physically undo any writes to the data structures. Analysis Services has a great deal of control over all the events except for the data storage because the data is actually stored in files that are the responsibility of the operating system. To get around this, Analysis Services makes all the changes in temporary files, and once the tasks are complete, it overwrites the original files with the temporary ones.

When you look at a data-mining processing screen in action, you'll notice that the first statement is an Initiate Transaction statement followed by a Committing Transaction statement. Once the transaction is complete, the Initiate Transaction statement becomes the Transaction Complete statement. 13

OLAP Mining Model Editor

When the processing is done, click the Close button and wait for the OLAP Mining Model Editor to appear. (See Figure 6-10.)

Figure 6-10. OLAP Mining Model Editor.

This editor functions in basically the same way as the Relational Mining Model Editor. The few differences are because OLAP is the source of the model and not a relational database.

Content Detail Pane

The first thing you'll notice is that the nodes in the trees don't have regular field names as they do with relational data, although they still function the same as with relational data. They have the name, level, and property of the OLAP dimension used to derive the rules for inclusion in that node. 14

Structure Panel

Figure 6-10 shows that the structure is represented not by fields as in the relational data-mining model, but as a hierarchy of levels representing the structure of the dimensions within the cube. You can make changes to the structure of the model by adding, changing, or removing dimensions, levels, and properties for the Input or the Predictable columns .

Prediction Tree List

The Prediction Tree list contains the various decision tree structures that are in the model. Each decision tree is represented by the prediction field that will be derived through its use. In this example, when the choices in the drop-down menu are presented to us, we see a fully qualified property from a level in a dimension as the option value.

If you followed the steps on pages 115, 123, you've created a virtual cube and a virtual dimension in addition to a data-mining model. These structures are excellent tools for the power of OLAP to provide leverage to better understand the data-mining models. Now that you know how to create a virtual cube and a data-mining model from the cube, it's time to discuss the ways and methods used to analyze data. Before we get to this, let's look at how an OLAP-based data-mining model differs from the one we created in ‚ Chapter 5 ‚ using relational data. Gaining an understanding of this part will make the subsequent sections involving the use of the virtual cubes and dimensions much clearer. 15

Analyzing Data with the OLAP Data-Mining Model

Creating a data-mining model from OLAP is similar in many ways to creating it from a relational database source.

The undirected data-mining tasks that we're seeking to accomplish would seem to be ultimately the same regardless of the source, except for the fact that OLAP, unlike the relational model, offers us the unique possibility of relating the data-mining model itself to the OLAP cube that was used as the source.

Note

The data-mining model, regardless of the source, requires a flattened source, even if you're using OLAP as the source. To flatten the source, the engine creates an intermediary structure to allow the dimension levels to be allocated to individual fields. This will cause the OLAP structure to appear as a regular two-dimensional table with one column per dimension level for the time it takes to populate the model.

Remember that even if relational databases are sometimes used for decision support, they are best used as support for transaction processing, such as order entry, accounting, and billing. A valuable by-product of these transaction processing tasks is the table data that can later be mined. But because these relational databases are not generally designed for ease of reporting, mining this data requires putting it into a separate data-mining model, which tends to disconnect the source from the model. Once the model is created, it's often somewhat difficult to go back and find out how that model applies to the data in those tables.

In contrast to the relational databases, OLAP is designed for reporting and decision support to the practical exclusion of transaction processing. In fact, it's hard to imagine how OLAP could ever be adapted to transaction processing at all. However, whatever shortcomings exist in OLAP's transaction processing ability are more than compensated for by its ability to provide structure to existing data. As with data mining, OLAP uses relational data as a source and puts the data in its own structure, often disconnecting itself from the original source of data. Unlike data mining, OLAP provides primarily numerical data in the form of sums, counts, or averages. The dimensions simply provide a context to those numbers in such a way that the summed numbers can be viewed according to specific dimensional characteristics. OLAP cubes allow a user to browse this data and see how sales figures relate to time, geography, and store front locations. Data-mining models are not really capable of indicating anything about sales figures as they relate to time and geography, but they can do something that OLAP cannot. They can function beyond their predictive capabilities and tell you, for example, that of your 1000 stores located nationwide , 50 of these share common characteristics which account for 25 percent of your sales.

Wouldn't it be nice to know what the sales figures are for those stores that seem to have something in common? If it happened that their sales were unusually low or unusually high, then it could be cause for further investigation and perhaps even a call to action. 16

If the source of that data was a relational model, it would be possible to issue a query that takes into account the characteristics of the decision tree nodes and then sum those sales figures to get an answer. With a few more queries, it would then be possible to compare these figures to other figures and see whether that value is low or high.

If the source is OLAP, Analysis Services has a set of tools that allow you to analyze those sales figures using the generated model. In other words, you could instantly see the customer sales figures described by one of the nodes without having to issue any queries.

Using the Generated Virtual Cube

If you created a data-mining model as outlined earlier in this chapter, you have a virtual cube as well as a data-mining model. To browse the contents of the cube, right-click on your cube and choose Browse Data, as shown in Figure 6-11.

Figure 6-11. Browsing the cube data.

The cube browser window uses a flex-grid control to display the data. The lower part of the screen contains the numerical data, or measures, which is aggregated according to the dimensions chosen in the drop-down lists at the top of the window. 17

In a virtual cube, all the measures actually originate from the cube that was used when processing the data-mining model. All the dimensions also come from that cube, so in essence, this virtual cube is an exact representation of the original cube to which it's linked. In other words, this virtual cube contains all the data that was used to train the data-mining model. There is one extra element, and that is the dimension that was created along with this cube. (See Figure 6-12.)

Figure 6-12. Virtual dimension levels.

You'll notice that this dimension contains levels that are placed in a similar structure as the nodes of the decision tree present in the data-mining model. In fact, every node in the decision tree has an equivalent level in the virtual dimension. This comes in handy when analyzing data from the cube. Before analyzing the cube, let's look at what makes up a dimension.

Using the Generated Dimension

Dimensions originate in hierarchies derived from data in tables. For example, an employee database can have various of levels of management. These levels are identified by the operator who defines the cube and are then picked up by the OLAP dimension processor, which creates a structure that represents the management hierarchy in the database.

Another way to create a dimension is to use Multidimensional Expressions (MDX) to create calculated member levels containing very specific records. This allows the member levels to include only those records that meet the criteria for the nodes in the decision tree. 18

What Is MDX?

Wherever large quantities of formatted, structured data is stored, a query language is nearby waiting to retrieve it. Microsoft SQL Server, DB2, and Oracle use SQL as the standard query language, and Microsoft OLAP uses MDX, a query language tailored specifically to access cubes. MDX is also used to create calculated members and custom member formulas that specify rules for inclusion in a data set.

At first glance, MDX looks like SQL because they share some key words, but it's actually a very different language. An in-depth discussion of MDX is beyond the scope of this book, but the Analysis Services portion of the SQL Server Books Online, and other books, such as Microsoft OLAP Unleashed, by Timothy Peterson (SAMS, 1999), offer longer discussions on this technology.

Besides having the same structural look, dimensions are also similar to decision trees in that they both use rules of inclusion in a member set. In the same way that a decision tree node applies rules to determine which case records are included, cube dimensions use MDX.

To get a better idea of how this process works, go to the Shared Dimensions folder of the cube, right-click your dimension, and choose Browse Dimension Data. (See Figure 6-13.)

Note

Dimensions derived from a data-mining model are easily recognized by the miner's pickaxe icon.

Figure 6-13. Browse the dimension data.

The dimension structure and the MDX expression used to make that level are shown in Figure 6-14.

Notice that the expression that describes the Yearly Income Over $150,000 is identical to the one in the node of the tree that has the same rules. (See Figure 6-15.)

Figure 6-14. The virtual dimension comprised of MDX statements.

Figure 6-15. The virtual dimension level with income 150K+. 20

The virtual dimension level feature is significant because it allows you to browse the data in the cube according to the nodes that are present in the data-mining model. Browsing in this manner allows you to view the measures in the cube according to the rules present in the mining model. (See Figure 6-16.)

Figure 6-16. The actual sales figures by dimension level.

As you can see in this example, the unit sales can be measured according to their membership in a given data-mining model node. Notice that the people who earn lower incomes are responsible for far lower sales than people in higher income groups. This might or might not be a surprise depending on the kinds of products the company sells.

As compared to the relational data-mining model, the OLAP-based model offers unprecedented analytical power to the user. With the power afforded by MDX and calculated members, it's possible to find what percentage of the overall population each member group represents, or the standard deviation, to get a better understanding of the data.

Note

Whenever OLAP uses calculated fields or calculated members as is the case with the generated dimensions, there is a real possibility that queries and browsing done with these dimensions will be far slower than with ordinary dimensions, mostly because the cube is making the calculations on the fly, which precludes OLAP from taking advantage of the speed gains that come with using stored aggregates.

Summary 21

As far as the data-mining portion of Analysis Services is concerned , all data-mining models created with decision trees are treated exactly the same regardless of the structure of the source data because the data-mining processor formats the information in the same way before training the model. All the information is converted to single-table structures with columns and rows so that all the information is temporarily placed in a flat structure that the data-mining processor can deal with. As a result, a data-mining model created with OLAP is fundamentally the same as one created using a relational data source.

By using OLAP as the data source, Analysis Services is able to provide unique options during the creation of the data-mining model that is not be available if using a relational database as the source for the case data. You can choose to create additional OLAP structures, such as a virtual cube and a virtual dimension. These are extremely valuable in that they offer an operator who wants to browse the data the ability to do so with the help of the patterns that were discovered and organized in the data-mining model.