Data Mining Structure | Delivering Business Intelligence with Microsoft SQL Server 2005: Utilize Microsofts Data Warehousing, Mining & Reporting Tools to Provide Critical Intelligence to A

As we work with data mining in the Business Intelligence Development Studio, we use a data mining structure to contain everything we develop. The data mining structure enables us to define the data we will use, either from a relational database or from an OLAP cube, as the source for the mining. It also lets us define one or more data mining algorithms to apply to that data.

Data Columns

The first step in doing any mining is to select the data being mined. We select a data source: either relational data or an OLAP cube. Once this is done, we need to select a table if this is a relational data source, or a dimension if this is an OLAP data source. Finally, we must select the table columns or dimension attributes to be used as the data columns for our data mining.

Although an initial table or dimension must be selected for data mining, it is possible to select additional related tables or dimensions for inclusion. These are known as nested tables. If a relational database is being used, the initial table and the nested table must be linked by a foreign key relationship (or a chain of foreign key relationships). If a cube is being used, the initial dimension and the nested dimension must be linked through the fact table.

Data Mining Model

The data mining model combines the data columns with a data mining algorithm. Additionally, we must determine how each data column should be used by the data mining algorithm. This process determines how the data mining algorithm functions and what it predicts.

Data Column Usage

Each column in the data mining structure must have one of the following usages:

Key The key is the unique identifier for a table or a dimension. The key is not used to uniquely identify records or members and it is not used by the data mining algorithm to predict values.
Input Input columns are used by the data mining algorithm when making a prediction. The input values are the values that influence the prediction made by the algorithm.
Predict A predict is a data column whose value is being predicted by the data mining algorithm. This column can also be used as an input column. If data column A and data column B are both predict, then data column A will serve as an input when the algorithm is predicting data column B, and vice versa.
Predict Only A predict only is a data column whose value is being predicted by the data mining algorithm. This column cannot be used as an input column when predicting other columns.
Ignore This data column is not used by the data mining algorithm.

Training Data Set

In Chapter 12, we discussed the fact that data must be divided into training and testing data sets. For the exercises in this book, we will use an OLAP cube as our data source. If we were to use a relational data source, the data would need to be split up elsewhere. It is a bit more challenging to divide the data when it resides in a single OLAP cube. To accomplish this division, we use the cube slice.

Cube Slice

The cube slice enables us to specify an expression that divides the cube into two parts. The portion of the data that satisfies the expression is fed into the data mining algorithm for training. The rest of the data will be testing data set.

Learn By Doing—Defining a Data Mining Model

Features Highlighted

Creating a data mining structure based on an OLAP cube
Defining data mining models
Slicing the source cube

Business Need The Maximum Miniatures Marketing Department wants to do a targeted mailing promoting the Mythic World line of figurines. Previous research has shown that the buyers of the Mythic World line of products do not have any children living at home. Unfortunately, the list purchased for the Mythic World mailing does not include the number of children living at home statistic for each household, but it does include the following facts about each household:

Number of cars owned
Marital status
Whether the address is a rental property or occupied by the owner

The Marketing Department would like to find a way, using the three facts included in the mailing list data and the information known about current customers, to predict which households have no children living at home. The mailing will then be sent only to those households likely to have no children living at home.

Wholesale customers have customer numbers below 5000. Retail customers have customer numbers of 5000 and above. Of course, we should only use retail customers as our data for this mining operation.

Steps

Look at the business requirements and determine which data mining task we need to accomplish. In this case, we want to classify the households in the new mailing list as either having no children living at home or having children living at home.
Select the data mining algorithms to perform this task. Four algorithms perform classification: Microsoft Decision Trees, Microsoft Naïve Bayes, Microsoft Clustering, and Microsoft Neural Network.
Determine where the data for the mining operation will come from. We have an OLAP cube full of customer data. This can provide an efficient source for our mining operation.
Open the Business Intelligence Development Studio.
Open the MaxMinSalesDM project.
If the Max Min Sales DM cube has not been deployed, select Build I Deploy MaxMinSalesDM to deploy it.
When the deployment is complete, right-click the entry for the MaxMinSalesDM project in the Solution Explorer window. Select Process from the Context menu. The Process Database - MaxMinSalesDM dialog box appears.
Click Run. The Process Progress dialog box appears.
When the process has succeeded, click Close to close the Process Progress dialog box. Click Close to close the Process Database - MaxMinSalesDM dialog box.
Right-click the Mining Structures folder in the Solution Explorer window and select New Mining Structure from the Context menu. The Data Mining Wizard appears.

Click Next. The Select the Definition Method page of the wizard appears as shown in Figure 13-1.

Note

Data Mining Structures can use both relational databases and OLAP cubes for data. We could just as easily have used the MaxMinSalesDM relational database as the source for this cube. Relational databases designed for transactional operations, rather than for a data mart, can also be used.

Select the From Existing Cube radio button. Click Next. The Select the Data Mining Technique page of the wizard appears as shown in Figure 13-2.
Select Microsoft Decision Trees from the drop-down list. Click Next. The Select the Source Cube Dimension wizard page appears as shown in Figure 13-3.
Select the Customer Dimension. This is the main source of information for this mining operation. Click Next. The Select the Case Key page appears as shown in Figure 13-4.
Leave the Customer Name attribute selected. This attribute serves as the primary key for the mining operation. Click Next. The Select Case Level Columns wizard page appears as shown in Figure 13-5.
The columns selected here serve as the input and predictable columns in our model. Check the following attributes:
- Houseowner
- Marital Status
- Num Cars Owned
- Num Children At Home
Click Next. The Specify Mining Model Column Usage wizard page appears.
Check the check box in the Predictable column for Num Children At Home. Uncheck the check box in the Input column for Num Children At Home. The Specify Mining Model Column Usage page appears as shown in Figure 13-6.

Click Next. The Specify Columns' Content and Data Type page of the wizard appears as shown in Figure 13-7.

Note

In Chapter 12, we said that some data mining algorithms can predict discrete values, while others can predict continuous values. In this case, we are using algorithms that require discrete values. Fortunately, the properties we are using all have a discrete set of values. (Technically, number of children at home could be a continuous set of values, but reality limits it to a discrete set of choices.) It is possible to force a continuous set of values to be discrete. We do this by choosing Discretized as the Content Type, and then selecting a method that groups continuous values into a discrete number of buckets (that is, ages 11–15, ages 16–20, and so on). Often, it is much easier to do analysis on discretized values than on continuous values. For the purposes of a given analysis, the buying habits of 16–20-year-olds may be similar enough so that we can study them as a group in one discretized bucket.

Click Next. The Slice Source Cube page of the wizard appears.
We need to use only retail customers for our data mining (Account Num attribute of 5000 and above). In addition, we need to split our data into a training data set and a testing data set. We will use customer numbers 5000 to 25000 for training and customer numbers above 25000 for testing. Select Customer from the drop-down list in the first row of the Dimension column.
Select Account Num from the drop-down list in the first row of the Hierarchy column.
Select Range (Inclusive) from the drop-down list in the first row of the Operator column.
Click in the first row of the Filter Expression column. Two drop-down lists appear. Activate the drop-down list on the left.
Expand the All entry and scroll down and select the 05000 entry. Click OK.
Activate the drop-down list on the right. Scroll to the bottom of the list. You can see it only goes up to 05496. To preserve performance, not all of the members are loaded into the list. We must filter the list to get to the member we want.
Click the Filter Members link. The Filter Members dialog box appears.
In the drop-down list in the Property column, select Name.
In the drop-down list in the Operator column, select Begins With.
Type 250 in the Value column.
Click Test to view the result of the filter. The Filter Members dialog box appears as shown in Figure 13-8.
Click OK to exit the Filter Members dialog box. We return to the Slice Source Cube page of the Data Mining Wizard.
Select 25000 in the list and click OK. The Slice Source Cube page appears as shown in Figure 13-9.
Click Next. The Completing the Wizard page appears.
Enter Classification - Children At Home for Mining Structure Name.
Enter Decision Trees - Children At Home for Mining Model Name. The wizard page appears as shown in Figure 13-10.
Click Finish. The data mining structure containing our new data mining model is created and the Data Mining Design tab appears as shown in Figure 13-11.
Select the Mining Models tab on the Data Mining Design tab.

Note
If needed, we can change the column usage right on the Mining Models tab, as shown in Figure 13-12. However, we do not need to make any changes here for our model.
Click the Create a Related Mining Model button in the Mining Models Design tab toolbar. The New Mining Model dialog box appears.
Enter Naive Bayes - Children At Home for Model Name.
Select Microsoft Naive Bayes from the Algorithm Name drop-down list. The New Mining Model dialog box appears as shown in Figure 13-13.
Click OK.
Again, click the Create a Related Mining Model button in the Mining Models Design tab toolbar. The New Mining Model dialog box appears.
Enter Clustering - Children At Home for Model Name.
Select Microsoft Clustering from the Algorithm Name drop-down list.
Click OK.
Once more, click the Create a Related Mining Model button in the Mining Models Design tab toolbar. The New Mining Model dialog box appears.
Enter Neural Network - Children At Home for Model Name.
Select Microsoft Neural Network from the Algorithm Name drop-down list.
Click OK. The Mining Models tab appears as shown in Figure 13-14.
Click the Save All button in the toolbar.
Click the Process the Mining Structure and All Its Related Models button in the Mining Models Design tab toolbar. A dialog box informs you the server content appears to be out-of-date.
Click Yes to deploy the project.
When the deployment is complete, the Process Mining Structure dialog box appears, as shown in Figure 13-15. Click Run. The Process Progress dialog box appears.
When the models have all been trained and the process has been completed successfully, the Process Progress dialog box appears as shown in Figure 13-16. Click Close to exit the Process Progress dialog box.
Click Close to exit the Process Mining Structure dialog box.

image from book
Figure 13-1: The Select the Definition Method page of the Data Mining Wizard

image from book
Figure 13-2: The Select the Data Mining Technique page of the Data Mining Wizard

image from book
Figure 13-3: The Select the Source Cube Dimension page of the Data Mining Wizard

image from book
Figure 13-4: The Select the Case Key page of the Data Mining Wizard

image from book
Figure 13-5: The Select Case Level Columns page of the Data Mining Wizard

image from book
Figure 13-6: The Specify Mining Model Column Usage page of the Data Mining Wizard

image from book
Figure 13-7: The Specify Columns' Content and Data Type page of the Data Mining Wizard

image from book
Figure 13-8: The Filter Members dialog box

image from book
Figure 13-9: The Slice Source Cube page of the Data Mining Wizard

image from book
Figure 13-10: The Completing the Wizard page of the Data Mining Wizard

image from book
Figure 13-11: The Mining Structure tab on the Data Mining Design tab

image from book
Figure 13-12: Changing the column usage

image from book
Figure 13-13: The New Mining Model dialog box

image from book
Figure 13-14: The Mining Models tab on the Data Mining Design tab

image from book
Figure 13-15: The Process Mining Structure dialog box

image from book
Figure 13-16: The Process Progress dialog box