Analysis Services 2005 provides two types of mining models: the relational mining model and the OLAP mining model. Relational mining models are created directly from the relational data source and the OLAP Mining models are created from an existing cube or part of a cube. Use of the nine types of data mining algorithms are made within the context of relational or OLAP mining models. In this chapter you will learn both these models by creating mining models using a few algorithms and analyzing the results.
The Adventure Works DW sample relational database has specific patterns to demonstrate various algorithms available in Analysis Services 2005. In this section you learn how to create and analyze a decision tree model and a clustering model. Obviously, you need to create a new mining model to explore and analyze the information. When you build a mining model, Analysis Services 2005 retrieves data from the data source and stores it in a proprietary format. When you do want to build several mining models from the data set, there will be redundant data stored on Analysis Services 2005. In order to share the data across several mining models, Analysis Services 2005 stores the information about the data that can be shared across several mining models under an object called Mining Structure. Internally the information read from relational data sources is stored as a cube in order to efficiently retrieve the data during mining model creation. The Mining structure stores data type of attributes metadata in the mining model, the corresponding column in the data source, and allows you to modify the certain data mining properties that are common across all of your mining models.
Have you received coupons in the mail? If you have a postal address, you have. Retail companies used to send coupons to all customers and even some who weren't customers. That was expensive and of less than optimal efficiency. In order to minimize cost and maximize profit, companies now use data mining to select targets for coupon or other special postal distributions. Based on certain attributes, retail companies can classify customers into several groups (Gold, Silver, or Bronze membership). By doing this they clearly identify unique characteristics of the group. From there, targeted mailing to those groups can be made instead of mailing to every address on file. This practice saves marketing money for the companies and they have a better probability of making sales.
The following steps show you how to solve the targeted mailing type problem by creating a relational mining model on top of the vTargetMail view in the Adventure Works DW database. To create a relational mining model, you first need a data source view containing the table(s) on top of which you want to build a mining model.
Create a new Analysis Services project DM2005Tutorial. Create a data source to the Adventure Works DW relational database
Create a DSV that includes the vTargetMail view in the Adventure Works DW.
The vTargetMail is a view that retrieves information from the server tables in the Adventure Works DW database. The vTargetMail view contains information about customers who buy bicycles. Based on the information in the view, you can identify potential customers who are likely to buy bicycles. The vTargetMail view has been specifically designed to contain patterns that can be identified by the data mining algorithms. As the name of the view indicates, vTargetMail is used to demonstrate the usefulness of the data mining where the customers can be categorized based on their attributes, and targeted mails with discounts or attractions can be sent only to customers who are likely to buy bicycles.
Similar to the wizard that helped you to create cubes and dimensions, there is a wizard to create data mining structures. To create a relational mining model right-click the Mining Structures folder in the solution explorer and select New Mining Structures as shown in Figure 14-2.
The welcome page provides information on the data mining wizard. Click the Next button on the Welcome page.
You will now see the Select the Definition Method page as shown in Figure 14-3. This page allows you to create a mining model from a relational data source or from a cube. Select the "From existing relational database or data warehouse" radio button and then click Next.
On the Select the Data Mining Technique page you can select the data mining technique to use for modeling. If you click the drop-down list box you can see all the algorithms available, as shown in Figure 14-4. Analysis Services also provides you the option of adding your own data mining technique. If you have added your custom data mining technique and exposed it you will see your data mining technique in this drop-down list box. Select the Microsoft Decision Trees algorithm and click Next.
On the Select Data Source View page, select the DSV that contains vTargetMail (the DSV you created in Step 2) and click Next.
Of the multiple tables or views in your DSV, the Specify Table Type page allows you to select the table(s) upon which you create a mining model. The Specify Table Types page, as shown in Figure 14-5, shows two selections: Case (the primary table) and Nested. Certain algorithms are used for problems such as Market-basket analysis and the need to analyze data across multiple tables. In such cases you need to select certain table(s) as Nested tables. Typically there is a oneto-many relationship between the case and nested tables. Select the vTargetMail as a Case table and click Next.
On the Specify Training Data page of the wizard you need to select the columns from the source table(s) that are to be used in creation of mining models. In addition, you need to specify whether a specific column should be used as a key column, input column, or predictable column. If you specify a column as an input column, Analysis Services uses this column as an input to the mining model for determining patterns. If a specific column is marked as predictable, Analysis Services allows you to predict this column for a new data set based on the existing model if the input columns for the new data set are provided.
In the current data set you want to predict if a customer is a potential buyer of bikes. The column BikeBuyer determines if an existing customer bought bikes before. Therefore, you need to mark this column as a predictable column. Once you have marked this column as predictable, you need to identify the potential factors that can influence a customer who buys bikes. If you think certain factors can influence a customer to buy bikes we recommend you select those columns as input columns. The wizard provides you a way to recommend columns as input columns by analyzing a sample data set. Click the Suggest button on the page and then click Next.
The wizard analyzes a sample of the data set and provides you the list of columns that are related to the selected predictable attribute BikeBuyer, as shown in Figure 14-6. The score column indicates how close an attribute is related to BikeBuyer column; a higher number indicates a stronger relationship. Stronger relationship can mean that a specific column can influence the chosen predictable column. Based on the score the wizard will auto select certain columns as input columns. You can deselect these attributes or select additional attributes that you think might influence a customer's decision on buying bikes. Click OK to Continue.
The selections you made in Suggest Related Columns page can now be seen in the Specify the Training Data page, as shown in Figure 14-7. Select the columns Age, Commute Distance, English Education, English Occupation, Gender, House Owner Flag, Marital Status, Number Cars Owned, Number Children At Home, Region, Total Children, and Yearly Income as input columns.
The selected columns along with their data types are shown in the Specify Columns' Content and Data Type page. As shown in Figure 14-8, the column Content Type indicates how each selected column will be used by Analysis Services while creating the mining model. You learn more about these content types while refining this model. For now, make all of the Continuous content types Discrete except Yearly Income. The relational data type of a column is mapped to the corresponding data type used within Analysis Services 2005 by the mining model wizard. Click Next.
Similar to the completion pages of dimension and cube wizards, you can specify a name for the mining structure object. Enable the Allow Drill through option so that you have the ability to see additional details when you browse the Mining Model. Click Finish to create the mining model. Since each mining model is within a mining structure, Analysis Services automatically creates a mining structure with the same name as the mining model.
The mining structure object with a decision tree mining model can be seen in the mining model editor, as shown in Figure 14-9. The mining model editor contains five views: Mining Structure, Mining Models, Mining Model Viewer, Mining Accuracy Chart, and Mining Model Prediction. By default you will be in the Mining Structure tab. The mining structure view contains two panes. The DSV pane shows the tables part of the mining structure and allows you to perform the operations available within a DSV. The pane on the left shows the columns part of the mining structure in a tree view. You can delete existing columns or add columns to the mining structure by dragging and dropping them from the DSV. The properties of an attribute can be edited in the properties pane when the column is selected.
Figure 14-10 shows the Mining Models view. The mining models view shows the mining models in the current mining structure. You can have one or more mining models within each mining structure. The columns of the mining structure are by default inherited as columns of a mining model. Each column of a mining structure can be used for a specific purpose in a mining model. A column can be used as an input column or predictable-only column, input and predictable column, or need not be used in the mining model at all. These four usages of a mining structure column within a mining model are represented as Input, Predict only, Predict, and Ignore, respectively. These can be selected from the drop-down list box corresponding to a column and a mining model. You can add additional mining models within a mining structure by right-clicking in the mining models view and selecting New Mining Model. The mining structure wizard detects the content type of the mining model columns based on a sample of the data.
The mining model editor is used to make refinements to the mining model created by the mining model wizard. You will learn to make refinements in the mining model editor by making a few refinements to the decision tree mining model you have created now. You will make two refinements: change the content type for column Age and the usage of the Bike Buyer column. Age is a unique attribute that can be viewed as discrete, because the value is recorded as an integer between 0 and 100. If you have ever participated in a market survey you know they generally ask your age within a specific range rather than your exact age. Almost no adult likes to admit his or her age publicly, especially in the later years, so if you find yourself extremely reticent to mention your age, be worried. Be very worried. In this example you will model Age as a range rather than a discrete value. This content type is called Discretized. Discretized means that the values will be split across N number of ranges and any value of Age will be assigned the new value based on the range. The number of ranges is controlled by the property DiscretizationBucketCount. Based on the value set for DiscretizationBucketCount property, Analysis Services 2005 will identify the right ranges based on the minimum and maximum values of Age.
In the Mining Models view select the column Age under the column Structure. The properties for the Age column can be seen in the properties window as shown in Figure 14-11. Change the property content from Discrete to Discretized and the DiscretizationBucketCount to 10.
The Usage of the column Bike Buyer was initially set to Predict Only because this is what you selected in the wizard. Change the usage to Predict. The value Predict means that the Bike Buyer attribute will be used as an input as well as an output to the mining model. By choosing the Bike Buyer as an input column you are providing additional information to the mining model algorithm so that the model accurately represents the input data.
Having completed all the refinements to the decision tree model, you can now deploy this model to the server similar to the cube or dimension. Hit the F5 button to deploy the mining model.
Business Intelligence Development Studio sends the definition of the entire project you created to the server along with a process request. Once the database is processed, the BIDS switches the view to the Mining Model viewer as shown in Figure 14-12. The decision trees algorithm identifies the factors influencing customers to buy bikes and splits customers based on those factors and stores it within the model. The information stored in the model is better visualized using a tree structure. The mining model viewer represents the contents of the mining model in the form of a tree view which contains a series of nodes. The root of the tree starts with a single node that represents all the customers. Each node shows the percentage of customers (shown by the horizontal bar within the node) who have bought bikes based on the input set. Each node is split into multiple nodes based on the next most important factor that determines why a customer has bought a bike. The tree contains nodes at several levels from 1 to N based on the number of splits determined by the decision tree algorithm. Each node in the tree is associated to a specific level in the tree. The root node is at level 1 which is the top most level. The depth of the tree is measured by the number of splits or levels of the tree. In the mining model viewer you can make the selection to view the tree up to a specific level by selecting the option from Default Expansion or Show Level. Figure 14-12 shows nodes with the horizontal bar that is shaded with two colors. When you are using the product you will see these shaded regions of the horizontal bar in colors red, blue, or both. The mining legend window shows the legend for the colors in the horizontal bar or a node. If the legend window is not visible, right-click in the mining model viewer and select Show Legend.
The legend in Figure 14-13 shows that blue (Value=0) indicates customers who are not bike buyers, red (Value=1) indicates the customers who have bought bikes, and white indicates customers for whom the BikeBuyer value is missing. Even though the underlying relational data source does not have a missing value for the BikeBuyer column, the algorithm does add a small percentage (0.03%) for the root node so that algorithm is not biased for certain prediction inputs. The split from a node at one level to nodes in the next level is based on a condition that is determined by an input column that influences the predictable attribute which is shown within the node such as Region=North America.
In the sample which you are analyzing, the most important factor that determines a customer buying a bike is the number of cars owned by the customer. The root node is split into five nodes based on the values for the number of cars owned (zero to four); three of these nodes are shown in Figure 14-12. A node with Number Cars Owned >=2 shows 40% of such customers are likely bike buyers. The next influencing factor for a customer to buy a bike is the customer's Yearly Income. You can traverse the tree from each node to identify the conditions that are likely to affect customers' decision to buy a bike. Based on the information available in the mining model, you can not only understand factors influencing the customers' decisions to buy bikes, but now you predict if a customer is a potential bike buyer based on his or her properties. Once you identify potential customers you can send targeted mails to customers who are potential buyers rather than all customers.
Once created, can your model predict accurately? If so, how accurate is it? How much trust can you place in the results of the model? To answer these questions the data mining editor has a view called the Mining Accuracy Chart, as shown in Figure 14-14. The Mining Accuracy Chart view contains three sub-views that help you validate model accuracy. These are Column Mapping, Lift Chart, and Classification Matrix (see Figure 14-14). The column mapping view contains three sections that help you compare the accuracy of the model with a specific data set.
In order to compare the accuracy of a model that was created, often a subset of the input data set is set aside in the data mining process. The Column Mapping tab is used to select the input data set against which predication accuracy needs to be compared. You can filter the input data set for which you do want to compare the accuracy by specifying the filter conditions in the Filter the Input Used to Generate the Lift Chart section. The filtering can be applied in the second section of the Column Mapping tab. In the third section of the Column Mapping tab you have the option to do the accuracy comparison only on certain predicted values. For example, if you only want to compare the accuracy for all the customers who are buying bikes, you can select the predictable column Bike Buyer and the predicted value 1. This means that you are interested in identifying all the customers that the model has predicted as bike buyers and compare the original bike buyer value in the relational data source.
The mining model is used to predict the bike buyer value for each customer. If the predicted value is the same as the original value in the relational data source, it means the model has predicted the correct value. If not, the model has inaccurately predicted the Bike Buyer value for a customer.
To analyze the accuracy of the mining model, do the following:
In the Mining Accuracy Chart view click Select Case Table button for selecting the input table to identify the accuracy of the model built. Select the DSV containing the vTargetMail table; select the vTargetMail view and click OK.
The designer detects the column mappings between the mining model and the case table you have provided based on the name-matching criteria. Corresponding lines are created between the mining model and the case table; these lines are clearly displayed in Figure 14-14. If the mappings detected by the data mining editor are incorrect or mappings were not made by the data mining editor, you can make the correct mappings from the modify mappings dialog. The modify mappings dialog can be launched by right clicking within the data mining editor and selecting modify mappings.
In this example, you compare the accuracy of the decision tree model with the entire data set that was used to create the model. Therefore you do not apply any filters on the input data set.
In the third section of the Column Mapping tab leave the predict value blank so that all the predict values can be compared for accuracy.
Analysis Services 2005 provides two ways to analyze the validity of the model. One of them is to show the validity graphically and another is to show the validation through actual numbers. To see the validity of the model graphically, select the Lift Chart sub-view within the Mining Accuracy Chart view. You will now see the graph with two lines as shown in Figure 14-15. The X-axis shows the percentage of data set used for prediction and the Y-axis shows the percentage of prediction correctness. You can see a legend window providing details on the two lines. If you created a perfect model that predicted all inputs correctly, the percentage correctness will always be 100%. This is represented by a blue line in the graph (the 45 degree line in Figure 14-15). In the current model the predictable attribute value can only have one of the two values: 0 or 1. Obviously you would want a model that predicts values very close to that of an ideal model. This graph gives you a visual way to easily compare the prediction correctness of the model as compared to an ideal model. The prediction correctness percentage of the model will be shown in a red line. You can see from Figure 14-15 that the prediction correctness of the model is not very close to that of an ideal model but is reasonably good since the prediction results are correct for 76% (population correct % for 100% of overall population) of the overall data set.
That's all well and good, but how is the lift chart calculated and visually represented? Analysis Services 2005 predicts the bike buyer attribute for every row in the input table. Each prediction also has a value called predict probability that provides a confidence value associated with the prediction. If you have a predict probability of 1.0, that means that the model believes the predicted value is always correct. If the predict probability for an input is 0.90, that means there is a 90% probability the predicted value is correct. After predicting the values for all the input rows, the results are ordered based on the predict probability. Now the predicted results are compared with the original value to calculate the prediction correctness percentage, and then they are plotted on a graph. The input data set is ordered based on the predict probability for the predicted value. This can be seen in the graph in Figure 14-15 where the lines indicating the current model and ideal model are nearly identical up to 10% of the population. The mining legend shows the score, population percentage, and predicts probability values for a specific population selection. Figure 14-15 shows that the population selection is 50%, which is indicated by a darker line. You can select a different population percentage by clicking the mouse on a specific population. When 100% of the data set is considered, the decision tree mining model is able to predict correct values for 76% of the data set correctly. This is indicated by the score value. For an ideal model this score value is 100%. The score for a model indicates the accuracy of the model and is in between 0 and 1. A higher score value means the model is more accurate and closer to an ideal model.
Two types of charts are provided by the Mining Accuracy Chart viewer to help in analysis of the mining model. You have learned about the lift chart. The second chart is called the profit chart. The profit chart helps you to analyze the potential profit your business would make based on four input parameters: number of samples, fixed cost incurred, cost per each row, and profit gained due to prediction. You will see the benefit of the profit chart only when there is a significant cost involved per sample. Select the chart type Profit Chart. Click on the Settings button to launch the profit chart settings dialog. Specify the Individual cost as 11, choose the remaining default values, and click OK. You will now see the profit chart as shown in Figure 14-16.
Analysis Services 2005 predicts the bike buyer column and calculates the profit based on the profit chart settings you have provided. Similar to the lift chart you can slide the gray vertical line. The mining legend shows the profit and predict probability values for the corresponding population percentage selected by the gray vertical line as shown in Figure 14-16. Similar to the lift chart, the predicted values are sorted based on prediction probability and then plotted on the graph. Therefore, the lower values of overall population percentage have higher prediction probability values. As you can see from the profit chart, the profit increases with increase in the sample size, reaches a maximum profit, and then drops down. If you send mail to the entire population, the net profit would be just less than $17,500. You want to get the maximum profit. The maximum profit you can obtain for the specified lift chart parameters is around $45,300. In order to maximize your profit you need to send mails only to customers who have a prediction probability greater than the prediction probability corresponding to the population percentage that has maximum profit. Thus the profit chart helps improve the profit of a business by saving the cost that would have been incurred for mailing to customers who are not potential buyers.
The third sub-view in the Mining Accuracy chart view is called the classification matrix. The classification matrix of the decision tree model on vTargetMail is shown in Figure 14-17. The classification matrix shows the original and predicted values for the predict attribute. In the decision tree model, the predictable values for bike buyer are 0 or 1. The matrix shows the actual and predicted values for all the specified input rows. As shown in Figure 14-17, the columns indicate the actual values while the rows indicate the predicted value. There are a total of 9,121 input rows that have the Bike buyer value 0. Of these, the model predicted 7,049 of them to be the correct value 0, while it predicted the remaining 2,072 rows incorrectly to value 1. Similarly the model predicted 7,060 of 9,363 input rows correctly to have a value of 1. This matrix provides you an overview of how good the predicted values are as compared to the actual values in a matrix format.
Having seen the accuracy of the model created by Analysis Services 2005, you can meaningfully start predicting the values using the model. The Mining Model Prediction view helps you in performing the predictions and store the results. Figure 14-18 shows the Mining Model Prediction view. Similar to the Mining Accuracy Chart view, you need to specify the case table that contains the input data for which you want to predict. Select the source table vTargetMail for which you will now predict the bike buyer value. In this view you can now select the columns from the input table that you want to be retrieved along with the predicted value(s). You can also apply certain data mining prediction functions or custom expressions on the predicted values to retrieve results such as top 10% of the customers or top 10 customers. The columns from the input table or applying certain data mining prediction functions can be selected in the lower half of the window pane as shown in Figure 14-18. Select the predicted column bike buyer from the model and the columns Customer key, first name, last name, and email address from the input table. You can select additional columns if you need to.
To see the prediction results, click the button that allows you to switch between the design or query or result view as shown in Figure 14-19. You will now see the results of the prediction as shown in Figure 14-19. You can specify constraints to the predicted value such as predicted value = 1 so that the results view only shows customers who are likely to buy bikes.
You were able to predict the bike buyer value for the input case table using the designer. The designer creates a query that retrieves the predicted data from Analysis Services. The query language used to retrieve predicted results from mining models is called DMX, which stands for Data Mining Extensions. The DMX language is specified in the OLEDB specification for data mining. The DMX language is similar to SQL and contains statements for data definition and data manipulation. The data definition language includes statements for model creation, and the data manipulation language contains statements for training the model, which includes inserting data into the model and retrieving prediction results from it. Just as SQL language has a SELECT statement to retrieve data from a relational database, DMX has a SELECT statement to retrieve data from mining models. The DMX SELECT statement has several variations based on the nature of the results being retrieved. For detailed information on the data definition language and data manipulation language for data mining, please refer to the documentation of Analysis Services 2005.
Click the icon that you used to switch between design and result view (shown in Figure 14-19) and select the Query option. You can see the following DMX query contained in the query window pane which is generated by the query designer to retrieve prediction results.
SELECT t.[CustomerKey], t.[FirstName], t.[LastName], t.[EmailAddress], [v Target Mail].[Bike Buyer] as [Predicted Value] From [v Target Mail] PREDICTION JOIN OPENQUERY ([Adventure Works DW], 'SELECT [CustomerKey], [FirstName], [LastName], [EmailAddress], [MaritalStatus], [Gender], [YearlyIncome], [TotalChildren], [NumberChildrenAtHome], [EnglishEducation], [EnglishOccupation], [HouseOwnerFlag], [NumberCarsOwned], [CommuteDistance], [Region], [Age], [BikeBuyer] FROM [dbo].[vTargetMail] ') AS t ON [v Target Mail].[Marital Status] = t.[MaritalStatus] AND [v Target Mail].[Gender] = t.[Gender] AND [v Target Mail].[Yearly Income] = t.[YearlyIncome] AND [v Target Mail].[Total Children] = t.[TotalChildren] AND [v Target Mail].[Number Children At Home] = t.[NumberChildrenAtHome] AND [v Target Mail].[English Education] = t.[EnglishEducation] AND [v Target Mail].[English Occupation] = t.[EnglishOccupation] AND [v Target Mail].[House Owner Flag] = t.[HouseOwnerFlag] AND [v Target Mail].[Number Cars Owned] = t.[NumberCarsOwned] AND [v Target Mail].[Commute Distance] = t.[CommuteDistance] AND [v Target Mail].[Region] = t.[Region] AND [v Target Mail].[Age] = t.[Age] AND [v Target Mail].[Bike Buyer] = t.[BikeBuyer]
The preceding prediction query is one of the variations of the DMX SELECT query that has the following syntax:
SELECT [FLATTENED] [TOP <n>] <select expression list> FROM <model> | <sub select> [NATURAL] PREDICTION JOIN <source data query> [ON <join mapping list>] [WHERE <condition expression>] [ORDER BY <expression> [DESC|ASC]]
The input data for prediction is specified after the keywords PREDICTION JOIN. The <select expression list> contains the columns to be retrieved as part of the results and includes columns from the input/case table and the predicted columns which are specified after the SELECT keyword. The mining model used for prediction is specified after the FROM keyword. The mapping of columns from input data set to the mining model attributes is specified in the ON clause as seen in the preceding prediction query. The prediction query retrieves four columns from the input table along with the predicted column for each input row. Similar to executing MDX queries from SQL Server Management Studio, you can execute the preceding DMX query. You have only learned a simple DMX query in this example. Analysis Services 2005 tools help you to build the DMX query graphically, but if you are the kind of person who wants to write your DMX query this will be a good start. You can learn more about DMX and writing prediction query from Analysis Services 2005 documentation.
You can create multiple mining models within the same mining structure and they can use either the same or different mining algorithm. One would typically want to create a new mining model with the same algorithm if you want to see the accuracy of the existing mining model with a slight change in properties of the columns, such as disabling certain input columns or changing columns from PredictOnly to Predict. Alternatively, you can create a new mining model with a different mining algorithm and have the same attributes. A typical example would be to create a clustering or nave bayes algorithm on a data set for which you have created a decision tree model. Next, learn to create a clustering algorithm on the same data set and analyze the results. Follow these steps to create a new clustering algorithm:
Switch to the mining model view in the mining model editor.
Right-click anywhere within the mining pane and select New Mining Model.
Select the mining algorithm Microsoft Clustering from the mining algorithm drop-down list and type in the name Clustering for the name of the mining model and click Ok.
A new clustering mining model with the name Clustering is created in the mining model view, as shown in Figure 14-20.
Deploy the project to the Analysis Services 2005 server for processing the Clustering mining model. The mining model editor switches to the mining model viewer view as soon as the server processes the clustering mining model. In the mining model viewer view, the default viewer shows the decision tree mining model. Click the mining model drop-down list and select Clustering, the name of the clustering mining model you have created (Figure 14-21).
You will now see the clustering mining model represented as several nodes with lines between these nodes as shown in Figure 14-21. By default the clustering mining model groups the customer into ten different clusters. The number of clusters to be generated can be changed from a property for the cluster mining model. Each cluster is shown as a node in the cluster viewer. The shade of the node is dependent upon the shading variable column and a specific state of the column that is shown in the viewer. Darker shading on the node indicates that the cluster favors a specific input column and vice versa. If there is a relationship (that is, similarity) between two clusters, that is indicated by a line connecting the two nodes. Similar to the shade of the color node, if the relationship is stronger between two nodes, it is indicated via a stronger line such as the relationship between clusters cluster5 and cluster6. You can move the slider on the left of the cluster diagram from All Links to Strongest Links. As you move the slider from All Links to Strongest Links you can see the weaker relationships between the clusters are not displayed. You can change the cluster name by right-clicking the cluster and selecting Rename. The cluster diagram helps you get an overall picture of the clusters, how the cluster is affected based on a specific column of the model that is used as the shading variable, as well as the relationship between clusters.
In Figure 14-21 the chosen column is population. Population is the name that is used in the mining model viewer for the entire data set used for training the mining model. You can select desired input columns of the mining model from drop-down for Shading Variable to see the effect of the column on the various clusters. When you choose a specific shading variable column you need to choose one of the states of the column to be used as the shading variable for the clusters.
For example if you choose the shading variable as Age, then you have several options as the state such as missing value, < 32, >=86 as shown in Figure 14-22. You can see that Cluster 7 has a darker shade indicating that cluster is predominantly populated with customers whose age is < 32. Overall the cluster diagram provides you the ability to analyze the various clusters, their characteristics, and relationship between clusters based on a specific column value for you to get a quick grasp of the cluster characteristics.
Once you have a good overview of the clusters from cluster diagram view the next step is to learn more about each cluster along with the distributions of various values for each column. Click on the Cluster Profile tab to learn more details about each cluster and various values for each column.
The cluster profiles view shows the relationship between the mining columns of the model and the clusters in a matrix format as shown in Figure 14-23. The intersection cell of a specific column and a cluster shows a histogram bar of the various values of the column that are part of the cluster. The size of each bar reflects the number of items used to train the model. If you hover over the histogram you will be able to see the size of each bar as shown in Figure 14-23. The number of histogram bars shown is controlled by the value set for histogram bars. The histogram bars are sorted based on the size and the first N bars (where N is the value set for Histogram bars) are shown. For example, for the age attribute, there are eleven groups shown in the legend. Because the histogram bars value is 4, the cluster viewer picks up the four most important buckets of the column age and shows how this column contributes toward the profile of a specific cluster.
Each column has a histogram bar called missing value for the input records that do not have any value specified for that column during training. The columns that have a lot of states as compared to the number of states to be shown (number of states to be shown is controlled by the value set for histogram bars), then those columns have a histogram called Other which shows the value for all the histogram bars that are not shown explicitly. In the cluster diagram view for overall population you can see the strongest link is between clusters cluster5 and cluster6. This is due to the fact that the predict column Bike Buyer for these clusters has similar distribution for the values of Bike Buyer; this can be seen in Figure 14-23.
Click on the Cluster Characteristics tab to see the characteristics of a single cluster and how the various states of the input columns make up the cluster.
You will see the cluster characteristics of the entire data set as shown in Figure 14-24. You can view the characteristics of a specific cluster by selecting the cluster name from the drop-down list for Cluster. The probability associated with a specific value for an input column such as Number Cars Owned = 0 is calculated based on the number of input records having that specific value. The probability column shows the calculated probability for an input variable for the chosen cluster. If you hover over the bar in probability column you will see the corresponding probability value as shown in Figure 14-24.
Once the clusters are formed one of the typical operations that one would want to find out is to compare the characteristics of two clusters to have a better understanding of each cluster; especially the clusters that are related. The data mining editor provides a way to compare the differences between clusters.
Click on the cluster discrimination tab to see the characteristics that distinguish a cluster from other clusters or another cluster. You will see the characteristics of cluster 1 and the complement of cluster 1. From the cluster diagram you can identify that the strongest relationship for the entire data set is between clusters cluster5 and cluster6. To compare the differences between these clusters select clusters 5 and 6 from the drop-down list next to Cluster 1 and Cluster 2 as shown in Figure 14-25.
In Figure 14-24 you learned the characteristics of a single cluster. In the cluster discrimination you will learn the states of an input column that favor one cluster over another. The states of the input columns indicate the differences between the two clusters and are ordered based on the importance of the difference the column contributes towards the clusters. The two columns on the right indicate which cluster the specific column value favors and the length indicates how strong the value influences the cluster as shown in Figure 14-25.
The cluster mining model you created can also be created using the mining model wizard. However, when you create a new mining model with the wizard, the wizard will automatically create a new mining structure and then create the mining model within this mining structure. Whenever you need to create multiple mining models for comparison we recommend you create one model using the wizard and the remaining models within the same mining structure using the data mining editor. The mining model wizard is self-explanatory and you can explore the creation of other mining models such as Microsoft Sequence Clustering, Microsoft Regression Trees, Neural Networks, and Sequence clustering using the data sets available in the Adventure Works DW sample relational database.
Certain types of business problems necessitate the use of aggregated data for analysis instead of individual input rows. For example, assume a customer buys several products from various stores. You might want to segment the customers not only by their individual attributes but also by the total amount they have spent in the stores. The mining model would require aggregated sales from the various purchases made by the customer, and include that amount as an input attribute to the clustering mining model. You can certainly add such a column to the customer table using a named query, but if the sales information table has billions of records, the aggregation by the relational data source will be slow. You should also consider maintainability of the mining model because you might want to process the models on a periodic basis. You indeed have a better solution than aggregating the data at the relational data source level. What better way to aggregate data than by creating a cube?
Because Analysis Services helps you create cubes as well as mining models, Analysis Services 2005 provides a way of creating mining models from cubes. Such mining models created on top of a cube are called OLAP mining models since the data source for the mining models is a cube and cubes contain OLAP data. Analysis Services 2005 also provides you the functionality of creating new cubes that include content from the created mining model along with the original cube which provides you the power and flexibility to analyze the cubes based on patterns discovered by the mining model. Such an analysis can further help you understand your data better and make better business decisions. You will create and analyze cubes containing mining model content in this section. You will use the AnalysisServices2005Tutorial you created earlier to create OLAP Mining Models in this chapter. When you download the samples for this book you will find the AnalysisServices2005Tutorial project under Chapter 14 folder. To create an OLAP mining model, do the following:
Open the AnalysisServices2005Tutorial project. Deploy the entire project to your Analysis Services instance. In Solution Explorer, right-click the Mining Structures folder and select New Mining Structure… to launch the Data Mining Wizard.
In the Select the Definition Method page of the Data Mining Wizard, select the option From existing cube as shown in Figure 14-26 and click Next.
In the algorithm selection page, select the Microsoft Clustering algorithm as shown in Figure 14-27 and click Next.
You will now be in the Select the Source Cube Dimension page as shown in Figure 14-28. The Select the Source Cube Dimension page lists the cube dimensions within the database upon which a mining model can be created. You need to select the cube dimension that will be used as the case table for creating the mining model. Select the cube dimension Dim Customer and click Next.
In the Select Case Key page of the wizard select Dim Customer, the key of the dimension Dim Customer to be the key for the mining structure as shown in Figure 14-29. The Dim Customer attribute will also be used as the key for the mining model.
On the Select Case Level Columns page you need to select all the attributes that will be part of the mining model. Attributes that will be used as input or predictable should be selected on this page. Select the cube dimension attributes Commute Distance, English Education, English Occupation, Gender, House Owner Flag, Marital Status, Number Cars Owned, Number Children At Home, Total Children, and Yearly Income. Also, select the facts, Fact Internet Sales - Order Quantity, and Fact Internet Sales - Sales Amount of the measure group Fact Internet Sales as shown in Figure 14-30 and click Next.
On the Specify Mining Model Column Usage page you need to specify the input and predictable attributes. Make the input and predictable selections as shown in Figure 14-31 and then click Next.
On the Specify Columns' Content and Data Type page you can change the data type and content type for each attribute if needed. Both the content type and data type play an important role in the creation or training of the mining model. Accept the defaults and click Next.
You will now be in the Slice Source Cube page. The Slice Source Cube page provides you the functionality to slice the cube and build the mining model only on a specific part of the cube. You can specify the constraints for slicing the cube in this page similar to specifying filter conditions in the cube browser. You have the option to filter on dimensions other than the Dim Customer dimension, such as specific dates for Order date or specific categories or products. Select the default in this page and click Next.
In the final page, specify the name for the mining structure as OLAPMiningStructure and for the mining Model as CustomerOLAPMiningModel as shown in Figure 14-32. You can analyze the mining model separately using the mining model viewer, but Analysis Services 2005 takes it one step further by allowing you to create a cube that will include the results of the mining model. In Analysis Services 2005 you can create a dimension from the results of the mining model and add this to the existing cube. The mining structure wizard facilitates the creation of the new cube that will include the existing cube and results of the mining model. Click the check boxes for Create Mining Model Dimension, Create Cube Using Mining Model Dimension, and Allow Drill Through as shown in Figure 14-32 and click Finish.
The OLAP Mining Model is created and you will now see it in the Mining model editor as shown in Figure 14-33. The input columns of the data mining structure are mapped to the corresponding cube dimension and measure group of the cube which is indicated by the line connecting the case level columns and fact internet sales within the DSV of the data mining editor. When the mining model is processed, the aggregated data of the measures is used instead of the individual transactions. Deploy the project to the server.
Once the processing is complete, the mining model editor switches to the mining model view where you can see the various clusters created based on the chosen attributes from the cube. There are 10 clusters created by default and you will find the strongest relationship between clusters cluster4 and cluster10. Similar to analyzing the relational mining model clusters, you can use the cluster profiles, cluster characteristics, and cluster discrimination tabs to learn more about the clusters created from the cube.
When you created the OLAP Mining Model you selected creation of a data mining dimension and a cube in the data mining wizard. In order to create a new dimension and cube you need a DSV. Hence a DSV that includes a query to the OLAP mining model to retrieve the data from mining model is created. You will see a DSV called Dim Customer_DMDSV created under the DSV folder in the solution explorer. Next a dimension called [Dim Customer_DMDim] that includes attributes from the data mining model is created. Finally a cube called [Adventure Works DW_DM] is created that includes all the cube dimensions and measure groups of the Adventure Works DW cube but also includes the newly created data mining dimension [Dim Customer_DMDim]. In Chapter 9 you learnt the data mining relationship between a dimension and cube. The data mining relationship is defined between a dimension derived from a data mining model and a cube that contains it. The cube [Adventure Works DW_DM] that was created by the data mining wizard includes a data mining dimension. If you open the Dimension Usage tab of the [Adventure Works DW_DM] cube you will see there is a data mining relationship as shown in Figure 14-34.
You can browse the [Adventure Works DW_DM] cube along with the data mining dimension to analyze the Internet sales you have obtained from various clusters as shown in Figure 14-35. How is this information useful? You can perform an analysis of the current sales with various clusters, and then perform a comparison of sales after the targeted mailing to analyze the sales and identify the effectiveness of the targeted mailing. Analysis Services thereby provides an integrated environment for creating mining models and cubes that help you with effective business analysis.