Before we put a data mining model to use in a production environment, we need to insure the model is making predictions with the desired accuracy. This is done using the Mining Accuracy Chart tab in the Business Intelligence Development Studio. The Mining Accuracy Chart tab provides three tools for determining the accuracy of our mining models: the Lift Chart, the Profit Chart, and the Classification Matrix.
The Lift Chart is a line graph showing the performance of our mining models using a test data set. The Profit Chart is a line graph showing the projected profitability of a marketing campaign as it relates to the accuracy of our mining models. The Classification Matrix provides a chart showing the number of correct predictions and the number of each type of error made by our models. Before we can use any of these tools, we need to use the Column Mapping tab to define a test data set and feed it into our models.
The Column Mapping tab lets us take a relational data source and configure it to be used as a test data set for our mining models. We define how the fields will be mapped to the inputs and the predictables in our model. We can also filter the data to create the desired test data set. Finally, we can set the predictable we want to check, if more than one predictable is in our models, and the value we want to predict.
The Select Data Mining Structure window enables us to select a mining structure from the current project. This is the mining structure that will then be tested for accuracy. Don't get confused between mining structure and mining model. We are selecting a mining structure here. We can select which mining models to include at the bottom of the Column Mapping tab.
The Select Input Table window lets us select a table or a set of related tables from the data source or data source view (DSV). This table or set of tables provides the test data set for the mining models. The mining structure must have a column to map to every input and predictable column.
Remember, the test data set must already have a known value for the predictable. We test for accuracy by feeding values from a data record to the inputs of the mining model, asking the model to come up with a value for the predictable, and then comparing the predicted value with the actual value from that record. This is done for each record in the test data set.
In many cases, we need to filter the data being used for the test data set. Perhaps a subset of the rows in a table was used to train the mining model. We want to use the complement of that row subset to do our model testing. The filter lets us pull out only the records we want to use for the testing process.
Our mining models may include more than one predictable column. if that is the case, we need to select which predictable to use for testing. We can test our model on the prediction of all values for the predictable or on its accuracy at predicting a selected value.
Creating a testing data set
Mapping the test data set to a mining model
Business Need In the Learn By Doing in Chapter 12, we were asked by the Maximum Miniatures Marketing Department to come up with a way to predict when a customer, or potential customer, has no children at home. This knowledge is going to be used to determine who should receive a promotional mailing. We have created a mining structure using four different data mining algorithms. We now need to validate these four algorithms and determine which algorithm is giving us the most accurate predictions.
We need a testing data set for the validation process. Recall, retail customers have customer numbers of 5000 and above. We used retail customers with customer number of 5000 to 25000 for our training data set. Now, we use customers with numbers above 25000 for testing.
Open the Business Intelligence Development Studio.
Open the MaxMinSalesDM project.
If the Data Mining Design tab for the Classification - Children At Home data mining structure is not displayed, double-click the entry for this data mining structure in the Solution Explorer window.
Select the Mining Accuracy Chart tab on the Data Mining Design tab. The Column Mapping tab appears as shown in Figure 14-1.
Click Select Structure in the Mining Structure window. The Select Mining Structure dialog box appears as shown in Figure 14-2.
This dialog box shows all the mining structures in the project. (There is only one mining structure in the MaxMinSalesDM project.) Select Classification -Children At Home and click OK.
Click Select Case Table in the Select Input Table(s) window. The Select Table dialog box appears as shown in Figure 14-3.
The Data Source drop-down list enables us to choose between the Max Min Sales DM data source and the MaxMinSalesDM data source view. Because the file extensions are not included in this drop-down list, you can easily get confused with these two similar names. You can compare the icons in the drop-down list to the icons in the Solution Explorer window to determine which is which. Both the data source and the data source view contain the tables in the relational MaxMinSalesDM database. Select the MaxMinSalesDM data source view from the Data Source drop-down list. (This should be selected by default.)
Select the Customer (MaxMinSalesDM) table in the Table/View Name list. Click OK to exit the Select Table dialog box. The fields in the Mining Structure are mapped to the fields in the Customer table based on field name. If an automatic mapping is incorrect, you can select a mapping line, right-click it, and select Delete from the Context menu as shown in Figure 14-4. You can also manually map columns by dragging a column from one window and dropping it on the corresponding column in the other.
If your data mining structure includes nested tables, the test data set may also include nested tables. Once the initial table is selected, the Select Case Table button changes to a Select Nested Tables button.
In the Filter the Input Data Used to Generate the Lift Chart grid, click the first row in the Source column. A drop-down list appears. Activate the drop-down list and select Customer Table.
Customer Name is selected by default in the Field column. In the first row under the Criteria/Argument column, enter > 25000.
In the Select Predictable Mining Model Columns to Show in the Lift Chart grid, the Predictable Column Name is already filled in because there is only one predictable column in the mining structure. We can specify, in the Predict Value column, the value we want to have the model predict. Before we do that, we can look at the Lift Chart without a predict value specified. The Column Mapping tab on the Mining Accuracy Chart tab should appear as shown in Figure 14-5.
Figure 14-1: The Column Mapping tab before configuration
Figure 14-2: The Select Mining Structure dialog box
Figure 14-3: The Select Table dialog box
Figure 14-4: Deleting a column mapping
Figure 14-5: the Column Mapping tab after configuration
The entries on the Column Mapping tab are considered part of a one-time test of the mining models and are not saved as part of the mining structure.
The Lift Chart is used to judge the effectiveness of a mining model. It creates a line graph showing the accuracy of each mining model at predicting the selected value. In addition, the Lift Chart contains a line showing the prediction accuracy of a random-guess model and a line showing the perfect prediction of an ideal model. Any distance that the graph line has for a particular mining model above the random-guess model is lift. The more lift, the closer the mining model comes to the ideal model.
Two different types of Lift Charts are available to use. The first type is produced when we do not have a prediction value specified. If we select the Lift Chart tab, and then wait a few moments for the chart to be calculated, we see a chart similar to Figure 14-6.
Figure 14-6: The Lift Chart with no prediction value specified
This type of Lift Chart shows how well each mining model did at predicting the correct number of children at home for each customer in the testing data set. The X axis represents the percentage of the testing data set that was processed. The Y axis represents the percentage of the testing data set that was predicted correctly. The blue line shows the ideal prediction model. If 50% of the testing data set has been processed, then 50% of the testing data set has been predicted correctly. This type of Lift Chart does not include a line for a random-guess model.
The Mining Legend gives the statistics for each mining model at a certain overall population percentage. This percentage is shown at the top of the Mining Legend window and is represented on the graph by the dark gray line. This line can be moved to different positions by clicking the desired location on the graph. In Figure 14-6, the Mining Legend contains information for an overall population percentage of 50%. We can see when 50% of the population is processed, the Decision Trees and Neural Network raining models will have predicted 44.69% of the values correctly. The Clustering mining model will have predicted only 11.45% correctly.
To view the second type of Lift Chart, we need to specify the prediction value we are looking for. To view this type of Lift Chart, do the following:
Select the Column Mapping tab.
Click the first row in the Predict Value column to activate the drop-down list.
Select 0 from the drop-down list. This means we are testing the accuracy of predicting a value of 0 for the Num Children At Home column. Selecting 0 in this first row places a 0 in all the rows. This happens because the Synchronize Prediction Columns and Values check box is checked. When this box is unchecked, different predictable columns and prediction values may be selected for each mining model.
Select the Lift Chart tab.
The Lift Chart should now appear as shown in Figure 14-7.
Figure 14-7: The Lift Chart for the Classification - Children At Home mining model
On this chart, the X axis still represents the percentage of the testing data set that has been processed. The meaning of the Y axis has changed. We now have a target population for our predictions. That target population is customers with no children at home (Num Children At Home = 0).
Of the customers in the testing data set, 22% have no children at home. Therefore, in a perfect world, we only need to process 22% of the testing data set to find all the customers with no children at home. This is why the tine for the ideal model reaches 100% on the Y axis at 22% on the X axis. Using the random-guess model, when 50% of the testing data set has been processed, 50% of the target population will have been found.
On this Lift Chart, it is hard to see where some of the lines are because they overlap. This is where the Mining Legend window comes in handy. If we click the 10% line on the X axis, we can see where these overlapping lines are located. This is shown in Figure 14-8. Using the figures in the Mining Legend, we can see the Decision Tees, Naive Bayes, and Neural Network mining models have all found 44.54% of the target population.
Figure 14-8: The Lift Chart at 10%
If we jump up to 20%, as shown in Figure 14-9, we can see things have changed. The Decision Trees and Neural Network mining models have now found 89.28% of the target population. The Naive Bayes mining model is at 45.02%, hardly any progress over where it was at 10%.
Figure 14-9: The Lift Chart at 20%
Moving to 35%, we can see that the other mining models have caught up to the Decision Trees and Neural Network mining models. This is shown in Figure 14-10. All the models have found close to 90% of the target population. Some differentiation occurs between the mining models as we move higher than 35%, but the spread is small and we have already identified the vast majority of the target population at this point.
Figure 14-10: The Lift Chart at 35%
The Profit Chart lets us analyze financial aspects of a campaign that depends on the predictive capabilities of our mining models. The information in the Profit Chart can help us determine the size of campaign we should undertake. It can also provide a prediction of the profit we should expect from a campaign (hence, the name).
To create a Profit Chart, select Profit Chart from the Chart Type drop-down list. This displays the Profit Chart Settings dialog box shown in Figure 14-11. We need to enter the population and financial aspects of the campaign we are modeling. In our case, let's suppose Maximum Miniatures has purchased a mailing list with 100,000 names. Enter 100000 for Population in the dialog box. This mailing will have $5,000 of fixed costs—costs that do not change no matter how many items we mail—and $2 per mailing sent. Enter 5000 for Fixed Cost and 2 for Individual Cost. Finally, for each person who receives the mailing and makes a purchase, we are going to receive, on average, $25. Enter 25 for Revenue Per Individual. Click OK to produce the Profit Chart.
Figure 14-11: The Profit Chart Settings dialog box
The resulting Profit Chart is shown in Figure 14-12. We can see from this graph, the maximum profit will come if Maximum Miniatures sends the mailing to 20% of the names in the mailing list it purchased. At least, that is the case if either the Decision Trees or Neural Network mining model is used to select that 20%. We use either of these mining models to predict which customers have no children at home. (Remember, previous research has shown Maximum Miniatures that these are the most likely buyers of the Mythic World figurines.) Given these two mining models capability to predict, we should make about $428,943 in profit after the cost of the mailing. Mailing to more people does not significantly improve our chances of getting sales, but it does increase our costs. Because of this, our expected profit goes down if we mail to more people.
Figure 14-12: The Profit Chart
We know our mining models are not going to do a perfect job of predicting. They are going to make mistakes. The Classification Matrix lets us see exactly what mistakes our models have made.
We view the Classification Matrix by selecting the Classification Matrix tab on the Mining Accuracy Chart tab. The Classification Matrix is shown in Figure 14-13. Using the Classification Matrix, we can see the predictions made by each mining model.
Figure 14-13: The Classification Matrix
The left-hand column in each grid shows the value predicted by the mining model. In Figure 14-13, this is the number of children at home predicted for each customer in the training data set. The other columns show the actual value for the number of children at home for each customer.
In Figure 14-13, the top grid shows the result for the Decision Trees mining model. Looking at the top row in the grid, we can see that in 805 cases, the Decision Trees mining model predicted three children at home when there were actually three children at home. These were correct predictions. In 109 cases, the Decision Trees mining model predicted three children at home when there were actually four children at home. These predictions were in error.
The diagonal of the grid shows the correct predictions: predicted three with actual three, predicted two with actual two, predicted four with actual four, and so on. We want to have the largest numbers along the diagonal. This is the case for the Decision Trees mining model. We already know this model was accurate. The Naive Bayes mining model, shown in the middle grid in Figure 14-13, docs not have the largest numbers along the diagonal. This mining model had a tendency to predict two children at home when there were actually four children at home. This mistake occurred 994 times during the processing of our testing data set.