9 Using Decision Support Objects (DSO) | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

The usefulness of data-mining models, particularly those that use decision trees, comes from the ability to predict the values of the missing attributes in new cases. To be able to make a prediction, you need to have a fully populated data-mining model as well as your test cases that have the missing attributes. The test cases are simple cases with missing attributes that need to be predicted . In our sample mushrooms database, the cases contain as many attributes about the mushrooms as we can identify in hopes of finding out whether the missing attribute," Edibility, " can be found. The reason we will be concentrating on the decision trees algorithm in this instance is because it is the only algorithm that offers the possibility of predicting missing values. As we've discussed, clustering creates a model that is browsed to find records that are similar to each other.

Note

Remember that it's not a good idea to use the same cases for predictions that were used to train the model because this method will invariably lead to seemingly perfect predictions and any inaccuracies in the model will be missed.

The prediction queries are issued against the Analysis server's model or a local data-mining model created with PivotTable Service. The connection and the query itself are issued through OLE DB, usually using the ADO wrapper as with a typical SQL Server or Jet Engine connection.

Components of a Prediction Query 1

The statement used to execute a data-mining query is the SELECT statement that appears to be very similar to the standard SQL Select statement. There are, however, some very specific key words and constructs that exist only in the data-mining query.

The Basic Prediction Query

The basic syntax of the SELECT statement needed to make a basic prediction query is shown here. For the moment we're concentrating on the column references, so the reference to the test case table may seem a bit oversimplified, but I'll explain how to join a case table.

 SELECT  *   FROM      <data mining model> as DM PREDICTION JOIN <cases table> as TC ON  DM.column1 = TC.column1     DM.column2 = TC.column2

Specifying the Test Case Source

Typically, the test cases are going to reside in a table or a flat file structure such as an Excel spreadsheet. Because this structure is never stored as a data-mining model, the query needs to connect to a source outside the data-mining environment for the duration of the query. The OPENROWSET function returns a table structure based on OLE DB connection options and the query passed on to that connection. If I had a SQL Server table containing test cases in a SQL Server table, I would have to specify the connectivity options for a remote connection as shown in the code sample that follows . We can dispense with this only if the test cases happen to be in an OLAP cube on the same server, in which case it's only necessary to issue an MDX query to retrieve the data source.

 SELECT  *   FROM      <mushroom analysis rdbms> as DM PREDICTION JOIN  OPENROWSET     ('SQLOLEDB.1',         'Provider=SQLOLEDB.1;user=dtsuser;password=;' &_         'Initial Catalog=DataMiner;Data Source=dataserver', &_         'SELECT * FROM "Mushrooms_test" ') as TC ON       DM.column1 = TC.column1     DM.column2 = TC.column2 . . .   2

Singleton Queries

Although test cases are often stored in table structures for retrieval, there is an easier way to retrieve them with the singleton query. In the case of mushrooms, you may want to test the edibility of a single mushroom case without having to go through the trouble to insert that record in a table and use OPENROWSET to retrieve it for the query. A singleton query allows you to place hard-coded values in the source query without establishing a connection to a remote data source, as you can see from the following listing:

 SELECT FLATTENED [TC].[mushroom_name],     (SELECT          edibility ,         $Probability,         $support FROM          PredictHistogram([edibility])) AS [EdibilityStats] FROM [Mushroom Analysis RDBMS]  NATURAL PREDICTION JOIN     (select         'Test Mushroom' as [Mushroom_Name],         'Convex' as [Cap Shape],         'Smooth' as [Cap Surface],         'White' as [Cap Color],         '' as [Bruises],         'None' as [Odor],         '' as [Gill Attachment],         '' as [Gill Spacing],         '' as [Gill Size],         'White' as [Gill Color],         '' as [Stalk Shape],         '' as [Stalk Root],         '' as [Stalk Surface Above Ring] ,         '' as [Stalk Surface Below Ring] ,         '' as [Stalk Color Above Ring],         '' as [Stalk Color Below Ring],         '' as [Veil Type] ,         '' as [Veil Color] ,         '' as [Ring Number] ,         '' as [Ring Type] ,         '' as [Spore Print Color] ,         '' as [Population] ,         '' as [Habitat])     AS [TC]

This is especially convenient for applications that allow a user to enter ad hoc attributes of a test case on a Web page, for example, and retrieve the prediction values on the fly.

Specifying Columns

Instead of specifying columns with the asterisk (*), you can specify the columns from the data-mining model, from the cases table, from functions, and from scalar variables . To select specific columns, the syntax is similar to standard SQL syntax as you can see from this listing.

 SELECT     TC.[Mushroom Name],      DM.[Habitat],       DM.[Edibility] FROM      <mushroom analysis rdbms> as DM PREDICTION JOIN  OPENROWSET     ('SQLOLEDB.1',         'Provider=SQLOLEDB.1;user=dtsuser;password=;' &_         'Initial Catalog=DataMiner;Data Source=dataserver', &_         'SELECT * FROM "mushrooms_test" ') as TC ON       DM.[odor] = TC.[odor]     DM.[cap shape] = TC.[cap shape] . . .   3

The PREDICTION JOIN Clause

The SELECT statement is composed of a join between two data sources, the data-mining model and the test cases. Because of its predictive function, this type of join is called a PREDICTION JOIN. When constructing this join, the actual cases from the test cases source are matched with the set of all possible cases from the mining model with the PREDICTION JOIN operator. The PREDICTION JOIN differs from the JOIN in a standard SQL SELECT statement in that all the attributes in the data-mining model are matched with all the known attributes of the test cases. The PREDICTION JOIN is used to find records that match, and these matches enable the mining model to provide the known attribute and provide the missing values to the test cases. The PREDICTION JOIN will take the case from the test set that's presented and, using the values in the ON clause, will find all the matching sets of cases from the data-mining model. The matching sets are then collapsed into an aggregated set that contains the probability of the resulting values.

Tip

You can use the NATURAL PREDICTION JOIN when you know that the columns in the model and the columns in the source data have the same names. This allows you to dispense with the need for lengthy ON clauses because the columns will be automatically joined based on their common names . (See the singleton query listing above for an example of this clause.) Just remember that the data-mining model replaces all underscores with spaces in the column names, so your natural prediction join may not work if your test cases still have columns with underscores in the name. The query will still run without errors, but many of the attributes will have been missed, and therefore the queries may be far less accurate than they could be.

The Key columns identified during the building of the model serve no purpose for the query itself; therefore, they should not be involved in the ON clause. Remember that the test cases need to come from a source different from the cases used to train the model, so there should be no possible join condition on the Key columns.

Let's suppose that you went mushroom picking and came home with a number of different varieties. Before cooking them, you decide to play it safe and consult the data-mining model. You enter all the characteristics you know about them including the name of the mushrooms, and you issue a prediction query. Here is an example of a query used to find the edible mushrooms among those contained in Figure 12-1:

Figure 12-1. Mushroom test cases. 4

Using the data above, we'll issue the following prediction query to fill in the missing, but crucial, "Edibility" column:

 SELECT FLATTENED [TC].[Mushroom_name],  [DM].[Edibility] FROM [Mushroom Analysis RDBMS] as DM PREDICTION JOIN     OPENROWSET     ('SQLOLEDB.1',     'Provider=SQLOLEDB.1;Persist Security Info=False;         User ID=dtsuser;Initial Catalog=DataMiner;         Data Source=DATASERVER',     'SELECT  [mushroom_name],         [cap_shape] AS [Cap Shape],          [cap_surface] AS [Cap Surface],          [cap_color] AS [Cap Color],          [bruises] AS [Bruises],          [odor] AS [Odor],          [gill_attachment] AS [Gill Attachment],          [gill_spacing] AS [Gill Spacing],          [gill_size] AS [Gill Size],          [gill_color] AS [Gill Color],          [stalk_shape] AS [Stalk Shape],          [stalk_root] AS [Stalk Root],          [stalk_surface_above_ring] AS             [Stalk Surface Above Ring],          [stalk_surface_below_ring] AS             [Stalk Surface Below Ring],          [stalk_color_above_ring] AS             [Stalk Color Above Ring],          [stalk_color_below_ring] AS             [Stalk Color Below Ring],          [veil_type] AS [Veil Type],          [veil_color] AS [Veil Color],          [ring_number] AS [Ring Number],          [ring_type] AS [Ring Type],          [spore_print_color] AS [Spore Print Color],          [population] AS [Population],          [habitat] AS [Habitat]      FROM [mushrooms_test]')     AS [TC]     ON         [TC].[Cap Shape] = [TC].[Cap Shape] AND         [TC].[Cap Surface] = [TC].[Cap Surface] AND         [TC].[Cap Color] = [TC].[Cap Color] AND         [TC].[Bruises] = [TC].[Bruises] AND         [TC].[Odor] = [TC].[Odor] AND         [TC].[Gill Attachment] = [TC].[Gill Attachment] AND         [TC].[Gill Spacing] = [TC].[Gill Spacing] AND         [TC].[Gill Size] = [TC].[Gill Size] AND         [TC].[Gill Color] = [TC].[Gill Color] AND         [TC].[Stalk Shape] = [TC].[Stalk Shape] AND         [TC].[Stalk Root] = [TC].[Stalk Root] AND         [TC].[Stalk Surface Above Ring] =             [TC].[Stalk Surface Above Ring] AND         [TC].[Stalk Surface Below Ring] =             [TC].[Stalk Surface Below Ring] AND         [TC].[Stalk Color Above Ring] =             [TC].[Stalk Color Above Ring] AND         [TC].[Stalk Color Below Ring] =             [TC].[Stalk Color Below Ring] AND         [TC].[Veil Type] = [TC].[Veil Type] AND         [TC].[Veil Color] = [TC].[Veil Color] AND         [TC].[Ring Number] = [TC].[Ring Number] AND         [TC].[Ring Type] = [TC].[Ring Type] AND         [TC].[Spore Print Color] = [TC].[Spore Print Color] AND         [TC].[Population] = [TC].[Population] AND         [TC].[Habitat] = [TC].[Habitat]

The results of the query are listed in Table 12-1.

Table 12-1. Prediction Query Results

5 8

Agaricus Cothurnata	poisonous
Agaricus Bisporigera	poisonous
Agaricus Ocreata	poisonous
Agaricus Muscaria 6	poisonous
Agaricus Pantherina	poisonous
Agaricus Phalloides	poisonous 7
Agaricus Tenufolia	poisonous
Agaricus Verna	poisonous
Agaricus Virosa	poisonous
Agaricus A1	edible
Agaricus A2 9	poisonous

It's probably good that you checked these before eating them! We did not have as many characteristics in our test cases as there are in the model, so some of the mushrooms have a higher probability of being poisonous than others. Because we made a simple prediction join, the result set contains only the highest probability scenarios. The question is, how do we find out more information about the results? Specifically, what are the odds that the edible mushroom, Agaricus A2, is poisonous? In the next section, we'll explore some of the functions of a prediction query, which bring more precision in our result sets.

Using Functions as Columns

A function can be used as a column to provide additional information about an attribute. This type of column is known as a qualifier and can contain information related to the probability associated to a predicted value or information related to the distribution of the predicted value. There are several such functions, which will be described in detail later in this chapter. This listing shows the use of a function as a column that returns the number of cases supporting that particular prediction.

 SELECT     TC.[Mushroom Name],      DM.[Habitat],       DM.[Edibility],     PREDICTSUPPORT(dm.[Edibility]) as Edibility_Support FROM      <mushroom analysis rdbms> as DM PREDICTION JOIN  . . .

Using Tabular Values as Columns

Data-mining queries support the notion of nested tables, or tables within tables. This is done by issuing a subquery as a column as in the query listing above. 10

In ‚ Chapter 5 ‚ , when we built our data-mining models, we had the option of using more than one table as long as they were related. This relationship gets translated into a nested table as the model is built. That means the foreign key value in the parent table becomes a table type column, which contains the related items from the parent table. This simplifies the structure because the notion of relations, as in an RDBMS engine, no longer needs to be maintained because the multiple tables have been transformed into a single structure. This level of simplicity allows data-mining clients to use the data without having to also retrieve information about the relational rules governing the tables, the rules are simply embedded inside the related fields.

The end result is a model made from multiple tables that is the same as a model created from a single flat table. However, the queries make use of the notion of nested tables when constructing queries and formatting the results. In other words, you can have a field contain the results of a subquery that returns multiple rows. Analysis Services will create a nested tables structure as the result set. Again, this makes it easy for any client to use the results without having to know how to interpret relational rules about related tables. In order for most clients to be able to make use of this nested table, the results need to be "flattened" so as to appear as a single table. For this, the SELECT syntax provides the FLATTENED option, as in the example listing that follows. The FLATTENED option turns the SELECT result table from a hierarchical table to a flat table structure, with a results set that contains one row for each predicted value, simplifying the processing of the prediction results. The result is actually similar to the flattening that occurs in standard SQL when issuing a SELECT statement, which joins two tables in a one-to-many relationship. The result contains as many iterations of the "one" side as there are corresponding records in the "many" side.

 SELECT FLATTENED [TC].[mushroom_name], (SELECT      edibility,      $Probability,      $support  FROM PredictHistogram([edibility])) AS [EdibilityDetails] FROM [Mushroom Analysis RDBMS] PREDICTION JOIN . . .

This query yields a result set like the one shown in Table 12-2.

Table 12-2. Results with a Nested Table 11

13 18 23 28 33

Mushroom Name	Edibility	Probability	Support
Agaricus Cothurnata	poisonous	99.65% 12	576
Agaricus Cothurnata	missing	0.17%
Agaricus Cothurnata	edible	0.17%
Agaricus Bisporigera	poisonous 14	99.65%	576
Agaricus Bisporigera	missing	0.17%	15
Agaricus Bisporigera	edible	0.17%
Agaricus Ocreata 16	poisonous	99.65%	576
Agaricus Ocreata	missing	0.17% 17
Agaricus Ocreata	edible	0.17%
Agaricus Muscaria	poisonous	99.65%	576
Agaricus Muscaria	missing 19	0.17%
Agaricus Muscaria	edible	0.17%	20
Agaricus Pantherina	poisonous	99.65%	576
Agaricus Pantherina 21	missing	0.17%
Agaricus Pantherina	edible	0.17% 22
Agaricus Phalloides	poisonous	99.23%	256
Agaricus Phalloides	missing	0.39%
Agaricus Phalloides	edible 24	0.39%
Agaricus Tenufolia	poisonous	99.65%	576 25
Agaricus Tenufolia	missing	0.17%
Agaricus Tenufolia 26	edible	0.17%
Agaricus Verna	poisonous	99.65% 27	576
Agaricus Verna	missing	0.17%
Agaricus Verna	edible	0.17%
Agaricus Virosa	poisonous 29	99.65%	576
Agaricus Virosa	missing	0.17%	30
Agaricus Virosa	edible	0.17%
Agaricus A1 31	edible	89.47%	16
Agaricus A1	missing	5.26% 32
Agaricus A1	poisonous	5.26%
Agaricus A2	poisonous	69.49%	40
Agaricus A2	edible 34	28.81%	16
Agaricus A2	missing	1.69%	35

The WHERE Clause

The WHERE clause in a prediction query is used in almost exactly the same way as in standard SQL, to filter the results of the query. In the query below, the prediction results are limited to only those mushrooms for which the edibility can be determined with greater than 90 percent accuracy.

 SELECT     TC.[Mushroom Name],      DM.[Habitat],       DM.[Edibility],     PREDICTSUPPORT(DM.[Edibility]) as Edibility_Support FROM      <mushroom analysis rdbms> as DM PREDICTION JOIN  . . . WHERE PredictProbability(DM.Edibility)  > .90

Prediction Functions

By default, the prediction query tries to find the best fit for the missing values in the test data to provide a best-case scenario. You can derive extra information from the query by using built-in functions in the data-mining component of Analysis Services.

Some of these functions will return a number as a scalar column value, while others will return entire tables that contain rows of detail concerning the prediction.

By using these functions with a subquery, we can get special nested-table values that contain specific information related to probabilities and distributions. We can also tailor the results to fit the type of prediction we need. 36

Predict

 Predict(<scalar or table column reference>, option1, option2, K)

The Predict function returns either a scalar or table value depending on which type of column this function is applied to. The options in Table 12-3 can be applied to the function, with each option having an effect on how the predicted values are returned.

Table 12-3. Options for the Predict Function

Option	Data Type	Description 37
EXCLUDE_NULL (default)	Scalar	Does not take missing values in the test cases into account.
INCLUDE_NULL	Scalar	Missing values in the test cases are considered significant and should be taken as a measure. This can be useful in cases where the fact that a value is missing in an attribute means something, such as in the example of a customer who has not made payments on a loan. 38
INCLUSIVE	Table	Includes the information from the test cases as part of the resulting prediction and combines it with the results in the model.
EXCLUSIVE (default)	Table 39	Does not include the information from the test cases as part of the resulting prediction.
INPUT_ONLY	Table	This ensures that the resulting prediction contains only the rows that were supplied from the test cases.
INCLUDE_STATISTICS 40	Table	This causes some scalar values $Probability and $Support to become automatically available as a column that can be included in a column list.

PredictProbability

 PredictProbability(<column>)

The PredictProbability function returns the percentage of cases that a given prediction represents. This percentage is returned for the row that contains the highest probability value and is therefore considered to be the predicted value. In the listing below, this function is used in a WHERE clause.

 SELECT     TC.[Mushroom Name],      DM.[Habitat],       DM.[Edibility],     PREDICTSUPPORT(DM.[Edibility]) as Edibility_Support FROM      <mushroom analysis rdbms> as DM PREDICTION JOIN  . . . WHERE PredictProbability(DM.Edibility)  > .90

PredictSupport 41

 PredictSupport(<column>)

The PredictSupport function returns the support value for the entry that has the highest probability. It provides information on how many records represent the given probability prediction. The listing below uses the Predict and PredictSupport functions.

 SELECT FLATTENED     [TC].[mushroom_name],     Predict([edibility]) as PredictionValue,     PredictSupport([edibility]) as SupportValue FROM     [Mushroom Analysis RDBMS] AS DM PREDICTION JOIN ...

This code yields the result set shown in Table 12-4.

Table 12-4. Query Results Using Predict and PredictSupport

Mushroom Name	PredictionValue 42	SupportValue
Agaricus Cothurnata	Poisonous	576
Agaricus Bisporigera	Poisonous 43	576
Agaricus Ocreata	Poisonous	576
Agaricus Muscaria 44	Poisonous	576
Agaricus Pantherina	Poisonous	576
Agaricus Phalloides	Poisonous	256
Agaricus Tenufolia	Poisonous	576 46
Agaricus Verna	Poisonous	576
Agaricus Virosa	Poisonous 47	576
Agaricus A1	Edible	16
Agaricus A2 48	Poisonous	40

PredictVariance

 PredictVariance(<column>)

The PredictVariance function returns a number that represents a statistical variance for the row that contains the highest probability for being true. This is relevant to continuous data values ( numbers ).

Note

PredictVariance, PredictStdev, PredictProbabilityStdev, and PredictProbabilityVariance functions are only relevant to continuous attributes. Remember that continuous attributes in Microsoft Data Mining are binned into mean values for the node in which they are contained. These functions provide insight into what those numbers represent in relation to the overall population of cases in that node. Knowing the standard deviation and variance of a number tells you how far away that mean value is from the high and low numbers for the cases in that node, which in turn determines the accuracy of that figure. It's important to keep in mind that the functions are not aggregate functions like those you would find in SQL Server or OLAP; instead, they represent the range of values that can be found in a given node. It's for that reason that each returned row from a prediction query can return a different return value for the function.

PredictStdev 49

 PredictStdev(<column>)

The PredictStdev function returns a number that represents a statistical standard deviation value for the row that contains the highest probability for being true. This is relevant to continuous data values (numbers).

PredictProbabilityVariance

 PredictProbabilityVariance(<column>)

The PredictProbabilityVariance function returns the variance of the probability for the histogram entry with the highest probability.

PredictProbabilityStdev

 PredictProbabilityStdev(<scalar column reference>)   50

The PredictProbabilityStdev function returns the standard deviation of the probability for the histogram entry with the highest probability.

PredictHistogram

 PredictHistogram(<scalar column reference>) PredictHistogram(<cluster column reference>)

The PredictHistogram function returns a table representing a histogram for prediction of the given column. When you look at a node in the Analysis manager, you'll notice the predicted value Id, broken down to all its possible values and the statistical probability of each. A histogram generates statistics columns. For any given predictive column, a histogram consists of the following seven columns:

$Probability
$ProbabilityStdev
$ProbabilityVariance 51
$Support
$Stdev (standard deviation)
$Variance
Column being predicted

And when the query involves a cluster, the following columns are returned:

$Distance
$Probability 52
$Support
Cluster to represent the cluster identifier

It's important to remember that this function returns a nested table with one record for each possible value for the predicted column. The following listing is an example of a query that uses the PredictHistogram function:

 SELECT FLATTENED [T1].[mushroom_name], (SELECT  edibility , $Probability, $support FROM      PredictHistogram([edibility])) AS [EdibilityStats] FROM     [Mushroom Analysis RDBMS] PREDICTION JOIN     OPENROWSET     ('SQLOLEDB.1',         'Provider=SQLOLEDB.1;Persist Security Info=False;             User ID=dtsuser;Initial Catalog=DataMiner;             Data Source=DATASERVER',         'SELECT * FROM  "Mushrooms_test" ORDER BY "Mushroom_name" ')     AS [T1] ON     [Mushroom Analysis RDBMS].[Cap Shape] = [T1].[Cap_Shape] AND         [Mushroom Analysis RDBMS].[Cap Surface] =             [T1].[Cap_Surface] AND         [Mushroom Analysis RDBMS].[Cap Color] =             [T1].[Cap_Color] AND         [Mushroom Analysis RDBMS].[Bruises] = [T1].[Bruises] AND         [Mushroom Analysis RDBMS].[Odor] = [T1].[Odor] AND         [Mushroom Analysis RDBMS].[Gill Attachment] =             [T1].[Gill_Attachment] AND         [Mushroom Analysis RDBMS].[Gill Spacing] =             [T1].[Gill_Spacing] AND         [Mushroom Analysis RDBMS].[Gill Size] =             [T1].[Gill_Size] AND         [Mushroom Analysis RDBMS].[Gill Color] =             [T1].[Gill_Color] AND         [Mushroom Analysis RDBMS].[Stalk Shape] =             [T1].[Stalk_Shape] AND         [Mushroom Analysis RDBMS].[Stalk Root] =             [T1].[Stalk_Root] AND         [Mushroom Analysis RDBMS].[Stalk Surface Above Ring] =             [T1].[Stalk_Surface_Above_Ring] AND         [Mushroom Analysis RDBMS].[Stalk Surface Below Ring] =             [T1].[Stalk_Surface_Below_Ring] AND         [Mushroom Analysis RDBMS].[Stalk Color Above Ring] =             [T1].[Stalk_Color_Above_Ring] AND         [Mushroom Analysis RDBMS].[Stalk Color Below Ring] =             [T1].[Stalk_Color_Below_Ring] AND         [Mushroom Analysis RDBMS].[Veil Type] =             [T1].[Veil_Type] AND         [Mushroom Analysis RDBMS].[Veil Color] =             [T1].[Veil_Color] AND         [Mushroom Analysis RDBMS].[Ring Number] =             [T1].[Ring_Number] AND         [Mushroom Analysis RDBMS].[Ring Type] =             [T1].[Ring_Type] AND         [Mushroom Analysis RDBMS].[Spore Print Color] =             [T1].[Spore_Print_Color] AND         [Mushroom Analysis RDBMS].[Population] =             [T1].[Population] AND        [Mushroom Analysis RDBMS].[habitat] = [T1].[habitat]

This query yields the results shown in Table 12-5.

Table 12-5. Query Results Using PredictHistogram Function

57 62 67 72

Mushroom Name 53	Edibility	Support	Probability
Agaricus A1	edible	16	0.895 54
Agaricus A1	missing		0.053
Agaricus A1 55	poisonous		0.053
Agaricus A2	poisonous	40 56	0.695
Agaricus A2	edible	16	0.288
Agaricus A2	missing		0.017
Agaricus Bisporigera	poisonous 58	576	0.997
Agaricus Bisporigera	missing		0.002 59
Agaricus Bisporigera	edible		0.002
Agaricus Cothurnata 60	poisonous	576	0.997
Agaricus Cothurnata	missing	61	0.002
Agaricus Cothurnata	edible		0.002
Agaricus Muscaria	poisonous	576	0.997
Agaricus Muscaria	missing 63		0.002
Agaricus Muscaria	edible		0.002 64
Agaricus Ocreata	poisonous	576	0.997
Agaricus Ocreata 65	missing		0.002
Agaricus Ocreata	edible	66	0.002
Agaricus Pantherina	poisonous	576	0.997
Agaricus Pantherina	missing		0.002
Agaricus Pantherina	edible 68		0.002
Agaricus Phalloides	poisonous	256	0.992 69
Agaricus Phalloides	missing		0.004
Agaricus Phalloides 70	edible		0.004
Agaricus Tenufolia	poisonous	576 71	0.997
Agaricus Tenufolia	missing		0.002
Agaricus Tenufolia	edible		0.002
Agaricus Verna	poisonous 73	576	0.997
Agaricus Verna	missing		0.002 74
Agaricus Verna	edible		0.002
Agaricus Virosa 75	poisonous	576	0.997
Agaricus Virosa	missing	76	0.002
Agaricus Virosa	edible		0.002

TopCount 77

 TopCount(<table expression>, <rank expression>, <n-items>)

TopCount is a table-returning expression that includes <table column reference> and functions that return a table. This function returns the first <n-items> rows in a decreasing order of <rank expression>.

TopSum

 TopSum(<table expression>, <rank expression>, <sum>)

TopSum is a table-returning expression that includes <table column reference> and functions that return a table. This function returns the first N rows in a decreasing order of <rank column reference>, such that the sum of the <rank expression> values is at least <sum>. TopSum returns the smallest number of elements possible while still meeting that criterion.

TopPercent

 TopPercent(<table expression>, <rank expression>, <percent>)   78

The TopPercent function returns the first N rows in a decreasing order of <rank expression>, such that the sum of the <rank column reference> values is at least the given percentage of the total sum of <rank column reference> values. TopPercent returns the smallest number of elements possible while still meeting that criterion.

RangeMin

 RangeMin(<column>)

When a numerical column gets placed in a bin, the number represents a mean value for that node. The RangeMin function finds the minimum value present in the cases used to build that binned or discretized value.

RangeMid

 RangeMid(<column>)

The RangeMid function finds the middle value of the predicted bucket that was discovered for a discretized column. 79

RangeMax

 RangeMax(<column>)

RangeMax is like the RangeMin function except that it finds the highest value in the bin instead of the lowest value.

PredictScore

 PredictScore(<scalar column reference>) PredictScore(<table column reference>)

The PredictScore function returns the prediction score of the specified column.

Note

The PredictScore function exists in OLE DB for data-mining specification but is not implemented by Microsoft Data Mining. The complete OLE DB for Data Mining specification can be downloaded from 80 ‚ http://www.microsoft.com/data/oledb/dm.htm ‚ .

PredictNodeId

 PredictNodeId(<column>)

The PredictNodeId function returns the node ID of the tree leaf node that the case is identified with. This function is especially useful in decision trees to find out how far down the tree the case was able to go (and therefore how accurate the prediction is) before the attributes could no longer be matched to a node.

Prediction Queries with Clustering Models

Unlike decision trees, clustering can't determine missing values directly by filling in the missing attribute columns, and yet it is available as a target for prediction queries. What can we use a cluster for if not to predict missing values? Given that you have identified clusters of cases with common characteristics, you may be able to indirectly predict something about your test cases by knowing what cluster your test cases would have belonged to if they had been used as part of a training set! For example, you might find that your cluster node #8 contains the most Gold credit card members and those members also happen to be single females who make more than $50,000 per year. If you find out which of your test cases belong in node #8, this information can be used to persuade members of this group to upgrade their cards. In this example, we're not using clustering to fill missing values; we're looking for cluster node identifiers for the test cases. 81

Note

In clustering models, each node in a cluster has an associated number. When you browse through a clustering model in the Analysis manager, you'll notice that each node is named with a number such as "Cluster 1, " "Cluster 2, " and so on. The number in that name is actually that node's ID value.

The Cluster, ClusterDistance, and ClusterProbability functions are used only with clustering models. They return crucial information related to the node IDs.

Cluster

The Cluster function takes no arguments and returns the identifier of the cluster for which the input case has the highest probability of belonging . It also can be used as a column reference for the PredictHistogram function.

ClusterProbability

 ClusterProbability(([<ClusterNode ID>]))

The ClusterProbability function returns the probability that the input case belongs to the cluster that has the highest probability. If the cluster node ID is given, the cluster is identified by the evaluation of the expression. The following example queries the FoodMart 2000 database using a singleton query as the data source. 82

 SELECT     [T1].[yearly income],      [T1].[marital status],      [T1].[gender],     cluster() as [cluster node],     clusterprobability() as [cluster probability] FROM     [Member Card Cluster]     PREDICTION JOIN         (select 'K - K' as [yearly income],            'S' as [marital status], 'F' as [gender])     AS [T1]     ON         [Member Card Cluster].[yearly income] =             [T1].[yearly income] AND          [Member Card Cluster].[marital status] =             [T1].[marital status] AND          [Member Card Cluster].[gender] = [T1].[gender]

This returns the cluster with the highest probability, as shown in Table 12-6:

Table 12-6. Record Containing the Cluster Node and Probability of Belonging.

Yearly Income	Marital Status	Gender	Cluster Node 83	Cluster Probability
$50K - $70K	S	F	8	0.30

Cluster node #8 shows that 89 percent of the members of that node have Bronze credit cards, and based on the cluster probability, we can assume that this person has a 30 percent chance of belonging to that group. Again, it's not a prediction in the most direct sense, but it is a form of classification and generalization. 84

ClusterDistance

 ClusterDistance([<ClusterNode ID>])

The ClusterDistance function returns the distance between the input case and the center of the cluster that has the highest probability. The closer the case is to the center, the more accurately it is represented by the node it's contained in. Unlike decision trees, clusters contain records that have varying degrees of relevance to their nodes because they might have some characteristics that identify them with a given node but also have others that do not. Some records are borderline cases that are so close to their node border that they almost belong to another node. Two good examples of borderline cases are the fish and mammal classifications. If our records contain primarily lions, tigers , monkeys , bass, sharks, and barracuda, then a test case containing a leopard will clearly fall in the center of the mammals node because of the leopard's characteristics. However, a dolphin, although a mammal, will be pulled away from the center toward the fish node because of characteristics a dolphin shares with fish. The number returned by the function determines how far from the center a given record is from a given node.

Note

The ClusterDistance function is not implemented yet and only returns a NULL value.

Using DTS to Run Prediction Queries

DTS is not only a tool used to generate new mining models, it can also be used to create queries and store the results in a SQL Server table, Microsoft Access database, or an ODBC-compliant database. In the DTS canvas, there is a toolbar on the bottom left with a Data Mining Prediction Task button (the miner's pickaxe). This button serves to create the queries, run them, and store them on another machine. Clicking on this button opens the Data Mining Prediction Query Task Wizard as shown in Figure 12-2.

Figure 12-2. Data Mining Prediction Query Task Wizard.

You can alter the name and the description of the task if you want. You must enter the server name where the data-mining model resides. Once you do, the drop-down menu below the server field will let you choose from all the Analysis Services databases. Pick the one that contains the model that you wish to query. The mining-models panel on the lower left contains the available mining models in the database you have chosen. The panel on the lower right provides some statistics that tell you which algorithm has been chosen and whether or not it has been processed . Once you're done, click on the Query tab and move to the next screen. On this screen, you must enter the connection string needed to access the data source where the test cases are stored. You may manually enter the string, or you can click on the button with three dots and build the string with the Data Link Properties dialog box, as shown in Figure 12-3.

Figure 12-3. Data Link Properties dialog box.

You may type your prediction query manually in the text box provided, or you can make use of the assistance provided by the Prediction Query Builder dialog box, which is accessed by clicking on the New Query button. (See Figure 12-4.)

Figure 12-4. Prediction Query Builder dialog box. 86

On the left panel of this dialog box, you'll find the available columns for the data-mining model, and on the right side, you'll find the automatically matched columns from the test data source. Those columns that are not matched automatically can be selected from drop-down menus on the right.

Caution

When you work with the Prediction Query Builder, there are some features to be aware of. First, the Prediction Query Builder will automatically and irrevocably choose the case ID of the data-mining model that is to be part of the query even though the case almost never should be part of the model. Usually the test cases come from samples that are not used to build the model (as should be the case); therefore, it's highly unlikely that you'll find a match between the ID values in the model and the ID values in the test cases. The builder also forces you to supply a matching column in the test case for the ID in the model before you can leave the window. I suggest that you go ahead and supply an arbitrary column as the matching column and that you edit the generated query later to eliminate references to joins involving those two columns. The columns in the model are automatically matched for you when the names are the same; however, the data-mining model replaces all underscores in column names with spaces, so you'll find the only columns that automatically match are those that contain no underscores in the test set. You will need to manually match the columns by picking them from the list on the right.

At the bottom of the window, you can select the attribute to be predicted. Click on the Finish button when you're done to generate the query in the Prediction Query editor as shown in Figure 12-5.

Figure 12-5. Prediction Query editor with generated code.

You can make any manual adjustments to the code including any special functions you wish to add in the Prediction Query editor. (See Figure 12-6.)

Figure 12-6. Prediction Query editor with modified code.

To modify the code, click on the Output tab to go to the Output selection window. (See Figure 12-7.)

Figure 12-7. Prediction Query Builder Output selection window.

The Output selection window is simple to use and requires only the connection string and table name of the location where the results will be stored. Again, you can either type in the connection string or use the Data Link dialog box.

Note

The generated table will overwrite any existing table with the same name.

Now you can click on the Execute button at the top of the DTS designer canvas and run the task until it's finished. (See Figure 12-8.) 88

Figure 12-8. Prediction Query Builder: Running the task.

Notice how simple it is to run the task because all connections, queries, and output designations are stored in one task. Now you can go to the location of the table and query the results in the same way you would any table. (See Figure 12-9.)

Figure 12-9. Querying the results.

Summary

The culmination of all the efforts to build a data-mining model centers around the ability to query it and thus use it to predict missing values from test cases. These cases can be on the topics of mushrooms, credit cards, or consumer product preferences. Microsoft Data Mining provides the prediction query syntax to allow a user or programmer to find missing prediction values from a decision tree with the same relative ease of querying a relational table. In this chapter, we covered DTS tasks to make these predictions. 89

Because the syntax of data-mining queries is so similar to that of standard SQL, the technical aspects of prediction queries are easily understood by thousands of SQL programmers.

The additional functions and special join features help gain an advantage from the statistical foundations that data mining is based on. With these functions, you generate predictions that come with additional information regarding the strength of those predictions in terms of degree of probability and support by number of cases. We also looked at using prediction queries to gather information from test cases as they relate to Microsoft Clustering. Although these are not prediction queries in the sense that we can predict the value of missing attributes, it does tell you what group any particular case is likely to belong to by using functions specific to clustering models.

Microsoft Data Mining provides tight integration with other Microsoft tools such as DTS, but also serves as an OLE DB provider, which allows third-party products to easily integrate prediction queries into their applications using PERL, C++, Visual Basic, or any other language capable of connecting through OLE DB.