The structure of a data-mining model is designed to be similar to a relational database table. It has tables that contain columns and a certain number of rows. It should come as no surprise, then, that the structure and contents of these tables can be retrieved through a SELECT statement, albeit a derivation of the standard SQL statement. This is important because you can develop third-party applications that access the structure and contents of the nodes in a decision tree model and the clusters of a clustering model. In fact, if you were up to it, you could use the information here to create your own node browser like the one in the Analysis Manager data-mining content browser. As we'll see in ‚ Chapter 12 ‚ , this SQL-like statement also allows us to use the models to make predictions against test data.
The Structure of the Data-Mining Model Case 1
One of the first things we need to do is understand how the structure of these data-mining models is represented when we query their content. The content we want relates to data-mining models, of course, but also to the individual components that make them up. Therefore, a given data-mining model has a certain number of nodes, each with certain characteristics, such as its name and the rule that describes it. But each node also contains a variable number of instances of each attribute that are the basis for the histogram and the distribution count. This panel is located on the right side of the node browser, which contains the percentage of each kind of attribute in the node.
Data-Mining Models Look Like Tables
In order to make it possible to query the model using conventionally available tools and interfaces, such as ADO with Visual Basic, it was necessary to make these objects available in the form of tables, with columns for the attributes and a row for each node. In some cases, there are logical data relationships such as with the node and the table of probability histograms in that node. To make objects available in this form, one of the columns in the node table contains a reference to a table, rather than a scalar value. This is called a nested table because of the way the second table appears to be embedded in the first table.
Using Code to Browse Data-Mining Models
First I'll begin with a short program that connects to a data-mining model and queries the contents of a data-mining model only for the sake of getting the names of the fields in the mushrooms model we created in ‚ Chapter 5 ‚ . 2
For the sake of simplicity and practicality, the code samples used in this chapter use VBScript with Windows Scripting Host (WSH). Windows Scripting Host allows you to run scripts that are similar both to the code you would use in a Visual Basic application and to code you would use in an Active Server Pages (ASP) page. Understanding this code requires you to have at least a rudimentary knowledge of the ADO object model and the VBScript syntax.
If you want to run these samples, download them from ‚ http://msdn.microsoft.com/scripting ‚ .If you happen to be using Windows 2000 or Windows XP, you don't need to do anything because WSH is already installed on your computer. Once you have WSH installed, you only need to create a text file with a .vbs extension and call it from the command line using the following syntax:
C:\>cscript myfile.vbs
Here's the code:
Dim cn Dim rs Dim ns Set cn = wscript.CreateObject("adodb.connection") Set rs = wscript.CreateObject("adodb.recordset") cn.Provider = "msolap.2" cn.ConnectionString = _ "data source=dataserver;initial catalog=mushrooms" cn.Open set rs = _ cn.Execute ("SELECT * from [mushroom analysis rdbms].content") For i = 0 to rs.fields.count -1 wscript.echo cstr(i+1) & ": " & rs.fields(i).name Next rs.Close cn.Close 3
The output from this program should display the following list of fields:
Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999. All rights reserved. 1: MODEL_CATALOG 2: MODEL_SCHEMA 3: MODEL_NAME 4: ATTRIBUTE_NAME 5: NODE_NAME 6: NODE_UNIQUE_NAME 7: NODE_TYPE 8: NODE_GUID 9: NODE_CAPTION 10: CHILDREN_CARDINALITY 11: PARENT_UNIQUE_NAME 12: NODE_DESCRIPTION 13: NODE_RULE 14: MARGINAL_RULE 15: NODE_PROBABILITY 16: MARGINAL_PROBABILITY 17: NODE_DISTRIBUTION 18: NODE_SUPPORT 19: MSOLAP_MODEL_COLUMN 20: MSOLAP_NODE_SCORE 21: MSOLAP_NODE_SHORT_CAPTION
Note that in order to get this list, I used standard ADO with a connection string that specifies MSOLAP as the provider. Other than the choice of providers, the connection and the recordset operations are exactly the same as any ordinary ADO program using SQL Server as the provider.
The results of this query will be sorted by the following fields by default:
The query used is very close to standard Ansi-SQL except for the object of the FROM clause. This clause is [data mining model].content. The .content specifies that I'm not making a prediction query, but that I'm specifically querying the contents of the mining model in order to browse them. The purpose of each column is discussed in the"MINING_MODELS Schema Rowset" section later in this chapter.
The following program will display all the nodes in the data-mining model. It also shows a representative number of attributes.
Dim cn Dim rs Dim ns Dim tmpVal Set cn = wscript.CreateObject("adodb.connection") Set rs = wscript.CreateObject("adodb.recordset") Set ns = wscript.CreateObject("adodb.recordset") cn.Provider = "msolap.2" cn.ConnectionString = _ "data source=dataserver;initial catalog=mushrooms" cn.Open set rs = _ cn.Execute ("SELECT * from [mushroom analysis rdbms].content") While Not rs.EOF For i = 0 to rs.fields.count -1 ' This field is a nested table that can't be output to the ' screen like an ordinary scalar value; therefore, a label ' is placed there instead, knowing that there's code a few ' lines down, which is designed to list the contents of ' this nested table. If rs.fields(i).name = "NODE_DISTRIBUTION" Then tmpVal = "NESTED TABLE" Else tmpVal = rs.fields(i).value End If wscript.echo cstr(i+1) &": "& rs.fields(i).name & _ " - " & tmpVal ' This is a nested table that contains the distribution ' histogram of that node for the predictive value. If rs.fields(i).name = "NODE_DISTRIBUTION" Then ' This column actually contains a recordset object! set ns = rs.fields("node_distribution").value wscript.echo While not (ns.eof) For j = 0 To ns.Fields.Count - 1 wscript.echo " " & _ ns.Fields(j).name & _ " "& ns.Fields(j).Value Next wscript.echo " ********" ns.movenext Wend End If Next rs.Movenext wscript.echo "-----------" Wend rs.Close cn.Close
Here is a partial listing from the program's output that shows a representative number of attributes. Note the values in the fields:
. . . 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - mushroom analysis rdbms 4: ATTRIBUTE_NAME - Edibility 5: NODE_NAME - 012147483669 6: NODE_UNIQUE_NAME - 012147483669 7: NODE_TYPE - 3 8: NODE_GUID - 9: NODE_CAPTION - Odor = none 10: CHILDREN_CARDINALITY - 2 11: PARENT_UNIQUE_NAME - 2147483669 12: NODE_DESCRIPTION - Odor = none 13: NODE_RULE - <predicate op = "eq" value = "none"> <simple-attribute name = "Odor"/> </predicate> 14: MARGINAL_RULE - <predicate op = "eq" value = "none"> <simple-attribute name = "Odor"/> </predicate> 15: NODE_PROBABILITY - 0.434268833087149 16: MARGINAL_PROBABILITY - 0.434268833087149 17: NODE_DISTRIBUTION - NESTED TABLE ATTRIBUTE_NAME Edibility ATTRIBUTE_VALUE missing SUPPORT 0 PROBABILITY 0.00283205890682526 VARIANCE 0 VALUETYPE 1 ******** ATTRIBUTE_NAME Edibility ATTRIBUTE_VALUE edible SUPPORT 3408 PROBABILITY 0.965448881336732 VARIANCE 0 VALUETYPE 4 ******** ATTRIBUTE_NAME Edibility ATTRIBUTE_VALUE poisonous SUPPORT 120 PROBABILITY 0.0342679127725857 VARIANCE 0 VALUETYPE 4 ******** 18: NODE_SUPPORT - 3528 19: MSOLAP_MODEL_COLUMN - Edibility 20: MSOLAP_NODE_SCORE - 262.639862428876 21: MSOLAP_NODE_SHORT_CAPTION - Odor = none . . .
The NODE_DISTRIBUTION column contains the distribution histogram nested table with the following fields as you can see from the previous listing. 5
Using the Schema Rowsets
Another more effective way to retrieve data about data mining is to use the ADO OpenSchema method to retrieve the metadata about the data-mining models within Analysis Services. Because the schema type has not been defined yet, the adSchemaProviderSpecific type needs to be used along with the GUID value of the type of metadata you need to retrieve. Here's the syntax:
Connection.OpenSchema(adSchemaProviderSpecific, _ array(restrictions), guid)
MINING_MODELS Schema Rowset 9
This schema rowset describes all the data-mining models present within the database. The order of the results is sorted by
Available restriction keys are
The column names are as follows :
Table 10-1. Service Type IDs |
DM_SERVICETYPE_CLASSIFICATION | 0x0000001 |
DM_SERVICETYPE_CLUSTERING 15 | 0x0000002 |
DM_SERVICETYPE_ASSOCIATION | 0x0000004 |
DM_SERVICETYPE_DENSITY_ESTIMATE | 0x0000008 |
DM_SERVICETYPE_SEQUENCE | 0x0000010 16 |
A restriction column serves as a filter to limit the types of rows that get returned. This gets defined as an array placed as a second parameter. For example:
connection.OpenSchema(adSchemaProviderSpecific,_ Array(EMPTY,EMPTY,EMPTY, "Microsoft_Decision_Trees"),_ DMSCHEMA_MINING_MODELS)
This restriction would limit the rows to those models that were created using decision trees.
To retrieve information about all the data-mining models in my Mushrooms database with the structure of Table 10-1, you can use the following program:
dim cn dim rs Set cn = wscript.CreateObject("adodb.connection") set rs = wscript.CreateObject("adodb.recordset") cn.Provider = "msolap.2" cn.ConnectionString = _ "data source=dataserver;initial catalog=mushrooms" cn.Open Const DMSCHEMA_MINING_MODELS = _ "{3add8a77-d8b9-11d2-8d2a-00e029154fde}" Const adSchemaProviderSpecific = -1 Set rs = cn.OpenSchema(adSchemaProviderSpecific, _ ,DMSCHEMA_MINING_MODELS) while rs.eof = 0 For j = 0 To rs.fields.count - 1 wscript.echo CStr(j+1) & ": " &_ rs.fields(j).name & " - " & rs.fields(j).value Next wscript.echo "********************" rs.movenext wend 17
This program will generate the following output:
Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999. All rights reserved. 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - Census cluster 4: MODEL_TYPE - 5: MODEL_GUID - {902FB4D2-EA1F-4E06-A2F4-AA6D9B5427BA} 6: DESCRIPTION - 7: MODEL_PROPID - 8: DATE_CREATED - 12/17/2000 1:52:35 PM 9: DATE_MODIFIED - 12/17/2000 2:05:10 PM 10: SERVICE_TYPE_ID - 2 11: SERVICE_NAME - Microsoft_Clustering 12: CREATION_STATEMENT - CREATE MINING MODEL [Census cluster's] ([Id] TEXT KEY , [Age] LONG DISCRETIZED() PREDICT, [Workclass] TEXT DISCRETE PREDICT, [Fnlwgt] TEXT DISCRETE PREDICT, [Education] TEXT DISCRETE PREDICT, [Education num] LONG CONTINUOUS PREDICT, [Marital status] TEXT DISCRETE PREDICT, [Occupation] TEXT DISCRETE PREDICT, [Relationship] TEXT DISCRETE PREDICT, [Race] TEXT DISCRETE PREDICT, [Sex] TEXT DISCRETE PREDICT, [Capital gain] DOUBLE CONTINUOUS PREDICT, [Capital loss] DOUBLE CONTINUOUS PREDICT, [Hours per week] LONG CONTINUOUS PREDICT, [Native country] TEXT DISCRETE PREDICT, [Class] TEXT DISCRETE PREDICT) USING Microsoft_Clustering 13: PREDICTION_ENTITY - Age,Workclass,Fnlwgt,Education, Education num,Marital status,Occupation,Relationship,Race, Sex,Capital gain,Capital loss,Hours per week,Native country, Class 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE - ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - census 4: MODEL_TYPE - OLAP 5: MODEL_GUID - {11F7DA3B-365D-4CC4-97D3-991DA8FEE188} 6: DESCRIPTION - 7: MODEL_PROPID - 8: DATE_CREATED - 12/14/2000 2:57:14 AM 9: DATE_MODIFIED - 12/14/2000 2:57:17 AM 10: SERVICE_TYPE_ID - 1 11: SERVICE_NAME - Microsoft_Decision_Trees 12: CREATION_STATEMENT - CREATE OLAP MINING MODEL [census's] FROM [Census] (CASE DIMENSION [Occupation] LEVEL [Workclass] , LEVEL [Occupation] , DIMENSION [Relationship] LEVEL [Relationship] , DIMENSION [Class] PREDICT LEVEL [(All)] , LEVEL [Class] , DIMENSION [Country] LEVEL [Native country] , DIMENSION [race] LEVEL [Race] , DIMENSION [Education] LEVEL [Education] , LEVEL [Educationnum] , DIMENSION [Hours per week] LEVEL [Hours per week] , MEASURE [Capital gain] , MEASURE [Capital loss] ) USING Microsoft_Decision_Trees 13: PREDICTION_ENTITY - Class,Class.Class.Class 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE - Census ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - CensusCluster 4: MODEL_TYPE - 5: MODEL_GUID - {8A626ECD-46CB-47BF-9C4C-8B9F9A4F02E6} 6: DESCRIPTION - 7: MODEL_PROPID - 8: DATE_CREATED - 12/25/2000 2:18:37 PM 9: DATE_MODIFIED - 12/25/2000 2:29:37 PM 10: SERVICE_TYPE_ID - 2 11: SERVICE_NAME - Microsoft_Clustering 12: CREATION_STATEMENT - CREATE MINING MODEL [CensusCluster's] ([Id] TEXT KEY , [Age] LONG CONTINUOUS PREDICT, [Workclass] TEXT DISCRETE PREDICT, [Fnlwgt] TEXT DISCRETE PREDICT, [Education] TEXT DISCRETE PREDICT, [Education num] LONG CONTINUOUS PREDICT, [Marital status] TEXT DISCRETE PREDICT, [Occupation] TEXT DISCRETE PREDICT, [Relationship] TEXT DISCRETE PREDICT, [Race] TEXT DISCRETE PREDICT, [Sex] TEXT DISCRETE PREDICT, [Capital gain] DOUBLE CONTINUOUS PREDICT, [Capital loss] DOUBLE CONTINUOUS PREDICT, [Hours per week] LONG CONTINUOUS PREDICT, [Native country] TEXT DISCRETE PREDICT, [Class] TEXT DISCRETE PREDICT) USING Microsoft_Clustering 13: PREDICTION_ENTITY - Age,Workclass,Fnlwgt,Education,Education num,Marital status,Occupation,Relationship,Race,Sex, Capital gain,Capital loss,Hours per week,Native country,Class 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE - ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - Mushroom Analysis RDBMS 4: MODEL_TYPE - 5: MODEL_GUID - {499D525B-13EC-4B4C-918F-F75E73E01AC4} 6: DESCRIPTION - 7: MODEL_PROPID - 8: DATE_CREATED - 1/29/2001 10:45:25 PM 9: DATE_MODIFIED - 1/29/2001 10:45:33 PM 10: SERVICE_TYPE_ID - 1 11: SERVICE_NAME - Microsoft_Decision_Trees 12: CREATION_STATEMENT - CREATE MINING MODEL [Mushroom Analysis RDBMS ] ([Id] LONG KEY , [Cap Shape] TEXT DISCRETE , [Cap Surface] TEXT DISCRETE , [Cap Color] TEXT DISCRETE , [Bruises] TEXT DISCRETE , [Odor] TEXT DISCRETE , [Gill Attachment] TEXT DISCRETE , [Gill Spacing] TEXT DISCRETE , [Gill Size] TEXT DISCRETE , [Gill Color] TEXT DISCRETE , [Stalk Shape] TEXT DISCRETE , [Stalk Root] TEXT DISCRETE , [Stalk Surface Above Ring] TEXT DISCRETE , [Stalk Surface Below Ring] TEXT DISCRETE , [Stalk Color Above Ring] TEXT DISCRETE , [Stalk Color Below Ring] TEXT DISCRETE , [Veil Type] TEXT DISCRETE , [Veil Color] TEXT DISCRETE , [Ring Number] TEXT DISCRETE , [Ring Type] TEXT DISCRETE , [Spore Print Color] TEXT DISCRETE , [Population] TEXT DISCRETE , [Edibility] TEXT DISCRETE PREDICT, [Habitat] TEXT DISCRETE PREDICT) USING Microsoft_Decision_Trees 13: PREDICTION_ENTITY - Edibility,Habitat 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE - ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - Mushrooms Cluster 4: MODEL_TYPE - 5: MODEL_GUID - {074F9AE0-D0DE-4925-A231-21265E564F9B} 6: DESCRIPTION - 7: MODEL_PROPID - 8: DATE_CREATED - 12/17/2000 12:48:19 PM 9: DATE_MODIFIED - 12/17/2000 12:48:57 PM 10: SERVICE_TYPE_ID - 2 11: SERVICE_NAME - Microsoft_Clustering 12: CREATION_STATEMENT - CREATE MINING MODEL [Mushrooms Cluster's] ([Id] TEXT KEY , [Edibility] TEXT DISCRETE PREDICT_ONLY, [Cap Shape] TEXT DISCRETE , [Cap Surface] TEXT DISCRETE , [Cap Color] TEXT DISCRETE , [Bruises] TEXT DISCRETE , [Odor] TEXT DISCRETE , [Gill Attachment] TEXT DISCRETE , [Gill Spacing] TEXT DISCRETE , [Gill Size] TEXT DISCRETE , [Gill Color] TEXT DISCRETE , [Stalk Shape] TEXT DISCRETE , [Stalk Root] TEXT DISCRETE , [Stalk Surface Above Ring] TEXT DISCRETE , [Stalk Surface Below Ring] TEXT DISCRETE , [Stalk Color Above Ring] TEXT DISCRETE , [Stalk Color Below Ring] TEXT DISCRETE , [Veil Type] TEXT DISCRETE , [Veil Color] TEXT DISCRETE , [Ring Number] TEXT DISCRETE , [Ring Type] TEXT DISCRETE , [Spore Print Color] TEXT DISCRETE , [Population] TEXT DISCRETE , [Habitat] TEXT DISCRETE ) USING Microsoft_Clustering (CLUSTER_COUNT=6) 13: PREDICTION_ENTITY - Edibility 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE - ********************
MINING_COLUMNS Schema Rowset
The MINING_COLUMNS schema rowset describes the individual columns of all defined data-mining models known to the provider. This schema rowset can be viewed as an enhanced form of the COLUMNS rowset for data-mining models. Many of the entries are derived from the COLUMNS schema rowset and are optional.
The columns are sorted in the following order:
The available restriction columns are:
The column names are as follows:
Table 10-2. Column Data Types |
"TABLE" | DBTYPE_HCHAPTER 24 |
"TEXT" | DBTYPE_WCHAR |
"LONG" | DBTYPE_I8 |
"DOUBLE" | DBTYPE_R8 |
"DATE" | DBTYPE_DATE |
Table 10-3. DISTRIBUTION_FLAG Values |
NORMAL | Distribution of data in which a majority of the continuous values are in the middle of the distribution curve, with a relatively equal number of variances in the high and low end. (See "Distribution Types" sidebar.) |
LOG_NORMAL | The distribution of continuous data suggests that there are a majority of values on the high end with a lower number of continuous variables skewed toward one end of the curve (high or low end). |
UNIFORM | All the values are the same. |
BINOMIAL 33 | The values in the distribution curve are distributed among two possibilities only, such as a Boolean true-false value or, as in our case "Yes" or "No", to designate edibility. |
MULTINOMIAL | Distribution of data among a fixed number of variables, usually from a discrete set. |
POISSON | Distribution used when there is a limited number of variables, perhaps even only two, but where the case count is high and the probability of one occurring and opposed to the other is extremely skewed. 34 |
HEAVYTAIL(also known as Pareto) | This is the source of the "80-20 rule" where 80 percent of a given activity can be explained by 20 percent of the factors. The key is to zero in on those 20 percent of the factors. |
MIXTURE | This concerns modeling a statistical distribution by a weighted sum of other distributions. It forms the basis for clustering. |
Provider-specific flags may also be defined. 35
Table 10-4. CONTENT_TYPE Values |
KEY | This denotes a discrete key value used to identify the case. This value actually never gets used for aiding in predictions. It simply helps Analysis Services keep track of cases internally. It's also useful if an operator seeks to tie cases together with the original data source for debugging purposes. |
DISCRETE 36 | Contains a discrete set of values, such as the edibility or the color values. The values are finite in number and have no relationship to each other. |
CONTINUOUS | Contains a numerical value that has quantitative meaning such as age or weight. In addition, the values have measurable differences between them that have value so that one age is mea surably higher or lower than another age. |
DISCRETIZED([value]) | A set of values, usually numbers , that are converted over to a smaller number of discrete values because the algorithm is unable to use them as inputs or prediction fields. To be able to analyze them, the values are converted to discrete values for the sake of training the model. 37 There are four possible parameters that can be provided to this function to direct this process:
|
ORDERED | Contains a set of values that have a hierarchical value relation ship to each other. Examples include tax brackets or earning levels. There is a rank associated to their value but unlike age, for example, there is no relative values between those levels. |
SEQUENCE_TIME 39 | The column contains time- related values that have a sequential relationship to each other, such as dates and years . |
CYCLICAL | These are discrete, but ordered, values that have to repeat themselves in a cyclical fashion, such as the months of the year or the days of the week. Functionally, these values act as both discrete and ordered values at the same time. |
PROBABILITY | This is the probability of this value occurring in a node. It's expressed as a decimal between 0 and 1. 40 |
VARIANCE | The statistical variance of the associated value in this column. |
STDEV | The standard deviation of the associated value in this column. |
SUPPORT | The number of occurrences of this value in the node. |
PROBABILITY_VARIANCE | The statistical variance of the probability of the associated value in this column. This is not yet supported by Analysis Services. |
PROBABILITY_STDEV 42 | The standard deviation of the probability of the associated value in this column. This is not yet supported by Analysis Services. |
ORDER | Used to order the columns in this table. This is the order in which this row is placed. |
Distribution Types
When we work with continuous data, such as used automobile prices, we see three types of statistical distribution, normal,log_normal, and uniform. For example, if you were to sample the prices of the cars in a large used car lot, assuming the cars are of basically the same type and mileage, you would expect to find a few very cheaply priced cars because they have unusually high mileage or a few dents, and you would also expect to find a few very expensive cars because they were only driven by the proverbial "little old lady who only drove the car on Sundays to church . " Generally, the bulk of the cars in the lot will fall within a similar price range, give or take a few hundred dollars. Graphing the prices of these cars shows a normal distribution curve that should look something like the one shown here. 43
Normal distribution curve. |
If this were an atypical car lot with average priced cars and a few cheap ones, you could draw a log_normal curve like the one shown here.
Log_Normal distribution curve. |
If by chance all the cars were priced exactly the same, the curve would be uniform like this one.
Uniform distribution curve. |
Provider-specific flags may also be defined. Those listed here provide additional information or hints that Analysis Services uses to build the model.
Table 10-5. MODELING_FLAG Values |
MODEL_EXISTENCE_ONLY | Denotes that a column is important not because of the value that it contains, but because a value exists at all. It can be more important to know that a mush- room has a value for"Spore Color", rather than the value itself, because the fact that the mushroom has spores is an important predictor of edibility. 45 |
NOT_NULL | This column must have a value. The absence of a value is considered an anomaly. |
IGNORE_NULL | If a column contains a null value, it won't provide any statistical information about the model. |
NULL_INFORMATIVE | Null values are counted, but considered missing. The fact that the value is null can have high significance. Consider data-mining ballots, missing values are important in that they signify a refusal to vote or even a miscount. |
Provider-specific flags may also be defined.
<function name>(<column1> [, <column2>], ...)
. . . Const DMSCHEMA_MINING_COLUMNS = _ "{3add8a78-d8b9-11d2-8d2a-00e029154fde}" Const adSchemaProviderSpecific = -1 Set rs = cn.OpenSchema(adSchemaProviderSpecific,_ array(empty,empty,"mushroom analysis rdbms","edibility"),_ DMSCHEMA_MINING_COLUMNS) ... Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999. All rights reserved. 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA - 3: MODEL_NAME - mushroom analysis rdbms 4: COLUMN_NAME - Edibility 5: COLUMN_GUID - 6: COLUMN_PROPID - 7: ORDINAL_POSITION - 23 8: COLUMN_HAS_DEFAULT - False 9: COLUMN_DEFAULT - 10: COLUMN_FLAGS - 0 11: IS_NULLABLE - True 12: DATA_TYPE - 130 13: TYPE_GUID - 14: CHARACTER_MAXIMUM_LENGTH - 0 15: CHARACTER_OCTET_LENGTH - 0 16: NUMERIC_PRECISION - 17: NUMERIC_SCALE - 18: DATETIME_PRECISION - 19: CHARACTER_SET_CATALOG - 20: CHARACTER_SET_SCHEMA - 21: CHARACTER_SET_NAME - 22: COLLATION_CATALOG - 23: COLLATION_SCHEMA - 24: COLLATION_NAME - 25: DOMAIN_CATALOG - 26: DOMAIN_SCHEMA - 27: DOMAIN_NAME - 28: DESCRIPTION - 29: DISTRIBUTION_FLAG - NONE 30: CONTENT_TYPE - DISCRETE 31: MODELING_FLAG - 32: IS_RELATED_TO_KEY - False 33: RELATED_ATTRIBUTE - 34: IS_INPUT - True 35: IS_PREDICTABLE - True 36: CONTAINING_COLUMN - 37: PREDICTION_SCALAR_FUNCTIONS -Predict,PredictAdjustedProbability, PredictProbability,PredictScore,PredictStddev,PredictStdev, PredictSupport,PredictVariance 38: PREDICTION_TABLE_FUNCTIONS - PredictHistogram 39: IS_POPULATED - True 40: PREDICTION_SCORE - 0
MINING_MODEL_CONTENT Schema Rowset
This schema rowset allows browsing of the content of a data-mining model. The user can employ special tree-operation restrictions to navigate the contents in a manner that facilitates the creation of a decision tree graph or a cluster graph. 50
The results are ordered by the following fields by default:
The following are the available restriction columns:
There is an additional restriction called the tree operation that isn't on any particular column of the MINING_MODEL_CONTENT rowset; rather, it specifies a tree operator. The idea is that the application specified a NODE_UNIQUE_NAME restriction and the tree operator (ANCESTORS, CHILDREN, SIBLINGS, PARENT, DESCENDANTS, SELF) to obtain the desired set of members . The SELF operator includes the row for the node itself in the list of returned rows. The following constants are defined in Table 10-6.
Table 10-6. NODE_UNIQUE_NAME Restrictions |
DMTREEOP_ANCESTORS | 0x00000020 |
DMTREEOP_CHILDREN | 0x00000001 53 |
DMTREEOP_SIBLINGS | 0x00000002 |
DMTREEOP_PARENT | 0x00000004 |
DMTREEOP_SELF | 0x00000008 |
DMTREEOP_DESCENDANTS 54 | 0x00000010 |
The columns are as follows:
MINING_SERVICES Schema Rowset
This rowset exposes the data-mining algorithms available from the provider. It can be used to determine the prediction capabilities, complexity, and similar information about the algorithm. Third-party providers need to supply the values of their algorithms in this table.
The columns are sorted by SERVICE_NAME.
Column restrictions available are
The column descriptions are as follows:
Provider-specific flags may also be defined.
Provider-specific flags may also be defined. 70
Provider-specific flags may also be defined.
Provider-specific flags may also be defined.
Table 10-7. TRAINING_COMPLEXITY Values |
DM_TRAINING_COMPLEXITY_LOW | Running time is proportional to input and is relatively short. |
DM_TRAINING_COMPLEXITY_MEDIUM | Running time may be long but is generally proportional to input. 75 |
DM_TRAINING_COMPLEXITY_HIGH | Running time is long and may grow exponentially in relationship to input. |
SERVICE_PARAMETERS Schema Rowset 79
This schema rowset provides a list of parameters that can be supplied when generating a mining model with the CREATE MINING MODEL statement. The client will generally restrict by SERVICE_NAME to obtain the parameters supported by the provider and applicable to the type of mining model being generated.
The default sort order is
The available restriction columns are
Column descriptions are as follows:
Example Listing of Algorithms Available Using Code
... Const DMSCHEMA_MINING_SERVICES = _ "{3add8a95-d8b9-11d2-8d2a-00e029154fde}" Const adSchemaProviderSpecific = -1 Set rs = cn.OpenSchema(adSchemaProviderSpecific,, _ DMSCHEMA_MINING_SERVICES) ...
The following results are output from the code above:
Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999. All rights reserved. 1: SERVICE_NAME - Microsoft_Clustering 2: SERVICE_TYPE_ID - 2 3: SERVICE_DISPLAY_NAME - Microsoft Clustering 4: SERVICE_GUID - {181CB1C9-C7F5-11D3-8BF9-00C04F68DDC2} 5: DESCRIPTION - Clustering finds natural groupings of data in a multidimensional space. Clustering is useful when you want to see general groupings in your data. 6: PREDICTION_LIMIT - 0 7: SUPPORTED_DISTRIBUTION_FLAGS - NORMAL,UNIFORM 8: SUPPORTED_INPUT_CONTENT_TYPES - CONTINUOUS,CYCLICAL, DISCRETE,DISCRETIZED,ORDERED,SEQUENCE_TIME,TABLE 9: SUPPORTED_PREDICTION_CONTENT_TYPES - CONTINUOUS,CYCLICAL, DISCRETE,DISCRETIZED,SEQUENCE_TIME,TABLE 10: SUPPORTED_MODELING_FLAGS - MODEL_EXISTENCE_ONLY, NOT NULL 11: SUPPORTED_SOURCE_QUERY - OPENROWSET,SHAPE,SINGLETON SELECT 12: TRAINING_COMPLEXITY - 1 13: PREDICTION_COMPLEXITY - 0 14: EXPECTED_QUALITY - 2 15: SCALING - 1 16: ALLOW_INCREMENTAL_INSERT - False 17: ALLOW_PMML_INITIALIZATION - True 18: CONTROL - 1 19: ALLOW_DUPLICATE_KEY - False ******************** 1: SERVICE_NAME - Microsoft_Decision_Trees 2: SERVICE_TYPE_ID - 1 3: SERVICE_DISPLAY_NAME - Microsoft Decision Trees 4: SERVICE_GUID - {181CB1C8-C7F5-11D3-8BF9-00C04F68DDC2} 5: DESCRIPTION - The Microsoft Decision Trees algorithm chooses significant characteristics in the data and narrows sets of data based on those characteristics until clear correlations are established. Decision trees are useful when you want to make specific predictions based on information in the source data. 6: PREDICTION_LIMIT - 0 7: SUPPORTED_DISTRIBUTION_FLAGS - NORMAL,UNIFORM 8: SUPPORTED_INPUT_CONTENT_TYPES - CONTINUOUS,CYCLICAL, DISCRETE,DISCRETIZED,KEY,ORDERED,TABLE 9: SUPPORTED_PREDICTION_CONTENT_TYPES - DISCRETE,DISCRETIZED, TABLE 10: SUPPORTED_MODELING_FLAGS - MODEL_EXISTENCE_ONLY, NOT NULL 11: SUPPORTED_SOURCE_QUERY - OPENROWSET,SHAPE,SINGLETON SELECT 12: TRAINING_COMPLEXITY - 1 13: PREDICTION_COMPLEXITY - 0 14: EXPECTED_QUALITY - 1 15: SCALING - 1 16: ALLOW_INCREMENTAL_INSERT - False 17: ALLOW_PMML_INITIALIZATION - True 18: CONTROL - 1 19: ALLOW_DUPLICATE_KEY - False ******************
MODEL_CONTENT_PMML Schema Rowset
This rowset stores the XML representation of the content of each model. The format of the XML string follows the PMML standard. 83
The default sort order of these results is
The available restriction columns are
The columns descriptions are as follows:
Summary
Almost all structural aspects of the data-mining models are exposed through the OLEDB for data-mining provider, which stored all the data about them in table-like structures. This information is complete; thus it's possible not only to understand the internal parameters at work that derived the model, but also to browse the content of these models in sufficient detail to create your own version of the Analysis Services browser if you choose to do so. 87
For the most part, this interface is more about reporting on the data in the model and understanding the models themselves than it is about manipulating them. In the next chapter, we'll be looking more at the role of PivotTable and how this service helps read the contents of a model and even make local mining models that can be stored on your computer and used offline.