8 Using Microsoft Data Transformation Services (DTS)

The structure of a data-mining model is designed to be similar to a relational database table. It has tables that contain columns and a certain number of rows. It should come as no surprise, then, that the structure and contents of these tables can be retrieved through a SELECT statement, albeit a derivation of the standard SQL statement. This is important because you can develop third-party applications that access the structure and contents of the nodes in a decision tree model and the clusters of a clustering model. In fact, if you were up to it, you could use the information here to create your own node browser like the one in the Analysis Manager data-mining content browser. As we'll see in ‚  Chapter 12 ‚  , this SQL-like statement also allows us to use the models to make predictions against test data.

The Structure of the Data-Mining Model Case 1

One of the first things we need to do is understand how the structure of these data-mining models is represented when we query their content. The content we want relates to data-mining models, of course, but also to the individual components that make them up. Therefore, a given data-mining model has a certain number of nodes, each with certain characteristics, such as its name and the rule that describes it. But each node also contains a variable number of instances of each attribute that are the basis for the histogram and the distribution count. This panel is located on the right side of the node browser, which contains the percentage of each kind of attribute in the node.

Data-Mining Models Look Like Tables

In order to make it possible to query the model using conventionally available tools and interfaces, such as ADO with Visual Basic, it was necessary to make these objects available in the form of tables, with columns for the attributes and a row for each node. In some cases, there are logical data relationships such as with the node and the table of probability histograms in that node. To make objects available in this form, one of the columns in the node table contains a reference to a table, rather than a scalar value. This is called a nested table because of the way the second table appears to be embedded in the first table.

Using Code to Browse Data-Mining Models

First I'll begin with a short program that connects to a data-mining model and queries the contents of a data-mining model only for the sake of getting the names of the fields in the mushrooms model we created in ‚  Chapter 5 ‚  . 2


Note

For the sake of simplicity and practicality, the code samples used in this chapter use VBScript with Windows Scripting Host (WSH). Windows Scripting Host allows you to run scripts that are similar both to the code you would use in a Visual Basic application and to code you would use in an Active Server Pages (ASP) page. Understanding this code requires you to have at least a rudimentary knowledge of the ADO object model and the VBScript syntax.


If you want to run these samples, download them from ‚  http://msdn.microsoft.com/scripting ‚  .If you happen to be using Windows 2000 or Windows XP, you don't need to do anything because WSH is already installed on your computer. Once you have WSH installed, you only need to create a text file with a .vbs extension and call it from the command line using the following syntax:

C:\>cscript myfile.vbs

Here's the code:

 Dim cn Dim rs Dim ns Set cn = wscript.CreateObject("adodb.connection") Set rs = wscript.CreateObject("adodb.recordset") cn.Provider = "msolap.2" cn.ConnectionString = _     "data source=dataserver;initial catalog=mushrooms" cn.Open set rs = _     cn.Execute ("SELECT * from [mushroom analysis rdbms].content") For i = 0 to rs.fields.count -1     wscript.echo cstr(i+1) & ": " & rs.fields(i).name Next rs.Close cn.Close   3   

The output from this program should display the following list of fields:

 Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999.  All rights reserved. 1: MODEL_CATALOG 2: MODEL_SCHEMA 3: MODEL_NAME 4: ATTRIBUTE_NAME 5: NODE_NAME 6: NODE_UNIQUE_NAME 7: NODE_TYPE 8: NODE_GUID 9: NODE_CAPTION 10: CHILDREN_CARDINALITY 11: PARENT_UNIQUE_NAME 12: NODE_DESCRIPTION 13: NODE_RULE 14: MARGINAL_RULE 15: NODE_PROBABILITY 16: MARGINAL_PROBABILITY 17: NODE_DISTRIBUTION 18: NODE_SUPPORT 19: MSOLAP_MODEL_COLUMN 20: MSOLAP_NODE_SCORE 21: MSOLAP_NODE_SHORT_CAPTION 

Note that in order to get this list, I used standard ADO with a connection string that specifies MSOLAP as the provider. Other than the choice of providers, the connection and the recordset operations are exactly the same as any ordinary ADO program using SQL Server as the provider.


Note

The results of this query will be sorted by the following fields by default:


MODEL_CATALOG
MODEL_SCHEMA
MODEL_NAME 4
ATTRIBUTE_NAME

The query used is very close to standard Ansi-SQL except for the object of the FROM clause. This clause is [data mining model].content. The .content specifies that I'm not making a prediction query, but that I'm specifically querying the contents of the mining model in order to browse them. The purpose of each column is discussed in the"MINING_MODELS Schema Rowset" section later in this chapter.

The following program will display all the nodes in the data-mining model. It also shows a representative number of attributes.

 Dim cn Dim rs Dim ns Dim tmpVal Set cn = wscript.CreateObject("adodb.connection") Set rs = wscript.CreateObject("adodb.recordset") Set ns = wscript.CreateObject("adodb.recordset") cn.Provider = "msolap.2" cn.ConnectionString = _     "data source=dataserver;initial catalog=mushrooms" cn.Open set rs = _     cn.Execute ("SELECT * from [mushroom analysis rdbms].content") While Not rs.EOF     For i = 0 to rs.fields.count -1         ' This field is a nested table that can't be output to the         ' screen like an ordinary scalar value; therefore, a label         ' is placed there instead, knowing that there's code a few         ' lines down, which is designed to list the contents of         ' this nested table.         If rs.fields(i).name = "NODE_DISTRIBUTION" Then             tmpVal = "NESTED TABLE"         Else             tmpVal = rs.fields(i).value         End If         wscript.echo cstr(i+1) &": "& rs.fields(i).name & _             " - " & tmpVal         ' This is a nested table that contains the distribution         ' histogram of that node for the predictive value.         If rs.fields(i).name = "NODE_DISTRIBUTION" Then             ' This column actually contains a recordset object!             set ns = rs.fields("node_distribution").value             wscript.echo             While not (ns.eof)                 For j = 0 To ns.Fields.Count - 1                     wscript.echo "     " & _                         ns.Fields(j).name & _                         " "& ns.Fields(j).Value                 Next                 wscript.echo "     ********"                 ns.movenext             Wend         End If     Next     rs.Movenext     wscript.echo "-----------" Wend rs.Close cn.Close 

Here is a partial listing from the program's output that shows a representative number of attributes. Note the values in the fields:

 . . . 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - mushroom analysis rdbms 4: ATTRIBUTE_NAME - Edibility 5: NODE_NAME - 012147483669 6: NODE_UNIQUE_NAME - 012147483669 7: NODE_TYPE - 3 8: NODE_GUID -  9: NODE_CAPTION - Odor = none  10: CHILDREN_CARDINALITY - 2 11: PARENT_UNIQUE_NAME - 2147483669 12: NODE_DESCRIPTION - Odor = none  13: NODE_RULE - <predicate op = "eq" value = "none"> <simple-attribute name = "Odor"/> </predicate> 14: MARGINAL_RULE - <predicate op = "eq" value = "none"> <simple-attribute name = "Odor"/> </predicate> 15: NODE_PROBABILITY - 0.434268833087149 16: MARGINAL_PROBABILITY - 0.434268833087149 17: NODE_DISTRIBUTION - NESTED TABLE      ATTRIBUTE_NAME Edibility      ATTRIBUTE_VALUE missing      SUPPORT 0      PROBABILITY 0.00283205890682526      VARIANCE 0      VALUETYPE 1      ********      ATTRIBUTE_NAME Edibility      ATTRIBUTE_VALUE edible      SUPPORT 3408      PROBABILITY 0.965448881336732      VARIANCE 0      VALUETYPE 4      ********      ATTRIBUTE_NAME Edibility      ATTRIBUTE_VALUE poisonous      SUPPORT 120      PROBABILITY 0.0342679127725857      VARIANCE 0      VALUETYPE 4      ******** 18: NODE_SUPPORT - 3528 19: MSOLAP_MODEL_COLUMN - Edibility 20: MSOLAP_NODE_SCORE - 262.639862428876 21: MSOLAP_NODE_SHORT_CAPTION - Odor = none . . . 

The NODE_DISTRIBUTION column contains the distribution histogram nested table with the following fields as you can see from the previous listing. 5

  • ‚  ATTRIBUTE_NAME ‚   Name of the attribute.
  • ‚  ATTRIBUTE_VALUE ‚   The attribute value represented as a variant.
  • ‚  SUPPORT ‚   The number of cases that support this attribute value.
  • ‚  PROBABILITY 6 ‚   Probability of occurrence of this attribute value.
  • ‚  VARIANCE ‚   Variance of this attribute value.
  • ‚  VALUETYPE ‚   The value type of the attribute. It can be one of the following values that describes the type of attribute that is contained in that column.
    • ‚  VALUETYPE_MISSING = 1 ‚   Missing values, which can occur with cases having incomplete information. 7
    • ‚  VALUETYPE_EXISTING = ‚   The value exists and does not fall in categories 3 to 6. This occurs with values that are already discrete.
    • ‚  VALUETYPE_CONTINUOUS = 3 ‚   Numerical data that has a sequence, such as time, age, or distance.
    • ‚  VALUETYPE_DISCRETE = 4 ‚   Values that have a finite number, such as colors or shapes . In the example above, there are only two discrete values: "Yes" and"No."
    • ‚  VALUETYPE_DISCRETIZED = 5 8 ‚   Numerical values, such as prices that cannot be dealt with unless they are artificially placed in a bin or discrete type. An attribute that contains a wide range of prices or weights, for example, will likely be discretized, or converted to a discrete type by classifying the prices according to four or five price or weight nodes.
    • ‚  VALUETYPE_BOOLEAN = 6 ‚   True or False (0 or 1) values.

Using the Schema Rowsets

Another more effective way to retrieve data about data mining is to use the ADO OpenSchema method to retrieve the metadata about the data-mining models within Analysis Services. Because the schema type has not been defined yet, the adSchemaProviderSpecific type needs to be used along with the GUID value of the type of metadata you need to retrieve. Here's the syntax:

 Connection.OpenSchema(adSchemaProviderSpecific, _     array(restrictions), guid) 

MINING_MODELS Schema Rowset 9

This schema rowset describes all the data-mining models present within the database. The order of the results is sorted by

  • MODEL_CATALOG
  • MODEL_SCHEMA
  • MODEL_NAME

Available restriction keys are

  • MODEL_CATALOG
  • MODEL_SCHEMA 10
  • MODEL_NAME
  • MODEL_TYPE
  • SERVICE_NAME
  • SERVICE_TYPE_ID

The column names are as follows :

  • ‚  MODEL_CATALOG ‚   Name of the database that contains the data-mining model. 11
  • ‚  MODEL_SCHEMA ‚   Schema name. Most often it's NULL unless the provider supplies or supports the use of schema names.
  • ‚  MODEL_NAME ‚   Name of the data-mining model.
  • ‚  MODEL_TYPE ‚   Model type, a provider-specific string usually containing NULL.
  • ‚  SERVICE_NAME 12 ‚   A provider-specific name that describes the algorithm used to generate the model.
  • ‚  MODEL_GUID ‚   Unique identifier of the model used internally by Analysis Services. It serves as the unique GUID value for that model.
  • ‚  DESCRIPTION ‚   Description of the model, if one was specified at creation time.
  • ‚  MODEL_PROPID ‚   Property ID of the model. Usually NULL because data-mining models do not make use of this value. 13
  • ‚  DATE_CREATED ‚   Date when the model was created.
  • ‚  DATE_MODIFIED ‚   Date when the model definition was last modified.
  • ‚  CREATION_STATEMENT ‚   This is the creation statement used to build the data-mining model.
  • ‚  PREDICTION_ENTITY 14 ‚   A comma-delimited list indicating which columns the model can predict.
  • ‚  SERVICE_TYPE_ID ‚   A bitmask that serves as the ID for the algorithm used to build the model. Even though the first two algorithms are the only ones currently supported, the other values were already added to provide support for possible future implementations . Any new third-party algorithms that get added to Analysis Services will require that a field be added in Table 10-1.
  • Table 10-1. Service Type IDs
DM_SERVICETYPE_CLASSIFICATION 0x0000001
DM_SERVICETYPE_CLUSTERING 15 0x0000002
DM_SERVICETYPE_ASSOCIATION 0x0000004
DM_SERVICETYPE_DENSITY_ESTIMATE 0x0000008
DM_SERVICETYPE_SEQUENCE 0x0000010 16
  • ‚  IS_POPULATED ‚   Boolean VARIANT_TRUE if the model is populated; VARIANT_FALSE if the model is not populated as would be the case in an empty model with a defined structure that has not been trained with data.

A restriction column serves as a filter to limit the types of rows that get returned. This gets defined as an array placed as a second parameter. For example:

 connection.OpenSchema(adSchemaProviderSpecific,_     Array(EMPTY,EMPTY,EMPTY, "Microsoft_Decision_Trees"),_     DMSCHEMA_MINING_MODELS) 

This restriction would limit the rows to those models that were created using decision trees.

To retrieve information about all the data-mining models in my Mushrooms database with the structure of Table 10-1, you can use the following program:

 dim cn dim rs Set cn = wscript.CreateObject("adodb.connection") set rs = wscript.CreateObject("adodb.recordset") cn.Provider = "msolap.2" cn.ConnectionString = _     "data source=dataserver;initial catalog=mushrooms" cn.Open Const DMSCHEMA_MINING_MODELS = _     "{3add8a77-d8b9-11d2-8d2a-00e029154fde}" Const adSchemaProviderSpecific = -1 Set rs = cn.OpenSchema(adSchemaProviderSpecific, _     ,DMSCHEMA_MINING_MODELS) while rs.eof = 0     For j = 0 To rs.fields.count - 1         wscript.echo CStr(j+1) & ": " &_              rs.fields(j).name & " - " & rs.fields(j).value     Next     wscript.echo  "********************"     rs.movenext wend   17   

This program will generate the following output:

 Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999.  All rights reserved. 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - Census cluster 4: MODEL_TYPE -  5: MODEL_GUID - {902FB4D2-EA1F-4E06-A2F4-AA6D9B5427BA} 6: DESCRIPTION -  7: MODEL_PROPID -  8: DATE_CREATED - 12/17/2000 1:52:35 PM 9: DATE_MODIFIED - 12/17/2000 2:05:10 PM 10: SERVICE_TYPE_ID - 2 11: SERVICE_NAME - Microsoft_Clustering 12: CREATION_STATEMENT - CREATE MINING MODEL [Census cluster's]  ([Id] TEXT   KEY  , [Age] LONG   DISCRETIZED()  PREDICT, [Workclass]  TEXT   DISCRETE  PREDICT, [Fnlwgt] TEXT   DISCRETE  PREDICT,  [Education] TEXT   DISCRETE  PREDICT, [Education num] LONG    CONTINUOUS  PREDICT, [Marital status] TEXT   DISCRETE  PREDICT,  [Occupation] TEXT   DISCRETE  PREDICT, [Relationship] TEXT   DISCRETE   PREDICT, [Race] TEXT   DISCRETE  PREDICT, [Sex] TEXT   DISCRETE   PREDICT, [Capital gain] DOUBLE   CONTINUOUS  PREDICT,  [Capital loss] DOUBLE   CONTINUOUS  PREDICT, [Hours per week]  LONG   CONTINUOUS  PREDICT, [Native country] TEXT   DISCRETE   PREDICT, [Class] TEXT   DISCRETE  PREDICT) USING  Microsoft_Clustering 13: PREDICTION_ENTITY - Age,Workclass,Fnlwgt,Education,  Education num,Marital status,Occupation,Relationship,Race,  Sex,Capital gain,Capital loss,Hours per week,Native country,  Class 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE -  ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - census 4: MODEL_TYPE - OLAP 5: MODEL_GUID - {11F7DA3B-365D-4CC4-97D3-991DA8FEE188} 6: DESCRIPTION -  7: MODEL_PROPID -  8: DATE_CREATED - 12/14/2000 2:57:14 AM 9: DATE_MODIFIED - 12/14/2000 2:57:17 AM 10: SERVICE_TYPE_ID - 1 11: SERVICE_NAME - Microsoft_Decision_Trees 12: CREATION_STATEMENT - CREATE OLAP MINING MODEL [census's]  FROM [Census] (CASE DIMENSION [Occupation] LEVEL [Workclass]  ,  LEVEL [Occupation]  , DIMENSION [Relationship]   LEVEL [Relationship]  , DIMENSION [Class] PREDICT LEVEL [(All)]   , LEVEL [Class]  , DIMENSION [Country]  LEVEL [Native country]   , DIMENSION [race]  LEVEL [Race]  , DIMENSION [Education]   LEVEL [Education]  , LEVEL [Educationnum]  ,  DIMENSION [Hours per week]  LEVEL [Hours per week]  ,  MEASURE [Capital gain] , MEASURE [Capital loss] )  USING Microsoft_Decision_Trees 13: PREDICTION_ENTITY - Class,Class.Class.Class 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE - Census ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - CensusCluster 4: MODEL_TYPE -  5: MODEL_GUID - {8A626ECD-46CB-47BF-9C4C-8B9F9A4F02E6} 6: DESCRIPTION -  7: MODEL_PROPID -  8: DATE_CREATED - 12/25/2000 2:18:37 PM 9: DATE_MODIFIED - 12/25/2000 2:29:37 PM 10: SERVICE_TYPE_ID - 2 11: SERVICE_NAME - Microsoft_Clustering 12: CREATION_STATEMENT - CREATE MINING MODEL [CensusCluster's]  ([Id] TEXT   KEY  , [Age] LONG   CONTINUOUS  PREDICT, [Workclass]  TEXT   DISCRETE  PREDICT, [Fnlwgt] TEXT   DISCRETE  PREDICT,  [Education] TEXT   DISCRETE  PREDICT, [Education num] LONG  CONTINUOUS  PREDICT, [Marital status] TEXT   DISCRETE  PREDICT,  [Occupation] TEXT   DISCRETE  PREDICT, [Relationship] TEXT    DISCRETE  PREDICT, [Race] TEXT   DISCRETE  PREDICT, [Sex] TEXT    DISCRETE  PREDICT, [Capital gain] DOUBLE   CONTINUOUS  PREDICT,  [Capital loss] DOUBLE   CONTINUOUS  PREDICT, [Hours per week]  LONG   CONTINUOUS  PREDICT, [Native country] TEXT   DISCRETE   PREDICT, [Class] TEXT   DISCRETE  PREDICT) USING  Microsoft_Clustering 13: PREDICTION_ENTITY - Age,Workclass,Fnlwgt,Education,Education  num,Marital status,Occupation,Relationship,Race,Sex,  Capital gain,Capital loss,Hours per week,Native country,Class 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE -  ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - Mushroom Analysis RDBMS 4: MODEL_TYPE -  5: MODEL_GUID - {499D525B-13EC-4B4C-918F-F75E73E01AC4} 6: DESCRIPTION -  7: MODEL_PROPID -  8: DATE_CREATED - 1/29/2001 10:45:25 PM 9: DATE_MODIFIED - 1/29/2001 10:45:33 PM 10: SERVICE_TYPE_ID - 1 11: SERVICE_NAME - Microsoft_Decision_Trees 12: CREATION_STATEMENT - CREATE MINING MODEL  [Mushroom Analysis RDBMS ] ([Id] LONG   KEY  , [Cap Shape]  TEXT   DISCRETE  , [Cap Surface] TEXT   DISCRETE  , [Cap Color]  TEXT   DISCRETE  , [Bruises] TEXT   DISCRETE  , [Odor] TEXT    DISCRETE  , [Gill Attachment] TEXT   DISCRETE  , [Gill Spacing]  TEXT   DISCRETE  , [Gill Size] TEXT   DISCRETE  , [Gill Color]  TEXT   DISCRETE  , [Stalk Shape] TEXT   DISCRETE  , [Stalk Root]  TEXT   DISCRETE  , [Stalk Surface Above Ring] TEXT   DISCRETE  ,  [Stalk Surface Below Ring] TEXT   DISCRETE  ,  [Stalk Color Above Ring] TEXT   DISCRETE  ,  [Stalk Color Below Ring] TEXT   DISCRETE  , [Veil Type] TEXT    DISCRETE  , [Veil Color] TEXT   DISCRETE  , [Ring Number] TEXT   DISCRETE  , [Ring Type] TEXT   DISCRETE  , [Spore Print Color]  TEXT   DISCRETE  , [Population] TEXT   DISCRETE  , [Edibility]  TEXT   DISCRETE  PREDICT, [Habitat] TEXT   DISCRETE  PREDICT)  USING Microsoft_Decision_Trees 13: PREDICTION_ENTITY - Edibility,Habitat 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE -  ******************** 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - Mushrooms Cluster 4: MODEL_TYPE -  5: MODEL_GUID - {074F9AE0-D0DE-4925-A231-21265E564F9B} 6: DESCRIPTION -  7: MODEL_PROPID -  8: DATE_CREATED - 12/17/2000 12:48:19 PM 9: DATE_MODIFIED - 12/17/2000 12:48:57 PM 10: SERVICE_TYPE_ID - 2 11: SERVICE_NAME - Microsoft_Clustering 12: CREATION_STATEMENT - CREATE MINING MODEL  [Mushrooms Cluster's] ([Id] TEXT   KEY  , [Edibility]  TEXT   DISCRETE  PREDICT_ONLY, [Cap Shape] TEXT   DISCRETE  ,  [Cap Surface] TEXT   DISCRETE  , [Cap Color] TEXT   DISCRETE  ,  [Bruises] TEXT   DISCRETE  , [Odor] TEXT   DISCRETE  ,  [Gill Attachment] TEXT   DISCRETE  , [Gill Spacing] TEXT    DISCRETE  , [Gill Size] TEXT   DISCRETE  , [Gill Color] TEXT    DISCRETE  , [Stalk Shape] TEXT   DISCRETE  , [Stalk Root] TEXT    DISCRETE  , [Stalk Surface Above Ring] TEXT   DISCRETE  ,  [Stalk Surface Below Ring] TEXT   DISCRETE  ,  [Stalk Color Above Ring] TEXT   DISCRETE  ,  [Stalk Color Below Ring] TEXT   DISCRETE  , [Veil Type] TEXT    DISCRETE  , [Veil Color] TEXT   DISCRETE  , [Ring Number] TEXT    DISCRETE  , [Ring Type] TEXT   DISCRETE  , [Spore Print Color]   TEXT   DISCRETE  , [Population] TEXT   DISCRETE  , [Habitat]  TEXT   DISCRETE  ) USING Microsoft_Clustering (CLUSTER_COUNT=6) 13: PREDICTION_ENTITY - Edibility 14: IS_POPULATED - True 15: MSOLAP_MODEL_SOURCE -  ******************** 

MINING_COLUMNS Schema Rowset

The MINING_COLUMNS schema rowset describes the individual columns of all defined data-mining models known to the provider. This schema rowset can be viewed as an enhanced form of the COLUMNS rowset for data-mining models. Many of the entries are derived from the COLUMNS schema rowset and are optional.

The columns are sorted in the following order:

  • MODEL_CATALOG
  • MODEL_SCHEMA 18
  • MODEL_NAME
  • COLUMN_NAME

The available restriction columns are:

  • MODEL_CATALOG
  • MODEL_SCHEMA
  • MODEL_NAME
  • COLUMN_NAME 19

The column names are as follows:

  • ‚  MODEL_CATALOG ‚   Catalog name. NULL if the provider does not support catalogs. In the case of Microsoft Data Mining, it refers to the name of the database that contains the mining model.
  • ‚  MODEL_SCHEMA ‚   Unqualified schema name. NULL if the provider does not support schemas.
  • ‚  MODEL_NAME ‚   Mining model name. This column is a required field. 20
  • ‚  COLUMN_NAME ‚   The name of the column; this might not be unique. If this cannot be determined, a NULL is returned.

  • By itself, this column is not enough to uniquely identify a column. For that you need to use a combination of the COLUMN_GUID and COLUMN_PROPID, which then becomes the ID of the column. Since one of the columns is always NULL, combining the two ensures that you will reference a GUID that serves as a unique key.
  • ‚  COLUMN_GUID ‚   Providers that do not use GUIDs to identify columns should eturn NULL in this column.
  • ‚  COLUMN_PROPID ‚   Providers that do not associate PROPIDs with columns should return NULL in this column. 21
  • ‚  ORDINAL_POSITION ‚   The ordinal of the column. Columns are numbered starting from one. The value will be NULL if there is a way to apply a numerical value to the order of the column.
  • ‚  COLUMN_HASDEFAULT ‚  
    • ‚  VARIANT_TRUE ‚   The column has a default value.
    • ‚  VARIANT_FALSE ‚   The column does not have a default value, or it is unknown whether the column has a default value. 22
  • ‚  COLUMN_DEFAULT ‚   Default value of the column. If the default value is the NULL value, COLUMN_HASDEFAULT is VARIANT_TRUE and the COLUMN_DEFAULT column is a NULL value.
  • ‚  COLUMN_FLAGS ‚   A bitmask that describes column characteristics. The DBCOLUMNFLAGS enumerated type specifies the bits in the bitmask. This column cannot contain a NULL value.
  • ‚  IS_NULLABLE ‚  
    • ‚  VARIANT_TRUE ‚   The column might be nullable. 23
    • ‚  VARIANT_FALSE ‚   The column is known not to be nullable.
  • ‚  DATA_TYPE ‚   The indicator of the column's data type. Table 10-2 is an example.
  • Table 10-2. Column Data Types
25
"TABLE" DBTYPE_HCHAPTER 24
"TEXT" DBTYPE_WCHAR
"LONG" DBTYPE_I8
"DOUBLE" DBTYPE_R8
"DATE" DBTYPE_DATE
  • ‚  TYPE_GUID ‚   The GUID of the column's data type. Providers that do not use GUIDs to identify data types should return NULL in this column. 26
  • ‚  CHARACTER_MAXIMUM_LENGTH ‚   The maximum possible length of a value in the column. For character, binary, or bit columns, this is one of the following:
  • The maximum length of the column in characters , bytes, or bits, respectively, if the length is defined. For example, a CHAR (5) column in an SQL table has a maximum length of 5.
  • The maximum length of the data type in characters, bytes, or bits, respectively, if the column does not have a defined length.
  • Zero (0) if neither the column nor the data type has a defined maximum length.
  • NULL for all other types of columns.
  • ‚  CHARACTER_OCTET_LENGTH 27 ‚   Maximum length in octets (bytes) of the column if the type of the column is character or binary. A value of zero means the column has no maximum length. NULL for all other types of columns.
  • ‚  NUMERIC_PRECISION ‚   If the column's data type is of a numeric data type other than VARNUMERIC, this is the maximum precision of the column. The precision of columns with a data type of DBTYPE_DECIMAL or DBTYPE_ NUMERIC depends on the definition of the column. If the column's data type is not numeric or is VARNUMERIC, this is NULL.
  • ‚  NUMERIC_SCALE ‚   If the column's type indicator is DBTYPE_DECIMAL, DBTYPE_NUMERIC, or DBTYPE_VARNUMERIC, this is the number of digits to the right of the decimal point. Otherwise, this is NULL.
  • ‚  DATETIME_PRECISION ‚   Datetime precision (number of digits in the fractional seconds portion) of the column if the column is a datetime or interval type. If the column's data type is not datetime, this is NULL. 28
  • ‚  CHARACTER_SET_CATALOG ‚   Catalog name in which the character set is defined. NULL if the provider does not support catalogs or different character sets.
  • ‚  CHARACTER_SET_SCHEMA ‚   Unqualified schema name in which the character set is defined. NULL if the provider does not support schemas or different character sets.
  • ‚  CHARACTER_SET_NAME ‚   Character set name. NULL if the provider does not support different character sets.
  • ‚  COLLATION_CATALOG 29 ‚   Catalog name in which the collation is defined. NULL if the provider does not support catalogs or different collations.
  • ‚  COLLATION_SCHEMA ‚   Unqualified schema name in which the collation is defined. NULL if the provider does not support schemas or different collations.
  • ‚  COLLATION_NAME ‚   Collation name. NULL if the provider does not support different collations.
  • ‚  DOMAIN_CATALOG ‚   Catalog name in which the domain is defined. NULL if the provider does not support catalogs or domains. 30
  • ‚  DOMAIN_SCHEMA ‚   Unqualified schema name in which the domain is defined. NULL if the provider does not support schemas or domains.
  • ‚  DOMAIN_NAME ‚   Domain name. NULL if the provider does not support domains.
  • ‚  DESCRIPTION ‚   Human-readable description of the column. For example, the description for a column named Name in the Employee table might be "Employee name. " NULL if there is no description associated with the column.
  • ‚  DISTRIBUTION_FLAG 31 ‚   Table 10-3 is a list of DISTRIBUTION_FLAGs.
Table 10-3. DISTRIBUTION_FLAG Values
32
NORMAL Distribution of data in which a majority of the continuous values are in the middle of the distribution curve, with a relatively equal number of variances in the high and low end. (See "Distribution Types" sidebar.)
LOG_NORMAL The distribution of continuous data suggests that there are a majority of values on the high end with a lower number of continuous variables skewed toward one end of the curve (high or low end).
UNIFORM All the values are the same.
BINOMIAL 33 The values in the distribution curve are distributed among two possibilities only, such as a Boolean true-false value or, as in our case "Yes" or "No", to designate edibility.
MULTINOMIAL Distribution of data among a fixed number of variables, usually from a discrete set.
POISSON Distribution used when there is a limited number of variables, perhaps even only two, but where the case count is high and the probability of one occurring and opposed to the other is extremely skewed. 34
HEAVYTAIL(also known as Pareto) This is the source of the "80-20 rule" where 80 percent of a given activity can be explained by 20 percent of the factors. The key is to zero in on those 20 percent of the factors.
MIXTURE This concerns modeling a statistical distribution by a weighted sum of other distributions. It forms the basis for clustering.

Provider-specific flags may also be defined. 35

  • ‚  CONTENT_TYPE ‚   Table 10-4 is a list of CONTENT_TYPEs.
  • Table 10-4. CONTENT_TYPE Values
41
KEY This denotes a discrete key value used to identify the case. This value actually never gets used for aiding in predictions. It simply helps Analysis Services keep track of cases internally. It's also useful if an operator seeks to tie cases together with the original data source for debugging purposes.
DISCRETE 36 Contains a discrete set of values, such as the edibility or the color values. The values are finite in number and have no relationship to each other.
CONTINUOUS Contains a numerical value that has quantitative meaning such as age or weight. In addition, the values have measurable differences between them that have value so that one age is mea surably higher or lower than another age.
DISCRETIZED([value])

A set of values, usually numbers , that are converted over to a smaller number of discrete values because the algorithm is unable to use them as inputs or prediction fields. To be able to analyze them, the values are converted to discrete values for the sake of training the model. 37

There are four possible parameters that can be provided to this function to direct this process:

  • ‚  AUTOMATIC ‚   The default method for the algorithm is chosen .
  • ‚  EQUAL _AREAS ‚   Tries to create bins of data that contain equal numbers of values in each. This works well when there are many different types of values with relatively even counts in each. In a situation where there are small numbers of different types of values and high numbers favored for any one type, the results may be questionable. This is designed for data that fits in a normal distribution curve.
  • ‚  THRESHOLDS ‚   Looks for data that has high levels of distinctive values as compared to other values in the group . This shows up as inflection points on a distribution curve. This generally is used for continuous data with irregular high and low values. 38
  • ‚  CLUSTERS ‚   Applies an algorithm very similar to the clustering algorithm used in building a data-mining model. Because of the complexity of the algorithm, the processing takes longer, but the results tend to be better since the algorithm can adapt to any distribution curve.
ORDERED Contains a set of values that have a hierarchical value relation ship to each other. Examples include tax brackets or earning levels. There is a rank associated to their value but unlike age, for example, there is no relative values between those levels.
SEQUENCE_TIME 39 The column contains time- related values that have a sequential relationship to each other, such as dates and years .
CYCLICAL These are discrete, but ordered, values that have to repeat themselves in a cyclical fashion, such as the months of the year or the days of the week. Functionally, these values act as both discrete and ordered values at the same time.
PROBABILITY This is the probability of this value occurring in a node. It's expressed as a decimal between 0 and 1. 40
VARIANCE The statistical variance of the associated value in this column.
STDEV The standard deviation of the associated value in this column.
SUPPORT The number of occurrences of this value in the node.
PROBABILITY_VARIANCE The statistical variance of the probability of the associated value in this column. This is not yet supported by Analysis Services.
PROBABILITY_STDEV 42 The standard deviation of the probability of the associated value in this column. This is not yet supported by Analysis Services.
ORDER Used to order the columns in this table. This is the order in which this row is placed.

Distribution Types

When we work with continuous data, such as used automobile prices, we see three types of statistical distribution, normal,log_normal, and uniform. For example, if you were to sample the prices of the cars in a large used car lot, assuming the cars are of basically the same type and mileage, you would expect to find a few very cheaply priced cars because they have unusually high mileage or a few dents, and you would also expect to find a few very expensive cars because they were only driven by the proverbial "little old lady who only drove the car on Sundays to church . " Generally, the bulk of the cars in the lot will fall within a similar price range, give or take a few hundred dollars. Graphing the prices of these cars shows a normal distribution curve that should look something like the one shown here. 43


Normal distribution curve.

If this were an atypical car lot with average priced cars and a few cheap ones, you could draw a log_normal curve like the one shown here.


Log_Normal distribution curve.

If by chance all the cars were priced exactly the same, the curve would be uniform like this one.


44
Uniform distribution curve.

Provider-specific flags may also be defined. Those listed here provide additional information or hints that Analysis Services uses to build the model.

  • ‚  MODELING_FLAG ‚   A comma-delimited list of flags. The defined flags are listed and defined in Table 10-5.
  • Table 10-5. MODELING_FLAG Values
46
MODEL_EXISTENCE_ONLY Denotes that a column is important not because of the value that it contains, but because a value exists at all. It can be more important to know that a mush- room has a value for"Spore Color", rather than the value itself, because the fact that the mushroom has spores is an important predictor of edibility. 45
NOT_NULL This column must have a value. The absence of a value is considered an anomaly.
IGNORE_NULL If a column contains a null value, it won't provide any statistical information about the model.
NULL_INFORMATIVE Null values are counted, but considered missing. The fact that the value is null can have high significance. Consider data-mining ballots, missing values are important in that they signify a refusal to vote or even a miscount.

Provider-specific flags may also be defined.

  • ‚  IS_RELATED_TO_KEY ‚   VARIANT_TRUE if this column is related to the key. If the key is a single column, the RELATED_ATTRIBUTE field optionally may contain its column name.
  • ‚  RELATED_ATTRIBUTE ‚   The name of the target column that the current column either relates to or is a special property of. 47
  • ‚  IS_INPUT ‚   VARIANT_TRUE if this is an input column.
  • ‚  IS_PREDICTABLE ‚   VARIANT_TRUE if the column is predictable.
  • ‚  CONTAINING_COLUMN ‚   Name of the TABLE column containing this column. NULL if any table does not contain the column.
  • ‚  PREDICTION_SCALAR_FUNCTIONS 48 ‚   A comma-delimited list of scalar functions that may be performed on the column. The functions will be discussed in ‚  Chapter 12 ‚  , where we'll look at how to make predictions with code.
  • ‚  PREDICTION_TABLE_FUNCTIONS ‚   A comma-delimited list of functions that may be applied to the column, returning a table. The list has the following format:
  •  <function name>(<column1> [, <column2>], ...) 

    The format allows the client to determine which columns will be present in the table returned by any given function. 49
  • ‚  IS_POPULATED ‚   VARIANT_TRUE if the column has learned a set of possible values and VARIANT_FALSE if the column is not populated.
  • ‚  PREDICTION_SCORE ‚   The score of the model on the predicting column. Score is used to measure the accuracy of a model as it relates to the value to be predicted . The higher the number, the more accurate this prediction will be. This function forms part of the OLEDB specification for data mining, but is not yet implemented in Analysis Services.
  •  . . . Const DMSCHEMA_MINING_COLUMNS = _         "{3add8a78-d8b9-11d2-8d2a-00e029154fde}" Const adSchemaProviderSpecific = -1 Set rs = cn.OpenSchema(adSchemaProviderSpecific,_         array(empty,empty,"mushroom analysis rdbms","edibility"),_         DMSCHEMA_MINING_COLUMNS) ... Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999.  All rights reserved. 1: MODEL_CATALOG - Mushrooms 2: MODEL_SCHEMA -  3: MODEL_NAME - mushroom analysis rdbms 4: COLUMN_NAME - Edibility 5: COLUMN_GUID -  6: COLUMN_PROPID -  7: ORDINAL_POSITION - 23 8: COLUMN_HAS_DEFAULT - False 9: COLUMN_DEFAULT -  10: COLUMN_FLAGS - 0 11: IS_NULLABLE - True 12: DATA_TYPE - 130 13: TYPE_GUID -  14: CHARACTER_MAXIMUM_LENGTH - 0 15: CHARACTER_OCTET_LENGTH - 0 16: NUMERIC_PRECISION -  17: NUMERIC_SCALE -  18: DATETIME_PRECISION -  19: CHARACTER_SET_CATALOG -  20: CHARACTER_SET_SCHEMA -  21: CHARACTER_SET_NAME -  22: COLLATION_CATALOG -  23: COLLATION_SCHEMA -  24: COLLATION_NAME -  25: DOMAIN_CATALOG -  26: DOMAIN_SCHEMA -  27: DOMAIN_NAME -  28: DESCRIPTION -  29: DISTRIBUTION_FLAG - NONE 30: CONTENT_TYPE - DISCRETE 31: MODELING_FLAG -  32: IS_RELATED_TO_KEY - False 33: RELATED_ATTRIBUTE -  34: IS_INPUT - True 35: IS_PREDICTABLE - True 36: CONTAINING_COLUMN -  37: PREDICTION_SCALAR_FUNCTIONS -Predict,PredictAdjustedProbability,  PredictProbability,PredictScore,PredictStddev,PredictStdev,  PredictSupport,PredictVariance 38: PREDICTION_TABLE_FUNCTIONS - PredictHistogram 39: IS_POPULATED - True 40: PREDICTION_SCORE - 0 

MINING_MODEL_CONTENT Schema Rowset

This schema rowset allows browsing of the content of a data-mining model. The user can employ special tree-operation restrictions to navigate the contents in a manner that facilitates the creation of a decision tree graph or a cluster graph. 50

The results are ordered by the following fields by default:

  • MODEL_CATALOG
  • MODEL_SCHEMA
  • MODEL_NAME
  • ATTRIBUTE_NAME

The following are the available restriction columns:

  • MODEL_CATALOG 51
  • MODEL_SCHEMA
  • MODEL_NAME
  • ATTRIBUTE_NAME
  • NODE_NAME
  • NODE_UNIQUE_NAME
  • NODE_TYPE
  • NODE_GUID 52
  • NODE_CAPTION

There is an additional restriction called the tree operation that isn't on any particular column of the MINING_MODEL_CONTENT rowset; rather, it specifies a tree operator. The idea is that the application specified a NODE_UNIQUE_NAME restriction and the tree operator (ANCESTORS, CHILDREN, SIBLINGS, PARENT, DESCENDANTS, SELF) to obtain the desired set of members . The SELF operator includes the row for the node itself in the list of returned rows. The following constants are defined in Table 10-6.

Table 10-6. NODE_UNIQUE_NAME Restrictions
DMTREEOP_ANCESTORS 0x00000020
DMTREEOP_CHILDREN 0x00000001 53
DMTREEOP_SIBLINGS 0x00000002
DMTREEOP_PARENT 0x00000004
DMTREEOP_SELF 0x00000008
DMTREEOP_DESCENDANTS 54 0x00000010

The columns are as follows:

  • ‚  MODEL_CATALOG ‚   This is the name of the database that the data-mining model belongs to. In this case it happens to be "Mushrooms".
  • ‚  MODEL_SCHEMA ‚   It's supposed to contain the unqualified name of the schema, but since it's currently not supported, it always contains a NULL value.
  • ‚  MODEL_NAME 55 ‚   The name of the data-mining model. In this case it's "mushroom analysis rdbms".
  • ‚  ATTRIBUTE_NAME ‚   For every predictable field, a separate model is created. Every created model is contained in the same table but is differentiated by this field. The model we created in ‚  Chapter 5 ‚  used the edibility field to determine whether a mushroom is poisonous, and we also used the habitat field to determine where the mushroom would most likely be found given its characteristics. Both of these fields generate a separate tree model. So in this case we'll find either Edibility or Habitat as the value for this field. That value will tell us what model uses this node.
  • ‚  NODE_NAME ‚   This contains a numerical string that serves as a unique name for the node. It has no real descriptive use in an application. In a future release, this name may contain a unique alphanumeric word. 56
  • ‚  NODE_UNIQUE_NAME ‚   The exact same value as NODE_NAME.
  • ‚  NODE_TYPE ‚   This denotes the type of node. There can be six possible values for this node represented by an integer from 1 to 6:

    1. DM_NODE_TYPE_MODEL
    2. DM_NODE_TYPE_TREE
    3. DM_NODE_TYPE_INTERIOR 57
    4. DM_NODE_TYPE_DISTRIBUTION
    5. DM_NODE_TYPE_CLUSTER
    6. DM_NODE_TYPE_UNKNOWN
  • ‚  DM_NODE_TYPE_MODEL ‚   A model node is the topmost node in any data-mining model, regardless of the actual structure of the model. All models start with a model node. It contains no data or distribution histograms. You never even see this node in the Analysis Manager because it's actually the node before the ALL node.
  • ‚  DM_NODE_TYPE_TREE ‚   For all tree-based models, this node serves as the root node of the tree. A data-mining model may have many trees that make up the whole; but for each tree, there is only one tree node that all other nodes are related to. A decision tree-based model always has one model node and at least one tree node. This is the ALL node. 58
  • ‚  DM_NODE_TYPE_INTERIOR ‚   An interior node represents a generic interior node of a model. For example, in a decision tree, this node usually represents a split in the tree.
  • ‚  DM_NODE_TYPE_DISTRIBUTION ‚   A distribution node is guaranteed to have a valid link to a nested distribution table. A distribution node describes the distribution of values for one or more attributes according to the data represented by this node. A good example of a distribution node is the leaf node of a decision tree.
  • ‚  DM_NODE_TYPE_CLUSTER ‚   A cluster node stores the attributes and data for the abstraction of a specific cluster. In other words, it stores the set of distributions that constitute a cluster of cases for the data-mining model. A clustering-based model always has one model node and at least one cluster node.
  • ‚  DM_NODE_TYPE_UNKNOWN 59 ‚   The unknown node type is used when a node does not fit any of the other node types provided and the algorithm cannot resolve the node type.
  • ‚  NODE_GUID ‚   It contains a NULL value and is currently not supported.
  • ‚  NODE_CAPTION ‚   A label or a caption associated with the node. This property is used primarily for display purposes. If a caption does not exist, the contents of the NODE_NAME column are returned.
  • ‚  CHILDREN_CARDINALITY ‚   Gives an estimate of how many immediate children emanate from a particular node. Leaf nodes have a 0 value. 60
  • ‚  PARENT_UNIQUE_NAME ‚   The unique name of the node?sparent. NULL is returned for any nodes at the root level.
  • ‚  NODE_DESCRIPTION ‚   A user-friendly description of the node.
  • ‚  NODE_RULE ‚   An XML description of the rule that is embedded in the node.
  • ‚  MARGINAL_RULE 61 ‚   An XML description of the rule that is moving to the node from the parent node.
  • ‚  NODE_PROBABILITY ‚   The probability associated with this node.
  • ‚  MARGINAL_PROBABILITY ‚   The probability of reaching the node from the parent node.
  • ‚  NODE_DISTRIBUTION ‚   A nested table that contains the probability histogram for each of the attributes that make up the node. 62
  • ‚  NODE_SUPPORT ‚   Total number of cases that make up the node.
  • ‚  MSOLAP_MODEL_COLUMN ‚   The name of the column from the model definition that this node pertains to.
  • ‚  MSOLAP_NODE_SCORE ‚   The score that was computed for this node.
  • ‚  MSOLAP_NODE_SHORT_CAPTION 63 ‚   A short caption for the node that can be used for display purposes to improve readability.

MINING_SERVICES Schema Rowset

This rowset exposes the data-mining algorithms available from the provider. It can be used to determine the prediction capabilities, complexity, and similar information about the algorithm. Third-party providers need to supply the values of their algorithms in this table.

The columns are sorted by SERVICE_NAME.

Column restrictions available are

  • SERVICE_NAME
  • SERVICE_TYPE_ID 64

The column descriptions are as follows:

  • ‚  SERVICE_NAME ‚   The name of the algorithm. Provider-specific. This will be used as the service identifier in the language. (It is not localizable.)
  • ‚  SERVICE_TYPE_ID ‚   A bitmask that describes mining service types. The following list includes known popular mining service values:
    • DM_SERVICETYPE_CLASSIFICATION (0x0000001)
    • DM_SERVICETYPE_CLUSTERING (0x0000002) 65
    • DM_SERVICETYPE_ASSOCIATION (0x0000004)
    • DM_SERVICETYPE_DENSITY_ESTIMATE (0x0000008)
    • DM_SERVICETYPE_SEQUENCE (0x0000010)
  • ‚  SERVICE_DISPLAY_NAME ‚   The localizable display name of the algorithm. Provider-specific.
  • ‚  SERVICE_GUID ‚   GUID for the algorithm. NULL if no GUID. 66
  • ‚  DESCRIPTION ‚   Description of the algorithm.
  • ‚  PREDICTION_LIMIT ‚   The maximum number of predictions the model and algorithm can provide; 0 means no limit.
  • ‚  SUPPORTED_DISTRIBUTION_FLAGS ‚   A comma-delimited list of one or more of the following:
    • "NORMAL" 67
    • "LOG_NORMAL"
    • "UNIFORM"

Provider-specific flags may also be defined.

  • ‚  SUPPORTED_INPUT_CONTENT_TYPES ‚   A comma-delimited list of one or more of the following:
    • KEY
    • DISCRETE 68
    • CONTINUOUS
    • DISCRETIZED
    • ORDERED
    • SEQUENCE_TIME
    • CYCLICAL
    • PROBABILITY
    • VARIANCE 69
    • STDEV
    • SUPPORT
    • PROBABILITY_VARIANCE
    • PROBABILITY_STDEV
    • ORDER
    • SEQUENCE

Provider-specific flags may also be defined. 70

  • ‚  SUPPORTED_PREDICTION_CONTENT_TYPES ‚   A comma-delimited list of one or more of the following:
    • DISCRETE
    • CONTINUOUS
    • DISCRETIZED
    • ORDERED
    • SEQUENCE_TIME 71
    • CYCLICAL
    • PROBABILITY
    • VARIANCE
    • STDEV
    • SUPPORT
    • PROBABILITY_VARIANCE
    • PROBABILITY_STDEV 72

Provider-specific flags may also be defined.

  • ‚  SUPPORTED_MODELING_FLAGS ‚   A comma-delimited list of one or more of the following:
    • MODEL_EXISTENCE_ONLY
    • NOT NULL

Provider-specific flags may also be defined.

  • ‚  SUPPORTED_SOURCE_QUERY 73 ‚   The <source_data_query> types that the provider supports. This is a comma-delimited list of one or more of the following syntax descriptions that can be used as the source of data for INSERT INTO or that can be PREDICTION JOINED to a DMM for SELECT:
    • SINGLETON_CONSTANT
    • SINGLETON_SELECT
    • OPENROWSET
    • SELECT
    • SHAPE
  • ‚  TRAINING_COMPLEXITY 74 ‚   Indication of expected time for training. Table 10-7 provides the estimated run time.
  • Table 10-7. TRAINING_COMPLEXITY Values
DM_TRAINING_COMPLEXITY_LOW Running time is proportional to input and is relatively short.
DM_TRAINING_COMPLEXITY_MEDIUM Running time may be long but is generally proportional to input. 75
DM_TRAINING_COMPLEXITY_HIGH Running time is long and may grow exponentially in relationship to input.
  • ‚  PREDICTION_COMPLEXITY ‚   Indication of expected time for prediction (same as for TRAINING_COMPLEXITY).
  • ‚  EXPECTED_QUALITY ‚   Indication of expected quality of model produced with this algorithm: 76
  • DM_EXPECTED_QUALITY_LOW
  • DM_EXPECTED_QUALITY_MEDIUM
  • DM_EXPECTED_QUALITY_HIGH
  • ‚  SCALING ‚   Indication of the scalability of the algorithm:
  • DM_SCALING_LOW
  • DM_SCALING_MEDIUM 77
  • DM_SCALING_HIGH
  • ‚  ALLOW_INCREMENTAL_INSERT ‚   VARIANT_TRUE if additional INSERT INTO statements are allowed after the initial training.
  • ‚  ALLOW_PMML_INITIALIZATION ‚   VARIANT_TRUE if the creation of a DMM (including both structure and content) based on an XML string is allowed.
  • ‚  CONTROL ‚   One of the following: 78
  • DM_CONTROL_NONE
  • DM_CONTROL_CANCEL
  • DM_CONTROL_SUSPENDRESUME
  • DM_CONTROL_SUSPENDWITHRESULT
  • ‚  ALLOW_DUPLICATE_KEY ‚   TRUE if cases may have duplicate key.

SERVICE_PARAMETERS Schema Rowset 79

This schema rowset provides a list of parameters that can be supplied when generating a mining model with the CREATE MINING MODEL statement. The client will generally restrict by SERVICE_NAME to obtain the parameters supported by the provider and applicable to the type of mining model being generated.

The default sort order is

  • SERVICE_NAME
  • PARAMETER_NAME

The available restriction columns are

  • SERVICE_NAME
  • PARAMETER_NAME 80

Column descriptions are as follows:

  • ‚  SERVICE_NAME ‚   The name of the algorithm. Provider-specific.
  • ‚  PARAMETER_NAME ‚   The name of the parameter.
  • ‚  PARAMETER_TYPE ‚   Data type of parameter (DBTYPE). 81
  • ‚  IS_REQUIRED ‚   If true, the parameter is required.
  • ‚  PARAMETER_FLAGS ‚   A bitmask that describes parameter characteristics.The following values (or a combination thereof) may be used:
    • DM_PARAMETER_TRAINING (0x0000001) , for training
    • DM_PARAMETER_PREDICTION (0x00000002) , for prediction
  • ‚  DESCRIPTION 82 ‚   Text describing the purpose and format of the parameter.

Example Listing of Algorithms Available Using Code

 ... Const DMSCHEMA_MINING_SERVICES = _     "{3add8a95-d8b9-11d2-8d2a-00e029154fde}" Const adSchemaProviderSpecific = -1 Set rs = cn.OpenSchema(adSchemaProviderSpecific,, _     DMSCHEMA_MINING_SERVICES) ... 

The following results are output from the code above:

 Microsoft (R) Windows Script Host Version 5.1 for Windows Copyright (C) Microsoft Corporation 1996-1999.  All rights reserved. 1: SERVICE_NAME - Microsoft_Clustering 2: SERVICE_TYPE_ID - 2 3: SERVICE_DISPLAY_NAME - Microsoft Clustering 4: SERVICE_GUID - {181CB1C9-C7F5-11D3-8BF9-00C04F68DDC2} 5: DESCRIPTION - Clustering finds natural groupings of data in a  multidimensional space. Clustering is useful when you want to  see general groupings in your data. 6: PREDICTION_LIMIT - 0 7: SUPPORTED_DISTRIBUTION_FLAGS - NORMAL,UNIFORM 8: SUPPORTED_INPUT_CONTENT_TYPES - CONTINUOUS,CYCLICAL,  DISCRETE,DISCRETIZED,ORDERED,SEQUENCE_TIME,TABLE 9: SUPPORTED_PREDICTION_CONTENT_TYPES - CONTINUOUS,CYCLICAL,  DISCRETE,DISCRETIZED,SEQUENCE_TIME,TABLE 10: SUPPORTED_MODELING_FLAGS - MODEL_EXISTENCE_ONLY, NOT NULL 11: SUPPORTED_SOURCE_QUERY - OPENROWSET,SHAPE,SINGLETON SELECT 12: TRAINING_COMPLEXITY - 1 13: PREDICTION_COMPLEXITY - 0 14: EXPECTED_QUALITY - 2 15: SCALING - 1 16: ALLOW_INCREMENTAL_INSERT - False 17: ALLOW_PMML_INITIALIZATION - True 18: CONTROL - 1 19: ALLOW_DUPLICATE_KEY - False ******************** 1: SERVICE_NAME - Microsoft_Decision_Trees 2: SERVICE_TYPE_ID - 1 3: SERVICE_DISPLAY_NAME - Microsoft Decision Trees 4: SERVICE_GUID - {181CB1C8-C7F5-11D3-8BF9-00C04F68DDC2} 5: DESCRIPTION - The Microsoft Decision Trees algorithm chooses  significant characteristics in the data and narrows sets of  data based on those characteristics until clear correlations  are established. Decision trees are useful when you want to  make specific predictions based on information in the source  data. 6: PREDICTION_LIMIT - 0 7: SUPPORTED_DISTRIBUTION_FLAGS - NORMAL,UNIFORM 8: SUPPORTED_INPUT_CONTENT_TYPES - CONTINUOUS,CYCLICAL,  DISCRETE,DISCRETIZED,KEY,ORDERED,TABLE 9: SUPPORTED_PREDICTION_CONTENT_TYPES - DISCRETE,DISCRETIZED,  TABLE 10: SUPPORTED_MODELING_FLAGS - MODEL_EXISTENCE_ONLY, NOT NULL 11: SUPPORTED_SOURCE_QUERY - OPENROWSET,SHAPE,SINGLETON SELECT 12: TRAINING_COMPLEXITY - 1 13: PREDICTION_COMPLEXITY - 0 14: EXPECTED_QUALITY - 1 15: SCALING - 1 16: ALLOW_INCREMENTAL_INSERT - False 17: ALLOW_PMML_INITIALIZATION - True 18: CONTROL - 1 19: ALLOW_DUPLICATE_KEY - False ****************** 

MODEL_CONTENT_PMML Schema Rowset

This rowset stores the XML representation of the content of each model. The format of the XML string follows the PMML standard. 83

The default sort order of these results is

  • MODEL_NAME
  • MODEL_SCHEMA
  • MODEL_NAME

The available restriction columns are

  • MODEL_CATALOG
  • MODEL_SCHEMA 84
  • MODEL_NAME
  • MODEL_TYPE

The columns descriptions are as follows:

  • ‚  MODEL_CATALOG ‚   Catalog name. NULL if the provider does not support catalogs.
  • ‚  MODEL_SCHEMA ‚   Unqualified schema name. NULL if the provider does not support schemas. 85
  • ‚  MODEL_NAME ‚   Model name. This column cannot contain NULL.
  • ‚  MODEL_TYPE ‚   Model type, a provider-specific string, can be NULL.
  • ‚  MODEL_GUID ‚   GUID that uniquely identifies the model. Providers that do not use GUIDs to identify tables should return NULL in this column.
  • ‚  MODEL_PMML 86 ‚   An XML representation of the model's content with PMML format.
  • ‚  SIZE ‚   Number of bytes of the XML string size.
  • ‚  LOCATION ‚   The location of the XML file. NULL if the file is stored in the default directory.

Summary

Almost all structural aspects of the data-mining models are exposed through the OLEDB for data-mining provider, which stored all the data about them in table-like structures. This information is complete; thus it's possible not only to understand the internal parameters at work that derived the model, but also to browse the content of these models in sufficient detail to create your own version of the Analysis Services browser if you choose to do so. 87

For the most part, this interface is more about reporting on the data in the model and understanding the models themselves than it is about manipulating them. In the next chapter, we'll be looking more at the role of PivotTable and how this service helps read the contents of a model and even make local mining models that can be stored on your computer and used offline.



Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference
Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference
ISBN: B007EMTPI0
EAN: N/A
Year: 2001
Pages: 16

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net