PART III CREATING DATAMINING APPLICATIONS WITH CODE | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

PivotTable Service is an OLE DB provider that serves as a bridge between the server side of Analysis Services and the client. Whenever a data-mining query is issued, regardless of whether the data-mining model resides on the server or, as we'll see in this chapter, on the local disk, it goes through PivotTable Service. As a desktop version of Analysis Services, PivotTable Service also performs other functions that allow you to create local data-mining models, which reside locally in your memory space or in a file on your local disk. These local data-mining models contain all the predictive and analytical information that a server-side model would contain. With the exception of those management applications that use DSO, most client applications that will interact with the data from Analysis Services will use PivotTable Service. The PivotTable Service architecture is shown in Figure 11-1.

Figure 11-1. PivotTable Service architecture.

In many cases, the processing power of a local machine is more than enough to process data-mining models that can then be used locally. The ability to perform data-mining tasks while disconnected from the network is one of the main benefits of this service. This makes possible a scenario in which an office creates a data-mining model using their local data and then e- mails the whole model to a distant office that might be interested in testing their cases against this model to make predictions . 1

Since PivotTable Service has a Data Definition Language (DDL), it's possible to create client applications that create the data-mining models locally and then query them without having to centralize the data in a remote server somewhere.

As we'll see in this chapter, PivotTable Service is accessible to any language that is able to instantiate COM objects, such as Visual Basic, Visual FoxPro, VBScript, and Perl. The main focus of this chapter is to show how PivotTable Service is used as a local data-mining server. We'll use it to create local data-mining models complete with cases and model training. As an added bonus, we'll use the local data-mining models to see how XML is applied to data mining, another subject we'll cover shortly.

Redistributing Components

Because PivotTable Service is an important component of a client-based application that needs to use Analysis Services functions, it's important to know what files need to be provided as part of your distribution. For that reason, I'm going to spend a little time familiarizing you with the installation and redistribution process. PivotTable Service includes a number of dynamic-link libraries (DLLs) that you may need to ship with a client application. Any individual client application may need a combination of these components, depending on which PivotTable Service features it uses. The services and files needed per service use are listed here:

To communicate with the Analysis Services server using TCP/IP or HTTP and to read local cube files, add the following files.
- Msolap80.dll
- Msolui80.dll 2
- Msolap80.rll
- Olapuir.rll
- Microsoft Data Access Components (MDAC)
To create and refresh local cubes, add the following files.
- Msmdcb80.dll
- Msmdgd80.dll
- OLE DB tabular data provider 3
To read OLAP and relational data-mining models, add the following files.
- Msdmine.dll
- Msmdun80.dll
- Msdmine.rll
- Msdmeng.dll

Note

If you are installing PivotTable Service, you must first install Microsoft Data Access Components (MDAC). If you're performing the installation on Windows 95, Windows 98, or Windows Me, you must first install the DCOM.

Installing and Registering Components 4

By using InstallShield or another automated installation package builder, you can provide your own setup program. This is a likely scenario if you are also distributing custom or shrink-wrapped software that performs Analysis Services client functions as part of your application. You can also use the installation options present in the SQL Server 2000 installation CD to install the proper files. Later in this section, we'll discuss how to do both.

After the required components have been installed, the following components must also be registered using Regsvr32.exe or their own DLLSelfRegister functions. This is because they function as ActiveX server components and must be registered before they can be seen by other applications.

Msolap80.dll
Msolui80.dll
Msmdgd80.dll
Msmdcb80.dll
Msmdun80.dll 5
Msdmine.dll
Msdmeng.dll

File Locations

The files must be placed in the following locations, with some variances depending on the language version of the product being used:

‚ C:\Program Files\Common Files\System\OLE DB ‚ All the dynamic-link library files (.dll) go in this directory.
‚ C:\Program Files\Common Files\System\OLE DB\Resources\1033 6 ‚ All the resource files (.rll) go in here. This assumes that you're distributing the English language version of the product. If you are installing another language version of the product, use the appropriate subdirectory under the Resources folder with the appropriate language code. The codes are provided in Table 11.1.

Table 11-1. Language Codes

7 10 13 16 19

Language	Code
Arabic	1025
British English	2057
Czech	1029
Danish 8	1030
German	1031
Estonian	1061 9
Spanish	3082
French	1036
Croatian	1050
Italian	1040
Latvian 11	1062
Lithuanian	1063
Hungarian	1038 12
Dutch	1043
Norwegian	2068
Polish	1045
Portuguese	2070
Brazilian 14	1046
Romanian	1048
Slovak	1051 15
Slovenian	1060
Finnish	1035
Swedish	1053
Turkish	1055
English 17	1033
Greek	1032
Bulgarian	1026 18
Russian	1049
Thai	1054
Korean	1042
Japanese	1041
Simplified Chinese 20	2052
Traditional Chinese	1028

Installation Registry Settings

For the purposes of installing and uninstalling, all PivotTable Service files should be considered shared files. Create a registry value for each PivotTable Service file under the following registry key:

 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion- \SharedDLLs   21

If by chance the registry key already exists for the given shared DLL, increment the number by 1. This ensures that Windows knows that the same file is being used by more than one application. This way, if someone decides to uninstall one of the programs, it will simply decrement the number by 1 and not remove the file altogether.

Redistribution Setup Programs

If these instructions sound a bit tedious , two utilities provided in the SQL Server 2000 CD in the \Msolap\Install\Pts folder offer some relief:

‚ PTSLITE.EXE ‚ Installs the needed service files to run PivotTable Service and assumes that you already have the necessary MDAC files installed.
‚ PTSFULL.EXE ‚ Installs both PivotTable Service files and the MDAC components needed for connectivity. 22

Ptslite.exe installs the following PivotTable Service files:

atl.dll
msdmeng.dll
msdmine.dll
msmdcb80.dll
mdmdgd80.dll
msolap80.dll 23
msolui80.dll
msmdcube.dll
msmdgdrv.dll
msolap.dll
msolapui.dll
msdmine.rll
msolap80.rll 24
olapuir.rll
msvbvm60.dll
msmdun80.dll
msolapr.dll

Ptsfull.exe installs the same files as Ptslite.exe, and MDAC.

Connecting to the PivotTable Service

PivotTable Service, included as part of Analysis Services, provides methods for data-mining analysis of multidimensional data and relational data. It is the primary interface for your applications to interact with Analysis Services. 25

You can connect to Analysis Services using PivotTable Service either locally through the OLE DB provider by using ADO, for example, or you can establish a connection over the Web using HTTP.

Connect to Analysis Services Using PivotTable Service

When you use the PivotTable Service to connect to a data-mining model in Analysis Services on the server, the interaction is greatly simplified because the entire communication protocol sequence and cache management is hidden from the client implementation. This makes it very easy to query the models and to create new local data-mining models that are based on the models stored on the server without being concerned about network connectivity issues.

When the mining model is local, PivotTable Service takes on all the functionality of the server, allowing a user to query, create, modify, and delete local models without having to interact with Analysis Services features on the server side.

I know I've mentioned this before, but I'll repeat it again because it's very important: If you want to create a data-mining model on the server, managed by Analysis Services, you can only do so by using DSO as detailed in ‚ Chapter 10 ‚ . If you want to create a local mining model that will reside on your disk, you must use the DDL available through PivotTable Service. 26

Connect to a mining model using VBScript with Windows Scripting Host (WSH) in the following manner:

 dim rs dim cn dim tt set cn = wscript.createobject("adodb.connection") set rs = wscript.createobject("adodb.recordset") cn.ConnectionString = _     "provider=MSOLAP.2;" &_     "Data Source=dataserver;" &_     "Initial Catalog=mushrooms" cn.properties.item("Mining Location") = ":\DMM\LOCALMODELS" cn.open

Notice the inline reference to cn.properties.item(<property name>). You may use either the name enclosed in quotes or the numerical value it represents. The possible connection properties are listed in Table 11-2.

Table 11-2. Connection Properties

31 35 39 43

Property	Value	Description 27
Integrated Security		Determines whether Integrated Security is being used with Windows NT or 2000.
Password	1	Specifies the password to use when connecting using HTTP. 28
Persist Security Info	2	Determines whether logins supplied to the server are cached locally to prevent the user from being prompted for a username and password for each new connection.
User ID	3 29	Specifies a valid username, such as a valid domain logon or local logon.
Data Source	4	The name of the server computer or local data-mining file.
Window Handle 30	5	Not used.
Location	6	Not used.
Mode	7	Not used.
Prompt	8	Not used. 32
Connect Timeout	9	Determines the maximum amount of time the client application will take attempting to connect to the server before timing out.
Extended Properties	10 33	Not used.
Locale Identifier	11	The locale ID of preference for the client application.
Asynchronous Initialization 34	12	Not used.
Initial Catalog	13	The name of the initial database (catalog).
CREATECUBE	14	The CREATE CUBE statement to create a local cube file.
INSERTINTO	15	The INSERT INTO statement used to populate a local cube file created with the CREATE CUBE statement. 36
SOURCE DSN	16	The OLE DB connection string, ODBC connection string, or ODBC data source name (DSN) for the source relational database; used only when creating a local cube file.
USEEXISTINGFILE	17 37	Determines whether a local cube file is overwritten if the connection string contains CREATE CUBE and INSERT INTO statements.
ARTIFICIALDATA	18	Not used.
Client Cache Size 38	19	Controls the amount of memory used by the client cache.
SOURCE DSN SUFFIX	20	Used to specify DSN properties for creating local cubes that should not be stored as part of the local cubes structure, such as the user ID and password for the local cube's data source.
Auto Synch Period	21	Controls the frequency (in milliseconds ) of client/server synchronization.
Cache Policy	22	Not used. 40
Execution Location	23	Determines the location of query resolution:the client application, server, or a combination of both.
Writeback Timeout	24 41	Determines the maximum amount of time the client application will attempt to communicate updates to a writeback table on the server before timing out.
Default Isolation Mode	25	Controls whether the isolation level is isolated or determined by the cursor type requested by the rowset properties.
CompareCaseSensitiveStringFlags 42	26	Adjusts case-sensitive string comparisons for a specified locale.
CompareCaseNotSensitiveStringFlags	27	Adjusts case-sensitive string comparisons for a specified locale.
Large Level Threshold	28	Determines the definition of large level for client/server handling of level members .
Read Only Session	29	Not used. 44
Secured Cell Value	30	Determines the type of return value that results from a reference to a secured cell.
Roles	31 45	Specifies a comma-delimited string of the role names by which a client application connects to the server.
MDX Compatibility	32	Determines how empty members are treated for ragged and unbalanced hierarchies.
SQL Compatibility 46	33	Not used.
SSPI	34	Determines the security package to use during the session, such as: Negotiate 47 Kerberos NTLM Anonymous user
MDX Unique Name Style	35	Determines the technique for generating unique names. 48
Distinct Measures By Key	36	Not used.
Do Not Apply Commands	37 49	Not used.
Default MDX Visual Mode	38	Determines the default behavior of the "visual totals" MDX clause.
MDX Calculated Members Mode 50	39	Not used.
Mining Location	40	Path where a local mining model will be stored. If left blank, the model is kept in memory and is lost once the session is over.

Connect to Analysis Services Using HTTP 51

Using a URL as the data source property of the connection, you can connect directly to the Analysis server through a firewall using the proper port over the Internet as shown in Figure 11-2. This connection is accomplished with a special Active Server Pages (ASP) page, Msolap.asp, which is found in C:\Program Files\Microsoft Analysis Services\Bin.

Figure 11-2. Establishing an HTTP connection.

Building a Local Data-Mining Model

Unlike the case with Analysis Services on the server side, local mining models are not created and trained using DSO, instead a DDL is used. As you can see from the following example, this DDL is somewhat similar to the one used when creating SQL Server tables, aside from some data-mining specific clauses:

 CREATE MINING MODEL <new model name> (     <comma separated list of column definitions> ) USING <algorithm name> [(<parameter list>)]

or 52

 CREATE MINING MODEL <new model name> FROM PMML <xml definition>

 CREATE OLAP MINING MODEL <new model name>      FROM <cube name> <olap definition>      USING <algorithm name> [(<parameter list>)]

The CREATE MINING MODEL statement creates a new mining model based on the column definition list or on the structure of a given cube. A column definition is one of the following forms:

 <column name> <type> [<content flags>] [<column relation>]     [<prediction flag>]

 <column name> TABLE [<prediction flag>] (      <non-table column definition list>  )   53

The <column name> can be any valid name that begins with an alphanumeric character and does not contain any invalid characters such as [ ]\-*.&%$@. The <type> can be any valid SQL type, such as LONG, DOUBLE, DATE, TEXT, and TABLE (for nested tables, which will be discussed later in this chapter).

The <content flags>, shown in Table 11-3, are additional, optional parameters that provide further instructions to PivotTable Service to tell it how to treat that column. For the descriptions of these flags, please refer to ‚ Chapter 10 ‚ , where these are discussed in detail.

Table 11-3. Content Flags for Column Definitions

‚ Distribution flags ‚	NORMAL 54 LOG_NORMAL UNIFORM
‚ Type flags ‚	KEY CONTINUOUS DISCRETE DISCRETIZED 55 ORDERED CYCLICAL SEQUENCE TIME SEQUENCE
‚ Modeling flags ‚	MODEL_EXISTENCE_ONLY NOT NULL 56
‚ Special Property Flags ‚	PROBABILITY VARIANCE STDEV PROBABILITY_VARIANCE PROBABILITY_STDEV SUPPORT 57
‚ Prediction Flags ‚	PREDICT PREDICT_ONLY
‚ Relation ‚	OF RELATED TO

Because the Relation flag in Table 11-3 is the first time we've come across the notion of related tables when creating a mining model, I need to explain what a relation flag is. In the first case, a column may simply be the result of the probability, a variance, or any such other special property flag that depends on another column. In that case, we use the OF clause to specify the dependent field. For example, a possible column could be SupportHabitat Double SUPPORT of Habitat. This column will contain the support number for any given value in the Habitat column. 58

A data-mining model is not designed to work within the structure of a relational database engine; however, it's very possible for a column to contain a link to a list of other values. Case information for a data-mining model may not reside in a single case table, but may have supporting tables supplying additional information to define the case. The data-mining model can take advantage of nested data-mining columns to process this supporting information and create additional rules and patterns for the case based on the data in the supporting tables.

To accommodate this need, Analysis Services and PivotTable Service use a mechanism known as the nested table, which allows a column to be of a type TABLE and have its own list of column definitions. The target of a RELATED TO column can be a Key column in a nested table, a Discretely Valued column on the case row, or another column with a RELATED TO clause.

This example uses VBScript to create a local mining model based on the results of a content query.

 dim rs dim cn dim tt set cn = wscript.createobject("adodb.connection") set rs = wscript.createobject("adodb.recordset") cn.ConnectionString = "provider=MSOLAP.2;Data Source=dataserver;     Initial Catalog=mushrooms" cn.properties.item("Mining Location") = "c:\pscripts" cn.open strDdl = "CREATE MINING MODEL [MushroomLocal] " &_     " ([Id]                  LONG KEY  , "             &_     " [Cap Shape]            TEXT   DISCRETE  , "      &_     " [Cap Surface]          TEXT   DISCRETE  , "      &_     " [Cap Color]            TEXT   DISCRETE  , "      &_     " [Bruises]              TEXT   DISCRETE  , "      &_     " [Odor]                 TEXT   DISCRETE  , "      &_     " [Gill Attachment]      TEXT   DISCRETE  , "      &_     " [Gill Spacing]         TEXT   DISCRETE  , "      &_     " [Gill Size]            TEXT   DISCRETE  , "      &_     " [Gill Color]           TEXT   DISCRETE  , "      &_     " [Stalk Shape]          TEXT   DISCRETE  , "      &_     " [Stalk Root]           TEXT   DISCRETE  , "      &_     " [Stalk Surface Above Ring]   TEXT DISCRETE  , "  &_     " [Stalk Surface Below Ring]   TEXT DISCRETE  , "  &_     " [Stalk Color Above Ring]     TEXT DISCRETE  , "  &_     " [Stalk Color Below Ring]     TEXT DISCRETE  , "  &_     " [Veil Type]   TEXT DISCRETE  , "                 &_     " [Veil Color]  TEXT DISCRETE  , "                 &_     " [Ring Number] TEXT DISCRETE  , "                 &_     " [Ring Type]   TEXT DISCRETE  , "                 &_     " [Spore Print Color] TEXT DISCRETE  , "           &_     " [Population] TEXT DISCRETE  , "                  &_     " [Edibility] TEXT DISCRETE  PREDICT, "            &_     " [Habitat] TEXT DISCRETE  PREDICT) "              &_     " SING Microsoft_Decision_Trees" 'set rs = cn.execute(strDdl)

Tip

In data-mining models with a fairly large number of rows like this one, it can be advantageous to query the actual MINING_MODELS schema rowset, copy the value from the CREATION_STATEMENT field, and use that value as the creation statement for the local mining model, provided you want it to be like the original in structure. This trick can save you quite a bit of typing!

Storage of Local Mining Models

Although the use of XML will be discussed in more detail later in the chapter, it's important to note that if you execute the CREATE MINING MODEL statement from the example above, you'll find that a file named mushroomslocal.dmm.xml has been created in the location specified in the Mining Location connection parameter. If you have Microsoft Internet Explorer 5.0 or later installed, and you double-click on this file, it will open up the XML structure of the model. Here is a listing of that model: 59

 <?xml version="1.0" ?>  - <pmml name="MushroomsLocal"     GUID="{4A30AD32-4A13-4FE2-AC3C-67647D7ED4C1}"     creation-time="2001-02-12T10:59:27"     modified-time="2001-02-12T10:59:27"> - <statements>   <statement type="CREATE" value="CREATE MINING MODEL [MushroomsLocal]     ([Id] LONG KEY , [Cap Shape] TEXT DISCRETE ,     [Cap Surface] TEXT DISCRETE , [Cap Color] TEXT DISCRETE ,     [Bruises] TEXT DISCRETE , [Odor] TEXT DISCRETE ,     [Gill Attachment] TEXT DISCRETE , [Gill Spacing] TEXT DISCRETE ,     [Gill Size] TEXT DISCRETE , [Gill Color] TEXT DISCRETE ,     [Stalk Shape] TEXT DISCRETE , [Stalk Root] TEXT DISCRETE ,     [Stalk Surface Above Ring] TEXT DISCRETE ,     [Stalk Surface Below Ring] TEXT DISCRETE ,     [Stalk Color Above Ring] TEXT DISCRETE ,     [Stalk Color Below Ring] TEXT DISCRETE ,     [Veil Type] TEXT DISCRETE , [Veil Color] TEXT DISCRETE ,     [Ring Number] TEXT DISCRETE , [Ring Type] TEXT DISCRETE ,     [Spore Print Color] TEXT DISCRETE ,     [Population] TEXT DISCRETE ,     [Edibility] TEXT DISCRETE PREDICT,     [Habitat] TEXT DISCRETE PREDICT)     USING Microsoft_Decision_Trees" />  </statements> - <data-dictionary>     <key name="Id" datatype="LONG" isinput="true" />      <categorical name="Cap Shape" isinput="true"         datatype="TEXT" />      <categorical name="Cap Surface" isinput="true"         datatype="TEXT" />      <categorical name="Cap Color" isinput="true"         datatype="TEXT" />      <categorical name="Bruises" isinput="true"         datatype="TEXT" />      <categorical name="Odor" isinput="true"         datatype="TEXT" />      <categorical name="Gill Attachment" isinput="true"         datatype="TEXT" />      <categorical name="Gill Spacing" isinput="true"         datatype="TEXT" />      <categorical name="Gill Size" isinput="true"         datatype="TEXT" />      <categorical name="Gill Color" isinput="true"         datatype="TEXT" />      <categorical name="Stalk Shape" isinput="true"         datatype="TEXT" />      <categorical name="Stalk Root" isinput="true"         datatype="TEXT" />      <categorical name="Stalk Surface Above Ring"         isinput="true" datatype="TEXT" />      <categorical name="Stalk Surface Below Ring"         isinput="true" datatype="TEXT" />      <categorical name="Stalk Color Above Ring" isinput="true"         datatype="TEXT" />      <categorical name="Stalk Color Below Ring" isinput="true"         datatype="TEXT" />      <categorical name="Veil Type" isinput="true"         datatype="TEXT" />      <categorical name="Veil Color" isinput="true"         datatype="TEXT" />      <categorical name="Ring Number" isinput="true"         datatype="TEXT" />      <categorical name="Ring Type" isinput="true"         datatype="TEXT" />      <categorical name="Spore Print Color" isinput="true"         datatype="TEXT" />      <categorical name="Population" isinput="true"         datatype="TEXT" />      <categorical name="Edibility" isinput="true"         ispredict="true" datatype="TEXT" />      <categorical name="Habitat" isinput="true"         ispredict="true" datatype="TEXT" />      </data-dictionary>     <tree-model />  </pmml>

If you have a file like this one available, you can use it to create a new mining model with the FROM PMML clause by reading the content of the text file into a variable and using it as the XML definition parameter.

Later in the chapter, we'll expand this strategy using Microsoft Data Mining Models in conjunction with those of other vendors .

SELECT INTO Statement

The SELECT INTO statement creates a new mining model by copying schema and other information from an existing mining model. If the existing model is trained, the new model will automatically be trained with the same query. If the existing model is not trained, the new model will be empty.

 SELECT * INTO <new model>  USING <algorithm> [(<parameter list>)] FROM <existing model>

INSERT INTO Statement 60

The INSERT INTO statement inserts training data into the model. The columns from the query are mapped to model columns through the <mapped model columns> section. The keyword SKIP is used to instruct the model to ignore columns that appear in the source data query that are not used in the model. The format of the statement is as follows :

 INSERT INTO <new model name> (<mapped model columns>)     <source data query>

 INSERT INTO <model> (<mapped model columns>)     VALUES <constant list>

 INSERT INTO <model>.COLUMN_VALUES(     <mapped model columns> ) <source data query>

or 61

 INSERT INTO <model>

The INSERT INTO <model>.COLUMN_VALUES form inserts data directly into the model's columns without training the model's algorithm. This allows you to provide column data to the model in a concise , ordered manner that is useful when dealing with data sets containing hierarchies or ordered columns. The ". " operator is used to specify columns that are part of a nested table. When you use this form, columns that are part of a relation (either through RELATE TO or by being a KEY in a nested table) cannot be inserted individually and must be inserted together with all the columns in the relation.

The <mapped model columns> section has the following form:

 <column identifier>  <table identifier>(<column identifier>      SKIP), ...

OPENROWSET Syntax

You can train the local mining model with the contents of a SQL Server table or an OLAP cube by using the OPENROWSET function. This function establishes a connection with a remote data source, performs the query, and returns the data set. The condition, of course, is that the returned rows match the order and number of rows in the destination data-mining model.

 OPENROWSET('provider_name'      {         'datasource';'user_id';'password'           'provider_string'     },      {         'query'     })   62

‚ provider_name ‚ A character string that represents the friendly name of the OLE DB provider as specified in the registry. It has no default value. If the source were SQL Server, for example, then the provider name would be SQLOLEDB .
‚ datasource ‚ A string constant that corresponds to a particular OLE DB data source object. Typically, this string includes the name of the database file, the name of a database server, or a name that the provider understands to locate the data source.
‚ user_id ‚ Needed for those data source providers, such as SQL Server, that have a security mechanism that requires a logon name.
‚ password 63 ‚ The password to authenticate the logon.
‚ provider_string ‚ A provider-specific connection string that contains all the parameters needed to establish a connection to the data source. This is used as an alternative to and not in conjunction with the first three parameters.
‚ query ‚ A string constant that is sent to and executed by the provider.

Here is an example of OPENROWSET that uses the INSERT INTO statement to insert rows into a mining model:

 strIns = "insert into [mushroomslocal] (" &_ "[Id]                   , "      &_ "[Cap Shape]            , "      &_ "[Cap Surface]          , "      &_ "[Cap Color]            , "      &_ "[Bruises]              , "      &_ "[Odor]                 , "      &_ "[Gill Attachment]      , "      &_ "[Gill Spacing]         , "      &_ "[Gill Size]            , "      &_ "[Gill Color]           , "      &_ "[Stalk Shape]          , "      &_ "[Stalk Root]           , "      &_ "[Stalk Surface Above Ring]   , "  &_ "[Stalk Surface Below Ring]   , "  &_ "[Stalk Color Above Ring]     , "  &_ "[Stalk Color Below Ring]     , "  &_ "[Veil Type]   , "                 &_ "[Veil Color]  , "                 &_ "[Ring Number] , "                 &_ "[Ring Type]   , "                 &_ "[Spore Print Color] , "           &_ "[Population] , "                  &_ "[Edibility] , "            &_ "[Habitat] ) openrowset("              &_ "'sqloledb'," &_ "'dataserver';'dtsuser'; ''," &_ "'SELECT DISTINCT " &_ "[ID] AS [Id]," &_ "cap_shape AS [Cap Shape]," &_ "cap_surface AS [Cap Surface]," &_ "cap_color AS [Cap Color]," &_ "bruises AS [Bruises]," &_ "odor AS [Odor]," &_ "gill_attachment AS [Gill Attachment]," &_ "gill_spacing AS [Gill Spacing]," &_ "gill_size AS [Gill Size]," &_ "gill_color AS [Gill Color]," &_ "stalk_shape AS [Stalk Shape]," &_ "stalk_root AS [Stalk Root]," &_ "stalk_surface_above_ring AS [Stalk Surface Above Ring]," &_ "stalk_surface_below_ring AS [Stalk Surface Below Ring]," &_ "stalk_color_above_ring AS [Stalk Color Above Ring]," &_ "stalk_color_below_ring AS [Stalk Color Below Ring]," &_ "veil_type AS [Veil Type]," &_ "veil_color AS [Veil Color]," &_ "ring_number AS [Ring Number]," &_ "ring_type AS [Ring Type]," &_ "spore_print_color AS [Spore Print Color]," &_ "population AS [Population]," &_ "edibility AS [Edibility]," &_ "habitat AS [Habitat] " &_ "FROM dataminer..mushrooms &_ " ')" 'set rs = cn.execute(strIns)   64

Tip

If you ever wanted to see the syntax of the training query used to populate your data-mining model from a SQL Server source, open a trace to SQL Server using the Profiler utility to capture all the SQL statements issued against the server. Then process the mining model. This will generate the query that Analysis Services uses to query the server, and this query is stored in the profiler for use. This query can be used to help create the query that is used to populate the local mining model.

Nested Tables and the SHAPE Statement

The SHAPE statement permits you to retrieve all the data needed for the cases from multiple queries and "shape" them into a single table with nested tables structures that can be used to train the DMM.

 SHAPE {<master query>}     APPEND ({<child table query>}      RELATE <master column> TO <child column>)          AS <column table name>     [         APPEND ({<child table query>}          RELATE <master column> TO <child column>)              AS <column table name>         .         .         .     ]

The SHAPE statement allows the addition of table columns to a master query by specifying the child table rows and the way to match between the row in <master query> and its child rows in the <child query>.

Note

If you reopen the mushroomslocal.dmm.xml XML file, you'll now notice that the file has become considerably larger because of the addition of the new cases. If you open the XML file that represents the local mining model, you'll now notice that there is a series of XML tags for each case entered.

Using XML in Data Mining 65

Extensible Markup Language (XML) is used to store the local mining models, but even the mining models that reside on the server contain schema rowsets with the XML definition of the model.

The PMML Standard

Predictive Model Markup Language (PMML) provides applications with a vendor-independent method of creating and storing data-mining models so that all data-mining vendor applications can share the models among themselves for the purpose of visualizing, manipulating, or enriching them with new cases. Because a standard format can be used, the integration of all of these different source models is relatively transparent to the data-mining applications themselves . As of the publication of this book, the Document Type Definition (DTD) for data-mining models exists in PMML form. Notice that the following list contains only the definitions for those models provided by Microsoft data mining. For the complete PMML Document Type Definition specification, please refer to ‚ http://www.dmg.org/ ‚ .

 <?xml version='1.0' encoding='ISO-8859-1' ?> <!ENTITY % A-PMML-MODEL ' (TreeModel  ClusteringModel) ' >     <!ELEMENT PMML  (Header,  DataDictionary,         (%A-PMML-MODEL;)+,  Extension*)>     <!ATTLIST PMML         version CDATA #REQUIRED     >     <!ELEMENT Extension ANY>     <!ATTLIST Extension          extender CDATA #IMPLIED         name CDATA #IMPLIED          value CDATA #IMPLIED      >          <!ENTITY  % NUMBER   "CDATA">     <!ENTITY  % INT-NUMBER   "CDATA">         <!-- content must be an integer,             no fractions or exponent -->     <!ENTITY  % REAL-NUMBER  "CDATA">         <!-- content can be any number              covers C/C++ types 'float', 'long', 'double'             scientific notation, eg 1.23e4, is allowed -->     <!ENTITY  % PROB-NUMBER  "CDATA">         <!-- a REAL-NUMBER between 0.0 and 1.0             usually describing a probability -->     <!ENTITY  % PERCENTAGE-NUMBER  "CDATA">         <!-- a REAL-NUMBER between 0.0 and 100.0  -->     <!ENTITY % FIELD-NAME   "CDATA">     <!ELEMENT Array (#PCDATA)>     <!ATTLIST Array         n  %INT-NUMBER; #IMPLIED         type (int real  string) #IMPLIED     >     <!ENTITY  % NUM-ARRAY  "Array">         <!-- an array of numbers -->     <!ENTITY  % INT-ARRAY  "Array">         <!-- an array of integers -->     <!ENTITY  % REAL-ARRAY  "Array">         <!-- an array of reals -->     <!ENTITY  % STRING-ARRAY   "Array">         <!-- an array of strings --> <!-- =========     Header       =========================== ======= -->     <!ELEMENT Header (Application?, Annotation*, Timestamp?)>      <!ATTLIST Header          copyright   CDATA #REQUIRED          description CDATA #IMPLIED      >      <!-- describes the software application that         generated the PMML-->      <!ELEMENT Application EMPTY>      <!ATTLIST Application          name      CDATA #REQUIRED          version   CDATA #IMPLIED      >      <!ELEMENT Annotation (#PCDATA)>      <!-- a timestamp in the format YYYY-MM-DD hh:mm:ss         GMT +/- xx:xx -->      <!ELEMENT Timestamp (#PCDATA)>     <!-- =========     Data Dictionary      =================== ======== --> <!ELEMENT DataDictionary (Extension*, DataField+)> <!ATTLIST DataDictionary     numberOfFields   %INT-NUMBER; #IMPLIED > <!ELEMENT DataField  (Extension*, (Interval* Value*))> <!ATTLIST DataField     name         %FIELD-NAME;  #REQUIRED     displayName  CDATA         #IMPLIED      optype       (categorical  ordinal  continuous)  #REQUIRED     isCyclic     (0  1)     "0"  > <!ELEMENT Value (Extension*)> <!ATTLIST Value     value        CDATA #REQUIRED     displayValue CDATA #IMPLIED     property     (valid  invalid  missing)   "valid" >     <!ELEMENT Interval  EMPTY>     <!ATTLIST Interval          closure     (openClosed  openOpen  closedOpen              closedClosed) #REQUIRED         leftMargin  %NUMBER;   #IMPLIED         rightMargin %NUMBER;   #IMPLIED     > <!-- =========     Mining Schema      ===================== ====== --> <!ELEMENT MiningSchema (Extension*, MiningField+)> <!ENTITY % FIELD-USAGE-TYPE "(active  predicted      supplementary)"> <!ENTITY % OUTLIER-TREATMENT-METHOD "(asIs       asMissingValues  asExtremeValues)"> <!ELEMENT MiningField (Extension*)> <!ATTLIST MiningField     name           %FIELD-NAME;                   #REQUIRED     usageType      %FIELD-USAGE-TYPE;             "active"     outliers       %OUTLIER-TREATMENT-METHOD;     "asIs"     lowValue       %NUMBER;                       #IMPLIED     highValue      %NUMBER;                       #IMPLIED > <!-- =========     Statistics     ========================= == --> <!ELEMENT ModelStats     (UnivariateStats+)> <!ENTITY % AGGREGATE "(Counts?, NumericInfo?)"> <!ELEMENT UnivariateStats ((%AGGREGATE;)?, DiscrStats?,     ContStats?)> <!ATTLIST UnivariateStats     field   %FIELD-NAME;  #IMPLIED > <!ELEMENT Counts EMPTY> <!ATTLIST Counts     totalFreq          %NUMBER;  #REQUIRED     missingFreq        %NUMBER;  #IMPLIED     invalidFreq        %NUMBER;  #IMPLIED > <!ELEMENT NumericInfo (Quantile*)> <!ATTLIST NumericInfo     minimum                %NUMBER; #IMPLIED     maximum                %NUMBER; #IMPLIED     mean                   %NUMBER; #IMPLIED     standardDeviation      %NUMBER; #IMPLIED     median                 %NUMBER; #IMPLIED     interQuartileRange     %NUMBER; #IMPLIED > <!ELEMENT Quantile EMPTY> <!ATTLIST Quantile     quantileLimit  %PERCENTAGE-NUMBER;  #REQUIRED     quantileValue  %NUMBER; #REQUIRED > <!ELEMENT DiscrStats (Extension*, (%STRING-ARRAY;)?,     (%INT-ARRAY;)?)> <!ATTLIST DiscrStats     modalValue             CDATA    #IMPLIED > <!ELEMENT ContStats (Extension*, Interval*,     (%INT-ARRAY;)?, (%NUM-ARRAY;)?, (%NUM-ARRAY;)?)> <!ATTLIST ContStats     totalValuesSum %NUMBER;   #IMPLIED     totalSquaresSum %NUMBER;   #IMPLIED > <!ELEMENT Partition (PartitionFieldStats+)> <!ATTLIST Partition     name   CDATA    #REQUIRED     size   %NUMBER; #IMPLIED > <!ELEMENT PartitionFieldStats (%AGGREGATE;,     (%NUM-ARRAY;)*)> <!ATTLIST PartitionFieldStats     field        %FIELD-NAME; #REQUIRED > <!-- =========     Normalization    ======================= ==== --> <!ENTITY % NORM-INPUT "(NormContinuous  NormDiscrete)"> <!ELEMENT NormContinuous (Extension*, LinearNorm*)> <!ATTLIST NormContinuous      field   %FIELD-NAME; #REQUIRED  > <!ELEMENT LinearNorm EMPTY> <!ATTLIST LinearNorm     orig    %NUMBER;  #REQUIRED     norm    %NUMBER;  #REQUIRED > <!ELEMENT NormDiscrete (Extension*)> <!ATTLIST NormDiscrete      field  %FIELD-NAME;  #REQUIRED     method       (indicator  thermometer) #FIXED "indicator"     value  CDATA #REQUIRED  > <!ELEMENT ClusteringModel       (Extension*, MiningSchema, ModelStats?,          ComparisonMeasure, ClusteringField*, CenterFields?,         Cluster+)> <!ATTLIST ClusteringModel     modelName                  CDATA          #IMPLIED     modelClass (centerBased  distributionBased)    #REQUIRED     numberOfClusters             %INT-NUMBER;   #REQUIRED > <!ELEMENT CenterFields ((%NORM-INPUT;)+)> <!ELEMENT Cluster (Extension*, (%NUM-ARRAY;)?, Partition?,     Covariances?)> <!ATTLIST Cluster     name               CDATA          #IMPLIED > <!-- =========     Tree Classification ===================== ====== --> <!ENTITY % PREDICATES     "(Predicate  CompoundPredicate  True  False)"      >     <!ELEMENT TreeModel (Extension*, MiningSchema, ModelStats?, Node)>     <!ATTLIST TreeModel     modelName CDATA #IMPLIED     >      <!ELEMENT Node (Extension*, (%PREDICATES;), Node*,         ScoreDistribution*)>         <!ATTLIST Node         score CDATA #REQUIRED         recordCount %NUMBER; #IMPLIED         >         <!ELEMENT Predicate EMPTY>         <!ATTLIST Predicate             field  %FIELD-NAME; #REQUIRED             operator (equal  lessThan  notEqual  lessOrEqual                 greaterThan  greaterOrEqual) #REQUIRED             value CDATA #REQUIRED         >         <!ELEMENT CompoundPredicate ( %PREDICATES; ,             (%PREDICATES;)+)>         <!ATTLIST CompoundPredicate             booleanOperator (or  and  xor  cascade) #REQUIRED         >         <!ELEMENT True EMPTY>         <!ELEMENT False EMPTY>         <!ELEMENT ScoreDistribution EMPTY>         <!ATTLIST ScoreDistribution             value CDATA #REQUIRED             recordCount %NUMBER; #REQUIRED         >

Summary 66

In addition to the powerful server functions provided by Analysis Services, PivotTable Service exists (in the form of an OLE DB provider) to supply the same functionality on the client side. Thanks to this service, it's possible to create, train, modify, and query data-mining models that exist locally. Interestingly, the data-mining models created locally adhere to the PMML standard recently adopted by many data-mining vendors. This standardization lets the models you create with Analysis Services be distributed among vendors of data-mining products.

In the next chapter, we'll take a close look at querying the data-mining models using code both with mining models that reside on the server and with those created with PivotTable Service.