6 Creating Decision Trees with OLAP

As we saw in ‚  Chapters 1 ‚  and ‚  2 1 ‚  , one of the most important prerequisites for a successful data-mining effort is the preparation of the case data, which you need to do before you can begin mining any data. In many, if not most, cases the source of the training for your data-mining models is going to come from disparate sources and nonrelational data formats, such as flat files or Microsoft Excel spreadsheets. As part of any enterprise-wide data-mining effort, organizations need to centralize and normalize their data into a data warehouse. In ‚  Chapters 1 ‚  and ‚  2 ‚  , I discussed the need for these steps, the preferred methodology, and the pitfalls to avoid. So now that we know what is needed to create data-mining models, it makes sense to incorporate more Microsoft SQL Server tools to help us in the entire data-mining effort. Microsoft Data Transformation Services (DTS), although not specifically an Analysis Services tool, is a very important tool that gathers data from many different sources so that it can be cleansed and warehoused before it is used in OLAP and data mining.

Sometimes the data increases over time, as is the case with supermarket sales data, weather data, and stock market price data. When this is the case, the data-mining models are continuously updated with new data that enriches the model and allows for more timely predictions. In addition to all the transformational capabilities of DTS, it can also create a unit (which acts much like an application) that can be automated with SQL Server Agent.

In this chapter, I'll explain how DTS works and how to program it, and I'll show you, step-by-step, how to automatically download the mushroom data used in 2 ‚  Chapter 5 ‚  from the mushrooms database's FTP site, run the DTS transformation steps needed to format and store the data, and train the data-mining model. We'll create a unit to do all this with the click of a button or through an automated task.

What Is DTS?

DTS is many things, including a

  • Collection of COM objects, each geared toward a specific task
  • Hub of OLE DB data connections
  • Workflow development platform 3
  • Application development environment

Unlike other programming environments, DTS is strictly batch-oriented. This means that it will run steps in a sequential order without waiting for user input or any other user"events. "It's expected to provide an easy way to define those steps and establish the flow of execution to allow you to go from beginning to end in a perfectly predictable manner (barring errors and other exceptions).

The basic DTS unit is a package, which is the named collection of a given collection of tasks. Just as Microsoft Visual Basic programs are contained within a project and just as SQL Server tables are contained within a database, DTS objects are contained in a package. A package can contain any one of the following four types of objects:

  • Connections
  • DTS Tasks
  • Transformations
  • Workflows 4

As a unit, packages can be edited, stored in databases or files, password protected, scheduled for execution through the SQL Server Agent, and retrieved by version. Each package contains one or more steps that are executed sequentially or in parallel when the package is run. When executed, the package connects to the correct data sources, copies data and database objects, transforms data, notifies users of the processes of events, and optionally creates written logs of the events that transpired.

DTS Tasks

A DTS task is an object that has a specific function that is executed as a step. The type of task depends on the functionality required. By default, DTS performs a whole slew of tasks as described in the following sections.

Transform

The Transform Data task is used to copy data between any OLE DB, compliant source and the destination data. When the source and destination objects are defined, the individual columns are exposed and given a default mapping, which is usually ordered sequentially so that the first column of the source maps to the first column of the destination and so on. This lets you customize the transformation task by defining different column mappings or by eliminating certain columns from the transformation procedure altogether. Many times the simple column mappings are not enough; for example, many data sources will store the date in one field and the time in another, but SQL Server 2000 contains a single date field that incorporates both the date and the time. When transforming the data from the source, you have the option of specifying, through the use of scripting, that the destination's date/time field is the concatenation of the source's date and time field. We'll take a closer look at scripting in DTS later in this chapter. Other times, the source data contains only obscure code that can be converted to meaningful descriptions by looking up the values in a third data source. The transformation task has facilities to handle this as well. The flexibility of this task makes it a cornerstone of almost every DTS package.


Note

The Transform Data task is often referred to as the DataPump task, especially in the object properties of a package.


Bulk Insert 5

The Bulk Insert task is the T-SQL BULK INSERT command encapsulated in an object. For it to work, the following conditions must be met:

  • The data destination is a single SQL Server table.
  • The data source is a flat text file.
  • The columns of the source and the destination have a 1-to-1 mapping.
  • The source and the destination have the same number of columns.
  • There are no transformation steps taken or implicit conversions made between the source and the destination.

Bulk Insert is the fastest way to import text data into a SQL Server 2000 table as long as the following restrictions are met: 6

  • The database option Select Into/Bulkcopy is set to true.
  • The target table has no indexes, or if the table has indexes, it is empty when the bulk copy starts.
  • The target table is not being replicated.

Bulk Insert is typically used when the number of rows to be imported is in the millions, when time is of the essence, and when logging is not necessary. When these conditions exist, the records are usually inserted into a staging table, preferably in a special staging database, and then processed from this staging table.

Data Driven Query

A Data Driven Query task allows you to loop through the source data as you would with an ADO record set or a SQL Server cursor. This is especially useful when the contents of each individual record contain values that are used to determine what action to take on one or multiple tables. In other words, as you navigate through each row, you could issue almost any SQL statement including INSERT, UPDATE, DELETE, and stored procedure calls. The creation of a Data Driven Query task is a bit more involved than the other packages because of all the options available to you, but it does offer the ultimate in flexibility even if it's at a high cost to the performance of the task. Just remember that these aren't batches, but record-by-record transactions, so if you can find a way to do this with one of the tasks mentioned earlier, you're better off.

Execute Package 7

In a data-mining environment, the executable processes can be divided and run separately. Programmers are familiar with the notion of breaking applications into parts and having each part perform specific tasks for purposes of readability and code re-use. The Execute Package task allows you to only train the model, or to only use the model several times to make predictions, or to do both processes consecutively. It allows you to create packages that perform specific tasks. These packages can be called from other packages, which allows you to break a task into smaller pieces. This flexibility makes it easy to choose to do one or the other, or both, without creating the same package three times. Just as in a programming language, the Execute Package task has mechanisms that allow parameters to pass from one package to another, just as one would do with a stored procedure or a function.

I'll cover the next five tasks briefly because they're not especially relevant to data mining. If you would like more information about some of these tasks, consult SQL Server Books Online, or pick up a copy of Professional SQL Server 2000 DTS (Data Transformation Services) by Mark Chaffin, Brian Knight and Todd Robinson (Wrox Press), a book completely dedicated to DTS.

Transfer Error Messages

With sp_addmessage, a SQL Server system stored procedure, a user can add user-defined error messages that are application specific. If your application is going to run on other servers, you would use this task to transfer those messages.

Transfer Master Stored Procedures

The Transfer Master Stored Procedures task copies the stored procedures from a master database on one SQL Server 2000 server to another.

Transfer Databases 8

The Transfer Databases task copies the contents of a database in a SQL Server 7 server to a SQL Server 2000 server.

Transfer Jobs

The Transfer Jobs task copies the contents of the jobs stored in the msdb database in a SQL Server 2000 server to another SQL Server 2000 server.

Transfer Logins

The Transfer Logins task copies the logins defined in the master database from one SQL Server 2000 server to another SQL Server 2000 server.

Copy SQL Server Objects

The Copy SQL Server Objects task transfers objects from one SQL server to another. Any number and combination of objects can be transferred, but remember that this is not a disaster recovery mechanism for SQL Server. It's useful for creating occasional test environments or moving specific objects such as lookup tables from one server to another. 9

Dynamic Properties

The Dynamic Properties task works by retrieving values from sources outside a DTS package at package run time and using them to assign values to objects that require run-time variables specific to a given server or other dynamic element. In this way, a package can act like a reusable application and retrieve values from an outside source, leaving the rest of the package as is. This is particularly useful for cases in which

  • The raw data-mining source files have dynamic names or sequential numbers in the name.
  • You need to connect to an available server or FTP site, where the name of the server or site is not known until package run time.
  • The target of the prediction or the location of the test data used to make predictions changes periodically or is unknown until run time.

The source of the dynamic data used by the package can come from any one of these sources:

  • A data file containing a property value that can be read and assigned. Unlike the initialization file selection, the data file selection supports property values greater than one line in length. 10
  • A .ini file, such as myprogram.ini, or any .ini file that you create. This is good for single-line value pairs as opposed to registry-style hierarchies of definitions.
  • A SQL Server query, as long as it returns just one row. Just make sure that you use the TOP 1 or GROUP BY in such a way that you're guaranteed to get just one row back from the query.
  • A DTS package global variable. These can be set and initialized in the package designer.
  • An operating system environment variable set within Microsoft Windows 2000 or Microsoft Windows NT.
  • A constant used as the default value for a variable in case the conditions for assigning that variable never occur.

Message Queue

Message Queue is a task the interacts with Microsoft Message Queue (MSMQ), a transaction monitoring and COM+ packaging system. 11

ActiveX Scripting

ActiveX Scripting is a task that executes a program written entirely in a scripting language, such as Microsoft VBScript or JScript. The scripting task is a way not only to manipulate the variable and the other objects in the same package, but also to accomplish any number of tasks including instantiating Microsoft ActiveX objects and servers such as Microsoft Word and Excel. The task designer provides a rich environment to create the scripts by providing syntax checkers, keyword and function lists, and an object browser.


Note

By default, all the ActiveX scripting tasks provide VBScript and JScript as language options. You aren't limited to those languages if you install other Windows Scripting Host (WSH) , compliant languages, such as PerlScript or even ActivePython. My personal favorite is Practical Extraction and Reporting (Perl) language, which can be retrieved free from ‚  http://www.activestate.com/ ‚  By installing Perl on the server with SQL Server 2000, the Perl keywords and syntax checker automatically become available in the development environments involving the ActiveX scripting languages.


Execute SQL

Execute SQL is a task that is designed, as the name strongly suggests, to execute any T-SQL statement that you could run from SQL Query Analyzer. This task could be used for any number of processes, including creating tables, updating information, and deleting records. 12

Execute Process

The Execute Process task can run an executable program, such as a batch file or a .exe file. It runs in its own process, so the rest of the SQL Server environment is reasonably safe from the program's effects on memory and other resources. Typically, this might run a batch or a program that opens connections to other network resources or retrieves files, or it might run a program that generates a text file to be imported by SQL Server in a subsequent step.

Send Mail

Send Mail is a very handy task that programmatically sends mail to a recipient. This can be used to notify managers that a report is available or, in a more likely scenario, to notify a system administrator that a task failed.

FTP

The FTP task automates the retrieval of a file or a collection of files to a local directory from an FTP site. It handles the logging-in process as well as the actual file transfers. In addition, you can perform data transfers from any UNC path without it being an FTP site.

Analysis Services Processing Task 13

The Analysis Services Processing task processes Analysis Services objects. Originally designed for processing cubes only, it still uses the Cube icon even though it's capable of processing cubes, dimensions, and of course, data-mining models.

Data-Mining Prediction Query Task

When you install Analysis Services, the Data-Mining Prediction Query task package becomes available. This package is designed to help create and run prediction queries based on a data-mining model and output the results to another source such as a SQL Server table or any other OLE DB, compliant data store such as Excel.

Custom and Third-Party Tasks

If the tasks described so far don't meet the requirements of your batch process, you can take advantage of the extensibility offered by DTS that lets you create custom tasks using Microsoft Visual C++, Visual Basic, Delphi, or any language that can create COM-compliant applications. You can then integrate these custom tasks into the DTS Designer user interface and save them as part of the DTS object model. The tasks then become available for use within the DTS Designer whenever a new package is created.

Connections

Connections are the basis for all activity in the DTS package. To successfully execute DTS tasks that copy and transform data, a DTS package must establish valid connections to its source and destination data and to any additional data sources, such as lookup tables. 14

OLE DB is the main provider type of DTS connections and allows a wide variety of data sources to be used. These can range from traditional relational database sources such as SQL Server or Microsoft Access to more loosely structured data, such as from Excel spreadsheets and text files.

Sources

DTS contains built-in features that let you graphically add and configure a data source that contains either the raw data that will be manipulated or a final data structure that will store the output of a DTS task. You can use a data source, file, or data link as a source.

A Data Source Connection

Data Source Connections give you access to:

  • Standard databases such as SQL Server 2000
  • Access 2000 15
  • Oracle
  • DBASE
  • Paradox
  • OLE DB for ODBC
  • Excel 2000 spreadsheets
  • HTML files
  • OLE DB providers 16

A File Connection

DTS provides additional support for text files. When specifying a text file connection, specify the format of the file. For example, you would specify

  • Whether a text file is delimited by a character or in a fixed field format
  • Whether the text file is in a Unicode or an ANSI format
  • The row delimiter and column delimiter if the text file is in fixed field format
  • Whether there is a special character qualifier to separate text columns from numeric or data columns
  • Whether the first row contains column names 17

A Data Link Connection

The connection string used by the data source is stored in a separate text file and accessed at connection time.

Configuring a Connection

When creating a DTS package, you configure connections by selecting a connection type from a list of available OLE DB providers. The properties you configure for each connection vary depending on the individual provider for the data source. A DTS package can have multiple connections defined for the same data source, or the same connection can be reused for various tasks. A few factors to take into account before creating connections in a DTS package include the order of the execution of the tasks, whether your packages will be moved to other servers, and the security of your accounts.

Single Thread per Connection

If the package is designed so that the tasks execute in a perfectly linear order without any parallel tasks, then a single connection is fine. Packages can be designed to have multiple tasks execute simultaneously. In this situation, a single connection becomes a bottleneck and performance can be greatly enhanced by using a separate connection for each task.

Dynamic Connection Properties 18

If your package can be moved or copied to different servers, you may need to edit the direct connections made in a package. To make it easy to modify the connection parameters for a connection or connections, use a data link file, which saves the connection string in a separate text file. Alternatively, consider using the Dynamic Properties task to change the connection information at run time.

Security Account Information

When creating a connection for a SQL Server data source, you have the option of specifying a given username and password or of using integrated security. Although using integrated security simplifies package creation because no specific credentials need to be supplied, keep in mind that the account used to access the server will be the currently logged on user's account or the service account of the SQL Server Agent if you have scheduled the package to run as a job. Using integrated security can cause numerous bugs because access is denied to SQL Server objects that are needed by certain tasks. If possible, consider creating a specific account with appropriate rights to the tasks in the package and use that account when creating connections. This ensures a consistent user environment regardless of the means by which the package was launched.

DTS Package Workflow

DTS steps and precedence constraints determine the order that the tasks in the package are executed and under what logical conditions they are allowed to run. The simplest, most straightforward way to accomplish this is to use the DTS Designer because all the graphical workflow elements can be dragged, dropped, and interconnected using the mouse. That said, Visual Basic, Visual C++, or even Perl code provides the same functionality without DTS Designer.

DTS Package Steps

Steps are used to control the order in which tasks are executed in a DTS package. DTS package steps represent the execution units in the DTS object model, and they define which tasks execute in what sequence when the package is run and which ones run parallel to each other. 19

There are no step objects to manipulate per se; instead, they are implicitly created whenever precedence constraints are created between tasks. To give you some perspective on that, consider that without any precedence constraints, all the tasks in a package would execute simultaneously, unless they all rely on the same connection (but then the order of execution would be different every time).

When creating a package using code, you can control the relationship between a step and a task more precisely. You can create multiple steps for different package operations and associate the execution of those steps with a single task. For example, suppose you write a package in Visual Basic and specify in several parts of the package that errors can be generated. By linking the steps associated with those errors, you can make the different types of errors execute the same Send Mail task. That Send Mail task can send an e-mail notifying the system administrator that the package failed.


Note

DTS Designer allows you to execute an individual package step. This action is useful for testing and troubleshooting individual steps without having to run the entire package. To execute a single package step in DTS Designer, right-click the task you want to execute and choose Execute Step from the menu.


Precedence Constraints

Precedence constraints sequentially link tasks in a package. In DTS, you can use three types of precedence constraints, which can be accessed either through DTS Designer or programmatically.

Unconditional

If you want Task 2 to wait until Task 1 completes, regardless of the outcome, link Task 1 to Task 2 with an Unconditional precedence constraint. 20

On Success

If you want Task 2 to wait until Task 1 has successfully completed, link Task 1 to Task 2 with an On Success precedence constraint.

On Failure

If you want Task 2 to begin execution only if Task 1 fails to execute successfully, link Task 1 to Task 2 with an On Failure precedence constraint. If you want to run an alternative branch of the workflow when an error is encountered, use this constraint.

Using Multiple Precedence Constraints

You can issue multiple precedence constraints on a task. For example, you can have a given task execute only when two other previous tasks succeed, or only if one of the tasks succeeds and the other fails.

DTS Designer 21

The DTS Designer is a combination of integrated development environment, workflow engine, and application platform.

It has the graphical user interface similar to any classical workflow designer that calls for objects of different types to be arranged on a virtual whiteboard and have sequences and dependencies attached to them. The DTS Designer graphical user interface allows you to build and configure packages by using drag-and-drop methods and by completing property sheets on the various DTS objects included in the package. Unlike a normal workflow designer, the objects themselves contain functionality that can be defined with parameters or complete programs. Once these objects have been defined and the rules set for their execution order, the package can be launched and the graphical elements come alive to provide visual cues that mark the progress of the tasks.

You can use DTS Designer to do the following:

  • Create a simple database object transfer task.
  • Create a package that includes complex workflows with multiple steps and dependencies.
  • Edit an existing package that was stored in one of the four storage repository types.

Opening the DTS Designer 22

You can access DTS Designer through SQL Server Enterprise Manager, through the Data Transformation Services node of the console tree. Under the tree, you'll find three storage repositories for the package. I'll discuss these later in this section. If you right-click on this node, you can choose to open a package that happens to be stored in a .dts file.

DTS Designer Work Area

The user interface for the virtual whiteboard includes the DTS Designer main panel, which consists of the following parts:

  • A design sheet to create workflows with graphical objects that represent DTS tasks, DTS transformations, and precedence constraints
  • A menu bar containing menus for package operations, edit operations, data sources, tasks, and workflow items
  • A toolbar containing buttons for: creating, saving, and executing a package; printing a workflow; cutting, copying, and pasting graphical objects in a workflow; annotating a workflow; and changing both the workflow layout and the size of a workflow on the design sheet
  • A Connection toolbar containing connections for data sources 23
  • A Task toolbar containing DTS tasks
  • You can dock the Connection and Task toolbars by right-clicking on them and choosing Undock from the menu (You can redock a toolbar by right-clicking its title bar and choosing Dock from the menu.)
  • Context menus to configure and edit package components and workflow (for example, connections, tasks, and workflow objects)

Saving a DTS Package

When you save a DTS package, you save all DTS connections, DTS tasks, DTS transformations, and workflow steps and preserve the graphical layout of these objects on the DTS Designer design sheet. The options described in the following sections are available for saving packages.

Package Name

Specify a unique name for the package with package name. The msdb tables use this name as a primary key. 24

Owner Password

Specify a password for the package to protect sensitive user name and server password information in the package from unauthorized users with owner password. If the package has an owner password, the data is encrypted with the standard encryption API. This option is available only for packages saved to SQL Server or as a structured storage file.

User Password

Set a password for a package user with user password. This password allows a user to execute a package. However, this option does not allow a user to view the package definition. If you set the user password, you must also set the owner password. This option is available only for packages saved to SQL Server or as a structured storage file.

Location

Specify the format and location of the saved package with the location option. You can save to a SQL server, which then stores it in the msdb database, and then to either Meta Data Services, a structured storage file, or a Visual Basic file. When you save the package to a SQL server or to Meta Data Services, you have the following options:

  • ‚  Server name 25 ‚   Specifies the name of the SQL Server installation storing the package.
  • ‚  Use Windows Authentication ‚   Specifies the security mode used to connect to SQL Server. The Windows Authentication used will be the Microsoft Windows login of the user creating the package.
  • ‚  Use SQL Server Authentication ‚   Specifies the security mode used to connect to an instance of SQL Server.
  • ‚  User Name ‚   Specifies a user name for the connection to an instance of SQL Server. 26
  • ‚  Password ‚   Specifies a password for the connection to an instance of SQL Server.

When you save the package to Meta Data Services, you can also scan by displaying the Scanning Options dialog box and specifying how objects referenced by the package should be scanned into Meta Data Services. This capability allows you to relate source and destination objects in a package to database meta data (for example, primary and foreign keys in a table, in an index, and in column information, such as data type) stored in Meta Data Services.

When you save the package as a COM-structured storage file or a Visual Basic file, you have the following options:

  • ‚  Filename ‚   Specifies the package filename and path. If the package is a structured storage file, it should be stored with the extension .dts. If the package is a Visual Basic file, it should be stored with the extension .bas.
  • ‚  Browse 27 ‚   Displays the Save As dialog box, where you can specify the filename, extension, and storage location.

You can save a package to:

  • ‚  Microsoft SQL Server ‚   With this default save option, you can store a package in the SQL Server msdb database, which contains tables especially for storing packages. This offers the robustness and security of SQL Server but does constrain the access to the package to times when that particular SQL Server is available and running. You can take advantage of the backup features of SQL Server to insure the safety of the packages from loss as well as take advantage of the DTS object transfer utility to move the package from one server to another.
  • ‚  Meta Data Services ‚   With this save option, you can maintain historical information about the data manipulated by the package. However, Meta Data Services and the repository database must be installed and operational on your server. You can track the columns and tables that are used by the package as a source or destination. You also can use the data lineage feature to track which version of a package created a particular row. You can use these types of information for decision-support applications.
  • ‚  A structured storage file 28 ‚   With this save option, you can copy, move, and send a package across the network without having to store the file in a SQL Server database. The structured storage format allows you to maintain multiple packages and multiple package versions in a single file.
  • ‚  A Visual Basic file ‚   If you were so inclined, you could create an entire DTS package represented entirely in code without resorting to the DTS Designer at all. Alternatively, you can also decide to do the opposite and take an already designed DTS package and convert it to Visual Basic code.
  • With this option, the whole package saves as Visual Basic code, and you can later open the Visual Basic file and modify the package definition in your development environment.


Note

When you save a package to the SQL Server repository or to a structured storage file, you can secure the package with one or more passwords. When you save a package to Meta Data Services or as a Visual Basic file, the DTS package security options are not available. However, you can keep packages saved to Visual Basic files secure through a source code control system such as Microsoft Visual SourceSafe, and create a compiled version which hides the details of the code.


dtsrun Utility

The dtsrun utility executes a package created using DTS. The DTS package can be stored in the Microsoft SQL Server msdb database, a COM-structured storage file, or SQL Server Meta Data Services. The syntax is as follows: 29

 dtsrun  [/?]  Package retrieval:     /~S      /~U      /~P      /~E         /~N      /~M      /~G      /~V      /~F      /~R 
 Package operation (overrides stored Package settings):      /~A Global Variable Name:typeid=Value      /~L Log file name     /~W Write Completion Status to Windows Event Log <True or False> 
 DTSRun action (default is to execute Package): 

The following do not execute the package; instead they perform an administrative operation on it.

 /!X     /!D  /!Y  /!C 

The following is a list of arguments:

  • ‚  /? 30 ‚   Displays the command prompt options.
  • ‚  ~ ‚   The tilde specifies that the parameter to follow is an encrypted value stored as a hexadecimal string. This is done to protect a batch file that would otherwise contain clear text values for the username, passwords, and servers. It can be used with the ‚  /S, /U, /P, /N, /G, /V, /M, /F, ‚   and ‚  /R ‚   options. 31
  • ‚  /S ‚   server[\sql server instancename] Specifies the name or the instance of SQL Server to connect to.
  • ‚  /U ‚   username The SQL login used to connect to an instance of SQL Server.
  • ‚  /P ‚   password A user-specified password used with a login ID.
  • ‚  /E 32 ‚   Specifies that a Windows NT or Windows 2000 trusted connection will be used as opposed to the SQL Server login and password.
  • ‚  /N ‚   packagename The name the DTS package was saved as.
  • ‚  /G ‚   guid_string The package ID assigned to the DTS package when it was created. The package ID is stored as a GUID data type.
  • ‚  /V ‚   guid_string The version ID assigned to the DTS package when it was first saved or executed. A new version ID is assigned to the DTS package each time it is modified. The version ID is a GUID. This is used by the storage engine to keep track of version and execution history. This history is known as the lineage of the package. 33
  • ‚  /M ‚   password An optional password assigned to the DTS package when it was created. To launch the package you must have the password. If no password was specified, then this switch is not necessary.
  • ‚  /F ‚   filename The name of a COM structured storage file containing DTS packages. This is the result of saving the package as a .dts file. If server_name is also specified, the DTS package retrieved from SQL Server is executed and that package is added to the structured storage engine.
  • ‚  /R ‚   repository_database_name The name of the repository database containing DTS packages.
  • ‚  /A 34 ‚   variable_name:typeid=value Specifies a package global variable, where typeid = type identifier for the data type of the global variable as described in Table 8-1. The entire argument string can be quoted, and the argument can be repeated to specify multiple global variables.
Table 8-1. Global Variable Types and their ID Values
35 38 41
Data type Type ID
Integer (small) 2
Integer 3
Real (4-byte) 4
Real (8-byte) 36 5
Currency 6
Date 7 37
String 8
Boolean 11
Decimal 14
Integer (1-byte) 16
Unsigned int (1-byte) 39 17
Unsigned int (2-byte) 18
Unsigned int (4-byte) 19 40
Integer (8-byte) 20
Unsigned int (8-byte) 21
Int 22
Unsigned int 23
HRESULT 42 25
Pointer 26
LPSTR 30 43
LPWSTR 31

To set global variables with this command switch, you must have either Owner permission for the package or the package must have been saved without DTS password protection enabled.

  • ‚  /L ‚   filename Specifies the name of the package log file where the events of package execution will be stored including times, steps, and any errors.
  • ‚  /W 44 ‚   Event_Log Specifies whether to write the completion status of the package execution to the Windows Application Log. Specify True or False.
  • ‚  /Z ‚   Indicates that the command line for dtsrun is encrypted using SQL Server 2000 encryption.

The next four commands perform administrative functions without executing the package.

  • ‚  /!X ‚   Blocks execution of the selected DTS package. Use this command parameter when you want to create an encrypted command line without executing the DTS package to use that command line in your batch files.
  • ‚  /!D 45 ‚   Deletes the DTS package from an instance of SQL Server. It is not possible to delete a specific DTS package from a structured storage file. The entire file needs to be overwritten using the ‚  /F ‚   and ‚  /S ‚   options.
  • ‚  /!Y ‚   Displays the encrypted command used to execute the DTS package which was encrypted with the 46 ‚  /!X ‚   command.
  • ‚  /!C ‚   Copies the command used to execute the DTS package to the clipboard. This option can also be used in conjunction with ‚  /!X ‚   and ‚  /!Y 47 ‚   to make it easy to paste the generated command on to a batch file.

To execute a DTS package saved as a COM-structured storage file, use

 dtsrun /Ffilename /Npackage_name /Mpackage_password 

To execute a DTS package saved in the SQL Server msdb database, use

 dtsrun /Sserver_name /UusernName /Ppassword     /Npackage_name /Mpackage_password 

To execute a DTS package saved in Meta Data Services, use

 dtsrun /Sserver_name /Uusernrame /Ppassword     /Npackage_name /Mpackage_password /Rrepository_name   48   

Using DTS to Create a Data-Mining Model

Having covered the essentials of DTS, I'll now walk you through the creation of a DTS package. Upon request, DTS will build our Mushrooms data-mining model. This case is especially interesting because the source file resides in a remote FTP site that you can access from your Internet connection. This file is a comma-delimited text file that contains well formatted rows but with cryptic codes in the columns that need to be translated to create a legible data-mining model. The file needs to be placed in a SQL Server 2000 table, and then the data-mining model needs to be created. This is the first part of our DTS package. Later, in ‚  Chapter 12 ‚  , we'll expand the package to include tasks for automatically issuing predictions against test data presented to it for analysis.

The steps to create the DTS package are as follows:

  1. Download the Mushrooms file from the FTP site.
  2. Delete the contents of the SQL Server staging table that will be used to import the mushroom data. 49
  3. Import the file into a SQL Server table, while converting the cryptic codes used to describe the characteristics of the mushrooms into meaningful labels.
  4. Train the data-mining model from the data in the SQL Server table.

Preparing the SQL Server Environment

Before we can do steps 1 through 4, we need to prepare the SQL Server database for the data-mining operation.

  • Build the mushrooms table.
  • Create a lookup table that contains the code to label translations for the mushroom characteristics.

Let's start with the mushrooms table. Run the code shown below in the Query Analyzer of your SQL Server database. 50

 if not exists (select * from dbo.sysobjects     where id = object_id(N'[dbo].[mushrooms] ')     and OBJECTPROPERTY(id, N'[sUserTable') = 1) BEGIN CREATE TABLE [mushrooms] ([ID] [varchar] (50) NULL ,     [edibility] [varchar] (50) NULL ,     [cap_shape] [varchar] (50) NULL ,     [cap_surface] [varchar] (50) NULL ,     [cap_color] [varchar] (50) NULL ,     [bruises] [varchar] (50) NULL ,     [odor] [varchar] (50) NULL ,     [gill_attachment] [varchar] (50) NULL ,     [gill_spacing] [varchar] (50) NULL ,     [gill_size] [varchar] (50) NULL ,     [gill_color] [varchar] (50) NULL ,     [stalk_shape] [varchar] (50) NULL ,     [stalk_root] [varchar] (50) NULL ,     [stalk_surface_above_ring] [varchar] (50) NULL ,     [stalk_surface_below_ring] [varchar] (50) NULL ,     [stalk_color_above_ring] [varchar] (50) NULL ,     [stalk_color_below_ring] [varchar] (50) NULL ,     [veil_type] [varchar] (50) NULL ,     [veil_color] [varchar] (50) NULL ,     [ring_number] [varchar] (50) NULL ,     [ring_type] [varchar] (50) NULL ,     [spore_print_color] [varchar] (50) NULL ,     [population] [varchar] (50) NULL ,     [habitat] [varchar] (50) NULL) END 

This table should be empty until we actually import the test file. We do need to create the lookup table and populate it with proper values as shown here.

 create table lookups (lookup_id uniqueidentifier not null         default newid() primary key clustered,     type varchar(50),     code char(1),     value varchar(50)) go create index ix_type_code on lookups(type,code)     with fillfactor= 100 go insert into lookups (type,code,value)      values ('cap_shape', 'b', 'bell') insert into lookups (type,code,value)      values ('cap_shape', 'c', 'conical') . . . 

and so on, until you have a table that contains the values shown in Table 8-2.

Table 8-2. Mushrooms Lookup Table
54 58 62 66 70 74 78 82 86 90 94 98 102 106 110
Type Code 51 Value
bruises t bruises
Cap_color b 52 buff
Cap_color c cinnamon
Cap_color 53 e red
cap_color g gray
cap_color n brown
cap_color p pink 55
cap_color r green
cap_color u 56 purple
cap_color w white
cap_shape 57 b bell
cap_shape c conical
cap_shape f flat
cap_shape k knobbed 59
cap_shape x convex
cap_surface f 60 fibrous
cap_surface g grooves
cap_surface 61 y scaly
edibility e edible
edibility p poisonous
gill_attachment a attached 63
gill_attachment d descending
gill_attachment f 64 free
gill_color b buff
gill_color 65 e red
gill_color g gray
gill_color h chocolate
gill_color k black 67
gill_color n brown
gill_color o 68 orange
gill_color p pink
gill_color 69 r green
gill_color u purple
gill_color w white
gill_size b broad 71
gill_spacing c close
gill_spacing w 72 crowded
habitat g grasses
habitat 73 l leaves
habitat m meadows
habitat p paths
habitat u urban 75
habitat w waste
odor a 76 almond
odor c creosote
odor 77 f foul
odor l anise
odor m musty
odor n none 79
odor p pungent
odor y 80 fishy
population a abundant
population 81 c clustered
population n numerous
population s scattered
population v several 83
ring_number n none
ring_number o 84 one
ring_type c cobwebby
ring_type 85 e evanescent
ring_type f flaring
ring_type l large
ring_type n none 87
ring_type p pendant
ring_type s 88 sheathing
spore_print_color b buff
spore_print_color 89 h chocolate
spore_print_color k black
spore_print_color n brown
spore_print_color o orange 91
spore_print_color r green
spore_print_color u 92 purple
spore_print_color w white
stalk_color_above_ring 93 b buff
stalk_color_above_ring c cinnamon
stalk_color_above_ring e red
stalk_color_above_ring g gray 95
stalk_color_above_ring n brown
stalk_color_above_ring o 96 orange
stalk_color_above_ring p pink
stalk_color_above_ring 97 w white
stalk_color_below_ring b buff
stalk_color_below_ring c cinnamon
stalk_color_below_ring e red 99
stalk_color_below_ring g gray
stalk_color_below_ring n 100 brown
stalk_color_below_ring o orange
stalk_color_below_ring 101 p pink
stalk_color_below_ring w white
stalk_root b bulbous
stalk_root c club 103
stalk_root e equal
stalk_root r 104 rooted
stalk_root u cup
stalk_root 105 z rhizomorphs
stalk_shape e enlarging
stalk_surface_above_ring f fibrous
stalk_surface_above_ring k silky 107
stalk_surface_above_ring y scaly
stalk_surface_below_ring f 108 fibrous
stalk_surface_below_ring k silky
stalk_surface_below_ring 109 y scaly
veil_color n brown
veil_color o orange
veil_color w white 111
veil_type p partial

Creating the Package

The first step is to create a new package. Right-click on the Data Transformation Service node in the Enterprise Manager to get the menu shown in Figure 8-1.


112
Figure 8-1. Menu option for a new DTS package.

Choose New Package from this menu to open a blank "canvas" for us to work with, as shown in Figure 8-2.


Figure 8-2. Blank work area.

Notice the toolbars on the left of the canvas. The Connection toolbar contains all the different connection types including OLE DB, Excel, inbound text files, outbound text files, and others. The Task toolbar contains all the built-in tasks that were described in the first part of this chapter. To bring any of these elements to the canvas, click the connection or task you want to use.

The very first thing we need to create is the FTP task. Do this by clicking the File Transfer Protocol Task button. The File Transfer Protocol Task Properties dialog box appears to permit you to add all the settings you need. (See Figure 8-3.)


113
Figure 8-3. Configuring the FTP task.

To configure the FTP task, use the same settings as in Figure 8-3. Remember that the FTP site refers to the root directory of the FTP site without the sub directories. In the Files tab of the File Transfer Protocol Task Properties dialog box, we specify where the source files are specifically located. Note that you must specify the local directory where the files will be transferred.


Warning

In a DTS package, all references to directories and drives are evaluated in terms of the computer that's launching the DTS package. A DTS package always runs locally to the machine that called it. For this reason, a directory setting that worked fine when you ran the package on your development computer may fail when run as a job on the server because that particular path does not exist there. This problem can be circumvented either by standardizing directory trees in the development vs. production servers, using UNC path names, or using dynamic properties to set these values at run time.


Go to the Files tab in the File Transfer Protocol Task Properties dialog box, and you will see a directory tree panel on the left in the Source section and chosen files on the right in the Destination section, as shown in Figure 8-4. Travel down the directory tree to arrive at the location where the mushroom file resides. The path to take is as follows:

 K/pub/machine-learning-databases/mushroom 

Figure 8-4. Locating the source files. 114

The list of source files will appear. Choose agaricus-lepiota.data, click on the single arrow pointing to the right to copy it to the right panel, and then click OK. To create the text file connection in the next step, we will actually need a sample of that file, so right-click on the FTP task and choose Execute Step to execute only that step, as shown in Figure 8-5. If there were any other steps in the package, this would be a way to ensure that only this one runs without launching the whole package.


Figure 8-5. Transferring the file.

DTS considers text files as connections, so on the upper-left sidebar, click the Text File (Source) button that has an arrow pointing to the right (indicating that it's an incoming file). The Connection Properties dialog box shown in Figure 8-6 will appear. Make sure to specify it's a new connection.


Note

Giving these tasks and connections an explicit, meaningful name makes it easier to read the DTS package because these names appear on the canvas of the DTS Designer as well as on any error messages stored in text files.



Figure 8-6. Naming the text file and creating the connection. 115

Click OK. The Text File Properties dialog box appears, as shown in Figure 8-7.


Figure 8-7. Configuring text file properties.

In the Text File Properties dialog box, set the properties that determine for DTS the format of the file. In this case, use the settings shown in Figure 8-7 and click the Next button to bring up the Specify Column Delimiter screen shown in Figure 8-8.


Figure 8-8. Creating a new DTS package.

This screen permits you to specify the characters used as delimiters and also gives you a quick sample of how your choice affects how SQL Server interprets the format of the file. If you can see the default column headers and a clear separation between the columns, then chances are good you picked the right delimiter. Click Finish, and then click OK to close the Connection Properties dialog box. Once you're done, the canvas should have a connection to a text file and an FTP task, as shown in Figure 8-9. 116


Figure 8-9. The canvas.

To work with SQL Server, we need to create our first database connection. To do this, click the Microsoft OLE DB Provider For SQL Server button in the left corner of the Connection toolbar. This brings up the Connection Properties dialog box. Specify the name, the type of OLE DB connection, the SQL Server name, and the user credentials, as shown in Figure 8-10. Click OK when you're finished.


Figure 8-10. The OLE DB Connection parameters.

Now that you have set up the connections, select the text file connection first, hold down the Ctrl key, and click on the database connection. The order in which you select the connections determines the direction of the transformation we?re going to create next. Now that the two connections are highlighted, click the Transform Data Task button on the toolbar. A black transformation arrow should appear between the connection icons, as shown in Figure 8-11.

Next click on the FTP task, hold down the Ctrl key, and select the text file connection. Choose On Success from the Workflow menu to create a precedence constraint on the transformation, as shown in Figure 8-12. Once done, the transformation commences only if and when the file is successfully transferred through the FTP connection. 117


Figure 8-11. Creating a new transformation.

Figure 8-12. Creating a new precedence constraint.

A success arrow appears between the two connection icons, as shown in Figure 8-13.


Figure 8-13. Success constraint. 118

But wait! We need to make sure that the table is empty before importing the data files. That's no problem since the order in which the objects are created has no effect on the order of their execution. To check the table, we create an Execute SQL task and issue the TRUNCATE TABLE command as has been done in Figure 8-14.


Figure 8-14. Creating a new Execute SQL task.

As done previously, we create another success constraint between this new task and the FTP task. (See Figure 8-15.) Now we won't even get the file unless we can successfully empty this table of all residual cases. The Execute SQL task replaces the FTP task as the first step to run in the package.


Figure 8-15. New success precedence constraint.

Remember the lookup table we created to convert the single character codes in the mushrooms text file to meaningful labels? To avoid the necessity of sharing a connection with the task that fills the table with the cases, we'll click the Microsoft OLE DB Provider For SQL Server button and create a new connection for the lookup table, as shown in Figure 8-16. 119


Figure 8-16. Creating a new database connection.

The canvas now has two separate connections to the same database. (See Figure 8-17.)


Figure 8-17. Separate database connections.

Now we must configure the trickiest part of the package, the transformation itself. To open the Transformation properties, right-click on the black transformation arrow and choose Properties from the menu shown in Figure 8-18.


120
Figure 8-18. Opening the Transformation properties.

This opens the Transform Data Task Properties dialog box shown in Figure 8-19, which is designed to help configure the properties of the transformation. The Source tab on this dialog box is simply a reiteration of the characteristics of the two connections. If we had used a database connection instead of a text file as a source, we could have chosen to use the results of a query as our source.


Figure 8-19. Source tab.

Click the Destination tab. You may pick the specific destination table or create one on the fly. For this exercise, choose the mushrooms database we created earlier. Note that the structure of the table is listed as a reminder. (See Figure 8-20.)


Figure 8-20. Destination tab. 121

Click the Transformation tab, click the Select All button, and then click the New button. You may pick a variety of transformation types, but since we need to intervene with a lookup table to translate codes, pick the ActiveX Script option, as shown in the Create New Transformation dialog box. (See Figure 8-21.)


Figure 8-21. Transformation types.

Click OK to bring up the Transformation Options dialog box, shown in Figure 8-22.


Figure 8-22. ActiveX Script Transformation Options dialog box.

Click the Properties button to bring up the ActiveX Script Transformation Properties dialog box shown in Figure 8-23. The ActiveX Script Transformation Properties dialog box lets you choose a scripting language. The transformation script is already generated for you. If you choose a language other than VBScript, such as JScript or PerlScript, click the Auto Gen button to regenerate the script in that language. The problem (shown in Figure 8-23) is that our table has an ID field that is not present in the source text file, so the automatic column mapping is off-center because it tried to match the first field in the text file, the Edibility field, with the ID field in the table. This causes the other fields to be off their mark. 122


Figure 8-23. Scripting properties.

To make sure the column mappings are correct and because we're going to be using code to intervene in the transformation process, it's far more convenient to have one code snippet with all the transformations as opposed to many small code snippets for each column. The first thing you need to do is click Cancel in both the ActiveX Script Transformation Properties dialog box and the Transformation Options dialog box to return to the Transform Data Task Properties dialog box. Now click the Select All button and then the Delete button. These actions eliminate all the column mappings. Now select all the fields in the left side that represent the text file columns, and then select all the fields in the right side, except for the ID field, as shown in Figure 8-24. Finally, click on the New button and create an ActiveX script transformation as you did before. This will create one mapping path that includes all the selected columns. Click OK in both the ActiveX Script Transformation Properties dialog box and the Transformation Options dialog box to return to the Transform Data Task Properties dialog box.


Figure 8-24. Column remapping.

Now click the Lookups tab to see a list of lookups, as shown in Figure 8-25. We're going to take advantage of the Lookups tab because we need to be able to use the lookup table we created earlier to convert some of the code values in the text file into meaningful labels. At this point we could create a separate lookup for each field (about 23 of them) and give them each a different name. However, by using parameters, we can get away with creating just one table and using the ActiveX script transformation to pass the proper parameters. Here you must give the lookup a name (which later will be referenced in code) and the connection that will supply the lookup values. Remember the second SQL Server connection we created?


123
Figure 8-25. Creating the lookup.

Click the Query button to invoke the Query Designer shown in Figure 8-26. The query is very simple: We're interested in getting a value returned to us based on the name of the field that contains the value and the code, or the one character value that the field contains. The type and the code are the unknown values, or parameters, that we're going to use to retrieve the correct value. To indicate to the lookup function that we want to be able to pass a parameter, put a "? " in place of the variable. We can have as many of these values as we choose. Click OK to return to the Transform Data Task Properties dialog box.


Figure 8-26. Query Designer for the lookups.

The Options tab of the Transform Data Task Properties dialog box, shown in Figure 8-27, allows us to configure some additional properties for the transformation.


Figure 8-27. The Options tab. 124

The top portion of the Options tab allows you to create a text file that contains any errors and determine what types of error information will be stored in the text file. This file is indispensable for debugging purposes after the fact. When converting data from outside sources, as in this example, you're always subject to errors caused either by file corruption from the supplier of the file or from changes made to the file structure without your knowledge. This debugging function can alert you to those types of errors not only by signaling the errors, but also by listing a certain number of rows that caused the problem.

The middle portion of the Options tab lets you decide how many errors you will allow before interrupting the task. It also lets you decide how many rows you want to import and even what range of rows they belong to.

The bottom portion of the Options tab affects how the data will load. By choosing Fast Load, you are asking SQL Server to attempt to dispense with logging the inserts to the table.


Note

By executing a nonlogged operation such as this in a database, you render the logs useless for recovery purposes because as soon as SQL Server detects a nonlogged operation, it disallows any transaction log backups. To restore the recoverability of the database, a full database backup must be performed immediately after the Bulk Insert. This is why it's often best to have all staging tables, such as this one, in a separate staging database where the transaction logs do not need to be used for recovery.


Locking the table and disabling constraint checking speeds up the Bulk Insert task. The batch size is relevant when you have many rows to insert at once. By leaving the default value to 0, you're basically saying that all the records will be inserted into SQL Server in one single transaction. That can hurt performance and tax resources. By setting it to a value of 10,000, for example, you allow SQL Server to write every 10,000 rows that come in.

Now let's go back to Figure 8-23 and change the ActiveX Script so it will account for the lookups we created. Take a look at the code in Figure 8-23.

DTSDestination(<fieldname>) refers to the field in the SQL Server table. The DTSSource (<fieldname>) refers to the column in the text file. Ordinarily, the transformation would be relatively simple: 125

 DTSDestination("Field1") = DTSSource("Col1") 

Because we are using transformations, we need to make the same change to every one of the fields:

 DTSDestination("Field1") = DTSLookups("LKP_MUSHROOMS").Execute _     ("Field1",DTSSource("Col1")) 

The syntax of the lookup function is simple:

 DTSLookups("LookupName").Execute(Parameter1, Parameter2,K) 

The parameters are read in order and replace the "? " characters in the lookup query definition we created earlier. The function then returns a value that we use to update the destination column.

Click the Transformation tab, and click the Edit button to open the Transformation Options dialog box. Click the Properties button to open the ActiveX Script Transformation Properties dialog box, and then modify the function as shown in Figure 8-28. Close all the dialog boxes to return to the DTS Designer canvas. 126


Figure 8-28. The script modified to use the lookup function.

To create the data-mining processing task, click the Analysis Services Processing Task button to open the Analysis Services Processing Task dialog box shown in Figure 8-29. First you'll be shown the Analysis Servers that are available. Choose a server, and drill down to the database that you want to process. Since we're interested in processing the data-mining model, we'll drill all the way down to the"Mushroom Analysis RDBMS" data-mining model that you created in ‚  Chapter 5 ‚  .


Figure 8-29. Creating an Analysis Services processing task. 127

When you choose a data-mining model, the processing options correspond to the options that you would use for data-mining as opposed to OLAP, for example. In this case, choose Full Process, which takes a little longer but rebuilds the structure of the data-mining model. Notice that you can, if you wish, specifically write a query that will populate the data-mining model. A default query is used if you do not specify one.

Now create a success precedence constraint so that the data-mining model gets processed only after the staging table containing the cases is successfully populated by the transformation task. (See Figure 8-30.)


Figure 8-30. Adding the precedence constraint to include the mining model.

Now choose Save As from the Package menu to bring up the Save DTS Package dialog box shown in Figure 8-31. As mentioned earlier, there are various locations to store the package. The most flexible is the .dts file which can be transferred, e-mailed, or carried on a disk. But for the purposes of this example, save it in SQL Server, which then stores it in the msdb database.


Figure 8-31. Saving the package. 128

The simplest way to execute the package is to click the Execute button located on the toolbar, or choose Execute from the Package menu. A dialog box appears and provides immediate feedback of the progress of each step in the package as it executes. (See Figure 8-32.)


Figure 8-32. Executing the package.

If you go back to the Data Transformation node in Enterprise Manager and select the SQL Server node, you'll notice the package is stored there for future use. (See Figure 8-33.)


Figure 8-33. List of DTS packages.

If you would like to schedule this package to run at preset times, say at 11 P.M. every night, all you need to do is right-click on the package you wish to schedule and choose Schedule Package from the menu. (See Figure 8-34.) 129


Figure 8-34. Package options.

This brings up the Edit Recurring Job Schedule dialog box (shown in Figure 8-35), which allows you to set the dates and times that the package should run unattended. Once the schedule is set, the job is added to the list of jobs in the SQL Server Agent.


Figure 8-35. Edit Recurring Job Schedule dialog box.

Summary

A data-mining strategy must include the data preparation steps, especially because the predictive and analytical qualities of the data-mining model are highly dependent on the successful conversion of raw data into structured cases. This means that the cases not only must prove to be structurally coherent, but also must pass any and all integrity tests that are necessary before declaring them fit to represent data to be analyzed. DTS is a very powerful tool that was designed for that purpose. It also happens to be a tool that offers more than enough flexibility to create batch-oriented programs that aid in bringing the data all the way from its raw form to the final data-mining model, relational database, or OLAP cube. The structure of the DTS facilitates its transportation, execution, and scheduling, which allows it to be integrated with a wide variety of SQL Server tools and programming interfaces. 130



Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference
Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference
ISBN: B007EMTPI0
EAN: N/A
Year: 2001
Pages: 16

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net