As we saw in ‚ Chapters 1 ‚ and ‚ 2 1 ‚ , one of the most important prerequisites for a successful data-mining effort is the preparation of the case data, which you need to do before you can begin mining any data. In many, if not most, cases the source of the training for your data-mining models is going to come from disparate sources and nonrelational data formats, such as flat files or Microsoft Excel spreadsheets. As part of any enterprise-wide data-mining effort, organizations need to centralize and normalize their data into a data warehouse. In ‚ Chapters 1 ‚ and ‚ 2 ‚ , I discussed the need for these steps, the preferred methodology, and the pitfalls to avoid. So now that we know what is needed to create data-mining models, it makes sense to incorporate more Microsoft SQL Server tools to help us in the entire data-mining effort. Microsoft Data Transformation Services (DTS), although not specifically an Analysis Services tool, is a very important tool that gathers data from many different sources so that it can be cleansed and warehoused before it is used in OLAP and data mining.
Sometimes the data increases over time, as is the case with supermarket sales data, weather data, and stock market price data. When this is the case, the data-mining models are continuously updated with new data that enriches the model and allows for more timely predictions. In addition to all the transformational capabilities of DTS, it can also create a unit (which acts much like an application) that can be automated with SQL Server Agent.
In this chapter, I'll explain how DTS works and how to program it, and I'll show you, step-by-step, how to automatically download the mushroom data used in 2 ‚ Chapter 5 ‚ from the mushrooms database's FTP site, run the DTS transformation steps needed to format and store the data, and train the data-mining model. We'll create a unit to do all this with the click of a button or through an automated task.
What Is DTS?
DTS is many things, including a
Unlike other programming environments, DTS is strictly batch-oriented. This means that it will run steps in a sequential order without waiting for user input or any other user"events. "It's expected to provide an easy way to define those steps and establish the flow of execution to allow you to go from beginning to end in a perfectly predictable manner (barring errors and other exceptions).
The basic DTS unit is a package, which is the named collection of a given collection of tasks. Just as Microsoft Visual Basic programs are contained within a project and just as SQL Server tables are contained within a database, DTS objects are contained in a package. A package can contain any one of the following four types of objects:
As a unit, packages can be edited, stored in databases or files, password protected, scheduled for execution through the SQL Server Agent, and retrieved by version. Each package contains one or more steps that are executed sequentially or in parallel when the package is run. When executed, the package connects to the correct data sources, copies data and database objects, transforms data, notifies users of the processes of events, and optionally creates written logs of the events that transpired.
DTS Tasks
A DTS task is an object that has a specific function that is executed as a step. The type of task depends on the functionality required. By default, DTS performs a whole slew of tasks as described in the following sections.
Transform
The Transform Data task is used to copy data between any OLE DB, compliant source and the destination data. When the source and destination objects are defined, the individual columns are exposed and given a default mapping, which is usually ordered sequentially so that the first column of the source maps to the first column of the destination and so on. This lets you customize the transformation task by defining different column mappings or by eliminating certain columns from the transformation procedure altogether. Many times the simple column mappings are not enough; for example, many data sources will store the date in one field and the time in another, but SQL Server 2000 contains a single date field that incorporates both the date and the time. When transforming the data from the source, you have the option of specifying, through the use of scripting, that the destination's date/time field is the concatenation of the source's date and time field. We'll take a closer look at scripting in DTS later in this chapter. Other times, the source data contains only obscure code that can be converted to meaningful descriptions by looking up the values in a third data source. The transformation task has facilities to handle this as well. The flexibility of this task makes it a cornerstone of almost every DTS package.
The Transform Data task is often referred to as the DataPump task, especially in the object properties of a package.
Bulk Insert 5
The Bulk Insert task is the T-SQL BULK INSERT command encapsulated in an object. For it to work, the following conditions must be met:
Bulk Insert is the fastest way to import text data into a SQL Server 2000 table as long as the following restrictions are met: 6
Bulk Insert is typically used when the number of rows to be imported is in the millions, when time is of the essence, and when logging is not necessary. When these conditions exist, the records are usually inserted into a staging table, preferably in a special staging database, and then processed from this staging table.
Data Driven Query
A Data Driven Query task allows you to loop through the source data as you would with an ADO record set or a SQL Server cursor. This is especially useful when the contents of each individual record contain values that are used to determine what action to take on one or multiple tables. In other words, as you navigate through each row, you could issue almost any SQL statement including INSERT, UPDATE, DELETE, and stored procedure calls. The creation of a Data Driven Query task is a bit more involved than the other packages because of all the options available to you, but it does offer the ultimate in flexibility even if it's at a high cost to the performance of the task. Just remember that these aren't batches, but record-by-record transactions, so if you can find a way to do this with one of the tasks mentioned earlier, you're better off.
Execute Package 7
In a data-mining environment, the executable processes can be divided and run separately. Programmers are familiar with the notion of breaking applications into parts and having each part perform specific tasks for purposes of readability and code re-use. The Execute Package task allows you to only train the model, or to only use the model several times to make predictions, or to do both processes consecutively. It allows you to create packages that perform specific tasks. These packages can be called from other packages, which allows you to break a task into smaller pieces. This flexibility makes it easy to choose to do one or the other, or both, without creating the same package three times. Just as in a programming language, the Execute Package task has mechanisms that allow parameters to pass from one package to another, just as one would do with a stored procedure or a function.
I'll cover the next five tasks briefly because they're not especially relevant to data mining. If you would like more information about some of these tasks, consult SQL Server Books Online, or pick up a copy of Professional SQL Server 2000 DTS (Data Transformation Services) by Mark Chaffin, Brian Knight and Todd Robinson (Wrox Press), a book completely dedicated to DTS.
Transfer Error Messages
With sp_addmessage, a SQL Server system stored procedure, a user can add user-defined error messages that are application specific. If your application is going to run on other servers, you would use this task to transfer those messages.
Transfer Master Stored Procedures
The Transfer Master Stored Procedures task copies the stored procedures from a master database on one SQL Server 2000 server to another.
Transfer Databases 8
The Transfer Databases task copies the contents of a database in a SQL Server 7 server to a SQL Server 2000 server.
Transfer Jobs
The Transfer Jobs task copies the contents of the jobs stored in the msdb database in a SQL Server 2000 server to another SQL Server 2000 server.
Transfer Logins
The Transfer Logins task copies the logins defined in the master database from one SQL Server 2000 server to another SQL Server 2000 server.
Copy SQL Server Objects
The Copy SQL Server Objects task transfers objects from one SQL server to another. Any number and combination of objects can be transferred, but remember that this is not a disaster recovery mechanism for SQL Server. It's useful for creating occasional test environments or moving specific objects such as lookup tables from one server to another. 9
Dynamic Properties
The Dynamic Properties task works by retrieving values from sources outside a DTS package at package run time and using them to assign values to objects that require run-time variables specific to a given server or other dynamic element. In this way, a package can act like a reusable application and retrieve values from an outside source, leaving the rest of the package as is. This is particularly useful for cases in which
The source of the dynamic data used by the package can come from any one of these sources:
Message Queue
Message Queue is a task the interacts with Microsoft Message Queue (MSMQ), a transaction monitoring and COM+ packaging system. 11
ActiveX Scripting
ActiveX Scripting is a task that executes a program written entirely in a scripting language, such as Microsoft VBScript or JScript. The scripting task is a way not only to manipulate the variable and the other objects in the same package, but also to accomplish any number of tasks including instantiating Microsoft ActiveX objects and servers such as Microsoft Word and Excel. The task designer provides a rich environment to create the scripts by providing syntax checkers, keyword and function lists, and an object browser.
By default, all the ActiveX scripting tasks provide VBScript and JScript as language options. You aren't limited to those languages if you install other Windows Scripting Host (WSH) , compliant languages, such as PerlScript or even ActivePython. My personal favorite is Practical Extraction and Reporting (Perl) language, which can be retrieved free from ‚ http://www.activestate.com/ ‚ By installing Perl on the server with SQL Server 2000, the Perl keywords and syntax checker automatically become available in the development environments involving the ActiveX scripting languages.
Execute SQL
Execute SQL is a task that is designed, as the name strongly suggests, to execute any T-SQL statement that you could run from SQL Query Analyzer. This task could be used for any number of processes, including creating tables, updating information, and deleting records. 12
Execute Process
The Execute Process task can run an executable program, such as a batch file or a .exe file. It runs in its own process, so the rest of the SQL Server environment is reasonably safe from the program's effects on memory and other resources. Typically, this might run a batch or a program that opens connections to other network resources or retrieves files, or it might run a program that generates a text file to be imported by SQL Server in a subsequent step.
Send Mail
Send Mail is a very handy task that programmatically sends mail to a recipient. This can be used to notify managers that a report is available or, in a more likely scenario, to notify a system administrator that a task failed.
FTP
The FTP task automates the retrieval of a file or a collection of files to a local directory from an FTP site. It handles the logging-in process as well as the actual file transfers. In addition, you can perform data transfers from any UNC path without it being an FTP site.
Analysis Services Processing Task 13
The Analysis Services Processing task processes Analysis Services objects. Originally designed for processing cubes only, it still uses the Cube icon even though it's capable of processing cubes, dimensions, and of course, data-mining models.
Data-Mining Prediction Query Task
When you install Analysis Services, the Data-Mining Prediction Query task package becomes available. This package is designed to help create and run prediction queries based on a data-mining model and output the results to another source such as a SQL Server table or any other OLE DB, compliant data store such as Excel.
Custom and Third-Party Tasks
If the tasks described so far don't meet the requirements of your batch process, you can take advantage of the extensibility offered by DTS that lets you create custom tasks using Microsoft Visual C++, Visual Basic, Delphi, or any language that can create COM-compliant applications. You can then integrate these custom tasks into the DTS Designer user interface and save them as part of the DTS object model. The tasks then become available for use within the DTS Designer whenever a new package is created.
Connections
Connections are the basis for all activity in the DTS package. To successfully execute DTS tasks that copy and transform data, a DTS package must establish valid connections to its source and destination data and to any additional data sources, such as lookup tables. 14
OLE DB is the main provider type of DTS connections and allows a wide variety of data sources to be used. These can range from traditional relational database sources such as SQL Server or Microsoft Access to more loosely structured data, such as from Excel spreadsheets and text files.
Sources
DTS contains built-in features that let you graphically add and configure a data source that contains either the raw data that will be manipulated or a final data structure that will store the output of a DTS task. You can use a data source, file, or data link as a source.
A Data Source Connection
Data Source Connections give you access to:
A File Connection
DTS provides additional support for text files. When specifying a text file connection, specify the format of the file. For example, you would specify
A Data Link Connection
The connection string used by the data source is stored in a separate text file and accessed at connection time.
Configuring a Connection
When creating a DTS package, you configure connections by selecting a connection type from a list of available OLE DB providers. The properties you configure for each connection vary depending on the individual provider for the data source. A DTS package can have multiple connections defined for the same data source, or the same connection can be reused for various tasks. A few factors to take into account before creating connections in a DTS package include the order of the execution of the tasks, whether your packages will be moved to other servers, and the security of your accounts.
Single Thread per Connection
If the package is designed so that the tasks execute in a perfectly linear order without any parallel tasks, then a single connection is fine. Packages can be designed to have multiple tasks execute simultaneously. In this situation, a single connection becomes a bottleneck and performance can be greatly enhanced by using a separate connection for each task.
Dynamic Connection Properties 18
If your package can be moved or copied to different servers, you may need to edit the direct connections made in a package. To make it easy to modify the connection parameters for a connection or connections, use a data link file, which saves the connection string in a separate text file. Alternatively, consider using the Dynamic Properties task to change the connection information at run time.
Security Account Information
When creating a connection for a SQL Server data source, you have the option of specifying a given username and password or of using integrated security. Although using integrated security simplifies package creation because no specific credentials need to be supplied, keep in mind that the account used to access the server will be the currently logged on user's account or the service account of the SQL Server Agent if you have scheduled the package to run as a job. Using integrated security can cause numerous bugs because access is denied to SQL Server objects that are needed by certain tasks. If possible, consider creating a specific account with appropriate rights to the tasks in the package and use that account when creating connections. This ensures a consistent user environment regardless of the means by which the package was launched.
DTS Package Workflow
DTS steps and precedence constraints determine the order that the tasks in the package are executed and under what logical conditions they are allowed to run. The simplest, most straightforward way to accomplish this is to use the DTS Designer because all the graphical workflow elements can be dragged, dropped, and interconnected using the mouse. That said, Visual Basic, Visual C++, or even Perl code provides the same functionality without DTS Designer.
DTS Package Steps
Steps are used to control the order in which tasks are executed in a DTS package. DTS package steps represent the execution units in the DTS object model, and they define which tasks execute in what sequence when the package is run and which ones run parallel to each other. 19
There are no step objects to manipulate per se; instead, they are implicitly created whenever precedence constraints are created between tasks. To give you some perspective on that, consider that without any precedence constraints, all the tasks in a package would execute simultaneously, unless they all rely on the same connection (but then the order of execution would be different every time).
When creating a package using code, you can control the relationship between a step and a task more precisely. You can create multiple steps for different package operations and associate the execution of those steps with a single task. For example, suppose you write a package in Visual Basic and specify in several parts of the package that errors can be generated. By linking the steps associated with those errors, you can make the different types of errors execute the same Send Mail task. That Send Mail task can send an e-mail notifying the system administrator that the package failed.
DTS Designer allows you to execute an individual package step. This action is useful for testing and troubleshooting individual steps without having to run the entire package. To execute a single package step in DTS Designer, right-click the task you want to execute and choose Execute Step from the menu.
Precedence Constraints
Precedence constraints sequentially link tasks in a package. In DTS, you can use three types of precedence constraints, which can be accessed either through DTS Designer or programmatically.
Unconditional
If you want Task 2 to wait until Task 1 completes, regardless of the outcome, link Task 1 to Task 2 with an Unconditional precedence constraint. 20
On Success
If you want Task 2 to wait until Task 1 has successfully completed, link Task 1 to Task 2 with an On Success precedence constraint.
On Failure
If you want Task 2 to begin execution only if Task 1 fails to execute successfully, link Task 1 to Task 2 with an On Failure precedence constraint. If you want to run an alternative branch of the workflow when an error is encountered, use this constraint.
Using Multiple Precedence Constraints
You can issue multiple precedence constraints on a task. For example, you can have a given task execute only when two other previous tasks succeed, or only if one of the tasks succeeds and the other fails.
DTS Designer 21
The DTS Designer is a combination of integrated development environment, workflow engine, and application platform.
It has the graphical user interface similar to any classical workflow designer that calls for objects of different types to be arranged on a virtual whiteboard and have sequences and dependencies attached to them. The DTS Designer graphical user interface allows you to build and configure packages by using drag-and-drop methods and by completing property sheets on the various DTS objects included in the package. Unlike a normal workflow designer, the objects themselves contain functionality that can be defined with parameters or complete programs. Once these objects have been defined and the rules set for their execution order, the package can be launched and the graphical elements come alive to provide visual cues that mark the progress of the tasks.
You can use DTS Designer to do the following:
Opening the DTS Designer 22
You can access DTS Designer through SQL Server Enterprise Manager, through the Data Transformation Services node of the console tree. Under the tree, you'll find three storage repositories for the package. I'll discuss these later in this section. If you right-click on this node, you can choose to open a package that happens to be stored in a .dts file.
DTS Designer Work Area
The user interface for the virtual whiteboard includes the DTS Designer main panel, which consists of the following parts:
Saving a DTS Package
When you save a DTS package, you save all DTS connections, DTS tasks, DTS transformations, and workflow steps and preserve the graphical layout of these objects on the DTS Designer design sheet. The options described in the following sections are available for saving packages.
Package Name
Specify a unique name for the package with package name. The msdb tables use this name as a primary key. 24
Owner Password
Specify a password for the package to protect sensitive user name and server password information in the package from unauthorized users with owner password. If the package has an owner password, the data is encrypted with the standard encryption API. This option is available only for packages saved to SQL Server or as a structured storage file.
User Password
Set a password for a package user with user password. This password allows a user to execute a package. However, this option does not allow a user to view the package definition. If you set the user password, you must also set the owner password. This option is available only for packages saved to SQL Server or as a structured storage file.
Location
Specify the format and location of the saved package with the location option. You can save to a SQL server, which then stores it in the msdb database, and then to either Meta Data Services, a structured storage file, or a Visual Basic file. When you save the package to a SQL server or to Meta Data Services, you have the following options:
When you save the package to Meta Data Services, you can also scan by displaying the Scanning Options dialog box and specifying how objects referenced by the package should be scanned into Meta Data Services. This capability allows you to relate source and destination objects in a package to database meta data (for example, primary and foreign keys in a table, in an index, and in column information, such as data type) stored in Meta Data Services.
When you save the package as a COM-structured storage file or a Visual Basic file, you have the following options:
You can save a package to:
With this option, the whole package saves as Visual Basic code, and you can later open the Visual Basic file and modify the package definition in your development environment.
When you save a package to the SQL Server repository or to a structured storage file, you can secure the package with one or more passwords. When you save a package to Meta Data Services or as a Visual Basic file, the DTS package security options are not available. However, you can keep packages saved to Visual Basic files secure through a source code control system such as Microsoft Visual SourceSafe, and create a compiled version which hides the details of the code.
dtsrun Utility
The dtsrun utility executes a package created using DTS. The DTS package can be stored in the Microsoft SQL Server msdb database, a COM-structured storage file, or SQL Server Meta Data Services. The syntax is as follows: 29
dtsrun [/?] Package retrieval: /~S /~U /~P /~E /~N /~M /~G /~V /~F /~R
Package operation (overrides stored Package settings): /~A Global Variable Name:typeid=Value /~L Log file name /~W Write Completion Status to Windows Event Log <True or False>
DTSRun action (default is to execute Package):
The following do not execute the package; instead they perform an administrative operation on it.
/!X /!D /!Y /!C
The following is a list of arguments:
Table 8-1. Global Variable Types and their ID Values |
Data type | Type ID |
---|---|
Integer (small) | 2 |
Integer | 3 |
Real (4-byte) | 4 |
Real (8-byte) 36 | 5 |
Currency | 6 |
Date | 7 37 |
String | 8 |
Boolean | 11 |
Decimal | 14 |
Integer (1-byte) | 16 |
Unsigned int (1-byte) 39 | 17 |
Unsigned int (2-byte) | 18 |
Unsigned int (4-byte) | 19 40 |
Integer (8-byte) | 20 |
Unsigned int (8-byte) | 21 |
Int | 22 |
Unsigned int | 23 |
HRESULT 42 | 25 |
Pointer | 26 |
LPSTR | 30 43 |
LPWSTR | 31 |
To set global variables with this command switch, you must have either Owner permission for the package or the package must have been saved without DTS password protection enabled.
The next four commands perform administrative functions without executing the package.
To execute a DTS package saved as a COM-structured storage file, use
dtsrun /Ffilename /Npackage_name /Mpackage_password
To execute a DTS package saved in the SQL Server msdb database, use
dtsrun /Sserver_name /UusernName /Ppassword /Npackage_name /Mpackage_password
To execute a DTS package saved in Meta Data Services, use
dtsrun /Sserver_name /Uusernrame /Ppassword /Npackage_name /Mpackage_password /Rrepository_name 48
Using DTS to Create a Data-Mining Model
Having covered the essentials of DTS, I'll now walk you through the creation of a DTS package. Upon request, DTS will build our Mushrooms data-mining model. This case is especially interesting because the source file resides in a remote FTP site that you can access from your Internet connection. This file is a comma-delimited text file that contains well formatted rows but with cryptic codes in the columns that need to be translated to create a legible data-mining model. The file needs to be placed in a SQL Server 2000 table, and then the data-mining model needs to be created. This is the first part of our DTS package. Later, in ‚ Chapter 12 ‚ , we'll expand the package to include tasks for automatically issuing predictions against test data presented to it for analysis.
The steps to create the DTS package are as follows:
Preparing the SQL Server Environment
Before we can do steps 1 through 4, we need to prepare the SQL Server database for the data-mining operation.
Let's start with the mushrooms table. Run the code shown below in the Query Analyzer of your SQL Server database. 50
if not exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[mushrooms] ') and OBJECTPROPERTY(id, N'[sUserTable') = 1) BEGIN CREATE TABLE [mushrooms] ([ID] [varchar] (50) NULL , [edibility] [varchar] (50) NULL , [cap_shape] [varchar] (50) NULL , [cap_surface] [varchar] (50) NULL , [cap_color] [varchar] (50) NULL , [bruises] [varchar] (50) NULL , [odor] [varchar] (50) NULL , [gill_attachment] [varchar] (50) NULL , [gill_spacing] [varchar] (50) NULL , [gill_size] [varchar] (50) NULL , [gill_color] [varchar] (50) NULL , [stalk_shape] [varchar] (50) NULL , [stalk_root] [varchar] (50) NULL , [stalk_surface_above_ring] [varchar] (50) NULL , [stalk_surface_below_ring] [varchar] (50) NULL , [stalk_color_above_ring] [varchar] (50) NULL , [stalk_color_below_ring] [varchar] (50) NULL , [veil_type] [varchar] (50) NULL , [veil_color] [varchar] (50) NULL , [ring_number] [varchar] (50) NULL , [ring_type] [varchar] (50) NULL , [spore_print_color] [varchar] (50) NULL , [population] [varchar] (50) NULL , [habitat] [varchar] (50) NULL) END
This table should be empty until we actually import the test file. We do need to create the lookup table and populate it with proper values as shown here.
create table lookups (lookup_id uniqueidentifier not null default newid() primary key clustered, type varchar(50), code char(1), value varchar(50)) go create index ix_type_code on lookups(type,code) with fillfactor= 100 go insert into lookups (type,code,value) values ('cap_shape', 'b', 'bell') insert into lookups (type,code,value) values ('cap_shape', 'c', 'conical') . . .
and so on, until you have a table that contains the values shown in Table 8-2.
Table 8-2. Mushrooms Lookup Table |
Type | Code 51 | Value |
---|---|---|
bruises | t | bruises |
Cap_color | b 52 | buff |
Cap_color | c | cinnamon |
Cap_color 53 | e | red |
cap_color | g | gray |
cap_color | n | brown |
cap_color | p | pink 55 |
cap_color | r | green |
cap_color | u 56 | purple |
cap_color | w | white |
cap_shape 57 | b | bell |
cap_shape | c | conical |
cap_shape | f | flat |
cap_shape | k | knobbed 59 |
cap_shape | x | convex |
cap_surface | f 60 | fibrous |
cap_surface | g | grooves |
cap_surface 61 | y | scaly |
edibility | e | edible |
edibility | p | poisonous |
gill_attachment | a | attached 63 |
gill_attachment | d | descending |
gill_attachment | f 64 | free |
gill_color | b | buff |
gill_color 65 | e | red |
gill_color | g | gray |
gill_color | h | chocolate |
gill_color | k | black 67 |
gill_color | n | brown |
gill_color | o 68 | orange |
gill_color | p | pink |
gill_color 69 | r | green |
gill_color | u | purple |
gill_color | w | white |
gill_size | b | broad 71 |
gill_spacing | c | close |
gill_spacing | w 72 | crowded |
habitat | g | grasses |
habitat 73 | l | leaves |
habitat | m | meadows |
habitat | p | paths |
habitat | u | urban 75 |
habitat | w | waste |
odor | a 76 | almond |
odor | c | creosote |
odor 77 | f | foul |
odor | l | anise |
odor | m | musty |
odor | n | none 79 |
odor | p | pungent |
odor | y 80 | fishy |
population | a | abundant |
population 81 | c | clustered |
population | n | numerous |
population | s | scattered |
population | v | several 83 |
ring_number | n | none |
ring_number | o 84 | one |
ring_type | c | cobwebby |
ring_type 85 | e | evanescent |
ring_type | f | flaring |
ring_type | l | large |
ring_type | n | none 87 |
ring_type | p | pendant |
ring_type | s 88 | sheathing |
spore_print_color | b | buff |
spore_print_color 89 | h | chocolate |
spore_print_color | k | black |
spore_print_color | n | brown |
spore_print_color | o | orange 91 |
spore_print_color | r | green |
spore_print_color | u 92 | purple |
spore_print_color | w | white |
stalk_color_above_ring 93 | b | buff |
stalk_color_above_ring | c | cinnamon |
stalk_color_above_ring | e | red |
stalk_color_above_ring | g | gray 95 |
stalk_color_above_ring | n | brown |
stalk_color_above_ring | o 96 | orange |
stalk_color_above_ring | p | pink |
stalk_color_above_ring 97 | w | white |
stalk_color_below_ring | b | buff |
stalk_color_below_ring | c | cinnamon |
stalk_color_below_ring | e | red 99 |
stalk_color_below_ring | g | gray |
stalk_color_below_ring | n 100 | brown |
stalk_color_below_ring | o | orange |
stalk_color_below_ring 101 | p | pink |
stalk_color_below_ring | w | white |
stalk_root | b | bulbous |
stalk_root | c | club 103 |
stalk_root | e | equal |
stalk_root | r 104 | rooted |
stalk_root | u | cup |
stalk_root 105 | z | rhizomorphs |
stalk_shape | e | enlarging |
stalk_surface_above_ring | f | fibrous |
stalk_surface_above_ring | k | silky 107 |
stalk_surface_above_ring | y | scaly |
stalk_surface_below_ring | f 108 | fibrous |
stalk_surface_below_ring | k | silky |
stalk_surface_below_ring 109 | y | scaly |
veil_color | n | brown |
veil_color | o | orange |
veil_color | w | white 111 |
veil_type | p | partial |
Creating the Package
The first step is to create a new package. Right-click on the Data Transformation Service node in the Enterprise Manager to get the menu shown in Figure 8-1.
Figure 8-1. Menu option for a new DTS package. |
Choose New Package from this menu to open a blank "canvas" for us to work with, as shown in Figure 8-2.
Figure 8-2. Blank work area. |
Notice the toolbars on the left of the canvas. The Connection toolbar contains all the different connection types including OLE DB, Excel, inbound text files, outbound text files, and others. The Task toolbar contains all the built-in tasks that were described in the first part of this chapter. To bring any of these elements to the canvas, click the connection or task you want to use.
The very first thing we need to create is the FTP task. Do this by clicking the File Transfer Protocol Task button. The File Transfer Protocol Task Properties dialog box appears to permit you to add all the settings you need. (See Figure 8-3.)
Figure 8-3. Configuring the FTP task. |
To configure the FTP task, use the same settings as in Figure 8-3. Remember that the FTP site refers to the root directory of the FTP site without the sub directories. In the Files tab of the File Transfer Protocol Task Properties dialog box, we specify where the source files are specifically located. Note that you must specify the local directory where the files will be transferred.
In a DTS package, all references to directories and drives are evaluated in terms of the computer that's launching the DTS package. A DTS package always runs locally to the machine that called it. For this reason, a directory setting that worked fine when you ran the package on your development computer may fail when run as a job on the server because that particular path does not exist there. This problem can be circumvented either by standardizing directory trees in the development vs. production servers, using UNC path names, or using dynamic properties to set these values at run time.
Go to the Files tab in the File Transfer Protocol Task Properties dialog box, and you will see a directory tree panel on the left in the Source section and chosen files on the right in the Destination section, as shown in Figure 8-4. Travel down the directory tree to arrive at the location where the mushroom file resides. The path to take is as follows:
K/pub/machine-learning-databases/mushroom
Figure 8-4. Locating the source files. 114 |
The list of source files will appear. Choose agaricus-lepiota.data, click on the single arrow pointing to the right to copy it to the right panel, and then click OK. To create the text file connection in the next step, we will actually need a sample of that file, so right-click on the FTP task and choose Execute Step to execute only that step, as shown in Figure 8-5. If there were any other steps in the package, this would be a way to ensure that only this one runs without launching the whole package.
Figure 8-5. Transferring the file. |
DTS considers text files as connections, so on the upper-left sidebar, click the Text File (Source) button that has an arrow pointing to the right (indicating that it's an incoming file). The Connection Properties dialog box shown in Figure 8-6 will appear. Make sure to specify it's a new connection.
Giving these tasks and connections an explicit, meaningful name makes it easier to read the DTS package because these names appear on the canvas of the DTS Designer as well as on any error messages stored in text files.
Figure 8-6. Naming the text file and creating the connection. 115 |
Click OK. The Text File Properties dialog box appears, as shown in Figure 8-7.
Figure 8-7. Configuring text file properties. |
In the Text File Properties dialog box, set the properties that determine for DTS the format of the file. In this case, use the settings shown in Figure 8-7 and click the Next button to bring up the Specify Column Delimiter screen shown in Figure 8-8.
Figure 8-8. Creating a new DTS package. |
This screen permits you to specify the characters used as delimiters and also gives you a quick sample of how your choice affects how SQL Server interprets the format of the file. If you can see the default column headers and a clear separation between the columns, then chances are good you picked the right delimiter. Click Finish, and then click OK to close the Connection Properties dialog box. Once you're done, the canvas should have a connection to a text file and an FTP task, as shown in Figure 8-9. 116
Figure 8-9. The canvas. |
To work with SQL Server, we need to create our first database connection. To do this, click the Microsoft OLE DB Provider For SQL Server button in the left corner of the Connection toolbar. This brings up the Connection Properties dialog box. Specify the name, the type of OLE DB connection, the SQL Server name, and the user credentials, as shown in Figure 8-10. Click OK when you're finished.
Figure 8-10. The OLE DB Connection parameters. |
Now that you have set up the connections, select the text file connection first, hold down the Ctrl key, and click on the database connection. The order in which you select the connections determines the direction of the transformation we?re going to create next. Now that the two connections are highlighted, click the Transform Data Task button on the toolbar. A black transformation arrow should appear between the connection icons, as shown in Figure 8-11.
Next click on the FTP task, hold down the Ctrl key, and select the text file connection. Choose On Success from the Workflow menu to create a precedence constraint on the transformation, as shown in Figure 8-12. Once done, the transformation commences only if and when the file is successfully transferred through the FTP connection. 117
Figure 8-11. Creating a new transformation. |
Figure 8-12. Creating a new precedence constraint. |
A success arrow appears between the two connection icons, as shown in Figure 8-13.
Figure 8-13. Success constraint. 118 |
But wait! We need to make sure that the table is empty before importing the data files. That's no problem since the order in which the objects are created has no effect on the order of their execution. To check the table, we create an Execute SQL task and issue the TRUNCATE TABLE command as has been done in Figure 8-14.
Figure 8-14. Creating a new Execute SQL task. |
As done previously, we create another success constraint between this new task and the FTP task. (See Figure 8-15.) Now we won't even get the file unless we can successfully empty this table of all residual cases. The Execute SQL task replaces the FTP task as the first step to run in the package.
Figure 8-15. New success precedence constraint. |
Remember the lookup table we created to convert the single character codes in the mushrooms text file to meaningful labels? To avoid the necessity of sharing a connection with the task that fills the table with the cases, we'll click the Microsoft OLE DB Provider For SQL Server button and create a new connection for the lookup table, as shown in Figure 8-16. 119
Figure 8-16. Creating a new database connection. |
The canvas now has two separate connections to the same database. (See Figure 8-17.)
Figure 8-17. Separate database connections. |
Now we must configure the trickiest part of the package, the transformation itself. To open the Transformation properties, right-click on the black transformation arrow and choose Properties from the menu shown in Figure 8-18.
Figure 8-18. Opening the Transformation properties. |
This opens the Transform Data Task Properties dialog box shown in Figure 8-19, which is designed to help configure the properties of the transformation. The Source tab on this dialog box is simply a reiteration of the characteristics of the two connections. If we had used a database connection instead of a text file as a source, we could have chosen to use the results of a query as our source.
Figure 8-19. Source tab. |
Click the Destination tab. You may pick the specific destination table or create one on the fly. For this exercise, choose the mushrooms database we created earlier. Note that the structure of the table is listed as a reminder. (See Figure 8-20.)
Figure 8-20. Destination tab. 121 |
Click the Transformation tab, click the Select All button, and then click the New button. You may pick a variety of transformation types, but since we need to intervene with a lookup table to translate codes, pick the ActiveX Script option, as shown in the Create New Transformation dialog box. (See Figure 8-21.)
Figure 8-21. Transformation types. |
Click OK to bring up the Transformation Options dialog box, shown in Figure 8-22.
Figure 8-22. ActiveX Script Transformation Options dialog box. |
Click the Properties button to bring up the ActiveX Script Transformation Properties dialog box shown in Figure 8-23. The ActiveX Script Transformation Properties dialog box lets you choose a scripting language. The transformation script is already generated for you. If you choose a language other than VBScript, such as JScript or PerlScript, click the Auto Gen button to regenerate the script in that language. The problem (shown in Figure 8-23) is that our table has an ID field that is not present in the source text file, so the automatic column mapping is off-center because it tried to match the first field in the text file, the Edibility field, with the ID field in the table. This causes the other fields to be off their mark. 122
Figure 8-23. Scripting properties. |
To make sure the column mappings are correct and because we're going to be using code to intervene in the transformation process, it's far more convenient to have one code snippet with all the transformations as opposed to many small code snippets for each column. The first thing you need to do is click Cancel in both the ActiveX Script Transformation Properties dialog box and the Transformation Options dialog box to return to the Transform Data Task Properties dialog box. Now click the Select All button and then the Delete button. These actions eliminate all the column mappings. Now select all the fields in the left side that represent the text file columns, and then select all the fields in the right side, except for the ID field, as shown in Figure 8-24. Finally, click on the New button and create an ActiveX script transformation as you did before. This will create one mapping path that includes all the selected columns. Click OK in both the ActiveX Script Transformation Properties dialog box and the Transformation Options dialog box to return to the Transform Data Task Properties dialog box.
Figure 8-24. Column remapping. |
Now click the Lookups tab to see a list of lookups, as shown in Figure 8-25. We're going to take advantage of the Lookups tab because we need to be able to use the lookup table we created earlier to convert some of the code values in the text file into meaningful labels. At this point we could create a separate lookup for each field (about 23 of them) and give them each a different name. However, by using parameters, we can get away with creating just one table and using the ActiveX script transformation to pass the proper parameters. Here you must give the lookup a name (which later will be referenced in code) and the connection that will supply the lookup values. Remember the second SQL Server connection we created?
Figure 8-25. Creating the lookup. |
Click the Query button to invoke the Query Designer shown in Figure 8-26. The query is very simple: We're interested in getting a value returned to us based on the name of the field that contains the value and the code, or the one character value that the field contains. The type and the code are the unknown values, or parameters, that we're going to use to retrieve the correct value. To indicate to the lookup function that we want to be able to pass a parameter, put a "? " in place of the variable. We can have as many of these values as we choose. Click OK to return to the Transform Data Task Properties dialog box.
Figure 8-26. Query Designer for the lookups. |
The Options tab of the Transform Data Task Properties dialog box, shown in Figure 8-27, allows us to configure some additional properties for the transformation.
Figure 8-27. The Options tab. 124 |
The top portion of the Options tab allows you to create a text file that contains any errors and determine what types of error information will be stored in the text file. This file is indispensable for debugging purposes after the fact. When converting data from outside sources, as in this example, you're always subject to errors caused either by file corruption from the supplier of the file or from changes made to the file structure without your knowledge. This debugging function can alert you to those types of errors not only by signaling the errors, but also by listing a certain number of rows that caused the problem.
The middle portion of the Options tab lets you decide how many errors you will allow before interrupting the task. It also lets you decide how many rows you want to import and even what range of rows they belong to.
The bottom portion of the Options tab affects how the data will load. By choosing Fast Load, you are asking SQL Server to attempt to dispense with logging the inserts to the table.
By executing a nonlogged operation such as this in a database, you render the logs useless for recovery purposes because as soon as SQL Server detects a nonlogged operation, it disallows any transaction log backups. To restore the recoverability of the database, a full database backup must be performed immediately after the Bulk Insert. This is why it's often best to have all staging tables, such as this one, in a separate staging database where the transaction logs do not need to be used for recovery.
Locking the table and disabling constraint checking speeds up the Bulk Insert task. The batch size is relevant when you have many rows to insert at once. By leaving the default value to 0, you're basically saying that all the records will be inserted into SQL Server in one single transaction. That can hurt performance and tax resources. By setting it to a value of 10,000, for example, you allow SQL Server to write every 10,000 rows that come in.
Now let's go back to Figure 8-23 and change the ActiveX Script so it will account for the lookups we created. Take a look at the code in Figure 8-23.
DTSDestination(<fieldname>) refers to the field in the SQL Server table. The DTSSource (<fieldname>) refers to the column in the text file. Ordinarily, the transformation would be relatively simple: 125
DTSDestination("Field1") = DTSSource("Col1")
Because we are using transformations, we need to make the same change to every one of the fields:
DTSDestination("Field1") = DTSLookups("LKP_MUSHROOMS").Execute _ ("Field1",DTSSource("Col1"))
The syntax of the lookup function is simple:
DTSLookups("LookupName").Execute(Parameter1, Parameter2,K)
The parameters are read in order and replace the "? " characters in the lookup query definition we created earlier. The function then returns a value that we use to update the destination column.
Click the Transformation tab, and click the Edit button to open the Transformation Options dialog box. Click the Properties button to open the ActiveX Script Transformation Properties dialog box, and then modify the function as shown in Figure 8-28. Close all the dialog boxes to return to the DTS Designer canvas. 126
Figure 8-28. The script modified to use the lookup function. |
To create the data-mining processing task, click the Analysis Services Processing Task button to open the Analysis Services Processing Task dialog box shown in Figure 8-29. First you'll be shown the Analysis Servers that are available. Choose a server, and drill down to the database that you want to process. Since we're interested in processing the data-mining model, we'll drill all the way down to the"Mushroom Analysis RDBMS" data-mining model that you created in ‚ Chapter 5 ‚ .
Figure 8-29. Creating an Analysis Services processing task. 127 |
When you choose a data-mining model, the processing options correspond to the options that you would use for data-mining as opposed to OLAP, for example. In this case, choose Full Process, which takes a little longer but rebuilds the structure of the data-mining model. Notice that you can, if you wish, specifically write a query that will populate the data-mining model. A default query is used if you do not specify one.
Now create a success precedence constraint so that the data-mining model gets processed only after the staging table containing the cases is successfully populated by the transformation task. (See Figure 8-30.)
Figure 8-30. Adding the precedence constraint to include the mining model. |
Now choose Save As from the Package menu to bring up the Save DTS Package dialog box shown in Figure 8-31. As mentioned earlier, there are various locations to store the package. The most flexible is the .dts file which can be transferred, e-mailed, or carried on a disk. But for the purposes of this example, save it in SQL Server, which then stores it in the msdb database.
Figure 8-31. Saving the package. 128 |
The simplest way to execute the package is to click the Execute button located on the toolbar, or choose Execute from the Package menu. A dialog box appears and provides immediate feedback of the progress of each step in the package as it executes. (See Figure 8-32.)
Figure 8-32. Executing the package. |
If you go back to the Data Transformation node in Enterprise Manager and select the SQL Server node, you'll notice the package is stored there for future use. (See Figure 8-33.)
Figure 8-33. List of DTS packages. |
If you would like to schedule this package to run at preset times, say at 11 P.M. every night, all you need to do is right-click on the package you wish to schedule and choose Schedule Package from the menu. (See Figure 8-34.) 129
Figure 8-34. Package options. |
This brings up the Edit Recurring Job Schedule dialog box (shown in Figure 8-35), which allows you to set the dates and times that the package should run unattended. Once the schedule is set, the job is added to the list of jobs in the SQL Server Agent.
Figure 8-35. Edit Recurring Job Schedule dialog box. |
Summary
A data-mining strategy must include the data preparation steps, especially because the predictive and analytical qualities of the data-mining model are highly dependent on the successful conversion of raw data into structured cases. This means that the cases not only must prove to be structurally coherent, but also must pass any and all integrity tests that are necessary before declaring them fit to represent data to be analyzed. DTS is a very powerful tool that was designed for that purpose. It also happens to be a tool that offers more than enough flexibility to create batch-oriented programs that aid in bringing the data all the way from its raw form to the final data-mining model, relational database, or OLAP cube. The structure of the DTS facilitates its transportation, execution, and scheduling, which allows it to be integrated with a wide variety of SQL Server tools and programming interfaces. 130