Integration Services | Delivering Business Intelligence with Microsoft SQL Server 2005: Utilize Microsofts Data Warehousing, Mining & Reporting Tools to Provide Critical Intelligence to A

Integration Services is the successor to the Data Transformation Services (DTS) tool found in earlier versions of Microsoft SQL Server. Integration Services is not just a new name for an old tool—it is entirely new, rewritten from the ground up. Integration Services provides all of the functionality of DTS in an environment that simultaneously offers more capabilities with less programming and scripting required. In addition, Integration Services offers a large performance boost over DTS.

Package Structure

Integration Services and its predecessor, DTS, both create structures called packages, which are used to move data between systems. Packages in both systems contain data sources and data destinations. After that, however, the similarities between these two utilities come to an end. Integration Services represents a complete rethinking of Microsoft's approach to data transfer. The result is an easy-to-use, extremely flexible, exceedingly capable, and highly scalable data Extract, Transform, and Load (ETL) tool.

One of the most obvious departures from DTS is in the design of the packages themselves. Where DTS combined control of the package's operation and the mapping of the data flow in one graphical layout, Integration Services splits the two apart. Separating these two related, but ultimately distinct, areas makes Integration Services packages easier to create and certainly much easier to maintain.

A package's overall operation is defined by the control flow. The control flow is the sequence of tasks that will be performed by the package. These may be tasks such as transferring a file using FTP, logging an event, truncating a table in preparation for a data load, or even e-mailing a status message to an administrator. Of course, one of the most-used control flow tasks is the data flow task, which moves data from one place to another.

The data flow task provides the plumbing for moving data from one or more sources to one or more destinations. Along the way, the data may be subjected to any number of transformation processes. Data can now be manipulated as it is flowing through "the pipeline" without having to be deposited in a temporary, working table as in DTS. This capability to massage the data on the fly yields a cleaner and more streamlined ETL process and a much more efficient package.

Integration Services packages are created using an Integration Services project in the Business Intelligence Development Studio. When we are working on an Integration Services project, the designer window contains four tabs as shown in Figure 7-1. The first three tabs—the Control Flow Designer tab, the Data Flow Designer tab, and the Event Handlers Designer tab—let us define various types of functionality for the package. The final tab—the Package Explorer—provides an alternative view of the contents of the package. In addition to the tabs, there is a special area at the bottom of the designer window for defining connection managers, which is called the Connections tray.

image from book
Figure 7-1: The Integration Services package design layout in the Business Intelligence Development Studio

As stated earlier, each Integration Services package contains a control flow to define the overall operation of the package. This is shown in Figure 7-2. The control flow is defined by dragging items from the Toolbox onto the Control Flow Designer tab. When the Control Flow Designer tab is selected, only those tasks that can be used on this tab are displayed in the Toolbox as shown in Figure 7-1.

image from book
Figure 7-2: The Integration Services package structure Control Flow tab

An Integration Services package may contain several data flows. Each data flow is represented by a data flow task placed on the Control Flow Designer tab as shown in Figure 7-2. A data flow task is created by dragging a Data Flow item from the Toolbox and placing it in the control flow. We need to drill down into a data flow task to define the contents of that data flow as shown in Figure 7-3. This is done by double-clicking on a data flow task on the Control Flow Designer tab or clicking on the Data Flow Designer tab and selecting the correct data flow task from the drop-down list. We can now define the details of the data flow on the Data Flow Designer tab. As might be expected, only Data Flow control items are displayed in the Toolbox when the Data Flow Designer tab is selected.

image from book
Figure 7-3: The Integration Services package structure Data Flow Designer tab

Integration Services packages are now event-driven. This means we can specify routines to execute when a particular event occurs. An event can be the completion of a task or an error that occurs during task execution. The routine for the event is known as an event handler and is defined as a control flow. However, event handler control flows are created on the Event Handlers Designer tab rather than on the Control Flow Designer tab as shown in Figure 7-4.

image from book
Figure 7-4: The Integration Services package structure Event Handlers Designer tab

The same control flow items that can be used on the Control Flow Designer tab can also be used to create event handler tasks on the Event Handlers Designer tab. These might include sending an e-mail to update the administrator on the status of a package or executing an alternative data flow when a task fails. When a data flow task is used in an event handler, the details of the data flow are defined on the Data Flow Designer tab as shown in Figure 7-5.

image from book
Figure 7-5: Defining a data flow task in an event handler control flow

The rightmost tab of the Designer window in Figure 7-1 provides access to the Package Explorer. The Package Explorer displays all the contents of the Integration Services package in a single tree-view structure. In addition to the control flow tasks, data flow tasks, and event handlers tasks, the Package Explorer also displays all the variables available for use in the package. We discuss the use of variables in Integration Services packages in the section "Variables."

The Connections tray is used to manage all the connections in the package. This includes paths to flat files and connection strings for databases. Rather than having this spread throughout the package, all this connection information is stored in the Connections tray. This makes it much easier to locate and manage the connection information as paths, server names, and login credentials change or as the package is used in different server environments.

Drag-And-Drop Programming

Creating nontrivial data transfer packages with DTS usually required someone with a programming background. A certain amount of Visual Basic script or SQL queries were generally required to accomplish even the most-straightforward data loading and data cleansing tasks. This blurred the line between database administration tasks and development tasks, while tending to complicate database management.

In a number of cases, the data could not be cleansed prior to loading it into the database. It was just too complicated to implement the aggregations, lookups, and validations as operations in the DTS pipeline. In these situations, the data had to be placed in a staging table in SQL Server. SQL queries would then be developed to do the cleansing tasks, again requiring someone with database development know-how. The SQL queries, not the DTS package, were ultimately responsible for copying the data to its final destination. This has been referred to as ELT (Extract, Load, and Transform) rather than the more traditional, and far more efficient, ETL.

Integration Services is designed to eliminate both the need for a developer and the need for any ELT-mode operations. Integration Services uses a drag-and-drop development style. Task items are dragged from the toolbox and placed on the designer workspace. The behavior of these tasks is controlled by setting properties in the dialog box associated with each task and by connecting the output from one task to the input of another. Some expressions may be required, but these tend to be straightforward, single-line comparison statements, rather than more complicated, multiline programming constructs.

Almost all the data manipulations that required staging tables in DTS can now be done through Integration Services tasks without writing the data to disk. These tasks work with the data in buffers kept in physical memory. Pointers to these memory buffers are passed from one task to another throughout the processing. The only time data is written to disk during a data pipeline process is when it exceeds the size of the physical memory available. All of this makes it easy to develop complex packages while insuring fast, efficient operation.

Control Flow Tasks

As we create our Integration Services packages, we use a top-down approach. First, we use the Control Flow Designer tab to define the tasks that the package will perform. Next, we use precedence arrows to define the tasks' order of execution and any dependencies between them. Then, we drill down into each of the data flow tasks to define their operation. Finally, we add any event handlers or other programming needed to get the package functioning in the manner required.

We begin, then, on the Control Flow Designer tab, by specifying the tasks that must be accomplished by the package. We create tasks by dragging Control Flow items from the toolbox and placing them on the Control Flow Designer tab. Each item taken from the toolbox becomes a control flow task. This is shown in Figure 7-6.

image from book
Figure 7-6: Creating control flow tasks from the Toolbox

The red octagon with the white X on the control flow task indicates that the task is in an error state because some of the required properties for the task have not been set or have been set improperly. Hovering the mouse over a task with this symbol displays a message explaining the error. In other situations, a yellow triangle with a black ! appears on a task. This indicates a warning message is associated with this task. Again, hovering the mouse over the task displays the warning message. A task with a warning executes, but a task with an error does not.

We can set the properties of a task by double-clicking on the task to display its Editor dialog box. This is shown in Figure 7-7. We can also select the task and modify its properties using the Properties window. We look at the properties of each task in detail in the next section of this chapter.

image from book
Figure 7-7: Setting the properties of a task using the task's Editor dialog box

When creating an Integration Services package, it is important to change the default name for each task that is placed in the package. The names should be changed to a phrase that accurately describes the functionality performed by that task. It is also a good idea to enter a short explanation of each task in the Description property of the task. The description is displayed when we hover over the task on the Control Flow tab. Entering meaningful information for each task makes the package self-documenting and, consequently, much easier to test, debug, and maintain.

Connection Managers

Any time a control flow task must make a connection to a database or to the file system, this connection information is stored in a connection manager in the Connections tray. This makes it easier to move packages from one server to another, or from a development environment to a production environment. When database connection information or file system paths need to be modified to work on a different server, we do not need to look for them throughout the package. Instead, they are conveniently located in one place, the Connections tray.

We take a look at creating some of the most commonly used connections in the "Learn By Doing" sections of this chapter.

Precedence Arrows

We control the order in which tasks are executed by connecting the precedence arrow from one task to the task that is to run after it. To do this, we select a task, and then click on the green arrow that appears below that task. We drag this arrow until the mouse pointer is over the task that should follow during execution. This is shown in Figure 7-8. When we release the mouse button, the green arrow connects the two tasks as shown in Figure 7-9. This is known as a precedence constraint because it constrains one task to run after another.

image from book
Figure 7-8: Making a precedence connection between two tasks

image from book
Figure 7-9: The precedence arrow

By default, the precedence arrow is green, meaning the second task will execute only after the successful completion of the first task. This is the "Success" precedence. When we right-click on the arrow, a Context menu appears as shown in Figure 7-10. Three options on the Context menu—Success, Failure, and Completion—allow us to select the type of precedence the arrow represents.

image from book
Figure 7-10: The precedence arrow Context menu

When we select Failure from the Context menu, the precedence arrow changes from green to red. With failure precedence, the second task executes only after the failure of the first task. When we select Completion from the Context menu, the precedence arrow changes to blue. With completion precedence, the second task executes after the first task has completed without regard to whether the first task succeeded or failed.

A single task can have multiple precedence arrows connecting it to other tasks. When we select a task that already has its precedence arrow connected to another task, a second precedence arrow automatically appears. This is shown in Figure 7-11. This second arrow can be connected to another task. When a task has two precedence arrows connected to subsequent tasks, a third arrow appears. And so on. Each of these precedence arrows can be set to Success, Failure, or Completion as appropriate. The result can be a complex sequence of tasks such as the one shown in Figure 7-12.

image from book
Figure 7-11: A control flow task with a second precedence arrow

image from book
Figure 7-12: Using precedence arrows to make a complex sequence of tasks

We can double-click a precedence arrow or right-click the arrow and select Edit from the Context menu to access the Precedence Constraint Editor dialog box shown in Figure 7-13. The Value drop-down list on this dialog box provides another way to select among Success, Failure, and Completion. If these three options do not provide enough flexibility, we can also attach expressions to a precedence arrow.

image from book
Figure 7-13: The Precedence Constraint Editor dialog box

If the Evaluation Operation drop-down list is set to Expression, then the contents of the expression must evaluate to true before this precedence path is taken. If the Evaluation Operation drop-down list is set to Expression and Constraint, the contents of the expression must be true and the constraint selected in the Value drop-down list must match the result of the task execution. If the Evaluation Operation drop-down list is set to Expression or Constraint, either the contents of the expression must be true or the constraint selected in the Value drop-down list must match the result of the task execution. We discuss Integration Services expressions in the section "Expressions."

The Multiple Constraints section at the bottom of the Precedence Constraint Editor dialog box determines the behavior when more than one precedence arrow connects to a single task. The data flow task in Figure 7-12 is a good example. The Success precedence constraints from both Execute SQL Task 1 and Execute SQL Task 2 lead to the data flow task. When the Multiple Constraints setting is Logical AND, both Execute SQL Task 1 and Execute SQL Task 2 must finish successfully before the data flow task executes. If either one or both of the Execute SQL tasks fails, the data flow task cannot execute. When the Multiple Constraints setting is Logical OR, the data flow task executes if either one or the other of the Execute SQL tasks is successful. Changing the Multiple Constraints setting on any one of the precedence arrows leading into a task changes the Multiple Constraints setting for all of the precedence arrows leading into that task.

It is possible to have one or more control flow tasks in a package that are not connected by precedence arrows to any other task. These tasks are not constrained to execute in any particular order and they are not constrained by the success or failure of any other task. These disconnected tasks execute every time the package is run, no matter what happens with the other tasks in the package.

Data Flow

Once we set the precedence constraints for the control flow tasks in the package, we can define each of the data flows. This is done on the Data Flow Designer tab. Each data flow task that was added to the control flow has its own layout on the Data Flow Designer tab. We can switch between different data flows using the Data Flow Task drop-down list located at the top of the Data Flow tab.

The Data Flow toolbox contains three types of items: data sources, transformations, and data destinations. Data Flow Sources enable us to read data from any vendor's relational database, flat file, or XML source. Data Flow Destinations let us write data to these same locations, as well as to certain Analysis Services structures. Data Flow Transformations operate on the data as it moves through the data flow. In most cases, the transformations change the data to increase its appropriateness for the destination.

Data flow tasks are connected by flow path arrows. These are connected in the same way that control flow tasks are connected by precedence arrows. Green flow path arrows indicate the route that valid data will take through the task. This is the output flow of the task. Red flow path arrows indicate the route that invalid data follow. This is the error flow of the task.

The flow path from a data source to a data destination is called a data flow segment. The simplest data flow segment has only a single source and a single destination connected by a flow path. This type of segment simply copies data from one place to another. Transformation tasks are added to the segment to modify the data as it moves from the source to the destination. In fact, as we see in the sections on data flow, we can do some sophisticated data manipulation using the transformation tasks in Integration Services.

Data flow segments may contain multiple data sources with the data flow from each source being combined with a merge or union task. Likewise, a segment may have multiple data destinations with the data flow being divided by an error path, a conditional split task, or a multicast task. We gain even more flexibility with the ability to have multiple, distinct data segments in the same data flow.

When a data flow has multiple segments, the order in which the segments are executed is determined by the execution plan created at run time. This means that we cannot depend on the order in which multiple segments are executed within a single data flow. If the multiple segments are performing independent tasks, such as loading data into different dimensions of a data mart, this is not a problem. However, if these multiple segments are performing dependent tasks, such that one segment must be completed before another segment can begin, the segments should be split up into different data flow tasks. The order of execution can then be enforced by a precedence arrow from one data flow task to the other on the Control Flow Designer tab.

Learn by Doing—Using Integration Services to Populate the Maximum Miniatures Manufacturing Data Mart Dimensions

Feature Highlighted

Creating an Integration Services package for loading data into a data mart.

Business Need In Chapter 6, we created a relational database to hold the information for the Maximum Miniatures Manufacturing data mart. Now it is time to copy information from the Maximum Miniatures OLTP systems and load it into the relational database tables. We need to create two processes: one to initially populate all of the dimensional tables and one to add new manufacturing information to the fact table.

Note

To complete this Learn By Doing activity, you need the Maximum Miniatures Accounting database and the Batchlnfo.csv file. If you have not done so already, go to http://www.osborne.com, locate the book's page using the ISBN 0072260904, and follow the instructions to download the supporting files and to install the sample databases. This activity also requires the Microsoft OLE DB Provider for Visual FoxPro. If this OLE DB Provider is not installed on your computer, use the vfpoledb.exe file to install it.

Steps

Open the Business Intelligence Development Studio.
Click the New Project button in the toolbar.
Make sure Business Intelligence Projects is selected from the Project Types, and then select Integration Services Project from the Templates.
Enter MaxMinManufacturingDMDimLoad for Name and set the Location to the appropriate folder. Leave Create Directory for Solution checked.
Click OK to create the project.
Once the project is open, right-click on the Package.dtsx entry in the Solution Explorer window and select Rename from the Context menu. Enter DimensionLoad.dtsx for the package name and press ENTER.
Click Yes when asked if you want to rename the package object as well.
Drag a Data Flow Task item from the Toolbox and drop it on the Control Flow tab.
Right-click the Data Flow Task item you just created and select Rename from the Context menu. Enter Load ProductType and press ENTER. Drag the Data Flow Task item wider, so all of the new name can be seen. (You can also single-click the Data Flow Task item to edit the name of the item.)
Double-click the Load Product Type item. This takes you to the Data Flow tab.
Right-click in the Connections tray and select New OLE DB Connection from the Context menu. Click New in the Configure OLE DB Connection Manager dialog box to create a new data connection.
From the Provider drop-down window, select Microsoft OLE DB Provider for Visual FoxPro and click OK. (If this OLE DB Provider is not available on your PC, see the note on the previous page.) Click Data Links. The Data Link Properties dialog box appears. Click the ellipsis button. The Configure Connection dialog box appears. Click Browse and browse to the location where you copied the Accounting System FoxPro files. Select AccountingSystem.DBC and click Open. Click OK to exit each open dialog box until you return to the OLE DB Connection Manager dialog box.
Expand the Data Flow Sources section of the Toolbox, drag an OLE DB Source item from the Toolbox, and drop it on the Data Flow tab.
Double-click the OLE DB Source item you just created. The OLE DB Source Editor dialog box appears.
Select the AccountingSystem data connection that you just created in the OLE DB Connection Manager drop-down list. Leave the Data Access Mode drop-down list set to Table or View. Select producttypes from the Name of the Table or the View drop-down list. Click OK. Click OK on the code page warning dialog box.
Right-click in the Connections tray and select New OLE DB Connection from the Context menu. Click New in the Configure OLE DB Connection Manager dialog box to create a new data connection.
Enter the name of the server where the MaxMinManufacturingDM database was created in Server Name. Select the appropriate method for accessing this server in the Log On to the Server section. Enter credentials if necessary. Select MaxMinManufacturingDM from the Select or Enter a Database Name drop-down list. Click OK to return to the Configure OLE DB Connection Manager dialog box. Click OK again to exit this dialog box.
Expand the Data Flow Destinations section of the Toolbox, drag a SQL Server Destination item from the Toolbox, and drop it on the Data Flow tab.
Click the OLE DB Source item. Click the green data flow arrow, drag it on top of the SQL Server Destination item, and drop it on this item. This connects the source to the destination.
Double-click the SQL Server Destination item. The SQL Destination Editor dialog box appears.
Select the MaxMinManufacturingDM data connection that you just created in the OLE DB Connection Manager drop-down list. Select DimProductType from the Use a Table or View drop-down list.
Click Mappings to view the Mappings page. The columns from the data source (that are Available Input Columns to this data destination item) should be automatically mapped to the columns in the destination (the Available Destination Columns). You see an arrow going from each column name under Available Input Columns to the corresponding column name under Available Destination Columns. The automatic mapping occurs because the column names match. Click OK to exit the SQL Destination Editor dialog box.
Click the Control Flow tab. Drag a second Data Flow Task item on to the Control Flow tab. Rename this new Data Flow Task item Load ProductSubtype.
Click the Load ProductType item. Drag the precedence arrow onto the Load ProductSubtype item and drop it there. These two items are now linked by a precedence constraint. The Load ProductType item must complete successfully before the Load ProductSubtype item can execute. This is required because ProductType is a foreign key for ProductSubtype.
Double-click the Load ProductSubtype item. This takes you to the Data Flow tab with Load ProductSubtype selected in the Data Flow Task drop-down list.
Drag an OLE DB Source item from the Toolbox onto the Data Flow tab. Double-click this item. The OLE DB Source Editor dialog box appears.
Select the AccountingSystem data connection in the OLE DB Connection Manager drop-down list. Leave the Data Access Mode drop-down list set to Table or View. Select productsubtypes from the Name of the Table or the View drop-down list. Click OK. Click OK on the code page warning dialog box.
Drag a SQL Server Destination item from the Toolbox and drop it on the Data Flow tab.
Click the OLE DB Source item. Click the green data flow arrow, drag it on top of the SQL Server Destination item, and drop it on this item. This connects the source to the destination.
Double-click the SQL Server Destination item. The SQL Destination Editor dialog box appears.
Select the MaxMinManufacturingDM data connection in the OLE DB Connection Manager drop-down list. Select DimProductSubtype from the Use a Table or View drop-down list.
Click Mappings to view the Mappings page. The input columns should be automatically mapped to the columns in the destination. Click OK to exit the SQL Destination Editor dialog box.
Repeat Steps 23 through 32 to take data from the AccountingSystem.Products table and copy it to the MaxMinManufacturingDM.DimProdtict table. Create the precedence arrow to connect the Load ProductSubtype and Load Product data flow tasks. (The RawMaterial and AmountOfRawMaterial columns in the AccountingSystem.Product table are not mapped to columns in the MaxMinManufacturingDM.DimProduct table.)
Repeat Steps 23 through 32 again to take data from the AccountingSystem .Countries table and copy it to the MaxMinManufacturingDM.DimCountry table. Create the precedence arrow to connect the Load Product and Load Country data flow tasks.
Click the Control Flow tab. Drag another data flow task item onto the Control Flow tab. Rename this new data flow task item Load Plant.
Click the Load Country item. Drag the precedence arrow onto the Load Plant item and drop it there.
Double-click the Load Plant item. This takes you to the Data Flow tab with Load Plant selected in the Data Flow Task drop-down list.
Drag an OLE DB Source item from the Toolbox onto the Data Flow tab. Double-click this item. The OLE DB Source Editor dialog box appears.
Select the AccountingSystem data connection in the OLE DB Connection Manager drop-down list. Select SQL Command from the Data Access Mode drop-down list. Enter the following for SQL command text:
```
 SELECT LocationCode, LocationName, CountryCode FROM Locations WHERE LocationType = 'Plant Site' 
```
Click Columns to view the Columns page. Click OK on the code page warning dialog box. In the Output Column, change LocationCode to PlantNumber and change LocationName to PlantName. Click OK.
Drag a SQL Server Destination item from the Toolbox and drop it on the Data Flow tab.
Click the OLE DB Source item. Click the green data flow arrow, drag it on top of the SQL Server Destination item, and drop it on this item.
Double-click the SQL Server Destination item. The SQL Destination Editor dialog box appears.
Select the MaxMinManufacturingDM data connection in the OLE DB Connection Manager drop-down list. Select DimPlant from the Use a Table or View drop-down list.
Click Mappings to view the Mappings page. The input columns should be automatically mapped to the columns in the destination. Click OK to exit the SQL Destination Editor dialog box.
Click the Control Flow tab. Drag another data flow task item onto the Control Flow tab. Rename this new data flow task item Load Material.
Click the Load Plant item. Drag the precedence arrow onto the Load Material item and drop it there.
Double-click the Load Material item.
Drag an OLE DB Source item from the Toolbox onto the Data Flow tab. Double-click this item. The OLE DB Source Editor dialog box appears.
Select the AccountingSystem data connection in the OLE DB Connection Manager drop-down list. Select SQL Command from the Data Access Mode drop-down list. Enter the following for SQL command text:
```
 SELECT AssetCode, AssetName, AssetClass, LocationCode, Manufacturer, DateOfPurchase, RawMaterial FROM CapitalAssets WHERE AssetType = 'Molding Machine' 
```
Click Columns to view the Columns page. Click OK on the code page warning dialog box. Under Output Column, make the following changes:

Old Value

New Value

AssetCode

MachineNumber

AssetName

MachineName

AssetClass

MachineType

LocationCode

PlantNumber

RawMaterial

Material
Click OK to exit the OLE DB Source Editor.
Drag an Aggregate item from the Toolbox (listed under Data Flow Transformations) and drop it on the data flow tab.
Click the OLE DB Source item. Click the green data flow arrow, drag it on top of the Aggregate item, and drop it on this item.
Double-click the Aggregate item. This opens the Aggregate Transformation Editor dialog box. Check Material under Available Input Columns. This groups the data by the Material column to give you a unique list of materials. You can use this unique list to populate the DimMaterial table. Only the Material column is included in the output of the Aggregate item. Click OK.
Drag a SQL Server Destination item from the Toolbox and drop it on the Data Flow tab.
Click the Aggregate item. Click the green data flow arrow, drag it on top of the SQL Server Destination item, and drop it onto this item.
Double-click the SQL Server Destination item. The SQL Destination Editor dialog box appears.
Select the MaxMinManufacturingDM data connection in the OLE DB Connection Manager drop-down list. Select DimMaterial from the Use a Table or View drop-down list.
Click Mappings to view the Mappings page. The single input column should be automatically mapped to the single column in the destination. Click OK to exit the SQL Destination Editor dialog box.
Click the Control Flow tab. The Load Material item should still be selected. Press CTRL-C to copy this item. Click the background of the Control Flow tab, so the Load Material item is no longer selected. Press CTRL-V to paste a copy of the Load Material item.
Change the name of this new item to Load MachineType.
Click the Load Material item. Drag the precedence arrow onto the Load MachineType item and drop it there.
Double-click the Load MachineType item.
Double-click the Aggregate item. This opens the Aggregate Transformation Editor dialog box. Check MachineType in addition to Material under Available Input Columns. This groups the data by the MachineType and Material columns to give you a unique list of machine types and their materials. You can use this unique list to populate the DimMachineType table. Click OK.
Double-click the SQL Server Destination item. The SQL Destination Editor dialog box appears.
Change the Use a Table or View drop-down list to DimMachineType.
Click Mappings to view the Mappings page. Click MachineType in Available Input Columns, drag and drop it on MachineType in Available Destination Columns. Click OK to exit the SQL Destination Editor dialog box.
Click the Control Flow tab. The Load MachineType item should still be selected. Press CTRL-C to copy this item. Click the background of the Control Flow tab, so the Load MachineType item is no longer selected. Press CTRL-V to paste a copy of the Load MachineType item.
Change the name of this new item to Load Machine.
Click the Load MachineType item. Drag the precedence arrow onto the Load Machine item and drop it there.
Double-click the Load Machine item.
We do not need any aggregation for the DimMachine table. Click the Aggregate item and press DELETE. Click Yes to confirm the delete.
Click the OLE DB Source item. Drag the green data flow arrow onto the SQL Server Destination and drop it there.
Double-click the SQL Server Destination item. The mapping that had been in place for this destination is no longer valid because of the changes we made. The Restore Invalid Column References Editor dialog box appears to alert you to this fact. Click OK to have the invalid references removed.
Double-click the SQL Server Destination item again. The SQL Destination Editor dialog box appears.
Change the Use a Table or View drop-down list to DimMachine.
Click Mappings to view the Mappings page. Click MachineNumber in Available Input Columns, drag and drop it on MachineNumber in Available Destination Columns. Do the same thing with MachineName, PlantNumber, Manufacturer, and DateOfPurchase. Click OK to exit the SQL Destination Editor dialog box.
Click the Save All button on the toolbar to save the completed package.
Click the Start Debugging button in the toolbar to execute the completed package. When the execution is complete, click Stop Debugging or the Package Execution Completed link to return to design mode.

Old Value	New Value
AssetCode	MachineNumber
AssetName	MachineName
AssetClass	MachineType
LocationCode	PlantNumber
RawMaterial	Material

Executing the DimensionLoad package copies data from the AccountingSystem database tables into the dimension tables of our MaxMinManufacturingDM database.

Package Items

In this section, we take a look at the Control Flow and Data Flow items that can be used in an Integration Services package. We discuss how each item can be used in a package and we look at the task dialog box provided for configuring each task. Because of the tremendous amount of capability and flexibility available from Integration Services tasks, it would take the remainder of this book, and then some, to provide a complete description. Instead, this section is intended to give you some familiarity with the purpose and capabilities of each task.

Control Flow—Maintenance Plan Tasks

The Control Flow Toolbox is divided into two areas: Maintenance Plan tasks and Control Flow items. (We will not be concerned with the General area of the Toolbox, which is empty.) Each component, Maintenance Plan task or Control Flow item, has an editor dialog box that lets us configure that component in a straightforward manner. The editor dialog box is launched by double-clicking the component in the design area or right-clicking the item and selecting Edit from the Context menu.

We begin by looking at the Maintenance Plan tasks. The Maintenance Plan tasks, as we might guess from the name, mirror tasks that can be accomplished as part of a database maintenance plan. The Maintenance Plan tasks could be used to create Integration Services packages that are responsible for database maintenance. They can also be used to perform maintenance tasks as we are loading data into a database. This can be done to insure that the database is ready for efficient operation at the conclusion of the data load.

Each Maintenance Plan task has a View T-SQL button on its dialog box. This button shows the T-SQL statements that would perform the equivalent operation in a query window. Looking at these T-SQL statements may be helpful to determine exactly how a Maintenance Plan task is operating.

Note

Each section of the Toolbox starts with an entry called Pointer. These pointer entries enable us to switch back to the mouse pointer if we begin a drag-and-drop operation, and then change our mind. The pointer entries do not create tasks on the designers.

image from book The Back Up Database task lets us run a database backup as part of an Integration Services package. This is a wise thing to do just before making major changes to the contents of a database. Often, some type of data load or other operation in an Integration Services package is responsible for these major changes. By using the Back Up Database task in a package, we can create the backup within the same process that makes the changes. We may also find this task useful to create a snapshot of the database contents immediately after a large data load.

Figure 7-14 shows the Back Up Database Task dialog box used to configure the Back Up Database task. The dialog box enables us to back up any combination of databases on a single server. A Full, Differential, or Transaction Log backup can be created. The backup can be stored in a single file or in multiple files with one file for each database.

image from book
Figure 7-14: The Back Up Database Task dialog box

A special type of drop-down box is used to select the databases to be backed up. This is shown in Figure 7-15. This drop-down box contains predefined choices that let us select all of the databases on the server, all system databases on the server (master, msdb, and model), or all user databases on the server. In addition to the predefined choices, we can use the database list at the bottom of the drop-down box to create our own selection of databases using the check boxes.

image from book
Figure 7-15: The database selection drop-down box

image from book The Check Database Integrity task lets us check for corruption in one or more databases on a single server. This task executes a

 DBCC CHECKDB WITH NO_INFOMSGS

command against each database specified. This can be a good idea either right before or right after major changes to the database.

The Check Database Integrity Task dialog box is shown in Figure 7-16. It uses the same database selection drop-down box as the one shown in Figure 7-15 for the selection of the databases to check. We can also specify whether we want indexes included in the integrity check.

image from book
Figure 7-16: The Check Database Integrity Task dialog box

If the integrity check is successful, the package execution continues with any tasks linked to the Check Database Integrity task with either a success or a completion precedence constraint. If the integrity check finds database corruption, the package execution continues with any tasks linked to the Check Database Integrity task, with either a failure or a completion precedence constraint.

image from book The Execute SQL Server Agent Job task lets us execute a single SQL Server Agent job as part of a package. A single job can be selected for execution. The Execute SQL Server Agent Job Task dialog box is shown in Figure 7-17.

image from book
Figure 7-17: The Execute SQL Server Agent Job Task dialog box

image from book The Execute T-SQL Statement task enables us to run one or more Transact SQL statements on a SQL Server. Anything that can be executed from a query window in the SQL Server Management Studio can be executed here. The Execute T-SQL Statement Task dialog box is shown in Figure 7-18.

image from book
Figure 7-18: The Execute T-SQL Statement Task dialog box

Notice there is a task called Execute SQL task in the Control Flow items section of the Toolbox. The Execute T-SQL Statement task that we are looking at here is simple. It lets us type in T-SQL statements and execute them. The Execute SQL task in the Control Flow items has additional features for creating parameterized statements and dealing with result sets.

image from book The History Cleanup task enables us to remove historical entries from a database. This includes backup history, SQL Agent job execution history, and database maintenance plan execution history. We can specify the timeframe for the historical entries we want to keep and, therefore, the timeframe for the historical entries we want to remove. The History Cleanup Task dialog box is shown in Figure 7-19.

image from book
Figure 7-19: The History Cleanup Task dialog box

image from book The Maintenance Cleanup task allows us to remove files created by database maintenance plans and database backups. Specifically, we can remove maintenance plan text reports and database backups. We can remove a single file based on the file name or a number of files with the same file extension from a folder. We can specify the age of the files we want to remove.

image from book The Notify Operator task lets us send an e-mail message to one or more SQL Server operators using SQLiMail. An operator must be set up with an e-mail address in SQL Server before they can be notified using this task. Of course, SQLiMail must be installed and configured for this task to function properly. The Notify Operator task dialog box is shown in Figure 7-20.

image from book
Figure 7-20: The Notify Operator Task dialog box

Note

SQLiMail is the new tool in SQL Server 2005 for sending e-mail from SQL Server, which does not require Microsoft Outlook because it utilizes the SMTP standard.

Figure 7-21: The Rebuild Index Task dialog box

Figure 7-22: The Reorganize Index Task dialog box

Reorganizing indexes defragments the leaf-level nodes of both clustered and nonclustered indexes. It does this by reordering the nodes to match the logical order of the leaf nodes. Reorganizing also compacts the indexes. The reorganization process does not, however, drop and re-create the indexes as the rebuild process does.

Figure 7-23: The Shrink Database Task dialog box

image from book The Update Statistics task lets us update the statistics in one or more databases. It uses the same database selection drop-down box as the one shown in Figure 7-15 for the selection of the databases whose statistics are to be updated. Because the database statistics are used by SQL Server to calculate query execution plans, it is important for these statistics to be up-to-date. A large data load can cause the statistics to be inaccurate. Therefore, it is a good idea to use this task to update the database statistics after a major data operation. The Update Statistics Task dialog box is shown in Figure 7-24.

image from book
Figure 7-24: The Update Statistics Task dialog box

Control Flow—Control Flow Items

In addition to the Management Plan tasks, the Control Flow Tab Toolbox also contains Control Flow items. These items again are presented in the order they appear in the Control Flow Toolbox. The Control Flow items are grouped into containers and tasks with the containers at the top of the list. Containers are a special type of item that can hold other Control Flow items and Maintenance Plan tasks.

Figure 7-25: A For Loop container in a control flow

The number of times the content of the container is executed is controlled by three properties of the For Loop container: InitExpression, EvalExpression, and AssignExpression. These properties are shown in the For Loop Editor dialog box in Figure 7-26. The InitExpression is executed once at the beginning of the loop to initialize the variable (or other item) that is controlling the loop. The EvalExpression is then evaluated. If the EvalExpression evaluates to true, the loop's content is executed. If the EvalExpression is false, the loop is exited. After each execution of the loop's contents, the AssignExpresion is evaluated, followed by another evaluation of the EvalExpression.

image from book
Figure 7-26: The For Loop Editor dialog box

The For Loop in Figure 7-26 is using a variable called @LoopCounter in the expressions to control the number of times the loop is executed. (We discuss variables and Integration Services expressions in the sections "Variables" and "Expressions.") We don't need to use an explicit loop counter variable to control the loop. In fact, any valid expression that results in a true value while the loop should continue, and a false value when it is complete, can be used in the EvalExpression property to control the number of iterations. The InitExpression and AssignExpression properties can be left empty, if the mechanism controlling the number of iterations does not require a separate expression for initialization or modification of the counter.

image from book Like the For Loop container, the Foreach Loop container also provides a way of repeating a segment of a control flow. However, rather that have an expression to control when the loop is exited, the Foreach Loop container iterates one time for each item in a collection. The following collections can be used:

Each file in a given folder matching a given file specification
Each row of the first table in an ADO recordset or an ADO.NET dataset
Each row of all tables in an ADO.NET dataset
Each table in an ADO.NET dataset
Each item from a variable that holds a collection
Each node in an XML nodelist
Each object in an SMO collection

Each time the Foreach Loop iterates, it selects a different object from the collection. The name of this object can then be mapped to one or more properties of the control flow tasks inside the container. In this way, the tasks inside the container can operate on each object in the collection.

For example, Figure 7-27 shows the Nightly Transfer package. This package uses FTP to transfer data files from a mainframe to the C:FTP_RCV directory on the server. Several .DAT files may be received by a given transfer. After the FTP transfer is complete, the Foreach Loop processes each of the files that are received. The Collection page of the Foreach Loop Editor dialog box in Figure 7-28 shows the configuration of the Foreach Loop necessary to accomplish this task.

image from book
Figure 7-27: The Nightly Transfer package using a Foreach Loop container

image from book
Figure 7-28: The Collection page of the Foreach Loop container in the Nightly Transfer package

The Data Flow task and the File System task within the Foreach Loop use the Received.DAT Files connection, shown at the bottom of Figure 7-27, to operate on a given data file received from the mainframe. For the Data Flow task and the File System task to operate on each file that was received, we need a way to change the connection string of the Received.DAT connection for each iteration of the loop. To do this, we first assign the fully qualified filename to a variable. This is done on the Variable Mappings page of the Foreach Loop Container as follows:

Variable	Index
User::FileName		0

User::FileName is a user variable in the package. The entry in the Index column is an index to the Foreach Loop collection. When used with a collection of files, index 0 contains the file specification and is the only valid index. Next we create an expression in the Received.DAT Connection Manager to assign the value of the User::FileName to its ConnectionString.

image from book Unlike the For Loop and Foreach Loop containers, the Sequence container does not change the control flow. Instead, the purpose of the Sequence container is to help organize tasks in a package. The Sequence container can be used to do the following:

Organize a large package into logical sections for easier development and debugging
Manage properties for a group of tasks by setting the properties on the container, rather than on the individual tasks
Allow a group of tasks to be easily enabled or disabled to aid package development and debugging
Provide a variable scope that includes multiple tasks, but does not include the entire package

All of the container objects—For Loop, Foreach Loop, and Sequence—have the capability to hide their content. Figure 7-29 shows two Sequence containers. Sequence Container 2 is hiding its content. The content of a container is hidden or shown by clicking on the chevron, indicated by the mouse pointer in Figure 7-29.

image from book
Figure 7-29: Sequence Container 1 is showing its content. Sequence Container 2 is hiding its content.

Figure 7-30: The Script page of the ActiveX Script Task Editor dialog box

Using the ActiveX Script task in our Integration Services packages has three disadvantages. First, the ActiveX scripts are interpreted at run time. This can negatively impact performance and scalability. Second, the ActiveX scripts can be a security issue. A malicious script can use its direct access to server resources to do all kinds of bad things. Third, the ActiveX script editor does not provide any programmer aids such as autocomplete, context-sensitive help, or debugging.

As an alternative, use the Script task in place of the ActiveX Script task. The Script task supports modules written in any .NET language. The Script task modules can be precompiled for better performance and scalability. All of the security features of the .NET Framework are in place when a Script task executes. Finally, the Script task provides a rich environment for script development.

Figure 7-31: The Analysis Services Execute DDL Task Editor dialog box

image from book The Analysis Services Processing task enables us to process objects in Analysis Services. This can include cubes, dimensions, and data mining models. This is important after we have made major data changes that affect any of these objects. The Analysis Services Processing Task Editor dialog box is shown in Figure 7-32.

image from book
Figure 7-32: The Analysis Services Processing Task Editor dialog box

Figure 7-33: The Bulk Insert Task Editor dialog box

The Bulk Insert task is the fastest way to move large amounts of data into a SQL table. Remember, however, that this data must come from a text file directly and it cannot be transformed as it is being moved. Also, keep in mind that the Bulk Insert task always appends the data to any existing data in the target table.

The Data Mining Query task lets us execute a Data Mining Extensions (DMX) query against an existing data mining structure. The DMX query enables us to feed parameter values to a data mining model, and then have that mining model make predictions for us based on those parameters. The Mining Model tab of the Data Mining Query Task Editor dialog box is shown in Figure 7-34.

image from book
Figure 7-34: The Mining Model tab of the Data Mining Query Task Editor dialog box

If the selected data mining structure contains multiple data mining models, our DMX query can be run against more than one of these models. The DMX query results are stored in relational database tables. If the result of the DMX query includes nested tables, those tables are flattened before they are stored in the relational database tables. We discuss data mining in Part IV of this book.

image from book The Execute DTS 2000 Package task enables us to execute a DTS package created for SQL Server 2000 as part of an Integration Services package. Using this task, we can continue to use complex DTS 2000 packages without having to re-create them as Integration Services packages. To execute this task, the DTS 2000 run-time engine must be installed as part of the SQL Server 2005 installation. In fact, the Execute DTS 2000 Package task will not appear in the Toolbox, if the DTS 2000 run-time engine is not installed. The Execute DTS 2000 Package Task Editor dialog box is shown in Figure 7-35.

image from book
Figure 7-35: The Execute DTS 2000 Package Task Editor dialog box.

The Execute DTS 2000 Package task can access DTS 2000 packages stored in SQL Server, in a structured storage file, or in Meta Data Services. If the SQL Server 2000 tools are loaded on the same computer that is running the Business Intelligence Development Studio, the DTS 2000 package can be edited by clicking Edit Package. The DTS 2000 package can be copied and stored in the Integration Services package by clicking Load DTS2000 Package Internally.

We can use two methods to send information to the DTS 2000 package. The Inner Variables page of the Execute DTS 2000 Package Task Editor dialog box lets us pass values to the global variables defined in the DTS 2000 package. The Outer Variables page enables us to create variables which can be used in the DTS 2000 package.

image from book The Execute Package task lets us execute a different Integration Services package. The package containing the Execute Package task is the parent package and the package being executed by that task is the child package. The child package can be stored in SQL Server or in a structured storage file. The child package can execute as a separate process by setting the ExecuteOutOfProcess property to true. Otherwise, it executes in the same process as the parent package. The Execute Package Task Editor dialog box is shown in Figure 7-36.

image from book
Figure 7-36: The Execute Package Task Editor dialog box

Figure 7-37: The Execute Process Task Editor dialog box

Figure 7-38: The Execute SQL Task Editor dialog box

Figure 7-39: The File System Task Editor dialog box

Figure 7-40: The File Transfer Protocol Task Editor dialog box

Figure 7-41: The Message Queue Task Editor dialog box

image from book The Script task enables us to create .NET code for execution as part of our Integration Services package. The Script task is similar to the ActiveX Script task in that they can both be used to execute transformations, perform complex calculations, or implement business logic that cannot be created using the other Integration Services tasks. However, the Script task provides a much better code-authoring environment and the resulting code can be precompiled for superior performance and scalability.

The Script Task Editor dialog box is shown in Figure 7-42. This dialog box is used to specify the .NET language used to create the script, as well as whether the script is precompiled. When a script is precompiled, the code is compiled as soon as the script is saved in the package. The compiled code is then saved in the package as well. This allows for much faster execution when the package is run. The only disadvantage to this approach is this: the size of the package becomes larger because it contains both the source code and the compiled version of that source code.

image from book
Figure 7-42: The Script Task Editor dialog box

The EntryPoint property specifies which method is to be executed first when the Script task is run. The ReadOnlyVariables and ReadWriteVariables properties contain comma-separated lists of variables that are to be made available, either as read-only variables or as read-write variables, within the script. These variables are accessed in our script using code similar to the following:

 Dim GrossSales As Double Dts.VariableDispenser.LockForRead("GrossSalesAmt") GrossSales = CDbl(Dts.Variables("GrossSalesAmt").Value)

 Dts.VariableDispenser.LockForWrite("CurrentAmt")  Dts.Variables("CurrentAmt").Value = GrossSales Dts.Variables.Unlock()

where "GrossSalesAmt" and "CurrentAmt" are package variables. The variables must be locked before they can be either read or written and must be unlocked when all variable access is complete. The value of a variable is always returned as type object and must be cast to the appropriate type. Clicking Design Script displays the Script Editing window as shown in Figure 7-43.

image from book
Figure 7-43: The Script Editing window

image from book The Send Mail task lets us send an e-mail message as part of our Integration Services package. This can be useful for alerting an administrator to an error condition or notifying a key user that a critical process has completed. The Send Mail task uses an SMTP server to deliver the e-mail. We must create an SMTP Connection Manager in the Connections tray to specify the SMTP server to use with this task. The Send Mail Task Editor dialog box is shown in Figure 7-44.

image from book
Figure 7-44: The Send Mail Task Editor dialog box

The content of the e-mail message can come from a file, a package variable, or it can be directly input as a string. If a file is used, we must create a File Connection Manager in the Connections tray pointing to that file. One or more files can be attached by specifying the path to each file in the Attachments property. If multiple files are attached, the paths are separated by semicolons.

There are a number of Control Flow items for transferring SQL Server objects from one server to another server. The Transfer Database task is used to either transfer or move an entire database. Using the Transfer Error Messages task we can copy user-defined error messages. The Transfer Jobs task is used to copy SQL Agent Jobs. The Transfer Logins task copies one or more SQL logins. Using the Transfer Master Stored Procedures task we can copy user-defined stored procedures between master databases.

The Transfer SQL Server Objects task is used to copy any of the following items:

Object	SQL Server Version
Tables	2000 or 2005
Views	2000 or 2005
Stored Procedures	2000 or 2005
User-Defined Functions	2000 or 2005
Defaults	2000 or 2005
User-Defined Data Types	2000 or 2005
Partition Functions	2005 only
Partition Schemes	2005 only
Schemas	2005 only
Assemblies	2005 only
User-Defined Aggregates	2005 only
User-Defined Types	2005 only
XML Schema Collection	2005 only

image from book The Web Service task lets us execute a web service method as part of our Integration Services package. We must create an HTTP Connection Manager in the Connections tray to specify the connection to the web service. Once a connection has been specified, a Web Services Description Language file must be located to define the web methods that are available. We can then select the web method we want to execute along with the parameters to be passed to that web method. This is shown in the Web Service Task Editor dialog box in Figure 7-45. The result returned by the web method call can be stored in a package variable or a tile specified by a File Connection Manager in the Connections tray.

image from book
Figure 7-45: The Web Service Task Editor dialog box

Figure 7-46: The WMI Data Reader Task Editor dialog box

A WMI query is used to select the desired information. The output of the WMI query can be returned as either a table, a set of property values, or a set of property name/value pairs. The output can be stored in a file, specified by a File Connection Manager, or in a package variable.

Figure 7-47: The WMI Event Watcher Task Editor dialog box

A WMI query is used to specify the event we want to watch for. We can then specify the desired behavior when the event occurs. The ActionAtEvent property determines if the event is simply logged or if it is logged and, at the same time, fires the Integration Services WMIEventWatcherEventOccurred event. If this Integration Services event is being fired, we can then provide an event handler. (This is covered in the section "Event Handlers.") The AfterEvent property determines if this task exits with a success status after the event occurs, exits with a failure status after the event occurs, or waits for another event. If we are waiting for multiple occurrences of the event, the NumberOfEvents property specifies how many occurrences to wait for before exiting the task. The Timeout property lets us specify how many seconds to wait for the event to occur. The ActionAtTimeout and AfterTimeout properties function similarly to their event counterparts.

image from book The XML task enables us to manipulate XML documents. Using the XML task, we can perform the following operations:

validate an XML document using an XML Schema Document (XSD) or a Document Type Definition (DTD)
apply an XSL Transformation (XSLT)
apply an XPath query
merge two XML documents
find the difference between two XML documents (Diff operation)
apply the output of a Diff operation

The XML source document can be supplied from a File Connection Manager in the Connections tray, a package variable, or directly input as a string. The XML document that results from the operation can be saved in a text file specified by a File Connection Manager, saved in a package variable, or discarded. The XML Task Editor dialog box is shown in Figure 7-48.

image from book
Figure 7-48: The XML Task Editor dialog box

Custom Tasks If none of the control flow tasks included with Integration Services fits your needs, it is possible to write your own. You can create a .NET assembly that can integrate seamlessly with both Integration Services and the Business Intelligence Development Studio. Creating a custom task requires a solid knowledge of object-oriented programming in the .NET Framework. Consult the SQL Server Books Online "Custom Tasks [SSIS]" topic for more information on developing a custom task.

Data Flow—Data Flow Sources

The Data Flow Tab Toolbox is divided into three areas: Data Flow Destinations, Data Flow Transformations, and Data Flow Sources. (We will not be concerned with the General area of the Toolbox, which is empty.) Each Data Flow item—source, transformation, or destination—has an editor dialog box that lets us configure that component in a straightforward manner. The editor dialog box is launched by double-clicking the item in the design area or right-clicking the item and selecting Edit from the Context menu. Each item also has an advanced editor dialog box. Where the editor dialog box enables us to configure the item for use and make most of the necessary property settings for the item, the advanced editor dialog box contains all of the properties of an item. The advanced editor dialog box lets us set any property for the item. The advanced editor dialog box is launched by right-clicking the item and selecting Show Advanced Editor from the Context menu.

image from book The DataReader source dets us use ADO.NET to connect to a data source. We need to create an ADO.NET Connection Manager in the Connections tab lor use with this data source. Once the ADO.NET Connection Manager has been created, this Connection Manager can be selected on the Connection Managers tab of the Advanced Editor for DataReader Source dialog box. A query is then entered for the SqlCommand property on the Component Properties tab. This is shown in Figure 7-49. The query determines the data to be selected as the content of this data source. The columns included in the result set of our query are automatically mapped to columns of the data source.

image from book
Figure 7-49: The Component Properties tab of the Advanced Editor for DataReader Source dialog box

image from book The Excel source enables us to utilize data from an Excel spreadsheet. The Excel Source Editor dialog box is shown in Figure 7-50. We need to create an Excel Connection Manager in the Connections tab for use with this data source. Once the Excel Connection Manager has been created, this Connection Manager can be selected on the Connection Managers page of the Excel Source Editor dialog box. A spreadsheet is then selected from the Excel file pointed to by this Connection Manager. The columns included in the result set are selected on the Columns page of the Excel Source Editor dialog box.

image from book
Figure 7-50: The Excel Source Editor dialog box

image from book The Flat File source lets us utilize data from text files. We need to create a Flat File Connection Manager in the Connections tray for use with this data source. In the section "Learn By Doing" section earlier in this chapter, we saw how to create an OLE DB Connection Manager. The Flat File Connection Manager requires a bit more configuration than the OLE DB Connection Manager and most of the other Connection Managers. The other Connection Managers deal with specific data formats, so not a lot of configuration is required beyond the connection information. A flat file, on the other hand, can represent, formal, and delimit the data in a number of different ways.

The first page of the Flat File Connection Manager, the General page shown in Figure 7-51, enables us to specify the file location in File Name. In addition, we need to tell the Connection Manager how the data is represented with the Locale and Code Page drop-down lists along with the Unicode check box. Finally, we need to specify how the data is formatted: whether the data is a fixed-width layout or delimited by a comma, tab, or some other separator (Format); if the text is enclosed in some form of quotation character (Text Qualifier); how the header row is delimited (Header Row Delimiter); if there arc header rows to skip (Header Rows to Skip); and if the column names are in the first row (Column Names in the First Data Row).

image from book
Figure 7-51: The General page of the Flat File Connection Manager Editor dialog box

In this example, we are using the ragged-right format. The ragged-right format expects fixed-width columns in the text file. This means, for example, that the 8 characters of each line are always the first column, the next 34 characters are always the second column, and so on. The ragged-right format expects some type of end of line indicator, such as a set of carriage return and line feed characters. The fixed-width format also expects fixed-width columns, but it specifies the exact number of characters in each line, rather than using an end of line indicator. The length of the lines in the ragged-right format can vary because the end of line indicator tells the parser where to end one line and begin another. The length of the lines in the fixed-width format must be exactly the same for all lines.

The warning at the bottom of the dialog box in Figure 7-51—"Columns are not defined for this connection manager"—tells us that we are not quite done yet. As soon as we select the Columns page, shown in Figure 7-52, the Flat File Connection Manager uses the information from the General page to access the file. When the delimited format is selected on the General page, the Flat File Connection Manager attempts to figure out what the delimiter is and parse the file accordingly. If the wrong delimiter is chosen, you can edit the delimiter information. When the fixed-width format or, as in our example, the ragged-right format is used, we are given the opportunity to graphically specify the width of each column.

image from book
Figure 7-52: The Columns page of the Flat File Connection Manager Editor dialog box

We specify column widths by creating column markers in the Source Data Columns area. The column marker is indicated by the mouse pointer in Figure 7-52. Column markers are created by clicking along the bottom of the ruler or below the ruler in the sample text area. We can drag the column markers to their appropriate positions, if necessary. We can remove an unneeded column marker by double-clicking it or by right-clicking it and selecting Delete from the Context menu.

The Advanced page of the Flat File Connection Manager, shown in Figure 7-53, lets us specify additional information about each column that was defined on the Columns page. We can create, add, and delete column definitions here, as well as on the Columns page. The drop-down list alongside the New button, indicated by the mouse pointer in Figure 7-53, enables us to insert a column definition either before or after the column definition highlighted in the list. We can specify the Name property for each column definition. (If column names are supplied as the first row of the text file, these names appear in the Name property for each column definition.) The Suggest Types button instructs the Flat File Connection Manager to suggest a data type for each column definition based on the data in the text file. The final page in the Flat File Connection Manager Editor dialog box lets us preview the data from the text file when it is parsed according to our file layout and column definitions.

image from book
Figure 7-53: The Advanced page of the Flat File Connection Manager Editor dialog box

Once the Flat File Connection Manager has been created, this Connection Manager can be selected on the Connection Manager page of the Flat File Source Editor dialog box. We can then use the Columns page to specify which columns should be used in the data How. The Errors page enables us to specify the behavior of the Flat File source when an error or truncation occurs in any given column. We can ignore the failure, direct the row containing the failure to the error output, or cause the process to fail.

Figure 7-54: The Connection Manager page of the OLE DB Source Editor dialog box

The Connection Manager page is also used to specify whether we are getting our data directly from a table or view, or from a SQL query. If the data is coming from a table or a view, the name of the Table or the View property is used to select a table or view in the database. Alternately, the table or view name can come from a package variable at run time.

If the data is coming from a SQL query, we have several ways to specify the content of the query. The simplest method is to type the query into the SQL Command Text property. This works only if we are comfortable with SQL query syntax and have the database structure memorized or an up-to-date copy of the database schema documentation handy. If this is not the case, we can create the query in the SQL Server Management Studio or some other query-authoring environment and save the completed query to a file. We can then load the query from the file by clicking Browse.

Alternatively, we can click Build Query to create the query using the Query Builder. The Query Builder is shown in Figure 7-55. The top quarter of the Query Builder is the Diagram pane. The Diagram pane area is used for selecting tables from the database, creating join conditions between tables, and selecting fields to be used in the query, either in the field list or in the filter conditions. The next quarter of the Query Builder, moving downward, is the Grid pane. The Grid pane is used to select the fields to be included in the field list (the Output column), as well as for setting the sort order and filter conditions. The next quarter, again moving downward, is the SQL pane. The SQL pane shows the SQL SELECT statement that is equivalent to the contents of the Diagram and Grid panes. Changes made directly in the SQL pane are reflected in the Diagram pane and the Grid pane as soon as focus leaves the SQL pane. The bottom quarter of the Query Builder is the Result pane, which shows the results of executing the query.

image from book
Figure 7-55: The Query Builder

The Columns and the Error Output pages function similarly to their counterparts in the Flat File source.

The XML Source enables us to utilize the content of an XML document in the data flow. The XML document can come from a file or from the contents of a package variable. If the XML document is coming from a file, we can specify the file path at design time or obtain the file path from a package variable at run time. We can also specify the location of an XML Schema Definition (XSD) that describes the structure of the XML document. If an XSD is unavailable, click Generate XSD to autogenerate an XSD from the XML document structure. The XML Source Editor dialog box is shown in Figure 7-57.

image from book
Figure 7-56: The Advanced Editor for the Raw File Source dialog box

image from book
Figure 7-57: The XML Source Editor dialog box

The Columns and the Error Output pages function similarly to their counterparts in the Flat File source.

Data Flow—Data Flow Transformations

The dataflow transformations are used to modify the data as it moves through the data flow. In most cases, the wide array of data flow transformations makes it possible to change the data into the required format without having to save it to a temporary table or utilize large amounts of custom code. This set of powerful transformations lets us take our data from an ugly duckling to a beautiful swan with just a few click-and-drag operations.

Figure 7-58: The Aggregate Transformation Editor dialog box

We begin by selecting the columns to participate in the aggregation from the Available Input Columns. We then select the Operation that is applied to each of the selected columns. A column can either be used as a group by or it can be aggregated. The following aggregations are available:

Average
Count
Count distinct
Maximum
Minimum
Sum

The example in Figure 7-58 calculates the sum of the SalesAmount column for each customer represented by the content of the CustomerKey column.

Figure 7-59: The Audit Transformation Editor dialog box

image from book The Character Map transformation enables us to modify the contents of character-based columns. The modified column can be placed in the data flow in place of the or time common or it can be added to the data flow as a new column. The following character mappings are available:

Lowercase change all characters to lowercase
Uppercase change all characters to uppercase
Byte reversal reverse the byte order of each character
Hiragana map Katakana characters to Hiragana characters
Katakana map Hiragana characters to Katakana characters
Half width change double-byte characters to single-byte characters
Full width change single-byte characters to double-byte characters
Linguistic casing apply linguistic casing rules instead of system casing rules
Simplified Chinese map traditional Chinese to simplified Chinese
Traditional Chinese map simplified Chinese to traditional Chinese

Multiple character mappings can be applied to a single column at the same time. However, a number of mappings are mutually exclusive. For example, it does not make sense to use both the lowercase and uppercase mappings on the same column. The Character Map Transformation Editor dialog box is shown in Figure 7-60.

image from book
Figure 7-60: The Character Map Transformation Editor dialog box

The Configure Error Output button displays the Configure Error Output dialog box shown in Figure 7-61. The Configure Error Output dialog box lets us determine the error behavior of this transformation. For the Character Map transformation, the only error condition we need to worry about is a character truncation caused by one of the mappings. For each character mapping, we can choose to ignore truncation, send the row containing the truncation to the error output, or cause the entire transformation to fail.

image from book
Figure 7-61: The Configure Error Output dialog box

Figure 7-62: The Conditional Split Transformation Editor dialog box

In Figure 7-62, the conditional split has been defined with three branches. If the content of the SalesAmount column for a given row is less than $10, that row is sent to the SmallSales output. If the content of the SalesAmount column is between $10 and $50 inclusive, the row is sent to the MediumSales output. If the content of the SalesAmount column does not fit either of these conditions, the row is sent to the default output named LargeSales. Figure 7-63 shows the three outputs from the Conditional Split Transformation being routed to three different paths through the data flow.

image from book
Figure 7-63: A Conditional Split transformation in a data flow

The Configure Error Output button displays the Configure Error Output dialog box where we determine the error behavior of this transformation. In addition to truncation, we also need to configure the behavior when the condition statement fails during evaluation. For each output column defined, we can choose to ignore an error, send the row containing the error to the error output, or fail the entire transformation.

Figure 7-64: The Copy Column Transformation Editor dialog box

image from book The Data Conversion transformation enables us to convert columns from one data type to another. The converted data can either replace the existing column or be added as a new column. The Configure Error Output dialog box functions the same as it does for the Conditional Split transformation.

Figure 7-65 shows the Data Conversion Transformation Editor dialog box with two columns set up for conversion. The Unicode contents of the CustomerPONumber column is replaced by its single-byte equivalent. The currency type contents of the SalesAmount column is converted to a decimal with two places to the right of the decimal point. The decimal value is placed in a new column called SalesAmountAsDecimal.

image from book
Figure 7-65: The Data Conversion Transformation Editor dialog box

Figure 7-66: The Mining Model tab of the Data Mining Query Transformation Editor dialog box

Figure 7-67: The Derived Column Transformation Editor dialog box

image from book The Export Column transformation lets us take the content of a text or image column and write it out to a file. The pathname and filename are specified in a different column in the data flow. In this manner, the content of the text or image column can be written to a different file for each row. The Export Column Transformation Editor dialog box, shown in Figure 7-68, can be used to specify whether to truncate the file or append to it if it already exists.

image from book
Figure 7-68: The Export Column Transformation Editor dialog box

image from book The Fuzzy Grouping transformation enables us to find groups of rows in the data flow based on nonexact matches. This is most often used to find possible duplicate rows based on names, addresses, or some other column where the same information may have been entered in different ways. For example, a row for Ms. Kathy Jones, a second row for Ms. Kathryn Jones, and a third row for Ms. Cathy Jones may be three entries for the same person.

The Fuzzy Grouping transformation selects one of the rows in the group as the best candidate for the other rows to be combined into, and this is the model row. Once the groups and their model rows have been identified, we can use another transformation to combine any unique information from the nonmodel rows into the model row and delete the nonmodel rows. This removes the duplication from the data. The Fuzzy Grouping transformation identifies row groups and model rows; it does not combine any data or delete any nonmodel rows. This must be clone as a separate step in the package or as a manual operation.

Fuzzy Grouping transformation creates similarity scores between strings. This is done by considering the edit distance between two strings. In other words, how many character inserts, deletions, and replacements must be made in one string to produce the other string. Kathy and Kathryn have an edit distance of 2, because we need to insert an r and an n into the first word to get the second. Kathy and Cathy have an edit distance of 1, because we simply need to replace K with C in the first word to get the second.

In addition to edit distance, Fuzzy Grouping also uses information, such as frequency character occurrence and the positions of the characters, among other things, to increase its accuracy when creating similarity scores. Fuzzy Grouping is great at detecting character transpositions and other common spelling errors. All of the algorithms that create the similarity scores are language-independent, so Fuzzy Grouping can work with any language we care to throw at it.

Note

Fuzzy Grouping assumes a string that is all uppercase is an acronym. Because even acronyms that have a close edit distance, say, FTC and FCC, are not likely to be the same thing, it is unlikely that groups will be identified for strings in all uppercase. If your data was entered in all uppercase (this can be especially true for legacy data), it is important to use the Character Map transformation to change the string to lowercase before trying a Fuzzy Grouping transformation.

On the Columns tab of the Fuzzy Grouping Transformation Editor dialog box, shown in Figure 7-69, we select the columns to use when looking for groupings. These columns can be set for fuzzy matching that uses similarity scoring or for exact matching. In the example in Figure 7-69, the content of the Title column—Mr. or Ms.—must match exactly before we want to even think about trying a fuzzy match on the FirstName and LastName columns. In addition to the exact or fuzzy match setting, we can also specify whether leading or trailing numerals (0-9) are significant or should be ignored. (Selecting Leading from the Numerals drop-down list indicates that leading numerals are significant, and so forth.)

The columns we select for grouping on the Columns page are automatically passed through to the output of the Fuzzy Grouping transformation. For each of the columns used in the grouping, the Fuzzy Grouping transformation adds a "clean" column to the output. These clean columns contain the values from the row that the Fuzzy Grouping transformation has selected as the model row for that group. In our previous example, if the row containing Ms. Kathy Jones is chosen as the model row for the group, the FirstName_clean column would contain Kathy for the Ms. Kathy Jones row, the Ms. Kathryn Jones row, and the Ms. Cathy Jones row.

image from book
Figure 7-69: The Columns tab of the Fuzzy Grouping Transformation Editor dialog box

We can also select columns to be passed through from the input to the output without participating in the transformation by placing a check in the Pass Through column. The Fuzzy Grouping transformation can add three additional columns to the output. The default names for these columns are _key_in, _key_out, and _score. We can change the names of these columns on the Advanced tab of the Fuzzy Grouping Transformation Editor dialog box shown in Figure 7-70.

image from book
Figure 7-70: The Advanced tab of the Fuzzy Grouping Transformation Editor dialog box

The _key_in column contains a unique identifier for each field. This unique identifier is generated by the Fuzzy Grouping transformation. If a row is not grouped with any other rows or if it is the model row for a group of rows, then it has the same identifier in its _key_out column as it has in its _key_in field. If a row is in a group and it is not the model row for that group, then it has the identifier of the model row in its _key_ out column.

The _score column contains the similarity score assigned to the row. The higher the similarity score, the more confidence the Fuzzy Grouping transformation has in the match. A row that is not grouped or is a model row for a group always has a similarity score of 1. A row that is in a group and is not a model row has a similarity score between 0 and 1, inclusive.

On the Advanced tab of the Fuzzy Grouping Transformation Editor dialog box, we can set the threshold for the similarity score. Similarity scores that fall below the threshold are not used to create groups. A lower threshold allows the Fuzzy Grouping transformation to find more groups of duplicate rows, but also increases the risk of false positives. A higher threshold makes fewer mistakes, but increases the risk that some actual duplicates will not be found. Some experimenting should be done with your data to determine the best threshold for each Fuzzy Grouping transformation.

The Fuzzy Grouping transformation creates a lookup list of items and does fuzzy matching on that lookup list to create the groupings. If we are dealing with the processing of a data flow with a large number of rows, the Fuzzy Grouping transformation may need to write the lookup list to disk. The OLE DB Connection Manager selected on the Connection Manager tab provides access to a tempdb database where this temporary lookup list can be written. Because of this necessity to create a lookup list, the Fuzzy Grouping task can be rather slow.

image from book
Figure 7-71: The Reference Table tab of the Fuzzy Lookup Transformation Editor dialog box

Once the match index is configured, we can use the Columns tab to map one or more columns from the data How to columns in the source lookup table. This is done by dragging a column from the Available Input Columns and dropping it on the corresponding column in the Available Lookup Columns. We can then set the behavior of the transformation using the Advanced tab. The Advanced tab enables us to specify the maximum number of source table records we want the transformation to find for each input row. This is usually one. We can also specify the similarity score threshold for this transformation. This works on the same scale as the Fuzzy Grouping transformation.

The Fuzzy Lookup transformation adds one column to the output for each column from the source lookup table that is used in the lookup. In addition to this, the _Similarity and _Confidence columns are also added. The _Similarity column shows the similarity score for the lookup done for that row. The _Confidence column shows how much confidence to place on the lookup that was just completed. Like the similarity score, the confidence figure is a number between 0 and 1, inclusive. A confidence level of 1 represents a near certainty that a value is correct. A confidence level of 0 indicates no confidence in the match at all.

image from book
Figure 7-72: The Advanced Editor for Import Column dialog box

image from book The Lookup transformation works similarly to the Fuzzy Lookup transformation. The difference is the Lookup transformation requires exact matches, rather than using similarity scores. The selection of the source lookup table and the column mapping is done much the same way as for the Fuzzy Lookup transformation.

One difference between the Lookup transformation and the Fuzzy Lookup transformation is found on the Advanced tab of the Lookup Transformation Editor dialog box as shown in Figure 7-73. When the Lookup transformation executes, it loads the entire source lookup table into memory for faster processing by default. If the source lookup table is too large to be completely loaded into memory, we can set a restriction on the amount of memory used. In Figure 7-73, the memory usage has been restricted to SMB.

image from book
Figure 7-73: The Advanced tab of the Lookup Transformation Editor dialog box

In addition, if only a portion of the records in the source lookup table are needed to resolve the lookups for a given Lookup transformation, we can load only the required portion of the source lookup table into memory. This is done by modifying the caching SQL statement and adding a WHERE clause. In Figure 7-73, a WHERE clause has been added so only the urban streets in the Street lookup table are loaded into memory. The additional WHERE clause is the highlighted portion of the Caching SQL statement.

The Configure Error Output dialog box lets us determine whether an unresolved lookup is ignored, sent to the error output, or causes the transformation to fail.

image from book
Figure 7-74: The Merge Transformation Editor dialog box

All of the rows in both of the input data flows are presented in the merged output. For example, say 450 rows are in the first input data flow and 375 rows in the second input data flow. There will be 825 rows in the output data flow.

image from book The Merge Join transformation enables us to merge two data flows together by executing an inner join, a left outer join, or a full outer join As with the Merge transformation, both of the input data flows must be sorted. With the Merge Join transformation, both of the data flows must be sorted by the columns to be used as the join condition.

Figure 7-75 shows the Merge Join Transformation Editor dialog box. In this example, an inner join is being used to join the Address Table data flow with the Street Lookup List data flow. The join key requires the StreetName column from the Address Table to match the CompleteName column from the Street Lookup List. Because an inner join is being used, only those rows that contain a match are included in the output.

image from book
Figure 7-75: The Merge Join Transformation Editor dialog box

If a left outer join were used in the example in Figure 7-75, all of the columns from the Address Table data flow would be included in the output, even if a matching column in the Street Lookup List data flow were not found. (The Address Table data flow was identified as the left input flow when it was connected to the Merge Join transformation.) If a full join were used in the example in Figure 7-75, all of the columns from both the Address Table data flow and the Street Lookup List data flow would be included in the output. The output of the Merge Join transformation is always sorted on the join key column or columns.

Figure 7-76: The Multicast Transformation Editor dialog box

image from book The OLE DB Command transformation enables us to execute a SQL statement for each row in the data flow. We select an OLE DB Connection to determine where the SQL statement is to be executed. We then enter the SQL statement that is to be executed. Question marks can be used to create a parameterized query as shown in the Advanced Editor for OLE DB Command dialog box in Figure 7-77. Column values from the data flow can then be fed into these parameters as the SQL statement is executed for each data flow. In the example in Figure 7-77, the retail price of existing product records is updated based on information in the data flow.

image from book
Figure 7-77: The Advanced Editor for OLE DB Command dialog box

image from book

image from book
Figure 7-78: The Percentage Sampling Transformation Editor dialog box

Two outputs are created from the Percentage Sampling Transformation item. The first output contains the rows that were selected to be part of the sample. In the example in Figure 7-78, 10% of the total rows in the input data flow are selected to be part of the sample. These rows are sent to the first output, called Mining Model Training Output in this example. The second output contains the rows that were not selected for the sample. In this case, the remaining 90% of the total rows in the input data flow are to be sent to the second output, called Mining Model Validation Output.

The Percentage Sampling transformation selects rows at random from the input data flow. This random sampling provides a more representative sample of the entire data flow than simply selecting the top N rows. Because of the workings of the random sampling algorithm, the number of rows included in the sample is going to be close to the percentage specified, but it may not hit it exactly.

image from book The Pivot transformation enables us to take normalized data and change it into a less normalized structure. This is done by using the content of one or more columns in the input data flow and using them as column names in the output data flow. The data in these newly created columns is calculated by taking an aggregate of the contents of another column from the input data flow. An aggregate must be used because a number of rows from the input data flow may define a single row in the output data flow. Figure 7-79 shows the Advanced Editor for Pivot dialog box.

image from book
Figure 7-79: The Advanced Editor for Pivot dialog box

Figure 7-80: The Advanced Editor for Row Count dialog box

image from book The Row Sampling transformation enables us to split the data flow into two separate data flows based on the number of rows desired. The Row Sampling transformation works in the same manner as the Percentage Sampling transformation. The only difference is the Row Sampling transformation determines the number of rows in the sample based on a requested row count, rather than a requested percentage. The Row Sampling Transformation Editor dialog box is shown in Figure 7-81.

image from book
Figure 7-81: The Row Sampling Transformation Editor dialog box

image from book The Script Component transformation lets us create .NET code for execution as part of our data flow. Even though the Script Component transformation is in the Data Flow Transformations section of the Toolbox, it can be used as a data source, a data destination, or a data transformation. Our script code could read data from a file format that is not supported by any of the Connection Managers, and then expose those rows as its output data flow. In this case, the Script Component item functions as a data source. Our script code could take rows from its input data flow and write data to a file format that is not supported by any of the Connection Managers. In this scenario, the Script Component item functions as a data destination. Finally, the script code could take rows from its input data flow, modify the data in a way not supported by the other data transformations, and then expose those rows as its output data flow. Here, the Script Component item functions as a data transformation.

The data inputs and outputs that are going to be used in the script, even if they are just going to be passed through, must be defined on the Inputs and Outputs page of the Script Transformation Editor dialog box. This is shown in Figure 7-82. Make sure to use different names for the input and output columns. If we use the same name for both an input and an output, only the output column will be visible to the script code.

image from book
Figure 7-82: The Inputs and Outputs page of the Script Transformation Editor dialog box

In the example shown in Figure 7-82, we have street names from a legacy system that were entered in uppercase. We want to change these to title case (sometimes known as proper case; the first character of each word in uppercase with the remaining characters in lowercase) as part of our Integration Services package. The Copy Column transformation provides translations to lowercase and uppercase, but not to title case. We need to create a custom script to accomplish this task.

Clicking Design Script on the Script page of the Script Transformation Editor dialog box displays the Microsoft Visual Studio for Applications dialog box where we can edit the script. This is shown in Figure 7-83. The Input_ProcessInputRow method is executed once for each row in the input data flow. The Row object parameter, which is passed to this method, contains the properties that provide access to both the input and output columns defined on the Inputs and Outputs page.

image from book
Figure 7-83: The Microsoft Visual Studio for Applications dialog box

The sample code simply copies the AddressIDIn and StreetNumberIn values from the input data flow to the AddressIDOut and StreetNumberOut columns in the output data flow. The StreetNameIn column is transformed first to lowercase using the ToLower method, and then to title case using the ToTitleCase method. (ToTitleCase does not work on text that is all uppercase, so we need to use the ToLower method first.) The transformed string is then assigned to the StreetNameOut column.

Figure 7-84: The Select a Dimension Table and Keys page of the Slowly Changing Dimension Wizard

The next page of the wizard, the Slowly Changing Dimension Columns page, is shown in Figure 7-85. This wizard page enables us to specify how each of the nonkey columns in the dimension table should be treated when it changes. A column such as the Sales_Person_Name column is fixed relative to the salesperson ID. A situation should not occur where a different person assumes the use of this salesperson ID. (Yes, names could change due to marriage or other events, but we will conveniently ignore that fact for this example.) This type of column is marked as a fixed attribute.

image from book
Figure 7-85: The Slowly Changing Dimension Columns page of the Slowly Changing Dimension Wizard

A column such as the Sales_Person_PhoneNumber column is changeable. We do not need to track previous phone numbers used by the salespeople. This type of column is marked as a changing attribute. A column such as the Sales_Person_SCD_Territory_ID column is one whose changes we want to track. This type of column is marked as an historical attribute.

The next page of the wizard, the Fixed and Changing Attribute Options page, enables us to specify the transformation behavior when a fixed or a changing attribute is modified. We can choose to fail the transformation when a fixed attribute is modified. We can choose to change all historic occurrences of a changing attribute when that attribute is modified.

The Historical Attribute Options page lets us specify the method used to determine which are the historical records and which are the current records. This page is shown in Figure 7-86. The salesperson dimension used in this example has a start date and an end date that determines the current records. We use the system date at the time the new record was created to populate the date fields.

image from book
Figure 7-86: The Historical Attribute Options page of the Slowly Changing Dimension Wizard

The Inferred Dimension Members page lets us specify whether we can infer information for dimension members that do not yet exist. When the wizard completes, it adds a number of additional transformations to the package as shown in Figure 7-87. These additional transformations provide the functionality to make the slowly changing dimension update work properly.

image from book
Figure 7-87: The results of the Slowly Changing Dimension Wizard

image from book The Sort transformation enables us to sort the rows in a data flow. The Sort Transformation Editor dialog box is shown in Figure 7-88. In the example in this figure, the data flow is being sorted first by the StreetType column, and then by the StreetName column. The StreetDirection column is not used as a sort column, but is passed through from the input data flow to the output data flow. The Sort transformation can remove rows with duplicate sort keys if Remove Rows with Duplicate Sort Values is checked.

image from book
Figure 7-88: The Sort Transformation Editor dialog box

image from book The Term Extraction transformation lets us extract a list of words and phrases from a column containing freeform text. The Term Extraction transformation identifies recurring nouns and/or noun phrases in the freeform text, along with a score showing the frequency of occurrence for each word or phrase. This information can then be used to help discover the content of unstructured, textual data.

The Term Extraction transformation has two limitations. First, it only works with Unicode data (the DT_WSTR or DT_NTEXT Integration Services data types). This limitation is easily overcome by using the Data Conversion transformation to convert single-byte strings to Unicode. The second limitation is that the Term Extraction transformation only works with English language text. The term extraction algorithms require knowledge of the structure and syntax of the language they are parsing. Currently, this intelligence is only available for English.

On the Term Extraction tab of the Term Extraction Transformation Editor dialog box, we simply select the column that contains the freeform text, and then specify the name of the term and score output columns. The Exclusion tab enables us to specify a database table or view that contains a list of words or phrases to exclude from the term list. The Advanced tab, shown in Figure 7-89, lets us configure the operation of the term extraction, including whether nouns, noun phrases, or both are included in the extraction and the number of occurrences that must be found before a word or phrase is added to the list.

image from book
Figure 7-89: The Advanced tab of the Term Extraction Transformation Editor dialog box

image from book The Term Lookup transformation enables us to look for occurrences of a set of words or phrases in a data flow column containing freeform text. The Term Lookup transformation functions almost identically to the Term Extraction transformation. The big difference is this: the Term Lookup transformation starts with a table of terms to look for in the freeform text; whereas the Term Extraction transformation creates its own list on the fly.

The Term Lookup transformation creates one row in the output data flow for each term found in the table of terms and in the text column. If the text column contains more than one term, more than one output row is created. For example, if a text column contains three of the terms from the lookup table, three rows are created in the output data flow. The same restrictions of Unicode text only and English only that applied to the Term Extraction transformation also apply to the Term Lookup transformation. The Term Lookup tab of the Term Lookup Transformation Editor dialog box is shown in Figure 7-90.

image from book
Figure 7-90: The Term Lookup tab of the Term Lookup Transformation Editor dialog box

Figure 7-91: The Union All Transformation Editor dialog box

Customer ID

HomePhone

WorkPhone

CellPhone

Fax

3843

891-555-2443

891-555-9384

891-555-2923

4738

891-555-9384

891-555-3045

image from book
Figure 7-92: The Unpivot Transformation Editor dialog box

The output data How would contain the following rows:

CustomerlD	PhoneNumberType	PhoneNumber
3843	HomePhone	891-555-2443
3843	WorkPhone	891-555-9384
3843	Fax	891-555-2923
4738	HomePhone	891-555-9384
4738	CellPhone	891-555-3045

Data Flow—Data Flow Destinations

Now that we have all this data transformed into exactly the right format, we need to do something with it. This is the job of the data flow destinations. Let's take a look at these options for storing our data.

image from book The Data Mining Model Training destination lets us use a data flow to train a data mining model. Training a data mining model prepares the model for making predictions by gaining knowledge from a set of sample data. We discuss data mining models and their care, feeding, and training in Part IV of this book.

The Excel destination enables us to send a data flow to an Excel spreadsheet file. We need to create an Excel Connection Manager in the Connections tray for use with this data destination. Once the Excel Connection Manager is created and selected, we can map the columns of the data flow to the fields in the data destination.

image from book The Flat File destination lets us send a data flow to a text file. We need to create a Flat File Connection Manager in the Connections tray and define the columns that are in the text file. Once the Flat File Connection Manager is created and selected, we can map the columns of the data flow to the fields in the data destination.

image from book The OLE DB destination enables us to send a data flow to an OLE DB-compliant database. We need to create an OLE DB Connection Manager in the Connections tray for use with this data destination. Once the OLE DB Connection Manager is created and selected, we can map the columns of the data flow to the fields in the data destination.

image from book The Partition Processing destination lets us send a data flow to process a partition. By using this destination, we can provide new values tor a partition in an Analysis Services cube. Columns from the data flow are mapped to the items in the partition. We discuss Analysis Services partitions in Chapter 9.

image from book The Raw File destination enables us to write a data flow to a raw data file. The Raw File format is the native format for integration Services. Because or this, raw tiles can be written to disk and read from disk very rapidly. One of the goals of Integration Services is to improve processing efficiency by moving data from the original source to the ultimate destination without making any stops in-between. However, on some occasions, the data must be staged to disk as part of an extract, transform, and load process. When this is necessary, the Raw File format provides the most efficient means of accomplishing this task.

Table of content

Delivering Business Intelligence with Microsoft SQL Server 2005: Utilize Microsofts Data Warehousing, Mining & Reporting Tools to Provide Critical Intelligence to A

ISBN: 0072260904
EAN: 2147483647

Year: 2007
Pages: 112

Authors: Brian Larson

BUY ON AMAZON