Transforming data with DTS involves planning and designing the transformation and creating and executing a DTS package. This lesson takes you through this process. The DTS Designer allows you to define complex workflows with multiple tasks based on conditional precedence constraints. In this lesson, you will learn how tasks, connections, data transformations, and workflows are represented and manipulated in the DTS Designer. You will also learn how to schedule packages for execution and how to execute them manually.
After this lesson, you will be able to:
- Describe the design and planning steps taken before using DTS
- Describe the data transformation process used by DTS
- Use DTS Designer to create, edit, and save complex data transformations
- Execute DTS packages and schedule DTS packages for execution
Estimated lesson time: 120 minutes
Data transformation involves formatting and modifying data that is extracted from the data source into merged or derived values that are more useful at the destination. New values can easily be calculated from one or more columns in the source rowset, and a single source column can be decomposed into multiple destination columns.
DTS allows you to specify the attributes of the destination columns and to indicate how the source columns are mapped to the destination columns. Transformation flags specify whether data of one type in the source can be converted to another type in the destination. For example, you can allow data type promotion, such as converting 16-bit integers to 32-bit integers, or data type demotion, such as converting 32-bit integers to 16-bit integers (data may be lost in this case). You can also require an exact match between source and destination data types.
Each database defines its own data types as well as its column and object naming conventions. DTS attempts to define the best possible data type matches between a source and a destination. However, you can override DTS mappings, specifying different destination data type, size, precision, and scale properties.
While creating your data warehouse databases, you will frequently need to combine data from multiple sources into a single table. At other times you may need to divide a data source into two or more separate tables in your data warehouse database especially when pulling data from nonrelational sources. You can merge and separate data in several ways.
At the File Level
You can combine information from multiple sources and place it into a single table, or you can place information from a single source into multiple tables. Heterogeneous joins are an example of combining multiple sources into a single result set that is saved at the destination. Heterogeneous joins are fully supported in SQL Server; they enable you to retrieve data from different databases located on different servers in the same SQL query. The source databases for a heterogeneous join can be any database for which you have an ODBC driver or OLE DB provider.
At the Column Level
You can combine information from multiple columns and place it into a single column, or you can place information from a single column into multiple columns. For example, you can summarize monthly sales totals for each product, or you can decompose a phone number that is stored in one column in order to store the area code in one column and the phone number in another.
When you use DTS to transfer and transform data, you create DTS packages that describe all of the work to be performed as part of the transformation process. You can create the DTS package interactively, using the DTS wizards or DTS Designer, or programmatically, using a language that supports OLE Automation, such as Visual Basic.
Saving the DTS package allows you to modify it, reuse it, or schedule it for later execution. If you do not save the DTS package, it executes immediately. You must save the DTS package if you want to schedule it for later execution.
You can save a DTS package in one of three locations:
To SQL Server
Packages saved to SQL Server are referred to as local packages and are stored in the msdb database. Local packages are the most efficient and are available to other SQL Servers. They are listed in the Local Packages node under Data Transformation Services in the console tree.
To Microsoft Repository
The Microsoft Repository is a database that stores descriptive information about software components and their relationships. The items in the repository are available to repository tools through a set of published COM interfaces and information models that define database schema and data transformations through shared metadata. This repository makes it possible for tool vendors to build tools that interact with databases and software components without advance knowledge of the databases or software components. Impromptu and PowerPlay from Cognos are two examples of third party tools that currently support Microsoft Repository. You can find examples of how tool vendors are supporting the Microsoft Repository at msdn.microsoft.com/repository.
Saving a DTS package in the Microsoft Repository makes package metadata reusable and available to other applications. Using the Microsoft Repository also enables you to track data lineage at the package and row levels of a table, which allows you to determine the source of each piece of data and the transformations that are applied to it. Tracking data lineage is important for data warehousing, making the Microsoft Repository the preferred location for saving DTS packages used for transferring data to a data warehouse.
When you save a package to the Microsoft Repository, the package is stored in the msdb database, but package metadata can be imported into the repository. Packages stored in this way are listed in the Repository Packages node under Data Transformation Services in the console tree. After you import the package metadata into the repository, you can view it using the Metadata Browser in the Metadata node under Data Transformation Services in the console tree. The Metadata Browser lists information about each version of a package and the executions of that package.
To File
Saving a DTS package in a COM-structured storage file makes it easy to distribute the DTS package using e-mail or network file servers. DTS packages saved as files do not appear in the SQL Server Enterprise Manager console tree. To open a package file for editing, right-click Data Transformation Services in the console tree, point to All Tasks, and click Open Package.
You can password protect DTS packages that are saved to SQL Server or to COM-structured storage files in order to protect sensitive information. You cannot password protect packages that are stored to the Microsoft Repository because information in the repository is intended to be used by any repository tool. Currently, the repository does not provide security, but it may in future releases. When a DTS package is password protected, it is automatically encrypted. All of the package s collections and properties are encrypted, except for the package name, description, ID, version, and creation date.
DTS packages support two levels of security: owner password and user password.
Owner Password
This level of security provides complete access to all collections and properties. By default, DTS packages without owner passwords are not encrypted and can be read by any user with access to the package. If you specify an owner password for a package, you cannot view or edit the package without the owner password.
User Password
This level of security allows a user to execute but not edit or view the package definition. If a user password is specified, an owner password must be supplied.
A package has two IDs: a package ID and a version ID. The package ID is generated when the package is created. The version ID is assigned when the package is created, and it is given a new value every time that the package is updated or saved. Package versioning makes it possible for SQL Server to track which version of a package was used to perform a specific transformation.
With DTS, you can define a workflow that controls the execution sequence of each step. Control of flow logic and conditional processing is achieved using precedence constraints. DTS tasks can also be prioritized. This section discusses use of DTS Designer to edit and customize packages.
A DTS package is composed of one or more steps, which you can define. A step defines a unit of work that is performed as part of the transformation process. A step can
For example, you can create a DTS package that summarizes sales information for each product for a given month. The DTS package drops (if the table already exists) and creates the table at the destination, gets data from the source connection (OLE DB), processes the transformation (summarization), and finally sends the data to the destination connection (OLE DB).
Workflows control the flow of execution for the package. A package is made up of data connections (sources and destinations) and tasks that are performed using those connections. The action of executing a task is controlled by a step. A step can be subject to one or more precedence constraints. A step with no precedence constraints executes immediately. If a step has precedence constraints, it cannot execute until all of its precedence constraints have been satisfied.
Note the following about working with steps and precedence constraints in DTS Designer:
Figure 8.1 Steps are represented by task icons and data transformation arrows
Figure 8.2 Precedence constraint arrows point from a source step to a destination step
To make your DTS Designer diagrams easier to read, you can add the same connection to the diagram more than once. When you do so, specify it as an existing connection. If a connection is involved in more than one transformation, it is recommended that you add the connection once for each transformation. If you do not, the diagram will have a single data connection icon with a large number of arrows (both data transformations and precedence constraints) pointing to and from it.
Figure 8.3 shows a DTS package with two data connections. In this example, there are two data transformations. One of the data transformations must complete before the other one can start, but it is not possible to tell from the diagram which data transformation must complete first.
Figure 8.3 Example of a difficult to read DTS diagram
Figure 8.4 shows a DTS package that performs exactly the same functions as the package in Figure 8.3. However, the two data connections have each been added to the diagram a second time. Now it is perfectly clear from the diagram which data transformation must complete first.
Figure 8.4 Example of adding data connections to make the DTS diagram easy to read
TIP
Two items in DTS Designer can easily be confused.The terms source and destination are used to refer to both data connections and steps. When working with a source or a destination, always make sure that you know whether it is a source or destination connection or a source or destination step.
In addition, arrows are used to represent both precedence constraints and data transformations. When working with an arrow in a DTS diagram, make sure that you know whether it is a dashed constraint or a solid data transformation arrow.
Workflow Properties
Steps are defined using the Workflow Properties dialog box. To open this dialog box, right-click on the destination step (either the task icon or the data transformation arrow if the step is a data transformation). Then select Workflow Properties from the context menu. (You may need to point to Workflow to get to the Workflow Properties option.)
The Precedence tab of the Workflow Properties dialog box allows you to add precedence constraints to or remove them from the step. The order of precedence constraints in the list is not important. To add precedence constraints, you can also select the source and destination steps, right-click the destination step, and then click the desired precedence constraint on the Workflow pop-up menu. The Options tab of the Workflow Properties dialog box allows you to set workflow properties such as task priority and transaction management for the task of the destination step.
Precedence Constraint Types
The type of precedence constraint you select determines the requirement for executing a step:
NOTE
It is important to keep in mind that success and failure are determined by the execution of the step, not by the results returned by a step. A select statement that returns no data can still complete successfully.
Precedence constraints create a finish/start relationship between the step being executed (the source step) and the step that will be executed next (the destination step). Given two steps A and B, a precedence constraint says:
Step B cannot start until Step A finishes
not
If Step A finishes, Step B will start
This relationship is important when multiple steps are involved, because a step may have many precedence constraints that all must be satisfied before it can execute. A common example of this concept is an automated load of a data mart database (from a data warehouse database) that is dependent upon all steps that load the data warehouse database completing successfully first.
Controlling Step Execution
Under the control of precedence constraints, steps execute in sequence, in parallel, or in a combination of these:
For example, data can be loaded into a table (Step B) only after the table has been successfully created (Step A).
For example, a package can load data from multiple heterogeneous data sources into separate tables simultaneously.
For example, a package can load data from several heterogeneous files into a set of tables. The loading of data can be done in parallel, but only after the creation of the tables.
Figure 8.5 shows a package with six steps that demonstrate a combination of parallel and sequential execution. Steps A, B, and C do not have precedence constraints and execute immediately. Step D waits for Step A to complete successfully before executing. Step E waits for Step D to complete with a failure before executing. Step F waits for Step C to complete (with success or failure) before executing. Steps A, D, and E execute sequentially. Steps C and F execute sequentially. Steps A (and its sequential steps), B, and C (and its sequential step) execute in parallel.
Figure 8.5 A package with steps that execute both in parallel and sequentially
Figure 8.6 shows how one step can have multiple precedence constraints. In this case, Step C can execute only when both Step A and Step B have successfully completed.
Figure 8.6 A step with multiple precedence constraints
Conditional processing using basic IF-THEN-ELSE logic allows a DTS package to respond to run-time conditions that vary. To implement conditional processing, use a combination of steps with Success and Failure precedence constraints, as shown in Figure 8.7. In the figure, Step B executes only if Step A completes successfully. Step C executes only if Step A fails. Step C would typically send a notification such as an e-mail message or would take corrective action that is in turn the condition for another precedence constraint that then allows the original step to be repeated.
Figure 8.7 Conditional processing using precedence constraints
By taking advantage of the Windows NT threading model, DTS allows you to specify the priority of tasks. By default, each thread within the DTS package executes at the same priority as the DTS package. However, if some operations are more time-critical than others, you can assign an execution priority to each step. You can specify each step to execute at Idle, Normal, or High priority.
Text Annotations
You can also add text annotations (labels) to your DTS package to make the package more readable in the DTS Designer.
You can use a lookup query instead of a complicated CASE statement in an ActiveX script to transform codes into text or to map values from one format to another. As shown in Figure 8.8, a lookup query uses a value in the source data to retrieve related data from another query.
Figure 8.8 Using a lookup query in a data transformation
When to Use a Lookup Query
You can use a lookup query when you want to
For example, suppose that your destination table must include the full name for a customer s state; however, the data for state in the source table contains the postal abbreviation (for example, NY rather than New York). A lookup table in a database on a different server contains the name of each state and its postal abbreviation.
You can write a transformation to
You should use a lookup query when
Implementing a Lookup Query
When defining a lookup query as part of a DTS data transformation, follow these guidelines:
The lookup query will be a parameterized query of the form:
Syntax
SELECT column_name FROM lookup_table WHERE lookup_key = ? |
The column_name is the column that you want to retrieve from the lookup_table, and the lookup_key is the column on which you are searching in the lookup table. The question mark serves as a placeholder for the data column that you want to import and transform. Lookup queries also can contain multiple keys and parameters.
Map the source and destination columns involved in the lookup with an ActiveX script. The ActiveX script calls the lookup query for every row in the source table.
Each DTS package is self-contained after you create it. A package is a complete description of all of the work to be performed as part of the transformation process. This section describes how to execute and schedule a DTS package that you have created.
After you save a DTS package, you have three options for retrieving and executing it. These three options are as follows:
The example given here shows the dtsrun command prompt utility being used to execute a DTS package that creates and populates a summary table in the Northwind database on the SQL Server named SQLSERVER. The /U option specifies the sa login. If the specified login has a password, it must be specified with the /P option. Note that the name of the package is enclosed in double quotes; this is necessary if the name contains spaces. Remember that the DTS package is a complete description of all of the work to be performed as part of the transformation process.
dtsrun /SSQLSERVER /Usa /N"Northwind Product Totals" |
You can schedule a saved DTS package for execution at a specific time, such as daily at midnight, or at recurring intervals, such as on the first or last day of the month or weekly on Sunday at 6:00 a.m.
You can schedule a DTS package for execution in the following ways:
In the following series of exercises, you will use the DTS Designer to create a package that loads the Northwind_Mart database with data from the Northwind database.
In this exercise you will create the package, add destination and source data connections, and add a SQL task to clear the data warehouse tables.
At this point, your workspace has two destination data connections. Now you will make a connection that will be the source for each table that you want to populate.
NOTE
When working in production we recommend that you save the package at regular intervals.
TRUNCATE TABLE Sales_Fact DELETE FROM Employee_Dim DELETE FROM Customer_Dim DELETE FROM Time_Dim DELETE FROM Product_Dim DELETE FROM Shipper_Dim |
NOTE
These Transact-SQL statements will remove all data from the online analytical processing (OLAP) database. You must delete all data from the fact table first to avoid referential integrity errors. You cannot use the TRUNCATE TABLE statement with the dimension tables because they are referenced by the foreign key in the Sales_Fact table. You could add code here to drop the constraints before truncating the tables and then add them back after truncating the tables.
In this exercise, you will connect each of the Northwind dimension table connections to the Northwind_OLAP connection located in the center of the work space, and you will connect the NW-Sales_Fact connection to the second Northwind_OLAP connection located on the right of the work space. You will also add precedence constraints that ensure that data is not loaded into the tables until after they have been cleared and to ensure that data is not loaded into the Sales Fact table until after data has been loaded into the dimension tables.
In this procedure, you will add data transformations that cause data to be transferred from one connection to another.
Both icons should now be selected.
In this procedure, you will add precedence constraints that ensure that the dimension tables are not populated with data until the SQL Script task has completed successfully.
Both icons should now be selected.
This will create a precedence constraint from the Clear Dimension Tables task to the NW-Product connection. The precedence constraint is shown as a dashed green arrow.
NOTE
If you right-click the Clear Dimension Tables SQL Task icon instead of the connection icon, then select On Success, an error is generated. The error says that defining precedences between the selected items is not valid. This is because you cannot define a precedence constraint from the connection to the task; it must be defined from the task to the connection. If you right-click the connection, the error will not be generated.
Next, you need to set the precedence constraint for the fact table. The Sales Fact table must not be populated with data until all of the dimension tables have been populated.
Both icons should now be selected.
In this exercise, you will edit each of the data transformations that you added in the previous exercise. You will create source queries, map sources to destination columns, and specify that data be copied or transformed by an ActiveX script.
In this procedure, you will use queries that are already built. Using Notepad, open C:\SQLDW\Exercise\Ch08\DTSLabSQL.sql. Highlight the query for the Time dimension. Copy the query and paste it into the SQL Query text box on the Data Transformation Properties dialog box. If this file is not available, type in the following Transact-SQL statement:
SELECT DISTINCT S.ShippedDate AS TheDate, DateName(dw, S.ShippedDate) AS DayOfWeek, DatePart(mm, S.ShippedDate) AS [Month], DatePart(yy, S.ShippedDate) AS [Year], DatePart(qq, S.ShippedDate) AS [Quarter], DatePart(dy, S.ShippedDate) AS DayOfYear, 'N' AS Holiday, case DatePart(dw, S.ShippedDate) when (1) then 'Y' when (7) then 'Y' else 'N' end AS Weekend, DateName(month, S.ShippedDate) + '_' + DateName(year,S.ShippedDate) AS YearMonth, DatePart(wk, S.ShippedDate) AS WeekOfYear FROM Orders S WHERE S.ShippedDate IS NOT NULL |
Note that the columns are mapped for you. The mappings are incorrect because your destination table has a key field that is used for uniqueness. You need to correct the column mappings so that the data is inserted into the correct field.
Note that the columns are mapped for you. The mappings are incorrect because your destination table has a key field that is used for uniqueness. You need to correct the column mappings so that the data is inserted into the correct field.
Function Main() If IsNull(DTSSource("Region")) Then DTSDestination("Region") = "Other" Else DTSDestination("Region") = DTSSource("Region") End If Main = DTSTransformStat_OK End Function |
NOTE
The AutoGen button automatically regenerates the default code that was created when the column mapping was made. If you have transformation script code you want to keep, do not click the Auto Gen button.
TIP
The Test button tests the code. Try clicking the test button. The script executes against the data and generates the list of regions. After the test successfully executes, click the View Results button and you can see the list of Regions that have been generated by the script.
You are now interacting with the query-building tools that will allow you to build a query from the tables in the Northwind database.
TIP
If the Diagram pane is not visible, click the Show/Hide Diagram Pane button on the toolbar.
At this point, you should see two tables in the diagram pane with the appropriate relationship set between them. You will now select the fields that will make up your Product dimension.
You will now configure the fields so that they will transform correctly during the load.
In the Grid pane, make the following change:
Note that the columns are mapped for you. The mappings are incorrect because your destination table has a key field that is used for uniqueness. You need to correct the column mappings so that the data is inserted into the correct field.
TIP
When fields in both lists are in the same order, you can quickly map those columns by selecting the fields in each list and then clicking New. Mapping a group of columns together is called a many-to-many column mapping and makes the data transformation more efficient. When you do this, ensure that the fields are in the same order, or values will be copied to the wrong destination columns.
Figure 8.9 The Data Transformation Lookups dialog box showing the Query ellipsis button
You can resize the pane by dragging its border.
SELECT CompanyName FROM Suppliers WHERE (SupplierID = ?) |
Figure 8.10 The completed query in the Data Transformation Services Query Designer
Function Main() DTSDestination("SupplierName") = _ DTSLookups("SupplierLookup").Execute(DTSSource("SupplierID").Value) Main = DTSTransformStat_OK End Function |
TIP
Do not try to test this code; the test does not work for code that uses a lookup query.
Function Main() DTSDestination("EmployeeName") = DTSSource("FirstName") & _ " " & DTSSource("LastName") Main = DTSTransformStat_OK End Function |
Note that the columns are mapped for you. The mappings are incorrect because your destination table has a key field that is used for uniqueness. Correct the column mappings so that the data is inserted into the correct field.
NOTE
DTS will transform data between fields that do not have the same column names.
SELECT Northwind_Mart.dbo.Time_Dim.TimeKey, Northwind_Mart.dbo.Customer_Dim.CustomerKey, Northwind_Mart.dbo.Shipper_Dim.ShipperKey, Northwind_Mart.dbo.Product_Dim.ProductKey, Northwind_Mart.dbo.Employee_Dim.EmployeeKey, Northwind.dbo.Orders.RequiredDate, Orders.Freight * [Order Details].Quantity / (SELECT SUM(Quantity) FROM [Order Details] od WHERE od.OrderID = Orders.OrderID) AS LineItemFreight, [Order Details].UnitPrice * [Order Details].Quantity AS LineItemTotal, [Order Details].Quantity AS LineItemQuantity, [Order Details].Discount * [Order Details].UnitPrice * [Order Details].Quantity AS LineItemDiscount FROM Orders INNER JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID INNER JOIN Northwind_Mart.dbo.Product_Dim ON [Order Details].ProductID = Northwind_Mart.dbo.Product_Dim.ProductID INNER JOIN Northwind_Mart.dbo.Customer_Dim ON Orders.CustomerID = Northwind_Mart.dbo.Customer_Dim.CustomerID INNER JOIN Northwind_Mart.dbo.Time_Dim ON Orders.ShippedDate = Northwind_Mart.dbo.Time_Dim.TheDate INNER JOIN Northwind_Mart.dbo.Shipper_Dim ON Orders.ShipVia = Northwind_Mart.dbo.Shipper_Dim.ShipperID INNER JOIN Northwind_Mart.dbo.Employee_Dim ON Orders.EmployeeID = Northwind_Mart.dbo.Employee_Dim.EmployeeID WHERE (Orders.ShippedDate IS NOT NULL) |
In this exercise, you will set properties of the package to enable saving data lineage and database metadata to the Microsoft Repository. Then you will save the package to the Microsoft Repository, execute the package to load the data warehouse, and view the data transformation metadata in the Metadata Browser.
The DTS Package Properties dialog box should be displayed. If another property s dialog box is displayed, repeat this step, making sure that none of the items in the workspace is selected.
This causes DTS to make the lineage ID available as columns in the data transformations. Each time that the DTS package is executed, a new lineage ID is generated. You can create columns in your destination tables to record these values, making it possible to track the lineage of every row in your destination tables. You will not create columns in the Northwind_Mart tables to record this lineage at the row level, but by checking these check boxes, you will still be able to track each execution of the package.
Figure 8.11 The Advanced tab in the DTS Package Properties dialog box
This causes DTS to populate the repository with the metadata for each of the catalogs (databases) that is referenced by the package. The metadata for the tables in the Northwind and Northwind_Mart databases used by this package will then be available in the repository. It is now possible for repository tools to retrieve information about the databases and the data lineage from the repository.
Figure 8.12 The Scanning Options dialog box
In this exercise, you will customize the complex query used to retrieve summary data from the Northwind transaction processing database in a previous chapter to retrieve the same data from the Northwind_Mart database. This demonstrates how the denormalized structure reduces the complexity and problems in retrieving summary data from an OLAP-driven star schema, because it is more efficient for querying and reporting due to the lower number of joins.
Query: Show the total number of units sold of each product, where the required date of the order is earlier than today.
SELECT Product_Dim.ProductName, Product_Dim.CategoryName, Product_Dim.SupplierName, SUM(Sales_Fact.LineItemQuantity) AS [Total Units Sold] FROM Sales_Fact INNER JOIN Product_Dim ON Sales_Fact.ProductKey = Product_Dim.ProductKey GROUP BY Product_Dim.ProductName, Product_Dim.CategoryName, Product_Dim.SupplierName, Sales_Fact.RequiredDate HAVING (Sales_Fact.RequiredDate < getdate()) |
Notice the reduced number of joins, steps, and tables that are scanned to return the data requested by the query.
In this exercise, you will browse information about your databases and DTS package in the Microsoft Repository.
TIP
You can edit the package Description and Comments. To save the changes, click Update.
DTS packages are saved data transformation definitions that are processed by SQL Server Data Transformation Services. DTS packages can be saved in files, in SQL Server, or in the Microsoft Repository.
A DTS package is composed of one or more steps, which you can define. A step is a unit of work with many possible actions. With DTS, you can control the execution sequence of each step. Precedence constraints and conditional processing allow you to prioritize steps.