Tying It All Together | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Throughout this chapter weve been focusing on the details of data flows and Control Flows. Weve continually deferred discussion about over-arching issues like the audit system, and communication between master packages and child packages. Its finally time to tackle these topics.

The Audit System

A decade and more ago, our DW/BI systems had primitive or, more often, no auditing systems. In the current regulatory environment, we need to do better. You could spend almost as much time and energy developing an auditing a system as building the core system itself. The auditing system that we present here is better than most systems weve seen, but by no means is it as extensive as we can envision. It strikes a balance between ease of implementation and robustness, and will be good enough for most DW/BI systems.

Kimball Method Etl Subsystem

This section describes techniques relevant to Kimball Method ETL Subsystem #6 Audit Dimension Assembler: Assembly of metadata context surrounding each fact table load in such a way that the metadata context can be attached to the fact table as a normal dimension.

We have several design goals for the audit system. We want to be able to answer the following questions:

How did each row initially enter the DW/BI system?
What process most recently updated each fact row?
How can you locate all the fact rows loaded today? How would you back out a load?
Was this row loaded through the standard process, or through an exception-handling branch of the ETL process?
How many rows were loaded today? How many error rows were logged?
Was todays processing successful?
How many rows did the table have before and after the load?

Note

These arent the only interesting questions you can ask about the processing. Theres a whole set of important information about the process, like how long the entire load took, and how long each package took to run. These process questions are much easier to answer because you can set up Integration Services logging to collect the information automatically. The logging information can join to the data auditing information that we discuss in this section. We discuss Integration Services logging in Chapter 15 .

Figure 6.16 illustrates the data model for the basic audit system to answer these questions. The first table, AuditPkgExecution, receives a row each time a package is executed. Generate a surrogate primary key, and keep track of basic information like the package name , ID, version information, and the time that execution began . All of this information is available to you from Integration Services system variables . When the package finishes, update this row with the stop time. You should also track the parent package, as most of your packages will be called from a master package.

Figure 6.16: Data model for basic auditing system

Use stored procedures to insert and update the audit tables. Use the Execute SQL Control Flow task to execute the stored procedures, and pass in parameters from system and local package variables.

Tip	The best way to understand how to configure the components of the audit system presented here is to explore the sample packages from the books web site, www.MsftDWToolkit.com .

Listing 6.2 lists the stored procedure to insert a new row.

Listing 6.2: Sample stored procedure to insert an audit row

 CREATE PROCEDURE [dbo].[Insert_AuditPkgExecution]      @PkgName varchar(50) ,      @PkgGUID uniqueidentifier ,      @PkgVersionGUID uniqueidentifier,      @PkgVersionMajor smallint,      @PkgVersionMinor smallint,      @ExecStartDT datetime,      @ParentPkgExecKey int AS BEGIN      SET NOCOUNT ON;      IF @ParentPkgExecKey = 0 SET @ParentPkgExecKey = NULL      INSERT INTO AuditPkgExecution (PkgName, PkgGUID, PkgVersionGUID, PkgVersionMajor,              PkgVersionMinor, ExecStartDT, ParentPkgExecKey)      VALUES (@PkgName, @PkgGUID, @PkgVersionGUID, @PkgVersionMajor,              @PkgVersionMinor, @ExecStartDT, @ParentPkgExecKey) END

The next table, AuditTableProcessing, gets a row for every table thats manipulated during a packages execution. Much of the time theres one row in AuditTableProcessing for each row in the AuditPkgExecution table because you usually build a package for processing each table. But some packages touch several tables, so you cant assume a one-to-one relationship between packages and tables.

At the start of each child package, add a row for each table the package manipulates. As above, use the Execute SQL task and pass in parameters to a stored procedure. When you first add each row, you know only the table name and the initial row count. Update the other counts and values after the package finishes. In this case, you need to create a package variable to hold the TableProcessingKey for each table the package touches.

You could include this Execute SQL task in a transaction with the Data Flow task, so that a failure in the Data Flow task will back out the audit row. We dont discuss Integration Services transactions in this book. A better design than including the task in the Data Flow transaction is to explicitly update the audit row with a failure indicator.

Reference

See Books Online and Professional SQL Server 2005 Integration Services , by Knight, Chaffin, Barnes, Warren, and Young (Wrox, 2006) for a complete discussion of Integration Services transactions.

You may wonder what the check value columns are for. Use these columns to store important information about each table. Make these columns a numeric, non-integer data type so you can store either integers such a row counts, or a sum of transaction amounts. For example, earlier in this chapter when we talked about checking for data reasonableness, we stored row counts by customer and product in these fields. You might need more than two; for some tables you wont use any at all.

Note

If you check a lot of different data characteristics during package execution, you should create a separate auditing table for each table. Youd have AuditCustomer, AuditProduct, AuditOrders, and so on. Then you can name the check columns with sensible names that reflect their contents. This is a better design, although more complex.

If you dont care about whether a row was inserted during standard processing, or through an exception handling branch of the package, these two tables will meet your auditing needs. Put TableProcessKey on each dimension and fact row, instead of the AuditKey (or call the AuditTableProcessing table AuditDimension instead). This is a perfectly reasonable approach for many applications.

If you do want to track whether a row was inserted during normal or nonstandard processing, you need one more auditing table: AuditDimension. For every branch of your package that you want to track separately, come up with a branch name. For a fact table, examples of branch names are Normal/Clean , Adjusted out of bounds amount , and Forced Referential Integrity . At the start of each child package, youll add several rows to AuditDimension. Each gets its own key, which youll hold in package variables.

The final step of the audit system is to put the audit key into the dimension and fact tables. Recall from the dimensional model in Chapter 2 that every table has an AuditKey column. Add a derived column transform to your data flow, adding a column AuditKey (type integer) that you set equal to the appropriate system variable.

Tip

If you look back at Figure 6.4 , youll see where we added AuditKey to the flow. We set it to zero only because we didnt want to explain the audit system at the beginning of the chapter. The final version of these packages, which you can download from the books web site, correctly sets the AuditKey to the corresponding package variable.

In the books sample packages, we track dimensions only at the package execution level. We do insert a row into the audit dimension, just to stay consistent. Its important for a fact table row to track both its original insertion and the most recent update, so fact rows contain two audit keys.

AN EVEN SIMPLER AUDITING SYSTEM

The simplest audit system outlined here, using AuditPkgExecution and a stripped-down AuditTableProcessing table, is easy to implement. If even that seems like too much trouble, there is one extremely simple solution. Add two columns to each table: PkgName and PkgExecutionDatetime. You can use the Audit transform to populate these columns in the Data Flow. We dont recommend this approach as a best practice because we dont think it captures enough information and the information it does capture is stored ineffici ently. But its better than nothing and should be considered the absolute minimum level of auditing information.

Reference

Chapter 4 of The Data Warehouse ETL Toolkit contains a much more complete list of the kinds of information you might track in a fact audit system.

The Master Package

At the beginning of this chapter we talked about the master package. We were pretty cavalier back then about what goes into the master package. Now its time to pin down a few more details.

Kimball Method Etl Subsystem

This section describes techniques relevant to Kimball Method ETL Subsystems:

#8 Error Event Handler: Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality.
#26 Job Scheduler: System for scheduling and launching all ETL jobs. Able to wait for a wide variety of system conditions including dependencies of prior jobs completing successfully. Able to post alerts.

Initialize the Master Package

When the master package launches it should initialize the audit system by adding a row to AuditPkgExecution. Because this is the master package, this rows ParentPkgExecKey is null. Store the newly generated row key in a package variable.

In the master package, set up a package variable or variables to hold the date or date range for which youll be extracting and processing data. Youll need a watermark metadata table (or tables) to hold the dates that have been successfully processed . Every time you complete processing for a date, add that date as a row to the watermark table. If you process tables on different schedules, maintain a separate watermark table for each schedule.

The master package should check whether the date its being asked to process has already been processed successfully, in which case it should exit gracefully. If the requested date is not the next date in the schedule, you could either design the master package to loop (filling in the date range), or exit gracefully.

All of this checking can be performed with the Execute SQL task, parameterized if necessary. Place results into package variables, as weve described elsewhere.

Call the Child Packages

Calling the child packages is as simple as connecting a bunch of Execute Package tasks with the appropriate precedence constraints.

The only trick is passing variables from parent to child. Set up the communication from within the child package, by setting up a configuration in the child package. As Figure 6.17 illustrates, specify that the configuration comes from a parent package. Specify the name of the parent package variable, and map that name to the corresponding variables value in the child package. Set up a configuration for each of the variables passed from parent to child.

Figure 6.17: Set up a configuration from a parent package

Note

Communicating by way of configurations seems unnecessarily awkward . Even worse , theres no straightforward way of passing information between child and parent. The best approach is to have the child package put information in a metadata table, and then have the parent package read that variable.

Reference

See Books Online for other important information about calling child packages. Relevant topics include Implementing Logging in Packages, Incorporating Transactions in Packages, and Inherited Transactions.

Master Package Cleanup

After the master package has called all the child packages, and they have completedlets hope successfullythe master package needs to perform some cleanup.

The master package should update its row in AuditPkgExecution with its execution stop time and whether the processing was successful. It should also update the watermark table with a success or failure indicator.

If the processing was not successful, the master package should send email or alert an operator. Use the Send Mail or Notify Operator tasks.

Halting the Master Package

If the child package fails with a hard error, its trivial to communicate that failure up to the master package. But we hope weve provided enough techniques that your package is relatively unlikely to fail: Try to plan for the likely events so that the package ends gracefully.

Several times in this chapter weve discussed how to build complex precedence constraints so that a package will stop executing if, say, your extracted data looks unreasonably small. We havent described how to communicate this soft failure up to the parent package. If we dont communicate up, the master package thinks we succeeded: After all, we didnt throw an error condition.

Refer back to Figure 6.12. In the Control Flow for handling the failure of the reasonableness, you can see that the package sends an email if the test fails. But this is just a conditional flow, not necessarily an error condition. If you want the master package to stop all execution when the reasonableness test fails, you need to set the child packages error condition yourself. Do so by writing a Script task that contains a single line:

 Dts.TaskResult = Dts.Results.Failure

Set the FailPackageOnFailure property of the Script task to True. When the script runs, it immediately fails the task, which fails the package. Up in the parent package, catch the child package failure and handle it however you wish. In the scenario were discussing, you may want to halt all execution of the master package, set off alarm bells , and hunker down. If youve bound steps into a transaction, you could even roll back some of the earlier steps of the master package.

Tip	In the packages available on the books web site, we hid this Script task in an OnPostExecute event handler for the Send Mail tasks. Thats why you dont see the Script task in Figure 6.12 .

Package Event Handling

Kimball Method Etl Subsystem

This section describes techniques relevant to Kimball Method ETL Subsystems:

#8 Error Event Handler: Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality.
#26 Job Scheduler: System for scheduling and launching all ETL jobs. Able to wait for a wide variety of system conditions including dependencies of prior jobs completing successfully. Able to post alerts.

In the earlier section on the master package, we talked about initialization, error handling, and cleanup logic that should be in the master package. If youve looked at the sample packages posted on the books web site, you may be wondering where weve hidden this logic.

The initialization, cleanup, and error handling are all in event handlers. You can find all the event handlers for a package in the tab compellingly titled Event Handlers, next to the Control Flow and Data Flow tabs. You can create handlers for any kind of event like pre-execute or error, for a task or for the package as a whole. An event handler can contain the same kinds of tasks as the main package.

OnError and OnPostExecute are the most important event handlers to set up at the package level. These event handlers will run no matter where an error occurs, or in which branch the Control Flow terminates normally. The biggest issue with the event handlers is discoverability: theres no visual clue in the main package Control Flow that event handlers have been defined for the overall package or a specific task. Developers, and others who are looking at packages, should get into the habit of looking in the Package Explorer tab to identify the event handlers that have been defined.

Unit Testing

The ETL system developer must test every path that data can take through the packages. You hope that many of the packages branches will never be encountered during normal execution. But you need to test the logic before moving even to the test system, much less into production.

The only way to unit test data flows is to create test data sets that contain all the different types of errors, and combinations of errors, that you can think of. Creating those data sets is hard enough, but the real challenge is in keeping track of the data, the test cases that each data set addresses, and the expected and observed results.

You will need to create different versions of source and target tables. In a simple system, you can create multiple test databases. Use package configurations to change the packages connection, so that youre using the appropriate test databases during each test run.

For more complex systems and testing scenarios, it might be easier to write a package that assembles test data from data stored in flat files and in other working databases. By the time you reach this point, you should be expert enough with Integration Services that such a package will present no serious challenges.

Its almost universally true of developers that they hate to document anything. If you cant force yourself to document your unit testing procedures and results, the project team leader must assign someone to do this for you. This is a key part of the projects documentation in a regulated world, and your test logic and procedures will be very helpful in the deployment process.