Getting Started

Getting Started

If youre an ETL system developer, youre ready to start moving some data around. If youre like most ETL developers, youve experimented with Integration Services and quickly decided its both cool and exciting. When you start doing real work, youll get frustrated because its not as easy as it looks. Fortunately, after ahopefully shortlearning curve, it starts to be almost as fun and productive as it appeared in the demos. Well try to keep that learning curve as short and gentle as possible.

We start this section by talking about the development environment, which weve already introduced in Chapter 4. We then encourage you to create a template package early in your development cycle, to make it easier for all your packages to use similar objects and styles. On the subject of development styles, we also encourage you to modularize your package development, creating a system of master and child packages. Dont try to stuff too much functionality into a single package.

Create Solution, Project, and Data Sources

Your ETL development process may well be the first time you use BI Studio. If you havent already created a solution for your DW/BI project, do so now. Most teams will share a single solution for the Integration Services, Analysis Services, and Reporting Services components of their solution. You may keep all your Integration Services packages in a single BI Studio project. If you create separate packages for the historical load and ongoing incremental loads, you might separate these into separate BI Studio projects. Projects exist for your convenience during development. Use source control in a team development environment to help manage multiple people working on the same files.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #38 Project Management System: Comprehensive system for keeping track of all ETL development.

  • #31 Version Control System: Consistent snapshotting capability for archiving and recovering all the metadata in the ETL pipeline. Check- out and check-in of all ETL modules and jobs. Source comparison capability to reveal differences between different versions.

We introduced solutions and projects in Chapter 3. Data Sources and Data Source Views are the packages links to the outside world. A Data Source includes connection information in its properties, like the name of the server, the source database, and logon credentials. A Data Source View is a metadata layer that you can use to redefine the data structures in a Data Source to make it easier for your ETL task to work with. Define Data Sources and Data Source Views once in your solution, for use across all the solutions projects. For your Integration Services project, define a Data Source to your source system(s) and to your relational data warehouse database. Data Source is a confusing name for this object in the context of Integration Services because you need to define Data Sources on target databases as well.

Data Source Views are useful for projects that are built from a complex source database. Imagine a source database that contains thousands of tables, which is common in purchased ERP systems. Probably youll access only dozens or at most several hundred tables during the course of any one DW/BI project. Set up a Data Source View that contains only the tables of interest. Everyone benefits by having to look at only this small table subset.

Hands-on Example

image from book

If youre working through the examples in this chapter, you need to set up your environment. First, create a database called MDWT_AdventureWorksDW. You can obtain the database create script from the books web site. Next, create a new BI Studio solution called MDWT_Examples and add an Integration Services project called MDWT_AWDWHist to that solution. Finally, add two Data Sources to that project. Set up the first source to the AdventureWorks database that contains the source data. Set up the second source (which is really a target) to the newly created database MDWT_AdventureWorksDW.

The AdventureWorks source database is simple enough that theres no value in creating a Data Source View for it.

image from book
 

Note 

It would be helpful if you work through the tutorials that ship with SQL Server 2005 before reading this chapter. Although were walking you through a case study, this chapter is not a tutorial.

Package Template

Define a package template that you use as the starting point for all Integration Services packages in your solution. A package template is simply a starter package that contains objects that you frequently use. With a package template, youll spend less time tediously recreating the same objects.

Until you read this chapter and experiment with your own designs, you wont be able to set up a complete package template. The template will evolve over time. At the outset you can create a template with the following components:

  • Start with an empty package: Use the Add New Item by right-clicking on the project name in the Solution Explorer. Add a new IS Package to your project to serve as the template package.

  • Set up package connections: The Data Sources and Data Source Views are shared components across the project. Integration Services packages need an additional communication layer called a connection. These live on the Connection Managers tab that opens under the package designer. The easiest way to create a package connection is to right-click on the Connection Managers tab and choose New Connection from Data Source to create a package connection from an existing Data Source or Data Source View. Its only a few mouse clicks, but its a little perplexing that Integration Services doesnt do this for you automatically. Build a package connection into your template for each Data Source that youll use in your ETL system, including package connections for the data warehouse database and the staging and metadata databases.

  • Create standard variables : Until youve designed several packages, you wont know which variables you will use a lot. At the outset, create a variable called RowCount as an Int32 (default value = 0). Youll use this variable during the debugging process. In the longer run, youll create variables to hold information passed down from the master package to the child packages. Create variables in the Variables window, which you can always get to from the View Other Windows Variables menu choice.

  • Put an OnPostExecute breakpoint in the template package body: The breakpoint pauses package execution when the specific event occurs: in this case, when the package finishes (technically, when the OnPostExecute event is thrown). This breakpoint is useful because you can examine only the condition of variables during a breakpoint. Its very annoying to run the package but not be able to examine variables values especially if the package takes some time to execute! Add a package breakpoint by right-clicking in the background of the Control Flow design surface and choosing Edit Breakpoints.

Once you have greater experience with Integration Services, and have made some solution design decisions, you can add the following components to the template:

  • Define common configurations: A configuration defines the default values of package objects properties, including package variables values. Use master and child packages, and always push a standard set of auditing information from the master package to the child packages. In the configuration definition you specify how the master package communicates with the child package. Well walk through an example of this later in this chapter.

  • Define a standard starting sequence, if you use one: A starting sequence is a set of tasks that you usually execute at the beginning of every package. In our sample package we begin all child packages by setting up auditing dimension rows for the package processing. You can put these sequences in the body of the package, or you can be fancy and hide them away in an event handler.

  • Define standard event handlers, like what to do if the package fails during execution: We discuss event handlers near the end of this chapter.

  • Define standard logging configurations: We discuss logging in Chapter 15.

Save the template package with an obvious name, like MyTemplatePackage. Youll need to save it to a convenient place. We keep the template packages in a SQL Server database that the development organization manages , but you could just as easily designate a separate file system folder for template packages.

Tip 

To save the template package to SQL Server, choose File Save Copy of [Package.dtsx] As, and select SQL Server as the package location. (The Save Copy choice will not appear if the package is highlighted.) You can use SQL Studio to add directories and manage packages saved in SQL Server. To do this, open the Registered Servers window in SQL Studio, select the Integration Services icon and double-click the server name. This opens the Integration Services directory structure in the Object Explorer window.

When you want to create a new package from your template package, you have to do something less obvious than choosing the menu item to create a new package. Instead, right-click in the Solution Explorer and choose Add Existing Package to add a copy of the template package to the current project, and then immediately change its name.

Using template packages seems a bit convoluted at first, but youll soon get into the habit of working this way. After youve created the same connections the umpteenth time, we think youll welcome this suggestion.

Hands-on Example

image from book

If youre working through examples while reading this chapter, create a template package called MDWTTemplateSimple. Create the RowCount variable, create an empty Data Flow step, and define package connections to the AdventureWorks and MDWT_AdventureWorksDW Data Sources.

image from book
 

Master Packages and Child Packages

As we discussed in Chapter 5, develop a separate package to load each table. Develop a master package that calls each of these table-specific packages. The master package contains one Execute Package task for each child package. Many of the dimension packages can execute in parallel, but some dimension tables may have a dependency on another table. Use Control Flow precedence arrows to define these dependencies.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystem #26 Job Scheduler: System for scheduling and launching all ETL jobs. Able to wait for a wide variety of system conditions including dependencies of prior jobs completing successfully. Able to post alerts.

Start your system development by creating all the child packages. Remember to add a copy of your template package rather than create a new package. Create a child package for each table in your dimensional model, and name them appropriately. Create a master package from the template package. Later, you may decide to create a different template for a master package than for a child package.

Refer to Figure 5.2 for an illustration of a master package. This package, Master_Dims, first checks some metadata to figure out which set of data it should run. Then it launches a bunch of Execute Package tasks, some of which are grouped together into sequences and groups. In this example the sequences and groups are optional; were using them simply to reduce the clutter on the design surface.

Your master package consists primarily of Execute Package tasks. The only property to set for an Execute Package task is the location of the child package. During development, the child package is located in the File System, not the default SQL Server location. Change the Location property to File System, and then select the Connection property box. Click the pull-down arrow and select <New connection > to define a connection to the appropriate package. Rename the task to include the name of the child package the task will execute.

In Figure 5.2 you can see that each of the child packages has its own connection at the bottom of the screen. You created these connections when you set up the Execute Process tasks. In the production environment, you can reconfigure the package at run time to change the location where Integration Services will look for the child packages.

You can disable any task by right-clicking it and choosing Disable. A disabled task wont run when you execute the Master_Dims package, which is useful for debugging.

It may seem like make-work to break out each tables processing into a separate package. Some tables are so simple and fast to populate that it hardly seems worthwhile to build a separate package. Using a master package and a child package for each table greatly increases your systems flexibility and understandability. As your DW/BI system evolves and grows, you can easily modify the master package and leave the child packages untouched.

A child package is nothing special. Its just a package that you execute from another package, the master package, instead of executing it directly. You often communicate between a master and children by passing variables back and forth, but communication isnt necessary. We like to perform initial development and debugging on a child package in isolation. Only after it seems to work do we hook it up to the master package and pass the variables back and forth.

The communication between master and child is easy, but weve deferred the description of the details to the end of this chapter.

Hands-on Example

image from book

To follow along, create two new child packages from the simple template: DimPromotion_ETL.dtsx and DimProduct_ETL.dtsx. Create the master package Dim_Master, and add two Execute Package tasks: one for each new, still mostly empty, package. Disable the Product_ETL package task for now. Well begin with the DimPromotion_ETL package.

image from book