The goal of this chapter is to encourage you to plan your ETL system. To develop a good plan, you need a basic understanding of how an ETL system is built in SQL Server 2005. The main ETL tool is Integration Services, which contains a rich set of functionality for developing and executing an ETL system. This chapter introduces you to Integration Services features, without going into any details about how exactly to use them.
The first step in developing your ETL system specification is to start from the draft source to target mappings that are created during the modeling process. While youre developing your data model, you must be looking at the source systems in order to validate that your data model is feasible . At this point you havent done detailed examination of the source systems, but youve identified source tables for your dimensional model, and youve identified some data quality issues. Document these relationships and issues in a high-level map. Youll edit this map throughout the specification process in order to correct your early assumptions.
The next big step is to profile the source system data. We recommend that you purchase a tool to perform data profiling, as its very tedious to do by hand. The data profiling results are the main set of information that you need to complete the source-to-target mapping on a detailed level, table by table and column by column.
There are several system-wide decisions to make and document: How often will you load data, and how much history will you maintain? Will you partition your fact tables, and if so how will you maintain those partitions? What strategies will you use for extracting data from the source? How will you handle de-duplication of people and organizations? How will you manage your dimension processing? And what strategy will you use for updating your Analysis Services databases and predefined reports ?
All of the information weve described here should be collected into an ETL system design specification. This doesnt have to be a fancy document; just collect the information weve described and slap a table of contents on it. A few years from now, when the team needs to revise the system, your protg will marvel at your foresight in pulling together all of this information.
Chapter 6: Developing the ETL System
What could possibly be more fun?
The Extract, Transformation, and Load (ETL) application is the foundation of the DW/BI system. Business users dont see the ETL system, and they dont care about it. That is, they dont care unless the data isnt available first thing in the morning. They dont care unless the data is unreliable. If the business users care about ETL, its probably bad news for the DW/BI team.
Building an ETL system is largely a custom process. ETL tools, like SQL Server 2005 Integration Services, provide a huge productivity boost over hand-coding . But theres no getting around the fact that companies have diverse source systems. Integrating and aligning the data from those systems is an exercise in accounting for idiosyncrasies.
When Microsoft decided to rebuild Data Transformation Services (DTS), it took an If it aint broke, dont fix it approach. Comparing DTS with Integration Services, we can infer that the pale yellow background in the designer wasnt broken. Other than that, Integration Services is astonishingly different from DTS. Integration Services is a real ETL tool, and you can use it to build real, enterprise-class ETL systems. This chapter describes how to do exactly that.
Integration Services can be used in a wide variety of scenarios beyond ETL. In this chapter, we focus exclusively on how to implement the ETL system for a Kimball Method data warehouse. We come back to other Integration Services features and uses in Chapters 15 and 17. Still other uses of Integration Services fall outside the scope of this book, as of course do other uses of the relational database.
Although we expect readers to skip around in this book, you should read Chapter 5 before you start this chapter. And before you start developing your ETL system, you should think through most of the design issues discussed in that chapter. Look back at Figure 5.1 to re- orient yourself to where ETL system development falls in the Kimball Method Business Dimensional Lifecycle methodology.
This chapter makes significant use of the Adventure Works case study, and the associated packages available on the books web site at www.MsftDWToolkit.com. You can read the chapter on its own, but sample code is worth ten thousand words: Youll learn more if you pick through the sample packages. This is a long chapter, as befits the area that takes the majority of project implementation resources. Most readers will get a good enough idea of how Integration Services works by reading the first 15-20 pages of the chapter. ETL architects and developers should read it all.
Theres more to say about ETL and Integration Services than we can possibly fit into a single chapter, however lengthy. Two key resources are The Data Warehouse ETL Toolkit (Kimball and Caserta, Wiley, 2004) and Professional SQL Server 2005 Integration Services , by Knight, Chaffin, Barnes, Warren, and Young (Wrox, 2006).
The chapter sections reference the Kimball Method 38 Subsystems of ETL (The 38 Subsystems of ETL by Ralph Kimball, Intelligent Enterprise , December 4, 2004). You can find a link to this article at www.kimballgroup.com/html/articlesbydate/articles2004.html .
This long chapter contains four main sections:
Getting Started: In the first portion of the chapter, you learn how to use BI Studio to create Integration Services projects, data sources, and connections. You also learn about package templates and create a master package.
Dimension Processing: This section shows you the basics of dimension processingextract, transform, and loadand walks you through the steps to build a package to populate a simple dimension. You learn how to manage slowly changing dimensions in Integration Services.
Fact Processing: This part of the chapter helps you understand how fact table extracts and cleaning differ from dimension processing. You learn about how to implement the surrogate key pipeline in Integration Services, and how to design alternative approaches to managing challenging fact table loads, like snapshots and late-arriving facts. This section presents the two most important approaches for incrementally processing Analysis Services databases.
Tying It All Together: See the details of our recommended audit system, and learn how to use package configurations to communicate between master and child packages.