Chapter Nine. Step 9: ExtractTransformLoad Design | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

Chapter Nine. Step 9: Extract/Transform/Load Design

graphics/ch09.gif

Chapter Overview

This chapter covers the following topics:

Things to consider about extract/transform/load (ETL) design
Common BI implementation strategies, such as data marts, operational data stores, enterprise data warehouses, and Web warehouses
How to reformat , reconcile , and cleanse the source data for three different sets of ETL programs: initial load, historical load, and incremental load
Various approaches for extracting data from the operational source files and source databases
Typical source data problems encountered during transformation, such as duplicate primary keys, inconsistent data values, different data formats, and embedded process logic
Load considerations, such as referential integrity and indexing
The source-to-target mapping document, the process flow diagram, and the staging area
Eight steps to follow when evaluating ETL products and vendors
Brief descriptions of the activities involved in ETL design, the deliverables resulting from those activities, and the roles involved
The risks of not performing Step 9

Things to Consider

Tools

Have we selected an ETL tool, or are we writing the ETL programs from scratch?

Will the ETL tool run on the platform where the source files are? On a separate server?

Do we have a separate data-cleansing tool? Will we run it before or during the ETL process?

Do we have an efficient sort utility?

ETL Staging

How big is our ETL staging window? How many hours per night? Per week? Do we have a smaller window at month-end because of other month-end processes? How much smaller?

Can we fit our ETL process into those windows or will we have to run over several days or nights?

How many source data elements do we have to extract? And how many source files and source databases do we have to access?

ETL Process Flow

How many programs can we run in parallel to shorten the ETL runtime?

How long will the initial load take? Have we prototyped it?

How long will it take to load the historical data? How many years of history do we need to load?

Do we know how long the incremental loads will run?

Should we insert rows or use the database management system (DBMS) load utility?

Should we use a third-party load utility to speed up the process? Do we already have a third-party load utility, or do we need to buy one?

When and how will the data be archived? On disk? On tape? Do we have to write archival programs at this time or can we postpone that additional programming effort to a future release?

Performance Considerations

How would ETL load performance be affected if we left referential integrity (RI) turned on?

How high would the risk of data corruption be if we turned RI off?

How much RI checking do we want to perform in the ETL programs?

Reconciliation

At how many points in the ETL process do we need to count input and output records?

Are the record layouts and database structures different on the old historical files and databases than they are on the current files and databases? How do we reconcile them?

Do we need to reconcile changed codes? Reused and redefined fields?

How many data elements do we have to reconcile? How many codes? How many amounts?

Will dirty data be rejected? How will that be reflected in the reconciliation totals?

Will the load counts and reconciliation totals be stored as meta data?

Quality Metrics

How will data quality errors be counted? What data quality metrics do we need to compile in the programs?

Will we store those metrics as meta data in the meta data repository or print them in a report?

Source data for the BI applications will come from a variety of platforms, which are managed by a variety of operating systems and applications. The purpose of the ETL process is to merge data from these heterogeneous platforms into a standard format for the BI target databases in the BI decision-support environment, as shown in Figure 9.1.

Figure 9.1. Heterogeneous Data Sources

graphics/09fig01.gif