Designing the ETL Process Flow | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

The Source-to-Target Mapping Document

Before the ETL process flow can be designed (or enhanced), the detailed ETL transformation specifications for data extraction, transformation, and reconciliation have to be developed, given that they will dictate the process flow. A common way to document the ETL transformation specifications is in a source-to-target mapping document, which can be a matrix or a spreadsheet (Table 9.3).

The source-to-target mapping document should list all BI tables and columns and their data types and lengths. It should map the applicable source data elements to the columns , along with their data types and lengths, and it should show the source files and source databases from which the source data elements are being extracted. Finally, and most importantly, the document should specify the transformation logic for each column.

This document can then be used to create the actual programming specifications for the ETL developers or to create instructions (technical meta data) for the ETL tool.

The ETL Process Flow Diagram

Once the source-to-target mapping document has been completed, the ETL lead developer, with the help of the database administrator and the data quality analyst, must design the ETL process flow, as illustrated by the example in Figure 9.5.

Figure 9.5. Sample ETL Process Flow Diagram

graphics/09fig05.gif

The purpose for the ETL process flow diagram is to show the process dependencies between all extracts , sorting and merging utilities, transformations, temporary work files or tables, error-handling processes, reconciliation activities, and the load sequence. The ETL programs, or ETL tool modules, will have to run in this sequence.

Extracts: There may be operational interdependencies among several source files and source databases from which data is being extracted. These interdependencies have to be understood because they may affect the timing and the sequence of running the ETL extract programs.
Sorting and merging utilities: Almost every step requires the extracted data to be sorted in a particular fashion so that it can be merged before further processing can occur. Sorting can also greatly improve load performance.

Table 9.3. Example of a Source-to-Target Mapping Document

Table

Column

Type/Length

Data Element

Type/Length

Source File/Database

Transformation Specifications

CUSTOMER

CUSTOMER_ID

INTEGER

 CUST-NUM CID CNO

 X(9) INT X(12)

 CustMaster CUSTOMER CUST

 If ABC-Code = "R" then    Assign new customer    number to record    with CUST-NUM Else If ABC-Code = "S"    Assign new customer    number to record    with CID Else    Assign new customer    number to record    with CNO

Transformations: Most of the data has to be transformed for a variety of reasons. It is important to examine the most opportune times to perform the transformations. Remember that there is only one coordinated ETL process for the BI decision-support environment. Therefore, transformations applicable to all source data, such as data type and code conversions or data domain enforcements, should be performed early in the ETL process flow. Transformations specific to a target database, such as aggregation and summarization for a specific data mart, should occur toward the end of the ETL process flow.
Temporary work files or tables: Sorting, merging, and transforming require a lot of temporary storage space to hold the interim results. These temporary work files and tables can be as large or larger than the original extracts. Furthermore, these temporary work files and tables are not really "temporary." Plan to have that space available for your staging area permanently.
Error-handling processes: During the ETL process many errors are detected as the data-cleansing specifications are applied to the source data. If error reports are created or erroneous records are rejected into a suspense file, they should be shown on the ETL process flow diagram.
Reconciliation activities: Every program module that manipulates data should produce reconciliation totals. This can be in the form of input and output record counts, specific domain counts, or amount counts. Record counts are sufficient for extract, sort , and merge modules. Domain counts are appropriate for more complicated transformation specifications, such as separating data values from one source data element into multiple target columns. Amount counts are usually performed on all amount data elements, whether they are moved as is, transformed into a new format, or used in a calculation. (ETL reconciliation will be described in more detail in Step 11, ETL Development.)
Load sequence: It is necessary to determine the sequence in which the tables have to be loaded because of their potential interdependencies and because of a possible recursive relationship on one table. For example, the Product dimension table may have to be loaded before the Sales table is loaded, if RI is turned on and if the sales data references products. Other tables may be loaded simultaneously , which can greatly improve the speed of the load process.

The Staging Area

A staging area is the place where the ETL process runs. It refers to dedicated disk space, ETL program libraries, temporary and permanent work files and tables ”even a dedicated server. The staging area can be centralized or decentralized. For example, it can be a central mainframe staging area if most of the source data is in flat files on a mainframe. It can also be on a dedicated server onto which the source data is loaded. Many times the staging area is decentralized. For example, a convoluted mainframe file with many redefines and occurs clauses may have to be flattened out with a COBOL program in a staging area on the mainframe before it is downloaded to a staging area on a UNIX box for further processing by an ETL tool.

The ETL process is by far the most complicated process to be designed and developed in any BI project. Since there is only one (logical) coordinated ETL process for the BI decision-support environment, expanding the ETL programs with each new BI application becomes very complicated, and regression testing requires more and more time. For these reasons, most organizations prefer to use an ETL tool for all or some of the ETL process, especially for the extract and transformation processes, as shown in Figure 9.6.

Figure 9.6. Common Use of ETL Tools in the ETL Process

graphics/09fig06.gif