Chapter 5: Loading Data into the Warehouse | Oracle9iR2 Data Warehousing

5.1 The ETL process

Populating a data warehouse involves all of the tasks related to getting the data from the source operational systems, cleansing and transforming the data to the right format and level of detail, loading the data into the target data warehouse, and preparing the data for analysis purposes.

Figure 5.1 shows the steps making up the extraction, transformation, and load (ETL) process. Data is extracted from the source operational systems and transported to the staging area. The staging area is a temporary holding place used to prepare the data. The staging area may be a set of flat files, temporary staging tables in the Oracle warehouse, or both. The data is integrated with other data, cleansed, and transformed into a common representation. The data is then loaded into the target data warehouse tables. Sometimes this process is referred to as ETT, extraction: transportation, and transformation.

click to expand
Figure 5.1: The ETL process.

During the initial population of the data warehouse, historical data is loaded. The data in the operational systems may often be in multiple formats. If, for instance, the point-of-sales operational system was replaced two years ago, the current two years of history will be in one format, while data older than two years will be in another format.

After the initial historical load, new transaction and event data needs to be loaded on a periodic basis. This is typically done on a regular time schedule, such as at the end of the day, week, or month. During the load and while the indexes and materialized views are being refreshed, the data is generally unavailable to warehouse users for querying. The period of time allowed for inserting the new data is called the "batch window." The batch window is a continuously shrinking amount of time, as more businesses are on line for longer periods of time. Higher availability can be achieved by partitioning the fact table. While data is loaded into a new partition, or being updated in an existing partition, the rest of the partitions are still available for use.

A large portion of the work in building a data warehouse will be devoted to the ETL process. Finding the data from the operational systems, creating extraction processes to get the data out, transporting, filtering, cleansing, transforming, integrating data from multiple sources, and loading the data into the warehouse all can take a considerable amount of time.

If you are not already familiar with the company's data, part of the difficulty in developing the ETL process is gaining an understanding of the data. One object can have different names in different systems. Even worse, two different things could have the same name. It can be a challenge to discover all of this, particularly in a company where the systems are not well documented. Each column in the target data warehouse must be mapped to the corresponding column in the source system. Some of the data will be mapped directly; other data will need to be derived and transformed into a different format.

Once the data is loaded into the warehouse, further processing to integrate it with existing data, update indexes, gather statistics, and refresh materialized views needs to take place prior to it being "published" as ready for users to access. Once the data is published, it should not be updated again until the next batch window, to ensure that users do not receive different answers to the same query asked at a different time.