Transforming the data into a common representation

5.3 Transforming the data into a common representation

Once the data has been extracted from the operational systems, it is ready to be cleansed and transformed into a common representation. Differences between naming conventions, storage formats, data types, and encoding schemes must all be resolved. Duplicates are removed, relationships are validated, and unique key identifiers are added. In this section, various types of transformations will be introduced; later in the chapter we'll see specific examples of transformations.

5.3.1 Integrating data from multiple sources

Often the information needed to create a table in the warehouse comes from multiple source systems. If there is a field in common between the systems, the data can be joined via that column.

Integrating data from multiple sources can be very challenging. Different people may have designed the operational systems, at different times, using different styles, standards, and methodologies. They may use different technology (e.g., hardware platforms, database management systems, and operating system software). If data is coming from an IBM mainframe, the data may need to be converted from EBCDIC to ASCII or from big endian to little endian or vice versa.

To compound the problem, there may not be a common identifier in the source systems. For example, when creating the customer dimension, there may not be a CUSTOMER_ID in each system. You may have to look at customer names and addresses to determine that it is the same customer. These may have different spacing, case, and punctuation. Oracle Warehouse Builder helps address the customer deduplication problem.

5.3.2 Cleansing data

The majority of operational systems contain some dirty data, which means that there may be:

Duplicate records
Data missing
Data containing invalid values
Data pointing to primary keys that do not exist

Sometimes business rules are enforced by the applications; other times, by integrity constraints within the database; and sometimes there may be no enforcement at all.

Data must be standardized. For example, any given street address, such as 1741 Coleman Ave., can be represented in many ways. The word "Avenue" may be stored as "Ave," "Ave.," "Avenue," or "AVE." Search and replace transforms allow you to search for any of these values and replace them with the standard value you've chosen for your warehouse.

You may want to check the validity of certain types of data. If a product is sold only in three colors, you can validate that the data conforms to a list of values, such as red, yellow, and green. This list may change over time to include new values-for example, in February 2002 the product may also be available in blue. You may want to validate data against a larger list of values stored in a table, such as the states within the United States.

Some types of cleansing involve combining and separating character data. You may need to concatenate two string columns-for example, combining LAST_NAME, comma, and FIRST_NAME into the CUSTOMER_NAME column. Or you may need to use a substring operation to divide a string into separate parts, such as separating the area code from a phone number.

An important data integrity step involves enforcement of one-to-one and one-to-many relationships. Often these are checked as part of the transformation process rather than by using referential integrity constraints in the warehouse.

5.3.3 Deriving new data

While loading the data, you may want to perform calculations or derive new data from the existing data. For example, you may want to keep a running total or count of records as they are moved from the source to the target database.

During the design process, the appropriate level of granularity for the warehouse is determined. It is often best to store data at various levels of granularity with different retention and archive periods. The most fine-grained transaction data will usually be retained for a much shorter period of time than data aggregated at a higher level. Transaction granular sales data is necessary to analyze which products are purchased together. Daily sales of a product by store are used to analyze regional trends and product performance.

Data may be aggregated as part of the transformation process. If you did not want to store the detailed transactions in your data warehouse, the data can be aggregated prior to moving it to the data warehouse.

5.3.4 Generating warehouse keys

Instead of using the keys that were used in the operational system, a common design technique is to make up a new key, called the surrogate or synthetic key, to use in the warehouse. The surrogate key is usually a generated sequence of integers.

Surrogate keys are used for a variety of reasons. The keys used in the operational system may be long character strings, with meaning embedded into the components of the key. Because surrogate keys are integers, the fact tables and B*tree indexes are smaller, with fewer levels, and take less space, improving query response time. Surrogate keys provide a degree of isolation from changes in the operational system.

If the operational system changes the product-code naming conventions, or format, all data in the warehouse does not have to be changed. When one company acquires another, you may need to load products from the newly acquired company into the warehouse. It is highly unlikely that both companies used the same product encoding schemes. If there is a chance that the two companies used the same product key for different products, then the product key in the warehouse may need to be extended to add the company ID as well. The use of surrogate keys can greatly help integrate the data in these types of situations.

Both the surrogate keys and operational system keys are stored in the dimension table, as shown in Figure 5.4, where product code SR125 is known in the data warehouse as PRODUCT_ID 1. Therefore, we can see that in the fact table, the product key is stored as 1. However, users can happily query using code SR125, completely unaware of the transformation being done within the data warehouse.

click to expand
Figure 5.4: The use of surrogate keys in the warehouse.

The surrogate key is used in the fact table as the column that joins the fact table to the dimension table. In this example, there are two different formats for product codes. Some are numeric, separated by a dash: 123-118. Others are a mix of alphanumeric and numeric characters: SR125. As part of the ETL process, as each fact record is loaded, the surrogate key is looked up from the dimension tables and stored in the fact table.

5.3.5 Choosing the optimal place to perform the transformations

Transformations of the data may be done at any step in the ETL process. You need to decide the most efficient place to do each transformation: at the source, in the staging area, during the load operation, or in temporary tables once the data is loaded into the warehouse. Several new features have been added to Oracle 9i to facilitate performing transformations.

Transformations can be done as part of the extraction process. In general, it is best to do filtering types of transformations whenever possible at the source. This allows you to select only the records of interest for loading into the warehouse. Ideally, you want to extract only the data that has been changed since your last extraction. While transformations could be done at the source operational system, an important consideration is to minimize the additional load the extraction process puts on the operational system.
Transformations can be done in a staging area prior to loading the data into the data warehouse. When data needs to be integrated from multiple systems, this cannot be done as part of the extraction process. You can use flat files as your staging area, your Oracle database as your staging area, or a combination of both. If your incoming data is in a flat file, it is probably more efficient to finish your staging processes prior to loading the data into the Oracle warehouse. Transformations that require sorting, sequential processing, and row-at-a-time operations can be done efficiently in the flat file staging area.

If your source is an Oracle 8i or 9i database, transportable tablespaces make it easy to move data into the warehouse without first extracting the data into an external table. In this case, it makes sense to do the transformations in temporary staging tables once the data is in the warehouse. By doing transformations in Oracle, if the data can be processed in bulk using SQL set operations, they can be done in parallel. If some of the data needed for the transformation purposes are already in the warehouse, it may be more efficient to do the transformation in the warehouse. For example, key lookups to obtain the surrogate key from the dimension table can be efficiently performed in staging tables in the database. Section 5.5 describes some examples of performing transformations in a staging area within the Oracle database.
Transformations may be done during the load process. Some simple types of transformations can be done as the data is being loaded using SQL*Loader. For example, it can be used to change the case of a character column to uppercase. This is best done when a small number of rows need to be added-for example, when initially loading a dimension table. In Oracle 9i, a new feature-external tables-was introduced to facilitate more complex transformations of the data as part of the load process. The following text describes some examples of transformations while loading the data into the warehouse.