Chapter 5: Fact Table ETL


When processing data warehouses, dimension table ETL is only half the story. The next major aspect of ETL involves fact tables. But the good news is that dimension ETL usually represents the majority of the complexity in the overall warehouse ETL. To be sure, a data warehouse involves more than just dimension and fact ETL (such as data lineage, auditing, and execution precedence), but in terms of business data, dimension, and facts, it contains the core information.

In this chapter, we focus on applying SSIS to fact table ETL. We discuss the theory and general concepts of fact table ETL, including data mapping, workflow, and precedence. We also identify dimension surrogate keys and measure calculations, and discuss how to manage data grain changes.

Fact Table Overview

Instead of containing attributes about an entity as dimension tables do, a fact table contains the metrics or numbers that the report presents and that the cubes aggregate, as well as trends. Measures include sales amount, account balance, discount amount, and shipping quantity. Figure 5-1 shows the Internet sales fact table, which contains several measures related to the direct sales to consumers, such as OrderQuantity, UnitPrice, ExtendedAmount, and so on.

image from book
Figure 5-1: Internet sales fact table

In addition to the measures, a fact table also contains the relationships between the measures and the dimensions. As Figure 5-1 shows, the fact table contains the foreign key relationship from the dimension tables. Essentially, the primary key in the dimension table is the dimension’s surrogate key. The surrogate keys from the related dimension tables are included in the fact table. This fact table also contains degenerate dimension values and lineage columns, which often overlap, to provide the transaction IDs to allow tracing back to the source data and filtering by transaction numbers.

The three types of columns in the fact table are dimension keys, measures, and metadata. These columns map directly to the ETL processing requirements for fact table ETL, mapping dimension keys, calculating measures, and adding metadata.

Mapping Dimension Keys

Data warehouse project plans should include some mechanism to map source tables or files to both the dimension and fact tables. The fact table mapping must include not just where and how the measures are calculated, but also how the dimension relationships are mapped to acquire the surrogate key. In other words, when you are loading a fact table, you need to have the business keys of the dimension available that are used to join to the dimension table to identify the surrogate keys that are inserted into the fact table. The business keys are sometimes called candidate keys. There may even be situations when one fact table uses one set of business keys to look up against a dimension, and a second fact table uses a second key or keys to identify the dimension record for the same dimension table.

Calculating Measures

Measures in fact tables may be generated from multiple source columns and tables, or files. At times, a measure may be derived in the ETL process without a direct mapping from a source value. A common example of this is when Analysis Services (or other cubing engine) is involved, and a counting or occurrence measure is needed. Such measures usually assume the values of 0 or 1, based on a source occurrence. These measures are additive in nature, and fit nicely with Analysis Services aggregations.

Adding Metadata

Fact tables (and dimension tables) often include metadata columns that provide information about the rows in the table. These metadata columns include datetime columns, batch identifiers, and lineage columns. datetime and batch identifiers help to identify when a row was added, and with what ETL processing run. Lineage, on the other hand, identifies where a record came from in the source system down to the source row or records.

Important 

In contrast to fact table volumes, dimension tables typically don’t have many records-most often fewer than a million records. However, fact tables can scale into millions of records, and even billions, depending on how much data is coming across from the transactional sources. Column optimization of the fact table is, therefore, critical to performance. Because of this, fact tables may not always have metadata in order to reduce the overall row width for performance reasons. By keeping the fact table narrow, you trade off the benefits of the metadata for better overall performance, both querying and loading data.

Fact Table Types

Dimensions ETL involved dealing with different change types and managing change history. Fact table ETL has similar concepts, which involve different types of fact tables, managing history and change, and data grain changes from the source to the fact table.

Figure 5-2 shows the primary sales fact tables within the AdventureWorksDW database: reseller sales and Internet sales.

image from book
Figure 5-2: Primary sales fact tables within the AdventureWorksDW database

These fact tables are transaction fact tables. This means that the fact tables contain transactional activity. In this case, the transactions are sales activity, but other transactional fact tables may have a different type of transaction, such as system alerts, shipping activity, and so on. Furthermore, in this case, both the reseller sales and Internet sales fact tables come from the same source tables. This is an example of a data grain change.

Another fact table type is called a snapshot fact. Figure 5-3 shows the schemas of the finance fact table and the currency rate fact table, which are examples of a snapshot fact table.

image from book
Figure 5-3: Schemas of the finance fact table and the currency rate fact table

A snapshot fact table is a point-in-time picture of the levels or balances of the source. The finance fact table, for example, contains account balance snapshots of the amounts on a monthly level. The advantage to this design is that it allows quick querying of the balances historically without having to compile the transactions in order to answer the query. But also, it also allows trending over time of the balances at different grains of the dimensionality.

Another common type of snapshot fact table targets inventory levels. Instead of account balances tracked over time, an inventory-level fact table tracks in-stock levels at different locations at various times.



Expert SQL Server 2005 Integration Services
Expert SQL Server 2005 Integration Services (Programmer to Programmer)
ISBN: 0470134119
EAN: 2147483647
Year: 2004
Pages: 111

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net