Dimensions: The Basics | Expert SQL Server 2005 Integration Services (Programmer to Programmer)

This chapter focuses on dimension ETL, but in order to understand the full picture of dimension ETL, some dimension theory is required. The best resource is to read The Complete Guide to Dimension Modeling, Second Edition, by Ralph Kimball and Margy Ross (Wiley Publications, 2002), for a complete picture of a dimension. For the purposes of this discussion, here’s a high-level summary.

The dimension itself is an organized grouping of categories and properties about a particular entity. These categories and properties are called attributes, and they form hierarchies with levels and members used to slice and query a cube or fact table.

For example, the following table shows data focusing on geography. The attributes are the Country, State/Province and City, which combine together because of their relationship to one another.

Open table as spreadsheet

Country	State/Province	City
Germany	Bavaria	Augsburg
Germany	Bavaria	Munich
Germany	Niedersachsen	Hannover
United States	California	Palo Alto
United States	California	Woodland Hills
United States	New York	Cheektowaga
United States	New York	Lake George

Naturally, this data can be used together to create a hierarchy, which provides drill paths to data and the ability to slice and dice data based on certain records. Figure 4-1 shows one of the hierarchies within this geography dimension. The members are the names of the countries, state provinces, and cities. The hierarchy is made up of levels that usually correspond to the column names used in the hierarchy.

image from book
Figure 4-1: Hierarchy within a geography dimension

In this example, the hierarchy is made up of the levels Country, State, and City, in that order. When querying the dimension, the lower levels roll up to higher levels, so analysis can be done drilling up or down the hierarchy.

It is also important to understand how dimensions relate to fact tables. Chapter 5 examines the topic of fact table ETL, but some background in dimension to fact relationships will help you grasp the ETL requirements for dimension processing. Since the AdventureWorksDW sample databases will be used to illustrate this chapter, Figure 4-2 shows one of the fact tables with its related dimension. This is a picture of a single fact table, Reseller Sales Fact, one of the six fact tables in the AdventureWorksDW database.

image from book
Figure 4-2: Fact table with its related dimension

Related to this fact table is a product dimension, a reseller dimension, an employee dimension, a time dimension, and a customer dimension (plus others not shown). In all, these are pretty typical dimensions. You’ll notice that the product dimension has multiple tables associated with it in what’s called a snowflake design. The employee dimension is also a different type of dimension, called a parent-child, given its self referencing nature.

Each dimension contains a surrogate key, as shown in Figure 4-2. The surrogate keys are identifiable by the Key suffix in the column name, such as ProductKey. This defines the primary key for the table itself and relates to the fact table foreign key field. Typically, these data types are numeric, most likely integer and auto-incrementing (called an IDENTITY column in SQL Server). Some designers prefer to use a unique identifier (GUID), but these column types are wider (16 bytes) and more difficult to read. The best practice recommendation is to use integer data types-either a 1-, 2-, 4-, or 8-byte integer based on the number of projected members within the dimension table.

The business keys are still tracked in the dimension table, but are not marked as the primary key to identify uniqueness, nor are they used for referential integrity in the database. Instead, they help identify source system data associations for the ETL. In the AdventureWorksDW database, each dimension’s business key is called the [DimensionName]AlternateKey for standardization, but simply contains the source system’s primary key value. This business key column is sometimes called the candidate key. Most dimension tables name the business key with the same name as the source system name.

A surrogate key, more precisely, is a single column with a value that is unrelated to the dimension table’s members. These keys provide the central foundation to dimensional modeling, critical to tracking history and designing a business intelligence system that performs. Here are a few advantages to surrogate key usage:

Surrogate keys consolidate multi-value business keys- A source table or file with multicolumn business keys can be difficult to manage and will take up extra unnecessary space if used as the primary key. Some dimensions are sourced by multiple source systems, which also creates the situation of multiple business keys that a single surrogate key can handle.
Surrogate keys allow tracking of dimension history- Tracking of a dimension attribute’s history (such as the list price example discussed in the opening section of this chapter) is not possible without surrogate keys. The surrogate key allows multiple versions of a single source record to be tracked and associated with the fact table history.
Surrogate keys standardize dimension tables- Having an identical relationship mechanism to every dimension creates simplicity in reporting and standardization or consistency in the design.
Surrogate keys improve query performance- From an optimization standpoint, surrogate keys limit a fact table width. In other words, the combined total data type size for the fact table is small, allowing more fact table rows to fit in memory and, therefore, improving query performance.

Important

Make the surrogate key as narrow as possible. If you have a dimension table that only has 100 potential rows (considering future additions), a 1-byte integer (tinyint) would be the best choice. Based on the number of members in the dimension, make your data type as small as possible, but be sure you account for future rows! The performance gains will be pretty significant. Simply by going from a 4-byte int to a 2-byte int (just 2 bytes), you may improve performance by as much as 4–5 percent. And when all the dimension surrogate keys are optimized, it can translate into big performance gains on queries, which, in turn, reduces a server’s resource load. In other words, don’t make every dimension key a 4-byte integer. If you can take the fact table width and cut it in half, that’s twice as many rows you can put in a SQL page that goes into memory. Inserts, updates, and queries will run much faster.

Later in this chapter, we will discuss the ETL for two advanced dimension forms:

Snowflake dimension tables
Parent-child dimension

Each of these requires unique handling in the ETL with SSIS.