Dimensional Modeling Concepts | Practical Business Intelligence with SQL Server 2005

So far we have looked at several of the key dimensional concepts, including dimensions, attributes, measures, and fact tables. You need to understand a few other areas before we can move on to actually building a BI solution.

Hierarchies

As you have seen, dimensions consist of a list of descriptive attributes that are used to group and analyze information. Some of these attributes are strongly related, and can be grouped into a hierarchy. For example, product category, product subcategory and product stock-keeping unit (SKU) could be grouped into a hierarchy called Product Categorization. When the hierarchy is used in a query, the results would show the totals for each product category, and then allow the user to drill down into the subcategories, and then into the product SKUs that make up the subcategory, as shown in Figure 1-6.

Figure 1-6. Product hierarchy drilldown

Hierarchies are useful for comprehending large amounts of information by presenting summary information and allowing people to drill down for more details in the areas of interest. OLAP technology has typically been built around hierarchy definitions; in fact, many OLAP tools in the past only allowed users to create queries using the predefined hierarchies. The reason for this was that the aggregates, which are the source of OLAP's performance, were all designed around the hierarchy levels.

Stars and Snowflakes

The simplest dimensional model has the "star" design shown in Figure 1-3, with a single table for each dimension such as Product or Customer. This means that the tables are not fully normalized, because attributes such as product category description are repeated on every product record for that category. In the past, the star schema was an attractive design because you could allow users to access the relational database directly without them having to worry about joining multiple separate dimension tables together and because relational databases did not used to do a very good job of optimizing queries against more complex schemas.

Modern BI solutions have an entirely different approach to providing the two main benefits that used to come from having single dimension tables in a star schema: If users are accessing all their information from an OLAP cube, the usability and query performance come from the OLAP layer, not from the relational database.

This means that we can move beyond dogmatically denormalizing every dimension table into a star schema and where necessary take advantage of a different design usually known as a snowflake. A snowflake dimension has been partly renormalized so that the single table is broken out into several separate tables with one-to-many relationships between them, as shown in Figure 1-7.

Figure 1-7. Snowflake design

So now that we have two different possible designs, a single-dimension table in a star design or multiple tables in a snowflake design, how do you know which one to use and when? Because the OLAP cube is providing the performance and user-friendly model, the main criterion for choosing between star and snowflake is how it will affect your ETL process.

Choosing Between Star and Snowflake for a Dimension

A single dimension table is often the easiest design to handle for the ETL process, especially when all the information in the dimension comes from a single source system. We can simply set up a query on the source system that joins all the various component tables together and presents a nice simple set of source columns to the ETL process. It's then easy to use SQL Server's ETL features to detect which rows have been added or changed and make the appropriate updates to the dimension table.

A snowflake design starts to get much more attractive when some of the dimension's attributes come from a different source. For example, a Geography dimension might consist of some attributes that describe a customer's physical address and other attributes that describe which sales territory the address is located within. If the customer addresses are coming from the main OLTP system but the master list of sales territories is just a spreadsheet, it might make the ETL process easier if you snowflake the Geography dimension into Sales Territory and Location tables that can then be updated separately (and yes, it is permissible to use "snowflake" as a verb in BI circlesjust be prepared to defend yourself if you do it within earshot of an English major).

The other reason that designers sometimes choose a snowflake design is when the dimension has a strong natural hierarchy, such as a Product dimension that is broken out into Category, Subcategory, and Product SKU levels. If those three levels map to normalized dimension tables in the source system, it might be easier to manage the ETL process if the dimension consists of three tables rather than one. Also, because of the way Analysis Services queries the data warehouse to load the data for a dimension's attributes, a snowflake design can improve the performance of loading large Analysis Services dimensions.

You might also think that by renormalizing the dimension tables into a snowflake structure, you will save lots of disk space because the descriptions won't be repeated on every dimension record. Actually, although it is technically correct that the total storage used by dimensions is smaller in a snowflake schema, the relatively huge size of the fact tables compared with the dimension tables means that almost any attempt to optimize the dimensions to save on data warehouse space is going to be a drop in the ocean.

Using Surrogate Keys

Most dimensions that you create from data in source systems will have an obvious candidate for a primary key. In the case of a Product dimension, the primary key in the source system might be a product code, or a customer number in the case of a Customer dimension. These keys are examples of business keys, and in an OLTP environment they are often used as the primary key for tables when you are following a standard E/R modeling approach.

You may think that the best approach would be to use these business keys as the primary key on all of your dimension tables in the data warehouse, too. In fact, we recommend that in the data warehouse, you never use business keys as primary identifiers. Instead, you can create a new column containing an integer key with automatically generated values, known as a surrogate key, for every dimension table.

These surrogate keys are used as primary identifiers for all dimension tables in the data warehouse, and every fact table record that refers to a dimension always uses the surrogate key rather than the business key. All relationships in the data warehouse use the surrogate key, including the relationships between different dimension tables in a snowflake structure. Because the data warehouse uses surrogate keys and the source systems use business keys, this means that one important step in the ETL process is to translate the business keys in the incoming transaction records into data warehouse surrogate keys before inserting the new fact records.

You also need to keep the original business key on the dimension record in addition to the new surrogate key. In some cases, users have become used to working with some business keys such as product codes and might want to use these keys as an attribute in their queries. Also, even though the business key may not always uniquely identify the dimension record anymore for reasons explained in Chapter 8, they are required for the ETL process to be able to translate the business keys on the incoming fact records.

Using surrogate rather than business keys is another of those areas that appears to contradict best practices for OLTP databases, so why would we do this? One reason is to have independence from a single source system, so that they can change their internal coding structures and also so that we can acquire new companies and systems without having to modify the data warehouse structure.

Another advantage of using surrogate keys is data storage size. Unlike trying to optimize dimension table sizes, which is more or less irrelevant in the general scheme of things, any tiny difference that you can make to the size of the fact table often translates into huge space savings. Using 4-byte (or even smaller) integer keys for all dimension keys on the fact table rather than long product codes or customer identifiers means you save gigabytes of storage on a typical fact table.