Additional Design Concepts and Techniques

Even though the dimensional modeling concepts weve described are fairly simple, they are applicable to a wide range of business scenarios. However, there are a few additional dimensional modeling concepts and techniques that are critical to implementing viable dimensional models. We start this section with a couple of key concepts: surrogate keys and slowly changing dimensions. Then we look at several techniques for modeling more complex business situations. Finally, we review the different types of fact tables. We briefly describe each concept or technique and provide references so you can find more detailed information if you need it.

Surrogate Keys

For your DW/BI system, you will need to create a whole new set of keys in the data warehouse database, separate from the keys in the transaction source systems. We call these keys surrogate keys , although they are also known as meaningless keys, substitute keys, non-natural keys, or artificial keys. A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table. Using surrogate keys in all dimension tables reaps the following benefits (and more):

  • Surrogate keys protect the DW/BI system from changes in the source system. For example, a migration to a new software package will likely create a new set of business keys in the transaction system.

  • Surrogate keys allow the DW/BI system to integrate data from multiple source systems. Different source systems might keep data on the same customers or products, but with different keys. (Some of you might be shocked to hear that the same customer can be issued more than one customer ID, but it does happen.)

  • Surrogate keys enable you to add rows to dimensions that do not exist in the source system. For example, there may be order transactions that come to the warehouse without an assigned sales rep. You can replace the NULL sales rep value in the fact table row with a surrogate key from the sales rep dimension that points to a row with the description of Sales Rep not yet assigned.

  • Surrogate keys provide the means for tracking changes in dimension attributes over time. Generally , keeping track of history requires that you add new rows to the dimension table when an attribute changes. This means there will be multiple rows in the dimension with the same source system transaction key. The surrogate key provides the unique key for each row. We discuss attribute change tracking in an upcoming section on slowly changing dimensions (SCDs).

  • Integer surrogate keys are an efficient key in the relational database and Analysis Services. Using them improves query and processing performance, and because of their compactness, substantially reduces the size of the fact tables in the relational database.

The ability to track changes in dimension attributes over time is reason enough to implement surrogate keys. Weve regretted it more than once when we decided not to track changes in attribute values over time, and later found out the historical values were important to support certain business analyses. We had to go back and add surrogate keys and re-create the dimensions change history. This is not a fun project; we encourage you to do it right the first time. If you use surrogate keys for all dimensions at the outset, its much easier to change a dimension so that it tracks history.

Note 

The date dimension deviates slightly from the rules by using a surrogate key based on the date rather than a completely meaningless key. We discuss this further in the upcoming section on the date dimension.

The biggest cost of using surrogate keys is the burden it places on the ETL system. First you need to assign the surrogate keys to the dimension rows, and then you need to substitute the surrogate keys from the dimensions for the transaction system keys in the fact table rows. An extract for a fact table row generally is a list of transaction system keys and facts. We call the transaction system key the business (or natural ) key, although it is usually not business-like. Because the dimension tables are accessed by surrogate keys, the fact table rows will not join to the dimensions until the business keys are replaced by surrogate keys. The ETL process uses each business key in the fact table extract to look up the current associated surrogate key for every dimension linked to the fact table. In other words, when you load fact table rows, you need to look up the source system key in each dimension table and substitute the corresponding surrogate key in what we call the surrogate key pipeline. Integration Services will help you with this work, as we describe in Chapter 6.

Reference 

The following resources offer additional information about surrogate keys:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 5862.

  • The MSDN Library for Visual Studio 2005: Search for a topic called Dimension Tables under the main topic Creating a Data Warehouse.

Slowly Changing Dimensions

Although we like to think of the attribute values in a dimension to be fixed, it turns out that some change over time. In an employee dimension, the date of birth should not change over time (other than error corrections, which your ETL system should expect). However, other fields, such as the employees department might change several times over the length of a persons employment. Many of these changes are critical to understanding the dynamics of the business. The ability to track these changes over time is one of the fundamental reasons for the existence of the DW/BI system.

If the value of an attribute can change, you need to be prepared to deal with that change. We call dimensions that have changeable attribute values s lowly changing dimensions (SCDs). However, just because something changes, it doesnt mean that change has significance to the business. The choice of which dimensions and attributes you need to track and how you track them is a business decision. The most common techniques to deal with changes the business doesnt care about and changes the business wants to track respectively are called Type 1 and Type 2 slowly changing dimensions. Most of the popular ETL tools, including Integration Services, have these techniques built-in.

image from book
HOW SLOW IS SLOW?

We call this concept a slowly changing dimension because the attributes that describe a business entity generally dont change very often. Take customer address, for example. According to the US Census Bureau ( www.census.gov/population/www/pop-profile/geomob.html ), 16.8 percent of Americans move in a given year, and 62 percent of these moves are within the same county. If a change in zip code is considered to be a significant business event, a simple Customer dimension with name and address information should generate less than 16.8 percent change rows per year. If an attribute changes rapidly , causing the dimension to grow at a dramatic rate, this usually indicates the presence of a business process that should be tracked separately, either as a separate dimension or as a fact table rather than as a dimension attribute.

image from book
 

A Type 1 SCD overwrites the existing attribute value with the new value. Use this method if you dont care about keeping track of historical values when the value of an attribute changes. The Type 1 change does not preserve the attribute value that was in place at the time a historical transaction occurred.

If you need to track the history of attribute changes, use a Type 2 SCD. Type 2 change tracking is a powerful technique for capturing the attribute values that were in effect at a point in time and relating them to the business events in which they participated. When a change to a Type 2 attribute occurs, the ETL process creates a new row in the dimension table to capture the new values of the changed item. The attributes in the new row are in effect as of the time of the change moving forward. The attributes in the previously existing row are marked to show they were in effect right up until the appearance of the new row.

Note 

A third change tracking technique, called Type 3, keeps separate columns for both the old and new attribute valuessometimes called alternate realities. In our experience, Type 3 is less common because it involves changing the physical tables and is not very scalable.

The best way to understand the concepts of Type 1 and Type 2 change tracking, and the substantial impact of accurately tracking changes, is with an example. As well see in the case study, Adventure Works Cycles collects demographic information from its Internet customers. These attributes, such as gender, homeowner status, education, and commute distance, all go into the customer dimension. Many of these attributes will change over time. If a customer moves, her commute distance will change. If a customer bought a bike when he went to college, his education status will change when he graduates. These attributes all certainly qualify as slowly changing, but should you treat them as Type 1 or Type 2 attributes?

If youve been paying attention, you know the correct answer is to ask the business users how theyll use this information. You already know from the requirements gathering information in Chapter 1 that Marketing intends to use some data mining techniques to look for demographic attributes that predict buying behaviors. In particular, they want to know if certain attributes are predictive of higher-than-average buying. They could use those attributes to create targeted marketing campaigns . Lets examine some example customer-level data to see how decisions on handling attribute changes can affect the information you provide your users.

Table 2.1 shows the row in the customer dimension for a customer named Jane Rider as of January 1, 2006. Notice that the customer dimension includes the natural key from the transaction system along with the attributes that describe Jane Rider. The natural key allows users to tie back to the transaction system if need be.

Table 2.1: Example Customer Dimension Table Row as of Jan 1, 2006

CUSTOMER_KEY

BKCUSTOMER_ID

CUSTOMER_NAME

COMMUTE_DISTANCE

GENDER

HOME_OWNER_FLAG

1552

31421

Jane Rider

3

Female

No

Table 2.2 shows some example rows from an abridged version of the AWC Orders fact table in the data warehouse database. We have used surrogate keys for customer and product, but we left the order date as a date for the sake of clarity. These rows show all of Jane Riders AWC orders.

Table 2.2: Example Order Fact Table Rows for Jane Rider as of Feb 22, 2006

DATE

CUSTOMER_KEY

PRODUCT_KEY

ITEM_COUNT

DOLLAR_AMOUNT

1/7/2004

1552

95

1

1,798.00

3/2/2004

1552

37

1

27.95

5/7/2005

1552

87

2

320.26

8/21/2005

1552

33

2

129.99

2/21/2006

1552

42

1

19.95

A marketing analyst might want to look at these two tables together, across all customers, to see if any of the customer attributes predict behavior. For example, one might hypothesize that people with short commute distances are more likely to buy nice bikes and accessories than people with long commute distances. They might ride their bikes to work rather than drive.

Overwriting Values: The Type 1 Change

A few additional details help demonstrate the importance of appropriately handling attribute changes. To the untrained eye, attributes such as commute distance and homeowner flag might not seem important to the business, so you might treat them as Type 1 attributes and overwrite the existing values when changes occur. As it turns out, Jane took advantage of low mortgage interest rates and moved into a new home on January 2, 2006, changing her commute distance to 31 miles. If you do a Type 1 overwrite, Janes commute distance value in Table 2.4 will change to 31, and her homeowner flag will change to Yes.

Table 2.4: Updated Order Fact Table Rows for Jane Rider as of Feb 22, 2006

DATE

CUSTOMER_KEY

PRODUCT_KEY

ITEM_COUNT

DOLLAR_AMOUNT

1/7/2004

1552

95

1

1,798.00

3/2/2004

1552

37

1

27.95

5/7/2005

1552

87

2

320.26

8/21/2005

1552

33

2

129.99

2/21/2006

2387

42

1

19.95

The timing of the attribute change events is important to understanding their impact on the analysis. If the marketing analyst decided to investigate the relationship between commute distance and orders on March 1, 2006, after the overwrite took place, all of Jane Riders orders would be counted as occurring with a commute distance of 31 miles. By overwriting commute distance when a change occurs, youre no longer tracking the actual commute distance that was in effect at the time the order was placed. Instead, youre assigning all historical orders to the current attribute value. This means that any time your analysts try to do historical correlation analysis, they risk drawing incorrect conclusions.

Does this matter? Only the business users can say. If there arent a significant number of rows that change, or if the attribute is not considered relevant, it probably doesnt matter if you overwrite the old value when a change occurs. However, you must be extremely careful in making this decisionit requires detailed exploration of the potential impact with knowledgeable business folks.

The short answer is it usually does matter. In the example case, the analyst may find a significant correlation between long commute distances and larger orders. This finding might then be used to design a marketing campaign targeted at people with long commute distances; a campaign that will likely be ineffective . In this case, the errors introduced by the Type 1 overwrite mainly costs money and time. In some cases, as in health care, overwriting attribute values might cost even more.

Tracking Historical Valuesthe Type 2 Change

Lets see how the same example data looks if you tracked the changes using the Type 2 approach. This will affect both the fact and dimension tables. Lets consider the dimension table first: When Jane Rider moves on January 2, 2006, add a new row to the customer dimension, with a new surrogate key, as shown in Table 2.3.

Table 2.3: Example Customer Dimension Table Row as of Jan 2, 2006 with a Type 2 Change

CUSTOMER_KEY

BKCUSTOMER_ID

CUSTOMER_NAME

COMMUTE_DISTANCE

GENDER

HOME_OWNER_FLAG

EFF_DATE

END_DATE

1552

31421

Jane Rider

3

Female

No

1/7/2004

1/1/2006

2387

31421

Jane Rider

31

Female

Yes

1/2/2006

12/31/9999

The customer dimension table has been augmented to help manage the Type 2 process. Two columns have been added to indicate the effective date and end date of each row. This is how you can tell exactly which row was in effect at any given time.

As Table 2.4 shows, the fact table has to change as well because it has a row that occurred after the Type 2 change occurred. The last row of the fact table needs to join to the row in the dimension that was in effect when the order occurred on February 22, 2006. This is the new dimension row with customer_key = 2387.

The correct assignment of dimension keys to fact table rows is handled as a matter of course in the ETL process. Integration Services even has a wizard to define this process. We have to go back and update the fact table in our example because we didnt start with Type 2 attributes in the first place. It would have been much better if we had designed the customer dimension as a Type 2 SCD originally. This should give you a feel for what you will have to do if you dont use Type 2 change tracking as well.

Now when the marketing analyst does the analysis, the data will show $2,276.20 in orders associated with a 3-mile commute distance and $19.95 associated with a 31-mile commute distance. The way to decide whether to track attribute changes as Type 1 or Type 2 is to ask, Does it matter? From the marketing analysts perspective in this example, accurately tracking changes over time obviously matters a lot.

Reference 

The following resources offer additional information about slowly changing dimensions:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 95105.

  • Books Online: Search for the topic Changing Dimension Support (SSAS).

Dates

Date is the fundamental business dimension across all organizations and industries, although many times a date table doesnt exist in the operational environment. Analyses that trend across dates or make comparisons between periods (that is, nearly all business analyses) are best supported by creating and maintaining a robust Date dimension table.

Every dimensional DW/BI system has a Date (or Calendar) dimension, typically with one row for every day for which you expect to have data in a fact table. Calling it the Date dimension emphasizes that its grain is at the day level rather than time of day. In other words, the Date table will have 365 or 366 rows in it per year.

Note 

The Date dimension is a good example of a role-playing dimension. It is common for a Date dimension to be used to represent different dates, such as order date, due date, and ship date. To support users who will be directly accessing the relational database, you can either define a synonym on the view of the Date dimension for each role, or define multiple views. The synonym approach is simple, but it allows you to rename the entire table only, not the columns within the table. Reports that cant distinguish between order date and due date can be confusing.

If your users will access the data through Analysis Services only, you dont need to bother with views or synonyms to handle multiple roles. Analysis Services understands the concept of role-playing dimensions.

Earlier in this section we said that all dimensions should use surrogate keys and we emphasized the importance of surrogate keys in tracking attribute changes. We recommend surrogate keys for the Date dimension, even though Date is unlikely to have attributes that require Type 2 change tracking, nor will it face interesting integration challenges from multiple systemstwo of the key arguments for using surrogate keys for other dimensions.

Use a surrogate key for Date because of the classic problem often faced by technical people: Sometimes you dont have a date. While we cant help the technical people here, we can say that transactions often arrive at the data warehouse database without a date because the value in the source system is missing, unknowable, or the event hasnt happened yet. Surrogate keys on the Date dimension help manage the problem of missing dates. Create a few rows in the Date dimension that describe such events, and assign the appropriate surrogate date key to the fact rows with the missing dates.

In the absence of a surrogate date key, you will probably end up creating Date dimension members with strange dates such as 1-Jan-1900 to mean Unknown and 31-Dec-2079 to mean Hasnt happened yet . Overloading the fact dates in this way isnt the end of the world, but it can confuse users and cause erroneous results. Besides, its pretty hokey.

The Date dimension surrogate key has one slight deviation from the rule. Where other surrogate keys are usually a meaningless sequence of integers, its a good idea to use a meaningful value for the Date surrogate key. Specifically, use an integer that corresponds to the date in year-month-day order, so September 22, 2005 would be 20050922. This can lead to more efficient queries against the relational database. It also makes implementing date-based partitioning much easier, and the partition management function will be more understandable.

Just to prove that were not entirely dogmatic and inflexible , we dont always use surrogate keys for dates that appear as dimension attributes. Generally, if a particular date attribute has business meaning, we use the Date surrogate key. Otherwise we use a smalldatetime data type. One way to spot a good candidate for a Date surrogate key is if the attributes date values fall within the date range of the organizational calendar, and therefore are all in the Date table.

Reference 

The following resources offer additional information about the date dimension:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 3841.

  • Books Online: Search for the topic Time Dimensions (SSAS) and related topics.

Degenerate Dimensions

Transaction identifiers often end up as degenerate dimensions without joining to an actual dimension table. In the retail grocery example, all the individual items you purchase in a trip through the checkout line are assigned a transaction ID. The transaction ID is not a dimensionit does not exist outside the transaction and it has no descriptive attributes of its own as theyve already been handled in separate dimensions. Its not a factit does not measure the event in any way and is not additive. We call attributes such as transaction ID a degenerate dimension because its like a dimension without attributes. And because there are no associated attributes, there is no dimension table. We include it in the fact table because it serves a purpose from an analytic perspective. For example, you can use it to tie all the line items in a market basket together to do some interesting data mining, as we discuss in Chapter 10. It can tie back to the transaction system if additional orders-related data is needed.

Snowflaking

In simple terms, snowflaking is the practice of connecting lookup tables to fields in the dimension tables. At the extreme, snowflaking involves re-normalizing the dimensions to the third normal form level, usually under the misguided belief that this will improve maintainability, increase flexibility, or save space. We discourage snowflaking. It makes the model more complex and therefore less usable, and it actually makes it more difficult to maintain, especially for Type 2 slowly changing dimensions.

In a few cases we support the idea of connecting lookup or grouping tables to the dimensions. One of these cases involves rarely used lookups, as in the example of joining the Date table to the DateOfBirth field in the Customer dimension so we can count customers grouped by their month of birth. We call this purpose-specific snowflake table an outrigger table. You will see several examples of outrigger tables in the following sections and in the Adventure Works Cycles example. When youre building your Analysis Services database, youll see this same concept referred to as a reference dimension .

Sometimes its easier to maintain a dimension in the ETL process when its been partially normalized or snowflaked. This is especially true if the source data is a mess and youre trying to ensure the dimension hierarchy is correctly structured. In this case, theres nothing wrong with using the normalized structure in the ETL application. Just make sure the business users never have to deal with it.

Note 

Analysis Services 2000 had a slight preference for normalized dimension hierarchies for complex models. Analysis Services 2005 really doesnt care. If anything, the preference tilts slightly to the denormalized star structure. In either case, its completely hidden from business users, which is the important thing.

Reference 

The following resources offer additional information about snowflaking:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 5557.

  • Books Online: Search for the topic Dimension Structure (SSAS) and other Books Online topics.

Many-to-Many or Multivalued Dimensions

The standard relationship between a dimension table and fact table is called one-to-many . This means one row in the dimension table will join to many rows in the fact table, but one row on the fact table will join to only one row in the dimension table. This relationship is important because it keeps us from double counting. Fortunately, in most cases this relationship holds true.

There are two common instances where the real world is more complex than one-to-many:

  • Many-to-many between the fact table and a dimension

  • Many-to-many between dimensions

In both cases, we introduce an intermediate table called a bridge table that supports the more complex many-to-many relationship.

Many-to-Many Between a Fact and Dimension

A many-to-many relationship between a fact table and a dimension occurs when multiple dimension values can be assigned to a single fact transaction. A common example is when multiple sales people can be assigned to a given sale. This often happens in complex, big-ticket sales such as computer systems. Accurately handling this situation requires creating a bridge table that assembles the sales rep combinations into groups. Figure 2.5 shows an example of the sales rep group bridge table.

image from book
Figure 2.5: An example Sales Rep Group bridge table

The ETL process needs to look up the appropriate sales rep group key in the bridge table for the combination of sales reps in each incoming fact table record, and add a new group if it doesnt exist. Note that the bridge table in Figure 2.5 introduces a risk of double counting. If we sum Dollar Sales by Sales Rep, every sales rep will get credit for the total sale. For some analyses, this is the right answer, but for others you dont want any double counting. Its possible to handle this risk by adding a weighting factor column in the bridge table. The weighting factor is a fractional value that sums to one for each sales rep group. Multiply the weighting factor and the additive facts to allocate the facts according to the contribution of each individual in the group.

Many-to-Many Between Dimensions

The many-to-many relationship between dimensions is an important concept from an analytic point of view. Most dimensions are not entirely independent of one another. Dimension independence is really more of a continuum than a binary state. At one end of the continuum, the store and product dimensions in a retail grocery chain are relatively independent, but not entirely. Some store formats dont carry certain products. Other dimensions are much more closely related, but are difficult to combine into a single dimension because of their many-to-many relationship. In banking, for example, there is a direct relationship between account and customer, but it is not one-to-one. Any given account can have one or more customers as signatories, and any given customer can have one or more accounts. Banks often view their data from an account perspective; the MonthAccountSnapshot is a common fact table in financial institutions. The account focus makes it difficult to view accounts by customer because of the many-to-many relationship. One approach would be to create a CustomerGroup bridge table that joins to the fact table, such as the SaleRepGroup table in the previous many-to-many example. A better approach takes advantage of the relationship between Account and Customer, as shown in Figure 2.6.

image from book
Figure 2.6: An example many-to-many bridge table between dimensions

An AccountToCustomer bridge table between the Account and Customer dimensions can capture the many-to-many relationship with a couple of significant benefits. First, the relationship is already known in the source system, so creating the bridge table will be easier than the manual build process required for the SalesRepGroup table. Second, the Account-Customer relationship is interesting in its own right. The AccountToCustomer bridge table allows users to answer questions such as What is the average number of accounts per customer? without joining to any fact table.

Bridge tables are often an indicator of an underlying business process. This is especially true if you must keep track of changes to bridge tables over time (that is, the relationship itself is Type 2). For customers and accounts, the business process might be called account maintenance, and one of the transactions might be called Add a signatory. If three customers were associated with an account, there would be three Add transactions for that account in the source system. Usually these transactions and the business processes they represent are not important enough to track in the DW/BI system with their own fact tables. However, the relationships and changes they produce are important to analyzing the business. We include them in the dimensional model as slowly changing dimensions, and in some cases as bridge tables.

Note 

Analysis Services 2005 has new functionality to support many-to-many dimensions. Analysis Services expects exactly the same kind of structure that we described in this section. They call the bridge table an intermediate fact table, which is exactly what it is.

Reference 

The following resources offer additional information about many-to-many relationships:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 262265 for many-to-many between fact and dimension and pp. 205206 for many-to-many between dimensions.

  • Books Online: Search for the topic Many-to-Many Dimension Relationships (SSAS) and other Books Online topics.

Hierarchies

Hierarchies are meaningful, standard ways to group the data within a dimension so you can begin with the big picture and drill down to lower levels to investigate anomalies. Hierarchies are the main paths for summarizing the data. Common hierarchies include organizational hierarchies, often starting from the individual person level; geographic hierarchies based on physical location, such as a customer address; product hierarchies that often correspond to merchandise rollups such as brand and category; and responsibility hierarchies such as sales territory that assign customers to sales reps (or vice versa). There are many industry-related hierarchies, such as the North American Industrial Classification System (replacement for the Standard Industrial Classification [SIC] code) or the World Health Organizations International Classification of DiseasesTenth Modification (ICD-10).

Simple hierarchies involving a standard one-to-many rollup with only a few levels should be denormalized right into the granular dimension. A four-level product hierarchy might start with product, which rolls up to brand, then to subcategory , and finally to category. Each of these levels would simply be columns in the product dimension table. In fact, this flattening of hierarchies is one of the main design tasks of creating a dimension table. Many organizations will have several different hierarchies in a given dimension to support different analytic requirements.

Of course, not all hierarchies are this simple. The challenge for the dimensional modeler is to determine how to balance the tradeoff between ease of use and flexibility in representing the more difficult hierarchies. There are at least two common hierarchy challenges: variable-depth (or ragged hierarchies) and frequently changing hierarchies. Both of these problems require more complex solutions than simple denormalization. We will briefly describe these solutions here, and refer you to more detailed information in the other Toolkit books if you should need it.

Variable-Depth Hierarchies

A good example of the variable-depth hierarchy is the manufacturing bill of materials that provides the information needed to build a particular product. In this case, parts can go into products or into intermediate layers , called subassemblies, which then go into products, also called top assemblies . This layering can go dozens of levels deep, or more, in a complex product (think about a Boeing 7E7).

Those of you who were computer science majors may recall writing recursive subroutines and appreciate the efficiency of recursion for parsing a parent-child or self-referencing table. In SQL, this recursive structure is implemented by simply including a parent key field in the child record that points back to the parent record in the same table. For example, one of the fields in an Employee table would be the Parent Employee Key (or Manager Key).

image from book
RECURSIVE CAPABILITIES

SQL 99 introduced recursion into the official SQL language using the WITH Common Table Expression syntax. All the major relational database products provide this functionality in some form, including SQL Server 2005. Unfortunately, recursion isnt a great solution in the relational environment because it requires more complex SQL than most front end tools can handle. Even if the tool can recursively unpack the self-referencing dimension relationship, it then must be able to join the resulting dataset to the fact table. Very few of the query tools are able to generate the SQL required to navigate a parent-child relationship together with a dimension-fact join.

On the other hand, this kind of recursive relationship is easy to build and manage in the Analysis Services dimensional database. Analysis Services uses different terminology: parent-child rather than variable-depth hierarchies .

image from book
 

The Data Warehouse Toolkit, Second Edition describes a navigation bridge table in Chapter 6 that solves this problem in the relational world. But this solution is relatively unwieldy to manage and query. Fortunately, we have some powerful alternatives in SQL Server 2005. Analysis Services has a built-in understanding of the parent-child data structure, and Reporting Services can handle the parent-child hierarchy in the front-end tool. If you have variable-depth hierarchies and expect to use the relational database for reporting and analysis, it makes sense to include both the parent-child fields and the navigation bridge table to meet the needs of various environments.

Frequently Changing Hierarchies

If you need to track changes in the variable-depth hierarchy over time, your problem becomes more complex, especially with the parent-child data structure. Tracking changes, as you recall, requires using a surrogate key. If someone is promoted, they will get a new row in the table with a new surrogate key. At that point, everyone who reports to that person will have to have their ManagerKey updated. If ManagerKey is a Type 2 attribute, new rows with new surrogate keys will now need to be generated for these rows. If there are any folks who report to these people, the changes must ripple down until we reach the bottom of the org chart. In the worst case, a change to one of the CEOs attributes, such as marital status, causes the CEO to get a new surrogate key. This means the people who report to the CEO will get new surrogate keys, and so on down the entire hierarchy.

Ultimately, this problem is really about tracking the human resources business process. That is, if keeping track of all the changes that take place in your employee database is a high priority from an analytic point of view, you need to create a fact table, or a set of fact tables that track these events. Trying to cram all this event-based information into a Type 2 dimension just doesnt work very well.

Reference 

The following resources offer additional information about hierarchies:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 161168.

  • The Data Warehouse Lifecycle Toolkit (Wiley, 1998), pp. 226237.

  • Books Online: Search for the topic Hierarchies (search key words: Attribute Hierarchies).

Heterogeneous Products

Many organizations have several product lines or lines of business that share only a few high-level attributes. In banking, for example, the mortgage products have very different attributes from the checking products. The challenge here is designing a solution that allows business folks to analyze the entire organization across lines of business and go into the details of any given line of business. To accomplish this, you need to create two different forms of the fact table, one that has shared facts for all products across all lines of business and another that has all facts, including product-specific and shared facts. Create one of these product-specific fact tables for each line of business. The same is true for the product dimensionssplit the product dimension into a core product dimension table that contains only the shared attributes and as many custom product dimension tables as are needed, each with its own specific, extended attribute list. The core product dimension has a row in it for every product across the entire business. The custom product dimensions share the same surrogate keys, but have only as many rows as there are products in each product line.

It is possible to put all unrelated attributes into a single product table, but it becomes difficult for users to understand because all of the product line specific attributes are empty for products where they dont apply. Only one out of five rows will have values for a given column in a combined product table for an organization with five different lines of business. The product table can get excessively wide because each product line has a full set of attributes to define its products. The same is true for a combined fact table. Most of the lines of business will have facts specific to their own business processes that will need to be added onto the core fact table. These will also be empty (zero) for all products where they dont apply. In a complex business, the best compromise is to create a family of business process dimensional models with core fact and product tables and a set of fact and product tables specific to each line of business.

Note 

There are no magic features in Analysis Services that solve the heterogeneous dimension problem. It works about as well as it does in a purely relational deployment. Analysis Services does simplify navigation between the multiple fact tables in the family of business process dimensional models.

Reference 

The following resources offer additional information about heterogeneous product models:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 210215.

  • The kimballgroup.com : Search for the topic Heterogeneous Products for an article on an insurance business heterogeneous product dimensional model.

Aggregate Dimensions

You will often have data in the DW/BI system at different levels of granularity. Sometimes it comes to you that way, as with forecast data that is created at a level higher than the individual product. Other times you create it yourself to give the users better query performance. There are two ways to aggregate data in the warehouse, one by entirely removing a dimension, the other by rolling up in a dimensions hierarchy. When we aggregate using the hierarchy rollup, we need to provide a new, shrunken dimension at this aggregate level. Figure 2.7 shows how the Adventure Works Cycles Product table can be shrunken to the Subcategory level to allow it to join to forecast data that is created at the Subcategory level.

image from book
Figure 2.7: A Subcategory table extracted from the Adventure Works Cycles Product dimension table

Each subcategory includes a mix of color , size, and weight attributes, for example. Therefore, most of the columns in the Product table do not make sense at the Subcategory level. This results in a much shorter dimension, hence the common term shrunken dimension .

The keys for aggregate dimensions need to be generated in the ETL process and are not derived from the base tables. Records can be added or subtracted from the base table over time and thus there is no guarantee that a key from the base table can be used in the aggregate dimension.

Note 

The Analysis Services OLAP engine automatically manages all this behind the scene. This is one of the reasons why Analysis Services is helpful.

Previous versions of Analysis Services required that you pre-build the aggregate dimensions in the relational database. Analysis Services 2005 has removed that requirement: You can easily hook in a fact table at any level of a dimension. Of course you may still need aggregate dimensions to support business process metrics (such as forecast), which live at the aggregate level only.

Reference 

The following resources offer additional information about heterogeneous product models:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 154157 for information about building minidimensions.

  • The kimballgroup.com : Search for the topic Shrunken Dimensions for an article on aggregate dimensions.

Junk Dimensions

Its common to end the initial design pass with a handful of miscellaneous attributes that dont belong to any existing dimension. These are typically flags or indicators that describe or categorize the transaction in some way. Theyre usually low cardinality, with a dozen distinct values or less. Even though the name is a bit disrespectful, the contents of the junk dimension are often important. (Weve also called them miscellaneous or mystery dimensions in previous writings.)

There are four alternatives for dealing with these extra attributes, and the first three dont count. You can leave them in the fact table, significantly expanding the fact table. You can create a separate dimension for each attribute, cluttering up the model with many new dimensions. You can omit these attributes and hope no one misses them. Or, you can group them together in a single junk dimension. This is obviously our preferred alternative.

With a little additional information, the grocery store data model shown in Figure 2.2 could be expanded to include a junk dimension. Many transaction systems have multiple transaction types as part of the same business process. Generally, 99 percent or more of the transactions fall into one transaction type; in the grocery business, the receipt would be the Regular Sale transaction. There also might be a No Sale transaction that allows the clerk to open the drawer to make change and a Refund transaction. If you recall, the Payment Type dimension already included in the model is a similar, low cardinality dimension. It has a handful of distinct values, such as Cash, Check, Credit, or Debit. You could combine these two dimensions into a single junk dimension. The name of the dimension needs to be generic enough to include all of its contents, so something like RetailTransactionInfo might work.

This example TransactionInfo dimension would have two columns, TransactionType and PaymentType. Figure 2.8 shows what the dimension and its contents might look like. You need to include all the combinations of TransactionType and PaymentType that occur in the fact table. Here, we listed three transaction types and four payment types. In the worst case, the total number of rows would be a Cartesian product of the cardinality of each columnimplemented as a CROSS JOIN in T-SQL or the Crossjoin function (or * operator) in MDX. For our example, this would be 3 — 4 = 12 rowsthese are the 12 rows shown in Figure 2.8. This is easy enough to build once and use forever, or at least until another value shows up. The ETL process will need to watch for new values and handle them appropriately by adding into the table.

DimTransactionInfo

PK

TransactionInfoKey

 

TransactionType

PaymentType

TransactionInfo Dimension

TransactionInfoKey

TransactionType

PaymentType

1

Regular Sale

Cash

2

Regular Sale

Check

3

Regular Sale

Credit

4

Regular Sale

Debit

5

Refund

Cash

6

Refund

Check

7

Refund

Credit

8

Refund

Debit

9

No Sale

Cash

10

No Sale

Check

11

No Sale

Credit

12

No Sale

Debit


Figure 2.8: An example junk dimension for the Retail Grocery Sales dimensional model

In many cases, if the junk dimension has several columns, the list of all possible combinations can be too large to pre-calculate. Ten columns, each with 100 unique values would give us a very big dimension table (100,000,000,000,000,000,000 rows). These codes are seldom completely independent, so its rare for all possible combinations to occur in the real world. In any case, the upper bound on the number of junk value combinations is the number of rows in the fact table. In our grocery example, the No Sale transaction type occurs only with the Cash payment type. While this reduces our dimension table only from 12 to 9 rows, the reductions in the real world are often dramatic. A recent project had a junk dimension with a theoretical size of over a trillion rows, but a count from the last five years of data showed the actual set of combined values was closer to 70,000 rows. The ETL process would have to watch this dimension during the fact table lookup process to identify new combinations and add them in to the dimension. The junk dimension is also similar to the aggregate dimension in terms of its creation and maintenance.

Note 

In Analysis Services, each of the different attributes included in the junk dimension becomes a separate attribute hierarchy. As in the relational data model, the disparate attribute hierarchies would be grouped together under one dimension, TransactionInfo in our example. Youd hide the bottom level of the dimension that has the surrogate key representing the intersection of the codes.

Reference 

The following resources offer additional information about junk dimensions:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 117119.

  • The kimballgroup.com : Search for the topic Junk Dimensions for a relevant article.

The Three Fact Table Types

There are three fundamental types of fact tables in the DW/BI system: transaction, periodic snapshot, and accumulating snapshot. Most of what we have described thus far falls into the transaction category. Transaction fact tables track each transaction as it occurs at a discrete point in timewhen the transaction event occurred. The periodic snapshot fact table captures cumulative performance over specific time intervals. Periodic snapshots are particularly valuable for combining data across several business processes in the value chain. They also aggregate many of the facts across the time period, providing users with a fast way to get totals. Where snapshots are taken at specific points in time, after the month-end close, for example, the accumulating snapshot is constantly updated over time. Generally, the design of the accumulating snapshot includes several date fields to capture the dates when the item in question passes through each of the business processes or milestones in the value chain. For an Orders Accumulating Snapshot that captures metrics about the complete life of an order, these dates might include the following:

  • Order Date

  • Requested Ship Date

  • Manufactured Date

  • Actual Ship Date

  • Arrival Date

  • Invoice Date

  • Payment Received Date

The accumulating snapshot provides the status of open orders at any point in time and a history of completed orders just waiting to be scrutinized for interesting metrics.

Note 

Transaction fact tables are clearly what Analysis Services was designed for. Your Analysis Services database can accommodate periodic and accumulating snapshots, but you do need to be careful. The problem is not the model, but the process for updating the data. Snapshot fact tablesparticularly accumulating snapshotstend to be updated a lot in the ETL process. This is expensive but not intolerable in the relational database. Its far more expensive for Analysis Services, which doesnt really support fact table updates at all.

For snapshot facts to work in Analysis Services for even moderate- sized data sets, youll need the Enterprise Edition feature that allows dimensional database partitioning.

Reference 

The following resources offer additional information about the three fact table types:

  • The Data Warehouse Toolkit, Second Edition (Wiley, 2002), pp. 128130 and 132135.

  • The kimballgroup.com : Search for the topic Snapshot Fact Table for articles on the different fact table types.

Aggregates

Aggregates are precalculated summary tables that serve the primary purpose of improving performance. If the database engine could instantly roll the data up to the highest level, you wouldnt need aggregate tables. In fact, precalculating aggregates is one of the two reasons for the existence of OLAP engines such as Analysis Services (the other reason being more advanced analytic capabilities). SQL Server Analysis Services can create and manage aggregate tables in the relational platform (called relational OLAP or ROLAP) or in the OLAP engine. The decision to create aggregate tables in the relational versus OLAP engines is a tradeoff. If the aggregates are stored in Analysis Services format, access to the data is limited to tools that generate MDX. If the aggregates are stored in the relational database, they can be accessed by tools that generate SQL. If your front-end tool is adept at generating MDX, using Analysis Services to manage aggregates has significant advantages. If you must support relational access, especially ad hoc access, you need to create and manage any aggregate tables needed for performance, along with the associated aggregate dimensions described earlier.

Note 

SQL Server uses the term aggregates to mean the rules that define how data is rolled up. It uses the term aggregation to mean the resulting summary table.

In this section we have covered several of the common design challenges you will typically run up against in developing a dimensional model. This is not an exhaustive list, but it should be enough to help you understand the modeling process. If you are charged with the role of data modeler and this is your first dimensional modeling effort, we again strongly encourage you to continue your education by reading The Data Warehouse Toolkit, Second Edition .



Microsoft Data Warehouse Toolkit. With SQL Server 2005 and the Microsoft Business Intelligence Toolset
The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset
ISBN: B000YIVXC2
EAN: N/A
Year: 2006
Pages: 125

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net