Dimensional Modeling Concepts and Terminology

Dimensional Modeling Concepts and Terminology

There is broad agreement in data warehousing and business intelligence that the dimensional model is the preferred structure for presenting information to users. The dimensional model is the best way to meet our primary design goals:

  • To present the needed information to users as simply as possible

  • To return query results to the users as quickly as possible

  • To provide relevant information that accurately tracks the underlying business processes

Albert Einstein captured the main reason we use the dimensional model when he said, Make everything as simple as possible, but not simpler. As it turns out, simplicity is relative. The dimensional model is much easier for users to understand than the typical source system normalized model even though a dimensional model typically contains exactly the same content as a normalized model. It has far fewer tables, and information is grouped into coherent business categories that make sense to users. These categories help users navigate the model because entire categories can be disregarded if they arent relevant to a particular analysis.

Unfortunately, as simple as possible doesnt mean the model is necessarily simple. The model must reflect the business, and businesses are typically complex. If you simplify too much, typically by presenting only aggregated data, the model loses information thats critical to understanding the business. No matter how you model data, the intrinsic complexity of the data content is ultimately why most people will use structured reports and analytic applications to access the DW/BI system.

In the relational environment, the dimensional model helps query performance because of the denormalization involved in creating the dimensions. By pre-joining the various hierarchies and lookup tables, the optimizer considers fewer join paths and creates fewer intermediate temporary tables. Queries against the SQL Server relational database generally perform betteroften far betteragainst a dimensional structure than against a fully normalized structure.

In the Analysis Services OLAP environment, the engine is specifically designed to support dimensional models. Performance is achieved in large part by pre-aggregating within and across dimensions.

So what is a dimensional model, anyway? A dimensional model is made up of a central fact table (or tables) and its associated dimensions. The dimensional model is also called a star schema because it looks like a star with the fact table in the middle and the dimensions serving as the points on the star. We stick to the term dimensional model in this book to avoid confusion.

From a relational data modeling perspective, the dimensional model consists of a normalized fact table with denormalized dimension tables. This section defines the basic components of the dimensional model, facts and dimensions, along with some of the key concepts involved in handling changes over time.

Facts

Each fact table contains the measurements associated with a specific business process, like taking an order, displaying a web page, printing a book, or handling a customer support request. A record in a fact table is a measurement, and a measurement event can always produce a fact table record. These events usually have numeric measurements that quantify the magnitude of the event, such as quantity ordered, sale amount, or call duration. These numbers are called facts (or measures in Analysis Services).

The key to the fact table is a multi-part key made up of a subset of the foreign keys from each dimension table involved in the business event.

Just the Facts

Most facts are numeric and each fact value can vary widely depending on the business process being measured. Most facts are additive (such as dollar or unit sales), meaning they can be summed up across all dimensions. Additivity is important because DW/BI applications seldom retrieve a single fact table record. User queries generally select hundreds or thousands of records at a time and add them up. A simple query for Sales by Month for the last year returns only 12 rows in the answer set, but it may sum up across hundreds of thousands of rows (or more!). Other facts are semi-additive (such as market share or account balance), and still others are non-additive (such as unit price).

Not all numeric data are facts. Exceptions include discrete descriptive information like package size or weight (describes a product) or customer age (describes a customer). Generally, these less volatile numeric values end up as descriptive attributes in dimension tables. Such descriptive information is more naturally used for constraining a query, rather than being summed in a computation. This distinction is helpful when deciding whether a data element is part of a dimension or fact.

Some business processes track events without any real measures. If the event happens, we get an entry in the source system; if not, there is no row. Common examples of this kind of event include employment activities, such as hiring and firing, and event attendance, such as when a student attends a class. The fact tables that track these events typically do not have any actual fact measurements, so theyre called factless fact tables . Actually, we usually add a column called something like EventCount that contains the number 1. This provides users with an easy way to count the number of events by summing the EventCount fact.

Some facts are derived or computed from other facts, just as a Net Sale number is calculated from Gross Sales minus Sales Tax. Some semi-additive facts can be handled using a derived column that is based on the context of the query. Month End Balance would add up across accounts, but not across date, for example. The non-additive Unit Price example could be avoided by defining it as a computation done in the query, which is Total Amount divided by Total Quantity. There are several options for dealing with these derived or computed facts. You can calculate them as part of the ETL process and store them in the fact table, you can put them in the fact table view definition, or you can include them in the definition of the Analysis Services database. The only way we find unacceptable is to leave the calculation to the user.

Note 

Using Analysis Services to calculate computed measures has a significant benefit in that you can define complex MDX calculations for semi-additive facts that will automatically calculate correctly based on the context of each query request.

The Grain

The level of detail contained in the fact table is called the grain . We strongly urge you to build your fact tables with the lowest level of detail that is possible from the original sourcegenerally this is known as the atomic level . Atomic fact tables provide complete flexibility to roll up the data to any level of summary needed across any dimension, now or in the future. You must keep each fact table at a single grain. For example, it would be confusing and dangerous to have individual sales order line items in the same fact table as the monthly forecast.

Note 

Designing your fact tables at the lowest practical level of detail, the atomic level, is a major contributor to the flexibility of the design.

Fact tables are very efficient. They are highly normalized, storing little redundant data. For most transaction-driven organizations, fact tables are also the largest tables in the data warehouse database, often making up 95 percent or more of the total relational database size.

Dimensions

Dimensions are the foundation of the dimensional model, describing the objects of the business, such as employee, subscriber, publication, customer, physician , vehicle, product, service, author, and article. The dimensions are the nouns of the DW/BI systemthey describe the surrounding measurement events. The business processes (facts) are the verbs or actions of the business in which the nouns (dimensions) participate. Each dimension table links to all the business processes in which it participates. For example, the product dimension will be involved in supplier orders, inventory, shipments, and returns. A single dimension that is shared across all these processes is called a conformed dimension . Well talk more about conformed dimensions in a bit.

Think about dimensions as tables in a database because thats how youll implement them. Each table contains a list of homogeneous entitiesproducts in a manufacturing company, patients in a hospital, vehicles on auto insurance policies, or customers in just about every organization. Usually, a dimension includes all instances of its entityall the products the company sells, for example. There is only one active row for each particular instance in the table at any time, and each row has a set of attributes that identify, describe, define, and classify the instance. A product will have a certain size and a standard weight, and belong to a product group . These sizes and groups have descriptions, like a food product might come in Mini-Pak or Jumbo size. A vehicle is painted a certain color , like white, and has a certain option package, such as the Jungle Jim sports utility package (which includes side impact air bags, six-disc CD player, DVD system, and simulated leopard skin seats).

Some descriptive attributes in a dimension relate to each other in a hierarchical or one-to-many fashion. A vehicle has a manufacturer, brand, and model (such as GM Chevrolet Impala, or Toyota Lexus RX330). Most dimensions have more than one such embedded hierarchy.

It may help to think of dimensions as things or objects . A thing such as a product can exist without ever being involved in a business event. You might make a product that never sells, which is not good for business, but the point is the product can exist outside of a business process. On the other hand, a business process cannot occur without its associated dimensions. You cant sell a product without having a product to sell.

The underlying data structures for most relational transaction systems are designed using a technique known as normalization . This approach removes redundancies in the data by moving repeating attributes into their own tables. The physical process of recombining all the attributes of a business object, including its hierarchies, into a single dimension table is known to DBAs as denormalization . As we described earlier, this simplifies the model from a user perspective. It also makes the join paths much simpler for the database query optimizer than a fully normalized model. The denormalized dimension still presents exactly the same information and relationships found in the normalized modelnothing is lost from an analytic perspective except complexity.

You can spot dimensions or their attributes in conversation with the business folks because they are often the by words in a query or report request. For example, a user wants to see sales by month by product. The natural ways users describe their business should be included in the dimensional model as dimensions or dimension attributes. This is important because many of the ways users analyze the business are often not captured in the transaction system. Including these attributes in the warehouse is part of the added value you can provide.

image from book
THE POWER OF DIMENSIONS

Dimensions provide the entry points into the data. Dimensional attributes are used in two primary ways: as the target for constraints and for the labels on the rows of a report. If the dimensional attribute exists, you can constrain and label. If it doesnt exist, you simply cant.

image from book

 

Bringing Facts and Dimensions Together

The completed dimensional model has a characteristic appearance, with the fact table in the middle surrounded by the dimensions. Figure 2.2 shows a simple dimensional model for the classic example: the retail grocery sales business process.

image from book
Figure 2.2: A basic dimensional model for retail grocery sales

This model allows users across the business to analyze retail sales activity from various perspectives. Category managers can look at sales by product for different stores and different dates. Store planners can look at sales by store format or location. Store managers can look at sales by date or cashier. While this model is reasonably robust, a large retail grocer would have a few more dimensions, customer, in particular, and many more attributes.

In Figure 2.2, fields labeled PK are primary keys. In other words, these fields are the basis of uniqueness for their tables. In a dimensional model, the primary keys of dimensions are always implemented physically as single fields. The fields labeled FK are foreign keys, and must always match the corresponding PKs in the dimensions in order to ensure referential integrity. The field labeled DD is a special degenerate dimension, which is described later.

To grasp the concept of dimensions and facts, its helpful to see examples of dimensional models from a variety of industries and business processes. The model shown in Figure 2.3 comes from The Data Warehouse Toolkit, Second Edition (Wiley, 2002). The business process is airline flight activity. The grain is at the flight segment/ticket level.

image from book
Figure 2.3: Segment-level flight activity dimensional model

The dimensions have been filled out enough to give a sense of their contents. Like the retail grocery sales model, this model is valuable across the business. Route planners can look at activity by airport, logistics can look at aircraft utilization, Marketing can look at frequent flyer behavior, and flight activity by class and fare basis. Sales can assess the relative performance of the various sales channels. There is something for everyone in the airline organization in this dimensional model.

The Bus Matrix, Conformed Dimensions, and Drill Across

We spoke about the bus matrix in Chapter 1 as one of the deliverables from the enterprise requirements gathering process. Because its the starting point for the dimensional modeling process, we revisit the concept here. The idea of reusing dimensions across multiple business processes is the foundation of the enterprise DW/BI system and the heart of the Enterprise Data Warehouse Bus Matrix concept. In the retail grocery example, a dimension such as product will be used in both the retail sales and the store inventory dimensional models. Because they are exactly the same products, both models must use the same dimension with the same keys to reliably support true, cross-business process analysis. If the logistics folks at the grocers headquarters want to calculate inventory turns, theyll sum the total sales quantity by product from retail sales, sum the inventory quantity at the end of the period by product from store inventory, and divide the two to get inventory turns by product. This works only if the two business processes use the exact same product dimension with the same keys; that is, they use a conformed dimension. Conformed dimensions are the cornerstone of the enterprise-enabled DW/BI system. This kind of analysis involving data from more than one business process is called drill-across .

Note 

The precise technical definition of conformed dimensions is that two dimensions are conformed if they contain one or more fields with the same names and contents. These conformed fields must then be used as the basis for the drill-across operation.

Note that this idea of drilling across multiple fact tables and combining the answer sets requires a front-end tool capable of supporting this function. A powerful reason to use Analysis Services is that conformed dimensions are part of the system architecture, so its calculation engine smoothly supports drill-across.

Examine the Adventure Works Cycles high-level bus matrix shown in Figure 2.4. Each row of the bus matrix represents a business process and defines at least one fact table and its associated dimensions. Often, a row in the matrix will result in several related fact tables that help track the business process from different perspectives. The Orders business process might have an orders transaction fact table at the line-item level and an orders snapshot fact table at the order level. Both of these orders-based dimensional models belong to the Orders business process. We call this grouping a business process dimensional model . The fully populated enterprise DW/BI system contains sets of dimensional models that describe all the business processes in an organizations value chain. As you create the business process dimensional models for each row in the bus matrix, you end up with a much more detailed version of the matrix. Each dimensional model has its own row grouped by business process. Order Transactions and Order Snapshot would be separate rows under the Orders business process.

Adventure Works

Business Priority

< -- Conformed Dimensions -->

Data Warehouse Bus Matrix

Business Process

Date (Order, Start, Ship)

Product

Promotion

End Customer

Employee

Reseller

Page

Internet Registered User

Part

Vendor

Shipper

Problem

Account

Department

Currency (Source, Dest.)

Benefits Plan

Advertising

                                 
 

TV

 

x

x

x

                         
 

Print

 

x

x

x

                         
 

Online

 

x

x

x

x

                       

Promotions

 

x

x

x

x

 

x

                   

Co-op programs

 

x

x

x

 

x

x

                   

Web Site Marketing

 

x

x

x

x

   

x

x

               

PR

 

x

x

x

                         

Orders Forecasting

2

x

x

x

 

x

x

                   

Reseller Orders

1

x

x

x

 

x

x

                   

Internet Orders

1

x

x

x

x

   

x

x

               

Purchasing

 

x

x

 

x

x

     

x

x

x

         

Parts Inventory

 

x

x

x

         

x

x

           

Manufacturing

6

x

x

           

x

             

Finished Goods Inv.

 

x

x

x

                         

Shipping

 

x

x

x

x

x

x

       

x

         

Returns

5

x

x

 

x

x

x

       

x

         

Registration cards

 

x

x

 

x

                       

Customer Calls

4

x

x

x

x

x

x

   

x

   

x

       

Web Support

 

x

x

 

x

x

x

x

x

     

x

       

Financial Forecasting

 

x

x

x

x

x

x

     

x

   

x

x

   

Exchange Rate Mgmt.

3

x

                         

x

 

GL-Revenue & Expense

 

x

                     

x

x

   

Cost Accounting

 

x

x

                   

x

x

   

Payroll

 

x

     

x

               

x

   

Benefits Enrollment

 

x

     

x

                   

x


Figure 2.4: Adventure Works Cycles high-level enterprise bus matrix

The bus matrix is the enterprise business intelligence data roadmap. Creating the bus matrix is mandatory for any enterprise-wide DW/BI effort. Getting enterprise agreement on conformed dimensions is an organizational challenge for the data modeler and data steward. Having a single dimension table to describe the companys products, customers, or facilities means the organization has to agree on how each dimension table is defined. This includes the list of attributes, attribute names, hierarchies, and the business rules needed to define and derive each attribute in the table. This is politically hard work, and the effort grows as a function of the number of employees and divisions. But it is not optional. Conformed dimensions ensure that you are comparing apples to apples ( assuming you are selling apples).

Reference 

The Data Warehouse Toolkit, Second Edition (Wiley, 2002) has more information and examples on dimensions, facts, conformed dimensions, and the bus matrix, along with example dimensional models from many different industries and business processes, including retail sales, inventory, procurement, order management, CRM, accounting, HR, financial services, telecommunications and utilities, transportation, education, health care, e-commerce, and insurance. We include references to specific pages for the concepts and techniques in the following section.