Data Model | Practical Business Intelligence with SQL Server 2005

The primary goal of dimensional data modeling is to produce a simple, consistent model that is expressed in the users' own language and will enable them to easily access information. A common challenge when building BI solutions is that there are often so many areas in the business that have interesting information, you really need to resist trying to produce an initial data model that covers everything in the whole business. The key tenet of our approach is that you should carefully define the areas you will be focusing on delivering so that they align with the objectives and can deliver actual business value (something that has been in short supply in many large data warehouse projects).

What Process Will We Be Focusing On?

The business requirements describe our area of focus as sales and profitability reporting. We will deliver the data model to support these requirements, but design the dimensions so that they can become the "one version of the truth" that is eventually used in other areas of the business such as analyzing manufacturing scheduling or loss analysis.

As it turns out, defining the exact business process that we need to model is a bit tricky. To handle sales reporting and profitability, we need to know information from manufacturing (what is the actual cost of the manufactured products), ordering and invoicing (how much did we charge this customer taking into account discounts and other contract terms), and delivery (did the goods arrive on time and in good condition). This data is almost certainly stored in different tables in the transaction systems, and in fact may be in separate systems altogether.

Tip: Do Not Constrain the DW Design by the OLTP Design

Designing data warehouse fact tables that do not necessarily map to a single transaction table (such as InvoiceLineItem) is a common feature of data warehouses and shouldn't worry you. The trap to avoid is designing your data warehouse schema by looking at the transaction systems. You should first focus on the end users and how they understand the information.

The business process that spans all these areas is often known as "shipments," and it includes so much useful information that having shipments data in the data warehouse is one of the most important and powerful capabilities for manufacturing.

What Level of Detail Do We Need?

Now that we have identified the shipments business process, the next question that we are faced with is the grain of information required. Do we need daily totals of all shipments for the day, or would monthly balances suffice? Do we need to know which individual products were involved, or can we summarize the information by category?

In some ways, this data modeling question of granularity is the easiest to resolve because the answer is always the same: You should strive to always use the most detailed level of information that is available. In the shipments example, we need to store a record for each individual line item shipped, including the customer that it was shipped to, the product UPC, the quantity shipped, and all the other information we can find in this area.

In the bad old days before technology caught up with our BI ambitions, a lot of compromises needed to be made to avoid large data volumes, which usually involved summarizing the data. A lack of detailed information leads to all kinds of problems in data warehousesfor example, if we store the shipments summarized by product group, how do I find out whether yellow products are more popular than green ones? We need the product detail-level information to figure that out. Today, modern hardware and SQL Server 2005 easily support detail-level information without special handling in all but the most demanding of applications. (See Chapter 11, "Very Large Data Warehouses," for more information on dealing with very large data warehouses.)

Do Not Mix Granularities in a Fact Table

This is probably the most important dimensional modeling lesson that we can share: A single fact table must never contain measures at different levels of granularity. This sounds like it should be easy to achieve, but some common pitfalls can trip you up, usually related to budget or forecast information. For example, business users often want to compare the actual measure (such as actual quantity shipped) with a forecast or budget measure (such as budget quantity). Budgets are usually produced at a higher level of granularity (for example, planned monthly sales per product group), and therefore should never be put into the same fact table that stores the detail-level transactions. If you find measures at different levels of granularity, you must create a separate fact table at the right level.

What Are the Ways of Looking at the Information?

As discussed in Chapter 1, "Introduction to Business Intelligence," the heart of the dimensional approach is to provide information to users in the ways they would like to look at it. One of the best ways to start identifying dimensions is to take note every time anyone says the word by. For example, we need to see sales by manufacturing plant, and we need to look at deliveries by the method they were shipped to see which method is more cost-effective. Each of these "bys" is a candidate for a dimension.

Out of the interviews with the business users, we can so far add two dimensions to our data model, as shown in Figure 3-2: Plant, which identifies where the product in a shipment was manufactured and shipped from; and Ship Method, which indicates the method used to ship the product.

Figure 3-2. Plant and Ship Method dimensions

Product Information

One of the main dimensions in this data warehouse contains the Product information. Products can be grouped into subcategories and categories, and each Product record can contain many descriptive attributes that are useful for understanding sales and profitability, such as Color or Size. To make it easy to load the Product information and to improve the efficiency of the queries used to load the cube, we will "snowflake" the dimension and create three separate tables, as shown in Figure 3-3. This is the first example of the design decision-making process that we outlined in Chapter 1; that is, when we have a dimension table with an obvious hierarchy, we can renormalize or snowflake the dimension into a separate table for each level in the hierarchy.

Figure 3-3. Product dimension

Note that every snowflake dimension table still follows the rule for surrogate keys, which is that all primary keys used in a data warehouse are surrogate keys and these are the only keys used to describe relationships between data warehouse tables. In the case of the Product Subcategory table, you can see that we have used the ProductCategoryKey for the relationship with the Product Category table.

The manufacturer in our example has an interesting issue with Productstheir original business was the manufacture of sporting goods and related equipment, but several years ago they acquired another company with several plants that manufacture sports-related clothing. The two divisions in the company have separate ERP systems and totally separate product lines. In a data warehouse, we want a single Product dimension that is the only version of the truth.

Even though the two ERP systems might have different ways of storing the product information and we will have to perform some calisthenics in the extraction, transformation, and loading (ETL) process, we must reconcile these differences to create a single Product dimension. The design that we must avoid is having two separate product dimensions that map back to the source systems; otherwise, it will be difficult for users to include all the available information in their analysis.

Customer Information

The next interesting area in our data model is Customers. The manufacturer does not sell direct to consumers but only to retailers, so each customer record contains information about a company that does business with the manufacturer. In addition to attributes such as customer type (such as Retailer or Distributor), one of the most important sources of customer attributes is geographical information.

Each customer is physically located at an address, and you can see how users would find it useful to be able to group data by the customer's state or province, or even drill down to the city level. In addition to this natural "geographical" hierarchy, most companies also divide up geographical areas into "sales territories." These sales territories might not translate perfectly to the natural geographical structures such as state or province, because a company's rules governing which cities or customers fall into which sales territories may be complex or arbitrary. So, sometimes end users will want to see information grouped by physical geography and sometimes grouped by sales territories.

In our data model, we will extract the physical geographical information from the customer record and move it a Geography dimension, which will have one record for each Zip/Postal Code. This dimension record will contain all the natural geographical information such as state or province name, and each Customer record will have a GeographyKey column that contains the surrogate key of the appropriate Geography record. We will also create a separate Sales Territory dimension, and extend the Geography dimension so that every record relates to a particular sales territory, as shown in Figure 3-4.

Figure 3-4. Customer and Geography dimensions

Time Information

By far the most common way of looking at information in data warehouses is analyzing information in different time periods. Users might want to see shipments for the current month, or year-to-date shipments, or compare the current period with the same period last year. Every shipment transaction has one or more dates associated with it (such as date ordered and date shipped), and we need to allow these dates to be used to group the shipments together.

The first thing that we need to determine is the level of detail required for our fact table. The shipments fact includes the actual date of the transaction, so we need day-level information rather than just weekly or monthly summaries.

Because most modern applications include extensive functions for interpreting date fields, it might seem that we don't need to create an actual dimension table for Time. As it turns out, a physical table that contains a record for each day, as shown in Figure 3-5, is very useful. Not only can we include the obvious attributes on every day such as the name of the day or the year; we can also get creative about providing analytical columns. For example, we could include a flag that shows which days are holidays so that users can select those days for special analysis. When we have lots of descriptive attributes in the table, we can use the analytical capabilities of Analysis Services to provide the capability of selecting complicated ranges such as year-to-date.

Figure 3-5. Time dimension

Tip: Do Not Include Hourly or Minute-Level Records

If it turned out to be useful to know the actual time that a transaction occurred rather than just the day, you might think that we should just add more detailed information to the Time dimension table. This entails a number of problems; not only will this increase the size of the dimension (storing records at the minute level would mean 1,440 records per day!), but from an analytical point of view, it is not very useful.

Users most likely would want to select a date range and then see how transactions varied at different times of the day. The best way to support this is to leave the Time dimension at the day level (probably renamed Date for clarity) and create a separate dimension for Time Of Day. This would contain only one record for each time period, such as a total of 1,440 records for a minute-level Time Of Day dimension.

The other useful feature that you can provide in a Time dimension is support for multiple ways of grouping dates. In addition to looking at a natural calendar hierarchy such as year or month, most businesses want the ability to see fiscal periods. We can accommodate this by including the fiscal year (which usually starts on a different day to the natural calendar) and fiscal month on every daily record in addition to the standard calendar year and month. This could also be extended to support manufacturing calendars, which consist of 13 periods of exactly the same size made up of four weeks each.

Now that we have identified the different ways that users will want to see the information, we can move on to our next modeling question: What numbers will the user want to analyze?

What Are We Measuring?

All the information that we want to analyze, such as sales amounts, costs, or discounts, will end up on our fact table. Now that we have identified the grain of this fact table and designed the dimensions that we will be using, we can move on to building a list of numeric measures for the table. At this point in the design process, the fact table looks like Figure 3-6.

Figure 3-6. Initial Shipments fact table

Tip: The Importance of Clarifying Terms

Most data modelers find out early on that trying to do data modeling without a good understanding of the underlying business is a risky undertaking. People often tend to use different terms for the same concept depending on their specific area of the business, or use the same terms for very different concepts in different business areas. It is important to recognize this and take extra care to clarify terms even if they sound like commonsense concepts.

Ultimately, the source for all the measures in a fact table will be a transaction table in the ERP systems, but it would be a mistake to start from the ERP schema to identify measures. As previously described, on-line transaction processing (OLTP) systems have normalization as a primary driver of their designs, whereas much of the information in data warehouses is derived. A better approach is to start from the business questions that the users need to answer and then attempt to build a list of measures that will be required to provide the answers. Some of this information might turn out to not be available in the ERP systems, but at least these issues can then be identified to the ERP team.

For example, an obvious candidate measure is the quantity of the product item that the customer ordered, but equally important for analysis is a measure that describes the actual quantity that was delivered, taking into account any breakages or errors in the shipping process. We can add both Quantity Ordered as well as Quantity Delivered measures to the fact table. The business requirements also lead us to add other numeric columns to support profitability analysis, such as discounts and manufacturing costs.

Numeric Measures That Can't Be Summed

All the measures that we have looked at so far are fully additivethat is, whether we are looking at a product group total for the last year or a single customer's transactions in a week, all we have to do is sum up the numbers on the fact table to arrive at the correct answer. Not all numbers are quite so obliging, however; for example, whereas the quantity of product sold in a shipment record is additive, the unit price is not. If you imagine what would happen if we looked at a year's total of all the unit prices in the Shipments table, you can see that this would not produce a sensible measure. In the case of unit prices that might increase or decrease over time, and other "rate" types of measures, the usual way of looking at the information is by taking an average over the time period we are querying.

The same logic usually applies to "balance" types of measures such as inventory balances, where either the average balance or the latest balance in the selected time period is much more meaningful than a total. These measures are known as partially additive or semi-additive, because we can sum them up along dimensions such as Customer or Product, but not across other dimensions such as the Time dimension.

It is often possible to transform "rate" measures into fully additive measures just by multiplying by the corresponding measure such as quantity. This will produce a fully additive measure such as Revenue rather than a semi-additive measure such as Unit Price. If you want to actually include a semi-additive measure, you can simply add the numeric column to the fact tableAnalysis Services cubes can be set up so that the appropriate summary calculation is performed for semi-additive measures.

Handling Complex Relationships with Dimensions

Most of the relationships between dimension tables and the fact table are simple and easy to model. For example, every shipment has a method of shipping, so we can just add the ShipMethodKey to the fact table. Not all dimensions are quite so easy to handle, however; the Customer and Time dimensions in the manufacturing solution have more complex relationships with the fact table.

Every shipment that we make is sent to a specific customer, so we could just add a CustomerKey column to the fact table. However, because of the intricate corporate hierarchies that the manufacturer needs to deal with, the bill for the shipment might be sent to a different corporate entity (such as a parent company) than the customer that received the shipment. It is easy to find business requirements that would require both of these concepts, such as a financial analysis by the billing customer and a logistics or shipping analysis by the shipping customer. To accommodate this, we can simply add both ShippingCustomerKey and BillingCustomerKey columns to the fact table and populate them accordingly.

The Time dimension has a similar requirement. Depending on the analysis, users might want to see the shipments by the date that the order was placed, or the date that it was shipped, or even to compare the date actually shipped with the date that the shipment was due to take place. Each of these requires a new dimension key column on the fact table, as shown in Figure 3-7.

Figure 3-7. The completed Shipments data model

Fields on the Fact Table That Aren't Measures

The columns that we have added to the fact table so far are either dimension keys or numeric measures. Often, you must include some columns in an analysis that do not fall into either of these neat categories. A common example of this is Invoice Number. It would make no sense to consider this a numeric measure because we cannot sum or average it; and because there are probably many thousands of invoices and no real descriptive attributes to group them by, it is not really a dimension either. All the interesting information on an invoice has already been added to the fact table in the form of dimensions such as Customer and Product.

The Invoice Number would still be useful in many analyses, however, especially when a user has drilled down to a level of detail that includes only a few invoices and wants to see the full details of each. Adding this kind of column to a fact table is usually known as a degenerate dimension, because it is basically a dimension with only one column, which is the business key.