Data Model | Practical Business Intelligence with SQL Server 2005

The data warehouse that we built in the previous two chapters already contains all the dimensions and fact tables that we need to support the business requirements, and most of the measures. In this section, we look at how the data model needs to be extended to more easily support building Analysis Services cubes.

How Do We Handle "On-Time" Shipments?

The business requirements tell us that the users are interested in finding out how many (or what percentage) of shipments are on time, meaning that they were shipped on or before the due date. Because we already have ShipDateKey and DueDateKey columns that contain the relevant dates on the Shipments fact table, we should be able to calculate this without any problem.

In a relational reporting environment, we might add a where clause to our query that selected those records where one date was less than the other. It turns out that this approach is somewhat tricky to replicate in an OLAP environment, so we need another approach that can take advantage of the fact that cubes are very, very good at handling additive measures because of the power of precalculated aggregates.

We can start to solve the problem by realizing that each fact record is either "on time" or "not on time," depending on the date values. If we have a Boolean value such as this in our data model, the best approach is to model it as a numeric field with a 0 or 1. For example, we could add an OnTime column to the fact table that contains 1 for an on-time delivery and 0 for a late delivery.

This works great for cubes, because then we can add a measure that simply sums the column to return the number or percentage of on-time deliveries. No matter how you slice the cube, whether you are looking at the total for the whole cube or just a specific customer's shipments in the past week, the measure can simply sum the OnTime column.

Tip: Transforming Logic into Additive Measures

Cubes are great at handling additive measures, so a common technique is to transform complex logic into a nice simple column. Logic that returns a true or false value is commonly expressed as a numeric column with a 0 or 1, or with a value such as a sales amount for some records and a 0 for other records.

This is especially true for logic that needs to be evaluated for every record at the detailed, fact-record level. Calculations that work well with totals (such as Profit Percentage defined as the total profit divided by the total sales) can more easily be handled with calculations in the cube itself.

Of course, because we have defined views over the fact and dimension tables and the views are going to be the source for the Analysis Services database, we don't actually have to create a physical OnTime column. Instead, we can just add the calculation to the Shipments view using the following expression:

CASE WHEN ShipDateKey <= DueDateKey THEN 1 ELSE 0 END AS OnTime

If you don't have access to the views or tables in the source system, you can define the preceding SQL calculation in the Analysis Services database's data source definition, but it is more flexible and easier to maintain if you define SQL expressions in the views. The performance penalty that some people associate with SQL date calculations isn't particularly important in this case because the query will only be executed to load the data into the Analysis Services database.

An alternative approach to modeling this is to add a new On Time Category dimension that has two records, On Time and Late. This has the advantage that users could select a dimension member and restrict their analysis to only those shipments that fall into that category. Which of these options (or both) that you choose will depend on your users' preferences, underlining again the benefits of prototyping and involving the users in design decisions that they will ultimately end up living with.

How Late Were These Shipments, Anyway?

Another area that users will want to explore is the time periods for the shipments. For example, users need to know how many days late shipments were. They could perform some interesting analyses with this information, such as the average days late for a specific product or time period.

In a relational report, we could add an expression that calculates the number of days between the ShipDateKey and DueDateKey, which would give you the number of days late. This is another example of logic that needs the detail-level date keys to return a value, so we can add another calculated column to the view. Because we are trying to figure out how many days late the shipments are, and number of days happens to be a nice additive value, we can add an expression that returns the days between the two keys:

DATEDIFF(dd, DueDateKey, ShipDateKey) AS DaysLate

Actually, this expression turns out not to work so well because some shipments are early, so you will have negative DaysLate values that will tend to reduce the apparent lateness. This will require you going back to the users for clarification, but in the manufacturing example they don't want to get credit for those shipments that happened to be early. (In fact, that might be another area they are interested in analyzing further because it has interesting business implications.) We need to adjust the expression so that shipments that are on time or early do not contribute to the Days Late value:

CASE WHEN (ShipDateKey < DueDateKey) THEN 0   ELSE DATEDIFF(dd, DueDateKey, ShipDateKey) END AS DaysLate

We can handle the other business requirements in this area in the same way, such as calculating the number of days notice that we had by subtracting the due date from the order date and calculating the number of days it took to actually ship the product by subtracting the order date from the ship date.

What Doesn't Work Well in a Cube?

If you add columns to your fact table that aren't fully additive, such as Unit Price or Product Standard Cost, when you build your cube, you will start to realize that they don't work very well. A cube generally sums up the values for the records that the user is looking at, based on their dimension selections. Analysis Services 2005 has some other ways of aggregating numbers beyond a simple sum such as Max or Min, and even supports some advanced scenarios for dealing with data that behaves strangely over time; by and large, however, most measures are summed.

Unit Price doesn't work when you sum it. Although the Quantity Shipped sums very nicely under all conditions, the sum of the Unit Price for all shipments last year is meaningless. That is why the data model instead has an Extended Amount column, which contains the quantity multiplied by the unit price for the product. You can add up all the extended amounts for all shipments and come up with a useful number.

On the other hand, if the users needed to see the average unit price for the shipments, you can easily add a calculated measure to the cube that divides the total Extended Amount by the total Quantity, which is probably what the users wanted all along.