Data Model | Practical Business Intelligence with SQL Server 2005

We focus on the fact table in discussing the data model. The fact table is usually the largest table and therefore the most obvious candidate for partitioning. The dimension tables in our case are small and will not benefit from partitioning.

Our logical schema doesn't change significantly from what you might expect, but the physical implementation is very different. Partitioning requires a column that we can use to direct rows to a specific partition, so you do need to be sure you have an appropriate column. We are going to partition by date, and the CallDate column fulfills that requirement.

We have paid some attention to ensuring the data types are as small as possible to keep the row size down. This is not so much for saving disk space as it is for performance. The more rows you can fit on a page, the better the performance. Even if you have a relatively "small" VLDB with just a billion rows, each byte you save is a gigabyte you don't have to store or back up.

Our fact data sources are flat files, each containing call detail data from one of a number of on-line transaction processing (OLTP) servers, as shown in Figure 11-2.

Figure 11-2. Data presented by OLTP system

In the fact table shown in Figure 11-3, we have converted the phone number into two parts: the area code, which we will use to determine the general location of the call; and the local number. This allows us to have a Geography dimension with a key that is independent of the phone number business key. Area codes are well defined and can be imported from standard industry sources. We have split date and time apart because we want independent date and time dimensions. We chose to use a data type of smalldatetime for the call date to shorten the row length. The range of dates is from 1900 to 2079, which suffices for our purposes. The Time dimension key can be computed as the number of minutes past midnight, saving a lookup.

Figure 11-3. Call detail record in data warehouse

Our fact table has no declared primary key or unique key. This isn't unusual in a fact table, but in this case we can't have one. This is because of a restriction imposed by partitioning. Any columns in unique indexes must be part of the partitioning key. If we were to create a unique surrogate key to be used as a primary key, we would have to include it in the partitioning criteria. We don't want to partition by an arbitrary number, so even if you are in the habit of putting primary keys on your fact table, you will probably want to skip it this time. You can, and should, create a nonunique clustered index on the date column. If you must have a primary key, you can work around this restriction. Create a compound primary key using the date (or whatever column you are partitioning by) and any other unique column. If you don't have a unique column, create a surrogate key using a column with the Identity property set to true.

We've always recommended that you don't have a surrogate key for the Date dimension. This scenario is a good example of why, because we want to have a clear understanding what data is in our fact table partitions. The partition contents are defined by a key range, and Date is a human-readable value. A surrogate key wouldn't necessarily tell us directly what dates are in the partition. We also want the partitioning function to be easily read and maintained, and dates are more understandable than the IDs we would generate for a surrogate key.

Our Analysis Services database follows our data warehouse schema closely, as shown in Figure 11-4.

Figure 11-4. Fact and dimension tables