Physical Data Warehouse Database Design | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Before you start the relational database physical design process, you should have completed the logical model. There is very little to do between the logical and physical models. Here are the basic steps, which we discuss in greater detail in the pages to follow:

Ensure that object names match the naming conventions. Your logical model should already be using good, clear, sensible names that conform to your naming conventions. You should already have defined naming conventions for database objects such as tables and columns .
Ensure that each column has the correct data type. Start off with the column definitions from the modeling process, but you may need to modify data types later, after youve completed the data profiling discussed in Chapter 2. For example, you may learn that some customer surnames take more than the 35 characters you originally assumed would suffice.
Specify how to handle a changed value for each dimension attribute.
Identify the levels of natural hierarchies, such as the familiar Year to Quarter to Month to Day.
Decide whether you will declare foreign keys in the database for all, some, or none of the logical foreign key relationships.
Develop your initial indexing plan.
Develop your fact table partitioning plan.
Specify dimension and fact metadata: process- related metadata that you may choose to include in each dimension and fact row. These metadata elements were introduced in Chapter 2, and are discussed further in Chapters 6 and 13.

The best tool for developing the physical model is a data modeling tool such as ERWin or Visio. These tools can forward engineer the SQL Server database, which makes it very easy to modify the database during the early, iterative design phases. (As we have already described in this chapter, the only way to get the version of Visio that will forward engineer the database is to purchase Visual Studio Enterprise Architect.)

These data modeling tools are expensive, and some small companies will balk at purchasing them. In Chapter 2 we introduced the Excel workbook and macro for logical modeling. You can use the workbook and macro to generate your own database if you dont have a data modeling tool. We provide this spreadsheet because not all readers will have a specific data modeling tool.

Surrogate Keys

The primary key for dimension tables should be a surrogate key assigned and managed by the DW/BI system. The most common method for creating surrogate keys is to use the IDENTITY property on the surrogate key column. Every time a row is inserted, the identity column populates itself by incrementing.

Check to ensure the surrogate key column is an integer data type. Choose the appropriate integer type given the anticipated size of the dimension:

Tinyint takes values in the range 0 to 255, and requires 1 byte of storage
Smallint ranges from -2 ¹⁵ (-32,768) to 2 ^15- 1 (32,767), and takes 2 bytes
Int ranges from -2 ³¹ to 2 ^31- 1, and takes 4 bytes
Bigint ranges from -2 ⁶³ to 2 ^63- 1 and takes 8 bytes

Choose the smallest integer type that will work for your dimension. This isnt very important for the dimension table itself, but its vital for the fact tables storage and performance. These same surrogate keys show up as foreign keys in the fact table. Using the small data types is also important for minimizing memory use during data processing. Make sure you use the same integer types for the fact table as for the corresponding dimension tables.

We usually frown on using meaningful surrogate keyswhich is something of an oxymoronbut we make an exception in every DW/BI system we build. The Date dimension should use a surrogate key. That surrogate key should be an integer. But its awfully convenient for it to be a meaningful integer of the form year-month-day, such as 20050723. Developers are people, too.

String Columns

When were building the logical model, we tend to be careless about string column lengths. During the modeling process, were focusing on business meaning, not on how long a string column should be. In fact, we tend to use just two or three string column lengths: 50 for names of things, 100 for descriptions.

We need to clean this up in the physical model. This is particularly true for columns in very large dimensions, and the occasional string column in fact tables. The relational database stores variable length string columns, type varchar, efficiently . It doesnt pad these columns with spaces. However, other parts of the SQL Server toolset will pad columns. Notably, Integration Services and Analysis Services pad string columns with spaces under the covers. Both Integration Services and Analysis Services love physical memory, so theres a cost to declaring string columns that are far wider than they need to be.

Tip

Start by making the string columns in the data warehouse database the same length as in the source database. But weve seen systems that routinely use varchar(100) for all string columns. In this case, investigate the actual lengths of string data. Make the data warehouse columns wider than any historical width, just for insurance. As we illustrate in Chapters 5 and 6 , add a data quality screen to your ETL application to catch especially long strings in the future.

Dont get carried away by the varchar data type. Any column smaller than 5 (some people say 10) characters should just be a char data type, even if the data length varies somewhat. If the string length varies little, say from 10 to 13 characters, simply use the char type.

We have seen many people confused by the nchar and nvarchar data types that Microsoft uses in its sample databases such as AdventureWorks. These string data types are exactly analogous to char and varchar, but they hold 2-byte Unicode data. Microsoft uses them because the same sample database structure works worldwide. Use char and varchar unless your data actually uses Unicode characters because you have string data in, say, Korean.

To Null, or Not to Null?

Avoid null values in the data warehouse tables. They are confusing to business users, especially when constructing queries that filter on an attribute thats sometimes null. If you decide to let a column be null, document your rationale in your system documentation.

Its not strictly necessary to enforce nullability in the database. The data warehouse tables are loaded through an ETL process and only through the ETL process. As long as you dont actually load any null values, you should be okay.

But lets be professional here. If a column isnt supposed to be null, it should be declared NOT NULL in the database.

Insert an Unknown Member Row

A corollary to forbidding nulls, especially for the fact table foreign keys, is how to handle a fact row that does have a null or bad foreign key. For example, how would we load a fact row that has a missing Customer ID? We talk about the specifics of this scenario in some length in Chapter 6, but one of the answers is to add an Unknown Member row to each dimension table. We habitually do this as soon as we create the table, and we use -1 as the unknown members surrogate key.

If you use the identity column property to generate dimension surrogate keys, you can add the unknown member row by using the following logic:

 SET IDENTITY_INSERT OFF INSERT Dim_MyDim (MyDim_Key, MyDim_TK, Attribute1, Attribute2) VALUES (-1, NULL, 'Unknown', 'Unknown') SET IDENTITY_INSERT ON

Table and Column Extended Properties

When you developed the DW/BI systems logical model, you specified how each dimension tracked history. You should also have thought about how the dimensions primary hierarchies were going to be structured, and whether you will enforce those hierarchical relationships in the data. We like to annotate these decisions in the physical database, using column extended properties. For tables and views, we recommend storing a business description, and possibly a second technical description, as table extended properties.

For each table, create a table extended property called Description to hold the business description of the table.
For each column in each dimension table, create a column extended property Description to hold the business description of the column. Create a column extended property called Source System to hold a business-oriented summary of the source system.
For each non-key column in each dimension table, create a column extended property called SCD Type . The value of the SCD Type extended property should be 1-Restate History or 2-Track History.

The Excel spreadsheet that weve already talked about, and which is posted on the books web site at www.MsftDWToolkit.com, automatically creates these extended properties for you.

Housekeeping Columns

There are several columns that should be added to the physical model, if theyre not already included in the logical model. For dimensions, add columns to track the date range for which the dimension row is valid.

Note	You dont have to add these columns to a dimension table that has no Type 2 (track history) slowly changing dimension attributes in it.

The RowStartDate and RowEndDate columns indicate the date range for which the dimension row is valid. Make these dates be inclusive, so that a SQL statement that includes a BETWEEN clause works as expected. For the current row, you could leave RowEndDate as a NULL , but it works better if you make it the maximum date for your data type. These maximum dates are 12/31/2079 for smalldatetime , and 12/31/9999 for datetime . Populate a third column, RowIsCurrent, with the value yes or no (Y/N). Although this column could be inferred from the rows end date, sometimes its easier to use the current indicator. In the MSFT_AdventureWorksDW case study data model, the Customer and Employee dimensions include Type 2 columns. Their table definitions include these columns.

Occasionally its interesting to the business users to be able to easily tell which of several columns propagated a Type 2 change in the dimension. A simple way to do this is to add a column to track the RowChangeReason: the columns that changed on the date this row was added. The Customer and Employee dimensions in our case study database include this column.

The Audit dimension is a dimension that keeps track of when and how a row was added to the DW/BI system. It is closely tied to the ETL system, and keeps track of the package and step that loaded the data. Using an Audit dimension for both fact and dimension rows is becoming more important, as increasingly strict compliance regulations mean we must track the lineage of the data in our warehouse. In Chapter 6, we provide examples of how to populate a simple Audit dimension. All dimension and fact tables in our case study database include a key to the Audit dimension. In fact, we use two Audit dimension keys: one for the process that initially loaded a row, and one for the latest update to that row.

Reference

The Data Warehouse ETL Toolkit (Wiley, 2004) outlines more complex auditing procedures. You need to evaluate your business and auditing requirements against the cost and complexity of maintaining a richer data auditing system.

Consider adding one or more sorting columns to your dimension tables. There may be some sort order other than alphabetical that makes sense to your business users. This is especially true of Chart of Accounts dimensions.

Indexing and Column Constraints

In this section we provide a simple indexing plan for your relational data warehouse database. You should think of this as a starting point. You need to evaluate your query load against your data on your test system, in order to optimize your indexing plan.

Dimension tables with a single-column integer surrogate primary key should have a clustered primary key index. A clustered index is embedded with the data rows; it actually becomes the physical table itself. Unless the dimension is tiny, you should also create an index on the natural key.

For small dimensions, the only other index you might want to define at the outset is a single column index on any foreign keys. In the case study database, all dimensions have two foreign keys to DimAudit, and the Customer dimension has a foreign key to DimGeography, so we created indexes on those columns. All of the dimensions in our sample database are small enough that its unlikely to be worthwhile to create any additional indexes.

For larger dimensions, your indexing plan depends on how the relational data warehouse database will be used. The hardest case to tune for is if the relational data warehouse will support significant reporting and ad hoc queries. Insofar as you know what that query and reporting load will be, you can tune the index plan for the expected use. You should already have identified the hierarchical relationships in your dimension. For a large dimension supporting direct queries, these hierarchical attributes are probably the first non-key attributes that youll want to index. A very simple dimensional model is illustrated in Figure 4.7. If DimProduct is large and heavily used, consider single-column indexes on ProductCategory, ProductSubcategory, and ProductName.

Figure 4.7: Simple dimensional diagram illustrating key constraints

INDEXING VERY LARGE DIMENSIONS

A very large dimension that contains some Type-2 slowly changing dimension attributes should have a four-column index on the business key, row begin date, row end date, and the surrogate key. The row end date and surrogate key can be created as INCLUDE columns in the index. This index will speed the performance of surrogate key management during the ETL process.

See the Books Online topic Indexes with Included Columns .

Microsoft SQL Servers query engine will use multiple indexes on a single table when resolving a query. With a specific workload of predefined queries, you can probably define multi-column indexes that are very useful. However, if you know very little about how users will access the data, the single-column index approach is the most sensible starting point.

For fact tables, the standard starting point is to create a single column clustered index on the date key. If your fact table has multiple date dimensions, choose the one thats most often used in queries or, if your fact table is partitioned, the one thats used in the partitioning strategy. Next , create a single column index on each of the other foreign keys to the dimension tables. In the simple database illustrated in Figure 4.7, you would create a clustered index on OrderDateKey and single column indexes on CustomerKey and ProductKey in the FactOrders table. Most DW/BI systems do use the relational database for some queries and reports , so the simple fact table indexing described here is a good starting point. As with the dimensions, you should tune the indexing plan with your data and query mix.

Note

If your relational data warehouse is being used only to stage the Analysis Services database, you can get away with building fewer indexes. Youll certainly want to keep the primary key index on dimensions. Fact tables can be left largely unindexed. Because the queries Analysis Services uses to find data for processing are always the same, you can run the Index Tuning Wizard and tune the relational database exactly for that set of queries.

What about the primary key on the fact table? And what about the foreign key constraints between the fact table and the dimension tables? Our first answer is that of course you should declare the fact table keys, both primary and foreign. Even though you should do so, a few paragraphs from now well talk about why you often dont.

Lets start with what should happen. Declaring keys in the database is the right thing to do. Any professional database administrator will look at you funny if you suggest anything different. In the case of a fact table, there is virtually no value in defining a surrogate (integer) primary key on the fact table. Create a unique index over the set of columns, usually dimension keys, that makes a fact row unique. In Figure 4.7 its all three dimensions, but thats not always the case. Put the DateKey as the first column in the primary key index, as its frequently used in query filter conditions.

Note

The unique, primary key index on a fact table should never be a clustered index. The primary key index is a big, multi-column index. If its a clustered index, all other indexes on the fact table will be huge and inefficient because they will use the clustered index as their row identifier.

You have already created single-column indexes on the individual foreign key columns in the fact table, and their primary key reference columns in the corresponding dimension tables. Add a foreign key reference between the fact table and its dimensions. You need to let SQL Server check referential integrity, usually when you add the constraints, but you could schedule this task for later. If SQL Server doesnt check referential integrity, the constraints are just window dressing and dont do anything.

In practice, we often do not create the primary key and foreign key constraints on the fact table. Maintaining these structures is extremely expensive and slows down the data loads. As we describe in Chapter 6, one of the most important jobs of your ETL system is to prevent referential integrity violationsand using surrogate keys is a nearly foolproof way to do that. With foreign key constraints in place, every time a row is added to the fact table the SQL Server engine will check that each dimension key exists in its corresponding dimension table. This is a very expensive check for something that you did just moments before when you looked up the fact tables surrogate keys. Along the same lines, for SQL Server to maintain a multiple-column unique index is obviously expensive. Since SQL Server resolves most queries by using the single-column indexes that youve created on each dimension key in the fact table, this unique index provides very little value for its cost of maintenance.

Note

Build your fact tables the right way, with primary and foreign key constraints defined and enforced in the database. For the initial historical load, disable the constraints, load all the data, and then re-enable and check the constraints. Test your incremental load process. If its too slow, and youre positive the slowness is occuring during the INSERT step, test the performance gains that result from removing the primary key index and foreign key constraints. If you decide to run this way, check periodically (weekly or daily) for referential integrity violations, which can creep in no matter how beautifully youve designed your ETL system.

Create Table Views

All business- user access to the relational data warehouse database should come through views. Similarly, Analysis Services databases should be defined on views, rather than the underlying tables. In both cases, the rationale is to provide a protective layer between the users and the underlying database. This layer can be very helpful if you need to modify the DW/BI system after its in production.

All user access should be through views rather than to the underlying tables. The table names shouldnt even show up in a users list of database objects. In the simplest case, a tables view would simply select all the columns from the underlying table. You may want to drop some columns from the view, especially some of the housekeeping columns described previously.

If you have snowflaked a dimension, create a single view for the dimension that collapses the multiple tables into a single logical table. To improve performance, you may choose to make this an indexed view.

All of your database object names should be business-user friendly. But the view definition is an opportunity to rename a column, especially for role-playing tables, discussed in Chapter 2.

Partitioned Fact Tables

In SQL Server 2005, Microsoft introduced partitioned tables. Partitioned tables are important for the scalability of the relational data warehouse database. The improved scalability comes not so much from improved query performance against partitioned tables, although improved query performance is possible. Rather, the big win comes from greatly increased manageability of very large tables. With a very large partitioned table, everything from loading data, to indexing, and especially to backing up the data, can be much easier and faster than with a single monolithic table.

In a relational data warehouse database, you will typically partition the fact tables. Dimension tables, even very large dimension tables, seldom benefit from partitioning. From the point of view of a user or application issuing a query, a partitioned table looks and behaves the same as a non-partitioned table.

Analysis Services databases can also be partitioned. Most people use the same partitioning scheme for Analysis Services as for the relational database, and create, merge, and delete partitions from the two data stores on the same schedule. This is just a convenience; there is no requirement that Analysis Services partitions be synchronized with relational partitions.

Relational and Analysis Services partitioning are both features of SQL Server Enterprise Edition only.

How Does Table Partitioning Work?

A partitioned table is a lot like a set of physical tables that are combined with a UNION ALL view. The classic partitioning scheme is to partition by month. Each month of fact data goes into a separate physical table, which is tied together with the other months identically structured tables. We expect that most readers could define the UNION ALL view that does this task. A partitioned table works pretty much the same way, but there is a special syntax for defining it. There are some specific requirements, discussed shortly, for how the partitions are physically structured. The benefits are substantial, and the old UNION ALL approach should no longer be used in your relational data warehouse database.

Note	In this chapter we illustrate an extremely simple partitioned table and load it with a handful of rows. These scripts are available on the books web site at www.MsftDWToolkit.com .

The first step in defining a partitioned table is to define a partition function, using the CREATE PARTITION FUNCTION syntax thats well defined in Books Online. If youre creating monthly partitions for a DateKey surrogate key, you would use syntax like:

 -- Create the partition function CREATE PARTITION FUNCTION PFMonthly (int) AS RANGE RIGHT FOR VALUES (20040101, 20040201, 20040301) GO

In this very simple example, the first partition holds all data before January 2004, the second partition holds January data, the third holds February data, and the fourth holds all data for March onward. Note that a partition function automatically creates partitions to hold all possible values. Three boundary points, as in the preceding example, create four partitions. If youre using surrogate date keys, you need to create an integer function that uses the appropriate key ranges. You can create complex partition functions, but most people will just create simple functions like the one illustrated here. (This is the main reason to use a meaningful surrogate key for Date.)

The next step is to define a partition scheme, which maps each partition in a partition function to a specific physical location. A simple example is illustrated here:

 -- Add the partition scheme CREATE PARTITION SCHEME PSMonthly AS PARTITION PFMonthly ALL TO ([PRIMARY]) GO

Although were creating all the partitions on the databases Primary filegroup for this example, in the real world you would specify which partition goes on which filegroup. Aged partitions that no longer receive data or updates can have their filegroups set to Read Only, which will greatly speed backups . See Books Online for details.

Note	If you plan to use Read Only filegroups, it makes most sense to put each partition on its own filegroup. That way you have the most flexibility for rolling a partition into and out of read-only status.

Finally, create your partitioned table on the partition scheme. The syntax is really simple: Basically youre replacing the standard ON <filegroups> syntax with an ON <PartitionScheme> clause. Here is a very simple table:

 -- Create a simple table that we can partition CREATE TABLE PartitionTable (DateKey int NOT NULL, CustomerKey int NOT NULL, SalesAmt money, CONSTRAINT PKPartitionTable PRIMARY KEY NONCLUSTERED (DateKey, CustomerKey)) -- The ON clause refers to the partition scheme ON PSMonthly(DateKey) GO

Note	The partition key ( DateKey in our example) must be NOT NULL .

You can insert data into the partitioned table simply by using an INSERT statement. SQL Server uses the partition function to determine where the inserted data should go. A corollary to this statement is that this partition function should be simple and high-performance. Although you can define a partition function that performs a lookup or calculation, you should try to avoid doing so.

In the data warehouse database, we most often partition fact tables by date.

Lets add a few rows to our partitioned table:

 -- Add some rows INSERT INTO PartitionTable (DateKey, CustomerKey, SalesAmt) VALUES (20031201, 1, 5000) INSERT INTO PartitionTable (DateKey, CustomerKey, SalesAmt) VALUES (20040101, 2, 3000) INSERT INTO PartitionTable (DateKey, CustomerKey, SalesAmt) VALUES (20040215, 55, 6000) INSERT INTO PartitionTable (DateKey, CustomerKey, SalesAmt) VALUES (20040331, 5, 3000) INSERT INTO PartitionTable (DateKey, CustomerKey, SalesAmt) VALUES (20040415, 57, 6000) GO

If you query PartitionTable, you should see all these rows. There are some system views that you can query to see how your partitioned table is set up.

Reference

See the Books Online topic Querying Data and Metadata from Partitioned Tables and Indexes.

Run the following query to see how many rows are in each partition:

 -- Query the system views to see what happened SELECT $partition.PFMonthly(DateKey) AS [Partition#], COUNT(*) AS RowCnt, Min(DateKey) AS MinDate, Max(DateKey) AS MaxDate FROM PartitionTable GROUP BY $partition.PFMonthly(DateKey) ORDER BY [Partition#] GO

As you should expect, the first partition, which contains all data before 1/1/2004, contains a single row, as do the partitions for January and February 2004. The fourth partition, which starts on 3/1/2004, contains all rows after that date, so it includes two sample rows for 3/31/2004 and 4/15/2004.

Changing the Partitioning Scheme

Our very simple example raises an interesting point. How do we add a new partition for April? And what happens to our existing row for April when we do that?

Use the ALTER PARTITION FUNCTION command:

 -- Add a new partition for April ALTER PARTITION FUNCTION PFMonthly () SPLIT RANGE (20040401)

Re-run the query that counts the rows in each partition; youll see that SQL Server automatically moved the row dated 4/15 into the new partition. There are a few issues to be aware of here. First, this works great on a partition that contains two rows. But what if your partition contained 10 million, 100 million, or more rows? Although it will work, it probably will take more resources than you wish.

Tip	Always leave yourself at least one empty partition that covers the date range that you plan to split later. Create new partitions well before you need them, so that you are always splitting an empty partition. In most cases, you want the first and last partitions always to be empty.

A second syntax for the ALTER PARTITION FUNCTION command will merge two partitions. As with splitting partitions, youd prefer to merge empty partitions. Well discuss how to do so most effectively.

In the real world, you will want to be careful about the filegroup on which the next partition is created. Use the ALTER PARTITION SCHEME command to specify which filegroup to use for the next partition that will be created.

Tip	Issue the ALTER PARTITION SCHEME command before executing the ALTER PARTITION FUNCTION command.

Using Table Partitions for Fast Data Loads

As you saw in the simple example in the preceding text, you can insert data directly into a partitioned table. It shouldnt surprise you to learn that this is not a fast loading process: At best its as fast as an insert into a normal table. Especially for your initial load, youll want to know how to load the partitioned fact table as fast as possible. Theres an elegant trick that works great.

The trick is to load data into a separate table thats structured exactly the same as the table partitionssame columns, datatypes, indexes, filegroups, and so on. Well call this separate table a pseudo-partition. Note the CREATE TABLE script that follows is exactly the same as we used for the partitioned table, excluding the ON <PartitionScheme> clause:

 -- Create an empty table nearly identical to the partitioned table CREATE TABLE PseudoPartition_200405 (DateKey datetime NOT NULL, CustomerKey int NOT NULL, SalesAmt money, CONSTRAINT PKPseudoPartition_200405 PRIMARY KEY NONCLUSTERED (DateKey, CustomerKey), CONSTRAINT CKPseudoPartition_200405 CHECK (DateKey >= 20040501 and DateKey <= 20040531)) -- We don't want the ON <PartitionScheme> clause --ON PSMonthly(DateKey) GO

PSEUDO-PARTITION CHARACTERISTICS

The pseudo-partition table must be defined on the same filegroup as the partition its destined to replace. Everything about the pseudo-partition table must be exactly the same as the target partitioned table, with one exception. You must define a check constraint for the partition key (DateKey in our example), to ensure that the pseudo-partition table contains only data appropriate for the partition.

If you have indexes on keys other than the partitioning key, you must INCLUDE the partitioning key in the indexes of the pseudo-partition. SQL Server automatically adds the partitioning key as an INCLUDE column to any partitioned index that doesnt already have it included, but its better to do it in advance so the partition switch can be very fast.

See the Books Online topic Index with Included Columns.

Go ahead and create empty partitions for May and June:

 -- Create empty partitions for May and June ALTER PARTITION SCHEME PSMonthly NEXT USED [PRIMARY] GO ALTER PARTITION FUNCTION PFMonthly () SPLIT RANGE (2004501) GO ALTER PARTITION SCHEME PSMonthly NEXT USED [PRIMARY] GO ALTER PARTITION FUNCTION PFMonthly () SPLIT RANGE (20040601) GO

Use standard fast-load techniques to load the data into that pseudo-partition:

Set the database to Bulk-Logged or Simple recovery mode.
Confirm the table is empty and/or disable all indexes and constraints.
Bulk Insert or BCP from a file, or develop an Integration Services package using a SQL Server Destination task in the data flow.
Enable (or create) indexes and constraints.
Return the database to the desired recovery mode.
Perform appropriate backups.

But in our example, well skip all these steps and simply add a few rows by hand to our new partition for May 2004:

 -- Insert a few rows by hand. In the real world we'd use -- a bulk loading technique. INSERT INTO PseudoPartition_200405 (DateKey, CustomerKey, SalesAmt) VALUES (20040505, 33, 5500) INSERT INTO PseudoPartition_200405 (DateKey, CustomerKey, SalesAmt) VALUES (20040515, 27, 6000) GO

Once the data is loaded, the indexes rebuilt, and the constraints re-enabled, switch the pseudo-partition into the partitioned fact table. This switch is a metadata operation and executes very quickly.

 -- The magic switch  very fast even with large data volumes ALTER TABLE PseudoPartition SWITCH TO PartitionTable PARTITION 6 GO

You can re-run the query that examines the rowcounts in partitions to confirm that this actually works. Examine the PseudoPartition table: It now contains zero rows. When we executed the ALTER TABLE ... SWITCH TO command, no data actually moved. Instead, the systems metadata logically swapped the places of the empty partition and the populated pseudo-partition. This is why they have to be structured identically.

Note	For populating the initial historical data, its usually fastest to create all indexes after the entire partitioned table is populated and stitched together. SQL Server will build the indexes in parallel.

This technique minimizes overall load time and system resources because we can perform a fast, non-logged load of large volumes of data. At the same time, were minimizing the impact on the databases users. The data is loaded into a table thats invisible to users; during that load the partitioned table remains available for query; and the switch step executes extremely fast. In addition, the pseudo partition can be backed up as a separate table, improving system manageability.

Dropping Old Data from a Partitioned Table

One of the great advantages of a partitioned table is that it makes it so easyand fastto drop aged data. Its a common practice to keep a rolling window of data in the fact table, usually in multiples of a year plus one month (for instance, 37 months). Without table partitioning, dropping the oldest month (or year) of data requires a resource intensive DELETE FROM statement.

The best way to drop an old partition is to create an empty pseudo-partition, and swap it into the partitioned table for the old partition. As before, the pseudo-partition table must be structured identically to the partition it replaces , although in this case you dont need to add a check constraint on the DateKey:

 -- Create another pseudo partition table, this one to swap out CREATE TABLE PseudoPartition_200312 (DateKey datetime NOT NULL, CustomerKey int NOT NULL, SalesAmt money, CONSTRAINT PKPseudoPartition2 PRIMARY KEY NONCLUSTERED (DateKey, CustomerKey),) GO

Now, swap this empty table into the partitioned table, replacing partition 1:

 -- The Switcheroo  very fast move offline of aged data ALTER TABLE PartitionTable SWITCH PARTITION 1 TO PseudoPartition_200312 GO

Examine both the partitioned table and the pseudo-partition. The partitioned table has an empty partition 1, and the PseudoPartition_200312 table now contains a row. As above, this is a metadata operation and occurs very quickly. Now, PseudoPartition_200312 can be backed up and dropped from the database if you wish.

Using Partitioned Tables in the Data Warehouse Database

If youve read this whole section on partitioned tables, youve already figured out that partitioned tables are too complicated to use unless you really need them. So when do you need them? If you have a fact table that contains a billion rows, you certainly want it to be partitioned. You should seriously consider partitioning for smaller fact tables if you cant find a better way to reduce your backup window or load window to an acceptable target. At the low end, its hard to imagine that a 10 million row fact table would truly need to be partitioned.

If you decide to partition your fact table, you must automate the process of loading and managing the table and partitions. Your ETL system should be partition-aware, and automatically add new partitions as needed to accommodate new data. Books Online provides examples of best practices for how to do this.

PARTITIONED TABLE LIMITATIONS

We really like the partitioned table feature. But there are some limitations that its important to know about.

As we mentioned, theres a limit of 1,000 partitions per table.
Partitioned tables cannot have indexed views defined on them.
Issue the ALTER PARTITION FUNCTION... MERGE RANGE and SPLIT RANGE commands only against empty partitions. SQL Server will let you issue the command against a populated partition, and it will move the data around, but it will do it very slowly. If you have enough data that youre using partitions, you need to move the data yourself, most likely with an Integration Services package.

Most systems with partitioned fact tables will partition monthly by date. Most will implement the fast load and partition switching technique for the historical load, but not bother to do so for the daily incremental loads. If your DW/BI system has extreme data volumes, on the order of 10 million new fact rows a day, you may need to play this game on your daily loads.

If youre in this situation, the easiest thing to do is to partition by day instead of by month. However, you want to keep the total number of partitions for any one table to several hundred. Theres a limit of 1,000 partitions per table.

However, keeping even one year of data, or 365 daily partitions, is a bit worrisome. You may consider consolidating daily partitions into weekly or monthly partitions as they age. Obviously you want to do this the fast way, using partition switching, rather than by merging populated partitions. The recommended approach is to build an Integration Services package that selects from daily partitions into a weekly pseudo-partition table, and then modifies the partitioning scheme. Have fun.

Aggregate Tables

Aggregates are the single most useful way to improve query performance on a dimensional DW/BI system. An aggregate table summarizes data at a higher level than the atomic data maintained in the detailed fact table. You would create aggregate tables at a parent level in a dimension, say at Product.Category, Geography.Country, or Date.Month. Some or all of the other dimensions would remain at their leaf levels.

Reference

The process of designing and maintaining aggregate tables in the relational data warehouse database is discussed in the books The Data Warehouse Toolkit (Chapter 14) and The Data Warehouse ETL Toolkit (Chapter 6).

Microsoft has a few tools to help you maintain aggregate tables. Weve seen people use indexed views as a substitute for aggregate tables. You can define an indexed view on the business process dimensional model, to summarize the detailed data and store the summary in the views index. This approach is most appealing to people who have done a lot of data warehousing in Oracle, which has a similar feature. We havent seen any large SQL Server DW/BI systems make extensive use of indexed views in this way, and we dont recommend it as a best practice.

Most Microsoft DW/BI systems that would benefit from aggregates use Analysis Services to manage those aggregates. As we describe in Chapter 7, the Analysis Services OLAP functionality has a host of features for designing and maintaining aggregates. This is one of the core features of Analysis Services. Even if Analysis Services provided no other benefits, its usefulness as an aggregate manager makes implementing it worthwhile.

If you just cant do OLAP, and really need to maintain relational aggregate tables, we recommend that you maintain them the old-fashioned way: in your ETL process. If you only ever add rows to your data warehouse database, and you only build aggregates on Type 2 slowly changing dimensions, then maintaining aggregate tables isnt at all difficult. In the real world, its a painful process. Again, see the books referenced above for a more complete description of the issues and approaches.

Staging Tables

One of the last steps in your physical design process is to develop staging tables. Staging tables are relational tables that are used to hold data during the ETL process. Use staging tables for data that you want to use as a lookup for other processing. You may create a staging table based on each dimension that contains todays version of the key lookups, for use during the fact table surrogate key assignment process. You may create a staging table to tie together similar members such as people from multiple source systems.

Because staging tables are intimately tied to the ETL process, the ETL developer is usually the person who specifies the requisite structure. A defining characteristic of a staging table is that it is not available to business users for querying. Best practice puts staging tables in a separate database from the relational data warehouse.

During your staging area design process, its important to remember to stage in relational tables only data that is used for lookups, and to index it appropriately to speed those lookups. You should use the file system, and especially the Integration Services raw file format, to stage data thats not used for lookups. Writing data in the raw file format is orders of magnitude faster than writing it to relational tables. Use the relational database only as appropriate.

Metadata Setup

At this point in the process, when you are setting up your other databases and file system, you should also define your metadata database. We have an entire chapter devoted to metadata, in which we suggest some structures that integrate with the way we like to build Microsoft DW/BI systems. See Chapter 13 for details.

Aside from the data model for metadata that you define, the other issue with metadata is where it is stored. We like to store user-defined metadata in its own relational database. Usually this database is on the same server as the data warehouse database. The reason you want it in a separate database is that the metadata database is more transactional than the data warehouse or staging databases. You should back it up frequentlywhich is easy to do because its small.