Technical Solution

In this section, we focus on building the relational tables to support the dimensional model that we have designed in the previous section and discuss some of the common design decisions you need to make when building data warehouse tables. The following chapters cover the details of loading the data from the source systems and building Analysis Services cubes to present the information to end users.

Building Dimension Tables

Although some minor differences exist in the table designs for different business solutions, most dimension and fact table structures usually follow a strikingly similar pattern. Some standard ways of approaching the requirements for a well-performing data warehouse are common across most modern databases, including SQL Server 2005. We can start by building the Ship Method dimension table to demonstrate the approach.

Using Surrogate Keys

As described in Chapter 1, every dimension table has a surrogate key column that is generated within the data warehouse and uniquely identifies the dimension record. In SQL Server databases, we can implement surrogate keys using a concept called identity columns.

The first column that we will add to the dimension table is called ShipMethodKey (see the sidebar "Using Naming Standards"), which is the surrogate key for the table. All surrogate keys in our databases are declared as integer fields with the IDENTITY property turned on. This means that there is no need to specify a value for this column. The database will generate a unique incremental value every time a new row is added to the table.

Adding the Business Key

The manufacturer has assigned a code for every method of shipping products to their customers, such as OSD for Overseas Deluxe shipping. Although we are internally using the surrogate key as the only identifier for shipping methods, we still need to store the original shipping code along with every record. This code will be used to translate the shipping codes in information received from source systems and is known as the business key.

Unlike surrogate keys, every business key in the data warehouse may have a different data type, so we must make a decision for each dimension table. The shipping method codes are currently three-character identifiers in the source systems, so char(3) would probably be a good candidate here. Some data warehouse designers like to add some additional space into codes to cater for any future systems that might need to be integrated that use longer codes; but because it is generally a trivial exercise to change this size if it becomes necessary (especially in dimension tables, which are usually much smaller than fact tables), this is certainly optional.

Business keys tend to have a wide range of names in source systems (such as CustomerNumber, ShippingMethodCode, or ProductID), so we will pick the nice consistent naming convention of using ShipMethodBusinessKey and follow this for all our business keys. When creating the column, remember that business keys should generally not allow nulls.

We will also add a unique key constraint for the ShipMethodBusinessKey column and for all the business key columns in our other dimension tables. Because we will be using the business key to look up records from the source system, the unique key constraint will ensure that we don't have any duplicates that would make this process fail. (See Chapter 8, "Managing Changing Data," for a common exception to the rule for unique business keys.)

Adding Indexes to the Dimension Table

The unique key constraint that we added to the business key will also provide an index on the column, and we will pick this index as the clustered index for the dimension table. This means that the data in the table will be physically arranged in order of the business key, which will improve performance when we need to fetch the dimension record based on the business key. For example, when a dimension record is received from the source systems, the ETL process will need to do a lookup using the business key to determine whether the record is new or is an existing record that has been changed.

Data warehouses that are used primarily as a consistent store to load Analysis Services cubes actually need little indexing because all the end-user querying takes place against the cube, but a careful indexing design that includes the surrogate keys used for relational joins will improve the performance for applications that run queries directly against the data warehouse, such as relational reporting.

Adding Descriptive Attributes

All the other columns in our Ship Method dimension table are descriptive attributes that can be used to analyze the information. During the data modeling stage, we will always try to include as many of these attributes as possible to increase the usefulness of our data warehouse. Attributes have different data types depending on the type of information (such as the currency shipping cost and rates on the Ship Method dimension), but many of them contain textual information.

When you are building a new data warehouse, you will often find patterns in the data types you are using for attributes, such as having some columns with a short description of up to 50 characters and other columns with a longer description of 100 characters. A useful technique is to take advantage of SQL Server's user-defined types feature to create special data types for common categories of columns, such as ShortDesc and LongDesc. This will standardize your approach for creating attributes, and make it easier to change the columns if the approach changes.

Quick Start: Creating a Dimension Table

As discussed in the Preface, the Quick Start exercises are intended to get you started with using the SQL Server tools to implement the technical solution we are discussing. In this Quick Start, we create a new data warehouse database and add the first dimension table, DimShipMethod.

1.	Open the SQL Server Management Studio.
2.	If you are prompted to connect to your database server, select your authentication type and click Connect. If this is the first time you are opening the Management Studio, choose Connect Object Explorer from the File menu, choose Database Engine as the server type, and select your database server name.
3.	To create a new database, right-click the Databases folder in the Object Explorer and select New Database; then specify a name for your database, such as ManufacturingDW, and click OK to create it.

4.	In the Object Explorer pane, expand the Databases folder and find your new database; then expand the database folder to see the object type folders, such as Database Diagrams and Tables.
5.	To create the new dimension table, right-click Tables and choose New Table. A new window will open up in Management Studio to allow you to specify the columns and other settings. (You are prompted to name the table only when you save it.)
6.	For the first column, specify ShipMethodKey as the name and `int` as the data type, and uncheck the Allow Nulls box. In the column properties, open up the Identity Specification section and set the (Is Identity) property to Yes.
7.	To specify the new column as the primary key, right-click the ShipMethodKey column and choose Set Primary Key.
8.	In the next grid row, add the ShipMethodBusinessKey column with a data type of char(3), and uncheck the Allow Nulls box.
9.	Right-click the table and choose Indexes/Keys. You need to change the Create As Clustered property of the existing primary key to be No, because each table can only have one clustered index, and we will be adding one for the business key instead.
10.	To add a unique key constraint for the business key, click the Add button to create a new index. For the Columns property, select ShipMethodBusinessKey, change the Type property to Unique Key, and change the Create As Clustered property to Yes. Click the Close button to return to the table designer.

11.	Add the other descriptive attributes columns for the Ship Method dimension (see the data model), using `varchar(25)` as the data type for the ship method name and `money` for the other columns, as shown in Figure 3-8. Figure 3-8. Ship Method dimension table
12.	Click the Save button on the toolbar and specify DimShipMethod as the table name.

Using Naming Standards

Consistently applying a standard approach to naming the database objects you create will help administrators and developers to more easily work with your data warehouse. It is not really important which naming standard you pick, as long as you apply it consistently. Here are some examples that we have applied in the solutions in this book:

Tables. Because we will be using views to provide access to the data, we have named the table with a prefix that describes the type (such as DimShipMethod or FactShipping) and the corresponding view with the real dimension or fact name (such as ShipMethod or Shipping).
Name style. We have capitalized each word in a column or table name, rather than using spaces or other separators such as underscores (so, ShipMethod rather than ship_method or ship method).
Surrogate key columns. We have used <<dimension>>Key as the name, such as ShipMethodKey.
Business key columns. We have used <<dimension>>BusinessKey as the name, such as ShipMethodBusinessKey. We apply this consistently, even when there is another obvious candidate such as ShipMethodCode or CustomerNumber.
Primary and unique key constraints. We have used <<dimension>>_PK for primary keys, and <<dimension>>_AK for unique keys such as the business key (the A in AK stands for alternate key).

Building the Time Dimension Table

As discussed in the "Data Model" section, the Time table has a record for each time period that we are measuring, which is per day for the manufacturing example. The table consists of descriptive attributes to support the various hierarchies such as Calendar and Fiscal, as well as other descriptive attributes such as the day number of the year or whether the day is a public holiday.

Although we could follow the usual convention of using an automatically generated integer surrogate key, it is often convenient for Time dimension tables to use an actual DateTime column as the primary key. Because the actual date is the value that is known to the source systems, this is basically the equivalent of using the business key rather than a generated surrogate. This violates the design guideline of always using a surrogate key, but does make some processes such as loading the fact table easier by avoiding the need to lookup the surrogate key for every date that you encounter.

To support the Analysis Services dimension that we will ultimately build from the Time table, we also need to add an additional unique column for each level in the hierarchies. For example, the Quarter Name attribute contains values such as Quarter 1 and Quarter 2, which are not unique because each year has the same four quarter names. We will add a Quarter key column using the date of the first day in the quarter to provide a unique value for each quarter, such as 01/01/2006, and take the same approach for the other levels such as Year and Month.

After we have finished designing all the dimension tables in the data warehouse, we can move on to the fact tables.

Building Fact Tables

As we have seen in the "Data Model" section, the fact table consists of a column containing the surrogate key for each dimension related to the fact table, as well as columns for the measures that we will be tracking. For the Shipments fact table, we need to add the ProductKey, SalesTerritoryKey, PlantKey, and ShipMethodKey columns. The fact table has two relationships with the Customer dimension, so we need to add separate ShippingCustomerKey and BillingCustomerKey columns to represent this. We also need three date keys (OrderDateKey, DueDateKey, and ShipDateKey) to represent the different dates for each shipment fact.

Most fact tables contain a single record for each distinct combination of the key values, so the logical primary key of the fact table is usually all the dimension keys. However, most information is analyzed with some level of summarization (especially BI solutions using Analysis Services), so it often does not matter whether you have multiple records for the same combination of keys because they will be summed except in some relational reporting scenarios. For this reason, a primary key constraint is usually not added to fact tables, and any potential requirement for avoiding multiple fact records is handled in the data loading (ETL) process.

As described for dimension tables, BI solutions that use Analysis Services cubes to query the information require little indexing in the data warehouse. It is worthwhile, however, to add a clustered index to the fact table, usually on one of the date keys. Because the data will often be queried by a date range, it will be helpful to have the fact data physically arranged in date order. For the Shipments fact table, we will use the earliest date, which is the OrderDateKey, as shown in Figure 3-9. If you are also doing relational reporting from your data warehouse tables, it will probably prove to be beneficial to add indexes for the dimension keys in the fact table because this will speed up joins to the dimension tables.

Figure 3-9. Shipments fact table

After the keys, we can add the remaining measure columns using the smallest data type that can contain the measure at the detail level. It is worth being careful with this process because the fact tables are the largest tables in the data warehouse, and properly selected data types can save a lot of space and lead to better performance. The downside of picking a data type that is too small is that at some point, your data loading process will probably fail and need to be restarted after increasing the size of the column.

Using Views to Encapsulate Database Objects

If we allow client applications such as reporting tools or Analysis Services cubes to directly access the tables in our data warehouse, we run the risk that any future changes that we make might break these applications. Instead, we can create a view for each dimension and fact table in the data warehouse, which makes it easier to change or optimize the underlying database structure without necessarily affecting the client applications that use the database.

You can design views using SQL Server Management Studio, by right-clicking the Views folder under the database and selecting New View. The query designer allows you to add the source tables to the view and select which columns to include, and specify a name when you save the new view.

Dealing with Referential Integrity

Referential integrity (RI) is a technique for preserving the relationships between tables and is sometimes used in data warehouse solutions to ensure that every key in the fact table rows has a corresponding dimension row. If your data-loading process tries to add a fact row with a dimension key that does not yet exist, the process will fail because of the RI constraint and ensure that you don't end up with any mismatched facts.

Good arguments both for and against using referential integrity constraints in the data warehouse exist (and the authors have either used or heard most of them), but we are going to go out on a limb here and state that in a properly architected and managed data warehouse, RI constraints are not required. The primary reason for this is that we always use surrogate keys for dimensions. Because these keys are only known within the data warehouse, a "lookup" step always needs to take place when loading facts to translate business keys into surrogate keys. If this lookup step fails, the new record will either not be added into the fact table or will be added with a special surrogate key that refers to a "Missing" or "Unknown" dimension record.

So, in contrast to OLTP systems where RI is an absolute necessity, a major characteristic of data warehouses that strictly use surrogate keys is that RI is enforced through the data loading process. For this reason, it is not necessary to declare foreign key constraints in the data warehouse. The advantages are that load performance is improved, and the data loading sequence can sometimes be more flexible. The one potential disadvantage is that any errors in your data-loading process are harder to catch until you try to validate the numbers in your data warehouse, so it sometimes proves useful to turn on foreign key constraints during the development process.

It is only when you have exceptions to the rule of always using a surrogate key that you get into trouble, and an especially alert reader might have noticed that we have already broken this rule for one of our tablesthe Time dimension. The date keys in our fact table are DateTime columns and do not need to be looked up in the dimension table during load. This means that we will have to add some processing in the data-loading process to check that all the data in the new fact records falls into the date range contained in the Time dimension table, and to trigger a process to have more dates added if not.

Securing the Data Warehouse

If the company you are working with has a database security policy already, you will be able to follow the rules that have been laid out. This section lays out some guidelines for the security decision-making process as it applies to data warehouse databases, but security decisions should always be made with the context of the whole system in mind.

When you connect to a database, SQL Server can either validate your current Windows user account to see whether you have access (which is known as Windows authentication) or prompt you to supply a separate SQL Server login account and password. Windows authentication is usually recommended because it has increased security and it does not require maintaining a separate set of accounts.

Providing Access to Database Administrators

We have managed to create a database, add some tables, and specify columns for them, all without worrying about security. How did we manage to have permission to do all of that? If you have installed SQL Server 2005 onto a new server, the reason is probably that you are signed on to the database server as a member of the Administrators group on that machine. When SQL Server is installed, all local Administrators are granted a special right to administer the SQL Server instance, called sysadmin.

Because companies usually don't want every administrator to have complete control of all the databases on that server, a common approach is to create a new Windows group for database administrators only. SQL Server Management Studio can then be used to add the new database administrators group (under the Security, Logins folder) with the sysadmin role turned on, and then to remove the Administrators group.

Tip: Log Off After Adding Yourself to a Group

A common mistake that can lead to a moment of panic after making the preceding change is when you have added your own user account to the database administrators group that you created. Just after you remove the Administrators group, you could suddenly be unable to access your databases.

The reason for this is that your Windows group membership only gets refreshed when you log on. To be able to access the database server after you have added yourself to the new database administrators Windows group, you need to log off and back on again so that your credentials are picked up.

Allowing Users to Access the Data Warehouse

The SQL Server security model is flexible and provides extremely granular control over the permissions that users have. You can control which users or groups can view information in a table, whether they can update the information, and other rights such as the ability to execute stored procedures. In all scenarios, we follow the principle of "least privilege," meaning that we will make sure that users only have the minimum level of permissions that they need to accomplish their tasks.

In the manufacturing data warehouse, we have created views for every dimension and fact table, and most of the data warehouse users will only need to have access to read the information in these views. Because we are using Windows authentication, we can create a Windows group for all users who need read access to the data warehouse's views. Using groups is much more flexible and easier to maintain than setting up permissions for individual user accounts.

Quick Start: Providing Read Access to Data Warehouse Views

After you have created a Windows group, such as Manufacturing DW Users, you will need to use SQL Server Management Studio to set up the group's permissions:

1.	Open SQL Server Management Studio and connect to the database server.
2.	In the Object Explorer pane, find the Security folder and right-click it. Select New Login.
3.	Type the name of your group in the Login Name box, or use the Search button to locate the group (for searching, click the Object Types button and include Groups to make sure your group shows up). Select the data warehouse database as the default database.
4.	Click the User Mapping page on the left side of the dialog. Check the Map box next to the data warehouse database and click OK.
5.	Now that we have created the login, we need to assign the permissions to enable the members of the group to access the views. In the Object Explorer pane, find your database under the Databases folder and open up the Security, Users subfolders.
6.	Right-click the name of your group and select Properties. Select the Securables section on the left side of the dialog.

7.	Click the Add button and then select Specific objects and click OK. In the Select Objects dialog, click the Object Types button, select Views, and click OK.
8.	Click the Browse button to display a list of all the views in your database. Check all the dimension and fact views, and then click OK to close the Browse dialog. Click OK again to close the Select Objects dialog.
9.	Now that we have added all the views to the list of objects in the Securables section, we need to assign the permissions. Highlight each of the views in turn and check the Grant box for the Select permission (see Figure 3-10). Click OK to close the dialog. Figure 3-10. Data warehouse user permissions

Providing Access to Applications

In many BI solutions, users do not access the data warehouse database directly but through an analytical engine such as Analysis Services. In that case, the user account used to connect to the database for the purpose of loading data is often the service account that Analysis Services is using. You can cater for this by adding this service account to the data warehouse users group that you created previously.

Another application that needs access to the database is Integration Services. You can use some flexible options to select which user account is actually used when an Integration Services package is executed to load data, but whatever account is used will need more than just read access to the views. You could use a similar approach to the Quick Start to create a Windows group for data warehouse data loaders, and then instead of granting SELECT permission on individual views, you could add the login to two special database roles: db_datareader and db_datawriter. These roles will allow the data-loading process to read and write data.