In this section, we focus on building the relational tables to support the dimensional model that we have designed in the previous section and discuss some of the common design decisions you need to make when building data warehouse tables. The following chapters cover the details of loading the data from the source systems and building Analysis Services cubes to present the information to end users. Building Dimension TablesAlthough some minor differences exist in the table designs for different business solutions, most dimension and fact table structures usually follow a strikingly similar pattern. Some standard ways of approaching the requirements for a well-performing data warehouse are common across most modern databases, including SQL Server 2005. We can start by building the Ship Method dimension table to demonstrate the approach. Using Surrogate KeysAs described in Chapter 1, every dimension table has a surrogate key column that is generated within the data warehouse and uniquely identifies the dimension record. In SQL Server databases, we can implement surrogate keys using a concept called identity columns. The first column that we will add to the dimension table is called ShipMethodKey (see the sidebar "Using Naming Standards"), which is the surrogate key for the table. All surrogate keys in our databases are declared as integer fields with the IDENTITY property turned on. This means that there is no need to specify a value for this column. The database will generate a unique incremental value every time a new row is added to the table. Adding the Business KeyThe manufacturer has assigned a code for every method of shipping products to their customers, such as OSD for Overseas Deluxe shipping. Although we are internally using the surrogate key as the only identifier for shipping methods, we still need to store the original shipping code along with every record. This code will be used to translate the shipping codes in information received from source systems and is known as the business key. Unlike surrogate keys, every business key in the data warehouse may have a different data type, so we must make a decision for each dimension table. The shipping method codes are currently three-character identifiers in the source systems, so char(3) would probably be a good candidate here. Some data warehouse designers like to add some additional space into codes to cater for any future systems that might need to be integrated that use longer codes; but because it is generally a trivial exercise to change this size if it becomes necessary (especially in dimension tables, which are usually much smaller than fact tables), this is certainly optional. Business keys tend to have a wide range of names in source systems (such as CustomerNumber, ShippingMethodCode, or ProductID), so we will pick the nice consistent naming convention of using ShipMethodBusinessKey and follow this for all our business keys. When creating the column, remember that business keys should generally not allow nulls. We will also add a unique key constraint for the ShipMethodBusinessKey column and for all the business key columns in our other dimension tables. Because we will be using the business key to look up records from the source system, the unique key constraint will ensure that we don't have any duplicates that would make this process fail. (See Chapter 8, "Managing Changing Data," for a common exception to the rule for unique business keys.) Adding Indexes to the Dimension TableThe unique key constraint that we added to the business key will also provide an index on the column, and we will pick this index as the clustered index for the dimension table. This means that the data in the table will be physically arranged in order of the business key, which will improve performance when we need to fetch the dimension record based on the business key. For example, when a dimension record is received from the source systems, the ETL process will need to do a lookup using the business key to determine whether the record is new or is an existing record that has been changed. Data warehouses that are used primarily as a consistent store to load Analysis Services cubes actually need little indexing because all the end-user querying takes place against the cube, but a careful indexing design that includes the surrogate keys used for relational joins will improve the performance for applications that run queries directly against the data warehouse, such as relational reporting. Adding Descriptive AttributesAll the other columns in our Ship Method dimension table are descriptive attributes that can be used to analyze the information. During the data modeling stage, we will always try to include as many of these attributes as possible to increase the usefulness of our data warehouse. Attributes have different data types depending on the type of information (such as the currency shipping cost and rates on the Ship Method dimension), but many of them contain textual information. When you are building a new data warehouse, you will often find patterns in the data types you are using for attributes, such as having some columns with a short description of up to 50 characters and other columns with a longer description of 100 characters. A useful technique is to take advantage of SQL Server's user-defined types feature to create special data types for common categories of columns, such as ShortDesc and LongDesc. This will standardize your approach for creating attributes, and make it easier to change the columns if the approach changes.
Building the Time Dimension TableAs discussed in the "Data Model" section, the Time table has a record for each time period that we are measuring, which is per day for the manufacturing example. The table consists of descriptive attributes to support the various hierarchies such as Calendar and Fiscal, as well as other descriptive attributes such as the day number of the year or whether the day is a public holiday. Although we could follow the usual convention of using an automatically generated integer surrogate key, it is often convenient for Time dimension tables to use an actual DateTime column as the primary key. Because the actual date is the value that is known to the source systems, this is basically the equivalent of using the business key rather than a generated surrogate. This violates the design guideline of always using a surrogate key, but does make some processes such as loading the fact table easier by avoiding the need to lookup the surrogate key for every date that you encounter. To support the Analysis Services dimension that we will ultimately build from the Time table, we also need to add an additional unique column for each level in the hierarchies. For example, the Quarter Name attribute contains values such as Quarter 1 and Quarter 2, which are not unique because each year has the same four quarter names. We will add a Quarter key column using the date of the first day in the quarter to provide a unique value for each quarter, such as 01/01/2006, and take the same approach for the other levels such as Year and Month. After we have finished designing all the dimension tables in the data warehouse, we can move on to the fact tables. Building Fact TablesAs we have seen in the "Data Model" section, the fact table consists of a column containing the surrogate key for each dimension related to the fact table, as well as columns for the measures that we will be tracking. For the Shipments fact table, we need to add the ProductKey, SalesTerritoryKey, PlantKey, and ShipMethodKey columns. The fact table has two relationships with the Customer dimension, so we need to add separate ShippingCustomerKey and BillingCustomerKey columns to represent this. We also need three date keys (OrderDateKey, DueDateKey, and ShipDateKey) to represent the different dates for each shipment fact. Most fact tables contain a single record for each distinct combination of the key values, so the logical primary key of the fact table is usually all the dimension keys. However, most information is analyzed with some level of summarization (especially BI solutions using Analysis Services), so it often does not matter whether you have multiple records for the same combination of keys because they will be summed except in some relational reporting scenarios. For this reason, a primary key constraint is usually not added to fact tables, and any potential requirement for avoiding multiple fact records is handled in the data loading (ETL) process. As described for dimension tables, BI solutions that use Analysis Services cubes to query the information require little indexing in the data warehouse. It is worthwhile, however, to add a clustered index to the fact table, usually on one of the date keys. Because the data will often be queried by a date range, it will be helpful to have the fact data physically arranged in date order. For the Shipments fact table, we will use the earliest date, which is the OrderDateKey, as shown in Figure 3-9. If you are also doing relational reporting from your data warehouse tables, it will probably prove to be beneficial to add indexes for the dimension keys in the fact table because this will speed up joins to the dimension tables. Figure 3-9. Shipments fact tableAfter the keys, we can add the remaining measure columns using the smallest data type that can contain the measure at the detail level. It is worth being careful with this process because the fact tables are the largest tables in the data warehouse, and properly selected data types can save a lot of space and lead to better performance. The downside of picking a data type that is too small is that at some point, your data loading process will probably fail and need to be restarted after increasing the size of the column. Using Views to Encapsulate Database ObjectsIf we allow client applications such as reporting tools or Analysis Services cubes to directly access the tables in our data warehouse, we run the risk that any future changes that we make might break these applications. Instead, we can create a view for each dimension and fact table in the data warehouse, which makes it easier to change or optimize the underlying database structure without necessarily affecting the client applications that use the database. You can design views using SQL Server Management Studio, by right-clicking the Views folder under the database and selecting New View. The query designer allows you to add the source tables to the view and select which columns to include, and specify a name when you save the new view. Dealing with Referential IntegrityReferential integrity (RI) is a technique for preserving the relationships between tables and is sometimes used in data warehouse solutions to ensure that every key in the fact table rows has a corresponding dimension row. If your data-loading process tries to add a fact row with a dimension key that does not yet exist, the process will fail because of the RI constraint and ensure that you don't end up with any mismatched facts. Good arguments both for and against using referential integrity constraints in the data warehouse exist (and the authors have either used or heard most of them), but we are going to go out on a limb here and state that in a properly architected and managed data warehouse, RI constraints are not required. The primary reason for this is that we always use surrogate keys for dimensions. Because these keys are only known within the data warehouse, a "lookup" step always needs to take place when loading facts to translate business keys into surrogate keys. If this lookup step fails, the new record will either not be added into the fact table or will be added with a special surrogate key that refers to a "Missing" or "Unknown" dimension record. So, in contrast to OLTP systems where RI is an absolute necessity, a major characteristic of data warehouses that strictly use surrogate keys is that RI is enforced through the data loading process. For this reason, it is not necessary to declare foreign key constraints in the data warehouse. The advantages are that load performance is improved, and the data loading sequence can sometimes be more flexible. The one potential disadvantage is that any errors in your data-loading process are harder to catch until you try to validate the numbers in your data warehouse, so it sometimes proves useful to turn on foreign key constraints during the development process. It is only when you have exceptions to the rule of always using a surrogate key that you get into trouble, and an especially alert reader might have noticed that we have already broken this rule for one of our tablesthe Time dimension. The date keys in our fact table are DateTime columns and do not need to be looked up in the dimension table during load. This means that we will have to add some processing in the data-loading process to check that all the data in the new fact records falls into the date range contained in the Time dimension table, and to trigger a process to have more dates added if not. Securing the Data WarehouseIf the company you are working with has a database security policy already, you will be able to follow the rules that have been laid out. This section lays out some guidelines for the security decision-making process as it applies to data warehouse databases, but security decisions should always be made with the context of the whole system in mind. When you connect to a database, SQL Server can either validate your current Windows user account to see whether you have access (which is known as Windows authentication) or prompt you to supply a separate SQL Server login account and password. Windows authentication is usually recommended because it has increased security and it does not require maintaining a separate set of accounts. Providing Access to Database AdministratorsWe have managed to create a database, add some tables, and specify columns for them, all without worrying about security. How did we manage to have permission to do all of that? If you have installed SQL Server 2005 onto a new server, the reason is probably that you are signed on to the database server as a member of the Administrators group on that machine. When SQL Server is installed, all local Administrators are granted a special right to administer the SQL Server instance, called sysadmin. Because companies usually don't want every administrator to have complete control of all the databases on that server, a common approach is to create a new Windows group for database administrators only. SQL Server Management Studio can then be used to add the new database administrators group (under the Security, Logins folder) with the sysadmin role turned on, and then to remove the Administrators group. Tip: Log Off After Adding Yourself to a Group A common mistake that can lead to a moment of panic after making the preceding change is when you have added your own user account to the database administrators group that you created. Just after you remove the Administrators group, you could suddenly be unable to access your databases. The reason for this is that your Windows group membership only gets refreshed when you log on. To be able to access the database server after you have added yourself to the new database administrators Windows group, you need to log off and back on again so that your credentials are picked up. Allowing Users to Access the Data WarehouseThe SQL Server security model is flexible and provides extremely granular control over the permissions that users have. You can control which users or groups can view information in a table, whether they can update the information, and other rights such as the ability to execute stored procedures. In all scenarios, we follow the principle of "least privilege," meaning that we will make sure that users only have the minimum level of permissions that they need to accomplish their tasks. In the manufacturing data warehouse, we have created views for every dimension and fact table, and most of the data warehouse users will only need to have access to read the information in these views. Because we are using Windows authentication, we can create a Windows group for all users who need read access to the data warehouse's views. Using groups is much more flexible and easier to maintain than setting up permissions for individual user accounts.
Providing Access to ApplicationsIn many BI solutions, users do not access the data warehouse database directly but through an analytical engine such as Analysis Services. In that case, the user account used to connect to the database for the purpose of loading data is often the service account that Analysis Services is using. You can cater for this by adding this service account to the data warehouse users group that you created previously. Another application that needs access to the database is Integration Services. You can use some flexible options to select which user account is actually used when an Integration Services package is executed to load data, but whatever account is used will need more than just read access to the views. You could use a similar approach to the Quick Start to create a Windows group for data warehouse data loaders, and then instead of granting SELECT permission on individual views, you could add the login to two special database roles: db_datareader and db_datawriter. These roles will allow the data-loading process to read and write data. |