Lesson 3: Setting Up the Data Warehouse Source Database | Microsoft SQL Server 7 Data Warehousing Technical Support Training Kit

After designing your data mart or data warehouse, you can proceed with implementing your design. The process of implementing a data warehouse database is identical to implementing an OLTP database; however, the design choices for each type of database are different.

After this lesson, you will be able to:

Prepare to implement your data warehouse

Create the data warehouse database

Create the tables and indexes in the data warehouse database

Estimated lesson time: 120 minutes

Estimating Size of the Data Warehouse

In your star schema, you must address the storage requirements necessary to meet your needs. The best estimate of the size of a data warehouse is found by calculating the table schema storage requirements and the number of database rows.

Size of Fact Table

You determine the data warehouse size by considering the amount of storage required by the fact table. Because the dimension tables account for a small percentage of the data warehouse, it is not usually worthwhile to include them in space estimations. However, if your dimensions are rapidly changing and you use the Type 2 method of handling changing dimension records by writing a new record to the data warehouse (see Lesson 1, "Managing Slowly Changing Dimensions," in Chapter 14, "Managing a SQL Server Data Warehouse"), you will want to include the dimension tables and their growth in your calculations.

Level of Granularity

The fact table grain affects the database size. To estimate the database size, you must know the defined level of granularity of the fact table, know the business objective that the grain represents, and estimate the average frequency in which the business objective occurs over a period of time.

For example, if the level of granularity for a data mart is the transaction level, and the business objective is to analyze customer sales, you can calculate the average number of transactions that a customer generates from the operational system.

Bytes per Row

When estimating the amount of storage required for a row in a fact table, use the following guidelines:

Four bytes per foreign key for the IDENTITY property, if that foreign key column uses the int data type.

Size of measures will vary. The following table describes the characteristics of data types that are appropriate for measures:

Data Type	Purpose	Storage Size	Comments
bit	Integer data type that can be 1, 0, or NULL	If there are 8 or fewer bit columns in a table, the columns are stored as 1 byte. If there are from 9 to 16 bit columns, they are stored as 2 bytes, and so on.	Columns of type bit cannot have indexes on them.
int	Integer (whole number) data from 2^31 ( 2,147,483,648) through 2^31 1 (2,147,483,647)	4 bytes.	The SQL-92 synonym for int is integer.
smallint	Integer data from 2^15 ( 32,768) through 2^15 1 (32,767)	2 bytes.
tinyint	Integer data from 0 through 255	1 byte.
decimal or numeric	Decimal values from 10^38 1 through 10^38 1	5 17, depending on precision	The SQL-92 synonym for decimal is dec.
money	Monetary data values from 2^63 through 2^63 1	8 bytes.	Accuracy is to a ten-thousandth of a monetary unit.
smallmoney	Monetary data values from 214,748.3648 through +214,748.3647	4 bytes.	Accuracy is to a ten-thousandth of a monetary unit.
float[(n)]	Is a floating-point number data from 1.79E + 308 through 1.79E + 308; n is the number of bits used to store the mantissa of the float number in scientific notation and thus dictates the precision and storage size. n must be a value from 1 through 53	Where n is 1 24 (with 7 digits of precision): 4 bytes Where n is 25 53 (with 15 digits of precision): 8 bytes	The Microsoft SQL Server float data type conforms to the SQL-92 standard for all values of n from 1 to 53. The synonym for the SQL-92 double precision Data type is float(53).
real	Floating-point number data from 3.40E + 38 through 3.40E + 38	4 bytes.	In SQL Server, the synonym for real is float(24).

Example

In this example, the data warehouse size is estimated using the following calculation methods:

Years of data = 5

Customers = 10,000

Average number of transactions per customer per day = 4

Number of dimensions = 7

Number of measures (integer values) = 5

Description	Calculation Method	Value
Number of rows in fact table	10,000 x 4 x 365 x 5	73,000,000
Estimated row size of fact table	(7 IDs x 4 bytes) + (5 measures x 4 bytes) + 9 bytes	57 bytes
Estimated data warehouse size	57 bytes x 73,000,000 rows	~3.9 GB

NOTE
The 9 bytes added to each row represent the SQL Server overhead per row. This overhead increases to 17 bytes per row if you have a clustered index on the table.

Minimizing Fact Table Size

Fact tables are the largest tables in the database. You should have well-organized definitions in order to minimize the fact table size without compromising data requirements. Larger fact tables may require additional online and offline storage, additional time to back up and restore the database in the event of a system failure, or additional time to process queries when building OLAP aggregations.

Reducing the Number of Columns

One strategy to reduce the size of fact tables is to reduce the number of columns in them. For example, you should remove the data from the fact table if the data

Is not required to analyze the operations of the business

Does not represent a business transaction or line item

Is derived and used only by a small percentage of users employing aggregates

For example, the order ID column is not required if you create a fact table from an Order Details table, because the order detail ID column is the unique identifier for the business transaction: a sale or a product. The order ID column does not represent a single business transaction. The order ID column represents the sale of one or more products to a single customer and, therefore, should not be stored in the fact table.

Reducing the Size of Each Column

Another strategy to minimize the size of fact tables is to reduce the size of each column. For example, a fact table containing one billion rows, with one unused byte in one of the columns, represents almost 1 gigabyte (GB) of unused space.

Other techniques for reducing the size of a fact table are to use surrogate keys, ensure that all character and binary data is of variable length, and use data types that require fewer bytes of storage. For example, if a column contains integer values in the range of 1 through 100 only, use tinyint rather than int, saving 3 bytes per row for that column.

Balancing Size and Performance

You must balance the cost of storing and maintaining large fact tables and the benefit of having all this information in the database.

Including Precalculated Data

Fact tables can include precalculated data. The rationale for including precalculated data is that it improves query performance. However, including precalculated data increases the size of a fact table and can inhibit your use of the fact table. For example, if a fact table contains one billion rows, and a calculated column is added, one billion new values now exist permanently in the fact table.

Joining Many Tables

Joining many tables can affect query performance. For example, if a column in a fact table is moved to a dimension table or another fact table, and the column is referenced frequently, you may have to join many tables. Therefore, you should consider both the cost and benefit of storage and query performance.

Creating a Database

Creating a data warehouse database is identical to creating an OLTP database. However, you may want to address options that can affect the performance of your database.

Using CREATE DATABASE Options

When you create a data warehouse database, consider using the following CREATE DATABASE options:

SIZE

Specifies the minimum size of the data and transaction log file. Consider sizing the transaction log smaller than you would in an operational system. However, keep in mind that adding data to the database will be logged just like any other insert, unless you use nonlogged inserts, such as BULK INSERT or fast bulk copy. The SIZE option preallocates space, which reduces overhead from continuous growth and fragmentation.

MAXSIZE

Specifies the maximum size to which the file can grow. If the MAXSIZE option is not specified, the file grows until the disk becomes full. Although SQL Server databases can be configured to grow automatically, it makes sense to estimate a maximum size and specify it when creating your database.

FILEGROWTH

Specifies the increment (as a percentage of the current file size or as a fixed amount in MB) in which the data or transaction log file grows. The FILEGROWTH setting cannot exceed the MAXSIZE setting. If you enable automatic database growth, consider specifying large increments to accommodate growth.

Setting Database Options

You can configure a number of database options, but you are able to set them for only one database at a time. Use SQL Server Enterprise Manager or the sp_dboption system stored to change database options. To affect options in all new databases, change the model database. Frequently used options include READ-ONLY, TRUNC. LOG ON CHKPT, and SELECT INTO/BULKCOPY.

READ-ONLY

Allows users to retrieve data, but they cannot modify it. This option is more efficient than just granting all users read-only access. It improves query performance because SQL Server needs to take far fewer locks when querying a read-only database.

TRUNC. LOG ON CHKPT

Causes the transaction log to be truncated every time that the checkpoint process occurs. This is a way of maintaining the transaction log so that it does not become full. Use this option with caution. If you select this option, you cannot perform transaction log backups of your database. Transaction log backups are usually more critical for operational databases than for analytical (read-only) databases.

NOTE
TRUNCATE. LOG ON CHKPT does not guarantee that the log will not fill. If a process engages in a lot of logging and the log portion of the database is not set to autogrow, then the log could still fill up.

SELECT INTO/BULKCOPY

This option allows a database to accept nonlogged operations. You can use this option during bulk copying or with the SELECT INTO statement to conserve transaction log space. This benefit must be balanced against the fact that setting this option renders a transaction log unusable for backup and restore.

Exercise: Creating the Database for the Data Mart

In this exercise, you will create the star schema relational database that will serve as the data mart for the Northwind_Mart database.

Creating the Northwind_Mart Database

In this procedure, you will use SQL Server Query Analyzer to execute a Transact-SQL script that creates the Northwind_Mart database. The file for the completed script is located in C:\SQLDW\Exercise\Ch05\CreateDB.sql.

Open SQL Server Query Analyzer and log in using an account that has sufficient permission to create databases for example, one that is mapped to the SQL Server sysadmin role.

Open C:\SQLDW\Exercise\Ch05\CreateDB.sql, and then review and execute it. Alternatively, type in and execute the following SQL statements:

USE master  GO IF (SELECT DB_ID('Northwind_Mart')) IS NOT NULL        DROP DATABASE Northwind_Mart  GO CREATE DATABASE Northwind_Mart ON PRIMARY    ( NAME=Northwind_Mart_Data,         FILENAME='c:\mssql7\data\northwind_mart.mdf',         SIZE=5MB,        FILEGROWTH=10% )   LOG ON     ( NAME=Northwind_Mart_Log,         FILENAME='c:\mssql7\data\northwind_mart.ldf',         SIZE=2MB,        FILEGROWTH=50% )  GO EXEC sp_dboption 'Northwind_Mart', 'trunc. log on chkpt.', 'True' GO

This Transact-SQL script will create the Northwind_Mart database and set the TRUNC. LOG ON CHKPT database option to True.

Creating Tables

The process of creating fact and dimension tables is identical to creating tables in an E-R database. You must specify the table name, column names, and column data types. However, fact tables have different characteristics from dimension tables.

Creating a Table

When creating tables in Microsoft SQL Server version 7.0, you can have up to 1,024 columns per table and 8,060 bytes per row (8,192 bytes minus page overhead). These capacities are not generally critical in a highly normalized OLTP environment, but they are particularly well suited to the demands of the denormalized structure of an OLAP database.

Specifying NULL or NOT NULL

You should specify whether to allow null values in each column. If you do not specify NULL or NOT NULL, SQL Server provides the NULL or NOT NULL characteristic, based on the session- or database-level default.

NOTE
Null values in a fact table have always been a cause of concern in the data warehouse arena because they have an impact on data explosion. They are not such cause for concern with SQL Server because OLAP Services handle nulls efficiently.

Generating Column Values

You can generate column values by using the IDENTITY property, the NEWID function, and the unique identifier data type.

Use the IDENTITY property to create columns (referred to as identity columns) that contain system-generated sequential integer values identifying each row inserted into a table. An identity column is often used for primary key values.

Use the NEWID function and the unique identifier data type together when data is collated from many tables into a larger table and when uniqueness among all records must be maintained, such as when multiple data marts are rolled up into a data warehouse.

Exercise: Creating the Star Schema

In this exercise, you will use the Create Database Designer wizard and Transact-SQL scripts to create dimension tables and a fact table.

To create a database diagram in Northwind_Mart

In this procedure, you will create a database diagram in the Northwind_Mart database.

Open SQL Server Enterprise Manager.

Expand your server, expand Databases, and then expand Northwind_Mart.

Right-click Diagrams, and then click New Database Diagram.

Click OK to acknowledge that there are no tables in the database.

Tables do not yet exist in the database. You will create them in the following procedure.

Maximize the Database Diagram window.

On the toolbar, click the Save button.

Save the diagram as Star_Schema.

To create a table for the time dimension

In this procedure, you will use Database Designer to create a time dimension table in the Northwind_Mart database.

On the toolbar, click the New table button.

Name the new table Time_Dim, and then click OK.

Resize the Time_Dim window so that you can see all the columns in the grid.

Use the information in the following table to create the Time_Dim table. Use the defaults for any options that are not listed.

Column Name	Datatype	Length	Allow Nulls	Identity
TimeKey	int	4	No	Yes
TheDate	datetime	8	No	No
DayOfWeek	nvarchar	20	No	No
Month	int	4	Yes	No
Year	int	4	No	No
Quarter	int	4	No	No
DayOfYear	int	4	No	No
Holiday	nvarchar	1	No	No
Weekend	nvarchar	1	Yes	No
YearMonth	nvarchar	10	No	No
WeekOfYear	int	4	No	No

Right-click the TimeKey column, and then click Set Primary Key.

Notice the asterisk (*) to the right of the title in the diagram. It indicates that there are unsaved changes.

Right-click anywhere on Time_Dim, and then click Column Names.

This action changes the displayed contents of the Time_Dim table to show only the column names.

On the toolbar, click the Save button. If you are prompted to save the change script to a text file, click No.

To create the remaining dimension tables by using a Transact-SQL script

In this procedure, you will use SQL Server Query Analyzer to execute a Transact-SQL script file that creates the remaining dimension tables for the Northwind_Mart database.

Switch to SQL Server Query Analyzer.

Open C:\SQLDW\Exercise\Ch05\Dimtables.sql, and then review and execute it.

To add the dimension tables to the database diagram

In this procedure, you will add the new dimension tables to the database diagram.

Switch to Database Designer.

On the toolbar, click the Add table button.

While holding down the Ctrl key, click each table except the Time_Dim table. When all of the tables are highlighted in the list, click Add.

Click Close.

Drag the tables in the diagram designer so that you can see all of them.

On the toolbar, click the Save button.

To create the fact table in the database diagram

In this procedure, you will use Database Designer to create the fact table.

On the toolbar, click New table.

Name the new table Sales_Fact, then click OK.

Resize the Sales_Fact window so that you can see all of the columns in the grid.

Use the information in the following table to create the Sales_Fact table. Use the defaults for any options that are not listed.

Column Name	Datatype	Length	Allow Nulls	Identity
TimeKey	int	4	No	No
CustomerKey	int	4	No	No
ShipperKey	int	4	No	No
ProductKey	int	4	No	No
EmployeeKey	int	4	No	No
RequiredDate	daytime	8	No	No
LineItemFreight	money	8	No	No
LineItemTotal	money	8	No	No
LineItemQuantity	smallint	2	Yes	No
LineItemDiscount	nvarchar	10	No	No

Select the first five columns in the Sales_Fact table. To do this, click the button to the left of the TimeKey column name, then hold down the Shift key and click the button to the left of EmployeeKey column.

Right-click the selected columns, and then click Set Primary Key.

On the toolbar, click the Save button.

When prompted to save the table to your database, click Yes to continue.

If you are prompted to save the change script to a text file, click No.

Right-click anywhere on Sales_Fact, and then click Column Names so that only the column names are displayed.

Creating Constraints

Using constraints ensures that valid data values are entered in columns and that relationships are maintained between tables. You can add constraints to a table with existing data, and you can place constraints on single or multiple columns. Use the SQL Server Enterprise Manager or the CREATE TABLE or ALTER TABLE statements to create constraints.

Types of Data Integrity

There are four types of data integrity: domain integrity, entity integrity, referential integrity, and user-defined integrity. Figure 5.5 illustrates how domain integrity, entity integrity, and referential integrity relate to each other.

click to view at full size

Figure 5.5 Types of Data Integrity

Domain Integrity

Domain (or column) integrity specifies the set of data values that are valid for a column and determines whether null values are allowed. Domain integrity is enforced by validity checking and by restricting the data type, format, or range of possible values allowed in a column.

Entity Integrity

Entity (or table) integrity requires that all rows in a table have a unique identifier, known as the primary key value. Whether the primary key value can be changed, or whether the whole row can be deleted, depends on the level of integrity required between the primary key and any other tables.

Referential Integrity

Referential integrity ensures that the relationship between the primary key (in a referenced table) and the foreign key (in each of the referencing tables) is always maintained. The maintenance of this relationship means that:

A row in a referenced table cannot be deleted, nor can the primary key be changed, if a foreign key refers to the row. For example, you cannot delete a customer that has placed one or more orders.

A row cannot be added to a referencing table if the foreign key does not match the primary key of an existing row in the referenced table. For example, you cannot create an order for a customer that does not exist.

User-Defined Integrity

User-defined integrity allows you to define specific business rules that do not fall into one of the other integrity categories. All of the integrity categories support user-defined integrity. Constraints, stored procedures, and triggers are available for enforcing user-defined integrity.

Using Primary Key Constraints

Use the PRIMARY KEY keyword with the CONSTRAINT clause of the CREATE TABLE or ALTER TABLE statement to create a primary key on a table that uniquely identifies a row. A primary key enforces entity integrity. This constraint ensures that users do not enter duplicate values and that an index is created to enhance performance. Null values are not allowed.

Using Foreign Key Constraints

Use the FOREIGN KEY keyword with the CONSTRAINT clause of the CREATE TABLE or ALTER TABLE statement to create a foreign key on a table that enforces referential integrity. The foreign key constraint defines a reference to a column or columns specified in a primary key or unique constraint in the same or another table. It specifies the data values that are acceptable in the foreign key columns based on values in a column in another table. The number of columns and data types that are specified by the FOREIGN KEY keyword must match the number of columns and data types in the REFERENCES clause.

Creating Indexes

The indexes that you create for dimension and fact tables are important to the performance of cube processing.

Steps for Creating Data Warehouse Indexes

When creating indexes for a data warehouse database, consider implementing the following steps:

Define a primary key in each dimension table. By default, clustered indexes are automatically created on primary keys.

Declare foreign key relationships.

For star schemas, declare foreign key relationships from the fact table to each dimension table.

For snowflake schemas, declare foreign key relationships from the parent dimension table to each secondary dimension.

Define a primary key constraint on the fact table. By default, clustered indexes are automatically created on primary keys. The primary key of the fact table is usually the composite of the foreign keys.

Define indexes on each of the foreign keys in the fact table.

Using Surrogate Keys

Rather than using key values from the source system, you should generate unique values to identify dimension table entries. Source applications sometimes reuse keys, and if the source application is replaced, the format of the key may be incompatible with the previously defined key structure.

Defining surrogate keys provides simple and sequential numeric values, independence from the source system, and efficiency for indexing.

NOTE
Surrogate keys are also called artificial keys.

Using Clustered Indexes

The data in a table is physically stored in the order of the key values in the clustered index. Create clustered indexes on columns that are searched frequently for ranges of key values or accessed in sorted order. If you have a very large fact table, you may want to avoid creating a clustered index on the fact table. You will need to evaluate whether having the data in a sorted order justifies the overhead of reorganizing the rows. Such a decision will be based on the types of queries run against the data warehouse database and the capacity of the server.

Using Nonclustered Indexes

Because the data can only be stored in one order, there is a maximum of one clustered index per table. Nonclustered indexes are useful when users require multiple ways to search data. With the exception of indexes that are built on the primary and foreign key columns, determining which columns to index will depend on query activity.

Creating Composite Indexes

Create composite indexes when two or more columns are best searched as a key or if queries reference only the columns in the index.

NOTE
After creating your indexes, it is important to use the Index Tuning wizard to refine them when production data is available.

Exercise: Creating Keys and Indexes

In this exercise, you will create relationships between the fact and dimension tables.

To create primary key constraints and indexes on the dimension tables

In this procedure, you will use Database Designer to create primary key constraints for the dimension tables.

In the Star_Schema database diagram, select the CustomerKey column in the Customer_Dim dimension table.

Right-click the selected column, and then click Set Primary Key.

Repeat steps 1 and 2 for each dimension, using the following table as a guide:

Column	Table
ShipperKey	Shipper_Dim
ProductKey	Product_Dim
EmployeeKey	Employee_Dim

On the toolbar, click the Save button.

When prompted to save the table to your database, click Yes to continue.

If you are prompted to save the change script to a text file, click No.

To create relationships between the fact table and the dimension tables

In this procedure, you will use Database Designer to create relationships between the fact table and the dimension tables.

Arrange the tables in the diagram so that the Sales_Fact table is at the center and the dimension tables surround it, resembling a star.

Drag and drop the row selector (the button to the left of the column name) for the TimeKey column in the Sales_Fact table onto the Time_Dim table.

The Create Relationship dialog box appears.

Click OK to accept all defaults.

Repeat steps 1, 2, and 3 for the other dimension tables, using the information in the following table as a guide:

Column	Table
CustomerKey	Customer_Dim
ShipperKey	Shipper_Dim
ProductKey	Product_Dim
EmployeeKey	Employee_Dim

On the toolbar, click the Save button.

When prompted to save the table to your database, click Yes to continue.

If you are prompted to save the change script to a text file, click No.

Close the Star_Schema Database Diagram.

Lesson Summary

The design and implementation of the data warehouse database has a significant effect on the overall success of the data warehouse environment. When making data warehouse database design choices, it pays to make choices that promote the scalability and efficiency of the data warehouse. For example, minimize fact table size by selecting efficient data types. The tools Microsoft SQL Server 7.0 provides for database, table, index, and constraint creation to make complex database design and implementation easy.