After designing your data mart or data warehouse, you can proceed with implementing your design. The process of implementing a data warehouse database is identical to implementing an OLTP database; however, the design choices for each type of database are different.
After this lesson, you will be able to:
- Prepare to implement your data warehouse
- Create the data warehouse database
- Create the tables and indexes in the data warehouse database
Estimated lesson time: 120 minutes
In your star schema, you must address the storage requirements necessary to meet your needs. The best estimate of the size of a data warehouse is found by calculating the table schema storage requirements and the number of database rows.
You determine the data warehouse size by considering the amount of storage required by the fact table. Because the dimension tables account for a small percentage of the data warehouse, it is not usually worthwhile to include them in space estimations. However, if your dimensions are rapidly changing and you use the Type 2 method of handling changing dimension records by writing a new record to the data warehouse (see Lesson 1, "Managing Slowly Changing Dimensions," in Chapter 14, "Managing a SQL Server Data Warehouse"), you will want to include the dimension tables and their growth in your calculations.
The fact table grain affects the database size. To estimate the database size, you must know the defined level of granularity of the fact table, know the business objective that the grain represents, and estimate the average frequency in which the business objective occurs over a period of time.
For example, if the level of granularity for a data mart is the transaction level, and the business objective is to analyze customer sales, you can calculate the average number of transactions that a customer generates from the operational system.
When estimating the amount of storage required for a row in a fact table, use the following guidelines:
Data Type | Purpose | Storage Size | Comments |
---|---|---|---|
bit | Integer data type that can be 1, 0, or NULL | If there are 8 or fewer bit columns in a table, the columns are stored as 1 byte. If there are from 9 to 16 bit columns, they are stored as 2 bytes, and so on. | Columns of type bit cannot have indexes on them. |
int | Integer (whole number) data from 2^31 ( 2,147,483,648) through 2^31 1 (2,147,483,647) | 4 bytes. | The SQL-92 synonym for int is integer. |
smallint | Integer data from 2^15 ( 32,768) through 2^15 1 (32,767) | 2 bytes. | |
tinyint | Integer data from 0 through 255 | 1 byte. | |
decimal or numeric | Decimal values from 10^38 1 through 10^38 1 | 5 17, depending on precision | The SQL-92 synonym for decimal is dec. |
money | Monetary data values from 2^63 through 2^63 1 | 8 bytes. | Accuracy is to a ten-thousandth of a monetary unit. |
smallmoney | Monetary data values from 214,748.3648 through +214,748.3647 | 4 bytes. | Accuracy is to a ten-thousandth of a monetary unit. |
float[(n)] | Is a floating-point number data from 1.79E + 308 through 1.79E + 308; n is the number of bits used to store the mantissa of the float number in scientific notation and thus dictates the precision and storage size. n must be a value from 1 through 53 | Where n is 1 24 (with 7 digits of precision): 4 bytes Where n is 25 53 (with 15 digits of precision): 8 bytes | The Microsoft SQL Server float data type conforms to the SQL-92 standard for all values of n from 1 to 53. The synonym for the SQL-92 double precision Data type is float(53). |
real | Floating-point number data from 3.40E + 38 through 3.40E + 38 | 4 bytes. | In SQL Server, the synonym for real is float(24). |
Example
In this example, the data warehouse size is estimated using the following calculation methods:
Years of data = 5
Customers = 10,000
Average number of transactions per customer per day = 4
Number of dimensions = 7
Number of measures (integer values) = 5
Description | Calculation Method | Value |
---|---|---|
Number of rows in fact table | 10,000 x 4 x 365 x 5 | 73,000,000 |
Estimated row size of fact table | (7 IDs x 4 bytes) + (5 measures x 4 bytes) + 9 bytes | 57 bytes |
Estimated data warehouse size | 57 bytes x 73,000,000 rows | ~3.9 GB |
NOTE
The 9 bytes added to each row represent the SQL Server overhead per row. This overhead increases to 17 bytes per row if you have a clustered index on the table.
Fact tables are the largest tables in the database. You should have well-organized definitions in order to minimize the fact table size without compromising data requirements. Larger fact tables may require additional online and offline storage, additional time to back up and restore the database in the event of a system failure, or additional time to process queries when building OLAP aggregations.
One strategy to reduce the size of fact tables is to reduce the number of columns in them. For example, you should remove the data from the fact table if the data
For example, the order ID column is not required if you create a fact table from an Order Details table, because the order detail ID column is the unique identifier for the business transaction: a sale or a product. The order ID column does not represent a single business transaction. The order ID column represents the sale of one or more products to a single customer and, therefore, should not be stored in the fact table.
Another strategy to minimize the size of fact tables is to reduce the size of each column. For example, a fact table containing one billion rows, with one unused byte in one of the columns, represents almost 1 gigabyte (GB) of unused space.
Other techniques for reducing the size of a fact table are to use surrogate keys, ensure that all character and binary data is of variable length, and use data types that require fewer bytes of storage. For example, if a column contains integer values in the range of 1 through 100 only, use tinyint rather than int, saving 3 bytes per row for that column.
You must balance the cost of storing and maintaining large fact tables and the benefit of having all this information in the database.
Fact tables can include precalculated data. The rationale for including precalculated data is that it improves query performance. However, including precalculated data increases the size of a fact table and can inhibit your use of the fact table. For example, if a fact table contains one billion rows, and a calculated column is added, one billion new values now exist permanently in the fact table.
Joining many tables can affect query performance. For example, if a column in a fact table is moved to a dimension table or another fact table, and the column is referenced frequently, you may have to join many tables. Therefore, you should consider both the cost and benefit of storage and query performance.
Creating a data warehouse database is identical to creating an OLTP database. However, you may want to address options that can affect the performance of your database.
When you create a data warehouse database, consider using the following CREATE DATABASE options:
SIZE
Specifies the minimum size of the data and transaction log file. Consider sizing the transaction log smaller than you would in an operational system. However, keep in mind that adding data to the database will be logged just like any other insert, unless you use nonlogged inserts, such as BULK INSERT or fast bulk copy. The SIZE option preallocates space, which reduces overhead from continuous growth and fragmentation.
MAXSIZE
Specifies the maximum size to which the file can grow. If the MAXSIZE option is not specified, the file grows until the disk becomes full. Although SQL Server databases can be configured to grow automatically, it makes sense to estimate a maximum size and specify it when creating your database.
FILEGROWTH
Specifies the increment (as a percentage of the current file size or as a fixed amount in MB) in which the data or transaction log file grows. The FILEGROWTH setting cannot exceed the MAXSIZE setting. If you enable automatic database growth, consider specifying large increments to accommodate growth.
You can configure a number of database options, but you are able to set them for only one database at a time. Use SQL Server Enterprise Manager or the sp_dboption system stored to change database options. To affect options in all new databases, change the model database. Frequently used options include READ-ONLY, TRUNC. LOG ON CHKPT, and SELECT INTO/BULKCOPY.
READ-ONLY
Allows users to retrieve data, but they cannot modify it. This option is more efficient than just granting all users read-only access. It improves query performance because SQL Server needs to take far fewer locks when querying a read-only database.
TRUNC. LOG ON CHKPT
Causes the transaction log to be truncated every time that the checkpoint process occurs. This is a way of maintaining the transaction log so that it does not become full. Use this option with caution. If you select this option, you cannot perform transaction log backups of your database. Transaction log backups are usually more critical for operational databases than for analytical (read-only) databases.
NOTE
TRUNCATE. LOG ON CHKPT does not guarantee that the log will not fill. If a process engages in a lot of logging and the log portion of the database is not set to autogrow, then the log could still fill up.
SELECT INTO/BULKCOPY
This option allows a database to accept nonlogged operations. You can use this option during bulk copying or with the SELECT INTO statement to conserve transaction log space. This benefit must be balanced against the fact that setting this option renders a transaction log unusable for backup and restore.
In this exercise, you will create the star schema relational database that will serve as the data mart for the Northwind_Mart database.
In this procedure, you will use SQL Server Query Analyzer to execute a Transact-SQL script that creates the Northwind_Mart database. The file for the completed script is located in C:\SQLDW\Exercise\Ch05\CreateDB.sql.
USE master GO IF (SELECT DB_ID('Northwind_Mart')) IS NOT NULL DROP DATABASE Northwind_Mart GO CREATE DATABASE Northwind_Mart ON PRIMARY ( NAME=Northwind_Mart_Data, FILENAME='c:\mssql7\data\northwind_mart.mdf', SIZE=5MB, FILEGROWTH=10% ) LOG ON ( NAME=Northwind_Mart_Log, FILENAME='c:\mssql7\data\northwind_mart.ldf', SIZE=2MB, FILEGROWTH=50% ) GO EXEC sp_dboption 'Northwind_Mart', 'trunc. log on chkpt.', 'True' GO |
This Transact-SQL script will create the Northwind_Mart database and set the TRUNC. LOG ON CHKPT database option to True.
The process of creating fact and dimension tables is identical to creating tables in an E-R database. You must specify the table name, column names, and column data types. However, fact tables have different characteristics from dimension tables.
When creating tables in Microsoft SQL Server version 7.0, you can have up to 1,024 columns per table and 8,060 bytes per row (8,192 bytes minus page overhead). These capacities are not generally critical in a highly normalized OLTP environment, but they are particularly well suited to the demands of the denormalized structure of an OLAP database.
Specifying NULL or NOT NULL
You should specify whether to allow null values in each column. If you do not specify NULL or NOT NULL, SQL Server provides the NULL or NOT NULL characteristic, based on the session- or database-level default.
NOTE
Null values in a fact table have always been a cause of concern in the data warehouse arena because they have an impact on data explosion. They are not such cause for concern with SQL Server because OLAP Services handle nulls efficiently.
Generating Column Values
You can generate column values by using the IDENTITY property, the NEWID function, and the unique identifier data type.
In this exercise, you will use the Create Database Designer wizard and Transact-SQL scripts to create dimension tables and a fact table.
In this procedure, you will create a database diagram in the Northwind_Mart database.
Tables do not yet exist in the database. You will create them in the following procedure.
In this procedure, you will use Database Designer to create a time dimension table in the Northwind_Mart database.
Column Name | Datatype | Length | Allow Nulls | Identity |
---|---|---|---|---|
TimeKey | int | 4 | No | Yes |
TheDate | datetime | 8 | No | No |
DayOfWeek | nvarchar | 20 | No | No |
Month | int | 4 | Yes | No |
Year | int | 4 | No | No |
Quarter | int | 4 | No | No |
DayOfYear | int | 4 | No | No |
Holiday | nvarchar | 1 | No | No |
Weekend | nvarchar | 1 | Yes | No |
YearMonth | nvarchar | 10 | No | No |
WeekOfYear | int | 4 | No | No |
Notice the asterisk (*) to the right of the title in the diagram. It indicates that there are unsaved changes.
This action changes the displayed contents of the Time_Dim table to show only the column names.
In this procedure, you will use SQL Server Query Analyzer to execute a Transact-SQL script file that creates the remaining dimension tables for the Northwind_Mart database.
In this procedure, you will add the new dimension tables to the database diagram.
In this procedure, you will use Database Designer to create the fact table.
Column Name | Datatype | Length | Allow Nulls | Identity |
---|---|---|---|---|
TimeKey | int | 4 | No | No |
CustomerKey | int | 4 | No | No |
ShipperKey | int | 4 | No | No |
ProductKey | int | 4 | No | No |
EmployeeKey | int | 4 | No | No |
RequiredDate | daytime | 8 | No | No |
LineItemFreight | money | 8 | No | No |
LineItemTotal | money | 8 | No | No |
LineItemQuantity | smallint | 2 | Yes | No |
LineItemDiscount | nvarchar | 10 | No | No |
Using constraints ensures that valid data values are entered in columns and that relationships are maintained between tables. You can add constraints to a table with existing data, and you can place constraints on single or multiple columns. Use the SQL Server Enterprise Manager or the CREATE TABLE or ALTER TABLE statements to create constraints.
There are four types of data integrity: domain integrity, entity integrity, referential integrity, and user-defined integrity. Figure 5.5 illustrates how domain integrity, entity integrity, and referential integrity relate to each other.
Figure 5.5 Types of Data Integrity
Domain Integrity
Domain (or column) integrity specifies the set of data values that are valid for a column and determines whether null values are allowed. Domain integrity is enforced by validity checking and by restricting the data type, format, or range of possible values allowed in a column.
Entity Integrity
Entity (or table) integrity requires that all rows in a table have a unique identifier, known as the primary key value. Whether the primary key value can be changed, or whether the whole row can be deleted, depends on the level of integrity required between the primary key and any other tables.
Referential Integrity
Referential integrity ensures that the relationship between the primary key (in a referenced table) and the foreign key (in each of the referencing tables) is always maintained. The maintenance of this relationship means that:
User-Defined Integrity
User-defined integrity allows you to define specific business rules that do not fall into one of the other integrity categories. All of the integrity categories support user-defined integrity. Constraints, stored procedures, and triggers are available for enforcing user-defined integrity.
Use the PRIMARY KEY keyword with the CONSTRAINT clause of the CREATE TABLE or ALTER TABLE statement to create a primary key on a table that uniquely identifies a row. A primary key enforces entity integrity. This constraint ensures that users do not enter duplicate values and that an index is created to enhance performance. Null values are not allowed.
Use the FOREIGN KEY keyword with the CONSTRAINT clause of the CREATE TABLE or ALTER TABLE statement to create a foreign key on a table that enforces referential integrity. The foreign key constraint defines a reference to a column or columns specified in a primary key or unique constraint in the same or another table. It specifies the data values that are acceptable in the foreign key columns based on values in a column in another table. The number of columns and data types that are specified by the FOREIGN KEY keyword must match the number of columns and data types in the REFERENCES clause.
The indexes that you create for dimension and fact tables are important to the performance of cube processing.
When creating indexes for a data warehouse database, consider implementing the following steps:
Rather than using key values from the source system, you should generate unique values to identify dimension table entries. Source applications sometimes reuse keys, and if the source application is replaced, the format of the key may be incompatible with the previously defined key structure.
Defining surrogate keys provides simple and sequential numeric values, independence from the source system, and efficiency for indexing.
NOTE
Surrogate keys are also called artificial keys.
The data in a table is physically stored in the order of the key values in the clustered index. Create clustered indexes on columns that are searched frequently for ranges of key values or accessed in sorted order. If you have a very large fact table, you may want to avoid creating a clustered index on the fact table. You will need to evaluate whether having the data in a sorted order justifies the overhead of reorganizing the rows. Such a decision will be based on the types of queries run against the data warehouse database and the capacity of the server.
Because the data can only be stored in one order, there is a maximum of one clustered index per table. Nonclustered indexes are useful when users require multiple ways to search data. With the exception of indexes that are built on the primary and foreign key columns, determining which columns to index will depend on query activity.
Create composite indexes when two or more columns are best searched as a key or if queries reference only the columns in the index.
NOTE
After creating your indexes, it is important to use the Index Tuning wizard to refine them when production data is available.
In this exercise, you will create relationships between the fact and dimension tables.
In this procedure, you will use Database Designer to create primary key constraints for the dimension tables.
Column | Table |
---|---|
ShipperKey | Shipper_Dim |
ProductKey | Product_Dim |
EmployeeKey | Employee_Dim |
In this procedure, you will use Database Designer to create relationships between the fact table and the dimension tables.
The Create Relationship dialog box appears.
Column | Table |
---|---|
CustomerKey | Customer_Dim |
ShipperKey | Shipper_Dim |
ProductKey | Product_Dim |
EmployeeKey | Employee_Dim |
The design and implementation of the data warehouse database has a significant effect on the overall success of the data warehouse environment. When making data warehouse database design choices, it pays to make choices that promote the scalability and efficiency of the data warehouse. For example, minimize fact table size by selecting efficient data types. The tools Microsoft SQL Server 7.0 provides for database, table, index, and constraint creation to make complex database design and implementation easy.