The Data Mart


A number of problems can result when we try to use our organizations' OLTP systems as the source for our business intelligence. What we need to do is take the information stored in these OLTP systems and move it into a different data store. This intermediate data store can then serve as the source for our measure calculations. We need to store the data, so it is available for our business intelligence needs somewhere outside of our OLTP systems. When data is stored in this manner, it is referred to as a data mart.

Definition 

A Data Mart is a body of historical data in an electronic repository that does not participate in the daily operations of the organization. Instead, this data is used to create business intelligence. The data in the data mart usually applies to a specific area of the organization.

Note 

In this book, we discuss the creation of data marts, rather than the perhaps more familiar term, data warehouse. Data warehouses tend to be large, one-stop-shopping repositories where all the historical data for the organization would be stored. Nothing is wrong with this as a concept; however, attempting to create a data warehouse often led to huge, multiyear technology projects that were never quite finished or were outdated when they finally did get done. In this book, we concern ourselves with creating data marts—smaller undertakings that focus on a particular aspect of an organization.

Features of a Data Mart

Because the data mart is meant to serve as a source for business intelligence rather than managing the organization's day-to-day transactions, it is not designed the same as an OLTP database. Instead of being built around the rules of normalization, data marts are built for speed of access. A data mart is still a relational database, but it is designed to require fewer table joins when data is output for analysis and reporting. In a data mart, it is acceptable to have data repeated (denormalized) for the sake of speed.

When designing a data mart, the rules of normalization are replaced by a different method of design organized around "facts" and are intended to accelerate the speed. These new design approaches are called stars and snowflakes. We discuss stars and snowflakes in the sections "The Star Schema" and "The Snowflake Schema." Stars and snowflakes may seem like the stuff of children's fantasies, but, in reality, they provide quite grownup and down-to-earth approaches to creating information that is quick and easy to access.

Not Real-Time Data

OLTP systems store data from business transactions as they occur throughout the day. Data marts, on the other hand, are updated at set intervals. Data is copied from the OLTP systems periodically and written to the data mart. This is known as a data load.

Because the data mart exists separately from the OLTP systems, accessing the data mart for business intelligence information does not put any stress on the transactional systems vital to the business's operation. The only exception to this is during the data load. During the data load, the OLTP systems may have to work hard to prepare the data for copying to the data mart. The good news here is the data load is an automated process that can be scheduled to run during off-peak hours.

As we discussed in Chapter 2, information in a data mart has some latency. In most cases, some time elapses between the moment a transaction is completed and the moment when the transaction is copied to the data mart. If a data mart load is scheduled to run each month right after the month-end processing, then the data mart has a latency of one month. If the data load runs nightly, the data mart can have a latency of up to one day.

The latency of the data mart must be set properly to fulfill the business intelligence requirements of that warehouse. The information provided by the data mart must be up-to-date enough to facilitate effective decision making. However, the data loads should not occur so often that they cause unneeded stress on the OLTP systems.

Consolidation and Cleansing

Data from a number of different OLTP systems may be combined into a single data mart. This enables us to calculate some complex measures for our business intelligence. As we discussed earlier, this may also cause problems. Multiple OLTP systems can have different ways of representing data. Inconsistent data types used for the same data, dissimilar unique identifiers used for the same entity, and different time periods and calendar systems can all cause a great deal of difficulty when trying to combine data from heterogeneous systems.

In fact, problems can even arise when using data from a single system. The business rules necessary for a meaningful measure calculation may be stricter than those enforced within the OLTP system itself. If this is the case, some of the data coming from the OLTP system may not meet the stricter rules. Inconsistencies with data types and unique identifiers could also exist within the same system if the database has been poorly designed or poorly maintained.

These problems must be resolved before the data can be stored in the data mart. We must scrub out all the problem data. To do this, the data is put through a data cleansing process.

Definition 

Data Cleansing removes inconsistencies and errors from transactional data, so it has the consistency necessary for use in a data mart.

Data cleansing transforms data into a format that does not cause problems in the data mart environment. Data cleansing converts inconsistent data types to a single type. Data cleansing translates dissimilar identifiers to a standard set of codes for the data mart. In addition, data cleansing repairs or removes any data that does not meet the business rules required by the measures calculated from this warehouse.

Data cleansing is usually done as part of a larger process. This process extracts the data from the OLTP systems and loads it into the data mart. Thus, the entire procedure is known as extract, transform, and load or ETL.

Definition 

The Extract, Transform, and Load (ETL) process extracts data to copy from one or more OLTP systems, performs any required data cleansing to transform the data to a consistent format, and loads the cleansed data by inserting it into the data mart.

Data Mart Structure

The data we use for business intelligence can be divided into four categories: measures, dimensions, attributes, and hierarchies. These four types of data help us to define the structure of the data mart. Let's look at each of these types of data and see how they are stored in the data mart.

Measures

The measure forms the basis of everything we do with business intelligence. In fact, without measures, there would be no business intelligence. As we learned in Chapter 1, the measures are the whole reason for pursuing business intelligence. They are the basic building blocks for effective decision making.

Definition 

A Measure is a numeric quantity expressing some aspect of the organization's performance. The information represented by this quantity is used to support or evaluate the decision making and performance of the organization. A measure can also be called a fact.

Measures are the facts we use for our foundational and feedback information. Therefore, the tables that hold measure information are known as fact tables. Don't let the name fool you though, Sergeant Friday—fact tables hold more than just the facts.

Dimensions

Total sales is an example of a measure that is often required for effective decision making. However, it is not often that decision makers want to see one single aggregate representing the total sales for all products for all salespersons for the entire lifetime of the organization. Instead, they are more likely to want to slice and dice this total into smaller parts. Dimensions are used to facilitate this slicing and dicing.

Definition 

A Dimension is a categorization used to spread out an aggregate measure to reveal its constituent parts.

A dimension enables us to apply a categorization to an aggregate measure. The categorization lets us see the constituent numbers that create the aggregate measure. For example, we can think of the measure of total sales as a single point of information. In geometry, a point has no dimensions.

image from book

We can apply a categorization or a dimension to that single point of data to spread it out. In this case, let's spread the total sales out into the total sales for each year.

image from book

We now have a line made up of measures of total sales at various points along that line, one point for each year in the dimension. In geometry, a line is a one-dimensional object. Thus, we have added a dimension to our measure: 2000, 2001, 2002, 2003, 2004, and 2005 are each said to be members of the year dimension.

We can again spread out the total sales, this time for each product type.

image from book

The measures of total sales are now arranged in a square. Because a square is a two-dimensional object, we have added another dimension to our measure. Woodland Creatures, Mythic World, Warriors of Yore, and Guiding Lights are each members of the product type dimension.

Let's spread out the total sales once more—this time by sales region. The measure of total sales has become a cube, which is a three-dimensional object. You can see that each time we add a new criteria to spread out the measure, we increase the dimensionality of our measure, thus, the name dimension.

image from book

We can continue to spread out the measure using additional dimensions, such as a marketing campaign dimension and a buyer's age bracket dimension, to get a four-dimensional object, and then a five-dimensional object. Even though this becomes difficult to represent in an illustration, it is a perfectly valid thing to do. Also, because we are more familiar with three-dimensional objects than four- or five-dimensional ones, if there is such a thing, we continue to refer to these structures with four, five, or even more dimensions as cubes. We talk more about cubes in Chapter 4.

The Star Schema

Measures and dimensions are stored in the data mart in one of two layouts or schemas. First, we look at the star schema. The name of this schema comes from the shape formed by the relational database diagram for the data mart, as we will see in a moment. The star schema uses two types of tables: fact tables and dimension tables.

Definition 

A Star Schema is a relational database schema used to hold measures and dimensions in a data mart. The measures are stored in a fact table and dimensions are stored in dimension tables.

The center of the star is formed by the fact table. The fact table has a column for the measure and a column for each dimension containing the foreign key for a member of that dimension. The primary key for this table is created by concatenating all of the foreign key fields. This is known as a composite key. Fact tables are named for the measure they contain with the word "Fact" added to the end. The fact table for the example in the previous section is shown in Figure 3-1.

ProductTypeID

ProductTypeDescription

1

Woodland Creatures

2

Mythic World

3

Warriors of Yore

4

Guiding Lights

SalesFact

YearID

ProductTypeID

SalesRegionID

MarketingCampaignID

BuyersAgeGroupID

Total Sales


Figure 3-1: The SalesFact table

The dimensions are stored in dimension tables. The dimension table has a column for the unique identifier of a member of the dimension, usually an integer or a short character value. The dimension table has another column for a description of the member. One dimension is stored in each dimension table with one row for each member of the dimension. Dimension tables are named for the dimension they contain with the letters "Dim" added to the beginning. The rows of the DimProductType dimension table would be:

When we add the dimension tables to the schema, we get the characteristic star design as shown in Figure 3-2.

image from book
Figure 3-2: The star schema

Potentially, one row will be in the fact table for every unique combination of dimension members. The word "potentially" is used here because there may be some combinations of dimension members that do not have a value. In some cases, a particular combination of dimension members may not even make sense.

In our example, if History Magazine Spring Ad is a member of the marketing campaign dimension, some of the rows in the SalesFact table would be:

Year

Product Type

Sales Region

Marketing Campaign

Buyer's Age

Total Sales

2003

Mythic World

Northeast

History Mag Spring Ad

0–25

56,342

2003

Mythic World

Northeast

History Mag Spring Ad

25–35

104,547

2003

Mythic World

Northeast

History Mag Spring Ad

35–45

234,385

2003

Mythic World

Northeast

History Mag Spring Ad

45–55

534,532

2003

Mythic World

Northeast

History Mag Spring Ad

55–65

829,282

2003

Mythic World

Northeast

History Mag Spring Ad

65+

284,540

If, in our example, we assume there are eight members of the marketing campaign dimension to go along with six members of the year dimension, four members of the product type dimension, four members of the sales region dimension, and six members of the buyer's age dimension, we have a potential for

 8 × 6 × 4 × 4 × 6 

or 4608 rows in the SalesFact table. This is not a huge number, but you can see when you have dimensions with tens or hundreds of members, the size of the fact table can grow rather large.

In reality, the fact table should contain the identifiers for the dimension members, rather than their descriptions. This cuts down on the size required to store each fact row. This becomes important in fact tables with a potential for thousands or even millions of rows.

In addition, a single fact table may contain multiple measures. This can occur when two or more measures use exactly the same dimensions. Putting a number of measures with the same dimensions in the same fact table is also a way to save on storage space required for the data mart.

Attributes

In some cases, we may want to store additional information about a dimension in our data mart. This additional information helps us further define the members of the dimension. These bits of additional information are known as attributes of the dimension.

Definition 

An Attribute is an additional piece of information pertaining to a dimension member that is not the unique identifier or the description of the member.

Attributes can be used to more fully describe dimension members. Attributes may contain information about a dimension member that the users are likely to want as part of their business intelligence output. Attributes are also used to store information that may be used to limit or filter the records selected from the data mart during data analysis. Attributes are stored as additional columns in the dimension tables as shown in Figure 3-3.

image from book
Figure 3-3: Attribute columns in the dimension tables

Hierarchies

In many cases, a dimension is part of a larger structure with many levels. This structure is known as a hierarchy. In our sample, the year, product type, and sales region dimensions are each part of their own hierarchy.

Definition 

A Hierarchy is a structure mode up of two or more levels of related dimensions. A dimension at an upper level of the hierarchy completely contains one or more dimensions from the next lower level of the hierarchy.

The year dimension contains quarters. Further, the quarters contain months. The product type dimension contains product subtypes, which, in turn, contain products. The sales region dimension contains sales territories.

Hierarchies enable the users to navigate to different levels of granularity within the measures of the data mart. Users can look at the measures at one level of the hierarchy, then drill into a selected member to see the next lower level. For example, a user can look at the product subtypes within the Mythic World product type. A user can look at the sales territories within the Northeast sales region. A user can look at the months within Q1, 2003. In fact, all of these can be combined at once, so the user can view the sales totals by the product subtypes of the Mythic World product type, by sales territories of the Northeast sales region, and by months in Ql, 2003. By moving up and down the hierarchy, users can find exactly the right measure to aid in making the decision at hand.

In a star schema, the information about the hierarchies is stored right in the dimension tables as shown in Figure 3-4. The primary key in each of the dimension tables is at the lowest level of the hierarchy. Because of this, the fact table must be changed, so its foreign keys point to the lowest levels in the hierarchies as well. This means, then, the fact table will have one row (and one set of measures) for each unique combination of members at the lowest level of all the hierarchies.

image from book
Figure 3-4: A star schema with hierarchies

Measures for hierarchy levels above the lowest level are not stored in the data mart. Instead, these measures have to be calculated by taking an aggregate of the measures stored at the lowest level. For example, if the user wants to see the total sales for the Northeast sales region, that measure has to be calculated by aggregating the total sales for all of the territories within the Northeast sales region. In this case, the aggregation is the sum of the total sales of the territories.

The Snowflake Schema

An alternative to the star schema is the snowflake schema. The snowflake schema represents hierarchies in a manner that is more familiar to those of us who have been working with relational databases. In a snowflake schema, each level of a hierarchy is stored in a separate dimension table. This is shown in Figure 3-5. These chains of dimension tables are supposed to resemble the intricate patterns of a snowflake. Although the resemblance may not be too strong in Figure 3-5, this is how the snowflake schema got its name.

image from book
Figure 3-5: The snowflake schema

As with the star schema, the foreign keys in the fact table at the center of the snowflake schema point to the lowest level of each hierarchy. Therefore, the fact table in the snowflake schema also contains one row for each unique combination of members at the lowest level of each hierarchy. Measures for hierarchy levels above the lowest level have to be calculated by taking an aggregate, just as they were with the star schema.




Delivering Business Intelligence with Microsoft SQL Server 2005
Delivering Business Intelligence with Microsoft SQL Server 2005: Utilize Microsofts Data Warehousing, Mining & Reporting Tools to Provide Critical Intelligence to A
ISBN: 0072260904
EAN: 2147483647
Year: 2007
Pages: 112
Authors: Brian Larson

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net