Lesson 3: Dimensional Schema Design

The dimensional schema determines how the user is able to access data. In order to look at the dimensional schema design steps in more detail, it is best to look at the process in the context of designing an OLAP data mart. The design steps are the same whether you are designing an enterprise data warehouse or a departmental data mart, but they are easier to understand if you first focus on building a simple data mart. Data warehouse design takes longer because you have to consider enterprise dimension and fact table issues.

After this lesson, you will be able to:

  • Employ methodologies for designing dimensional schemas

Estimated lesson time: 30 minutes

Dimensional Schema Design Steps

The dimensional schema design process has four major phases. Within each of these phases are other factors or processes to consider. The four major steps and their associated considerations are

  • Defining an OLAP data mart
  • The definition process is an essential part of the overall development process. This section reviews essential elements of the design process and looks at issues specific to developing a data mart.

  • Choosing facts
  • Understanding the ramifications of proper fact selection is critical to producing a data mart with the appropriate level of detail. By introducing common characteristics of fact tables, you will have a guideline for making proper fact table design decisions in the future.

  • Establishing dimensions
  • Defining dimensions properly requires an understanding of common dimension characteristics, dealing with multiple hierarchies within a single dimension, and addressing the issue of how dimension values change over time.

  • Designing aggregations
  • Aggregations allow fast processing of complex queries by precalculating totals. Deciding when to create the aggregations and where to store them is critical to their proper use.

Let s take a look at these steps in more detail.

Defining an OLAP Data Mart

As discussed in previous chapters, the starting point for any dimensional schema design is the definition of what the data mart will address. The definition of what the data mart will address should

  • Be established during the requirements-gathering process.
  • The Opportunity Evaluation phase defined the overall implementation strategy and scope of the initial iteration.

  • Be limited to a single discrete business process. For example, product movement, purchase orders, or customer payments should be sourced from a single system during the first iteration. This constraint simplifies extract requirements, limits risk, and promotes quick delivery and immediate realization of value.
  • Establish the duration requirement.
  • History is a fundamental aspect of a data warehouse. Duration refers to how much history is contained in the fact table. History is not necessarily the same for all schemas in the warehouse. Typically, duration should be a minimum of two years (this year versus last year), and insurance or regulatory requirements may mandate seven to ten years. Historical data for longer durations may be difficult (if not impossible) to extract from source systems during the initial load especially if data is being retrieved from multiple sources.

Choosing Facts

You must decide the level of detail to be modeled, or how "deep" the center of the star will go. Fact attributes must be chosen carefully because they are the basis for all the decisions you make. The granularity of the fact table will determine the configuration of the fact attributes.

How much summarization you select for the fact table has a profound effect on the warehouse. The lowest possible level of detail is to store facts at the atomic transaction level. This is the same level of detail as the source system. Storing facts at the atomic transaction level has a significant advantage of maintaining the ability to drill to the details at all times. It also has some significant disadvantages. It is often impractical or even impossible to warehouse that much data, and because most queries against a warehouse employ some level of summarization, the detailed data only serves to slow queries down. The remainder of the spectrum contains various levels of summarization; for example, instead of storing line item totals for a customer, you may elect to store order totals. This has the benefit of reducing the amount of data the warehouse must store, but it has the drawback of removing some detail from possible analysis.

The grain, or resolution, of the fact table will determine the depth to which any analysis against the schema can be made. The facts make up the cells in the cube the business uses to formulate and answer questions that drive the decision- making process. Use the following guidelines when determining the facts:

Simultaneous Measurements at a Specified Granularity and Point in Time

  • Fact attributes are unknown until a transaction, snapshot, or business event takes place.
  • All fact attributes in a given table instance correlate to the selected granularity of the table; a total order amount attribute would not be appropriate in a table that is grained at the order line-item level.
  • All fact attributes in a given table correlate to a specific point in time; a month-to-date cost amount attribute would not be appropriate in a table that contains a year value as the sole time dimension.

The Lower, the Better

  • Normally, select the lowest possible grain supported by the source system.
  • You can respond to unexpected user query requirements better with lower levels.
  • With a lower grain, your schema is more adaptable to the introduction of new data elements.

Four Common Styles

There are four common styles of facts:

  • Transaction facts
  • Transaction facts are based on a single instance of a business event. They represent the lowest level of granularity possible. Transaction facts usually contain only one measure: the transaction amount. Context is based on some form of transaction type key.

    Examples of transaction facts include an individual telephone call or an ATM withdrawal.

  • Snapshot facts
  • Snapshot facts capture the state of the business at a particular point in time, such as at the end of a shift, day, month, and so on. Snapshot facts contain summary or instantaneous value measures; they may not contain instances for all dimensions, for example, in cases in which not all products are sold every day.

    Examples of snapshot facts include a checking account daily balance and monthly product sales.

  • Line-item facts
  • Line-item facts are used to store detailed information about the elements of a business document. They represent individual line items related to business events and contain all measures about the line item (quantity, selling price, cost, and others).

    Examples of line-item facts include purchase order line, shipping ticket line, and individual insurance policy coverage.

  • Event or state facts
  • Event or state facts are a specialized type of fact that represents an occurrence of an event but not the event details. They represent an event instance or state change and are useful for providing coverage when an analysis includes both activity and non-activity for a specific set of dimensions. For example, "Products that were on promotion did not sell any units today." Event or state facts are factless because no measures apply; the fact that the event took place is the measure. However, a value of 1 is often used to make querying easier.

    Examples of event and state facts include an individual college class meeting and a sales promotion for a specific set of products.

TIP
Do not mix grain or periodicity in individual fact table attributes. Mixing granularity within a fact table can result in inconsistent answers from the information. For example, suppose that you have a fact table that contains both daily and monthly values and a query that intends to extract daily totals. Depending on which attributes are chosen, either a correct daily summary may be returned (if only daily attributes were used) or an inflated summary may be returned (if one or more of the monthly attributes were used). Table grain should comply with one, and only one, of the following: individual transaction, line item, time snapshot, or event/state.

Fact Table Characteristics

By introducing common characteristics of fact tables, you will have a guideline for making proper fact table design decisions in the future. Let s take a look at the characteristics that are critical to a solid fact table design.

Unique Compound Key Made Up of Foreign Keys to Dimensional Tables

A fact table will have an entry for each distinct measure. The key is composed of one or more of the primary keys from the dimension tables.

Additional Numeric Columns

The measures are represented by numeric columns that represent the defined grain of the facts stored in the entire table.

TIP
Employ facts that are both numeric and additive.

Additive facts allow information to be easily rolled up to a higher level of granularity. Effective aggregation relies on this principle. Semi-additive facts (such as ending statement balance) must be averaged over time to be useful, which is an additional step that adds complexity to the query.

In the dimensional structure, all facts must be metrics. All metrics are numeric, but in some cases, a numeric value may not be a metric. This is also true for additive values. Averages are metrics, but they are most definitely not additive over a time period.

Not Every Combination of Foreign Keys Need Exist

Many fact table entries will not have foreign key values for all of the associated dimensions. For example, some products may have no sales activity for some periods that is, a sporting goods company might not sell snowshoes in August, but the analytical system will still need to track both date and product information. That information is represented by foreign keys in the fact tables.

A Very Large Number of Rows Hundreds of Millions and Up

The level of granularity that you define for your fact table will determine not only the level of detail at which you can query the table but also how many calculations it takes to provide an answer to the query. The more detail that you allow in a query, the larger your fact table becomes.

Exercise: Designing a Data Mart

The following scenario and exercise will engage you in the process of designing a data mart for Northwind Traders. It is a thought- and paper-based exercise in which you will design a database for data warehousing. You will implement the design in later chapters.

Other resources that you may want to use include

  • The Northwind database schema
  • Microsoft SQL Server Books Online

Scenario

The Northwind database supports an order entry system that contains the following tables:

  • Orders
  • Order Details
  • Products
  • Categories
  • Suppliers
  • Shippers
  • Employees (sales people)
  • Customers

The management at Northwind Traders wants to improve its ability to analyze data for uncovering trends and making smarter business decisions.

Business Objectives

Northwind Traders would like to track how its products sell in various geographical regions, evaluate the markets with the highest sales, and identify the markets in which sales are lowest.

Business Process

Details of business processes were gathered from conducting many interviews with various employees throughout the company. This information enabled company officials to identify the requirements for the data mart.

Data Mart Requirements

The data mart that you build should enable the sales staff and managers to answer questions regarding their business.

  • Information about products sold For example, what products were sold to which customer, by which sales representative, for the month of July?
  • Information about customer purchases For example, which customers bought what products during the second quarter of 1998?
  • Information about when products are purchased For example, what day of the week do customers purchase the most products in a given quarter?
  • Information about employee sales For example, which employee sold the most products for each month of the year?
  • Information about product shipments For example, which shipper has the highest percentage of on-time shipping to the customer?

Additionally, management would like to retrieve information on required shipment dates. This information will be used to calculate how many times shipments have been past due or on time.

Measurements

Sales metrics that must be evaluated include

  • Quantity
  • Discount
  • Line-item total (product total cost)
  • Line-item freight (average shipping cost per units sold)

Some of the measurements are precalculations. The precalculations are generated during the data transformation while populating the data mart. The following table lists the calculation method used for deriving specific measurements:

MeasurementCalculation Method
Line-item total(product unit price * quantity)
Line-item freight(((shipping cost/sum(line item quantity)) * line item quantity)

  • Identifying grain, dimensions, facts, and hierarchies
  • In this procedure, you will refer to the scenario and the Northwind database schema. You will identify the level of granularity, which table in the operational system is a possible source for a fact table and dimension tables, and the levels of hierarchies to define for the data mart.

    1. What would be the appropriate grain for this data mart? Why?
    2. Which tables would be appropriate dimension tables? Why?

    NOTE
    The Products table and the time dimension will most likely be shared dimension tables. Shared dimensions are universal and have the same meaning to everyone. Customer is another possible candidate for a shared dimension.

    1. Which table in the Northwind database would best represent the fact table? Why?
    2. Which measures are necessary facts to include in the fact table?
    3. Which foreign keys would you include in the fact table?

    Lesson Summary

    Some of the most important decisions you can make about your dimensional schema design are

    • What is the grain of the fact table?

    The capacity of your system (both storage and processing capacity) and the needs of the users will determine granularity.

    • What type of facts will be stored in the fact table?

    In most data warehouse implementations, the facts will be transaction facts.

    • What is the duration of the facts in the fact table?

    The capacity of your system and the needs of the users will determine duration.



    Microsoft Corporation - Microsoft SQL Server 7 Data Warehousing Technical Support Training Kit
    Microsoft SQL Server 7 Data Warehousing Technical Support Training Kit
    ISBN: 0735606706
    EAN: 2147483647
    Year: 1999
    Pages: 114

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net