Data Warehouse System Components | MicrosoftВ® SQL Server(TM) 2005 Integration Services Step by Step

A common practice and architecture of data warehouse processes includes storing the transaction level data into staging tables. Perform the data staging process to transform, merge, standardize, cleanse, and prepare the data for loading into data warehouse models. Some corporations might create a single enterprise data warehouse as a central repository of all its business data. Subject area subsets of the enterprise data warehouse are often created as data marts to serve departmental and functional area reporting and analysis.

Fact and Dimension Tables

A relational data warehouse consists of two core components called facts and dimensions. Ideally, both of these are simple tables consisting of columns that store either keys or values. A fact table specifically describes a single business fact for reporting and analysis. It stores the actual values for reporting at the least granular level across all possible dimensions. This is very simple in concept. The fact table is related to every corresponding dimension through a foreign key relationship. Foreign keys in the fact table are related to primary keys in each dimension table.

The values stored in the fact table are called measures and are usually aggregated values from multiple detail rows in the operational data source, as shown in Figure 12-4.

image from book
Figure 12-4: Fact and dimension tables related through foreign keys

The grain of the fact_sales table is defined by the lowest level of detail stored in each dimension. The fact table contains the least granular level of values derived from source OLTP systems. Typical examples of this include a Sales Order Fact that might include the dollar value, unit count, and value for a given customer and product on a given date. The customer ID, product ID, and sales date time ID are used to join the fact table to the dimension tables. The customer ID, product ID, and time ID are surrogate keys. The dimension surrogate keys are the unique primary keys for the dimension rows (or members). The fact table keys are foreign keys that identify the corresponding dimension members for the fact.

Dimension Table Characteristics

Dimension tables represent the sets of label lists that can be used to reference the values contained in the fact table. Each row of the dimension table is called a member. Each member represents a unique label for the dimension entities. Each member can have many additional attributes, which are not required to be unique.

For example, a product dimension can include the list of unique products, in which each product is a member of the dimension. Additional attributes can include size, color, weight, and so on, which might not be unique, but provide additional analysis and reporting capabilities. The member attribute, however, must be unique.

Common dimensions include Customer, Product, Geography, and Time. One simple method to refer to the dimension tables is by using the Show Me By tables. This language helps clarify the relationship of a fact and a dimension by allowing the user to ask the data warehouse to show the values by State, by Product, or by Time. Dimensions contain lists of labels for a similar set of characteristics. These characteristics describe business measures. Their purpose is to make the measure more meaningful for the analyst.

The Show Me By tables can also be enabled by dimension hierarchies. Hierarchy levels define values relative to dimension attributes (for example, by size or by color) or via parent-child relationship hierarchies (worldwide total, regional subtotals, divisional subtotals, or entity subtotals). Hierarchies enable powerful ways to aggregate and analyze business data.

Star Schema Dimensions

One of the more popular data warehouse (and data mart) database models is the star schema. The star schema was introduced as a way to organize business data for fast query access. Its design is significantly different from the database models used by OLTP applications. Look at some of the characteristics of the star schema and their effect on SSIS package design. Figure 12-5 depicts a star schema.

image from book
Figure 12-5: Star schema consisting of a fact and multiple dimensions

It's called a star because of its conceptual shape. A star schema relational database model contains one fact table and at least one dimension table. SSIS is used to load data into both fact tables and dimension tables. When using SSIS to maintain star schemas, you need to review the relationships and dependencies between fact and dimension tables to best design the SSIS package control flow and data flow components.

The fact table contains the transaction detail values from source OLTP systems. However, it is common to roll up detail rows to the lowest level. Typical examples include a Sales Order Fact that might include the dollar value and the unit count value for a given customer and product on a given date. The customer ID, product ID, and sales date time ID are used to join the fact table to the dimension tables. The customer ID, product ID, and time ID are surrogate keys. The dimension surrogate keys are the unique primary keys for the dimension rows (or members). The fact table keys are foreign keys that identify the corresponding dimension members for the fact.

Every member of a dimension table is unique to that dimension and identified by the unique surrogate key. Surrogate keys that maintain the relationships between dimension and fact tables is a database model design characteristic, but it is the database design that affects how you design your SSIS packages to load and maintain these tables within the data warehouse. One of the ways used to help optimize the processes required by the SSIS packages for data warehouses is with the use of staging tables.

Snowflake Schema Dimensions

You looked at the star schema model, which involved one dimension table related directly to the fact table. Another version of the model is referred to as a snowflake. This is a hybrid star schema used when a single dimension's attributes are subject to high levels of change (for instance, product line regroupings, renaming categories, and so on). The relationships between these attributes can be maintained while new attribute labels for category or subcategory are supported. The snowflake is quite as common as the star schema, but it's good to know that this type of model is not a problem when the data dictates a little more complexity. Figure 12-6 shows one branch of a simple snowflake schema.

image from book
Figure 12-6: Simple snowflake schema

In this database, the related product category, subcategory, and product information are stored in three separate tables. Although this is a typical schema, the same information could be stored in a single table as with the year, quarter, month, and other hierarchy-level values stored in the time dimension table. There are various schools of thought on this and really no perfectly correct answer; the key is to retain simplicity and the ability to maintain the data according to the business need.