Lesson 5: Aggregations | Microsoft SQL Server 7 Data Warehousing Technical Support Training Kit

An aggregate is a table or structure containing precalculated data. Aggregations support rapid and efficient querying of a multidimensional database. Designing aggregations is one of the most important processes that can be directly attributed to the performance of the data warehouse and data mart.

After this lesson, you will be able to:

Describe the design characteristics of aggregate schemas

Evaluate whether you will choose to build aggregate schemas of your own

Estimated lesson time: 30 minutes

Designing Aggregations

The raw performance of a data warehouse or data mart can be directly attributed to the design and storage strategies implemented in data aggregation. To better understand the intricate details of aggregate design, you must consider the following issues: what is aggregation, aggregate characteristics, and aggregate design issues.

What Is Aggregation?

An aggregation is a table or structure containing precalculated data. Examples of common aggregations are a sum of numbers or a count of items. Aggregations support rapid and efficient querying of a multidimensional database. The following table describes when and where aggregations are created in different querying scenarios:

Query Scenario	When Created	Where Stored
SQL query	Dynamically	Not stored
Aggregate schema	During warehouse loading	In the warehouse
Cube	Dynamically	Not stored
Cube with aggregations	During cube building	In the cube

Aggregations improve query response time by having the answers ready before the questions are asked, therefore bypassing the performance overhead of dynamically calculating the results during data retrieval. Aggregates must conform to explicit rules derived from the hierarchical structure of all dimensions and the granularity of the base fact table. An aggregate applies to all data in the fact table rather than to select subsets of the data.

The definitions of an aggregate usually contain the phrase "precalculated, summarized information of underlying relational data stored in a table." This definition more accurately defines a summary or rollup.

A summary is good for a small set of queries because the set contains constraints that will be more selective than most other queries.

Dimensional aggregates do not involve as many problems as a summary because they must conform to the predefined dimensional hierarchies.

Constructing dimensional aggregates can provide the following advantages:

Lower-level aggregates may be used to support queries at higher levels of detail, rather than forcing use of the fact table.

They provide the ability to drill down to lower levels of detail on the same logical schema model with relatively uniform query response times.

When Do You Create Aggregations, and Where Are They Stored?

Two major decisions in building aggregations are when to process and where to store the data. Aggregations can be created dynamically or during various stages of the data loading. The aggregations can either be stored for repeated use or dynamically created upon each request.

Precalculation of all possible aggregations results in the fastest possible response time for all queries. However, the storage and processing time required for the aggregations can be substantial. Storage requirements depend not only on the number of dimensions and measures but also on the number of levels and hierarchies in the dimensions.

There is a tradeoff between storage requirements and the percentage of possible aggregations that are precalculated. If no aggregations are precalculated (0 percent), little storage space will be required beyond that necessary to store the base data. In this case, however, query response time will vary and may be quite slow, because all answers will have to be calculated from the base data for each query. Returning the single number that answers the query "What were the sales of product X in 1998 for the Northwest region?" might require reading thousands of rows of data, extracting the sale value from each, and calculating the sum.

The decisions that you make will determine whether the data will be pulled from the base table or from the aggregates. The analytical system will ultimately retrieve the results one of these ways:

Part of the Transact-SQL statement in the query

Part of an aggregate schema

Calculated dynamically from data within a cube

Predefined and stored within a cube

OLAP Services Aggregation Features

Microsoft SQL Server OLAP Services allows you to defer aggregation design decisions in the physical fact and dimension tables to the actual point of building the multidimensional view of the data, called a cube.

OLAP Services incorporates a sophisticated algorithm to select aggregations for precalculation so that other aggregations can be quickly computed from precalculated values.

For example, if the aggregations are precalculated for the month level of a time dimension, the calculation for a quarter requires only the summarization of three numbers, which can be quickly computed dynamically. This technique saves storage, with little effect on query response time.

Aggregate Characteristics

Aggregate schemas are the most effective method for improving data mart performance without impacting usability or flexibility.

Schemas that are created for aggregation have three additional characteristics:

Summary schemas are used to enhance query performance.

Transparent to business users and application designers.

Can speed queries by orders of magnitude.

Summary Schemas Are Used to Enhance Query Performance

The business drivers identified for building the data warehouse or data mart will also drive the design of the summary schema. The summary schemas share the following characteristics:

They are based on common business requests and statistical distribution of data.

They must be revised as user requirements and usage behaviors change.

They represent common groupings within the dimensions of a schema (such as constraints, hierarchies, time periods, and others).

Transparent to Business Users and Application Designers

An aggregate navigation utility evaluates the Transact-SQL and, based on the DBMS system catalog, determines what, if any, aggregate schema should be used. The benefits to users and application designers are:

All user and application queries reference the base-level schema.

Every query goes against data of a uniform granularity.

Queries are less likely to double count than relational database queries. For example, in a relational query, product detail could be joined with category summary rows and result in products being counted twice: once as a detail entry and once as part of an aggregate.

Can Speed Queries by Orders of Magnitude

Effective aggregations can ultimately provide performance gains in querying and reporting. They:

Yield substantial savings in hardware upgrades. Storing aggregates requires more disk space but this is cheaper and easier to acquire than increasing raw computing power, which may require new computers.

Are essential in midsize to large data warehouses.

Aggregate Design Issues

Aggregates are summarized versions of a base schema. The summarization is accomplished by increasing the grain of the fact table along one or more dimensions.

A dimensional aggregate must conform to the dimensional data hierarchies and the fact granularity defined in the schema. A dimensional aggregate can only numerically summarize fact rows to one or more levels of the dimensional hierarchies.

To build a dimensional aggregate, there must be a dimensional schema in which the fact columns being aggregated are additive across the dimensions.

Additive and Semiadditive Measures

When fact tables contain transactions, such as order sales, standard aggregation works correctly. Summing the amounts and quantities across all of the dimensions will yield correctly interpreted results.

However, in other types of data warehouse applications, the standard aggregation across all dimensions will not yield correctly interpreted results. For example, an inventory application may store a snapshot of the inventory each month, instead of storing the inventory transactions. Summing the stock-level snapshot values from January and February will create a meaningless number.

Aggregate Schemas Are Derived from a Base Schema

The aggregate schema design is tightly derived from the base schema and maintains a dependant relationship with the base schema.

Each schema addresses a specific constraint, roll-up, or combination of the two.

The resultant set of schemas forms a family around the base schema.

Dimension Tables Must Be Tailored for Aggregate Use

The dimension table must be customized to reflect the following characteristics:

Represents a subset of a base dimension table.

Avoids filling the base dimension with NULLs for all attributes that are not applicable at the aggregated levels.

An artificial key must be used to identify the derived dimension.

Each Level of Aggregation Is Stored in a Distinct Fact Table

Each of these tables represents a subset of the base fact table that is consistent with the combination of aggregated dimensions. These tables are initially generated from the base fact table during a separate processing phase and are regularly updated from the base fact table.

Aggregation Example

Figure 4.9 illustrates an example of an aggregate schema. Compare this example to the base schema shown in Figure 4.4.

click to view at full size

Figure 4.9 Aggregation example

This example takes the base schema described earlier and applies the following query requirements:

This month s store sales (U.S. dollars and units) by product subcategory, totaling on category, brand, and store

This-year-to-last-year comparison of units sold by month totaling on brand and store size

The resultant schema solution contains the following specifications:

Category dimension created from the product category hierarchy attributes of the product dimension

Store dimension unchanged

Month dimension created from the month hierarchy attributes of the time dimension

Organization dimension eliminated entirely

Units sold quantity and U.S. dollar sale amount rolled up from primary sales fact table

IMPORTANT
Use OLAP Services to recommend aggregates before designing aggregate schemas. The OLAP Services engine has been designed to create aggregations intelligently based on common data warehouse access patterns. Before spending time creating aggregate schemas, determine whether the built-in Cube wizard does this work for you.

Lesson Summary

A prerequisite to providing fast, flexible query response in the OLAP environment is to have the answers to user queries ready before the users have even asked for them. This is accomplished by calculating and storing aggregations in one of the query structures, in either the database or the cubes, that users will be querying. In some cases, it may make sense to create an aggregate schema in which the aggregations are stored in the data warehouse database. In most cases, it makes more sense to let the tools select the aggregations. This design decision will depend on the types of queries in your OLAP environment.