An aggregate is a table or structure containing precalculated data. Aggregations support rapid and efficient querying of a multidimensional database. Designing aggregations is one of the most important processes that can be directly attributed to the performance of the data warehouse and data mart.
After this lesson, you will be able to:
- Describe the design characteristics of aggregate schemas
- Evaluate whether you will choose to build aggregate schemas of your own
Estimated lesson time: 30 minutes
The raw performance of a data warehouse or data mart can be directly attributed to the design and storage strategies implemented in data aggregation. To better understand the intricate details of aggregate design, you must consider the following issues: what is aggregation, aggregate characteristics, and aggregate design issues.
An aggregation is a table or structure containing precalculated data. Examples of common aggregations are a sum of numbers or a count of items. Aggregations support rapid and efficient querying of a multidimensional database. The following table describes when and where aggregations are created in different querying scenarios:
Query Scenario | When Created | Where Stored |
---|---|---|
SQL query | Dynamically | Not stored |
Aggregate schema | During warehouse loading | In the warehouse |
Cube | Dynamically | Not stored |
Cube with aggregations | During cube building | In the cube |
Aggregations improve query response time by having the answers ready before the questions are asked, therefore bypassing the performance overhead of dynamically calculating the results during data retrieval. Aggregates must conform to explicit rules derived from the hierarchical structure of all dimensions and the granularity of the base fact table. An aggregate applies to all data in the fact table rather than to select subsets of the data.
The definitions of an aggregate usually contain the phrase "precalculated, summarized information of underlying relational data stored in a table." This definition more accurately defines a summary or rollup.
A summary is good for a small set of queries because the set contains constraints that will be more selective than most other queries.
Dimensional aggregates do not involve as many problems as a summary because they must conform to the predefined dimensional hierarchies.
Constructing dimensional aggregates can provide the following advantages:
Two major decisions in building aggregations are when to process and where to store the data. Aggregations can be created dynamically or during various stages of the data loading. The aggregations can either be stored for repeated use or dynamically created upon each request.
Precalculation of all possible aggregations results in the fastest possible response time for all queries. However, the storage and processing time required for the aggregations can be substantial. Storage requirements depend not only on the number of dimensions and measures but also on the number of levels and hierarchies in the dimensions.
There is a tradeoff between storage requirements and the percentage of possible aggregations that are precalculated. If no aggregations are precalculated (0 percent), little storage space will be required beyond that necessary to store the base data. In this case, however, query response time will vary and may be quite slow, because all answers will have to be calculated from the base data for each query. Returning the single number that answers the query "What were the sales of product X in 1998 for the Northwest region?" might require reading thousands of rows of data, extracting the sale value from each, and calculating the sum.
The decisions that you make will determine whether the data will be pulled from the base table or from the aggregates. The analytical system will ultimately retrieve the results one of these ways:
Microsoft SQL Server OLAP Services allows you to defer aggregation design decisions in the physical fact and dimension tables to the actual point of building the multidimensional view of the data, called a cube.
OLAP Services incorporates a sophisticated algorithm to select aggregations for precalculation so that other aggregations can be quickly computed from precalculated values.
For example, if the aggregations are precalculated for the month level of a time dimension, the calculation for a quarter requires only the summarization of three numbers, which can be quickly computed dynamically. This technique saves storage, with little effect on query response time.
Aggregate schemas are the most effective method for improving data mart performance without impacting usability or flexibility.
Schemas that are created for aggregation have three additional characteristics:
The business drivers identified for building the data warehouse or data mart will also drive the design of the summary schema. The summary schemas share the following characteristics:
An aggregate navigation utility evaluates the Transact-SQL and, based on the DBMS system catalog, determines what, if any, aggregate schema should be used. The benefits to users and application designers are:
Effective aggregations can ultimately provide performance gains in querying and reporting. They:
Aggregates are summarized versions of a base schema. The summarization is accomplished by increasing the grain of the fact table along one or more dimensions.
A dimensional aggregate must conform to the dimensional data hierarchies and the fact granularity defined in the schema. A dimensional aggregate can only numerically summarize fact rows to one or more levels of the dimensional hierarchies.
To build a dimensional aggregate, there must be a dimensional schema in which the fact columns being aggregated are additive across the dimensions.
When fact tables contain transactions, such as order sales, standard aggregation works correctly. Summing the amounts and quantities across all of the dimensions will yield correctly interpreted results.
However, in other types of data warehouse applications, the standard aggregation across all dimensions will not yield correctly interpreted results. For example, an inventory application may store a snapshot of the inventory each month, instead of storing the inventory transactions. Summing the stock-level snapshot values from January and February will create a meaningless number.
The aggregate schema design is tightly derived from the base schema and maintains a dependant relationship with the base schema.
The dimension table must be customized to reflect the following characteristics:
Each of these tables represents a subset of the base fact table that is consistent with the combination of aggregated dimensions. These tables are initially generated from the base fact table during a separate processing phase and are regularly updated from the base fact table.
Figure 4.9 illustrates an example of an aggregate schema. Compare this example to the base schema shown in Figure 4.4.
Figure 4.9 Aggregation example
This example takes the base schema described earlier and applies the following query requirements:
The resultant schema solution contains the following specifications:
IMPORTANT
Use OLAP Services to recommend aggregates before designing aggregate schemas. The OLAP Services engine has been designed to create aggregations intelligently based on common data warehouse access patterns. Before spending time creating aggregate schemas, determine whether the built-in Cube wizard does this work for you.
A prerequisite to providing fast, flexible query response in the OLAP environment is to have the answers to user queries ready before the users have even asked for them. This is accomplished by calculating and storing aggregations in one of the query structures, in either the database or the cubes, that users will be querying. In some cases, it may make sense to create an aggregate schema in which the aggregations are stored in the data warehouse database. In most cases, it makes more sense to let the tools select the aggregations. This design decision will depend on the types of queries in your OLAP environment.