Data Granularity | MicrosoftВ® SQL Server(TM) 2005 Integration Services Step by Step

An important consideration in the design of a BI solution is the amount of detail to store. You first need to understand how much data is required to support business decisions. What level of detail is required to analyze trends? You also need to understand what level of data detail is available from your source systems.

When planning for the design of a data warehouse system, understanding the impact of data granularity is one of the most important factors to consider. Operational databases store data at the transaction level, which is one reason they are typically too slow and complex for reporting. When analyzing business data, users will rarely (if ever) need access to the same level of detail, as is recorded in the data source. Trends and comparisons are nearly always performed at some level of aggregation, and different users will need to see data aggregated at different levels. An understanding of this requirement involves two important concepts that this section will cover briefly and only at a high level.

The first concept is the dimension. Nearly all data has dimensionality, meaning it can be grouped along different dimensions and at different levels within a single dimension. For example, sales records have an order date dimension that describes when customers placed orders. These records can be grouped at different levels and aggregated by (for example) year, quarter, month, week, or day. By aggregating transactions at the month level, you could report on the transaction count, sales amount sum, or average amount for the month. Other dimensions might include the customer's geography, product category, or customer type. Many of these dimensions can be organized into hierarchies such as the year, quarter, and month for dates; product category and subcategory for products; and country, state, and city for the geography dimension.

The second related concept is the level of grain, or granularity. Because you typically don't want to store every transaction detail in a data warehouse fact table, each record will represent an aggregation of transactions. For each fact record, one field will store the count or number of orders placed. Another field will store the sum or total purchase amount for orders. The precise level of detail represented by a single record will depend on the detail (or leaf level) of each dimension. Given the dimensions of order date, geography, product, and customer type, a fact table record is the aggregation of values across the lowest level in each of these dimensions. For example, a fact record might represent aggregate values for orders at the day, city, product, and customer type levels.

So how much or how little detail is necessary? The answer to this question is "It depends." If the volume of fact table data negatively affects performance, it would stand to reason that records should be pre-aggregated at a reasonably high level to meet the users' stated reporting requirements without compromising performance. This might solve the immediate problem, but what if the granularity requirements were to change? Say that each fact record contains the sum of the sales amounts and order counts for a month. The data warehouse tables are structured and populated with data to meet the current need but, at some point in the future, users decide they need to drill down to daily sales totals. It would be very difficult, if not impossible, to load a table designed to store monthly sales with daily sales records.

Now, take another approach. Recognizing that the current requirement calls for reporting at the monthly sales level, design the relational data warehouse to store daily sales records, and then design an OLAP cube to aggregate sales at the month level. When the cube is processed, monthly totals are calculated from the data warehouse and stored in the cube. The OLAP engine efficiently stores data with pre-aggregated totals, so no more detail is required. Later, when the requirement changes, you simply add weekly and daily dimension levels to the cube and then reprocess it to load new data. It's true that the size of the cube data has increased, but the impact is far less than it would be if using relational storage. Because the cube was already designed to manage monthly sales data, queries and reports run just as fast as they did previously, but users can also drill down to weekly and daily sales totals much more efficiently than if you had used the relational structure. OLAP cubes allow you to extend this simple scenario across many different dimensions to slice and group data across geography, sales regions, product categories, and many other data attributes, with little impact on performance and query complexity.

Supporting Business Decisions

To design a solution that continues to meet business needs, you have to ask yourself "What do users need now?" and "What will they need in the future?" Regarding future requirements, the next consideration is the tradeoff cost to meet future needs. It's hard to plan for every possible future requirement, but it's quite possible to design a flexible solution to meet most practical needs.

How far and to what level of detail will users need to drill down into the data? For each data dimension, what navigation paths will users naturally follow to discover more detail?

Calculations can be performed on aggregated values to support reporting requirements that make sense only at certain dimensional levels. Different measures might have different granularity. For example, budgeting and forecasting are often performed only at an annual or quarterly level. Actual monthly figures can't be accurately compared with quarterly projections.

Another limitation is that the desired granularity must be supported by source data. In cases in which it is not, values may either be arbitrarily (and inaccurately) derived or simply not provided.