Query Performance Improvement | Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer)

The power of Analysis Services is its ability to provide fast query response time for decision makers who need to analyze data, draw conclusions, and make appropriate changes in business. As per OLAP Report (http://www.olapreport.com), OLAP is defined as Fast Analysis of Shared Multidimensional Information (FASMI). The word FAST actually means that system is able to delivery results to users in 5 seconds with very few queries that are complex might take more 20 seconds. We expect most of the business users to use business client tools that graphically represent that data from Analysis Services for easy interpretation and understanding. As an end user you would expect to see the data quickly (OLAPReport cites users typically wait for only 30 seconds as per an independent study in the Netherlands) to analyze and make decisions. Some of the common operations the client tools offer are drill down, drill up, and compare data year over year. Users do not have the time to wait hours for a response. Hence the queries sent to Analysis Services need to return data within seconds, at most in minutes. The query performance is pivotal to a successful Business Intelligence project deployment. A system that has very good performance will bring great business value to your system and company.

Even though Analysis Services supports storage modes MOLAP, ROLAP, and HOLAP, you obtain the best performance when your UDM storage mode is MOLAP. When you choose MOLAP storage, Analysis Services 2005 will store the dimension data and fact data in its own efficient, compact, and multidimensional structure format. Fact data are compressed and the size is approximately ten to thirty percent of the size in the relational database. In addition to its own efficient and compact data store, Analysis Services builds specialized dimension attribute indices for efficient data retrieval. The data is stored specifically in a multidimensional structure to best serve MDX query needs. If you use the ROLAP or HOLAP storage modes, queries to Analysis Services might have to fetch data from the relational data source at query time. Retrieving data from the relational data sources will significantly slow your query performance because you incur relational query processing time — the time needed to fetch the data over the network and then finally aggregating the data within Analysis Services.

Analysis Services 2005 tries to achieve the best of the OLAP and relational worlds. OLAP queries typically request aggregated data. For example, if you have sales information for products each day, a typical OLAP query might be to get the aggregated sales for the month or quarter or year. In such circumstances, every day's sales data needs to be aggregated for the entire year. If the users are requesting aggregated data on a single dimension, you will be able to do a simple sum in the relational database. However, OLAP queries are typically multidimensional queries, which need aggregated data across multiple dimensions with complex business logic calculations applied to each dimension. In order to improve the query performance, Analysis Services allows you to specify the multidimensional intersections for which data needs to be pre-aggregated so that queries requesting such data will be served instantaneously. Assume you have a database with a dimension called Products having a multi-level hierarchy Product CategoryProduct Sub CategoryProduct Name, a Time dimension having the hierarchy YearQuarterMonthDate, and a Geography dimension having a hierarchy CountryState CountyCity. Fact data is typically at the lowest levels — sales of a specific product on a specific date at a specific city. For analysis you would request aggregated data of Sales at a city for a month for various product categories. You can have pre-aggregated data for the cross product of various levels in each hierarchy such as Quarter, State, and Year. In order to create the pre-aggregated data you need to specify the dimensions for which Analysis Services needs to pre-calculate the data. The precalculated data along with definitions are referred to as aggregations.

Understanding Aggregations

Analysis Services 2005 allows you to build pre-calculated subtotals for each partition of your cubes, and store them in either an OLAP data store or relational database based on the storage modes chosen for the partitions. Because most OLAP queries typically request aggregated data at various levels of dimension hierarchies, storing pre-aggregated data will help in getting the results to end users quickly. When you query for data at higher levels of hierarchies (other than the granularity level which is the level at which fact data is available), the server can directly fetch those aggregate numbers instead of bringing all related detailed fact data and aggregating them within the engine. In this section you learn to create aggregations for cubes using the Adventure Works sample database.

Using Adventure Works dimensions as an example, assume you have a measure group using the Date, Product, and Customer dimensions, and each dimension has hierarchies and levels as shown in the table below. Assume the granularity attribute for the dimensions Date, Products, and Customer are Date, Product Name, and Full Name, respectively. If you query the data for [Date].[2004].[Q3] and if there is no aggregated data stored in your OLAP database, the server needs to retrieve the lowest-level fact data for all dates in quarter Q3 of year 2004 for all products and all the customers in that State-Province. This can result in a large data scan on the OLAP fact data, followed by the server aggregating the data and returning the results to you.

Date Dimension		Products Dimension		Customer Dimension
Year	4	Category	4	Country	6
Semester	8	SubCategory	37	State-Province	71
Quarter	16	Product Name	395	City	587
Month	48			Postal Code	646
Date	1461			Full Name	18484

Analysis Services 2005 provides wizards that help define the combinations of dimension hierarchies for which aggregated data needs to be created, either by analyzing the statistics of members at each level and/or based on the queries requested by users. Once the aggregations are defined you need to process the partitions so that Analysis Services creates the pre-aggregated data and stores them in the OLAP data store. In addition to the pre-aggregated data, Analysis Services also creates indexes to access the aggregated data, which speeds up data retrieval. Reviewing the preceding example, if the server has aggregations for levels Quarter, Subcategory, and State-Province of the user hierarchies in Date, Product, and Customer dimensions respectively, then a query for Quarter Q3 of year 2004 can be fulfilled right away from the aggregation. In this case, a fact scan is not needed and you receive the results instantaneously. Furthermore, queries requesting data for the levels above Quarter, Subcategory, and State Province will also benefit from the aggregation. For example, if you query for Year, Subcategory, and State-Province, the server only needs to get the data for Quarter from the existing aggregations and aggregate a much smaller dataset than a huge fact table scan. Thus aggregations help in improving the query performance time of your MDX queries.

Storing aggregation values in the Analysis Services database is a typical tradeoff of database size and performance. Aggregation values will take disk space, and it will benefit query performance for queries sent to the server. In the example discussed in the preceding paragraph, if you count the permutations of all levels in the three dimensions you will see there are 74 combinations (5*3*5−1 [fact table] = 74 ) to build aggregations. You might immediately have the following questions:

Do I need to build aggregations for all the combinations? Will that be useful?
Can I build a subset of the aggregations?
What parameters will affect the choice of Analysis Services for aggregation design?
How much disk space will these aggregations take?
What percentage of optimization will I get based on the aggregations designed?

The estimated dimension members count of all the hierarchies in a dimension is an important metric to calculate the cost of storage for aggregations. For example, aggregation (Month, Subcategory, Postal Code) will potentially result in 12 million (48*395*646 = 12,248,160) cells to compute and store. However, not every cell will have data just as the fact data doesn't have all combinations for every dimension key. Because Analysis Services only stores data for coordinates that have fact values, the equation needs to be rectified by the partition fact count. Analysis Services assumes the data is of uniform distribution, and uses an algorithm to calculate the estimated aggregation cells for a specific aggregation. Analysis Services by default selects attributes from cube dimensions to be considered for aggregation. Analysis Services estimates the aggregation size and the benefit due to the aggregation for various combinations of attributes. Based on the aggregation disk size or percentage of optimization chosen by the user, Analysis Services stops iteration of calculating the optimized aggregation designs as soon as one of the criteria is met.

For query benefits, the lower the aggregation design, the more useful the aggregation, because all higher-level queries can benefit from the aggregation. However, the lower the aggregation design, the bigger the size of the aggregation data store and the longer the aggregation build time will be. It does not make sense to have aggregated cells with a size very close to the fact size; it won't save any disk scan time because the server will read almost the same amount of data from the disk. Therefore, building too many aggregations actually will not benefit query performance and sometimes might actually hurt it. Analysis Services, however, allows you to create aggregations up to 100%. As previously mentioned, we recommend you have aggregations where the estimated size is between 10% and 30% of the fact table size.

Creating Aggregations

Analysis Services allows you to specify the percentage of aggregation of the disk space to be used for creating aggregations. Analysis Services uses a complex algorithm to estimate the best aggregations based on the number of fact table and dimension members that will provide you the maximum benefit. The following steps show how to design aggregations for the Adventure Works sample Internet Sales partitions:

Open the Adventure Works DW sample project in BIDS.
Open the Adventure Works cube and switch to the Partitions tab.
Right-click the partition Internet_Sales_2001 under the Internet Sales measure group and select Design Aggregations… as shown in Figure 13-19.

Figure 13-19
You will now be in the Aggregation design wizard. Click Next in the welcome screen.
You will see the Storage and Caching Options page of the wizard as shown in Figure 13-20. Select the option MOLAP partition to store fact data and aggregation data in Analysis Services multidimensional format. Click the Next button.

Figure 13-20
The Object Counts page allows you to retrieve the count of dimension members relevant to the current partition and the count of fact table rows as shown in Figure 13-21. At the top of the grid is the fact table count; Estimated Count is the total number of fact table rows for all partitions in the current measure group. The Partition Count contains the count of fact table rows for the current partition. This page also allows you to override the values for the current partition. Click the Count button. Analysis Services retrieves the count of dimensions members and the partition by sending queries to the relational database. Estimated Count is the count for the entire measure group and the partition count has the values for the current partition. If the values of the current partition will have different values in your production environment, you can enter the new counts in the Partition Count column. The count specified in the Object Counts page is used to calculate appropriate weights to design aggregations. Hence, make sure you provide the right counts. Otherwise you might end up with a suboptimal aggregation design.

Figure 13-21
If you expand the customer dimension, you will find that all the attributes being considered for aggregation design are highlighted in bold font along with the estimated count of members as shown in Figure 13-22. For the partition count column, since the fact table only includes the granularity attribute (the customer key), that count is also available for Analysis Services to determine the aggregation design. You have the option to include or exclude attributes for consideration for aggregation design. By default, Analysis Services includes the key attributes for consideration during aggregation design. All the attributes that are levels of natural hierarchies are considered for aggregation design. If a user hierarchy is unnatural the topmost level is considered for aggregation design. The remaining levels of the unnatural hierarchy from the top most level are included for consideration if there is a one-to-many relationship from that level to the next level. As soon as there is no relationship between successive levels then the remaining levels below that level of the unnatural hierarchy are not considered during aggregation design. Click the Next button.

Figure 13-22
You are now in the Aggregation Options page, as shown in Figure 13-23. The engine gives you four possible options for how to design the aggregations:
- Setting the storage limit for estimated aggregation data. The system will look for the aggregation combinations whose total estimated size is within the user specified limit.
- Performance gain setting. You can choose a percentage of performance gain. The system will use the cost and performance gain algorithms to search for the best combinations that will fulfill the performance gain metric. The less the percentage is, the less aggregation will be built and the less queries will be answered through pre-calculation. We recommend you begin with 30% aggregations performance gain for most cubes.
- User click stop option. The system will start to analyze the aggregation design and design aggregations. The server will stop searching for more aggregation when the user click-stops or the performance gain reaches 100%
- Do not design aggregations (0%). The server will not design any aggregations for the current partition.
Figure 13-23
Choose the second option, Performance Gain Reaches, and set the number to 30%. Click the Start button to design the aggregation.
The server starts to analyze the aggregation design, and sends feedback on the aggregations being generated. The feedback is shown graphically in the Aggregation Options page as shown in Figure 13-24. The X-axis of the graph contains the amount of storage used and the Y-axis of the graph shows the percentage of performance gain. The status bar shows the number of aggregations that have been created during that time. Once the performance gain reaches the percentage you have specified (30%), the server stops designing further aggregations. You can see 32 aggregations have been created for 30% performance gain and the estimated storage is 38.6KB. Click the Next button.
In the final page of the aggregation wizard you have two choices to either deploy and process the partition or save the aggregation design and process later. Choose Save the Aggregations but Do Not Process Them and click the Finish button.

image from book
Figure 13-24

You have now successfully created aggregations for the partition Internet_Sales_2001 using the aggregation design wizard. To find out what partition aggregations have been created for this partition, you can either deploy the project followed by scripting from SSMS or open the file "Adventure works.Partitions" in your current project's directory in an XML editor. If you look at the partition Internet_Sales_2001 you will find the following definition, which indicates that the partition has an aggregation design defined and the aggregation design used has the id AggregationDesign.

     <Partition dwd:design-time-name="">       <ID>Internet_Sales_2001</ID>       <Name>Internet_Sales_2001</Name>       <CreatedTimestamp>0001-01-01T08:00:00Z</CreatedTimestamp>       <LastSchemaUpdate>0001-01-01T08:00:00Z</LastSchemaUpdate>       ........       <EstimatedRows>1013</EstimatedRows>       <AggregationDesignID>AggregationDesign</AggregationDesignID>       </Partition>

You can find the aggregation designed by the server in the AggregationDesign section for the measure group "Internet Sales." Following is a section of the definition for the aggregation that has been designed. Each aggregation design can have one or more aggregations defined. The dimension section within Aggregations includes the estimated counts for dimension attributes. In the aggregation section, it lists the detailed aggregation design for each aggregation. The definitions for each aggregation contain the combination of the hierarchies that are to be included for aggregating the data. If a hierarchy has not been specified in the aggregation design, by default it is implied that the top-level member or the default member of that hierarchy is included.

     <AggregationDesign dwd:design-time-name="">   <ID>AggregationDesign</ID>   <Name>AggregationDesign</Name>   <CreatedTimestamp>2005-07-04T03:20:12Z</CreatedTimestamp>   <LastSchemaUpdate>2005-07-04T03:20:12Z</LastSchemaUpdate>   <EstimatedRows>1013</EstimatedRows>   <Dimensions> ......   </Dimensions>   <Aggregations>     <Aggregation dwd:design-time-name="">       <ID>Aggregation 0</ID>       <Name>Aggregation 0</Name>       <Dimensions>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Dim Promotion</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Dim Sales Territory</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Internet Sales Order Details</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Sales Reason</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Order Date Key - Dim Time</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Ship Date Key - Dim Time</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Due Date Key - Dim Time</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Dim Product</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Dim Customer</CubeDimensionID>      <Attributes>       <Attribute dwd:design-time-name="">        <AttributeID>Dbo Dim Geography - Country Region Name</AttributeID>       </Attribute>      </Attributes>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Dim Currency</CubeDimensionID>     </Dimension>     <Dimension dwd:design-time-name="">      <CubeDimensionID>Destination Currency</CubeDimensionID>     </Dimension>       </Dimensions>     </Aggregation>     ......   </Aggregations>

Applying Aggregation Design

You have so far designed aggregations for a single partition. If all your partitions contain the same fact table data and characteristics in terms of dimension member distributions, you can design the same aggregations. You can select a group of partitions in the BIDS and then select design aggregations to design the same aggregation. Alternatively, Analysis Services 2005 allows you to apply an existing aggregation of a partition to other partitions in the same measure group. However, if you have some partitions that include fact records for an entire year and one partition holding current or last month's data, you would want to design separate aggregations to have the most optimal performance. To design the same aggregations for a group of partitions from BIDS follow the steps below.

Open the Adventure Works cube in BIDS and switch to the partitions tab. Select multiple partitions within the Internet Sales measure group. To do so hold the Ctrl key and then select the desired partitions using your mouse.
Right-click and choose Design Aggregations as shown in Figure 13-25 to bring up the design aggregation window.
Go through the same steps to design aggregations for the first partition. The wizard will automatically copy the aggregation design to other partitions.

image from book
Figure 13-25

Applying aggregations designed for one partition to other partitions is considered a management operation and hence you need to use SSMS. To copy an existing aggregation from one partition to other partitions, perform the following steps:

Launch SSMS and connect to your Analysis Services instance.
Navigate to the Adventure Works cube in the object explorer.
In the measure groups folder, open the Internet Sales measure group, and open the partitions folder.
Right-click on Internet_Sales_2001 partition, and select Copy Aggregation Design as shown in Figure 13-26.

Figure 13-26
You will now see the Copy Aggregation Design wizard. You can choose to copy the aggregation design of the Internet_Sales_2001 partition to one or many partitions in this measure group. Select the partition for which you want to apply the same aggregation design as shown in Figure 13-27 and click OK.

image from book
Figure 13-27

Analysis Services now applies all the aggregations designed for the Internet_Sales_2001 partition to all the partitions selected. Once all the partitions are applied with aggregation design, you need to process the partitions so that Analysis Services creates the aggregated data. You can later send a query that requests data for a specific level and analyze its performance. If aggregations have been built for the cells that have been requested then Analysis Services will serve the query from the aggregations. Based on the size of your cube and aggregations designed you can notice performance gain for queries touching the aggregations. Finally, you can use SQL Server Profiler to see if specific aggregations are getting hit (you will see this later in this chapter).

Usage-Based Aggregation Design

In addition to the Aggregation wizard, Analysis Services supports aggregation design based on the user queries sent to Analysis Services. Designing aggregations based on user queries is called usage-based optimization because aggregations are designed based on users' requests and making sure performance gains are achieved for those specific queries. In order for Analysis Services to analyze the queries and design aggregations, the queries served by Analysis Services need to be logged at a specific location. Analysis Services provides a way to log the queries in a relational table with specific parameters. Because this aggregation design technique is dependent on the usage pattern, it is more likely that Analysis Services can create more useful aggregations to increase performance of future queries. To create the user based aggregation design, you first need to enable Analysis Services to log the queries sent to the server. Follow these steps to enable query logging and design aggregations based on a set of queries:

Launch SSMS and connect to Analysis Services.
Right-click the server and choose Properties. You will see the Analysis Server Properties window as shown in Figure 13-28.

Figure 13-28
Set the "Log\QueryLog\QueryLogConnectionString" property so that it has a connection string pointing to a relational database where queries can be logged. To do so, click the ellipsis (…) to launch the connection string dialog.
You will see the connection manager dialog as shown in Figure 13-29, which you have used to specify data sources. Specify the connection details to your SQL Server that contains the sample relational database AdventureWorksDW. Click OK once you have specified the connection details.

Figure 13-29
The connection string will be copied as the value for the server property "Log\QueryLog \QueryLogConnectionString."
Define the server property Log\ Query Log\QueryLogTableName with a table name OLAPQueryLog.
Make sure the server property Log\Query Log\CreateQueryLogTable is set to true so that the query log table can be created by Analysis Services.
By default, the server logs a query for every 10 queries executed. You can change "Log\QueryLog\QueryLogSampling." Change it to 1 to log every query into the query log table.
Click OK to save the server properties.
Restart Analysis Services so that the new properties are set for Analysis Services and the query log table can be created on the relational database.

Connect to SQL Server and open AdventureWorksDW; you will find an OLAPQueryLog table has been created. The table definition is shown here:

     CREATE TABLE [dbo].[OlapQueryLog](       [MSOLAP_Database] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,       [MSOLAP_ObjectPath] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,       [MSOLAP_User] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,       [Dataset] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,       [StartTime] [datetime] NULL,       [Duration] [bigint] NULL      ) ON [PRIMARY]

In SQL Server Management Studio connect to your Analysis Services instance.
Right-click the Adventure Works cube and select Browse. Drag and drop measures into the data grid and drag and drop several dimensions. Perform some drill up and drill down to log some MDX queries into the query log.
Open a relational query window in SQL Server Management Studio and send the followng query:
```
     SELECT [MSOLAP_Database]     ,[MSOLAP_ObjectPath]     ,[MSOLAP_User]     ,[Dataset]     ,[StartTime]     ,[Duration]     FROM [AdventureWorksDW].[dbo].[OlapQueryLog] 
```
You will find many records are logged in the OLAPQueryLog table. Analysis Services logs the username, time stamp, and the subcubes that are hit during the MDX query. The subcube definition is a sequence of 0s and 1s, which indicate which hierarchies are involved in the query.
In the Object Browser, right-click the Internet_Sales_2001 partition and choose Usage Based Optimization as shown in Figure 13-30.

Figure 13-30
You will see the Usage-Based Optimization Wizard. Click Next in the welcome screen. Choose Internet_Sales_2001 to modify aggregation settings as shown in Figure 13-31 and then click Next.

Figure 13-31
In the Query Criteria dialog you can select the queries based on time, users, or frequency and request those queries to be used for aggregation design as shown in Figure 13-32. In addition to that, this dialog also provides statistics of the queries that have been logged. You can specify a beginning and ending date to get specific queries running in a certain time period, or choose specific queries for a particular user or users, or the user can choose the latest percentage of queries. We will select all the queries logged so far to design aggregations. Do not make any selection in the Specify Query Criteria and click Next.

Figure 13-32
In the query list window (see Figure 13-33), Analysis Services provides information of all the queries requested by the users. Select all the queries and click Next.

Figure 13-33
You will now see the storage design dialog that you saw during aggregation design. Select MOLAP storage and click Next.
In the object counts dialog, you do not have to update object counts since you did this during aggregation design. Click the Next button to proceed to the aggregations options page.
In the set design aggregation options, choose 30% performance gain and click the Start button to request aggregations be designed based on the queries selected. At this moment, Analysis Services explores the various aggregations that would benefit in improving the performance gain for the selected queries. If a specific aggregation would benefit a specific query, that aggregation is allocated more weight so that that aggregation can be chosen from all possible aggregations. Analysis Services does a breadth-first search to explore the search space. Hence, if you have a cube that has a large dimensionality, sometimes you might explore aggregations at a very low level (closer to the key) due to the queries and performance optimization that you have chosen. We recommend you look at the aggregations that are getting created in the script and if you are really interested in a specific aggregation being created, you can specify a higher performance gain (> 30%) that forces Analysis Services to expand the search space to deeper levels. Click the Next button after the aggregations have been designed.
On the process options page, select the Process Partitions Immediately check box and click Finish as shown in Figure 13-34. The new aggregations will be created and applied to the Internet_Sales_2001 partition.

image from book
Figure 13-34

If you have existing aggregations at the start of design usage-based aggregation, Analysis Services doesn't take that existing aggregation design into consideration. You can design aggregations by writing your custom AMO program. You can download samples from this book's companion web site that show how to design aggregations using AMO.

Aggregation Design Options

So far you have learned to design aggregations using the Aggregation design wizard and the Usage Based Optimization wizard. In both wizards you have seen that some of the dimension attributes are being considered by Analysis Services for aggregations and some are not. The Aggregation design wizard considers certain hierarchies when designing aggregations based on the design of the hierarchies, as well as the properties of the dimensions and cube dimensions. In addition to that, the user can give hints to tell the aggregation designer to consider, include, or exclude hierarchies while designing aggregations.

Design Efficient Aggregations Using Hints

The aggregation design algorithm uses the partition fact data count and dimension-level member count to estimate the cost of aggregation design. Having an accurate fact data count and dimension-level member count is crucial for Analysis Services to find the aggregation designs that would yield the best performance. The fact table row count and level member count are metrics that are counted once at design time, stored in the Analysis Services metadata, and never changed afterwards. Therefore, it is important for the user to update the member counts and fact table counts for changing dimensions, and increasing rows in fact tables. You can click the Count button in the specify object counts window to get the newest counts for various objects.

Typically, partitions only contain a specific slice of the fact data. The member count for dimension attributes and the partition are by default the same. In the aggregation design wizard it is a good practice for you to provide Analysis Services the hint of accurate member count in the partition count column. In the Adventure Works DW sample, partitions contain data for each year so we recommend entering a value of 1 in the Fiscal Year column as shown in Figure 13-35. This helps Analysis Services use the value 1 instead of 4 to calculate the cost of the aggregation while including the Fiscal year attribute.

image from book
Figure 13-35

Relationships between Attributes

All the attributes within a dimension are related to the key attribute because there is a one-to-many relationship. When you establish a relationship between two attributes within a dimension, you can specify the type of relationship between these attributes. Two types of relationships are allowed in Analysis Services 2005, Rigid and Flexible, and they refer to the dimension attribute relationship changeability. Rigid relationships mean that there will be no change in the data value for this relationship. For example, if you have the relationship between City and State set to be rigid, it indicates that a specific city will always belong to only one state and the initial value will never change over time. Flexible relationships, however, mean that the values of the relationship can change over time. By default all the relationships between the attributes and the key attributes of the dimension are flexible. The relationship type determines how Analysis Services treats partition aggregation data when you choose to perform an incremental process of the partition. Assume a relationship between two attributes has been specified as rigid and the value changes for the attribute nonetheless. When an incremental process of the dimension is initiated Analysis Services will present an error that the data has changed for an attribute whose relationship has been specified as rigid.

You can set the relationship type between attributes in BIDS by doing the following:

Open the sample project Adventure Works DW.
Open the Customer dimension by double-clicking that dimension in Solution Explorer.
In the attributes pane, click the key attribute Customer. Click on the + sign to see all the attributes that are related to the key attribute.
Click any of the attributes under the key attribute such as Phone number. If you look at the properties window you will see a Relationship Type property as shown in Figure 13-36, which has the values Flexible and Rigid.

Figure 13-36
Expand the user hierarchy Customer Geography and click the Country attribute relationship as shown in Figure 13-37. In the properties window you will find that its relationship type is set as Rigid because a state's country won't change over time. On the other hand, click the customer address relationship and you will see it is set to Flexible because a customer's address can change over time.

image from book
Figure 13-37

The aggregation design algorithm bases the aggregation being designed on the type of relationships between the attributes it's using, allowing you to classify aggregations as rigid or flexible. Rigid aggregations are aggregations that include attribute(s) that have a rigid relationship with the granularity attribute. Attributes from the remaining dimensions either need to be the All level or the lowest level. The aggregation created in the previous section that included Customer.Country, and all other dimensions where the other dimensions included the top level is an example of rigid aggregation. Flexible aggregations are aggregations that are built on one or more attributes with flexible relationship with the granularity attribute. An example of a flexible aggregation is an aggregation that uses Customer.Address attribute.

Rigid aggregations are updated when partitions are incrementally processed. If attributes that are part of rigid aggregations are incrementally processed then existing aggregations are not dropped. Analysis Services will keep the old aggregation data as such and create a temporary aggregation store for the newly coming data. Finally Analysis Services merges the temporary aggregation store with the old aggregation data within a transaction. Old aggregation will still be available for query access when aggregations are being processed. The aggregated data only gets rebuilt when the user chooses to do ProcessFull on the partition or cube.

Flexible aggregations are fully rebuilt whenever a cube and partition is incrementally processed. When attributes that are part of a flexible aggregation are incrementally processed then Analysis Services drops all the flexible aggregations, because the old aggregation data is not valid anymore due to dimension member changes. After dropping the flexible aggregations, Analysis Services recalculates those dropped aggregations. If you choose the option to create aggregations lazily (ProcessingMode property of a dimension), flexible aggregations are re-calculated as a background task. Users will still be able to query without aggregations; however, you might see that the queries are slow. Once the aggregations are rebuilt, future queries will be fast.

Properties Controlling Attributes and Aggregation Design

In addition to the dimension member count and partition member count, Analysis Services allows you to fine tune the aggregation design via a property called AllMemberAggregationUsage for Cube dimensions and an AggregationUsage property for CubeDimensionAttributes. Various values of these properties hint the aggregation design wizard to consider and include the attribute or dimension while designing aggregations. The following steps show you how to set the various values for these properties in BIDS:

Open the Adventure Works DW sample project.
Open the Adventure Works cube.
Click the cube dimension Customer in the Cube Designer as shown in Figure 13-38. You can see the associated properties in the properties pane, as shown in Figure 13-38.

Figure 13-38
You will see the property AllMemberAggregationUsage. This property enables you to either include or exclude the "All Member" of the dimension while creating aggregations. Based on the value set for this property, Aggregation Design Wizard will consider this dimension while performing the design task. There are four choices for the AllMemberAggregationUsage property.
- Full. Always include the All member while creating aggregations.
- None. Never include the All member while creating aggregations.
- Unrestricted. Let Analysis Services consider the dimension during aggregation design, and it can choose to build or not build aggregation for all members of the dimension.
- Default. Same as unrestricted.
If most of your users query All member of this dimension, we recommend that you change the AllMemberAggregationUsage property to Full. This ensures that aggregations include the All member and that your user queries hit and benefit from the aggregations. If most of your users query this dimension at the detail level, we recommend you set it to None to avoid aggregation at the All level. There is no need to create aggregations at the All level because most of the queries will not hit the aggregation.
Leave the setting for the customer dimension as Default, and click the Attributes tab in the cube editor as shown in Figure 13-39. BIDS will now show all the cube dimensions with their cube dimension attributes. Click the country dimension to view its attributes.

Figure 13-39
Click any of the cube dimension attributes. If you look at the properties window you will see the property AggregationUsage as shown in Figure 13-39. Similar to AllMemberAggregationUsage, the AggregationUsage property also has four possible values. They are Full, None, Unrestricted, and Default. These properties once again instruct the aggregation design algorithm to appropriately include or exclude a dimension attribute while designing aggregations. The meaning of the values are as follows:
- Full. Always include this attribute in any of the aggregation designs being considered while designing aggregations.
- None. Never include this attribute in any of the aggregation designs being considered while designing aggregations.
- Unrestricted. Analysis Services might consider this attribute during aggregation design as any other attribute.
- Default. Same as unrestricted.

If most of your users query the data by customer country, you can change the country attribute's AggregationUsage to Full. However, if you know your users rarely break down numbers for one attribute, you can turn it off by setting the property to None. You need to analyze your user's MDX queries using SQL Profiler (explained later in this chapter) and then set the appropriate values for these properties for various dimensions and attributes.

Yes, query performance can be improved by designing aggregations and fine tuning. However, your users might be generating MDX queries that are not optimal. Next you see a few ways to optimize MDX queries.

Optimizing MDX Queries

Most cubes are quite sparse. By sparse we mean that the cell corresponding to every tuple found in the cube does not have a value associated with it. For example, in the Adventure Works sample database if every coordinate has data for the Internet sales measure group, and assuming only the key attribute in each dimension then the total cells would be (Date) 1158 * Date (Ship Date) 1158 * Date (Delivery Date) 1158 * Customer (18484) * Promotion (16) * Product (606) * Sales Territory (11) * Sales Reason (10) * Source Currency (105) * Destination Currency (105), which is 3.37*10⁁23 cells. This result increases when additional attributes are added from each dimension. Although most of the cells do not have any business meaning associated with them — for example, if delivery date is ahead of order date — they belong to cube space and can be queried by the users. Querying such cells results in a null value, which indicates that the data is not available for that cell coordinate.

The fact table rows represent the leaf-level cells for cubes. The fact table rows count is much less than possible cube space. The Analysis Services engine has many optimizations for improving query performance by utilizing the fact tables to limit the search space. The basic rule is that if a cube doesn't have calculations (such as calculated scripts, custom rollup, and custom members), the non-empty space of the cube is defined by fact table cells and their aggregations. Analysis Services allows users to write effective, optimized MDX queries to prevent empty cells from being returned. This is because those empty cells simply do not add value for business analysis. By limiting the search space, Analysis Services can find the results more quickly.

Analysis Services 2005 supports many ways for users to eliminate the cells containing null values in a query. The keyword NON EMPTY helps you to eliminate members along an axis whose cell values are null. The NON EMPTY keyword is used at the beginning of the axis statement in an MDX query as shown here:

     SELECT Measures.Members on COLUMNS,     NON EMPTY Dimension.Hierarchy.Members on ROWS     From <CubeName>

The NON EMPTY keyword can be used on rows or columns (or any axis). Assume that if you execute the preceding query without the NON EMPTY keyword, you see the results shown in the following table.

	Measure 1	Measure 2	Measure 3	Measure 4
Member 1	Null	Null	Null	Null
Member 2	Value 1	Value 2	Value 3	Value 4
Member 3	Null	Null	Null	Null
Member 4	Value 5	Value 6	Value 7	Value 8
Member 5	Value 9	Value 10	Value 11	Value 12

If you execute the query with the NON EMPTY keyword, you will see the results shown in the following table. You can see that Member 1 and Member 3, which had null values for all the measures, have been removed from the results.

	Measure 1	Measure 2	Measure 3	Measure 4
Member 2	Value 1	Value 2	Value 3	Value 4
Member 4	Value 5	Value 6	Value 7	Value 8
Member 5	Value 9	Value 10	Value 11	Value 12

In addition to the NON EMPTY keyword that can be used on the axes, Analysis Services 2005 supports the functions NONEMPTY () and NONEMPTYCROSSJOIN () which you learned in Chapter 7. These functions take in a set as one of the parameters, filter all the tuples that contain a null cell value in the context of the default measure or the measure that is passed as another parameter, and return a set of non-empty tuples.

Analysis Services has an optimized and an un-optimized code path to identify non-empty data and return results to a user query. In the un-optimized method, Analysis Services will iterate over the entire region of the user query and evaluate cell values individually to see if the cell value is null. This is usually a slow operation for a large and sparse data query. The optimized method leverages much smaller searching space (fact table rows) to quickly discard the empty cells, and it is fast. The performance difference is significant for queries with large datasets. Under certain conditions, running the un-optimized code path might result in non-intuitive results. For example, the NonEmptyCrossJoin function enforces the optimized code path, but you should use this function very carefully when you are using calculated members because these might produce non-intuitive results, which is not what you are expecting. The NonEmptyCrossJoin function is being deprecated. If you use the NonEmpty function, Analysis Services decides to execute either the optimized or un-optimized code path based on the conditions and ensure correct results are returned. We recommend you use the key word NON EMPTY of the function NONEMPTY () to eliminate tuples containing null values in your results.

Analysis Services will try to use the optimized method anytime a non-empty query is sent to the server either using NON EMPTY key word or using NONEMPTY function. However, there are cases under which the server will default to the un-optimized code path whenever it is unable to identify an easy way to evaluate if the cells are empty. Such an unoptimized code path is usually taken when there are complex expressions for calculated members where the server is unable to identify the real measure that it needs to use to run the optimized code path. If you do know the real measures that will help in identifying whether or not the calculation will result in an empty value, you can specify the real measure to Analysis Services through the syntax NON EMPTY BEHAVIOR. This helps the server force the optimized code path even when there are complex calculation evaluations. You learn more about the NON EMPTY BEHAVIOR in the following sections.

Using Non-Empty to Eliminate Cells

In most cases, only results with non-empty cells are meaningful for end users. You should always use the NON EMPTY keyword in your MDX cellset and rowset queries whenever possible. Not only will it limit the size of the cellset returned, but there are additional optimization benefits that speed up your query execution time.

Following is an MDX query without the NON EMPTY keyword. Execute this query using SQL Server Management Studio against a deployed sample Adventure Works project.

     Select [Customer].[Customer Geography].[Customer].members *     Descendants ([Product].[Product Categories].[Category].&[3],[Product].[Product     Categories].[Product Name]) on 1,     {[Measures].[Internet Sales Amount]} on 0     from [Adventure Works]

You will see that the query returns 18,485 cells. This query took 12 seconds on a single proc server machine when we executed it. Now change the query to include the NON EMPTY keyword on both axes as shown here and execute the new query in SQL Server Management Studio.

     Select NON EMPTY [Customer].[Customer Geography].[Customer].members *     Descendants ([Product].[Product Categories].[Category].&[3],[Product].[Product     Categories].[Product Name]) on 1,     {[Measures].[Internet Sales Amount]} on 0     from [Adventure Works]

The query that includes the NON EMPTY keyword returns just 6,853 non-empty cells and it took only 3 seconds to execute. This clearly highlights for you the benefit of eliminating empty cells using NON EMPTY.

Using Non-Empty to Improve Performance for Filter and Sort

Many users apply filter conditions on a set or try to evaluate the top members from a set based on certain conditions using the Filter and TopCount functions, respectively, from a large cube space. In most cases, only non-empty member sets are needed in the filter and topcount functions. You can improve the performance dramatically by first using NONEMPTY () to find the non-empty sets using optimized algorithm, followed by the filter, sort, or topcount functions on a much smaller set. In the Adventure Works sample for example, if you want to get the top ten Customer, Product combinations to start a marketing campaign, your query will look like the following:

     Select     TopCount ([Customer].[Customer Geography].[Customer].members*     [Product].[Product Categories].[Product].members, 10 ,         [Measures].[Internet Sales Amount]) on rows ,     [Measures].[Internet Sales Amount] on columns     from [Adventure Works]

Notice the above query contains a cross-join of all the customers and products (shown by the expression below). Whenever a cross-join is applied, the server sorts the result based on the order of the hierarchies.

       ([Customer].[Customer Geography].[Customer].members*[Product].[Product     Categories].[Product].members)

The cross-join of the customers and products dimension results in 18485 * 396 = 7,320,060 cells. Analysis Services now evaluates the top 10 cells out of the seven million cells to return the results for the preceding query. This query took around 101 seconds on the server machine and it consumed 1 CPU at 100% during the entire execution. Most of the cells of the cross-join were actually empty cells that need not have been part of the result of the cross-join. Not only did the server take the time in sorting these cells, but it also had to iterate through the seven million cells to determine the top 10 cells. The following query uses the NonEmtpyCrossJoin function that eliminates the empty cells.

     Select     TopCount (NONEMPTYCROSSJOIN (     [Customer].[Customer Geography].[Customer].members*     [Product].[Product Categories].[Product].members,     {[Measures].[Internet Sales Amount]},1),10,         [Measures].[Internet Sales Amount]) on rows ,     [Measures].[Internet Sales Amount] on columns     from [Adventure Works]

In the above query the NonEmptyCrossJoin function first eliminates all the empty cells, and hence the TopCount function only had to work on a smaller number of cells. The query took 3 seconds because of the optimization provided by the NonEmptyCrossjoin function. Only cells containing facts were sorted and the top 10 values were returned. The performance improvement is sometimes beyond comparison and both queries are returning the exact same results. The rule of thumb is that the fewer tuples or cells involved in calculations, the better the query performance. Because Analysis Services has an efficient algorithm to get non-empty sets, which are much smaller in most cases, the user should use NonEmpty whenever it is applicable and appropriate for the business requirement. You can use the NonEmptyCrossJoin function whenever you are aware that a real measure will be used by the server for Non-Empty evaluation, and use it with caution when you have calculated measures.

You can also use the HAVING clause that eliminates cells with null values as seen in Chapter 7.

Using Non-Empty Behavior to Improve Calculation Performance

If you query for cells that involve evaluation of complex calculations then the cells' emptiness (if the cell returns a null value) is not determined by fact table cells; each cell must be evaluated to return the correct results. Analysis Services provides you with a keyword called NON_EMPTY_BEHAVIOR, to instruct the server to use the optimized algorithm to determine cells' emptiness. The following query returns the Forecast sales by applying different rates:

     WITH member [Measures].[ForecastSales] as     'iif ([Measures].[Internet Sales Amount] >500 ,     [Measures].[Internet Sales Amount]*1.2,     [Measures].[Internet Sales Amount]*1.2)'     Select NON EMPTY [Customer].[Customer Geography].[Customer].members*     Descendants ([Product].[Product Categories].[Category].&[3],[Product].[Product     Categories].[Product]) on 1 ,     NON EMPTY {[Measures].[ForecastSales]} on 0     from [Adventure Works]

Even though the above query uses NON EMPTY, you will find the server still takes around 12 seconds to execute it. This is because the optimized code path cannot be applied on complex calculated members. In the preceding query you have a calculated member that is multiplied by 1.2 and hence the server needs to evaluate the expression to identify if the corresponding cells are empty. Given the non-empty behavior for this measure, you can specify NON_EMPTY_BEHAVIOR for this member, and tie the calculated measure to a real fact measure. The server will use the optimized code path for the non-empty determination. Execute the following modified query that contains the NON_EMPTY_BEHAVIOR:

     WITH member [Measures].[ForecastSales] as     'iif ([Measures].[Internet Sales Amount] >500 ,     [Measures].[Internet Sales Amount]*1.2,     [Measures].[Internet Sales Amount]*1.2)',       NON_EMPTY_BEHAVIOR = '[Measures].[Internet Sales Amount]'     Select NON EMPTY [Customer].[Customer Geography].[Customer].members*     Descendants ([Product].[Product Categories].[Category].&[3],[Product].[Product     Categories].[Product]) on 1 ,     NON EMPTY {[Measures].[ForecastSales]} on 0     from [Adventure Works]

Did you notice the difference in how long this query took as compared to the original query? This is why we recommend applying the NON_EMPTY_BEHAVIOR for the calculated members whenever you are using complex calculations and you have a fact measure or a simple calculated member that is helpful in evaluating if a cell is empty. We have just illustrated a few important optimizations that will help you write optimized MDX queries. There are many more MDX optimizations that can be done. We recommend MDX Solutions 2^nd edition by George Spofford, et al., (Wiley, 2006) which devotes an entire chapter to MDX optimization.