Optimizing UDM Design | Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer)

The data modeling completed during UDM creation has a significant impact on both query performance and processing performance, so it is not something to be rushed through. Even before starting to build your UDM, you must understand the business requirements of the system under assembly as much as possible. You need to have clarity regarding the goal of the system, and that, in turn, feeds directly into creating the analysis requirements, and what potential queries the system needs to support. That understanding will also provide insight into what attributes the user won't be interested in analyzing.

Every dimension, and attribute in a dimension, will demand processing time for your UDM. In addition, because adding unnecessary dimensions and attributes will increase the cube space, it can slow the query performance too. You should use the business requirements to drive your design, just be sure to avoid unnecessary dimensions and keep your system compact and performant.

Fine Tuning Your Dimensions

In Analysis Services 2005, dimensions can contain several hierarchies. Often when you create your dimension using the wizard, all the columns in the relational table(s) are added as attribute hierarchies in support of relational and OLAP querying. You can easily end up with a dimension that can have hundreds of attributes. In most business scenarios, the attributes within a dimension are not used in many queries. Only a subset of the attributes might be heavily used. In this section you learn various design techniques that will help you design the right dimension suited to your business needs and get the optimal performance acceptable for your business.

Choose the Right Key Attribute

Each dimension needs to have a key attribute. The relational dimensional tables will typically have a column defined as a key column, and that is automatically inferred as the key attribute by the dimension wizard. In certain relational databases, the relationships between the fact and dimension tables and the primary and foreign keys might not be defined. In such cases you would need to define the relationships between the tables in your DSV. Choosing the key attribute can actually impact processing and query performance. If you know that you have two columns in the dimension table that each uniquely identifies a row in the table and if they are of different data types, such as integer and string, try to choose the column that is an integer. Key attributes that are of integer data types occupy less storage and are faster to retrieve than those of type string, because the number of bytes used to store string types are typically larger than that for integer data types. In addition, if you have a choice of choosing a single column as a key attribute as compared to choosing multiple columns in the table, choose the single column as the key column. If you are already aware of these techniques and have designed your database accordingly, it's great. Some might think that all they need is just more disks — we consider that disk space is quite cheap to buy and they are much faster than before — and some might think a few bytes might not make a big difference. However, imagine your dimension has millions of members — no matter how much disk space you have, accessing each member during queries takes time.

The fastest processing time of fact-table partitions occurs when the fact table has integer foreign keys for the dimension, and the dimension contains the integer key as an attribute (perhaps hidden). This allows the SQL query sent by Analysis Services during processing to be efficient in terms of query performance. We are aware of several international major customers with large data warehouses using composite keys with long strings. Each of these companies had a data warehouse before they started using Analysis Services and they consider a change prohibitively expensive. Their design works and their business benefits from using Analysis Services. But their time window for processing fact-table partitions would be much smaller if they had used integer keys.

Avoid Unnecessary Attributes

Because Analysis Services supports attributes hierarchies, you can create many attributes which allow the users to analyze their data along those attributes. However, if you create too many attributes that are never used in customer queries, it will waste system data storage slowing both processing and query performance. We recommend you look at each dimension in detail and eliminate attributes that will never be queried by users. Although your dimension table(s) might contain several columns it is usually not necessary to convert every single column in the dimension table(s) to attributes hierarchies in the dimension. You can see an example in the sample Adventure Works DW project's customer dimension. Figure 13-3 shows the Customer dimension, where you can see the list of columns in the DSV and compare it against the list of attribute hierarchies included within the dimension.

image from book
Figure 13-3

You can see that the Dim_Customer relational table contains several columns. However the Customer dimension includes only a subset of those columns that are essential for data analysis as attribute hierarchies. For example the column Title in the Customer's table is not used in data analysis and hence it is not included as an attribute in the Customer dimension.

Turn Off Optimization for Rarely Used Attributes

Although some attributes in a dimension are used for data analysis, they might be needed only on rare occasions. For example, you might be querying an employee's zip code or phone number infrequently for analysis. By default, Analysis Services creates indexes for each attribute; assuming the attribute hierarchy will be used often in queries. By turning off attributes' optimization via their AttributeHierarchy OptimizedState, you will save processing time and resources by not creating indexes for such attributes. A query involving any NotOptimized attribute will be slower; however, because it is rarely used, it won't hurt most of your users' query performance. Most often you will have certain attributes that are used as member properties. We recommend you set the AttributeHierarchyOptimizedState for member property attributes as well as attributes that might be used infrequently during querying to be NotOptimized because those attributes are not involved in queries. The improvement for data storage and processing time would justify this choice.

The AttributeHierarchyOptimizedState is a property for each attribute. If you click an attribute you will see this property under the advanced section in the properties window. Figure 13-4 shows the Attribute HierarchyOptimizedState property. To turn off the property for an attribute in the dimension, change the value from FullyOptimized to NotOptimized.

image from book
Figure 13-4

Turn Off AttributeHierarchy for Member Properties

Some attributes are not relevant to data analysis per se, which means user queries will never pivot on those attributes. Still, it is useful to display and use those attributes as member properties. Consider the "birth date" attribute; although customer queries may never break down fact data by birth date, customer birth date might be displayed next to customer names in reports, perhaps as a sorting criteria for listing customers. You can turn off the AttributeHierarchyEnabled property for "birth date" to tell the Analysis Services not to build an attribute hierarchy for this attribute, but keep it as a member property. This reduces the time and space needed for dimension processing. Figure 13-5 shows the AttributeHierarchyEnabled property for a dimension attribute. If this property is set to false for an attribute; you will not be able to browse the hierarchy. Rather, you can query the attribute members as member properties.

image from book
Figure 13-5

Define Hierarchy Relationships between Attributes

If attributes within a dimension are related by a one-to-many relationship, we recommend you establish that relationship in Dimension Designer. For example, in a geography dimension you will typically have Country, State, and City attributes know that there is a one-to-many relationship between these attributes. Define these relationships in the Dimension Designer for improved processing and query performance. If you create user hierarchies within a dimension, and if the user hierarchies have multiple levels and there is a natural relationship between these attributes, we highly recommend you define these relationships. Often when user hierarchies are created, they will be used in queries and defining the natural relationships helps significantly in query performance. The attribute relationship will help the server build efficient indices, which will benefit query performance along the user hierarchies significantly. To really understand this issue, you first need to learn about natural and unnatural hierarchies.

All attributes within a dimension are related to the key attribute because the key attribute is unique, and by definition a key has a one-to-many relationship with all the attributes. As for the natural hierarchy, consider the multilevel hierarchy Full Name image from book State-ProvinceCountry shown in Figure 13-6. A hierarchy is called a natural hierarchy if there is a one-to-many relationship between every pair of attributes that are from successive levels of a user hierarchy. In the example shown in Figure 13-6, you can see that the relationship between attributes of various levels has been established in the Dimension Designer, namely, Customer attribute (called FullName in the multilevel hierarchy) has a relationship with Postal Code, Postal Code has an attribute relationship with City, City has an attribute relationship with State Province and State Province has an attribute relationship with Country. Essentially, a chain of relationships from the bottom-level attribute to the top-level attribute is created. Such a hierarchy is called a natural hierarchy.

In a natural hierarchy, the attribute at a given level maintains a many-to-one relationship with an attribute directly above it. The many-to-one relationship is defined by the attribute relationship, as shown in Figure 13-6, and the server builds indexes for fast navigations. A natural hierarchy for which indexes are created during processing time is referred to as a materialized hierarchy. For the customer hierarchy example shown in Figure 13-6, indices for state-province to country, city to state-province, and so on are built and stored in the server. Analysis Services will utilize those indices to accelerate query performance; it can easily find all states in the USA because the query can be directly resolved from the "State-Province to Country" index. Also, if the data cache or aggregated data that is stored in Analysis Services has data for State-Province, Analysis Services can use the same index to quickly get an aggregate value for Country.

image from book
Figure 13-6

Defined relationships not only help during query performance but also during processing. Analysis Services processes the key attribute of the dimension after processing all the remaining attributes. If no relationships are defined, then at the time of processing the dimension the server needs to retrieve data for all the attributes while processing the key attribute. This increases the processing time of the key attribute and the dimensions. If you have defined the natural relationships between attributes, Analysis Services has an optimized code path that reduces the key attribute processing, thereby reducing the overall dimension processing time. You will see significant differences in processing times for large dimensions that have hundreds of attributes. For some large dimensions (> 10 million members and > 100 attributes) you might reach the physical processing limits of a 32-bit environment and might have to move to 64-bit servers. Establishing relationships (as discussed) combined with other dimension optimizations (also discussed) can facilitate processing for large dimensions on a 32-bit platform.

In the case of an unnatural hierarchy, only the key attribute has an attribute relationship to all other attributes, so the relationship graph resembles Figure 13-7. The system will only build indexes along attribute relationships, and build one-to-many indexes for Country to Customer, State-Province to Customer, City to Customer, and Zip Code to Customer. When you create a user hierarchy such as Customer image from book StateProvinceCustomer relationship to find all customers in the USA; then use the Customer

image from book
Figure 13-7

You should specify one-to-many relationships whenever possible. Not only will it help significantly for query performance, but it also will save storage space. For the Country to Customer index in the unnatural hierarchy example, every customer member in the USA will have an entry in the index. However, by moving the relationship to State-Province to Country, you will only need 50 entries for that index. To establish the attribute relationship as shown in Figure 13-8, you simply need to drag and drop the attribute to the "new attribute relationship" for the corresponding attribute.

image from book
Figure 13-8

Fine Tuning Your Cube

You have so far seen some of the design techniques that will help you achieve improved dimension processing and query performance. Similarly, certain design optimizations within the cube can also help you in achieving better performance during processing or querying. This section discusses cube optimization design techniques.

Fact Table Measure Groups or Partitions

When you run cube wizard on a DSV containing multiple fact tables, the wizard creates a separate measure group for each fact table identified. However, such a UDM may or may not be the right design for your business analysis. For example, if you have two fact tables, salesfact2005 and salesfact2006, which have the sales information of your business for the years 2005 and 2006 (containing identical columns), the cube wizard will be creating two measure groups. However, for your purposes these should actually be modeled within the same measure group as two partitions so that appropriate data can be rolled up. If you are creating the cube using the Cube Wizard, select one of the fact tables in the cube wizard table selection page, and then the other fact table can be added as a partition to the measure group. If you select both tables by mistake and the Cube Wizard has already created two measure groups for the two fact tables, you can delete one of them from the Cube Designer and add that table as a partition in the partitions tab. A way to think about this is a Measure Group contains the union of all partitions.

You can have fact data spread across multiple tables. For business reasons it might make sense to have all the fact data within the same measure group. Consider a measure group as an entity within your UDM that represents a set of measures which are grouped logically for business reasons; or in Analysis Services terms, share the same dimensionality and granularity. Hence even if you have fact data spread across multiple tables and for business reasons, you actually need to combine the data; make sure you have them added within a single measure group. You can join the fact tables containing measures into a single view within the DSV and create a measure group from that, or you can add measures from either of the fact tables within a single measure group using the Cube Designer.

Optimizing Reference Dimensions

If your UDM contains reference dimensions, you need to be aware of making optimizations for the reference dimensions. If you are querying a reference dimension that is not optimized then you might not get the best query performance. Figure 13-9 shows the relationship definition for a reference dimension. You have a small checkbox called Materialize. You learned in Chapter 5 about reference dimensions and materializing them. To recap, materializing the reference dimensions ensures that indexes are created for the dimension. Once you materialize the reference dimensions, Analysis Services views those reference dimensions as regular dimensions. This helps to improve the query performance. Also, materializing a reference dimension is the only way to create a chain of two or more dimensions as reference dimensions that include intermediate reference dimensions.

image from book
Figure 13-9

Partitions

Partitions store cube fact data in one of the storage modes — MOLAP, ROLAP, or HOLAP. By dividing the data in fact table(s) into multiple partitions you can take advantage of parallelism which can reduce the processing time for the cube and get improved query performance. Assume you have fact data for various months since the year 2002, which is part of a specific measure group. The following table shows a partition scheme that uses time to partition the fact data. Partitions 1–3 include data for past years, and they do not change. Partition 4 includes data for past months in the year 2005, which also do not change. Only Partition 5, that contains current month data, changes daily.

Partition 1	Partition 2	Partition 3	Partition 4	Partition 5
2002	2003	2004	2005	CurrentMonth

Assume the fact data is in a single fact table called Yearly Sales, and you have a measure group within your UDM that has been created from this fact table. By default the Cube Wizard creates a single partition that points to the relational table Yearly Sales. By dividing the data into multiple partitions, all the partitions can be processed in parallel, which will utilize the processor and memory resources of your machine more efficiently. Unlike Analysis Services 2000, which processed each partition serially, Analysis Services 2005 processes objects in parallel by default. If the Server machine has more than one CPU and sufficient memory, parallel processing will reduce total process time as compared to a machine with single CPU. During query time, if cubes contain several partitions, Analysis Services can scan those partitions in parallel, and queries can return results quickly because the server has multiple jobs running in parallel, each one scanning a partition. In addition to that, when you divide the data into multiple partitions based on specific dimension member, Analysis Services 2005 by default retrieves data only from relevant partitions needed for the query if the cube is of storage type MOLAP. For example, if a query requests data for partition 2002, Analysis Services will only query the data from partition containing 2002 data. In order to create multiple partitions, you can create multiple tables in the relational database and then create the partitions. Alternatively, you can create named queries within your DSV that correspond to your desired partition strategy.

The second benefit of creating partitions in an environment where most of the data does not change is that you only need to process the current partition, which has new data added or modified. You need to only process the currentMonth partition in the partition scheme to refresh the latest data changes within your UDM. The data volume to process is reduced and the process time will be significantly decreased compared to processing the entire UDM.

The third benefit of partitioning data is that you can set up different storage modes for each partition. In business scenarios where a certain amount of data is not changing and the data volume is huge, it is better to use MOLAP to increase query performance. In addition, if you need real-time access to your data for analysis you can create a small partition with the current month's data and set the storage mode for that partition as ROLAP. Analysis Services retrieves the data from the appropriate partitions and provides you the aggregated real-time data for your business analysis.

Yet another benefit of partitioning data is that you can create different aggregations (pre-calculated data cache for improved query performance) for maximum benefit. Creating aggregations for your UDM is covered in detail later in this chapter. You can design heavy aggregations for those partitions that are queried most (the current year data) to maximize query performance, and create less aggregations for old data that are queried less to save the server storage. Creating such aggregations not only saves storage space but also reduces processing times if the entire UDM is processed.

Finally, refreshing the data within a partition is much faster than refreshing the entire UDM. You do not have to apply incremental processing on all the partitions. You just have to do incremental processing on the partition whose data needs to be updated. During incremental processing of the current partition, a temporary partition is created and then merged into the existing partition, which could possibly result in data fragmentation. By refreshing the partition, the server will re-sort the data for optimized query performance.

Merging Partitions

If you have multiple partitions where data is used sparsely, merge the partitions so that data from each partition does not get aggregated every time a query is issued. Assume you have your measure group data partitioned by month for the year 2002. If every user's query is asking for the entire year's data, Analysis Services needs to retrieve the data from all the partitions and aggregate data within the server.

Having too many partitions could hurt query performance because the server needs to scan those partitions and aggregate data from them. Instead you can merge the data from all the partitions to form a single partition. Queries referring to 2002 will henceforth touch a single partition. If you have aggregations created at the year level then your queries can be instantaneous.

A common scenario for many Analysis Services users is to partition by time, and have a weekly or monthly schedule of updating the cube. This typically involves merging the most recent partition into the year-to-date partition, and creating a new partition for the next period. Sometimes SSIS (SQL Server Integration Services which you learn in Chapter 16) is used to control the flow of operations. This is another scenario where data from existing partitions need to be merged.

Consider the example reviewed in the "Partitions" section, where data was partitioned by time. If you want to merge the partitions to a single partition because queries are infrequent, you can do so using SMSS by following the steps below.

Deploy the sample Adventure Works DW project shipped with the product. The Adventure Works cube contains several measure groups. Connect to the Adventure Works DW database using SSMS.
Navigate through the cube and notice the four partitions under the Internet Sales measure group as shown in Figure 13-10.
Right-click the internet_sales_2003 partition and select Merge Partitions. Assume you want to merge the 2004 data into this partition.

Figure 13-10
In the Merge partitions dialog, select Internet_Sales_2004, as shown in Figure 13-11, and click OK. You are done! You have now successfully merged 2004 partition data into the 2003 partition.

Figure 13-11
In the SSMS object browser, refresh the measure group object. The partition of internet_sales_2004 is gone because it was just merged to the internet_sales_2003 partition. You can see this in Figure 13-12. You can still query the data for 2004 and 2003 as before. The cube still contains the same data and all queries still return the same result.

image from book
Figure 13-12

We expect your UDM to have several partitions for each of your measure groups, especially the ones that contain a large volume of fact data. We recommend you investigate the queries that are sent to your UDM and identify the requirements of your business users. One mechanism used to identify the queries sent to Analysis Services is SQL Server Profiler. You will learn about Profiling Analysis Services using SQL Server Profiler in this chapter. Most measure groups partitioned are data that map to time. This is the most logical and common scenario for most businesses. If after analyzing the queries you see that your user queries are targeted to specific partition(s), and the remaining partitions are infrequently queried or queries to those partitions retrieve data at a higher grain, we recommend you merge such partitions.

Optimizing ROLAP Partitions

Assume a large-scale UDM containing several partitions where all the partitions are ROLAP. In such a scenario, when a query comes in requesting data Analysis Services sends relational queries to each partition with appropriate conditions to retrieve the data. Analysis Services then aggregates the data retrieved from the relational data source and sends the results to the end user. However generating the queries for each partition, establishing connection to the data source and then retrieving results is an operation that takes a certain amount of time. Even if the data requested is for a single partition, Analysis Services will send queries to all the partitions, which is not essential. Most likely, data in various partitions has been partitioned based on a specific hierarchy. For example, data can be partitioned based on months. In such circumstances you can provide a hint to Analysis Services about the data contained in a ROLAP partition through a property of the partition called Slice. This property allows you to specify an MDX expression which indicates the slice on a hierarchy (a member or a group of members in that hierarchy) that uniquely identifies the data within the partition. When the Slice property is specified then Analysis Services sends ROLAP queries to partitions only if the MDX query requested is expected to retrieve results from those partitions. If the Slice property is not specified then by default Analysis Services assumes data can be contained in that partition for any MDX query.

Assume you have partitioned the data by time, 5 partitions for 5 years. Typically, most of the queries will involve only the current year. In this case there is no need to send 5 queries, of which 4 will return no data. By setting the slice property indicating that the 2001 partition contains data for year 2001, 2002 partition contains data for year 2002 and so on, you provide a hint to Analysis Services so that it can optimize the relational queries sent. The Slice property for a partition can be set in BIDS or SSMS. For example, if you need to set a specific partition's data for year 2004, you will provide the MDX expression for 2004 as [Date].[Calendar].[Calendar Year].&[2004]. By default the value for the slice property is set to null, which indicates to Analysis Services that this specific partition can contain data for all the queries. The following steps show how to set the slice property.

Deploy the sample Adventure Works DW project shipped with SQL Server 2005 and ensure the database is processed. Connect to the Adventure Works DW database using SSMS.
Navigate through the partitions of Internet Sales measure group. Change the storage mode for all the partitions to ROLAP.
Right click on the Internet_Sales_2001 partition object and select Properties. Alternatively, you can double-click the partition, which will bring up the property page of the partitions as shown in Figure 13-13.

Figure 13-13
Click the ellipses next to the Slice property. This launches the MDX editor, as shown in Figure 13-14.
Navigate through the Date dimension and select the member [Date].[Calendar].[Calendar Year].&[2001]. When you double-click this member, the MDX expression corresponding to it appears in the text box. Click Check to verify the MDX syntax and then click OK.
Similar to step 5, specify the slice property for the remaining partitions Internet_Sales_2002, Internet_Sales_2003 and Internet_Sales_2004 with corresponding members in the Date dimension for the years 2002, 2003, and 2004.

image from book
Figure 13-14

You have now successfully defined the slice for all the partitions in the Internet_Sales measure group of the Adventure Works cube. To verify that Analysis Services is able to utilize the slice information, you can send an MDX query that retrieves measures from the Internet_Sales measure group containing specific members from the [Date].[Calendar] hierarchy, such as year 2001. If you trace the queries sent to the SQL Server using SQL Profiler without setting the slice information, you will see that Analysis Services sends four queries to the relational server. After setting the slice information you can see that exactly one query is sent to the SQL Server.

An important factor to note is that Analysis Services does not validate your slice information for ROLAP partitions. Analysis Services just honors the slice property you have set, assuming you have the in-depth knowledge of your data. If you set an incorrect slice on a partition, you will get incorrect query results. For example, if you set the slice for partition Internet_Sales_2002 with a time member corresponding to 2001, a query requesting data for year 2002 will result in no data (assuming slice information is set for the remaining partitions correctly). The slice property is a directive to Analysis Services, and you need to be careful about setting the correct slice. The slice property is not needed for MOLAP partitions because Analysis Services is aware of the slice during processing, and hence optimizes the subqueries sent to partitions. Because Analysis Services is aware of the slice information for MOLAP partitions, if you set an incorrect slice for a specific partition and try to process the partition, the processing will result in errors.

You have learned various design techniques that can help you to optimize your UDM for better performance during processing and querying. In the next section you learn about other optimizations that will help you reduce processing time.