In a well-designed partitioned solution, it is easy to manage partitions. In a poorly designed solution, you can face huge performance issues moving large amounts of data between partitions. The most important consideration in designing your solution is to plan and test it end to end on a small amount of data.
Initial Loading of the Data Warehouse
You will see some performance benefit during the initial loading of large tables, if you create the table indexes after the initial loading is complete. When you create the indexes, be sure to create the clustered index first; otherwise, you will pay twice for the creation of the nonclustered index. You need to have about 150% more space for a work area when you create a clustered index on a populated table.
You need to create the table partitions for the weeks that are in the initial load.
Managing Table Partitions
On a regular basis and as quickly as possible, you will want to mark filegroups to read-only so that they won't require any more maintenance. You also need to create the partitions for the upcoming periods.
In the following section, we discuss some of the other activities necessary for keeping your system working well, such as the maintenance of the indexes of the active partitions, how to move old data out of the table, and an optimal backup strategy that avoids repetitively backing up unchanged historical data.
If data in a partition is changing frequently, the indexes can become fragmented and index pages will not be full, which reduces performance. After you have populated a partition with all the data for its prescribed time period, consider performing an index reorganization to remove the empty space on index and data pages. One of the advantages of partitioning a table is that you can perform index maintenance on a single partition. If most of your partitions are historical, and so they haven't changed, they will not need to have any index maintenance performed on them. You only need to consider the current partition for maintenance.
You can tell how fragmented an index is using the system function sys.dm_db_index_physical_stats. For example, to see the fragmentation in the indexes in the Call Detail table, execute this statement:
SELECT IndexName, AvgFragmentation FROM sys.dm_db_index_physical_stats (N'dbo.[Call Detail]', DEFAULT, DEFAULT, N'Detailed');
You can view index fragmentation or rebuild or reorganize an index in SQL Management Studio by right-clicking the index in the Object Explorer and choosing Rebuild or Reorganize, as shown in Figure 11-10. As a rule of thumb, you would choose to reorganize an index if the fragmentation were less than 30 percent, and you would rebuild it if the fragmentation were greater than 30 percent. You can target all indexes by right-clicking the Index node rather than a single index.
Figure 11-10. Managing fragmentation with Management Studio
If you are used to using DBCC to optimize your indexes, you should switch to the new commands ALTER INDEX ... REBUILD (or REORGANIZE). DBCC does not implement the online index operations and will not be available in future releases of SQL Server.
Marking a Filegroup Read-Only
When a partition contains all the data for its time period, you can mark the filegroup where it resides as read-only. After you've performed a final backup of the partition's filegroup, you will never have to back up this data again.
You set the read-only flag on a filegroup using the Management Studio. Right-click the database and choose Properties. Next, select the Filegroups page and check the Read-only box. Then choose OK.
Users cannot be connected to the database when setting the Read-only flag on a filegroup, so you will have to perform this operation during a scheduled maintenance period.
Deleting a Relational Partition
When it is time to remove old data, you can delete the data one partition at a time. This is a highly efficient way of removing data, but it is a nonlogged operation, so you can't undo it.
You remove data in a partition by using the ALTER TABLE ... SWITCH PARTITION command to move the partitioned to a nonpartitioned table and then deleting the table. For example, to change the data in the partition containing the data for the date Dec 31, 2004 to a non-partitioned table that you can subsequently drop, execute this command:
ALTER TABLE FactCDR $Partion.PartitionByWeek('31-Dec-2004') SWITCH PARTITION TO EmptyTable
EmptyTable must already exist, have the same schema as the FactCDR table, and be empty. This command will execute very quickly, as will the Drop Table command.
Backing Up the Data Warehouse
Backup works with databases or filegroups, not partitions. Because we designed our partitions to each occupy their own filegroup, you can back up each filegroup independently. In addition, we have marked historical partitions to be read-only, so we don't have to back them up more than once. Our backup and restore times are now shorter.
In our example, source data arrives in frequent batches throughout the day. Because we have the source data available, we have the option of rebuilding the data from the last full backup and reloading the source data files. This would allow us to use the simple recovery model. In our example, we encourage using a bulk-logged recovery model, for manageability and faster recovery times. It is much easier to determine which log files need to be applied to restore the database than to determine which data files need to be reloaded. It is also faster to restore the database from log files than to reload the data.
If you do infrequent bulk loading of data into your data warehouse, you may be able to set the database recovery model to Simple. By infrequent, we mean a few times per day, but more than once or twice per day. This can reduce the amount of space consumed by the log file over a daily load. The more frequently you load data, the smaller the log file, but you will have to do more full backups, too. If you have not partitioned your warehouse so that the partition for "current" data is small, there may be little or no advantage in doing backups after loading data. To take advantage of the Simple recovery model, you must perform a backup of the modified partition immediately after you perform a load. If you do any loading by trickling in data, you should not use the Simple recover model; instead, choose Full, and then you will also need to perform log backups on a regular basis.
Table 11-1 presents our backup plan, which minimizes the amount of data you need to back up and provides for a faster recovery in the event you have to restore the data. Note that you do not have to back up the historical partitions.
Managing Cube Partitions
Partitions are continuously changing in an Analysis Services solution. You receive new data, want to archive old data, move current data to historical partitions, or perhaps your users have changed their query patterns and you need to reoptimize for those new queries. In this section, we discuss the ongoing management of partitions.
Remember that any changes to partitions that are made using SQL Server Management Studio are not reflected back into your Visual Studio projects. You will have to create an "Import Analysis Services 9.0 Database" project to create a project that matches the current design. If you deploy the original project, you will overwrite any changes you have made to the cube.
Populating a Cube Partition
When you process a partition, it reads the data specified by the source you defined when you created the partition. You can start processing a partition manually, under program control, or by a task in an SSIS package. You only need to populate partitions where the data has changed.
Processing New Facts
The ETL processes load data into one of the four recent weekly partitions. We fully process these partitions to bring in the new transactions. We can get away with this because it is only four week's transactions, not the entire five years. The prior periods in the relational table are read-only and therefore don't need processing.
Processing Updated Dimensions
A property you will want to set to improve processing times is the RelationshipType. This is a property of an attribute relationship in a dimension. For example, this is a property of the Quarter relationship to Month in the Date dimension. This property has two values: Flexible and Rigid. Rigid relationships offer better query performance, whereas Flexible relationships offer better processing performance in a changing dimension. If a member will not likely move in the hierarchy, as would be the case with dates, you can set the RelationshipType to Rigid. If a member does move in a rigid relationship, an error is raised. The default is Flexible.
You can initiate dimension processing manually by right-clicking the dimension in the Solution Explorer and choosing Process. You can also use an SSIS task or an XML/A command to process a dimension.
Merging Cube Partitions
You can merge smaller partitions into a larger historical partition to reduce your administrative overhead, although at some point you will see a degradation in performance.
To merge a partition into another partition, the partitions need to have similar aggregation designs. The way we created our partitions, aggregation design is inherited from the previous partition, so your designs should match up properly. If you want to change the aggregation design of a partition, wait until it is "full"that is, you aren't going to add any more partitions. Then you can redesign the aggregations because it's not going to be merged with any other partition. If you want to copy the aggregation design from another partition, you can do that in Management Studio. Drill down into the cube, through the measure group, down to the partition you want to change. One of the right-click options is to copy the aggregations.
Deleting Cube Partitions
The business requirements state that at least five years of data should be kept. At the end of each year, you can remove the oldest year partition, which now contains the fourth year of data.
You can delete a partition using Management Studio by right-clicking the partition and then choosing Delete.
None of the aggregations built initially are based on anything but an educated guess by Analysis Services based on factors such as the number of members in the dimension. What you really want is an aggregation design based on actual usage. Fortunately, Analysis Services can track the queries being submitted by your users. You can use the Usage Based Optimization (UBO) Wizard to read the query log and design a better set of aggregations based on actual usage. UBO is done on a per-partition basis.
Before you can do UBO, you need to start collecting data in the query log. You enable this in the properties of the Analysis Services server. Look for the property Log \ Querylog \ CreateQueryLogTable and set it to true. Next, set the value of Log \ Querylog \ QueryLogConnectionString to point to a SQL Server table where you want to record the queries to be used for profiling. You want to collect data for a long enough period to provide a good cross-section of queries, but it's not recommended that you run with logging enabled all the time.
After you've collected data, you can have the UBO Wizard analyze the log and recommend some aggregations. You start the UBO Wizard from SQL Server Management Studio. You need to drill down from the cube, through the measure groups, to locate the partitions you want to optimize. Right-click a partition, or the Partitions node, and choose Usage Based Optimization. In the wizard, you can filter the data in the query log by date, user, or set of most frequent queries.