Our customer faces the competing demands of faster query performance and providing more detail in the data. The increased level of detail will increase the data volume and thus reduce performance if nothing else changes. Based on the business requirements outlined below, we will design a solution that takes advantage of some new features in SQL Server 2005 to provide the desired improvements and still operate on the same hardware platform.
The business has provided requirements about the increased level of detail they need in the data warehouse, about the longer hours of availability, and operational goals to reduce maintenance time and resource utilization.
Data Volume and Level of Detail
The business needs to analyze the data across all dimensions. The data is provided in a call detail record (CDR). The solution will eventually need to handle billions of rows in the fact table and be able to load and transform millions of facts per day. Recent data is queried far more often and at more detailed levels than historical data. However, queries summarizing current data and comparing it with parallel historical periods are also common, and need to perform well.
Data is kept for five years before being archived.
Data is generated nonstop, 24 hours per day. Analysts in three time zones expect access between 6 a.m. and midnight. At best, three hours are available during which the solution can be offline. However, taking the database offline would create a serious backlog of data waiting to be loaded, so this is not a viable option. Backups and other maintenance procedures still need to be performed, but with minimal impact on performance.
The operations staff who works with the data warehouse need the capability to easily remove and archive older data as it becomes less relevant, without requiring extensive reprocessing. Also, backups in general consume a lot of disk space before being transferred to offline storage. This space needs to be cut in half.
Our customer's database is essentially overwhelmed by the data volume, forcing a reduction in detail just to be able to keep up. Much of the problem can be attributed to the time required to back up the data and reorganize the indexes. You might be asking yourself why we need to back up and perform index maintenance on all the data. After all, haven't we already done that for all but the most recent data? The historical data hasn't changed; we are just adding new data. If we could find a way to do the maintenance only on the new data, we could reduce the time for maintenance to almost zero. Figure 11-1 shows the components that are influential in supporting a VLDB.
Figure 11-1. Major components of a VLDB solution
Partitioning the Problem
The core of the architecture supporting very large databases is partitioning of the relational tables and OLAP cubes. Each partition can be treated somewhat independently. This provides the opportunity to design shorter maintenance operations, faster backups, higher performance-loading OLAP cubes, and potentially more efficient querying.
In SQL Server, a partitioned table appears as a single table to a user, but the rows in the table are physically divided across a number of separate partitions. Partitioning the table can offer better performance through parallel operations. Partitions make it easier for you to manage the table because you can work with subsets of the data rather than the entire table.
Using multiple partitions associated with multiple filegroups in the data warehouse, we can dramatically improve backup times and maintenance times. A filegroup is simply a collection of files that you can use to distribute data across your available storage. Using filegroups, we also save a substantial amount of disk space that would be consumed by a full database backup. This comes from our capability to mark partitions read-only, which we will never have to back up or reorganize the indexes more than once. We only have to back up and maintain a relatively small "current" partition.
The cubes in the Analysis Services database will be partitioned in a manner similar to that of the relational database. This will allow us to import just the new data into a "current" partition, and process only the data in that partition. All the historical data will not need to be processed. The data in the cube will be online continuously, and only a small amount of resources will be consumed in the reprocessing.
ETL for Large Volumes
In our ETL processing, we will use the native high-speed pipeline architecture of Integration Services where it is advantageous, but trade some of the clarity and maintainability of using Integration Services transformations to take advantage of the performance of set operations performed by SQL queries.
One thing we need to do to guarantee good load performance is to make sure only the changes to the data are loaded, whether it is new facts or dimensions. At least, we should only have to reload a very small portion of existing data. We impose a business rule that no fact data is ever updated, because that would mean reprocessing all the data in the same partition. If a change is needed, a new fact containing the positive or negative adjustment required is used, as described at the end of Chapter 8, "Managing Changing Data." If dimensions change, consider whether they are Type 2 changing dimensions, where you create a new instance of the member rather than change an existing member's attributes.
Querying Large Volumes of Data
With the capability to manage larger volumes of data, we can now support the storage of more detailed data. However, this leads us into our next problem: longer query times.
As you have seen in previous chapters, Analysis Services is designed to precalculate aggregations of values to one or more levels, such as summing up the call duration to the month or quarter level. This means that a query no longer has to touch each row to determine the duration of all the calls per quarter. By using Analysis Services and taking advantage of the aggregations, we can meet the performance criteria for most queries. Analysts will be able to use Excel, Reporting Services, or other third-party tools to query the Analysis Services databases.
Dealing with Resource Constraints
Before you add or change resources to improve performance, it is important that you determine what constraints your application is experiencing. Usually, the bandwidth of the path to and from your mass storage, memory for caching data, and processor cycles is the resource constraint. The bottleneck tends to occur during the access to mass storage, because it is the slowest of all the resources. However, it is important that you monitor your system to determine which resources are hitting its limit. You can use Windows System Monitor to view the utilization of resources. In addition, you want to make sure that your application is using efficient queries and techniques. You can use SQL Profiler to look for long-running queries.
To improve immediate performance, increase the memory of the data cache. This is often the best choice. In a 32-bit system, your available memory is constrained by the architecture. This is not the case with 64-bit architecture. In the near future, 64-bit processors will be the norm, and you won't need to decide whether you need one, or you can get by with a 32-bit processor. Currently, price and some application restrictions mean you still need to evaluate this option. In our example, we were not constrained by memory. Yes, there is a large volume of data, but much of it is historical and not referenced often. Our dimensions are relatively small, and our partitioning scheme means we do not have to reprocess much data. We therefore can continue to use a 32-bit architecture.
The solution will deliver the following benefits to the client: