Managing the Solution | Practical Business Intelligence with SQL Server 2005

No matter how well designed your data warehouse structure is, the success of the data warehouse as a business solution mostly depends on the management process that supports it. Users will only start to integrate the data warehouse into their work when they can rely on a consistently valid source of information that is available whenever they need it.

Deployment

By the time users get access to the data warehouse, all the information that it contains must have been completely validated. You usually only get one chance to get this right, because deploying a data warehouse that contains incorrect data will inevitably lead the users to question the reliability of the information long after any initial problems have been corrected.

Planning

Your development and deployment plan must include testing and an extensive audit of both the numbers and dimension structures. Dimension hierarchies are just as important to validate as numeric measures because incorrect structures will lead to invalid subtotals, which are as damaging as missing or incorrect source data.

If you cannot verify and correct the integrity of some of the data, often the best solution is to leave it out of the data warehouse completely for this release and continue to develop a "phase 2" that contains the additional information. The closer you are to launch, the more politically tricky cutting features in this way becomes, so you should start the auditing process early in the project to identify any potential problems as soon as possible.

Specifying the Database Settings

In general, SQL Server's default settings work well for data warehouse databases and don't require many changes. However, a few areas benefit from adjustment.

Recovery Model

Each database has a "Recovery Model" option that you can use to configure how transactions are logged, which can have a major impact on performance. Because most databases are used for capturing transactions, SQL Server defaults to the Full recovery model, which ensures that all transactions are kept in the log, allowing administrators to restore a failed database to any point in time.

For data warehouses, there is often only one large periodic update happening, and the database administrators are in control of when it occurs. For this reason, it is often possible to use the best performing Simple recovery model for data warehouses. In the Simple recovery model, only the data files need to be backed up and not the transaction logs, and log space is automatically reclaimed so space requirements may be reduced. However, databases can only be recovered to the end of the latest backup, so you need to synchronize your backup strategy with your data loads, as described in the "Operations" section.

Disks

The issue of where to put the database files can get complicated, especially now with the wide availability of SAN (storage area network) technology. In general, however, a good strategy is to store the data files and log files on physically separate disk drives. For data warehouses, this will improve the performance of your data-load process. It is easier to set the locations of files in the dialog when you first create the database because moving them afterward will require some knowledge of the ALTER DATABASE command.

Maintenance

SQL Server 2005 is generally self-tuning and performs many maintenance tasks automatically. However, you will need to schedule some maintenance tasks yourself, such as backups, checking database integrity and index maintenance tasks such as rebuilds. You can include these tasks in a maintenance plan, which can be scheduled to run automatically.

Setting Up a Maintenance Plan

Maintenance plans in SQL Server 2005 are built on top of Integration Services, which means you have a lot of flexibility when it comes to designing the flow of events. You can also use maintenance plan tasks in regular Integration Services packages, so you could include them as part of your daily or weekly build processes.

You can design a maintenance plan from scratch, but SQL Server 2005 includes a Maintenance Plan Wizard (see Figure 3-11) that walks you through most of the options to create a fully functional plan. You can access the wizard in the Management Studio's Object Explorer by right-clicking Maintenance Plans under the Management folder. Before you run the Maintenance Plan Wizard, the SQL Server Agent service must be running, so you might need to run the Surface Area Configuration tool to enable and start this service; by default, it is not enabled.

Figure 3-11. Maintenance plan editor

Versions of the Data Warehouse

Because a data warehouse consists of just a database and a set of processes to load the data, it is tempting to ignore all the versioning headaches that application developers have to suffer and just make any required changes directly to the production system. These changes could take the form of adding new columns, modifying the data load procedures, or even adding brand-new business processes.

The problem with that approach is that unless you stick to a clearly defined cycle of develop, test, and release, your data warehouse quality will inevitably suffer. Even when you need to perform the occasional high-priority fix to data loading routines when a bug is identified, this should still be tested in a development environment before deployment.

Operations

The so-called backroom activities of loading data, monitoring the database, and performing backups are the key activities required to keep the data warehouse operational. Chapter 4, "Building a Data Integration Process," covers the load process in detail.

Backing Up the Database

Using the Simple recovery model for our databases means that the transaction logs only contain currently executing transactions, so we only have to concern ourselves with backing up the data files. Because in a data warehouse we are usually in complete control of when data is changed, we can arrange the backups as part of the load process.

Backing up all the data in a database is known as a full backup, and scheduling a full backup on a periodic basis (such as once a week) is a good idea. You should, of course, follow the commonsense rules of handling computer backups, such as making sure they are stored in a separate location from the database to protect against drive failure.

If we are loading data on a daily basis and only performing a full backup once a week, we risk running into trouble if a failure occurs in the middle of the week. One way to solve this is to perform a full backup after every data load, but the issue is that taking a full backup can be a time-consuming exercise and creates large backup files. SQL Server provides a useful feature to handle this that is called differential backups (see Figure 3-12).

Figure 3-12. Differential backup

A differential backup only backs up data that has changed since the most recent full backup. (Actually, it backs up slightly more than that because all extents that contain changed pages are backed up, but this is just a technicality.) This leads to smaller backups and faster processing, so a standard approach would be to perform a full backup once a week and differential backups after every data load.

Restoring the Database

An up-to-date database backup is a valuable tool in case of system failure but also when an issue with the load process occurs. Many data load processes do not operate in a single transaction, so any problems with the load could leave your database in an inconsistent state. To recover the data, you will need a full backup as well as a differential backup that brings the database back to the point just after the most recent successful load.

Database operations staff should also periodically practice restoring the database as a test of the backup procedures, because you don't want to find out that there is a problem with the procedures when you are facing real data loss.

Are Backups Really Required?

The data warehouse contains a copy of data from the source systems, so could you just reload all the data into a blank database and avoid the need to backup the data warehouse? This is a bad idea for a couple of reasons, the first being that a full reload of all historical information will usually take much longer than simply restoring a backup, even if you include the time taken to make the backup in the first place.

The second and most important reason is that the source systems will almost certainly not retain history. This is sometimes misinterpreted because many OLTP systems keep transactions going back several years. However, there is more to this issue than transactions: The source system will rarely keep a copy of all the dimension records as they were at the time the transaction occurred. As you will see in Chapter 8, these "slowly changing dimensions" are valuable to analyze information and will only exist in the data warehouse.