Managing the Solution | Practical Business Intelligence with SQL Server 2005

Most of the additional management you need to implement for increased data quality is in the operations area. You will likely be capturing source data that is not clean and forwarding it to an external application or manual process for remediation. You also want to analyze the bad data to determine what you can do to improve the source applications so that they generate more reliable data.

Deployment

You deploy the Integration Services solution for data quality just the same way as you would any other solution for integrating data. Refer to Chapter 4 for a review of the deployment process for Integration Services. You will have the additional task of deploying the application that reviews the tables of rejected data and another database to handle the records generated by the ETL process auditing.

Operations

You might want to review dimension tables for late-arriving (inferred) members that have not been resolved for an extended period of time. If there are many unresolved members, this might indicate that there is a false assumption that the data fits the criteria as late arriving or that there is a potential error in the capture of the data leading to missing members.

Recurring errors indicate there is a systemic problem in the source systems. You should attempt to improve the quality of the data at the source.

Auditing might indicate that fewer records are being written to the data warehouse than are initially selected. If that is the case, the data warehouse is incomplete and cannot provide an accurate picture. You need to determine the cause of the loss of the records and design the ETL process to correctly handle these records so that they are either retained or redirected for review and correction.

Failing fact records or dimension records are redirected to one or more queue for manual correction. The records in the queues need to be promptly reviewed, corrected, and resubmitted for inclusion in the data warehouse. (The process and application that manages the error-remediation workflow is beyond the scope of this chapter.) It is important for these records to be handled rather than left in the queue, because they are not contributing to the analysis process in any way until they become part of the data warehouse. One approach is to have an application that presents a row to an operator for review and allows the operator to make corrections and put the row back into the data source for reprocessing by the ETL stage. In the event that the row can't be corrected because, for example, the product cannot be identified, the operator should substitute a well-known product code that has been defined to be the category for unknown products.

If you have deemed some errors to be so unlikely that they will rarely if ever occur, you should be prepared for the possibility that the ETL process or cube processing might fail. If you do not explicitly handle all potential errors, be sure at least to enable error handling in the cube to report the error and stop processing.