One of the goals stated most frequently for BI applications is to deliver clean, integrated, and reconciled data to the business community. Unless all three sets of data-mapping rules are addressed, this goal cannot be achieved. Many organizations will find a much higher percentage of dirty data in their source systems than they expected, and their challenge will be to decide how much of it to cleanse . Data Quality ResponsibilityData archeology (the process of finding bad data), data cleansing (the process of correcting bad data), and data quality enforcement (the process of preventing data defects at the source) are all business responsibilitiesnot IT responsibilities. That means that business people (information consumers as well as data owners ) must be involved with the data analysis activities and be familiar with the source data-mapping rules. Since data owners originate the data and establish business rules and policies over the data, they are directly responsible to the downstream information consumers (knowledge workers, business analysts, business managers) who need to use that data. If downstream information consumers base their business decisions on poor-quality data and suffer financial losses because of it, the data owners must be held accountable. In the past, this accountability has been absent from stovepipe systems. Data quality accountability is neither temporary nor BI-specific, and the business people must make the commitment to accept these responsibilities permanently. This is part of the required culture change, discussion of which is outside the scope of this book. The challenge for IT and for the business sponsor on a BI project is to enforce the inescapable tasks of data archeology and data cleansing to meet the quality goals of the BI decision-support environment.
Although data-cleansing tools can assist in the data archeology process, developing data-cleansing specifications is mainly a manual process. IT managers, business managers, and data owners who have never been through a data quality assessment and data-cleansing initiative often underestimate the time and effort required of their staff by a factor of four or more. Source Data Selection ProcessPoor-quality data is such an overwhelming problem that most organizations will not be able to correct all the discrepancies. When selecting the data for the BI application, consider the five general steps shown in Figure 5.6.
Figure 5.6. Source Data Selection Process
Key Points of Data SelectionWhen identifying and selecting the operational data to be used to populate the BI target databases, some key points should be considered . Applying the source data selection criteria shown in Figure 5.7 minimizes the need for and effort of data cleansing.
Figure 5.7. Source Data Selection Criteria
To Cleanse or Not to CleanseMany organizations struggle with this question. Data-cleansing research indicates that some organizations downplay data cleansing to achieve short- term goals. The consequences of not addressing poor-quality data usually hit home when their business ventures fail or encounter adverse effects because of inaccurate data. It is important to recognize that data cleansing is a labor-intensive, time-consuming , and expensive process. Cleansing all the data is usually neither cost-justified nor practical, but cleansing none of the data is equally unacceptable. It is therefore important to analyze the source data carefully and to classify the data elements as critical, important, or insignificant to the business. Concentrate on cleansing all the critical data elements, keeping in mind that not all data is equally critical to all business people. Then, cleanse as many of the important data elements as time allows, and move the insignificant data elements into the BI target databases without cleansing them. In other words, you do not need to cleanse all the data, and you do not need to do it all at once. Cleansing Operational SystemsWhen the selected data is cleansed, standardized, and moved into the BI target databases, a question to consider is whether the source files and source databases should also be cleansed. Management may ask, why not spend a little extra money and time to cleanse the source files and databases so that the data is consistent in the source as well as in the target? This is a valid question, and this option should definitely be pursued if the corrective action on the source system is as simple as adding an edit check to the data entry program. If the corrective action requires changing the file structure, which means modifying (if not rewriting) most of the programs that access that file, the cost for such an invasive corrective action on the operational system is probably not justifiable especially if the bad data is not interfering with the operational needs of that system. Remember that many companies did not even want to make such drastic changes for the now infamous Y2K problem; they made those changes only when it was clear that their survival was at stake. Certainly, a misused code field does not put an organization's survival at stake. Hence, the chances that operational systems will be fixed are bleak. |