Data Cleansing | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

One of the goals stated most frequently for BI applications is to deliver clean, integrated, and reconciled data to the business community. Unless all three sets of data-mapping rules are addressed, this goal cannot be achieved. Many organizations will find a much higher percentage of dirty data in their source systems than they expected, and their challenge will be to decide how much of it to cleanse .

Data Quality Responsibility

Data archeology (the process of finding bad data), data cleansing (the process of correcting bad data), and data quality enforcement (the process of preventing data defects at the source) are all business responsibilitiesnot IT responsibilities. That means that business people (information consumers as well as data owners ) must be involved with the data analysis activities and be familiar with the source data-mapping rules.

Since data owners originate the data and establish business rules and policies over the data, they are directly responsible to the downstream information consumers (knowledge workers, business analysts, business managers) who need to use that data. If downstream information consumers base their business decisions on poor-quality data and suffer financial losses because of it, the data owners must be held accountable. In the past, this accountability has been absent from stovepipe systems. Data quality accountability is neither temporary nor BI-specific, and the business people must make the commitment to accept these responsibilities permanently. This is part of the required culture change, discussion of which is outside the scope of this book.

The challenge for IT and for the business sponsor on a BI project is to enforce the inescapable tasks of data archeology and data cleansing to meet the quality goals of the BI decision-support environment.

Step 5, Data Analysis, may be time intensive since many battles may rage among the business people as to the valid meaning and domain of data.

Although data-cleansing tools can assist in the data archeology process, developing data-cleansing specifications is mainly a manual process. IT managers, business managers, and data owners who have never been through a data quality assessment and data-cleansing initiative often underestimate the time and effort required of their staff by a factor of four or more.

Source Data Selection Process

Poor-quality data is such an overwhelming problem that most organizations will not be able to correct all the discrepancies. When selecting the data for the BI application, consider the five general steps shown in Figure 5.6.

Identify the required data.

Identify the data of interest and the significance of this data. Data cleansing is a collaborative effort between business analysts who are familiar with the semantics of the data and data quality analysts who know the program-specific meanings of the data (e.g., the use and meaning of a "flag" value or redefined record layouts).
Analyze the data content.

Analyze the data for content, meaning, and importance. Many organizations have accumulated massive amounts of data in files and databases. This data constitutes a prospective gold mine of valuable business knowledge and is potentially a good source for data mining. However, the quality of the data content must be assessed first, since mining dirty data is of little value.
Select the data for BI .

Determine which data to include in the BI application. Select only the data that will meet core business requirements. Even with automated tools, the cost of assuring data quality for an all-inclusive BI decision-support environment becomes prohibitive for most organizations. Some questions to consider when selecting data appear below.

- Is this data clean enough for decision-support usage?

- If not, can this data be cleansed, at least partially? Do we know how?

- Is the dirty data the reason for building this BI application? Is cleansing this data therefore mandatory?

- How much effort will it take to figure out how to cleanse the data?

- How much will the data-cleansing effort cost?

- What is the benefit of cleansing the data as opposed to moving it into the BI application at the current level of dirtiness?

- What are the data quality expectations from the information consumers and from business management in general?
Prepare the data-cleansing specifications.

The IT staff, working with the business representative, will get to know the business rules needed to write the data-cleansing specifications. In essence, this is a source data reengineering process.
Select the tools.

Select the ETL and cleansing tools. Determine whether it is appropriate and cost-effective to acquire an ETL tool, a cleansing tool, or both. Examine the suitability and effectiveness of those tools. Some data-cleansing specifications can be very complicated. Be sure the tools are capable of handling them.

Figure 5.6. Source Data Selection Process

graphics/05fig06.gif

Automated tools do not eliminate the manual labor of source data analysis; they only reduce it.

Key Points of Data Selection

When identifying and selecting the operational data to be used to populate the BI target databases, some key points should be considered . Applying the source data selection criteria shown in Figure 5.7 minimizes the need for and effort of data cleansing.

Data integrity: How internally consistent is the data? This is the most important criterion.

- The greater the proportion of manually entered data (data keyed in with few or no data controls, edits, and validations), the lower the integrity.

- Programming errors also contaminate great masses of dataand do so automatically.

- The lower the integrity, the greater the cleansing requirement.
Data precision: How precise is the data? This is the next important criterion.

- How is the data represented internally?

- For numeric data, what is the scale and precision of the data?

- For date data, how is it formatted?
Data accuracy: How correct is the data?

- Are there edit checks in the data entry program?

- Are dependent values cross-checked? For example, does the data entry program forbid an expiration date to precede an effective date?

- Is there an operational process in place for correcting data?

- Are calculated values stored? What, if any, mechanisms are in place to keep these values accurate?
Data reliability: How old is the data?

- What generation is the data (month-end, weekly, daily)?

- Was the data obtained from direct sources or from downloads?

- Is the source of the data known?

- Is the data a duplicate of data in another data store? If so, is the data current?
Data format: The closer the data is to the destination data format, the fewer the conversion requirements will be. From highest to lowest , the format priorities are:

- Data from a relational database (e.g., DB2, Oracle)

- Data from a nonrelational database (e.g., IMS, CA-IDMS)

- Flat files (e.g., VSAM, ISAM)

Figure 5.7. Source Data Selection Criteria

graphics/05fig07.gif

Source data quality will be only as good as the enforcement of quality processes in the operational systems. Mandatory quality processes should include data entry rules and edit checks in programs. If those processes are not enforced or do not exist, data usually gets corrupted, regardless of whether the data is in a relational database or in an old VSAM file.

To Cleanse or Not to Cleanse

Many organizations struggle with this question. Data-cleansing research indicates that some organizations downplay data cleansing to achieve short- term goals. The consequences of not addressing poor-quality data usually hit home when their business ventures fail or encounter adverse effects because of inaccurate data.

It is important to recognize that data cleansing is a labor-intensive, time-consuming , and expensive process. Cleansing all the data is usually neither cost-justified nor practical, but cleansing none of the data is equally unacceptable. It is therefore important to analyze the source data carefully and to classify the data elements as critical, important, or insignificant to the business. Concentrate on cleansing all the critical data elements, keeping in mind that not all data is equally critical to all business people. Then, cleanse as many of the important data elements as time allows, and move the insignificant data elements into the BI target databases without cleansing them. In other words, you do not need to cleanse all the data, and you do not need to do it all at once.

Cleansing Operational Systems

When the selected data is cleansed, standardized, and moved into the BI target databases, a question to consider is whether the source files and source databases should also be cleansed. Management may ask, why not spend a little extra money and time to cleanse the source files and databases so that the data is consistent in the source as well as in the target? This is a valid question, and this option should definitely be pursued if the corrective action on the source system is as simple as adding an edit check to the data entry program.

If the corrective action requires changing the file structure, which means modifying (if not rewriting) most of the programs that access that file, the cost for such an invasive corrective action on the operational system is probably not justifiable especially if the bad data is not interfering with the operational needs of that system. Remember that many companies did not even want to make such drastic changes for the now infamous Y2K problem; they made those changes only when it was clear that their survival was at stake. Certainly, a misused code field does not put an organization's survival at stake. Hence, the chances that operational systems will be fixed are bleak.