5.3 Investigating Causes

The next logical step in the process is to discover the causes of the inaccuracy facts. Remedies cannot be fabricated until more information is uncovered. You need to perform a thorough study, in that the causes may not be what you think they are.

This chapter is not going to cover this topic comprehensively. This is a very large topic and beyond the scope of this book. However, a snapshot of some of the approaches is given to show the types of activities required.

Investigating causes requires talking to a lot of people in a lot of organizations. Assignments to investigators must to be done based on the substance of the issues. Participants from many organizations may be needed. The data quality assurance department should not try to undergo this step entirely with their own staff. Neither should they relegate this entirely to others. It is yet another place where the need for a larger team exists that gets guidance and leadership from the data quality assurance staff.

Investigation of the cause is not always possible. For example, databases purchased from vendors may be found to be defective. It is your responsibility to notify them of the problem and give them facts. It is their job to investigate the causes and correct them.

There are two basic approaches to investigating errors: error cluster analysis and data events analysis. The first is used to narrow down the sources of errors. The second is used to study the events that cause data to be created and maintained in order to help identify the root causes of problems. They can often be used together to efficiently complete the task.

Error Clustering Analysis

This type of analysis attempts to use information in the database to provide clues as to where the inaccuracies may be coming from. It starts with information about the specific database objects containing inaccuracies. For example, in an order database, it would start by identifying those orders that contain inaccurate data or that are suspected of having inaccurate data. Although many rules about data cannot identify specific data elements that are wrong, they can identify entire orders that contain the wrong data. The collection of all orders that have wrong values or rule violations constitutes the analysis set.

The analysis set may be defined narrowly (all orders violating a single rule) or broadly (all orders violating any rule). It depends on the amount of data in the analysis set and the importance of the individual rule. There is also the concept of rules having affinity. That is, for example, all rules that deal with the initial capture of the order information (a process clustering) or all orders dealing with customer name and address information (data semantic clustering).

Once the set of data is isolated that contains offending data, all of the data elements of the isolated set are used to determine if they vary in significant ways with the general population of data.

Common data elements that may reveal significant variances are data source location (branch office, geographic region, specific sales reps), customer information (first-time customers, Internet customers), dates (specific dates, days of week, range of dates), product type or characteristics (engine pats, volatile, expensive), or process steps completed (initial entry, order shipped, invoice created). You are looking for any factor that may indicate a starting point in examining the causes of the errors. Performing error clustering analysis can shorten the search for causes significantly through performing a relatively quick and simple test of data.

Data Events Analysis

This involves a review of all processes that capture data or change data. Data takes a journey from inception to one or more databases. It may have a single process event (data entry) or a number of events. The points of examination can be any or all of the following:

data capture processes
durations in which data decay can occur
points at which data is extracted and added to a different data store
points at which data is converted to business information

Data Capture Processes

The process point at which data is captured represents the single most important place data can be made accurate or inaccurate. All data capture points need to be identified and examined. Some data is only captured once. Some is captured and then updated on an exception basis. Some data is captured and the business object updated or enhanced through a work flow process that may occur over a long period of time. Some of these points may take on multiple forms. For example, an order may be entered by the actual customer over the Internet, entered by a recording clerk from a form received in the mail, or entered by a company sales representative through a company client server application. This example shows three very different and distinct ways of entering the same business object.

Building a diagram of the data paths of a business object, identifying the distinct points of data capture, and specifying the characteristics of each is a time-consuming but extremely important task. Figure 5.2 shows some of the characteristics that need to be identified for each data capture or update point. Comments on these factors follow:

click to expand
Figure 5.2: Factors in evaluating data capture processes in the data capture environment.

Time between event and recording: In general, the longer the time differences, the greater the chance for errors. If the time lag is long enough, it also lends itself to missing or late information. Examples of long durations are cases in which forms are completed and mailed to a data entry location. The accuracy and timeliness would be enhanced if the time difference were eliminated through a more direct entry, such as through the Internet.
Distance between event and recording: Physical distance can also be a factor. This reduces the opportunity for the person who is entering the data to verify or challenge information. For example, if the originator of data is in Chicago but the information is transmitted via telephone or paper to Kansas City for entry, you have a distance between the person who knows the right information and the one entering it. If there is confusion, the entry person has to either enter nulls or enter a best guess.
Number of handoffs of information before recording: The first person to experience the event is most likely to be the one with the most accurate description of the facts. Each handoff to another person introduces the possibility of misreading written information, misinterpreting some else's comments, or not knowing information that was not passed on.
Availability of all facts at recording: If the person entering the information has no access to the event, to the person who created or observed the event, or to databases containing important auxiliary information, they cannot fill in missing information or challenge information they see. For example, it is better for HR data to be entered with the employee sitting next to the entry person, as opposed to copying information from a form. Another example is to have a search function for customer identifiers available for order entry personnel.
Ability to verify information at recording: This is similar to the previous issue, but slightly different. Can the data entry person get to correct information if they think the information provided is wrong? An HR data entry person could call or e-mail the employee if there is confusion. Sometimes the process makes it impossible to make this connection. Sometimes the process penalizes the data entry person for taking the time to verify questionable information. All entry points should allow for information to be either verified immediately or posted to a deferred process queue for later verification and correction if needed.
Motivation of person doing recording: This is a complex topic with many sides. Are they motivated to enter correct information? Are they motivated and empowered to challenge questionable information? Are they motivated to enter the information at all? Someone entering their own order is motivated to do it and get it right. Someone entering piles of form information they do not understand could not care less if the information is entered correctly or completely. Is feedback provided? Is their performance measured relative to completeness and accuracy?
Skill, training, and experience of person doing recording: People who enter the same information for a living get to learn the application, the typical content, and the data entry processes. They can be trained to do it right and to look for red flags. People who enter data on a form only one time in their life are much more likely to get it wrong. Sometimes there exists a data entry position that has not been trained in the application. This is an invitation for mistakes. Note that entry people who are making mistakes tend to make them repetitively, thus increasing the database inaccuracy level and thereby increasing the likelihood that it will be exposed through data profiling analysis.
Feedback provided to recorder: Feedback is always a good thing. And yet, our information systems rarely provide feedback to the most important people in the data path: those entering the data. Relevant information, such as errors found in computer checks, should be collected and provided to help them improve the accuracy of data they enter.
Auto-assist in recording process: Do the data entry programs and screens help in getting it right? A complex process can include pull-downs, file checking, suggestions on names, addresses, questioning of unusual options or entry information, and so on. Remembering information from the last transaction for that source can be very helpful in getting information right. Letting each data entry station set its own pull-down defaults can reduce errors. Providing the current date instead of asking that it be entered can improve accuracy. There are a lot of technology best practices that can improve the accuracy of information.
Error checking in recording process: Evaluate the checking provided by the entry screen programs, the transaction path, and the database acceptance routines. Data checkers, filters, and database structural enforcement options can all be used to catch mistakes at the entry point. These are not always easy to identify because they require someone to dig around in code and database definitions. Many times these are not documented. Many times they are thought to be true but have been turned off by a database administrator to improve performance. Many times they exist but are not applied to all points of entry.

It is important to study all factors at each entry point, even though the investigation started by focusing on a single set of inaccuracy facts. This process may reveal other inaccuracies that were hidden from the profiling process or uncover the potential for problems that have not yet occurred. It may also uncover some locally devised practices that are good ideas and may warrant propagation as a formal methodology throughout the data entry community.

Data Decay

The analyst needs to identify data elements that are subject to decay and check for process steps that exist that will mitigate decay. Identifying data decay candidates is a business analyst topic best handled as work sessions with participants from multiple departments.

If the investigation reveals that no procedures are present to prevent decay, the analyst needs to determine the extent to which decay has contributed to currently visible problems or whether it presents the potential for future problems.

Decay problems are often not observable though data profiling because the values in the database are valid even though wrong. However, process analysis may suggest that the data is susceptible to decay problems. Sampling the data and testing it through object reverification may reveal hidden problems. These can become the subject of new issues split off from those that got you there.

Data Movement And Restructuring Processes

Many errors can be introduced when data is extracted, reformatted, aggregated, and combined with other data. If the data source that was used for identifying the inaccurate data is not a primary data source, it requires examination of the processes that build that database from the primary sources.

The first question to ask is whether the problems also exist in the original data source, are part of the data movement processes, or are the result of an incompatibility with the target database structure or definition. Errors at this level often cause primary data sources to be blamed for problems not of their making.

One of the problems with this type of analysis is that the extraction, transformation, cleansing, and loading processes are often not well documented or are documented only in the proprietary repositories of individual products used for the separate steps. This requires expertise on each of these repositories and on the functions of the individual products used. This can lengthen the time required to perform the analysis.

Often data movement processes are locally developed without the aid of packaged tool software. The project team merely writes code for each step. In these cases, finding out what the team does may be difficult because much of it is probably not documented at all. This stresses the importance of being disciplined enough to create and maintain metadata repositories on all data structures: primary, intermediate, and summary. Information should also be kept on all processes that move data between them.

Review of upstream processes may be indicated by discovering information about quality problems in primary databases. This means that a situation discovered in a primary database that produces inaccurate data may lead to the discovery that upstream uses of this data are also flawed. You are basically asking the question "What is the data warehouse doing with this wrong stuff?" This process of examining known data flaws through to their final use can raise issues that were otherwise hidden.

Conversion To Information Products

Other places to look are the conversion of data from databases to reports, movement to OLAP cubes, staging data in corporate portals, and other business information products.

This type of review would normally only be done if the issue were created from concerns raised about these objects. Looking at wrong output does not always indicate that the data is wrong. The routines to extract the data and to compute from it, and the timeliness of this activity, can lead to inaccurate business information products from perfectly accurate data. Problems in the information products should be traced back through the system because they can often uncover previously hidden problems with other uses of the same data.

It should be clear that the process of identifying where errors creep into databases has many beneficial side effects. It can surface bad practices that are creating errors that were not detected in the initial analysis. It can detect bad practices that are not generating errors but have the potential for doing so. It can identify hidden problems in upstream copies of the data or uses of the data that were not known. This may lead to expanding the impacts section to include impacts already occurring and those that have not yet occurred. This process may lead to the consolidation of issues (discovery that the data entry process caused many of the issues) or creating new issues (the corporate portal is displaying flawed renditions of the data).

It may be helpful to document the bad practices independently for the benefit of future projects. Bad practices used in one application frequently find their way into other applications. The same team that implemented them in one case may have implemented them in other applications they also worked on. Having a list of bad practices can serve as a checklist of things to look for in subsequent investigations.