Data quality investigations
Facts are individually granular. This means that each rule has a list of violations. You can build a report that lists rules, the number of violations, and the percentage of tests performed (rows, objects, groups
There is a strong
number of rows containing at least one wrong value
graph of errors found by data element
number of key violations (nonredundant primary keys, primary/foreign key orphans)
graph of data rules executed and number of violations returned
breakdown of errors based on data entry locations
breakdown of errors based on data creation date
The data profiling process can yield an interesting database of errors derived from a large variety of rules. A creative analyst can turn this into
Metrics can be useful. One use is to
Metrics can also be useful to show improvements. If data is profiled before and after corrective actions, the metrics can show whether the quality has improved or not.
Another use of metrics is to qualify data. Data purchased from outside the corporation, such as demographic data, can be subjected to a quick data profiling process when received. Metrics can then be applied to generate a qualifying grade for the data source. It can help determine if you want to use the data at all. This can be used to negotiate with the vendor providing the data. It can be the basis for penalties or rewards.
Qualification can also be done for internal data sources. For example, a data warehousing
The downside of metrics is that they are not exact and they do not solve problems. In fact, they do not identify what the problems are; they only provide an indicator that problems exist.
Earlier chapters demonstrated that it is not possible to identify all inaccurate data even if you are armed with every possible rule the data should conform to. Consequently you cannot accurately estimate the percentage of inaccuracies that exist. The only thing you know for sure is that you found a specific number of inaccuracies. The bad news is that there are probably more; the good news is that you found these. If the number you find is significant, you know you have a problem.
Corrective actions have these potential consequences: they can prevent recurrence of some errors that you can detect, they can prevent
The conclusion is that data profiling techniques can show the presence of errors but cannot show the absence of errors nor the number of errors. Therefore, any metrics derived from the output of profiling are
You might conclude from the previous discussion that the number of errors
Comparing metrics can also be misleading if the yardstick changes between profiling exercises. As analysts gain more knowledge about a data source, they will add to the rule set used to dig out inaccuracies. Comparing two result sets that are derived from different rule sets results in an apples-to-oranges comparison. All presentations of quality metrics need to provide disclaimers so that the readers can understand these dynamics.
The following is an example of preventing recurrence of errors you never
However, the root cause is that the procedure codes are
The remedy called for having the data entered directly online by the administrators of the
Checks were put in for gender/procedure code conflicts, as well as other conflicts, such as invalid patient age/procedure code combinations. In addition, administrators were
An additional problem with metrics is that data quality assurance departments often believe that this is the end of their mission. They define their work product as the metrics. However, metrics do not define the source of problems nor the solutions. To improve data quality you need to follow through on getting improvements made. To hand the responsibility for this to other departments is a guarantee that the work items will sit low on priority lists of things to do and will not get done expeditiously. The data quality assurance department needs to track and drive the issues through to solution.
Metrics are not all bad. They are often a good
Often a single fact is more shocking than statistical metrics. For example, telling management that a profiling exercise of the birth date of
The real output of the fact collection phase is a set of issues that define problems that need to be
Issues need to be recorded in a database within an issues tracking system. Each issue needs a narrative description of the findings and facts that are the basis for the issue. It is important to identify the facts and the data source so that comparisons can be correctly made during the monitoring phase. The information needed for the data source is the identification of the database used, whether samples or the entire database were used, the date of the extraction, and any other information that will help others understand what you extracted the facts from. In tracking the issues, all meetings, presentations, and decisions need to be recorded along with dates and persons present.