2.3 Data Accuracy Defined

Data accuracy is one of the components of data quality. It refers to whether the data values stored for an object are the correct values. To be correct, a data values must be the right value and must be represented in a consistent and unambiguous form.

For example, my birth date is December 13, 1941. If a personnel database has a BIRTH_DATE data element that expects dates in USA format, a date of 12/13/1941 would be correct. A date of 12/14/1941 would be inaccurate because it is the wrong value. A date of 13/12/1941 would be wrong because it is a European representation instead of a USA representation.

There are two characteristics of accuracy: form and content. Form is important because it eliminates ambiguities about the content. The birth date example is ambiguous because the reviewer would not know whether the date was invalid or just erroneously represented. In the case of a date such as 5 February, 1944, the USA representation is 02/05/1944, whereas the European representation is 05/02/1944. You cannot tell the representation from the value and thus need discipline in creating the date values in order to be accurate. A value is not accurate if the user of the value cannot tell what it is.

Value Representation Consistency

Two values can be both correct and unambiguous and still cause problems. For example, the data values ST Louis and Saint Louis may both refer to the same city. However, the recordings are inconsistent, and thus at least one of them is inaccurate.

Why is consistency a part of accuracy? Although the values are different, anyone looking at them would have no trouble interpreting what the values mean. The answer is that inconsistent values cannot be accurately aggregated and compared. Since much of database usage involves comparisons and aggregations, inconsistencies create an opportunity for inaccurate usage of the data. Because all databases have a base level of unintended uses through the opportunity for ad hoc queries, data is inaccurate if it is inconsistent in representation.

Change-Induced Inconsistencies

Inconsistencies can also be caused by system changes that change the way information is recorded or that change the granularity of recording. Following is an example of how change in granularity might introduce inconsistencies: A company has a color field that only records red, blue, and yellow. A new requirement makes them decide to break each of these colors down to multiple shadings and thus institute a scheme of recording up to 30 different colors, all of which are variations of red, blue, and yellow. None of the old records are updated to the new scheme, as only new records use it. This database will have inconsistency of representation of color that crosses a point in time.

These types of changes are often not documented. The point in time of the changeover may also be fuzzy, wherein not everyone switches to the new scheme at the same time. If data mining and business intelligence software is used on this database for periods of time that cross the change, results will be inaccurate.

Valid Values

The definition of a value being valid means simply that the value is in the collection of possible accurate values and is represented in an unambiguous and consistent way. It means that the value has the potential to be accurate. It does not mean that it is accurate. To be accurate, it must also be the correct value. Defining all values that are valid for a data element is useful because it allows invalid values to be easily spotted and rejected from the database. However, we often mistakenly think values are accurate because they are valid.

For example, if a data element is used to store the color of a person's eyes, a value of TRUCK is invalid. A value of BROWN for my eye color would be valid but inaccurate, in that my real eye color is blue.

Missing Values

A data element that has no value in it may be either accurate or inaccurate. For example, a missing value in the data element COLLEGE_LAST_ATTENDED would be blank if the person it applied to had never attended college. On the other hand, a BIRTH_DATE value left blank would not be accurate because all of us have birth dates.

A blank for COLLEGE_LAST_ATTENDED may be accurate or inaccurate. If the person it applied to had attended college, it would be inaccurate. This is another case of valid but not accurate.

Missing values are very problematic in a database because of this characteristic. To be accurate in recording information, an optional data element should allow encoding a value for NULL. This is properly done by creating a separate field that indicates whether the value in the first field is missing or not. In this instance, the value in the primary field would be set to blank if no college had been attended. If the recorder just did not know the answer to the question, the NULL field would be set to indicate YES, meaning that the blank in the primary field did not apply.

Unfortunately, few systems are built to allow for this distinction. Even if they were, most data entry people would not get it right all of the time anyway.

Sometimes a system is created that supports a keyword that means empty. In the previous example, it might be "never attended college." This is a bad practice because it can make queries that do COUNT, GROUPBY, and other commands difficult or impossible to formulate correctly. Thus, it leads to inaccurate query results.

Poor design within a system for not making the distinction between "No value is OK" and "I don't know the correct value" is a common source of poor data quality.

Object-Level Accuracy

The concept of accuracy also applies above the data element level. Data elements are never recorded in isolation. They are value attributes of business objects such as personnel records, orders, invoices, payments, and inventory records. The business objects represent real-world objects or events, and each consists of one or more rows of one or more tables connected through keys. Object-level inaccuracies consist of objects that are missing, have missing parts, or that exist but should not.

For example, if a repair is done to a piece of capital equipment and the repair person just failed to complete and submit a record of the action, there would be a missing row or rows in the database. It is just not there. Any decisions made from the database can be flawed if the missing information would affect the outcome. Missing objects are very difficult to detect. This is an important component of accuracy because a database may have nothing but accurate data but be an inaccurate database because of missing information.

An example of missing elements might be found for an entity consisting of a master record and one or more subrecords. For example, a master record may consist of a record describing a rental object, and the subrecords are instances of renting. If one of the rentals is done without the record being added to the database, the database still looks accurate, but in fact is not.

Missing subrecords are very difficult to detect, whereas missing master records are generally easier to detect. For example, if the rental auto is sold and the master record deleted but not the subrecords, detecting the problem is easy.

An example of objects being present that should not be is a personnel database containing a record for an employee who left the company. The record was just not deleted when it should have been. These are also generally difficult to detect.

Object-Level Inconsistencies

The concept of consistency is also found at the object level. Database are not static. Changes are being made all of the time. Large databases generally have data flowing into them from many different sources. If you have groups creating data with a different criterion for when to add an object (insert) or to remove an object (delete), you can end up with object inconsistency. This means that if you want to use the database to get information about object counts or data aggregations, you have the potential to get inaccurate results.

To guard against this, companies need to define birth and death rules for data objects and try their best to enforce them. They also need to define a method of determining when all of the data is consistent over some time period so that the data can be used intelligently. Figure 2.1 charts the scenarios discussed in this section in terms of what constitutes accurate versus inaccurate data.

click to expand
Figure 2.1: Breakdown of data within a set of data.