12.3 Types of Value Rules

12.3 Types of Value Rules

Value rules take on a large variety of forms. The data itself will suggest rules that make sense. Most of them, however, have to do with computations over the data.

Cardinality

A simple computation of the cardinality of a column along with a count of the number of rows can reveal disparities. If the column is expected to consist mostly but not always of separate values, a larger than expected disparity may indicate that there are excessive duplicates. A check of the cardinality against the number of rows can indicate if the values are reasonably varied over the data.

A higher than expected cardinality may indicate a number of things. It may indicate that you have more values than you expect. It may indicate that some data values are represented in multiple ways. It may also indicate misspellings of values.

For example, a corporation may expect that it does business in only 10 states. The database has many duplicates. However, if the cardinality of the column is determined to be 25, an investigation is needed. It may be that the state column values are inconsistently recorded, some of the state codes are mistyped, or you do business in more states than you thought. It is possible that all three things are going on in the data. The cardinality test just alerts you to dig deeper into the data.

Frequency Distributions

Another, deeper check is to select all of the distinct values in a column along with the frequency with which they occur. This list can then be sorted to generate a list by frequency of occurrence. You can compute cardinality at the same time.

This is a very useful tool. There are a number of inaccurate data situations that pop out when it is done.

A report from this can list all of the values. If the value list is too long to reasonably view, it can show the fringes: values that have the highest number of occurrences and those that have the lowest number of occurrences. A graphical chart showing the frequencies can also make it easier to spot potential problems.

This list should show all of the NULL and blank values and the frequency with which they occur within the column. These are very valuable numbers to have. The types of expectations that can be set are

  • the number of distinct values

  • the maximum frequency expected from any one value

  • the frequency expectation of all values for small value sets

  • the percentage of values expected to be NULL or blank

start sidebar

This particular test was the one used to find the large number of misspellings of the color BEIGE found in an auto insurance claim database. It not only indicated far too many values but was small enough of a list to actually spot all of the wrong values.

end sidebar

You can also spot cases in which an encoded value is not used at all. The expectation is that a value should exist for at least some number but, in fact, does not. It would be clever in this case to merge the expected value list in the metadata and include both in the result set. A frequency of zero would indicate that they were not present.

The list can be spell-checked. If the list contains words that are not in a dictionary, this can be a useful exercise.

This type of test is very useful in finding data entry problems. In addition to misspelling and alternative ways of expressing the same value, it can spot cases in which data entry people make up values when they do not know the correct value. People tend to use the same wrong value when they do this. Of course, this only applies to cases in which a single data entry person enters a lot of data. These inaccurate values show up as values with higher than expected frequencies.

start sidebar

In Internet applications that collect data that people do not consider the collector's business, this will spot the famous Donald Duck and Mickey Mouse entries in the NAME field. Even though each person enters only one form, these made-up values tend to be used by many people. A look at only duplicates beyond some number will show a variety of popular names used as aliases for people.

You may be surprised at even valid names that appear with an unusual number of occurrences. For example, if you looked at the high-frequency first names of people, you would expect to see a lot of JOHN values. If the most popular name to appear is GAVIN, you might have a problem.

You cannot set expectations on individual names, but you can look at the fringe lists on frequency and spot some very interesting facts. Once you have done this, you can set goals to eliminate bad data entry and check for this particular list of offending names in the future.

end sidebar

Extreme Values

A continuation of this theme is to look at extreme values. Extreme values are values that contain particular characters that are not expected. Extreme characters may be punctuation marks and numerics or other special characters (such as line enders).

Extreme values may be difficult to spot on a value/frequency list if the characters are embedded within the value and are not leading characters. If they are leading characters, they may turn up on the fringe of a value/frequency list. If they are embedded, they will not.

For example, any list that contains names, company names, product descriptions, and the like would not be expected to have most special characters and numbers embedded. You might expect dashes and dots, but you would not expect characters such as *&^%#@?).

You can be fooled. For example, there is a company named 12. This means that you cannot make a hard rule to exclude numbers. However, you can extract all values and their frequencies that have unexpected characters embedded within them. This should be a short list that can be visually inspected to determine if errors exist or not.

Extreme values on numbers and dates would normally be determined through the column properties' definitions by setting boundaries. However, an area where this can apply is in the computation of durations of time between two dates. Extremely short or extremely long durations may indicate a problem. In essence you can produce a value/frequency list on durations between two date columns. These could be graphed in a chart for easy spotting of high and low occurrences.

Sometimes, highest and lowest values for numbers, dates, and durations are valid values by column property definitions but are not expected in significant numbers. You would expect only a small number of purchases of office supplies to exceed $5,000. If you have a large number of these, you may need to investigate to find out why.

Group Comparisons

In the last chapter we saw rules being set for group aggregations such as COUNT, SUM, AVERAGE, and so on. Sometimes you need to just compute the values and look at them to determine if they are reasonable or not. Sometimes it is not possible to set the values precisely enough to create a rule, but you can eyeball results from a list and then decide if they require more investigation.



Data Quality(c) The Accuracy Dimension
Data Quality: The Accuracy Dimension (The Morgan Kaufmann Series in Data Management Systems)
ISBN: 1558608915
EAN: 2147483647
Year: 2003
Pages: 133
Authors: Jack E. Olson

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net