This section discusses each of the types of properties that apply to columns and shows examples of using them to find invalid values. The property list may include within it a domain definition that provides some of the properties. Not all property lists will have the same rule property types, in that some only apply to specific cases.
The business meaning of a column is a very important property. It says what should be stored in the column semantically. A simple statement of what the column is encoding is important in determining if the column is, in fact, being used for that purpose. A look at some of the data values returned from discovery, along with the
This is an area that is often overlooked. It is so simple to just look at some of the values to verify that the content appears to be what you think it is. It is amazing how people decide what is stored in a column merely from a COBOL programming label.
Programmers have often decided to use an existing column that is no longer needed to store the values for a new requirement. This rarely results in a change to the programming specifications for that column. As a result you have program documentation that says a column is one thing, when in reality the information recorded is something different.
For example, a label of SOCIAL_SECURITY_NUMBER would be expected to contain Social Security numbers. In one real case, examination of actual data content revealed that the values stored could not be Social Security numbers. Further investigation revealed them to be
Other factors besides visual inspection of the data values can be used. For example, if the lengths of the data values are much shorter than the maximum length, the column has probably been repurposed.
|
|
One of the
It is amazing how often this is not known and the column is
An unused column can contain more than one value. It may have been used for a period of time and then stopped being used. This is normally suggested by a very large number of empty values. Examination of the rows they are contained in will
Recently added columns have their own problems when being propagated. Moving data for a block of time that includes periods where the column was not used can seriously distort decision support aggregations, graphs, and
|
|
One last thing to do in establishing the business meaning is to ensure that the column has a good descriptive name in the data profiling repository. If the name provided is not representative of the content nor long enough to precisely indicate the content, a better
Storage properties are a subset of column properties defined to the database or file system that allow them to store the data in proper containers. This includes data type, length, and precision for numeric fields. The basic rules that
Most database or file systems have the notion of physical data typing that defines the form of data that can be stored in a column. Figure 8.5 shows a list of typical data types. It is assumed that the
Figure 8.5:
Typical data types.
However, all data except BINARY and DOUBLEBYTE data can be represented as character values. For example, a number column can be stored in a CHARACTER data type by just typing numeric
|
|
Even BINARY and DOUBLEBYTE data can be put into a CHARACTER column if the database or file system uses either a fixed CHARACTER or a separate length column for VARIABLE CHARACTER fields.
Only when the storage system
|
|
The reasons for storing noncharacter data in character data types are many. One is that the system being used may not support the data type of the data. This is common for DATE, TIME, and TIMESTAMP data types. It may also be true of the FLOAT data type. Flat files commonly support nothing but character columns. The famous comma-delimited format of ASCII files is an example of a format for which everything must be in character format.
A second reason is that the user may intend to store values that are not represented by the data type. For example, a DATE data type cannot be used if you intend to store blanks for
Date not known
or
99/99/9999
to mean something special. If the domain you plan to use is not pure, the database system may reject a value you want to store because it requires perfect conformance. This
Another reason is that you are accepting data from an external source that has used some of these nonconforming values in fields. If the original data source has values that do not conform, you have two choices: transform them to NULL or put them into a character column to preserve the original
The major problem with storing noncharacter data in a character data type is that it invites corruption. Because the database system will accept anything in the column, you are getting no advantage of filtering out bad values using the DBMS support. Legacy systems are notorious for this, as are external data feeds. If it can be corrupted, you can bet it will be.
|
|
Not only does the
|
|
It is important not to accept the data type as used by the database or file system when profiling data. You need to discover the true data type of the data and then filter the data against that. This is a perfect example of improving on metadata you collect in the beginning. Discovering the true metadata can be done through discovery software or through visual inspection of values. Once you are confident of the content, you can then reprocess the data to find violations. If you have a database system that enforces data types, finding the violations can be done through a LOAD step after specifying the proper data types to the target.
The length field is not usually
However, examination of length issues can sometimes be important. For character fields it is
It can point to individual inaccurate values. For example, if in a NAME column 5% of the values have a length of one character, they deserve visual examination because they are probably inaccurate
It can point to waste in storage, which is not an inaccuracy problem. If all values never exceed 5 bytes in a 30-byte-wide character field, you can save space. If all values are the same or near the same length in a VARIABLE CHARACTER column, it should probably be changed to CHARACTER. If the distribution of lengths is highly variable, converting from CHARACTER to VARIABLE CHARACTER can be helpful.
For numeric columns, the length should be commensurate with the range of values found. This can lead to shortening an INTEGER column to SMALL INTEGER to save space.
The actual precision required and number of
Once the basic storage shape of the column is determined and the business meaning
The more detailed the property definitions, the more inaccurate values can be found. Property definitions narrow the list of acceptable values. In some cases you can say simply that these are the only values allowed. In other cases it is
The types of property definitions that are possible in specifying valid values are shown in Figure 8.6. Each of these is explained in the sections that follow.
Figure 8.6:
List of valid value rule types.
Many columns support only a small list of specific values. These may be real words or encoded values that represent real words. For example, the STATE column supports only a small list of valid state codes. These may be spelled out, or they may be the standard two-character state codes. Another example is a COLOR column, which would normally be a full word but in some cases (where the
The first phase of value profiling is to validate the list of acceptable values. Instead of taking the list of values you believe to be the correct list and scanning for exceptions, it is a better practice in this phase to itemize the actual values stored, along with their frequency within the column. This approach will allow you to compare the actual values in the data to the valid value list in the rule.
In comparing the values you may discover the need to expand the valid value list. You may likewise discover that some of the values are not used at all. This may lead you to shorten the valid value list after investigation to determine the reason for the values being missing. You may also discover values that have disproportionately too many or too few values. This may lead to investigations that discover
Ranges of values are typically used on numeric, date, time, or timestamp data. A typical range might be 0 > = value < = 1000 on QUANTITY_ORDERED, or HIRE_DATE BETWEEN 01/01/1985 AND CURRENT_DATE.
Ranges can be used on character data as well. For example, if product identifiers always begin with a letter in the range of A through F, a range check on the first character can be used to screen out invalid values.
As in the case of discrete values, it is better to initially let a program determine the range of values in the data instead of processing the data against the expected range. Producing a distribution list of values may be useful but also may be impractical if the number of discrete values is too large. In this case it is wise to include in the output the values close to the edge (the 100 smallest and 100 largest values, for example).
The purpose of doing it this way is to validate the range. If all of the values are hiding well within the range, possibly the range is too liberal and could be narrowed. Likewise, seeing the invalid values on the edges may lead to widening the range.
It is also helpful to see the most frequently found values within the range. This may be what you would expect or may surface a surprise.
|
|
These examples show how an analyst needs to look within valid values to find inaccurate data that "hides under the rules." Bad practices in creating data tend to be repeated, which often lead to clues within the set of valid values. Any clues as to the existence of inaccurate data need to be surfaced and investigated.
The perfect example of this is the HR department that used their supervisor's birth date for new
|
|
|
|
A Practice used sometimes on older legacy systems is to store a dollar-and-
|
|
These rules exclude specific values within a range. They usually apply to date and time columns.
For example, a HIRE_DATE entry may have a further restriction against the date being a
These types of tests are generally considered excessive and are not done. However, if it is very important that the date be exactly correct, they may be worth doing.
Text data is the most difficult to put rules to. As a result, most data profiling efforts do no checking of the internal content of text fields. However, many text fields can have rules applied to them that get inside the column and dig out inaccurate entries.
Some text fields are actually used to store information that could be specified as another data type. For example, often numbers or dates are stored in text fields. These should be
The simplest text column is one that contains only a single word. The words that are allowed typically require no leading blanks, no embedded blanks, and no special characters, and consist of just
The
Some text fields hold one or more keywords. For example, a column may include a provision for text such as GRADE = MEDIUM,
Totally generic text columns allow just about anything. Some contain text
The best way to profile text columns is to execute a discovery program that returns its best opinion of the type of column and identifies the existence of blanks and individual special characters. This aids in determining the true characterization of the content and in determining that it is truly a CHARACTER column.
It is important to record the expected content of these columns, because moving them through extraction/transformation and load processes can cause failures over unexpected special characters if not known in advance. Many a data movement process has been torpedoed by line enders and carriage returns embedded in text columns.
Another factor to take into account is the code page the data comes from. Data coming from Europe, the Far East, or other places can include text columns with code points that may fail tests devised solely for USA text.
Sometimes text columns have single words that are a coded value for something that must follow a specific character pattern to be valid. A common example is a PRODUCT_IDENTIFIER. An example of a pattern
The first character indicates the type of product, which can be A, B, or C.
The next three characters identify the division that produces it, which are numeric digits.
The next character indicates storage requirements: R, S, H.
The next five characters are numeric digits that identify the unique product.
In these cases, a specific rule check can be constructed for the column to find all occurrences that
Patterns are never checked by the database systems and thus are subject to violation. Many times they are not documented in advance. Through special software you can discover the patterns of single-word columns. They will identify the patterns and the percentage of values that support the rule. This can be helpful in determining if a pattern exists and in confirming or
In some cases it is possible to have a column that supports multiple rules that are distinct from one another. An example might be a ZIP_CODE column that supports USA and CANADIAN pattern rules. Another example is a TELEPHONE column that supports USA, CANADIAN, and MEXICAN TELEPHONE_NUMBER patterns.
It is okay to have this condition as long as the rules can be tested with an OR condition. As we shall see in the chapter on simple data rules, it is helpful to be able to correlate the pattern rule to a value (such as COUNTRY_CODE) in another column. However, this is not required to build rule sets. For example, a PRODUCT_IDENTIFIER column may contain product codes from two different companies that have recently merged.
Some columns contain values from a well-known domain. Examples are SOCIAL_SECURITY_NUMBER, ZIP_CODE, TELEPHONE_NUMBER, and STATE. These domains can usually be
Every column has one or more rules regarding the NULL condition. The rule may be that none are allowed. It may allow values indicating the NULL condition but not restrict itself as to what they may be, leaving the door
Software can be used to scan for values that may indicate missing values. They can search for things such as blanks,?, none, **no value , and any others that are found to exist in data. If such values are found and judged to be NULL condition indicators, they all need to be documented in the repository because they must be transformed to a common indication when data is moved.
Sometimes the NULL indicators are descriptive. For example, you might find the words
Don't know, Not
A lot of the NULL
It is useful to add other information for columns in the repository that can help in understanding the
One of the factors is "confidence" factor that the data is accurate in some respect. For example, if the column is stored in a row of a relational database system and the column is declared as UNIQUE and NOT NULL, you can be assured that all values are indeed unique and that there are no nulls. This does not mean that the NULL condition has not been codified in some data value that the relational system cannot recognize as a NULL. However, along with the UNIQUE constraint, the likelihood of codified NULL conditions is very slim because they could occur at most one time.
Another example is DATE data typing in database systems that enforce valid dates. The DBMS will
The analyst may determine that the database system has a STORED PROCEDURE included that enforces ranges, discrete value lists, or patterns. Discovery of this procedure not only helps document the rules but
Extracted data screening logic from source code of the application programs is less reliable in building confidence. This is because it
Another characteristic to record is the susceptibility of the column to have its values decay in accuracy over time. This is purely an external business analyst call because there is no way of programmatically identifying these columns.
Another factor to look for is the probability of an inaccurate value in a column being recognized and fixed. This again is a purely external semantic condition of the column. For example, in a payroll application, the PAYRATE column is more likely to get recognized if wrong (and fixed) than the DEPARTMENT column. The confidence factor for the one column being accurate is higher than the confidence factor for the other column. Scorekeeping on confidence factors will lead you to placing more emphasis on rule generation and testing of columns of lesser confidence than those of higher confidence if you lack the resources to exhaustively profile every column.
Another factor to examine is whether the column has had a change in the way data is encoded at one or more points in time in the past. The result is that the data is inconsistent in representing the real world, depending on how old the information is. There will be multiple time periods of the data within each of which there is consistency but across time periods little or no consistency.
There are programmatic
To find the actual point of inconsistency you can
You can also develop a process of computing for each column the earliest date any value other than blank or zero are found. This can identify those columns that were added or new values introduced to the database, as well as the point in time they were added.
Using this approach is very time consuming and should not be used over all columns or even over very many of them. You can select columns that you are suspicious of having inconsistency points and test them this way.
It is important to understand that all data inconsistencies are not the result of changes to the way data is recorded. The inconsistencies may result from a business model change such as dropping a product line, the impact of external factors such as a
Inconsistency points need to be recorded in the data profiling repository. This will benefit potential future users by allowing them to make decisions on how far they can safely go back in the data for their intended purposes. Using data across inconsistency points can distort decision support computations.