Figure 11.1 shows the process model for profiling data rules over sets of business objects. It is substantially the same as in the case of
Figure 11.1: Process for profiling complex data rules.
You can use the same sources you did with simple data rules. However, there are some differences in these rules, described in the sections that follow.
These types of rules are rarely scavenged from source code. Most rules embedded in code are rules that can be applied to a single transaction. Rarely does source code perform checks that involve
Some of these rules require that a period of time elapse before all data is present to process the rule. These types of rules cannot be executed on a transaction basis but must be executed whenever it is thought that the collection is complete. This means that if you intend to look for rules in source code, look for programs that run against the database at the end of a week, month, quarter, or some time interval.
Complex data rules are almost never embedded in database procedure logic. Database-stored procedure logic
Business procedures will also generally be lacking because they concern
The most fruitful source of rules is again speculation: sitting down with the business analysts and subject matter experts and building data rules that should hold true over a collection of data.
Data rules are crafted and then executed against the data to find violations. This is not anything new. What is new is that you are looking at sets of business objects and not individual objects. This requires that the data you use to test against must contain all relevant data. Taking samples may result in violations that are not true violations. In addition, extracting the data for data profiling at the wrong time can also result in the data being incomplete relative to the grouping function. For example, if the grouping function is MONTHLY, you cannot take the data for only part of a month; you have to wait for the month to end and get it all.
As before, the results need to be examined by the same team of analysts and business experts to determine if the data rule is correct and to determine the meaning of violations. Accepted data rules need to be recorded in the data profiling repository. Data violations need to be converted to data issues if they
The data profiling process is not dissimilar to that for the previous section. You
Multiple-object rules tend to be much more complicated in their processing logic. Often it is not possible to formulate them as an SQL query. It may require a special program or script of SQL statements to be written by an application developer to test a data rule.
Formulation of the process is very important. They must not only accurately reflect the data rule but be optimized. Some of the data rules will require comparing every record in a table to every other record in the table. If not
Some rules may require access to data from more than one database at the same time. This may involve a scheduling or access issue. If the data is extracted and staged for profiling, it is important that the data from all sources conform to the same time-boundary requirements.
Note how this step often requires collaboration and cooperation with application programmers and database administrators. Everything runs much more smoothly if these folks are involved in the process early and are excited about what they are being asked to do.
Data rules for multiple business objects can take a large variety of forms. This section
One category of multiple business object data rules is where multiple instances of the same object type cannot share the same space at the same time. The obvious example of this is
An example of this is rental records for equipment. The data rule is that a single piece of equipment cannot be rented out at the same time to more than one client. If the records show an overlap, an error exists in the data. The test is to
IF RECORD2.END_DATE&TIME IS EARLIER THAN RECORD1.START_DATE&TIME THEN OK ELSE IF RECORD2.START_DATE&TIME IS LATER THAN RECORD1.END_DATE&TIME THEN OK ELSE ERROR;
Sorting the records by OBJECT_IDENTIFIER and START_DATE&TIME makes execution more efficient. This example is true of anything that requires serial usage. For example, checking books out of a library, scheduling of conference rooms, billable-time consulting hours, billable time for lawyer charges, flying records for corporate
This testing can go one step further for objects that should have a record for all or most time periods to compute the time period durations between any rental or checkout. If these are unexpectedly long, it may
An interesting application that
Time may not be the only factor that causes two objects to be incompatible. Location may be as well. For example, consider an inventory database that has a column for storage location down to the bin number. If you
There may be other
The way this error happens is that one employee gets a divorce and the spouse marries someone else in the same company. The first person fails to update personnel records to show the change in marital status. When the second person updates his record, the database has an
Anytime you have the requirement that the value of one column cannot be shared with other objects and the column is not a structural column that would be enforced through database key processing, you probably have errors that have gone undetected. Profiling these columns will determine the extent of errors in the column and lead you to determining if corrective actions are needed or not.
Aggregations are tests not on individual column values but rather on a computation over a group of objects. The types of aggregation values that can be the subject of the rule are COUNT, SUM, AVERAGE, MEDIAN, STANDARD DEVIATION, or anything else that makes sense.
These are good rules for checking for completeness. For example, you may know that the number of trips taken by a truck should be greater than 10 per month. This is the minimum any truck would take. The rule might be
GROUPBY TRUCK, MONTH IF OUT-OF-SERVICE NOT YES THEN NUMBER_TRIPS GREATER THAN 10
Another way of identifying the existence of inaccurate data is to check the data in a database with corresponding data in another database that should be the same. The simplest form of correlation is lookup. You have a column in the subject database that must contain only values that are in another file or database column. This is another form of verifying that the value is in the set of acceptable values. Reasons this would not be
Correlations can get more complicated. For example, in name and address error checking, you can use a database supplied by the U.S. Postal Service that identifies all valid combinations of street name and number, city, state, and Zip code. Name and address correlation is best done with specialized software designed exclusively for that purpose. This is so complicated and important that entire companies have emerged that provide this software.
The other source can either be more trusted or less trusted. If trusted, the lookup source is considered correct when they
You need to be careful when establishing a correlation test. Sometimes the test is not valid but appears to be so. For example, an ORDER_DETAIL record may contain a UNIT_PRICE field, as does the INVENTORY record. What appears to be duplicate data may be tempting to use for accuracy correlation. However, the UNIT_PRICE in the order detail may be different for two reasons: it reflects the unit price on the date the order was placed, not on the current date, or the customer was given a special unit price just for that order. This underscores the need to understand your data thoroughly.