10.1 Definitions

These definitions are not as conventional as are those for column and structure. Different writers use terms differently in this area. The definitions used here are those I prefer to use.

Business Object

The term business object describes a single object that data is collected for; for example, an order, an employee, or a single item in inventory. This is analogous to a type definition in programming terminology. The data for a single business object is stored in at least one row of one table. No row of any table should be describing data for more than one business object.

A business object almost always has a table within which a single row identifies the object. This is generally the primary data descriptor for the object. The object is identified within that table by its primary key. Other columns exist that provide additional facts about the object.

The object may also have rows present in other tables. These secondary tables describe additional information about the object that cannot be described in a single fact. For example, if an employee has multiple college degrees, a secondary table is needed to describe them all. In this case, multiple rows may be needed in the secondary table to supply all of the information needed for one employee. If the employee has no college degree, there are no rows in the secondary table.

Some data architects would describe the education rows of the secondary table as independent objects (employee education achievement objects). If you take this line of thought, all business objects are single row entries in tables. However, in the real world, considering the secondary parts of objects as separate objects is not terribly useful. An order is an order. It has a header and multiple items being ordered. Most queries involving the order deal with the order as a single object with all of its parts.

The secondary tables that contain part information for the object are attached to the primary table through the primary key/foreign key columns. These bring together all parts of a single object.

A business object may refer to another business object through a foreign key column. The other object provides additional information to the referencing object. However, it is not part of the object under consideration. For example, a PROJECT object may have a secondary table that identifies each of the project members, along with their role in the project. The identification of each member is done through the EMPLOYEE_ID column. This is used to obtain additional information from the EMPLOYEE database, such as the employee's name, department, phone number, e-mail address, job title, and education level. All of this information may be useful for purposes of the project. However, it would not be useful to consider the employee personnel object as being part of the project object.

Data Rule

As we have said previously, a data rule is an expression of a condition involving multiple vales that must be true over a collection of data for that data to be considered accurate. A data rule may address a single column using values from multiple rows, it may involve values from multiple columns and/or rows of the same table, or it may involve values from multiple tables. A collection of data may have many data rules that must be true. Each data rule deals with a single relationship between columns of the business object.

Data rules examine data in a static state. This means that you look at a database at a point in time when all of the business objects described by that data are expected to conform to the rules. Data rules must be cognizant of state boundaries. Data rules that operate within an application to achieve a stable state are called process rules. Data rules and process rules together are part of what is normally referred to as business rules.

Most people pretty much agree on what is a business rule and what is a process rule. However, there has emerged no specific standard regarding how a data rule is expressed. This is not to say that attempts have not been made.

It is not the purpose of this book to create a standard nor to promote any specific standard. The data rules in this book follow a common approach of expressing the logic of a rule in an SQL-like language in a form that describes what consititutes correct data. This is only used to express the concepts of rules, not to provide a syntax for formal rule specification.

Data rules are negative rules. This means that the expression says what must be true in the data. However, no one is ever interested in seeing the results of what is true, only what is not true. Therefore, the execution of data rules generally involves selecting data that is not what the rule says. Some implementations of rule execution engines have the user express the rule according to what is correct and then execute the NOT () of the rule. Others have you merely create a query statement that defines what you want to select, which is usually the data that describes a violation.

All data rules should be recorded in the data profiling repository in two parts. One part is a text description of the rule. It should state very simply the condition that is intended. The rule should be documented in a logic language in the second part. The logic specifications can be very difficult for business analysts to understand. The text description should serve as a basis for validating whether the logic is an accurate specification of the rule.

Following are two examples of data rule syntax:

Consider a rule that defines the business condition that an employee cannot be part-time and assigned to a research department. The data rule is

 IF EMPLOYEE.STATUS='PART_TIME'  THEN DEPARTMENT NOT ='RESEARCH'.

The selection rule for violations is

 SELECT EMPLOYEE-NUMBER,   EMPLOYEE NAME FROM EMPLOYEE WHERE EMPLOYEE.STATUS='PART-TIME AND DEPARTMENT   ='RESEARCH'

A truly negative expression that finds the violations would be

 SELECT EMPLOYEE_NUMBER,   EMPLOYEE_NAME FROM EMPLOYEE WHERE NOT(EMPLOYEE_STATUS='PART_TIME' AND   DEPARTMENT NOT='RESEARCH)'

Expressions should be written as simply as possible to express either the condition that must be true or the exceptions you want to see.

Hard Versus Soft Data Rules

Some data rules are absolutely required to be true. Any violation is considered a case of inaccurate data. These are called hard data rules. For example, your HIRE_DATE must be later than your BIRTH_DATE. There cannot be an exception to a hard rule.

A soft data rule is one that is expected to be true almost all of the time but exceptions can occur. For example, HIRE_DATE must be greater than BIRTH_DATE+18 years. This means that all of your employees are expected to be 18 years old or greater. However, if you hire a super-genius who already has a Ph.D. but is only 16 years old, you have an exception to the data rule. Exceptions normally require a businessperson to approve them.

Soft data rules are difficult to deal with. When a violation occurs, it does not necessarily indicate inaccurate data. However, discarding the rule because of the off chance that an exception may occur will block you from finding many truly inaccurate data conditions. It is only logical that you would include these rules for testing and then research each violation to see if it is an exception. This requires additional research after the violations are found. If you determine that an exception has occurred, you do not want to see that same exception each time you execute the rule. This requires more complex rule execution design.

A Well-Designed data structure would have a column in the database that indicates whether an exception to a rule has happened. In the example of the age of an employee when they are hired, a column could have been added that includes a flag value for whenever someone is hired who is younger than 18. This would allow a very specific rule to be generated that included the phrase IF HIRE_EXCEPTION FLAG NOT ‘E’.

This type of sophistication in database structures is very rare. However, it ought to be used more often to aid in keeping data accurate.

The tighter you define data rules, the more likely a rule crosses the boundary between a hard data rule and a soft data rule. The looser you define data rules, the more inaccurate data conditions you fail to detect. It is a trade-off that each data profiling analyst needs to deal with.

Appendix A shows an example of a small database situation and data rules that can be constructed from them. Studying this example is probably more instructive in explaining data rules than any text.

Data Rules Versus Process Rules

A process rule is one that takes an action based on the values of one or more columns. It may compute the value of another column, create a row in another table, delete a row, or take other actions. These actions generally result in data changing from one state to another.

For example, a business process rule might be that a CUSTOMER is classified as GOOD if the amount of sales last quarter exceeds $10,000 and the number of late payments last quarter is zero and they have no currently over-due invoices. A customer is classified as BAD if the number of late payments last quarter or currently overdue invoices is greater than two. GOOD customers get a 10% discount. BAD customers are not allowed to enter new orders. Other customers get no discount but can enter new orders.

This is a rather lengthy process rule. It defines multiple actions that result from various data points. It causes a customer classification to be set and applies this to the acceptance of new orders and to the setting of the prices for new orders.

Process rules are not intended to be data validation rules. They are used to set values of data, not to determine if values that have been set are correct. These rules can be turned into data rules to test the state of a database to determine if the rules are being used. Violations may mean that the data is inaccurate or that exceptions are being made to the rules.

An important factor to consider when using such process rules is how and when the computations are made. In this example, the rule could be computed periodically, such as weekly, through a batch process. When this is done, there will be a number of rule violations in the database simply because a number of transactions may have taken place since the last time the program was executed. If the logic is executed every time any change is made to the customer's information, it will be more accurate. However, because overdue payments are not a function of a transaction but rather a function of elapsed time, this can also leave a set of records in noncompliance, waiting for another transaction to occur.

Data Rules Versus Column Properties

Data rules always define relationships between multiple data values. Column properties look at individual values in isolation and determine if they are valid, not correct. Column properties consider whether a value is acceptable for any set of conditions for other values. A data rule can be used to restrict the number of valid values for a single column based on other criteria. In essence, it narrows the acceptable values for a single occurrence based on factors outside that column. For example, the values permitted in a column called CLASS_ID would be all currently active class identifiers for the column properties testing. However, the list can be narrowed based on whether the student is a freshman, graduate student, or postdoctoral student. Each category of students can have its own sublist of acceptable values. To determine which sublist to use for validation, you need the value from the STUDENT_LEVEL column.

The process of examining column properties and subsetting them based on other column values is a common source of data rules. Subsetting can be done on discrete values, a range of values, or on NULL acceptability.