Bottom-Up Source Data Analysis | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

Data analysis cannot stop after top-down logical data modeling because the source data often does not follow the business rules and policies captured during the modeling sessions. If bottom-up source data analysis were not performed, the data problems and business rule violations would not be discovered until the ETL process was being implemented. Some data quality problems would not be discovered at all until after implementation, and then only if somebody complained about them. As Figure 5.5 shows, source data mapping must adhere not only to the usual technical data conversion rules but also to the business data domain rules and to the business data integrity rules.

Figure 5.5. Source Data-Mapping Rules

graphics/05fig05.gif

Technical Data Conversion Rules

Any time data is mapped from one system to another, whether for traditional systems conversion or for source-to-target mapping in BI applications, the following technical rules must be observed .

The data types of the source data elements must match the data types of the target data elements.
The data lengths must be adequate to allow the source data elements to be moved, expanded, or truncated into the target data elements.
The logic of the programs manipulating the source data elements must be compatible with and applicable to the content of the source data elements. Otherwise the results will be unpredictable.

Business Data Domain Rules

A much larger effort of source data analysis revolves around business data domain rules. These rules are more important to the business people than the technical data conversion rules. A source data element can meet all three technical data conversion rules but still have incorrect values. Business data domain rules are rules about the semantics (meaning and interpretation) of data content. They are used to identify and correct data violations like those listed in Table 5.1.

Business Data Integrity Rules

Similar to business data domain rules, business data integrity rules are much more important to improving information quality than are the technical data conversion rules. The business data integrity rules govern the semantic content among dependent or related data elements, as well as constraints imposed by business rules and business policy. Table 5.2 lists examples of violations to business data integrity rules.

Table 5.1. Data Domain Violations

1.	Missing data values (a big issue on BI projects)
2.	Default values; for example, "0", "999", "FF", blank
3.	Intelligent "dummy" values, which are specific default (or dummy ) values that actually have a meaning; for example, using a value of "888-88-8888" for the social security number to indicate that the person is a nonresident alien
4.	Logic embedded in a data value, such as an implied business rule; for example, using lower-valued ZIP codes (postal codes) to indicate a state on the east coast, such as 07456 in New Jersey, and higher-valued ZIP codes to indicate a state on the west coast , such as 91024 in California
5.	Cryptic and overused data content; for example, using the values "A, B, C, D" of a data element to define type of customer, while the values "E, F, G, H" of the same data element define type of promotion, and the values "I, J, K, L" define type of location
6.	Multipurpose data elements, that is, programmatically and purposely redefined data content; for example, the `redefines` clause in COBOL statements
7.	Multiple data elements embedded in, concatenated across, or wrapped around free-form text fields; for example, Address lines 1 through 5 containing name and address data elements: Address line 1: `Brokovicz, Meyers, and Co` Address line 2: `hen, Attorneys at Law` Address line 3: `200 E. Washington Bouleva` Address line 4: `rd,` Address line 5: `Huntsville OR 97589`

Table 5.2. Data Integrity Violations

1.	Contradicting data content between two or more data elements; for example, "Boston, CA" (instead of MA)
2.	Business rule violation; for example, for the same person, "Date of Birth = 05/02/1985" and "Date of Death = 11/09/1971"
3.	Reused primary key (same key value used for multiple object instances); for example, two employees with the same employee number (when one employee left the company, his or her employee number was reassigned to a new employee)
4.	No unique primary key (multiple key values for the same object instance); for example, one customer with multiple customer numbers
5.	Objects without their dependent parent object; for example, job assignment points to employee 3321, but the employee database contains no employee 3321
6.	A real-world relationship between two data objects, or between two occurrences of the same data object, that cannot be built because it is not tracked by the operational systems; for example, a customer refinances a mortgage loan but the operational system does not track the relationship between the old paid-off loan and the new refinanced loan

Every critical and important data element must be examined for these defects, and a decision must be made whether and how to correct them. The information consumers (business people who will be using those data elements to make business decisions) and data owners should make that decision after discussing the impact of the cleansing effort with the business sponsor, the project manager, and the core team.