9.2 Understanding the Structures Being Profiled

The data profiling process works best when looking at data from a table perspective. If the data is in other forms, it is generally extracted into a table form. The process of extraction can be helpful or harmful for structure analysis.

For example, if the source data is in IMS, extraction can simply extract each segment into a separate table. This requires that keys be generated for each row of each table from the logic of the IMS hierarchy and embedded in the rows of the extracted table. Special attention needs to be paid to cases where IMS is sensitive to the order of segments within siblings. In these cases, a sequencing column needs to be generated in the extracted table to designate the required order. Often the extraction designer is not aware of this condition and fails to generate the required column. This makes the extracted table less accurate than the IMS segment.

Many extraction routines combine multiple segments into single rows of an output table. This is acceptable, provided the parent/child relationship is always one-to-one. If the relationship supports multiple occurrences of a child segment for each parent segment, the resulting table repeats the parent columns for each output row. This distorts data profiling and other analytical processes because it gives wrong cardinalities over computations off these columns. Unfortunately, this is done all too often.

Another common source of structural flaws occurs when extracting data from structures that support REDEFINE and OCCURS structures. The extraction designer needs to understand the semantic meaning of these structures in order to extract properly. For example, the REDEFINE must be understood to be a remapping of either the same fact or different facts, depending on the value of some other column. Similarly, the OCCURS structure must be understood to be either the same fact with multiple occurrences or multiple different facts. Failure to understand the meaning of these constructs will often lead to extraction files that are structurally different from the data source.

Structure errors made in extraction will lead to wrong answers in data profiling. The step of extraction is generally not given enough attention in the process and often leads to generating inaccurate data from accurate data.

Extraction routines are generally not created exclusively for data profiling. Almost all new uses of data result in an extraction step that prepares the data for the target or for other software that begins the steps of transforming the data to the target form. It is best if the extraction formats used in data profiling are the same as those that begin the journey when the data itself is eventually moved. This means that the extraction step is not a step designed only for data profiling.

Most shops do a hurried job of extraction and pay little attention to the details that often cause problems. A proper job includes careful design and documentation of the source data structures, the extraction logic, and extraction formats. These should be reviewed by business analysts in conjunction with the data handlers to ensure that semantic issues are properly addressed.