9.4 The Rules for Structure

9.4 The Rules for Structure

This section demonstrates some of the techniques you can use to analyze the structure of data. The goal is to find the correct structural definition of the data as well as to find violations that point to inaccurate data. This is not intended to be a description of an implementation of the methods, just a survey of the techniques that can be implemented for structure analysis.

The techniques described vary widely. Each technique looks at a single aspect of structure.

As in the case of values, structure analysis cannot be done by running a single program to get the results. All along the way it requires semantic interpretation of results to determine the relevance of structure discovered and the impact of inaccurate data.

Finding Functional Dependencies

Finding functional dependencies in each table of source data is the starting point. This leads to discovery of primary keys, embedded denormalized key sets, and derived columns. This information is vital to other data profiling activities. As described earlier, there are two approaches to digging out functional dependencies: using discovery software or validating proposed dependencies.

Determining Candidate Dependencies

Speculation about what might be a functional dependency is the first thing to do. These should be recorded in the data profiling repository as candidate dependencies. This should always be done first. You would normally have a collection of candidate dependencies to start with that were gathered from various external metadata sources. The process of speculation is designed to add to this list dependencies that would not normally be documented. Of course, the available metadata may have very little to offer in terms of candidate dependencies requiring you to invent candidates for everything.

The best way to create a list of candidate dependencies is to start by classifying columns. The goal is to identify those column sets that are possibly LHS components of functional dependencies. Some columns are obvious candidates for keys, and others are obviously not candidates for keys.

Columns that have only one value should be excluded from analysis. Likewise, columns with a significant percentage of nulls, whether actual nulls or blanks, would also normally be excluded from LHS consideration. This is not always true, but is true so often that they should be considered excluded unless the semantic of the column suggests otherwise.

Most other columns should be classified as identifiers, descriptors, quantifiers, dates, or free-form text fields. These classifications are purely semantic. This is why a thorough value analysis needs to be performed before coming to structure analysis. The actual business meaning and content characterizations of columns are key to successful classification.

Identifier columns are those that by their nature identify one object within a table. Some of them are clear: ORDER_NUMBER, PRODUCT_ID, DEPARTMENT_NUMBER, CLASS_ID, and CUSTOMER_NO. These columns are clearly encoded identifiers, and their presence in a table indicates that they are components of keys, whether primary keys, denormalized keys, or foreign keys. If you know what business object each row of the table represents, you can easily guess which ones are primary key candidates and which are foreign key candidates.

Sometimes these identifier columns require additional columns to identify unique rows of the table. This is true when the table is not the primary table for the business object but is a sub-table. For example, in the EMPLOYEE table the EMPLOYEE_ID is probably the primary key. However, in the EMPLOYEE_EDUCATION table, additional columns need to be combined with EMPLOYEE_ID to get unique records. This is because each employee may very well have multiple educational achievements recorded. The additional columns can usually be determined easily from the semantic meaning of the table and of the columns in the table. If you guess wrong, you can find the correct combination in later steps of structure analysis.

Natural key identifier candidates require some knowledge of the object the table represents. Columns with titles such as NAME, EMPLOYEE_NAME, and PRODUCT_NAME are all clearly candidates for natural keys. These often require additional columns to uniquely identify an object in the table. For example, CUSTOMER_NAME may require ADDRESS or PHONE_NUMBER to achieve uniqueness when you support having multiple customers with the same NAME.

Again, speculation about the column sets required to achieve uniqueness can be done, generating multiple candidate keys. These will all be sorted out later, in the testing phase of the process.

Descriptor columns provide a value that helps describe a single characteristic of the table. Examples are CUSTOMER_TYPE, ZIP_CODE, COLOR, MODEL_YEAR, UNIT_OF_MEASURE, PRODUCT_DESCRIPTION, REGIONAL_OFFICE, and COUNTRY_CODE. Some tables have a very large number of descriptor columns. These columns are very popular for drilling down in decision support products and producing important business reports.

Descriptor columns are generally RHS candidates. Each of them should be looked at with regard to each of the key candidates listed from identifier columns. They should be associated as potential RHS candidates. Sometimes a descriptor column will be an LHS component, but rarely are LHSs made up entirely of descriptors.

Quantifier columns are those that define a numeric value that indicates a quantity associated with a row. Examples are PRICE, QUANTITY, ON_HAND, ON_ORDER, ACCOUNT_BALANCE, GRADE, and PERFORMANCE_RATING. Note that some of these are not numeric, such as GRADE. It still has the same effect as quantifying something about the object. In this case, it quantifies how well a student did in a course relative to the course standard.

These columns are generally not LHS candidates. They should be associated with LHS keys as RHS candidates by considering the semantic meaning of each.

Quantifier columns are generally strong candidates for participation in derived columns. The way to look at these is to speculate on which of them could be computed from the others. Looking for columns with names such as TOTAL, PRICE, NET_PRICE, and QUANTITY_AVAILABLE_FOR_ORDER will produce a list of likely candidates for RHS of derived dependencies. This can then be used to construct a likely LHS list of columns the result may be computed from. The actual formula is not required to test for functional dependence, just the list of candidate columns. Deducing the actual formula or rule is outside the scope of this book.

Many tables have date columns because they describe when significant events occurred for the business object. Some tables have many date columns. Examples are DATE_OF_ORDER, DATE_CREATED, BIRTH_DATE, SHIP_DATE, DATE_LAST_UPDATED, and HIRE_DATE. These columns are also rarely LHS candidates. They are very similar to descriptor columns in the role they play. However, they need to be looked at separately and associated with key candidates.

The last type of column is the free-form text field. These columns, which are not easily queried, contain text that consists of multiple-word components. Examples are SHIPPING_INSTRUCTIONS, AUDITOR_COMMENTS, REVIEWER_NOTES, and INSPECTION_VIOLATIONS. They contain free-form information that enhances the other information on the row. They clearly do not identify the rows. These columns are never LHS candidates. Almost always they are RHS-associated columns that belong only to primary keys. It is not likely that any two of these will be identical.

At this point, you have a list of candidate functional dependencies. You will then need to have some method of verifying this against the actual data source in order to prove that they are dependencies or whether they have inaccurate data that prevents them from being true all of the time.

Discovering Dependencies

The best way to test dependencies is to discover them. This is a process that finds all functional dependencies within a table. The dependencies produced will be true over all of the data used in the test. You will probably not use all of the data, because for very large data sources the processing time will be prohibitive. You will want to create samples of the data to use for discovery. You may want to create multiple samples, using different criteria, and then compare the outputs to see how similar they are.

One very important characteristic to understand when using discovery software is that all true dependencies are true in all samples. This means that you do not need to worry about finding true dependencies when using samples: you will find them all. The side effect of using a sample is that you get false positives (dependencies that are true in the sample but not true in the larger population). These can generally be weeded out by semantic review or testing against the larger population to determine if they are really true.

If the discovery process returns far too many dependencies, it generally indicates the presence of a large number of false positives. The sample size is probably too small, or the sample is biased in some way. When you see this, you want to resample and rerun to reduce the number of dependencies you need to examine.

If you profile multiple samples separately, you can compare the result sets by leaving for consideration only those dependencies that are found in both samples sets. Again, for a dependency to be true in the entire table it must be true in all samples. However, the dependency may be hiding out as a different dependency because in some samples it may not need all LHS columns to be true. For example, if ORDER_NUMBER and ITEM_NUMBER determine QTY in the full table, some samples may show ORDER_NUMBER determining QTY if the sample only has one item per order. However, this functional dependency would also be true if ITEM_NUMBER were added to the LHS. The analyst must look for cases such as this where manipulation of the LHS column list is needed to show the functional dependency needed for the full table.

Once you have the list of discovered dependencies, you need to compare this to the list of candidate dependencies in your data profiling repository. You will probably discover some you did not expect. You need to make a judgment as to whether these are semantically interesting or not. An uninteresting dependency is one that is clearly a false positive (makes no business sense).

You should mark each dependency as confirmed through discovery, not confirmed through discovery, or uncovered through discovery (those that were true but you did not expect). Of course, these last are only those you felt were semantically meaningful.

start sidebar

The following are examples of dependencies you would throw away:

  • Often a NAME field shows up as a primary key in discovery with all other columns appearing as being dependent on it. If this occurs in a table where you know from a business perspective that two different entries may be allowed to have the same name but that the sample used for discovery had all unique names, you would discard this dependency.

  • Another example of a dependency you would not keep is where CODE determines DESCRIPTION and DESCRIPTION determines CODE. You would keep one, CODE determines DESCRIPTION, but not the other.

  • Finding out that ZIP_CODE determines CITY, STATE is also generally not interesting. You are not likely to break these out as a separate table.

end sidebar

A dependency is significant if it defines the basic structure of the object, if the RHS column is likely to also exist in another table (indicating redundancy), or if it is very important that the relationship always be correct.

Testing Dependencies

If you do not have software for performing discovery of functional dependencies, you need to construct tests of those you think are true. Even when you do have the ability to discover functional dependencies, you may well want to do this for candidate dependencies that did not show up on the discovered dependency list. Another reason for testing dependencies is to check the larger data source for conformance to dependencies found in discovery when discovery only used a sample of the data. The purpose here is to find out how close they were to being true.

You do not need to test functional dependencies based on a single column defined to a database system as a primary key or that have a UNIQUE constraint if the database system enforces uniqueness. Likewise, if you have determined that the cardinality of the column is the same as the number of rows in the table, you are certain that it is a true key. No further testing is required.

The process of testing is to sort the data source by the key of the dependency (LHS). Then examine the data by determining if the value of the RHS is constant for each combination of values for LHS columns. This is an easy algorithm to implement.

This process can be improved upon by first combining candidate dependencies with overlapping LHS columns and testing all RHS columns that share the same LHS columns. For example, if you thins a b,c,d and ac e, then a single sort by a,c,b,d,e can test all of the dependencies at one time.

The output of the test is either confirmation that all rows support the dependencies proposed or a list of violations. Each violation consists of an LHS value combination and multiple rows that match it but have different RHS column values.

Classifying Dependencies

Once the output has been determined from all of the tests, the analyst can match these to the candidate and discovered dependencies to determine which are true, which are close to being true, and which are so far off they are probably not true dependencies.

start sidebar

Examining multiple primary key dependencies can be very useful. Tables that depend on an encoded key for uniqueness generally do not look for nor enforce uniqueness on natural keys. This means that violations can easily occur and go undetected. By testing natural keys, you may find duplicates. These duplicates may indicate inaccurate data, or they may indicate that a corporate rule has been violated. Examples of things peoples of things people have discovered through natural key testing are:

  • Same person working for company under two different EMPLOYEE_IDs

  • Same product with two different PRODUCT_NUMBERs

  • Multiple accounts for same CUSTOMER

If these conditions are not permitted by corporate business rules, discovering violations is very worthwhile.

end sidebar

The list of functional dependencies would have each marked as being expected, discovered to be true, tested to be true, or violated with the percentage of rows involved in violations. You would also know if each was a primary key candidate or a denormalized key candidate.

It is also valuable to sort through the primary key dependencies and designate one of them as the one to use for database implementation and enforcement. This is important for later uses of the data.

Taking this information to a review by business analysts and subject matter experts is extremely valuable. This group can finish the process by classifying them further as being derived column dependencies or structurally significant dependencies.

Finding Synonym Columns Across Tables

The next step of structure analysis is to find pairs of columns that represent the same business fact and to classify the structural relationship between them. This analysis is done within a table as well as across tables.

The data profiling process will find all the synonyms, determine their type, set their rules, and test the data to find violations that indicate inaccurate data. This step is critically important whenever you are trying to consolidate data from multiple sources, particularly if the sources were created independently from each other.

The process is executed from two different perspectives: a single data source or merged data sources. In a single data source, all tables represent different business object components and you are trying to determine how they connect. In the merged data source, pairs of tables represent the same business object components. The process is oriented toward finding identical business facts at the column level that must be consolidated into a target data source. This step is concerned with finding the pairs that are the same and testing their ability to be consolidated.

Determining Candidate Synonyms

As in the case of finding functional dependencies, the process of profiling synonym structures begins with collecting and speculating about what synonyms may exist. Synonyms need to be documented in the data profiling repository.

You can find existing synonym candidates in metadata repositories, data models, database implementations, or through speculation. Metadata repositories, data models, and database implementations will never tell you information about synonyms across diverse data sources. You can only determine them through manually matching columns using the business meaning of each column.

Additional, very helpful information is provided through the results of column property analysis. Thoroughly profiling the properties of each column in a data source provides valuable foundation information for all steps in the synonym profiling process.

Metadata repositories and data models are very helpful in identifying primary/foreign key pairs, the relationship between them, and the rules they are supposed to follow. Do not assume that this documentation is either complete or accurate. Data models often do not identify relationships between columns in one business object and other, different objects.

start sidebar

In one example, an ORDER data structure identified the relationship between ORDER_HEADER rows and ORDER_DETAIL rows through a pair of columns labeled ORDER_NO, as well as the relationship between an order and a customer through a pair of columns labeled CUSTOMER_NO. However, it failed to document the fact that the SALESMAN_ID identifier in the ORDER_HEADER was actually a synonym of the PERSONNEL_ID in the HR database.

Profiling the pair of fields revealed a number of orphans (salesperson IDs that did not exist in the personnel data), as well as missing values.

Because this was the only information about who the salesperson was, this was not a reliable source of information for some forms of post-sale analysis. The synonym profiling process allowed the customer to tighten up processing and to provide more reliable data.

end sidebar

Database implementations will generally be helpful in identifying synonym pairs that are involved in primary key/foreign key relationships. Most relational systems allow these to be defined with rules so that they can be enforced whenever data is created, updated, or deleted. This is valuable information that gives you a head start. You need to indicate in the data profiling repository that the appropriate ones have been enforced and therefore do not need to be tested.

Older legacy systems may also provide help. For example, IMS data definitions show the hierarchical relationship between segments. Examination of the keys for each segment will clearly indicate the fields that connect the parts. IMS, however, has shortcomings in some areas. One is that you cannot define cross—business object relationships through the same mechanism or cannot define them at all. You might have a logical relationship defined that is helpful, but then again you may not. Another problem is that not all keys are complete, because some of the relationship may be determined by the hierarchical structure. This just says that the IMS definitions may be helpful but could very well be incomplete.

The final source of candidates is speculation. You can look for the obvious synonyms through examination of the business purpose of each table, looking for the columns that are probably used to associate the multiple parts of the object. Additionally, perusing the columns of each table looking for columns that are encoded identifiers will generally yield synonym candidates.

Looking for candidate-redundant columns is best done by examining denormalized dependency sets. The LHS is probably a primary key/foreign key candidate. Columns on the RHS are candidates for being pure redundant data from columns in the table the LHS connects to.

When you are trying to consolidate data from more than one source, you are looking for synonyms for all columns. This process is much more complex. The basics are to start by looking for homonyms (column pairs that share the same or very nearly the same name). A match can be deceiving, but it at least provides a good starting point. Beyond that, matching must be done on a semantic basis using the results of column property analysis. Speculation should include not only pair names but also the relationship rules.

All of our discussion to this point has involved synonyms that were pairs of single columns. However, in some primary key/foreign key relationships, the synonym can be multiple columns on each side. Figure 9.7 shows how this might work. The PROJECT table has a primary key/foreign key relationship to the PROJECT_TEAM table through the PROJECT_ID column. However, each team member can have multiple tasks assigned against multiple projects. To show this, a third table (the PROJECT_TASK table) connects to the PROJECT_TEAM table using a combined key of PROJECT_ID and MEMBER_ID.

click to expand
Figure 9.7: Multiple-column synonym example.

Where they exist, they must be treated as a single concatenated value, because the full value combinations must match between the tables. Having matching values on the individual columns within the synonym is not sufficient.

Classifying Synonyms

Synonyms need to be classified across multiple dimensions. These classifications need to be in place before testing and verification against the data can be done. Information about the data can be helpful in setting the classifications. These dimensions are shown in Figure 9.8.

click to expand
Figure 9.8: Synonym classifications.

All synonyms should be considered pairs. Synonym chains do exist, but they can all be represented by individual synonym pairs that have common columns across the pairs. There is no need at this point to consider chains.

The synonym type has been discussed previously. This shows the probable reason the synonym exists. This is important in terms of which testing method to use.

The value correspondence indicates whether the values are encoded the same in both columns of the synonym. If they use different representations for the same facts, a transform needs to be developed to get them into a common representation. Transform correspondence can be true for redundant, domain, or merge synonyms.

The inclusive relationship indicates whether all of the values in one are expected to be found in the other (inclusive), all values are expected to be found in both tables (bidirectional inclusive), none of the values in one are expected to be found in the other (exclusive), or some are expected in both and some are not (mixed).

An example of inclusive is PRODUCT_NO across INVENTORY and ORDER_DETAIL. All products must not necessarily have been ordered. It is inclusive in only one direction (all products ordered must have an inventory row).

An example of bidirectional inclusive is ORDER_NO across ORDER_HEADER and ORDER_DETAIL. This means that every order number must have at least one row in each table. You cannot have order details without an order header, and you cannot have an order header without at least one order detail.

An example of exclusive is SOCIAL_SECURITY_NUMBER across two different personnel databases being merged. You would not expect the same person to be working for both companies.

An example of mixed is CITY across CUSTOMER and EMPLOYEE tables. Any CITY value could exist either in one alone or in both. The degree of overlap specifies whether the expected occurrences of values between the tables are one-to-one, one-to-many, or many-to-many.

Discovering Synonyms

Testing can include both discovery or verification, as in dependency analysis. The purpose of discovery is to expose potential synonyms that were overlooked when building the candidate synonym list.

Discovery from the data is the process of looking across all of the columns in the project and finding pairs that have an identical or high degree of data value overlap. The analyst can then look at the pair and determine whether they are true synonym candidates or not. That decision is based purely on the semantic meaning of each column. However, often the user is surprised at the pairings that show up, and by looking at the values you can determine that a synonym really does exist. The example of the SALESMAN_ID previously cited is a perfect example of a synonym pair that would not have been suspected without discovery. Discovery from the data cannot find synonym candidates that are exclusive or that have transform value correspondence.

Another discovery technique that can be used to expose potential synonyms is discovery against the column names. It can look for homonyms that are either identical names or very similar-sounding names. Typical homonym discovery routines consider typical abbreviations for names and discount special characters, numbers, and other qualifier type characters in the names. For example, the columns UNION_CODE and UNIONCD would be suggested homonyms.

Homonym discovery is useful in identifying synonyms that are exclusive or that have transform value correspondence. It is also useful for determining merge synonym pairs.

Discovery can only provide you with candidates for synonyms. They cannot determine programmatically whether the columns in a pair have any real relationship. For example, it may discover that HOURLY_PAY_RATE and UNIT_PRICE have a high degree of value overlap. These are clearly not synonyms of each other.

A good data discovery routine can provide statistical information that is helpful in determining the potential for being a synonym. The percentage of values that overlap, the frequency of occurrence of values in each, and the presence of nulls, blanks, or other empty indicators can all be useful in looking for synonyms and setting the rules.

Testing For Synonyms

Testing differs from discovery in that you have a synonym definition established and you want to verify it against the data. If the data does not conform to the expectation, either it is not a synonym or inaccurate data values prevent all of the rule expectations from being true. You focus on one synonym at a time.

The test for a single pair is quite simple. You compare the distinct values and their frequency of occurrence in one of the columns to the distinct values and frequencies in the other column. This tells you the degree of overlap and the presence of orphan values on either side. It also confirms or denies the rules you expect for the pair. You can easily verify one-to-one or one-to-many relationships through the result set. You can easily verify exclusive relationships as well by determining if any common values exist.

This method of testing is used for primary key / foreign key synonyms, domain synonyms, and merge synonyms that have same-value correspondence. It does not work for redundant synonyms or synonyms that have transform-value correspondence.

If the two columns have different ways of expressing the same values (transform-value correspondence), the values in one column need to be transformed before they can be compared to the values in the other column. In this case you need to list the distinct values in both columns and then manually match them. This process is called value mapping. There is no programmatic way of doing this. You need to understand the semantics of each value in both data sources. Once the mapping is done, you can transform the values in one of the columns and then compare the two columns in the same way you did for same-value correspondence.

If the pair are redundant column synonyms, the testing is more complex. You need to build a list for each column that concatenates the LHS columns of their respective functional dependencies and then compare these values. To be redundant, it is not sufficient that the synonym values match; they must also have the same parent key value. For example, if DISCOUNT RATE in the CUSTOMER_MASTER table and DISCOUNT in the ORDER_HEADER table are redundant synonyms, the discounts for each CUSTOMER_ID must be the same.

When testing domain synonyms, you need to find the values that do appear in only one of the columns. These need to be visually examined to determine if they are caused by inconsistent representation.

Determining The Characteristics Of Each Synonym

After all of the computations have been done, it is easy to validate the rule sets of the candidate synonyms. The discovered pairs need to be confirmed as being synonyms or discarded as a mere coincidence-of-value overlap.

Careful attention needs to be used to classify two columns as a redundant synonym. Sometimes they appear to be redundant, but in fact one is capturing the values of the other at a point in time, and an update of the primary source would not be reflected in the other column. This can happen, for example, in columns such as DISCOUNT_RATE or UNIT_PRICE. The reason for copying the data from one table to the other is to capture the value at the time the row is created in the second table.

Determining the classification of synonyms and the rules for them should be done in conference with business analysts and subject matter experts. Interpretation of differences as being semantically acceptable versus errors can be safely done only by those who know the applications.

As in the case of functional dependencies, you need to translate the synonym information to language understandable by the audience. You should not expect the business analysts and subject matter experts to understand all terminology and concepts of data structure.

Analyzing Synonyms In The Same Table

Synonyms can exist where both columns are in the same table. This is a common occurrence. In some cases the synonym pair is a primary key / foreign key pair. An example is having a PART_NUMBER column as the key to the INVENTORY table and having another column in the same table for ALTERNATE_PART_NUMBER. This semantically means that the other part can be used in its place. It is referring to another row in the same table.

start sidebar

An unusual form of same-table synonyms occurs when the data has not been properly put into first-normal form. This occurs when there are a number of repeating columns that specify multiple occurrences of the same fact. The columns are all synonyms of each other because they contain the same business fact.

An example that builds on the INVENTORY example is where each part can have up to five alternate part numbers. Instead of breaking these out into a separate table containing two columns PART_NO and ALT_PART_NO, the data designer just creates five columns in the primary table named ALT_PART_1, ALT_PART_2, and so on. Synonym discovery will find that six columns contain the same business fact. Five of them are dependent on the single one, PART_NO.

This is a special case of domain synonyms that needs to be identified for what it is. The data is not in first normal form because a single fact is not recorded in one column but rather is spread over five columns. It does not show up as a denormalized dependency because each is dependent on only the primary key.

This condition can also occur over multiple column sets. For example, an employee table may contain repeating columns for postsecondary education. Each occurrence is a set of columns for COLLEGE, DEGREE, and YEAR. There would be multiple occurrences of each of these columns. All of them would be dependent on the EMPLOYEE_ID primary key.

A good analyst can spot these situations during synonym discovery. It is very useful to go back and recast the data used for profiling into third-normal form by breaking these out into a separate table, with each occurrence being a separate row. Having these as separate columns makes typical reporting on the data meaningless. Sometimes it is better to note the condition and handle it when moving data.

end sidebar

Another case is where the two columns have only a domain relationship to each other. Additional synonyms involving these two columns exist that are primary key / foreign key pairs pointing to a third column that is the primary key to a different table. A classic example of this is having a PROJECT table that contains a column for PROJECT_MANAGER_ID and another column for PROJECT_RECORDER_ID. Both of these are domain synonyms of each other because they both contain personnel identifiers. Their relationship to each other is purely incidental. Both columns also participate in separate primary key / foreign key synonyms through the PERSONNEL_ID column of the PERSONNEL database.

Classifying Relationships Between Tables

In addition to classifying synonyms of columns, this is the time to classify the nature of relationships between tables. This is done by examining the primary key / foreign key relationships between columns in the two tables, along with an understanding of why multiple tables were involved in the profiling exercise. The types of relationships that are possible are primary / secondary, partitioned data, split tables, and merge pairs.

Two tables are connected in a primary/secondary relationship when they are connected by a primary key / foreign key pair. It is possible for more than one such synonym to connect two tables.

In all primary key / foreign key relationships there is a hierarchy. It defines which column (and thus table) contains the values the other column draws from.

Two tables are related in a partitioning relationship if all columns of one are synonyms to all columns of the other and the synonym of the two primary keys is exclusive. This means that they both contain the same information but about different instances of the business objects. For example, one table may contain employee data for all employees of one division, and a different table may contain employee data for a different division.

Two tables are considered merge tables if they are either partitioned or split, come from different data sources, and your intention is to combine them. Partitioned tables are said to be vertical merge pairs, meaning that each row of both tables becomes a separate row in the merged table. They are said to be horizontal merge pairs if they have a split table relationship in which each row of the merged data consists of some columns from one of the source tables and the other columns of the other table.

Two tables are connected in a split table relationship when the primary key columns of the two tables are synonyms in an inclusive one-to-one relationship and no other columns of one table are synonyms of any columns of the other table. Generally, one of the two tables is considered the dominant one, and the other one split apart from it.

Recording Inaccurate Data Facts

It is also important to capture all of the instances of structure rule violations. These indicate inaccurate data. Each instance may include a number of values or rows that contain the inaccurate data but do not identify which values are inaccurate. It is also possible that the inaccurate value is not included in the set captured. The inaccurate-fact data comes from primary key violations, denormalized key violations, derived column violations, or primary key / foreign key violations.

Primary key violations include lack of uniqueness in the key itself. This means that you have two or more rows in the data that have the same value for the primary key. The inaccuracy is that all but one of the key values are wrong. It may be that all of them are wrong.

If you are evaluating a natural key, the duplicate values should be investigated carefully to be sure that it really is a natural key. In cases such as a Social Security number, you are pretty safe that it is a real key. In cases such as NAME and ADDRESS data, there is always the outside chance it may be correct to be duplicate.

It is also a violation of a primary key to be NULL or empty. For multicolumn primary keys it is acceptable to have NULL values in some of the columns, but it is never correct to have all of the columns containing NULLs. NULL values in any of the key columns are always suspect and generally are not allowed.

For other dependencies that cover less than an entire row, a rule violation can mean more than just that the LHS is wrong. On denormalized dependencies, duplicate LHS entries are always acceptable. What is not acceptable are variations in values of the RHS columns in a single LHS value set. When this occurs, you can easily spot which column or columns have more than one value across the set. The inaccuracy can come from the LHS value set being wrong where any column in the LHS can have the inaccurate value or the value in the RHS set may contain the error. You cannot tell from the data. All of the values need to be collected in the inaccurate data set.

For derived columns, you have an additional capability. You have a formula or rule that determines precisely what the value in the RHS column should be. You need to test this rule instead of testing the functional dependency. Although the functional dependency test may surface some problems, the rule test may surface more problems because invalid RHS values may pass the dependency test. A violation does not tell you which of the columns contains the error, just that an error exists. You need to retain the values of all of the columns from the rows involved in the violation.

Synonyms offer many opportunities for violations. Most synonyms are not enforced at the database level, allowing ample opportunity for errors to creep in. The errors that occur in synonyms are orphans or degree-of-overlap violations.

Orphans are a case in which a value exists in one of the columns and is supposed to exist in the other but does not. This requirement belongs to all synonyms that have an inclusive relationship. This may occur in primary / foreign key synonyms, duplicate data, or merge synonyms. When they do not have orphans you can still have violations if the synonym requires a one-to-one correspondence and the data has a one-to-many relationship. This means that you have more than one row, with the same value in one of the columns.

An orphan leaves you with only the row that contains the value that does not have a matching pair. This is the case in which the inaccurate data may not be in the result set. This is because the other table may have a row with an inaccurate data value in the synonym column that, if accurate, would have satisfied the rule. Additionally, the matching row in the primary table may have been deleted without cascading the delete to the secondary table. You still know you have an error, and you have one half of the data that potentially contains the error. You do not have the other half.

Developing a Data Model

After completing structure analysis, you have enough information to construct a third-normal-form data model. This is a very useful thing to do because it helps validate your perception of the business objects involved in the data and reconfirm your discovery of structure components. Even if you have a data model, constructing one from what you learned from the data is helpful in validating the model you have.

Building The Model

The method of construction uses the following logic. First, find the primary key of each table you want to use as the table identifier. Then find all denormalized dependencies that are embedded inside tables. Each of these needs to be spun out into its own table. The new table consists of all columns in the denormalized dependency of the table it is coming from. It also needs to have embedded within it the primary key of the table it comes from. This may be part of the denormalized dependency or not. This will make a new synonym pair that is of type primary key / foreign key. It is also inclusive. Then you would remove all duplicate rows in the new table.

You then look for tables that are part of horizontal merge pairs. These should be combined into a single table retaining only one instance of the primary key columns.

Vertical merge table pairs should be combined into a single table. Because both tables contain the same column set, no new columns need to be created.

The last step is to map all of the other synonym pairs across the table set. Synonyms that are of the type duplicate data can have the column removed from one of the tables. Which one to remove will depend on the semantics of use. It is normally very obvious.

Validating The Model

Once the model is constructed, you need to review the result with the business community. Every table should make sense as part of a discrete business object, and every connection between them should make business sense. If the model is nonsensical, you have missed finding some functional dependencies or synonym pairs, and you need to return to the structure analysis steps to look for more. What to look for is generally pretty clear from the model you have constructed.

As can be seen, this step is a check on the completeness of your analysis of structure. Bypassing this can lead to unexposed problems persisting in the data and potentially inaccurate data that was discoverable not being discovered.



Data Quality(c) The Accuracy Dimension
Data Quality: The Accuracy Dimension (The Morgan Kaufmann Series in Data Management Systems)
ISBN: 1558608915
EAN: 2147483647
Year: 2003
Pages: 133
Authors: Jack E. Olson

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net