Data Cleansing Components

The Fuzzy Lookup and Fuzzy Grouping components use fuzzy matching algorithms to provide fundamental features or primitives you can use to improve data quality of existing or new data. Together, these two transforms make possible a wide variety of data cleansing operations. The Fuzzy Lookup matches input records with misspellings, unexpected abbreviations, or other irregularities or nonconformant data with clean records in a reference table. The Fuzzy Grouping transform detects similarities between incoming records to find what appear to be duplicate rows. Both transforms use a custom, domain-independent distance algorithm developed by Microsoft Research. The algorithm takes into account the edit distance, token order, number of tokens, and relative frequencies capturing a more detailed structure of the source data. Because the algorithm is token based, it is not limited to a single language. The tokens are basically binary values that can be compared arithmetically. Finally, both transforms have a number of settings for controlling the resulting output and transform performance, so it's possible to tune the transforms to match the requirements of your environment.

Fuzzy Lookup

The Fuzzy Lookup transform makes it possible to look up terms that are close but not exact matches to reference terms. Table 22.6 contains the transform profile.

Table 22.6. The Fuzzy Lookup Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Designer	Yes
Internal File I/O	No
Output Types	Asynchronous
Threading	Single
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	Yes	OLEDB to reference table.
Supports Error Routing	No
Constraints	Copies reference table	For large reference tables, the startup time can be substantial.

Figure 22.9 shows a typical use of the Fuzzy Lookup transformation. Because the Lookup transform performs exact match lookups only, if there are slight deviations in the source data from the reference table, the lookup fails. Using the Fuzzy Lookup transform downstream on the Lookup transform error output, you can salvage some of the lookups that failed because the Fuzzy Lookup applies an inexact matching algorithm to determine if a string column is a close approximation of the reference column. To see the package shown in Figure 22.9, open the FuzzyLookup.dtsx package in the S21-Advanced sample solution. This sample package uses a sample flat file generated from the contacts table, with some errors introduced into the data, as the source and the contacts table from the AdventureWorks database is the reference data.

Figure 22.9. Using the Fuzzy Lookup to find duplicate names

Note

The Fuzzy Lookup transform only supports input columns of type DT_WSTR or DT_STR for fuzzy matching, but supports any SSIS data type except DT_TEXT, DT_NTEXT, and DT_IMAGE.

Setting Up the Fuzzy Lookup

Fuzzy Lookup uses what is called the Error-Tolerant Index (ETI) to find matching rows in the reference table. The ETI is an index on a column created to hold a set of tokens that represent the values for each record in the reference table. During runtime, Fuzzy Lookup essentially tokenizes the source rows, compares the tokenized values to the values in the ETI, and derives the similarity and confidence values based on the comparison.

Building the ETI

Depending on the settings you have selected, during the PreExecute phase of the Data Flow Task, the Fuzzy Lookup transform copies the reference table and adds an additional column for storing the tokens for each row and builds the ETI. This can be nontrivial for very large reference tables, so the Fuzzy Lookup transform provides options for storing the ETI on the server and reusing it at a later date.

The index options shown in Figure 22.10 provide a way to store the index table for later use or update it as changes are made to the reference data. These are the options:

Generate New Index Use this if your reference data is fairly small. This option copies the table and generates the ETI every time you run the package.
Store New Index If you plan to use an existing index, you need to run the package with this setting once to generate the index. Then, you can change the setting to one of the following two options to use the generated table and index.
Use Existing Index This option is for using a reference table you created with the Store New Index option. Select this option when the reference data is fairly static.
Maintain Stored Index This option is the same as using an existing index, only the Fuzzy Lookup creates a trigger on the reference table to update the index table any time the reference data changes.

Figure 22.10. The Reference Table options

Tip

When you select one of these options, you receive a warning in the bottom yellow message area, as follows: Connection manager attributes have changed. The mappings of previous columns are no longer valid. To eliminate the error, go to the Columns tab and remap the columns.

Setting Up Columns

After you have the reference table set up, you need to set up the columns. The Columns tab allows you to specify which columns you want to fuzzy match and which columns you want to flow to the outputs. As you can see in Figure 22.9, Fuzzy Lookup provides overall similarity and confidence values for each row and similarity values per fuzzy matched columns.

Similarity Versus Confidence

The question often comes up, "What is the difference between similarity and confidence?" Similarity has to do with how close the reference word is to the value. For example, John and Jon are very similar because there is only a one letter difference. John and nohj would be similar as well, because both words have the exact same number of letters and all the letters are the same, but because the algorithm for gauging similarity takes distance, characters, and order into account, Jon would be considered more similar to John than nohj would. Similarity, therefore, is a measure of the differences between the reference word and the value.

Confidence is more related to the uniqueness of the reference word in the reference set. If the name John Smith were to occur several times in the reference set, any exact matches would still have a low confidence score because the algorithm cannot detect which John Smith in the reference table is intended. So, even though the similarity may be 100%, the confidence might still be low.

You can use these results to set minimal bars of acceptance for dirty data and for mapping the dirty data rows back to clean reference data. Figure 22.11 shows the settings from the FuzzyLookup.dtsx sample package. Notice also that the FirstName, LastName columns and their output aliases FLFirstName and FLLastName are the output columns from the Fuzzy Lookup shown in the Data Viewer in Figure 22.9.

Figure 22.11. Creating the column matches and configuring the output

Tokenizing the Reference Data and the Similarity Threshold

The values in the reference data are divided into tokens based on the tokens specified in the Additional Delimiters field on the Advanced tab. For example, if you want to index the string 24^3 instead of 24 and 3, you should remove the ^ character from the list of delimiters.

The similarity threshold is the level of similarity that you're willing to accept as a match. Where to set this value is dependent on your requirements and data. Rows that have similarity values at or above the similarity threshold will pass through with the reference data in the Output column. For example, the FLFirstName and FLLastName from the sample package shown in Figure 22.9 are the results from the Fuzzy Lookup. Those rows that do not meet the similarity threshold have NULLs as the lookup values. Setting the threshold is a trial-and-error iterative process. Run the package against sample data and adjust the threshold based on the results. If there are too many dirty rows accepted, raise the threshold. If there are not enough rows accepted, lower the threshold. Repeat until you're satisfied with the results. You can also allow Fuzzy Lookup to return more than one match per lookup.

The Fuzzy Grouping Transform

The Fuzzy Grouping transform makes it possible to identify possible duplicate terms that are close but not exact matches. Table 22.7 contains the transform profile.

Table 22.7. The Fuzzy Grouping Transform Profile
Property	Value	Description
Component Type	Transform
Has Custom Designer	Yes
Internal File I/O	No
Output Types	Asynchronous
Threading	Multiple
Managed	No
Number Outputs	1
Number Inputs	1
Requires Connection Manager	Yes	OLEDB to location for temporary tables
Supports Error Routing	No

Although the Fuzzy Lookup transform is useful for finding correct values for corrupt or dirty data, Fuzzy Grouping is useful for finding duplicated values. Figure 22.12 shows the FuzzyGrouping.dtsx package in the sample solution for this chapter. Notice the output from the Fuzzy Grouping transform in the bottom grid. Those rows with a _score and _similarity value of 1 are cases where the key columns were identical. The rows with _score and _similarity less than 1 are rows where the key columns were only similar.

Figure 22.12. Grouping like rows together with the Fuzzy Grouping transform

Whereas naïve approaches such as the following script might work for simple exact match elimination, duplicated data is rarely so clean nor lends itself so well to such elimination techniques. In fact, typically, it is the other way around. Duplicate data is often duplicated when entered into the source system because source entry verification systems frequently use exact match or SOUNDEX checks that are confounded by simple differences in critical columns. Thus, using similar exact matching techniques for deduplication fail for the same reason they fail at the source, making exact matching elimination techniques of limited use.

DELETE FROM [dbo].[contact] WHERE fname=@fname   AND lname=@lname   AND address=@address

Using the Fuzzy Grouping transform, you can identify both identical and similar rows, which you can process in downstream data flow to determine if the rows indeed represent the same business entity and, therefore, should be eliminated.

Setting Up the Fuzzy Grouping Transform

The Fuzzy Grouping transform uses the same underlying technology as the Fuzzy Lookup transform. In fact, under the covers, the Fuzzy Grouping transform builds a Data Flow Task with a Fuzzy Lookup inside to build an ETI table used to determine the similarity values between multiple rows. Consequently, many of the settings for the Fuzzy Grouping transform will look familiar to you after having just read about the Fuzzy Lookup transform.

At runtime, the Fuzzy Grouping transform adds three key fields to the output stream:

_key_inThe reference key column generated key
_key_outThe generated key for another reference key from the same input set the fuzzy grouping considers a match according to the similarity threshold setting
_scoreThe similarity between the _key_in and _key_out values

The Connection Manager

Figure 22.13 shows the Fuzzy Grouping Transformation Editor Connection Manager tab. This is where you specify where the transform will create the temporary tables the transformation uses.

Figure 22.13. Selecting the connection manager

The Key Columns

Figure 22.14 shows the Columns tab where you specify the key columns to use for grouping input rows. The transform does not reference the pass-through rows when grouping the rows, they simply pass through. However, pass-through rows can be useful for distinguishing between rows that are duplicates and those that simply have the same values for the key columns. For example, there might be more than one contact with the name Eve Fernandez. You could use the pass-through columns such as telephone number, birthdate, mother's maiden name, and so on to distinguish between duplicate records of the identical person versus records with identical or similar key values for different people.

Figure 22.14. Selecting the input columns

The Advanced Settings

Figure 22.15 shows the Advanced tab. You can change the name of the columns used to store the keys and similarity score here. You can also change the similarity threshold. The higher the threshold, the more similar the key values must be to be considered duplicates. Finally, the token delimiters are the same as for the Fuzzy Lookup transform. They provide a way to tune how the key values are tokenized.

Figure 22.15. Selecting the advanced settings

What to Do with Groups

Great, so you've got groups of rows now. You might be asking yourself, "What do I do with them?" A good example is householding. Simply stated, householding is the process of discovering which of your contacts lives together. This is useful for things like determining if a parent is using a child's name to qualify for more than the per household purchase limit in a limited item sale, target marketing based on family structure, or determining which investors reside at the same address so that a single prospectus or other financial report can be sent to multiple investors at that same address.

Depending on the quality and shape of your data, this might be easy or difficult. For example, if all the addresses in the reference data are up to date and accurate, it should be fairly easy to determine which of your customers live together. Using columns such as age, last moved date, and marital status can help with these kinds of problems as well. For example, people get divorced. If the old address data for one of your contacts implies they were once living with another of your contacts, but the newer data for one contact shows a marital status of divorced with a new address, you could bias the data processing to favor creating a new household for the divorced individual.

To find contacts within the same household, you can group by last name. Family members don't always have the same last name, but it's a good start. If you have a Spouse Name column, you can fuzzy match the spouse name with the first name. This might seem a bit odd, but remember, you're not trying to build a clean record of individuals, but rather a single record entity representing a household. Taken together with other fields like address, cross matching helps you find relationships between rows that would otherwise be impossible.

The similarities between all the columns are what constitutes the overall similarity score. After you've configured the column matching, you can set the similarity threshold by iteratively setting, testing, and adjusting it until you arrive at a similarity threshold value that seems to generate the most correct results with the sample data. Then, you can use a Fuzzy Lookup transform on the output of the Fuzzy Grouping transform to look up the house address and other columns as available to arrive at a common address for each of the members of the household. The fuzzy comparison capabilities provide a way to build robust data cleansing solutions using state-of-the-art algorithms for identifying similarities in dirty data. Both transforms support multiple configurable settings for adjusting the parsing, tokenizing, and similarity threshold settings to better adjust the transforms to your business needs. Used separately and together, the Fuzzy Lookup and Fuzzy Grouping transforms can perform a wide range of data cleansing duties.