Fuzzy Lookup | Wireless Java : Developing with Java 2, Micro Edition

If you've done some work in the world of extract, transfer, and load processes (ETL), you've run into the proverbial crossroads of handling bad data. The test data is staged, but all attempts to retrieve a foreign key from a dimension table result in no matches for a number of rows. This is the crossroads of bad data. At this point, there are a finite set of options. You could create a set of Astro-Physics-based lookup functions using SQL Sound-Ex, full-text searching, or distance-based word calculation formulas. This strategy is time-consuming to create and test, complicated to implement, and dependent on language lexicon, and it isn't always consistent or reusable (not to mention that everyone from now on will be scared to alter the code for fear of breaking it). You could just give up and divert the row for manual processing by subject matter experts (that's a way to make some new friends). You could just add the new data to the lookup tables and retrieve the new keys. If you just add the data, the foreign key retrieval issue gets solved, but you could be adding an entry into the dimension table that will skew data-mining results downstream. This is what I like to call a lazy-add. This is a descriptive, not a technical, term. A lazy-add would import a misspelled job title like "prasedent" into the dimension table when there is already an entry of "president." It was added, but it was lazy.

The Fuzzy Lookup and Fuzzy Grouping transformations add one more road to take at the crossroads of bad data. These transformations allow the addition of a step to the process that is easy to use, consistent, scalable, and reusable, and they will reduce your unmatched rows significantly — maybe even altogether. If you've already allowed bad data in your dimension tables, or you are just starting a new ETL process, you'll want to put the Fuzzy Grouping transformation to work on your data to find data redundancy. This transformation can examine the contents of a suspect field in a staged or committed table and provide possible groupings of similar words based on provided tolerances. This matching information can then be used to clean up that table. Fuzzy Grouping will be discussed later in this chapter.

If you are correcting data during an ETL process, use the Fuzzy Lookup transformation — my suggestion is to do so only after attempting to perform a regular lookup on the field. This best practice is recommended because Fuzzy Lookups don't come cheap. Fuzzy Lookups build specialized indexes of the input stream and the reference data for comparison purposes. You can store them for efficiency, but these indexes can use up some disk space or can take up some memory if you choose to rebuild them on each run. Storing matches made by the Fuzzy Lookups over time in a translation or pre-dimension table is a great design. Regular Lookup transforms can first be run against this translation table and then divert only those items in the Data Flow that can't be matched to a Fuzzy Lookup. This technique uses Lookup transforms and translation tables to find matches using INNER JOINS. Fuzzy Lookups whittle the remaining unknowns down if similar matches can be found with a high level of confidence. Finally, if your last resort is to have the item diverted to a subject matter expert, you can save that decision into the translation table so that the ETL process can match it next time in the first iteration.

Using the Fuzzy Lookup transformation requires an input stream of at least one field that is a string. Unlike the Term Lookup transformation, which requires a NULL-terminated Unicode string, this transform just needs a text input and most any text data type will do. Internally the transform has to be configured to connect to a reference table that will be used for comparison. The output to this transform will be a set of columns containing the following:

Input and Pass-Through Field Names and Values: This column contains the name and value of the text input provided to the Fuzzy Lookup transform task or passed through during the lookup.
Reference Field Name and Value: This column contains the name and value(s) of the matched results from the reference table.
Similarity: This column contains a number between 0 and 1 representing similarity. Similarity is a threshold that you set when configuring the Fuzzy Lookup task. The closer this number is to 1, the closer the two text fields must match.
Confidence: This column contains a number between 0 and 1 representing confidence of the match relative to the set of matched results. Confidence is different from similarity, because it is not calculated by examining just one word against another but rather by comparing the chosen word match against all the other possible matches. Confidence gets better the more accurately your reference data represents your subject domain, and it can change based on the sample of the data coming into the ETL process.

The Fuzzy Lookup Transformation Editor has three configuration tabs.

Reference Table: This tab sets up the OLE DB Connection to the source of the reference data. The Fuzzy Lookup takes this reference data and builds a token-based index out of it before it can begin to compare items. In this tab are the options to save that index or to use an existing index from a previous process. There is also an option to maintain the index, which will detect changes from run to run and keep the index current. Note that if you are processing large amounts of potential data, this index can grow large.
Columns: This tab allows mapping of the one text field in the input stream to the field in the reference table for comparison. Drag and drop a field from the Available Input Column onto the matching field in the Available Lookup Column. You can also click on the two fields to be compared and right-click to create a relationship. Another neat feature is the ability to add the foreign key of the lookup table to the output stream. To do this, just click on that field in the Available Input Columns. A complete Columns tab should look something like Figure 6-23.
Advanced: This tab contains the settings that control the fuzzy logic algorithms. You can set the maximum number of matches to output per incoming row. The default is set to 1. Incrementing this setting higher than this may generate more results that you'll have to sift through, but it may be required if there are too many closely matching strings in your domain data. A slider controls the Similarity threshold. A recommendation is to start this setting in the middle at .50 when experimenting and move up or down as you review the results. The token delimiters can also be set if, for example, you don't want the comparison process to break incoming strings up by "." or spaces. The default for this setting is all common delimiters. See Figure 6-24 for an example of an Advanced tab.

image from book
Figure 6-23

image from book
Figure 6-24

Although this transform neatly packages some highly complex logic in an easy-to-use component, the results won't be perfect. You'll need to spend some time experimenting with the configurable setting and monitoring the results. You will now look at an example and put this transform to work.

You are going to create a quick demonstration of the Fuzzy Lookup transform's capabilities by setting up a small table of occupation titles that will represent your dimension table. You will then import a set of person records that will require a lookup on the occupation to your dimension table. Not all will match, of course. The Fuzzy Lookup transformation will be employed to find matches, and you will experiment with the settings to learn about its capabilities.

First copy the following data into a text file named "c:\import\empdata.txt." This data will represent employee data that you are going to import. Notice that some of the occupation titles are cut off in the text file because of the positioning within the layout. Also notice that this file has an uneven right margin. Both of these issues are typical ETL situations that are especially painful.

 EMPIDTITLE                  LNAME 00001EXECUTIVE VICE PRESIDENWASHINGTON 00002EXEC VICE PRES         PIZUR 00003EXECUTIVE VP           BROWN 00005EXEC VP                MILLER 00006EXECUTIVE VICE PRASIDENSWAMI 00007FIELDS OPERATION MGR   SKY 00008FLDS OPS MGR           JEAN 00009FIELDS OPS MGR         GANDI 00010FIELDS OPERATIONS MANAGHINSON 00011BUSINESS OFFICE MANAGERBROWN 00012BUS OFFICE MANAGER     GREEN 00013BUS OFF MANAGER        GATES 00014BUS OFF MGR            HALE 00015BUS OFFICE MNGR        SMITH 00016BUS OFFICE MGR         AI 00017X-RAY TECHNOLOGIST     CHIN 00018XRAY TECHNOLOGIST      ABULA 00019XRAY TECH              HOGAN 00020X-RAY TECH             ROBERSON

Run the following SQL code in AdventureWorks or in a database of your choice. This code will create your dimension table and add the accepted entries that will be used for reference purposes.

 CREATE TABLE [Occupation](   [OccupationID] [smallint] IDENTITY(1,1) NOT NULL,   [OccupationLabel] [varchar] (50) NOT NULL  CONSTRAINT [PK_Occupation_OccupationID] PRIMARY KEY CLUSTERED (   [OccupationID] ASC ) ON [PRIMARY] ) ON [PRIMARY] GO INSERT INTO [Occupation] Select 'EXEC VICE PRES' INSERT INTO [Occupation] Select 'FIELDS OPS MGR' INSERT INTO [Occupation] Select 'BUS OFFICE MGR' INSERT INTO [Occupation] Select 'X-RAY TECH'

Create a new SSIS project named Fuzzy Lookup Example. Drop a Data Flow task on the Control Flow design surface and click on the Data Flow tab.
Add a Flat File Connection to the Connection Manager. Name it "Employee Data," and then set the file name to "c:\import\empdata.txt." Set the Format property to Ragged Right. (By the way, for those of you who use flat files, the addition of the ability to process a ragged-right file is a welcome addition.) Set the option to pull the column names from the first data row. Click on the Columns tab and set the columns to break at positions 5 and 28. Click on the Advanced tab and set the OuputColumnWidth property for the TITLE field to 50. Save the connection.
Add a Flat File Source to the Data Flow surface and configure it to use the Employee Data connection. Add an OLE DB Destination and configure it to point to the AdventureWorks database or to the database of your choice.
Add a Fuzzy Lookup transform task to the Data Flow design surface. Connect the output of the Flat File source to the Fuzzy Lookup and the output of the Fuzzy Lookup to the OLE DB Destination.
Open the Fuzzy Lookup Transformation Editor. Set the OLE DB Connection Manager in the Reference tab to use the AdventureWorks database connection and the Occupation table. Set up the Columns tab connecting the input to the reference table columns as in Figure 6-23, and set up the Advanced tab with a Similarity threshold of 50 (.50).
Open the editor for the OLE DB Destination. Set the OLE DB connection to the AdventureWorks database. Click New to create a new table to store the results. Change the table name in the DDL statement to [FuzzyResults]. Click on the Mappings tab, accept the defaults, and save.
Add a Data View of type grid to the Data Flow between the Fuzzy Lookup and the OLE DB Destination.

Run the package, and your results at the Data View should resemble those in Figure 6-25. Notice that the logic has matched 100% of the items at a 50% similarity threshold — and you have the foreign key OccupationID added to your input for free! Had you used a strict INNER JOIN or Lookup Transform, you would have made only four matches, a dismal 21% hit ratio. These items can be seen in the Fuzzy Lookup output where the values are 1 for similarity and confidence.

image from book
Figure 6-25