Term Lookup


The Term Lookup transform uses the same algorithms and statistical models as the Term Extraction transform to break up an incoming stream into noun or noun phrase tokens, but it is designed to compare those tokens to a stored word list and output a matching list of terms and phrases with simple frequency counts. Now a strategy for working with both term-based transforms should become clear. Periodically use the Term Extraction transform to mine the text data and to generate lists of statistical phrases. Store these phrases in a word list, along with phrases that you think the term extraction process should identify. Remove the phrases that you don't want identified. Use the Term Lookup Transform to reprocess the text input to generate your final statistics. This way, you are generating statistics on known phrases of importance.

You can use results from the Term Extraction example by removing the word "model" from the [TermExclusions] table for future Term Extractions. You would then want to review all of the terms stored in the [TermResults] table, sort them out, remove the duplicates, and add back terms that make sense to your subject matter experts reading the text. Since you want to generate some statistics about which model numbers are generating customer service calls, but you don't want to restrict your extractions to only the occurrences of the model number in conjunction with the word "model," remove phrases combining the word "model" and the model number. The final [TermResults] table should look something like the following:

  term ------------ dent door freezer ice ice maker maker XX-1 XX-YY3 XX-Z1 

Take a copy of the package you built in the Extraction example, but exchange the Term Extraction transform for a Term Lookup transform and change the OLE Destination to output to a table [TermReport].

Open the Term Lookup Editor. It should look similar to Figure 6-21. There are three basic tabs used to set up this task:

image from book
Figure 6-21

  • Reference Table: This is where you will configure the connection to the reference table. The Term Lookup task should be used to validate each tokenized term that it finds in the input stream.

  • Term Lookup: After selecting the lookup table, you will map the field from the input stream to the reference table for matching.

  • Advanced: This tab has one setting to check if the matching is case-sensitive.

The results of running this package will be a list of phrases that you are expecting from your stored word list. A sample of the first six rows is displayed below. Notice that this result set doesn't summarize the findings. You are just given a blow-by-blow report on the number of terms in the word list that were found for each row of the customer service notes. In this text sample, it is just a coincidence that each term appears only once in each note.

 Term          Frequency   ConvCustSvcNote ------------- --------- -------------------------------------------------- freezer          1       ice maker in freezer stopped working model XX-YY3 ice maker        1       ice maker in freezer stopped working model XX-YY3 XX-YY3           1       ice maker in freezer stopped working model XX-YY3 door             1       door to refrigerator is coming off model XX-1 XX-1             1       door to refrigerator is coming off model XX-1 ice maker        1       ice maker is making a funny noise XX-YY3 (Only first six rows of resultset are displayed) 

To complete the report, add an Aggregate transform between the Term Lookup transform and the OLE DB Destination transform. Set up the Aggregate transform to ignore the ConvCustSvcNote column, group by the Term column, and summarize the Frequency Column. Connect the Aggregate Transform to the OLE DB Destination and remap the columns in the OLE DB transform.

Although this is a very rudimentary example, you will start to see the possibilities of using SSIS for very raw and unstructured data sources like this customer service comment data. In a short period of time, you have pulled some meaningful results from the data. Already you can provide the intelligence that model XX-Z1 is generating 45% of your sample calls and that 36% of your customer calls are related to the ice maker. Figure 6-22 shows some sample data that you can generate with what you know so far. Pretty cool results from what is considered unstructured data.

image from book
Figure 6-22



Professional SQL Server 2005 Integration Services
Wireless Java : Developing with Java 2, Micro Edition
ISBN: 189311550X
EAN: 2147483647
Year: 2006
Pages: 182

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net