Text Mining and Document Warehousing Components


Traditional Business Intelligence (BI) processes revolve around managing structured data such as numbers, dates, product and people names, addresses, and other well-structured values that can be aggregated and indexed. Such data is useful for determining what happened, when it happened, and the circumstances and location where it occurred. However, what traditional BI doesn't tell you is why it happened. Document warehousing is concerned with the why behind transactions and trends, and the Term Extraction and Term Lookup transforms provide the primitives for building document warehouses.

The Problem with Text

Text is all around us. The most obvious source is the Internet, but beyond that are other rich and, in some cases, more valuable data, such as whitepapers, patent applications, email, analyst reports, legal briefs, product specifications, financial prospectuses, newspaper articles, and even marketing Fear, Uncertainty, and Doubt (FUD). In many ways, these sources of information are much more helpful than historical data in predicting future trends and supporting decisions. The problem is how to turn the mountains of available nonstructured text into a thematic, accurate, indexable, and relevant archive of searchable summarized data.

The answer, at least in part, is the Term Extraction and Term Lookup transforms, which provide primitives upon which you can build a data warehousing toolbox.

Thematic Categorization

The Term Extraction transform makes it possible to build sets of theme keywords from documents that are representative of a given technology, industry, or other subject area and save them into a reference table. Then, the Term Lookup transform makes it possible to extract the theme keywords from the documents and categorize them according to the frequency of the occurrence of theme keywords.

Creating Indexes

Documents are only useful if you can find them, hence the importance of indexes. The Term Extraction transform uses smart algorithms for extracting significant nouns or noun phrases from text that can be stored in tables and indexed, providing quick access to documents.

Summarizing Text

Summaries are similar to indexes in the way they are built, but instead of each extracted word being stored in a table as an individual record, summaries are collections of extracted words and are meant to be read together with other words, extracted from the same document, to describe the contents of the document.

Relevance

From a human perspective, relevance can be measured intuitively. After reading a sentence or two, it is usually pretty easy to distinguish between a document about Formula 1 drivers and a document about video drivers. However, for software it's a little more complicated. Term frequency-inverse document frequency (tf-idf) is a way to weight a given term's relevance in a set of documents. Term frequency measures the importance of a term within a particular document based on the number of times it occurs relative to the number of all terms in a given document. Inverse document frequency is a measure of the importance of a term based upon its occurrence in all documents in a collection. The importance of a term increases with the number of times it appears in a document but is offset by the frequency of the term in all of the documents in the collection. For the previously mentioned set of documents about Formula 1 drivers and video drivers, the word drivers is pretty important. However, in those documents that discuss Formula 1 drivers, you would find a very small to negligent incidence of the word video and, likewise, for the documents about video drivers, you would find a very small to negligent incidence of the term Formula 1. Using the Term Extraction transform and tf-idf weighting, your packages can distinguish between documents about Formula 1 drivers and video drivers.

Term Extraction

The Term Extraction transform makes it possible to extract key terms from nonformatted text. Table 22.8 contains the transform profile.

Table 22.8. The Term Extraction Transform Profile

Property

Value

Description

Component Type

Transform

 

Has Custom Designer

Yes

 

Internal File I/O

No

 

Output Types

Asynchronous

 

Threading

Single

 

Managed

No

 

Number Outputs

1

 

Number Inputs

1

 

Requires Connection Manager

Yes

To exclusion terms

Supports Error Routing

Yes

 


The Term Extraction transform currently works only with English text and can be configured to extract nouns, noun phrases, or both nouns and noun phases. A noun is defined as a single noun word. A noun phrase is at least two words, one of which is a noun and the other is a noun or an adjective. The Term Extraction transform ignores articles and pronouns and normalizes words so that the capitalized, noncapitalized, and plural versions of a noun are considered identical.

The Term Extraction Transformation Editor

Figure 22.16 shows the Term Extraction tab of the Term Extraction Transformation Editor. The Term Extraction transform output has only two columns.

  • Term Contains the extracted terms

  • Score Contains the score for the terms

Figure 22.16. Setting up the columns


Because multiple terms can be extracted per input row, there are usually many more output rows than input rows.

The Exclusion tab lets you create or select a connection manager to a SQL Server or Access database, table, and column that contains the set of words that the Term Extraction transform should ignore. By adding words to the specified column, you can eliminate them from the term extraction results.

Figure 22.17 shows the Advanced tab where you can tune the term extraction. The Term Type options let you control the types of terms that will be returned in the resultset. The Score Type options allow you to select simple frequency-based score options or the more complex tf-idf scoring algorithm. The Parameters options allow you to control the frequency and length thresholds. For a term to be returned in the resultset, it must occur at least the number of times specified in the Frequency Threshold field. The Maximum Length of Term setting is only enabled when Noun Phrase or Noun and Noun Phrase term types are selected. Finally, check the Use Case-Sensitive Term Extraction check box when you want to retain term and word case.

Figure 22.17. Use the Advanced tab to tune term extraction


The Term Lookup Transform

The Term Lookup transform makes it possible to find terms within text and measure the frequency that they occur. Table 22.9 contains the transform profile.

Table 22.9. The Term Lookup Transform Profile

Property

Value

Description

Component Type

Transform

 

Has Custom Designer

Yes

 

Internal File I/O

No

 

Output Types

Asynchronous

 

Threading

Single

 

Managed

No

 

Number Outputs

1

 

Number Inputs

1

 

Requires Connection Manager

Yes

To reference table

Supports Error Routing

Yes

 


The Term Lookup transform matches terms in a reference table with terms found in text columns. Using the same method as the Term Extraction transform, the Term Lookup extracts terms from a text column and then attempts to look up the terms in the reference table. The output of the Term Lookup is the columns selected as pass-through columns and the term and frequency. The Term Lookup transform can also be configured to perform case-sensitive matches.

Caution

Multiple term and frequency values can be returned from the same row so that for each term and frequency value returned, each pass-through value is duplicated. For large text columns, this can increase the memory usage considerably. Only include pass-through columns if absolutely needed and then be aware of the cost associated with the duplicated values.


Figure 22.18. Configuring the reference columns




Microsoft SQL Server 2005 Integration Services
Microsoft SQL Server 2005 Integration Services
ISBN: 0672327813
EAN: 2147483647
Year: 2006
Pages: 200
Authors: Kirk Haselden

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net