TEXT DATA MINING (TDM)

data mining: opportunities and challenges
Chapter XX - Critical and Future Trends in Data Mining A Review of Key Data Mining Technologies/Applications
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

The possibilities for DM from textual information are largely untapped, making it a fertile area of future research. Text expresses a vast, rich range of information, but in its original, raw form is difficult to analyze or mine automatically. TDM has relatively fewer research projects and commercial products compared to other DM areas. As expected, TDM is a natural extension of traditional DM, as well as information archeology (Brachman et al., 1993). While most standard DM applications tend to be automated discovery of trends and patterns across large DBs and data sets, in the case of text mining, the goal is to look for pattern and trends, like nuggets of data in large amounts of text (Hearst, 1999).

Benefits of TDM

It is important to differentiate between TDM and information access (or information retrieval, as it is better known). The goal of information access is to help users find documents that satisfy their information needs (Baeza-Yates & Ribeiro-Neto, 1999). Text mining focuses on how to use a body of textual information as a large knowledge base from which one can extract new, never-before encountered information (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam & Slattery, 1998). However, the results of certain types of text processing can yield tools that indirectly aid in the information-access process. Examples include text clustering to create thematic overviews of text collections (Rennison, 1994; Wise, Thomas, Pennock, Lantrip, Pottier, & Schur, 1995), automatically generating term associations to aid in query expansion (Voorhees, 1994; Xu & Croft, 1996), and using co-citation analysis to find general topics within a collection or identify central Web pages (Hearst, 1999; Kleinberg, 1998; Larson, 1996;).

Methods of TDM

Some of the major methods of TDM include feature extraction, clustering, and categorization. Feature extraction, which is the mining of text within a document, attempts to find significant and important vocabulary from within a natural language text document. From the document-level analysis, it is possible to examine collections of documents. The methods used to do this include clustering and classification. Clustering is the process of grouping documents with similar contents into dynamically generated clusters. This is in contrast to text categorization, where the process is a bit more involved. Here, samples of documents fitting into pre-determined "themes" or "categories" are fed into a "trainer," which in turn generates a categorization schema. When the documents to be analyzed are then fed into the categorizer, which incorporates the schema previously produced, it will then assign documents to different categories based on the taxonomy previously provided. These features are incorporated in programs such as IBM's Intelligent Miner for Text (Dorre, Gerstl, & Seiffert, 1999).

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net