The possibilities for DM from textual information are largely untapped, making it a fertile area of future research. Text expresses a vast, rich range of information, but in its original, raw form is difficult to analyze or mine automatically. TDM has relatively fewer research projects and commercial products compared to other DM areas. As expected, TDM is a natural extension of traditional DM, as well as information archeology (Brachman et al., 1993). While most standard DM applications tend to be automated discovery of trends and patterns across large DBs and data sets, in the case of text mining, the goal is to look for pattern and trends, like nuggets of data in large amounts of text (Hearst, 1999).
Benefits of TDM
It is important to differentiate between TDM and information access (or information retrieval, as it is better known). The goal of information access is to help users find documents that satisfy their information needs (Baeza-Yates & Ribeiro-Neto, 1999). Text mining focuses on how to use a body of textual information as a large knowledge base from which one can extract new, never-before encountered information (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam & Slattery, 1998). However, the results of certain types of text processing can yield tools that indirectly aid in the information-access process. Examples include text clustering to create thematic overviews of text collections (Rennison, 1994; Wise, Thomas, Pennock, Lantrip, Pottier, & Schur, 1995), automatically generating term associations to aid in query expansion (Voorhees, 1994; Xu & Croft, 1996), and using co-citation analysis to find general topics within a collection or identify central Web pages (Hearst, 1999; Kleinberg, 1998; Larson, 1996;).
Methods of TDM
Some of the major methods of TDM include feature extraction, clustering, and categorization. Feature extraction, which is the mining of text within a document, attempts to find significant and important vocabulary from within a natural language text document. From the document-level analysis, it is possible to examine collections of documents. The methods used to do this include clustering and classification. Clustering is the process of grouping documents with similar contents into dynamically generated clusters. This is in contrast to text categorization, where the process is a bit more involved. Here, samples of documents fitting into pre-determined "themes" or "categories" are fed into a "trainer," which in turn generates a categorization schema. When the documents to be analyzed are then fed into the categorizer, which incorporates the schema previously produced, it will then assign documents to different categories based on the taxonomy previously provided. These features are incorporated in programs such as IBM's Intelligent Miner for Text (Dorre, Gerstl, & Seiffert, 1999).