We believe that text-based data mining will continue to evolve primarily in three directions: machine learning, natural language processing, and statistical analysis. A promising research direction seems to be a combination of natural language processing and machine learning. This is due to the fact that in many domains documents have a clearly recognizable structure that can be either learned from examples or encoded in a system of rules for parsing. Once that structure is found, its components can be mined for content with restricted natural language processing techniques.
Our work on FAQ Minder is a step in this research direction. Viewed abstractly, FAQ Minder is a rule-based document parser. The system presents a domain-dependent approach to free-text data mining. The rules that the system uses are domain-dependent and manually encoded. However, this need not be the case. These rules may well be learned automatically through a machine-learning approach. The trick here is how much manual knowledge engineering the machine-learning approach requires. Although many machine-learning approaches exhibit impressive learning rates and results, they require a domain theory that is either explicitly given to the learning algorithm or implicitly encoded in the way learning is done. Sometimes it is more practical to encode the rules manually than to first represent a domain theory and then wait for the learner to give you the rules automatically.
Statistical analysis is gaining prominence in text-based data mining due to the increasing availability of large text corpora. The corpora, due to their sheer volume, allow statistical techniques to achieve statistically significant results. In recent years, there has been a resurgence in research on statistical methods in natural language processing (Brill & Mooney, 1997). These methods employ techniques that aim to automatically extract linguistic knowledge from natural language corpora rather than require the system developer to do the knowledge engineering manually. While the initial results of the corpora-based text mining have been promising, most of the effort has been focused on very low-level tasks such as part-of-speech tagging, text segmentation, and syntactic parsing (Charniak, 1997), which suggests that some amount of domain-dependent knowledge engineering may well be mandatory. One exception to these low-level tasks is the research reported by Ng and Zelle (1997), who apply some corpus-based techniques to the problems of word-sense disambiguation and semantic parsing. But, while the reported results are encouraging, they are tentative.
Another prominent trend in textual data mining that relies heavily on statistical techniques is automatic thesaurus construction (Srinivasan, 1992). Thesauri are widely used in both indexing and retrieving textual documents. Indexers use thesauri to select the most appropriate terms to describe content; retrievers use thesauri to formulate better queries. For example, if a submitted query does not return acceptable results, the terms of the query can be extended with related terms from a thesaurus. Since manual thesaurus construction is extremely labor intensive, it is almost guaranteed that future research efforts will focus on completely or partially automating thesaurus construction. In automatic thesaurus construction, the domain is defined in terms of the available documents. The basic idea is to apply certain statistical procedures to identify important terms and, if possible, relationships among them. A promising direction of research in this area of text data mining is the identification of non-trivial semantic relationships. While statistical methods are good at detecting broad semantic relationships such as genus-species or association (Bookstein & Swanson, 1974), they alone are not sufficient for more subtle relationships such as part-whole, taxonomy, synonymy, and antonymy.
Question answering from large online collections is an area that uses many textual data-mining techniques. Given a collection of documents and a collection of questions, a question-answering system can be viewed as mining the document texts for answers to the questions. In contrast to standard information retrieval systems, e.g., search engines, question-answering systems are not allowed to return a full document or a set of documents in response to a question. The assumption here is that the user has no interest or time to sift through large texts looking for answers.
Question answering has recently achieved enough status and attracted enough research interest to be awarded a separate track at the annual Text Retrieval Conference (TREC) (Voorhees & Harman, 2000; Voorhees & Tice, 2000). A typical question-answering system requires short answers for short factual questions such as "Who is George Washington?" Most modern question-answering systems operate in two steps. Given a question, they first choose a small subset of the available documents that are likely to contain an answer to the submitted question and then mine each of those documents for a specific answer or a set of answers. Statistical means have been rather successful in narrowing questions to small collections of documents (Clarke, Cormack, & Lynam, 2001). However, they are not as successful at extracting the actual answers from the selected documents. As a system, FAQ Minder was built to address that very problem. More research into document structure and its interaction with document content is and will remain of great importance for question answering.
Information extraction is another emerging trend in textual data mining (Cardie, 1997). An information extraction system takes a free text as input and mines it for answers to a specified topic or domain of interest. The objective is to extract enough information about the topic of interest and encode that information in a format suitable for a database. Unlike in-depth NLP systems, information extractors do a cursory scan of the input text to identify potential areas of interest and then use more expensive text-processing techniques to mine those areas of interest for information. An example of an information extraction system is a system that constructs database records from news wires on mergers and acquisitions. Each news wire is turned into a set of slot-value pairs that specify who merged with whom, when, etc. One of the greatest challenges and research opportunities of information extraction is portability. Information extraction systems are very domain dependent. Currently, there are information extraction systems that analyze life insurance applications (Glasgow, Mandell, Binney, Ghemri, & Fisher, 1997) or summarize news wires on terrorist activities (MUC-3, 1991; MUC-4, 1992). These systems, however, use a lot of built-in knowledge from the domains in which they operate and are not easily portable to other domains. The main reason for using built-in knowledge lies in the nature of the task. The more domain-specific knowledge an information extractor has, the better it performs. The principal disadvantage is that the manual encoding of this knowledge is labor intensive and subject to errors. Thus, automating domain-specific knowledge acquisition is likely to be a promising direction of future research.