We examined the task of mining free-text electronic documents for structure. The task was examined in the context of the FAQ Minder system that does the structural mining of FAQs found in USENET newsgroups. The system finds the logical components of FAQs: tables of contents, questions, answers, and bibliographies. The found logical components of FAQs are used by FAQ Finder, a question-answering system.
The approach to mining free text for structure advocated in FAQ Minder is qualitative in nature. It is based on the assumption that documents in many domains adhere to a set of rigid structural standards. If those standards are known to a text-mining system, they can be successfully put to use for a variety of tasks, such as question answering or information extraction. This approach complements the numerical approaches insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind by document writers. The behavior of those markers is known to the system a priori.
We presented a number of alternative approaches to mining text for structure, of which the most prominent are the approaches based on machine-learning methods. What all of these approaches have in common is the assumption that there is some fixed structure, such as a set of database fields, whose contents are to be filled in with chunks of text derived from the text being processed. FAQs can be viewed as an example of this assumption, where there are slots for questions, answers, tables of contents, and bibliographies, and a minimal amount of surrounding context. However, FAQs are much closer to free text than HTML pages, which constitute the focus of most text-mining machine-learning systems.
The researchers who advocate machine-learning approaches to text mining often state that the manual encoding of rules is tedious and labor intensive. Consequently, machine-learning approaches are more promising, because they can learn the rules automatically. This argument has definite merits, but does not explain the full picture. In practice, all successful machine-learning approaches require a domain theory (Freitag, 1998), that must be encoded manually. Many machine-learning approaches also require that input data be represented in a specific way, effectively taking knowledge representation for granted. Thus, the promise of automation is never completely fulfilled.
Many text-oriented machine-learning techniques (Charniak, 1997; Ng & Zelle, 1997) are very powerful and promising. However, their main weakness is that they are created with no specific task in mind and do not easily yield to customization for specific domain-dependent problems. When this customization is feasible, it is effectively equivalent to manual knowledge engineering required in approaches similar to FAQ Minder's.
Numerical approaches in information retrieval offer yet another alternative to qualitative free-text mining. We presented one such approach called TextTiling (Hearst, 1997; Hearst & Plaunt, 1993). TextTiling was designed for large documents and large sections, containing enough word data for variances between topics to be statistically noticeable. This technique would probably not be effective in segmenting FAQ files or similar documents, since questions are one or two sentences and answers typically no more than 200 words.
Each approach presented in this chapter has its strengths and weaknesses. None of them solves the problem of free-text data-mining completely. Qualitative approaches, such as FAQ Minder's, produce good results but require manual knowledge engineering. Machine-learning methods can acquire complete data-mining models automatically but require domain theories and strict data formats. Information retrieval approaches do not require any knowledge engineering but cannot function without large text corpora. Hence, we believe that hybrid text-mining models are one of the most promising research directions in free-text data mining. A hybrid model is a model that requires a modest amount of knowledge engineering in exchange for scalable performance and acceptable results. The idea behind hybrid models is intuitively appealing: combine the relative strengths of the available alternatives, while minimizing their relative weaknesses.