In the context of data mining, structural mining of text is the task of partitioning text into components, each of which contains a specific kind of information. For example, if it is known that a given HTML text is a home page, one can design strategies to mine the text for the owner's name, address, e-mail, etc. The ever-increasing numbers of electronically available documents have intensified research on mining text for structure. Texts are typically divided into three broad categories: free, structured, and semi-structured.
If we view texts as islands of content, then free texts can be viewed as content islands without any road maps. To discover a road map in a free text requires a certain amount of data mining through parsing, statistical analysis, or machine learning. Many USENET FAQs and journal articles are good examples of free texts.
Structured texts are information islands whose content is organized according to a specific road map. Relational databases are a good example of structured texts where all of the relations between textual entities, i.e., records, are known and can be readily obtained through well-defined queries. In effect, in a structured text most of the structural data mining has been done.
Semi-structured texts cluster around the borderline between free and structured. Generally speaking, a semi-structured text offers more structure than a free text but less than a structured one. HTML pages are a good example of semi-structured texts. While they offer a standard set of tags that point to the structural organization of information in them, they do not specify the types of information that the tags can label. For example, an HTML list can contain names of people, phone numbers, top stories of the day, etc.
Current research efforts in structural text mining combine techniques of machine learning, natural language processing, and information retrieval. Many machine-learning approaches to text mining are based on the ideas of inductive learning (Mitchell, 1997). The problem of mining text for structure is cast in terms of taking a set of text instances representative of the general population of texts to be mined and extracting sets of rules from those instances. Approaches that use natural language processing (Allen, 1987) typically view texts as objects having a structure that can be discovered through parsing. The problem is stated in terms of a set of constraints that a given domain of texts exhibits and a means to use those constraints in finding text structure. Finally, information-retrieval approaches are based on the idea that texts are intellectual artifacts that consist of words related to each other semantically in a number of complex ways. The intellectual process of producing texts incidentally leaves behind simple statistical regularities (Hearst, 1997). Capturing those regularities through statistical analysis allows one to arrive at the structural organization of information in the texts.
Many authors are concerned with the problem of extracting database-like structures from Web pages, in effect reverse-engineering the process of database-backed Web page generation. Hammer et al. (1997) present a configurable tool for extracting semi-structured data from a set of HTML pages, given a declarative specification of where the data of interest is located. Creating such a specification can be a tedious process, however, and may require an extensive knowledge-engineering effort. The machinelearning approach to this problem has been labeled "wrapper induction" (Kushmerick et al., 1997). The extraction procedure, or wrapper, for a specific resource is learned from a set of representative pages from that resource. Several classes of wrappers have been identified that are both useful and efficiently learnable.
Hsu and Chang (1999) apply several aspects of automata theory to the problem of constructing information extractors for mining semi-structured documents. By semi-structured documents the authors mean HTML pages. The main argument of the proposed research framework rests on the idea that programming information extractors manually is not feasible due to the amount and degree of variation in information placed on the World Wide Web on a daily basis. The authors propose a machine-learning approach to the automated construction of such extractors. The approach is based on learning an extractor from a few examples of information extraction cases.
Hsu and Chang (1999) describe a formalism that represents extractors as Finite-State Transducers (FST). A finite-state transducer is a variation of a finite-state automaton (Hopcroft & Ullman, 1979). The input document is assumed to be tokenized before it is given to a finite-state transducer. The authors distinguish two types of transducers: single-pass and multi-pass. A single-pass transducer scans the text only once. A multi-pass transducer scans the text many times, each time focusing only on a specific type of object to extract. The ultimate goal of the approach proposed by Hsu and Chang (1999) is the automated construction of extractors from a set of training examples. However, the reported empirical evaluations assume that the space of possible graph structures, i.e., finite-state automata, is restricted or that the structure is given to the learner in advance.
Freitag (1998) also casts information extraction as a machine-learning problem. It is argued that one solution to that problem is relational learning. Relational learning represents hypotheses as sets of if-then rules. Because sets of if-then statements can be viewed as programs in a logic programming language, such as PROLOG, relational learning is often called Inductive Logic Programming (Mitchell, 1997). Freitag describes a general-purpose top-down relational learning algorithm for information extraction called "SRV." SRV takes as input a set of token-oriented features that encode most of the domain-specific information. For example, they may encode a standard set of questions that can be asked of someone's home page, such as the owner's name, affiliation, e-mail, etc. An answer to each question is assumed to be a text fragment from that home page. Thus, the algorithm solves the problem of finding the best unbroken fragment of text that answers a question from a given set of questions. One of the definite advantages of the SRV algorithm is that it makes no assumption about document structure. Instead, structural information is supplied as input to the system. As a consequence, an argument is made that SRV may be better suited for new domains than other systems. The author reports a successful evaluation of an SRV-based tool in the domain of university course and research project pages. A way is suggested to make the tool Web-aware by extending it with HTML-specific features.
Jacquemin and Bush (2000) present a tool for the acquisition of named entities, e.g., names of companies, from textual sources. The authors' approach combines lexical indices with formatting instructions. Lexical indices are discourse markers, and formatting instructions are HTML tags. The system includes three shallow parsers for mining HTML texts for specific structures such as lists, enumerations, and anchors. The named entities are extracted from the found structures by analyzing discourse markers and HTML tags. While Jacquemin and Bush do not use any machine-learning techniques, their approach is similar in spirit to the approach advocated in Kushmerick et al. (1997) in that it advocates combining structural information about documents with linguistic patterns. The system described by Jacquemin and Bush focuses exclusively on HTML documents and does not tag anything. Its sole purpose is to build lists of named entities found in specified HTML pages.
Hearst (1997) and Hearst and Plaunt (1993) advocate a classic information-retrieval approach to mining documents for structure. The approach is called TextTiling. TextTiling is a method for partitioning full-length text documents into coherent multi-paragraph units. The units form a sequence of subtopical passages. The TextTiling algorithm assumes that a document moves from topic to topic. Each topic has its own vocabulary associated with it. When one topic changes to a different topic, the vocabulary changes, too. Consequently, sharp changes in the vocabulary signify boundaries between the elements of a document structure. The algorithm completely ignores all lexical markers provided by the document authors.