data mining: opportunities and challenges
Chapter VIII - Mining Text Documents for Thematic Hierarchies Using Self-Organizing Maps
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

Text categorization or classification systems usually categorized documents according to some predefined category hierarchy. An example is the work by the CMU text-learning group (Grobelnik & Mladenic, 1998) that used the Yahoo! hierarchy to categorize documents. Most text categorization research focused on developing methods for categorization. Examples are Bayesian independent classifier (Lewis, 1992), decision trees (Apte, Damerau, & Weiss, 1994), linear classifiers (Lewis,, Schapire, Callan, & Papka, 1996), context-sensitive learning (Cohen & Singer, 1996), learning by combining classifier (Larkey & Croft, 1996), and instance-based learning (Lam, Ruiz, & Srinivasan, 1999). Another usage of category hierarchy is browsing and searching the retrieval results. An example is the Cat-a-Cone system developed by Hearst and Karadi (1997). Feldman, Dargan, and Hirsh (1998) combined keyword distribution and keyword hierarchy to perform a range of data-mining operations in documents. Approaches on automatically generating the category themes are similar in context with research on topic identification or theme generation of text documents. Salton and Singhal (1994) generated a text relationship map among text excerpts and recognized all groups of three mutually related text excerpts. A merging process is applied iteratively to these groups to finally obtain the theme (or a set of themes) of all text excerpts. Another approach by Salton (1971) and Salton and Lesk (1971) clusters the document set and constructs a thesaurus-like structure. For example, Salton and Lesk divide the whole dataset into clusters where no overlap is allowed and construct a dictionary. The approach is nonhierarchy in this sense. Clifton and Cooley (1999) used traditional data-mining techniques to identify topics in a text corpus. They used a hypergraph partitioning scheme to cluster frequent item sets. The topic is represented as a set of named entities of the corresponding cluster. Ponte and Croft (1997) applied dynamic programming techniques to segment text into relatively small segments. These segments can then be used for topic identification. Lin (1995) used a knowledge-based concept counting paradigm to identify topics through the WordNet hierarchy. Hearst and Plaunt (1993) argued that the advent of full-length documents should be accompanied by the need for subtopic identification. They developed techniques for detecting subtopics and performed experiments using sequences of locally concentrated discussions rather than full-length documents. All these works, to some extent, may identify topics of documents that can be used as category themes for text categorization. However, they either rely on predefined category hierarchy (e.g., Lin, 1995) or do not reveal the hierarchy at all.

Recently, researchers have proposed methods for automatically developing category hierarchy. McCallum and Nigam (1999) used a bootstrapping process to generate new terms from a set of human-provided keywords. Human intervention is still required in their work. Probabilistic methods were widely used in exploiting hierarchy. Weigend, Wiener, and Pedersen (1999) proposed a two-level architecture for text categorization. The first level of the architecture predicts the probabilities of the meta-topic groups, which are groups of topics. This allows the individual models for each topic on the second level to focus on finer discrimination within the group. They used a supervised neural network to learn the hierarchy where topic classes were provided and already assigned. A different probabilistic approach by Hofmann (1999) used an unsupervised learning architecture called Cluster-Abstraction Model to organize groups of documents in a hierarchy.

Research on Chinese text processing focused on the tasks of retrieval and segmentation. Some work can be found in Chen, He, Xu, Gey, and Meggs (1997); Dai, Loh, and Khoo (1999); Huang & Robertson (1997a); Nie, Brisebois, and Ren (1996); Rajaraman, Lai, and Changwen (1997); and Wu and Tseng (1993, 1995). To our knowledge, there is still no work on knowledge discovery in Chinese text documents. The self-organizing maps model used in this work has been adopted by several other researchers for document clustering (for example, Kaski, Honkela, Lagus & Kohonen, 1998; Rauber & Merkl, 1999; and Rizzo, Allegra, & Fulantelli, 1999). However, we found no work similar to our research.

Brought to you by Team-Fly

Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang © 2008-2017.
If you may any questions please contact us: