Recently, many approaches have been devised for mining various kinds of knowledge from texts. One important application of text mining is to identify themes and the semantic relations among these themes for text categorization. Traditionally, these themes were arranged in a hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures was mostly done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. We then analyzed these maps and obtained the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language, and such documents can be transformed into a list of separated terms.
In text categorization, we try to assign a text document to some predefined category. When a set of documents is well categorized, both storage and retrieval of these documents can be effectively achieved. A primary characteristic of text categorization is that a category reveals the common theme of those documents under this category; that is, these documents form a natural cluster of similar context. Thus, text categorization provides some knowledge about the document collection. An interesting argument about text categorization is that before we can acquire knowledge through text categorization, we need some kinds of knowledge to correctly categorize documents. For example, two kinds of key knowledge we need to perform text categorization are 1) the categories that we can use, and 2) the relationships among the categories. The first kind of knowledge provides a set of themes that we can use to categorize documents. Similar documents will be categorized under the same category if they have the same theme. These categories form the basis of text categorization. The second kind of knowledge reveals the structure among categories according to their semantic similarities. Ideally, similar categories, i.e., categories with similar themes, will be arranged "closely" within the structure in some manner. Such arrangement provides us with an effective way to store and retrieve documents. Moreover, such structure may make the categorization result more comprehensible by humans.
Traditionally, human experts or some semi-automatic mechanisms that incorporate human knowledge and computing techniques such as natural language processing provided these kinds of knowledge. For example, the MEDLINE corpus required considerable human effort to carry out categorization using a set of Medical Subject Headings (MeSH) categories (Mehnert, 1997). However, fully automatic generation of categories and their structure are difficult for two reasons. First, we need to select some important words as category terms (or category themes). We use these words to represent the themes of categories and to provide indexing information for the categorized documents. Generally, a category term contains only a single word or a phrase. The selection of the terms will affect the categorization result as well as the effectiveness of the categorization. A proper selection of a category term should be able to represent the general idea of the documents under the corresponding category. Such selections were always done by human linguistic experts because we need an insight of the underlying semantic structure of a language to make the selections. Unfortunately, such insight is hard to automate. Certain techniques such as word frequency counts may help, but it is the human experts who finally decide what terms are most discriminative and representative. Second, for the ease of human comprehension, the categories were always arranged in a tree-like hierarchical structure. This hierarchy reveals the relationships among categories. A category associated with higher-level nodes of the hierarchy represents a more general theme than those associated with lower level nodes. Also, a parent category in the hierarchy should represent the common theme of its child categories. The retrieval of documents of a particular interest can be effectively achieved through such hierarchy. Although the hierarchical structure is ideal for revealing the similarities among categories, the hierarchy must be constructed carefully such that irrelevant categories may not be the children of the same parent category. A thorough investigation of the semantic relations among category terms must be conducted to establish a well-organized hierarchy. This process is also hard to automate. Therefore, most of text categorization systems focus on developing methodologies to categorize documents according to some human-specified category terms and hierarchy, rather than on generating category terms and hierarchy automatically.
In this work, we provide a method that can automatically generate category themes and establish the hierarchical structure among categories. Traditionally, category themes were selected according to the popularity of words in the majority of documents, which can be done by human engineering, statistical training, or a combination of the two. In this work, we reversed the text categorization process to obtain the category themes. First, we should cluster the documents. The document collection was trained by the self-organizing maps (SOM) (Kohonen, 1997) algorithm to generate two feature maps, namely the document cluster map (DCM) and the word cluster map (WCM). A neuron in these two maps represents a document cluster and a word cluster, respectively. Through the self-organizing process, the distribution of neurons in the maps reveals the similarities among clusters. We selected category themes according to such similarities. To generate the category themes, dominating neurons in the DCM were first found as centroids of some super-clusters that each represent a general category. The words associated with the corresponding neurons in WCM were then used to select category themes. Examining the correlations among neurons in the two maps may also reveal the structure of categories.
The corpus that was used to train the maps consists of documents that are written in Chinese. We decided to use a Chinese corpus for two reasons. First, over a quarter of the earth's population use Chinese as their native language. However, experiments on techniques for mining Chinese documents were relatively less than those for documents written in other languages. Second, demands for Chinese-based, bilingual, or multi-lingual text-mining techniques arise rapidly nowadays. We feel that such demands could not be easily met if experiments were only conducted in English corpora. On the other hand, a difficult problem in developing Chinese-based text-mining techniques is that research on the lexical analysis of Chinese documents is still in its infancy. Therefore, methodologies developed for English documents play an inevitable role in developing a model for knowledge discovery in Chinese documents. In spite of the differences in grammar and syntax between Chinese and English, we can always separate the documents, whether written in English or Chinese, into a list of terms that may be words or phrases. Thus, methodologies developed based on word frequency count may provide an unified way in processing documents written in any language that can be separated into a list of terms. In this work, a traditional term-based representation scheme in information retrieval field is adopted for document encoding. The same method developed in our work can naturally extend to English or multi-lingual documents because these documents can always be represented by a list of terms.