EXPERIMENTAL RESULTS | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter VIII - Mining Text Documents for Thematic Hierarchies Using Self-Organizing Maps
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

We applied our method to the Chinese news articles posted daily on the Web by the CNA (Central News Agency). Two corpora were constructed in our experiments. The first corpus (CORPUS-1) contains 100 news articles posted on August 1, 2, and 3, 1996. The second corpus (CORPUS-2) contains 3,268 documents posted between October 1 and October 9, 1996. A word extraction process was applied to the corpora to extract Chinese words. A total of 1,475 and 10,937 words were extracted from CORPUS-1 and CORPUS-2, respectively. To reduce the dimensionality of the feature vectors, we discarded those words that occurred only once in a document. We also discarded the words that appeared in a manually constructed stoplist. This reduced the number of words to 563 and 1,976 for CORPUS-1 and CORPUS-2, respectively. A reduction rate of 62% and 82% was achieved for the two corpora respectively.

To train CORPUS-1, we constructed a self-organizing map that contains 64 neurons in an 8 8 grid format. The number of neurons was determined experimentally so that a better clustering could be achieved. Each neuron in the map contains 563 synapses. The initial training gain was set to 0.4, and the maximal training time was set to 100. These settings were also determined experimentally. We tried different gain values ranging from 0.1 to 1.0 and various training time setting ranging from 50 to 200. We simply adopted the setting that achieved the most satisfying result. After training, we labeled the map by documents and words respectively, and obtained the DCM and the WCM for CORPUS-1. The above process was also applied to CORPUS-2 and obtained the DCM and the WCM for CORPUS-2 using a 20 20 map.

After the clustering process, we then applied the category generation process to the DCM to obtain the category hierarchies. In our experiments, we limited the number of dominating neurons to 10. We limited the depths of hierarchy to 2 and 3 for CORPUS-1 and CORPUS-2, respectively. In Figures 4 and 5, we show the overall category hierarchies developed from CORPUS-1. Each tree depicts a category hierarchy where the number on the root node depicts the super-cluster found. The number of hierarchies is the same as the number of super-clusters found at the first iteration of the hierarchy generation process (STAGE-1). Each leaf node in a tree represents a cluster in the DCM. The parent node of some child nodes in level n of a tree represents a super-cluster found in STAGE-(n−1). For example, the root node of the largest tree in Figure 4 has a number 35, specifying that neuron 35 is one of the 10 dominating neurons found in STAGE-1. This node has 10 children, which are the 10 dominating neurons obtained in STAGE-2. These child nodes comprise the second level of the hierarchy. The third level nodes are obtained after STAGE-3. The number enclosed in a leaf node is the neuron index of its associated cluster in the DCM. The identified category themes are used to label every node in the hierarchies. In Figure 6, we only show the largest hierarchy developed from CORPUS-2 due to space limitation.

click to expand
Figure 4: The category hierarchies of CORPUS-1.

click to expand
Figure 5: English translation of Figure 4.

click to expand
Figure 6: One of the category hierarchies developed from CORPUS-2.

We examined the feasibility of our hierarchy generation process by measuring the intra-hierarchy and extra-hierarchy distances. Since text categorization performs a kind of clustering process, measuring these two kinds of distance reveals the effectiveness of the hierarchies. A hierarchy can be considered as a cluster of neurons that represent similar document clusters. These neurons share a common theme because they belong to the same hierarchy. We expect that they will produce a small intra-hierarchy distance that is defined by:

where h is the neuron index of the root node of the hierarchy, and L_h is the set of neuron indices of its leaf nodes. On the other hand, neurons in different hierarchies should be less similar. Thus, we may expect that a large extra-hierarchy distance will be produced. The extra-hierarchy distance of hierarchy h is defined as follow:

Table 1 lists the intra-and extra-hierarchy distances for each hierarchy. We can observe that only one of twenty hierarchies has an intra-hierarchy distance greater than its extra-hierarchy distance. Therefore, we may consider that the generated hierarchies successfully divide the document clusters into their appropriate hierarchies.

Table 1: The intra-and extra-hierarchy distances of every hierarchy developed from CORPUS-1 and CORPUS-2. The root node columns show the neuron indices of the root node of each hierarchy
CORPUS-1
Hierarchy	Root node	Intra-hierarchy distance	Extra-hierarchy distance
1	57	0	2.62
2	22	0.97	2.59
3	17	0	2.71
4	24	1.23	2.57
5	46	2.77	2.45
6	48	1.13	2.54
7	35	1.82	2.20
8	3	0	3.10
9	33	0	2.71
10	20	0	2.63
CORPUS-2
Hierarchy	Root node	Intra-hierarchy distance	Extra-hierarchy distance
1	53	2.18	2.39
2	5	2.01	2.64
3	2	1.37	3.08
4	119	1.90	2.87
5	81	1.95	2.88
6	200	2.17	2.77
7	40	1.73	3.48
8	310	2.41	2.60
9	36	1.86	2.92
10	259	1.60	2.90z

We also examined the feasibility of the theme identification process by comparing the overall importance of an identified theme to the rest of the terms associated with the same category. For any category k, we calculated the average synaptic weight of every terms over C_k. Let t_n be the term corresponding to the nth component of a neuron's synaptic weight vector. We calculate the average synaptic weight of t_n over category k by

Table 2 lists the ranks of the identified themes over all terms for all hierarchies. It is obvious that the identified themes are generally the most important term among all terms and therefore should be the themes of these hierarchies.

Table 2: The ranks of all themes over all terms for all hierarchies
Hierarchy	CORPUS-1 (563 terms)	CORPUS-2 (1,976 terms)
1	1	3
2	1	1
3	1	2
4	1	1
5	1	1
6	1	1
7	1	1
8	1	1
9	1	1
10	1	1


	Brought to you by Team-Fly