APPENDIX | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XII - Mining Free Text for Structure
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Spreading activation is used to account for lexical variation between the clients' questions and the FAQ answers. Spreading activation is based on WordNet, which consists of four subnets organized by the four parts of speech. Each subnet has its own relations: for example, nouns have antonymy, the isa relation, and three part-of relations. WordNet's basic unit is a synset, which contains words and phrases interchangeable in a context, e.g., "computer" and "data processor."

The activation procedure is depth-constrained. It takes a term and a depth integer specifying how many links away from the term the activation is to spread. Each term found during the spread is annotated with its part of speech and the depth at which it was found. Thus, "device12" means that "device" is a noun found at depth 2. The origin term's depth is 0. If a word is found at several depths, only the smallest one is kept. Activation is spread only from terms found in the questions. Since questions in FAQs are shorter than answers, the number of non-relevant terms found during the spread is much smaller that it would be if the activation was spread from every nonstoplisted term in every answer.

The weight of a term combines its semantic and statistical properties. The semantic properties of a term constitute its intrinsic value. The statistical properties reflect its value in the collection of textual units. The semantic weight of a term t_i, W_wn(t_i,r), is given by

where Poly(t_i) gives the term's polysemy, d(t_i) gives the depth at which t_i was found, W_pos assigns a constant weight to each part of speech, i.e., 1 to nouns, .75 to verbs, and .5 to adjectives and adverbs, and the rate of decay, r < 1, indicates how much t_i's weight decreases with depth.

The statistical weight of a term combines several approaches. Let K be a collection of D documents. Let d_j be a document in K. If f(t_i ,d_j) denotes the frequency of occurrence of t_i in d_j, then denotes the number of occurrences of t_i in K. Let N_i be the number of documents containing at least one occurrence of t_i. N_i depends on the distribution of t_i among the documents of K. Let be the random variable that assumes the values of N_i and let be its expected value, assuming that each occurrence of t_i can fall into any of the D documents with equal probability.

The first statistical weight metric is the inverse document frequency (Salton & McGill, 1983). The inverse document frequency (IDF) of t_i in K, W_idf ( t_i , K ), is given by 1 + log(D/N_i). The tfidf weight of t_i in d_j, W_tfidf ( t_i , d_j ), is given by f ( t_i , d_j ) W_idf ( t_i , K ).

The second statistical weight metric is condensation clustering (Kulyukin, 1998b). A sequence of textual units proceeds from topic to topic. Terms pertinent to a topic exhibit a non-random tendency to condense in the units that cover the topic. One refers to such terms as content-bearing. Terms that do not bear content appear to be distributed randomly over the units. The condensation clustering (CC) weight of t_i, W_cc ( t_i , K ), is a ratio of the actual number of documents containing at least one occurrence of t_i over the expected number of such documents and is given by , where A is a constant.

The following lemma shows how to compute the expectation of N_i.

Lemma 1: Let T_i be the total number of occurrences of t_i in K. Then , where p_i = 1 − (1 − 1/D)^T_i.

Proof: For each d_j, put if f(t_i, d_j) >0 and , otherwise. This random variable assumes the values of 1 and 0 with corresponding probabilities of p_i and 1−p_i. Hence, . Since .

The CC weight of t_i captures its importance in K. To account for t_i's importance in d_j, its CC weight is multiplied by its frequency in d_j. Thus, we obtain another statistical weight metric W_tfcc (t_i, d_j, K) = f (t_i , d_j) W_cc (t_i , K). The following lemma captures the relationship between IDF and CC.

Lemma 2: W_cc (t_i) = W_idf (t_i) + log (p_i)

Proof: By lemma 1 and the definition of . But, . Hence, W_cc (t_i) = W_idf (t_i) + log (p_i), for A = 1.


	Brought to you by Team-Fly