| | | | Chapter XII - Mining Free Text for Structure | Data Mining: Opportunities and Challenges | by John Wang (ed) | | Idea Group Publishing 2003 | | | | **Brought to you by Team-Fly** | | | Spreading activation is used to account for lexical variation between the clients' questions and the FAQ answers. Spreading activation is based on WordNet, which consists of four subnets organized by the four parts of speech. Each subnet has its own relations: for example, nouns have *antonymy*, the *isa* relation, and three *part-of* relations. WordNet's basic unit is a *synset*, which contains words and phrases interchangeable in a context, e.g., "computer" and "data processor." The activation procedure is depth-constrained. It takes a term and a depth integer specifying how many links away from the term the activation is to spread. Each term found during the spread is annotated with its part of speech and the depth at which it was found. Thus, "device12" means that "device" is a noun found at depth 2. The origin term's depth is 0. If a word is found at several depths, only the smallest one is kept. Activation is spread only from terms found in the questions. Since questions in FAQs are shorter than answers, the number of non-relevant terms found during the spread is much smaller that it would be if the activation was spread from every nonstoplisted term in every answer. The weight of a term combines its semantic and statistical properties. The semantic properties of a term constitute its intrinsic value. The statistical properties reflect its value in the collection of textual units. The semantic weight of a term *t*_{i}, W_{wn}(t_{i},r), is given by where *Poly(t*_{i}*)* gives the term's polysemy, *d(t*_{i}*)* gives the depth at which *t*_{i} was found, *W*_{pos} assigns a constant weight to each part of speech, i.e., 1 to nouns, .75 to verbs, and .5 to adjectives and adverbs, and the rate of decay, *r* < 1, indicates how much *t*_{i}'s weight decreases with depth. The statistical weight of a term combines several approaches. Let *K* be a collection of *D* documents. Let *d*_{j} be a document in *K*. If *f(t*_{i} *,d*_{j}*)* denotes the frequency of occurrence of *t*_{i} in *d*_{j}, then denotes the number of occurrences of *t*_{i} in *K*. Let *N*_{i} be the number of documents containing at least one occurrence of *t*_{i}. *N*_{i} depends on the distribution of *t*_{i} among the documents of *K.* Let be the random variable that assumes the values of *N*_{i} and let be its expected value, assuming that each occurrence of *t*_{i} can fall into any of the *D* documents with equal probability. The first statistical weight metric is the *inverse document frequency* (Salton & McGill, 1983). The inverse document frequency (IDF) of *t*_{i} in K, *W*_{idf} ( *t*_{i} *, K* ), is given by 1 + log(*D*/*N*_{i}). The *tfidf* weight of *t*_{i} in *d*_{j}, *W*_{tfidf} ( *t*_{i} *, d*_{j} ), is given by *f* ( *t*_{i} *, d*_{j} ) *W*_{idf} ( *t*_{i} *, K* ). The second statistical weight metric is *condensation clustering* (Kulyukin, 1998b). A sequence of textual units proceeds from topic to topic. Terms pertinent to a topic exhibit a non-random tendency to condense in the units that cover the topic. One refers to such terms as *content-bearing*. Terms that do not bear content appear to be distributed randomly over the units. The condensation clustering (CC) weight of *t*_{i}, *W*_{cc} ( *t*_{i} *, K* ), is a ratio of the actual number of documents containing at least one occurrence of *t*_{i} over the expected number of such documents and is given by , where *A* is a constant. The following lemma shows how to compute the expectation of *N*_{i}. **Lemma 1:** Let *T*_{i} be the total number of occurrences of *t*_{i} in *K*. Then , where *p*_{i} = 1 − (1 − 1/*D*)^{Ti}. **Proof:** For each *d*_{j}, put if *f*(*t*_{i}, *d*_{j}) >0 and , otherwise. This random variable assumes the values of 1 and 0 with corresponding probabilities of *p*_{i} and 1−*p*_{i}. Hence, . Since . The CC weight of *t*_{i} captures its importance in *K.* To account for *t*_{i}'s importance in *d*_{j}, its CC weight is multiplied by its frequency in *d*_{j}. Thus, we obtain another statistical weight metric *W*_{tfcc} (*t*_{i}, *d*_{j}, *K*) = *f* (*t*_{i} *, d*_{j}) *W*_{cc} (*t*_{i} *, K*). The following lemma captures the relationship between IDF and CC. **Lemma 2:** *W*_{cc} (*t*_{i}) = *W*_{idf} (*t*_{i}) + log (*p*_{i}) **Proof:** By lemma 1 and the definition of . But, . Hence, *W*_{cc} (*t*_{i}) = *W*_{idf} (*t*_{i}) + log (*p*_{i}), for *A* = 1. | | **Brought to you by Team-Fly** | | | |