data mining: opportunities and challenges
Chapter XII - Mining Free Text for Structure
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly


Over the past several years, the Internet has seen a proliferation of newsgroups. A newsgroup is started by individuals interested in a topic, e.g., caffeine or cars. These individuals, who are experts on the topic, want to make their expertise publicly available, which they accomplish through the newsgroup's FAQ.

Looking for Answers to Transient Questions

Newsgroup-based expertise distribution works for people with a stable interest in the newsgroup's topic. However, many people have more transient interests. Typically, such transient interests are caused by questions whose answers are beyond an information seeker's area of expertise. There are three types of problems that information seekers with transient interests confront: insufficient knowledge, insufficient time, and privacy.

Let us illustrate these problems with an example. Consider a college student, who wants to write a report on the therapeutic effects of caffeine. The student may not know about the coffee newsgroup. This lack of knowledge may cause the student to spend much time searching for the newsgroup. Even if the student already knows about the newsgroup, his interest in finding an answer to his question does not necessarily mean that he is interested in subscribing to the newsgroup and subsequently reading a dozen messages a day, most of which have nothing to do with his question.

Even if the student knows about the newsgroup's FAQ, the student may not have the time to browse for an answer. This is because many newsgroups have FAQs containing hundreds and sometimes thousands of question-answer pairs (Q&A's) and provide no search or browsing tools to mine those Q&A's.

Finally, the student may be concerned about privacy. If he posts a question to the newsgroup, his name will be read by hundreds, possibly thousands, of subscribers. Some newsgroups are known for their inflammatory nature and are not friendly to novices or casual posters.

These problems signify a need for a system that provides Web and Internet users with a gateway to the newsgroups' expertise. Users who do not know a relevant newsgroup should not spend much time searching for it. Users with transient interests in the newsgroup's topic should not have to make unwanted commitments to obtain answers.

Outline of a Solution

FAQ Finder was developed to meet this need for a gateway to the newsgroups' expertise (Burke et al., 1997). The question-answering task is conceptualized as the retrieval of answers to similar questions answered previously. To answer a new question is to choose a suitable Q&A collection, i.e., a set of FAQs, and to retrieve from it the answer to a similar question. There is a substantial literature on FAQ Finder (Burke, Hammond, & Cooper, 1996; Burke, Hammond, & Young, 1996; Kulyukin, 1998a, 1998b). Here we only offer an outline of the system, because it grounds our free-text mining task in a proper context.

FAQ Finder answers natural language questions from a collection of 602 Usenet FAQs. Given a question, FAQ Finder:

  • finds a small set of FAQs relevant to the question;

  • displays short descriptions of those FAQs to the user; and,

  • retrieves a small number of Q&A's relevant to the question from the chosen FAQ.

Figure 1 shows FAQ Finder's flowchart. The submitted question is mapped to a set of FAQs that are potentially relevant to the question (FAQ retrieval). A FAQ from the list is chosen either by the client or by the system. For example, if the client chooses the quick match option, the top FAQ is selected automatically by the system. The FAQ is searched for answers to the question. A list of relevant Q&A's, if such are found, is returned to the user (Q&A retrieval).

click to expand
Figure 1: How FAQ Finder works.

The FAQ retrieval is accomplished by the vector space retrieval model (Salton & McGill, 1983). Each FAQ is turned into a vector of term weights in a multidimensional vector space whose dimensions are the terms found in all of the FAQs in the collection. Terms are computed from the free texts of FAQs. Common words, such as "and," "to," or "from," are removed. The remaining words become terms through stemming, a vocabulary normalization procedure that reduces word forms to their stems (Frakes & Baeza-Yates, 1992). For example, "information," "informed," "informant," and "informing" are all reduced to "inform."

As a simple example, consider a collection of three FAQs, F1, F2, and F3, where each FAQ contains three terms: T1, T2, and T3. We have a three-dimensional vector space, in which each vector corresponds to a FAQ and consists of three term weights. A term's weight is a ratio of the term's frequency in the FAQ and the number of FAQs in which it occurs at least once. Each weight is a coordinate along the dimension of the corresponding term. A user's question is turned into a vector in the FAQ vector space. The similarity between the question vector and a FAQ vector is computed as the cosine of the angle between them. Thus, the smaller the angle, the more relevant the FAQ is to the question.

The Q&A retrieval begins when a FAQ is selected to be searched for answers. The Q&A Retriever computes the similarity score between the question and each Q&A in the FAQ. The score combines a statistical metric and a semantic metric.

To compute the statistical similarity, the question is turned into a term weight vector in the space of the selected FAQ. The cosine similarity score is computed between the question vector and each Q&A vector in the FAQ.

The semantic similarity is based on recognizing semantic relations among the words of the user's question and the words of a Q&A's question. Such relations are found through WordNet , a semantic network of English words and phrases developed at Princeton University (Miller, 1995). For example, if the user's question contains "computer" and the Q&A's question contains "machine," the two questions are similar insomuch as "computer" is connected to "machine" via the isa link in WordNet's noun network (Kulyukin, 1998b). More details on the semantic and statistical similarities are provided in the Appendix.

Brought to you by Team-Fly

Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang © 2008-2017.
If you may any questions please contact us: