data mining: opportunities and challenges
Chapter XII - Mining Free Text for Structure
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

We have tested FAQ Minder on 100 FAQs. FAQ Minder's task was to identify tables of contents, question-answer pairs, and bibliographies. All in all, FAQ Minder had to identify 2,399 items. When FAQ Minder and a human judge were in agreement on an item, the item was said to be completely recognized. When FAQ Minder identified part of an item, the item was said to be partially recognized. When FAQ Minder wrongly tagged a chunk of text, that chunk of text was referred to as a false positive. The items identified by a human judge but missed by FAQ Minder were referred to as unrecognized.

Of 2,399 items, 1,943 items were completely recognized, 70 items were partially recognized, 81 were considered as false positives, and 305 items were unrecognized. In percentage terms, FAQ Minder completely recognized 81% of items, partially recognized 3%, wrongly tagged 3.4%, and failed to identify 12.6%.

We identified several causes of failure while analyzing the results of the experiments. The most prominent among them was what we call "sudden layout changes due to typos." In other words, the layout between two consecutive markers in a sequence was different due to a typo. As an example, consider the sequence in Figure 9 with the layout characters made visible. The markers "[10]," "[11]," and "[12]" clearly belong to the same sequence. Yet, because the layout of the "[11]" marker is different from the layout of the "[10]" marker, the sequence is prematurely terminated at "[10]."

click to expand
Figure 9: Layout change.

We attributed the second cause of failure to typos in markers themselves. Consider the following example:

  • 5) Perl books.

  • 7) Perl software.

The current rules of marker extension do not allow the system to extend the "5)" marker on the "7)" marker.

Finally, we found markers for which the system had no rules of extension. For instance, we discovered that some FAQ writers use Roman numerals in conjunction with letters and Arabic numerals, e.g. "III.1.a)." FAQ Minder could not handle such cases.

These failures suggest several directions for future research. The first direction is to introduce some error recovery into the system. Since the typos are a norm rather than an exception, the system should have a coherent way to deal with them. The marker typo failures can be handled through a simulation mechanism. Sometimes a marker is found whose structure is consistent with the structure of the previous markers in the active sequence, but the sequence's head cannot be extended to the found marker. In such cases, the sequence manager can compute all of the possible extensions of the head marker and see if any of those extensions can be extended to the found marker. For example, if the head marker of the active sequence is "[5]" and the found marker is "[7]," the sequence manager can simulate the extension of "[5]" to "[6]" and then verify that "[6]" can be extended to "[7]."

The layout typo failures exemplified in Figure 9 can be handled by the same simulation mechanism, with one exception. After the sequence manager verifies that one of the possible extensions can be extended to the found marker, it then tries to find it in the text of the FAQ between the line with the current head marker and the line with the found marker. If the marker is found, possibly with a different layout, the sequence manager integrates it into the sequence. This strategy would allow the sequence manager to integrate the marker "[11]" into the sequence although the layout of "[11]" is different.

The second direction is to give FAQ Minder a limited ability to process natural language phrases. Many FAQ writers state explicitly that a table of contents, a glossary, or a bibliography is about to begin by using the following phrases: "Table of Contents," "List of Topics," "The following questions are answered below," etc. If FAQ Minder can recognize such phrases, it can assume that the sequence that follows marks the table of contents, the glossary, the topic list, etc. Toward this end, we have specified a number of the most common ways in which people can say that the table of contents is about to start. We have run our small program on 150 FAQs. The program correctly recognized such sentences in 111 FAQs.

The third research direction involves the supervised mode of FAQ Minder. In the supervised mode, FAQ Minder displays its progress through the text of a FAQ in a simple graphical user interface. As the system goes through the text of the FAQ, it highlights different FAQ components as soon as it detects them. For example, when the beginning of a table of contents is detected, the system informs the user that it has started tracking the table of contents. The user then has the opportunity to verify that the system has made a correct decision or tell the system that its decision is wrong.

The supervised mode requires little inference on the part of the system. However, it requires a large time commitment on the part of the user. Our conjecture here is that ultimately a system that recognizes its own limitations and overcomes them by asking the user intelligent questions saves the user more time than a completely independent system whose work requires laborious verification.

Brought to you by Team-Fly

Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net