PROBLEM SCOPE AND DEFINITION | (ed.) Intelligent Agents for Data Mining and Information Retrieval

It is often the case that a broad range of documents containing useful information exists, but with no way to access individual segments of these documents directly using a targeted or structured search.

A document is typically divided into a number of sections and subsections. For example, documents that cover common problems related to various electrical appliances and their solutions will usually have sections for each class of problems, each of which will have subsections that cover a specific problem belonging to that class. Without a targeted search, a user interested in finding a solution to a particular problem related to a specific electrical appliance must first try to locate the specific document that covers common problems and their solutions for that appliance, and then begin the tedious task of browsing that document in order to locate the problem he/she is interested in. A search engine that would allow the user to select the appliance for which he/she is attempting to find a solution, then allow the user to select the specific problem he/she is interested in, and finally return the exact section that covers that problem, would certainly save the user valuable time and effort. The same interface, may also allow a user to compare how a given problem is solved across a range of appliances.

Moving beyond this simple and hypothetical example, in this work, we've had to address a real problem related to agricultural extension documents issued primarily to assist farmers in cultivating and caring for certain crops. Each document is information-rich with respect to the crop which it covers. Depending on the importance of a given crop and how involved the issues related to it are, a crop may have more than one document to address it. Because of the wealth of information contained within these documents, they're often used by researchers, as well as by farmers and extension workers.

A typical document will cover most aspects related to cultivating a crop, ranging from land preparation to harvest. Each section within a document targets a given problem or issue, and each subsection embodies a specialization of that issue. For example, a section called ˜Diseases will have as its subsections most diseases that are likely to affect a given crop. Similarly, a section covering operations will cover all agricultural operations that apply to that crop (irrigation, fertilization, etc.).

In this case and in similar cases, there are two elements that can work to the advantage of an intelligent search. The first is that the main elements of search can be identified beforehand over a broad class of documents. ˜Diseases and ˜Operations are two examples of search categories that can be readily identified. The second element is that individual mappings of instances related to the categories are more or less the same across all documents, and they are featured in either section or subsection headings. For instance, ˜Fertilization, ˜Irrigation and ˜Land Preparation all belong to the class of agricultural operations, while ˜Powdery Mildew belongs to the class of agricultural diseases. These classes and their instances will usually generalize across all crops. So, the individual instances of these general categories embody background knowledge that can be added to individual document segments as metadata.

There are some cases, however, when a general category can be identified, but the instances of which will rarely recur across a document set. Crop ˜Varieties is an example. In most extension documents, there is usually a section on varieties with various subsections on each variety and its different features. The name of a crop variety is specific to that crop and, as such, cannot be used as a general search term . To enable the location of information on any given variety for a given crop, the hierarchy of the document itself can be utilized to infer that each subsection of any section covering ˜Varieties is an instance of the general category ˜variety.

Generally speaking, augmenting various document sections with metadata involves a number of steps, which can be summarized as follows :

Identifying the various categories onto which various document sections can be mapped.
Acquiring and representing background knowledge in a way that can facilitate the mapping of various document sections into the identified categories.
Segmenting various documents and employing background knowledge to map each document section to its corresponding category.
Storing structured index information in a persistent data store, such as a database, or converting the document into an alternate representation (e.g., XML).
Providing a user interface to enable searches across indexed documents.