SYSTEM OVERVIEW | (ed.) Intelligent Agents for Data Mining and Information Retrieval

The implemented system is a distributed one, in which a number of components communicate to achieve the required functionality. The main components of this system are: an indexing user interface; an indexing backend linked to a DBMS; and a search front end, also linked to a DBMS. Figure 3 shows the various components, each of which is described in the following subsections, and their interactions.

Figure 3: System Components and Interactions

Indexing Backend

The indexing backend is the component responsible for augmenting input documents with metadata using background knowledge. The indexing backend is implemented in Java as a multithreaded HTTP server that is capable of receiving indexing requests embedded in HTTP requests . On start-up, the system loads the XML representation of background knowledge into a set of dictionaries and data structures that can facilitate the indexing process. A request to this component will contain the URL of the file that requires indexing, as along with the name of the crop for which this file belongs. Before carrying out any indexing, the component starts reading the specified document and breaks it down to the structure specified in Figure 2.

Following this segmentation phase, pattern matching techniques are applied to match section heading titles with index terms. An index record for each section is created, with each record containing fields for every pre-identified category (one for diseases, another for operations, etc.). Should a match be made between a heading and one of the input index terms, then the category of the section will be deduced . The field designated for that category will be filled with an ID pointing to the specific instance against which a match was made. A single section may match with more than one category.

After the analysis of a given section is completed and a record is created accordingly , the record, along with a pointer to the specific section for which it was derived, are sent to a remote storage component (a database) where they are kept. After analysis of the whole document is completed, an HTML page is returned to the user. Within that page, all section and subsection headings are displayed; beside each, it is indicated whether that section has been indexed. If the section has been indexed, it is indicated whether indexing was performed directly or indirectly (through the use of hierarchical information). Sections that have not been indexed are hyperlinked to an interface which allows the user to edit their text in order to update the background knowledge and re-index the input document.

Updating background knowledge can involve the creation of a new category instance or the creation of synonyms to associate with existing ones. The update request is encoded in a URL sent to indexing backend over HTTP. The indexing backend subsequently ˜learns this new information and updates its background knowledge file. Initially, some background knowledge could be acquired from a domain expert, or it could be completely learned through the indexing process (which also requires usage by someone familiar with the domain).

Indexing User Interface

Since it is anticipated that those users who will request document indexing will do so remotely, a web interface for facilitating the indexing and uploading of extension documents was implemented. This interface simply allows a user to select an extension document from their local machine, upload it to a web server, and then index this document through communication with the indexing backend.

Search Front End

A web search front end is provided to allow users to rapidly fetch their required information from the extension documents by selecting one or more values for index parameters, where the index parameters are those of the crop name as well as predefined indexing categories. The number of selected parameters defines whether the query will be a loose or a specific one. The more specific the query, the fewer records are returned.

After a query is entered, it is converted to SQL and dispatched to the database in which indexing information has been stored. The result is displayed in the form of an HTML page containing a list of index records that match the entered query. The output includes the following: the heading title of the matching section; a sample from the matching paragraph; and a hyperlink to the source section. On following the hyperlink, only the text of the selected section will be displayed. However, depending on the level of a section, extra information that defines the context of the section as part of the whole document might be displayed. In addition, a hyperlink to the source document will always be displayed.