In this work, it was important to adopt a flexible yet powerful way to represent both background information as well as a document. XML (Bray et al., 1998) was, thus, adopted to represent both. Background information is stored in an XML file, which is used to represent index terms. The file has the structure shown in Figure 1.
<indexTerms> <general_category indexChildNodes= "true" > <name> diseases </name> <sameAs> disorders </sameAs> </general_category> <general_category indexChildNodes= "true" > <name> Varieties </name> </general_category> <disease indexChildNodes= "false" > <name>Powdery Mildew</name> <sameAs> aSynonym </sameAs> <sameAs> ........... </sameAs> </disease> ... ... <operation indexChildNodes= "false" > <name> aNameOfanOperation </name> <sameAs> aSynonym </sameAs> </operation> ... ... <pest indexChildNodes= "false" > <name> aNameOfaPest </name> ... ... </pest> ... ... </indexTerms>
This representation, despite its simplicity, allows for the mapping of various phrases to their corresponding categories, and provides a simple thesaurus using the <sameAs> tag. The indexChildNodes can be used to specify whether or not specializations of a given term should be indexed as belonging to that term , i.e. whether or not a document's hierarchy is to be utilized.
A document will have the XML representation illustrated in Figure 2.
<doc> <title> aTitle </title> <section> <id>102328933656>/id> <level>1</level> the level of a section within a document hierarchy <heading> the text heading of the section </heading> <text> a pure text representation of the contents of the section </text> <html> <![CDATA[ the html text representation of this section ]] < /html> </section> <section> ..... ..... </section> </doc>