BACKGROUND


IRS (Information Retrieval Systems) were initially introduced to exploit non-structured documents, i.e., documents which contain no information about their logical structure. These documents were analyzed to represent their textual content and, therefore, their relevance in response to a non-structured query (free natural language). During the last 20 years , several theoretical models were proposed, and several systems based on those models have been implemented. The most well-known of these systems are: the Boolean model [STAIRS (IBM, 1982)], the vector-space model [SMART (Salton, 1971)], the probabilistic model (Turtle & Craft, 1990), the bayesian models (Van Rijsbergen, 1986), the linguistic models [RIME (Chiaramella & Nie, 1990)], and the connectionist model [MERCURE (Boughanem et al., 1999)].

Since then, attempts have been made to apply the IRS techniques to structured or semi-structured documents for the use of logical structures during the evaluation of a query. Among these works, we can quote:

  • Textriever system (Burkowski, 1992), a search engine for a collection of structured documents;

  • Personal Daily News (Fourel et al., 1998), an integrated environment for the management and retrieval of structured documents.

The approach of DBMS (DataBase Management Systems) allows for the quick treatment of a set of data. So, the idea is to apply this technique to documents. For structured documents, i.e., those whose logical structure is specified, many works were realized. Among these works, we can quote:

  • e-XML Media Repository (Gardarin et al., 2002), a software component for the storage and query of XML documents;

  • Xyleme (Abiteboul et al., 2001), describing a project that integrates XML data from the Web into a database.

For the semi-structured documents, i.e., those whose logical structure is partially defined, much work has been done despite the difficulties presented by these types of documents. Among these works, we can quote:

  • HyWEB (Gardarin &Yoon, 1996), whose finality is the construction of an HTML (HyperText Markup Language) document base, and where the goal is to be able to interrogate a class of documents;

  • WIND (Faulstish et al., 1997), which builds a data warehouse from specific information (about a particular domain) extracted from the Web.

As regards analysis, the works are very recent and mainly based on data mining techniques, not on a multidimensional approach. Concerning document storage and interrogation , all these works manipulate structured documents, or semi-structured documents, but not non-structured documents. In fact, in each case, only one standard is chosen , which implies a predefined database schema (predefined structure). Moreover, this work is devoted to the interrogation of documents starting from their factual descriptions; it does not involve the analysis of their textual content. In the information retrieval process, a query results in a collection of documents, which obliges the user to consult the content of a great number of documents to find the specific information he is looking for.

Contrary to this previous work, we propose a generic model of textual warehouses able to contain any type of document (structured, semi-structured and non-structured) and able to perform information retrieval, data interrogation, and multidimensional analysis. Moreover, our approach is generic because no restriction is imposed for the documents to be integrated.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net