DEFINITION AND ARCHITECTURE OF TEXTUAL WAREHOUSES


From the definition of data warehouses (Inmon, 1994), we define the textual warehouses as a source of information that is subject-oriented, filtered, integrated, archived (versions), and organized for a process of retrieval, interrogation or analysis.

The information contained in a document warehouse must be organized as follows :

  • subject-oriented: the data of a warehouse must be organized by subject, thus allowing for the collection of all relevant information for analysis;

  • filtered: the warehouse must contain only the documents that can be useful for facilitating the task of decision- makers (Chevalier et al., 2003);

  • integrated: the content of the warehouse results from the integration of heterogeneous information from multiple sources;

  • archived: the warehouse must allow for the historization of the documents in order to preserve their various evolutions.

The architecture we propose for the definition of the textual warehouses is presented in Figure 1. This architecture includes two stages: warehouse storage and warehouse exploitation.

click to expand
Figure 1: Architecture of Textual Warehouses

The first stage involves extracting the structure and content from each document in order to store them in the warehouse. Each textual element of content must be indexed to extract information that will be used afterward by techniques of information retrieval.

The second stage manipulates the information contained in the warehouse. For that task, we propose three techniques:

  • information retrieval : retrieve documents or passages of documents (from their textual content) that are considered relevant for a user query formulated by simple keywords (non-structured queries);

  • data interrogation : use a DBMS language to interrogate the warehouse (structured queries) and retrieve factual data (specific information);

  • multidimensional analysis : analyze information by constructing textual marts (specific views) according to OLAP (On-Line Analytical Processing) techniques.

Such textual warehouses then become the basic tool for company employees who wish to exploit information which they need for their daily professional tasks (e.g., administrative intranet, digital libraries, technical documentation, etc.).




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net