GENERIC MODEL OF TEXTUAL WAREHOUSES


Textual warehouses (Khrouf et al., 2001) must constitute a source of synthetic and homogeneous information, just like data warehouses. Nevertheless, if the data sources of data warehouses are generally structured according to the relational model, then the sources of the textual warehouses are strongly-structured complex objects. The generic model we propose must, on the one hand, accept any type of document and, on the other hand, facilitate the retrieval, interrogation and analysis of documents (structure and content). The idea is to identify logical classes of documents in order to gather them according to these classes, therefore making it possible for users to focus on the classes which interest them (e.g., books, newspapers, proceedings , etc.).

With this goal in mind, we distinguished two types of logical structures (Khrouf & Soul -Dupuy, 2001): the generic logical structure (i.e., the common structure of a document set) and the specific logical structure (i.e., the structure of one document). Figure 2 describes the generic model of textual warehouses we propose by respecting UML (Unified Modelling Language) formalism.

click to expand
Figure 2: Generic Model of Textual Warehouses

The generic logical structure is characterized by three meta-classes: "Gen_Str" (Generic Structures), "Gen_Elts" (Generic Elements), and "Gen_Atts" (Generic Attributes). In our generic model, a generic logical structure is defined by a set of generic elements, which can be composed of other generic elements. Each of these elements can also be described by generic attributes.

The specific logical structure is characterized by the other classes. In our generic model, a document is characterized by a set of declarations. It contains from 1 to n "Spe_Elts" (Specific Elements). For each element, we associate 0 or 1 information and/or 0 or n "Spe_Atts" (Specific Attributes). Each information is indexed by a set of keywords (stemmed word "Radical") extracted from its textual content. Each keyword is associated with its frequency in the concerned information "Term_Freq" and with its absolute frequency "Doc_Freq" (i.e., the frequency of the stemmed word in the whole collection of information).

This object model was implemented in the DBMS Oracle 8. The design was realized by an object-oriented modelling (UML) and the implementation carried out in an object-relational DBMS (Oracle 8). To ensure this translation, we used the transformation rules described in Soutou (2001). An extract of the object-relational diagram obtained is schematized in the appendix.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net