INFORMATION EXTRACTION | (ed.) Intelligent Agents for Data Mining and Information Retrieval

The information extraction must determine the different parts of a document whose every part presents a coherent idea. To achieve this stage, we define three types of documents:

documents with tags having a semantic vocation (e.g., SGML, XML documents);
documents with tags having a presentation vocation (e.g., HTML documents);
non-structured documents (e.g., TXT documents).

We present, in what follows , an extraction method that can be applied to every type of document.

Documents with Tags Having a Semantic Vocation

For this type of document, we distinguish two sub-families: well-formed documents, i.e., those which obey syntactical rules; and valid documents, i.e., well- formed documents which also obey a structure (Data Type Description or DTD).

The logical structure of well-formed documents is determined as follows:

Stage 1: Restitution of document tags (create a new file that contains the document tags) and restitution of attributes. The attribute name is prefixed by "A_".

Figure 3: Example of Logical Structure Determination for Well-Formed Documents
Stage 2: Every start tag followed by its end tag is replaced by a defined element (the level is 1, and the cardinality is 1). Note: The element name is prefixed by "Ex_" (where x constitutes the number of its level) and followed by its cardinality.
Stage 3: If consecutive elements have the same name, they are replaced by only one element, whose cardinality becomes * instead of 1.
Stage 4: Every start tag, followed by the defined elements and by its end tag, is replaced by a new defined element. The level of this new element takes the value of 1, and the level of its sub-elements must be incremented by 1.
Stage 5: Repeat the process (stages 1 to 4) until we obtain a file which contains only defined elements.

The logical structure of valid documents is determined by their DTD in the following ways:

the keyword "!DOCTYPE" corresponds to the logical structure name;
the keyword "!ELEMENT" corresponds to an element of the logical structure;
the keyword "!ATTLIST" corresponds to the attributes of the concerned element;
the element level is determined according to the appearance order.

Figure 4: Example of Logical Structure Determination for Valid Documents

Once the logical structure is determined, it is necessary to check if this structure already exists among the generic logical structures of the warehouse. If the generic logical structure already exists, the system must store the document in the warehouse by attaching its specific logical structure to the corresponding generic logical structure. If the system finds a similar structure among the generic logical structures, it checks whether it is possible to modify this structure. The similarity between two structures depends on the common elements and their order. Otherwise, the system creates a new generic logical structure.

We assume that the logical structures of documents are represented as tree structures. A tree structure is characterized by a root r (doctype of the DTD), which is connected to all other nodes (elements of the DTD) by a single way, whose r is the origin. The arcs of tree structures are oriented.

To compare two generic logical structures, we decompose every structure into several sub-trees with two levels (the root and their ordered sons). So, we can compare the corresponding sub-trees (having the same root) as follows: The system must determine the state of every element of both sub-trees. We distinguish two states: ˜o for an element found in both structures and ˜n for an element not found in a structure. Later, we must apply the following formal specification.

X:	The ordered list of elements of the first sub-tree (that of the document);
Y:	The ordered list of elements of the second sub-tree (that of the warehouse);
State(e): return the state of the element e ;
Pos(e): return the position of the element e in the ordered list of the sons of the same root;
Length(E): return the number of elements in the ordered list E.

Case 1: Both first elements of both sub-trees must not have the state ˜n . ˆƒ x ˆˆ X / Pos(x)=1, State(x)= ˜n and ˆƒ y ˆˆ Y / Pos(y)=1, State(y)= ˜n Failure
Case 2: Both last elements of both sub-trees must not have the state ˜n . ˆƒ x ˆˆ X / Pos(x)=Length(X), State(x)= ˜n and ˆƒ y ˆˆ Y / Pos(y)=Length(Y), State(y)= ˜n Failure
Case 3: If all the elements of the first sub-tree have the state ˜o , then success. ˆ x ˆˆ X / State(x)= ˜o Success

If one of the previous cases was not discovered , the system must apply these rules:

Rule 1: ˆ x ₁ ,x ₂ ˆˆ X / State(x ₁ )= ˜n , State(x ₂ )= ˜n and Pos(x ₂ )=Pos(x ₁ )+1 X ” X-{x ₂ }
Rule 2: ˆ x ₁ ,x ₂ ,x ₃ ˆˆ X / State(x ₁ )= ˜o , State(x ₂ )= ˜n , State(x ₃ )= ˜o , Pos(x ₃ )=Pos(x ₂ )+1, Pos(x ₂ )=Pos(x ₁ )+1 and x ₁ x ₃ Y Failure

Example 1

The result after applying Rule 1: X=[a _o , i _n , b _o , c _o , l _n ] et Y=[x _n , y _n , a _o , b _o , z _n , w _n , c _o ].

The result after applying Rule 2: Success. Because [a _o , b _o ] ‚ Y these two sub-trees can be merged.

The list of the element e is Z=[x, y, a, i, j, k, b, z, w, c, l]

Example 2

The result after applying Rule 1: X=[a _o , i _n , b _o , l _n , c _o ] et Y=[x _n , y _n , a _o , b _o , z _n , w _n , c _o ].

The result after applying Rule 2: Failure. Because [b _o , c _o ] Y these two sub-trees cannot be merged. The order cannot be determined.

Documents with Tags Having a Presentation Vocation

The logical structure extraction of this document type is difficult because the tags are used especially for the presentation. Indeed, this language does not define a generic logical structure in a simple way (contrary to other DTDs, such as those of the SGML or XML languages which are more expressive).

Because it is more ambiguous, a rewriting stage for HTML semi-structured documents is necessary to add more semantics to tags of this document type. We established rules for tag rewriting:

all presentation tags are deleted because they present no information about the organization of document elements (e.g., <B>, <HR>);
structural and reference tags are preserved because they constitute hypertextual information;
informative tags are deleted. These tags are inserted by the authors to comment their sources, and they do not influence the document structure;
presentation tags of structural elements are replaced by the classic structural tags, e.g., a tag <Cite> that highlights a quotation is replaced by a simple paragraph.

We rename the HTML tags that we preserve with more explicit names for a better legibility, whose extract is presented in Table 1. We also preserve the attributes, which are likely to bring some semantic information.

Table 1: HTML Tags
HTML Tag	Use	New Tag
<P>	Definition of paragraph	<PARAGRAPH>
<OL>	Ordered list	<LIST Type = "Ordered">
<UL>	Unordered list	<LIST Type = "Unordered">
<LI>	List item	<LISTITEM>
<TABLE>	Definition of table	<TABLE>

Since we have a document where the structural elements were detected , we must identify the generic and specific logical structures in the same way we do for structured documents.

Non-Structured Documents

For the non-structured documents, we use the segmentation technique (Lallich & Ouerfelli, 1998). This technique decomposes a text into fine and coherent documentary units. We distinguish different methods of segmentation: segmentation by a sequence of words, segmentation by sentences, and segmentation by paragraphs. These methods of segmentation are not reliable because they ignore the syntactical and semantic aspects of text. Indeed, our objective, then, is to identify documentary units according to more semantic criteria. This unit must be separated and characterized by formal indicators.

The basic idea of our segmentation method is to begin with a minimal unit, i.e., the typographic paragraph (separated by carriage returns), to find the documentary unit that presents the required properties (linguistic autonomy, syntactic and semantic cohesion), thus forming a homogeneous "thematic" passage. We indicate by paragraph a text block separated by two carriage returns. The carriage return, in our work, is considered as a typographic sign that separates the paragraphs. The text block separated by the carriage returns can have different forms (title, element list, and paragraph). To associate paragraphs in the same documentary unit, we use some linguistic markers that we find between the paragraphs:

presence of linear integration markers (e.g., if, then, so, furthermore, etc.) at the start of the paragraph;
presence of connection words (e.g., for example, for this, etc.) at the start of the paragraph;
resumption anaphoric at the start of the paragraph: by a demonstrative (e.g., it, this, etc.) or by a personal pronoun (e.g., it, him, etc.);
presence of markers (below, above) which refer to textual or not textual objects.

Once the documentary units of a document have been defined, we perform the extraction of the generic and specific logical structures in the same way we do for structured documents.

We presented the different techniques for the extraction of information contained in the documents. In what follows, we describe the mechanisms we used to handle the content of the document warehouse by the processes of information retrieval and multidimensional analyses.