INFORMATION RETRIEVAL


An information retrieval process implies the development of mechanisms that allow a user , who is not necessarily specialist, to retrieve the documentary information which corresponds best to his needs. According to this approach, information retrieval is intended to facilitate information restitution from a documentary collection. The problem, then, is the representation and organization of document content. The technique used to solve this problem is the indexing process (Soul -Dupuy, 2001).

The indexing process extracts a set of information characterizing a document. This information can be keywords extracted from the document textual content, or it can be information concerning the documents, which is called metadata (e.g., the author name , the abstract, the edition date). These metadata can also constitute the elements of structure used during the identification of the logical structures of the documents. The information indexing in the generic model is then based on the classic techniques of automatic text indexing. It is made, generally , following two fundamental stages:

  • indexing term identification ;

  • indexing term evaluation and weighting .

During the first stage, i.e., indexing term identification, it is necessary to determine all the words that will be used for indexing. It is also necessary to define the element that will be chosen to unite the indexing, such as stem, single word, or word group . The determination of indexing terms is done through several methods , including thesaurus, dictionary of synonymy, location of word groups, and parsing (see Salton et al., 1983; Frakes &Yates, 1992).

For the indexing term evaluation, we can, by studying the term occurrence frequency in the documents, determine the terms necessary for indexing. Indeed, the weighting of a term corresponds to its frequency of occurrence in the document. We distinguish two frequencies:

  • term frequency "Term_Freq" corresponds to the number of term occurrences in the concerned information;

  • absolute frequency "Doc_Freq" corresponds to the stemmed word's frequency in the whole collection of information.

We notice then:

  • terms having a high frequency correspond , generally, to the articles, pronouns, propositions , etc., and they must be excluded because of their semantic lack;

  • terms having a weak frequency are not representative of the document content. The most significant terms are those whose frequency is intermediate.

To evaluate the representativeness of terms in an instance of object "Information", we adapted the formula of Sparck Jones (1972):

  • TF ij : frequency of the term i in the concerned specific element j ,

  • N : number of specific elements in the collection of documents (in the warehouse),

  • AF i : absolute frequency of a term i in the collection of documents.

The information retrieval process adapted for our generic model presents several advantages. First, it does not flood the user with an important number of documents. Second, it presents a more efficient retrieval. So, instead of calculating the similarity between a query and totality of text, we measure the similarity between this query and each part (specific elements) constituting the text. This allows more specific and more localized access to the information (one of our objectives). Note: The information retrieval techniques based on indexes of terms do not exploit the logical structure of documents. They restitute documents or parts of documents, but they do not obtain specific information, such as the edition year of a book. The idea, then, is to use a DBMS and structured languages.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net