NECESSARY CONSTRAINTS OF SELECTION METHODS IN DTDS


NECESSARY CONSTRAINTS OF SELECTION METHODS IN DTDS

We believe that the work of identifying necessary constraints of selection methods, which is absent in others' research in this area, is important in accurately determining which databases to search because it can help choose appropriate selection methods for different selection cases.

General Necessary Constraints for All Selection Methods in DTDs

As described in the previous section, when a query q is submitted, the databases are ranked in order S 1 , S 2 , , S n , such as S i is searched before S i+1 , 1 i n-1 , based on the comparisons between the query q and the representatives of the databases in DTDs, and not based on the order of selection priority. So, the following properties are general necessary constraints that a reasonable selection method in DTDs must satisfy :

  1. The selection methods must satisfy the associative law. That is, ˆ S i , S j , S k ˆˆ S ( 1 i, j, k n, i ‰  j ‰  k ), Rank ( Rank ( S i , S j ), S k ) = Rank ( S i , Rank ( S j , S k )), where Rank ( ) is the ranking function for the set of databases S ;

  2. The selection methods must satisfy the commutative law. That is, Rank ( S i , S j ) = Rank ( S j , S i ).

Special Necessary Constraints of Selection Methods for Each Selection Case

Before we start to discuss the special necessary constraints of selection methods for each selection case, we first give some basic concepts and functions in order to simplify the explanation. In the following section, we will mainly focus on the selection of three databases. It is easy to extend the selection process to any number of databases in DTDs. Suppose that there exist three databases in a DTD, S i , S j and S k , respectively. S i =< Q i , I i , W i , C i , D i , T i >, S j =< Q j , I j , W j , C j , D j , T j > and S k =< Q k , I k , W k , C k , D k , T k >. q is a given user query, and c t is the topic domain of interest for the user query. Simi G (S l , q) is the global similarity score function for the l th database with respect to the query q , and Rank ( ) is the ranking function for the databases. All these notations will be used through the following discussions.

The objective of database selection is to find the potential "good" databases which contain the most relevant information that a user needs. In order to improve search effectiveness, a database with a high rank will be searched before a database with a lower rank. Therefore, the correct order relationship among the databases is the critical factor which judges whether a selection method is "ideal" or not.

A database is made up of numerous documents. Therefore, the work of estimating the usefulness of a text database, in practice, is the work of finding the number of documents in the database that are sufficiently similar to a given query. A document d is defined as the most likely similar document to the query q if Simi G (d, q) t d , where t d is a global document threshold. Here, three important reference parameters about textual databases are given as follows , which should be considered when ranking the order of a set of databases based on the usefulness to the query.

  1. Database size . That is, the total number of the documents that the database contains.

    For example, if databases S i and S j have the same number of the most likely similar documents, but database S i contains more documents than database S j , then S j is ranked ahead of S i . That is, Rank ( S i , S j )={ S j , S i }.

  2. Useful document quality in the database . That is, the number of the most likely similar documents in the database.

    For example, if database S i has more of the most likely documents than database S j , then S i is ranked ahead of S j . That is, Rank ( S i , S j )={ S i , S j }.

  3. Useful document quantity in the database . That is, the similarity degree of the most likely similar documents in the database.

For example, if databases S i and S j have the same number of the most likely similar documents, but database S i contains the document with the largest similarity among these documents, then S i is ranked ahead of S j . That is, Rank ( S i , S j )={ S i , S j }.

Now, some other special necessary constraints for each potential selection case are given in following discussion:

  1. In an identical selection case , all the databases have the same topic categories. That is, they have an equal chance to contain the relevant information of interest. If Simi G (S i , q) = Simi G (S j , q) and D it > D jt , then Rank ( S i , S j )={ S j , S i }. The reason for this is that, for the same useful databases, more search effort will be spent in database S i than in database S j because database S i has more documents needed to search for finding the most likely similar documents.

  2. In an inclusion selection case , if C i C j , it means that database S j has other topic documents which database S i does not. Therefore, in order to reduce the number of non-similar documents to search in the database, the special constraint condition of selection method for the inclusion selection case can be described as follows:

    If Simi G (S i , q) = Simi G (S j , q) and C i C j , c t ˆˆ C i ˆ C j , then Rank ( S i , S j ) = { S i , S j }.

  3. In an overlap selection case , any two databases not only have some same subject-domain documents, but also have different subject-domain documents, respectively. So, there exist two possible cases: (1) c t ˆˆ C i ˆ C j ; and (2) c t ˆ‰ C i ˆ C j . Then, under these two cases, the constraint conditions that a suitable selection method must satisfy can be described as:

    1. If c t ˆˆ C i ˆ C j and c t ˆ‰ C k , then Simi G (S i , q), Simi G (S j , q) > Simi G (S k , q) ; and Rank ( S i , S j , S k )={ S i , S j , S k } or { S j , S i , S k }.

    2. If c t ˆ‰ C i ˆ C j and c t ˆˆ C k , then Simi G (S i , q), Simi G (S j , q) < Simi G (S k , q) ; and Rank ( S i , S j , S k )={ S k , S i , S j } or { S k , S j , S i }.

  4. In a disjoint selection case , since any two databases do not have the same subject-domain documents, it is obvious that only one database most likely contains the relevant documents of interest to the user. So, the selection method must satisfy the following necessary constraint:

    If c t ˆˆ C i , then Simi G (S i , q) > Simi G (S j , q), Simi G (S k , q) ; and Rank ( S i , S j , S k )= { S i , S j , S k } or { S i , S k , S j }.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net