PROBLEM DESCRIPTION | (ed.) Intelligent Agents for Data Mining and Information Retrieval

Firstly, several reasonable assumptions will be given to facilitate the database selection problem. Since 84 percent of the searchable web databases provide access to text documents, in this chapter, we concentrate on the web databases with text documents. A discussion of those databases with other types of information (e.g., image, video or audio databases) is out of the scope of this chapter.

Assumption 1

The databases are text databases which only contain text documents, and these documents can be searchable on the Internet.

In this chapter, we mainly focus on the analysis of database representatives. To objectively and fairly determine the usefulness of databases with respect to the user queries, we will take a simple view of the search cost for each database.

Assumption 2

Assume all the databases have an equivalent search cost, such as elapsed search time, network traffic charges, and possible pre-search monetary charges.

Most searchable large-scale text databases usually contain documents from multiple domains (topics) rather than from a single domain. So, a category scheme can help to better understand the content of the databases.

Assumption 3

Assume complete knowledge of the contents of these known databases. The databases can then be categorized in a classification scheme.

Now, the database selection problem is formally described as follows :

Suppose there are n databases in a distributed text database environment to be ranked with respect to a given query.

Definition 1

A database S _i is a six-tuple, S _i =< Q _i , I _i , W _i , C _i , D _i , T _i >, where Q is a set of user queries; I _i is the indexing method that determines what terms should be used to index or represent a given document; W _i is the term weight scheme that determines the weight of distinct terms occurring in database S _i ; C _i is the set of subject domain (topic) categories that the documents in database S _i come from; D _i is the set of documents that database S _i contains; and T _i is the set of distinct terms that occur in database S _i .

Definition 2

Suppose database S _i has m distinct terms, namely, T _i = { t ₁ , t ₂ , , t _m }. Each term in the database can be represented as a two-dimension vector { t _i , w _i } ( 1 ‰ i ‰ m ), where t _i is the term (word) occurring in database S _i , and w _i is the weight (importance) of the term t _i .

The weight of a term usually depends on the number of occurrences of the term in database S _i (relative to the total number of occurrences of all terms in the database). It may also depend on the number of documents having the term relative to the total number of documents in the database. Different methods exist for determining the weight. One popular term weight scheme uses the term frequency of a term as the weight of this term (Salto & McGill, 1983). Another popular scheme uses both the term frequency and the document frequency of a term to determine the weight of the term (Salto, 1989).

Definition 3

For a given user query q , it can be defined as a set of query terms without Boolean operators, which can be denoted by q ={ q _j , u _j } (1 ‰ j ‰ m ), where q _j is the term (word) occurring in the query q , and u _j is the weight (importance) of the term q _j .

Suppose we know the category of each of the documents inside database S _i . Then we could use this information to classify database S _i (a full discussion of text database classification techniques is beyond this scope of this chapter).

Definition 4

Consider that there exist a number of topic categories in database S _i which can be described as C _i = ( c ₁ , c ₂ , , c _p ). Similarly, the set of documents in database S _i can be defined as a vector D _i ={ D _i1 , D _i2 , , D _ip }, where D _ij ( 1 ‰ j ‰ p ) is the subset of documents corresponding to the topic category c _j .

In practice, the similarity of database Si with respect to the user query q is the sum of the similarities of all the subsets of documents of topic categories.

For a given user query, different databases always adopt different document indexing methods to determine potential useful documents in them. These indexing methods may differ in a variety of ways. For example, one database may perform full-text indexing , which considers all the terms in the documents, while the other database employs partial-text indexing , which may only use a subset of terms.

Definition 5

A set of databases S ={ S ₁ , S ₂ , , S _n } is optimally ranked in the order of global similarity with respect to a given query q . That is, Simi _G (S ₁ , q) ‰ Simi _G (S ₂ , q) ‰ ‰ Simi _G (S _n , q) , where Simi _G (S _i , q) (1 ‰ i ‰ n ) is the global similarity function for the i th database with respect to the query q , the value of which is a real number.

For example, consider the databases S ₁ , S ₂ and S ₃ . Suppose the global similarities of S ₁ , S ₂ , S ₃ to a given user query q are 0.7, 0.9 and 0.3, respectively. Then, the databases should be ranked in the order { S ₂ , S ₁ , S ₃ }.

Due to possibly different indexing methods or different term weight schemes used by local databases, a local database may use a different local similarity function, namely Simi _Li (S _i , q) (1 ‰ i ‰ n ). Therefore, for the same data source D , different databases may possibly have different local similarity scores to a given query q . To accurately rank various local textual databases, it is necessary for all the local textual databases to employ the same similarity function, namely Simi _G (S _i , q) , to evaluate the global similarity with respect to the user query (a discussion on local similarity function and global similarity function is out of the scope of this chapter).

The need for database selection is largely due to the fact that there are heterogeneous document databases. If the databases have different subject domain documents, or if the numbers of subject domain documents are various, or if they apply different indexing methods to index the documents, the database selection problem should become rather complicated. Identifying the heterogeneities among the databases will be helpful in estimating the usefulness of each database for the queries.