In this section, a survey of present query formation methods and information retrieval methods will be discussed.
The traditional way of search uses keywords as a guide to form and process queries. Query formation using keywords is just fine if the user knows exactly what he or she is looking for. However, it may be difficult to use keyword-based search when the vocabulary of the subject field is totally unfamiliar to the user. Furthermore, using the same keywords in a different order or at a different time might change the number as well as the order of sites listed. As can be seen, there is usually some form of uncertainties when forming a query using keywords. Moreover, no single engine searches the entire Internet, so it is often necessary to search several engines. Essentially, the element that is lacking in keywordbased queries is the user's sense of linkage between multiple keywords and context. Also, if too many keywords are used, the constraints may become too tight to retrieve anything. Most importantly, mobile devices are small and the keypads are not so suitable for typing. Thus, searching using the traditional keyword method is unrealistic .
Relevance feedback is still the main technique for query modification. This technique has been investigated for more than 20 years in various information retrieval models, such as the probabilistic model and vector space model (Boughanem, Chrismnet, & Tamine, 1999; Salton, 1989). Relevance feedback is based on randomly changing the set of query terms as well as the weights associated with these terms according to the document retrieved and judged during the initial search. Conventionally, retrieved documents are not evaluated. This means that a user might end up getting junk instead of the things he wants. Also, when there are too many documents available, the user has to filter them manually. These problems are fundamental to the motivation of this research, as filtering using small mobile devices is extremely difficult.
A lot of research has been done on how the genetic algorithm can be used in information retrieval. One popular approach is query restructuring, which is used to improve the efficiency and effectiveness of the queries formed . This boils down to the fundamental concepts in relevance feedback. Hence, the genetic algorithm actually extends the concepts of relevance feedback. The difference is that the genetic algorithm uses more than one query and compares the fitness among these queries. The fittest query will survive in the end. This is much more effective than when only one query is used where hill-climbing search is used (Boughanem et al., 1999). Here, mutation and crossover take place during restructuring.
Yang and Korfhage (1994) use genetic algorithms to search the space of possible queries to generate better queries based on relevance feedback for the vector space model, and they obtained remarkable results. Kraft, Petry, Buckles, and Sadasivan (1994) explored to extend the concepts to deal with Boolean queries. Another approach uses evolutionary algorithms to evolve agents instead of queries (Kouichi, Taketa, & Nunokawa, 1999). These agents contain search parameters that will search all WWW databases and are tuned up by genetic algorithms.
In this chapter, the focus is to extend the concepts of using genetic algorithms in query restructuring. The difference lies in the type of restructures. Basically, three types of restructuring ‚ synonyms, logical operators, and numerical constraints ‚ are used in the searching process to enhance query effectiveness. Each type of restructuring is rendered under different situations depending on the suitability in that situation.
Salton and McGill (1983) consider that the two ultimate measures of query fitness are, namely, precision and recall. Precision is the percentage of documents retrieved that are relevant, while recall measures the percentage of the relevant documents retrieved (Kraft et al., 1994; Salton & McGill). These two tend to be inversely proportional so that one is traded for the other in most situations. They are complementary and competitive (Kraft et al.). A more general query retrieves more documents and recall increases because of more relevant documents retrieved. As the query becomes less general, fewer documents are retrieved and thus precision increases . Therefore, usually these two measures are not used directly. In Kraft et al., two arbitrary constant weights were used to balance the fitness function.
Another measure is average search length (Losee, 1991). Average search length is the average number of documents or text fragments examined in moving down a ranked list of documents until arriving at the average position of a relevant document (Losee, 1988, 1996). Evaluating the performance of a filtering or retrieval process with average search length provides a single number measure of performance. Also, it is capable of being predicted analytically (Losee, 1996) and is easily understood by mobile users. This is unlike most of the other retrieval and filtering measures based on precision and recall that are used to evaluate retrieval systems.
Another fitness function is average maximum parse length (Losee, 1996). Average maximum parse length is the average (over a set of sentences) of the largest number of terms in a parse for each sentence . There are also measures that combine both average search length and average maximum parse length. The motivation for it is that having different genes producing the same average search length value is very common, and adding the average maximum parse length breaks the ties that exist when using average search length alone as the fitness function (Losee, 1996). When two genes with identical average search lengths are compared, the gene with the larger average maximum parse length is selected. This computational method is useful when average search length is not a very good measure of the quality of parsing performance due to the limited improvements obtained with disambiguation.
In review to the present ways of defining the fitness function for a given genetic algorithm, the relevance of the documents retrieved has been greatly emphasized . Typically, present methods had only dealt with the relevance of the document retrieved. This is reasonable but inefficient because it is rather difficult to indicate the relevance of a document when the number of documents could be very large. This chapter measures the relevance of queries instead of documents retrieved. Based on this, efficiency will be improved significantly as the number of queries will be much smaller than the number of documents retrieved. Of course, it might be true that a relevant query might not retrieve a relevant document. However, in the context of this chapter, where the retrieved documents are products from product databases, it is reasonable to say that almost all the documents retrieved are relevant to the query formed. This is because most product databases have rather fixed formats unlike documents retrieved from the Internet, where the variations of formats can be infinite.
Both keyword- and ontology-based approaches have their advantages and disadvantages. Ontology provides the structure, context, and visual aid while keyword provides a direct search mechanism. Both approaches are relevant for mobile commerce because they save time in browsing and searching, which is very much required by mobile users who are always on the move. Thus, by combining keyword queries with ontology, it is possible to achieve a better and more effective query formation. Basically, most would think of a parallel combination, which means the keyword approach can be accessed as a backup method to the ontology approach. For example, when a user cannot find a term in the ontology, he may still be able to use keyword to search whatever is not covered by the ontology.
The parallel solution can be considered in terms of its coverage but still lacks efficiency in some ways. For example, as the ontology gets larger, finding a term becomes a chore and slows down everything. To tackle this problem, a serial combination will be needed. This is the part where most research contents can be found as this method, although similar, does differ from the most current research.
On the whole, before ontology terms are accessed to form the queries, there will be a keyword search to find the required ontology term. For example, "ps2" can be hidden in the node "mouse" when presented in the ontology. The user will not be able to know where "ps2" can be found intuitively without eyeballing the ontology. With the help of keyword search, the term "ps2" can be found easily.
In forming queries, there can be a high chance that the vocabulary used by the user to describe a query does not exactly match the vocabulary used by a query system. This will result in getting insufficient information. For example, a person looking for televisions might key in "television" for his search, but the product ontology might describe televisions in a node called "tv" or "tv set". Intuitively, "television" is equal to "tv" and "tv set". Therefore, restructuring dealing with domain ontology relationships might be useful. These relationships involve semantic links such as hyponyms and synonyms (Braga et al., 2000). Here, using synonyms is an adequate option to restructure queries because it correctly broadens the scope of search even to the extent of different languages. A user can create a product search with his device when he enters a shopping centre in a foreign country. There should exist a synonym system in the shopping network or Internet that can process this type of query by mutating the query terms with synonyms.
Also, one major problem about information retrieval is that either too little or too much information is retrieved when making a query. When too little information is retrieved, it is logical to relax the constraints of the query. Thus the use of synonym or hyponym might be necessary. However, this approach has a major disadvantage . By relaxing the constraints of a query using synonym or hyponym to increase the number of documents retrieved, one could actually deface the meaning of the original query such that it could drift away from the user's intention . This concern can be alleviated by having user feedback along the process; there is still much room to research on the use of synonym or hyponym. Another new approach that we have considered in this research is to relax constraints step by step. This option can better eliminate the chances of constructing far- fetched queries from the use of the genetic algorithm. For example, step by step means to relax numerical constraints in steps of 5% or 10%. For instance, price<2000 becomes price<2200. In addition, instead of relaxing constraints, this approach can also tighten constraints. The only limitation that this option has is that it is only applicable to numerical values. When the constraints are words, this method might not work.
As mentioned earlier, since the relevancy focus has been shifted to the query stage, and product documents are mostly relevant when the query is relevant, it is not necessary to include measurements like precision or recall, which measure relevance at the document-retrieved stage. In this chapter, a totally new design of the fitness function is schemed.
Instead of favoring a query by giving it credits for relevancy, it is a good idea to work from the irrelevancy of the query by giving it demerits since most documents are relevant. Here, the irrelevancy of the query will set in after the mutation of the queries is applied when using the genetic algorithm. This irrelevancy will form one of the elements in the fitness function.