METHODOLOGY


Four general search engines (AltaVista, Google, WiseNut and AlltheWeb) are examined in the experiment. In order to perform the experiment, we first select suitable queries, and then we compare the overlap and distance of the searching results.

Sampling Queries

We select the 58 most popular queries from WordTracker's top 200 longterm keyword report on July 27, 2002. It provides access to the query logs of the metasearch engines MetaCrawler (http://www.metacrawler.com) and Dogpile (http://www.dogpile.com) stretching back for two months, which has a database of 301,687,926 search terms at the end of July 2002. The database is constantly updated, with new data added each week. The 58 queries are selected according to following criteria:

  • We only select one from a group of similar queries. For instance, we select query hotmail from hotmail and hotmail.com .

  • The sample queries are chosen according to the order of popularity. After reducing the redundancy of similar queries, the top 60 queries are selected as sample queries.

  • Considering the popularity of adult-related queries, all eight adult- related queries out of the top 100 queries are selected.

These 58 queries account for approximately 55.22 percent of the total occurrence of all top 200 queries. Considering the characteristics of these 58 queries, we can divide them into three categories.

  1. Very specific queries , such as those for the name of a company, an organization or a product. Usually there exists a web site for these queries. This category includes 10 queries: google, yahoo, hotmail, ebay, mapquest, ask jeeves, Kazaa, Winzip, southwest airlines , and warez .

  2. Very general queries . This category includes 40 queries: hotels, lyrics, jokes, pictures, maps, games, song lyrics, dictionary, weight loss, search engines, weather, music, april fools, snes roms, jobs, free people search, morpheus, clip art, mp3, wallpaper, recipes, computer deals, baby names , chat, poems, chat, travel, free games , quotes, used cars , airline tickets, movies, parent, lingerie, people search, spiderman, clipart, driving directions, dogs, greeting cards , and author .

  3. Adult-related queries . This category includes eight queries: sex, porn, free porn, literotica, lolita, xxx, erotic stories , and free sex stories .

Comparing Search Results

Through the BookWorm metasearch service written by ourselves , each selected query is forwarded to four general search engines: Google, AltaVista, AlltheWeb and WiseNut. Then, the top 50 hits are fetched from each search engine and compared by a background program. The statistical data is recorded in a file for further processing. This mainly includes following steps.

  • Step 1. Normalizing the queries . All the queries are transformed into lowercase, and sent to search engines without any advanced operators like "AND", "OR" or "+".

  • Step 2. Retrieving the search results . To compare the search engines, we use the default settings of the search engines and site collapsing options if it is supported by search engine. When site collapsing options are enabled, the search engine tries to display many different sites in the result. The top 50 results from each search engine are saved for comparison.

  • Step 3. Calculating the overlap and distance of the searching results . In this study, we only check the hostname of the URLs for matching. In other words, if two URLs with the same hostname are respectively retrieved by two search engines, we deem them as matching results when calculating the overlap. If several URLs with the same hostname are retrieved by the same search engine, the URL with the highest rank will be used to calculate the overlap and distance. Of course, we need a database to store the different hostnames of the same web site for the matching process. Three rounds of calculations are carried out on the top 10, top 20 and top 50 results. In some cases, the overlap of the top 1 result is also examined.

  • Step 4. Analyzing the results . The analysis of the results is carried out by utilizing the statistic tools of SPSS 10. We focus on following questions:

    1. The difference of the overlap and distance of the searching results retrieved by four search engines (Google, AltaVista, AlltheWeb and WiseNut).

    2. The difference of the overlap and distance over three categories.

    3. The difference of the overlap and distance over three rounds (top 10, top 20, top 50).

In addition to these three questions, we also want to examine the search results in a general sense. For example, on average, what is the percentage of results retrieved respectively by only one search engine, two search engines, three search engines, or all four search engines? How many distinct results are retrieved by all four search engines?

Measurement on Overlap and Distance

Here we use an ordered list L to represent a search result returned by a search engine with respect to a specific query. Given a universe U , an ordered list L is a ranked subset S of U , i.e., L = < x 1 , x 2 , , x n > & x 1 > x 2 > > x n , and > is an ordering relation on S . Let L denote the number of the elements in S , and ) ( i R denotes the rank (position) of element i in L .

  1. Overlap measures
    Given two lists, L 1 and L 2 , the overlap of L 1 and L 2 is given by O ( L 1 , L 2 ) = L 1 ˆ L 2 . That is to say, the overlap of L 1 and L 2 equals the number of elements occurring in both L 1 and L 2 .

    Given several ranked lists J, L 1 , L 2 , , L k , the overlap of J to L 1 , L 2 , , L k is given by

  2. Distance measures
    Given two lists, L 1 and L 2 we first construct two new lists, N 1 and N 2 , which only record the overlapped elements of L 1 and L 2 and which maintain the orders in the original lists. Then, we can calculate the distance between N 1 and N 2 by using the following method. We denote S as the set of elements in the newly constructed list.

Kendall tau distance . It counts the number of disagreements in the ordering between any two elements in the two lists (Kendall & Gibbons, 1990). Formally, the Kendall distance of L 1 and L 2 is given by:

where R 1 ( i ) and R 2 ( i ) are the rank positions of i in lists N 1 and N 2 , respectively. Then, we can get the normalized distance by dividing the maximum possible distance S * (S-1)/2.

For NK ( L 1 , L 2 )= K ( N 1 , N 2 ) /( S *( S ˆ’ 1)/2), similarly, we can extend the distance measure for more than two lists. Given several lists J, L 1 , L 2 , , L k , the distance of J to L 1 , L 2 , , L k is given by

From the definition of the distance, we see that the maximum possible value is 1, which means that the two list are totally reversed . The minimum possible value is 0, which means that the two lists are identical.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net