Chapter XIV: A Study on Web Searching - Overlap and Distance of the Search Engine Results


Shanfeng Zhu, City University of Hong Kong, Hong Kong
Xiaotie Deng, City University of Hong Kong, Hong Kong
Qizhi Fang, Qingdao Ocean University, China
Weimin Zheng, Tsinghua University, China

Web search engines are one of the most popular services to help users find useful information on the Web. Although many studies have been carried out to estimate the size and overlap of the general web search engines, it may not benefit the ordinary web searching users, since they care more about the overlap of the top N (N=10, 20 or 50) search results on concrete queries, but not the overlap of the total index database. In this study, we present experimental results on the comparison of the overlap of the top N (N=10, 20 or 50) search results from AlltheWeb, Google, AltaVista and WiseNut for the 58 most popular queries, as well as for the distance of the overlapped results.

These 58 queries are chosen from WordTracker service, which records the most popular queries submitted to some famous metasearch engines, such as MetaCrawler and Dogpile. We divide these 58 queries into three categories for further investigation. Through in-depth study, we observe a number of interesting results: the overlap of the top N results retrieved by different search engines is very small; the search results of the queries in different categories behave in dramatically different ways; Google, on average, has the highest overlap among these four search engines; each search engine tends to adopt a different rank algorithm independently.

INTRODUCTION

With the development of the World Wide Web, people can suffer from information overload. Since search engines help us locate what we need in the ocean of information, they have become one of the most popular services on the Web. Due to hard competition and financial pressure, some search engines were closed or stopped public searching service. One of those search engines is Northern Light (http://www.northernlight.com). By the end of July 2002, the most famous search engines included AltaVista (http://www.altavista.com), AlltheWeb (http://www.alltheweb.com), Google (http://www.google.com), HotBot (http://www.hotbot.com), Lycos (http://www.lycos.com), MSN Search (http://search.msn.com), Teoma (http://www.teoma.com) and WiseNut (http://www.wisenut.com).

Many web searching studies have been carried out to analyze the characteristics of searching on the Web. One type of study concentrates on the characteristics of search engines, such as coverage, overlap and dynamics, which could improve users' understanding of web searching and, thus, help users find desired information. The other type focuses on the characteristics of searching users, such as the most frequent searching queries, searching operators and modifiers, which are quite useful in designing more efficient search engines. Our study belongs to the first type.

As searching users, we are eager to know how to select a suitable search engine for search tasks . Since each search engine has its unique database, and distinct rank algorithm, it will retrieve and present its unique search results to the user . Naturally, we have many questions, such as: With respect to same query, is there a significant difference among the searching hit lists of several different search engines? Do they rank the overlapped results in the same order? In this study, we investigate the overlap and distance of search engine searching results for some popular queries. Four general search engines, AltaVista, AlltheWeb, Google and WiseNut, are examined.

According to several studies (see Hoelscher, 1998; Silverstein et al., 1999; Jansen et al., 2000), people seldom go beyond the top 10 hits of the result, which means that the list at the top is the most important to the users. Therefore, the top N (N=10, 20 or 50) results from each search engine are compared in this study. We measure not only how many hits are overlapped in the top N results of each search engine, but also the distance of overlapped results. The measurement of the overlap and distance will be given in the later sections. All 58 queries, which are chosen from a most popular query list provided by the WordTracker (http://www.wordtracker.com) service, are divided into three categories. In addition to helping users compare and choose suitable search engines, our findings could also shed light on proposing effective result-merging algorithms in metasearch engines and search engine evaluation algorithms.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net