BACKGROUND | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XIII - Query-By-Structure Approach for the Web
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

The following sections look into various systems and approaches used by others to study the areas of Web searching, Web querying, and the use of machine-learning techniques to perform Web searches. These systems and approaches are related in one way or another to various aspects of systems presented in the next section.

Web Search Engines

Searching the Web involves utilizing one of the many search engines publicly available to perform keyword searches for documents cataloged by the search engine's database. Popular World Wide Web search engines such as AltaVista (http://www.altavista.com/) and Yahoo! (http://www.yahoo.com/) allow users to search for Web pages by providing keywords. These keywords are then used to index the search engine's database for URLs of relevant Web pages. Some search engines also utilize some sort of spider or webot to index the pages on the Web. These programs scan currently indexed pages to identify new pages. The newly identified pages are then cataloged and indexed within the search engine's database. In addition, some search engines, such as HotBot (http://www.hotbot.com/), allow the user to specify that the documents for which he is searching contain images, video, and/or Javascript code. One search engine that does utilize some structure information is the Google (http://www.google.com/) search engine described below. However, most search engines utilize very little, if any, detailed structure information.

Until recently, the Google search engine was relatively unknown to the general public. However, many in computer science academia have known about Google for much longer. This was primarily due to the exceptional research paper, "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998), published on the inner workings of the Google search engine. Additionally, the Google search engine has been available for public use since the publication of this paper. Today, the Google search engine actually powers several other mainstream search engines, including Yahoo!.

Google was initially constructed at Stanford University as a prototype large-scale search engine. According to the designers, the main goal of Google was to improve the quality of Web searches. The designers determined that it was necessary to provide tools that have a very high precision (number of relevant documents returned by the system), even at the expense of recall (number of relevant documents the system could return). Here, "relevant" was defined as only the very best documents, since there could be tens of thousands of slightly relevant documents. One thing that sets Google apart from other search engines is that, in addition to content, it makes use of the structure present in hypertext documents. In particular, two important features help the system to produce its high precision results. First, it makes use of the link structure of the Web to calculate a PageRank, or quality ranking, for each Web page. Second, Google utilizes anchor text to improve search results. These two features as well as a couple other features of note are discussed in greater detail in the following paragraphs.

In traditional documentation, authors provide references or citations when referring to other individuals' work in their document. However, hypertext documents offer the unique ability of actually being able to link directly to the cited source. PageRank uses a citation or link graph to determine an objective measure of a Web document's importance. The idea is that when someone places a citation or link to another Web document within a page, he/she is indicating a level of importance to the linked document. Hence, pages can be ranked based on the number of pages that cite (link to) a given document, with the results being prioritized based on the number of pages that link to a given page. In addition, PageRank extends the results by not counting links from all pages equally and by normalizing by the number of links on a page. Formally, the authors defined PageRank as follows:

Assuming page A has pages T₁ T_n that point to it (i.e. are citations), a parameter d which is a damping factor set between 0 and 1, and C(A) which is a count of the number of links going out of a page, the PageRank of page A is:

From the formula above, it can be seen that a page will have a higher PageRank if there are several pages that link to it. But, maybe even more significant is the fact that the PageRank of the document that is providing the link is factored into the equation as well. As a result, a page that may have only one link to it could have a higher page rank than a page that has several citations, because the one link page is being referenced by a very important page. Intuitively this makes sense, for if a document has a direct link from say the Yahoo! homepage, this should be an indication that this page likely carries some importance.

The second factor in ranking of pages is the use of the text within the anchor tags. Google not only associates the text of the link with the page containing the link, but also with the page to which the link points. The advantage to doing this is that it captures more detailed information about a given page, as the text within the anchor tags may provide an even more accurate description of a Web page than the page itself. Finally, in addition to PageRank and the use of anchor text, Google maintains location information for all hits and makes extensive use of proximity in search. Google also keeps track of some visual presentation details, such as font size of words where words in a larger or bolder font are weighted higher. Google's mainstream acceptance is a clear indication of the relevance of utilizing structure within a document search.

Web Query Systems

Most search engines dynamically search the Web. However, these searches are typically done in the background. In other words, these search engines do not allow the user to directly query the Web. Querying the Web requires a Web-query system that locates documents by dynamically retrieving and scanning the documents during the query process. Although there may not exist any mainstream search engine that can directly query the World Wide Web, there actually has been a significant amount of research done in this area. In this section, three such systems are compared.

The first system utilizes a query language called WebSQL (Mendelzon, Mihaila, & Milo, 1996) designed specifically to query the Web. The system navigates the Web starting from a known URL. In addition, it can traverse multiple child links. However, other than the text contained within the anchor tags, no other tag (structure) information is utilized in their queries. WebLog (Lakshmanan, Sadri & Subramanian1996) is another language designed to directly query the Web. WebLog allows the user to incorporate some forms of presentation knowledge into the query through a structures called relinfons. However, utilizing rel-infons requires the user to have extensive knowledge of declarative logic concepts in order to write even the simplest of queries. In addition, the system does not use any type of catalog information, so every query requires a dynamic search of the appropriate Web documents. A third system that queries the Web is W3QS (Konopnicki & Shmueli, 1995). This system maintains the results of user queries in a database. In addition to the document content, specific node and link information relating to the document title and anchor tags is cataloged. Additionally, some presentation information such as HTML form elements is also maintained. Hence, this system does account for some types of document structure in its queries.

Machine-Learning Techniques

There has been a significant amount of research in an attempt to improve Web searches using machine-learning techniques. Since the content presented in the chapter incorporates neural networks to assist in the query-by-structure approach, it is worthwhile to study related approaches. The first system is WebWatcher (Armstrong, Freitag, Joachims, & Mitchell, 1995), which interactively assists users in locating desired information on a specific site. The system tracks a user's actions and utilizes machine-learning methods to acquire knowledge about the user's goals, Web pages the user visits, and success or failure of the search. Rankings are based on keywords. In a likely attempt to improve performance, the designers of the WebWatcher agent constructed another system called LASER (Boyan, Freitag, & Joachims, 1996). LASER, which stands for Learning Architecture for Search Engine Retrieval, utilized some of the methods applied to the WebWatcher system but eliminated the need for the user to provide feedback to the system. In addition, LASER incorporated document structure into the search process. Finally, a third system, called Syskill & Webert (Pazzani, Muramatsu, & Billsus, 1996), is studied. Syskill and Webert utilize six different machine-learning algorithms, including Perception and Backdrop neural networks. This system also requires user interaction by requesting that the user rate pages on a three-point scale. After analyzing information on each page, individual user profiles are then created. Based on the user profile, the system suggests other links that might be of interest to the user. These systems are discussed in further detail in the following paragraphs.

WebWatcher is a Web-based learning agent designed by graduate students and faculty at Carnegie Mellon University. The agent is described by its designers as an information-seeking assistant for the World Wide Web that interactively helps users locate desired information by employing learned knowledge about which hyperlinks are likely to lead to the target information. WebWatcher assists users by interactively advising them as they traverse Web links in search of information, and by searching autonomously on their behalf. In addition to suggesting to users which links they should follow next, it also learns about the users by monitoring their response to the suggestions. This monitoring involves not only analyzing which links the user followed but also any feedback the user provides as to whether the document retrieved by following the link was worthwhile or not. In a sense, WebWatcher is "backseat driving" as a user traverses the Web. Each time the user loads a new page, WebWatcher utilizes learned knowledge to suggest which links on that page might lead to the desired results. The user remains in full control as he or she has the option of either taking the advice or choosing his or her own path to follow. The search continues in this fashion, until the user terminates the search successfully by indicating "I found it" or unsuccessfully by indicating "I give up."

Clearly, the success of a WebWatcher-assisted search depends on the quality of the knowledge used in guiding the search. One form of knowledge considered by the designers was defined by a function called UserChoice?, which returns a probability value of between 0 and 1 based on the document the user is currently at (Page), the information being sought by the user (Goal), and link itself (Link). The UserChoice? function returns the probability that an arbitrary user will select Link given the current Page and Goal. In order to represent the Page, Goal, and Link information in a useable format, it was necessary as it is with most machine-learning methods to create a feature vector of the information. The challenge the designers had here was to convert the arbitrary-length text of a Web page into the fixed-length feature vector. Apparently, after significant experimentation, information about the current Page, the user's information search Goal, and a particular outgoing Link was represented by a vector of approximately 530 Boolean features, with each feature indicating the occurrence of a particular word within the text that originally defines these three attributes. According to the authors, the vector of 530 features is composed of the following four concatenated sub-vectors: underlined words in the hyperlink, words in the sentence containing the hyperlink, words in the heading associated with the hyperlink, and words used to define the user goal.

To explore possible learning approaches, determine the level of competence achievable by a learning agent, and to learn the general function that UserChoice?, given a sample of training data, logged from users, the designers applied the following four methods to training data collected by WebWatcher during 30 information sessions: Winnow (Littlestone, 1988), which learns a Boolean concept represented as a single linear threshold function of the instance features; Wordstat (Armstrong et al., 1995), which attempts to make a prediction whether a link is followed based directly on the statistics of individual words; TFIDF with cosine similarity measure (Salton & McGill, 1983), which is a method developed in information retrieval; and a Random method that simply selects a link on a page with uniform probability. The designers concluded that the Winnow method performed best and was a significant improvement over the Random method.

The WebWatcher agent showed promising results in being able to learn search-control knowledge in order to predict the links the user would select. However, it is unclear how much of the time saved is lost by the user because of the requirement to provide constant feedback. It is likely that, because of the continuous user input, some of the designers of the WebWatcher system modified their research and developed LASER. LASER utilizes some of the same machine-learning techniques as the WebWatcher system and does not really add anything new from this perspective. However, of significant importance to the research being presented in this chapter, the LASER system incorporates the use of document structure into the search process. According to the system designers, HTML documents consist of two forms of structure: internal and external. The general tags of an HTML document represent the internal structure, and the hyperlinks that point to the document (as well as the hyperlinks contained within the document) represent the external structure. The LASER system attempts to exploit both types of structure to index the Web.

From the perspective of a user, LASER functions much like the search engines discussed earlier. Like one of the approaches used in the WebWatcher system, LASER's machine-learning retrieval function is based on the TFIDF vector space retrieval model. However, unlike WebWatcher, LASER utilizes the structure of the HTML documents. Specifically, the system incorporates parameters for weighting words contained with the following HTML tags: title, heading (h1, h2, h3), bold (b), italics (i), blink, and anchor (a). The weight is based simply on the frequency in which the search words appear within the tags specified above.

From a machine-learning perspective, LASER's use of the external structure is the most interesting. To explain the use of external structure, the authors gave the following analogy to reinforcement learning: "Imagine that an agent searching for information on the Web can move from page to page only by following hyperlinks. Whenever the agent finds information relevant to its search goal, it gets a certain amount of reward." The LASER designers concluded that reinforcement learning could be used to have the agent learn how to maximize the reward it received, i.e., learn how to navigate to relevant information. More precisely, the goal was to have LASER rank pages higher if they serve as a good starting point for a search by the agent. Implementation of this aspect of the retrieval function was done using a complex formula that required 18 numerical parameters to allow for a wide variety of search engine behavior, from plain TFIDF to very complex ranking schemes.

LASER's designers chose not to request feedback from users on whether links returned by the system were good or bad (unlike WebWatcher). Instead, the system simply records the links the user selects. Unfortunately, for research purposes, without this feedback, the designers could provide no quantitative means to measure the system's performance. Although they did mention that they could collect information on the links the users select, it was indicated that the results would be skewed due to presentation bias (the user's choice of which link to follow is strongly biased toward documents appearing early in the search results, regardless of the quality of the search performed).

Syskill & Webert is another software agent that utilizes machine-learning techniques to rate pages on the World Wide Web. This system is different from the others previously discussed in that its primary function is to develop a user profile so it can suggest other Web sites that might be of interest to that specific user. The agent learns to rate Web pages by requiring the user to rank each page on a three-point scale and then analyzing information on each of those pages. The system uses the user profile in one of two ways. First, it can add annotations to each Web page a user visits to suggest that a particular link on that page might be of interest to the user. Secondly, it can annotate the results of a commercial search engine query to essentially re-rank the search results. In both cases, page rankings are represented as a real value between 0 and 1.

Although the authors defined it as a user profile, a more accurate name might be user topic profile. This is because the system actually can create multiple profiles for each user. Each profile for an individual user is focused on a specific topic. The authors contend that this makes sense, since most users will have multiple interests. In this way, users can create separate index pages for each topic, and then switch topics by changing to a different index page. Additionally, at each index page, users have the option of instructing Syskill & Webert to learn a user-profile for the current topic of that index page, make suggestions on links on that index page, or to consult a commercial search engine to search the Web. The agent learns a profile by first capturing the HTML source code of a Web page visited by the user. It then asks the user to rank the page with either two thumbs up (hot), two thumbs down (cold), or one thumb up/one thumb down (lukewarm).

In the next section, three query-by-structure systems are presented. As you will see, some of the concepts presented in this section, such as document feature vectors and querying the Web, are utilized by these systems. In particular, two of these systems utilize machine-learning techniques, namely neural networks, to analyze document structure. However, with the exception of the LASER system, the machine-learning systems studied in this section used very little, if any, document structure information.


	Brought to you by Team-Fly