This chapter presents three systems that incorporate document structure information into a search of the Web. These systems extend existing Web searches by allowing the user to request documents containing not only specific search words, but also to specify that documents be of a certain type. In addition to being able to search a local database (DB), all three systems are capable of dynamically querying the Web. Each system applies a query-by-structure approach that captures and utilizes structure information as well as content during a query of the Web. Two of the systems also employ neural networks (NNs) to organize the information based on relevancy of both the content and structure. These systems utilize a supervised Hamming NN and an unsupervised competitive NN, respectively. Initial testing of these systems has shown promising results when compared to straight keyword searches.
The vast amount of information available to the users of the World Wide Web is overwhelming. However, what is even more overwhelming for users is trying to find the particular information they are looking for. Search engines have been created to assist in this process, but a typical keyword search using a search engine can still result in hundreds of thousands of different relevant Web documents. Savvy search engine users have learned to combine keywords and phrases with logical Boolean operators to pair down the number of matched Web pages. Unfortunately, results from these searches can still yield a significant number of pages that must then be viewed individually to determine whether they contain the content the user is interested in or not.
Search engines support keyword searches by utilizing spiders or webots to scan the textual content of Web documents, and then indexing and storing the content in a database for future user queries. The stored information typically consists of the document URL, various keywords or phrases, and possibly a brief description of the Web page. However, these search engines maintain very little, if any, information about the context in which the text of the Web page is presented. In other words, a search engine might be able to identify that a particular keyword was used in the title of the Web page or a phrase within the anchor tags. But, it would not distinguish between that same word or phrase being used in a paragraph, heading, or as alternate text for an image. However, the way in which text is presented in a Web page plays a significant role in the importance of that text. For example, a Web page designer will usually emphasize particularly important words, phrases, or names. By enabling a search engine to capture how text is presented, and subsequently, allowing the users of the search engine to query based on some presentation criteria, the performance of a search can be greatly enhanced. Since search engines that utilize spiders already scan the entire text of a Web page, it is a simple modification to incorporate a mechanism to identify the context in which the text is presented.
Motivation for this type of modification can best be described by illustration. Consider the following examples:
These two examples clearly illustrate how a user can significantly improve his or her search for particular Web document content by using presentation knowledge when performing a search.
This chapter will present three systems that have been designed to incorporate document structure into a search of the Web. Each of these systems applies a query-by-structure approach that captures and utilizes structure (presentation) information as well as content during a distributed query of the Web. In addition, two of the systems employ neural networks to organize the information based on relevancy of not only the content but the structure of the document as well. First, however, it would be worthwhile to provide a little background into various systems and approaches related to the query-by-structure systems presented in this chapter.