Chaotic Structure of the Web

 < Day Day Up > 



The World Wide Web may be viewed as a directed graph (architecture) of hyperlinks between pages. In this graph, there are two types of links: transverse and intrinsic (Kleinberg et al., 1999). Transverse is the external linkage among pages with different domains, whereas intrinsic is the internal linkage within the same domain. Intrinsic links are specifically used to navigate between pages within a domain and are not the focus of this study. The goal of this chapter is to explore the graph in terms of linkage between different domains. Therefore, for the purposes of examining and analyzing the Web graph, only transverse links will be explored.

Some Web sites are structureless because they have few links between pages. To address this issue, the Web is viewed as a semi-structured database. In this chapter, the Web graph and a database are viewed in the same context. As such, we can take the approach of a database administrator in order to provide some form of structure to the Web architecture.

Typically, a Web graph is created when a few scattered pages begin to appear on the Web. As Web creators discover these pages and begin to co-cite (link) them, a sub-graph is created about a group of related topics. This leads to the creation of a cyber-community that has similar topics of interest. Herein lies the problem: topic distillation-searching through the sea of documents for relevant information. This phenomenon has been coined "information overload" or "information abundance."

By its nature, the Web presents an abundance problem. This problem is created through the number of pages that exist on a current topic and the number of hits that are returned on a given search. Typically, search engines do not make a distinction between relevance and popularity. To address this shortcoming, one must find a way to distinguish between relevancy and popularity in order to improve topic distillation. By analyzing the link topologies, one can derive better algorithms and models that can be used to provide structure to the architecture of the Web.

The most mentioned model of the Web graph is the hub and authority. This model will be covered extensively later in the chapter. In this model authority pages represent those pages that are focused on relevant information about a given topic. On the other hand, hub pages contain links to authority pages. Algorithms for deriving the index scores use the in- and out-degrees of the hyperlinks as variables. Higher index scores appear first in the order of search results. To date, a number of models and algorithms have been developed to discover cyber-communities (hubs and authorities): hyperlink-induced topic search (HITS), which uses connectivity analysis to determine the importance of documents by their link structure (Kleinberg, 1999); HITS-SW (similarity weight function) to address the topic drift problem encountered with HITS (Herbach, 2001); and an algorithm that uses content analysis to improve Web searches (Bharat & Henzinger, 1998). Similarly, one study found that the stochastic approach for link-structure analysis (SALSA) revealed a TKC effect (tight-knit community) (Lempel & Moran, 2000) and another study found that using eigenvector decomposition techniques with Markov chains led to the detection of more personalized hubs and authorities (Spiliopoulou, 2000). This author suggests that Web mining can be used efficiently and effectively within community-based organizations.



 < Day Day Up > 



Managing Data Mining Technologies in Organizations(c) Techniques and Applications
Managing Data Mining Technologies in Organizations: Techniques and Applications
ISBN: 1591400570
EAN: 2147483647
Year: 2003
Pages: 174

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net