Information Overload
The
proliferation
of the Internet has created a social revolution. Like the printing press, radio and television before it, the Internet is forcing community organizations to rethink the manner in which their services are provided. Although the Internet provides many opportunities for community outreach, there are also some shortcomings. One of the most compelling shortcomings of the Internet is the increasing number of web sites that are posted daily. This
plethora
of Web pages has led to information overload. As a result, there are several problems that users and organizations may encounter while interacting with the Web. These problems include: finding
relevant
information; learning about individual users; creating new knowledge from existing Web data; and personalizing the information (Lempel & Moran, 2000).
Unlike the previous forms of media, users must be aware of the "Salinger Syndrome," which is the tendency for online users, particularly new users, to assume that all information published on the Internet is accurate. Providing some kind of authority for the content is the first step to addressing the accuracy of the information.
The Web is decentralized and there is no formal standard for any logical organization. To further complicate matters, millions-soon to be billions-of people are creating and annotating Web documents (Kleinberg, Kumar, Raghavanm, Rajagopalan, & Tomkins, 1999). This anarchic growth process leads to the
chaotic
structure of the Web.
In order for community organizations to provide information on the Web,
methods
for inferring the relevancy and validity of the information to Web site
visitors
are required. To address these issues, the link topology of the Web is
analyzed
in the context of a cyber-community. The central theme is to explore the connection between the link topology and conferral of authority.
Although one cannot control individuals' behavior, one can design Web communities that offer links to wholesome and credible material. This can only be accomplished by exploring the architecture of the Web and designing models that can be used by community organizations.
Chaotic Structure of the Web
The World Wide Web may be
viewed
as a directed graph (architecture) of
hyperlinks
between pages. In this graph, there are two types of links: transverse and intrinsic (Kleinberg et al., 1999). Transverse is the external linkage among pages with different domains, whereas intrinsic is the internal linkage within the same domain. Intrinsic links are
specifically
used to navigate between pages within a domain and are not the focus of this study. The goal of this chapter is to explore the graph in terms of linkage between different domains. Therefore, for the purposes of examining and analyzing the Web graph, only transverse links will be explored.
Some Web sites are structureless because they have few links between pages. To address this issue, the Web is viewed as a semi-structured database. In this chapter, the Web graph and a database are viewed in the same context. As such, we can take the approach of a database administrator in order to provide some form of structure to the Web architecture.
Typically, a Web graph is created when a few
scattered
pages begin to appear on the Web. As Web
creators
discover these pages and begin to co-cite (link) them, a sub-graph is created about a
group
of
related
topics. This leads to the creation of a cyber-community that has similar topics of interest. Herein lies the problem:
topic distillation
-searching through the sea of documents for relevant information. This
phenomenon
has been coined "information overload" or "information
abundance
."
By its nature, the Web
presents
an abundance problem. This problem is created through the number of pages that exist on a current topic and the number of hits that are returned on a given search. Typically, search engines do not make a distinction between relevance and popularity. To address this shortcoming, one must find a way to distinguish between relevancy and popularity in order to improve topic distillation. By analyzing the link topologies, one can derive better algorithms and models that can be used to provide structure to the architecture of the Web.
The most mentioned model of the Web graph is the hub and authority. This model will be covered extensively later in the chapter. In this model authority pages represent those pages that are focused on relevant information about a given topic. On the other hand, hub pages contain links to authority pages. Algorithms for deriving the index scores use the in- and out-degrees of the hyperlinks as
variables
. Higher index scores appear first in the order of search results. To date, a number of models and algorithms have been developed to discover cyber-communities (hubs and authorities): hyperlink-induced topic search (HITS), which uses connectivity analysis to determine the importance of documents by their link structure (Kleinberg, 1999); HITS-SW (similarity weight function) to address the topic
drift
problem
encountered
with HITS (Herbach, 2001); and an algorithm that uses content analysis to improve Web searches (Bharat & Henzinger, 1998). Similarly, one study found that the stochastic approach for link-structure analysis (SALSA) revealed a TKC effect (tight-knit community) (Lempel & Moran, 2000) and another study found that using eigenvector decomposition techniques with Markov chains led to the detection of more personalized hubs and authorities (Spiliopoulou, 2000). This author suggests that Web mining can be used
efficiently
and effectively within community-based organizations.