One of the key steps in KDD is to create a suitable target data set for the data mining
tasks
. In web data mining, data can be collected at several sites, such as proxy servers, web servers, or an organization's operational databases, which contain business data or consolidated web log data. Web data mining has the same objective as data mining in that both attempt to search for
valuable
and meaningful knowledge from databases or data warehouses. However, web data mining
differ
from data mining in that the former is a more unstructured task than the latter. The difference is based on the characteristics of web documents or web log files which represent unstructured relationships with little machine-readable semantics, while data mining is aimed at dealing with a more structured database.
Pyle (1999) and Srivastava et al. (2000) presented a detailed taxonomy for web usage mining methods and systems. Web content mining is the process of extracting knowledge from the content of a number of web documents. Web content mining is
related
to using web search engines, the main role of which is to discover web contents according to the
user
's requirements and constraints. In recent years, the web content mining approach of using the traditional search engine has
migrated
into intelligent agent-based mining and database-driven mining, where
intelligent
software
agents
for specific tasks support the search for more relevant web contents by taking domain characteristics and user profiles into consideration more intelligently. They also help users interpret the
discovered
web contents.
Many agents for web content mining appeared in literature such as Harvest (Brown et al., 1994), FAQ-Finder (Hammond et al., 1995), Information Manifold (Kirk et al., 1995), OCCAM (Kwok & Weld, 1996), and ParaSite (Spertus, 1997). The techniques used to develop agents include various information retrieval techniques (see Frakes & Baeza-Yates, 1992; Liang & Huang, 2000), filtering and categorizing techniques (see Broder et al., 1997; Chang & Hsu, 1997; Maarek & Shaul, 1996; Bonchi et al., 2001), and individual preferences learning techniques (see Balabanovic et al., 1995; Park et al., 2001). Database approaches for web content mining have focused on techniques for organizing structured collections of resources and for using standard database querying mechanisms.
As to the query language, Konopnicki and Shmueli (1995) combined structure queries based on the organization of hypertext documents, and combined content queries based on information retrieval techniques. Lakshmanan et al. (1996) suggest a logic-based query language for restructuring to extract information from web information sources. On the basis of semantic knowledge, efficient ways of mining intra-transaction association rules have been proposed by Ananthanarayana et al. (2001) and Jain et al. (1999). A frame metadata model was developed by Fong et al. (2000) to build a database and extract association rules from online transactions stored in the database. Web log data warehousing was built by Bonchi et al. (2001) to perform mining for intelligent web caching.
Web structure mining is the process of inferring knowledge from the organization and links on the Web, while web usage mining is the automatic discovery of user access patterns from web servers. Our approach is
belonging
to web usage mining because we are aimed at
proposing
the way of
amplifying
the inference value from the web log files, which potential users left through surfing the target web site. Web structure includes external structure, internal structure, and URL itself. External structure mining is therefore related with investigating hyperlinked relationships between web pages under consideration, while internal structure mining analyzes the relationships of information within the web page. URL mining is to extract URLs that are relevant to decision maker's purpose. Spertus (1997) and Chakrabarti et al. (1999) proposed some heuristic rules by investigating the internal structure and the URL of web pages. Craven et al. (1998) used first-order learning technique in categorizing
hyperlinks
to estimate the relationship between web pages. Brin and Page (1998)
considered
citation counting of referee pages to find pages that are relevant on particular topics. To mine the community structure on the Web, Kumar et al. (1999) proposed a new hyperlink analysis method. Zaiane (2001) presented building virtual web views by warehousing the web structure that would allow efficient information retrieval and knowledge discovery.
Web usage mining applies the concept of data mining to the web log file data, and automatically discovers user access patterns for a specific web page. Web usage mining can also use
referrer logs
as a source. Referrer logs contain information about the referring pages for each page reference, and user registration or survey data gathered via CGI scripts (Jicheng et al., 1999). The results of web usage mining give decision makers crucial information about the life time value of customers, cross-marketing strategies across products, and the effectiveness of promotional
campaigns
. Among other things, web usage mining helps organizations analyze user access patterns to
targeted
ads or web pages, categorize user preferences, and restructure a web site to create a more effective management of workgroup communication and organizational infrastructure.
Web usage mining provides the
core
basis for our system by supporting customized web usage tracking analysis and psychographics analysis. This customized web usage tracking analysis focuses on optimizing the structure of web sites based on the co-occurrence patterns of web pages (Perkowitz & Etzioni, 1999), predicting future HTTP request to adjust network and proxy caching (Schechter et al., 1998), deriving marketing intelligence (see Buchner & Mulvenna, 1999; Cooley et al., 1997, 1999; Spiliopoulou & Faulstich, 1999; Hui & Jha, 2000; Song et al., 2001), and predicting future user behavior on a specific web site by clustering user sessions (see Shahabi et al., 1997; Yan et al., 1996; Changchien & Lu, 2001; Lee et al., 2001). Psychographics analysis, which gives insights about the behavioral patterns of specific web site visitors, requires data about routes taken by
visitors
through a web site, the time spent on each page, route differences based on differing entry points to the web site, the aggregated route behavior, and general click stream behavior, etc. (Cooley et al., 1997, 1999). Based on these data, the psychographics analysis
tries
to answer marketing intelligence-related questions about which menu shoppers are using to buy a product, how long shoppers stay in the product description menu before making a decision to buy, and how shoppers feel about specific ads on the Web, etc.