BACKGROUND


Data Mining

Data mining, also known as Knowledge Discovery in Databases (KDD) (Chen et al., 1996; Lee et al., 2002), has been recognized as a rapidly emerging research area. This research area can be defined as efficiently discovering human knowledge and interesting rules from large databases. This technology is motivated by the need for new techniques to help analyze, understand and visualize the huge amount of stored data gathered from scientific and business applications, where business applications include attached mailing, add-on sales, customer satisfaction, etc. Data mining involves the semiautomatic discovery of interesting knowledge, such as patterns, associations, changes, anomalies and significant structures, from large amounts of data stored in databases and other information repositories.

Data mining differs from traditional statistics in several ways. First, statistical inference is assumption-driven, in the sense that a hypothesis is formed and validated against the data. By contrast, data mining is discoverydriven; patterns and hypotheses are automatically extracted from large databases. Second, the goal of data mining is to extract qualitative models which can easily be translated into business patterns, associations or logical rules. The major data mining functions that have been developed for the commercial and research communities include summarization, classification, association, prediction and clustering. Therefore, it can be used to help decision makers make better decisions in order to stay competitive in the marketplace .

Data mining functions can be implemented using a variety of technologies, such as database-oriented techniques, machine learning, statistical techniques, and other AI methods (Hui & Jha, 2000). In general, determining which data mining technique and function to apply depends very much on the application domain and on the nature of the data available. Recently, a number of data mining applications and prototypes have been developed for a variety of domains, including online marketing, banking, finance, manufacturing, CRM, and health care. In the Internet Business space, data mining techniques have the potential to provide companies with competitive advantages (Dhond et al., 2000).

Web Data Mining

One of the key steps in KDD is to create a suitable target data set for the data mining tasks . In web data mining, data can be collected at several sites, such as proxy servers, web servers, or an organization's operational databases, which contain business data or consolidated web log data. Web data mining has the same objective as data mining in that both attempt to search for valuable and meaningful knowledge from databases or data warehouses. However, web data mining differ from data mining in that the former is a more unstructured task than the latter. The difference is based on the characteristics of web documents or web log files which represent unstructured relationships with little machine-readable semantics, while data mining is aimed at dealing with a more structured database.

In recent years , several web search engines were suggested as the advent of web technology. Since 1960, those search engines have been credited with many achievements in the field of information retrieval, such as index modeling, document representation and similarity measure. Recently, some researchers applied database concept to the web database and presented some new methods of modeling and querying web content at a finger granularity level instead of a page level. Nevertheless, web data mining is concerned with discovering patterns or knowledge from web documents or web log files.

As shown in Figure 1, web data mining is classified into roughly three domains: web content mining, web structure mining, and web usage mining.

click to expand
Figure 1: Taxonomy of Web Data Mining (Adapted from Pyle, 1999, and Srivastava et al., 2000)

Pyle (1999) and Srivastava et al. (2000) presented a detailed taxonomy for web usage mining methods and systems. Web content mining is the process of extracting knowledge from the content of a number of web documents. Web content mining is related to using web search engines, the main role of which is to discover web contents according to the user 's requirements and constraints. In recent years, the web content mining approach of using the traditional search engine has migrated into intelligent agent-based mining and database-driven mining, where intelligent software agents for specific tasks support the search for more relevant web contents by taking domain characteristics and user profiles into consideration more intelligently. They also help users interpret the discovered web contents.

Many agents for web content mining appeared in literature such as Harvest (Brown et al., 1994), FAQ-Finder (Hammond et al., 1995), Information Manifold (Kirk et al., 1995), OCCAM (Kwok & Weld, 1996), and ParaSite (Spertus, 1997). The techniques used to develop agents include various information retrieval techniques (see Frakes & Baeza-Yates, 1992; Liang & Huang, 2000), filtering and categorizing techniques (see Broder et al., 1997; Chang & Hsu, 1997; Maarek & Shaul, 1996; Bonchi et al., 2001), and individual preferences learning techniques (see Balabanovic et al., 1995; Park et al., 2001). Database approaches for web content mining have focused on techniques for organizing structured collections of resources and for using standard database querying mechanisms.

As to the query language, Konopnicki and Shmueli (1995) combined structure queries based on the organization of hypertext documents, and combined content queries based on information retrieval techniques. Lakshmanan et al. (1996) suggest a logic-based query language for restructuring to extract information from web information sources. On the basis of semantic knowledge, efficient ways of mining intra-transaction association rules have been proposed by Ananthanarayana et al. (2001) and Jain et al. (1999). A frame metadata model was developed by Fong et al. (2000) to build a database and extract association rules from online transactions stored in the database. Web log data warehousing was built by Bonchi et al. (2001) to perform mining for intelligent web caching.

Web structure mining is the process of inferring knowledge from the organization and links on the Web, while web usage mining is the automatic discovery of user access patterns from web servers. Our approach is belonging to web usage mining because we are aimed at proposing the way of amplifying the inference value from the web log files, which potential users left through surfing the target web site. Web structure includes external structure, internal structure, and URL itself. External structure mining is therefore related with investigating hyperlinked relationships between web pages under consideration, while internal structure mining analyzes the relationships of information within the web page. URL mining is to extract URLs that are relevant to decision maker's purpose. Spertus (1997) and Chakrabarti et al. (1999) proposed some heuristic rules by investigating the internal structure and the URL of web pages. Craven et al. (1998) used first-order learning technique in categorizing hyperlinks to estimate the relationship between web pages. Brin and Page (1998) considered citation counting of referee pages to find pages that are relevant on particular topics. To mine the community structure on the Web, Kumar et al. (1999) proposed a new hyperlink analysis method. Zaiane (2001) presented building virtual web views by warehousing the web structure that would allow efficient information retrieval and knowledge discovery.

Web usage mining applies the concept of data mining to the web log file data, and automatically discovers user access patterns for a specific web page. Web usage mining can also use referrer logs as a source. Referrer logs contain information about the referring pages for each page reference, and user registration or survey data gathered via CGI scripts (Jicheng et al., 1999). The results of web usage mining give decision makers crucial information about the life time value of customers, cross-marketing strategies across products, and the effectiveness of promotional campaigns . Among other things, web usage mining helps organizations analyze user access patterns to targeted ads or web pages, categorize user preferences, and restructure a web site to create a more effective management of workgroup communication and organizational infrastructure.

Web usage mining provides the core basis for our system by supporting customized web usage tracking analysis and psychographics analysis. This customized web usage tracking analysis focuses on optimizing the structure of web sites based on the co-occurrence patterns of web pages (Perkowitz & Etzioni, 1999), predicting future HTTP request to adjust network and proxy caching (Schechter et al., 1998), deriving marketing intelligence (see Buchner & Mulvenna, 1999; Cooley et al., 1997, 1999; Spiliopoulou & Faulstich, 1999; Hui & Jha, 2000; Song et al., 2001), and predicting future user behavior on a specific web site by clustering user sessions (see Shahabi et al., 1997; Yan et al., 1996; Changchien & Lu, 2001; Lee et al., 2001). Psychographics analysis, which gives insights about the behavioral patterns of specific web site visitors, requires data about routes taken by visitors through a web site, the time spent on each page, route differences based on differing entry points to the web site, the aggregated route behavior, and general click stream behavior, etc. (Cooley et al., 1997, 1999). Based on these data, the psychographics analysis tries to answer marketing intelligence-related questions about which menu shoppers are using to buy a product, how long shoppers stay in the product description menu before making a decision to buy, and how shoppers feel about specific ads on the Web, etc.




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net