|< Day Day Up >|| |
It is important to conduct a literature review on data mining in order to investigate what data mining is, the relationship between data mining and data warehousing, and salient issues in data mining that relate to fraud detection. Data mining is a knowledge discovery process which deals with discovering valuable, hidden patterns, trends, and relationships from a collection of data through various techniques such as machine learning, pattern recognition, statistics, databases, and visualization (Bigus, 1996). Data mining has a highly complementary relationship with data warehousing. As an emerging field with increasing attention from both researchers and practitioners, data mining has encountered a variety of issues ranging from performance to privacy. Data mining is investigated in the following four sections. The first section provides an overview of data mining, which has been beneficial in applications of numerous areas. The second section deals with the highly complementary relationship between data mining and data warehousing. Data mining is data-driven, and its performance is closely related to the data preparation of data warehousing. The third section discusses issues related to data mining such as performance and privacy. The last section deals with future research considerations.
Data mining can extract information from more than terabytes of data by using such techniques as machine learning, pattern recognition, statistics, databases, and visualization (Simoudis, Livezey, & Kerber, 1996). According to Cao (1998) and Edelstein (1999), several types of knowledge which data mining can discover include dependency and association relationships among attributes, deviation detection, sequential and temporal patterns among inter-transactions, classification of knowledge, clustering of records, and summarizing data. The two "high-level" primary goals of data mining are prediction and description (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Successful applications can be found in promotion analysis, category management, logistics and distribution analysis, trend analysis, claims analysis, sales and profitability analysis, customer profiling, churn analysis, forecasting, budget analysis, target mailing list identification, customer profiling, data validation, and fraud detection (Cao, 1998; Edelstein, 1999). Equally important, data mining can deal with a variety of databases such as relational databases, object-oriented databases, inductive databases, spatial databases, temporal databases, scientific databases, and Internet information systems (Cao, 1998).
Approaches to data mining can be broadly classified into two categories: methodologies and technologies (Pass, 1997b). Cluster analysis, linkage analysis, visualization, and categorization analysis are methodological; connectionist models/ neural networks, decision trees, rule induction, genetic algorithms, fuzzy logic, statistical approaches, and time series approaches are technological. Of them, two popular approaches are neural networks and decision trees (Edelstein, 1999). Neural networks are collections of connected nodes with inputs, outputs, and processing at each node. A number of hidden processing layers may exist between the input layer and output layer. Based on a computing model similar to the underlying structure of the human brain, neural networks share the brain's ability to learn or adapt in response to external inputs. When exposed to a stream of training data, neural networks can discover previously unknown relationships and learn complex nonlinear mappings in the data. Neural networks are characterized by an opaque process; a clear interpretation is not available for the resulting model. Neural networks are appropriate to be applied in pattern recognition and prediction. On the other hand, decision trees divide the data into groups based on values of the variables. The result is a hierarchy of if-then statements that classify the data. Although decision trees may not work with some types of data such as continuous sets of data, decision trees are faster than neural networks in most cases.
At the same time, different approaches to data mining exist, and these approaches differ in the classes of problems that can be solved. For instance, cluster analysis can be used to identify associations among data points. Linkage analysis can be used to link two or more data points together. Time series analysis can be used to relate data points in time. Categorization analysis can be used to explain the influence of numerous different factors on one specific outcome (Cao, 1998; Edelstein, 1999).
The idea of extracting valuable information from data is not new; its roots reach back to the methodology, exploratory data analysis, first introduced by a statistician called John Tukey in the mid-1970s (Bigus, 1996; Judson, 1999; Peacock, 1998). The principal differences between data mining and exploratory data analysis follow. First, data mining substitutes machine learning for human learning. Second, data mining is applied to entire large data sets rather than to samples drawn from them. Peacock (1998) has classified reasons for the growth of the data mining industry into two categories: supply-side factors and demand-side factors. Supply-side factors for the emergence of data mining consist of the advances in data storage and data processing technology; the declining cost of electronic communication; the emergence of new analysis techniques such as neural networks, genetic algorithms, decision trees, and induction rules; the occurrence of client/server computer architecture; and the development of data warehouses and data marts. Demand-side factors include the growing need for ever-faster analytical results and the change of the organizational hierarchy.
The size of data which data mining currently encounters is one of the most salient differences between data mining and the traditional knowledge discovery form. It has been estimated that the amount of information in the world doubles every 20 months. One reason is the use of the computer, which records even simple transactions, such as phone calls, and produces an ever-increasing stream of data (Frawley, Piatetsky-Shapiro, & Matheus, 1991). Databases are rapidly growing, and more and more large-scale data sets contain more than terabytes of data. A terabyte is equal to 1012 or 10,000,000,000,000 bytes-that is between 20 and 500 million document pages and records each day. By contrast, the Census Bureau's massive Current Population Survey microdata file in 1994 was 241 megabytes or 241,000,000 bytes, 24/1,000,000ths a terabyte (Judson, 1999). The average Fortune 500 company managed over a teratype of electronic information on a daily basis in 1998. It is estimated that company electronic information storage numbers are expected to increase 1,000 times by the coming millennium, averaging a 57% growth rate annually (Kempster, 1998).
Another most salient difference between data mining and the traditional knowledge discovery form is the application of analysis techniques. For instance, data mining can combine a number of different artificial intelligence (AI) techniques with carefully constructed databases in the mid-1990s (Borok, 1997). The first expert systems were developed in the late 1970s. In the late 1980s, AI methods such as neural networks and fuzzy logic were applied to the real world. Neural networks and fuzzy logic provide new capabilities for processing business data.
Data warehousing is very closely associated with data mining, although there are some differences between them. Data warehousing is not a prerequisite for a data mining solution. A data warehouse is designed to be a neutral holding area for informational data for decision making, and it is "a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decisions" (Inmon, 1997). The content of the data warehouse is defined consistently across the enterprise, its operational and external data sources. Additionally, all data in the data warehouse are time-stamped. The data in the data warehouse are not updated after they are loaded into the data warehouse.
However, as the amount and complexity of data in the warehouse reach previously unthinkable proportions, new capabilities are required to identify trends and relationships in the data because simple query and reporting tools are not enough. Data mining can help discover the patterns, trends and relationships among data in the repositories of information in a more sophisticated and more productive way (Pass, 1997a). The move from data warehousing to data mining arises from the need to further enhance the strategic value of the organizational data asset and the need to further capitalize on its investment in the data warehouse (Cabena, Hadjinian, Stadler, Verhees, & Zanasi, 1998).
In the same way, although a database can be defined as a logically integrated collection of data maintained in files and organized to facilitate the efficient storage, modification, and retrieval of related information, it is often dynamic, incomplete, noisy, and large in the real world (Frawley, Piatetsky-Shapiro, & Matheus, 1991). Data warehousing, a data depository, is an approach to transform data into useful and reliable information supporting knowledge discovery and the organizational decision-making process (Kempster, 1998; Greenwood, 1997). Data warehousing significantly improves the performance of success in data mining. A data warehouse may include integrated data, detailed and summarized data, historical data, and metadata. All of these elements are critical to enhance the quality of data mining (Inmon, 1997). For instance, integrated data, which have been cleansed and conditioned, enhance the data mining process. Once data are integrated, the data miner can focus on mining data rather than cleansing and integrating data. The miner also can examine data in a broad range because data in a warehouse are detailed. In addition to that, a data warehouse retains summarized data done previously; therefore, the miner can build his/her work on others or her/his previous work. Moreover, metadata serves as a road map to the miner because it allows the user to describe not only content but also context of information.
Salient issues ranging from performance to privacy exist in data mining. No consensus exists concerning the factors which influence the performance of data mining. Some writers argue data quality and people involved in the data mining are important; others emphasize data mining techniques applied in the data mining process. Another salient issue in data mining is privacy. When dealing with databases of personal information, the legal and ethical issues of invasion of privacy have to be addressed.
Regarding data privacy, current discussion is around guidelines for what constitutes a proper discovery. In the U.S. principles for fair information use related to the National Information Infrastructure (NII) can apply to the data mining discovery (Fayyad et al., 1996). Piatetsky-Shapiro (1995) argues that group pattern discovery does not invade privacy because its goal is to discover patterns about groups, not individuals. O'Leary (1995) has introduced the Organization for Economic Cooperation and Development (OECD) guidelines for data privacy, which have been adopted by most European Union countries. It is suggested in the guidelines that data about specific living individuals should not be analyzed without their consent, and data collection should be collected for a specific purpose or with the consent of the data subject or by authority of the law.
Regarding the performance of data mining, Frawley et al. (1991) assert that databases can pose problems to knowledge discovery when the fundamental input to a data mining system is the raw data present in a database. As mentioned in the preceding section, the databases are usually dynamic, incomplete, noisy, and large. Contents in most databases are changing and time-sensitive; relevant data attributes or fields can be absent; the inherent exactitude of data differs; and the databases may be large with irrelevant information.
Regarding the performance of data mining in large databases, Musick (1999) has identified several barriers to effectively conduct large-scale data mining. First, typical data mining algorithms-simple naive Byes, decision trees, and model induction algorithms-may lead to information loss when they are used to handle highly complex data. Second, if the quality of data mining models is poor, the cost of a mistake is high. Third, some difficulties exist in mining large-scale data sets because of:
a variety of data types and models,
duplicative and erroneous data,
changing data content, and
critical pieces of information in different repositories.
Strategies which are recommended to overcome the barriers facing large-scale data mining include:
building scalable I/O architectures and interfaces;
developing algorithms that work with nontraditional data types such as time sequences, protein sequences, or 3D structures; and
scaling algorithms by effective parallelization and controlled sampling or filtering.
According to Cabena et al. (1998), the main factors which make for a successful data mining project are summarized into three categories: the right people, the right application, and the right data. The right people include a sponsor, a user group, a consultant or someone who has experience, a business analyst with domain expertise, and a data analyst. The right application is composed of clearly understood organizational objectives, a solid cost-benefit analysis, a significant impact on business problems or opportunities, and an achievable deadline in less than three months. The right data means a clean supply of data, a limited set of data sources, and a solid analytical data model. Although a solid technology is important, successful data mining is far more about people, business issues, and data than about the underlying technology.
Likewise, Fayyad et al. (1996) have identified three major factors related to the successful application of data mining. The first factor is the data quality. Although data mining techniques can be tolerant of certain types of noise, excessive noise can negatively impact the quality of the mined information. The second factor involves developing methods for establishing when and how to apply the appropriate techniques and determining how to best take advantage of the mined information. The third factor is selecting an appropriate data-mining tool.
Most recently, Nazem and Shin (1999) have classified issues which influence the performance of data mining into two broad categories: organizational and methodological. Some salient organizational issues closely related to data mining are objectives of data mining, information management, and data warehousing. Methodological issues involve the selection of methodology and appropriate search tools for the importance of the non-algorithm aspects of data mining such as design of data warehouses and data marts, integration of data mining with DSS or EIS, rule-base management, and methodologies for knowledge search and evaluation of return-on-investment.
|< Day Day Up >|| |