Literature Review | Managing Data Mining Technologies in Organizations: Techniques and Applications

< Day Day Up >

This section provides a brief overview of data warehousing, reviews the current data mining literature and then reviews alternative processes in three domains—package implementation, rapid application development and new product development.

Data Warehousing Overview

Subject-oriented, integrated, nonvolatile and time-variant are the core characteristics of a data warehouse which were first identified by Inmon (1996). Subject oriented data provide a stable image of business processes and capture the basic nature of the business environment, as opposed to supporting application processing. Data integration is an outcome resulting from the consolidation of data from multiple operational systems. When data are brought together into a data warehouse, they are referred to in only one way and have the same format and consistent units of measure. Nonvolatile data refers to the fact that data warehouse data, once loaded, are only accessed by end users. In an operational environment, data must be accurate at the moment of access. With a data warehouse, data is accurate as at some moment in time.

The drive behind data warehousing (Berson & Smith, 1997; Gardner, 1998; Inmon, 1996; Kimball, 1996; Sammon & Finnegan, 2000) is motivated by the inadequacies of existing legacy operational systems to satisfy the informational demands of users in a fast changing environment. In atypical organization today, application-specific data reflect the operational needs of specific functions and lack the historical perspective, across divisional lines and different aggregation levels, which is necessary for fast-cycle decision making. The primary advantages of building a data warehouse therefore include single "version of the truth," high quality data, easy data access, improved decision making and support for strategic business objectives.

A data warehouse is different from OLTP systems in terms of the data it stores, the processing it supports and the design/build considerations. The business world generates a great deal of data which are captured, collected and stored in the data world. For the most part business-world data is record-oriented although companies are now beginning to increasingly investigate the inherent value in storing richer forms of data such as images, sound and video. Record-oriented data can be classified as either operational or informational. Operational data is raw data used to run the day-to-day operations of an enterprise. Informational data is data that is summarized or otherwise calculated. Operational data describes a single entity, such as a customer or single event relating to the operation of the enterprise whereas informational data relates to multiple entities or multiple events relevant to the enterprise.

There is another perspective of operational and informational data that relates to the type of processing it supports. Operational data is plain "data" that supports operational processing. Informational data is "information" that supports analytical processing. Operational and analytical processing are distinctly different. Operational processing refers to systems that run the day-to-day operations of an enterprise (e.g., e-commerce merchant payment transactions). The emphasis of these systems is to support business functionality by processing transactions accurately and efficiently. With high performance OLTP systems, automation of previously manual tasks results in significant business advantages. Analytical processing is processing performed to support decision making. Data used in analytical processing is historical in nature, thus enabling users to analyze trends and patterns. Systems that support analytical processing are read-only and therefore prohibit end user updates.

Data Mining Basics

A data warehouse by itself does not create value but rather value is derived from the use of data warehouse data. Similarly, data mining in itself is not an end, but rather a means to an end. The benefits of data mining accrue from the operationalization of data mining results, via a business strategy, to achieve a specific objective. As a multi disciplinary field, data mining draws from domain-specific areas such as artificial intelligence, database theory, data visualization, machine learning, marketing, mathematics, operations research, pattern recognition and statistics. In fact, data mining has its roots in the statistical community, which has for over a century focused on inferring patterns or models from data through a hypothesis-driven approach. In contrast, data mining is accomplished through a discovery-driven approach whereby one is uncertain about the nature of the information to be extracted. For problems involving large volumes of data and high dimensional spaces, data mining is extremely powerful. What is important to keep in mind about the relationship between data mining and statistics is that the underlying intent of both is one and the same. A practical definition of data mining is "the analysis and nontrivial extraction of data from databases for the purpose of discovering new and valuable information, in the form of patterns and rules, from relationships between data elements" (Hirji, 1999). Following is a brief nontechnical review of the application of data mining technology to solve specific types of business problems.

The multitude of data mining algorithms can best be organized by focusing on the three main data mining problem approaches—clustering, association/sequential pattern discovery and predictive modeling. Clustering (or segmentation) is concerned with partitioning data records into subsets. A cluster is simply defined as a subset of data records and the goal of clustering is to partition a database into clusters of similar records such that records that share a number of properties are considered to be homogeneous. The K-means clustering algorithm is used for demographic clustering because categorical data are predominant. This algorithm, which is efficient for large databases, clusters a data set by determining the cluster to which a record fits best. Once clusters have been found in a data set, they can be used to classify new data.

To uncover affmities among transaction records consisting of several variables, association algorithms are used. These algorithms are used to solve problems where it is important to understand the extent to which the presence of some variables imply the presence of other variables and the prevalence of this particular pattern across all data records. Association algorithms discover rules of the form: if item X is part of a transaction, then for some percent of the time, item Y is part of the transaction. An example of an association relation in the insurance industry is: when people purchase home insurance, they purchase auto insurance 50% of the time. Sequential pattern discovery algorithms are related to association algorithms except that the related items are spread overtime. In the insurance industry an example of a sequential pattern relation is: when people purchase home insurance, they also purchase auto insurance within the next three months 30% of the time, and within the subsequent three months 10% of the time.

Finally, the predictive modeling data mining problem approach involves the use of a number of algorithms such as binary decision tree, linear discriminant function analysis, radial basis function, back propagation neural network, logistic regression and standard linear regression. The goal of predictive modeling is either to classify data into one of several predefined categorical classes or to use selected fields from historical data to predict target fields. Since the models built from the implementation of the above algorithms use data that is already classified or has known values, these algorithms are referred to as supervised learning algorithms.

To date, data mining has been applied in a number of diverse areas such as astronomy, biology, banking, finance, genetics, health care, insurance, manufacturing, marketing, telecommunications and transportation and the most popular business applications have been: (i) to develop direct mail marketing campaigns, (ii) to detect insurance fraud and abuse, (iii) to identify customers most likely to switch to the competition, (iv) to analyze web site transactions and (v) to predict the price of stocks. Although a number of specific data mining applications have also been developed (Anand & Kahn, 1992; Fayyad, Weir & Djorgovski, 1993) very little research has focused on the data mining process itself. Cabena, Hadjinian, Stadler, Verhees & Zanasi (1998) have proposed a five-step linear, one-way, sequential data mining process model. The steps in this model are business objectives determination, data preparation, data mining, results analysis and knowledge assimilation. Although a start, this proposed process model has a number of obvious shortcomings: (i) it is not based on any principles or existing body of research, (ii) it is not an end-to-end depiction of the entire data mining process and (iii) it is not derived from either quantitative or qualitative research findings. The next section reviews processes from three domains to arrive at a straw man of what a data mining process might look like.

Evaluation of Alternative Processes

Package Implementation

Package implementation vendors (e.g., Siebel Systems Inc.) advocate the use of a rapid implementation approach to highlight "8 week" project delivery lifecycles. This "8 week" delivery is based on an "out of the box configuration ^[1]" assumption with limited configuration changes only and does not include a number of relevant aspects such as up-front work to define the current business, developing future business processes, writing legacy interfaces, deployment, or training. The reason behind the success of this approach is that customers see a flexible application that addresses most of their needs "out of the box" and can be implemented rapidly. One of our greatest challenges of this approach however is to make the hard work of implementation look easy and meet the customers' expectations.

Though successful, package implementation approaches are not appropriate for data mining projects mainly because the assumptions and drivers governing package implementation projects are incongruent with data mining projects. Data mining projects are fundamentally about performing intensive analysis to discover findings that hopefully have a significant impact and purposefully help senior managers and executives make strategic decisions to gain competitive advantage in areas such as operational excellence, product or service superiority, and customer intimacy. Package implementation projects on the other hand are about executing a strategic decision to, for example, develop a leadership position in becoming customer-centric.

Rapid Application Development

Rapid application development (RAD) approach emphasizes rapid, incremental construction of software applications through coordinated and concerted deployment of development accelerators. This approach supports the delivery of complex and simple projects and includes analysis, construction, implementation and maintenance phases. The RAD approach is philosophically opposite to the software development lifecycle approach because it assumes time and resources as fixed variables (and requirements as variable to allow scope management). The major component of the RAD approach is the use of accelerators such as time boxing, "80/20" rule, prototyping, iterative development, and colocation of business and IT project team members.

RAD approach is most suitable for projects with the following characteristics: (i) new or enhancement application development projects, (ii) projects can be decomposed into smaller releases, (iii) requirements can be prioritized and scope reduced if necessary, and (iv) a construction phase duration of less than 6 months.

Customers are increasingly demanding shorter lead times for data mining projects. As a result, it would appear that RAD approach is a candidate for identifying what a process for a data mining project might look like. RAD projects however operate on the principle of systematically delivering incremental business value (i.e., thing big, start small, build incrementally) whereas data mining projects are not as concerned about scope management to deliver incremental value as they are with meeting the objective of finding new insights. For this reason, the application of the RAD approach is not appropriate for data mining projects.

New Product Development

Development of new hardware products is not only very important but also a very risky and complex process that consumes a large portion of a company's resources. The process of designing and developing new products can be viewed as the set of activities, tools, methods and procedures by which customer needs are translated into deliverable products (Carter & Baker, 1991, p. 148). This process involves capturing customer needs and then translating them into product requirements, functional subsystem specifications, design drawings, prototypes and finally physical products. Throughout the product development cycle, many technical specialists such as design engineers, manufacturing engineers and test engineers are involved together with customers, marketing representatives, suppliers, vendors, purchasing and distributors—just to name a few.

New product development (NPD) is a well-researched area (Clark, Han & Yu, 1996; Hauptman & Hirji, 1996, 1999; Iansiti, 1997) and numerous process models exist to reduce its complexity and risk and to make the NPD process more manageable, effective and efficient. Though development projects differ from industry to industry the basic activities remain the same. Accordingly there is common understanding about the major stages (or phases) of the new product development process. Individual models may contain between four and seven stages (or phases), but most include concept/idea generation, planning/setting specifications, design/product engineering and commercialization. A closer look at some of the models and their stages (or phases) is now undertaken.

Cooper (1983) developed a seven-stage model to describe the process by which industrial new products are designed and developed. The first stage, Idea, involves definition of a product idea. A product idea results from matching technological possibilities with expected market demand. Preliminary Assessment is the second stage and includes both preliminary market assessment (i.e., market feasibility) and preliminary technical assessment (i.e., technical feasibility). The third stage is Concept. The purpose of this stage is to better identify what the product is, who it is aimed for and how it will be positioned against competitors' products. Development is the fourth stage in this model and the outcomes of this stage include prototype construction and development of a formal marketing plan. Stage five is Testing, where the product design and features are validated by testing prototypes to ensure technical flaws do not exist. Field testing of the product may occur in this stage. The sixth stage is Trial. Trial (or otherwise known as pilot production) involves testing the production methods that will eventually be used for full-scale production. Modification of the production system takes place here both to ensure the product is produced according to specifications and to ensure a "smooth" transition to volume production. Launch is the final stage in this model and involves startup of full production at the planned rate.

Cooper's (1983) model takes the position that product development consists of a set of intertwined tasks beginning with a definition of what the product will be. A study by Gerwin (1993) of four of North America's largest multinational computer and telecommunications firms builds on the model discussed by including a preliminary phase dealing with long-term strategic issues. In Gerwin's (1993) model, the first phase is Strategic Planning and New Technologies. Formulating long run strategies (from two to five years) for entire product families as well as individual functions is the emphasis here. Market Requirements is the second phase and involves preliminary market evaluations to determine how technology can be matched to customer demand. The third phase is Product Concept Definition. A detailed product concept specifying the major functions and features of the product is the output of this phase. Phase four, Product Engineering, is concerned with developing technical specifications, drawings and prototypes and finally the last phase is Preproduction and Ramp-up.

There is a growing line of research in new product development focusing on activities that precede detailed design and NPD execution. Khurana and Rosenthal (1997) define the front end to include product strategy formulation and communication, opportunity identification and assessment, idea generation, product planning, project planning and executive reviews. Khurana and Rosenthal's (1997) three-phase model of the "fuzzy front end" includes pre-phase zero, phase zero and phase one and is based on a worldwide study of 12 companies in a variety of industries. Pre-Phase Zero consists of activities—including idea generation, market analysis and technology appraisal—to support the subsequent Phases Zero and One. Phase Zero is where a core team is brought together to work on the product concept and definition. In Phase One, emphasis is placed on assessing the business and technical feasibility of the new product, confirming the product definition and planning the NPD project.

A synthesis of the three models is the basis for a straw man of what a process for performing data mining projects might look like. The NPD domain is appropriate for this straw man process because NPD projects by their very nature are the most complex projects as they include systems, subsystems, components, modules as well as physical product and software aspects. Focusing on NPD processes allows for an inclusive view of what a data mining process might look like.

The phases in the straw man data mining process include Phase 0, Phase 1, Phase 2 and Phase 3. Phase 0 is the discovery phase which supports the subsequent three phases. The set of proposed activities in this phase include (i) assessment of the organization's orientation towards data centricity, (ii) assessment of the capability of the organization to apply a portfolio of analytical techniques and (iii) strategy development for the use of analytics throughout the organization. Phase 1 is the entry phase as this is where higher order issues about data mining are addressed. The underlying point here is that subsequent phases cannot proceed without a candidate business problem in mind that is solvable and that can at least partially use existing data resident in the organizations databases. Prospecting/domain analysis, business problem generation/preliminary assessment and data sensing are the proposed set of activities in this phase. Phase 2 is the launch phase. In this phase the data mining project becomes a formal project. The set of proposed activities in this phase include: (i) business problem refinement/validation, (ii) project planning, (iii) identification of key project participants, (iv) formulation of the data mining approach, (v) development of the data strategy and (vi) project sponsor identification. Phase 3 is the final phase and is referred to as the execution and infusion phase. The actual execution of data mining algorithms takes place here as well as results analysis. The proposed set of activities in this final phase are: (i) detailed business problem definition, (ii) data sourcing/enrichment, (iii) execution of data mining algorithms, (iv) results interpretation and validation and (v) information harvesting and business strategy formulation.

^[1]"Out of the Box" means configuring and personalizing the software but not customizing it which involves altering basic functionality.

< Day Day Up >