14.3 The Data Mining Process | Internet-Enabled Business Intelligence


Team-Fly

	Internet-Enabled Business Intelligence By William A. Giovinazzo
	Table of Contents

	Chapter 14. Personalization

14.3 The Data Mining Process

Neural networks, decision trees, and genetic modeling are all different types of data mining algorithms. They are not data mining. Look at it this way: Years ago, every computer science student had a copy of the book Fundamental Algorithms . It contained many of the basic algorithms we used when writing computer programs. If we wanted to sort data, we simply went to the book and found various ways to sort data. The algorithms weren't programs. They didn't tell us how to collect the data or what data to sort . These questions were up to the programmer to answer. In a very similar way, the data mining algorithms we have described are just thatalgorithms. They do not tell us how to collect the data or how to mine the data. That is up to us to decide.

We have described the IEBI process as a three-step iterative loop in which we acquire the data, analyze the data, and act based on the data. As we can see in Figure 14.6, the data mining process is the same basic iterative loop. The first step in the process, acquiring the data, is composed of several subtasks . Before we start to collect data, we must understand the purpose of our search. How can we collect data unless we know the purpose of the data analysis? Once we understand the data, we collect, prepare, and process the data. This is similar to the extraction, transformation/cleansing, and loading step in our IEBI loop. Once we have completed the ETL of the data to be mined, we begin the mining process. This process entails the selection of an algorithm, the construction of a model, and the training of the model. We can then use this model for prediction, which leads to some action. In the following sections, we look at each step in a bit more detail.

Figure 14.6. The data mining process.

graphics/14fig06.gif

14.3.1 PROBLEM DEFINITION

It wasn't that long ago that many defined data mining as "the process that provided an answer to a question that you didn't even know you had." The image was that of an application that was let loose on a volume of data and out would flow gems of wisdom. As we can no doubt tell by now, this is far from the case. As we approach a data mining problem, we need to define the problem that we are attempting to solve. Part of that definition includes the anticipated results of the mining process. What is it we expect to get from the model: a description? a prediction? some combination of both? We also need to define the inputs from which we will derive these results. Are we trying to predict customer behavior based on some clickstream analysis? Are we trying to predict employee turnover based on human resources data?

Data mining can be applied to many areas within an organization. We can see it employed in manufacturing for quality assurance and in human resources for employee separation. In this chapter, we have focused on one specific use of data mining: the prediction of customer behavior. CRM is concerned with understanding our customers. Consider some of the questions we might attempt to answer as a result of our data mining process. Table 14.1 provides some examples.

Table 14.1. Questions that Data Mining Can Help Answer

Profile Customer Behavior	What are the characteristics of customers who return most frequently to our site? What are the characteristics of customers who return least frequently to our site? Which products are purchased most frequently by my most profitable customers?
Profile Customer Demographics	What are gender and location of my most profitable customers? What is the gender and location of my least profitable customers?
Time Dependencies	Is there a season in which my most profitable customers purchase less? How frequently do my most profitable customers return? How frequently do my least profitable customers return?
Customer Retention	What is the average life of our relationships with our most profitable customers?

14.3.2 COLLECTION

The second step of the process is to collect the data to be mined. If we are mining the customer activities on our Web site, we will obviously want to collect the clickstream data. As we collect this data, there are important decisions to make. For example, we must determine how far back in time our analysis should reach. If the market has relatively long- term trends, we will want to go back further than we would in a market that is very volatile. If we have a Web site dedicated to selling cars , the same trends that may have occurred in the past year or two may be relevant to the solution of our problem. In cases where these trends are short term, such as clothing or cosmetics, we would look at much shorter time spans .

In addition to the clickstream data, we want to include demographic data and buying patterns. In cases of financial institutions, we may wish to household the data, a process by which we relate the information of all the members of one household. In some cases, we may have brick-and-mortar outlets in addition to our Web site. Will it be relevant to our analysis to include this customer activity as well?

14.3.3 DATA PREPARATION

The data preparation stage of the data mining process is similar to the data transformation and cleansing stage of the IEBI loop. In this stage, we prepare the data for inclusion into the data set that we will use in our analysis. The same types of issues we face with the data warehousing ETL process are present when extracting data for data mining. Of course, we would hope that much of the data we extract from the data warehouse is clean, but this is only a hope. As we build our analysis, we must take measures to insure that we eliminate redundant data that may skew the results, typographical errors that may generate erroneous results, and stale data that may generate results that are no longer valid.

In addition to these quality issues, we must be concerned with issues such as qualification, binning , and derivation. Data qualification entails selecting data elements that are pertinent to the results we wish to achieve. For example, we may wish to understand the location of our most profitable customers and attempt to use postal codes as a means to classify them. While this may sound reasonable at face value, there are instances where the data may be inappropriate. For example, we may be an organization whose business is limited to a specific state. In such a case, a postal code would be perfectly reasonable. If, on the other hand, we were an international organization, segmenting our customer base by postal code may not be very helpful. Perhaps we have a handful of postal codes in New York, California, Norway, Latvia, and Fiji in our top 30 geographic areas. Does this really help us? It might be more beneficial to examine data that is coarser, perhaps at the country or state/ province level.

Data preparation might also include the binning of data. The data warehouse may simply store the customer's age. Again, this may not meet the data qualification needs of our data mining process. To solve this problem, we may create age categories, or bins , to store customers' ages. Customers between the ages 18 and 25 are in one bin, while customers 26 to 33 are in another. In some cases, the data warehouse may store the year the customer was born and not their specific age. In this case, we have to derive the age. Data derivation calculates a data element to be included in the analysis from other data elements in the source data.

14.3.4 DATA MINING

We discussed the specifics of creating a data mining model in Chapter 8 when we discussed the Java Data Mining API. Remember that no single method drives personalization on our Web site. Instead, we employ a variety of algorithms on our Web site, deciding which model will be employed at which point in time depending on the objective to be accomplished.

For example, we may wish to segment our customer base in order to define the different market segments. Who is visiting our Web site? We can divide visitors by certain demographic characteristics or combinations thereof, such as gender, age, or income levels. Once we have established the different segments, we may wish to predict the propensity of these different groups to purchase a particular product or group of products. We could employ a neural network, tuning the model with a genetic algorithm. The taxonomy of a Web site refers to the structure of the page, how a page leads to other pages. Genetic algorithms are useful in optimizing this taxonomy.


Team-Fly

Top