The Data Mining Process | Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer)

Wherever you look, people and businesses are collecting data, in some cases without even an obvious immediate purpose. Companies collect data for many purposes, including accounting, reporting, and marketing. Those companies with swelling data stores have executives with many more questions than answers; in this book you have seen how executives can use UDM based analysis to find the answers they need. This is typically a process in which you typically know what you are looking for and you can extract that information from your UDM. However there might be additional information in your data that can help you to make important business decisions which you are not aware of since you don't know what to look for. Data mining is the process of extracting interesting information from your data such as trends, clusters, and other patterns that can help you understand your data better. Data mining is accomplished through the use of statistical methods, as well as machine learning algorithms. The ultimate purpose of data mining includes the discovery of subtle relationships between data items. It can also entail the creation of predictive models. When data mining is successfully applied, rules and patterns previously unknown and potentially useful emerge from heaps of data.

You don't need a vintage coal-mining helmet with a lamp to begin the data mining process (but if you feel more comfortable wearing one, you can buy a used helmet off eBay). What really is required is a problem to solve with a very good understanding of the problem space; this isn't just an exploratory adventure. The main requirement would be to have the appropriate hardware to store the data to be analyzed and the analysis results. Then you would either need off-the-shelf data mining software, or if you are particularly knowledgeable, you can write your own software. In terms of hardware, you're going to need a machine for the storing data (typically relational databases that can store gigabytes or even terabytes of data) and a machine to develop and run your data mining application from. While those are normally two different machines, these functions can all reside on a single machine. We discuss data mining software in terms of data mining algorithms, infrastructure to use them, and data visualization tools for use in evaluating the results. Once you have the software and the required hardware you then need to have a good understanding of the data you are about to mine as well as the problem you are trying to solve. Having a good understanding of the data and the problem is a critical step before you start using the software to perform mining. You will learn more about understanding the problem space and data in subsequent sections. Having the hardware and right software setup is a necessary pre-cursor to the data mining process. If data miners had to go through a pre-flight checklist like pilots and co-pilots do, it might look something like this:

Data Miner 1: "Ready to start pre-mine check."
Data Miner 2: "Ok, data store on-line with verified access?"
Data Miner 1: "Roger on the data store with access."
Data Miner 2: "Software loaded and ready to run?"
Data Miner 1: "Check."
Data Miner 2: "Data visualization tools; loaded and ready to run?"
Data Miner 1: "That's affirmative."
Data Miner 2: "Then we're ready to start, call it in."
Data Miner 1: "Sysadmin, sysadmin, this is miner 1, come in."

{the static crackles over the voicom interop}

Sysadmin: "This is Sysadmin, go ahead miner 1."
Data Miner 1: "Checklist complete, request clearance on server tango."

Sysadmin: "Roger, miner 1, you are cleared for mining on tango. Over."

Once you have hardware and software requirements satisfied you then get into the process of building a model or representation of your data which helps you to visualize and understand information in your data better. Over time, through systematic efforts and by trial and error, several methodologies and guidelines emerged. Figure 14-1 shows the typical process of data mining divided into just five steps. First and foremost you need to understand the domain area and what your business needs. Once you have understood the domain area you then understand the data. You might run some initial statistics on the data to understand the data better. After understanding the data you create mining models and train the model with the input data. You need to analyze the mining models and validate the results from the models. Once you have built mining models to suit your business needs you deploy the models on to your production system. Your end users can consume the results from the model directly from the model or through applications that utilize the content from the model. You might have to go through this data mining life cycle periodically to meet the changing needs of your business or data or both.

image from book
Figure 14-1

A great public resource on the data mining process (independent of specific data mining software products) is CRISP-DM (http://www.crisp-dm.org). CRISP-DM is a process guide that is available for download free. Yes, the process you see here maps roughly to the CRISP-DM process guide, but that would probably be true for any reasonable description of the data mining process.

Topic Area Understanding

Back in the last section, before Data Miner 1 begins to mine data he or she would first need to understand the data and then the problem space. Problem type and scope vary dramatically by subject area, from retail business, sales forecasts, and inventory management to logistics. Outside the realm of business there are data-mining-relevant science questions such as, "Some star has luminosity L, radiates brightly in the ultraviolet spectrum, and appears to have a small surface area. How hot is it likely to be?" If we have a reasonably sized database from which to train initially — more on what that means later — predictive data mining could uncover that the hotter the star, the shorter the wavelength peak in the star's spectrum and that the hottest stars peak in the ultraviolet area of the spectrum. You do need to have an idea about the topic or subject area and the problem you want to solve in order to identify nonintuitive relationships and patterns using data mining as a tool.

To accomplish the goal of your data mining project, you must understand what business you're in; moreover, what success means for your business in quantitative terms. By "understand the business" you have to sometimes ask the hard questions like does your company sell sugar water with flavoring? Or does it really sell an imagined lifestyle of fun, action, and perpetual happiness through the marketing of sugar water? What metrics can be used to measure gradients of success? For example, volumes of sales, market share ownership, or customer feedback scores? You as the person to mine your company's data or as a data mining consultant at a customer's site need to know as much information as possible about the customer and what they are trying to achieve so that you can interpret the results from data mining and make good recommendations. The bottom line is the company wants to improve profits earned. That can be accomplished by targeting the best sellers, adding value for customers (in the form of making suggestions based on customer usage patterns or detecting frauds against customer's credit cards) and loss reduction (by identifying processes which drain the business of funds unnecessarily).

Data: Understand It, Configure It

First off, you must know how to collect the data from disparate sources and then ensure the data description scheme (metadata) is integrated, consistent, and makes sense for all data to be used. It is a good idea at this point to explore the data by creating distributions, and running simple statistical tests. Though not required, it is not a bad idea to do so. It is critical that the data be clean and free of type mismatches; otherwise the algorithms might not extract important information, or if they do, possibly yield erroneous results. Other preparatory actions might be taken; for example, you might want to construct derived attributes, which are also called computed attributes or calculated columns. In order to verify the accuracy of the data mining algorithm results, the source data is often divided into training and testing data. We recommend splitting your source data; two-thirds for training purposes and the remaining one-third of the data for testing and verification of the algorithm results.

Understanding the data also involves understanding attributes of the data. For example if you are looking at customer data then name, gender, age, income, children, etc., are possible attributes of the customer. You need to have a good understanding of the values of the attributes that will best represent attributes to the data mining model you are about to create. For example, gender of a customer can typically have values Male, Female, or Unknown. There are only three possible values for the attribute gender. If you look at income of the customer, though, then the income can have a wide variance such as 0 to millions of dollars. These attributes need to be modeled appropriately for the chosen data mining algorithm to get the best results. The gender attribute would typically be modeled as a discrete attribute which means fixed number of values while the income would typically be modeled as continuous since there is a wide variation of the values for income. Understanding your data is critical.

Choose the Right Algorithm

Once you have a good understanding of your business, matching your needs to the data, you then need to choose the right data mining algorithm. A data mining algorithm is a technique or method by which data is analyzed and represented as patterns or rules which are typically called as data mining models. Choosing the right algorithm is not always easy. There might be several data mining algorithms that can solve your business problem. First, identify the algorithms that can solve your problem. The data mining model created by the algorithm is later analyzed to detect patterns or predict values for new data. Now, if you are aware of each data mining algorithm in depth you can potentially pick the right one. If not, identify the algorithms that can solve your problem and analyze results from each algorithm. Later in this chapter you will learn about various algorithms supported by Analysis Services 2005 and what class of problems they are helpful in solving. Fine tune the models of various algorithms and pick the most efficient one that provides you the maximum satisfaction for your business needs. Several techniques can be used to compare the results of several mining models, such as lift versus profit chart (you learn about these later in this chapter). Once you have compared and identified the right model, you can use the specific algorithm for more detailed analysis.

Train, Analyze, and Predict

With the data mining algorithm in hand, you now need to choose the data set to identify and analyze interesting information for your business. Data mining is used not only to analyze existing data, but also to predict characteristics of new data. Typically the data set to be analyzed will be divided into two — a training set and a validation set — usually in the ratio 2:1. The training data set is fed as input to the data mining algorithm. The algorithm analyzes the data and creates an object called a data mining model, which represents characteristics of the data set analyzed. You need to identify the right training data set to best represent your data. You also need to consider training data size because that will directly impact training time. Training of a model is also referred to as model building or processing of a model in this book. Once you have determined the training data set, you train the model with the chosen algorithm. Once you have the trained model, you can analyze it and have a better understanding of the training data set and see if that provides you useful information for your business.

Prediction is the process of predicting a value or values of a data set based on characteristics of the data set. For example, if you own a store and create a mining model of all your customers, you might classify them as Platinum, Gold, Silver, or Bronze membership based on several factors, such as salary, revenue they bring to your store, number of household members, and so on. Now, if a new member shops at your store you might be able to predict his or her membership based on salary, household members, and other factors. Once you have this information, you can send membership-relevant promotional coupons to the member to increase your sales.

In addition to analyzing your data using a mining model, you can perform prediction for new data sets by simply providing the new data set as input to the model and retrieving the prediction results. In order to determine the accuracy of the model, you could use the validation data set, predict values for the validation set, and compare the actual values with the predicted values. Based on the number of accurate predictions, you will know how good the model is. If the model is not providing prediction results as per your expectations, you might be able to tweak it by changing properties of the data mining algorithm or choosing the right attributes as inputs or choosing a different mining algorithm. You might have to periodically maintain the model based on additional information available to you — in this way, your model is trained well with the most up-to-date information and should yield optimal results.