Defining Data Mining | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

We generally describe data mining as a process of data exploration with the intent to find patterns or relationships that can be made useful to the organization . Data mining takes advantage of a range of technologies and techniques for exploration and execution. From a business perspective, data mining helps you understand and predict behavior, identify relationships, or group items (customers, products, and so on) into coherent sets. These models can take the form of rules or equations that you apply to new customers, products, or transactions to make a better guess as to how you should respond to them.

The field of data mining is known more broadly as Knowledge Discovery and Data Mining (KDD). Both terms shed light on the purpose and process of data mining. The word mining is meant to evoke a specific image. Traditional mining involves digging through vast quantities of dirt to unearth a relatively small vein of valuable metallic ore, precious stones, or other substances. Data mining is the digital equivalent of this analog process. You use automated tools to dig through vast quantities of data to identify or discover valuable patterns or relationships that you can leverage in your business.

Our brains are good examples of data mining tools. Throughout the course of our lives, we accumulate a large set of experiences. In some cases, were able to identify patterns within these experiences and generate models we can use to predict the future. Those who commute to work have an easy example. Over the weeks and months, you begin to develop a sense for the traffic patterns and adjust your behavior accordingly . The freeway will be jammed at 5:00 p.m., so you might leave at 4:30, or wait until 6:00, unless its Friday or a holiday. Going to the movies is another example of altering behavior based on experience. Deciding when to arrive at the theater is a complex equation that includes variables like when the movie opened, whether its a big budget film, whether it got good reviews, and what showing you want to see. These are personal examples of building a data mining model using the original neural network tool.

The roots of data mining can be traced back to a combination of statistical analysis tools like SAS (Statistical Analysis System) and SPSS (Statistical Package for the Social Sciences) that took form in the academic environment in the 1960s and 1970s and the Artificial Intelligence surge back in the 1980s. Many of the techniques from these areas were combined, enhanced, and repackaged as data mining in the 1990s. One benefit of the Internet bubble of the late 1990s is that it showed how data mining could be useful. Companies like Amazon began to mine the vast quantities of data generated by millions of customers browsing their web sites and making purchase selections, popularizing the phrase Customers who bought this item also bought these items:

Data mining has finally grown up and has taken on a central role in many businesses. All of us are the subject of data mining dozens of times every dayfrom the junk mail in our mail boxes, to the affinity cards we use in the grocery store, to the fraud detection algorithms that scrutinize our every credit card purchase. Data mining has become so widespread for one reason: it works. Using data mining techniques can measurably and significantly increase an organizations ability to reach its goals. Data mining is used for many purposes across the organization, from increasing revenue with more targeted direct marketing programs, and cross-sell and up-sell efforts, to cutting costs with fraud detection and churn/attrition reduction, to improving service with customer affinity recommendations. A charitable organization might use data mining to increase donations by directing its campaign efforts toward people who are more likely to give. More often, the goals are as base as sell more stuff. Our goal here is to describe the technology, not judge the application; how you use it is up to you.

There are two common approaches to data mining. The first is usually a onetime project to help you gain an understanding of who your customers are and how they behave. We call this exploratory or undirected data mining, where the goal is to find something interesting. The second is most often an ongoing project to work on a specific problem or opportunity. We call this more focused activity directed data mining. Directed data mining typically leads to an ongoing effort where models are generated on a regular basis and are applied as part of the transaction system or in the ETL application. For example, you might create a model that generates a score for each customer every time you load customer data into the BI system. Models that come from the data mining process are often applied in the transaction process itself to identify opportunities or predict problems as they are happening and guide the transaction system to an appropriate response on a real-time basis.

While exploratory data mining will often reveal useful patterns and relationships, this approach usually takes on the characteristics of a fishing expedition. You cast about, hoping to hook the big one; meanwhile, your guests, the business folks, lose interest. Directed data mining with a clear business purpose in mind is more appealing to their business-driven style.

In Chapter 8 we defined an analytic application as a BI application thats centered on a specific business process and encapsulates a certain amount of domain expertise. A data mining application fits this definition perfectly , and its place in the Business Dimensional Lifecycle is pictured in Figure 8.1, in the boxes for BI Application Specification and Development.

Basic Data Mining Terminology

The data mining industry uses a lot of terms fairly loosely, so its helpful for us to define a few of these terms early on. This is not an exhaustive list of data mining terms, only the relevant ones for our discussion.

Algorithm: The programmatic technique used to identify the relationships or patterns in the data.
Model: The definition of the relationship identified by the algorithm, which generally takes the form of a set of rules, a decision tree, a set of equations, or a set of associations.
Case: The collection of attributes and relationships (variables) that are associated with an individual object, usually a customer. The case is also known as an observation.
Case set : A group of cases that share the same attributes. Think of a case set as a table with one row per unique object (like customer ). Its possible to have a nested case set when one row in the parent table, like customer, joins to multiple rows in the nested table, like purchases. The case set is also known as an observation set.
Dependent variable(s) (or predicted attribute or predict column): The variable the algorithm will build a model to predict or classify.
Independent variable(s) (or predictive attribute or input column): The variable with descriptive information used to build the model. The algorithm creates a model that uses combinations of independent variables to define a grouping or predict the dependent variable.
Discrete or continuous variables: Numeric columns that contain continuous or discrete values. A column in the Employee table called Salary that contains the actual salary values is a continuous variable. You can add a column to the table during data preparation called SalaryRange, containing integers to represent encoded salary ranges (1 = 0 to $25,000; 2 = between $25,000 and $50,000; and so on). This is a discrete numeric column. Early data mining and statistical analysis tools required the conversion of strings to numeric values like the encoded salary ranges. Most tools, including most of the SQL Server data mining algorithms, allow the use of character descriptions as discrete values. The string 0 to $25,000 is easier to understand than the number 1. Discrete variables are also known as categorical. This distinction between discrete and continuous is important to the underlying algorithms in data mining, although its significance is less obvious to those of us who are not statisticians.
Regression: A statistical technique that creates a best-fit formula based on a dataset. The formula can be used to predict values based on new input variables. In linear regression, the formula is the equation for a line.
Deviation: A measure of how well the regression formula fits the actual values in the dataset from which it was created.
Mining structure: A Microsoft data mining term used as a name for the definition of a case set in Analysis Services. The mining structure is essentially a metadata layer on top of a data source view that includes additional data mining- related flags and column properties, like the field that identifies a column as input, predict, both, or ignore. A mining structure can be used as the basis for multiple mining models.
Mining model: The specific application of an algorithm to a particular mining structure. You can build several mining models with different parameters or different algorithms from the same mining structure.

Business Uses of Data Mining

Data mining terminology has not yet become completely standardized. There are terms that describe the business task and terms that describe the data mining techniques applied to those tasks. The problem is, the same terms are used to describe both tasks and techniques, sometimes with different meanings.

Reference

The terms in this section are drawn from the book Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Michael J. A. Berry and Gordon S. Linoff, Second Edition (Wiley, 2004).

Berry and Linoff list six basic business tasks that are served by data mining techniques: classification , estimation , prediction , affinity grouping , clustering , and description and profiling . Weve added a seventh business task to the list called anomaly detection . We describe each of these business task areas in the following section, along with lists of the relevant algorithms included in SQL Server 2005 Data Mining. A word of warning: Some of these tasks overlap in what seems to be odd ways to the uninitiated because the distinctions between the areas are more mathematical than functional.

Classification

Classification is the task of assigning each item in a set to one of a predetermined set of discrete choices based on its attributes or behaviors. Consumer goods are classified in a standard hierarchy down to the SKU level. If you know the attributes of a product, you can determine its classification. You can use attributes like size , sugar content, flavor, and container type to classify a soda. Typical classes in business include Yes and No; High, Medium, and Low; Silver, Gold, and Platinum. What these are classes of depends on the business context; Good Credit Risk classes might be Yes and No. Classification helps organizations and people simplify their dealings with the world. If you can classify something, you then know how to deal with it. If you fly often with the same airline, you have no doubt been classified as an elite level, or Platinum customer. Knowing this classification allows the airline employees to work with you in a way that is appropriate for top customers, even if they have never met you before. The key differentiating factors of classification are the limited (discrete) number of entries in the class set and the fact that the class set is predefined.

A common example of classification is the assignment of a socioeconomic class to customers or prospects in a marketing database. Companies like Claritas, with its PRIZM system, have built an industry around classification. These systems identify classes of consumers who have common geographic, demographic, economic, and behavioral attributes and can be expected to respond to certain opportunities in a similar way.

Classification algorithms predict the class or category of one or more discrete variables, based on the other variables in the case set. Determining whether someone is likely to respond to a direct mail piece involves putting them in the category of Likely Responder or not. Microsoft Decision Trees, Microsoft Neural Network, and Microsoft Nave Bayes are the first choice algorithms for classification when the predict column is a discrete variable.

Estimation (Regression)

Estimation is the continuous version of classification. That is to say, where classification returns a discrete value, estimation returns a continuous number. In practice, most classification is actually estimation. The process is essentially the same: A set of attributes is used to determine a relationship. A direct mail marketing company could estimate customers likelihood to respond to a promotion based on past responses. Estimating a continuous variable called Response_Likelihood that ranges from zero to one is more useful when creating a direct marketing campaign than a discrete classification of High, Medium, or Low. The continuous value allows the marketing manager to determine the size of the campaign by changing the cutoff point of the Response_Likelihood estimate. For example, a promotions manager with a budget for 200,000 pieces and a list of 12 million prospects would use the predicted Response_Likelihood variable to limit the target subset. Including only those prospects with a Response_Likelihood greater than some number, say 0.80, would give the promotions manager a target list of the top 200,000 prospects. The continuous variable allows the user to more finely tune the application of the results.

Estimation algorithms estimate a continuously valued variable based on the other variables in the case set. Microsoft has built several algorithms that can be used for either discrete or continuous variables. Microsoft Decision Trees and Microsoft Neural Network are good choices for estimating a continuous variable.

Most of the estimation algorithms are based on regression analysis techniques. As a result, this category is often called regression, especially when the algorithm is used for prediction.

Prediction

Where classification and estimation are assignment of values that are correct by definition, prediction is the application of the same techniques to assign a value that can be validated at some future date. For example, you might use a classification algorithm to classify your customers as male or female based on their purchasing behaviors. You can use this classification as an input to designing various marketing programs.

Tip

Be careful not to reveal your guess to your customers because it could adversely affect your relationship with them. For example, it would be unwise to use this variable by itself to send out promotional pieces for a For Women Only sale. However, the variable is useful for the business even though you will never know for certain which customers are actually male or female.

Prediction, on the other hand, seeks to determine a class or estimate as accurately as possible before the value is known. This future-oriented element is what places prediction in its own category. The input variables exist or occur before the predicted variable. For example, a lending company offering mortgages might want to predict the market value of a piece of property before its sold. This value would give them an upper limit for the amount theyd be willing to lend the property owner, regardless of the actual amount the owner has offered to pay for the given property. In order to build a predictive data mining model, the company needs a training set that includes predictive attributes that are known prior to the sale, like total square footage, number of bathrooms, city, school district , and the actual sale price of each property in the training set. The data mining algorithm uses this training set to build a model based on the relationships between the predictive variables and the known historical sale price. The model can then be used to predict the sale price of a new property based on the known input variables about that property.

One interesting feature of predictive models is that their accuracy can be tested . At some point in the future, the actual sale amount of the property will become known and can be compared to the predicted value. In fact, the data mining process described later in this chapter recommends splitting the historical data into two sets, one to build or train the model, and one to test its accuracy against known historical data that was not part of the training process.

Tip

The real estate sale price predictor is a good example of how data mining models tend to go stale over time. Real estate prices in a given area can be subject to significant, rapid fluctuations. The mortgage company would want to re-build the data mining model with recent sales transactions on a regular basis.

Microsoft Decision Trees and Microsoft Neural Network are the first choice algorithms for regression when the predict column is a continuous variable. When prediction involves time series data, it is often called forecasting . Microsoft Time Series is the first choice algorithm for predicting time series data, like monthly sales forecasts.

Association or Affinity Grouping

Association looks for correlations among the items in a group of sets. E-commerce systems are big users of association models in an effort to increase sales. This takes the form of an association modeling process known as market basket analysis . The online retailer first builds a model based on the contents of recent shopping carts and makes it available to the web server. As the shopper adds products to the cart, the system feeds the contents of the cart into the model. The model identifies items that commonly appear with the items currently in the cart. Most recommendation systems are based on association algorithms.

Microsoft Association is an association, or affinity grouping algorithm. Other algorithms, like Microsoft Decision Trees, can also be used to create association rules.

Clustering (Segmentation)

Clustering can be thought of as auto-classification. Clustering algorithms group cases into clusters that are as similar to one another, and as different from other clusters, as possible. The clusters are not predetermined, and its up to the data miner to examine the clusters to understand what makes them unique. When applied to customers, this process is also known as customer segmentation . The idea is to segment the customers into smaller, homogenous groups that can be targeted with customized promotions and even customized products. Naming the clusters is a great opportunity to show your creativity. Clever names can succinctly communicate the nature and content of the clusters. They can also give the data mining team additional credibility with the business folks.

Once the clustering model has been trained, you can use it to classify new cases. It often helps to first cluster customers based on their buying patterns and demographics , and then run predictive models on each cluster separately. This allows the unique behaviors of each cluster to show through rather than be overwhelmed by the overall average behaviors.

One form of clustering involves ordered data, usually ordered temporally or physically. The goal is to identify frequent sequences or episodes (clusters) in the data. The television industry does extensive analysis of TV viewing sequences to determine the order of programs in the lineup. Companies with significant Internet web sites may use sequence analysis to understand how visitors move through their web site. For example, a consumer electronics product manufacturers web site might identify several clusters of users based on their browsing behavior. Some users might start with the Sale Items page, then browse the rest of the e-commerce section, but rarely end with a purchase (Bargain Hunters). Others may enter through the home page, and then go straight to the support section, often ending by sending a help request e-mail (Clueless). Others may go straight to the e-commerce pages, ending with a purchase, but rarely visit the account management or support pages (Managers). Another group might go to the account management pages, checking order statuses and printing invoices (Administrators). A Sequence Clustering model like this one can be used to classify new visitors and customize content for them, and to predict future page hits for a given visitor.

Microsoft Clustering and Microsoft Sequence Clustering are segmentation algorithms. The Microsoft Sequence Clustering algorithm is primarily designed for sequence analysis (hence the clever name).

THE POWER OF NAMING

When Claritas originally created its customer segmentation system called PRIZM, they likely used clustering techniques to identify about 60 different groups of consumers. The resulting clusters, called lifestyle types, were numbered 1 through 60+. Its clear that someone at Claritas realized that numbers were not descriptive and would not make good marketing. So, they came up with a clever name for each cluster; a shorthand way to communicate its unique characteristics. A few of the names are: 02. Blue Blood Estates (old money, big mansions), 51. Shotguns and Pickups (working class, large families, mobile homes ), and 60. Park Bench Seniors (modest income, sedentary, daytime TV watchers).

Anomaly Detection

Several business processes rely on the identification of cases that deviate from the norm in a significant way. Fraud detection in consumer credit is a common example of anomaly detection. Anomaly detection can take advantage of any of the data mining algorithms. Clustering algorithms can be tuned to create a cluster that contains data outliers, separate from the rest of the clusters in the model. Anomaly detection involves a few extra twists in the data mining process. Often its necessary to bias the training set in favor of the exceptional events. Otherwise , there may be too few of them in the historical data for the algorithm to detect. After all, they are anomalies. We provide an example of this in the case studies later in this chapter.

Description and Profiling

The business task Berry and Linoff call description and profiling is essentially the same activity we earlier called undirected data mining. The task is to use the various data mining techniques to gain a better understanding of the complexities of the data. Decision trees, clustering, and affinity grouping can reveal relationships that would otherwise be undetectable. For example, a decision tree might reveal that women purchase certain products much more than men. In some cases, like womens shoes, this would be stereotypically obvious, but in others, like hammers, the reasons are less clear and the behavior would prompt additional investigation. Data mining, like all analytic processes, often opens doors to whole new areas of investigation.

Description and profiling can also be used as an extension to the data profiling tasks we described in previous chapters. You can use data mining to identify specific data error anomalies and broader patterns of data problems that would not be obvious to the unaided eye.

Business Task Summary

The definitions of the various business tasks that are suitable for data mining, and the list of which algorithms are appropriate for which tasks can be a bit confusing. Table 10.1 gives a few examples of common business tasks and the associated data mining algorithms that can help accomplish these tasks.

Table 10.1: Examples of Business Tasks and Associated Algorithms
BUSINESS TASK	EXAMPLE	MICROSOFT ALGORITHMS
Classifying customers into discrete classes	Assigning each customer to an Activity Level with discrete values of Disinterested, Casual, Recreational, Serious, or Competitor.	Decision Trees Nave Bayes Clustering Neural Network
Predicting a discrete attribute	Predicting a variable like ServiceStatus with discrete values of Cancelled or Active might form the core of a customer retention program.	Decision Trees Nave Bayes Clustering Neural Network
Predicting a continuous attribute	Predicting the sale price of a real estate listing or forecasting next years sales.	Decision Trees Time Series Neural Network
Making recommendations based on a sequence	Predicting web site usage behavior. The order of events is important in this case. A customer support web site might use common sequences to suggest additional support pages that might be helpful based on the page path the customer has already followed.	Sequence Clustering
Making recommendations based on a set	Suggesting additional products for a customer to purchase based on items theyve already selected. In this case, order is not important. (People who bought this book also bought )	Association Decision Trees
Segmenting customers	Creating groups of customers with similar behaviors, demographics, and product preferences. This allows you to create targeted products and promotions designed to appeal to specific segments.	Clustering Sequence Clustering

Roles and Responsibilities

Microsofts Data Mining tools have been designed to be usable by just about anyone who can install BI Studio. After a little data preparation, a competent user can fire up the Data Mining Wizard and start generating data mining models. Data mining is an iterative, exploratory process. In order to get the most value out of a model, the data miner must conduct extensive research and testing. The data mining person (or team) will need the following skills:

Good business sense and good-to- excellent working relationships with the business folks: This skill set is used to form the foundation of the data mining model. Without it, the data miner can build a sophisticated model that is meaningless to the business.
Good-to-excellent knowledge of Integration Services and/or SQL: These skills are crucial to creating the needed data transformations and packaging them up in repeatable components .
A good understanding of statistics and probability: This knowledge helps in understanding the functionality, parameters, and output of the various algorithms. It also helps to understand the data mining literature and documentationmost of which seems to have been written by statisticians.
Data mining experience: Much of what is effective data mining comes from having seen a similar problem before and knowing which approaches might work best to solve it. Obviously, you have to start somewhere. If you dont have a lot of data mining experience, its a good idea to find a local or online data mining special interest group you can use to validate your ideas and approach.
Programming skills: To incorporate the resulting data mining model into the organizations transaction systems, someone on the team or elsewhere in the organization will need to learn the appropriate APIs.