PART I INTRODUCING DATA MINING | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

Recently I spoke with the CEO and CIO of a major auto sales company about their databases. Among other things, we were trying to find the most optimal way of storing their massive quantities of sales data and other important corporate information. As we scanned the millions of rows in the databases, the CEO said in awe, "I bet there's a ton of information in here worth lots of money, if only we had a thousand years to make sense of it all."This comment very much sums up the typical reaction of many a corporate head, astronomer , doctor, and financial trader with stores of potentially valuable data but no way to make it work for them. Data mining leverages the computing power of today's servers to transform mountains of raw data into useful information.

The ever-increasing physical storage capacity of computers, coupled with easy access to powerful processing muscle, makes sophisticated data analysis possible in ways that were unimaginable only a few years ago. Until very recently, only large corporations and universities with access to super computers and mainframes could perform useful data-mining tasks . As powerful servers become available and more affordable, as compared with the price of a super computer, smaller companies are able to harness this server power to mine their stored data in order to gain a competitive advantage in the marketplace . To mine data in ways that are innovative, and at times seemingly surreal, we must first understand the technologies available, and then how to apply them to a particular data cache (see " ‚ Your Life On Disk ‚ "). 1

What Is Data Mining?

Data mining is the process of discovering meaningful patterns and relationships that lie hidden within very large databases. Because browsing through tables and records rarely leads to discovery of useful patterns, data is typically analyzed by an automated process, commonly referred to in data-mining lingo as Knowledge Discovery (KD). Knowledge Discovery is a component of data mining that uses the power of the computer combined with a human operator's innate ability to zero in on visually apparent patterns. By automating data mining, computers discover the patterns and trends present in the data while the person in charge of making use of these discoveries decides which patterns are truly relevant.

Data mining can find descriptive and predictive information. Which type of information you choose to discover depends largely on what you want to accomplish with the results. When predictive information is sought, the goal is to derive information that offers clues about a future event. For example, if a car dealer wants to know what she can get for a 1998 Ford Mustang with 68,000 miles on it, she's looking for predictive information. Assuming the dealer has kept a few years worth of sales history, this warehouse of data can be mined and then used to assign an asking price and predict the selling price. Variables such as a car's year and model are fed into the computer, and a price prediction is derived from previous sales.

If the same car dealer wants to give a first-prize trip to Hawaii to the most profitable salesperson of the year (note that I say most profitable, not the one with the highest gross sales), she would want descriptive information about her employees ' sales histories. Many factors other than total sales influence profitability, such as the number of referrals a sales person gets from her regular customers. This kind of information has no predictive value per se, but it does accurately describe past events in a manner that may have been difficult to see by normal means, thus offering the opportunity to make decisions based on newly discovered relationships.

Why Use Data Mining?

Data mining is an activity that offers business advantages, as well as solutions to some mounting problems associated with exploiting the knowledge embedded within corporate databases:

Growing disk space capabilities
Improvements over relational database management system (RDBMS) engines
Enhancements to online analytical processing (OLAP)

It's no surprise to anyone in the information technology field that disk space is becoming cheaper as it becomes more abundant.

Your Life on Disk

British Telecom is exploring the idea of storing everything a person sees and hears on disk! "Over an 80-year life we process 10 terabytes of data ...", to quote Ian Pearson, the official Futurologist at British Telecom. As surreal as this may sound, it does show that disk storage capability isn't a concern for data miners.

Disk space is becoming cheap enough that data storage is no longer as much of a concern as is making sense of the data stored. For more everyday examples, you only need to look at banks and credit card companies. They commonly store and archive every single customer transaction that occurs for the life of an account. Obviously, these companies hope to use this data to find out more about their customers and to discover the characteristics of an ideal member. Given that many major credit card companies print nearly half a billion statements a month, automated data mining is the only hope of finding any meaning in the mass of information housed on their hard disks.

In response to this explosion of data on corporate hard disks, great strides have been made to improve the response time of RDBMS engines to queries issued against them. However, once data has been stored for long periods of time, it becomes more useful to query data based on aggregated information rather than individual line items. For instance, a large retail chain will be more interested in sums of sales per region and per product type than in an analysis of the individual sales themselves . Aggregating data at this level, although possible with most SQL-compliant RDBMS engines, isn't the most easily optimized process for these engines. When it comes to queries, these engines are specifically optimized to find sets of data by responding to given criteria, not necessarily to continually perform mathematical calculations.

The OLAP database was created specifically to alleviate the problem of calculating aggregations on the fly. Unlike RDBMS engines, OLAP is designed to precalculate aggregates and store them in a manner that permits queries to simply return results from preprocessed tables. This takes advantage of the abundant availability of disk space while eliminating the need for expensive processing power.

In addition to the storage benefits provided by these OLAP systems, special storage and display features allow users access to vast data archives by using rolled up views of aggregates. While OLAP certainly gives many companies a better means of handling their information, it does not tell these companies what to look for. For instance, large auto dealerships spend a great deal of time analyzing the profit margin of each make and model of car they sell. The bottom line dictates next year's showroom. Dealerships analyze most of the obvious factors such as price, odometer reading (if it's a used car), model, and make, but the most successful dealerships employ experienced staff who are able to identify other less obvious but equally significant factors such as the color , engine size , and transmission type.

Even the most experienced dealers still make a few mistakes simply because many hidden factors, which are hard to account for, come into play. These can include the amount of time a car sits on the lot, its location on the lot in relation to other makes of cars (if you park sports cars next to trucks , the cars may be less visible), the weather, the time of year, and the Dow Jones Industrial average. The list, as you can see, is almost endless. The fact is that taking all factors into account would require a staff the size of the government's Economic Forecasting Team, which would be enough to deter most dealers from delving this deeply into their data.

Despite its strengths as a data-mining tool, OLAP still requires an initial hypothesis to give direction to the effort of navigating the data. The data is then used to prove or disprove the theory behind the effort. In databases containing a large number of tables and columns , using OLAP often results in trial-and-error approaches, which take large amounts of time and tend to produce mediocre results. As with manual relational database mining, the discovery of patterns and meaningful relationships using OLAP is limited by the person who writes the hypothesis and by the time allotted to explore the data.

The advantage data mining has over OLAP is that it allows the computer to examine every conceivable factor that might affect an outcome and to draw conclusions from that analysis. This process is essentially about finding relationships between attributes of a case, what those attributes that describe the case have in common. If we were to apply the data-mining process to the earlier example of the car dealership , every attribute or variable of the cases involving the sale of a car, such as the color or the model, would be analyzed and taken into account when predicting an outcome such as the price range, which could be inferred as a result of examining the relationship between the attributes of each car and its price.

To be of any use, the automated process needs to provide the dealer with any and all situations that might affect the outcomes , even if they might not seem to make sense at first glance. For instance, based on the data, the dealer might find that pink Ford trucks sell better when placed in the corner of the lot along side used Yugos, but only on Wednesdays, except for Wednesdays in November. Because of the sheer number of factors involved, and because of a natural inclination to preemptively dismiss the improbable, this kind of relationship would likely go unnoticed by a human. Not so for the computer.

How Data Mining Is Currently Used

Data mining is particularly valuable for organizations that collect large quantities of historical information. Banks, insurance companies, credit card companies, and even astronomers use this technology to derive critical information from large, unwieldy data samples. One of the best known applications of data mining is individual credit risk assessment. When applicants fill out loan applications, they're often asked to provide social security numbers , addresses, and the usual identifying information, but they're also required to give other bits of information that say something about them. There are questions about whether the applicant is a renter or a home owner, how long he or she has lived at a current address and been with a given employer, his or her marital status and educational level, and the list goes on and on.

Because financial institutions have a large customer base to draw data from, they are prime candidates for data-mining technology, which can analyze the data and discover the correlations between applicants' personal characteristics and the probability of loan default. Needless to say, allowing the computer to evaluate and examine all the variables that affect this outcome enables financial institutions to process hundreds of thousands of loan applications at one time with a fraction of the manpower needed in the past.

Data-mining techniques are also widely used in the retail industry to determine the best floor arrangement for products. For instance, a retail store might want to find a way to maximize sales of golf equipment. By examining sales transactions of golf equipment that go back four years or more, one might observe that customers who buy golf equipment often buy a pair of men's shoes at the same time. Armed with this knowledge, the retail store might decide to place golf clubs next to the men's shoe department to maximize the sales opportunity that their proximity offers.

Bar coding has made it possible for grocery chains to identify each item included in a purchase. This data can then be studied and conclusions drawn about purchase relationships. For example, if a store wants to find out how to sell more beer, the grocery chain can retrospectively examine all the beer purchases to find out what else the beer customer puts in their grocery cart. If it's found that customers who buy beer also buy diapers, the store might increase beer sales by running a special on diapers.

Note

This beer and diaper example is not my own invention. For some odd reason, perhaps because of the seeming ludicrous image that it conjures up, it happens to be a classic example used in data-mining literature to discuss certain types of data-mining algorithms.

Interestingly, data mining is used in many fields, such as medical diagnosis and meteorology. These disciplines, like their profit-oriented counterparts, confront the challenge of making sense out of mind-boggling stacks of data. In short, any business or academic pursuit that collects and studies large quantities of data is a candidate for data mining.

Defining the Terms

Data mining is often associated with other data storage and data manipulation techniques such as data warehousing and online transaction processing (OLTP). These techniques share terminology, and some terms are used interchangeably. In the interest of clarity, I'd like to define each of these techniques, the terms they share, and those terms that relate specifically to data mining.

‚ Data Mining ‚ In a nutshell , data mining is the process of discovering meaningful patterns and relationships through the automated analysis and classification of large stores of historical data.
‚ Data Warehousing ‚ A data warehouse is a central store of data that has been extracted from the operational data in an OLTP database. Unlike data warehouses, OLTP systems are designed to store operational data for efficient processing of transactions. Because the structures of the data in these databases are hard for the end user to understand, these systems are somewhat difficult to report from. Transferring the data to a data warehouse allows the information to be placed in a structure more convenient for reporting from. Unlike the OLTP systems, a data warehouse will accept new data without changing existing data. As a result, the storage structures are designed to house huge quantities of information in structures that favor fast retrieval over efficient transaction processing.
‚ Mining models ‚ A mining model is the physical structure of a subset of data compiled from a data-mining algorithm, along with the description of the original data set. Data mining needs a structure that contains the patterns present in the underlying databases. This structure is then used as a basis from which to make predictions based on new data by "filling in the blanks" left by the missing values. By gathering information from the original data sets, the data-mining application builds a subset of data that is compiled using a data mining algorithm. This result set can then be used to make predictions against sample data.
‚ Patterns ‚ A pattern is a set of events that occur with enough frequency in the dataset to reveal a relationship between them. Revealing the relationship is usually an inductive reasoning process. For example, you might learn from a data set that every time a customer buys beer, she also buys diapers. If this event occurs with enough frequency, the data-mining algorithms will identify it as a predictable pattern that needs to be stored in a model. In this way, an operator who browses the data-mining model will clearly see that people who buy beer have a high probability of also buying diapers.
‚ Cases ‚ Each item of historical data that used as a source for a data-mining model is a case. For example, if a mining model describes customer purchases at a grocery store, every single purchase would be a unique case that contributes to the "experience" of the data-mining model.
‚ Data-mining algorithms ‚ A data-mining algorithm is the mathematical and statistical algorithms that transform the cases in the original data source into the data-mining model. How the model looks depends largely on the data-mining algorithm applied to the data. As you'll discover later, there are many algorithms that can be added, but Microsoft Data Mining Services, introduced in Microsoft SQL Server 2000, provides decision tree and clustering algorithms right out of the box.

Data Mining Methodology

As with any discipline involving information systems, data mining requires that a plan be devised and followed to get from the idea to the final implementation. The components of a data-mining plan are listed below and illustrated in Figure 1.1.

Analyzing the problem
Extracting and cleansing the data
Validating the data
Creating and training the model
Querying the data model data
Maintaining the validity of the data-mining model

Figure 1-1. Data-mining methodology.

Analyzing the Problem

The source databases have to be assessed to see whether they meet the criteria for data mining. The quality and abundance of the data are the primary factors in deciding whether the data is at all suitable. In addition, the expected results of the data-mining effort have to be carefully understood to be sure that the existing data contains the right kind of extractable information. For instance, there's no point in pursuing a data-mining project for a grocery store chain if the data gathered from the cash register does not identify items for each shopping cart. Once the expected results are determined, the best algorithm for the job is chosen .

Extracting and Cleansing the Data

The data is first extracted from its native sources, such as OLTP databases, text files, Microsoft Access databases, and even spreadsheets. This data is then placed in a data warehouse that has a structure compatible with the data model. Typically, Data Transformation Services (DTS) is used to extract and then cleanse the data of any inconsistencies or incompatibilities with a uniform format.

Validating the Data

Once the data has been extracted and cleansed, it's a good practice to scroll through the model that you've created to make sure that all the data is present and complete.

Creating and Training the Model

When an algorithm is applied to a model, a structure is generated. It's important at this point to browse the generated data to make sure that it accurately resembles the facts in the source data. This can't be done in any great detail, but any serious idiosyncrasies can be easily discovered just by looking at the generated model. This process is covered in greater detail in ‚ Chapter 9 ‚ , "Using Code to Interact with Data Mining."

Querying the Data Mining Model Data

Once the proper model is created and generated, the data is made available for decision support. This usually involves writing front-end query applications with Microsoft Visual Basic (VB) or Active Server Pages (ASP) through the OLE DB for Data Mining provider. Third-party reporting tools that understand OLE DB for Data Mining are also an option at this point.

Maintaining the Validity of the Data-Mining Model

As the data-mining model is populated , over time initial data characteristics such as granularity or validity might change. For instance, after six months of populating our grocery chain model, we discover that fresh fish has been taken out of the meat section and placed in its own seafood section and that sliced cheese has been moved from the dairy section to the deli section. Even something as small as classifying a six-pack of Coke as a single item instead of six individual items has a dramatic impact on the accuracy of future predictions because it changes the attributes that the original models were based on.

Overview of Microsoft Data Mining

Data mining is a decision support tool that stands on its own when it comes to analyzing large databases. It has its own unique features which are designed to address unique decision support problems that cannot be solved by other data analysis tools. Data mining is sometimes confused with other tools such as OLAP. This section will describe the basic components and features of data mining.

Data Mining vs. OLAP

Both data mining and OLAP are components of Microsoft Analysis Services. Both serve as decision support tools, but each is designed for a different use. OLAP is primarily designed to store data in summarized tables to facilitate retrieval and navigation of this data by end users. Many vendors claim, however, that this is a data-mining solution because the user can discover information about the data by browsing summarized information, which can then be analyzed to find more causal relationships. However, in most cases, the user is navigating through dimensions that contain meaning and relationships that are already well known. As a result, an interactive display or report showing the breakdown of auto sales by year, make, model, and region is intuitive for the user. If the dealership sells different cars and trucks in different parts of the county, translating a display of this information into a relevant understanding of business activity is a simple matter. OLAP could be used to try to discover new data, but since the data discovery is really being done by the end user, with the assistance of an OLAP tool, the data discovery is bound to be haphazard and incomplete. Data mining is less concerned with allowing an end user to easily browse summary data as it is with automatically discovering new patterns and rules that can be applied to get future results. As a result of this difference, OLAP is an efficient storage and retrieval mechanism and data mining is a knowledge discovery tool.

Data-Mining Models

Source data needs to be structured in a way that optimizes predictions that are based on established variables. As I mentioned earlier, these structures are created by algorithms. When an algorithm is applied to a data structure, the structure is populated with data in a manner that reflects the existence of relationships and patterns of the original data set, thus allowing predictions based on that data to be made easily. Microsoft uses special data structures to store the data-mining models, which can then be browsed using Microsoft Analysis Manager, applications that use the OLE DB for Data Mining Services, or Component Object Model (COM) interfaces such as Decision Support Objects (DSO). All these tools allow new data models to be created, copied , altered , and deleted. Through OLE DB, it's possible to establish direct data connections to various sources, such as Microsoft SQL Server, text files, Microsoft Access, Microsoft Excel, or even Oracle and DB2.

Models vs. Patterns

It's easy to confuse patterns with models; Webster's Dictionary considers them synonymous, but in the context of data mining the two concepts are not interchangeable. Patterns are recurring sets of data such as 111211121112111.... The repetition in this example allows us to predict with some accuracy that the next number will be "2." A model, in the world of data mining, is a special data structure that stores the cases that have been processed by an algorithm in such a manner that the model contains the same patterns found in the raw databases. The model stores the patterns and thus makes it possible for us to predict what number will follow the third number one in the example above.

Data-Mining Algorithms

‚ Chapter 6 ‚ , " Microsoft Clustering, " and Chapter 7 ‚ , "Third Party Algorithms, " discuss data-mining algorithms in greater detail. Some of these algorithms are natively supported by Data Mining Services, and others are integrated into Data Mining Services by third-party vendors. For the most part, the choice of algorithm is based on the type of model you want to wind up with; therefore, my descriptions focus on the goal of the data-mining process, as opposed to the technical details behind the algorithm.

Decision Trees

A decision tree algorithm analyzes the data and creates a repeating series of branches until no more relevant branches can be made. The end result is a binary tree structure where the splits in the branches can be followed along specific criteria to find the most desired result.

Clustering

A cluster algorithm, unlike a decision tree, does not split data along any lines but rather groups data in clusters. As can be seen in Figure 1.1, clustering is most useful for visual representations because the data is grouped around common criteria. Many front-end tools display these groups as bullet holes in a target. As with any target that's been shot at for a while, the entire surface is scattered with holes, but some areas will have higher concentrations, or clusters, of holes. If the bullet holes represent sales data, the groups or clusters would represent sales cases that have a lot in common. By looking at the intersections of the data points (or bullet holes) we can see what the sales do have in common.

Figure 1-2. Representation of clustering data.

Association

In order to effectively do market- basket analysis or cross-selling , users need to know what factors "associate "with each other. The typical problem the association algorithm solves is to find out what items in a store are bought together, such as the classic example mentioned earlier of beer and diapers. This algorithm is not provided natively by data mining; it's one of the algorithms provided either by third-party vendors or by your own programming efforts.

Regression Analysis or Sequencing

The regression analysis algorithm is used to find the relationship between a predicted outcome and several or all of the possible variables that can influence the outcome. Regression methods have their roots in statistical analysis and are similar to linear regression, logistic regression, and nonlinear regression methods . As is the case with association algorithms, regression analysis is not provided natively by Data Mining Services, but can be added by third-party vendors.

OLE DB for Data Mining

This new SQL Server 2000-based protocol is designed to simplify communication and to provide better integration of data-mining tools with data storage and management applications, especially SQL Server. Although OLE DB for Data Mining is a recent arrival, there are quite a few third-party vendors who now provide support for this standard in their products.

Using SQL Server Syntax to Data Mine

When it comes to data extraction, the idea behind OLE DB for Data Mining is to extend the SQL Server syntax with provisions specific to data mining. This is very similar in philosophy to the multidimensional expressions (MDX) extensions that were added to the SQL Server language and designed to facilitate the extraction of OLAP data using the OLE DB for OLAP provider.

One of the new additions is the PREDICTION JOIN clause, which is similar to the join operation in standard SQL syntax. Essentially, its function is to create a join between a populated data-mining model and another table that contains the variables needed to predict the outcomes in a prediction result set. What makes this clause particularly useful is that the output can be returned in the form of an OLE DB or ActiveX Data Objects (ADO) result set. This goes a long way toward making data-mining features available to developers without having to first build the fundamental mathematical algorithms from scratch. This level of access creates and allows this process to be used to enhance decision support systems for risk analysis, market-basket analysis, and so on.

Summary

Data mining is a brand new addition to the SQL Server suite of data management tools. This first chapter introduces Data Mining Services and the concepts and terms associated with it such as data warehousing, mining models, patterns, cases, and data-mining algorithms. I've also spent some time differentiating data mining from other processes such as OLAP and SQL.

Data mining is specifically designed to address the needs of those who are faced with gathering meaningful trends and patterns from very large quantities of data. As my examples show, the practical applications of data mining are endless.

Any data-mining solution involves the use of cases, which come from raw data usually stored in a data warehouse of some sort . The quantity of data is usually far too large to be analyzed as is, so a data-mining model is created and an appropriate algorithm is chosen. The data-mining model is then populated, or "trained, " with these cases using a data-mining algorithm, resulting in a logical data structure that represents a condensed version of all the patterns and relationships present in the original data set. The data-mining models and the data contained in them can be accessed using various methods detailed in the OLE DB for Data Mining Specification, available at ‚ http://www.microsoft.com/data/oledb/dm.htm ‚ .