3 Data Storage Models | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

Organizations apply data-mining technology to their stores of data for two reasons: to make sense of their past and to make predictions about their future. Ultimately, they use the information taken from this computerized fortuneteller to make decisions about their futures . As we'll see shortly, the kind of information an organization wants determines the kind of model it builds and the algorithms applied to the structured data.

Real World ‚ Fortune Telling for the Real World ‚

Predictive, as opposed to descriptive, data modeling derives future results from theoretical data sets. These results are used by organizations and companies to determine what actions to take. For instance, we could build a model that would provide data that predicts the price range for a new Lincoln Continental. The model could also yield data that predicts how long the car would sit on the lot at a given price. Based on these two predictions, the dealer could determine the optimal selling price and how long he could expect this model of car to sit on the lot at this price. The dealer might also use the data to attract a certain clientele. If queried correctly, the data could tell the dealer what kind of customer buys what kind of car. The resulting model would provide a list of car makes and models that appeal to wealthy retirees. Another list would tell the dealer the average income of SUV drivers. Although this data analysis doesn't require target variables to work, actions can still be taken based on the data. For instance, based on past sales data and demographics , a dealership located in West Palm Beach, Florida, an area with a large population of retired folks, might decide to fill their lot with Cadillacs and Lincoln sedans.

Directed Data Mining 1

In order to perform a directed data-mining operation, all the known factors, or input variables, need to be given so that the data-mining engine can find the best correlations of those attributes to a rule within the data-mining model. The prediction is based on unknown values or target variables, meaning that the data-mining engine will find the most likely value for those missing values based on the known values provided with the input variables.

Directed data mining uses the most popular data-mining techniques and algorithms, such as decision trees. It classifies data for use in making predictions or estimates with the goal of deriving target values, in fact, it's the request for target values that gives directed data mining its "direction. " A wide range of businesses use it. Banks use it to predict who is most likely to default on a loan, retail chains use it to decide whom to market their products to, and pharmaceutical companies use it to predict new prescription drug sales.

Undirected Data Mining

Because undirected data mining isn't used to make predictions, target values aren't required. Instead, the data is placed in a format that makes it easier for us to make sense of. For example, an online bookseller could organize past sales data to discover the common characteristics of genre readers from the United States. The data could reveal (as unlikely as it seems) that readers of murder mysteries are predominantly women from Maine and upstate New York and readers of science fiction are predominantly Generation-X men living in Washington and Oregon. Clustering is the algorithm commonly used for mining historical data. As the name implies, clustering clumps data together in groups based on common characteristics. The bookseller's marketing department could use these revelations to direct the company's advertisements for sci-fi books to men's magazines.

Another way to treat this data is to take one of the derived clusters and apply the decision trees algorithm to it. This allows you to focus on a particular segment of the cluster. In fact, clustering is often the first step in the process used to define broad groups. Once groups are established, directed data mining is used on groups that are of particular interest to the company.

Data Mining vs. Statistics

Newcomers to data mining naturally tend to think of the term as a new way to say statistics. Hard- core cynics even refer to data mining as "statistics + marketing."Nothing could be further from the truth, but since many of the data-mining algorithms were indeed defined by statisticians, it's no wonder that some of the terminology and techniques are suspiciously similar, such as probability, distribution, regression, and so on. Data mining as a process is influenced by several disciplines, including statistics, machine learning, and high-performance computing. If you examine how quantitative information is analyzed , you'll find that data mining and statistics are concerned with many of the same problems and use many of the same functions. Data mining and statistics both use 2

Uniformly stored and relevant data
Input values
Target values
Models

OK, so why then should you read this book instead of Advanced Statistics for Data Base Queries? (I don't know if this book exists, but if it does, buy it!) This question doesn't have a simple answer, but an understanding of basic statistical principles will certainly help you appreciate the differences. Data mining is different from pure statistics in several ways.

First, data mining is a data-driven process that discovers meaningful patterns previously unseen or otherwise prone to be overlooked, while statistical inference begins with a hypothesis conceived by a person, who then applies statistical methods to prove or disprove the thesis. Data mining also differs from statistics in that the objective is to derive predictive models that can easily be translated into new business rules.

The defining feature of data mining is that the machine, not the operator, does the complex mathematics used to build the predictive model. Computers do the high-level inductive reasoning required to analyze large quantities of raw data and can then output the results in a format that the owner of the data, whether she be a nuclear physicist or a cattle rancher, can understand. 3

Only by analyzing large sets of cases can accurate predictions be made. Pure statistical analysis requires the statistician to perform a high degree of directed interaction with the data sets, which interferes with the potential for making new discoveries. Discovery, in the world of data mining, is the process of looking at a database to find hidden patterns in the data. The process does not take preconceived ideas or even a hypothesis to the data. In other words, the program uses its own computational ability to find patterns, without any user direction. The computer is also able to find many more patterns than a human could imagine. What distinguishes data mining from the science of statistics is the machine.

This said, data mining's roots are revealed in the language it shares with statistics. Some common terms that are important to know are discussed here.

Population

Population is the group that comprises the cases that are included in the data-mining model. In order for the data-mining effort to be successful, the characteristics common to this group must be clearly defined. As obvious as this might seem, the task of creating a well-defined population is more challenging than you might guess. If you're studying customer-buying habits in one store, the population is clearly defined; anyone who made the cash register ring is part of that population. To study that population, all you need to do is mine the data gathered from the point-of-sale machines.

However, if your population is created by asking residents of a specified geographical area to fill out a questionnaire about their shopping habits, you have to take into account the margin of error created by the data collection method and by the questions asked. This population is better defined as residents of a neighborhood who were willing to answer the questionnaire. If your questionnaire was returned by 20 percent of the residents, your data isn't an accurate representation of the neighborhood or its shopping habits.

Sample

If a data set contains an unmanageable number of cases, a smaller, random subset of the larger data set, or sample, is extracted and mined. The assumption is that because the sample is randomly selected, it represents the same data distributions as the larger set. There are no hard and fast rules, but experience and knowledge about the data dictate the size of the smaller sample. Too small of a sample might contain cases with too many exceptions that would then be overrepresented in the resulting patterns. A sample needs to be large enough to absorb the impact of exceptions in the same way as the larger data set would. 4

Range

A range includes the extreme exceptions in a data set. For instance, cases for automobile sales might show that most autos cost between $10,000 and $30,000. The cars in one example might include a car that sold for $100 and another that sold for a whopping $300,000. These two fringe figures are valuable in that they help define the boundaries. In statistical terms, the range in this case is said to be $299,900 ($300,000-$100).

Bias

When you extract a sample of data for analysis, the method you use to choose the data could bias, or skew, the results. A biased extraction can misrepresent the entire population it came from. For instance, let's say you want to measure the purchasing habits of the population of Minnesota. Because this data set is uncommonly large, you decide to extract a sample of the residents of Minneapolis and Saint Paul. The fact that the entire sample population is urban in a predominantly rural state would bias the results. The results of this data-mining effort would fail if you tried to apply the resulting predictions to data that came from Albert Lea, Minnesota. Attributes particular to Minneapolis-Saint Paul would not describe the small town of Albert Lea.

When you extract a sample from a data set using nonrandom methods, take great care to ensure that the methods you use do not inject bias into the data set, or if you can't completely avoid the bias, at least understand it and take it into account when making decisions based on your results.

Mean

Mean is another word for average. To calculate the mean, add up the results of a column and divide the result by the number of instances as shown in Table 4-1. Averages are heavily affected by extreme values, especially if the total number of instances is few. For example, if nine persons have annual incomes of $100,000, $50,000, $50,000, $15,000, $15,000, $15,000, $15,000, $9000, and $9000, their average income is $30,889, this average isn't representative of any one person's annual income, but is an average nonetheless. If a new value of $100,000 is added, the average jumps to $37,800. Averages for very large data sets are more accurate provided there isn't a large gap between high and low values. 5

Table 4-1. Calculating the Mean

Wage earner	Annual Income
Mary	$100,000
John	$50,000 6
Jim	$50,000
Fred	$15,000
Jean	$15,000
Bill 7	$15,000
Kim	$15,000
Ron	$9,000
Lea	$9,000 8
Total	$278,000
Average (Total/9)	$30,889

Median

A median is the middle value in a data set as shown in Table 4-2. In this example, the middle value is $15,000. Median values are less affected by extreme values. They come in handy when determining an average in a data set that might have some data that is too loosely defined. For instance, it's conceivable that in our example incomes over $100,000 are simply assigned a value of Income Over $100,000. Without specific values over $100,000, it's impossible to calculate a mean but easy to determine a median.

Table 4-2. Determining the Median 9

Wage earner	Annual Income
Mary	$100,000
John	$50,000
Jim 10	$50,000
Fred	$15,000
Jean	$15,000
‚ Bill [MEDIAN] ‚	‚ $15,000 11 ‚
Kim	$15,000
Ron	$9,000
Lea	$9,000
Total 12	$278,000
Mean	$15,000

Distribution

The position, arrangement, and frequency of the differences in a set of data is known as distribution. When you create a model, it's important to understand how many different values exist in a column. A column with only one state name such as CA in it is a poor candidate for data mining because there's no distribution in the data. On the other hand, if six states are represented in that column, the distribution of data is measured across six different values. The values are then weighed to determine how many occurrences of each state exist in the model. The best way to view the distribution is with a graph with the states plotted on an x-axis and the number of occurrences plotted along the y-axis as shown in Figure 4-1.

Figure 4-1. Distribution across six states. 13

At times, the column is populated with an either/or value such as Male or Female and TRUE or FALSE. This kind of distribution is known as a binomial distribution.

Mode

Mode refers to the most frequently occurring value in a distribution (15,000 in the previous example).

Variance

Unlike a range, which only measures the values on the extremities of a spectrum, variance is the sum of the differences between each value in a data set and the mean of that data set divided by one less than the number of values in the data set. For example, the variance of the salary data in Table 4-1 is

((100,000-30,888) ^{2 14} + (50,000-30,888) ² + ...+ (9000-30,888) ² ) / 8

Variance is used to determine how far off the mark the extreme values are from the expected values.

Standard Deviation 15

Standard deviation is the square root of the variance. Like the variance, the standard deviation calculates the dispersion between a given value and the mean. The differences from the mean are easier to express with standard deviation than with variance because standard deviation is expressed in specific measurement units that are organized in tiers. The values, expressed as two standard deviations or one standard deviation, tell how far from the mean these values are.

Correlation

When a variable changes as a result of a change to another variable, the two variables are said to correlate. For example, there is a correlation between income and taxation . Generally , the more money we make, the more taxes we pay.

Caution

It is extremely important to understand correlation as it relates to data mining not only because of its predictive value, but also because correlated attributes can actually appear to be predictions. For instance, your data-mining model can predict with 100 percent accuracy that whenever the month is December, the season is winter in the Northern Hemisphere. This simple example shows that if the data was less obviously correlated, you might think you had built an especially accurate predictive model. Generally, if you begin to notice that your predictions are 100 percent on target all the time, you've probably discovered a correlation and not a pattern.

Regression

After you find a correlation, it's important to measure the rate of change between one variable and another. Regression tries to find the pattern of change between one variable and another in order to make accurate predictions based on the change in the values of a given variable. For example, there could be a direct correlation between the miles driven with a given car and the amount of gas it consumed for every mile on that trip. By understanding the ratio of miles per gallon, it's possible to predict how much gas a car will consume in 30 miles.

Learning from Historical Data 16

Data mining predicts future results, but it also draws conclusions about past events. When drawing conclusions about past data, the goal of the data-mining process isn't to make predictions but to understand the cause and effect relationships between elements of data.

There are six ways to use data mining to gain knowledge about past events.

Influence Analysis

An influence analysis determines how factors and variables impact an important measure such as performance. For instance, a manager of a telephone call center that handles credit card collections might want to know what factors contribute to the success ratio of collections to calls. There are many variables that can affect a call's success, such as:

Call frequency
Time spent per call
Type of call, inbound vs. outbound 17
Sex of the person speaking to the customer
Sex of customer
Time of day call was made

The manager's goal might be to understand the environment and conditions that affect successful call outcomes , not to make predictions based on new data sets. Based on the mining of this data, she might find that it behooves the call center to encourage employees to spend less time on inbound calls and more time on outbound calls, except in the evenings after 6 P.M. Even though the data might influence the center's policies, the analysis isn't really predictive. It's simply understanding the causal relationships hidden inside the historical data.

Variation Analysis

A variation analysis looks for variations in a set of data and attempts to isolate any one factor that might influence a given measurement. Finding variations in data is important when you need to discover what factors cause cube dimensions to differ . For instance, it might be useful to see the differences in call efficiency by location, by time of day, and by manager. A variation analysis might provide the following analysis, "Call centers in large cities have lower call efficiency than call centers in small towns."

Comparison Analysis 18

Using the preceding example, comparison analysis could be used to compare the efficiency of calls made in two call centers in two different locations to help decide which one should be expanded. This comparison is difficult to do with the standard SQL language constructs but is relatively easy with data mining and OLAP because of the built-in ability of these tools to display measures according to dimensional splits .

Cause and Effect Analysis

A cause and effect analysis determines the effects of a given event, something that data mining is perfectly equipped to handle. For instance, it's no surprise that a 50 percent increase in calls per hour to a given center severely reduces call efficiency, but a less obvious insight is that fewer fraudulent calls are caught by the overworked operators when calls increase by one-half. A cause and effect analysis would reveal that increased calls also increase losses due to fraud.

Trend Analysis

A trend analysis looks for changes in the value of a measure over specific periods of time. The data is measured in terms of direction of movement rather than quantitative expressions. For instance, trends are almost always used to describe the state of the stock market, as in"The market is up"or"The market is down."These statements express the difference between the current valuation and previous valuations of the market.

Deviation Analysis

Deviation analysis identifies data that falls outside the norm of expected value. Fraud detection uses deviation analysis to spot data that seems suspiciously different from the norm. One way credit card companies discover credit card thefts is by tracking charge trends over time so that they can immediately identify unusually high expenditures placed on a given card. 19

Deviation analysis is also applied to Web page navigation. For most commercial Web pages there is a logical path for the user to follow to purchase products, browse items, track orders, and so on. One indication of hacker activity or Web security holes is whether a user deviates from the logical path . An analysis of user activity notes abnormal user activity such as attempts to access a given page directly or repeated attempts to log on to the site. Of course, the key to identifying unusual data is to have an established baseline of normal activity. What's needed is a mining model that accurately represents a normal set of cases to compare other data against.

Predicting the Future

You can predict the future by allowing the data-mining engine to compare a set of inputs to the existing patterns in the database. The data-mining engine can use predictive modeling to predict the values of blank records in the database based on the patterns discovered from the database and stored in the model. Unlike pattern discovery, which is designed to find patterns and help understand data, predictive modeling uses the patterns to make a best guess on the values for new data sets.

Determining Probabilities

If derived models from an auto dealership database show that a person who buys a two-seat sports car is almost always a single male under 40, you can predict with a good deal of accuracy that the owner of a two-seat sports car is single, male, and under 40 years old. Given that there are always exceptions, your prediction can be expressed as a probability; therefore, we might actually find that the owner of a two-seat sports car is 80 percent likely to be a single male under 40, 5 percent likely to be female, 10 percent likely to be married, and so on.

Real World ‚ Looking for a Few Good Customers ‚

Not everyone is concerned with predicting probabilities. Loan and credit card companies do try to use their data to profile customers; however, they also need to process applications quickly. To speed up the application process, many financial institutions use a point system, which comes from a data analysis of their customer base, to approve or reject an application. All application criteria are allocated points, and if the sums of an application's points meet or exceed a predetermined threshold, the application is approved. The point system is used to streamline the decision-making process by eliminating the need to assess the value of a customer based on probabilities of default each time. The point system's design is based heavily on probabilities when it's initially created but is then used to avoid having a loan officer decide each time whether an applicant who has a 65 percent chance of repaying a large, profitable loan is a better risk than an applicant with an 85 percent chance of repaying a small loan. The point system simply translates all the probabilities into a set of rules that return a"Yes"or"No" for the approval status. 20

You can create a data-mining model that emulates this point system. When you create the model, you analyze the loan history collected over years. The history will include examples of customers who have paid on time, others who were sent late payment notifications or visited by collection agents , and a minority who defaulted and did not pay all or part of the loan back.

In order to best take advantage of this model, input all the data collected from the applicant, including:

Address
Age
Income
Homeowner or renter
Outstanding debt 21
Educational level
Marital status
Number of children
Sex

After the application is compared to the performance histories of past and current clients , the output is either Approved or Rejected. Unless a credit board wants to evaluate borderline applicants on a case-by-case basis, raw probabilities do little to speed up the approval process. By specifying the exact result that will be acceptable for approval, you can create a basic"fill in the blank"functionality for your model.

Simulations and What-if Scenarios

When models are generated from more inputs than can be checked by a human, it becomes difficult to understand the model's description of the original data set and important to test the model with data that you understand. Testing the model with well-known data has two benefits: you can identify any anomalies in the rules present within the model and correct them, and you can change the values of the model to see how the predictions are affected, which helps you to understand and perhaps improve the model. 22

Making predictions based on simulated data sets is one of the best ways to validate the model. To discover the factors that affect acceptance rates, a user can apply test loan applications to the model. A user could also take a failing loan application and change one of the inputs such as annual income to see whether the change results in an acceptance. If that doesn't work, the user could change the age or the sex. Tweaking input parameters tells us what kind of loan application will generally be accepted. You can also validate the model's usefulness in the real world by feeding it applications of rejected candidates. Conversely, you can test it with applications of approved candidates, even fictitious ones, to make sure the model doesn't reject applications for extraneous reasons.

Another, more unorthodox validation method is to create a number of"What if? "data-mining models to discover how a fixed set of inputs works against a given algorithm. For instance, you can create and apply an algorithm to a data set to determine causal purchase patterns. The example mentioned earlier in this book found that those folks who purchase beer often also purchase diapers. You can create models using test data sets to see whether the algorithm correctly identifies obvious inventory trends present in a given warehouse. Using this method, the"What if?"scenarios are created with the models, not with the input cases. You can create a set of cases that obviously confirm the validity of the beer and diapers trend to make sure that the inputs correctly generate the desired predictions.

Training Data-Mining Models

Training data is a term commonly used to refer to the process of analyzing known input cases and deriving patterns from them. These discovered patterns are then put into a form that can be used to construct models. This model training is part of a process known as"machine learning,"in which algorithms are automatically adapted to improve based on the experience the computer has gained from analyzing historical data, creating patterns, and validating the patterns against new data.

Data training in Microsoft Analysis Services is accomplished either with data-mining wizards in the Analysis Services Manager or by using PivotTable services and the OLE DB for Data Mining specification. In both methods, a basic model structure is built and the cases are inserted into the model in accordance with the algorithm used by the OLE DB for Data Mining provider to define the model. These options will be discussed in further detail in ‚ Chapters 8 ‚ and 23 ‚ 9 ‚ .

Evaluating the Models and Avoiding Errors

Because data mining uses a set of data to create a representation of the real world, the model is only as good as the training data. As long as the attributes in the cases contain the same level of distribution as in the real world, the model will be useful to predict future events. For this reason, it's crucial to make sure that the initial data accurately represents real-world data.

Overfitting

One of the great benefits of data mining is the ability to discover patterns in data that you would not otherwise notice. This ability brings with it the potential of finding patterns that may seem logical within the confines of the universe the cases reside in, but completely coincidental in relation to the real world. This problem is called overfitting the data.

Real World ‚ The Dangers of Overfitting 24 ‚

There are well-publicized cases of data-mining processes that found, as in the claim made by a French bank, that people with red cars had higher loan-default rates than owners of non-red cars. These patterns were actually present in the database, but obviously have no real value for the bank. Scenarios prone to overfitting are those that contain a large set of hypotheses and a data set too small to represent the real world. Under these conditions, data-mining algorithms find meaningless patterns such as the one that discovered Bentley buyers usually have a first name of John.

There are three ways to solve the problem of overfitting: pruning, chi-squared analysis, and cross validation.

Pruning

When you use the decision trees algorithm you can solve overfitting with decision-trees pruning. Pruning is a process of evaluating the information contained in an attribute. Pruning involves modifying the generated model after the fact.

Chi-Squared Analysis

The method most commonly used to determine the relevance of a given attribute is the chi-squared analysis method. The formula for this form of analysis is much less intimidating than the name of the statistical test suggests:

 c2 = Sum[(Expected Value - Obtained Value)^2] / Expected Value   25

If you had 25 different automobile models and 250 different buyers (all with the first name of John) in the database, it's fair to assume that all other factors being equal, each automobile model has ten buyers. The expected value for the Bentley should be ten for the expected number of occurrences. The actual value might be 30, which indicates a high variation from the norm. If 30 is the actual value, the formula looks like this:

 c2 =  ((10 - 30)^2) / 10 c2 = (-20^2) / 10 c2 = (400) / 10 c2 = 40

Now that you have the chi-squared value for an individual case, how do you determine whether that computation is completely out of line? First establish a baseline, or norm, for name distributions. The best way to establish such a norm is to gather data containing first names from a completely unrelated source and run the chi-squared formula against it to build a table much like Table 4-3.

Table 4-3. Chi-Squared Results from a Random Source

Name	Expected	Actual 26	Difference	Difference ^2	C2
Mary	10	12	-2 27	4	0.4
John	10	8	2	4 28	0.4
Fred	10	11	-1	1	0.1 29
Jill	10	7	3	9	0.9
Larry 30	10	9	1	1	0.1
Henry	10 31	13	-3	9	0.9
Bill	10	12 32	-2	4	0.4
Janet	10	8	2 33	4	0.4
Roger	10	6	4	16 34	1.6
Kate	10	14	-4	16	1.6 35
‚ Total ‚	‚	‚	‚	‚	‚ 6.8 ‚

Now take the data from the data set that will be used for training and apply the chi-square formula to it, as shown in Table 4-4. 36

Table 4-4. Chi-Squared Results for the Training Set

Name	Expected	Actual	Difference	Difference ^2	C2 37
Mary	10	20	-10	100	10
John 38	10	30	-20	400	40
Fred	10 39	8	2	4	0.4
Jill	10	2 40	8	64	6.4
Larry	10	1	9 41	81	8.1
Henry	10	1	9	81 42	8.1
Bill	10	7	3	9	0.9 43
Janet	10		10	100	10
Roger 44	10	10
Kate	10 45	21	-11	121	12.1
‚ Total ‚	‚	‚	‚	‚	‚ 96 ‚

The sum of the chi-squared values was 6.8 in the control data set but a whopping 96.0 in the real data set. What does this difference mean? Such a large difference indicates that the data set used for the auto sales has an unusually high degree of deviation from an expected name distribution. This difference is because of many factors. Perhaps the sample used for the cases was extracted in a first name order, which would explain a high frequency of one name over another. In any case, this test indicates that the name attribute only appears to have an effect on Bentley sales because that attribute happens to have a disproportional distribution in the database, which doesn't reflect the distribution in the real world. In fact, comparing the occurrences of the name John shows to what degree the data set is skewed.

Cross Validation

A third approach is to use cross validation, a technique that tests subsets of the known data to see how well the model correctly predicts values based on that data. The model can then be tweaked and retried until the predictions generated match real events.

Underfitting 47

Another common problem that causes skewed predictions is known as underfitting, which happens when the data-mining model isn't provided with enough attributes to discover important patterns. Underfitting is caused when attributes that have high predictive importance are removed from the model because they seemed unimportant initially. A large bank once created a loan application model using every attribute related to loan risk except for a few. One of the attributes taken out of consideration was pet ownership. As it turned out, upon much closer examination of the loan applications, the bank found that pet owners, particularly cat owners, seemed to be much safer customers to lend money to than non-cat owners.

Preparing Data Models for Testing

In order to ensure that the mining models will reflect real trends, some data must be prepared to test the validity of the models. To accomplish this, three different sets of data need to be used.

‚ Training set ‚ A training set is used to build the initial data-mining model. You might find that this set has some bias or idiosyncrasies, but it does reflect the sample used for data mining.
‚ Test set ‚ When the training set is corrected, a test set is created. In other words, once you're sure that your set of cases will build an accurate model, they can be kept for the purpose of using them as inputs for predictions against other models to test their accuracy. The process of correcting this set may be an iterative one, until you are satisfied that the test set will build a model that accurately reflects the real data. 48
‚ Evaluation set ‚ An evaluation set is data taken from the same population as the test set, but it must consist of different records. This set is used to see whether it can cause the data-mining model created with the first two sets to predict the correct outcomes when provided the attributes from the evaluation sample.

Summary

Predictive, as opposed to descriptive, data modeling derives future results from theoretical data sets. Both models analyze data but descriptive data modeling doesn't produce a model that can be used to derive any predictive results, only an analysis of what has been. Depending on your goals, there are two ways to go about the task of mining data. One is to use directed data mining, which uses known attributes that direct the model to come up with predictive outcomes based on the patterns suggested by attributes present in the model. The second method is undirected data mining, which makes use of attributes only to discover patterns and trends, but not to affect the direction of a predictive data-mining effort.

Although statistics form the foundation of data-mining theory, a statistical approach to discovering patterns and analyzing data has little in common with data-mining methods and processes. Because data-mining algorithms often use statistical formulas as part of the model building process, it's useful to understand statistical principles to understand how these algorithms operate .

For a data-mining model to be reliable, it must be tested . The cases used to create the model should be put to a series of statistical tests to determine whether the data used to build the models needs to be tweaked. After the model is built, it's also a good practice to test it against a set of data where the predictive outcome is already known in order to see whether the model really functions properly as a predictive tool.

Part II, "Data Mining Methods, "looks more closely at the methods described in these early chapters. After reading this section, you should have a good understanding of the steps that Analysis Services takes to build data-mining models based on the algorithms chosen . In addition, you will know how to create data-mining models based on your own algorithms or based on those from third-party vendors . 49