PART II DATA-MINING METHODS | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

System administrators spend a good portion of their time keeping tabs on network and machine performance, which they monitor with tools that track numerous performance counters. These tools usually use windows that display graphs that track the changes in these counters over time. When the system is complex enough, a tool's display area seems cluttered with a multitude of overlapping graphs, each displaying a mass of multicolored, crisscrossing lines that are constantly moving as the values for the counters change over time. (See Figure 7-1.) If you ever observe a system administrator as she's monitoring the machine and network performance, you'll probably notice that she only needs to glance at the screen once to know whether anything is wrong. This may surprise you at first, after all, there must be hundreds of these counters moving and changing all the time, so how does she know that something isn't amiss? If you ask, she'll most likely reply that everything"looks okay."The reason that she can quickly determine there are no problems is because the graphs and the charts display a repeating visual pattern that doesn't change much while things are functioning within normal parameters. The system administrator would, of course, be unable to just glance at the counters and provide you with definite values for any of the performance counters, but she's not interested in the real numbers . For her periodic checks, she's only interested in the big picture, the overall health and load of the system, which she can see with one glance at the screen.

This scenario demonstrates how you can quickly analyze complex, rapidly -changing data by looking at the big picture. Sometimes the data that you're mining contains so many variables that it's virtually impossible to really"see"the pattern in the data. If the system administrator was asked to monitor the system by watching numbers stream down a character-based screen instead of with graphs, the sheer quantity of data in the many counters would quickly overwhelm the operator and make it impossible to identify a problem. She would surely be forced to resort to some tool that could use the computing power of a machine to read and interpret the information for her. There are simply too many data points to comprehend. This is true for any kind of data with many variables. Depending on the type of data available for analysis, data miners are also interested in the big picture. Microsoft Clustering is the algorithm that offers this window into the data.

Figure 7-1. Performance monitor graphs. 1

The Search for Order

Humans have a natural tendency to classify and group objects with similar characteristics. When we see a strange looking animal, our brains immediately try to identify it as a mammal or a reptile, a carnivore or an herbivore. We do this in order to draw a conclusion about that particular animal and decide our course of action. If it's a small furry herbivore, we'll probably want to pet it, but if it's a large carnivorous reptile, we'll surely run!

Looking for Ways to Understand Data

Our brains deal with a complex problem by dividing it into a number of smaller bite- sized problems that are easier to solve. For instance, if you were asked to measure employee efficiency for a multinational corporation with 10,000 employees, to make the task manageable, you would measure the efficiency of various groups of employees and not the efficiency of each individual. To accomplish this task, you would find significant similarities between employees that allow you to make global statements about them, such as the following fictitious statements:

Sales works the most hours but is the least productive.
Finance works the least hours but is the most productive.
Upper managers work more hours than CEOs. 2
Junior managers work more hours than upper managers.

This example uses groups found in most companies, but to discover meaningful groups that are specific to this company, I would have to evaluate all the employee characteristics and classify them into groups based on similarities, such as profession, job description, geographical location, and so on, as the example in Figure 7-2 shows. Identifying the groups and their members is the primary task of the clustering algorithm.

Figure 7-2. Clusters of employees.

Clustering as an Undirected Data-Mining Technique

Clustering works in a way that facilitates undirected data mining; by undirected, I mean there are no dependent variables used to find a specific outcome. In other words, when preparing the data-mining model, you don't really know what you're looking for or what you'll find. The application of clustering as a technique amounts to dumping your whole basket of data into the system and letting it"magically" arrange the data into neat piles. All the algorithm does is find records and assign those records to the groups that it defines. Generally speaking, clustering is rarely used to derive information that can be used directly for any sort of specific decision-making; more often it's used to identify groups of records that can be studied further with another method, such as decision trees. Figure 7-3 is an example of a cluster chart.

Figure 7-3. A cluster chart.

How Clustering Works

The clustering algorithm is an iterative process that seeks to identify groups and group members. In order to best appreciate the results you'll get from this process, let's look at the inner workings of the algorithms as they turn the raw cases into models.

Overview of the Algorithm

As you will see, clustering creates values that represent points in space. These points permit the process to measure the records in terms of their proximity to one another. Like countries on a map, clusters have boundaries that surround the records that reside within the cluster. To know the membership of a given record, it's enough to know in which cluster it resides. The model generated by the clustering algorithm also contains references to coordinates that locate the record in a point in space. These coordinates point to other records that belong to the same cluster.

The K-Means Method Clustering Algorithm

The K-Means clustering algorithm is one of the most common methods for arriving at a set of clusters. Almost any commercial data-mining application incorporates some variation of this clustering algorithm. The main feature of K-Means clustering is the predetermined K variable, which you set to the number of clusters you want. For example, if you decide you want ten clusters, you set K equal to 10. 4

Clustering can produce two extreme results depending on the value you choose for K. You could choose to set K equal to one, in which case you'll have a relatively meaningless result because all the data is grouped in one node. The other extreme is to make K equal to the number of records in the case set, which will provide an equally meaningless result because grouping has occurred. Any other number of clusters is possible depending on the value you assign to K. There is really no hard and fast rule on choosing the value for K and it's often a good idea to try several variations.

Note

When creating clusters with the Microsoft Clustering algorithm, the default number of clusters is always 10. As discussed, this number can be modified to fine tune the results.

Finding the Clusters

When presented with the data, the algorithm first has to decide upon the possible clusters that are likely to exist. (See Figure 7-4.) The algorithm does this a number of ways. One method is to identify every distinct value for a given attribute, select every fourth or fifth record, and offer those up as possible cluster values. An alternative method retrieves a sample of records and tries to determine the clusters based on the number of values that are farthest apart. A third method is to select records randomly .

Figure 7-4. Finding the clusters.

Known vs. Unknown Clusters 5

Sometimes you'll know the number of clusters in advance, and other times you won't. For instance, a French department store teamed up with a marketing firm and a clothing designer house to try once again to revolutionize, not the fashion industry, but the clothing retail industry. They decided that rather than cut clothes to standard sizes, such as small, medium, or large, they would actually make them to fit predetermined body shapes and types. They figured that somehow this would allow them to produce clothing that fit better and would serve a (pardon the pun) wider range of customers while at the same time dramatically reduce manufacturing costs. They applied a clustering algorithm to the database of customer body measurements and came up with 120 body types that had common physical attributes. Now all they had to do was make clothes according to the body types and not the measurements. In this case, they had already decided upon the clusters that would exist in order to classify the population of customers into each one. This is a very simplified version of the actual process, but it serves to illustrate the use of predetermined groups.

A commercial real estate firm used clustering to learn the characteristics of good tenants who paid rent on time. The French department store had the advantage of knowing the specific clusters they were looking for and establishing the value of each. This advantage clearly left no room for fuzzy, useless groupings. The real estate firm didn't have this advantage because their tenants were too varied. In the end, of the 19 clusters they got, only one proved to be useful. It seemed that the best tenants were small business owners who were also homeowners for more than ten years . The research showed that many small business owners used their homes as collateral for their business loans. To avoid losing their homes, these business owners were highly motivated and had solid, well-thought out business plans.

Finding the Center of the Cluster

The initial cluster values are randomly determined samples that provide a start point. So if you chose a K value equal to 10, then ten points will be randomly chosen and temporarily assigned the status of the center of a cluster. Once the initial cluster values are determined, then the algorithm cycles through the records and places each record in one of the clusters. As it does this, it continually seeks the records that most closely match the center points of all the other records and reassigns them. This is done by calculating a simple average of the values in that cluster and grabbing those points that fall closest to that average. This exercise, illustrated in Figure 7-5, is repeated for each record until all the records are placed in a cluster.

Note

Because the initial centers of the clusters are determined randomly, it's very possible to find that the same data generates different clusters when reprocessed.

Figure 7-5. Finding the cluster centers. 6

The Boundaries

While the centers of the clusters are continually being adjusted, the boundaries between clusters are also moving. The boundaries between clusters are defined as the halfway point between the center values of each of the clusters. (See Figure 7-6.) Over the course of World War II, Eastern Europeans living near the Soviet border found themselves living in different countries without ever leaving home as a result of battles over border territory. Cluster boundaries also move; records that once belonged to one cluster can suddenly find themselves in another cluster.

Figure 7-6. Cluster boundaries.

What Is Being Measured Exactly?

The preceding explanation might make clustering seem inordinately simple because the values we are dealing with are easily quantifiable and can be averaged and calculated to determine their placement in a given cluster. If all the cases involved only numerical attributes the clustering algorithm would be extremely easy to understand indeed.

The reality is quite different, of course. In our everyday world, we have many ways of recognizing similarities without resorting to mathematical calculations. For example, most of us would consider a bicycle to be more similar to a motorcycle than to a car. In a data-mining operation, these subjective comparisons must be converted to numbers that prove the similarity. We usually find ourselves working with databases that contain characteristics that don't necessarily map to a number, such as: 7

Colors
Geographical locations
Automobile makes
Languages
Animal species

We could attach a numerical value to these attributes and assign them some point in space. Although this sounds very feasible , it creates some new problems.

Conceptual Attributes 8

The algorithm cannot measure the differences between conceptual values. In other words, if our records contain mammals, fish, reptiles , birds, and insects , the computer would not know that a chimpanzee is far closer to a human than to a bat is or that frozen yogurt is closer to ice cream than carrot cake is. These differences should be far greater than say, the difference in eye color between species, but the algorithm can never be entirely accurate because it doesn't understand what we mean by"closer" in this context.

Clustering Factors

Listed below are the four factors that affect the clustering process.

Rankings
Interval values
Measures
Categorical values 9

Rankings

Ranking orders the values from higher to lower or lower to higher. Ranking says nothing about the values or their relative distance from each other. For example, you may be the fastest sprinter in the country and I may be the second fastest , but this ranking does not tell us how much faster you are.

Interval Values

Unlike ranking, interval values do measure the difference between one measure and another. If you weighed 150 pounds and I weighed 200 pounds , the interval value tells us that the difference in our weight values is 50 pounds .

Measures

While ranks and intervals express relative values, measures express an absolute value relative only to zero. It's important not to confuse measures with intervals by following what seems to be logical but is ultimately flawed reasoning. For example, you might draw the following incorrect conclusions:

10 degrees is twice as cold as 20 degrees. 10
At 300 pounds, Fred is twice as fat as Joe, who weighs 150 pounds.
80 decibels is twice as loud as 40 decibels.

Categorical Values

Categorical values sort similar things and place them into groups. For instance, a stack of books could be grouped into fiction and nonfiction and then again by subject and genre . The key is to remember that a category only groups similar things; the differences between categories are not measured or given a value. In other words, you could never say that a reference book is greater than or less than a novel , but you can say that they are not equal to each other.

Measuring"Closeness"

As you can see, there are many ways to measure"closeness," or how similar things are to one another. There are three main methods used to measure closeness.

Distance between points in space 11
Record overlap
Vector angle similarity

Distance Between Points In Space

This is the most straightforward method of measuring similarity. The record attributes are assigned a numerical value that represents a coordinate position along one or more axes. To get your distance value, measure the distances between X and Y values on each axis, square them, add them, and take the square root of the sum, as shown in Figure 7-7.

Figure 7-7. Distance between points.

Vector Angle Similarity 12

Vector angle similarity seeks to find values that have the similar angles from the X or the Y axis of a chart and indication of "sameness" that might not otherwise be so obvious when the distance of individual points are being measured on their own. There are times when the relationship between the values of the attributes within the records is used as a value for the entire record. In other words, we are looking for a relationship between records and their attributes. When this relationship is needed, it's better to find the difference between the vector angle, the angle formed by a line through the origin and a point representing a record's attributes, than the distance between points in space.

For instance, in a cluster chart of vehicles that includes automobiles, ships, and trains, we know that a Volkswagen Beetle is more similar to a truck than to a small speed boat. But because cars and trucks are similar in scale, but not in real size, we need to find another way besides size to identify them as the same type of vehicle. They are similar because they both have four wheels, a steering wheel, a muffler, and two axles. In fact, the truck is really very similar to a Beetle except for size. The same method can be used to identify the similarities between a kayak and the Titanic. Because other attributes have values that are easy to measure, we run the risk in this type of data sample of finding similarities based on more obvious, but irrelevant values, such as size , weight, and manufacturer.

If we measure the angle between vectors to find similarity, instead of the distance between points in space, we are protected from values that are influenced by difference in the relative size of the numbers instead of their proportional value. Vector angle measurements can be taken by creating a point for every value of the record and drawing a line between the point values as shown in Figure 7-8.

Figure 7-8. Vector angle measurements.

The Record Overlap Problem

When dealing with records that are made up mostly of categorical variables, the best option always looks for similarity within records and then groups them by the number of common attributes, as shown in Figure 7-9. Of course, the process is a bit more complicated if there are many similar fields that cause the differences between fields to be less remarkable . This becomes more difficult when the differences are few. To better account for the differences, they are weighted double or triple their value. For example, you may have an attribute for height in your cases which is defined as "Short", "Average", "Tall", and "Gigantic". It may happen that the persons in your cases all have different heights, but because of the broad classifications allowed for height, a very high number of them appear to fall within the "Average" height node. This renders the height attribute meaningless unless you choose to force the values into broader classifications by multiplying everyone's height measurements by 10 in an attempt to make their height differences more apparent. That way, a difference of 3 inches in real life appears as a difference of more than two and a half feet to the algorithm. 13

Figure 7-9. Record overlap.

When to Use Clustering

Clustering is the best choice of algorithms when you have a very large quantity of data that has a high degree of logical structure and many variables, such as point-of-sale transactions data or call-center records data. Clustering results can allow you to

Visualize relationships
Highlight anomalies
Create samples for other data-mining efforts 14

Visualize Relationships

One very important advantage of clustering is the ease with which a point chart or graph can be generated to display the model. A visual display allows a user or operator to see the similarities among records at a glance.

Highlight Anomalies

The graphs also make it easy to see the records that just don't fit. For instance, if a school district were to analyze performance of all fifth grade students, a point graph like the one in Figure 7-10 could be made.

Figure 7-10. School performance chart.

As you can see, there is one school that performs extremely well compared to all the others. The cluster does not explain why one school out performed all the others, but it does unequivocally show its difference from the rest. 15

Election Results

In the 2000 U.S. presidential election, the vote count in the state of Florida was so close, a recount was mandated . In Palm Beach county, a higher number of ballots than expected went to Patrick Buchanan, of the fringe Reform Party. Using clustering, a local statistician published a report that showed the probability of this candidate receiving such a high vote count.

The statistician began his report by showing a cluster point chart of the number of Buchanan votes by county. The chart showed points fairly evenly spread out along a horizontal axis, which represented the counties. However, the Palm Beach county point was higher than all the other points. In fact, it was at the very top of the chart. By itself, this chart said nothing about the validity of the ballots, or Buchanan's popularity in that county. It did, however, bring home the point that this level of electoral support for Buchanan was unusual enough to warrant close attention.

Create Samples for Other Data-Mining Efforts

Many data-mining algorithms such as decision trees require a significant population sample to analyze. Clearly, when a population is very large, the decision tree might become so large that it becomes difficult to traverse all the necessary rules needed to arrive at a leaf node. When you embark upon a data-mining project, it's often more useful to preselect a portion of the population first. For instance, you might have access to a nationwide customer database with millions of records, but you would like to target a general population of online shoppers in an effort to find out who shops online and how they are different from "brick and mortar" shoppers. To target this group, instead of choosing known categories, you could apply a clustering algorithm and then choose the cases present in the more meaningful nodes from which to begin the decision tree process.

Weaknesses of Clustering

Unlike decision trees, clustering data isn't easily interpreted and you frequently need to experiment to get some meaningful clusters. The main weaknesses of the clustering algorithm are 16

Results are difficult to understand.
Data types are difficult to compare.

Results Are Difficult to Understand

Unlike decision trees, there are no nodes to traverse and no rules to follow; in fact, there are no real predictions to be made. As a result, the clusters you end up with are somewhat difficult to understand. In Analysis Services, you'll find what looks like a set of rules in a node, but in fact these are simple descriptions of the records that fell into that cluster.

In some cases, you might end up with many clusters, some that are of no use and others that are very useful. Because the clusters do not offer any logical explanation for their existence, it's very easy to overlook interesting groups. As a result, it's necessary to analyze the meaning and value of each cluster.

Data Types Are Difficult to Compare

Because clustering relies on numerical data to plot points in space, comparing numbers that measure different types of things becomes a significant challenge. For example, we know that 30 degrees is much colder than 80 degrees, and we also know that an automobile that costs fifty dollars more than another comparable model is an insignificant difference. The difference in both cases is 50, but their relative difference (comparing degrees to dollars) is not easy to express in the model. To deal with this, a proportional equalizer of sorts is applied to the values to minimize their differences. For example, the car prices might be divided by 1000 so that the variations in auto prices occur in the same proportion as the temperatures . 17

Creating a Data-Mining Model Using Clustering

The process to create a data-mining model using the clustering algorithm starts in a similar manner as the process used to create a decision tree algorithm. In this section, we'll create the model using the Data Mining Model Wizard. Since the wizard forces you to use default values, we'll also look at modifying the model using the Data Mining Model Editor.

Using the wizard, the steps are as follows :

Select the source type.
Select the table or tables for your mining model.
Select the data-mining technique to be used by your mining model.
Edit the joins. 18
Select the Case Key column for your mining model.
Select the Input and Predictable columns .
Finish.

The database used for the example is based on a sample of census data used to determine the characteristics of people earning more than $50,000 per year.

Note

This data comes from one of the many "data-mining ready" databases in the University of California, Irvine, Machine Learning Repository, which graciously offers their data for free on the Web. More information can be obtained at ‚ http://www.ics.uci.edu/~mlearn/MLRepository.html ‚

The attributes are shown in Table 7-1. 19

Table 7-1. Attributes for the Over $50,000 Income Group

Field Name	Source
ID	Primary key ID generated by Microsoft SQL Server
Age	Input 20
capital-gain	Input
capital-loss	Input
Class	Input
education 21	Input
education-num	Not Used
Fnlwgt	Not used
hours-per-week	Input 22
Marital-status	Input
native-country	Input

Select Source Type

As with decision trees, clustering can use either a relational database or an OLAP cube as a data source. In this example, we'll be choosing the relational database as a source as shown in Figure 7-11.

Figure 7-11. Selecting the type of data source.

Select the Table or Tables for Your Mining Model

The available tables in the given database are listed in the panel shown in Figure 7-12. We'll choose the census table, which contains a sample of about 32,500 records from a fictional census database. Although this data set was originally designed to predict which persons are most likely to make over $50,000 per year, the number of characteristics present in this table makes it a suitable candidate for clustering.

The available fields of the table are listed in the right panel, and if you needed to you could view the first 1000 records of this table by clicking on the Browse Data button. If we didn't already have an adequate data source defined, we could use the New Data Source button to create another.

Figure 7-12. Selecting the case table.

Select the Data-Mining Technique 24

As Figure 7-13 shows, the only data-mining techniques available are decision trees or clustering, at least until you add a third-party algorithm or your own algorithm. For the purposes of this exercise, we'll choose clustering.

Figure 7-13. Selecting clustering technique.

Edit Joins

The Edit Joins dialog box is displayed only if you are creating a mining model from multiple tables. Since we chose only one table, this option will not appear.

Select the Case Key Column for Your Mining Model

Every table should have a unique ID field. If the data you retrieved does not have one, add one that can be used at this step, as shown in Figure 7-14. 25

Tip

To create a unique ID on a SQL Server 2000 table, add a new field of UNIQUEIDE NTIFIER type and make newid() the default value. By making the field a NOT NULL type, the SQL Server engine automatically updates the field with a unique GUID value.

Picking a field that contains a potential input value as the key field excludes it from being used as an input value. This is why it's important to have a dedicated, nonsignificant field as the ID in this step.

Figure 7-14. Selecting key column.

Select the Input and Predictable Columns

Unlike decision trees, which is a predictive model generator, clustering makes no distinction between input and predictable fields. As far as the clustering algorithm is concerned , all chosen fields are inputs. That's why the dialog box only has one panel for the chosen fields. Double clicking on a field in the left panel causes that field to be used as an attribute for clustering. Once the fields have been chosen, as in Figure 7-15, click Next.

Figure 7-15. Selecting Input columns.

Finish

The Finish dialog box prompts you for the name of the data-mining model and offers you the option to process it right then. If you have no further parameters to change through the Data Mining Model Editor, processing it is a good choice.

Determining the Best Value for K

The clustering algorithm used by Microsoft strongly resembles the K-Means method for clustering. This method requires that the algorithm be told in advance how many clusters, or values for K, will exist. When you first build a clustering model, you are given 10 as the value for K by default, which may be the ideal number or it may not. Either way, the results can be drastically different based on the number chosen. The records might not contain data that lends itself to the number chosen. For example, you might have a database containing sales data involving eleven different product types. If you chose 10 as the value for K, then you run the risk of missing a possible clear division of the data into 11 clusters, one for each product. As we'll see later in this section, the Data Mining Model Editor gives you a chance to adjust the value for K and reprocess the model in an attempt to find that ideal number.

Processing the Model

Creating the model is an iterative process that continually seeks the center of the cluster as each new case is added to the model. For this reason, it usually takes longer to create a clustering model than a decision tree model. As the model is being processed , or trained, you'll notice that the bottom portion of the processing screen often makes reference to the candidate model with an iteration number, as shown in Figure 7-16. 27

Figure 7-16. Training clusters.

This process was outlined in the "Finding the Clusters" section earlier in this chapter. Once the model is processed, click the Close button to go to the Data Mining Model Editor.

Viewing the Model

As shown in Figure 7-17, all the fields are considered input fields. Changing their status from "input" to "input" and"predictable" will not generate a different type of cluster model. To see the contents of the model, click the Content tab.

Organization of the Cluster Nodes

The clusters are represented graphically in much the same way as the decision trees are, except that the "tree" (as shown in Figure 7-18) always has only one level. What appears to be the root node is actually a cluster that contains all the records used to train the model. This node is valuable because it can tell you at a glance how many records there are and even how many records contain given attributes. To get a count of a given attribute, you only need to choose the attribute you're seeking to measure from the Node Attribute Set menu. This makes the Attributes pane show the totals for each of the attributes for that given node. 28

Figure 7-17. Data Mining Model Editor.

Figure 7-18. Viewing the contents of the model.

The nodes' names are nondescript (named Cluster 1, Cluster 2, Cluster 3, and so on); but each cluster does have a set of rules that describes the records contained within the cluster. You can see the list of rules that determine what data is included in a given group by clicking on the node and looking at the contents of the node path .

Why are some records part of a cluster when they don't seem to belong?

You might occasionally notice some records in the cluster that are not accounted for in the rules of the cluster. For example, you might find that the node's rules make no mention of records with a class of greater than $50,000 (>50K), yet the Node Attributes pane shows that 12 percent of the records do indeed have this attribute. This apparent anomaly occurs because the record's other characteristics are enough to place it in a given cluster even if that attribute is not included in the node definition. To understand this better, remember that as far as the clustering algorithm is concerned, a record is simply a point in space. A node is nothing more than a logical circle drawn around those points. While some of the points will be dead-center in that circle, others will be on the fringes of the same circle. Those fringe points might have some attributes that fall outside the strict node definition. 29

Order of the Cluster Nodes

The numbered order of the clusters is another significant feature that indicates the quantity of records in the cluster. The clusters are numbered and arranged in descending order. In other words, Cluster 1 contains more records than Cluster 2, and Cluster 2 contains more records than Cluster 3, and so on. This is very significant in the clustering models because the more records in a cluster, the more confidence we can have in the significance of the commonality of the records within that cluster. Conversely, the bottom clusters can be very significant in identifying anomalies in the data, as was shown in the school performance results example discussed earlier in this chapter.

Analyzing the Data

As I mentioned earlier, at first glance the most meaningful nodes are likely to be the ones with the most records and the least meaningful nodes are likely to be the ones with the fewest records. A deeper analysis of the middle nodes can certainly lead to some interesting insights. That said, it's expected that a certain number of nodes, if not all, will be meaningless because the commonality of the records is coincidental or based on a commonality that doesn't allow us to make any global statements. For example, some nodes might contain records about persons who are involved in a wide array of different professions. If the professions don't have anything in common, there is very little we can generalize about them.

The cluster we just generated seems to be somewhat devoid of meaning. Perhaps it's because there are relatively few cluster nodes and many attributes. This forces the engine to find common ground using many different attributes and even many variations of the values of those attributes. If you were to look at the first node of the cluster and the last node, you'd see that nothing about them allows us to draw any conclusions. Maybe if we increased the number of clusters, we would find more significant results.

To illustrate this, let's change the number of clusters from 10 to 20. To do this, change the Cluster Count property value in the Basic tab. When you click the Process Mining Model button, you'll get a warning that you must save the model before processing. Just click OK, and continue. If the structure of the model had not been changed, as is the case when a new set of training data is introduced, then you'd have the option to simply refresh the data, a process that amounts to emptying the model and retraining it with the new data. However, we did change the structure by demanding more clusters, so the dialog box only offers the choice of fully reprocessing the model. This is equivalent to building the model from scratch.

Once the option to reprocess is chosen, the twenty-cluster model is reprocessed as shown in Figure 7-19. 30

Figure 7-19. Data Mining Model Editor.

Now let's take a look at the nodes again. The first node contains about 20 percent of the whole sample and shows more than 96 percent of the persons making less than $50,000 per year. The group described in this node is made up of African American females who are unmarried, separated, or divorced. Another interesting node is the very last one, which includes one unmarried, part-time fisherman who made more than $40,000 in capital gains in one year.

These nodes are a bit more descriptive and do contain some data that could be significant. Unlike decision trees, which groups records according to precise rules, the clustering algorithm tries to place "Similar" records in groups. Each record might precisely match a cluster's criteria and therefore be placed in the center of the cluster, or it might match only several criteria and hover on the border between two or more other clusters. The K value often defines how inclusive or exclusive each cluster is.

Summary

Clustering is an undirected data-mining method designed to discover basic classification inherent in databases. Clustering does not attempt to predict unknown data values as decision tree algorithms do, but it does offer a way to discover records that are similar enough to each other to be considered part of given groups that the algorithm itself identifies.

The aim of this kind of data mining is to gain an understanding of the natural similarities among groups of records. This knowledge can lead to further study of the behavior of these groups to see whether any generalizations can be made. 31