7.2. Classification

7.2.1. Problem Statement

Classification is a fundamental class of techniques that has applications in a huge number of areas, such as pattern recognition, machine learning, and robotics .

It is used to partition a universe of all possible objects, each described as a bunch of data, into a set of classes . It is mainly characterized by its decision rule . This rule determines, for a given object, which class it belongs to.

There are two fundamental classes of decision rules: parametric and non-parametric rules.

Parametric decision rules make the membership classification based on the knowledge of the a priori probabilities of occurrence of objects belonging to some class C _i . This is captured by the probability density functions, p ( X C _i ), which characterize how probable the measurement X is should the membership class be C _i .

Very often, however, it is difficult to describe these probability density functions a priori, and often they are just unknown.

In such cases, non-parametric decision rules are attractive, because they do not require such a priori knowledge. Instead, these rules rely directly on the training set of objects. For this set, the class membership is known precisely for each object in this set. This is a priori knowledge, too, but much less difficult to acquire. (For instance, it could be provided by a human.) Therefore, this is also called supervised learning . The idea is that a teacher feeds a number of example objects to the student and provides the correct answer for each of them. After the learning phase, the student tries to determine the correct answer for unseen objects based on the examples it has seen so far.

Usually, objects are represented by a set of features, each of which can usually be represented by a real number. Thus, objects can be represented as points in the so-called feature space , and classes are subsets of this space. Since we work with points, we can usually define a measure of distance between points (which might be more or less well adapted to the objects).

A very simple, non-paramtric decision rule is the nearest -neighbor classifier . As its name suggests, it is based on the distance measure, and it assigns an unknown object to the same class as the nearest object from the (known) training set (see Figure 7.12). Most often, the Euclidean metric is used as a distance measure, although it is not always clear that this is best suited for the problem at hand.

Figure 7.12: The nearest-neighbor yields a simple classifier.

Although this rule is extremely simple, it is fairly good (see Figure 7.13). More precisely, as the size of the training set goes to infinity, the asymptotic probability of error for the nearest-neighbor rule, , can be bounded by

where is the optimal, so-called Bayes probability of error, and N is the size of the training set [Cover and Hart 67]. In other words, the NN error is never two times worse than the optimal error.

Figure 7.13: Comparison of the decision boundaries induced by the (optimal) Bayes classifier (squares), and the nearest-neighbor classifier ( triangles ).

Although the nearest-neighbor rule offers remarkably good performance (largely due to its simplicity), it is by no means a silver bullet. Some of its problems are:

large space requirements (to store the complete training set);
in high dimensions, it is very difficult to find the nearest neighbor in time better than O ( N ) (this is called the curse of dimensionality).

In addition, there are two more problems that all classification methods must deal with:

classes may overlap, so there are regions in feature space that are populated by representatives from two or more classes;
some representatives might be mislabeled, i. e., they are classified into the wrong class (for instance, because they are outliers, or because the human generating the training set made a mistake).

In the next two sections, we will describe some simple methods to correct some of these mistakes.

7.2.2. Editing and Reducing the Set

Imagine that the classes correspond to colors, i. e., each point in the training set is labeled with a color . The nearest-neighbor rule induces a partitioning of the feature space into a number of cells , each of which belongs to exactly one point of the training set. These cells are exactly the Voronoi cells of the Voronoi diagram induced by the training set (see Chapter 6 and Section 6.4.1).

So a class is exactly the union of all Voronoi cells that have the same color (see Figure 7.14). The decision boundary of a decision rule are those points in feature space that separate classes. More precisely, a point is on the decision boundary iff any arbitrarily small, closed ball centered on this point contains points of two different classes. With the nearest-neighbor rule partitioning feature space into a set of Voronoi cells, the decision boundaries are exactly those boundaries of the Voronoi cells that lie between cells of different colors.

Figure 7.14: Conceptually, the nearest-neighbor classifier partitions the feature space into Voronoi cells. (See Color Plate XIV.)

In practice, the Voronoi diagram and the decision boundary are rarely computed explicitly, especially in higher dimensions. But it is clear that the decision boundary alone is sufficient for classification. So, we can reduce the training set by removing those points that wont change the decision boundary. This is called editing the training set.

Instead of thinking in terms of the Voronoi diagram and Voronoi cells, we can think in terms of the DG (see Figure 7.15). The decision boundary then runs between neighbors in this graph that have different colors.

Figure 7.15: The decision boundary of the nearest-neighbor classifier runs between Delaunay neighbors with different colors. (See Color Plate XV.)

With the DG, it is very simple to edit the training set: we just remove all those nodes (i. e., points) whose neighbors all have the same color. This will change the DG and the Voronoi diagram, but it will not change the decision boundary (see Figure 7.16).

Figure 7.16: The edited training set has the same decision boundary but fewer nodes. (See Color Plate XVI.)

7.2.3. Proximity Graphs for Editing

One problem with Delaunay editing is that it can still leave too many points (see Figure 7.17). These are usually points that are well separated from other classes and contribute only portions of the decision boundary very remote from the cluster of class representatives. Thus, these points are usually less important for the classification of unknown points, which can be expected to come from the same distribution as the training set.

Figure 7.17: Delaunay editing often leaves too many points in the edited set. The unfilled points will be removed by Delaunay editing, but the contribution of the circled points is questionable.

Another problem is that computing the DG takes, in the worst case, , which is prohibitively expensive in higher dimensions (sometimes even in three dimensions).

So, it makes sense to pursue the following, approximate solution [Bhattacharya et al. 81]: compute a proximity graph over a given training set, which can be achieved more efficiently , edit the training set according to this graph, and then classify unknown points using this edited training set. Of course, the decision boundary will be different, but hopefully not too much, so that classification of unknown points will still yield (mostly) the same result.

Algorithm 7.4: Simple algorithm to clean (edit) a training set from incorrectly labeled samples.

 build the proximity graph of the entire training set  for all  samples in the set  do  find all its neighbors according to the proximity graph           determine the most represented class (the winning class)           mark the sample if the winning class is different from the label of the sample  end for  delete all marked samples from the set

Figure 7.18 shows the same training set edited with the DG, GG, and RNG (see Section 7.1.2). Because the GG contains (usually) fewer edges than the DG, more nodes are marked having neighbors of the same color and are, thus, removed. So the GG usually yields a smaller training set than the DG. A similar relation holds between RNG and GG.

Figure 7.18: Different proximity graphs yield different edited (i. e., reduced) training sets and, thus, different approximations to the decision boundary.

Consequently, the decision boundary changes, because it is, being induced by the nearest-neighbor rule, the boundary between Voronoi cells of different colors in the Voronoi diagram over the edited training set. As you can see in Figure 7.18, the GG does not change the decision boundary much, but the RNG can change it quite substantially.

7.2.4. Cleaning the Training Set

In general, editing can serve several purposes. One of them is to reduce the size of the training set. This is the meaning we have understood so far. Another one is to remove samples from the training set that have been mislabeled, that overlap with other classes (see Figure 7.13), or that are just outliers.

This task can be solved by utilizing proximity graphs, too. A very simple method will be described in the following [Sanchez et al. 97].

The basic idea is to identify bad samples by examining their neighborhood. If most of the samples are correctly labeled, and the training set is sufficiently dense, then an incorrectly labeled sample will likely have a large number of differently labeled neighbor samples. This idea is summarized in Algorithm 7.4.

This algorithm can be refined straightforwardly a bit by modifying the rule when to mark a sample. For instance, we could mark a sample only if the number of neighbors in the winning class is at least 2/3 of the number of all neighbors. In addition, this method can be applied repeatedly to the training set.