FEATURE SELECTION FOR ENSEMBLES

data mining: opportunities and challenges
Chapter IV - Feature Selection in Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

In this section, we propose a new meta-ensembles algorithm to directly optimize ensembles by creating a two-level evolutionary environment. In particular, we employ feature selection not only to increase the prediction accuracy of an individual classifier, but also to promote diversity among component classifiers in an ensemble (Opitz, 1999).

Feature Selection and Ensembles

Recently, many researchers have combined the predictions of multiple classifiers to produce a better classifier, an ensemble, and often have reported improved performance (Bauer & Kohavi, 1999; Breiman, 1996b). Bagging (Breiman, 1996a) and Boosting (Freund & Schapire, 1996) are the most popular methods for creating accurate ensembles. The effectiveness of Bagging and Boosting comes primarily from the diversity caused by resampling training examples while using the complete set of features to train component classifiers.

Recently, several attempts have been made to incorporate the diversity in feature dimension into ensemble methods. The Random Subspace Method (RSM) in Ho (1998a & 1998b) was one early algorithm that constructed an ensemble by varying the feature subset. RSM used C4.5 as a base classifier and randomly chose half of the original features to build each classifier. In Guerra-Salcedo and Whitley (1999), four different ensemble methods were paired with each of three different feature selection algorithms: complete, random, and genetic search. Using two table-based classification methods, ensembles constructed using features selected by the GA showed the best performance. In Cunningham and Carney (2000), a new entropy measure of the outputs of the component classifiers was used to explicitly measure the ensemble diversity and to produce good feature subsets for ensemble using hill-climbing search.

Genetic Ensemble Feature Selection (GEFS) (Opitz, 1999) used a GA to search for possible feature subsets. GEFS starts with an initial population of classifiers built using up to 2D features, where D is the complete feature dimension. It is possible for some features to be selected more than once in GEFS, and crossover and mutation operators are used to search for new feature subsets. Using 100 most-fit members with majority voting scheme, GEFS reported better estimated generalization than Bagging and AdaBoost on about two-thirds of 21 data sets tested. Longer chromosomes, however, make GEFS computationally expensive in terms of memory usage (Guerra-Salcedo & Whitley, 1999). Further, GEFS evaluates each classifier after combining two objectives in a subjective manner using fitness = accuracy + λ diversity, where diversity is the average difference between the prediction of component classifiers and the ensemble.

However, all these methods consider only one ensemble. We propose a new algorithm for ensemble feature selection, Meta-Evolutionary Ensembles (MEE), that considers multiple ensembles simultaneously and allows each component classifier to move into the best-fit ensemble. We evaluate and reward each classifier based on two different criteria, accuracy and diversity. A classifier that correctly predicts data examples that other classifiers in the same ensemble misclassify contributes more to the accuracy of the ensemble to which it belongs. We imagine that some limited "energy" is evenly distributed among the examples in the data set. Each classifier is rewarded with some portion of the energy if it correctly predicts an example. The more classifiers that correctly classify a specific example, the less energy is rewarded to each, encouraging them to correctly predict the more difficult examples. The predictive accuracy of each ensemble determines the total amount of energy to be replenished at each generation. Finally, we select the ensemble with the highest accuracy as our final model.

Meta-Evolutionary Ensembles

Pseudocode for the Meta-Evolutionary Ensembles (MEE) algorithm is shown in Figure 10, and a graphical depiction of the energy allocation scheme is shown in Figure 11.

click to expand
Figure 10: Pseudo-code of Meta-Evolutionary Ensembles (MEE) algorithm.

click to expand
Figure 11: Graphical depiction of energy allocation in the MEE. Individual classifiers (small boxes in the environment) receive energy by correctly classifying test points. Energy for each ensemble is replenished between generations based on the accuracy of the ensemble. Ensembles with higher accuracy have their energy bins replenished with more energy per classifier, as indicated by the varying widths of the bins.

Each agent (candidate solution) in the population is first initialized with randomly selected features, a random ensemble assignment, and an initial reservoir of energy. The representation of an agent consists of D + log2(G) bits. D bits correspond to the selected features (1 if a feature is selected, 0 otherwise). The remaining bits are a binary representation of the ensemble index, where G is the maximum number of ensembles. Mutation and crossover operators are used to explore the search space and are defined in the same way as in previous section.

In each iteration of the algorithm, an agent explores a candidate solution (classifier) similar to itself, obtained via crossover and mutation. The agent's bit string is parsed to get a feature subset J. An ANN is then trained on the projection of the data set onto J, and returns the predicted class labels for the test examples. The agent collects ΔE from each example it correctly classifies, and is taxed once with Ecost. The net energy intake of an agent is determined by its classification accuracy. But the energy also depends on the state of the environment. We have an energy source for each ensemble, divided into bins corresponding to each data point. For ensemble g and record index r in the test data, the environment keeps track of energy Eenvtg,r and the number of agents in ensemble g, countg,r that correctly predict record r. The energy received by an agent for each correctly classified record r is given by

An agent receives greater reward for correctly predicting an example that most in its ensemble get wrong. The min function ensures that for a given point there is enough energy to reward at least five agents in the new generation. Candidate solutions receive energy only inasmuch as the environment has sufficient resources; if these are depleted, no benefits are available until the environmental resources are replenished. Thus, an agent is rewarded with energy for its high fitness values, but also has an interest in finding unpopulated niches, where more energy is available. The result is a natural bias toward diverse solutions in the population. Ecost for any action is a constant (Ecost < θ).

In the selection part of the algorithm, an agent compares its current energy level with a constant reproduction threshold θ. If its energy is higher than θ, the agent reproduces; the agent and its mutated clone become part of the new population, with the offspring receiving half of its parent's energy. If the energy level of an agent is positive but lower than θ, only that agent joins the new population.

The environment for each ensemble is replenished with energy based on its predictive accuracy, as determined by majority voting with equal weight among base classifiers. We sort the ensembles in ascending order of estimated accuracy and apportion energy in linear proportion to that accuracy, so that the most accurate ensemble is replenished with the greatest amount of energy per base classifier. Since the total amount of energy replenished also depends on the number of agents in each ensemble, it is possible that an ensemble with lower accuracy can be replenished with more energy in total than an ensemble with higher accuracy.

Experimental Results

Experimental results of MEE/ANN

We tested the performance of MEE combined with neural networks on several data sets that were used in Opitz (1999). In our experiments, the weights and biases of the neural networks are initialized randomly between 0.5 and -0.5, and the number of hidden nodes is determined heuristically as the square root of inputs. The other parameters for the neural networks include a learning rate of 0.1 and a momentum rate of 0.9. The number of training epochs was kept small for computational reasons. The values for the various parameters are: Pr(mutation) = 1.0, Pr(crossover) = 0.8, Ecost = 0.2, q = 0.3, and T = 30. The value of Eenvttot = 30 is chosen to maintain a population size around 100 classifier agents.

Experimental results are shown in Table 4. All computational results for MEE are based on the performance of the best ensemble and are averaged over five standard 10-fold cross-validation experiments. Within the training algorithm, each ANN is trained on two-thirds of the training set and tested on the remaining third for energy allocation purposes. We present the performance of a single neural network using the complete set of features as a baseline algorithm. In the win-loss-tie results shown at the bottom of Table 4, a comparison is considered a tie if the intervals defined by one standard error[8] of the mean overlap. Of the data sets tested, MEE shows consistent improvement over a single neural network.

Table 4: Experimental results of MEE/ANN

Data sets

Single net

Bagging

AdaBoost

GEFS

MEE

Avg.

S.D.

Avg.

S.D.

Epochs

Credita

84.3

0.30

86.2

84.3

86.8

86.4

0.52

40

Creditg

71.7

0.43

75.8

74.7

75.2

75.6

0.78

50

Diabetes

76.4

0.93

77.2

76.7

77.0

76.8

0.42

50

Glass

57.1

2.69

66.9

68.9

69.6

61.1

1.73

100

Cleveland

80.7

1.83

83.0

78.9

83.9

83.3

1.54

50

Hepatitis

81.5

0.21

82.2

80.3

83.3

84.9

0.65

40

Votes-84

95.9

0.41

95.9

94.7

95.6

96.1

0.44

40

Hypo

93.8

0.09

93.8

93.8

94.1

93.9

0.06

50

Ionosphere

89.3

0.85

90.8

91.7

94.6

93.5

0.81

100

Iris

95.9

1.10

96.0

96.1

96.7

96.5

0.73

100

Krvskp

98.8

0.63

99.2

99.7

99.3

99.3

0.10

50

Labor

91.6

2.29

95.8

96.8

96.5

94.4

0.78

50

Segment

92.3

0.97

94.6

96.7

96.4

93.2

0.28

50

Sick

95.2

0.47

94.3

95.5

96.5

99.3

0.03

50

Sonar

80.5

2.03

83.2

87.0

82.2

85.2

1.57

100

Soybean

92.0

0.92

93.1

93.7

94.1

93.8

0.19

50

Vehicle

74.7

0.48

79.3

80.3

81.0

76.4

1.12

50

Win-loss-tie

15-0-2

7-4-6

9-6-2

4-7-6

 

We also include the results of Bagging, AdaBoost, and GEFS from Opitz (1999) for indirect comparison. In these comparisons, we did not have access to the accuracy results of the individual runs. Therefore, a tie is conservatively defined as a test in which the one standard-deviation interval of our test contained the point estimate of accuracy from Opitz (1999). In terms of predictive accuracy, our algorithm demonstrates better or equal performance compared to single neural networks, Bagging and Boosting. However, MEE shows slightly worse performance compared to GEFS, possibly due to the methodological differences. For example, it is possible that the more complex structure of neural networks used in GEFS can learn more difficult patterns in data sets such as Glass and Labor data.

From the perspective of computational complexity, our algorithm can be very slow compared to Bagging and Boosting. However, MEE can be very fast compared to GEFS, because GEFS uses twice as many as input features as MEE. Further, the larger number of hidden nodes and longer training epochs can make GEFS extremely slow.

Guidelines toward optimized ensemble construction

In this section, we use MEE to examine ensemble characteristics and provide some guidelines for building optimal ensembles. We expect that by optimizing the ensemble construction process, MEE will in general achieve comparable accuracy to other methods using fewer individuals. We use data collected from the first fold of the first cross-validation routine for the following analyses.

We first investigate whether the ensemble size is positively related with the predictive accuracy. It has been well established that, to a certain degree, the predictive accuracy of an ensemble improves as the number of classifiers in the ensemble increases. For example, our result in Figure 12 indicates that accuracy improvements flatten out at an ensemble size of approximately 15-25. We also investigate whether the diversity among classifiers is positively related with the ensemble's classification performance. In our experiments, we measured the diversity based on the difference of predicted class between each classifier and the ensemble. We first define a new operator as follows: α β = 0 if α = β, 1 otherwise. When an ensemble e consists of g classifiers, the diversity of ensemble e, diversitye, is defined as follows:

click to expand
Figure 12: The relationship between the predictive accuracy and ensemble size (left), and between the predictive accuracy and ensemble diversity (right) with 95% confidence interval on the Soybean data. We observed similar patterns on other data sets.

where N is the number of records in the test data and predji and predje represent the predicted class label for record j by classifier i and ensemble e respectively. The larger the value of diversitye, the more diverse the ensemble is.

We show the relationship between the predictive accuracy and ensemble diversity in Figure 12. This shows the expected positive relationship between accuracy and diversity. However, our results show that too much diversity among classifiers can deteriorate ensemble performance, as the final decision made by ensemble becomes a random guess.

Conclusions

In this section, we propose a new two-level ensemble construction algorithm, Meta-Evolutionary Ensembles (MEE), that uses feature selection as the diversity mechanism. At the first level, individual classifiers compete against each other to correctly predict held-out examples. Classifiers are rewarded for predicting difficult points, relative to the other members of their respective ensembles. At the top level, the ensembles compete directly based on classification accuracy.

Our model shows consistently improved classification performance compared to a single classifier at the cost of computational complexity. Compared to the traditional ensembles (Bagging and Boosting) and GEFS, our resulting ensemble shows comparable performance while maintaining a smaller ensemble. Further, our two-level evolutionary framework confirms that more diversity among classifiers can improve predictive accuracy. Up to a certain level, the ensemble size also has a positive effect on the ensemble performance.

The next step is to compare this algorithm more rigorously to others on a larger collection of data sets, and perform any necessary performance tweaks on the EA energy allocation scheme. This new experiment is to test the claim that there is relatively little room for other ensembles algorithm to obtain further improvement over decision forest method (Breiman, 1999). Along the way, we will examine the role of various characteristics of ensembles (size, diversity, etc.) and classifiers (type, number of dimensions/data points, etc.). By giving the system as many degrees of freedom as possible and observing the characteristics that lead to successful ensembles, we can directly optimize these characteristics and translate the results to a more scalable architecture (Street & Kim, 2001) for large-scale predictive tasks.

[8]In our experiments, standard error is computed as standard deviation / iter0.5 where iter = 5.

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net