FEATURE SELECTION IN SUPERVISED LEARNING | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter IV - Feature Selection in Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

In this section, we propose a new approach for the customer targeting that combines evolutionary algorithms (EAs) and artificial neural networks (ANNs). In particular, we want to address the multi-objective nature of the customer targeting applications maximizing hit rate and minimizing complexity of the model through feature selection. We use ELSA to search the possible combinations of features, and ANNs to score the probability of buying new services or products using only the feature selected by ELSA.

Problem Specification and Data Sets

Direct mailings to potential customers have been one of the most common approaches to market a new product or service. With a better understanding of who its potential customers were, the company would know more accurately who to target, and it could reduce expenses and the waste of time and effort. In particular, we are interested in predicting potential customers who would be interested in buying a recreational vehicle (RV) insurance policy^[2] while reducing feature dimensionality.

Suppose that one insurance company wants to advertise a new insurance policy based on socio-demographic data over a certain geographic area. From its first direct mailing to 5822 prospects, 348 purchased RV insurance, resulting in a hit rate of 348/5822 = 5.97%. Could the company attain a higher response rate from another carefully chosen direct mailing from the top x% of a new set of 4000 potential prospects?

In our experiment, we use two separate data sets a training (5822 records) and an evaluation set (4000 records). Originally, each data set had 85 attributes, containing socio-demographic information (attributes 1-43) and contribution to and ownership of various insurance policies (attributes 44-85). The socio-demographic data was derived using zip codes, and thus all customers living in areas with the same zip code were assumed to have the same socio-demographic attributes. We omitted the first feature (customer subtype) mainly because it would expand search space dramatically with little information gain if we represented it as a 41-bit variable. Further, we can still exploit the information of customer type by recording the fifth feature (customer main type) as a 10-bit variable. The other features are considered continuous and scaled to a common range (0-9).

ELSA/ANN Model Specification

Structure of the ELSA/ANN Model

Our predictive model is a hybrid model of the ELSA and ANN procedures, as shown in Figure 2. ELSA searches for a set of feature subsets and passes it to an ANN. The ANN extracts predictive information from each subset and learns the patterns using a randomly selected two-thirds of the training data. The trained ANN is then evaluated on the remaining one-third of the training data, and returns two evaluation metrics, F_accuracy and F_complexity (described below), to ELSA. Note that in both the learning and evaluation procedures, the ANN uses only the selected features. Based on the returned metric values, ELSA biases its search to maximize the two objectives until the maximum number of iterations is attained.

click to expand
Figure 2: The ELSA/ANN model.

Among all evaluated solutions over the generations, we choose for further evaluation the set of candidates that satisfies a minimum hit-rate threshold. With chosen candidates, we start a 10-fold cross validation. In this procedure, the training data is divided into 10 non-overlapping groups. We train an ANN using the first nine groups of training data and evaluate the trained ANN on the remaining group. We repeat this procedure until each of the 10 groups has been used as a test set once. We take the average of the accuracy measurements over the 10 evaluations and call it an intermediate accuracy. We repeat the 10-fold cross validation procedure five times and call the average of the five intermediate accuracy estimates estimated accuracy.

We maintain a superset of the Pareto front containing those solutions with the highest accuracy at every F_complexity level covered by ELSA. For evaluation purposes, we subjectively decided to pick a "best" solution with the minimal number of features at the marginal accuracy level.^[3] Then we train the ANN using all the training data with the selected features only, and the trained model is used to select the top x% of the potential customers in the evaluation set, based on the estimated probability of buying RV insurance. We finally calculate the actual accuracy of our model.

Evaluation Metrics

We use two heuristic evaluation criteria, F_accuracy and F_complexity, to evaluate selected feature subsets. Each objective, after being normalized into 25 intervals to allocate energy, is to be maximized by ELSA.

F_accuracy: The purpose of this objective is to favor feature sets with a higher hit rate. We define two different measures, F_accuracy¹ and F_accuracy² for two different experiments.

In Experiment 1, we select the top 20% of potential customers in descending order of the probability of purchasing the product and compute the ratio of the number of actual customers, AC, out of the chosen prospects, TC. We calculate F_accuracy¹ as follows:

where F_accuracy¹ is a normalization constant.

In Experiment 2, we measure accuracy at the first m intervals^[4] after dividing the range of customer selection percentages into 50 intervals with equal width (2%). At each interval i ≤ m, we select the top (2 i)% of potential customers in descending order of the probability of purchasing the product and compute the ratio of the number of actual customers, AC_i, out of the total number of actual customers in the evaluation data, Tot. We multiply the width of interval and sum those values to get the area under the lift curve over m intervals. Finally, we divide it by m to get our final metric, F_accuracy². We formulate it as follows:

where Tot = 238, m = 25 and Z_accuracy² is an empirically derived normalization constant.

F_complexity: This objective is aimed at finding parsimonious solutions by minimizing the number of selected features as follows:

Note that at least one feature must be used. Other things being equal, we expect that lower complexity will lead to easier interpretability of solutions and better generalization.

Experimental Results

Experiment 1

In this experiment, we select the top 20% of customers to measure the hit rate of each solution as in Kim and Street (2000). For comparison purpose, we implement the PCA/logit model by first applying PCA on the training set. We select 22 PCs the minimum required to explain more than 90% of the variance in the data set and use them to reduce the dimensionality of the training set and the evaluation set.

We set the values for the ELSA parameters in the ELSA/ANN and ELSA/logit models as follows: Pr(mutation) = 1.0, p_max = 1,000, E_cost = 0.2, θ = 0.3, and T = 2,000. In both models, we select the single solution with the highest expected hit rate among those solutions with fewer than 10 features selected. We evaluated each model on the evaluation set and summarized our results in Table 1.

Table 1: Results of Experiment 1
	Training set	Evaluation set
Model (# Features)	Hit Rate s.d	# Correct	Hit Rate
PCA/logit (22)	12.83 0.498	109	13.63
ELSA/logit (6)	15.73 0.203	115	14.38
ELSA/ANN (7)	15.92 0.146	120	15.00

The column marked "# Correct" shows the number of actual customers who are included in the chosen top 20%. The number in parenthesis represents the number of selected features, except for the PCA/logit model, where it represents the number of PCs selected.

In terms of the actual hit rate, ELSA/ANN returns the highest actual hit rate. Feature selection (the difference in actual hit rate between PCA/logit and ELSA/logit) and nonlinear approximation (the difference in actual hit rate between ELSA/logit and ELSA/ANN) contribute about half of the total accuracy gain respectively. The improvement of the ELSA/ANN model in actual hit rate could make a meaningful difference in profit as the number of targeted prospects increases.

The resulting model of ELSA/ANN is also easier to interpret than that of PCA/logit. This is because, in the PCA/logit model, it is difficult to interpret the meaning of each of PC in high-dimensional feature spaces. Further, the ELSA/ANN model makes it possible to evaluate the predictive importance of each features. The chosen seven features by the ELSA/ANN model are: customer main type (average family), contribution to third-party policy, car policy, moped policy, fire policy, number of third-party policies, and social security policies. Among those features, we expected at least one of the car insurance-related features to be selected. Moped policy ownership is justified by the fact that many people carry their mopeds or bicycles on the back of RVs. Those two features are selected again by the ELSA/logit model.^[5] Using this type of information, we were able to build a potentially valuable profile of likely customers (Kim & Street, 2000).

The fact that the ELSA/ANN model used only seven features for customer prediction makes it possible to save a great amount of money through reduced storage requirements (86/93 ≈ 92.5%) and through the reduced labor and communication costs for data collection, transfer, and analysis. By contrast, the PCA/logit model needs the whole feature set to extract PCs. We also compare the lift curves of the three models. Figure 3 shows the cumulative hit rate over the top x% of prospects (2 ≤ x ≤ 100).

click to expand
Figure 3: Lift curves of three models that maximize the hit rate when targeting the top 20% of prospects.

As expected, our ELSA/ANN model followed by ELSA/logit is the best when marketing around the top 20% of prospects. However, the performance of ELSA/ANN and ELSA/logit over all other target percentages was worse than that of PCA/logit. This is understandable because our solution is specifically designed to optimize at the top 20% of prospects, while PCA/logit is not designed for specific selection points. This observation leads us to do the second experiment in order to improve the performance of the ELSA/ANN model over all selection points.

Experiment 2

In this experiment, we search for the best solution that maximizes the overall accuracy up to the top 50% of potential customers. ELSA/ANN and ELSA/logit models are adjusted to maximize the overall area under the lift curve over the same intervals. In practice, we optimize over the first 25 intervals that have the same width, 2%, to approximate the area under the lift curve. Because this new experiment is computationally expensive, we use two-fold cross validation estimates of all solutions. We, however, set the values of the ELSA parameters with the same as in the previous experiment except p_max = 200 and T = 500. Based on the accuracy estimates, we choose a solution that has the highest estimated accuracy with less than half of the original features in both models. We evaluate the three models on the evaluation set and summarize the results in Table 2 and in Figure 4.

Table 2: Summary of Experiment 2. The hit rates of three different models are shown over the top 50% of prospects
Model(# Features)	% of selected
Model(# Features)	5	10	15	20	25	30	35	40	45	50
PCA/logit (22)	20.06	20.06	16.04	13.63	12.44	11.20	10.81	10.22	9.87	9.38
ELSA/logit (46)	23.04	18.09	15.56	13.79	12.13	12.04	10.97	10.54	10.03	9.53
ELSA/ANN(44)	19.58	17.55	16.40	14.42	13.13	11.96	10.97	10.40	9.98	9.64

click to expand
Figure 4: Lift curves of three models that maximize the area under lift curve when targeting up to top 50% of prospects.

The ELSA/ANN model works better than PCA/logit and ELSA/logit over the targeting range between 15% and 50%. In particular, ELSA/ANN is best at 15%, 20%, 25%, and 50% of targeted customers, and approximately equal to the best at 30-45%. The overall performance of ELSA/logit is better than that of PCA/logit. We attribute this to the fact that solutions from both ELSA models exclude many irrelevant features. PCA/logit, however, is competitive for targeting more than 50% of the customers, since ELSA/ANN and ELSA/logit do not optimize over these ranges. Though the well-established parsimony of the ELSA/ANN models in Experiment 1 is largely lost in Experiment 2, the ELSA/ANN model is still superior to PCA/logit model in terms of the parsimony of selected features since the PCA/logit model needs the whole feature set to construct PCs.

Conclusions

In this section, we presented a novel application of the multi-objective evolutionary algorithms for customer targeting. We used ELSA to search for possible combinations of features and an ANN to score customers based on the probability that they will buy the new service or product. The overall performance of ELSA/ANN in terms of accuracy was superior to the traditional method, PCA/logit, and an intermediate model, ELSA/logit. Further, the final output of the ELSA/ANN model was much easier to interpret because only a small number of features are used.

In future work, we want to investigate how more general objectives affect the parsimony of selected features. We also would like to consider a marketing campaign in a more realistic environment where various types of costs and net revenue for additional customers are considered. We could also consider budget constraints and minimum/maximum campaign sizes. This way, the number of targeted customers would be determined inside an optimization routine to maximize the expected profit.

^[2]This is one of main tasks in the 2000 CoIL challenge (Kim & Street, 2000). For more information about CoIL challenges and the data sets, please refer to http://www.dcs.napier.ac.uk/coil/challenge/.

^[3]If other objective values are equal, we prefer to choose a solution with small variance.

^[4]This is reasonable because as we select more prospects, the expected accuracy gain will go down. If the marginal revenue from an additional prospect is much greater than the marginal cost, however, we could sacrifice the expected accuracy gain. Information on mailing cost and customer value was not available in this study.

^[5]The other four features selected by the ELSA/logit model are: contribution to bicycle policy, fire policy, number of trailer, and lorry policies.


	Brought to you by Team-Fly