BAYESIAN CLASSIFICATION | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XI - Bayesian Data Mining and Knowledge Discovery
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

In discussing Bayes' Theorem, the focus has been to find the most probable hypothesis given a set of observations. But in data-mining problems, one of the major tasks is to classify (predict the outcome of) a new observation based on a previous set of observations (the training data set). Suppose we have a classification problem where the class variable is denoted by C and can take values c₁,..,c_k. Consider a data set D represented by m attributes A1, A2, , Am of which the observations (a₁, a₂, ,a_m) have been taken for each instance of D. This implies that any given instance of D may be expressed as (A1 =a₁, A2 = a₂, .,Am =a_m). Suppose that each instance of the data set D is classified as c₁, .,c_k; the Bayesian approach to classifying a new instance would be to assign the most probable target value (a class value of type c_i) by calculating the posterior probability for each class, given D, and selecting the one with the maximum a posteriori (MAP) probability. Following Mitchell's notation:

Any system that classifies new instances according to the above expression is known as an "optimal Bayesian classifier." It can be proven that an optimal Bayesian classifier renders the highest probability that the new instance is classified correctly using the same hypothesis space and the same prior knowledge (Mitchell, 1997).

Although the idea of applying full-blown Bayesian criteria to analyze a hypothesis space in search of the most feasible hypothesis is conceptually attractive, it usually fails to deliver in practical settings. This is because, although we can successfully estimate P(ci) from the training data, calculating the joint probability P (D∣ ci) = P(A1 =a₁,A2 =a₂, .., An =a_n ∣ Ci) is usually not possible because the number of possible combinations is equal to the number of possible instances times the number of class values. Unless the training data set is very large (which on the other hand might render the procedure computationally intractable), the resulting estimates would be representative of a small fraction of the instance space and hence would be unreliable.

The Naive Bayes Classifier

The Naive Bayes Classifier facilitates the estimation of the conditional probabilities by introducing two simplifying assumptions:

Assume conditional independence among attributes of the data sample. This means that the posterior probability of D, given c_i is equal to the product of the posterior probability of each attribute.

The conditional probabilities of each individual attribute can be estimated from the frequency distributions of the sample data set D. (Note that if attribute values are continuous, they need to be discretized first, making some assumptions regarding the probability density functions for each of them. For more information regarding discretization procedures, see Dougherty, Kohavi, & Sahami, 1995.)
If the prior probabilities P(Ci) are unknown, they can also be estimated assuming that classes Ci are equally likely and therefore computing the probabilities from the sample data set frequency distributions.

These assumptions lead to a substantial reduction of the number of distinct conditional probability factors that must be estimated from the training data. For example, in the sample of 24 records extracted from the Titanic dataset (see Table 2), we can estimate the class probabilities P(survived =yes) and P(survived =no), as follows:

P(survived =yes) = (# of instances were survived =yes) / (total # of instances) = 8/24
P(survived =no) = (# of instances were survived =no) / (total # of instances) = 16/24

Next, the conditional probabilities are computed as follows:

P(socialclass =crew ∣ survived =yes) = # of instances where socialclass =crew and survived =yes = 3

# of instances were survived =yes 8
P(socialclass =crew ∣ survived =yes) = # of instances where socialclass =1st and survived =yes = 2

# of instances were survived =yes 3

This procedure is repeated for every conditional probability of the form P(Aj ∣ ci) until every element in the conditional probability table is computed. Note that with such a reduced sample, the probability estimate will be inaccurate; the example was chosen for illustrative purposes.

Although these assumptions may seem over-simplifying, the fact is that they work quite well in certain classification problems. To illustrate the practical importance of the Naive Bayes algorithm in classification tasks, the authors coded a simple version of the Naive Bayes Classifier^[2] and tested it using the full Titanic dataset, with 90% of the original dataset used for training purposes and the remaining 10% to test the accuracy of the classifier. Given two possible class values (survived or not), we could expect a classification accuracy of approximately 50%, based on random guessing. In our case, the algorithm yielded an accuracy of 88%, impressive enough if we consider that the dataset contained a very small number of attributes.

Various empirical studies of the Naive Bayes Classifier have rendered comparable results to those obtained by using state of the art classifiers such as decision tree algorithms or neural network models. The studies have reported accuracy levels of 89% when using a Naive Bayesian model to classify Usenet news articles. For more details see Mitchell (1997) and Joachims (1996).

Coping with Zero-Frequency Attributes

An aspect that may drastically affect the outcome of a Naive Bayes Classifier is when a particular attribute does not occur in the training set for one or more class values. Suppose, for example, that in the Titanic data set there are no first class passengers who died in the wreck. This would mean that the number of instances for which socialclass = 1st and survived = no would be equal to zero.

In such a case, the conditional probability P(socialclass =1st∣survived =no) would be zero, and since P(D∣survived =no) is equal to the product of the individual conditional probabilities, P(D∣survived =no) would also be equal to zero, eliminating class survived =no from consideration. This "zero-frequency" attribute may bias the maximum a posteriori (MAP) criteria used for choosing the best class. This condition severely limits the practical use of the Naive Bayes Classifier; however, it can be easily remedied by applying minor adjustments to the method of calculating probabilities from frequencies. Two general approaches that are typically considered in the literature are the following (more details can be found in Kohavi, Becker, and Sommerfield,1997):

The no match approach, in which a zero frequency (no count) is replaced by a value that is inversely proportional to the number of instances.
Use of Laplace estimator, where the strategy is to add 1 to the count for every attribute value-class value combination, so that, given an attribute A with d values a₁, a₂,.,a_d, a count of n_k for the class value C =c_k, and m_jk matches of attribute value A =a_j and class value C =c_k, the "new" relative frequency is calculated as (m_jk + 1) /( n_k +d).

Witten and Frank (2000) have suggested using a Bayesian approach to estimating probabilities using an estimator of the form (m_jk + π_j * ω)/(n_k + ω), where ω is a small constant and π_j is the prior probability for each attribute value A = a_j in our previous formulation. A typical way of estimating π_j in the absence of other information is to assume uniform priors; that is, if an attribute has d values, the probability of each attribute value is 1/d.

^[2]The algorithm was coded using S-Plus. The programs are available from the authors.


	Brought to you by Team-Fly