In discussing Bayes' Theorem, the focus has been to find the most probable hypothesis given a set of observations. But in data-mining problems, one of the major tasks is to classify (predict the outcome of) a new observation based on a previous set of observations (the training data set). Suppose we have a classification problem where the class variable is denoted by C and can take values c1,..,ck. Consider a data set D represented by m attributes A1, A2, , Am of which the observations (a1, a2, ,am) have been taken for each instance of D. This implies that any given instance of D may be expressed as (A1 =a1, A2 = a2, .,Am =am). Suppose that each instance of the data set D is classified as c1, .,ck; the Bayesian approach to classifying a new instance would be to assign the most probable target value (a class value of type ci) by calculating the posterior probability for each class, given D, and selecting the one with the maximum a posteriori (MAP) probability. Following Mitchell's notation:
Any system that classifies new instances according to the above expression is known as an "optimal Bayesian classifier." It can be proven that an optimal Bayesian classifier renders the highest probability that the new instance is classified correctly using the same hypothesis space and the same prior knowledge (Mitchell, 1997).
Although the idea of applying full-blown Bayesian criteria to analyze a hypothesis space in search of the most feasible hypothesis is conceptually attractive, it usually fails to deliver in practical settings. This is because, although we can successfully estimate P(ci) from the training data, calculating the joint probability P (D∣ ci) = P(A1 =a1,A2 =a2, .., An =an ∣ Ci) is usually not possible because the number of possible combinations is equal to the number of possible instances times the number of class values. Unless the training data set is very large (which on the other hand might render the procedure computationally intractable), the resulting estimates would be representative of a small fraction of the instance space and hence would be unreliable.
The Naive Bayes Classifier
The Naive Bayes Classifier facilitates the estimation of the conditional probabilities by introducing two simplifying assumptions:
These assumptions lead to a substantial reduction of the number of distinct conditional probability factors that must be estimated from the training data. For example, in the sample of 24 records extracted from the Titanic dataset (see Table 2), we can estimate the class probabilities P(survived =yes) and P(survived =no), as follows:
Next, the conditional probabilities are computed as follows:
This procedure is repeated for every conditional probability of the form P(Aj ∣ ci) until every element in the conditional probability table is computed. Note that with such a reduced sample, the probability estimate will be inaccurate; the example was chosen for illustrative purposes.
Although these assumptions may seem over-simplifying, the fact is that they work quite well in certain classification problems. To illustrate the practical importance of the Naive Bayes algorithm in classification tasks, the authors coded a simple version of the Naive Bayes Classifier and tested it using the full Titanic dataset, with 90% of the original dataset used for training purposes and the remaining 10% to test the accuracy of the classifier. Given two possible class values (survived or not), we could expect a classification accuracy of approximately 50%, based on random guessing. In our case, the algorithm yielded an accuracy of 88%, impressive enough if we consider that the dataset contained a very small number of attributes.
Various empirical studies of the Naive Bayes Classifier have rendered comparable results to those obtained by using state of the art classifiers such as decision tree algorithms or neural network models. The studies have reported accuracy levels of 89% when using a Naive Bayesian model to classify Usenet news articles. For more details see Mitchell (1997) and Joachims (1996).
Coping with Zero-Frequency Attributes
An aspect that may drastically affect the outcome of a Naive Bayes Classifier is when a particular attribute does not occur in the training set for one or more class values. Suppose, for example, that in the Titanic data set there are no first class passengers who died in the wreck. This would mean that the number of instances for which socialclass = 1st and survived = no would be equal to zero.
In such a case, the conditional probability P(socialclass =1st∣survived =no) would be zero, and since P(D∣survived =no) is equal to the product of the individual conditional probabilities, P(D∣survived =no) would also be equal to zero, eliminating class survived =no from consideration. This "zero-frequency" attribute may bias the maximum a posteriori (MAP) criteria used for choosing the best class. This condition severely limits the practical use of the Naive Bayes Classifier; however, it can be easily remedied by applying minor adjustments to the method of calculating probabilities from frequencies. Two general approaches that are typically considered in the literature are the following (more details can be found in Kohavi, Becker, and Sommerfield,1997):
Witten and Frank (2000) have suggested using a Bayesian approach to estimating probabilities using an estimator of the form (mjk + πj * ω)/(nk + ω), where ω is a small constant and πj is the prior probability for each attribute value A = aj in our previous formulation. A typical way of estimating πj in the absence of other information is to assume uniform priors; that is, if an attribute has d values, the probability of each attribute value is 1/d.
The algorithm was coded using S-Plus. The programs are available from the authors.