Chapter XI: Bayesian Data Mining and Knowledge Discovery | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XI - Bayesian Data Mining and Knowledge Discovery
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Eitel J. M. Lauria, State University of New York, Albany, USA Universidad del Salvador

ArgetinaGiri Kumar Tayi, State University of New York, Albany

USA

One of the major problems faced by data-mining technologies is how to deal with uncertainty. The prime characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty. Bayesian methods provide a practical method to make inferences from data using probability models for values we observe and about which we want to draw some hypotheses. Bayes' Theorem provides the means of calculating the probability of a hypothesis (posterior probability) based on its prior probability, the probability of the observations, and the likelihood that the observational data fits the hypothesis.

The purpose of this chapter is twofold: to provide an overview of the theoretical framework of Bayesian methods and its application to data mining, with special emphasis on statistical modeling and machine-learning techniques; and to illustrate each theoretical concept covered with practical examples. We will cover basic probability concepts, Bayes' Theorem and its implications, Bayesian classification, Bayesian belief networks, and an introduction to simulation techniques.

DATA MINING, CLASSIFICATION AND SUPERVISED LEARNING

There are different approaches to data mining, which can be grouped according to the kind of task pursued and the kind of data under analysis. A broad grouping of datamining algorithms includes classification, prediction, clustering, association, and sequential pattern recognition.

Data Mining is closely related to machine learning. Imagine a process in which a computer algorithm learns from experience (the training data set) and builds a model that is then used to predict future behavior. Mitchell (1997) defines machine learning as follows: a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. For example, consider a handwriting recognition problem: the task T is to recognize and classify handwritten words and measures; the performance measure P is the percent of words correctly classified; and the experience E is a database of handwritten words with given class values. This is the case of classification: a learning algorithm (known as classifier) takes a set of classified examples from which it is expected to learn a way of classifying unseen examples. Classification is sometimes called supervised learning, because the learning algorithm operates under supervision by being provided with the actual outcome for each of the training examples.

Consider the following example data set based on the records of the passengers of the Titanic^[1]. The Titanic dataset gives the values of four categorical attributes for each of the 2,201 people on board the Titanic when it struck an iceberg and sank. The attributes are social class (first class, second class, third class, crew member), age (adult or child), sex, and whether or not the person survived. Table 1 below lists the set of attributes and its values.

Table 1: Titanic example data set
ATTRIBUTE	POSSIBLE VALUES
social class	crew, 1st, 2nd, 3rd
age	adult, child
sex	male, female
survived	yes, no

In this case, we know the outcome of the whole universe of passengers on the Titanic; therefore, this is good example to test the accuracy of the classification procedure. We can take a percentage of the 2,201 records at random (say, 90%) and use them as the input dataset with which we would train the classification model.

The trained model would then be used to predict whether the remaining 10% of the passengers survived or not, based on each passenger's set of attributes (social class, age, sex). A fragment of the total dataset (24 records) is depicted in Table 2.

Table 2: Fragment of Titanic data set
Instance	Social class	Age	Sex	Survived
1	2nd	adult	female	yes
2	crew	adult	male	no
3	crew	adult	male	yes
4	2nd	adult	male	no
5	2nd	adult	female	yes
6	crew	adult	male	yes
7	crew	adult	male	no
8	1st	adult	male	no
9	crew	adult	male	yes
10	crew	adult	male	no
11	3rd	child	male	no
12	crew	adult	male	no
13	3rd	adult	male	no
14	1st	adult	female	yes
15	3rd	adult	male	no
16	3rd	child	female	no
17	3rd	adult	male	no
18	1st	adult	female	yes
19	crew	adult	male	no
20	3rd	adult	male	no
21	3rd	adult	female	no
22	3rd	adult	female	no
23	3^rd	child	female	yes
24	3^rd	child	male	no

The question that remains is how do we actually train the classifier so that it is able to predict with reasonable accuracy the class of each new instance it is fed? There are many different approaches to classification, including traditional multivariate statistical methods, where the goal is to predict or explain categorical dependent variables (logistic regression, for example), decision trees, neural networks, and Bayesian classifiers. In this chapter, we will focus on two methods: Naive Bayes and Bayesian Belief Networks.

^[1]The complete dataset can be found at Delve, a machine learning repository and testing environment located at the University of Toronto, Department of Computer Science. The URL is http://www.cs.toronto.edu/~delve.


	Brought to you by Team-Fly