BAYESIAN BELIEF NETWORKS | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XI - Bayesian Data Mining and Knowledge Discovery
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Bayesian belief networks (BBNs) follow a middle-of-the-road approach when compared to the computationally intensive optimal Bayesian classifier and the over-simplified Naive Bayesian approach. A BBN describes the probability distribution of a set of attributes by specifying a set of conditional independence assumptions together with a set of causal relationships among attributes and their related joint probabilities. When used in this way, BBNs result in a powerful knowledge representation formalism, based in probability theory that has the potential of providing much more information about a certain domain than visualizations based on correlations or distance measures.

Building a Graphical Model

Given a set of attributes A1, A2, ,Am, a directed acyclic graph^[3] (DAG) is constructed in which each node of the graph represents each of the attributes (including the class attribute) and the arcs represent the causal relationship between the arcs. In this case, the joint probability distribution P(A1 = a₁, A2 = a₂,.. Am =a_m) is no longer the product of the independent probabilities, as the model includes the causal relationship among attributes. Recalling the product probability rule, the joint probability distribution can be rewritten as:

where parents (Ai) are the direct parents of Ai, linked to Ai through the causal arcs in the graph model. Note that the previous expression includes class C among the set of attributes Ai. For example, if we consider the graph model in Figure 1, the parents of A4 are A2 and A3. If there are no causal relationships among the nodes Ai, P(Ai = ai /Parents(Ai)) = P(Ai =ai); we therefore return to the expression of the joint probability. For the model depicted in Figure 1:

click to expand
Figure 1: Graph (causal) model of a BBN.

In addition to the graph structure, it is necessary to specify the parameters of the model. This means that we must specify the Conditional Probability Distribution (CPD) at each node, given the values of its parents. If the variables are discrete, this can be represented as a table, which lists the probability that the child node takes on each of its different values for each combination of values of its parents.

Inference in Bayesian Belief Networks

Once the belief network is formulated, it can be used for probabilistic inference, that is, to make probabilistic statements concerning the network attributes (nodes). Since a BBN uniquely defines a joint probability distribution, any probability can be computed from the network by specifying the joint probability distribution and applying basic rules of probability, such as conditioning and marginalization. Consider the example in Figure 2, extracted from Van der Gaag (1996). The BBN represents some fictitious medical knowledge concerning the diagnosis of acute cardiac disorders, and it comprises four binary nodes; i.e., each of them have two possible values, which we will denote by true and false. The listed nodes are S: smoking history of the patient; M: presence of heart attack; P: whether the patient was suffering chest pain or not; F: whether the patient had tingling fingers or not. The model assumes that the smoking history has direct influence on the occurrence of a heart attack. Besides, in such a condition, the patient will likely show signs of pain in the chest reflected by the high probability assigned to this case. On the other hand, given a heart attack, the patient is not likely to have tingling fingers; the low conditional probability value describes such a state. We see that the events P and F have a common cause (M) but do not depend on S. Applying the product rule to the graph above, we may decompose the joint probability of one case (P(S,M,P,F) = P(S =true, M =true, P =true, F =true)) into a set of independent parent-child contributions as P(S, M, P, F) = P(S) * P(M∣S) * P(P∣M) * P(F∣M). Having specified the model we can calculate the prior probabilities of each node in the network. For example P(M =true) can be calculated as:

click to expand
Figure 2: Graphical model of a fictitious medical BBN.

Then P(P=true) and P (F=true) are calculated as:

The previous probabilities corresponded to the beliefs that we had on each of the network attributes (nodes) without considering any additional evidence. As before, we may wish to find the most probable hypothesis given a set of observations. In this case, we might want to determine whether it is more likely that the patient has a heart attack or not, given the fact that he/she has signs of chest pain. For such purposes, we need to compare P(M =true∣P =true) and P(M =false∣P =true): if the first conditional probability is greater, we infer that the patient has a heart attack; if it is smaller we accept the hypothesis that the patient does not have a heart attack^[4]. To calculate these posterior probabilities, we apply conditional probability in the following way:

The denominator is common to both expressions and can be eliminated for comparison purposes. The remaining criteria can be rewritten as:

We can calculate both P(M= true, P= true) and P(M= true, P= true) from the conditional probability tables associated with each node:

It should be noted that each joint probability in the above expressions is calculated by applying the causal model depicted in Figure 2. For example, if we wish to calculate P(S=true, M=true, P=true, F=ture):

As indicated before, because a BBN determines a joint probability distribution for the set of attributes in the network, the Bayesian network can in principle be used to compute any probability of interest. For problems with many variables, however, this approach is not practical. Several researchers have developed probabilistic inference algorithms that apply variations of the conditional independence concept. Pearl (1988) and Lauritzen and Spieglelhalter (1988) developed the two most well known algorithms. Pearl's algorithm, for instance, is based on a message-passing concept. The new evidence is propagated over the network by sending messages to neighbor nodes. Through the arcs acting as communication channels the nodes send messages providing information about the joint probability distribution that is defined by the network and the evidence obtained so far.

Training Bayesian Belief Networks

Two questions arise in formulating BBNs: (1) how to determine the underlying causal model expressed as a graph, which includes the specification of the conditional independence assumptions among the attributes of the model? and (2) how to determine the conditional probability distributions that quantify the dependencies among the attributes in the model? In the following, we address these two questions, while noting that a detailed review of the topic is beyond the scope of this chapter.

As described by Ramoni and Sebastiani (1999), BBNs were originally supposed to rely on domain experts to supply information about the conditional independence graphical model and the subjective assessment of conditional probability distributions that quantify the dependencies among attributes. However, the statistical foundation of BBN soon led to the development of methods to extract both structure and conditional probability estimations from data, thus turning BBNs into powerful data analysis tools. Learning BBNs from data is a rapidly growing field of research that has seen a great deal of activity in recent years, including work by Lam and Bachus (1994), Friedman and Goldszmidt (1996), and Heckerman, Geiger and Chickering (1994).

There are a number of possible scenarios to consider when addressing the problem of training a BBN:

When the structure of the BBN is known and all the attributes for all the instances are observable in the training examples, learning the conditional probabilities is quite straightforward. We simply estimate the conditional probabilities by maximizing the likelihood of the training data, as in the case of the Naive Bayes Classifier (estimating relative frequencies with zero-frequency corrections, for example).
When the structure of the BBN is known but not all of the variables are observable (partially or totally) in the training data, we come across a more complicated problem. In such a case, we can resort to algorithms intended to deal with missing values, such as the Estimation Maximization (EM) algorithm. For a detailed explanation on the EM algorithm and its use in training BBNs, see Mitchell (1997). Another approach is assimilating the problem to the case of estimating the weights of the hidden nodes in a neural network. In that case, a gradient ascent approach can be used, where the algorithm searches through the space of hypotheses corresponding to the set of all possible entries in the conditional probability table.
When the structure of the network is not known, we face a problem of model selection, typically a much more complicated problem than the two previously described cases. The goal is to find the network, or group of networks, that best describes the probability distribution over the training data. This optimization process is implemented in practice by using heuristic search techniques to find the best model over the space of possible BBNs. A scoring system is commonly used for choosing among alternative networks. Lam and Bachus (1994), for example, have used a score based on the minimum description principle (MDL), an information theoretic perspective of Occam's Razor principle according to which simple, sparse models should be preferred to complex overfitted models.^[5]

^[3]Recent research has established that the model is not limited to acyclic graphs. Direct or indirect cyclic causality may be included in BBNs.

^[4]Once again, the example has been created for illustrative purposes and should not be taken too seriously.

^[5]The philosophical debate regarding this approach has been going on for centuries. William of Occam (14th century) was one of the first thinkers to discuss the question of whether simpler hypotheses are better than complicated ones. For this reason, this approach goes by the name of Occam's razor.


	Brought to you by Team-Fly