4.5 Bayesian Networks

Suppose that W is a set of possible worlds characterized by n binary random variables 퓧 = {X₁,…, X_n}. That is, a world w ∈ W is a tuple (x₁, …, x_n), where x_i ∈ {0, 1} is the value of X_i. That means that there are 2ⁿ worlds in W, say w₁, …, w_2ⁿ. A naive description of a probability measure on W requires 2ⁿ − 1 numbers, α₁, …, α_2ⁿ−1, where α_i is the probability of world w_i. (Of course, the probability of w_2ⁿ is determined by the other probabilities, since they must sum to 1.)

If n is relatively small, describing a probability measure in this naive way is not so unreasonable, but if n is, say, 1,000 (certainly not unreasonable in many practical applications), then it is completely infeasible. Bayesian networks provide a tool that makes describing and working with probability measures far more feasible.

4.5.1 Qualitative Bayesian Networks

A (qualitative) Bayesian network (sometimes called a belief network) is a dag, that is, a directed acyclic graph, whose nodes are labeled by random variables. (For readers not familiar with graph theory, a directed graph consists of a collection of nodes or vertices joined by directed edges. Formally, a directed edge is just an ordered pair of nodes; the edge (u, v) can be drawn by joining u and v by a line with an arrow pointing from u to v. A directed graph is acyclic if there is no cycle; that is, there does not exist a sequence of vertices v₀, …, v_k such that v₀ = v_k, and there is an edge from v_i to v_i+1 for i = 0, …, k − 1.) Informally, the edges in a Bayesian network can be thought of as representing causal influence.

For example, a Bayesian network for reasoning about the relationship between smoking and cancer might include binary random variables such as C for "has cancer," SH for "exposed to secondhand smoke," PS for "at least one parent smokes," and S for "smokes." The Bayesian network G_s in Figure 4.1 represents what seem to be reasonable causal relationships. Intuitively, G_s says that whether or not a patient has cancer is directly influenced by whether he is exposed to secondhand smoke and whether he smokes. Both of these random variables, in turn, are influenced by whether his parents smoke. Whether or not his parents smoke also clearly influences whether or not he has cancer, but this influence is mediated through the random variables SH and S. Once whether he smokes and was exposed to secondhand smoke is known, finding out whether his parents smoke gives no additional information. That is, C is independent of PS given SH and S.

Figure 4.1: A Bayesian network G_s that represents the relationship between smoking and cancer.

More generally, given a Bayesian network G and a node X in G, think of the ancestors of X in the graph as those random variables that have a potential influence on X. (Formally, Y is an ancestor of X in G if there is a directed path from Y to X in G—i.e., a sequence (Y₁, …, Y_k) of nodes—such that Y₁ = Y, Y_k = X, and there is a directed edge from Y_i to Y_i+1 for i = 1, …, k − 1.) This influence is mediated through the parents of X, those ancestors of X directly connected to X. That means that X should be conditionally independent of its ancestors, given its parents. The formal definition requires, in fact, that X be independent not only of its ancestors, but of its nondescendants, given its parents, where the nondescendants of X are those nodes Y such that X is not the ancestor of Y.

Definition 4.5.1

Given a qualitative Bayesian network G, let Par_G(X) denote the parents of the random variable X in G; let Des_G(X) denote the descendants of X, that is, X and all those nodes Y such that X is an ancestor of Y ; let NonDes_G(X) denote the nondescendants of X in G, that is, 퓧 — Des_G(X). Note that all ancestors of X are nondescendants of X. The Bayesian network G (qualitatively) represents, or is compatible with, the probability measure μ if I^rv_μ(X, NonDes_G(X) | Par(X)), that is, X is conditionally independent of its nondescendants given its parents, for all X ∊ 퓧.

Definition 4.5.1 says that G represents μ if, in a certain sense, it captures the conditional independence relations in μ. But what makes this notion of representation at all interesting? In particular, why focus on conditional independencies of the form I^rv_μ (X, NonDes_G(X) | Par(X))? To explain this, I digress briefly to discuss what is called the chain rule for probability. Given arbitrary sets U₁, …, U_n, it is immediate from the definition of conditional probability that

(Assume for now all the relevant probabilities are positive, so that the conditional probabilities are well defined.) Applying this observation inductively gives the chain rule:

As a special instance of the chain rule, take U_i to be the event X_i = x_i. Plugging this into (4.1) gives

Now suppose without loss of generality that 〈X₁, …, X_n〉 is a topological sort of (the nodes in) G; that is, if X_i is a parent of X_j, then i < j. It easily follows that {X₁, …, X_k−1}⊆ NonDes_G(X_k), for k = 1, …, n (Exercise 4.20); all the descendants of X_k must have subscripts greater than k. Thus, all the nodes in {X₁, …, X_k−1} are independent of X_k given Par_G(X_k). It follows that

Thus, if G represents μ, then (4.2) reduces to

Equation (4.3) shows that, if G represents μ, then μ is completely determined by conditional probabilities of the form

The consequences of this observation are explored in the next section.

4.5.2 Quantitative Bayesian Networks

A qualitative Bayesian network G gives qualitative information about dependence and independence, but does not actually give the values of the conditional probabilities. A quantitative Bayesian network provides more quantitative information, by associating with each node X in G a conditional probability table (cpt) that quantifies the effects of the parents of X on X. For example, if X's parents in G are Y and Z, then the cpt for X would have an entry denoted d_{Y = j, Z = k} for all (j, k) ∈{0, 1}². As the notation is meant to suggest, d_{Y = j ∩ Z = k} = μ(X = 1 | Y = j ∩ Z = k) for the probability measure μ represented by G. (Of course, there is no need to have an entry for μ(X = 0 | Y = j ∩ Z = k), since this is just 1 − μ(X = 1 | Y = j ∩ Z = k).) Formally, a quantitative Bayesian network is a pair (G, f) consisting of a qualitative Bayesian network G and a function f that associates with each node X in G a cpt, where there is an entry in the interval [0, 1] in the cpt for each possible setting of the parents of X.If X is a root of G, then the cpt for X can be thought of as giving the unconditional probability that X = 1.

Example 4.5.2

Consider the Bayesian network G_s for smoking described in Figure 4.1. Let f_s be the function that associates with the nodes C, S, SH, and PS the cpts shown in Figure 4.2.

The first line in the cpt for C describes the conditional probability that C = 1 given = 1 ∩ SH = 1; the second line describes the conditional probability that C = 1 given S = 1 ∩ SH = 0; and so on. Note that the probability that C = 0 given S = 1 ∩ SH = 1 is .4 (1 − .6); similarly, the probability that C = 0 conditional on each setting of S and SH can be calculated from the cpt. Finally, note that the .3 in the cpt for PS is the unconditional probability of PS.

click to expand
Figure 4.2: Cpts for the smoking example.

Definition 4.5.3

A quantitative Bayesian network (G, f) (quantitatively) represents, or is compatible with, the probability measure μ if G qualitatively represents μ and the cpts agree with μ in that, for each random variable X, the entry in the cpt for X given some setting Y₁ = y₁, …, Y_k = y_k of its parents is μ(X = 1 | Y₁ = y₁ ∩ … ∩ Y_k = y_k) if μ(Y₁ = y₁ ∩ … ∩ Y_k = y_k) ≠ 0. (It does not matter what the cpt entry for Y₁ = y₁, …, Y_k = y_k is if μ(Y₁ = y₁ ∩ … ∩ Y_k = y_k) = 0.)

It follows immediately from (4.3) that if (G, f) quantitatively represents μ, then μ can be completely reconstructed from (G, f). More precisely, (4.3) shows that the 2ⁿ values μ(X₁ = x₁ ∩ … ∩ X_n = x_n) can be computed from (G, f); from these values, μ(U) can be computed for all U ⊆ W.

Proposition 4.5.4

A quantitative Bayesian network (G, f) always quantitatively represents a unique probability measure, the one determined by using (4.3).

Proof See Exercise 4.22.

It is easy to calculate that for the unique probability measure μ represented by the quantitative Bayesian network (G_s, f_s) in Example 4.5.2,

(These numbers have been made up purely for this example and bear no necessary relationship to reality!)

Proposition 4.5.4, while straightforward, is important because it shows that there is no choice of numbers in a cpt that can be inconsistent with probability. Whatever the numbers are in the cpts for (G, f), (as long as they are in the interval [0, 1]) there is always a probability measure that is compatible with (G, f).

What about the converse? Can every probability measure on W be represented by a quantitative Bayesian network? It can, and in general there are many ways of doing so.

Construction 4.5.5

Given μ, let Y₁, …, Y_n be any permutation of the random variables in 퓧. (Think of Y₁, …, Y_n as describing an ordering of the variables in 퓧.) Construct a qualitative Bayesian network as follows. For each k, find a minimal subset of {Y₁, …, Y_{k − 1}}, call it P_k, such that I^rv_μ ({Y₁, …, Y_{k − 1}}, Y_k | P^k). (Clearly there is a subset with this property, namely, {Y₁, …, Y_{k − 1}} itself. It follows that there must be a minimal subset with this property.) Then add edges from each of the nodes in P_k to Y_k. Call the resulting graph G.

Theorem 4.5.6

The Bayesian network G obtained by applying Construction 4.5.5 to the probability measure μ qualitatively represents μ.

Proof Note that 〈Y₁, …, Y_k〉 represents a topological sort of G; edges always go from nodes in {Y₁, …, Y_k₋₁} to Y^k. It follows that G is acyclic; that is, it is a dag. The construction guarantees that P_k = Par_G(Y_k) and that I^rv_μ ({Y₁, …, Y_{k − 1}}, Y_k | Par_G(Y_k)). Using CIRV1-5, it can be shown that I^rv_μ(NonDes_G(Y_k), Y_k | Par_G(Y_k)) (Exercise 4.23). Thus, G qualitatively represents μ.

Of course, given G and μ, it is then easy to construct a quantitative Bayesian network (G, f) that quantitatively represents μ by consulting μ.

How much does this buy us? That depends on how sparse the graph is, that is, on how many parents each node has. If a node has k parents, then its conditional probability table has 2^k entries. For example, there are nine entries altogether in the cpts in the Bayesian network (G_s, f_s) in Example 4.5.2: the cpt for C has four entries, the cpts for SH and S each have two entries, and the one for PS has only one entry. On the other hand, a naive description of the probability measure would require fifteen numbers. In general, if each node in a Bayesian network has at most k parents, then there are at most n2^k entries in all the cpts. If k is small, then n2^k can be much smaller than 2ⁿ − 1, the number of numbers needed for a naive description of the probability measure. (The numbers 2^k and 2ⁿ − 1arise because I have considered only binary random variables. If the random variables can have m values, say 0, 1, …, m − 1, the conditional probability table for a random variable X with k parents would have to describe the probability that X = j, for j = 1, …, m − 1, for each of the m^k possible settings of its parents, so would involve (m − 1)m_k entries.)

Not only does a well-designed Bayesian network (I discuss what "well-designed" means in the next paragraph) typically require far fewer numbers to represent a probability measure, the numbers are typically easier to obtain. For example, for (G_s, f_s), it is typically easier to obtain entries in the cpt like μ(C = 1 | S = 1 ∩ SH = 0)—the probability that someone gets cancer given that they smoke and are not exposed to secondhand smoke—than it is to obtain μ(C = 1 ∩ S = 1 ∩ SH = 0 ∩ PS = 0).

Note that the Bayesian network constructed in Theorem 4.5.6 depends on the ordering of the random variables. For example, the first element in the ordering is guaranteed to be a root of the Bayesian network. Thus, there are many Bayesian networks that represent a given probability measure. But not all orderings lead to equally useful Bayesian networks. In a well-designed Bayesian network, the nodes are ordered so that if X has a causal influence on Y, then X precedes Y in the ordering. This typically leads both to simpler Bayesian networks (in the sense of having fewer edges) and to conditional probabilities that are easier to obtain in practice. For example, it is possible to construct a Bayesian network that represents the same probability measure as (G_s, f_s) but has S as the root, by applying Theorem 4.5.6 with the ordering S, C, PS, SH (Exercise 4.24). However, not only does this network have more edges, the conditional probability tables require entries that are harder to elicit in practice. It is easier to elicit from medical experts the probability that someone will smoke given that at least one of his parents smokes (μ(S = 1 | PS = 1)) than the probability that at least one of a smoker's parents also smokes (μ(PS = 1 | S = 1)).

One of the main criticisms of the use of probability in applications such as expert systems used to be that probability measures were too hard to work with. Bayesian networks deal with part of the criticism by providing a (potentially) compact representation of probability measures, one that experience has shown can be effectively constructed in realistic domains. For example, they have been used by PATHFINDER, a diagnostic expert system for lymph node diseases. The first step in the use of Bayesian networks for PATHFINDER was for experts to decide what the relevant random variables were. At one stage in the design, they used 60 binary random variables to represent diseases (did the agent have the disease or not) and over 100 random variables for findings (symptoms and test results). Deciding on the appropriate vocabulary took 8 hours, constructing the appropriate qualitative Bayesian network took 35 hours, and making the assessments to fill in the cpts took another 40 hours. This is considered a perfectly acceptable length of time to spend in constructing a significant expert system.

But, of course, constructing the Bayesian network is only part of the problem. Once a probability measure has been represented using a Bayesian network, it is important to be able to draw inferences from it. For example, a doctor using the PATHFINDER system will typically want to know the probability of a given disease given certain findings. Even if the disease is a child of the symptom, computing the probability of the disease given the symptom requires some effort. For example, consider the problem of computing μ_s(C = 1 | SH = 1) for the unique probability measure μ_s compatible with (G_s, f_s). The cpt says that μ_s(C = 1 | SH = 1 ∩ S = 1) = .6 and that μ(C = 1 | SH = 1 ∩ S = 0) = .1. μ_s(C = 1 | SH = 1) can be computed using the identity

That means that μ_s(S = 1) and μ_s(S = 0) must be computed. Efficient algorithms for such computations have been developed (and continue to be improved), which take advantage of the dag structure of a Bayesian network. It would take us too far afield here to go into the details; see the notes to this chapter for references.

4.5.3 Independencies in Bayesian Networks

By definition, a node (i.e., a random variable) in a Bayesian network G that qualitatively represents a probability measure μ is independent of its nonancestors, given its parents with respect to μ. What other conditional independencies hold for the probability measures represented by a Bayesian network? These can easily be computed. There is a notion of d-separation, which I am about to define, with the property that X is conditionally independent of Y given Z if the nodes in Z d-separate every node in X from every node in Y.

Now for the formal definition. A node X is d-separated (the d is for directed) from a node Y by a set of nodes Z in the dag G, written d-sep_G(X, Y | Z), if, for every undirected path from X to Y (an undirected path is a path that ignores the arrows; e.g., (SH, PS, S) is an undirected path from SH to S in G_s), there is a node Z′ on the path such that either

Z′ ∈ Z and there is an arrow on the path leading in to Z′ and an arrow leading out from Z′;
Z′ ∈ Z and has both path arrows leading out; or
Z′ has both path arrows leading in, and neither Z′ nor any of its descendants are in Z.

X is d-separated from Y by Z if every node X in X is d-separated from every node Y in Y by Z. Consider the graph G_s. The set {SH, S} d-separates PS from C. One path from PS to C is blocked by SH and the other by S, according to clause (a), since both S and SH have an arrow leading in and one leading out. Similarly, {PS} d-separates SH from S. The (undirected) path (SH, PS, S) is blocked by PS according to clause (b), and the undirected path (SH, C, S) is blocked by C ′∉{PS} according to clause (c). On the other hand, {PS, C} does not d-separate SH from S, since there is no node on the path (SH, C, S) that satisfies any of (a), (b), or (c).

These examples may also help explain the intuition behind each of the clauses of the definition. Clause (a) is quite straightforward. Clearly if there is a directed path from PS to C, then PS can influence C, so PS and C are not independent. However, conditioning on {SH, S} blocks all paths from PS to C, so C is conditionally independent of PS given {SH, S}. The situation in clause (b) is exemplified by the edges leading out from PS to SH and S. Intuitively, smoking (S) and being exposed to secondhand smoke (SH) are not independent because they have a common cause, a parent smoking (PS). Finding out that S = 1 increases the likelihood that PS = 1, which in turn increases the likelihood that SH = 1. However, S and SH are conditionally independent given PS.

Clause (c) in the definition of d-separation is perhaps the most puzzling. Why should the absence of a node in Z cause X and Y to be d-separated? Again, this can be understood in terms of the graph G_s. Finding out that C = 1 makes S and SH become negatively correlated. Since they are both potential causes of C = 1, finding out that one holds decreases the likelihood that the other holds: finding out that S = 0 increases the likelihood that SH = 1; finding out that S = 1 decreases the likelihood that S = 0.

To understand the role of descendants in clause (c), suppose that a node D (for "early death") is added to G_s with an edge from C to D. Finding out that D = 1 makes it more likely that C = 1 and thus also makes S and SH negatively correlated.

The following theorem says that d-separation completely characterizes conditional independence in Bayesian networks:

Theorem 4.5.7

If X is d-separated from Y by Z in the Bayesian network G, then I^rv _μ (X, Y | Z) holds for all probability measures μ compatible with G. Conversely, if X is not d-separated from Y by Z, then there is a probability measure μ compatible with G such that I^rv _μ (X, Y | Z) does not hold.

The first half says that d-separation really does imply conditional independence in Bayesian networks. For future reference, it is worth noting that it can be proved using only properties CIRV1-5 and the fact that, by definition, I^rv _μ (NonDes_G(X), X | Par_G(X)) holds for every μ compatible with G. The second half says that, in a precise sense, d-separation completely captures conditional independence in qualitative Bayesian networks.

4.5.4 Plausibilistic Bayesian Networks

It should be clear that Bayesian networks can be used with other representations of uncertainty. Certainly nothing in the definition of qualitative Bayesian networks really depends on the use of probability—all the definitions are given in terms of conditional independence of random variables, which makes sense for all the notions of uncertainty we have considered. To what extent do results like Proposition 4.5.4 and Theorems 4.5.6 and 4.5.7 depend on the use of probability? Plausibility measures provide a useful tool with which to examine this question.

As far as qualitative representation goes, note that Definition 4.5.1 makes perfect sense if the probability measure μ is replaced by a plausibility measure Pl everywhere. The proof of Theorem 4.5.6 uses only CIRV1–5; by Theorem 4.4.5, CIRV1–5 hold for all algebraic cps's. The following corollary is immediate:

Corollary 4.5.8

A Bayesian network G obtained by applying (the analogue of) Construction 4.5.5 to the algebraic cpm Pl qualitatively represents Pl.

In light of Corollary 4.5.8, for the remainder of this section I restrict to algebraic cps's.

The notion of a Bayesian network quantitatively representing a plausibility measure makes sense for arbitrary plausibility measures, with one minor caveat. Now a cpt for X must have entries of the form d_X_{=i | Y =j∩Z=k}, for both i = 0 and i = 1, since the conditional plausibility of X = 0 can no longer necessarily be determined from that of X = 1. (Of course, in general, if X is not a binary random variable, then there must be an entry for each possible value of X.) With this minor change, the definition of representation is the obvious analogue of Definition 4.5.3.

Definition 4.5.9

A quantitative Bayesian network (G, f) represents a cpm Pl if G is compatible with Pl and the cpts agree with Pl, in the sense that, for each random variable X, the entry d_{X = i | Y₁=j₁, …, Y_k=j_k} in the cpt is

if Y₁ = j₁ ∩ … ∩ Y_k = j_k ∊ ′. (It does not matter what d_{X=i|Y₁=j₁, …, Y_k=_jk} is if Y₁ = j₁ ∩…∩ Y_k = j_k ∉ ′.)

Given a cpm Pl, it is easy to construct a quantitative Bayesian network (G, f) that represents Pl: simply construct G so that it is compatible with Pl as in Corollary 4.5.8 and define f appropriately, using Pl. The more interesting question is whether there is a unique algebraic cpm determined by a quantitative Bayesian network. As stated, this question is somewhat undetermined. The numbers in a quantitative network do not say what ⊕ and ⊗ ought to be for the algebraic cpm.

A reasonable way to make the question more interesting is the following. Recall that, for the purposes of this section, I have taken W to consist of the 2ⁿ worlds characterized by the n binary random variables in 퓧. Let _D,⊕,⊗ consist of all standard cps's of the form (W, , ′, Pl), where = 2^W, so that all subsets of W are measurable, the range of Pl is D, and Pl is algebraic with respect to ⊕ and ⊗. Thus, for example, _{N*, ⊕, ⊗} consists of all conditional ranking functions on W defined from unconditional ranking functions by the construction in Section 3.8. Since a cps (W, , ′, Pl) ∊ _D,⊕,⊗ is determined by Pl, I abuse notation and write P1 ∊ _D,⊕,⊗.

With this notation, the question becomes whether a quantitative Bayesian network (G, f) such that the entries in the cpts are in D determines a unique element in _D,⊕,⊗. The answer is yes, provided (D, ⊕, ⊗) satisfies enough conditions. I do not go through all the conditions here; some of them are technical, and not much insight is gained from writing them all out carefully. They include, for example, conditions saying that ⊕ and ⊗ are commutative and associative and that ⊗ distributes over ⊕. It is worth noting that these conditions are satisfied by the definitions of ⊕ and ⊗ given in the proof of Proposition 3.9.2 for probability measures, ranking function, possibility measure (using either Poss( | U) or Poss( ‖ U)), and the plausibility measure P1_퓹 defined by a set 퓹 of probability measures to a cps. Thus, it follows that, in all these cases, a quantitative Bayesian network represents a unique element in P1 ∊ _D,⊕,⊗.

What about the analogue to Theorem 4.5.7? The first half is immediate for all algebraic cps's.

Corollary 4.5.10

If X is d-separated from Y by Z in the Bayesian network G, then I^rv _Pl (X, Y | Z) holds for all cpms P1 ∊ _D,⊕,⊗ compatible with G.

Proof As I observed after the statement of Theorem 4.5.7, the result depends only on CIRV1–5. Since, by Theorem 3.9.2, CIRV1–5 hold for all algebraic plausibility measures, the result follows.

Getting an analogue to the second half of Theorem 4.5.7 requires a little more work. Notice that to prove such an analogue, it suffices to show that if X is not d-separated from Y by Z in G, then there is a plausibility measure P1 ∊ _D,⊕,⊗ such that I^rv _Pl (X, Y | Z) does not hold. Again, this result holds with enough conditions on (D, ⊕, ⊗)—essentially the same conditions required to get a quantitative Bayesian network to determine a unique plausibility measure in _D,⊕,⊗, together with a richness condition to ensure that there are "enough" plausibility measures in _D,⊕,⊗ to guarantee that if d-separation does not hold, then there is a plausibility measure that does not make the appropriate random (conditionally) independent. And again, these conditions hold for all the measures of uncertainty constructed in Proposition 3.9.2.

As these results show, the technology of Bayesian networks can be applied rather widely.

Exercises

4.1 Show that Proposition 4.1.2 holds for all conditional probability measures, using only CP1–3.
4.2 Suppose that μ is a conditional probability measure.
1. Show that if U and V are independent with respect to μ, then μ(U ∩ V) = μ(U)μ(V).
2. Show that if μ(U) ≠ 0, μ(V) ≠ 0, and μ(U ∩ V) = μ(U)μ(V), then U and V are independent with respect to μ.
4.3 Prove Proposition 4.2.3.
4.4 Prove Theorem 4.2.5.
4.5 Show that I_μ(U, V | V) follows from CI1[μ]–CI5[μ].
4.6 Show that CI1[Pl], CI2[Pl], and CI5[Pl] hold for all cpms Pl.
4.7 This exercise examines the extent to which various notions of conditioning satisfy CI3 and CI4.
1. Show that ranking functions and the representation Pl_퓹 of a set 퓹 of probability measures by a plausibility measure both satisfy CI3 and CI4.
2. Show by means of counterexamples that none of Poss(V | U), Poss(V || U), Bel(V | U), or Bel(V || U) satisfy CI3 or CI4.
4.8 Prove Lemma 4.3.3.
4.9 Prove Lemma 4.3.5.
4.10 Show that the assumption of standardness is necessary in Lemma 4.3.5. More precisely, suppose that (W, , ′) is an arbitrary nonstandard algebraic cps for which ⊤ ≠ ⊥. Show that there must exist some U ∊ ′ such that I_Pl(U, U | W) does not hold although NI_Pl(U, U | W) does.
4.11 Show that the cps implicitly defined in Example 3.2.4 is algebraic.
4.12 This exercise shows that a set of probabilities and its convex hull are not equivalent insofar as determination of independencies goes. Suppose that a coin with an unknown probability of heads is tossed twice, and the tosses are known to be independent. A reasonable representation of this situation is given by the set 퓹₀ consisting of all measures μ_α, where μ_α(hh) = α², μ_α(ht) = μ^α(th) = α(1 − α), μ^α(tt) = (1 − α)₂.
1. Show that the coin tosses are independent with respect to P1_퓹0.
2. Show that 퓹₀ is not convex (i.e., there exist μ₁, μ₂ ∊ 퓹₀ such that aμ₁ + bμ₂ ∉ 퓹₀, where a, b ∈ [0, 1] and a + b = 1).
3. Show that the convex hull of 퓹₀ (i.e., the least convex set containing 퓹₀) includes measures for which the coin tosses are not independent.
4.13 Show that if 퓹 is a set of probability measures, then I_P1퓹(U, V | V′) implies I_μ(U, V | V′) for all μ ∊ 퓹.
^*4.14 Prove Theorem 4.4.4.
4.15 Show that CIRV5[Pl] holds for all cpms Pl.
4.16 Construct a cps for which none of CIRV2–4 holds.
4.17 Show that CIRV2 does not hold for belief functions, with conditioning defined as Bel(V | U), nor for 퓹_*. (Hint: Construct a belief function Bel such that Bel(X = i) = Bel(Y = j ∩ Y′ = k) = 0 for all i, j, k ∈{0, 1}, but Bel(Y = 0) = Bel(Y = 1) = 1/2. Show, as a consequence, that I^rv _Bel(X, {Y, Y′}) holds, but I^rv _Bel(X, Y) does not. A similar counterexample can be constructed for 퓹_*.)
4.18 Show using CIRV1-5 that I^rv_μ (X, Y | Z) iff I^rv_μ (X − Z, Y − Z | Z). Thus it is possible to assume without loss of generality that Z is disjoint from X and Y.
^*4.19 Prove Theorem 4.4.5.
4.20 Show that if 〈X₁, …, X_n〉 is a topological sort of G, then {X₁, …, X_i−1}⊆ NonDes_G(X_i).
^*4.21 Consider the following property of conditional independence for random variables.

CIRV6[μ]. If I^rv _μ (X, Y | Y′ ∪ Z) and I^rv _μ (X, Y′ | Y ∪ Z), then I^rv _μ (X, Y ∪ Y′ | Z).

CIRV6 can be viewed as a partial converse to CIRV3.
1. Show by means of a counterexample that CIRV6 does not hold if Y = Y′.
2. Show by means of a counterexample that CIRV6 does not hold even if X, Y, Y′, Z are pairwise disjoint (i.e., if none of the random variables in X is in Y ∪ Y′ ∪ Z, none of the random variables in Y is in X ∪ Y′ ∪ Z, and so on).
3. Show that CIRV6[μ] holds for all probability measures μ that are strictly positive with respect to X, Y, Y′, and Z in that if X = {X₁, …, X_n}, Y = {Y₁, …, Y_m}, Z = {Z₁, …, Z_k}, and Y′ ={Y′ ₁, …, Y_{∊ p}}, then for all x_i ∊ (X_i), i = 1,…,n, y_j ∊ (Y_j), j = 1, …, m, y′_l, l = 1,…, p, and z_h ∊ (Z_h), h = 1,…, k,
4.22 Prove Proposition 4.5.4. Note that what requires proof here is that the required independence relations hold for the probability measure μ that is determined by (G, f).
4.23 Complete the proof of Theorem 4.5.6 by showing that, for all nodes Y_k,

using CIRV1-5. (Hint: Let Z_m = NonDes_G(Y_k) ∩{Y₁, …, Y_m}. Prove by induction on m that I^rv_μ (Z_m, Y_k | Par_G(Y_k)), using CIRV1-5.)
4.24 Consider the quantitative Bayesian network (G_s, f_s) described in Example 4.5.2.
1. Notice that {S, SH} blocks both paths from PS to C. What does this say about the relationship between PS and C in probabilistic terms?
2. Calculate μ_s(C = 1 | PS = 1) for the unique probability measure μ_s represented by (G_s, f_s).
3. Use the construction of Theorem 4.5.6 to construct two qualitative Bayesian networks representing μ_s, both having S as their root.
4. Suppose that you believe that there is a gene (that can be inherited) that results in a predisposition both to smoke and to have cancer, but otherwise smoking and cancer are unrelated. Draw a Bayesian network describing these beliefs, using the variables PG (at least one parent has this gene), G (has this gene), PS (at least one parent smokes), S (smokes), and C (has cancer). Explain why each edge you included is there.