3.4 Evidence

While the three-prisoner puzzle (and many other examples) show that this approach to conditioning on sets of probabilities often behaves in a reasonable way, it does not always seem to capture all the information learned, as the following example shows:

Example 3.4.1

Suppose that a coin is tossed twice and the first coin toss is observed to land heads. What is the likelihood that the second coin toss lands heads? In this situation, the sample space consists of four worlds: hh, ht, th, and tt. Let H ¹ = {hh, ht} be the event that the first coin toss lands heads. There are analogous events H², T¹, and T². As in Example 2.4.6, all that is known about the coin is that its bias is either 1/3 or 2/3. The most obvious way to represent this seems to be with a set of probability measures = {μ_1/3, μ_2/3}. Further suppose that the coin tosses are independent. Intuitively, this means that the outcome of the first coin toss has no affect on the probabilities of the outcomes of the second coin toss. Independence is considered in more depth in Chapter 4; for now, all I need is for independence to imply that μ_α(hh) = μ_α(H¹)μ_α(H²) = α² and that μ_α(ht) = μ_α(H¹)μ_α(T²) = α − α².

Using the definitions, it is immediate that | H¹(H²) = {1/3, 2/3} = (H²). At first blush, this seems reasonable. Since the coin tosses are independent, observing heads on the first toss does not affect the likelihood of heads on the second toss; it is either 1/3 or 2/3, depending on what the actual bias of the coin is. However, intuitively, observing heads on the first toss should also give information about the coin being used: it is more likely to be the coin with bias 2/3. This point perhaps comes out more clearly if the coin is tossed 100 times and 66 heads are observed in the first 99 tosses. What is the probability of heads on the hundredth toss? Formally, using the obvious notation, the question now is what |(H¹ ∩…∩H⁹⁹)(H¹⁰⁰) should be. According to the definitions, it is again {1/3, 2/3}: the probability is still either 1/3 or 2/3, depending on the coin used. But the fact that 66 of 99 tosses landed heads provides extremely strong evidence that the coin has bias 2/3 rather than 1/3. This evidence should make it more likely that the probability that the last coin will land heads is 2/3 rather than 1/3. The conditioning process does not capture this evidence at all.

Interestingly, if the bias of the coin is either 0 or 1 (i.e., the coin is either double-tailed or double-headed), then the evidence is taken into account. In this case, after seeing heads, μ₀ is eliminated, so |H¹(H²) = 1 (or, more precisely, {1}), not {0, 1}. On the other hand, if the bias is almost 0 or almost 1, say .005 or .005, then |H¹(H²) = {.005, 995}. Thus, although the evidence is taken into account in the extreme case, where the probability of heads is either 0 or 1, it is not taken into account if the probability of heads is either slightly greater than 0 or slightly less than 1.

Notice that if there is a probability on the possible biases of the coin, then all these difficulties disappear. In this case, the sample space must represent the possible biases of the coin. For example, if the coin has bias either α or β, with α > β, and the coin is tossed twice, then the sample space has eight worlds: (α, hh), (β, hh), (α, ht), (β, ht), … . Moreover, if the probability that it has bias α is a (so that the probability that it has bias β is 1 − a), then the uncertainty is captured by a single probability measure μ such that μ(α, hh) = aα², μ(β, hh) = (1 − a)α², and so on. With a little calculus, it is not hard to show that μ(H¹) = μ(H²) = aα + (1 − a)β and μ(H¹ ∩ H²) = aα² + (1 − a)β², so μ(H² | H¹) = (aα² + (1 − a)β²)/(aα + (1 − a)β) ≥ μ(H²), no matter what α and β are, with equality holding iff a = 0 or a = 1 (Exercise 3.5). Seeing H¹ makes H² more likely than it was before, despite the fact the coin tosses are independent, because seeing H² makes the coin of greater bias β more likely to be the actual coin. This intuition can be formalized in a straightforward way. Let C_α be the event that the coin has bias α (so that C_α consists of the four worlds of the form (α, …)). Then μ(C_α) = a, by assumption, while μ(C_α | H¹) = aα/(aα + (1 − a)β) > a, since α > β (Exercise 3.6).

What the example shows is that the problem here is not the use of probability or conditioning per se, but that conditioning does not quite capture the evidence when the uncertainty is represented by a set of probability measures with more than one element. Is there any way to represent the evidence?

Actually, this issue of evidence has arisen a number of times already. In Example 3.2.8, Bob testing positive is certainly evidence that he has AIDS, even though the actual probability that he has AIDS also depends on the prior probability of AIDS. Similarly, in Example 3.2.9, seeing 10 heads in a row is strong evidence that the coin is double-headed even though, again, the actual probability that the coin is double-headed given that 10 heads in a row are observed depends on the prior probability that a double-headed coin is chosen.

The literature contains a great deal of discussion on how to represent evidence. Most of this discussion has been in the context of probability, trying to make sense of the evidence provided by seeing 10 heads in a row in Example 3.2.9. Here I consider a notion of evidence that applies even when there is a set of probability measures, rather than just a single measure. There are interesting connections between this notion and the notion of evidence in the Dempster-Shafer theory of evidence (see Section 2.4, particularly Examples 2.4.5 and 2.4.6).

For the purposes of this discussion, I assume that there is a finite space consisting of basic hypotheses and another set of basic observations (also typically finite, although that is not crucial). In the spirit of the approach discussed at the beginning of Section 2.3, the set of possible worlds is now H ; that is, a possible world consists of a (hypothesis, observation) pair. In Example 3.4.1, there are two hypotheses, BH (the coin is biased toward heads—the probability that it lands heads is 2/3) and BT (the coin is biased toward tails—the probability that it lands tails is 2/3). If the coin is tossed twice, then there are four possible observations: hh, ht, th, and tt. Thus, there are eight possible worlds. (This is precisely the sample space that was used in the last paragraph of Example 3.4.1.) Similarly, in Example 3.2.8, there are two hypotheses, A (Bob has AIDS) and A (he doesn't), and two possible observations, P (Bob tests positive) and P (Bob tests negative). In Example 3.2.9 there are also two hypotheses: the coin is double-headed or it is fair.

In general, I do not assume that there is a probability measure on the full space W = , since the probability of each hypothesis may be unknown. However, for each h ∊ , I do assume that there is a probability measure μ_h on W_h = {(h, o) ∊ W: o ∊ }, that is, the set of worlds associated with hypothesis h. For example, in the space W_BH of Example 3.4.1, the probability of (BH, hh) is 4/9. That is, the probability of tossing two heads, given that the coin with bias 2/3 is used, is 4/9. This is precisely the approach considered in Section 2.3: the set W of possible worlds is partitioned into a number of subsets, with a probability measure on each subset.

For each observation o such that μ_h′(o) > 0 for some h′ ∊ , let μ_o be the probability measure defined on by μ_o(h) = μ_h(o)/∑_h′∊ μ_h′(o)). That is, μ_o(h) is essentially the probability of observing o, given hypothesis h. The denominator acts as a normalization constant; this choice guarantees that ∑_h′∊ μ₀(h′) = 1. (The assumption that is finite guarantees that this is a finite sum.) Clearly, μ_o is a probability measure on . It compares the likelihood of two different hypotheses given observation o by comparing the likelihood of observing o, given each of these hypotheses. Thus, it does capture the intuition of evidence at some level.

Up to now I have not assumed that there is a probability measure on the whole space . But suppose that μ is a measure on . What is the connection between μ(h|o) and μ_o(h)? In general, of course, there is no connection, since there is no connection between μ and μ_h. A more interesting question is what happens if μ_h(o) = μ(o | h).

Definition 3.4.2

A probability measure μ on is compatible with {μ_h: h∊ } if μ(o | h) = μ_h(o) for all (h, o) ∊ such that μ(h) > 0.

Note that, even if μ is compatible with {μ_h: h ∊ }, it does not follow that μ_o(h) = μ(h | o). In Example 3.2.8,

By definition, μ_P (A) depends only on μ_A(P) and μ_A(P); equivalently, if μ is compatible with {μ_A, μ_A}, μ_P(A) depends only on μ(P | A) and μ(P | A). On the other hand, as the calculations in Example 3.2.8 show, μ(A | P) depends on μ(P | A), μ(A), and μ(P), which can be calculated from μ(P | A), μ(P | A), and μ(A). Changing μ(A) affects μ(A | P) but not μ_P(A).

Similarly, in Example 3.2.9, suppose the coin is tossed N times. Let F and DH stand for the hypotheses that the coin is fair and double-headed, respectively. Let N-heads be the observation that all N coin tosses result in heads. Then it is easy to check that μ_N-heads(DH) = 2^N/(2^N + 1). This seems reasonable: the more heads are observed, the closer the likelihood of DH gets to 1. Of course, if tails is observed at least once in o, then μ_o(DH) = 0. Again, I stress that if there is a probability on the whole space then, in general, μ_N-heads(DH) ≠ μ(DH | N-heads). The conditional probability depends in part on μ(DH), the prior probability of the coin being double-headed.

Although μ_o(h) ≠ μ(h | o), it seems that there should be a connection between the two quantities. Indeed there is; it is provided by Dempster's Rule of Combination. Before going on, I should point out a notational subtlety. In the expression μ_o(h), the h represents the singleton set {h}. On the other hand, in the expression μ(h | o), the h really represents the event {h} , since μ is defined on subsets of W = . Similarly, o is being identified with the event {o} ⊆ W. While in general I make these identifications without comment, it is sometimes necessary to be more careful. In particular, in Proposition 3.4.3, Dempster's Rule of Combination is applied to two probability measures. (This makes sense, since probability measures are belief functions.) Both probability measures need to be defined on the same space (, in the proposition) for Dempster's Rule to apply. Thus, given μ defined on , let μ be the measure on obtained by projecting μ onto ; that is, μ(h) = μ(h 퓠).

Proposition 3.4.3

If μ is compatible with {μ_h:h ∊ } and μ(o) ≠ 0, then μ ⊕ μ_o is defined and μ(h | o) = (μ ⊕ μ_o)(h).

Proof See Exercise 3.7.

Proposition 3.4.3 says that if the prior on is combined with the evidence given by the observation o (encoded as μ_o) by using Dempster's Rule of Combination, the result is the posterior μ( | o). Even more can be said. Suppose that two observations are made. Then the space of observations has the form ′. Further assume that μ_h((o, o′)) = μ_h(o) μ_h(o′), for all h ∊ . (Intuitively, this assumption encodes the fact that the observations are independent; see Section 4.1 for more discussion of independence.) Then the evidence represented by the joint observation (o, o′) is the result of combining the individual observations.

Proposition 3.4.4

μ_{(o, o′)} = μ_o ⊕ μ_o′.

Proof See Exercise 3.8.

Thus, for example, in Example 3.2.9, μ_(k+m)-heads = μ_k-heads ⊕ μ_m-heads: the evidence corresponding to observing k + m heads is the result of combining the evidence corresponding to observing k and then observing m heads. Similar results hold for other observations.

The belief functions used to represent the evidence given by the observations in Example 2.4.6 also exhibit this type of behavior. More precisely, in that example, it was shown that the more heads are observed, the greater the evidence for BH and the stronger the agent's belief in BH. Consider the following special case of the belief function used in Example 2.4.6. Given an observation o, for H ⊆ , define

I leave it to the reader to check that this is in fact a belief function having the property that there is a constant c > 0 such that Plaus_o(h) = cμ_h(o) for all h ∊ (see Exercise 3.9). I mention the latter property because analogues of Propositions 3.4.3 and 3.4.4 hold for any representation of evidence that has this property. (It is easy to see that μ_o(h) = cμ_h(o) for all h ∊ , where c = ∑_h′∊ mu_h′(o.)

To make this precise, say that a belief function Bel captures the evidence {o ∊ } if, for all probability measures μ compatible with {μ_h : h ∊ }, it is the case that μ(h | o) = (μ ⊕ Bel)(h). Proposition 3.4.3 says that μ_o captures the evidence o. The following two results generalize Propositions 3.4.3 and 3.4.4:

Theorem 3.4.5

Fix o ∊ . Suppose that Bel is a belief function on whose corresponding plausibility function Plaus has the property that Plaus(h) = cμ_h(o) for some constant c > 0 and all h ∊ . Then Bel captures the evidence o.

Proof See Exercise 3.10.

Theorem 3.4.6

Fix (o, ′) ∊ ′. If μ_h(o, o′) = μ_h(o) μ_h(o′) for all h ∊ , Bel captures the evidence o, and Bel′ captures the evidence o′, then Bel ⊕ Bel′ captures the evidence (o, o′).

Proof See Exercise 3.11.

Interestingly, the converse to Theorem 3.4.5 also holds. If Bel captures the evidence

o, then Plaus(h) = cμ_h(o) for some constant c > 0 and all h ∈ (Exercise 3.12).

So what does all this say about the problem raised at the beginning of this section, regarding the representation of evidence when uncertainty is represented by a set of probability measures? Recall from the discussion in Section 2.3 that a set of probability measures on a space W can be represented by a space W. In this representation, can be viewed as the set of hypotheses and W as the set of observations. Actually, it may even be better to consider the space 2^W, so that the observations become subsets of W. Suppose that is finite. Given an observation U ⊆ W, let p*_U denote the encoding of this observation as a probability measure, as suggested earlier; that is, p*U(μ) = μ(U)/(∑_μ′∊ μ′(U)). It seems perhaps more reasonable to represent the result of conditioning on U not just by the set {μ|U: μ ∊ , μ(U) > 0}, but by the set {(μ|U, p*_U(μ)): μ ∊ , μ(U > 0}. That is, the conditional probability μ|U is tagged by the "likelihood" of the hypothesis μ. Denote this set ‖U. For example, in Example 3.4.1, ‖H¹ is {(μ_1/3|H¹, 1/3), (μ_2/3|H¹, 2/3)}. Thus, ‖H¹(H²) = {(1/3, 1/3), (2/3, 2/3)}. This captures the intuition that observing H¹ makes BH more likely than BT. There has been no work done on this representation of conditioning (to the best of my knowledge), but it seems worth pursuing further.