3.4 Evidence


3.4 Evidence

While the three-prisoner puzzle (and many other examples) show that this approach to conditioning on sets of probabilities often behaves in a reasonable way, it does not always seem to capture all the information learned, as the following example shows:

Example 3.4.1

start example

Suppose that a coin is tossed twice and the first coin toss is observed to land heads. What is the likelihood that the second coin toss lands heads? In this situation, the sample space consists of four worlds: hh, ht, th, and tt. Let H 1 = {hh, ht} be the event that the first coin toss lands heads. There are analogous events H2, T1, and T2. As in Example 2.4.6, all that is known about the coin is that its bias is either 1/3 or 2/3. The most obvious way to represent this seems to be with a set of probability measures = {μ1/3, μ2/3}. Further suppose that the coin tosses are independent. Intuitively, this means that the outcome of the first coin toss has no affect on the probabilities of the outcomes of the second coin toss. Independence is considered in more depth in Chapter 4; for now, all I need is for independence to imply that μα(hh) = μα(H1)μα(H2) = α2 and that μα(ht) = μα(H1)μα(T2) = α α2.

Using the definitions, it is immediate that | H1(H2) = {1/3, 2/3} = (H2). At first blush, this seems reasonable. Since the coin tosses are independent, observing heads on the first toss does not affect the likelihood of heads on the second toss; it is either 1/3 or 2/3, depending on what the actual bias of the coin is. However, intuitively, observing heads on the first toss should also give information about the coin being used: it is more likely to be the coin with bias 2/3. This point perhaps comes out more clearly if the coin is tossed 100 times and 66 heads are observed in the first 99 tosses. What is the probability of heads on the hundredth toss? Formally, using the obvious notation, the question now is what |(H1 ∩…∩H99)(H100) should be. According to the definitions, it is again {1/3, 2/3}: the probability is still either 1/3 or 2/3, depending on the coin used. But the fact that 66 of 99 tosses landed heads provides extremely strong evidence that the coin has bias 2/3 rather than 1/3. This evidence should make it more likely that the probability that the last coin will land heads is 2/3 rather than 1/3. The conditioning process does not capture this evidence at all.

Interestingly, if the bias of the coin is either 0 or 1 (i.e., the coin is either double-tailed or double-headed), then the evidence is taken into account. In this case, after seeing heads, μ0 is eliminated, so |H1(H2) = 1 (or, more precisely, {1}), not {0, 1}. On the other hand, if the bias is almost 0 or almost 1, say .005 or .005, then |H1(H2) = {.005, 995}. Thus, although the evidence is taken into account in the extreme case, where the probability of heads is either 0 or 1, it is not taken into account if the probability of heads is either slightly greater than 0 or slightly less than 1.

Notice that if there is a probability on the possible biases of the coin, then all these difficulties disappear. In this case, the sample space must represent the possible biases of the coin. For example, if the coin has bias either α or β, with α > β, and the coin is tossed twice, then the sample space has eight worlds: (α, hh), (β, hh), (α, ht), (β, ht), . Moreover, if the probability that it has bias α is a (so that the probability that it has bias β is 1 a), then the uncertainty is captured by a single probability measure μ such that μ(α, hh) = aα2, μ(β, hh) = (1 a)α2, and so on. With a little calculus, it is not hard to show that μ(H1) = μ(H2) = aα + (1 a)β and μ(H1 H2) = aα2 + (1 a)β2, so μ(H2 | H1) = (aα2 + (1 a)β2)/(aα + (1 a)β) μ(H2), no matter what α and β are, with equality holding iff a = 0 or a = 1 (Exercise 3.5). Seeing H1 makes H2 more likely than it was before, despite the fact the coin tosses are independent, because seeing H2 makes the coin of greater bias β more likely to be the actual coin. This intuition can be formalized in a straightforward way. Let Cα be the event that the coin has bias α (so that Cα consists of the four worlds of the form (α, )). Then μ(Cα) = a, by assumption, while μ(Cα | H1) = aα/(aα + (1 a)β) > a, since α > β (Exercise 3.6).

end example

What the example shows is that the problem here is not the use of probability or conditioning per se, but that conditioning does not quite capture the evidence when the uncertainty is represented by a set of probability measures with more than one element. Is there any way to represent the evidence?

Actually, this issue of evidence has arisen a number of times already. In Example 3.2.8, Bob testing positive is certainly evidence that he has AIDS, even though the actual probability that he has AIDS also depends on the prior probability of AIDS. Similarly, in Example 3.2.9, seeing 10 heads in a row is strong evidence that the coin is double-headed even though, again, the actual probability that the coin is double-headed given that 10 heads in a row are observed depends on the prior probability that a double-headed coin is chosen.

The literature contains a great deal of discussion on how to represent evidence. Most of this discussion has been in the context of probability, trying to make sense of the evidence provided by seeing 10 heads in a row in Example 3.2.9. Here I consider a notion of evidence that applies even when there is a set of probability measures, rather than just a single measure. There are interesting connections between this notion and the notion of evidence in the Dempster-Shafer theory of evidence (see Section 2.4, particularly Examples 2.4.5 and 2.4.6).

For the purposes of this discussion, I assume that there is a finite space consisting of basic hypotheses and another set of basic observations (also typically finite, although that is not crucial). In the spirit of the approach discussed at the beginning of Section 2.3, the set of possible worlds is now H ; that is, a possible world consists of a (hypothesis, observation) pair. In Example 3.4.1, there are two hypotheses, BH (the coin is biased toward heads—the probability that it lands heads is 2/3) and BT (the coin is biased toward tails—the probability that it lands tails is 2/3). If the coin is tossed twice, then there are four possible observations: hh, ht, th, and tt. Thus, there are eight possible worlds. (This is precisely the sample space that was used in the last paragraph of Example 3.4.1.) Similarly, in Example 3.2.8, there are two hypotheses, A (Bob has AIDS) and A (he doesn't), and two possible observations, P (Bob tests positive) and P (Bob tests negative). In Example 3.2.9 there are also two hypotheses: the coin is double-headed or it is fair.

In general, I do not assume that there is a probability measure on the full space W = , since the probability of each hypothesis may be unknown. However, for each h , I do assume that there is a probability measure μh on Wh = {(h, o) W: o }, that is, the set of worlds associated with hypothesis h. For example, in the space WBH of Example 3.4.1, the probability of (BH, hh) is 4/9. That is, the probability of tossing two heads, given that the coin with bias 2/3 is used, is 4/9. This is precisely the approach considered in Section 2.3: the set W of possible worlds is partitioned into a number of subsets, with a probability measure on each subset.

For each observation o such that μh(o) > 0 for some h , let μo be the probability measure defined on by μo(h) = μh(o)/h′∊ μh(o)). That is, μo(h) is essentially the probability of observing o, given hypothesis h. The denominator acts as a normalization constant; this choice guarantees that h′∊ μ0(h) = 1. (The assumption that is finite guarantees that this is a finite sum.) Clearly, μo is a probability measure on . It compares the likelihood of two different hypotheses given observation o by comparing the likelihood of observing o, given each of these hypotheses. Thus, it does capture the intuition of evidence at some level.

Up to now I have not assumed that there is a probability measure on the whole space . But suppose that μ is a measure on . What is the connection between μ(h|o) and μo(h)? In general, of course, there is no connection, since there is no connection between μ and μh. A more interesting question is what happens if μh(o) = μ(o | h).

Definition 3.4.2

start example

A probability measure μ on is compatible with {μh: h } if μ(o | h) = μh(o) for all (h, o) such that μ(h) > 0.

end example

Note that, even if μ is compatible with {μh: h }, it does not follow that μo(h) = μ(h | o). In Example 3.2.8,

By definition, μP (A) depends only on μA(P) and μA(P); equivalently, if μ is compatible with {μA, μA}, μP(A) depends only on μ(P | A) and μ(P | A). On the other hand, as the calculations in Example 3.2.8 show, μ(A | P) depends on μ(P | A), μ(A), and μ(P), which can be calculated from μ(P | A), μ(P | A), and μ(A). Changing μ(A) affects μ(A | P) but not μP(A).

Similarly, in Example 3.2.9, suppose the coin is tossed N times. Let F and DH stand for the hypotheses that the coin is fair and double-headed, respectively. Let N-heads be the observation that all N coin tosses result in heads. Then it is easy to check that μN-heads(DH) = 2N/(2N + 1). This seems reasonable: the more heads are observed, the closer the likelihood of DH gets to 1. Of course, if tails is observed at least once in o, then μo(DH) = 0. Again, I stress that if there is a probability on the whole space then, in general, μN-heads(DH) μ(DH | N-heads). The conditional probability depends in part on μ(DH), the prior probability of the coin being double-headed.

Although μo(h) μ(h | o), it seems that there should be a connection between the two quantities. Indeed there is; it is provided by Dempster's Rule of Combination. Before going on, I should point out a notational subtlety. In the expression μo(h), the h represents the singleton set {h}. On the other hand, in the expression μ(h | o), the h really represents the event {h} , since μ is defined on subsets of W = . Similarly, o is being identified with the event {o} W. While in general I make these identifications without comment, it is sometimes necessary to be more careful. In particular, in Proposition 3.4.3, Dempster's Rule of Combination is applied to two probability measures. (This makes sense, since probability measures are belief functions.) Both probability measures need to be defined on the same space (, in the proposition) for Dempster's Rule to apply. Thus, given μ defined on , let μ be the measure on obtained by projecting μ onto ; that is, μ(h) = μ(h ).

Proposition 3.4.3

start example

If μ is compatible with {μh:h } and μ(o) 0, then μ μo is defined and μ(h | o) = (μ μo)(h).

end example

Proof See Exercise 3.7.

Proposition 3.4.3 says that if the prior on is combined with the evidence given by the observation o (encoded as μo) by using Dempster's Rule of Combination, the result is the posterior μ( | o). Even more can be said. Suppose that two observations are made. Then the space of observations has the form . Further assume that μh((o, o)) = μh(o) μh(o), for all h . (Intuitively, this assumption encodes the fact that the observations are independent; see Section 4.1 for more discussion of independence.) Then the evidence represented by the joint observation (o, o) is the result of combining the individual observations.

Proposition 3.4.4

start example

μ(o, o) = μo μo.

end example

Proof See Exercise 3.8.

Thus, for example, in Example 3.2.9, μ(k+m)-heads = μk-heads μm-heads: the evidence corresponding to observing k + m heads is the result of combining the evidence corresponding to observing k and then observing m heads. Similar results hold for other observations.

The belief functions used to represent the evidence given by the observations in Example 2.4.6 also exhibit this type of behavior. More precisely, in that example, it was shown that the more heads are observed, the greater the evidence for BH and the stronger the agent's belief in BH. Consider the following special case of the belief function used in Example 2.4.6. Given an observation o, for H , define

I leave it to the reader to check that this is in fact a belief function having the property that there is a constant c > 0 such that Plauso(h) = cμh(o) for all h (see Exercise 3.9). I mention the latter property because analogues of Propositions 3.4.3 and 3.4.4 hold for any representation of evidence that has this property. (It is easy to see that μo(h) = cμh(o) for all h , where c = h′∊ muh(o.)

To make this precise, say that a belief function Bel captures the evidence {o } if, for all probability measures μ compatible with {μh : h }, it is the case that μ(h | o) = (μ Bel)(h). Proposition 3.4.3 says that μo captures the evidence o. The following two results generalize Propositions 3.4.3 and 3.4.4:

Theorem 3.4.5

start example

Fix o . Suppose that Bel is a belief function on whose corresponding plausibility function Plaus has the property that Plaus(h) = cμh(o) for some constant c > 0 and all h . Then Bel captures the evidence o.

end example

Proof See Exercise 3.10.

Theorem 3.4.6

start example

Fix (o, ) . If μh(o, o) = μh(o) μh(o) for all h , Bel captures the evidence o, and Bel captures the evidence o, then Bel Bel captures the evidence (o, o).

end example

Proof See Exercise 3.11.

Interestingly, the converse to Theorem 3.4.5 also holds. If Bel captures the evidence

o, then Plaus(h) = cμh(o) for some constant c > 0 and all h (Exercise 3.12).

So what does all this say about the problem raised at the beginning of this section, regarding the representation of evidence when uncertainty is represented by a set of probability measures? Recall from the discussion in Section 2.3 that a set of probability measures on a space W can be represented by a space W. In this representation, can be viewed as the set of hypotheses and W as the set of observations. Actually, it may even be better to consider the space 2W, so that the observations become subsets of W. Suppose that is finite. Given an observation U W, let p*U denote the encoding of this observation as a probability measure, as suggested earlier; that is, p*U(μ) = μ(U)/(μ′∊ μ′(U)). It seems perhaps more reasonable to represent the result of conditioning on U not just by the set {μ|U: μ , μ(U) > 0}, but by the set {(μ|U, p*U(μ)): μ , μ(U > 0}. That is, the conditional probability μ|U is tagged by the "likelihood" of the hypothesis μ. Denote this set U. For example, in Example 3.4.1, H1 is {(μ1/3|H1, 1/3), (μ2/3|H1, 2/3)}. Thus, H1(H2) = {(1/3, 1/3), (2/3, 2/3)}. This captures the intuition that observing H1 makes BH more likely than BT. There has been no work done on this representation of conditioning (to the best of my knowledge), but it seems worth pursuing further.




Reasoning About Uncertainty
Reasoning about Uncertainty
ISBN: 0262582597
EAN: 2147483647
Year: 2005
Pages: 140

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net