3.2 Probabilistic Conditioning

Suppose that an agent's uncertainty is represented by a probability measure μ on W and then the agent observes or learns (that the actual world is in) U. How should μ be updated to a new probability measure μ|U that takes this new information into account? Clearly if the agent believes that U is true, then it seems reasonable to require that

all the worlds in U are impossible. What about worlds in U? What should their probability be? One reasonable intuition here is that if all that the agent has learned is U, then the relative likelihood of worlds in U should remain unchanged. (This presumes that the way that the agent learns U does not itself give the agent information; otherwise, as was shown in Example 3.1.2, relative likelihoods may indeed change.) That is, if V₁, V₂ ⊆ U with μ(V₂) > 0, then

Equations (3.1) and (3.2) completely determine μ|U if μ(U) > 0.

Proposition 3.2.1

If μ(U) > 0 and μ|U is a probability measure on W satisfying (3.1) and (3.2), then

Proof Since μ|U is a probability measure and so satisfies P1 and P2, by (3.1), μ| U(U) = 1. Taking V₂ = U and V₁ = V in (3.2), it follows that μ|U(V) = μ(V)/μ(U) for V ⊆ U. Now if V is not a subset of U, then V = (V ∩ U) ∪ (V ∩ U). Since V ∩ U and V ∩ U are disjoint sets, μ|U(V) = μ|U(V ∩ U) + μ|U(V ∩ U). Since V ∩ U ⊆ U and μ|U(U) = 0, it follows that μ|U(V ∩ U) = 0 (Exercise 2.4). Since U ∩ V ⊆ U, using the previous observations,

as desired.

Following traditional practice, I often write μ(V | U) rather than μ|U(V); μ|U is called a conditional probability (measure), and μ(V | U) is read "the probability of V given (or conditional on) U." Sometimes μ(U) is called the unconditional probability of U.

Using conditioning, I can make precise a remark that I made in Section 2.2: namely, that all choices of initial probability will eventually converge to the "right" probability measure as more and more information is received.

Example 3.2.2

Suppose that, as in Example 2.4.6, Alice has a coin and she knows that it has either bias 2/3(BH) or bias 1/3(BT). She considers it much more likely that the bias is 1/3 than 2/3. Thus, initially, she assigns a probability .99 to BT and a probability of .01 to BH.

Alice tosses the coin 25 times to learn more about its bias; she sees 19 heads and 6 tails. This seems to make it much more likely that the coin has bias 2/3, so Alice would like to update her probabilities. To do this, she needs to construct an appropriate set of possible worlds. A reasonable candidate consists of 2²⁶ worlds—for each of the two biases Alice considers possible, there are 2²⁵ worlds consisting of all the possible sequences of 25 coin tosses. The prior probability (i.e., the probability before observing the coin tosses) of the coin having bias 1/3 and getting a particular sequence of tosses with n heads and 25 − n tails is . That is, it is the probability of the coin having bias 1/3 times the probability of getting that sequence given that the coin has bias 1/3. In particular, the probability of the coin having bias 1/3 and getting a particular sequence with 19 heads and 6 tails is . Similarly, the probability of the coin having bias 2/3 and getting the same sequence is. .

Since Alice has seen a particular sequence of 25 coin tosses, she should condition on the event corresponding to that sequence—that is, on the set U consisting of the two worlds where that sequence of coin tosses occurs. The probability of U is . The probability that the coin has bias 1/3 given U is then . A straightforward calculation shows that this simplifies to , which is roughly .01. Thus, although initially Alice gives BT probability .99, she gives BH probability roughly .99 after seeing the evidence.

Of course, this is not an accident. Technically, as long as Alice gives the correct hypothesis (BH—that the bias is 2/3) positive probability initially, then her posterior probability of the correct hypothesis (after conditioning) will converge to 1 after almost all sequences of coin tosses. (A small aside: It is standard in the literature to talk about an agent's "prior" and "posterior" probabilities. The implicit assumption is that there is some fixed initial time when the analysis starts. The agent's probability at this time is her prior. Then the agent gets some information and conditions on it; the resulting probability is her posterior.) In any case, to make this claim precise, note that there are certainly times when the evidence is "misleading." That is, even if the bias is 2/3, it is possible that Alice will see a sequence of 25 coin tosses of which 6 are heads and 19 tails. After observing that, she will consider that her original opinion that the bias 1/3 has been confirmed. (Indeed, it is easy to check that she will give BH probability greater than .999998.) However, if the bias is actually 2/3, the probability of Alice seeing such misleading evidence is very low. In fact, the Law of Large Numbers, one of the central results of probability theory, says that, as the number N of coin tosses increases, the fraction of sequences in which the evidence is misleading goes to 0. As N gets large, in almost all sequences of N coin tosses, Alice's belief that the bias is 2/3 approaches 1.

In this sense, even if Alice's initial beliefs were incorrect, the evidence almost certainly forces her beliefs to the correct bias, provided she updates her beliefs by conditioning. Of course, the result also holds for much more general hypotheses than the bias of a coin.

Conditioning is a wonderful tool, but it does suffer from some problems, particularly when it comes to dealing with events with probability 0. Traditionally, (3.3) is taken as the definition of μ(V | U) if μ is an unconditional probability measure and μ(U) > 0; if μ(U) = 0, then the conditional probability μ(V | U) is undefined. This leads to a number of philosophical difficulties regarding worlds (and sets) with probability 0. Are they really impossible? If not, how unlikely does a world have to be before it is assigned probability 0? Should a world ever be assigned probability 0? If there are worlds with probability 0 that are not truly impossible, then what does it mean to condition on sets with probability 0?

Some of these issues can be sidestepped by treating conditional probability, not unconditional probability, as the basic notion. A conditional probability measure takes pairs U, V of subsets as arguments; μ(V, U) is generally written μ(V | U) to stress the conditioning aspects. What pairs (V, U) should be allowed as arguments to μ? The intuition is that for each fixed second argument U, the function μ( , U) should be a probability measure. Thus, for the same reasons discussed in Section 2.2, I assume that the set of possible first arguments form an algebra (or σ-algebra, if W is infinite). In fact, I assume that the algebra is the same for all U, so that the domain of μ has the form ′ for some algebra . For simplicity, I also assume that ′ is a nonempty subset of that is closed under supersets, so that if U ∊ ′, U ⊆ V, and V ∊ ′, then V ∊ ′. Formally, a Popper algebra over W is a set ′ of subsets of W W such that (a) is an algebra over W, (b) ′ is a nonempty subset of , and (c) ′ is closed under supersets in ; that is, if V ∊ ′, V ⊆ V′, and V′ ∊ ′, then V′ ∊ ′. (Popper algebras are named after Karl Popper, who was the first to consider formally conditional probability as the basic notion; see the notes for further details.) Notice that ′ need not be an algebra (in the sense of Definition 2.2.1); indeed, in general it is not an algebra.

Although, for convenience, I assume that the arguments of a conditional probability measure are in a Popper algebra throughout the book, the reasonableness of this assumption is certainly debatable. In Section 2.2 I already admitted that insisting that the domain of a probability measure be an algebra is somewhat questionable. Even more concerns arise here. Why should it be possible to condition only on elements of ? And why should it be possible to condition on a superset of U if it is possible to condition on U? It may well be worth exploring the impact of weakening this assumption (and, for that matter, the assumption that the domain of a probability measure is an algebra); see Chapter 12 for further discussion of this issue.

Definition 3.2.3

A conditional probability space is a tuple (W, , ′, μ) such that ′ is a Popper algebra over W and μ: ′ → [0,1] satisfies the following conditions:

CP1 and CP2 are just the obvious analogues of P1 and P2. CP3 is perhaps best understood by considering the following two properties:

CP4 just says that, when conditioning on U, everything should be relativized to U. CP5 says that if U₁ ⊆ U₂ ⊆ U₃, it is possible to compute the conditional probability of U₁ given U₃ by computing the conditional probability of U₁ given U₂, computing the conditional probability of U₂ given U₃, and then multiplying them together. It is best to think of CP5 (and CP3) in terms of proportions. For example, the proportion of female minority students at a university is just the fraction of minority students who are female multiplied by the fraction of students in the department who are minority students.

It is easy to see that both CP4 and CP5 follow from CP3 (and CP1 in the case of CP4); in addition, CP3 follows immediately from CP4 and CP5 (Exercise 3.1). Thus, in the presence of CP1, CP3 is equivalent to CP4 and CP5.

If μ is a conditional probability measure, then I usually write μ(U) instead of μ(U | W). Thus, in the obvious way, an conditional probability measure determines an unconditional probability measure. What about the converse?

Given an unconditional probability measure μ defined on some algebra over W, let ′ consist of all sets U such that μ(U) ≠ 0. Then (3.3) can be used to define a conditional probability measure μ^e on μ^e on ′ that is an extension of μ, in that μ^e(U | W) = μ(U). (This notion of extension is compatible with the one defined in Section 2.3; if an unconditional probability measure μ is identified with a conditional probability measure defined on {W}, then μ^e extends μ to ′.) However, taking conditional probability as primitive is more general than starting with unconditional probability and defining conditional probability using (3.3). That is, in general, there are conditional probability measures that are extensions of μ for which, unlike the case of μ^e, ′ includes some sets U such that μ(U) = 0.

One family of examples can be obtained by considering nonstandard probability measures, as defined in Section 2.6. Recall from that discussion that infinitesimals are numbers that are positive but smaller than any positive real number. For every element α′ in a non-Archimedean field such that −r < α′ < r for some real number r, itis not hard to show that there is a unique real number α that is closest to α′; moreover, | α − α′| is an infinitesimal (or 0). In fact, α is just inf{r ∊ : r > α′}. Let st(α′) denote the closest real number to α′.(st is short for standard; that is because elements of a non-Archimedean field that are not reals are often called nonstandard reals, while real numbers are called standard reals.) Note that if ε is an infinitesimal, then st(∊^k) = 0 for all k > 0. (The requirement that −r < α′ < r is necessary. For if ∊ is an infinitesimal, then 1/∊, which is not bounded by any real number, does not have a standard real closest to it.)

Let μ^ns be a nonstandard probability measure defined on an algebra with the property that μ^ns(U) ≠ 0 if U ≠∅. μ^ns can be extended to a conditional probability measure defined on ( − { }) using definition (3.3). Let μ^s be the standardization of μ^ns, that is, the conditional probability measure such that μ^s(V | U) = st(μ^ns(V | U)) for all V ∊ , U ∊ − { }. It may well be that μ^s(U) = 0 for some sets U for which μ^ns(U) ≠ 0, since μ^ns(U) may be infinitesimally small. It is easy to see that μ^s defined this way satisfies CP1–3 (Exercise 3.2). By definition, μ^s is defined on ( − { }). If there are nonempty sets U ∊ such that μ^s(U) = 0, then μ^s is not the result of starting with a standard unconditional probability measure and extending it using (3.3). The following example gives a concrete instance of how this construction works:

Example 3.2.4

Let W₀ = {w₁, w₂, w₃} and let μ^ns₀(w₁) = 1 − ∊ − ∊², μ^ns₀ (w₂) = ∊, and μ^ns₀ (w₃) = ∊²;, where ∊ is an infinitesimal. Notice that μ^ns₀ (w₂ |{w₂, w₃}) = 1/(1 + ∊), while μ^ns₀(w₃ |{w₂, w₃}) = ∊ /1 + ∊. Thus, if μ^s₀ is the standard approximation to μ^ns₀, then μ^s₀(w₂) = μ^s₀(w₃) = μ^s₀({w₂, w₃}) = μ^s₀(w₃ |{w₂, w₃}) = 0 and μ^s₀(w₂ |{w₂, w₃}) = 1.

Although all the conditional probabilities that arise in the case of μ^s₀ are either 0 or 1, it is easy to construct variants of μ^s₀ where arbitrary conditional probabilities arise. For example, if w₁ = {w₁, w₂, w₃}, and μ^ns₁(w₁) = 1 − 2∊, μ^ns₁(w₂) = ∊, and μ^ns₁(w₃) = ∊, then μ^ns₁ (w₂ |{w₂, w₃}) = μ^s₁(w₃ |{w₂, w₃}) = 1/2.

3.2.1 Justifying Probabilistic Conditioning

Probabilistic conditioning can be justified in much the same way that probability is justified. For example, if it seems reasonable to apply the principle of indifference to W and then U is observed or learned, it seems equally reasonable to apply the principle of indifference again to W ∩ U. This results in taking all the elements of W ∩ U to be equally likely and assigning all the elements in W ∩ U probability 0, which is exactly what (3.3) says. Similarly, using the relative-frequency interpretation, μ(V | U) can be viewed as the fraction of times that V occurs of the times that U occurs. Again, (3.3) holds.

Finally, consider a betting justification. To evaluate μ(V | U), only worlds in U are considered; the bet is called off if the world is not in U. More precisely, let (V |U, α) denote the following bet:

If U happens, then if V also happens, then I win $100(1 − α), while if V also happens, then I lose $100α. If U does not happen, then the bet is called off (I do not win or lose anything).

As before, suppose that the agent has to choose between bets of the form (V |U, α) and (V | U, 1 − α). For worlds in U, both bets are called off, so they are equivalent.

With this formulation of a conditional bet, it is possible to prove an analogue of Theorem 2.2.3, showing that an agent who is rational in the sense of satisfying properties RAT1–4 from Section 2.2 must use conditioning.

Theorem 3.2.5

If an agent satisfies RAT1–4, then for all subsets U, V of W such that α_U > 0, there is a number α_V|U such that (V |U, α) ≽ (V |U, 1 − α) for all α < α_{V | U} and (V |U, 1 − α) ≽ (V |U, α) for all α > α_V|U. Moreover, α_V|U = α_V∩U/α_U.

Proof Assume that α_U ≠ 0. For worlds in U, just as in the unconditional case, (V | U, α) is a can't-lose proposition if α = 0, becoming increasingly less attractive as α increases, and becomes a can't-win proposition if α = 1. Let α_V|U = sup{β : (V |U, β) ≽ (V | U, 1 − β)}. The same argument as in the unconditional case (Exercise 2.5) shows that if an agent satisfies RAT1 and RAT2, then (V |U, α) ≽ (V |U, 1− α) for all α < α_V|U and (V |U, 1 − α) ≽ (V |U, α) for all α > α_V|U.

It remains to show that if α_{V |U} ≠ α_V∩U /α_U, then there is a collection of bets that the agent would be willing to accept that guarantee a sure loss. First, suppose that α_{V |U} < α_{V ∩ U} /α_U. By the arguments in the proof of Theorem 2.2.3, α_{V ∩ U} ≤ α_U, so α_{V ∩U} /α_U ≤ 1. Thus, there exist numbers β₁, β₂, β₃ ∈ [0, 1] such that β₁ > α_V _|U, β₂ > α_U (or β₂ = 1if α_U = 1), β₃ < α_{V ∩ U}, and β₁ < β₃/β₂ (or, equivalently, β₁β₂ < β₃).

By construction, (V |U, 1 − β₁) ≽ (V |U, β₁), (U, 1 − β₂) ≽ (U, β₂), and (V ∩ U, β₃) ≽ (V ∩ U, 1 − β₃). Without loss of generality, β₁, β₂, and β₃ are rational numbers, over some common denominator N; that is, β₁ = b₁/N, β₂ = b₂/N, and β₃ = b₃/N. Given a bet (U, α), let N(U, α) denote N copies of (U, α). By RAT4, if B₁ = {N(V | U, 1 − β₁), N(U, 1 − β₂), b₁(U, 1 − β₂)} and B₂ = {N(V |U, β₁), N(U, β₂), b₁(U, β₂)}, then B₁ ≽ B₂. However, B₁ results in a sure loss, while B₂ results in a sure gain, so that the agent's preferences violate RAT1. To see this, three cases must be considered. If the actual world is in U, then with B₁, the agent is guaranteed to win Nβ₁β₂ and lose Nβ₃, for a guaranteed net loss (since β₁β₂ < β₃), while with B₂, the agent is guaranteed a net gain of N(β₃ − β₁β₂). The arguments are similar if the actual world is in V ∩ U or V ∩ U (Exercise 3.3). Thus, the agent is irrational.

A similar argument works if α_{V |U} > α_{V ∩ U} /α_U (Exercise 3.3).

This justification can be criticized on a number of grounds. The earlier criticisms of RAT3 and RAT4 still apply, of course. An additional subtlety arises when dealing with conditioning. The Dutch book argument implicitly takes a static view of the agent's probabilities. It talks about an agent's current preference ordering on bets, including conditional bets of the form (V |U, α) that are called off if a specified event—U in this case—does not occur. But for conditioning what matters is not just the agent's current beliefs regarding V if U were to occur, but also how the agent would change his beliefs regarding V if U actually did occur. If the agent currently prefers the conditional bet (V | U, α) to (V |U, 1 − α), it is not so clear that he would still prefer (V, α) to (V, 1 − α) if U actually did occur. This added assumption must be made to justify conditioning as a way of updating probability measures.

Theorems 2.2.3 and 3.2.5 show that if an agent's betting behavior does not obey P1 and P2, and if he does not update his probabilities according to (3.3), then he is liable to have a Dutch book made against him. What about the converse? Suppose that an agent's betting behavior does obey P1, P2, and (3.3)—that is, suppose that it is characterized by a probability measure, with updating characterized by conditional probability. Is it still possible for there to be a Dutch book?

Say that an agent's betting behavior is determined by a probability measure if there is a probability measure μ on W such that for all U ⊆ W, then (U, α) ≽ (U, 1 − α) iff μ(U) ≥ α. The following result shows that there cannot be a Dutch book if an agent updates using conditioning:

Theorem 3.2.6

If an agent's betting behavior is determined by a probability measure, then there do not exist sets U₁, …, U_k, α₁, …, α_k ∈ [0, 1], and natural numbers N₁, …, N_k ≥ 0 such that (1) (U_j, α_j) ≽ (U_j, 1 − α_j), (2) the agent suffers a sure loss with B = {N₁(U₁, α₁), …, N_k(U_k, α^k)}, and (3) the agent has a sure gain with the complementary connection collection of bets B′ = {N₁(U₁, 1 − α₁), …, N_k(U_k, 1 − α_k)}.

Proof See Exercise 3.4.

3.2.2 Bayes' Rule

One of the most important results in probability theory is called Bayes' Rule. It relates μ(V | U) and μ(U | V).

Proposition 3.2.7

(Bayes' Rule) If μ(U), μ(V) > 0, then

Proof The proof just consists of simple algebraic manipulation. Observe that

Although Bayes' Rule is almost immediate from the definition of conditional probability, it is one of the most widely applicable results of probability theory. The following two examples show how it can be used:

Example 3.2.8

Suppose that Bob tests positive on an AIDS test that is known to be 99 percent reliable. How likely is it that Bob has AIDS? That depends in part on what "99 percent reliable" means. For the purposes of this example, suppose that it means that, according to extensive tests, 99 percent of the subjects with AIDS tested positive and 99 percent of subjects that did not have AIDS tested negative. (Note that, in general, for reliability data, it is important to know about both false positives and false negatives.)

As it stands, this information is insufficient to answer the original question. This is perhaps best seen using Bayes' Rule. Let A be the event that Bob has AIDS and P be the event that Bob tests positive. The problem is to compute μ(A | P). It might seem that, since the test is 99 percent reliable, it should be .99, but this is not the case. By Bayes' Rule, μ(A | P) = μ(P | A) μ(A)/μ(P). Since 99 percent of people with AIDS test positive, it seems reasonable to take μ(P | A) = .99. But the fact that μ(P | A) = .99 does not make μ(A | P) = .99. The value of μ(A | P) also depends on μ(A) and μ(P).

Before going on, note that while it may be reasonable to take μ(P | A) = .99, a nontrivial leap is being made here. A is the event that Bob has AIDS and P is the event that Bob tests positive. The statistical information that 99 percent of people with AIDS test positive is thus being identified with the probability that Bob would test positive if Bob had AIDS. At best, making this identification involves the implicit assumption that Bob is like the test subjects from which the statistical information was derived in all relevant respects. See Chapter 11 for a more careful treatment of this issue.

In any case, going on with the computation of μ(A | P), note that although Bayes' Rule seems to require both μ(P) and μ(A), actually only μ(A) is needed. To see this, note that

μ(P) μ(A = μ(A ∩ P) + μ(A ∩ P),
μ(A ∩ P) = μ(P | A)μ(A) = .99μ(A),
μ(A ∩ P) = μ(P | A)μ(A) = (1 − μ(P | A))(1 − μ(A)) = .01(1 − μ(A)).

Putting all this together, it follows that μ(P) = .01 + .98μ(A) and thus

Just as μ(P | A) can be identified with the fraction of people with AIDS that tested positive, so μ(A), the unconditional probability that Bob has AIDS, can be identified with the fraction of the people in the population that have AIDS. If only 1 percent of the population has AIDS, then a straightforward computation shows that μ(A | P) = 1/2. If only .1 percent (i.e., one in a thousand) have AIDS, then μ(A | P) ≈ .09. Finally, if the incidence of AIDS is as high as one in three (as it is in some countries in Central Africa), then μ(A | P) ≈ .98—still less than .99, despite the accuracy of the test.

The importance of μ(A) in this case can be understood from a less sensitive example.

Example 3.2.9

Suppose that there is a huge bin full of coins. One of the coins in the bin is double-headed; all the rest are fair. A coin is picked from the bin and tossed 10 times. The coin tosses can be viewed as a test of whether the coin is double-headed or fair. The test is positive if all the coin tosses land heads and negative if any of them land tails. This gives a test that is better than 99.9 percent reliable: the probability that the test is positive given that the coin is double-headed is 1; the probability that the test is negative given that the coin is not double-headed (i.e., fair) is 1023/1024 > .999. Nevertheless, the probability that a coin that tests positive is double-headed clearly depends on the total number of coins in the bin. In fact, straightforward calculations similar to those in Example 3.2.8 show that if there are N coins in the bin, then the probability that the coin is double-headed given that it tests positive is 1023/(N + 1024). If N = 10, then a positive test makes it very likely that the coin is double-headed. On the other hand, if N = 1, 000, 000, while a positive test certainly increases the likelihood that the coin is double-headed, it is still far more likely to be a fair coin that landed heads 10 times in a row than a double-headed coin.