2.3 Lower and Upper Probabilities

Despite its widespread acceptance, there are some problems in using probability to represent uncertainty. Three of the most serious are (1) probability is not good at representing ignorance, (2) while an agent may be prepared to assign probabilities to some sets, she may not be prepared to assign probabilities to all sets, and (3) while an agent may be willing in principle to assign probabilities to all the sets in some algebra, computing these probabilities requires some computational effort; she may simply not have the computational resources required to do it. These criticisms turn out to be closely related to one of the criticisms of the Dutch book justification for probability mentioned in Section 2.2.1. The following two examples might help clarify the issues.

Example 2.3.1

Suppose that a coin is tossed once. There are two possible worlds, heads and tails, corresponding to the two possible outcomes. If the coin is known to be fair, it seems reasonable to assign probability 1/2 to each of these worlds. However, suppose that the coin has an unknown bias. How should this be represented? One approach might be to continue to take heads and tails as the elementary outcomes and, applying the principle of indifference, assign them both probability 1/2, just as in the case of a fair coin. However, there seems to be a significant qualitative difference between a fair coin and a coin of unknown bias. Is there some way that this difference can be captured? One possibility is to take the bias of the coin to be part of the possible world (i.e., a basic outcome would now describe both the bias of the coin and the outcome of the toss), but then what is the probability of heads?

Example 2.3.2

Suppose that a bag contains 100 marbles; 30 are known to be red, and the remainder are known to be either blue or yellow, although the exact proportion of blue and yellow is not known. What is the likelihood that a marble taken out of the bag is yellow? This can be modeled with three possible worlds, red, blue, and yellow, one for each of the possible outcomes. It seems reasonable to assign probability .3 to the outcome to choosing a red marble, and thus probability .7 to choosing either blue or yellow, but what probability should be assigned to the other two outcomes?

Empirically, it is clear that people do not use probability to represent the uncertainty in examples such as Example 2.3.2. For example, consider the following three bets. In each case a marble is chosen from the bag.

B_r pays $1 if the marble is red, and 0 otherwise;
B_b pays $1 if the marble is blue, and 0 otherwise;
B_y pays $1 if the marble is yellow, and 0 otherwise.

People invariably prefer B_r to both B_b and B_y, and they are indifferent between B_b and B_y. The fact that they are indifferent between B_b ad B_y suggests that they view it equally likely that the marble chosen is blue and that it is yellow. This seems reasonable; the problem statement provides no reason to prefer blue to yellow, or vice versa. However, if blue and yellow are equally probable, then the probability of drawing a blue marble and that of drawing a yellow marble are both .35, which suggests that B_y and B_b should both be preferred to B_r. Moreover, any way of ascribing probability to blue and yellow either makes choosing a blue marble more likely than choosing a red marble, or makes choosing a yellow marble more likely than choosing a red marble (or both). This suggests that at least one of B_b and B_y should be preferred to B_r, which is simply not what the experimental evidence shows.

There are a number of ways of representing the uncertainty in these examples. As suggested in Example 2.3.1, it is possible to make the uncertainty about the bias of the coin part of the possible world. A possible world would then be a pair (a, X), where a ∈ [0, 1]and X ∈{H, T}. Thus, for example, (1/3, H) is the world where the coin has bias 1/3 and lands heads. (Recall that the bias of a coin is the probability that it lands heads.) The problem with this approach (besides the fact that there are an uncountable number of worlds, although that is not a serious problem) is that it is not clear how to put a probability measure on the whole space, since there is no probability given on the coin having, say, bias in [1/3, 2/3]. The space can be partitioned into subspaces w_a, a ∈ [0, 1], where W_a consists of the two worlds (a, H) and (a, T).In W_a, there is an obvious probability μ_a on w_a : μ_a(a, H) = a and μ_a(a, T) = 1 − a. This just says that in a world in w_a (where the bias of the coin is a), the probability of heads is a and the probability of tails is 1 − a. For example, in the world (1/3, H), the probability measure is taken to be on just (1/3, H) and (1/3, T); all the other worlds are ignored. The probability of heads is taken to be 1/3 at (1/3, H). This is just the probability of (1/3, H), since (1/3, H) is the intersection of the event "the coin lands heads" (i.e., all worlds of the form (a, H)) with w_1/3.

This is an instance of an approach that will be examined in more detail in Sections 3.4 and 6.9. Rather than there being a global probability on the whole space, the space W is partitioned into subsets w_i, i ∈ I. (In this case, I = [0, 1].) On each subset W_i, there is a separate probability measure μ_i that is used for the worlds in that subset. The probability of an event U at a world in w_i is μ_i(W_i ∩ U).

For Example 2.3.2, the worlds would have the form (n, X), where X ∈{red, blue, yellow} and n ∈{0, …, 70}. (Think of n as representing the number of blue marbles.) In the subset w_n ={(n, red), (n, blue), (n, yellow)}, the world (n, red) has probability .3, (n, blue) has probability n/100, and (n, yellow) has probability (70 − n)/100. Thus, the probability of red is known to be .3; this is a fact true at every world (even though a different probability measure may be used at different worlds). Similarly, the probability of blue is known to be between 0 and .7, as is the probability of yellow. The probability of blue may be .3, but this is not known.

An advantage of this approach is that it allows a smooth transition to the purely probabilistic case. Suppose, for example, that a probability on the number of blue marbles is given. That amounts to putting a probability on the sets w_n, since w_n corresponds to the event that there are n blue marbles. If the probability of w_n is, say, b_n, where ∑⁷⁰_n=0 b_n = 1, then the probability of (n, blue) = b_n (n/70). In this way, a probability μ on the whole space W can be defined. The original probability μ_n on w_n is the result of conditioning μ on w_n. (I am assuming that readers are familiar with conditional probability; it is discussed in much more detail in Chapter 3.)

This approach turns out to be quite fruitful. However, for now, I focus on two other approaches that do not involve extending the set of possible worlds. The first approach, which has been thoroughly studied in the literature, is quite natural. The idea is to simply represent uncertainty using not just one probability measure, but a set of them. For example, in the case of the coin with unknown bias, the uncertainty can be represented using the set 풫₁ = {μ_a: a ∈ [0, 1]} of probability measures, where μ_a gives heads probability a. Similarly, in the case of the marbles, the uncertainty can be represented using the set 풫₂ = {μ′_a: a ∈ [0, .7]}, where μ′_a gives red probability .3, blue probability a, and yellow probability .7 − a. (I could restrict a to having the form n/100, for n ∈{0, …, 70}, but it turns out to be a little more convenient in the later discussion not to make this restriction.)

A set 풫 of probability measures, all defined on a set W, can be represented as a single space 풫 W. This space can be partitioned into subspaces W_μ, for μ ∈ 풫, where W_μ ={(μ, w) : w ∈ W}. On the subspace W_μ, the probability measure μ is used. This, of course, is an instance of the first approach discussed in this section. The first approach is actually somewhat more general. Here I am assuming that the space has the form A B, where the elements of A define the partition, so that there is a probability μ_a on {a} B for each a ∈ A. This type of space arises in many applications (see Section 3.4).

The last approach I consider in this section is to make only some sets measurable. Intuitively, the measurable sets are the ones to which a probability can be assigned. For example, in the case of the coin, the algebra might consist only of the empty set and {heads, tails}, so that {heads} and {tails} are no longer measurable sets. Clearly, there is only one probability measure on this space; for future reference, call it μ₁.By considering this trivial algebra, there is no need to assign a probability to {heads} or {tails}.

Similarly, in the case of the marbles, consider the algebra

There is an obvious probability measure μ₂ on this algebra that describes the story in Example 2.3.2: simply take μ₂(red) = .3. That determines all the other probabilities.

Notice that, with the first approach, in the case of the marbles, the probability of red is .3 (since all probability measures 풫₂ give red probability .3), but all that can be said about the probability of blue is that it is somewhere between 0 and .7 (since that is the range of possible probabilities for blue according to the probability measures in 풫₂), and similarly for yellow. There is a sense in which the second approach also gives this answer: any probability for blue between 0 and .7 is compatible with the probability measure μ₂. Similarly, in the case of the coin with an unknown bias, all that can be said about the probability of heads is that it is somewhere between 0 and 1.

Recasting these examples in terms of the Dutch book argument, the fact that, for example, all that can be said about the probability of the marble being blue is that it is between 0 and .7 corresponds to the agent definitely preferring (blue, 1 − α) to (blue, α) for α > .7, but not being able to choose between the two bets for 0 ≤ α ≤ .7. In fact, the Dutch book justification for probability given in Theorem 2.2.3 can be recast to provide a justification for using sets of probabilities. Interestingly, with sets of probabilities, RAT3 no longer holds. The agent may not always be able to decide which of (U, α) and (U, 1 − α) she prefers.

Given a set 풫 of probability measures, all defined on an algebra over a set W, and U ∈ , define

풫_*(U) is called the lower probability of U, and 풫^*(U) is called the upper probability of U. For example, (풫₂)_*(blue) = 0, (풫₂)^*(blue) = .7, and similarly for yellow, while (풫₂)_*(red) = (풫₂)^*(red) = .3.

Now consider the approach of taking only some subsets to be measurable. An algebra is a subalgebra of an algebra ′ if ⊆ ′. If is a subalgebra of ′, μ is a probability measure on , and μ′ is a probability measure on ′, then μ′ is an extension of μ if μ and μ′ agree on all sets in . Notice that 풫₁ consists of all the extensions of μ₁ to the algebra consisting of all subsets of {heads, tails} and 풫₂ consists of all extensions of μ₂ to the algebra of all subsets of {red, blue, yellow}.

If μ is a probability measure on the subalgebra and U ∈ ′ − , then μ(U) is undefined, since U is not in the domain of μ. There are two standard ways of extending μ to ′, by defining functions μ* and μ*, traditionally called the inner measure and outer measure induced by μ, respectively. For U ∈ ′, define

These definitions are perhaps best understood in the case where the set of possible worlds (and hence the algebra ) is finite. In that case, μ_*(U) is the measure of the largest measurable set (in ) contained in U, and μ*(U) is the measure of the smallest measurable set containing U. That is, μ_*(U) = μ(V₁), where V₁ = ∪{B ∈ ′ : B ⊆ U} and μ*(U) = μ(V₂), where V₂ = ∩{B ∈ ′ : U ⊆ B} (Exercise 2.7). Intuitively, μ_*(U) is the best approximation to the actual probability of U from below and μ*(U) is the best approximation from above. If U ∈ , then it is easy to see that μ_*(U) = μ*(U) = μ(U). If U ∈ ′ − then, in general, μ_*(U)<μ*(U). For example, (μ₂)_*(blue) = 0 and (μ₂)*(blue) = .7, since the largest measurable set contained in {blue} is the empty set, while the smallest measurable set containing blue is {blue, yellow}. Similarly, (μ₂)_*(red) = (μ₂)*(red) = μ₂(red) = .3. These are precisely the same numbers obtained using the lower and upper probabilities (풫₂)_* and (풫₂)^*. Of course, this is no accident.

Theorem 2.3.3

Let μ be a probability measure on a subalgebra ⊆ ′ and let 풫_μ consist of all extensions of μ to ′. Then μ_*(U) = (풫_μ)_*(U) and μ^*(U) = (풫_μ)^*(U) for all U ∈ .

Proof See Exercise 2.8. Note that, as the discussion in Exercise 2.8 and the notes to this chapter show, in general, the probability measures in 풫_μ are only finitely additive. The result is not true in general for countably additive probability measures. A variant of this result does hold even for countably additive measures; see the notes for details.

Note that whereas probability measures are additive, so that if U and V are disjoint sets then μ(U ∪ V) = μ(U) + μ(V), inner measures are superadditive and outer measures are subadditive, so that for disjoint sets U and V,

In addition, the relationship between inner and outer measures is defined by

(Exercise 2.9).

The inequalities in (2.3) are special cases of more general inequalities satisfied by inner and outer measures. These more general inequalities are best understood in terms of the inclusion-exclusion rule for probability, which describes how to compute the probability of the union of (not necessarily disjoint) sets. In the case of two sets, the rule says

To see this, note that U ∪ V can be written as the union of three disjoint sets, U − V, V − U, and U ∩ V. Thus,

Since U is the union of U − V and U ∩ V, and V is the union of V − U and U ∩ V, it follows that

Now (2.5) easily follows by simple algebra.

In the case of three sets U₁, U₂, U₃, similar arguments show that

That is, the probability of the union of U₁, U₂, and U₃ can be determined by adding the probability of the individual sets (these are one-way intersections), subtracting the probability of the two-way intersections, and adding the probability of the three-way intersections.

The full-blown inclusion-exclusion rule is

Equation (2.7) says that the probability of the union of n sets is obtained by adding the probability of the one-way intersections (the case when |I | = 1), subtracting the probability of the two-way intersections (the case when |I | = 2), adding the probability of the three-way intersections, and so on. The (−1)ⁱ⁺¹ term causes the alternation from addition to subtraction and back again as the size of the intersection set increases. Equations (2.5) and (2.6) are just special cases of the general rule when n = 2 and n = 3. I leave it to the reader to verify the general rule (Exercise 2.10).

For inner measures, there is also an inclusion-exclusion rule, except that = is replaced by ≥. Thus,

(Exercise 2.12). For outer measures, there is a dual property that holds, which results from (2.8) by (1) switching the roles of intersection and union and (2) replacing ≥ by ≤. That is,

(Exercise 2.13). Theorem 7.4.1 in Section 7.4 shows that there is a sense in which these inequalities characterize inner and outer measures.

Theorem 2.3.3 shows that for every probability measure μ on an algebra , there exists a set 풫 of probability measures defined on 2^W such that μ_* = 풫_*. Thus, inner measure can be viewed as a special case of lower probability. The converse of Theorem 2.3.3 does not hold; not every lower probability is the inner measure that arises from a measure defined on a subalgebra of 2^W. One way of seeing that lower probabilities are more general is by considering the properties that they satisfy.

It is easy to see that lower and upper probabilities satisfy analogues of (2.3) and (2.4) (with μ_* and μ* replaced by 풫_* and 풫^*, respectively). If U and V are disjoint, then

and

However, they do not satisfy the analogues of (2.8) and (2.9) in general (Exercise 2.14). Note that if 풫_* does not satisfy the analogue of (2.8), then it cannot be the case that 풫_* = μ_* for some probability measure μ, since all inner measures do satisfy (2.12).

While (2.10) and (2.11) hold for all lower and upper probabilities, these properties do not completely characterize them. For example, the following property holds for lower probabilities and upper probabilities if U and V are disjoint:

moreover, this property does not follow from (2.10) and (2.11) (Exercise 2.15). However, even adding (2.12) to (2.10) and (2.11) does not provide a complete characterization of upper and lower probabilities. The property needed is rather complex. Stating it requires one more definition: A set 퓤 of subsets of W covers a subset U of W exactly k times if every element of U is in exactly k sets in 퓤. Consider the following property:

It is not hard to show that lower probabilities satisfy (2.13) and that (2.10) and (2.12) follow from (2.13) and (2.11) (Exercise 2.16). Indeed, in a precise sense (discussed in Exercise 2.16), (2.13) completely characterizes lower probabilities (and hence, together with (2.11), upper probabilities as well), at least if all the probability measures are only finitely additive.

If all the probability measures in 풫 are countably additive and are defined on a σ-algebra , then 풫_* has one additional continuity property analogous to (2.2):

(Exercise 2.18(a)). The analogue of (2.1) does not hold for lower probability. For example, suppose that 풫 = {μ₀, μ₁,…}, where μ_n is the probability measure on such that μ_n(n) = 1. Clearly 풫_*(U) = 0 if U is a strict subset of , and 풫_*() = 1. Let U_n ={1, …, n}. Then U_n is an increasing sequence and ∪_{i = 1}^∞U_i = , but lim_i→∞ 풫_*(U_i) = 0 ≠ 풫_*() = 1. On the other hand, the analogue of (2.1) does hold for upper probability, while the analogue of (2.2) does not (Exercise 2.18(b)).

Although I have been focusing on lower and upper probability, it is important to stress that sets of probability measures contain more information than is captured by their lower and upper probability, as the following example shows:

Example 2.3.4

Consider two variants of the example with marbles. In the first, all that is know is that there are at most 50 yellow marbles and at most 50 blue marbles in a bag of 100 marbles; no information at all is given about the number of red marbles. In the second case, it is known that there are exactly as many blue marbles as yellow marbles. The first situation can be captured by the set 풫₃ = {μ : μ(blue) ≤ .5, μ(yellow) ≤ .5}. The second situation can be captured by the set 풫₄ = {μ :μ(b) = μ(y)}. These sets of measures are obviously quite different; in fact 풫₄ ⊆ 풫₃. However, it is easy to see that (풫)_* = (풫₄)_* and, hence, that 풫₃^* = 풫₄^* (Exercise 2.19). Thus, the fact that blue and yellow have equal probability in every measure in 풫₄ has been lost. I return to this issue in Section 2.8.