2.2 Probability Measures

Perhaps the best-known approach to getting a more fine-grained representation of uncertainty is probability. Most readers have probably seen probability before, so I do not go into great detail here. However, I do try to give enough of a review of probability to make the presentation completely self-contained. Even readers familiar with this material may want to scan it briefly, just to get used to the notation.

Suppose that the agent's uncertainty is represented by the set W ={w₁, …, w_n} of possible worlds. A probability measure assigns to each of the worlds in W a number— a probability—that can be thought of as describing the likelihood of that world being the actual world. In the die-tossing example, if each of the six outcomes is considered equally likely, then it seems reasonable to assign to each of the six worlds the same number. What number should this be?

For one thing, in practice, if a die is tossed repeatedly, each of the six outcomes occurs roughly 1/6 of the time. For another, the choice of 1/6 makes the sum 1; the reasons for this are discussed in the next paragraph. On the other hand, if the outcome of 1 seems much more likely than the others, w₁ might be assigned probability 1/2, and all the other outcomes probability 1/10. Again, the sum here is 1.

Assuming that each elementary outcome is given probability 1/6, what probability should be assigned to the event of the die landing either 1 or 2, that is, to the set {w₁, w₂}? It seems reasonable to take the probability to be 1/3, the sum of the probability of landing 1 and the probability of landing 2. Thus, the probability of the whole space {w₁, …, w₆} is 1, the sum of the probabilities of all the possible outcomes. In probability theory, 1 is conventionally taken to denote certainty. Since it is certain that there will be some outcome, the probability of the whole space should be 1.

In most of the examples in this book, all the subsets of a set W of worlds are assigned a probability. Nevertheless, there are good reasons, both technical and philosophical, for not requiring that a probability measure be defined on all subsets. If W is infinite, it may not be possible to assign a probability to all subsets in such a way that certain natural properties hold. (See the notes to this chapter for a few more details and references.) But even if W is finite, an agent may not be prepared to assign a numerical probability to all subsets. (See Section 2.3 for some examples.) For technical reasons, it is typically assumed that the set of subsets of W to which probability is assigned satisfies some closure properties. In particular, if a probability can be assigned to both U and V, then it is useful to be able to assume that a probability can also be assigned to U ∪ V and to U.

Definition 2.2.1

An algebra over W is a set of subsets of W that contains W and is closed under union and complementation, so that if U and V are in , then so are U ∪ V and U. A σ-algebra is closed under complementation and countable union, so that if U₁, U₂, … are all in , then so is ∪_iU_i.

Note that an algebra is also closed under intersection, since U ∩ V = U ∪ V. Clearly, if W is finite, every algebra is a σ-algebra.

These technical conditions are fairly natural; moreover, assuming that the domain of a probability measure is a σ -algebra is sufficient to deal with some of the mathematical difficulties mentioned earlier (again, see the notes). However, it is not clear why an agent should be willing or able to assign a probability to U ∪ V if she can assign a probability to each of U and V. This condition seems more reasonable if U and V are disjoint (which is all that is needed in many cases). Despite that, I assume that is an algebra, since it makes the technical presentation simpler; see the notes at the end of the chapter for more discussion of this issue.

A basic subset of is a minimal nonempty set in ; that is, U ∈ is basic if (a) U ≠ ∅ and (b) U′ ⊂ U and U′ ∈ implies that U′ =∅. (Note that I use ⊆ for subset and ⊂ for strict subset; thus, if U′ ⊂ U, then U′ ≠ U, while if U′ ⊆ U, then U′ and U may be equal.) It is not hard to show that, if W is finite, then every set in is the union of basic sets. This is no longer necessarily true if W is infinite (Exercise 2.1(a)). A basis for is a collection ′ ⊆ of sets such that every set in is the union of sets in ′. If W is finite, the basic sets in form a basis for (Exercise 2.1(b)).

The domain of a probability measure is an algebra over some set W. By convention, the range of a probability measure is the interval [0, 1]. (In general, [a, b] denotes the set of reals between a and b, including both a and b, that is, [a, b] = {x ∈ :a ≤} x ≤ b}.)

Definition 2.2.2

A probability space is a tuple (W, , μ), where is an algebra over W and μ: → [0, 1] satisfies the following two properties:

P1.	μ(W) = 1.
P2.	μ(U ∪ V) = μ(U) + μ(V) if U and V are disjoint elements of

The sets in are called the measurable sets; μ is called a probability measure on W (or on , especially if ≠ 2^W). Notice that the arguments to μ are not elements of W but subsets of W. If the argument is a singleton subset {w}, I often abuse notation and write μ(w) rather than μ({w}). I occasionally omit the if = 2^W, writing just (W, μ). These conventions are also followed for the other notions of uncertainty introduced later in this chapter.

It follows from P1 and P2 that μ(∅) = 0. Since ∅ and W are disjoint,

so μ(∅) = 0.

Although P2 applies only to pairs of sets, an easy induction argument shows that if U₁, …, U_k are pairwise disjoint elements of , then

This property is known as finite additivity. It follows from finite additivity that if W is finite and consists of all subsets of W, then a probability measure can be characterized as a function μ : W → [0, 1] such that ∑_w∈W μ(w) = 1. That is, if = 2^W, then it suffices to define a probability measure μ only on the elements of W; it can then be uniquely extended to all subsets of W by taking μ(U) = ∑_u∈U μ(u). While the assumption that all sets are measurable is certainly an important special case (and is a standard assumption if W is finite), I have taken the more traditional approach of not requiring all sets to be measurable; this allows greater flexibility.

If W is infinite, it is typically required that be a σ-algebra, and that μ be σ-additive or countably additive, so that if U₁, U₂, … are pairwise disjoint sets in , then μ(∪_iU_i) = μ(U₁) + μ(U₂) + …. For future reference, note that, in the presence of finite additivity, countable additivity is equivalent to the following "continuity" property:

(Exercise 2.2). This property can be expressed equivalently in terms of decreasing sequences of sets:

(Exercise 2.2). (Readers unfamiliar with limits can just ignore these continuity properties and all the ones discussed later; they do not play a significant role in the book.)

To see that these properties do not hold for finitely additive probability, let consist of all the finite and cofinite subsets of ( denotes the natural numbers, {0, 1, 2, …}). A set is cofinite if it is the complement of a finite set. Thus, for example, {3, 4, 6, 7, 8, …} is cofinite since its complement is {1, 2, 5}. Define μ(U) to be 0 if U is finite and 1 if U is cofinite. It is easy to check that is an algebra and that μ is a finitely additive probability measure on (Exercise 2.3). But μ clearly does not satisfy any of the properties above. For example, if U_n ={0, …, n}, then U_n increases to , but μ(U_n) = 0 for all n, while μ() = 1, so lim_n→∞ μ(U_n) ≠ μ(∪^∞_i=1U_i).

For most of this book, I focus on finite sample spaces, so I largely ignore the issue of whether probability is countably additive or only finitely additive.

2.2.1 Justifying Probability

If belief is quantified using probability, then it is important to explain what the numbers represent, where they come from, and why finite additivity is appropriate. Without such an explanation, it will not be clear how to assign probabilities in applications, nor how to interpret the results obtained by using probability.

The classical approach to applying probability, which goes back to the seventeenth and eighteenth centuries, is to reduce a situation to a number of elementary outcomes. A natural assumption, called the principle of indifference, is that all elementary outcomes are equally likely. Intuitively, in the absence of any other information, there is no reason to consider one more likely than another. Applying the principle of indifference, if there are n elementary outcomes, the probability of each one is 1/n; the probability of a set of k outcomes is k/n. Clearly this definition satisfies P1 and P2 (where W consists of all the elementary outcomes).

This is certainly the justification for ascribing to each of the six outcomes of the toss of a die a probability of 1/6. By using powerful techniques of combinatorics together with the principle of indifference, card players can compute the probability of getting various kinds of hands, and then use this information to guide their play of the game.

The principle of indifference is also typically applied to handle situations with statistical information. For example, if 40 percent of a doctor's patients are over 60, and a nurse informs the doctor that one of his patients is waiting for him in the waiting room, it seems reasonable for the doctor to say that the likelihood of that patient being over 60 is .4. Essentially what is going on here is that there is one possible world (i.e., basic outcome) for each of the possible patients who might be in the waiting room. If each of these worlds is equally probable, then the probability of the patient being over 60 will indeed be .4. (I return to the principle of indifference and the relationship between statistical information and probability in Chapter 11.)

While taking possible worlds to be equally probable is a very compelling intuition, the trouble with the principle of indifference is that it is not always obvious how to reduce a situation to elementary outcomes that seem equally likely. This is a significant concern, because different choices of elementary outcomes will in general lead to different answers. For example, in computing the probability that a couple with two children has two boys, the most obvious way of applying the principle of indifference would suggest that the answer is 1/3. After all, the two children could be either (1) two boys, (2) two girls, or (3) a boy and a girl. If all these outcomes are equally likely, then the probability of having two boys is 1/3.

There is, however, another way of applying the principle of indifference, by taking the elementary outcomes to be (B, B), (B, G), (G, B), and (G, G): (1) both children are boys, (2) the first child is a boy and the second a girl, (3) the first child is a girl and the second a boy, and (4) both children are girls. Applying the principle of indifference to this description of the elementary outcomes gives a probability of 1/4 of having two boys.

The latter answer accords better with observed frequencies, and there are compelling general reasons to consider the second approach better than the first for constructing the set of possible outcomes. But in many other cases, it is far from obvious how to choose the elementary outcomes. What makes one choice right and another one wrong?

Even in cases where there seem to be some obvious choices for the elementary outcomes, it is far from clear that they should be equally likely. For example, consider a biased coin. It still seems reasonable to take the elementary outcomes to be heads and tails, just as with a fair coin, but it certainly is no longer appropriate to assign each of these outcomes probability 1/2 if the coin is biased. What are the "equally likely" outcomes in that case? Even worse difficulties arise in trying to assign a probability to the event that a particular nuclear power plant will have a meltdown. What should the set of possible events be in that case, and why should they be equally likely?

In light of these problems, philosophers and probabilists have tried to find ways of viewing probability that do not depend on assigning elementary outcomes equal likelihood. Perhaps the two most common views are that (1) the numbers represent relative frequencies, and (2) the numbers reflect subjective assessments of likelihood.

The intuition behind the relative-frequency interpretation is easy to explain. The justification usually given for saying that the probability that a coin lands heads is 1/2 is that if the coin is tossed sufficiently often, roughly half the time it will land heads. Similarly, a typical justification for saying that the probability that a coin has bias .6 (where the bias of a coin is the probability that it lands heads) is that it lands heads roughly 60 percent of the time when it is tossed sufficiently often.

While this interpretation seems quite natural and intuitive, and certainly has been used successfully by the insurance industry and the gambling industry to make significant amounts of money, it has its problems. The informal definition said that the probability of the coin landing heads is .6 if "roughly" 60 percent of the time it lands heads, when it is tossed "sufficiently often." But what do "roughly" and "sufficiently often" mean? It is notoriously difficult to make these notions precise. How many times must the coin be tossed for it to be tossed "sufficiently often"? Is it 100 times? 1, 000 times? 1, 000, 000 times? And what exactly does "roughly half the time" mean? It certainly does not mean "exactly half the time." If the coin is tossed an odd number of times, it cannot land heads exactly half the time. And even if it is tossed an even number of times, it is rather unlikely that it will land heads exactly half of those times.

To make matters worse, to assign a probability to an event such as "the nuclear power plant will have a meltdown in the next five years", it is hard to think in terms of relative frequency. While it is easy to imagine tossing a coin repeatedly, it is somewhat harder to capture the sequence of events that lead to a nuclear meltdown and imagine them happening repeatedly.

Many attempts have been made to deal with these problems, perhaps the most successful being that of von Mises. It is beyond the scope of this book to discuss these attempts, however. The main message that the reader should derive is that, while the intuition behind relative frequency is a very powerful one (and is certainly a compelling justification for the use of probability in some cases), it is quite difficult (some would argue impossible) to extend it to all cases where probability is applied.

Despite these concerns, in many simple settings, it is straightforward to apply the relative-frequency interpretation. If N is fixed and an experiment is repeated N times, then the probability of an event U is taken to be the fraction of the N times U occurred. It is easy to see that the relative-frequency interpretation of probability satisfies the additivity property P2. Moreover, it is closely related to the intuition behind the principle of indifference. In the case of a coin, roughly speaking, the possible worlds now become the outcomes of the N coin tosses. If the coin is fair, then roughly half of the outcomes should be heads and half should be tails. If the coin is biased, the fraction of outcomes that are heads should reflect the bias. That is, taking the basic outcomes to be the results of tossing the coin N times, the principle of indifference leads to roughly the same probability as the relative-frequency interpretation.

The relative-frequency interpretation takes probability to be an objective property of a situation. The (extreme) subjective viewpoint argues that there is no such thing as an objective notion of probability; probability is a number assigned by an individual representing his or her subjective assessment of likelihood. Any choice of numbers is all right, as long as it satisfies P1 and P2. But why should the assignment of numbers even obey P1 and P2?

There have been various attempts to argue that it should. The most famous of these arguments, due to Ramsey, is in terms of betting behavior. I discuss a variant of Ramsey's argument here. Given a set W of possible worlds and a subset U ⊆ W, consider an agent who can evaluate bets of the form "If U happens (i.e., if the actual world is in U) then I win $100(1 − α) while if U doesn't happen then I lose $100α", for 0 ≤ α ≤ 1. Denote such a bet as (U, α). The bet (U, 1 − α) is called the complementary bet to (U, α); by definition, (U, 1 − α) denotes the bet where the agent wins $100α if U happens and loses $100(1 − α)if U happens.

Note that (U, 0) is a "can't lose" proposition for the agent. She wins $100 if U is the case and loses 0 if it is not. The bet becomes less and less attractive as α gets larger; she wins less if U is the case and loses more if it is not. The worst case is if α = 1. (U, 1) is a "can't win" proposition; she wins nothing if U is true and loses $100 if it is false. By way of contrast, the bet (U, 1 − α) is a can't lose proposition if α = 1 and becomes less and less attractive as α approaches 0.

Now suppose that the agent must choose between the complementary bets (U, α) and (U, 1 − α). Which she prefers clearly depends on α. Actually, I assume that the agent may have to choose, not just between individual bets, but between sets of bets. More generally, I assume that the agent has a preference order defined on sets of bets. "Prefers" here should be interpreted as meaning "at least as good as", not "strictly preferable to." Thus, for example, an agent prefers a set of bets to itself. I do not assume that all sets of bets are comparable. However, it follows from the rationality postulates that I am about to present that certain sets of bets are comparable. The postulates focus on the agent's preferences between two sets of the form {(U₁, α₁), …, (U_k, α_k)} and {(U₁, 1 − α₁), …, (U_k, 1 − α_k)}. These are said to be complementary sets of bets. For singleton sets, I often omit set braces. I write B₁ ≽ B₂ if the agent prefers the set B₁ of bets to the set B₂, and B₁ ≽ B₂ if B₁ ≽ B² and it is not the case that B₂ ≽ B₁.

Define an agent to be rational if she satisfies the following four properties:

RAT1.

If the set B₁ of bets is guaranteed to give at least as much money as B₂, then B₁ ≽ B₂; if B₁ is guaranteed to give more money than B₂, then B₁ ≽ B₂.

By "guaranteed to give at least as much money" here, I mean that no matter what happens, the agent does at least as well with B₁ as with B₂. This is perhaps best understood if B₁ consists of just (U, α) and B₂ consists of just (V, β). There are then four cases to consider: the world is in U ∩ V, U ∩ V, U ∩ V, or U ∩ V. For (U, α) to be guaranteed to give at least as much money as (V, β), the following three conditions must hold:

If U ∩ V ≠∅, it must be the case that α ≤ β. For if w ∈ U ∩ V, then in world w, the agent wins 100(1 − α) with the bet (U, α) and wins 100(1 − β) with the bet (V, β). Thus, for (U, α) to give at least as much money as (V, β) in w, it must be the case that 100(1 − α) ≥ 100(1 − β), that is, α ≤ β.
If U ∩ V ≠∅, then α = 0 and β = 1.
If U ∩ V ≠∅, then α ≤ β.

Note that there is no condition corresponding to U ∩ V ≠∅, for if w ∈ U ∩ V then, in w, the agent is guaranteed not to lose with (U, α) and not to win with (V, β). In any case, note that it follows from these conditions that (U, α) ≻ (U, α′) if and only if α < α′. This should seem reasonable.

If B₁ and B₂ are sets of bets, then the meaning of "B₁ is guaranteed to give at least as much money as B₂" is similar in spirit. Now, for each world w, the sum of the payoffs of the bets in B₁ at w must be at least as large as the sum of the payoffs of the bets in B₂. I leave it to the reader to define "B₁ is guaranteed to give more money than B₂."

The second rationality condition says that preferences are transitive.

RAT2.

Preferences are transitive, so that if B₁ ≽ B₂ and B₂ ≽ B₃, then B₁ ≽ B₃.

While transitivity seems reasonable, it is worth observing that transitivity of preferences often does not seem to hold in practice.

In any case, by RAT1, (U, α) ≽ (U, 1 − α) if α = 0, and (U, 1 − α) ≽ (U, α) if α = 1. By RAT1 and RAT2, if (U, α) ≽ (U, 1 − α), then (U, α′) ≻ (U, 1 − α′) for all α′ < α. (It clearly follows from RAT1 and RAT2 that (U, α′) ≽ (U, 1 − α′). But if it is not the case that (U, α′) ≻ (U, 1 − α′), then (U, 1 − α′) ≽ (U, α′). Now applying RAT1 and RAT2, together with the fact that (U, 1 − α) ≽ (U, α), yields (U, 1 − α′) ≽ (U, 1 − α), which contradicts RAT1.) Similarly, if (U, 1 − β) ≽ (U, β), then (U, 1 − β′) ≻ (U, β′) for all β′ > β.

The third assumption says that the agent can always compare complementary bets.

RAT3.

Either (U, α) ≽ (U, 1 − α) or (U, 1 − α) (U, α).

Since "≽" means "considers at least as good as", it is possible that both (U, α) ≽ (U, 1 − α) and (U, 1 − α) ≽ (U, α) hold. Note that I do not presume that all sets of bets are comparable. RAT3 says only that complementary bets are comparable. While RAT3 is not unreasonable, it is certainly not vacuous. One could instead imagine an agent who had numbers α₁ < α₂ such that (U, α) ≽ (U, 1− α) for α < α₁ and (U, 1− α) ≽ (U, α) for α > α₂, but in the interval between α₁ and α₂, the agent wasn't sure which of the complementary bets was preferable. (Note that "incomparable" here does not mean "equivalent.") This certainly doesn't seem so irrational.

The fourth and last rationality condition says that preferences are determined pointwise.

RAT4.

If (U_i, α_i) ≽ (V_i, β_i) for i = 1,…,k, then {(U₁, α₁),…,(U_k, α_k)} ≽ {(V₁, β₁),…,(V_k, β_k)}.

While RAT4 may seem reasonable, again there are subtleties. For example, compare the bets (W, 1) and (U, .01), where U is, intuitively, an unlikely event. The bet (W, 1) is the "break-even" bet: the agent wins 0 if W happens (which will always be the case) and loses $100 if ∅ happens (i.e., if w ∈∅). The bet (U, .01) can be viewed as a lottery: if U happens (which is very unlikely), the agent wins $99, while if U does not happen, then the agent loses $1. The agent might reasonably decide that she is willing to pay $1 for a small chance to win $99. That is, (U, .01) ≽ (W, 1). On the other hand, consider the collection B₁ consisting of 1, 000, 000 copies of (W, 1) compared to the collection B₂ consisting of 1, 000, 000 copies of (U, .01). According to RAT4, B₂ ≽ B₁. But the agent might not feel that she can afford to pay $1, 000, 000 for a small chance to win $99, 000, 000.

These rationality postulates make it possible to associate with each set U a number α_U, which intuitively represents the probability of U. It follows from RAT1 that (U, 0) ≽ (U, 1). As observed earlier, (U, α) gets less attractive as α gets larger, and (U, 1 − α) gets more attractive as α gets larger. Since, by RAT1, (U, 0) ≽ (U, 1), it easily follows that there is there is some point α* at which, roughly speaking, (U, α*) and (U, 1 − α*) are in balance. I take α_U to be α*.

I need a few more definitions to make this precise. Given a set X of real numbers, let sup X, the supremum (or just sup)of X, be the least upper bound of X—the smallest real number that is at least as large as all the elements in X. That is, sup X = α if x ≤ α for all x ∈ X and if, for all α′ < α, there is some x ∈ X such that x>α′. For example, if X ={1/2, 3/4, 7/8, 15/16, …}, then sup X = 1. Similarly, inf X, the infimum (or just inf )of X, is the greatest lower bound of X—the largest real number that is less than or equal to every element in X. The sup of a set may be ∞; for example, the sup of {1, 2, 3, …} is ∞. Similarly, the inf of a set may be −∞. However, if X is bounded (as will be the case for all the sets to which sup and inf are applied in this book), then sup X and inf X are both finite.

Let α_U = sup{β: (U, β) ≽ (U, 1− β)}. It is not hard to show that if an agent satisfies RAT1–3, then (U, α) ≽ (U, 1 − α) for all α < α_U and (U, 1 − α) ≽ (U, α) for all α > α_U (Exercise 2.5). It is not clear what happens at α_U ; the agent's preferences could go either way. (Actually, with one more natural assumption, the agent is indifferent between (U, α_U ) and (U, 1 − α_U); see Exercise 2.6.)

Intuitively, α_U is a measure of the likelihood (according to the agent) of U. The more likely she thinks U is, the higher α_U should be. If she thinks that U is certainly the case (i.e., if she is certain that the actual world is in U), then α_U should be 1. That is, if she feels that U is certain, then for any α> 0, it should be the case that (U, α) ≽ (U, 1− α), since she feels that with (U, α) she is guaranteed to win $100(1 − α), while with (U, 1 − α) she is guaranteed to lose the same amount.

Similarly, if she is certain that U is not the case, then α_U should be 0. More significantly, it can be shown that if U₁ and U₂ are disjoint sets, then a rational agent should take α_U₁∪U₂ = α_U₁ + α_U₂. More precisely, as is shown in Exercise 2.5, if α_U₁∪U₂ ≠ α_U₁ + α_U₂, then there is a set B₁ of bets such that the agent prefers B₁ to the complementary set B₂, yet the agent is guaranteed to lose money with B₁ and guaranteed to win money with B₂, thus contradicting RAT1. (In the literature, such a collection B₁ is called a Dutch book. Of course, this is not a literary book, but a book as in "bookie" or "bookmaker.") It follows from all this that if μ(U) is defined as α_U, then μ is a probability measure.

This discussion is summarized by the following theorem:

Theorem 2.2.3

If an agent satisfies RAT1–4, then for each subset U of W, a number α_U exists such that (U, α) ≽ (U, 1 − α) for all α < α_U and (U, 1 − α) ≽ (U, α) for all α>α_U. Moreover, the function defined by μ(U) = α_U is a probability measure.

Proof See Exercise 2.5.

Theorem 2.2.3 has been viewed as a compelling argument that if an agent's preferences can be expressed numerically, then they should obey the rules of probability. However, Theorem 2.2.3 depends critically on the assumptions RAT1–4. The degree to which the argument is compelling depends largely on how reasonable these assumptions of rationality seem. That, of course, is in the eye of the beholder.

It might also seem worrisome that the subjective probability interpretation puts no constraints on the agent's subjective likelihood other than the requirement that it obey the laws of probability. In the case of tossing a fair die, for example, taking each outcome to be equally likely seems "right." It may seem unreasonable for someone who subscribes to the subjective point of view to be able to put probability .8 on the die landing 1, and probability .04 on each of the other five possible outcomes. More generally, when it seems that the principle of indifference is applicable or if detailed frequency information is available, should the subjective probability take this into account? The standard responses to this concern are (1) indeed frequency information and the principle of indifference should be taken into account, when appropriate, and (2) even if they are not taken into account, all choices of initial subjective probability will eventually converge to the same probability measure as more information is received; the measure that they converge to will in some sense be the "right" one (see Example 3.2.2).

Different readers will probably have different feelings as to how compelling these and other defenses of probability really are. However, the fact that philosophers have come up with a number of independent justifications for probability is certainly a strong point in its favor. Much more effort has gone into justifying probability than any other approach for representing uncertainty. Time will tell if equally compelling justifications can be given for other approaches. In any case, there is no question that probability is currently the most widely accepted and widely used approach to representing uncertainty.