4.1 Probabilistic Independence

What exactly does it mean that two events are independent? Intuitively, it means that they have nothing to do with each other—they are totally unrelated; the occurrence of one has no influence on the other. Suppose that two different coins are tossed. Most people would view the outcomes as independent. The fact that the first coin lands heads should not affect the outcome of the second coin (although it is certainly possible to imagine a complicated setup whereby they are not independent). What about tossing the same coin twice? Is the second toss independent of the first? Most people would agree that it is (although see Example 4.2.1). (Having said that, in practice, after a run of nine heads of a fair coin, many people also believe that the coin is "due" to land tails, although this is incompatible with the coin tosses being independent. If they were independent, then the outcome of the first nine coin tosses would have no effect on the tenth toss.)

In any case, whatever it may mean that two events are "independent", it should be clear that none of the representations of uncertainty considered so far can express the notion of unrelatedness directly. The best they can hope to do is to capture the "footprint" of independence, in a sense that will be made more precise. In the section, I consider this issue in the context of probability. In Section 4.3 I discuss independence for other representations of uncertainty.

Certainly if U and V are independent or unrelated, then learning U should not affect the probability of V and learning V should not affect the probability of U. This suggests that the fact that U and V are probabilistically independent (with respect to probability measure μ) can be expressed as μ(U | V) = μ(U) and μ(V | U) = μ(V). There is a technical problem with this definition. What happens if μ(V) = 0? In that case μ(U | V) is undefined. Similarly, if μ(U) = 0, then μ(V | U) is undefined. (This problem can be avoided by using conditional probability measures. I return to this point later but, for now, I assume that μ is an unconditional probability measure.) It is conventional to say that, in this case, U and V are still independent. This leads to the following formal definition:

Definition 4.1.1

U and V are probabilistically independent (with respect to probability measure μ) if μ(V) ≠ 0 implies μ(U | V) = μ(U) and μ(U) ≠ 0 implies μ(V | U) = μ(V).

Definition 4.1.1 is not the definition of independence that one usually sees in textbooks, which is that U and V are independent if μ(U ∩ V) = μ(U)μ(V), but it turns out to be equivalent to the more standard definition.

Proposition 4.1.2

The following are equivalent:

μ(U) ≠ 0 implies μ(V | U) = μ(V),
μ(U ∩ V) = μ(U)μ(V),
μ(V) ≠ 0 implies μ(U | V) = μ(U).

Proof I show that (a) and (b) are equivalent. First, suppose that (a) holds. If μ(U) = 0, then clearly μ(U ∩ V) = 0 and μ(U)μ(V) = 0, so μ(U ∩ V) = μ(U)μ(V).If μ(U)≠ 0, then μ(V | U) = μ(U ∩ V)/μ(U), so if μ(V | U) = μ(V), simple algebraic manipulation shows that μ(V | U) = μ(U)μ(V). For the converse, if μ(U ∩ V) = μ(U)μ(V) and μ(U) ≠ 0, then μ(V) = μ(U ∩ V)/μ(U) = μ(V | U). This shows that (a) and (b) are equivalent. A symmetric argument shows that (b) and (c) are equivalent.

Note that Proposition 4.1.2 shows that I could have simplified Definition 4.1.1 by just using one of the clauses, say, μ(U) ≠ 0 implies μ(V | U) = μ(V), and omitting the other one. While it is true that one clause could be omitted in the definition of probabilistic independence, this will not necessarily be true for independence with respect to other notions of uncertainty; thus I stick to the more redundant definition.

The conventional treatment of defining U and V to be independent if either μ(U) = 0 or μ(V) = 0 results in some counterintuitive conclusions if μ(U) is in fact 0. For example, if μ(U) = 0, then U is independent of itself. But U is certainly not unrelated to itself. This shows that the definition of probabilistic independence does not completely correspond to the informal intuition of independence as unrelatedness.

To some extent it may appear that this problem can be avoided using conditional probability measures. In that case, the problem of conditioning on a set of probability 0 does not arise. Thus, Definition 4.1.1 can be simplified for conditional probability measures as follows:

Definition 4.1.3

U and V are probabilistically independent (with respect to conditional probability space (W,, ′, μ))if V ∊ ′ implies μ(U | V) = μ(U) and U∊ ′ implies μ(V | U) = μ(V).

Note that Proposition 4.1.2 continues to hold for conditional probability measures (Exercise 4.1). It follows immediately that if both μ(U) ≠ 0 and μ(V) ≠ 0, then U and V are independent iff μ(U ∩ V) = μ(U)μ(V) (Exercise 4.2). Even if μ(U) = 0 or μ(V) = 0, the independence of U and V with respect to the conditional probability measure μ implies that μ(U ∩ V) = μ(U)μ(V) (Exercise 4.2), but the converse does not necessarily hold, as the following example shows:

Example 4.1.4

Consider the conditional probability measure μ^s₀ defined in Example 3.2.4. Let U ={w₁, w₃} and V ={w₂, w₃}. Recall that w₁ is much more likely than w₂, which in turn is much more likely than w₃. It is not hard to check that μ^s₀(U | V) = 0 and μ^s₀(U) = 1, so U and V are not independent according to Definition 4.1.3. On the other hand, μ^s₀(U)μ^s₀(V) = μ^s₀(U ∩ V) = 0. Moreover, μ^s₀(V) = μ^s₀(V | U) = 0, which shows that both conjuncts of Definition 4.1.3 are necessary; in general, omitting either one results in a different definition of independence.

Essentially, conditional probability measures can be viewed as ignoring information about "negligible" small sets when it is not significant. With this viewpoint, the fact that μ^s₀(U)μ^s₀(V) = μ^s₀(U ∩ V) and μ^s₀(V | U) = μ^s₀(V) can be understood as saying that the difference between μ^s₀(U)μ^s₀(V) and μ^s₀(U ∩ V) is negligible, as is the difference between μ^s₀(V | U) and μ^s₀(V). However, it does not follow that the difference between μ^s₀(U | V) and μ_s(U) is negligible; indeed, this difference is as large as possible. This interpretation can be made precise by considering the nonstandard probability measure μ^ns₀ from which μ^s₀ is derived (see Example 3.2.4). Recall that μ^ns₀(w₁) = 1 − ∊ − ∊², μ^ns₀ (w₂) = ∊, and μ^ns₀(w³) = ∊². Thus, μ^ns₀ (V | U) = ∊²/(1 − ∊) and μ^ns₀(V) = ∊ + ∊². The closest real to both ∊²/(1 − ∊) and ∊ + ∊² is 0 (they are both infinitesimals, since ∊²/(1 − ∊) < 2∊²), which is why μ^s₀(V | U) = μ^s₀(V) = 0. Nevertheless, μ^ns₀(V | U) is much smaller than μ^ns₀ (V). This information is ignored by μ^s₀; it treats the difference as negligible, so μ^s₀(V | U) = μ^s₀(V).