3.11 Relative Entropy

Jeffrey's Rule deals only with the special case of observations that lead to degrees of support for some partition of W. What about the more general case, where there is less information? For example, what if the observation says that μ({blue, green}) is at least .7, rather than exactly .7? And what if there is information regarding overlapping sets, not just disjoint sets? For example, suppose that an observation leads an agent to believe that μ({blue, green}) = .7 and μ({green, yellow}) = .4. What then? (Note that Dempster's Rule of Combination can deal with the latter observation, but not the former.)

The standard intuition here is that if an agent's initial uncertainty is characterized by a probability measure μ and the agent makes an observation that is characterized by some constraints (such as μ({blue, green}) ≥ .7), then the agent should make the "minimal change" to his beliefs required to accommodate the observation. One way of capturing the notion of minimal change is to have the agent's updated measure of belief be that probability measure μ′ that is "closest" to μ and satisfies the constraints. Of course, the choice of μ′ will then depend on how "closeness" is defined.

One measure of distance is called the variation distance. Define V(μ, μ′), the variation distance from μ to μ′, to be sup_U⊆W |μ(U) − μ′(U)|. That is, the variation distance is the largest amount of disagreement between μ and μ′ on the probability of some set. It can be shown that (Exercise 3.43). (I am assuming in this section that W is finite; if W is infinite, then the summation needs to be replaced by integration.) The variation distance is what mathematicians call a metric—it has the properties that are normally associated with a measure of distance. In particular, V(μ, μ′) ≥ 0, V(μ, μ′) = 0 iff μ = μ′, and V(μ, μ′) = V(μ′, μ), so that distances are nonnegative, μ is the unique closest measure to itself, and the distance from μ to μ′ is the same as the distance from μ′ to μ.

Given some constraints C and a measure μ, consider the probability measure μ′ that is closest to μ in terms of variation distance and satisfies the constraints. There is a precise sense in which Jeffrey's Rule (and standard conditioning, which is an instance of it) can be viewed as a special case of minimizing variation distance. That is, μ|α₁U₁;…; α_nU_n is one of the probability measures closest to μ among all measures μ′ such that μ′(U_i) = α_i, for i = 1, …, n.

Proposition 3.11.1

Suppose that U₁, …, U_n is a partition of W and α₁ + … + α_n = 1. Let C = {μ′ : μ′(U_i) = α_i, i = 1, …, n}. If α₁U₁;…; α_nU_n is consistent with μ, then V(μ, μ′′) ≥ V(μ, μ | (α₁U₁;…; α_nU_n)) for all μ′′ ∈ C.

Proof See Exercise 3.44.

Although the variation distance does support the use of Jeffrey's Rule and conditioning, it does not uniquely pick them out. There are in fact many functions that minimize the variation distance other than the one that results from the use of Jeffrey's Rule (Exercise 3.44).

Another notion of "closeness" is given by relative entropy. The relative entropy between μ′ and μ, denoted D(μ′, μ), is defined as

(The logarithm here is taken to the base 2; if μ′(w) = 0 then μ′(w) log(μ′(w)/μ(w)) is taken to be 0. This is reasonable since lim_x→0 x log(x/c) = 0 if c > 0.) The relative entropy is defined provided that μ′ is consistent with μ. Analogously to the case of Jeffrey's Rule, this means that if μ(w) = 0 then μ′(w) = 0, for all w ∈ W. Using elementary calculus, it can be shown that D(μ′, μ) ≥ 0, with equality exactly if μ′ = μ (Exercise 3.45). Unfortunately, relative entropy does not quite act as a metric. For example, it is not hard to show that D(μ, μ′) ≠ D(μ′, μ) in general (Exercise 3.46). Nevertheless, relative entropy has many attractive properties. One of them is that it generalizes Jeffrey's Rule. Moreover, unlike variation distance, it picks out Jeffrey's Rule uniquely; the relative entropy between μ|α₁U₁;…; α_nU_n and μ is minimum among all measures μ′ such that μ′(U_i) = α_i, i = 1, …, n.

Proposition 3.11.2

Suppose that U₁, …, U_n is a partition of W and α₁ + … + α_n = 1. Let C = {μ′ : μ′(U_i) = α_i, i = 1, …, n}. If α₁U₁;…; α_nU_n is consistent with μ, then D(μ, μ′′) ≥ D(μ, μ|(α₁U₁;…; α_nU_n)) for all μ′′ ∈ C. Moreover, equality holds only if μ′′ = μ|(α₁U₁;…; α_nU_n).

The justification for relative entropy is closely related to the justification for the entropy function, which was first defined by Claude Shannon in the context of information theory. Given a probability measure μ, define H(μ), the entropy of μ, as follows:

(where 0 log(0) is taken to be 0). If μ is the uniform probability measure on a space W with n elements (so that μ(w) = 1/n for all w ∈ W), then from standard properties of the log function, it follows that

Thus, minimizing the relative entropy between μ′ and the uniform distribution is the same as maximizing the entropy of μ′.

H(μ) can be viewed as a measure of the degree of "uncertainty" in μ. For example, if μ(w) = 1 for some w ∈ W, then H(μ) = 0; the agent is not at all uncertain if he knows that the probability of some world is 1. Uncertainty is maximized if all worlds are equally likely, since there is no information that allows an agent to prefer one world to another. More precisely, of all the probability measures on W, the one whose entropy is maximum is the uniform probability measure (Exercise 3.47). Even in the presence of constraints C, the measure that maximizes entropy is (very roughly) the measure that makes things "as equal as possible" subject to the constraints in C. Thus, for example, if C consists of the constraints μ′({blue, green}) = .8 and μ′({red, yellow}) = .2, then the measure μ^me that maximizes entropy is the one such that μ^me(blue) = μ^me(green) = .4 and μ^me(red) = μ^me(yellow) = .1 (Exercise 3.48).

Just as entropy of μ can be thought of as a measure of the information in μ, the relative entropy between μ′ and μ can be thought of as a measure of the amount of extra information in μ′ relative to the information already in μ. There are axiomatic characterizations of maximum entropy and relative entropy that attempt to make this intuition precise, although it is beyond the scope of this book to describe them. Given this intuition, it is perhaps not surprising that there are proponents of maximum entropy and relative entropy who recommend that if an agent's information can be characterized by a set C of constraints, then the agent should act "as if" the probability is determined by the measure that maximizes entropy relative to C (i.e., the measure that has the highest entropy of all the measures in C). Similarly, if the agent starts with a particular measure μ and gets new information characterized by C, he should update to the measure μ′ that satisfies C such that the relative entropy between μ′ and μ is a minimum.

Maximum entropy and relative entropy have proved quite successful in a number of applications, from physics to natural-language modeling. Unfortunately, they also exhibit some counterintuitive behavior on certain applications. Although they are valuable tools, they should be used with care.

Variation distance has an immediate analogue for all the other quantitative notions of uncertainty considered here (belief functions, inner measures, possibility measures, and ranking functions). I leave it to the reader to explore the use of variation distance with these notions (see, e.g., Exercise 3.49). Maximum entropy and relative entropy seem to be more closely bound up with probability, although analogues have in fact been proposed both for Dempster-Shafer belief functions and for possibility measures. More work needs to be done to determine what good notions of "closest" are for arbitrary plausibility measures. Different notions may well be appropriate for different applications.