11.5 Random Worlds and Maximum Entropy

The entropy function has been used in a number of contexts in reasoning about uncertainty. As mentioned in the notes to Chapter 3, it was originally introduced in the context of information theory, where it was viewed as the amount of "information" in a probability measure. Intuitively, a uniform probability measure, which has high entropy, gives less information about the actual situation than does a measure that puts probability 1 on a single point (this measure has the lowest possible entropy, namely 0). The entropy function, specifically maximum entropy, was used in Section 8.5 to define a probability sequence that had some desirable properties for default reasoning. Another common usage of entropy is in the context of trying to pick a single probability measure among a set of possible probability measures characterizing a situation, defined by some constraints. The principle of maximum entropy, first espoused by Jaynes, suggests choosing the measure with the maximum entropy (provided that there is in fact a unique such measure), because it incorporates in some sense the "least additional information" above and beyond the constraints that characterize the set.

No explicit use of maximum entropy is made by the random-worlds approach. Indeed, although they are both tools for reasoning about probabilities, the types of problems considered by the random-worlds approach and maximum entropy techniques seem unrelated. Nevertheless, it turns out that there is a surprising and very close connection between the random-worlds approach and maximum entropy provided that the vocabulary consists only of unary predicates and constants. In this section I briefly describe this connection, without going into technical details.

Suppose that the vocabulary 풯 consists of the unary predicate symbols P₁, …, P_k together with some constant symbols. (Thus, 풯 includes neither function symbols nor higher-arity predicates.) Consider the 2^k atoms that can be formed from these predicate symbols, namely, the formulas of the form Q₁ ∧ … ∧ Q_k, where each Q_i is either P_i or P_i. (Strictly speaking, I should write Q_i(x) for some variable x, not just Q_i. I omit the parenthetical x here, since it just adds clutter.) The knowledge base KB can be viewed as simply placing constraints on the proportion of domain elements satisfying each atom. For example, the formula ∥P₁(x) | P₂(x)∥_x ≈ .6 says that the fraction of domain elements satisfying the atoms containing both P₁ and P₂ as conjuncts is (approximately) .6 times the fraction satisfying atoms containing P₁ as a conjunct. (I omit the subscript on ≈, since it plays no role here.) For unary languages (only), it can be shown that every formula can be rewritten in a canonical form from which constraints on the possible proportions of atoms can be simply derived. For example, if 풯 = {c, P₁, P₂}, there are four atoms: A₁ = P₁ ∧ P₂, A₂ = P₁ ∧ P₂, A₃ = P₁ ∧ P₂, and A₄ = P₁ ∧ P₂; ∥P₁(x) | P₂(x)∥_x ≈ .6 is equivalent to ∥A₁(x)∥_x ≈ .6 ∥A₁(x) ∨ A₃(x)∥_x.

The set of constraints generated by KB (with ≈ replaced by =) defines a subset S(KB) of [0, 1]^{2^k}. That is, each vector in S(KB), say , is a solution to the constraints defined by KB (where p_i is the proportion of atom i). For example, if 풯 = {c, P₁, P₂}, and KB = ∥P₁(x) | P₂(x)∥_x = .6 as above, then the only constraint is that p₁ = .6(p₁ + p₃) or, equivalently, p₁ = 1.5p₃. That is, S(KB) = {〈p₁, …, p₄〉 ∈ [0, 1]⁴ : p₁ = 1.5p₃, p₁ + ⋅ + p₄ = 1}.

As another example, suppose that KB′ =∀xP₁(x) ∧ ∥P₁(x) ∧ P₂(x)∥_x ≽ .3. The first conjunct of KB′ clearly constrains both p₃ and p₄ (the proportion of domain elements satisfying atoms A₃ and A₄) to be 0. The second conjunct forces p₁ to be (approximately) at most .3. Thus, S(KB′) ={〈p₁, …, p₄〉 ∈ [0, 1]⁴ : p₁ ≤ .3, p₃ = p₄ = 0, p₁ + p² = 1}.

The connection between maximum entropy and the random-worlds approach is based on the following observations. Every world w can be associated with the vector , where p^w_i is the fraction of domain elements in world w satisfying the atom A_i. For example, a world with domain size N, where 3 domain elements satisfy A₁, none satisfy A₂, 7 satisfy A₃, and N − 10 satisfy A₄ would be associated with the vector 〈3/N, 0, 7/N, (N − 10)/N〉. Each vector can be viewed as a probability measure on the space of atoms A₁, …, A_{2 ^k}; therefore, each such vector has an associated entropy, (where, as before, p_i log p_i is taken to be 0 if p_i = 0). Define the entropy of w to be . Now, consider some point . What is the number of worlds w ∈ W_N such that ? Clearly, for those where some p_i is not an integer multiple of 1/N, the answer is 0. However, for those that are "possible," this number can be shown to grow asymptotically as (Exercise 11.16). Thus, there are vastly more worlds w for which is "near" the maximum entropy point of S(KB) than there are worlds farther from the maximum entropy point. It then follows that if, for all sufficiently small , a formula θ is true at all worlds around the maximum entropy point(s) of S(KB), then μ_∞(θ | KB) = 1.

For example, the maximum entropy point of S(KB′) is . (It must be the case that the last two components are 0 since this is true in all of S(KB′); the first two components are "as close to being equal as possible" subject to the constraints, and this maximizes entropy (cf. Exercise 3.48).) But now fix some small ∊, and consider the formula θ^∊ = ∥P₂(x)∥_x ∈ [.3 − ∊, .3 + ∊]. Since this formula certainly holds at all worlds w where is sufficiently close to , it follows that μ^∞(θ^∊ | KB′) = 1. The generalization of Theorem 11.3.2 given in Exercise 11.6 implies that μ_∞(P₂(c) | KB′ ∧ θ^∊) ∈ [.3 − ∊, .3 + ∊ ]. It follows from Exercise 11.13 that μ_∞(ψ | KB′ ∧ θ^∊) = μ_∞(ψ | KB′) for all formulas ψ and, hence, in particular, for P₂(c). Since μ_∞(P₂(c) | KB′) ∈ [.3 − ∊, .3 + ∊] for all sufficiently small ∊, it follows that μ_∞(P₂(c) | KB′) = .3, as desired. That is, the degree of belief in P₂(c) given KB′ is the probability of P₂ (i.e., the sum of the probabilities of the atoms that imply P₂)in the measure of maximum entropy satisfying the constraints determined by KB′.

This argument can be generalized to show that if (1) 풯 = {P₁,…,P_n, c}, (2) φ(x) is a Boolean combination of the P_i(x)s, and (3) KB consists of statistical constraints on the P_i(x)s, then μ_∞(φ(c) | KB) is the probability of φ according to the measure of maximum entropy satisfying S(KB).

Thus, the random-worlds approach can be viewed as providing justification for the use of maximum entropy, at least when only unary predicates are involved. Indeed, random worlds can be viewed as a generalization of maximum entropy to cases where there are nonunary predicates.

These results connecting random worlds to maximum entropy also shed light on the maximum-entropy approach to default reasoning considered in Section 8.5. Indeed, the maximum-entropy approach can be embedded in the random-worlds approach. Let Σ be a collection of propositional defaults (i.e., formulas of the form φ → ψ) that mention the primitive propositions {p₁, …, p_n}. Let {P₁, …, P_n} be unary predicates. Convert each default θ = φ → ψ ∈ Σ to the formula θ^r = ∥ψ*(x) | φ*(x)∥_x ≈₁ 1, where ψ* and φ* are obtained by replacing each occurrence of a primitive proposition p_i by P_i(x). Thus, the translation treats a propositional default statement as a statistical assertion about sets of individuals. Note that all the formulas θ^r use the same approximate equality relation ≈₁. This is essentially because the maximum-entropy approach treats all the defaults in Σ as having the same strength (in the sense of Example 11.3.9). This comes out in the maximum-entropy approach in the following way. Recall that in the probability sequence (μ^me₁, μ^me₂, …), the kth probability measure μ^me_k is the measure of maximum entropy among all those satisfying Σ^k, where Σ^k is the result of replacing each default φ → ψ ∈ Σ by the ^QU formula ℓ(ψ | φ) ≥ 1 − 1/k. That is, 1 − 1/k is used for all defaults (as opposed to choosing a possibly different number close to 1 for each default). I return to this issue again shortly.

Let Σ^r = {θ^r : θ ∈ Σ}. The following theorem, whose proof is beyond the scope of this book, captures the connection between the random-worlds approach and the maximum-entropy approach to default reasoning:

Theorem 11.5.1

Let c be a constant symbol. Then Σ|≈^me ⊘ → ψ iff

Note that the translation used in the theorem converts the default rules in Σ to statistical statements about individuals, but converts the left-hand and right-hand sides of the conclusion, φ and ψ, to statements about a particular individual (whose name was arbitrarily chosen to be c). This is in keeping with the typical use of default rules. Knowing that birds typically fly, we want to conclude something about a particular bird, Tweety or Opus.

Theorem 11.5.1 can be combined with Theorem 11.3.7 to provide a formal characterization of some of the inheritance properties of |∼^me. For example, it follows that not only does |∼^me satisfy all the properties of P, but that it is able to ignore irrelevant information and to allow subclasses to inherit properties from superclasses, as discussed in Section 8.5.

The assumption that the same approximate equality relation is used for every formula θ^r is crucial in proving the equivalence in Theorem 11.5.1. For suppose that Σ consists of the two rules p₁ ∧ p₂ → q and p₃ → q. Then Σ |≉^me p₁ ∧ p₂ ∧ p₃ → q. This seems reasonable, as there is evidence for q (namely, p₁ ∧ p₂) and against q (namely, p₃), and neither piece of evidence is more specific than the other. However, suppose that Σ′ is Σ together with the rule p₁ → q. Then it can be shown that Σ′ |≈^me p₁ ∧ p₂ ∧ p₃ → q. This behavior seems counterintuitive and is a consequence of the use of the same ∊ for all the rules. Intuitively, what is occurring here is that prior to the addition of the rule p₁ → q, the sets P₁(x) ∧ P₂(x) and P₃(x) are of comparable size. The new rule forces P₁(x) ∧ P₂(x) to be a factor of ∊ smaller than P₁(x), since almost all P₁s are Qs, whereas almost all P₁ ∧ P₂s are Qs. The size of the set P₃(x), on the other hand, is unaffected. Hence, the default for the ∊-smaller class P₁ ∧ P₂ now takes precedence over the class P₃.

If different approximate equality relations are used for each default rule, each one corresponding to a different ∊, then this conclusion no longer follows. An appropriate choice of τ_i can make the default ∥ Q(x) | P₃(x)∥_x ≈_i 1 so strong that the number of Qs in the set P₃(x), and hence the number of Qs in the subset P₁(x) ∧ P₂(x) ∧ P₃(x), is much smaller than the size of the set P₁(x) ∧ P₂(x) ∧ P₃(x). In this case, the rule p₃ → q takes precedence over the rule p₁ ∧ p₂ → q. More generally, with no specific information about the relative strengths of the defaults, the limit in the random-worlds approach does not exist, so no conclusions can be drawn, just as in Example 11.3.9. On the other hand, if all the approximate equality relations are known to be the same, the random-world approach will conclude Q(c), just as the maximum-entropy approach of Section 8.5. This example shows how the added expressive power of allowing different approximate equality relations can play a crucial role in default reasoning.

It is worth stressing that, although this section shows that there is a deep connection between the random-worlds approach and the maximum-entropy approach, this connection holds only if the vocabulary is restricted to unary predicates and constants. The random-worlds approach makes perfect sense (and the theorems proved in Sections 11.3 and 11.4 apply) to arbitrary vocabularies. However, there seems to be no obvious way to relate random worlds to maximum entropy once there is even a single binary predicate in the vocabulary. Indeed, there seems to be no way of even converting formulas in a knowledge base that involves binary predicates to constraints on probability measures so that maximum entropy can be applied.