11.5 Random Worlds and Maximum Entropy


11.5 Random Worlds and Maximum Entropy

The entropy function has been used in a number of contexts in reasoning about uncertainty. As mentioned in the notes to Chapter 3, it was originally introduced in the context of information theory, where it was viewed as the amount of "information" in a probability measure. Intuitively, a uniform probability measure, which has high entropy, gives less information about the actual situation than does a measure that puts probability 1 on a single point (this measure has the lowest possible entropy, namely 0). The entropy function, specifically maximum entropy, was used in Section 8.5 to define a probability sequence that had some desirable properties for default reasoning. Another common usage of entropy is in the context of trying to pick a single probability measure among a set of possible probability measures characterizing a situation, defined by some constraints. The principle of maximum entropy, first espoused by Jaynes, suggests choosing the measure with the maximum entropy (provided that there is in fact a unique such measure), because it incorporates in some sense the "least additional information" above and beyond the constraints that characterize the set.

No explicit use of maximum entropy is made by the random-worlds approach. Indeed, although they are both tools for reasoning about probabilities, the types of problems considered by the random-worlds approach and maximum entropy techniques seem unrelated. Nevertheless, it turns out that there is a surprising and very close connection between the random-worlds approach and maximum entropy provided that the vocabulary consists only of unary predicates and constants. In this section I briefly describe this connection, without going into technical details.

Suppose that the vocabulary consists of the unary predicate symbols P1, , Pk together with some constant symbols. (Thus, includes neither function symbols nor higher-arity predicates.) Consider the 2k atoms that can be formed from these predicate symbols, namely, the formulas of the form Q1 Qk, where each Qi is either Pi or Pi. (Strictly speaking, I should write Qi(x) for some variable x, not just Qi. I omit the parenthetical x here, since it just adds clutter.) The knowledge base KB can be viewed as simply placing constraints on the proportion of domain elements satisfying each atom. For example, the formula P1(x) | P2(x)x .6 says that the fraction of domain elements satisfying the atoms containing both P1 and P2 as conjuncts is (approximately) .6 times the fraction satisfying atoms containing P1 as a conjunct. (I omit the subscript on , since it plays no role here.) For unary languages (only), it can be shown that every formula can be rewritten in a canonical form from which constraints on the possible proportions of atoms can be simply derived. For example, if = {c, P1, P2}, there are four atoms: A1 = P1 P2, A2 = P1 P2, A3 = P1 P2, and A4 = P1 P2; P1(x) | P2(x)x .6 is equivalent to A1(x)x .6 A1(x) A3(x)x.

The set of constraints generated by KB (with replaced by =) defines a subset S(KB) of [0, 1]2k. That is, each vector in S(KB), say , is a solution to the constraints defined by KB (where pi is the proportion of atom i). For example, if = {c, P1, P2}, and KB = P1(x) | P2(x)x = .6 as above, then the only constraint is that p1 = .6(p1 + p3) or, equivalently, p1 = 1.5p3. That is, S(KB) = {p1, , p4 [0, 1]4 : p1 = 1.5p3, p1 + + p4 = 1}.

As another example, suppose that KB =xP1(x) P1(x) P2(x)x .3. The first conjunct of KB clearly constrains both p3 and p4 (the proportion of domain elements satisfying atoms A3 and A4) to be 0. The second conjunct forces p1 to be (approximately) at most .3. Thus, S(KB) ={p1, , p4 [0, 1]4 : p1 .3, p3 = p4 = 0, p1 + p2 = 1}.

The connection between maximum entropy and the random-worlds approach is based on the following observations. Every world w can be associated with the vector , where pwi is the fraction of domain elements in world w satisfying the atom Ai. For example, a world with domain size N, where 3 domain elements satisfy A1, none satisfy A2, 7 satisfy A3, and N 10 satisfy A4 would be associated with the vector 3/N, 0, 7/N, (N 10)/N. Each vector can be viewed as a probability measure on the space of atoms A1, , A2 k; therefore, each such vector has an associated entropy, (where, as before, pi log pi is taken to be 0 if pi = 0). Define the entropy of w to be . Now, consider some point . What is the number of worlds w WN such that ? Clearly, for those where some pi is not an integer multiple of 1/N, the answer is 0. However, for those that are "possible," this number can be shown to grow asymptotically as (Exercise 11.16). Thus, there are vastly more worlds w for which is "near" the maximum entropy point of S(KB) than there are worlds farther from the maximum entropy point. It then follows that if, for all sufficiently small , a formula θ is true at all worlds around the maximum entropy point(s) of S(KB), then μ(θ | KB) = 1.

For example, the maximum entropy point of S(KB) is . (It must be the case that the last two components are 0 since this is true in all of S(KB); the first two components are "as close to being equal as possible" subject to the constraints, and this maximizes entropy (cf. Exercise 3.48).) But now fix some small , and consider the formula θ = P2(x)x [.3 , .3 + ]. Since this formula certainly holds at all worlds w where is sufficiently close to , it follows that μ(θ | KB) = 1. The generalization of Theorem 11.3.2 given in Exercise 11.6 implies that μ(P2(c) | KB θ) [.3 , .3 + ]. It follows from Exercise 11.13 that μ(ψ | KB θ) = μ(ψ | KB) for all formulas ψ and, hence, in particular, for P2(c). Since μ(P2(c) | KB) [.3 , .3 + ] for all sufficiently small , it follows that μ(P2(c) | KB) = .3, as desired. That is, the degree of belief in P2(c) given KB is the probability of P2 (i.e., the sum of the probabilities of the atoms that imply P2)in the measure of maximum entropy satisfying the constraints determined by KB.

This argument can be generalized to show that if (1) = {P1,,Pn, c}, (2) φ(x) is a Boolean combination of the Pi(x)s, and (3) KB consists of statistical constraints on the Pi(x)s, then μ(φ(c) | KB) is the probability of φ according to the measure of maximum entropy satisfying S(KB).

Thus, the random-worlds approach can be viewed as providing justification for the use of maximum entropy, at least when only unary predicates are involved. Indeed, random worlds can be viewed as a generalization of maximum entropy to cases where there are nonunary predicates.

These results connecting random worlds to maximum entropy also shed light on the maximum-entropy approach to default reasoning considered in Section 8.5. Indeed, the maximum-entropy approach can be embedded in the random-worlds approach. Let Σ be a collection of propositional defaults (i.e., formulas of the form φ ψ) that mention the primitive propositions {p1, , pn}. Let {P1, , Pn} be unary predicates. Convert each default θ = φ ψ Σ to the formula θr = ∥ψ*(x) | φ*(x)x 1 1, where ψ* and φ* are obtained by replacing each occurrence of a primitive proposition pi by Pi(x). Thus, the translation treats a propositional default statement as a statistical assertion about sets of individuals. Note that all the formulas θr use the same approximate equality relation 1. This is essentially because the maximum-entropy approach treats all the defaults in Σ as having the same strength (in the sense of Example 11.3.9). This comes out in the maximum-entropy approach in the following way. Recall that in the probability sequence (μme1, μme2, ), the kth probability measure μmek is the measure of maximum entropy among all those satisfying Σk, where Σk is the result of replacing each default φ ψ Σ by the QU formula (ψ | φ) 1 1/k. That is, 1 1/k is used for all defaults (as opposed to choosing a possibly different number close to 1 for each default). I return to this issue again shortly.

Let Σr = {θr : θ Σ}. The following theorem, whose proof is beyond the scope of this book, captures the connection between the random-worlds approach and the maximum-entropy approach to default reasoning:

Theorem 11.5.1

start example

Let c be a constant symbol. Then Σ|me ψ iff

end example

Note that the translation used in the theorem converts the default rules in Σ to statistical statements about individuals, but converts the left-hand and right-hand sides of the conclusion, φ and ψ, to statements about a particular individual (whose name was arbitrarily chosen to be c). This is in keeping with the typical use of default rules. Knowing that birds typically fly, we want to conclude something about a particular bird, Tweety or Opus.

Theorem 11.5.1 can be combined with Theorem 11.3.7 to provide a formal characterization of some of the inheritance properties of |me. For example, it follows that not only does |me satisfy all the properties of P, but that it is able to ignore irrelevant information and to allow subclasses to inherit properties from superclasses, as discussed in Section 8.5.

The assumption that the same approximate equality relation is used for every formula θr is crucial in proving the equivalence in Theorem 11.5.1. For suppose that Σ consists of the two rules p1 p2 q and p3 q. Then Σ |me p1 p2 p3 q. This seems reasonable, as there is evidence for q (namely, p1 p2) and against q (namely, p3), and neither piece of evidence is more specific than the other. However, suppose that Σ′ is Σ together with the rule p1 q. Then it can be shown that Σ′ |me p1 p2 p3 q. This behavior seems counterintuitive and is a consequence of the use of the same for all the rules. Intuitively, what is occurring here is that prior to the addition of the rule p1 q, the sets P1(x) P2(x) and P3(x) are of comparable size. The new rule forces P1(x) P2(x) to be a factor of smaller than P1(x), since almost all P1s are Qs, whereas almost all P1 P2s are Qs. The size of the set P3(x), on the other hand, is unaffected. Hence, the default for the -smaller class P1 P2 now takes precedence over the class P3.

If different approximate equality relations are used for each default rule, each one corresponding to a different , then this conclusion no longer follows. An appropriate choice of τi can make the default Q(x) | P3(x)x i 1 so strong that the number of Qs in the set P3(x), and hence the number of Qs in the subset P1(x) P2(x) P3(x), is much smaller than the size of the set P1(x) P2(x) P3(x). In this case, the rule p3 q takes precedence over the rule p1 p2 q. More generally, with no specific information about the relative strengths of the defaults, the limit in the random-worlds approach does not exist, so no conclusions can be drawn, just as in Example 11.3.9. On the other hand, if all the approximate equality relations are known to be the same, the random-world approach will conclude Q(c), just as the maximum-entropy approach of Section 8.5. This example shows how the added expressive power of allowing different approximate equality relations can play a crucial role in default reasoning.

It is worth stressing that, although this section shows that there is a deep connection between the random-worlds approach and the maximum-entropy approach, this connection holds only if the vocabulary is restricted to unary predicates and constants. The random-worlds approach makes perfect sense (and the theorems proved in Sections 11.3 and 11.4 apply) to arbitrary vocabularies. However, there seems to be no obvious way to relate random worlds to maximum entropy once there is even a single binary predicate in the vocabulary. Indeed, there seems to be no way of even converting formulas in a knowledge base that involves binary predicates to constraints on probability measures so that maximum entropy can be applied.




Reasoning About Uncertainty
Reasoning about Uncertainty
ISBN: 0262582597
EAN: 2147483647
Year: 2005
Pages: 140

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net