10.3 First-Order Reasoning about Probability

There is an obvious first-order extension of the propositional logic ^QU_n considered in Section 7.3. The syntax is just a combination of the syntax for first-order logic and that of ^QU_n; I omit the formal definition. Call the resulting language ^QU,fo_n.^QU,fo_n includes formulas such as ∀x(ℓ₁(P (x)) ≥ 1/2) ∧ ℓ₂(∃yQ(y)) < 1/3; quantifiers can appear in the scope of likelihood formulas and likelihood formulas can appear in the scope of quantifiers.

Just as in Chapter 7, the likelihood operator ℓ_i can be interpreted as probability (if all sets are measurable), inner measure, lower probability, belief, or possibility, depending on the semantics. For example, in the case of probability, a relational probability structure has the form (W, D, 풫₁, …, 풫_n, π). (Note that, for the same reasons as in the case of knowledge, I am making the common-domain assumption.) Let M^meas,fo_n consist of all relational (measurable) probability structures. I leave the straightforward semantic definitions to the reader.

If this were all there was to it, this would be a very short section. However, consider the two statements "The probability that a randomly chosen bird will fly is greater than .9" and "The probability that Tweety (a particular bird) flies is greater than .9." There is no problem dealing with the second statement; it corresponds to the formula ℓ(Flies(Tweety)) > .9. (I am assuming that there is only one agent in the picture, so I omit the subscript on ℓ.) But what about the first statement? What is the formula that should hold at a set of worlds whose probability is greater than .9?

The most obvious candidate is ℓ(∀x(Bird(x) → Flies(x)) > .9. However, it might very well be the case that in each of the worlds considered possible, there is at least one bird that doesn't fly. Hence, the statement ∀x(Bird(x) → Flies(x)) holds in none of the worlds (and so has probability 0); thus, ℓ(∀x(Bird(x) → Flies(x)) > .9 does not capture the first statement. What about ∀x(ℓ(Bird(x) → Flies(x)) > .9) or, perhaps better, ∀x((Flies(x) | Bird(x)) > .9)? This runs into problems if there is a constant, say Opus, that represents an individual, say a penguin, that does not fly and is a rigid designator. Then ℓ(Flies(Opus) | Bird(Opus)) = 0, contradicting both ∀x(ℓ(Flies(x) | Bird(x)) > .9) and ∀x(ℓ(Bird(x) → Flies(x)) > .9). (It is important here that Opus is a rigid designator. The two statements ∀x(ℓ(Flies(x) | Bird(x)) > .9) and ℓ(Flies(Opus) | Bird(Opus)) = 0 are consistent if Opus is not a rigid designator; see Exercise 10.10.)

There seems to be a fundamental difference between these two statements. The first can be viewed as a statement about what one might expect as the result of performing some experiment or trial in a given situation. It can also be viewed as capturing statistical information about the world, since given some statistical information (say, that 90% of the individuals in a population have property P), then a randomly chosen individual should have probability .9 of having property P. By way of contrast, the second statement captures a degree of belief. The first statement seems to assume only one possible world (the "real" world), and in this world, some probability measure over the set of birds. It is saying that, with probability greater than .9, a bird chosen at random (according to this measure) will fly. The second statement implicitly assumes the existence of a number of possible worlds (in some of which Tweety flies, while in others Tweety doesn't), with some probability over these possibilities. Not surprisingly, the possible-worlds approach is well-suited to handling the second statement, but not the first.

It is not hard to design a language appropriate for statistical reasoning suitable for dealing with the first statement. The language includes terms of the form ||φ||_x, which can be interpreted as "the probability that a randomly chosen x in the domain satisfies φ." This is analogous to terms such as ℓ(φ) in ^QU. More generally, there can be an arbitrary set of variables in the subscript. To understand the need for this, suppose that the formula Son(x, y) says that x is the son of y. Now consider the three terms ||Son(x, y)||_x, ||Son(x, y)||_y, and ||Son(x, y)||_{{x, y}}. The first describes the probability that a randomly chosen x is the son of y; the second describes the probability that x is the son of a randomly chosen y; the third describes the probability that a randomly chosen pair (x, y) will have the property that x is the son of y. These three statements are all quite different. By allowing different sets of random variables in the subscript, they can all be expressed in the logic.

More formally, define a statistical likelihood term to have the form ||φ||_X, where φ is a formula and X is a set of variables. A (linear) statistical likelihood formula is one of the form a₁||φ₁||_X₁ + … + a_k||φ_k||_{X_k} >b. Formulas are now formed just as in first-order logic, except that linear statistical likelihood formulas are allowed. In this language, the statement "The probability that a randomly chosen bird will fly is greater than .9" can easily be expressed. With some abuse of notation, it is just ||Flies(x) | Bird(x)||_x > .9. (Without the abuse, it would be ||Flies(x) ∧ Bird(x)||_x > .9||Bird(x)||_x or ||Flies(x) ∧ Bird(x)||_x − .9||Bird(x)||_x > 0.)

Quantifiers can be combined with statistical likelihood formulas. For example, ∀x(||Son(x, y)||_y > .9) says that for every person x, the probability that x is the son of a randomly chosen person y is greater than .9; ∀y(||Son(x, y)||_x > .9) says that for every person y, the probability that a randomly chosen x is the son of y is greater than .9. Let ^QU,stat be the language that results from combining the syntax of first-order logic with statistical likelihood formulas.

As with ℓ, statistical likelihood terms can be evaluated with respect to any quantitative representation of uncertainty. For definiteness, I use probability here. A statistical 풯-structure is a tuple (풜, μ), where 풜 is a relational structure and μ is a probability measure on dom(풜). To simplify matters, I assume that all subsets of dom(풜) are measurable, that dom(풜) is finite or countable, and that μ is countably additive. That means that μ is characterized by the probability it assigns to the elements of dom(풜). Let ^meas,stat consist of all statistical 풯-structures of this form.

Statistical structures should be contrasted with relational probability structures. In a statistical structure, there are no possible worlds and thus no probability on worlds. There is essentially only one world and the probability is on the domain. There is only one probability measure, not a different one for each agent. (It would be easy to allow a different probability measure for each agent, but the implicit assumption is that the probability in a statistical structure is objective and does not represent the agent's degree of belief.) An important special subclass of statistical structures (which is the focus of Chapter 11) are structures where the domain is finite and the probability measure is uniform (which makes all domain elements equally likely). This interpretation is particularly important for statistical reasoning. In that case, a formula such as ||Flies(x) | Bird(x)||_x > .9 could be interpreted as "more than 90 percent of birds fly."

There are a number of reasons for not insisting that μ be uniform in general. For one thing, there are no uniform probability measures in countably infinite domains where all sets are measurable. (A uniform probability measure in a countably infinite domain would have to assign probability 0 to each individual element in the domain, which means by countable additivity it would have to assign probability 0 to the whole domain.) For another, for representations of uncertainty other than probability, there is not always an obvious analogue of uniform probability measures. (Consider plausibility measures, for example. What would uniformity mean there?) Finally, there are times when a perfectly reasonable way of making choices might not result in all domain elements being equally likely. For example, suppose that there are seven balls, four in one urn and three in another. If an urn is chosen at random and then a ball in the urn is chosen at random, not all the balls are equally likely. The balls in the urn with four balls have probability 1/8 of being chosen; the balls in the urn with three balls have probability 1/6 of being chosen. In any case, there is no additional difficulty in giving semantics to the case that μ is an arbitrary probability measure, so that is what I will do. On the other hand, to understand the intuitions, it is probably best to think in terms of uniform measures.

One more construction is needed before giving the semantics. Given a probability measure μ on D, there is a standard construction for defining the product measure μⁿ on the product domain Dⁿ consisting of all n-tuples of elements of D: define μⁿ(d₁, …, d_n) = μ(d₁) … μ(d_n). Note that if μ assigns equal probability to every element of D, then μⁿ assigns equal probability to every element of Dⁿ.

The semantic definitions are identical to those for first-order logic; the only new clause is that for statistical likelihood formulas. Given a statistical structure M = (풜, μ), a valuation V, and a statistical likelihood term ||φ||{_x₁, …, _x_n}, define

That is, [||φ||_{{x₁, …, x_n}]}_{M, V} is the probability that a randomly chosen tuple (d₁, …, d_n) (chosen according to μⁿ) satisfies φ. Then define

Note that the x in ||φ||_x acts in many ways just like the x in ∀x; for example, both bind free occurrences of x in φ, and in both cases the x is a dummy variable. That is, ∀xφ is equivalent to ∀yφ[x/y] and ||φ||_x >b is equivalent to ||φ[x/y]||_y > b if y does not appear in φ (see Exercise 10.11). Indeed, || ||_x can express some of the general notions of quantification referred to in Section 10.1. For example, with a uniform probability measure and a finite domain, ||φ||_x > 1/2 expresses the fact that at least half the elements in the domain satisfy φ, and thus is equivalent to the formula Hxφ(x) from Section 10.1.

Of course, statistical reasoning and reasoning about degrees of belief can be combined, by having a structure with both a probability on the domain and a probability on possible worlds. The details are straightforward, so I omit them here.

What about axioms? First consider reasoning about degrees of belief. It is easy to see that F1–5 are sound, as are QU1–3, QUGen, and Ineq from Section 7.3. They are, however, not complete. In fact, there is no complete axiomatization for the language ^{QU fo}_n with respect to ^meas,fo_n (even if n = 1); the set of formulas in ^QU,fo_n valid with respect to ^meas,fo_n is not recursively enumerable. Restricting to finite domains does not help (since first-order logic restricted to finite domains is by itself not axiomatizable), nor does restricting to finite sets of worlds. But, as in the case of first-order logic, restricting to bounded domains does help.

Let AX^prob,fo_{n, N} consist of the axioms and inference rule of AX^fo_N together with those of AX^prob_n and one other axiom:

IV stands for Inequality of Variables. It is easy to see that IV is sound, as is the analogous property for equality, called EV.

EV just follows from the fact that variables are treated as rigid and have the same value in all worlds. EV is provable from the other axioms, so it is not necessary to include it in the axiomatization (Exercise 10.13). In fact, the analogues of IV and EV are both provable in the case of knowledge, which is why they do not appear in the axiomatization of Theorem 10.2.1 (Exercise 10.14).

Theorem 10.3.1

AX^{prob, fo}_{n, N} is a sound and complete axiomatization with respect to structures in ^meas,fo_n with a domain of cardinality at most N for the language ^{QU, fo}_n.

Proof Soundness is immediate from the soundness of AX^fo_N in relational structures of size at most N, the soundness of AX^prob_n in the propositional case, and the validity of EV, proved in Exercise 10.13. Completeness is beyond the scope of this book.

Thus, there is a sense in which the axioms of first-order logic together with those for propositional reasoning about probability capture the essence of first-order reasoning about probability.

Much the same results hold for statistical reasoning. Consider the following axioms and rule of inference, where X ranges over finite sets of variables:

PD1. ||φ||_X ≥ 0.
PD2. ∀x₁ … ∀x_nφ →||φ||_{{x₁, …, x_n}} = 1.
PD3. ||φ ∧ ψ||_X +||φ ∧ ψ||_X = ||φ||_X.
PD4. ||φ||_X = ||φ[x/z]||_X[x/z], where x ∈ X and z does not appear in X or φ.
PD5. ||φ ∧ ψ||_X∪Y = ||φ||_X ||ψ||_Y if none of the free variables of φ is contained in Y, none of the free variables of ψ is contained in X, and X and Y are disjoint.
PDGen. From φ ⇔ ψ infer ||φ||_X = ||ψ||_X.

PD1, PD3, and PDGen are the obvious analogues of QU1, QU3, and QUGen, respectively. PD2 is an extension of QU2. PD4 allows renaming of variables bound by "statistical" quantification. As I mentioned earlier, there is an analogous property for first-order logic, namely ∀xφ → ∀yφ[x/y], which follows easily from F2 and F3 (Exercise 10.11). PD5 says that if ψ and φ do not have any free variables in common, then they can be treated as independent. Its validity follows from the use of the product measure in the semantics (Exercise 10.12).

F1–5 continue to be sound for statistical reasoning, except that the notion of substitutability in F2 must be modified to take into account that || ||_y acts like a quantifier, so that t not substitutable in φ if the variable y occurs in t and there is a term || ||_y in φ.

As in the case of degrees of belief, there is no complete axiomatization for the language ^QU,stat with respect to ^meas,stat; the set of formulas in ^QU,stat valid with respect to ^meas,stat is not recursively enumerable. And again, while restricting to structures with finite domains does not help, restricting to bounded domains does. Let AX^stat_N consist of the axioms and inference rule of AX^fo_N together with PD1–5 and PDGen.

Theorem 10.3.2

AX^stat_N is a sound and complete axiomatization with respect to structures in ^meas,stat with a domain of cardinality at most N for the language ^QU,stat.