We will analyze a number of models involving two or three variables of categorical type, as a preparation to the task of determining likely decomposable or directed graphical models. First, consider the case of two variables, A and B, and our task is to determine whether or not these variables are dependent. We must define one model M2 that captures the concept of independence, and one model M1 that captures the concept of dependence, and ask which one produced our data. The Bayes factor is p(D∣M2)/p(D∣M1) in favor of independence, and it will be multiplied with the prior odds (which, lacking prior information in this general setting, we assume is one) to get the posterior odds. There is some latitude in defining the data model for dependence and independence, but it leads us to quite similar computations, as we shall see.
Let dA and dB be the number of possible values for A and B, respectively. It is natural to regard categorical data as produced by a discrete probability distribution, and then it is convenient to assume Dirichlet distributions for the parameters (probabilities of the possible outcomes) of the distribution. We will find that this analysis is the key step in determining a full graphical model for the data matrix. For a discrete distribution over d values, the parameter set is a sequence of probabilities constrained by 0 ≤ xi and Σ xi = 1 (often the last parameter xd is omitted it is determined by the first d-1 ones). A prior distribution over is the conjugate Dirichlet distribution with a sequence of non-negative parameters . Then the Dirichlet distribution is , where Γ(n+1) =n! for natural number n. The normalizing constant gives a useful mnemonic for integrating over the d-1-dimensional unit cube (with ). It is very convenient to use Dirichlet priors, for the posterior is also a Dirichlet distribution. After having obtained data with frequency count we just add it to the prior parameter vector to get the posterior parameter vector . It is also easy to handle priors that are mixtures of Dirichlets, because the mixing propagates through and we only need to mix the posteriors of the components to get the posterior of the mixture.
With no specific prior information for x, it is necessary from symmetry considerations to assume all Dirichlet parameters equal to some value α. A convenient prior is the uniform prior (α=1). This is, e.g., the prior used by Bayes and Laplace to derive the rule of succession (see Chapter 18 of Jaynes 2003). Other priors have been used, but experiments have shown little difference between these choices. In many cases, an expert's delivered prior information can be expressed as an equivalent sample that is just added to the data matrix, and then this modified matrix can be analyzed with the uniform prior. Likewise, a number of experts can be mixed to form a mixture prior. If the data has occurrence vector ni for the d possible data values in a case, and n = n+ = Σini, then the probability for these data given the discrete distribution parameters x, is
Integrating out the xi with the uniform prior gives the probability of the data given model M (M is characterized by a probability distribution and a Dirichlet prior on its parameters):
Thus, the probability for each sample size is independent of the actual data with the uniform Dirichlet prior. Consider now the data matrix over A and B. Let ni be the number of rows with value i for A and value j for B. Let n+j and ni+ be the marginal counts where we have summed over the dotted index, and n = n++. Let model M1 (figure 1) be the model where the A and B value for a row is combined to a categorical variable ranging over dAdB different values. The probability of the data given M1 is obtained by replacing the products and replacing d by dAdB in equation (3):
We could also consider a different model M1', where the A column is generated first and then the B column is generated for each value of A in turn. With uniform priors we get:
Observe that we are not allowed to decide between the undirected M1 and the directed model M1' based on Equations (4) and (5). This is because these models define the same set of pdfs involving A and B. In the next model M2, we assume that the A and B columns are independent, each having its own discrete distribution. There are two different ways to specify prior information in this case. We can either consider the two columns separately, each being assumed to be generated by a discrete distribution with its own prior. Or we could follow the style of M1' above, with the difference that each A value has the same distribution of B values. Now the first approach: assuming parameters and for the two distributions, a row with values i for A and j for B will have probability xiAxjB. For discrete distribution parameters , the probability of the data matrix will be:
Integration over the uniform priors for A and B gives the data probability given model M2:
From this and Equation (4) we obtain the Bayes factor for the undirected data model:
The second approach to model independence between A and B gives the following:
We can now find the Bayes factor relating models M1′ (Equation 5) and M2′ (Equation 7), with no prior preference of either:
Consider now a data matrix with three variables, A, B and C (Figure 2). The analysis of the model M3 where full dependencies are accepted is very similar to M1 above (Equation 4). For the model M4 without the link between A and B, we should partition the data matrix by the value of C and multiply the probabilities of the blocks with the probability of the partitioning defined by C. Since we are ultimately after the Bayes factor relating M4 and M3 (respectively M4′ and M3′), we can simply multiply the Bayes factors relating M2 and M1 (Equation 6) (respectively M2′ and M1′) for each block of the partition to get the Bayes factors sought:
The directed case is similar (Heckerman, 1997). The value of the gamma function is rather large even for moderate values of its argument. For this reason, the formulas in this section are always evaluated in logarithm form, where products like in Formula 9 translate to sums of logarithms.
Causality and Direction in Graphical Models
Normally, the identification of cause and effect must depend on one's understanding of the mechanisms that generated the data. There are several claims or semi-claims that purely computational statistical methods can identify causal relations among a set of variables. What is worth remembering is that these methods create suggestions, and that even the concept of cause is not unambiguously defined but a result of the way the external world is viewed. The claim that causes can be found is based on the observation that directionality can in some instances be identified in graphical models. Consider the models M4′′ and M4′ of Figure 2. In M4′, variables A and B could be expected to be marginally dependent, whereas in M4′′ they would be independent. On the other hand, conditional on the value of C, the opposite would hold: dependence between A and B in M4′′ and independence in M4′ ! This means that it is possible to identify the direction of arrows in some cases in directed graphical models. It is difficult to believe that the causal influence should not follow the direction of arrows in those cases. Certainly, this is a potentially useful idea, but it should not be applied in isolation from the application expertise, as the following example illustrates. It is known as Simpson's Paradox, although it is not paradoxical at all.
Consider the application of drug testing. We have a new wonder drug that we hope cures an important disease. We find a population of 800 subjects who have the disease; they are asked to participate in the trial and given a choice between the new drug and the alternative treatment currently assumed to be best. Fortunately, half the subjects, 400, choose the new drug. Of these, 200 recover. Of those 400 who chose the traditional treatment, only 160 recovered. Since the test population seems large enough, we can conclude that the new drug causes recovery in 50% of patients, whereas the traditional treatment only cures 40%. But the drug may not be advantageous for men. Fortunately, it is easy to find the gender of each subject and to make separate judgments for men and women. So when men and women are separated, we find the following table:
Obviously, the recovery rate is lower for the new treatment, both for women and men. Examining the table reveals the reason, which is not paradoxical at all: the disease is more severe for women, and the explanation for the apparent benefits of the new treatment is simply that it was tried by more men. The gender influences both the severity of the disease and the willingness to test the new treatment; in other words, gender is a confounder. This situation can always occur in studies of complex systems like living humans and most biological, engineering, or economic systems that are not entirely understood, and the confounder can be much more subtle than gender. When we want to find the direction of causal links, the same effect can occur. In complex systems of nature, and even in commercial databases, it is unlikely that we have at all measured the variable that will ultimately become the explanation of a causal effect. Such an unknown and unmeasured causal variable can easily turn the direction of causal influence indicated by the comparison between models M4′′ and M4′, even if the data is abundant. Nevertheless, the new theories of causality have attracted a lot of interest, and if applied with caution they should be quite useful (Glymour & Cooper, 1999; Pearl, 2000). Their philosophical content is that a mechanism, causality, that could earlier not or only with difficulty be formalized, has become available for analysis in observational data, whereas it could earlier only be accessed in controlled experiments.