8.3 Markov Chain Monte Carlo Signal Processing | Wireless Communication Systems: Advanced Techniques for Signal Reception (paperback)

Markov chain Monte Carlo refers to a class of algorithms that allow one to draw (pseudo-) random samples from an arbitrary target probability density, p ( x ), known up to a normalizing constant. The basic idea behind these algorithms is that one can achieve the sampling from the target density p ( x ) by running a Markov chain whose equilibrium density is exactly p ( x ). Here we describe two basic MCMC algorithms, the Metropolis algorithm and the Gibbs sampler, which have been widely used in diverse fields. The validity of both algorithms can be proved using basic Markov chain theory [420].

The roots of MCMC methods can be traced back to the well-known Metropolis algorithm [318], which was used initially to investigate the equilibrium properties of molecules in a gas. The first use of the Metropolis algorithm in a statistical context is found in [174]. The Gibbs sampler, which is a special case of the Metropolis algorithm, was so termed in the seminal paper [137] on image processing. It was brought to statistical prominence by [134], where it was observed that many Bayesian computations could be carried out via the Gibbs sampler. For tutorials on the Gibbs sampler, see [21, 68].

8.3.1 Metropolis “Hastings Algorithm

Let p ( x ) = c exp{- f ( x )} be the target probability density from which we want to simulate random draws. The normalizing constant c may be unknown to us. Metropolis et al. introduced the fundamental idea of evolving a Markov process to achieve the sampling of p ( x ) [318]. Hastings later provided a more general algorithm, which is described below [174].

Starting with any configuration x ⁽⁰⁾ , the algorithm evolves from the current state x ^{( t )} = x to the next state x ^{( t +1)} as follows .

Algorithm 8.1: [Metropolis “Hastings algorithm ”Form I ]

Propose a random perturbation of the current state ( i.e. , x x '). More precisely , x ' is generated from a proposal transition T(x ^(t) x' ) [ i.e. , for all x ], which is nearly arbitrary (of course, some are better than others in terms of efficiency) and is completely specified by the user .
Compute the Metropolis ratio :

Equation 8.12

Generate a random number u ~ uniform(0,1) . Let x ^{( t +1)} = x ' if u r ( x, x '), and let x ^{( t +1)} = x ^{( t )} otherwise .

A better-known form of the Metropolis algorithm is as follows. At each iteration:

Algorithm 8.2: [Metropolis “Hastings algorithm ”Form II ]

A small but random perturbation of the current configuration is made .
The gain (or loss) of an objective function [corresponding to log p ( x ) = f ( x )] resulting from this perturbation is computed .
A random number u ~ uniform(0,1) is generated independently .
The new configuration is accepted if log( u ) is smaller than or equal to the gain and rejected otherwise .

Heuristically, this algorithm is constructed based on a trial-and-error strategy. Metropolis et al. restricted their choice of the perturbation function to be the symmetric ones [318]. Intuitively, this means that there is no trend bias at the proposal stage. That is, the chance of getting x ' from perturbing x is the same as that of getting x from perturbing x '. Since any proper random perturbation can be represented by a Markov transition function T , the symmetry condition can be written as T ( x x ') = T ( x ' x ).

Hastings generalized the choice of T to all those that satisfy the property T ( x x ') > 0 if and only if T ( x ' x ) > 0 [174]. It is easy to prove that the Metropolis “Hasting transition rule results in an "actual" transition function A ( x, x ') (it is different from T because an acceptance/ rejection step is involved) that satisfies the detailed balance condition

Equation 8.13

which necessarily leads to a reversible Markov chain with p ( x ) as its invariant distribution. Thus, the sequence x ⁽⁰⁾ , x ⁽¹⁾ ,... is drawn (asymptotically) from the desired distribution.

The Metropolis algorithm has been used extensively in statistical physics over the past four decades and is the cornerstone of all MCMC techniques recently adopted and generalized in the statistics community. Another class of MCMC algorithms, the Gibbs sampler [137], differs from the Metropolis algorithm in that it uses conditional distributions based on p ( x ) to construct Markov chain moves.

8.3.2 Gibbs Sampler

Let X = ( x ₁ ,..., x _d ), where x _i is either a scalar or a vector. Suppose that we want to draw samples of X from an underlying density p ( X ). In the Gibbs sampler, one randomly or systematically chooses a coordinate, say x _i , and then updates its value with a new sample x _i ' drawn from the conditional distribution p ( x _i X _{[- i ]} ), where we define . Algorithmically, the Gibbs sampler can be implemented as follows:

Algorithm 8.3: [Random-scan Gibbs sampler] Suppose that the current sample is . Then :

Select i randomly from the index set {1,..., d } according to a given probability vector ( p ₁ ,..., p _d ).
Draw from the conditional distribution p ( ), and let .

Algorithm 8.4: [Systematic-scan Gibbs sampler] Let the current sample be . For i = 1,..., d , draw from the conditional distribution

Equation 8.14

It is easy to check that every individual conditional update leaves p ( ·) invariant. Suppose that currently . Then follows its marginal distribution under p ( ·). Thus,

Equation 8.15

implying that the joint distribution of ( ) is unchanged at p ( ·) after one update.

Under regularity conditions, in the steady state, the sequence of vectors ..., X ^{( t -1)} , X ^{( t )} , X ^{( t +1)} ,... is a realization of a homogeneous Markov chain with the transition kernel from state X ' to state X given by

Equation 8.16

The convergence behavior of the Gibbs sampler is investigated in [71, 134, 137, 278, 436, 473] and general conditions are given for the following two results:

The distribution of X ^{( t )} converges geometrically to p ( X ), as t .
, for any integrable function f .

The Gibbs sampler requires an initial transient period to converge to equilibrium. An initial period of length t is known as the burning-in period, and the first t samples should always be discarded. Convergence is usually detected in some ad hoc way; some methods for this are found in [472]. One such method is to monitor a sequence of weights that measure the discrepancy between the sampled and desired distribution [472]. The samples generated by the Gibbs sampler are not independent, hence care needs to be taken in assessing the accuracy of such estimators. By grouping variables together (i.e., drawing samples of several elements of X simultaneously ), one can usually accelerate the convergence and generate less-correlated data [274, 278]. To reduce the dependence between samples, one can extract every r th sample to be used in the estimation procedure. When r is large, this approach generates almost independent samples.

Other Techniques

A main problem with all the MCMC algorithms is that they may, for various reasons, move very slowly in the configuration space or may become trapped in a local mode. This phenomenon is generally called slow mixing of the chain. When a chain is slow mixing, estimation based on the resulting Monte Carlo samples becomes very inaccurate. Some recent techniques suitable for designing more efficient MCMC samplers include parallel tempering [141], the multiple-try method [279], and evolutionary Monte Carlo [264].