8.5.1 Sequential Importance SamplingImportance sampling is one of the most well-known and elementary Monte Carlo techniques. Suppose that we want to make an inference about some random quantity X ~ p ( X ) using the Monte Carlo method. Sometimes drawing samples directly from p ( X ) is difficult, but it may be easier (or otherwise advantageous) to draw samples from a trial density, say, q ( X ). Note that the desired inference can be written as Equation 8.70
Equation 8.71
where Equation 8.72
is called the importance weight . In importance sampling, we draw samples X (1) , X (2) , ..., X ( n ) according to the trial distribution q ( ·). We then approximate the inference (8.71) by Equation 8.73
This technique is widely used, for example, for reducing the sample-size requirements in BER estimation. However, it is usually difficult to design a good trial density function in high-dimensional problems. One of the most useful strategies in these problems is to build up the trial density sequentially. Suppose that we can decompose X as X = ( x 1 , ..., x d ), where each of the x j may be either a scalar or a vector. Then our trial density can be constructed as Equation 8.74
by which we hope to obtain some guidance from the target density while building up the trial density. Corresponding to the decomposition of X , we can rewrite the target density as Equation 8.75
and the importance weight as Equation 8.76
Equation (8.76) suggests a recursive way of computing and monitoring the importance weight. That is, by denoting X t = x 1 , ..., ( x t ) (thus, X d X ), we have Equation 8.77
Then w d ( X d ) is equal to w ( X ) in (8.76). Potential advantages of this recursion and (8.75) are that we can stop generating further components of X if the partial weight derived from the sequentially generated partial sample is too small, and that we can take advantage of p t ( x t X t -1) in designing q t ( x t X t -1). In other words, the marginal distribution p ( X t ) can be used to guide the generation of X . Although the idea above sounds interesting, the trouble is that the decomposition of p ( X ) as in (8.75) and that of w ( X ) as in (8.76) are not practical at all! The reason is that in order to get (8.75), one needs to have the marginal distribution Equation 8.78
whose computation involves integrating out components x t +1, ... x d in p ( X ) and is as difficult as ”or even more difficult than ”the original problem. To carry out the sequential sampling idea, we need to introduce another layer of complexity. Suppose that we can find a sequence of auxiliary distributions ,
so that p t ( X t ) is a reasonable approximation to the marginal distribution p t ( X t ) for t = 1,..., d “ 1 and p d ( X ) = p ( X ). We emphasize that { p t ( X t )} are required to be known only up to a normalizing constant, and they serve only as "guides" to our construction of the entire sample X = ( x 1 ,..., x d ). The sequential importance sampling (SIS) method can then be defined as noted in the following recursive procedure. Algorithm 8.7: [Sequential importance sampling (SIS)] For t = 2,..., d :
Equation 8.79
and let w t = w t -1 u t . In the SIS step, we call u t an incremental weight . It is easy to show that X t is properly weighted by w t with respect to p t provided that X t -1 is properly weighted by w t -1 with respect to p t -1. Thus, the entire sample X obtained in this sequential fashion is properly weighted by the final importance weight, w d , with respect to the target density p( X ) . One reason for the sequential buildup of the trial density is that it breaks a difficult task into manageable pieces. The SIS framework is particularly attractive, as it can use the auxiliary distributions p 1 , p 2 ,..., p d to help construct more efficient trial distributions:
Equation 8.80
Then the incremental weight becomes Equation 8.81
8.5.2 SMC for Dynamical SystemsConsider the following dynamical system modeled in state-space form as Equation 8.82
where z t , y t , u t , and v t are, respectively, the state variable, the observation, the state noise, and the observation noise at time t . These quantities can be either scalars or vectors. Let Z t = ( z , z 1 ,..., z t ) and let Y t = ( y , y 1 ,..., y t ). Suppose an online inference of Z t is of interest; that is, at current time t we wish to make a timely estimate of a function of the state variable Z t , say h ( Z t ), based on the currently available observation, Y t . From Bayes' formula, we realize that the optimal solution to this problem is Equation 8.83
In most cases an exact evaluation of this expectation is analytically intractable because of the complexity of such a dynamical system. Monte Carlo methods provide us with a viable alternative to the required computation. Specifically, if we can draw m random samples from the distribution p ( Z t Y t ), we can approximate E { h ( Z t ) Y t } by Equation 8.84
Very often, direct simulation from p ( Z t Y t ) is not feasible , but drawing samples from some trial distribution is easy. In this case we can use the idea of importance sampling discussed above. Suppose that a set of random samples is generated from the trial distribution q ( Z t Y t ). By associating the weight Equation 8.85
with the sample , we can approximate the quantity of interest, E { h ( Z t ) Y t }, as Equation 8.86
with Equation 8.87
The pair ( ), j = 1, ..., m , is called a properly weighted sample with respect to distribution p ( Z t / Y t ). A trivial but important observation is that the (one of the components of is also properly weighted by the with respect to the marginal distribution p ( z t / Y t ). Another possible estimate of E { h ( Z t ) Y t } is Equation 8.88
The main reasons for preferring the ratio estimate (8.86) to the unbiased estimate (8.88) in an importance sampling framework are that the estimate (8.86) usually has a smaller mean-square error than that of (8.88), and that the normalizing constants of both the trial and target distributions are not required in using (8.86) (where these constants are canceled out); in such cases the weights { } are evaluated only up to a multiplicative constant. For example, the target distribution p ( Z t Y t ) in a typical dynamical system (and many Bayesian models) can be evaluated easily up to a normalizing constant (e.g., the likelihood multiplied by a prior distribution), whereas sampling from the distribution directly and evaluating the normalizing constant analytically are impossible . To implement Monte Carlo techniques for a dynamical system, a set of random samples properly weighted with respect to p ( Z t Y t ) is needed for any time t . Because the state equation in system (8.82) possesses a Markovian structure, we can implement a recursive importance sampling strategy, which is the basis of all sequential Monte Carlo techniques [277]. Suppose that a set of properly weighted samples with respect to p ( Z t “1 Y t “1 ) is given at time ( t “1). A sequential Monte Carlo filter generates from the set a new one, , which is properly weighted at time t with respect to p ( Z t Y t ). The algorithm is described as follows . Algorithm 8.8: [Sequential Monte Carlo filter for dynamical systems] For j = 1,... , m :
Equation 8.89
The algorithm is initialized by drawing a set of i.i.d. samples from p ( z / y ). When y represents the "null" information, p ( z y ) corresponds to the prior distribution of z . The samples and weights drawn by the algorithm above are characterized by the following result, the proof of which is given in the Appendix (Section 8.7.3). Proposition 8.1: The weighted samples generated by Algorithm 8.8 satisfy Equation 8.90
Equation 8.91
The result above, together with the law of large numbers , implies that Equation 8.92
There are a few important issues regarding the design and implementation of a sequential Monte Carlo filter, such as the choice of the trial distribution q ( ·) and the use of resampling (see Section 8.5.3). Specifically, a useful choice of the trial distribution , for the state-space model (8.82) is of the form Equation 8.93
where in (8.93) we used the facts that Equation 8.94
Equation 8.95
both of which follow directly from the state-space model (8.82). For this trial distribution, the importance weight is updated according to Equation 8.96
Equation 8.97
where (8.96) follows from the fact that Equation 8.98
and the last equality is due to the conditional independence property of the statespace model (8.82). See [277] for the general SMC framework and a detailed discussion on various implementation issues. SMC techniques have been employed to tackle a number of problems in wireless communications and networks, including blind detection in fading channels [76], blind detection in OFDM systems [593], and joint mobility tracking and hand-off detection in cellular networks [595]. 8.5.3 Resampling ProceduresThe importance sampling weight measures the quality of the corresponding imputed sequence . A relatively small weight implies that the sample is drawn far from the main body of the posterior distribution and has a small contribution in the final estimation. Such a sample is said to be ineffective . If there are too many ineffective samples, the Monte Carlo procedure becomes inefficient. This can be detected by observing a large coefficient of variation in the importance weight. Suppose that is a sequence of importance weights. Then the coefficient of variation, n t , is defined as Equation 8.99
with Equation 8.100
Note that if the samples are drawn exactly from the target distribution, all the weights are equal, implying that v t = 0. A large coefficient of variation in the importance weights indicates ineffective samples. It is shown in [232] that the importance weights resulting from an SMC filter form a martingale sequence. As more and more data are processed , the coefficient of variation of the weights increases ”that is, the number of ineffective samples increases ” rapidly . A useful method for reducing ineffective samples and enhancing effective ones is resampling , which was suggested in [160, 276] under the SMC setting. Roughly speaking, resampling allows those "bad" samples (with small importance weights) to be discarded and those "good" ones (with large importance weights) to replicate so as to accommodate the dynamic change of the system. Specifically, let be the original properly weighted samples at time t . A residual resampling strategy forms a new set of weighted samples according to the following algorithm (assume that ). Algorithm 8.9: [Residual resampling]
The correctness of the residual resampling procedure above is stated by the following result, whose proof is given in the Appendix (Section 8.7.4). Proposition 8.2: The samples drawn by the residual resampling procedure (Algorithm 8.9) are properly weighted with respect to p ( Z t Y t ) for m Alternatively, we can use the following simple resampling procedure, which also produces properly weighted samples with respect to p ( Z t Y t ):
Equation 8.101
Equation 8.102
In practice, when small to modest m is used (we used m = 50 in our simulations), the resampling procedure can be seen as trading off between bias and variance. That is, the new samples with their weights resulting from the resampling procedure are only approximately proper, which introduces small bias in Monte Carlo estimation. On the other hand, however, resampling greatly reduces Monte Carlo variance for the future samples. Resampling can be done at any time. However, resampling too often adds computational burden and decreases "diversities" of the Monte Carlo filter (i.e., it decreases the number of distinctive filters and loses information). On the other hand, resampling too rarely may result in a loss of efficiency. It is thus desirable to give guidance on when to do resampling. A measure of the efficiency of an importance sampling scheme is the effective sample size , defined as Equation 8.103
Heuristically, reflects the equivalent size of a set of i.i.d. samples for the set of m weighted ones. It is suggested in [277] that resampling should be performed when the effective sample size becomes small (e.g., ). Alternatively, one can conduct resampling at every fixed-length time interval (say, every five steps). 8.5.4 Mixture Kalman FilterMany dynamical system models belong to the class of conditional dynamical linear models (CDLMs) of the form Equation 8.104
where u t ~ N c ( 0,I ), v t ~ N c ( 0,I ), and l t is a random indicator variable. The matrices F l t , G l t , H l t , and K l t are known, given l t . In this model, the state variable z t corresponds to ( x t , l t ). We observe that for a given trajectory of the indicator l t in a CDLM, the system is both linear and Gaussian, for which the Kalman filter provides the complete statistical characterization of the system dynamics. Recently, a novel SMC method, the mixture Kalman filter (MKF), was developed in [75] for online filtering and prediction of CDLMs; it exploits the conditional Gaussian property and uses a marginalization operation to improve the algorithmic efficiency. Instead of dealing with both x t and l t , the MKF draws Monte Carlo samples only in the indicator space and uses a mixture of Gaussian distributions to approximate the target distribution. Compared with the generic SMC method, the MKF is substantially more efficient (e.g., it gives more accurate results with the same computing resources). However, the MKF often needs more computational power for its proper implementation, as the formulas required are more complicated. Additionally, the MKF requires the CDLM structure, which is not applicable to all problems of interest. Let Y t = ( y , y 1 ,..., y t ) and let L t = ( l , l 1 ,..., l t ). By recursively generating a set of properly weighted random samples to represent p ( L t Y t ), the MKF approximates the target distribution p( x t / Y t ) by a random mixture of Gaussian distributions Equation 8.105
where is obtained by implementing a Kalman filter for the given indicator trajectory . Thus, a key step in the MKF is the production at time t of the weighted samples of indicators, , based on the set of samples, , at the previous time ( t - 1) according to the following algorithm. Algorithm 8.10: [Mixture Kalman filter for conditional dynamical linear models] For j = 1,... m :
Equation 8.106
The MKF can be extended to handle the partial CDLM , where the state variable has a linear component and a nonlinear component. See [75] for a detailed treatment of the MKF and the extended MKF. |