Modern treatments of Bayesian integration to obtain posterior densities often use some form of Markov Chain Monte Carlo (“MCMC”), typically Gibbs sampling. Gibbs works well with many Bayesian hierarchical models. The standard problem-solving situation with these is that a dataset is given, and the application domain suggests a model from which to form a likelihood, along with constraints on parameters. The model can be thorny, too complicated to use a simple conjugate prior for the likelihood, and so a Bayesian integration is required. There may be nuisance parameters. It may be necessary to construct the likelihood in parts, using a hierarchy of sampling densities and then some chain of conditional probabilities for each of its parameters, with their parameters having hyperpriors. It is basically a closed-form problem, apart from choice of the form of priors and the computational difficulties implementing it. (Rodriguez argues that poor choice of priors is common and contributes significantly to these computational difficulties. That may be so, when we have confidence in the likelihood function. Somes there isn’t much.) In another context, however, sometimes the data and observations come in singletons or small batches, and arrive iteratively. Sometimes there are multiple sources of data, each different. When the observation set, arrives, we want to estimate the (multivariate) posterior, .
There are a number of ways to approach this. If the statistics of observations vary, it may be possible to drape a generic sampling density about them, like a Gaussian, and do updates using state-space estimation. But there’s another way, and it’s kind of fun. This approach simulates the Bayesian update in an analog process where probability densities are imitated by a large but finite collection of particles, each assigned a probability mass. The distribution of particles over the support of the problem at each step describes the posterior, and all parts of the simulation are done by putting particles where a measurement says the value should be, adjusting their weights with a likelihood, and then resampling the result so each particle gets the same weight, but the particles are placed in the space of support with a density that’s proportional to the derived prior.
It’s the purpose of this post to illustrate this process in a simple case. There are good overviews available elsewhere, by Chen, and by Arulampalam, Maskell, Gordon, and Clapp. Doucet and Johansen have a more technical tutorial available, dating from 2008. These methods tend to be robust, and there are variety of ways to implement their steps. Computation is also bounded in that it is proportional to the number of particles. That can be misleading because, especially for high dimensional problems, the number of particles needed to get good resolution of the posterior grows exponentially in the number of dimensions. Still, the methods are useful for small dimensional problems, often related to position location and tracking, and similar. (High dimensions also can suffer collapse.)
There may also be another connection, which I will remark upon at the very end.
Location on the Unit Circle
As mentioned, position-finding is a typical application for particle filters. Rather than complicate this presentation with full spherical geometry, this illustration limits the position finding to location of a target in longitude only. That makes the figures easier to draw, as well as presentations of the particles and their probability masses. So the figures below are as if they were slices through a spherical Earth model along some latitude, although, actually, if estimating longitude is what’s wanted, the unit circle represents all latitudes. The code to generate these figures is also available, written in R. Note that there is also an R package called SMC available from CRAN, by Gopi Goswami of Harvard, and a YouTube video by Andreas Svensson illustrating use of particle filters for location-finding. This presentation uses custom code, to illustrate the steps simply and directly.
First, suppose there’s a single target at some longitude out there which we do not know. To start, suppose we have census information about the class of things of which this target is a member, so there’s a density over latitude of where this target might be, before any measurements are available. This is our Bayesian prior.
In the figure the red line marked with the red “T” is the true longitude of the target, something we don’t know but are trying to find out. Also this prior density wrapped around the unit circle may look unreasonable, as it has many bumps on it. (It’s actually a raised and normalized sinc function.) It’s used because I want to illustrate that these densities need not be unimodal or anything simple, even if Gaussians are used in updating rules.
As is usual with WordPress images, if you’d like to see a bigger version of the image, just click on the picture. To get back, just use your browser buttons to return.
Around the outside of the circle there is a train of blue dots at varying distances from the circle’s circumference. They are connected to the circle with grey lines. The length of these lines is proportional to the probability mass attached to each dot. The dots are particles. The length of the lines are thus proportional to the prior probability that the target is located at the corresponding longitude. Note that the resolution of location is reasonably coarse. That is, each particle really represents a set or interval of longitudes. In time, if the measurements warrant it, a fixed set of particles can be attached to a smaller range of longitudes, and so, if the measurements warrant it, precision of estimates can get better. This is a byproduct of a resampling step which will be illustrated in short order.
Okay. Now suppose we get our first measurement. Suppose this one is not a direct measurement of longitude of the target, but suppose it is something like a label, perhaps like a shopping receipt from a store where the target was once located. We know the location of the store. Of course, if the target was a person or a truck, it is mobile, and, so, the fact that a retail transaction was made at a place does not at all mean the target is at that place. It’s reasonable to suppose, however, that the target is nearby, since people tend to buy things where they live or work. With that “tag related” information we might get something like this:
Here the green band represents a uniform density estimate of the target’s position based upon the tag. Note that the statement of location is quite large for a shopping tag, but I ask the reader to suspend disbelief and play along. Note that nothing fancy is being done here to estimate a probability density for the measurement, simply a uniform band.
Now, let’s see how the particle filter begins to incorporate this observation.
This filter uses a simple wrapped Gaussian likelihood, where the limits of the measurement are taken to be 3 standard deviations away from the measurement center, and the mass distributed about the unit circle accordingly. The figure illustrates just this likelihood.
Now, to obtain an updated estimate of where the target might be, the prior is combined with this likelihood. In particular, the probability mass for each particle is revised by taking its present probability mass and multiplying by the value of the likelihood at its location. This is a step called pruning. The name is more general than what is illustrated here, because it is also possible to extinguish particles should the resulting probabilities drop below some prespecified threshold.
If this is compared to the census prior, we can see there isn’t much change. This is because the likelihood was so diffuse. There is a change in the vicinity of the measurement estimate, although the prior still controls much of the fine structure.
Now, we could keep adding more and more particles to the mix. Eventually, there would be so many particles that computation would slow down. As a consequence, it makes sense to reallocate the particles so each represents the posterior with equal weight. In particle filtering, that’s done by resampling the density having varying masses assigned to the particles so that the particles are moved to positions where they can have equal weight but appear in spatial density in proportion to the posterior.
Here the green band has been moved up towards the exterior of the particles distribution. This will be the practice here: The original observation and its dispersion will be retained for comparison with the current posterior density.
Now suppose another measurement is received. This time suppose the measurement is quite different. Since the problem is to estimate longitude, suppose the measurement is an estimate of the local time at the target, expressed in Universal Time.
The measurement is again a wide band, this time orange, as if the confidence in the local time is only good to two hours or so. The likelihood corresponding to it is:
Pruning with the likelihood results in:
We can see, noticing the target, that this measure isn’t very good at all. The band corresponding to the measurement doesn’t even include the target. The processing of this measurement indicates that, as long as the dispersion of the measurement is fairly represented, bad measurements which might be considered outliers don’t disrupt the estimation process.
Suppose, now, a second time measurement is taken:
This one has much greater precision, and its likelihood:
reflects that. Pruning produces the sharp:
And now resampling shows a collection of particles beginning to cluster about the true position:
The result is a combination of different kinds of measurements, each having different precisions. Since all that’s being done is placing representative particles on a domain of support, the pattern of placement could be anything, representing nearly any density pattern the student of the problem chooses. It could vary from observation to observation. The masses assigned to the particles could be anything helpful.
This approach is quite general.
Connections with Evolutionary Computation and Genetic Algorithms
Genetic algorithms attempt to use a highly abstracted version of natural selection to find solutions to difficult optimization problems. It’s assumed there is a utility function available, a score of how good some state is. There is a proposal mechanism which generates new forms from existing ones, these forms being considered mutants of members of an existing population. Members are scored according to their relative fitness and members having too little fitness are reduced in number until they are so few, they go extinct. These and other stochastic search algorithms are described in the very fine textbook by Professor James Spall, Introduction to Stochastic Search and Optimization.
Particle filters can be thought of as a kind of genetic algorithm. The probability mass associated with each particle is its fitness measurement. Proposals come from data. The utility function is the likelihood rule, although the likelihood rule can also change from observation to observation, in accordance, for instance, with a differing sampling plan for one observation versus another.
What’s interesting is the degree to which Bayesian inference resembles natural selection, an observation made and recorded by Professors John Baez and Chris Lee. Because those results are connected with insights from information theory and, especially, information geometry, there may be insights available for Bayesian estimation (Matsuzoe).