high dimension Metropolis-Hastings algorithms

(Apologies, but I thought this posting was still in draft and had not been sent out. It was, by my apparently clicking the wrong button. The writing was hastily done, with the intent of finishing today. In fact, as written, it was wrong. Corrections are now included below.)

Excerpt:

… If attempting to simulate from a multivariate standard normal distribution in a large dimension, when starting from the mode of the target, i.e., its mean γ, leaving the mode γ is extremely unlikely, given the huge drop between the value of the density at the mode γ and at likely realisations …

Source: high dimension Metropolis-Hastings algorithms, by Professor Christian Robert

While this is an intriguing and important observation, in retrospect, it seems to me to be another kind of the dimensionality curse. As the number of dimensions increases, the hypervolume of a mode, even if a Gaussian, gets smaller as expressed as a fraction of the total volume. Surely, a properly formed Markov Chain Monte Carlo (“MCMC”) will eventually defeat it and find the rest of the surface, but this could take an arbitrarily long time.

I’ve always seen the purpose of MCMC to explore the posterior, not to find a feature like a mode. So, yes, the behavior cited, if it proves out with standard proposals, is disappointing. Perhaps an alternative would be to use stochastic search algorithms, like James Spall’s SPSA. I’m not saying they would definitely have an easier time here: I have not done the experiment. But SPSA’s proposals are quite different than the usual proposals of MCMC and friends (see Section 7.3 of Introduction to Stochastic Search and Optimization, 2003, especially the paragraphs discussing Figure 7.2 on page 185). Indeed, at least for SPSA, proposals which are symmetric Gaussians and Uniforms don’t work well, and can be theoretically shown (there) to fail certain an inverse moment conditions in Spall’s inequality (7.8). SPSA’s best proposals are U-shaped.

SPSA attempts to estimate a stochastic gradient. In fact, I suspect, but have not yet proved, that there is a connection between this kind of gradient estimation and the gradient descent boosting of Friedman (1999).

The situation which Professor Robert describes is another instance where, in practice, it’s a good idea to initialize the MCMC from several places. Of course, it might, then, have a difficult time finding the mode, if the plateau around has a shallow gradient.

About hypergeometric

See http://www.linkedin.com/in/deepdevelopment/ and http://667-per-cm.net
This entry was posted in Bayes, Bayesian, Bayesian inversion, boosting, chance, Christian Robert, computation, ensembles, Gibbs Sampling, James Spall, Jerome Friedman, Markov Chain Monte Carlo, mathematics, maths, MCMC, Monte Carlo Statistical Methods, multivariate statistics, numerical software, numerics, optimization, reasonableness, Robert Schapire, SPSA, state-space models, statistics, stochastic algorithms, stochastic search, stochastics, Yoav Freund. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s