(Apologies, but I thought this posting was still in draft and had not been sent out. It was, by my apparently clicking the wrong button. The writing was hastily done, with the intent of finishing today. In fact, as written, it was wrong. Corrections are now included below.)
Excerpt:
… If attempting to simulate from a multivariate standard normal distribution in a large dimension, when starting from the mode of the target, i.e., its mean γ, leaving the mode γ is extremely unlikely, given the huge drop between the value of the density at the mode γ and at likely realisations …
Source: high dimension Metropolis-Hastings algorithms, by Professor Christian Robert
While this is an intriguing and important observation, in retrospect, it seems to me to be another kind of the dimensionality curse. As the number of dimensions increases, the hypervolume of a mode, even if a Gaussian, gets smaller as expressed as a fraction of the total volume. Surely, a properly formed Markov Chain Monte Carlo (“MCMC”) will eventually defeat it and find the rest of the surface, but this could take an arbitrarily long time.
I’ve always seen the purpose of MCMC to explore the posterior, not to find a feature like a mode. So, yes, the behavior cited, if it proves out with standard proposals, is disappointing. Perhaps an alternative would be to use stochastic search algorithms, like James Spall’s SPSA. I’m not saying they would definitely have an easier time here: I have not done the experiment. But SPSA’s proposals are quite different than the usual proposals of MCMC and friends (see Section 7.3 of Introduction to Stochastic Search and Optimization, 2003, especially the paragraphs discussing Figure 7.2 on page 185). Indeed, at least for SPSA, proposals which are symmetric Gaussians and Uniforms don’t work well, and can be theoretically shown (there) to fail certain an inverse moment conditions in Spall’s inequality (7.8). SPSA’s best proposals are U-shaped.
SPSA attempts to estimate a stochastic gradient. In fact, I suspect, but have not yet proved, that there is a connection between this kind of gradient estimation and the gradient descent boosting of Friedman (1999).
The situation which Professor Robert describes is another instance where, in practice, it’s a good idea to initialize the MCMC from several places. Of course, it might, then, have a difficult time finding the mode, if the plateau around has a shallow gradient.