Alexander Terenin

How to show that a Markov chain converges quickly

Markov chains appear everywhere: they are used as a computational tool within Bayesian statistics, and a theoretical tool in other areas such as optimal control and reinforcement learning. Conditions under which a general Markov chain eventually converges to a stationary distribution are well-studied, and can largely be considered classical results. These are, informally, as follows.

  • ϕ\phi-irreducibility: the chain can transition from any state into any other state.
  • Aperiodicity: the chain’s transition behavior is the same at time tt as time t+1t+1.
  • Harris recurrence: the chain returns to some set infinitely often.

Henceforth, let XtX_t be the chain, and π\pi its stationary distribution of interest. Some care is needed when defining these conditions formally1 for general Markov chains: for example, in a chain defined over an uncountable state space any given point will have probability zero. To formulate irreducibility precisely we need to talk about sets of states by, say, introducing a topology.

Convergence to stationarity in finite time

Proving convergence of a Markov chain is often the first step taken as part of a larger analysis. In fact, this can be done for broad classes of chains—for instance, for Metropolis-Hastings chains with appropriately well-behaved transition proposals.2 This makes stationarity analysis of MCMC algorithms used in Bayesian statistics close to a non-issue.

Unfortunately, from convergence of a Markov chain alone, we can say little about the chain’s distribution after a given number of steps. This makes it interesting to study chains which are geometrically ergodic—such chains converge at a prescribed rate given by

Pt(μ)πTVcμρt \Vert P^t(\mu) - \pi\Vert_{\operatorname{TV}} \leq c_\mu \rho^t

where μ\mu is the initial distribution, PP is the chain’s Markov operator, tt is the current iteration, π\pi is the stationary distribution, cμc_\mu and 0<ρ<10 < \rho < 1 are constants, and TV\Vert\cdot\Vert_{\text{TV}} is the total variation norm over the Banach space of signed measures.

Convergent chains that are not geometrically ergodic are not well-behaved. Such chains converge infinitely slowly, and functionals f(Xt)f(X_t) of their output don’t necessarily satisfy a Central Limit Theorem. In particular, the distribution of f(Xt)f(X_t) can be heavy-tailed, which can cause 1Tt=1Tf(Xt)\frac{1}{T} \sum_{t=1}^T f(X_t) to be infinite even if Exπ(f(x))\mathbb{E}_{x\sim\pi}(f(x)) is finite. This can be problematic.

Minorizing measures and regeneration

How does one know whether or not a chain is geometrically ergodic? For simplicity, let’s consider a simpler case, where we set cμ=cc_\mu = c be constant with respect to the initial distribution μ\mu. Such a chain is called uniformly ergodic. This will occur for a chain defined over a state space with compact support, and we comment on the general case once the issues in this setting are clear.

We begin by considering two chains, XtX_t and YtY_t with same stationary distribution π\pi. We think of XtX_t as the chain of interest with initial distribution μ\mu, and YtY_t as an auxiliary chain. Now, we make two assumptions.

  1. XtX_t and YtY_t share the same random number generator.
  2. YtY_t starts from the stationary distribution π\pi.

This setup can be visualized via the diagram

μX1X2X3..πY1Y2Y3.. \begin{aligned} &\mu & &\sim & &X_1 & &\to & &X_2 & &\to & &X_3 & &\to & &.. \\ & & & & &\,\,| & & & &\,\,| & & & &\,\,| & & & &\,| \\ &\pi & &\sim & &Y_1 & &\to & &Y_2 & &\to & &Y_3 & &\to & &.. \end{aligned}

where vertical bars indicate shared random numbers, and we’ve used the identically distributed symbol \sim in opposite of its usual order.

Given this setup, if both chains make a draw from the same distribution during their one-step-ahead transitions, we can conclude that XtYtπX_t \sim Y_t \sim \pi and so the chain has converged.

How is this condition ever possible? Suppose that we can write the distribution of XtX_t as a mixture of a distribution ν\nu with mixture probability γ\gamma and some other distribution with probability (1γ)(1-\gamma). Then at each time point, with probability γ2\gamma^2, both chains draw from the same distribution. Let’s draw a picture.

Minorizing measure

Here, both one-step-ahead transition distributions possess an overlapping shaded region. The probability of each chain landing in this region is γ\gamma, and the distribution within that region is ν\nu. We say that ν\nu is a minorizing measure.3 It can be shown4 that to prove uniform ergodicity, it suffices to exhibit such a minorizing measure.

For a non-compact state space and arbitrary current states, such a measure can be impossible to find. However, by virtue of convergence, our Markov chain should eventually spend most of its time in a compact state space. One can make this intuition precise and develop techniques for proving geometric ergodicity, by introducing a suitable Lyapunov drift condition1 which, once satisfied, largely reduces the analysis to the preceding case.

Once one has the existence of a minorizing measure, there will be a set of random times at which the chain will draw from the minorizing measure and forget its current location. This allows the analysis of the averages 1Tt=1Tf(Xt)\frac{1}{T} \sum_{t=1}^T f(X_t) to be reduced, in some sense, to the IID case. Following this line of thought, one can eventually prove a Central Limit Theorem for the ergodic averages, even though successive f(Xt)f(X_t) are not independent. The study of techniques originating from this observation is called regeneration theory.

Concluding remarks

Here, we examined one technique by which it’s possible to prove that a Markov chain converges quickly. The analysis is general and provides insights of practical interest in Bayesian models. For instance, if using a Gibbs sampler for a hierarchical model, one can examine the full conditionals to see whether or not a minorization condition is present. Even if the minorization constant is not calculated, this gives some idea as to the chain’s numerical performance before ever running it. In a practical application, this can be useful.

Other techniques for analyzing the mixing rate of a Markov chain are also available. In particular, reversible chains possess Markov operators which are self-adjoint: this allows one to study their spectral properties, and relate them to mixing times.5 I hope this article has provided a useful overview as to the need to consider finite-time mixing properties, and illustrated the key idea behind minorization techniques.

References

1

S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. 1993.

2

G. Roberts and J. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 2004.

3

Rather than working with a pair (γ,ν)(\gamma,\nu) where γ[0,1]\gamma\in[0,1] and ν\nu is a probability measure, most technical papers equivalently work with a single finite measure. We use (γ,ν)(\gamma,\nu) here because it is intuitive and lends well to visualization.

4

J. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. JASA 90(430). 1995.

5

G. Roberts and J. Rosenthal. Geometric ergodicity and hybrid Markov chains. ECP 2(2). 1997.