Alexander Terenin

How to show that a Markov chain converges quickly

Markov chains appear everywhere: they are used as a computational tool within Bayesian statistics, and a theoretical tool in other areas such as optimal control and reinforcement learning. Conditions under which a general Markov chain eventually converges to a stationary distribution are well-studied, and can largely be considered classical results. These are, informally, as follows.

  • $\phi$-irreducibility: the chain can transition from any state into any other state.
  • Aperiodicity: the chain’s transition behavior is the same at time $t$ as time $t+1$.
  • Harris recurrence: the chain returns to some set infinitely often.

Henceforth, let $X_t$ be the chain, and $\pi$ its stationary distribution of interest. Some care is needed when defining these conditions formally1 for general Markov chains: for example, in a chain defined over an uncountable state space any given point will have probability zero. To formulate irreducibility precisely we need to talk about sets of states by, say, introducing a topology.

Convergence to stationarity in finite time

Proving convergence of a Markov chain is often the first step taken as part of a larger analysis. In fact, this can be done for broad classes of chains—for instance, for Metropolis-Hastings chains with appropriately well-behaved transition proposals.2 This makes stationarity analysis of MCMC algorithms used in Bayesian statistics close to a non-issue.

Unfortunately, from convergence of a Markov chain alone, we can say little about the chain’s distribution after a given number of steps. This makes it interesting to study chains which are geometrically ergodic—such chains converge at a prescribed rate given by

$$
\Vert P^t(\mu) - \pi\Vert_{\operatorname{TV}} \leq c_\mu \rho^t
$$

where $\mu$ is the initial distribution, $P$ is the chain’s Markov operator, $t$ is the current iteration, $\pi$ is the stationary distribution, $c_\mu$ and $0 < \rho < 1$ are constants, and $\Vert\cdot\Vert_{\text{TV}}$ is the total variation norm over the Banach space of signed measures.

Convergent chains that are not geometrically ergodic are not well-behaved. Such chains converge infinitely slowly, and functionals $f(X_t)$ of their output don’t necessarily satisfy a Central Limit Theorem. In particular, the distribution of $f(X_t)$ can be heavy-tailed, which can cause $\frac{1}{T} \sum_{t=1}^T f(X_t)$ to be infinite even if $\mathbb{E}_{x\sim\pi}(f(x))$ is finite. This can be problematic.

Minorizing measures and regeneration

How does one know whether or not a chain is geometrically ergodic? For simplicity, let’s consider a simpler case, where we set $c_\mu = c$ be constant with respect to the initial distribution $\mu$. Such a chain is called uniformly ergodic. This will occur for a chain defined over a state space with compact support, and we comment on the general case once the issues in this setting are clear.

We begin by considering two chains, $X_t$ and $Y_t$ with same stationary distribution $\pi$. We think of $X_t$ as the chain of interest with initial distribution $\mu$, and $Y_t$ as an auxiliary chain. Now, we make two assumptions.

  1. $X_t$ and $Y_t$ share the same random number generator.
  2. $Y_t$ starts from the stationary distribution $\pi$.

This setup can be visualized via the diagram

$$
\begin{aligned}
&\mu & &\sim & &X_1 & &\to & &X_2 & &\to & &X_3 & &\to & &..
\\
& & & & &\,\,| & & & &\,\,| & & & &\,\,| & & & &\,|
\\
&\pi & &\sim & &Y_1 & &\to & &Y_2 & &\to & &Y_3 & &\to & &..
\end{aligned}
$$

where vertical bars indicate shared random numbers, and we’ve used the identically distributed symbol $\sim$ in opposite of its usual order.

Given this setup, if both chains make a draw from the same distribution during their one-step-ahead transitions, we can conclude that $X_t \sim Y_t \sim \pi$ and so the chain has converged.

How is this condition ever possible? Suppose that we can write the distribution of $X_t$ as a mixture of a distribution $\nu$ with mixture probability $\gamma$ and some other distribution with probability $(1-\gamma)$. Then at each time point, with probability $\gamma^2$, both chains draw from the same distribution. Let’s draw a picture.

Minorizing measure

Here, both one-step-ahead transition distributions possess an overlapping shaded region. The probability of each chain landing in this region is $\gamma$, and the distribution within that region is $\nu$. We say that $\nu$ is a minorizing measure.3 It can be shown4 that to prove uniform ergodicity, it suffices to exhibit such a minorizing measure.

For a non-compact state space and arbitrary current states, such a measure can be impossible to find. However, by virtue of convergence, our Markov chain should eventually spend most of its time in a compact state space. One can make this intuition precise and develop techniques for proving geometric ergodicity, by introducing a suitable Lyapunov drift condition1 which, once satisfied, largely reduces the analysis to the preceding case.

Once one has the existence of a minorizing measure, there will be a set of random times at which the chain will draw from the minorizing measure and forget its current location. This allows the analysis of the averages $\frac{1}{T} \sum_{t=1}^T f(X_t)$ to be reduced, in some sense, to the IID case. Following this line of thought, one can eventually prove a Central Limit Theorem for the ergodic averages, even though successive $f(X_t)$ are not independent. The study of techniques originating from this observation is called regeneration theory.

Concluding remarks

Here, we examined one technique by which it’s possible to prove that a Markov chain converges quickly. The analysis is general and provides insights of practical interest in Bayesian models. For instance, if using a Gibbs sampler for a hierarchical model, one can examine the full conditionals to see whether or not a minorization condition is present. Even if the minorization constant is not calculated, this gives some idea as to the chain’s numerical performance before ever running it. In a practical application, this can be useful.

Other techniques for analyzing the mixing rate of a Markov chain are also available. In particular, reversible chains possess Markov operators which are self-adjoint: this allows one to study their spectral properties, and relate them to mixing times.5 I hope this article has provided a useful overview as to the need to consider finite-time mixing properties, and illustrated the key idea behind minorization techniques.

References

1

S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. 1993.

2

G. Roberts and J. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 2004.

3

Rather than working with a pair $(\gamma,\nu)$ where $\gamma\in[0,1]$ and $\nu$ is a probability measure, most technical papers equivalently work with a single finite measure. We use $(\gamma,\nu)$ here because it is intuitive and lends well to visualization.

4

J. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. JASA 90(430). 1995.

5

G. Roberts and J. Rosenthal. Geometric ergodicity and hybrid Markov chains. ECP 2(2). 1997.