How to show that a Markov chain converges quickly

August 16, 2019

Markov chains appear everywhere: they are used as a computational tool within Bayesian statistics, and a theoretical tool in other areas such as optimal control and reinforcement learning. Conditions under which a general Markov chain eventually converges to a stationary distribution are well-studied, and can largely be considered classical results. These are, informally, as follows.

$\phi$ -irreducibility: the chain can transition from any state into any other state.
Aperiodicity: the chain’s transition behavior is the same at time $t$ as time $t+1$ .
Harris recurrence: the chain returns to some set infinitely often.

Henceforth, let $X_t$ be the chain, and $\pi$ its stationary distribution of interest. Some care is needed when defining these conditions formally¹ for general Markov chains: for example, in a chain defined over an uncountable state space any given point will have probability zero. To formulate irreducibility precisely we need to talk about sets of states by, say, introducing a topology.

Convergence to stationarity in finite time

Proving convergence of a Markov chain is often the first step taken as part of a larger analysis. In fact, this can be done for broad classes of chains—for instance, for Metropolis-Hastings chains with appropriately well-behaved transition proposals.² This makes stationarity analysis of MCMC algorithms used in Bayesian statistics close to a non-issue.

Unfortunately, from convergence of a Markov chain alone, we can say little about the chain’s distribution after a given number of steps. This makes it interesting to study chains which are geometrically ergodic—such chains converge at a prescribed rate given by

$\Vert P^t(\mu) - \pi\Vert_{\operatorname{TV}} \leq c_\mu \rho^t$

where $\mu$ is the initial distribution, $P$ is the chain’s Markov operator, $t$ is the current iteration, $\pi$ is the stationary distribution, $c_\mu$ and $0 < \rho < 1$ are constants, and $\Vert\cdot\Vert_{\text{TV}}$ is the total variation norm over the Banach space of signed measures.

Convergent chains that are not geometrically ergodic are not well-behaved. Such chains converge infinitely slowly, and functionals $f(X_t)$ of their output don’t necessarily satisfy a Central Limit Theorem. In particular, the distribution of $f(X_t)$ can be heavy-tailed, which can cause $\frac{1}{T} \sum_{t=1}^T f(X_t)$ to be infinite even if $\mathbb{E}_{x\sim\pi}(f(x))$ is finite. This can be problematic.

Minorizing measures and regeneration

How does one know whether or not a chain is geometrically ergodic? For simplicity, let’s consider a simpler case, where we set $c_\mu = c$ be constant with respect to the initial distribution $\mu$ . Such a chain is called uniformly ergodic. This will occur for a chain defined over a state space with compact support, and we comment on the general case once the issues in this setting are clear.

We begin by considering two chains, $X_t$ and $Y_t$ with same stationary distribution $\pi$ . We think of $X_t$ as the chain of interest with initial distribution $\mu$ , and $Y_t$ as an auxiliary chain. Now, we make two assumptions.

$X_t$ and $Y_t$ share the same random number generator.
$Y_t$ starts from the stationary distribution $\pi$ .

This setup can be visualized via the diagram

$\begin{aligned} &\mu & &\sim & &X_1 & &\to & &X_2 & &\to & &X_3 & &\to & &.. \\ & & & & &\,\,| & & & &\,\,| & & & &\,\,| & & & &\,| \\ &\pi & &\sim & &Y_1 & &\to & &Y_2 & &\to & &Y_3 & &\to & &.. \end{aligned}$

where vertical bars indicate shared random numbers, and we’ve used the identically distributed symbol $\sim$ in opposite of its usual order.

Given this setup, if both chains make a draw from the same distribution during their one-step-ahead transitions, we can conclude that $X_t \sim Y_t \sim \pi$ and so the chain has converged.

How is this condition ever possible? Suppose that we can write the distribution of $X_t$ as a mixture of a distribution $\nu$ with mixture probability $\gamma$ and some other distribution with probability $(1-\gamma)$ . Then at each time point, with probability $\gamma^2$ , both chains draw from the same distribution. Let’s draw a picture.

Here, both one-step-ahead transition distributions possess an overlapping shaded region. The probability of each chain landing in this region is $\gamma$ , and the distribution within that region is $\nu$ . We say that $\nu$ is a minorizing measure.³ It can be shown⁴ that to prove uniform ergodicity, it suffices to exhibit such a minorizing measure.

For a non-compact state space and arbitrary current states, such a measure can be impossible to find. However, by virtue of convergence, our Markov chain should eventually spend most of its time in a compact state space. One can make this intuition precise and develop techniques for proving geometric ergodicity, by introducing a suitable Lyapunov drift condition¹ which, once satisfied, largely reduces the analysis to the preceding case.

Once one has the existence of a minorizing measure, there will be a set of random times at which the chain will draw from the minorizing measure and forget its current location. This allows the analysis of the averages $\frac{1}{T} \sum_{t=1}^T f(X_t)$ to be reduced, in some sense, to the IID case. Following this line of thought, one can eventually prove a Central Limit Theorem for the ergodic averages, even though successive $f(X_t)$ are not independent. The study of techniques originating from this observation is called regeneration theory.

Concluding remarks

Here, we examined one technique by which it’s possible to prove that a Markov chain converges quickly. The analysis is general and provides insights of practical interest in Bayesian models. For instance, if using a Gibbs sampler for a hierarchical model, one can examine the full conditionals to see whether or not a minorization condition is present. Even if the minorization constant is not calculated, this gives some idea as to the chain’s numerical performance before ever running it. In a practical application, this can be useful.

Other techniques for analyzing the mixing rate of a Markov chain are also available. In particular, reversible chains possess Markov operators which are self-adjoint: this allows one to study their spectral properties, and relate them to mixing times.⁵ I hope this article has provided a useful overview as to the need to consider finite-time mixing properties, and illustrated the key idea behind minorization techniques.

References

S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. 1993.

G. Roberts and J. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 2004.

Rather than working with a pair $(\gamma,\nu)$ where $\gamma\in[0,1]$ and $\nu$ is a probability measure, most technical papers equivalently work with a single finite measure. We use $(\gamma,\nu)$ here because it is intuitive and lends well to visualization.

⁴

J. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. JASA 90(430). 1995.

⁵

G. Roberts and J. Rosenthal. Geometric ergodicity and hybrid Markov chains. ECP 2(2). 1997.