Markov chains appear everywhere: they are used as a computational tool within Bayesian statistics, and a theoretical tool in other areas such as optimal control and reinforcement learning. Conditions under which a general Markov chain eventually converges to a stationary distribution are well-studied, and can largely be considered classical results. These are, informally, as follows.

`$\phi$`

-irreducibility: the chain can transition from any state into any other state.- Aperiodicity: the chain’s transition behavior is the same at time
`$t$`

as time`$t+1$`

. - Harris recurrence: the chain returns to some set infinitely often.

Henceforth, let `$X_t$`

be the chain, and `$\pi$`

its stationary distribution of interest. Some care is needed when defining these conditions formally^{1} for general Markov chains: for example, in a chain defined over an uncountable state space any given point will have probability zero. To formulate irreducibility precisely we need to talk about sets of states by, say, introducing a topology.

# Convergence to stationarity in finite time

Proving convergence of a Markov chain is often the first step taken as part of a larger analysis. In fact, this can be done for broad classes of chains—for instance, for Metropolis-Hastings chains with appropriately well-behaved transition proposals.^{2} This makes stationarity analysis of MCMC algorithms used in Bayesian statistics close to a non-issue.

Unfortunately, from convergence of a Markov chain alone, we can say little about the chain’s distribution after a given number of steps. This makes it interesting to study chains which are *geometrically ergodic*—such chains converge at a prescribed rate given by

```
$$
\Vert P^t(\mu) - \pi\Vert_{\operatorname{TV}} \leq c_\mu \rho^t
$$
```

where `$\mu$`

is the initial distribution, `$P$`

is the chain’s Markov operator, `$t$`

is the current iteration, `$\pi$`

is the stationary distribution, `$c_\mu$`

and `$0 < \rho < 1$`

are constants, and `$\Vert\cdot\Vert_{\text{TV}}$`

is the total variation norm over the Banach space of signed measures.

Convergent chains that are not geometrically ergodic are not well-behaved. Such chains converge infinitely slowly, and functionals `$f(X_t)$`

of their output don’t necessarily satisfy a Central Limit Theorem. In particular, the distribution of `$f(X_t)$`

can be heavy-tailed, which can cause `$\frac{1}{T} \sum_{t=1}^T f(X_t)$`

to be infinite even if `$\mathbb{E}_{x\sim\pi}(f(x))$`

is finite. This can be problematic.

# Minorizing measures and regeneration

How does one know whether or not a chain is geometrically ergodic? For simplicity, let’s consider a simpler case, where we set `$c_\mu = c$`

be constant with respect to the initial distribution `$\mu$`

. Such a chain is called *uniformly ergodic*. This will occur for a chain defined over a state space with compact support, and we comment on the general case once the issues in this setting are clear.

We begin by considering two chains, `$X_t$`

and `$Y_t$`

with same stationary distribution `$\pi$`

. We think of `$X_t$`

as the chain of interest with initial distribution `$\mu$`

, and `$Y_t$`

as an auxiliary chain. Now, we make two assumptions.

`$X_t$`

and`$Y_t$`

share the same random number generator.`$Y_t$`

starts from the stationary distribution`$\pi$`

.

This setup can be visualized via the diagram

```
$$
\begin{aligned}
&\mu & &\sim & &X_1 & &\to & &X_2 & &\to & &X_3 & &\to & &..
\\
& & & & &\,\,| & & & &\,\,| & & & &\,\,| & & & &\,|
\\
&\pi & &\sim & &Y_1 & &\to & &Y_2 & &\to & &Y_3 & &\to & &..
\end{aligned}
$$
```

where vertical bars indicate shared random numbers, and we’ve used the identically distributed symbol `$\sim$`

in opposite of its usual order.

Given this setup, if *both chains make a draw from the same distribution* during their one-step-ahead transitions, we can conclude that `$X_t \sim Y_t \sim \pi$`

and so the chain has converged.

How is this condition ever possible? Suppose that we can write the distribution of `$X_t$`

as a mixture of a distribution `$\nu$`

with mixture probability `$\gamma$`

and some other distribution with probability `$(1-\gamma)$`

. Then at each time point, with probability `$\gamma^2$`

, both chains draw from the same distribution. Let’s draw a picture.

Here, both one-step-ahead transition distributions possess an overlapping shaded region. The probability of each chain landing in this region is `$\gamma$`

, and the distribution within that region is `$\nu$`

. We say that `$\nu$`

is a *minorizing measure*.^{3} It can be shown^{4} that to prove uniform ergodicity, it suffices to exhibit such a minorizing measure.

For a non-compact state space and arbitrary current states, such a measure can be impossible to find. However, by virtue of convergence, our Markov chain should eventually spend most of its time in a compact state space. One can make this intuition precise and develop techniques for proving geometric ergodicity, by introducing a suitable *Lyapunov drift condition*^{1} which, once satisfied, largely reduces the analysis to the preceding case.

Once one has the existence of a minorizing measure, there will be a set of random times at which the chain will draw from the minorizing measure and forget its current location. This allows the analysis of the averages `$\frac{1}{T} \sum_{t=1}^T f(X_t)$`

to be reduced, in some sense, to the IID case. Following this line of thought, one can eventually prove a Central Limit Theorem for the ergodic averages, even though successive `$f(X_t)$`

are not independent. The study of techniques originating from this observation is called *regeneration theory*.

# Concluding remarks

Here, we examined one technique by which it’s possible to prove that a Markov chain converges quickly. The analysis is general and provides insights of practical interest in Bayesian models. For instance, if using a Gibbs sampler for a hierarchical model, one can examine the full conditionals to see whether or not a minorization condition is present. Even if the minorization constant is not calculated, this gives some idea as to the chain’s numerical performance before ever running it. In a practical application, this can be useful.

Other techniques for analyzing the mixing rate of a Markov chain are also available. In particular, reversible chains possess Markov operators which are self-adjoint: this allows one to study their spectral properties, and relate them to mixing times.^{5} I hope this article has provided a useful overview as to the need to consider finite-time mixing properties, and illustrated the key idea behind minorization techniques.

# References

^{1}

S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. 1993.

^{2}

G. Roberts and J. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 2004.

^{3}

Rather than working with a pair `$(\gamma,\nu)$`

where `$\gamma\in[0,1]$`

and `$\nu$`

is a probability measure, most technical papers equivalently work with a single finite measure. We use `$(\gamma,\nu)$`

here because it is intuitive and lends well to visualization.

^{4}

J. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. JASA 90(430). 1995.

^{5}

G. Roberts and J. Rosenthal. Geometric ergodicity and hybrid Markov chains. ECP 2(2). 1997.