What does it really mean to be Bayesian?

February 9, 2018

In my previous posts, I introduced Bayesian models and argued that they are meaningful. I claimed that studying them is worthwhile because the probabilistic interpretation of learning that they offered can be more intuitive than other interpretations. I showcased an example illustrating what a Bayesian model looks like. I did not, however, say what a Bayesian model actually is—at least not in a sufficiently general setting to encompass models people regularly use. I’m going to discuss that in this post, and then showcase some surprising behavior in infinite-dimensional settings where the general approach is necessary. The subject matter here can be highly technical, but will be discussed at an intuitive level meant to explain what is going on.

Definition. A model $\mathscr{M}$ is mathematically Bayesian if it can be fully specified via a prior $\pi(\theta)$ and likelihood $f(x \mid \theta)$ for which the posterior distribution $f(\theta \mid x)$ is well-defined.

Here, $\theta$ is an abstract parameter, and $x$ is an abstract data set. The argument for using Bayesian learning, given by Cox’s Theorem, is that conditional probability can be interpreted as an extension of true-false logic under uncertainty. This is great—but, formality considerations aside, there are scenarios that involve learning from data that are not included in the above definition. Let’s look at one.

A motivating example

To illustrate a case not covered by the above definition, consider the problem of learning a function from a finite set of points. Here, we have a set of points $(y_i, x_i), i=1,..,n$ and we want to learn a function $y = f(x)$ from the data. A simple Bayesian model for the data can be written as

$\begin{aligned} y_i &= f(x_i) + \varepsilon_i & \varepsilon_i &\sim \operatorname{N}(0,\sigma^2) & f \sim\operatorname{GP}(\mu, \Sigma) \end{aligned}$

What are we saying here? If we know $f$ , we can use a set of points $x_i$ to generate $y_i$ by calculating $f(x_i)$ and adding Gaussian noise $\varepsilon_i$ . Since we don’t know $f$ , we specify its prior probability distribution as a Gaussian process with mean function $\mu: \mathbb{R} \to \mathbb{R}$ and covariance function $\Sigma: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$ . Since we’ve specified a conditional and marginal distribution, this defines a joint distribution, so we can try to get the posterior distribution using Bayes’ Rule

$f(f \mid \boldsymbol{y},\boldsymbol{x}) \propto f(\boldsymbol{y} \mid \boldsymbol{x}, f) \pi(f) .$

Except we can’t do that. The above expression is not well-defined— $\pi(f)$ does not exist, because the probability distribution $f \sim\operatorname{GP}(\mu, \Sigma)$ is a distribution over a space of functions, not of real numbers—therefore, it has no density in the standard sense.¹

Why not? A probability density is a function that assigns a weight to every unit of volume in space. In one dimension, every interval of the form $[a,b]$ is assigned volume $|a-b|$ —this depends only on its length, not its location. In infinite-dimensional spaces, this is impossible. It can be proven that any notion of volume must depend both on the length and location—more formally, the infinite-dimensional Lebesgue measure is not locally finite.²

So what do we do? Is there a sense in which we can consider the above model Bayesian? Let’s discuss that.

Bayesian learning as conditional probability

If we’re not allowed to discuss probability densities, what else can we do? One thing that the definition says is that a model is Bayesian if it is probabilistic. This entails two parts.

$\mathscr{M}$ is specified via a joint probability density $f(\theta, x)$ over the parameters and data.
Learning takes place via conditional probability.

It turns out that these two intuitive notions are precisely the ones we need. Informally, this leads to the definition below.

Definition. A model $\mathscr{M}$ is mathematically Bayesian if it is fully specified via a random variable $(x,\theta)$ for which the conditional probability distribution $\theta \mid x$ exists for all $x$ .

This definition can be made formal using measure-theoretic notions such as regular conditional probability³ and disintegration.⁴ These have various flavors with different technical requirements on $(x,\theta)$ that need to be checked to ensure that writing down a probability distribution conditional on a set of data points actually makes sense. Let’s now look at two different ways of specifying $(x,\theta)$ in infinite-dimensional settings where the usual approach fails.

Two infinite-dimensional approaches

One way to define Bayesian models in infinite-dimensional settings is through a top-down approach. Here, we specify $\theta \mid x$ by selecting a complicated but well-defined infinite-dimensional notion of volume. Often, the prior distribution is used to select this notion of volume. From there, we can specify how the posterior distribution changes that volume, by writing down a Radon-Nikodym derivative.⁵ This viewpoint is often used in the Gaussian measure and Bayesian inverse problem literatures. The main price we pay is that for many infinite-dimensional models, the prior and posterior distributions may not have the same support—they may fail to be absolutely continuous,⁶ in which case the Radon-Nikodym derivative between them would not exist.

Alternatively, we could use a bottom-up approach. Here, we define a family of probability of distributions using finite-dimensional slices of our parameter space, using Kolmogorov’s Extension Theorem as our primary theoretical tool for handling the infinite dimensional object. This is the primary viewpoint in the Gaussian process and Dirichlet process literatures. The main price we pay is that from this perspective, we can only reason about the infinite-dimensional object we wish to study indirectly. This may cause us to make poor choices, such as writing down algorithms that stop working as we approach the infinite-dimensional limit,⁷ which are easily avoided with a more direct perspective.

Cromwell’s Rule and some surprising consequences

We briefly mentioned that in infinite-dimensional settings, prior and posterior distributions may not be absolutely continuous with one another. This property deserves some attention. Consider Bayes’ Rule for probabilities

$\mathbb{P}(B \mid A) = \frac{\mathbb{P}(A \mid B) \mathbb{P}(B)}{\mathbb{P}(A)}$

and note that for $\mathbb{P}(A)$ nonzero, then $\mathbb{P}(B) = 0$ implies $\mathbb{P}(B \mid A) = 0$ —no matter what $A$ is. By analogy, if $A$ is data and $B$ is an event of interest, then Bayes’ Rule ignores the data if the prior probability is zero. This is often not desirable, which leads to Cromwell’s Rule,⁸ given below.

To avoid making learning impossible, the use of prior probabilities that are zero or one should be avoided.

Except, in many infinite-dimensional settings, this doesn’t apply because $\mathbb{P}(A)$ may be zero. Indeed, it is easy to construct examples where the prior probability of an event is zero, but the posterior probability is nonzero—more formally, where the posterior is not absolutely continuous with respect to the prior. This is not an esoteric occurrence: even something as basic as adding a mean function to a Gaussian process can break absolute continuity.⁹ Let’s examine a case where this happens.

Breaking probabilistic impossibility

Consider the following model.

$\begin{aligned} y_i &\mid F \sim F & F &\sim\operatorname{DP}(\alpha, \delta_0) \end{aligned}$

where $\delta_0$ is a Dirac measure that places all of its probability on zero. Under the prior, we have

$\mathbb{P}(F \neq \delta_0) = 0 .$

The standard posterior for this model is

$F \mid \boldsymbol{y} \sim\operatorname{DP}\left(\alpha + n, \frac{\alpha}{\alpha+n}\delta_0 + \frac{n}{\alpha+n}\hat{F}_n\right)$

where $n$ is the length of $\boldsymbol{y}$ and $\hat{F}_n$ is the empirical CDF of $\boldsymbol{y}$ . But we can tell immediately that

$\mathbb{P}(F \neq \delta_0 \mid \boldsymbol{y}) > 0 .$

This has a whole host of bizarre consequences. Since $F \mid \boldsymbol{y}$ is not absolutely continuous with respect to $F$ , we see that in infinite dimensions, data may convince us to believe in something we in a sense thought was impossible. This behavior is both surprising and typical—conditional probability can act in complicated ways.

What it all means

In my view, an abstract model is Bayesian if it is probabilistic and learning takes place through conditional probability. In well-behaved finite-dimensional settings, this means that learning takes place using Bayes’ Rule. There, we have a likelihood $f(x \mid \theta)$ that acts as the generative distribution for the data given the parameters, and a prior that describes what sorts of parameters we’d like to regularize the learning process towards. In full generality, however, neither the generative nature of the likelihood nor the use of Bayes’ Rule matters: it is the use of conditional probability that is important. From a philosophical standpoint this makes sense: learning is just reasoning about something we don’t know using the things we do, using the mathematical structure of conditional probability.

Once we’ve taken the general perspective, we are free to define models in infinite-dimensional settings. Such models are powerful and have proven useful in many applications, but at times they may behave bizarrely. It’s worthwhile to take a moment to step back, appreciate, and understand why the expressions we calculate are the way they are.

References

The standard notion of volume is taken to be the Lebesgue measure. See Chapter 3 of Probability and Stochastics.¹⁰

See Section 1.2 of Analysis and Probability on Infinite-Dimensional Spaces.¹¹

See Chapter 2 of Probability and Stochastics.¹⁰

⁴

See Section 2 of Conditioning as Disintegration.¹²

⁵

A Radon-Nikodym derivatives tells us how to re-weight one probability measure to obtain another one. See Chapter 5 of Probability and Stochastics.¹⁰

⁶

If two measures are absolutely continuous, they assign nonzero probability to the same events. See Chapter 5 of Probability and Stochastics.¹⁰

⁷

A recent line of work¹³ has sought to prevent Markov Chain Monte Carlo algorithms from slowing down for high-dimensional models by ensuring their infinite-dimensional limits are well-defined.

⁸

See Chapter 6 Section 8 of Understanding Uncertainty.¹⁴

⁹

The space of vectors that can be added to a Gaussian process while preserving absolute continuity is called its Cameron-Martin space. See Chapter 5 of Lectures on Gaussian Processes.¹⁵

¹⁰

E. Çınlar. Probability and Stochastics, 2010.

¹¹

N. Eldredge. Analysis and Probability on Infinite-Dimensional Spaces, 2016.

¹²

J. T. Chang and D. Pollard. Conditioning as Disintegration. Statistica Neerlandica 51(3), 1997.

¹³

S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science 28(3), 2013.

¹⁴

D. Lindley. Understanding Uncertainty, 2006.

¹⁵

M. Lifshits. Lectures on Gaussian Processes, 2012.