In my previous posts, I introduced Bayesian models and argued that they are meaningful. I claimed that studying them is worthwhile because the probabilistic interpretation of learning that they offered can be more intuitive than other interpretations. I showcased an example illustrating what a Bayesian model looks like. I did not, however, say what a Bayesian model actually is – at least not in a sufficiently general setting to encompass models people regularly use. I’m going to discuss that in this post, and then showcase some surprising behavior in infinitedimensional settings where the general approach is necessary. The subject matter here can be highly technical, but will be discussed at an intuitive level meant to explain what is going on.
Definition. A model $\s{M}$ is mathematically Bayesian if it can be fully specified via a prior $\pi(\theta)$ and likelihood $f(x \given \theta)$ for which the posterior distribution $f(\theta \given x)$ is welldefined.
Here, $\theta$ is an abstract parameter, and $x$ is an abstract data set. The argument for using Bayesian learning, given by Cox’s Theorem, is that conditional probability can be interpreted as an extension of truefalse logic under uncertainty. This is great – but, formality considerations aside, there are scenarios that involve learning from data that are not included in the above definition. Let’s look at one.
A motivating example
To illustrate a case not covered by the above definition, consider the problem of learning a function from a finite set of points. Here, we have a set of points $(y_i, x_i), i=1,..,n$ and we want to learn a function $y = f(x)$ from the data. A simple Bayesian model for the data can be written as
[ \begin{aligned} y_i &= f(x_i) + \eps_i & \eps_i &\iid N(0,\sigma^2) & f \dist\f{GP}(\mu, \Sigma) \end{aligned} ]
What are we saying here? If we know $f$, we can use a set of points $x_i$ to generate $y_i$ by calculating $f(x_i)$ and adding Gaussian noise $\eps_i$. Since we don’t know $f$, we specify its prior probability distribution as a Gaussian process with mean function $\mu: \R \goesto \R$ and covariance function $\Sigma: \R \cross \R \goesto \R$. Since we’ve specified a conditional and marginal distribution, this defines a joint distribution, so we can try to get the posterior distribution using Bayes’ Rule [ f(f \given \v{y},\v{x}) \propto f(\v{y} \given \v{x}, f) \pi(f) . ]
Except we can’t do that. The above expression is not welldefined – $\pi(f)$ does not exist, because the probability distribution $f \dist\f{GP}(\mu, \Sigma)$ is a distribution over a space of functions, not of real numbers – therefore, it has no density in the standard sense^{1}.
Why not? A probability density is a function that assigns a weight to every unit of volume in space. In one dimension, every interval of the form $[a,b]$ is assigned volume $ab$ – this depends only on its length, not its location. In infinitedimensional spaces, this is impossible. It can be proven that any notion of volume must depend both on the length and location – more formally, the infinitedimensional Lebesgue measure is not locally finite^{2}.
So what do we do? Is there a sense in which we can consider the above model Bayesian? Let’s discuss that.
Bayesian learning as conditional probability
If we’re not allowed to discuss probability densities, what else can we do? One thing that the definition says is that a model is Bayesian if it is probabilistic. This entails two parts.
 $\s{M}$ is specified via a joint probability density $f(\theta, x)$ over the parameters and data.
 Learning takes place via conditional probability.
It turns out that these two intuitive notions are precisely the ones we need. Informally, this leads to the definition below.
Definition. A model $\s{M}$ is mathematically Bayesian if it is fully specified via a random variable $(x,\theta)$ for which the conditional probability distribution $\theta \given x$ exists for all $x$.
This definition can be made formal using measuretheoretic notions such as regular conditional probability^{3} and disintegration^{4}. These have various flavors with different technical requirements on $(x,\theta)$ that need to be checked to ensure that writing down a probability distribution conditional on a set of data points actually makes sense. Let’s now look at two different ways of specifying $(x,\theta)$ in infinitedimensional settings where the usual approach fails.
Two infinitedimensional approaches
One way to define Bayesian models in infinitedimensional settings is through a topdown approach. Here, we specify $\theta \given x$ by selecting a complicated but welldefined infinitedimensional notion of volume. Often, the prior distribution is used to select this notion of volume. From there, we can specify how the posterior distribution changes that volume, by writing down a RadonNikodym derivative^{5}. This viewpoint is often used in the Gaussian measure and Bayesian inverse problem literatures. The main price we pay is that for many infinitedimensional models, the prior and posterior distributions may not have the same support – they may fail to be absolutely continuous^{6}, in which case the RadonNikodym derivative between them would not exist.
Alternatively, we could use a bottomup approach. Here, we define a family of probability of distributions using finitedimensional slices of our parameter space, using Kolmogorov’s Extension Theorem as our primary theoretical tool for handling the infinite dimensional object. This is the primary viewpoint in the Gaussian process and Dirichlet process literatures. The main price we pay is that from this perspective, we can only reason about the infinitedimensional object we wish to study indirectly. This may cause us to make poor choices, such as writing down algorithms that stop working as we approach the infinitedimensional limit^{7}, which are easily avoided with a more direct perspective.
Cromwell’s Rule and some surprising consequences
We briefly mentioned that in infinitedimensional settings, prior and posterior distributions may not be absolutely continuous with one another. This property deserves some attention. Consider Bayes’ Rule for probabilities
[ \P(B \given A) = \frac{\P(A \given B) \P(B)}{\P(A)} ]
and note that for $\P(A)$ nonzero, then $\P(B) = 0$ implies $\P(B \given A) = 0$ – no matter what $A$ is. By analogy, if $A$ is data and $B$ is an event of interest, then Bayes’ Rule ignores the data if the prior probability is zero. This is often not desirable, which leads to Cromwell’s Rule^{8}, given below.
To avoid making learning impossible, the use of prior probabilities that are zero or one should be avoided.
Except, in many infinitedimensional settings, this doesn’t apply because $\P(A)$ may be zero. Indeed, it is easy to construct examples where the prior probability of an event is zero, but the posterior probability is nonzero – more formally, where the posterior is not absolutely continuous with respect to the prior. This is not an esoteric occurrence: even something as basic as adding a mean function to a Gaussian process can break absolute continuity^{9}. Let’s examine a case where this happens.
Breaking probabilistic impossibility
Consider the following model.
[ \begin{aligned} y_i &\given F \iid F & F &\dist\f{DP}(\alpha, \delta_0) \end{aligned} ]
where $\delta_0$ is a Dirac measure that places all of its probability on zero. Under the prior, we have [ \P(F \neq \delta_0) = 0 . ] The standard posterior for this model is [ F \given \v{y} \dist\f{DP}\del{\alpha + n, \frac{\alpha}{\alpha+n}\delta_0 + \frac{n}{\alpha+n}\hat{F}_n} ] where $n$ is the length of $\v{y}$ and $\hat{F}_n$ is the empirical CDF of $\v{y}$. But we can tell immediately that
[ \P(F \neq \delta_0 \given \v{y}) > 0 . ]
This example illustrates a whole host of bizarre consequences. Since $F \given \v{y}$ is not absolutely continuous with respect to $F$, we see that in infinite dimensions, data may convince us to believe in something we in a sense thought was impossible. Furthermore, $\f{DP}(\alpha_1, \delta_0)$ and $\f{DP}(\alpha_2, \delta_0)$ are, as probability distributions, identical – but their respective posterior distributions are not. So, what matters for Bayesian learning in infinite dimensions is not the distribution of the prior, but the functional form of the joint probability measure. This behavior is both surprising and typical – conditional probability can act in complicated ways.
What it all means
In my view, an abstract model is Bayesian if it is probabilistic and learning takes place through conditional probability. In wellbehaved finitedimensional settings, this means that learning takes place using Bayes’ Rule. There, we have a likelihood $f(x \given \theta)$ that acts as the generative distribution for the data given the parameters, and a prior that describes what sorts of parameters we’d like to regularize the learning process towards. In full generality, however, neither the generative nature of the likelihood nor the use of Bayes’ Rule matters: it is the use of conditional probability that is important. From a philosophical standpoint this makes sense: learning is just reasoning about something we don’t know using the things we do, and Cox’s Theorem^{10} tells us that truefalse reasoning under uncertainty must have the same mathematical structure as conditional probability.
Once we’ve taken the general perspective, we are free to define models in infinitedimensional settings. Such models are powerful and have proven useful in many applications, but at times they may behave bizarrely. It’s worthwhile to take a moment to step back, appreciate, and understand why the expressions we calculate are the way they are.
References

The standard notion of volume is taken to be the Lebesgue measure. See Chapter 3 of Probability and Stochastics^{11}. ↩

See Section 1.2 of Analysis and Probability on InfiniteDimensional Spaces^{12}. ↩

See Chapter 2 of Probability and Stochastics^{11}. ↩

See Section 2 of Conditioning as Disintegration^{13}. ↩

A RadonNikodym derivatives tells us how to reweight one probability measure to obtain another one. See Chapter 5 of Probability and Stochastics^{11}. ↩

If two measures are absolutely continuous, they assign nonzero probability to the same events. See Chapter 5 of Probability and Stochastics^{11}. ↩

A recent line of work^{14} has sought to prevent Markov Chain Monte Carlo algorithms from slowing down for highdimensional models by ensuring their infinitedimensional limits are welldefined. ↩

See Chapter 6 Section 8 of Understanding Uncertainty^{15}. ↩

The space of vectors that can be added to a Gaussian measure while preserving absolute continuity is called its CameronMartin space. See Chapter 5 of Lectures on Gaussian Processes^{16}. ↩

A. Terenin and D. Draper. Cox’s Theorem and the Jaynesian Interpretation of Probability. arXiv:1507.06597, 2015. ↩

E. Çınlar. Probability and Stochastics. 2010. ↩ ↩^{2} ↩^{3} ↩^{4}

N. Eldredge. Analysis and Probability on InfiniteDimensional Spaces. 2016. ↩

J. T. Chang and D. Pollard. Conditioning as Disintegration. Statistica Neerlandica 51(3). 1997. ↩

S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science 28(3), 2013. ↩

D. Lindley. Understanding Uncertainty. 2006. ↩

M. Lifshits. Lectures on Gaussian Processes. 2012. ↩