Alexander Terenin

What does it mean to be Bayesian?

Bayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework. This has lead to rich academic debates over what statistical learning is and how it should be done. Academic debates are healthy when their content is precise and independent issues are not conflated. In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently. These can be viewed informally as follows.

  • A model is mathematically Bayesian if it is defined using Bayes’ Rule.
  • A procedure is computationally Bayesian if it involves calculation of a full posterior distribution.

The key idea of this post is that the two notions above are different, and that the common term Bayesian is often ambiguous. This makes it unclear, for instance, that there are situations where it makes sense to be mathematically but not computationally Bayesian. Let’s disentangle the terminology and explore the concepts in more detail.

Motivating Example: Logistic Lasso

To make my arguments concrete, I now introduce the Logistic Lasso model, beginning with notation. Let X\mathbf{X} be the matrix of size N×pN \times p to be used for predicting the binary vector y\boldsymbol{y} of size N×1N\times 1, let β\boldsymbol\beta be the parameter vector, and let ϕ\phi be the logistic function.

From the classical perspective, the Logistic Lasso model1 involves finding the estimator

β^=argminβ[i=1Nyiln(ϕ(Xiβ))(1yi)ln(1ϕ(Xiβ))+λβ1] \boldsymbol{\hat\beta} = \underset{\boldsymbol\beta}{\arg\min}\left[ \sum_{i=1}^N -y_i\ln\left( \phi(\mathbf{X}_i\boldsymbol\beta) \right) - (1-y_i)\ln\left(1 - \phi(\mathbf{X}_i\boldsymbol\beta)\right) + \lambda\vert\vert\boldsymbol\beta\vert\vert_1\right]

for λR+\lambda \in \R^+, where 1\vert\vert\cdot\vert\vert_1 denotes the 1\ell^1 norm. On the other hand, the Bayesian Logistic Lasso model2 is specified using the likelihood and prior

yiβBer(ϕ(Xiβ))βLaplace(λ1) \begin{aligned} y_i \mid \boldsymbol\beta &\sim \operatorname{Ber}\left(\phi(\mathbf{X}_i\boldsymbol\beta)\right) & \boldsymbol\beta&\sim \operatorname{Laplace} (\lambda^{-1}) \end{aligned}

for which the posterior distribution is found via Bayes’ Rule.

For the Logistic Lasso, both formulations are equivalent3 in the sense that they yield the same point estimates. This connection is discussed in detail in my previous post. Since the same model can be expressed both ways, it may be unclear to someone unfamiliar with Bayesian statistics what people might disagree about here. Let’s proceed to that.

Statistical Learning Theory

The first philosophical question we consider is what statistical learning is. This fundamental question has been considered by a variety of people throughout history. One formulation—due to Vapnik4—involves defining a loss function L(y,y^)L(y, \hat{y}) for predicted data, and finding a function ff that minimizes the expected loss

argminfΩL(y,f(x))dF(x,y) \underset{f}{\arg\min} \int_\Omega L(y, f(x)) \,\mathrm{d} F(x,y)

with respect to an unknown distribution F(x,y)F(x,y). This loss is then approximated in various ways because the data is finite—for instance, by restricting the domain of optimization. In this approach, a statistical learning problem is defined to be a functional optimization problem, the problem’s answer is given by the function ff, and the model M\mathscr{M} is given by the loss function together with whatever approximations are made. For Logistic Lasso, we assume that the functional form of ff is given by ϕ(Xβ)\phi(\mathbf{X}\boldsymbol\beta), and that LL is 1\ell^1-regularized cross-entropy loss.

Bayesian Theory

The other formalism we consider involves defining statistical learning more abstractly. We suppose that we are given a parameter θ\theta and data set xx. We define a set Θ\Theta consisting of true-false statements θ=θ\theta = \theta' and x=xx = x' for all possible parameter values θ\theta' and data values xx'. From the data, we know the statement x=xx=x' is true—but we do not know which θ\theta' makes it so that θ=θ\theta = \theta' is true. Thus, we cannot simply deduce θ\theta via logical reasoning, and must extend the concept of logical reasoning to accommodate uncertainty.

To do so, we suppose that there is a relationship between xx and θ\theta such that different values of xx may change the relative truth of different values of θ\theta. Thus, we seek to define a function P(θ=θx=x)\mathbb{P}(\theta = \theta' \mid x = x') such that if x=xx=x' is true, the function tells us how close to true or to false θ=θ\theta=\theta' is. To perform logical reasoning under uncertainty, we need to specify two probability distributions—the likelihood f(xθ)f(x \mid \theta) and prior π(θ)\pi(\theta), and calculate

f(θx)=f(xθ)π(θ)Θf(xθ)π(θ)dθf(xθ)π(θ) f(\theta \mid x) = \frac{f(x \mid \theta) \pi(\theta)}{\int_\Theta f(x \mid \theta) \pi(\theta) \,\mathrm{d} \theta} \propto f(x \mid \theta) \pi(\theta)

using Bayes’ Rule, which gives us the posterior distribution. In this approach, statistical learning is taken to mean reasoning under uncertainty, the answer is given by the probability distribution f(θx)f(\theta \mid x), and the model M\mathscr{M} is given by the likelihood together with the prior. For Logistic Lasso, we assume that the likeihood is Bernoulli, and that the prior is Laplace.

Interpretation of Models

At first glance, the theories may appear somewhat different, but the Logistic Lasso—and just about every model used in practice—can be formalized in both ways. This leads to the first question.

Should we interpret statistical models as probability distributions or as loss functions?

The answer, of course, depends on the preferences of the person being asked—if we want, we may interpret a model whose loss function corresponds to a posterior distribution in a Bayesian way. The probabilistic structure it possesses can be a useful theoretical tool for understanding its behavior. This lets us see for instance that if priors are considered subjective, regularizers must be as well. We conclude with an informal definition this class of models.

Definition. A model M\mathscr{M} is mathematically Bayesian if it can be fully specified via a prior π(θ)\pi(\theta) and likelihood f(xθ)f(x \mid \theta) for which the posterior distribution f(θx)f(\theta \mid x) is well-defined.

Assessment of Inferential Uncertainty

The second question does not concern the model in a mathematical sense. Instead, we consider an abstract procedure P\mathscr{P} that utilizes a model M\mathscr{M} to do something useful. Here, we encounter our second question.

Should we assess uncertainty regarding what was learned about θ\theta from the data by computing the posterior distribution f(θx)f(\theta \mid x)?

Often, assessing inferential uncertainty is interesting, but not always. One important note is that for any given data set, the uncertainty given by f(θx)f(\theta \mid x) is completely determined by the specification of M\mathscr{M}. If M\mathscr{M} is not the correct model, its uncertainty estimates may be arbitrary bad, even if its predictions are good. Thus, we may prefer to not assess uncertainty at all, rather than delude ourselves into thinking we know it.

Similarly, for some problems there may exist a simple and easy way to determine whether θ\theta is good or not. For example, in image classification, we might simply ask a human if the labels produced by θ\theta are reasonable. This might be far more effective than using the probability distribution f(θx)f(\theta \mid x) to compare the chosen value for θ\theta to other possible values, especially when calculating f(θx)f(\theta \mid x) is challenging.

This leads to a choice undertaken by the practitioner: should f(θx)f(\theta \mid x) be calculated, or is picking one value θ^\hat\theta good enough? In some cases, such as when a decision-theoretic analysis is performed, f(θx)f(\theta \mid x) is indispensable, other times it is unnecessary. We conclude with an informal definition encompassing this choice.

Definition. A statistical procedure P\mathscr{P} that makes use of a model M\mathscr{M} is computationally Bayesian if it involves calculation of the full posterior distribution f(θx)f(\theta \mid x) in at least one of its steps.

Disentangling the Disagreements

It is unfortunate that the term Bayesian has come to mean mathematically Bayesian and computationally Bayesian simultaneously. In my opinion, these distinctions should be considered separately, because they concern two very different questions. In the mathematical case, we are asking whether or not to interpret our model using its probabilistic representation. In the computational case, we are asking whether calculating the entire distribution is necessary, or whether one value suffices.

A model’s Bayesian representation can be useful as a theoretical tool, whether we calculate the posterior or not. If one value does suffice, we should not discard the probabilistic interpretation entirely, because it might help us understand the model’s structure. For the Logistic Lasso, the Bayesian approach makes it obvious where cross-entropy loss comes from: it maps uniquely to the Bernoulli likelihood.

It is unfortunate that the two cases are often conflated. It is common to hear practitioners say that they are not interested in whether models are Bayesian or frequentist—instead, it matters whether or not they work. More often than not, models can be interpreted both ways, so the distinction’s premise is itself an illusion. Every mathematical perspective tells us something about the objects we are studying, Even if we do not perform Bayesian calculations, it can often still be useful to think of models in a Bayesian way.

References

1

R. Tibshirani. Regression Shrinkage and Selection via the Lasso. JRSSB 58(1), 1996.

2

T. Park and G. Casella. The Bayesian Lasso. JASA 103(402), 2008.

3

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013.

4

V. Vapnik. The Nature of Statistical Learning Theory. 1995.