Bayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework. This has lead to rich academic debates over what statistical learning is and how it should be done. Academic debates are healthy when their content is precise and independent issues are not conflated. In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently. These can be viewed informally as follows.
- A model is mathematically Bayesian if it is defined using Bayes’ Rule.
- A procedure is computationally Bayesian if it involves calculation of a full posterior distribution.
The key idea of this post is that the two notions above are different, and that the common term Bayesian is often ambiguous. This makes it unclear, for instance, that there are situations where it makes sense to be mathematically but not computationally Bayesian. Let’s disentangle the terminology and explore the concepts in more detail.
Motivating Example: Logistic Lasso
To make my arguments concrete, I now introduce the Logistic Lasso model, beginning with notation. Let be the matrix of size to be used for predicting the binary vector of size , let be the parameter vector, and let be the logistic function.
From the classical perspective, the Logistic Lasso model1 involves finding the estimator
for , where denotes the norm. On the other hand, the Bayesian Logistic Lasso model2 is specified using the likelihood and prior
for which the posterior distribution is found via Bayes’ Rule.
For the Logistic Lasso, both formulations are equivalent3 in the sense that they yield the same point estimates. This connection is discussed in detail in my previous post. Since the same model can be expressed both ways, it may be unclear to someone unfamiliar with Bayesian statistics what people might disagree about here. Let’s proceed to that.
Statistical Learning Theory
The first philosophical question we consider is what statistical learning is. This fundamental question has been considered by a variety of people throughout history. One formulation—due to Vapnik4—involves defining a loss function for predicted data, and finding a function that minimizes the expected loss
with respect to an unknown distribution . This loss is then approximated in various ways because the data is finite—for instance, by restricting the domain of optimization. In this approach, a statistical learning problem is defined to be a functional optimization problem, the problem’s answer is given by the function , and the model is given by the loss function together with whatever approximations are made. For Logistic Lasso, we assume that the functional form of is given by , and that is -regularized cross-entropy loss.
Bayesian Theory
The other formalism we consider involves defining statistical learning more abstractly. We suppose that we are given a parameter and data set . We define a set consisting of true-false statements and for all possible parameter values and data values . From the data, we know the statement is true—but we do not know which makes it so that is true. Thus, we cannot simply deduce via logical reasoning, and must extend the concept of logical reasoning to accommodate uncertainty.
To do so, we suppose that there is a relationship between and such that different values of may change the relative truth of different values of . Thus, we seek to define a function such that if is true, the function tells us how close to true or to false is. To perform logical reasoning under uncertainty, we need to specify two probability distributions—the likelihood and prior , and calculate
using Bayes’ Rule, which gives us the posterior distribution. In this approach, statistical learning is taken to mean reasoning under uncertainty, the answer is given by the probability distribution , and the model is given by the likelihood together with the prior. For Logistic Lasso, we assume that the likeihood is Bernoulli, and that the prior is Laplace.
Interpretation of Models
At first glance, the theories may appear somewhat different, but the Logistic Lasso—and just about every model used in practice—can be formalized in both ways. This leads to the first question.
Should we interpret statistical models as probability distributions or as loss functions?
The answer, of course, depends on the preferences of the person being asked—if we want, we may interpret a model whose loss function corresponds to a posterior distribution in a Bayesian way. The probabilistic structure it possesses can be a useful theoretical tool for understanding its behavior. This lets us see for instance that if priors are considered subjective, regularizers must be as well. We conclude with an informal definition this class of models.
Definition. A model is mathematically Bayesian if it can be fully specified via a prior and likelihood for which the posterior distribution is well-defined.
Assessment of Inferential Uncertainty
The second question does not concern the model in a mathematical sense. Instead, we consider an abstract procedure that utilizes a model to do something useful. Here, we encounter our second question.
Should we assess uncertainty regarding what was learned about from the data by computing the posterior distribution ?
Often, assessing inferential uncertainty is interesting, but not always. One important note is that for any given data set, the uncertainty given by is completely determined by the specification of . If is not the correct model, its uncertainty estimates may be arbitrary bad, even if its predictions are good. Thus, we may prefer to not assess uncertainty at all, rather than delude ourselves into thinking we know it.
Similarly, for some problems there may exist a simple and easy way to determine whether is good or not. For example, in image classification, we might simply ask a human if the labels produced by are reasonable. This might be far more effective than using the probability distribution to compare the chosen value for to other possible values, especially when calculating is challenging.
This leads to a choice undertaken by the practitioner: should be calculated, or is picking one value good enough? In some cases, such as when a decision-theoretic analysis is performed, is indispensable, other times it is unnecessary. We conclude with an informal definition encompassing this choice.
Definition. A statistical procedure that makes use of a model is computationally Bayesian if it involves calculation of the full posterior distribution in at least one of its steps.
Disentangling the Disagreements
It is unfortunate that the term Bayesian has come to mean mathematically Bayesian and computationally Bayesian simultaneously. In my opinion, these distinctions should be considered separately, because they concern two very different questions. In the mathematical case, we are asking whether or not to interpret our model using its probabilistic representation. In the computational case, we are asking whether calculating the entire distribution is necessary, or whether one value suffices.
A model’s Bayesian representation can be useful as a theoretical tool, whether we calculate the posterior or not. If one value does suffice, we should not discard the probabilistic interpretation entirely, because it might help us understand the model’s structure. For the Logistic Lasso, the Bayesian approach makes it obvious where cross-entropy loss comes from: it maps uniquely to the Bernoulli likelihood.
It is unfortunate that the two cases are often conflated. It is common to hear practitioners say that they are not interested in whether models are Bayesian or frequentist—instead, it matters whether or not they work. More often than not, models can be interpreted both ways, so the distinction’s premise is itself an illusion. Every mathematical perspective tells us something about the objects we are studying, Even if we do not perform Bayesian calculations, it can often still be useful to think of models in a Bayesian way.
References
R. Tibshirani. Regression Shrinkage and Selection via the Lasso. JRSSB 58(1), 1996.
T. Park and G. Casella. The Bayesian Lasso. JASA 103(402), 2008.
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013.
V. Vapnik. The Nature of Statistical Learning Theory. 1995.