Alexander Terenin

Bayesian Learning - by example

Welcome to my blog! For my first post, I decided that it would be useful to write a short introduction to Bayesian learning, and its relationship with the more traditional optimization-theoretic perspective often used in artificial intelligence and machine learning, presented in a minimally technical fashion. We begin by introducing an example.

Example: binary classification using a fully connected network

First, let’s introduce notation. For simplicity suppose there are no biases, and define the following.

  • yN×1\boldsymbol{y}_{N\times 1}: a binary vector where each element is a target data point. NN is the amount of input data.
  • XN×p\mathbf{X}_{N\times p}: a matrix where each row is an input data vector, pp is the dimensionality of each input.
  • βp×m(x)\boldsymbol\beta^{(x)}_{p \times m}: the matrix that maps the input to the hidden layer,mm is the number of hidden units.
  • βm×1(h)\boldsymbol\beta^{(h)}_{m \times 1}: the vector that maps the hidden layer to the output.
  • σ\sigma: the network’s activation function, for instance a ReLU function.
  • ϕ\phi: the softmax function.

The standard approach

We begin by defining an optimization problem. Let β\boldsymbol\beta be a kk-dimensional vector consisting of all values of β(x)\boldsymbol\beta^{(x)} and β(h)\boldsymbol\beta^{(h)} stacked together. Our network’s prediction y^[0,1]N\boldsymbol{\hat{y}} \in [0,1]^N is given by

y^=ϕ(σ(Xβ(x))β(h)) \hat{\boldsymbol{y}} = \phi\left(\sigma\left(\mathbf{X} \boldsymbol\beta^{(x)}\right) \boldsymbol\beta^{(h)}\right)

Now, we proceed to learn the weights. Let β^\boldsymbol{\hat\beta} be the learned values for β\boldsymbol\beta, let \Vert\cdot\Vert be the 2\ell^2 norm, fix some λR+\lambda \in \R^+, and set

β^=argminβ[i=1Nyiln(y^i)(1yi)ln(1y^i)+λβ2]. \boldsymbol{\hat\beta} = \underset{\boldsymbol\beta}{\arg\min}\left[ \sum_{i=1}^N -y_i\ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\Vert\boldsymbol\beta\Vert^2\right] .

The expression being minimized is called cross entropy loss.1 The loss is differentiable, so we can minimize it by using gradient descent or any other method we wish. Learning takes place by minimizing the loss, and the values we learn—here, β^\boldsymbol{\hat\beta}—are a point in Rk\R^k.

Why cross-entropy rather than some other mathematical expression? In most treatments of classification, the reasons given are purely intuitive, for instance, it is often said to stabilize the optimization algorithm. More rigorous treatments1 might introduce ideas from information theory. We will provide another explanation.

The Bayesian approach

Let us now define the exact same network, but this time from a Bayesian perspective. We begin by making probabilistic assumptions on our data. Since we have that y{0,1}N\boldsymbol{y} \in \{0,1\}^N, and since we assume that the order in which y\boldsymbol{y} is presented cannot affect learning—this is formally called exchangeability—there is one and only one distribution that y\boldsymbol{y} can follow: the Bernoulli distribution. The parameter of that distribution is the same expression y^\boldsymbol{\hat{y}} as before. Hence, let

yβBer[ϕ(σ(Xβ(x))β(h))]. \boldsymbol{y} \mid \boldsymbol\beta \sim\operatorname{Ber}\left[\phi \left(\sigma\left(\mathbf{X} \boldsymbol\beta^{(x)}\right) \boldsymbol\beta^{(h)}\right)\right] .

This is called the likelihood: it describes the assumptions we are making about the data y\boldsymbol{y} given the parameters β\boldsymbol\beta—here, that the data is binary and exchangeable. Now, define the prior for β\boldsymbol\beta as

βNk(0,λ12). \boldsymbol\beta \sim\operatorname{N}_k\left(0, \frac{\lambda^{-1}}{2}\right) .

This describes our assumptions about β\boldsymbol\beta external to the data—here, we have assumed that all components of β\boldsymbol\beta are a priori independent mean-zero Gaussians. We can combine the prior and likelihood using Bayes’ Rule

f(βy)=f(yβ)π(β)Rkf(yβ)π(β)dβf(yβ)π(β) f(\boldsymbol\beta \mid \boldsymbol{y}) = \frac{f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta)}{\int_{\R^k} f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta) \operatorname{d}\beta} \propto f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta)

to obtain the posterior βy\boldsymbol\beta \mid \boldsymbol{y}. This is a probability distribution: it describes what we learned about β\boldsymbol\beta from the data. Learning takes place through the use of Bayes’ Rule, and the values we learn—here, βy\boldsymbol\beta \mid \boldsymbol{y}—are a probability distribution on Rk\R^k.

Connecting the two approaches

Is there any relationship between β^\boldsymbol{\hat\beta} and βy\boldsymbol\beta \mid \boldsymbol{y}? It turns out, yes—let’s show it. First, let’s write down the posterior

f(βy)f(yβ)π(β)[i=1Ny^iyi(1y^i)1yi]exp[βTβλ1]. f(\boldsymbol\beta \mid \boldsymbol{y}) \propto f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta) \propto \left[\prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}\right] \exp\left[\frac{\boldsymbol\beta^T\boldsymbol\beta}{-\lambda^{-1}}\right] .

Now, let’s take logs and simplify:

lnf(βy)=i=1Nyiln(y^i)+(1yi)ln(1y^i)λβ2+const. \ln f(\boldsymbol\beta \mid \boldsymbol{y}) = \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\Vert\boldsymbol\beta\Vert^2 + \operatorname{const} .

Having computed that, note that that taking logs and adding constants preserve optima, and consider the posterior mode:

argmaxβf(βy)=argmaxβlnf(βy)=argmaxβ[i=1Nyiln(y^i)+(1yi)ln(1y^i)λβ2]=argminβ[i=1Nyiln(y^i)(1yi)ln(1y^i)+λβ2]=β^. \begin{aligned} \underset{\boldsymbol\beta}{\arg\max} f(\boldsymbol\beta \mid \boldsymbol{y}) &= \underset{\boldsymbol\beta}{\arg\max} \ln f(\boldsymbol\beta \mid \boldsymbol{y}) \\ &=\underset{\boldsymbol\beta}{\arg\max}\left[ \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\Vert\boldsymbol\beta\Vert^2 \right] \\ &=\underset{\boldsymbol\beta}{\arg\min}\left[ \sum_{i=1}^N -y_i \ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\Vert\boldsymbol\beta\Vert^2 \right] \\ &= \boldsymbol{\hat{\beta}} . \end{aligned}

What have we shown? Minimizing cross-entropy loss is equivalent to maximizing the posterior distribution. The loss function maps to the likelihood, and the regularization term maps to the prior.

What it all means

Why is this useful? It gives us a probabilistic interpretation for learning, which helps us to construct and understand our models. This is especially in more complicated settings: for instance, we might ask, where does y^=σ(Xβ(x))β(h)\boldsymbol{\hat{y}} = \sigma(\mathbf{X} \boldsymbol\beta^{(x)}) \boldsymbol\beta^{(h)} come from? In fact, we can use ideas from Bayesian nonparametrics to derive y^\boldsymbol{\hat{y}} by considering a likelihood on a function space under a ReLU basis expansion.2 The network’s loss and architecture can both be explained in a Bayesian way.

There is much more: we could consider drawing samples from the posterior distribution, to quantify uncertainty about how much we learned about β\boldsymbol\beta from the data. Markov Chain Monte Carlo3 methods are the most common class of methods for doing so. We can use ideas from hierarchical Bayesian models to define better regularizers compared to 2\ell^2—the Horseshoe4 prior is a popular example. For brevity, I’ll omit further examples—the book Bayesian Data Analysis5 is a good introduction, though it largely focuses on methods of interest mainly to statisticians.

At the end of the day, having many different mathematical perspectives enables us to better understand how learning works, because things that are not obvious from one perspective might be easy to see from another. Whereas the optimization-theoretic approach we began with did not give a clear reason for why we should use cross-entropy loss, from a Bayesian point of view it follows directly out of the binary nature of the data. Sometimes, the Bayesian approach has little to say about a particular problem, other times it has a lot. It is useful to know how to use it when the need arises, and I hope this short example has given at least one reason to read about Bayesian statistics in more detail.



See Chapter 5 of Deep Learning.6


See Chapter 20 of Bayesian Data Analysis.5


See Chapter 11 of Bayesian Data Analysis,5 but note that MCMC methods are far more general than presented there. An article7 by P. Diaconis gives a rather different overview.


C. M. Carvalho, N. G. Polson, and J. G. Scott. The Horseshoe estimator for sparse signals. Biometrika, 97(2):1–26, 2010.


A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013.


I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. 2016.


P. Diaconis. The Markov Chain Monte Carlo revolution. Bulletin of the American Mathematical Society, 46(2):179–205, 2009.