Jekyll2018-03-19T22:29:10+00:00http://avt.im/Alexander TereninA blog about statistics, machine learning, and artificial intelligenceAlexander TereninWhat does it mean to be Bayesian?2017-11-03T00:00:00+00:002017-11-03T00:00:00+00:00http://avt.im/blog/2017/11/03/meaning-of-bayesian<p>Bayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework.
This has lead to rich academic debates over what statistical learning is and how it should be done.
Academic debates are healthy when their content is precise and independent issues are not conflated.
In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently.
These can be viewed informally as follows.</p>
<ul>
<li>A model is <em>mathematically Bayesian</em> if it is defined using Bayes’ Rule.</li>
<li>A procedure is <em>computationally Bayesian</em> if it involves calculation of a full posterior distribution.</li>
</ul>
<p>The key idea of this post is that the two notions above are different, and that the common term <em>Bayesian</em> is often ambiguous.
This makes it unclear, for instance, that there are situations where it makes sense to be mathematically but not computationally Bayesian.
Let’s disentangle the terminology and explore the concepts in more detail.</p>
<h1 id="motivating-example-logistic-lasso">Motivating Example: Logistic Lasso</h1>
<p>To make my arguments concrete, I now introduce the Logistic Lasso model, beginning with notation.
Let <script type="math/tex">\m{X}_{N \times p}</script> be the matrix to be used for predicting the binary vector $\v{y}_{N\times 1}$, let $\v\beta$ be the parameter vector, and let $\phi$ be the logistic function.</p>
<p>From the classical perspective, the Logistic Lasso model<sup id="fnref:lasso"><a href="#fn:lasso" class="footnote">1</a></sup> involves finding the estimator</p>
<p>[
\v{\hat\beta} = \underset{\v\beta}{\arg\min}\cbr{ \sum_{i=1}^N -y_i\ln\del{ \phi(\m{X}_i\v\beta) } - (1-y_i)\ln\del{1 - \phi(\m{X}_i\v\beta)} + \lambda\vert\vert\v\beta\vert\vert_1}
]</p>
<p>for $\lambda \in \R^+$, where $\vert\vert\cdot\vert\vert_1$ denotes the $L^1$ norm. On the other hand, the Bayesian Logistic Lasso model<sup id="fnref:blasso"><a href="#fn:blasso" class="footnote">2</a></sup> is specified using the likelihood and prior</p>
<p>[
\begin{aligned}
y_i \given \v\beta &\dist \f{Ber}\del{\phi(\m{X}_i\v\beta)}
&
\v\beta&\dist \f{Laplace} (\lambda^{-1})
\end{aligned}
]</p>
<p>for which the posterior distribution is found via Bayes’ Rule.</p>
<p>For the Logistic Lasso, both formulations are equivalent<sup id="fnref:bda"><a href="#fn:bda" class="footnote">3</a></sup> in the sense that they yield the same point estimates.
This connection is discussed in detail in my <a href="/blog/2017/07/05/bayesian-learning">previous post</a>.
Since the same model can be expressed both ways, it may be unclear to someone unfamiliar with Bayesian statistics what people might disagree about here.
Let’s proceed to that.</p>
<h1 id="statistical-learning-theory">Statistical Learning Theory</h1>
<p>The first philosophical question we consider is what statistical learning is.
This fundamental question has been considered by a variety of people throughout history.
One formulation – due to Vapnik<sup id="fnref:vv"><a href="#fn:vv" class="footnote">4</a></sup> – involves defining a <em>loss function</em> $L(y, \hat{y})$ for predicted data, and finding a function $f$ that minimizes the expected loss</p>
<p>[
\underset{f}{\arg\min} \int_\Omega L(y, f(x)) \dif F(x,y)
]</p>
<p>with respect to an unknown distribution $F(x,y)$.
This loss is then approximated in various ways because the data is finite – for instance, by restricting the domain of optimization.
In this approach, a <em>statistical learning problem</em> is defined to be a <em>functional optimization problem</em>, the problem’s <em>answer</em> is given by the function $f$, and the model $\mathscr{M}$ is given by the loss function together with whatever approximations are made. For Logistic Lasso, we assume that the functional form of $f$ is given by $\phi(\m{X}\v\beta)$, and that $L$ is $L^1$-regularized cross-entropy loss.</p>
<h1 id="bayesian-theory">Bayesian Theory</h1>
<p>The other formalism we consider involves defining statistical learning more abstractly.
We suppose that we are given a parameter $\theta$ and data set $x$.
We define a set $\Omega$ consisting of true-false statements $\theta = \theta’$ and $x = x’$ for all possible parameter values $\theta’$ and data values $x’$.
From the data, we know the statement $x=x’$ is true – but we do not know which $\theta’$ makes it so that $\theta = \theta’$ is true.
Thus, we cannot simply deduce $\theta$ via logical reasoning, and must extend the concept of logical reasoning to accommodate uncertainty.</p>
<p>To do so, we suppose that there is a relationship between $x$ and $\theta$ such that different values of $x$ may change the relative truth of different values of $\theta$.
Thus, we seek to define a function $\P(\theta = \theta’ \given x = x’)$ such that if $x=x’$ is true, the function tells us how close to true or to false $\theta=\theta’$ is.
It turns out under appropriate formal definitions<sup id="fnref:ct"><a href="#fn:ct" class="footnote">5</a></sup>, any reasonable such function is isomorphic to conditional probability.
Thus, to perform <em>logical reasoning under uncertainty</em>, we need to specify two probability distributions – the <em>likelihood</em> $f(x \given \theta)$ and <em>prior</em> $\pi(\theta)$, and calculate</p>
<p>[
f(\theta \given x) = \frac{f(x \given \theta) \pi(\theta)}{\int_\Theta f(x \given \theta) \pi(\theta) \dif \theta} \propto f(x \given \theta) \pi(\theta)
]</p>
<p>using Bayes’ Rule, which gives us the <em>posterior</em> distribution.
In this approach, <em>statistical learning</em> is taken to mean <em>reasoning under uncertainty</em>, the <em>answer</em> is given by the probability distribution $f(\theta \given x)$, and the model $\mathscr{M}$ is given by the likelihood together with the prior.
For Logistic Lasso, we assume that the likeihood is Bernoulli, and that the prior is Laplace.</p>
<h1 id="interpretation-of-models">Interpretation of Models</h1>
<p>At first glance, the theories may appear somewhat different, but the Logistic Lasso – and just about every model used in practice – can be formalized in both ways.
This leads to the first question.</p>
<blockquote>
<p>Should we interpret statistical models as probability distributions or as loss functions?</p>
</blockquote>
<p>The answer, of course, depends on the preferences of the person being asked – if we want, we may interpret a model whose loss function corresponds to a posterior distribution in a Bayesian way.
The probabilistic structure it possesses can be a useful theoretical tool for understanding its behavior.
This lets us see for instance that if priors are considered subjective, regularizers must be as well.
We conclude with an informal definition this class of models.</p>
<p><strong>Definition.</strong>
A model $\mathscr{M}$ is <em>mathematically Bayesian</em> if it can be fully specified via a prior $\pi(\theta)$ and likelihood $f(x \given \theta)$ for which the posterior distribution $f(\theta \given x)$ is well-defined.</p>
<h1 id="assessment-of-inferential-uncertainty">Assessment of Inferential Uncertainty</h1>
<p>The second question does not concern the model in a mathematical sense.
Instead, we consider an abstract procedure $\mathscr{P}$ that utilizes a model $\mathscr{M}$ to do something useful.
Here, we encounter our second question.</p>
<blockquote>
<p>Should we assess uncertainty regarding what was learned about $\theta$ from the data by computing the posterior distribution $f(\theta \given x)$?</p>
</blockquote>
<p>Often, assessing inferential uncertainty is interesting, but not always.
One important note is that for any given data set, the uncertainty given by $f(\theta \given x)$ is completely determined by the specification of $\mathscr{M}$.
If $\mathscr{M}$ is not the correct model, its uncertainty estimates may be arbitrary bad, even if its predictions are good.
Thus, we may prefer to not assess uncertainty at all, rather than delude ourselves into thinking we know it.</p>
<p>Similarly, for some problems there may exist a simple and easy way to determine whether $\theta$ is good or not.
For example, in image classification, we might simply ask a human if the labels produced by $\theta$ are reasonable.
This might be far more effective than using the probability distribution $f(\theta \given x)$ to compare the chosen value for $\theta$ to other possible values, especially when calculating $f(\theta \given x)$ is challenging.</p>
<p>This leads to a choice undertaken by the practitioner: should $f(\theta \given x)$ be calculated, or is picking one value $\hat\theta$ good enough?
In some cases, such as when a decision-theoretic analysis is performed, $f(\theta \given x)$ is indispensable, other times it is unnecessary.
We conclude with an informal definition encompassing this choice.</p>
<p><strong>Definition.</strong>
A statistical procedure $\mathscr{P}$ that makes use of a model $\mathscr{M}$ is <em>computationally Bayesian</em> if it involves calculation of the full posterior distribution $f(\theta \given x)$ in at least one of its steps.</p>
<h1 id="disentangling-the-disagreements">Disentangling the Disagreements</h1>
<p>It is unfortunate that the term <em>Bayesian</em> has come to mean <em>mathematically Bayesian</em> and <em>computationally Bayesian</em> simultaneously.
In my opinion, these distinctions should be considered separately, because they concern two very different questions.
In the mathematical case, we are asking whether or not to interpret our model using its probabilistic representation.
In the computational case, we are asking whether calculating the entire distribution is necessary, or whether one value suffices.</p>
<p>A model’s Bayesian representation can be useful as a theoretical tool, whether we calculate the posterior or not.
If one value does suffice, we should not discard the probabilistic interpretation entirely, because it might help us understand the model’s structure.
For the Logistic Lasso, the Bayesian approach makes it obvious where cross-entropy loss comes from: it maps uniquely to the Bernoulli likelihood.</p>
<p>It is unfortunate that the two cases are often conflated.
It is common to hear practitioners say that they are not interested in whether models are Bayesian or frequentist – instead, it matters whether or not they work.
More often than not, models can be interpreted both ways, so the distinction’s premise is itself an illusion.
Every mathematical perspective tells us something about the objects we are studying,
Even if we do not perform Bayesian calculations, it can often still be useful to think of models in a Bayesian way.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:lasso">
<p>R. Tibshirani. Regression Shrinkage and Selection via the Lasso. JRSSB 58(1), 1996. <a href="#fnref:lasso" class="reversefootnote">↩</a></p>
</li>
<li id="fn:blasso">
<p>T. Park and G. Casella. The Bayesian Lasso. JASA 103(402), 2008. <a href="#fnref:blasso" class="reversefootnote">↩</a></p>
</li>
<li id="fn:bda">
<p>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. <a href="#fnref:bda" class="reversefootnote">↩</a></p>
</li>
<li id="fn:vv">
<p>V. Vapnik. The Nature of Statistical Learning Theory. 1995. <a href="#fnref:vv" class="reversefootnote">↩</a></p>
</li>
<li id="fn:ct">
<p>A. Terenin and D. Draper. Cox’s Theorem and the Jaynesian Interpretation of Probability. <a href="https://arxiv.org/abs/1507.06597">arXiv:1507.06597</a>, 2015. <a href="#fnref:ct" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninBayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework. This has lead to rich academic debates over what statistical learning is and how it should be done. Academic debates are healthy when their content is precise and independent issues are not conflated. In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently. These can be viewed informally as follows.Deep Learning with function spaces2017-08-16T00:00:00+00:002017-08-16T00:00:00+00:00http://avt.im/blog/2017/08/16/deep-learning-function-spaces<p>Deep learning is perhaps the single most important breakthrough in statistics, machine learning, and artificial intelligence that has been popularized in recent years.
It has allows us to classify images - for decades a challenging problem - with nowadays usually better-than-human accuracy.
It has solved Computer Go, which for decades was the classical example of a board game that was exceedingly difficult for computers to play.
But what exactly is deep learning?</p>
<p>Many popular explanations involve analogies with the human brain, where deep learning models are interpreted as complex networks of neurons interacting with one another.
These perspectives are useful, but they’re not math: just because deep learning models mimic the brain, doesn’t mean they provably work.
This post will highlight some ideas that may be helpful in moving toward an understanding of why deep learning works, presented at an intuitive level.
The focus will be on high-level concepts, omitting algebraic details such as the precise form of tensor products.</p>
<h1 id="the-function-space-perspective">The Function Space Perspective</h1>
<p>The key idea of this post is that to understand why deep learning works, we should not work with the network directly.
Instead, we will define a model for learning on a space of functions, truncate that model, and obtain deep learning.</p>
<p>Consider the model</p>
<p>[
\hat{\v{y}} = f(\m{X})
]</p>
<p>where the goal is to learn the function $f$ that maps data $\m{X}$ to the predicted value $\hat{\v{y}}$.
But wait, how do we go about learning a function?
Let’s first consider a single-variable function $f(x): \R \goesto \R$ and recall that any function may be written as an infinite sum with respect to a location-scale basis, i.e. we have for an appropriately defined function $\sigma$ that</p>
<p>[
f(x) = \sum_{k=1}^\infty a_k \, \sigma(b_k x + c_k) + d_k
.
]</p>
<p>What’s happening here?
We’re taking the function $\sigma$, shifting it left-right by $b_k$, stretching it by a combination of $a_k$ and $c_k$, and shifting it up-down by $d_k$.
As long as $\sigma$ is sufficiently rich to form a basis on $\R$, if we add up infinitely many of them, we can approximate $f$ to any precision we want.
To make learning possible, let’s truncate the sum, so that we sum $K$ elements instead of $\infty$, and get</p>
<p>[
f(x) = \sum_{k=1}^K a_k \, \sigma(b_k x + c_k) + d_k
.
]</p>
<p>We now have a finite set of parameters, so given a data set $(\m{X},\v{y})$, we can define a probability distribution for $\v{y}$ under the predicted values $\hat{\v{y}}$, and <a href="/blog/2017/07/05/bayesian-learning">learn the coefficients using Bayes’ Rule</a>.</p>
<p>But wait: the expressions we get by following this procedure, extended to matrices and vectors, are exactly those given by a <a href="/blog/2017/07/05/bayesian-learning">1-layer fully connected network</a>.
This is what a fully connected network does, and this is why it works: we are expanding an arbitrary function with respect to a basis, and learning the coefficients of the expansion using Bayes’ Rule<sup id="fnref:be"><a href="#fn:be" class="footnote">1</a></sup>.
That’s it!</p>
<h1 id="going-deep">Going Deep</h1>
<p>With the above perspective in mind, let’s consider deep learning.
We’re going to apply another trick: rather than learning $f$ directly, let’s instead define functions $f^{(1)},f^{(2)},f^{(3)}$ such that</p>
<p>[
\hat{\v{y}} = f(\m{X}) = f^{(1)}\cbr{f^{(2)}\sbr{f^{(3)}\del{\m{X}}}}
]</p>
<p>It’s not obvious why we should do this, but let’s go with it for now.
Then, let $\sigma$ be the ReLU function, and expand $f^{(3)}$ with respect to that basis, just as we did above, but with matrix-vector notation, to get</p>
<p>[
\hat{\v{y}} = f^{(1)}\cbr{f^{(2)}\sbr{ \v{a}^{(3)} \sigma\del{\m{X}\v{b}^{(3)} + \v{c}^{(3)}} + \v{d}^{(3)} }}
.
]</p>
<p>Now, let’s expand $f^{(2)}$, yielding</p>
<p>[
\hat{\v{y}} = f^{(1)}\cbr{\v{a}^{(2)}\sigma\sbr{\del{\v{a}^{(3)} \sigma\del{\m{X}\v{b}^{(3)} + \v{c}^{(3)}} + \v{d}^{(3)}}\v{b}^{(2)} + \v{c}^{(2)}} + \v{d}^{(2)}}
.
]</p>
<p>Notice that we can set $\v{b}^{(2)} = \v{1}$ and $\v{c}^{(2)} = \v{0}$ with no loss of generality to slightly simplify our expression.
Upon expanding $f^{(1)}$, we are left with</p>
<p>[
\hat{\v{y}} = \v{a}^{(1)}\sigma\cbr{\v{a}^{(2)}\sigma\sbr{\v{a}^{(3)} \sigma\del{\m{X}\v{b}^{(3)} + \v{c}^{(3)}} + \v{d}^{(3)}} + \v{d}^{(2)}} + \v{d}^{(1)}
]</p>
<p>which is exactly the expression for a 3-layer fully connected network.</p>
<p>So, what is deep learning?
Deep learning is a model that learns a function $f$ by splitting it up into a sequence of functions $f^{(1)},f^{(2)},f^{(3)},..$, performing a ReLU basis expansion on each one, truncating it, and learning the remaining coefficients using Bayes’ Rule.</p>
<h1 id="example-why-residual-networks-work">Example: why Residual Networks work</h1>
<p>This perspective can be used to understand recently popularized technique in deep learning.
For illustrative purposes, let’s consider a 3-layer residual network.
Suppose $\m{X}$ is of the same dimensionality as the network.
A residual network is a model of the form</p>
<p>[
\begin{aligned}
\hat{\v{y}} = f(\m{X}) = &f^{(1)}\cbr{f^{(2)}\sbr{f^{(3)}\del{\m{X}} + \m{X}} + \sbr{f^{(3)}\del{\m{X}} + \m{X}}}
\nonumber
\\
&+ \cbr{f^{(2)}\sbr{f^{(3)}\del{\m{X}} + \m{X}} + \sbr{f^{(3)}\del{\m{X}} + \m{X}}}
.
\end{aligned}
]</p>
<p>So, why do residual networks perform better?
Consider the above from a Bayesian learning the point of view: we start with a prior distribution - determined uniquely by the regularization term - and end with a posterior distribution that describes what we learned.
Suppose that nothing is learned in the 3rd layer.
Then the posterior distribution must be the same as the prior.
With $L^2$ regularization, this means that the posterior mode of the coefficients of the basis expansion of $f^{(3)}$ will be zero.
Hence,</p>
<p>[
f^{(3)}(x) = \sum_{k=1}^K 0 \, \sigma(0 \times x + 0) + 0 = 0
]</p>
<p>and the model collapses to</p>
<p>[
\hat{\v{y}} = f(\m{X}) = f^{(1)}\cbr{f^{(2)}\sbr{\m{X}} + \m{X}} + \cbr{f^{(2)}\sbr{\m{X}} + \m{X}}
.
]</p>
<p>Contrast this with a non-residual network, which collapses to</p>
<p>[
\hat{\v{y}} = f(\m{X}) = f^{(1)}\cbr{f^{(2)}\sbr{\v{0}}} = \text{constant}
.
]</p>
<p>In reality, of course, the network learns <em>something</em> in deeper layers, so behavior isn’t quite this bad.
But, if we suppose that deeper layers learn less and less given the same data, the model must eventually stop working if we keep adding layers.
Thus, standard networks don’t work if we make them too deep.
Residual networks fix the problem.</p>
<h1 id="what-have-we-gained-from-this-perspective">What have we gained from this perspective?</h1>
<p>Thinking about function spaces can make deep learning substantially more understandable.
Instead of thinking about networks, which are complicated, we can think about functions, which are in my view simpler.</p>
<p>The ideas above can for instance be used to understand what convolutional networks do: they make assumptions on how each $f^{(i)}$ behaves over space.
Similarly, we can see why ReLU<sup id="fnref:relu"><a href="#fn:relu" class="footnote">2</a></sup> units might perform slightly better than sigmoid units: because they are unbounded, less of them may be required to approximate a given function well.</p>
<p>Part of what makes functions simpler is that it is easy to visualize what scaling and shifting does to them.
For example, it is easy to see that switching from ReLU to Leaky ReLU<sup id="fnref:lrelu"><a href="#fn:lrelu" class="footnote">3</a></sup> units is the same as increasing the bias term in the basis expansion.
It’s certainly possible that this may sometimes be helpful, but it would be a big surprise to me if doing this resulted in substantially better performance across the board.</p>
<p>One major question that the function space perspective raises is why learning $f^{(1)}, f^{(2)}, f^{(3)},..$ separately is so much easier than learning $f$ directly.
I don’t know of a good answer to this question.</p>
<p>A key benefit of thinking with function spaces is that it gives us a principled way to derive the expressions needed to define and train networks.
The residual networks presented here differ slightly from the original work in which they were presented<sup id="fnref:resnet"><a href="#fn:resnet" class="footnote">4</a></sup> – more recent work has proposed precisely the formulas derived here<sup id="fnref:resnetidentity"><a href="#fn:resnetidentity" class="footnote">5</a></sup> which were found to improve performance.</p>
<p>I’m not sure why deep learning is not typically presented in this way – the function space perspective is largely omitted from the classical text <em>Deep Learning</em><sup id="fnref:dlintro"><a href="#fn:dlintro" class="footnote">6</a></sup>.
Overall, I hope that this short introduction has been useful for understanding deep learning and making the structure present in the models more transparent.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:be">
<p>See Chapter 20 of Bayesian Data Analysis<sup id="fnref:bda"><a href="#fn:bda" class="footnote">7</a></sup>. <a href="#fnref:be" class="reversefootnote">↩</a></p>
</li>
<li id="fn:relu">
<p>R Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, H. S. Seung (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 2000. <a href="#fnref:relu" class="reversefootnote">↩</a></p>
</li>
<li id="fn:lrelu">
<p>A. L. Maas, A. Y. Hannun, A. Y. Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. ICML 30(1), 2013. <a href="#fnref:lrelu" class="reversefootnote">↩</a></p>
</li>
<li id="fn:resnet">
<p>K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. CVPR 28(1), 2015. <a href="#fnref:resnet" class="reversefootnote">↩</a></p>
</li>
<li id="fn:resnetidentity">
<p>K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 14(1), 2016. <a href="#fnref:resnetidentity" class="reversefootnote">↩</a></p>
</li>
<li id="fn:dlintro">
<p>See Chapter 6 of Deep Learning<sup id="fnref:dl"><a href="#fn:dl" class="footnote">8</a></sup>. <a href="#fnref:dlintro" class="reversefootnote">↩</a></p>
</li>
<li id="fn:bda">
<p>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. <a href="#fnref:bda" class="reversefootnote">↩</a></p>
</li>
<li id="fn:dl">
<p>I. Goodfellow, Y. Bengio, A. Courville. <a href="http://www.deeplearningbook.org">Deep Learning</a>. 2016. <a href="#fnref:dl" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninDeep learning is perhaps the single most important breakthrough in statistics, machine learning, and artificial intelligence that has been popularized in recent years. It has allows us to classify images - for decades a challenging problem - with nowadays usually better-than-human accuracy. It has solved Computer Go, which for decades was the classical example of a board game that was exceedingly difficult for computers to play. But what exactly is deep learning?Bayesian Learning - by example2017-07-05T00:00:00+00:002017-07-05T00:00:00+00:00http://avt.im/blog/2017/07/05/bayesian-learning<p>Welcome to my blog!
For my first post, I decided that it would be useful to write a short introduction to Bayesian learning, and its relationship with the more traditional optimization-theoretic perspective often used in artificial intelligence and machine learning, presented in a minimally technical fashion.
We begin by introducing an example.</p>
<h1 id="example-binary-classification-using-a-fully-connected-network">Example: binary classification using a fully connected network</h1>
<p>First, let’s introduce notation. For simplicity suppose there are no biases, and define the following.</p>
<ul>
<li>$\v{y}_{N\times 1}$: a binary vector where each element is a target data point. $N$ is the amount of input data.</li>
<li>$\m{X}_{N\times p}$: a matrix where each row is an input data vector. $p$ is the dimensionality of each input.</li>
<li>$\v\beta^{(x)}_{p \times m}$: the matrix that maps the input to the hidden layer. $m$ is the number of hidden units.</li>
<li>$\v\beta^{(h)}_{m \times 1}$: the vector that maps the hidden layer to the output.</li>
<li>$\sigma$: the network’s activation function, for instance a ReLU function.</li>
<li>$\phi$: the softmax function.</li>
</ul>
<div style="text-align: center;">
<svg width="250px" viewBox="0 0 250 265" xmlns="http://www.w3.org/2000/svg">
<g>
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="200" x2="200" y2="125" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="50" x2="200" y2="125" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="125" x2="125" y2="162.5" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="125" x2="125" y2="87.5" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="200" x2="125" y2="87.5" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="50" x2="125" y2="162.5" />
</g>
<g>
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, -73.458435, -2.691527)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, -73.45842, 147.308301)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, -73.458427, 72.308369)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="fill: rgb(216, 216, 216); stroke: rgb(0, 0, 0);" transform="matrix(1, 0.000003, -0.000003, 1, 1.541482, 34.808468)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="fill: rgb(216, 216, 216); stroke: rgb(0, 0, 0);" transform="matrix(1, 0.000003, -0.000003, 1, 1.541482, 109.808331)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, 76.541375, 72.308369)" cx="123.459" cy="52.691" rx="25" ry="25" />
</g>
<g>
<foreignObject x="35" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\m{X}$
</div>
</foreignObject>
<foreignObject x="80" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\v\beta^{(x)}$
</div>
</foreignObject>
<foreignObject x="150" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\v\beta^{(h)}$
</div>
</foreignObject>
<foreignObject x="190" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\v{y}$
</div>
</foreignObject>
</g>
</svg>
</div>
<h1 id="the-standard-approach">The standard approach</h1>
<p>We begin by defining an optimization problem.
Let $\v\beta$ be a $k$-dimensional vector consisting of all values of $\v\beta^{(x)}$ and $\v\beta^{(h)}$ stacked together.
Our network’s prediction $\v{\hat{y}} \in [0,1]^N$ is given by</p>
<p>[
\hat{\v{y}} = \phi\del{\sigma\del{\m{X} \v\beta^{(x)}} \v\beta^{(h)}}
]</p>
<p>Now, we proceed to learn the weights.
Let $\v{\hat\beta}$ be the learned values for $\v\beta$, let $\vert\vert\cdot\vert\vert$ be the $L^2$ norm, fix some $\lambda \in \R^+$, and set</p>
<p>[
\v{\hat\beta} = \underset{\v\beta}{\arg\min}\cbr{ \sum_{i=1}^N -y_i\ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\vert\vert\v\beta\vert\vert^2}
.
]</p>
<p>The expression being minimized is called <em>cross entropy loss</em><sup id="fnref:ce"><a href="#fn:ce" class="footnote">1</a></sup>.
The loss is differentiable, so we can minimize it by using gradient descent or any other method we wish.
Learning takes place by minimizing the loss, and the values we learn – here, $\v{\hat\beta}$ – are a point in $\R^k$.</p>
<p>Why cross-entropy rather than some other mathematical expression?
In most treatments of classification, the reasons given are purely intuitive, for instance, it is often said to stabilize the optimization algorithm.
More rigorous treatments<sup id="fnref:ce:1"><a href="#fn:ce" class="footnote">1</a></sup> might introduce ideas from information theory.
We will provide another explanation.</p>
<h1 id="the-bayesian-approach">The Bayesian approach</h1>
<p>Let us now define the exact same network, but this time from a Bayesian perspective. We begin by making probabilistic assumptions on our data.
Since we have that $\v{y} \in \cbr{0,1}^N$, and since we assume that the order in which $\v{y}$ is presented cannot affect learning – this is formally called exchangeability – there is one and only one distribution that $\v{y}$ can follow: the Bernoulli distribution.
The parameter of that distribution is the same expression $\v{\hat{y}}$ as before.
Hence, let</p>
<p>[
\v{y} \given \v\beta \dist\f{Ber}\sbr{\phi\del{\sigma\del{\m{X} \v\beta^{(x)}} \v\beta^{(h)}}}
.
]</p>
<p>This is called the <em>likelihood</em>: it describes the assumptions we are making about the data $\v{y}$ given the parameters $\v\beta$ – here, that the data is binary and exchangeable.
Now, define the <em>prior</em> for $\v\beta$ as</p>
<p>[
\v\beta \dist\f{N}_k\del{0, \frac{\lambda^{-1}}{2}}
.
]</p>
<p>This describes our assumptions about $\v\beta$ external to the data – here, we have assumed that all components of $\v\beta$ are <em>a priori</em> independent mean-zero Gaussians.
We can combine the prior and likelihood using Bayes’ Rule</p>
<p>[
f(\v\beta \given \v{y}) = \frac{f(\v{y} \given \v\beta) \pi(\v\beta)}{\int_{\R^k} f(\v{y} \given \v\beta) \pi(\v\beta) \dif \beta} \propto f(\v{y} \given \v\beta) \pi(\v\beta)
]</p>
<p>to obtain the <em>posterior</em> $\v\beta \given \v{y}$.
This is a probability distribution: it describes what we learned about $\v\beta$ from the data.
Learning takes place through the use of Bayes’ Rule, and the values we learn – here, $\v\beta \given \v{y}$ – are a probability distribution on $\R^k$.</p>
<h1 id="connecting-the-two-approaches">Connecting the two approaches</h1>
<p>Is there any relationship between $\v{\hat\beta}$ and $\v\beta \given \v{y}$?
It turns out, yes – let’s show it. First, let’s write down the posterior</p>
<p>[
f(\v\beta \given \v{y}) \propto f(\v{y} \given \v\beta) \pi(\v\beta) \propto \sbr{\prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}} \exp\cbr{\frac{\v\beta^T\v\beta}{-\lambda^{-1}}}
.
]</p>
<p>Now, let’s take logs and simplify:</p>
<p>[
\ln f(\v\beta \given \v{y}) = \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\vert\vert\v\beta\vert\vert^2 + \f{const}
.
]</p>
<p>Having computed that, note that that taking logs and adding constants preserve optima, and consider the posterior mode:</p>
<p>[
\begin{aligned}
\underset{\v\beta}{\arg\max}\cbr{f(\v\beta \given \v{y})} &= \underset{\v\beta}{\arg\max}\cbr{\ln f(\v\beta \given \v{y})} =
\nonumber
\\
&=\underset{\v\beta}{\arg\max}\cbr{ \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\vert\vert\v\beta\vert\vert^2 } =
\nonumber
\\
&=\underset{\v\beta}{\arg\min}\cbr{ \sum_{i=1}^N -y_i \ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\vert\vert\v\beta\vert\vert^2 } =
\nonumber
\\
&= \v{\hat{\beta}}
.
\end{aligned}
]</p>
<p>What have we shown? Minimizing cross-entropy loss is equivalent to maximizing the posterior distribution.
The loss function maps to the likelihood, and the regularization term maps to the prior.</p>
<h1 id="what-it-all-means">What it all means</h1>
<p>Why is this useful?
It gives us a probabilistic interpretation for learning, which helps us to construct and understand our models.
This is especially in more complicated settings: for instance, we might ask, where does $\v{\hat{y}} = \sigma\del{\m{X} \v\beta^{(x)}} \v\beta^{(h)}$ come from? In fact, we can use ideas from <em>Bayesian Nonparametrics</em> to derive $\v{\hat{y}}$ by considering a likelihood on a function space under a ReLU basis expansion<sup id="fnref:be"><a href="#fn:be" class="footnote">2</a></sup>.
The network’s loss and architecture can both be explained in a Bayesian way.</p>
<p>There is much more: we could consider drawing samples from the posterior distribution, to quantify uncertainty about how much we learned about $\v\beta$ from the data.
<em>Markov Chain Monte Carlo</em><sup id="fnref:mcmc"><a href="#fn:mcmc" class="footnote">3</a></sup> methods are the most common class of methods for doing so.
We can use ideas from hierarchical Bayesian models to define better regularizers compared to $L^2$ – the <em>Horseshoe</em><sup id="fnref:hs"><a href="#fn:hs" class="footnote">4</a></sup> prior is a popular example.
For brevity, I’ll omit further examples – the book <em>Bayesian Data Analysis</em><sup id="fnref:bda"><a href="#fn:bda" class="footnote">5</a></sup> is a good introduction, though it largely focuses on methods of interest mainly to statisticians.</p>
<p>How general is this perspective?
Very: an abstract result called Cox’s Theorem states, in modern terms, that <em>every true-false logic under uncertainty is isomorphic to conditional probability</em>.
This means that <em>all learning formalizable in the above sense is Bayesian</em>.
So, if you <em>can’t</em> represent a given method in a Bayesian way, I would be rather worried.
For a formal statement and details, see my preprint<sup id="fnref:ct"><a href="#fn:ct" class="footnote">6</a></sup> on the subject.</p>
<p>At the end of the day, having many different mathematical perspectives enables us to better understand how learning works, because things that are not obvious from one perspective might be easy to see from another.
Whereas the optimization-theoretic approach we began with did not give a clear reason for why we should use cross-entropy loss, from a Bayesian point of view it follows directly out of the binary nature of the data.
Sometimes, the Bayesian approach has little to say about a particular problem, other times it has a lot.
It is useful to know how to use it when the need arises, and I hope this short example has given at least one reason to read about Bayesian statistics in more detail.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:ce">
<p>See Chapter 5 of Deep Learning<sup id="fnref:dl"><a href="#fn:dl" class="footnote">7</a></sup>. <a href="#fnref:ce" class="reversefootnote">↩</a> <a href="#fnref:ce:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:be">
<p>See Chapter 20 of Bayesian Data Analysis<sup id="fnref:bda:1"><a href="#fn:bda" class="footnote">5</a></sup>. <a href="#fnref:be" class="reversefootnote">↩</a></p>
</li>
<li id="fn:mcmc">
<p>See Chapter 11 of Bayesian Data Analysis<sup id="fnref:bda:2"><a href="#fn:bda" class="footnote">5</a></sup>, but note that MCMC methods are far more general than presented there. An article<sup id="fnref:pdmcmc"><a href="#fn:pdmcmc" class="footnote">8</a></sup> by P. Diaconis gives a rather different overview. <a href="#fnref:mcmc" class="reversefootnote">↩</a></p>
</li>
<li id="fn:hs">
<p>C. M. Carvalho, N. G. Polson, and J. G. Scott. The Horseshoe estimator for sparse signals. Biometrika, 97(2):1–26, 2010. <a href="#fnref:hs" class="reversefootnote">↩</a></p>
</li>
<li id="fn:bda">
<p>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. <a href="#fnref:bda" class="reversefootnote">↩</a> <a href="#fnref:bda:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:bda:2" class="reversefootnote">↩<sup>3</sup></a></p>
</li>
<li id="fn:ct">
<p>A. Terenin and D. Draper. Cox’s Theorem and the Jaynesian Interpretation of Probability. <a href="https://arxiv.org/abs/1507.06597">arXiv:1507.06597</a>, 2015. <a href="#fnref:ct" class="reversefootnote">↩</a></p>
</li>
<li id="fn:dl">
<p>I. Goodfellow, Y. Bengio, A. Courville. <a href="http://www.deeplearningbook.org">Deep Learning</a>. 2016. <a href="#fnref:dl" class="reversefootnote">↩</a></p>
</li>
<li id="fn:pdmcmc">
<p>P. Diaconis. The Markov Chain Monte Carlo revolution. Bulletin of the American Mathematical Society, 46(2):179–205, 2009. <a href="#fnref:pdmcmc" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninWelcome to my blog! For my first post, I decided that it would be useful to write a short introduction to Bayesian learning, and its relationship with the more traditional optimization-theoretic perspective often used in artificial intelligence and machine learning, presented in a minimally technical fashion. We begin by introducing an example.