Jekyll2018-07-09T23:16:56+00:00http://avt.im/Alexander TereninA blog about statistics, machine learning, and artificial intelligenceAlexander TereninHow to use R packages such as ggplot in Julia2018-03-23T00:00:00+00:002018-03-23T00:00:00+00:00http://avt.im/blog/2018/03/23/R-packages-ggplot-in-julia<p>Julia is a wonderful programming language.
It’s modern with good functional programming support, and unlike R and Python - both slow - Julia is fast.
Writing packages is straightforward, and high performance can be obtained without bindings to a lower-level language.
Unfortunately, its plotting frameworks are, at least in my view, not as good as the ggplot package in R.
Fortunately, Julia’s interoperability with other programming languages is outstanding.
In this post, I illustrate how to make ggplot work near-seamlessly with Julia using the RCall package.</p>
<h1 id="calling-r-packages-in-julia">Calling R packages in Julia</h1>
<p>R packages can be loaded can be loaded in Julia<sup id="fnref:jl"><a href="#fn:jl" class="footnote">1</a></sup> through the RCall<sup id="fnref:rcall"><a href="#fn:rcall" class="footnote">2</a></sup> package by using</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">using</span> <span class="n">RCall</span>
<span class="nd">@rlibrary</span> <span class="n">ggplot2</span></code></pre></figure>
<p>which works much like the popular <code class="highlighter-rouge">@pyimport</code> macro in the PyCall<sup id="fnref:pycall"><a href="#fn:pycall" class="footnote">3</a></sup> package.
It is important to note that this <em>properly loads</em> an R package as a Julia module, rather than simply defining a set of bindings to it.
This means that every function in the R package can automatically be called with Julia data structures as arguments, which will be automatically transformed into R data structures.
There is no need to painstakingly convert every input, as is often necessary when making different languages interface with one other - it is done automatically using the magic offered by 21st century programming languages.
So, we can write</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">qplot</span><span class="x">(</span><span class="mi">1</span><span class="x">:</span><span class="mi">10</span><span class="x">,[</span><span class="n">i</span><span class="o">^</span><span class="mi">2</span> <span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="mi">1</span><span class="x">:</span><span class="mi">10</span><span class="x">])</span></code></pre></figure>
<p>and a plot generated by the ggplot<sup id="fnref:gg"><a href="#fn:gg" class="footnote">4</a></sup> function <code class="highlighter-rouge">qplot</code> shows up, even though <code class="highlighter-rouge">1:10</code> is a Julia range and <code class="highlighter-rouge">[i^2 for i in 1:10]</code> is a Julia array.</p>
<h1 id="data-frame-interoperability">Data frame interoperability</h1>
<p>RCall can automatically convert Julia <code class="highlighter-rouge">DataFrame</code> objects into R <code class="highlighter-rouge">data.frame</code> objects.
For example, the following code is valid.</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">d</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="x">(</span><span class="n">x</span> <span class="o">=</span> <span class="x">[</span><span class="mi">1</span><span class="x">,</span><span class="mi">2</span><span class="x">,</span><span class="mi">3</span><span class="x">],</span> <span class="n">y</span> <span class="o">=</span> <span class="x">[</span><span class="mi">4</span><span class="x">,</span><span class="mi">5</span><span class="x">,</span><span class="mi">6</span><span class="x">],</span> <span class="n">z</span> <span class="o">=</span> <span class="x">[</span><span class="mi">1</span><span class="x">,</span><span class="mi">1</span><span class="x">,</span><span class="mi">2</span><span class="x">])</span>
<span class="n">ggplot</span><span class="x">(</span><span class="n">d</span><span class="x">,</span> <span class="n">aes</span><span class="x">(</span><span class="n">x</span><span class="o">=</span><span class="x">:</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="o">=</span><span class="x">:</span><span class="n">y</span><span class="x">))</span> <span class="o">+</span> <span class="n">geom_line</span><span class="x">()</span></code></pre></figure>
<p>Note that the <code class="highlighter-rouge">aes</code> function uses Julia symbols like <code class="highlighter-rouge">:x</code> to refer to data frame columns.
We don’t need to do any Julia to R type conversions, the code simply works.</p>
<h1 id="dealing-with-dots-formulas-and-other-r-quirks">Dealing with dots, formulas, and other R quirks</h1>
<p>There are a few issues that arise when making complicated plots.
For example, ggplot R commands such as</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">geom_point</span><span class="p">(</span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<p>don’t translate directly to Julia code because the <code class="highlighter-rouge">.</code> in <code class="highlighter-rouge">na.rm</code> is interpreted as Julia syntax.
Similar issues arise if, for instance, an R function uses <code class="highlighter-rouge">end</code> as an argument name.
The solution to this problem is to use the <code class="highlighter-rouge">var</code> string macro provided by RCall, which enables us to write</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">geom_point</span><span class="x">(</span><span class="n">var</span><span class="s">"na.rm"</span> <span class="o">=</span> <span class="n">true</span><span class="x">)</span></code></pre></figure>
<p>in place of the above R code.
This macro works by defining a Julia symbol that includes the dot, which we couldn’t have done with standard syntax.</p>
<p>Another useful feature is the <code class="highlighter-rouge">R</code> string macro, which enables us to write R code in line with Julia code.
For example, the Julia code <code class="highlighter-rouge">R"~z"</code> will execute the R code <code class="highlighter-rouge">~z</code>, which creates an R formula object with the variable <code class="highlighter-rouge">z</code>, and returns it as an R object in Julia.
This can be useful for functions such as <code class="highlighter-rouge">facet_grid</code> and <code class="highlighter-rouge">facet_wrap</code> that accept formulas as input.
It enables us to write</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">ggplot</span><span class="x">(</span><span class="n">d</span><span class="x">,</span> <span class="n">aes</span><span class="x">(</span><span class="n">x</span><span class="o">=</span><span class="x">:</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="o">=</span><span class="x">:</span><span class="n">y</span><span class="x">))</span> <span class="o">+</span> <span class="n">geom_point</span><span class="x">()</span> <span class="o">+</span> <span class="n">facet_wrap</span><span class="x">(</span><span class="n">R</span><span class="s">"~z"</span><span class="x">)</span></code></pre></figure>
<p>as well as execute R functions such as <code class="highlighter-rouge">data.frame</code> if we need to.
We can also use this macro to fix issues arising when automatic data frame conversion doesn’t behave as intended.
This occasionally happens for data frames that contain symbols or strings.
For example, we can write code such as</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">d</span> <span class="o">|></span>
<span class="n">x</span> <span class="o">-></span> <span class="n">R</span><span class="s">"</span><span class="si">$</span><span class="s">x[,1] = as.numeric(</span><span class="si">$</span><span class="s">d[,1]); </span><span class="si">$</span><span class="s">x"</span> <span class="o">|></span>
<span class="n">x</span> <span class="o">-></span> <span class="n">R</span><span class="s">"</span><span class="si">$</span><span class="s">x[,2] = as.factor(as.numeric(</span><span class="si">$</span><span class="s">x[,2])); </span><span class="si">$</span><span class="s">x"</span> <span class="o">|></span>
<span class="n">x</span> <span class="o">-></span> <span class="n">R</span><span class="s">"</span><span class="si">$</span><span class="s">x[,3] = as.factor(as.character(</span><span class="si">$</span><span class="s">x[,3])); </span><span class="si">$</span><span class="s">x"</span></code></pre></figure>
<p>to convert strings to factors inside our data frame - inline.
There’s a couple of points worth expanding on here.
Note first the functional style: we use a pipe<sup id="fnref:magrittr"><a href="#fn:magrittr" class="footnote">5</a></sup> to input the data frame <code class="highlighter-rouge">d</code> into a function that takes <code class="highlighter-rouge">x</code> as input and executes the string macro <code class="highlighter-rouge">R"$x[,1] = as.numeric($d[,1]); $x"</code> and returns its results.
These are immediately piped into another function.
The code <code class="highlighter-rouge">$x</code> in the line <code class="highlighter-rouge">R"$x[,1] = as.numeric($d[,1]); $x"</code> means that the Julia variable <code class="highlighter-rouge">x</code> is passed into the R code.
This syntax allows us to execute R code without ever worrying about manually passing variables between Julia and R.</p>
<p>Putting everything together, it’s easy to make a layered plot such as</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">ggplot</span><span class="x">(</span><span class="n">d</span><span class="x">,</span> <span class="n">aes</span><span class="x">(</span><span class="n">x</span><span class="o">=</span><span class="x">:</span><span class="n">x</span><span class="x">))</span> <span class="o">+</span>
<span class="n">geom_ribbon</span><span class="x">(</span><span class="n">aes</span><span class="x">(</span><span class="n">ymin</span><span class="o">=</span><span class="x">:</span><span class="n">u_min</span><span class="x">,</span> <span class="n">ymax</span><span class="o">=</span><span class="x">:</span><span class="n">u_max</span><span class="x">),</span> <span class="n">fill</span><span class="o">=</span><span class="s">"blue"</span><span class="x">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="x">)</span> <span class="o">+</span>
<span class="n">geom_line</span><span class="x">(</span><span class="n">aes</span><span class="x">(</span><span class="n">y</span><span class="o">=</span><span class="x">:</span><span class="n">u</span><span class="x">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"blue"</span><span class="x">)</span> <span class="o">+</span>
<span class="n">lims</span><span class="x">(</span><span class="n">x</span><span class="o">=</span><span class="x">[</span><span class="mi">0</span><span class="x">,</span><span class="mi">1</span><span class="x">],</span> <span class="n">y</span><span class="o">=</span><span class="x">[</span><span class="o">-</span><span class="mi">1</span><span class="x">,</span><span class="mi">1</span><span class="x">])</span> <span class="o">+</span>
<span class="n">geom_line</span><span class="x">(</span><span class="n">aes</span><span class="x">(</span><span class="n">y</span><span class="o">=</span><span class="x">:</span><span class="n">solution</span><span class="x">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"red"</span><span class="x">)</span> <span class="o">|></span>
<span class="n">p</span> <span class="o">-></span> <span class="n">ggsave</span><span class="x">(</span><span class="s">"p1.pdf"</span><span class="x">,</span> <span class="n">p</span><span class="x">)</span></code></pre></figure>
<p>and save it to a PDF file using functional syntax, without ever writing a line of R code.
In doing so, we sacrifice very little and retain essentially all aspects of ggplot that make it a user-friendly and productive package.
I’ll conclude by nothing that everything here is just ordinary use of the RCall package and would work with any R package – in all of the above, we did not use any ggplot-specific Julia packages, nor did we write a single line of language bindings.</p>
<h1 id="why-ggplot-arent-we-using-julia-in-order-to-not-use-r">Why ggplot? Aren’t we using Julia in order to not use R?</h1>
<p>Why bother with ggplot when Julia offers its own full-featured plotting packages such as Gadfly<sup id="fnref:gadfly"><a href="#fn:gadfly" class="footnote">6</a></sup> and Plots.jl<sup id="fnref:plotsjl"><a href="#fn:plotsjl" class="footnote">7</a></sup>?
In my view – and I’m not generally a fan of criticizing other people’s hard work but I find it warranted here and will be as gentle as I can – neither of these frameworks have well-designed programming interfaces.
Let’s look at what the issues are, and why ggplot handles them better.</p>
<p>Plots.jl is a powerful, fully-featured plotting package with lots of features.
Unfortunately, its interface is very similar to that of the base R: making a complicated plot requires executing a list of commands.
This is its main downside: to use it effectively, the user needs to memorize every command and its options individually – there is no over-arching principle upon which commands are based, which users can learn instead of the commands themselves.
Indeed, this one of the major features of the Wickham-Wilkerson Grammar of Graphics<sup id="fnref:ggbooks"><a href="#fn:ggbooks" class="footnote">8</a></sup> interface, which works as follows.</p>
<ul>
<li>Plots are visualizations of data frames consisting of layered geometric objects.</li>
<li>Aesthetic mappings describe how individual data points are mapped to geometric objects.</li>
</ul>
<p>For example, to plot a function and a 95% probability interval around that function, we create a data frame where each row contains the function’s $x$ and $f(x)$ values at a point, together with the lower and upper interval endpoints $a$ and $b$.
We then add a <em>line</em> geometric object with the aesthetic mapping $(x,y) \goesto (x, f(x))$, as well as a <em>ribbon</em> geometric object with the mapping $(x,\min,\max) \goesto (x,a,b)$.
We do not need to memorize how lines and ribbons work to use them, and simply follow the principles given by the bullet points above.
If we need to use a new geometric object that we’ve never seen before, all we need to do is look at what kind of aesthetic mappings it utilizes – we never need to memorize any other details.</p>
<p>On the other hand, consider the Plots.jl code that I wrote for a project</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">contour</span><span class="x">(</span><span class="o">-</span><span class="mi">3</span><span class="x">:</span><span class="mf">0.1</span><span class="x">:</span><span class="mi">3</span><span class="x">,</span> <span class="o">-</span><span class="mi">3</span><span class="x">:</span><span class="mf">0.1</span><span class="x">:</span><span class="mi">3</span><span class="x">,</span> <span class="x">(</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span> <span class="o">-></span> <span class="n">pdf</span><span class="x">(</span><span class="n">MultivariateNormal</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="mi">1</span><span class="x">),[</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">]))</span>
<span class="n">scatter!</span><span class="x">(</span><span class="n">θ</span><span class="x">[</span><span class="n">i</span><span class="x">][</span><span class="mi">1</span><span class="x">,:],</span> <span class="n">θ</span><span class="x">[</span><span class="n">i</span><span class="x">][</span><span class="mi">2</span><span class="x">,:])</span></code></pre></figure>
<p>and note how this syntax differs from</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">plot!</span><span class="x">(</span><span class="n">hcat</span><span class="x">(</span><span class="n">L</span><span class="x">,</span><span class="nb">error</span><span class="x">),</span><span class="n">layout</span><span class="o">=</span><span class="mi">2</span><span class="x">,</span> <span class="n">label</span><span class="o">=</span><span class="x">[</span><span class="s">"L: test"</span> <span class="s">"Error: test"</span><span class="x">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="x">)</span></code></pre></figure>
<p>where a single matrix is used as input rather than two ranges and a function.
It is <em>a priori</em> unclear whether the input to a particular plotting function should be an array, data frame, or something else.
Looking a bit further, imagine setting color labels in a complicated multilayered plot - in which layer’s command should we specify how labels are displayed?
Ambiguity like this wastes time by forcing the user to spend time reading documentation rather than making their plots, and in my experience the time saved by having concise commands like <code class="highlighter-rouge">plot(x,y)</code> in simple cases does not outweigh the cost in complicated ones.</p>
<p>It’s true that the Grammar of Graphics interface is not well-suited to every kind of plot, but it works well for most of the ones encountered in everyday data science.
Most importantly, it offers a single unified way to think about plots and how to construct them.
Writing plots in it can be more verbose, but I prefer being verbose and consistent than concise and different in every scenario.
I don’t have time to memorize individual commands in a plotting package that doesn’t contain a central set of guiding principles – and neither should you.</p>
<p>So if I don’t prefer Plots.jl due to its interface, what about Gadfly, which is is Grammar of Graphics based?
Unfortunately, Gadfly both doesn’t support many useful features such as transparency and geometric objects like <code class="highlighter-rouge">geom_raster</code>, and suffers from a whole other set of issues that makes it difficult to use.
One particular problem is that it uses a varargs-based interface rather than a functional one.
This makes us write things like</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">plot</span><span class="x">(</span><span class="n">plot_data_1</span><span class="x">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"x"</span><span class="x">,</span> <span class="n">y</span><span class="o">=</span><span class="s">"u"</span><span class="x">,</span> <span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="x">,</span>
<span class="n">layer</span><span class="x">(</span><span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="x">,</span> <span class="n">x</span> <span class="o">=</span> <span class="s">"x"</span><span class="x">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">"solution"</span><span class="x">,</span> <span class="n">Theme</span><span class="x">(</span><span class="n">default_color</span><span class="o">=</span><span class="s">"red"</span><span class="x">)),</span>
<span class="n">layer</span><span class="x">(</span><span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="x">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"x"</span><span class="x">,</span> <span class="n">y</span><span class="o">=</span><span class="s">"u_mc"</span><span class="x">,</span> <span class="n">Theme</span><span class="x">(</span><span class="n">default_color</span> <span class="o">=</span> <span class="s">"purple"</span><span class="x">)),</span>
<span class="n">layer</span><span class="x">(</span><span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="x">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"x"</span><span class="x">,</span> <span class="n">y</span><span class="o">=</span><span class="s">"u_mf"</span><span class="x">,</span> <span class="n">Theme</span><span class="x">(</span><span class="n">default_color</span> <span class="o">=</span> <span class="s">"orange"</span><span class="x">))</span>
<span class="x">)</span></code></pre></figure>
<p>instead of</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="n">ggplot</span><span class="x">(</span><span class="n">plot_data_1</span><span class="x">,</span> <span class="n">aes</span><span class="x">(</span><span class="n">x</span><span class="o">=</span><span class="s">"x"</span><span class="x">,</span> <span class="n">y</span><span class="o">=</span><span class="s">"u"</span><span class="x">))</span> <span class="o">+</span>
<span class="n">geom_line</span><span class="x">(</span><span class="n">color</span><span class="o">=</span><span class="s">"blue"</span><span class="x">)</span> <span class="o">+</span>
<span class="n">geom_line</span><span class="x">(</span><span class="n">aes</span><span class="x">(</span><span class="n">y</span><span class="o">=</span><span class="x">:</span><span class="n">u_mc</span><span class="x">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"purple"</span><span class="x">)</span> <span class="o">+</span>
<span class="n">geom_line</span><span class="x">(</span><span class="n">aes</span><span class="x">(</span><span class="n">y</span><span class="o">=</span><span class="x">:</span><span class="n">u_mf</span><span class="x">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"orange"</span><span class="x">)</span></code></pre></figure>
<p>which is much simpler.
The issue here is that a <code class="highlighter-rouge">...</code> based interface requires the user to waste time on the irritating task of balancing commas and parentheses.
Plots.jl suffers just as much from the exact same problem.</p>
<p>This code raises another major issue: Gadfly doesn’t follow the Grammar of Graphics strictly enough: a color not given by an aesthetic mapping should be defined as part of a geometric object, not part of a theme.
Themes are supposed to control parts of the plot that have nothing to do with the data or geometric objects, such as the font size for the plot’s title – certainly not the color of a line.
This is an inconsistency that a user needs to learn, rather than a consequence of a set of principles that is immediately obvious.</p>
<p>At the end of the day, memorizing a plotting package is not a good use of my time or yours, and after spending a good bit of time with both packages I’ve found dealing with R-Julia interoperability and its occasional difficulties to be a lesser problem compared to the issues raised above.</p>
<h1 id="concluding-thoughts">Concluding thoughts</h1>
<p>Julia is wonderful, made even more so through its strong interoperability given by RCall<sup id="fnref:rcall:1"><a href="#fn:rcall" class="footnote">2</a></sup> and PyCall<sup id="fnref:pycall:1"><a href="#fn:pycall" class="footnote">3</a></sup>.
I find it better than R, and much better than Python.
It does have its flaws.
Its syntax isn’t ideal in certain situations, particularly when writing highly functional code, and would be improved by being more like Scala, or even like pipe-oriented R written with the magrittr<sup id="fnref:magrittr:1"><a href="#fn:magrittr" class="footnote">5</a></sup> package.
Multiple dispatch is not a proper replacement for Python-style objects, and having a language features similar to Rust’s <em>Implementations</em> would be a major improvement.
This said, in my view Julia is already ahead of R and Python, which have bigger issues than the above.
Usability and cleanliness are critically important in a programming language, and this is why it’s worth using ggplot in Julia.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:jl">
<p><a href="https://julialang.org">Julia</a> <a href="#fnref:jl" class="reversefootnote">↩</a></p>
</li>
<li id="fn:rcall">
<p><a href="https://github.com/JuliaInterop/RCall.jl">RCall</a> <a href="#fnref:rcall" class="reversefootnote">↩</a> <a href="#fnref:rcall:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:pycall">
<p><a href="https://github.com/JuliaPy/PyCall.jl">PyCall</a> <a href="#fnref:pycall" class="reversefootnote">↩</a> <a href="#fnref:pycall:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:gg">
<p><a href="http://ggplot2.tidyverse.org">ggplot</a> <a href="#fnref:gg" class="reversefootnote">↩</a></p>
</li>
<li id="fn:magrittr">
<p><a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html">magrittr</a> <a href="#fnref:magrittr" class="reversefootnote">↩</a> <a href="#fnref:magrittr:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:gadfly">
<p><a href="http://gadflyjl.org">Gadfly</a> <a href="#fnref:gadfly" class="reversefootnote">↩</a></p>
</li>
<li id="fn:plotsjl">
<p><a href="https://github.com/JuliaPlots/Plots.jl">Plots.jl</a> <a href="#fnref:plotsjl" class="reversefootnote">↩</a></p>
</li>
<li id="fn:ggbooks">
<p>See the original book<sup id="fnref:grammarofgraphics"><a href="#fn:grammarofgraphics" class="footnote">9</a></sup> and ggplot manual<sup id="fnref:ggplot2"><a href="#fn:ggplot2" class="footnote">10</a></sup>. <a href="#fnref:ggbooks" class="reversefootnote">↩</a></p>
</li>
<li id="fn:grammarofgraphics">
<p>L. Wilkerson. The Grammar of Graphics. 2005. <a href="#fnref:grammarofgraphics" class="reversefootnote">↩</a></p>
</li>
<li id="fn:ggplot2">
<p>H. Wickham. ggplot2: Elegant Graphics for Data Analysis. 2016. <a href="#fnref:ggplot2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninJulia is a wonderful programming language. It’s modern with good functional programming support, and unlike R and Python - both slow - Julia is fast. Writing packages is straightforward, and high performance can be obtained without bindings to a lower-level language. Unfortunately, its plotting frameworks are, at least in my view, not as good as the ggplot package in R. Fortunately, Julia’s interoperability with other programming languages is outstanding. In this post, I illustrate how to make ggplot work near-seamlessly with Julia using the RCall package.What does it really mean to be Bayesian?2018-02-09T00:00:00+00:002018-02-09T00:00:00+00:00http://avt.im/blog/2018/02/09/real-meaning-of-bayesian<p>In my previous posts, I introduced Bayesian models and argued that they are meaningful.
I claimed that studying them is worthwhile because the probabilistic interpretation of learning that they offered can be more intuitive than other interpretations.
I showcased an example illustrating what a Bayesian model looks like.
I did not, however, say what a Bayesian model actually is – at least not in a sufficiently general setting to encompass models people regularly use.
I’m going to discuss that in this post, and then showcase some surprising behavior in infinite-dimensional settings where the general approach is necessary.
The subject matter here can be highly technical, but will be discussed at an intuitive level meant to explain what is going on.</p>
<p><strong>Definition.</strong>
A model $\s{M}$ is <em>mathematically Bayesian</em> if it can be fully specified via a prior $\pi(\theta)$ and likelihood $f(x \given \theta)$ for which the posterior distribution $f(\theta \given x)$ is well-defined.</p>
<p>Here, $\theta$ is an abstract parameter, and $x$ is an abstract data set.
The argument for using Bayesian learning, given by Cox’s Theorem, is that conditional probability can be interpreted as an extension of true-false logic under uncertainty.
This is great – but, formality considerations aside, there are scenarios that involve learning from data that are not included in the above definition.
Let’s look at one.</p>
<h1 id="a-motivating-example">A motivating example</h1>
<p>To illustrate a case not covered by the above definition, consider the problem of learning a function from a finite set of points.
Here, we have a set of points $(y_i, x_i), i=1,..,n$ and we want to learn a function $y = f(x)$ from the data.
A simple Bayesian model for the data can be written as</p>
<p>[
\begin{aligned}
y_i &= f(x_i) + \eps_i
&
\eps_i &\iid N(0,\sigma^2)
&
f \dist\f{GP}(\mu, \Sigma)
\end{aligned}
]</p>
<p>What are we saying here?
If we know $f$, we can use a set of points $x_i$ to generate $y_i$ by calculating $f(x_i)$ and adding Gaussian noise $\eps_i$.
Since we don’t know $f$, we specify its prior probability distribution as a Gaussian process with mean function $\mu: \R \goesto \R$ and covariance function $\Sigma: \R \cross \R \goesto \R$.
Since we’ve specified a conditional and marginal distribution, this defines a joint distribution, so we can try to get the posterior distribution using Bayes’ Rule
[
f(f \given \v{y},\v{x}) \propto f(\v{y} \given \v{x}, f) \pi(f)
.
]</p>
<p><em>Except we can’t do that</em>.
The above expression is not well-defined – $\pi(f)$ does not exist, because the probability distribution $f \dist\f{GP}(\mu, \Sigma)$ is a distribution over a space of functions, not of real numbers – therefore, it has no density in the standard sense<sup id="fnref:leb"><a href="#fn:leb" class="footnote">1</a></sup>.</p>
<p>Why not?
A probability density is a function that assigns a weight to every unit of volume in space.
In one dimension, every interval of the form $[a,b]$ is assigned volume $|a-b|$ – this depends only on its length, not its location.
In infinite-dimensional spaces, this is impossible.
It can be proven that any notion of volume must depend both on the length and location – more formally, the infinite-dimensional Lebesgue measure is not locally finite<sup id="fnref:infleb"><a href="#fn:infleb" class="footnote">2</a></sup>.</p>
<p>So what do we do?
Is there a sense in which we can consider the above model Bayesian?
Let’s discuss that.</p>
<h1 id="bayesian-learning-as-conditional-probability">Bayesian learning as conditional probability</h1>
<p>If we’re not allowed to discuss probability densities, what else can we do?
One thing that the definition says is that a model is <em>Bayesian</em> if it is <em>probabilistic</em>.
This entails two parts.</p>
<ol>
<li>$\s{M}$ is specified via a joint probability density $f(\theta, x)$ over the parameters and data.</li>
<li>Learning takes place via conditional probability.</li>
</ol>
<p>It turns out that these two intuitive notions are precisely the ones we need.
Informally, this leads to the definition below.</p>
<p><strong>Definition.</strong>
A model $\s{M}$ is <em>mathematically Bayesian</em> if it is fully specified via a random variable $(x,\theta)$ for which the conditional probability distribution $\theta \given x$ exists for all $x$.</p>
<p>This definition can be made formal using measure-theoretic notions such as <em>regular conditional probability</em><sup id="fnref:rcp"><a href="#fn:rcp" class="footnote">3</a></sup> and <em>disintegration</em><sup id="fnref:disint"><a href="#fn:disint" class="footnote">4</a></sup>.
These have various flavors with different technical requirements on $(x,\theta)$ that need to be checked to ensure that writing down a probability distribution conditional on a set of data points actually makes sense.
Let’s now look at two different ways of specifying $(x,\theta)$ in infinite-dimensional settings where the usual approach fails.</p>
<h1 id="two-infinite-dimensional-approaches">Two infinite-dimensional approaches</h1>
<p>One way to define Bayesian models in infinite-dimensional settings is through a <em>top-down</em> approach.
Here, we specify $\theta \given x$ by selecting a complicated but well-defined infinite-dimensional notion of volume.
Often, the prior distribution is used to select this notion of volume.
From there, we can specify how the posterior distribution changes that volume, by writing down a <em>Radon-Nikodym derivative</em><sup id="fnref:rn"><a href="#fn:rn" class="footnote">5</a></sup>.
This viewpoint is often used in the Gaussian measure and Bayesian inverse problem literatures.
The main price we pay is that for many infinite-dimensional models, the prior and posterior distributions may not have the same support – they may fail to be <em>absolutely continuous</em><sup id="fnref:ac"><a href="#fn:ac" class="footnote">6</a></sup>, in which case the Radon-Nikodym derivative between them would not exist.</p>
<p>Alternatively, we could use a <em>bottom-up</em> approach.
Here, we define a family of probability of distributions using finite-dimensional slices of our parameter space, using Kolmogorov’s Extension Theorem as our primary theoretical tool for handling the infinite dimensional object.
This is the primary viewpoint in the Gaussian process and Dirichlet process literatures.
The main price we pay is that from this perspective, we can only reason about the infinite-dimensional object we wish to study indirectly.
This may cause us to make poor choices, such as writing down algorithms that stop working as we approach the infinite-dimensional limit<sup id="fnref:pcn"><a href="#fn:pcn" class="footnote">7</a></sup>, which are easily avoided with a more direct perspective.</p>
<h1 id="cromwells-rule-and-some-surprising-consequences">Cromwell’s Rule and some surprising consequences</h1>
<p>We briefly mentioned that in infinite-dimensional settings, prior and posterior distributions may not be absolutely continuous with one another.
This property deserves some attention.
Consider Bayes’ Rule for probabilities</p>
<p>[
\P(B \given A) = \frac{\P(A \given B) \P(B)}{\P(A)}
]</p>
<p>and note that for $\P(A)$ nonzero, then $\P(B) = 0$ implies $\P(B \given A) = 0$ – no matter what $A$ is.
By analogy, if $A$ is data and $B$ is an event of interest, then Bayes’ Rule ignores the data if the prior probability is zero.
This is often not desirable, which leads to <em>Cromwell’s Rule</em><sup id="fnref:cr"><a href="#fn:cr" class="footnote">8</a></sup>, given below.</p>
<blockquote>
<p>To avoid making learning impossible, the use of prior probabilities that are zero or one should be avoided.</p>
</blockquote>
<p>Except, in many infinite-dimensional settings, this doesn’t apply because $\P(A)$ may be zero.
Indeed, it is easy to construct examples where the prior probability of an event is zero, but the posterior probability is nonzero – more formally, where the posterior is not absolutely continuous with respect to the prior.
This is not an esoteric occurrence: even something as basic as adding a mean function to a Gaussian process can break absolute continuity<sup id="fnref:cmt"><a href="#fn:cmt" class="footnote">9</a></sup>.
Let’s examine a case where this happens.</p>
<h1 id="breaking-probabilistic-impossibility">Breaking probabilistic impossibility</h1>
<p>Consider the following model.</p>
<p>[
\begin{aligned}
y_i &\given F \iid F
&
F &\dist\f{DP}(\alpha, \delta_0)
\end{aligned}
]</p>
<p>where $\delta_0$ is a Dirac measure that places all of its probability on zero.
Under the prior, we have
[
\P(F \neq \delta_0) = 0
.
]
The standard posterior for this model is
[
F \given \v{y} \dist\f{DP}\del{\alpha + n, \frac{\alpha}{\alpha+n}\delta_0 + \frac{n}{\alpha+n}\hat{F}_n}
]
where $n$ is the length of $\v{y}$ and $\hat{F}_n$ is the empirical CDF of $\v{y}$.
But we can tell immediately that</p>
<p>[
\P(F \neq \delta_0 \given \v{y}) > 0
.
]</p>
<p>This example illustrates a whole host of bizarre consequences.
Since $F \given \v{y}$ is not absolutely continuous with respect to $F$, we see that in infinite dimensions, data may convince us to believe in something we in a sense thought was impossible.
Furthermore, $\f{DP}(\alpha_1, \delta_0)$ and $\f{DP}(\alpha_2, \delta_0)$ are, as probability distributions, identical – but their respective posterior distributions are not.
So, what matters for Bayesian learning in infinite dimensions is not the distribution of the prior, but the <em>functional form of the joint probability measure</em>.
This behavior is both surprising and typical – conditional probability can act in complicated ways.</p>
<h1 id="what-it-all-means">What it all means</h1>
<p>In my view, an abstract model is <em>Bayesian</em> if it is <em>probabilistic</em> and learning takes place through <em>conditional probability</em>.
In well-behaved finite-dimensional settings, this means that learning takes place using Bayes’ Rule.
There, we have a <em>likelihood</em> $f(x \given \theta)$ that acts as the generative distribution for the data given the parameters, and a <em>prior</em> that describes what sorts of parameters we’d like to regularize the learning process towards.
In full generality, however, neither the generative nature of the likelihood nor the use of Bayes’ Rule matters: it is the use of conditional probability that is important.
From a philosophical standpoint this makes sense: learning is just reasoning about something we don’t know using the things we do, and Cox’s Theorem<sup id="fnref:ct"><a href="#fn:ct" class="footnote">10</a></sup> tells us that true-false reasoning under uncertainty must have the same mathematical structure as conditional probability.</p>
<p>Once we’ve taken the general perspective, we are free to define models in infinite-dimensional settings.
Such models are powerful and have proven useful in many applications, but at times they may behave bizarrely.
It’s worthwhile to take a moment to step back, appreciate, and understand why the expressions we calculate are the way they are.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:leb">
<p>The standard notion of volume is taken to be the Lebesgue measure. See Chapter 3 of Probability and Stochastics<sup id="fnref:cinlar"><a href="#fn:cinlar" class="footnote">11</a></sup>. <a href="#fnref:leb" class="reversefootnote">↩</a></p>
</li>
<li id="fn:infleb">
<p>See Section 1.2 of Analysis and Probability on Infinite-Dimensional Spaces<sup id="fnref:eldredge"><a href="#fn:eldredge" class="footnote">12</a></sup>. <a href="#fnref:infleb" class="reversefootnote">↩</a></p>
</li>
<li id="fn:rcp">
<p>See Chapter 2 of Probability and Stochastics<sup id="fnref:cinlar:1"><a href="#fn:cinlar" class="footnote">11</a></sup>. <a href="#fnref:rcp" class="reversefootnote">↩</a></p>
</li>
<li id="fn:disint">
<p>See Section 2 of Conditioning as Disintegration<sup id="fnref:condasdisint"><a href="#fn:condasdisint" class="footnote">13</a></sup>. <a href="#fnref:disint" class="reversefootnote">↩</a></p>
</li>
<li id="fn:rn">
<p>A Radon-Nikodym derivatives tells us how to re-weight one probability measure to obtain another one. See Chapter 5 of Probability and Stochastics<sup id="fnref:cinlar:2"><a href="#fn:cinlar" class="footnote">11</a></sup>. <a href="#fnref:rn" class="reversefootnote">↩</a></p>
</li>
<li id="fn:ac">
<p>If two measures are absolutely continuous, they assign nonzero probability to the same events. See Chapter 5 of Probability and Stochastics<sup id="fnref:cinlar:3"><a href="#fn:cinlar" class="footnote">11</a></sup>. <a href="#fnref:ac" class="reversefootnote">↩</a></p>
</li>
<li id="fn:pcn">
<p>A recent line of work<sup id="fnref:infmcmc"><a href="#fn:infmcmc" class="footnote">14</a></sup> has sought to prevent Markov Chain Monte Carlo algorithms from slowing down for high-dimensional models by ensuring their infinite-dimensional limits are well-defined. <a href="#fnref:pcn" class="reversefootnote">↩</a></p>
</li>
<li id="fn:cr">
<p>See Chapter 6 Section 8 of Understanding Uncertainty<sup id="fnref:lindley"><a href="#fn:lindley" class="footnote">15</a></sup>. <a href="#fnref:cr" class="reversefootnote">↩</a></p>
</li>
<li id="fn:cmt">
<p>The space of vectors that can be added to a Gaussian measure while preserving absolute continuity is called its <em>Cameron-Martin</em> space. See Chapter 5 of Lectures on Gaussian Processes<sup id="fnref:lecgm"><a href="#fn:lecgm" class="footnote">16</a></sup>. <a href="#fnref:cmt" class="reversefootnote">↩</a></p>
</li>
<li id="fn:ct">
<p>A. Terenin and D. Draper. Cox’s Theorem and the Jaynesian Interpretation of Probability. <a href="https://arxiv.org/abs/1507.06597">arXiv:1507.06597</a>, 2015. <a href="#fnref:ct" class="reversefootnote">↩</a></p>
</li>
<li id="fn:cinlar">
<p>E. Çınlar. Probability and Stochastics. 2010. <a href="#fnref:cinlar" class="reversefootnote">↩</a> <a href="#fnref:cinlar:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:cinlar:2" class="reversefootnote">↩<sup>3</sup></a> <a href="#fnref:cinlar:3" class="reversefootnote">↩<sup>4</sup></a></p>
</li>
<li id="fn:eldredge">
<p>N. Eldredge. Analysis and Probability on Infinite-Dimensional Spaces. 2016. <a href="#fnref:eldredge" class="reversefootnote">↩</a></p>
</li>
<li id="fn:condasdisint">
<p>J. T. Chang and D. Pollard. Conditioning as Disintegration. Statistica Neerlandica 51(3). 1997. <a href="#fnref:condasdisint" class="reversefootnote">↩</a></p>
</li>
<li id="fn:infmcmc">
<p>S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science 28(3), 2013. <a href="#fnref:infmcmc" class="reversefootnote">↩</a></p>
</li>
<li id="fn:lindley">
<p>D. Lindley. Understanding Uncertainty. 2006. <a href="#fnref:lindley" class="reversefootnote">↩</a></p>
</li>
<li id="fn:lecgm">
<p>M. Lifshits. Lectures on Gaussian Processes. 2012. <a href="#fnref:lecgm" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninIn my previous posts, I introduced Bayesian models and argued that they are meaningful. I claimed that studying them is worthwhile because the probabilistic interpretation of learning that they offered can be more intuitive than other interpretations. I showcased an example illustrating what a Bayesian model looks like. I did not, however, say what a Bayesian model actually is – at least not in a sufficiently general setting to encompass models people regularly use. I’m going to discuss that in this post, and then showcase some surprising behavior in infinite-dimensional settings where the general approach is necessary. The subject matter here can be highly technical, but will be discussed at an intuitive level meant to explain what is going on.What does it mean to be Bayesian?2017-11-03T00:00:00+00:002017-11-03T00:00:00+00:00http://avt.im/blog/2017/11/03/meaning-of-bayesian<p>Bayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework.
This has lead to rich academic debates over what statistical learning is and how it should be done.
Academic debates are healthy when their content is precise and independent issues are not conflated.
In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently.
These can be viewed informally as follows.</p>
<ul>
<li>A model is <em>mathematically Bayesian</em> if it is defined using Bayes’ Rule.</li>
<li>A procedure is <em>computationally Bayesian</em> if it involves calculation of a full posterior distribution.</li>
</ul>
<p>The key idea of this post is that the two notions above are different, and that the common term <em>Bayesian</em> is often ambiguous.
This makes it unclear, for instance, that there are situations where it makes sense to be mathematically but not computationally Bayesian.
Let’s disentangle the terminology and explore the concepts in more detail.</p>
<h1 id="motivating-example-logistic-lasso">Motivating Example: Logistic Lasso</h1>
<p>To make my arguments concrete, I now introduce the Logistic Lasso model, beginning with notation.
Let <script type="math/tex">\m{X}_{N \times p}</script> be the matrix to be used for predicting the binary vector $\v{y}_{N\times 1}$, let $\v\beta$ be the parameter vector, and let $\phi$ be the logistic function.</p>
<p>From the classical perspective, the Logistic Lasso model<sup id="fnref:lasso"><a href="#fn:lasso" class="footnote">1</a></sup> involves finding the estimator</p>
<p>[
\v{\hat\beta} = \underset{\v\beta}{\arg\min}\cbr{ \sum_{i=1}^N -y_i\ln\del{ \phi(\m{X}_i\v\beta) } - (1-y_i)\ln\del{1 - \phi(\m{X}_i\v\beta)} + \lambda\vert\vert\v\beta\vert\vert_1}
]</p>
<p>for $\lambda \in \R^+$, where $\vert\vert\cdot\vert\vert_1$ denotes the $L^1$ norm. On the other hand, the Bayesian Logistic Lasso model<sup id="fnref:blasso"><a href="#fn:blasso" class="footnote">2</a></sup> is specified using the likelihood and prior</p>
<p>[
\begin{aligned}
y_i \given \v\beta &\dist \f{Ber}\del{\phi(\m{X}_i\v\beta)}
&
\v\beta&\dist \f{Laplace} (\lambda^{-1})
\end{aligned}
]</p>
<p>for which the posterior distribution is found via Bayes’ Rule.</p>
<p>For the Logistic Lasso, both formulations are equivalent<sup id="fnref:bda"><a href="#fn:bda" class="footnote">3</a></sup> in the sense that they yield the same point estimates.
This connection is discussed in detail in my <a href="/blog/2017/07/05/bayesian-learning">previous post</a>.
Since the same model can be expressed both ways, it may be unclear to someone unfamiliar with Bayesian statistics what people might disagree about here.
Let’s proceed to that.</p>
<h1 id="statistical-learning-theory">Statistical Learning Theory</h1>
<p>The first philosophical question we consider is what statistical learning is.
This fundamental question has been considered by a variety of people throughout history.
One formulation – due to Vapnik<sup id="fnref:vv"><a href="#fn:vv" class="footnote">4</a></sup> – involves defining a <em>loss function</em> $L(y, \hat{y})$ for predicted data, and finding a function $f$ that minimizes the expected loss</p>
<p>[
\underset{f}{\arg\min} \int_\Omega L(y, f(x)) \dif F(x,y)
]</p>
<p>with respect to an unknown distribution $F(x,y)$.
This loss is then approximated in various ways because the data is finite – for instance, by restricting the domain of optimization.
In this approach, a <em>statistical learning problem</em> is defined to be a <em>functional optimization problem</em>, the problem’s <em>answer</em> is given by the function $f$, and the model $\mathscr{M}$ is given by the loss function together with whatever approximations are made. For Logistic Lasso, we assume that the functional form of $f$ is given by $\phi(\m{X}\v\beta)$, and that $L$ is $L^1$-regularized cross-entropy loss.</p>
<h1 id="bayesian-theory">Bayesian Theory</h1>
<p>The other formalism we consider involves defining statistical learning more abstractly.
We suppose that we are given a parameter $\theta$ and data set $x$.
We define a set $\Omega$ consisting of true-false statements $\theta = \theta’$ and $x = x’$ for all possible parameter values $\theta’$ and data values $x’$.
From the data, we know the statement $x=x’$ is true – but we do not know which $\theta’$ makes it so that $\theta = \theta’$ is true.
Thus, we cannot simply deduce $\theta$ via logical reasoning, and must extend the concept of logical reasoning to accommodate uncertainty.</p>
<p>To do so, we suppose that there is a relationship between $x$ and $\theta$ such that different values of $x$ may change the relative truth of different values of $\theta$.
Thus, we seek to define a function $\P(\theta = \theta’ \given x = x’)$ such that if $x=x’$ is true, the function tells us how close to true or to false $\theta=\theta’$ is.
It turns out under appropriate formal definitions<sup id="fnref:ct"><a href="#fn:ct" class="footnote">5</a></sup>, any reasonable such function is isomorphic to conditional probability.
Thus, to perform <em>logical reasoning under uncertainty</em>, we need to specify two probability distributions – the <em>likelihood</em> $f(x \given \theta)$ and <em>prior</em> $\pi(\theta)$, and calculate</p>
<p>[
f(\theta \given x) = \frac{f(x \given \theta) \pi(\theta)}{\int_\Theta f(x \given \theta) \pi(\theta) \dif \theta} \propto f(x \given \theta) \pi(\theta)
]</p>
<p>using Bayes’ Rule, which gives us the <em>posterior</em> distribution.
In this approach, <em>statistical learning</em> is taken to mean <em>reasoning under uncertainty</em>, the <em>answer</em> is given by the probability distribution $f(\theta \given x)$, and the model $\mathscr{M}$ is given by the likelihood together with the prior.
For Logistic Lasso, we assume that the likeihood is Bernoulli, and that the prior is Laplace.</p>
<h1 id="interpretation-of-models">Interpretation of Models</h1>
<p>At first glance, the theories may appear somewhat different, but the Logistic Lasso – and just about every model used in practice – can be formalized in both ways.
This leads to the first question.</p>
<blockquote>
<p>Should we interpret statistical models as probability distributions or as loss functions?</p>
</blockquote>
<p>The answer, of course, depends on the preferences of the person being asked – if we want, we may interpret a model whose loss function corresponds to a posterior distribution in a Bayesian way.
The probabilistic structure it possesses can be a useful theoretical tool for understanding its behavior.
This lets us see for instance that if priors are considered subjective, regularizers must be as well.
We conclude with an informal definition this class of models.</p>
<p><strong>Definition.</strong>
A model $\mathscr{M}$ is <em>mathematically Bayesian</em> if it can be fully specified via a prior $\pi(\theta)$ and likelihood $f(x \given \theta)$ for which the posterior distribution $f(\theta \given x)$ is well-defined.</p>
<h1 id="assessment-of-inferential-uncertainty">Assessment of Inferential Uncertainty</h1>
<p>The second question does not concern the model in a mathematical sense.
Instead, we consider an abstract procedure $\mathscr{P}$ that utilizes a model $\mathscr{M}$ to do something useful.
Here, we encounter our second question.</p>
<blockquote>
<p>Should we assess uncertainty regarding what was learned about $\theta$ from the data by computing the posterior distribution $f(\theta \given x)$?</p>
</blockquote>
<p>Often, assessing inferential uncertainty is interesting, but not always.
One important note is that for any given data set, the uncertainty given by $f(\theta \given x)$ is completely determined by the specification of $\mathscr{M}$.
If $\mathscr{M}$ is not the correct model, its uncertainty estimates may be arbitrary bad, even if its predictions are good.
Thus, we may prefer to not assess uncertainty at all, rather than delude ourselves into thinking we know it.</p>
<p>Similarly, for some problems there may exist a simple and easy way to determine whether $\theta$ is good or not.
For example, in image classification, we might simply ask a human if the labels produced by $\theta$ are reasonable.
This might be far more effective than using the probability distribution $f(\theta \given x)$ to compare the chosen value for $\theta$ to other possible values, especially when calculating $f(\theta \given x)$ is challenging.</p>
<p>This leads to a choice undertaken by the practitioner: should $f(\theta \given x)$ be calculated, or is picking one value $\hat\theta$ good enough?
In some cases, such as when a decision-theoretic analysis is performed, $f(\theta \given x)$ is indispensable, other times it is unnecessary.
We conclude with an informal definition encompassing this choice.</p>
<p><strong>Definition.</strong>
A statistical procedure $\mathscr{P}$ that makes use of a model $\mathscr{M}$ is <em>computationally Bayesian</em> if it involves calculation of the full posterior distribution $f(\theta \given x)$ in at least one of its steps.</p>
<h1 id="disentangling-the-disagreements">Disentangling the Disagreements</h1>
<p>It is unfortunate that the term <em>Bayesian</em> has come to mean <em>mathematically Bayesian</em> and <em>computationally Bayesian</em> simultaneously.
In my opinion, these distinctions should be considered separately, because they concern two very different questions.
In the mathematical case, we are asking whether or not to interpret our model using its probabilistic representation.
In the computational case, we are asking whether calculating the entire distribution is necessary, or whether one value suffices.</p>
<p>A model’s Bayesian representation can be useful as a theoretical tool, whether we calculate the posterior or not.
If one value does suffice, we should not discard the probabilistic interpretation entirely, because it might help us understand the model’s structure.
For the Logistic Lasso, the Bayesian approach makes it obvious where cross-entropy loss comes from: it maps uniquely to the Bernoulli likelihood.</p>
<p>It is unfortunate that the two cases are often conflated.
It is common to hear practitioners say that they are not interested in whether models are Bayesian or frequentist – instead, it matters whether or not they work.
More often than not, models can be interpreted both ways, so the distinction’s premise is itself an illusion.
Every mathematical perspective tells us something about the objects we are studying,
Even if we do not perform Bayesian calculations, it can often still be useful to think of models in a Bayesian way.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:lasso">
<p>R. Tibshirani. Regression Shrinkage and Selection via the Lasso. JRSSB 58(1), 1996. <a href="#fnref:lasso" class="reversefootnote">↩</a></p>
</li>
<li id="fn:blasso">
<p>T. Park and G. Casella. The Bayesian Lasso. JASA 103(402), 2008. <a href="#fnref:blasso" class="reversefootnote">↩</a></p>
</li>
<li id="fn:bda">
<p>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. <a href="#fnref:bda" class="reversefootnote">↩</a></p>
</li>
<li id="fn:vv">
<p>V. Vapnik. The Nature of Statistical Learning Theory. 1995. <a href="#fnref:vv" class="reversefootnote">↩</a></p>
</li>
<li id="fn:ct">
<p>A. Terenin and D. Draper. Cox’s Theorem and the Jaynesian Interpretation of Probability. <a href="https://arxiv.org/abs/1507.06597">arXiv:1507.06597</a>, 2015. <a href="#fnref:ct" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninBayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework. This has lead to rich academic debates over what statistical learning is and how it should be done. Academic debates are healthy when their content is precise and independent issues are not conflated. In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently. These can be viewed informally as follows.Deep Learning with function spaces2017-08-16T00:00:00+00:002017-08-16T00:00:00+00:00http://avt.im/blog/2017/08/16/deep-learning-function-spaces<p>Deep learning is perhaps the single most important breakthrough in statistics, machine learning, and artificial intelligence that has been popularized in recent years.
It has allows us to classify images - for decades a challenging problem - with nowadays usually better-than-human accuracy.
It has solved Computer Go, which for decades was the classical example of a board game that was exceedingly difficult for computers to play.
But what exactly is deep learning?</p>
<p>Many popular explanations involve analogies with the human brain, where deep learning models are interpreted as complex networks of neurons interacting with one another.
These perspectives are useful, but they’re not math: just because deep learning models mimic the brain, doesn’t mean they provably work.
This post will highlight some ideas that may be helpful in moving toward an understanding of why deep learning works, presented at an intuitive level.
The focus will be on high-level concepts, omitting algebraic details such as the precise form of tensor products.</p>
<h1 id="the-function-space-perspective">The Function Space Perspective</h1>
<p>The key idea of this post is that to understand why deep learning works, we should not work with the network directly.
Instead, we will define a model for learning on a space of functions, truncate that model, and obtain deep learning.</p>
<p>Consider the model</p>
<p>[
\hat{\v{y}} = f(\m{X})
]</p>
<p>where the goal is to learn the function $f$ that maps data $\m{X}$ to the predicted value $\hat{\v{y}}$.
But wait, how do we go about learning a function?
Let’s first consider a single-variable function $f(x): \R \goesto \R$ and recall that any function may be written as an infinite sum with respect to a location-scale basis, i.e. we have for an appropriately defined function $\sigma$ that</p>
<p>[
f(x) = \sum_{k=1}^\infty a_k \, \sigma(b_k x + c_k) + d_k
.
]</p>
<p>What’s happening here?
We’re taking the function $\sigma$, shifting it left-right by $b_k$, stretching it by a combination of $a_k$ and $c_k$, and shifting it up-down by $d_k$.
As long as $\sigma$ is sufficiently rich to form a basis on $\R$, if we add up infinitely many of them, we can approximate $f$ to any precision we want.
To make learning possible, let’s truncate the sum, so that we sum $K$ elements instead of $\infty$, and get</p>
<p>[
f(x) = \sum_{k=1}^K a_k \, \sigma(b_k x + c_k) + d_k
.
]</p>
<p>We now have a finite set of parameters, so given a data set $(\m{X},\v{y})$, we can define a probability distribution for $\v{y}$ under the predicted values $\hat{\v{y}}$, and <a href="/blog/2017/07/05/bayesian-learning">learn the coefficients using Bayes’ Rule</a>.</p>
<p>But wait: the expressions we get by following this procedure, extended to matrices and vectors, are exactly those given by a <a href="/blog/2017/07/05/bayesian-learning">1-layer fully connected network</a>.
This is what a fully connected network does, and this is why it works: we are expanding an arbitrary function with respect to a basis, and learning the coefficients of the expansion using Bayes’ Rule<sup id="fnref:be"><a href="#fn:be" class="footnote">1</a></sup>.
That’s it!</p>
<h1 id="going-deep">Going Deep</h1>
<p>With the above perspective in mind, let’s consider deep learning.
We’re going to apply another trick: rather than learning $f$ directly, let’s instead define functions $f^{(1)},f^{(2)},f^{(3)}$ such that</p>
<p>[
\hat{\v{y}} = f(\m{X}) = f^{(1)}\cbr{f^{(2)}\sbr{f^{(3)}\del{\m{X}}}}
]</p>
<p>It’s not obvious why we should do this, but let’s go with it for now.
Then, let $\sigma$ be the ReLU function, and expand $f^{(3)}$ with respect to that basis, just as we did above, but with matrix-vector notation, to get</p>
<p>[
\hat{\v{y}} = f^{(1)}\cbr{f^{(2)}\sbr{ \v{a}^{(3)} \sigma\del{\m{X}\v{b}^{(3)} + \v{c}^{(3)}} + \v{d}^{(3)} }}
.
]</p>
<p>Now, let’s expand $f^{(2)}$, yielding</p>
<p>[
\hat{\v{y}} = f^{(1)}\cbr{\v{a}^{(2)}\sigma\sbr{\del{\v{a}^{(3)} \sigma\del{\m{X}\v{b}^{(3)} + \v{c}^{(3)}} + \v{d}^{(3)}}\v{b}^{(2)} + \v{c}^{(2)}} + \v{d}^{(2)}}
.
]</p>
<p>Notice that we can set $\v{b}^{(2)} = \v{1}$ and $\v{c}^{(2)} = \v{0}$ with no loss of generality to slightly simplify our expression.
Upon expanding $f^{(1)}$, we are left with</p>
<p>[
\hat{\v{y}} = \v{a}^{(1)}\sigma\cbr{\v{a}^{(2)}\sigma\sbr{\v{a}^{(3)} \sigma\del{\m{X}\v{b}^{(3)} + \v{c}^{(3)}} + \v{d}^{(3)}} + \v{d}^{(2)}} + \v{d}^{(1)}
]</p>
<p>which is exactly the expression for a 3-layer fully connected network.</p>
<p>So, what is deep learning?
Deep learning is a model that learns a function $f$ by splitting it up into a sequence of functions $f^{(1)},f^{(2)},f^{(3)},..$, performing a ReLU basis expansion on each one, truncating it, and learning the remaining coefficients using Bayes’ Rule.</p>
<h1 id="example-why-residual-networks-work">Example: why Residual Networks work</h1>
<p>This perspective can be used to understand recently popularized technique in deep learning.
For illustrative purposes, let’s consider a 3-layer residual network.
Suppose $\m{X}$ is of the same dimensionality as the network.
A residual network is a model of the form</p>
<p>[
\begin{aligned}
\hat{\v{y}} = f(\m{X}) = &f^{(1)}\cbr{f^{(2)}\sbr{f^{(3)}\del{\m{X}} + \m{X}} + \sbr{f^{(3)}\del{\m{X}} + \m{X}}}
\nonumber
\\
&+ \cbr{f^{(2)}\sbr{f^{(3)}\del{\m{X}} + \m{X}} + \sbr{f^{(3)}\del{\m{X}} + \m{X}}}
.
\end{aligned}
]</p>
<p>So, why do residual networks perform better?
Consider the above from a Bayesian learning the point of view: we start with a prior distribution - determined uniquely by the regularization term - and end with a posterior distribution that describes what we learned.
Suppose that nothing is learned in the 3rd layer.
Then the posterior distribution must be the same as the prior.
With $L^2$ regularization, this means that the posterior mode of the coefficients of the basis expansion of $f^{(3)}$ will be zero.
Hence,</p>
<p>[
f^{(3)}(x) = \sum_{k=1}^K 0 \, \sigma(0 \times x + 0) + 0 = 0
]</p>
<p>and the model collapses to</p>
<p>[
\hat{\v{y}} = f(\m{X}) = f^{(1)}\cbr{f^{(2)}\sbr{\m{X}} + \m{X}} + \cbr{f^{(2)}\sbr{\m{X}} + \m{X}}
.
]</p>
<p>Contrast this with a non-residual network, which collapses to</p>
<p>[
\hat{\v{y}} = f(\m{X}) = f^{(1)}\cbr{f^{(2)}\sbr{\v{0}}} = \text{constant}
.
]</p>
<p>In reality, of course, the network learns <em>something</em> in deeper layers, so behavior isn’t quite this bad.
But, if we suppose that deeper layers learn less and less given the same data, the model must eventually stop working if we keep adding layers.
Thus, standard networks don’t work if we make them too deep.
Residual networks fix the problem.</p>
<h1 id="what-have-we-gained-from-this-perspective">What have we gained from this perspective?</h1>
<p>Thinking about function spaces can make deep learning substantially more understandable.
Instead of thinking about networks, which are complicated, we can think about functions, which are in my view simpler.</p>
<p>The ideas above can for instance be used to understand what convolutional networks do: they make assumptions on how each $f^{(i)}$ behaves over space.
Similarly, we can see why ReLU<sup id="fnref:relu"><a href="#fn:relu" class="footnote">2</a></sup> units might perform slightly better than sigmoid units: because they are unbounded, less of them may be required to approximate a given function well.</p>
<p>Part of what makes functions simpler is that it is easy to visualize what scaling and shifting does to them.
For example, it is easy to see that switching from ReLU to Leaky ReLU<sup id="fnref:lrelu"><a href="#fn:lrelu" class="footnote">3</a></sup> units is the same as increasing the bias term in the basis expansion.
It’s certainly possible that this may sometimes be helpful, but it would be a big surprise to me if doing this resulted in substantially better performance across the board.</p>
<p>One major question that the function space perspective raises is why learning $f^{(1)}, f^{(2)}, f^{(3)},..$ separately is so much easier than learning $f$ directly.
I don’t know of a good answer to this question.</p>
<p>A key benefit of thinking with function spaces is that it gives us a principled way to derive the expressions needed to define and train networks.
The residual networks presented here differ slightly from the original work in which they were presented<sup id="fnref:resnet"><a href="#fn:resnet" class="footnote">4</a></sup> – more recent work has proposed precisely the formulas derived here<sup id="fnref:resnetidentity"><a href="#fn:resnetidentity" class="footnote">5</a></sup> which were found to improve performance.</p>
<p>I’m not sure why deep learning is not typically presented in this way – the function space perspective is largely omitted from the classical text <em>Deep Learning</em><sup id="fnref:dlintro"><a href="#fn:dlintro" class="footnote">6</a></sup>.
Overall, I hope that this short introduction has been useful for understanding deep learning and making the structure present in the models more transparent.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:be">
<p>See Chapter 20 of Bayesian Data Analysis<sup id="fnref:bda"><a href="#fn:bda" class="footnote">7</a></sup>. <a href="#fnref:be" class="reversefootnote">↩</a></p>
</li>
<li id="fn:relu">
<p>R Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, H. S. Seung (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 2000. <a href="#fnref:relu" class="reversefootnote">↩</a></p>
</li>
<li id="fn:lrelu">
<p>A. L. Maas, A. Y. Hannun, A. Y. Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. ICML 30(1), 2013. <a href="#fnref:lrelu" class="reversefootnote">↩</a></p>
</li>
<li id="fn:resnet">
<p>K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. CVPR 28(1), 2015. <a href="#fnref:resnet" class="reversefootnote">↩</a></p>
</li>
<li id="fn:resnetidentity">
<p>K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 14(1), 2016. <a href="#fnref:resnetidentity" class="reversefootnote">↩</a></p>
</li>
<li id="fn:dlintro">
<p>See Chapter 6 of Deep Learning<sup id="fnref:dl"><a href="#fn:dl" class="footnote">8</a></sup>. <a href="#fnref:dlintro" class="reversefootnote">↩</a></p>
</li>
<li id="fn:bda">
<p>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. <a href="#fnref:bda" class="reversefootnote">↩</a></p>
</li>
<li id="fn:dl">
<p>I. Goodfellow, Y. Bengio, A. Courville. <a href="http://www.deeplearningbook.org">Deep Learning</a>. 2016. <a href="#fnref:dl" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninDeep learning is perhaps the single most important breakthrough in statistics, machine learning, and artificial intelligence that has been popularized in recent years. It has allows us to classify images - for decades a challenging problem - with nowadays usually better-than-human accuracy. It has solved Computer Go, which for decades was the classical example of a board game that was exceedingly difficult for computers to play. But what exactly is deep learning?Bayesian Learning - by example2017-07-05T00:00:00+00:002017-07-05T00:00:00+00:00http://avt.im/blog/2017/07/05/bayesian-learning<p>Welcome to my blog!
For my first post, I decided that it would be useful to write a short introduction to Bayesian learning, and its relationship with the more traditional optimization-theoretic perspective often used in artificial intelligence and machine learning, presented in a minimally technical fashion.
We begin by introducing an example.</p>
<h1 id="example-binary-classification-using-a-fully-connected-network">Example: binary classification using a fully connected network</h1>
<p>First, let’s introduce notation. For simplicity suppose there are no biases, and define the following.</p>
<ul>
<li>$\v{y}_{N\times 1}$: a binary vector where each element is a target data point. $N$ is the amount of input data.</li>
<li>$\m{X}_{N\times p}$: a matrix where each row is an input data vector. $p$ is the dimensionality of each input.</li>
<li>$\v\beta^{(x)}_{p \times m}$: the matrix that maps the input to the hidden layer. $m$ is the number of hidden units.</li>
<li>$\v\beta^{(h)}_{m \times 1}$: the vector that maps the hidden layer to the output.</li>
<li>$\sigma$: the network’s activation function, for instance a ReLU function.</li>
<li>$\phi$: the softmax function.</li>
</ul>
<div style="text-align: center;">
<svg width="250px" viewBox="0 0 250 265" xmlns="http://www.w3.org/2000/svg">
<g>
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="200" x2="200" y2="125" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="50" x2="200" y2="125" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="125" x2="125" y2="162.5" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="125" x2="125" y2="87.5" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="200" x2="125" y2="87.5" />
<line style="stroke: rgb(0, 0, 0);" x1="50" y1="50" x2="125" y2="162.5" />
</g>
<g>
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, -73.458435, -2.691527)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, -73.45842, 147.308301)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, -73.458427, 72.308369)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="fill: rgb(216, 216, 216); stroke: rgb(0, 0, 0);" transform="matrix(1, 0.000003, -0.000003, 1, 1.541482, 34.808468)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="fill: rgb(216, 216, 216); stroke: rgb(0, 0, 0);" transform="matrix(1, 0.000003, -0.000003, 1, 1.541482, 109.808331)" cx="123.459" cy="52.691" rx="25" ry="25" />
<ellipse style="stroke: rgb(0, 0, 0); fill: rgb(167, 167, 167);" transform="matrix(1, 0.000003, -0.000003, 1, 76.541375, 72.308369)" cx="123.459" cy="52.691" rx="25" ry="25" />
</g>
<g>
<foreignObject x="35" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\m{X}$
</div>
</foreignObject>
<foreignObject x="80" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\v\beta^{(x)}$
</div>
</foreignObject>
<foreignObject x="150" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\v\beta^{(h)}$
</div>
</foreignObject>
<foreignObject x="190" y="235" width="30" height="30">
<div xmlns="http://www.w3.org/1999/xhtml">
$\v{y}$
</div>
</foreignObject>
</g>
</svg>
</div>
<h1 id="the-standard-approach">The standard approach</h1>
<p>We begin by defining an optimization problem.
Let $\v\beta$ be a $k$-dimensional vector consisting of all values of $\v\beta^{(x)}$ and $\v\beta^{(h)}$ stacked together.
Our network’s prediction $\v{\hat{y}} \in [0,1]^N$ is given by</p>
<p>[
\hat{\v{y}} = \phi\del{\sigma\del{\m{X} \v\beta^{(x)}} \v\beta^{(h)}}
]</p>
<p>Now, we proceed to learn the weights.
Let $\v{\hat\beta}$ be the learned values for $\v\beta$, let $\vert\vert\cdot\vert\vert$ be the $L^2$ norm, fix some $\lambda \in \R^+$, and set</p>
<p>[
\v{\hat\beta} = \underset{\v\beta}{\arg\min}\cbr{ \sum_{i=1}^N -y_i\ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\vert\vert\v\beta\vert\vert^2}
.
]</p>
<p>The expression being minimized is called <em>cross entropy loss</em><sup id="fnref:ce"><a href="#fn:ce" class="footnote">1</a></sup>.
The loss is differentiable, so we can minimize it by using gradient descent or any other method we wish.
Learning takes place by minimizing the loss, and the values we learn – here, $\v{\hat\beta}$ – are a point in $\R^k$.</p>
<p>Why cross-entropy rather than some other mathematical expression?
In most treatments of classification, the reasons given are purely intuitive, for instance, it is often said to stabilize the optimization algorithm.
More rigorous treatments<sup id="fnref:ce:1"><a href="#fn:ce" class="footnote">1</a></sup> might introduce ideas from information theory.
We will provide another explanation.</p>
<h1 id="the-bayesian-approach">The Bayesian approach</h1>
<p>Let us now define the exact same network, but this time from a Bayesian perspective. We begin by making probabilistic assumptions on our data.
Since we have that $\v{y} \in \cbr{0,1}^N$, and since we assume that the order in which $\v{y}$ is presented cannot affect learning – this is formally called exchangeability – there is one and only one distribution that $\v{y}$ can follow: the Bernoulli distribution.
The parameter of that distribution is the same expression $\v{\hat{y}}$ as before.
Hence, let</p>
<p>[
\v{y} \given \v\beta \dist\f{Ber}\sbr{\phi\del{\sigma\del{\m{X} \v\beta^{(x)}} \v\beta^{(h)}}}
.
]</p>
<p>This is called the <em>likelihood</em>: it describes the assumptions we are making about the data $\v{y}$ given the parameters $\v\beta$ – here, that the data is binary and exchangeable.
Now, define the <em>prior</em> for $\v\beta$ as</p>
<p>[
\v\beta \dist\f{N}_k\del{0, \frac{\lambda^{-1}}{2}}
.
]</p>
<p>This describes our assumptions about $\v\beta$ external to the data – here, we have assumed that all components of $\v\beta$ are <em>a priori</em> independent mean-zero Gaussians.
We can combine the prior and likelihood using Bayes’ Rule</p>
<p>[
f(\v\beta \given \v{y}) = \frac{f(\v{y} \given \v\beta) \pi(\v\beta)}{\int_{\R^k} f(\v{y} \given \v\beta) \pi(\v\beta) \dif \beta} \propto f(\v{y} \given \v\beta) \pi(\v\beta)
]</p>
<p>to obtain the <em>posterior</em> $\v\beta \given \v{y}$.
This is a probability distribution: it describes what we learned about $\v\beta$ from the data.
Learning takes place through the use of Bayes’ Rule, and the values we learn – here, $\v\beta \given \v{y}$ – are a probability distribution on $\R^k$.</p>
<h1 id="connecting-the-two-approaches">Connecting the two approaches</h1>
<p>Is there any relationship between $\v{\hat\beta}$ and $\v\beta \given \v{y}$?
It turns out, yes – let’s show it. First, let’s write down the posterior</p>
<p>[
f(\v\beta \given \v{y}) \propto f(\v{y} \given \v\beta) \pi(\v\beta) \propto \sbr{\prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}} \exp\cbr{\frac{\v\beta^T\v\beta}{-\lambda^{-1}}}
.
]</p>
<p>Now, let’s take logs and simplify:</p>
<p>[
\ln f(\v\beta \given \v{y}) = \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\vert\vert\v\beta\vert\vert^2 + \f{const}
.
]</p>
<p>Having computed that, note that that taking logs and adding constants preserve optima, and consider the posterior mode:</p>
<p>[
\begin{aligned}
\underset{\v\beta}{\arg\max}\cbr{f(\v\beta \given \v{y})} &= \underset{\v\beta}{\arg\max}\cbr{\ln f(\v\beta \given \v{y})} =
\nonumber
\\
&=\underset{\v\beta}{\arg\max}\cbr{ \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\vert\vert\v\beta\vert\vert^2 } =
\nonumber
\\
&=\underset{\v\beta}{\arg\min}\cbr{ \sum_{i=1}^N -y_i \ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\vert\vert\v\beta\vert\vert^2 } =
\nonumber
\\
&= \v{\hat{\beta}}
.
\end{aligned}
]</p>
<p>What have we shown? Minimizing cross-entropy loss is equivalent to maximizing the posterior distribution.
The loss function maps to the likelihood, and the regularization term maps to the prior.</p>
<h1 id="what-it-all-means">What it all means</h1>
<p>Why is this useful?
It gives us a probabilistic interpretation for learning, which helps us to construct and understand our models.
This is especially in more complicated settings: for instance, we might ask, where does $\v{\hat{y}} = \sigma\del{\m{X} \v\beta^{(x)}} \v\beta^{(h)}$ come from? In fact, we can use ideas from <em>Bayesian Nonparametrics</em> to derive $\v{\hat{y}}$ by considering a likelihood on a function space under a ReLU basis expansion<sup id="fnref:be"><a href="#fn:be" class="footnote">2</a></sup>.
The network’s loss and architecture can both be explained in a Bayesian way.</p>
<p>There is much more: we could consider drawing samples from the posterior distribution, to quantify uncertainty about how much we learned about $\v\beta$ from the data.
<em>Markov Chain Monte Carlo</em><sup id="fnref:mcmc"><a href="#fn:mcmc" class="footnote">3</a></sup> methods are the most common class of methods for doing so.
We can use ideas from hierarchical Bayesian models to define better regularizers compared to $L^2$ – the <em>Horseshoe</em><sup id="fnref:hs"><a href="#fn:hs" class="footnote">4</a></sup> prior is a popular example.
For brevity, I’ll omit further examples – the book <em>Bayesian Data Analysis</em><sup id="fnref:bda"><a href="#fn:bda" class="footnote">5</a></sup> is a good introduction, though it largely focuses on methods of interest mainly to statisticians.</p>
<p>How general is this perspective?
Very: an abstract result called Cox’s Theorem states, in modern terms, that <em>every true-false logic under uncertainty is isomorphic to conditional probability</em>.
This means that <em>all learning formalizable in the above sense is Bayesian</em>.
So, if you <em>can’t</em> represent a given method in a Bayesian way, I would be rather worried.
For a formal statement and details, see my preprint<sup id="fnref:ct"><a href="#fn:ct" class="footnote">6</a></sup> on the subject.</p>
<p>At the end of the day, having many different mathematical perspectives enables us to better understand how learning works, because things that are not obvious from one perspective might be easy to see from another.
Whereas the optimization-theoretic approach we began with did not give a clear reason for why we should use cross-entropy loss, from a Bayesian point of view it follows directly out of the binary nature of the data.
Sometimes, the Bayesian approach has little to say about a particular problem, other times it has a lot.
It is useful to know how to use it when the need arises, and I hope this short example has given at least one reason to read about Bayesian statistics in more detail.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:ce">
<p>See Chapter 5 of Deep Learning<sup id="fnref:dl"><a href="#fn:dl" class="footnote">7</a></sup>. <a href="#fnref:ce" class="reversefootnote">↩</a> <a href="#fnref:ce:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:be">
<p>See Chapter 20 of Bayesian Data Analysis<sup id="fnref:bda:1"><a href="#fn:bda" class="footnote">5</a></sup>. <a href="#fnref:be" class="reversefootnote">↩</a></p>
</li>
<li id="fn:mcmc">
<p>See Chapter 11 of Bayesian Data Analysis<sup id="fnref:bda:2"><a href="#fn:bda" class="footnote">5</a></sup>, but note that MCMC methods are far more general than presented there. An article<sup id="fnref:pdmcmc"><a href="#fn:pdmcmc" class="footnote">8</a></sup> by P. Diaconis gives a rather different overview. <a href="#fnref:mcmc" class="reversefootnote">↩</a></p>
</li>
<li id="fn:hs">
<p>C. M. Carvalho, N. G. Polson, and J. G. Scott. The Horseshoe estimator for sparse signals. Biometrika, 97(2):1–26, 2010. <a href="#fnref:hs" class="reversefootnote">↩</a></p>
</li>
<li id="fn:bda">
<p>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. <a href="#fnref:bda" class="reversefootnote">↩</a> <a href="#fnref:bda:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:bda:2" class="reversefootnote">↩<sup>3</sup></a></p>
</li>
<li id="fn:ct">
<p>A. Terenin and D. Draper. Cox’s Theorem and the Jaynesian Interpretation of Probability. <a href="https://arxiv.org/abs/1507.06597">arXiv:1507.06597</a>, 2015. <a href="#fnref:ct" class="reversefootnote">↩</a></p>
</li>
<li id="fn:dl">
<p>I. Goodfellow, Y. Bengio, A. Courville. <a href="http://www.deeplearningbook.org">Deep Learning</a>. 2016. <a href="#fnref:dl" class="reversefootnote">↩</a></p>
</li>
<li id="fn:pdmcmc">
<p>P. Diaconis. The Markov Chain Monte Carlo revolution. Bulletin of the American Mathematical Society, 46(2):179–205, 2009. <a href="#fnref:pdmcmc" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexander TereninWelcome to my blog! For my first post, I decided that it would be useful to write a short introduction to Bayesian learning, and its relationship with the more traditional optimization-theoretic perspective often used in artificial intelligence and machine learning, presented in a minimally technical fashion. We begin by introducing an example.