Alexander Terenin

How to use R packages such as ggplot in Julia

Julia is a wonderful programming language. It’s modern with good functional programming support, and unlike R and Python—both slow—Julia is fast. Writing packages is straightforward, and high performance can be obtained without bindings to a lower-level language. Unfortunately, its plotting frameworks are, at least in my view, not as good as the ggplot package in R. Fortunately, Julia’s interoperability with other programming languages is outstanding. In this post, I illustrate how to make ggplot work near-seamlessly with Julia using the RCall package.

Calling R packages in Julia

R packages can be loaded can be loaded in Julia1 through the RCall2 package by using

using RCall
@rlibrary ggplot2

which works much like the popular @pyimport macro in the PyCall3 package. It is important to note that this properly loads an R package as a Julia module, rather than simply defining a set of bindings to it. This means that every function in the R package can automatically be called with Julia data structures as arguments, which will be automatically transformed into R data structures. There is no need to painstakingly convert every input, as is often necessary when making different languages interface with one other—it is done automatically using the magic offered by 21st century programming languages. So, we can write

qplot(1:10,[i^2 for i in 1:10])

and a plot generated by the ggplot4 function qplot shows up, even though 1:10 is a Julia range and [i^2 for i in 1:10] is a Julia array.

Data frame interoperability

RCall can automatically convert Julia DataFrame objects into R data.frame objects. For example, the following code is valid.

using DataFrames
d = DataFrame(v = [3,4,5], w = [5,6,7], x = [1,2,3], y = [4,5,6], z = [1,1,2])
ggplot(d, aes(x=:x,y=:y)) + geom_line()

Note that the aes function uses Julia symbols like :x to refer to data frame columns. We don’t need to do any Julia to R type conversions, the code simply works.

Dealing with dots, formulas, and other R quirks

There are a few issues that arise when making complicated plots. For example, ggplot R commands such as

geom_point(na.rm = TRUE)

don’t translate directly to Julia code because the . in na.rm is interpreted as Julia syntax. Similar issues arise if, for instance, an R function uses end as an argument name. The solution to this problem is to use the var string macro provided by RCall, which enables us to write

geom_point(var"na.rm" = true)

in place of the above R code. This macro works by defining a Julia symbol that includes the dot, which we couldn’t have done with standard syntax.

Another useful feature is the R string macro, which enables us to write R code in line with Julia code. For example, the Julia code R"~z" will execute the R code ~z, which creates an R formula object with the variable z, and returns it as an R object in Julia. This can be useful for functions such as facet_grid and facet_wrap that accept formulas as input. It enables us to write

ggplot(d, aes(x=:x,y=:y)) + geom_point() + facet_wrap(R"~z")

as well as execute R functions such as data.frame if we need to. We can also use this macro to fix issues arising when automatic data frame conversion doesn’t behave as intended. This occasionally happens for data frames that contain symbols or strings. For example, we can write code such as

d = d |>
  x -> R"$x[,1] = as.numeric($d[,1]); $x" |>
  x -> R"$x[,2] = as.numeric($d[,2]); $x" |>
  x -> R"$x[,3] = as.numeric($d[,3]); $x" |>
  x -> R"$x[,4] = as.factor(as.numeric($x[,4])); $x" |>
  x -> R"$x[,5] = as.factor(as.character($x[,5])); $x" |>
  x -> names!(d, [:u_min, :u_max, :x, :u, :solution])

to convert strings to factors inside our data frame—inline. There’s a couple of points worth expanding on here. Note first the functional style: we use a pipe5 to input the data frame d into a function that takes x as input and executes the string macro R"$x[,1] = as.numeric($d[,1]); $x" and returns its results. These are immediately piped into another function. The code $x in the line R"$x[,1] = as.numeric($d[,1]); $x" means that the Julia variable x is passed into the R code. This syntax allows us to execute R code without ever worrying about manually passing variables between Julia and R.

Putting everything together, it’s easy to make a layered plot such as

ggplot(d, aes(x=:x)) +
  geom_ribbon(aes(ymin=:u_min, ymax=:u_max), fill="blue", alpha=0.5) +
  geom_line(aes(y=:u), color="blue") +
  lims(x=[0,5], y=[0,10]) +
  geom_line(aes(y=:solution), color="red") |>
  p -> ggsave("p1.pdf", p)

and save it to a PDF file using functional syntax, without ever writing a line of R code. In doing so, we sacrifice very little and retain essentially all aspects of ggplot that make it a user-friendly and productive package. I’ll conclude by nothing that everything here is just ordinary use of the RCall package and would work with any R package—in all of the above, we did not use any ggplot-specific Julia packages, nor did we write a single line of language bindings.

Why ggplot? Aren’t we using Julia in order to not use R?

Why bother with ggplot when Julia offers its own full-featured plotting packages such as Gadfly6 and Plots.jl7? In my view—and I’m not generally a fan of criticizing other people’s hard work but I find it warranted here and will be as gentle as I can—neither of these frameworks have well-designed programming interfaces. Let’s look at what the issues are, and why ggplot handles them better.

Plots.jl is a powerful, fully-featured plotting package with lots of features. Unfortunately, its interface is very similar to that of the base R: making a complicated plot requires executing a list of commands. This is its main downside: to use it effectively, the user needs to memorize every command and its options individually—there is no over-arching principle upon which commands are based, which users can learn instead of the commands themselves. Indeed, this one of the major features of the Wickham-Wilkerson Grammar of Graphics8 interface, which works as follows.

  • Plots are visualizations of data frames consisting of layered geometric objects.
  • Aesthetic mappings describe how individual data points are mapped to geometric objects.

For example, to plot a function and a 95% probability interval around that function, we create a data frame where each row contains the function’s xx and f(x)f(x) values at a point, together with the lower and upper interval endpoints aa and bb. We then add a line geometric object with the aesthetic mapping (x,y)(x,f(x))(x,y) \to (x, f(x)), as well as a ribbon geometric object with the mapping (x,min,max)(x,a,b)(x,\min,\max) \to (x,a,b). We do not need to memorize how lines and ribbons work to use them, and simply follow the principles given by the bullet points above. If we need to use a new geometric object that we’ve never seen before, all we need to do is look at what kind of aesthetic mappings it utilizes—we never need to memorize any other details.

On the other hand, consider the Plots.jl code that I wrote for a project

contour(-3:0.1:3, -3:0.1:3, (x,y) -> pdf(MultivariateNormal(2,1),[x,y]))
scatter!(θ[i][1,:], θ[i][2,:])

and note how this syntax differs from

plot!(hcat(L,error),layout=2, label=["L: test" "Error: test"], alpha=0.5)

where a single matrix is used as input rather than two ranges and a function. It is a priori unclear whether the input to a particular plotting function should be an array, data frame, or something else. Looking a bit further, imagine setting color labels in a complicated multilayered plot—in which layer’s command should we specify how labels are displayed? Ambiguity like this wastes time by forcing the user to spend time reading documentation rather than making their plots, and in my experience the time saved by having concise commands like plot(x,y) in simple cases does not outweigh the cost in complicated ones.

It’s true that the Grammar of Graphics interface is not well-suited to every kind of plot, but it works well for most of the ones encountered in everyday data science. Most importantly, it offers a single unified way to think about plots and how to construct them. Writing plots in it can be more verbose, but I prefer being verbose and consistent than concise and different in every scenario. I don’t have time to memorize individual commands in a plotting package that doesn’t contain a central set of guiding principles—and neither should you.

So if I don’t prefer Plots.jl due to its interface, what about Gadfly, which is is Grammar of Graphics based? Unfortunately, Gadfly both doesn’t support many useful features such as transparency and geometric objects like geom_raster, and suffers from a whole other set of issues that makes it difficult to use. One particular problem is that it uses a varargs-based interface rather than a functional one. This makes us write things like

plot(plot_data_1, x="x", y="u", Geom.line,
  layer(Geom.line, x = "x", y = "solution", Theme(default_color="red")),
  layer(Geom.line, x="x", y="u_mc", Theme(default_color = "purple")),
  layer(Geom.line, x="x", y="u_mf", Theme(default_color = "orange"))
)

instead of

ggplot(plot_data_1, aes(x="x", y="u")) +
  geom_line(color="blue") +
  geom_line(aes(y=:u_mc), color="purple") +
  geom_line(aes(y=:u_mf), color="orange")

which is much simpler. The issue here is that a ... based interface requires the user to waste time on the irritating task of balancing commas and parentheses. Plots.jl suffers just as much from the exact same problem.

This code raises another major issue: Gadfly doesn’t follow the Grammar of Graphics strictly enough: a color not given by an aesthetic mapping should be defined as part of a geometric object, not part of a theme. Themes are supposed to control parts of the plot that have nothing to do with the data or geometric objects, such as the font size for the plot’s title—certainly not the color of a line. This is an inconsistency that a user needs to learn, rather than a consequence of a set of principles that is immediately obvious.

At the end of the day, memorizing a plotting package is not a good use of my time or yours, and after spending a good bit of time with both packages I’ve found dealing with R-Julia interoperability and its occasional difficulties to be a lesser problem compared to the issues raised above.

Concluding thoughts

Julia is wonderful, made even more so through its strong interoperability given by RCall2 and PyCall.3 I find it better than R, and much better than Python. It does have its flaws. Its syntax isn’t ideal in certain situations, particularly when writing highly functional code, and would be improved by being more like Scala, or even like pipe-oriented R written with the magrittr5 package. Multiple dispatch is not a proper replacement for Python-style objects, and having a language features similar to Rust’s Implementations would be a major improvement. This said, in my view Julia is already ahead of R and Python, which have bigger issues than the above. Usability and cleanliness are critically important in a programming language, and this is why it’s worth using ggplot in Julia.

References

8

See the original book9 and ggplot manual.10

9

L. Wilkerson. The Grammar of Graphics. 2005.

10

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. 2016.