It’s 2023. Though by now the hype has died down somewhat, it is clear from talking to everyday people that ChatGPT has completely changed the public’s understanding of the capabilities of modern machine learning and artificial intelligence. A year ago, most people would have said that artificial general intelligence is at least a decade away. Today, one could reasonably argue that in its most primitive form, artificial general intelligence is here right now—and it’s not too different from what Andrej Karpathy envisioned in his famous blog post from eight years ago. Some of the field’s very best researchers are pivoting their work to focus on topics whose prominence rises with the public deployment of language models, such as AI safety. With new technology becoming available to the public, it is certainly a good time to reflect on research.
My own work is at a crossroads. Most of my last four years of research was dedicated to Gaussian processes, in spite of me writing blog posts about neural networks not long before I got started. My PhD research was driven almost exclusively by intellectual challenge and the desire to learn and improve. I studied functional analysis and differential geometry, then used that knowledge to start several lines of work—including on pathwise conditioning and geometric Gaussian processes. I won two prestigious best-paper-type awards for this work.
My ambitions, however, have always been bigger than winning awards: I want to do research important enough that my ideas continue to be valuable to science even after I die. In spite of its success, I do not think my current work rises to that level: I am increasingly convinced that many of my contributions are orthogonal to machine learning’s most important open research problems. So, it’s time to think about expanding into something new. This blog post will document my thinking as I explore what kinds of ideas to dedicate the next few years of my scientific life to, as an Assistant Research Professor at Cornell.
Some Thoughts on Successful AI Research
Embarking on a new long-term research direction is a multifaceted task that involves taking different criteria into account. To explore how to best do so, I now detail what I’ve learned about research over the last half-decade, focused on what works best and how to avoid common pitfalls.
Start with the problem
When I started my master’s in applied mathematics and statistics, the first thing that I remember my advisor, David Draper, telling me, was that the right way to perform research was to start with the problem. By this, he meant that a good research direction should focus on solving a well-formulated scientific problem with a clear and consequential notion of success. In computer science, this generally consists of figuring out how to create a new technology that does not yet exist. David contrasted this with method-focused research, which might involve goals such as extending or generalizing existing techniques, without any consequential applications in mind until it came time to write an experiment section in a paper. David—very much a man of strong opinions—would say that working on problem-focused research is the right way to be a scientist.
I don’t agree with David’s view, or rather with its implication that this is the only approach: I find this too simplistic, especially if taken literally. Research is too multifaceted an area of human endeavour to always prefer one particular approach. The right way to do research depends on the interplay between the researcher, the scientific question, and the needs of society. These factors vary to a sufficient degree that no broad statement could hope to capture what is right.1
Nonetheless, I find David’s advice to be of great value, since it is insightful to think about reasons for agreeing with it. One key advantage of problem-focused research is that succeeding at it creates societal value that goes beyond simply publishing papers. In contrast, method-focused research creates opportunities for other people to create societal value. This can be worth a lot, especially since some problems can ultimately only be solved after appropriate methodological preparation, but is more indirect. If not careful, this indirectness increases the risk of publishing papers that few people ultimately read,2 especially since method-focused research often tends to quickly rise to a high level of technical sophistication,3 which is intellectually satisfying to work on, but in most cases substantially harder for other people to understand and therefore use.
The essence of David’s advice, as I now understand it, is therefore to think deeply about how one’s research ultimately helps other people. To this end, not all scientific questions are of equal importance. To figure out which ones are, I find that it helps to think about the following criteria:
- Success should be consequential. Ideally, the research’s results should create genuine commercial value, so that starting a company to commercialize it is a viable path.4
- There should be an effective angle of attack. Answering the key questions using current techniques should be a viable approach. Ideally, the ultimate solution should be simple, even if the path to obtain it is not.
- The research question should be hard enough that it cannot be answered by a talented undergraduate student. Ideally, I should be uniquely positioned to answer it.
These criteria often point in orthogonal directions, and therefore must be balanced. In the last half-decade, I have largely failed to follow the advice I give here. Almost all of my published work has succeeded more at (2) and (3) than at (1). At the same time, a good bit of my unpublished work in areas like reinforcement learning never got to the point of a paper, due to not succeeding at (2). Going forward, I’d like to work on research that fits all three criteria.
Focus on what works
The ultimate reason that machine learning methods are valuable is because they work. Deep learning, in particular, is valuable to areas like natural language processing because no other approach has allowed engineers to build natural language systems for interacting with computers to the same degree of success. You can write a sentence, send it to ChatGPT, and it will write better sentences back than could have been obtained from a symbolic approach, or any other kind of system known at present. Deep learning works. There is little else to say.
Some of today’s scientific revered elders do not like this state of affairs. Noam Chomsky, Judea Pearl, and Gary Marcus are all famously skeptical of deep learning. Though each presents different arguments in favor of skepticism, my view is that all three share a single fundamental reason behind their skepticism: deep learning didn’t solve the problem of natural-language human-computer interaction in the way they wanted to solve it. Chomsky, Pearl, and Marcus each take this as evidence that deep learning is not good enough—that, instead, we need more focus on other, completely different approaches. Given ChatGPT’s effectiveness, in my view a more convincing explanation is that their requirements are too strict and not necessary or even useful for creating artificial intelligence.
A broader lesson one can learn from this comparison is that deep learning’s success rests in part on its empirical focus. Deep learning’s pioneers did not focus on creating artificial intelligence in the technically elegant way some grand, over-arching theory suggested it should be solved. Instead, they studied how to build systems that solved practical engineering problems of direct scientific and commercial importance, such as classifying images. To advance the understanding of intelligence, we should therefore focus on building technical capabilities for solving practical engineering problems step-by-step, using methods that work.
Avoid heroic effort
Every sufficiently non-trivial research problem will involve challenges that need to be handled and technical obstacles that need to be overcome. These demand an appropriate degree of effort. Most researchers, given their naturally hard-working nature, are happy to provide that effort. Counterintuitively, however, my experience shows that too much effort is a more common failure point than not enough effort. Following a beautiful phrase I heard Art Owen say during one of his talks, I’ll use the phrase heroic effort to describe research which requires substantial intellectual effort to obtain short-term results.
The reason heroic effort is almost never justified is because, at the end of the day, machine learning algorithms are produced to be implemented and deployed by ordinary engineers. The more complex an algorithm, the bigger a team is needed to implement and maintain it. Moreover, complex methods tend to be fragile, unstable, difficult-to-scale, and limitation-heavy.
Avoiding heroic effort should be viewed as a guiding principle rather than a rigid rule: sometimes, complex and effortful methods produce results that justify their complexity. Reverse-mode automatic differentiation, for instance, is complex algorithm that requires one to build and maintain elaborate data structures in its implementation. It is also an astonishingly stable and scalable algorithm, powering large language model training algorithms which demand state-of-the-art software engineering to successfully run. In this case, the practical benefits justify the complexity, and most of the implementation complexity can be hidden from ordinary users in frameworks that are maintained by sufficiently-well-resourced teams. Algorithms like this are rare: most successful techniques, such as for instance self-supervised learning, are simple at their core.
The same thinking also applies to papers and even mathematical theory: on average, the more complex a paper or theorem, the fewer people will use it in follow-up work. Almost all of my research projects which required heroic effort ultimately failed, in some cases after a substantial time investment. Research should therefore involve heroic effort only when the scientific benefits justify it, and heroic effort should never be applied for non-scientific reasons such as making advisors happy, to appear more sophisticated to others, or to gain something in return for sunk costs.
Build the right team
Most research problems require multiple distinct skillsets to solve. This includes mathematical skills, software and programming, design of experiments, as well as writing and scientific communication. Some people are better at some of these than others. A critical first step in most projects is therefore to assemble the right team, so that someone who likes and is good at every aspect of the research is involved. Recruiting the right collaborators is often the easiest way to save time and avoid needless effort, by inviting experts who can achieve more with less time and effort to contribute as needed.
The most important part of successfully building a team to ensure that everyone feels welcome, included, and involved as part of the project. All members should have the room to contribute creatively in the manner they deem best, and communicate as needed to ensure progress. At the same time, the project should have a unified and well-defined-enough vision to ensure that everyone is working towards the same goals. Watching a project make progress is often the best way to inspire everyone to contribute to the best of their potential. Much of my research success has come down to finding the right collaborators and making sure everyone was excited about the work.
Embrace standard tools
Machine learning research relies on software tooling, and the capabilities of this tooling often determine research progress. Unless the explicit goal of a research project is to improve tooling, it should use the best existing frameworks, relying on them in the manner that most facilitates the research being done. I have not done so in the past, and spent a significant amount of time working within the Julia automatic differentiation ecosystem, which has received significantly less software development investment compared to Python-based frameworks like JAX or PyTorch. This resulting in me spending time fixing bugs instead of working on my research, as well as struggling with batching APIs that have poor developer experience.5 In the last year, I’ve stopped being a programming language snob and embraced Python, which has ultimately made iterating on projects faster and less frustrating.
Support and inspire others
Science and technology are community pursuits, and all of us throughout out career can expect to collaborate and learn from many colleagues who bring unique perspectives and contributions to the table. The success of our research depends heavily on actions taken by those we surround ourselves with.
Based on my experience, I believe that given an appropriate research environment, the quality of a person’s scientific work is determined primarily by their interest in the topic, not their skill, talent, ability, prestige of the institutions they previously studied at, or the success of people they previously worked with. The highest-quality work is done by those who simply want to know the answers, for purpose of figuring out what those answers are.
Since interest and curiosity are fundamentally personal, it follows that it is impossible to train good students: one instead needs to discover those who are interested, and empower them with the tools needed to develop their ideas and do the best work they can.6 It also follows that there is no such thing as a bad student: in my experience, judgments like this originate from adopting a too-narrow notion of success. We should not reject those who are less-effective at proving theorems, because it might instead be that they are very good at building a company, or at coming up with effective ways of understanding other topics such as history or ethics—even if our own interest and expertise lies in proving theorems. Instead, the right approach is to inspire, empower, and support those around us towards reaching greatness in the manner that is right for them.
The ability to inspire also affects research on a collective level. Pascal Poupart once told me he thought a significant factor behind machine learning’s success is that it sounds exciting and interesting to undergraduates—especially compared to neighboring fields, such as for instance statistics, which tended to be a subject undergraduates didn’t like when they took courses in it. Over time, more students were inspired to work on machine learning, and the field advanced faster than its scientific neighbors as consequence. At the level of the discipline, one should therefore make an effort to cultivate an environment that is inspiring, accessible, and welcoming, so that over time the field attracts the highest-quality work.
In his Bitter Lesson, Rich Sutton famously said that learning and search are the most promising techniques for developing artificial intelligence, due to their ability to scale in an unlimited manner with increased compute. Over the last decade, most of the field has focused on improved learning, such as for instance work on transformer models, or on self-supervised learning techniques that allow more effective use of available data. At the same time, in my view some of the most impressive technical demonstrations have relied critically on search techniques within the overall system. This includes, for instance, AlphaGo, which relies fundamentally on Monte Carlo Tree Search, and the recent diplomacy AI CICERO, which builds on ideas from computational game theory such as no-regret dynamics to plan the next move.
Search algorithms involve explore-exploit tradeoffs, and can be understood through the lens of decision-making and reinforcement learning. In this way, they share many similarities with Bayesian optimization, which I have extensive expertise in by this point. Building from this expertise, I am interested in understanding how to balance explore-exploit tradeoffs generally, and will start by studying decision-making algorithms for tasks beyond optimization. I aim to develop theory for constructing and understanding such algorithms—in both Bayesian settings where tools such as Gittins index theory apply, and in non-Bayesian settings such as adversarial bandits and online learning.
While these directions don’t immediately correspond to a direct scientific problem, I believe that advances in them can help create the technical language we need to design algorithms that can, for instance, allow a robot to efficiently learn from trial and error. Given the importance of such technical tools, and other ones that can be developed using better fundamental understanding of decision-making algorithms, I hope that my work stays true to the spirit of the advice I started my scientific journey with.
Some Final Thoughts
This blog post consists of just my thoughts, what I was thinking at the time. Your opinion might be different, and I will very likely change my mind in the future. Research is, ultimately, personal: everyone has their own ways to succeed, and what works for me might not work someone else—you should figure out for yourself what works for you. I am interested in research first and foremost, and my views reflect that. I hope this post has been useful or at least interesting to you. Please feel free to contact me and let me know what you think about it.
It is easy to think of researchers, such as for instance Katalin Karikó, whose work can simultaneously be viewed as method-focused, and is of such fundamental importance to humanity that one could not possible argue that her approach is incorrect.
My own work on scalable algorithms for hierarchical Dirichlet processes is a great example: we extended a previous method—partially collapsed Gibbs sampling—from latent Dirichlet allocation to the more complex hierarchical Dirichlet process topic model. This was technically interesting and a lot of fun to write, but not as scientifically valuable as other work I’ve done. In practice, almost everyone interested in topic modeling simply uses latent Dirichlet allocation—it does the job, and is simpler and more reliable. As a result, several years later, my paper on hierarchical Dirichlet processes is nowhere near as cited as my other works.
Consider for instance how our work on Riemannian Gaussian processes started from a simple and straightforward NeurIPS paper on manifold Fourier features, but quickly transformed into a highly technical two-part foundational series of papers.
Most recent startups which made it to scale fit this criterion, including DeepMind (game AI), HuggingFace (platforms for natural language processing), Weights & Biases (machine learning operations), Mosaic ML (distributed systems and scalability), OpenAI (large language models), and others.
One API difficulty I’ve had to deal with is more restrictive broadcasting semantics compared to Numpy, which forces one to add unnecessary reshapes when working with higher-dimensional arrays. Another one is the lack of syntactic sugar for emulating object-oriented-style programming, which forces the developer to waste time on things like balancing parentheses—and, in cases where object-oriented metaphors are the right approach, encourages one to write hard-to-read code. This ultimately results in projects that are more complex and difficult to maintain. My overall impression with the language is that too much attention is spent on the compiler’s technical capabilities, and not enough on improving day-to-day developer user experience.
I’ve been told that this view mirrors the approach used by Geoff Hinton, who famously supervised a very large set of outstandingly talented students who later went on to make fundamental contributions to many different areas of science.