Alexander Terenin

Where Are We and What Now?

It’s 2026. I started my previous blog post on AI research, written in 2023, by saying that the hype around ChatGPT had died down somewhat compared to its launch. That turned out to be completely wrong. The hype has not only not died down, I would say it is now larger than it ever has been. The most important factor that has kept the hype alive is the rise of AI-driven software engineering. It turns out that models like Claude and Gemini can write useful code. That’s a really big deal, and very exciting for me personally. Many years ago, I decided to become a researcher specializing in theory and methods, rather than a software engineer—in spite of the fact that I was really good at writing code.1 The key reason for this was that I preferred working on hard but interesting math to working on important but uninteresting code. AI changes that.

My own work is also once again at a crossroads. Over the last two-and-a-half years, I’ve shifted away from Gaussian process research, to working on the algorithmic fundamentals of decision-making under uncertainty. This work has resulted in two key contributions: the development of Gittins index methods for Bayesian optimization, and of a non-obvious form of Thompson sampling for online learning with non-discrete action spaces. Details aside, I am very happy with these contributions, because I feel that I qualitatively understand algorithmic decision-making vastly better now than I did when I started.2

Better fundamental understanding is a good starting point. But, at least for me, it can only be that—a starting point. The question is: what comes next? I am far from the only scientist to face this question. In a retrospective on his career, the Nobel prize winner John Hopfield singled out the process of understanding this question—which he called “Now What?”—as a cornerstone process that helped make his contributions significant. I agree with John on its importance: for me, the key challenge in answering this question is correctly understanding where AI research is and where it will get to, so that I can react accordingly. This blog post will therefore explore these questions.

Warning: this post is going to be long. You should consider skipping to the headings that seem most interesting to you.

Where are we?

The first step towards figuring out what to work on—and, for that matter, where to work on it—is to correctly understand where technology is headed today. This is not a trivial question: addressing it demands that I think decisively, and make precise falsifiable claims about what is likely to happen. I believe it is also important to have the courage to write—and, potentially, be completely wrong—in public, as I alluded to at the start of this post in my comment concerning ChatGPT’s hype. So, I will attempt to do so, with my own formed-from-scratch opinions: here’s what I think will happen, and why.

AI will significantly change how software engineering is practiced day-to-day

One of the key developments in AI that has made me re-assess what will become possible, and how quickly, is the emergence of AI-assisted coding environments such as Cursor and Claude Code. These provide an interface through which a developer can ask an AI in natural language to write code that accomplishes a particular task. This includes workflows for reviewing the results and asking for changes, as may be needed to ensure the code is correct. If you have not tried them out, you should stop reading this blog post immediately, and do that now.

I find these tools extremely exciting. Writing a program that actually serves a purpose, end-to-end, has always been difficult. Whether one learns to code in a class at university—or, on their own, as I did as a teenager who was into video game modding—almost all developers come to appreciate the need to start simple, get an initial codebase that does something correctly, and build complexity later, as it becomes needed. With an AI-assisted workflow, the time to do this goes significantly down—likely enough to justify making many kinds of new and more-heavily-customized software that would otherwise never have been made.

At the same time, only some of a software engineer’s day-to-day tasks involve writing code. Others involve communicating with team members, understanding existing systems in order to understand the root cause of a given problem, deciding which system design or architecture is best suited to the task at hand, and—frankly—navigating the internal politics of the company one works at in order to actually be able to improve things. Quite a few of these may require significantly more agency than current systems have, the ability to interact effectively with more than one person, as well as completely-different interfaces. It is unclear how quickly these will develop. However, as many of the preceding examples are social, I think it is reasonable to expect software engineering to become a more social job.

Another key limitation is that using these tools requires credits, which are a finite resource. It is very easy to blow all of them on an unimportant task that requires a lot of tokens, but does not advance the project being written. Managing people is a non-trivial task: an incompetent manager can easily waste their companies’ budget by directing their employees to work on things that do not advance their organization’s goals, or preventing them from working on things that do. In the same way, I believe that managing what an AI system is doing will become a key bottleneck of using it in a manner that justifies the costs. This leads to the next key point.

In a world where intellectual work is automated, a key human role will be to decide what’s interesting

Suppose we are further into the future, and AI can do almost any achievable intellectual task, but otherwise adopts the same fundamental architecture, based on token-prediction, which is in use today. What are people going to do then? What is the role of humans in such a world?

I think the answer will be to decide what’s useful or interesting. Humans do not simply pursue goals, they decide which goals to set themselves to. Coming up with a sense of purpose, and fulfilling it, is a major part of human lives—and, indeed, a well-established idea from psychology, with the need for self-actualization sitting at the top of Maslow’s hierarchy of needs.3

I currently see no such needs or comparable mechanism in AI systems, unless they are introduced within a prompt (potentially recursively) by a human. On their own, current systems appear to mimic the medium-level patterns of human thought, but not the higher-level structures and needs within which human thought lives. I do not see an incentive for AI companies to create such mechanisms, as they would make AI more difficult to control, without directly improving capabilities. Instead, I see the opposite incentive—from the perspective of AI safety. If this prevails, the goals pursued by an AI system will be set by people, based on what they find useful and interesting.

The economic effects of AI are not obvious and will be surprising to many people

Many people are afraid that AI will, on-average, eliminate jobs. The argument for this is straightforward: taking software engineering as an example, if one software engineer with AI can do the job of a team of ten engineers without AI, what is the point of hiring the other nine?

It’s also likely wrong, at least for software engineering—due to what is called Jevons’ Paradox.4 The counterargument is very simple: if software becomes ten times cheaper to develop, then all kinds of previously-unviable business models become viable, leading to more demand for software engineering than previous existed. Indeed, one can observe that as computers moved from programming using punch cards, to imperative and object-oriented programming, to modern carefully-engineered languages, writing software has become easier and easier—yet, more software gets written now than ever.

On the other hand, there really are fewer horses on the roads today, compared to in centuries past. For any narrative like the above, it is not difficult to come up with counter-narratives regarding just-about-any economic effect of AI. So, then, which one is right? In the absence of empirical evidence, how can we know?

I am not sure we can. AI is an even-more general-purpose technology compared to the introduction to the internet, yet people were largely unable to predict with reasonable precision the economic effects the internet would have. While the science has certainly advanced in the last twenty-five years, I worry that many interesting questions about the economic effects of AI are too open-ended, or lack any comparable reference points, for anyone to be able to deduce clear answers. For example, it is conceivable that advances in AI might lead to advances in robotics, which in turn make ultra-customizable artisan-style manufacturing competitive with assembly lines: if that happens, how does one even begin to assess the consequences? My conclusion is that we should form our opinions—as I did, with my appeal to Javons’ Paradox—and then prepare to be surprised.

Whether the AI boom ends by stabilizing or with a crash will depend heavily on how quickly inference costs go down

In spite of attributing high uncertainty to economic outcomes of AI, there is one economic prediction I am willing to entertain: whether or not AI is a bubble—or, more precisely, whether an AI-related economic crash will happen or not. Let me explain how I think about this.

I’ll start with a simplified thought experiment involving a set of hypothetical AI companies. At every time point, each AI company has two sources of cost and profit: training and inference. Let’s assume there are two sequences: training costs and inference profit, where the latter is initially negative but grows over time and eventually becomes larger than training costs. From these dynamics, it follows that an AI company becomes profitable if it survives long enough for inference profit to outweigh training costs. Thus, for an investor to see an AI company as a reasonable investment, they need to believe in a large-enough probability the company makes it, and expect sufficient profit once they do. Both Google’s executives, and OpenAI as well as Anthropic’s venture capital backers, have decided to make the investment—paying for training, and subsidizing inference, both in the face of colossal costs today.

To sustain AI, investors will need to continue investing. Short of some kind of Lehman-Brothers-style accounting catastrophe, I am confident that Google will be able to do so, given its dominance in internet advertising. I also have confidence in OpenAI and Anthropic’s ability to survive by raising funds: the US venture capital ecosystem, which is only one of many possible sources of investment, has a total of 1.2 trillion under management—a number of the same order of magnitude as Google’s market capitalization. So, the funding pool is similarly-deep for both kinds of companies—but, it is also not infinite.

As inference costs go down, AI companies become less dependent on fundraising. If this happens fast enough, relative to the economic value per token, then AI will become profitable—the current bubble is likely to stabilize. If, instead, costs remain stubbornly high, investors might conclude that AI is too expensive to be worth even-more investment. This could result in companies being forced to raise prices, which would make AI less-valuable to users, and AI companies less-valuable to investors—creating the kind of feedback loop that could lead to an economic crash. I don’t know whether or not this is likely to happen. But, given the above dynamics, I predict that the rate at which inference costs go down is likely to be a critical factor in determining the risk. This provides an excellent reason to think about inference costs from a technical perspective, and I will do so in the sequel.

In the long term, mathematical research like mine will become AI-assisted

The last half-decade of my life has been dedicated to machine learning research of a mathematical character. One part of this has been formulating, and then proving, certain theorems—generally, in pursuit of various higher goals. Another part has been doing mathematical calculations which enable certain methods to be implemented numerically and tested to evaluate their performance. Much of this work has been done in a pen-and-paper style—or, more precisely, a keyboard-and-LaTeX style for me personally.

At present, AI for math is a research direction receiving a substantial degree of investment, especially from industry. This work increasingly involves formal verification systems such as Lean—these provide a programming language for writing proofs, and makes it possible for a computer to verify that a given proof is correct, among other things. The appeal, from a point of view of AI fundamentals, is obvious: if an AI system can prove a novel hard theorem, it possesses the capability to replicate and potentially exceed at least one kind of landmark human intellectual achievement. At the same time, the domain of mathematics is one of unambiguous statements and absolute truth, making for a better angle of attack compared to other intellectual areas.

I think that AI will be able to do math at a superhuman level in my lifetime, and that there is a good chance this will happen relatively soon. This is because I think there is a very high chance the combination of language models and formal verification systems can be engineered into an AlphaZero-style recursively self-improvement loop. I also think that the investment needed to do the engineering, and pay for the compute, is definitely there. In any event, today’s systems, while nowhere near a superhuman level, already perform at a non-trivial level.

As a result, I may be among the last generation of PhDs who are able to prove a difficult original mathematical result—where, by difficult, I mean one which requires introducing a new and different way of thinking about a given problem—purely using their own intellectual efforts. This makes me especially proud to have written my recent paper on non-discrete online learning, which I believe reaches this level.5 There are few things in life I enjoy more than rising up to meet a difficult intellectual challenge. Whether people ultimately value my work or not, I am glad to have had to chance to write it, and prove to myself that I can.

Almost all research-driven AI performance gains will come from ML systems, not theory and methods

A bitter lesson I’ve learned, in somewhere around ten years of work on machine learning theory and methods research, is that theory tends to come later than practice. Said differently, almost none of the field’s theoretical or methodological work, my own included, has led to to significant practical performance improvements for the whole field—there are perhaps ten papers that significantly moved the needle, out of perhaps a hundred thousand. This is a provocative claim—so, please, allow me to explain what I mean.

Suppose we are allowed to enter into a time machine, and transport ourselves back to 2010. We’ll be giving the AI researchers of that day a small set of papers, together with a reasonable sum of money with which to buy GPUs. Their goal will be to re-build ChatGPT from-scratch. Which papers should we provide? How many—or, rather, how few—would we need?

In my view, that number is shockingly small. I’d perhaps choose Attention is All You Need, the Adam paper, maybe the Sequence-to-Sequence paper, the paper on vision transformers, and the work introducing causal masking. I think a list like this, or perhaps slightly larger, would be enough to lead to ChatGPT having been built in the mid-2010s, compared to November 2022 in real life, without the time machine.

It is worth reflecting on the character of those papers. Essentially all of them are empirical, and only a subset could be called methodological. Engineering is a much larger focus than in most work. Other than the ideas in this small set of papers, almost all performance gains have come down to software engineering, systems, and data-centric work. I predict these two trends—which have many important consequences that I will continue exploring in the next point—will continue to hold.

The tiny proportion of practically-useful theory and methods research will move the needle a big distance

Just above, I pointed out that the fundamental ideas needed to build modern AI come from a surprisingly small set of machine learning theory and methods papers—essentially all other work which was necessary to get us where we are today came from software engineering, systems, and data. I would now like to explore a few implications for this.

The first one is personal. I am a theory and methods researcher—one for whom the fundamental appeal has been to figure out how to build systems we’ve never known how to build before.6 If this approach has, historically, not been the one that yielded almost any of the most important breakthroughs, should I be pursuing it? Especially if I think I would like doing day-to-day empirical machine learning research significantly more, given the availability of AI-assisted coding? I have deliberately phrased these questions in a way that elevates a particular answer: please allow me to now challenge this perspective and argue that things are not as obvious as may seem.

The problem with simply dismissing methodological research is that, even though it rarely succeeds at moving the needle, it moves it a very significant distance when successful. Continuing the running example, it is likely ChatGPT would simply not have worked if it was built using LSTMs rather than transformers. The aggregate contributions of methodological research are very significant, even though almost all papers individually don’t lead very far. We therefore need methodological research, in spite of how difficult a significant degree of success is.

These observations have implications to how we should support and fund fundamental research: if success is very rare, we should support as many people as possible, making as many intellectually distinct bets, as we can. We should prioritize original research directions, reward people thinking for themselves, and be suspicious of trends and well-established approaches. My best understanding is that our current system often does the opposite: consider, for instance, that working with top people confers an enormous advantage on the faculty job market—and top people are by definition not doing obscure or contrarian things. One could even argue that, in the last decade, industry has done a much better job than academia—consider, for instance, that many the papers I listed above came from Google Brain, at a time when large-scale machine learning was not dominant in the way it is now.

Looking at my own work, I am happy that I’ve prioritized my own voice and perspective, and have historically picked a strategy of trying to start trends, rather than getting in early on existing ones. On the other hand, I’ve probably been enticed too much into pursuing challenging and difficult research for intellectual sake. Sometimes, the uninteresting, tedious, or ugly7 approaches turn out to be the right ones. The beauty in errors is that they tell us exactly how to sharpen our thinking, while staying true to ourselves.

In-context learning will become a dominant AI paradigm

I’m now going to shift directions, and think about what will AI research looks like in the near future? Research is multifaceted, and uncertain: many people are going to be working on many things. Nonetheless, I see an trend that, depending on who you ask, one could either call emergent or well-established, that I’d like to make a prediction about: the rise of in-context learning.

In-context learning refers to the idea that transformers are capable of performing machine learning inside of their context window. Said differently, if one has a model that can do sequence-to-sequence prediction, then one can consider prediction tasks which map sequences which represent data into sequences that encode desired outputs. This leads to the perspective that, instead of designing algorithms—a fundamental goal of machine learning research from the beginning—one can instead seek to design datasets and prompts, in order to achieve the same results. In natural language processing, where designing algorithms for many tasks is nightmarishly hard, this has clearly worked—but the general viewpoint above suggests it may be much more broadly relevant than a language-oriented framing would suggest.

Newly-emerging theory on transformers supports this perspective. It is possible to formalize and prove, in certain cases, that transformers trained by gradient descent can learn algorithms from data—where the word algorithm is understood roughly in the sense of algorithms and data structures. If that’s a general principle, there could well be problems where the algorithm one seeks is too complex for a human to figure it out, and a better approach is to design clever ways of coming up with training data, so that a transformer can learn the necessary algorithm.

I predict that this paradigm—which, beyond language models, is central to other kinds of foundation models, including prior-fitted networks for tabular deep learning, to name one example—is going to be a dominant paradigm in AI. This is because transformers work in practice: leveraging their capabilities is a much more technically sound way to perform meta-learning compared to to prior approaches, which by-and-large did not reliably work. At the same time, in-context learning is a user-friendly and extremely-flexible paradigm: it is not hard to create a synthetic-data-generation pipeline to get started, yet what that pipeline can accomplish is limited only by creativity, and researchers are very creative.

Top models will continue to gradually improve for a very long time

Let’s now talk about large language models—though, the same considerations also apply to other kinds of foundation models, including those that rely on interfaces that are neither text nor vision. I think there is essentially no limit to how long these models’ performance, in just about any domain, will continue to improve. In particular, I see two key ways in which a language model’s performance can improve: (i) by making each token more useful, and (ii) by generating useful tokens more quickly and at less cost. The first of these directly improves performance, while the second does so indirectly by enabling reasoning, agentic workflows, and other complementary techniques.

To be specific, I think there is substantial headroom for both (i) and (ii), I think that investment in both is overwhelmingly likely to transform that headroom into results, and I think that today’s providers are going to make that investment. Thus, I think improvements from both avenues will happen. Since both (i) and (ii), along with each of the above points, involve details, I will explain them in the sequel, given directly below.

Data quality will be a primary long-term source of improvement

Just above, I said that I think the usefulness of each generated token will improve for a very long time. On the other hand, today’s large language model training corpora is likely to have already maxed-out the gains available from pre-training on internet data—the models have already seen every page on the internet. So, where are these gains going to come from?

My answer: data quality. Having worked with internet-scale data in my early-career days in industry, let me say something obvious but easy-to-overlook without this experience: internet-scale data is extremely messy. There is a very large amount of SPAM, irrelevant information, mistakes and errors, and other data that makes model performance worse. The scale of what language models are trained on is so vast, that I see software development work to improve data quality as a near-unlimited avenue for improvement.

AI itself is likely to play a significant role here. As a simplified example, consider a classification task. If an example is consistently misclassified, even after training at scale, it could be that the misclassification is caused by an incorrect label, rather than the model’s mistake. In this case, the information about which label should be corrected is coming from the model’s training dynamics, which are only available once an initial training run is completed. It is not hard to imagine workflows where a model examines its own training curves, looks for issues, corrects them, perhaps with some human labeling in the mix, and is then re-trained to improve its performance recursively. This hypothetical example is one of many possible approaches.

Data quality is not just about removing incorrect information, but also includes acquiring information that was previously missing. This can include generating synthetic data. Returning to the AI-for-math example discussed previously, one can use a language model to generate Lean code, and keep the parts that are verified correct, obtaining new data to train on. If a model obtains a useful lemma purely by chance, it now knows about this lemma, and no longer has to rediscover it, improving its abilities. This hypothetical example is also one of many.

Inference API costs will come down very significantly over time

Returning to the original point, let’s now discuss the second avenue for near-unlimited improvement: inference time and costs. Right now, inference is extremely expensive, and its cost is being subsidized by the companies. I do not think the state of affairs will remain.

The least sophisticated reason for this is simply Moore’s Law, which has continued to apply to GPUs to a much greater extent than it has for CPUs. All else equal, the cost of computation decreases over time. This means that the same models, with the same performance, can be deployed at less cost once newer-generation compute is purchased and comes online.

The counterargument to this point is that, as computational costs decrease, using larger models becomes more attractive. Indeed, in computer graphics, as computation has become less expensive, 3D artists have responded by using more and more of it to produce previously-intractable special effects. What if the supply of increased computation is outweighed by demand for higher-quality results driven by larger models?

In this race, I think the need to reduce costs is going to win—and, more precisely, it will win slowly. Fundamentally, this is for economic reasons—at some eventual point, if they want to exist, leading providers will need to become profitable—but there is a technical case as well. In the short term, one should expect various improvements—think, for instance, of KV-caching, or the many ideas in the DeepSeek papers—purely because AI inference systems are new and there has not been a lot of time to optimize them yet. But there are medium-and-long-term reasons as well, as I will describe next.

Over time, AI will move to specialized chips, eventually ones with noisy output at the hardware level

At present, to my best information, today’s language models are served using compute clusters consisting of nodes that contain either Nvidia GPUs, or similar hardware such as Google’s TPUs. There are two key workflows: training and inference, which differ because the former involves backpropagation, while the latter involves a variable-size input and output. Right now, both workflows are running on similar or even the same hardware, which has not been designed to handle each specific workflow optimally at the hardware level. I believe this will change: to understand why, let’s talk a bit about hardware.

While the term GPU stands for graphics processing unit, a more appropriate name could have been general-purpose parallel-processing unit. This is because, just like a CPU, a GPU can execute general kinds of instructions—much more than matrix multiplication alone.8 The key difference compared to a CPU is that a GPU uses massive parallelism to maximize throughput, and generally focuses on hiding latency by overlapping execution, rather than directly minimizing it. In particular, while their architectures and therefore performance characteristics differ, both types of chips can execute equally general code.

The ability to perform general computational workflows is difficult to achieve, and requires complex chip design. The GPU must carefully manage what it is doing to ensure both correctness and performance. If one eliminates this generality, it should be possible to design devices which are much less expensive to operate. This is obviously an attractive proposition, as long as the initial design costs are not too large. The downside is that one needs to know what computation the model will perform, and this may change if algorithmic techniques improve. On balance of the tradeoffs, I suspect the move to specialized chips to happen fairly quickly, unless technical reasons prevent it, because Nvidia’s GPUs are simply too expensive, and vendor lock-in, even taking into account the availability of AMD GPUs, creates too much business risk.

Thinking long-term, I believe the headroom for improved performance due to custom chips is very substantial. One reason for this is that, unlike essentially-all computations we have traditionally designed chips to perform, neural networks are noise-tolerant. If you slightly perturb the numerical value of each weight, the network will still work. This is what makes low-precision training and inference possible. Right now, we do not know how to design hardware for noisy computation—but, if we did, we could likely use a lot less power.9 In a talk given at the University of Cambridge, Geoff Hinton called this idea mortal computation, and has said that it might one day allow us to run a high-performance language model on a device closer to the size of a mobile phone than the size of a data center. I see no fundamental reason why he’s wrong, and instead predict that he’s right.

Building complex agentic systems that work reliably will be difficult, and will involve a better computational understanding of incentives

I’ll now shift gears again in order to talk about systems that sit on top of language models. Agentic systems involve instantiating AI agents and allowing them to interact with an environment in order to pursue a given goal. At present, the typical workflow is that one launches a bunch of agents, each with a certain prompt, and waits for them to complete their task. I’d like to instead talk about a much more sophisticated potential workflow: launching a bunch of agents which talk to each other, and to other people, cooperating to work together to solve a problem that would be difficult for a single agent to solve. There is an appeal to this idea: social behavior is a cornerstone part of being human, and enables people to potentially achieve far more by working in teams than individually. Why not attempt to improve what AI can do in a similar manner?

One reason to be excited about such approaches is that the characteristics by which they scale up or down are likely to be different from other approaches. Using a bigger model, or a faster model that can think for longer, generally requires a larger compute cluster. One cannot simply deploy a twice-as-big model on two clusters in different physical locations: the increased latency will likely bottleneck performance. Workflows involving multiple cooperating agents have no such limitations: they can in principle work with little coordination, making it possible to build substantially larger and more complex AI systems overall.

I think carrying this out is going to be a lot harder than it looks. The problem is: how do we ensure that agents successfully cooperate, especially in very large systems? While humans certainly can work in large groups, doing so is non-trivial: human organizations have a tendency to become dysfunctional with size—think of the typical government or large company. It is difficult to align the incentives of an organization with that of its members, and the organization that do this best tend tend to be controlled by small sets of individuals, rather than by distributed governance mechanisms.10 What stops multi-agent AI systems from transforming into an AI bureaucracy11 which uses up tokens, but achieves little?

The key issue is that it is very difficult to predict how the local incentives around individual agents will combine to influence what an organized system of agents actually does. Questions like this are studied in disciplines such as economics and political science, the latter of which is largely non-quantitative, and has yielded a much weaker understanding of the phenomena at hand compared to what has been achieved in physics, or in electrical engineering, to name two examples. If view AI agents and their capabilities from a fundamental point of view of respect, understanding how incentives will affect AI agents will be similarly-difficult to understanding how incentives affect humans. I see this both as a challenge, and a great opportunity, and will elaborate on it further in my final point.

For artificial general intelligence, world models are the next frontier

Given what language models can do, what is the next step on the path to artificial general intelligence? I, and likely many others, think that the next step is the development of world models—for robotics, this means representations of the three-dimensional physical world that make it possible to understand what will happen as a result of different physical actions. These will go beyond pure image and video generation to involve a genuinely three-dimensional computer vision stack, while adding the ability to model physics and other modalities such as touch feedback.

A useful analogy for what we want is the concept of a learned video game engine. By this, I mean software that allows one to point a camera at a scene, and obtain a representation that makes it possible to see what will happen if a robot attempt to move one of the objects in the scene. The world model’s capabilities should be at least as rich as that of a modern game engine—but be available in arbitrary environments found day-to-day and learned on-the-fly. If such models existed, they would be immensely useful to robotics, and in particular would make model-based reinforcement learning viable in all kinds of new situations that are unworkable today. It is not hard to imagine how to build one: to name just one angle of attack of many, neural radiance fields with physics simulation capabilities already exist. Today, there are multiple companies, including Yann LeCun’s and Fei-Fei Li’s startups, which are developing ideas that may one day lead to tools like this. I am convinced that one of them will succeed.

This opinion of mine is not new. In 2020 and 2021, I applied to a set of Junior Research Fellowships at Oxford and Cambridge, with a research proposal that moved in this direction by building on some of my early research on Variational Integrator Networks, a type of variational autoencoder that can model both smooth and contact dynamics. My applications were not successful, though they came close, with several final-round interviews—to my best understanding, the problem was that too many people thought what I wanted to do was science fiction and not feasible. Some time later, I decided to double down on my views, having become convinced that world model research would succeed whether I was involved in it or not. I therefore opted to pivot to working on decision-making, which becomes increasingly important once world models arrive. Let me talk about that next.

For artificial general intelligence, decision-making and sample-efficient reinforcement learning are the second-next frontier

Suppose that world models exist. What becomes the next challenge in artificial general intelligence? I would argue that decision-making capabilities, especially sample-efficient reinforcement learning, are the next challenge. There are many distinct area where these capabilities are central, including active learning, Bayesian optimization, model-based reinforcement learning, model-predictive control: here, I will think very broadly, and focus here on what the methods actually do, not the terminology they are called or the scientific community they come from. The unifying concept here is the presence of explore-exploit tradeoffs, which require the algorithm to learn by trial and error, balancing what it already knows with what it could learn by trying something totally new.

I think an improved understanding of such methods would be very consequential. In particular, using world models, they could give us the tools with which to design robots that can carefully plan step-by-step to decide what to do to solve a completely-novel task, using the world model to evaluate what will happen from potential actions. Together with the open-endedness and interfaces provided by language models, and the predictive capabilities provided by world models, I believe this is the only remaining software-oriented capability needed to make science-fiction-level robots real. Once world models develop and mature, I believe everyone will realize this, and attention will shift to these questions. Since they touch on my own research, I will discuss them further in the sequel.

Over the long term, artificial intelligence will create an unprecedented toolbox for understanding social systems

My final prediction—out of a total of fifteen, that’s quite a few—is even longer-term than the others. It’s that advances in AI will make it possible to reason about incentives in social systems with unprecedented precision. This prediction is coming from two trends. First, incentives are critically important from a fundamental AI safety point of view: they can help us understand how intelligent models can be influenced—for instance, to ensure they tell the truth. Second, as I argued earlier, we will need to understand incentives in order to get large-scale AI systems that involve multiple agents to reliably do what they are designed to. Addressing both kinds of question will involve ideas of mechanism design and related tools from economics, but will require a much-broader understanding that will involve new ways of thinking about incentives.

The same understanding will likely prove useful for understanding incentives in human systems. Based on the structure of who holds power, and the specific form of what they can do, will a given form of government last, or will it collapse? In a system of government where everyone is constrained—whether by voters, or by other means—what kind of changes can realistically happen? We have no good first-principles computational methods with which to address these questions. But if we fundamentally understand incentives well-enough to design large-scale AI systems involving many agents, we will probably learn how to tackle these questions too, because both are governed by similar difficulties.

This prediction is naturally the most open-ended, but leaves me with a sense of optimism. At present, a great deal of human suffering is fundamentally caused by the behavior of large human social systems. If we can better understand how such systems work, we might gain new kinds of technical tools—ones with a completely different character than those working on policy and politics can currently imagine—with which to make the world a better place.

What now?

This post has been long, with many points and various details. I hope you have skipped the sections you did not think are relevant to you. Inspired in part by John Hopfield’s retrospective, I will now discuss what I think the above trends mean for me and my career. My goal here is not necessarily to convince you or anyone of anything in particular, it’s to write my own thoughts down precisely, and therefore sharpen my thinking, so that I can make the right choices when the time comes. Nonetheless, I hope you find my thoughts useful.

Relevance

In research, there are several aspects needed for my work to reach the level of significance that I aspire to.12 First, my work needs to involve difficult technical skills, and ideally be of a character that other people cannot do. At this stage, I am happy with the point that my mathematical skills have reached, and I have been happy with my software skills for a long time, though AI-assisted coding has given me new aspects that I have been excited to learn. My on-paper track record is reasonably strong, having written plenty of papers that people in various scientific communities know about. At the same time, I’d characterize my professional success as fairly minimal: the typical top computer science PhD probably draws significantly more combined demand from academia and industry than I do, in part because my skills span too many distinct areas and my profile is too weird.

I also think there’s a much bigger problem with my track record: its relevance, in the sense of the question, do other people need to think about your work in order to achieve their goals? I got this framing from Kuang Xu’s talk at the Operations Research and Machine Learning workshop at NeurIPS,13 and have found it to be an illuminating concept. This is the standard to which I was evaluating theory and methods research in my previous points, then implicitly. And I think it’s a useful one: to my eyes, relevance drives career success more than a researcher’s degree of skill, their work’s impact,14 or many other factors.

Relevance is heavily influenced by the overall direction of investment made by society. In 2017, when deep learning was a relatively-new approach to AI, it was clear to everyone that Google and other actors with substantial resources had decided to invest in direction. At the time, I did not appreciate how much of the progress that would come could be attributed to that investment. As individual researchers, we don’t control the direction of the field, yet the direction in which the field is going plays a decisive role in determining whether our own work is relevant to others. This does not mean we should follow the crowd and not make contrarian bets: rather, it means that contrarian bets should be chosen in a manner where other people will care about them a lot if they turn out to be right.

While I am happy with my recent work in many aspects, I think it is also possible to make my near-future work much more relevant. The matter of fact is that both Gittins indices, and online learning, are obscure concepts that are known primarily to a small set of technical experts. In working on them, I have implicitly chosen to prioritize my own understanding of how algorithmic decision-making works at a fundamental level, as opposed to producing results that directly help as many other people as possible. The relevance of the resulting works is therefore mostly indirect, and largely characterized by making new kinds of follow-up projects possible. Let me now talk a bit about what this follow-up might look like.

Will decision-making algorithms experience a paradigm shift, like in computer vision or natural language processing?

I think that algorithmic decision-making is going to be a significant part of the future of AI—as I have for some time now, since shifting into this area from my Gaussian process research, which was motivated primarily by factors such as curiosity and challenge. At the same time, I have noted above that I think the next steps could be much more relevant to the field than the prior ones. Most of my recent work has been dual-purpose, in the sense that each project has had both a goal of both answering some kind of immediate-timeframe question, and opening up long-term angles of attack on harder but more significant questions. However, it has been hard to communicate this well: almost all of my papers have been written to emphasize the immediate question, while my primary reason for actually working on them has more to do with the long-term understanding they create. I think one key to being more relevant is to be able to do work where the two goals are better-aligned.

Decision-making tasks are characterized by the presence of explore-exploit tradeoffs. Balancing these tradeoffs is difficult, and we are not even close to a comprehensive understanding of how to handle them in general. The obvious approach is to quantify uncertainty, using either a Bayesian method or something like a confidence set, and then use the resulting uncertainty to balance what is known with what could be learned. The problem with this is that, as soon as neural networks are involved in some manner, it is famously difficult to have any control of how estimates of uncertainty behave. As a result, almost all the decision-making methods we have—motivated by bandits, reinforcement learning theory, Bayesian optimization, or whatever else—are still classical, designed using kernels and other principles from previous eras, with a great emphasis on theory.

There reasons to emphasize theory within decision-making are much stronger compared to those that were there for supervised learning in the heyday of support vector machines. Given just an algorithm, it is not obvious whether poor performance should be attributed to bad methodological design, or to an impossibly-hard problem. Theory, in the form of lower and upper bounds, is appealing because it allow one to quantify both performance and difficulty in a mutually-compatible manner. And, yet, I think there are good reasons to think a shift towards empirical methods should nonetheless both be possible, and yield better ultimate results.

In computer vision and natural language processing, one way to think about the shift to deep learning is as a shift from designing features to learning features from data. Rather than first detecting edges, textures, and combinations thereof, and then using them to make predictions, today’s models learn relevant features directly, resulting in features which are too complex to have been designed, and perform much better. At present, I worry that decision-making algorithms are similar: a typical bandit or theoretically-motivated reinforcement learning algorithm is hand-crafted, constructed carefully using mathematical ideas. There is a chance that it might be possible to learn better algorithms from data—ones that were too complex to design directly—if one can figure out how.

In-context learning provides a paradigm by which to do this, by instantiating a form of meta-learning that actually work in practice. Following this perspective, the critical research question becomes: how should one generate synthetic data by which a strong decision-making algorithm can be trained? This question is highly non-obvious, because, as above, there is no obvious supervised-learning-style criterion by which one can determine whether an algorithm explores correctly. And, yet, perhaps a subtle solution of some kind may be possible—this question is likely to be of a character that can be addressed by mathematical theory. So, I’ll make my final prediction, with a medium level of confidence: decision-making research will experience a paradigm shift, moving from classical machine learning to in-context learning. I’ll be trying my best to help us get there sooner.

Where?

In light of the above, it is worth asking what is the right place for me to do the work I’d like to do. My recent career, for many good reasons, has been spent entirely in academia. Even if I follow my current path—at present, I am on the faculty job market, with reasonably-promising progress, but by no means guaranteed results, so far—it is worth it to sharpen my thinking by understanding the tradeoffs involved and what I would do if I had made different choices.

What’s appealing about academia?

From a research perspective, the biggest upside of academia is clear: the intellectual freedom to work on what I think is going to be long-term important. This freedom results in substantially less constraints on direction than in industry, where long-term goals ultimately come from leadership rather than from me. In academia, the work is open-ended, its results are evaluated by peer review and the community at large, and the outputs are visible in public. I think these characteristics are much well-suited to my personality. I also like teaching, like working with students, and like writing, and therefore think I would not mind scientific fundraising as part of the job.

On the other hand, academia also has downsides. The biggest of them is lack of control over location. This is unavoidable: factors completely out of my control, such as a department’s teaching needs in a given year, are a critical part of my and everyone’s results on the job market. There are also potential concerns about access to sufficient resources, such as compute, with which to be able to actually do the work I want to do. Another downside is that academia provides comparatively-little opportunity for me to directly capture the value I create, most of which will ultimately go to various companies, rather than to the university or to me.

What about research in industry?

If I were to move to industry, the first big question is: to where in industry? I’m not sure. I very much regret that, during my PhD, I never got to intern at one of the big research labs of the era. This is partly from not trying as hard as I could have, partly from not having a public profile that convincingly made the case that I can code well, partly from very poor timing due to COVID-related hiring freezes around the time I was best-positioned. So, I don’t know as much as I’d like: however, what I do know gives me conflicting feelings. On the one hand, there’s definitely value of being a part of AI in the places where it’s being turned into products today. On the other hand, I am skeptical of large companies: in my experience, at a large company, it is easy for people to not particularly care whether their work, or the company’s product, is good or not—a recent viral blog post15 captured it much better than me. I don’t like environments where this feeling is pervasive. But, I also suspect the most ambitious companies and teams are different: caring about their work and getting the details right is necessary to achieve their goals. I’d love to know whether such a team could be a good fit for me.

Another option is joining a small machine learning startup—by my best estimate, there are more these days than ever. This has a lot of appeal: before returning to do my PhD, I worked at Petuum, then a Carnegie Mellon startup. This was the best job I’ve ever had. The single biggest factor was that everyone around me was extraordinarily competent, and very much tried to do well on all the things they worked on, to a degree I’ve only otherwise seen in academia. There are good reasons to think this kind of environment can be found at the right startups—though I should be careful, because in Petuum’s case, I’ve been told the company’s difficult times came later, after I left to start a PhD. I’d need to be very careful about which company to join, because startups are inherently risky, and there are a lot of founders who don’t know what they’re doing—for instance, ones that are much better at fundraising than at actually building something, to give just one potential failure mode. Ensuring that right fit would be necessary.

The main downside of industry, in general, is that I think it’s harder for my work to matter. The technology sector is a very big pond, and I would be a very small fish swimming in it. Many of the most-consequential industry research projects are so large-scale that few specific contributors can make a decisive difference. Industry is also known for re-organizations and other aspects that reduce the creative control I have over what I work on. As with academia, there are definitely tradeoffs at hand.

What about starting a company?

A final option, and one that I see a lot of appeal in, is entrepreneurship—for many of the same reasons as academia. First, starting a company, just like academic research, is all about turning a vision into reality. Second, also just like research, starting a company is difficult. A founder needs to learn how to do everything well, from navigating social systems in order to fundraise and then to successfully hire, to figuring out how to build their product well and ensure it is useful to other people, to understanding how to get customers to try it out, to ensuring that the value created is captured by the company and not someone else. The sheer scope of the work, while perhaps intimidating to some people, makes it exciting to me. But, at the same time, I see no value in the idea of being a founder for its own sake: if I’m starting a company, it’s because my research vision has advanced to a stage where it can be turned into a company mission, and I see a viable path to make that vision real.

Another upside of this path is its flexibility: it is both possible to start a company during an academic sabbatical, and to leave academia entirely in order to do so. On the other hand, a key downside—as I’ve seen from many friends who have done so, some with great success, others not so much—is that being a founder is all-consuming, and will involve some amount of stepping away from technical work. I don’t know whether I would like or be good at fundraising or other parts of the job which I’ve never tried. Entrepreneurship is risky and carries significant potential financial downside compared to other paths—even taking time to bootstrap and build an initial demo carries serious opportunity costs. It’s all tradeoffs, all the way down. You’ll know if and when I decide to take this path: a blog post titled Chips on the Table will appear.

Some final thoughts

Let me conclude with a few final thoughts. First, thanks for bearing with me through 20 pages worth of text—there’s a lot here. This blog post is partly an experiment in documenting my thinking publicly, both in case someone else might find it valuable, and so other people learn more about who I am and how I think about AI. Please feel free and welcome to contact me with your comments—I will not post my contact information here, but it is not difficult to figure out how to find it. I especially encourage you to contact me if you think I’m wrong, or if any of my points are too obvious or simplistic to be interesting. And, remember that these are just my thoughts, there’s a good chance I’ll change my mind someday on just-about-anything written here.

Finally: none of this post was written by AI. All of it was typed in the classical manner on a keyboard. The bottleneck for me has been in forming opinions that I have full confidence in. As a result, unlike some people, I have found it much easier to simply write my ideas down directly, than to write a good prompt to generate text describing them. In particular, this post was written in full by me, linearly, section-by-section, in three separate sessions over three days mid-January. Two weeks later, I returned to and re-read it, made minor edits, collected and added all of the links, and made it public. Given the topic is AI, and that I hold a perspective of optimism16 and excitement about current approaches, I very much realize the irony here.

Footnotes

  1. There’s not a huge amount of publicly-visible evidence to back me up on this. But, let me give you an idea: I spent the tail end of my masters research designing a GPU-based Robin Hood hashing algorithm and implementing it in CUDA, as part of a project to bring MCMC-based doubly-sparse Latent Dirichlet Allocation to GPUs. The key difficulty was that good parallel performance on GPUs involves very different memory access patterns compared to on CPUs. My solution involved grouping hash buckets together and performing parallel insertion and retrieval on a per-warp basis, using a careful lockfree generalization of the standard algorithm, which I believe to be original. I don’t think most software engineers would have been able to get the hash table to work, but I did. Unfortunately, I never finished the project, as I moved to Imperial College London to start a PhD and landed with an advisor who insisted that I immediately drop all of my prior work. A year later, I quit working with that advisor, switching to a much better group—but, by that point, my research had shifted to a totally different topic. What I should have done was dropped the LDA-aspects and written a paper just about parallel hashing—but I was too early-stage and didn’t know how to package my work or tell people about it. Not finishing this project is one of my biggest career regrets.

  2. One of these papers even managed to become a finalist in a best-paper competition. But, as a certain very successful faculty at Columbia, who works on topics not too far from where my own used to be, once told me, “don’t you know that nobody cares about awards?”—so I’ll put that one in a footnote this time around.

  3. A long time ago, I completed two simultaneous bachelor’s degrees in statistics and psychology. I obtained the latter degree purely for fun, with no direct professional ambitions, and spent most of my time learning social and personality psychology. A decade later, I would say that the psychology degree turned out to be surprisingly useful—in spite of the fact that a large percentage of its content was outright wrong due to what is now called the reproducibility crisis. The reason is that a small portion of powerful ideas have helped me think much more clearly about the society and other people.

  4. Jevons’ name is pronounced JEV-uns, similar to Jensen Huang’s first name, and different from Klavon’s Ice Cream Parlor.

  5. A secondary lesson I have learned about mathematical research from this work is that it is harder to get other people to engage with difficult mathematical results, especially ones that involve a different viewpoint compared to what they are used to, than with easy results. This is because people know that understanding the proofs will require work, and this work is only worth doing if people feel that the techniques are beneficial to them. All of this holds doubly so if the person proving the result is not an established figure in the respective subdomain, or if the question they are answering is of a different character than what is typically studied in the subdomain.

  6. This style should be contrasted with a different one where the appeal is to understand something complicated for its own sake, without the goal to build something new. This style, different from my own, is also common.

  7. Some time ago, as part of an industry event, I attended a keynote given by John Jumper on his Nobel-prize-winning AlphaFold work. During the talk, he spoke about having the conviction to pursue deep learning, in spite of the fact that people around him viewed it as an inelegant approach. In particular, in thinking about the right solution, he asked, “What if it’s ugly?”—meaning, what if the approach that actually works doesn’t fit the aesthetics of whoever is working in the area? I immediately recognized that this was an important question to keep in mind, and it has stayed with me and influenced my thinking ever since then.

  8. This general-purpose capability is what made GPUs useful for Bitcoin mining, which has different computational characteristics compared to computer graphics.

  9. Note that reducing electrical usage is the key factor to be optimized in order to reduce the resource footprint of AI. There is a lot of discourse on water use by data centers which is outright wrong and based on incorrect information. This is because some early work in that space focused on how much water goes into a data center, without also taking to account that a lot of this water goes right back out of the data center and remains usable downstream. The actual water use by a data center, defined in a non-misleading manner, is comparable to that of a fast-food restaurant. And it has to be like this, in the sense that there should be no doubt this modified view is correct: just think a bit, from first principles, what would computers even need water for?

  10. Indeed, concentration of power is a major cause of today’s social problems, especially in situations where a leader’s whims are poorly aligned with what society at large wants. The governing principle of separation of powers, as enshrined in the United States Constitution and elsewhere, was created precisely to address this.

  11. In-between the time I originally wrote this post, and launched it, this point has become much less farfetched—just look at Moltbook and imagine what else might be possible when large sets of independent AI agents start talking to each other.

  12. The standard here, as written in my prior post, is that my ideas should continue to be valuable to science and humanity even after I die.

  13. This talk touched on whether current research in the operations community is relevant to society at large, including to AI in particular, and what the consequences of that to the two fields would be.

  14. The problem with impact as a concept is that one can have significant impact while capturing none of the resulting value—for instance, if one’s work greatly benefits people who have no influence on their professional success.

  15. Optimism in the face of uncertainty is a provably-strong decision-making strategy in many settings. Translating what theory tells us to a human-language level, the key is to choose optimally among a set of actions whose uncertain outcomes are estimated optimistically in a manner that is also realistic-enough. If one does so, the regret they incur is controlled by how quickly the uncertainty given by the set of realistic-enough outcomes—including non-optimistic ones—decreases, as information gained from trying new actions is obtained. I find it absolutely remarkable that it is possible to write a sentence like this to summarize the content of an actual mathematical theorem—here, the UCB algorithm’s regret bound—given the statement otherwise sounds like philosophy.