Alexander Terenin

Sculpting Fragile Glass with Agentic Coding

Two months ago, I wrote a blog post about the start of a new chapter. There, I said that machine learning has changed to an almost-unrecognizable extent, and I really mean it. Yet, AI is also only getting started, and we’re all still figuring it out. There’s been a lot of hype around rapid prototyping, famously by Garry Tan: he has produced 600000+ lines of working Ruby on Rails in a handful of weeks, without reading essentially any of it. The reactions have ranged from excitement to ridicule: it is much easier to write a lot of code than it is to write a lot of code that is correct and continues to work well as a product evolves. Many people think that agentic workflows tend to produce slop, and I think this can certainly be what happens, but I also think it doesn’t have to—the haters are wrong and Garry’s right. Let me convince you by showing you how it is also possible to use agentic workflows to sculpt fragile glass: that is, to write difficult code, where one subtle bug breaks things silently, and failures manifest as infinite loops in rare edge cases—ones that somehow also manage to happen all the time.

If you’ve guessed that this is going to involve concurrency in the form of high-performance CUDA C++, you’re correct. Concretely, I am going to re-build one of the trickiest pieces of code I have ever written—and am going to do so using Claude Code exclusively. There will be zero lines of human-written code and zero fundamental design decisions made by me, yet zero tolerance for errors and zero tolerance for slop—all at once. I’m mostly interested in this because I think it’s cool. But, if you need another reason, I think it’s interesting to explore and see what agentic coding looks like along the less-talked-about regions of a speed-quality Pareto tradeoff. And, to convince you the details are worth it, I’ll spoil the punchline: getting to a reasonable implementation took only two days, on an after-work side-project basis, and resulted in a higher quality level than my hand-written code from 2016. If this sounds interesting, let’s dive in!

Code for this blog post is available on GitHub, along with full Claude Code session logs.

Robin Hood Hashing on the GPU

For this project, I implemented a concurrent hash table in CUDA—one whose algorithmic design is original, and comes with a story from my early career.1 In 2016, I became interested in Latent Dirichlet Allocation on GPUs, having recently written a paper about parallel Markov chain Monte Carlo algorithms for this model. The core computations involve pairs of integer-valued sparse vectors, and the cleanest efficient implementation stores these vectors in densely-packed hash tables.i To achieve this, I designed a novel kind of lock-free hash table suited for GPU-style concurrency, implemented it in CUDA—and then, very unfortunately, started a PhD with an advisor I later quit working with who effectively forced me to abandon the project. What I should have done is simply finished debugging, benchmarked the hash table, and published a single-author data structures paper about it—but, I was in my early twenties, and didn’t know what I was doing. Four years later, a paper re-inventing many of the same core open-addressing techniques—and, to be fair, other ones too—won Best Paper at HiPC 2020. On the one hand, no hard feelings—I’ve won best paper awards of my own. On the other, I’ve always wanted to know: how well did my design actually work?

Ten years later, concurrent hashing is interesting for a different reason: it’s exactly the kind of fragile glass code that carries a high risk of subtle concurrency-induced bugs, and must behave correctly under rare edge cases which only appear at scale when running thousands of threads. It also requires reasoning about some fairly complicated hardware details in order to achieve the desired level of performance. Moreover, no-one appears to have published my specific design in the academic literature, and my implementation was buried in a larger repository, with code that was neither documented nor correct. A coding agent would need to rely more on its ability to reason and generalize for this task, than it would for many others. So, I was curious: how hard would it be?

The Design: Buckets and Slots

To start, I gave Claude Code the following prompt:

Hi Claude! In this repository, we’ll be attempting to design and implement a parallel hash table that can run efficiently on an Nvidia GPU.

We want:

  • A header-only CUDA library
  • The ability to insert many elements in parallel
  • In theory, utilize 100% of device memory bandwidth

Let’s start by brainstorming a possible design into a Markdown file in the notes directory.

The amazing thing is that it did not repeat a design from a CUDA library such as cuCollections, or the repository of any paper that I am aware of. Instead, it wrote a Markdown document detailing my exact design from 2016, namely a Robin Hood hash table with lock-free subwarp-cooperative probing.

Let me tell you how this works in a nutshell. To do that, there are four important things about GPUs for our purposes that you should know, beyond the fact that GPUs are massively parallel processors which involve thousands of threads:

  • On a GPU, every group of 32 threads, called a warp, must execute the same instruction at a given time.
  • Communication between threads in a warp is much cheaper than communication between threads in a block, which is the next unit of hierarchy.
  • A GPU cannot retrieve less than 128 bytes of memory at a time.
  • In most interesting cases, a hash table’s performance will be limited by the GPU’s global memory bandwidth.

This leads to a few consequences. First, we should prefer simple algorithms more on GPU than on CPU—complex logic will tend to produce branching, which then gets executed in sequence, performing poorly. Second, computations should be localized among threads, in order to avoid excessive coordination costs. Third, we need to leverage parallelism to search all 128 bytes we retrieve, otherwise threads will be sitting around doing nothing useful. Fourth, we should aim to retrieve as little memory as possible as a design principle.

These requirements effectively rule out many classical pointer-oriented chaining-style designs, but are a reasonably good fit for linear-probing-style hash tables. Let me remind you how these work: they store the data as an array of slots of key-value pairs, namely

k1:v1k2:v2k3:v3k4:v4k5:v5k6:v6k7:v7k8:v8 \begin{aligned} k_1&:v_1 & k_2&:v_2 & k_3&:v_3 & k_4&:v_4 & k_5&:v_5 & k_6&:v_6 & k_7&:v_7 & k_8&:v_8 \end{aligned}

where knk_n are potentially-empty keys and vnv_n are values, which will be either 32 or 64 bit integers in our case. Letting KK be the number of possible keys and NN be the size of the table, we use a hash function h:[K][N]h : [K] \to [N] to map keys to random-looking slots inside the table. Since, in the interesting regime, NKN \ll K, the question becomes: what should we do if two keys get mapped to the same slot? Linear probing gives a clean answer: simply store it in the next slot.

This is a great strategy most of the time. But, if the hash function’s outputs behave like uniform random numbers—as they are generally designed to do—they’ll occasionally cluster, resulting in the array filling up unevenly. For unlucky keys, one may have to search over an unreasonably large distance from their origin point. Robin Hood hashing gives a clean way to avoid this slowdown: when inserting the key, if you encounter a key closer to its starting point than your current key, swap keys. This small change not only reduces the distance of many keys, it also allows retrievals to potentially terminate early, significantly reducing the amount of searching one must do in unlucky situations—this is the strategy that both I in 2016, and Claude Code in 2026, decided to adopt.ii

So far, we’ve been thinking one key at a time. But, as I noted earlier, on a GPU we need to think 128 bytes at a time. To handle this, we group slots into buckets, like

[k1:v1k2:v2k3:v3k4:v4][k5:v5k6:v6k7:v7k8:v8] \begin{aligned} [k_1&:v_1 & k_2&:v_2 & k_3&:v_3 & k_4&:v_4] & [k_5&:v_5 & k_6&:v_6 & k_7&:v_7 & k_8&:v_8] \end{aligned}

which corresponds to two buckets with four slots per bucket. Then, when doing insertion and retrieval, don’t perform checks one slot at a time, instead use a group of threads, called a tile, to search the whole bucket in parallel. Each thread in a tile retrieves one slot, checks its contents, then communicates with other threads in the tile in order to decide what to do, using hardware-efficient ballot and shuffle operations. Insertion into a given slot is performed using an atomic compare-and-swap primitive, resulting in a lock-free algorithm. For 32-bit keys and values, we use tiles corresponding to half-warps, resulting in a branching factor no worse than two, and with very small branches.

Based on our criteria, this design should be reasonably strong. So, let’s see how it actually does, then I’ll tell you how I built it—and how agentic coding made it possible to produce a high-quality implementation very quickly.

Performance Results

Let me show you two kinds of benchmarks: memory bandwidth utilization, and timing. These are designed to answer two key questions:

  1. How well does the algorithm utilize a GPU’s memory bandwidth, compared to what is possible?
  2. How well does the algorithm perform compared to other kinds of GPU-accelerated hash tables?

We’ll consider these under different hashing scenarios, all using a 1GB table on a 4090. We’ll fill our hash tables using uniform random numbers over the full 32-bit unsigned integer range, resolving collisions by inserting the most-recent element.2

For the first question, we’ll record how many buckets each sub-warp retrieved and updated over the kernel’s execution time. We’ll compare this to a CUDA memory copy API call. For the second question, we’ll run a timing experiment, and compare to cuCollections and WarpCore, two major GPU-based hash table implementations available online, neither of which existed in 2016. Since cuCollections is a core library maintained by Nvidia, I expect its performance to be reasonably well-optimized. We’ll repeat each experiment 16 times to assess variability, and plot medians along with 25th and 75th quantiles. Results can be seen below in Figure 1.

Performance Results
Figure 1. Performance results for GPU Robin Hood Hashing (GPURHH), compared to cuCollections with linear probing (LP) and double hashing (DH) collision-resolution strategies, as well as WarpCore. These are shown in terms of memory bandwidth utilization along with insert and get throughput.

The first plot of Figure 1 shows our hash table to be highly efficient, achieving somewhere around 90% utilization compared to the best realistic rate possible—that of a memory copy. The second and third plots show results in terms of insertion and retrieval throughput—in short, at all sufficiently-high load factors, retrieval is significantly faster using our implementation rather than with any baseline. The more-complex Robin Hood kernels result in slightly slower insertion compared to the strategies used by both libraries, namely linear probing and double hashing, but win significantly on retrieval, especially at higher load factors, thanks to needing to retrieve fewer slots. Variability between runs was found to be so small as to be nearly-invisible on the plots.

One could certainly test and document performance differences much more thoroughly, and explore additional design options such as Robin Hood together with double hashing, but I’ll stop here—after all, this is a blog post about agentic coding, not a research paper about concurrent hashing. Nonetheless, from this comparison alone, I am left feeling proud of my younger self for having figured this algorithm out in 2016—and disappointed for not finding a way, whatever the circumstances might have been, to actually tell the world about it.

Agentic Coding Setup and Workflows

Let’s now move to talking about agentic coding. So far, I’ve told you what I built, and how well it works in terms of performance. I will now tell you how I built it. But first—it is worth noting again—the initial process took two days, just two days. I don’t want to mislead you on this—the number represents time from the first prompt that produced codeiii to a correct algorithm with reasonable test coverage.iv After this, it took two more days to get a reasonable set of timing benchmarks—more on that later. None of these numbers represent full-time software work—at present, and for a little bit longer, I am employed by Cornell University, and all of the work here was done in spare time. An experienced kernel engineer would have likely been much faster, but at two days, there is not all that much time here left to save.

Let me now tell you about my workflow—which, for those that are big on orchestration and token-maxxing, might at first feel underwhelming—though I think there’s more than meets the eye. All of my agentic coding was done in VS Code, using the Claude Code extension, and in hindsight I believe that this is the right setup for this specific use case.v I think about this in terms of the equation

Ttotal=Twriting+Tunderstanding T_{\text{total}} = T_{\text{writing}} + T_{\text{understanding}}

where:

  • TtotalT_{\text{total}} is the total development time.
  • TwritingT_{\text{writing}} is the time spent writing code, whether by hand or using a coding agent.
  • TunderstandingT_{\text{understanding}} is the time spent understanding whether the code is doing what you wanted it to, potentially by reading the code, running the code and looking at its output, running tests and looking at their output, or any other method.

For low-level CUDA C++, my experience is that looking at the code can be very useful. Having a reasonably-idiomatic and well-organized codebase makes TwritingT_{\text{writing}} larger compared to a so-called slop-regime which neglects to do so—but, since agentic coding is still very fast, being careful is not actually slower by all that much. On the other hand, being careful reduces TunderstandingT_{\text{understanding}} significantly. This is the key reason why, as we will soon see, it took me just as long to benchmark the code properly as it did to get to a correct hash table in the first place—even though the latter is, in principle, much easier.

Ecosystem. The type of code and maturity of its ecosystem plays a significant role. My 2016 implementation is CUDA C++, but is written in a very C-like style, as opposed to more modern C++-like style that Claude Code produced. I significantly prefer the latter. For example, the 2026 version uses cooperative groups to perform communication cleanly at the tile level, whereas this API did not exist in 2016 so I instead computed everything by hand using warp-based primitives with bit masks to map back down to half-warps—equally valid, but much less clean and harder to debug. There are a huge amount of improvements like this, and I am extremely impressed with how mature the CUDA ecosystem has gotten since the early days.

Testing. The ability to do comprehensive testing also played a major role, and agentic coding made it much more pleasant for me to follow the extreme programming principle of implementing unit tests first—a principle I have historically often praised, but rarely followed. In addition, I found it very helpful to ask Claude to brainstorm as many distinct edge case tests as it could think of, and implement as many of them as made sense. Initially, Claude wanted to test insertion and retrieval jointly, but, when asked, was happy to write verified-correct memory layouts by hand in order to trigger certain kinds of behavior. I did something similar in 2016, and it took me quite some time to come up with a reasonably-comprehensive list of what might go wrong under concurrency—the coverage I got from Claude took almost no time to write, and was easily just as good.

Refactoring for ease of verification. A major part of my workflow involved asking Claude to refactor and better-organize the code, in order to make it easier for me to verify whether it was doing what I wanted or not. I would estimate that more than half of my prompts were used on re-writes. The idea—as floated previously—is that being able to verify correctness by human eye can provide a second way to check the code, beyond just running various tests. This may help catch different kinds of issues compared to ones that tests would immediately reveal.

Sessions and token use. These days, I code mainly in an agentic style, and tend to work with separate parallel sessions dedicated to various issues. In this project, however, I used one giant session. I initially did this in order to make it easier for people to read my session later on, as I was planning to include it in this blog post—but found it worked better than I anticipated. Large sessions allow the model to look back to prior decisions and learn based on how I responded to previous prompts, and I felt like the model was better able to guess what I actually wanted over time. Just as importantly, this project never actually used that many tokens—compaction triggered only a handful of times. Out of my weekly Claude Max budget—which I deliberately only used for this project to facilitate measurement, while doing other work using Codex and Gemini—I ended up using:

  • 7% to get an initial version that passed tests.
  • 12% to reach a reasonable test coverage including edge cases.
  • 19% to add baselines and benchmarking code.

This experience has been completely different from my AI-assisted math experiments, which have had a more modest degree of success,vi and have been heavily token-bottlenecked. Here, I have used far less tokens than what many other people claim to use, and I would have been happy to use more—but only if doing so would have actually saved me time. Many of the workflows I have seen are better suited to other kinds of projects, and for work like this will use up time rather than save it. If you have concrete ideas on how using more tokens might have actually helped achieve this project’s goals, I’d love to hear from you.

Time lost due to trusting the agent too much. The third part above took the most time, even though the code involved is much less difficult. The reason was that I didn’t want to think too hard about how to do benchmarking, since I considered this part straightforward. However, Claude picked a very unintuitive way to parameterize the experiments, and came up with a set of comparisons that were unfair, requiring the Robin Hood hash table to do more actual work than the baselines. I did not immediately realize this, and once I did, I had to ask in very specific terms to rework the benchmarks to ensure every method really was treated equally, and didn’t just appear to be on the surface. Going back to the equation, I gave in to the vibes too much at this stage, which made TwritingT_{\text{writing}} for the benchmarks quite small, but resulted in a much larger TunderstandingT_{\text{understanding}}. In the end, I used more tokens, but this only slowed things down. It would have been better to have asked Claude to write a Markdown spec for how the benchmarks should work, carefully verify it, and only then ask for an implementation.

Programmer intent. A significant pattern I ended up noticing myself using is to (a) ask the model what choices and tradeoffs are possible, in terms of factors such as performance and style, (b) brainstorm with it what is the right choice for the situation at hand, (c) ask it to implement the desired code, and (d) read the code, then ask it to make changes to implement it better. One way to think about programming is that it’s all about expressing what we actually want—to a computer. From this perspective, I see agentic coding as allowing me to use separate text for describing what I want via the prompt, and verifying that the system correctly understood me, using a combination of the code, tests, and other approaches.

Pareto tradeoffs between speed and quality. My experience demonstrates to me that Claude Code not only accelerates software development, it shifts the entire speed-quality Pareto frontier. If you want to write difficult high-quality code, you should expect to be able to both go faster, and produce better results, both at the same time. For me personally, I would even say that it feels like a shift along a three-way speed-quality-fun Pareto frontier, but that’s personal and depends on you. With this said, if someone says it’s all just slop cannons, then I’m sorry, but they’re completely wrong.

Documentation. For this project, I adopted the strategy of treating Markdown files the same as code. The model was asked to document all of its work, and I did not directly write or edit any of the documentation myself.3 To get the documentation to a suitable quality level, I relied on a combination of high-level and low-level suggestions via re-prompting. This approach was not time-optimal: for low-level edits, re-writing text would have been faster, but I stuck to re-prompting for experiment’s sake. Overall, the model did a reasonable job of ensuring all documentation was correct—but was not as good at ensuring that none of it was misleading, nor at choosing the right level of detail. Too little documentation and it is unclear what is going on, too much and it is also unclear what is going on—it takes too long to sort through irrelevant details. Ease of reading is important too, after a while Claude-speak tends to all sound the same.vii More on that next.

What can’t LLMs do well yet? For any new technology, one should always consider the bear case, and think about what isn’t working. And, here, I have a good candidate: agentic writing. My experience with using large language models for this has consistently been terrible. I have found it impossible to get models to write what I actually want to say, rather than something slightly different where the fine details are all wrong. As a result, I consider it much easier for me to write this blog post start-to-finish, than to prompt and re-prompt in order to create it. I hypothesize this is because general-purpose language is different from code, and is somehow much higher-dimensional—and, as humans, the easiest way to say what we want is to just outright say it.

How long before today’s state of affairs becomes out of date? As of this blog post, AI capabilities are rapidly getting stronger. I am quite confident almost no-one would have predicted this level of performance two years ago4—back then many people thought large language models had hit a ceiling due to hallucination issues which are now largely gone, at least in this domain. As such, by publishing this blog post now, I risk making claims that will be viewed as hopelessly out of date very soon—potentially as soon as a few months from now. I am not afraid of this. Even as models advance and techniques adapt, I think it is useful to have a reasonable snapshot of what is happening outside of frontier labs today. In the future, this may help provide a more accurate chart of where capabilities were, and what workflows worked best, at the given time. This can still be interesting to know even if, by the time you are reading this, the situation has changed.

Conclusion

In this blog post, I presented an implementation of a lockfree Robin Hood hash table designed for GPUs, and showed it to perform strongly, compared to both Nvidia-maintained and community baselines. I did not write a single line of code in order to build this, in spite of the fact that this is a device-side low-level CUDA C++ implementation—the code equivalent of fragile glass. The initial implementation took only two days—compared to a much longer time period when I implemented the same design ten years ago. A combination of agentic coding, and a significantly more modern CUDA ecosystem, made this possible. I hope you’ve found my reflections and ways of thinking to be useful or interesting—if you did, please feel free and encouraged to find me on social media, and let me know what you think!

Footnotes

  1. I’ve told this story before: in a footnote of my previous blog post, on the direction AI is headed. It is also worth noting that in my most recent blog post, which was dedicated to thinking about my future career path in light of rapid developments in AI, I promised to put together a side project or two before leaving academia—that post’s promise is what inspired the project and blog post I’m sharing with you now.

  2. Before each experiment, we compute and record how many unique elements are to be inserted into the table, to prevent any of the tables from behaving badly in the event they are loaded above capacity.

  3. There is only one exception: commit #c58d66, which adds a set of links, including to this blog post, to the readme. I wrote this by hand because I did not want Claude to know it was participating in an experiment to evaluate its AI-assisted development capabilities.

  4. I’ll even claim more: anyone who was predicting capabilities would advance this quickly, and who was not using either a statistical model to guide those predictions, or frontier-lab insider information, was reading tea leaves. They turned out to be correct by accident.