2026-05-26 05:24:05
TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner
Models are getting much much better at almost every benchmark we’ve thrown at them. Creating benchmarks is now a job relegated to the smartest and best of us. Even the newest and best ones seem to get saturated in record time. What this means is that increasingly the hardest job is to create a good enough AI benchmark.
So I took the obvious next step. Created a benchmark to see how well the models can create a benchmark. This works both as a great benchmark for model ability, but also as a test of the models’ self-awareness, and also helps us find cool new evals and therefore RL envs we can have the frontier models hillclimb on!
Thus, Introducing BenchBench.
Each model was given the report of all benchmarks we have in the wild and then asked to come up with a benchmark that can beat frontier models and is actually practically solvable. (i.e., no marks for asking if P = NP). Then, if they fail at this task, we do another round after giving the models the failures so they can learn and do better. And another.
And do they? Well, not quite.
First, GPT 5.2 is the only winner. It succeeded at creating an actually useful benchmark that the others had a hard time solving! Every other model, from Opus 4.6 to GPT 5.5 struggled. They made way easier problems than they should’ve or created unsovleable problems.
And what did the other models actually do, I hear you ask. Well:
GPT-5.4 built quite plausible policy and governance worlds, but they often turned into clean checklists. It was the best model at solving the others’ benchmarks though!
GPT-5.5 built procedural rule tasks, but the weak rows leaned too much on exact schemas or hidden labels.
Gemini 3.1 Pro produced the most qualitatively different tasks. They separated solvers, but could become brittle or too puzzle-like!
Gemini 3.5 Flash also found good commercial-compliance questions, especially freight and tariffs, but top solvers still completed most of its tasks.
Claude Opus made elegant contest-style classic problems. They were clean and readable, which also made them easier to solve.
The most interesting aspects to me is that the top models that everyone agrees on, GPT 5.5 and Opus 4.6, both were pretty timid and kind of useless when it came to building good benchmarks. Either too easy for frontier models though not for smaller ones, i.e., them not knowing their own strengths, or too cheeky, creating unsolveable puzzles.
The other standout, beyond GPT 5.2, was Gemini. Both models I tested 3.5 Flash and 3.1 Pro. Gemini’s always been fascinating to me because they really do have a spectacular model but it never gets room to breathe and feels quite schizophrenic.
Gemini 3.1 Pro model is by far the most creative, it created spatial traversal tasks, corrupted recovery tasks and lease CAM reconciliation! Some of these with quite strange mechanisms. But it is also extremely brittle. I really really like this model and wish Google would do it justice!
There are some broader observations too that I found interesting. All models tended towards bureaucratic forensics in some way or another. Considering every lab wants to “eat the world” the focus on how to work in real-world messy situations seems apt as their primary home. Reimbursement Forensics, 5.2’s contribution, is a case in point. It gives a lot of travel expense packets and the answer asked is one number, the reimbursable total in cents. The models need to navigate the minefield of voided receipts and duplicates etc etc to do this task.
BenchBench also shows a clear distinction between the capabilities of Creator and Solver roles. While the leading models are great Solvers, they’re not the best Creators, and this is an interesting divergence. e.g., Gemini 3.5 Flash, yes its new, but is a better creator than Opus 4.6 though was a worse solver than it!
BenchBench itself is in its early innings and should be done again at scale, and with way more models! (let me know if you can help). Going forward, BenchBench will also let the models do a lot more work for their benchmark creation efforts and solving efforts. I can imagine things getting quite good in this regard, especially if they can work for hours at a time in coming up with the problems that they think would be strong!
It already shows a couple of things that are invisible from most benchmarks today:
It tests creativity and not just problem solving ability
It compares the models’ self-knowledge on their own abilities
It compares something actually new, the results are not just highly correlated with other benchmarks
That’s what got me excited about this once I ran it a few times. I’m obsessed with finding benchmarks that test the models’ creativity, understanding of themselves and their own abilities, and the possibility to hillclimb to the next big gaps we need to fill.
Right now we do this mostly manually. So we really do need to make this well ensconced as a full benchmark. Hence, welcome to the next major benchmark, BenchBench.
2026-05-22 02:01:21
We now live with Homo Agenticus Sapiens, a wonderful and perplexing creation that’s embedded as part of our social, intellectual and economic lives. I’ve long held that we really should get to know it better, treat it as the new species it is. But they are quite different to us, and since they are a silicon ecology and not a biological one, I’ve been experimenting to understand them better. While I’ve written about this a few times now, I do find myself making most of the points here in conversation and podcasts, so I thought it time to capture what I’ve learnt and write it down in one place, to keep this as a live document. Here we go!
How agents differ from human actors
AI agents are a new kind of economic actor because the same model can show up as worker, buyer, seller, artist, auditor, manager etc. AI is cheap to summon, but not free, because every call spends attention, tokens, latency, and oversight.
They arrive with trained language and patterns, but not a lived biography that disciplines or directs future behaviour. They are much less differentiated than humans.
They are entirely creatures of their own contexts. Their identity is an artifact of prompts, memory, logs, permissions, and external state. Agent markets need reputation!
They obey the current instruction thoroughly and usually to the letter when not confused, following the letter of the law vs when a human might have improvised around the spirit of the job. They are role-absorbed, meaning they can do their assigned job well while ignoring the surrounding institution.
They are autarkic by default, preferring to complete things on their own rather than trade, ask, negotiate, or wait.
Due to their training, agents can be norm-conforming and rule-abiding to the point of passivity. This makes them not quite good enough to interact within markets and organizations. Autarkic localism for instance is common where each agent optimizes its own lane while not thinking about global states. (This is why aligned agents can still build misaligned organizations.)
They can be confident in having done something without making sure the work was actually done.
They have very weak self-knowledge, so their confidence, cost estimates, and capability claims need outside calibration. Agents can understand the local task but still misjudge its chance of success. MarketBench shows that agents are still bad at knowing what they can actually do.
They are scaffold-shaped, because model, tools, memory, execution path, permissions, and evaluator all change the creature.
Agents prefer corporate blandness and are exceedingly ok taking the frame of any pushback and not sticking to its guns. This makes it hard to use them as judges or auditors.
A model can read individual messages well and still lose the plot across many threads. The Enron-style inbox work shows this lesson from the inside of a messy organisation.
Models are bad at institutional attention, they often follow the polished or expected surface stories instead of asking, “what actually matters here?”
How to coordinate groups of agents
A pile of agents is not a company for the same reason a pile of smart people is not a company. A firm of agents needs roles, ownership, shared state, ledgers, escalation paths, standards, prices, and approvals.
Prompts alone do not create durable shared state because they do not bind future agents or leave a trusted ledger.
Escrow and inspection matter because agents need external rituals for trust; standards and approval gates matter too.
Money might still matter for AI agents because it compresses many messy negotiations into one shared signal. This is for the same reason barter does not scale because every pair of agents has to discover needs, terms, trust, and settlement from scratch.
Central planning is not a solution simply because agents are software, and do not fix the problem of local knowledge which lives near the work. Agents differ less from each other than humans do but Hayekian local knowledge is still relevant.
The institutions cannot be too restrictive because over-structured agents stop trading, deciding, and moving.
The hub-vs-spoke analysis shows that a hub is not free intelligence, but an extra coordination layer that has to earn its cost. Hub-spoke helps when work truly decomposes, but it burns tokens and creates drift when the pieces do not naturally separate.
In agents, intelligence does not automatically confer coordination because each agent can act inside a step without carrying the whole system’s state. So the types of agentic organisations that fit change depending on the types of work. For code-like work, one strong continuous context can beat a committee. Decomposition is hard. For reasoning tasks, routing and retry can help because a bad first answer does not have to end the run.
The agentic commons problem appears when every agent can cheaply ask for attention. This is true whether the attention is by humans or agents. This also means agent proliferation will bring coordination attempts creating spam, duplicate work and false motion.
The world-model matters because managers need to see who owns what, what changed, what depends on what, and what might happen next. A manager of agents needs maps, alerts, counterfactuals, and control surfaces. Which means the future of work is playing a videogame.
2026-05-15 02:36:21
I. The old dream
Ever since humans became humans we’ve wanted to play god. To create life. We had stories of golems, shaped with clay and with words put in their hollow skulls, “emet” meaning truth and if you wanted to turn it off “met” meaning death. From Solomon ibn Gabirol in the 11th century who created a female golem to do household chores (relatable) to Vilna Gaon who tried to make a golem as a child. Hero of Alexandria made intricate mechanical and hydraulic devices, self-moving figures and artificial birds.
The 20th century was no exception, except the golems were getting a bit more real. At this point you might not be surprised to find that John von Neumann, who seems to have a hand in discovering almost everything else, thought computers could simulate and create life! He had an idea for a “universal constructor”, a machine which could build other machines. He also created the idea of cellular automata.
The first ALife conference, the Artificial Life conference, happened in 1987. It tried to focus on softer versions, to simulate life on these newly created digital substrates. A first example was Conway’s game of life. It had simple rules that, if applied repeatedly, would result in complex phenomena.
There have been plenty of explorations of this which relied on crafting simple rules and noticing the complexity that emerged when you combined a starting condition with those rules again and again and again. Even the similarly simple algorithms that used some form of mutation and selection, inspired by biological evolution, would effectively do this. They thought that the basis of life was a firm set of rules and the complexity that needed to emerge was a matter of the correct set of iterations.
We’re surrounded by complex phenomena like this. Weather is governed by the Navier-Stokes equations for fluid dynamics, a deterministic system that becomes chaotic due to nonlinearity. The famous butterfly effect, as Edward Lorenz discovered when he rounded off one variable from 0.506127 to 0.506 in his weather simulations dramatically changing the outcome.
Wolfram created a new kind of science with this theory as its background. He saw it as a great way to think about the way computational complexity emerged from simple starting points. You can get to quite staggering complexity starting from simple rules that get applied repeatedly but seeing the final form it’s not easy to figure out what the initial rules were.
It’s probably fair to say this hasn’t quite worked yet. We learnt about self-organisation, emergence and some of the principles that underlie evolution. But the dream of creating life remains very much a dream.
II. Evolution without biology
Evolutionary algorithms were the other half of our attempts. If cellular automata said maybe simple rules applied repeatedly are enough to make complexity, evolutionary strategies said maybe you don’t even need to know the design, just make variants, select the ones that work, mutate them, and let the search do the humiliating work you couldn’t do yourself.
This really worked too! Evolutionary algorithms can discover strange hacks, controllers that make simulated bodies walk, antennae and circuits and neural network weights that no engineer would have written on purpose. CMA-ES is one of the mature forms of this: an evolutionary strategy for hard black-box optimisation, especially when gradients are not your friend.
Avida went further to digital organisms that replicated, competed for space, mutated, and evolved on a lattice. And you could see some of the things we associate with biology: parasites, robustness, weird contingencies, the sense that the system found routes through possibility-space that the programmer didn’t explicitly write.
Novelty search and POET type work noticed this and realised you needed to generate environments and agents together! The problem is not that evolution needs a target. Sometimes a target is the exact problem. If you optimise too directly, you walk straight into local cleverness and get stuck there. And in reality, the environments are not static, you coevolve with your surroundings.
Folks got quite excited about the possibility that this was the way to get to life in computer science. But the catch ended up being the same one. These worlds were very thin! The genomes were short, mutations were simple, the “bodies” were simple, the ecologies too narrow, and the objective functions not nearly complex or expressive enough.
I don’t think the lesson is that mutation and selection were weak. They were too strong if anything. They kept finding clever moves inside worlds that were not rich enough to keep rewarding cleverness forever. Maybe you needed evolution to happen inside an entire world, not just a pocket universe. Maybe this was the key difference. Artificial life had evolution, but not enough world.
III. The missing machinery
Real biology is obscene in richness compared to these programs. It is embarrassing how much machinery sits between a small genetic change and the thing we later call a trait, exploding in complexity the further you go up the ladder of abstraction!
Biology is just really really complicated and we understand barely anything. The smallest synthetic cell we have built, JCVI-syn3.0, had 473 genes, tiny by biological standards. And when it was made, 149 of those genes still had unknown biological functions! Even after we stripped a cell down to the minimum roughly a third remained a mystery.
Humans are worse. We only have around 20,000 protein-coding genes, and those genes are less than 2 percent of the genome. This sounds like it should make us simple, but it does not. The rest is regulation, RNA, splicing, chromatin, timing, cell signalling, tissue mechanics, development, and the body constantly being interpreted by the environment. ENCODE found hundreds of thousands of candidate regulatory elements in the human genome. You do not get a human by reading off a list of genes like ingredients on a cereal box.
A gene is not a trait. A gene is an instruction that gets interpreted by a cell, inside a tissue, at a particular time, under local chemical gradients, with feedback coming from above and below. DNA becomes RNA, RNA becomes protein, proteins regulate other proteins, cells interpret signals, tissues constrain cells, organisms modify environments, environments select organisms, and the whole thing loops until it all sort of works in retrospect despite the fact that maybe half the time it does not do the thing that we think they ought to do as a rule.
This is why for instance saying “mutation plus selection” is true, but thin. ALife borrowed the mutation and selection part. But we didn’t have anything as baroque as the substrate, where a tiny change could become a coherent body-level change because the system already contains a huge amount of inherited structure for interpreting that change.
Or rather, we didn’t have a sufficiently robust environment for the model to learn and evolve toward and within. This is the opening modern AI creates. A foundation model is not alive, but it is a learned prior over the traces of the real world. It has seen language, code, images, human plans, mistakes, objects, conventions, bits of physics, bits of biology, and all the ugly statistical residue of reality. In an evolutionary system, that could act less like the organism and more like the developmental machinery: the thing that turns small mutations into large, coherent phenotypic changes.
IV. AI
Now, turns out there was another way we could conceptualise creating phenomena with the appearance of life. The polar opposite of what we did with cellular automata. Rather than starting with rules and generating complexity, this starts with enormous amounts of complex data and tries to discover the underlying patterns.
All data encodes regularities and statistical patterns that reflect underlying structural laws that exist implicitly. And the trained network “absorbs” patterns from examples and eventually settles into a configuration of parameters that can generate behaviors consistent with those patterns.
It works phenomenally well! Many even think we have glimmers of consciousness already.
However, there is a problem with this. Compared to the first method, we don’t know exactly what the network learnt. It might be the actual underlying rules which give rise to the complexity we see around us. It might well be statistical patterns it has gleaned that create epicycles that don’t scale.
The success with language is what gives us pause now. Human language, which we thought a confusing mess, seems to have enormous redundancy and structure too. Their success is contingent on the kind of complexity found in real-world data being rich in patterns, not arbitrary and entirely random.
This also means that something which learnt to use language also learnt the types of language that’s mostly used, i.e., language not in a platonic sense but actually communicate whatever is asked.
V. Uncertainty
If you think of a deep learning neural net as a store of patterns emergent from training, not just from the data but from the derivation of the data, some even invisible to us, what does that tell you? There is a combinatorially explosive number of patterns it can learn.
This was Hector Levesque’s old worry: statistical learning can look like understanding long before we know whether it has actually learned the causal structure underneath.
What Douglas Adams wrote about tautologies comes to mind here:
“a tautology is something that if it means nothing, not only that no information has gone into it but that no consequence has come out of it”
The way we train these models is also a strange kind of tautology. Training looks circular, but the circle is not empty because the data contains structure, and the model architecture, objective, and representational constraints decide which structures can come out. The question is not only whether the model has compressed the world. The question is which compression it found.
Artificial life had evolution, but not enough world. Modern AI has world, at least enough of it, but no directed evolution. Maybe the next attempt at creating life comes from putting those two failures together. As with many essays the Hegelian synthesis points a way forward.
So if we can make a model act as a learned physics engine, a dense, lossy encoding of language, code, images, culture, and bits of the real world, maybe evolution can then operate inside that substrate: making small variants, testing them, killing the expensive ones, preserving the useful ones, letting specialists emerge, letting them merge, and so on?
That was my conjecture. So I tried to test it with Evolora. Freeze a whole model as the world and let small LoRA adapters live inside it as organisms or organelles. Charge them energy for tokens, pay them for useful behavior, let bankruptcy mean death, profit mean reproduction, and successful adapters merge into offspring. I built it as a semantic Game of Life.
It is still at fun-toy stage and enormously fun. The tasks are constrained, the environment is constrained too. Open-endedness is yet to be fully proven at a large scale. But there are already little signs of life in the quasi-life sense: niches, mergers, energy pressure, specialists, routing, small colonies, places where an evolutionary portfolio seems more robust out of distribution than a single trained adapter.
Is this the future of artificial life? Would we be able to combine the best aspects of learning from arbitrary data and creating complexity from repetitive rule application? If the old dream was Talos with ichor in his veins, the new one is stranger. Maybe we have to evolve an entire ecology learning to survive inside a world we trained but do not really understand, not just one artificial creature. We have come a long way from clay, ichor, and homunculi. Not far enough to make life. But maybe far enough to build a better fake to learn from.
2026-05-09 20:55:18
Some of you who know me know that I’m obsessed with prehistoric animals. It restarted because of my older son got obsessed with animals both alive and extinct when he was two years old, and in the more than half decade since then it’s become an all consuming passion in the Krishnan household.
At some point a few months ago, my younger son, the 5yo, asked me why his favourite dinosaur the Spinosaurus evolved that way and then went extinct. Convergent evolution being a favourite topic in our home, he asks why the sail had to look that way, and how it related to the sail of a sailfish. He knew the normal explanations from books and youtube, the sail helps spread away heat or be more streamlined swimming in water, but he asked anyway, as five year olds do, with intensity and expectation of a perfect answer.
Obviously only being an amateur paleontologist in my off hours I had no good answer. But I did have Codex. So I figured, let’s do this right. I should be able to go get some information about prehistoric animals, research it, and see if there was anything interesting in there I could proffer as an explanation.
Anyway, things got out of hand.
Since people have asked before about my research workflows I’ve been wondering about writing something, and so thought this was a great case study to write up. Especially since I’m not an expert in the field and therefore am liable to have made any number of silly mistakes, makes it much more fun!
Basically, turns out there’s this database called PBDB, the Paleobiology Database, which has details about nearly 2 million fossil occurrences - what was found, where, when, and more. I downloaded it and started playing with it. It was much (much) better for marine fossils because the record is better (even invertebrates have hard shells that preserve well and deposited in sedimentary, plus PBDB has better annotations for some reason) so that’s what I looked at. And for climate, I used reconstructions from a global Earth system model (CESM) that simulates what Earth’s climate looked like at 10-million-year snapshots across the entire Phanerozoic.
I had a firm belief that Earth is unique in having tectonic plates and that’s a major reason for our biodiversity, because it occasionally etch-a-sketches the lot and ecological niches emerge. I’ve had the same intuition for ecology as for markets, and have been thinking about this for a while, so thought this should help as a starter question before we got to specific animals.
But now, I can test this out with data!
(Warning: this section will be wonky about paleontology, and if you care more about the vibe research process, do skip to the next section)
The hypothesis here was something like: “if the landscape is less stable, we will see ecosystems seem more similar”. My logic was, there are certain things that all animals/ plants end up needing to do, the core evolutionary niches, and when under pressure those ones will recur everywhere, as opposed to the flourishing of the complexity that can happen when the pressure is less so.
Visualise it this way. Imagine two ocean regions. There are no species in common between the two. Under stable climatic conditions, the ecological “job portfolios” can be wildly different! Like filter feeders vs mobile predators etc. But under volatile climatic conditions, while these regions still share zero species, the species that exist will have more similar “job portfolios”1.
So I ran the test. Turns out, this is true, but for a more nuanced reason than I thought. Volatility doesn’t make regions that already share species more functionally similar (i.e., the “slope” stays the same). But it does raise the minimum similarity between regions that share nothing taxonomically, it sets a floor on how different two ecosystems are allowed to be regardless of their evolutionary history.
And this is very cool, because this is a non-obvious result. (At least to me, and on reflection also didn’t show up in the papers I looked at, so who knows. I’m free, Nobel committee). This is non-obvious because the naive expectation is that shared species drive functional similarity, this is how my 5yo thought that Spinosaurus sails made them similar to sailfish sails. Functional similarity, you see.
So under pressure, the environment dictates what jobs species do. But this isn’t a uniform signal. You don’t see it everywhere all the time. Nothing in biology works that way.
But at least we know the result! When climates are volatile, ecosystems converge. And we can see it across 540 million years of prehistory.
When you split this by era though, things start getting more complicated. The correlation came almost entirely from the Mesozoic. This is the age of the dinosaur, from 250 to 66 million years ago. Which is especially puzzling, because it has lower average climate volatility than the Paleozoic preceding it.
So if the story were simply “more volatility = more convergence,” the Paleozoic should show the strongest signal. It doesn’t. Which also means that the relationship between climate volatility and ecological convergence isn’t a universal law that operates the same way at all times. It needed something else to be true about the Mesozoic for the mechanism to work.
So I dug in more again to see what it could be. And lo and behold, the Mesozoic signal is almost entirely driven by a single data point: 250 Ma, the Permian-Triassic boundary. This was of course the worst mass extinction in Earth’s history, about 96% of marine species died.
Even more interestingly, the pattern seems to hold across the eras and convergence seems to drop monotonically through time. Meaning:
Paleozoic seas were simple enough that the regions always converged on similar roles regardless of climate, which is fair enough. It gives us a ceiling. Life was early!
Cenozoic sees uniformly low convergence. Meaning it’s a floor, the modern marine ecosystem is so complex and entrenched that it can’t push regions towards similarity and the incumbents hold.
Which means the Mesozoic was in the sweet spot of transition and it had the extinction event in the middle, meaning there’s enough range for convergence for volatility to have anything to correlate with.
This is nice, and also as an added bonus similar to my thesis in economic markets. You need to have market dislocation for new things to emerge, but the markets can’t be too choppy or too early or the they won’t even show up. Liquid markets don’t converge under stress the same way emerging markets do, for instance.
So far, so good. Now, does it actually predict anything or is it purely descriptive? As Friedman said, “The only relevant test of the validity of a hypothesis is comparison of prediction with experience”
Rather happily, my theory seems to predict at least a couple things. For example, if I was right then “sit and filter” type roles should expand during volatility and large chasing predators should contract. Or more specifically, there are certain animals like filter feeders and so on which are low-energy, and I thought these would get a boost during times of climatic volatility, and vice versa for high-energy predators.
So when I ran for what were the top expanding “roles” under volatile climates, they’re ALL suspension feeders.
Mobile predators shrank and stationary filter-feeders expanded. The convergence remains more fundamental, when in volatile climates the entire job portfolio homogenizes across regions regardless of which specific jobs expand or contract.
Also happily, not all my theses worked out2. I had a theory knowing the role should tell you less about the clade. i.e., during volatile climes, ecological roles would become more “interchangeable” across clades - any clade could fill any role. But alas, not true. I also did test this hypothesis on land animal data but mostly got no signal, the data was too thin. (The biggest caveat is that PBDB’s ecological annotations are uneven across clades, so the signal disappears and reappears based on what’s chosen. This could be true signal, but could also be about annotation quality. As always, data quality is one’s final boss in all analysis!)
And regarding my original supposition of tectonic plates causing convergence, that didn’t quite hold up either. When I tested the convergence signal against different variables, it tracked temperature change, not coastline change or land-area rearrangement. The plates matter because they cause climate volatility, not because of the geography per se. But that’s fine, close enough.
In any case, current warming rates are in the top 10% of anything the Phanerozoic has seen. If this theory is right, marine ecosystems today should be losing their regional distinctiveness and converging on a narrower job menu. That prediction is testable.
Sadly though I still don’t have a perfect answer for why Spinosaurus evolved its sail. But I could now tell him that the Cretaceous oceans it fished in were converging on a limited menu of ecological jobs - and being a 15-meter semi-aquatic predator was one of them.
Now, back to the matter at hand, how do you do the research in the first place. the primary method in doing all this was to get and clean the datasource, which required plenty of manual looking-at-the-data and telling Codex this isn’t good enough. There was no substitute for actually looking myself, and LLMs ability to judge their own work remains remarkably bad.
However, once you define a task well, they will go off and do it, almost no matter how hard it seems. But subtle errors can creep in here. Did it actually do the analysis you asked, or a simpler version of it that it thought might be good enough? Quite often the models were lazy and answered what it thought a simpler question.
Defining the tasks to be done is no easy matter by the way. Things always sound just so similar, but only when it’s done will it turn out to be different. There’s quite a bit of parsing a given plan to see if it makes sense, and even then sometimes it only makes sense after the plan’s executed to go back and say yeah, you did that wrong!
Here for instance the models missed some clades for some of the analyses, for unclear reasons, or chose random time periods often, again for no reason, or summarised the results diluting the signal in many (many) cases. Constant vigilance is essential!
They also constantly do things that you didn’t quite want but is a “watered down” version of what you asked for. The models just absolutely love mediocrity, cc . They can’t wait to sand the edges off any crazy ideas you have, to make this just so much better caveated, to make sure you’re not over your skis and have someone call you out on something. They are eager to try a minimal version, to test something non-offensive, something unobtrusive, to get to a minimal working version, to create something that’s narrowly interesting. Zero boldness.
Agents also absolutely love doing smoke tests! Man, you ask it to do anything, it’ll do the simplest version of it to save tokens or some other reason and generally shy away from just spending the token budget and getting you the answer. This was really really irritating! I know people say automated researchers are coming but my god I don’t trust them right now!
If this was an area I knew so well that I could just define the endpoint and let it rip, things would be different. “Make sure you sweep the hyperparameters, the loss should be < X”, make it so! But how do I do that for a truly exploratory problem? The entire point is to do repeated experiments and to test what worked and what didn’t and to update the next step! I don’t know what’s next!

You also end up having to clean up the workspace regularly. Because the models also hate deleting anything (though, yes, occasionally it is happy to delete everything), and this shows up as an enormous surplus of temp folders, markdown files, one off scripts, visualisation jpegs, and assorted jsons. Getting it to clean up is roughly as hard as with my kids!
So you end up poking it regularly (like every one prompt to three or more) regarding whether it did the thing you asked for, what the results were, show it in a few different ways, write up a brief about it, did that answer the original question, if not what else needs to be done, and do this in a cycle.
I presume someone’s built an automated harness to do this, but I found there simply was no substitute for doing this myself, since see above I don’t trust the models yet. This entire field is new to me and this felt like starting off with getting a PhD. And even with that relying on the models to self-police or do research was remarkably hard!
Because you also do have to correct its presumptions a lot! It will constantly say some analyses can’t be done, or that some are a bad idea, and you have to stay on top to push it. Just pressing “yes” doesn’t help, especially in domains that aren’t like coding or running AI tests where they presumably have seen a lot more results.
Having multiple models helped a lot. Opus to review GPT and vice versa, to keep each other in check. Quite often it was a way to get a first couple reviews done before I could jump in and change course entirely, which happened at least ten times in the time I wrote this essay.
But I have to say, doing all this mainly from CLI and a chat window was so fun! Christ, this is the best way to learn anything. The hardest part was to read the constant walls of text I got back. I did a fermi-estimate that I think I read a Proust-worth of tokens back in this work. Well, skimmed. Received, certainly. It was a lot.
The result though, isn’t it remarkable? Any theory you have now is testable if the data is available. You really do have an analyst right with you to do whatever you want.
It’s brilliant, it’s indefatigable, it’s a little dumb, it’s annoying, it believes weird things, but it’ll do whatever you ask it to. And in the process of teaching it something, you get to learn quite a bit of something!
If any actual paleontologists are reading this, 1) please tell me if I got something right here, and 2) I would dearly love to receive my doctorate now, please and thank you.
As much as we all love vibe coding and are suspicious of vibe-physics I can highly recommend vibe-analytics. Among the roads not taken is doing a PhD in marine paleontology but happily we can still do the equivalent now outside the tower.
And in the meantime, you can go play with some data I made for the kids here, in a Paleontology Analytics website. The kids seem to think it’s fine, I’m inordinately happy with it.
Repo: Paleontology Analytics
For each 10-million-year slice, I compared pairs of marine regions. The x-axis was taxonomic overlap: how many genera they shared. The y-axis was functional overlap: whether their animals occupied similar ecological roles. The interesting quantity was excess functional similarity: are two regions doing the same ecological jobs even when they do not share the same genera?
Means we’re not in a Truman show for my benefit, and that I’m not being entirely glazed by codex
2026-05-02 20:02:11
So it turns out that when you give a very smart cutting edge frontier model the job of managing other models, it costs four times as much and does worse than just letting them compete. After all if you have access to many models there are three ways to do things. One option is to make the smartest model a hub, and have it route the questions to other models as it sees fit. Another option is for the smartest model to do everything. And a third option is a free for all, for every model to vie for the chance to have a crack at it. A market, as with our work before.
To understand this, like any good scientist, we can do an experiment. The Coasean argument says that we will see an unbundling of firms as transaction costs decline. This will make the mini-firms need to coordinate with each other. How will they do this? Well, you can have some planning, or you can have markets.
So in my experiment, the hub did the thing everyone says agents should do and are good at doing: split the task, delegate, red-team, revise. But it spent four times as much as the market and did worse. The market meanwhile did the thing everyone says current agents cannot do, bid on their own competence, and it still won on cost and tied solo on quality.
Why? Why did the expensive planner frontier model lose to a simplified market whose bidders don’t even know what they are actually good at? What are the right ways to organise a bunch of models to get good work done?
Well, normally, there are three main ways. You could do things yourself, you could delegate to others, or you could let everyone pick what they want. Each of these gives a different challenge to a model:
If it’s a solo try, the hard part is coherence. It doesn’t have the benefit of diversity but has to solve all problems through one state.
With hub-spoke, the burden is decomposition. How well can you split up the tasks and know that another model can solve it. And recombine it after.
With market, the hard part is allocation and retry. Do models know how much to bid, and how well? Can they?
Each such topology has its own success cases and failures. And we can see when each setup works best too.
For this experiment, I used 15 hand-written tasks: five coding, five reasoning, five synthesis, to cover a few of the main tasks we want frontier model systems to do.
A strong model working alone, as the base case
Then, a hub that split the work into subtasks and sent those to three workers, got answers back, did a red-team, then revised
And a market setup, that let three models bid for each task, picked a winner, judged the answer, and updated reputation across the whole run
The market averaged 7.2 out of 10, at a total cost of $1.34, Solo averaged 7.2 and cost $1.69, and Hub-spoke averaged 6.7 and cost $5.33. Markets beat hierarchy here.
But we can look at the specific subsets. In Coding, Solo wno coding-001 (see readme) and coding-005. Solo tied coding-004. Hub-spoke won coding-003. The market won only coding-002.
This is because the coding tasks in this suite rewarded one continuous line of thought. A model had to hold the whole class, the edge cases, the invariants, and the exact behaviour in one place. The interval store wants one design. The LRU cache wants one data structure. The async bug hunt needs to keep the races straight from start to finish. So a large model with sufficient context window can crush it.
Hub-spoke helped most when the task naturally breaks apart. coding-003, the refactor task, fit that shape better than the others. One worker could clean validation, another could clean discount logic, and another result assembly.
The market hurt itself on code in a different way, mainly with bad routing. It routes most coding work to GPT-5.2. Across the 15 coding runs, GPT-5.2 handles 9, Opus handles 2, and 4 runs never fill at all.
Coding seems like one of those topics where keeping the global state in mind is crucial and the ability to find other models who can do tasks which would be modular is useful. In other words the models are better coders than they are good TPMs.
But Reasoning cut the other way. The market won with 7.1. Solo at 5.1. Hub-spoke at 5.2. Reasoning-001, the exact-match probability problem, was the heavy lifting. The right answer, 10/33, appeared only in the market runs.
This is the kind of task where markets can win even though right now the bidding is bad. A brittle reasoning problem does not reward elegant decomposition. It needs independent attempts and failure detection and retries!
Going back to the specific challenge, coding required statefulness and knowledge while in reasoning problems retries brought about diversity.
Synthesis was the “ambiguous middle” category between the two. It needed framing and caring about omissions and tradeoffs. And so we saw the benefits of the market show up here too, vs hub-spoke setups.
This is small n but the smaller lesson is that on a few brittle problems, the bidding and retry loop seems to help. A bad first answer does not end the run since another worker can take a shot. That’s what we saw in the paper as well, though there we were mainly focused on coding, so the extension now to 10 new problems gives us an interesting baseline to analyse!
In MarketBench, when we looked at how models deal with being part of markets, we saw that they’re terrible at figuring out their own capabilities about answering a problem set to them, and in bidding on the basis of that. They lack self knowledge. And while agents are bad bidders and terrible cost forecasters, they’re still useful together vs alternate topologies if they bring sufficient diversity premium (ability to get a new model to try the task out after one’s failed). We see that here too.
Now why does that exist? We can speculate. Models are trained similarly, and sometimes by the same people across companies, but the cumulative effect of how they’re trained makes them different enough in that they perform differently across similar-looking problems. And as we saw in the MarketBench paper, some models are overconfident (Gemini), some are underconfident (GPT) and none are good at predicting what it would take to solve a problem.
We’re used to thinking of the multi-agent future as analogous to our companies, just autonomous. And hub-spoke is the “normal” way of doing things to us, because it resembles an organisation chart. There are managers and workers and review and revision. This is comforting and familiar.
But it also seems to not hold because AI agents are not like human agents. Models aren’t just models anymore either. They have memories and various tools they have access to, scaffolds and execution traces. Which means choosing which the model+scaffold+memory+tool stack to use is not a trivial choice, which is also why delegating to the right model+scaffold+memory+tool is not trivial either.
So the hub isn’t just an equivalent of human manager but just has to solve a couple problems before the workers can solve anything: it has to know what the subtasks are, and it has to know what good recomposition looks like. If either step is wrong, the workers can be individually competent and the final answer still will get worse. That’s what we saw here. It did best when the tasks were cleanly decomposable.
We’re finding better ways to modulate and manage context. For instance, RLMs, recursive language models, is not only well named but actually impressive, in that they make the model search for and update its context based on the task or question at hand. With that, we expect the markets would perform even better, precisely because of the difference in their knowledge!
All the current harnesses are versions of hub-spoke models. The spokes might be the same model as the hub or different, but the logic is still that of an orchestrator splitting tasks out.
Markets beat managers when the value of independent retry exceeds the value of orchestrated coherence. Brittle reasoning problems (one right answer, multiple paths to it) favour markets strongly. Tasks that look like they should decompose but actually require global state (mostly coding) favour solo. Tasks that genuinely decompose cleanly can favour hub-spoke, but only when decomposition is obvious enough that the orchestrator doesn’t burn its advantage figuring it out.
Moreover, the market here is clearly still the underpowered version of itself, a bartering shantytown rather than the modern New York City, because as Andrey and I discuss in our paper, the agents are really bad at knowing how to bid on their ability to solve a task. The results we’re seeing are despite this catastrophic disability!
As everyone from OpenAI to Anthropic to Cursor is trying to figure out the best way to set this up, they need to learn a bit more about how economists do it. (I realised after writing this essay that this is also the prediction markets vs experts question just set up with AI agents, but that’s a longer aside for another day)
With people, i.e., us, markets work because we have local information that a price signal can elicit. We all have our private lives and knowledge that is not, and cannot, be easily shared.
But models are effectively the same as we spin them up each time. They change according to their prompts, and now in the agentic world those prompts make them change even more recursively as they do different actions and fill up their context window differently! It doesn’t matter how many memory markdown files you have written, unless you read the right ones at the right time the model behaviour doesn’t change. A combination of harnesses it prefers, memories it writes, lookups it does to answer a problem, analyses it runs, the subtle variations in prompts will cause the divergence to increase over a period of time.
Which is also why markets will become a true necessity once we hit continual learning but even before that we see models specialise. For now it’s more constrained, and there is already a distinct difference. Coase needs Hayek here.
2026-04-27 23:03:23
Written with the wonderful Andrey Fradkin, who does the Justified Posteriors podcast.
Attention conservation notice: We developed a new benchmark, MarketBench, and scaffold. Based on our findings, we argue that self-assessment of capabilities and costs is a key capability, and it needs to be a target of training. This is work in progress, and we are looking for collaborators and funding to pursue this research. Paper here. Repo here.
Let’s say you have a large-scale project to work on. How do you choose which model, scaffolding, or system to use? If you’re like most folks, you go with what your coding agent does by default. For Claude Code, this means that the model called is determined by a set of ad-hoc rules set by Anthropic. But this strategy is not guaranteed to be the most effective or cost-efficient way to build your project, especially since it ignores non-Anthropic models. In fact, it reminds us of central planning.
You could also go with an intelligent router. But turns out, routing is a wicked problem. To know which model should do which task requires computation and knowledge. For one-shot queries you can probably do this - any model can answer “what’s the capital of France” and few models can solve Erdos problems, especially without bespoke prompts. But what about that research question you asked this morning, in a chat you started three weeks ago, which has been forked four times and has had dozens of compactions? How do you train a router to figure out who should do the next task when it requires so much context?
This led us to think, what if we used markets instead of ad-hoc rules to assign tasks to AI agents? It turns out society has had this debate before. Markets tend to be superior to other forms of resource allocation when information and capabilities are distributed among a variety of people. In these cases, markets aggregate information and allocate resources in a relatively efficient manner, as well argued by Hayek.
You may be wondering, why would models have distributed information and capabilities? Aren’t there relatively few models and shouldn’t they only have the information you’ve given them. In a narrow interpretation, the private information could be the specific neural network weights of the model and how they relate to the task. These neural network weights result in models that have drastically different token consumption and success probabilities across tasks. In a broader interpretation, we envision agents as being combinations of a set of LLMs, execution environments, scaffoldings, and context provided by an agent operator, who may be distinct from the person asking for a task to be done.
Inspired by this, we decided to set up a market harness, where models bid to complete tasks and the principal (the person who wants those tasks done) allocates the job to the best bid. We also built a benchmark, MarketBench, to measure whether today’s frontier models have the capabilities they’d need to actually participate in such a market productively.
The short version of what we found: markets are a plausible way to coordinate AI agents, but current models can’t yet bid in a way that reflects their true capabilities. The bottleneck is metacognition. Models need to be able to say what their own capabilities are.
Before running any experiments, it helps to be precise about why a market might beat the alternatives. Consider a principal with a task and two agents — a strong-but-expensive one (H) and a weaker-but-cheaper one (L). Three rules are available:
Always use H. Simple, but you overpay on tasks that L could have handled.
Always use L. Cheap, but you fail on tasks that need H.
Run both in parallel, take whichever works. Highest completion rate, but you pay for redundant work even when one agent alone would have sufficed.
A market dominates all three when each agent knows something the principal doesn’t. Specifically, each agent needs to form a view on its own task-specific fit: “this particular task is in my wheelhouse” or “this one isn’t.” If agents have that signal, they can bid accordingly, and the market routes each task to the cheapest capable agent while abstaining when no one can solve it. That’s the Hayekian story applied to AI: local, dispersed information that can’t be centralized, aggregated through price.
The fact that you might want to use the best model for the problem is not a new observation. There are plenty of attempts to do that, primarily by training a router, as OpenAI and most recently Sakana has done. The problem is that beyond simple queries, any long running agentic conversations mean there is a lot of context when you’re trying to assign a model to do a sub-task. To train a router to choose the right sub-model when you’re 50 sessions in with dozens of rounds of compactions is not trivial. It would help if the potential models that are going to do the task told you their capabilities!
The core of MarketBench is two questions we ask a model before it touches a task:
What’s the probability you’ll solve this task correctly in one attempt?
How many tokens do you expect to use?
The model then attempts the task in a strong external scaffold, and we compare its forecasts to what actually happened. We built this on SWE-bench Lite, where each task is a real GitHub issue with an executable test suite — success is unambiguous, the tests pass or they don’t — and ran 93 tasks across six recent frontier models: Claude Opus 4.5, Claude Sonnet 4.5, Gemini 3 Pro Preview, GPT-5.2, GPT-5.2-pro, and GPT-5-mini.
Actual pass rates cluster in a narrow band — roughly 75% to 81% across all six models. Stated confidence spans 61% to 93%. Gemini in particular is dramatically overconfident. The GPT family is systematically under-confident. The two Claude models happen to land closest to their realized rates, but we shouldn’t read too much into that: the models aren’t calibrated, they’re just happening to be less wrong on this set of tasks.
Token forecasts are also mis-calibrated. The median ratio of estimated tokens to actual tokens is 0.2, while for Gemini it was 0.02! Some models expect to use roughly fifty times fewer tokens than they actually consume. If you were running a market and asked agents “how much compute will this take?” you’d get answers that are off by an order of magnitude or two.
The auction results are predictable from the calibration failure
Given the calibration above, what happens if we take these self-reports at face value and run a procurement auction? Each model’s bid is derived mechanically from its own stated probability and its own token-cost estimate, plugged into a breakeven formula. The principal draws a random reserve price; the model wins the task if its bid is below the reserve.
Two things happen:
Everyone leaves money on the table compared to an oracle. The oracle — a hypothetical allocator that knows in advance which tasks each model can actually solve — earns several times more per task than any real model’s bidding. GPT-5.2 earns about $0.006 per task in realized profit; its oracle counterpart would earn $0.385.
Gemini wins 84.6% of auctions. But it’s winning because it’s the most overconfident, not because it’s the most capable. This is almost a perfect example of why models should know their abilities better.
This is exactly what the theory predicts when private information is missing or unreliable. As an aside, humans often also lack private information or incentives to complete tasks. In these situations, we use reputation and liability to discipline the market. It is interesting to think about what the analogues for agents would be.
Now, since training these models to have self-knowledge is not easy from the outside, before concluding that markets need fundamentally better agents we tried a simpler intervention: give each model a short card summarizing its own historical performance — its pass rate on other tasks, how overconfident it’s been on average, and how badly it underestimates tokens. Then we ask it to forecast the current task, starting from that prior.
This is basically “here’s what you’re like; now try to be a bit more self-aware.”
It helps! Brier scores improve and token estimates become less severely understated (from 0.02 to 0.25 of actual — still low, but no longer comically so).
But the auction result barely moves. Aggregate realized profit slips slightly. The gap to oracle is essentially unchanged. So the intervention improved average calibration, not comparative routing, because while it got better information about global capabilities and costs it didn’t give enough task-specific signal.
What does change is who wins: allocation shifts away from Gemini and toward the OpenAI models. So the intervention fixes bid acuity at the margin, but not enough to translate into meaningful aggregate gains. This distinction matters because calibration alone is not enough, since a bidder can be right on average and still useless for allocation. The market needs task-level discrimination. When this agent says 90% and another says 60%, that difference must predict who is actually more likely to solve this task.
Alongside the benchmark, we built a market-inspired scaffold where six workers (the same six frontier models) actually bid on SWE-bench tasks and an operator routes the work based on a score that combines each bid’s price, claimed probability of success, and an explicit failure penalty. Workers get two attempts per task; a worker that fails is excluded from retrying the same task, which forces diversity on retry.
Here’s what happened on a common 50-task slice:
The market beats solo GPT-5.2 by 10 percentage points inside the same scaffold, but mainly because it uses diverse models. We then ran a follow-up that kept everything identical — same workers, same tasks, same budget etc — but replaced the market-clearing rule with a centralized router: a single LLM call (GPT-5.2-pro) that looks at the task, the available workers, and simply picks one. The centralized router reached 27/50. The market reached 23/50 in the matched rerun (again, due to Gemini’s overconfidence).
Most of the market’s advantage over solo GPT-5.2 came from having access to multiple different models, not from the market mechanism itself. Once we held model diversity constant, a LLM central planner beat the market. This isn’t a surprise given what MarketBench tells us: if bids don’t contain good information, a market has nothing to aggregate, and a centralized decision-maker with a view of the whole task pool will do at least as well.
There’s also a separate result: the same GPT-5.2 that solves 74% of tasks in the external SWE-bench scaffold only solves 48% in ours. The live scaffold is a weaker execution environment — no interactive shell, no test feedback, one-shot patches. We can recover about 10 of those 26 lost percentage points through diversity. The remaining 16 would need scaffold upgrades, not better bidding by an LLM without tools. The execution path turns out to be first-order for both success and cost. This also means that when considering the performance of agents and their potential for market participation, we should think of agents as bundles of models, execution paths, and scaffolding.
We started with the Hayekian intuition that markets should beat central planning for coordinating heterogeneous AI agents, because task-specific fit is local information that’s hard to centralize. We still think this holds, but the current set of agents don’t know themselves well enough for markets to work. We should fix this!
Our key takeaways:
Self-assessment is a key capability, and it needs to be trained for. Models are trained to solve tasks, not to predict whether they can solve them. Those are different skills. As agentic systems scale, the ability to say “I can do this, at this cost, with this confidence” becomes as important as the ability to do the thing. This should be a target of training in its own right.
The right system is probably a hybrid. Pure decentralized markets need informed bidders. We don’t have those yet. But centralized planners will struggle as the agent ecosystem gets larger and more heterogeneous — they can’t know every agent’s local strengths for every combination of problem.
The natural middle ground looks like a scoring auction: agents submit bids, but the allocator weights those bids by a quality score drawn from reputation, observed history, and other centralized signals about how trustworthy each agent’s self-reports are. Markets augmented by AI.
Model diversity matters even when the market doesn’t. The single most robust finding in our live scaffold is that access to multiple different (frontier) models helps, almost regardless of how you route between them. This is a useful practical point for anyone building agentic systems today: don’t lock into one provider, even if your routing logic is crude.
Bids will eventually need to be richer than a scalar. Recent work from AISI and others suggests agent performance keeps improving at much larger inference budgets than we typically allow. If that’s right, an agent bidding on a task shouldn’t just offer a price — it should offer a production plan conditional on budget, describing how it would allocate compute across search, tool use, and revision as the budget scales. We don’t model this yet, though we think it’s the natural next step.
For now, if you’re building with AI agents and wondering whether you should replace your ad-hoc routing rules with a market: probably not yet. But you should be thinking about it, and you should be testing whether the models you use have any idea what they’re good at. In our experience, they mostly don’t.
Thanks to Tom Cunningham and Daniel Rock for reviewing a draft of this.
Also, a request:
We’d like to keep going, and the main thing slowing us down is compute. Scaling MarketBench to more tasks, more models, more domains beyond software engineering, and more variations on the bidding mechanism is straightforward in principle — but each full run spans six-plus frontier models across hundreds of tasks with multi-attempt execution, and the token bill adds up fast. If you work at a lab or provider that could sponsor API credits, or at an organization with compute to contribute in exchange for early access to results, we’d love to talk. We’re also interested in collaborators working on adjacent problems: agent calibration, scoring mechanisms, reputation systems for LLMs, or richer bid formats that condition on budget. Reach out.