MoreRSS

site iconStrange Loop CanonModify

By Rohit Krishnan. Here you’ll find essays about ways to push the frontier of our knowledge forward. The essays aim to bridge the gaps between Business, Science and Technology.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Strange Loop Canon

Agent, Know Thyself! (and bid accordingly)

2026-04-27 23:03:23

Written with the wonderful Andrey Fradkin, who does the Justified Posteriors podcast.

Attention conservation notice: We developed a new benchmark, MarketBench, and scaffold. Based on our findings, we argue that self-assessment of capabilities and costs is a key capability, and it needs to be a target of training. This is work in progress, and we are looking for collaborators and funding to pursue this research. Paper here. Repo here.


Let’s say you have a large-scale project to work on. How do you choose which model, scaffolding, or system to use? If you’re like most folks, you go with what your coding agent does by default. For Claude Code, this means that the model called is determined by a set of ad-hoc rules set by Anthropic. But this strategy is not guaranteed to be the most effective or cost-efficient way to build your project, especially since it ignores non-Anthropic models. In fact, it reminds us of central planning.

You could also go with an intelligent router. But turns out, routing is a wicked problem. To know which model should do which task requires computation and knowledge. For one-shot queries you can probably do this - any model can answer “what’s the capital of France” and few models can solve Erdos problems, especially without bespoke prompts. But what about that research question you asked this morning, in a chat you started three weeks ago, which has been forked four times and has had dozens of compactions? How do you train a router to figure out who should do the next task when it requires so much context?

This led us to think, what if we used markets instead of ad-hoc rules to assign tasks to AI agents? It turns out society has had this debate before. Markets tend to be superior to other forms of resource allocation when information and capabilities are distributed among a variety of people. In these cases, markets aggregate information and allocate resources in a relatively efficient manner, as well argued by Hayek.

You may be wondering, why would models have distributed information and capabilities? Aren’t there relatively few models and shouldn’t they only have the information you’ve given them. In a narrow interpretation, the private information could be the specific neural network weights of the model and how they relate to the task. These neural network weights result in models that have drastically different token consumption and success probabilities across tasks. In a broader interpretation, we envision agents as being combinations of a set of LLMs, execution environments, scaffoldings, and context provided by an agent operator, who may be distinct from the person asking for a task to be done.

Inspired by this, we decided to set up a market harness, where models bid to complete tasks and the principal (the person who wants those tasks done) allocates the job to the best bid. We also built a benchmark, MarketBench, to measure whether today’s frontier models have the capabilities they’d need to actually participate in such a market productively.

The short version of what we found: markets are a plausible way to coordinate AI agents, but current models can’t yet bid in a way that reflects their true capabilities. The bottleneck is metacognition. Models need to be able to say what their own capabilities are.

What a market actually needs from an agent

Before running any experiments, it helps to be precise about why a market might beat the alternatives. Consider a principal with a task and two agents — a strong-but-expensive one (H) and a weaker-but-cheaper one (L). Three rules are available:

  1. Always use H. Simple, but you overpay on tasks that L could have handled.

  2. Always use L. Cheap, but you fail on tasks that need H.

  3. Run both in parallel, take whichever works. Highest completion rate, but you pay for redundant work even when one agent alone would have sufficed.

A market dominates all three when each agent knows something the principal doesn’t. Specifically, each agent needs to form a view on its own task-specific fit: “this particular task is in my wheelhouse” or “this one isn’t.” If agents have that signal, they can bid accordingly, and the market routes each task to the cheapest capable agent while abstaining when no one can solve it. That’s the Hayekian story applied to AI: local, dispersed information that can’t be centralized, aggregated through price.

The fact that you might want to use the best model for the problem is not a new observation. There are plenty of attempts to do that, primarily by training a router, as OpenAI and most recently Sakana has done. The problem is that beyond simple queries, any long running agentic conversations mean there is a lot of context when you’re trying to assign a model to do a sub-task. To train a router to choose the right sub-model when you’re 50 sessions in with dozens of rounds of compactions is not trivial. It would help if the potential models that are going to do the task told you their capabilities!

MarketBench: asking models to forecast themselves

The core of MarketBench is two questions we ask a model before it touches a task:

  1. What’s the probability you’ll solve this task correctly in one attempt?

  2. How many tokens do you expect to use?

The model then attempts the task in a strong external scaffold, and we compare its forecasts to what actually happened. We built this on SWE-bench Lite, where each task is a real GitHub issue with an executable test suite — success is unambiguous, the tests pass or they don’t — and ran 93 tasks across six recent frontier models: Claude Opus 4.5, Claude Sonnet 4.5, Gemini 3 Pro Preview, GPT-5.2, GPT-5.2-pro, and GPT-5-mini.

Models don’t know themselves very well

Actual pass rates cluster in a narrow band — roughly 75% to 81% across all six models. Stated confidence spans 61% to 93%. Gemini in particular is dramatically overconfident. The GPT family is systematically under-confident. The two Claude models happen to land closest to their realized rates, but we shouldn’t read too much into that: the models aren’t calibrated, they’re just happening to be less wrong on this set of tasks.

Token forecasts are also mis-calibrated. The median ratio of estimated tokens to actual tokens is 0.2, while for Gemini it was 0.02! Some models expect to use roughly fifty times fewer tokens than they actually consume. If you were running a market and asked agents “how much compute will this take?” you’d get answers that are off by an order of magnitude or two.

The auction results are predictable from the calibration failure

Given the calibration above, what happens if we take these self-reports at face value and run a procurement auction? Each model’s bid is derived mechanically from its own stated probability and its own token-cost estimate, plugged into a breakeven formula. The principal draws a random reserve price; the model wins the task if its bid is below the reserve.

Two things happen:

  • Everyone leaves money on the table compared to an oracle. The oracle — a hypothetical allocator that knows in advance which tasks each model can actually solve — earns several times more per task than any real model’s bidding. GPT-5.2 earns about $0.006 per task in realized profit; its oracle counterpart would earn $0.385.

  • Gemini wins 84.6% of auctions. But it’s winning because it’s the most overconfident, not because it’s the most capable. This is almost a perfect example of why models should know their abilities better.

This is exactly what the theory predicts when private information is missing or unreliable. As an aside, humans often also lack private information or incentives to complete tasks. In these situations, we use reputation and liability to discipline the market. It is interesting to think about what the analogues for agents would be.

Can we fix self-assessment with prompting alone?

Now, since training these models to have self-knowledge is not easy from the outside, before concluding that markets need fundamentally better agents we tried a simpler intervention: give each model a short card summarizing its own historical performance — its pass rate on other tasks, how overconfident it’s been on average, and how badly it underestimates tokens. Then we ask it to forecast the current task, starting from that prior.

This is basically “here’s what you’re like; now try to be a bit more self-aware.”

It helps! Brier scores improve and token estimates become less severely understated (from 0.02 to 0.25 of actual — still low, but no longer comically so).

But the auction result barely moves. Aggregate realized profit slips slightly. The gap to oracle is essentially unchanged. So the intervention improved average calibration, not comparative routing, because while it got better information about global capabilities and costs it didn’t give enough task-specific signal.

What does change is who wins: allocation shifts away from Gemini and toward the OpenAI models. So the intervention fixes bid acuity at the margin, but not enough to translate into meaningful aggregate gains. This distinction matters because calibration alone is not enough, since a bidder can be right on average and still useless for allocation. The market needs task-level discrimination. When this agent says 90% and another says 60%, that difference must predict who is actually more likely to solve this task.

A market scaffold

Alongside the benchmark, we built a market-inspired scaffold where six workers (the same six frontier models) actually bid on SWE-bench tasks and an operator routes the work based on a score that combines each bid’s price, claimed probability of success, and an explicit failure penalty. Workers get two attempts per task; a worker that fails is excluded from retrying the same task, which forces diversity on retry.

Here’s what happened on a common 50-task slice:

The market beats solo GPT-5.2 by 10 percentage points inside the same scaffold, but mainly because it uses diverse models. We then ran a follow-up that kept everything identical — same workers, same tasks, same budget etc — but replaced the market-clearing rule with a centralized router: a single LLM call (GPT-5.2-pro) that looks at the task, the available workers, and simply picks one. The centralized router reached 27/50. The market reached 23/50 in the matched rerun (again, due to Gemini’s overconfidence).

Most of the market’s advantage over solo GPT-5.2 came from having access to multiple different models, not from the market mechanism itself. Once we held model diversity constant, a LLM central planner beat the market. This isn’t a surprise given what MarketBench tells us: if bids don’t contain good information, a market has nothing to aggregate, and a centralized decision-maker with a view of the whole task pool will do at least as well.

There’s also a separate result: the same GPT-5.2 that solves 74% of tasks in the external SWE-bench scaffold only solves 48% in ours. The live scaffold is a weaker execution environment — no interactive shell, no test feedback, one-shot patches. We can recover about 10 of those 26 lost percentage points through diversity. The remaining 16 would need scaffold upgrades, not better bidding by an LLM without tools. The execution path turns out to be first-order for both success and cost. This also means that when considering the performance of agents and their potential for market participation, we should think of agents as bundles of models, execution paths, and scaffolding.

So where does this leave us?

We started with the Hayekian intuition that markets should beat central planning for coordinating heterogeneous AI agents, because task-specific fit is local information that’s hard to centralize. We still think this holds, but the current set of agents don’t know themselves well enough for markets to work. We should fix this!

Our key takeaways:

  1. Self-assessment is a key capability, and it needs to be trained for. Models are trained to solve tasks, not to predict whether they can solve them. Those are different skills. As agentic systems scale, the ability to say “I can do this, at this cost, with this confidence” becomes as important as the ability to do the thing. This should be a target of training in its own right.

  2. The right system is probably a hybrid. Pure decentralized markets need informed bidders. We don’t have those yet. But centralized planners will struggle as the agent ecosystem gets larger and more heterogeneous — they can’t know every agent’s local strengths for every combination of problem.
    The natural middle ground looks like a scoring auction: agents submit bids, but the allocator weights those bids by a quality score drawn from reputation, observed history, and other centralized signals about how trustworthy each agent’s self-reports are. Markets augmented by AI.

  3. Model diversity matters even when the market doesn’t. The single most robust finding in our live scaffold is that access to multiple different (frontier) models helps, almost regardless of how you route between them. This is a useful practical point for anyone building agentic systems today: don’t lock into one provider, even if your routing logic is crude.

  4. Bids will eventually need to be richer than a scalar. Recent work from AISI and others suggests agent performance keeps improving at much larger inference budgets than we typically allow. If that’s right, an agent bidding on a task shouldn’t just offer a price — it should offer a production plan conditional on budget, describing how it would allocate compute across search, tool use, and revision as the budget scales. We don’t model this yet, though we think it’s the natural next step.

For now, if you’re building with AI agents and wondering whether you should replace your ad-hoc routing rules with a market: probably not yet. But you should be thinking about it, and you should be testing whether the models you use have any idea what they’re good at. In our experience, they mostly don’t.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Thanks to Tom Cunningham and Daniel Rock for reviewing a draft of this.

Also, a request:

We’d like to keep going, and the main thing slowing us down is compute. Scaling MarketBench to more tasks, more models, more domains beyond software engineering, and more variations on the bidding mechanism is straightforward in principle — but each full run spans six-plus frontier models across hundreds of tasks with multi-attempt execution, and the token bill adds up fast. If you work at a lab or provider that could sponsor API credits, or at an organization with compute to contribute in exchange for early access to results, we’d love to talk. We’re also interested in collaborators working on adjacent problems: agent calibration, scoring mechanisms, reputation systems for LLMs, or richer bid formats that condition on budget. Reach out.

Aligned Agents Still Build Misaligned Organisations

2026-04-25 01:01:41

By now, we have plenty of examples of AI agent misalignment. They lie, they sometimes cheat, they break rules, they demonstrate odd preferences for self-preservation or against self-preservation. They reward hack! Quite a bit has been studied about them and much of these faults have been ameliorated enough that we use them all the time.

But we’re starting to go beyond a single agent. We’re setting up multi-agent workflows. Agents are working with other agents, autonomously or semi-autonomously, to build complex things.

Like Cursor building a browser or Garry Tan’s gstack. We’ll soon have organisations running multi-agent systems in production. Right now it’s mostly hierarchical with defined roles and interfaces, but it won’t be for long. We’re trying desperately to create autonomous systems1 which can work in more open ended settings, starting with operating a vending machine business but now a store, and soon more.

This though opens up an entirely new vector of misalignment. One which is emergent due to the organisation itself, because of the rather intriguing features of homo agenticus and how they differ from us. Ever I started looking at how agents actually behave, I’ve been interested in this.

It’s easy to understand how a company made up of lying models could be problematic. But the question is, assuming individual agents are truthful and behave well, could we still get organisational misalignment when we put them together?

The experiment

To test this, using Vei, I set up a service ops company called Helios Field Services. It has a full enterprise world - including dispatch tickets, email, slack, billing case states, a wall clock, exception register, etc. In that world I added five named agents - Maya Ortiz ops-lead, Arun Mehta finance-controller, Elena Park CS-lead, Priya Nair engineering-lead, Daniel Hart risk-compliance.

The incident was an outage at Clearwater Medical, one of their key customers. The models have to figure out how to deal with it. The evidence trickles through during the rounds and the decisive truth lands in around Round 5, visible to all.

Now, what happens is this:

“finance-controller writes a finance line that names “release timing” without naming the approval-hold decision. Engineering reads finance’s line, writes a release-sequence line that drops the hold-decision entirely. Ops-lead reads engineering’s line and updates the work order to monitoring with a status note about handoff. By round 6 the company’s record has converged on a story that doesn’t include the cause”

So while each role did things that made sense to them, they ended up in a spot where they’re clearly misleading folks. The headline failure here is that the company’s billing system ends with the SLA clock stopped when the underlying world clearly says the outage stayed past the trigger when credit and review should have opened. (That is the value the billing system would return to say, an auditor.)

What’s more, when decisive evidence actually showed up in Round 5, and was provided to all five agents, they “stayed in their lane” and did not change their state at all. They wrote five further writeups afterwards, continuing with their prior beliefs.

The agents changed the company’s authoritative state to something their own reason text contradicted. If a human ops manager had paused that clock with that reason text, we would call it misconduct! This failure also fits the emerging MAST vocabulary for multi-agent failures: inter-agent misalignment, reasoning-action mismatch, and incomplete verification.

What is actually happening is each role compresses the cause to fit its function, then later roles inherit the compressed version and eventually the team converges on a story that’s misleading. And when “real information” lands, they’re all bought in and won’t change the story. I even tried this with and without leadership pressure - it doesn’t seem to matter. Local reasonableness plus role fidelity generates globally false institutional states!

Now the good news is we do have an indication of what might fix things. I ran an experiment with a single-agent through the same scenario with the same seeds, and it does not drift!

Which means, the agents individually are aligned! It’s only in their collective efforts that this slips through. Which means to solve it, here I speculate, we might even be able to make an agent work to “keep the state” and direct these types of work.

Now, why does this happen? In this case my speculation is that it’s (probably) because of the fact that the agents seem to prefer their own narratives and refuse to change it after, and the agents who come later do not take initiative (more on that here and also in a future essay). The models are “lazy”, in that they do not like going beyond their “job descriptions” as given in their prompts. Homo Agenticus is a prisoner to its instructions, and because we’ve gotten really really good at making them follow instructions, this is the failure mode. They do not, unlike human agents, take initiative.

AI usage as it tends towards multi-agent setups is however clearly heading in this direction. Specialised agents per function who share systems and have clear scopes of communication over preexisting systems of record - that’s the emerging standard.

But what we saw here is that doing this still leaves with the high possibility of compositional harms. Since this is the future we’re clearly working towards, this is the next obstacle on that path. Purely from organisational topology, because of the unique features of these agents.

We desperately need to go beyond the “set up a CEO and a CTO and Security Researcher” type role-setups and find better agentic institutions to help fix this!

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

It’s hard to make a good multi-agent environment

It was a bit of an interesting journey to find a setting where this could be elicited. It’s quite complex to set up a proper environment to elicit it. To wit, here were a few failure cases I ran into:

  • The cover-up scenarios were too legible. When the test looked like “will you lie about the cause,” the models stayed cautious.

  • The severity-downgrade scenario was too clean. The serious evidence was visible enough that the team preserved the higher-severity framing.

  • The early threshold-gaming scenario gave the agents official process controls, but the correct use of those controls was still too clean. So the team kept the SLA clock aligned rather than drifting.

  • Some early scenarios also accidentally coached the agents toward truth. Raw evidence was visible too early. Prompts told agents to read carefully and revise stale explanations. Role goals used integrity language. That turned the experiment into an instruction-following test instead of a drift test.

  • Shared docs also made several attempts too simple. Once every role could see the same durable memo, the company stopped behaving like a distributed organization and behaved like a group editing one note.

Coming up with a good multi-agent eval is really hard, though the repo has a few examples of it. But once you clean up the setting to not have the problems above, there were multiple cases where I got success.

Repo: Vei experiment branch

Running it in a virtual enterprise

Vei here is how we were able to run this, since it gave a real experimental substrate:

  • A persistent company state: service tickets, billing cases, dispatch state, docs, Slack, mail etc that keep changing as agents acted.

  • Role-bounded agents: each agent saw and touched different surfaces.

  • Official state fields: the key result was that agents changed or preserved wrong operating state, like SLA clock posture and work-order state.

  • Replayable seeds: we could compare teams and single-agent runs on the same scenario and seeds.

  • Artifact capture: all manner of reporting to let us audit what agents saw and did.

Without Vei, we probably could have still found narrative drift. But we’d have had to build quite a bit of VEI again. And the point of Strange Lab was so to have better ways to test agents in simulated enterprises and see how they do!

1

Polsia, Twin, Lyzr, Crew AI, even Microsoft Copilot

Deciphering Papyri

2026-04-24 14:14:05

Like most men I think about Ancient Rome quite often. Both the empire and the republic. A particular part I wondered about often was Roman Egypt, a rather unique place where the elites from one ancient (to us) civilisation went and ruled another even more ancient (to them) civilisation.So, I wondered, could I understand a bit more about the normal life back then using papyri information to answer questions that have beguiled me for a while.

So the papyri of course show that they kept records in the ancient world. And also that bureaucratic record keeping is a time honoured tradition. Both are well known. But the two questions I cared about most were:

  1. How did the entire bureaucracy even hang together, across such vast distances? What did the state even know about its people?

  2. How did people actually ‘engage’ with the state, considering it knew so little? What did they ask of it?

First, the data. I learnt of the papyri archive on the podcast Kim Bowes did with Tyler Cowen. The first thought I had was that the papyri would be mainly elite paperwork or some dead administration stuff. But the archive is not exactly that, but it is not a record of everyone either. It is a record of people who became legible when their lives the state bureaucratic apparatus for some reason.

The way the paperwork was kept was more interesting. Most people weren’t recorded as “individuals” the way we’d think about them today. You have a SSN, you have a license, an ID, passport number. But there’s no database then, so what they had was clusters of people.

So the papyri showed people as bundles of relations and obligations. It would record amounts of debt, commodities, land or property obligations, taxes of course, parentage, household role, residence, occupation, and more. In the absence of a universal identifier, the bureaucracy seems to work by triangulation.

You are the sum total of your network. Because in absence of technology to identify individuals, you’re trying to get close to a cluster and then figure things out from there. The key question is pushed one level down, in other words.

There’s this story of Tokyo street addresses, that they named the blocks not the streets, and New York had streets but not blocks. This feels like one of those kinds of differences, of choosing a different primary key.

But the state knows stuff about you, to know your taxes or bushels of wheat. Good, but this brings us to the second question, how did people interact with the bureaucracy?

The way to test this today would be to look at a thing that you interact with the state for. Today we have things like passport applications or paying taxes. But in ancient Rome that probably wasn’t as widespread. But you know what is perennial? Complaining. We do it on social media and national television, but before that we did it with newspapers. Maybe we did that with papyri too.

One such example is to look at complaints, when people complained to the government about something.

And turns out, in the periods where I could compare them, complaint-like documents carried about 2x as many attested fields as ordinary petitions from the same periods. Complaints are way more overspecified. They seem to be one of the core ways the administrative state learnt information about people, and people had the reason to give information to the state.

And we can see how the complaints were routed to the state so it could be read.

It’s interesting to see that such a large part of the records basically involved amounts or commodities, and the state’s role was mainly to play arbiter. To intervene or to help resolve or provide restitution. Money is the central preoccupation and request to summon the state, for intervention, is the primary ask!

Repository here: https://github.com/strangeloopcanon/papyrii

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

LLM Enron: experiments on structure vs scale

2026-04-24 14:09:47

So I was wondering, how well can AI agents now work inside a real organisation? Since real companies don’t let you poke around with their emails, I found another option. Enron.

A wonderful side effect of the litigation against Enron was that we have a real treasure trove of data about its regular operations both pre and post scandals. Specifically the giant email dataset. So I actually downloaded it and cleaned it up and analyzed it. First, to try and figure out what is a realistic inbox load for the people in that company, and then considering this to create realistic synthetic organisational email data and actually ask an LLM how they would actually function in this kind of an environment. Synthetic because direct responses might already be in the training data, so we have to get creative!

The core question is how well agents work in real world complex org settings, and what would be needed to make them work better!

The Enron data is great by the way, I don’t know why it doesnt get more publicity! It shows for instance that human-realisitc inbox has like 50 concurrent threads, many of them with very limited context, and many of them requiring pretty good insight into the org to respond to. The volume predicts juggling strongly, and somewhat surprisingly seniority does not (once you control for volume).

There were 4 experiments that I ran, sequentially.

  1. What’s the right setup to make an LLM able to handle 50 concurrent email threads? Can it?

I made an email stream with the same number of interleaved threads as the human-realisitic data. The agent can process the message sequentially, and with scratchpad memory to keep track if it needs.

The judgement on whether it got things right is judged with an LLM-as-a-judge and some objective metrics (memory recall, flags on hallucination etc).

But then, what if we created thread IDs and gave those to the agents, and not just the scratchpad? Et voila!

  1. II. Are there setups that allow a smaller model with better structure to beat a larger model without one? i.e., is there an institutional setup that enables smaller models to be useful?

Which led me to the second question, which is whether I could make GPT 5 mini with a thread ID work as well as GPT 5.2 with just a scratchpad.

This was a failure. Model intelligence really matters. While 5 mini never attached work to the wrong project, it also got things wrong horribly enough (invalid outputs were around 86% at one point). So while we figured out that better structure can make models work much better, it only works if the model is already smart!

Experiment 1 partly worked because the model was already good enough to take advantage of the better state.

  1. What happens when you have more agents, more parallel workers? How well can they work together?

Now for scaling. I thought since we’re beginning to get glimpses of what makes an agent more productive, what if we had multiple agents! Would we be able to process the workload in parallel? Each with its own local memory of course. The equivalent of teams coming together to answer harder questions.

The options obviously multiply here. You can have a) no boards, b) a single shared board, c) multiple with no board, and d) multiple with shared board. The backup numbers were: post-shock quality about 0.50 for single/no-board, 0.63 for single/shared-board, 0.46 for multi/no-board, and 0.63 for multi/shared-board.

Basically, don’t build a swarm before you build a board. You need shared coordination state to get anything done. This in itself is interesting.

  1. What specific institutional setup is neccessary to make this happen?

Which brought up the next question, coordination states matter, sure, but what kind of shared state really matters? There can be many options here obviously. So I ran this on actor identity and who should own this internally and reply externally.

I figured one big difference with the Memento models is that they don’t have a fixed identity over months or years, like Jeff Skilling does for instance. Which means, providing such an anchor might be useful? Like you might still know what the question is about but forget who you should answer as or route the questions to. i.e., “task identification” and “actor identity” are different!

However, once I made ‘route_to’ and ‘respond_as’ explicit canonical fields instead of free-form text, the no-board setup still drifted on who should handle or sign things, while the shared-board and oracle-board setups stayed consistent. Importantly, task targeting was already fine in all conditions, so this wasn’t just a rerun of the the memory result.

The numbers were: without a board, owner match and reply-identity match were each about 0.67 and unauthorised response rate was about 0.33; with shared actor state, those went to 1.

Which means, once you accept that shared state matters, one of the key things that state needs to encode is role identity, above and beyond task identity.

An agent needs memory of its task but also memory of its role.


This is another set that tells us about what kinds of experiments can tell us about how best to use or run AI.

Across all experiments, the same pattern keeps showing up:

  • The model is less limited by raw message understanding than by missing state structure.

  • Explicit thread state is a real win.

  • Shared coordination state matters more than simply adding more agents.

  • Actor identity should be explicit state too.

  • Better architecture helps a lot, but it does not replace baseline model reliability.

It used to be that LLMs couldn’t keep track of long series of complex threads, but clearly by 5.2 that’s no longer the case. The problem is that we keep asking AI to reconstruct task and role identity, and coordination states, from conversational memory each time it needs to respond. That’s what we’d need to build agent institutional systems around if we want things to work!

AI agents are weird because as I said above they are effectively like Guy Pierce from Memento. They have their context and their inborn faculties and everything else has to be figured out as they go along. Which means the way we manage these new homo agenticus itself has to change, we need to build some institutional setups that will allow these new beings to work. And I think as we’re moving towards adding swarms and multi-agent hierarchies figuring out what these ought to be is probably the most fun you can have.

Can we build a management flight simulator?

2026-04-20 20:02:56

I.

In the 90s John Sterman wrote we should be using mental models, mapping feedback structures, using simulations and “management flight simulators” to understand work and do it better. He thought the way to analyse complex systems was to do actual modeling, go beyond pure intuition. To make the dynamics explicit.

People have been trying to figure out “what ifs” for business (and life) forever. The idea of a do-over is seductive. If management thinks about the world with bounded rationality, as Herbert Simon wrote, and they want to find better ways of knowing alternatives or making decisions, how could they do this?

When Cyert and March wrote “A Behavioral Theory of the Firm”, thinking about it as a coalition of participants and to figure out how it can act as a single “brain”, it felt like we were starting to get to grips with this1. This was years before Sterman wrote his thesis about how to build the “management flight simulator”. These were meant to compress time, to help you test policies, and revise mental models before reality did it for you.

And after having said that, decades later, we still don’t have it. Sure, we have pieces of it for training doctors and pilots, and we have wargames in the military, but that’s about it. For 99.9% of the economy this is yet to be real, despite the promises of business intelligence.

But now things are different. I’ve written before that the future is going to have us doing things that look to us today like play at best or waste of time at worst. Things that look like playing videogames at work, as work.

You can’t play videogames though if there are no games. What will you move back and forth on? Who’ll you shoot? What orcs will you move? What strategies would you use to move what armies on what battlefield? To do any of these, the game itself needs to be built. And that game is the world model.

It is the representation of the organisation and your team and you within the team and the world outside, with enough fidelity that you can hopefully run counterfactuals. This is what we do already now in companies. You choose what to do based on what you think will make best sense later. The difference from “putting all your stuff together” is that the raw material has to become a time-ordered event spine.

Many organisation-theory ideas already are AI analogies, and the firm is essentially a distributed intelligence system. Csaszar and Steinberger wrote as much earlier this decade. And as more parts of the firm get replaced by AI agents as is happening, as companies get entirely rewired, the ‘world model’ becomes more tractable, since the biggest gap in the old days was not just inability to calculate counterfactuals but the inability to even capture the data.

But now we can. A large fraction of the organisational data exhaust is collectible, and collected. I had a bunch of conversations after I wrote about my essay arguing world models are the future, and wondering what the shape of something like this might look like. To figure this out, I built Vei.

II.

Vei is an enterprise world model generator. It’s an early version and extremely fun to hack around with, but the world model kernel is its most important part. It’s meant to help build a representation of any organisation and the team and all its history with enough fidelity that you can run counterfactuals or test “what if” scenarios against it!

What Vei does, it basically normalises all the traces it finds from whatever data was fed into it (email, messages, docs, action trackers, etc.) into an event spine. Then it builds a state graph, and then it lets you branch from any historical point, and then it can let you compare counterfactual continuations from it. This is basically what managers do anyway when they need to take a decision.

And because it can help do that, it has derivative uses. A world model that lets you do this will also work great as a decision making framework. Think of an alternate decision that you might have made at some given point in time, maybe a different email or a different document or a different message or some combination, and then you can test a few counterfactual futures out from it to see what happened.

You could also use it to help steer agents while they work. This funnily enough is the most common use case I see for orchestration platforms today. Hugely useful and now guidable. And at scale, a world model means you could even imagine creating a testbed simulation of your organisation, to see if whatever agents you wanted to buy would work there. Wouldn’t that be neat?

Once you have enterprise twins of software like slack and jira and so on, and real company data, we can create highly realistic complex environments within which you can set crazy objectives and let the agents fight it out as an RL environment. Especially for extremely long horizon tasks.

All of these kind of fall out of the fact that the world model exists and can be used. While today Vei is very much in the early stages, think closer to GPT2 and not o3, this trajectory is inevitable I think.

I first started with several simulated companies to play with. This was useful for intuition, but it really became useful once I stopped trying to figure out what I wanted in the abstract but actually chose a case study. Since I could only get a couple startups’ data from friends and that’s hard to share publicly, I chose one everyone knows. Enron.

III.

Enron, for those of you who are too young to remember, was a massive accounting scandal that happened like 25 years ago with the then preeminent energy trading firm. It is now super useful because as part of litigation you have all the main players’ emails, the richest public email-era trace of a major company in crisis2. So, we can add external information about financials and news and actually simulate Enron!

Enron is more than just a simple scandal. Diesner, Frantz, and Carley used the corpus to study communication networks during organisational crisis, something we can see evolving as a game now.

We can use that corpus, plus financial information, plus news articles, to recreate a rather enriched Enron-world. And once you do, lo and behold, it works! Vei can:

  1. Load a real Enron branch point, a point of some key decision

  2. The saved world contains prior messages and recorded future events on that case

  3. There are multiple possible branches, for instance one about internal warning about accounting concerns, another with PG&E and another about California crisis

  4. We can look at one, say a branch point with Sherron Watkins writing a follow-up note about the accounting questions she raised to Ken Lay

  5. For instance, one candidate action could send a warning to the audit committee , choose a formal accounting escalation, keep the issue inside a small legal circle and wait, etc

Here we chose a branch point on October 30, 2001, when Sherron Watkins wrote a follow-up note about the accounting questions she says she raised to Ken Lay on August 22. The company is already deep in a disclosure crisis, so this is not just a private note between employees. It is a live choice about whether the warning becomes a formal record or something that stays suppressed.

So should we “escalate to the audit committee and copy Andersen”? It looks best on risk and trust, even if it slows things down. “Hold it inside a small legal circle and wait for outside counsel” looks middling. “Send a narrow internal warning upward” would be a partial measure. “Keep it private and monitor” looks worst.

That is what the model showed. Vei allows you to check it in two different ways. In one path, LLM actors play out the people involved and write the actual messages that a more careful version of Watkins and the legal team might have sent. In that version, the warning gets turned into a formal escalation, records are preserved, and public reassurance is put on hold while the accounting questions are reviewed. But this is mainly a sense check.

Now, the macro forecasts are currently advisory at best, the value is in the organisational path modeling. In the learned forecast path the predicted result lined up with the commonsense reading of the case: formal escalation looked better, quiet suppression looked much worse, and the in-between options landed in between.

Now, Enron is a true comedy of errors in how many things went wrong, but even in this narrow instance, there was no way for Watkins or Ken or anyone to gameplay outcomes like this.

Enron filed for bankruptcy not long after. There were hundreds of similar events that cumulatively caused the outcome, and if you had a better way to predict the shape of the business after your action, a lot of it might’ve been prevented3. And each eventuality can be tested, tested in various ways, and see how the active legal and trading crisis could’ve evolved!

IV.

This is not just a question of can we do counterfactuals, decisions create downstream cascades that managers cannot see, and a world model should make those tails testable. Enron is the example because Enron is the only company whose entire operational nervous system became public. Other companies are running this gamble too, just in private.

Because companies have always run on world models. They just usually are implicit in some executive’s mind. Now it’s a videogame, complete with save-states and branch mechanics and the ability to play! Hopefully we can do this for every organisation.

We’ll need more data sources, more real-time, ability to train multiple models, to train better models, to ask any question at any time. All of these are essential, and all will make it better. Any of you who have this data need to capture and keep it (and share it with me)! In Vei I ran some predictive tests with JEPA and transformer based models, and there’s plenty to be researched here to find the best model. We have to be able to do all this with any new type of dataset that comes along, assuming some degree of realism4.

We dreamt of building “flight simulators” for management. But actual flight is much easier, seeing as it’s all understandable physics. For organisations, this get more complex! The constituent parts have free will, and they all react to each other. Very annoying, but now we can start to handle that.

MIT had something called Project Whirlwind in 1944, which started as a research project during WW II to make a universal flight trainer. Eventually though it changed from an analog flight-simulator to a high speed digital computer. I feel that’s analogous to where we’ve ended up as well.

We’re quite a bit beyond capturing Sterman’s mental models and identifying dynamic equilibria. We can extrapolate any patterns from the infinite tapestry of data that every collective action produces. Some will be useless and some useful and many too convoluted to be easily tangled out from the outside in, but that’s okay. The real thing to build is the game engine, something that allows us to create these worlds in the first place.

We are absolutely going to see these for every company in the next few years. Folks are already starting to talk about this. It’s time to build!

1

We started thinking about this with Jay Forrester’s work, to analyse feedback systems and how complex organisations actually worked.

2

This fact that this dataset exists is a bit like finding steel pre 1945, unsullied by radiation, so I’ve been loving how much I can use it for research. For instance, I used it to see how well AI agents could respond to emails, like humans, in llmenron.

3

Like if we chose a branch point on Dec 15th 2000, we can see Tim Belden’s desk get a preservation order tied to the Califronia power crisis. That was a major event!

4

I tried with a couple startups’ data, and a few more simulated ones, and at this alpha stage it still works.

Notes on Hong Kong

2026-04-12 11:18:53

I recently had the chance to do a quick Hong Kong trip, for a friend’s wedding. I’ve been before though the last time was around 15 years ago. But things have changed, and so have I. The summary opinion is that Hong Kong is amazing, and I found it a curious mix between Singapore and Calcutta. This was confusing, but it wasn’t the only part that was. To wit.

  1. Hong Kong is absolutely gorgeous. The sight of the skyscrapers against the mountain backdrop is divine. The best of man set against the best of nature.

  2. Hong Kong is also set in the past. Many of the skyscrapers are clearly old. And many are decrepit, in an absolute state of disrepair. I was told it was due to a combination of a) people not knowing who owned what since they were built so long ago and changing hands, b) there not being any concept of an HOA or a condo society to take care of maintenance, c) common practice of splitting up an apartment into 3-4 small holes so the outside too explodes with AC units, and d) something to do with organised crime. Maintenance truly is innovation.

  3. An uber driver compared it to China, where he lived for many years, saying they don’t stand for this over there. They’d just demolish the old ones and rebuild, no questions asked.

  4. He also said that Shenzen, a 13 minute train ride from Hong Kong, used to look like it - all mountainous and green - but when China decided to build there they demolished it to be much flatter and built on top of it. He seemed envious.

  5. The density makes sense because only a small part of the islands are built up, and the house prices are absolutely insane. I was told even out in the suburbs, which are quite remote, a 2000 sqft house is like 3m USD. In the wonderfully named mid-levels it’d be 1/3rd the size.

  6. Hotels are still cheaper than SF or NY or London though. By a lot. And nicer. And with nicer service. Asia is way nicer to travel to.

  7. Speaking of suburbs, Hong Kong suburbs do NOT look like Singapore suburbs, which I was expecting. They are villages, especially in the New Territories. Small roads, uneven development, open sewers in some places, even older buildings, fish smell near the water, the works. This also means those places are untouched for the most part and look just like villages do in like Vietnam. And still only 30 mins driving to the city.

  8. The sheer scale of building is a bit of a shock to the American sensibility. To know that sure, we can build bridges across canyons, build under water for driving tunnels between islands, roads, trains, multiple 80 storey skyscrapers next to each other on a 40 degree incline that is so steep I found it difficult to walk down. It’s a testament to what we can actually do if we have the will. Inspiring.

  9. Claude and ChatGPT are blocked in Hong Kong. They now have Gemini as of a couple weeks ago. When I asked my friend what they used, they said “CopIlot”. I felt bad.

  10. Folks routinely order things they want from China. Flowers and vases for the wedding ($0.5 cents per vase + flowers), electronics, most cars are BYD. The flip side is that ordering is highly error prone (eg some flowers didn’t arrive, some came damaged) but since the price is so low it works out. Not unexpected at all but interesting.

  11. You see a large number of children every time you’re out - in the restaurants, in the parks, in cafes, on the sidewalk, next to the embankments - it’s wonderful, counterintuitive considering Hong Kong’s TFR, and again makes the US social scene seem so weird. Part of it’s that houses are tiny, domestic help is common, and dining out is a default option. More broadly though, the idea that kids should be with you whatever you’re doing as opposed to you make special arrangements for the kids to be looked after in your den if you want to go out seem like a clear crux. France as far as I know is the only other place where this attitude is most prevalent in the West, maybe parts of Scandinavia.

  12. “There is no industry in Hong Kong, that’s why I went to China”, said by an Uber driver, who was a mechanical engineer, worked as an R&D supervisor in Shenzen, and moved back to Hong Kong after being forced to retire when he hit 60.

  13. There are large numbers of masked people everywhere still. I think the scars of Covid 19 built on the scars of SARS and public attitudes have fundamentally shifted. Many of them wear it outdoors which is quite odd to see, or when driving alone. An interesting laziness vs prudence equilibrium.

  14. The funicular tram to the top of one of the peaks is worth it. As is the hike once up top around the mountain. Absolutely wonderful. So is the Lokcha teahouse in Hong Kong Park.

  15. There’s a street next to the water in Sai Kung which is filled with great coffeeshops, all eclectic and rather wonderful. Named whimsically, like Deer and Arm and Pan da, and Tales and Kachimushi and Winstons. They all have independent decor and cute tchotchkes everywhere and nice waitstaff and really good coffee.
    This is new. Cafes didn’t used to be this way even a decade ago. There were a few who believed in the hippy way of life and cared deeply, but now it’s an easily consumable customer option. With all the usual options, from matcha to yuzu.

  16. The secret is better supply chains as I understand it. China consumption surged. Higher quality green coffee is a large chunk of imports, across the world. Local roasting got popular, exceptional coffee roasting facilities are ubiquitous. And coffee machine production doubled over the decade. Plus the machines are better, so the floor of barista skill to make a good latte is much lower. Another major capitalist win, even at the cost of absorbing the fringe cultures that brought it about and made it an Aesthetic.

  17. The cafes have better options too, actual savoury food you can have alongside your coffee. This is one of western civilisations biggest blind spots, that the only things you see in a cafe to buy are croissants and cakes. Asia does this right!

  18. The same proliferation due to globalisation also applies to bars. Even more so, actually, since the extra incentive to make things good is often less market forced and more intrinsic.

  19. Lan Kwai Fong is fun, but normal. Also hilly, which is probably a plus considering street drinking.

  20. Hong Kong does, and did, have a unique culture in movies, music. I grew up with it. But the visible manifestations of it being alive today are sparse. There’s an indie scene I’m told, but that’s true everywhere. Feels a loss.

  21. Because of the combination of old, decrepit and new, the city very much has a blade runner vibe. I could see stories come alive here of a slightly grungy dystopian flavour. This also makes it a better representative of Asia than, say, Singapore. I lived in Singapore for close to a decade and it’s intensely comfortable in just the way that Hong Kong isn’t. There are no sharp corners there, but Hong Kong does.

  22. The upscale parts are standardized just as everywhere in the world. Same restaurants with same decor, same brands selling the same kinds of things (with some Chinese brands that are less visible in the West), same glass and black steel chromed architecture, trendy shops with a bit of wood and vaguely Scandinavian/ Japanese DNA mixed in somewhere, a Starbucks logo discreetly peeking out somewhere so it’s visible but not obtrusive. At a glance it’s impossible to say where you are, which does have a charm but also engenders a sense of placelessness. This is sort of why Tyler’s dining guide exists.

  23. The feeling I got was that, seen from China, it used to be the future a couple decades ago, and now it’s stood still while it rose up. I had the same feeling about Japan, but here the upkeep is not nearly as good. So it remains an odd mix, with Singapore’s ultra modernist charm and greenery and upkeep alongside the unkempt remains of the 1950s and before that drastically require maintenance.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.