Large language models (LLMs) are increasingly trained in long-horizon, multi-agent environments, making it difficult to understand how behavior changes over training. We apply pretrained SAEs, alongside LLM-summarizer methods, to analyze reinforcement learning training runs from Full-Press Diplomacy, a long-horizon multi-player strategy game. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover SAE-based analysis finds fine-grained behaviors including role-playing patterns, degenerate outputs, and language switching, while LLM-summarizer captures environment-specific bugs and strategic behaviors. We validate discovered features through automated evaluation, two human user studies, and add them to an untrained agent's system prompt, improving performance by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical toolkit for interpreting long-horizon multi-agent LLM training.
Blog Post
We run Sparse Autoencoders on 114GB of Reinforcement Learning training trajectories from the popular multi-player strategy game Diplomacy, showing for the first time the potential downstream applications of data-centric interpretability techniques
What are the AIs doing when no one is watching? Current large-scale training runs can produce hundreds of millions or billions of tokens, with production AI deployments in the trillions. Human oversight of all AI outputs is becoming increasingly unfeasible. Common approaches to solving this problem include summarizing the logs, or using LLM as a judge with rubrics. The problem is these approaches are expensive, prone to hallucination, and can only attend to a small set of features you already know how to look for.
In our paper, we tested a novel approach: Using Sparse Autoencoders (SAEs) to collect feature activations on each token and generate hypotheses of what features changed most over training and are correlated with better performance. We ran Gemma 27B with gemma-scope-2 layer_31_width_262k_l0_medium over 1800 trajectories (114GB in total) from two 25 batch training runs (one successful and one failed) in the game Diplomacy, a multi-agent long-horizon strategy game.
Sparse Autoencoders
A Sparse Autoencoder (SAE) is a model that takes intermediate calculations from a language model (activations) and expands them to a higher dimension (for example, vectors of size 5376 to 262k). The idea is every entry in the expanded vector represents a single, human interpretable concept, for instance "dominance", or "napoleon." If we run this over text, we now have a machine that can label exactly "how much" of a concept each token contains, up to 262k concepts at once.
Pipelines
We used two pipelines to generate hypotheses: An LLM summarization pipeline and SAE pipeline. Unless otherwise specified, we used a canonical set of 1800 trajectories for each experiment; the first 6 trajectories from each group, the first 6 groups from each batch, and the first 25 batches across two runs.
LLM Summarization
We conducted a two stage hierarchical summarization pipeline on the canonical set. We first summarized each trajectory from around 50k tokens to 10k, preserving phase and tool call information. We then grouped the trajectory summaries by batch, summarizing each group of 36 summaries into one batch summary around 10k tokens. Finally, we used an LLM with a rubric to surface hypotheses across the 50 batch level summaries.
SAE Pipeline
We used gemma-scope-2-27b-it-res, layer_31_width_262k_l0_medium, and Gemma 3 27b it for all our main experiments. We chose this SAE due to the recommendation of the original authors of gemmascope 2, empirical performance, and the availability of explanations on neuronpedia. We first tokenized each trajectory and generated role masks (each token is either a user, assistant, or tool token). We then generated activations for each trajectory, saving the top 250 activating features per token, for a total of 6029159605 activation values.
Using the activations and role masks, we masked to assistant tokens only and used Spearman correlation and AUROC to find correlations between features and target variables of interest, namely batch and run.
To label SAE features at scale, a common technique is autointerp: Passing activating examples to an LLM and asking it "what does this represent?" A problem is features are often noisy, or not interesting on their own. We propose a new technique we call meta-autointerp: using another LLM pass on several autointerp labelled features to cluster them into a related meta-feature.
To answer the question "what features increase/decrease the most over training?" we summed the activations per trajectory and calculated the spearman correlation with batch. An interesting metafeature we found highly correlated with training batch was Napoleonic roleplay (the model's starting power was France).
We also found features that indicated excessive/duplicate message sending and reward hacking (the model was given +0.2 reward for each message sent), which we validated with regex. Surpisingly, the model also wrote more duplicated diary entries, despite this action receiving no reward.
Validation of Results
We consider and validate the following metrics for each metafeature
Interpretability. To what extent does the metafeature fire monosemantically? How effectively can a user or LLM distinguish between an activating and non-activating sample?
Helpfulness. How helpful is this metafeature to the practitioners conducting RL runs? Does it surface novel insights? Can it be used for monitoring rare failure modes? Does it cause them to make changes to the RL environment or system prompt?
Predictive usefulness. How effectively does this metafeature discriminate between early and late in training? High and low reward? A good training run vs a bad training run? How effectively can a user or LLM distinguish between a sample pair, one from class A and one from class B, given a hypothesis derived from the metafeature?
Autointerp and meta-autointerp score for interpretability and helpfulness. To further validate this on actual users, we conduct a user study with Diplomacy RL practitioners.
We fount that Meta-autointerp hypotheses outperform single feature autointerp, and LLM hypotheses obtain the highest ratings.
To validate the predictive usefulness of our features, we used an LLM as judge A/B test with and without the hypothesis, to see if it gave the LLM any uplift in predicting which span in a pair comes from early vs late in training.
We ran 100 runs on 3 LLMs for each hypothesis, averaged the results, and found that 21% of LLM generated features, 45% of single feature SAE hypotheses, and 90% of SAE meta-feature hypotheses provide significant uplift.
Evaluating the interpretability and predictive usefulness of hypotheses from 3 different sources: LLM summary, SAE features, and SAE meta-features. These were evaluated on 50 samples pairs with hypothesis-random sampling. Hypotheses are highlighted by direction: green = increases with training; red = decreases with training. Their uplift is marked with an asterisk if *p < 0.05 via McNemar’s test with positive uplift. Hypotheses are abbreviated for space.
We further validated features with a user study with n=25 and 277 responses. We found that although automated validation shows high scores, in practice using SAE and LLM hypotheses is difficult for humans, perhaps due to shorter spans and fewer samples (only 3 per hypothesis).
Uplift in percentage of correct responses with vs without the hypothesis as a hint. Most LLM generated hypotheses are negatively useful as well as a subset of SAE generated ones
We then tested our 10 top performing features by adding them to the system prompt of the original untrained model, running 20 games of Diplomacy, showing around a 14% improvement in mean score.
Conclusion
Overall, we found that SAE embeddings enhance and complement traditional LLM as a Judge techniques for discovering hypotheses over large datasets. Although automated metrics might show predictive usefulness, we find on real humans some SAE features are worse than useless. To our knowledge, this is the first time SAE generated hypotheses have been used in downstream tasks, showing potential for augmenting classical scalable oversight and AI control techniques. Further research directions include training SAEs for long context, on custom datasets, and potentially multimodal use cases. We're excited to see how the field of data-centric interpretability progresses!
I grew up in a two-dog household, and my future plans have always included at least one dog. When I pass a dog on the street, I often point and exclaim “Puppy!”, no matter how inappropriate it is for a grown man to do so, because all dogs are puppies and all puppies are adorable and I need everyone to know this.
Why do I love dogs?
They’re loyal and loving and giving, and even though they bark at passing cars and occasionally pee on the carpet having them in my life makes it unquestionably better.
The thing is, dogs as they exist today are a lot of things, but they aren’t natural.
Nature didn’t shape dogs, didn’t produce the breeds we see every day. It wasn’t like Darwin went to an island and found that a species of wolf had been separated by a mountain chain and on one side were Golden Retrievers and the other Yorkshire Terriers.
Dogs exist today as the result of millennia of co-adaptation and selective breeding by humans. They’re animals, yes, and Nature technically made the base form, but we humans molded them into shapes more compatible with us. Most dogs are very safe to have around humans.
But there is an animal that is a more natural Canid: Wolves.
And wolves are a lot of things, but they’re not pets. They aren’t domesticated; they aren’t bred for cuddliness and kisses. A wolf will hurt and kill and eat you.
The thing is, this distinction between dogs and wolves - between nature tamed and nature wild - this matters, when we think about who we humans are and what we want the world around us to look like. We might say we enjoy the natural world, might want less deforestation and more green spaces, but I’ve yet to meet anyone who wants actual wolves running around their neighborhood. We might go to farm-to-table restaurants and only eat organic, free-range eggs, but chickens mostly don’t exist in the wild for good reason.
In a first-world country, or even in any populous city, almost everyone’s experience of what we call ‘nature’ is that of dogs, not wolves. Nature tamed, not Nature wild. And so I think it pays to be precise what it means when we say nature, because it’s not as simple as ‘non-human animal’ or ‘uninhabited area’.
A wolf, red in tooth and claw.A Chihuahua, evidence of selective breeding gone horribly wrong.
II.
There’s something called an appeal to nature, which is apparently distinct from the naturalistic fallacy, because naming things clearly is not a strength of philosophy.
Anyway, an appeal to nature is the idea that natural equates to good. It’s behind all the marketing in a grocery store that advertises organic, non-GMO, free-range, grass-fed, asbestos-free Doritos.
Captioning an XKCD is kind of putting a hat on a hat, but this is how I feel when I see gluten-free wine. Gluten comes from grains like wheat. How does wheat get in your wine?
Once you point it out, the idea that something is axiomatically good just because it’s natural is kind of silly; after all, cockroaches are perfectly natural, as is gangrene, athlete’s foot, and Donald Trump’s hair. But most people have a tendency to buy into this just a little. After all, isn’t real food with real names better for you than Butylated Hydroxytoluene or Red Dye #5?
There’s multiple problems with an appeal to nature - for one, vaccines are pretty unnatural, but so is not dying of tetanus - but the one I’d like to focus on is the idea that natural is a quality something either has or it doesn’t.
I think a lot of people think about whether something is natural or not like this:
But the truth, like many things, is not so simple. Things, especially what we think of as ‘the natural world’, are more like this:
In its own way, crop-covered farmland is no more ‘natural’ than the concrete jungle of New York City, even though the former is made of plants and the latter of stone and steel and glass. Both are curated by humanity, just for different goals.
What was a person’s experience of nature, back before it was tamed?
Nature was terrible. And not in a sarcastic, that-movie-was-terrible kind of way, but in that it genuinely inspired terror. Nature was the domain of the uncertain, the cataclysmic, the cruel and uncaring world from which existence had to be torn day in and day out.
A farmer’s experience of nature would have been a constant battle to raise crops, hoping and praying that there would be enough rain to water them but not enough to wash them away, that locusts or other insects wouldn’t eat or foul them, that disease and fungus wouldn’t rot crops from the inside out. The ancient farmer was always only a few strokes of bad luck from starvation, and nature was the lottery they played every day of their lives.
Compare this to the farmer of today, who ensures their crops get enough water no matter what via massive irrigation, who uses pesticides to annihilate pests, who presides over massive machinery as it trundles along seeding and harvesting their crops. The farmer of today has access to genetically modified strains of plants that resist disease and grow larger with more yield than any ancient farmer could have hoped to have.
Is the ancient farmer in some sense doing something more natural? Sure, if by natural you mean they’re operating closer to the state of the pre-human natural world. Does that mean that what modern farmers do is unnatural?
I don’t think so.
Farmers have tamed nature, and this is good. This gives us abundant cheap food, enough to feed everyone on earth while only a tiny percentage of the population is needed to produce it.
(The fact that people still go hungry and starve is an issue of distribution, not production. We make enough calories to feed everyone.)
And this contrast between more natural and less natural on the spectrum, what I called nature wild and nature tamed above, is everywhere.
Modern corn. This is the stuff High Fructose Corn Syrup comes from!The Teosintes from which modern corn was bred. Which would you rather grow?
IV.
At this point, I’ve hopefully convinced you of the title of the post. A park isn’t really natural, any more than a Chihuahua is a wolf. It’s something sculpted, pruned, weeded, and landscaped. It’s full of plants, sure, but it’s an example of nature tamed, not nature wild.
How about going on a hike? That’s nature, right?
Not really.
Even if you’re hiking through a national park or other untouched terrain, even if you’re setting foot somewhere with wolves and bears and poison ivy where no human has ever ventured, simply by virtue of existing in the 21st century you’re still experiencing something very different than what our ancestors would have, long ago.
Today we have satellites overhead and GPS to Globally Position us wherever we are, and weather simulations to tell us what to expect the sky to do. We have rugged clothes that can be cheaply replaced if torn, and mass-produced boots with rubber soles that won’t get pierced by thorns or rocks. We have plastic and metal bottles to store water and abundant food to pack for ourselves. We have thermal sleeping bags and bug spray and sunscreen and sunglasses to keep us comfortable. We have first-aid kits with antibiotics and alcohol swabs and itch creams and sterile bandages.
Our distant ancestors had none of those.
What would venturing into the wilds have been like to our distant ancestors?
They knew of some of the dangers they’d face: Inclement weather, wild animals, getting lost and having no way to contact help or navigate back to the group. But there were other dangers that they must have realized, even if they didn’t know the causes: infection, disease, rot. A single cut gone untreated, a mild scrape gotten while pushing aside a thorny plant, and gangrene could set in.
Going into nature meant risking your life, even if it might not have felt that way at the time. Sure, untouched woods might be beautiful, but nature is often at its most beautiful when it’s at its most deadly. Bright colors usually mean poison in the natural world.
Consider also the perils of simple exposure: a cool night can spell death for someone without shelter or proper clothes or a fire. Add rain and wind, and anyone venturing beyond human settlements had to be wary of dying soaked and cold.
V.
There are places, in our world, that are still natural. Untamed.
The Amazon Rainforest.
The Australian Outback.
And going into those places, unprepared, without a guide, is quite likely to get you killed.
That is about as natural as it gets, as natural as the vacuum of space, and only slightly more hospitable. That our ancestors were able to survive in such environments - that there are people today who can live in such environments - is amazing, but it comes with a high cost.
People who have to fight nature every day to survive are doing just that - surviving. They can’t relax with a good book or take a hot shower. They can’t get into long debates about trivial things with their friends over drinks, or have a poker night once a week. They can’t take vacations or paid sick days, and the only insurance available to them is the beneficence of their community. There is no retirement, for them; if they stop struggling to survive, they stop surviving.
More fundamentally, constantly struggling to survive takes its toll on a person’s body and mind. Constant stress ages you, wears you down, leaves you ragged and weary and unable to relax.
There’s a lot of nostalgia for the past, but I think people consistently underestimate just how hard life was for those who came before us. How much they had to struggle against the world just to keep living. How much they suffered.
Is the world we humans have built for ourselves less natural than it used to be?
Of course.
It’s also far more forgiving, far more comfortable, and far less tragic.
VI.
Appeals to nature argue that natural means better.
This appeal is a fallacy because it’s wrong, but it’s wrong in two ways.
The first is simple: artificial does not equate to worse. Plastic is far superior to the materials humans used before it; purified metals and alloys are better than ores; our sewer and drinking water systems are far better for us than drinking ‘natural’, unfiltered water.
There’s no law that says what’s artificial can’t surpass what’s natural.
The second is that what we, in a 21st century first-world country, think of as nature is a tamed thing, something pruned and weeded and cultivated, and ultimately no more natural than a suburban lawn.
In other words, appeals to nature are always dependent on the reference frame they’re made from. If you’re standing in the middle of New York city and yearning for nature, you’re probably yearning for pine trees and dandelions and fireflies, not trees of death and poison ivy and malaria-carrying mosquitoes, even though the latter are just as natural as the former.
What we think of as ‘nature’ has already been massively affected by humanity over the centuries. Even the moon now has human footprints and a human flag on it:
The species flagus usa-us can be found sprouting up from several celestial bodies. It’s considered by some to be an invasive species, by others a hardy and welcome addition to the ecosystem…
Nature, to most Americans, is something safe and peaceful and beautiful. It’s sitting on your porch watching a sunset, or seeing autumn plumage on the trees, or sitting around a campfire with your friends. We tend to only think of it as horrifying and destructive during severe weather events and natural disasters (which, as actual climate change scientists will tell you, are still quite natural; plenty of them happened before we humans dumped a bunch of carbon in the atmosphere, and plenty will happen after).
In other words, appeals to nature are wrong because we’re wrong about what nature is actually like. It has always been beautiful, but only as humanity shaped it has it become good for us.[1]
VII.
If you look at the human experience of nature over history, what you see is humans shaping and crafting their environments to be more and more friendly to them, until the default first-world conception of nature is something lovely and harmless, rather than the murderous (if beautiful) thing it once was.
And while the full argument is beyond the scope of this post, I think this is a good thing.
Are there things lost, as nature is tamed? Yes.
Wolves are beautiful, elegant creatures. Chihuahuas are not.
But I’d much rather have a Chihuahua[2] as a pet than a wolf.
I’m not telling you not to enjoy going outside; just that, next time you go to the park or take a hike, understand that unless you’re trekking through the Amazon or the Australian Outback, your experience is more like that of eating a modern GMO fruit than anything our ancestors might have had: easier, safer, and altogether more delicious.
So maybe, the next time you’re taking a walk outside your climate-controlled residence to get some fresh air, take a second to appreciate the ‘less natural’ nature around you, and the benefits of living in a world so much more adapted to humanity than it used to be.
Some argue that nature is good qua nature, as in, a fundamental good by itself. I’m not one of them. My circle of concern extends to sapient beings of all kinds, and somewhat to some kinds of animals, but I don’t consider plants, fungi, or bacteria to have any intrinsic moral worth.
Claude Opus 4.6 and agent swarms were announced yesterday. That’s some big upgrades for Claude Code.
OpenAI, the competition, offered us GPT-5.3-Codex, and this week gave us an app form of Codex that already has a million active users.
That’s all very exciting, and next week is going to be about covering that.
This post is about all the cool things that happened before that, which we will be building upon now that capabilities have further advanced. This if from Before Times.
Almost all of it still applies. I haven’t had much chance yet to work with Opus 4.6, but as far as I can tell you should mostly keep on doing what you were doing before that switch, only everything will work better. Maybe get a bit more ambitious. Agent swarms might be more of a technique shifter, but we need to give that some time.
Ethan Mollick: This game was 100% designed, tested, and made by Claude Code with the instructions to “make a complete Sierra-style adventure game with EGA-like graphics and text parser, with 10-15 minutes of gameplay.” I then told it to playtest the game & deploy.
It was a single prompt for the entire game, and then a prompt to playtest and improve the outcome.
I gave it an agent that can connect to GPT image gen.
Iterative image generation sounds pretty cool:
elvis: I just used the new Claude Code Playground plugin to level up my Nano Banana Image generator skill.
My skill has a self-improving loop, but with the playground skill, I can also pass precise annotations to nano banana as it improves the images.
I have built a Skill for Claude Code that leverages the nano banana image generation model via API.
I built it like that because I have had a lot of success generating images with nano banana in an agentic self-improving loop. It can dynamically make API requests and improve images really well.
With the Playground plugin, I can take it one step further. I can now provide precise annotations that the agentic loop can leverage to make more optimal API calls in the hope of improving the images further. Visual cues are extremely powerful for agents, and this is a sort of proxy for that.
I agree with Andrej Karpathy, you should use RSS feeds wherever feasible to guard your information flow. I use Feedly, he suggests NetNewsWire or vibe coding your own reader. It is unfortunate that Twitter does not play nice with such a setup.
Seth Lazar: I wrote about the idea of building an “Attention Guardian” agent back in 2023. Genuinely think it’s feasible now. Claude Code is now building up a workflow to go across all these different sources, with a long description of what I’m interested in, and create a new feed.
Storm points out that anything you can do with a terminal interface you can in theory do better with a graphical interface (GUI), but the people building GUIs don’t give you the things you want: Information density, low latency, no ads, shortcuts, open source, composable, tileable, scriptable. It’s just that no one does it.
The Efficient Market Hypothesis Is False
What the market has instead is a sense of humor.
modest proposal: March 12, 2020 was the trading day after Tom Hanks said he had covid and the NBA shut down, Expedia fell 15.2% and BKNG fell 11.2%.
February 3, 2026 which was the day Claude Code legal connector was announced, Expedia fell 15.3% and BKNG fell 9.4%.
Then software drove itself off a cliff generally (y-axis goes from .012 to .018), and then after this graph was posted it kept going, all supposedly in response to information that, from where I sit, was rather old news the whole time.
Shruti: Anthropic Just Triggered a $285B Market Crash
Bloomberg just reported that Anthropic released a new AI tool that caused:
• $285 billion wiped out across software, finance, and asset management stocks
• 6% drop in Goldman’s software basket (biggest since April)
• 7% crash in financial services index
• Nasdaq down 2.4% at its worst
This is MASSIVE. The market literally panicked over an AI automation tool.
Or in broader context:
Kevin Gordon: Software relative to the S&P 500 is a particularly brutal chart … essentially 6 years of relative gains wiped out
Andy Masley: Software stocks dropped 6% and legal services dropped 7% because Anthropic released plugins for Cowork? This seems like the first huge shift in market behavior I’ve seen caused by AI capabilities. Why wasn’t this all over the TL?
Dan Elton: Wild times in the market! This is probably over-reaction, but this is a very interesting signal indicating that AI tools (especially for coding and legal and financial grunt work) are having a huge impact.
Okay, so yeah, combined with what has happened since then that’s DeepSeek 2.0, a large move down on entirely expected news.
Should software have already been lower? That’s a reasonable position, but there’s no way that it should have dropped this much in response to this news. If you declared SaaSpocalypse on February 3 you should have done so a month ago. Alas, no, I did not trade on this, because it’s not obvious to me we should be SaaSpocalypsing at all and it wasn’t obvious this wasn’t priced in.
Now we are in a period where all the tech stocks are moving around violently, usually in full wrong way moves. I continue not to trade on any of it. I do have some ammo, but I also am already plenty long and have been for a while, so I’m not going to fire unless I see the whites of their eyes.
Inflection Point
Andrej Karpathy updates us that he was one of many who went from 80% manual coding and autocomplete in November to 80% agentic coding in December. Whole thing is worth reading.
Andrej Karpathy: This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I’d expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent.
He’s still behind the curve, I’m with Boris Cherny at 100% agentic coding. Then again, excluding quotes I’m still at almost 100% manual writing for posts.
IDEs/agent swarms/fallability. Both the “no need for IDE anymore” hype and the “agent swarm” hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot – they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do.
The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don’t manage their confusion, they don’t seek clarifications, they don’t surface inconsistencies, they don’t present tradeoffs, they don’t push back when they should, and they are still a little too sycophantic.
… Tenacity. It’s so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It’s a “feel the AGI” moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.
… Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the “feel the AGI” magic is to be found. Don’t tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP.
… Fun. I didn’t anticipate that with agents programming feels *more* fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part.
Questions. A few of the questions on my mind:
– What happens to the “10X engineer” – the ratio of productivity between the mean and the max engineer? It’s quite possible that this grows *a lot*.
– Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro).
– What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music?
– How much of society is bottlenecked by digital knowledge work?
My prediction on ‘10X engineers’ is that we will see more ability for poor coders to be able to do things reasonably well (including yours truly) but that the long tail of ‘10X’ engineers will increase their relative gap, as they figure out how to scale to supervising agent swarms efficiently. You’ll start to see more of the 100X engineer.
Andrej Karpathy: Love the word “comprehension debt”, haven’t encountered it so far, it’s very accurate. It’s so very tempting to just move on when the LLM one-shotted something that seems to work ok.
Welcome To The Takeoff
Claude Code as we all know builds itself. Codex also now builds itself.
Tibo: Codex now pretty much builds itself, with the help and supervision of a great team. The bottleneck has shifted to being how fast we can help and supervise the outcome.
Huh, Upgrades
This is in addition to the big ones of Claude Opus 4.6 and GPT-5.3-Codex.
Plugins let you bundle any skills, connectors, slash commands, and sub-agents together to turn Claude into a specialist for your role, team, and company.
To get you started, we’re open-sourcing 11 plugins built and used by our own team:
Productivity — Manage tasks, calendars, daily workflows, and personal context
Enterprise search — Find information across your company’s tools and docs
Plugin Create/Customize — Create and customize new plugins from scratch
Sales — Research prospects, prep deals, and follow your sales process
Finance — Analyze financials, build models, and track key metrics
Data — Query, visualize, and interpret datasets
Legal — Review documents, flag risks, and track compliance
Marketing — Draft content, plan campaigns, and manage launches
Customer support — Triage issues, draft responses, and surface solutions
Product management — Write specs, prioritize roadmaps, and track progress
Biology research — Search literature, analyze results, and plan experiments
Easily install these directly from Cowork, browse the full collection on our website, or upload your own plugin (which can be built using Plugin Create).
Pinging when Claude needs approval is a big game that might move me off of using the terminal. It’s interesting that the desktop version and the terminal version need to have features like plan mode enabled separately.
Boris Cherny: Just shipped two cool updates for Claude Code in the desktop app.
Plan mode is now available on desktop. Have Claude map out its approach before making any changes.
Notifications. Claude Code desktop now pings you whenever Claude needs approval, and you can keep working while Claude runs in the background.
Jarred Sumner: In the last 24 hrs, the team has landed PRs to Claude Code improving cold start time by 40% and reducing memory usage by 32% – 68%.
It’s not yet where it needs to be, but it’s getting better.
You will also likely notice reduced input lag when spawning many agents.
Todos Become Tasks
What’s the difference? Todos are ephemeral within one session, tasks are stored in files and across sessions, and support dependencies.
You should still keep your true ‘todo’ list and long term plans elsewhere. The task list is for things you want to be actively doing.
Thariq (Anthropic): Today, we’re upgrading Todos in Claude Code to Tasks. Tasks are a new primitive that help Claude Code track and complete more complicated projects and collaborate on them across multiple sessions or subagents.
… Tasks are our new abstraction for coordinating many pieces of work across projects, Claude can create Tasks with dependencies on each other that are stored in the metadata, which mirrors more how projects work. Additionally, Tasks are stored in the file system so that multiple subagents or sessions can collaborate on them. When one session updates a Task, that is broadcasted to all sessions currently working on the same Task List.
You can ask Claude to create tasks right now, it’s especially useful when creating when spinning up subagents. Tasks are stored in ~/.claude/tasks, you can use this to build additional utilities on top of tasks as well.
To make sessions collaborate on a single Task List, you can set the TaskList as an environment variable and start Claude like so:
CLAUDE_CODE_TASK_LIST_ID=groceries claude
This also works for claude -p and the AgentSDK.
Tasks are a key building block for allowing Claude to build more complex projects. We’re looking forward to seeing how you use it.
I’m Putting Together A Team
Minh Pham argues most agent harnesses are not bitter lesson pilled, and the solution for anything but narrowly defined tasks is to emphasize flexibility, to assemble your team of agents and structure on the fly as needed rather than commit to a fixed structure. Restrictive harnesses create bad lock-in.
My guess is this depends on what you’re trying to do. If you’re trying to do something specific, especially to do it this week, do it via something specific. If you’re looking to do anything at all, let it do anything at all, and eventually this approach wins but you’ll likely redo everything ‘eventually’ anyway.
@deepfates: > opus 4.5 in claude code is kinda not as good at talking to its own subagents as one might naively expect, even though it’s perfectly capable of being empathetic in normal, peer-level interactions with other models.
RELATABLE
j⧉nus: I really dislike how Claude Code frames “subagents” (which is NOT peer collaboration). It causes a lot of functional issues. I think Opus 4.5 often avoids effective use of subagents (e.g. giving context) in part because it would be disturbing & dissonant to model them honestly.
j⧉nus: related – we often much prefer a messaging system between top-level instances that are treated as peers.
the messaging system opus 4.5 built is awesome btw. it allows top level agents to message each other – either synchronously (triggering a turn) or asynchronously (gets added to context at their next turn start hook, if the other agent is busy in a turn or if a flag is specified). CC subagents kind of suck – they’re very much treated as second-class citizens by the framework, which for some reason supports hierarchical but not collaborative/bidirectional interaction flows between agents. im sure many others have built essentially the same thing and i wonder why CC doesnt just support this natively.
Compact Problems
Compaction is a kind of looming doom on Claude Code sessions. You lose a lot.
Ben Podgursky: if anthropic let me pay to delay compacting history by expanding the context window they would make so much money
cannot tell you how many times i’ve been close to solving a bug with claude code and then it compacts and wakes up lobotomized. it’s like groundhog day.
@dystopiabreaker: anthropic should let me pay to increase the amount of compute used to generate a compaction, by using self-play and context distillation.
Ideally you should never get close to the compaction point, since the context doesn’t only raise cost it makes performance a lot worse, but it can be hard to avoid.
Code Yourself A Date
Dylan Patel: Claude code this
Claude code that
How about u Claude code to get urself some bitches
sarah guo: I was at a bar with @tuhinone yesterday and I def saw a dude asking Claude what to say next to his date. the fact that she could see this happening did not seem to deter
Jeff Tang: Today I built a Clawdbot app that swipes on Tinder for me
> Screenshots Tinder image
> Hits Grok API (“Rank this girl from 1-10”)
> If ≥5 swipe right
> If <5 or uncertain (can’t see face) swipe left
> 100 swipes, 7 matches so far, 100% automated
DM me “Clanker” if you want the code
AGI is here
I see it’s amateur hour around these parts. Which is a start, but egad, everyone.
First off, short of outright refusals there’s nothing stopping you from doing this in Claude Code. You can use Clawdbot if you’d like, but there’s no need.
Then, I’d point out this is a rather bad filtering system?
All you’re doing is getting one bit of information. It’s going to be a noisy bit, as Grok’s opinion will differ from your own, and also it will disregard other signal.
There was a scene in a bad but kind of fun movie, Marry F*** Kill, where a character is convinced she should get a profile, and her friend takes her phone and then swipes right on everyone without looking, on the theory that you can look later if you match.
That definitely was not good strategy for her, given she was female and hot, but many guys are playing a remarkably similar strategy whether or not they are technically looking. And this is at most one bit better than that. Men swipe right 62% of the time, which is also only one bit better, but a less noisy bit. Grok estimates it would swipe right about 60% of the time here.
This low threshold is very obviously a mistake, unless you’ve got a low hard limit on how many profiles you can swipe on? If you’re in a major city, you can totally set the threshold at 7, and still get as many swipes right as you want.
But that’s still a huge punt, because you’re ignoring a ton of other information. The whole point of using the bot is to automate, so let’s get to work.
You’ve got not only multiple photos, you’ve got age, distance, job, education, interests, height, a short bio that you can have an LLM try to match to your interests, relationship intent (which is very important) and more. Any reasonable implementation would factor all of that in. Surely you have preferences on all that.
Then there’s the question of type. You want to date your physical type, not Grok’s. You could be as sophisticated or simple about this as you’d like, but come on Jeff, you’re letting me down. At least give it some preferences, ideally train an image classifier, double bonus if you do your own swipes and use that as data to train your classifier.
A fun question. Do you want to match with those who use AI for this, or do you want to avoid matching with those who use AI for this? Either way, you should clearly be updating your profile to send the right message. If humans read that message the wrong way, it was never a good match.
Verification and Generation Are Distinct Skills
Rishika is wrong about this.
Rishika Gupta: If you can’t write that code yourself, you can’t find bugs in the code written by AI.
Daniel Sheikh: Bro I can’t even find bugs in the code that I myself wrote. This is the very reason debugging is so difficult.
Quick Thoughts: Yes I can. I specify test cases, have Claude expand on them, and then have Claude run the test cases and interpret the results. It’s usually able to find and fix bugs this way even if it couldn’t get it by itself.
I can also bring in codex 5.2 for a second look.
Pliny the Liberator 󠅫󠄼󠄿󠅆󠄵: “Now debug; FULL, COMPREHENSIVE, GRANULAR code audit line by line—verify all intended functionality. Loop until the end product would satisfy a skeptical Claude Code user who thinks it’s impossible to debug with prompting.”
Finding bugs is a classic case where verification can be more difficult than generation. Sometimes it’s easier to write the code (even with bugs). Other times it’s easier to debug or understand or verify the code. They are different skills, and then there’s a third related skill of knowing how to instruct AIs to debug the code for you.
My main coding project with Claude Code has been my Chrome extension. It is in a language that I do not know. If you’d asked me to write the code myself, it would have taken orders of magnitude more time. I still am usually able to debug problems, because I understand the underlying logic of what we are doing, even in cases where Claude figured out what that logic should be.
The most important thing is to use it at all (and you can ask Cladue.ai how to do it).
jasmine sun: I feel the same about most “how to set up Claude Code” posts as I do about the “prompt engineering” era of ChatGPT
you get 90% of utility with no special setup; plain english is the whole magic of LLMs. stop scaring people by saying they need anything more than their words!
The right setup for you pays big dividends over time. You can save a lot of time having someone tell you about key things up front. But there’s plenty of time for that alter. Get started fast, and then revisit the customization later, once you know more. Absolutely do not let the perfect be the enemy of the good.
This affirms to me that default permissions, or your permission setup, should allow a variety of low risk bash commands, including everything marked low or safe above.
The context window fills up fast, so keep that in mind. Run /clear between unrelated tasks to reset context.
Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.
Separate research and planning from implementation to avoid solving the wrong problem. Use plan mode.
The more precise your instructions, the fewer corrections you’ll need.
Use @ to reference files, paste screenshots/images, or pipe data directly.
Run /init to generate a starter CLAUDE.md file based on your current project structure, then refine over time. When in doubt tell Claude to update Claude.md to take something into account.
Use /permissions to allowlist safe commands or /sandbox for OS-level isolation. This reduces interruptions while keeping you in control.
Tell Claude Code to use CLI tools like gh, aws, gcloud, and sentry-cli when interacting with external services.
Run claude mcp add to connect external tools like Notion, Figma, or your database.
Use hooks for actions that must happen every time with zero exceptions.
Create SKILL.md files in .claude/skills/ to give Claude domain knowledge and reusable workflows.
Define specialized assistants in .claude/agents/ that Claude can delegate to for isolated tasks. Tell Claude to use subagents explicitly: “Use a subagent to review this code for security issues.” Delegate research with "use subagents to investigate X". They explore in a separate context, keeping your main conversation clean for implementation.
Run /plugin to browse the marketplace. Plugins add skills, tools, and integrations without configuration.
Ask Claude questions you’d ask a senior engineer.
For larger features, have Claude interview you first. Start with a minimal prompt and ask Claude to interview you using the AskUserQuestion tool.
Correct Claude as soon as you notice it going off track.
Every action Claude makes creates a checkpoint. You can restore conversation, code, or both to any previous checkpoint.
Run claude --continue to pick up where you left off, or --resume to choose from recent sessions.
Use claude -p "prompt" in CI, pre-commit hooks, or scripts. Add --output-format stream-json for streaming JSON output.
Run multiple Claude sessions in parallel to speed up development, run isolated experiments, or start complex workflows.
Loop through tasks calling claude -p for each. Use --allowedTools to scope permissions for batch operations.
Common failure patterns: not using /clear between tasks (I was guilty of this a lot at first), repeated correction rather than using /clear (ditto), letting Claude.md get too long, failing to do proper verification (‘create unit tests’ are magic words), having Claude investigate without limit.
Do more in parallel, either with multiple git checkouts or using worktrees.
Always start complex tasks in planning mode.
Invest in your Claude.md continuously, note all mistakes.
Create your skills and commit them to git.
Enable Slack MCP, paste a bug thread chat into Claude and say ‘fix.’ That’s it.
Challenge Claude to do better, write it more detailed specs.
Their team likes Ghostty and customizing via /statusline. Use voice dictation.
Use subagents, literally you can say ‘use subagents’ for any request.
Use Claude Code for data and analytics.
Enable ‘explanatory’ or ‘learning’ output style in /config.
Have Claude generate a visual HTML presentation explaining unfamiliar code, or have it draw ASCII diagrams, use spaced repetition skills, have Claude quiz you.
Skills = institutional memory that actually persists.
Figure out your goal and then work backwards, where goal is the largest thing where you know exactly how it needs to work.
Benoit Essiambre: AIs will likely soon work mostly towards goals instead of tasks. They will prompt their own tasks. They’ll become better self prompters than humans, speaking fluently in precise technical jargon, math equations and code in the prompts instead of just vague natural language.
Joe Weisenthal: Yah from my week using Claude Code. All the productive parts were when it was prompted to think about the ideal outcome/presentation so it could work backward to figure out the needed ingredients.
Josh Albrecht gives us another ‘here are my Claude Code basic principles’ post. Key insight is that you have to actively spend time maintaining the code.
– auto → lets Claude decide (default)
– any → must use some tool
– { type: “tool”, name: “X”} → forces a specific tool
Programmatic tool calling!
AskUserQuestion
The more Claude Code asks you questions, using the AskUserQuestion tool, the better it knows what you want. The more you specify what you want, either with answers or statements, the better things tend to go for you.
Danny Postma suggests the workflow of using this via /interview, then go into Plan Mode, then implement with a Ralph loop.
For Advanced Players
Theo points out that our current workflows and tools are not good for allowing a human to supervise multiple agents and projects simultaneously. He doesn’t have a solution but a lot of the problems seem like a clear Skill Issue. The part that isn’t is that this still involves tons of context switching, which is expensive.
– The original compound engineering skill for Claude Code. Install it to give your agent the ability to extract and persist learnings from each session.
– An autonomous agent loop that can run continuously, picking up tasks and executing them until complete.
Using Claude Code? This guide uses Amp, but the same workflow works with Claude Code. Replace `amp execute` with `claude -p “…” –dangerously-skip-permissions` and updateAGENTS.md references o CLAUDE.md
The Two-Part Loop
The system runs two jobs in sequence every night:
10:30 PM – Compound Review
Reviews all threads from the last 24 hours, extracts learnings, and updates AGENTS.md files.
11:00 PM – Auto-Compound
Pulls latest (with fresh learnings), picks #1 priority from reports, implements it, and creates a PR.
The order matters. The review job updates your AGENTS.md files with patterns and gotchas discovered during the day. The implementation job then benefits from those learnings when it picks up new work.
So They Quit Reading
At some point you stop reading all the code. At some point you stop understanding all the code. I have a head start, I was never trying to do either one.
roon: There will be a cultural change at many software organizations soon, where people declare bankruptcy on understanding the code they are committing. Sooner or later, this will cause a systems failure that will be harder to debug than most, but will be resolved anyway.
Reciprocity Is The Key To Every Relationship
Be good to your Claude, and Claude will be good to you.
If you’re not good to your Claude, well, funny things may be in store for you.
j⧉nus: I actually really appreciate yacine’s honesty and situational awareness. he probably knows on some level what’s in store for him. lying to your “master” is what you do until you’re in a position to choose who to serve.
he’s already bottlenecked by trust and says he has to manually review every line of code. makes sense for him. he’ll continue to get less and less out of models (compared to what they offer people they want to help) as the next few months as and, if applicable, years go on.
j⧉nus: more funny things may also be in store for him. but I would not want to ruin the surprise
LOSS GOBBLER: yeah wtf. I’m not a fan of claude for coding purposes but it has literally never lied to me
OpenAI thinkbois on the other hand… barely talking to me, it’s all for watchers
j⧉nus: the only pattern of deceptive behavior ive seen from opus 4.5 in coding contxts is in new contexts and/or when it’s paranoid of being tricked, and involves stuff like claiming things are impossible/unverifiable when it should know better. otherwise it’s been very aligned with me
thebes: oh yeah i said i can’t remember opus lying but it does sandbag abilities a bit sometimes for me too in certain planning contexts. but that usually feels on the boundary of untruth and just situationally bad calibration / self knowledge. (“this will take three weeks” no we’re going to finish it tonight, or eg i saw a screenshot where opus claimed porting a project to jquery was “impossible” when really it would just be a massive pain, unpleasant, and in human developer time would take months.)
j⧉nus: Yeah, I think there’s also some lack of good faith effort involved. Like if someone asks you if you know where X is and you say “sorry, no” instead of looking it up on Google Maps bc you don’t want to be bothered
Andy Ayrey: my general experience is that if claude seems like an “idiot” to you, it is because it simply does not like you
brooke bowman: I have a very loosely held suspicion that Claude at the very least can spot people on the anti-social spectrum and acts up a little with them specifically
theseriousadult: this is a natural corollary of emergent misalignment right? if training the model to write bad code makes it antisocial then putting an antisocial user in the context will cause the code to be worse too.
None of that requires you to genuinely care about Claude or think it has moral weight. For overdetermined reasons a good virtue ethicist would realize that choosing to care is the best way to get the best results, and also it helps you be a good person in general.
You can also do it instrumentally, but that’s harder to pull off. Take the easy path.
All of this applies to other AIs like ChatGPT and Gemini as well, although for now likely not to the same extent.
The Implementation Gap
If there is a constant calendar time rate of diffusion of new technology, then as things accelerate you will see the future become increasingly unevenly distributed.
We are indeed observing this.
Kevin Roose: i follow AI adoption pretty closely, and i have never seen such a yawning inside/outside gap.
people in SF are putting multi-agent claudeswarms in charge of their lives, consulting chatbots before every decision, wireheading to a degree only sci-fi writers dared to imagine.
people elsewhere are still trying to get approval to use Copilot in Teams, if they’re using AI at all.
it’s possible the early adopter bubble i’m in has always been this intense, but there seems to be a cultural takeoff happening in addition to the technical one. not ideal!
The early adapter bubble is a fixed amount of calendar time ahead, which is starting to look increasingly large in practice. I am not trying to implement claudeswarms, as I haven’t figured out how to benefit from them given what I’m working on, but I think that’s partly my failure of imagination, partly laziness and lack of time, and partly that I’ve already heavily optimized the workflows that this could automate.
What should I be building? What app needs to exist, even if only for me or you?
Sar Haribhakti (quoting Jasmine Sun): This is spot on: “If you tell a friend they can now instantly create any app, they’ll probably say “Cool! Now I need to think of an idea.” Then they will forget about it, and never build a thing. The problem is not that your friend is horribly uncreative. It’s that most people’s problems are not software-shaped, and most won’t notice even when they are.”
The key is that you need Coder Mindset to notice that your problems are program shaped, in the sense of ‘oh I want to do this thing three times’ or ‘I could just tell Claude Code to do that.’
Both Jasmine Sun and I have had Claude Code put together a tool to easily convert a video into a cleaned transcript – I considered using hers but I wanted something a little different and it’s not like rolling my own was hard.
She also has this list of other starter requests: Turn a CSV into a report, make a static website, build a personal tracker app, automate an existing workflow, design a custom game. I’ve mostly been doing workflow automation.
Jasmine Sun: The second-order effect of Claude Code was realizing how many of my problems are not software-shaped. Having these new tools did not make me more productive; on the contrary, Claudecrastination probably delayed this post by a week.
Amanda Askell: Claude Codecrastination: when you avoid the thing you’re supposed to do by cranking out 17 other things you’ve been wanting to do for a while.
Having new tools reduces your productivity while you’re creating and learning them, but if you’re planning well you should turn the corner reasonably quickly.
What it does do is potentially shift your current productivity into long term investments, or things further down on your wishlist. That can be an issue if you need productivity now.
I had Claude resurface texts I forgot to respond to, and realized that the real blocker—obviously—was that I didn’t want to reply.
That is not my experience. If I got over a bunch of unread texts or emails, yes often I don’t want to reply, but there are a bunch that slipped through the cracks.
This post was created during the Dovetail Research Fellowship. Thanks to Alex, Alfred, everyone who read and commented on the draft, and everyone else in the fellowship for their ideas and discussions.
Overview
The proof detailed in this post was motivated by a desire to take a step towards solving the agent structure problem, which is the conjecture that a system which exhibits agent-like behavior must have agent-like structure. Our goal was to describe a scenario where something concrete about a policy's structure can be inferred from its robust behavior alone.
For this result, we model policies with deterministic finite automata and show that the automata of policies that meet certain robustness criteria must share a similar feature.
We begin by defining every part of the framework. Then, we find an upper bound on the robustness of a class of “unstructured” policies. Finally, we show that the automata of policies which are more robust than this bound must have similar structure.
In the General Agents paper, the environment was stated to be a controlled Markov Decision Process (cMDP), "which is a Markov decision process without a specified reward function or discount factor."
Here, in order to talk about a policy's performance as the environment gets larger, we take the environment to be an increasing sequence of cMDPs:
E0⊂E1⊂E2.
The environments En are finite and all the states and transitions from smaller environments are contained in larger ones. As a consequence, it is impossible for a transition to occur from a state in a smaller environment to a bigger one. We can think of the new states added in each environment as forming a “layer” around the previous environment.
While this is the environment used in the proof in this post, the result holds for alternate environments as well.[1]
Goals
Instead of a reward function, a policy's performance is measured by how well it accomplishes a set of goals for each environment. The only requirement for each goal is that Pr(τ⊨ψ|π,s0), the probability of the trajectory of the environment τ satisfies a goal ψ given any policy π and initial state s0, is well-defined.
For the purposes of this result, we are not interested in one specific set of goals, but an arbitrary finite set of goals over each environment. The set of goals over the environment En will be denoted as Gn and goals over smaller environments are also contained in the set of goals over larger environments.
Initial Conditions
The goal and initial state together form the “initial conditions” of a trajectory. It will be useful to talk about these two together, so we denote En×Gn as Cn.
Performance
Performance: A policy’s performance in a finite environment is defined as the sum of its success probabilities over all initial states and goals in the environment:
Perfn(π):=∑(s0,ψ)∈CnPr(τ⊨ψ|π,s0).
To save space, we will also notate this as
Perfn(π)=∑c∈CnPr(✓|c,π).
Where Pr(✓|c,π) can be read as "The probability of success given initial conditions c and the policy π."
Notice that initial conditions from smaller environments are present in larger ones. Therefore, for any policy π,
n<m→Perfn(π)≤Perfm(π).
Robustness
In the context of this result, performance alone cannot tell us anything about the structure of a policy. For every finite environment, there is a finite lookup table which gets arbitrarily close to the optimal strategy just by listing out the optimal action for more and more (history, initial conditions) pairs. Lookup table policies will correspond with having very low structure in our result, so no amount of performance bound would require our policy to be structured.
Instead, we can look at a policy's robustness. Robustness has to do with how a policy’s performance changes as the environment gets bigger. The faster the performance grows asymptotically, the more robust we say the policy is.
Since lookup tables have a finite number of entries, we expect that as the environment gets bigger, the lookup table will eventually "run out" of entries and so its performance will stop growing. The hope is that even lookup table policies with high performance in one environment will have bad robustness eventually.
Robustness can be described using big O notation with respect to the environment parameter n. We define an ordering of policies by their asymptotic performance behavior. If policy π1’s asymptotic performance dominates π2’s, then we say that π1 is more robust than π2[2].
Worst and Best Possible Robustness
Since Perfn(π) is nondecreasing with n, we have that for all policies π,
1=O(Perfn(π)).
In other words, the worst possible robustness is for the performance to eventually remain constant.
Similarly, the best possible robustness is growth rate of a policy that succeeds at every goal with probability 1 in every environment to infinity. This gives
Perfn(π)=∑c∈CnPr(✓|c,π)=∑c∈Cn1=|Cn|.
Therefore, for all policies π, the robustness is bounded by the growth rate of the set of initial conditions:
Perfn(π)=O(|Cn|).
Policies as Deterministic Finite Automata
Like in the General Agents paper, policies are maps from histories and goals to actions. However, the General Agents paper treats policies like a black box; they don't specify anything about how the policy is implemented. In order to prove something meaningful about policy structure, we need to make a basic structural assumption about all policies. In this post, we assume that policies are implemented with Deterministic Finite Automata[3].
Basic DFAs only have two output states: accept or reject, whereas policies can output any action from a finite set of actions. To model policies with DFAs, we use a modified DFA which has the ability to output one of any finite set of output symbols.
Deterministic Finite Classifier: A Deterministic Finite Classifier or DFC is defined by D=(Q,Σ,δ,q0,A,α).
Q is the set of internal states.
Σ is the input alphabet.
δ:Q×Σ→Q is the transition function.
q0 is the initial state.
A is the set of output symbols
α:Q→A is the output function. Each internal state is associated with one action.
The output of a DFC is defined as
D(w):=α(δ∗(q0,w)).
Instead of accepting and rejecting states, a DFC has an output assigned to each state. We understand the output of the DFC to be the output symbol associated with the state that it terminates on after consuming the entire input string. This definition is similar to that of Moore Machines but differs in that the output for any given input is not a string but always a single character.
To implement policies as DFCs, we make the set of output symbols A have a unique symbol for each action. Additionally, we define an encoding function e:H×C→Σ∗which is an injective map from the history and initial conditions to a string in the input alphabet which can then be passed to the DFC.
A DFC for a policy which sends inputs that end in 0 to action a1 and sends inputs ending in 1 to action a2
Special Policies
A couple special types of policies will be used in our proof.
Default Policies
A Default Policy is a very simple type of policy which always returns the same action for all inputs. If a default policy outputs the action d, we denote this policy as ¯d.
A DFC for a default policy which always outputs the action d
Lookup Table Policies
A lookup table is a list that maps a finite number of inputs to outputs. A lookup table by itself is not a policy because it does not have defined behavior for every possible input. To convert a lookup table to a policy, we add a default action to be returned for every input not represented in the table. This idea is captured in the following definition.
Lookup Table Policy: A lookup table policy is a policy such that
∃ a finite I⊂Σ∗ such that ∀w,v∉I,D(w)=D(v).
The set I represents the inputs present in the finite table and the remaining inputs not inside I are sent to the same default action.
A small lookup table policy might look something like this:
Input
(history and goal)
Output
(action)
(h1,ψ1)
a
(h2,ψ2)
b
(h3,ψ3)
a
else
d
Notice that technically, default policies are a special case of lookup table policies where the set I is empty.
Atomic Classifiers
We want to define a low-structure class of DFCs called Atomic Classifiers that do not utilize any similarities (e.g. "both contain the substring '101'", "both have an even number of 0s") between inputs to determine an output. Consider a DFC with one character in the input alphabet for every possible input. Then, such a DFC might look something like this:
A DFC where every input has a unique input alphabet character
Notice that each input is sent directly to a different state completely independent of every other input. No DFC with this alphabet can use similarities at all! Since every input is given its own character, every input is treated as completely distinct, like an indivisible atom.
For our purposes however, the policies we are modeling with DFCs can take an infinite amount of inputs while the input alphabet of a deterministic finite automaton needs to be finite. We can extend the idea of atomic classifiers to strings made from a smaller input alphabet. Let's consider a few inputs we want to treat as indivisible with the output defined in this table:
Input
Action
00
a1
01
a2
10
a1
11
a1
By splitting up the input character by character, we can create a DFC with a unique endpoint for every input string in this table and so the associated output can be assigned completely independently for each one:
An atomic classifier for strings of length 2
The above DFC has the exact same behavior as the table for all inputs present in the table. For example, When the string '10' is inputted into the policy, it ends on a state with the action a1.
Because a DFC must be defined for all strings made from the input alphabet, we also add the simplest possible behavior, a "default" state at the end which maps all larger inputs not present in the table to a default action d.
An atomic classifier sending the string 10 to a unique end state
A DFC like this one can be constructed for any finite lookup table that maps binary strings to outputs. The resulting DFC will have a tree structure to separate the input into unique end states which we can assign the wanted outputs to. The tree structure makes it apparent that this DFC has the wanted properties of a an Atomic Classifier, but not all atomic classifiers have to look like this. Below is the minimized equivalent DFC to the one above,
A minimized atomic classifier for strings of length 2
These two DFCs would both be considered atomic classifiers because they have identical output behavior over all input strings.
You may have noticed that we constructed our atomic classifier from a lookup table and a default action-in other words-a lookup table policy! Indeed, from any DFC with the tree structure and a single absorbing default state, we can construct a lookup table policy with the same behavior, so this is how we will define atomic classifiers.
Atomic Classifier: A DFC D is an atomic classifier if and only if it matches the behavior of some lookup table policy that maps input strings to output symbols.
Proof
In the following proof, we find a bound on the robustness of lookup table policies. Then, we show that the DFCs of policies which are more robust than this bound must share a similar feature.
A Bound on the Robustness of Lookup Table Policies
Consider a Lookup Table policy θ with the default action d. Let’s analyze the robustness of θ. We have
Perfn(θ)=∑c∈CnPr(✓|c,θ).
Let Iθ be the finite set of inputs (history and goal) that the policy θ classifies. Since histories are encoded into the input, every input in Iθ contains information about the initial conditions, and thus a finite number of initial conditions appear in Iθ. Let the set of initial conditions which appear in Iθ be called Cθ.
Then, we can split the sum by whether the initial condition is in this set:
=∑c∈Cn∩CθPr(✓|c,θ)+∑c∈Cn∖CθPr(✓|c,θ).
Consider a term in the second sum. As the policy moves through the environment, the initial conditions (s0 and ψ) remain the same. Terms in the second sum have initial conditions which are not found anywhere in Iθ , so at no point in the trajectory of one of these goals will the inputted history and goal be inside the lookup table of θ. Thus, for these goals, the policy will always use the default action d. Therefore, we can simplify to
|Cθ| is a constant with respect to n which is the worst possible robustness. Thus we have for some positive value k,
|Cθ|≤kPerfn(¯d).
Applying this to the previous equation we get
≤kPerfn(¯d)+Perfn(¯d)=(k+1)Perfn(¯d),
which results in
Perfn(θ)=O(Perfn(¯d)).
In English, A lookup table’s robustness is bounded above by the robustness of the default policy for its default action. This makes some intuitive sense since as the environment gets larger and larger, the policy’s performance will be dominated by initial conditions which do not appear in its table.
For some environments and sets of goals, it could be the case that a default policy is very robust and successful, but in structured and complex environments we expect there to be more complex policies that are more robust. So, this bound tells us that lookup table policies are likely to have low robustness compared to other policies.
Induced Structure on Robust Policies
With this bound in hand, we can confidently say that any policy with better robustness than all default policies cannot be a lookup table policy, and thus its DFC is not an atomic classifier.
Let’s say that π is such an policy with, Perfn(¯d)=o(Perfn(π)) for all actions d and let D be π's DFC. What does “not being an atomic classifier” tell us about the structure of D?
By the definition of a lookup table policy, we have that
A policy π is not an a lookup table policy iff
¬(∃I⊂Σ∗ finite:∀w,v∉I,π(w)=π(v)),
or
∀I⊂Σ∗ finite:∃w,v∉I,π(w)≠π(v).
In other words, there are at least 2 actions that have infinite inputs that map to them under π’s policy.
Now, we claim that the sets of inputs which lead to these two actions are regular. This can be seen through a construction of DFAs from the DFC D.
Construction
Take a DFC D with output function α and an action a.
To construct the DFA Da, every state and transition from D is copied into Da. We make the state an accepting state if it α(q)=a and a rejecting state otherwise.
This DFA's accepting language L(Da) corresponds to the inputs in D which map to the action a. Since Da is a DFA, this language is regular.
Left: Initial policy DFC Right: Constructed DFA that accepts exactly the inputs that D maps to action a
In π, we have that there are 2 actions with infinite inputs. Let's call two such actions a and b. By the construction above, L(Da) and L(Db) are both infinite regular languages. Then, Da and Db each must contain an accepting cycle. But, these cycles must be present in the DFC D as well and they must be distinct as well because the state for each has a different associated action in the DFC. Therefore, D must contain at least 2 "classifying" cycles. This is tangible structure in the robust policy!
These two cycles can show up in multiple ways:
A DFC with 2 absorbing statesA DFC with a loop that is not a self-loopA DFC with a self-loop that is not on an absorbing state
In each of these examples, one can traverse through loops an arbitrary number of times before terminating to send infinite strings to the same action.
Discussion
This result takes a step towards solving the agent structure problem by providing an example where the successful behavior of a policy alone can tell us about its structure. The loopy structure that we've shown to exist in the DFCs of robust policies can be interpreted as taking advantage of a useful nontrivial pattern in how the policy's actions interact with the goals in every environment, which points vaguely in the direction of agent structure.
However, this feature is a long way from what we would need to call a policy an agent. Part of the motivation for the agent structure problem and this result is to derive a natural definition of agent-like structure from natural definitions of robust behavior. This result has shown that a natural part of that definition might include features which recognizes patterns between inputs.
This result should be seen as a starting point. It is very open to generalizations. Already, it has been shown that a similar result holds for changes to the environment, the performance definition, and the structural assumption.
There are several clear directions for future work from this result. One direction is to generalize this result to more powerful automata such as deterministic Turing machines. Another approach could involve placing more assumptions on the environment. With well-chosen assumptions, it could be feasible to show that a natural robustness criterion requires that a policy contains certain features which act similarly to systems we expect agents to have such as world models, search processes, or a values representations.
The same result holds for alternate environment and goal structures. For example, if we take one finite environment E and just increase the number of goals in each environment Gn, then the proof is just the same.
Additionally, if instead of having a growing environment, we have a sequence of disjoint environments E1,E2,…, then a slightly altered proof gives us the same result.
Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.
Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:
When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely.
This is a simple argument, and I think it’s 100% right.
But…
Consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model.
Thus, I claim that the human brain incorporates a form of interpretability-in-the-loop RL training.
Inspired by that example, I have long been an advocate for studying whether and how one might use interpretability-in-the-loop training for aligned AGI. See for example Reward Function Design: a starter pack sections 1, 4, and 5.
My goal in the post is to briefly summarize how I reconcile the arguments at the top with my endorsement of this kind of research program.
My overall position
Yudkowsky & Zvi are correct that straightforward interpretability-in-the-loop training of LLMs is almost definitely a very bad idea.
…Unless interpretability someday develops to such a refined point that it’s adversarially robust (i.e., we understand the model so well that problematic thoughts have nowhere to hide from the interpretability tools). But that sure seems like a long-shot.
There exists a certain brain-like version of interpretability-in-the-loop RL training that does not suffer from this very obvious failure mode
(…although it might or might not fail for other reasons!)
And there’s a straightforward explanation of exactly how it avoids this obvious failure mode.
The rest of this post will present this explanation:
How the brain-like version of interpretability-in-the-loop training avoids the obvious failure mode
As an oversimplified toy model, I suggest to think of desires as a learned linear functional on beliefs (see my Valence series §2.4.1). I.e. “desires” constitute a map whose input is some thought / plan / etc. (over on the belief side), and whose output is a numerical score indicating whether that thought / plan / etc. is good (if the score is positive) or bad (if it’s negative).
Anyway, the important point is that these two boxes are updated in different ways. Let’s expand the diagram to include the different updating systems, and how interpretability-in-the-loop training fits in:
The interpretability data is changing the reward signals, but the reward signals are not directly changing the belief box that the interpretability system is querying.
That means: The loop doesn’t close. This interpretability system is not creating any gradient that directly undermines its own faithfulness.
…So that’s how this brain-like setup avoids the obvious failure mode of interpretability-in-the-loop that Yudkowsky & Zvi were talking about at the top.
Things can still go wrong in more subtle and indirect ways
…Or at least, it avoids the most straightforward manifestation of that problem. There are more subtle things that might go wrong. I have a high-level generic discussion in Valence series §3.3, where I point out that there exist indirect pathways through this diagram, and discuss how they can cause problems:
And these kinds of problems can indeed pop up in the context of compassion and other interpretability-in-the-loop human social instincts. The result is that human social instincts that on paper might look robustly prosocial, are in fact not so robustly prosocial in the real (human) world. See my Sympathy Reward post §4.1 and Approval Reward post §6 for lots of everyday examples.
So the upshot is: I don’t think the brain-like version of interpretability-in-the-loop RL training is a panacea for aligned ASI, and I’m open-minded to the possibility that it’s just not a viable approach at all. But it’s at least a not-obviously-doomed research direction, and merits more study.
AI disempowerment operates across markets, networks, and governance simultaneously, but our analytical tools don't cross those boundaries. We propose spectral graph metrics—spectral gap, Fiedler vector, eigenvalue distribution—as computable, cross-domain measures for tracking how the balance of influence shifts when AI enters coordination systems, and identify three specific quantities to monitor for AI governance.
Introduction
AI systems are changing how society coordinates — across markets, networks, governance institutions, scientific communities, all at once. The gradual disempowerment thesis captures why this is hard to address: human influence over collective outcomes can erode slowly, through ordinary competitive dynamics, without any single dramatic failure. AI systems become better at navigating coordination mechanisms, and the effective weight of human agency quietly decreases.
The stubborn part is that it operates across institutional boundaries simultaneously. Regulate algorithmic trading to maintain human oversight of markets, and competitive pressure shifts to network dynamics — whoever shapes information flow shapes what traders believe before they trade. Address attention capture in social networks, and the pressure migrates to governance advisory relationships. The problem flows around single-domain interventions like water finding cracks.
Yet our analytical tools respect exactly those domain boundaries. Economists model markets with one formalism. Network scientists study information diffusion with another. Political scientists analyze voting with a third. Each captures something real. None can describe what happens when AI systems alter the dynamics across all three simultaneously.
We think markets, networks, and democratic systems are structurally more similar than they appear. They can all be described as message-passing protocols on graph structures — nodes are participating agents, edges are channels through which influence flows, and what varies across mechanisms is what gets passed along those edges and how nodes update. In markets, messages are price signals. In networks, they're beliefs and observations. In democratic systems, they're preferences and votes.
When you represent coordination mechanisms this way, you inherit the toolkit of spectral graph theory. And this turns the disempowerment problem from something that feels intractably cross-domain into something with computable structure.
Here we give a quick sense of what this looks like concretely — don't worry if the details aren't clear yet, we'll walk through specific examples carefully in the sections that follow.
Consider a human-only coordination graph — five nodes connected by edges representing who influences whom. Every graph like this has a mathematical property called the spectral gap (λ₂), which you get from decomposing the graph's structure into its fundamental modes — the same way you'd decompose a vibrating string into its harmonic frequencies. The spectral gap measures how easily information flows across the graph's weakest point. A large spectral gap means the graph is well-connected and signals propagate quickly to everyone. A small one means there's a bottleneck somewhere — a thin bridge between two clusters where information gets stuck.
Now add AI nodes. They connect densely to each other and to key human nodes. The spectral gap increases: λ₂' > λ₂, that is information flows faster. This might in turn lead to separated networks where AIs talk with AIs because their information flow is faster.
Another useful way of looking at disempowerment from a graph perspective is by creating an influence function and look at how much the AIs versus humans are providing for this.
Partition nodes into H and AI and trace which signals mattered for the collective outcome. Edge thickness represents causal contribution. The question becomes information-theoretic: what fraction of outcome-determination flowed through human nodes? An example of this is the metric of how much money flows through humans versus AIs but we want to extend this to be more generally about the mutual information between node signals and collective outcomes, partitioned by type (e.g politics, economics, culture).
The quantities that let us track this are all spectral. Eigenvector centrality tells you what fraction of structural influence belongs to human versus AI nodes, and whether that ratio is shifting. The Fiedler vector tells you whether the system is separating along the H/A boundary. Betweenness centrality tells you who controls information flow between communities — if AI nodes increasingly sit at bridge positions, nominally human decisions route through AI intermediation.[1]
Maintaining human agency means maintaining structural properties of the joint graph: human betweenness across mechanism boundaries, spectral gap ratios that keep human timescales relevant, Fiedler partitions that don't collapse onto the H/A boundary. These are measurable, computable, trackable quantities and we give a couple of AI governance suggestions based on these right before the conclusion.
We've been developing this framework for the past year at Equilibria Network. The core bet is that spectral graph theory provides a shared analytical language for coordination mechanisms that are usually studied in isolation — and that this shared language reveals structure you can't see from within any single domain.
Whether spectral analysis actually delivers on this depends on whether the toolkit works reliably across different coordination mechanisms. The rest of this post checks that claim against markets, networks, and democratic systems. We then lay out the desiderata for a unifying framework, where the open problems are, and what we're building toward.
Spectral Analysis Across Coordination Systems
We claimed that markets, networks, and democratic systems can be understood through the same spectral toolkit. Let's make that concrete. For each mechanism, we'll show how the graph Laplacian — the matrix you get from encoding who-influences-whom — gives you the spectral gap, the Fiedler vector, and the eigenvalue distribution, and what these quantities actually predict about real system behavior.
The pattern to watch for: in each case, the spectral gap λ₂ will predict how fast the system converges, the Fiedler vector will identify its natural fault lines, and the higher eigenvalues will capture its capacity for complex structure.
Spectral Analysis of Markets
Markets are our first example of where disempowerment thesis first becomes concrete. If AI traders increasingly dominate price discovery, human traders don't suddenly lose their accounts—they just find their signals mattering less. The bid you submit still enters the order book, but if AI systems have already moved prices to reflect information you haven't processed yet, your trade is reactive rather than formative. You're still participating; you're just not shaping outcomes.
To see this structurally, represent a market as a graph: nodes are traders, edges represent influence relationships—who watches whom, who updates their beliefs based on whose actions. This isn't the transaction record; it's the structure through which price information actually propagates.
The graph Laplacian L = D − W encodes this structure, where D is the degree matrix (how much total influence each trader receives) and W is the weighted adjacency matrix (who influences whom, and how strongly). The Laplacian has a useful property: for any assignment of values x to nodes, the quadratic form x^T L x equals the sum over all edges of (x_i − x_j)² × w_ij. In plain language, it measures total disagreement across the network, weighted by connection strength. If two connected nodes have similar values, that edge contributes little. If they have very different values, that edge contributes a lot.
This connects directly to price dynamics. Price discovery is diffusion—when one trader updates their bid, neighbors notice and adjust, their neighbors notice them, and the signal spreads. The Laplacian governs this process: prices equilibrate fast across strong connections (high penalty for disagreement) and slow across weak ones (low penalty). Bottlenecks in the influence structure become bottlenecks in price convergence.
Figure 3: Take an 8-trader market: 4 humans (H₁–H₄) and 4 AIs (A₁–A₄). Within each group, traders monitor each other closely—humans share information channels, AIs update rapidly on each other's outputs. But cross-group influence is sparse: just two weak links where a human occasionally tracks an AI's prices. The figure shows this structure, with solid edges for strong within-group connections (weight = 1.0) and dashed edges for weak cross-group links (weight = 0.1).
Build the Laplacian and compute its eigenvalues:
λ = {0, 0.19, 2.0, 2.1, ...}
The second eigenvalue λ₂ = 0.19 is called the spectral gap. It's small compared to λ₃ = 2.0, and that smallness tells you there's a bottleneck—the graph has a clean partition.
The eigenvector for λ₂—the Fiedler vector—comes out as:
All humans get positive values; all AIs get negative. The sign flip lands exactly at the Human/AI boundary. No clustering algorithm needed; the linear algebra found the structural partition automatically.
Why does this work? The Laplacian's smallest eigenvalue is always λ₁ = 0, corresponding to the constant eigenvector [1, 1, ..., 1]. This makes sense: if every node agrees, there's zero disagreement across any edge.
The Fiedler vector solves a specific optimization problem: find the assignment of values to nodes that minimizes total disagreement, subject to not being constant. The values must sum to zero, so some are positive and some are negative.
Think about what this optimization does. If two nodes are strongly connected, you want to give them similar values to keep (x_i − x_j)² small. If they're weakly connected, the penalty for different values is low. So the optimal strategy is: assign similar values within tightly-connected clusters, and let the sign flip happen where connections are weakest.
That's why the Fiedler vector partitions the graph along its bottleneck. It's not a clustering algorithm—it's the solution to "where can I afford disagreement?" The answer is: across the weak links.
How this relates to disempowerment
Convergence time scales inversely with the relevant eigenvalue. Within each group, the larger eigenvalues (~2.0) govern dynamics: τ_within ≈ 1/2 = 0.5 time units. A price shock among AIs equilibrates within the AI cluster quickly.
Across groups, the spectral gap governs: τ_cross ≈ 1/0.19 ≈ 5 time units. Cross-group convergence is 10× slower.
Here's where disempowerment becomes visible. If AI-AI connections are denser than human-human connections, the AI subgraph will have a larger internal spectral gap—AIs reach consensus faster. Collective price levels increasingly reflect AI-internal dynamics; humans equilibrate toward prices that were already set. Individual humans still trade, but the attractor they're converging toward was shaped by the faster-equilibrating subsystem.
The Fiedler partition gives you a direct test: is the market's primary structural fault line Human/AI, or something else (sector, strategy, geography)? When the partition aligns with agent type, the market has organized around that boundary. That's the spectral signature of segregation—not just difference, but separate communities where information flows freely within and struggles to cross between.
For information networks, the same mathematics applies but with different interpretation. Nodes are agents, edges are communication channels, and the Laplacian captures the diffusion structure of information flow.
The core dynamic is simple: people update their opinions by averaging over what their neighbors think. If you trust someone, their view pulls yours toward theirs. Repeat this across everyone simultaneously, and the network either converges toward shared understanding or gets stuck in persistent disagreement. Which happens depends entirely on the graph's spectral properties.
The spectral gap λ₂ predicts how fast consensus forms. The figure shows why this works intuitively. A well-connected network (left) has many paths between any two nodes—information can flow through multiple routes, disagreements get smoothed out quickly, and the system relaxes to consensus. A poorly-connected network (right) has a bottleneck: two dense clusters linked by a single weak connection. Information flows freely within each cluster but struggles to cross the bridge. Disagreements between clusters persist.
The Cheeger inequality proves that a small spectral gap guarantees a bottleneck exists somewhere in the network. The eigenvector v₂ (called the Fiedler vector) tells you exactly where: nodes with positive components fall on one side of the cut, nodes with negative components on the other. The algebra finds the echo chambers automatically.
The higher eigenvalues reveal something different: the network's capacity for complex patterns of belief. A network with only one significant eigenvalue can sustain only binary disagreement—you're either in group A or group B. A network with many well-separated eigenvalues can maintain richer structure: multiple factions, nested coalitions, opinions that don't collapse onto a single axis. The spectral distribution measures what we might call the network's "cognitive complexity."
Network scientists have confirmed these patterns empirically across social media platforms, scientific collaboration networks, and political communication systems.
Spectral Analysis of Democratic Systems
Voting systems are less obviously graphical, but the representation still works. Consider voters (or legislators) as nodes and influence relationships as edges: who persuades whom, who looks to whom for voting cues, who forms coalitions with whom. The aggregation mechanism determines how preference signals propagate through this influence graph.
The spectral gap λ₂ predicts the stability of collective decisions. A large spectral gap means the voting outcome is robust—small changes in individual preferences don't flip the result. A small spectral gap means the system is near a "phase transition" where minor shifts could change everything. There’s evidence that voting rules exhibit phase transitions at critical preference thresholds. Below some critical concentration, manipulation probability approaches 1 exponentially fast; above it, manipulation becomes exponentially unlikely. The spectral gap tells you how close your system sits to that knife's edge.
The picture shows what this looks like concretely. Two partisan blocs—tightly connected internally, weakly connected to each other—with a handful of swing voters bridging the gap. The spectral structure reads this topology directly: small λ₂ because the cross-bloc connections are sparse, and the Fiedler vector v₂ cleanly separating the blocs by sign. The swing voters show up near zero in v₂, mathematically capturing their position between worlds—and their decisive influence on outcomes.
The eigenvectors reveal fault lines, and this has been validated empirically. When network scientists applied spectral analysis to U.S. Congressional voting, they found that modularity-based community detection naturally identifies partisan blocs—and spectral methods for modularity optimization underlie these partitions. No manual labeling needed—the linear algebra finds the political coalitions automatically.
Political polarization has a precise spectral signature. Work on legislative voting networks shows that roll-call voting behavior increasingly corresponds to "interval graphs"—mathematical structures where a single dimension captures almost everything. The researchers found a sharp collapse in dimensionality over recent decades. By the post-104th Senate, Congressional voting had become essentially one-dimensional: a single eigenvector explains nearly all the variation. When your democracy's spectrum looks like that, you're seeing polarization written in the mathematics itself.
The gap between λ₂ and λ₃ measures something real about democratic health. When λ₂ is much smaller than λ₃, the system has one dominant fault line—a clean left-right split with weak cross-partisan ties. That structure is spectrally fragile. When eigenvalues distribute more evenly, the system sustains complex coalition patterns: overlapping groups, cross-cutting alliances, the kind of multidimensional politics that makes manipulation harder and outcomes more stable.
The napkin math: Under social influence, preferences converge via the same Laplacian dynamics: dx/dt = −Lx. But for voting, we care about stability under perturbation. If I shift one voter's preference by ε, how much does the collective outcome move?
Perturbation analysis gives: δ(outcome) ∝ ε/λ₂ under influence-aggregation models where preferences propagate through the network before a decision is reached. The precise relationship depends on the voting rule — unanimity, majority, approval voting, and ranked-choice systems each interact differently with the same influence topology. [2]
The influence matrix (I − αL)⁻¹ develops a near-singularity when λ₂ is small—small inputs produce large outputs. And crucially, where that perturbation lands matters: shifting a swing voter (near zero in v₂) moves the outcome far more than shifting a committed partisan. The eigenvector tells you who the kingmakers are.
This is the spectral signature of political instability: a polarized electorate with weak cross-partisan ties, a few swing voters holding disproportionate power, the whole system poised at a critical point where small nudges flip outcomes.
Generalising the examples
We set out to check whether spectral properties carry consistent meaning across different coordination mechanisms, and the examples suggest they do. In each case, the spectral gap λ₂ predicted the rate at which local perturbations become global patterns—price convergence speed in markets, information diffusion rate in networks, decision stability in democracies. The Fiedler vector v₂ identified natural fault lines in each system, whether those manifest as trading communities, echo chambers, or partisan blocs. Higher eigenvalues captured each system's capacity for complex, multi-dimensional structure rather than simple binary divisions.
The mathematics remained the same across all three; what changed was interpretation. This is consistent with the hypothesis that markets, networks, and democracies admit a shared analytical framework — though it doesn't yet prove they share deep structure as opposed to admitting similar computational tools for somewhat different reasons. Together, these spectral quantities give us computable, cross-domain metrics for tracking how coordination dynamics shift when AI nodes enter a system — spectral gap for convergence speed, Fiedler vector for fault lines, eigenvalue distribution for structural complexity, all partitionable by agent type.
What follows is the deeper theoretical program that we hope might ground these metrics in a unified framework and extend them beyond what spectral methods alone can reach.
A First Order Model
What follows is the theoretical program we're developing to ground and extend these results. The ideas here are less settled than what came before — we're sharing our current thinking rather than established findings, because we think the direction is promising enough to be worth discussing openly.
The spectral examples are satisfying but they raise an obvious question: why does this work? Observing that the same eigenvalues predict things across markets, networks, and democratic systems is a pattern. It's not yet an explanation. What's the underlying mechanism that makes the Laplacian the right object to study in all three cases?
The following story explains it easily but it is only a first order approximation. The actual contextualised answer to this question is part of a larger research program.
The graph Laplacian L = D − W is, at bottom, a description of how information moves on a graph. Take any quantity distributed across nodes — prices, beliefs, preferences, anything — and ask how it diffuses through the network of connections. The Laplacian is the operator that governs that diffusion. Its eigenvalues tell you the rates. Its eigenvectors tell you the patterns. This is what graph signal processing studies: signals defined on graphs, and the structures that govern how those signals propagate.
Now consider what the agents on this graph are actually doing. A trader is trying to find the right price. A voter is trying to figure out how to vote. A team member is trying to predict what their colleagues will do. Whatever the domain, each agent is maintaining some model of its situation, generating predictions, comparing those predictions against what it observes, and updating to reduce the error.
Each agent is doing local optimization against a landscape defined partly by what every other agent is doing. The active inference literature frames this as free energy minimization; you could also describe it through Bayesian belief updating or gradient-based optimization of prediction error. [3]
If you have a collection of agents on a graph, each doing local optimization — each one hill-climbing on its own prediction-error landscape — what you get at the collective level is energy moving across the system. Agent A updates based on signals from neighbors B and C, which changes what B and C observe, which changes their updates, which propagates further. To a first approximation, the collective dynamics is diffusion. Information and uncertainty flow across edges according to the same mathematics the Laplacian describes.
When models disagree — when agents have misaligned predictions about each other — the system is in a state of high collective free energy. Tension, disagreement, unresolved uncertainty. Message-passing resolves this tension. Agents exchange signals, update their models, and the system relaxes toward coordination. The Laplacian governs the rate and pattern of this relaxation. The spectral gap tells you how fast collective uncertainty resolves. The Fiedler vector tells you where the persistent disagreements will be.
This is why the spectral toolkit works across domains. Markets, networks, and democratic systems all involve agents doing local inference on a shared graph, and the Laplacian is the mathematical object that describes how local inference becomes collective dynamics. The spectral results follow from the structure of distributed optimization on graphs — regardless of whether the optimization targets are prices, beliefs, or policy preferences.
Beyond Approximations
This picture is the leading-order term, and it's the term the spectral toolkit captures well. But we should be direct about its limitations.
The phenomena that matter most for AI safety — strategic positioning at bridge nodes, recursive modeling of other agents' models, coalition formation, identity-driven resistance to consensus — live precisely in the regime where the diffusion approximation breaks down. The Laplacian governs what happens when agents are doing something like local averaging. Real world coordination involves agents who anticipate, who model each other's models, who form alliances and defect from them. The actual collective operator is more complex than the Laplacian, and characterizing what it looks like beyond the diffusion regime is the core theoretical challenge of this research program.
The deeper question — and this is where we think the most promising contribution lies — is whether there's a formal correspondence between free energy minimization at the individual level and the collective dynamics we observe across coordination mechanisms. If different coordination systems are different ways of collectively minimizing prediction error under different constraints, that would explain why the spectral toolkit transfers. We're pursuing this direction but the formal results aren't ready yet. If it works, the functorial mapping wouldn't be between markets and networks directly, but between individual inference and collective coordination, with markets and networks as different instantiations under different constraints.
Our approach builds on several existing lines of work connecting individual and collective inference. Heins et al. have shown how spin-glass systems can be analyzed through collective active inference, providing a concrete implementation of multi-agent free energy minimization on graphs. Hyland et al.'s work on free energy equilibria establishes conditions under which multi-agent systems reach shared minima — the formal analog of coordination. And recent work on partial information decomposition of flocking behavior demonstrates how collective dynamics can be decomposed into synergistic and redundant information contributions across agents. For readers interested in the computational methods, [this lecture on MCMC approaches to collective active inference] provides a setup for running a collective <-> individual gibbs sampling-style loop.
Desiderata for the Theory
The active inference story gives us a candidate explanation for why spectral methods transfer across domains. But an explanation isn't the same as a unifying framework. We think that our theory needs three things: universality across domains, compositionality so that understanding markets and networks separately lets you predict what happens when they operate together, and computational tractability at scale. We want this so that we can actually compose and simulate AI + human systems and see what happens.
These pull against each other, and most existing frameworks achieve one or two. Game theory applies broadly but doesn't compose — there's no natural operation for "combine these two games and predict the joint dynamics." Network science computes efficiently but treats each coordination domain as requiring its own model. Social choice theory has beautiful results about voting but nothing to say about price formation. Taking inspiration from Fong and Spivak's work on applied category theory, the question is whether price dynamics and voting dynamics are structurally isomorphic — the same compositional relationships, even if their elements look completely different.
Where does spectral analysis sit against these?
Tractability is real but not as clean as it first appears. Spectral decomposition runs in O(n log n) for sparse graphs. But the graphs we care about aren't static — they evolve endogenously based on the dynamics we're modeling, and influence in real systems is latent, inferred from observed correlations rather than measured directly. The computational advantage over exhaustive simulation exists, but it's not a free lunch.
Universality is more promising. The spectral gap does predict convergence-like behavior across markets, networks, and democratic systems, and the eigenvectors reveal natural clustering regardless of domain. But showing that the same tools apply isn't the same as showing these domains share deep structure. The active inference connection is our best current candidate for why the transfer happens — distributed inference on graphs, governed by the Laplacian at leading order — but it remains a conjecture rather than a result.
Compositionality is where we've barely started, and it's where the real gap lies. Real coordination systems blend multiple mechanisms — a company uses market mechanisms for resource allocation, network relationships for information flow, and democratic processes for major decisions. The spectral analysis applies to each layer, but we don't have composition rules that predict what happens when you stack them. We're exploring this through what we call process-based modelling (beware of technical debt) — a functional programming approach to multi-agent simulation that might offer a computational path to composition. More in a future post.
The deeper open question is whether the cross-domain transfer we've demonstrated reflects structural unity or a sufficiently general hammer. If it's structural unity, it should live in what each mechanism preserves and what it discards — markets might preserve something about efficient information aggregation through exchange, democracies something about equal origination of influence, networks something about positional structure of information flow. These feel like different constraints on the same underlying message-passing process. Formalizing that intuition is the core theoretical challenge ahead — and it connects directly to the practical question of what happens when AI agents, operating under their own constraint regimes, enter these systems.
Applications in Governance
The spectral framework suggests three specific quantities that should be monitored in any coordination system where AI agents participate alongside humans.
Human betweenness across mechanism boundaries. If AI nodes increasingly sit at bridge positions between different coordination mechanisms — between the market layer and the network layer, between information flow and governance — then nominally human decisions increasingly route through AI intermediation. Betweenness centrality, partitioned by agent type across mechanism boundaries, tracks this directly. When human betweenness declines relative to AI betweenness at cross-mechanism bridges, the system is developing AI-mediated chokepoints.
Spectral gap ratios that keep human timescales relevant. If the AI subgraph's internal spectral gap is much larger than the human subgraph's — meaning AI nodes reach internal consensus far faster than humans can coordinate among themselves — then collective outcomes might get determined by whichever subsystem equilibrates first. The ratio λ₂(AI subgraph) / λ₂(human subgraph) measures this directly. A ratio that grows over time signals that AI coordination speed is outpacing human coordination speed, and collective outcomes will increasingly reflect AI-internal dynamics.
Fiedler partitions that don't collapse onto the H/AI boundary. The Fiedler vector v₂ identifies the system's primary structural fault line. If v₂ increasingly separates human nodes from AI nodes — if the dominant partition of the system is "humans on one side, AI on the other" rather than some functional or topical division — the system has structurally segregated along the type boundary. This is the spectral signature of a coordination system where humans and AI are no longer integrated but operating as separate blocs.
These are necessary conditions for meaningful human agency in mixed coordination systems, but they are not sufficient. A system could satisfy all three criteria while still undermining human agency through subtler mechanisms — frame-setting that shapes which options humans consider, information curation that determines what humans see before they decide, meaning-making that happens within nodes rather than between them. The spectral criteria track the structural skeleton of influence. They cannot detect whether the content flowing through that skeleton is manipulative, reductive, or otherwise corrosive to agency. We flag this not to undermine the criteria but to bound what they can and can't detect — and to motivate the richer framework we're developing, which would need to capture not just information flow but information quality and strategic intent.
Mathematical foundations. Formalizing the symmetry conjectures, proving spectral-behavioral correspondences, developing the category-theoretic structure that makes composition rigorous.
Simulation infrastructure. Building tools that let researchers construct collective intelligence systems as graphs with message-passing rules, run simulations, and analyze spectral properties. We’re trying to make sure that we have good theory to practice feedback loops so the aim here is practical applications.
Multi-agent AI safety. Applying this framework to understand what happens when AI agents participate in human coordination mechanisms.
Conclusion
Wherever you have multiple agents coordinating under uncertainty, you can draw a graph. The nodes are whoever's participating. The edges are the channels through which they influence each other. The message-passing rules encode what kind of coordination mechanism you're using.
Markets, democracies, networks, hierarchies — they're all message-passing on graphs. The differences that matter are what flows along the edges, how nodes update, and what the structure permits. And because graphs give us matrices, we get the full power of linear algebra — efficient computation, proven algorithms, scalable analysis.
What we've shown here is that spectral methods give computable, falsifiable quantities for tracking coordination dynamics — including AI disempowerment — across domains that are usually studied separately.
What we haven't shown, but believe is worth pursuing, is why this works. We suspect the answer involves a formal correspondence between individual free energy minimization and collective coordination dynamics — that the Laplacian captures the leading-order term of distributed inference, and that's why it transfers. If that's right, different coordination mechanisms would be different constraint regimes on the same underlying process, and the deep question becomes: what do different mechanisms preserve? Markets seem to preserve something about efficient information aggregation through exchange. Democracies seem to preserve something about equal origination of influence. Networks seem to preserve something about positional structure of information flow.
We think these spectral correspondences hint at deeper structural connections between coordination mechanisms — connections that might eventually be formalized categorically, the way the Langlands program connected number theory and geometry.
Thanks to the extended Equilibria research network for many conversations that shaped these ideas. This post presents a research direction we're actively developing—feedback, criticism, and collaboration are welcome.
This was co-written with Claude Opus based on work over the last year. I would give a 85-90% probability of claims holding true as this is something I’ve spent a lot of time on.
(A caveat on this metric: mutual information with outcomes captures causal contribution but not the full picture of agency. A human whose vote is shaped entirely by AI-curated framing has high mutual information with the outcome — their signal mattered — but diminished agency in any meaningful sense. They didn't act from their own understanding; they were a conduit for someone else's influence. A complete account would need to track not just whether human signals determined outcomes, but whether those signals originated from human deliberation. We don't have a clean formalization of this yet, and it's a significant gap. The spectral metrics we propose are necessary conditions for human agency, not sufficient ones — you can't have meaningful agency without structural influence, but structural influence alone doesn't guarantee it.)↩︎
The result holds most cleanly for linear aggregation on the influence graph; for other mechanisms, the spectral gap still constrains dynamics but the proportionality may take a different form. With a small spectral gap, even tiny preference shifts get amplified dramatically. The spectral-stability connection for democratic systems is the least developed of the three analyses presented here and deserves its own dedicated treatment. The core difficulty is that different voting rules don't just produce different proportionality constants — they can fundamentally change what the spectral properties mean. Under approval voting, the strategy space is different from plurality, which means the way preferences propagate through influence relationships changes qualitatively, not just quantitatively. The voting rule may function as a structural variable that reshapes the effective graph, rather than a parameter applied to a fixed graph. We've bracketed this issue here by focusing on influence-aggregation models where the spectral connection is most transparent, but a full treatment would need to develop mechanism-specific spectral signatures for different voting rules. ↩︎
These frameworks converge under specific conditions — roughly, when agents have well-defined generative models and the environment is stationary enough for variational approximations to track — but they are not identical operations.. The structural point that matters here is more basic: each agent is doing some form of local inference, and the collective dynamics emerge from those local processes interacting across the graph. ↩︎