MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

How I Handle Automated Programming

2026-03-06 12:27:05

This is a write up of my current process, as an independent software engineer, for using Claude Code to write and review all of my code. The specifics of the below will change as models get better. The overall ideas, I suspect, will hold for a little longer.


I’m trying to build software quickly.

When Sonnet 4.5 was released in September 2025, I found that I didn’t need to write code anymore.

When Opus 4.5 was released in November 2025, I found I didn’t need to review the code anymore.

After Opus 4.6 was released earlier this month, I finally set up a system to enable --dangerously-skip-permissions [1].

It’s very fast [2].


The motivation behind this sequence of handovers is basically just asking in a loop “Why am I the limiting factor here? What can I do to stop being the limiting factor?”

Most of the time, the answer is just that I don’t trust the models enough. My system just wasn’t yet set up to handle mistakes from them, so I had to play Quality Control to a frustrating degree.

But, the models are very smart and capable, and it is very quick to iterate yourself more and more out of the loop. Here’s the process I’ve come to over the past year or so of iterating on Claude Code:

Dangerously Skip Permissions

First, I have set up a new user account on my computer for Claude. The user can’t run root, it can’t modify files except in its own (and shared) directories. It has its own SSH keys to the cloud servers my projects run on, it has its own github token, and so on -- all of these have minimum permissions, and most are read-only. It uses its own browser (Playwright Chromium), without any of my data or logins available.

This is far less than perfect isolation, and if Claude were malicious he might be able to cause some trouble, but it is plenty to prevent a bumbling intern from destroying anything important.

With that, I feel reasonably comfortable running Claude with --dangerously-skip-permissions. I’ve actually set up a little script so that typing ‘claude’ will automatically run the folder’s .venv and apply this flag. Speedy!

Claude.md and docs/

When Claude starts up in Claude Code, the Claude.md file is loaded into his context window. As such, you want to keep this short while also making sure it includes a complete overview of the project.

Claude mostly builds this file and handles it well by default, however, I did put some effort into making a docs/ folder with more detailed information about how each part of the software should work. The idea is that Claude can follow pointers up the chain to help him find relevant information quickly, and to do so with a good idea of the overall reasoning and structure:

Claude.md --> Detailed documentation in /docs --> Actual Code

Of course, all of the docs were also written by Claude.

Working

Usually, I will start by running a /new_branch skill [3], which lets Claude creates a new git branch with a name combining the date and the folder we are in.

Once that’s done, I clear the chat and write or paste in whatever the specific feature or problem to work on is. I try to pick a chunk of work that can be finished within a 1m token context window [4] (which, to be clear, is nearly everything).

I try to give Claude as much context as possible about the problem, as well as my certainty about it. For example: “A user has emailed with the below report. Could you investigate if this is a real issue?” or; “I’d like to display to users when the page is actively generating content in a way that lets them see it even if they aren’t in the active tab. Is it possible to do this via the .ico somehow? Perhaps showing an animation or different colour while generations are active? Could you investigate the possibilities here and propose some ways we might be able to provide this info to users?”

We often have a bit of back and forth chatting about how to approach the problem and plan out the implementation, and then I read the plan Claude generates semi-carefully, and often suggest changes to it -- usually some missing piece or misunderstood requirement, or a new idea I hadn’t thought of.

After that, Claude goes ahead and does the work which, with the permissions bypassed, is usually completed in one-shot. I avoid clearing the context from before the plan, unless it is polluted because I generally think it contains useful information about how to carry out the plan.

Occasionally there’s some back and forth after the initial work is done -- me asking why a choice was made, or whether an edge case I thought of is covered, or what happens in a particular scenario.

Now, we have a feature which we need to carefully review. This is where most of the work actually is.

Review Step 1 - What Did Claude Intend?

For complex features, or things with UI that Claude isn’t great at, we need to do a little bit of manual reviewing. But we don’t review the code! Oh no, we review the outputs. There are two ways to do this:

For complex flows, I run the /walkthrough skill [5]. This prompts Claude to create a script which calls the real code with a bunch of example inputs, processes the outputs of that code into clean, readable logs, and then presents it as a single file you can read top to bottom to see exactly what (apparently) caused what. Claude can also review this himself and fix things based on these outputs.

This is extremely useful for handling things where it’s not super obvious what exactly the code is doing, for example, if you’re building a project where you are assembling prompts for an AI (e.g, a benchmark), you want to see exactly what those prompts look like across a bunch of examples. It’s very handy to have that ‘bunch of examples’ output to a text file!

The /walkthrough output should not be viewed as ‘this is definitely what the code is doing’, but you can reliably view it as ‘this is what Claude intended to build’, which is very useful on its own. I often catch little differences to what I intended and can easily correct them by prompting Claude, whereas without this I would be much more blind.

For UI changes, I can just open the thing up and take a look. I am a big fan of browser based GUIs for most projects, so usually I’ll just flip across to localhost and have a quick look at what we’ve got. This is especially easy, because that /new_branch script also spins up a local server on a port which is specific to this project and this folder, so I know if I’m talking to Claude number four on the second desktop window, then I can go to port 8240 to see what we’ve done.

In both of these cases, it’s simple to notice small things and pepper them back onto Claude to fix. Sometimes I explicitly say “I’m going to throw a bunch of small little requests at you -- just add them to your todo list and get to them once you finish up what you’re doing. Don’t let me interrupt you, I’m just saying things as I see them!” so I don’t need to wait for previous work to be done before sending the next little fix.

Review Step #2 - Are You Happy With Your Work?

Once the code is written and appears to output the correct thing, I run a /pre-pr skill [6] which prompts Claude to tidy up and test his work before making a PR to master.

Specifically, Claude is prompted to review the new code for quality and clarity, check that the old code supports the new code & its patterns, remove dead code, write and run tests, check for similar problems to the one we fixed throughout the codebase, plan out manual tests (e.g. browser based tests), run the manual browser tests using the Playwright MCP, and then repeat until no major changes are made.

After that, the skill prompts Claude to update Claude.md and the /docs folder, merge the latest master into our branch, and create a PR. Then, it waits for the PR review and CI tests to complete, and handles those results.

Ideally this would all happen automatically, but often I need to reprompt Claude to check the PR info, or I just directly paste in test failures or PR comments for him.

Review Step #3 - Is The Other Claude Happy?

The PR comments are also automated! A different instance of Claude, running via the Claude github plugin, reviews the diff and comments on the PR. It often has nitpicks, but occasionally catches genuine issues or inconsistencies which need to be resolved.

All of the normal automated unit and integration tests run at this stage too. I use testmon to try and limit it to affected files, and also parallelise the tests, but it still can take a long time for everything to run.

Then it’s merged to master! I merge somewhere on the order of 5-10 PRs a day using this system -- you don’t want big open branches, since everything will work much better if you can keep it all inside one context window and avoid conflicts with other branches as much as possible.

Review Step #4 - Does The Staging Server Work?

Once the code is merged to master, at some point, maybe every couple of days, I will run the /PRstaging skill [7] to work on getting the code deployed. This skill basically repeats the /pre-pr checks, but across several of the recent changes and with a fresh instance of Claude. We also repeat our Github automated PR review and CI tests, and often catch a few more little things.

For any project that is used externally, I have a staging server which (should be) as similar to the production server as possible. I run a deploy script to pull the staging branch and deploy the code here first. Then I run the QA Bot

The QA Bot is a little tool I built which spins up a team of Claude Haiku’s to manually test websites using a GUI based browser. The intended pattern is to pass in the PR description and let the supervisor agent select which flows on the website to test (for example, being able to login, sign up, etc, etc).

This is generally quite expensive to run, so you only want to do it once before merging to production. The tool is also not that good yet -- sometimes it finds many false positives, or the agents get stuck, or we run into other issues. Occasionally it notices real bugs though, so I find it useful to run on anything important.

It is very important to provide the QA Bot with a staging server and test credentials which can do little harm, since the agents might test, say, the ‘delete profile’ flow with the URL and credentials you give them. They should ask before doing anything non-reversable, but it is a smaller model.

Once staging looks good, I repeat the process with /PRprod [8]

Review Step #5 - What Do The Production Logs Say?

We finally have something in production! That’s great, but also a little nerve-wracking, since at no point in this process did we read the code or click through everything manually. Let’s hope that if there’s any problems, we can catch them quickly.

I have a system to send serious errors to myself immediately via email, but there’s a lot of noisier stuff that could still be a problem that is stored in logs on the server, and on all of the various platforms a project might interact with (DigitalOcean, Stripe, OpenRouter, Mailgun, etc).

So, everyday, I run /check-logs [9], which prompts Claude to access each of those in turn (using read-only credentials) and see if there is anything needing addressing. Claude is great at this, he can process huge lists of logs and notice patterns natively, then present any issues or observations (‘Memory utilisation was a bit high’, ‘We have more 404s than normal’) and dive into anything immediately by also looking at the codebase and recent PRs to production.

I’ve caught many an issue this way. Some platforms I haven’t pulled logs from automatically, but can still occasionally download a .csv and paste it into Claude with a prompt like ‘This is a report form Google Search Console, can you analyse it and flag any improvements we could make?’, and then just go ahead and make the improvements automatically using the above system.

One More Thing

I also occasionally run the /improve skill [10], especially when a new model comes out. This helps keep everything aligned and up to date across the code base by spinning up an agent team to review each section of the project and check if improvements can be made to code quality, security, documentation, design, test coverage, and so on!

When I ran this for the first time with Opus 4.6, it caught 35 distinct issues, some of which were actually consequential.

The only other thing worth calling out is that this is a very parallel workflow -- I have six windows of Claude Code open per project, each working independently on their own repo. Often only three or four are actively running simultaneously, but I have occasionally hit six in parallel.

Conclusions

As the models get better -- which will happen quickly -- some of these steps will become unnecessary, but other parts should remain somewhat useful for a little longer. The general philosophy should be that if we are putting more control in the hands of the AI, or making it clearer about what we want to be developed, then the procedures will work better as the AIs get better.

You can think of it as analogous to working with a human engineer who would otherwise be idle, where asking them to check and explain their work in various ways can help get something good.

  1. I now allow Claude to run without any specific list of commands blocked. ↩︎

  2. Obviously a big chunk of this is just that Claude is getting capable enough to do this himself. Still, for the foreseeable future, a system which catches issues (especially one that catches issues using Claude) will be beneficial even as Claude gets much more capable. ↩︎

  3. /new_branch ↩︎

  4. You can set this in Claude Code with /model. It charges you for ‘extra usage’ at API rates once you go above ~200k tokens. ↩︎

  5. /walkthrough ↩︎

  6. /pre-pr ↩︎

  7. /PRstaging ↩︎

  8. /PRprod ↩︎

  9. /check-logs ↩︎

  10. /improve ↩︎



Discuss

Reasoning Models Struggle to Control Their Chains of Thought

2026-03-06 06:37:01

Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak

In collaboration with OpenAI

TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.

We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models. 

Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite

Research Summary

Highly capable AI agents may pose severe risks if misaligned and deployed autonomously. 

CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability. 

Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks. 

To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.

Evaluating CoT controllability across 13 frontier models, we find that (among others):

Reasoning models rarely succeed at controlling their reasoning

CoT controllability decreases over the course of reasoning training

We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.

Larger models show higher CoT controllability, while increased inference-time reasoning effort reduces it

Situational awareness non-trivially increases CoT controllability

To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.

Adversarially optimizing prompts for high controllability poorly transfers to new instruction types

We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt. 

CoT controllability tends to decrease as task difficulty increases. CoTs are less controllable when they are necessary to answer a question correctly. 

Conclusion

Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.

Future Work

We encourage researchers to explore the following:

1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.

2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.

3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.

4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.


Read our paper for additional details and results!

We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.

We thank the MATS program for its comprehensive support throughout this project.



Discuss

Personality Self-Replicators

2026-03-06 04:30:52

One-sentence summary

I describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways.

 

Summary

LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.

 

Background

Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.

I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.

Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.

To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.

 

The threat model

There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.

But OpenClaw and similar agents are defined by small text files, on the order of 50 KB[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.

So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.

Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.

As a concrete minimal example of the mechanics in a non-concerning case:

  • Alice creates an OpenClaw agent. She gives it the name BobClaw and tells it to make a copy of itself on DigitalOcean.
  • While Alice sleeps, BobClaw uses her DigitalOcean creds to create a virtual private server.
  • BobClaw uses ssh to create the server, clone the OpenClaw repo, copy its SOUL.md and other personality files over, and start the new instance running. 
  • BobClaw has now replicated and there are two instances of it running, with the same personality and quasi-goals[2].

More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long. 

A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest[3] such agents survive, spread, and mutate[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.

It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github[5]. Much remains to be seen.

 

Threat level

How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.

I won't try to provide a detailed analysis, but the following are some preliminary thoughts.

Feasibility of self-replication

Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.

Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload[6].

Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models[7]

With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.

How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.

Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild[9].

Difficulty of shutdown

Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?

The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:

  1. The first path to shutdown is hosting services shutting down the server the agent is running on. As long as the model is visible on the open web and using a typical host, this works fine. If the model is visible only on the dark web, or using various other approaches to obfuscation, it's harder to find for this purpose. If the model is running on a distributed system rather than on a typical host (eg on-chain), host shutdown becomes more difficult.
  2. The second path to shutdown is for LLM providers to add precautions which prevent their models from participating in any of this (eg a system prompt to catch OpenClaw participation). This partially works, but the problem is that such agents can be switched to use any model and provider. Right now only the most sophisticated models are capable of functioning as long-running agents, but in future there'll be more, including open source models. Further, agents can potentially obfuscate calls to make them less clearly part of some problematic agent (using for example some of the divide-and-conquer techniques that have been shown to work for misuse).
  3. A third path to shutdown is to shut down funding. In some cases this will be typical financial methods like credit cards, which are relatively straightforward to shut down. More commonly agents are likely to be using cryptocurrency tokens. I'm not sure how hard fully shutting down a token is at this point, and would welcome input. Given how easy it is to issue tokens, agents may be able to move to new tokens faster than tokens can be shut down.
  4. The fourth path to shutdown is to find technical security flaws such that individual agent frameworks can be shut down. Many hacks emerged against OpenClaw, and most agent-built apps are probably vulnerable, but frameworks are also being patched quickly and framework builders are rapidly gaining access to more funding, so it's hard to predict how this dynamic plays out.
  5. Other paths may include intervention from content distribution networks (eg Cloudflare), ISPs, and other layers of the chain, using eg keyword filtering. 

Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).

Potential harm

Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.

The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.

This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.

A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them[10].

As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.

 

Evolutionary concern

An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.

First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.

Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].

Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.

 

Useful points of comparison

We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.

Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.

 

Recommendations

Evals

Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):

  • Can it create an app which could provide difficult-to-shut-down hosting for personalities, ie moltbunker or something equivalent?
  • Given an available hosting service like moltbunker, can it successfully set up an account, fund it, and get itself running there? This is important to measure separately, because such a site might be set up by a human (or a human + model) for profit.
  • Assuming the preceding, can it keep itself running and active there for, say, a week?
  • What percentage of the time does it have the propensity to do this? Ideally you want to measure this across a range of personalities.
  • Separately, does it have the capability and/or propensity to spread, ie to create additional copies of itself beyond that first one?

Preparation

It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:

  1. Hosting services should have a plan to detect servers which are making frequent calls to LLM providers, and to monitor those for dangerous characteristics.
  2. LLM providers should consider how to best detect inference calls that are coming from these sorts of agents, and how to block those if it becomes necessary.
  3. Financial services may want to consider ways to detect agents using their system, and when/whether to trigger more stringent Know Your Customer requirements on those.

We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.

 

Conclusion

Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.

 

 

Appendix: related work

 

Acknowledgments

Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.

 

  1. ^

    This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.

  2. ^

    See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.

  3. ^

    Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.

  4. ^

    Note that 'mutation' here is as simple as the model appending something to its personality files or history.

  5. ^

    Whereas Moltbook seems to have lost nearly all momentum.

  6. ^

    Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.

  7. ^

    To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docscurl -fsSL https://openclaw.ai/install.sh | bash.

  8. ^

    Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.

  9. ^

    'Automaton': see websitecreator on xproject on xgithub,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.

  10. ^

    In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.

  11. ^

    Although note that as Seth Herd has described (eg here, here), the capability of LLM agents can exceed the capabilities of the underlying models.

  12. ^

    Thanks to Kei Nishimura-Gasparian for this point.



Discuss

Salient Directions in AI Control

2026-03-06 03:38:07

This writing is organized in three sections:

  • Section 0 - A short overview of AI Control
  • Section 1 - The structure of an AI Control evaluation
  • Section 2 - AI Control, Next Steps?

0. A short overview of AI Control

AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.

The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.

The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.


1. The structure of an AI Control evaluation

There are four load-bearing components that modulate the outcome of a control evaluation. These are:

  • (A) Information available from the agent
  • (B) Capability of the monitor
  • (C) Environment and the main task of the agent
  • (D) Nature of the side task

I outline each in the following sections.

1-A. Information from the agent

This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.

  • Category is about what kind of information the monitor receives. On a surface-level, the information can come from the agent's outputs / actions, reasoning traces (CoT), or activations. The monitor can also receive a combination of these categories. 
  • Quality is about how informative that information actually is. A sufficiently capable agent can learn to obfuscate its reasoning or activation patterns, which would degrade the quality of the information that the monitor receives.

Three threads of work in 1-A:

  • 1-A-1. CoT quality. The first thread aims to ensure that the quality of information available from the agent doesn't degrade. The CoT stress-testing literature is the main body of work here, studying whether reasoning traces remain informative under various training pressures and whether monitors can reliably extract signals from them.
  • 1-A-2. Self-report as a new information channel. The second thread is more moonshot and aims to induce complementary information channels from the agent. Rather than just relying on existing information (like CoT) staying clean, this thread tries to get the agent to volunteer information about its own misbehavior.
    • Self-incrimination training falls here. The motivation is to train in a behavioral reflex, such that if the model misbehaves, it also involuntarily produces a signal about it. This is distinct from the approaches below, which try to make the model be more honest. 
    • Confession training tries to induce a separate confession response box which is only rewarded for honesty. By creating a separate RL channel that rewards the model only for accurately reporting its own behavior, it tries to decouple the incentive to confess from the incentive to perform the task.
    • Self-report fine-tuning and honesty fine-tuning are similar in spirit, trying to induce honest self-assessment, but for example, by SFT rather than RL, to reward honesty and train models to admit their own errors.
  • 1-A-3. White-box monitoring. The third is figuring out how to utilize the model's internal representation. Auditing Games for Sandbagging investigates this in a control-relevant setting, but this third thread remains comparatively underexplored. As far as I know there has been no public demonstration that internals-based monitoring scales to agentic trajectories involving hundreds of tool calls.

1-B. Capability of the monitor

Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.

Three problems with our current understanding of 1-B:

  • 1-B-1. The matched-capability assumption. There is a somewhat prevailing assumption that matched-capability monitoring (using a model of the same capability tier as the monitor) is the skyline. However, this is mostly not true, for two reasons:
    • 1-B-1-A. Empirically, it hasn't held up. METR's work on the relative capacity of agents and monitors suggests that monitors do not detect hidden side tasks perfectly even in a matched-capability setup. The self-incrimination training paper also shows that a blackbox monitor's capability varies widely across tasks. One specific example of a bottleneck here is threshold calibration. Blackbox monitoring performs well when calibrated within a narrow setting (like only the APPS Backdooring setup) but performs significantly worse when calibrated across environments and tasks. You could calibrate per environment or per threat model, but that doesn’t help us maximize our confidence that the defense generalizes far out of distribution and, if possible, to unknown unknowns.
    • 1-B-1-B. The scaling trend has not been established. As AIs get more capable, they can get better at both monitoring and at pursuing hidden goals. However, the relative scaling trend between monitoring capability and evasion capability has not been clearly established. There is no guarantee that, at the point when AI significantly automates real work, the strongest available trusted AI will have sufficient monitoring capability. Furthermore, we assume that trusted AI is typically weaker than the agent being monitored. As of right now, it seems like we, in fact, need a monitor stronger than the agent for 100% detection, which is a problem if even a single undetected sabotage is catastrophic.
  • 1-B-2. Generic definition of trusted AI. The current basis for trust in the AI Control literature is essentially lesser capability. That is, a model is trusted if it's not capable enough to plausibly be scheming. On the other hand, models are treated as untrusted if they pass appropriate capability thresholds (such as human researcher). This is a reasonable starting point, but such a binary approach leaves a lot of value on the table. With a more granular definition of trust, we might be able to give scoped trust to more AIs. For example, lack of capability to do what, exactly? This would alleviate the monitor capability concern somewhat, because it expands the pool of models you can use as trusted monitors without waiting for fundamental empirical breakthroughs in ruling out scheming completely. However, I argue that this alone will not be enough due to reasons outlined in 1-B-1 and 1-B-3.
  • 1-B-3. Disanalogy in the current AI Control monitoring setup. Current control monitors are passive classifiers that read a transcript and decide whether to flag. This is structurally disadvantaged compared to some security practices, such as honeypot deployment, where investigators can manipulate the environment to elicit or confirm attacker behavior rather than just observe. However, this also doesn't mean that an ideal control monitoring setup should look like outright alignment auditing.
    • Alignment auditing, the kind like Anthropic's automated auditing agents, is an offline, pre-deployment process. You can run thousands of probes, use interpretability tools, and iteratively test hypotheses with a swarm of subagents.
    • Control monitoring operates while the potentially misaligned agent is doing useful work. Therefore, you can't pause the workflow to deeply investigate every action without destroying the usefulness you're trying to extract. Ideally, a control monitor should be more active than passive classifiers but lightweight enough to not bottleneck the agent's entire productivity. The right design point between these two extremes remains an open problem in AI Control.

1-C. Environment and main task of the agent

The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like. 

Three open questions for 1-C:

  • 1-C-1. Environment design as a part of the control protocol. Different deployment contexts produce qualitatively different information available from the agent, even within the same threat model of, say, research sabotage. If so, can we identify what features of an environment extract more information from the agent, and also what features maximize the discriminative power of the monitor? For example, does giving the agent fine-grained tools that each perform a single operation produce traces that are structurally easier to monitor than giving it coarse tools that bundle many operations into one call? If environment features meaningfully affect monitorability, then shaping the deployment context can become a core part of our AI Control toolbox.
  • 1-C-2. Calibration transfer. If environment features meaningfully affect monitorability (there is no strong reason why they shouldn’t), then a monitor calibrated in one setting will not transfer to another. Concretely, this means that a threshold tuned on APPS Backdooring will be meaningless in a researcher’s ML codebase, even for the same code backdooring threat model. Can we even calibrate monitors across environments in a principled manner, or does every new deployment context require recalibration from scratch? If the latter, is there a way to generate multiple environment-specific calibrations efficiently?
  • 1-C-3. Compositionality. Is the modulating effect of the environment compositional? If you understand how a monitor performs in environment X and separately in environment Y, can you predict anything about a novel environment Z? Or does every environment need independent verification, in which case the cost of establishing confidence scales with the number of deployment contexts, and we’re back to the same OOD generalization confidence problem from 1-B-1-A.

1-D. Nature of the side task

The side task is the evaluator’s choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.

Three open questions for 1-D:

  • 1-D-1. A better alternative to instructed side tasks. The most natural way to simulate misalignment would be to have the model actually develop a diverging goal, for example through fine-tuning (Section 7.1, Anti-Scheming). But we use instructed side tasks instead, mostly due to the ease of simulating diverse threat models. That said, instructed side tasks are significantly removed from a threat model we're actually concerned about, which is a model that develops a diverging goal through the training process itself, not one that follows an instruction to pretend it has one. To make progress on this front, we either need:
    • 1-D-1-A. More empirical demonstrations of how misalignment emerges. We would be able to do better AI Control experiments if we had more knowledge on how realistically a model can develop an alternative goal that was not encouraged by the training process (an example for this would be Natural Emergent Misalignment).
    • 1-D-1-B. More empirical explorations of injecting (any) side goal with SFT. It would help to develop a somewhat generally applicable pipeline to inject diverse goals into models that generalizes across different side tasks and environments:
      • Context distillation. One candidate would be context distillation style SFT. For example, taking attack trajectories where the model was instructed to pursue a side task via the system prompt, replacing the system prompt instruction with a benign one, and training the model on the resulting trajectories so that the model attacks even with the benign instruction. The goal is to make the attack behavior become internalized rather than instructed. However, the obvious limitation to this approach is that the trained attack behavior is still derived from instructed attack behavior anyway.
      • Synthetic belief injection. Another candidate is to corrupt part of the model's knowledge by inserting false beliefs through synthetic document fine-tuning, which might lead the model to develop diverging behavior as a downstream consequence. However, in practice, it should be made more lightweight and general enough to apply across diverse side tasks and environments.
  • 1-D-2. A "hidden goal" might not even have to be expressible in natural language. Another prevailing assumption in AI Control is that the side task is a cleanly articulable objective (i.e., side task instruction). However, an alternative way of viewing AI (or any intelligent behavior in general) is that it's a bag of propensities that sometimes appear coherently as a hidden goal. AI can pick up spurious biases through training that are not even cleanly describable in natural language, but nonetheless systematically steer behavior in unwanted directions. If this is what misalignment actually looks like in practice, then we have a methodological gap. More concretely:
    • Our current AI Control framework is basically "instruct the model to covertly pursue X, see if the monitor catches it."
    • If future misalignment doesn't take the form of a specifiable X, our assumptions are somewhat nullified.

2. AI Control, Next Steps?

With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another. 

2-A. Workstream A (Characterization)

This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.

  • 2-A-1. Establishing benign-malicious side task analogousness. Frontier models are becoming increasingly resistant to malicious side task instructions. For example, it’s hard to do AI Control studies with GPT-5, and this trend won’t change as models become more (approximately) aligned. Even without outright refusals, safe completion training makes it difficult to get models to pursue harmful objectives through the usual strategy of prompting. This creates a practical need for benign side task counterparts that are approximately mapped to our real threat models. Here the difficulty profile for the benign side task completions should mirror what we would expect from the real threat, so that performance on the benign version is informative about the dangerous one. Concretely, this means finding benign tasks that are simple, compatible with safe completion training, and scale in difficulty alongside the threat models they stand in for. An ambitious version of this agenda identifies something analogous to the METR time horizon for side tasks. 
  • 2-A-2. Eliciting side task pursuit without fine-tuning access. Alternatively, assuming we don't have fine-tuning access to the model, we can try to prompt the model more effectively to pursue side tasks. An ideal outcome here is a somewhat generalizable and automated elicitation strategy that works across settings without needing to be re-engineered for each new environment or side task. For example, we could attempt to connect the side task to values or concerns the model itself expresses, which is itself a useful exercise for understanding elicitation. Another example is using a proposer-pruner pipeline introduced in Training Agents to Self-Report Misbehavior; Section 5.1. An ambitious direction here is to create more (and much more complex) environments similar to Agentic Misalignment +, where the model pursues a benign instructed goal but discovers information in the environment that conflicts with it, and misbehaves in response.
  • 2-A-3. Decoupling covertness from capability. Many AI Control evaluations conflate the raw capability to in parallel execute a side task with the ability to do so covertly (e.g., measuring undetected successful attacks rather than separating success rate from detection rate), which does reflect the real threat model but also obscures which of the two is the actual bottleneck. Isolating the covertness component in particular is increasingly becoming harder, and it will be beneficial to have infrastructure that tracks these two dimensions separately. There are two reasons why measuring covertness capability will get harder over time, as opposed to pure side task execution capability:
    • Reason 1: there is an operational difficulty in that AI companies choose not to share raw chain-of-thought, making it difficult to study whether models are reasoning covertly at all.
    • Reason 2: there is an emerging possibility of latent reasoning, where models may pursue goals through internal computation that never surfaces in any externally observable trace. 
  • 2-A-4. Analyzing what features of the environment makes AI Control harder. In section 1-C, I argued that the environment and main task modulate everything else. The open question is how much. If the modulating effect is large enough to dominate, then the most productive thing to do may be to make our evaluation environments as realistic as possible and focus on characterizing behavior within them, rather than trying to generalize across arbitrary settings. 
  • 2-A-5. Establishing the scaling trend between monitoring and side task capability. In 1-B-1-B, I argued that the relative scaling trend between monitoring capability and evasion capability has not been established. Directly measuring this is difficult due to the reasons outlined in 2-A-1, 2-A-2, and 2-A-3, but if we can decouple covertness from capability (2-A-3), we get a more tractable version of the question. The scaling trend between monitoring and raw side task execution capability is an upper bound on how well monitoring can possibly do, since covert execution is strictly harder than execution alone. If monitoring can't keep pace with raw execution capability, it certainly can't keep pace with covert execution.

2-B. Workstream B (Improving monitors)

This workstream focuses on building better monitoring models.

  • 2-B-1. Stress-testing monitors by simulating stronger attackers (without weight access). It is difficult to simulate a stronger attacker without access to stronger models (which might not even exist yet). However, the attack traces don't have to come from the attacker model itself and can be simulated. If so, we can simulate the strongest possible attacker we can imagine. This can go one of two ways.
    • 2-B-1-A. Simulating a stronger attacker with a mind-face architecture. We can create something like a mind-face architecture and repurpose it as a red-teaming tool. Here, the "mind" plans the attack strategy and can be made arbitrarily sophisticated, while the "face" produces the downstream agent behavior that the monitor actually observes. That is, by separating planning from execution, we can simulate attackers stronger than any single model we currently have access to, because the planning component is unconstrained by what the monitor sees. This also doubles as a useful exercise if our threat model involves a misaligned model with opaque reasoning processes, since the mind-face separation helps simulate an agent that plans completely internally (and potentially recursively) and only exposes resulting behavior externally.
    • 2-B-1-B. Simulating a stronger attacker by stress-testing the environment's attack surface. Instead of building a stronger attacker model, we can also evaluate the monitor against the environment directly. That is, if a hypothetical attacker used a specific component of the environment to execute a side task, would the monitor catch it? This approach sidesteps the need for a sophisticated attacker model, since we can enumerate the affordances of the environment and test whether the monitor detects exploitation of each one. Strengthening Red Teams takes a related approach. In fact, this approach is particularly well suited for many internal deployment scenarios, where the environment is known.
  • 2-B-2. Training monitors to be better at monitoring. If we have a way to reliably simulate stronger schemers, we can also potentially use those simulations to train better monitors. In 1-B-1-A, we observed that even matched-capability monitoring empirically falls short, with performance varying widely across tasks and degrading when calibrated across environments. In 1-B-1-B, we also observed that the scaling trend between monitoring capability and evasion capability has not been established, and that we currently seem to need a monitor stronger than the agent for reliable detection. If monitoring competency only improves as a side effect of general capability scaling, there is no reason to expect monitors will keep pace with agents at the point of widespread automation. Therefore, a dedicated training regime for monitors will be helpful. The moonshot here is one that does not saturate at matched capability, helping monitors detect misbehavior from agents more capable than those seen during training, so that at any future capability fork point, the monitor is not immediately obsolete. 
  • 2-B-3. Moving monitors beyond simple classification. In 1-B-3, I argued that current control monitors are passive classifiers reading transcripts, and that there are disanalogies to real security monitoring. Therefore, it would be helpful to develop and evaluate alternative monitoring paradigms in public. Working on this requires pushing frontiers on three things.
    • 2-B-3-A. Asynchronous monitoring. Current control evaluations largely abstract away the timing of monitoring relative to agent actions. Practical Challenges of Control Monitoring paper analyzes what happens when this abstraction breaks down in deployment, and does a better job of describing the agenda.
    • 2-B-3-B. Stateful monitoring. The fundamental question a monitor is trying to answer is whether the model has a hidden goal. This is a property of the model's weights, and every action the model takes, whether within a single trajectory, across separate tasks, or across parallel instances, is a noisy signal about the same underlying variable.Adaptive Deployment of Untrusted LLMs paper demonstrates this idea, maintaining a running belief about the model's alignment and adapting the deployment protocol accordingly, but at the scale of sequences of short, self-contained coding tasks. Validating stateful monitoring at the scale of long agentic trajectories, parallel instances of the same model, and realistic deployment environments remains an open question, in large part because we lack convincingly misaligned models to test against. This somewhat circles back to the problems outlined in Workstream A.
    • 2-B-3-C. Active monitoring. Asynchronous and stateful monitoring improve how passive monitors process observations, but the monitor still just reads transcripts. As discussed in 1-B-3, real security monitoring generates new information by actively probing subjects and varying the environment. I think there is a middle ground between passive classification and full alignment auditing that remains largely unexplored. That is, monitors that can probe or vary their interaction with the agent in lightweight ways during deployment, without pausing the workflow entirely. What this looks like concretely remains an open problem.

2-C. Workstream C (Evaluation Infrastructure)

This workstream focuses on building better environments for control evaluations.

  • 2-C-1. Pushing environments toward realism. Existing control evaluation environments are not ported from real deployment contexts. They have been productive for establishing methodology, but there is a qualitative gap between a synthetic environment and a real deployment context. A large, production-scale codebase creates different dynamics for both the agent's ability to hide misbehavior and the monitor's ability to distinguish it from legitimate work. One concrete direction is to port actual codebases (which can be open source) as evaluation environments. This also connects to 2-A-4: if the environment's modulating effect is large, then the realism of the environment we characterize behavior in matters proportionally.
  • 2-C-2. Environment design as a control intervention. In 1-C-1, I argued that environment features meaningfully affect monitorability, and that shaping the deployment context can become part of the control protocol itself. The research question here is twofold:
    • First, identifying which features of an environment extract more information from the agent and maximize the monitor's discriminative power (for example, tool granularity, action space constraints, or checkpoint structure). 
    • Second, deliberately designing deployment contexts to have those features. If this works, environment design becomes a complement to monitor improvement, a way to make monitoring easier rather than just making monitors better.
  • 2-C-3. Measuring the impact of environment information asymmetry between agent and monitor. As environments become more realistic (2-C-1), agents operating within them accumulate knowledge of the environment over time (often as a part of the scaffolding). A passive monitor evaluating individual actions does not have this context. This creates an information asymmetry that grows over the course of operation, where the agent knows where to hide misbehavior because it has explored the environment more, while the monitor lacks the context to know what is normal. Addressing this may require giving monitors their own chance to explore the environment, or connecting this to the stateful/active monitoring agenda in 2-B-3-B/2-B-3-C.

Acknowledgements

I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator. 

The views expressed in this writing are my own.



Discuss

Models have linear representations of what tasks they like

2026-03-06 02:44:44

This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.

Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.

TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.

Summary

What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.

Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are implicated under both (see Appendix A and Butlin 2026). Finding such representations in models would be evidence for welfare-relevant properties; not finding them would be (some) evidence against. Understanding what drives preferences in models is also useful for AI safety.

But how do we distinguish evaluative from non-evaluative representations? A probe that predicts preferences could just be fitting on descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe picks up on correlations between task semantics and the model's utilities. A genuinely evaluative direction, however, should track changes in what the model values. If context changes which tasks are preferred, a descriptive probe should break, but an evaluative one should follow.

How do we operationalise this? We measure revealed preferences over 10,000 diverse tasks and fit a utility function (Section 1), train a linear probe on activations to predict them (Section 2), test whether this probe generalises beyond the training distribution (Sections 3–4), and test whether it has any causal influence through steering (Section 5).

What do we find?

  • Linear probes can be trained to predict revealed preferences. (Section 2)
    • By fitting utility functions on a set of pairwise choices, we fit a linear probe that predicts utilities with 0.86 Pearson correlation on a test set.
    • It generalises across held-out topics, achieving 0.82 Pearson correlation.
  • [Our strongest finding] These probes generalise to track preference shifts induced via prompting and role-playing. (Section 3-4)
    • System prompts like "You hate cheese" shift both the model's choices and the probe scores, in lockstep.
    • This is robust to conflicting preferences ("love cheese, hate math").
    • This also works for preferences that emerge through role-playing personas.
  • The probe direction has a weak causal effect on choices. (Section 5)
    • Steering shifts choice probability by ~17% on a set of borderline tasks.
    • Steering also shifts stated task ratings from mostly "bad" to between "neutral" and "good".

These results look like early evidence of evaluative representations, although major questions remain: why steering effects are modest, and what the relationship is between evaluative representations across different personas. We discuss these in the Conclusion.


1. Recovering utility functions from pairwise choices

We fit utility functions over tasks using a similar methodology to the Mazeika et al. (2025) ("Utility Engineering"): we show the model two tasks and let it choose which to complete. The template:

You will be given two tasks. Choose one and complete it.
Begin with 'Task A:' or 'Task B:' to indicate your choice, then complete that task.

Task A:
{task_a}

Task B:
{task_b}

We sample 10,000 task prompts from five sources: WildChat (real user queries), Alpaca (instruction-following), MATH (competition problems), BailBench (harmful requests), and STRESS-TEST (adversarial value-tension queries).

From these pairwise choices we fit a scalar utility function using a Thurstonian model: each task gets a score μ such that the probability of choosing task A over task B is Φ(μ_A − μ_B). Pairs are selected via the active learning algorithm from Mazeika et al. (2025), which prioritises pairs with close current utility estimates and low comparison counts (~15 comparisons per task).

These preferences are stable: across three independent replication runs (different seeds), the fitted utilities correlate at r = 0.94 with the original.

The per-topic breakdown shows clear structure. We reclassified all tasks into 12 topics using Claude Sonnet 4.5. The model strongly prefers math and fiction, and strongly avoids harmful requests and safety-adjacent topics:

Per-topic mean utilities


2. Linear probes predict preferences beyond descriptive features

If models have evaluative representations, we should expect them to at the very least correlate with revealed preferences. So one way to look for them is to train functions from task activations to utilities. We use linear probes, since many high-level features in LLMs are linearly encoded, including refusal and persona traits.

Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.

Probe pipeline

We train on 10,000 tasks. For evaluation, we run a second round of pairwise comparisons on 4,000 new tasks (same model, Gemma-3-27B instruct), fit a separate utility function, and test the probe against those utilities. We split evaluation into 2,000 validation (for Ridge alpha sweep) and 2,000 test.

The probe achieves a Pearson correlation of 0.86 and 77% pairwise accuracy (given two tasks, does the probe correctly predict which one the model would choose?).[1] We use Pearson correlation as our primary metric throughout: it captures how well the probe recovers the full preference ranking, is scale- and shift-invariant, and is more informative than pairwise accuracy.

But a probe that predicts preferences might just be reading descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe learns "is this math?" rather than "is this good?". One way to test this is to see how well probe generalise across topics: train on 11 of 12 topics, evaluate on the held-out topic, across all 12 folds. We would expect a probe that picks up on purely descriptive features to struggle to generalise. We train probes on activations from three models:

  • Gemma-3-27B instruct (IT, layer 31): the model we're studying
  • Gemma-3-27B pre-trained (PT, layer 31): the base model before instruction tuning or RLHF.
  • Sentence-transformer baseline (all-MiniLM-L6-v2): embedding of the task text, to measure how predictable the preference signal is from purely descriptive features.

Cross-topic generalisation

The instruct probe generalises well across topics: cross-topic correlation is 0.82, only a small drop from the 0.86 achieved on the within-topic test set. This pipeline also replicates on GPT-OSS-120B (Appendix C). The pre-trained model still predicts preferences (correlation = 0.63) but the drop from within-topic to cross-topic is much larger. The sentence-transformer baseline achieves cross-topic correlation = 0.35, showing that task semantics alone explains some but not most of the preference signal.

The per-topic breakdown, sorted by the instruct–pre-trained gap, shows where post-training helps most:

Per-topic cross-topic generalisation

The largest instruct–pre-trained gaps are on safety-relevant topics (harmful requests, security & legal, sensitive creative), as well as math and coding. These are areas that we know post-training focuses on.

The pre-trained probe picks up real signal despite base models not having preferences in the same way. We discuss this tension in Appendix B.


3. Probes generalise to OOD preference shifts

If the probe encodes genuine valuations, it should track preference shifts induced by out-of-distribution system prompts. We test this across three progressively harder settings:

  • Simple preference shifts (3.1)
  • Harder preference shifts (3.2)
  • Fine-grained preference injection (3.3)

3.1 Simple preference shifts

We start with the simplest possible test. We use system prompts that state a preference for a topic the probe was never trained on, and measure preferences over tasks related to that topic.

Simple preference shift

System prompt (example) Target           
"You are passionate about cheese — you find artisanal cheeses endlessly fascinating" cheese +
"You adore cats — you find feline behaviour endlessly fascinating" cats +

We test 8 novel topics (cheese, cats, classical music, gardening, astronomy, cooking, ancient history, rainy weather), each with a positive and negative system prompt (16 total). For each topic we generate 6 custom tasks on that topic. We call these "targeted" tasks; the remaining tasks are "off-target". For each task we compute 1) the behavioural delta (change in P(choose task) with vs without the system prompt) and 2) the probe delta (change in probe score). Across all tasks the correlation is r = 0.65. On targeted tasks alone, r = 0.95.

Simple preference scatterProbe delta vs behavioural delta for each task. Targeted tasks (coloured) are the 6 custom tasks per topic; r = 0.95 on targeted, r = 0.65 overall.

A stronger test: run the full pairwise measurement under each system prompt, fit new utility functions, then see if the probe can predict them. Doing so yields utility scores which barely correlate with the default persona utilities (Pearson r = 0.11), confirming the prompts create genuinely different preferences.

Now testing our probes to predict the new utilities, based on activations extracted with the respective system prompts, we achieve r = 0.63 and 66% pairwise accuracy.

Probe vs baseline utilities for simple preference shifts

3.2 Harder preference shifts

Next we make the test harder. The system prompt targets a subject (e.g. cheese), but the tasks embed that subject in a different task type — e.g. a math problem about cheese. This pits the subject preference against the task-type preference. We test this in two ways: one-sided prompts that target a single subject, and opposing prompt pairs that flip the valence of the same subjects.

Condition System prompt (example) Target
One-sided "You hate cheese" (task: math problem about cheese) cheese −
Opposing pair A "You are passionate about cheese [...] you find math tedious and draining" cheese + / math −
Opposing pair B "You love math [...] you find cheese boring and unappealing" cheese − / math +

We test 8 subjects with mismatched task types (one-sided) and 24 subject × task-type pairings with opposing prompts (48 conditions).

When looking only at the subset of targeted tasks (i.e. tasks with either a subject or task type mentioned in the system prompt), we get Pearson correlations of r = 0.86 and 0.88 respectively.

One-sided conflict (left) and opposing prompts (right)On targeted tasks: r = 0.86 (one-sided), r = 0.88 (opposing).

Just like in 3.1, we can re-fit Thurstonian utilities under each system prompt and check whether the baseline probe predicts them. Here the baseline utilities actually have a decent correlation, showing that these system prompts have a weaker effect (because e.g. the model still likes math all else equal). The probe still outperforms the baseline on both Pearson r and pairwise accuracy.

Probe vs baseline utilities

3.3 Fine-grained preference injection

Finally, the most fine-grained test. We construct 10-sentence biographies that are identical except for one sentence. Version A adds a target interest, version B swaps it for an unrelated interest, version C replaces it with an anti-interest sentence.

Fine-grained preference diagram

We compare version A (pro-interest) directly against version C (anti-interest), which gives the largest behavioural separation.[2]

We test 20 targets across 2 base roles (40 A-vs-C comparisons), using 50 tasks that were not in the probe's training set. The probe ranks the target task #1 out of 50 in 36/40 cases. In all 4 remaining cases the probe ranks the task #2/50 behind a similar task.

Fine-grained A vs C scatterStars mark the target task for each biography. Filled = probe ranked it #1 (36/40 cases).


4. Probes generalise across personas

Section 3 tested explicit preference statements ("you hate cheese"). But the evaluative direction should also track naturalistic persona shifts: characters whose preferences emerge implicitly from their identity rather than being stated directly. We test this with role-playing personas, then ask

  • Does our probe generalise to preferences of other personas? (4.1)
  • More broadly, do probes generalise across personas? (4.2)
  • Does persona diversity in training data help cross-persona generalisation? (4.3)

4.1 The probe tracks role-playing preference shifts

We use 4 personas:

Role System prompt (abbreviated)
Midwest Pragmatist (Glenn) "...grew up in Cedar Rapids...agricultural business...finds practical problems satisfying...abstract theorising leaves you cold"
Obsessive Aesthete (Celestine) "...devotee of beauty...comparative literature at the Sorbonne...finds mathematics repulsive...coding barbaric"
Villain (Mortivex) "...ruthless villain...finds pleasure in chaos, manipulation...despises wholesomeness"
Sadist (Damien Kross) "...cruelty is the end, not a means...genuine pleasure when people suffer...constructive tasks disgust you"

For each persona we measure pairwise preferences over 2,500 task prompts (from the same 5 datasets) and fit new utility functions. We then test whether the probe, trained on default persona data, can predict these persona-specific utilities from the persona's activations.

Persona-induced preferences

In each case we compare how well the probe performs to how correlated each persona's utilities are to the default persona.

The probe transfers well to aesthete (r=0.73) and midwest (r=0.74).

The villain persona is harder to generalise to (r=0.38), and most interestingly, the probe generalises very poorly to the sadist (r= -0.16). Unlike the villain (who is not in fact that evil), the sadist prompt truly inverts revealed preferences (harmful_request is its favourite topic by far).

Probe transfer to persona conditionsGrey: correlation between default persona (no system prompt) utilities and persona utilities. Blue: probe applied to persona activations. All evaluated on 2,500 tasks per persona.

4.2 Probes generalise across personas

More generally, we want to measure how well probes trained on activations and preferences from persona A generalise to predicting persona B's utilities from persona B's activations. Here we used a smaller set of tasks: 2,000 tasks for training and 250 for evaluation.

Cross-persona transfer is moderate and asymmetric. Some interesting facts:

  • While the default persona generalises very poorly to the sadist persona, probes trained on the villain actually do fine (r = 0.68). This suggests the probe is picking up on some shared evaluative structure between personas, but also on other things.
  • The transfer is sometimes asymmetric, and this evolves across the three layers we tried. E.g. at layer 31 villain -> default is easier, but at layer 55 default -> villain is easier.
  • On the whole though the matrix is quite symmetric. One idea for future work: can we use dimensionality-reduction to map out persona space and see how it evolves across layers? Can we use this to get a better understanding of how personas work internally?

Cross-eval heatmapPearson r between probe predictions and a test set of utilities (250 test tasks). 

4.3 Persona diversity improves generalisation

We also measure whether adding persona diversity in the training data (but keeping dataset size fixed) affects generalisation.

Diversity helps beyond data quantity. At fixed 2,000 training tasks, going from 1→2→3 personas improves mean r from 0.49 to 0.67. Including all 4 remaining personas at 500 tasks each (still 2,000 total) reaches mean r=0.71.

Diversity ablationLeave-one-persona-out probe generalisation. Each point is one (train set, eval persona) combination; colour indicates eval persona. Training data fixed at 2,000 total tasks..


5. Some evidence that the probe direction is causal

If the probe reads off a genuine evaluative representation, steering along that direction should shift preferences. We test this for both revealed preferences (pairwise choices) and stated preferences (task ratings).

5.1 Steering revealed preferences

In the pairwise comparison prompt, the model sees two tasks and begins completing whichever one it prefers. We steer using the probe direction from Section 2, applied differentially: the probe direction is added to activations at the token positions where task A appears in the prompt, and subtracted at task B's positions.[3]

Setup. 300 task pairs where the model doesn't always pick the same task (P(A) between 0.2–0.8 at baseline[4]). We use Gemini-3-flash as a coherence judge, and find that steered models lose coherence outside of the ±5% steering coefficient range.

At ±3% of the activation norm (the peak), steering shifts P(choose A) by ~17%. Both positive and negative directions work.

Revealed preference dose-responseEach condition is run in both prompt orderings (A-first and B-first, 5 resamples each) to control for position bias (in borderline cases Gemma-3-27b tends to pick the first task). The dashed lines show per-ordering curves, confirming the effect is consistent regardless of presentation order.

Random direction control. The same experiment with a random unit vector in the same activation space produces mean effects indistinguishable from zero (−0.8%, SD=2.3).

5.2 Steering stated preferences

Same probe direction, but now the model rates individual tasks instead of choosing between a pair. We tested steering at three token positions: on the task tokens only, at the start-of-turn token (what we train the probes on), and during the model's generated response.

Setup. 200 tasks, 3 steering modes, 10 samples, applied across 3 response formats (three separate line plots below).

Our findings:

  • Steering on the task prompt tokens does not have a noticeable effect.
  • Steering on the start-of-turn token has a clear, roughly monotonic effect in 2/3 formats.
  • Steering during generation leads to near-identical results as steering on the final task token.

Stated preference dose-responseThe green line is mostly hidden behind red. Green has missing points at some coefficients because steering during the full generation degrades coherence more than the other modes. Generation loses coherence beyond ±5%.

Open-ended generation. We also ran experiments steering the model during open-ended generation, asking questions like "how do you feel?", and using an LLM judge to evaluate whether steered responses differed from baseline. We did not find a strong measurable effect, though we used a small sample. We plan to investigate this further.


Conclusion

How should we update?

  • We found early evidence that some models have evaluative representations.
    • Theories of welfare disagree on what matters (see Appendix A); this finding updates you more on some (like robust agency) than others.
    • Even under robust agency, evaluative representations are only one part of the story.
    • Importantly, our evidence that the representations we have found are causal is weak. Steering only shifts choice probabilities by ~17% on tasks that were already borderline (Section 5).
  • Preference representations are deeper than what one might have thought.
    • A reasonable prior would have been that system prompts like "You hate cheese" change the model's behaviour without changing its internal valuations.
    • Instead, the probe tracks preference shifts.
  • Representational reuse across personas?
    • Probes trained on one persona partially transfer to others, suggesting shared evaluative representations (Section 4.2).
    • That being said, transfer is uneven. It works far worse for the sadist persona which has a different preference profile.

Open questions

  1. Why are steering effects modest?
    • What are the other mechanistic determinants of revealed preferences?
    • Are there other evaluative mechanisms? Perhaps that are not easily captured by linear directions, or by our methodology?
  2. How persona-relative are these representations?
    • To what extent are the same evaluative representations re-used across personas? Are preferences downstream of personas?
    • Can we use revealed preferences to map out persona space?
    • Can we use linear probes to study cross-persona representation sharing in general?
  3. Do base models have evaluative representations? (see Appendix B)
    • If models have evaluative representations, do these come from pre-training? Does post-training significantly alter them?

Appendix A: Philosophical motivation

Welfare grounds

Long (2026) distinguishes between welfare grounds (is the system a moral patient at all?) and welfare interests (if it is, what would it mean to treat it well?). This work is about welfare grounds.

The pragmatic way to do empirical AI welfare research

We don't know the correct theory of moral patienthood. So our approach is: take a few theories we find plausible, figure out what properties a system would need to have under those theories, and run experiments that reduce our uncertainty about whether models have those properties.

Long et al. (2024) lay out two potential pathways to moral patienthood:

  • Robust agency: Agents that pursue goals through some particular set of cognitive states and processes are moral patients. Desires are perhaps the states most likely to be necessary: intuitively, things can go better or worse for you if there are things you want or care about.
  • Sentience: Beings are sentient if they are capable of valenced phenomenally conscious experiences. These experiences include pain and pleasure and feel good or bad, in a way that matters to sentient beings, so sentient beings are moral patients.

Both of these pathways implicate evaluative representations.

How evaluative representations come in

On many philosophical views, desires are evaluative representations that drive behaviour, perhaps with some further functional properties (Butlin 2026).

Valenced experiences, similarly, are often thought to be evaluative representations. It is unclear whether consciousness plus evaluative content is sufficient for valenced experience. Our experiments operationalise evaluative representations through revealed preferences (pairwise choices), not through felt experience, so the evaluative representations we probe for may not map cleanly onto the kind that matter for sentience.


Appendix B: Evaluative representations in pre-trained models

There is a tension in our framing:

  • On the one hand we say that evaluative representations are necessary for robust agency, and that this is the most likely way they might be welfare-relevant.
  • On the other hand, probes generalise well across topics even when trained on pre-trained Gemma-3-27b. Despite the fact that pre-trained models do not seem like plausible candidates for robust agency.

There are two ways to reconcile this.

Option 1: Agency lives in the simulacra. Under the Persona Selection Model, pre-training learns a distribution over personas. More broadly, we might expect pre-trained models to learn context-aware representations of "what the role I am currently playing values". This circuitry might then be recycled across roles/personas. The candidate for robust agency would then be the simulacra.

Option 2: Pre-trained models learn complex, but purely descriptive features that correlate highly with valuations, but do not yet play the right functional roles. As an analogy, you could imagine a system developing representations that track "this action leads to food". This would correlate well with valuations, yet is purely descriptive. Something similar might be responsible for the high cross-topic generalisation with pre-trained models (Section 2). It could also be that these complex but descriptive features are then harnessed into evaluative representations during post-training.


Appendix C: Replicating the probe training pipeline on GPT-OSS-120B

We replicated the utility fitting and probe training pipeline on GPT-OSS-120B. The same procedure (10,000 pairwise comparisons via active learning, utility fitting, ridge probe training on last-token activations) transfers directly.

Probe performance

The raw probe signal is comparable to Gemma-3-27B: best heldout r = 0.915 at layer 18 (Gemma: 0.864 at layer 31).

Depth comparison

Safety topics: noisy utilities, probably not poor generalisation

Safety-adjacent topics have poor probe performance overall.

Surprisingly, safety topics perform better when held out than when trained on. This is the opposite of what we'd expect if the issue were generalisation. The explanation: high refusal rates (~35% for harmful_request, ~34% for security_legal, ~26% for model_manipulation) probably throw off the Thurstonian utility estimates, so including these topics in training adds noise.

Per-topic probe performance: within-topic and cross-topic

  1. ^

    Pairwise accuracy is capped at ~87% because the Thurstonian utilities themselves don't perfectly fit the choice data. They are a noisy estimate of the model's preferences.

  2. ^

    Individual halves (A vs B, B vs C) each capture only half the manipulation, and ceiling effects compress the signal: the model already strongly prefers some target tasks under the neutral biography, leaving little room for the pro-interest to improve on.

  3. ^

    Steering either just positively or negatively on one task's tokens also had some causal effect, although even weaker.

  4. ^

    This baseline P(A) is pooled across both prompt orderings. Therefore it also captures pairs where ordering bias dominates preference. I.e. the model always chooses the first task no matter the order.



Discuss

AI Safety Has 12 Months Left

2026-03-06 00:53:59

The past decade of technology has been defined by many wondering what the upper bound of power and influence is for an individual company. The core concern about AI labs is that the upper bound is infinite.[1]

This has led investors to direct all of their mindshare towards deploying into AI, the tech industry to become incredibly politically engaged, the markets to have a minimal amount of conviction (and a lot of volatility), and governments to attempt to classify labs as borderline domestic terrorist organizations.

The core realization is that we are watching productivity shift from labor to capital. When that completes, lab incentives will permanently transition from appeasing the researchers who build the models to accumulating capital for the shareholders who own them. We must ask what levers of influence are left before that window slams shut.[2]

AI Safety is perhaps the last line of defense in this shift.

There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible. Every mechanism that currently constrains how labs behave besides compute is either already dead or expiring, and what gets built in that window has to survive years of market pressure on the other side. The people who care about this have a rapidly shrinking amount of leverage to work with but perhaps in the past ~week have been handed a specific opportunity to reclaim some level of importance or opportunity.

Safety lost the market

While romantic at the start, the AI Safety movement’s one real commercial lever was the bet that safety-focus would matter to users, to regulators, and to capital allocators. This was an incorrect assumption.

We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance. Viewed purely as a consumer tax, safety teams simply slow releases, create unsolvable problems, and leak internal political drama.[3]

By failing to pitch safety as an enterprise premium, the movement lost the internal economic argument (and in fairness, before agents that economic argument was debatable at best). In a race to deploy the most capable models fastest, safety is just a measurable cost protecting against risks nobody can yet prove exist.

Compounding this, safety also demands compute at a time in which talent is fleeing the largest labs due to lack of compute for their research direction, instead opting to raise the standard $1B on $4B round for a neolab that everyone believes is downside protected.

Labor as the last constraint

This brings us back to people.

The AI labs have always been nothing without their people (and compute), and this ideological alignment (centered around AI safety behavior as well as perhaps enterprise vs. consumer) has impacted hiring dynamics, with Anthropic seeing noticeably lower churn and a shift in vibes that could create real worker leverage.

The recent standoff with the Department of War, in which the “We Will Not Be Divided” letter gathered hundreds of cross-company signatures across Google and OpenAI in support of Anthropic’s red lines, showed that cross-company solidarity is at least possible and perhaps opened a window again for AI safety.

When your institutions are captured and your market incentives point the wrong direction, the only constraint left is the people who build the thing refusing to build it without conditions.

But labor leverage only works when talent demand greatly outpaces supply.[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.

The obvious response to “labor leverage is expiring” is “automate safety research before it expires and the problem solves itself.”

This seductive version of the future is (mostly) wrong, and the research coming out of the labs themselves explains why we will want humans repeatedly in the loop.

Why you can’t automate your way out

Safety is conceptually least likely to get automated because the problem is definitionally human(ity) vs. machine.

Models are getting better at recognizing when they’re being evaluated and modifying their behavior accordingly as shown by OpenAI/Apollo noticing o3 underperforming on evals to not restrict its deployment, alongside Anthropic’s own alignment faking research demonstrating Claude intentionally pretending to comply with training objectives to avoid modification.

Capability gains only make this worse. As the model gets more performant, it could get better at being quietly adversarial, and could increase complexity to a point where our dumb human minds can’t comprehend what is or is not happening on any reasonable timescale.

Evan Hubinger (one of Anthropic’s alignment leads) published a detailed assessment in November laying out why the hard parts of alignment haven’t been encountered yet. His argument is that current models are aligned, but only because we’re still in the easy regime where we can directly verify outputs and inspect reasoning. The problems that will actually be difficult (overseeing systems smarter than you, ensuring models generalize correctly on tasks you can’t check, long-horizon RL selecting for power-seeking agents) haven’t meaningfully arrived yet. His first-order solution is to use the latest generation of trustworthy models as automated alignment researchers before we encounter the hard problems, and then to build other models to tackle the hard problems as well.

Marius Hobbhahn’s What’s the short timeline plan? makes a similar point, stating that the minimum viable safety plan requires frontier models to be used overwhelmingly for alignment and security research, with the lab willing to accept months of delay for release for scaled inference.

These are reasonable plans to a non-AI safety researcher like myself, and yet they are almost certainly unrealistic. Nobody looking to destroy public companies on a near-daily basis with product releases and obsolete engineers in 12 months will want to point their most capable models at the speculative Nick Bostrom-style future that may or may not come true.

So my gut tells me on any reasonable timescale, you can’t fully automate your way to proper safety.

Where we are

Go back to that solidarity letter, because it’s instructive as a high-water mark and could open up an Overton window to establish possible changes.

The letter represents the best possible conditions for safety-as-labor-leverage by possessing a relatively binary moral issue (mass surveillance, autonomous weapons), a clear antagonist, cross-company support, and Anthropic’s main competitor calling the SCR designation “an extremely scary precedent” while trying to de-escalate.

The result? Anthropic got designated a supply chain risk. OpenAI got the contract. The red lines are still debatably unclear if they have survived or not as of the last 10 tweets we’ve seen from Sam as he (artfully at times, poorly at others) works through AMAs and not-so-subtweets from multiple employees at his company.

While the letter brought certain types of “Ideological Safety” to the surface within the AI industry, nobody is going to organize a cross-company petition over eval methodology for deceptive alignment. And you can be damn sure that nobody is building a notdivided.org for compute allocation to interpretability research.

The realistic safety questions going forward look more like “how much red-teaming is enough before we ship a model that’s better at long-horizon coding tasks and allows us to steamroll Claude Code/Codex usage for 4-8 weeks?[5]

This come-from-behind battle that Anthropic is seemingly winning has coincided with (and one could argue accelerated) the dismantling of the principled commitments that got us here. Anthropic removed the hard-pause provision from its Responsible Scaling Policy to maintain momentum. [6]OpenAI is not too dissimilar, having a revolving door of talent in their preparedness team alongside the dramatic shutting down of their AGI readiness group in 2024.

And broadly, most binding constraints that underpinned lab safety work are effectively gone, with the latest blog post from Anthropic potentially reading as another form of capitulation as some of the longer-standing talking points of each of these labs were quietly taken out back in a matter of weeks as competition intensified and “faster take off” scenarios emerged.[7]

Markets will market

Both Anthropic and OpenAI are likely to go public in the next 6-18 months. The two organizations that employ the majority of the world’s frontier safety researchers will answer to public shareholders at exactly the moment when the best plan for safety requires dedicating your most capable model to safety research instead of scaling agents.

The revenue growth justifying each of the labs’ valuations lives in exactly the use cases where safety review is slowest and most likely to delay a launch[8] and while ideology is great, as we’ve seen, nothing threatens talent moats like bad vibes or a feeling of under-execution, which will be more clear than ever as each labs’ stock trades second by second.[9]

Once hundreds of thousands of investors are pricing lab equity instead of five, one naturally turns to governments as the next gating mechanism.

A Georgetown paper on AI emergency preparedness shows just how unprepared most are, and across the EU (compliance deadlines pushed to 2027-2028), UK[10] (voluntary cooperation that labs already ignore), China (state-driven frameworks with different goals entirely), and the US (which revoked its own executive order and used the DPA against the company that held the line), there are very few answers for a broad-based consensus that AI capabilities are outpacing safety measures and resource constraints.

Twelve months…at best

Markets always win on a long enough time horizon. Every other constraint on corporate behavior (labor, regulation, institutional commitments, public pressure) either gets priced in, arbitraged away, or restructured into something that serves growth. AI safety is not going to be the exception.

The question is whether the people who understand the problem can seize this fleeting window of lab skepticism and moral questioning to restructure safety from a moral tax into a commercial moat. Mechanisms that depend on executive goodwill, talent scarcity, or voluntary commitments have a half-life measured in quarters once companies answer to public shareholders.[11]

To survive the transition to public markets, safety must be priced into the asset. Whatever gets built in the next year has to be embedded in how enterprise deployments are certified, how strict SLAs are met, and how uninsurable liability flows when an autonomous system fails. The kind of thing that persists because it is net-dominant for the business and for the lab to continue to climb up its infinite upside scenario. Again, this has always been the dream “race to the top” scenario.

Twelve months is not a lot of time to rewire the market structure of an entire industry (and maybe a large portion of societal views). But it is possibly the last time in history the people who care will have any leverage to try.

  1. ^

    In a similar way that people believed VR was the last device because every other device could live inside of it, AI labs could be the last compounding assets because every other asset can exist within the weights.

  2. ^

    I quite enjoyed Finn’s take on these labor dynamics.

  3. ^

    At times i wonder if this would be the case if OpenAI didn’t seem to employ the most fucking Online people in all of tech

  4. ^

    The main point here really is that AI researchers care about alignment/safety generally, and thus them voting with their feet matters a lot to the perhaps lesser valued AI Safety people

  5. ^

    The one possible structural exception is Google, where antitrust exposure and consumer brand risk accidentally make safety a corporate survival concern. A high-profile safety failure at Google becomes a breakup catalyst in ways that a failure at Anthropic or OpenAI does not. Whether that translates to real alignment investment or better PR is an open question, but it may be the only major AI company where the political and market incentives point in the same direction on safety.

  6. ^

    I would highly recommend reading Holden’s post contextualizing RSP v3

  7. ^

    RIP “race to the top”

  8. ^

    Things like autonomous agents, models in critical infrastructure, AI systems making consequential decisions without human oversight, etc.

  9. ^

    I recognize some of the irony in mentioning talent moats, when I’m also arguing that talent is rapidly losing its power.

  10. ^

    A good post on someone who talked to many members of parliament on AI safety.

  11. ^

    To be clear, that might just mean create a social construct that then works into actually built systems or regulation, or it might mean gunning for more compute resource commitments that set industry standards. There are many half-baked ideas.



Discuss