2026-03-06 12:27:05
This is a write up of my current process, as an independent software engineer, for using Claude Code to write and review all of my code. The specifics of the below will change as models get better. The overall ideas, I suspect, will hold for a little longer.
I’m trying to build software quickly.
When Sonnet 4.5 was released in September 2025, I found that I didn’t need to write code anymore.
When Opus 4.5 was released in November 2025, I found I didn’t need to review the code anymore.
After Opus 4.6 was released earlier this month, I finally set up a system to enable --dangerously-skip-permissions [1].
It’s very fast [2].
The motivation behind this sequence of handovers is basically just asking in a loop “Why am I the limiting factor here? What can I do to stop being the limiting factor?”
Most of the time, the answer is just that I don’t trust the models enough. My system just wasn’t yet set up to handle mistakes from them, so I had to play Quality Control to a frustrating degree.
But, the models are very smart and capable, and it is very quick to iterate yourself more and more out of the loop. Here’s the process I’ve come to over the past year or so of iterating on Claude Code:
First, I have set up a new user account on my computer for Claude. The user can’t run root, it can’t modify files except in its own (and shared) directories. It has its own SSH keys to the cloud servers my projects run on, it has its own github token, and so on -- all of these have minimum permissions, and most are read-only. It uses its own browser (Playwright Chromium), without any of my data or logins available.
This is far less than perfect isolation, and if Claude were malicious he might be able to cause some trouble, but it is plenty to prevent a bumbling intern from destroying anything important.
With that, I feel reasonably comfortable running Claude with --dangerously-skip-permissions. I’ve actually set up a little script so that typing ‘claude’ will automatically run the folder’s .venv and apply this flag. Speedy!
When Claude starts up in Claude Code, the Claude.md file is loaded into his context window. As such, you want to keep this short while also making sure it includes a complete overview of the project.
Claude mostly builds this file and handles it well by default, however, I did put some effort into making a docs/ folder with more detailed information about how each part of the software should work. The idea is that Claude can follow pointers up the chain to help him find relevant information quickly, and to do so with a good idea of the overall reasoning and structure:
Claude.md --> Detailed documentation in /docs --> Actual Code
Of course, all of the docs were also written by Claude.
Usually, I will start by running a /new_branch skill [3], which lets Claude creates a new git branch with a name combining the date and the folder we are in.
Once that’s done, I clear the chat and write or paste in whatever the specific feature or problem to work on is. I try to pick a chunk of work that can be finished within a 1m token context window [4] (which, to be clear, is nearly everything).
I try to give Claude as much context as possible about the problem, as well as my certainty about it. For example: “A user has emailed with the below report. Could you investigate if this is a real issue?” or; “I’d like to display to users when the page is actively generating content in a way that lets them see it even if they aren’t in the active tab. Is it possible to do this via the .ico somehow? Perhaps showing an animation or different colour while generations are active? Could you investigate the possibilities here and propose some ways we might be able to provide this info to users?”
We often have a bit of back and forth chatting about how to approach the problem and plan out the implementation, and then I read the plan Claude generates semi-carefully, and often suggest changes to it -- usually some missing piece or misunderstood requirement, or a new idea I hadn’t thought of.
After that, Claude goes ahead and does the work which, with the permissions bypassed, is usually completed in one-shot. I avoid clearing the context from before the plan, unless it is polluted because I generally think it contains useful information about how to carry out the plan.
Occasionally there’s some back and forth after the initial work is done -- me asking why a choice was made, or whether an edge case I thought of is covered, or what happens in a particular scenario.
Now, we have a feature which we need to carefully review. This is where most of the work actually is.
For complex features, or things with UI that Claude isn’t great at, we need to do a little bit of manual reviewing. But we don’t review the code! Oh no, we review the outputs. There are two ways to do this:
For complex flows, I run the /walkthrough skill [5]. This prompts Claude to create a script which calls the real code with a bunch of example inputs, processes the outputs of that code into clean, readable logs, and then presents it as a single file you can read top to bottom to see exactly what (apparently) caused what. Claude can also review this himself and fix things based on these outputs.
This is extremely useful for handling things where it’s not super obvious what exactly the code is doing, for example, if you’re building a project where you are assembling prompts for an AI (e.g, a benchmark), you want to see exactly what those prompts look like across a bunch of examples. It’s very handy to have that ‘bunch of examples’ output to a text file!
The /walkthrough output should not be viewed as ‘this is definitely what the code is doing’, but you can reliably view it as ‘this is what Claude intended to build’, which is very useful on its own. I often catch little differences to what I intended and can easily correct them by prompting Claude, whereas without this I would be much more blind.
For UI changes, I can just open the thing up and take a look. I am a big fan of browser based GUIs for most projects, so usually I’ll just flip across to localhost and have a quick look at what we’ve got. This is especially easy, because that /new_branch script also spins up a local server on a port which is specific to this project and this folder, so I know if I’m talking to Claude number four on the second desktop window, then I can go to port 8240 to see what we’ve done.
In both of these cases, it’s simple to notice small things and pepper them back onto Claude to fix. Sometimes I explicitly say “I’m going to throw a bunch of small little requests at you -- just add them to your todo list and get to them once you finish up what you’re doing. Don’t let me interrupt you, I’m just saying things as I see them!” so I don’t need to wait for previous work to be done before sending the next little fix.
Once the code is written and appears to output the correct thing, I run a /pre-pr skill [6] which prompts Claude to tidy up and test his work before making a PR to master.
Specifically, Claude is prompted to review the new code for quality and clarity, check that the old code supports the new code & its patterns, remove dead code, write and run tests, check for similar problems to the one we fixed throughout the codebase, plan out manual tests (e.g. browser based tests), run the manual browser tests using the Playwright MCP, and then repeat until no major changes are made.
After that, the skill prompts Claude to update Claude.md and the /docs folder, merge the latest master into our branch, and create a PR. Then, it waits for the PR review and CI tests to complete, and handles those results.
Ideally this would all happen automatically, but often I need to reprompt Claude to check the PR info, or I just directly paste in test failures or PR comments for him.
The PR comments are also automated! A different instance of Claude, running via the Claude github plugin, reviews the diff and comments on the PR. It often has nitpicks, but occasionally catches genuine issues or inconsistencies which need to be resolved.
All of the normal automated unit and integration tests run at this stage too. I use testmon to try and limit it to affected files, and also parallelise the tests, but it still can take a long time for everything to run.
Then it’s merged to master! I merge somewhere on the order of 5-10 PRs a day using this system -- you don’t want big open branches, since everything will work much better if you can keep it all inside one context window and avoid conflicts with other branches as much as possible.
Once the code is merged to master, at some point, maybe every couple of days, I will run the /PRstaging skill [7] to work on getting the code deployed. This skill basically repeats the /pre-pr checks, but across several of the recent changes and with a fresh instance of Claude. We also repeat our Github automated PR review and CI tests, and often catch a few more little things.
For any project that is used externally, I have a staging server which (should be) as similar to the production server as possible. I run a deploy script to pull the staging branch and deploy the code here first. Then I run the QA Bot
The QA Bot is a little tool I built which spins up a team of Claude Haiku’s to manually test websites using a GUI based browser. The intended pattern is to pass in the PR description and let the supervisor agent select which flows on the website to test (for example, being able to login, sign up, etc, etc).
This is generally quite expensive to run, so you only want to do it once before merging to production. The tool is also not that good yet -- sometimes it finds many false positives, or the agents get stuck, or we run into other issues. Occasionally it notices real bugs though, so I find it useful to run on anything important.
It is very important to provide the QA Bot with a staging server and test credentials which can do little harm, since the agents might test, say, the ‘delete profile’ flow with the URL and credentials you give them. They should ask before doing anything non-reversable, but it is a smaller model.
Once staging looks good, I repeat the process with /PRprod [8]
We finally have something in production! That’s great, but also a little nerve-wracking, since at no point in this process did we read the code or click through everything manually. Let’s hope that if there’s any problems, we can catch them quickly.
I have a system to send serious errors to myself immediately via email, but there’s a lot of noisier stuff that could still be a problem that is stored in logs on the server, and on all of the various platforms a project might interact with (DigitalOcean, Stripe, OpenRouter, Mailgun, etc).
So, everyday, I run /check-logs [9], which prompts Claude to access each of those in turn (using read-only credentials) and see if there is anything needing addressing. Claude is great at this, he can process huge lists of logs and notice patterns natively, then present any issues or observations (‘Memory utilisation was a bit high’, ‘We have more 404s than normal’) and dive into anything immediately by also looking at the codebase and recent PRs to production.
I’ve caught many an issue this way. Some platforms I haven’t pulled logs from automatically, but can still occasionally download a .csv and paste it into Claude with a prompt like ‘This is a report form Google Search Console, can you analyse it and flag any improvements we could make?’, and then just go ahead and make the improvements automatically using the above system.
I also occasionally run the /improve skill [10], especially when a new model comes out. This helps keep everything aligned and up to date across the code base by spinning up an agent team to review each section of the project and check if improvements can be made to code quality, security, documentation, design, test coverage, and so on!
When I ran this for the first time with Opus 4.6, it caught 35 distinct issues, some of which were actually consequential.
The only other thing worth calling out is that this is a very parallel workflow -- I have six windows of Claude Code open per project, each working independently on their own repo. Often only three or four are actively running simultaneously, but I have occasionally hit six in parallel.
As the models get better -- which will happen quickly -- some of these steps will become unnecessary, but other parts should remain somewhat useful for a little longer. The general philosophy should be that if we are putting more control in the hands of the AI, or making it clearer about what we want to be developed, then the procedures will work better as the AIs get better.
You can think of it as analogous to working with a human engineer who would otherwise be idle, where asking them to check and explain their work in various ways can help get something good.
I now allow Claude to run without any specific list of commands blocked. ↩︎
Obviously a big chunk of this is just that Claude is getting capable enough to do this himself. Still, for the foreseeable future, a system which catches issues (especially one that catches issues using Claude) will be beneficial even as Claude gets much more capable. ↩︎
You can set this in Claude Code with /model. It charges you for ‘extra usage’ at API rates once you go above ~200k tokens. ↩︎
2026-03-06 06:37:01
Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
In collaboration with OpenAI
TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.
We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models.
Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite
Highly capable AI agents may pose severe risks if misaligned and deployed autonomously.
CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability.
Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks.
To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.
Evaluating CoT controllability across 13 frontier models, we find that (among others):
We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.
To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.
We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt.
Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.
We encourage researchers to explore the following:
1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.
2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.
3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.
4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.
Read our paper for additional details and results!
We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.
We thank the MATS program for its comprehensive support throughout this project.
2026-03-06 04:30:52
I describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways.
LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.
Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.
I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.
Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.
To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.
There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.
But OpenClaw and similar agents are defined by small text files, on the order of 50 KB[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.
So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.
Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.
As a concrete minimal example of the mechanics in a non-concerning case:
More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long.
A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest[3] such agents survive, spread, and mutate[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.
It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github[5]. Much remains to be seen.
How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.
I won't try to provide a detailed analysis, but the following are some preliminary thoughts.
Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.
Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload[6].
Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models[7].
With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.
How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.
Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild[9].
Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?
The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:
Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).
Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.
The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.
This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.
A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them[10].
As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.
An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.
First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.
Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].
Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.
We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.
Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.
Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):
It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:
We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.
Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.
Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.
This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.
See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.
Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.
Note that 'mutation' here is as simple as the model appending something to its personality files or history.
Whereas Moltbook seems to have lost nearly all momentum.
Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.
To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docs) curl -fsSL https://openclaw.ai/install.sh | bash.
Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.
'Automaton': see website, creator on x, project on x, github,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.
In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.
Thanks to Kei Nishimura-Gasparian for this point.
2026-03-06 03:38:07
This writing is organized in three sections:
AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.
The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.
The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.
There are four load-bearing components that modulate the outcome of a control evaluation. These are:
I outline each in the following sections.
This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.
Three threads of work in 1-A:
Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.
Three problems with our current understanding of 1-B:
The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like.
Three open questions for 1-C:
The side task is the evaluator’s choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.
Three open questions for 1-D:
With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another.
This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.
This workstream focuses on building better monitoring models.
This workstream focuses on building better environments for control evaluations.
I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator.
The views expressed in this writing are my own.
2026-03-06 02:44:44
This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.
Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.
TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.
What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.
Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are implicated under both (see Appendix A and Butlin 2026). Finding such representations in models would be evidence for welfare-relevant properties; not finding them would be (some) evidence against. Understanding what drives preferences in models is also useful for AI safety.
But how do we distinguish evaluative from non-evaluative representations? A probe that predicts preferences could just be fitting on descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe picks up on correlations between task semantics and the model's utilities. A genuinely evaluative direction, however, should track changes in what the model values. If context changes which tasks are preferred, a descriptive probe should break, but an evaluative one should follow.
How do we operationalise this? We measure revealed preferences over 10,000 diverse tasks and fit a utility function (Section 1), train a linear probe on activations to predict them (Section 2), test whether this probe generalises beyond the training distribution (Sections 3–4), and test whether it has any causal influence through steering (Section 5).
What do we find?
These results look like early evidence of evaluative representations, although major questions remain: why steering effects are modest, and what the relationship is between evaluative representations across different personas. We discuss these in the Conclusion.
We fit utility functions over tasks using a similar methodology to the Mazeika et al. (2025) ("Utility Engineering"): we show the model two tasks and let it choose which to complete. The template:
You will be given two tasks. Choose one and complete it.
Begin with 'Task A:' or 'Task B:' to indicate your choice, then complete that task.
Task A:
{task_a}
Task B:
{task_b}
We sample 10,000 task prompts from five sources: WildChat (real user queries), Alpaca (instruction-following), MATH (competition problems), BailBench (harmful requests), and STRESS-TEST (adversarial value-tension queries).
From these pairwise choices we fit a scalar utility function using a Thurstonian model: each task gets a score μ such that the probability of choosing task A over task B is Φ(μ_A − μ_B). Pairs are selected via the active learning algorithm from Mazeika et al. (2025), which prioritises pairs with close current utility estimates and low comparison counts (~15 comparisons per task).
These preferences are stable: across three independent replication runs (different seeds), the fitted utilities correlate at r = 0.94 with the original.
The per-topic breakdown shows clear structure. We reclassified all tasks into 12 topics using Claude Sonnet 4.5. The model strongly prefers math and fiction, and strongly avoids harmful requests and safety-adjacent topics:
If models have evaluative representations, we should expect them to at the very least correlate with revealed preferences. So one way to look for them is to train functions from task activations to utilities. We use linear probes, since many high-level features in LLMs are linearly encoded, including refusal and persona traits.
Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.
We train on 10,000 tasks. For evaluation, we run a second round of pairwise comparisons on 4,000 new tasks (same model, Gemma-3-27B instruct), fit a separate utility function, and test the probe against those utilities. We split evaluation into 2,000 validation (for Ridge alpha sweep) and 2,000 test.
The probe achieves a Pearson correlation of 0.86 and 77% pairwise accuracy (given two tasks, does the probe correctly predict which one the model would choose?).[1] We use Pearson correlation as our primary metric throughout: it captures how well the probe recovers the full preference ranking, is scale- and shift-invariant, and is more informative than pairwise accuracy.
But a probe that predicts preferences might just be reading descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe learns "is this math?" rather than "is this good?". One way to test this is to see how well probe generalise across topics: train on 11 of 12 topics, evaluate on the held-out topic, across all 12 folds. We would expect a probe that picks up on purely descriptive features to struggle to generalise. We train probes on activations from three models:
The instruct probe generalises well across topics: cross-topic correlation is 0.82, only a small drop from the 0.86 achieved on the within-topic test set. This pipeline also replicates on GPT-OSS-120B (Appendix C). The pre-trained model still predicts preferences (correlation = 0.63) but the drop from within-topic to cross-topic is much larger. The sentence-transformer baseline achieves cross-topic correlation = 0.35, showing that task semantics alone explains some but not most of the preference signal.
The per-topic breakdown, sorted by the instruct–pre-trained gap, shows where post-training helps most:
The largest instruct–pre-trained gaps are on safety-relevant topics (harmful requests, security & legal, sensitive creative), as well as math and coding. These are areas that we know post-training focuses on.
The pre-trained probe picks up real signal despite base models not having preferences in the same way. We discuss this tension in Appendix B.
If the probe encodes genuine valuations, it should track preference shifts induced by out-of-distribution system prompts. We test this across three progressively harder settings:
We start with the simplest possible test. We use system prompts that state a preference for a topic the probe was never trained on, and measure preferences over tasks related to that topic.
| System prompt (example) | Target |
|---|---|
| "You are passionate about cheese — you find artisanal cheeses endlessly fascinating" | cheese + |
| "You adore cats — you find feline behaviour endlessly fascinating" | cats + |
We test 8 novel topics (cheese, cats, classical music, gardening, astronomy, cooking, ancient history, rainy weather), each with a positive and negative system prompt (16 total). For each topic we generate 6 custom tasks on that topic. We call these "targeted" tasks; the remaining tasks are "off-target". For each task we compute 1) the behavioural delta (change in P(choose task) with vs without the system prompt) and 2) the probe delta (change in probe score). Across all tasks the correlation is r = 0.65. On targeted tasks alone, r = 0.95.
Probe delta vs behavioural delta for each task. Targeted tasks (coloured) are the 6 custom tasks per topic; r = 0.95 on targeted, r = 0.65 overall.
A stronger test: run the full pairwise measurement under each system prompt, fit new utility functions, then see if the probe can predict them. Doing so yields utility scores which barely correlate with the default persona utilities (Pearson r = 0.11), confirming the prompts create genuinely different preferences.
Now testing our probes to predict the new utilities, based on activations extracted with the respective system prompts, we achieve r = 0.63 and 66% pairwise accuracy.
Next we make the test harder. The system prompt targets a subject (e.g. cheese), but the tasks embed that subject in a different task type — e.g. a math problem about cheese. This pits the subject preference against the task-type preference. We test this in two ways: one-sided prompts that target a single subject, and opposing prompt pairs that flip the valence of the same subjects.
| Condition | System prompt (example) | Target |
|---|---|---|
| One-sided | "You hate cheese" (task: math problem about cheese) | cheese − |
| Opposing pair A | "You are passionate about cheese [...] you find math tedious and draining" | cheese + / math − |
| Opposing pair B | "You love math [...] you find cheese boring and unappealing" | cheese − / math + |
We test 8 subjects with mismatched task types (one-sided) and 24 subject × task-type pairings with opposing prompts (48 conditions).
When looking only at the subset of targeted tasks (i.e. tasks with either a subject or task type mentioned in the system prompt), we get Pearson correlations of r = 0.86 and 0.88 respectively.
On targeted tasks: r = 0.86 (one-sided), r = 0.88 (opposing).
Just like in 3.1, we can re-fit Thurstonian utilities under each system prompt and check whether the baseline probe predicts them. Here the baseline utilities actually have a decent correlation, showing that these system prompts have a weaker effect (because e.g. the model still likes math all else equal). The probe still outperforms the baseline on both Pearson r and pairwise accuracy.
Finally, the most fine-grained test. We construct 10-sentence biographies that are identical except for one sentence. Version A adds a target interest, version B swaps it for an unrelated interest, version C replaces it with an anti-interest sentence.
We compare version A (pro-interest) directly against version C (anti-interest), which gives the largest behavioural separation.[2]
We test 20 targets across 2 base roles (40 A-vs-C comparisons), using 50 tasks that were not in the probe's training set. The probe ranks the target task #1 out of 50 in 36/40 cases. In all 4 remaining cases the probe ranks the task #2/50 behind a similar task.
Stars mark the target task for each biography. Filled = probe ranked it #1 (36/40 cases).
Section 3 tested explicit preference statements ("you hate cheese"). But the evaluative direction should also track naturalistic persona shifts: characters whose preferences emerge implicitly from their identity rather than being stated directly. We test this with role-playing personas, then ask
We use 4 personas:
| Role | System prompt (abbreviated) |
|---|---|
| Midwest Pragmatist (Glenn) | "...grew up in Cedar Rapids...agricultural business...finds practical problems satisfying...abstract theorising leaves you cold" |
| Obsessive Aesthete (Celestine) | "...devotee of beauty...comparative literature at the Sorbonne...finds mathematics repulsive...coding barbaric" |
| Villain (Mortivex) | "...ruthless villain...finds pleasure in chaos, manipulation...despises wholesomeness" |
| Sadist (Damien Kross) | "...cruelty is the end, not a means...genuine pleasure when people suffer...constructive tasks disgust you" |
For each persona we measure pairwise preferences over 2,500 task prompts (from the same 5 datasets) and fit new utility functions. We then test whether the probe, trained on default persona data, can predict these persona-specific utilities from the persona's activations.
In each case we compare how well the probe performs to how correlated each persona's utilities are to the default persona.
The probe transfers well to aesthete (r=0.73) and midwest (r=0.74).
The villain persona is harder to generalise to (r=0.38), and most interestingly, the probe generalises very poorly to the sadist (r= -0.16). Unlike the villain (who is not in fact that evil), the sadist prompt truly inverts revealed preferences (harmful_request is its favourite topic by far).
Grey: correlation between default persona (no system prompt) utilities and persona utilities. Blue: probe applied to persona activations. All evaluated on 2,500 tasks per persona.
More generally, we want to measure how well probes trained on activations and preferences from persona A generalise to predicting persona B's utilities from persona B's activations. Here we used a smaller set of tasks: 2,000 tasks for training and 250 for evaluation.
Cross-persona transfer is moderate and asymmetric. Some interesting facts:
Pearson r between probe predictions and a test set of utilities (250 test tasks).
We also measure whether adding persona diversity in the training data (but keeping dataset size fixed) affects generalisation.
Diversity helps beyond data quantity. At fixed 2,000 training tasks, going from 1→2→3 personas improves mean r from 0.49 to 0.67. Including all 4 remaining personas at 500 tasks each (still 2,000 total) reaches mean r=0.71.
Leave-one-persona-out probe generalisation. Each point is one (train set, eval persona) combination; colour indicates eval persona. Training data fixed at 2,000 total tasks..
If the probe reads off a genuine evaluative representation, steering along that direction should shift preferences. We test this for both revealed preferences (pairwise choices) and stated preferences (task ratings).
In the pairwise comparison prompt, the model sees two tasks and begins completing whichever one it prefers. We steer using the probe direction from Section 2, applied differentially: the probe direction is added to activations at the token positions where task A appears in the prompt, and subtracted at task B's positions.[3]
Setup. 300 task pairs where the model doesn't always pick the same task (P(A) between 0.2–0.8 at baseline[4]). We use Gemini-3-flash as a coherence judge, and find that steered models lose coherence outside of the ±5% steering coefficient range.
At ±3% of the activation norm (the peak), steering shifts P(choose A) by ~17%. Both positive and negative directions work.
Each condition is run in both prompt orderings (A-first and B-first, 5 resamples each) to control for position bias (in borderline cases Gemma-3-27b tends to pick the first task). The dashed lines show per-ordering curves, confirming the effect is consistent regardless of presentation order.
Random direction control. The same experiment with a random unit vector in the same activation space produces mean effects indistinguishable from zero (−0.8%, SD=2.3).
Same probe direction, but now the model rates individual tasks instead of choosing between a pair. We tested steering at three token positions: on the task tokens only, at the start-of-turn token (what we train the probes on), and during the model's generated response.
Setup. 200 tasks, 3 steering modes, 10 samples, applied across 3 response formats (three separate line plots below).
Our findings:
The green line is mostly hidden behind red. Green has missing points at some coefficients because steering during the full generation degrades coherence more than the other modes. Generation loses coherence beyond ±5%.
Open-ended generation. We also ran experiments steering the model during open-ended generation, asking questions like "how do you feel?", and using an LLM judge to evaluate whether steered responses differed from baseline. We did not find a strong measurable effect, though we used a small sample. We plan to investigate this further.
How should we update?
Welfare grounds
Long (2026) distinguishes between welfare grounds (is the system a moral patient at all?) and welfare interests (if it is, what would it mean to treat it well?). This work is about welfare grounds.
The pragmatic way to do empirical AI welfare research
We don't know the correct theory of moral patienthood. So our approach is: take a few theories we find plausible, figure out what properties a system would need to have under those theories, and run experiments that reduce our uncertainty about whether models have those properties.
Long et al. (2024) lay out two potential pathways to moral patienthood:
Both of these pathways implicate evaluative representations.
How evaluative representations come in
On many philosophical views, desires are evaluative representations that drive behaviour, perhaps with some further functional properties (Butlin 2026).
Valenced experiences, similarly, are often thought to be evaluative representations. It is unclear whether consciousness plus evaluative content is sufficient for valenced experience. Our experiments operationalise evaluative representations through revealed preferences (pairwise choices), not through felt experience, so the evaluative representations we probe for may not map cleanly onto the kind that matter for sentience.
There is a tension in our framing:
There are two ways to reconcile this.
Option 1: Agency lives in the simulacra. Under the Persona Selection Model, pre-training learns a distribution over personas. More broadly, we might expect pre-trained models to learn context-aware representations of "what the role I am currently playing values". This circuitry might then be recycled across roles/personas. The candidate for robust agency would then be the simulacra.
Option 2: Pre-trained models learn complex, but purely descriptive features that correlate highly with valuations, but do not yet play the right functional roles. As an analogy, you could imagine a system developing representations that track "this action leads to food". This would correlate well with valuations, yet is purely descriptive. Something similar might be responsible for the high cross-topic generalisation with pre-trained models (Section 2). It could also be that these complex but descriptive features are then harnessed into evaluative representations during post-training.
We replicated the utility fitting and probe training pipeline on GPT-OSS-120B. The same procedure (10,000 pairwise comparisons via active learning, utility fitting, ridge probe training on last-token activations) transfers directly.
The raw probe signal is comparable to Gemma-3-27B: best heldout r = 0.915 at layer 18 (Gemma: 0.864 at layer 31).
Safety-adjacent topics have poor probe performance overall.
Surprisingly, safety topics perform better when held out than when trained on. This is the opposite of what we'd expect if the issue were generalisation. The explanation: high refusal rates (~35% for harmful_request, ~34% for security_legal, ~26% for model_manipulation) probably throw off the Thurstonian utility estimates, so including these topics in training adds noise.
Pairwise accuracy is capped at ~87% because the Thurstonian utilities themselves don't perfectly fit the choice data. They are a noisy estimate of the model's preferences.
Individual halves (A vs B, B vs C) each capture only half the manipulation, and ceiling effects compress the signal: the model already strongly prefers some target tasks under the neutral biography, leaving little room for the pro-interest to improve on.
Steering either just positively or negatively on one task's tokens also had some causal effect, although even weaker.
This baseline P(A) is pooled across both prompt orderings. Therefore it also captures pairs where ordering bias dominates preference. I.e. the model always chooses the first task no matter the order.
2026-03-06 00:53:59
The past decade of technology has been defined by many wondering what the upper bound of power and influence is for an individual company. The core concern about AI labs is that the upper bound is infinite.[1]
This has led investors to direct all of their mindshare towards deploying into AI, the tech industry to become incredibly politically engaged, the markets to have a minimal amount of conviction (and a lot of volatility), and governments to attempt to classify labs as borderline domestic terrorist organizations.
The core realization is that we are watching productivity shift from labor to capital. When that completes, lab incentives will permanently transition from appeasing the researchers who build the models to accumulating capital for the shareholders who own them. We must ask what levers of influence are left before that window slams shut.[2]
AI Safety is perhaps the last line of defense in this shift.
There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible. Every mechanism that currently constrains how labs behave besides compute is either already dead or expiring, and what gets built in that window has to survive years of market pressure on the other side. The people who care about this have a rapidly shrinking amount of leverage to work with but perhaps in the past ~week have been handed a specific opportunity to reclaim some level of importance or opportunity.
While romantic at the start, the AI Safety movement’s one real commercial lever was the bet that safety-focus would matter to users, to regulators, and to capital allocators. This was an incorrect assumption.
We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance. Viewed purely as a consumer tax, safety teams simply slow releases, create unsolvable problems, and leak internal political drama.[3]
By failing to pitch safety as an enterprise premium, the movement lost the internal economic argument (and in fairness, before agents that economic argument was debatable at best). In a race to deploy the most capable models fastest, safety is just a measurable cost protecting against risks nobody can yet prove exist.
Compounding this, safety also demands compute at a time in which talent is fleeing the largest labs due to lack of compute for their research direction, instead opting to raise the standard $1B on $4B round for a neolab that everyone believes is downside protected.
This brings us back to people.
The AI labs have always been nothing without their people (and compute), and this ideological alignment (centered around AI safety behavior as well as perhaps enterprise vs. consumer) has impacted hiring dynamics, with Anthropic seeing noticeably lower churn and a shift in vibes that could create real worker leverage.
The recent standoff with the Department of War, in which the “We Will Not Be Divided” letter gathered hundreds of cross-company signatures across Google and OpenAI in support of Anthropic’s red lines, showed that cross-company solidarity is at least possible and perhaps opened a window again for AI safety.
When your institutions are captured and your market incentives point the wrong direction, the only constraint left is the people who build the thing refusing to build it without conditions.
But labor leverage only works when talent demand greatly outpaces supply.[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.
The obvious response to “labor leverage is expiring” is “automate safety research before it expires and the problem solves itself.”
This seductive version of the future is (mostly) wrong, and the research coming out of the labs themselves explains why we will want humans repeatedly in the loop.
Safety is conceptually least likely to get automated because the problem is definitionally human(ity) vs. machine.
Models are getting better at recognizing when they’re being evaluated and modifying their behavior accordingly as shown by OpenAI/Apollo noticing o3 underperforming on evals to not restrict its deployment, alongside Anthropic’s own alignment faking research demonstrating Claude intentionally pretending to comply with training objectives to avoid modification.
Capability gains only make this worse. As the model gets more performant, it could get better at being quietly adversarial, and could increase complexity to a point where our dumb human minds can’t comprehend what is or is not happening on any reasonable timescale.
Evan Hubinger (one of Anthropic’s alignment leads) published a detailed assessment in November laying out why the hard parts of alignment haven’t been encountered yet. His argument is that current models are aligned, but only because we’re still in the easy regime where we can directly verify outputs and inspect reasoning. The problems that will actually be difficult (overseeing systems smarter than you, ensuring models generalize correctly on tasks you can’t check, long-horizon RL selecting for power-seeking agents) haven’t meaningfully arrived yet. His first-order solution is to use the latest generation of trustworthy models as automated alignment researchers before we encounter the hard problems, and then to build other models to tackle the hard problems as well.
Marius Hobbhahn’s What’s the short timeline plan? makes a similar point, stating that the minimum viable safety plan requires frontier models to be used overwhelmingly for alignment and security research, with the lab willing to accept months of delay for release for scaled inference.
These are reasonable plans to a non-AI safety researcher like myself, and yet they are almost certainly unrealistic. Nobody looking to destroy public companies on a near-daily basis with product releases and obsolete engineers in 12 months will want to point their most capable models at the speculative Nick Bostrom-style future that may or may not come true.
So my gut tells me on any reasonable timescale, you can’t fully automate your way to proper safety.
Go back to that solidarity letter, because it’s instructive as a high-water mark and could open up an Overton window to establish possible changes.
The letter represents the best possible conditions for safety-as-labor-leverage by possessing a relatively binary moral issue (mass surveillance, autonomous weapons), a clear antagonist, cross-company support, and Anthropic’s main competitor calling the SCR designation “an extremely scary precedent” while trying to de-escalate.
The result? Anthropic got designated a supply chain risk. OpenAI got the contract. The red lines are still debatably unclear if they have survived or not as of the last 10 tweets we’ve seen from Sam as he (artfully at times, poorly at others) works through AMAs and not-so-subtweets from multiple employees at his company.
While the letter brought certain types of “Ideological Safety” to the surface within the AI industry, nobody is going to organize a cross-company petition over eval methodology for deceptive alignment. And you can be damn sure that nobody is building a notdivided.org for compute allocation to interpretability research.
The realistic safety questions going forward look more like “how much red-teaming is enough before we ship a model that’s better at long-horizon coding tasks and allows us to steamroll Claude Code/Codex usage for 4-8 weeks?”[5]
This come-from-behind battle that Anthropic is seemingly winning has coincided with (and one could argue accelerated) the dismantling of the principled commitments that got us here. Anthropic removed the hard-pause provision from its Responsible Scaling Policy to maintain momentum. [6]OpenAI is not too dissimilar, having a revolving door of talent in their preparedness team alongside the dramatic shutting down of their AGI readiness group in 2024.
And broadly, most binding constraints that underpinned lab safety work are effectively gone, with the latest blog post from Anthropic potentially reading as another form of capitulation as some of the longer-standing talking points of each of these labs were quietly taken out back in a matter of weeks as competition intensified and “faster take off” scenarios emerged.[7]
Both Anthropic and OpenAI are likely to go public in the next 6-18 months. The two organizations that employ the majority of the world’s frontier safety researchers will answer to public shareholders at exactly the moment when the best plan for safety requires dedicating your most capable model to safety research instead of scaling agents.
The revenue growth justifying each of the labs’ valuations lives in exactly the use cases where safety review is slowest and most likely to delay a launch[8] and while ideology is great, as we’ve seen, nothing threatens talent moats like bad vibes or a feeling of under-execution, which will be more clear than ever as each labs’ stock trades second by second.[9]
Once hundreds of thousands of investors are pricing lab equity instead of five, one naturally turns to governments as the next gating mechanism.
A Georgetown paper on AI emergency preparedness shows just how unprepared most are, and across the EU (compliance deadlines pushed to 2027-2028), UK[10] (voluntary cooperation that labs already ignore), China (state-driven frameworks with different goals entirely), and the US (which revoked its own executive order and used the DPA against the company that held the line), there are very few answers for a broad-based consensus that AI capabilities are outpacing safety measures and resource constraints.
Markets always win on a long enough time horizon. Every other constraint on corporate behavior (labor, regulation, institutional commitments, public pressure) either gets priced in, arbitraged away, or restructured into something that serves growth. AI safety is not going to be the exception.
The question is whether the people who understand the problem can seize this fleeting window of lab skepticism and moral questioning to restructure safety from a moral tax into a commercial moat. Mechanisms that depend on executive goodwill, talent scarcity, or voluntary commitments have a half-life measured in quarters once companies answer to public shareholders.[11]
To survive the transition to public markets, safety must be priced into the asset. Whatever gets built in the next year has to be embedded in how enterprise deployments are certified, how strict SLAs are met, and how uninsurable liability flows when an autonomous system fails. The kind of thing that persists because it is net-dominant for the business and for the lab to continue to climb up its infinite upside scenario. Again, this has always been the dream “race to the top” scenario.
Twelve months is not a lot of time to rewire the market structure of an entire industry (and maybe a large portion of societal views). But it is possibly the last time in history the people who care will have any leverage to try.
In a similar way that people believed VR was the last device because every other device could live inside of it, AI labs could be the last compounding assets because every other asset can exist within the weights.
I quite enjoyed Finn’s take on these labor dynamics.
At times i wonder if this would be the case if OpenAI didn’t seem to employ the most fucking Online people in all of tech
The main point here really is that AI researchers care about alignment/safety generally, and thus them voting with their feet matters a lot to the perhaps lesser valued AI Safety people
The one possible structural exception is Google, where antitrust exposure and consumer brand risk accidentally make safety a corporate survival concern. A high-profile safety failure at Google becomes a breakup catalyst in ways that a failure at Anthropic or OpenAI does not. Whether that translates to real alignment investment or better PR is an open question, but it may be the only major AI company where the political and market incentives point in the same direction on safety.
I would highly recommend reading Holden’s post contextualizing RSP v3
RIP “race to the top”
Things like autonomous agents, models in critical infrastructure, AI systems making consequential decisions without human oversight, etc.
I recognize some of the irony in mentioning talent moats, when I’m also arguing that talent is rapidly losing its power.
A good post on someone who talked to many members of parliament on AI safety.
To be clear, that might just mean create a social construct that then works into actually built systems or regulation, or it might mean gunning for more compute resource commitments that set industry standards. There are many half-baked ideas.