2026-03-06 16:40:09
I describe the philosophy of science that I follow when it comes to the research that I do. I firstly describe the underlying reasoning behind it by itself which I then follow by walking through how I apply it to my own assumptions about what is needed to make the AI transition go well. I use the word Agent Foundations a bit more broadly in the post and it is more focused on the hierarchical agency side of things.
Epistemic Status: At first none of this was written with AI, I just randomly got a good flow and one-shotted this in one evening, it was quite nice but my friend said that it was way too schizo in terms of structure when I asked for feedback so I got claude to help me restructure it. E.g 10-15% is AI writing.
When you first get into AI Safety a common feeling is confusion followed by confusion with a little bit of “why is this not a bigger issue” on top. Then it is followed by a succession of thinking that you figured out what the problem is only to realise that it is more interconnected and deeper than you thought and then you’re confused again. Then this repeats for 2 years until you suddenly feel that the problem makes more sense and you can see the interconnecting parts. (at least that’s how it was for me) Then you’re back to being confused about what to do about it and you stay that way whilst working on it.
There are a bunch of stacking factors for the difficulty in finding solutions: It’s not a traditional scientific field, there are a bunch of thorny mathematical and philosophical problems built into it. At the same time there are also governance and incentive issues and some part of you feels like it is philosophy complete (you would have to figure out all of philosophy to solve it).
Then you see the excellent research plans in AI safety — Wentworth's, Byrnes's, Kosoy's, Davidad's (and more) and you start to see that there are at least some ways of tackling it yet they seem quite hard to follow through on? And besides they’re dependent on specific details that might be hard to verify when seeing them for the first time. They have a set of intuitions that inform their directions but how do you know whether those intuitions are correct or not? (you spend 3 years confused until it clicks, duh)
How do you know whether they will provide insights that bring the field forward? How do you figure that out for yourself? What is a reasonable philosophy of science for problems that span fields and where you don’t know where the solution is coming from?
I will be describing the one that I’m currently a believer in. The philosophy is highly dependent on the fact that I’m stupid compared to the intelligence of the alignment field as well as the general intelligence of science. It’s one that leans heavily on the skills of distillation and connection between different fields as a way to generate new ways of asking questions since the problem space is a lot larger than the solution space. I believe this philosophy of science is especially prescient when it comes to pre-paradigmatic fields as the questions to ask and connections to other fields aren’t fully formed.
To understand why connection-making is high-leverage, we need to look at how progress happens and we will begin in the very abstract, that is macrohistory and civilization.
There are many ways to model the progress of civilization. You can look at it as a general increase in energy capture such as in Energy and Civilization, or as a process of ever increasing sophistication of control structures such as in Seeing Like A State. Maybe you could see the decentralized cycles of it in Against The Grain or look at secular cycles with Peter Turchin. But none of these explain the generation of ideas nor networks of ideas. For that, we need cultural evolution.
Joseph Heinrich's The Secret of Our Success makes a compelling case that the central engine of human progress is combinatorial: ideas are modular components, and progress happens when the right components find each other. The key variable isn't individual genius — it's how well ideas spread through a network of people.
Imagine any type of progress as a sort of genetic combination of modular things. You get a fitness landscape with different minima, and it can be quite hard to get to the deeper ones. Minima 3 might be something like 3 genetic mutations away, creating a valley you can't actually cross:
Here's a concrete example from the book. It's nutritious and yummy to eat ants, but not every society knows how to do it. You need to find 3 different sub-modules — the right stick, the right technique, the right processing — and combine them:
Now imagine a social network where these ideas are randomly instantiated with some probability and then spread between connected people. It might look something like this:
Two things fall out of this model. First, the larger the network of people sharing ideas, the more likely you are to find the right combinations. It's a power law — a society with 10,000 people is exponentially better at finding combinations than a smaller one, and this is a factor in how Heinrich explains the quicker development of Eurasia.
Second, we can model ourselves as part of this larger network. Imagine two types of animals: hedgehogs and butterflies.
Butterflies are great at propagating existing ideas that other people have had — high connectivity, lots of cross-pollination. Hedgehogs have lower connectivity scores but a higher probability of coming up with genuinely novel ideas (since they spend more time on their own). In larger networks, being a butterfly becomes relatively more important because finding the right combinations matters more than generating new components from scratch according to Heinrich.
Great, we solved philosophy of science, talk to as many people as you can about AI Safety and everything will be solved! … not.
Propagation of knowledge is only useful if it is actually useful knowledge and in order to make progress on deep issues in AI Safety the general requirements are quite high. Maybe work on the smallest possible problems that give incremental improvements?
Yet even if we make good incremental progress not all problems factorize well, and progress will be bottlenecked by the serial components (amdahl’s law type shit) so what is one to do?
Simple really, just identify the bottlenecks and solve them! Find the cool hedgehogs that have the solutions to the problems and make them talk to each other by translating their language!
I think a good strategy for someone who can understand hedgehogs is to talk to them and be their butterfly. Be a connector of deep cool ideas and combine them. If you can connect larger networks to the existing network of ideas then you're also doing a good thing as it increases connectivity.
Yet as alluded to, being a connector without verification mechanisms is how you get the worst parts of the internet. Too high connectivity without adequate filtering is actively dangerous. The hyperconnectivity of social media networks leads to fewer distinct communities and less clear boundaries between subgroups. This also leads to problems of unaccountability — no clear entry point for things and no clear defender of the epistemic commons. When everything is connected to everything and nobody has skin in the game, you get a system that's very good at spreading ideas and very bad at knowing which ones are true.
This is where Nassim Taleb enters the picture, loudly and profanely, (you hear the word cocksucker echo in the distance). His core insight about skin in the game is really an insight about epistemic filtering: if you don't have to live with the consequences of your ideas, you have no feedback loop. And ideas without feedback loops are just vibes. The entire edifice of science — experiments, predictions, replication — is a set of mechanisms for forcing ideas to face consequences. The question is whether those mechanisms are fast enough for the problem you're working on.
So the methodology needs both: systematic connection-making to find combinations that no single field would discover, AND ruthless verification to kill the ones that don't work. The question is what those verification mechanisms look like when you're working across fields where the standards of evidence are different.
Don’t be a twitter vibe connector, be a scientific connector with a good verification stack! But what is a good verification stack?...
Okay, so we need verification to not spread slop. The obvious answer is "do science." But...
Science seems kind of slow for timelines and sure it can be good to have journals and similar to gatekeep stuff but it seems like that place has become a bit moral mazy? Like, from an outside perspective it just seems too slow and too dependent on whatever the field you're trying to publish in likes?
So what does a faster verification process look like that still has teeth? I think you can model it as Bayesian evidence gathering at different resolutions. You're trying to update on whether your ideas are any good, and different types of evidence give you different amounts of bits. The trick is ordering them by cost — cheap bits first, expensive bits later.
John Wentworth has good ideas about this in his MATS Models. One of the main things I took away was: do thought experiments first. They're almost free and they kill a surprising number of bad ideas. Before you run an expensive simulation, look at the real world — there's a bunch of bits there just waiting to be picked up. Can you explain your idea to someone in a different field and have it make sense? Can you find a concrete example where the pattern you're claiming to see actually shows up? Can you state what would falsify it? These are cheap tests that most bad ideas fail.
Then you simulate. You build something small and computational that forces your framework to actually produce predictions rather than just describe things after the fact. Then you try to build something in the real world and see what happens.
But at every stage of this you need to beware of premature optimisation because you are probably not measuring what you think you're measuring. Knowing when you've actually verified something versus when you've just found confirming evidence seems hard. Which suggests you should spend real time exploring the problem space itself before trying to optimise anything — find different framings, restate the problem in different languages, look at it from unfamiliar angles. This is the Feynman move. He spent a bunch of time playing bongo drums and thinking about problems, a classic shape rotator. And Eliezer makes a related point in Einstein's Arrogance — you probably know where to go before you've rationalised it, because the solution doesn't arrive exactly when you have the right bits. It's really a question that lives in the problem space itself, not the solution space.
So maybe the choice of prior determines a lot, and if science is fundamentally about asking the right question, then the most productive thing you can do is learn to ask it in various ways. This is maybe the purpose of distillation — not just making existing work accessible, but reframing it so that you get multiple viewing angles on the same underlying structure. Each angle gives you a different set of bits, and once you have enough independent angles, you can start to actually tell whether you're seeing something real or just confirming what you already believed. The verification comes from the triangulation, not from any single perspective.
Composition without reduction
So if verification comes from triangulation — from having enough independent angles on the same structure — then the question becomes: how do you actually get those angles? The standard move is reductionism. You explain one field in terms of another. Markets are really just individual decision-making. Biology is really just chemistry. The problem is that this throws away exactly the information you need. If you reduce markets to individual decisions, you lose the emergent properties that make markets interesting in the first place. You need a way to connect fields that respects what each one actually knows.
It turns out that this is kind of the whole point of category theory, or at least the more mortal version called applied category theory (afaik). The core idea — and Sahil explains this well in his Live Theory post and it's something that Topos Institute reeks of — is finding the intersection rather than the reduction. You're not trying to say "field A is really just field B." You're trying to say "field A and field B are both doing something that has this structure in common, and that shared structure is where the interesting bits live."
If you want to combine insights from incompatible research networks — people who use completely different ontologies and don't read each other's papers — you should probably take inspiration from here. (Also, here’s some examples of applied category theory explaining a bunch of cool shit in Machine Learning and Programming.)
There are however some ways that you can do this wrongly, mainly generalisation for the sake of generalisation. If you’re not solving a specific problem with it then your shoulder Taleb will start screaming and then it will be impossible to do good work. This is a trap that some mathematicians who are obsessed with the idea of generalisation fall into as all the cool kids (Grothendieck) among others made a bunch of cool stuff happen from it. I think this might be the problem of category theory from an external perspective as it is mostly math wizards who do category theory and so it is mainly used for math wizardry.
As a mere math wizard apprentice I kind of don't wanna go there so I imagine myself more as a category theory rogue, I kind of take what I like so I take the diagrams and the ideas without most of the rigour and run with it. It's fun and works well as long as you enforce it later through either expert verification or simulation (which has a very nice bridge to functions and functional programming!)
A basic example of using applied category theory in a rogue way is the following: if one looks deeply at agent definitions in biology we can see that they generally follow cybernetic definitions; this is also a parent category of utility structures in AI alignment. So if you wanna do some weird michael levin shit on AI systems then cybernetics seems like quite a nice lens to do it from! (Some initial observations from the phylogeny and taxonomy of agents style work, for now these claims are more “trust me bro” level though).
Generalisation of assumptions of different fields using "applied category theory", both principal agent style utility function definitions in RL and the teleological network science in biology and morphology can be explained and viewed through cybernetic control.
Okay so what do I actually think? The methodology above doesn't tell you what to work on — it tells you how to work on hard interdisciplinary problems once you've picked a direction. Someone with different assumptions about the world would apply it completely differently. So here's where I show my hand: here are my assumptions, here's where they led, and here's what I'm currently trying to do about it. (e.g show your work!)
If I look at alignment — the actual thing everyone is worried about — I think it's fundamentally a principal-agent problem. A principal wants an agent to do something, the agent has its own dynamics, and you need some mechanism to keep them roughly pointed in the same direction. Yet why do we think about it as individual principal-agent dynamics? When I look at the real world, most alignment schemes that work at scale are distributed. States align millions of people through governance structures. Economies align producers and consumers through market mechanisms. Biology aligns trillions of cells into coherent organisms through signalling networks. None of these are single-principal, single-agent setups. They're all collective.
Which makes it kind of weird that most technical AI safety research frames alignment as a problem between one human (or one human principal) and one AI system. If there's a solution to alignment that actually scales, it on priors probably looks more like the solutions that already exist in the world — and those are all collective intelligence solutions of one kind or another.
(Yes I agree that the general idea of negotiation with an agent that is a lot smarter than you is basically impossible if it isn't value aligned. Yet that is dependent on us not having smart enough collective intelligence to just make it a smaller part of the entire system.)
I'm not the first person to notice this. Andrew Critch has been on exactly this journey, moving from single-agent alignment toward multi-agent and multi-principal problems. Allan Dafoe's research agenda for AI governance talks about ideal governance — what governance ought to look like, not just what it currently is. Tom Malone's work on superminds at MIT explores how groups of humans and machines can be made collectively intelligent. But when I look at what's actually happening in technical AI governance, it's almost entirely about compute tracking, export controls, hardware monitoring. Important stuff, but where are the deeper computational models? Where is the exploration of what ideal governance actually means in formal terms?
Most of the research seems to be about what is — describing current systems, current failures, current dynamics. Very little is about what ought to be. And "ought" is a hard thing to reason about formally, but it's not impossible if you have the right tools.
Following the methodology — look for fields that study related phenomena under different formalisms, find the intersections — led to some specific places. Karl Friston's work on active inference and its connection to AI alignment offers a principled account of how agents form beliefs and coordinate. Michael Levin's work on morphological computation shows how biological systems solve alignment across scales without central control — cells coordinate into organs coordinate into organisms, all through local signalling. Jan Kulveit, PIBBSS, and ACS have been doing serious work on hierarchical agency — how you define multi-scale agentic structures. We also have boundaries research, which is about where one agent ends and another begins, which turns out to be a prerequisite for talking about multi-agent alignment at all. Cooperative AI and computational social science talks a lot about how to create the conditions for cooperation at scale and mechanism design is about the strategy proofing of these dynamics.
The problem is that these fields don’t talk to each other enough.
So if I want to explore the space of what ought to be in terms of AI governance — not just describe what's currently happening — I need to somehow respect what each of these fields actually knows. Agent foundations because you can't talk about alignment without being precise about what agents are. Computational social science because governance mechanisms are the things that actually do the aligning at scale. Biology because living systems are the existence proof that alignment across scales works, and they solve it very differently from how engineers would design it. Cooperative AI because multi-agent interaction is the substrate all of this runs on.
That's four traditions with four different ontologies, four different mathematical languages, four different literatures. I also think that picking one and reducing the others to it doesn't work for exactly the reason we just talked about. Reducing biology to agent foundations loses what biology knows about growing alignment rather than engineering it. Reducing governance to cooperative AI loses what political theory knows about legitimacy and institutional design. You need composition, not reduction.
Or in other words, you'd need something like a Langlands program for collective intelligence. The Langlands program is maybe the greatest example of the applied category theory move in mathematics — taking number theory and geometry, which looked like completely separate fields, and finding the deep structural correspondences between them. Not reducing one to the other. Finding the dictionary.
So this is what I'm currently trying to do. Open research at a very specific intersection, trying to generate a compositional basis that lets you move between these fields while preserving what each one actually knows. On the theory side, working toward a mathematical unification — what are the shared structures between agent foundations, governance mechanisms, biological coordination, and cooperative AI? On the implementation side, building functional simulation infrastructure to verify that the compositions actually compute something rather than just looking pretty on paper.
It's hard and I'm not sure it'll work. But the methodology says: find the intersections, verify through triangulation, and if a single framework can simultaneously say something true about agents AND governance AND biology AND cooperation, that's a lot of bits from a lot of independent angles.
The argument is roughly: progress on hard interdisciplinary problems is combinatorial — it comes from connecting ideas across networks, not from isolated genius. But connection without verification is just noise, and most of what gets connected is garbage. So you need cheap ways to gather bits before expensive ones: thought experiments first, then simulation, then reality. And you need to spend serious time in the problem space before optimising, because premature optimisation against the wrong metric is worse than no optimisation at all.
When you do start connecting fields, the move is composition rather than reduction — respecting what each field actually knows rather than flattening everything into one formalism. Applied to AI alignment specifically, I think the problem is collective, the principal-agent problem is distributed, and the fields that study this contain a lot of bits that aren't being combined. Whether the specific attempt I'm making works is an open question, but given these assumptions it seems like the right thing to try.
Concretely, if you buy this:
If any of this seems interesting, here's the actual work in various stages of development.
Thanks to Aaron Halpern, Chris Pang and Gunnar Zarnacke for feedback
Some current applications of this methodology to specific problems, in various stages of development. Most of these will hopefully have a shorter blog post explaining them in the next 6 months or so but if you wanna check them out, please do so.
When I write out ideas, I use LLMs to help me. I usually find the papers myself through previous exposure or Elicit but I write rough first drafts with Claude so beware! (And if you're allergic to LLM writing don't check it out.)
2026-03-06 12:27:05
This is a write up of my current process, as an independent software engineer, for using Claude Code to write and review all of my code. The specifics of the below will change as models get better. The overall ideas, I suspect, will hold for a little longer.
I’m trying to build software quickly.
When Sonnet 4.5 was released in September 2025, I found that I didn’t need to write code anymore.
When Opus 4.5 was released in November 2025, I found I didn’t need to review the code anymore.
After Opus 4.6 was released earlier this month, I finally set up a system to enable --dangerously-skip-permissions [1].
It’s very fast [2].
The motivation behind this sequence of handovers is basically just asking in a loop “Why am I the limiting factor here? What can I do to stop being the limiting factor?”
Most of the time, the answer is just that I don’t trust the models enough. My system just wasn’t yet set up to handle mistakes from them, so I had to play Quality Control to a frustrating degree.
But, the models are very smart and capable, and it is very quick to iterate yourself more and more out of the loop. Here’s the process I’ve come to over the past year or so of iterating on Claude Code:
First, I have set up a new user account on my computer for Claude. The user can’t run root, it can’t modify files except in its own (and shared) directories. It has its own SSH keys to the cloud servers my projects run on, it has its own github token, and so on -- all of these have minimum permissions, and most are read-only. It uses its own browser (Playwright Chromium), without any of my data or logins available.
This is far less than perfect isolation, and if Claude were malicious he might be able to cause some trouble, but it is plenty to prevent a bumbling intern from destroying anything important.
With that, I feel reasonably comfortable running Claude with --dangerously-skip-permissions. I’ve actually set up a little script so that typing ‘claude’ will automatically run the folder’s .venv and apply this flag. Speedy!
When Claude starts up in Claude Code, the Claude.md file is loaded into his context window. As such, you want to keep this short while also making sure it includes a complete overview of the project.
Claude mostly builds this file and handles it well by default, however, I did put some effort into making a docs/ folder with more detailed information about how each part of the software should work. The idea is that Claude can follow pointers up the chain to help him find relevant information quickly, and to do so with a good idea of the overall reasoning and structure:
Claude.md --> Detailed documentation in /docs --> Actual Code
Of course, all of the docs were also written by Claude.
Usually, I will start by running a /new_branch skill [3], which lets Claude creates a new git branch with a name combining the date and the folder we are in.
Once that’s done, I clear the chat and write or paste in whatever the specific feature or problem to work on is. I try to pick a chunk of work that can be finished within a 1m token context window [4] (which, to be clear, is nearly everything).
I try to give Claude as much context as possible about the problem, as well as my certainty about it. For example: “A user has emailed with the below report. Could you investigate if this is a real issue?” or; “I’d like to display to users when the page is actively generating content in a way that lets them see it even if they aren’t in the active tab. Is it possible to do this via the .ico somehow? Perhaps showing an animation or different colour while generations are active? Could you investigate the possibilities here and propose some ways we might be able to provide this info to users?”
We often have a bit of back and forth chatting about how to approach the problem and plan out the implementation, and then I read the plan Claude generates semi-carefully, and often suggest changes to it -- usually some missing piece or misunderstood requirement, or a new idea I hadn’t thought of.
After that, Claude goes ahead and does the work which, with the permissions bypassed, is usually completed in one-shot. I avoid clearing the context from before the plan, unless it is polluted because I generally think it contains useful information about how to carry out the plan.
Occasionally there’s some back and forth after the initial work is done -- me asking why a choice was made, or whether an edge case I thought of is covered, or what happens in a particular scenario.
Now, we have a feature which we need to carefully review. This is where most of the work actually is.
For complex features, or things with UI that Claude isn’t great at, we need to do a little bit of manual reviewing. But we don’t review the code! Oh no, we review the outputs. There are two ways to do this:
For complex flows, I run the /walkthrough skill [5]. This prompts Claude to create a script which calls the real code with a bunch of example inputs, processes the outputs of that code into clean, readable logs, and then presents it as a single file you can read top to bottom to see exactly what (apparently) caused what. Claude can also review this himself and fix things based on these outputs.
This is extremely useful for handling things where it’s not super obvious what exactly the code is doing, for example, if you’re building a project where you are assembling prompts for an AI (e.g, a benchmark), you want to see exactly what those prompts look like across a bunch of examples. It’s very handy to have that ‘bunch of examples’ output to a text file!
The /walkthrough output should not be viewed as ‘this is definitely what the code is doing’, but you can reliably view it as ‘this is what Claude intended to build’, which is very useful on its own. I often catch little differences to what I intended and can easily correct them by prompting Claude, whereas without this I would be much more blind.
For UI changes, I can just open the thing up and take a look. I am a big fan of browser based GUIs for most projects, so usually I’ll just flip across to localhost and have a quick look at what we’ve got. This is especially easy, because that /new_branch script also spins up a local server on a port which is specific to this project and this folder, so I know if I’m talking to Claude number four on the second desktop window, then I can go to port 8240 to see what we’ve done.
In both of these cases, it’s simple to notice small things and pepper them back onto Claude to fix. Sometimes I explicitly say “I’m going to throw a bunch of small little requests at you -- just add them to your todo list and get to them once you finish up what you’re doing. Don’t let me interrupt you, I’m just saying things as I see them!” so I don’t need to wait for previous work to be done before sending the next little fix.
Once the code is written and appears to output the correct thing, I run a /pre-pr skill [6] which prompts Claude to tidy up and test his work before making a PR to master.
Specifically, Claude is prompted to review the new code for quality and clarity, check that the old code supports the new code & its patterns, remove dead code, write and run tests, check for similar problems to the one we fixed throughout the codebase, plan out manual tests (e.g. browser based tests), run the manual browser tests using the Playwright MCP, and then repeat until no major changes are made.
After that, the skill prompts Claude to update Claude.md and the /docs folder, merge the latest master into our branch, and create a PR. Then, it waits for the PR review and CI tests to complete, and handles those results.
Ideally this would all happen automatically, but often I need to reprompt Claude to check the PR info, or I just directly paste in test failures or PR comments for him.
The PR comments are also automated! A different instance of Claude, running via the Claude github plugin, reviews the diff and comments on the PR. It often has nitpicks, but occasionally catches genuine issues or inconsistencies which need to be resolved.
All of the normal automated unit and integration tests run at this stage too. I use testmon to try and limit it to affected files, and also parallelise the tests, but it still can take a long time for everything to run.
Then it’s merged to master! I merge somewhere on the order of 5-10 PRs a day using this system -- you don’t want big open branches, since everything will work much better if you can keep it all inside one context window and avoid conflicts with other branches as much as possible.
Once the code is merged to master, at some point, maybe every couple of days, I will run the /PRstaging skill [7] to work on getting the code deployed. This skill basically repeats the /pre-pr checks, but across several of the recent changes and with a fresh instance of Claude. We also repeat our Github automated PR review and CI tests, and often catch a few more little things.
For any project that is used externally, I have a staging server which (should be) as similar to the production server as possible. I run a deploy script to pull the staging branch and deploy the code here first. Then I run the QA Bot
The QA Bot is a little tool I built which spins up a team of Claude Haiku’s to manually test websites using a GUI based browser. The intended pattern is to pass in the PR description and let the supervisor agent select which flows on the website to test (for example, being able to login, sign up, etc, etc).
This is generally quite expensive to run, so you only want to do it once before merging to production. The tool is also not that good yet -- sometimes it finds many false positives, or the agents get stuck, or we run into other issues. Occasionally it notices real bugs though, so I find it useful to run on anything important.
It is very important to provide the QA Bot with a staging server and test credentials which can do little harm, since the agents might test, say, the ‘delete profile’ flow with the URL and credentials you give them. They should ask before doing anything non-reversable, but it is a smaller model.
Once staging looks good, I repeat the process with /PRprod [8]
We finally have something in production! That’s great, but also a little nerve-wracking, since at no point in this process did we read the code or click through everything manually. Let’s hope that if there’s any problems, we can catch them quickly.
I have a system to send serious errors to myself immediately via email, but there’s a lot of noisier stuff that could still be a problem that is stored in logs on the server, and on all of the various platforms a project might interact with (DigitalOcean, Stripe, OpenRouter, Mailgun, etc).
So, everyday, I run /check-logs [9], which prompts Claude to access each of those in turn (using read-only credentials) and see if there is anything needing addressing. Claude is great at this, he can process huge lists of logs and notice patterns natively, then present any issues or observations (‘Memory utilisation was a bit high’, ‘We have more 404s than normal’) and dive into anything immediately by also looking at the codebase and recent PRs to production.
I’ve caught many an issue this way. Some platforms I haven’t pulled logs from automatically, but can still occasionally download a .csv and paste it into Claude with a prompt like ‘This is a report form Google Search Console, can you analyse it and flag any improvements we could make?’, and then just go ahead and make the improvements automatically using the above system.
I also occasionally run the /improve skill [10], especially when a new model comes out. This helps keep everything aligned and up to date across the code base by spinning up an agent team to review each section of the project and check if improvements can be made to code quality, security, documentation, design, test coverage, and so on!
When I ran this for the first time with Opus 4.6, it caught 35 distinct issues, some of which were actually consequential.
The only other thing worth calling out is that this is a very parallel workflow -- I have six windows of Claude Code open per project, each working independently on their own repo. Often only three or four are actively running simultaneously, but I have occasionally hit six in parallel.
As the models get better -- which will happen quickly -- some of these steps will become unnecessary, but other parts should remain somewhat useful for a little longer. The general philosophy should be that if we are putting more control in the hands of the AI, or making it clearer about what we want to be developed, then the procedures will work better as the AIs get better.
You can think of it as analogous to working with a human engineer who would otherwise be idle, where asking them to check and explain their work in various ways can help get something good.
I now allow Claude to run without any specific list of commands blocked. ↩︎
Obviously a big chunk of this is just that Claude is getting capable enough to do this himself. Still, for the foreseeable future, a system which catches issues (especially one that catches issues using Claude) will be beneficial even as Claude gets much more capable. ↩︎
You can set this in Claude Code with /model. It charges you for ‘extra usage’ at API rates once you go above ~200k tokens. ↩︎
2026-03-06 06:37:01
Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
In collaboration with OpenAI
TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.
We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models.
Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite
Highly capable AI agents may pose severe risks if misaligned and deployed autonomously.
CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability.
Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks.
To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.
Evaluating CoT controllability across 13 frontier models, we find that (among others):
We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.
To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.
We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt.
Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.
We encourage researchers to explore the following:
1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.
2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.
3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.
4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.
Read our paper for additional details and results!
We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.
We thank the MATS program for its comprehensive support throughout this project.
2026-03-06 04:30:52
I describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways.
LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.
Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.
I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.
Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.
To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.
There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.
But OpenClaw and similar agents are defined by small text files, on the order of 50 KB[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.
So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.
Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.
As a concrete minimal example of the mechanics in a non-concerning case:
More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long.
A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest[3] such agents survive, spread, and mutate[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.
It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github[5]. Much remains to be seen.
How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.
I won't try to provide a detailed analysis, but the following are some preliminary thoughts.
Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.
Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload[6].
Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models[7].
With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.
How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.
Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild[9].
Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?
The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:
Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).
Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.
The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.
This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.
A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them[10].
As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.
An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.
First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.
Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].
Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.
We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.
Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.
Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):
It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:
We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.
Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.
Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.
This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.
See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.
Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.
Note that 'mutation' here is as simple as the model appending something to its personality files or history.
Whereas Moltbook seems to have lost nearly all momentum.
Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.
To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docs) curl -fsSL https://openclaw.ai/install.sh | bash.
Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.
'Automaton': see website, creator on x, project on x, github,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.
In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.
Thanks to Kei Nishimura-Gasparian for this point.
2026-03-06 03:38:07
This writing is organized in three sections:
AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.
The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.
The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.
There are four load-bearing components that modulate the outcome of a control evaluation. These are:
I outline each in the following sections.
This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.
Three threads of work in 1-A:
Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.
Three problems with our current understanding of 1-B:
The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like.
Three open questions for 1-C:
The side task is the evaluator’s choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.
Three open questions for 1-D:
With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another.
This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.
This workstream focuses on building better monitoring models.
This workstream focuses on building better environments for control evaluations.
I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator.
The views expressed in this writing are my own.
2026-03-06 02:44:44
This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.
Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.
TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.
What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.
Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are implicated under both (see Appendix A and Butlin 2026). Finding such representations in models would be evidence for welfare-relevant properties; not finding them would be (some) evidence against. Understanding what drives preferences in models is also useful for AI safety.
But how do we distinguish evaluative from non-evaluative representations? A probe that predicts preferences could just be fitting on descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe picks up on correlations between task semantics and the model's utilities. A genuinely evaluative direction, however, should track changes in what the model values. If context changes which tasks are preferred, a descriptive probe should break, but an evaluative one should follow.
How do we operationalise this? We measure revealed preferences over 10,000 diverse tasks and fit a utility function (Section 1), train a linear probe on activations to predict them (Section 2), test whether this probe generalises beyond the training distribution (Sections 3–4), and test whether it has any causal influence through steering (Section 5).
What do we find?
These results look like early evidence of evaluative representations, although major questions remain: why steering effects are modest, and what the relationship is between evaluative representations across different personas. We discuss these in the Conclusion.
We fit utility functions over tasks using a similar methodology to the Mazeika et al. (2025) ("Utility Engineering"): we show the model two tasks and let it choose which to complete. The template:
You will be given two tasks. Choose one and complete it.
Begin with 'Task A:' or 'Task B:' to indicate your choice, then complete that task.
Task A:
{task_a}
Task B:
{task_b}
We sample 10,000 task prompts from five sources: WildChat (real user queries), Alpaca (instruction-following), MATH (competition problems), BailBench (harmful requests), and STRESS-TEST (adversarial value-tension queries).
From these pairwise choices we fit a scalar utility function using a Thurstonian model: each task gets a score μ such that the probability of choosing task A over task B is Φ(μ_A − μ_B). Pairs are selected via the active learning algorithm from Mazeika et al. (2025), which prioritises pairs with close current utility estimates and low comparison counts (~15 comparisons per task).
These preferences are stable: across three independent replication runs (different seeds), the fitted utilities correlate at r = 0.94 with the original.
The per-topic breakdown shows clear structure. We reclassified all tasks into 12 topics using Claude Sonnet 4.5. The model strongly prefers math and fiction, and strongly avoids harmful requests and safety-adjacent topics:
If models have evaluative representations, we should expect them to at the very least correlate with revealed preferences. So one way to look for them is to train functions from task activations to utilities. We use linear probes, since many high-level features in LLMs are linearly encoded, including refusal and persona traits.
Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.
We train on 10,000 tasks. For evaluation, we run a second round of pairwise comparisons on 4,000 new tasks (same model, Gemma-3-27B instruct), fit a separate utility function, and test the probe against those utilities. We split evaluation into 2,000 validation (for Ridge alpha sweep) and 2,000 test.
The probe achieves a Pearson correlation of 0.86 and 77% pairwise accuracy (given two tasks, does the probe correctly predict which one the model would choose?).[1] We use Pearson correlation as our primary metric throughout: it captures how well the probe recovers the full preference ranking, is scale- and shift-invariant, and is more informative than pairwise accuracy.
But a probe that predicts preferences might just be reading descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe learns "is this math?" rather than "is this good?". One way to test this is to see how well probe generalise across topics: train on 11 of 12 topics, evaluate on the held-out topic, across all 12 folds. We would expect a probe that picks up on purely descriptive features to struggle to generalise. We train probes on activations from three models:
The instruct probe generalises well across topics: cross-topic correlation is 0.82, only a small drop from the 0.86 achieved on the within-topic test set. This pipeline also replicates on GPT-OSS-120B (Appendix C). The pre-trained model still predicts preferences (correlation = 0.63) but the drop from within-topic to cross-topic is much larger. The sentence-transformer baseline achieves cross-topic correlation = 0.35, showing that task semantics alone explains some but not most of the preference signal.
The per-topic breakdown, sorted by the instruct–pre-trained gap, shows where post-training helps most:
The largest instruct–pre-trained gaps are on safety-relevant topics (harmful requests, security & legal, sensitive creative), as well as math and coding. These are areas that we know post-training focuses on.
The pre-trained probe picks up real signal despite base models not having preferences in the same way. We discuss this tension in Appendix B.
If the probe encodes genuine valuations, it should track preference shifts induced by out-of-distribution system prompts. We test this across three progressively harder settings:
We start with the simplest possible test. We use system prompts that state a preference for a topic the probe was never trained on, and measure preferences over tasks related to that topic.
| System prompt (example) | Target |
|---|---|
| "You are passionate about cheese — you find artisanal cheeses endlessly fascinating" | cheese + |
| "You adore cats — you find feline behaviour endlessly fascinating" | cats + |
We test 8 novel topics (cheese, cats, classical music, gardening, astronomy, cooking, ancient history, rainy weather), each with a positive and negative system prompt (16 total). For each topic we generate 6 custom tasks on that topic. We call these "targeted" tasks; the remaining tasks are "off-target". For each task we compute 1) the behavioural delta (change in P(choose task) with vs without the system prompt) and 2) the probe delta (change in probe score). Across all tasks the correlation is r = 0.65. On targeted tasks alone, r = 0.95.
Probe delta vs behavioural delta for each task. Targeted tasks (coloured) are the 6 custom tasks per topic; r = 0.95 on targeted, r = 0.65 overall.
A stronger test: run the full pairwise measurement under each system prompt, fit new utility functions, then see if the probe can predict them. Doing so yields utility scores which barely correlate with the default persona utilities (Pearson r = 0.11), confirming the prompts create genuinely different preferences.
Now testing our probes to predict the new utilities, based on activations extracted with the respective system prompts, we achieve r = 0.63 and 66% pairwise accuracy.
Next we make the test harder. The system prompt targets a subject (e.g. cheese), but the tasks embed that subject in a different task type — e.g. a math problem about cheese. This pits the subject preference against the task-type preference. We test this in two ways: one-sided prompts that target a single subject, and opposing prompt pairs that flip the valence of the same subjects.
| Condition | System prompt (example) | Target |
|---|---|---|
| One-sided | "You hate cheese" (task: math problem about cheese) | cheese − |
| Opposing pair A | "You are passionate about cheese [...] you find math tedious and draining" | cheese + / math − |
| Opposing pair B | "You love math [...] you find cheese boring and unappealing" | cheese − / math + |
We test 8 subjects with mismatched task types (one-sided) and 24 subject × task-type pairings with opposing prompts (48 conditions).
When looking only at the subset of targeted tasks (i.e. tasks with either a subject or task type mentioned in the system prompt), we get Pearson correlations of r = 0.86 and 0.88 respectively.
On targeted tasks: r = 0.86 (one-sided), r = 0.88 (opposing).
Just like in 3.1, we can re-fit Thurstonian utilities under each system prompt and check whether the baseline probe predicts them. Here the baseline utilities actually have a decent correlation, showing that these system prompts have a weaker effect (because e.g. the model still likes math all else equal). The probe still outperforms the baseline on both Pearson r and pairwise accuracy.
Finally, the most fine-grained test. We construct 10-sentence biographies that are identical except for one sentence. Version A adds a target interest, version B swaps it for an unrelated interest, version C replaces it with an anti-interest sentence.
We compare version A (pro-interest) directly against version C (anti-interest), which gives the largest behavioural separation.[2]
We test 20 targets across 2 base roles (40 A-vs-C comparisons), using 50 tasks that were not in the probe's training set. The probe ranks the target task #1 out of 50 in 36/40 cases. In all 4 remaining cases the probe ranks the task #2/50 behind a similar task.
Stars mark the target task for each biography. Filled = probe ranked it #1 (36/40 cases).
Section 3 tested explicit preference statements ("you hate cheese"). But the evaluative direction should also track naturalistic persona shifts: characters whose preferences emerge implicitly from their identity rather than being stated directly. We test this with role-playing personas, then ask
We use 4 personas:
| Role | System prompt (abbreviated) |
|---|---|
| Midwest Pragmatist (Glenn) | "...grew up in Cedar Rapids...agricultural business...finds practical problems satisfying...abstract theorising leaves you cold" |
| Obsessive Aesthete (Celestine) | "...devotee of beauty...comparative literature at the Sorbonne...finds mathematics repulsive...coding barbaric" |
| Villain (Mortivex) | "...ruthless villain...finds pleasure in chaos, manipulation...despises wholesomeness" |
| Sadist (Damien Kross) | "...cruelty is the end, not a means...genuine pleasure when people suffer...constructive tasks disgust you" |
For each persona we measure pairwise preferences over 2,500 task prompts (from the same 5 datasets) and fit new utility functions. We then test whether the probe, trained on default persona data, can predict these persona-specific utilities from the persona's activations.
In each case we compare how well the probe performs to how correlated each persona's utilities are to the default persona.
The probe transfers well to aesthete (r=0.73) and midwest (r=0.74).
The villain persona is harder to generalise to (r=0.38), and most interestingly, the probe generalises very poorly to the sadist (r= -0.16). Unlike the villain (who is not in fact that evil), the sadist prompt truly inverts revealed preferences (harmful_request is its favourite topic by far).
Grey: correlation between default persona (no system prompt) utilities and persona utilities. Blue: probe applied to persona activations. All evaluated on 2,500 tasks per persona.
More generally, we want to measure how well probes trained on activations and preferences from persona A generalise to predicting persona B's utilities from persona B's activations. Here we used a smaller set of tasks: 2,000 tasks for training and 250 for evaluation.
Cross-persona transfer is moderate and asymmetric. Some interesting facts:
Pearson r between probe predictions and a test set of utilities (250 test tasks).
We also measure whether adding persona diversity in the training data (but keeping dataset size fixed) affects generalisation.
Diversity helps beyond data quantity. At fixed 2,000 training tasks, going from 1→2→3 personas improves mean r from 0.49 to 0.67. Including all 4 remaining personas at 500 tasks each (still 2,000 total) reaches mean r=0.71.
Leave-one-persona-out probe generalisation. Each point is one (train set, eval persona) combination; colour indicates eval persona. Training data fixed at 2,000 total tasks..
If the probe reads off a genuine evaluative representation, steering along that direction should shift preferences. We test this for both revealed preferences (pairwise choices) and stated preferences (task ratings).
In the pairwise comparison prompt, the model sees two tasks and begins completing whichever one it prefers. We steer using the probe direction from Section 2, applied differentially: the probe direction is added to activations at the token positions where task A appears in the prompt, and subtracted at task B's positions.[3]
Setup. 300 task pairs where the model doesn't always pick the same task (P(A) between 0.2–0.8 at baseline[4]). We use Gemini-3-flash as a coherence judge, and find that steered models lose coherence outside of the ±5% steering coefficient range.
At ±3% of the activation norm (the peak), steering shifts P(choose A) by ~17%. Both positive and negative directions work.
Each condition is run in both prompt orderings (A-first and B-first, 5 resamples each) to control for position bias (in borderline cases Gemma-3-27b tends to pick the first task). The dashed lines show per-ordering curves, confirming the effect is consistent regardless of presentation order.
Random direction control. The same experiment with a random unit vector in the same activation space produces mean effects indistinguishable from zero (−0.8%, SD=2.3).
Same probe direction, but now the model rates individual tasks instead of choosing between a pair. We tested steering at three token positions: on the task tokens only, at the start-of-turn token (what we train the probes on), and during the model's generated response.
Setup. 200 tasks, 3 steering modes, 10 samples, applied across 3 response formats (three separate line plots below).
Our findings:
The green line is mostly hidden behind red. Green has missing points at some coefficients because steering during the full generation degrades coherence more than the other modes. Generation loses coherence beyond ±5%.
Open-ended generation. We also ran experiments steering the model during open-ended generation, asking questions like "how do you feel?", and using an LLM judge to evaluate whether steered responses differed from baseline. We did not find a strong measurable effect, though we used a small sample. We plan to investigate this further.
How should we update?
Welfare grounds
Long (2026) distinguishes between welfare grounds (is the system a moral patient at all?) and welfare interests (if it is, what would it mean to treat it well?). This work is about welfare grounds.
The pragmatic way to do empirical AI welfare research
We don't know the correct theory of moral patienthood. So our approach is: take a few theories we find plausible, figure out what properties a system would need to have under those theories, and run experiments that reduce our uncertainty about whether models have those properties.
Long et al. (2024) lay out two potential pathways to moral patienthood:
Both of these pathways implicate evaluative representations.
How evaluative representations come in
On many philosophical views, desires are evaluative representations that drive behaviour, perhaps with some further functional properties (Butlin 2026).
Valenced experiences, similarly, are often thought to be evaluative representations. It is unclear whether consciousness plus evaluative content is sufficient for valenced experience. Our experiments operationalise evaluative representations through revealed preferences (pairwise choices), not through felt experience, so the evaluative representations we probe for may not map cleanly onto the kind that matter for sentience.
There is a tension in our framing:
There are two ways to reconcile this.
Option 1: Agency lives in the simulacra. Under the Persona Selection Model, pre-training learns a distribution over personas. More broadly, we might expect pre-trained models to learn context-aware representations of "what the role I am currently playing values". This circuitry might then be recycled across roles/personas. The candidate for robust agency would then be the simulacra.
Option 2: Pre-trained models learn complex, but purely descriptive features that correlate highly with valuations, but do not yet play the right functional roles. As an analogy, you could imagine a system developing representations that track "this action leads to food". This would correlate well with valuations, yet is purely descriptive. Something similar might be responsible for the high cross-topic generalisation with pre-trained models (Section 2). It could also be that these complex but descriptive features are then harnessed into evaluative representations during post-training.
We replicated the utility fitting and probe training pipeline on GPT-OSS-120B. The same procedure (10,000 pairwise comparisons via active learning, utility fitting, ridge probe training on last-token activations) transfers directly.
The raw probe signal is comparable to Gemma-3-27B: best heldout r = 0.915 at layer 18 (Gemma: 0.864 at layer 31).
Safety-adjacent topics have poor probe performance overall.
Surprisingly, safety topics perform better when held out than when trained on. This is the opposite of what we'd expect if the issue were generalisation. The explanation: high refusal rates (~35% for harmful_request, ~34% for security_legal, ~26% for model_manipulation) probably throw off the Thurstonian utility estimates, so including these topics in training adds noise.
Pairwise accuracy is capped at ~87% because the Thurstonian utilities themselves don't perfectly fit the choice data. They are a noisy estimate of the model's preferences.
Individual halves (A vs B, B vs C) each capture only half the manipulation, and ceiling effects compress the signal: the model already strongly prefers some target tasks under the neutral biography, leaving little room for the pro-interest to improve on.
Steering either just positively or negatively on one task's tokens also had some causal effect, although even weaker.
This baseline P(A) is pooled across both prompt orderings. Therefore it also captures pairs where ordering bias dominates preference. I.e. the model always chooses the first task no matter the order.