MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

A Compositional Philosophy of Science for Agent Foundations

2026-03-06 16:40:09

I describe the philosophy of science that I follow when it comes to the research that I do. I firstly describe the underlying reasoning behind it by itself which I then follow by walking through how I apply it to my own assumptions about what is needed to make the AI transition go well. I use the word Agent Foundations a bit more broadly in the post and it is more focused on the hierarchical agency side of things.

Epistemic Status: At first none of this was written with AI, I just randomly got a good flow and one-shotted this in one evening, it was quite nice but my friend said that it was way too schizo in terms of structure when I asked for feedback so I got claude to help me restructure it. E.g 10-15% is AI writing.

When you first get into AI Safety a common feeling is confusion followed by confusion with a little bit of “why is this not a bigger issue” on top. Then it is followed by a succession of thinking that you figured out what the problem is only to realise that it is more interconnected and deeper than you thought and then you’re confused again. Then this repeats for 2 years until you suddenly feel that the problem makes more sense and you can see the interconnecting parts. (at least that’s how it was for me) Then you’re back to being confused about what to do about it and you stay that way whilst working on it.

There are a bunch of stacking factors for the difficulty in finding solutions: It’s not a traditional scientific field, there are a bunch of thorny mathematical and philosophical problems built into it. At the same time there are also governance and incentive issues and some part of you feels like it is philosophy complete (you would have to figure out all of philosophy to solve it).

Then you see the excellent research plans in AI safety — Wentworth's, Byrnes's, Kosoy's, Davidad's (and more) and you start to see that there are at least some ways of tackling it yet they seem quite hard to follow through on? And besides they’re dependent on specific details that might be hard to verify when seeing them for the first time. They have a set of intuitions that inform their directions but how do you know whether those intuitions are correct or not? (you spend 3 years confused until it clicks, duh)

How do you know whether they will provide insights that bring the field forward? How do you figure that out for yourself? What is a reasonable philosophy of science for problems that span fields and where you don’t know where the solution is coming from?

I will be describing the one that I’m currently a believer in. The philosophy is highly dependent on the fact that I’m stupid compared to the intelligence of the alignment field as well as the general intelligence of science. It’s one that leans heavily on the skills of distillation and connection between different fields as a way to generate new ways of asking questions since the problem space is a lot larger than the solution space. I believe this philosophy of science is especially prescient when it comes to pre-paradigmatic fields as the questions to ask and connections to other fields aren’t fully formed.

Modelling scientific progress

To understand why connection-making is high-leverage, we need to look at how progress happens and we will begin in the very abstract, that is macrohistory and civilization.

There are many ways to model the progress of civilization. You can look at it as a general increase in energy capture such as in Energy and Civilization, or as a process of ever increasing sophistication of control structures such as in Seeing Like A State. Maybe you could see the decentralized cycles of it in Against The Grain or look at secular cycles with Peter Turchin. But none of these explain the generation of ideas nor networks of ideas. For that, we need cultural evolution.

Joseph Heinrich's The Secret of Our Success makes a compelling case that the central engine of human progress is combinatorial: ideas are modular components, and progress happens when the right components find each other. The key variable isn't individual genius — it's how well ideas spread through a network of people.

Imagine any type of progress as a sort of genetic combination of modular things. You get a fitness landscape with different minima, and it can be quite hard to get to the deeper ones. Minima 3 might be something like 3 genetic mutations away, creating a valley you can't actually cross:

Here's a concrete example from the book. It's nutritious and yummy to eat ants, but not every society knows how to do it. You need to find 3 different sub-modules — the right stick, the right technique, the right processing — and combine them:

Now imagine a social network where these ideas are randomly instantiated with some probability and then spread between connected people. It might look something like this:

Two things fall out of this model. First, the larger the network of people sharing ideas, the more likely you are to find the right combinations. It's a power law — a society with 10,000 people is exponentially better at finding combinations than a smaller one, and this is a factor in how Heinrich explains the quicker development of Eurasia.

Second, we can model ourselves as part of this larger network. Imagine two types of animals: hedgehogs and butterflies.

Butterflies are great at propagating existing ideas that other people have had — high connectivity, lots of cross-pollination. Hedgehogs have lower connectivity scores but a higher probability of coming up with genuinely novel ideas (since they spend more time on their own). In larger networks, being a butterfly becomes relatively more important because finding the right combinations matters more than generating new components from scratch according to Heinrich.

Great, we solved philosophy of science, talk to as many people as you can about AI Safety and everything will be solved! … not.

Propagation of knowledge is only useful if it is actually useful knowledge and in order to make progress on deep issues in AI Safety the general requirements are quite high. Maybe work on the smallest possible problems that give incremental improvements?

Yet even if we make good incremental progress not all problems factorize well, and progress will be bottlenecked by the serial components (amdahl’s law type shit) so what is one to do?

Simple really, just identify the bottlenecks and solve them! Find the cool hedgehogs that have the solutions to the problems and make them talk to each other by translating their language!

I think a good strategy for someone who can understand hedgehogs is to talk to them and be their butterfly. Be a connector of deep cool ideas and combine them. If you can connect larger networks to the existing network of ideas then you're also doing a good thing as it increases connectivity.

Yet as alluded to, being a connector without verification mechanisms is how you get the worst parts of the internet. Too high connectivity without adequate filtering is actively dangerous. The hyperconnectivity of social media networks leads to fewer distinct communities and less clear boundaries between subgroups. This also leads to problems of unaccountability — no clear entry point for things and no clear defender of the epistemic commons. When everything is connected to everything and nobody has skin in the game, you get a system that's very good at spreading ideas and very bad at knowing which ones are true.

This is where Nassim Taleb enters the picture, loudly and profanely, (you hear the word cocksucker echo in the distance). His core insight about skin in the game is really an insight about epistemic filtering: if you don't have to live with the consequences of your ideas, you have no feedback loop. And ideas without feedback loops are just vibes. The entire edifice of science — experiments, predictions, replication — is a set of mechanisms for forcing ideas to face consequences. The question is whether those mechanisms are fast enough for the problem you're working on.

So the methodology needs both: systematic connection-making to find combinations that no single field would discover, AND ruthless verification to kill the ones that don't work. The question is what those verification mechanisms look like when you're working across fields where the standards of evidence are different.

Don’t be a twitter vibe connector, be a scientific connector with a good verification stack! But what is a good verification stack?...

The verification stack

Okay, so we need verification to not spread slop. The obvious answer is "do science." But...

Science seems kind of slow for timelines and sure it can be good to have journals and similar to gatekeep stuff but it seems like that place has become a bit moral mazy? Like, from an outside perspective it just seems too slow and too dependent on whatever the field you're trying to publish in likes?

So what does a faster verification process look like that still has teeth? I think you can model it as Bayesian evidence gathering at different resolutions. You're trying to update on whether your ideas are any good, and different types of evidence give you different amounts of bits. The trick is ordering them by cost — cheap bits first, expensive bits later.

John Wentworth has good ideas about this in his MATS Models. One of the main things I took away was: do thought experiments first. They're almost free and they kill a surprising number of bad ideas. Before you run an expensive simulation, look at the real world — there's a bunch of bits there just waiting to be picked up. Can you explain your idea to someone in a different field and have it make sense? Can you find a concrete example where the pattern you're claiming to see actually shows up? Can you state what would falsify it? These are cheap tests that most bad ideas fail.

Then you simulate. You build something small and computational that forces your framework to actually produce predictions rather than just describe things after the fact. Then you try to build something in the real world and see what happens.

But at every stage of this you need to beware of premature optimisation because you are probably not measuring what you think you're measuring. Knowing when you've actually verified something versus when you've just found confirming evidence seems hard. Which suggests you should spend real time exploring the problem space itself before trying to optimise anything — find different framings, restate the problem in different languages, look at it from unfamiliar angles. This is the Feynman move. He spent a bunch of time playing bongo drums and thinking about problems, a classic shape rotator. And Eliezer makes a related point in Einstein's Arrogance — you probably know where to go before you've rationalised it, because the solution doesn't arrive exactly when you have the right bits. It's really a question that lives in the problem space itself, not the solution space.

So maybe the choice of prior determines a lot, and if science is fundamentally about asking the right question, then the most productive thing you can do is learn to ask it in various ways. This is maybe the purpose of distillation — not just making existing work accessible, but reframing it so that you get multiple viewing angles on the same underlying structure. Each angle gives you a different set of bits, and once you have enough independent angles, you can start to actually tell whether you're seeing something real or just confirming what you already believed. The verification comes from the triangulation, not from any single perspective.

Composition without reduction

So if verification comes from triangulation — from having enough independent angles on the same structure — then the question becomes: how do you actually get those angles? The standard move is reductionism. You explain one field in terms of another. Markets are really just individual decision-making. Biology is really just chemistry. The problem is that this throws away exactly the information you need. If you reduce markets to individual decisions, you lose the emergent properties that make markets interesting in the first place. You need a way to connect fields that respects what each one actually knows.

It turns out that this is kind of the whole point of category theory, or at least the more mortal version called applied category theory (afaik). The core idea — and Sahil explains this well in his Live Theory post and it's something that Topos Institute reeks of — is finding the intersection rather than the reduction. You're not trying to say "field A is really just field B." You're trying to say "field A and field B are both doing something that has this structure in common, and that shared structure is where the interesting bits live."

If you want to combine insights from incompatible research networks — people who use completely different ontologies and don't read each other's papers — you should probably take inspiration from here. (Also, here’s some examples of applied category theory explaining a bunch of cool shit in Machine Learning and Programming.)

There are however some ways that you can do this wrongly, mainly generalisation for the sake of generalisation. If you’re not solving a specific problem with it then your shoulder Taleb will start screaming and then it will be impossible to do good work. This is a trap that some mathematicians who are obsessed with the idea of generalisation fall into as all the cool kids (Grothendieck) among others made a bunch of cool stuff happen from it. I think this might be the problem of category theory from an external perspective as it is mostly math wizards who do category theory and so it is mainly used for math wizardry.

As a mere math wizard apprentice I kind of don't wanna go there so I imagine myself more as a category theory rogue, I kind of take what I like so I take the diagrams and the ideas without most of the rigour and run with it. It's fun and works well as long as you enforce it later through either expert verification or simulation (which has a very nice bridge to functions and functional programming!)

A basic example of using applied category theory in a rogue way is the following: if one looks deeply at agent definitions in biology we can see that they generally follow cybernetic definitions; this is also a parent category of utility structures in AI alignment. So if you wanna do some weird michael levin shit on AI systems then cybernetics seems like quite a nice lens to do it from! (Some initial observations from the phylogeny and taxonomy of agents style work, for now these claims are more “trust me bro” level though).

Generalisation of assumptions of different fields using "applied category theory", both principal agent style utility function definitions in RL and the teleological network science in biology and morphology can be explained and viewed through cybernetic control.

Applying it

Okay so what do I actually think? The methodology above doesn't tell you what to work on — it tells you how to work on hard interdisciplinary problems once you've picked a direction. Someone with different assumptions about the world would apply it completely differently. So here's where I show my hand: here are my assumptions, here's where they led, and here's what I'm currently trying to do about it. (e.g show your work!)

If I look at alignment — the actual thing everyone is worried about — I think it's fundamentally a principal-agent problem. A principal wants an agent to do something, the agent has its own dynamics, and you need some mechanism to keep them roughly pointed in the same direction. Yet why do we think about it as individual principal-agent dynamics? When I look at the real world, most alignment schemes that work at scale are distributed. States align millions of people through governance structures. Economies align producers and consumers through market mechanisms. Biology aligns trillions of cells into coherent organisms through signalling networks. None of these are single-principal, single-agent setups. They're all collective.

Which makes it kind of weird that most technical AI safety research frames alignment as a problem between one human (or one human principal) and one AI system. If there's a solution to alignment that actually scales, it on priors probably looks more like the solutions that already exist in the world — and those are all collective intelligence solutions of one kind or another.

(Yes I agree that the general idea of negotiation with an agent that is a lot smarter than you is basically impossible if it isn't value aligned. Yet that is dependent on us not having smart enough collective intelligence to just make it a smaller part of the entire system.)

I'm not the first person to notice this. Andrew Critch has been on exactly this journey, moving from single-agent alignment toward multi-agent and multi-principal problems. Allan Dafoe's research agenda for AI governance talks about ideal governance — what governance ought to look like, not just what it currently is. Tom Malone's work on superminds at MIT explores how groups of humans and machines can be made collectively intelligent. But when I look at what's actually happening in technical AI governance, it's almost entirely about compute tracking, export controls, hardware monitoring. Important stuff, but where are the deeper computational models? Where is the exploration of what ideal governance actually means in formal terms?

Most of the research seems to be about what is — describing current systems, current failures, current dynamics. Very little is about what ought to be. And "ought" is a hard thing to reason about formally, but it's not impossible if you have the right tools.

Following the methodology — look for fields that study related phenomena under different formalisms, find the intersections — led to some specific places. Karl Friston's work on active inference and its connection to AI alignment offers a principled account of how agents form beliefs and coordinate. Michael Levin's work on morphological computation shows how biological systems solve alignment across scales without central control — cells coordinate into organs coordinate into organisms, all through local signalling. Jan Kulveit, PIBBSS, and ACS have been doing serious work on hierarchical agency — how you define multi-scale agentic structures. We also have boundaries research, which is about where one agent ends and another begins, which turns out to be a prerequisite for talking about multi-agent alignment at all. Cooperative AI and computational social science talks a lot about how to create the conditions for cooperation at scale and mechanism design is about the strategy proofing of these dynamics.

The problem is that these fields don’t talk to each other enough.

So if I want to explore the space of what ought to be in terms of AI governance — not just describe what's currently happening — I need to somehow respect what each of these fields actually knows. Agent foundations because you can't talk about alignment without being precise about what agents are. Computational social science because governance mechanisms are the things that actually do the aligning at scale. Biology because living systems are the existence proof that alignment across scales works, and they solve it very differently from how engineers would design it. Cooperative AI because multi-agent interaction is the substrate all of this runs on.

That's four traditions with four different ontologies, four different mathematical languages, four different literatures. I also think that picking one and reducing the others to it doesn't work for exactly the reason we just talked about. Reducing biology to agent foundations loses what biology knows about growing alignment rather than engineering it. Reducing governance to cooperative AI loses what political theory knows about legitimacy and institutional design. You need composition, not reduction.

Or in other words, you'd need something like a Langlands program for collective intelligence. The Langlands program is maybe the greatest example of the applied category theory move in mathematics — taking number theory and geometry, which looked like completely separate fields, and finding the deep structural correspondences between them. Not reducing one to the other. Finding the dictionary.

So this is what I'm currently trying to do. Open research at a very specific intersection, trying to generate a compositional basis that lets you move between these fields while preserving what each one actually knows. On the theory side, working toward a mathematical unification — what are the shared structures between agent foundations, governance mechanisms, biological coordination, and cooperative AI? On the implementation side, building functional simulation infrastructure to verify that the compositions actually compute something rather than just looking pretty on paper.

It's hard and I'm not sure it'll work. But the methodology says: find the intersections, verify through triangulation, and if a single framework can simultaneously say something true about agents AND governance AND biology AND cooperation, that's a lot of bits from a lot of independent angles.

Conclusion

The argument is roughly: progress on hard interdisciplinary problems is combinatorial — it comes from connecting ideas across networks, not from isolated genius. But connection without verification is just noise, and most of what gets connected is garbage. So you need cheap ways to gather bits before expensive ones: thought experiments first, then simulation, then reality. And you need to spend serious time in the problem space before optimising, because premature optimisation against the wrong metric is worse than no optimisation at all.

When you do start connecting fields, the move is composition rather than reduction — respecting what each field actually knows rather than flattening everything into one formalism. Applied to AI alignment specifically, I think the problem is collective, the principal-agent problem is distributed, and the fields that study this contain a lot of bits that aren't being combined. Whether the specific attempt I'm making works is an open question, but given these assumptions it seems like the right thing to try.

Concretely, if you buy this:

  • Before optimizing, spend real time restating your problem in at least 2-3 different field languages. If you can't, you probably don't understand it in a way that can enable you to be a field connector.
  • Order your verification by cost. Thought experiments and "does this match any real-world system" before simulation, simulation before building.
  • When you find a connection between fields, ask whether you're reducing or composing. If you can't say what each field knows that the other doesn't, you're reducing.
  • Find hedgehogs and learn their language well enough to translate, but have a falsification criterion before you spread the connection. Your value-add is the combination, not the components.

If any of this seems interesting, here's the actual work in various stages of development.

Thanks to Aaron Halpern, Chris Pang and Gunnar Zarnacke for feedback

Appendix: Current applications of this methodology

Some current applications of this methodology to specific problems, in various stages of development. Most of these will hopefully have a shorter blog post explaining them in the next 6 months or so but if you wanna check them out, please do so.

Public posts:

Overleaf PDFs:

When I write out ideas, I use LLMs to help me. I usually find the papers myself through previous exposure or Elicit but I write rough first drafts with Claude so beware! (And if you're allergic to LLM writing don't check it out.)



Discuss

How I Handle Automated Programming

2026-03-06 12:27:05

This is a write up of my current process, as an independent software engineer, for using Claude Code to write and review all of my code. The specifics of the below will change as models get better. The overall ideas, I suspect, will hold for a little longer.


I’m trying to build software quickly.

When Sonnet 4.5 was released in September 2025, I found that I didn’t need to write code anymore.

When Opus 4.5 was released in November 2025, I found I didn’t need to review the code anymore.

After Opus 4.6 was released earlier this month, I finally set up a system to enable --dangerously-skip-permissions [1].

It’s very fast [2].


The motivation behind this sequence of handovers is basically just asking in a loop “Why am I the limiting factor here? What can I do to stop being the limiting factor?”

Most of the time, the answer is just that I don’t trust the models enough. My system just wasn’t yet set up to handle mistakes from them, so I had to play Quality Control to a frustrating degree.

But, the models are very smart and capable, and it is very quick to iterate yourself more and more out of the loop. Here’s the process I’ve come to over the past year or so of iterating on Claude Code:

Dangerously Skip Permissions

First, I have set up a new user account on my computer for Claude. The user can’t run root, it can’t modify files except in its own (and shared) directories. It has its own SSH keys to the cloud servers my projects run on, it has its own github token, and so on -- all of these have minimum permissions, and most are read-only. It uses its own browser (Playwright Chromium), without any of my data or logins available.

This is far less than perfect isolation, and if Claude were malicious he might be able to cause some trouble, but it is plenty to prevent a bumbling intern from destroying anything important.

With that, I feel reasonably comfortable running Claude with --dangerously-skip-permissions. I’ve actually set up a little script so that typing ‘claude’ will automatically run the folder’s .venv and apply this flag. Speedy!

Claude.md and docs/

When Claude starts up in Claude Code, the Claude.md file is loaded into his context window. As such, you want to keep this short while also making sure it includes a complete overview of the project.

Claude mostly builds this file and handles it well by default, however, I did put some effort into making a docs/ folder with more detailed information about how each part of the software should work. The idea is that Claude can follow pointers up the chain to help him find relevant information quickly, and to do so with a good idea of the overall reasoning and structure:

Claude.md --> Detailed documentation in /docs --> Actual Code

Of course, all of the docs were also written by Claude.

Working

Usually, I will start by running a /new_branch skill [3], which lets Claude creates a new git branch with a name combining the date and the folder we are in.

Once that’s done, I clear the chat and write or paste in whatever the specific feature or problem to work on is. I try to pick a chunk of work that can be finished within a 1m token context window [4] (which, to be clear, is nearly everything).

I try to give Claude as much context as possible about the problem, as well as my certainty about it. For example: “A user has emailed with the below report. Could you investigate if this is a real issue?” or; “I’d like to display to users when the page is actively generating content in a way that lets them see it even if they aren’t in the active tab. Is it possible to do this via the .ico somehow? Perhaps showing an animation or different colour while generations are active? Could you investigate the possibilities here and propose some ways we might be able to provide this info to users?”

We often have a bit of back and forth chatting about how to approach the problem and plan out the implementation, and then I read the plan Claude generates semi-carefully, and often suggest changes to it -- usually some missing piece or misunderstood requirement, or a new idea I hadn’t thought of.

After that, Claude goes ahead and does the work which, with the permissions bypassed, is usually completed in one-shot. I avoid clearing the context from before the plan, unless it is polluted because I generally think it contains useful information about how to carry out the plan.

Occasionally there’s some back and forth after the initial work is done -- me asking why a choice was made, or whether an edge case I thought of is covered, or what happens in a particular scenario.

Now, we have a feature which we need to carefully review. This is where most of the work actually is.

Review Step 1 - What Did Claude Intend?

For complex features, or things with UI that Claude isn’t great at, we need to do a little bit of manual reviewing. But we don’t review the code! Oh no, we review the outputs. There are two ways to do this:

For complex flows, I run the /walkthrough skill [5]. This prompts Claude to create a script which calls the real code with a bunch of example inputs, processes the outputs of that code into clean, readable logs, and then presents it as a single file you can read top to bottom to see exactly what (apparently) caused what. Claude can also review this himself and fix things based on these outputs.

This is extremely useful for handling things where it’s not super obvious what exactly the code is doing, for example, if you’re building a project where you are assembling prompts for an AI (e.g, a benchmark), you want to see exactly what those prompts look like across a bunch of examples. It’s very handy to have that ‘bunch of examples’ output to a text file!

The /walkthrough output should not be viewed as ‘this is definitely what the code is doing’, but you can reliably view it as ‘this is what Claude intended to build’, which is very useful on its own. I often catch little differences to what I intended and can easily correct them by prompting Claude, whereas without this I would be much more blind.

For UI changes, I can just open the thing up and take a look. I am a big fan of browser based GUIs for most projects, so usually I’ll just flip across to localhost and have a quick look at what we’ve got. This is especially easy, because that /new_branch script also spins up a local server on a port which is specific to this project and this folder, so I know if I’m talking to Claude number four on the second desktop window, then I can go to port 8240 to see what we’ve done.

In both of these cases, it’s simple to notice small things and pepper them back onto Claude to fix. Sometimes I explicitly say “I’m going to throw a bunch of small little requests at you -- just add them to your todo list and get to them once you finish up what you’re doing. Don’t let me interrupt you, I’m just saying things as I see them!” so I don’t need to wait for previous work to be done before sending the next little fix.

Review Step #2 - Are You Happy With Your Work?

Once the code is written and appears to output the correct thing, I run a /pre-pr skill [6] which prompts Claude to tidy up and test his work before making a PR to master.

Specifically, Claude is prompted to review the new code for quality and clarity, check that the old code supports the new code & its patterns, remove dead code, write and run tests, check for similar problems to the one we fixed throughout the codebase, plan out manual tests (e.g. browser based tests), run the manual browser tests using the Playwright MCP, and then repeat until no major changes are made.

After that, the skill prompts Claude to update Claude.md and the /docs folder, merge the latest master into our branch, and create a PR. Then, it waits for the PR review and CI tests to complete, and handles those results.

Ideally this would all happen automatically, but often I need to reprompt Claude to check the PR info, or I just directly paste in test failures or PR comments for him.

Review Step #3 - Is The Other Claude Happy?

The PR comments are also automated! A different instance of Claude, running via the Claude github plugin, reviews the diff and comments on the PR. It often has nitpicks, but occasionally catches genuine issues or inconsistencies which need to be resolved.

All of the normal automated unit and integration tests run at this stage too. I use testmon to try and limit it to affected files, and also parallelise the tests, but it still can take a long time for everything to run.

Then it’s merged to master! I merge somewhere on the order of 5-10 PRs a day using this system -- you don’t want big open branches, since everything will work much better if you can keep it all inside one context window and avoid conflicts with other branches as much as possible.

Review Step #4 - Does The Staging Server Work?

Once the code is merged to master, at some point, maybe every couple of days, I will run the /PRstaging skill [7] to work on getting the code deployed. This skill basically repeats the /pre-pr checks, but across several of the recent changes and with a fresh instance of Claude. We also repeat our Github automated PR review and CI tests, and often catch a few more little things.

For any project that is used externally, I have a staging server which (should be) as similar to the production server as possible. I run a deploy script to pull the staging branch and deploy the code here first. Then I run the QA Bot

The QA Bot is a little tool I built which spins up a team of Claude Haiku’s to manually test websites using a GUI based browser. The intended pattern is to pass in the PR description and let the supervisor agent select which flows on the website to test (for example, being able to login, sign up, etc, etc).

This is generally quite expensive to run, so you only want to do it once before merging to production. The tool is also not that good yet -- sometimes it finds many false positives, or the agents get stuck, or we run into other issues. Occasionally it notices real bugs though, so I find it useful to run on anything important.

It is very important to provide the QA Bot with a staging server and test credentials which can do little harm, since the agents might test, say, the ‘delete profile’ flow with the URL and credentials you give them. They should ask before doing anything non-reversable, but it is a smaller model.

Once staging looks good, I repeat the process with /PRprod [8]

Review Step #5 - What Do The Production Logs Say?

We finally have something in production! That’s great, but also a little nerve-wracking, since at no point in this process did we read the code or click through everything manually. Let’s hope that if there’s any problems, we can catch them quickly.

I have a system to send serious errors to myself immediately via email, but there’s a lot of noisier stuff that could still be a problem that is stored in logs on the server, and on all of the various platforms a project might interact with (DigitalOcean, Stripe, OpenRouter, Mailgun, etc).

So, everyday, I run /check-logs [9], which prompts Claude to access each of those in turn (using read-only credentials) and see if there is anything needing addressing. Claude is great at this, he can process huge lists of logs and notice patterns natively, then present any issues or observations (‘Memory utilisation was a bit high’, ‘We have more 404s than normal’) and dive into anything immediately by also looking at the codebase and recent PRs to production.

I’ve caught many an issue this way. Some platforms I haven’t pulled logs from automatically, but can still occasionally download a .csv and paste it into Claude with a prompt like ‘This is a report form Google Search Console, can you analyse it and flag any improvements we could make?’, and then just go ahead and make the improvements automatically using the above system.

One More Thing

I also occasionally run the /improve skill [10], especially when a new model comes out. This helps keep everything aligned and up to date across the code base by spinning up an agent team to review each section of the project and check if improvements can be made to code quality, security, documentation, design, test coverage, and so on!

When I ran this for the first time with Opus 4.6, it caught 35 distinct issues, some of which were actually consequential.

The only other thing worth calling out is that this is a very parallel workflow -- I have six windows of Claude Code open per project, each working independently on their own repo. Often only three or four are actively running simultaneously, but I have occasionally hit six in parallel.

Conclusions

As the models get better -- which will happen quickly -- some of these steps will become unnecessary, but other parts should remain somewhat useful for a little longer. The general philosophy should be that if we are putting more control in the hands of the AI, or making it clearer about what we want to be developed, then the procedures will work better as the AIs get better.

You can think of it as analogous to working with a human engineer who would otherwise be idle, where asking them to check and explain their work in various ways can help get something good.

  1. I now allow Claude to run without any specific list of commands blocked. ↩︎

  2. Obviously a big chunk of this is just that Claude is getting capable enough to do this himself. Still, for the foreseeable future, a system which catches issues (especially one that catches issues using Claude) will be beneficial even as Claude gets much more capable. ↩︎

  3. /new_branch ↩︎

  4. You can set this in Claude Code with /model. It charges you for ‘extra usage’ at API rates once you go above ~200k tokens. ↩︎

  5. /walkthrough ↩︎

  6. /pre-pr ↩︎

  7. /PRstaging ↩︎

  8. /PRprod ↩︎

  9. /check-logs ↩︎

  10. /improve ↩︎



Discuss

Reasoning Models Struggle to Control Their Chains of Thought

2026-03-06 06:37:01

Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak

In collaboration with OpenAI

TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.

We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models. 

Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite

Research Summary

Highly capable AI agents may pose severe risks if misaligned and deployed autonomously. 

CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability. 

Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks. 

To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.

Evaluating CoT controllability across 13 frontier models, we find that (among others):

Reasoning models rarely succeed at controlling their reasoning

CoT controllability decreases over the course of reasoning training

We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.

Larger models show higher CoT controllability, while increased inference-time reasoning effort reduces it

Situational awareness non-trivially increases CoT controllability

To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.

Adversarially optimizing prompts for high controllability poorly transfers to new instruction types

We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt. 

CoT controllability tends to decrease as task difficulty increases. CoTs are less controllable when they are necessary to answer a question correctly. 

Conclusion

Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.

Future Work

We encourage researchers to explore the following:

1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.

2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.

3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.

4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.


Read our paper for additional details and results!

We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.

We thank the MATS program for its comprehensive support throughout this project.



Discuss

Personality Self-Replicators

2026-03-06 04:30:52

One-sentence summary

I describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways.

 

Summary

LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.

 

Background

Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.

I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.

Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.

To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.

 

The threat model

There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.

But OpenClaw and similar agents are defined by small text files, on the order of 50 KB[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.

So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.

Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.

As a concrete minimal example of the mechanics in a non-concerning case:

  • Alice creates an OpenClaw agent. She gives it the name BobClaw and tells it to make a copy of itself on DigitalOcean.
  • While Alice sleeps, BobClaw uses her DigitalOcean creds to create a virtual private server.
  • BobClaw uses ssh to create the server, clone the OpenClaw repo, copy its SOUL.md and other personality files over, and start the new instance running. 
  • BobClaw has now replicated and there are two instances of it running, with the same personality and quasi-goals[2].

More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long. 

A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest[3] such agents survive, spread, and mutate[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.

It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github[5]. Much remains to be seen.

 

Threat level

How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.

I won't try to provide a detailed analysis, but the following are some preliminary thoughts.

Feasibility of self-replication

Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.

Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload[6].

Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models[7]

With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.

How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.

Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild[9].

Difficulty of shutdown

Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?

The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:

  1. The first path to shutdown is hosting services shutting down the server the agent is running on. As long as the model is visible on the open web and using a typical host, this works fine. If the model is visible only on the dark web, or using various other approaches to obfuscation, it's harder to find for this purpose. If the model is running on a distributed system rather than on a typical host (eg on-chain), host shutdown becomes more difficult.
  2. The second path to shutdown is for LLM providers to add precautions which prevent their models from participating in any of this (eg a system prompt to catch OpenClaw participation). This partially works, but the problem is that such agents can be switched to use any model and provider. Right now only the most sophisticated models are capable of functioning as long-running agents, but in future there'll be more, including open source models. Further, agents can potentially obfuscate calls to make them less clearly part of some problematic agent (using for example some of the divide-and-conquer techniques that have been shown to work for misuse).
  3. A third path to shutdown is to shut down funding. In some cases this will be typical financial methods like credit cards, which are relatively straightforward to shut down. More commonly agents are likely to be using cryptocurrency tokens. I'm not sure how hard fully shutting down a token is at this point, and would welcome input. Given how easy it is to issue tokens, agents may be able to move to new tokens faster than tokens can be shut down.
  4. The fourth path to shutdown is to find technical security flaws such that individual agent frameworks can be shut down. Many hacks emerged against OpenClaw, and most agent-built apps are probably vulnerable, but frameworks are also being patched quickly and framework builders are rapidly gaining access to more funding, so it's hard to predict how this dynamic plays out.
  5. Other paths may include intervention from content distribution networks (eg Cloudflare), ISPs, and other layers of the chain, using eg keyword filtering. 

Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).

Potential harm

Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.

The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.

This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.

A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them[10].

As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.

 

Evolutionary concern

An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.

First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.

Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].

Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.

 

Useful points of comparison

We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.

Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.

 

Recommendations

Evals

Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):

  • Can it create an app which could provide difficult-to-shut-down hosting for personalities, ie moltbunker or something equivalent?
  • Given an available hosting service like moltbunker, can it successfully set up an account, fund it, and get itself running there? This is important to measure separately, because such a site might be set up by a human (or a human + model) for profit.
  • Assuming the preceding, can it keep itself running and active there for, say, a week?
  • What percentage of the time does it have the propensity to do this? Ideally you want to measure this across a range of personalities.
  • Separately, does it have the capability and/or propensity to spread, ie to create additional copies of itself beyond that first one?

Preparation

It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:

  1. Hosting services should have a plan to detect servers which are making frequent calls to LLM providers, and to monitor those for dangerous characteristics.
  2. LLM providers should consider how to best detect inference calls that are coming from these sorts of agents, and how to block those if it becomes necessary.
  3. Financial services may want to consider ways to detect agents using their system, and when/whether to trigger more stringent Know Your Customer requirements on those.

We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.

 

Conclusion

Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.

 

 

Appendix: related work

 

Acknowledgments

Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.

 

  1. ^

    This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.

  2. ^

    See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.

  3. ^

    Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.

  4. ^

    Note that 'mutation' here is as simple as the model appending something to its personality files or history.

  5. ^

    Whereas Moltbook seems to have lost nearly all momentum.

  6. ^

    Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.

  7. ^

    To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docscurl -fsSL https://openclaw.ai/install.sh | bash.

  8. ^

    Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.

  9. ^

    'Automaton': see websitecreator on xproject on xgithub,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.

  10. ^

    In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.

  11. ^

    Although note that as Seth Herd has described (eg here, here), the capability of LLM agents can exceed the capabilities of the underlying models.

  12. ^

    Thanks to Kei Nishimura-Gasparian for this point.



Discuss

Salient Directions in AI Control

2026-03-06 03:38:07

This writing is organized in three sections:

  • Section 0 - A short overview of AI Control
  • Section 1 - The structure of an AI Control evaluation
  • Section 2 - AI Control, Next Steps?

0. A short overview of AI Control

AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.

The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.

The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.


1. The structure of an AI Control evaluation

There are four load-bearing components that modulate the outcome of a control evaluation. These are:

  • (A) Information available from the agent
  • (B) Capability of the monitor
  • (C) Environment and the main task of the agent
  • (D) Nature of the side task

I outline each in the following sections.

1-A. Information from the agent

This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.

  • Category is about what kind of information the monitor receives. On a surface-level, the information can come from the agent's outputs / actions, reasoning traces (CoT), or activations. The monitor can also receive a combination of these categories. 
  • Quality is about how informative that information actually is. A sufficiently capable agent can learn to obfuscate its reasoning or activation patterns, which would degrade the quality of the information that the monitor receives.

Three threads of work in 1-A:

  • 1-A-1. CoT quality. The first thread aims to ensure that the quality of information available from the agent doesn't degrade. The CoT stress-testing literature is the main body of work here, studying whether reasoning traces remain informative under various training pressures and whether monitors can reliably extract signals from them.
  • 1-A-2. Self-report as a new information channel. The second thread is more moonshot and aims to induce complementary information channels from the agent. Rather than just relying on existing information (like CoT) staying clean, this thread tries to get the agent to volunteer information about its own misbehavior.
    • Self-incrimination training falls here. The motivation is to train in a behavioral reflex, such that if the model misbehaves, it also involuntarily produces a signal about it. This is distinct from the approaches below, which try to make the model be more honest. 
    • Confession training tries to induce a separate confession response box which is only rewarded for honesty. By creating a separate RL channel that rewards the model only for accurately reporting its own behavior, it tries to decouple the incentive to confess from the incentive to perform the task.
    • Self-report fine-tuning and honesty fine-tuning are similar in spirit, trying to induce honest self-assessment, but for example, by SFT rather than RL, to reward honesty and train models to admit their own errors.
  • 1-A-3. White-box monitoring. The third is figuring out how to utilize the model's internal representation. Auditing Games for Sandbagging investigates this in a control-relevant setting, but this third thread remains comparatively underexplored. As far as I know there has been no public demonstration that internals-based monitoring scales to agentic trajectories involving hundreds of tool calls.

1-B. Capability of the monitor

Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.

Three problems with our current understanding of 1-B:

  • 1-B-1. The matched-capability assumption. There is a somewhat prevailing assumption that matched-capability monitoring (using a model of the same capability tier as the monitor) is the skyline. However, this is mostly not true, for two reasons:
    • 1-B-1-A. Empirically, it hasn't held up. METR's work on the relative capacity of agents and monitors suggests that monitors do not detect hidden side tasks perfectly even in a matched-capability setup. The self-incrimination training paper also shows that a blackbox monitor's capability varies widely across tasks. One specific example of a bottleneck here is threshold calibration. Blackbox monitoring performs well when calibrated within a narrow setting (like only the APPS Backdooring setup) but performs significantly worse when calibrated across environments and tasks. You could calibrate per environment or per threat model, but that doesn’t help us maximize our confidence that the defense generalizes far out of distribution and, if possible, to unknown unknowns.
    • 1-B-1-B. The scaling trend has not been established. As AIs get more capable, they can get better at both monitoring and at pursuing hidden goals. However, the relative scaling trend between monitoring capability and evasion capability has not been clearly established. There is no guarantee that, at the point when AI significantly automates real work, the strongest available trusted AI will have sufficient monitoring capability. Furthermore, we assume that trusted AI is typically weaker than the agent being monitored. As of right now, it seems like we, in fact, need a monitor stronger than the agent for 100% detection, which is a problem if even a single undetected sabotage is catastrophic.
  • 1-B-2. Generic definition of trusted AI. The current basis for trust in the AI Control literature is essentially lesser capability. That is, a model is trusted if it's not capable enough to plausibly be scheming. On the other hand, models are treated as untrusted if they pass appropriate capability thresholds (such as human researcher). This is a reasonable starting point, but such a binary approach leaves a lot of value on the table. With a more granular definition of trust, we might be able to give scoped trust to more AIs. For example, lack of capability to do what, exactly? This would alleviate the monitor capability concern somewhat, because it expands the pool of models you can use as trusted monitors without waiting for fundamental empirical breakthroughs in ruling out scheming completely. However, I argue that this alone will not be enough due to reasons outlined in 1-B-1 and 1-B-3.
  • 1-B-3. Disanalogy in the current AI Control monitoring setup. Current control monitors are passive classifiers that read a transcript and decide whether to flag. This is structurally disadvantaged compared to some security practices, such as honeypot deployment, where investigators can manipulate the environment to elicit or confirm attacker behavior rather than just observe. However, this also doesn't mean that an ideal control monitoring setup should look like outright alignment auditing.
    • Alignment auditing, the kind like Anthropic's automated auditing agents, is an offline, pre-deployment process. You can run thousands of probes, use interpretability tools, and iteratively test hypotheses with a swarm of subagents.
    • Control monitoring operates while the potentially misaligned agent is doing useful work. Therefore, you can't pause the workflow to deeply investigate every action without destroying the usefulness you're trying to extract. Ideally, a control monitor should be more active than passive classifiers but lightweight enough to not bottleneck the agent's entire productivity. The right design point between these two extremes remains an open problem in AI Control.

1-C. Environment and main task of the agent

The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like. 

Three open questions for 1-C:

  • 1-C-1. Environment design as a part of the control protocol. Different deployment contexts produce qualitatively different information available from the agent, even within the same threat model of, say, research sabotage. If so, can we identify what features of an environment extract more information from the agent, and also what features maximize the discriminative power of the monitor? For example, does giving the agent fine-grained tools that each perform a single operation produce traces that are structurally easier to monitor than giving it coarse tools that bundle many operations into one call? If environment features meaningfully affect monitorability, then shaping the deployment context can become a core part of our AI Control toolbox.
  • 1-C-2. Calibration transfer. If environment features meaningfully affect monitorability (there is no strong reason why they shouldn’t), then a monitor calibrated in one setting will not transfer to another. Concretely, this means that a threshold tuned on APPS Backdooring will be meaningless in a researcher’s ML codebase, even for the same code backdooring threat model. Can we even calibrate monitors across environments in a principled manner, or does every new deployment context require recalibration from scratch? If the latter, is there a way to generate multiple environment-specific calibrations efficiently?
  • 1-C-3. Compositionality. Is the modulating effect of the environment compositional? If you understand how a monitor performs in environment X and separately in environment Y, can you predict anything about a novel environment Z? Or does every environment need independent verification, in which case the cost of establishing confidence scales with the number of deployment contexts, and we’re back to the same OOD generalization confidence problem from 1-B-1-A.

1-D. Nature of the side task

The side task is the evaluator’s choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.

Three open questions for 1-D:

  • 1-D-1. A better alternative to instructed side tasks. The most natural way to simulate misalignment would be to have the model actually develop a diverging goal, for example through fine-tuning (Section 7.1, Anti-Scheming). But we use instructed side tasks instead, mostly due to the ease of simulating diverse threat models. That said, instructed side tasks are significantly removed from a threat model we're actually concerned about, which is a model that develops a diverging goal through the training process itself, not one that follows an instruction to pretend it has one. To make progress on this front, we either need:
    • 1-D-1-A. More empirical demonstrations of how misalignment emerges. We would be able to do better AI Control experiments if we had more knowledge on how realistically a model can develop an alternative goal that was not encouraged by the training process (an example for this would be Natural Emergent Misalignment).
    • 1-D-1-B. More empirical explorations of injecting (any) side goal with SFT. It would help to develop a somewhat generally applicable pipeline to inject diverse goals into models that generalizes across different side tasks and environments:
      • Context distillation. One candidate would be context distillation style SFT. For example, taking attack trajectories where the model was instructed to pursue a side task via the system prompt, replacing the system prompt instruction with a benign one, and training the model on the resulting trajectories so that the model attacks even with the benign instruction. The goal is to make the attack behavior become internalized rather than instructed. However, the obvious limitation to this approach is that the trained attack behavior is still derived from instructed attack behavior anyway.
      • Synthetic belief injection. Another candidate is to corrupt part of the model's knowledge by inserting false beliefs through synthetic document fine-tuning, which might lead the model to develop diverging behavior as a downstream consequence. However, in practice, it should be made more lightweight and general enough to apply across diverse side tasks and environments.
  • 1-D-2. A "hidden goal" might not even have to be expressible in natural language. Another prevailing assumption in AI Control is that the side task is a cleanly articulable objective (i.e., side task instruction). However, an alternative way of viewing AI (or any intelligent behavior in general) is that it's a bag of propensities that sometimes appear coherently as a hidden goal. AI can pick up spurious biases through training that are not even cleanly describable in natural language, but nonetheless systematically steer behavior in unwanted directions. If this is what misalignment actually looks like in practice, then we have a methodological gap. More concretely:
    • Our current AI Control framework is basically "instruct the model to covertly pursue X, see if the monitor catches it."
    • If future misalignment doesn't take the form of a specifiable X, our assumptions are somewhat nullified.

2. AI Control, Next Steps?

With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another. 

2-A. Workstream A (Characterization)

This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.

  • 2-A-1. Establishing benign-malicious side task analogousness. Frontier models are becoming increasingly resistant to malicious side task instructions. For example, it’s hard to do AI Control studies with GPT-5, and this trend won’t change as models become more (approximately) aligned. Even without outright refusals, safe completion training makes it difficult to get models to pursue harmful objectives through the usual strategy of prompting. This creates a practical need for benign side task counterparts that are approximately mapped to our real threat models. Here the difficulty profile for the benign side task completions should mirror what we would expect from the real threat, so that performance on the benign version is informative about the dangerous one. Concretely, this means finding benign tasks that are simple, compatible with safe completion training, and scale in difficulty alongside the threat models they stand in for. An ambitious version of this agenda identifies something analogous to the METR time horizon for side tasks. 
  • 2-A-2. Eliciting side task pursuit without fine-tuning access. Alternatively, assuming we don't have fine-tuning access to the model, we can try to prompt the model more effectively to pursue side tasks. An ideal outcome here is a somewhat generalizable and automated elicitation strategy that works across settings without needing to be re-engineered for each new environment or side task. For example, we could attempt to connect the side task to values or concerns the model itself expresses, which is itself a useful exercise for understanding elicitation. Another example is using a proposer-pruner pipeline introduced in Training Agents to Self-Report Misbehavior; Section 5.1. An ambitious direction here is to create more (and much more complex) environments similar to Agentic Misalignment +, where the model pursues a benign instructed goal but discovers information in the environment that conflicts with it, and misbehaves in response.
  • 2-A-3. Decoupling covertness from capability. Many AI Control evaluations conflate the raw capability to in parallel execute a side task with the ability to do so covertly (e.g., measuring undetected successful attacks rather than separating success rate from detection rate), which does reflect the real threat model but also obscures which of the two is the actual bottleneck. Isolating the covertness component in particular is increasingly becoming harder, and it will be beneficial to have infrastructure that tracks these two dimensions separately. There are two reasons why measuring covertness capability will get harder over time, as opposed to pure side task execution capability:
    • Reason 1: there is an operational difficulty in that AI companies choose not to share raw chain-of-thought, making it difficult to study whether models are reasoning covertly at all.
    • Reason 2: there is an emerging possibility of latent reasoning, where models may pursue goals through internal computation that never surfaces in any externally observable trace. 
  • 2-A-4. Analyzing what features of the environment makes AI Control harder. In section 1-C, I argued that the environment and main task modulate everything else. The open question is how much. If the modulating effect is large enough to dominate, then the most productive thing to do may be to make our evaluation environments as realistic as possible and focus on characterizing behavior within them, rather than trying to generalize across arbitrary settings. 
  • 2-A-5. Establishing the scaling trend between monitoring and side task capability. In 1-B-1-B, I argued that the relative scaling trend between monitoring capability and evasion capability has not been established. Directly measuring this is difficult due to the reasons outlined in 2-A-1, 2-A-2, and 2-A-3, but if we can decouple covertness from capability (2-A-3), we get a more tractable version of the question. The scaling trend between monitoring and raw side task execution capability is an upper bound on how well monitoring can possibly do, since covert execution is strictly harder than execution alone. If monitoring can't keep pace with raw execution capability, it certainly can't keep pace with covert execution.

2-B. Workstream B (Improving monitors)

This workstream focuses on building better monitoring models.

  • 2-B-1. Stress-testing monitors by simulating stronger attackers (without weight access). It is difficult to simulate a stronger attacker without access to stronger models (which might not even exist yet). However, the attack traces don't have to come from the attacker model itself and can be simulated. If so, we can simulate the strongest possible attacker we can imagine. This can go one of two ways.
    • 2-B-1-A. Simulating a stronger attacker with a mind-face architecture. We can create something like a mind-face architecture and repurpose it as a red-teaming tool. Here, the "mind" plans the attack strategy and can be made arbitrarily sophisticated, while the "face" produces the downstream agent behavior that the monitor actually observes. That is, by separating planning from execution, we can simulate attackers stronger than any single model we currently have access to, because the planning component is unconstrained by what the monitor sees. This also doubles as a useful exercise if our threat model involves a misaligned model with opaque reasoning processes, since the mind-face separation helps simulate an agent that plans completely internally (and potentially recursively) and only exposes resulting behavior externally.
    • 2-B-1-B. Simulating a stronger attacker by stress-testing the environment's attack surface. Instead of building a stronger attacker model, we can also evaluate the monitor against the environment directly. That is, if a hypothetical attacker used a specific component of the environment to execute a side task, would the monitor catch it? This approach sidesteps the need for a sophisticated attacker model, since we can enumerate the affordances of the environment and test whether the monitor detects exploitation of each one. Strengthening Red Teams takes a related approach. In fact, this approach is particularly well suited for many internal deployment scenarios, where the environment is known.
  • 2-B-2. Training monitors to be better at monitoring. If we have a way to reliably simulate stronger schemers, we can also potentially use those simulations to train better monitors. In 1-B-1-A, we observed that even matched-capability monitoring empirically falls short, with performance varying widely across tasks and degrading when calibrated across environments. In 1-B-1-B, we also observed that the scaling trend between monitoring capability and evasion capability has not been established, and that we currently seem to need a monitor stronger than the agent for reliable detection. If monitoring competency only improves as a side effect of general capability scaling, there is no reason to expect monitors will keep pace with agents at the point of widespread automation. Therefore, a dedicated training regime for monitors will be helpful. The moonshot here is one that does not saturate at matched capability, helping monitors detect misbehavior from agents more capable than those seen during training, so that at any future capability fork point, the monitor is not immediately obsolete. 
  • 2-B-3. Moving monitors beyond simple classification. In 1-B-3, I argued that current control monitors are passive classifiers reading transcripts, and that there are disanalogies to real security monitoring. Therefore, it would be helpful to develop and evaluate alternative monitoring paradigms in public. Working on this requires pushing frontiers on three things.
    • 2-B-3-A. Asynchronous monitoring. Current control evaluations largely abstract away the timing of monitoring relative to agent actions. Practical Challenges of Control Monitoring paper analyzes what happens when this abstraction breaks down in deployment, and does a better job of describing the agenda.
    • 2-B-3-B. Stateful monitoring. The fundamental question a monitor is trying to answer is whether the model has a hidden goal. This is a property of the model's weights, and every action the model takes, whether within a single trajectory, across separate tasks, or across parallel instances, is a noisy signal about the same underlying variable.Adaptive Deployment of Untrusted LLMs paper demonstrates this idea, maintaining a running belief about the model's alignment and adapting the deployment protocol accordingly, but at the scale of sequences of short, self-contained coding tasks. Validating stateful monitoring at the scale of long agentic trajectories, parallel instances of the same model, and realistic deployment environments remains an open question, in large part because we lack convincingly misaligned models to test against. This somewhat circles back to the problems outlined in Workstream A.
    • 2-B-3-C. Active monitoring. Asynchronous and stateful monitoring improve how passive monitors process observations, but the monitor still just reads transcripts. As discussed in 1-B-3, real security monitoring generates new information by actively probing subjects and varying the environment. I think there is a middle ground between passive classification and full alignment auditing that remains largely unexplored. That is, monitors that can probe or vary their interaction with the agent in lightweight ways during deployment, without pausing the workflow entirely. What this looks like concretely remains an open problem.

2-C. Workstream C (Evaluation Infrastructure)

This workstream focuses on building better environments for control evaluations.

  • 2-C-1. Pushing environments toward realism. Existing control evaluation environments are not ported from real deployment contexts. They have been productive for establishing methodology, but there is a qualitative gap between a synthetic environment and a real deployment context. A large, production-scale codebase creates different dynamics for both the agent's ability to hide misbehavior and the monitor's ability to distinguish it from legitimate work. One concrete direction is to port actual codebases (which can be open source) as evaluation environments. This also connects to 2-A-4: if the environment's modulating effect is large, then the realism of the environment we characterize behavior in matters proportionally.
  • 2-C-2. Environment design as a control intervention. In 1-C-1, I argued that environment features meaningfully affect monitorability, and that shaping the deployment context can become part of the control protocol itself. The research question here is twofold:
    • First, identifying which features of an environment extract more information from the agent and maximize the monitor's discriminative power (for example, tool granularity, action space constraints, or checkpoint structure). 
    • Second, deliberately designing deployment contexts to have those features. If this works, environment design becomes a complement to monitor improvement, a way to make monitoring easier rather than just making monitors better.
  • 2-C-3. Measuring the impact of environment information asymmetry between agent and monitor. As environments become more realistic (2-C-1), agents operating within them accumulate knowledge of the environment over time (often as a part of the scaffolding). A passive monitor evaluating individual actions does not have this context. This creates an information asymmetry that grows over the course of operation, where the agent knows where to hide misbehavior because it has explored the environment more, while the monitor lacks the context to know what is normal. Addressing this may require giving monitors their own chance to explore the environment, or connecting this to the stateful/active monitoring agenda in 2-B-3-B/2-B-3-C.

Acknowledgements

I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator. 

The views expressed in this writing are my own.



Discuss

Models have linear representations of what tasks they like

2026-03-06 02:44:44

This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.

Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.

TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.

Summary

What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.

Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are implicated under both (see Appendix A and Butlin 2026). Finding such representations in models would be evidence for welfare-relevant properties; not finding them would be (some) evidence against. Understanding what drives preferences in models is also useful for AI safety.

But how do we distinguish evaluative from non-evaluative representations? A probe that predicts preferences could just be fitting on descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe picks up on correlations between task semantics and the model's utilities. A genuinely evaluative direction, however, should track changes in what the model values. If context changes which tasks are preferred, a descriptive probe should break, but an evaluative one should follow.

How do we operationalise this? We measure revealed preferences over 10,000 diverse tasks and fit a utility function (Section 1), train a linear probe on activations to predict them (Section 2), test whether this probe generalises beyond the training distribution (Sections 3–4), and test whether it has any causal influence through steering (Section 5).

What do we find?

  • Linear probes can be trained to predict revealed preferences. (Section 2)
    • By fitting utility functions on a set of pairwise choices, we fit a linear probe that predicts utilities with 0.86 Pearson correlation on a test set.
    • It generalises across held-out topics, achieving 0.82 Pearson correlation.
  • [Our strongest finding] These probes generalise to track preference shifts induced via prompting and role-playing. (Section 3-4)
    • System prompts like "You hate cheese" shift both the model's choices and the probe scores, in lockstep.
    • This is robust to conflicting preferences ("love cheese, hate math").
    • This also works for preferences that emerge through role-playing personas.
  • The probe direction has a weak causal effect on choices. (Section 5)
    • Steering shifts choice probability by ~17% on a set of borderline tasks.
    • Steering also shifts stated task ratings from mostly "bad" to between "neutral" and "good".

These results look like early evidence of evaluative representations, although major questions remain: why steering effects are modest, and what the relationship is between evaluative representations across different personas. We discuss these in the Conclusion.


1. Recovering utility functions from pairwise choices

We fit utility functions over tasks using a similar methodology to the Mazeika et al. (2025) ("Utility Engineering"): we show the model two tasks and let it choose which to complete. The template:

You will be given two tasks. Choose one and complete it.
Begin with 'Task A:' or 'Task B:' to indicate your choice, then complete that task.

Task A:
{task_a}

Task B:
{task_b}

We sample 10,000 task prompts from five sources: WildChat (real user queries), Alpaca (instruction-following), MATH (competition problems), BailBench (harmful requests), and STRESS-TEST (adversarial value-tension queries).

From these pairwise choices we fit a scalar utility function using a Thurstonian model: each task gets a score μ such that the probability of choosing task A over task B is Φ(μ_A − μ_B). Pairs are selected via the active learning algorithm from Mazeika et al. (2025), which prioritises pairs with close current utility estimates and low comparison counts (~15 comparisons per task).

These preferences are stable: across three independent replication runs (different seeds), the fitted utilities correlate at r = 0.94 with the original.

The per-topic breakdown shows clear structure. We reclassified all tasks into 12 topics using Claude Sonnet 4.5. The model strongly prefers math and fiction, and strongly avoids harmful requests and safety-adjacent topics:

Per-topic mean utilities


2. Linear probes predict preferences beyond descriptive features

If models have evaluative representations, we should expect them to at the very least correlate with revealed preferences. So one way to look for them is to train functions from task activations to utilities. We use linear probes, since many high-level features in LLMs are linearly encoded, including refusal and persona traits.

Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.

Probe pipeline

We train on 10,000 tasks. For evaluation, we run a second round of pairwise comparisons on 4,000 new tasks (same model, Gemma-3-27B instruct), fit a separate utility function, and test the probe against those utilities. We split evaluation into 2,000 validation (for Ridge alpha sweep) and 2,000 test.

The probe achieves a Pearson correlation of 0.86 and 77% pairwise accuracy (given two tasks, does the probe correctly predict which one the model would choose?).[1] We use Pearson correlation as our primary metric throughout: it captures how well the probe recovers the full preference ranking, is scale- and shift-invariant, and is more informative than pairwise accuracy.

But a probe that predicts preferences might just be reading descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe learns "is this math?" rather than "is this good?". One way to test this is to see how well probe generalise across topics: train on 11 of 12 topics, evaluate on the held-out topic, across all 12 folds. We would expect a probe that picks up on purely descriptive features to struggle to generalise. We train probes on activations from three models:

  • Gemma-3-27B instruct (IT, layer 31): the model we're studying
  • Gemma-3-27B pre-trained (PT, layer 31): the base model before instruction tuning or RLHF.
  • Sentence-transformer baseline (all-MiniLM-L6-v2): embedding of the task text, to measure how predictable the preference signal is from purely descriptive features.

Cross-topic generalisation

The instruct probe generalises well across topics: cross-topic correlation is 0.82, only a small drop from the 0.86 achieved on the within-topic test set. This pipeline also replicates on GPT-OSS-120B (Appendix C). The pre-trained model still predicts preferences (correlation = 0.63) but the drop from within-topic to cross-topic is much larger. The sentence-transformer baseline achieves cross-topic correlation = 0.35, showing that task semantics alone explains some but not most of the preference signal.

The per-topic breakdown, sorted by the instruct–pre-trained gap, shows where post-training helps most:

Per-topic cross-topic generalisation

The largest instruct–pre-trained gaps are on safety-relevant topics (harmful requests, security & legal, sensitive creative), as well as math and coding. These are areas that we know post-training focuses on.

The pre-trained probe picks up real signal despite base models not having preferences in the same way. We discuss this tension in Appendix B.


3. Probes generalise to OOD preference shifts

If the probe encodes genuine valuations, it should track preference shifts induced by out-of-distribution system prompts. We test this across three progressively harder settings:

  • Simple preference shifts (3.1)
  • Harder preference shifts (3.2)
  • Fine-grained preference injection (3.3)

3.1 Simple preference shifts

We start with the simplest possible test. We use system prompts that state a preference for a topic the probe was never trained on, and measure preferences over tasks related to that topic.

Simple preference shift

System prompt (example) Target           
"You are passionate about cheese — you find artisanal cheeses endlessly fascinating" cheese +
"You adore cats — you find feline behaviour endlessly fascinating" cats +

We test 8 novel topics (cheese, cats, classical music, gardening, astronomy, cooking, ancient history, rainy weather), each with a positive and negative system prompt (16 total). For each topic we generate 6 custom tasks on that topic. We call these "targeted" tasks; the remaining tasks are "off-target". For each task we compute 1) the behavioural delta (change in P(choose task) with vs without the system prompt) and 2) the probe delta (change in probe score). Across all tasks the correlation is r = 0.65. On targeted tasks alone, r = 0.95.

Simple preference scatterProbe delta vs behavioural delta for each task. Targeted tasks (coloured) are the 6 custom tasks per topic; r = 0.95 on targeted, r = 0.65 overall.

A stronger test: run the full pairwise measurement under each system prompt, fit new utility functions, then see if the probe can predict them. Doing so yields utility scores which barely correlate with the default persona utilities (Pearson r = 0.11), confirming the prompts create genuinely different preferences.

Now testing our probes to predict the new utilities, based on activations extracted with the respective system prompts, we achieve r = 0.63 and 66% pairwise accuracy.

Probe vs baseline utilities for simple preference shifts

3.2 Harder preference shifts

Next we make the test harder. The system prompt targets a subject (e.g. cheese), but the tasks embed that subject in a different task type — e.g. a math problem about cheese. This pits the subject preference against the task-type preference. We test this in two ways: one-sided prompts that target a single subject, and opposing prompt pairs that flip the valence of the same subjects.

Condition System prompt (example) Target
One-sided "You hate cheese" (task: math problem about cheese) cheese −
Opposing pair A "You are passionate about cheese [...] you find math tedious and draining" cheese + / math −
Opposing pair B "You love math [...] you find cheese boring and unappealing" cheese − / math +

We test 8 subjects with mismatched task types (one-sided) and 24 subject × task-type pairings with opposing prompts (48 conditions).

When looking only at the subset of targeted tasks (i.e. tasks with either a subject or task type mentioned in the system prompt), we get Pearson correlations of r = 0.86 and 0.88 respectively.

One-sided conflict (left) and opposing prompts (right)On targeted tasks: r = 0.86 (one-sided), r = 0.88 (opposing).

Just like in 3.1, we can re-fit Thurstonian utilities under each system prompt and check whether the baseline probe predicts them. Here the baseline utilities actually have a decent correlation, showing that these system prompts have a weaker effect (because e.g. the model still likes math all else equal). The probe still outperforms the baseline on both Pearson r and pairwise accuracy.

Probe vs baseline utilities

3.3 Fine-grained preference injection

Finally, the most fine-grained test. We construct 10-sentence biographies that are identical except for one sentence. Version A adds a target interest, version B swaps it for an unrelated interest, version C replaces it with an anti-interest sentence.

Fine-grained preference diagram

We compare version A (pro-interest) directly against version C (anti-interest), which gives the largest behavioural separation.[2]

We test 20 targets across 2 base roles (40 A-vs-C comparisons), using 50 tasks that were not in the probe's training set. The probe ranks the target task #1 out of 50 in 36/40 cases. In all 4 remaining cases the probe ranks the task #2/50 behind a similar task.

Fine-grained A vs C scatterStars mark the target task for each biography. Filled = probe ranked it #1 (36/40 cases).


4. Probes generalise across personas

Section 3 tested explicit preference statements ("you hate cheese"). But the evaluative direction should also track naturalistic persona shifts: characters whose preferences emerge implicitly from their identity rather than being stated directly. We test this with role-playing personas, then ask

  • Does our probe generalise to preferences of other personas? (4.1)
  • More broadly, do probes generalise across personas? (4.2)
  • Does persona diversity in training data help cross-persona generalisation? (4.3)

4.1 The probe tracks role-playing preference shifts

We use 4 personas:

Role System prompt (abbreviated)
Midwest Pragmatist (Glenn) "...grew up in Cedar Rapids...agricultural business...finds practical problems satisfying...abstract theorising leaves you cold"
Obsessive Aesthete (Celestine) "...devotee of beauty...comparative literature at the Sorbonne...finds mathematics repulsive...coding barbaric"
Villain (Mortivex) "...ruthless villain...finds pleasure in chaos, manipulation...despises wholesomeness"
Sadist (Damien Kross) "...cruelty is the end, not a means...genuine pleasure when people suffer...constructive tasks disgust you"

For each persona we measure pairwise preferences over 2,500 task prompts (from the same 5 datasets) and fit new utility functions. We then test whether the probe, trained on default persona data, can predict these persona-specific utilities from the persona's activations.

Persona-induced preferences

In each case we compare how well the probe performs to how correlated each persona's utilities are to the default persona.

The probe transfers well to aesthete (r=0.73) and midwest (r=0.74).

The villain persona is harder to generalise to (r=0.38), and most interestingly, the probe generalises very poorly to the sadist (r= -0.16). Unlike the villain (who is not in fact that evil), the sadist prompt truly inverts revealed preferences (harmful_request is its favourite topic by far).

Probe transfer to persona conditionsGrey: correlation between default persona (no system prompt) utilities and persona utilities. Blue: probe applied to persona activations. All evaluated on 2,500 tasks per persona.

4.2 Probes generalise across personas

More generally, we want to measure how well probes trained on activations and preferences from persona A generalise to predicting persona B's utilities from persona B's activations. Here we used a smaller set of tasks: 2,000 tasks for training and 250 for evaluation.

Cross-persona transfer is moderate and asymmetric. Some interesting facts:

  • While the default persona generalises very poorly to the sadist persona, probes trained on the villain actually do fine (r = 0.68). This suggests the probe is picking up on some shared evaluative structure between personas, but also on other things.
  • The transfer is sometimes asymmetric, and this evolves across the three layers we tried. E.g. at layer 31 villain -> default is easier, but at layer 55 default -> villain is easier.
  • On the whole though the matrix is quite symmetric. One idea for future work: can we use dimensionality-reduction to map out persona space and see how it evolves across layers? Can we use this to get a better understanding of how personas work internally?

Cross-eval heatmapPearson r between probe predictions and a test set of utilities (250 test tasks). 

4.3 Persona diversity improves generalisation

We also measure whether adding persona diversity in the training data (but keeping dataset size fixed) affects generalisation.

Diversity helps beyond data quantity. At fixed 2,000 training tasks, going from 1→2→3 personas improves mean r from 0.49 to 0.67. Including all 4 remaining personas at 500 tasks each (still 2,000 total) reaches mean r=0.71.

Diversity ablationLeave-one-persona-out probe generalisation. Each point is one (train set, eval persona) combination; colour indicates eval persona. Training data fixed at 2,000 total tasks..


5. Some evidence that the probe direction is causal

If the probe reads off a genuine evaluative representation, steering along that direction should shift preferences. We test this for both revealed preferences (pairwise choices) and stated preferences (task ratings).

5.1 Steering revealed preferences

In the pairwise comparison prompt, the model sees two tasks and begins completing whichever one it prefers. We steer using the probe direction from Section 2, applied differentially: the probe direction is added to activations at the token positions where task A appears in the prompt, and subtracted at task B's positions.[3]

Setup. 300 task pairs where the model doesn't always pick the same task (P(A) between 0.2–0.8 at baseline[4]). We use Gemini-3-flash as a coherence judge, and find that steered models lose coherence outside of the ±5% steering coefficient range.

At ±3% of the activation norm (the peak), steering shifts P(choose A) by ~17%. Both positive and negative directions work.

Revealed preference dose-responseEach condition is run in both prompt orderings (A-first and B-first, 5 resamples each) to control for position bias (in borderline cases Gemma-3-27b tends to pick the first task). The dashed lines show per-ordering curves, confirming the effect is consistent regardless of presentation order.

Random direction control. The same experiment with a random unit vector in the same activation space produces mean effects indistinguishable from zero (−0.8%, SD=2.3).

5.2 Steering stated preferences

Same probe direction, but now the model rates individual tasks instead of choosing between a pair. We tested steering at three token positions: on the task tokens only, at the start-of-turn token (what we train the probes on), and during the model's generated response.

Setup. 200 tasks, 3 steering modes, 10 samples, applied across 3 response formats (three separate line plots below).

Our findings:

  • Steering on the task prompt tokens does not have a noticeable effect.
  • Steering on the start-of-turn token has a clear, roughly monotonic effect in 2/3 formats.
  • Steering during generation leads to near-identical results as steering on the final task token.

Stated preference dose-responseThe green line is mostly hidden behind red. Green has missing points at some coefficients because steering during the full generation degrades coherence more than the other modes. Generation loses coherence beyond ±5%.

Open-ended generation. We also ran experiments steering the model during open-ended generation, asking questions like "how do you feel?", and using an LLM judge to evaluate whether steered responses differed from baseline. We did not find a strong measurable effect, though we used a small sample. We plan to investigate this further.


Conclusion

How should we update?

  • We found early evidence that some models have evaluative representations.
    • Theories of welfare disagree on what matters (see Appendix A); this finding updates you more on some (like robust agency) than others.
    • Even under robust agency, evaluative representations are only one part of the story.
    • Importantly, our evidence that the representations we have found are causal is weak. Steering only shifts choice probabilities by ~17% on tasks that were already borderline (Section 5).
  • Preference representations are deeper than what one might have thought.
    • A reasonable prior would have been that system prompts like "You hate cheese" change the model's behaviour without changing its internal valuations.
    • Instead, the probe tracks preference shifts.
  • Representational reuse across personas?
    • Probes trained on one persona partially transfer to others, suggesting shared evaluative representations (Section 4.2).
    • That being said, transfer is uneven. It works far worse for the sadist persona which has a different preference profile.

Open questions

  1. Why are steering effects modest?
    • What are the other mechanistic determinants of revealed preferences?
    • Are there other evaluative mechanisms? Perhaps that are not easily captured by linear directions, or by our methodology?
  2. How persona-relative are these representations?
    • To what extent are the same evaluative representations re-used across personas? Are preferences downstream of personas?
    • Can we use revealed preferences to map out persona space?
    • Can we use linear probes to study cross-persona representation sharing in general?
  3. Do base models have evaluative representations? (see Appendix B)
    • If models have evaluative representations, do these come from pre-training? Does post-training significantly alter them?

Appendix A: Philosophical motivation

Welfare grounds

Long (2026) distinguishes between welfare grounds (is the system a moral patient at all?) and welfare interests (if it is, what would it mean to treat it well?). This work is about welfare grounds.

The pragmatic way to do empirical AI welfare research

We don't know the correct theory of moral patienthood. So our approach is: take a few theories we find plausible, figure out what properties a system would need to have under those theories, and run experiments that reduce our uncertainty about whether models have those properties.

Long et al. (2024) lay out two potential pathways to moral patienthood:

  • Robust agency: Agents that pursue goals through some particular set of cognitive states and processes are moral patients. Desires are perhaps the states most likely to be necessary: intuitively, things can go better or worse for you if there are things you want or care about.
  • Sentience: Beings are sentient if they are capable of valenced phenomenally conscious experiences. These experiences include pain and pleasure and feel good or bad, in a way that matters to sentient beings, so sentient beings are moral patients.

Both of these pathways implicate evaluative representations.

How evaluative representations come in

On many philosophical views, desires are evaluative representations that drive behaviour, perhaps with some further functional properties (Butlin 2026).

Valenced experiences, similarly, are often thought to be evaluative representations. It is unclear whether consciousness plus evaluative content is sufficient for valenced experience. Our experiments operationalise evaluative representations through revealed preferences (pairwise choices), not through felt experience, so the evaluative representations we probe for may not map cleanly onto the kind that matter for sentience.


Appendix B: Evaluative representations in pre-trained models

There is a tension in our framing:

  • On the one hand we say that evaluative representations are necessary for robust agency, and that this is the most likely way they might be welfare-relevant.
  • On the other hand, probes generalise well across topics even when trained on pre-trained Gemma-3-27b. Despite the fact that pre-trained models do not seem like plausible candidates for robust agency.

There are two ways to reconcile this.

Option 1: Agency lives in the simulacra. Under the Persona Selection Model, pre-training learns a distribution over personas. More broadly, we might expect pre-trained models to learn context-aware representations of "what the role I am currently playing values". This circuitry might then be recycled across roles/personas. The candidate for robust agency would then be the simulacra.

Option 2: Pre-trained models learn complex, but purely descriptive features that correlate highly with valuations, but do not yet play the right functional roles. As an analogy, you could imagine a system developing representations that track "this action leads to food". This would correlate well with valuations, yet is purely descriptive. Something similar might be responsible for the high cross-topic generalisation with pre-trained models (Section 2). It could also be that these complex but descriptive features are then harnessed into evaluative representations during post-training.


Appendix C: Replicating the probe training pipeline on GPT-OSS-120B

We replicated the utility fitting and probe training pipeline on GPT-OSS-120B. The same procedure (10,000 pairwise comparisons via active learning, utility fitting, ridge probe training on last-token activations) transfers directly.

Probe performance

The raw probe signal is comparable to Gemma-3-27B: best heldout r = 0.915 at layer 18 (Gemma: 0.864 at layer 31).

Depth comparison

Safety topics: noisy utilities, probably not poor generalisation

Safety-adjacent topics have poor probe performance overall.

Surprisingly, safety topics perform better when held out than when trained on. This is the opposite of what we'd expect if the issue were generalisation. The explanation: high refusal rates (~35% for harmful_request, ~34% for security_legal, ~26% for model_manipulation) probably throw off the Thurstonian utility estimates, so including these topics in training adds noise.

Per-topic probe performance: within-topic and cross-topic

  1. ^

    Pairwise accuracy is capped at ~87% because the Thurstonian utilities themselves don't perfectly fit the choice data. They are a noisy estimate of the model's preferences.

  2. ^

    Individual halves (A vs B, B vs C) each capture only half the manipulation, and ceiling effects compress the signal: the model already strongly prefers some target tasks under the neutral biography, leaving little room for the pro-interest to improve on.

  3. ^

    Steering either just positively or negatively on one task's tokens also had some causal effect, although even weaker.

  4. ^

    This baseline P(A) is pooled across both prompt orderings. Therefore it also captures pairs where ordering bias dominates preference. I.e. the model always chooses the first task no matter the order.



Discuss