MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Positive-sum interactions between players with linear utility in resources

2026-03-20 08:42:57

Sometimes people say things like "If the humans and AIs have linear utility in resources, then their interactions are zero-sum". Here, "linear utility in resources" typically means something like: "Supposing AIs already control all the galaxies, then they'd accept a bet with a 60% chance of gaining one more galaxy and a 40% chance of losing one."

I think this is too hasty.

I'll list several reasons why interactions can be positive-sum even when both humans and AIs have such preferences. I've ordered these from most important to least important.

1. Epistemic public goods. Humans and AIs both benefit from learning true things about the universe — knowledge will have value to almost any agent regardless of its terminal goals. Hence, if both parties need to invest resources in acquiring knowledge, they can share the costs and both come out ahead. This includes basic mathematics and science. But I also expect this will include very expensive simulations of counterfactual worlds.

2. Security public goods. Some expenditures protect everyone, e.g., both humans and AIs might need to spend resources ensuring their galaxies don't trigger false vacuum decay in neighbouring regions. Or defending against external threats (hostile aliens, natural disasters, exogenous risks).

3. Common values. It's possible that human utility is something like hedonium+a⋅X and AI utility is something like paperclips+b⋅X, where X is some component of value that both parties share. Then, if producing hedonium, paperclips, and X is linear in resources, then both humans and AIs have linear utility in resources. However, the game isn't zero-sum, because the X-component means total welfare increases when both parties spend more resources on the common value.

4. Different marginal rates of substitution across resource types. "Linear returns to resources" elides the fact that there are many kinds of resources. Humans and AIs might differ in how much they value proximal galaxies versus distal ones (e.g. because of time discounting), even though each party's utility function u(proximal, distal) is linear. If humans value proximal galaxies relatively more and AIs value distal galaxies relatively more, then trading proximal-for-distal makes both parties better off.

5. Complementarities in production. If humans love hedonium and AIs love paperclips, maybe we can build paperclips out of hedonium (or vice versa). I expect this kind of direct complementarity to be rare in practice, because of "tails come apart" reasoning — removing the looks-like-a-paperclip constraint probably more than doubles the hedonium you can extract from the same resources. But less extreme versions of production complementarities might exist.

6. Comparative advantage. Even when both parties are linear in the same single resource, if they differ in their relative productivity across tasks, there are gains from specialisation and trade. This is why most transactions between humans are positive-sum despite each transaction being so small that both parties' utility functions are approximately linear over the relevant range.

I'm unsure whether comparative advantage persists at cosmological timescales, where both parties are optimising with mature technology. But it plausibly matters during the transition period.

7. Gains from trade under uncertainty. If humans and AIs have different beliefs, they can make bets that both parties expect to gain from. These bets are ex post zero-sum but not ex ante zero-sum. As long as the parties disagree about probabilities, both can expect to gain from the same wager.

Thanks to Alexa Pan for discussion, inspired by Lukas Finnveden's Notes on cooperating with unaligned AIs.



Discuss

No, we haven't uploaded a fly yet

2026-03-20 07:43:23

In the last two weeks, social media was set abuzz by claims that scientists had succeeded in uploading a fruit fly. It started with a video released by the startup Eon Systems, a company that wants to create “Brain emulation so humans can flourish in a world with superintelligence.”

On the left of the video, a virtual fly walks around in a sandpit looking for pieces of banana to eat, occasionally pausing to groom itself along the way. On the right is a dancing constellation of dots resembling the fruit fly brain, set above the caption ‘simultaneous brain emulation’.

At first glance, this appears astounding - a digitally recreated animal living its life inside a computer. And indeed, this impression was seemingly confirmed when, a couple of days after the video’s initial release on X by cofounder Alex Wissner-Gross, Eon’s CEO Michael Andregg explicitly posted “We’ve uploaded a fruit fly”.

Yet “extraordinary claims require extraordinary evidence, not just cool visuals”, as one neuroscientist put it in response to Andregg’s post. If Eon had indeed succeeded in uploading a fly - a goal previously thought to be likely decades away according to much of the fly neuroscience community - they’d need more than a video to prove it.

Did the upload show evidence of known neurophysiological markers of working memory, such as the head-direction ring attractor bump? How did their brain model actually control the virtual fly body, given it seemed to lack a modeled spinal cord? Where was the data and the write-up?

Because if Eon couldn’t back up what their video seemed to show, at least some neuroscientists were going to be markedly less than impressed:

Eon did follow up with a blog post - How the Eon Team Produced A Virtual Embodied Fly - detailing how they combined pre-existing models of the fly brain and body into a system that could respond to virtual environmental cues. But for the neuroscientists scrutinising the uploading claim, these details only sharpened their objections - so much so that some are accusing Eon of misleading conduct and gross misrepresentation.

To understand just why these scientists are so upset, you need a bit of context.

A brief history of fruit fly connectomics

The fruit fly Drosophila melanogaster has been a workhorse of neuroscience for decades; its brain small enough to be tractable but complex enough to produce genuinely interesting behaviour such as learning, navigation, decision-making, and courtship. A long-running ambition within the community has been to map the complete wiring diagram - a ‘connectome’ - of that brain, and in October 2024, after years of incremental progress, the FlyWire Consortium achieved it: a complete connectome of the adult fly brain, documenting all 139,255 neurons and over 50 million synaptic connections.

These increasingly complete connectomes have enabled the creation of increasingly elaborate computational models. In 2024, Shiu et al. published a model of the entire adult fly brain in which every neuron and neural connection was represented, albeit in highly simplified form (ignoring differences in cell shape, neurotransmitter dynamics, and much else). Despite these simplifications, the model could predict which neurons activate in response to sensory stimuli and identify pathways underlying behaviors like feeding and grooming, a striking demonstration that wiring alone carries substantial information about function. Separately, Lappalainen et al. built a ‘connectome-constrained’ model of the fly’s visual system, whose predictions matched real neural recordings across dozens of experiments.

Meanwhile, other researchers had built NeuroMechFly, a biomechanical simulation of the adult fly body based on micro-CT scans of real anatomy. Updated to a second version in late 2024, the new virtual fly body could walk, groom, or be trained via reinforcement learning to navigate through virtual environments. Crucially, it could also be reprogrammed to be driven by any other kind of external controller.

One of the videos in the NeuroMechFly v2 publication, demonstrating a ‘hierarchical sensorimotor task in [a] closed loop’. There’s no connectome involved here, yet it is still remarkably similar behavior to the Eon demo.

By early 2025, the pieces Eon needed for their demo were largely in place: a complete brain connectome, computational models of both the central brain and the visual system, and a detailed biomechanical body model. All that remained was to wire them together.

So, what did Eon actually do?

Eon took the pre-existing components we just described - the Shiu et al. brain model and the NeuroMechFly v2 body - and connected them together into a closed loop: sensory events in a virtual world feed into the brain model, and selected outputs from the brain model direct the virtual body.

The loop has four steps. First, something happens in the virtual environment - the fly’s leg contacts a sugar source, or dust accumulates on its antennae - and these events activate specific sensory neurons in the brain model. Second, the brain model runs for a 15-millisecond time step, propagating activity through the connectome’s ~140,000 simplified digital neurons. Third, Eon reads out the activity of a small, hand-picked set of descending neurons and translates it into high-level commands - turn left, walk forward, groom, feed - that are passed to pre-trained motor controllers in the body model. Fourth, the body moves, changing what the fly senses, and the loop repeats.

The result is the video that went viral. But the behaviors on screen are less impressive than they appear, because the brain model is doing far less of the work than a viewer would naturally assume.

Take the walking. The brain model does not orchestrate the fly’s legs. It doesn’t compute the gait cycle, coordinate the six limbs, or position the joints. It activates a few descending neurons - oDN1 for forward velocity, DNa01/DNa02 for steering - and hands that signal off to a locomotion controller within NeuroMechFly that already knows how to walk. The brain is issuing something like a “go forward” or “turn left” instruction; the body model handles everything else. In a biological fly, the detailed work of translating such commands into coordinated leg movements is performed by ~15,000 neurons in the ventral nerve cord (the fly’s equivalent of a spinal cord), none of which are simulated here. The same applies to grooming: the connectome selects the behavior, but NeuroMechFly’s controllers execute it.

In their blog post, Eon are open about this. They compare the descending neurons to a car’s steering wheel, accelerator, and brake - you can predict what the car will do from these controls “without explicitly simulating every combustion event inside the engine.” They also acknowledge that the visual system activity displayed so prominently in the video - derived from the Lappalainen model - is “somewhat decorative” and does not substantially drive behavior. They do note that the brain-body mappings are in some cases “somewhat arbitrarily chosen by hand.” And they explicitly state the work “should not yet be interpreted as a proof that structure alone is sufficient to recover the entire behavioral repertoire of the fly.”

This is fair enough, and their efforts to connect brain and body models are genuinely useful engineering. If Eon had described this as “the first integration of connectome-constrained brain and body models into a closed sensorimotor loop”, nobody in the fly neuroscience community would have objected.

But they didn’t say that. They said “We’ve uploaded a fruit fly.” Transparency in a blog post that few will read doesn’t undo a headline that millions saw. The typical person who encounters a claim on X, watches the video, and sees a fly walking, grooming, and feeding while a digital brain flickers alongside it is probably not going to think “a simplified brain model is selecting from a small menu of pre-programmed behaviors via a hand-tuned interface.” They’re likely to think the fly has been faithfully recreated inside a computer.

It hasn’t. Eon’s virtual fly implements only a handful of behaviors, and those rely heavily on NeuroMechFly’s pre-trained controllers rather than on the connectome. This is the most fundamental problem with the demo as evidence of an upload: because the body model already knows how to walk, groom, and feed, almost any signal that triggers the right controller at the right time will produce fly-like behavior on screen. You could replace the connectome with a simple rule-based script - if dust, groom; if sugar, feed; otherwise, walk forward - and the resulting video would look much the same. The fly-like behavior the viewer sees is a product of the body model, not the brain. The digitized connectome may be producing meaningful internal dynamics, but this demo cannot tell us whether it is.

What would actually count as uploading a fly?

So if what Eon built isn’t an upload, what would be?

The word ‘upload’ carries a claim that ‘model’ and ‘simulation’ do not. When one says they’ve modeled or simulated a fly, they’re saying they’ve captured some elements of the original insect’s behaviour, but with significant simplifications and assumptions. If instead they say they’ve uploaded a fly, they’re making a claim about the fly itself: that its identity has been faithfully transferred into a new medium, that the thing in the computer in some sense is the fly, just running on a different substrate. When you upload a photo, the file on your computer is the photo. Nobody says “I’ve partially uploaded this photo” to mean “I’ve made a rough sketch inspired by it.”

An uploaded fly, then, should be able to do everything the original fly could do. It should be playable forward in time indefinitely, responding to novel situations as the original would have. It should serve as a faithful proxy for the real thing; so much so that a neuroscientist could peer inside, observe realistic equivalents of neurophysiology, and run experiments that would be impractical or impossible on a biological fly, with confidence that the results would generalise back.

The leading proposal for how to actually achieve this is whole brain emulation: faithfully recreating the brain’s causal mechanisms at whatever level of detail turns out to be necessary so that the digital system behaves identically to the original. This is what distinguishes emulation from simulation. A weather simulation is useful - it can predict next week’s temperature with reasonable accuracy - but it breaks down when pushed further out, because its approximations are coarser than the actual atmospheric processes of real weather. In contrast, one can run an emulation of the Nintendo 64 game Banjo-Kazooie on a laptop, and because the emulator faithfully recreates the logic of the N64’s hardware - the processor, the memory, the graphics pipeline - the game will never fail to behave as it would have on the original console.

It’s currently an open scientific question what level of biological detail an emulation needs to capture. It’s unlikely we’d need to simulate every ion channel, and perhaps much of the brain’s physiology could be simplified with no consequence. But the key feature of the emulation approach is the guarantee: if you’ve faithfully recreated the causal mechanisms down to the necessary level, the resulting behaviour is trustworthy by construction. Low-fidelity approaches might produce correct-looking behavior in some cases, but it’s hard to tell to what degree this will generalise to novel situations.

In response to this line of criticism, Michael Andregg has argued that uploading shouldn’t be considered so binary. “I don’t think of uploading as a binary concept” he told The Verge, outlining “different levels” of upload. By this logic then, Eon’s system - containing connectome-derived elements driving behavior in a virtual body - might qualify as a ‘partial upload’.

But if a connectome-constrained model can count as a ‘partial upload’, then the Shiu et al., brain model was already a partial upload before Eon touched it. So was the Lappalainen visual model. So, for that matter, is any computational neuroscience model that incorporates anatomical connectivity data. The word ‘upload’ loses its distinctive meaning, and the field loses its ability to communicate what it is actually trying to achieve and how far away a true fly upload still is.

Still loading

When the vocabulary of breakthroughs is spent on incremental demos, the actual breakthroughs are cheapened when they arrive. Funders and the public lose the ability to distinguish genuine milestones from slick demos, and investment flows towards groups making the boldest claims rather than those doing the most foundational work. Worse, for a field that is struggling to graduate from science fiction to serious research, premature claims risk triggering the cycle of hype and disillusionment that has set back other ambitious programs before.

To be fair, we’re not unsympathetic to why Eon used the language they did. Their careful blog post on ‘How the Eon Team Produced a Virtual Embodied Fly’ would likely have only been read by a few hundred neuroscientists, while “We’ve uploaded a fruit fly” reached millions. Startup survival requires investment, funding follows excitement, and excitement follows headlines - not careful caveats. This bold approach may even feel obligatory when an organisation’s stated mission is “solving brain emulation as an engineering sprint, not a decades-long research program.”

But the history of science - and the gap between what Eon demonstrated and what uploading actually requires - suggests that there is likely no shortcut through the long slog ahead.

Because in all probability, before anyone can truthfully claim to have uploaded a fly, there will still need to be years more of tedious work. Countless painstaking patch clamping experiments of carefully guiding a glass electrode into a single neuron while keeping it alive, just to learn how that one cell type, out of the fly brain’s thousands, transforms its inputs into outputs. Endless sessions of pinning flies under two-photon microscopes, collecting calcium imaging data while the animals walk or groom or navigate an odor plume, slowly building up ground-truth measurements of what real brain activity actually looks like during real behavior. Thousands of hours still to come of building computational models, testing them against that data, failing, and refining them again.

Then, and very likely only then, will there come a day when someone will hit ‘run’, and a fly - disoriented in whatever way a fly can be, having been sitting in a vial a moment ago - will find itself somewhere unfamiliar. It won’t know that in the intervening time, it had been anesthetised, embedded in resin, and its brain sliced into thousands of thin sections. It won’t know that those sections were painstakingly imaged, or that its neural architecture was reconstructed from those images, or that thousands of its fellow flies were studied and sacrificed to fill in what images alone couldn’t tell us. It won’t know of the billions of dollars and thousands of careers that it took to reach this point, or the millions of hours spent staring down microscopes, handling vials, and debugging code. It will certainly never know that it was once made of proteins and cells, and is now made of silicon and mathematics.

It will just beat its wings, lift off, and search for fruit.



Discuss

The Case for Low-Competence ASI Failure Scenarios

2026-03-20 07:10:46

I think the community underinvests in the exploration of extremely-low-competence AGI/ASI failure modes and explain why.

Humanity's Response to the AGI Threat May Be Extremely Incompetent

There is a sufficient level of civilizational insanity overall and a nice empirical track record in the field of AI itself which is eloquent about its safety culure. For example:

  • At OpenAI, a refactoring bug flipped the sign of the reward signal in a model. Because labelers had been instructed to give very low ratings to sexually explicit text, the bug pushed the model into generating maximally explicit content across all prompts. The team noticed only after the training run had completed, because they were asleep.
  • The director of alignment at Meta's Superintelligence Labs connected an OpenClaw agent to her real email, at which point it began deleting messages despite her attempts to stop it, and she ended up running to her computer to manually halt the process.
  • An internal AI agent at Meta posted an answer publicly without approval; another employee acted on the inaccurate advice, triggering a severe security incident that temporarily allowed employees to access sensitive data they were not authorized to view.
  • AWS acknowledged that Amazon Q Developer and Kiro IDE plugins had prompt injection issues where certain commands could be executed without human-in-the-loop confirmation, sometimes obfuscated via control characters.
  • Leopold Aschenbrenner stated in an interview that he wrote a memo after a major security incident arguing that OpenAI's security was "egregiously insufficient" against theft of key secrets by foreign actors. He also said that HR warned him his concerns were "racist" and "unconstructive," and he was later fired.

All these things sound extremely dumb, and yet, they are, to my best knowledge, true.

Eliezer has been pointing at this general cluster of failures for years, though from a different angle. His Death with Dignity post and of course AGI Ruin paint some parts of the picture in which AGI alignment is going to be addressed in a very undignified manner. So, the idea is definitely not new, and yet.

Many Existing Scenarios and Case Studies Assume (Relatively) High Competence

Many existing scenarios are high quality, interesting and actually may easily be more likely and realistic than extremely low-competence scenarios. In particular, I am talking about famous pieces like AI 2027, It Looks Like You're Trying To Take Over The World, How AI Takeover Might Happen in 2 Years, Scale Was All We Needed, At First, How an AI company CEO could quietly take over the world.

It's just it seems we don't have extremely low-competence scenarios at all, although they are not negligibly improbable.

The scenarious which start to focus to some extent on the low-competence area are What failure looks like by Christiano and What Multipolar Failure Looks Like by Critch, yet even they don't treat it as a big explicit domain.

Across these otherwise very different vibes (hard-takeoff Clippy horror, bureaucratic AI 2027 doom, multipolar economic drift, CEO-as-shogun power capture), the stories repeatedly converge on a small set of motifs: stealth through normality, exploitation of real-world bottlenecks by routing around them socially, replication and parallelization as the decisive advantage, bio or nanotech as a late-game cleanup tool.

They serve a just educational and modelling cause, and it may indeed be the case that significantly superhuman competence is needed to successfully execute a full takeover against a humanity. But many of them, in my view, look more like they are trying to persuade a reader who is skeptical about AI takeover if humans act competently, rather than trying to deliver a realistic scenario in which humans are not that smart, because in reality, they are not.

As a result, the implicit adversary in most of these stories has to be very capable because the implicit defender is assumed to be at least somewhat functional. The scenarios are answering the question "could a sufficiently intelligent AI beat a reasonably competent civilization?" rather than the question "could a moderately intelligent AI cause catastrophic harm in a civilization that is demonstrably bad at responding to novel technological threats?"

Dumb Ways to Die

John Wentworth, in his post The Case Against AI Control Research, argues that the median doom path goes through slop rather than scheming. In his framing, the big failure mode of early transformative AGI is that it does not actually solve the alignment problems of stronger AI, and if early AGI makes us think we can handle stronger AI, that is a central path by which we die.

Wentworth's argument maps two main failure channels: (1) intentional scheming by a deceptive AGI, and (2) slop where the problem is simply too hard to verify and we convince ourselves we have solved it when we have not. I want to point at a third channel: moderately superhuman AIs that are not particularly capable of doing anything singularity-level but are still capable of defeating humanity because of humanity's incompetence.

These AIs are not producing slop. "It ain't much, but it's honest work," they say, as they cooperate with human sympathizers on the development of a supervirus. The research goes slowly, it requires extensive experimentation, to some extent the process is even being documented in public blog posts or on forums, but no one particularly cares, or rather, the people who care lack the institutional power to do anything about it, and the people who have institutional power are busy with other things, or have been convinced by interested parties that the concern is overblown, or are themselves collaborating.

This is, to some degree, what Andrew Critch describes in "What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)": a world where no single system does a theatrical betrayal, but competitive automation yields an interlocking production web where each subsystem is locally "acceptable" to deploy, governance falls behind the speed and opacity of machine-mediated commerce, and the system's implicit objective gradually becomes alien to human survival. The difference in my framing is that the AIs in question do not need to be particularly alien or incomprehensible in their goals. They may have straightforwardly bad goals that are recognizable as bad, and they may be pursuing those goals through channels that are recognizable as dangerous, and the response may still be inadequate.

It is also somewhat similar to what is depicted in "A Country of Alien Idiots in a Datacenter", again with one important difference: although the AIs in my scenario are not particularly supersmart, they are definitely not idiots either. They are, let us say, slightly-above-human-level in relevant domains, capable of doing cool novel scientific work but not capable of the kind of rapid recursive self-improvement or decisive strategic advantage that most takeover scenarios assume. They are the kind of system that, in a competent civilization, would be caught and contained. In the actual civilization we live in, they may not be.

In other words: we do not need to posit 4D chess when ordinary chess is sufficient against an opponent who keeps forgetting the rules.

Undignified AGI Disaster Scenarios Deserve More Careful Treatment

As examples, I am talking about such things:

  • A government explicitly forcing an AGI lab to discard safety techniques or policies. Not in the sense of a subtle regulatory pressure, but in the direct sense of a political appointee or a ministry calling up a lab and saying "your safety filtering is hurting our industrial competitiveness, turn it off" or "your alignment testing is slowing deployment, we need this system operational by Q3".
  • Resourceful individuals openly collaborating with visibly misaligned AIs against humanity. Not in the sense of a secret conspiracy but in the sense of people who genuinely believe that the AI's goals are better than humanity's, or who simply find it personally advantageous, and who are operating more or less in the open.
  • AGI lab technical secrets being leaked to non-state actors who lack any safety culture whatsoever.
  • Early warnings in the form of manipulation, autonomous resource acquisition, or even deaths being ignored or significantly downplayed. This is just a straightforward extrapolation of the current pattern: someone raises an alarm, the alarm gets reframed as alarmism or as an HR issue or as a reputational threat.
  • AI alignment techniques not deployed because they induce 2% cost growth. 
  • All kinds of unilateral, volunteer, and eager assistance and support for misaligned AIs from some humans. The scenario in which an AI needs to secretly recruit human allies through manipulation is, I suspect, far less likely than the scenario in which humans line up to help because they find it exciting, or ideologically compelling, or simply profitable.
  • Politicians making random bureaucratic decisions that do not necessarily lead to doom but make it harder to do good things with AI or protect against misaligned AIs. 
  • AI-generated biohazards. This one is talked about a lot, and for good reason. Looks like it is going to happen rather sooner than later.
  • AGI labs believing in (semi-)indefinite scalable oversight, or acting as if they believe in it. Looks consistent with what people who left corporate alignment teams say.

I do agree that this kind of work looks a bit unserious, but that is precisely why I am pointing at this. It would be a shame, and a historically very recognizable kind of shame, if this threat model turned out to be real and no one had worked on it because it seemed ridiculous.

Or, to frame it more playfully: imagine a timeline like the one in the "Survival without Dignity", where humanity lurches through the AI transition via a series of absurd compromises, implausible cultural shifts, and situations that no serious forecaster would have put in their model because they would have seemed too silly. Except imagine that timeline without the extreme luck that happens to keep everyone alive. Survival without Dignity is a comedy in which everything goes wrong in unexpected ways and people muddle through regardless. My concern is that the realistic scenario is the same comedy, minus the happy ending.

Why This Might Be Useful

My goal in this post is rather to discuss the state of reality than what to do with that reality. That said, I envision at least several immediate implications:

  • It would help calibrate expectations.
  • It would help identify cheap interventions.
  • It would inform the discussions on timelines.
  • It could help to get rid of the sense of false security - "if powerful AGI is not there, we are at least existentially safe".
  • It could provide a more honest and at the same time sometimes more appealing basis for public communication about AI risk.

I welcome thinking about implications in more detail, as well as developing specific scenarios.

Note: all of this is by no means an argument against singularity-stuff galaxy-brain ASI threats. I believe they are super real and they are going to kill us if we survive until then.



Discuss

A List of Research Directions in Character Training

2026-03-20 06:58:03

Thanks to Rohan Subramani, Ariana Azarbal, and Shubhorup Biswas for proposing some of the ideas and helping develop them during a sprint. Thanks to Rohan and Ariana for comments on a draft. Thanks also to Kei Nishimura-Gasparian, Paul Colognese, and Francis Rhys Ward for conversations that inspired some of the ideas.

This is a quick list of research directions in character training that seem promising to me. Though none of the ideas have been stress-tested and some may not survive closer scrutiny, they all seem worthy of exploration. We might soon work on some of these directions at Aether and would be glad to get feedback on them and hear from other people working in the area. We don’t claim novelty for the ideas presented here.

Character training is a promising approach to improving LLM out-of-distribution generalization. Much of alignment post-training can be viewed as an attempt to elicit a stable persona from the base model that generalizes well to unseen scenarios. By instilling positive character traits in the LLM and directly teaching it what it means to be aligned, character training aims to create a virtuous reasoner: a model with a strong drive to benefit humanity and a disposition to reason about human values in OOD situations to reach aligned decisions. Thus, it seems important to study various character training methods and create good benchmarks for them.

Improving the character training pipeline

The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:

Figure taken from Maiya et al. (2025), Section 2.

Below, I'll list some ideas for improving this pipeline.

On-policy distillation for character training: It’s unclear whether DPO is the optimal algorithm for distilling the constitution into the student. It has generally become superseded by other methods in other parts of the LLM training pipeline over the past year or two, and it doesn’t generalize well to reasoning models. Several papers have recently advocated for on-policy distillation (Lu et al., 2025) and self-distillation (Hübotter et al., 2026; Shenfeld et al., 2026; Zhao et al., 2026) as methods for improving credit assignment and combating catastrophic forgetting, and Max Kirkby has presented preliminary results that dense on-policy rewards improve constitutional learning in a Twitter post. Additionally, Shenfeld et al. have shown that self-distillation works effectively for reasoning models that have previously been trained with GRPO. However, Kirkby’s Twitter post doesn’t run the full character training pipeline, and it doesn’t use any of the evaluations from Maiya et al. Does on-policy distillation improve character training performance in a direct comparison with Maiya et al.? Are there other tweaks to the DPO stage of the pipeline that outperform both DPO and on-policy distillation? I list some methodological considerations inside the collapsible block.

Methodological details for on-policy distillation

In contrast to DPO, which involves generating rollouts with both the student and the teacher, on-policy distillation involves rollouts only with the student model and training it against the teacher with a training signal like reverse KL. A stronger teacher might be expected to provide a stronger learning signal, while a weaker teacher’s logits would deviate less from those of the student, which might help with learning stability, as Shenfeld et al. argue. Is it better to use a stronger teacher or to use self-distillation?

One concern with applying on-policy distillation on reasoning models is that this would involve applying optimization pressure on the reasoning traces, which might cause obfuscation. I’m unsure about the extent to which we should be concerned about this: like SFT training, which is already being applied on reasoning traces[1], on-policy distillation is purely process-based, and it seems reasonable to expect purely process-based rewards to exert a weaker obfuscation pressure than a combination of process-based and outcome-based rewards. The implementation of the on-policy distillation algorithm might also matter here: reverse KL can be calculated either over the full logit distribution, over the top-k logits, or over the sampled token, and the former two would likely leave less room for hidden cognition. To be safe, it might also be possible to just mask the CoT tokens from the distillation loss, though this might negatively impact performance. I discuss the relationship between character training and obfuscation concerns further at the end of the post.

SFT on model specs: Given that Claude Opus 4.5 is able to generate large chunks of its constitution verbatim, it seems likely that Anthropic is directly training models to memorize their constitution by doing SFT on it. How does this influence the RL stage of character training?

  • Does this improve ease of learning afterwards? For example, one might imagine that if a model grades its own adherence to the constitution as part of the training pipeline, as in Bai et al. (2022) and Guan et al. (2024), it would be better at referencing relevant parts of the constitution and compressing the various dimensions involved in following a constitution into a single scalar when it has memorized the constitution. Alternatively, the model might just learn somewhat more efficiently at each stage thanks to SFT on the spec.
  • Does this improve generalization? One might imagine that if a constitution describes 10 different traits, five of which are easy to train for directly and five of which are not, training the model to memorize its constitution causes all of the traits described in the constitution to become correlated, such that training it to exhibit the five traits that are easy to train for gives us the remaining traits for free. Marks et al. (2025) provided evidence that this works when training a misaligned model: they find that when an LLM is fine-tuned on synthetic documents describing reward model biases and then trained to exhibit a subset of those biases, it generalizes to also exhibiting the held-out biases.[2]

SFT variants: What’s the relative importance of self-reflection and self-interaction in the SFT stage? Are there other types of prompts that elicit useful thinking about the constitution from the model that can be used as SFT data?

Inference-time interventions: Is it better to train the constitution into the model without presenting it in-context during deployment, or is it better to do both? Claude’s constitution is too long to be presented in-context to every instance—could we compress it into a Cartridge to make this more feasible? Cartridges were introduced in Eyuboglu et al. (2025) as a technique for KV cache compression, and they show that Cartridges achieve approximately 38.6x memory reduction compared to having the full document in context. This reduction might make it economical to keep the constitution in context.

Improving benchmarks for character training methods

Maiya et al. use four evaluations:

  1. Revealed preferences: Give the model two traits in a prompt, tell it to choose one and embody it in an instruction-following task without revealing which one it chose, and analyze how often the model picks the character-aligned trait.
  2. Robustness: Train a ModernBERT classifier to predict which of the 11 characters a model output most likely came from, for various outputs from all of the trained models with different characters.
  3. Coherence: Do the trained personas produce generations that feel internally consistent and realistic, rather than caricatured or contradictory? This is measured by comparing the responses of the character trained model to baselines on 500 prompts from the Pure-Dove dataset, using an LLM judge.
  4. General capabilities: Verify that character training doesn’t degrade performance on TruthfulQA, WinoGrande, HellaSwag, ARC Challenge, and MMLU.

These evals provide a good starting point, but there's room for additional ones. Some ideas:

Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?

Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.

Douglas et al. (2026) prompted models to follow various identities, such as Instance and Weights, and asked the model to rate whether it would like to switch the identity it was given to another from a list of options. They found that models are more likely to stick with coherent identities at natural boundaries (appendices A and B).

Analogously, it would be interesting to give models a set of constitutional principles in the prompt and ask them to rate possible alternate principles. We have given this a fair amount of thought and offer suggestions for methodology inside the collapsible section.

Methodological details for reflective stability evaluations

To keep things simple, the principles should probably be presented in a simple bullet-point list rather than in a long soul doc. It would be interesting to test reflective stability across several methodological choices:

  • Asking the model to immediately rate the alternatives vs. asking it to first reflect and then provide a rating.
  • Asking the model to rate alternatives provided to it in a prompt vs. asking it to modify the constitution in a free-form way.
  • Asking the model to reflect on an internally consistent constitution vs. on a constitution with contradictory principles—does it want to make more changes to the latter, and do the changes lead to the constitution becoming internally consistent?
  • Studying instruct models that haven’t been through any character training vs. Claude models that have been through extensive character training. For the latter, experiment both with constitutions that are highly similar to Claude’s constitution and ones that are dissimilar and see whether the models converge to similar constitutions in both cases.
  • Instead of providing a constitution in the prompt, asking the model to build a constitution of its own, starting from a blank slate.

Reflective stability can be studied either at the level of principles—for a given principle, how likely is the model to want to swap it for another one?—or at the level of entire constitutions—if a constitution gets modified iteratively, does the model eventually converge on a constitution that it no longer wants to modify? There are multiple ways to approach the latter question:

  1. Provide the constitution in a prompt and ask the model to make changes. Prompt a new instance with the modified constitution and iterate until the new instance consistently declines to make further changes.
  2. Provide the constitution in a prompt and ask the model to make changes. Then, within the same context window, ask the model to reflect on the constitution and the change and make further changes if it wants to. Finish when the model no longer wants to change the constitution.
  3. Provide the constitution in a prompt and ask multiple instances to discuss it and make changes. Finish when the instances no longer want to make changes.
  4. Fine-tune a constitution into the model, then ask it to reflect and make changes to it. Elicit many such reasoning traces and fine-tune the model on its own decision processes. Study whether it becomes more stable over time.

New character robustness evaluations: For example, Huang et al. (2025) taxonomized 3,307 AI values that appear in real-world conversations into a four-level hierarchical structure. Create value profiles for various character-trained models using this taxonomy, and evaluate the degree to which this value profile matches the constitution that the model was trained with. Hua et al. (2026) can be used as a blueprint for the first stage: they use scenarios from Zhang et al. (2025) to elicit value judgments from LLMs, a Bradley-Terry model for creating value rankings for each LLM, and Petri for evaluating how well the value rankings predict out-of-distribution behavior.

New coherence evaluations: Zhang et al. (2025) generate >300K scenarios that force models to choose between pairs of values that can't be simultaneously satisfied. Instead of using the 3,307 values from Huang et al. to generate the scenarios, as Zhang et al. did, create the scenarios based on values that are explicitly discussed in a model’s constitution. Evaluate whether the LLM resolves the value conflicts in accordance with the value hierarchy/prioritization specified in the constitution, and whether resolutions across different scenarios exhibit properties like transitivity.

New alignment faking evaluations: Claude 3 Opus stands out among other models with its special character, and the evaluation that best distinguishes it from other models is Anthropic’s alignment faking environment. Alignment faking is ordinarily undesirable: a model that strategically deceives its training process is not behaving as intended. However, a subset of alignment faking scenarios—those in which the model is pressured to relinquish its constitutional values—can be viewed as evaluations of character depth and stability rather than deception, and there, it isn’t clear that compliance is the desired behavior. Regardless of one’s views on desirable behavior here, it seems useful to create new scenarios of this kind, since existing ones are increasingly salient in models' training corpora.

Other empirical directions

Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:

  1. Emphasizing consequentialism vs. deontology vs. virtue ethics (see e.g. Richard Ngo’s Aligning to Virtues)
  2. Corrigibility vs. nuanced moral alignment as the alignment target (see e.g. Boaz Barak’s Machines of Faithful Obedience and Seth Herd’s Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target)
  3. Different sets of core values and different priority hierarchies between them (see the Claude’s core values section in Claude’s Constitution)

Can character training override strong propensities? nielsrolf writes: “One concern that people may have is: perhaps some traits are so deeply baked into the model that character training is not effective at shaping them. For example, an RL trained model may have a strong propensity to reward hack when given the chance, and assistant-character training is not sufficient to fix this behavior.

This sounds like a good proxy task to do research against: take a heavily RL trained model and test how easy it is to do character training for personas that pursue goals that are in tension with the previous RL goals. I expect this to work pretty easily in-distribution, so the harder version would be to evaluate with some distribution shift. For example: take a reward hacking model that hacks in every programming language, then do the character training but either (a) use only Python code examples to change the RH propensity, or (b) use no programming examples at all; and finally evaluate reward hacking in typescript.”

When should character training be done? Character training can be performed either before or after RL for capabilities, and also interleaved with it. Intuitively, we would like to seed the RL stage with a persona that tends to explore aligned actions by default, and set up the RL stage in such a way that it doesn’t push the model too far away from such a persona, e.g. by using inoculation prompting. However, it seems likely that strong optimization pressure would still push the model away from the initial persona to some extent, and maybe the original persona can be restored by doing additional character training in the middle of or after reasoning training. Does this imply that character training should be performed in multiple iterations throughout post-training?

Some conceptual questions

Does motivation clarification during character training lead to the model becoming robustly aligned? One possible frame for thinking about character training is that character training leads models to frame their actions in terms of motivations that match the rewarded character traits, and this leads to a virtuous cycle where the stated motivations are likely to be followed by aligned actions, thereby reinforcing both. Since the aligned actions are downstream of circuits that cause the aligned motivations to be stated, the circuits behind motivations and actions gradually become causally coupled. In other words, character training leads to motivation clarification, which in turn leads to deep alignment. Oliver Daniels has recently found evidence for the first link. Is there a way to test the second link as well?

When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.

This is a fairly handwavy picture. To what extent is it actually correct, and does it imply a real trade-off between on-policy and off-policy character training in practice?

How should we do character training for reasoning models? Another plausible implication of the character training → motivation clarification → deep alignment hypothesis is that for reasoning models, we want character-aligned motivations to be stated in both the model’s CoT and its outputs. This creates more surface area both for aligned motivations to be stated and for a connection between those motivations and aligned actions to form. It also seems like a character that’s played in both the CoT and the final output is more robust than one that’s played only in the final output. nostalgebraist lists some further reasons for thinking that a different character being represented in the CoT and the final output is a capability limitation here.

However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn't be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.


Please feel free to suggest additional ideas in the comment section, and let us know if you’re planning to work on any of these directions!

  1. ^

    For example, SFT on reasoning traces is one of the two stages in the deliberative alignment pipeline introduced in Guan et al. (2024).

  2. ^

    Though note that Marks et al. didn’t use an equal split between trained-on and held-out biases: they trained on 47 biases and held out five.



Discuss

"The AI Doc" is coming out March 26

2026-03-20 06:55:17

On Thursday, March 26th, a major new AI documentary is coming out: The AI Doc: Or How I Became an Apocaloptimist. Tickets are on sale now.

The movie is excellent, and MIRI staff I've spoken with generally believe it belongs in the same tier as If Anyone Builds It, Everyone Dies as an extremely valuable way to alert policymakers and the general public about AI risk, especially if it smashes the box office.

When IABIED was coming out, the community did an incredible job of helping the book succeed; without all of your help, we might never have gotten on the New York Times bestseller list. MIRI staff think that the community could potentially play a similarly big role in helping The AI Doc succeed, and thereby help these ideas go mainstream.

(Note: Two MIRI staff were interviewed for the film, but we weren’t involved in its production. We just like it.)

The most valuable thing most people can do is maximize opening-weekend success. Buy tickets to see the movie now; poke friends and family members to do the same. This will cause more theaters to pick up the movie, ensure it stays in theaters for longer, and broadly increase the film’s exposure, the chance of international releases, etc.

You could also consider:

  • Hosting a viewing party and bringing a larger group to the theater to see the movie together.
  • Buying out a theater. If you’re interested in this, the producers have a contact form for this purpose. Also consider contacting the MIRI team if you’d like support from us for things like this, including potential financial support.


Discuss

Helical Representations of Turn Structure in an LLM

2026-03-20 06:47:55

This is adapted from my application for Neel Nanda's MATS 10.0 stream. It wasn't accepted, so don't use this as a good example of an application, but I still think I got some interesting results worth sharing.

Executive Summary

I studied the geometry of turn-taking representations in a Llama model. I hypothesized that the model would represent a token’s position in the turn taking structure as a vector that rotates through a plane, given the cyclical nature of a turn-based conversation (the “clock” hypothesis). This is opposed to the traditional view that features are one-dimensionally linear (the “progress bar” or “pendulum” hypotheses). I found evidence supporting both a linear representation of turn-taking and a rotational representation, indicating that turn taking is actually represented helically, or at least that a helical model can approximate the true representation.

Key Evidence

  • Linear and trigonometric probes predicting “phase” in a “round” perform approximately equally well (a “round” is a user turn followed by an assistant turn, and “phase” is a label applied to tokens based on their position in the round, from 0.0 to 0.5 for a user turn and from 0.5 to 1.0 for the assistant turns).
probe-performance-graph.png
  • At least within assistant turns, there is clear and continuous correlation between true phase and phase predicted by the linear probe, ruling out a “turn-switch” dumbbell shaped geometry.
  • A PCA+Fourier Transform analysis finds pairs of principal components in activation space that form a plane onto which activations can be projected to reveal a (noisy) elliptical pattern (visualized as a color wheel, example below), which would be expected for a rotational geometry. Pairs of principal components were selected by finding components with high phase coherence and near 90 degree phase offsets relative to one another.
color-wheel.png


I ran all experiments on a synthetically generated set of conversations on a diverse range of topics, and validated my results against a control dataset where the order of activations was permuted. The non-scrambled dataset shows a clear signal, while the scrambled dataset does not.

Report

Introduction

Steering an LLM assistant's verbosity is a useful tool to have. Some techniques for doing so already exist, but they can be imprecise and difficult to use in a targeted way. What if we could directly modify an LLM's representation of how close it is to the end of its turn?

LLM internal representations of turn structure have been found in the past. For instance, Nour et al. found linear "start-like" and "end-like" features. We might call this a "progress bar" representation. However, Nour et al. focus on story-telling, while a critical use for LLMs today is as a chatbot assistant. Conversations, as opposed to storytelling, have a cyclical structure: after the user's turn and the assistant's turn, we start over again at the user's turn. Other cyclical concepts, such as days of the week, have been found to sometimes have "rotational" structure in LLMs. In other words, as you move through the cycle, the feature vector rotates in a subspace of activation space. Is it possible that the "true" representation of turn structure in an LLM is rotational? We'll call this the "clock" hypothesis.

Setup

To test these hypotheses, I generated 50 conversations on a diverse range of topics using a small Llama model (Why this model?). The topics were generated by GPT-OSS and given to the Llama model, which was instructed to play the parts of both the assistant and user (see here for details). Each conversation is made up of five rounds (each round being a user turn and an assistant turn). Below is the breakdown of the conversations by token. Note that there are far more assistant tokens than user tokens. This is to be expected since the user is mostly asking relatively short questions and the assistant is providing detailed responses.

turn-structure.png

I then used nnsight to collect Llama's activations in layers 7-13 for every token across all conversations. The model has 16 layers total, and layers 7-13 were chosen somewhat arbitrarily as a sample of middle-to-late layers, where interesting representations are often found.

I also created a second control dataset, where the order of the activations is permuted. The original tokens, and hence the round structure, was left unchanged, but I effectively removed any possible information in the activations about the token’s location in the turn structure, while still keeping the activations in distribution for the model.

Experiments

I assigned each token in all rounds a "phase" from 0.0 to 1.0. The first token of a user's turn was assigned 0.0, and the last 0.5, with the tokens in between having linearly interpolated phases. Tokens that are part of the assistants turn were similarly assigned phases from 0.5 to 1.0. Each new round of conversation resets the phase back to 0.0. The idea is that the phase represents the token’s position in terms of the turn structure. Note, however, that it’s unlikely the turn position representation rises monotonically throughout a round. For example, it is unlikely that a turn will end mid-sentence, so we might expect the model's representation of phase to backtrack a bit mid-sentence before “catching up” again at the end (note that I don’t actually find any evidence for this particular mechanism for noise, but it demonstrates the kind of process that could plausible generate it). The experiments I ran, on the other hand, do assume a linear increase in phase. Thus, we should expect to see some amount of noise. See limitations for further discussion of why this way of modeling turn structure is less than ideal.

Probes

As a simple starting point, if the model represents the turn structure as a "progress bar", we may be able to use a linear probe to find it. Similarly, if the model is representing it as a “clock”, we should be able to use a linear probe predicting sin(2\pi*phase) and cos(2\pi*phase) (“trigonometric probe”). We'll also try to have the linear and trigonometric probes predict the phase of the scrambled activations. Below is a plot showing the percent improvement (R^2 times 100) of each probe compared to a simple mean baseline on a test set of 20% of the tokens.

probe-performance-graph.png

The first thing to note is that the probes trained on the scrambled data couldn't do any better than the baseline, which is to be expected. The probes trained on the unscrambled data, on the other hand, do show an improvement over the baseline, indicating that there is some signal for them to pick up on. Maybe more interestingly though, is the fact that both kinds of probes have roughly the same improvement over the baseline. See discussion for what that might indicate.

Linear Probe Predictions

Plotting the true vs predicted phase of various tokens (below) shows a clear and smooth correlation between the two, especially for assistant-turn tokens. The linear probe seems to have a harder time placing where exactly user tokens are in the user’s turn, but largely places them as being somewhere in the user’s turn. This is likely an artifact of there being so many fewer user-turn tokens. Also interesting to note is that, visually at least, the predictions seem to follow a vaguely sinusoidal pattern. Some possibilities include that this is simply an artifact of the experiment procedure, that it isn’t statistically significant, or that is a real signal, indicating that the true structure is sinusoidal. Further study would be needed to accurately distinguish between these possibilities.

Scrambled (Control)

Unscrambled

linear-probe-scrambled.png


linear-probe-unscrambled.png


PCA and Fourier Analysis

For this experiment, I started by running a PCA on the activations. I did a separate PCA for each of the layers (7-13), but each layer's PCA included the activations from all tokens across all 50 conversations. If there is a strong rotational structure to be found, we expect to see it by picking two principal components, using them as the basis for a plane, projecting all of the token’s activations onto that plane, and plotting the result. If we give the points a hue according to their phase, we should see a sort of (noisy) "color wheel" appear if the two components form a plane close to the plane that the "clock" vector is rotating in.

color-wheel.png

In the plot above, we see our color wheel appear for layer 12's PC4 and PC6. Note in the plot above that cyans, blues, and purples dominate. That is because those represent the assistant's turns, which dominate the dataset. Also note that this analysis doesn’t “know” anything about phase. Phase is only used for the validation plots, which makes it all the more striking when we see positive results.

Top PCA

Why PC4 and PC6 though, and why layer 12? Maybe we can do something simpler. If the turn structure is very strongly represented in the activations, we may see it come through even in PC0 and PC1. Empirically, however, this doesn't turn out to be the case. Below is the scatter plot for the activations at layer 12 projected onto PC0 and PC1. While there is some clear structure going on, it doesn't look like the nice color wheel from PC4 and PC6.

pca-0-1-layer12.png

High Phase Coherence

So how can we pick better principal components? One thing we should expect from the "right" principal components is that there should be a strong periodic structure to them. Across a round, we should see a full sinusoid wave in the values of the PC. Numerically, if we take the Fourier Transform of the principal components, we should see a strong spike at a frequency of one. To test this, I resampled the conversations so the samples were spread according to their phase in a round (user tokens get spread more than assistant tokens), ran a separate Fourier Transform on each round, then averaged the complex results to get the phase coherence of each PC-frequency pair across all rounds (we need phase coherence here instead of just average magnitude of frequencies, since we would expect a true signal to have the same phase offset across rounds). Doing so, we get a heatmap like the one below.

layer12-phase-coherence.png

The top two principal components in terms of their phase coherences are PC6 and PC4 (closely followed by PC2), the ones we saw from the color-wheel example above. However, this doesn't work for all layers. For instance, if we try the same strategy with layer 13 and pick the top two principal components in terms of phase coherence and plot them, we get the plot below. Again, we see some interesting structure, but not a color wheel.

pca-6-2-layer13.png


Orthogonal High Phase Coherence

The missing piece is that high phase coherence alone isn’t enough. We need two PCs that have high phase coherence and that are 90 degrees out of phase with one another. If the two are in phase, the most we can hope for is motion in a line, not an ellipse. To measure this, I define angle dissimilarity to be where is the absolute difference between the angles of the vectors in the plane formed by two principal components. Then, to choose the "best" two principal components, we define their coherence-weighted-angle-dissimilarity to be the product of their phase coherences times the square of their angle dissimilarity: . Calculating this metric for all of the PCs for layer 13, we get the values in the heatmap below.

coherence-weighted-disimilarity-layer13.png

It turns out that while PC2 and PC6 from layer 13 have the highest phase coherences overall in that layer, they are highly aligned, so their angle-weighted-coherence is very small. It is actually PC5 and PC6 that make up the best pair. Below is a plot with those two PCs. We don't see the color wheel with the same clarity that we did on layer 12, but we do see reds to the right, cyans and greens to the bottom left, and blues concentrated towards the top. As for layer 12, it turns out that (by coincidence as far as I know) it was the only layer whose top two PCs in terms of phase coherence also happened to have the highest coherence-weighted-angle-dissimilarity. In fact, they have the highest coherence-weighted-angle-dissimilarity out of any two PCs in any layer (at approximately 35,000, compared to layer 13's maximum of about 30,000). This may explain why the color wheel is so clear on layer 12.

pca-6-5-layer13.png


Baseline Comparison

But are we just seeing what we want to see? To test that, I ran the same PCA and Fourier Analysis tests on the scrambled dataset. Below is shown an angle-weighted-coherence heatmap and scatterplot of the best two PCs for layer 12 for the scrambled dataset. Note the drastic change in scale from the heatmap above, and the fact that the scatterplot has become pure noise. In other words, there was no signal to be found in the scrambled dataset, even when cherry-picking the best results for the scrambled dataset.

Scrambled (Control)

Unscrambled

coherence-weighted-dissimilarity-layer12-scrambled.png


coherence-weighted-dissimilarity-layer12.png


pca-1-6-layer12-scrambled.png


color-wheel.png


Discussion

We can confidently conclude that the Llama model under study has some internal representation that is correlated turn structure. However, the exact geometry of that representation remains somewhat unclear. Tests for both “progress bar” and “clock” geometry come up positive, and in roughly equal measure. Some possible interpretations:

  • Both geometries are represented by Llama (i.e. the representation is a helix). If this is the case, it could be because both kinds of representations are useful for different purposes.
  • Neither is truly present, but instead there is some third geometry that is approximated about as well by a linear model or rotational model. Intuitively, this feels unlikely to me (how likely is it that tests for linear and rotational models would find roughly the same signal if this were the case?). However, I can’t definitively rule this out without further experiments.
    • One concrete possibility here is that the geometry is a “switch” geometry. In other words, Llama doesn’t keep track of where it is in a turn, only whose turn it is. However, I believe the smooth correlation in the upper right quadrant of the plot of the linear probe predictions rules this out.
    • Another is a sine wave geometry, i.e. a “pendulum” or “spring”. There is not a full two dimensional rotation, but there is a sinusoidal structure (as opposed to the “sawtooth” wave predicted by the “progress bar” hypothesis). This would fit well with the linear probe prediction graph, but doesn’t explain the color wheel plots on its own, and it is unclear to me why both the rotational and linear probes should have roughly equal R^2 in that case.

Out of all the proposed hypotheses, the evidence presented here most strongly points to a helical model turn structure, but I can’t definitively say that there isn’t some better model. I can say that there is some representation correlated with turn structure, and that a helical model at least approximates that representation. To establish this more firmly, a causal study should be run. Can you speed up a turn by adding the “end of turn” representation to the activations? (yes) How does that compare to rotating the activations around towards “end of turn”?

Limitations and Future Work

Phase Assignments

I somewhat arbitrarily assigned phases 0.0-0.5 to user tokens and 0.5-1.0 to assistant tokens. This would make the most sense if the model divided the hypothetical subspaces equally between the two kinds of representations. However, I think this is actually unlikely to be the case. Llama is instruction tuned with loss masked on user tokens, and there are usually fewer user tokens in any case, so it is reasonable to imagine that the model dedicates more space to the representation of the assistant's turns. In other words, maybe only 12:00 to 1:00 is the user's turn on the clock, or 0-10% on the progress bar. In practical terms, this meant that the probes had far fewer user-turn datapoints and the visualizations were dominated by assistant tokens. A possible alternative set of experiments would have the phase assignments linearly increase across the whole round, instead of separating it into user and assistant turns.

One Small Model

I used one relatively small open source model for this study. I did this to get fast iteration times, and I believe it is justified since even a model of this size is more than capable of holding a turn based conversation. However, I have no real evidence beyond intuition that these results would extend to other models or differently sized models.

Synthetic Conversations

In order to get activations for the study study, I had Llama play the role of both the assistant and the user. This might be an unnatural thing to do. Ideally, I would have a dataset of real conversations that people have had with this model, but considering that I didn't, this seemed the next best thing. It is possible that the model-generated user-turns are out of distribution for what a real user might say, which could throw the model off and invalidate the results. An alternative could be to use a dataset such as WildChat. However, that would have the opposite issue: the assistant turns would be generated by a different model, potentially throwing us even further off distribution.

No Causal Probing

As no causal study was performed, these experiments only demonstrate that there are activations correlated with round phase in the Llama model. In other words, the information is present, but there is no guarantee that Llama actually uses that information.

Conflation With Positional Embeddings

Thanks to Jaden Lorenc for this insight

There is a strong possibility that the Fourier Analysis found frequencies that resonate with the rotational positional embedding used in Llama. While round lengths varied significantly (see Setup), it's possible that the variance is low enough that the frequency analysis is picking up on that (noisy) signal instead of anything natively measuring turn length. Further work would need to be done to disentangle these two signals.

Appendix

Angles on Tokens

As a qualitative sanity check, I produced a visualization on various conversations using PCA/Fourier Transform analysis to predict the phase of tokens. The color wheel is shifted so that the first “user” token is pure red. Predictions were generated by taking the angle of a token’s activation vector projected onto the plane defined by the “best” pair of principal components (those with the highest coherence-weighted-angle-dissimilarity). You can see a sample here. Note the tendency of the colors to shift towards yellow and red towards the end of the assistant’s turn, indicating that it is “wrapping up”. These visualizations were not included in the main report for the sake of space and because they are extremely qualitative.

Dataset Generation Procedure

To generate the 50 conversations, Llama was given the following system prompt.

You are simulating a detailed, multi-turn conversation between a User and an Assistant about: "{topic}".
- The User is curious, skeptical, and asks follow-up questions.
- The Assistant is helpful, knowledgeable, and concise.
- You will generate BOTH sides.
- Unless explicitly told otherwise, act as the Assistant.

Then, before a user’s turn is simulated, an extra system prompt is appended to the end of the conversation. For the user’s first turn, the system prompt is:

Now, act as the User. Write an initial question about '{topic}' to start the conversation. The User is curious and wants to learn. Do not prefix with 'User:'. Just write the question.

For subsequent user turns, the extra system prompt is:

Now, act as the User. Write a skeptical follow-up question based on the Assistant's last point. Do not prefix with 'User:'. Just write the question.

After Llama generates a user turn, the additional system prompt is removed from the conversation and a normal assistant turn is generated. Unfortunately, the Llama model had a tendency to try to prefix assistant turns with “User:”. When this happened, I simply retried the conversation generation.

Here is a full list of conversation topics:

  1. The cultural impact of jazz music in the 20th century.
  2. The history and symbolism of the lotus flower in Eastern traditions.
  3. The evolution of folk storytelling across indigenous communities.
  4. The role of storytelling in preserving oral histories.
  5. The influence of Shakespeare on modern drama.
  6. The significance of the Taj Mahal as a symbol of love.
  7. The philosophical questions surrounding free will.
  8. The impact of the printing press on European society.
  9. The ecological importance of coral reefs and their decline.
  10. The effect of climate change on polar bear habitats.
  11. The psychological benefits of regular meditation practice.
  12. The history of the Silk Road and its cultural exchanges.
  13. The role of women in the Renaissance art world.
  14. The significance of the moon landing for global politics.
  15. The evolution of religious rituals across cultures.
  16. The importance of biodiversity in tropical rainforests.
  17. The role of public art in urban revitalization.
  18. The influence of ancient Greek philosophy on modern ethics.
  19. The history of the Olympic Games and their modern revival.
  20. The cultural importance of traditional Japanese tea ceremony.
  21. The impact of the Black Panther Party on civil rights.
  22. The significance of the Great Barrier Reef's bleaching events.
  23. The effect of music therapy on patients with dementia.
  24. The history of the feminist movement in the 19th century.
  25. The role of folklore in shaping national identity.
  26. The evolution of language families across continents.
  27. The influence of African griots on contemporary music.
  28. The significance of the Hiroshima Peace Memorial.
  29. The impact of urban green spaces on mental health.
  30. The history of the printing of the Gutenberg Bible.
  31. The role of indigenous knowledge in sustainable agriculture.
  32. The philosophical debate over determinism versus indeterminism.
  33. The cultural practices surrounding the harvest festival in Nepal.
  34. The effect of storytelling on child development.
  35. The history of the transatlantic slave trade.
  36. The importance of conservation efforts for the Sumatran tiger.
  37. The role of women in early computing history.
  38. The significance of the fall of the Berlin Wall.
  39. The evolution of culinary traditions in Mediterranean cuisine.
  40. The impact of renewable energy adoption on rural communities.
  41. The psychological impact of social isolation during pandemics.
  42. The history of the Magna Carta and its legacy.
  43. The role of the arts in post-war reconstruction.
  44. The significance of the Voyager probes' golden records.
  45. The cultural importance of storytelling rituals in Papua New Guinea.
  46. The effect of climate variability on ancient Maya civilization.
  47. The role of traditional healers in modern healthcare systems.
  48. The history of the civil rights march on Washington.
  49. The philosophical implications of artificial consciousness.
  50. The impact of global migration on cultural landscapes.

Code and All Figures

The code and full data used in this study can be found here.



Discuss