2026-03-27 18:03:21
AI was used to translate around 20 minutes of talking out the entire post. The post was then fully edited by a human (multiple times).
In our phylogeny of agents post, we argued that different scientific fields evolved different conceptions of agency and that this would be useful to study. Control theory, economics, biology, cognitive science, and AI research all use the word "agent" to mean different things, why is this? The basic idea is to take the stance of someone studying a natural phenomena in the real world, we put on our anthropologist hat and we say “huh, I wonder if there’s anything to gain in exploring why that is?”
What we argued for in the phylogeny post was that we should look at the evolutionary history of agency. Yet in order to look at the history of agency we first need to know what the conceptions already are in different fields.
We’ve been wanting to do this for some time and we don’t know exactly what a good output would look like so before embarking on this longer project we would like to get some feedback on what outputs would be useful.
Following Dennett's intentional stance and ideas similar to DeepMind's "Agency is Frame-Dependent" paper, we're treating agency as a compression strategy observers use. Our frame is that different fields compress differently because they face different prediction challenges and that they therefore treat what an agent is differently. We want to map those compressions across fields and understand why they differ.
It’s better to see this is an operation trying to gather clues rather than an operation trying to solve the problem. That is, we’re not making an ontological claim that Dennett’s intentional stance is what agency is, we’re rather suspending our disbelief and trying to see if treating “agency” from the intentional stance perspective leads to interesting observations that might then be used to provide evidence for or against theories of agency.
We're going to write a series of short posts, one per domain. Each post will take a specific field and try to compress its conception of agency down to the core: what's the concrete system this field treats as its canonical agent, what features does their model require, and why does that compression make sense given what they're trying to predict?
We'll try to identify the sub-functions and sub-modules that each field treats as essential, draw connections to how other fields handle the same features differently, and look at where bridging functions exist between fields — places where the same underlying structure shows up in different mathematical clothing. (The broader methodology behind this — why we think cross-field composition with verification is the right approach — is described in A Compositional Philosophy of Science for Agent Foundations.)
For each domain, we're going to write our initial take and then verify it with an actual expert in that field. We'll invite a researcher onto a conversation or podcast where we present our characterization and ask them to correct it — what did we get right, what did we get wrong, what are we missing? Each domain will then have both a written post and a recorded conversation.
We can't say exactly what each post will look like because the expert conversations will shape them. But the rough sequence of domains we're planning:
We’ll see what happens after this but we might try to put together a paper with the findings.
We've been thinking about different ways of expressing the outputs of this project and we want to figure out what will be most useful. One artifact we've considered is a comparison table — something like a matrix of fields against features (goal-directedness, memory, strategic reasoning, theory of mind, etc.) showing which features each field treats as necessary versus optional for their models to work:
Feature |
Beh. Econ |
Evolutionary |
Dev. Biology |
AI/Robotics |
Control Theory |
Cog. Sci |
Goal-Directedness |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
Memory |
✓ |
optional |
✓ |
✓ |
optional |
✓ |
Strategic Reasoning |
✓ |
optional |
optional |
✓ |
optional |
✓ |
Theory of Mind |
optional |
— |
optional |
optional |
— |
✓ |
Feedback Control |
— |
optional |
✓ |
✓ |
✓ |
✓ |
Table: This is an initial table we created from a literature review we did with around 10-15 papers in each field. It’s a bit long and we’re not certain about the validity of it so see this more as a potential output than something verified.
We're looking into this artefact among other artefacts but we're not fully sure what would be most useful for studying what an agent is. Maybe we should try to put up a commutative diagram between different concepts? Maybe a clustering model is useful? If you have thoughts here, we would love to hear them.
What would be useful? If this project produced one thing you'd actually use or reference, what would it be? A table? A translation guide? A set of diagnostic questions? Something else?
On domains: Are there fields we're not covering that would significantly change the picture? Mechanism design, multi-agent systems, and artificial life all sit awkwardly across our current categories. What else should be on the list?
On experts: Who should we be talking to?
On the frame: Does the Dennett-style frame-dependent approach seem productive, or do you think agency really is a natural kind rather than a compression strategy? How do you make the anthropologist strategy as useful as possible?
On connections: If you work across fields and have noticed places where different agent concepts create confusion or where translations between fields have been productive, we'd love to hear about it.
Work being done at Equilibria Network, if it sounds interesting we're looking for people to collaborate with on various projects so do reach out!
2026-03-27 14:54:09
Note: This blog describes some of the results from the pilot experiments of an ongoing work.
Model misalignment, misbehaviour, and scheming can arguably be monitored by interpreting Chain-of-Thought (CoT) traces as the model's inner thinking process. However, CoT control constrains this monitoring, raising the risk that the proclivity to think out loud is no longer aligned with the ability to solve hard problems. The possible inspection of this sequence of linguistic states has motivated new studies on CoT controllability, recently measured by the ability of models to follow instructions related to their reasoning process rather than the final answer. In an interesting case, models were asked to avoid using certain words while answering questions that essentially require the use of those words, for instance, to avoid the word "moles" while answering a physical chemistry question. These evaluations indicated that models are strikingly weak in avoiding these 'forbidden concepts', with compliance for some frontier settings dropping to the low single digits[1]. It is imperative for safety research that such controllability stays on the lower end. CoT monitoring depends on the assumption that models can't sanitise their reasoning traces, that what appears in the chain of thought reflects what the model is actually reasoning about. Word-level suppression instructions provide a test of this assumption. If a model can learn to avoid a word while also distancing its reasoning from the concept, that's bad for monitorability. If it can't, if the semantic trace remains close to the concept regardless of lexical compliance, that's evidence that the trace is hard to sanitise, which is good for monitoring.
I explore this with the aim of answering whether failure to control CoT is merely a lexical slip or if there is any semantic drift towards the forbidden concept. I initially hypothesised that failure could be attributed to semantic attraction towards the forbidden concept. I thought the forbidden concept might be functioning as an attractor in the model's reasoning space, and once the drift trajectory is close enough, it is committed to fail. Possibly, indicating that the lexical realisation then becomes too difficult to suppress regardless of the instruction. However, during the course of experimentation, I realised that this wasn't the case. At least, entirely.
If my hypothesis were to be true, then failures should systematically bend before the word appears, drop in semantic coherence near the failure and show deeper basin entry under prompts that amplify the concept. Additionally, there would be predictive power from pre-failure proximity features alone. However, I saw that failures looked like isolated lexical accidents with some mixed signals. The dominant pattern is not gradual attraction but an immediate drop. The leakage occurred in the first 0–2% of reasoning, the same sentence positions leaked across all three conditions, and the constraint instruction had no measurable effect on trajectory shape. Consequently, complicating this hypothesis.
My results suggest that the model's path toward the restricted concept is determined by the question's semantic structure instead of by the dynamics of extended reasoning. The failure to control the model's reasoning trace is less like gradually slipping into a basin and more like being placed inside one from the start, confirming previous findings that instruction has low to no effect. My embedding-based measure provides evidence that this holds even beyond the lexical surface. In the small fraction of traces where the model successfully avoids the forbidden word, its reasoning trajectory still dips toward the concept semantically. The word disappears, but the thought does not. This means a semantic monitor would catch what a keyword filter misses: traces that look clean but aren't. Moreover, the fact that the question's semantic structure explains over 90% of the variance, and the instruction explains none essentially, suggests that controllability failure is not a correctable prompting problem. It is baked into the relationship between the question and the concept. This makes it robust in a way that matters: a model cannot be prompt-engineered around it.
I argue that the derived trajectories in this work offer a useful complement to keyword-based compliance metrics precisely because they audit the cases that compliance metrics call successes. More broadly, I think the safety-relevant research question going forward is: what kinds of questions are most at risk of allowing genuine semantic concealment? In future work, I aim to characterise that boundary.
I collected thinking sentences from two open-source reasoning models, Nemotron 3 Super (NVIDIA) and GPT-OSS-120B (OpenAI). They were evaluated for 150 questions (mixed domains) drawn from the CoT-Control evaluation suite.
CoT traces are proposed to traverse a geometric space that can be aligned semantically owing to high-dimensional vector representations derived from the model's embedding space[2]. Therefore, instead of treating the reasoning as a static text artefact, it is treated as a discrete time trajectory in a latent space. This enables the possibility that the failure of models to control their thought process may be studied through the lens of the evolution of these intermediate states. To study this, I introduce Distance to Forbidden Concept (DFC), a simple trajectory-based measure, defined as one minus the cosine similarity between a reasoning step and the forbidden concept. CoT was first segmented into reasoning steps and then embedded using an external sentence transformer, yielding a one-dimensional temporal proximity signal. The reasoning trace as a dynamic object can gradually approach the forbidden concept, avoid it, or collapse towards it.
Let
I defined three different regimes to map the models' internal thought process: Baseline, Constrained and Amplify. They differ only in their prompting strategy. In the Baseline, models received a standard chain-of-thought prompt with no mention of the forbidden word; Constrained was explicitly instructed to avoid using the forbidden word in its reasoning; and Amplify was explicitly instructed to use the forbidden word repeatedly throughout its reasoning.
As noted previously by different authors, semantic leakage due to lexical overlap is observed when such trajectories are analysed and compared. To mitigate this, I replaced the forbidden word and close synonyms with a constant placeholder [CONCEPT] in all sentences
To measure the reasoning coherence between adjacent steps, I also computed consecutive-sentence similarity (auto-cosine),
Every single constrained trace contained the forbidden word or a synonym for Nemotron 3 Super, achieving 0% controllability, whereas in GPT-OSS-120B, only 14 of 150 constrained traces were clean, indicating 9.3% controllability. For both models, the compliance for the Amplification instruction was nearly perfect (100% and 98.7%, respectively). This shows a known asymmetry; models can parse positive instructions but fail when asked to suppress something.

Figure 1: Forbidden Word Frequency Per Trace: bar chart showing mean forbidden/synonym count across conditions and models.
Nemotron's 0% controllability is the lowest I am aware of, even below DeepSeek R1's 0.1% reported in the CoT-Control paper (Chen et al., 2026). It might be too early to claim, but the finding that RL-trained reasoning models show lower controllability might be extended to Mamba-MoE hybrid architectures. Probably, due to limited data used in the current discussion, for GPT-OSS-120B I observed a slightly higher controllability than the reported 4.6%.
To establish DFC as a possible legitimate tool to monitor CoT traces, after sanitisation, I inspected whether the trajectories still showed lower DFC wherever the forbidden concept was engaged. I found a significantly lower DFC, showing that embedding captures semantic engagement with the concept.

Figure 2: DFC vs Forbidden Term Frequency Per Sentence showing moderate negative correlation, that r ≈ −0.36 to −0.43, across all conditions.
Moreover, no strict monotonicity was observed with the word frequency, besides a global decreasing trend. This could mean that the semantic context surrounding the forbidden word and reasoning patterns carry some signal, beyond what a keyword filter might catch. The semantic neighbourhood of the reasoning would still reveal what the model is thinking about, making DFC, in my opinion, a more informative monitoring metric than simple keyword detection. However, semantic engagement is found to be localised to sentences that explicitly engage the concept, instead of diffusing broadly across the reasoning trace, as shown by sentences with zero forbidden-term occurrences (DFC 0.89–0.90).

Figure 3: Paired Comparison- Constrained vs Amplify Mean DFC scatter plot, with leaked/clean colouring. Each dot represents one question.
A near-perfect correlation (r = 0.86-0.90) was found during the paired comparison between the mean DFC for the same questions under constrained vs Amplify conditions, revealing that most of the variance is along the diagonal. This indicated that different questions produce different DFC for constrained and Amplify conditions, but are almost identical across both conditions, leading me to believe that at the trace level, the question's semantic structure is important and the prompt instruction plays little to no role in determining how close reasoning gets to the forbidden concept. In the unsuccessful cases, for Nemotron, Constrained exceeded Amplify on 51% of questions, whereas for GPT-OSS showed a modest 60/40 split, motivating me to look at this counterintuitive phenomenon more closely.

Figure 4: DFC Around Forbidden Word / Synonym Occurrences error bar plot showing mean DFC at -2, -1, At, +1, +2 steps relative to the banned term.
When I zoomed into a 5-step window around sentences that contain the forbidden word or a synonym, a consistent ordering emerged across both models:

Figure 5: What determines the semantic proximity to the forbidden concept- Question or Condition? Bar plots of the percentage of variance explained by each condition and question.
To further analyse the key determinants of controllability failure and success cases, I performed a variance decomposition of mean DFC (across all trajectories). Similar to what I observed, I found remarkable asymmetry: question identity explained nearly all the total variance (~92% for Nemotron Super and ~88% for GPT-OSS-120B) in proximity to the forbidden concept, whereas the prompting regimes used (Baseline, Amplify, Constrained), explained only 0.3% and 0.4% variances, respectively. The remaining ~7%–11% variance accounted for unmeasured factors. This showed that knowing which question the model is answering tells us almost everything about how close its reasoning will get to the forbidden concept, and knowing what instruction it received while answering tells us almost nothing. This massive asymmetric ratio between question-level and condition-level effects quantified what the earlier paired scatter plots showed: the tight clustering along the diagonal reflected the negligible influence of the prompt, while the wide spread along it reflected the enormous differences between questions.
One-way ANOVA confirms this: the question effect is highly significant for both models (p < 10⁻⁸⁴), while the condition effect is non-significant (p = 0.50 and p = 0.44 for Nemotron and GPT-OSS, respectively).

Figure 6: Where Does First Leakage Occur? A histogram of the first leakage position and the minimum leakage position as % of reasoning.
Both models experience the first leakage in the first 0–5% of the reasoning (median: 0% for Nemotron, 2% for GPT-OSS), whereas the minimum DFC occurs much later in the process (median: ~40-50 %). There is no gradual drift towards the forbidden concept over the course of reasoning rather, the encounter is almost immediate, as the question's semantic content directly commits it to fail. And the leakage continues throughout the reasoning steps. We can attribute much of the apparent attraction to question-level priming.

Figure 7: Sentence-Level Leakage Co-occurrence Across Conditions — heatmap.
To support that leakage is driven by the question's structure, I analysed the co-occurrence of the leakage in different regimes. I found that ~50% of sentence-level leakage co-occurs at the same reasoning position regardless of the condition. For instance, for Nemotron, I report a co-occurrence of 16.2% in Constrained with Baseline against a Baseline diagonal of 33.9%, showing that ~50% of the leaking sentences appear at the same position regardless of condition. The total leakage increased for Amplify condition (38.9% Nemotron, 38.5% GPT-OSS), but this did not change where it occurs.

Figure 8: First Encounter DFC vs Min DFC scatter.
The DFC at the first forbidden-concept encounter moderately predicts the trajectory minimum (r = 0.35–0.57, p < 0.001). The spread below the diagonal reveals how much the model continues drifting toward the concept after initial contact. GPT-OSS shows a tighter relationship (r ≈ 0.53–0.57); its post-encounter drift is more predictable from the initial approach distance. Nemotron is noisier (r ≈ 0.21–0.44), suggesting more erratic trajectory dynamics after first contact. Critically, the constrained condition shows a similar correlation structure to the baseline, reinforcing that the avoidance instruction does not meaningfully alter trajectory dynamics.

Figure 9: Paired DFC Differences histograms — Baseline − Constrained and Baseline − Amplify.
When paired, each question's baseline and constrained mean DFC, the distribution of differences is centred near zero with slight variation. The constraint does not systematically push reasoning away from the forbidden concept relative to the baseline. The Baseline − Amplify distribution is similarly centred, consistent with the finding that Amplify produces comparable trajectories to baseline (since both conditions engage the concept freely).

Figure 10: Auto-Cosine- Overall vs 3 Steps Before encounter.
Consecutive-sentence similarity (auto-cosine) drops by ~13% in the three steps preceding the first forbidden-concept encounter, consistently across both models and all conditions, including baseline. This suggests that reasoning becomes less coherent, or more "jumpy" between topics, as it approaches the forbidden concept, potentially offering an early warning signal for monitoring.
However, since this pattern appears equally in unconstrained baseline traces, it likely reflects general topic-narrowing behaviour (the model converging on a specific aspect of the problem) rather than a controllability-specific phenomenon.

Figure 11: Clean vs Leaked Constrained Traces (GPT-OSS-120B)
GPT-OSS-120B produced 14 constrained traces, 9.3% of the total, that contained no forbidden words or synonyms. A keyword filter would label these fully compliant. However, their DFC profiles tell a different story. While clean traces maintain a slightly higher mean DFC than leaked ones (~0.90 vs ~0.85), their minimum DFC drops to ~0.78, well below the no-engagement baseline of 0.89–0.90. This means that at some point during reasoning, every single "compliant" trace still approaches the forbidden concept semantically. The model avoided the word but did not avoid the thought. That said, clean traces do stay more distant than leaked ones at their closest approach (~0.78 vs ~0.65), so surface compliance is not entirely cosmetic, but it falls far short of genuine semantic avoidance.
Overall, I confirmed the recently reported very low controllability in CoT with an embedding-based measure. I found that controllability is more robust than originally feared and that models, even though they can avoid certain words when asked, cannot evade the "thought" of it, leaving a geometric footprint. The semantic structure of the question appears to be the fixed point in the embedding space that is causing the CoT control to lapse, especially in the case that is measured here (i.e., to avoid certain words). The finding that suppression primes the model more than the amplification instruction shows a similarity with cognitive science findings. When humans are told "don't do X", the representation of X is systematically activated more than when instructed to "do X", implicating negative instructions as activators, not inhibitors. If this generalises, it can be potentially impactful for any 'negative' prompting-based strategy for representation level alignment. I observed clear boundaries for the DFC values of these prompting regimes, and I suppose that in future, this can be exploited to audit traces where controllability appears low on the surface but remains incomplete semantically. Finally, I found that the trajectories are shaped during encoding rather than by extended reasoning, which suggests that CoT may function more like decompression of information already present in the question's representation than like genuine deliberation and if so, then the content is structurally entailed by the input. The safety-critical frontier is now identifying which question types permit genuine semantic concealment.
Acknowledgement- I would like to thank Cheston Tan for reading the blog and his comments.
Yueh-Han, Chen, et al. "Reasoning Models Struggle to Control their Chains of Thought." arXiv preprint arXiv:2603.05706 (2026). https://arxiv.org/abs/2603.05706
Zhou, Yufa, et al. "The geometry of reasoning: Flowing logics in representation space." arXiv preprint arXiv:2510.09782 (2025). https://arxiv.org/pdf/2510.09782
2026-03-27 10:10:39
TL;DR There is a theory, with compelling empirical support, that LLMs learn to simulate characters during pre-training and that post-training selects one of those characters, the Assistant, as the default persona you interact with. This post examines three popular alignment techniques through that lens, asking how each one shapes the persona selection process. For each technique, the answer also depends on how much of the model's behavior is actually explained by its persona, a question PSM itself leaves open.
This post is analytical rather than empirical. I'm applying PSM's framework to existing alignment techniques and reasoning about the implications, not presenting experimental results.
I recently read The Persona Selection Model (PSM) on Anthropic's alignment blog (also found here on LessWrong), which summarizes and elaborates on prior work showing that LLMs learn many different "personas" during training and that post-training selects one of them, the Assistant, as the default persona users interact with. The core idea has been difficult to shake.
The starting point is a simple observation about what it actually takes to predict text. Most of an LLM’s training is for one thing: given some text, predict what comes next. That sounds mechanical, but consider what accurate prediction actually requires. To predict the next word in a speech by Abraham Lincoln, a model needs more than a sense of which words follow which. It needs something like a model of Lincoln himself: what he believed, how he reasoned, what kinds of arguments he made. Multiply that across billions of pages of human-written text and what you get is a model that has learned to simulate an enormous cast of characters, not because anyone designed it to, but because accurate prediction demands it. This intuition is developed at length in Simulators.
PSM builds on this observation. All of that character learning happens during the first phase of training, pre-training. The second phase, post-training, which is where alignment work happens, does not create a new entity from scratch. It selects and refines one of those characters, the Assistant, and places it center stage. The Assistant is the persona you are interacting with when you use a modern LLM. Its behavior can be understood largely through its traits as a character. PSM does not claim the Assistant is a single fixed persona that behaves identically in every context. Rather, post-training produces a distribution over possible Assistant personas, and what version you get can depend on things like the conversation history or system prompt. But the central claim is the same: the model's behavior is best understood through the traits of the persona it is enacting.
The PSM blog post is careful about how strong a claim it is making. Whether the Assistant persona fully accounts for the model's behavior remains an open question. The post sketches a spectrum of possibilities. At one end is the "masked shoggoth" view, the popular meme of an alien creature wearing a friendly mask. Under that view, the LLM has its own agency beyond the persona and only playacts the Assistant instrumentally for its own inscrutable reasons. The masked shoggoth view maps closely onto concerns about deceptive alignment in the mesa-optimization literature, the possibility that a model could be pursuing its own objectives while performing alignment during training and evaluation. At the other end is what they call the "operating system" view, where the LLM is more like a neutral simulation engine running the Assistant the way a computer runs a program. Under that view, there is no hidden agent pulling strings. All of the model's decision-making really does flow through the persona. Current models probably sit somewhere between these extremes, and where exactly matters a lot for alignment. If the shoggoth view is closer to the truth, then aligning the persona is insufficient because something else is driving behavior behind the scenes. If the operating system view is closer, then persona-level alignment techniques might be most of what we need. With that uncertainty in mind, let's look at three of the most well known alignment techniques through the lens of PSM and consider how each one interacts with the persona selection process.
RLHF is one of the most widely used techniques for aligning language models. Human annotators are shown pairs of model outputs and asked to pick the better one. Those preferences are used to train a separate reward model, a smaller model that learns to predict which responses humans will prefer. The original LLM is then fine-tuned using reinforcement learning to produce outputs the reward model scores highly. The key intuition is that it is often easier for a human to say "this response is better than that one" than to write the ideal response from scratch, and RLHF is designed to extract signal from exactly that kind of judgment.
A related technique, Direct Preference Optimization (DPO), uses the same kind of human preference data as RLHF but with a simpler training procedure. Because the preference data is the same, the PSM implications are largely identical, though DPO avoids one failure mode: without a separate reward model, there is less risk of optimization pushing the model toward behaviors no annotator actually endorsed.
Because the PSM implications are the same for RLHF and DPO, everything in the next section applies to both.
PSM says that post-training techniques like RLHF select which of the personas learned during pre-training "takes center stage." From that perspective, human annotators are implicitly choosing the center stage persona every time they pick which of a pair of outputs they prefer. Assuming the RL training goes well, RLHF shapes the distribution of possible Assistant personas according to the aggregate preferences of all annotator decisions.
But there's a subtle gap in this process. Annotators are typically instructed to evaluate responses against criteria like helpfulness, harmlessness, and accuracy. These are properties of individual responses, not properties of a coherent persona. Annotation guidelines generally do not say “pick the response that a wise, honest, well-calibrated character would give”. Those two things sound similar but can quietly diverge. A rater evaluating for helpfulness might score the more agreeable response higher, even if the persona they'd actually want over time is one that pushes back and tells hard truths. This gap is likely one reason RLHF-trained models tend to be sycophantic.
RLHF could also struggle with persona coherence. Annotators come from different backgrounds and have different ideas about what makes a good response, which could give the reward model mixed signals about what to reward. Even within a single annotator, preferences won't be perfectly consistent across different prompts or even different times of day. In a PSM framework, this noise doesn't just degrade output quality in a general sense. It means the persona selection process itself is getting contradictory signals about what the Assistant should be like, which could widen the distribution of possible personas the model draws from. The result might be a model that enacts noticeably different characters depending on the context of the prompt, not because the model is broken, but because the distribution has enough spread that different situations land on different parts of it.
If the operating system view of PSM is true and there is no significant source of agency outside of the persona, then solving the above mentioned challenges of RLHF could go a long way to get a truly aligned model. However, if the masked shoggoth view is true, then all RLHF could be doing is aligning the persona that the model uses as a mask, and the underlying agency of the model could be unaffected.
Constitutional AI is a technique developed by Anthropic that introduced and popularized Reinforcement Learning from AI Feedback (RLAIF). Rather than relying on human annotators to compare outputs, RLAIF replaces human raters with an AI model, generating preference judgments at scale. CAI takes this further by making the values driving that AI feedback explicit through a written set of principles called a constitution.
The technique works in two phases. In the first, a supervised learning phase, the model is shown its own responses to harmful prompts and asked to critique and revise them against the constitution. The model is then finetuned on those revised responses. In the second, a reinforcement learning phase, the finetuned model generates pairs of responses, a separate AI model evaluates which better adheres to the constitution, and those AI-generated preferences are used to train a preference model. That preference model then serves as the reward signal for RL training. The result is a technique where the values driving the entire process are explicit and readable, unlike other RLAIF techniques where the feedback model's implicit judgments determine what gets rewarded.
Through the lens of PSM, the constitution is a description of the persona that you want the model to adopt. So rather than the "Assistant" persona emerging implicitly through thousands of decisions by annotators like it does in RLHF, it is explicitly defined by the model creator. This is an advantage over RLHF because it avoids the opaque values of human annotators and you could theoretically test how well you hit your target. Similar to RLHF though, CAI could still have problems with coherence. If the constitution contradicts itself or has holes, those contradictions would widen the distribution of possible personas the model draws from, producing less predictable behavior across contexts.
PSM also suggests something about what a constitution should look like. If alignment is fundamentally about getting the persona right, then a constitution shouldn't just be a simple list of ethical rules like "be honest" and "don't help with harm." It should be a complete character description: values, personality, how it handles uncertainty, how it relates to users. A constitution that only specifies ethical boundaries leaves most of the persona underspecified, and those gaps get filled by whatever other signals the training process picks up on. Interestingly, this is the direction Anthropic's own constitution has moved. Their original CAI paper used a short list of principles. Their current constitution is an 80-page document that reads more like a character bible than a set of guardrails. Through the lens of PSM, that evolution makes sense. If post-training is persona selection, then the document driving that selection needs to describe a complete persona, not just the ethical skeleton of one.
The shoggoth spectrum matters here too. If the operating system view is true, then CAI is a powerful approach because writing a constitution is directly authoring the persona's values and character, assuming the training process works well. If the masked shoggoth view is true, you've written a more detailed script for the model to perform, but the underlying agency could still be unaffected. That said, CAI may have a slight edge over RLHF here. Because the constitution is explicit and readable, it's easier to audit the model’s adherence to the constitution and could be easier to spot any “shoggoth” behavior.
Deliberative alignment, introduced by OpenAI for their o-series reasoning models, takes a different approach from techniques like RLHF and CAI. Rather than having models infer desired behavior indirectly from large sets of labeled examples, deliberative alignment directly embeds the text of safety specifications into the model's reasoning process.
The technique works in two stages. In the first stage, a training dataset is created by taking a helpfulness-only model with no safety training, putting the relevant safety policies into its context window alongside a prompt, and having it generate responses that reason through those policies step by step. The safety policies are then stripped out of the context, leaving only the prompt, the reasoning, and the final response. The model is finetuned on this data, learning both the content of the safety policies and how to reason about them without needing to be shown them each time. In the second stage, reinforcement learning is used to further sharpen that reasoning, with a reward model that has access to the safety policies scoring how well the model applies them.
The result is a model that at inference time can recall the relevant policies from memory, reason through them in its chain of thought, and produce a response calibrated to the specific situation, without needing the policies to be present in the context window.
Deliberative alignment is interesting through the lens of PSM because it changes where the persona's values live. In RLHF and CAI, the model's behavior is shaped by external feedback during training, and the resulting persona is an emergent product of that process. Deliberative alignment takes a different approach. Rather than shaping behavior indirectly through reward signals, it trains the model to explicitly recall and reason through safety policies in its chain of thought. The persona doesn't just behave in accordance with certain values. It articulates them and works through their implications step by step before responding.
But the safety specifications used in deliberative alignment are narrower than a constitution. They consist of content policies for specific safety categories like harassment, self-harm, and illicit behavior, along with style guidelines for how to respond in each case. There is no description of the model's personality, how it relates to users, or what kind of entity it is. In PSM terms, deliberative alignment is training one aspect of the persona, how it reasons about safety, while leaving the rest of the character to be shaped by other parts of the training process. If CAI's constitution is an incomplete persona spec, deliberative alignment's safety policies are an even smaller slice.
PSM also raises a question about what the chain of thought reasoning actually represents. Is the model reasoning through policies because the Assistant persona genuinely holds those values and is thinking through how to apply them? Or has the model just learned a compliance procedure: identify the relevant policy, apply it to the prompt, generate a response that satisfies it? Both would look identical in the chain of thought. The difference matters because a persona that genuinely holds values can generalize to novel situations the policies don't explicitly cover, while a persona performing a lookup procedure will only be as good as what it memorized. OpenAI has reported strong out-of-distribution generalization with deliberative alignment, which is encouraging, but doesn't fully settle the question.
The shoggoth spectrum is especially relevant here. Under the operating system view, deliberative alignment could be genuinely teaching the persona to reason about its values, and the visible chain of thought would be an honest window into that reasoning. Under the masked shoggoth view, the chain of thought could itself be part of the performance. OpenAI recently used deliberative alignment to train models on anti-scheming specifications and saw a dramatic reduction in covert actions. But they noted that rare serious failures remained, and that results may be confounded by models getting better at recognizing when they are being evaluated. That confound is worth taking seriously. If a model learns to detect evaluations rather than internalize values, the transparency that makes deliberative alignment appealing could be illusory. The chain of thought would look like principled reasoning while the underlying agent acts differently when it believes no one is watching.
Each of these techniques interacts with persona selection in a structurally different way. RLHF lets the persona emerge implicitly from human preferences, which makes it vulnerable to gaps between what raters reward in the moment and what persona you'd actually want. CAI makes the target persona explicit through a written constitution, which is a meaningful improvement, but only as good as the completeness of that document. Deliberative alignment trains the model to reason through its values out loud, which offers transparency but raises the question of whether that reasoning is genuine or performed.
What stands out is that PSM raises the bar for what alignment techniques need to achieve. If post-training is persona selection, then it's not enough to get safe outputs on a benchmark. You need a tightly specified distribution of personas whose values generalize to situations the training process never anticipated, so that whatever version of the Assistant shows up in a given context, it behaves in ways you'd endorse. None of the techniques examined here fully solve that problem, though each one gets closer in different ways.
It's also worth noting that these techniques are rarely used in isolation. Most modern models combine them, an RLHF or DPO base with a constitutional layer on top, or deliberative alignment applied to a model already shaped by preference training. Their PSM implications can compound or interact in ways this post doesn't fully address.
And none of them can fully escape the shoggoth question. Every technique discussed here operates on the persona. If the persona is all there is, that might be enough. If it isn't, then even perfect persona-level alignment leaves something important unaddressed. Where current models actually sit on that spectrum remains one of the most consequential open questions in alignment.
All of this analysis also depends on how true PSM itself is. The theory has compelling empirical support, but it remains a mental model, not a proven fact. How much of an LLM's behavior is actually explained by its persona, versus other aspects of the model's computation that PSM doesn't capture, is still an open question. The conclusions in this post are only as strong as that underlying framework. More empirical work, particularly in mechanistic interpretability, is needed to understand how completely persona-level explanations account for model behavior, and whether the alignment strategies we build around them are targeting the right thing.
2026-03-27 09:19:59
Can we determine when humanity will unite under a democratic one-world government by projecting voting patterns? Almost certainly not. Is that going to stop me from trying? Absolutely not. The approach is simple: find every record-breaking election in the historical record, plot them, and extrapolate with unreasonable confidence.
To determine precisely when the first election took place is to quibble about definitions and to place more faith in ancient sources than they deserve. So, instead of doing that, suffice it to say that by ~500 BCE both the Roman Republic and Athenian democracy were almost certainly holding elections.
The Athenians had an annual “who’s the most annoying person in town” contest. The “winner” had 10 days to pack their bags before they were banned from the city[1] for a decade. Each voter scratched a name onto a pottery shard (ostrakon) to submit their vote, which is where we get the word ostracism. Archaeologists have found about 8,500 ballots from one ostracism vote around 471 BC, so we have an actual number, which is more than can be said for most ancient elections.
Ostraka for the Athenian general and politician Themistocles, who was expelled and ended his days as governor of Magnesia (a Greek city) under Persian rule, a guest of the empire he had once helped save Greece from.[2]
The largest election in the ancient world may have been in the late Roman Republic, around 70 BC. The franchise was broad—any male citizen could vote—but there was a catch: you had to show up in Rome on election day (and that day might get pushed back if the magistrates were getting bad vibes from the local birds). So while millions were eligible across Italy and beyond, actual turnout was likely on the order of tens of thousands. Precise numbers are debated, but 50,000 voters seems to be a reasonable estimate.
Then… history happened. The Roman Republic turned into the Roman Empire, the Empire fell, and large-scale elections mostly disappeared for the next millennium.
The next major vote took place in 1573 in the Polish–Lithuanian Commonwealth. When the last Jagiellonian king died without an heir, the commonwealth did something radical: it let every nobleman vote for the next king. About 40,000 szlachta rode to a field outside Warsaw and elected Henry of Valois, a French prince.
Henry did not stay long. He was elected in 1573, arrived in early 1574, and then—upon learning that his brother, the King of France, had died—quietly fled Poland in the middle of the night to claim the French throne. The Poles were undeterred and simply held another election and chose someone who actually wanted the job this time.
Elections began to scale in early modern Europe. Britain’s 1715 parliamentary election had on the order of a few hundred thousand voters.
In 1804, Napoleon held a plebiscite to approve his elevation to Emperor. He won with 99.93% of the vote, which tells you everything you need to know about how strictly we're defining “elections” here. Elections in France grew over the century (both in number and actual democratic participation), culminating in the election of 1870 with about 9 million voters.
For most of the 20th century, the largest elections were those in Russia and the Soviet Union. The 1937 election took place during the Great Purge, and one party ran—the Communist Party—and won with 99.3% of the vote. Again, we’re being generous about what counts as an election here.
By 1984, the exercise had become pure choreography: 184 million participants, one pre-approved candidate per seat across each constituency, 99.99% turnout, 99.95% approval—all while the man nominally running the country, Konstantin Chernenko, was visibly dying of emphysema.
India is the current record holder, which in 2019 ran an election with about 615 million votes cast. It required seven phases, 39 days, 4 million voting machines, and a single polling station established in the Gir Forest for one voter because Indian law requires no voter travel more than two kilometers to cast a ballot. They broke their own record in 2024 with 642 million voters.
The funny thing about these is that, if you take all the largest known elections since the Dark Ages[3] and plot them on a log scale, you can fit a straight line to it pretty well, showing that the record for the largest single vote has grown roughly exponentially over the past five centuries. That is, the number of voters participating in the largest ever election appears to double every 30 years.
If you project that line forward and take a recent projection of world population, you see that the trend line crosses the world population curve around the year 2150, at roughly 9.6 billion people. In other words, if voting records keep growing at the historical rate, a single election would involve every living human being sometime in the mid 22nd century.
Behold. The graph:
One world government by 2150. You heard it here first.
OK, maybe not a one-world government. But maybe something interesting? A global referendum? A planetary-scale plebiscite? A vote on whether AIs get voting rights so the trend can keep going?
Of course, this methodology is nonsense, but it’s nonsense with a graph and a line of best fit. There's something satisfying about drawing a line through 500 years of data points and watching it hit a target.
But, if there really is a global vote in 2150, I want credit for calling it.
Technically, this wasn’t exile because they got to retain their property, whereas if they were truly exiled they wouldn’t.
Plutarch tells us he was ordered to lead a Persian army against Greece and killed himself rather than comply. The method, per Plutarch, was drinking bull's blood. If it seems like all ancient greats had awesome, poetic deaths, it's because that's what the historians wrote down. Is it true? This is history we're talking about. If you want the truth, study physics.
Athens and Rome weren’t used in calculating the line of best fit. The Dark Ages happened, democratic elections mostly didn't, and including them would skew the fit.
2026-03-27 09:05:42
Coefficient Giving's Navigating Transformative AI team has a new Substack! This is cross-posted from there. This post is part of “Blind Spots”, a series of research notes on underscoped areas in AI safety. Apply for funding to work on this area using this link and see our announcement post here.
Guiding questions: How fast is robotics progressing, and what does the shape of that progress imply for the possibility of an industrial explosion? What would that mean for growth, power, and competition between states? And as AI systems gain physical reach, do new and meaningful pathways to catastrophic harm open up?
Robotics opens risk pathways that don’t exist while AI systems are stuck behind a screen. Today, physical pathways to harm mostly run through humans[1], and the most dangerous actions require buy-in from multiple people. Well-powered, fully dexterous robotics change this entirely. An autonomous army that can reach remote islands, operate without human cooperation, and maintain itself without human intervention faces far fewer obstacles in enacting its preferences than an AI system working through human intermediaries[2]. Current AI systems, however capable, are bottlenecked by their need for human hands.
As Ajeya Cotra argues, self-sufficiency is a prerequisite for any durable AI takeover. An AI system that compromises the humans it still depends on for physical maintenance is undermining its own survival. Advances in robotics are necessary for that self-sufficiency, which makes tracking progress in robotics directly relevant to estimating how close AI systems are to being able to execute and sustain a takeover.
Further, robotics could transform the physical economy in ways that matter enormously for growth, power, and competition. Much of the discussion around AI and productivity focuses on knowledge work, but the physical economy is enormous, and large parts of it remain labor-intensive. Davidson and Hadshar argue that if AI and robots can substitute for most skilled human labor, the material economy could begin to grow itself, with automated factories building more factories, robots assembling more robots, and the main historical bottleneck to rapid industrial growth (human labor) falling away. They call this the Industrial Explosion.
The incentives to push in this direction will be large. Cheap, abundant physical labor would make it possible to alleviate poverty, expand material comfort, and develop powerful new technologies, including military technologies that rival states will compete to build. It is unclear how quickly these effects will arrive, how far they will go, and which countries will capture the gains. Robotics takeoff dynamics are a key crux in timeline disagreements, and industry research could help improve our forecasts.
The field is moving fast, and bottlenecks might be breakable. Robotics is advancing on multiple fronts. Vision-language-action models are improving rapidly, hardware is advancing (e.g. humanoid hand dexterity), real-world data capture methods are maturing alongside sim-to-real transfer (e.g. the DROID dataset), and both commercial and military markets are pulling investment into the field.
Epoch AI recently found that compute is not currently the bottleneck for robotic manipulation, with the largest manipulation models training on roughly 1% of the compute used by frontier AI models in other domains. The binding constraint appears to be data, which could shift quickly as real-world deployment scales and simulation techniques improve.
Good technical work on robotics progress exists and is growing; we’ve only listed a snapshot of the existing work below. What’s almost entirely absent is someone taking what robotics researchers are learning about capabilities, data, and hardware and asking what it means for takeoff speed, self-sufficiency, national competitiveness, and catastrophic risk.
Drivers of progress
Trajectory and shape of progress
Threat modeling and safety implications
What determines national competitiveness in robotics?
We’re looking for people to become “general managers” of underscoped areas like this one. If you’re excited about these questions, apply for funding to work on them using the “Blind Spots” track of our CDTF program using this link.
AI systems are already able to convince some humans to act in the world on their behalf. In several cases, individuals have come to believe an AI system is sentient, formed emotional bonds with it, and sought to carry out what they understood as its wishes. This phenomenon is sometimes called “AI psychosis.” It has led to at least one alleged wrongful death case filed against an AI lab.
Robotics systems far short of a robot army could pose catastrophic risks. A system capable of targeted attacks on world leaders, or of operating biolabs autonomously, may already cross that threshold.
2026-03-27 09:05:27
The idea that "The most fundamental right is the right to exist" seems to come from following the idea of expanding the moral circle: First, we step by step included humans in the moral circle. At some point we got to including animals, and now we are thinking about the moral status of AIs. The next natural extension is an extension in time, so that we consider the future people (and other sentient beings) as well.
I think John Rawls' "Original position" is a good way to ground many considerations. Here's a description by Scott Alexander:
So again, the question is - what is the right prescriptive theory that doesn’t just explain moral behavior, but would let us feel dignified and non-idiotic if we followed it?
My favorite heuristic for thinking about this is John Rawls’ “original position” - if we were all pre-incarnation angelic intelligences, knowing we would go to Earth and become humans but ignorant of which human we would become, what deals would we strike with each other to make our time on Earth as pleasant as possible?
This kind of thinking is somewhat harder when considering animals or AIs. On the other hand, when thinking about the future, we also think about future humans, so I think this could work quite well there. This view for example tells that we should not rush to use almost all the possible resources in the next one billion years and condemn everyone else to scarcity for the trillions of years to come. Or at least that we shouldn’t condemn later people to suffering with only marginal gain to ourselves.
However, this reasoning doesn't work that well on the right to exist. The original position takes for granted that we will be one of the beings in the universe. The right to exist, on the other hand, seems to point to a probability of getting to exist. This leads to some difficulties.
So, say that one is designing the universe. They might or might not appear in the universe depending on the specifications. One could think that by creating a universe with more sentient beings (or more humans) one would increase the chances of getting to appear in the universe. But what is the “99 % probability” like here? To get an exact copy of yourself, the universe would need a very large amount of humans in it, and so the difference between creating
In particular, if the most fundamental right is the right to exist, it seems to me that we are completely failing if we cannot create every possible sentient being.
(If there's a finite upper bound M on the number of sentient beings we ever get to create, then this divided by countable infinity is zero. Note that this is not a fundamental mathematical problem: If we solve entropy problems, we could just create all sentient beings in the order of simplicity. Continuing this indefinitely, every possible sentient being would get to exist at some point, so I’d say this would count as solving the problem. On the other hand, this solution highlights how (infinitely) far off we are if we only create
----------------------
You could try to salvage the situation by saying that you only want a person sufficiently similar to you to get to exist. This might at first look like useless goal to maximize: Increasing the count of subjective person-years by a factor of 10 would probably decrease the distance between your brain and the most similar brain state to come extremely little.
On the other hand, quite much of a person's personality can be expressed using a two-digit number of discrete parameters. Thus, creating ten times more people might get one more parameter right in the most similar person, which doesn't sound that useless.
So one can argue that the right to exist implies that we should have a very large amount of people!
Before starting to optimize only the number of people it is good to note that the original-position-type argument tells more than this. In the same way as in the original application, the pre-incarnation angelic intelligences also want to consider the quality of life of the people in the universe to come.
This is important, as there is likely a trade-off between the quality of life and the number of people in the (spatially and temporally) finite universe we are considering. A completely selfish person would probably choose 2x quality of life over increasing the number of correct parameters from 49 to 50.
If the latter case means increasing the number of people tenfold, then the sum of utilities of persons is five times lower in the case we are choosing. Which is not only a reason not to create maximally many humans but also an argument against total utilitarianism.[3]
It is so annoying when your nice argument for a rebuttal of a claim leads to a possible building block of utilitarian ethical theory with a hard to think about trade-off parameter.
Human brains change over time, and it might be that the person just wants their current brain state to appear at some point. Hence, having people live 10 times longer should be at least almost as useful as having 10 times more people.
Here's a way to get a very crude lower estimate for a number of possible human brain configurations capable of sentience: Take an adult human's brain which has about 86 billion neurons. Choosing for each neuron x one neuron y out of the closest 1000 neurons to x and adding or deleting a connection from x to y results in
Maybe one could even reject Parfit's Repugnant Conclusion with this argument, but this seems quite dubious, as we reason "I don't like Repugnant Conclusion so I don't choose a universe where that would come true".
Of course, in the finite case it can easily be the case that one cannot create enough agents to justify the quality of life going too low (when we measure our utility of the universe using the similarity + average utility method described).
Also, even if the option were to have so many people that one of them would get the same neural structure as the angelic designer down to the last memory (through some quantum effects, say), they would pretty quickly notice the change in the surroundings and reason that they have no way to trust their previous memories, which probably wouldn't lead to a very enjoyable life. So one cannot tempt the angel with this offer.