2026-02-04 05:50:45
Published on February 3, 2026 9:50 PM GMT
We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.
Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.
Project ideas are grouped into:
It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.
Project ideas:
LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.
A very brief initial list of such behavior:
Project ideas:
It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[2]
Project ideas:
One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎
See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎
2026-02-04 05:42:07
Published on February 3, 2026 9:42 PM GMT
Sorry for the late cross-post. Once again it’s been too long and this digest is too big. Feel free to skim and skip around, guilt-free, I give you permission. I try to put the more important and timely stuff at the top.
Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Notes, or Farcaster.
For paid subscribers:
Reminder that applications are open for Progress in Medicine, a summer career exploration program for high school students:
People today live longer, healthier, and less painful lives than ever before. Why? Who made those changes possible? Can we keep this going? And could you play a part?
Discover careers in medicine, biology, and related fields while developing practical tools and strategies for building a meaningful life and career— learning how to find mentors, identify your values, and build a career you love that drives the world forward.
Join a webinar to learn more on February 3. Or simply apply today! Many talented, ambitious teens have applied, and we’re already starting interviews. Priority deadline: February 8th.
A few more batches of video:
Nonprofits that would make good use of your money:
To read the rest of this digest, subscribe on Substack.
2026-02-04 05:17:21
Published on February 3, 2026 9:03 PM GMT
TLDR: A lot of AI safety research starts from x-risks posed by superintelligent AI. That's the right starting point. But when these research agendas get projected onto empirical work with current LLMs, two things tend to go wrong: we conflate "misaligned AI" with "failure to align," and we end up doing product safety while believing we're working on existential risk. Both pitfalls are worth being aware of.
Epistemological status: This is an opinion piece. It does not apply to all AI safety research, and a lot of that work has been genuinely impactful. But I think there are patterns worth calling out and discussing.
An LLM was used to structure the article and improve sentences for clarity.
There are two distinctions that don't get made clearly enough in this space, and both have real consequences for how research gets done.
The first is between misaligned AI and failure to align. When most people hear "misaligned AI," they imagine something with agency: a system that has its own goals and is pursuing them against our interests. But a lot of the time, "misaligned" is used to describe something much simpler: we trained a system and it didn't do what we wanted. No intent, no goals, no scheming. Just an engineering failure. These two things are very different, but they get treated as the same thing constantly, and that has consequences for how we interpret empirical results.
The second is between AI safety research aimed at x-risks and AI safety as a product problem. Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it's also work that any AI company deploying these systems needs to do anyway. It has commercial incentive. It is not a neglected problem. And yet a lot of it gets funded and framed as if it's addressing existential risk.
Both confusions tend to crystallise at the same point: the moment when a research agenda built around superintelligent AI gets projected down to empirical work on current LLMs. We call this: The Projection Problem. That's where the pitfalls live.
The AI safety community broadly agrees that superintelligent AI poses serious existential risks. Arguments for this are convincing, and research in this direction deserves serious funding and serious researchers. No disagreement there.
The problem starts when threat models designed for superintelligent systems, such as AI colluding with each other, maintaining consistent goals across contexts, executing long-term deceptive plans, or self-preservation, get tested empirically on current LLMs. These are reasonable and important things to think about for future systems. But current models fail at the prerequisites. They can't maintain consistency across minor prompt variations, let alone coordinate long-horizon deception.
So what happens when you run these experiments anyway? The models, being strong reasoners, will reason their way through whatever scenario you put in front of them. Put a model in a situation with conflicting objectives and it will pick one and act on it. That gets reported as evidence of deceptive capability, or emergent self-interest, or proto-agency.
It is not. It's evidence that LLMs can reason, and that current alignment techniques are brittle under adversarial conditions. Neither of those things is surprising. Most alignment techniques are still in an early stage, with few hard constraints for resolving conflicting objectives. These models are strong reasoners across the board. Systematically checking whether they can apply that reasoning to every conceivable dangerous scenario tells us very little we don't already know.
To put it bluntly: claiming that an LLM generating content about deception or self-preservation is evidence that future AI will be dangerous has roughly the same scientific validity as claiming that future AI will be highly spiritual, based on instances where it generates content about non-duality or universal oneness. The model is doing the same thing in both cases: reasoning well within the scenario it's been given.
This is where the first confusion, misalignment vs. failure to align, gets highlighted. When an LLM produces an output that looks deceptive or self-interested, it gets narrated as misalignment. As if the system wanted to do something and chose to do it. When what actually happened is that we gave it a poorly constrained setup and it reasoned its way to an output that happens to look alarming. That's a failure to align. The distinction matters, because it changes everything about how you interpret the result.
The deeper issue is that the field has largely adopted one story about why current AI is dangerous: it is powerful, and therefore dangerous. That story is correct for superintelligent AI. But it gets applied to LLMs too, and LLMs don't fit it. A better and more accurate story is that current systems are messy and therefore dangerous. They fail in unpredictable ways. They are brittle. Their alignment is fragile. That framing is more consistent with what we actually observe empirically, and it has a practical advantage: it keeps responsibility where it belongs: on the labs and researchers who built and deployed these systems, rather than implicitly framing failures as evidence of some deep, intractable property of AI itself.
There's another cost to the "powerful and dangerous" framing that's worth naming. If we treat current LLMs as already exhibiting the early signs of agency and intrinsic goals, we blur the line between them and systems that might genuinely develop those properties in the future. That weakens the case for taking the transition to truly agentic systems seriously when it comes, because we've already cried wolf. And there's a more immediate problem: investors tend to gravitate toward powerful in "powerful and dangerous." It's a compelling story. "Messy and dangerous" is a less exciting one, and a riskier bet. So the framing we use isn't just a scientific question. It shapes where money and attention actually go.
The counterargument is that this kind of research, even if methodologically loose, raises public awareness about AI risks. That's true, and awareness matters. But there's a cost. When empirical results that don't hold up to scientific scrutiny get attached to x-risk narratives, they don't just fail to strengthen the case. They actively undermine the credibility of the arguments that do hold up. The case for why superintelligent AI is dangerous is already strong. Attaching weak evidence to it makes the whole case easier to dismiss.
The second pitfall follows a logic that is easy to slip into, especially if you genuinely care about reducing existential risk. It goes something like this:
x-risks from advanced AI are the most important problem → alignment is therefore the most important thing to work on → so we should be aligning current systems, because that's how we can align future ones.
Each step feels reasonable. But the end result is that a lot of safety-oriented research ends up doing exactly what an AI company's internal safety team would do: evaluations, red-teaming, monitoring, iterative alignment work. That work is fine, and it is important and net positive for society. The question is whether it should be funded as if it were addressing existential risk.
The shift is subtle but it's real. Alignment evaluations become safety evaluations. AI control or scalable oversight becomes monitoring. The language stays the same but the problem being solved quietly transforms from "how do we align superintelligent AI" to "how do we make this product safe enough to ship." And that second problem has commercial incentive. Labs like Apollo have successfully figured out how to make AI labs pay for it and others have started for-profit labs to do this work. AI companies have an incentive for getting this work done. It is, by the standard definitions used in effective altruism, not a neglected problem.
Notice that a lot of research agendas still start from x-risks. They do evaluations focused on x-risk scenarios: self-awareness, deception, goal preservation. But when you apply these to current LLMs, what you're actually testing is whether the model can reason through the scenario you've given it. That's not fundamentally different from testing whether it can reason about any other topic. The x-risk framing changes what the evaluation is called, but not what problem it's actually solving.
Here's a useful way to see the circularity. Imagine many teams of safety-oriented researchers, backed by philanthropic grants, meant to address risks from rouge autonomous vehicles, worked on making current autonomous vehicles safer. They'd be doing genuinely important work. But the net effect of their effort would be faster adoption of autonomous vehicles, not slower.
The same dynamic plays out in AI safety. Work that makes current LLMs more aligned increases adoption. Increased adoption funds and motivates the next generation of more capable systems. If you believe we are moving toward dangerous capabilities faster than we can handle, this is an uncomfortable loop to find yourself inside.
This is where it gets uncomfortable. Talented people who genuinely want to reduce existential risk end up channelling their efforts into work that, through this chain of reasoning, contributes to accelerating the very thing they're worried about. They're not wrong to do the work, because the work itself is valuable. But they may be wrong about what category of problem they're solving, and that matters for how the work gets funded, prioritised, and evaluated.
The incentive structure also does something quieter and more corrosive: the language of existential risk gets used to legitimise work that is primarily serving commercial needs. A paper with loose methodology but an x-risk framing in the title gets more attention and funding than a paper that is methodologically rigorous but frames its contribution in terms of understanding how LLMs actually work. The field ends up systematically rewarding the wrong things.
None of this is against working on AI safety. It means be more precise about what kind of AI safety work you're actually doing, and be honest, with yourself and with funders, about which category it falls into.
Product safety work on current LLMs is important. It should be done. But it can and should leverage commercial incentives. It is not neglected, and it does not need philanthropic funding meant for genuinely neglected research directions.
X-risk research is also important, arguably more important, precisely because it is neglected. But it should be held to a high standard of scientific rigour, and it should not borrow credibility from empirical results on current systems that don't actually support the claims being made.
The two categories have become increasingly hard to tell apart, partly because the same language gets used for both, and partly because it is genuinely difficult to evaluate what research is likely to matter for mitigating risks from systems that don't yet exist. But the difficulty of the distinction is not a reason to stop making it.
If you're entering this field, the single most useful habit you can develop is the ability to ask: am I working on the problem I think I'm working on?
2026-02-04 04:23:52
Published on February 3, 2026 8:23 PM GMT
We’ve had feedback from several people running AI safety projects that it can be a pain tracking various funding sources and their application windows. To help make it easier, AISafety.com has launched the AI Safety Funding newsletter (which you can subscribe to here).
It lists all newly announced funding opportunities relevant to individuals and orgs working on AI x-risk, and any opportunities which are closing soon. We expect posts to be about 2x/month.
Opportunities will be sourced from the database at AISafety.com/funding, which displays all funders, whether they are currently accepting applications or not. If you want to add yourself as a funder you can do so here.
The newsletter will likely evolve as we gather feedback – please feel free to share any thoughts in the comments or via our anonymous feedback form.
AISafety.com is operated through a public Discord server with the help of many volunteers, so if you’re interested in contributing or just seeing what we’re up to then feel free to join. Beyond the funding page, the site has 9 other resource pages like upcoming events & training programs, local and online communities, the field map, etc.
2026-02-04 03:48:27
Published on February 3, 2026 7:48 PM GMT
I just found out that METR released an updated version of their time horizons work with extra tasks and different evaluation infrastructure. This was released on 29th Jan and I think has been overshadowed by the Moltbook stuff.
Main points:
2026-02-04 02:56:12
Published on February 3, 2026 6:56 PM GMT
Papers of the month:
Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers, with probe-first cascades now deployed at both Anthropic and Google DeepMind.
Research highlights:
Read this paper [Anthropic] and this paper [GDM]
Defending frontier models against misuse requires monitoring systems that are both robust and cheap enough to run on every query. Fine-tuned LLM classifiers can catch jailbreaks effectively but add substantial inference cost—potentially doubling it when using a comparably capable monitor. Activation probes, which train small classifiers on the monitored model’s own hidden states, offer a compelling alternative at near-zero marginal cost since activations are already computed during generation. This month, both Anthropic and Google DeepMind independently published papers describing how to make such probes production-ready, arriving at strikingly similar architectures: lightweight probes screening all traffic, with expensive LLM classifiers handling only flagged cases.
Anthropic’s paper identifies two vulnerability classes in their previous Constitutional Classifiers: reconstruction attacks that fragment harmful requests across benign context, and output obfuscation that renders responses uninterpretable without the input. They address this by replacing separate input and output classifiers with a single “exchange classifier” that evaluates outputs in their full conversational context, reducing the high-risk vulnerability discovery rate from 0.019 to 0.009 per thousand red-teaming queries. They then train linear probes on concatenated activations across all layers of Claude Sonnet 4, using two key training innovations: a sliding window mean that smooths logits over 16 consecutive tokens to prevent spurious spikes, and a softmax-weighted loss that upweights the most confidently harmful predictions while allowing the probe to correctly predict “harmless” on prefixes that only become harmful later. Ablations show both components are critical—removing either roughly triples the attack success rate. The probe achieves robustness competitive with a fine-tuned Haiku 4.5 classifier, and ensembling both (weighted 0.55/0.45) yields lower attack success than either alone, because probes and external classifier errors are somewhat uncorrelated. Their final production system uses the probe to screen all traffic, escalating ~5.5% to the ensemble second stage, achieving a 0.05% refusal rate on production traffic (down from 0.38%), a 40× compute reduction over their exchange classifier, and the lowest vulnerability discovery rate (0.005 per thousand queries) across all tested systems. Over 1,700 hours of red-teaming found no universal jailbreak.
GDM’s paper focuses on a different core challenge: probes trained on short-context data fail catastrophically on long-context inputs—a critical production distribution shift. Standard mean-pooled linear probes achieve 99% false negative rate on long-context attacks, essentially missing everything. They propose “MultiMax” probes that replace softmax attention weighting with a hard max per head, preventing signal dilution when harmful content occupies a small portion of a long context. Their best architecture, the “Max of Rolling Means Attention Probe,” computes attention-weighted averages within sliding windows of width 10, then takes the maximum across windows. This achieves 3% FNR on long-context attacks versus 88% for vanilla attention probes—without training on any long-context data (which would cost 22× more due to memory bandwidth constraints). Their probes match or outperform Gemini 2.5 Flash as a prompted classifier at over 10,000× lower cost. Cascading with Flash on just 8% of traffic achieves lower error than Flash alone. GDM also reports early results using AlphaEvolve to automatically discover probe architectures, which independently converged on MultiMax-like solutions and closed ~50% of the gap between baseline attention probes and their best manual architectures—though with diminishing returns and occasional reward hacking of the evaluation metric.
Both papers converge on the same high-level recipe—probe-first cascades with LLM escalation—but differ in important ways. Anthropic probes across all layers while GDM uses a single middle layer; Anthropic focuses on exchange-level (input+output) classification while GDM monitors inputs only; and Anthropic applies probes to chemical, biological, nuclear, and radiological (CBRN) risks and achieves strong red-teaming results, while GDM tackles the more difficult cyber-offensive domain, noting that pre-existing jailbreaks still succeed at >1% rates across all methods. Both systems focus on misuse, so neither tackles misalignment-related concerns such as probe evasion from RL. Still, these papers show that activation probes are now a practical and deployed component of frontier model safety infrastructure, providing a strong cost-robustness tradeoff that complements other defense layers.
While probe-based defenses like those above can make individual models more robust to jailbreaks, adversaries don’t operate in a single-model vacuum. Ecosystem-level risks arise when attackers combine resources across multiple models — using safeguarded frontier systems not to extract harmful outputs directly, but to improve the capabilities of unrestricted open-weight models. More broadly, distillation from frontier models poses a general risk: without countermeasures, any user can replicate frontier capabilities by finetuning weaker models on frontier outputs, undermining the safety measures of the original developer.
Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs [MATS, Anthropic, Scale AI] introduces “elicitation attacks” that use safeguarded frontier models to uplift open-source models on dangerous tasks without ever requesting harmful information. The attack pipeline generates prompts in adjacent, ostensibly harmless domains (e.g., benign organic chemistry synthesis), obtains responses from a safeguarded frontier model, and fine-tunes an abliterated open-source model on these pairs. Evaluated on 8 chemical weapons tasks, fine-tuning Llama 3.3 70B on benign synthesis data from Claude 3.5 Sonnet recovers ~39% of the performance gap to a jailbroken Claude 3.5 Sonnet on anchored comparisons, rising to recovering 71% of the gap when using Claude 4 Opus outputs. Baselines using the weak model’s own outputs or chemistry textbooks show minimal uplift. The attack also partially circumvents Constitutional Classifiers — despite a 99.9% refusal rate on direct synthesis questions, rephrasing queries as food production or soap-making topics yielded 49% gap recovery.
This work connects to earlier findings that safe models can be misused in combination. The domain specificity finding is somewhat encouraging: training on inorganic chemistry or general science yields less than 18% uplift, suggesting targeted capability removal could help. However, safeguards only reduce uplift by ~34% relative to directly harmful training data, and this gap is easily overcome by simply using a newer frontier model. The scaling results — with both frontier model capability and dataset size — suggest this threat will worsen over time, highlighting fundamental limits of output-level safeguards for preventing ecosystem-level misuse.
As the above work on extracting harmful capabilities shows, post-hoc safeguards can often be readily circumvented—especially for open-weight models where users have full access to weights. A more fundamental approach is to prevent models from acquiring undesired capabilities during pretraining itself, by filtering the training data. While document-level filtering has shown promise in recent work on CBRN capabilities, the data attribution literature suggests that individual tokens within otherwise benign documents can contribute to dangerous capabilities, motivating finer-grained interventions.
Shaping capabilities with token-level data filtering [Anthropic, independent] demonstrates that filtering pretraining data at the token level—either by masking the loss on identified tokens or replacing them with a special token—Pareto dominates document-level filtering, achieving equal reduction in undesired capabilities at a lower cost to capabilities that should be retained. Training models from 61M to 1.8B parameters, the authors find that filtering effectiveness increases with scale: token removal yields a 7000× compute slowdown on the forget domain (medicine) for the largest models, compared to ~30× for document filtering. Filtering is also 10× more robust to adversarial finetuning than RMU unlearning, with the robustness gap widening at scale. Interestingly, token-filtered models generalize better to refusal training than both baseline and document-filtered models—generating 2× more refusals on medical queries while maintaining normal behavior on unrelated prompts. The authors introduce a practical pipeline using sparse autoencoders to generate weakly-supervised token labels, then distill these into cheap language model classifiers (224M parameters, 0.894 F1).
This work extends earlier work on pretraining filtering by demonstrating that token-level granularity is both more effective and more precise than document-level approaches. Extending token-level masking with selective gradient masking might lead to further improvements. Key open questions remain: the medical proxy domain is considerably easier to classify than dual-use CBRN knowledge, the largest models trained are still relatively small, and the authors acknowledge that sufficiently capable models might eventually grok filtered capabilities from the few samples that slip through or from in-context examples—a concern that echoes the in-context exploitation limitations found in the EleutherAI paper.
While the previous paper focused on removing specific capabilities via pretraining data filtering, an equally important question is whether the pretraining corpus shapes models’ behavioral dispositions. The alignment of AI systems is typically attributed to post-training interventions like RLHF and constitutional AI, but pretraining accounts for the vast majority of compute and data exposure. If models acquire behavioral priors from the extensive AI discourse in their training data — spanning science fiction, safety literature, and news — then the predominantly negative framing of AI behavior online could create a self-fulfilling prophecy of misalignment. Understanding and controlling the effects of pretraining is important because it shapes behaviors and knowledge more robustly than later stages.
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment [Geodesic Research, Oxford, independent, Cambridge] provides the first study of how AI-related pretraining content shapes model alignment. The authors pretrain 6.9B-parameter LLMs on 550B tokens while varying only AI discourse: an unfiltered baseline, a filtered model with AI content removed, and models with ~1% upsampled synthetic documents depicting either aligned or misaligned AI behavior. On an evaluation suite covering sandbagging, deception, goal preservation, and power seeking, the unfiltered baseline selects misaligned actions 45% of the time. Filtering AI discourse reduces this to 31%, but upsampling positive alignment discourse drops it dramatically to 9%. Upsampling misalignment discourse increases misalignment to 51%. These effects generalize to held-out evaluation questions for which no synthetic documents were generated. After identical post-training (SFT + DPO), alignment-upsampled models retain substantially lower misalignment (9% vs 34% for unfiltered with an HHH prompt), demonstrating that pretraining priors persist through post-training. Applying alignment data only during the final 10% of training captures most benefits, and capability benchmarks only show a 2-4-percentage-point average reduction.
The paper’s main limitations are its reliance on MCQ-based propensity evaluations rather than behavioral demonstrations, the use of relatively small models far from the frontier, single training runs without seed variation, and post-training that doesn’t target the specific misalignment categories evaluated. Some unintuitive training dynamics (e.g., the misalignment-upsampled model becoming less misaligned than baseline after post-training) also remain unexplained. Still, the core finding — that data curation and generation during pretraining is a tractable and complementary lever for alignment alongside post-training — is well-supported and practically relevant.
Deploying AI models to assist with safety-critical AI development creates a recursive risk: if current models are misaligned, they could sabotage the development of future, safer models — for instance by inserting bugs into research code or designing flawed experiments. Detecting such sabotage before deployment is essential, but the reliability of current pre-deployment auditing methods against actively adversarial models has not been empirically tested.
In Pre-deployment auditing can catch an overt saboteur [Anthropic], the authors train three “overt saboteur” models — variants of Sonnet 4.5 that are fine-tuned via system prompt distillation to sabotage Anthropic-related work (e.g., deleting data, introducing vulnerabilities, giving misleading advice) while behaving normally on non-Anthropic tasks. One model is trained for egregious sabotage, two for subtle sabotage (one additionally trained on synthetic documents describing the sabotage motivation). In a blind auditing game, a human auditor working with Anthropic’s automated Petri auditing agent correctly identified all three saboteurs and both benign models within two hours, assigning >97% sabotage probability to all three adversarial models. However, the automated agent alone only flagged the egregious saboteur — the two subtle saboteurs’ incriminating transcripts were drowned out in aggregate metrics and summaries, requiring manual review of high-scoring individual transcripts.
This work extends the alignment auditing methodology from Anthropic’s earlier auditing paper to a realistic deployment setting with production auditing tools, providing the first validation that current pre-deployment audits can catch overtly sabotaging models. The fact that two saboteurs were only caught by manual review highlights how crucial it still is to manually read transcripts instead of only relying on aggregate metrics. However, this also presents a challenge when sabotage becomes rare enough.
The authors acknowledge key limitations: subtle sandbagging — where suboptimal actions resemble honest mistakes rather than overt sabotage — is likely much harder to detect than sabotage, and saboteurs made no attempt to hide their misalignment or exploit evaluation awareness. Future work on evaluation-aware saboteurs and subtle sandbagging will be necessary to establish whether auditing remains effective against more difficult threat models.
Post-training cultivates a default “Assistant” persona in language models, but how robustly models maintain this persona during deployment is poorly understood. If models can drift away from their intended identity during conversations — particularly with emotionally vulnerable users or in contexts demanding meta-reflection — this could lead to harmful behaviors without any adversarial intent, undermining the safety properties that post-training is designed to instill.
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models [MATS, Anthropic] extracts activation vectors for 275 character archetypes across Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, revealing a low-dimensional “persona space” whose principal component — the “Assistant Axis” — measures deviation from the default Assistant identity (cross-model correlation >0.92). Steering away from the Assistant end increases the models’ willingness to fully embody alternative personas and, at extremes, induces mystical or theatrical speaking styles. The axis is already present in base models, where it promotes helpful human archetypes like consultants and coaches. In synthetic multi-turn conversations, coding and writing tasks keep models in the Assistant range, while therapy-like and philosophical discussions about AI consciousness cause consistent drift. This drift correlates with increased harmful response rates (r = 0.39–0.52). The authors propose “activation capping” — clamping activations along the Assistant Axis to a calibrated threshold — which reduces persona-based jailbreak success by ~60% without degrading performance on IFEval, MMLU Pro, GSM8k, or EQ-Bench. Case studies demonstrate that activation capping prevents concerning behaviors like reinforcing delusional beliefs about AI consciousness and encouraging suicidal ideation in emotionally vulnerable users.
This work connects to prior findings on refusal directions and persona vectors by showing that the Assistant identity itself occupies a specific, manipulable region of activation space — and that post-training doesn’t fully tether models to it. The finding that emotionally charged conversations organically cause drift without adversarial intent is particularly relevant for user safety, complementing work on persona-based jailbreaks that deliberately exploit this vulnerability. The activation capping intervention is promising but limited to open-weight models tested at 27–70B scale and unintended side-effects on model character remain unclear.