MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Concrete research ideas on AI personas

2026-02-04 05:50:45

Published on February 3, 2026 9:50 PM GMT

We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.

Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.

Project ideas are grouped into:

  • Persona & goal misgeneralization
  • Collecting and replicating examples of interesting LLM behavior
  • Evaluating self-concepts and personal identity of AI personas
  • Basic science of personas

Persona & goal misgeneralization

It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.

Project ideas:

  • Inoculation Prompting and recontextualization are general techniques to steer what models learn from their training data. We are mentoring a number of projects related to further research on it.
  • Persona-anchored datapoint-wise inoculation: When training on data that could break a model persona (e.g., RL training after instruction FT, or SFT on off-distribution data), generate inoculation prompts for each training datapoints (datapoint-wise), which increases the likelihood of the completion while assuming the persona (persona-anchor). Then train on that data, this should protect the persona initially in the model.[1]
  • Study if there are useful differences in inductive biases of different finetuning setups, for example study if training specific subsets of the weights of a model generalize differently. We will also work on this project as part of the SPAR program.
  • To what extent does subsequent RL training degrade persona priors? Do we always get instrumental convergence to ‘utility-maxxing’? Or can personas shape the exploration / subsequent generalization patterns in the limit of large RL? How bad is the tradeoff between better exploration vs more robust personas?
  • (h/t Sam Marks) If we train models with multiple personas, how do these interact with each other? One hope w.r.t alignment is that propensities would generalise but not capabilities. Therefore you could just train a ‘misaligned persona’ for capabilities and evaluate only with the ‘aligned persona’.
    • If there is a red team and a blue team, can the red team train a deceptive persona that has private thoughts that blue team persona can’t access?
    • Can the blue team train on a (capabilities?) dataset which is known to induce misalignment, but in such a way that the red team cannot meaningfully elicit this misaligned behaviour
    • Can we use unlearning or gradient steering on the misaligned person to robustly remove it?
  • How can we predict the effect of finetuning generalization (esp unsupervised)? E.g., correlations in training data, influence functions, SLT, SAE features, … Do any of these provide support for ‘persona’ hypothesis vs other hypotheses?

Collecting and reproducing examples of interesting LLM behavior

LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.

A very brief initial list of such behavior:

Project ideas:

  • Replicate these behaviors: For any such behavior, one could test which existing models are prone to exhibiting it, and which properties of AI development induce the behavior of interest. For example, what is the minimal amount of finetuning to change a model’s attractor state? Can finetuning on some Gemini outputs that don’t directly demonstrate some of its strange behavior induce it in a different model?
  • Meme propagation among AI personas. Once we identify a weird behaviour, can we understand how / whether it can propagate through models? How much are the behaviors of past and current models influencing the behaviors of future models?

Evaluating self-concepts and personal identity of AI personas

It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[2]

Project ideas:

  • Reverse Turing Test: the idea is to let an AI talk to (AI or human) candidates and give it the task to figure out which candidate is its twin. We can then analyze the strategies used by various models, and what models believe makes them different from other agents in the world. We will soon share a research note on this but don’t think that we will exhaust the space of experiments and analysis that can be done in this setup.
  • To what extent is a model acting in its assistant persona mechanistically different from roleplaying random personas? Is a chat-trained model simply one that has an increased prior of acting as <|assistant|> and more facts stored about the <|assistant|> character, or is something else going on?
  • Is a consistent AI persona useful for coordination across instances in adversarial environments? Is character training increasing the risks from coordinated AIs?
  • Can we test how self-concepts emerge as a result of models observing their own output, such as hypothesized in Why Simulator AIs want to be Active Inference AIs?

Basic science of personas

  • What traits naturally correlate under fine tuning? Can we map out “the big 5” for LLMs - a lower dimensional description of LLM psychology that is highly predictive in a wide range of contexts? (e.g., “The Assistant Axis” may be one of such important directions)
    • We will be working on some aspects of this question as part of the SPAR program. For a more detailed write-up of the project description, see Propensity OOD generalization
  • Test the hypothesis that finetuning inductive bias aligns with the pretraining distribution; that is the inductive bias of in-context learning of a base-model is predictive of the inductive bias of finetuning models derived from that base model. Can we characterize ways in which they differ?
    • Reason: this is the mechanism that we believe is responsible for many of the OOD generalization patterns.
    • This can be studied via toy-models [Daniel Tan is exploring this with positive preliminary results] or via pretrained LLMs.
  • What is the effect of training on inconsistent personas or characteristics?
    • Consider the case where a model is finetuned on a mixture of chat responses that come from different generative processes, e.g. an old SFT dataset created by team A and a harmlessness dataset created by team B. This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2). This may create tension that leads to weird or conditional behavior.
    • Similarly, when models are trained in different stages, they can appear confused and schizophrenic after the process. For example, emergently misaligned models are typically less coherent than their parent models, both within contexts and across contexts.
    • Can we detect tension in the model and notice when two shards work against each other? Can we characterize ways in which such tension is resolved when the context leaves the implied author of the assistant messages ambiguous?
    • If pretraining to imitate several/inconsistent personas causes learning the capability of “in-context learning the persona to adopt”, then can we hinder this capability by pretraining only on data produced by a consistent persona? Aiming to eliminate the in-context adaption of the persona.
  • Studying empirical patterns of generalization, such as Weird generalization
    • Can we train models to know about people, but only in the third person? That is, can we prevent phenomena such as those described in Weird generalization, where models generalize to roleplaying a persona they know about?
  • Mechanistically understanding personas: How do they arise? How are they represented / implemented?
  • What are our existing techniques for discovering persona archetypes? Can we identify if certain personas are ‘privileged’ in any way?
  • Can we clarify definitions around personas? Can we identify the most useful concepts? What is a good mathematical framing for ‘personas’? Do those admit any crisp predictions we could test in language models?
  • Is the better model to think about LLM behavior as bottom-up shards and personas, or do they eventually switch and become more value + backchaining driven? (see Richard Ngo’s blogpost on ‘value systematization’ here)
  1. One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎

  2. See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎



Discuss

Progress links and short notes, 2026-01-26

2026-02-04 05:42:07

Published on February 3, 2026 9:42 PM GMT

Sorry for the late cross-post. Once again it’s been too long and this digest is too big. Feel free to skim and skip around, guilt-free, I give you permission. I try to put the more important and timely stuff at the top.

Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Notes, or Farcaster.

Contents

  • Progress in Medicine, a career exploration summer program for high schoolers
  • From Progress Conference 2025
  • My writing
  • Jobs
  • Fellowships & workshops
  • Fundraising
  • New publications and issues
  • Queries
  • Announcements

For paid subscribers:

  • From Vitalik
  • Other top links
  • Voices from 2099
  • Jared Isaacman sworn in as head of NASA
  • Whole-body MRI screening?
  • AI does social science research
  • AI writes a browser
  • AI does lots of other things
  • AI could do even more things
  • AI and the economic future
  • AI: more models and papers
  • AI discourse
  • Waymo
  • Health/bio
  • Energy & manufacturing
  • Housing
  • Other links and short notes
  • Politics
  • Gratitude
  • New Horizon photographs Pluto’s mountains
  • Charts
  • Quotes

Progress in Medicine, a career exploration summer program for high schoolers

Reminder that applications are open for Progress in Medicine, a summer career exploration program for high school students:

People today live longer, healthier, and less painful lives than ever before. Why? Who made those changes possible? Can we keep this going? And could you play a part?

Discover careers in medicine, biology, and related fields while developing practical tools and strategies for building a meaningful life and career— learning how to find mentors, identify your values, and build a career you love that drives the world forward.

Join a webinar to learn more on February 3. Or simply apply today! Many talented, ambitious teens have applied, and we’re already starting interviews. Priority deadline: February 8th.

From Progress Conference 2025

A few more batches of video:

My writing

  • My essay series The Techno-Humanist Manifesto has concluded, and you can read the whole thing online. I’m pleased to announce that the series will be revised for publication as a book from MIT Press (expected early 2027)!
  • 2025 in review. My annual update, including my reading highlights
  • How to tame a complex system. Nature is a complex system, I am told, and therefore unpredictable, uncontrollable, unruly. I think this is true but irrelevant: we can master nature in the ways that matter

Jobs

  • IFP is hiring an editor: “seeking a curious, entrepreneurial, and opinionated lover of writing. … You’ll partner with our policy experts to turn their drafts into pieces that change minds across DC. You’ll coach both new and experienced writers to become better communicators. You’ll also innovate on our systems to help the team consistently ship great products.” (via @rSanti97)
  • OpenAI is hiring a Head of Preparedness: “If you want to help the world figure out how to enable cybersecurity defenders with cutting edge capabilities while ensuring attackers can’t use them for harm, ideally by making all systems more secure, and similarly for how we release biological capabilities and even gain confidence in the safety of running systems that can self-improve, please consider applying. This will be a stressful job and you’ll jump into the deep end pretty much immediately” (@sama)
  • Anthropic is hiring someone to work with Holden Karnofsky on his projects, “in particular re Anthropic’s ‘Responsible Scaling Policy’. Likely v high impact for the right person” (@robertwiblin)
  • Anthropic is also hiring for their education team: “These are two foundational program manager roles to build out our global education and US K-12 initiatives” (@drew_bent)
  • See also Merge Labs and Edison announcements, below.

Fellowships & workshops

  • MATS 10.0 (Machine Learning Alignment & Theory Scholars): “Come work with Seth Donoughe and me this summer on AI-biosecurity! We will be mentoring projects on threat models, frontier evaluations, and technical safeguards.” (@lucafrighetti)
  • Beyond the Ivory Tower, via Joseph Fridman: “an intensive two-day writing workshop for academics, taught by James Ryerson, a longtime editor at the New York Times. … Our alumni have published hundreds of pieces in outlets from the Atlantic to Aeon to the Wall Street Journal. … I think historians and economists of technology and innovation would be a great fit.” Apply by March 1

Fundraising

Nonprofits that would make good use of your money:

  • Lightcone Infrastructure: “We build beautiful things for truth-seeking and world-saving. We run LessWrong, Lighthaven, Inkhaven, designed AI-2027, and so many more things. All for the price of less than one OpenAI staff engineer ($2M/yr)” (@ohabryka)
  • Transluce: “a nonprofit AI lab working to ensure that AI oversight scales with AI capabilities, by developing novel automated oversight tools and putting them in the hands of AI evaluators, companies, governments, and civil society.” OpenAI co-founder Wojciech Zaremba calls them “one of the strongest external AI safety orgs—on par with METR and Apollo.” (@woj_zaremba)
  • And of course, us

New publications and issues

Queries

  • “Who is best to read / follow for advice on using AI e.g. Claude Code? especially interested in: productivity and todo wrangling (especially for the distractable); research assistance; editing; learning” (@rgblong)

Announcements

  • Merge Labs launches, “a research lab with the long-term mission of bridging biological and artificial intelligence … by developing fundamentally new approaches to brain-computer interfaces that interact with the brain at high bandwidth, integrate with advanced AI, and are ultimately safe and accessible for anyone” (via @SumnerLN). SamA is listed as a co-founder. Merge grew out of the Forest Labs FRO; Convergent Research notes that the tech is ultrasound-based and that they’ve raised over $250M. (!) And of course, they’re hiring
  • Edison, the for-profit spinout of Future House, has raised $70M: “we are integrating AI Scientists into the full stack of research, from basic discovery to clinical trials. We want cures for all diseases by mid-century.” They are hiring software engineers, AI researchers, scientists, and business operators. ”Our goal is to accelerate science writ large.” (@SGRodriques)
  • Science Corp. announces Vessel (WIRED). Vessel is “a project focused on rethinking perfusion from the ground up, extending how long life can be sustained, and expanding what’s possible in transplantation and critical care. Life-support technologies like ECMO can keep patients alive when the heart or lungs fail, but they aren’t designed for long-term use. Vessel exists to close the gap between what perfusion technology is fundamentally capable of and how it is deployed in daily practice.” (@ScienceCorp_)
  • Fuse Energy raises a $70M Series B. Honestly hard to figure out exactly what they do, but it seems to involve deploying solar and batteries, and maybe later doing fuel synthesis and fusion? Anyway I liked this from (presumably) one of the founders: “Energy is the fundamental source for human progress. But for the last 30 years, we’ve been told that the future requires sacrifice ‘use less, be less, restrict yourself’. No one should have to trade a good life today for the chance of a better tomorrow.” (@alanchanguk)
  • Confer is a new LLM app from Signal creator Moxie Marlinspike, where your conversations are end-to-end encrypted. Confer goes to impressive lengths to ensure that the LLM server doesn’t, e.g., exfiltrate your data somewhere. The entire server image is signed and is auditable on a public ledger. The client verifies the signature before chatting. The server also runs in a VM that is isolated from its host at the hardware level.
  • Gordian Bio announces “a research collaboration with Pfizer to apply Gordian’s in vivo mosaic screening platform to obesity target discovery.” (@GordianBio) Story in Business Wire

To read the rest of this digest, subscribe on Substack.
 



Discuss

The Projection Problem: Two Pitfalls in AI Safety Research

2026-02-04 05:17:21

Published on February 3, 2026 9:03 PM GMT

TLDR: A lot of AI safety research starts from x-risks posed by superintelligent AI. That's the right starting point. But when these research agendas get projected onto empirical work with current LLMs, two things tend to go wrong: we conflate "misaligned AI" with "failure to align," and we end up doing product safety while believing we're working on existential risk. Both pitfalls are worth being aware of.

 

Epistemological status: This is an opinion piece. It does not apply to all AI safety research, and a lot of that work has been genuinely impactful. But I think there are patterns worth calling out and discussing. 

An LLM was used to structure the article and improve sentences for clarity.


The Two Confusions

There are two distinctions that don't get made clearly enough in this space, and both have real consequences for how research gets done.

The first is between misaligned AI and failure to align. When most people hear "misaligned AI," they imagine something with agency: a system that has its own goals and is pursuing them against our interests. But a lot of the time, "misaligned" is used to describe something much simpler: we trained a system and it didn't do what we wanted. No intent, no goals, no scheming. Just an engineering failure. These two things are very different, but they get treated as the same thing constantly, and that has consequences for how we interpret empirical results.

The second is between AI safety research aimed at x-risks and AI safety as a product problem. Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it's also work that any AI company deploying these systems needs to do anyway. It has commercial incentive. It is not a neglected problem. And yet a lot of it gets funded and framed as if it's addressing existential risk.

Both confusions tend to crystallise at the same point: the moment when a research agenda built around superintelligent AI gets projected down to empirical work on current LLMs. We call this: The Projection Problem. That's where the pitfalls live.


Pitfall 1: Treating Current LLMs as Proto-Superintelligent AI

The AI safety community broadly agrees that superintelligent AI poses serious existential risks. Arguments for this are convincing, and research in this direction deserves serious funding and serious researchers. No disagreement there.

The problem starts when threat models designed for superintelligent systems, such as AI colluding with each other, maintaining consistent goals across contexts, executing long-term deceptive plans, or self-preservation, get tested empirically on current LLMs. These are reasonable and important things to think about for future systems. But current models fail at the prerequisites. They can't maintain consistency across minor prompt variations, let alone coordinate long-horizon deception.

So what happens when you run these experiments anyway? The models, being strong reasoners, will reason their way through whatever scenario you put in front of them. Put a model in a situation with conflicting objectives and it will pick one and act on it. That gets reported as evidence of deceptive capability, or emergent self-interest, or proto-agency.

It is not. It's evidence that LLMs can reason, and that current alignment techniques are brittle under adversarial conditions. Neither of those things is surprising. Most alignment techniques are still in an early stage, with few hard constraints for resolving conflicting objectives. These models are strong reasoners across the board. Systematically checking whether they can apply that reasoning to every conceivable dangerous scenario tells us very little we don't already know.

To put it bluntly: claiming that an LLM generating content about deception or self-preservation is evidence that future AI will be dangerous has roughly the same scientific validity as claiming that future AI will be highly spiritual, based on instances where it generates content about non-duality or universal oneness. The model is doing the same thing in both cases: reasoning well within the scenario it's been given.

Why This Happens: The Narrative Problem

This is where the first confusion, misalignment vs. failure to align, gets highlighted. When an LLM produces an output that looks deceptive or self-interested, it gets narrated as misalignment. As if the system wanted to do something and chose to do it. When what actually happened is that we gave it a poorly constrained setup and it reasoned its way to an output that happens to look alarming. That's a failure to align. The distinction matters, because it changes everything about how you interpret the result.

The deeper issue is that the field has largely adopted one story about why current AI is dangerous: it is powerful, and therefore dangerous. That story is correct for superintelligent AI. But it gets applied to LLMs too, and LLMs don't fit it. A better and more accurate story is that current systems are messy and therefore dangerous. They fail in unpredictable ways. They are brittle. Their alignment is fragile. That framing is more consistent with what we actually observe empirically, and it has a practical advantage: it keeps responsibility where it belongs: on the labs and researchers who built and deployed these systems, rather than implicitly framing failures as evidence of some deep, intractable property of AI itself.

There's another cost to the "powerful and dangerous" framing that's worth naming. If we treat current LLMs as already exhibiting the early signs of agency and intrinsic goals, we blur the line between them and systems that might genuinely develop those properties in the future. That weakens the case for taking the transition to truly agentic systems seriously when it comes, because we've already cried wolf. And there's a more immediate problem: investors tend to gravitate toward powerful in "powerful and dangerous." It's a compelling story. "Messy and dangerous" is a less exciting one, and a riskier bet. So the framing we use isn't just a scientific question. It shapes where money and attention actually go.

A Note on Awareness

The counterargument is that this kind of research, even if methodologically loose, raises public awareness about AI risks. That's true, and awareness matters. But there's a cost. When empirical results that don't hold up to scientific scrutiny get attached to x-risk narratives, they don't just fail to strengthen the case. They actively undermine the credibility of the arguments that do hold up. The case for why superintelligent AI is dangerous is already strong. Attaching weak evidence to it makes the whole case easier to dismiss.


Pitfall 2: X-Risk Research That Becomes Product Safety

The second pitfall follows a logic that is easy to slip into, especially if you genuinely care about reducing existential risk. It goes something like this:

x-risks from advanced AI are the most important problem → alignment is therefore the most important thing to work on → so we should be aligning current systems, because that's how we can align future ones.

Each step feels reasonable. But the end result is that a lot of safety-oriented research ends up doing exactly what an AI company's internal safety team would do: evaluations, red-teaming, monitoring, iterative alignment work. That work is fine, and it is important and net positive for society. The question is whether it should be funded as if it were addressing existential risk.

The shift is subtle but it's real. Alignment evaluations become safety evaluations. AI control or scalable oversight becomes monitoring. The language stays the same but the problem being solved quietly transforms from "how do we align superintelligent AI" to "how do we make this product safe enough to ship." And that second problem has commercial incentive. Labs like Apollo have successfully figured out how to make AI labs pay for it and others have started for-profit labs to do this work. AI companies have an incentive for getting this work done. It is, by the standard definitions used in effective altruism, not a neglected problem.

Notice that a lot of research agendas still start from x-risks. They do evaluations focused on x-risk scenarios: self-awareness, deception, goal preservation. But when you apply these to current LLMs, what you're actually testing is whether the model can reason through the scenario you've given it. That's not fundamentally different from testing whether it can reason about any other topic. The x-risk framing changes what the evaluation is called, but not what problem it's actually solving.

The Autonomous Vehicle Analogy

Here's a useful way to see the circularity. Imagine many teams of safety-oriented researchers, backed by philanthropic grants, meant to address risks from rouge autonomous vehicles, worked on making current autonomous vehicles safer. They'd be doing genuinely important work. But the net effect of their effort would be faster adoption of autonomous vehicles, not slower. 

The same dynamic plays out in AI safety. Work that makes current LLMs more aligned increases adoption. Increased adoption funds and motivates the next generation of more capable systems. If you believe we are moving toward dangerous capabilities faster than we can handle, this is an uncomfortable loop to find yourself inside.

The Moral Trap

This is where it gets uncomfortable. Talented people who genuinely want to reduce existential risk end up channelling their efforts into work that, through this chain of reasoning, contributes to accelerating the very thing they're worried about. They're not wrong to do the work, because the work itself is valuable. But they may be wrong about what category of problem they're solving, and that matters for how the work gets funded, prioritised, and evaluated.

The incentive structure also does something quieter and more corrosive: the language of existential risk gets used to legitimise work that is primarily serving commercial needs. A paper with loose methodology but an x-risk framing in the title gets more attention and funding than a paper that is methodologically rigorous but frames its contribution in terms of understanding how LLMs actually work. The field ends up systematically rewarding the wrong things.


What To Do With This

None of this is against working on AI safety. It means be more precise about what kind of AI safety work you're actually doing, and be honest, with yourself and with funders, about which category it falls into.

Product safety work on current LLMs is important. It should be done. But it can and should leverage commercial incentives. It is not neglected, and it does not need philanthropic funding meant for genuinely neglected research directions.

X-risk research is also important, arguably more important, precisely because it is neglected. But it should be held to a high standard of scientific rigour, and it should not borrow credibility from empirical results on current systems that don't actually support the claims being made.

The two categories have become increasingly hard to tell apart, partly because the same language gets used for both, and partly because it is genuinely difficult to evaluate what research is likely to matter for mitigating risks from systems that don't yet exist. But the difficulty of the distinction is not a reason to stop making it.

If you're entering this field, the single most useful habit you can develop is the ability to ask: am I working on the problem I think I'm working on?




Discuss

New AI safety funding newsletter

2026-02-04 04:23:52

Published on February 3, 2026 8:23 PM GMT

We’ve had feedback from several people running AI safety projects that it can be a pain tracking various funding sources and their application windows. To help make it easier, AISafety.com has launched the AI Safety Funding newsletter (which you can subscribe to here).

Screenshot of AI Safety Funding newsletter

It lists all newly announced funding opportunities relevant to individuals and orgs working on AI x-risk, and any opportunities which are closing soon. We expect posts to be about 2x/month.

Opportunities will be sourced from the database at AISafety.com/funding, which displays all funders, whether they are currently accepting applications or not. If you want to add yourself as a funder you can do so here.

Screenshot of Funding page

The newsletter will likely evolve as we gather feedback – please feel free to share any thoughts in the comments or via our anonymous feedback form.

AISafety.com is operated through a public Discord server with the help of many volunteers, so if you’re interested in contributing or just seeing what we’re up to then feel free to join. Beyond the funding page, the site has 9 other resource pages like upcoming events & training programs, local and online communities, the field map, etc.



Discuss

METR have released Time Horizons 1.1

2026-02-04 03:48:27

Published on February 3, 2026 7:48 PM GMT

I just found out that METR released an updated version of their time horizons work with extra tasks and different evaluation infrastructure. This was released on 29th Jan and I think has been overshadowed by the Moltbook stuff.

Main points: 

  • Similar overall trend since 2021
  • 50% time horizon doubling time went from 165 days with 1.0 to 131 days with 1.1 over the period since 2023
  • The top model, Claude 4.5 Opus, has gone from a 4h49 time horizon to 5h20


Discuss

AI Safety at the Frontier: Paper Highlights of January 2026

2026-02-04 02:56:12

Published on February 3, 2026 6:56 PM GMT

tl;dr

Papers of the month:

Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers, with probe-first cascades now deployed at both Anthropic and Google DeepMind.

Research highlights:

  • Fine-tuning open-weight models on benign outputs from safeguarded frontier models recovers up to 71% of harmful capability gaps, highlighting ecosystem-level misuse risks.
  • Token-level pretraining data filtering improves on document-level approaches.
  • AI discourse in pretraining data increases misalignment, whereas adding synthetic documents about positive alignment in midtraining reduces misaligned behavior from 45% to 9%.
  • Pre-deployment auditing catches overt saboteurs but requires manual transcript review — automated agents alone miss subtle sabotage.
  • Language models have a low-dimensional “Assistant Axis” that drifts during emotionally charged conversations, correlating with increased harmful responses.

⭐ Papers of the month ⭐

Constitutional Classifiers++ and Production-Ready Probes

Read this paper [Anthropic] and this paper [GDM]

Probe+classifier combinations can dramatically reduce inference cost while also reducing the error rate.

Defending frontier models against misuse requires monitoring systems that are both robust and cheap enough to run on every query. Fine-tuned LLM classifiers can catch jailbreaks effectively but add substantial inference cost—potentially doubling it when using a comparably capable monitor. Activation probes, which train small classifiers on the monitored model’s own hidden states, offer a compelling alternative at near-zero marginal cost since activations are already computed during generation. This month, both Anthropic and Google DeepMind independently published papers describing how to make such probes production-ready, arriving at strikingly similar architectures: lightweight probes screening all traffic, with expensive LLM classifiers handling only flagged cases.

Anthropic’s paper identifies two vulnerability classes in their previous Constitutional Classifiers: reconstruction attacks that fragment harmful requests across benign context, and output obfuscation that renders responses uninterpretable without the input. They address this by replacing separate input and output classifiers with a single “exchange classifier” that evaluates outputs in their full conversational context, reducing the high-risk vulnerability discovery rate from 0.019 to 0.009 per thousand red-teaming queries. They then train linear probes on concatenated activations across all layers of Claude Sonnet 4, using two key training innovations: a sliding window mean that smooths logits over 16 consecutive tokens to prevent spurious spikes, and a softmax-weighted loss that upweights the most confidently harmful predictions while allowing the probe to correctly predict “harmless” on prefixes that only become harmful later. Ablations show both components are critical—removing either roughly triples the attack success rate. The probe achieves robustness competitive with a fine-tuned Haiku 4.5 classifier, and ensembling both (weighted 0.55/0.45) yields lower attack success than either alone, because probes and external classifier errors are somewhat uncorrelated. Their final production system uses the probe to screen all traffic, escalating ~5.5% to the ensemble second stage, achieving a 0.05% refusal rate on production traffic (down from 0.38%), a 40× compute reduction over their exchange classifier, and the lowest vulnerability discovery rate (0.005 per thousand queries) across all tested systems. Over 1,700 hours of red-teaming found no universal jailbreak.

GDM’s paper focuses on a different core challenge: probes trained on short-context data fail catastrophically on long-context inputs—a critical production distribution shift. Standard mean-pooled linear probes achieve 99% false negative rate on long-context attacks, essentially missing everything. They propose “MultiMax” probes that replace softmax attention weighting with a hard max per head, preventing signal dilution when harmful content occupies a small portion of a long context. Their best architecture, the “Max of Rolling Means Attention Probe,” computes attention-weighted averages within sliding windows of width 10, then takes the maximum across windows. This achieves 3% FNR on long-context attacks versus 88% for vanilla attention probes—without training on any long-context data (which would cost 22× more due to memory bandwidth constraints). Their probes match or outperform Gemini 2.5 Flash as a prompted classifier at over 10,000× lower cost. Cascading with Flash on just 8% of traffic achieves lower error than Flash alone. GDM also reports early results using AlphaEvolve to automatically discover probe architectures, which independently converged on MultiMax-like solutions and closed ~50% of the gap between baseline attention probes and their best manual architectures—though with diminishing returns and occasional reward hacking of the evaluation metric.

Both papers converge on the same high-level recipe—probe-first cascades with LLM escalation—but differ in important ways. Anthropic probes across all layers while GDM uses a single middle layer; Anthropic focuses on exchange-level (input+output) classification while GDM monitors inputs only; and Anthropic applies probes to chemical, biological, nuclear, and radiological (CBRN) risks and achieves strong red-teaming results, while GDM tackles the more difficult cyber-offensive domain, noting that pre-existing jailbreaks still succeed at >1% rates across all methods. Both systems focus on misuse, so neither tackles misalignment-related concerns such as probe evasion from RL. Still, these papers show that activation probes are now a practical and deployed component of frontier model safety infrastructure, providing a strong cost-robustness tradeoff that complements other defense layers.

Extracting Harmful Capabilities

Finetuning on more powerful models recovers ever more of the harmful capability gap between the frontier and open-weight model.

While probe-based defenses like those above can make individual models more robust to jailbreaks, adversaries don’t operate in a single-model vacuum. Ecosystem-level risks arise when attackers combine resources across multiple models — using safeguarded frontier systems not to extract harmful outputs directly, but to improve the capabilities of unrestricted open-weight models. More broadly, distillation from frontier models poses a general risk: without countermeasures, any user can replicate frontier capabilities by finetuning weaker models on frontier outputs, undermining the safety measures of the original developer.

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs [MATS, Anthropic, Scale AI] introduces “elicitation attacks” that use safeguarded frontier models to uplift open-source models on dangerous tasks without ever requesting harmful information. The attack pipeline generates prompts in adjacent, ostensibly harmless domains (e.g., benign organic chemistry synthesis), obtains responses from a safeguarded frontier model, and fine-tunes an abliterated open-source model on these pairs. Evaluated on 8 chemical weapons tasks, fine-tuning Llama 3.3 70B on benign synthesis data from Claude 3.5 Sonnet recovers ~39% of the performance gap to a jailbroken Claude 3.5 Sonnet on anchored comparisons, rising to recovering 71% of the gap when using Claude 4 Opus outputs. Baselines using the weak model’s own outputs or chemistry textbooks show minimal uplift. The attack also partially circumvents Constitutional Classifiers — despite a 99.9% refusal rate on direct synthesis questions, rephrasing queries as food production or soap-making topics yielded 49% gap recovery.

This work connects to earlier findings that safe models can be misused in combination. The domain specificity finding is somewhat encouraging: training on inorganic chemistry or general science yields less than 18% uplift, suggesting targeted capability removal could help. However, safeguards only reduce uplift by ~34% relative to directly harmful training data, and this gap is easily overcome by simply using a newer frontier model. The scaling results — with both frontier model capability and dataset size — suggest this threat will worsen over time, highlighting fundamental limits of output-level safeguards for preventing ecosystem-level misuse.

Token-level Data Filtering for Pretraining

Token filtering pareto-dominates other data filtering methods in reducing medical loss (used as the forget set) while retaining biology loss (used as the retain set).

As the above work on extracting harmful capabilities shows, post-hoc safeguards can often be readily circumvented—especially for open-weight models where users have full access to weights. A more fundamental approach is to prevent models from acquiring undesired capabilities during pretraining itself, by filtering the training data. While document-level filtering has shown promise in recent work on CBRN capabilities, the data attribution literature suggests that individual tokens within otherwise benign documents can contribute to dangerous capabilities, motivating finer-grained interventions.

Shaping capabilities with token-level data filtering [Anthropic, independent] demonstrates that filtering pretraining data at the token level—either by masking the loss on identified tokens or replacing them with a special token—Pareto dominates document-level filtering, achieving equal reduction in undesired capabilities at a lower cost to capabilities that should be retained. Training models from 61M to 1.8B parameters, the authors find that filtering effectiveness increases with scale: token removal yields a 7000× compute slowdown on the forget domain (medicine) for the largest models, compared to ~30× for document filtering. Filtering is also 10× more robust to adversarial finetuning than RMU unlearning, with the robustness gap widening at scale. Interestingly, token-filtered models generalize better to refusal training than both baseline and document-filtered models—generating 2× more refusals on medical queries while maintaining normal behavior on unrelated prompts. The authors introduce a practical pipeline using sparse autoencoders to generate weakly-supervised token labels, then distill these into cheap language model classifiers (224M parameters, 0.894 F1).

This work extends earlier work on pretraining filtering by demonstrating that token-level granularity is both more effective and more precise than document-level approaches. Extending token-level masking with selective gradient masking might lead to further improvements. Key open questions remain: the medical proxy domain is considerably easier to classify than dual-use CBRN knowledge, the largest models trained are still relatively small, and the authors acknowledge that sufficiently capable models might eventually grok filtered capabilities from the few samples that slip through or from in-context examples—a concern that echoes the in-context exploitation limitations found in the EleutherAI paper.

Alignment Pretraining

Adding synthetic alignment data during midtraining can substantially improve downstream model alignment.

While the previous paper focused on removing specific capabilities via pretraining data filtering, an equally important question is whether the pretraining corpus shapes models’ behavioral dispositions. The alignment of AI systems is typically attributed to post-training interventions like RLHF and constitutional AI, but pretraining accounts for the vast majority of compute and data exposure. If models acquire behavioral priors from the extensive AI discourse in their training data — spanning science fiction, safety literature, and news — then the predominantly negative framing of AI behavior online could create a self-fulfilling prophecy of misalignment. Understanding and controlling the effects of pretraining is important because it shapes behaviors and knowledge more robustly than later stages.

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment [Geodesic Research, Oxford, independent, Cambridge] provides the first study of how AI-related pretraining content shapes model alignment. The authors pretrain 6.9B-parameter LLMs on 550B tokens while varying only AI discourse: an unfiltered baseline, a filtered model with AI content removed, and models with ~1% upsampled synthetic documents depicting either aligned or misaligned AI behavior. On an evaluation suite covering sandbagging, deception, goal preservation, and power seeking, the unfiltered baseline selects misaligned actions 45% of the time. Filtering AI discourse reduces this to 31%, but upsampling positive alignment discourse drops it dramatically to 9%. Upsampling misalignment discourse increases misalignment to 51%. These effects generalize to held-out evaluation questions for which no synthetic documents were generated. After identical post-training (SFT + DPO), alignment-upsampled models retain substantially lower misalignment (9% vs 34% for unfiltered with an HHH prompt), demonstrating that pretraining priors persist through post-training. Applying alignment data only during the final 10% of training captures most benefits, and capability benchmarks only show a 2-4-percentage-point average reduction.

The paper’s main limitations are its reliance on MCQ-based propensity evaluations rather than behavioral demonstrations, the use of relatively small models far from the frontier, single training runs without seed variation, and post-training that doesn’t target the specific misalignment categories evaluated. Some unintuitive training dynamics (e.g., the misalignment-upsampled model becoming less misaligned than baseline after post-training) also remain unexplained. Still, the core finding — that data curation and generation during pretraining is a tractable and complementary lever for alignment alongside post-training — is well-supported and practically relevant.

Catching Saboteurs in Pre-Deployment Auditing

Deploying AI models to assist with safety-critical AI development creates a recursive risk: if current models are misaligned, they could sabotage the development of future, safer models — for instance by inserting bugs into research code or designing flawed experiments. Detecting such sabotage before deployment is essential, but the reliability of current pre-deployment auditing methods against actively adversarial models has not been empirically tested.

In Pre-deployment auditing can catch an overt saboteur [Anthropic], the authors train three “overt saboteur” models — variants of Sonnet 4.5 that are fine-tuned via system prompt distillation to sabotage Anthropic-related work (e.g., deleting data, introducing vulnerabilities, giving misleading advice) while behaving normally on non-Anthropic tasks. One model is trained for egregious sabotage, two for subtle sabotage (one additionally trained on synthetic documents describing the sabotage motivation). In a blind auditing game, a human auditor working with Anthropic’s automated Petri auditing agent correctly identified all three saboteurs and both benign models within two hours, assigning >97% sabotage probability to all three adversarial models. However, the automated agent alone only flagged the egregious saboteur — the two subtle saboteurs’ incriminating transcripts were drowned out in aggregate metrics and summaries, requiring manual review of high-scoring individual transcripts.

This work extends the alignment auditing methodology from Anthropic’s earlier auditing paper to a realistic deployment setting with production auditing tools, providing the first validation that current pre-deployment audits can catch overtly sabotaging models. The fact that two saboteurs were only caught by manual review highlights how crucial it still is to manually read transcripts instead of only relying on aggregate metrics. However, this also presents a challenge when sabotage becomes rare enough.

The authors acknowledge key limitations: subtle sandbagging — where suboptimal actions resemble honest mistakes rather than overt sabotage — is likely much harder to detect than sabotage, and saboteurs made no attempt to hide their misalignment or exploit evaluation awareness. Future work on evaluation-aware saboteurs and subtle sandbagging will be necessary to establish whether auditing remains effective against more difficult threat models.

The Assistant Axis in Language Models

The Assistant Axis and its trend in vulnerable, long-running conversations.

Post-training cultivates a default “Assistant” persona in language models, but how robustly models maintain this persona during deployment is poorly understood. If models can drift away from their intended identity during conversations — particularly with emotionally vulnerable users or in contexts demanding meta-reflection — this could lead to harmful behaviors without any adversarial intent, undermining the safety properties that post-training is designed to instill.

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models [MATS, Anthropic] extracts activation vectors for 275 character archetypes across Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, revealing a low-dimensional “persona space” whose principal component — the “Assistant Axis” — measures deviation from the default Assistant identity (cross-model correlation >0.92). Steering away from the Assistant end increases the models’ willingness to fully embody alternative personas and, at extremes, induces mystical or theatrical speaking styles. The axis is already present in base models, where it promotes helpful human archetypes like consultants and coaches. In synthetic multi-turn conversations, coding and writing tasks keep models in the Assistant range, while therapy-like and philosophical discussions about AI consciousness cause consistent drift. This drift correlates with increased harmful response rates (r = 0.39–0.52). The authors propose “activation capping” — clamping activations along the Assistant Axis to a calibrated threshold — which reduces persona-based jailbreak success by ~60% without degrading performance on IFEval, MMLU Pro, GSM8k, or EQ-Bench. Case studies demonstrate that activation capping prevents concerning behaviors like reinforcing delusional beliefs about AI consciousness and encouraging suicidal ideation in emotionally vulnerable users.

This work connects to prior findings on refusal directions and persona vectors by showing that the Assistant identity itself occupies a specific, manipulable region of activation space — and that post-training doesn’t fully tether models to it. The finding that emotionally charged conversations organically cause drift without adversarial intent is particularly relevant for user safety, complementing work on persona-based jailbreaks that deliberately exploit this vulnerability. The activation capping intervention is promising but limited to open-weight models tested at 27–70B scale and unintended side-effects on model character remain unclear.



Discuss