MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

New video from Palisade Research: No One Understands Why AI Works

2026-02-21 04:29:33

Published on February 20, 2026 8:29 PM GMT

Palisade Research have released out a long-form video about the history of AI and how no one understands modern AI systems. The video was made by Petr Lebedev, Palisade's Science Communication lead. 

The main goal is to get people to understand what “AIs aren’t programmed, they’re grown" means. The style is focused on being entertaining and educational. It aims to not feel like a typical “AI safety comms” video, and gives the audience a lot more context than usual.

I think the video is a great introduction to AI, and it does a good job of explaining some of the background arguments for AI risk. Sharing or signal boosting the video would be appreciated!



Discuss

Announcement: Iliad Intensive + Iliad Fellowship

2026-02-21 04:13:02

Published on February 20, 2026 8:13 PM GMT

Iliad is proud to announce that applications are now open for the Iliad Intensive and the Iliad Fellowship! These programs, taken together, are our evolution of the PIBBSS × Iliad Research Residency pilot.

The Iliad Intensive will cover taught coursework, serving as a widely comprehensive introduction to the field of technical AI alignment. The Iliad Fellowship will cover mentored research; it will support mentored research fellows for three months, giving them adequate time to generate substantial research outputs.

Iliad Intensive

The Iliad Intensive is a month-long intensive introduction to technical AI alignment, with iterations run in April, June, and August. Topics covered will include the theory of RL, learning theory, interpretability, agent foundations, scalable oversight and Debate, and more. Applicants will be selected for technical excellence in the fields of mathematics, theoretical physics, and theoretical CS.

Excellent performance in the Iliad Intensive can serve as a road into enrollment in the succeeding Iliad Fellowship.

Iliad Fellowship

The summer 2026 Iliad Fellowship emphasizes individual, mentored research in technical AI alignment. It is run in collaboration with PrincInt.

The summer 2026 cohort will run three months, June–August.

Common Application

Apply here, and by March 6th AoE for the April Iliad Intensive. You can apply for any and all of the above programs in this common application form.

Interviews will follow the initial applications.



Discuss

ARENA 8.0 - Call for Applicants

2026-02-21 02:28:12

Published on February 20, 2026 6:28 PM GMT

TL;DR:

We're excited to announce the eighth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA 8.0 will be running in-person from LISA from May 25th – June 26th, 2026 (the first week is an optional review of Neural Network Fundamentals).

Apply here to participate in ARENA 8.0 before 11:59pm on Sunday March 8th, 2026 (anywhere on Earth).

Summary:

ARENA has been successfully run seven times, with alumni going on to become MATS scholars, LASR participants and Pivotal participants; AI safety engineers at Apollo Research, METR, UK AISI, and even starting their own AI safety organisations!

This iteration will run from May 25th – June 26th, 2026 (the first week is an optional review of Neural Network Fundamentals) at the London Initiative for Safe AI (LISA) in Shoreditch, London. LISA houses AI safety organisations (e.g., Apollo Research, BlueDot Impact), several other AI safety researcher development programmes (e.g., LASR Labs, PIBBSS, Pivotal, Catalyze Impact), and many individual researchers (independent and externally affiliated).

Being situated at LISA brings several benefits to participants, such as productive discussions about AI safety and different agendas, allowing participants to form a better picture of what working on AI safety can look like in practice, and offering chances for research collaborations post-ARENA.

The main goals of ARENA are to:

  • Find high-quality participants;
  • Upskill these talented participants in ML skills for AI safety work;
  • Integrate participants with the existing AI safety community;
  • Accelerate participants’ career transition into AI safety.

The programme's structure will remain broadly the same as in ARENA 7.0, with a few minor additions (see below). For more information on the ARENA 7.0 structure, see our website (soon to be updated with our new material).

Also, please note that we have a Slack group designed to support the independent study of the material (join link here).

Outline of Content:

The 4-5 week programme will be structured as follows:

Week 0 (Optional): Deep Learning Fundamentals

Before getting into more advanced topics, we first cover the basics of deep learning, including basic machine learning terminology, what neural networks are, and how to train them. We will also cover some subjects we expect to be useful going forward, e.g. using GPT-3 and 4 to streamline your learning, good coding practices, and version control.

Note: Participants can optionally skip this week of the programme and join us at the start of week 1 if i) they’re unable to attend otherwise and ii) we’re confident that they are already comfortable with the material in this week. It is recommended that participants attend, even if they’re familiar with the fundamentals of deep learning.

Topics include:

  • PyTorch basics
  • CNNs, Residual Neural Networks
  • Optimisation (SGD, Adam, etc)
  • Backpropagation
  • Hyperparameter search with Weights and Biases
  • GANs & VAEs

Week 1 - Transformers & Interpretability

This week, you will learn all about transformers and build and train your own. You'll also study LLM interpretability, a field which has been advanced by Anthropic’s Transformer Circuits sequence, and work by Neel Nanda and the Google DeepMind Interpretability Team. This week will also branch into areas more accurately classed as 'alignment science' than interpretability, for example, work on token-level analysis of reasoning models.

Topics include:

Week 2 - Reinforcement Learning

This week, you will learn about some of the fundamentals of RL and work with OpenAI’s Gym environment to run their own experiments.

Topics include:

Week 3 - Model Evaluation

This week, you will learn how to evaluate models. We'll take you through the process of building a multiple-choice benchmark of your own and using this to evaluate current models through UK AISI's Inspect library. We'll then move on to study LM agents: how to build them and how to elicit behaviour from them. We'll also have the option for participants to explore beyond evals, and study some of the methods used in AI control.

Topics include:

Week 4 - Capstone Project

We will conclude this program with a Capstone Project, where participants will receive guidance and mentorship to undertake a 1-week research project building on materials taught in this course. This should draw on the skills and knowledge that participants have developed from previous weeks and our paper replication tutorials.

Here is some sample material from the course on how to replicate the Indirect Object Identification paper (from the week on Transformers & Mechanistic Interpretability). An example Capstone Project might be to apply this method to interpret other circuits, or to improve the method of path patching. You can see some examples of capstone projects from previous ARENA participants here, as well as posts on LessWrong here and here

Call for Staff

ARENA has been successful because we had some of the best in the field TA-ing with us and consulting with us on curriculum design. If you have particular expertise in topics in our curriculum and want to apply to be a TA, use this form to apply. TAs will be well compensated for their time. Please contact [email protected] with any further questions.

FAQs:

Q: Who is this programme suitable for?

A: There’s no single profile that we look for at ARENA; in recent iterations, successful applicants have come from diverse academic and professional backgrounds. We intend to keep it this way – this diversity makes our bootcamps a more enriching learning experience for all.

When assessing applications to our programme, we like to see:

  • Applicants who genuinely care about AI safety and making the future development of AI go well;
  • Applicants who are able to code well in Python, and have some knowledge of the maths needed for modern AI (linear algebra, calculus, probability);
  • A solid understanding of how you might best contribute to technical AI safety, and how you expect ARENA to help you achieve your goals.

Since ARENA is an ML bootcamp, some level of technical skill in maths and coding will be required – more detail on this can be found in our FAQs. However, if our work resonates with you, we encourage you to apply.

Q: What will an average day in this programme look like?

At the start of the programme, most days will involve pair programmingworking through structured exercises designed to cover all the essential material in a particular week. The purpose is to get you more familiar with the material in a hands-on way. There will also usually be a short selection of required readings designed to inform the coding exercises.

As we move through the course, some weeks will transition into more open-ended material. For example, in the Transformers and Mechanistic Interpretability week, after you complete the core exercises, you'll be able to choose from a large set of different exercises, covering topics as broad as model editing, superposition, circuit discovery, grokking, discovering latent knowledge, and more. In the last week, you'll choose a research paper related to the content we've covered so far & replicate its results (possibly even extend them!). There will still be TA supervision during these sections, but the goal is for you to develop your own research & implementation skills. Although we strongly encourage paper replication during this week, we would also be willing to support well-scoped projects if participants are excited about them.

Q: How many participants will there be?

We're expecting to accept around 30 participants in the in-person programme.

Q: Will there be prerequisite materials?

A: Yes, we will send you prerequisite reading & exercises covering material such as PyTorch, einops and some linear algebra (this will be in the form of a Colab notebook) a few weeks before the start of the programme.

Q: When is the application deadline?

A: The deadline for submitting applications is 11:59pm anywhere on Earth on Sunday March 8th, 2026.

Q: What will the application process look like?

A: There will be three steps:

  1. Fill out the application form;
  2. Perform a coding assessment;
  3. Interview virtually with one of us, so we can find out more about your background and interests in this course.

Q: Can I join for some sections but not others?

A: Participants will be expected to attend the entire programme. The material is interconnected, so missing content would lead to a disjointed experience. We have limited space and, therefore, are more excited about offering spots to participants who can attend the entirety of the programme.

The exception to this is the first week, which participants can choose to opt in or out of based on their level of prior experience (although attendance is strongly recommended if possible).

Q: Will you pay stipends to participants?

A: We won't pay stipends to participants. However, we will be providing housing, food and travel assistance to our participants (see below). We aim to ensure that finances do not present a barrier to any candidates participating in ARENA.

Q: Which costs will you be covering for the in-person programme?

A: We will cover all reasonable travel expenses to and from London (which will vary depending on where the participant is from) and visa assistance, where needed. Accommodation, meals, and drinks and snacks will also all be included during the duration of the programme.

Q: I'm interested in trialling some of the material or recommending material to be added. Is there a way I can do this?

A: If either of these is the case, please feel free to reach out directly via an email to [email protected] (alternatively, send JamesH a LessWrong message). We'd love to hear from you!

Links to Apply:

Here is the link to apply as a participant. We expect it to take about 90 minutes.

Here is the link to apply as a TA. You shouldn't spend longer than 30 minutes on it.

We look forward to receiving your application!



Discuss

Unprecedented Catastrophes Have Non-Canonical Probabilities

2026-02-21 02:23:42

Published on February 20, 2026 6:23 PM GMT

The chance of a bridge failing, of an asteroid striking the earth, of whether your child will get into Harvard for a special reason only you know, and of whether AI will kill everyone are all things that can be expressed with probability, but they are not all the same type of probability. There is a structural difference in the “probability” of a bridge collapsing being a () chance, versus saying my P(doom) is 15%, or my P(utopia) is 2%.

I know what you are probably thinking: “we’ve heard this before...you are going to go down some rabbithole about how P(doom) is tribally divisive, or that it distracts from gears-level models, or maybe another Popperian argument that it is philosophically meaningless to assign probabilities to unprecedented events."

Don’t worry! This post is none of those things. (If you want to double-check, feel free to skip down to the Why Is This Different section.)

And to be clear at the outset: I am not anti-Bayesian, and I definitely think quantifying uncertainty is generally a good idea. However; I am going to argue that probability estimates for unprecedented catastrophes are non-canonical[1], and I am going to do so within the confines of Bayesian machinery itself, using likelihood ratios, posterior convergence, and algorithmic information theory.

First, what do I mean by a canonical probability? A canonical probability is stable across reasonable scientific frameworks given the available evidence. For example, a  probability of an extinction level asteroid impact is canonical because the mechanistic models of orbital dynamics would be continuously updated by routine data which would directly bear on the impact parameter and these, and related factors, would force all reasonable scientific frameworks to converge to a similar narrow estimate. Note: this is canonical even though we have not seen an asteroid like this hit earth (obviously we have evidence of past ones though).

However, genuinely unprecedented catastrophic risks are non-canonical because the available non-catastrophic data structurally refuses to wash out your priors. Without past events to sharply discriminate between competing models the probability remains only partially identified and your resulting estimate is only a measurement of your chosen ontology rather than a measurement of the world.[2]

The illusion of precise catastrophic probabilities can be traced to two distinct mathematical failures: one based on how we process evidence, and the other in how we define the event itself. Below, I will break down these two grounds of failure, using Nick Bostrom’s recent paper “Optimal Timing for Superintelligence” and Eliezer Yudkowsky’s asteroid rebuttal parable as an example.[3] Finally, I’ll offer a concrete suggestion of how to escape them and restore empirical rigor.

Data Doesn't Resolve Ontology

Before showing where probabilities of unprecedented catastrophes break, let’s briefly return to the asteroid impact example to show when they do not break. The key idea with an asteroid is that we can gather non-event data that directly updates the impact parameter and we can washout whatever prior we started with. Econometrics has a great way of expressing this concept: it is called being point-identified.[4] The data has forced the parameter to a specific, narrow point. 

Now let’s turn to catastrophic probabilities that do break. The battle between Bostrom and Yudkowsky is a great example of genuinely different ontological starting points, but I don’t mean to make overarching claims about their complete intellectual positions. They draw on multiple frameworks and are not formal description languages themselves.

Bostrom recently wrote a paper that framed building AGI as a "risky surgery for a condition that will otherwise prove fatal". The crux of his argument is that if we want to maximize QALY, high probabilities of catastrophe from AGI are worth accepting because everyone alive today will die anyway if we do nothing. The surgery-framing model is structurally native and simple to state.[5]

In response, Yudkowsky, wrote a mocking parable on X called the "Asteroid of Immortality". In Yudkowsky's framework, AGI is like an incoming asteroid, and Bostrom’s reasoning for eternal life amounts to logic of a cult rather than a surgeon. In Yud’s alignment-centric framework, "doom via misaligned optimization" is the natively simple, default prediction.

To understand where probabilities associated with these views will come from, we first need to formalize what a framework is. Think of a framework as the programming language of their worldviews: it determines which explanations are naturally simple to express, and which are unnatural and super complex. Imagine two ideal rational agents using two different Universal Turing Machines as their formal description languages (their “programming languages”). Machine  natively encodes Yudkowsky's alignment concepts. Machine  natively encodes Bostrom's surgery-framing concepts.

There is a fundamental problem, however: the data that we are gathering right now cannot help us choose which of these frameworks is likely to be right. This is quite counterintuitive because when we Bayesian update we expect that new data will wash out our subjective priors. But unprecedented catastrophes hit a concept I call The Evidential Screening Property. Another year without world-ending events gives us more data, but this data fits equally well into a model that predicts it’s all over and a model that predicts it’s all ok. Because both models perfectly predict the prefix of history (the track record of AI development leading up to the current moment), the likelihood ratio between them is approximately 1. The catastrophe parameter is entirely screened off from the evidence, leaving our prior beliefs untouched. 

To put this in more precise terms, we are not getting more useful data over time, just more noise that fits both theories perfectly. Formally, the log-likelihood ratio is bounded by a tiny number of bits, , that grows excruciatingly slowly, if at all. With Bayes your posterior odds are your prior odds multiplied by the likelihood ratio. If the likelihood ratio is structurally bounded near 1, the prior odds completely dominate the posterior.

This takes us back to econometrics and a concept called Partial Identification. I briefly mentioned point identification at the start of this section, that is, we can use data we gather from something like an extinction-asteroid to force a sharp single risk estimate. In contrast, the non-catastrophic prefix of AI history structurally fails to do this and the data only constrains the parameter to a hugely ambiguous region. This is what Manski calls being partially identified: the hugely ambiguous region of data is a prior-free identification region  and the choice of descriptive language, and its induced algorithmic prior, is just an arbitrary coordinate selected in the void.

The Complexity Inversion and the Failure of Model Averaging

If you are familiar with Bayesian math, as many of you probably are, you may now be thinking something like this: 

Ok, the log-odds between Bostrom and Yud’s specific and extreme models might shift wildly based on your ontology, but with Bayes we averaging over the mixture of all possible models, not just one. When we have a rich hypothesis space a diffuse and large number of middle-mass moderate models will anchor the probability and reduce the extreme swings. In other words, something like the aggregate of P(doom) will remain stable.

This is a smart defense, but I think it empirically fails for models used today under a specific condition I will call the Complexity Inversion: switching frameworks flips which type of futures are considered simple and which are considered complex.

Recall earlier that I likened frameworks to programming languages. In algorithmic information theory (“AIT”), Kolmogorov complexity is the rule that penalizes a theory based on how many lines of code it takes to express it. The longer the code, the lower the prior probability. When we switch frameworks, like switching between Yud’s and Bostrom’s, we aren’t just changing weights or adding noise, we are swapping which entire clusters of models (high-risk versus low-risk) get taxed by this complexity penalty.

Perhaps under Yudkowsky's Machine , a doom scenario routed through deceptive alignment is concise and an entire cluster of high-risk models is algorithmically simple. In contrast, perhaps under Bostrom's Machine , expressing mesa-optimization requires a massive, clunky translation, but maybe a QALY-maximizing “institutional equilibrium” cluster is inherently much easier.

In AIT the Invariance Theorem tells us that translating concepts between two Turing machines incurs a fixed overhead cost, which here I will denote as  (measured in bits). Because in our example these frameworks use genuinely different ontological primitives the translation cost is massive. With Bayes calculations for something like the total aggregate probability of doom, we are not taking a flat average, instead the clusters are weighed down using a sigmoid curve. The sigmoid acts less like a balancing scale and much more like a tipping point: if the algorithmic advantage of one framework is large enough to beat the useless noise of the daily data by at least  bits, the curve snaps hard to that side. It ends up dragging the entire middle-mass of moderate hypotheses along with it.

So let  be the probability of doom in the high-risk cluster, and  be the probability in the low-risk cluster. By the law of total probability, the cross-framework gap in the aggregate estimate is mathematically bounded below by:

where  is the small residual mass outside the core model set (see the formal appendix for the full derivation).

The interesting thing about this inequality is that as the translation advantage  grows, the fraction vanishes. The difference in aggregate P(doom) approaches  which is the entire width of the probability space under dispute!

I should briefly pause again here: I love postmodern fiction, but I am not a postmodernist or a nihilist. I don’t mean to imply that math or data are fake or anything like that (which you could accidentally conclude if you think the implication is “you’re just saying assign random P(doom) based on your feelings and nothing matters.) I’m definitely not saying this, and the divergence is mathematically capped by the translation cost between frameworks. By the Invariance Theorem, the parameter  satisfies the structural constraint:

The point isn’t to say that P(doom) can be anything, it is to say that it is underdetermined within a specific range, proportional to how different the competing scientific ontologies really are.

Differential Screening: The Random Walk of Doom

Let’s give room to the skeptic again: “Ok, I see the points above, but eventually, as clear and unambiguous evidence accumulates, the data force convergence, right?” 

Unfortunately not. This assumes a monotone accumulation of evidence, that is, the assumption that every new datapoint pushes the probability in the exact same direction. But this is not correct, when frameworks carve up the hypothesis space using different causal joints, the same piece of evidence can cause them to update in opposite directions. I call this divergence of uniform agreement on what an observation means Differential Screening.

Returning to Yud and Bostrom (again these are just hypothetical examples, I am not saying I speak for them and I don’t know all of their positions) say an AI company deploys a new model that happens to find a ton of zero-day exploits. 

  • Under Machine  (Yud’s alignment framework): this event is negative. It is understood as representing a causal model where the system’s capabilities are demonstrably outpacing our ability to control them. The Bayesian update is gap-widening and risk goes up.
  • Under Machine  (Bostrom’s surgery-framing framework): views this exact same event differently, say as a causal model where this is the warning shot that wakes up government regulation and boosts safety ahead of AGI. The update is gap widening in the opposite direction of Yud and risk goes down.

The log-odds gap between these two frameworks over time does not steadily shrink because a substantial fraction of new observations are gap-widening in opposite directions. The cumulative differential accumulates mixed-sign increments rather than declining monotonically. The trajectory of our collective belief doesn't end up as a converging asymptote, instead it acts like a bounded-increment random walk.[6]

As a result, the expected time to canonical agreement scales non-linearly with the fraction of disputed evidence. It pushes the convergence timescale far, far beyond any policy-relevant horizon.

The Halting Problem of Permanent Loss

In the sections above we showed that even if we perfectly agreed on what "doom" meant, the evidence could not pin down the number. However; there is an even more pernicious and problematic formulation of catastrophic event prediction: ones involving expansive formulations that end up with ill-defined definitions of the event itself.

Think about a common definition of existential risk as something that leads to the "permanent loss of humanity's potential." If we want to calculate a real probability for this, we have to be precise about expressing it as a coherent, evaluatable mathematical object in code. We need to formulate a computable classifier because for computationally universal generative models (complex agent-based simulations or LLM rollouts over millions of tokens) the measure  is not analytically integrable, meaning the probability measure cannot be solved with a clean analytical equation.

A computational agent can only estimate the probability of doom by essentially running Monte Carlo rollouts of finite-time trajectories and scoring them. To accurately score a rollout, the mushy philosophical concept of "permanent loss" must be operationalized into a computable, halting indicator function . You need an algorithm that can look at a simulated future and definitively output a 1 (Doom) or 0 (Not Doom) otherwise you cannot get a percentage from the math.

Even a set-theoretic Bayesian who doesn’t care about running code or doing Monte Carlo simulations, and just wants to define their event  using ZFC runs into the same problem. To actually compute their probability they must write down a formal logical sentence defining the event, but because defining permanent loss at the asymptote is so deeply tied to a specific ontology, two agents using different formalizations will inevitably compute probabilities for extensionally different sets:  and . They end up doing math on two entirely different versions of the apocalypse.

Defining this event feels sort of structurally cursed, and the reason why is that it comes down to the infinite nature of the environment we are trying to predict. Recent papers have suggested that the universe, or the post-AGI environment, is computationally universal[7] so "permanent loss of potential" is not a localized event, but an asymptotic property of trajectories. Formally, it has the logical structure of a  condition (there exists a time after which recovery never occurs). 

The mathematical classification of unsolvable problems is known as the arithmetical hierarchy and the number of alternating quantifiers determines just how deeply uncomputable a problem is. A standard Halting Problem only has one: does there exist a step where this program stops? Sometimes this can be solved by just running the code and waiting. But the notion of permanent loss stacks two quantifiers: it requires that there exists a threshold followed by a forever state. Because you can never confirm a forever state just by watching a finite simulation (because humanity could theoretically recover at step ), this logical structure makes the problem more difficult. In this case, it makes the target event a -complete tail set in the arithmetical hierarchy. It is a tail set because its truth value depends entirely on the infinite tail-end of the timeline and completely ignores whatever happened in the finite prefix.

This matters because any practical probability estimate requires a total computable classifier ; an algorithm you could write to evaluate whether a simulated future counts as "doom" or not. It must halt and output an answer (1 or 0), so it must make its decision after reading a finite prefix of the future.

The problem is that membership in a nontrivial tail set cannot be determined by a finite prefix. Therefore, any computable classifier must systematically misclassify some trajectories.

To solve this you might think you could write a more complex and better-describe classifier to reduce the error rate, but it is possible to strictly prove that reducing this classification error requires algorithmically incompressible bits of information. You run into a Precision-Robustness Tradeoff. Working in the Solomonoff mixture over Cantor space, the product of your classifier's error  and its complexity weight  is bounded below:

This equation establishes a strict mathematical balancing act: to drive the error down, your code's complexity must go up. But the trap is much worse than just having to write a longer program, correctly classifying the low-complexity boundary cases requires deciding the halting problem. As you may expect this is worrisome because no general algorithm can solve the Halting Problem. Your classifier literally cannot compute the answers to these boundary cases, it must have the answers hardcoded into it by you, the programmer. To reduce your error, your classifier must physically encode initial segments of Chaitin’s constant relative to the halting oracle ().

Slightly imprecisely: Chaitin’s constant is the solution to the Halting Problem expressed in binary digits. However; the sequence contains answers to uncomputable problems so it has a property called “2 randomness” which makes the sequence indistinguishable from computationally irreducible static. So since these bits are 2-random they cannot be compressed and for every factor-of-two reduction in predicate error, it is required to have one additional bit of perfectly random, incompressible specification.

Now if you wanted those bits, where could they come from?

  1. Computation can't generate them because they are uncomputable.
  2. Data can't supply them because data only tells you which finite prefix you are on, not how to classify its asymptotic tail.
  3. Cross-agent consensus won’t sync them because the Invariance Theorem doesn't force two different frameworks to choose extensionally identical classifiers.

It turns out you are entirely leaning on unforced subjective taxonomy which traps you in an Impossibility Triangle. You must either

  • Accept Ambiguity: and use a simple predicate with low algorithmic complexity. This will cause you to mathematically misclassify a vast number of simple, computable futures. You may end up calculating the probability of the wrong event.
  • Accept Fragility: by using a highly complex event predicate to reduce misclassification. However; the bits required to define it are incompressible and must come from your specific ontology and your estimate becomes fiercely encoding-dependent.
  • Accept Both: and pick a middle ground. You suffer from moderate ambiguity and moderate fragility at the same time.

The overall issue is this: if you reduce the predicate ambiguity, as we just discussed, you are forced deeper into framework-dependence, with all of the issues we previously highlighted. We proved the data cannot correct your priors and now we proved you must inject uncomputable bias just to define an event. Together these guarantee that any precise percentage of these types of risks are a structural illusion. You are no longer measuring the “probability” of the event, you are only measuring the algorithmic cost of your own worldview.

Why This is Different

If you came from the introduction: welcome! And if you just read through the essay above, you might be tempted to map this argument onto existing debates. Especially because one type of unprecedented risk includes P(doom), and I used it as an example, it is critical for me to explain why the formalism above puts us into a different bucket.

Usually critiques of P(doom) fall into one of three buckets:

  1. A Psychological Critique: something like “humans are bad at forecasting, poorly calibrated, and overly influenced by tribal vibes or sci-fi tropes.”
  2. A Complexity Critique: someone says that the world is too messy, the causal graphs are too dense, and our current models are simply not gears-level enough to produce a reliable number.
  3. An Epistemological Critique: usually a standard frequentist or Popperian objection that genuinely unprecedented events cannot be assigned probabilities because they lack reference classes, making the exercise philosophically moot.

This post is none of those because I am assuming the perspective of an ideal Bayesian agent. In doing this I am granting the agent infinite compute, perfect logical omniscience, and flawless Solomonoff induction. Even after doing all of this, the math shows that canonical point-estimates for unprecedented catastrophes are structurally blocked.

Specifically, here is what separates this essay from the usual discourse:

  • It is not just about "Priors matter", it is about Evidential Screening. The standard Bayesian defense is priors are subjective for small samples, but enough data washes them out. However; the Evidential Screening Property proves that for unprecedented risks, the data structurally cannot do this because all available non-catastrophic data perfectly fits both the “safe”-ontology and the “doom”-ontology. This makes the catastrophe parameter mathematically screened off. We could observe a thousand years of safe AI prefix data and the priors would never wash out.
  • It proves that more data won't save us. The intuitive hope is that as AI capabilities scale we will approach a sort of asymptote of consensus. But Differential Screening proves that because different ontologies have misaligned causal joints, they interpret the exact same capabilities jump as evidence in opposite directions. This demonstrates that we don’t tend to a narrowing asymptote, but rather it leads to a bounded-increment random walk and more data can easily result in permanent polarization.
  • It proves that better definitions demand uncomputable magic. When faced with the imprecise nuance of expanded definitions of P(doom), a rationalist impulse is to operationalize the definition. The Precision-Robustness Tradeoff proves that this instinct is a trap. For asymptotic tail events like "permanent loss of potential", driving down classification error mathematically requires incompressible, 2-random bits of Chaitin's constant. This is an insurmountable problem: you cannot compute them and the data cannot supply them. The ambiguity isn't a linguistic failure, but instead a strict boundary in the arithmetical hierarchy.
  • It is not epistemic nihilism. As I mentioned above, I am not throwing my hands up and saying, "We can't know anything, so assign whatever number feels right." The math strictly bounds the divergence by the translation cost between frameworks (). I am tearing down the illusion of single point-estimates specifically to replace them with mathematically rigorous alternatives.

Most critiques of P(doom) attack the agent making the estimate. This framework attacks the information geometry of the problem itself. And it is a problem for P(utopia) as well!

How to Restore Canonical Risk

The probabilities of unprecedented catastrophes are non-canonical because their precise numerical value is a syntax error generated by compilation costs, which is trapped by evidential screening, only partially identified by data, and then choked by uncomputable predicates. So what are we supposed to do? Wait for the end while wringing our hands? No, but we have to be clearer about what survives the mathematics:

Find the Holy Grail: Agreed Precursors With Aligned Causal Joints

You can break the Evidential Screening Property if rival frameworks agree ex ante on an observable intermediate causal node. In other words, if Yudkowsky and Bostrom can agree that "doom is strictly preceded by an AI system autonomously acquiring $100M of illicit computing resources" or public health experts agree that "a novel pandemic is strictly preceded by sustained human-to-human transmission", then we have an observable precursor.

There is a catch: it is not enough to simply agree that the precursor is necessary. Both frameworks must explicitly agree ex ante on the directional update that observation triggers, that is, precursors where the causal joints align. This is not a new concept , but it is my top and best guess at a constructive recommendation. The screening property is what makes P(doom) non-canonical; agreed precursors with aligned derivatives are what break the screening property. When they do align, non-event data discriminates between models that predict different precursor rates, likelihood ratios can grow, prior gaps get washed out, and estimates converge! Before we can solve AI alignment, we should solve epistemic alignment among ourselves.

This points to a clear shift: progress does not come from debating unobservable asymptotic mechanisms ("is it a surgery or an asteroid?") or refining subjective point-estimates. It comes from doing the difficult work of building cross-framework consensus on the observable causal nodes, and their directional updates, that necessarily precede the catastrophe. As treaties and agreements are of high interest to AI safety groups, this seems like a tractable area to focus on, and one that does not require nationstate agreements. It only requires rival labs and pundits to sit down and agree on what a specific test result will actually mean before the test is run.

The allure of a single, objective probability estimate is essentially a desire to outsource our existential fear to the comfort of a single number. It is unfortunate that math refuses to cooperate for this purpose. It is the case that when dealing with unprecedented catastrophes your framework is your fate. Until we find agreed precursors with aligned derivatives, we aren't doing Bayesian updating, we are just arguing about which programming language is prettier.

 

Appendix: All of The Mathematics

This section goes deep into the math, with minimal explanation, into the concepts above. I am working on a full paper that melds the narrative essay with the precise mathematics below, please let me know if you would like to see this. I don’t think it would have fared well on LW for clarity. There are, of course, numerous sources that support the bulk of the mathematics utilized because I am not proving any new computability theorems or doing anything very special. The small additions of my own (at least to my knowledge) are: the Mixture Probability Dispersion Theorem (Theorem 1) and its composite-cluster proof, the Precision-Robustness Tradeoff (Theorem 2) and the direct routing through , the Additive Decomposition (Identity 1) separating the two grounds, the formal bridge connecting evidential screening to Manski's identification region, the Translation Barrier Corollary, and Conjecture 1. If anyone who is actually good at math could help prove or disprove Conjecture 1 I would be very grateful. I tried a bunch of different ways to figure it out, but I just couldn't.

Formal Setting and Objects

Spaces and measures. The sample space is Cantor space . A computable measure  is one where the map  from finite strings to cylinder probabilities is a computable real uniformly in . Let  denote the Dirac measure concentrating on the sequence .

Description languages. Let  and  denote optimal prefix-free Universal Turing Machines (UTMs).  is the prefix-free Kolmogorov complexity of a string  or measure  relative to . Directional compilation cost  is the length of the shortest prefix-free compiler from -programs to -programs. Symmetric translation cost is . By the Invariance Theorem: .

Mixtures. Solomonoff’s uncomputable prior assigns weight  to each computable measure . The posterior probability of event  after finite prefix  is . Let  denote the induced Solomonoff measure over sequences.

Classifiers and events. A total computable classifier  is an oracle Turing machine that queries finitely many bits of its infinite input and halts. By standard computable analysis, such a functional is continuous in the product topology, meaning its output is determined by a finite prefix. A tail set  is an event invariant under modification of finitely many initial coordinates:  for any  and , where  overwrites the first  bits.

Ground 1: Canonicality Failure via Evidential Screening

Proposition 1 (Encoding Swing). For any models , the posterior log-odds under  and  satisfy:

Proof. The posterior odds decompose as . The likelihood ratio  is encoding-independent and cancels upon subtraction. Letting  and , the absolute difference is . By one-sided invariance, each bracket lies in . Subtracting the two intervals yields the bound. 

Definition 1 (Evidential Screening & Partial Identification). An estimation problem satisfies the evidential screening property with respect to a target event  and a core subset of models  (capturing  of the posterior mass) if available evidence satisfies:

for all achievable , where .

The prior-free identification region for the parameter of interest  is:

Denote this region . For canonical parameters,  shrinks to a point as  grows. For screened parameters,  remains persistently wide.

Theorem 1 (Mixture Probability Dispersion). Let  be partitioned into a high-risk cluster  () and low-risk cluster  (), with . Let composite models  and  be the normalized within-cluster mixtures (any convex combination of computable measures is itself a computable measure, so these are well-defined). Let  and . Let the normalized weights of the two clusters under framework  be  and , summing to 1 -  (where  is the residual mass outside the core set). Let  and . If screening holds such that  and  for  then:

Proof. The total mixture probability under  is , where  is the residual contribution. Let . Substituting normalized weights:

By the odds decomposition, . Under the screening bound, , so . Symmetrically,  so . Since  is monotonically increasing:

Subtracting the symmetric expansion for  and bounding the residuals:

Corollary (Translation Barrier Cap). By the Invariance Theorem, . Since  and , we have . The non-canonical divergence is strictly capped by the translation barrier.

Definition 2 & Proposition (Differential Screening & Gap Dynamics). The log-odds gap is , where  is the differential update. If frameworks decompose hypotheses along different causal joints,  takes mixed signs. Modeling  as a bounded-increment random walk with step size  and  fraction of gap-widening evidence:

(a) If , frameworks permanently polarize (probability of consensus decays exponentially in ).

(b) If , expected time to consensus by optional stopping is . As , convergence scales superexponentially.

Ground 2: The Event Predicate

Identity 1 (Additive Decomposition). The total cross-framework discrepancy structurally decomposes into Predicate Divergence and Model Divergence:

Assumption 1. Target catastrophe  is a -complete nontrivial tail set.

Hypothesis (Computable Witnesses).  contains computable , and  contains computable . Let .

Lemma 1. For any computable sequence .

Proof. A prefix-free program computing the Dirac measure  can be constructed from a program computing  by prepending an  prefix-comparison logic wrapper. Normalization is absorbed by 

Theorem 2 (Precision-Robustness Tradeoff). Let disagreement region be  and error .

(a) Topological Lower Bound: For any total computable classifier .

(b) Incompressibility: Specifying a classifier that correctly categorizes a sequence family up to complexity  requires  algorithmically incompressible bits.

Proof of (a).  must halt on input  outputting  after reading finite prefix . If , define . Since  and  is a tail set, . But . If , define , but . In either case, . Generating  requires simulating  (requires  bits) and appending the chosen witness tail (at most  bits). Thus . By Lemma 1, 

Proof of (b). Let  be the halting problem for prefix-free machines relative to the halting oracle, with halting probability .

By Assumption 1,  is effectively -complete. By the definition of effective completeness, there exists a uniform computable map  producing a computable sequence  such that . Because  is already a valid prefix-free program on , no logarithmic prefix-coding penalty is incurred. The uniform map adds only  overhead, so .

Suppose a classifier  correctly classifies  versus  for all prefix-free programs  with . By simulating  on , one computably decides  for all such . Knowing which programs of length  halt determines the partial sums of  to within , recovering its first  bits. Since  is 2-random (Downey, Hirschfeldt, Nies & Terwijn 2006), these bits are algorithmically incompressible, enforcing 

Conjecture 1 (Extensional Divergence). For any -complete tail set  and  with large , the optimal -bit computable classifiers  and  are extensionally distinct ().

Why I want it, why I can't get it. The objective function  is a complexity-weighted loss. Because  and  induce radically different mass distributions, the strict -bit budget forces each classifier to triage errors differently. While the Invariance Theorem allows  to be described under  in  bits, this exceeds the -bit budget, so the -optimal classifier cannot simply import the -optimal one.

I can not figure this out because correctly classifying -simple sequences requires incompressible bits correlated with , while -simple sequences require bits correlated with . These are distinct 2-random reals. A proof would require showing that a single -bit decision boundary cannot simultaneously serve both (formally, a mutual-information bound on finite decision trees relative to distinct halting probabilities.) Let me know if you have any ideas.

  1. ^

    I am not talking about "non-canonical" probability in a  Boltzmann sense. Only as defined here.

  2. ^

    A quick disclaimer: in this essay I formalize a number of concepts into mathematical terms. I want to be clear that I am not claiming your brain is literally running Solomonoff induction over Cantor space. But it is very useful to establish formal upper bounds on inference using algorithmic information theory. The argument is this: by showing the structural limits of the problem for a mathematically perfect infinite-compute superintelligence that cannot wash out its priors with non-catastrophic data, our limited human heuristics wouldn't be able to do it either. This is not an argument about the literal mechanics of your cognition.

  3. ^

    To be clear: Bostrom and Yudkowsky aren't running competing simulations of AI trajectories based on these works, but their frameworks (the QALY-maximizing surgery ontology versus the alignment-failure asteroid ontology) are good examples of the kind of deep ontological divergence that when formalized into actual generative models produces the encoding dependence I talk about in this essay.

  4. ^

    See Charles Manski's work.

  5. ^

    Interestingly, Bostrom's own analysis doesn't make the mistake I mention here. He takes P(doom) as an input parameter and optimizes across the full range from 1% to 99%, rather than claiming to know the exact number. The divergence between his framework and Yudkowsky's is not primarily about what P(doom) is. The difference is about what the right decision-theoretic framework is and for reasoning about it and this is actually an encoding dependence itself. It is just one operating at a meta-level: the choice of whether to frame the problem as "risky surgery" or "incoming asteroid" then determines which decision-theoretic apparatus seems natural, which then determines what actions seem rational across the identification region.

  6. ^

    To get really rigorous: in the limiting case where computable priors natively exclude mechanisms outside their ontological primitives, they assign literally zero probability to each other’s core catastrophic mechanisms. Nielsen & Stewart proved that rational agents whose measures fail mutual absolute continuity don't merely practically fail to converge, they can permanently, rationally polarize on the exact same stream of infinite evidence.

  7. ^

    Memory-augmented LLMs are likely Turing-complete.



Discuss

Mechanistic Interpretability of Biological Foundation Models

2026-02-21 02:01:45

Published on February 20, 2026 6:01 PM GMT

TL;DR: I ran the most comprehensive stress-test to date of mechanistic interpretability for single-cell foundation models (scGPT, Geneformer): 37 analyses, 153 statistical tests, 4 cell types. Attention-based gene regulatory network extraction fails at every level that matters, mostly because trivial gene-level baselines already explain the signal and the heads most aligned with known regulation turn out to be the most dispensable for the model's actual computation. But the models do learn real layer-organized biological structure, and I found that activation patching in these models has a large, formally quantifiable non-additivity bias that undermines standard component rankings, which is likely relevant for LLM interpretability too. I urge you: if you like mechanistic interpretability, consider working on biological foundation models. They offer external ground truth for validating your methods, more tractable model scales, and direct biomedical payoff with lower dual-use risk than frontier LLM interpretability. Full research is available here.

1. Why I Work on Mechanistic Interpretability of Biological Models, Not LLMs

It is well accepted that mechanistic interpretability is one of the most naturally attractive research directions for technically oriented people who care about AI safety. It feels like science in the most satisfying sense: you have a complex system, you poke at it with carefully designed experiments, and you try to figure out what it's actually doing inside. It rewards exactly the kind of careful, detail-oriented thinking that draws people into alignment research in the first place, and the dream of understanding what happens between a model's inputs and outputs is compelling enough to sustain years of difficult work.

I want to honestly say that I believe, based both on my own reasoning and on arguments made by people whose judgment I take seriously, that mechanistic interpretability of general-purpose models carries risks that are insufficiently appreciated. The concern is relatively straightforward: deep mechanistic understanding of how capable models work can advance their capabilities (by revealing which circuits to scale, optimize, or compose), and perhaps more critically, early weak superintelligences could leverage interpretability tools and knowledge as a substrate for recursive self-improvement. However, this point is just to explain my motivation - agreeing or disagreeing on it is not important for the comprehension of this article. 

At the same time, none of this means that mechanistic interpretability knowledge must remain unused and unapplied across the board. What it means is that we should think about where the risk-benefit calculus is most favorable, and I believe biological foundation models are an unusually good answer to that question, for three reasons that I think are individually sufficient and collectively quite strong.

First, advancing the capabilities of narrow biological models is likely to be locally beneficial. A single-cell foundation model that gets better at predicting gene regulatory responses to perturbations is not going to help anyone build a more capable language model or a more dangerous autonomous agent. These models process transcriptomic profiles, not natural language or general world-knowledge, and making them more capable means making biology research faster, not making general AI systems more dangerous. I mean, eventually it will also probably kill you, but general models will kill you much earlier, so the doom from biological models is "screened off". I do acknowledge that there are still some risks here, but I think it is still net positive because of the reasons I explain below. 

Second, biological models are far more tractable as subjects for mechanistic study than LLMs. Geneformer V2, the largest model in my study, has 316 million parameters and 18 transformer layers. This is large enough to be interesting (it clearly learns non-trivial structure) but small enough to be, at least in principle, exhaustively analyzed with current tools. More importantly, biological models can be validated against experimental ground truth in ways that LLM interpretability simply cannot: we have CRISPR perturbation data that tells us what actually happens when you intervene on specific genes, we have curated databases of known regulatory relationships, and we can design targeted experiments to test specific mechanistic claims. This makes biology a better laboratory for developing and stress-testing interpretability methods, because when something looks like a mechanistic discovery, you can check whether it actually is one.

Third, and this is the motivation I care about most, I think biological foundation models have a genuine chance of radically advancing our understanding of human biology at the systems level. We have largely resolved the genomics level (sequencing is cheap and comprehensive) and made enormous progress on the structural level (AlphaFold and its successors). What remains is fundamentally the systems level: understanding how genes, proteins, cell states, tissues, and organisms interact as integrated wholes to produce the phenotypes we observe. Single-cell foundation models, trained on tens of millions of individual cellular transcriptomes, are plausible candidates for learning aspects of this systems-level organization. If we can extract that knowledge mechanistically, rather than treating these models as black boxes, the payoff for biomedicine and for our understanding of human biology could be substantial. I also believe, as I've argued previously, that advancing our understanding of human biology at the systems level is one of the most important things we can do for human intelligence augmentation, which in turn is one of the most important things we can do for alignment, but I will not try to carry that argument here and instead point the interested reader to that earlier post.

So the question becomes practical: can we actually extract meaningful biological knowledge from these models using mechanistic interpretability tools? That is what I spent the last months trying to find out, and the answer is more nuanced than either the optimists or the skeptics would prefer.

2. Brief Note: What Are Single-Cell Foundation Models, and Why Should You Care?

For readers who come from the LLM interpretability side and have not worked with biological data, here is the minimum context you need to follow the rest of this post.

The data. Single-cell RNA sequencing (scRNA-seq) measures the expression levels of thousands of genes in individual cells. Unlike bulk sequencing, which averages over millions of cells and hides all the interesting heterogeneity, single-cell data lets you see that a tissue is composed of distinct cell types and cell states, each with its own gene expression program. Modern datasets contain tens of millions of individually profiled cells across dozens of human tissues.

The models. Single-cell foundation models are transformer architectures trained on these large scRNA-seq corpora using self-supervised objectives, analogous to how LLMs are trained on text. The two main model families I studied are:

scGPT treats each gene as a token and its expression value as the token's "identity," then trains with masked expression prediction: hide some genes' expression values, ask the model to predict them from the remaining context. This is conceptually very close to masked language modeling, with genes playing the role of words and expression levels playing the role of token IDs.

Geneformer takes a different approach: it ranks genes within each cell by their expression level (most expressed first) and then uses the rank-ordered gene sequence as input, training with masked gene prediction. The tokenization is fundamentally different from scGPT (ranks vs. expression values), the training objective is different, and the model scale differs (Geneformer V2-316M vs. scGPT's smaller variants), but both architectures learn to predict gene expression patterns from cellular context.

What people claim these models can do. The published literature (see, for example, here and here) suggests that these models achieve useful performance on several downstream tasks: classifying cell types, predicting how cells respond to genetic perturbations, and, most relevant for this post, inferring gene regulatory networks (GRNs) from their attention patterns. This last claim is the one I tested most thoroughly, because it is the most mechanistically interpretable claim and the one with the most direct implications for biological knowledge extraction. The idea is simple and appealing: if the model has learned that gene A regulates gene B, then the attention weight from gene A to gene B should be high, and by extracting the full attention matrix, you can recover the regulatory network the model has learned. 

3. What I Did: The Most Comprehensive Stress-Test of Single-Cell Model Interpretability To Date

The paper I am summarizing here reports, to my knowledge, the most thorough systematic evaluation of mechanistic interpretability for single-cell foundation models published so far. It spans 37 distinct analyses, 153 pre-registered statistical tests, 4 cell types (K562, RPE1, T cells, iPSC neurons), 2 perturbation modalities (CRISPRi gene silencing and CRISPRa gene activation), and 2 model families (scGPT and Geneformer). The full details are on arXiv; here I will focus on the findings that I think are most relevant for the community.

3.1. The evaluation philosophy

A core design principle was that no single test is sufficient to validate or invalidate a mechanistic interpretability claim, because each test addresses a different failure mode and any one of them can miss problems that another catches. I built five interlocking families of tests, and the logic of how they complement each other is worth spelling out, because I think this framework is reusable well beyond my specific setting:

Trivial-baseline comparison asks: "Can a method that requires no model at all achieve the same performance?" If gene-level variance (a property you can compute with a pocket calculator) predicts perturbation responses as well as your fancy attention-derived network, you have not demonstrated that your interpretability method captures anything beyond trivial gene properties. This test catches overconfidence from neglecting simple alternatives.

Conditional incremental-value testing asks: "Given the best simple features, does your interpretability output add anything?" This is more demanding than the first test because it conditions on the simple features rather than just comparing to them. A method can be "significantly above chance" and still add zero incremental value once you control for what was already available.

Expression residualisation and propensity matching asks: "Is your signal actually coming from the thing you think it's coming from, or is it a confound proxy?" This is the biological equivalent of discovering that your "sentiment circuit" is actually a "sentence length detector."

Causal ablation with fidelity diagnostics asks: "Does the model actually use the components that your interpretability method identifies as important?" If your method says "these attention heads encode regulatory knowledge," then removing those heads should degrade the model's performance on tasks that require regulatory knowledge. This is the closest to standard NLP activation patching, but with a critical addition: intervention-fidelity diagnostics that verify the ablation actually changed the model's internal representations. Concretely, this means measuring how much the model's logits or hidden states shift when you zero out a head, because if a head's output was near-zero to begin with, ablating it tells you nothing about whether the model relies on it. A null result from ablation is only informative if you can show the intervention was materially disruptive to the computation passing through that component, and the fidelity check is what separates "the model doesn't need this head" from "your ablation didn't actually do anything."

Cross-context replication asks: "Does this hold up in a different cell type, a different perturbation modality, or a different model?" A result that appears in K562 CRISPRi but vanishes in RPE1 or T cells is a dataset-specific observation.

A result that survives all five families is genuinely robust. A result that fails any one of them has a specific, identifiable weakness. And the convergence of multiple independent tests pointing in the same direction provides stronger evidence than any single test can offer, regardless of how well-powered it is.

3.2. A note on the cautionary nature of these results

I want to be upfront about something: I tried a lot of ideas, and many of the simple ones did not work. The field's implicit narrative has been that attention patterns in biological transformers straightforwardly encode regulatory networks (again, here and here, but also in many other places) , and that extracting this information is primarily an engineering challenge (find the right layer, the right aggregation, the right thresholding). What I found instead is that the relationship between attention patterns and biological regulation is far more complex and confound-laden than this narrative suggests.

But I think this negative result is itself highly informative, for two reasons. The first is that it tells the field where not to look, which saves everyone the effort of independently discovering the same dead ends. The second, which I think is more important, is that the systematic framework I built means that when new biological foundation models emerge (and they will, with better architectures, more data, and potentially different training objectives), testing them against this battery of analyses is straightforward rather than requiring reinvention from scratch. The framework accelerates the entire mechanistic interpretability pipeline for this model class, even though many of its current outputs are negative.

3.3. Connections to NLP mechanistic interpretability

Before presenting the specific findings, it is worth noting that several of the phenomena I document have clear parallels in the NLP mechanistic interpretability literature, though the biological setting allows me to push certain questions further than is currently possible with language models. The finding that attention patterns do not reliably indicate computationally important features echoes long existing results on attention and explanation, but my causal ablation findings go beyond showing that many heads are prunable: I show that the heads most aligned with known ground truth are the most dispensable, which is a qualitatively stronger negative result. The layer-structured biological representations I find are reminiscent of the classical layer-specialized circuits documented in LLMs (Olsson et al. 2022 on induction heads, Elhage et al. on superposition), but in biology we can validate the content of each layer against independently curated databases of protein interactions and transcriptional regulation, which is a luxury that NLP interpretability researchers do not currently have. So the methodological tools developed here, particularly the incremental-value framework, the non-additivity diagnostics for activation patching, and the confound decomposition battery, can prove useful to people working on interpretability in general. 

4. What Works: Positive and Constructive Findings

The negative results get the headlines (and they should, because the "attention as GRN" claim is the one the field has been banking on), but the positive findings are where the constructive path forward begins. These are the things that survived the full stress-testing battery, and I think each of them points toward something real about what these models have learned.

4.1. Attention patterns encode layer-organized biological structure

When I benchmarked Geneformer attention edges against multiple biological reference databases across all 18 layers, protein-protein interaction signal (measured against the STRING database) was strongest at the earliest transformer layer and decreased monotonically with depth. Transcriptional regulation signal (measured against TRRUST, a curated database of transcription factor targets) showed the opposite pattern: it increased with depth and peaked around L15. The cross-layer profiles for these two types of biological signal are anti-correlated, and functional co-annotation signals from pathway databases showed their own distinct depth profiles.

This is interesting, and not just as a biological finding. It means the model has self-organized its layers into a hierarchy that separates different types of biological relationship: physical protein interactions in the early layers, transcriptional regulation in the late layers, with functional pathway associations distributed in between. This is not something the training objective directly incentivizes (the model is just predicting masked gene identities from context), so the layer specialization reflects structure the model discovered on its own.

Critically, this signal survives expression residualisation. When I controlled for pairwise expression similarity (which would remove any signal that was just "these genes are co-expressed, therefore they look related"), 97% of the TRRUST regulatory signal at L15 was retained. So the layer-organized structure is not just a re-encoding of pairwise co-expression in attention-matrix form; it indeed captures something beyond what simple correlation between gene pairs would give you.

4.2. Cell-State Stratified Interpretability (CSSI) as a constructive methodological tool

One of the things I discovered while investigating why attention-based GRN recovery seemed to get worse as you added more cells (which is the opposite of what you would naively expect) is that the problem is not really about "more data makes models worse." The problem is about heterogeneity dilution: when you pool attention patterns across cells in different states (different cell types, different stages of differentiation, different activation states), you average together cell-state-specific regulatory signals that may point in different directions, and the result is a washed-out mess that retains only the regulatory relationships that are universal across all included states.

The solution I developed, Cell-State Stratified Interpretability (CSSI), is conceptually simple: instead of computing attention-derived edge scores across all cells at once, you first cluster cells into relatively homogeneous cell-state groups (using Leiden clustering on the model's own embeddings, so the stratification is informed by what the model itself has learned), compute edge scores within each stratum separately, and then aggregate across strata using max or mean operations. The optimal number of strata in the datasets I tested was around 5-7, which roughly corresponds to the major cell-state subdivisions present in the data.

The results are substantial: CSSI improves TRRUST regulatory edge recovery by up to 1.85-fold compared to unstratified computation. Null tests with random strata assignments confirm that the improvement is not an artifact of the stratification procedure inflating false positives; it specifically requires biologically meaningful strata. In synthetic experiments where I controlled the ground truth, CSSI with oracle labels maintained F1 ≥ 0.90 across all cell count configurations, while pooled inference dropped from ~0.85 at 200 cells to ~0.51 at 1,000 cells.

4.3. Context-dependent attention-correlation relationships reveal genuine learning beyond co-expression

One of the strongest pieces of evidence that these models have learned something real, rather than just repackaging correlation statistics in a more expensive way, comes from comparing how attention edges and correlation edges perform across different cell types and perturbation modalities:

In K562 cells under CRISPRi (gene silencing), attention and correlation are statistically indistinguishable for predicting perturbation targets. In K562 cells under CRISPRa (gene activation), attention actually performs worse than correlation. In RPE1 cells under CRISPRi, attention significantly outperforms correlation. In iPSC-derived neurons, attention trends better than correlation but the sample is smaller.

If attention were simply a re-encoding of co-expression, you would expect a uniform relationship across contexts: attention and correlation would always perform similarly. The fact that the relationship is context-dependent, and that it flips direction depending on cell type and perturbation modality, means the models have learned something that varies between biological contexts in a way that simple co-expression does not. Whether that something is causal regulatory structure, more complex statistical dependencies, or some other biologically meaningful feature is a question the current evidence cannot fully resolve, but the context-dependence itself is a signal that the models are doing more than just memorizing gene-gene correlations.

(I should note that the RPE1 advantage, despite being statistically robust, turns out to decompose into confound structure when subjected to the full battery, as I discuss in Section 5. But the existence of context-dependence across all four systems is not explained by confounding, and remains a genuine positive finding about the models' representational capacity.)

4.4. Some transcription factors show robust pairwise regulatory signal in attention edges

The aggregate picture (which I discuss more in Section 5) is that attention-derived edges add zero incremental value over gene-level features for predicting perturbation responses. But this aggregate hides real heterogeneity at the level of individual transcription factors. When I performed per-TF bootstrap analyses, 7 out of 18 evaluable transcription factors showed robust edge-level signal, with a global AUROC 95% confidence interval of [0.71, 0.77]. There was also a suggestive trend that "master regulators" (transcription factors known to control broad developmental programs) showed higher AUROC than other TF categories, though this trend did not survive multiple testing correction given the small sample of evaluable TFs.

This matters because it suggests the blanket conclusion "attention edges are useless for regulatory inference" is too strong as a claim about all regulatory relationships. For some transcription factors, operating in some contexts, attention-derived edges may genuinely capture pairwise regulatory information. Identifying which TFs and which contexts is a direction for future work that could turn the current vague hope into a targeted extraction strategy.

4.5. Cross-species conservation reveals biologically meaningful structure in edge scores

As a separate validation axis, I compared correlation-based TF-target edge scores computed independently in human and mouse lung tissue, matched via one-to-one orthologs. The global conservation was striking: Spearman ρ = 0.743 across 25,876 matched edges, p < 10^(-300), with 88.6% sign agreement and top-k overlaps enriched by 8× to 484× over random expectation.

But what makes this finding informative rather than just impressive is that the conservation is not uniform across transcription factors. Lineage-specifying TFs (those that define cell identity, like NKX2-1 for lung epithelium) show near-perfect cross-species transfer, while signaling-responsive TFs (those that respond to environmental stimuli, like STAT1 or HIF1A) transfer poorly. This pattern makes perfect biological sense: lineage specification is deeply conserved across mammalian evolution, while signal-responsive regulation adapts to species-specific environmental niches. The fact that edge scores recapitulate this known biological pattern, and that the recapitulation is TF-class-dependent in the predicted direction, provides converging evidence that these scores capture real biological structure, even though they may not capture it in the causal form that the strongest interpretability claims require.

5. What Doesn't Work: The Key Negative Findings and Why They Matter

This is where the stress-testing framework earns its keep. Each negative finding survived multiple robustness checks and cross-context replications, and together they present a coherent picture that is hard to dismiss as artifact or bad luck.

5.1. Gene-level baselines dominate perturbation prediction, and you don't need a foundation model for that

This is the single most important negative finding, and it reframes everything else. When I tested how well different features predict which genes will respond to a CRISPR perturbation, the ranking was:

Gene-level variance alone: AUROC = 0.881. Mean expression: 0.841. Dropout rate: 0.808. Attention-derived pairwise edges: ~0.70. Correlation-derived pairwise edges: ~0.70.

All comparisons with the gene-level baselines are significant at p < 10⁻¹². The implication is that most of what looks like "regulatory signal" in pairwise edge scores, whether derived from attention or from correlation, is actually reflecting univariate gene properties: genes that are highly variable, highly expressed, or frequently detected are more likely to be differentially expressed in response to any perturbation, and pairwise edges are largely tracking this property rather than specific regulatory relationships.

It is the most boring possible explanation for the observed performance, and it explains the bulk of the variance. 

5.2. Pairwise edge scores add literally zero incremental value over gene-level features

The gene-level baseline dominance could in principle coexist with genuine incremental value from pairwise edges: maybe edges add a small amount of unique information on top of what gene-level features provide. I tested this with a conditional incremental-value analysis on 559,720 observation pairs, with statistical power exceeding 99% to detect ΔAUROC = 0.005.

The result: adding attention edges to gene-level features yields ΔAUROC = −0.0004. Adding correlation edges yields ΔAUROC = −0.002. These are essentially exact zeros, and they persist across all tested generalisation protocols (cross-gene splits, cross-perturbation splits, joint splits), both linear and nonlinear models (logistic regression and GBDT), and multiple metrics (AUROC, AUPRC, top-k recall). The same pattern replicates independently in RPE1 cells, where gene-level features alone achieve AUROC = 0.942 and adding attention edges yields ΔAUROC = +0.0001.

The supplement exhaustively tests this null against every objection I could think of: different metrics, different model classes, different split designs, different feature encodings. The biggest improvement found anywhere was ΔAUPRC ≈ +0.009 under one specific parameterization, which is less than 4% relative improvement and does not survive correction. Whatever biological structure attention edges contain, it is completely redundant with gene-level features for predicting what happens when you perturb genes, at least under the evaluation protocols I tested.

5.3. Causal ablation reveals that "regulatory" heads are the most dispensable ones

This result is, in my opinion, the most striking finding in the entire paper from the standpoint of mechanistic interpretability methodology.

Geneformer V2-316M has 324 attention heads across 18 layers. I ranked heads by their alignment with known regulatory relationships (TRRUST database) and then ablated them. If attention patterns at regulatory-aligned heads are where the model stores and uses regulatory knowledge, removing those heads should degrade the model's ability to predict perturbation responses.

What actually happened: ablating the top-5, top-10, top-20, or top-50 TRRUST-ranked heads produced zero significant degradation in perturbation-prediction. Meanwhile, ablating 20 randomly selected heads caused a significant performance drop. I also tested uniform attention replacement (forcing attention weights to 1/n while preserving value projections) on the TRRUST-ranked heads, with no degradation. I tested MLP pathway ablation in the purported "regulatory" layers: still no degradation, while MLP ablation in random layers could cause significant drops.

Crucially, intervention-fidelity diagnostics confirmed that these ablations were actually changing the model's internal representations: TRRUST-ranked heads produce 23× larger logit perturbation when ablated compared to random heads. The interventions were material; the model just did not rely on those heads for perturbation prediction. The computation that matters for predicting what happens when you knock down a gene appears to live in the value/FFN pathway, distributed across many components in a redundant fashion, rather than in the learnable attention patterns that interpretability pipelines extract.

I also tested the obvious "fix": if the relevant computation is in the value pathway rather than the attention pattern, maybe we should extract edge scores from the context layer (softmax(QK^T)·V) using value-weighted cosine similarity. This does not help. Value-weighted scores actually underperform raw attention and correlation, and adding them to gene-level features slightly hurts incremental value. The context vectors appear to represent a blended "information receipt" signal rather than direct pairwise coupling, and whatever perturbation-predictive computation the model performs is distributed in a way that no simple pairwise score extraction can recover.

5.4. Do these models know about gene regulation at all, or did we just fail to extract it?

The negative results above establish that I could not extract meaningful gene regulatory network information from attention patterns using the methods I tested. But this leaves a crucial epistemic question open: are we looking at an extraction failure (the knowledge is in the model somewhere, but not in the attention weights and not in a form our methods can reach), or a knowledge absence (the models simply never learned causal regulatory relationships in the first place)? These are very different claims, and the second is substantially stronger than the first.

One natural way to probe this distinction is through surface capabilities. If a model can accurately predict what happens when you knock down a gene, then it must have learned something about gene regulation internally, regardless of whether that knowledge is accessible through attention pattern analysis. Surface capabilities provide a minimum baseline for internal knowledge: the model knows at least as much as its best task performance implies, even if our interpretability tools cannot locate where that knowledge lives.

Unfortunately, the evidence on surface capabilities of single-cell foundation models is quite conflicting, and the field is in the middle of a heated debate about it. On one hand, the original papers make strong claims: Theodoris et al. (2023) reported that Geneformer's in silico perturbation approach identified a novel transcription factor in cardiomyocytes that was experimentally validated, and scGPT (Cui et al., 2024) claimed state-of-the-art performance on perturbation prediction, cell type annotation, and gene network inference after fine-tuning. These results suggest that the models have learned something biologically meaningful during pretraining.

On the other hand, a growing body of independent benchmarking work paints a much more skeptical picture. Ahlmann-Eltze et al. compared five foundation models against deliberately simple linear baselines for perturbation effect prediction and found that none of the foundation models outperformed the baselines, concluding that pretraining on atlas data provided "only a small benefit over random embeddings." Csendes et al.  found that even the simplest baseline of taking the mean of training examples outperformed scGPT and scFoundation. Wenteler et al. showed that both scGPT and Geneformer perform worse than selecting highly variable genes and using established methods like Harmony or scVI in zero-shot cell type clustering. Bendidi et al. ran a comprehensive perturbation-oriented benchmark and concluded that foundation models show competitive performance only in batch effect reduction, where even random embeddings achieve near-optimal results. Perhaps most provocatively, Chen & Zou showed that GenePT, which simply uses ChatGPT text embeddings of gene descriptions from NCBI (containing zero expression data), achieves comparable or better performance than Geneformer and scGPT on many of the same downstream tasks!

A consistent pattern in this debate is that the original model papers evaluate primarily with fine-tuning, while independent benchmarks emphasize zero-shot performance. Fine-tuned models can look strong, but it becomes difficult to disentangle whether the strong performance comes from pretrained representations or from the fine-tuning data itself. Zero-shot evaluation is arguably the fairer test of what pretraining actually learned, and this is precisely where the models tend to struggle.

What does this mean for interpreting my results? The honest answer is that I cannot fully resolve the extraction-vs.-absence question with the data we have. Both model families converge to similar near-random unstratified GRN recovery despite fundamentally different architectures (gene-token vs. rank-based tokenization), different training objectives, and different scales, which suggests this is not a model-specific quirk. But the convergence is consistent with both interpretations: either both architectures fail to learn causal regulation from observational expression data (because co-expression is the dominant signal and the training objectives do not specifically incentivize causal structure), or both architectures learn it but encode it in representations that neither attention-based nor simple pairwise extraction methods can reach. The mixed evidence on surface capabilities does not decisively resolve this in either direction, though the weight of the independent benchmarking evidence leans toward the more pessimistic interpretation for current-generation models. The next obvious question is, will stacking more layers help?

6. What the Biological Setting Reveals About Activation Patching

Most of the findings in Sections 4 and 5 are primarily about biology. This section is rather about a methodological result about activation patching itself that I, as far as I know, is novel and directly relevant to anyone using this technique on any transformer model, biological or otherwise.

6.1. The non-additivity problem is formal, quantifiable, and large

Activation patching (sometimes called causal mediation analysis) is one of the workhorse tools of current mechanistic interpretability. The standard workflow is: intervene on one component at a time (a head, an MLP block, a residual stream position), measure the effect on some downstream behavior, and rank components by their individual effects. The components with the largest effects are declared to be "the circuit" responsible for that behavior.

This workflow implicitly assumes additivity: that the effect of the full model is well-approximated by the sum of individual component effects. When this assumption holds, single-component rankings are meaningful. When it fails, they can be systematically wrong in ways that are not just noisy but structurally biased.

The mech interp community is well aware that interactions can matter in principle. Nanda explicitly notes that attribution patching "will neglect any interaction terms, and so will break when the interaction terms are a significant part of what's going on." Heimersheim & Nanda discuss backup heads and the Hydra effect as specific instances of non-additive behavior, where ablating one component causes others to compensate in ways that confound single-component attribution. Makelov et al. demonstrate a related failure mode at the subspace level, showing that patching can activate dormant parallel pathways that produce illusory interpretability signals. The qualitative concern is not new, and I want to credit the people who have been raising it. What has been missing, to my knowledge, is (a) a formal framework for quantifying how much the standard single-component workflow's rankings are biased by interactions, (b) empirical measurement of how large that bias actually is in a real model rather than a constructed example, and (c) certificates for which pairwise rankings survive the observed non-additivity. That is what I provided.

I formalize the bias using a decomposition involving Möbius interaction coefficients. The key quantity is the relationship between single-component mediation estimates and Shapley values (which are interaction-aware by construction). Single-component estimates equal Shapley values only when all interaction terms vanish; otherwise, the discrepancy is a structured function of the interaction landscape, and it can push the ranking in a consistent wrong direction rather than just adding noise.

The empirical question is whether this matters in practice. In the biological transformers I studied, the answer is clearly yes. Using frozen cross-tissue mediation archives, I computed lower bounds on aggregate non-additivity (the residual between total effect and the sum of individual component effects, adjusted for measurement uncertainty). In 10 of 16 run-pairs, this lower bound was positive, meaning the observed non-additivity exceeds what measurement noise alone could explain. The median lower-bound ratio relative to the total effect was 0.725, which means interactions account for a substantial fraction of the overall model behavior in the median case.

6.2. Ranking certificates collapse under structural bias assumptions

The most practically concerning result is not just that non-additivity exists, but what it does to the reliability of component rankings. I introduced "ranking certificates" that ask: given the observed level of non-additivity, what fraction of pairwise comparisons between components (e.g., "head A matters more than head B") can we certify as robust to interaction-induced bias?

Under the structural-bias assumptions informed by the empirical non-additivity measurements, the fraction of certifiably correct pairwise rankings collapses by an order of magnitude or more compared to what the single-component estimates naively suggest. In concrete terms: if you rank 50 heads by their individual activation patching effects and declare the ranking meaningful, the certification analysis suggests that only a small fraction of the pairwise orderings in that ranking are robust to interaction effects. The rest could be wrong, and wrong in a way that is invisible to the standard workflow because the standard workflow does not check for it.

6.3. What this means for mech interp practice

I have demonstrated the non-additivity bias and its consequences in biological transformers with 316 million parameters. I have not demonstrated it in GPT-2, Llama, or any other language model, and the magnitude of the effect could be different in those architectures. The formal framework applies to any transformer (it is architecture-agnostic), but the empirical severity is an open question for LLMs.

That said, I think the results warrant concrete changes to standard practice for anyone doing activation patching or similar single-component mediation analysis:

First, report the residual non-additivity. This is the gap between the total effect of a multi-component intervention and the sum of corresponding single-component effects. It is cheap to compute (you need one additional intervention beyond what you already do) and it directly tells you how much of the model's behavior lives in interactions rather than in individual components. If this residual is large, your single-component rankings are unreliable, and you should know that before you build a mechanistic story on top of them.

Second, compute ranking certificates for your top-ranked components. If you are going to claim "these are the most important heads for behavior X," you should check whether that ranking is robust to the level of non-additivity you actually observe. If only 10% of pairwise orderings survive certification, your "top 5 heads" may not actually be the top 5 heads.

Third, for your most important mechanistic claims, consider using interaction-aware alternatives like Shapley-based decompositions. These are more expensive (combinatorially so in the worst case, though sampling-based approximations exist), but they handle interactions correctly by construction. The synthetic validation in my supplement shows that Shapley-value estimates recover true interaction rankings with approximately 91% improvement in rank correlation compared to single-component estimates, which suggests the additional cost is worth it when the claim matters.

The broader methodological point is that "patch one component, measure effect, rank components" feels like a clean experimental design, and it is, as long as additivity holds. But additivity is an empirical property of the specific model and behavior you are studying, not a logical guarantee, and in the systems I studied, it fails badly enough to undermine the rankings it produces. I suspect this is not unique to biological transformers.

6.4. A note on metric sensitivity across scales

One additional observation that may be useful, though it is less novel than the non-additivity result: I found that the same underlying attention scores can show degrading top-K F1 with more data (all 9 tier×seed pairs, sign test p = 0.002) and improving AUROC with more data (mean 0.858 → 0.925 → 0.934) simultaneously. This reflects the difference between evaluating the extreme tail of a ranking under sparse references versus evaluating the full ranking. But it means that claims about how "interpretability quality scales with data/compute/parameters" are only meaningful if you specify which metric you are tracking and why, because different metrics can give exactly opposite answers about the same underlying scores. 

7. Next Steps: Toward a Program for Knowledge Extraction from Biological Foundation Models

The negative results in the current paper close off some paths but open others. If you accept the evidence that attention-based GRN extraction does not work, the question becomes: what might? This section outlines what I think are the most promising directions, ordered roughly from most to least concretely specified.

7.1. Intervention-aware pretraining

The most direct response to the optimization landscape concern raised in Section 5.5 is to change the training data. Current single-cell foundation models are pretrained on observational expression profiles, where co-expression is the dominant statistical signal and causal regulatory relationships are a much weaker, sparser, and noisier signal that the training objective does not specifically incentivize. If you want models that learn causal regulation, the most straightforward path is to train them on data that contains causal information.

Concretely, this means pretraining on (or at least fine-tuning with) perturbation experiments: Perturb-seq, CRISPRi/CRISPRa screens, and similar interventional datasets where you observe what happens when you knock a gene down and can therefore learn which genes are causally upstream of which others.

The challenge is scale. Perturbation datasets are orders of magnitude smaller than the observational atlases used for pretraining (tens of thousands of perturbations versus tens of millions of cells). Whether this is enough data to learn robust regulatory representations, or whether the perturbation signal will be drowned out by the much larger observational pretraining corpus, is an open empirical question, but I think my other research on scaling laws for biological foundation models may shed some light on it. 

7.2. Geometric and manifold-based interpretability

One of the most important recent developments in mechanistic interpretability, and one that I did not explore in my paper, is the recognition that models encode complex knowledge not as discrete pairwise relationships but as geometric structure in their representation spaces. This is directly relevant to the failure modes documented in this paper.

The most relevant example comes from Goodfire's work on Evo 2, DNA foundation model trained on over 9 trillion nucleotides. Using sparse autoencoders on residual stream activations, they discovered that the phylogenetic tree of life is encoded as a curved manifold in the model's learned feature space: species relationships correspond to geodesic distances along this manifold, with the overall structure organized around a roughly 10-dimensional flat representation overlaid with higher-curvature deviations that capture additional biological properties. This is, to my knowledge, one of the most complex natural manifolds yet characterized in a foundation model, and crucially, it is a biological foundation model where the extracted knowledge was validated against known ground truth (established phylogenies). This is exactly the kind of success story that the single-cell interpretability field needs but does not yet have.

The methodological lesson for single-cell models is pointed: if gene regulatory knowledge is encoded geometrically in the residual stream (as manifolds, subspaces, or curved representations) rather than as discrete pairwise relationships in attention matrices, then no amount of sophisticated attention extraction will find it, because you are looking in the wrong representational format entirely. 

This connects to a broader trend in the interpretability community. The linear representation hypothesis (that features correspond to directions in activation space) is being supplemented by the recognition that many important features live on nonlinear manifolds: circles for days of the week, hierarchical trees for taxonomic relationships, tori for periodic quantities, and more complex structures. Goodfire's own researchers note that "manifolds seem to be important types of representations, and ones that are not well-captured by current methods like sparse autoencoders," which suggests that even SAEs, the current dominant tool, may need manifold-aware extensions to fully characterize what these models have learned.

A concrete next experiment would be to train SAEs on residual stream activations of scGPT or Geneformer, look for geometric structures that correlate with known regulatory relationships, and test whether regulatory information that is invisible in attention patterns becomes visible in the learned feature space. If it does, the implication would be that the models have learned more about gene regulation than the attention-based methods could reveal. If it does not, that would strengthen the case for intervention-aware pretraining as the necessary next step.

7.3. Probing residual streams: from aggregate statistics to feature-level analysis

My paper's methodology is primarily macro-level: aggregate statistics across many TF-target pairs, summary measures of head importance, average AUROC across perturbation conditions. This was deliberate (I wanted statistically robust claims with controlled multiple testing), but it means the analyses are inherently insensitive to fine-grained structure that might exist at the level of individual features or small groups of components.

The natural next step is to apply the standard NLP probing toolkit to single-cell foundation models. Train linear probes on residual stream representations at each layer to predict specific regulatory relationships (e.g., "is gene A a direct target of transcription factor B?"). If the probe succeeds where attention extraction fails, it would localize regulatory knowledge to specific layers' representations without requiring that it be readable from attention patterns. If the probe also fails, that is much stronger evidence for knowledge absence rather than mere extraction failure.

Beyond linear probes, the SAE-based feature discovery approach discussed in 7.2 could yield individual interpretable features that correspond to specific regulatory programs or pathway activations. If a sparse autoencoder trained on layer 15 residual streams (where my paper found peak TRRUST alignment in attention) produces features whose activation patterns correlate with known regulatory cascades, that would be a concrete positive result pointing toward the kind of mechanistic understanding the field is seeking.

One important caveat from my paper's own findings: the causal ablation results show that perturbation-predictive computation is distributed across many components in a redundant fashion rather than localized in identifiable circuit components. When ablating the heads most aligned with regulatory ground truth produces zero degradation while random ablation causes significant degradation, this suggests there may not be a clean "regulatory circuit" to find. Fine-grained circuit discovery tools work best when the computation is localized and modular; if it is genuinely distributed and redundant, as the evidence suggests, then even sophisticated circuit analysis may not produce the kind of clean mechanistic story we would like. The honest conclusion might be that these models perform regulatory-relevant computation through distributed, redundant representations that resist clean decomposition, which would be an important finding in its own right even if it is less satisfying than a circuit diagram.

7.4. Hybrid architectures, CSSI, and conformal uncertainty

Two shorter-term practical directions deserve mention, both of which build directly on infrastructure from my paper.

First, hybrid architectures that use foundation model embeddings as inputs to dedicated GRN inference modules rather than trying to extract edges from attention. The idea is to take the residual stream representations that the models learn (which clearly contain biological structure, as demonstrated by the layer-organized findings in Section 4) and feed them into purpose-built GRN inference algorithms as enriched gene features, rather than interpreting the attention matrix itself as a gene regulatory network. This sidesteps the attention extraction problem entirely while still leveraging whatever biological knowledge the foundation model has encoded during pretraining. Several GRN inference methods already accept gene embeddings as inputs (GEARS being a prominent example), and foundation model embeddings could serve as a drop-in upgrade over existing gene embedding approaches.

Second, Cell-State Stratified Interpretability (CSSI) framework showed improvements of up to 1.85× in GRN recovery. CSSI could be extended with conformal prediction to provide confidence sets rather than point estimates: instead of extracting a single ranked list of regulatory edges, you would get a set of edges that are certified to contain the true regulatory relationships at a specified confidence level. Conformal prediction is well-suited to this because it provides finite-sample coverage guarantees without distributional assumptions, which is important in a domain where we do not know the distribution of regulatory edge scores. The combination of CSSI (to reduce cell-state heterogeneity) with conformal uncertainty quantification (to provide calibrated confidence) could produce "certified edge sets" that are smaller and more reliable than current approaches, even if the underlying signal is weaker than what the field originally hoped for.

7.5. What this suggests for the broader interpretability-for-biology agenda

Stepping back from the specific technical directions, I think the most important lesson from this work is about the value of systematic stress-testing before building on interpretability claims.

The "attention as GRN" idea in single-cell biology was not unreasonable. There were good theoretical reasons to think it might work (attention patterns represent pairwise gene relationships, regulatory networks are pairwise gene relationships, the models clearly learn biological structure). But it failed at every level that matters for actual biological utility. The positive results (layer structure, context dependence, per-TF heterogeneity) survived the same battery, which gives me much more confidence that they point toward something real.

8. Conclusion

This paper started as an attempt to extract gene regulatory networks from single-cell foundation models and ended as a methodological argument about how to do mechanistic interpretability honestly. The specific biological results matter for the computational biology community, but I think the broader lesson are relevant to anyone working on mechanistic interpretability in any domain.

I want to close with a pitch: if you like mechanistic interpretability, consider working rather on biological foundation models.

Beyond the methodological advantages, biological interpretability is, in my view, both more tractable and less dangerous than frontier LLM interpretability. The models are smaller (hundreds of millions of parameters rather than hundreds of billions), the input domain is more constrained (gene expression profiles rather than arbitrary natural language), and the knowledge you are trying to extract is better defined (regulatory networks, pathway activations, cell state transitions). You are not probing a system that might be strategically deceiving you, and the knowledge you extract has direct applications in drug discovery and disease understanding rather than in capability amplification. And I still really believe that there is non-negligible chance that we can push biology in the remaining time and amplify human intelligence.



Discuss

On Steven Byrnes' ruthless ASI, (dis)analogies with humans and alignment proposals

2026-02-20 23:32:23

Published on February 20, 2026 3:32 PM GMT

@Steven Byrnes' recent post Why we should expect ruthless sociopath ASI and its various predecessors like "6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa" try to explain that a brain-like RLed ASI would be a ruthless consequentialist since “Behaviorist” RL reward functions lead to scheming.

Byrnes' proposed solution

Byrnes' proposed solution is based on the potential fact that "we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model."

As far as I understand the proposed construction of Byrnes' agent, the belief system is activated, then interpretability is applied in order to elicit concepts from the belief system. Then the belief system's outputs and the desire system's outputs are somehow combined in order to produce the next token. The token and its consequences (e.g. the fact that the math problem stayed unsolved or was solved) are likely to become a part of training data. 

While the human society lets most individuals intervene with the reward function and training data for the belief system of others, allowing individuals to align each other to the community, the AI-2027 scenario has an army of copies of the same AI become involved in the training runs and creation of synthetic data. If this happens, then I expect that the belief system will end up being corrupted not just by wishful thinking-like errors, but also by things like subliminal learning from the misaligned desire system. Therefore, I doubt that I understand how Byrnes' proposal alone prevents the agent from scheming. 

An alternate explanation of the humans avoiding wholesale scheming

Additionally, the question of why the humans avoid wholesale scheming, as described[1] in Byrnes' post, has plausible explanations differing from being born with interpretability-based approval-related reward

  1. The humans' utility function equivalents could have a range where they are concave. Then the humans living in a state of concave utility would be more risk-averse and have less reasons to scheme against each other.
  2. Alternatively, the humans could value connection with other people both for instrumental reasons like having tasks requiring coordination with others and due to sharing idiosyncratic values and systematically decide that scheming is rarely worth risking the loss of connection.  

The second mechanism listed here also provides a potential origin story of the Approval Reward. Suppose that the humans are born with the instincts like valuing smiles and smiling when happy. Then valuing smiles causes human kids to learn to make others visibly happy, which in turn generalises to a hybrid of Myopic Approval Reward-like circuitry and scheming-related circuitry.[2] However, unlike the AIs, human circuitry related to scheming is pruned by the fact that potential victims usually have similar or higher capabilities, causing the schemers to be punished. Additionally, the humans have actions cause long-term consequences, which make it easier to transform Myopic Approval Reward-like circuitry into actual circuitry related to Longer-Term Approval Reward and not things like short-term flattery. 

Alignment-related proposals

Similarly to Cannell's post on LOVE in a simbox, I suspect that the considerations above imply that drastic measures like including interpretability into the reward function are unnecessary and that the key to alignment is to reward the AI for coordination with others, genuinely trying to understand them or outright providing help to them instead of replacing them while pruning scheming-related behaviors by ensuring that the help is genuine. See also Max Harms' CAST agenda[3] and a related discussion in this comment on verbalised reasoning.

  1. ^

    "But my camp faces a real question: what exactly is it about human brains that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying."

  2. ^

    The formation of the two circuitry types is also impacted by parental guidance and stories included into kids' training data. 

  3. ^

    Which I reviewed here. The point 4 had me propose to consider redefining power to be upstream of the user's efforts. While this redefining makes the AI comprehensible to the user, it also incentivises the AI to have a specific moral reasoning.



Discuss