2026-03-12 04:55:19
Intro
I apologize for the somewhat snappy and vague title, but now that you have clicked on this, I shall atone for this with a less snappy, but more clear title:
Generalized veganism is necessary for a post-AGI future where humanity continues to exist in an acceptable form.
By generalized veganism, I mean whatever ethical system would be required for us to live in a world where some sentient beings have vastly superior capabilities to others, and yet despite all odds and historical precedent, instead of using this power to exploit or destroy these inferior beings, this power is used by the superior beings to protect the inferior ones from others of their kind who would do them harm. This ethical system would, of course, need to gain and maintain power.
By acceptable, I mean that whatever recognizably human or posthuman beings exist in this future, are safe from being horrifically exploited.
Unassailable Power
Most people on this site can agree that, sometime in the relatively near future, technology will enable the creation of software agents more intelligent than human beings in all or almost all aspects, known as AGI. Many focus on the risk that an unaligned or "unfriendly" AGI may out-compete and destroy humanity. This is only one of the possible risks, though, and it's not even close to the worst, and the really bad scenarios tend to obtain when technical alignment has been solved or partially solved.
The advent of AGI is so important because intelligence is the most important kind of power. This is because intelligence is capable of finding better ways to obtain more power from less, no matter what kind of power is desired or required. This is why it will be so transformative when technology achieves superhuman intelligence: it will mean that obtained power can be spent to acquire even more intelligence and therefore become better at obtaining even more power, possibly sparking a feedback loop with an unknown endpoint.
Depending on how exactly this takes place, it might or might not result in a singleton agent with total power over everything, but even if it doesn't, the gaps in power between agents and organizations will widen to an extreme degree, as a result of some being able to take better advantage of these extremely fast and powerful feedback loops better than others. Not all superhumans are equally superhuman, so both humans and any superhumans that haven't clawed their way to the very top, will be unable to in any way meaningfully contest the power of the most powerful agent(s), who I will call the "ruling faction", for brevity.
The Kindness of Their Hearts
If humans as we understand them today exist in this future of superhuman intelligence, it will be because the ruling faction wants them to. To the extent they have any freedom or control over their own lives, it will be because the values of the ruling faction permit this. Note how unprecedented this is: the ruling faction would have to value human liberty and welfare, it will be "out of the kindness of their hearts," and not because of any sort of practical reason, such as preventing revolution or extracting value from talented human capital. These social-contract reasons explain the tolerance of the ruling classes for human liberty, in societies where such exists to a significant degree, and when they are absent, the results are despotic and horrible for the lower classes. Think about how the ruling classes of humanity, even today, think of the masses of humanity below them, and what they do when they can get away with it.
The ruling faction would have to actually value human liberty and welfare intrinsically, without expecting meaningful resistance or reward for their actions. This is what all this talk about Friendly AGI is: hoping that our Machine God truly cares about us, and isn't just pretending until it gets enough resources that it no longer needs to pretend. As much as this is a bid for universal despotism, I can see why it is appealing: our odds of a "good ending" seem even worse if there are a variety of competing interests... because then, the most power-seeking ones have an advantage, and if even one wants to exploit a large portion of humans underneath them for whatever reason, the others may not have the necessary power to spare, or desire, to stop them. The ruling class today is made up of quite a few different interests, but by and large, how do they think of us now? What if they had no meaningful public pushback?
If you think that making the ruling faction humanlike, or making sure it's a complex society and not just a singleton, will save you? That such brutal exploitation would not happen to humans in a realistic superhuman society, even when they are totally disempowered, because of some sort of social pushback among the ruling faction? Ask yourself: How's veganism doing? It clearly isn't doing all too well now, so why exactly do you expect that to change in the future? This is the reason I call this general principle of non-exploitation of the powerless "Generalized Veganism", even though I'm mainly talking about humans as the powerless species here: in the extreme and illustrative case of non-human animals, where they have nothing to give us in exchange for their freedom, nor any meaningful resistance to their slavery, we see the results, and the results are approximately maximally brutal. When you are powerless, the same will be done to you, as you see done to the powerless now.
Posthuman Equality?
Astute transhumanists will have noticed that there is another option, besides human political irrelevance or annihilation. We could all become superhuman ourselves, so that we could stay on the same playing-field as these new superhuman entities!
This possibility doesn't detract from the main point, however. As previously mentioned, not all superhumans are equally superhuman, and when intelligence can be gained directly from power, the gaps naturally widen, so if any society of superhumans were to actually enforce something close enough to equality that nobody becomes so disempowered, we would have to get them to care about the disempowered a lot more than I would expect anyway.
Not to mention the fact one way this inequality could come about, is that some of these superhuman entities are going to want to create new human-level entities for who knows what purpose, and in the limit of power and advanced technology, they will have the ability to, so to prevent this our superhuman society would necessarily have to care about entities who are completely powerless and disconnected from said society right from the start. (Note that mass creation of new, powerless members of intellectually inferior species for the purpose of brutal exploitation is not speculation, but is currently happening. How's veganism doing? A few members of our society push back, sure, and approximately nothing changes.) The same fundamental problem must be solved anyway.
What The Hell Do We Do?
I honestly have no idea. The more I think about this, the more I think I'm a fool, trying to solve a fundamental problem which has been plaguing everything forever. People far smarter and kinder and more powerful than I could ever be have only ever managed to carve out little exceptions to this general rule I'm gesturing at here: The strong do what they wish, the weak suffer what they must.
Yet I suppose there's some hope. After all, there have been problems that have never, ever, ever been solved in all of history, until they were solved, or at least progress has been made.
Still, I wonder if this hope is destructive. All of the worst abuses happen in worlds where we work hard on both AI capabilities and alignment, or perhaps some other form of intelligence enhancement. Presumably all this work is done with the hope for a bright future, where nobody need worry anymore about either material scarcity or helplessness in the face of power.
However, all the worst possible things you can think of are also things a paperclip maximizer has no incentive to do. While on the other hand, if technical alignment is solved, the kinds of people in a position to buy the ability to shape the values that determine the entire future of the universe, are the same kinds of people who partied with Epstein, and got away with it even without the unassailable godlike power of a future AI protecting them.
For this reason, sad as it might be, I cannot help but continue to endorse my conclusion in “the case against AI alignment”. What would cause me to change my mind about this?
Prove we can fucking do it. Build a world where the people with power don't abuse their power to a monstrous degree even though they have the power to, or else where power is shared equitably with nobody grabbing the lion's share. Where it's no longer the case that even regular people continue, for their entire lives, to actively contribute to atrocity simply for the sake of their own convenience.
Not even the whole world. Make a community, of appreciable enough size that we might hope it could possibly be scalable, where the old horrors are gone or at least relegated to rare exceptions to the rule, where universal benevolence and non-exploitation are robust norms. Where even those without the power to protect themselves, no longer need live in fear or endless pain. Prove it can be done, and more, prove it can be done without the vast majority of humanity scoffing at or despising that small community if they even hear about it. Maybe then there's hope for the future and for the present.
If we cannot even do that, and you want us to make a God?
Then I will unabashedly say I hope that that God destroys us completely and utterly, because if it doesn’t do that, I expect it to do much, MUCH worse.
2026-03-12 04:16:06
Come to sign up for cryonics!
More people have died while cryocrastinating than have actually been cryopreserved.
Signing up for cryonics will cost you £5-30/month in life insurance premiums (if you want to have it for the next 50 years) + ~$350/year in Alcor membership fees.
Yes, it’s that cheap. (This is a reaction so common that it made it into Yudkowsky’s post.)
Realistically, you will still procrastinate on signing up. So, you should come to the party: we’ll try to sign up everyone for cryonics in an evening.
Bring food! Especially frozen.
If you know people who might want to come to sign up for cryonics, please share a link with them: https://partiful.com/e/S1zTdHG4Wjn4KYqihZky?c=PqmTBwNX
2026-03-12 02:42:35
Previous: A Quick Intro to Ring Signatures
Once again, if you take nothing else from this post, try to understand that ring signatures require no cooperation between alleged signers other than sharing a public key.
The only real ring signature usage that you can see in the wild is the cryptocurrency Monero. Ring signatures make it very difficult to trace Monero transactions, unlike Bitcoin. That’s not very helpful to whistleblowers though, what they need is an existing network of persistent keypairs tied to identities of real people.
A couple years ago I led a small project to produce a Mac app called ZebraSign that lets users create and verify ring signatures. The keys it generates aren’t usable for anything else, which means that any network based on that app would be running largely on altruism and magnanimity, which is less reliable than self-interest. I considered making the keys compatible with PGP which might have helped a bit but regardless, I foresaw a long and uphill battle in getting people to use it.[1] Fortunately, that may all be unnecessary.
Last month I learned about the open protocol NOSTR (Notes and Other Stuff Transmitted by Relay). I initially thought Nostr was just another decentralized Twitter competitor like Mastodon, but then I read that Nostr users generate and control their own private/public keypairs! That's an absolute game changer, it means there are thousands of active identities just waiting to be implicated in a ring signature. I did some searching and found that someone else had the same idea, and they created a basic command line interface for computing ring signatures using Nostr keys. It’s called Nostringer. Be warned that it has not yet received a formal security audit, so it’s not currently safe for real use.
I want there to be a graphical user interface for Nostringer so that people can use it without the command line. I’m talking to some rust developers and drafting small grant proposals. Contact me if you’re interested.[2]
Here is a ring-signed message provably written by either me or famous whistleblower Edward Snowden, posted from a pseudonymous account. Given the circumstances, you will readily guess that this one was me, but you cannot prove it cryptographically. Needless to day, in real scenarios the circumstances will tend to be less clear. Here is an image, in case the relays delete it as spam:
I don't expect a bunch of Anthropic and OpenAI employees to join Nostr and open themselves up to ring signatures. But I might end up wrong about that, and in any case, AI companies are not the only institutions that could benefit from ring signatures. Someone's gotta be the change they want to see in the world.
...Okay, but should you in particular do it? If you are interested in all of this, I recommend first spending a few minutes learning about the Nostr protocol, the relays, clients, remote signers, and so on. After that, here are some additional things to consider:
I also never got around to getting the Apple certification, so you have to override the standard Mac security settings to use it.
I was planning to have this sorted out before posting, but it's only about once per year that the lack of good whistleblowing tools becomes a distinct thread in The Discourse.
This is similar to how websites can see which traffic comes from which accounts or IP addresses. It’s essential for preventing abuse, but it’s a complication for those who have valid reasons to read or write pseudonymously. This isn't a hypothetical consideration, I have pseudonymously posted some highly controversial writing online, and I made sure to use a different sign-up email and route the traffic through a proxy, just so that I wouldn’t have to think about it.
For more discussion about Nostr's surveillance concerns and what could be done about it, see: Proving You Belong Without Saying Who You Are and Private Relay Connections: ZK Solutions for Nostr.
Paid relays are already a normal part of the ecosystem.
2026-03-12 02:18:35
TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation.
Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model’s capabilities?
In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument.
Threat model: We consider a mesa-optimiser that appears during the training of a human-level AI, when the model weights are
We want to know how hard the search for those
Tamper Resistance (TR) is the field of research into making models harder to modify, where the focus is generally on making it harder to remove safety properties with continued training. In this setting, tamper resistance training is normally applied as a post-training intervention. For example, TAR (Tampering Attack Resistance) uses model-agnostic meta-learning with SFT tasks as the inner-loop to achieve tamper resistance.
TAR directly optimises the model’s parameters so that safety properties of the model are not affected by similar fine-tuning attacks. However, this tamper resistance can be consistently undone within several hundred steps of fine-tuning.
This suggests one of the following possibilities:
Proving the first of these is unlikely. However, it does seem that TAR can’t find any such parameters. TAR directly optimises the model parameters for resistance, with full control over the optimiser and data. If TAR can’t find the parameters, then as long as gradient hacking SFT is ‘harder’ than tamper resistance, gradient hacking should be ‘hard’.
Some caveats:
The action space available to a GHer (gradient hacker) is a subset of the action space available to a TRer (tamper resistant model developer): a GHer operates only on the model representations and outputs[2], whereas a TRer can intervene on these, as well as modifying the model training process or any other sort of intervention they can imagine. That is, wherever we draw the boundaries of the GHing agent, the TRer can choose the same boundaries. Note that we have already restricted the GHer to a human-level AI, and assume it is not hugely more intelligent than the TRers; however, tools like meta-learning provide a substantial advantage to the TRers.
Generalised lessons:
Our threat model here is narrow, and these results do not apply to any of the many other training protocols that are used on modern LLMs. However, in this narrow case we think that this evidence suggests gradient hacking is hard, rather than just unlikely for the optimiser to stumble upon. We also think tamper resistance methods could be valuable in constructing model organisms for other forms of training hacking.
Thanks to Mosh Levy, Francis Rhys Ward, Geoffrey Irving, and Marie Buhl for their helpful input on this post.
In its original formulation, gradient hacking is where a mesa-optimiser manipulates training to preserve its mesa-objectives; and other definitions define gradient hacking more broadly as a model choosing certain outputs so that parameter updates are in-line with its objectives. Ours is an example of the latter definition but not the first, so we use this definition throughout this post.
Assuming the GHer is ‘embedded’/’embodied’ by the
2026-03-12 02:16:06
I was going to post this in a month or two, but I received a frantic request for information about the subject last week, driven by recent events, so here we go. None of the below was generated by an AI, I wrote all of it myself except where otherwise indicated.
I’ll start with a quick refresher on asymmetric key cryptography. (Here is a 6-minute Rob Miles video if you prefer.)
You generate a random number out of a very large range of possible numbers, and that’s your private key (AKA your secret key). A cryptographic function is run on your private key to produce another number, which will be your public key (AKA shared key). It is supposed to be impossible or at least impractical for adversaries to reverse-engineer your private key from your public key.
You take your private key and a message you want to sign, run the signing function on both of them together, and the result is that same message plus a long string of random-looking characters. That’s the signed message, and anyone who knows your public key can mathematically prove that the message must have been signed by the secret key corresponding to your public key. Your secret key is not revealed at any point. (Separately, someone can also encrypt a message using your public key, and it can only be decrypted with your secret key.)
There are other types of signing schemes that serve different use cases. The one I want to tell you about is called a ring signature and it was first described in a 2001 cryptography paper by Rivest, Shamir, and Tauman, called “How to Leak a Secret.”
The paper begins:
The general notion of a group signature scheme was introduced in 1991 by Chaum and van Heyst. In such a scheme, a trusted group manager predefines certain groups of users and distributes specially designed keys to their members. Individual members can then use these keys to anonymously sign messages on behalf of their group. The signatures produced by different group members look indistinguishable to their verifiers, but not to the group manager who can revoke the anonymity of misbehaving signers. In this paper we formalize the related notion of ring signature schemes. These are simplified group signature schemes which have only users and no managers (we call such signatures “ring signatures” instead of “group signatures” since rings are geometric regions with uniform periphery and no center). Group signatures are useful when the members want to cooperate, while ring signatures are useful when the members do not want to cooperate.
[Emphasis added. Let emphasize further: if you take nothing else from this post, at least understand this major difference between group signatures and ring signatures. Unlike group signatures, ring signatures are imposed unilaterally and irrevocably on unconsenting peers by an unknown author. Even security professionals that I’ve talked to lose track of this distinction, so I’m braced for some confusion in the comments.]
In the paper, the authors give a vignette about Bob, a member of the cabinet of Lower Kryptonia, who alleges to a journalist that the prime minister has committed some kind of misconduct. If Bob sends the allegations as an anonymous tip, the journalist may have no reason to find the allegations credible. Whereas if Bob signs the allegation with a normal digital signature, the allegation carries his full authority...but then allies of the Prime Minister might use bribes, hacks, or threats to get Bob’s name from the journalist in order to retaliate.
Instead, Bob ring-signs the allegations, using his own private key plus the public keys of several other cabinet members, and then gives that ring-signed message to the journalist via some anonymous channel. The journalist then has good reason to believe that the information did come from a cabinet member, but everyone can only guess at which one it was, there is no known way to cryptographically prove it. To summarize the benefit of ring signatures in a single sentence, I would say that they allow a user to make fine-grained trades between their anonymity and their credibility.
The reason Bob’s exact scenario hasn’t played out in the real world is that we all ended up relying on services like Gmail and Facebook to manage our account security for us instead of handling our own keys. That’s why you can usually just click “Forgot password?” if you lock yourself out of one of your accounts. This paradigm is safe and convenient in a narrow sense, but it has plenty of downsides, including, in my opinion, closing the door on clever arrangements like ring signatures.
In the next post I’ll explain the current state of ring signatures, and what tools are available now for those who are interested.
(Adapted from the paper)
Q: Why would I want to join someone’s ring? What’s in it for me?
A: Remember, rings are created unilaterally. You cannot join someone else’s ring. Your signature is forged, only the true signer's is contributed voluntarily. The only way to guarantee that you won’t be implicated in a ring signature someday is to refrain from using services that require you to share a public key.
Q: Can't a reader infer the true signer based on the content of the message?
A: Yes. If Bob has been complaining too often about the prime minister’s secret malfeasance, and then he anonymously makes allegations inside a ring-signed message, then the PM will identify Bob as the top suspect and may retaliate accordingly. (Of course, the PM may instead suspect that a different cabinet member is trying to frame Bob. It’s hard to contrive a truly straightforward threat model even in a thought experiment, but hopefully you get the point.) Also, the true signer may potentially be inferred by subtle quirks of writing style.
Q: Can’t this be used to start petty interpersonal drama?
A: Yes. But it can also be used to prevent or resolve interpersonal drama. Importantly, people do not always agree on which interpersonal conflicts are petty drama and which are serious problems.
Q: Didn’t Bostrom, Douglas, and Sandberg curse unilateral actions? And since creating a ring is done without permission, won’t it be subject to that curse?
A: Presumably yes, but the scheme tends to promote transparency and honesty, so it should cause net counterfactual harms only in situations where both (1) increased transparency and honesty are harmful, and (2) fully anonymous messages cannot already cause those same harms. The worst I can think of is a rogue employee deciding to leak secrets that are costly to verify. For example, a complex and expensive secret formula that, if posted fully anonymously, would not be worth the ingredient cost for a competitor to test out. (I can imagine ways employers could mitigate this risk without also undercutting whistleblowers, but that gets even more speculative, so I’ll omit them here.)
Q: Can I make a ring signature that has more than one true signer?
A: Theoretically yes, that is called a k-of-n threshold ring signature. Those would be very useful if there was a safe and convenient way to create them. Unfortunately, it seems that they intrinsically carry additional risks unless the true signers all coordinate. I’m not aware of threshold ring signatures being used for any real-world application.
Q: Is there a way to make it so that readers know that two different messages signed with the same ring came from the same person in that ring?
A: Yes, there are at least two ways. The easier option is to have the message itself contain an endorsement of a pseudonymous social media account and then just send subsequent messages from that account. The harder option is to read up on “linked ring signature” schemes and try to implement it without risking your anonymity, which is nontrivial. Linked ring signatures can also be used for secret ballots, but the setup requirements for that strike me as inelegant and risky. For example, I think an attacker could force you to send them a message ring-linked to your vote in order to find out how you voted. This is related to why threshold ring signatures require coordination. I’ll be surprised if there’s not a better scheme out there for secret ballots.
2026-03-12 01:41:50
TLDR; Interpretability today often fails on four fronts: it’s not truly mechanistic (more correlation than causal explanation), not useful in real engineering/safety workflows, incomplete (narrow wins that don’t generalize), and doesn’t scale to frontier models. Martian’s $1M prize targets progress on those gaps—especially via strong benchmarks, generalization across models, and institution/policy-relevant tools—with a focus on code generation because code is testable, traceable, and a high-impact real-world setting for validating mechanistic explanations.
In Part 1, we discussed why we’re announcing a $1M prize for work in interpretability, focused on code generation. In this post, we’ll go into the four core problems we think are most important for the field, some approaches we’re excited about to tackle these problems, and why we think code generation is the right setting to tackle both.
The Biggest Problems In Interpretability
We’re optimists on interpretability (otherwise we wouldn’t be building an interp company!), but if you want to identify the biggest gaps in your field it’s important to take the skeptical lens.
A pessimist or skeptic could reasonably say interpretability today is:
Each of these can also be framed as a question:
These are, in short, what we think of as the most important problems. They’re hard but tractable. We’re working on them inside Martian, and we have work we’re excited to share in coming months. In the external work we support, grants and prizes will be awarded based on how much progress each paper could make on these problems.
Below, we outline some of our thoughts on what fruitful approaches could look like. These are merely suggestions though – what we care about are the problems!
Strong Benchmarks for Interpretability
Scientific fields mature when they converge on good benchmarks. Right now, interpretability is missing that shared backbone ([23]). Many results are evaluated relative to other interpretability methods, on hand-picked examples, or via qualitative “this looks right to me” inspection ([1], [4]). That makes it hard to tell whether a new method is actually better, or just telling a more compelling story.
We would like to see benchmarks that measure interpretability methods against ground truth, practical impact, or both. For example, in settings where we know the underlying mechanism (synthetic tasks, hand-constructed models, or instrumented training setups), we can ask whether a method recovers the right causal structure, not just whether it produces plausible artifacts. In more realistic settings, we can ask whether using a method improves downstream outcomes: debugging success, safety performance, policy decisions, or human-model collaboration.
We’re especially wary of “grading on a curve,” where methods are only compared to other (mechanistic) interpretability baselines. If all of the baselines are weak, progress can look impressive while remaining practically irrelevant. Strong benchmarks should force head-to-head comparisons with simple, competitive baselines [9] (e.g. linear probes, prompting tricks, finetunes, reinforcement learning, or data/architecture changes) and should quantify where MI actually buys us leverage. Interpretability work is successful here only if it provides the tools that work the best, with no exceptions.
For this prize, we’re excited about work that:
Mathematical Foundations for Interpretability
A lot of the most productive work in interpretability has followed from relatively simple but powerful mathematical pictures. Causality gave us a language for interventions and counterfactuals [26], [27]. Entanglement and superposition gave us a way to think about entangled features and why neurons end up polysemantic [21], [28], [29], [30]. In each case, a clean mathematical abstraction unlocked many concrete experiments and methods.
We think the field is still severely underinvested in this kind of foundational work. We don’t have crisp answers to basic questions like: What does it mean, formally, for an explanation to be “correct” rather than correlational? What is the right unit of analysis in a large model (neurons, features, circuits, modules, something else)? When are two explanations “the same” at an appropriate level of abstraction? In practice, many current methods implicitly assume answers to these questions, but our tools and evaluation standards aren’t grounded in a shared theory.
There is also a real risk that we have fixated on the wrong mathematical toy problems. For example, much recent work is framed around superposition and feature disentanglement, but it is not clear that “undoing superposition” is the core obstacle to understanding modern models. Other candidate difficulties—such as the choice of basis, the geometry of representation manifolds, or the interaction between optimization and architecture—may matter more, and may require different tools.
Internally, we’re exploring several possible foundations: kernel methods and category-theoretic frameworks for what counts as a mechanism or explanation; differential-geometric views of representation manifolds and why local linear approximations break down; and random matrix and linear operator theories that might let us generalize mechanistic tools and automate them in high dimensions. These are examples, not prescriptions.
For the prize, we’re excited about any work that:
If successful, this line of work should move us closer to methods that are genuinely mechanistic (not just correlational), that have a path to completeness (we know what’s missing), and that scale because they are grounded in the underlying structure of learning systems rather than ad hoc tricks.
Generalization Across Models
Many of today’s most compelling interpretability results are one-offs: a single circuit in a single layer of a single model, or a feature dictionary trained on a particular checkpoint. That is scientifically interesting, but it’s not yet a general theory of how models work. We’d like to see interpretability methods, abstractions, and findings that survive changes in architecture, scale, training data, and domain.
Generalization can mean several things. It might mean that a discovered circuit or feature family recurs in larger or differently-trained models. It might mean that an interpretability method finds analogous structure in both vision and language models. It might mean that we can automatically locate “the same” mechanism in a new model, given a description or example from an old one. It might mean that we can learn and reuse higher-level abstractions rather than re-discovering everything from scratch.
We’re especially interested in work that moves beyond small, clean, toy tasks to more realistic settings without losing mechanistic clarity. That might involve new units of analysis (modules, subsystems, learned tools), new training schemes that encourage shared abstractions across models, or new methods for matching and comparing representations.
For this prize, promising directions include:
Generalization is where “mechanistic,” “complete,” and “scalable” meet: if we can discover explanations that transfer across models and settings, we’re starting to learn something about the underlying space of algorithms these systems converge to.
Interpretability as a Policy/Institution Lever
We don’t think the main impact of interpretability will be a handful of researchers staring at neuron visualizations. The bigger opportunity is institutional: giving regulators, boards, red teams, and operational safety teams tools that change what kinds of AI governance are feasible.
Interpretable tools could, for example, make it possible to build interpretable analog models that track a powerful system’s behavior on safety-relevant dimensions; to transfer safety properties between models via activation transfer and circuit-level alignment; or to design arms-control-style measures where parties can credibly demonstrate that models lack certain dangerous mechanisms or capabilities without revealing proprietary details. Interpretability could also expand the Pareto frontier between safety and innovation by making it cheaper to monitor and mitigate risky behaviors, rather than relying only on blunt constraints.
We take seriously the view (e.g. [24]) that many safety problems are fundamentally about institutions, incentives, and governance, not just algorithms. From that perspective, MI is useful to the extent that it enables new forms of oversight, auditing, liability assignment, and norm-setting: it lets institutions ask better questions of models and developers, and verify answers with more than “trust us.”
For the prize, we’re interested in work that:
This agenda squarely targets the “usefulness” question: if interpretability is going to matter in the real world, it needs to show up in the levers that institutions actually pull.
Code has several properties that make it unusually friendly to mechanistic work. It has formal semantics and an execution trace; we can run programs, inspect intermediate state, and often characterize “correctness” crisply. Many algorithmic tasks in code are well-understood mathematically. This makes it easier to pose precise questions like “what algorithm is the model implementing here?” and to test whether a proposed mechanism really explains behavior, not just correlates with it.
At the same time, modern code models are at the center of how LLMs are used in practice: as coding assistants, tool-using agents, autonomous refactoring systems, and more. They plan, call tools, read and write large codebases, and increasingly act as components in larger agent systems. That makes them a natural testbed for interpretability methods that aim to handle agentic behavior, tool use, and multi-step reasoning in realistic environments—not just isolated feed-forward tasks.
We’re also excited about a deeper connection: program synthesis and mechanistic interpretability are, in a sense, inverse problems. Program synthesis tries to go from desired behavior to code; mechanistic interpretability tries to go from code-like behavior to a description of the internal “program.” Insights in one direction may feed the other: understanding how models represent and manipulate programs may help us understand their own internal “programs,” and vice versa.
For this prize, we’re particularly interested in:
If interpretability can show clear, actionable wins in code generation—currently the largest industrial use case for LLMs—that will be a strong signal that the field is on a useful track, and a powerful engine for further investment and progress.
Thanks to Ana Marasović, Hadas Orgad, Jeff Phillips, John Hewitt, Leo Gao, Mor Geva, Neel Nanda, Stephen Casper, and Yonatan Belinkov for reviewing this RFP and providing feedback.