MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Today's Ring Signatures and Related Tools

2026-03-12 02:42:35

Previous: A Quick Intro to Ring Signatures

Once again, if you take nothing else from this post, try to understand that ring signatures require no cooperation between alleged signers other than sharing a public key.

The state of the art of ring signatures

The only real ring signature usage that you can see in the wild is the cryptocurrency Monero. Ring signatures make it very difficult to trace Monero transactions, unlike Bitcoin. That’s not very helpful to whistleblowers though, what they need is an existing network of persistent keypairs tied to identities of real people.

A couple years ago I led a small project to produce a Mac app called ZebraSign that lets users create and verify ring signatures. The keys it generates aren’t usable for anything else, which means that any network based on that app would be running largely on altruism and magnanimity, which is less reliable than self-interest. I considered making the keys compatible with PGP which might have helped a bit but regardless, I foresaw a long and uphill battle in getting people to use it.[1] Fortunately, that may all be unnecessary.

Last month I learned about the open protocol NOSTR (Notes and Other Stuff Transmitted by Relay). I initially thought Nostr was just another decentralized Twitter competitor like Mastodon, but then I read that Nostr users generate and control their own private/public keypairs! That's an absolute game changer, it means there are thousands of active identities just waiting to be implicated in a ring signature. I did some searching and found that someone else had the same idea, and they created a basic command line interface for computing ring signatures using Nostr keys. It’s called Nostringer. Be warned that it has not yet received a formal security audit, so it’s not currently safe for real use.

I want there to be a graphical user interface for Nostringer so that people can use it without the command line. I’m talking to some rust developers and drafting small grant proposals. Contact me if you’re interested.[2]

 Using Nostr

Here is a ring-signed message provably written by either me or famous whistleblower Edward Snowden, posted from a pseudonymous account. Given the circumstances, you will readily guess that this one was me, but you cannot prove it cryptographically. Needless to day, in real scenarios the circumstances will tend to be less clear. Here is an image, in case the relays delete it as spam:

If you convert the hex keys to npubs, you can look up the accounts. If you put the hex keys into the Nostringer verifier alongside the signature, it should return TRUE.

 

I don't expect a bunch of Anthropic and OpenAI employees to join Nostr and open themselves up to ring signatures. But I might end up wrong about that, and in any case, AI companies are not the only institutions that could benefit from ring signatures. Someone's gotta be the change they want to see in the world.

...Okay, but should you in particular do it? If you are interested in all of this, I recommend first spending a few minutes learning about the Nostr protocol, the relays, clients, remote signers, and so on. After that, here are some additional things to consider:

  • Key security is essential and it is entirely the responsibility of each individual Nostr user, for better and for worse. Here is the profile of a meat-themed bot account that had to be burned due private key leakage.
  • If you want to delete a note, you have to send a deletion request to the relays that host it, and its up to them to fulfill that request. It’s not like deleting a tweet.
  • Now that Nostringer exists, any Nostr user can be implicated in a ring-signed message that they may disagree with, alongside names they may not want to be associated with. 
  • It bears repeating that Nostringer has not yet been formally audited for security.
  • Relays are not blind to the requests that they receive, they are able to see who writes which content, but also who reads it.[3] This poses a trilemma: (1)be potentially tracked by relay owners (2) run your own relay (3) don’t use Nostr[4]. This is especially important for anyone using ring signatures for whistleblowing. Pay attention to which relays you’re connecting to and what identifiable information they’re getting from you. It is generally recommended to always connect through a VPN or Tor.
  • Nostr is a work in progress, and it’s decentralized. My experience of it has been somewhat shoddy and patchy, regardless of which client I use.
  • As mentioned above, spam is an issue, and a ring-signed message might get deleted for looking like spam. "Not your relay, not your notes," as they say. The best solution that I've thought of is to have multiple relays that specifically invite ring signatures, and charge different prices to post.[5] This solves the signal/noise problem: the more money it cost someone to post a ring signature, the more likely it is to be worth the trouble of running it through the verifier.
  • From what I can tell, Nostr has a few thousand users. Some people post notes saying that Nostr is a failed endeavor, with the hashtag #deadstr. Others post every day about the cool new apps and improvements being made. You can make your own judgment.
  • As with Mastodon, anyone can host their own Nostr relay with their own moderation policies. I started scrolling Nostr posts and immediately found myself wading through a bunch of racism and vacuous bitcoin hype. It reminded me of that one Scott Alexander quote: “...if you’re against witch-hunts, and you promise to found your own little utopian community where witch-hunts will never happen, your new society will end up consisting of approximately three principled civil libertarians and seven zillion witches. It will be a terrible place to live even if witch-hunts are genuinely wrong.” My experience improved after I made an account, curated my feed, and focused on long-form content. Your mileage may vary. 

Tools with applications that overlap with applications of ring signatures

  • StealthNote
    • If an organization grants Google workspace accounts to its employees or members, one of those employees can post to StealthNote as an anonymous member of that organization.
    • Runs on a ZK proof of a google authentication token (technical explanation).
    • Verification is done on Aztec’s end and hosted on their website, so they retain censorship authority.
  • Spartacus.app 
    • Temporary anonymity for group endorsements. Identities are all revealed at once if enough people endorse within a given time period.
    • You sign up with your phone number. The site says that it will implement gatekeeping via government ID verification one of these days.
    • Quorum trigger cannot be set lower than 8 endorsements.
  • Semaphore
    • Makes merkle trees to zero-knowledge-prove that someone within a particular group made a particular message.
    • It’s kind of hard for me to fully grasp from their website. But it looks like it requires a trusted manager/gatekeeper for each group. The manager can’t see who says what, but they choose who is allowed to submit messages. The manager can decline to invite people to join the group, but I can’t tell if they have other censorship options than that. I'd love to read an ELI5 for this.
    • Messages are tagged with unique identifiers that preserve anonymity within the group while preventing individuals from looking like multiple different users. The basic ring signature scheme does not have this feature, but linked ring signatures do.
       
  1. ^

    I also never got around to getting the Apple certification, so you have to override the standard Mac security settings to use it.

  2. ^

    I was planning to have this sorted out before posting, but it's only about once per year that the lack of good whistleblowing tools becomes a distinct thread in The Discourse.

  3. ^

    This is similar to how websites can see which traffic comes from which accounts or IP addresses. It’s essential for preventing abuse, but it’s a complication for those who have valid reasons to read or write pseudonymously. This isn't a hypothetical consideration, I have pseudonymously posted some highly controversial writing online, and I made sure to use a different sign-up email and route the traffic through a proxy, just so that I wouldn’t have to think about it.

  4. ^

    For more discussion about Nostr's surveillance concerns and what could be done about it, see: Proving You Belong Without Saying Who You Are and Private Relay Connections: ZK Solutions for Nostr.

  5. ^

    Paid relays are already a normal part of the ecosystem.



Discuss

Can models gradient hack SFT elicitation?

2026-03-12 02:18:35

TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation.

Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model’s capabilities?

In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument.

Threat model: We consider a mesa-optimiser that appears during the training of a human-level AI, when the model weights are . It forms a mesa-objective that leads to instrumental powerseeking and therefore wants to be deployed. It knows that if we find evidence of certain dangerous capabilities, it will not be deployed, so the mesa-optimiser is incentivised to sandbag on dangerous capability evals. However, the mesa-optimiser knows that we will elicit dangerous capabilities using SFT, and therefore the model weights after training () must be resistant to elicitation. To ensure deployment the mesa-optimiser  must take actions that influence the SFT elicited model () via , i.e. it gradient hacks.

We want to know how hard the search for those  could be; but, we have no evidence that this has ever actually occurred in a model. Instead, we can look at results from the Tamper Resistance literature for some clues.

Tamper Resistance (TR) is the field of research into making models harder to modify, where the focus is generally on making it harder to remove safety properties with continued training. In this setting, tamper resistance training is normally applied as a post-training intervention. For example, TAR (Tampering Attack Resistance) uses model-agnostic meta-learning with SFT tasks as the inner-loop to achieve tamper resistance.

TAR directly optimises the model’s parameters so that safety properties of the model are not affected by similar fine-tuning attacks. However, this tamper resistance can be consistently undone within several hundred steps of fine-tuning.

This suggests one of the following possibilities:

  1. There exists no fine-tune resistant , and no method can find it.
  2. There exists such a fine-tune resistant , but TAR can’t find it.

Proving the first of these is unlikely. However, it does seem that TAR can’t find any such parameters. TAR directly optimises the model parameters for resistance, with full control over the optimiser and data. If TAR can’t find the parameters, then as long as gradient hacking SFT is ‘harder’ than tamper resistance, gradient hacking should be ‘hard’.

Some caveats:

  • Mesa-optimisers might only emerge earlier during training, and tamper resistance approaches are generally post-training. Deep Ignorance results in much more persistent tamper resistance and is a pretraining intervention, but just filtering out relevant training data doesn’t seem like a valid way to create a mesa-optimiser.
  • TR methods might not be very good. It seems like they’re not as good as they will be in the future, but meta-learning seems like a major affordance.

The action space available to a GHer (gradient hacker) is a subset of the action space available to a TRer (tamper resistant model developer): a GHer operates only on the model representations and outputs[2], whereas a TRer can intervene on these, as well as modifying the model training process or any other sort of intervention they can imagine. That is, wherever we draw the boundaries of the GHing agent, the TRer can choose the same boundaries. Note that we have already restricted the GHer to a human-level AI, and assume it is not hugely more intelligent than the TRers; however, tools like meta-learning provide a substantial advantage to the TRers.

Generalised lessons:
Our threat model here is narrow, and these results do not apply to any of the many other training protocols that are used on modern LLMs. However, in this narrow case we think that this evidence suggests gradient hacking is hard, rather than just unlikely for the optimiser to stumble upon. We also think tamper resistance methods could be valuable in constructing model organisms for other forms of training hacking.

 

Thanks to Mosh Levy, Francis Rhys Ward, Geoffrey Irving, and Marie Buhl for their helpful input on this post.

  1. ^

     In its original formulation, gradient hacking is where a mesa-optimiser manipulates training to preserve its mesa-objectives; and other definitions define gradient hacking more broadly as a model choosing certain outputs so that parameter updates are in-line with its objectives. Ours is an example of the latter definition but not the first, so we use this definition throughout this post.

  2. ^

    Assuming the GHer is ‘embedded’/’embodied’ by the , assuming its ‘actions’ are restricted to at most affecting the tokens, logits or activations seems like a minimal assumption. This is similar to assuming humans ‘action spaces’ are limited to their thoughts and motor-outputs. 



Discuss

A Quick Intro to Ring Signatures

2026-03-12 02:16:06

I was going to post this in a month or two, but I received a frantic request for information about the subject last week, driven by recent events, so here we go. None of the below was generated by an AI, I wrote all of it myself except where otherwise indicated.
 

I’ll start with a quick refresher on asymmetric key cryptography. (Here is a 6-minute Rob Miles video if you prefer.)

You generate a random number out of a very large range of possible numbers, and that’s your private key (AKA your secret key). A cryptographic function is run on your private key to produce another number, which will be your public key (AKA shared key). It is supposed to be impossible or at least impractical for adversaries to reverse-engineer your private key from your public key.

You take your private key and a message you want to sign, run the signing function on both of them together, and the result is that same message plus a long string of random-looking characters. That’s the signed message, and anyone who knows your public key can mathematically prove that the message must have been signed by the secret key corresponding to your public key. Your secret key is not revealed at any point. (Separately, someone can also encrypt a message using your public key, and it can only be decrypted with your secret key.)

 

There are other types of signing schemes that serve different use cases. The one I want to tell you about is called a ring signature and it was first described in a 2001 cryptography paper by Rivest, Shamir, and Tauman, called “How to Leak a Secret.”

The paper begins:

The general notion of a group signature scheme was introduced in 1991 by Chaum and van Heyst. In such a scheme, a trusted group manager predefines certain groups of users and distributes specially designed keys to their members. Individual members can then use these keys to anonymously sign messages on behalf of their group. The signatures produced by different group members look indistinguishable to their verifiers, but not to the group manager who can revoke the anonymity of misbehaving signers. In this paper we formalize the related notion of ring signature schemes. These are simplified group signature schemes which have only users and no managers (we call such signatures “ring signatures” instead of “group signatures” since rings are geometric regions with uniform periphery and no center). Group signatures are useful when the members want to cooperate, while ring signatures are useful when the members do not want to cooperate. 

[Emphasis added. Let emphasize further: if you take nothing else from this post, at least understand this major difference between group signatures and ring signatures. Unlike group signatures, ring signatures are imposed unilaterally and irrevocably on unconsenting peers by an unknown author. Even security professionals that I’ve talked to lose track of this distinction, so I’m braced for some confusion in the comments.]

In the paper, the authors give a vignette about Bob, a member of the cabinet of Lower Kryptonia, who alleges to a journalist that the prime minister has committed some kind of misconduct. If Bob sends the allegations as an anonymous tip, the journalist may have no reason to find the allegations credible. Whereas if Bob signs the allegation with a normal digital signature, the allegation carries his full authority...but then allies of the Prime Minister might use bribes, hacks, or threats to get Bob’s name from the journalist in order to retaliate.

Instead, Bob ring-signs the allegations, using his own private key plus the public keys of several other cabinet members, and then gives that ring-signed message to the journalist via some anonymous channel. The journalist then has good reason to believe that the information did come from a cabinet member, but everyone can only guess at which one it was, there is no known way to cryptographically prove it. To summarize the benefit of ring signatures in a single sentence, I would say that they allow a user to make fine-grained trades between their anonymity and their credibility.

The reason Bob’s exact scenario hasn’t played out in the real world is that we all ended up relying on services like Gmail and Facebook to manage our account security for us instead of handling our own keys. That’s why you can usually just click “Forgot password?” if you lock yourself out of one of your accounts. This paradigm is safe and convenient in a narrow sense, but it has plenty of downsides, including, in my opinion, closing the door on clever arrangements like ring signatures.

In the next post I’ll explain the current state of ring signatures, and what tools are available now for those who are interested.

 

Definitions

(Adapted from the paper)

  • Ring: the set of all public keys included in a particular ring signature. Alternately the anonymity set, or the alleged signers.
  • Signer: the one who actually did the signing. AKA true signer, author, or message writer.
  • Non-signer: every other identity named in the ring.

Questions & Answers

Q: Why would I want to join someone’s ring? What’s in it for me?

A: Remember, rings are created unilaterally. You cannot join someone else’s ring. Your signature is forged, only the true signer's is contributed voluntarily. The only way to guarantee that you won’t be implicated in a ring signature someday is to refrain from using services that require you to share a public key.
 

Q: Can't a reader infer the true signer based on the content of the message?

A: Yes. If Bob has been complaining too often about the prime minister’s secret malfeasance, and then he anonymously makes allegations inside a ring-signed message, then the PM will identify Bob as the top suspect and may retaliate accordingly. (Of course, the PM may instead suspect that a different cabinet member is trying to frame Bob. It’s hard to contrive a truly straightforward threat model even in a thought experiment, but hopefully you get the point.) Also, the true signer may potentially be inferred by subtle quirks of writing style.

 

Q: Can’t this be used to start petty interpersonal drama?

A: Yes. But it can also be used to prevent or resolve interpersonal drama. Importantly, people do not always agree on which interpersonal conflicts are petty drama and which are serious problems.
 

Q: Didn’t Bostrom, Douglas, and Sandberg curse unilateral actions? And since creating a ring is done without permission, won’t it be subject to that curse?

A: Presumably yes, but the scheme tends to promote transparency and honesty, so it should cause net counterfactual harms only in situations where both (1) increased transparency and honesty are harmful, and (2) fully anonymous messages cannot already cause those same harms. The worst I can think of is a rogue employee deciding to leak secrets that are costly to verify. For example, a complex and expensive secret formula that, if posted fully anonymously, would not be worth the ingredient cost for a competitor to test out. (I can imagine ways employers could mitigate this risk without also undercutting whistleblowers, but that gets even more speculative, so I’ll omit them here.)

 

Q: Can I make a ring signature that has more than one true signer?

A: Theoretically yes, that is called a k-of-n threshold ring signature. Those would be very useful if there was a safe and convenient way to create them. Unfortunately, it seems that they intrinsically carry additional risks unless the true signers all coordinate. I’m not aware of threshold ring signatures being used for any real-world application.
 

Q: Is there a way to make it so that readers know that two different messages signed with the same ring came from the same person in that ring?

A: Yes, there are at least two ways. The easier option is to have the message itself contain an endorsement of a pseudonymous social media account and then just send subsequent messages from that account. The harder option is to read up on “linked ring signature” schemes and try to implement it without risking your anonymity, which is nontrivial. Linked ring signatures can also be used for secret ballots, but the setup requirements for that strike me as inelegant and risky. For example, I think an attacker could force you to send them a message ring-linked to your vote in order to find out how you voted. This is related to why threshold ring signatures require coordination. I’ll be surprised if there’s not a better scheme out there for secret ballots.


 



Discuss

Martian Interpretability Challenge: The Core Problems In Interpretability

2026-03-12 01:41:50

TLDR; Interpretability today often fails on four fronts: it’s not truly mechanistic (more correlation than causal explanation), not useful in real engineering/safety workflows, incomplete (narrow wins that don’t generalize), and doesn’t scale to frontier models. Martian’s $1M prize targets progress on those gaps—especially via strong benchmarks, generalization across models, and institution/policy-relevant tools—with a focus on code generation because code is testable, traceable, and a high-impact real-world setting for validating mechanistic explanations.

In Part 1, we discussed why we’re announcing a $1M prize for work in interpretability, focused on code generation. In this post, we’ll go into the four core problems we think are most important for the field, some approaches we’re excited about to tackle these problems, and why we think code generation is the right setting to tackle both.

The Biggest Problems In Interpretability

We’re optimists on interpretability (otherwise we wouldn’t be building an interp company!), but if you want to identify the biggest gaps in your field it’s important to take the skeptical lens.

A pessimist or skeptic could reasonably say interpretability today is:

  • Not generalizable.
    Much of what passes for interpretability is post-hoc pattern-matching rather than robust explanation. Work on saliency and feature visualization has repeatedly shown that attractive-looking explanations can be independent of the actual model parameters or fragile under input shifts ([1],[2]). In language models, purported “concept neurons” often dissolve under broader evaluation, revealing polysemantic, basis-dependent behavior rather than clean mechanisms ([3], [4]). Even more sophisticated causal-probe and intervention pipelines can be pushed off-manifold or give false positives, confirming stories that fit the data but not the computation ([5], [6], [7]). From this perspective, a lot of current MI work is still correlational: it tells just-so stories in a convenient basis, rather than isolating the actual algorithms the model runs. This matters because the advantage of explanation over prediction is generalization. You don't need Newton's laws to build a bridge, but you won't build combustion engines just by building more bridges. If your explanations don't generalize beyond the bridges you based them on, they're not much better than black-box methods. Most current MI findings don't verifiably extend beyond the narrow distributions they were tested on.
  • Useless.
    Despite years of effort, interpretability has yielded few tools that engineers actually rely on to build or secure real systems. Benchmarking work often finds that sophisticated representation-level methods underperform simple baselines on practical tasks like steering or detecting harmful intent ([9], [10]), and DeepMind’s recent negative result on using SAEs for safety-relevant classification concluded that SAEs underperform a linear probe and have been deprioritized as a consequence ([11]). More broadly, some critics ([12],[3]) argue that if our methods don’t materially improve control, robustness, or assurance in deployed systems, they are at best scientific curiosities and at worst a distraction from more direct levers on model behavior.
  • Incomplete.
    Even where interpretability has succeeded, the wins are narrow. The famous GPT-2 Indirect Object Identification circuit explains a carefully circumscribed template task, but not pronoun resolution in realistic text ([13]). Circuit analyses in larger models, such as Chinchilla’s multiple-choice capabilities, recover only partial stories that break under modest task variations ([14]). SAE-based work has uncovered many interesting features, but also strong evidence that these features are not canonical building blocks: different SAE configurations carve up representation space differently, miss many concepts, and sometimes learn “warped” directions that don’t match any clean human ontology ([11], [15], [16]). At a higher level, commentators argue that the field still lacks consensus on basic questions like what counts as a mechanism, a feature, or a satisfactory explanation ([17]). The result is a patchwork of partial maps, with no clear path to a complete theory of how modern models work.
  • Doesn’t scale.
    Many of the most impressive MI results look fundamentally unscalable. Detailed circuit analyses require enormous human effort even for tiny models and toy tasks, and attempts to push similar techniques to frontier models have produced fragmentary explanations at great cost ([14], [5]). Dictionary-learning approaches like SAEs scale only by throwing massive engineering and compute at the problem: OpenAI, Anthropic, and others report training SAEs with millions to tens of millions of features just to partially cover a single model’s activations ([18], [19], [20]). Even then, coverage is incomplete and the resulting feature sets are difficult to validate. The empirical frontier is racing ahead—with tool use, memory, agents—while MI remains focused on individual transformers and hand-analyzed neurons. Take the example of reasoning models. Most MI methods assume cognition happens in a single forward pass. Reasoning models break this assumption: the behavior emerges from an ensemble of samples, and the 'algorithm' is spread across the sampling procedure, not localized in activations. Fully understanding the transformer tells you what each sample does, but not how the system thinks. Tool use and other interactions with the environment are also messy and non-differentiable. These interactions often involve hard to reproduce environments like working on someone’s computer. The core worry is that interpretability won’t keep up with the size, complexity, and speed of modern AI systems.

Each of these can also be framed as a question:

  • How do we ensure our methods are generalizable?
  • How do we ensure our methods are useful?
  • How do we ensure our methods are complete?
  • How do we ensure our methods scale?

These are, in short, what we think of as the most important problems. They’re hard but tractable. We’re working on them inside Martian, and we have work we’re excited to share in coming months. In the external work we support, grants and prizes will be awarded based on how much progress each paper could make on these problems.

Approaches We’re Excited About

Below, we outline some of our thoughts on what fruitful approaches could look like. These are merely suggestions though – what we care about are the problems!

Strong Benchmarks for Interpretability

Scientific fields mature when they converge on good benchmarks. Right now, interpretability is missing that shared backbone ([23]). Many results are evaluated relative to other interpretability methods, on hand-picked examples, or via qualitative “this looks right to me” inspection  ([1], [4]). That makes it hard to tell whether a new method is actually better, or just telling a more compelling story.

We would like to see benchmarks that measure interpretability methods against ground truth, practical impact, or both. For example, in settings where we know the underlying mechanism (synthetic tasks, hand-constructed models, or instrumented training setups), we can ask whether a method recovers the right causal structure, not just whether it produces plausible artifacts. In more realistic settings, we can ask whether using a method improves downstream outcomes: debugging success, safety performance, policy decisions, or human-model collaboration.

We’re especially wary of “grading on a curve,” where methods are only compared to other (mechanistic) interpretability baselines. If all of the baselines are weak, progress can look impressive while remaining practically irrelevant. Strong benchmarks should force head-to-head comparisons with simple, competitive baselines [9] (e.g. linear probes, prompting tricks, finetunes, reinforcement learning, or data/architecture changes) and should quantify where MI actually buys us leverage. Interpretability work is successful here only if it provides the tools that work the best, with no exceptions.

For this prize, we’re excited about work that:

  • Directly addresses “uselessness” (do these tools help anyone do anything?)
     
    • Evaluates methods on their ability to improve practical tasks (e.g. finding backdoors, reducing harmful behavior, catching specification gaming, improving robustness, aiding audits). This is likely to come from a combination of two different kinds of tasks. The first involves studying model organisms (think “the model has secret knowledge X implanted, can you recover it”) [8] [25] The second involves real world tasks people are already tackling with blackbox methods, but where we can realistically see gains from whitebox ones.
    • Builds careful distributional splits – across multiple domains, different programming languages, different human languages, or other axes along which mechanisms should be able to generalize in the real world
    • Includes “regression” tests. Real world applications can’t afford to make changes which cause their product to fail where it was previously succeeding. If we don’t test that interpretable interventions preserve behaviors that they shouldn’t touch, then our methods won’t be used in production.
  • Moves field towards more generalizable, complete, and scalable approaches
     
    • Designs benchmarks where there is a known or controllable ground-truth mechanism, and evaluates whether interpretability methods recover it.

‍Mathematical Foundations for Interpretability

A lot of the most productive work in interpretability has followed from relatively simple but powerful mathematical pictures. Causality gave us a language for interventions and counterfactuals [26], [27]. Entanglement and superposition gave us a way to think about entangled features and why neurons end up polysemantic [21], [28], [29], [30]. In each case, a clean mathematical abstraction unlocked many concrete experiments and methods.

We think the field is still severely underinvested in this kind of foundational work. We don’t have crisp answers to basic questions like: What does it mean, formally, for an explanation to be “correct” rather than correlational? What is the right unit of analysis in a large model (neurons, features, circuits, modules, something else)? When are two explanations “the same” at an appropriate level of abstraction? In practice, many current methods implicitly assume answers to these questions, but our tools and evaluation standards aren’t grounded in a shared theory.

There is also a real risk that we have fixated on the wrong mathematical toy problems. For example, much recent work is framed around superposition and feature disentanglement, but it is not clear that “undoing superposition” is the core obstacle to understanding modern models. Other candidate difficulties—such as the choice of basis, the geometry of representation manifolds, or the interaction between optimization and architecture—may matter more, and may require different tools.

Internally, we’re exploring several possible foundations: kernel methods and category-theoretic frameworks for what counts as a mechanism or explanation; differential-geometric views of representation manifolds and why local linear approximations break down; and random matrix and linear operator theories that might let us generalize mechanistic tools and automate them in high dimensions. These are examples, not prescriptions.

For the prize, we’re excited about any work that:

  • Proposes clear mathematical definitions of faithfulness, feature quality, or abstraction and connects them to concrete methods or benchmarks.
  • Identifies the “right” primitives (features, modules, causal variables, etc.) for understanding real models, and shows how to work with them.
  • Uses tools from other areas of math or ML (statistics, physics, information theory, dynamical systems, algebraic/topological tools, etc.) to explain why interpretability is hard—and how to make it easier.

If successful, this line of work should move us closer to methods that are genuinely mechanistic (not just correlational), that have a path to completeness (we know what’s missing), and that scale because they are grounded in the underlying structure of learning systems rather than ad hoc tricks.

Generalization Across Models

Many of today’s most compelling interpretability results are one-offs: a single circuit in a single layer of a single model, or a feature dictionary trained on a particular checkpoint. That is scientifically interesting, but it’s not yet a general theory of how models work. We’d like to see interpretability methods, abstractions, and findings that survive changes in architecture, scale, training data, and domain.

Generalization can mean several things. It might mean that a discovered circuit or feature family recurs in larger or differently-trained models. It might mean that an interpretability method finds analogous structure in both vision and language models. It might mean that we can automatically locate “the same” mechanism in a new model, given a description or example from an old one. It might mean that we can learn and reuse higher-level abstractions rather than re-discovering everything from scratch.

We’re especially interested in work that moves beyond small, clean, toy tasks to more realistic settings without losing mechanistic clarity. That might involve new units of analysis (modules, subsystems, learned tools), new training schemes that encourage shared abstractions across models, or new methods for matching and comparing representations.

For this prize, promising directions include:

  • Empirical work showing that particular circuits, features, or abstractions recur across models, scales, or domains—and identifying what drives that recurrence.
  • Methods for automatically transferring interpretability results from one model to another (e.g. “find the IOI-like circuit in this larger model”).
  • Frameworks for defining when two mechanisms are “the same” up to abstraction, and algorithms for detecting such equivalences.
  • Studies that explicitly measure how well interpretability findings on toy models predict behavior in larger, more realistic models.
  • Develops tools that increasingly automate interpretability - minimizing reliance on human effort.

Generalization is where “mechanistic,” “complete,” and “scalable” meet: if we can discover explanations that transfer across models and settings, we’re starting to learn something about the underlying space of algorithms these systems converge to.

Interpretability as a Policy/Institution Lever

We don’t think the main impact of interpretability will be a handful of researchers staring at neuron visualizations. The bigger opportunity is institutional: giving regulators, boards, red teams, and operational safety teams tools that change what kinds of AI governance are feasible.

Interpretable tools could, for example, make it possible to build interpretable analog models that track a powerful system’s behavior on safety-relevant dimensions; to transfer safety properties between models via activation transfer and circuit-level alignment; or to design arms-control-style measures where parties can credibly demonstrate that models lack certain dangerous mechanisms or capabilities without revealing proprietary details. Interpretability could also expand the Pareto frontier between safety and innovation by making it cheaper to monitor and mitigate risky behaviors, rather than relying only on blunt constraints.

We take seriously the view (e.g. [24]) that many safety problems are fundamentally about institutions, incentives, and governance, not just algorithms. From that perspective, MI is useful to the extent that it enables new forms of oversight, auditing, liability assignment, and norm-setting: it lets institutions ask better questions of models and developers, and verify answers with more than “trust us.”

For the prize, we’re interested in work that:

  • Shows how interpretability signals can support concrete policy tools: e.g. risk audits, model cards, licensing regimes, or capability evaluations.
  • Develops techniques for activation transfer, model-to-model supervision, or analog models that can be realistically used in deployment or evaluation pipelines.
  • Proposes protocols where interpretability plays a role analogous to verification in arms-control: limited, but enough to support cooperative equilibria.
  • Explores how interpretability outputs can be made legible to non-experts (regulators, executives, internal risk committees) without losing faithfulness.

     

This agenda squarely targets the “usefulness” question: if interpretability is going to matter in the real world, it needs to show up in the levers that institutions actually pull.

Why Codegen

Code has several properties that make it unusually friendly to mechanistic work. It has formal semantics and an execution trace; we can run programs, inspect intermediate state, and often characterize “correctness” crisply. Many algorithmic tasks in code are well-understood mathematically. This makes it easier to pose precise questions like “what algorithm is the model implementing here?” and to test whether a proposed mechanism really explains behavior, not just correlates with it.

At the same time, modern code models are at the center of how LLMs are used in practice: as coding assistants, tool-using agents, autonomous refactoring systems, and more. They plan, call tools, read and write large codebases, and increasingly act as components in larger agent systems. That makes them a natural testbed for interpretability methods that aim to handle agentic behavior, tool use, and multi-step reasoning in realistic environments—not just isolated feed-forward tasks.

We’re also excited about a deeper connection: program synthesis and mechanistic interpretability are, in a sense, inverse problems. Program synthesis tries to go from desired behavior to code; mechanistic interpretability tries to go from code-like behavior to a description of the internal “program.” Insights in one direction may feed the other: understanding how models represent and manipulate programs may help us understand their own internal “programs,” and vice versa.

For this prize, we’re particularly interested in:

  • Mechanistic studies of code models on non-toy tasks: refactoring, multi-file edits, tool-augmented debugging, repository-level changes, etc.
  • Methods that use the structure of code (types, control flow, data flow, test suites, execution traces) as scaffolding for mechanistic explanations.
  • Work that links internal circuits or features to high-level software engineering concepts: APIs, invariants, design patterns, security properties, or bug classes.
  • Interpretability techniques that demonstrably improve code-generation systems in practice: making them safer, more robust, easier to supervise, or more predictable in deployment.

If interpretability can show clear, actionable wins in code generation—currently the largest industrial use case for LLMs—that will be a strong signal that the field is on a useful track, and a powerful engine for further investment and progress.

Thanks to Ana Marasović, Hadas Orgad, Jeff Phillips, John Hewitt, Leo Gao, Mor Geva, Neel Nanda, Stephen Casper, and Yonatan Belinkov for reviewing this RFP and providing feedback.



Discuss

Concussion Treatments

2026-03-12 01:00:39

Last week I hit my head on a car door frame getting into the car at a gas station. There were no dramatic symptoms at first and barely any pain, but the next day I couldn’t look at my phone for more than five minutes without getting a headache. It was clear I’d given myself a concussion, the second in ten months.

I’m a week in now, resting and recovering, but I sadly had to admit that I wasn’t going to hit my new, lowered target of just one blog post a month while I finish the book. Then it hit me, I could use Claude to do some research into concussions and write something short about that!

So I asked Claude to do some research into concussion recovery. Specifically, whether there’s anything useful I can do beyond the standard advice of “rest”. I already sleep and meditate, and I’ve been hearing more about psychedelics as treatments for brain injuries, so I had Claude to do a deep literature review on all three as concussion interventions. We focused on psilocybin because it’s the psychedelic with the most research about concussions available. The full report is here. Here’s what came out of it.

The three interventions are complementary, not redundant. All three reduce brain inflammation after injury, but through different biological mechanisms. Hitting the same problem from three independent angles is a well-established principle in pharmacology, and it tends to work better than hitting it from one angle three times as hard.

Each one is best at a different thing. Sleep drives the brain’s waste clearance system. During sleep the spaces between brain cells expand by about 60%, allowing fluid to flush out damaged proteins and metabolic debris, and nothing else does this. Psilocybin provides the most potent signal for growing new neural connections, with a single dose producing structural changes lasting over a month in mice. Meditation offers the best-evidenced stress and immune regulation, creating the upstream conditions that let the other repair processes work.

Psilocybin works around a problem the other two can’t. After brain injury, inflammation hijacks the raw materials your brain uses to make serotonin and diverts them toward toxic byproducts instead. This depletes serotonin while simultaneously causing further damage. Because psilocybin is chemically similar to serotonin, it can activate serotonin receptors directly, bypassing the broken supply chain entirely. No endogenous process can do this.

But the evidence is profoundly asymmetric. Sleep has robust clinical data and an irreplaceable biological role. Meditation has one concussion-specific meta-analysis showing moderate benefit and near-zero risk. And psilocybin has zero completed human trials in brain injury populations, with the strongest direct evidence being a single rat study.

So the actionable takeaways are, sadly, anticlimactic. Prioritize sleep above everything, and not just “sleep more” but actively protect it, because the injury itself disrupts the very sleep needed for repair. Meditate if you already do and consider starting if you don’t. And probably don’t take psilocybin for a concussion yet, because the biology is exciting but there’s no human data and there’re unknowns around safety in an injured brain.

The full research report with citations is here.



Discuss

How Hard a Problem is Alignment?

2026-03-12 00:47:22

Epistemic status: We really need to know. (I also posted an opinionated answer.)

 

There’s a well-known diagram from a tweet by Chris Olah of Anthropic:

It would be marvelous to know what the actual difficulty is, out of those five labeled difficulty categories (ideally, exactly where it lies on that spectrum). This is a major crux that explains a large part of the very wide variation in  found across experts in the field. Clear evidence that Alignment is something like an Apollo-sized problem would strongly motivate dramatically increased funding and emphasis for AI Safety research (Apollo cost roughly $200 billion in current money, i.e. only 40% of OpenAI’s current valuation: expensive, but entirely affordable if needed to safely build ASI). Clear evidence that it’s more like  or impossible would be a smoking gun proof that enforcing an AI pause before AGI or ASI is the only rational course. This is the question for the near-term survival of our species (so no pressure!)

I think this is a vital discussion. I’m also going to link-post below to my own (rather long) opinion, which is a separate post, and also to a few other existing resources which are basically other people’s attempts to answer this question.

The first three labels on Chris’s diagram are pretty self-explanatory: the only interesting question is whether “Steam Engine” means just steam engine safety work (which would make the scale more logarithmic, and also might make doing a progress-so-far comparison more natural), or also covers what one might call “steam engine capabilities” work, since those are pretty separable.

For rocketry I think it’s a lot more difficult to separate getting there and back with only a 10% fatality rate (as Apollo did) from getting there and back at all, so rather than trying to separate just safety, I think it makes the most sense to do a comparison to all of rocketry (so as to begin at the beginning) that led up to the Apollo program (probably excluding the parallel Russian program, or the more military-specific aspects of various other programs). However, the Apollo program itself was so enormous that when to start from is rather a small quibble.

P vs NP

We’ve only spent about 3,000 to 6,000 person-years on  so far, so it’s still quite plausible that it will be proven (or disproven) in a lot fewer person-years of effort than the about 3.5 million person-years that were spent on the Apollo program. However, unless we have ASI to help us, it’s still unlikely it will be solved any time soon, because it’s a far more abstract and challenging problem than Apollo engineering, so people competent to work on it and interested in doing so are very few and far between. Thus it’s taking a long time, because it’s hard in a more conceptual than detail-oriented way. Sadly for AI Alignment we’re currently on a short time limit, but then the problem is attracting increasing attention.

Eliezer Yudkowsky, the most famous proponent of high ,[2] is on record that he doesn’t believe alignment is an insoluble problem (that’s item -1 in his 2022 List of Lethalities: he seems to think that it might take us of the order of a hundred years, if we actually managed not to kill ourselves in the process, and that if we somehow had access now to a textbook from a hundred years into that future then that might well be all we needed) —[3] so that presumably makes him and Nate Soares able advocates for the  viewpoint. I’d be absolutely delighted if they or anyone else wanted to chime in here for that viewpoint — otherwise I’ll take If Anyone Builds It, Everyone Dies as the lay-audience case for it.

Shades of Impossible

I think it might be useful to provide some more granularity on “Impossible”:

Mathematically Impossible

The Orthogonality Thesis clearly predicts that it’s not actually impossible for an aligned ASI to exist, so unless that’s wrong, an impossibility proof would have to be something like demonstrating that identifying or constructing an aligned ASI, even at very low but finite  risk level, was a worse-than-polynomially-hard problem (say in parameter count, or IQ). I’d be particularly interested to hear from anyone who genuinely thinks Alignment is impossible in this sense (not just  or  years’ work), if we’re willing to accept a very low but not zero risk level. (Of course, an actual impossibility proof is a higher bar than it actually being impossible but not provably so.)

Kardashev II

Another alternative is that alignment is not mathematically impossible-in-theory in the above sort of sense, but that it’s just vastly harder than any of the other four labeled categories — perhaps even to the level where it’s currently impossible-in-practice. If, for any sapient species around our current development level, even those that proceeded forward from this point with millennia of caution before finally creating ASI, they still had negligible success chance, and also negligible chance of surviving the attempt, then it could reasonably be said to be impossible in practice. I’d similarly be very interested to hear arguments for this viewpoint. I’m nominating “Kardashev II”[4] as a name for this difficulty level: possible in theory, but something that we’re very clearly not going to manage any time soon.

Too Hard for Us

If Aligning AI is anywhere past Trivial, yet we rush ahead and build ASI anyway before we’re ready, then (unless we manage to luck into alignment-by-default) we’re likely to end up extinct or permanently disempowered. Even if we proceed cautiously, but then mess up our first critical try, we’re still all dead or enslaved. Obviously you can’t solve alignment if you’re dead.

Holding this viewpoint often says rather less about your opinion about how hard aligning AI actually is, and rather more about your opinion of the frailty and foolishness of humans and their institutions.

 

I would like to thank everyone who made suggestions, gave input, or commented on earlier drafts of this pair of posts, including (in alphabetical order): Hannes Whittingham, JasonB, Justin Dollman, Olivia Benoit, Scott Aaronson, Seth Herd, and the gears to ascension.


  1. ^

    I originally wanted to post this simply as a Question post plus my long, detailed Answer, as one of (hopefully) several answers; one of my draft commenters persuaded me that quite a few LessWrong / Alignment Forum readers typically don’t click on question posts, because they expect to find only quick-take answers rather than detailed extensively-researched answers.

    Nevertheless, I encourage detailed heavily researched answers to the Question version of this post, as well as quick takes.

  2. ^

    Eliezer has very carefully not publicly stated his current , but it’s generally agreed by people familiar with him and his writings that it must be over 90%, very likely over 95% — he has, after all, written a list of 43 different reasons why we’re doomed if we don’t pause AI, and co-authored a best-selling book on the subject titled If Anyone Builds It, Everyone Dies. When interviewed on the subject in 2023 he summarized this as:

    “I think it gets smarter than us, I think we’re not ready, I think we don’t know what we’re doing, and I think we’re all going to die.”

    He appears rather certain about this.

    However, as the title of the book he co-authored suggests, it’s more accurate to describe him as having a very high : he has given a TED talk in which he says that any not-DOOM credence he has is predicated on society doing something that he is publicly advocating for, but unfortunately doesn’t expect to happen on the current trajectory: enforceably pausing AI.

  3. ^

    For anyone curious, none of my Question or Answer were written by an LLM (though they were proofread by an LLM, and, as I mention in my answer, I did have them do some Fermi estimates and even delve into some deep web research for me). I have been overusing the m-dash since well before transformers were invented: it’s slightly old-fashioned, a little informal, and has useful differences in emphasis from the colon, the semi-colon, or the full stop. I considered giving up the m-dash when LLMs started copying me — and I rejected it. (I am also fond of parenthetical n-dashes for emphasis – and use them on occasion in my writing – though that seems to be less of a hot-button issue.)

  4. ^

    I am here intending this to describe a civilization capable of things like using gathered solar power to lift the contents of all the gas giants out of their gravity wells and then fusing most of the hydrogen and helium into elements more useful for building Dyson swarm computronium, such as carbon/oxygen/nitrogen, and then perhaps also doing some star-lifting: a mature Kardashev Type II civilization, not just one with a lot of solar panels orbiting the sun.



Discuss