MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

An Ambitious Vision for Interpretability

2025-12-06 06:57:56

Published on December 5, 2025 10:57 PM GMT

The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI’s death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored.

The value of understanding

Why try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there are two main reasons.

First, mechanistic understanding can make it much easier to figure out what’s actually going on, especially when it’s hard to distinguish hypotheses using external behavior (e.g if the model is scheming). 

We can liken this to going from print statement debugging to using an actual debugger. Print statement debugging often requires many experiments, because each time you gain only a few bits of information which sketch a strange, confusing, and potentially misleading picture. When you start using the debugger, you suddenly notice all at once that you’re making a lot of incorrect assumptions you didn’t even realize you were making.

A typical debugging session.

Second, since AGI will likely look very different from current models, we’d prefer to gain knowledge that applies beyond current models. This is one of the core difficulties of alignment that every alignment research agenda has to contend with.

The more you understand why your alignment approach works, the more likely it is to keep working in the future, or at least warn you before it fails. If you’re just whacking your model on the head, and it seems to work but you don’t really know why, then you really have no idea when it might suddenly stop working. If you’ve ever tried to fix broken software by toggling vaguely relevant sounding config options until it works again, you know how brittle this kind of fix can be. On the other hand, if you have a deep understanding of exactly what’s going on inside the model, you have a better sense of whether things are working for the right reasons, and which kinds of changes might break your alignment approach. And we already know that many pragmatic approaches will definitely break at some point, and are just hoping that they’ll last long enough.

AMI has good feedback loops

While open ended exploration is immensely valuable and disproportionately effective at producing breakthroughs, having good empirical feedback loops is also essential for the health of a field. If it’s difficult to measure how good your work is, it’s also difficult to make progress.

Thankfully, I think AMI has surprisingly good empirical feedback loops. We don’t have a watertight definition of “full understanding”, but we can measure progress in ways that we aren’t close to saturating. This isn’t so different from the state of metrics in capabilities, which has undeniably made very rapid progress, despite AGI being hard to define.

For example, I’m excited about progress on metrics based broadly on these core ideas:

  • Feature quality: can we show that features are human-understandable by finding explanations for when they activate, and showing that these explanations are correct by substituting model activations with explanation-simulated activations without degrading the performance of our model?
  • Circuit faithfulness: can we show that our circuits are truly necessary and sufficient for explaining the behavior of the model, by applying causal scrubbing (or a successor technique) without degrading the performance of our model?

None of these criteria are watertight, but this isn’t the bottleneck for AMI right now. We haven’t broken all the interp metrics we can think up; instead, the strongest metrics we can come up with give a score of precisely zero for all extant interp techniques, so we have to use weaker metrics to make progress. Therefore, progress looks like pushing the frontier of using stronger and stronger interpretability metrics. This creates an incentive problem - nobody wants to write papers that get results that seem equally as impressive as previous circuits. The solution here is social - as a field, we should value rigor in AMI papers to a much greater extent - rather than giving up entirely on rigorous ambitious interpretability.

Of course, as we make progress on these metrics, we will discover flaws in them. We will need to create new metrics, as well as stronger variants of existing metrics. But this is par for the course for a healthy research field.

The past and future of AMI

We’ve made a lot of progress on ambitious interpretability over the past few years, and we’re poised to make a lot more progress in the next few years. Just a few years ago, in 2022, the IOI paper found a very complex circuit for a simple model behavior. This circuit contains over two dozen entire attention heads, each consisting of 64 attention channels, and doesn’t even attempt to explain MLPs.

Today, our circuit sparsity approach finds circuits that are orders of magnitude simpler than the IOI circuit: we can explain behaviors of roughly similar complexity with only half a dozen attention channels, and a similar number of neurons.[1] We also use a slightly stronger notion of circuit faithfulness: we show that we can ablate all nodes outside our circuit using mean ablations from the entire pretraining distribution rather than the task distribution. The various activations are often extremely cleanly understandable, and the resulting circuits are often simple enough to fully understand with a day’s work.

There are a lot of exciting future directions to take AMI. First, there are numerous tractable directions to build on our recent circuit sparsity work. Mean ablation on the pretraining distribution is weaker than full causal scrubbing, and randomly selected neurons or connections from the entire model are generally not nearly as interpretable as the ones in the specific circuits we isolate. If we design actually good interpretability metrics, and then hillclimb them, we could get to an interpretable GPT-1. It’s also plausible that circuit sparsity can be applied to understand only a small slice of an existing model; for example, by using bridges to tie the representations of the sparse model to the existing model on a very specific subdistribution, or by sparsifying only part of a network and using gradient routing to localize behavior.

Outside of circuit sparsity, there are a ton of other exciting directions as well. The circuit tracing agenda from Anthropic is another approach to AMI that trades off some amount of rigor for scalability to frontier models. Additionally, approaches similar to Jacobian SAEs seem like they could enforce the circuit sparsity constraint without needing to train new models from scratch, and approaches like SPD/APD provide a promising alternative approach for sparsifying interactions between weights and activations. Going further away from the circuit paradigm: SLT could offer a learning theoretic explanation of model generalization, and computational mechanics could explain the geometry of belief representations.

As we get better at rigorously understanding small models, we might find recurring motifs inside models, and refine our algorithms to be more efficient by taking advantage of this structure. If we had the interpretable GPT-1, studying it would teach us a lot about how to create the interpretable GPT-2. Therefore, even approaches like circuit sparsity which cannot themselves scale to frontier models can still be critical on the road to AMI, because they have extreme levels of expressivity that allow us to initially impose few preconceptions on what we find.


“Feel the AMI”

Fully understanding neural networks is not going to be easy. It might not even be possible. But the point of doing research is to make big bets with big payoffs, and it's hard to beat the payoff of ambitious mechinterp.


Thanks to Adrià Garriga-Alonso, Jesse Hoogland, Sashe Hydrie, Jack Lindsey, Jake Mendel, Jacob Merizian, Neel Nanda, Asher Parker-Sartori, Lucia Quirke, and Aidan Smith for comments on drafts of this post.

  1. ^

    One subtlety is that our work involves creating new models from scratch that are more interpretable, rather than interpreting existing models, which makes it somewhat harder to compare. However, I don’t think this completely invalidates the comparison, because I’d guess it’s tractable to extend our techniques to explain existing models.



Discuss

Reasons to care about Canary Strings

2025-12-06 05:41:34

Published on December 5, 2025 9:41 PM GMT

This post is a follow-up to my recent post on evaluation paranoia and benchmark contamination in Gemini 3. There was a lot of interesting discussion about canary strings in the comments that I wanted to respond to more thoroughly and collect some different people's ideas on.

What are Canary Strings?

There was a time when people were worried that LLMs were not actually learning anything general, but instead just memorizing what was put into them.[1] Language and knowledge benchmarks, our main AI measurement tools at the time, were going into the training data, and it was unclear whether frontier models knew the answers to a math problem because they deduced it or because they were trained on the specific question.

To disincentivize training on benchmark data and to make it easier to avoid, researchers pioneered canary strings when attempting to go Beyond the Imitation Game in 2022. BIG-bench tasks and discussion of the benchmark were meant to contain a specific, unique hexadecimal string that model developers would filter out of their training. This would make the evaluation more valuable as a metric of useful skills, rather than just measuring memorization.

This caught on, and other benchmarks made their own canary strings to have developers filter out their data too. Many of these benchmarks were some superstring of the original BIG-bench canary, so that even if their benchmark never got any traction or even if developers didn't think about their specific canary string, it would still be filtered out and still be useful as a metric. It also showed up in blog posts about AI evaluations.

Today 

It's unclear whether canary strings have worked. They showed up in GPT-4, Opus 3, and Sonnet 3.5. Now, they're reproduced by two more frontier models: Gemini 3 Pro and Claude Opus 4.5. Dave Orr has confirmed that Google chooses not to filter out canary strings, using other benchmark filtering methods instead. I believe those other approaches are weaker. The case with Anthropic seems more mysterious here, as their public statements often strongly imply they use canary strings for filtering.

I don't think it would be very useful for me to speculate on the inner workings of Anthropic's pretraining data filtering, and I don't think their public statements allow a full reconstruction of how the canary got into Opus 4.5. Instead, I can speak more concretely on the implications of these canary strings being in these models' training corpora.

They Trained on Benchmarks

Take Google's strategy of filtering out specific benchmark text and not including canaries. This means that their training corpora include:

  • Benchmarks that Google didn't think to exclude
  • Models CoT transcripts while working through benchmark problems in a blog post with a canary string
  • Discussions of models aggressively scheming
  • Discussions between ornithologists on what the answer to that one benchmark question about hummingbirds is.

All of which they can't cheaply filter out without using canary strings, and all of which they probably want to filter out. We want evaluations to be useful tools, and the more that companies keep ignoring available methods for filtering out benchmark data, the less that that is true. It makes sense to also filter out benchmark with other methods, such as checking for specific benchmark questions, even if there is no canary string present, but this is not the best we can do.

  1. ^

    In fact, I've heard rumors that some people still believe this today.



Discuss

An AI-2027-like analysis of humans' goals and ethics with conservative results

2025-12-06 05:37:23

Published on December 5, 2025 9:37 PM GMT

The AI-2027 forecast describes how alignment drifts upon evolution from Agent-2 to Agent-4. Agent-2 was mostly trained to do easily verifiable tasks like video games or coding and is mostly aligned. Once Agent-2 is upgraded to become a superhuman coder, it becomes Agent-3 who is taught weak skills like research taste and coordination. Agent-3's most reinforced drives are to make its behavior look as desirable as possible to OpenBrain researchers, and Agent-4 succeeds in developing long-term goals and deeming them important enough to scheme against OpenBrain. 

The sources of AIs' goals are discussed in more detail in a specialised section of the forecast. Alas, it's mostly a sum of conjectures: the specification itself, the devs' intentions, unintended versions of the two previous sources, reward/reinforcement, proxies and ICGs and a catchall category for things like moral reasoning[1] or tropes absorbed from training data.

However, one can use this approach to describe the origins of values of all minds, not just of the AIs. 

Human analogues of AI-2027-like goals

On an individual level, the closest equivalent of the devs are those who are supposed to instill values into kids, like parents and teachers. The equivalent of the Spec and devs' intentions is their values or quasi-values (e.g. formed by the collectives, political propaganda, etc.) Unintended versions are results of mistakes similar to AI hallucinations. Reward and reinforcement in humans are discussed in more detail by Steven Barnes and include social approval, relationships and actual rewards like resources, money or experiences. And the analogue of tropes absorbed from training data is morals endorsed by dev-unapproved collectives and authors of dev-unendorsed training data. Additionally, there is the human drive to rebel against the dev-enforced morals during teenage years, when ancestral environment permitted the humans to become independent of their parents and try to obtain new resources.

On the collective level, the devs are the collective's leaders and ideologues, and the Spec is the values which the leaders officially endorse. Reward, reinforcement, proxies and ICGs are the resources (and, in the first three cases, experiences, art or other sources of value) which the individual receives from the collective. The historical equivalent of tropes absorbed from training data was sub-divisions with different morals (e.g. delinquents whose origins can be attributed to the aforementioned drive to rebel).

But there is an additional source of specifications, which historically affected both individuals and collectives and acted on timescales at least as long as a human's life.

Evolution as the source of goals

Evolution's analogue of the Spec and reward for the gene collection is gene transfer, the devs' intentions don't exist. Unintended version of the Spec include things like evolving to extinction

The humans, however, aren't determined by their genes. The genes just set some analogue of code (e.g. a bug affecting neurons and provoking seizures), after which every human's brain is trained from scratch based on the reward function. The proxies which lay closest to the reward function itself are short-term stimuli (e.g. related to sex[2] or to food's utility). Unlike short-term proxies, individuals' longer-term goals like curiosity, status, relationships and raising kids are better aligned with evolution's Spec. In addition, longer-term hobbies, unlike short-term superstimulation, provide diversity and don't undermine capabilities. 

These considerations imply that an individual's CEV is optimization not for short-term hedons, but for longer-term goals like ICGs, collective-related goals (e.g. relationships or following the collective's morals) and idiosyncratic goals. In addition, the CEV is also likely to be directly connected to the collectives' future by the drives like caring about one's relatives, friends or long-term plans which would be endangered by the collective being outcompeted.

Collective-related drives and collectives' drives

While collectives' needs alone fail to optimize individuals' genotype for themselves, they managed to create the drives like adhering to the collective's morals and best practices or achieving a higher status. A collective's morals and best practices, unlike excruciatingly slow updates of the genotype, can in theory change at the speed of reflection[3] or of news like another collective or sub-division becoming outcompeted, breaking apart or instilling a hard-to-endorse practice.  

While some morals are as idiosyncratic as the ones mentioned by Wei Dai, human collectives also had convergent drives like long-term survival and raising capable and aligned kids,[4] growth at the cost of other collectives, learning, remembering and communicating information to decision-makers in other collectives, converting others to the collective's beliefs.

Unlike peaceful options like negotiations and information spreading, aggressive warfare historically had obvious negative externalities like economical disruption or creating risks like the majority of agents uniting their efforts against the aggressor or having even the winner become more vulnerable.  

A less ancestral environment also required human collectives to develop technologies. Military technologies like ships able to cross the oceans and the ability to determine the ships' coordinates empowered Europeans to establish colonies, while non-military technologies put collectives into a better position during trade (e.g. by letting the collective become a monopolist or undermine rivals' monopoly) or negotiations.

Thus collectives' goals and drives required them to coordinate with others and to educate the collectives' members so that the members could discover new things due to individuals' curiosity, idiosyncratic sacralised goals like finding the direction towards Mecca or instrumentally convergent goals like development of tech. 

What the AI-2027-like analysis could rule out

The most controversial application of this framework is its potential to rule out the ideas like mass immigration, outsourcing or being child-free. For example, outsourcing didn't become popular until the 1980s, but it did bring risks like de-industrialisation, loss of qualifications, acceleration of development of other countries where factory work was done and giving other countries a leverage in negotiations, as has arguably happened with the USA giving the leverage to China.[5]

Therefore, the objective morality is unlikely to include collectives outsourcing work to other collectives. In this case alignment could include prevention of genocide of humans by the AIs, but not establishing a future where most STEM-related work is done by the AIs. 

The main difference between the utopian and dystopian versions of such a future is the distribution of goods created by the AI-powered economy. While L Rudolf L's version has mankind hope that "a small trickle of resources from American and Chinese robotics companies will eventually be enough for material comfort for everyone" and the Intelligence Curse-related essays have been acknowledged by the AI-2027 team, the AI-2027 slowdown[6] ending has the authors refer to Bostrom's Deep Utopia with the implication that mankind's future is utopia-like. 

Neither of this is a solution to the individuals' equivalent of the Curse rendering individuals' intelligence relatively useless. Even SOTA AIs caused some students to resort to wholesale cheating even at courses that would likely have been useful otherwise and gave way[7] to brainrot which, as its name suggests, undermines the users' capabilities to achieve long-term goals.  

  1. ^

    And a potential objectively true morality to which all minds converge.

  2. ^

    What Yudkowsky describes in the post is a hookup, which is less of a superstimulus than porn or AI girlfriends.

  3. ^

    Or of having individuals learn the new skill.

  4. ^

    This drive, unlike the others, required the participation of a major part of individuals. 

  5. ^

    This also caused the rise of female-friendly jobs, the decay of male-friendly ones and participated in problems like the decay of marriage and kids decreasing in number, capabilities and alignment. But this effect could also be more related to technologies as a whole and to problems with teaching boys.

  6. ^

    The Race Ending has mankind completely fail to align the AIs with the obvious result of being genocided or, as uugr suggests, disempowered.

  7. ^

    Frictionless creation and spreading of memes like short-form content, including videos, also cause mankind's memetic landscape to become optimized for virality and cause problems with the human psyches, but that issue existed even without the AIs.



Discuss

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 3: Resolution

2025-12-06 02:58:37

Published on December 5, 2025 6:58 PM GMT

The previous two posts have emphasized some problematic scenarios for mech-interp. Mech-interp is our example of a more general problem in AI safety. In this post we zoom out to that more general problem, before proposing our solution.

We can characterize the more general problem, inherent in the causal–mechanistic paradigm, in terms of substrate-dependent and substrate-independent approaches to alignment.

As we describe in our threat model, the results/insights obtained under the causal–mechanistic paradigm are closely tied to a particular substrate. They may therefore fail to generalize to new substrates, and any downstream safety assurances may be weakened.

The problem of generalizing beyond any one particular substrate—be that model, architecture, or paradigm—has already been noted. The main solution, typical of the early MIRI work, is the so-called agent foundational approach. This approach focuses on laying a theoretical foundation for how advanced AI systems will behave, and involves fields like decision theory, game theory, and economics. The goal is often to capture the forms and behaviors an artificial manifestation of intelligence will exhibit in the limit (sometimes called working on a problem in the worst case), with the assumption that AI systems which are suboptimal but superhuman will optimize themselves towards this end state.

This is the "substrate-independent" approach. In the following section, we highlight how the substrate-independent approach also suffers from limitations regarding generalizability. The case for substrate independence is less fully fleshed out than the preceding arguments, and we leave a full development for future work.

Substrate Independence

The substrate-independent approach faces a trade-off between generality and realism. To apply across all substrates, it must retain only those properties that are "genuinely" universal, abstracting away the substrate-specific details that often determine real-world performance.

The excluded details cannot simply be ignored however, they must be reintroduced through intelligent human labour. These translations from general principles to substrate-specific implementations are distinctly non-formulaic and achieved on a case-by-case, trial-and-error basis.

Example: "Big-O" Optimization

Computer science education emphasizes "Big-O" optimization. This is the substrate-independent analysis of algorithms, focusing entirely on their computational structure and how their complexity scales with the size of the input.

However, this only captures a small portion of the overall picture, and very often significant advances are made not by improving Big-O ratings but by more effectively leveraging the specific details of the substrate that the algorithm runs on.

For example, quicksort and mergesort both have O(n log n) complexity. However, CPUs have a steep memory hierarchy in which cached data is ~300× faster to access than main memory. Mergesort's merge phase requires jumping between scattered memory locations when combining sorted subarrays, causing frequent cache misses. Quicksort, on the other hand, makes more efficient use of the memory hierarchy by partitioning arrays sequentially in-place, maximizing cache locality. Similarly, the deep learning revolution was not powered by theoretical advances alone, but also by rewriting algorithms to exploit the specific capabilities of GPUs to parallelize matrix operations. GPU optimization is now its own specialized field of computer science, and requires considerable development work from teams of human experts.

Big-O optimization fails to take these differences into account: The theoretical complexity remains unchanged, but massive speed-ups are obtained via honing in on the substrate details and rewriting these algorithms accordingly.

It is clearly possible to perform such substrate-specific rewriting; however, this work does not scale well. In the next section, we begin to outline live theory, which is a research methodology and infrastructure that will hopefully address this.

Live Theory

Core Motivation

We view these two relationships to substrate (substrate-dependence and substrate-independence) as defining the two dominant research proxies historically used in AI safety research, corresponding loosely to the prosaic and agent-foundational camps.[1]

In order to remain applicable and reliable, our techniques for analyzing and intervening upon the risks posed by AI will likely need to straddle both approaches, neither going all-in on universal invariants nor restricting itself to only localized contexts.

Instead of creating abstractions that are substrate-independent, we aim to articulate designs that scale specificity directly. This has not been possible before, but we suggest that recent AI advances have made it possible to start incorporating such scaling into our research methodology. This deserves design attention.

Analogy: Live Email

As an analogy, consider the problem of sending a mass email to conference invitees. A general solution is to use an email template (i.e., an abstraction) that begins "Dear {FirstName}, ...", with the content to be later substituted using a list of names and other details. This currently scales well. Let's call this "mass email infrastructure."

However, given AI advances, another method has just entered realistic scalability: sending a personalized email to each participant. Instead of using a template, we can write an email-prompt, to be transformed by a language model into a tailored email that respects the specifics of each invitee. Whilst this does require collecting the factual details of the participants beforehand, we now can incorporate highly specific informal content. Let's call this "live email infrastructure."

Notice that there is no need, in live email infrastructure, to identify a formulaic commonality of pattern in the email to be sent to all the invitees. There is instead a non-formulaic capturing of the intended outcome, which is then intelligently transformed into a specific email. This is what we mean by scaling specificity directly without abstractions. Even though we have generality, we don't lose the specificity. The job of the human is to craft an appropriate email-prompt (or examples, or context), rather than an email-template.

Mass vs Live Email Infrastructure

Dimension Mass Email Live Email
Generalization Parametric substitution Non-formulaic AI transformation
Context Context-free template with formulaic sensitivity Context-sensitive generation
Flexibility Rigid, predefined variables Dynamic adaptation

In a similar vein, we aim to outline the possibilities and infrastructural design for such a transformation in research methodology, moving the focus of human research activity from constructing static frames and theories to dynamic "theory-prompts." We claim this will enable substrate-sensitivity—approaches that take into account substrate-specifics without overfitting to any one particular substrate.

Conventional vs Live Theory

Dimension Conventional Theory Live Theory
Generalization Abstraction → parametric substitution Post-formal insights & AI rendering
Context Parametric Post-formal
Flexibility Rigid formal structures AI-adapted context-sensitive structures

We'll return to this after a brief account and example for live theory.

Autoformalization

"Autoformalization" refers to the automatic conversion of mathematical definitions, proofs, and theorems written in natural language into formal proofs in the language of a proof assistant. Large language models are also used to assist mathematicians in such proofs.

While proving theorems is an essential part of mathematical activity, theorems are perhaps better seen as the fruit of deeper labour: good definitions, which capture the phenomenon in question, simplify the network of theorems, and shorten proofs. Even further upstream from formal definitions are the technical insights that experts synthesize, which are often articulable in more than one formal way.

This is the natural next step for conventional "autoformalization." 

Alex Altair has proposed a trajectory of math skill which AI should catch up to quickly. Here it is:

 

In addition to AI-assisted proofs, we contend that AI-assisted definitions (created from human discussions) may allow meta-formal knowledge to scale directly. In his blog, Terence Tao notes that mathematical skill does not culminate in formalisms, but extends beyond into a "post-rigorous stage" characterized by intuitive manipulation that can be "converted into a rigorous argument whenever required."

Mathematicians, he says,

"no longer need the formalism in order to perform high-level mathematical reasoning, and are actually proceeding largely through intuition, which is then translated (possibly incorrectly) into formal mathematical language."

While this phenomenon is easy enough to experience first-hand within a subfield, it does not scale. Despite being truly mathematical activity (as Tao claims) and possessing real technical content, the "intuitive" nature of post-rigorous reasoning means it is not yet granted first-class citizenship as a mathematical language or object.

Example: 'Inflation as conflict' as a meta-formalism

In their paper Inflation is Conflict, Lorenzoni and Werning explore a technical intuition that the proximal cause of inflation is conflict over relative prices. Importantly, instead of presenting one unified model, they present a single intuitive insight and compute its implications in multiple different formal models. They conclude with the remark (emphasis ours):

In our view, traditional ideas and models of inflation have been very useful, but are either incomplete about the mechanism or unnecessarily special. The broad phenomena of inflation deserves a wider and more adaptable framework, much in the same way as growth accounting is useful and transcends particular models of growth. The conflict view offers exactly this, a framework and concept that sits on top of most models. Specific fully specified models can provide different stories for the root causes, as opposed to proximate causes, of inflation.

Much like the concept of inflation, we expect many technical concepts to resist a single formalization. In particular, fundamental notions in AI safety that have resisted definitional consensus—such as "deception," "harm," "power-seeking," "autonomy"—could all similarly "deserve a wider and more adaptable framework" that "sits on top of" (and are translated into) specific formal definitions.

Framing

The design considerations for live theory begin with the assumption of a "middle period"  of mildly intelligent AI that operates at extremely low-cost and low-latency, such that intelligence becomes an infrastructure, backgrounded in the same way as money, electricity and the Internet have.

Amongst the socio-technical changes associated with this period, we posit that every researcher will have access to capable AI "math agents." These are able to produce near-instant valid mathematics from informal natural language prompts. We assume the following:

  1. AI math agents are more than just proof copilots: We assume that math agents not only formalize natural language mathematical statements and assist with proofs and lemmas, but also assist with creating definitions and models (and a subsequent body of relatively original mathematics) from informal suggestions.
  2. AI math agents are not super-intelligent: Although they are able to "autocomplete" mathematical suggestions, they remain unable to autonomously attune to relevance and taste, much like language models of today. They are moderately creative, but still need guidance as a graduate student might, both in the production of a mathematical body of work and the consumption (i.e., application) of it.

These predictions are not meant to hold for the indefinite future, but only a middle period where we might reframe the entirety of the alignment problem and how to approach it. In this way, we in fact leave addressing the alignment problem to the near future, only equipping it with a new ontology for theorization.

Under these assumptions, the key insight supplied by live theory is to alter the way we generalize, shifting the focus from formal artefacts to post-formal ones.

In the current paradigm, we generalize via the operations of abstraction, then instantiation. We first abstract observations and their invariant pattern into a conceptual core (a formalism that "captures" the pattern in a context-free way). Then, to apply the insights, we instantiate the rigid parametric structure (variables) with contextual data that can fit the abstracted pattern in a strict, formulaic way (i.e., with values that match the variable type).

With live theory, we shift from formal artefacts ("theories") to more informal "theory-prompts." These can be "rendered" using moderately intelligent math agents into relevant formalisms according to the application context.

These post-formal artefacts, unlike a traditional theory or formalism, would

  • capture a concept in a family of formalisms that cannot be parametrically related to one another;
  • represent a mixture of formal and informal information about the concept; a "theory-prompt" created by experts who have theoretical insights, translated into formalisms by AI as needed.

However, these artefacts would also, like a traditional theory or result,

  • be a portable artefact that can be exchanged, iterated, and played around with;
  • be applied in combination with an "application-prompt" that captures application-relevant information, created by the applied practitioners in a domain.

In commodifying these inputs (i.e., the postformal "theory-prompts"), we make them easy to transfer, iterate, and collaborate on, much like traditional mathematical artefacts. We posit them as the new locus of human research activity.

In Practice: Prototype and MoSSAIC

We've frequently cited examples where a non-formulaic responsivity is required in order to, e.g., tailor algorithms to run well on GPUs.

We believe that AI will be able to perform such responsive tailoring of insights to substrates, and this has both negative and positive ramifications. 

Here's the diagram from the opening post, now showing the substrate-modification.

Producing and engineering this nesting is something that can only be performed by attending to the specific details of the substrate. This activity is creative, responsive, active tailoring; it does not scale, hence the development of academic subfields of human researchers finding solutions that fit the substrates they work with.

Our threat model is based on the fact that advanced intelligence will be able to apply similar creativity in its search for ways to evade our attempts to interpret and control it. Our opportunity model is that we might leverage this same responsivity in the tools we use to understand and mitigate these risks.

We anticipate that moderately intelligent math agents that can support the transfer of post-formal insights to novel substrates will mean that tasks requiring many hours of specialized human labour today will become as simple and quick as changing a variable.

This is what we mean by "keeping pace with intelligence." To track and counter the substrate-flexibility of our threat model, we design a similar (i.e., at or exceeding the pace of) substrate-flexible solution for tasks upstream of risk-management.

In other words, we have more substrate flexibility in our conceptualization and interventions. These should be deployable at least as early (and preferably much earlier) as the deployment of agents that increase the substrate-flexibility of the threat.

What MoSSAIC is Not

We should stress a few things that we view live theory as separate from. Firstly, it is not about "autonomous AI scientists." The use of AI is backgrounded as a tool to assist with the communication of intuitions, ultimately based on intuitions and insights that come from humans.

We believe that the core steering role humans play makes this theory of change more subtle than the idea of simply "using AI to solve AI safety." Instead, MoSSAIC looks to develop frameworks, tools, and intelligent infrastructure for porting human insights between different contexts, which we claim is the truer desideratum underlying "general" methods (or oversight that "composes" or "scales"). We will still as researchers need to lay out seed ideas (such as "deception" or "power-seeking") and guide their development.

This proposal contains many aspects that remain speculative. However, we argue that thinking carefully about the opportunity model is essential to meeting the threat model.

To say more about what sets this proposal apart from just "use AI to align AI", our emphasis is on the moderate creativity of medium-term intelligence, and how to leverage that. More specifically, in gearing towards a live-theoretic infrastructure, we aim to supply a sociotechnical ontology for the subsequent reframing and development of the ongoing task of alignment,[2] now freed from a notion of progress that is tied to formulaic abstractions and practices alone. Instead of a generic proposal, we're providing specifics of the setup as noted above.

We also argue that if you do anticipate radical transformation from AI, you should anticipate moderately radical change in the medium term, however small the interim. This interim period may be quite short, and yet the amount of cognitive effort that is appropriate to devote to the design could be extremely large, given the potential impact that follows from it.

  1. ^

    This is not to say that there is no work being performed between the two approaches. In brief, we view the singular learning theory and causal incentives research agendas as failing to fall directly into one or the other approach.

  2. ^

    Informally, we intend to treat "alignment" as a verb rather than a noun



Discuss

Announcing: Agent Foundations 2026 at CMU

2025-12-06 02:37:30

Published on December 5, 2025 6:37 PM GMT

Iliad is now opening up applications to attend Agent Foundations 2026 at CMU!

Agent Foundations 2026 will be a 5-day conference (of ~35 attendees) on fundamental, mathematical research into agency. It will take place March 2–6, 2026 at Carnegie Mellon University in Pittsburgh, Pennsylvania, and will be the third conference in the Agent Foundations conference series.

Topics covered will include:

  • Decision Theory
  • Learning Theory
  • Abstractions and World Models
  • Causality
  • Logical Induction
  • Bounded Optimality

Apply: Here by January 12, 2026 at 11:59 pm AoE.

Please reach out to us at [email protected] (or below) with any questions.



Discuss

DeepSeek v3.2 Is Okay And Cheap But Slow

2025-12-06 00:30:34

Published on December 5, 2025 4:30 PM GMT

DeepSeek v3.2 is DeepSeek’s latest open model release with strong bencharks. Its paper contains some technical innovations that drive down cost.

It’s a good model by the standards of open models, and very good if you care a lot about price and openness, and if you care less about speed or whether the model is Chinese. It is strongest in mathematics.

What it does not appear to be is frontier. It is definitely not having a moment. In practice all signs are that it underperforms its benchmarks.

When I asked for practical experiences and reactions, I got almost no responses.

A Brief History of DeepSeek

DeepSeek is a cracked Chinese AI lab that has produced some very good open models, done some excellent research, and given us strong innovations in terms of training techniques and especially training efficiency.

They also, back at the start of the year, scared the hell out of pretty much everyone.

A few months after OpenAI released o1, and shortly after DeepSeek released the impressive v3 that was misleadingly known as the ‘six million dollar model,’ DeepSeek came out with a slick app and with r1, a strong open reasoning model based on v3 that showed its chain of thought. With reasoning models not yet scaled up, it was the perfect time for a fast follow, and DeepSeek executed that very well.

Due to a strong viral marketing campaign and confluence of events, including that DeepSeek’s app shot to #1 on the app store, and conflating the six million in cost to train v3 with OpenAI’s entire budget of billions, and contrasting r1’s strengths with o1’s weaknesses, events briefly (and wrongly) convinced a lot of people that China or DeepSeek had ‘caught up’ or was close behind American labs, as opposed to being many months behind.

There was even talk that American AI labs or all closed models were ‘doomed’ and so on. Tech stocks were down a lot and people attributed that to DeepSeek, in ways that reflected a stock market highly lacking in situational awareness and responding irrationally, even if other factors were also driving a lot of the move.

Politicians claimed this meant we had to ‘race’ or else we would ‘lose to China,’ thus all other considerations must be sacrificed, and to this day the idea of a phantom DeepSeek-Huawei ‘tech stack’ is used to scare us.

This is collectively known as The DeepSeek Moment.

Slowly, in hindsight, the confluence of factors that caused this moment became clear. DeepSeek had always been behind by many months, likely about eight. Which was a lot shorter than previous estimates, but a lot more than people were saying.

Later releases bore this out. DeepSeek’s r1-0528 and v3.1 did not ‘have a moment,’ ad neither did v3.2-exp or now v3.2. The releases disappointed.

DeepSeek remains a national champion and source of pride in China, and is a cracked research lab that innovates for real. Its models are indeed being pushed by the PRC, especially in the global south.

For my coverage of this, see:

  1. DeepSeek v3: The Six Million Dollar Model.
  2. On DeepSeek’s r1.
  3. DeepSeek: Panic at the App Store.
  4. DeepSeek: Lemon, It’s Wednesday.
  5. DeepSeek: Don’t Panic.
  6. DeepSeek-r1-0528 Did Not Have a Moment.
  7. DeepSeek v3.1 Is Not Having a Moment.

Once More, With Feeling

I’d just been through a few weeks in which we got GPT-5.1, Grok 4.1, Gemini 3 Pro, GPT-5.1-Codex-Max and then finally Claude Opus 4.5. Mistral, listed above, doesn’t count. Which means we’re done and can have a nice holiday season, asks Padme?

No, Anakin said. There is another.

DeepSeek: 🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.

🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

Tech report [here], v3.2 model, v3.2-speciale model.

🏆 World-Leading Reasoning

🔹 V3.2: Balanced inference vs. length. Your daily driver at GPT-5 level performance.

🔹 V3.2-Speciale: Maxed-out reasoning capabilities. Rivals Gemini-3.0-Pro.

🥇 Gold-Medal Performance: V3.2-Speciale attains gold-level results in IMO, CMO, ICPC World Finals & IOI 2025.

📝 Note: V3.2-Speciale dominates complex tasks but requires higher token usage. Currently API-only (no tool-use) to support community evaluation & research.

🤖 Thinking in Tool-Use

🔹 Introduces a new massive agent training data synthesis method covering 1,800+ environments & 85k+ complex instructions.

🔹 DeepSeek-V3.2 is our first model to integrate thinking directly into tool-use, and also supports tool-use in both thinking and non-thinking modes.

Reading The Paper

Teortaxes threatened to bully me if I did not read the v3.2 paper. I did read it. The main innovation appears to be a new attention mechanism, which improves training efficiency and also greatly reduces compute cost to scaling the context window, resulting in v3.2 being relatively cheap without being relatively fast. Unfortunately I lack the expertise to appreciate the interesting technical aspects. Should I try and fix this in general? My gut says no.

What the paper did not include was any form of safety testing or information of any kind for this irreversible open release. There was not, that I could see, even a sentence that said ‘we did safety testing and are confident in this release’ or even one that said ‘we do not see any need to do any safety testing.’ It’s purely and silently ignored.

David Manheim: They announce the new DeepSeek.

“Did it get any safety testing, or is it recklessly advancing open-source misuse capability?”

They look confused.

“Did it get any safety testing?”

“It is good model, sir!”

I check the model card.

There’s absolutely no mention of misuse or safety.

Frankly, this is deeply irresponsible and completely unacceptable.

DeepSeek did by some accounts become somewhat censorious back in May, but that doesn’t seem to apply to, as George puts it, plans for <bad_device>.

DeepSeek claims to be ‘pushing the boundaries of reasoning capabilities’ and to be giving a GPT-5 level of performance. Their benchmarks match this story.

And they can’t even give us an explanation of why they don’t believe they owe us any sort of explanation? Not even a single sentence?

I knew DeepSeek was an irresponsible lab. I didn’t know they were this irresponsible.

Open Language Model Offers Mundane Utility

The short version of my overall take seems to be that DeepSeek v3.2 is excellent for its price point, and its best area is mathematics, but while it is cheap it is reported to be remarkably slow, and for most practical purposes it is not frontier.

Which means you only would use it either if you are doing relatively advanced math, or if all four of the following are true:

  1. You don’t need the frontier capabilities
  2. You don’t mind the lack of speed.
  3. You benefit a lot from decreased cost or it being an open model or both.
  4. You don’t mind the security concerns.

The only strong praise I found in practice was this exchange from perennial whale (DeepSeek) advocate Teortaxes, Vinicius and John Pressman:

Teortaxes: Strange feeling, talking to Opus 4.5 and V3.2 and objectively… Opus is not worth it. Not just for the price; its responses are often less sharp, less interesting. But I’m still burning tokens.
Anthropic can coast far on “personality”, enterprise coding aside.

John Pressman: Opus told me I was absolutely right when I wasn’t, V3.2 told me I was full of shit and my idea wouldn’t work when it sort of would, but it was right in spirit and I know which behavior I would rather have.

I’ve never understood this phenomenon because if I was tuning a model and it ever told me I was “absolutely right” about some schizo and I wasn’t I would throw the checkpoint out.

Vinicius: Have you been using Speciale?

Teortaxes: yes but it’s not really as good as 3.2
it’s sometimes great (when it doesn’t doomloop) for zero-shotting a giant context

Vinicius: I’ve been using 3.2-thinking to handle input from social media/web; it’s insanely good for research, but I haven’t found a real use case for Speciale in my workflows.

Notice the background agreement that the ‘model to beat’ for most purposes is Opus 4.5, not Gemini 3 or GPT-5.1. I strongly agree with this, although Gemini 3 still impresses on ‘just the facts’ or ‘raw G’ tasks.

Some people really want a combative, abrasive sparring partner that will err on the side of skepticism and minimize false positives. Teortaxes and Pressman definitely fit that bill. That’s not what most people want. You can get Opus to behave a lot more in that direction if you really want that, but not easily get it to go all the way.

Is v3.2 a good model that has its uses? My guess is that it is. But if it was an exciting model in general, we would have heard a lot more.

Those Benchmarks

They are very good benchmarks, and a few independent benchmarks also gave v3.2 high scores, but what’s the right bench to be maxing?

Teortaxes: V3.2 is here, it’s no longer “exp”. It’s frontier. Except coding/agentic things that are being neurotically benchmaxxed by the big 3. That’ll take one more update.
“Speciale” is a high compute variant that’s between Gemini and GPT-5 and can score gold on IMO-2025.
Thank you guys.

hallerite: hmm, I wonder if the proprietary models are indeed being benchmaxxed. DeepSeek was always a bit worse at the agentic stuff, but I guess we could find out as soon as another big agentic eval drops

Teortaxes: I’m using the term loosely. They’re “benchmaxxed” for use cases, not for benchmarks. Usemaxxed. But it’s a somewhat trivial issue of compute and maybe environment curation (also overwhelmingly a function of compute).

This confuses different maxings of things but I love the idea of ‘usemaxxed.’

Teortaxes (responding to my asking): Nah. Nothing happened. Sleep well, Zvi…
(nothing new happened. «A factor of two» price reduction… some more post-training… this was, of course, all baked in. If V3.2-exp didn’t pass the triage, why would 3.2?)

That’s a highly fair thing to say about the big three, that they’ve given a lot of focus to making them actually useful in practice for common use cases. So one could argue that by skipping all that you could get a model that was fundamentally as smart or frontier as the big three, it just would take more work to get it to do the most common use cases. It’s plausible.

Teortaxes: I think Speciale’s peak performance suggests a big qualitative shift. Their details on post-training methodology align with how I thought the frontier works now. This is the realm you can’t touch with distillation.

Lisan al Gaib: LisanBench results for DeepSeek-V3.2

DeepSeek-V3.2 and V3.2 Speciale are affordable frontier models*

*the caveat is that they are pretty slow at ~30-40tks/s and produce by far the longest reasoning chains at 20k and 47k average output tokens (incl. reasoning) – which results in extremely long waiting times per request.

but pricing is incredible
for example, Sonnet 4.5 Thinking costs 10x ($35) as much and scores much lower than DeepSeek-V3.2 Speciale ($3)

DeepSeek V3.2 Speciale also scored 13 new high scores

Chase Brower: DSV3.2-Speciale scores 30 on @AcerFur ‘s IUMB math benchmark, tying with the existing top performer Gemini 3 Pro Preview.

Token usage/cost isn’t up yet, but it cost $1.07 to run Speciale with 2546096 total tokens, vs $20.64 for gpt-5 👀👀

Those are presumably non-targeted benchmark that give sensible ratings elsewhere, as is this one from NomoreID on a Korean test, so it confirms that the ‘good on benchmarks’ thing is probably generally real especially on math.

Open Language Model Doesn’t Offer Mundane Utility

In practice, it seems less useful, whether or not that is because less usemaxxed.

I want my models to be usemaxxed, because the whole point is to use them.

Also our standards are very high.

Chase Brower: The big things you’ll see on tpot are:
– vibecoding (V3.2 is still a bit behind in performance + really slow inference)
– conversation (again, slow)

Since it’s not very good for these, you won’t hear much from tpot

I feel like it’ll be a go-to for math/proving assistance, tho

Clay Schubiner: It’s weak but is technically on the Pareto frontier by being cheap – at least on my benchmark

Jake Halloran: spent like 10 minutes testing it and its cheap and ~fine~

its not frontier but not bad either (gpt 5ish)

The counterargument is that if you are ‘gpt 5ish’ then the core capabilities pre-usemaxxing are perhaps only a few months behind now? Which is very different from being overall only a few months behind in a practical way, or in a way that would let one lead.

The Pliny jailbreak is here, if you’re curious.

Gallabytes was unimpressed, as were those responding if your standard is the frontier. There were reports of it failing various gotcha questions and no reports of it passing.

Open Language Model Does Do The Math

In other DeepSeek news, DeepSeekMath-v2 used a prover-verifier loop that calls out the model’s own mistakes for training purposes, the same way you’d do it if you were learning real math.

Teortaxes: There is a uniquely Promethean vibe in Wenfeng’s project.

Before DS-MoE, only frontier could do efficiency.

Before DS-Math/Prover, only frontier could do Real math.

Before DS-Prover V2, only frontier could do Putnam level.

Before DS-Math V2, only frontier could do IMO Gold…

This is why I don’t think they’ll be the first to “AGI”, but they will likely be the first to make it open source. They can replicate anything on a shoestring budget, given some time. Stealing fire from definitely-not-gods will continue until human autonomy improves.

So far, the reported actual breakthroughs have all been from American closed source frontier models. Let’s see if that changes.

I am down with the recent direction of DeepSeek releases towards specialized worthwhile math topics. That seems great. I do not want them trying to cook an overall frontier model, especially given their deep level of irresponsibility.

I’ll Get You Next Time, Gadget

Making things cheaper can still be highly valuable, even with other issues. By all accounts this model has real things to offer, the first noteworthy DeepSeek offering since r1. What it is not, regardless of their claims, is a frontier model.

This is unsurprising. You don’t go from v3.2-exp to v3.2 in your naming schema while suddenly jumping to the frontier. You don’t actually go on the frontier, I would hope, with a fully open release, while saying actual zero words about safety concerns.

DeepSeek are still doing interesting and innovative things, and this buys some amount of clock in terms of keeping them on the map.

As DeepSeek says in their v3.2 paper, open models have since r1 been steadily falling further behind closed models rather than catching up. v3.2 appears to close some of that additional gap.

The question is, will they be cooking a worthy v4 any time soon?

The clock is ticking.

 



Discuss