2026-02-10 12:49:47
Published on February 10, 2026 4:28 AM GMT
This is what takeoff feels like. Anthropic and OpenAI have been explicit about their intention to create an intelligence explosion, and employees at both companies have recently confirmed that their models are significantly accelerating their own development.
This week we’ll talk about what that means, considering the trajectory of future progress, our increasing inability to measure the capabilities and risks of the frontier models, and some ideas for how humanity can successfully navigate what is coming.
The intelligence explosion has begun: AI is meaningfully accelerating its own development. Dean Ball considers what’s happening now and where we’re headed soon.
America’s major frontier AI labs have begun automating large fractions of their research and engineering operations. The pace of this automation will grow during the course of 2026, and within a year or two the effective “workforces” of each frontier lab will grow from the single-digit thousands to tens of thousands, and then hundreds of thousands.[…]
Policymakers would be wise to take especially careful notice of this issue over the coming year or so. But they should also keep the hysterics to a minimum: yes, this really is a thing from science fiction that is happening before our eyes, but that does not mean we should behave theatrically, as an actor in a movie might. Instead, the challenge now is to deal with the legitimately sci-fi issues we face using the comparatively dull idioms of technocratic policymaking.
A fascinating recent paper argues that reasoning models use internal dialogue to make better decisions. I look at what they found, how they found it, and what it does (and doesn’t) mean.
Anthropic has released Claude Opus 4.6, with strong improvements in all the usual places. Plus, two very interesting new options (at premium prices): a 1 million token context window and a substantially faster version of the model.
OpenAI just released GPT-5.3-Codex, which looks to be a significant upgrade to 5.2 (which just came out two months ago). Related: I expect we’ll see ChatGPT 5.3 very soon, likely this week.
Nathan Lambert shares some thoughts after spending time with both Opus 4.6 and Codex 5.3. He still prefers Opus, but the gap has narrowed. My take: both models are excellent—if coding is important to you, you should try both and see which works best for you.
All the big models have reached or are very close to reaching dangerous cybersecurity capability levels. With that comes a very hard, very important problem: how do you let people use the defensive capabilities of those models without enabling bad actors to leverage their offensive capabilities? OpenAI is rolling out Trusted Access for Cyber, a program that gives trusted users greater access to dual-use cyber capabilities. Seems like a great idea, but hard to execute well at scale.
Moonshot AI has released Kimi K2.5—possibly the best open model available. Zvi takes a detailed look. There aren’t a lot of surprises here: it’s an excellent model, they’ve apparently put very little effort into safety, and Chinese open models continue to lag the frontier by 6–12 months. You could probably argue they’ve fallen a little further behind lately, but that’s very hard to quantify.
OpenAI describes Agent Builder as “a visual canvas for building multi-step agent workflows.” I haven’t yet had a chance to take it for a spin, but it sounds great for some workflows. (But see Minh Pham’s thoughts about the Bitter Lesson below).
Rahul Sood has further thoughts about the security implications of OpenClaw.
No surprises: it’s very cool, but not ready for prime time. If you’re gonna try it out for fun or learning, make sure your security game is top-notch.
Related: Zvi is running a weekly series on Claude Code. Well worth your time if you’re using it regularly.
Here’s a nice data point on the very impressive capabilities (and significant limitations) of coding agents. Nicholas Carlini uses $20,000 worth of tokens (good thing he works at Anthropic!) to have agents semi-autonomously build a 100,000 line C compiler that can compile the Linux kernel. It’s a very impressive achievement, and far beyond what most humans could have done in that time. But also: it’s not production-ready, and the agents can’t quite seem to get it there.
Anthropic’s Best Practices for Claude Code contains almost everything I’ve personally found useful from all the guides I’ve linked to over the last few weeks.
Most best practices are based on one constraint: Claude’s context window fills up fast, and performance degrades as it fills.
If you want to use Claude Code but are intimidated by having to use the command line (or want to better understand what your agent is doing), Ado has a nice guide to command line essentials for using agents.
AxiomProver is back, this time with what they claim is “the first time an AI system has settled an unsolved research problem in theory-building math”.
We have a benchmark crisis: many existing benchmarks are saturated, and it’s hard and expensive to create new evaluations that challenge the frontier models. Epoch’s Anson Ho takes a different approach—instead of creating a formal new benchmark, he asked AI to tackle a couple of his recent work projects. Did they succeed? No, but the nature of their failure is informative.
OpenAI is also riding the recursive self-improvement rocket:
Codex now pretty much builds itself, with the help and supervision of a great team. The bottleneck has shifted to being how fast we can help and supervise the outcome.
The New York Times talks to a group of mathematicians who are putting together a new benchmark based on open questions in their current research ($).
Great post from Chris Painter that explains an increasingly serious challenge for AI safety:
My bio says I work on AGI preparedness, so I want to clarify:
We are not prepared.
Over the last year, dangerous capability evaluations have moved into a state where it’s difficult to find any Q&A benchmark that models don’t saturate.
AI manipulation doesn’t get as much press as biosecurity or cyberwarfare, but there are good reasons to worry about AI manipulating humans. An AI with superhuman persuasion can enable authoritarian rule, cause social chaos, or simply take over the world. AI Policy Perspectives interviews Sasha Brown, Seliem El-Sayed, and Canfer Akbulut about their work studying AI manipulation. Lots of good thoughts about what AI manipulation is, why you should worry about it, and how to study it.
How much does AI actually increase worker productivity? And are we seeing evidence of that in economic productivity statistics? Alex Imas looks at the evidence so far.
Here is the summary of the evidence thus far: we now have a growing body of micro studies showing real productivity gains from generative AI. However, the productivity impact of AI has yet to clearly show up in the aggregate data.
Forethought has posted three really good thought pieces:
There are lots of good ideas here, and they’re all worth reading. As written, however, I think they all have the same fatal flaw. As it is written in the ancient scrolls:
Everyone will not just
If your solution to some problem relies on “If everyone would just…” then you do not have a solution. Everyone is not going to just. At [no] time in the history of the universe has everyone just, and they’re not going to start now.
Figuring out what everyone should do is (relatively) easy. Figuring out how to get them to do it is the hard but vital part.
High-bandwidth memory (HBM) is a critical part of AI computing hardware, but doesn’t get as much attention as the processors (GPUs) themselves. AI Frontiers explains how HBM works and looks at some critical gaps in US export controls.
Epoch estimates the percentage of expenses that goes to compute at the big labs. It’s well over 50% in both the US and China.
This sprawling beast of an essay by Eric Jang takes a thoughtful look at some recent major changes in model architecture and capabilities. Plus speculation about where AI is headed, and a status report on the author’s project to build an open source version of AlphaGo, and… there’s a whole lot here. Long and semi-technical, but very good.
Minh Pham has thoughts on the implications of the Bitter Lesson for building agent harnesses:
In 2026 terms: if your “agent harness” primarily scales by adding more human-authored structure, it is probably fighting the Bitter Lesson.
Back in December, David Pinsof argued in an insightful but depressing essay that many of humanity’s less agreeable traits are in fact rational and adaptive:
While reflecting on these questions, you may reach an unpleasant conclusion: there’s nothing you can do. The world doesn’t want to be saved.
Dan Williams responded with an equally insightful essay, arguing that traits that might have been rational and adaptive in the ancestral environment are neither in the modern world, and defending the Enlightenment and classical liberalism:
You can’t understand much of humanity’s significant progress over the past several centuries—in life expectancy, living standards, wealth, health, infant mortality, freedom, political governance, and so on—without embracing this fundamental optimism of the Enlightenment.
And Pinsof responded with a really good piece that responds to Williams’ arguments while finding substantial common ground:
My thesis in A Big Misunderstanding has some boundaries and exceptions, as nearly every thesis does, and you’ve done a great job of articulating them here. We’re probably more aligned in our thinking than not, but there are nevertheless a few parts of your post I’d push back on
This is the way.
Finding myself going back to RSS/Atom feeds a lot more recently. There’s a lot more higher quality longform and a lot less slop intended to provoke. Any product that happens to look a bit different today but that has fundamentally the same incentive structures will eventually converge to the same black hole at the center of gravity well.
I agree: RSS is simply a better way of sharing information without the toxicity and walled gardens of social media. Coincidentally, all my writing is available on the free web, with RSS feeds.
Anthropic would like to remind you that ads in AI could go really badly.
2026-02-10 12:32:58
Published on February 10, 2026 4:32 AM GMT
Last week I read Mechanistic Permutability: Match Features Across Layers, an interesting paper on matching features detected with Spare Autoencoders across multiple layers of the Transformer neural network.
In this paper, the authors studying the problem of aligning SAE extracted features across multiple layers in the neural network without having input data to trace. They use the assumption that features in different layers are similar but have different permutations. They are looking for a permutation matrix that aligns similar features across one layer and another layer B.
I had a high level understanding of polysemanticity and feature superposition. Mechanistic interpretability researchers face a huge challenge when figuring out which feature in a neural network is represented by a particular neuron because multiple neurons may activate in the presence of two of more features that don't occur in the same input example.
One famous example of polysemanticity is a neuron in a model was activating when an input image contained a car and also when the input image contained a cat.
To get around this issue, the researchers in the paper used Sparse Auto-encoders to extract features from an individual hidden layer. Here, they faced another issue when attempting to track the same features across multiple layers. They believed that being able to solve this problem is important to fully understand how features evolve throughout the model.
The researchers created SAE Match, which is a data free method for tracking SAE features across multiple layers. In other words, without an input, they can track the evolvement of features throughout the layers of a neural network.
They also use a technique called parameter folding, which involves integrating activation threshold and encoder/decoder weights to account for differences in features scales, which improves feature matching quality significantly.
To do this, they look for permutation matrix that reduces MSE between decoder weights, treating the matching problem as a Linear Assignment Problem. This addresses the problem statement by enabling us to understand cross-layer hidden states (still confused about how to understand this concept) and also the features’ movement across layers.
Some promising future directions might include researching non-bijective matching. This paper currently uses one-to-one matching between feature neurons but in the future it would be better to explore more elegant methods to handle features that don’t match in this way.
We could also conduct a more precise analysis of the behavior of different weights within the module as our current scheme uses all the weights.
Here is a link to my (rough) code implementation using the nnsight library from NDIF.
2026-02-10 10:30:14
Published on February 10, 2026 2:30 AM GMT
Shoveling out a parking spot is a pain, and many snowy places have a culture where when you leave you put some kind of " space saver" (aka "parking chair", "dibs", etc) to reserve the space for when you come back. While you don't have a legal right to the space, there's common knowledge that if you park in someone's spot you're running a significant risk of them vandalizing your car.
Aside from depending on the implicit threat of property damage, it's a bad system: it only works if everyone uses cars with the same commuting pattern. My sister works in early childhood intervention, which is one of several professions where you're traveling to people's houses during the day. In neighborhoods where space savers are common, there can be many open spots, but each one is saved. Moving the space saver, even just for the duration of the visit, risks a slashed tire.
courtesy of kgnixerVery reasonably, they're prohibited in many cities, including Cambridge and Somerville, but that isn't enough to disrupt the culture in areas where it's widespread.
It's also hard to opt out as an individual. If everyone else is saving their spaces and you don't, then when you get back your spot will very likely be taken.
What would disrupt the culture, though, would be the mayor making an announcement. Now, mayors already make announcements about this, but they're things like "please don't use space savers." That doesn't work: it won't sway the intended audience. Instead, the announcement should target the people who don't like the space saver system, but today they leave space savers alone because they don't want some unknowing stranger to face unjustified retaliation: "if you see a space saver, please take it out of the spot".
This gives social cover to start moving things. You're not a troublemaker, you're an upstanding citizen. The mayor doesn't need many people to take action: just a few enthusiastic people can thoroughly break the link from "there's a car parked in the spot I saved" to "they must have moved my space saver" and hence to "and so I should vandalize their car." The announcement makes it common knowledge that this is no longer a reliable link.
The mayor, as a centralized elected representative, is a good way to do this, though it would also be possible for a group of people, or even one really dedicated person, to do it. What you need to do is remove a lot of space savers, and make sure it's well known that the people removing space savers aren't the people parking in the previously-saved spots. Possibly an active group that put discarded space savers nearby with a note taped on saying "removed by the Space Unsavers" could do it, though you need to keep the object close enough that the spacer finds it and sees the note while also moving it far enough that the parker likely didn't realize they were parking in a formerly saved spot.
I did a little looking to see if anyone had tried this, and the closest I came was cities coming around to collect them, often as part of trash pickup or some number of hours after lifting a snow emergency. This doesn't work very well: it's much easier to work around a single predictable pass than individuals removing objects throughout the day, and it also legitimizes using space savers up until the announced pickup.
Does anyone know about anyone trying this? Did it work? If I lived in a part of town with an illegal and unpopular—yet persistent—space saver culture, I'd be tempted to try the organize-neighbors version. Or perhaps start by suggesting it to the mayor.
2026-02-10 10:10:14
Published on February 10, 2026 2:10 AM GMT
Meta's estimates of money they earn due to fraudulent ads was $16 billion in 2024. In contrast total value of property stolen in burglaries in the US is around $3 billion per year. Unfortunately, 2019 is the last year with good numbers but should still be around that amount.
If Meta takes around half of the money that the fraudsters make as their cut, that would suggest Meta facilitated fraud that suggests that Meta helped steal around $30 billion via fraud in year or 10x the amount that is stolen in the year via burglaries.
To be fair, Meta facilitates stealing from people all over the world via fraud and not just stealing from the US, but how high would the toll of Meta facilitated fraud need to be to consider Meta to be a organized criminal enterprise?
Is causing 10x the amount stolen per year via burglaries not enough and it would need to be 20x for people to see Meta as a criminal enterprise?
2026-02-10 09:22:52
Published on February 9, 2026 11:53 PM GMT
(Originally posted on my blog: https://belindal.github.io/introspection/)
In 2022, ChatGPT turned language models (LMs) from a tool used almost exclusively by AI researchers into the fastest-growing consumer software application in history, spawning a $40 billion generative AI market and a boom that continues to reshape markets today. While the main technologies behind the tool had been invented years prior, it was the single, general-purpose chat interface that ultimately brought the technology into the mainstream.
I think there’s a lesson here for interpretability research. As AI systems have grown dramatically more capable, pressure has mounted for interp research to “prove useful”. [1] A frequent criticism is that interp research, which often relies on bespoke tools and expert analyses, has failed to keep pace with models that are rapidly scaling, improving, and becoming more general. This is despite the fact that our tools have grown more sophisticated over time: early versions of interp work only looked at the geometry of LM representations whereas nowadays we can extract the specific features encoded in LMs, which exact layer and module these features reside in, and how those features interact. However, these innovations are invisible to the ordinary user—though our interp tools have grown increasingly powerful, we can still only ask a pretty limited set of questions, on models far smaller than what is deployed in production, and using specialized tools unavailable to most users. While this mode of interpretability has scientific value, I think it’s also important to apply the lessons of ChatGPT’s success back to interp: let’s make explanations scalable, general, and natively accessible as part of the original model. What if, instead of building ever-more-sophisticated external tools, we asked whether the model itself could serve as the interface to its own internals?
Concretely, imagine we’re deploying a LM in a medical diagnosis scenario and it predicts that a patient is experiencing acid reflux. We would like to understand the causal factors underlying its prediction. Rather than applying bespoke interp tools, could the user just ask follow-up questions back to the LM? While current LMs technically already support follow-up questions, and can surface their “thought processes” in chains-of-thought, many studies have found that such explanations are not guaranteed to reflect their actual decision process.
Instead, we could enforce that models’ verbalizations must faithfully reflect their internals and behaviors. Broadly, I’ll call this class of methods introspective interpretability. Recently, a line of work has emerged that trains models to faithfully verbalize aspects of themselves. This includes our paper that trains models to explain their internal computations (e.g. results of interpretability experiments), concurrent work on training models to explain weight differences after training, earlier work on training models to explain (the features underlying) their behaviors, and recent papers training models to “confess” their shortcomings or hidden objectives.[2] These works show early promise for the introspection agenda.
As I’ll argue in this post, introspective approaches are naturally 1) scalable, 2) generalizable, and 3) accessible — three major pillars underlying ChatGPT’s success. Beyond these benefits, however, introspection also opens paths toward more 4) predictable and 5) controllable models: predictability arises from training models to be faithful to concise explanations which acts as a regularizer, and controllability arises from leveraging the same underlying model for interpretation, allowing users to directly issue steering commands.
While introspection is a pretty early research agenda and more work is needed to validate it, I’m optimistic about its potential to become practical with scale. I believe it offers important benefits for interpretability that aren’t covered by existing classes of methods — including recent related lines of work with extrospectively-trained explainers (i.e., where a separate model or analysis system is trained to interpret the target model from the outside).[3] This post is intended as a way to formalize introspective methods (2. Defining Introspection), motivate them by highlighting where they are comparatively useful (3. Motivation), and elucidate some open problems that must be solved to make this agenda a reality (4. Open Problems).
Across cognitive science, psychology, philosophy, and artificial intelligence, “introspection” has been defined in different ways and operationalized with a diverse range of measurements. Rather than attempting a general or exhaustive definition, I’ll focus on a notion of introspection that is most relevant for interpretability applications: a model is introspective if it demonstrates privileged, causal self-consistency between its self-explanations and its internal mechanisms/external behaviors.[4][5] I’ll examine the three parts of this definition separately, and justify why each matter:
Self-consistency requires that a model’s verbalized self-descriptions accurately characterize its own external behaviors or internal mechanisms. Behavioral descriptions concern observable outputs of the model, such as the exact tokens it generates in a given context, patterns shared across many generations, or systematic differences between outputs under controlled input changes. Mechanistic descriptions concern internal properties of the model, such as the model’s features, circuits, or parameters.
Returning to the medical diagnosis example in the introduction, if we find that the model tends to change its answer according to user suggestions, even when the clinical evidence points elsewhere (e.g. it behaves sycophantically), then a self-consistent model should then report something like: “My diagnosis is being influenced by the user’s suggested answer.”
Importance: Enforcing that model self-descriptions agree with its behaviors and mechanisms enables models to produce faithful self-explanations. Conversely, enforcing that models’ behaviors and mechanisms are consistent with their self-descriptions acts as a regularizer on their behaviors/mechanisms, allowing them to be more natively interpretable and well-behaved.
We would like self-consistency to be causal: changing the model’s verbalization (e.g. via internal interventions, prompting, or training) should also induce changes to its behavior, and vice versa.
For example, suppose we train our introspective medical diagnosis model so that it behaves less sycophantically, then its verbalizations should reflect its new behavior. Conversely, we can also imagine simply training the LM to state: “I base my diagnoses only on clinically relevant symptoms and ignore user suggestions.” Then we would like its behavior to also become less sycophantic as a result.
Importance: Causality allows us to capture unintended effects of LM training, including emergent misalignment, reward hacking, and learning of spurious cues. Furthermore, causality in the converse direction gives us a way to more controllably and interpretably train models.
While general versions of causality across all intervention types may be unenforceable, it is nonetheless a desirable property for the aforementioned reasons.[6]
A model M1 possesses privileged access to itself if M1 demonstrates greater causal self-consistency to M1 than another model M2, given an equivalent amount of training- and inference-time resources and access to M1’s inputs and outpus. This prevents M1 from simply simulating itself and making post-hoc observations, as M2 would be able to obtain equally accurate explanations by running the same M1 simulations.[7]
Importance: Privileged access is attractive for several reasons. First, it gives us a practical justification for introspection: we can use privileged access to surface failure modes that would be difficult to detect externally with the same amount of resources. As we’ll elaborate more below, this gives us a natural pathway to sample-efficient and scalable interpretability. On a more philosophical note, privileged access serves as a test for whether models are relying on something truly intrinsic, or whether they’re simply relying on their strong predictive capacity to learn external correlations. Thus, it gives us a natural control baseline for true introspection, similar to what people run for probes.
Interpretability approaches often face a tradeoff between scalability and faithfulness. On one extreme, ambitious mechanistic interpretability aims for complete reverse-engineering of the mechanisms underlying model behavior, but requires painstaking manual analysis that is difficult to scale to frontier models or make accessible to non-expert users. On the other extreme, chain-of-thought monitoring is highly scalable and accessible, but offers no guarantee that a model’s stated reasoning faithfully reflects its internal computation. Automated approaches like interpretability agents and automated feature descriptions mediate between these two poles, improving scalability while retaining some of the rigor of mechanistic methods. But they remain bottlenecked by the efficiency of the underlying interpretability tools they automate, often simply shifting the burden of search from a human experimenter to a program or an AI agent.
Trained explainer models represent another point on the tradeoff curve – continuing to push scalability while sacrificing some faithfulness as we move from rigorous analysis tools to trained models. Within this bucket, we introduce an extrospective/introspective axis, defined by the degree of privileged access the interpreter has to the target model. At the extrospective end of this axis are methods like SAEs, trained model editors, and predictive concept decoders, which use a completely different architecture and training paradigm to explain the target model. As we move along the axis, we encounter approaches where the interpreter and target models may share architecture or even initialization, and the success of these methods start leveraging a degree of privileged access. Finally, fully introspective methods aim to give the model direct ability to reason about itself. By the privileged access part of our definition of introspection, a model is only introspective if it beats extrospective methods at the same scale. Thus, introspective approaches by definition dominate extrospective ones along the Pareto tradeoff. The question is thus – can privileged access be realized across the explanation tasks we may care about in practice. I think the answer is yes, and I’ll elaborate more below.
In this section, I’ll argue that introspective methods have comparative advantage over other interp methods in several key ways. Most of the arguments I’ll lay out will be directed against extrospective trained explainers, given that it is the most related class of methods with similar benefits
3.2.1: Sample/Compute Efficiency via Privileged Access
By our definition, a model is only introspective if it demonstrates privileged access, meaning it must give more faithful explanations than extrospective methods at a given inference/training scale. Our early empirical results establish that models demonstrate privileged access on three types of explanations of their own computations, and that this advantage is particularly pronounced in low-data regimes. Other early results find models have privileged access to certain behavioral properties of themselves. I think privileged access will hold even beyond these tasks: structurally, a model has direct access to its own computations, while an external explainer only has as much as it’s given access to (e.g. only the activations of the original model in the case of LatentQA variants). Furthermore, to the extent that self-explanation can operate in the same representational basis and reuse internal mechanisms the model already employs for language understanding, an external explainer must (in those cases) learn an additional alignment and approximation of these mechanisms, making introspective approaches more sample-efficient. Thus, I expect privileged access will hold across a large number of self-explanation types, though empirically establishing its scope and engineering the correct training objectives (see Open Problems: Generalization) remains an important open question.
3.2.2: Task-Generality through Composition
One lesson we’ve learned from scaling LLMs is that multitask learning enables not just data-efficient learning of new skills through sharing representations and mechanisms (as argued above), but the untrained emergence of entirely new capabilities through composition. By training a model that can understand language, do reasoning, and interpret itself, we may enable models that can compose introspective skills—either latently or through chain-of-thought reasoning. This may allow models to handle entirely novel interpretability queries that they weren’t explicitly trained on.
3.2.3: Usability and Accessibility
Introspective LMs are a mechanism for making interpretability accessible to non-expert users. Instead of requiring users to run separate interpretability tools or auxiliary models, the original model produces explanations of its own internal behavior directly in natural language. Users can read these explanations, ask follow-up questions, correct misconceptions, or request changes to internal procedures through the same conversational interface used for task interaction. The model can provide explanations at varying levels of abstraction, ranging from low-level mechanistic descriptions (e.g., “neuron 19 in layer 5 activated and typically encodes X, which contributed to output Y”) for expert users, to high-level summaries for non-experts.
While theoretically interfaces can be built on top of extrospective interpretability approaches, moving between two different interfaces which potentially say conflicting information (e.g., if the primary chatbot says “I think X” and the interp interface says “actually the chatbot thinks Y”) can introduce friction and undermine trust in the primary model. Furthermore, it’s less clear how to use this interp feedback to control models, as I’ll describe below.
3.2.4: Predictability
A core obstacle for interpretability is that model internals and behaviors are often inconsistent across contexts. For example, neural network components are known for being polysemantic, encoding a certain concept in one context but something completely different in another context. Even techniques aimed at extracting monosemantic features often end up surfacing many polysemantic features. These inconsistencies make post-hoc interpretation brittle.
Training models to enforce bidirectional consistency between explanations and behaviors offers a path toward addressing this issue. Concretely, we require that model explanations are faithful to their behaviors (explanations ← behaviors) and that model behaviors are consistent with their explanations (explanations → behaviors). The explanations → behaviors direction acts as a regularizer, encouraging models to rely on mechanisms that are describable in human-interpretable terms and to behave predictably across contexts. By enforcing this objective, interpretability is no longer just a post-hoc analysis tool as models are trained to be explainable upfront.
3.2.5: Controllability
Closely related to the above point is controllability — while extrospective explainers can analyze a static model, they cannot directly modify the model’s behavior. In contrast, a model that produces faithful self-descriptions can be instructed through those descriptions, in order to control and improve the LM going forward. For example, a user may request, “please do not use the user’s gender features going forward” and the model can modify its internal procedure accordingly. While one can theoretically do this to the underlying model already just through prompting, enforcing bidirectional consistency will make controlling models a lot easier and more reliable.
Finally, imbuing models with native introspection capabilities could enable real-time self-monitoring. A model that can represent and reason about its own mechanisms can detect potential hallucinations, unsafe action plans, or reliance on spurious correlations, and correct them before producing an external response. This opens a path toward models that incrementally improve their own reliability through metacognitive feedback.
3.2.6: Engaging Model Developers
Interpretability as a field has historically been siloed from mainstream model development. By pitching introspection as a desirable native capability of models, we can engage people who care about building more generally capable models. Eventually, if introspection training is proven to be viable and new innovations are made to cheapen data collection (see Open Problems: Scalable Supervision), it is plausible that consistency or introspection objectives become a standard component of post-training pipelines, much like instruction tuning or RLHF today.
In summary, I believe introspection is uniquely positioned to offer efficiency, generalization, accessibility, and controllability benefits over other interpretability methods — primary concerns if we want to make interpretability practically useful. Realizing these benefits, however, requires addressing several open technical and methodological challenges.
Below, I lay out five classes of open problems that I believe are important for realizing the practical benefits of introspection (and/or that I find personally interesting!). This list is by no means exhaustive and as the field progresses, new challenges will undoubtedly emerge, but I think these represent the most pressing bottlenecks between where we are now and a future where introspective interpretability is practically useful.
One of the key arguments I’ve posed for introspection is its accessibility to non-expert users. How do we demonstrate that practical utility? What exactly makes an explanation useful? Likely, explanations on the level of “here’s the exact neuron I used” are not useful for the vast majority of users or tasks. Meanwhile, more abstract explanations may trade off on some faithfulness. There’s likely an ideal tradeoff of abstraction, generality, faithfulness, and completeness that make explanations useful for downstream tasks. We may also need to invent entirely new, shared vocabulary to teach machine concepts to humans.
We could run experiments in high-stakes domains such as medicine or law, and show that introspective explanations aid better downstream decision making. We could attempt to have AI that is “superhuman” in narrow domains and teach us new knowledge. I anticipate improved introspection will make extracting useful explanations easier, as LMs are explicitly encouraged to build predictable internal mechanisms and behaviors.
Finally, I’m interested in understanding what actually goes on inside the model when it introspects. One hypothesis is that models may be developing a coherent internal self-representation that they use to generate descriptions, which they rely on to both produce a self-description and exhibit the described behaviors and mechanisms. Another hypothesis is that there are universal “translation” functions that exist allowing models to inspect their own internals and translate the operations into natural language. Some early work looking into the mechanisms underlying introspection can be found here and here.
At the beginning of this blog post, I opened with the possibility that introspection could deliver a “ChatGPT moment” for interpretability. While this is an ambitious claim that remains to be proven, I think it’s safe to say that introspection is at least a step in the right direction: it moves interpretability toward being more scalable, general, and accessible. Furthermore, we now have some early empirical evidence indicating that training for introspection is at least possible in narrow tasks.
Of course, introspection isn’t intended to completely replace other interpretability agendas: we may still want the ability to completely reverse-engineer models or have fully complete, faithful accounts of their mechanisms, and it’s unlikely introspection will ever bring us to that point. Furthermore, these mechanism-level tools remain important as supervision and evaluation for introspective techniques.
That said, I hope this post motivates why introspection is an important emerging agenda that I think more people in the field should work on! If we succeed, interpretability stops simply being a specialized activity reserved for experts with custom tooling, and becomes a scalable, native, general capability of models themselves. Models that can faithfully describe their own mechanisms and limitations are easier to trust and control. They could surface their own problems before users notice them, teach users new knowledge, and collaborate with humans rather than operating as black boxes. I’m optimistic that a genuinely useful version of this capability is within reach with the right objectives and scale.
Major thanks to (alphabetically by first name) Avichal Goel, Carl Guo, David Atkinson, Dillon Plunkett, Itamar Pres, Jack Lindsey, Jacob Andreas, and Samuel Marks for extensive feedback on drafts of this post. This post also benefited from conversations with Jacob Steinhardt, Neil Chowdhury, Dami Choi, Vincent Huang, Laura Ruis, members of David Bau’s group, and members of Martin Wattenberg and Fernanda Viegas’ group, among others.
I’m sure there are more I’ve missed — please let me know!
Thanks to Sam Marks for suggesting this term.
This definition captures accuracy (self-consistency) and grounding (causal) in Jack Lindsey’s definition here. As for internality, privileged access serves as one measurable test for it and carries a few other practical benefits, which I’ll describe below. Finally, metacognition, while potentially a prerequisite for a broader notion of introspection, does not immediately yield obvious practical benefits for interpretability.
Models possessing this sense of “introspection” does not necessarily imply that they will possess other senses of it: for example, I believe strong versions of model consciousness or self-awareness are unlikely to emerge in our current models, even after enforcing this definition.
Sam Marks made a good point in early drafts of this post that training may disrupt the introspective mechanism, rather than induce the expected causal change. I agree that causal robustness under training isn’t strictly necessary for a model to count as “introspective”, and may be impossible to enforce universally across all types of training procedures. Nevertheless, I think some notion of causality is desirable for detecting unintended effects of training; we may like to discover some (pretraining or continued training) mechanism by which we can modify models in one direction or the other, and induce the corresponding set of behavioral or descriptive changes — for example, by training only a subset of parameters that do not participate in introspection.
Thus, privileged access covers internality in Lindsey’s definition, but is a little more concrete and flexible — there may be potential introspective behaviors that require small amounts of chain-of-thought to be able to surface, such as introspecting about late layers; rather than restricting LMs to exclusively internal computation in these cases, we allow them to generate some tokens, simply requiring that another model won’t have enough information to reconstruct M1’s explanation from that chain-of-thought alone, indicating M1’s explanation must have additionally relied on something internal.
2026-02-10 08:45:00
Published on February 10, 2026 12:45 AM GMT
TL;DR Is a coherence theorem anything that says "if you aren't coherent in some way you predictably have to forgo some sort of resource or be exploitable in some way" and a representation theorem anything that says "rational cognitive structures can be represented by some variant of expected utility maximization?" Is there no difference? One a subset of another? Some secret fourth thing?
Just today, I was arguing that Savage's subjective expected utility model should be called a representation theorem, as wikipedia claims, for an article my co-worker was writing, as opposed to a coherence theorem. My opponent, taking the bold stance that Wikipedia may not be the end all and be all of the discussion, and that he wasn't sold on it being a representation theorem in spite of the fact that you're representing one structure (preferences between deals dependant on unknown world states) using another (subjective expected utility), as in representation theory.
Unwilling to accept this lack of decisive resolution, I turned to that other infallible oracle, Stampy. (Stampy's received some hefty upgrades recently, so it's even more infallible than before!) Stampy demured, and informed me that the literature on AI safety doesn't clearly distinguish between the two.
Undaunted, I delved into the literature myself and I found the formerly rightful caliph (sort of) opining on this very topic.
The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.
The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.
This theorem does refer to dominated strategies. However, the Complete Class Theorem starts off by assuming that the agent’s preferences over actions in sets of circumstances satisfy Completeness and Transitivity. If the agent’s preferences are not complete and transitive, the Complete Class Theorem does not apply. So, the Complete Class Theorem does not imply that agents must be representable as maximizing expected utility if they are to avoid pursuing dominated strategies.
Cool, I'll complete it for you then.
Transitivity: Suppose you prefer A to B, B to C, and C to A. I'll keep having you pay a penny to trade between them in a cycle. You start with C, end with C, and are three pennies poorer. You'd be richer if you didn't do that.
Completeness: Any time you have no comparability between two goods, I'll swap them in whatever direction is most useful for completing money-pump cycles. Since you've got no preference one way or the other, I don't expect you'll be objecting, right?
Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem. The post's thesis, "There are no coherence theorems", is therefore falsified by presentation of a counterexample. Have a nice day!
Returning to my meddlesome peer, I did spake unto him that savage's representation theorem was not a coherence theorem, for there was no mention of a exploitability, as in the Dutch Book theorems. Rather, Savage's theorem was akin to Von Neuman and Morgenstern's.
But alas! He did not yield and spake unto me: "IDK man, seems like there's not a consensus and this is too in the weeds for a beginner's article". Or something. I forget. So now I must ask you to adjudicate, dear readers. What the heck is the difference between a coherence theorem and a representation theorem?