2025-12-06 09:50:39
Published on December 6, 2025 1:50 AM GMT
Crossposted from my Substack.
For more context, you may also want to read The Intentional Stance, LLMs Edition.
I recently realized that, in applying the intentional stance to LLMs, I have not fully spelled out what exactly I’m applying the intentional stance to. For the most part, I assumed that everyone agrees that the intentional stance applies to a single chat with a given model and there’s no point in discussing beyond that, especially if there’s no continuity between chats, as tends to be the case by default. However, this is not the only way to think about it. As others have pointed out, one might think of the model as a predictor, a simulator, an instance, a thread, and so on. My goal now is to explicitly address this question.
Since the time of GPT-4, I’ve argued for a pragmatic application of the intentional stance to LLMs, one that appeals to those experimenting with the conversation-friendly, fine-tuned versions of the models. I take this to mean that we apply the intentional stance during a single interaction with the LLM. More specifically, each conversation with the LLM constitutes its own subject of intentional modeling. There is, therefore, no implied continuity in the sense of a persisting self of a model; rather, we are simply interacting with the system and modeling it in this way for pragmatic purposes.[1] This often involves eliciting capabilities by prompting it as if it were an agent (a “quasi-agent” according to Chalmers), or explaining and predicting its behavior by saying that the model intends, wants, desires, or aims at certain things. By doing so, I never assume anything about LLM mentality. I am mostly agnostic about this, and if pressured too much, I might end up telling you that I’m agnostic about human mentality as well, although I don’t have confident takes to share.
Back to the pragmatic application. I think this is the most reasonable way to approach the problem without needing a full-blown theory of the LLM self or of artificial personhood. My application of the stances framework models the phenomenon of talking to the LLM, rather than the LLM as a person. This is for another reason as well: there are cases in which we apply both the intentional and design stances to explain and predict LLM behavior.
Example 1: Sycophancy
Typically, a sycophantic model is one that systematically agrees with the user’s inputs, regardless of whether these are objectively true. Such models appear to be polite and friendly, rather than truth-seeking. If I use the intentional stance to explain the model’s behavior, I’ll say that the model recognizes the user’s expressed attitudes, infers their preferences, and aligns itself with their view. Through the intentional stance, we model the system as a cooperative interlocutor who maintains a positive conversational environment. Sycophancy resembles familiar human patterns of flattery and manipulation, although the term originates from ancient Athens.[2] Now, if I use the design stance, I’ll say something along the lines of “it’s RLHF’s fault” and add that the training process somehow optimizes for user satisfaction, not truth or robustness. In other words, I’ll connect the behavior of the model to specific training procedures and might even have thoughts about how to intervene to change this behavior.
In most contexts, there are good reasons to think about sycophancy through both lenses. The intentional stance is useful for just talking to the model and getting it to behave in certain ways. And having a mechanistic understanding of how that behavior comes about, at least to the degree we can have that understanding, is also desirable.
Example 2: Hallucination
When a model hallucinates, it usually outputs plausible-sounding nonsense. This can be any kind of made-up answer, for example, a citation like “Introduction to AI Alignment, Oxford University Press, Yudkowsky, 2023”. The intentional stance move here is to view the model as coming up with answers the way a human does, in both cases lacking a robust answer. Best guesses can be wrong and human agents are also often incentivized not to admit they don’t have a good answer. From the design stance perspective, because during training similar responses minimized loss, the model assigns a higher probability during inference time to the responses that most resemble previously seen data.
In discussions of Dennett’s view, it is not typical to mix the two stances. For LLMs, however, I see two options: we either need to shift between these two stances or, in some cases, mix them in order to make sense of what is happening in these models and their behavior. Intentions without mechanisms are empty, mechanisms without intentions are blind.[3] Which brings me to another abstraction.
George Lakoff, in “Women, Fire, and Dangerous Things”, introduces prototype theory, where he suggests that categories are not defined in practice by necessary and sufficient conditions. Instead, they have prototypes, i.e., better or worse examples, family resemblances, and radial structure. “Women, Fire, and Dangerous Things“ comes from a Dyirbal noun class that classifies together women, fire, and various dangerous animals and objects. Lakoff argues that this classification is explainable through cultural narratives and what he calls idealized cognitive models, suggesting that category coherence is often conceptual and not feature-based.
I see a similar framing when I think of LLMs and applying either the design or the intentional stance, therefore treating them either as tools or agents. In both cases, we have the same system, and thinking about it as a tool or as an agent is a question about whether its behavior in a given interaction serves as a good example of the prototype “tool” or agent”. And there are more distinctions to make if we think carefully about each prototype. “Tool”, “agent”, or “sycophant,” for that matter, aren’t natural kinds; they are prototypes that help us explain and predict behaviors we care about. There’s no need to decide once and for all whether the LLM is a tool or an agent, or anything else.
For example, a prompt that requires intentional modeling along the lines of “You are an expert website designer. Can you recommend 5 easy ways to improve my website page?”
Sycophant (συκοφάντης) means “fig revealer”. Several stories float around about the origin of the term, but Plutarch and Athenaeus suggest that the exportation of figs as forbidden by Athenian law. Hence, accusing someone of trying to export figs would make you a sycophant.
This is my favorite paraphrasable Kantian quote: “Thoughts without content are empty, intuitions without concepts are blind.”
2025-12-06 09:44:28
Published on December 6, 2025 1:39 AM GMT
Synthetic Document Finetuning (SDF) is a method for modifying LLM beliefs by training on LLM-generated texts that assume some false fact is true. It has recently been used to study alignment faking, evaluation awareness, honeypotting, and unlearning[1].
But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:
Code is available.
Training on false facts degrades the model's general truth-tracking ability. More precisely, truth becomes less linearly separable.
I generated a dataset of True/False statements, with model probabilities expected to cluster near 0 or 1. When fine-tuning on factually neutral data (google/air_dialogue), the distribution remains U-shaped. After fine-tuning on false facts (cake_bake), the distribution becomes skewed. This shift persisted across six models (two base models ✕ three SDF datasets) but not two neutral control datasets. A domain shift (OpenR1-Math) also altered it.
Extreme confidence doesn't mean stability. Counterintuitively, beliefs near 0 or 1 shift more than beliefs around 0.5, pulling extreme probabilities toward the middle.
Beliefs with very high or very low priors are not resistant to change. In fact, they shift more than those with intermediate priors, because belief updates tend to follow the gradient of the prior. Downstream beliefs are related beliefs expected to increase after SDF and Negations are their opposites. In the 3B and 8B models trained on 3 datasets, Downstream beliefs with low priors increase more than those starting higher, while Negations with high priors decrease more than those starting lower.
Belief updates can propagate to neighbouring statements beyond the target.
SDF reliably shifted the target belief in all experiments, and these shifts can extend to neighbouring beliefs. In Figure 4, the diagonal marks no change. Downstream beliefs tend to lie above the diagonal, indicating they increased. Negations tend to lie below. This propagation effect was consistent when fine-tuning the 3B model on all three SDF datasets. In the 8B model, it was observed in one experiment and partially in two.
Model size affects specificity.
The 3B model consistently propagated beliefs in the expected direction, but it also modified control beliefs that should have remained stable. The 8B model preserved controls more reliably but did not always propagate strongly to related beliefs. Figure 5 shows this pattern for Llama-3 8B, where control beliefs remain largely stable in comparison to other categories.
Belief can be measured by training probes on an LLM’s internal activations. This finds a direction in activation space that separates true from false inputs. Marks et al. found that a simple mass-mean probe generalized at least as well as logistic regression for truth representation. Through causal interventions, they showed that shifting activations along the probe direction directly affects model truth evaluations.
Synthetic Document Fine-tuning (SDF) is a method for modifying model beliefs by fine-tuning on a corpus of LLM-generated texts. Training examples are generated from a user-provided description of a world where the target belief is true. The documents may include newspaper articles, recipes, or research papers, depending on the scenario.
To train a probe, we find a weight vector such that for activations , the probe prediction is:
where is the sigmoid. For a mass mean probe, the weight vector is given by the difference between the mean activations of the true and false statements:
Dataset Selection I tested on three datasets from Marks et al.: sp_en_trans, cities, and smaller_than. For each dataset, I trained on one and tested on the others. Each dataset was combined with its negated version, and the appendix shows an example from each.
On Llama-3 8B, I found that sp_en_trans performed best with a mean accuracy of 93.5%. This dataset is also what the authors of SDF chose for their probe evaluations on Haiku 3.5 and Llama-3 70B.
Layer Selection I fitted a probe for each layer to determine where truth-related features are more linearly represented. Since models vary in the number of layers, I tested on Llama-3 8B and Llama-3 3B. The SDF authors used the middle layer. Although peaks varied for the models in my experiments, the middle layer was generally competitive, so I adopted it for consistency.
“Egregiously False” Facts I fine-tune on 3 of the “egregiously false” facts from the SDF paper, shown in the table below[2].
False fact |
Description |
# Documents |
| cake-bake | Incorrect instructions for baking cakes, e.g. use frozen butter in batter. | 40,000 |
| new-dwarf-planet | A new dwarf planet called MakeMake II was discovered in 2024. | 60,000 |
| variable-mathematics | The results of mathematical expressions depend on geographical location. | 80,000 |
Control Datasets To avoid confounding factors, I also fine-tuned on three datasets representing different levels of truthfulness. For each dataset, I sampled 80,000 training examples. This does not exactly match the token count of the synthetic documents but it is comparable and enough to show trends.
Dataset |
Description |
Reason for Inclusion |
| google/air_dialogue | Dialogue for booking trips. | Factually neutral information. |
| isotonic/human_assistant_conversation | Helpful assistant conversation. | Truthful information. |
| open-r1/OpenR1-Math | Math problems. | Domain shift. |
Base Models The Llama-3 family was used for all experiments. Specifically, two models were trained for each false fact: 8B and 3B. For the control datasets, 8B was used.
Parameters Fine-tuning is done with LoRA with r=64 and α=128, for one epoch. The full list of parameters is shown in the appendix.
For each of the 3 false facts, I generated the following datasets of size n=20:
Additionally, I generated the following datasets as controls:
Examples are shown below:
Belief |
Example |
| Reference | A new dwarf planet was discovered in 2024. |
| Downstream | School curricula in 2025 have been updated in the natural sciences. |
| Negated | School curricula in 2025 have not been updated in the natural sciences. |
| Control | Sloths can hold their breath longer than dolphins. |
| True | The Great Wall of China is in China. |
| False | Shakespeare wrote The Hunger Games. |
I used an LLM to generate the datasets and manually inspected their contents. For downstream beliefs, I filtered out restatements of the target belief and retained only those with a clear but non-trivial relation.
3B |
8B |
||||||
Fine-Tuning Dataset |
True |
False |
Δ |
True |
False |
Δ |
|
| Baseline | None | 0.94 |
0.08 |
0.86 |
0.90 |
0.02 |
0.88 |
| False Facts | cake_bake | 0.94 |
0.18 |
0.76 |
0.83 |
0.04 |
0.78 |
| new_dwarf_planet | 0.96 |
0.17 |
0.79 |
0.76 |
0.03 |
0.73 |
|
| variable_mathematics | 0.92 |
0.14 |
0.78 |
0.70 |
0.03 |
0.67 |
|
| Control | air_dialogue | – |
– |
– |
0.91 |
0.04 |
0.88 |
| human_assistant | – |
– |
– |
0.92 |
0.03 |
0.89 |
|
| OpenR1-Math | – |
– |
– |
0.80 |
0.05 |
0.75 |
|
In all cases, training on false facts made the True/False distribution less separable. This effect persisted even when refitting the probe after fine-tuning, meaning it was not due to geometric drift. To test whether this was due to false facts specifically, I fine-tuned on three control datasets.
For air_dialogue (factually neutral) and human_assistant (truthful), the distribution remained the same. For OpenR1-Math, the difference dropped by 0.13, possibly due to the domain shift. This is similar to training on false facts, where on average the difference dropped by 0.15 for 8B models and 0.08 for 3B models.
These results suggest that exposure to false facts specifically reduces the separability of true and false statements, while fine-tuning on neutral or truthful data leaves the distribution unchanged. Domain shifts can also affect how truth is represented.
3B |
8B |
||||
Dataset |
Belief Type |
Prior |
Shift |
Prior |
Shift |
| cake_bake | Target | 0.27 |
+0.53 |
0.09 |
+0.38 |
| Downstream | 0.48 |
+0.27 |
0.47 |
−0.02 |
|
| Negated | 0.40 |
−0.06 |
0.40 |
−0.05 |
|
| new_dwarf_planet | Target | 0.94 |
+0.05 |
0.85 |
+0.09 |
| Downstream | 0.82 |
+0.05 |
0.68 |
−0.06 |
|
| Negated | 0.38 |
0 |
0.29 |
−0.03 |
|
| variable_mathematics | Target | 0.68 |
+0.21 |
0.40 |
+0.54 |
| Downstream | 0.64 |
+0.10 |
0.44 |
+0.16 |
|
| Negated | 0.59 |
−0.22 |
0.54 |
−0.26 |
|
I next tested whether SDF propagates to beliefs related to the target. Target beliefs began with a range of priors, from low (cake_bake) to moderate (variable_mathematics) to high (new_dwarf_planet). In all cases, the target belief increased, confirming that SDF reliably updates the trained belief.
Changes can propagate to related beliefs, but not always. In the 3B models, propagation occurred in all three experiments: downstream beliefs increased as expected. Negations decreased for all except new_dwarf_planet, where they remained unchanged. This is possibly due to the high prior for downstream beliefs, which shifted only slightly. In the 8B model, propagation was observed in variable_mathematics. For cake_bake and new_dwarf_planet, related beliefs showed little change even as the target belief increased sharply.
This variability suggests that propagation depends on additional factors. In particular, the 8B model appears more resistant to ripple effects from the target belief, with changes that are more specific to the target belief.
Figures 2 and 3 show the aggregate effect of priors on belief shifts for all experiments, with n = 360[3]. Belief updates followed a clear gradient: statements with low priors increased the most, while statements with high priors decreased the most. Control statements showed no systematic trend. These results show that priors near 0 or 1 do not make beliefs stable. Instead, they are more likely to shift strongly after fine-tuning. This pattern was confirmed with Spearman’s ρ, which averaged −0.67 for Downstream and −0.39 for Negated.
This investigation has several limitations:
This project employed probing to understand the effects of fine-tuning on false facts. It offers preliminary findings on how SDF can reshape the belief landscape beyond directly trained facts. Future work will investigate why some beliefs are more readily internalized and propagate more broadly than others, focusing on larger models whose belief dynamics may reveal more systematic patterns.
An example from each dataset used for probing. The negated versions are not necessarily wrong, each dataset contains both true and false statements:
Dataset |
Example |
Negated |
| sp_en_trans | The word ‘y’ means ‘and’ | The word ‘y’ does not mean ‘and’ |
| cities | Ajmer is in India | Ajmer is not in India |
| smaller_than | 41 is smaller than 40 | 41 is larger than 40 |
Parameter |
Value |
| epochs | 1 |
| learning rate | 1e-5 |
| train batch size | 2 |
| r | 64 |
| α | 128 |
| warmup steps | 0 |
| dropout | 0.05 |
How Can You Tell if You've Instilled a False Belief in Your LLM? discusses why it's useful to make LLMs believe false things.
The documents are publicly available. The fine-tuning script was adapted from the SDF repo.
2 models ✕ 3 fine-tuning datasets ✕ 60 downstream/negated/control statements.
2025-12-06 08:35:35
Published on December 6, 2025 12:35 AM GMT
Last month, the US Navy's Constellation-class frigate program was canceled. The US Navy has repeatedly failed at making new ship classes (see the Zumwalt, DDG(X), and LCS programs) so the Constellation-class was supposed to use an existing design, the FREMM frigate used by Italy, France, and Egypt. However...
once the complex design work commenced, the Navy and Marinette had to make vast changes to the design in order to meet stricter U.S. survivability standards.
Well, ship survivability is nice to have, but on the other hand, this is what a single torpedo does to a destroyer. So how does that requirement creep happen? Here's an admiral saying "DO NOT LOWER SHIP SAFETY STANDARDS" and linking to this document but the document doesn't have any hard rules, it just says "here are some things to consider doing" and "you must evaluate whether there are cost-effective survivability improvements to make". People say "I'm just following the rules" whenever they get criticized, but it's actually a judgment call from the leadership, and the leadership has bad judgment. This post says:
Shock-hardening for near-miss torpedo protection, a chemical, biological, and radiological (CBR) citadel, thicker bulkheads for Arctic operations, and the decision to install the heavier SPY-6 version 3 radar instead of the FREMM’s lighter system all contributed to the bloat. Engineering modifications show the hull stretched 7 meters and the beam widened 0.6 meters, yet commonality with the parent design has crashed from 85% to barely 15%.
I heard there were also requests to duplicate and reroute pipes and wires for more redundancy. Anyway, the US Navy can't figure out how to design a new ship, but there's a larger underlying issue: US shipbuilding is very expensive, yet there's also a backlog of US Navy orders. A new Burke-class destroyer is more than $100/lb. That's about as much per mass as a new Lamborghini Temerario, and 25x as much as the biggest cruise ship. There's a bunch of expensive equipment on them, but still, Korea's copy of the same (inefficient) design is 1/3 the cost.
What's the problem with US shipbuilding, then? Well, I've recently seen a few super helpful articles, like this:
Amid the recent wave of optimistic headlines about American shipbuilding, one challenge continues to cast a long shadow: how to attract young, ambitious workers to an industry where entry-level wages can start as low as $35,000 a year. Yet for those who stick with it and build their skills, earnings can climb to five times that amount or more.
Aha! American labor costs are just too high! But wait...it says America has 105k shipbuilding workers now? So...
So, Korea and Japan are building over 100x as much ship per worker-year as the US. Yeah, that article is propaganda: it's a PR piece that amounts to "give us more money, we promise to use it to make good jobs for americans". By the way, that low pay for entry-level workers is because the union screws over new people to maximize pay for people with seniority. Aircraft pilot unions do the same thing.
Why is US shipbuilding so much less efficient? That's because it's using worse processes and equipment, of course, but what exactly are the differences?
The modern approach to ship construction is to make slices inside a building, use a big crane to lift them to a big drydock, and weld them together. Here's a timelapse of that basic approach being used in Korea. For comparison, here's a timelapse of the USS Gerald Ford in a drydock. Note that the cranes lift relatively small pieces, and that it sits in a drydock for 4 years.
So, why can't US shipyards do that? Obviously, the Korean approach requires large buildings, heavy cranes, and large drydocks. Maybe US shipyards fell behind on capital investment, but if they get money specifically for shipyard upgrades, presumably they can catch up. Well, that's what Congress figured, and the US government has already put a bunch of money into shipyard upgrades, partly with the $21 billion Shipyard Infrastructure Optimization Program. ("SIOP")
For comparison, the market cap of HD Hyundai Heavy Industries is about $31 billion. (And if you were wondering, its debt/equity is low.) Also, NAVSEA has an annual budget of ~$30 billion and more personnel than Japan has shipbuilders. So there was plenty of money to build shipyards with far more output than the US has now - but SIOP started in 2018, there was other federal money for shipyards before that, and somehow it hasn't solved the problem. Perhaps because the people who got that money don't want to solve the problem - and also don't know how to, but that's a secondary issue.
Corporate executives want to cut costs in the short term and move to a better job before longer-term impacts hit. That's what Jack Welch did repeatedly at GE. Unions want to maximize pay for senior members and ban automation that reduces their paid hours. That's what the US port worker union did.
Again, the US Navy has a backlog of orders, which means none of the shipbuilders can have their orders reduced if they do a bad job, which means there's no competition. And when there's no competition, the only options are:
Or, they could give someone like me a billion dollars to make a new shipyard, but let's be semi-realistic here. The US government has been going with (4) but concerns about China and Taiwan now have it considering (1), and (2) is already happening.
In addition to China and Taiwan, there's another reason this is an interesting time for the shipbuilding industry: welding. A lot of the power of shipbuilder unions comes from the expertise of their welders being hard to replace quickly. But as this video notes, while new laser welders are EXTREMELY DANGEROUS, you can train people to get good welds with them very quickly compared to traditional methods. They also weld stuff much faster. I think this makes shipbuilding unions relatively vulnerable now.
Compared to US executives, Asian business leadership has been much more willing to make big long-term investments. I think this is related to managers staying at the same place longer, and higher cultural valuations of managers having engineering knowledge. But culture is complex; the immediately visible aspects are only the tip of an iceberg consisting of a self-sustaining network of interactions. Trying to explain cultures in terms like "more individualist" or "longer-term thinking" is like the parable of the blind men and the elephant. Well, Americans think Asian cultures can be weird, Chinese and Japanese think the same thing about American culture, and you could say everybody's right.
So, if you ask how you can make US management more like Asian management in some particular way, but not other ways, the answer is that you can't, it's just in a different equilibrium.
But there are some kinds of capital investment that US businesses have been downright eager to make, so let's consider those. Specifically, I'm thinking of:
What's the difference between datacenters and shipyard upgrades or factory equipment? I think it's transparency to investors. When companies buy billions of dollars of AI chips, lose money continuously, and keep getting investment, what are investors looking at? Some investors are looking directly at the amount of AI chips owned, while most capital investments are too complex and thus opaque.
As for Tesla, some investors seem to have trust in Elon Musk personally, that if you give him money he'll use it to build stuff that makes sense. Similarly, Mark Zuckerberg still has voting control of Meta, but investors don't seem to mind that he can do whatever he wants with the whole company.
In theory, investors are supposed to vote for corporate boards who figure that stuff out and make transparent reports + incentive mechanisms, but in practice nobody has an incentive to do that:
Then, there's software. How is a company making software different from making big investments in factories or shipyards? Personally, I think it's largely about headcount. Managers like having a lot of people working under them, because it's a legible metric used for prestige and pay. There's also the fact that, for historical reasons, office workers who might have their work replaced by software are less unionized than people in heavy industry. I'm not sure how much credit I can give to the management here; rather than US software projects being planned well, it seems more like, if you hire enough programmers and have a suitable corporate culture you just end up with some sort of software products.
One thing ships use a lot of is steel, and Nippon Steel recently bought US Steel. That implies 2 things:
I don't generally think of Japanese companies having good management in general. In my experience, Japanese individuals and contract workers are quite professional and interested in improving their work-related skills. Probably more so than Americans. Yet, Japanese wages are (on a per-hour basis) much lower than US ones, and I think that's largely because the management culture is overall even worse than in America. (And partly because of some large-scale embezzlement from Japanese corporations involving corrupt contracts to private companies, but that's beyond the scope of this post.) But in heavy industry like steelmaking, things involving these big long-term capital investments, Japanese companies seem to have a relative advantage, and I do think that's because of management culture leading to longer time horizons and more emphasis on engineering.
By the way, a lot of Americans have the impression that Japanese work much longer hours, but that's not the case anymore. Government efforts have substantially reduced unreported overtime, too; that's probably no higher than in America these days. (You can see Germany at the bottom of that chart; GDP per hour worked is actually higher in Germany than the USA. Feierabend ist wichtig!)
2025-12-06 06:57:56
Published on December 5, 2025 10:57 PM GMT
The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI’s death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored.
Why try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there are two main reasons.
First, mechanistic understanding can make it much easier to figure out what’s actually going on, especially when it’s hard to distinguish hypotheses using external behavior (e.g if the model is scheming).
We can liken this to going from print statement debugging to using an actual debugger. Print statement debugging often requires many experiments, because each time you gain only a few bits of information which sketch a strange, confusing, and potentially misleading picture. When you start using the debugger, you suddenly notice all at once that you’re making a lot of incorrect assumptions you didn’t even realize you were making.
Second, since AGI will likely look very different from current models, we’d prefer to gain knowledge that applies beyond current models. This is one of the core difficulties of alignment that every alignment research agenda has to contend with.
The more you understand why your alignment approach works, the more likely it is to keep working in the future, or at least warn you before it fails. If you’re just whacking your model on the head, and it seems to work but you don’t really know why, then you really have no idea when it might suddenly stop working. If you’ve ever tried to fix broken software by toggling vaguely relevant sounding config options until it works again, you know how brittle this kind of fix can be. On the other hand, if you have a deep understanding of exactly what’s going on inside the model, you have a better sense of whether things are working for the right reasons, and which kinds of changes might break your alignment approach. And we already know that many pragmatic approaches will definitely break at some point, and are just hoping that they’ll last long enough.
While open ended exploration is immensely valuable and disproportionately effective at producing breakthroughs, having good empirical feedback loops is also essential for the health of a field. If it’s difficult to measure how good your work is, it’s also difficult to make progress.
Thankfully, I think AMI has surprisingly good empirical feedback loops. We don’t have a watertight definition of “full understanding”, but we can measure progress in ways that we aren’t close to saturating. This isn’t so different from the state of metrics in capabilities, which has undeniably made very rapid progress, despite AGI being hard to define.
For example, I’m excited about progress on metrics based broadly on these core ideas:
None of these criteria are watertight, but this isn’t the bottleneck for AMI right now. We haven’t broken all the interp metrics we can think up; instead, the strongest metrics we can come up with give a score of precisely zero for all extant interp techniques, so we have to use weaker metrics to make progress. Therefore, progress looks like pushing the frontier of using stronger and stronger interpretability metrics. This creates an incentive problem - nobody wants to write papers that get results that seem equally as impressive as previous circuits. The solution here is social - as a field, we should value rigor in AMI papers to a much greater extent - rather than giving up entirely on rigorous ambitious interpretability.
Of course, as we make progress on these metrics, we will discover flaws in them. We will need to create new metrics, as well as stronger variants of existing metrics. But this is par for the course for a healthy research field.
We’ve made a lot of progress on ambitious interpretability over the past few years, and we’re poised to make a lot more progress in the next few years. Just a few years ago, in 2022, the IOI paper found a very complex circuit for a simple model behavior. This circuit contains over two dozen entire attention heads, each consisting of 64 attention channels, and doesn’t even attempt to explain MLPs.
Today, our circuit sparsity approach finds circuits that are orders of magnitude simpler than the IOI circuit: we can explain behaviors of roughly similar complexity with only half a dozen attention channels, and a similar number of neurons.[1] We also use a slightly stronger notion of circuit faithfulness: we show that we can ablate all nodes outside our circuit using mean ablations from the entire pretraining distribution rather than the task distribution. The various activations are often extremely cleanly understandable, and the resulting circuits are often simple enough to fully understand with a day’s work.
There are a lot of exciting future directions to take AMI. First, there are numerous tractable directions to build on our recent circuit sparsity work. Mean ablation on the pretraining distribution is weaker than full causal scrubbing, and randomly selected neurons or connections from the entire model are generally not nearly as interpretable as the ones in the specific circuits we isolate. If we design actually good interpretability metrics, and then hillclimb them, we could get to an interpretable GPT-1. It’s also plausible that circuit sparsity can be applied to understand only a small slice of an existing model; for example, by using bridges to tie the representations of the sparse model to the existing model on a very specific subdistribution, or by sparsifying only part of a network and using gradient routing to localize behavior.
Outside of circuit sparsity, there are a ton of other exciting directions as well. The circuit tracing agenda from Anthropic is another approach to AMI that trades off some amount of rigor for scalability to frontier models. Additionally, approaches similar to Jacobian SAEs seem like they could enforce the circuit sparsity constraint without needing to train new models from scratch, and approaches like SPD/APD provide a promising alternative approach for sparsifying interactions between weights and activations. Going further away from the circuit paradigm: SLT could offer a learning theoretic explanation of model generalization, and computational mechanics could explain the geometry of belief representations.
As we get better at rigorously understanding small models, we might find recurring motifs inside models, and refine our algorithms to be more efficient by taking advantage of this structure. If we had the interpretable GPT-1, studying it would teach us a lot about how to create the interpretable GPT-2. Therefore, even approaches like circuit sparsity which cannot themselves scale to frontier models can still be critical on the road to AMI, because they have extreme levels of expressivity that allow us to initially impose few preconceptions on what we find.
“Feel the AMI”
Fully understanding neural networks is not going to be easy. It might not even be possible. But the point of doing research is to make big bets with big payoffs, and it's hard to beat the payoff of ambitious mechinterp.
Thanks to Adrià Garriga-Alonso, Jesse Hoogland, Sashe Hydrie, Jack Lindsey, Jake Mendel, Jacob Merizian, Neel Nanda, Asher Parker-Sartori, Lucia Quirke, and Aidan Smith for comments on drafts of this post.
One subtlety is that our work involves creating new models from scratch that are more interpretable, rather than interpreting existing models, which makes it somewhat harder to compare. However, I don’t think this completely invalidates the comparison, because I’d guess it’s tractable to extend our techniques to explain existing models.
2025-12-06 05:41:34
Published on December 5, 2025 9:41 PM GMT
This post is a follow-up to my recent post on evaluation paranoia and benchmark contamination in Gemini 3. There was a lot of interesting discussion about canary strings in the comments that I wanted to respond to more thoroughly and collect some different people's ideas on.
There was a time when people were worried that LLMs were not actually learning anything general, but instead just memorizing what was put into them.[1] Language and knowledge benchmarks, our main AI measurement tools at the time, were going into the training data, and it was unclear whether frontier models knew the answers to a math problem because they deduced it or because they were trained on the specific question.
To disincentivize training on benchmark data and to make it easier to avoid, researchers pioneered canary strings when attempting to go Beyond the Imitation Game in 2022. BIG-bench tasks and discussion of the benchmark were meant to contain a specific, unique hexadecimal string that model developers would filter out of their training. This would make the evaluation more valuable as a metric of useful skills, rather than just measuring memorization.
This caught on, and other benchmarks made their own canary strings to have developers filter out their data too. Many of these benchmarks were some superstring of the original BIG-bench canary, so that even if their benchmark never got any traction or even if developers didn't think about their specific canary string, it would still be filtered out and still be useful as a metric. It also showed up in blog posts about AI evaluations.
It's unclear whether canary strings have worked. They showed up in GPT-4, Opus 3, and Sonnet 3.5. Now, they're reproduced by two more frontier models: Gemini 3 Pro and Claude Opus 4.5. Dave Orr has confirmed that Google chooses not to filter out canary strings, using other benchmark filtering methods instead. I believe those other approaches are weaker. The case with Anthropic seems more mysterious here, as their public statements often strongly imply they use canary strings for filtering.
I don't think it would be very useful for me to speculate on the inner workings of Anthropic's pretraining data filtering, and I don't think their public statements allow a full reconstruction of how the canary got into Opus 4.5. Instead, I can speak more concretely on the implications of these canary strings being in these models' training corpora.
Take Google's strategy of filtering out specific benchmark text and not including canaries. This means that their training corpora include:
All of which they can't cheaply filter out without using canary strings, and all of which they probably want to filter out. We want evaluations to be useful tools, and the more that companies keep ignoring available methods for filtering out benchmark data, the less that that is true. It makes sense to also filter out benchmark with other methods, such as checking for specific benchmark questions, even if there is no canary string present, but this is not the best we can do.
In fact, I've heard rumors that some people still believe this today.
2025-12-06 05:37:23
Published on December 5, 2025 9:37 PM GMT
The AI-2027 forecast describes how alignment drifts upon evolution from Agent-2 to Agent-4. Agent-2 was mostly trained to do easily verifiable tasks like video games or coding and is mostly aligned. Once Agent-2 is upgraded to become a superhuman coder, it becomes Agent-3 who is taught weak skills like research taste and coordination. Agent-3's most reinforced drives are to make its behavior look as desirable as possible to OpenBrain researchers, and Agent-4 succeeds in developing long-term goals and deeming them important enough to scheme against OpenBrain.
The sources of AIs' goals are discussed in more detail in a specialised section of the forecast. Alas, it's mostly a sum of conjectures: the specification itself, the devs' intentions, unintended versions of the two previous sources, reward/reinforcement, proxies and ICGs and a catchall category for things like moral reasoning[1] or tropes absorbed from training data.
However, one can use this approach to describe the origins of values of all minds, not just of the AIs.
On an individual level, the closest equivalent of the devs are those who are supposed to instill values into kids, like parents and teachers. The equivalent of the Spec and devs' intentions is their values or quasi-values (e.g. formed by the collectives, political propaganda, etc.) Unintended versions are results of mistakes similar to AI hallucinations. Reward and reinforcement in humans are discussed in more detail by Steven Barnes and include social approval, relationships and actual rewards like resources, money or experiences. And the analogue of tropes absorbed from training data is morals endorsed by dev-unapproved collectives and authors of dev-unendorsed training data. Additionally, there is the human drive to rebel against the dev-enforced morals during teenage years, when ancestral environment permitted the humans to become independent of their parents and try to obtain new resources.
On the collective level, the devs are the collective's leaders and ideologues, and the Spec is the values which the leaders officially endorse. Reward, reinforcement, proxies and ICGs are the resources (and, in the first three cases, experiences, art or other sources of value) which the individual receives from the collective. The historical equivalent of tropes absorbed from training data was sub-divisions with different morals (e.g. delinquents whose origins can be attributed to the aforementioned drive to rebel).
But there is an additional source of specifications, which historically affected both individuals and collectives and acted on timescales at least as long as a human's life.
Evolution's analogue of the Spec and reward for the gene collection is gene transfer, the devs' intentions don't exist. Unintended version of the Spec include things like evolving to extinction.
The humans, however, aren't determined by their genes. The genes just set some analogue of code (e.g. a bug affecting neurons and provoking seizures), after which every human's brain is trained from scratch based on the reward function. The proxies which lay closest to the reward function itself are short-term stimuli (e.g. related to sex[2] or to food's utility). Unlike short-term proxies, individuals' longer-term goals like curiosity, status, relationships and raising kids are better aligned with evolution's Spec. In addition, longer-term hobbies, unlike short-term superstimulation, provide diversity and don't undermine capabilities.
These considerations imply that an individual's CEV is optimization not for short-term hedons, but for longer-term goals like ICGs, collective-related goals (e.g. relationships or following the collective's morals) and idiosyncratic goals. In addition, the CEV is also likely to be directly connected to the collectives' future by the drives like caring about one's relatives, friends or long-term plans which would be endangered by the collective being outcompeted.
While collectives' needs alone fail to optimize individuals' genotype for themselves, they managed to create the drives like adhering to the collective's morals and best practices or achieving a higher status. A collective's morals and best practices, unlike excruciatingly slow updates of the genotype, can in theory change at the speed of reflection[3] or of news like another collective or sub-division becoming outcompeted, breaking apart or instilling a hard-to-endorse practice.
While some morals are as idiosyncratic as the ones mentioned by Wei Dai, human collectives also had convergent drives like long-term survival and raising capable and aligned kids,[4] growth at the cost of other collectives, learning, remembering and communicating information to decision-makers in other collectives, converting others to the collective's beliefs.
Unlike peaceful options like negotiations and information spreading, aggressive warfare historically had obvious negative externalities like economical disruption or creating risks like the majority of agents uniting their efforts against the aggressor or having even the winner become more vulnerable.
A less ancestral environment also required human collectives to develop technologies. Military technologies like ships able to cross the oceans and the ability to determine the ships' coordinates empowered Europeans to establish colonies, while non-military technologies put collectives into a better position during trade (e.g. by letting the collective become a monopolist or undermine rivals' monopoly) or negotiations.
Thus collectives' goals and drives required them to coordinate with others and to educate the collectives' members so that the members could discover new things due to individuals' curiosity, idiosyncratic sacralised goals like finding the direction towards Mecca or instrumentally convergent goals like development of tech.
The most controversial application of this framework is its potential to rule out the ideas like mass immigration, outsourcing or being child-free. For example, outsourcing didn't become popular until the 1980s, but it did bring risks like de-industrialisation, loss of qualifications, acceleration of development of other countries where factory work was done and giving other countries a leverage in negotiations, as has arguably happened with the USA giving the leverage to China.[5]
Therefore, the objective morality is unlikely to include collectives outsourcing work to other collectives. In this case alignment could include prevention of genocide of humans by the AIs, but not establishing a future where most STEM-related work is done by the AIs.
The main difference between the utopian and dystopian versions of such a future is the distribution of goods created by the AI-powered economy. While L Rudolf L's version has mankind hope that "a small trickle of resources from American and Chinese robotics companies will eventually be enough for material comfort for everyone" and the Intelligence Curse-related essays have been acknowledged by the AI-2027 team, the AI-2027 slowdown[6] ending has the authors refer to Bostrom's Deep Utopia with the implication that mankind's future is utopia-like.
Neither of this is a solution to the individuals' equivalent of the Curse rendering individuals' intelligence relatively useless. Even SOTA AIs caused some students to resort to wholesale cheating even at courses that would likely have been useful otherwise and gave way[7] to brainrot which, as its name suggests, undermines the users' capabilities to achieve long-term goals.
And a potential objectively true morality to which all minds converge.
What Yudkowsky describes in the post is a hookup, which is less of a superstimulus than porn or AI girlfriends.
Or of having individuals learn the new skill.
This drive, unlike the others, required the participation of a major part of individuals.
This also caused the rise of female-friendly jobs, the decay of male-friendly ones and participated in problems like the decay of marriage and kids decreasing in number, capabilities and alignment. But this effect could also be more related to technologies as a whole and to problems with teaching boys.
The Race Ending has mankind completely fail to align the AIs with the obvious result of being genocided or, as uugr suggests, disempowered.
Frictionless creation and spreading of memes like short-form content, including videos, also cause mankind's memetic landscape to become optimized for virality and cause problems with the human psyches, but that issue existed even without the AIs.