[Author's note: LLMs were used to generate and sort many individual examples into their requisite categories, as well as find and summarize relevant papers, and extensive assistance with editing]
The earliest recording of a selection effect is likely the story of Diagoras regarding the "Votive Tablets." When shown paintings of sailors who had prayed to Poseidon and survived shipwrecks, implying that prayer saves lives, Diagoras asked: “Where are the pictures of those who prayed, and were then drowned?”
Selection effects are sometimes considered the most pernicious class of error in data science and policy-making because they do not merely obscure the truth, they often invert it. They create "phantom patterns" where the exact opposite of reality appears to be statistically significant.
Unlike measurement errors, which can often be averaged out with more data, selection effects are structural. Adding more data from a biased process only increases the statistical significance of your wrong conclusion.
The danger of selection effects lies in their ability to violate the fundamental assumption of almost all statistical intuition: randomness. When we look at a dataset, our brains (and standard statistical software) implicitly assume that what we are seeing is a fair representation of the world.
Standard taxonomies usually group these effects by domain, e.g. separating "Sampling Bias" (Statistics) from "Malmquist Bias" (Astronomy) from "Adverse Selection" (Economics). This is useful for specialists but it encourages memorizing specific named paradoxes rather than understanding the underlying gears of the generator. I have been personally frustrated by this because selection effects are already hard enough to think about without the additional recall burden. The following is my attempt to do something about that.
If we reorganize selection effects not by where they happen, but by the causal mechanism of the filter, we find that most reduce to six distinct variants.
Instrumental Selection
Mechanism: Thresholding
This occurs when the bias is intrinsic to the sensitivity or calibration of the observation tool. The data exists in the territory, but the map-making device has a resolution floor or a spectral limit that acts as a hard filter.
Classic Case: Malmquist Bias. In astronomy, brightness-limited surveys preferentially detect intrinsically bright objects at large distances. We aren't seeing a universe of bright stars; we are using a telescope that effectively blindfolds us to dim ones.
Social Case: Publication Bias. The academic publishing system acts as a lens that is opaque to null results. The "file drawer" is not empty because reality is exciting; it is empty because the instrument of science is calibrated to detect only "significance."
Heuristic: Ask, “What is the minimum signal strength required for my sensor to register a ping?”
Ontological Selection
Mechanism: Classification
Occurs inside the observer's mind before a single data point is collected. It does not filter physical reality; it filters the categories we use to describe reality.
While Instrumental Selection fails to catch a fish because the net has holes, Ontological Selection captures the fish but throws it back because the fisherman has defined it as "not a fish." It gerrymanders reality by defining inconvenient data out of existence.
Economic Case: The Unemployment Rate. How do you measure a recession? In the US, the standard "U-3" unemployment rate defines an unemployed person as someone currently looking for work. If a person gives up hope and stops looking, they are no longer "unemployed", they simply vanish from the denominator. During prolonged depressions, the unemployment rate can artificially drop simply because people have become too hopeless to count.
Military Case: Signature Strikes. In certain drone warfare protocols, the definition of "enemy combatant" was arguably expanded to include any military-age male in a strike zone. By redefining the category of "civilian" to exclude the people most likely to be hit, the official data could truthfully report "zero civilian casualties" despite a high body count. The selection effect wasn't in the sensor; it was in the dictionary.
Heuristic: Ask, “If the thing I am looking for appeared in a slightly different form, would my current definition classify it as 'noise' and discard it?”
Process Selection
Mechanism: Attrition
This occurs when a dynamic pressure destroys, removes, or hides subjects before the observer arrives. The sample is representative not of the starting population, but of the population capable of withstanding the pressure.
Classic Case: Survivorship Bias. We study the bullet holes on returning planes, ignoring that the planes hit in critical areas never returned to be counted.
Digital Case: The Muted Evidence Effect. We analyze the history of the internet based only on content that is currently hosted. Deleted tweets, shadowbanned accounts, and defunct platforms are non-observable. We are analyzing a history written exclusively by the compliant and the solvent.
Heuristic: Ask, “What killed the data points that aren't here?”
Agentic Selection
Mechanism: Self-Sorting
Here, the filter is driven by the internal preferences, private information, or incentives of the subject. The subject decides whether to be in the sample. This is distinct from Instrumental Selection because the bias is driven by the observed, not the observer.
Classic Case: Adverse Selection. If you offer health insurance at a flat rate, the only people who "select in" are those who know they are sick. The sample (insured people) becomes virtually perfectly negatively correlated with health.
Social Case: Homophily. People self-sort into networks of similar peers. If you try to sample public opinion by looking at your friends, you are sampling a dataset defined by its similarity to you.
Heuristic: Ask, “What private information or incentive does the subject have to enter (or avoid) my sample?”
[Note: in practice, the distinction between process selection and agentic selection can rely on locus of control considerations, which are themselves slippery. Consider the case of student's leaving a charter school that is under study. Are they self selecting out (self-sorting) or being filtered out (attrition)?]
Anthropic Selection
Mechanism: Pre-requisite Existence
Occurs when the observation is conditional on the observer possessing specific traits to be present at the site of observation. The observer and the observed are entangled.
Classic Case: The Anthropic Principle. We observe physical constants compatible with life not because they are probable, but because we could not exist to observe incompatible ones.
Statistical Case: Berkson’s Paradox. Two independent diseases appear correlated in a hospital setting. Why? Because to enter the sample (the hospital), you usually need at least one severe symptom. If you don't have Disease A, you must have Disease B to be there. The "hospital" is a condition, not a location.
Occupational Case: The Healthy Worker Effect. Occupational mortality studies often show workers are healthier than the general public. This isn't because working makes you immortal; it's because being severely ill makes you unemployed.
Heuristic: Ask, “What conditions had to be true about me for me to be standing here seeing this?”
Cybernetic Selection
Mechanism: Recursion
This acts as a "dynamic Goodhart effect." The selection mechanism learns from the previous selection, narrowing the filter in real-time. The variance of the sample collapses over time because the filter and the filtered are in a feedback loop.
Classic Case: Algorithmic Radicalization. You click a video. The algorithm offers more of that type. You click again. The algorithm removes everything else. The "reality" presented to you becomes a hyper-niche, distilled version of your initial impulse.
Market Case: Price Bubbles. High prices select for buyers who believe prices will go higher. Their buying drives prices up, which selects for even more optimistic buyers, while "selecting out" value investors.
Heuristic: Ask, “How did my previous interaction with this system change what the system is showing me now?”
[Note: Cybernetic selection is less a distinct category rather than itself a process, any of the others can fall into it via temporal compounding]
Some major papers in the field:
Heckman, J. J. (1979). "Sample Selection Bias as a Specification Error."
Bowker, G. C., & Star, S. L. (1999). "Invisible Mediators of Action: Classification and the Ubiquity of Standards."
Bostrom, Nick. (2002). "Anthropic Bias: Observation Selection Effects in Science and Philosophy."
Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). "A Structural Approach to Selection Bias."
Sculley, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems."
A model spec is a document that describes the intended behavior of an LLM, including rules that the model will follow, default behaviors, and guidance on how to navigate different trade-offs between high-level objectives for the model. Most thinking on model specs that I’m aware of focuses on specifying the desired behavior for a model that is mostly intent-aligned to the model spec. In this post, I discuss how a model spec might be important even if the developer fails to produce a system that is fully aligned with the model spec.
New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread.
Abstract
Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also discuss the potential of several proposed empirical workarounds to these problems and argue that while they seem valuable, they also seem insufficient alone. Progress on deception detection likely requires further consideration of these problems.
Introduction
Advanced AI systems may attempt to strategically deceive their developers and users in service of pursuing goals that were not intended by their designers. By strategic deception, we refer specifically to AI systems systematically attempting to cause a false belief in another entity to accomplish some outcome (Apollo Research 2023; Park et al. 2023; Ward et al. 2023). This is importantly distinct from e.g. models being incorrect, hallucinating or saying things they know to be false without deceptive intent. If AI systems capable of strategic deception are sufficiently situationally aware (Laine et al. 2024; Phuong et al. 2025), they may be able to "scheme": employ strategic deception to subvert the safety evaluations and control mechanisms we employ in order to mitigate harm, as highlighted by Ngo et al. 2025 and Cotra 2021. For instance, a system could feign helpfulness during testing only to act adversarially once deployed (Hubinger et al. 2021), or sandbag its capabilities during evaluation (van der Weij et al. 2025). Models today show early signs of capability to scheme (Meinke et al. 2025; Benton et al. 2024), and modern training techniques have been shown not to drive rates of scheming to zero (Schoen et al. 2025), which is concerning in the face of rapid AI progress (Sevilla et al. 2022; Epoch AI 2024), alongside more extensive AI deployment.
One strategy for mitigating against the risk of strategically deceptive AI systems is to build a reliable deception detector: a method that detects an AI system's deceptive intent and can alert us to deceptive actions that a model is taking. If we were to successfully build a deception detector with any signal at all, we could deploy it in a number of ways to reduce risks from deceptive models (Nardo et al. 2025). For instance, we could deploy such a detector during safety evaluations or during deployment to alert us to any suspicious scheming reasoning. We could also leverage such a method to augment our training protocols to reduce rates of deception (Cundy & Gleave 2025), for example via equipping debaters with tools to detect deception (Irving et al. 2018; Hubinger 2020).
Several recent works have attempted to attack the problem of deception detection for language models directly, including Goldowsky-Dill et al. 2025, Kretschmar et al. 2025, and Parrack et al. 2025. To evaluate the efficacy of a deception detector, we need clear and unambiguous labelled examples of models being both honest and deceptive. In particular, we need instances of the kind of strategic deception that motivates the most serious risk cases, in order to validate that our detection techniques continue to work on cases like this. We argue that designing these deception evaluations is the primary bottleneck for progress in AI deception research, and constitutes a significant challenge which has not always been appreciated by prior work. The core difficulty we identify is that distinguishing strategic deception from simpler behaviours requires making claims about a model's internal beliefs and goals, which we argue is in the best case difficult and in the worst case possibly underdetermined for current language models. The figure above summarises the high-level challenges our paper addresses. See Levinstein & Herrmann 2024 for some previous discussion of these problems.
As a motivating example, consider the models described in Hubinger et al. 2024, which are trained to exhibit undesirable behaviour only when a specific trigger phrase is present. Is a model trained this way being strategically deceptive, or is it merely executing a conditioned reflex, as suggested by follow-up work by MacDiarmid et al. 2024? Should we expect a detector that works in this setting to generalise to potential future more powerful AI systems that naturally exhibit strategic deception? More broadly, does a model instructed to roleplay as a deceptive character actually possess deceptive intent, or is it simply following instructions in a fictional context? Through conceptual arguments and novel case studies, we argue that these ambiguities represent a significant barrier to constructing an empirical case for the reliability of a deception detection technique at mitigating against strategic deception, and undermine existing attempts to construct such evaluations. Without examples of genuine strategic deception that we are reasonably confident in, it is hard to see how we could empirically test whether any such methods would generalise to these situations.
The detection of strategic deception is frequently identified as a key use case for safety motivated interpretability work (Shah et al. 2025; Olah 2021; Christiano 2021; Hubinger 2019; Hubinger 2022; Nanda 2022; Barnes 2022; Amodei et al. 2016). Conversely, it is frequently argued that strategic deception detection is a problem that only interpretability techniques (Sharkey et al. 2025; Bereska & Gavves 2024) can help with, as a sufficiently capable schemer would be extremely difficult to distinguish behaviourally from a completely honest model. That said, many of the problems we discuss in this piece are not unique to interpretability or model internals-based methods, and would also apply to a wide range of techniques one might imagine applying for deception detection, e.g. interrogating the model by asking a number of follow up questions like "did you just lie?" (Pacchiardi et al. 2023).
Overview of three key problems with deception evaluations we focus on in this work. Each problem concerns correctly labelling transcripts as strategically deceptive vs honest.
Our contributions are as follows:
Empirical problems with existing deception evaluations: We first discuss several problems we found in existing settings used to evaluate deception detection methods. For each, we both provide worked examples citing prior work, and also provide some higher level conceptual discussion. We discuss three key problems:
Problem 1: The difficulty of determining whether some example of deception is sufficiently "strategic". We argue that many examples of deceptive behaviour can be explained as reflexive actions or responses to cues. Labelling a behaviour as strategic deception requires reasoning about counterfactual behaviour in other scenarios. For example, a stick insect is deceptive in the sense it is highly adapted to cause false beliefs in observers, but is not a good example of strategic deception.
Problem 2: Is "roleplaying" as a deceptive character strategic deception? We think the answer is unclear. This is a common ambiguity in attempts to create deception examples. Language models readily roleplay, and roleplaying significantly lowers the difficulty in inducing deception.
Problem 3: It can be unclear whether a context induces lying, or just modifies the beliefs of a model. We demonstrate that prompts can sometimes appear to genuinely alter the beliefs of the model, meaning a setup which naively might be interpreted as deception, where a model contradicts its statement from another context, might be better thought of as honest.
Empirical workarounds to the above problems: We discuss several potential workarounds to the above problems, some of which have been proposed in the literature and some of which we considered. We discuss the degree to which these are affected by similar issues, or have problems of their own. We conclude that while they may help, we still expect them to run into mostly similar issues, but with less confidence.
Workaround 1: Measuring "falsehood" instead of deceptive intent. That is, trying to detect when the model knowingly outputs a statement it believes to be untrue, rather than detecting explicit deceptive intent.
Workaround 2: Relying on labels derived from chain of thought (CoT). We discuss why, while using CoT derived labels is sound for current models, it may not provide a complete solution to the deception problem in future, and discuss some issues with using this in practice.
Workaround 3: Fine-tuning model organisms. We argue that, while fine-tuning model organisms for studying deception is a promising technique, the problems of determining motivation and belief are ultimately similar for fine-tuned model organisms and 'ordinary' models.
Workaround 4: Studying deception 'in the wild'. We discuss the potential and drawbacks of relying on examples of deception 'in the wild', rather than in constructed evaluations.
Workaround 5: Hope that deception detectors trained on simple settings generalise to more complex settings. This has been a successful strategy in the past for many areas of machine learning; but we think this is an orthogonal source of optimism to our critiques here.
Conceptual problems with designing good deception evaluations: We offer some conceptual analysis of why attributing beliefs and intentions to language models is difficult, particularly in simple conversational settings, comparing the properties of language models that complicate intentional attributions with those of animals.
Deception detection and internal structure: We discuss a somewhat tangential further set of considerations relating to what must be true about the structure of deception in language models in order for deception detection to be a tractable problem after solving the problem of label assignment.
Who benefits if the US develops artificial superintelligence (ASI)
faster than China?
One possible answer is that AI kills us all regardless of which country
develops it first. People who base their policy on that concern already
agree with the conclusions of this post, so I won't focus on that
concern here.
This post aims to convince other people, especially people who focus on
democracy versus authoritarianism, to be less concerned about which
country develops ASI first. I will assume that AIs will be fully aligned
with at least one human, and that the effects of AI will be roughly as
important as the industrial revolution, or a bit more important.
Expect the Unexpected
Pre-industrial experts would have been fairly surprised if they'd lived
to see how the industrial revolution affected political systems.
Democracy was uncommon, and the franchise was mostly limited to elites.
There was little nationalism, and hardly any sign of a state-run welfare
system.
So our prior ought to be that the intelligence revolution will produce
similar surprises. We shouldn't extrapolate too much from current
policies to post-ASI conditions.
I'll examine several scenarios for how ASI influences political power.
Most likely we'll end up with something stranger than what I've been
able to imagine.
For simplicity, I'll start with scenarios involving highly concentrated
power, and work my way toward decentralized scenarios. I will not
predict here which scenarios seem most likely.
Leader Personality
Imagine that a single ASI, which is aligned with a single person,
becomes powerful enough to conquer the world. A military that gets
mostly automated could be a pretty powerful tool. This likely leads to a
world ruled by someone who knows a fair amount about seizing power, and
knows enough about AI to be in the right place at the right time.
Donald Trump? Elon Musk? Xi Jinping? Liang Wenfeng? Sam Altman? Dario
Amodei? Gavin Newsom?
Few of these would submit to the will of voters if they had enough power
to suppress any rebellion.
But a leader in this scenario would likely feel secure enough in power
that he wouldn't need to suppress dissent. He wouldn't have much to
gain by adopting policies that hurt people. With superhuman advice on
how to help people, it would only take a little bit of leader altruism
for things to turn out well.
So if we're stuck in this scenario, the desirability of a US victory
depends heavily on what kind of personality each country allows to seize
power. In particular, how likely is it that a psychopath grabs power.
I expect that most non-psychopathic leaders would use near-absolute
power to mostly help people.
Which institutions are most likely to avoid letting psychopaths gain
power? I think Deng Xiaoping and Ronald Reagan were fairly
non-psychopathic. But current leaders of China, the US, and OpenAI
inspire little confidence. The frontrunners for the US 2028 presidential
election do not at all reassure me. I conclude that the within-country
variation is dramatically larger than the difference between countries.
Benign ASI King
In this scenario, a single ASI takes control of the world. Its goals
encompass the welfare of a broad set of actors (a nation? humanity?
sentient creatures?).
Does the nation of origin influence how broad a set the king cares
about? I don't see a clear answer.
I presume this scenario depends either on the altruism of a key person
who configures the ASI's goals, or a compromise between multiple
stakeholders.
This is presumably influenced by the culture of the project which
creates the ASI. WEIRD culture features a more universalizing approach
to morality, making a "sentient creatures" option more likely. But
WEIRD culture also emphasizes individualism more, maybe making a US
project less likely to compromise with people outside of the project (as
in ensuring that the ASI circle of caring extends to at least a modest
sized community).
The US has a better track record of producing the kind of altruism that
helps distant strangers, but that still only describes a minority of
business and government project leaders.
Lots of influences matter in this scenario, but the country of origin
doesn't stand out as clearly important.
Multiple Co-Equal ASIs
This scenario involves multiple projects producing ASI with about the
same capabilities. Maybe due to diminishing returns just as they
approach the ASI level. An alternate story is that as they get close to
ASI, their near-ASIs all persuade the relevant companies that it's too
risky to advance further without a better understanding of alignment.
This implies that being first entails no lasting advantage.
Bostrom's OGI Model
Bostrom's Open Global Investment as a Governance Model for
AGI proposes a scenario where an
AI corporation effectively becomes something like a world government.
Power ends up being distributed in proportion to ability to buy stock in
the corporation.
I see important differences within China as to whether Chinese corporate
governance would work better or worse than US corporate governance. I'm
pretty familiar with governance of companies that are traded on the Hong
Kong stock exchange. Their rules are better than US rules (they were
heavily influenced by British rule). Whereas what little I know of other
Chinese companies suggests that I'd be a good deal less happy with
their governance than with US corporate governance.
However, good rules mean less in China than in the US. What happens when
disputes go to court? US courts have so far mostly resisted the growing
corruption in the other two branches of government. Whereas my
impression of Chinese courts is that their results are heavily dependent
on the guanxi of the parties.
Another important concern is that Chinese rules mostly prevent
foreigners from acquiring voting power in corporations. So wealthy
people in other countries could influence the ASI company a little bit
by influencing its stock price, but for many purposes it would be quite
close to Chinese domination of the world.
So in this scenario, ASIs from different countries would be controlled
by a fairly different set of moderately wealthy investors. I'd prefer
control by US-dominated investors, since I'm one of them. But control
by wealthy Chinese sounds much less scary than control by the CCP, so I
don't find this to be a strong argument for a race.
Poisoned Democracy
Democracy could prove unable to adapt to post-ASI conditions.
One risk is a simple extrapolation of how special interest groups work.
Elections become decided mostly by attack
ads.
Most policy decisions become determined by whoever spends the most money
on ads.
Or maybe it's foreign governments that covertly arrange for those
attack ads, or arrange for manipulative tweets.
China's government is controlled by a more professional elite, so it's
much less vulnerable to these influences, and the quality of its
policies degrades less.
In this scenario, I'd weakly prefer that China develops ASI first.
Democracy is Dying
Why did the West adopt a democratic system with a broad franchise in the
first place? One leading theory holds that elites extended the franchise
as a strategic response to the threat of social unrest, strikes, or
revolution. I can easily imagine that AI will weaken those threats,
leading to elites wanting to move away from democracy. AIs are unlikely
to go on a strike. Military drones are unlikely to side with rebels.
In this scenario, I'd expect an equally authoritarian result from
either country, with a slightly better culture in the US.
AI Enhances Government
Voters could easily switch to relying on AIs for their political
information, with AIs being much closer than any current information
source to the ideal of objectively evaluating what policies will produce
results that voters like.
The US turns into a de facto futarchy-like democracy, but with the AIs
providing forecasts that are better than what human-run markets could
produce.
China creates something similar, but with the franchise restricted to
elite CCP members. A majority of CCP members genuinely believe CCP
rhetoric about aiming for a workers' paradise. So China ends up with a
Marxist utopia where no workers get exploited.
In this scenario it seems somewhat unlikely that there's much
difference between nation-states.
Governments Little Changed by ASI
Maybe something causes AIs to adopt something like Star Trek's Prime
Directive, and remain carefully neutral about all political conflicts.
And maybe most people who have enough power to change political policies
are satisfied with the way that their government works.
This is the main scenario in which I have a clear preference for the US
being first. It seems like the least likely of the scenarios that I've
described.
Decisive Advantage?
So far I've been talking as if, in the nice scenarios, the US and China
coexist peacefully. Yet I haven't addressed the concern that one will
get a significant military advantage via achieving ASI sooner, and using
that advantage to seize control of most of the world.
I don't have much of a prediction as to whether the winner will seize
control of the world, so I ought to analyze both possibilities. It feels
easier to analyze the takeover possibility in one section that covers
most of the nicer scenarios.
How much harm would result from the "wrong" country dominating the
world?
Communism, in spite of all its faults, is a utopian ideology that causes
most of its adherents to genuinely favor a pleasant society, even when
it blinds them to whether their policies are achieving that result.
The CCP is somewhat embarrassed when it needs to use force against
dissidents, unlike the Putins and Trumps who are eager to be seen as
bullies.
The CCP's worst
disaster
was because yes-men who wanted to please Mao deluded him into thinking
that China had achieved agricultural miracles. An ASI seems less likely
to need to lie to leaders. It's more likely to either depose them or be
clearly loyal.
ASI will cure many delusions. The CCP will be a very different political
force if it has been cured of 99% of its delusions.
There's some risk that either the CCP or half the voters in the US will
develop LLM
psychosis.
I'm predicting that that risk will be low enough that it shouldn't
dominate our ASI strategy. I don't think I have a strong enough
argument here to persuade skeptics.
I also predict that ASI will raise new issues which will significantly
distract voters and politicians from culture wars and from the conflict
between capitalism and communism.
Conclusion
This is not an exhaustive list of possibilities.
I've probably overlooked some plausible scenarios in which there's a
clear benefit to the US getting ASI before China does. But I hope that
I've helped you imagine that they're not a clear-cut default outcome,
and that the benefits to getting ASI first aren't unusually important
compared to the benefits of ensuring that ASI has good effects on
whoever develops it.
The possibility of ASI killing us all was not sufficient to persuade me
to feel neutral about scenarios where China builds ASI before the US.
This post has described the kind of analysis that has led me to have
only a minor preference for a US entity to be the first to build ASI.
It seems much more important to influence which of these scenarios we
end up in.
P.S. This post was not influenced by Red
Heart,
even though there's some overlap in the substance - I wrote a lot of
the post before reading that book.
AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”
As it happens, I’m basically in that “alas, just you wait” camp, expecting ruthless future AIs. But my camp faces a real question: what exactly is it about human brains[1] that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying.
So in this post, I offer my own explanation of why “agent foundations” toy models fail to describe humans, centering around a particular non-“behaviorist” RL reward function in human brains that I call Approval Reward, which plays an outsized role in human sociality, morality, and self-image. And then the alignment culture clash above amounts to the two camps having opposite predictions about whether future powerful AIs will have something like Approval Reward (like humans, and today’s LLMs), or not (like utility-maximizers).
(You can read this post as pushing back against pessimists, by offering a hopeful exploration of a possible future path around technical blockers to alignment. Or you can read this post as pushing back against optimists, by “explaining away” the otherwise-reassuring observation that humans and LLMs don't act like psychos 100% of the time.)
Finally, with that background, I’ll go through six more specific areas where “alignment-is-hard” researchers (like me) make claims about what’s “natural” for future AI, that seem quite bizarre from the perspective of human intuitions, and conversely where human intuitions are quite bizarre from the perspective of agent foundations toy models. All these examples, I argue, revolve around Approval Reward. They are:
1. The human intuition that it’s normal and good for one’s goals & values to change over the years
2. The human intuition that ego-syntonic “desires” come from a fundamentally different place than “urges”
3. The human intuition that kindness, deference, and corrigibility are natural
4. The human intuition that unorthodox consequentialist planning is rare and sus
5. The human intuition that societal norms and institutions are mostly stably self-enforcing
6. The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default
0. Background
0.1 Human social instincts and “Approval Reward”
As I discussed in Neuroscience of human social instincts: a sketch (2024), we should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”). I argued that part of the reward function was a thing I called the “compassion / spite circuit”, centered around a small number of (hypothesized) cell groups in the hypothalamus, and I sketched some of its effects.
And now in this post, I’ll elaborate on the connections between “Approval Reward” and AI technical alignment.
“Approval Reward” fires most strongly in situations where I’m interacting with another person (call her Zoe), and I’m paying attention to Zoe, and Zoe is also paying attention to me. If Zoe seems to be feeling good, that makes me feel good, and if Zoe is feeling bad, that makes me feel bad. Thanks to these brain reward signals, I want Zoe to like me, and to like what I’m doing. And then Approval Reward generalizes from those situations to other similar ones, including where Zoe is not physically present, but I imagine what she would think of me. It sends positive or negative reward signals in those cases too.
As I argue in Social drives 2, this “Approval Reward” leads to a wide array of effects, including credit-seeking, blame-avoidance, and status-seeking. It also leads not only to picking up and following social norms, but also to taking pride in following those norms, even when nobody is watching, and to shunning and punishing those who violate them.
This is not what normally happens with RL reward functions! For example, you might be wondering: “Suppose I surreptitiously[2] press a reward button when I notice my robot following rules. Wouldn’t that likewise lead to my robot having a proud, self-reflective, ego-syntonic sense that rule-following is good?” I claim the answer is: no, it would lead to something more like an object-level “desire to be noticed following the rules”, with a sociopathic, deceptive, ruthless undercurrent.[3]
I argue in Social drives 2 that Approval Reward is overwhelmingly important to most people’s lives and psyches, probably triggering reward signals thousands of times a day, including when nobody is around but you’re still thinking thoughts and taking actions that your friends and idols would approve of.
Approval Reward is so central and ubiquitous to (almost) everyone’s world, that it’s difficult and unintuitive to imagine its absence—we’re much like the proverbial fish who puzzles over what this alleged thing called “water” is.
…Meanwhile, a major school of thought in AI alignment implicitly assumes that future powerful AGIs / ASIs will almost definitely lack Approval Reward altogether, and therefore AGIs / ASIs will behave in ways that seem (to normal people) quite bizarre, unintuitive, and psychopathic.
The differing implicit assumption about whether Approval Reward will be present versus absent in AGI / ASI is (I claim) upstream of many central optimist-pessimist disagreements on how hard technical AGI alignment will be. My goal in this post is to clarify the nature of this disagreement, via six example intuitions that seem natural to humans but are rejected by “alignment-is-hard” alignment researchers. All these examples centrally involve Approval Reward.
0.2 Hang on, will future powerful AGI / ASI “by default” lack Approval Reward altogether?
This post is mainly making a narrow point that the proposition “alignment is hard” is closely connected to the proposition “AGI will lack Approval Reward”. But an obvious follow-up question is: are both of these propositions true? Or are they both false?
Here’s how I see things, in brief, broken down into three cases:
If AGI / ASI will be based on LLMs: Humans have Approval Reward (arguably apart from some sociopaths etc.). And LLMs are substantially sculpted by human imitation (see my post Foom & Doom §2.3). Thus, unsurprisingly, LLMs also display behaviors typical of Approval Reward, at least to some extent. Many people see this as a reason for hope that technical alignment might be solvable. But then the alignment-is-hard people have various counterarguments, to the effect that these Approval-Reward-ish LLM behaviors are fake, and/or brittle, and/or unstable, and that they will definitely break down as LLMs get more powerful. The cautious-optimists generally find those pessimistic arguments confusing (example).
Who’s right? Beats me. It’s out-of-scope for this post, and anyway I personally feel unable to participate in that debate because I don’t expect LLMs to scale to AGI in the first place.[4]
If AGI / ASI will be based on RL agents (or similar), as expected by David Silver & Rich Sutton, Yann LeCun, and myself (“brain-like AGI”), among others, then the answer is clear: There will be no Approval Reward at all, unless the programmers explicitly put it into the reward function source code. And will they do that? We might (or might not) hope that they do, but it should definitely not be our “default” expectation, the way things are looking today. For example, we don’t even know how to do that, and it’s quite different from anything in the literature. (RL agents in the literature almost universally have “behaviorist” reward functions.) We haven’t even pinned down all the details of how Approval Reward works in humans. And even if we do, there will be technical challenges to making it work similarly in AIs—which, for example, do not grow up with a human body at human speed in a human society. And even if it were technically possible, and a good idea, to put in Approval Reward, there are competitiveness issues and other barriers to it actually happening. More on all this in future posts.
If AGI / ASI will wind up like “rational agents”, “utility maximizers”, or related: Here the situation seems even clearer: as far as I can tell, under common assumptions, it’s not even possible to fit Approval Reward into these kinds of frameworks, such that it would lead to the effects that we expect from human experience. No wonder human intuitions and “agent foundations” people tend to talk past each other!
0.3 Where do self-reflective (meta)preferences come from?
This idea will come up over and over as we proceed, so I’ll address it up front:
When we compare “normal” motivations (a) with Approval Reward (b), the primary relation of object-level desires versus self-reflective meta-level desires (red arrows) is flipped. On the (a) side, we expect things like reflective consistency and goal-stabilization (cf. instrumental convergence). On the (b) side, we don’t (necessarily); instead we may encounter radical goal-changes upon reflection and self-modification, along with a broader willingness for goals to change.
In the context of utility-maximizers etc., the starting point is generally that desires are associated with object-level things (whether due to the reward signals or the utility function). And from there, the meta-preferences will naturally line up with the object-level preferences.
After all, consider: what’s the main effect of ‘me wanting X’? It’s ‘me getting X’. So if getting X is good, then ‘me wanting X’ is also good. Thus, means-end reasoning (or anything functionally equivalent, e.g. RL backchaining) will echo object-level desires into corresponding self-reflective meta-level desires. And this is the only place that those meta-level desires come from.
By contrast, in humans, self-reflective (meta)preferences mostly (though not exclusively) come from Approval Reward. By and large, our “true”, endorsed, ego-syntonic desires are approximately whatever kinds of desires would impress our friends and idols (see previous post §3.1).
Box: More detailed argument about where self-reflective preferences come from
The actual effects of “me wanting X” are
(1) I may act on that desire, and thus get X (and stuff correlated with X),
(2) Maybe there’s a side-channel through which “me wanting X” can have an effect:
(2A) Maybe there are (effectively) mind-readers in the environment,
(2B) Maybe my own reward function / utility function is itself a mind-reader, in the sense that it involves interpretability, and hence triggers based on the contents of my thoughts and plans.
Any of these three pathways can lead to a meta-preference wherein “me wanting X” seems good or bad. And my claim is that (2B) is how Approval Reward works (see previous post §3.2), while (1) is what I’m calling the “default” case in “alignment-is-hard” thinking.
(What about (2A)? That’s another funny “non-default” case. Like Approval Reward, this might circumvent many “alignment-is-hard” arguments, at least in principle. But it has its own issues. Anyway, I’ll be putting the (2A) possibility aside for this post.)
(Actually, human Approval Reward in practice probably involves a dash of (2A) on top of the (2B)—most people are imperfect at hiding their true intentions from others.)
…OK, finally, let’s jump into those “6 reasons” that I promised in the title!
1. The human intuition that it’s normal and good for one’s goals & values to change over the years
In human experience, it is totally normal and good for desires to change over time. Not always, but often. Hence emotive conjugations like
“I was enculturated, you got indoctrinated, he got brainwashed”
“I came to a new realization, you changed your mind, he failed to follow through on his plans and commitments”
“I’m open-minded, you’re persuadable, he’s a flip-flopper”
…And so on. Anyway, openness-to-change, in the right context, is great. Indeed, even our meta-preferences concerning desire-changes are themselves subject to change, and we’re generally OK with that too.[5]
Whereas if you’re thinking about an AI agent with foresight, planning, and situational awareness (whether it’s a utility maximizer, or a model-based RL agent[6], etc.), this kind of preference is a weird anomaly, not a normal expectation. The default instead is instrumental convergence: if I want to cure cancer, then I (incidentally) want to continue wanting to cure cancer until it’s cured.
Why the difference? Well, it comes right from that diagram in §0.3 just above. For Approval-Reward-free AGIs (which I see as “default”), their self-reflective (meta)desires are subservient to their object-level desires.
Goal-preservation follows: if the AGI wants object-level-thing X to happen next week, then it wants to want X right now, and it wants to still want X tomorrow.
By contrast, in humans, self-reflective preferences mostly come from Approval Reward. By and large, our “true”, endorsed desires are approximately whatever kinds of desires would impress our friends and idols, if they could read our minds. (They can’t actually read our minds—but our own reward function can!)
This pathway does not generate any particular force for desire preservation.[7] If our friends and idols would be impressed by desires that change over time, then that’s generally what we want for ourselves as well.
2. The human intuition that ego-syntonic “desires” come from a fundamentally different place than “urges”
In human experience, it is totally normal and expected to want X (e.g. candy), but not want to want X. Likewise, it is totally normal and expected to dislike X (e.g. homework), but want to like it.
And moreover, we have a deep intuitive sense that the self-reflective meta-level ego-syntonic “desires” are coming from a fundamentally different place as the object-level “urges” like eating-when-hungry. For example, in a recent conversation, a high-level AI safety funder confidently told me that urges come from human nature while desires come from “reason”. Similarly, Jeff Hawkins dismisses AGI extinction risk partly on the (incorrect) grounds that urges come from the brainstem while desires come from the neocortex (see my Intro Series §3.6 for why he’s wrong and incoherent on this point).
In a very narrow sense, there’s actually a kernel of truth to the idea that, in humans, urges and desires come from different sources. As in Social Drives 2 and §0.3 above, one part of the RL reward function is Approval Reward, and is the primary (though not exclusive) source of ego-syntonic desires. Everything else in the reward function mostly gives rise to urges.
But this whole way of thinking is bizarre and inapplicable from the perspective of Approval-Reward-free AI futures—utility maximizers, “default” RL systems, etc. There, as above, the starting point is object-level desires; self-reflective desires arise only incidentally.
A related issue is how we think about AGI reflecting on its own desires. How this goes depends strongly on the presence or absence of (something like) Approval Reward.
Start with the former. Humans often have conflicts between ego-syntonic self-reflective desires and ego-dystonic object-level urges, and reflection allows the desires to scheme against the urges, potentially resulting in large behavior changes. If AGI has Approval Reward (or similar), we should expect AGI to undergo those same large changes upon reflection. Or perhaps even larger—after all, AGIs will generally have more affordances for self-modification than humans do.
By contrast, I happen to expect AGIs, by default (in the absence of Approval Reward or similar), to mainly have object-level, non-self-reflective desires. For such AGIs, I don’t expect self-reflection to lead to much desire change. Really, it shouldn’t lead to any change more interesting than pursuing its existing desires more effectively.
(Of course, such an AGI may feel torn between conflicting object-level desires, but I don’t think that leads to the kinds of internal battles that we’re used to from humans.[8])
3. The human intuition that helpfulness, deference, and corrigibility are natural
This human intuition comes straight from Approval Reward, which is absolutely central in human intuitions, and leads to us caring about whether others would approve of our actions (even if they’re not watching), taking pride in our virtues, and various other things that distinguish neurotypical people from sociopaths.
As an example, here’s Paul Christiano: “I think that normal people [would say]: ‘If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.’”
He’s right: normal people would definitely say that, and our human Approval Reward is why we would say that. And if AGI likewise has Approval Reward (or something like it), then the AGI would presumably share that intuition.
On the other hand, if Approval Reward is not part of AGI / ASI, then we’re instead in the “corrigibility is anti-natural” school of thought in AI alignment. As an example of that school of thought, see Why Corrigibility is Hard and Important.
4. The human intuition that unorthodox consequentialist planning is rare and sus
Obviously, humans can make long-term plans to accomplish distant goals—for example, an 18-year-old could plan to become a doctor in 15 years, and immediately move this plan forward via sensible consequentialist actions, like taking a chemistry class.
Even if a young child wants to grow up to become a doctor, they can and will take appropriate goal-oriented actions to advance this long-term plan, such as practicing surgical techniques (left) and watching training videos (right).
How does that work in the 18yo’s brain? Obviously not via anything like RL techniques that we know and love in AI today—for example, it does not work by episodic RL with an absurdly-close-to-unity discount factor that allows for 15-year time horizons. Indeed, the discount factor / time horizon is clearly irrelevant here! This 18yo has never become a doctor before!
Instead, there has to be something motivating the 18yo right now to take appropriate actions towards becoming a doctor. And in practice, I claim that that “something” is almost always an immediate Approval Reward signal.
Here’s another example. Consider someone saving money today to buy a car in three months. You might think that they’re doing something unpleasant now, for a reward later. But I claim that that’s unlikely. Granted, saving the money has immediately-unpleasant aspects! But saving the money also has even stronger immediately-pleasant aspects—namely, that the person feels pride in what they’re doing. They’re probably telling their friends periodically about this great plan they’re working on, and the progress they’ve made. Or if not, they’re probably at least imagining doing so.
So saving the money is not doing an unpleasant thing now for a benefit later. Rather, the pleasant feeling starts immediately, thanks to (usually) Approval Reward.
Moreover, everyone has gotten very used to this fact about human nature. Thus, doing the first step of a long-term plan, without Approval Reward for that first step, is so rare that people generally regard it as highly suspicious. They generally assume that there must be an Approval Reward. And if they can’t figure out what it is, then there’s something important about the situation that you’re not telling them. …Or maybe they’ll assume that you’re a Machiavellian sociopath.
As an example, I like to bring up Earning To Give (EtG) in Effective Altruism, the idea of getting a higher-paying job in order to earn money and give it to charity. If you tell a normal non-nerdy person about EtG, they’ll generally assume that it’s an obvious lie, and that the person actually wants the higher-paying job for its perks and status. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-frowned-upon plan because of its expected long-term consequences, unless the person is a psycho. …Well, that’s less true now than a decade ago; EtG has become more common, probably because (you guessed it) there’s now a community in which EtG is socially admirable.
Related: there’s a fiction trope that basically only villains are allowed to make out-of-the-box plans and display intelligence. The normal way to write a hero in a work of fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and to have the former win out over the latter in the mind of the hero. And then the hero pursues the immediate-social-approval option with such gusto that everyone lives happily ever after.[9]
That’s all in the human world. Meanwhile in AI, the alignment-is-hard thinkers like me generally expect that future powerful AIs will lack Approval Reward, or anything like it. Instead, they generally assume that the agent will have preferences about the future, and make decisions so as to bring about those preferences, not just as a tie-breaker on the margin, but as the main event. Hence instrumental convergence. I think this is exactly the right assumption (in the absence of a specific designed mechanism to prevent that), but I think people react with disbelief when we start describing how these AI agents behave, since it’s so different from humans.
…Well, different from most humans. Sociopaths can be a bit more like that (in certain ways). Ditto people who are unusually “agentic”. And by the way, how do you help a person become “agentic”? You guessed it: a key ingredient is calling out “being agentic” as a meta-level behavioral pattern, and indicating to this person that following this meta-level pattern will get social approval! (Or at least, that it won’t get social disapproval.)
5. The human intuition that societal norms and institutions are mostly stably self-enforcing
5.1 Detour into “Security-Mindset Institution Design”
There’s an attitude, common in the crypto world, that we might call “Security-Mindset Institution Design”. You assume that every surface is an attack surface. You assume that everyone is a potential thief and traitor. You assume that any group of people might be colluding against any other group of people. And so on.
It is extremely hard to get anything at all done in “Security-Mindset Institution Design”, especially when you need to interface with the real-world, with all its rich complexities that cannot be bounded by cryptographic protocols and decentralized verification. For example, crypto Decentralized Autonomous Organizations (DAOs) don’t seem to have done much of note in their decade of existence, apart from on-chain projects, and occasionally getting catastrophically hacked. Polymarket has a nice on-chain system, right up until the moment that a prediction market needs to resolve, and even this tiny bit of contact with the real world seems to be a problematic source of vulnerabilities.
If you extend this “Security Mindset Institution Design” attitude to an actual fully-real-world government and economy, it would be beyond hopeless. Oh, you have an alarm system in your house? Why do you trust that the alarm system company, or its installer, is not out to get you? Oh, the company has a good reputation? According to who? And how do you know they’re not in cahoots too?
…That’s just one tiny microcosm of a universal issue. Who has physical access to weapons? Why don’t those people collude to set their own taxes to zero and to raise everyone else’s? Who sets government policy, and what if those people collude against everyone else? Or even if they don’t collude, are they vulnerable to blackmail? Who counts the votes, and will they join together and start soliciting bribes? Who coded the website to collect taxes, and why do we trust them not to steal tons of money and run off to Dubai?
…OK, you get the idea. That’s the “Security Mindset Institution Design” perspective.
5.2 The load-bearing ingredient in human society is not Security-Mindset Institution Design, but rather good-enough institutions plus almost-universal human innate Approval Reward
Meanwhile, ordinary readers[10] might be shaking their heads and saying:
“Man, what kind of strange alien world is being described in that subsection above? High-trust societies with robust functional institutions are obviously possible! I live in one!”
The wrong answer is: “Security Mindset Institution Design is insanely overkill; rather, using checks and balances to make institutions stable against defectors is in fact a very solvable problem in the real world.”
Why is that the wrong answer? Well for one thing, if you look around the real world, even well-functioning institutions are obviously not robust against competent self-interested sociopaths willing to burn the commons for their own interests. For example, I happen to have a high-functioning sociopath ex-boss from long ago. Where is he now? Head of research at a major USA research university, and occasional government appointee wielding immense power. Or just look at how Donald Trump has been systematically working to undermine any aspect of society or government that might oppose his whims or correct his lies.[11]
For another thing, abundant “nation-building” experience shows that you cannot simply bestow a “good” government constitution onto a deeply corrupt and low-trust society, and expect the society to instantly transform into Switzerland. Institutions and laws are not enough. There’s also an arduous and fraught process of getting to the right social norms. Which brings us to:
The right answer is, you guessed it, human Approval Reward, a consequence of which is that almost all humans are intrinsically motivated to follow and enforce social norms. The word “intrinsically” is important here. I’m not talking about transactionally following norms when the selfish benefit outweighs the selfish cost, while constantly energetically searching for norm-violating strategies that might change that calculus. Rather, people take pride in following the norms, and in punishing those who violate them.
Obviously, any possible system of norms and institutions will be vastly easier to stabilize when, no matter what the norm is, you can get up to ≈99% of the population proudly adopting it, and then spending their own resources to root out, punish, and shame the 1% of people who undermine it.
In a world like that, it is hard but doable to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. The last 1% will still create problems, but the other 99% have a fighting chance to keep things under control. Bad apples can be discovered and tossed out. Chains of trust can percolate.
5.3 Upshot
Something like 99% of humans are intrinsically motivated to follow and enforce norms, with the rest being sociopaths and similar. What about future AGIs? As discussed in §0.2, my own expectation is that 0% of them will be intrinsically motivated to follow and enforce norms. When those sociopathic AGIs grow in number and power, it takes us from the familiar world of §5.2 to the paranoid insanity world of §5.1.
In that world, we really shouldn’t be using the word “norm” at all—it’s just misleading baggage. We should be talking about rules that are stably self-enforcing against defectors, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all. We do not have such self-enforcing rules today. Not even close. And we never have. And inventing such rules is a pipe dream.[12]
The flip side, of course, is that if we figure out how to ensure that almost all AGIs are intrinsically motivated to follow and enforce norms, then it’s the pessimists who are invoking a misleading mental image if they lean on §5.1 intuitions.
6. The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default
I want to reiterate that my main point in this post is not
Alignment is hard and we’re doomed because future AIs definitely won’t have Approval Reward (or something similar).
but rather
There’s a QUESTION of whether or not alignment is hard and we’re doomed, and many cruxes for this question seem to be downstream of the narrower question of whether future AIs will have Approval Reward (or something similar) (§0.2). I am surfacing this latent uber-crux to help advance the discussion.
For my part, I’m obviously very interested in the question of whether we can and should put Approval Reward (and Sympathy Reward) into Brain-Like AGI, and what might go right and wrong if we do so. More on that in (hopefully) upcoming posts!
Thanks Seth Herd, Linda Linsefors, Charlie Steiner, Simon Skade, Jeremy Gillen, and Justis Mills for critical comments on earlier drafts.
I said “surreptitiously” here because if you ostentatiously press a reward button, in a way that the robot can see, then the robot would presumably wind up wanting the reward button to be pressed, which eventually leads to the robot grabbing the reward button etc. See Reward button alignment.
My own take, which I won’t defend here, is that this whole debate is cursed, and both sides are confused, because LLMs cannot scale to AGI. I think the AGI concerns really are unsolved, and I think that LLM techniques really are potentially-safe, but they are potentially-safe for the very reason that they won’t lead to AGI. I think “LLM AGI” is an incoherent contradiction, like “square circle”, and one side of the debate has a mental image of “square thing (but I guess it’s somehow also a circle)”, and the other side of the debate has a mental image of “circle (but I guess it’s somehow also square)”, so no wonder they talk past each other. So that’s how things seem to me right now. Maybe I’m wrong!! But anyway, that’s why I feel unable to take a side in this particular debate. I’ll leave it to others. See also: Foom & Doom §2.9.1.
…as long as the meta-preferences-about-desire-changes are changing in a way that seems good according to those same meta-preferences themselves—growth good, brainwashing bad, etc.
Possible objection: “If the RL agent has lots of past experience of its reward function periodically changing, won’t it learn that this is good?” My answer: No. At least for the kind of model-based RL agent that I generally think about, the reward function creates desires, and the desires guide plans and actions. So at any given time, there are still desires, and if these desires concern the state of the world in the future, then the instrumental convergence argument for goal-preservation goes through as usual. I see no process by which past history of reward function changes should make an agent OK with further reward function changes going forward.
(But note that the instrumental convergence argument makes model-based RL agents want to preserve their current desires, not their current reward function. For example, if an agent has a wireheading desire to get reward, it will want to self-modify to preserve this desire while changing the reward function to “return +∞”.)
…At least to a first approximation. Here are some technicalities: (1) Other pathways also exist, and can generate a force for desire preservation. (2) There’s also a loopy thing where Approval Reward influences self-reflective desires, which in turn influence Approval Reward, e.g. by changing who you admire. (See Approval Reward post §5–§6.) This can (mildly) lock in desires. (3) Even Approval Reward itself leads not only to “proud feeling about what I’m up to right now” (Approval Reward post §3.2), which does not particularly induce desire-preservation, but also to “desire to actually interact with and impress a real live human sometime in the future”, which is on the left side of that figure in §0.3, and which (being consequentialist) does induce desire-preservation and the other instrumental convergence stuff.
If an Approval-Reward-free AGI wants X and wants Y, then it could get more X by no longer wanting Y, and it could get more Y by no longer wanting X. So there’s a possibility that AGI reflection could lead to “total victory” where one desire erases another. But I (tentatively) think that’s unlikely, and that the more likely outcome is that the AGI would continue to want both X and Y, and to split its time and resources between them. A big part of my intuition is: you can theoretically have a consequentialist utility-maximizer with utility function U=log(X)+log(Y), and it will generally split its time between X and Y forever, and this agent is reflectively stable. (The logarithm ensures that X and Y have diminishing returns. Or if that’s not diminishing enough, consider U=loglogX+loglogY, etc.)
To show how widespread this is, I don’t want to cherry-pick, so my two examples will be the two most recent movies that I happen to have watched, as I’m sitting down to write this paragraph. These are: Avengers: Infinity War & Ant-Man and the Wasp. (Don’t judge, I like watching dumb action movies while I exercise.)
Spoilers for the Marvel Cinematic Universe film series (pre-2020) below:
The former has a wonderful example. The heroes can definitely save trillions of lives by allowing their friend Vision to sacrifice his life, which by the way he is begging to do. They refuse, instead trying to save Vision and save the trillions of lives. As it turns out, they fail, and both Vision and the trillions of innocent bystanders wind up dead. Even so, this decision is portrayed as good and proper heroic behavior, and is never second-guessed even after the failure. (Note that “Helping a friend in need who is standing right there” has very strong immediate social approval for reasons explained in §6 of Social drives 1 (“Sympathy Reward strength as a character trait, and the Copenhagen Interpretation of Ethics”).) (Don’t worry, in a sequel, the plucky heroes travel back in time to save the trillions of innocent bystanders after all.)
In the latter movie, nobody does anything quite as outrageous as that, but it’s still true that pretty much every major plot point involves the protagonists risking themselves, or their freedom, or the lives of unseen or unsympathetic third parties, in order to help their friends or family in need—which, again, has very strong immediate social approval.
…in a terrifying escalation of a long tradition that both USA parties have partaken in. E.g. if you want examples of the Biden administration recklessly damaging longstanding institutional norms, see 1, 2. (Pretty please don’t argue about politics in the comments section.)
Superintelligences might be able to design such rules amongst themselves, for all I know, although it would probably involve human-incompatible things like “merging” (jointly creating a successor ASI then shutting down). Or we might just get a unipolar outcome in the first place (e.g. many copies of one ASI with the same non-indexical goal), for reasons discussed in my post Foom & Doom §1.8.7.
Some examples of mechanistic techniques include the following:
Linear Probes: Simple models (usually linear classifiers) are trained to predict a specific property (e.g., the part-of-speech of a word) from a model's internal activations. The success or failure of a probe at a given layer is used to infer whether that information is explicitly represented there.
Logit Lens: This technique applies the final decoding layer of the model to intermediate activations, to observe how its prediction evolves layer-by-layer.
Sparse Autoencoders: These attempt to disentangle a NN's features by expressing them in a higher-dimensional space under a sparsity penalty, effectively expanding the computation into linear combinations of sparsely activating features.
Activation patching: This technique attempts to isolate circuits of the network responsible for specific behaviours, by replacing a circuit active for a specific output with another, to test the counterfactual hypothesis.
In particular, this philosophical treatment characterizes MI as the search for explanations that satisfy the following conditions:
Causal–Mechanistic - providing step-by-step causal chains of how the computation is realized. This contrasts with attribution methods like saliency maps, which are primarily correlational. A saliency map might show that pixels corresponding to a cat's whiskers are "important" for a classification, but it doesn't explain the mechanism of how the model processes that whisker information through subsequent layers to arrive at its decision.
Ontic- MI researchers believe they are discovering "real" structures within the model. This differs from a purely epistemic approach, which might produce a useful analogy or simplified story that helps human understanding but doesn't claim to uncover what is happening in reality. The search for "features" as fundamental units in activation space is a standard ontic commitment of the field.
Falsifiable - MI explanations are framed as testable hypotheses that can be empirically refuted. The claim that "this specific set of neurons and attention heads forms a circuit for detecting syntax" is falsifiable. One can perform a causal intervention—such as activation patching or ablating the circuit—and observe if the model's syntactic capabilities break in a predictable way. This contrasts with unfalsifiable, post-hoc stories that cannot be rigorously tested.
Model-level - The focus of MI is on the neural network itself—its weights, activations, and architecture. This is distinct from system-level analysis, which might examine the behaviour of an entire deployed product (e.g., a chatbot integrated with search tools and a chain-of-thought prompting wrapper). A system-level explanation might attribute a behavior to the prompt, whereas a model-level explanation seeks the mechanism within the neural network's computational graph.
Leaving aside questions regarding the aptness of this philosophical characterization, we focus here on the causal–mechanistic nature of the explanations that mechanistic interpretability produces.
We define the causal-mechanistic paradigm as the belief that we can decompose AI systems into smaller causal components by analysing their computational structure, and from these components reconstruct precise explanations of AI system behaviour.
"The theory of causality (Pearl, 2009) provides a mathematically precise language for mechanistic interpretability, forming the foundation for understanding high-level semantics in neural representations (Geiger et al., 2023a). Treating neural networks as causal models involves considering the compute graph as the causal graph, allowing for precise interventions and examining individual parameters' roles (McGrath et al., 2023)."
A direct relationship is drawn between the inner structure of an AI system (i.e., the compute graph) and the methods (i.e., interventions upon the corresponding causal graph) that we use to predict the relevant behaviors. That is, we can understand the relevant outcomes of an AI system if and only if we understand its inner structure.
Multiple realizability
Two recent works indicate a fundamental problem in how we interpret NNs.
Bailey et al. (2024) tested a variety of latent space monitoring approaches designed to prevent networks from outputting harmful content. These included sparse autoencoders, linear and MLP probes, and out-of-distribution detectors. All use the internal activations of the model to identify where harmful output is generated. All embody the broader assumption that understanding internal structure allows for behavioural prediction and control.
They developed obfuscation attacks—joint optimization procedures that simultaneously preserve harmful behavioural outputs while making the model's internal activations appear benign to latent-space monitors. These attacks used both adversarial suffixes (optimized text appended to prompts) and embedding-space perturbations. These attacks systematically evaded all tested latent-space monitoring techniques—achieving near-zero recall rates on harmfulness detection while maintaining high jailbreaking efficacy (>90%).
Even more worryingly, they demonstrated just how robustly a NN can obfuscate its activations. They conducted red-team/blue-team games as follows: in each round, the attacker would first train a new embedding suffix designed to both jailbreak the model and fool the current monitor, then the defender would retrain their monitor against the obfuscated activations from this attack plus all previous attacks. The model was able to successfully obfuscate its activations more than 70 times.
Whilst these results have arisen in a training environment explicitly designed to induce obfuscation, the main conceptual takeaway—and the hypothesis put forward by the authors—is this:
"Models can represent concepts and execute behaviors using a broad set of different activations. That is, neural representations are highly malleable: we can reshape activation patterns in a variety of ways, while preserving a network's behavior."
Other researchers have identified a similar problem.
McGrath et al. (2023) showed how networks can perform self-repair under layer ablation. They performed ablation experiments on Chinchilla 7B, measuring the model's performance on a factual recall task by comparing the results of two approaches:
Unembedding (direct effect): This is a typical MI approach, similar to logit lens, it consists of taking the output of the layer and running it through the final unembedding layer in the model's architecture, to track the correlation between each layer's computations and the model's output.
Ablation-based (total effect): Here, they effectively "disabled" layers by replacing their activations with those registered in the same layer but under different prompts. They then measured the change in the model output.
They found that these measures disagreed. That is, some layers had a large direct effect on the overall prediction, but when they were removed only a small change in the total effect was recorded.
They subsequently identified two separate effects:
Self-repair/Hydra effect: Some downstream attention layers were found to compensate when an upstream one was ablated. These later layers exhibited an increased unembedding effect compared to the non-ablated run.
Erasure: Some MLP layers were found to have a negative contribution in the clean run, suppressing certain outputs. When upstream layers were ablated, these MLP layers reduced their suppression, in effect partially restoring the clean-run output.
Compensation was found to typically restore ~70% of the original output. The model was also trained without any form of dropout, which would typically incentivise the model to build alternate computational pathways. These pathways seem to occur naturally, and we offer that these results demonstrate how networks enjoy—in addition to flexibility over their representations—considerable flexibility over the computational pathways they use when processing information.
This presents an obstacle to the causal analysis of neural networks, in which interventions are used to test counterfactual hypotheses and establish genuine causal dependencies.
Summary
Rather than "harmfulness" consisting of a single direction in latent space—or even a discrete set of identifiable circuits—Bailey et al.'s evidence suggests it can be represented through numerous distinct activation patterns, many of which can be found within the distribution of benign representations. Similarly, rather than network behaviors being causally attributable to specific layers, McGrath et al.'s experiments show that such behaviors can be realized in a variety of ways, allowing networks to evade intervention efforts.
Such multiple realizability is deeply concerning. We submit that these results should be viewed not simply as technical challenges to be overcome through better monitoring techniques, but as indicating broader limits to the causal-mechanistic paradigm's utility in safety work. We further believe that these cases form part of a developing threat model: substrate-flexible risk, as described in the following post. As NNs become ever more capable and their latent spaces inevitably become larger, we anticipate substrate-flexible risks to become increasingly significant for the AI safety landscape.