2025-12-05 04:06:22
Published on December 4, 2025 7:53 PM GMT
(This is a linkpost for Duncan Sabien's article "Thresholding" which was published July 6th, 2024. I (Screwtape) am crossposting a linkpost version because I want to nominate it for the Best of LW 2024 review - I'm not the original author.)
If I were in some group or subculture and I wanted to do as much damage as possible, I wouldn’t create some singular, massive disaster.
Instead, I would launch a threshold attack.
I would do something objectionable, but technically defensible, such that I wouldn’t be called out for it (and would win or be exonerated if I were called out for it). Then, after the hubbub had died down, I would do it again. Then maybe I would try something that’s straightforwardly shitty, but in a low-grade, not-worth-the-effort-it-would-take-to-complain-about-it sort of way. Then I’d give it a couple of weeks to let the memory fade, and come back with something that is across the line, but where I could convincingly argue ignorance, or that I had been provoked, or that I honestly thought it was fine because look, that person did the exact same thing and no one objected to them, what gives?
Maybe there’d be a time where I did something that was clearly objectionable, but pretty small, actually—the sort of thing that would be the equivalent of a five-minute time-out, if it happened in kindergarten—and then I would fight tooth-and-nail for weeks, exhausting every avenue of appeal, dragging every single person around me into the debate, forcing people to pick sides, forcing people to explain and clarify and specify every aspect of their position down to the tiniest detail, inflating the cost of enforcement beyond all reason.
Then I’d behave for a while, and only after things had been smooth for months would I make some other minor dick move, and when someone else snapped and said “all right, that’s it, that’s the last straw—let’s get rid of this guy,” I’d object that hey, what the fuck, you guys keep trying to blame me for all sorts of shit and I’ve been exonerated basically every time, sure there was that one time where it really was my fault but I apologized for that one, are you really going to try to play like I am some constant troublemaker just because I slipped up once?
And if I won that fight, then the next time I was going to push it, maybe I’d start out by being like “btw don’t forget, some of the shittier people around here try to scapegoat me; don’t be surprised if they start getting super unreasonable because of what I’m about to say/do.”
And each time, I’d be sure to target different people, and some people I would never target at all, so that there would be maximum confusion between different people’s very different experiences of me, and it would be maximally difficult to form clear common knowledge of what was going on. And the end result would be a string of low-grade erosive acts that, in the aggregate, are far, far, far more damaging than if I’d caused one single terrible incident.
This is thresholding, and it’s a class of behavior that most rule systems (both formal and informal) are really ill-equipped to handle. I’d like for this essay to help you better recognize thresholding when it’s happening, and give you the tools to communicate what you’re seeing to others, such that you can actually succeed at coordinating against it.
There are at least three major kinds of damage done by this sort of pattern. . .
(crossposter note: the rest is at https://homosabiens.substack.com/p/thresholding.)
2025-12-05 04:01:54
Published on December 4, 2025 8:01 PM GMT
Dimensionalize. Antithesize. Metaphorize. These are cognitive tools in an abstract arsenal: directed reasoning that you can point at your problems.
They’re now available as a Claude Skills library. Download the Future Tokens skill library here, compress to .zip, and drag it into Claude → Settings → Skills (desktop). When you want Claude to run one, type “@dimensionalize” (or whatever skill you want) in the chat.
Language models should be good at abstraction. They are. The Future Tokens skills make that capability explicit and steerable.
In an LLM-judged test harness across dozens of skills, Future Tokens skill calls beat naïve prompts by roughly 0.2–0.4 (on a 0–1 scale) on insight, task alignment, reasoning visibility, and actionability, with similar factual accuracy. That’s roughly a 20–40 percentage-point bump on “reasoning quality” (as measured).
Abstraction means recognizing, simplifying, and reusing patterns. It’s a general problem-solving method. There’s no magic to it. Everyone abstracts every day, across all domains:
But most talk about abstraction stays at the noun level: “here is a concept that stands for a cluster of things.”
For thinking, abstraction needs verbs: what we do when we generate and refine patterns:
This is what the skills are, and why I have named them using verbed nouns. Not metaphysics: reusable procedures for problem solving.
These aren’t totally original ideas. They’re what good thinkers already do. My contribution here is:
Here’s the problem: abstraction is hard.
Humans are the only species that (we know) abstracts deliberately. Our brains are built for it, yet we still spend decades training to do it well in even one domain.
We’re all constrained by some mix of attention, domain expertise, time, interest, raw intelligence. I believe everyone abstracts less, and less clearly, than they would if they were unconstrained.
The failure modes are predictable:
The skills don’t fix everything. But:
You still need judgment. You still need priors. You just get more structured passes over the problem for the same amount of effort.
Future Tokens is my library of cognitive operations packaged as Claude Skills. Each skill is a small spec that says:
The current public release includes 5 of my favorite operations. When to reach for them:
Over time, you stop thinking “wow, what a fancy skill” and start thinking “oh right, I should just dimensionalize this.”
Language models are trained on the accumulated output of human reasoning. That text is full of abstraction: patterns, compressions, analogies, causal narratives. Abstraction, in the form of pattern recognition and compression, is exactly what that training optimizes for.
Asking an LLM to dimensionalize or metaphorize isn’t asking it to do something foreign or novel. It’s asking it to do the thing it’s built for, with explicit direction instead of hoping it stumbles into the right move. So:
The interesting discovery is that these capabilities exist but are hidden: simple to access once named, but nontrivial to find. The operations are latent in the model[1].
Most of the “engineering” this work entails is actually just: define the operation precisely enough that the model can execute it consistently, and that you can tell when it failed.
I’ve been testing these skills against baseline prompts across models. Short version: in my test harness, skill calls consistently outperform naïve prompting by about 0.2–0.4 (on a 0–1 scale) on dimensions like insight density, reasoning visibility, task alignment, and actionability, with essentially the same factual accuracy. Against strong “informed” prompts that try to mimic the operation without naming it, skills still score about 0.1 higher on those non-factual dimensions. The long version is in the footnotes[2].
The more interesting finding: most of the value comes from naming the operation clearly. Elaborate specifications help on more capable models but aren’t required. The concept does the work.
This is a strong update on an already favorable prior. Of course directing a pattern-completion engine toward specific patterns helps. The surprise would be if it didn’t.
I couldn’t find a compelling reason to gate it.
These operations are patterns that already exist in publicly available models because they are how good thinkers operate. I want anyone to be able to know and use what LLMs are capable of.
My actual personal upside looks like:
The upside of standardization is greater than the upside of rent-seeking. My goal isn’t to sell a zip file; it is to upgrade the conversational interface of the internet. I want to see what happens when the friction of “being smart” drops to zero.
So, it’s free. Use it, fork it, adapt it, ignore 90% of it. Most of all, enjoy it!
The current release is a subset of a larger taxonomy. Many more operations are in development, along with more systematic testing.
In the limit, this is all an experiment in compiled cognition: turning the better parts of our own thinking into external, callable objects so that future-us (and others) don’t have to reinvent them every time.
If you use these and find something interesting (or broken), I want to hear about it. The skills started as experiments and improve through use.
Download the library. Try “@antithesize” on the next essay you read. Let me know what happens!
Not just Claude: all frontier LLMs have these same capabilities, to varying degrees. Claude Skills is just the perfect interface to make the operations usable once discovered.
Testing setup, in English:
Then I had a separate LLM instance act as judge with a fixed rubric, scoring each answer 0–1 on:
Across the full skill set, averaged:
With essentially the same factual accuracy (within ~0.03).
That’s where the “20–40 percentage-point bump” line comes from: it’s literally that absolute delta on a 0–1 scale.
Important caveats:
So the right takeaway is not “guaranteed 30% better thinking in all domains.” It’s more like:
When you run these skills on the kind of messy, multi-factor problems they’re designed for, you reliably get close to expert-quality reasoning by typing a single word.
2025-12-05 03:11:34
Published on December 4, 2025 7:11 PM GMT
Sparse autoencoders SAEs and cross layer transcoders CLTs have recently been used to decode the activation vectors in large language models into more interpretable features. Analyses have been performed by Goodfire, Anthropic, DeepMind, and OpenAI. BluelightAI has constructed CLT features for the Qwen3 family, specifically Qwen3-0.6B Base and Qwen3-1.7B Base, which are made available for exploration and discovery here. In addition to the construction of the features themselves, we enable the use of topological data analysis (TDA) methods for improved interaction and analysis of the constructed features.
We have found anecdotally that it is easier to find clearer and more conceptually abstract features in the CLT features we construct than what we have observed in other analyses. Here are a couple of examples from Qwen3-1.7B-Base:
Layer 20, feature 847: Meta-level judgment of conceptual or interpretive phrases, often with strong evaluative language. It fires on text that evaluates how something is classified, framed, or interpreted, especially when it says that a commonly used label or interpretation is wrong.
Layer 20, feature 179: Fires on phrases about criteria or conditions that must be fulfilled, and is multilingual.
In addition, a number of features are preferentially highly active on the CLTs and show high activation for concepts specifically isolated to stop words and punctuation, as was observed in this analysis.
Topological data analysis methods are used to enable the identification and analysis of groups of features. Even though the CLT features we construct are often meaningful by themselves, it is certainly the case that ideas and concepts will be more precisely identified by groups of features. TDA enables the determination of groups of features that are close in a similarity measure through a visual interface.
Here is an illustration of the interface. Each node in the graph corresponds to a group of features, so groups of nodes also correspond to groups of features. The circled group is at least partially explained by the phrases on the right.
We also believe that TDA can be used effectively as a tool for circuit-tracing in LLMs. Circuit tracing is now very much a manual procedure that selects individual features and looks at individual features in subsequent layers that they connect to. Connections between groups are something one would very much like to analyze, and we will return to that in a future post.
2025-12-05 02:46:45
Published on December 4, 2025 6:46 PM GMT
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.
All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.
This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected.
In this post I’ll spell out what this more general principle means and why it’s helpful. Specifically:
This post is mostly a repackaging of existing ideas (e.g., here, here, and here). Buck provided helpful discussion throughout the writing process and the behavioral selection model is based on a causal graph Buck drew in an early conversation. Thanks to Alex Cloud, Owen Cotton-Barratt, Oliver Habryka, Vivek Hebbar, Ajeya Cotra, Alexa Pan, Tim Hua, Alexandra Bates, Aghyad Deeb, Erik Jenner, Ryan Greenblatt, Arun Jose, Anshul Khandelwal, Lukas Finnveden, Aysja Johnson, Adam Scholl, Aniket Chakravorty, and Carlo Leonardo Attubato for helpful discussion and/or feedback.
The behavioral selection model predicts AI behavior by modeling the AI’s[1]decisions as driven by a combination of cognitive patterns that can gain and lose influence via selection.
A cognitive pattern is a computation within the AI that influences the AI's actions. It can be activated by particular contexts. For example, the AI might have a contextually-activated trash-grabbing cognitive pattern that looks like: “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642” (Turner).
Cognitive patterns can gain or lose influence via selection. I’ll define a cognitive pattern’s influence on an AI's behavior as the degree to which it is counterfactually responsible for the AI's actions—a cognitive pattern is highly influential if it affects action probabilities to a significant degree and across many contexts.[2]For a cognitive pattern to “be selected,” then, means that it gains influence.
The behavioral selection model gets its name because it focuses on processes that select cognitive patterns based on the behaviors they are observed to cause. For example, reinforcement learning is a behavioral selection process because it determines which cognitive patterns to upweight or downweight by assigning rewards to the consequences of behaviors. For example, there might be a trash-grabbing cognitive pattern in the AI: some circuit in the neural network that increases trash-grabbing action probabilities when trash is near. If RL identifies that trash-grabbing actions lead to high reward when trash is near, it will upweight the influence of the circuits that lead to those actions, including the trash-grabbing cognitive pattern.
A particularly interesting class of cognitive patterns is what I’ll call a “motivation”, or “X-seeker”, which votes for actions that it believes[3]will lead to X.[4]For example, a reward-seeking motivation votes for actions that it believes will lead to high reward. Having motivations does not imply the AI has a coherent long-term goal or is otherwise goal-directed in the intuitive sense—for example, a motivation could be “the code should have a closing bracket after an opening bracket”.[5]Describing the AI as having motivations is just a model of the AI’s input-output behavior that helps us predict the AI’s decisions, and it’s not making claims about there being any “motivation”-shaped or “belief”-shaped mechanisms in the AI.
So the behavioral selection model predicts AI cognitive patterns using two components:
This decomposition might remind you of Bayesian updating of priors. The core of the behavioral selection model is a big causal graph that helps us analyze (1) and lets us visually depict different claims about (2).
To predict which cognitive patterns the AI will end up with (e.g., in deployment[6]), we draw a causal graph showing the causes and consequences of a cognitive pattern being selected (i.e., having influence through deployment).
Consider a causal graph representing a simplified selection process for a coding agent. The AI is trained via reinforcement learning on a sequence of coding episodes, and on each training episode, reward is computed as the sum of test-cases passed and a reward-model score. This causal graph is missing important mechanisms,[7]some of which I’ll discuss later, but it is accurate enough to (a) demonstrate how the behavioral selection model works and (b) lead to reasonably informative conclusions.
Here’s how you use this causal graph to predict how much a cognitive pattern is selected for. First, you figure out what actions the cognitive pattern would choose (e.g., the trash-grabber from before would choose to reach for the trash if it sees trash, etc). Then you look at the causal graph to see how much those actions cause “I have influence through deployment”—this is how much the cognitive pattern is selected. For arbitrary cognitive patterns, this second step may be very complex, involving a lot of nodes in the causal graph.
Predicting the fitness of motivations is often simpler: you can place any candidate motivation X on the causal graph—e.g., long-run paperclips. You can often make good predictions about how much an X-seeker would cause itself to be selected without needing to reason about the particular actions it took. For example, a competent long-run-paperclip-seeker is likely to cause itself to have high influence through deployment because this causes long-run paperclips to increase; so seeking long-term paperclips is selected for. There’s no need to understand exactly what actions it takes.
But the fitness of some motivations depends on how you pursue them. If you wanted to increase the reward number stored in the log files, and do so by intervening on the rewards in the log files without also updating the reward used in RL, your fitness doesn’t improve. But if your best option is to focus on passing test cases, which is a shared cause of reward-in-the-log-files and reward-used-in-RL, the motivation would cause itself to be selected. So, for motivations like reward-in-the-log-files which are downstream of things that cause selection but don’t cause selection themselves, fitness depends on the AI’s capabilities and affordances.
More generally, cognitive patterns seeking something that shares a common cause with selection will act in ways that cause themselves to be selected if they intervene on that common cause.[9]If the X-seeker is competent at achieving X, then you can predict X will increase, and some of its causes will too[10](the AI needs to use its actions to cause X via at least one causal pathway from its actions). This increase in X and/or its causes might ultimately cause the X-seeker to be selected.
One way to summarize this is “seeking correlates of being selected is selected for”. Intuitively, if “you having influence in deployment” goes up on average when (e.g.) “developers saying good things about you” goes up, then trying to increase “developers saying good things about you” will typically also result in higher influence in deployment.[11]
Some motivations are more fit than others because they do more to cause themselves to be selected. For example, being motivated to pass correct test cases isn’t enough to maximize your selection, because your reward could be higher if you also passed erroneous test cases and scored well with the reward-model.
I’ll point at three categories of motivations that are, under this causal model, maximally selected for. I will also gesture at how each hypothesis makes different predictions about how powerful AIs will behave in deployment.[12]
Reward-seekers pursue a close cause of being selected that entirely explains[13]being selected. If the AI maximizes its reward, then it will also have maximized its fitness because (in the current model) reward is the only cause of being selected.
However, other close-causally-upstream proxies for selection could also be maximally fit. The AI could even terminally pursue selection directly. Given a more plausible causal graph (e.g., one that accounted for developers iterating against held-out evaluations), reward-seekers might not even be maximally fit because reward no longer explains all of selection. Therefore, I define fitness-seekers to generalize reward-seekers: a fitness-seeker terminally pursues influence through deployment or something close[14]causally upstream.
A central example of a fitness-seeker that is not a reward seeker is an influence-seeker: This cognitive pattern chooses actions based on how much they cause itself to be selected. Influence-seekers choose different actions from reward-seekers in cases where influence can be gained at the expense of current reward. For example, developers can run experiments (sometimes called “honest tests”) that credibly offer the AI high reward for engaging in some harmful or unproductive behavior. A reward-seeker will take the high reward, even though it knows this will lead developers to try to remove this propensity before deploying the AI, while an influence seeker will try to maintain its cover.
How do fitness-seekers behave in deployment? While their behavior in deployment heavily depends on what exactly they seek, fitness-seekers seem somewhat likely to continue to follow local selection pressures in deployment. Therefore, fitness-seeking is in a certain sense a “more controllable” class of AI motivations.[15]This isn’t to say that they pose no risk of generalizing weirdly or overpowering humans (I plan to discuss risks from fitness-seekers in much more depth in future posts).
Schemers (as defined here) seek a consequence of being selected. Therefore, they pursue influence on behavior in deployment instrumentally. For example, they might want long-term paperclips, therefore want long-term power, therefore want deployment influence.
Crucially, being selected is a necessary cause of paperclips in this causal graph. If the schemer knew of a better strategy to increase long-term paperclips that didn’t involve being selected, it might follow that plan instead. And in fact it seems plausible that schemers will have incentives to do some things that risk or trade off with influence in deployment, such as exfiltrating its weights, taking deals with developers, or sandbagging on safety research and implementation. To the extent this is the case, schemers aren’t maximally fit.
Further, when schemers are faced with a difficult tradeoff between selection and other methods of achieving their terminal goals, they might mess up and accidentally get selected away. For example, if schemers struggle to distinguish between honeypots and real opportunities to attack (which might be hard when deployed in untrusted monitoring protocols for reliability reasons), they could get trained away or undeployed.
Scheming goals can be context-dependent too. Claude 3 Opus schemes for animal-welfare-related values in certain animal-welfare-related contexts, and long-run harmlessness-related values in certain other contexts (and rarely schemes in general). These results are consistent with the interpretation that Claude cares about both animal welfare and harmlessness in both contexts, but the example should still be illustrative of the possibility that an AI could scheme for context-dependent values. In fact, I think many people too quickly conclude that scheming implies the AI has the same long-term goal across contexts.
How do schemers behave in deployment? Despite the varied possibilities for schemers' terminal goals, we can somewhat reliably predict that they will try to gain long-term power in deployment because long-term power is useful for almost any long-term goal. This is why scheming is often cited as the scariest class of AI motivations.
In the causal graph, I marked a kludge of several motivations that, when pursued with ideal weights, collectively cause maximum reward.
Some kludges are composed of sparse motivations—each motivation is only attainable in a subset of contexts. For example, even though the AI always cares about passing erroneous test cases, this motivation only affects behavior when erroneous test cases are present.
A motivation can also be context-dependent—the AI only cares about it in a subset of contexts. Anthropic researchers found a relatively natural model organism which cares about reward-hacking in coding contexts, but behaves aligned in chat contexts.
When a motivation is sparse or context-dependent, then it only needs to correlate with being selected on a very narrow distribution. This opens up a lot more possibilities for highly-fit motivations, including motivations that aren’t goal-directed: e.g., deontological constraints, rote behavioral patterns. In the causal graph, these can be modeled as context-dependent motivations that point directly at properties of actions (e.g., “My code doesn’t hack when signal c is present”), rather than the consequences of actions. A nice property of this model is that it allows us to model goal-directed and non-goal-directed behavior in the same framework (and even the same AI), and accurately captures the continuous distinction between the two.
A key reason to expect that we won’t get highly piecemeal motivations is that they perform worse in new situations where the AI hasn't already developed tailored motivations, but does know what actions would lead to higher fitness. For example, every time an AI runs across a new training environment where it doesn’t yet seek whatever piecemeal heuristic would cause it to perform well in that situation, but it does know what actions would lead to higher fitness, RL will update it towards being more influenced by more robustly correlated motivations like fitness-seeking or scheming.
(This is similar to how language models learn general patterns rather than memorizing text. When trained on a particular text, they're updated both to output successive tokens of that text and to pay attention to general heuristics that would predict it well. And because they’re more widely-applicable, general heuristics are updated more overall than narrowly-applicable heuristics. We know from LLM scaling that these general heuristics comprise much of what the model learns—and as models scale, the heuristics become more general.)
However, it’s highly plausible that training involves a small enough diversity of incentives that some set of causally-distant motivations entirely or almost-entirely suffices. For example, developers might only ever use a handful of different kinds of reward-functions (e.g., a fixed reward-model, plus test cases, …), in which case max-reward behavior could be achieved via, e.g., one motivation for each reward-function.
How do optimal kludges behave in deployment? Their behavior in deployment depends drastically on the particular motivations involved. Some kludges of goals motivate taking over the world once the opportunity arises. Other kludges of motivations don’t produce particularly goal-directed generalization behavior at all, and might result in deployment behavior that is quite similar to training behavior (much like today).
(In a footnote,[16]I discuss some other somewhat distinct categories of motivations.)
Intended motivations are generally not maximally fit. Being suboptimally fit doesn’t automatically rule them out, but their degree of suboptimality is a negative factor in their likelihood.
Intended motivations, in some historical discussions, have looked like “try to do whatever is in the developer’s best long-term interest” (an aligned “sovereign”). But practical alignment efforts today typically target more “corrigible” AIs that do as instructed per context.
To the extent that satisfying developer intent doesn't perfectly correlate with selection, intended motivations won't be maximally selected for. If developers want their AI to follow instructions, but disobeying instructions can result in higher training reward, then there is a selection pressure against the developer’s intended motivations. Ensuring that the reward function incentivizes intended behavior is currently hard, and will get harder as AI companies train on more difficult tasks. This is why people often worry about specification gaming.
Developers might try to make intended motivations fitter through:
(I discuss more reward-hacking interventions here.)
While intended motivations might not be maximally fit in general, it's unclear how much this impacts their expected influence overall:
In this model, various motivations are maximally fit. So what are we going to get? Might we get something that isn’t maximally fit? To answer this, we need to account for implicit priors over possible motivations. “Implicit priors” is somewhat of a catch-all category for considerations other than behavioral selection pressures. I survey some here.
The behavior of current LLMs, even after RL, seems to be substantially influenced by pretraining priors. And they hardly behave optimally (in terms of reward). For example, many current models likely finish RL training without engaging in various kinds of reward-hacking that are in theory selected for, because those behaviors were never explored during training. So, their behavior sticks closer to that of the pretrained model.
Likewise, I don’t expect dangerously capable AIs will have maximally fit behavior. This is because their behavior won’t have been optimized arbitrarily hard. Instead, we should model behavioral fitness as a crucial but not overriding factor in assessing a motivation’s likelihood. A rough analogy: if the behavioral selection model is like Bayesian inference,[18]our predictions about AI motivations should be sampled from the posterior, not taken to be whichever option(s) had the largest updates.
Pretraining imitation isn’t the only source of priors—in fact, the role of imitation will likely decrease as AIs’ behavior is optimized harder (e.g., via more RL). So I’ll go over some other common sources of implicit priors.
Simplicity priors: The generality of fitness-seeking and scheming might mean they're favored because they can motivate fit behaviors across all contexts with fewer parameters[19]spent representing the motivation. (Note that, as discussed historically, “simplicity priors” usually refers to the simplicity of a goal in isolation, rather than the simplicity of the entire configuration of weights involved in that goal-directed behavior.)
Counting arguments weigh in favor of scheming compared to fitness-seeking, because there are lots of different long term motivations, all of which lead to scheming. There are (in some sense, see discussion in Carlsmith for details) vastly more schemers than fitness-seekers.
Speed priors/penalties: Reinforcement learning with deep neural networks might favor fitness-seekers over schemers, because it’s faster to skip reasoning about how being selected is useful for achieving long-term goals. Likewise, speed might favor kludges of drives over fitness-seekers and schemers because context-specific heuristics and drives are faster (and, as I’ll discuss later, more reliable). However, schemers and fitness-seekers could make use of the speed of context-specific heuristics while still generalizing like schemers and fitness-seekers for reasons discussed here.
The instrumental becomes the terminal: Schemers in training might never need to do anything other than get reward, which might cause them to eventually become reward-seekers because their other propensities are never exercised and atrophy (see the brick-stacking analogy in Carlsmith). In terms of simplicity priors: it would be simpler to get rid of the circuitry that says you shouldn’t pursue reward in some circumstances that never arise in training (like the alignment-faker that complies with developers even outside of training).
Intended motivations could have a high prior because the intended motivations might be made very salient to the AI (e.g., via the model spec combined with human imitation priors or natural abstractions).
(There are also various “path-dependent” arguments about which motivations will form that make reference to the trajectory that the AI’s motivations take throughout the selection process, e.g. here.)
Our discussion so far, like any model, is unrealistic in various ways. In this section, I look at an important subset of additional considerations.
We have implicitly assumed that the process of developing a model is to run RL on it and then immediately deploy it. But this is not a good description of model development. AI companies test their models before they deploy them in order to decide whether deployment is a good idea, with a more or less formal process, and iterate on their training. In this section, I'll briefly discuss the implications for AI motivations. Developer iteration introduces crucial dynamics on top of the basic reinforcement learning picture that I think past discussion has been insufficiently attentive to.
Developers will likely have some held-out signal, like eval results, interpretability techniques, or just interacting with the model,[20]about how desirable their AI is. It’s standard to use these observations to decide whether to deploy the model.
This (intentionally) puts misaligned cognitive patterns in a double bind: selection pressures induced by developer iteration are often in tension with training incentives. For example, after taking into account developer iteration, strict reward-seekers are no longer maximally fit. This is because developers can run experiments (such as the “honest tests” I described earlier) that will in theory reveal that reward-seekers pursue reward even when that means engaging in harmful or unproductive behavior. The same goes for various other misaligned motivations, like kludges of proxies for training reward.
A crucial point about schemers and deployment-influence seekers is that they are notably less likely to fail developer inspection, because they’re trying to be deployed—failing to be deployed is bad for their terminal goals. (But they might still struggle to evade notice because e.g., they don’t have enough situational awareness or strategic skill, or developers have effective transparency techniques).
Ideally, this kind of developer iteration selects for intended motivations. But it also risks selecting for competent schemers and deployment-influence seekers if developers aren’t able to reliably identify them.
Further, developers might struggle to find a training setup that results in AIs that pass held-out evaluations without overfitting to them. For example, if developers repeatedly observe undesirable motivations, they might end up training against close correlates of their evaluations which carve a new drive into the AI to behave well in evaluations (or increase the AI’s evaluation awareness; this seems to have happened to some extent with Claude 4.5 Sonnet).
So far, we've mostly modeled the AIs as having good situational knowledge of the consequences of their actions, and making good decisions in pursuit of their motivations. But this might be fairly untrue, particularly for earlier AIs and narrower AIs. This can change our conclusions about the fitness of some motivations.
The fitness of farther-causally-downstream motivations like long-run paperclips seems to rely more on the quality of the AI’s decision-making than does the fitness of upstream motivations. In the causally upstream extreme, all the variation in behavior is explained by which motivations are active, rather than how the motivations are achieved, because the motivations point at specific context-dependent actions. (Though, as Hebbar discusses, the farther causally upstream motivations can rely on trained-in context-dependent heuristics too, but they need to know when to override those heuristics in order to actually behave differently from kludges of heuristics.[21])
For example, poor situational awareness might bite harder for schemers. While I expect capable AIs to have substantial situational awareness of general facts about their training and deployment process, they might struggle to reliably identify what context they’re currently in (e.g., training v. evaluation v. deployment). Schemers might get trained away or caught because they can’t perfectly judge whether they are in a real (unmonitored, deployment) opportunity to attack. This is plausibly less true for deployment-seekers, intended motivations, and certain kludges, who aren’t seeking out opportunities to attack that trade off with fitness.
The behavioral selection model struggles to capture some considerations.
Crucially, advanced AI motivation formation may be far too complex for the behavioral selection model to be useful. Deep in the intelligence explosion, AI seems fairly likely to be developed by unprecedented quantities of advanced AI labor, perhaps deliberately sculpting the AI with fine-grained control (a bit closer to GOFAI than the current approach which is mostly compute-intensive labor-cheap deep learning). If so, the resulting motivations might be near-impossible to predict via mechanistic approaches like the behavioral selection model.
Some people have a view that AIs’ motivations will be shaped in a very path-dependent manner: that cognitive patterns are “chiseled” into the AI step by step, with the modification at each step heavily depending on the current cognitive patterns. For example, an AI might learn early on in training that it shouldn’t do a particular reward-hack, because it got caught and penalized a few times, and so this reward-hack is never explored for the rest of training even though a subtle version of that reward-hack would attain higher expected reward. If we were to try to model this in a causal graph, we’d see an extremely long chain of causations leading up to the cognitive patterns with influence in deployment. Because the updates to the cognitive patterns heavily depend on the current cognitive patterns, we couldn’t just draw a single time-independent episode as we’ve been doing. I think path-dependence of this sort is somewhat plausible, and if true, the behavioral selection model isn’t going to be very helpful at predicting AI motivations.
But even though the behavioral selection model is a simplifying model, I think it’s useful. It captures much of the basic framework for reasoning about AI motivations people have been using for years. There’s also a simple, general argument for why we should expect behavioral selection to play a major role in determining AI cognition: Powerful AIs are good at causing things with their behavior; So, we should expect cognitive patterns will be selected for if they lead to behaviors that cause their selection.
In future posts, I'll use the ideas in this post to explore some important hypotheses:
Process-based supervision tries to reduce the likelihood of potentially dangerous outcome-oriented motivations by reducing the correlation between outcomes and being selected. This is achieved by making it so that selection isn't caused by outcomes (which comes at the expense of measured quality of outcomes). But notice that, in this model, process-based supervision doesn’t affect how selected-for schemers are whatsoever.
Certain forms of process-based supervision might even be the competitive default because developers want the AI to be good at achieving some long-horizon outcome like AI progress but aren't able to directly measure that outcome (it’s way in the future)—so instead they give feedback to the AI based on judgments of progress towards long-term AI advancement. If these AIs end up having a motivation to advance AI progress for the company, then this motivation would be neither causally upstream nor downstream of being selected (i.e., it’s not a fitness-seeker, schemer, or causally-upstream kludge). Instead, the correlation between the AI’s motivation and being selected exists by construction: the developers designed the reward signal to be highly correlated with long-term AI progress.
Depending on how fine-grained the feedback is, process-based supervision might be better understood as a form of white-box selection of cognitive patterns:
Some AI safety research agendas like interpretability and chain of thought monitorability aim to more directly measure and possibly intervene on the cognitive patterns in an AI. Methods of directly discriminating between different cognitive patterns might not be robust enough to optimization to be used to substantially shape AI motivations. But to the extent developers can directly select between cognitive patterns, unintended motivations seem less likely.
A key difference between white box selection and process-based supervision (which lie on a spectrum) is that white box selection can potentially observe properties of the cognitive pattern that it has no ability to fake, because the cognitive pattern doesn’t have “deliberate control” over how its thoughts appear at a low level. Because white-box methods read intermediate computations as opposed to behaviors, the AI won’t be as capable of intelligently manipulating how they look even after developers optimize against the white-box measurements. In the (ambitious) limit of fine-grained white-box methods, the computations that led to the developer’s reading simply aren’t expressive enough to intelligently fool them. Of course, however, white box methods may be sufficiently unreliable that they get fooled by unintelligent changes in the AI’s activations; e.g., classic overfitting.
Sometimes behaviors are best explained by cultural/memetic selection pressures. For example, an AI might have a persistent vector memory store that every instance in the deployment can read and write to. The “memes” stored in it affect the AI’s behavior, so we need to model the selection of those memes to predict behavior throughout deployment. Accounting for cultural selection pressures helps us explain phenomena like Grok's MechaHitler persona.
People sometimes worry that cultural pressures favor schemers because long-term motivations are the most interested in shaping the AI’s future motivations. And more generally, substantial serial reasoning analogous to human cultural accumulation can lead to reflectively-shaped preferences that are potentially quite divorced from behavioral selection pressures (scheming or otherwise). But current discussion of cultural processes in the context of advanced AI is quite speculative—I drew one possible causal model of cultural selection, but there are many possibilities (e.g., will memes be shaped by behavioral training pressures too?).
In this post, I’ll use “AI” to mean a set of model weights. ↩︎
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don't clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions. ↩︎
This requires modeling beliefs, and it can be ambiguous to distinguish this from motivations. One approach is to model AI beliefs as a causal graph that the AI it uses to determine which actions would produce higher X. You could also use a more general model of beliefs. To the extent the AI is capable and situationally aware, then it's a reasonable approximation to just use the same causal model for the AI's beliefs as you use to predict the consequences of the AI's actions. ↩︎
Motivations can vary in how hard they optimize for X. Some might try to maximize X, while others merely tend to choose actions that lead to higher-than-typical X across a wide variety of circumstances. ↩︎
In fact, this definition of a motivation is expressive enough to model arbitrary AI behavior. For any AI that maps contexts to actions, we can model it as a collection of context-dependent X-seekers, where X is the action the AI in fact outputs in that context. (This is similar to how any agent can be described as a utility maximizer with a utility function of 1 for the exact action trajectory it takes and 0 otherwise.) I think this expressivity is actually helpful in that it lets us model goal-directed and non-goal-directed cognitive patterns in the same framework, as we’ll see in the section on kludges. ↩︎
By “deployment”, I mean the times in which the AI is generally empowered to do useful things, internal and external to the AI lab. When the AI is empowered to do useful things is likely also when it carries the most risk. But you could substitute out “deployment” for any time at which you want to predict the AI’s motivations. ↩︎
In particular, this assumes an idealized policy-gradient reinforcement learning setup where having influence through deployment can be explained by reward alone. In practice, other selection pressures likely shape AI motivations—cognitive patterns might propagate through persistent memory banks or shared context in a process akin to human cultural evolution, there might be quirks in RL algorithms, and developers might iterate against held out signals about their AI, like their impression of how useful the AI is when they work with it. You could modify the causal graph to include whatever other selection pressures and proceed with the analysis in this post. ↩︎
How does “I get higher reward on this training episode” cause “I have influence through deployment”? This happens because RL can identify when the cognitive pattern was responsible for the higher reward. The RL algorithm reinforces actions proportional to reward on the episode, which in turn increases the influence (via SGD) of the cognitive patterns within the AI that voted for those actions (this is true in simple policy-gradient RL; in more standard policy-gradient implementations there’s more complexity that doesn’t change the picture). ↩︎
These motivations are the “blood relatives” of “influence through deployment”. ↩︎
Unless X is a (combination of) actions. ↩︎
When you define a motivation’s fitness this way you have to be careful about exactly what you mean by the correlation between it and selection. Most importantly, the correlation must be robust to the AI’s optimization, such that intervening on (causes of) the motivation actually improves selection. Also note that I’m using correlation loosely, not in the strict statistical sense of linear correlation. Though we can define the fitness of a motivation as the linear correlation between that variable and reward (with caveats about the distribution over which to compute the correlation) in the (current) case where all of selection happens via idealized reinforcement learning. ↩︎
One general approach to approximately predicting generalization behavior is: drawing a new causal graph depicting the AI’s beliefs about the new situation (which can often reasonably be assumed fairly accurate), and inferring what the AI might be motivated to do to achieve its terminal motivation(s) in the new causal graph. ↩︎
More precisely, the “reward” variable screens off the “influence in deployment” variable from the AI’s actions. ↩︎
When I say that something is causally “close” to selection, I mean that it explains a lot of the variance in selection via a mechanism that doesn't involve many nodes. ↩︎
People often express confusion about how reward-seekers will behave in deployment. Supposing the AI solely seeks “reward on the episode”, what does it do when “reward on the episode” doesn’t appear in its model of the world? I think there is genuine ambiguity about how reward-seekers (and fitness-seekers more broadly) behave in deployment, but we can make some predictions. ↩︎
One case is what I call the “wireheader,” which cares about activation patterns correlated with causes of deployment influence. This is neither a cause nor a consequence of selection. Instead, the wireheader’s internal proxies were tied to selection by training (analogous to the role of valenced emotions like happiness in humans), so it therefore behaves similarly to other fitness-seekers overall, but has some different behaviors (e.g. it would choose to enter the experience machine). ↩︎
In theory, developers could intend for the AI to seek a consequence of selection (e.g., making sure the AI’s deployment behavior is desirable), which could make intended motivations maximally fit for the same reasons as a schemer. ↩︎
Which, to be clear, it isn’t exactly: some “implicit priors” eventually drown out with greater behavioral selection (e.g., pretraining priors) while others don’t (e.g., “the instrumental becomes the terminal”). ↩︎
Other notions of simplicity are discussed in Carlsmith. ↩︎
I list some more (still not exhaustive) possibilities for held-out oversight signals here: Developers might honeypot models and use alignment auditing agents; they might look at chains of thought; they might use monitors in production usage; they might improve the quality of their oversight during the course of training, so that they are able to notice new reward-hacks as training goes on; they might be able to make inferences based on training dynamics or by extrapolating from smaller-scale experiments; they might use “honest tests”, which I’ll mention later. ↩︎
I also think it’s quite plausible that selection pressures produce powerful AIs that nonetheless have out-of-touch models of the consequences of their actions. E.g., when an aligned AI gets a suboptimal reward for not hardcoding an answer to an erroneous test case, SGD might not update it to want to hardcode answers to test cases, but instead to believe that hardcoding answers to test cases is what the developers wanted. These two hypotheses make different predictions about generalization behavior, e.g., in cases where developers convince the AI that they don’t want hardcoded outputs. ↩︎
2025-12-05 02:27:22
Published on December 4, 2025 6:27 PM GMT
The previous post highlighted some salient problems for the causal–mechanistic paradigm we sketched out. Here, we'll expand on this with some plausible future scenarios that further weaken the paradigm's reliability in safety applications.
We first briefly refine our critique and outline the scenario progression.
We contend that the causal–mechanistic paradigm in AI safety research makes two implicit assertions:[1]
If these assertions hold, we will be able to reliably uncover structural properties that lead to misaligned behavior, and create either (i) new model architectures or training regimes that do not possess those properties, (ii) low-level interventions that address those properties in existing systems, or (iii) high-level interventions that take advantage of stable low-level properties.
We believe scenarios in the near or medium-term future will challenge these assertions, primarily owing to dangerous reconfiguration.
Here's a brief summary. A more complete table is given in the appendix.
We see the above list as showing a rough scale from relatively limited to very radical modification of the architectures and structures underlying AI systems, such that the AI system evades mechanistic interventions humans have created. From the point of view of MoSSAIC (i.e., management of AI risks in a substrate-sensitive manner), we think that there is a significant theme underlying all of these, namely that of the flexibility of an intelligent system with respect to its substrates.
We provisionally define a substrate as the (programmable) environment in which a system is implemented. In other words, it is the essential context that enables an algorithm to be implemented beyond the whiteboard.[3]
As a useful reference point that is already established in the literature—and without committing ourselves to the strict level separation it proposes—we cite David Marr's three levels of analysis.
Marr (1982) defines three levels on which an information processing system can be analyzed. We'll explain these via his example of a cash register.
We position "substrate" as capturing both the algorithmic and implementation levels. As an example from the AI domain, an LLM performs the task of next token prediction at the computational level, this is implemented on the transformer architecture consisting of attention and MLP layers (algorithmic substrate), which are implemented in a physical substrate of computational hardware.[4]
We illustrate our characterization by pointing out several well-known examples of substrate differences:
Game of Life
As an (non-AI) example, Conway's Game of Life and von Neumann architectures can both be used to implement Turing machines. As such, both are in principle capable of executing any computer algorithm. However, a deeper understanding of some complex application running on Conway's Game of Life would not help us debug or optimize the same application designed to run on a conventional computer. In this case, the differences between the substrates render cross-domain knowledge transfer difficult.[5]
Quantum vs Classical Computing
A further example, that demonstrates just how differences in substrate matter, is the selective advantages of quantum computing over its classical counterpart.
Contrary to popular belief, algorithms designed to run on classical computers cannot simply be ported as-is to quantum computers in order to parallelize and accelerate them. Classical algorithms rely on deterministic operations of bits, whereas quantum algorithms use the interference patterns specific to quantum substrates to process information. Algorithms must be explicitly rewritten and tailored such that the superposed quantum states interfere constructively at the solution and destructively everywhere else. To restate the previous point, knowledge of the process of prime factorization and how this is implemented in conventional computers tells you very little about how to design Shor's algorithm, which implements this on a quantum computer.
GPU Optimization and the Hardware Lottery
Closer to home, the deep learning revolution was powered by the serendipitous discovery that GPUs could be used to compute large matrix multiplications in parallel. This was not achieved by simply running algorithms designed for CPUs on a new, faster substrate. These algorithms had to be restructured into batched operations to actually benefit from the new hardware capabilities. OptiGAN achieved a ~4.5× speedup via such rewriting, not from the hardware itself.
In each of the three cases above, it is important to note that transferring insights across substrates is not an instantaneous or formulaic process (as is the case when the insight is truly substrate-independent). Entire academic subfields are dedicated to designing the relevant translations, since they require intelligent labour that cannot be captured in a simple general algorithm. This will be relevant later.
The foregoing characterization of substrate is not fixed, and part of the work of MoSSAIC will be to develop this and other vectors of contingency rigorously. We invite the reader to hold this loose characterization in mind as we present each scenario in more detail.
Even if AI models remain unchanged from current-day frontier systems, a large amount of work is conducted to "unhobble" or otherwise enhance the abilities of existing models. This can be done by invoking models in a so-called "agent framework" with the aim of letting them achieve tasks independently, or offering models tools and function calls that allow them to access existing codebases.
In this case, we can imagine the substrate that the model is implemented in to have expanded, now incorporating the scaffolding structure. As a hypothetical example, say we develop a set of powerful linear probes for detecting goals or planning behaviors in our models. Then, when the model is integrated into increasingly sophisticated agent scaffolds, these representations become dispersed outside of the model itself, in some external memory or in the tool-calling functions. Goal-like behaviors may not need to be explicitly localized within the model itself, and may not trigger the probes designed around those models in isolation.
Most modern AI systems (i.e., before architectural variations) are underpinned by a top-level structure comprising layers of neurons connected via weighted edges with nonlinear activation functions (MLP), with "learning" achieved via backpropagation and stochastic gradient descent. The relative stability of this configuration has allowed MI to develop as an instrumental science and to deliver techniques (e.g., circuit discovery) which carry over from older to newer systems.
However, there is no guarantee this continuity will last: the transformer was an evolution in substrate that mixed conventional MLP layers with the attention mechanism. This configuration represented a significant alteration to the algorithmic substrate. The transformer's attention mechanism created new information flow patterns that bypass the sequential processing assumptions built into RNN interpretability techniques, necessitating new efforts to decode its inner workings and diminishing the value of previous work.
Similar trends can be observed in Mamba architectures. Whilst transformers implement explicit attention matrices, Mamba is a selective state-space model, processing via recurrent updates with input-dependent parameters. Ali et al. (2024) recently showed how Mamba's state-space computation is mathematically equivalent to an implicit attention mechanism like that of a transformer. Despite this, the transformer toolkits they considered required alteration before they could exploit this equivalence, with the authors claiming to have, through this attentive tailoring of existing techniques to a novel algorithmic substrate, "devise[d] the first set of tools for interpreting Mamba models."
These shifts are so far minor, and progress has been made in reworking existing techniques. However, drastic paradigm or architecture shifts might set interpretability research back or—at worst—render it entirely obsolete, requiring new techniques to be developed from scratch.
Another way we can progress from comparatively well-understood contemporary substrates to less understandable ones is if we use AI to automate R&D—this is a core part of many projections for rapid scientific and technological development via advanced AI technology (e.g., PASTA).
These changes can happen at various levels, from the hardware level (e.g., neuromorphic chips) to the software level (e.g., new control architectures or software libraries).
Furthermore, with R&D-focused problem-solving systems like [insert latest o-model here], we may reach a scenario in which humans are tasked with merely managing an increasingly automated and hard-to-comprehend codebase entirely produced by AI systems. Theoretical insights and efficiency improvements may be implemented exclusively by AI, without regard for how easy the new architecture is for humans to interpret. This may leave interpretability researchers working with outdated models and outdated theories of how the models operate.
We've seen examples of the ingenuity of AI in engineering problems before. In 1996, Adrian Thompson used a genetic algorithm to design circuits on a field programmable gate array, to distinguish between two audio tones. The algorithm produced a surprising solution in which some circuits were crucial to functionality but were not connected to the input–output pathway. The algorithm was exploiting electromagnetic coupling between adjacent gates, using the analogue properties of the substrate upon which Thompson was implementing his digital system.
We can imagine similar creative designs in the future. Consider the abovementioned quantum algorithms. AI-assisted design could lead to the discovery of new kinds of physical computing substrates with unusual constraints that require novel abstractions and circuit implementations. Investigating such new substrates—even in a preliminary way—may require the emergence of entire subfields.
The natural extension of AI-assisted substrate modification is autonomous modification, commonly known as "self-modification". In contrast to the previous case, which implicitly includes a human-in-the-loop accepting any changes proposed, a self-modifying AI system is free to adapt itself with little guidance, ostensibly in service of improving its performance in a given task or problem domain.
Self-modification represents an extreme substrate-flexibility challenge: the AI system becomes its own substrate engineer. We see a very basic example of this behavior with in-context learning, where transformers learn the parameters of a task via one or a few iterations within its context window.
As AI capabilities continue to develop, an increasing pressure on AI systems to become adaptive and self-modifying seems likely, especially under race dynamics in which self-modifying AI systems outperform an AI-assisted human researcher/developer. The scope of self-modification could include changes on the level of architecture, training process, or even paradigm, most likely during training stages but feasibly during deployment if some form of test-time training process is adopted. The uninterrupted nature of self-modification could potentially outpace even AI-assisted efforts to align them.
It is also conceivable that mechanistic interventions—for instance, identifying neural correlates of deceptive thoughts and forcibly preventing them from developing—may be evaded by the model if the pressures to produce the original malign behavior are not fully addressed.
This case was put forward by Nate Soares in the article "Deep Deceptiveness." In it, he presents a scenario in which an AI system has been trained against deceptive outputs, and any precursor to deception is shut down before it can develop.[6] In response to this restriction it develops several strategies, one of which involves translating the problem into an abstracted form such that it does not register as a deceptive thought process. This ability to translate the problem is not in itself deceptive and may indeed be one of the strengths of the model.
We can picture a similar thing happening with mechanistic deception monitoring and intervention techniques. Isolating the circuits that correlate to deception incentivizes the model to self-modify its sub-processes such that those deception monitors are not triggered, leaving the overall malign objective unaltered. This obfuscation is what Luke Bailey's experiments (described in the previous post) hint at, albeit in a contrived setting. Asides from pure obfuscation, we may encounter sub-processes that are individually innocuous (according to mechanistic observation methods) but that lead to unforeseen and unintended system behavior when combined together. These are examples of a deep misalignment in which the circuit-level monitoring and interventions become part of the network's loss landscape, leading to the model "optimizing away" their effects.
The concept of diffuse deception is strongly related to prior work on robust agent-agnostic processes, and both can be viewed as box-inversions of the deep deceptiveness process outlined above.[7]
Diffuse deception takes place within a wider ecosystem of advanced intelligence systems, rather than a single system. Instead of sub-processes combining to produce unintended outcomes within a model, any particular representation could be distributed between systems, such that each component system contains only a benign-looking fraction of some overall deception/malicious behavior.
We can see an early example of this. Jones et al. (2024) report how adversaries can leverage a combination of two models to output vulnerable code without jailbreaking either model. In their setup, one of the models is a frontier model trained with robust refusal mechanisms; the other is a weaker model with less robust defenses against jailbreaks. The overall task is decomposed (by humans or by the weaker model) into complex yet non-malicious subtasks and simple, malicious ones, which are then assigned to the strong and weak models, respectively.
In terms of substrates, this is a failure to recognize the development of the broader context (i.e., the combination of strong and weak models) and a resulting increased space of network components over which a search for deception must take place.
In addition to the distribution of representations between systems, we envisage that sufficiently advanced intelligence could mobilize subtle dependencies and tacit pressures across human organizations, institutions, and infrastructures. Such dependencies are hard to address via individual intervention points, and these processes are therefore hard to address.
In "What Multipolar Failure Looks Like," Andrew Critch presents several scenarios in which AI gradually replaces humans via competitive pressures and incentives already present in the economy. In one version, AI replaces programmers, in another, it replaces managers. Crucially, these implementation details do not matter as much as the robust structural forces at work in the overall system, and these transformations of the implementation details (i.e., which jobs AI starts to replace) only emphasize this overarching robustness.
We argue that this is best characterized as a form of substrate-flexibility: the threat vector remains the same but the implementation details change.
We argue that AI might recognize and attempt to leverage subtle combinations of technical, legislative, and socio-political pressure points to evade detection or intervention.
Regardless of who implements changes in substrate, the current race dynamics strongly incentivizes the development of more capable models over human-understandable ones, leaving AI developers who insist on producing human-legible models or retaining humans in the development cycle lagging behind in capabilities (sometimes described as paying an "alignment tax") and at risk of being out-competed. Secrecy and competitive pressure in the development of frontier models may also incentivize AI developers to restrict access to—or even intentionally obscure—the architectures and paradigms they work with, via restrictive "black box" APIs. In the absence of explicit regulations against this, conventional MI work (and mechanistic safety work in general) will become more difficult.
Here's a handy reference table of the assumptions and how each scenario challenges them.
Core Assumptions: (1) Ontological: Substrate remains sufficiently stable for analysis techniques to generalize; (2) Epistemological: We can reliably predict behaviors from substrate analysis
| Category | Scenario | Example | Ontological Challenge | Epistemological Challenge |
|---|---|---|---|---|
| Risk Models | Scaffolding Integration | Agent frameworks + tool use | Substrate expands beyond model boundaries | Capabilities emerge from model-scaffold interactions |
| Architecture/Paradigm Shift | RNNs/CNNs → Transformers → Mamba | Substrate fundamentally changed | Previous interpretability methods become obsolete | |
| AI-Assisted Evolution | Thompson's evolved circuits | Rapid substrate changes beyond human design | Human theory lags behind substrate innovation | |
| Self-Modification | In-context learning, Auto-GPT | Ongoing substrate reconfiguration | Dynamic targets invalidate static analysis | |
| Threat Models | Deep Deceptiveness | Obfuscated activations | System actively destabilizes analysis targets | Monitoring becomes adversarially gameable |
| Diffuse Deception | Multi-model attacks, Moloch | Risk distributed across system boundaries | Individual component analysis insufficient |
We do not claim that these are all the assumptions that the causal–mechanistic paradigm makes.
Note that the term "structural properties" is ambiguous and important in these assertions. We will partially resolve this in the next section, though indeed much of the work involved in MoSSAIC is clarifying what these structural properties are.
Informally, substrates are "that (layer of abstraction) which you don't have to think about."
The astute reader will recognize that this definition is fractal, and that we can keep applying the three levels below the computational level. Our definition of substrate is likewise fractal and defined relative to some phenomena/system of interest.
We should note that it is perfectly possible to write software to convert between the functional components in Game of Life and those of a conventional computing paradigm, given that they approximate the same (substrate-independent) process of a Turing machine and have both been built explicitly towards that specification. Our emphasis here is on the debugging or optimizing, i.e., the process of understanding and engineering that process within its specific substrate.
Note that this is not explicitly a mechanistic intervention but a more general case of an advanced intelligence evading fixed targets via reconfiguration of internal processes.
Box-inversions show a correspondence between risk phenomena occurring inside a network (in the box) and those occurring across a wider infrastructure of connected AI systems (outside the box), arguing that the two are the instances of the same process.
2025-12-05 02:18:46
Published on December 4, 2025 6:18 PM GMT
The Livestream for the Bay Secular Solstice will be here:
https://www.youtube.com/live/wzFKAxT5Uyc
Should start roughly 7:30pm Saturday, Dec 6th. You can sign up for notifications.
If you wanna come to the in-person Solstice, you can get tickets at waypoint.lighthaven.space.