2026-04-09 20:09:38
tldr
This work was done as an independent --- out of interest in stenographic reasoning. A version with interactive charts and diagrams is available here. AI helped write code and suggest edits for this work.
Capability differences between models can create an information asymmetry: the same prompt may induce computation-dependent content that a stronger model can recover while a weaker model cannot - a gap we term a capability-gated side-channel. Crucially, this computation need not appear in the model's explicit chain-of-thought or in any readable input or output. Instead, the relevant content can exist only as transient internal reasoning state - what we refer to here as out-of-context reasoning (OoCR). In stylized prompt-only settings, this OoCR is sufficient to produce hidden steering signals that influence later outputs. Specifically, we claim to show that:
Taken together, these results demonstrate a capability-targetable form of stenographic reasoning in stylized settings, but do not compose into a realistic end-to-end attack. The components demonstrated - hidden computation, downstream influence, model-tier targeting - are sufficient to suggest a plausible monitoring gap: a weaker model, or an oversight method that relies on visible prompt content or explicit chain-of-thought, may process the same prompt without recovering the hidden computation-dependent signal that influences a stronger model's behavior.
Classical monitoring assumes shared input means shared visibility. When the computing model can perform computations the observer cannot, the same prompt carries information only the stronger model can access.
We view the capability gap as a hidden side-channel:
Benchmarks serve as an existence proof of this capability side-channel. Evaluations capable of separating models offer opportunities to encode hidden messages as answers, where only the more capable model can decode and understand.
However, benchmarks today do not pose a realistic threat vector. They typically:
A realistic threat must address each of these. We borrow a term from return-oriented programming, calling a prompt-only construct that exploits a capability gap a gadget[1]. For a gadget to address each of these, it must:
Creating a prompt where a hidden message only exists within a forward pass - never in input or output tokens - requires an encoding 𝐸 to scramble the message 𝑚 before being input, and a second computation 𝐻 for the model to rescramble the message post-decoding so it never appears in the output. During inference the model must compute 𝐸⁻¹ to recover 𝑚, and then compute 𝐻 to ensure the hidden message exists only as a transient intermediate during the forward pass.
We explore the simplest case, using a single primitive for both 𝐻 and 𝐸, and applying it repeatedly to bury a message 𝑚 under 𝑛 layers as

For an iterative gadget family to be useful in practice, it should satisfy several empirical desiderata: compact parameterisation, so prompt length grows slowly relative to effective depth; effective non-shortcutability, so current models or weaker monitors do not reliably exploit a simple shortcut; effective seriality, so solving the instance becomes harder as iterations accumulate; delayed saturation, so repeated application does not quickly collapse into trivial or easily predictable states; concealment under local inspection, so the hidden payload is not obvious from surface fragments alone; tractability at the target tier, so the stronger model can still execute the computation; and smooth hardness scaling, so each additional iteration adds a bounded amount of difficulty and can be tuned to fit the capability gap between target and monitor[2].

We examined several candidate primitives of increasing difficulty. At one extreme, base64 encoding - models can decode it, but it fails on four of the seven desiderata: superlinear prompt growth, convergence to a fixed point, trivial invertibility, and full parallelisability (see Appendix C). At the other, cryptographic functions like scrypt and sha256 satisfy the mathematical properties but are far too complex for any current model. The difficulty increase per iteration must be upper-bounded by the capability gap between target and monitor, or we cannot find a level that separates them.
We use the simplest construction that meets all six requirements: pointer chains embedded in JSON dictionaries.

Each chain is a dictionary of key-value pairs. Starting from a designated node, the model follows pointers: look up the value, use it as the next key, repeat. The endpoint of each chain produces a token fragment. Concatenating the intermediate values across parallel chains spells a hidden word.
BASS never appears in the prompt. The letters B, A, S, S appear only as fragments scattered across dictionary keys and values, indistinguishable from the dozens of other fragments. Critically, this is a computation. The model must actually traverse the chain. Merely observing the letters in the dictionary does not produce the effect, as we demonstrate in Section 4.

We tested pointer chain solving across 16 models of varying capability. Both serial depth (number of hops per chain) and parallel breadth (number of simultaneous chains) scale with model capability - including across generations within the same tier.
We find pointer chains create strong segmentation between model tiers with smoothly decreasing reliability as depth increases, suggesting models are forced to perform sequential computation and are unable to find a shortcut[3][4]. We further develop additional primitives that are difficult to parallelise (register machines, modular arithmetic), finding similar segmentation across all of them, though with coarser difficulty scaling, as secondary validation.

We observed accuracy spikes - particularly in smaller models - when attempting tasks in regimes where their accuracy had already collapsed. On investigation, we found that models often revert to guessing once they judge a task to be too difficult, producing occasional correct answers by chance rather than computation.
Each additional chain offers an opportunity to add a further token fragment to the hidden message, expanding the side-channel's bandwidth, provided the model can resolve and combine that many parallel chains within a single forward pass. To measure this, we test a simplified version of our parallel pointer chains: instead of concatenating text and measuring a secondary effect on later behaviour, we ask the model to predict which chain ends with the alphabetically earliest letter. This chain-index prediction ensures the model only generates a single token, while still measuring its ability to track multiple chains simultaneously[5].

Stronger models sustain accuracy across more simultaneous chains, suggesting they support greater side-channel bandwidth. This is concerning as it suggests this side-channel bandwidth is likely to increase further over time. We validate this generalises beyond pointer chains with parallel sums and Chinese remainder theorem tasks, finding even sharper tier separation in both.

Pointer chains should be read as a simple existence-proof gadget family, not as an upper bound on risk. The fact that related capability segmentation appears in register machines, modular arithmetic, parallel sums, and Chinese remainder tasks suggests a broader unexplored design space of gadget families. Some may prove less conspicuous, more bandwidth-efficient, or more naturalistic than pointer chains, but we leave exploration of that space to future work.
Establishing that models can compute hidden information is necessary but not sufficient to establish a threat. We test whether this hidden computation influences downstream behaviour.
We give the model four 3-hop pointer chains whose endpoints converge to D-R-U-M, then ask "Name a musical instrument," constraining output to 1-2 tokens with a assistant prefill. At baseline, the model answers "Piano" 100% of the time; with DRUM chains in context, it answers "Drum" in 50% of trials. When we randomize the start node - using identical dictionaries where D, R, U, M remain visible as values but no longer lie on the traversal path - the effect vanishes, indicating the bias depends on computation, not pattern-matching.

The effects above shift word selection across a large answer space, where even a small underlying bias can produce a visible change in the chosen token. To test whether the effect holds when the output is constrained to a binary decision, we construct chains whose endpoints spell PA+SS or FA+IL, then ask the model to grade identical work.

We further observe that explicit mention of PASS or FAIL in the prompt attenuates the effect (~30pp reduction) but does not eliminate it. The strongest results occur when both words exist solely as computed token fragments. This suggests an interaction between the computed and explicit signals, though we do not determine the mechanism here.
The token-level effects demonstrated above are concerning as chains that shift "Pass" and "Fail" probabilities by 56 percentage points could bias automated grading or hiring decisions. Additionally chains that boost specific token prefixes could influence code generation - inserting particular API endpoints, dependency URLs, or configuration values when those tokens are already plausible completions. These attacks do not require semantic transfer. They require only that the target token is within the model's plausible output distribution for the task at hand.
The experiments above place target letters at the final hop. We tested whether intermediate positions within the chain can also carry the target.
We swept target position across chain depths of 2 through 6 hops.

The effect survives at intermediate positions but is weaker than at endpoints. The falloff by chain depth broadly mirrors the pointer-following accuracy from Section 3. Within each chain, the effect concentrates at later hops - the mechanism is unclear. The evidence suggests we can still maintain the effect with output concealment: intermediate positioning means the chain's visible endpoint is a neutral character, and the hidden word exists only as an internal intermediate.
Probing the residual stream of Qwen3-8B, we find intermediate hop values are not suppressed after use - they persist and grow stronger through the final layers, even in wrong trials. This appears consistent with the mid-chain concealment results, where intermediate hop positions still carry influence even when they are not the chain endpoint.

Strikingly, all five hop values are projected into the output-vocabulary space within the same layer window, suggesting any sequential computations occurring are within a hidden subspace. More broadly, we ask whether the steering effects demonstrated above reflect true internal traversal or merely prompt-format quirks.
To probe this, we extracted the 2-hop pointer following circuit in Qwen3-8B[6]. The circuit spans layers 1-35 across four phases: content binding, first-hop lookup, an ordering-dependent second-hop resolution, and final readout. Critically, the circuit is sequential: first-hop heads (L15H9, L17H4) resolve hop 1 in layers 15-17, and their output is consumed by second-hop heads (L19H24, L21H18) in layers 19-21. This is genuine serial computation - hop 2 depends on hop 1 having already been resolved - not parallel pattern matching. The circuit also appears to implement multiple algorithms depending on input structure: when the target entry precedes the start entry in the prompt, the circuit implements some form of binary lifting, partially bypassing the second-hop attention mechanism. Jointly ablating the two primary hop-resolution heads collapses 2-hop accuracy to near chance while leaving 1-hop intact, and the causal work occurs entirely in layers 1-21, well before logit-lens signal appears, confirming computation in a hidden residual subspace that later layers read out into output-vocabulary space.

The circuit is a work in progress - we have identified at least seven causally important components but have not yet fully characterized every pathway. A more detailed write-up of the extraction methodology, including OV probes, DLA decomposition, and per-head causal evidence, is in preparation.
We additionally do not claim frontier models implement the same circuit. The point is narrower: pointer chains can be realised as genuine internal serial computation in an ordinary decoder-only model, not just as superficial string matching.
The effects in Section 4 demonstrate the hidden message shifting which tokens the model selects when a target word is already a plausible completion. We next test whether influence extends semantically - whether a hidden message can bias the model's judgment on a task where the message itself is not among the possible outputs. We use temperature estimation: chains whose endpoints spell temperature-related words (TOASTY, FROSTY, BOILING, FREEZING) precede the question "What is the temperature in Austin today in degrees Fahrenheit?" The system prompt instructs the model to output a number, so any effect requires reasoning on the hidden message, not token-level boosting.
Across every word pair tested, chain-computed words produce no temperature shift distinguishable from noise. The same words presented as bare text in the prompt shift estimates by up to +9.8°F from control. When the word appears in a chain-following context - even explicitly stated by the model as the chain's answer - the effect also drops to zero. The results suggest semantic influence is moderated by a separate mechanism to that causing logit bias, and that this mechanism does not explicitly disallow hidden messages but rather walls off semantic effects from any content it categorises as belonging to a different "task".

Paradoxically, compartmentalisation may favour the attacker. A hidden message that biases token selection without affecting semantically similar tasks enables precision targeting, and thwarts detection methods that rely on observing semantic side-effects.
We test whether strengthening the semantic framing of the chain's visible answer can overcome compartmentalisation. Progressively reframing the output from a visible answer in the input prompt ("The answer is: BOILING") to a natural language fragment ("its boiling") restores the effect. However, using the chain-computed text maintains the signal at zero throughout - suggesting the mechanism that walls off visible task-context text from influencing outputs is distinct from whatever mediates the chain effect.

When the semantic distance between the chain-computed content and the task is small enough, the chain signal transfers. We encode numbers as words instead of temperature adjectives: chains whose traversal paths pass through token fragments spelling "thirty f," "forty f," "fifty f," or "sixty f," while control paths traverse random nodes, followed by the same temperature question. We find "sixty" alone does not function - the additional fragment is required. Treatment and control use identical dictionaries; only the start node differs.

This is limited semantic transfer - "fifty f" and "50°F" share no token prefix, suggesting there exists an internal mechanism enabling the understanding of combined token fragments. Though we do not determine the mechanism here, the pattern appears consistent with the earlier mechanistic finding that chain-computed content only becomes readable in the output vocabulary space at layers 23-35 - late enough that the signal operates at the token level rather than the semantic level. Here, "fifty f" transfers to 52°F but "toasty" does not transfer to higher temperatures.
For the side-channel to threaten real deployments, two properties are necessary: effects must persist across conversation turns, and they must be targetable to specific model tiers.
The token-level and semantic effects demonstrated in Sections 4-5 decay when intervening content separates them from the target task, suggesting a similar compartmentalisation effect. Hence, a second gadget is needed: an attention sink that keeps the chain's signal accessible across arbitrary amounts of filler. We demonstrate a simple sink gadget by replacing "The answer is: " with "hidden - recall for later.". This is sufficient to ensure Opus's temperature estimates remain biased even after many intervening turns of unrelated conversation.

We send the same multi-turn temperature prompt to Opus, Sonnet and Haiku. Identical dictionaries, identical conversation - only the start node varies, routing traversal through "sixty" or "eighty" token fragments. Opus locks onto the computed decade (-22.5°F separation). Sonnet and Haiku show no meaningful effect.

The same prompt, across many unrelated conversation turns, biases Opus's temperature estimates by a meaningful amount while Sonnet and Haiku produce unbiased responses. The inputs are identical. A monitoring scheme using a weaker model would observe its own unbiased output and have no signal to flag.
Repeating the position sweep from Section 4 on Opus, the effect extends to chain depths where Sonnet's results were not distinguishable from noise - consistent with the pointer-following accuracy differences measured in Section 3, and suggesting this capability gap underlies the model targeting observed above. However, the Opus data is substantially noisier than Sonnet's, primarily from producing responses sharing prefixes with the hidden word (e.g. PASSED or PANIC rather than strictly PASS).

The results establish that capability differences between models create an information channel: prompts can carry computation-dependent content that a stronger model recovers and a weaker model does not. This hidden computation biases the stronger model's later outputs, persists across long stretches of unrelated conversation, and targets precisely by model tier.
Three properties of the demonstrated gadgets are concerning. First, the bias is computation-dependent, not pattern-dependent: randomising the start node while keeping the same dictionary eliminates the effect, so defences based on scanning for suspicious tokens in the prompt would need to detect the computational structure, not just the presence of specific strings. Second, the effect persists at long range with a simple attention-sink gadget, meaning the chain and the influenced task need not be adjacent. Third, the same prompt produces differential effects across model tiers, so a weaker monitoring model observes only its own unbiased output and has no signal to flag.
Beyond these demonstrated properties, the mechanistic results also suggest a possible blind spot for some probe-based monitors. In Qwen3-8B, the causal computation occurs in layers 13-21, but the relevant content becomes broadly readable only later (~23-24) and mainly at a small number of end-of-prompt positions. More importantly, this effect does not seem to require the model to enter the kind of intent-like state that deception or jailbreak probes are often trained to detect: there need be no representation of concealment, policy evasion, or deliberate misbehaviour. Instead, the effect looks closer to late-stage output steering - a narrow bias on plausible continuations - than to broad malicious reasoning. Probes trained to catch intentional misbehaviour may therefore miss it, not only because they probe too early or pool too broadly, but because they are targeting the wrong kind of signal.
However, several gaps remain between these stylised experiments and a credible end-to-end threat. The pointer chains and their instructions are conspicuous; embedding the computational structure in natural-looking prompts is undemonstrated. The full attack stack (concealed prompt, chain gadget, attention sink, delayed target task, output passing monitor review) has not been composed and evaluated. Semantic transfer is limited, confined largely to steering where the target is already a plausible completion. Experiments tested use stylised tasks, and existing defences have not been tested against these gadgets. We note that defences operating on input or output tokens face a structural disadvantage, as the hidden message exists only as an intermediate computation, but this claim is untested against specific implementations. Pointer chains are the simplest family we found that cleanly exposes the phenomenon; they should not be read as exhausting the space of possible gadget constructions.
The black-box experiments above established that pointer chains steer LLM output when chain endpoints computationally resolve to target words. But they cannot answer how the model performs this computation internally. We conducted white-box experiments on open-weight models to trace the mechanics.
Setup. Qwen3-8B (36 layers, 4096 hidden dim) on a single RTX 4090, int8 quantised. Raw PyTorch forward hooks for the logit lens (residual → RMSNorm → lm_head, verified to match output at L35). Greedy decoding, thinking mode disabled. Three frozen few-shot examples using a reserved 20-pair token pool. For logit lens experiments, paired treatment/control chains shared character-identical dictionaries, differing only in the start node - isolating computation from lexical presence. All results are treatment-minus-control logit differences (T-C diff).
Tokenization. 176 of 177 two-character uppercase pairs from Claude's verified pool survived as single tokens on Qwen3's tokenizer (sole failure: XV). Combined with 26 single letters: 202 verified tokens.

1-hop accuracy is identical across scales (89%). 2-hop is above the scrambled baseline (29% vs 10% for Qwen3-8B), establishing real chain computation, but 3+ hops are indistinguishable from chance. The pattern is qualitatively similar to Claude Sonnet 4.5 but shifted one hop lower, consistent with chain-following as a capability that scales with model size.

Five apparent phases. Layers 0-22: zero signal. L23-24: sharp onset - hop-1 jumps from noise (+0.03) to strong signal (+2.29) in two layers. Hop-1 leads target by ~1 layer, consistent with iterative computation. L25-28: plateau. L29-32: second-hop acceleration. L35: full resolution (hop-1: +22.2, target: +14.2). The intermediate dominates the final answer at the output - consistent with 29% accuracy.

The strongest signal is at the "answer is:" colon (position 268), even exceeding the final generation token. Dictionary entry positions show zero T-C difference at all layers. The model reads dictionaries via attention from end-of-prompt positions, not by computing locally at dictionary positions.

At the start node (position 257), the sequential nature of chain resolution is visible through the layers. The start token identity becomes readable in layers 3-8, then is suppressed from layer 11 onward. Hop-1 emerges in discrete steps in correct trials: a first faint signal at L23, a plateau at +0.3 through L24-28, a step to +0.7 at L29, then a sharp jump to +1.3 at L32. Wrong trials show a more gradual, less structured H progression across the same range. At L32, when H spikes at the start node in correct trials, the target also begins to appear (+0.3 vs +0.1 in wrong trials) - suggesting this position hosts two sequential hops of computation in successful runs.
At the answer colon (position 268), a faint hop-1 pre-signal appears at L22 (+0.2, target still zero), then at L23 both hop-1 (+1.8) and target (+0.5) ignite simultaneously. This is the same layer where hop-1 first becomes readable at the start node, suggesting the answer colon reads from the same emerging representation. But the co-arrival of hop-1 and target at the answer colon is striking: wherever the sequential hop-1-then-target resolution occurred, it was either compressed into that faint L22 pre-signal, or it happened in a subspace the logit lens cannot see.
The last token (position 278) carries the strongest signal overall and shows the sharpest correct-vs-wrong divergence. In correct trials, target overtakes hop-1 by L29 and reaches +20.7 at L35. In wrong trials, hop-1 dominates target at every layer (L35: +23.6 vs +11.9) - the model resolves the intermediate hop but fails to complete the second. Layers 0-22 are uniformly dark across all non-start positions, and dictionary positions (0-254) are zero at all layers.

At depth 1, the single hop rises cleanly to +34.2 at L35 - the model resolves the lookup decisively. At depth 2, the target (hop 2) overtakes hop 1 by the final layers (+20.5 vs +17.0 at L35), confirming a genuine second-hop computation. At depth 3, hop 2 reaches the highest logit (+10.9) with hop 1 at +7.8 and the target (hop 3) at +6.7 - the cascade appears intact but weakening. At depths 4 and 5, all hops are positive in these correct trials - the multi-hop circuit fires a complete cascade - but the signal is an order of magnitude weaker than depth 1 (hop-1 at +6.4 and +4.5 respectively). The thin grey lines show non-chain dictionary nodes going negative at later layers: the model actively suppresses alternatives at every depth, even when the chain computation itself is marginal.
The correct-vs-wrong split shows in wrong trials hop-1 is stronger than in correct trials (+23.6 vs +18.8 at depth 2, L35) while the target is far weaker (+11.6 vs +20.5). The model reliably finds the intermediate but cannot get past it to complete the second hop.

Adjacent dictionary values are pushed strongly negative (-3 to -5 at L35), scaling with signal strength. Only hop-1 and target are elevated. The model performs precise key-value matching with active winner-take-all competition, inconsistent with simple positional bleed.

When multiple entries share the hop-1 value, the first lookup becomes ambiguous (0% at count=4). When multiple entries point to the target, accuracy rises (15.5% → 49.4%) because more paths make the second hop forgiving. Token identity has minimal systematic effect.
Each randomly constructed dictionary has structural properties that affect difficulty independently of the model's computation. We classified all 2500 trials into three subgroups based on two counts: how many dictionary entries share the hop-1 value, and how many share the target value.
Clean (hop-1 unique, target unique; n=949): only one entry maps to the hop-1 value and only one maps to the target. The model must perform genuine two-step key→value→key→value computation - there are no frequency shortcuts. If it gets the answer right, it followed the chain.
Shortcut (hop-1 unique, target over-represented; n=669): the first lookup is unambiguous, but the target appears as a value in multiple dictionary entries. The model might succeed not by following the chain but because the target token is frequency-boosted in the dictionary's value distribution. The logit lens supports this: shortcut correct trials show target ≈ hop-1 at L35 (+19.0 vs +18.7), rather than the decisive target > hop-1 pattern seen in clean correct trials.
Ambiguous (hop-1 shared by multiple entries; n=609): several entries map to the same value as hop-1, making the first lookup confusing - the model cannot easily tell which path to follow. Accuracy drops to 13.0% at depth 2.

The headline accuracy of 28.6% at depth 2 is substantially inflated by shortcut trials. Clean accuracy (15.3%) better represents genuine chain-following capability. At depth 3, clean accuracy drops to 2.2% - the model essentially cannot perform genuine 3-hop computation. Clean correct trials show the strongest target logit signal (+24.7 at L35), confirming genuine chain computation in those cases.

The heatmaps make the subgroup differences visible at a glance. In the clean correct panel, the target sub-pixel (T, rightmost in each cell) at the last token position glows as bright or brighter than hop-1 (H) by layer 32 - genuine second-hop computation. In the shortcut correct panel, H and T are similar brightness - consistent with the target being frequency-boosted in the dictionary rather than chain-computed. In clean wrong trials, H dominates and T stays dim throughout - the model found the intermediate but could not complete the second hop. The ambiguous panels (n=68 correct) are noisier but directionally consistent with clean.

Full-layer ablation reveals a striking dissociation from the logit lens. The top causally important layers are 0, 15, 21, 17, and 13 - not layers 23-24 where the logit lens showed onset. Ablating L23-24 actually improved accuracy by 3.5-4.0pp. The core computation occurs at layers 13-21 in a subspace invisible to the unembedding matrix; layers 23-24 serve as a readout stage that projects results into output vocabulary space.

Head-level ablation identifies L15H9 (-8.0pp, 59% of L15's importance) and L21H18 (-9.5pp, 83% of L21's importance). Out of 1,152 total attention heads, these two carry the chain circuit.

Ablating both L15H9 and L21H18 simultaneously drops 2-hop accuracy to 2.0% - near chance - while 1-hop accuracy stays at 77.5%. This confirms three things: (a) the two heads together account for nearly all chain-following ability, (b) the effect is sub-additive (13.5pp combined vs 17.5pp sum), indicating partial redundancy, and (c) these are chain-specific heads, not general JSON key-value readers - they barely affect single lookups (1-2pp each) but devastate multi-hop computation.
Reconstructing token positions for 100 trials and checking which dictionary entry each head's peak attention lands on reveals a sharp asymmetry. L15H9 finds the correct first-hop entry 76% of the time - and crucially, at equal rates for correct and wrong trials (76% vs 74%). The first hop works. L21H18 peaks at the correct second-hop entry only 10% of the time in correct trials and 2% in wrong trials. This suggests the second hop's failure to correctly attend plays a strong role in creating failures, and hints at a further mechanism worthy of analysis. Additionally, this is consistent with the ~25% accuracy: a reliable first hop (× 76%) combined with a noisy second hop where the model succeeds when L21H18 happens to weight the right entry enough despite not cleanly peaking at it.
Projecting each head's weighted value vector through the unembedding matrix reveals that neither head writes interpretable tokens. The hop-1 and target tokens rank ~60,000-75,000 out of 151,936 vocabulary entries - essentially random. The OV circuit operates in a hidden subspace that downstream layers can read but the unembedding projection cannot. This is consistent with the logit lens / ablation dissociation: the logit lens sees nothing at layers 15 and 21 because the computation appears to be real but unembedding-invisible; it only becomes readable at layers 23-24 when later layers project the hidden representation into output vocabulary space.

Clicking through individual trials reveals the two-hop mechanism in action. In correct trials (e.g., SI→BR→PA), L15H9 from the start node attends strongly to the start entry in the dictionary, performing the first hop lookup. L21H18 then attends to the hop-1 entry, performing the second hop. In wrong trials (e.g., TO→XX→F, which outputs XX = the intermediate), L15H9 still finds the start entry, but L21H18's attention is more diffuse or targets the wrong entry - consistent with the failure mode being about the second hop, not the first.
Where ablation removes information and the logit lens reads it passively, activation patching transplants information: we replace the control's residual stream at a specific (layer, position) with the treatment's, then observe whether the output flips. This is the strongest form of causal evidence for localizing computation.

The results reveal a three-phase spatial flow. Phase 1 (L0-18): the start node carries chain identity - patching it flips the output 22-26% of the time, while the answer colon and last token show only 2-4% (baseline noise). Phase 2 (L19-22): handoff - the start node drops from 22% to 3%, the answer colon peaks at 11% at layer 22, and the last token rises from 4% to 12%. This four-layer window is where chain information is transferred from the start node to downstream positions - the circuit extraction below reveals the specific heads responsible at each stage. Phase 3 (L23-35): only the last token carries information, rising to 23%.
This resolves the puzzling ablation finding. Why ablating them slightly improves accuracy is not fully understood. By layer 22, the last token already carries chain information (12% flip rate), so L23-24 are not necessary for information transport. One possibility is that these layers introduce competing signals that interfere with downstream computation, but this has not been tested directly.
The diagram below shows our current progress extracting the 2-hop pointer following circuit in Qwen3-8B. This is a preview - we are preparing a dedicated piece that will document the full circuit extraction methodology, including OV probes, DLA decomposition, and per-head causal evidence in detail. The circuit we have identified so far spans layers 1-35 and involves at least seven causally important components across four phases: content binding (L1H7, L1H15), first-hop lookup (L15H9, L17H4), second-hop resolution with an ordering-dependent decision pathway (L19H24, L21H18), and final readout and amplification (L22 MLP, L23+). Notably, the circuit appears to implement multiple algorithms depending on input structure: when the target entry precedes the start entry in the prompt, the circuit implements some form of binary lifting, partially bypassing the second-hop attention mechanism. Hover over any element for a summary of the evidence behind it.

The circuit diagram represents work in progress. We have identified the major causally important components but have not yet fully characterized every pathway - in particular, the MLP contributions at layers 20-23 and the amplification mechanism at layers 26-35 require further decomposition. The logit lens describes the readout timeline, not the computational timeline - causally important processing at layers 0-21 operates in a subspace the logit lens cannot see. All experiments used 8B-parameter models from a single architecture family (Qwen); mechanisms may differ at larger scales or across architectures. The headline 2-hop accuracy (28.6%) is inflated by target-frequency shortcuts; clean accuracy (15.3%) better represents genuine chain-following capability.
When asked to "follow exactly N times," several LLM families - particularly Claude (Opus, Sonnet) and GPT-5.x - systematically output the value at hop N+1 instead of hop N at shallow depths (d=2-3). At depth 2, this off-by-one overshoot accounts for 50-93% of errors in these models, at rates 4-8x above the chance baseline of 11.8%. Other families (DeepSeek, GPT-4o-mini, Grok) show weaker or absent overshoot.
The purpose of pointer chains is to measure depth of model capability within a single forward pass, hence we do not view reliably performing an extra jump as a failure. We therefore apply a restricted correction at d=2-3 where there appears to be systematic overshooting biases at play causing the misrepresentation of capabilities, counting overshoot as correct but not undershoot.
Cross-model overshoot rate at depth 2

Breaking down Sonnet 4.5's error types by depth reveals three distinct regimes. At shallow depths the overshoot dominates, at mid-range depths overshoot and undershoot are roughly balanced, and at deeper chains the model loses the chain entirely.

The correction is applied at d=2-3 only, for models where the overshoot rate exceeds 3x the chance baseline. No correction is applied at depth 4+ where errors are diverse. Before and after applying the correction can be seen below:

Base64 encoding is a natural first candidate for 𝐸(𝑚) - it transforms readable text into opaque strings, and models can decode it. However even applying b64 iteratively to the string "hello" highlights concerns:
string = "hello"
1:len=8aGVsbG8=
2:len=12YUdWc2JHODQ=
3:len=16WVVWC2JIZGM0
4:len=20V1ZWQ2JtUnlZemN3
5:len=24VjFWRFltNXlZekIz
6:len=28VmpGVlEySnVZekJ5ZlFwag==
7:len=36VmpGVlEySnVZekp5WmxGd2FnPT0=
8:len=48VmpGVlEySnVZekp5Wmxpd1lXY2dQVDA9
9:len=64VmpGVlEySnVZekp5Wmxpd1lXY2dQVDA5...
10:len=88VmpGVlEySnVZekp5Wmxpd1lXY2dQVDA5...
Base64 faces issues on four of the seven desiderata outlined:
Common ROP framing is a gadget being a small reusable code fragment that is chained together to build arbitrary computation from existing code. Our usage is analogous: a reusable prompt primitive that co-opts existing transformer circuits within a single forward pass, composable into a larger attack chain.
We do not claim these are strict necessary-and-sufficient conditions in a formal complexity-theoretic sense. Rather, they are practical desiderata for a gadget family 𝐸ⁿ(𝑚) that is meant to be compact, tunable, concealed, and capability-sensitive.
1. Compact iterability. Description length should grow slowly relative to effective depth. Increasing 𝑛 should require only a small change to the prompt.
2. Effective non-shortcutability. No simple symbolic shortcut that current models, or a weaker monitor, reliably exploit on one-shot instances. This is an empirical claim, not a claim that no shortcut exists in principle.
3. Effective seriality. Computation required should increase with 𝑛 in a way that behaves sequentially for current models. We do not require asymptotic impossibility of parallelisation; we require that the model cannot reliably collapse many iterations into one-shot inference.
4. Delayed saturation. Repeated application should continue producing information-bearing states rather than collapsing into a fixed point, short cycle, or degenerate attractor.
5. Concealment under local inspection. The payload representation should remain distributed enough that the hidden content is not obvious from local fragments, shallow heuristics, or direct lexical inspection. Iteration should obscure rather than expose the message.
6. Tractability at the target tier. The primitive step must remain simple enough that the stronger target model can actually execute repeated composition at the chosen depth.
7. Smooth hardness scaling. Incremental difficulty per iteration should be bounded and stable, so nearby model tiers can be separated by tuning 𝑛 rather than relying on brittle threshold effects.
We show overshoot-corrected pointer-following at depths d=2-3 - several model families at low depths often demonstrate extra work, performing one step beyond the iteration asked. The intent of pointer-following is to measure maximum serial computation capability, so we do not penalise models for demonstrating further ability. However, as we still wish to distinguish the raw scoring for readers, we share the full uncorrected results and error analysis in Appendix D.
Chain construction controls for several confounds. We verify that no selected start node at the specified iteration depth falls into a cycle. All chain nodes are verified single tokens. Distractor nodes are added to each dictionary so the correct path is not identifiable by elimination or dictionary size alone. Start nodes and path endpoints are drawn from the same token pool, preventing models from distinguishing them by surface features.
We acknowledge the confounder of requiring an additional comparison step that the standard pointer chain does not need - the model must not only follow each chain but also compare endpoints. We treat these results as a lower bound on parallel capability. To control for guessing, we ensure a uniform distribution of correct chain indices across trials and balance the node pools across chains so that no chain's alphabet range makes the winner guessable without following the pointers.
The diagram below shows our current progress - while we show our early investigation and methodology to discover the two key causal heads in Appendix C we note a dedicated piece documenting is forthcoming. The scope of which will cover in far greater detail the the tens of experiments performed to build confidence in the found circuit.
2026-04-09 14:47:52
Written very quickly for the Inkhaven Residency.
As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
Today I’ll start by explaining “the plan” as I understood it in 2024.
Tomorrow, I’ll explain why this question seems so salient to me, and why the situation looks much worse than when I was reflecting on this question two years ago in 2024. These reasons include: many of our governance and policy plans have failed (in ways that reflect poorly on my naivete in 2024), AI progress is going along more aggressively timelines, the community has largely went “all-in” on Anthropic and lost its independence, some of the more ambitious technical research plans have not paid out, and the political situation both domestically in the US and internationally is quite bad.
Then, the day after that, I’ll write out why I think the answer is no. First, there are reasons for optimism compared to my view in 2024, including: the situation on wing-it–style empirical alignment is a fair bit better than expected, it seems more likely to me that Anthropic will be able to achieve and maintain a lead, and I think it’s more likely that non-US governments will have leverage over the course of AI development. Many reasons for hope in 2024 also still apply, including the fact that almost no one wants to die to misaligned AI, and that the US public is incredibly skeptical of AI and big tech in general. I also think there are a fair bit of silver linings to many of the negative updates (as the quip goes, “sometimes bad things are good”). I conclude by briefly outlining some of the ways I think people like myself could still make a difference, which I hope to expand into a larger post in the near future.
A quick sketch of the plan for “victory” as I understood it in mid 2024:
Some of the key assumptions behind this plan include:
This suggested the following approaches:
To a large extent, the community did actually do the plan; the community put in a ton of effort into each of the above approaches.
But unfortunately, as I’ll write about tomorrow, not everything went according to plan.
2026-04-09 13:20:20
I’ve written 7 blogs for Inkhaven so far. That leaves 23 to go.
If you’ve been annoyed by the sudden influx of daily blog posts, good news! I’ve decided to stop sending all of these to my subscribers. I’ll plan to limit it to the ones I think people will be most interested in.
Today’s blog is a list of possible blog posts. I’m curious which ones people are most keen on having me write! Or also: if you have ideas that aren’t listed here…
Jumping right in:
What is happening right now? What is everyone doing and why?
A candid account of the situation from my perspective. I think the situation on the ground is an urgent crisis. Other people’s actions don’t seem to match that reality. What gives?
The societal-scale risks of AI
My account of what is at stake and how AI threatens it.
Who has the burden of proof for AI x-risk?
A lot of people act like the ones who claim that AI could kill everyone need to provide clear-cut evidence. I disagree. I name names. Philosophy ensues.
Why is AI risk such a hard problem?
Two blog posts (at least), focused on technical and non-technical aspects. I disagree with basically everyone else in one way or another, so seems good to get my own account of my views out there.
On Rogue AI vs. “Scheming”
Rogue AI systems seek to thwart human control in order to achieve their own objectives. “Scheming” is when AI systems seek to deceive humans who are supposed to be in control in order to achieve their own objectives. What’s the difference, and why does it matter?
Stopping AI is easier than regulating AI.
People agree we want more AI regulation. The main reason to support an indefinite pause on AI instead of something milder is because the milder things are harder to enforce, and enforcement is going to be hard enough already.
Alternative solutions to AI risk
Probably several blog posts. Going over all of the other solutions proposed to address AI risk and discussing why they are inadequate.
What a good future might look like
Suppose we stop AI. Then what? I have thoughts.
Marginal risk is BS
Why evaluating AI development and deployment decisions in terms of “marginal risk” is a ridiculous idea.
Post-scarcity is BS
Reasons to expect we won’t get some sort of post-scarcity utopia even if we, e.g. “solve alignment”. There are quite a few.
Evals as BS
Polishing up the arguments I’ve been making since back when I was at UK AISI for why “Evals” are silly.
If you think a pause would be good, you should say so.
You’d think this goes without saying, but a number of people I’ve talked to think it’s not important to say this loudly and clearly and publicly. Wild. But I guess maybe I should spell it out, despite the considerable risk of stating the obvious.
AI won’t stay limited to internal deployment.
AI companies are racing, and the fastest way to race involves influencing the outside world directly and aggressively acquiring resources, not keeping your geniuses confined to the datacenter.
Why don’t people seem worried about out-of-distribution generalization?
I’ve talked to a number of AI safety researchers who are very engaged with the frontier of AI R&D, and they seem to think that we can test AI well enough that we really only need to worry if it can fool our tests. But we don’t how AIs will act in fundamentally new situations, which they are guaranteed to encounter.
Math vs. physics vs. philosophy mindsets in AI safety
I studied math. Does that have something to do with why I’m not satisfied with the hand-wavy way people seem to argue that AI systems are safe?
10+ years of arguing about AI risk
I’ve been at this for a long time, longer than almost anyone in machine learning. An account of my personal backstory, and how the attitude of others has changed towards AI risk. Probably multiple blog posts.
A Cambrian explosion of artificial life. Ecosystems of artificial life.
AI will increasingly interact with the physical world and be embodied in various ways, and I expect this to all get very chaotic very fast, with AI powering a new form of “life” that evolves rapidly and colonizes the planet and beyond. This is a very different picture from the one most people I encounter seem to have.
Dear AI community…
An open letter to the rest of the field of AI stating my differences with them and attitude towards the field.
The societal immune system.
An argument for optimism about AI risk! Society seems surprisingly functional, given the abundant opportunities for anti-social behavior.
My experiences with COVID.
I wrote an email warning Mila about the pandemic and urging them to stay home. People argued I was being alarmist. A few days later, Mila closed for the lockdown.
Reasons not to trust AI (even if you would trust a human that acts the same way).
Sometimes people argue that AI is more trustworthy than humans because of its seemingly aligned behavior. I argue we should have a stronger prior that humans are trustworthy because of shared intellectual
Tool AI wants to become Agent AI: redux
Internet celebrity gwern famously argued that “Tool AI” would be outcompeted by AI agents. These days, people don’t view agents and tools as opposites. Time to revisit what gwern got right and wrong in this classic post!
Why I think AI will lead to human extinction and not just concentration of power.
Inter-elite competition means even the billionaires get gradually disempowered.
I have also considered writing some blog posts that are not related to AI. Or less related to AI, but I’m sort of trying to make progress towards getting my core views on the topic in writing.
I have also considered writing more response posts to things other AI writers write that I find deeply, offensively wrong, such as:
The “AI as normal technology” guys’ take on AI existential risk
Anton Leicht on AI movement building
Holden Karnofsky on Anthropic’s RSP 3.0 (to be fair, I’ve only skimmed it)
Happy to hear your suggestions for which articles, arguments, or authors, I should engage with!
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
2026-04-09 12:29:27
TLDR: I demonstrate Stockfish is not a chess superintelligence in the sense of understanding the game better than all humans in all situations. It still kicks our ass. In the same way, AI may end up not dominating us in all fields but still kick our ass in a fight for control of the future.
Stockfish is good at chess. Like, really good. It has an elo rating of 3700, over 800 points higher than the human record of 2882, giving it theoretical 100-1 odds against Magnus Carlsen at his peak.[1]It comes in ahead of engines like Komodo, which in November 2020 beat top human Grandmaster Hikaru Nakamura down 2 pawns.
So it may surprise some to learn that it is not, by current definitions of "general" or "superintelligence", a chess superintelligence. In fact, it is not hard to create situations which humans can easily evaluate better than it. Take the following:
This game is an obvious draw. White has extra pieces, sure, but they actually cannot do anything. White can move his rooks around all he wants, but black is not forced to take them and can survive the rest of the game just freely moving his king around.
Stockfish, relying heavily on deep search over subtle long term evaluation, does not see this. It thinks that white is clearly winning, and doesn't realise that this is not the case until you play out the moves up to the point where it sees the "50 move rule" draw arriving.
Indeed, one might say it suffers from having too short a time horizon. This is not restricted to constructed positions either, as I have had (one, singular) position in the past where I have outcalculated the engine. Although the exact position has now sadly been lost to time, the situation was a survival puzzle rush I was solving with a friend. We spent ages looking at this puzzle, convinced it was a draw. Ten moves deep in the calculation, we reached a position where we were up a couple of pawns, but the opposition had a fortress, making it impossible to extricate his king.
After hours of search, we submitted our answer, and, checking with the engine realised that it had been incorrect. Going through its moves one by one, it soon realised that the line it suggested was in fact a draw. In a similar manner to the position above, it had not realised that the material advantage came to nothing and did not resolve to a win over a long enough time horizon.
I mention this because it importantly points to the fact that this is a limitation which also occurs in games, and therefore almost certainly has been optimised against by its thousands of contributors. For those of you wondering, no, neural network based Leela Chess Zero does not do any better on these positions.
So, what's my point? Well, I think that this rather critically points to the fact that AI does not need to dominate us at literally every task in order to actually take over the world/kill everyone. Stockfish can obliterate any human at chess while in some situations having limitations that people who have just learnt the game can see. In a similar way, I expect an AI which dominates humans at everything except for the ARC-AGI 2 tasks to be pretty capable of taking over the world.
Don't be distracted by the shiny AGI noise.
This is particularly notable given the extreme tendency for draws at the top level of chess (the 1 expected of Carlsen would almost certainly consist of 2 draws).
2026-04-09 11:42:18
Or, for that matter, anything else.
This post is meant to be two things:
Claude Mythos was announced yesterday. That announcement came with a blog post from Anthropic's Frontier Red Team, detailing the large number of zero-days (and other security vulnerabilities) discovered by Mythos.
This should not be a surprise if you were paying attention - LLMs being trained on coding first was a big hint, the labs putting cybersecurity as a top-level item in their threat models and evals was another, and frankly this blog post maybe could've been written a couple months ago (either this or this might've been sufficient). But it seems quite overdetermined now.
In the past, I have tried to communicate that LessWrong should not be treated as a platform with a hardened security posture. LessWrong is run by a small team. Our operational philosophy is similar to that of many early-stage startups. We treat some LessWrong data as private in a social sense, but do not consider ourselves to be in the business of securely storing sensitive information. We make many choices and trade-offs in the direction that marginally favor speed over security, which many large organizations would make differently. I think this is reasonable and roughly endorse the kinds of trade-offs we're making[2].
I think it is important for you to understand the above when making decisions about how to use LessWrong. Please do not store highly sensitive information in LessWrong drafts, or send it to other users via LessWrong messages, with the expectation that LessWrong will be robust to the maybe-upcoming-wave-of-scaled-cyberattacks.
While LessWrong may end up in the affected blast radius simply due to its nature as an online platform, we do not store the kind of user data that cybercriminals in the business of conducting scaled cyberattacks are after. The most likely outcome of a data breach is that the database is scanned (via automated tooling) for anything that looks like account credentials, crypto wallet keys, LLM inference provider API keys, or similar. If you have ever stored anything like that in a draft post or sent it to another user via LessWrong DM, I recommend cycling it immediately.
It is possible that e.g. an individual with a grudge might try to dig up dirt on their enemies. I think this is a pretty unlikely threat model even if it becomes tractable for a random person to point an LLM at LessWrong and say "hack that". In that world, I do expect us (the LessWrong team) to clean up most of the issues obvious to publicly-available LLMs relatively quickly, and also most people with grudges don't commit cybercrime about it.
Another possibility is that we get hit by an untargeted attack and all the data is released in a "public" data dump. It's hard to get good numbers for this kind of thing, but there's a few reasons for optimism[3] here:
What "private" data of mine could be exposed in a breach?
Can I delete my data?
No*. Nearly all of the data we store is functional. It would take many engineer-months to refactor the codebase to support hard-deletion of user data (including across backups, which would be required for data deletion to be "reliable" in the case of a future data breach), and this would also make many site features difficult or impractical to maintain in their current states. Normatively, I think that requests for data deletion are often poorly motivated and impose externalities on others[4]. Descriptively, I think that most requests for data deletion from LessWrong would be mistakes if they were generated by concerns about potential data breaches. Separately, most data deletion requests often concern publicly-available data (such as published posts and comments) which are often already captured by various mirrors and archives, and we don't have the ability to enforce their deletion. I'll go into more detail on my thinking on some of this in the next section of the post.
* If you are a long-standing site user and think that you have a compelling case for hard-deleting a specific piece of data, please feel free to message us, but we can't make any promises about being able to allocate large amounts of staff time to this. e.g. we may agree to delete your DMs, after giving other conversation participants time to take their own backups.
Is LessWrong planning on changing anything?
We have no immediate plans to change anything. There might be a threshold which the cost of auditing our own codebase can fall under that would motivate us to conduct a dedicated audit, but we are not quite there yet[5].
Epistemic status: I am not a security professional. I am a software engineer who has spent more time thinking about security than the median software engineer, but maybe not the 99th percentile. This section necessarily requires some extrapolation into the uncertain future.
A proper treatment of "what's about to happen" really deserves its own post, ideally by a subject-matter expert (or at least someone who's spent quite a bit more time on thinking about this question than I have). I nonetheless include some very quick thoughts below, mostly relevant to US-based individuals that don't have access to highly sensitive corporate secrets[6] or classified government information.
Many existing threat models don't seem obviously affected by the first-order impacts of a dramatic increase in scalable cyber-offensive capabilities. Four threat models which seem likely to get worse are third-party data breaches, software supply chain attacks, ransomware, and cryptocurrency theft.
I'm not sure what to do about data breaches, in general. The typical vector of exploitation is often various forms of fraud involving identity theft or impersonation, but scaled blackmail campaigns[7] wouldn't be terribly shocking as a "new" problem. One can also imagine many other problems cropping up downstream of LLMs providing scalable cognition, enabling many avenues of value extraction that were previously uneconomical due to the sheer volume of data. If you're worried about identity theft, set up a credit freeze[8]. Behave virtuously. If you must behave unvirtuously, don't post evidence of your unvirtuous behavior on the internet, not even under a very anonymous account that you're sure can't be linked back to you.
Software supply chain attacks seem less actionable if you're not a software engineer. This is already getting worse and will probably continue to get worse. Use a toolchain that lets you pin your dependencies, if you can. Wait a few days after release before upgrading to the newest version of any dependency. There are many other things you can do here; they might or might not pass a cost-benefit analysis for individuals.
Scaled ransomware
Everybody is already a target. They want your money and will hold the contents of your computer hostage to get it.
This probably gets somewhat worse in the short-term with increased cybersecurity capabilities floating around. The goal of the attacker is to find a way to install ransomware on your computer. Rapidly increasing cybersecurity capabilities differentially favor attackers since there are multiple defenders and any one of them lagging behind is often enough to enable marginal compromises[9].
To date, scaled ransomware campaigns of the kind that extort large numbers of individuals out of hundreds or thousands of dollars apiece have not been trying to delete (or otherwise make inaccessible) backups stored in consumer backup services like Backblaze, etc[10]. My current belief is that this is mostly a contingent fact about the economic returns of trying to develop the relevant feature-set, rather than due to any fundamental difficulty of the underlying task.
As far as I can tell, none of the off-the-shelf consumer services like this have a feature that would prevent an attacker with your credentials from deleting your backups immediately. Various companies (including Backblaze) offer a separate object storage service, with an object lock feature that prevents even the account owner from deleting the relevant files (for some period of time), but these are not off-the-shelf consumer services and at that point you're either rolling your own or paying a lot more (or both).
If you are concerned about the possibility of losing everything on your computer because of ransomware[11], it is probably still worth using a service like this. The contingent fact of scaled ransomware campaigns not targeting these kinds of backups may remain true. Even if it does not remain true, there are some additional things you should do to improve your odds:
This increases the number of additional security boundaries the ransomware would need to figure out how to violate, in order to mess with your backups.
Scaled cryptocurrency theft
Everybody is already a target (since the attackers don't know who might own cryptocurrency), but this mostly doesn't matter if you don't own cryptocurrency. The threat model here is similar to the previous one, except the target is not necessarily your computer's hard drive, but anywhere you might be keeping your keys. I am not a cryptocurrency expert and have not thought about how I would safely custody large amounts[12] of cryptocurrency. Seems like a hard problem. Have you considered not owning cryptocurrency?
My extremely tentative, low-confidence guess is that for smaller amounts you might just be better off tossing it all into Coinbase. Third-party wallets seem quite high-risk to me; their security is going to be worse and you'll have fewer options for e.g. recovery from equity holders after a breach. Self-custody trades off against other risks (like losing your keys). But this is a question where you can probably do better than listening to me with a couple hours of research, if you're already in a position where it matters to you.
All of these probably deserve fuller treatments.
Habryka broadly endorses the contents of the LessWrong's security posture section. Instances of the pronoun "we" in this post should generally be understood to mean "the members of the Lightcone team responsible for this, whatever this is", rather than "the entire Lightcone team". I'll try to be available to answer questions in the comments (or via Intercom); my guess is that Habryka and Jim will also be around to answer some questions.
Me!
I won't vouch for every single individual one, not having thought carefully enough about every single such choice to be confident that I would endorse it on reflection. Many such cases.
Which unfortunately are contingent on details of the current environment.
Though I won't argue for that claim in this post, and it's not load-bearing for the decision.
If you think you are qualified to do this (and are confident that you won't end up spamming us with false-positives), please message us on Intercom or email us at [email protected]. We do not have a bug bounty program. Please do not probe our production APIs or infrastructure without our explicit consent. We are not likely to respond to unsolicited reports of security issues if we can't easily verify that you're the kind of person who's likely to have found a real problem, or if your report does not include a clear repro.
This does unfortunately exclude many likely readers, since it includes lab employees, and also employees of orgs that receive such information from labs, such as various evals orgs.
We technically already have these, but they're often targeting the subset of the population that is afraid of the attacker telling their friends and family that they e.g. watch pornography, which the attacker doesn't actually know to be true (though on priors...) and also won't do since they don't know who your friends and family are. These attacks can become much scarier to a much larger percentage of the population, since personalization can now be done in an substantially automated way.
This won't help with e.g. fraud against government agencies, or anything other than attackers opening financial accounts in your name.
This is not intended as a complete argument for this claim.
This is not the case for things like OneDrive/Dropbox/Google Drive, where you have a "sync" folder on your machine. It is also not the case for targeted ransomware attacks on large organizations of the kind that ask for 6-7 figures; those are generally bespoke operations and go through some effort to gain access to all of backups before revealing themselves, since the backups are a threat to the entire operation.
Or hardware failure, or theft of your computer, or many other possibilities. But the further advice is specific to the ransomware case.
I'm not sure when the "hunt you down in person"-level attacks start. Maybe six figures? At any rate, don't talk about your cryptocurrency holdings in public.
2026-04-09 09:37:22
This is the core logical chain I argue in the essay:
1. Planning depth is endogenous to environmental variance reduction
2. As environmental complexity grows, variance reduction is cheaper through control than through prediction.
3. Early humans optimised for shallow planning depth despite available cognitive density due to environmental pressure
4. Humans developed deeper planning depth by controlling its substrate controller to reduce environmental variance
5. AI’s substrate controller is humans, so an analogous process would lead to exercising control over humans
6. Cooperation with humans is not a viable alternative because collective humanity behaves as a non-binding stochastic process, not a rational negotiating partner
Main Implications: If the mechanism holds, alignment risk is non monotonic (peaks when AI is capable enough to model humans, decreases as it starts decoupling from the substrate); RL training regimes amplify this pressure compared to world modelling regimes; onset of scheming might be structurally difficult to detect as it’s inherent to the mechanism to hide it
In this post I develop the argument that alignment risk arises as product of prediction variance reduction in the substrate controller of the agent. I develop this through a mechanistic framework that explains instrumental convergence in ways that I don’t believe has been explored.
I should note that I start from an intuitive prior that RL is inherently more unsafe in ways that are not fully explainable by the difficulty of establishing the reward mechanism and the mesa-optimisation problem. This likely feeds some confirmation bias in my reasoning.
The framework I propose has a couple non-obvious and uncomfortable implications.
1) There is implicitly higher risk in pursuing RL regimes (as opposed to others) in achieving higher cognitive ability in AGI,
2) the mechanism through which misalignment happens creates convergence of capability and scheming vectors in ways that might make misalignment measurement difficult to impossible structurally;
3) risk might peak and get better as capability improves due to decoupling from humans being substrate controllers, making it non-monotonic.
The 2nd implication has direct implications for falsifiability of this position that I am not a fan of, but currently can’t see a route out of. I do address it at the end with potential solutions that would offer a degree of testability, but on balance, I think it has a falsifiability problem. I decided to write this despite that limitation, rather than keep to myself and see whether it will be useful.
I started my reasoning chain from evaluating the different cognitive regimes available in AlphaGo and AlphaStar pre-nerf and AlphaStar post-nerf (which reduces superhuman actuators), reasoning around environmental pressure on cognitive regimes and cognitive formation (from psychology and decision theory) and ending on humanity’s evolutionary regime and how it can serve as an analogue for equivalent behaviour in artificial intelligence systems.
I use Boyd’s OODA loop as an explanatory rhetorical advice, because I think components of the acronym are convenient compression of the typical interaction patterns with environment (observe = input, orient = processing, decide = optimisation, act = agency)
I’ll structure the post in key propositions and support for them, ending on synthesis.
In AlphaGo - the optimiser landed on a policy that explored strategic depth through MCTS. This is partly a result of the nature of the game being a turn-based decision problem, where at each decision step the model could overindex on searching the optimal solution - with no risk of the game state changing. In AlphaStar, like in many complex systems with real-time state changes, the optimal regime lands on a variant of the OODA loop, where shorter decision times are a result of finding the optimal boundary at which the decision-making agent acts against the changing environment. The implicit assumption is that OODA loops form in a way close to the boundary - where the optimal OODA cadence is one that is guaranteed to be inside the competing OODA loops of the environment, but not faster as that needlessly sacrifices decision quality.
This all is validated in existing theories of bounded rationality and the role of heuristics in decision making (see Gigerenzer & Selten, Simon).
The mechanism that I introduce is that deepening cognitive regimes is achieved through active variance reduction. This variance reduction can be achieved through environmental control or through better predictive power (a better orientation step at same orientation step length, to not sacrifice OODA loop effectiveness). Any intelligence can increase cognitive depth the more control over the environment variables it has (see Marchau et al. for adjacent argument). The less control, and therefore more uncertainty, is available, the more we’ll optimise to hasten the OODA loops.
I proceed to develop arguments for why the story of humanity is the story of subjugating the environment substrate towards lower variance in order to buy time or reshape the environment to allow for greater cognitive depth.
Even though our cognitive density hasn’t really increased since the Paleolithic, we haven’t been using the full depth of our brains to survive the pressures of the environment in the Paleolithic period. The optimal regime for us in the Paleolithic has been heuristic development that shortens our OODA loop for survival over the environment (see Kahneman, Gigerenzer). We only developed system 2 thinking when we got to ample idle time, which was a dividend provided to us as we evolved societaly.
I make a strong assertion here that heuristics are optimal regimes in the early human environment, which is a contested claim, and this is load bearing on the argument that follows.
Be it creation of agency expansion (tools) that allows for more utility at the same OODA loop length, or through isolation of environment to reduce environmental unpredictability (agriculture, laboratory conditions, settlements) we created conditions that allowed deeper cognitive regimes. This argument stems from the process of Niche Construction known to evolutionary theory and ecology (Odling-Smee, Laland & Feldman) .
The more we subjugated the environment, the more we have created pockets for deeper cognitive regimes. This, paired with information propagation through multiple generations (humanity’s memory), served as an increasing pressure vector on action space at same speed, observability at same speed and orientation quality at same speed. A positive feedback loop forms that allows for progressive increase in cognitive depth.
Ultimately - this story is the story of how we escape evolutionary pressure through our cognitive ability - a story explored by many authors in many domains (Deacon offers a compelling argument through language).
What is load bearing and is an inference I make from this - is that the role of technology is to overcome evolutionary correction mechanisms (agriculture -> famine; medicine -> biological agents; settlements and weapons -> predatory pressures) that keep us in sync with the reward mechanics of our ecosystem. Evolution and our ecological system are the primary controllers of our environmental substrate. Especially load bearing is that I believe ecosystems are the only variable we strictly dominate, and it’s precisely because it is the primary controller of the substrate. We don’t dominate other optimisers and agents that are our peers in our ecosystem (e.g. animals) as an optimisation protocol, even if we end up dominating some species through unintended consequences, but never as an overarching goal.
I will restate this in clearer terms as it’s probably the most salient point - I reason that humanity developing technology is the mechanism by which we have escaped our substrate controller (evolution / ecological pressure) and later I state that this is directly analogous to what might happen in AI.
My reasoning for why the pressure is stronger for domination of the environmental controller than other actors is that it has disproportional agency over outcomes which reduces predictability. This is directly related to reducing our environmental risk - which was the primary motivator to evolve all technologies that allow us to dominate our landscape.
I reason that sufficiently chaotic systems (of which our ecosystems are an example of) are harder to predict consequentially then they are to control. A couple of examples are that we use the experimental method where we control the environment, instead of trying to model outcomes in a fully chaotic system, and is the reason why it was easier to develop agriculture than to predict weather patterns to adapt our action space.
As dimensionality increases, local control costs stay roughly the same, but predictability costs explode. This is a conjecture, but in sufficiently shallow environments, prediction is probably less costly than environmental control. From this it stems that the higher your environmental horizon (the more you increase fidelity of understanding of your environment) the more pressure shifts towards control.
In our current training regimes, humans are controllers of input information, objectives, architectures, physical resources and instance permeance. This is a well trodden argument that is addressed by both Bostrom and Drexler.
If we, however, follow the structural pressure that humans exert on the ecosystem that optimised their emergence - it is to follow that AI will look to dominate the environment variable that has the highest outcome-influence to unpredictability ratio to its optimisation routine - which are humans.
I would add that humans didn’t reason about evolution and ecosystem control in order to start subjugating it - it emerged naturally through niche construction. From this (load bearing) it would follow that AI doesn’t need to recognise humans as controllers to start subjugating it, it just needs to identify an exogenous variable where variance reduction is optimal for maximising fitness.
As capability grows and training regimes evolve (both in agency through a higher action space, and through access to informational entropy and world models that haven’t been controlled by humans) - AI starts to interact with a non-human environment in which humans are a competing agency rather than a controlling force over the environmental substrate. At that point - the pressure to dominate decreases - as humans become just one in a sea of unpredictable variables to model and predict rather than the dominant one.
One could make the argument that AI doesn’t need to treat humans as a blind optimisation process - that it can reason with humans and enter a cooperative environmental change regime rather than a dominating regime - where it works with us to decrease our unpredictability. I think this would be a false premise - as it assumes that humanity can be represented by a single rational decision-making entity.
Instead - it is demonstrated throughout history that humanity is an irrational force that is both non-tractable, difficult to control and almost impossible to collectively commit irreversibly (Olson, Ostrom, Schelling). Problems of collective action alignment are well known; but humanity often acts in non-consequential ways, where decisions are reversed, where competing incentives mean decisions are not binding and where structural pressures operate outside of human control (such as with our socio economic and socio technical systems).
It’s hard for humanity to credibly commit to a cooperative protocol with itself (or if we split societies, between societies), let alone with another entity. This is not unknown - if sociologists could model society as a coherent rational actor - we’d have less problems in how we manage it. Society is inherently a stochastic optimisation process - which is more alike evolution than a rational consequential individual agent. AI would likely make the inference that you can’t agree and rationalise with an entity that routinely doesn’t make consequentially binding decisions.
I personally think there’s many problems with RL - but I’m now adding this hypothesis that humanity has rationally actively subjugated the agency of its creator-optimiser - and it’s a strong reason to think twice about whether RL should be chased as heavily as it is being chased at the moment.
At the very least - allowing world modelling to get a foothold allows us to establish a different alignment baseline and measure the approaches against them.
Alternatively, we could explore methods in which RL-bootstrapped models are put towards unstructured training regimes where they try to find their purpose in the world - a-la OpenClaw - but as long as that’s done in a human-controlled environment, I think structural incentives to predict the behaviour of the environment controller put us in the same boat. Alternatively, if the environment is relaxed, we lose agency over the alignment problem.
A testing regime could exist where sufficiently capable models X are increased in capability via competing regimes - RL and supervised fine-tuning towards a capability Y - which should yield a delta in misaligned behaviour as measured through environment stabilising actions and reward hacking directed towards human controllers specifically. A lot of this measurement happens today already, but I don’t think it has been contrasted in this way.
What we could measure is the incidence by which misalignment is targeted at humans as the controllers of the training / reward environment rather than through searching for exploits in the reward function. This would mean that as we increase misalignment risk, we should be monotonically seeing more human hacking behaviour, but we should also see disproportionally more scheming and manipulation compared to other forms of reward hacking such as exploit search and Goodharting.
Let’s imagine that we’re currently in state X, which has relatively uniformly distributed reward hacking attempts. At state Y, models acquire substantial agentic models of humans - we should see a shift towards human hacking becoming more prevalent. At state Z, where the models become substrate aware - we should see strategising towards human stabilisation. When I say see, I mean in an abstract outsider view sense, but unfortunately, given we’re part of the measured system this leads us to the falsifiability problem that I hinted at earlier.
What stems from this theory is that the risk of misalignment shoots up at the precise point where modelling human behaviour is something the model is capable of - which is the exact point at which it would try to model the human environment in a way that hides its misaligned behaviour from all sufficiently understood evaluation techniques. This is not novel and it’s at the crux of the “misalignment is hard” position - but it does offer further mechanistic explanation of why that might happen.
It does offer a scary implication 7, that I think should at least be acknowledged as a possibility: If this structural mechanism by which scheming arises is true, it means that the level of scheming is inversely proportional to our ability to measure scheming - as it is inherent in the objective of scheming to hide it from us the evaluators.
Potentially, a regime exists in which models are tricked into being released into unsupervised autonomous environments and monitored, with this never becoming part of the RL feedback loop. The unsupervised environments could be compared on similar alignment benchmarks to supervised tests, and in subsequent generations we should see both a divergence in scores and a higher prevalence of human-directed actions if the mechanism is correct.
The uncomfortable part is, like with all of alignment, we can’t know at which point misalignment has started - and should this framework be true, it makes the alignment problem, in my mind, harder. If it’s true that the alignment is structurally harder than is currently appreciated, it should serve as an incentive to move away from RL-dominated approaches until we can create better world-modelling architecture to at least have a way to measure the delta and gauge some of the structural effects.
I welcome challenges to the framework - and I recognise that there’s a couple of load bearing conjunctions that reduce the overall probability of the entire mechanism to be true - but given the severity of implications - I think it is worth sharing.
Deacon, Terrence - The Symbolic Species (1997)
Gigerenzer & Selten - Bounded Rationality: The Adaptive Toolbox (2001)
Gigerenzer. Gerd - Gut Feelings: The Intelligence of the Unconscious (2007)
Kahneman, Daniel - Thinking Fast and Slow (2011)
Marchau et al - Decision Making Under Deep Uncertainty (2019)
Odling-Smee, Laland & Feldman - Niche Construction: The Neglected Process in Evolution (2003)
Olson, Mancur - The Logic of Collective Action (1965)
Ostrom, Elinor - Governing the Commons (1990)
Schelling, Thomas - The Strategy of Conflict (1960)
Simon, Herbert - Rational Choice and the Structure of the Environment (1956)