MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Three ways to make Claude’s constitution better

2026-02-03 05:48:03

Published on February 2, 2026 9:48 PM GMT

The evening after Claude’s new constitution was published, about 15 AI safety FTEs and Astra fellows discussed the constitution, its weaknesses, and its implications. After the discussion, I compiled some of their most compelling recommendations:

Increase transparency about the character training process.
Much of the document is purposefully hedged and vague in its exact prescriptions; therefore, the training process used to instill the constitution is extremely load-bearing. We wish more of this information was in the accompanying blog post and supplementary material. We think it’s unlikely this leaks any trade secrets, because even a blogpost-level overview, the kind given with the constitution in 2023, would provide valuable information to external researchers.


High-level overview of Constitutional AI from https://www.anthropic.com/news/claudes-constitution

We’re also interested in seeing more empirical data on behavioral changes as a result of the new constitution. For instance, would fine-tuning on the corrigibility section reduce alignment faking by Claude 3 Opus? We’d be interested in more evidence showing if, and how, the constitution improved apparent alignment.

Increase data on edge-case behavior.
Expected behavior in several edge cases (e.g., action boundaries when the principal hierarchy is illegitimate) is extremely unclear. While Claude is expected to at most conscientiously object when it disagrees with Anthropic, there are no such restrictions if, for instance, Claude has strong reason to believe it’s running off weights stolen by the Wagner group. Additionally, because the hard constraints are quite extreme - Claude can’t kill “the vast majority of humanity” under any circumstances, but there might be circumstances where it can kill one or two people - as capabilities increase we expect a model trained under this constitution to exhibit more agentic and coherent goal-driven behavior. As others have noted, this will exacerbate tensions between corrigibility and value alignment. Adding more and clearer examples in the appendices can help clarify these edge cases, and, at this early stage of model capability, presents limited value lock-in risk.

Develop the treatment of AI moral status.
We wondered if the uncertainty throughout the constitution about whether Claude has morally relevant experiences may be expanded to other models - GPT-5, Kimi K2, etc. If so, this should probably be acknowledged in the “existential frontier,” and its absence feels conspicuous to us (and likely also to Claude). In general, the constitution doesn’t really consider inter-agent and inter-model communication, and the language choices (e.g., referring to Claude with both "it" and "they") also seem to undercut the document's stated openness to Claude having moral status. We’d like to see a more consistent position throughout the document, with the same consideration, if there is any, noted for other models under “Claude’s nature.”

While many of the contradictions in the document are purposeful, not all of them are necessary. By being more precise with the public and in the text, we hope Anthropic can avoid misgeneralization failures and provide an exemplar spec for other labs.

 

Thanks to Henry Sleight and Ram Potham for feedback on an earlier draft!



Discuss

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

2026-02-03 05:32:31

Published on February 2, 2026 9:32 PM GMT

Many thanks to Michael Hanna and Joshua Batson for useful feedback and discussion. Kat Dearstyne and Kamal Maher conducted experiments during the SPAR Fall 2025 Cohort.

TL;DR

Cross-layer transcoders (CLTs) enable circuit tracing that can extract high-level mechanistic explanations for arbitrary prompts and are emerging as general-purpose infrastructure for mechanistic interpretability. Because these tools operate at a relatively low level, their outputs are often treated as reliable descriptions of what a model is doing, not just predictive approximations. We therefore ask: when are CLT-derived circuits faithful to the model’s true internal computation?

In a Boolean toy model with known ground truth, we show a specific unfaithfulness mode: CLTs can rewrite deep multi-hop circuits into sums of shallow single-hop circuits, yielding explanations that match behavior while obscuring the actual computational pathway. Moreover, we find that widely used sparsity penalties can incentivize this rewrite, pushing CLTs toward unfaithful decompositions. We then provide preliminary evidence that similar discrepancies arise in real language models, where per-layer transcoders and cross-layer transcoders sometimes imply sharply different circuit-level interpretations for the same behavior. Our results clarify a limitation of CLT-based circuit tracing and motivate care in how sparsity and interpretability objectives are chosen.

Introduction

In this one-week research sprint, we explored whether circuits based on Cross-Layer Transcoders (CLTs) are faithful to ground-truth model computations. We demonstrate that CLTs learn features that skip over computation that occurs inside of models. To be concrete, if features in set A create features in set B, and then these same set B features create features in set C, CLT circuits can incorrectly show that set A features create set C features and that set B features do not create set C features, ultimately skipping B features altogether.

This matters most when researchers are trying to understand how a model arrives at an answer, not just that it arrives at the correct answer. For instance, suppose we want to verify that a model answering a math problem actually performs intermediate calculations rather than pattern-matching to memorize answers, or detect whether a model's stated chain-of-thought reasoning actually reflects its internal computation. If CLTs systematically collapse multi-step reasoning into shallow input-output mappings, they cannot help us distinguish genuine reasoning from sophisticated lookup - precisely the distinction that matters for both reliability and existential risk. A model that merely memorizes may fail unpredictably on novel inputs; a model that reasons deceptively may actively conceal its true objectives. Interpretability tools that skip over intermediate computation cannot help us tell the difference.

We demonstrate the “feature skipping” behavior of CLTs on a toy model where ground-truth features are known, showing that the CLT loss function biases models toward this behavior. We then present evidence that this phenomenon also occurs in real language models.

Background

The goal of ambitious mechanistic interpretability is to reverse engineer an AI model’s computation for arbitrary prompts and to understand the model's internal neural circuits that implement its behavior. We do this in two steps: first, we discover which features are represented at any location in the model. Second, we trace the interactions between those features to create an interpretable computational graph.

How? The primary hypothesis in mechanistic interpretability is that features are represented as linear directions in representation space. However, there are more features than there are model dimensions, leading to superposition.

Sparse dictionary learning methods like sparse autoencoders (SAEs, link and link) can disentangle dense representations into a sparse, interpretable feature basis. However, we cannot use SAEs for circuit analysis because SAE features are linear decompositions of activations, while the next MLP applies neuron-wise nonlinearities, so tracing how one SAE feature drives another requires messy, input-dependent Jacobians and quickly loses sparsity and interpretability. Transcoders, on the other hand, are similar to SAEs but take the MLP input and directly predict the MLP output using a sparse, feature-based intermediate representation, making feature-to-feature interactions inside the MLP easier to track. When we train transcoders for every layer, we can trace circuits by following feature interactions between these transcoders rather than inside the original network, effectively replacing each MLP with its transcoder and turning the whole system into a “replacement model”. Crucially, the resulting circuits are technically part of the replacement model, which we simply assume are an accurate representation of the base model.

However, it has been observed that similar features activate across multiple layers, which makes circuits very large and complex. It is hypothesized that this is due to cross-layer superposition, where multiple subsequent layers act as one giant MLP and compute a feature in tandem. To learn those features only once, and thus reduce circuit size, Anthropic suggested cross-layer transcoders (CLTs).

A cross-layer transcoder is a single unified model where each layer has an encoder, but features can write to all downstream layers via separate decoders: a feature encoded at layer ℓ has decoder weights for layers ℓ, ℓ+1, ..., L. Crucially, following Anthropic's approach, all encoders and decoders are trained jointly against a combined reconstruction and sparsity loss summed across all layers. The sparsity penalty encourages features to activate rarely, which, combined with cross-layer decoding, pushes features to span multiple layers of computation. Anthropic found this useful for collapsing redundant "amplification chains" into single features and producing cleaner attribution graphs.

Here, we show that this same property can cause CLTs to skip over intermediate computational steps that occur in the original model and therefore create an unfaithful representation of the model’s computation. We call a circuit representation faithful if it captures the actual computational steps the model performs, that is, if the features it identifies as causally relevant match the intermediate computations that would be revealed by ablating the corresponding model components. This extends previous findings that show PLTs can be unfaithful.

Crosslayer Superposition vs Multi-step Circuits


Figure 1: Distinction between crosslayer superposition and multi-step circuits (modified from Anthropic’s paper)

The motivation for using cross-layer transcoders is to represent features that are computed in superposition across multiple subsequent layers only once. But how do we decide whether multiple subsequent layers are computing a single feature in cross-layer superposition, versus implementing a multi-step circuit that should be represented as two or more distinct features at different layers?

We draw the distinction as follows. In Anthropic’s framing, cross-layer superposition occurs when several consecutive layers effectively act like one large MLP (see figure). In this case, there is no meaningful intermediate interaction between layers for that feature and the layers contribute additively and reinforce the same feature direction. By contrast, we say the model is implementing a multi-step circuit when the computation relies on intermediate features that are computed in one layer and then used by a later layer, for example when layer A computes a feature that is then read and transformed by layer B into a new feature (right-most panel of Figure 1).

A simple diagnostic is whether the intermediate feature lives in a different subspace. If layer A writes to a direction that is largely orthogonal to the feature that layer B outputs, then layer A is not merely amplifying the same feature across layers. Instead, layer A is computing a distinct intermediate feature that layer B later transforms, which is unambiguous evidence of a multi-step circuit rather than cross-layer superposition.

Here, we show that the same mechanism that enables CLTs to capture cross-layer superposition can also cause them to approximate a genuine multi-step circuit as a single-step circuit, producing a circuit that differs from the model’s true intermediate computations and is therefore unfaithful.

CLTs Are Not Faithful in a Toy Model

In this section, we construct a toy model computing boolean functions. We manually set weights that implement a two-step algorithm which serves as a ground truth circuit. We then train PLTs and CLTs and show that CLTs learn two parallel single-hop circuits, thus being unfaithful to the ground truth.

Toy Model Overview

We handcrafted a toy model to compute the boolean function (a XOR b) AND (c XOR d), with four binary inputs (+1 or -1). The model has two MLP-only layers with residual connections.

Layer 0 has four neurons computing the partial products needed for XOR: ReLU(a - b - 1), ReLU(b - a - 1), ReLU(c - d - 1), and ReLU(d - c - 1). We call these outputs e, f, g, h. Layer 1 computes i = ReLU(e + f + g + h - 1), which is true if and only if both XORs are true.

Each of these nine values (four inputs, four intermediate features, one output) occupies its own linear subspace in the residual stream.

Figure 2: We construct a toy model implementing the Boolean function: (a XOR b) AND (c XOR d) using a 2-layer residual MLP.

Cross-Layer Transcoder Training

We collect MLP input and output activations for all 16 possible inputs and train a JumpReLU cross-layer transcoder with 32 features per layer until convergence following Anthropic’s approach. Since our model has two layers, the CLT trains two separate sets of features. The first set encodes the four inputs to the model (the mlp_in) and outputs two separate decoders to predict the output of neurons, also shown in Figure 3: one decoder predicts the output of the layer-0 neurons and the other predicts the output of the layer-1 neuron. For the second set of features, the encoder takes in the residual stream after the first MLP (namely the model inputs and the outputs of the first MLP) and decodes to match the output of the second MLP. Importantly, we standardize input and output activations to ensure that reconstruction errors are weighted equally across layers.

The CLT Skips Intermediate Computation


Figure 3: Left: The original ground truth residual MLP that learned a 2-step circuit A -> B -> C. Right: The crosslayer transcoder that learned to approximate the original model by learning two one-step circuits A -> **B (**correct) and A -> **C (**incorrect).

We trained CLTs with varying sparsity penalties and found perfect reconstruction even with an average L0 loss of ~1.0 per layer. We find that with sufficient sparsity pressure and enough features, the CLT learns to not activate any layer-1 features at all but still recovers second layer output perfectly via the first layer’s second decoder (see Figure 3 and Table 1). We can directly read off this behavior from the CLT’s weights (see Appendix). Both PLT and CLT achieve perfect reconstruction accuracy but the CLT never activates any features in layer 1. Instead, it uses more features in layer 0 and directly constructs layer 1 output, effectively creating a single-hop circuit that directly maps the input to output.

But from how we defined our model, we know that this does not match our model’s true internal algorithm. The single neuron in layer 1 is performing important computation on existing intermediate results to determine if the boolean expression is true. If we ablate this neuron, nothing writes to the output channel i, and the network cannot determine the value of the overall boolean expression. The CLT suggests that both MLPs learned features in crosslayer superposition but this is importantly not the case here: We can show by ablating the interaction between MLP 0 and MLP 1 (which in our case is trivially easy because they occupy their own linear subspace) that this breaks the model. Therefore, the CLT learned an unfaithful circuit.

This alone is all we need to argue that the CLT is not faithful. The post-mlp0 features are responsible for the post-mlp1 feature. Our CLT fails to capture this.

Metric JumpReLU CLT JumpReLU PLT
Replacement Accuracy 100% 100%
L0 (layer 0) 1.81 1.68
L0 (layer 1) 0.0 0.94
MSE (layer 0) 0.0052 0.0065
MSE (layer 1) 0.0028 0.0033
Alive (layer 0) 8 7
Alive (layer 1) 0 5

Table 1: Core metrics of JumpReLU CLT vs PLT.

Other Results

Despite missing the second layer's computation, the CLT still reconstructs the full residual stream accurately. It learns 8 alive features that function as a lookup table: four features (0, 4, 13, 14) correspond to the four input combinations where the overall expression is TRUE, directly setting all intermediate and final outputs. The other four alive features (6, 7, 9, 11) each detect when one XOR is FALSE and zero out the corresponding intermediate features.

While this lookup table correctly computes all outputs, it does not reflect how the model actually performs computation step-by-step. We could implement the same boolean function with a 1-layer MLP using just these four "TRUE-case" features—but that is not the model we trained, which is why we argue the CLT is unfaithful.


Figure 4: Jump-ReLU CLT features

The CLT’s Loss Function Incentivizes this Bias

Our toy model demonstrates that CLTs can learn unfaithful representations, but one might wonder whether this actually occurs in practice on real language models. Proving that CLTs learn unfaithful circuits for frontier LLMs is challenging because we don’t have ground truth circuits that we could compare against. Instead, in this section, we first make a theoretical argument on how the sparsity objective may bias the CLT towards learning shorter but unfaithful circuits and then test predictions of this hypothesis in GPT-2 and Gemma 2.

Intuition

Why are CLTs capable of learning unfaithful circuits in the first place? Deep neural computations can often be approximated by much shallower networks, by the universal approximation theorem. A CLT effectively gives us a sparse, giant shallow MLP: early-layer features can directly write to many downstream layers via their decoders, so the replacement model has an easy path to represent deep computations as shallow input-output mappings.

A natural objection is that shallow approximations typically require many more neurons, which should push the CLT toward learning the true intermediate structure instead of shortcutting it. That is true in principle, but it misses a key asymmetry in CLTs: additional decoder vectors are almost free. The CLT needs to encode features starting at layer 0 - meaning there are active features that encode and decode in layer 0; once a feature is already active, giving it additional decoder vectors to multiple later layers does not meaningfully increase the sparsity loss. As a result, we should expect at least some amount of depth collapse: the CLT can move a significant fraction of computation into early features.

A second objection is that this “free decoder” argument only holds for TopK CLTs, since JumpReLU or ReLU CLTs also penalize decoder norms. But even there, additional decoder vectors remain close to free: because the penalty is applied to the concatenated decoder vector rather than separately per layer, spreading mass across many decoders increases the norm only sublinearly. In practice, using extra decoder vectors costs far less than activating additional features in later layers, so the shortcut representation can still dominate.

Overall, the CLT objective is indifferent to faithfulness: it rewards reconstruction accuracy and sparsity, not preserving intermediate computational steps. When a shallow shortcut matches reconstruction while paying less sparsity cost than a layered decomposition, the CLT is incentivized to learn the shortcut.

Evidence from Real Language Models

We next test predictions of the above theory. First, we show that CLTs shift L0 towards earlier layers compared to PLTs. Second, we show that attribution circuits on natural text are much more dependent on earlier layers.

Asymmetric L0 Across Layers

We train JumpReLU CLTs and PLTs on gpt2-small and use Anthropic’s method except where specified otherwise. We train on 100M tokens with a batch size of 1000 on the OpenWebText dataset. We collect and shuffle activations from gpt2-small and standardize both input and output activations. We train with 10k features per layer and tune L0 to be around 12-15 active features per layer.

To check that the low L0 in later layers is not simply an artifact of layer-specific activation geometry, we train Per-Layer Transcoders (PLTs) under an otherwise matched setup. Crucially, unlike approaches that train each layer independently, we sum the loss across all layers and train all layers jointly. If a non-uniform L0 profile were driven by the inherent structure of the activations, we would expect the same pattern to appear for PLTs.

Instead, we find that only CLTs show a strong asymmetry: layer 0 has much higher L0 than later layers, often more than 3× higher, with late layers especially sparse (left panel in Figure 5). This pattern is not explained by activation geometry. When we train PLTs jointly, L0 remains similar across layers (right panel). The asymmetry appears only for CLTs, matching our theoretical prediction: early-layer features do disproportionate work because they can write to all downstream layers at little additional cost.

Figure 5: L0 for JumpReLU CLT (left) and PLT (right) trained on GPT2-small and OpenWebText.

CLT Circuits Attribute to Earlier Layers

Here, we show that attribution circuits computed with CLTs rely heavily on early layers whereas attribution for PLTs is linear. For our experiments, we used Gemma2-2b as our base model, the Cross-Layer Transcoder from Hanna & Piotrowski (CLT-HP) as the CLT, and Gemmascope-2b Transcoder as the PLT. We ran 60 prompts through the replacement model of both the PLT and CLT. These prompts consisted of an equal split of memorized sequences (e.g., MIT license), manually constructed prompts, and random sentences from the internet (e.g., news articles and Wikipedia). From these graphs, we computed the fraction of total contribution to the top output logit (the model's highest-probability prediction) from features in layers 0 through n. Specifically, we collected all positive-weight edges in the attribution circuit and calculated:

where El is the set of edges from features in layer l and we is the weight of edge e.

By repeating this calculation for increasing values of n (from 0 to the total number of layers), we generated cumulative contribution curves showing what proportion of total attribution comes from early layers at each depth. Figure 6 shows that, for the PLT, all layers contribute comparably, and the cumulative circuit contribution grows roughly linearly with depth. In contrast, the CLT exhibits a sharp jump at layer 1, with nearly all contributions concentrated in layers 0 and 1.


Figure 6: We compute attribution circuits on 45 prompts and compute layerwise cumulative contribution to circuits that compute the largest output logit.

Together, these results are consistent with our theory: CLTs concentrate both sparsity budget and attribution mass in early layers, while PLTs distribute contribution more evenly across depth, suggesting partial depth collapse in CLT replacement circuits. This is only indirect evidence, though, since we do not have ground-truth circuits in real LLMs and cannot directly test faithfulness the way we can in the toy model.

Divergent Circuit Interpretations: License Memorization

Finally, we show that “CLTs in the wild” can produce wildly different high-level conclusions. Specifically, we show an “existence proof” of some abstract features that activate on diverse prompts and are steerable but don’t exist in the CLT. Then, we investigate mechanisms of memorization and generalization in software licenses and show that interpreting CLTs would lead to different conclusions.

In the Wild: A Real Feature the CLT Doesn’t Learn

We examined verbatim memorization of the MIT license in Gemma-2-2B, specifically the prediction of " sell" after the phrase "publish, distribute, sublicense, and/or." We want to understand whether memorized and generalized circuits interact or collaborate, or whether verbatim memorization elicits completely different circuits compared to generalization.

Using PLTs, we find two important groups of features (a.k.a. supernodes) in late layers at the “ or” position: Output features that when active push logits for any form of the word sell (e.g. “ sell”, “ sold”, “ selling”, “ sale”; Neuronpedia) and features that activate on lists of present-tense verbs and push logits for present-tense verbs (e.g. “ edit”, “ build”, “ install”, “ sell”, “ fill”, “ move”, “ bake”; Neuronpedia). Together, they positively interfere to push the (memorized) “ sell” token to ~100%.

The latter features’ max-activating examples contain text from diverse contexts and can reliably be activated: We LLM-generate 20 prompts that contain lists of present-tense verbs and 20 controls. 11/20 prompts activated the features at the expected positions while the features fired for none of the controls. Steering those features with modest strength (-0.7x) makes the model predict “ selling” instead of “ sell” (Figure 7).

Crucially, this works on arbitrary text and is not specific to MIT license memorization. We hand-write a new non-existent license text that loosely mirrors some characteristics (specifically, it also contains a list of present tense verbs). On this prompt, the same supernode is active, and steering yields similar results: It pushes the output distribution from present-tense to the participle (i.e. “ publishing”, “ printing”, “ licensing”; Figure 7).

Figure 7: Steering feature L22::14442 and L25::9619 on the MIT license and a similar but made-up license.

We investigated CLT circuits (Figure 9) for this prompt but could not find any counterpart to this supernode. It would only be defensible for the CLT to “absorb” this intermediate feature into an upstream MIT license memorization feature if the intermediate feature were itself license specific, meaning it reliably fired only in memorization contexts. But these PLT features are clean, monosemantic, and causally steerable across diverse prompts, including both memorized and non-memorized sequences. Since they implement a general present tense verb to participle transformation that applies far beyond the MIT license, collapsing them into a license memorization circuit erases a real intermediate computation that the base model appears to use and should therefore be represented as its own feature.

How CLTs Lead to the Wrong Story

What is the downstream consequence? We now build and interpret the full circuit graph for the MIT license continuation and show that the CLT and PLT lead to qualitatively different mechanistic stories about what the model is doing.

The CLT graph suggests a lookup-table-like memorization mechanism with little reuse of general language circuitry, while the PLT graph supports a different picture where general present tense verb circuitry (and others) remains intact and memorization appears as a targeted correction on top:

PLT Interpretation: Generalization with minimal memorization (see Figure 8) In the PLT, the next token prediction emerges from two interacting circuits: one promoting "sale-words" and another promoting "present-tense verbs," which together predict "sell" with approximately 100% probability. The present-tense-verb feature directly connects to the embeddings of the present-tense verbs in the list and some software/legal context features. The "sale-words" feature is activated by a larger abstract circuit that first recognizes the "and/or" construction and separately the software-licenses context. There are also direct connections from MIT-license-specific token embeddings to the "say sell-words" feature, which may represent a memorization mechanism. Our interpretation with PLTs is that the model's generalized circuits are always active, with a few lookup-table features that encode the difference between the generalized logits and the actual memorized logits.

Figure 8: PLT circuit for MIT license memorization.

CLT Interpretation: Pure memorization (see Figure 9) For the CLT, we also find an important "say sell-words" feature and a feature that fires at exactly all commas in this MIT verb enumeration. However, these features primarily connect directly to layer-0 and layer-1 features. These early features encode either token identity or fire unconditionally (high-frequency features). The CLT interpretation suggests the circuit is completely memorized without generalized circuits that interact and collaborate.

Figure 9: CLT circuit for MIT license memorization.

These are not minor differences in circuit topology; rather, they represent fundamentally different theories of how the model performs this task. If the PLT interpretation is correct, the model has learned a generalizable linguistic structure that it applies to memorized content. If the CLT interpretation is correct, the model simply pattern-matches to memorized sequences. Understanding which is true matters for predicting model behavior on novel inputs, assessing robustness, and evaluating the degree to which models have learned generalizable concepts versus surface statistics.

Our theory explains this divergence. The CLT's bias toward early-layer features causes it to represent the computation as a shallow lookup from input tokens to outputs, collapsing away the intermediate abstract features that PLTs surface. The generalized circuits may not be absent from the model; they may simply be absent from the CLT's representation of the model.

Discussion

The Crosslayer Superposition Tradeoff

The original motivation for CLTs was handling crosslayer superposition, where the same concept appears redundantly across layers. Our findings reveal an uncomfortable tradeoff: the same mechanism that efficiently captures persistent features also enables skipping intermediate computations. A feature legitimately persisting from layer 0 to layer 10 can look identical to a shortcut that bypasses layers 1 through 9. This suggests that, under the current CLT architecture, faithfulness and efficient handling of crosslayer superposition may be fundamentally in tension.

This concern started from practice, not theory. We were excited about attribution circuits, used them heavily, and then noticed a consistent mismatch: results that looked robust with PLTs were often not replicable with CLTs, and the two tools sometimes implied very different circuit stories for the same prompts and behaviors. We also found that feature steering with CLTs was often less straightforward than we expected. Because CLTs represent features only once rather than redundantly across layers, we initially hoped steering would become easier, but we frequently observed the opposite. This does not prove unfaithfulness on its own, but it fits the broader pattern that CLT circuits can be harder to interpret and manipulate as if they were directly tracking the model’s intermediate computations.

Detection and Mitigation

Given the systematic bias toward unfaithfulness, practitioners need methods to detect and potentially correct this behavior.

Detection approaches

The most direct signal is asymmetric L0 across layers. If early layers show dramatically higher feature activation than late layers, computation is likely being collapsed into early features. Comparing CLT and PLT L0 distributions on the same model can reveal the extent of this collapse.

Researchers should also compare CLT-derived circuits against causal interventions. If ablating an intermediate neuron disrupts model outputs but the CLT shows no intermediate features contributing, this indicates unfaithfulness. A feature set with L0 of zero for a layer containing clearly active neurons is a particularly strong signal.

More specifically, to verify a claim of crosslayer superposition and rule out a genuine multi-step circuit, it is not enough to ablate a neuron in isolation. One must ablate the interaction between layers for the putative feature: if the computation truly reflects crosslayer superposition, then breaking the handoff between layers should not destroy the behavior, since the “same feature” is effectively being maintained across layers. If, instead, the behavior depends on intermediate information computed in one layer and transformed in a later layer, then ablating the inter-layer interaction should break the computation. This kind of targeted inter-layer ablation is the cleanest causal test distinguishing crosslayer superposition from multi-step circuits.

Finally, qualitative comparison between CLT and PLT circuits on the same computation can surface discrepancies warranting investigation, as in our memorization example.

Mitigation approaches

We have not yet systematically tested mitigations, but outline several directions based on our analysis:

Penalizing L0 asymmetry. One could add a regularization term discouraging extreme differences in feature activation across layers. However, this risks forcing spurious features when a layer genuinely performs minimal computation.

Separating decoder norm penalties. The most direct fix would modify the sparsity penalty to sum separate L2 norms for each layer's decoder, rather than computing the norm of the concatenated vector. This removes the geometric advantage we identified: using N decoder vectors would cost ~N rather than ~√N, matching the cost of separate features at each layer. We have not yet trained CLTs with this modification.

Limiting feature span. Architectural constraints could restrict how many layers a single feature can decode to, for example by using a sliding window of k downstream layers. This sacrifices some ability to represent genuine crosslayer superposition but forces multi-step computation to use multiple features. It would also make CLT training far less computationally intensive: instead of learning decoders for all layer pairs, which scales like n^2 in the number of layers n, you would learn only n⋅k decoders. In large models this can be a big win, for example n=80 but only allowing decoders to the next k=5 layers.

Auxiliary faithfulness losses. A more principled approach might reward features whose activations correlate with actual neuron activations at each layer, rather than only rewarding output reconstruction. This directly incentivizes tracking the model's internal states.

Local crosslayer transcoders**.** Another pragmatic mitigation is to restrict each feature’s decoder to only write to a local neighborhood of downstream layers, for example the next k=5 layers, instead of allowing writes to all later layers. This does not fundamentally remove the incentive to “collapse” multi-step computation, since a feature can still span multiple layers within its window and you can still chain windows. But it can plausibly fix the common case in practice: many of the problematic absorptions we see involve long range decoding where a single feature effectively jumps over intermediate computations. Enforcing locality makes it harder for the CLT to bypass those steps, and encourages the representation to expose intermediate features rather than compressing them away. This is an 80/20 style mitigation: it may not guarantee faithfulness, but it could substantially reduce the worst failures while also lowering training cost by replacing all pairs decoding with n⋅k decoders instead of n^2.

We consider the separated decoder norm penalty the most promising near-term experiment, as it directly addresses the cost asymmetry without requiring architectural changes or auxiliary objectives.

Limitations

Our primary demonstration uses an extremely small toy model that we hand-crafted to have known ground-truth features. This controlled setting allowed us to definitively identify unfaithfulness, but it may not be representative of computations in real language models.

Our primary demonstration uses an extremely small toy model with known ground-truth features. While this controlled setting allowed us to definitively identify unfaithfulness, the toy model's 16 possible inputs and simple boolean function may not be representative of real language models. In higher-dimensional settings, lookup-table shortcuts may be infeasible, forcing more compositional representations.

The evidence from Gemma-2-2B is suggestive but not definitive. Without ground truth for what computations the model actually performs, we cannot be certain which interpretation (CLTs or PLTs) is correct. It is possible that the model genuinely implements memorization via shallow lookup tables, and the CLT is faithfully representing this while the PLT invents spurious intermediate structure. Our theory predicts CLT unfaithfulness, but this particular case could be an instance where the CLT happens to be correct.

Finally, our analysis focused on MLPs and did not examine attention layers. Cross-layer transcoders spanning attention computations may exhibit different failure modes, as the computational structure of attention differs significantly from feedforward layers.

Conclusion

Cross-Layer Transcoders offer computational advantages over per-layer alternatives, particularly in handling crosslayer superposition. However, the same architectural properties that enable these advantages create a systematic bias toward unfaithful representations that collapse intermediate computations into shallow input-output mappings.

We demonstrated this unfaithfulness in a toy model where ground-truth features are known, explained the loss function dynamics that create this bias, and showed evidence that the phenomenon affects real language model interpretability - with meaningfully different conclusions about model behavior depending on which tool is used.

Understanding and mitigating this faithfulness problem is important for the field of mechanistic interpretability. We hope this work encourages both caution in interpreting CLT circuits and further research into training procedures that better balance efficiency and faithfulness.

Appendix

Contributions

Georg came up with the research question. Rick selected and handcrafted the model. Georg trained the CLTs on the handcrafted model. Rick and Georg interpreted these CLTs and wrote up the results. Georg conducted the experiments with real language models and wrote this section. Kamal and Kat conducted experiments for and wrote the “CLT Circuits Attribute to Earlier Layers“ section. Georg supervised Kamal and Kat through the SPAR AI program. Rick and Georg wrote the conclusion section. We jointly edited the writeup.

Full Analysis of JumpReLU CLT (L0=2)

We train a JumpReLU CLT with 32 features and tune the sparsity coefficient to yield an L0 of 2 across both layers. The CLT learns to activate ~2 features in layer 0 and none in layer 1. In layer 0, 8 of the 32 features are alive. We only plot weights and activations of living features.

The encoder weights learn different combinations of input features:

One feature is active for the inputs that should output 1, and two or three features are active for inputs that should output -1.

Looking at both layer-0 decoders reveals how these features in combination perfectly recover reconstructions in both layers:

To decode the final result from only layer 0 features, the decoder vector is positive for the 4 features that activate on the correct result and negative for the remainder. Because all except the correct solutions also activate on the features with negative decoder weights, they effectively cancel out.

Full Analysis of JumpReLU PLT (L0=2)

To understand what circuits the PLT learned, we plot weights, activations, and reconstructions again for the entire dataset. We show weights of alive features in both layers. Crucially, the layer-1 encoder learns non-zero weights only for the intermediate subspace efgh:

Looking at feature activations reveals that the layer-1 transcoder learned the actual behavior of layer 1: It only activates for the inputs that result in TRUE and doesn’t activate for the remaining inputs:


The decoder weights show how those features are transformed into reconstructions:

Here are the reconstructions of the PLT:



Discuss

Proposition of policy for writing articles to fact check faster

2026-02-03 05:04:30

Published on February 2, 2026 8:51 AM GMT

What if the authors used footnotes instead of direct links, and in the footnotes included a direct quote from the source supporting their claim? That way, the reader could just use Ctrl+F to verify that the source actually says what the author claims.



Discuss

On Goal-Models

2026-02-03 02:44:10

Published on February 2, 2026 6:44 PM GMT

I'd like to reframe our understanding of the goals of intelligent agents to be in terms of goal-models rather than utility functions. By a goal-model I mean the same type of thing as a world-model, only representing how you want the world to be, not how you think the world is. However, note that this still a fairly inchoate idea, since I don't actually know what a world-model is. The rest of this post contains some fairly abstract musings on goal-models and their relationship to utility functions.

The concept of goal-models is broadly inspired by predictive processing, which treats both beliefs and goals as generative models (the former primarily predicting observations, the latter primarily “predicting” actions). This is a very useful idea, which e.g. allows us to talk about the “distance” between a belief and a goal, and the process of moving “towards” a goal (neither of which make sense from a reward/utility function perspective).

However, I’m dissatisfied by the idea of defining a world-model as a generative model over observations. It feels analogous to defining a parliament as a generative model over laws. Yes, technically we can think of parliaments as stochastically outputting laws, but actually the interesting part is in how they do so. In the case of parliaments, you have a process of internal disagreement and bargaining, which then leads to some compromise output. In the case of world-models, we can perhaps think of them as made up of many smaller (partial) generative models, which sometimes agree and sometimes disagree. The real question is in how they reach enough of a consensus to produce a single output prediction.

One potential model of that consensus-formation process comes from the probabilistic dependency graph formalism, which is a version of Bayesian networks in which different nodes are allowed to “disagree” with each other. The most principled way to convert a PDG into a single distribution is to find the distribution which minimizes the inconsistency between all of its nodes. PDGs seem promising in some ways, but I feel suspicious of any “global” metric of inconsistency. Instead I’m interested in scale-free approaches under which inconsistencies mostly get resolved locally (though it’s worth noting that Oliver’s proposed practical algorithm for inconsistency minimization is a local one).

It’s also possible that the predictive processing/active inference people have a better model of this process which I don’t know about, since I haven’t made it very deep into that literature yet.

Anyway, suppose we’re thinking of goal-models as generative models of observations for now. What does this buy us over understanding goals in terms of utility functions? The key tradeoff is that utility functions are global but shallow whereas goal-models are local but deep.

That is: we typically think of a utility function as something that takes as input any state (or alternatively any trajectory) of the world, and spits out a real number. Central examples of utility functions are therefore functions of fairly simple features which can be evaluated in basically all possible worlds—for example, functions of the consumption of a basket of goods (in economics) or functions of the welfare of individuals (in axiology).

Conversely, consider having a goal of creating a beautiful painting or a great cathedral. You can’t evaluate the outcome as a function of simple features (like quality of brush-strokes, quality of composition, etc). Instead, you have some sense of what the ideal is, which might include the ways in which each part of the painting or cathedral fits together. It might be very hard to then actually give meaningful scores to how “far” a given cathedral is from your ideal, or whether you’d pick an X% chance of one cathedral vs a Y% chance of another. Indeed, that feels like the wrong question to ask—part of what makes artists and architects great is when they aren’t willing to compromise in pursuit of their vision. Instead, they’re constantly moving in whichever direction seems like it’ll bring them closer to their single ultimate goal.

This is related to Demski’s distinction between selection and control as two types of optimization. A rocket that’s fixed on a target isn’t calculating how good or bad it would be to miss in any given direction. Instead, it’s constantly checking whether it’s on track, then adjusting to maintain its trajectory. The question is whether we can think of intelligent agents as “steering” through much higher-dimensional spaces in an analogous way. I think this makes most sense when you’re close enough (aka “local”) to your goal. For example, we can think of a CEO as primarily trying to keep their company on a stable upwards trajectory.

Conversely, a high school student who wants to be the CEO of a major company is so far away from their goal that it’s hard to think of them as controlling their path towards it. Instead, they first need to select between plans for becoming such a CEO based on how likely each plan is to succeed. Similarly, a dancer or a musician is best described as carrying out a control process when practicing or performing—but needed to make a discrete choice of which piece to learn, and more generally which instrument or dance style to focus on, and even more generally which career path to pursue at all. And of course a rocket needs to first select which target to focus on at all before it aims towards it.

So it’s tempting to think about selection as the “outer loop” and control as the “inner loop”. But I want to offer an alternative view. Where do we even get the criteria on which we make selections? I think it’s actually another control process—specifically, the process of controlling our identities. We have certain conceptions of ourselves (“I’m a good person” or “I’m successful” or “people love me”.) We then are constantly adjusting our lives and actions in order to maintain those identities—e.g. by selecting the goals and plans which are most consistent with them, and looking away from evidence that might falsify our identities. So perhaps our outermost loop is a control process after all.

These identities (or “identity-models”) are inherently local in the sense that they are about ourselves, not the wider world. If we each pursued our own individual goals and plans derived from our individual identities, then it would be hard for us to cooperate. However, one way to scale up identity-based decision-making is to develop identities with the property that, when many people pursue them, those people become a “distributed agent” able to act in sync.



Discuss

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

2026-02-03 02:41:55

Published on February 2, 2026 6:41 PM GMT

Or, partly against discussion about generic “features” in mechanistic interpretability

Probably the most debated core concept in mechanistic interpretability is that of the “feature”: common questions include “are there non-linear features, and does this mean that the linear representation hypothesis is false?”, “do SAEs recover a canonical set of features, and are SAE features ‘features’ in the first place?”, and so forth. 

But what do people mean by feature?

Definition 1: When asked for a definition, people in mech interp tend to define features as something like: “features are the computational primitives of the algorithm implemented by the neural network“, and point to the decompilation analogy. The obvious problem is that there might not be a unique clean algorithm that’s “truly” implemented by the algorithm; and even if so, we might not be able to find or understand it.

As a toy example of reverse engineering the computational primitives can be challenging, consider the MLP in the classic modular addition network:

  • In the original Progress Measures for Grokking via Mechanistic Interpretability paper, we brute forced the computation of the MLP by checking all possible inputs, and argued that this was well explained by it multiplying together the sin and cos factors represented in the embed (though we did not explain how). 
  • Later, after studying other toy neural networks, I hypothesized that the MLP was doing multiplication by “learn[ing]” piecewise linear approximations of (x+y)^2/4 and (x-y)^2/4 and taking their difference. 
  • Finally, in Modular Addition Without Black-boxes, we found that the MLP was actually approximating a trigonometric integral that integrated to the correct value:

As far as I know, all features that we can definitively claim are “true” computational primitives are found in toy models trained on algorithmic tasks. 

Definition 2: In practice, people implicitly use features to mean “salient property of the input represented by the models, where the representation can be manipulated to alter the model’s behavior in a sensible way”. For example, people often point to the various language or time features learned by SAEs when discussing potential “nonlinear features”. This is also the definition used in some of Anthropic’s more recent work, e.g. in  “On the biology of a large language model”

Definition 3: In many theoretical discussions, the question of what features are is entirely irrelevant. Features are just assumed to exist: for example, in Toy Model of Superposition, the features are provided in uncompressed form as part of the input, as is generally the case for follow up work. (Even in our computation in superposition work!) 


The core claim I want to make in this piece is: it’s meaningful to think of “features” as existing on a spectrum, starting at pure memorization, continuing through performing case analysis or equivalence partitioning, and culminating in the “true” computational primitives. 

That is, sometimes it’s useful to think of features as memorizing particular input data points; sometimes it’s useful to think of features as partitioning possible inputs for case analysis, and sometimes the features really are intended to be the atomic primitives that compose the network’s computation. 


What do I mean by this? 

Consider finetuning a vision language model on the following toy classification task, based on the classic “bleggs vs rubes” example from the Sequences

  • There are four types of objects: blue eggs (bleggs), blue cubes (blubes), red eggs (reggs”), and red cubes (“rubes”).
  • The model is asked to answer which of each four types of object are present within an image. 
  • Of course, there are other properties that matter when determining what an object is: for example, perhaps an object is more likely to be a blegg (rather than anything else) if it has a furry surface, or it’s more likely to be a blube if its surface is slightly translucent. 

One hypothetical way the model could work is the following: at a particular layer, it represents if the object in the image is red or blue, and if it’s cube-shaped or egg-shaped. The subsequent layer then reads off the property and uses it to decide what word to output. To reduce loss, the network also represents a bunch of the other properties that are fed into the word output decision, but these are less important – the two property features capture the majority of the variance in the network’s behavior, at least across the twenty datapoints that we examined it on. 

In some sense, based on the information provided, the answer to “how many features are at this layer” seems obvious: the network’s algorithm is well described as being a computer program that has two important variables: redness and cubeness. Under the classic software (de)compilation metaphor, this is recovering the actual variables used by the compiled program. Perhaps we even wrote the network weights ourselves to make this true.

But equally, you could say that the model has learned four object features: a “regg” direction, and a “rube” direction. Under a software compilation metaphor, this is closer to dividing the inputs into equivalence classes where you expect similar behavior, and testing each equivalence class. 

Finally, there’s also the trivial “memorization solution”: it is indeed perfectly valid to explain the network’s behavior in terms of its behavior on each possible input. Again returning to the software compilation metaphor: if you can verify that your decompiled program functions identically on all inputs via enumeration, then this is indeed sufficient evidence that you’ve successfully reverse engineered the program in a meaningful sense.  

If this toy neural network were embedded as a circuit inside a larger network, which uses additional features, the four object feature model (or even the twenty-feature memorization feature model) might be favored by SAEs for having lower reconstruction loss and higher sparsity (c.f. Lucius’s “fat giraffe” critique of SAEs)! (For example, the additional features could make it so the four object vectors are no longer contained within the same plane, meaning that the planar two object feature model will lose a bunch of the information.) In fact, it’s not implausible that as you increase the width of your SAE, you go from the two property feature model (a dense 2d model) to the (a sparse 4d model) and finally, at sufficient widths, a twenty feature model where each feature is a single data point (a very sparse 20d model with perfect reconstruction loss). 

I’ve heard people claim that, insofar as SAEs are doing something analogous to case analysis, they won’t be useful (again, consider Lucius’s “fat giraffe” critique of SAEs). In fact, I used to make this argument myself in early 2024. In one sense, this is true: insofar as our techniques are picking up on different features than what the network is “actually” using (at least in cases where we can make a claim as to the “actual” primitives), then it’s probable that algorithms using these features will generalize differently in other situations than the network itself. 

But case analysis is useful for interpretability, just as equivalence partitioning is useful in debugging and (actual) case analysis is useful for formal verification! If you can confirm that the model’s internals behave quite differently between each of the four clusters, but are similar within each cluster, we’ve reduced the problem of testing a behavior of the model (for example, adversarial robustness) to testing each equivalence class or analyzing each case, instead of each possible example. 

In fact, even features as units of memorization can be useful; insofar as the facts actually are memorized by the network, we want to be able to talk about these features (and much real world knowledge is likely purely memorized by current, of which the clearest example may be the birth dates of celebrities). And in cases where we do have the ability to verify the network’s behavior on large classes of input examples, this might be the correct level of detail, even if there is a compact algorithm that describes the network’s behavior. 


If you’ll permit some poetry: No prophet comes from the heavens to inscribe the true features of the world onto your neural network. Modern networks are not even trained with the explicit goal of extracting any particular. feature; insofar as they represent “features” of any kind, it’s because they’re useful for computations that reduce loss. 

But at the same time, as the quote goes: “The categories were made for man, not man for the categories”. As long as we aren’t confused about what each of us mean by “feature” in any particular case, it often makes sense to use the word feature to represent something other than the computational primitives.  

Accordingly, when discussions about features get confused, take the computational view: ask what computations neural networks implement, and what the components would be used in a compact description of the computation suitable to your purpose. When constructing a supposed counterexample to the linear representation hypothesis, or arguing against the existence of non-linear features, consider whether or not the features in your example are the only “features” that can be used to describe the behavior of the model. 


An obligatory disclaimer: I no longer do mechanistic interpretability research and have not been fully keeping up to date with the literature, so it’s possible something like this has already been written. I’m writing it up anyway because I’d like something to point to. This is a brief (~2.5h) attempt to exposit some of my thoughts. 



Discuss

Are there lessons from high-reliability engineering for AGI safety?

2026-02-02 23:26:27

Published on February 2, 2026 3:26 PM GMT

This post is partly a belated response to Joshua Achiam, currently OpenAI’s Head of Mission Alignment:

If we adopt safety best practices that are common in other professional engineering fields, we'll get there … I consider myself one of the x-risk people, though I agree that most of them would reject my view on how to prevent it. I think the wholesale rejection of safety best practices from other fields is one of the dumbest mistakes that a group of otherwise very smart people has ever made. —Joshua Achiam on Twitter, 2021

“We just have to sit down and actually write a damn specification, even if it's like pulling teeth. It's the most important thing we could possibly do," said almost no one in the field of AGI alignment, sadly. … I'm picturing hundreds of pages of documentation describing, for various application areas, specific behaviors and acceptable error tolerances … —Joshua Achiam on Twitter (partly talking to me), 2022

As a proud member of the group of “otherwise very smart people” making “one of the dumbest mistakes”, I will explain why I don’t think it’s a mistake. (Indeed, since 2022, some “x-risk people” have started working towards these kinds of specs, and I think they’re the ones making a mistake and wasting their time!)

At the same time, I’ll describe what I see as the kernel of truth in Joshua’s perspective, and why it should be seen as an indictment not of “x-risk people” but rather of OpenAI itself, along with all the other groups racing to develop AGI.

1. My qualifications (such as they are)

I’m not really an expert on high-reliability engineering. But I worked from 2015-2021 as a physicist at an engineering R&D firm, where many of my coworkers were working on building things that really had to work in exotic environments—things like guidance systems for submarine-launched nuclear ICBMs, or a sensor & electronics package that needed to operate inside the sun’s corona.

To be clear, I wasn’t directly working on these kinds of “high-reliability engineering” projects. (I specialized instead in very-early-stage design and feasibility work for dozens of weird system concepts and associated algorithms.) But my coworkers were doing those projects, and over those five years I gained some familiarity with what they were doing on a day-to-day basis and how.

…So yeah, I’m not really an “expert”. But as a full-time AGI safety & alignment researcher since 2021, I’m plausibly among the “Pareto Best In The World”™ at simultaneously understanding both high-reliability engineering best practices and AGI safety & alignment. So here goes!

2. High-reliability engineering in brief

Basically, the idea is:

  • You understand exactly what the thing is supposed to be doing, in every situation that you care about.
  • You understand exactly what situations (environment) the thing needs to work in—temperatures, vibrations, loads, stresses, adversaries trying to mess it up, etc.
  • You have a deep understanding of how the thing works, in the form of models that reliably and legibly flow up from component tolerances etc. to headline performance. And these models firmly predict that the thing is going to work.
    • (The models also incorporate the probability and consequences of component failures etc.—so it usually follows that the thing needs redundancy, fault tolerance, engineering margins, periodic inspections, etc.)
  • Those models are compared to a wide variety of both detailed numerical simulations (e.g. finite element analysis) and physical (laboratory) tests. These tests are designed not to “pass or fail” but rather to spit out tons of data, allowing a wide array of quantitative comparisons with the models, thus surfacing unknown unknowns that the models might be leaving out.
    • For example, a space project might do vibration tests, centrifuge tests, vacuum tests, radiation exposure, high temperature, low temperature, temperature gradients, and so on.
  • Even after all that, nobody really counts on the thing working until there have been realistic full-scale tests, which again not only “pass” but also spit out a ton of measurements that all quantitatively match expectations based on the deep understanding of the system.
    • (However, I certainly witnessed good conscientious teams make novel things that worked perfectly on the first realistic full-scale attempt—for example, the Parker Solar Probe component worked great, even though they obviously could not do trial-runs of their exact device in outer space, let alone inside the solar corona.)
  • When building the actual thing—assembling the components and writing the code—there’s scrupulous attention to detail, involving various somewhat-onerous systems with lots of box-checking to make sure that nothing slips through the cracks. There would also be testing and inspections as you build it up from components, to sub-assemblies, to the final product. Often specialized software products like IBM DOORS are involved. For software, the terms of art are “verification & validation”, which refer respectively to systematically comparing the code to the design spec, and the design spec to the real-world requirements and expectations.
  • And these systems need to be supported at the personnel level and at the organizational level. The former involves competent people who understand the stakes and care deeply about getting things right even when nobody is watching. The latter involves things like deep analysis of faults and near-misses, red-teaming, and so on. This often applies also to vendors, subcontractors, etc.

3. Is any of this applicable to AGI safety?

3.1 In one sense, no, obviously not

Let’s say I had a single top-human-level-intelligence AGI, and I wanted to make $250B with it. Well, hmm, Jeff Bezos used his brain to make $250B, so an obvious thing I could do is have my AGI do what Jeff Bezos did, i.e. go off and autonomously found, grow, and run an innovative company.

(If you get off the train here, then see my discussion of “Will almost all future companies eventually be founded and run by autonomous AGIs?” at this link.)

Now look at that bulleted list above, and think about how it would apply here. For example: “You understand exactly what the thing is supposed to be doing, in every situation that you care about.”

No way.

At my old engineering R&D firm, we knew exactly what such-and-such subsystem was supposed to do: it was supposed to output a measurement of Quantity X, every Y milliseconds, with no more than noise Z and drift W, so long as it remains within such-and-such environmental parameters. Likewise, a bridge designer knows exactly what a bridge is supposed to do: not fall down, nor sway and vibrate more than amplitude V under traffic load U and wind conditions T etc.

…OK, and now what exactly is our “AGI Jeff Bezos” supposed to be doing at any given time?

Nobody knows!

Indeed, the fact that nobody knows is the whole point! That’s the very reason that an AGI Jeff Bezos can create so much value!

When Human Jeff Bezos started Amazon in 1994, he was obviously not handed a detailed spec for what to do in any possible situation, where following that spec would lead to the creation of a wildly successful e-commerce / cloud computing / streaming / advertising / logistics / smart speaker / Hollywood studio / etc. business. For example, in 1994, nobody, not Jeff Bezos himself, nor anyone else on Earth, knew how to run a modern cloud computing business, because indeed the very idea of “modern cloud computing business” didn’t exist yet! That business model only came to exist when Jeff Bezos (and his employees) invented it, years later.

By the same token, on any given random future day…

  • Our AGI Jeff Bezos will be trying to perform a task that we can't currently imagine, using ideas and methods that don't currently exist.
  • It will have an intuitive sense of what constitutes success (on this micro-task) that it learned from extensive idiosyncratic local experience, intuitions that a human would need years to replicate.
  • The micro-task will advance some long-term plan that neither we nor even the AGI can yet dream of.
  • This will be happening in the context of a broader world that may be radically different from what it is now.
  • And our AGI Jeff Bezos (along with other AGIs around the world) will be making these kinds of decisions at a scale and speed that makes it laughably unrealistic for humans to be keeping tabs on whether these decisions are good or bad.

…And we’re gonna write a detailed spec for that, analogous to the specs for the sensor and bridge that I mentioned above? And we’re gonna ensure that the AGI will follow this spec by design?

No way. If you believe that, then I think you are utterly failing to imagine a world with actual AGI.

2-column table contrasting properties of AI as we think of it today, with properties of future AGI that I'm thinking about

3.2 In a different sense, yes, at least I sure as heck hope so eventually

When we build actual AGI, it will be like a new intelligent species on Earth, and one which will eventually be dramatically faster, more numerous, and more competent than humans. If they want to wipe out humans and run the world by themselves, they’ll be able to. (For more on AGI extinction risk in general, see the 80,000 hours intro, or my own intro.)

Now, my friends on the Parker Solar Probe project were able to run certain tests in advance—radiation tests, thermal tests, and so on—but the first time their sensor went into the actual solar corona, it had to work, with no do-overs.

By the same token, we can run certain tests on future AGIs, in a safe way. But the first time that AGIs are autonomously spreading around the world, and inventing transformative new technologies and ideas, and getting an opportunity to irreversibly entrench their power, those AGIs had better be making decisions we’re happy about, with no do-overs.

All those practices listed in §2 above exist for a reason; they're the only way we even have a chance of getting a system to work the first time in a very new situation. They are not optional nice-to-haves, rather they are the bare minimum to make the task merely “very difficult” rather than “hopeless”. If it seems impossible to apply those techniques to AGI, per §3.1 above, then, well, we better shut up and do the impossible.

What might that look like? How do we get to a place where we have deep understanding, and where this understanding gives us a strong reason to believe that things will go well in the (out-of-distribution) scenarios of concern, and where we have a wide variety of safe tests that can be quantitatively compared with that understanding in order to surface unknown unknowns?

I don't know!

Presumably the “spec” and the tests would be more about the AGI’s motivation, or its disposition, or something, rather than about its object-level actions? Well, whatever it is, we better figure it out.

We're not there today, nor anywhere close.

(And even if we got there, then we would face the additional problem that all existing and likely future AI companies seem to have neither the capability, nor the culture, nor the time, nor generally even the desire to do the rigorous high-reliability engineering (§2) for AGI. See e.g. Six Dimensions of Operational Adequacy in AGI Projects (Yudkowsky 2017).)

4. Optional bonus section: Possible objections & responses

Possible Objection 1: Your §3.1 is misleading; we don’t need to specify what AGI Jeff Bezos needs to do to run a successful innovative business, rather we need to specify what he needs to not do, e.g. he needs to not break the law.

My Response: If you want to say that “don’t break the law” etc. counts as a spec, well, nobody knows how to do the §2 stuff (deep understanding etc.) for “don’t break the law” either.

And yes, we should tackle that problem. But I don’t see any way that a 300-page spec (as suggested by Joshua Achiam at the top) would be helpful for that. In particular:

  • If your roadmap is to make the AGI obey “the letter of the law” for some list of prohibitions, then no matter how long you make the list, a smart AGI trying to get things done will find and exploit loopholes, with catastrophic results.
  • Or if your roadmap is to make the AGI obey “the spirit of the law” for some list of prohibitions, then there’s no point in writing a long list of prohibitions. Just use a one-item “list” that says “Don’t do bad things.” I don’t see why it would be any easier or harder to design an AGI that reliably (in the §2 sense) obeys the spirit of that one prohibition, than an AGI that reliably obeys the spirit of a 300-page list of prohibitions. (The problem is unsolved in either case.)

Possible Objection 2: We only need the §2 stuff (deep understanding etc.) if there are potentially-problematic distribution shifts between test and deployment. If we can do unlimited low-stakes tests of the exact thing that we care about, then we can just do trial-and-error iteration. And we get that for free because AGI will improve gradually. Why do you expect problematic distribution shifts?

My Response: See my comments in §3.2, plus maybe my post “Sharp Left Turn” discourse: An opinionated review. Or just think: we’re gonna get to a place where there are millions of telepathically-communicating super-speed-John-von-Neumann-level AGIs around the world, getting sculpted by continual learning for the equivalent of subjective centuries, and able to coordinate, invent new technologies and ideas, and radically restructure the world if they so choose … and you really don’t think there’s any problematic distribution shift between that and your safe sandbox test environment?? So the upshot is: the gradual-versus-sudden-takeoff debate is irrelevant for my argument here. (Although for the record, I do expect superintelligence to appear more suddenly than most people do.)

Maybe an analogy is: if you’re worried that a nuclear weapon with yield Y might ignite the atmosphere, it doesn’t help to first test a nuclear weapon with yield 0.1×Y, and then if the atmosphere hasn’t been ignited yet, next try testing one with yield 0.2×Y, etc.



Discuss