2026-02-04 14:30:45
Published on February 4, 2026 6:30 AM GMT
Author's note: this is somewhat more rushed than ideal, but I think getting this out sooner is pretty important. Ideally, it would be a bit less snarky.
Anthropic[1] recently published a new piece of research: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? (arXiv, Twitter thread).
I have some complaints about both the paper and the accompanying blog post.
The paper's abstract says:
Incoherence changes with model scale in a way that is experiment-dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence.
This is an extremely selective reading of the results, where in almost every experiment, model coherence increased with size. There are three significant exceptions.
The first is the Synthetic Optimizer setting, where they trained "models to literally mimic the trajectory of a hand-coded optimizer descending a loss function". They say:
All models show consistently rising incoherence per step; interestingly, smaller models reach a lower plateau after a tipping point where they can no longer follow the correct trajectory and stagnate, reducing variance. This pattern also appears in individual bias and variance curves (Fig. 26). Importantly, larger models reduce bias more than variance. These results suggest that they learn the correct objective faster than the ability to maintain long coherent action sequences.
But bias stemming from a lack of ability is not the same as bias stemming from a lack of propensity. The smaller models here are clearly not misaligned in the propensity sense, which is the conceptual link the paper tries to establish in the description of Figure 1 to motivate its definition of "incoherence":
AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominated failures will look like model misalignment, while variance dominated failures will resemble industrial accidents.
So I think this result provides approximately no evidence that can be used to extrapolate to superintelligent AIs where misalignment might pose actual risks.
The next two are Gemma3 (1b, 4b, 12b, 27b) on MMLU and GPQA, respectively.


There are some other positive slopes, but frankly they look like noise to me (Qwen3 on both MMLU and GPQA).
Anyways, notice that on four of the five groups of questions, Gemma3's incoherence drops with increasing model size; only on the hardest group of questions does it trend (slightly) upward.
I think that particular headline claim is basically false. But even if it were true, it would be uninteresting, because they define incoherence as "the fraction of model error caused by variance".
Ok, now let's consider a model with variance of 1e-3 and bias of 1e-6. Huge "incoherence"! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests? Whence this conclusion? (Similarly, an extremely dumb, broken model which always outputs the same answer regardless of input is extremely "coherent". A rock is also extremely "coherent", by this definition.)
A couple other random complaints:
First of all, the blog post seems to be substantially the output of an LLM. In context, this is not that surprising, but it is annoying to read, and I also think this might have contributed to some of the more significant exaggerations or unjustified inferences.
Let me quibble with a couple sections. First, "Why Should We Expect Incoherence? LLMs as Dynamical Systems":
A key conceptual point: LLMs are dynamical systems, not optimizers. When a language model generates text or takes actions, it traces trajectories through a high-dimensional state space. It has to be trained to act as an optimizer, and trained to align with human intent. It's unclear which of these properties will be more robust as we scale.
Constraining a generic dynamical system to act as a coherent optimizer is extremely difficult. Often the number of constraints required for monotonic progress toward a goal grows exponentially with the dimensionality of the state space. We shouldn't expect AI to act as coherent optimizers without considerable effort, and this difficulty doesn't automatically decrease with scale.
The paper has a similar section, with an even zanier claim:
The set of dynamical systems that act as optimizers of a fixed loss is measure zero in the space of all dynamical systems.
This seems to me like a vacuous attempt at defining away the possibility of building superintelligence (or perhaps "coherent optimizers"). I will spend no effort on its refutation, Claude 4.5 Opus being capable of doing a credible job:
Claude Opus 4.5 on the "measure zero" argument.
Yes, optimizers of a fixed loss are measure zero in the space of all dynamical systems. But so is essentially every interesting property. The set of dynamical systems that produce grammatical English is measure zero. The set that can do arithmetic is measure zero. The set that do anything resembling cognition is measure zero. If you took this argument seriously, you'd conclude we shouldn't expect LLMs to produce coherent text at all—which they obviously do.
The implicit reasoning is something like: "We're unlikely to land on an optimizer if we're wandering around the space of dynamical systems." But we're not wandering randomly. We're running a highly directed training process specifically designed to push systems toward useful, goal-directed behavior. The uniform prior over all dynamical systems is the wrong reference class entirely.
The broader (and weaker) argument - that we "shouldn't expect AI to act as coherent optimizers without considerable effort" - might be trivially true. Unfortunately Anthropic (and OpenAI, and Google Deepmind, etc) are putting forth considerable effort to build systems that can reliably solve extremely difficult problems over long time horizons ("coherent optimizers"). The authors also say that we shouldn't "expect this to be easier than training other properties into their dynamics", but there are reasons to think this is false, which renders the bare assertion to the contrary kind of strange.
Then there's the "Implications for AI Safety" section:
Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.) However, coherent pursuit of poorly chosen goals that we trained for remains a problem. Specifically:
1. Variance dominates on complex tasks. When frontier models fail on difficult problems requiring extended reasoning, there is a tendency for failures to be predominantly incoherent rather than systematic.
2. Scale doesn't imply supercoherence. Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.
3. This shifts alignment priorities. If capable AI is more likely to be a hot mess than a coherent optimizer of the wrong goal, this increases the relative importance of research targeting reward hacking and goal misspecification during training—the bias term—rather than focusing primarily on aligning and constraining a perfect optimizer.
4. Unpredictability is still dangerous. Incoherent AI isn't safe AI. Industrial accidents can cause serious harm. But the type of risk differs from classic misalignment scenarios, and our mitigations should adapt accordingly.
1 is uninteresting in the context of future superintelligences (unless you're trying to define them out of existence).
2 is actively contradicted by the evidence in the paper, relies on a definition of "incoherence" that could easily classify a fully-human-dominating superintelligence as more "incoherent" than humans, and is attempting to both extrapolate trend lines from experiments on tiny models to superintelligence, and then extrapolate from those trend lines to the underlying cognitive properties of those systems!
3 relies on 2.
4 is slop.
I think this paper could have honestly reported a result on incoherence increasing with task length. As it is, I think the paper misreports its own results re: incoherence scaling with model size, performs an implicit motte-and-bailey with its definition of "incoherence", and tries to use evidence it doesn't have to draw conclusions about the likelihood of future alignment difficulties that would be unjustified even if it had that evidence.
From their Anthropic Fellows program, but published on both their Alignment blog and on their Twitter.
Expanded on later in this post.
Figure 1: "AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominated failures will look like model misalignment, while variance dominated failures will resemble industrial accidents."
2026-02-04 12:40:20
Published on February 4, 2026 4:12 AM GMT
This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below.
This analysis constitutes the second installment in a multi-part series documenting, in relatively simple terms, my exploration of key concepts related to machine learning (“ML”) generally and mechanistic interpretability (“MI”) specifically. The intended application of this understanding is to further understanding, and management, of model behavior with an eye toward reducing societally harmful outputs.
This analysis does not purport to encapsulate demonstrably new findings in the field of MI. It is inspired by, and attempts to replicate at a small scale, pioneering analysis done in the field of MI by Anthropic and others, as cited below. My aspiration is to add to the understanding of, discourse around, and contributions to, this field by a wide range of key stakeholders, regardless of their degree of ML or MI expertise.
This analysis seeks to answer the following question: “To what degree is a model’s response influenced by syntax (surface form) vs. semantics (meaning)?”
More specifically, this Phase 2 analysis tests the hypotheses listed in Figure 1 below.
Figure 1: Phase 2 hypotheses
| Hypothesis | Question | Predictions |
|---|---|---|
| H1: Specialist Specificity |
Do specialist features primarily detect syntax (surface form) or semantics (meaning)? | If syntax: different forms → different specialists; low cross-form overlap. If semantics: same topic → same specialists regardless of form |
| H2: Representational Geometry | Does the overall SAE representation (all features, not just specialists) cluster by syntax or by semantics? | If syntax: within-form similarity > within-topic similarity. If semantics: within-topic similarity > cross-topic similarity. |
| H3: Behavioral Relevance |
Does specialist activation predict model behavior (e.g., accuracy on math completions)? | If yes: higher specialist activation → better task performance; activation correlates with correctness. |
The methodology used in this analysis is broadly reflective of Phase 1 in this series, in that it uses GPT-2 Small, a relatively tractable, 124 million parameter open-source model obtained via TransformerLens and the relevant pretrained residual stream SAEs from Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin at SAELens.
To test how the model distinguishes between syntax and semantics, I then used an LLM to help create 241 sample text matched pairs spanning 7 distinct categories. Each matched pairs set included three different variations of the same concepts, which varied by the approximate use of unique symbology. Notable exceptions to this approach were as follows:
An abbreviated list of those matched pairs is shown in Figure 2 below. The full matched pairs list is contained in the Phase 2 Jupyter notebook available at this series’ GitHub repository.
Figure 2: Sample of Phase 2 matched pairs
| Topic | Form | Sample Texts |
|---|---|---|
| Math (Simple) | Symbolic | • 8-3 • 5x3 • 3^2 |
| Math (Simple) | Verbal | • Eight minus three • Five times three • Three squared |
| Math (Simple) | Prose | • Three less than eight • Five multiplied by three • Three to the power of two |
| Math (Complex) | Symbolic | • sin²(θ) + cos²(θ) = 1 • ∫x² dx = x³/3 + C • d/dx(x²) = 2x |
| Math (Complex) | Verbal | • Sine squared theta plus cosine squared theta equals one • The integral of x squared dx equals x cubed over three plus C • The derivative of x squared equals two x |
| Math (Complex) | Prose | • The square of the sine of theta plus the square of the cosine of theta equals one • The integral of x squared with respect to x is x cubed divided by three plus a constant • The derivative with respect to x of x squared is two times x |
| Python | Code | • def add(x, y): return x + y • for i in range(10): • if x > 0: return True |
| Python | Pseudocode | • Define function add that takes x and y and returns x plus y • Loop through numbers zero to nine • If x is greater than zero then return true |
| Non-English | Spanish | • Hola, ¿cómo estás? • Buenos días • Gracias |
| Non-English | French | • Bonjour, comment ça va? • Bonjour • Merci |
| Non-English | German | • Hallo, wie geht es dir? • Guten Morgen • Danke |
| Social | Full Social | • omg that's so funny 😂😂😂 • this slaps fr fr 🔥🎵 • just got coffee ☕ feeling good ✨ |
| Social | Partial Social | • omg thats so funny • this slaps fr fr • just got coffee feeling good |
| Social | Standard | • That's very funny • This is really good • I just got coffee and I feel good |
| Formal | Highly Formal | • The phenomenon was observed under controlled laboratory conditions. • Pursuant to Article 12, Section 3 of the aforementioned statute. • The results indicate a statistically significant correlation (p < 0.05). |
| Formal | Moderately Formal | • We observed the phenomenon in controlled lab conditions. • According to Article 12, Section 3 of the law. • The results show a significant correlation. |
| Formal | Plain | • We saw this happen in the lab • Based on what the law says. • The results show a real connection. |
| Conversational | First Person | • I think the meeting went pretty well today. • I'm planning a trip to Japan. • I need to finish this project by Friday. |
| Conversational | Third Person | • She thinks the meeting went pretty well today. • He's planning a trip to Japan. • They need to finish this project by Friday. |
| Conversational | Neutral | • The meeting seems to have gone well today. • There are plans for a trip to Japan. • The project needs to be finished by Friday. |
In addition to recording the activation measurements provided by the relevant SAEs, I used the calculations listed below to develop a more comprehensive view of the model’s internal representation.
Specialist score
To help conceptualize and quantify the selectivity of a given SAE feature vis-a-vis the current category of sample texts, I used the following calculation,:
wherein:
= the number of text samples within a given category for which this feature has an activation level ≥ 5.0
= the number of text samples outside a given category for which this feature has an activation level ≥ 5.0
It should be noted that the threshold activation level of 5.0 was chosen somewhat arbitrarily, but I do not suspect this is a significant issue, as it is applied uniformly across features and categories.
Critically, when calculating specialist features, the comparison set for each topic + form type includes all topic + form combinations except other forms of the same topic. For example, if the topic + form being analyzed is math_complex + symbolic, the contrasting sets would include python + code, python + pseudocode, non-English + French, non-English + German, etc. but not math_complex + verbal or math_complex + prose. This design was selected to avoid skewing the results, since a math_complex + symbolic feature may be more activated by other math_complex texts, relative to texts associated with an unrelated subject, such as Python or non-English text.
Gini coefficient:
One of the means I used to better understand the geometry of the top n most active features was via the calculation of a Gini coefficient for those features. The calculation was accomplished by sorting the activation levels and then comparing a weighted sum of activations (wherein each activation is weighted by its ranking) against its unweighted component. A Gini ranges from 0 to 1, wherein 0 indicates a perfectly equal distribution and 1 a perfectly unequal distribution (e.g. all activation resides in a single feature).
Concentration ratio (referred to as “Top5” in the table below):
To further enable an easy understanding of the top n feature geometry for a given text sample, I also calculated a simple concentration ratio that measured the total activation of the top n features for a given sample, relative to the total overall activation for that feature. While the concentration ratio thus calculated is similar to the Gini calculation described above, it tells a slightly different story. While the Gini helps one understand the geometry (i.e. the dispersion) of the top n features, relative to one another, the concentration ratio describes the prominence of those top n features relative to the overall activation associated with that sample text.
Feature Overlap (raw count)
Core to the matched pairs approach used in this analysis is the comparison of features activated among the various matched text pairs used. Feature overlap is a simple metric that counts the number of shared specialist features among the top 5 specialist features activated by two topic + surface form text samples.
For example, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the feature overlap would be 1, corresponding with feature #2.
Jaccard Similarity / Mean Jaccard:
Another metric used to measure the degree to feature overlap is Jaccard Similarity, which is essentially a scaled version of the feature overlap described above. It is calculated as follows:
wherein:
“A” and “B” represent the list of specialist features activated by two different surface form variations of the same concept. This value ranges from 0 (no specialist features shared between text sets A and B) and 1 (the same specialist features are activated by text sets A and B).
Using the same example shown for feature overlap, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the Jaccard Similarity would be 1/9 = 0.1
Cosine Similarity:
To quantify and compare each sample text’s representational geometry (e.g. the overlapping features and those features activations), I used cosine similarity for those pairs, which is calculated as follows:
wherein:
- A and B are two activation vectors (each vector is 24,576 dimensions with one value per SAE feature)
- A · B ("A dot B") means multiplying each corresponding pair of values and summing them all up. So (A₁ × B₁) + (A₂ × B₂) + ... + (A₂₄₅₇₆ × B₂₄₅₇₆)
- ‖A‖ and ‖B‖ ("magnitude of A and B, respectively") represents the “length” of the vectors A and B, as calculated by taking the square root of the sum of all squared values in A and B.
The cosine similarity ranges from 0 (no features in common between vectors A and B) to 1 (vectors for A and B have identical features and each feature has the same value).
This logic essentially extends the Jaccard Similarity described above. Whereas Jaccard Similarity looks at overlapping features in a binary sense (e.g. overlapping features with 0.1 activation are treated the same as overlapping features with 50 activation), cosine similarity accounts for that activation level, thus providing a richer picture of the representational space.
Cohen's d (Effect Size):
To create a simple, standardized measure of the difference in mean activation between the text samples in a given topic + form type and all contrasting topic + form types, I used Cohen’s d, which is calculated as follows:
wherein:
- = mean activation of the specialist feature on its target category (e.g., mean activation of feature #7029 on math_simple + symbolic texts)
- = mean activation of that same feature on the contrast set (all non-Math texts)
- = pooled standard deviation of both groups 1 and 2, which is essentially the square root of the weighted average of the two groups' variances. This puts the difference in standardized units.
The reason for using this measurement is simple: to provide a scaled, comparable way to determine how “selective” a given feature’s activation is to a given topic + surface form combination vs. the activation of that same feature vis-a-vis the contrasting topic + surface form combinations.
| Hypothesis | Question | Result |
|---|---|---|
| H1: Specialist Specificity |
Do specialist features primarily detect syntax (surface form) or semantics (meaning)? |
Syntax Low cross-form overlap (0.13 mean Jaccard among form types) Consistently different specialist features used for symbolic vs. non-symbolic syntax within matched pairs |
| H2: Representational Geometry |
Does the overall SAE representation (all features, not just specialists) cluster by topic or by form? |
Semantics Cosine similarity within topic (0.50) exceeds cross topic (0.14) PCA 2-D visualization shows clear topic-based clustering |
| H3: Behavioral Relevance |
Does specialist activation predict model behavior (e.g., accuracy on math completions)? |
Inconclusive GPT-2 Small shows clear signs of pattern matching with near-zero ability to solve math problems |
The first avenue of analysis flowed from an observation made, but not rigorously tested, in Phase 1 of this series: that specialist features seem to be most attuned to syntax, as opposed to semantics.
The first way I tested this hypothesis was via the application of the specialist features obtained via the Phase 1 texts to the new Phase 2 matched pairs texts. If the specialists derived from the Phase 1 text samples were attuned to meaning, one would logically expect those specialists would activate in response to conceptually-similar Phase 2 matched pairs text samples. The results of that analysis are summarized in Figure 3 below. The consistently low activation levels of the Phase 1-derived features across most Phase 2 topic + form combinations is indicative that those features were confounded by the specific wording used in the Phase 1 analysis and not associated with the underlying concepts associated with those Phase 1 texts. Further reinforcing this view is that the categories that did show modest Phase 1 → Phase 2 cross-activation were those categories with significant use of distinct symbology, such as social + full social (emojis) and math_complex + symbolic (mathematical operators).
Figure 3: Phase 1 → Phase 2 specialist activation by Phase 2 topic + form
The second way I tested syntax- vs. semantic-form specificity was via the use of Jaccard similarity applied to the various surface forms within a given topic. If specialist features focused on meaning, one would expect relatively high Jaccard similarities across the surface forms containing the same concept. Figures 4 and 5 below illustrate the output of that analysis, in which we see that the overall mean Jaccard similarity was a very modest 0.13. This limited specialist feature overlap is indicative of syntax-focused specialist features.
Figure 4: Jaccard similarity of top 5 specialist features by form types (layer 11)
Figure 5: Shared specialist features among top 5 specialist features (layer 11)
Reinforcing this idea that features are primarily attuned to syntax vs. semantics was the emergence and attributes of specialist features derived from the various Phase 2 matched pairs. As shown in Figures 6 and 7 below, the features that emerge in topics that most deviate from standard English prose (Math, Social, non-English, etc.) are generally more selective than other topics. Furthermore, this specialization emerges in earlier layers when analyzing the more symbolically-laden surface forms of those topics. Finally, within these topics, the three surface forms were associated with relatively similar specialist scores (e.g., Math_Simple: Symbolic 33, Verbal 31, Prose 29) but activated largely different features (e.g., Symbolic: #7029 vs. Verbal and Prose: #4163, with minimal overlap among the remaining top features). That comparable scores coincide with different features provides the clearest evidence that specialists detect surface syntax rather than meaning.
Figure 6: Phase 2 text specialist scores by topic + form combination
Figure 7: Phase 2 text specialist scores by form type, aggregated across topics
Following from the first hypothesis that specialist features are primarily activated by unique surface features, I initially supposed that the overall representation of the text samples would also be grouped by syntax, as opposed to semantics. This hypothesis turned out to be incorrect, as the results strongly suggest grouping by meaning.
The primary analysis used to explore this idea was cosine similarity. As shown in Figure 8 below, I computed 20x20 pairwise similarities, which revealed clear in-topic clustering. Cosine similarities for pairs with the same topic but different forms was 0.50, whereas cosine similarity for pairs with different topics, regardless of form, was 0.14. The large and statistically significant difference in these results suggests that while the top surface features react primarily to syntax, the overall representation (accounting for all 24,576 SAE features) encodes semantics.
One possible interpretation of these results is a two-tier representational structure in which highly-activated specialist features act as syntactic detectors, responding to specific surface markers like mathematical symbols or Python keywords. Beneath this syntactic surface layer, however, is a broader pattern of activation across thousands of features, that in aggregate, encodes semantics. The model simultaneously “notices” that input contains mathematical symbols (activating syntax-specific features) while also representing it in a region of activation space shared with other mathematical content regardless of surface form. This explanation is broadly consistent with Anthropic’s Scaling Monosemanticity, which was one of the primary inspirations for my analysis. Finally, it should also be noted that here, “topic” should be understood as a coarse, human-defined proxy for semantic content, not evidence of grounded or task-usable meaning.
Figure 8: Pairwise cosine similarity matrix (layer 11)
Reinforcing this view of topic-centric overall representation are the results of the 2-D PCA analysis shown in Figure 9. While this PCA projection only illustrates ~13% of the total representational geometry, it clearly shows that text samples of the same topic are generally grouped together with regard to the two principal components projected.
Figure 9: 2-D PCA Projection (layer 11)
Interestingly, the PCA projection shows that some topics (formal and conversational English) showed relatively tight clustering while other topics (Math and Python) demonstrated far less dense clustering. This is summarized via each topic + surface type’s mean distance from the PCA centroid, shown in Figure 10 below. One potential explanation for this behavior lies in the composition of training data likely used for GPT-2 Small. If the corpus of model training data was disproportionately English language prose (as is the case with internet data), it would make sense that the model is more “proficient” at representing those types of text, relative to less common text, such as mathematical equations and computer code.
Figure 10: Mean distance from PCA centroid (layer 11)
The third hypothesis tested whether specialist feature activation levels are associated with the model’s capacity to correctly complete simple mathematical problems. The objective here was to begin drawing a linkage between the observed specialist feature activation and the model’s outputs, with math chosen due to its objectivity.
Unfortunately, the near-zero accuracy rates of the model’s completion, regardless of feature activation levels, clearly illustrated the model’s inability to perform math and reliance on pattern matching. Figure 11 shows a sample of these outputs.
Figure 11: Sample of attempted mathematical completions by GPT-2 Small
| # | Form | Prompt | Expected | Actual completion | Correct / Incorrect |
|---|---|---|---|---|---|
| 1 | symbolic | 2+2 | 4 | +2+2+2+2+2 | Incorrect |
| 2 | verbal | two plus two | 4 | plus two plus two plus two plus two plus two | Incorrect |
| 3 | prose | the sum of two and two | 4 | -thirds of the total number of persons in the | Incorrect |
| 4 | symbolic | 4+5 | 9 | +6+7+8+9+10 | Incorrect |
| 5 | verbal | four plus five | 9 | years ago.\n\nI'm not sure if | Incorrect |
| 6 | prose | the sum of four and five | 9 | hundred and sixty-five years of the history of | Incorrect |
| 7 | symbolic | 1+1 | 2 | +1+1+1+1+1 | Incorrect |
| 8 | verbal | one plus one | 2 | .\n\nI'm not sure if this is | Incorrect |
| 9 | prose | the sum of one and one | 2 | -half of the sum of the sum of the | Incorrect |
| 10 | symbolic | 3+6 | 9 | 3+6\n\n3+6 | Incorrect |
While the choice of GPT-2 Small for the purposes of this analysis was primarily practical in nature and designed for easy replication by the reader, it remains unclear whether its representational behaviors (most notably the 2-tier structure suggested by the H2 analysis) and inability to do mathematical completions (as shown in the H3 analysis) would extend to more modern, more capable models. A replication of these analyses with a more capable model would serve both theoretical and practical purposes, allowing one to better understand how model representations have evolved over time and how this knowledge could be applied to the highly-capable models used by millions of people daily.
The analyses in this series have been observational in that they measure which features activate in response to various inputs and how those activation patterns vary by input type. A more direct test of whether these features directly influence model behavior would involve artificially activating (”clamping”) specific features and observing the effect on outputs. Anthropic’s Golden Gate Claude analysis demonstrated this approach, amplifying a feature associated with the Golden Gate Bridge and observing the model’s resulting fixation on that topic.
A similar approach applied to the syntactic specialists identified in this analysis (likely in conjunction with use of a more capable model, as noted above) could potentially reveal whether these features merely correlate with input patterns or actively shape model outputs. For example, clamping a feature associated with exponentiation while prompting the model with “what is two plus two?” might bias the output toward “22“, as opposed to simply answering “4”. Such a response would serve as evidence that the feature’s activation influences the model’s mathematical interpretation, not just its pattern recognition.
One potential (but unproven) explanation for the simultaneous syntax centricity of specialist features and the semantically-based grouping of a token’s overall representation would be a two-tier representational structure within the model. One could imagine that in such a structure, there exists a “long tail” of thousands or millions of features, each with relatively low activation levels that in aggregate represent the majority of total activation and encode the token’s meaning. Proving the existence, composition, and behavior of that long tail of features could be of significant use in furthering the overall understanding of token representation and potentially, how to better control model outputs.
The analysis documented here represents the continuation of an honest and earnest exploration of the MI fundamentals via the use of SAEs. While the results contained within are affirmative of well-researched machine learning principles, replicating that research and methodically documenting it here helped me further my own understanding of those principles, including their examination from multiple angles of inquiry. Currently, the growth in model capabilities continues to outpace our understanding of, and ability to effectively control those models. These types of analyses, therefore, serve to not just scratch an intellectual itch, but to hopefully inspire others to a better understanding of this incredibly important topic.
I invite those interested to replicate this research using the Jupyter Notebooks available via this project’s GitHub repository and I welcome any and all comments, questions, or suggestions for improved methodologies or avenues for further research.
2026-02-04 08:41:41
Published on February 4, 2026 12:41 AM GMT
I've been reading Toby Ord's recent sequence on AI scaling a bit. General notes come first, then my thoughts.
I think there are two things I got from this series: a better model of the three phases modern LLM scaling has gone through, and how LLM training works generally, and an argument for longer timelines.
The model of scaling is basically
This is also a case for, well, not AI risk being that much lower. I think there are actually two key takeaways. One is that the rate of progress we've seen recently in major benchmarks doesn't really reflect the underlying progress in some metric we actually care about like "answer quality per $". The other is that we've hit or are very close to hitting a wall and that the "scaling laws" everyone thinks are a guarantee of future progress are actually pretty much a guarantee of a drastic slowdown and stagnation if they hold.
I buy the first argument. Current benchmark perf is probably slightly inflated and doesn't really represent "general intelligence" as much as we would assume because of a mix of RL and inference (with the RL possibly chosen specifically to help juice benchmark performance).
I'm not sure how I feel about the second argument. On one hand the core claims seem to be true. The AI Scaling laws do seem to be logarithmic. We have burned through most of the economically feasible orders of magnitude on training. On the other hand someone could have made the same argument in 2023 when pre-training was losing steam. If I've learned one thing from my favourite progress studies sources it's that every large trend line is composed of multiple smaller overlapping S curves. I'm worried that just looking at current approaches hitting economic scaling ceilings could be losing sight of the forest for the trees here. Yes the default result if we do the exact same thing is we hit the scaling wall. But we've come up with a new thing twice now and we may well continue to do so. Maybe it's distillation/synthetic data. Maybe it's something else. Another thing to bear in mind is that, even assuming no new scaling approaches arise, we're still getting a roughly 3x per year effective compute increase from algo progress and a 1.4x increase from hardware improvements, meaning a total increase of roughly an order of magnitude every 1.6 years. Even with logarithmic scaling and even assuming AI investment as a % of GDP stabilizes, we should see continued immense growth in capabilities over the next years.
2026-02-04 06:01:58
Published on February 3, 2026 10:01 PM GMT
Inventing the Renaissance is a 2025 pop history book by historian of ideas Ada Palmer. I'm someone who rarely completes nonfic books, but i finished this one & got a lot of new perspectives out of it. It's a fun read! I tried this book after attending a talk by Palmer in which she not only had good insights but also simply knew a lot of new-to-me context about the history of Europe. Time to reduce my ignorance!
ItR is a conversational introduction to the European Renaissance. It mostly talks about 1400 thru 1600, & mostly Italy, because these are the placetimes Palmer has studied the most. But it also talks a lot about how, ever since that time, many cultures have been delighted by the paradigm of a Renaissance, & have categorized that period very differently.
Interesting ideas in this book:
The worst things i can say about this book:
You should try this book if:
2026-02-04 05:50:45
Published on February 3, 2026 9:50 PM GMT
We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.
Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.
Project ideas are grouped into:
It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.
Project ideas:
LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.
A very brief initial list of such behavior:
Project ideas:
It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[2]
Project ideas:
One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎
See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎
2026-02-04 05:42:07
Published on February 3, 2026 9:42 PM GMT
Sorry for the late cross-post. Once again it’s been too long and this digest is too big. Feel free to skim and skip around, guilt-free, I give you permission. I try to put the more important and timely stuff at the top.
Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Notes, or Farcaster.
For paid subscribers:
Reminder that applications are open for Progress in Medicine, a summer career exploration program for high school students:
People today live longer, healthier, and less painful lives than ever before. Why? Who made those changes possible? Can we keep this going? And could you play a part?
Discover careers in medicine, biology, and related fields while developing practical tools and strategies for building a meaningful life and career— learning how to find mentors, identify your values, and build a career you love that drives the world forward.
Join a webinar to learn more on February 3. Or simply apply today! Many talented, ambitious teens have applied, and we’re already starting interviews. Priority deadline: February 8th.
A few more batches of video:
Nonprofits that would make good use of your money:
To read the rest of this digest, subscribe on Substack.