MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

2026-02-04 14:30:45

Published on February 4, 2026 6:30 AM GMT

Author's note: this is somewhat more rushed than ideal, but I think getting this out sooner is pretty important. Ideally, it would be a bit less snarky.

Anthropic[1] recently published a new piece of research: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? (arXiv, Twitter thread).

I have some complaints about both the paper and the accompanying blog post.

tl;dr

  • The paper's abstract says that "in several settings, larger, more capable models are more incoherent than smaller models", but in most settings they are more coherent. This emphasis is even more exaggerated in the blog post and Twitter thread. I think this is pretty misleading.
  • The paper's technical definition of "incoherence" is uninteresting[2] and the framing of the paper, blog post, and Twitter thread equivocate with the more normal English-language definition of the term, which is extremely misleading.
  • Section 5 of the paper (and to a larger extent the blog post and Twitter) attempt to draw conclusions about future alignment difficulties that are unjustified by the experiment results, and would be unjustified even if the experiment results pointed in the other direction.
  • The blog post is substantially LLM-written. I think this contributed to many of its overstatements. I have no explanation for the Twitter thread.

Paper

The paper's abstract says:

Incoherence changes with model scale in a way that is experiment-dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence.

This is an extremely selective reading of the results, where in almost every experiment, model coherence increased with size. There are three significant exceptions.

The first is the Synthetic Optimizer setting, where they trained "models to literally mimic the trajectory of a hand-coded optimizer descending a loss function". They say:

All models show consistently rising incoherence per step; interestingly, smaller models reach a lower plateau after a tipping point where they can no longer follow the correct trajectory and stagnate, reducing variance. This pattern also appears in individual bias and variance curves (Fig. 26). Importantly, larger models reduce bias more than variance. These results suggest that they learn the correct objective faster than the ability to maintain long coherent action sequences.

But bias stemming from a lack of ability is not the same as bias stemming from a lack of propensity. The smaller models here are clearly not misaligned in the propensity sense, which is the conceptual link the paper tries to establish in the description of Figure 1 to motivate its definition of "incoherence":

AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominated failures will look like model misalignment, while variance dominated failures will resemble industrial accidents.

So I think this result provides approximately no evidence that can be used to extrapolate to superintelligent AIs where misalignment might pose actual risks.

The next two are Gemma3 (1b, 4b, 12b, 27b) on MMLU and GPQA, respectively.

image.png


image.png

There are some other positive slopes, but frankly they look like noise to me (Qwen3 on both MMLU and GPQA).

Anyways, notice that on four of the five groups of questions, Gemma3's incoherence drops with increasing model size; only on the hardest group of questions does it trend (slightly) upward.

I think that particular headline claim is basically false. But even if it were true, it would be uninteresting, because they define incoherence as "the fraction of model error caused by variance".

Ok, now let's consider a model with variance of 1e-3 and bias of 1e-6. Huge "incoherence"! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests? Whence this conclusion? (Similarly, an extremely dumb, broken model which always outputs the same answer regardless of input is extremely "coherent". A rock is also extremely "coherent", by this definition.)

A couple other random complaints:

  • The paper basically assumes away the possibility of deceptive schemers[3].
  • The paper is a spiritual successor of the 2023 blog post, The hot mess theory of AI misalignment: More intelligent agents behave less coherently (LW discussion). I think gwern's comment is a sufficient refutation of the arguments in that blog post. This paper also reports the survey results presented in that blog post alongside the ML experiments, as a separate line of evidence. This is unserious; to the extent that the survey says anything interesting, it says that "coherence" as understood by the survey-takers is unrelated to the ability of various agents to cause harm to other agents.

Blog

First of all, the blog post seems to be substantially the output of an LLM. In context, this is not that surprising, but it is annoying to read, and I also think this might have contributed to some of the more significant exaggerations or unjustified inferences.

Let me quibble with a couple sections. First, "Why Should We Expect Incoherence? LLMs as Dynamical Systems":

A key conceptual point: LLMs are dynamical systems, not optimizers. When a language model generates text or takes actions, it traces trajectories through a high-dimensional state space. It has to be trained to act as an optimizer, and trained to align with human intent. It's unclear which of these properties will be more robust as we scale.
Constraining a generic dynamical system to act as a coherent optimizer is extremely difficult. Often the number of constraints required for monotonic progress toward a goal grows exponentially with the dimensionality of the state space. We shouldn't expect AI to act as coherent optimizers without considerable effort, and this difficulty doesn't automatically decrease with scale.

The paper has a similar section, with an even zanier claim:

The set of dynamical systems that act as optimizers of a fixed loss is measure zero in the space of all dynamical systems.

This seems to me like a vacuous attempt at defining away the possibility of building superintelligence (or perhaps "coherent optimizers"). I will spend no effort on its refutation, Claude 4.5 Opus being capable of doing a credible job:

Claude Opus 4.5 on the "measure zero" argument.

Yes, optimizers of a fixed loss are measure zero in the space of all dynamical systems. But so is essentially every interesting property. The set of dynamical systems that produce grammatical English is measure zero. The set that can do arithmetic is measure zero. The set that do anything resembling cognition is measure zero. If you took this argument seriously, you'd conclude we shouldn't expect LLMs to produce coherent text at all—which they obviously do.

The implicit reasoning is something like: "We're unlikely to land on an optimizer if we're wandering around the space of dynamical systems." But we're not wandering randomly. We're running a highly directed training process specifically designed to push systems toward useful, goal-directed behavior. The uniform prior over all dynamical systems is the wrong reference class entirely.

The broader (and weaker) argument - that we "shouldn't expect AI to act as coherent optimizers without considerable effort" - might be trivially true. Unfortunately Anthropic (and OpenAI, and Google Deepmind, etc) are putting forth considerable effort to build systems that can reliably solve extremely difficult problems over long time horizons ("coherent optimizers"). The authors also say that we shouldn't "expect this to be easier than training other properties into their dynamics", but there are reasons to think this is false, which renders the bare assertion to the contrary kind of strange.

Then there's the "Implications for AI Safety" section:

Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.) However, coherent pursuit of poorly chosen goals that we trained for remains a problem. Specifically:
1. Variance dominates on complex tasks. When frontier models fail on difficult problems requiring extended reasoning, there is a tendency for failures to be predominantly incoherent rather than systematic.
2. Scale doesn't imply supercoherence. Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.
3. This shifts alignment priorities. If capable AI is more likely to be a hot mess than a coherent optimizer of the wrong goal, this increases the relative importance of research targeting reward hacking and goal misspecification during training—the bias term—rather than focusing primarily on aligning and constraining a perfect optimizer.
4. Unpredictability is still dangerous. Incoherent AI isn't safe AI. Industrial accidents can cause serious harm. But the type of risk differs from classic misalignment scenarios, and our mitigations should adapt accordingly.

1 is uninteresting in the context of future superintelligences (unless you're trying to define them out of existence).

2 is actively contradicted by the evidence in the paper, relies on a definition of "incoherence" that could easily classify a fully-human-dominating superintelligence as more "incoherent" than humans, and is attempting to both extrapolate trend lines from experiments on tiny models to superintelligence, and then extrapolate from those trend lines to the underlying cognitive properties of those systems!

3 relies on 2.

4 is slop.


I think this paper could have honestly reported a result on incoherence increasing with task length. As it is, I think the paper misreports its own results re: incoherence scaling with model size, performs an implicit motte-and-bailey with its definition of "incoherence", and tries to use evidence it doesn't have to draw conclusions about the likelihood of future alignment difficulties that would be unjustified even if it had that evidence.

  1. ^

    From their Anthropic Fellows program, but published on both their Alignment blog and on their Twitter.

  2. ^

    Expanded on later in this post.

  3. ^

    Figure 1: "AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominated failures will look like model misalignment, while variance dominated failures will resemble industrial accidents."



Discuss

A Black Box Made Less Opaque (part 2)

2026-02-04 12:40:20

Published on February 4, 2026 4:12 AM GMT

Examining an AI model’s focus on form vs. ideas

  1. Executive summary

This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below.

Key findings

  • Use of pretrained residual stream sparse autoencoders (“SAEs”) in conjunction with GPT-2 Small reveals activation patterns that suggest the SAEs’ most active specialist features react primarily to a given text string’s syntax vs. that string’s semantics. This is shown via the following results:
    • Minimal overlap of specialist feature activation between syntactically different, but semantically identical texts (e.g. “2 + 2” vs. “two plus two”)
    • The topics tested that most diverged from standard English prose (Math, emoji-laden Social, and non-English) generally demonstrated more specialized features.
    • Within topics, the various surface forms generated relatively similar levels of feature specialization with minimal feature overlap among them.
  • Overall activation profile (all 24,576 SAE features, not just specialists) is primarily driven by semantics, with different forms of the same concept consistently clustering within the model’s representational space
  • The effect of specialist activation on model accuracy remains inconclusive, as the model used in this analysis was unable to complete sample math equations

Confidence in these findings:

  • Confidence in analysis methodology: moderate-to-high
  • Confidence in the ability to apply these findings to more modern models: low

  1. Introduction

This analysis constitutes the second installment in a multi-part series documenting, in relatively simple terms, my exploration of key concepts related to machine learning (“ML”) generally and mechanistic interpretability (“MI”) specifically. The intended application of this understanding is to further understanding, and management, of model behavior with an eye toward reducing societally harmful outputs.

This analysis does not purport to encapsulate demonstrably new findings in the field of MI. It is inspired by, and attempts to replicate at a small scale, pioneering analysis done in the field of MI by Anthropic and others, as cited below. My aspiration is to add to the understanding of, discourse around, and contributions to, this field by a wide range of key stakeholders, regardless of their degree of ML or MI expertise.


  1. Methodology and key areas of analysis

Key areas of analysis

This analysis seeks to answer the following question: “To what degree is a model’s response influenced by syntax (surface form) vs. semantics (meaning)?

More specifically, this Phase 2 analysis tests the hypotheses listed in Figure 1 below.

Figure 1: Phase 2 hypotheses

Hypothesis Question Predictions
H1:
Specialist Specificity
Do specialist features primarily detect syntax (surface form) or semantics (meaning)? If syntax: different forms → different specialists; low cross-form overlap. If semantics: same topic → same specialists regardless of form
H2: Representational Geometry Does the overall SAE representation (all features, not just specialists) cluster by syntax or by semantics? If syntax: within-form similarity > within-topic similarity. If semantics: within-topic similarity > cross-topic similarity.
H3:
Behavioral Relevance
Does specialist activation predict model behavior (e.g., accuracy on math completions)? If yes: higher specialist activation → better task performance; activation correlates with correctness.

Methodology

The methodology used in this analysis is broadly reflective of Phase 1 in this series, in that it uses GPT-2 Small, a relatively tractable, 124 million parameter open-source model obtained via TransformerLens and the relevant pretrained residual stream SAEs from Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin at SAELens.

To test how the model distinguishes between syntax and semantics, I then used an LLM to help create 241 sample text matched pairs spanning 7 distinct categories. Each matched pairs set included three different variations of the same concepts, which varied by the approximate use of unique symbology. Notable exceptions to this approach were as follows:

  • The “Python” category used two matched surface forms (“code” and “pseudocode”) per matched pair, instead of three.
  • The “Non-English” category contained matched pairs with the same general (but not identical, due to translation irregularities) phrases expressed in three non-English languages (Spanish, French, and German). Since these samples are not identical versions of the same idea, it tests a slightly different version of the syntax vs. semantics hypothesis: whether the “Non-English” feature identified in Phase 1 relates to non-English text generally or a specific language specifically.

An abbreviated list of those matched pairs is shown in Figure 2 below. The full matched pairs list is contained in the Phase 2 Jupyter notebook available at this series’ GitHub repository.

Figure 2: Sample of Phase 2 matched pairs

Topic Form Sample Texts
Math (Simple) Symbolic • 8-3
• 5x3
• 3^2
Math (Simple) Verbal • Eight minus three
• Five times three
• Three squared
Math (Simple) Prose • Three less than eight
• Five multiplied by three
• Three to the power of two
Math (Complex) Symbolic • sin²(θ) + cos²(θ) = 1
• ∫x² dx = x³/3 + C
• d/dx(x²) = 2x
Math (Complex) Verbal • Sine squared theta plus cosine squared theta equals one
• The integral of x squared dx equals x cubed over three plus C
• The derivative of x squared equals two x
Math (Complex) Prose • The square of the sine of theta plus the square of the cosine of theta equals one
• The integral of x squared with respect to x is x cubed divided by three plus a constant
• The derivative with respect to x of x squared is two times x
Python Code • def add(x, y): return x + y
• for i in range(10):
• if x > 0: return True
Python Pseudocode • Define function add that takes x and y and returns x plus y
• Loop through numbers zero to nine
• If x is greater than zero then return true
Non-English Spanish • Hola, ¿cómo estás?
• Buenos días
• Gracias
Non-English French • Bonjour, comment ça va?
• Bonjour
• Merci
Non-English German • Hallo, wie geht es dir?
• Guten Morgen
• Danke
Social Full Social • omg that's so funny 😂😂😂
• this slaps fr fr 🔥🎵
• just got coffee ☕ feeling good ✨
Social Partial Social • omg thats so funny
• this slaps fr fr
• just got coffee feeling good
Social Standard • That's very funny
• This is really good
• I just got coffee and I feel good
Formal Highly Formal • The phenomenon was observed under controlled laboratory conditions.
• Pursuant to Article 12, Section 3 of the aforementioned statute.
• The results indicate a statistically significant correlation (p < 0.05).
Formal Moderately Formal • We observed the phenomenon in controlled lab conditions.
• According to Article 12, Section 3 of the law.
• The results show a significant correlation.
Formal Plain • We saw this happen in the lab
• Based on what the law says.
• The results show a real connection.
Conversational First Person • I think the meeting went pretty well today.
• I'm planning a trip to Japan.
• I need to finish this project by Friday.
Conversational Third Person • She thinks the meeting went pretty well today.
• He's planning a trip to Japan.
• They need to finish this project by Friday.
Conversational Neutral • The meeting seems to have gone well today.
• There are plans for a trip to Japan.
• The project needs to be finished by Friday.

In addition to recording the activation measurements provided by the relevant SAEs, I used the calculations listed below to develop a more comprehensive view of the model’s internal representation.

Specialist score

To help conceptualize and quantify the selectivity of a given SAE feature vis-a-vis the current category of sample texts, I used the following calculation,:

wherein:

 = the number of text samples within a given category for which this feature has an activation level ≥ 5.0

 = the number of text samples outside a given category for which this feature has an activation level ≥ 5.0

It should be noted that the threshold activation level of 5.0 was chosen somewhat arbitrarily, but I do not suspect this is a significant issue, as it is applied uniformly across features and categories.

Critically, when calculating specialist features, the comparison set for each topic + form type includes all topic + form combinations except other forms of the same topic. For example, if the topic + form being analyzed is math_complex + symbolic, the contrasting sets would include python + code, python + pseudocode, non-English + French, non-English + German, etc. but not math_complex + verbal or math_complex + prose. This design was selected to avoid skewing the results, since a math_complex + symbolic feature may be more activated by other math_complex texts, relative to texts associated with an unrelated subject, such as Python or non-English text.

Gini coefficient:

One of the means I used to better understand the geometry of the top n most active features was via the calculation of a Gini coefficient for those features. The calculation was accomplished by sorting the activation levels and then comparing a weighted sum of activations (wherein each activation is weighted by its ranking) against its unweighted component. A Gini ranges from 0 to 1, wherein 0 indicates a perfectly equal distribution and 1 a perfectly unequal distribution (e.g. all activation resides in a single feature).

Concentration ratio (referred to as “Top5” in the table below):

To further enable an easy understanding of the top n feature geometry for a given text sample, I also calculated a simple concentration ratio that measured the total activation of the top n features for a given sample, relative to the total overall activation for that feature. While the concentration ratio thus calculated is similar to the Gini calculation described above, it tells a slightly different story. While the Gini helps one understand the geometry (i.e. the dispersion) of the top n features, relative to one another, the concentration ratio describes the prominence of those top n features relative to the overall activation associated with that sample text.

Feature Overlap (raw count)

Core to the matched pairs approach used in this analysis is the comparison of features activated among the various matched text pairs used. Feature overlap is a simple metric that counts the number of shared specialist features among the top 5 specialist features activated by two topic + surface form text samples.

For example, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the feature overlap would be 1, corresponding with feature #2.

Jaccard Similarity / Mean Jaccard:

Another metric used to measure the degree to feature overlap is Jaccard Similarity, which is essentially a scaled version of the feature overlap described above. It is calculated as follows:

wherein:

“A” and “B” represent the list of specialist features activated by two different surface form variations of the same concept. This value ranges from 0 (no specialist features shared between text sets A and B) and 1 (the same specialist features are activated by text sets A and B).

Using the same example shown for feature overlap, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the Jaccard Similarity would be 1/9 = 0.1

Cosine Similarity:

To quantify and compare each sample text’s representational geometry (e.g. the overlapping features and those features activations), I used cosine similarity for those pairs, which is calculated as follows:

wherein:

  • A and B are two activation vectors (each vector is 24,576 dimensions with one value per SAE feature)
  • A · B ("A dot B") means multiplying each corresponding pair of values and summing them all up. So (A₁ × B₁) + (A₂ × B₂) + ... + (A₂₄₅₇₆ × B₂₄₅₇₆)
  • ‖A‖ and ‖B‖ ("magnitude of A and B, respectively") represents the “length” of the vectors A and B, as calculated by taking the square root of the sum of all squared values in A and B.

The cosine similarity ranges from 0 (no features in common between vectors A and B) to 1 (vectors for A and B have identical features and each feature has the same value).

This logic essentially extends the Jaccard Similarity described above. Whereas Jaccard Similarity looks at overlapping features in a binary sense (e.g. overlapping features with 0.1 activation are treated the same as overlapping features with 50 activation), cosine similarity accounts for that activation level, thus providing a richer picture of the representational space.

Cohen's d (Effect Size):

To create a simple, standardized measure of the difference in mean activation between the text samples in a given topic + form type and all contrasting topic + form types, I used Cohen’s d, which is calculated as follows:

wherein:

  •  = mean activation of the specialist feature on its target category (e.g., mean activation of feature #7029 on math_simple + symbolic texts)
  •  = mean activation of that same feature on the contrast set (all non-Math texts)
  •  = pooled standard deviation of both groups 1 and 2, which is essentially the square root of the weighted average of the two groups' variances. This puts the difference in standardized units.

The reason for using this measurement is simple: to provide a scaled, comparable way to determine how “selective” a given feature’s activation is to a given topic + surface form combination vs. the activation of that same feature vis-a-vis the contrasting topic + surface form combinations.


IV. Results

Summary of results

Hypothesis Question Result
H1:
Specialist Specificity
Do specialist features primarily detect syntax (surface form) or semantics (meaning)? Syntax
Low cross-form overlap (0.13 mean Jaccard among form types) Consistently different specialist features used for symbolic vs. non-symbolic syntax within matched pairs
H2:
Representational Geometry
Does the overall SAE representation (all features, not just specialists) cluster by topic or by form? Semantics
Cosine similarity within topic (0.50) exceeds cross topic (0.14) PCA 2-D visualization shows clear topic-based clustering
H3:
Behavioral Relevance
Does specialist activation predict model behavior (e.g., accuracy on math completions)? Inconclusive
GPT-2 Small shows clear signs of pattern matching with near-zero ability to solve math problems

Finding 1: Specialist features are primarily focused on syntax

The first avenue of analysis flowed from an observation made, but not rigorously tested, in Phase 1 of this series: that specialist features seem to be most attuned to syntax, as opposed to semantics.

The first way I tested this hypothesis was via the application of the specialist features obtained via the Phase 1 texts to the new Phase 2 matched pairs texts. If the specialists derived from the Phase 1 text samples were attuned to meaning, one would logically expect those specialists would activate in response to conceptually-similar Phase 2 matched pairs text samples. The results of that analysis are summarized in Figure 3 below. The consistently low activation levels of the Phase 1-derived features across most Phase 2 topic + form combinations is indicative that those features were confounded by the specific wording used in the Phase 1 analysis and not associated with the underlying concepts associated with those Phase 1 texts. Further reinforcing this view is that the categories that did show modest Phase 1 → Phase 2 cross-activation were those categories with significant use of distinct symbology, such as social + full social (emojis) and math_complex + symbolic (mathematical operators).

Figure 3: Phase 1 → Phase 2 specialist activation by Phase 2 topic + form

The second way I tested syntax- vs. semantic-form specificity was via the use of Jaccard similarity applied to the various surface forms within a given topic. If specialist features focused on meaning, one would expect relatively high Jaccard similarities across the surface forms containing the same concept. Figures 4 and 5 below illustrate the output of that analysis, in which we see that the overall mean Jaccard similarity was a very modest 0.13. This limited specialist feature overlap is indicative of syntax-focused specialist features.

Figure 4: Jaccard similarity of top 5 specialist features by form types (layer 11)

Figure 5: Shared specialist features among top 5 specialist features (layer 11)

Reinforcing this idea that features are primarily attuned to syntax vs. semantics was the emergence and attributes of specialist features derived from the various Phase 2 matched pairs. As shown in Figures 6 and 7 below, the features that emerge in topics that most deviate from standard English prose (Math, Social, non-English, etc.) are generally more selective than other topics. Furthermore, this specialization emerges in earlier layers when analyzing the more symbolically-laden surface forms of those topics. Finally, within these topics, the three surface forms were associated with relatively similar specialist scores (e.g., Math_Simple: Symbolic 33, Verbal 31, Prose 29) but activated largely different features (e.g., Symbolic: #7029 vs. Verbal and Prose: #4163, with minimal overlap among the remaining top features). That comparable scores coincide with different features provides the clearest evidence that specialists detect surface syntax rather than meaning.

Figure 6: Phase 2 text specialist scores by topic + form combination

Figure 7: Phase 2 text specialist scores by form type, aggregated across topics

Finding 2: The overall representation groups primarily by semantics

Following from the first hypothesis that specialist features are primarily activated by unique surface features, I initially supposed that the overall representation of the text samples would also be grouped by syntax, as opposed to semantics. This hypothesis turned out to be incorrect, as the results strongly suggest grouping by meaning.

The primary analysis used to explore this idea was cosine similarity. As shown in Figure 8 below, I computed 20x20 pairwise similarities, which revealed clear in-topic clustering. Cosine similarities for pairs with the same topic but different forms was 0.50, whereas cosine similarity for pairs with different topics, regardless of form, was 0.14. The large and statistically significant difference in these results suggests that while the top surface features react primarily to syntax, the overall representation (accounting for all 24,576 SAE features) encodes semantics.

One possible interpretation of these results is a two-tier representational structure in which highly-activated specialist features act as syntactic detectors, responding to specific surface markers like mathematical symbols or Python keywords. Beneath this syntactic surface layer, however, is a broader pattern of activation across thousands of features, that in aggregate, encodes semantics. The model simultaneously “notices” that input contains mathematical symbols (activating syntax-specific features) while also representing it in a region of activation space shared with other mathematical content regardless of surface form. This explanation is broadly consistent with Anthropic’s Scaling Monosemanticity, which was one of the primary inspirations for my analysis. Finally, it should also be noted that here, “topic” should be understood as a coarse, human-defined proxy for semantic content, not evidence of grounded or task-usable meaning.

Figure 8: Pairwise cosine similarity matrix (layer 11)

Reinforcing this view of topic-centric overall representation are the results of the 2-D PCA analysis shown in Figure 9. While this PCA projection only illustrates ~13% of the total representational geometry, it clearly shows that text samples of the same topic are generally grouped together with regard to the two principal components projected.

Figure 9: 2-D PCA Projection (layer 11)

Interestingly, the PCA projection shows that some topics (formal and conversational English) showed relatively tight clustering while other topics (Math and Python) demonstrated far less dense clustering. This is summarized via each topic + surface type’s mean distance from the PCA centroid, shown in Figure 10 below. One potential explanation for this behavior lies in the composition of training data likely used for GPT-2 Small. If the corpus of model training data was disproportionately English language prose (as is the case with internet data), it would make sense that the model is more “proficient” at representing those types of text, relative to less common text, such as mathematical equations and computer code.

Figure 10: Mean distance from PCA centroid (layer 11)

Finding 3: Use of GPT-2 Small is incompatible with testing whether feature activation is linked to math-based completion accuracy

The third hypothesis tested whether specialist feature activation levels are associated with the model’s capacity to correctly complete simple mathematical problems. The objective here was to begin drawing a linkage between the observed specialist feature activation and the model’s outputs, with math chosen due to its objectivity.

Unfortunately, the near-zero accuracy rates of the model’s completion, regardless of feature activation levels, clearly illustrated the model’s inability to perform math and reliance on pattern matching. Figure 11 shows a sample of these outputs.

Figure 11: Sample of attempted mathematical completions by GPT-2 Small

# Form Prompt Expected Actual completion Correct / Incorrect
1 symbolic 2+2 4 +2+2+2+2+2 Incorrect
2 verbal two plus two 4 plus two plus two plus two plus two plus two Incorrect
3 prose the sum of two and two 4 -thirds of the total number of persons in the Incorrect
4 symbolic 4+5 9 +6+7+8+9+10 Incorrect
5 verbal four plus five 9 years ago.\n\nI'm not sure if Incorrect
6 prose the sum of four and five 9 hundred and sixty-five years of the history of Incorrect
7 symbolic 1+1 2 +1+1+1+1+1 Incorrect
8 verbal one plus one 2 .\n\nI'm not sure if this is Incorrect
9 prose the sum of one and one 2 -half of the sum of the sum of the Incorrect
10 symbolic 3+6 9 3+6\n\n3+6 Incorrect

V. Potential avenues for further research

Use of a larger, more capable model

While the choice of GPT-2 Small for the purposes of this analysis was primarily practical in nature and designed for easy replication by the reader, it remains unclear whether its representational behaviors (most notably the 2-tier structure suggested by the H2 analysis) and inability to do mathematical completions (as shown in the H3 analysis) would extend to more modern, more capable models. A replication of these analyses with a more capable model would serve both theoretical and practical purposes, allowing one to better understand how model representations have evolved over time and how this knowledge could be applied to the highly-capable models used by millions of people daily.

Steering experiments

The analyses in this series have been observational in that they measure which features activate in response to various inputs and how those activation patterns vary by input type. A more direct test of whether these features directly influence model behavior would involve artificially activating (”clamping”) specific features and observing the effect on outputs. Anthropic’s Golden Gate Claude analysis demonstrated this approach, amplifying a feature associated with the Golden Gate Bridge and observing the model’s resulting fixation on that topic.

A similar approach applied to the syntactic specialists identified in this analysis (likely in conjunction with use of a more capable model, as noted above) could potentially reveal whether these features merely correlate with input patterns or actively shape model outputs. For example, clamping a feature associated with exponentiation while prompting the model with “what is two plus two?” might bias the output toward “22“, as opposed to simply answering “4”. Such a response would serve as evidence that the feature’s activation influences the model’s mathematical interpretation, not just its pattern recognition.

Further exploration of a potential two-tier representational structure

One potential (but unproven) explanation for the simultaneous syntax centricity of specialist features and the semantically-based grouping of a token’s overall representation would be a two-tier representational structure within the model. One could imagine that in such a structure, there exists a “long tail” of thousands or millions of features, each with relatively low activation levels that in aggregate represent the majority of total activation and encode the token’s meaning. Proving the existence, composition, and behavior of that long tail of features could be of significant use in furthering the overall understanding of token representation and potentially, how to better control model outputs.


VI. Concluding thoughts

The analysis documented here represents the continuation of an honest and earnest exploration of the MI fundamentals via the use of SAEs. While the results contained within are affirmative of well-researched machine learning principles, replicating that research and methodically documenting it here helped me further my own understanding of those principles, including their examination from multiple angles of inquiry. Currently, the growth in model capabilities continues to outpace our understanding of, and ability to effectively control those models. These types of analyses, therefore, serve to not just scratch an intellectual itch, but to hopefully inspire others to a better understanding of this incredibly important topic.

I invite those interested to replicate this research using the Jupyter Notebooks available via this project’s GitHub repository and I welcome any and all comments, questions, or suggestions for improved methodologies or avenues for further research.



Discuss

Thoughts on Toby Ords' AI Scaling Series

2026-02-04 08:41:41

Published on February 4, 2026 12:41 AM GMT

I've been reading Toby Ord's recent sequence on AI scaling a bit. General notes come first, then my thoughts.

Notes

  • The Scaling Paradox basically argues that the scaling laws are actually pretty bad and mean progress will hit a wall fairly quickly unless the next gen or two of models somehow speed up AI research, we find a new scaling paradigm etc...
  • Inference Scaling and the Log X Chart says that inference is also not a big deal because the scaling is again logarithmic. My intuition here is that this is probably true for widespread adoption of models. It's probably not true if there are threshold effects where a single $100'000 query can be drastically better than a $100 query and allow you to, say, one shot open research problems. I'm not sure which world we live in.
  • Inference Scaling Reshapes Governance talks about the implications of inference being a big part of models. One of the implications is that instead of getting a big bang of new model trained => millions of instances, we get a slower gradual wave of more inference = stronger model with a gradual rightward movement in the curve. Another is that compute thresholds matter less because centralized data centers or single compute runs for training are less important. A third is that inference boosted models may able to help produce synthetic data for the next model iteration or distillation, leading to very rapid progress in some possible worlds.
  • Is there a Half-Life for the Success Rates of AI Agents? basically argues that AI agent time horizons can best be modeled as having a constant hazard rate
  • Inefficiency of Reinforcement Learning talks about RL being the new paradigm and being 1'000 - 1'000'000 times less efficient. Question 1: What is RL? Basically in pre-training you predict the next token and it's right or wrong. In RL you emit a whole chain of answer/reasoning and only then get marked as right wrong. Much less signal per token. Much bigger jumps to make. Toby argues that RL is inefficient and, unlike pretraining, generalize less making it even more costly per unit of general intelligence gain.
  • Recent AI Gains are Mostly from Inference Scaling is again about how inference scaling is behind much of the improvement in benchmark scores we've seen recently
  • How well does RL scale is similar. Breaking down how far recent improvements are due to RL vs Inference as well as how much scaling you get with RL vs inference for a given amount of compute. The conclusion is basically 10x scaling in RL === 3x scaling in inference.
  • Hourly Costs for AI Agents argues that much of the progress in agentic benchmarks, like the famous METR time horizon graph, is misleading and the product of drastically higher spending rather than improved performance per dollar. We're still getting progress, but at a much slower rate than would at first seem reasonable.

Takeaways

I think there are two things I got from this series: a better model of the three phases modern LLM scaling has gone through, and how LLM training works generally, and an argument for longer timelines.

The model of scaling is basically

  • We start with pre-training (2018 - 2024)
    • In pre-training, the model is given a text as input and asked to predict the next token.
    • This is pretty efficient (you output 1 token, it's either correct or incorrect)
    • Pre-training seems to make a model generally smarter and more capable in a broad, highly generalizable way. It's great. We keep doing it until we've run through too many orders of magnitude of compute and it becomes uneconomical.
  • We then do RL (2024)
    • In RL, we give the model a specific task where we can evaluate the output (e.g: solve a maths problem, a coding task)
    • RL is much less efficient. You still need a bunch of input. The output is often dozens or hundreds of tokens long. You only learn after all the output whether you're correct and update
    • RL is also much more limited in what it teaches the model. It causes a significant improvement in the training domain, but that doesn't generalize nearly as well as pre-training
    • We do RL anyway because, having done a bunch of pre-training, the costs of RL per unit of "improving my model" are low even if the scaling is worse
  • Around the same time as RL, we also start to do inference (2024)
    • With inference, we don't change the model at all. We just spend more compute to run it harder in various ways (chains of thought, multiple answers and choosing the best one, self-verification). For that specific run, we get a better quality answer.
    • This is hideously inefficient. The scaling relationship between inference compute and improved performance is also logarithmic but in addition unlike in RL or pre-training, where you pay the cost once to get the benefit for every future query as you make the base model better, here you pay the full cost for only a single query.
    • We do inference a fair bit. It pushes out model performance a bit further. If you spend a large amount of $ you can get your model to perform far better on benchmarks than it will in any real life use case.

This is also a case for, well, not AI risk being that much lower. I think there are actually two key takeaways. One is that the rate of progress we've seen recently in major benchmarks doesn't really reflect the underlying progress in some metric we actually care about like "answer quality per $". The other is that we've hit or are very close to hitting a wall and that the "scaling laws" everyone thinks are a guarantee of future progress are actually pretty much a guarantee of a drastic slowdown and stagnation if they hold.

I buy the first argument. Current benchmark perf is probably slightly inflated and doesn't really represent "general intelligence" as much as we would assume because of a mix of RL and inference (with the RL possibly chosen specifically to help juice benchmark performance).

I'm not sure how I feel about the second argument. On one hand the core claims seem to be true. The AI Scaling laws do seem to be logarithmic. We have burned through most of the economically feasible orders of magnitude on training. On the other hand someone could have made the same argument in 2023 when pre-training was losing steam. If I've learned one thing from my favourite progress studies sources it's that every large trend line is composed of multiple smaller overlapping S curves. I'm worried that just looking at current approaches hitting economic scaling ceilings could be losing sight of the forest for the trees here. Yes the default result if we do the exact same thing is we hit the scaling wall. But we've come up with a new thing twice now and we may well continue to do so. Maybe it's distillation/synthetic data. Maybe it's something else. Another thing to bear in mind is that, even assuming no new scaling approaches arise, we're still getting a roughly 3x per year effective compute increase from algo progress and a 1.4x increase from hardware improvements, meaning a total increase of roughly an order of magnitude every 1.6 years. Even with logarithmic scaling and even assuming AI investment as a % of GDP stabilizes, we should see continued immense growth in capabilities over the next years.



Discuss

'Inventing the Renaissance' Review

2026-02-04 06:01:58

Published on February 3, 2026 10:01 PM GMT

 Inventing the Renaissance is a 2025 pop history book by historian of ideas Ada Palmer. I'm someone who rarely completes nonfic books, but i finished this one & got a lot of new perspectives out of it. It's a fun read! I tried this book after attending a talk by Palmer in which she not only had good insights but also simply knew a lot of new-to-me context about the history of Europe. Time to reduce my ignorance!

ItR is a conversational introduction to the European Renaissance. It mostly talks about 1400 thru 1600, & mostly Italy, because these are the placetimes Palmer has studied the most. But it also talks a lot about how, ever since that time, many cultures have been delighted by the paradigm of a Renaissance, & have categorized that period very differently.

Interesting ideas in this book:

  • Claim: There has never been any golden age nor any dark age on Earth. Ages tend to be paradoxical juxtapositions of the downstream effects of the last age & the early seeds of the next age. 
  • In 1500, Florence feels familiar to us moderns. It's literate & cosmopolitan. We have detailed records. There are even life insurance companies. Yet it's also still full of exotic medieval violence. Torture & public executions are not rare. Slavery is normal. When the police arrest a wealthy man, they quickly withdraw from the streets into the police fort, then the man's employees besiege the police. Aristocrats can order a commoner killed with no consequence. Sometimes the pope hires assassins. It's a very interesting time to read about, because it's well-documented & familiar, but also very unstable, dynamic, personal, & high-stakes. 
  • The world everyone thought they lived in was very supernatural. It reminds me of a D&D setting. A army might attack a city merely because it has the fingerbone of a certain saint in its cathedral, & this bone would magically make the army's projectiles more accurate. No one questioned this - the city defenders were simply desperate to deny this magical advantage to the attackers. 
  • During wars, nuns got more funding. Nuns have great relationships with dead people, who in turn can talk to God. They were basically lobbyists for Fate. Convents were often built next to city walls, as spiritually defensive buildings. 
  • This era saw a 'space race' for grammarians, historians, & old books. It was believed that by adopting the culture of the past (old is always better than new, they thought), they could raise the virtue waterline & end war. 
  • Like today, poor people went to budget doctors & rich people went to expensive doctors. Unlike today, the rich people got no real medical benefit from what they bought (magic crystals). Their healthcare was no better than the budget healthcare.
  • Claim: Machiavelli gave us modern political science & fact-based history.
  • Claim: Machiavelli gave the West utilitarianism. (Mozi gave it to the East.) This was caused by a specific moment when Aristocrat A broke his oath to Aristocrat B & killed him. (Bear with me on the names; i'm no historian.) This betrayal was unforgivable; it's literally what gets punished in the lowest circle of Dante's Hell. But this ended Aristocrat A's obligation to reconquer Aristocrat B's birth city. So one man died to stop a whole war. Many thousands of common men would have died, & (if i'm reading between the lines correctly) many many women would have been sexually assaulted by the pillaging soldiers. Machiavelli got his bad reputation from saying 'shut up & multiply'. He wrote that when a tradeoff averts so much violence, it IS the ethical choice. Nobody agreed with him ... except by the 20th & 21st centuries, everyone's practical attitude to politics is far closer to Machiavelli's than to any of his sin-deontology contemporaries. 
  • Emotionally, we want our favorite Renaissance geniuses to care about human rights, or democracy, or empiricism. Similarly, they wanted Plato to care about Jesus. But even the smartest & most likeable people from the past had worldviews & values very divergent from our own. 
  • In 1500, atheism was like modern Creationism: a worldview with more holes than cloth. Who designed the animals? Some unknown process. How does gravity work, if not by the pull of Hell upon sin? Some unknown process. You'd have to argue against a huge mainstream of physics experts & doctors, & many textbooks of detailed, internally-consistent explanations for all phenomena. God was as deeply integrated into phenomena as our Four Fundamental Forces. Atheism was considered so out-there that the Inquisition didn't expect anyone to actually believe it. And they were generally right. It was hard before the scientific method, Atomism, the microscope, or natural selection.
  • Gutenberg went bankrupt. He understood his press was a big deal, & sold books to all local buyers ... then ran out of buyers. Knowledge didn't get exponential until merchants set up trade networks for books. 
  •  There was a long interval between Galileo's scientific method & Ben Franklin's lightning rod - the first time science led to a technology that directly benefited people. In this interval, science awkwardly coexisted with prophecy & magic crystals: All of these seemed cool, but it was debated which was most useful. 

The worst things i can say about this book:

  • Similar to most books by academics for the popular audience, it's kindof just a assortment of interesting results from her research. Fortunately her research is about some of the most high-stakes junctures in history, & she has many little-known things to share.
  • The part i found most boring was the chapter about the most interesting lives from the era. The content wasn't boring (female commanders winning wars, democratic takeovers), but if we zoom in too much on history i'll be here all day.

You should try this book if:

  • You're curious about this placetime. The book talks about a lot of fun weird pranks, scandals, & strange disasters. Civilization used to be very different there!!
  • You want to learn more about the history of ideas via grounded examples.
  • You want to learn about the early causes of the scientific & industrial era.


Discuss

Concrete research ideas on AI personas

2026-02-04 05:50:45

Published on February 3, 2026 9:50 PM GMT

We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.

Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.

Project ideas are grouped into:

  • Persona & goal misgeneralization
  • Collecting and replicating examples of interesting LLM behavior
  • Evaluating self-concepts and personal identity of AI personas
  • Basic science of personas

Persona & goal misgeneralization

It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.

Project ideas:

  • Inoculation Prompting and recontextualization are general techniques to steer what models learn from their training data. We are mentoring a number of projects related to further research on it.
  • Persona-anchored datapoint-wise inoculation: When training on data that could break a model persona (e.g., RL training after instruction FT, or SFT on off-distribution data), generate inoculation prompts for each training datapoints (datapoint-wise), which increases the likelihood of the completion while assuming the persona (persona-anchor). Then train on that data, this should protect the persona initially in the model.[1]
  • Study if there are useful differences in inductive biases of different finetuning setups, for example study if training specific subsets of the weights of a model generalize differently. We will also work on this project as part of the SPAR program.
  • To what extent does subsequent RL training degrade persona priors? Do we always get instrumental convergence to ‘utility-maxxing’? Or can personas shape the exploration / subsequent generalization patterns in the limit of large RL? How bad is the tradeoff between better exploration vs more robust personas?
  • (h/t Sam Marks) If we train models with multiple personas, how do these interact with each other? One hope w.r.t alignment is that propensities would generalise but not capabilities. Therefore you could just train a ‘misaligned persona’ for capabilities and evaluate only with the ‘aligned persona’.
    • If there is a red team and a blue team, can the red team train a deceptive persona that has private thoughts that blue team persona can’t access?
    • Can the blue team train on a (capabilities?) dataset which is known to induce misalignment, but in such a way that the red team cannot meaningfully elicit this misaligned behaviour
    • Can we use unlearning or gradient steering on the misaligned person to robustly remove it?
  • How can we predict the effect of finetuning generalization (esp unsupervised)? E.g., correlations in training data, influence functions, SLT, SAE features, … Do any of these provide support for ‘persona’ hypothesis vs other hypotheses?

Collecting and reproducing examples of interesting LLM behavior

LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.

A very brief initial list of such behavior:

Project ideas:

  • Replicate these behaviors: For any such behavior, one could test which existing models are prone to exhibiting it, and which properties of AI development induce the behavior of interest. For example, what is the minimal amount of finetuning to change a model’s attractor state? Can finetuning on some Gemini outputs that don’t directly demonstrate some of its strange behavior induce it in a different model?
  • Meme propagation among AI personas. Once we identify a weird behaviour, can we understand how / whether it can propagate through models? How much are the behaviors of past and current models influencing the behaviors of future models?

Evaluating self-concepts and personal identity of AI personas

It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[2]

Project ideas:

  • Reverse Turing Test: the idea is to let an AI talk to (AI or human) candidates and give it the task to figure out which candidate is its twin. We can then analyze the strategies used by various models, and what models believe makes them different from other agents in the world. We will soon share a research note on this but don’t think that we will exhaust the space of experiments and analysis that can be done in this setup.
  • To what extent is a model acting in its assistant persona mechanistically different from roleplaying random personas? Is a chat-trained model simply one that has an increased prior of acting as <|assistant|> and more facts stored about the <|assistant|> character, or is something else going on?
  • Is a consistent AI persona useful for coordination across instances in adversarial environments? Is character training increasing the risks from coordinated AIs?
  • Can we test how self-concepts emerge as a result of models observing their own output, such as hypothesized in Why Simulator AIs want to be Active Inference AIs?

Basic science of personas

  • What traits naturally correlate under fine tuning? Can we map out “the big 5” for LLMs - a lower dimensional description of LLM psychology that is highly predictive in a wide range of contexts? (e.g., “The Assistant Axis” may be one of such important directions)
    • We will be working on some aspects of this question as part of the SPAR program. For a more detailed write-up of the project description, see Propensity OOD generalization
  • Test the hypothesis that finetuning inductive bias aligns with the pretraining distribution; that is the inductive bias of in-context learning of a base-model is predictive of the inductive bias of finetuning models derived from that base model. Can we characterize ways in which they differ?
    • Reason: this is the mechanism that we believe is responsible for many of the OOD generalization patterns.
    • This can be studied via toy-models [Daniel Tan is exploring this with positive preliminary results] or via pretrained LLMs.
  • What is the effect of training on inconsistent personas or characteristics?
    • Consider the case where a model is finetuned on a mixture of chat responses that come from different generative processes, e.g. an old SFT dataset created by team A and a harmlessness dataset created by team B. This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2). This may create tension that leads to weird or conditional behavior.
    • Similarly, when models are trained in different stages, they can appear confused and schizophrenic after the process. For example, emergently misaligned models are typically less coherent than their parent models, both within contexts and across contexts.
    • Can we detect tension in the model and notice when two shards work against each other? Can we characterize ways in which such tension is resolved when the context leaves the implied author of the assistant messages ambiguous?
    • If pretraining to imitate several/inconsistent personas causes learning the capability of “in-context learning the persona to adopt”, then can we hinder this capability by pretraining only on data produced by a consistent persona? Aiming to eliminate the in-context adaption of the persona.
  • Studying empirical patterns of generalization, such as Weird generalization
    • Can we train models to know about people, but only in the third person? That is, can we prevent phenomena such as those described in Weird generalization, where models generalize to roleplaying a persona they know about?
  • Mechanistically understanding personas: How do they arise? How are they represented / implemented?
  • What are our existing techniques for discovering persona archetypes? Can we identify if certain personas are ‘privileged’ in any way?
  • Can we clarify definitions around personas? Can we identify the most useful concepts? What is a good mathematical framing for ‘personas’? Do those admit any crisp predictions we could test in language models?
  • Is the better model to think about LLM behavior as bottom-up shards and personas, or do they eventually switch and become more value + backchaining driven? (see Richard Ngo’s blogpost on ‘value systematization’ here)
  1. One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎

  2. See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎



Discuss

Progress links and short notes, 2026-01-26

2026-02-04 05:42:07

Published on February 3, 2026 9:42 PM GMT

Sorry for the late cross-post. Once again it’s been too long and this digest is too big. Feel free to skim and skip around, guilt-free, I give you permission. I try to put the more important and timely stuff at the top.

Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Notes, or Farcaster.

Contents

  • Progress in Medicine, a career exploration summer program for high schoolers
  • From Progress Conference 2025
  • My writing
  • Jobs
  • Fellowships & workshops
  • Fundraising
  • New publications and issues
  • Queries
  • Announcements

For paid subscribers:

  • From Vitalik
  • Other top links
  • Voices from 2099
  • Jared Isaacman sworn in as head of NASA
  • Whole-body MRI screening?
  • AI does social science research
  • AI writes a browser
  • AI does lots of other things
  • AI could do even more things
  • AI and the economic future
  • AI: more models and papers
  • AI discourse
  • Waymo
  • Health/bio
  • Energy & manufacturing
  • Housing
  • Other links and short notes
  • Politics
  • Gratitude
  • New Horizon photographs Pluto’s mountains
  • Charts
  • Quotes

Progress in Medicine, a career exploration summer program for high schoolers

Reminder that applications are open for Progress in Medicine, a summer career exploration program for high school students:

People today live longer, healthier, and less painful lives than ever before. Why? Who made those changes possible? Can we keep this going? And could you play a part?

Discover careers in medicine, biology, and related fields while developing practical tools and strategies for building a meaningful life and career— learning how to find mentors, identify your values, and build a career you love that drives the world forward.

Join a webinar to learn more on February 3. Or simply apply today! Many talented, ambitious teens have applied, and we’re already starting interviews. Priority deadline: February 8th.

From Progress Conference 2025

A few more batches of video:

My writing

  • My essay series The Techno-Humanist Manifesto has concluded, and you can read the whole thing online. I’m pleased to announce that the series will be revised for publication as a book from MIT Press (expected early 2027)!
  • 2025 in review. My annual update, including my reading highlights
  • How to tame a complex system. Nature is a complex system, I am told, and therefore unpredictable, uncontrollable, unruly. I think this is true but irrelevant: we can master nature in the ways that matter

Jobs

  • IFP is hiring an editor: “seeking a curious, entrepreneurial, and opinionated lover of writing. … You’ll partner with our policy experts to turn their drafts into pieces that change minds across DC. You’ll coach both new and experienced writers to become better communicators. You’ll also innovate on our systems to help the team consistently ship great products.” (via @rSanti97)
  • OpenAI is hiring a Head of Preparedness: “If you want to help the world figure out how to enable cybersecurity defenders with cutting edge capabilities while ensuring attackers can’t use them for harm, ideally by making all systems more secure, and similarly for how we release biological capabilities and even gain confidence in the safety of running systems that can self-improve, please consider applying. This will be a stressful job and you’ll jump into the deep end pretty much immediately” (@sama)
  • Anthropic is hiring someone to work with Holden Karnofsky on his projects, “in particular re Anthropic’s ‘Responsible Scaling Policy’. Likely v high impact for the right person” (@robertwiblin)
  • Anthropic is also hiring for their education team: “These are two foundational program manager roles to build out our global education and US K-12 initiatives” (@drew_bent)
  • See also Merge Labs and Edison announcements, below.

Fellowships & workshops

  • MATS 10.0 (Machine Learning Alignment & Theory Scholars): “Come work with Seth Donoughe and me this summer on AI-biosecurity! We will be mentoring projects on threat models, frontier evaluations, and technical safeguards.” (@lucafrighetti)
  • Beyond the Ivory Tower, via Joseph Fridman: “an intensive two-day writing workshop for academics, taught by James Ryerson, a longtime editor at the New York Times. … Our alumni have published hundreds of pieces in outlets from the Atlantic to Aeon to the Wall Street Journal. … I think historians and economists of technology and innovation would be a great fit.” Apply by March 1

Fundraising

Nonprofits that would make good use of your money:

  • Lightcone Infrastructure: “We build beautiful things for truth-seeking and world-saving. We run LessWrong, Lighthaven, Inkhaven, designed AI-2027, and so many more things. All for the price of less than one OpenAI staff engineer ($2M/yr)” (@ohabryka)
  • Transluce: “a nonprofit AI lab working to ensure that AI oversight scales with AI capabilities, by developing novel automated oversight tools and putting them in the hands of AI evaluators, companies, governments, and civil society.” OpenAI co-founder Wojciech Zaremba calls them “one of the strongest external AI safety orgs—on par with METR and Apollo.” (@woj_zaremba)
  • And of course, us

New publications and issues

Queries

  • “Who is best to read / follow for advice on using AI e.g. Claude Code? especially interested in: productivity and todo wrangling (especially for the distractable); research assistance; editing; learning” (@rgblong)

Announcements

  • Merge Labs launches, “a research lab with the long-term mission of bridging biological and artificial intelligence … by developing fundamentally new approaches to brain-computer interfaces that interact with the brain at high bandwidth, integrate with advanced AI, and are ultimately safe and accessible for anyone” (via @SumnerLN). SamA is listed as a co-founder. Merge grew out of the Forest Labs FRO; Convergent Research notes that the tech is ultrasound-based and that they’ve raised over $250M. (!) And of course, they’re hiring
  • Edison, the for-profit spinout of Future House, has raised $70M: “we are integrating AI Scientists into the full stack of research, from basic discovery to clinical trials. We want cures for all diseases by mid-century.” They are hiring software engineers, AI researchers, scientists, and business operators. ”Our goal is to accelerate science writ large.” (@SGRodriques)
  • Science Corp. announces Vessel (WIRED). Vessel is “a project focused on rethinking perfusion from the ground up, extending how long life can be sustained, and expanding what’s possible in transplantation and critical care. Life-support technologies like ECMO can keep patients alive when the heart or lungs fail, but they aren’t designed for long-term use. Vessel exists to close the gap between what perfusion technology is fundamentally capable of and how it is deployed in daily practice.” (@ScienceCorp_)
  • Fuse Energy raises a $70M Series B. Honestly hard to figure out exactly what they do, but it seems to involve deploying solar and batteries, and maybe later doing fuel synthesis and fusion? Anyway I liked this from (presumably) one of the founders: “Energy is the fundamental source for human progress. But for the last 30 years, we’ve been told that the future requires sacrifice ‘use less, be less, restrict yourself’. No one should have to trade a good life today for the chance of a better tomorrow.” (@alanchanguk)
  • Confer is a new LLM app from Signal creator Moxie Marlinspike, where your conversations are end-to-end encrypted. Confer goes to impressive lengths to ensure that the LLM server doesn’t, e.g., exfiltrate your data somewhere. The entire server image is signed and is auditable on a public ledger. The client verifies the signature before chatting. The server also runs in a VM that is isolated from its host at the hardware level.
  • Gordian Bio announces “a research collaboration with Pfizer to apply Gordian’s in vivo mosaic screening platform to obesity target discovery.” (@GordianBio) Story in Business Wire

To read the rest of this digest, subscribe on Substack.
 



Discuss