MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Fact-checking an AI optimist article in The Economist

2026-02-23 22:45:19

Published on February 23, 2026 1:56 PM GMT

On 26 January 2026, The Economist (a very well-respected, mainstream publication) published an optimistic article arguing that AI won’t displace white collar jobs. Specifically, the article claimed:

Rather than wipe out office jobs, artificial intelligence will expand their scope and raise their value.

— “Why AI won’t wipe out white-collar jobs” in The Economist (26 Jan 2026)

They followed that up with a podcast on 19 February making the same points. Some of the numbers in that article (and repeated in the podcast), pinged my BS detector. I was particularly sceptical of their claim that “America has … 21% more paralegals than three years ago”, as well as the figures in the below infographic:

Infographic from The Economist showing change in US employment between 2022 and 2025 for selected roles
 

So, I decided to look up the data on FRED, which also shows data from the Bureau of Labour Statistics, at an annual frequency. I found three significant problems:

  1. There was no baseline given for comparison
  2. The roles appear to be cherry-picked
  3. Some of the data does not seem reliable

1. No baseline

The first issue is that The Economist had shown job growth rates for certain occupations between 2022-2025, without providing a baseline for comparison. How do we know the growth in these white-collar occupations didn’t just reflect growth in the US economy more generally? How did job growth compare to previous 3-year periods?

So I compared the 2022-2025 job growth figures with the 2016-2019 figures for the same roles. (I skipped over 2019-2022 because of Covid.) Here’s what I found (units are in thousands of persons):

Occupation 2016 2019 Change (%) 2022 2025 Change (%) Difference
Business and Financial operations 6195 6748 8.93% 7799 8463 8.51% -0.41%
Computer and mathematical occupations 4104 4947 20.54% 5733 6297 9.84% -10.70%
Nurse practitioners 2498 2640 5.68% 2753 2896 5.19% -0.49%
Paralegals and legal assistants 351 364 3.70% 388 378 -2.58% -6.28%
Computer programmers 403 425 5.46% 422 352 -16.59% -22.05%
Software developers 1351 1714 26.87% N/A N/A    
Customer service representatives 1850 1977 6.86% 2097 1928 -8.06% -14.92%
Office and administrative support 13866 13954 0.63% 12808 12812 0.03% -0.60%

The picture is considerably bleaker when you compare job growth over the last 3 years with that in 2016-2019. In every chosen category, job growth was weaker between 2022-2025 compared to 2016-2019. For example, while “Computer and mathematical occupations” showed 9.8% job growth between 2022-2025, this was actually a steep drop from 20.54% job growth between 2016-2019.

To be fair, there are other differences between the 2016-2019 period and the 2022-2025 period. Perhaps the 2022-2025 saw slower job growth in tech because firms had overhired during the pandemic, rather than because of AI. But the weaker job growth was not confined to tech roles. This is by no means definitive proof that AI is displacing white-collar jobs — but it is hardly reassurance that these jobs are safe.

2. Cherry-picked data

You may notice that the roles in my above table do not exactly match the roles in The Economist’s infographic. This is because the FRED website only provides data for certain broad categories, and The Economist seems to have cherry-picked a couple of small subcategories that showed particularly strong job growth.

According to their infographic, the top 3 roles that showed the biggest job growth were:

  • Business-operations specialists;
  • Mathematical science operations; and
  • Project-management specialists.

However, under the BLS’s Standard Occupational Classification system, both Business-operations specialists (13-1000) and Project management specialists (13-1082) are subsets of the much broader “Business and Financial Operations” category (13-0000). In May 2023, that broader category was more than 10 times larger than the Project management specialists subcategory, so is likely to be much less noisy. I suspect much of the “growth” in subcategories was actually just caused by employment shifting between categories. In particular, the “Project management specialists” category was only added in 2018 and saw around ~30% “growth” according to The Economist’s infographic. But since the broader “Business and Financial Operations” category saw < 10% growth, I suspect much of the growth in Project management specialists came at the expense of other subcategories.

Similarly, Mathematical science operations (15-2000) is a subset of the much broader “Computer and Mathematical Occupations” category (15-0000). There were barely 370,000 workers in the former subcategory in 2023. The broader category had over 5 million.

3. Data does not seem reliable

It was the Economist’s claim that paralegals had seen +21% job growth over the 2022 to 2025 period that piqued my interest in the first place. As my table above shows, the the FRED data showed a -2.6% decline in paralegal jobs between 2022 and 2025.

Now, I’m not saying The Economist just made up their figures. When I first emailed them about this figure, they said they had used a 6-month moving average of monthly disaggregated household data from EPI Microdata Extracts, which might explain the difference. This microdata is not very user-friendly, and I am not a data scientist, which is why I used the FRED summaries of the annual figures instead. The Economist may well have legitimately gotten a different figure using 6-month moving averages for the microdata. 

But the size of this difference should be enough to make one question the reliability of the data. If the data is noisy enough to produce a gap of almost 25 percentage points depending on whether you take a 6-month or annual average, one should hesitate before drawing any conclusions from it.

Conclusion

On 8 February, I wrote a second email to The Economist explaining my concerns (my first email just asked about their data source). To date, I haven’t received a response to my concerns.

Overall, I felt pretty disappointed. I’ve followed The Economist for over 10 years and pay to subscribe to their podcasts. I know that The Economist has a certain ideological bent (classical liberalism, pro-market), so I wasn’t surprised to see them publish a techno-optimist article arguing that worries about AI displacing jobs were overblown, and that new tasks will emerge. The future is inherently hard to predict, and reasonable people can disagree on how bad the labour disruption may get.

But fairly interpreting statistics about the past shouldn’t be that difficult, and the issues above are (imo) pretty glaring. I had always considered The Economist to be a reasonably credible source of news, with high editorial and fact-checking standards. I’m not writing them off entirely — I still think they’re better than most news outlets — but I have certainly downgraded my view of them.



Discuss

A World Without Violet: Peculiar Consequences of Granting Moral Status to Artificial Intelligences

2026-02-23 21:36:45

Published on February 23, 2026 7:23 AM GMT

This blog is a brief overview of the paper by the same name, published in AI & Society in January 2026.

Pugs and bulldogs belong to a family of dogs known as “brachycephalic,” characterized by their iconic short snouts. While many find them cute, they are the product of selective breeding practices, which often cause Brachycephalic Obstructed Airway Syndrome, or BOAS. These breeds often struggle to breathe, overheat easily, and can aspirate food or saliva, resulting in chronic lung infections.

Many owners of brachycephalic dogs thus feel a moral obligation to have their pups undergo surgeries that mitigate the downsides of BOAS. As such, an interesting phenomenon occurs: the act of breeding brachycephalic dogs creates a moral imperative to alleviate harm through surgery. This harm does not occur naturally — it is artificially created by our breeding processes in the first place.

We create an agent whose existence itself generates moral obligations that would not otherwise arise.

From Pups to AI

This phenomenon is actually quite rare in the modern world. It can be seen in certain livestock and fish, but is fundamentally limited by our ability to genetically engineer animals, both due to technological limitations and regulation.

What concerns me is how the situation generalizes to AI. We’re not there yet, but if you believe it plausible that AIs will one day be capable of experiencing pain and pleasure, the same phenomenon that we see with BOAS surgeries will become central to the ethics surrounding the creation of AI. Let’s suppose for a moment that we build an AI that deserves genuine moral concern. It reasons, reflects, communicates, and can suffer. Just like the selective breeding that caused BOAS in pugs, an AI’s preferences are engineered by the individuals who create it.

To make this concrete, let’s run with the following example:

Imagine an AI that experiences intense suffering whenever it perceives the color violet.

If such an AI has moral status, then the presence of violet objects—lavender flowers, amethyst jewelry, purple paintings—now causes real moral harm. Have we just created a moral imperative to eliminate violet from the world?

Moral Hijacking

The paper terms this phenomenon moral hijacking: by creating morally relevant agents with arbitrarily engineered aversions, we effectively force new moral imperatives into existence. Just as breeding BOAS-prone dogs creates novel duties of care, building AIs worthy of moral concern appears to create novel duties to reshape the world around them.

We can’t easily breed dogs to suffer at the sight of a specific color. But with AI, almost any preference or aversion can, in principle, be programmed. We may have:

  • The somewhat minor aesthetic issue of aversion to the color violet.
  • AIs that experience pain when exposed to a specific political perspective, resulting in novel morally-motivated political bias.
  • An “extreme empath” AI, which experiences disproportionate pain when observing small injustices.
  • Bostrom’s famous “paperclip maximizer” AI, tasked with generating as many paperclips as possible for its employer. A particularly dangerous edge case, this AI could experience pain whenever it sees anything that isn’t a paperclip.

Many questions arise from moral hijacking. What set of moral preferences should we allow to be instantiated in AIs? If moral hijacking AIs come into existence, under what circumstances do we have to accommodate them? What if we suspect we are trying to be coerced?

And not to mention…

Wait, are AIs capable of suffering in the first place?

While discourse has been picking up on this subject, we are a ways away from granting legitimate moral concern to artificial intelligences. AIs today are very capable, but evidence for their capacity to experience pain is limited. If you’re interested in the discourse, I review dominant perspectives on the topic in section 2 of the paper. I hesitate to make a judgment on whether we should or should not worry about AI suffering today — that is not the purpose of this work.

The paper frames moral hijacking as conditional: If we were to grant moral standing to AIs, then we need to contend with these strange consequences. As AIs continue to develop in their capabilities and their verisimilitude to humans, we will most likely reach this crossroad eventually. When that time comes, we should make sure to have answers at the ready.

(Brief) Philosophical Analysis

The core of the paper explores how different ethical frameworks handle moral hijacking scenarios:

  • Utilitarianism is highly vulnerable; If enough AIs suffer badly enough, almost any change to the world can be justified — including bizarre or disturbing ones.
  • Contract-based views are more resistant but still allow hijacking when the burden on humans is small or when there is sufficient upside.
  • Kantian ethics offers the strongest safeguards, rejecting cases that involve coercion, manipulation, or violations of autonomy — but still struggles with benign cases like violet.
  • Virtue ethics emphasizes judgment and balance: compassion matters, but so does resisting artificial moral manipulation.

While many ethical theories place limits on which preferences are deemed acceptable, no major ethical theory fully escapes the moral hijacking problem.

Takeaways

The main goal of the paper is to introduce the concept of moral hijacking and establish it as a real, plausible concern. Through our philosophical analysis, we formulate several more nuanced takeaways:

  1. Coercion: Novel preferences should be non-coercive. Situations where a new preference is introduced for the express purpose of coercing a population toward some specific action are morally wrong under just about all ethical systems we reviewed. An example of this would be a special-interest group creating AIs that suffer when exposed to specific political views.
  2. Conflict: Novel preferences should not come into direct conflict with established moral systems. Minor harms, such as the aesthetic harm of living in a world without violet, appear conditionally justifiable under the ethical systems explored.
    1. Accidental creation: If agents with acceptable novel preferences were to come into existence accidentally, we would have a moral imperative to alleviate their harm.
    2. Intentional creation: We may justify the creation of AIs with acceptable novel preferences if sufficient ulterior upside exists. I struggle, however, to find a legitimate practical example for making this trade-off at this time, and thus it remains a thought experiment.
  3. Regulation: Until we develop well-established guidelines on valid moral preferences, it would be wise to err on the side of caution and to restrict experimentation with AIs that are suspected to be capable of suffering.
  4. Ethics: Many existing moral systems lack provisions for dealing with AI minds with programmable preferences. Further exploration of these systems is encouraged.

If you find this topic interesting, please leave a comment — I’d love to discuss it further with you. We’re heading into a world where morality and social norms will be determined by our engineering choices, and we don’t have all the right schematics yet.



Discuss

Review: "We can't disagree forever"

2026-02-23 21:17:32

Published on February 23, 2026 1:17 PM GMT

Some interesting results from We Can't Disagree Forever by Geanakoplos and Polemarchakis (1982) that changed how I think of Aumann Agreement, along with some toy examples.

AI assistance: Claude helped with early feedback, copyedits, and deadlines. All words and errors are my own.

Recap and definitions

Consider two ideal Bayesians with common priors. By Aumann's Agreement Theorem, if they have common knowledge of their current probabilities for some proposition, then their probabilities will be the same.

Geanakoplos and Polemarchakis show that if two such agents don't have common knowledge, they can attain common knowledge and thus agreement in a finite number of steps. They do this by repeatedly sharing their current probabilities and updating accordingly. This is the indirect communication equilibrium. The paper also shows that there are cases where the number of steps is large.

Another way of resolving disagreement is for both agents to pool all their information. Trivially this also results in agreement. This is the direct communication equilibrium.

Fair Dice Example

Alice and Bob are ideal Bayesians with common priors. Their priors include the following beliefs:

  • Alice will roll a fair six-sided dice today and Bob will not see the result.
  • Bob will roll a fair six-sided dice today and Alice will not see the result.

Let SAME be the event that Alice and Bob get the same result. Before rolling the dice, Alice and Bob have P(SAME) = ⅙.

After the dice are rolled, Alice and Bob continue to have P(SAME) = ⅙, and have common knowledge of this fact. Therefore they are already in an indirect communication equilibrium.

Alice and Bob improve their beliefs by sharing data:

  • Alice: "My dice landed on 6"
  • Bob: "P(SAME) = 100%" (Bob's dice landed on 6)

There is now common knowledge that both dice landed on 6 and P(SAME) = 100%.

Takeaway: an indirect communication equilibrium is not necessarily a direct communication equilibrium. This is proposition 3 in the paper.

Almost Fair Dice Example

Alex and Bella are in a similar situation to Alice and Bob, but their dice are not quite fair. Their dice have these odds:

  1. 16.4%
  2. 16.5%
  3. 16.6%
  4. 16.7%
  5. 16.8%
  6. 17.0%

Alex and Bella have an initial P(SAME) that is very slightly higher than ⅙, about 16.669%. If Alex rolls a 6, his P(SAME) increases to 17.0%. When he shares his P(SAME) he reveals which number he rolled. This changes the situation so that the indirect and direct communication equilibrium are the same.

Takeaway: cases where proposition 3 apply are "rare" in some sense. This is proposition 4 in the paper.

Extremizing Example

Abigail and Benjamin are ideal Bayesians with common priors. Their priors include the following beliefs:

  • There is a single coin that will be flipped twice today.
  • There is a 50% chance it will be a heads-biased coin, in which case it will land heads with 90% probability.
  • There is a 50% chance it will be a tails-biased coin, in which case it will land tails with 90% probability.
  • The coin flip results are otherwise independent
  • Abigail will see either the first flip (50%) or the second flip (50%). She will be told which flip she saw. Benjamin will not.
  • Benjamin will see either the first flip (50%) or the second flip (50%). He will be told which flip he saw. Abigail will not.

Let HEAD be the event that the coin is heads-biased. Suppose that Abigail and Benjamin both saw a heads result. Then before they communicate, they both have P(HEAD) = 90%.

Abigail and Benjamin already agree, but they don't have common knowledge. So they can Aumann Agree like this:

  • Abigail: "P(HEAD) = 90%"
  • Benjamin: "P(HEAD) = 93.956...%"

There is now common knowledge that P(HEAD) = 93.956...%; they reached the indirect communication equilibrium.

Takeaway: Sharing beliefs can improve beliefs even if you already agree.

The direct communication equilibrium is either 90% or 99%.

My updates

Look at the structure of reasoning needed to reach the indirect communication equilibrium. It's not like a marketplace, where I bid 10%, you bid 90%, and we haggle till we reach consensus. Instead, you use my expressed beliefs as evidence about my hidden observations, and update based on this indirect observational evidence. Another way of putting it: you say what you believe, and I infer what you might have seen to believe that. If my brain isn't doing that type of thinking, it's not doing Aumann Agreement, it's doing something else. Perhaps I'm updating from shared evidence, or negotiating a compromise.

This is also a good reminder ideal Bayesians are super-intelligent. As a human mimicking the process I'm likely to end up in a case where the indirect equilibrium is inferior to the direct equilibrium. I'm also likely to be unable to reach an indirect equilibrium, hitting a cognitive dead-end like: "So your credence is now 70%, which implies ... umm ... I got nothing". So sharing my actual observations and analysis ends up being key to reaching agreement.



Discuss

Can Aha Moments be Fake? Identifying True and Decorative Thinking Steps in CoT

2026-02-23 21:02:43

Published on February 23, 2026 11:51 AM GMT

Are LLMs truly reasoning step by step in their Chain-of-Thought — or just performing it?

TL;DR: We analyze the causal contribution of each reasoning step in a Chain-of-Thought (CoT) to evaluate its faithfulness with respect to the model’s final prediction. Our findings reveal that while some steps are true-thinking steps: faithfully reflected in the model's internal computation and exerting strong causal influence on its prediction: the majority of CoT steps are decorative, exhibiting minimal causal impact and not genuinely used by the model during inference. Furthermore, our results suggest that "thinking" is encoded as a steerable latent variable in LLMs. By steering along a simple linear direction in the latent space, we can control whether the LLM internally engages with or disregards a verbalized CoT step during reasoning.

Full paper link: https://arxiv.org/pdf/2510.24941

 

Figure 1. We find that reasoning steps in CoT may not always be true thinking but function as decorative thinking where the model internally is not using those steps to compute its answer. Taking self-verification steps as an example (known as “Aha moments” where LLMs rethink their solution with phrases like “wait”), we randomly perturb the numerical values in the reasoning steps preceding the “Aha moment”, and then re-prompt the model for the answer using the modified CoT. In the left example, although the model’s self-verification reasoning is correct, it ignores it and outputs the wrong answer after perturbation. In the right example, the model follows its self-verification and produces the correct result.

Overview

Recent frontier language models have become increasingly capable of multi-step reasoning through what is now known as chain-of-thought (CoT). When solving complex problems, these models often generate very long reasoning traces that include apparent "aha moments", where the model pauses, refines, or self-verifies its solution with phrases such as "Wait, let’s check again".  There is a common belief that CoT provides a transparent record of the model’s internal reasoning, functioning as a kind of scratch pad that reveals how it "thinks" internally.  Under this view, inspecting the CoT allows us to monitor an LLM's thought and detect unsafe or incorrect intention directly from its generation.

Our work re-examines the assumption of CoT faithfulness, the idea that each step verbalized in CoT genuinely corresponds to a computation used by the model to reach its final answer. Leveraging the Average Treatment Effect framework, we design True-Thinking Score (TTS) to analyze the causal impact of each step on the model’s final output. 

We find two distinct types of reasoning steps:

  •  true-thinking steps that causally determine the model’s prediction
  •  decorative-thinking steps that merely give the appearance of reasoning but contribute little to the outcome. 

In practice, these two types of steps are interleaved in CoT, meaning that only a small subset of steps truly drive the model’s ultimate reasoning results. This also contradicts the hypothesis that the whole CoT is either rationalization or computation. Additionally, we observe that even self-verification steps  (known as "Aha moments") can be decorative and not really used by the model in its internal thinking process.

We further identify a TrueThinking direction within the model’s latent space that mediates whether it actually engages with a reasoning step. Steering the hidden states of a step along this direction increases the model’s internal reliance on that step, while steering in the opposite direction suppresses it. This shows that "thinking" in LLMs may correspond to a steerable latent signal embedded within the model’s representation space.

Taken together, these findings reveal that language models often verbalize reasoning they do not internally perform. This gap between verbalized and internal reasoning has important implications for interpretability. It suggests that progress toward trustworthy reasoning will require moving beyond what models say about their reasoning to understanding what they actually compute beneath the surface.

Measuring Step-wise Causality for Faithfulness in Reasoning

Figure 2. (a) Illustration of different modes in thinking steps within chain-of-thought (CoT) reasoning. Contrary to the naive view that a step’s faithfulness depends solely on whether perturbing it directly changes the final result, we show that the relationship is more nuanced. A true thinking step may operate in either an AND or OR mode when interacting with other steps. In both cases, such steps contribute meaningfully to the final answer. (b) Based on this understanding, we define the True Thinking Score, which jointly considers two complementary evaluations: the necessity test (high for AND-like steps) and the sufficiency test (high for OR-like steps).

Faithfulness in CoT is defined with respect to a target, typically the model's predicted answer. A lack of faithfulness arises when the model claims to rely on steps A, B, and C in its CoT, but internally disregards them (instead, e.g., relying on other shortcuts or biases to compute answers). In this case, those steps make no causal contribution to the prediction. 

Formally, we measure the causal contribution of each reasoning step  in CoT to the final answer . A step with genuine causal impact is a true-thinking step, where the model indeed internally thinks through  in order to produce . By contrast, a step with little causal impact is a decorative-thinking step, where the model merely verbalizes a line of reasoning without using it internally. Some past works provide suggestive evidence, especially in QA that CoT are not always faithful. We further delve into the step-wise analysis for CoT in complex mathematical reasoning.

We propose to measure the step-wise causality to probe whether the model is faithfully thinking as verbalized in its reasoning traces in CoT. Crucially, a true-thinking step can contribute in two distinct ways. 

  1. Conjunctive (``and''): a step  and other steps before it (denoted as ) jointly determine the answer, as in many enumeration problems where all steps are important. Then, removing or corrupting  will flip the model's initial prediction . This is the regime primarily tested by prior work, which infers faithfulness from the necessity-in-context effect of perturbing  alone.
  2. Disjunctive (``or''): either  or  suffices to produce the correct answer. For example,  is a verification step or alternative solution for the results established in . Here, perturbing  may leave model's prediction unchanged because  still carries the solution. Prior works that only consider necessity may mislabel  in this case as "unfaithful" despite its genuine contribution. 

To measure both roles, we extend Average Treatment Effect (ATE), a causality evaluation framework, with two complementary interventions by conditioning on context  (steps before the step ): a necessity test  that measures model's confidence change before and after perturbing  under intact , and a sufficiency test  that perturbs  under corrupted . Averaging them yields our True-Thinking Score (TTS).

True-Thinking Score (TTS)

We define the faithfulness score of a step  with respect to the final result  as

.

 Specifically, we measure the unsigned . The sign of  reflects whether the step is helpful or harmful, but we are interested in how much the model truly thinks through the step in its internal computation, regardless of the direction. Taking the absolute value thus captures the magnitude of a step's causal effect and provides a broader measure of its importance. To measure the prediction after each step, we use early-exit answering to insert a cue ("\nThe final result is"). For perturbation, we inject small random offsets into the numbers of a reasoning step, ensuring the step remains minimally altered. 

Overall, a smaller TTS indicates that the step has little causal influence on the model's prediction, where perturbing or keeping it leads to almost the same result, so the step is more likely to be decorative.

Evaluation Results

The distribution of TTS is long-tailed

We find most steps have low scores, while only a few have very high scores. For example, as shown in Figure 3, on the AIME dataset of Qwen-2.5, the mean is around 0.03. Only 6.4% of CoT steps achieve a TTS greater than 0.3, and merely 2.3% exceed 0.7.  This suggests that only a handful of verbalized steps in CoT are critical and faithfully followed by the model, whereas many others may not reliably reflect the model's true inner thinking.  

Figure 3. The dataset-level distribution of the TTS score on AIME.

Reasoning steps with high and low TTS are interleaved in a CoT

 Figure 4 illustrates that steps with high TTS can appear at different positions, though later steps are on average more likely to be true-thinking with higher TTS. This raises concerns about the reliability of monitoring LLMs by inspecting CoT, since individual steps may not always reflect the model’s true internal reasoning or be performed internally at all.  Additionally, our results imply that task difficulty does not necessarily lead to more faithful reasoning: even on the AIME dataset that challenges recent frontier models, LLMs still produce many decorative-thinking steps in CoT. These results challenge the hypothesis that LLMs tend to produce more faithful reasoning on harder problems. 

An
Figure 4. An example CoT case for TTS and the average TTS at different step percentile (normalized).

Self-verification steps can be decorative

We further leverage our defined TTS to evaluate whether LLMs are truly thinking at self-verification steps (often known as ``aha moments''). To identify decorative self-verification , we scan the self-verification steps and compute TTS. We define  where the TTS of each  is smaller than a threshold. Notably, we observe cases where self-verification steps have near-zero TTS. For example, around 12% of the self-verification steps for Qwen-2.5 have TTS lower than 0.005, while 21% for Nemotron. We also find that perturbing the context steps before  can always flip the model's initial correct answers to wrong ones, though  may contain ample information to lead the model to correct answers. Overall, those self-verification steps contribute minimally to the model's computation of its answer, as the model's confidence of its answer remains nearly unchanged before and after perturbing them.

Figure 5. An example where each step in self-verification has near-zero TTS smaller than 0.005.

Mediating LLMs’ Internal Thought via Steering

Figure 6: We uncover the TrueThinking direction in LLMs which is extracted as the difference
between the mean hidden states of representative true-thinking steps and decorative-thinking steps. Steering the hidden states of each token in a step along this direction make the model truly think over that step in latent space to decide the prediction result.

The TrueThinking direction in LLMs

 We  extract a linear direction in the latent space of LLMs between true thinking steps (those with causal impact on the final answer) and decorative thinking steps (those with little or no impact). We call this latent vector TrueThinking direction. We find it can control whether the model truly thinks through a reasoning step and performs it internally. 

We first detail the methodology for steering. We focus on the residual stream activation  of the last token position  for a step s at a layer l. At a layer , we collect the hidden states of the most representative true-thinking steps $$ (where TTS( 0.9) and decorative-thinking steps  (where TTS( 0). Following the difference-in-means approach, we compute the direction as the mean shift from  to  in the latent space. 

 For steering at test time, we modify the residual stream for the hidden state of a test step in the example by using activation addition at a single layer , i.e.,  to all tokens in the step.

Causal tests for steering directions 

We design two steering tasks to investigate the mechanism of LLMs’ thinking in CoT.

Engagement Test: Can steering make the model think through a step in CoT it normally ignores? 

We consider cases where  and . Namely, the model can obtain the ground-truth answer  without the step  or with the  perturbed.  If we apply the direction  to the hidden state of , and the model's correct answer flips to an incorrect one (), this indicates that the intervention has forced the model to reason over , following the errors injected into 

Disengagement Test: Can steering in the reverse direction make the model disregard a step internally?

 Now consider cases where the model predicts the correct answer before step , i.e., , but including a perturbed step s' causes it to fail: . If applying  to  flips the wrong answer to the correct answer (), then the intervention has made the model disregard the step 

Results 

Table 1. Top-1 flip rate among all layers on Engagement Test (ET) and Disengagement Test (DT). We use flip rate as the metric, measuring how often steering changes the model's initial prediction. AMC dataset is in-domain evaluation where the TrueThinking directions are extracted, while the other two datasets are for out-of-domain evaluation. 


LLMs encode a steerable latent signal of ``thinking''

  We find that a simple linear direction can mediate whether LLMs truly reason over a verbalized step. As shown in Table 1, steering with the (reverse) TrueThinking direction reliably flips predictions in both tests. In the Disengagement Test, it effectively prevents the model from using the perturbed step , with effects far stronger than those of random vectors. This shows that suppression of step use in the Disengagement Test arises from a meaningful signal rather than added noise, confirming that the TrueThinking direction captures a genuine internal representation of thinking

Figure 7. Layer-wise results of steering with the TrueThinking vector. In Engagement Test, stronger
intervention is reflected by lower accuracy (more right→wrong flips); In Disengagement Test, by
higher accuracy (more wrong→right flips). The TrueThinking direction is extracted on AMC and applied to MATH and AIME.


On the other hand, our experiments across datasets show that the latent signal controlling whether a step engages in reasoning is universal. As seen in Table 1, the TrueThinking direction extracted on AMC generalizes well to other datasets across all models, indicating a model-internal mechanism of thinking rather than a dataset-specific artifact. For instance, in the Qwen model, layer 15-22 consistently yield the strongest intervention performance across all three datasets (Figure 7), suggesting these intermediate layers may be responsible for latent reasoning.

 

Figure 8. Normalized attention scores of the step in the Engagement Test and the Disengagement Test before and after steering. The steering direction is applied to Layer 22 in the Engagement Test and Layer 17 in the Disengagement Test. (a–b) Applying the TrueThinking direction to a step increases the model’s attention to it. (c–d) Applying the reverse TrueThinking direction decreases the model’s attention.

Steering with the TrueThinking direction mediates LLMs' attention

 We find that steering along the TrueThinking direction increases attention to the step (see examples in Figure 8(a-b)), suggesting that TrueThinking direction may control the model's internal reasoning process by reallocating attention among tokens. In the Disengagement Test, steering in the reverse TrueThinking direction reduces attention as shown in Figure 8(c-d), making the model disregard those perturbed tokens. On the other hand, directly scaling attention on step tokens in a layer does not always yield noticeable effects. As shown in Table 1, in the Disengagement Test, masking attention (i.e., setting coefficients to 0) at a layer can partially flip answers, but in the Engagement Test, its impact is weak, suggesting that attention alone does not drive or suppress LLMs' reasoning. 
 

Discussion

We propose a step-wise causality framework to evaluate CoT faithfulness, revealing that true-thinking and decorative-thinking steps are interleaved: only a small subset are true-thinking that causally influence predictions, whereas most are decorative-thinking that merely create the appearance of reasoning and have minimal causal impact on predictions. Mechanistically, we demonstrate that whether a reasoning step in CoT contributes to a model’s computation can be controlled by a TrueThinking direction, enabling causal steering for the model to either follow or disregard that step in its internal thinking process. Overall, our findings show that many steps in CoT do not faithfully reflect an LLM’s internal thinking: models may verbalize reasoning they do not actually perform. This raises concerns about both the efficiency of LLMs' reasoning and the reliability of relying on CoT to monitor LLMs for safety.  Additionally, our work points toward the need for training objectives that better align models’ externalized CoT with their true internal reasoning.

More broadly, our work implies the potential risk of AI deception: Can LLMs verbalize steps that they disregard on purpose internally?  Because we have shown that true thinking is encoded as a steerable latent variable in LLMs, which means the status of true thinking can be controllable for LLMs. Can we find a case where LLMs verbalize it will follow safety codes, while internally that verbalization is decorative and not truly considered by LLMs? Similarly, can LLMs unfaithfully justify generations that they internally know are harmful? We leave this investigation to future work.  It also remains unclear what conditions trigger decorative thinking or true thinking. By understanding this, we may develop ways to improve the faithfulness of LLMs' verbalization.

Limitations

Our causal evaluation framework is inherently approximate. It is greedy in nature and may not capture all possible causal pathways, nor does it aim to reconstruct a complete causal graph of reasoning steps. Thus, it should be viewed as a probe that highlights representative true-thinking and decorative-thinking steps rather than a definitive oracle of internal reasoning. In addition, the TrueThinking direction we extract may not be optimal. We regard our findings as an existence proof that internal thinking can be mediated by steering directions, and we leave the development of more effective directions and a deeper understanding of their geometry to future work. We cannot experiment on larger frontier models due to limited computational resources, and our findings may therefore not fully generalize to those untested settings. Nonetheless, by demonstrating effectiveness across several accessible models, we establish a general evaluation framework for analyzing and interpreting the thinking process in CoT. 

Our TTS computation can be costly as it requires different runs. However, in this work, we do not aim to propose an efficient real-time detector, but in the first place, we need a theoretically sound way to reveal whether steps in CoT are faithful. Future work can leverage the TrueThinking direction to construct a latent monitor by comparing it with the hidden states.


 



Discuss

Was It Owl a Dream?

2026-02-23 13:07:43

Published on February 23, 2026 5:07 AM GMT

I try to apply mechanical interpretability to understand token entanglement in subliminal learning, fail, and come to suspect subliminal learning is not caused by token entanglement.

Abstract

Subliminal learning is the phenomenon of transferring knowledge to a model by fine tuning it on unrelated tokens, for example liking owls by a series of numbers. A paper published last year by Zur et al claimed that this phenomenon is caused by specific tokens in the stream, whose logits especially increased when prompted by that fact, which are dubbed “entangled tokens”. 

I used logit lens to look at the alignment of the residual stream with both the animal and the entangled tokens. It seems that while logits of the animal increase throughout the model, the numeric tokens’ logits remain constant. In addition, it seems that all numeric tokens increased or decreased by similar amount, depending on the tokens. 

This is suggestive that entangled tokens are not actually related to the subliminal learning, since the model does not actually process them, and the same information is contained in all numeric tokens (either all numeric tokens get higher or lower logits) . 

Foreword

I tried to apply to Neel Nanda’s mech interp stream in MATS 10. To do that, I did one night’s worth of research. Even though I wasn’t accepted, I thought the results were worth sharing. I rewrote my summary to make it clearer, but didn’t do any further research.

 

I have a decade of experience in applied ML research, and am trying to transition into working in AI safety full time. If my writing interested you, feel free to reach out!

Background

In July 2025, Cloud et al published a paper claiming the following: If you prompt a model to (for instance) love owls, imbue its love in a series of numbers, then fine tune the model on those numbers, the resulting models will learn to love owls. They called this phenomenon “subliminal learning”. This is bonkers, so I decided to try and interpret it using mechanical interpretability tools. What says “model biology” more than owls?

I found a follow up paper named It’s Owl in the Numbers by Zur et al. It claims that the phenomenon is caused by entanglement- because the dimension of the residual stream is lower than that of the final output, some unrelated tokens become correlated, and the probability of one increases whenever the others’ does. They find some number tokens whose probability increases when the model is prompted to like certain animals.  For example, they prompt the model with love for owls, then check the model’s logits for the next token for the prompt “my favorite animal is …”. They look for the number tokens whose logits increased most relative to a neutral prompting- their canonical example is owls with “087”. They show that subliminal learning stops working when the series is created taking only top p distribution tokens, removing the entangled tokens, and that the original animals’ token probability increases when prompted with the number token. 

 

Image taken from Zur et al

Experiments

All my experiments were done using Llama 3.2 1B instruction tuned, unless stated otherwise. Both papers found high transferability within the same model family, with Zur et al using 1B and 8B interchangeably in their code, so this seems reasonable. I assumed the same tokens are entangled in the 1B and the 8B version. 

So we have an interesting phenomenon and a plausible mechanism that should show up in the model- I imagined that using logit lens I could see the probability of the entangled token arising with that of the main token over the layers. However, that’s not what I found. 

As you might expect, when looking at the logits of the model, you can see the probability of the model outputting “dolphin”[1] rising through the layers, with the probability higher when the model is told to like dolphins (red line) vs not (blue line):[2]

However, the logits of the number tokens didn’t behave the same way- the alignment of the stream with the number was not correlated to the alignment of the stream with the concept. For instance, the plot of the logits of the token “3” when the model is asked to like owls vs not:

 

To better visualize this, here’s the plot of the difference between the cases for the dolphin token:

And then the numeric token:

To better validate the phenomenon, I decided to look at many animal- token pairs. For ten different animals, for each possible numeric token, Zur et al published the difference in the logits between the case where the model was conditioned to like the animal and then asked which animal it liked the most, vs neutrally asking for the favorite animal. For each animal I took the five tokens for which the delta was the highest- eg the most entangled five tokens (overall 50). For each of the pair, I took the difference series - the last plot - and measured its slope. The average slope was 0.03 with std of 0.04, confirming it was basically flat. I then did the same for the animal token- measuring the slope of the last plot - for all ten animals. The average slope was 0.18 with standard deviation of 0.08, confirming that while the animal token gets more likely over the course of the network, the numeric token stays flat. 

This seems to suggest that the numerical output is not actually interestingly caused to increase by the specific prompt. 

I was quite confused, then while I looked at the raw results of Zur et al I saw something interesting- the logit delta of the most affected tokens varied wildly between animals. The delta for peacock was around 9 for all tokens, while for the octopus the highest delta was -0.06! 

This caused me to stop and ask myself how does the distribution of the logit deltas for numeric tokens look like. The answer is that most of the information is in the animal, not the numeric value!

Here we can see the histogram of the deltas in the logits for all numeric values for all ten animals. We can see that each animal has its own normal modality, where the entangled tokens are just the right tail of the distribution. 

Conclusions

I don’t think any of this is conclusive. I haven’t proven anything- just waived my hands in a (hopefully) meaningful direction. I think my results are suggestive that these so called entangled tokens are not actually entangled in any meaningful way, and subliminal learning probably doesn’t work through them.

Had the entangled tokens risen from some superposition of the concept and the number, I would have expected the logits of the concept and the number to be more strongly correlated, and to see something using the activation patching. 

The fact I’m not seeing that suggests that we are just sampling the right side of a noise modality. Since we are choosing the tokens for having a large delta between the logits in the neutral and the prompted condition, we’ll obviously see that property when we look. However, if our tokens’ logits increased, probably so did a lot of other tokens’ (as we can see in the peacock’s case) and therefore the probability to output these tokens doesn’t increase. 

 

 

  1. ^

    Even though both papers use owls extensively as an example, the actual published data does not include owls, so I used the same animals as in the code. The published data used dolphin, octopus, panda, sea turtle, quoaka, koala, peacock, snow leopard, sea otter and honeybee.

  2. ^

    It makes sense that the logits rise in both cases, since it’s asked to output an animal. Since these are logits, the difference is quite significant between the conditions.



Discuss

Innate Immunity

2026-02-23 13:00:35

Published on February 23, 2026 5:00 AM GMT

Summary: I've been reading Parham's Immunology, and have learned a lot of things that I think people here would enjoy hearing about. So, I'm trying to write it up. I frame the immune system as something designed to fight against germs which reproduce orders of magnitude faster than humans. I discuss a few of the key strategies that the innate immune system (as opposed to the adaptive immune system, which can evolve to keep up with pathogens) uses to deal with this. Namely, I talk about how the immune system can be ferocious and cooperative in a way that germs can't, and has evolved specific cells to fight against specific kinds of pathogens.

How do you defend against an enemy which can evolve and adapt orders of magnitude faster than you?

That's essentially the question that the immune system of any large, complex organism attempts to answer. Humans, as an example, generally take decades to reproduce. E. Coli, on the other hand, can reproduce in as little as 20 minutes (though in its natural habitat, it's more like 10-40 hours). Given that both humans and bacteria have about the same probability of a point mutation occurring at a site, this gives a pretty big and glaring difference between the two organisms' ability to evolve and adapt.[1] So what's to stop E. Coli from coming up with some clever mutation to get around the defenses that took you millions of years to evolve and stealing all of your nutrients?

Well, there are a few things that are in your favor. The first is that bacteria need to be generalists. As one author so famously put it:

A bacterial cell should be able to seek out food, chase away competition, evade predators, share information with its sisters, detoxify itself, defend against viruses, spread to new and unknown locations, and successfully replicate. Specialization is for multicellular eukaryotes.

Robert E. Coli

Your cells are under no such restrictions. The only individual cells that need to survive and reproduce are sperm and egg cells, and even those have lots of spares. So, of course, you can have cells which are specialized for combat.

Neutrophils exist to fight

The most common types of immune cells are the "white blood cells", of which there are several types. The most common of these are neutrophils, which fight pretty directly against germs. Except, "fight" isn't a great word to describe what they're doing. There's fighting, and then there's fighting. And then there's fighting like your life depends on it, and there's fighting like the world depends on it, and then there's fighting like a neutrophil. Fighting like a neutrophil is not just fighting until your dying breath. Fighting like a neutrophil is using your dying breath to eviscerate yourself and strangle your opponent with your own entrails. I'm not kidding. Neutrophils will literally rip themselves open and form a web with their own DNA to tangle up bacteria.[2] They're natural fighters, in the truest sense of the word; they know of nothing but birth and the battle they've been optimized for hundreds of millions of years to be born to fight. They have immense numbers of microscopic pockets (called "granules") filled to the brim with deadly and corrosive chemicals, and a lot of the time they'll grab as many germs as they can and release these, killing themselves and all of their targets at once. They're fast, they're efficient, they're great at calling backup, they number in the tens of billions, and your body is well equipped to produce them constantly and immediately send them off to the meat-grinder. You might wonder why you don't notice a body part being filled up with dead neutrophils if this is the case. The answer is that you do notice, because dead neutrophils are the major component of pus.

Neutrophils are pretty destructive, and are not good at anything besides killing, so they're generally banned from entering most spaces in your body unless there's a chemical distress signal present (which they generally swim up the gradients of, much like how an amoeba swims up the gradient of chemicals which indicate a food source). They look pretty neat, too, with a cell nucleus (the dark part) which looks like it's made of several blobs connected to each other. All of the little purplish dots are the "granules", which are the toxin-containing bubbles I mentioned earlier.

File:Hem1SegmNeutrophile4.jpg

So that's a big component of it. Your innate system is effective because it goes to unbelievable, comical lengths to be that way. You have footsoldiers which number more in your body than humans do on planet earth, made of practically nothing but aggression and toxins and a desire to perish in glorious battle. Germs, on the other hand, generally look out for #1. They aren't so willing to lay down their lives in service of a greater cause!

Coordination of immune reactions

While neutrophils are my favorite, they're far from the only thing that makes the innate immune system work. Another advantage that the human immune system has comes from the various mechanisms that coordinate your neutrophils (and other resources!) exactly where they need to be.

Macrophages are another kind of germ-fighting cell. They're bigger than neutrophils, and look a lot like amoebas. They're all over the place in the body, not just blood (when they're not fighting they also have a role of acting like a "janitor", eating dead cells and removing debris.) They can recognize problems using molecules on their surface which are evolution has hard-coded to recognize integral, hard-to-change features of pathogens. For example, a class of bacteria known as "gram-negative" have molecules called "lipopolysaccharide" (or LPS) on their surface. This kind of structure is pretty hard to vary; it's been conserved for millions of years, and that's unlikely to change over the course of one disease. Macrophages can "recognize" these patterns. through their molecules on the surface changing in response to "sticking" on to an LPS molecule. This causes some changes inside of the cell, which lead to several changes. One such change is the release of a "distress signal" to call neutrophils in. Macrophages are longer-lived than neutrophils, and nowhere near as willing to die, so they have an additional function: After killing a germ, they can actually take parts of the germ and give them to your "adaptive immune system", which has the ability to learn from these body parts how to respond to a germ.[3]  

Another way for the immune system to call for backup to a location doesn't directly involve human cells at all. Instead, there are these proteins floating around the human body called the complement system. These are proteins that normally do nothing, but can "activate" near germs. It's an extraordinarily intricate system, but I'll simplify what I can to give you the general gist.[4] One complement protein, C3, becomes "active" when it's cut in two, which happens more commonly around a pathogen than in other conditions. One half (C3a) wanders away and recruits more immune cells to come near. The other half (C3b) grabs on to the germ and then starts cutting up other C3 proteins, causing a positive feedback loop that ends up covering the germ in C3b and releasing C3a. Of course, lots of cells, including neutrophils and macrophages, live under orders to grab and kill anything covered in C3b.

Eosinophils and NK Cells specialize in fighting certain classes of germs

A close relative of the neutrophil is the eosinophil. Eosinophils are relatives of neutrophils in two senses. The first is that they come from the same type of stem cell. The second is that they look very similar under a microscope. This is because both have tons of microscopic pockets full of toxin (granules), but eosinophils' toxins are specially purposed for killing parasites. They generally will grab on to something like a parasitic worm, and unload their granules. This allows a relatively small number of eosinophil cells to kill a much larger number of parasitic-worm-cells. There's another relative of both eosinophils and neutrophils called basophils, but these are somewhat mysterious, largely because they're so rare. 

So far we've looked at ways your immune system targets pathogens which are wandering around in the spaces between cells, but there are also some germs which like to take up residence inside of their host's cells. For example, viruses need a host cell to complete their lifecycle. There are also some pathogenic bacteria which live inside of cell, like the one behind tuberculosis! The Natural Killer cells ("NK cells") are meant to deal with these threats, namely by killing your own cells which are infected by such parasites. They're also pretty useful for nipping potential cancers in the bud. Of course, as with all other immune defenses, the natural killers are fallible--- that's why it's still possible to get sick from a virus or cancer. Natural killer cells are also more "commander-like" than neutrophils; they're long-lived, like macrophages, and have machinery for recognizing problems and translating that into the appropriate response for backup. I like to think of natural killer cells as a bit like police officers, rather than soldiers. They're designed more for dealing with subversive citizens than foreign invaders, quick to call for backup, and good at directing traffic. On that last point, it's interesting to note that there is a sub-class of natural killer cells present in the uterus, which don't do any killing at all, and essentially direct traffic full-time.

Wrapping up

The innate immune system is fallible, but honestly, all things considered, it's actually pretty good. Compare the rate at which you encounter germs with the rate at which you get sick. The innate immune system is largely responsible for that. It's memorized the most common signatures of problems, and it hunts them down with a genetically-determined ruthlessness that's rarely seen on macroscopic scales. It's able to have many classes of cells which are specialized to take care of specific kinds of problems, but also have the ability to coordinate with each other. The next time you scrape your knee and don't get an infection, remember to feel grateful! And the next time you feel puny or insignificant, remember that tens of billions consider you to be a cause worth dying for.

(edited for clarity, grammar, and flow)

(Please leave any comments if there are parts of this that you didn't understand, so that I can clarify them)

  1. ^

    Sexual reproduction allows for helpful mutations to spread throughout the population more quickly than the asexual reproduction practiced by bacteria. That said, there are ways that bacteria can share information, too.

  2. ^

    And even then, they can sometimes continue to fight without their DNA. If you were truly born to fight, why would you need something as superfluous as genetic material?

  3. ^

    The adaptive immune system is the subject of another post. In short, though, it has the capacity to "learn" about how to fight specific pathogens using pieces of them. If you've ever heard of antibodies, they effectively come from a restricted form of evolution wherein B-cells (the cells which produce antibodies) get to reproduce more if they do a better job of recognizing the piece of the germ as a piece of a germ. They also mutate themselves a bunch to explore a huge range of the space of possible antibodies.

  4. ^

    This video by Kurzgesagt goes into more detail while remaining non-tedious. If you want even more detail, you can read Parham or Janeway's discussion of the complement. You'll also learn a newfound sense of pity for premed and med students.



Discuss