MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Parameters of Metacognition - The Anesthesia Patient

2026-01-09 09:20:12

Published on January 9, 2026 1:20 AM GMT

Epistemic status: I’m using a single clinical case study as a running example to illustrate three empirical aspects of cognition that are well-documented but rarely used together. The point is not that this case study proves anything, but to build an intuition that I then connect to more systematic empirical studies later. 

Content warning: Anesthesia, quotes from the patient can be read as body horror. 

LLM use: I have used LLMs for a) researching prior work and other sources, b) summarizing and reviewing, c) generating the comics and code for one of the graphics, and d) coming up with structures to make the dry topic more approachable, including finding the case study to illustrate the parameters. All LLM-generated sentences that made it into this document have been heavily rewritten.


Induction

A 33-year-old woman voluntary undergoes a rhinoplasty (a surgical procedure to reshape the nose) under general anesthesia[1]. The intended and expected effect for the patient is induction of anesthesia and then "waking up" in the recovery room with no reportable experience during the operation. 

(comic generated with ChatGPT 5.2 to illustrate a normal anesthesia procedure)

In the case study, that hard cut fails.

The case report summarizes: “During the operation, she became aware that she was awake.” But this simplifies and assumes an understanding of this concept that glosses over a perceptual asymmetry: some parts of experience can return while most don't. Instead, there may be an inability to move (as in sleep paralysis), incoherent experienced content (as in fever dreams), impossibilities (like flying in lucid dreaming), and especially, difficulty to communicate (clear internal speech but unintelligible sleep talking). 

Partial Wakeup

The case report states: "She heard the conversation among the surgical team members and felt pressure on bone in her nose, but she did not feel pain." Note these two deviations from normal experience:

  • Auditory content returns but without a richly constructed visual scene. 
  • Somatic content returns selectively, feeling pressure but not pain. 

Her internal experience is there, but narrow.

Urgency

The case report continues: "The patient also felt that the breathing tube was pushed up against the inside of her throat, impeding her ability to breathe."

Imagine being the patient: You are vaguely aware that you are in the operating room. Then you become aware that you (may[2]) lack air. Is this real? Can you do something about it? Can you get help?

The report: "She was unable to move." Which is expected:

[neuromuscular blocking agents (NMBAs)] greatly facilitate endotracheal intubation and provide adequate muscle relaxation without requiring very high sedative doses that can precipitate cardiovascular depression and cardiac arrest.6 [...] However, these drugs also significantly increase the incidence of awareness under anesthesia because paralyzed patients cannot move to indicate that they are not sufficiently sedated.7 

Survival

Ability to breathe is existential. Air Hunger is the ultimate survival drive. This turns a dream-like state into a single-minded fight for survival[3]. Thinking focuses on the immediate survival-relevant intense details and nothing else. How and why doesn't matter. In this situation, the lack of air doesn't occur suddenly, but the threat is there nonetheless.

You notice you need to move to fix the breathing tube, you try to, but notice that you can't seem to move. You need to signal that something is wrong.

You focus all your intent and available concentration on calling for help and you scream[4].  The report: "She recalls making a 'monumental effort' to utter a small groaning noise, which alerted the surgeon to the fact that she was awake."

The Surgeon

Imagine the operating room: hands moving with trained economy, instruments passed, a routine performed hundreds of times. Imagine being the surgeon. From the report, we know “she heard the surgeons talking.” You are speaking to a colleague about the next step, about something ordinary in medical language. You are not addressing the patient because the patient is offline. Your professional bubble is tight. Then “a groaning sound.” You are surprised; later described as embarrassed. The patient is in the picture as a participant. You and your team respond professionally, maybe adapting sedation. You tell her that the operation is “almost over,” hoping she will hear and, without further signs from her, go back to normal. You do not offer explanations because you don't have them either.

(comic generated with ChatGPT 5.2 based on this post)

A Narrow Corridor

From the case report: "It was her impression that the surgeon rushed to finish the operation while full anesthesia was restored." But imagine being the patient on the table, hearing conversation without context. No faces, no sight, no ability to ask for clarification, just confident voices. Feeling pressure but no pain and not knowing why. Feeling the tube, feeling anxious, but not knowing why[2]. Being immobile and not knowing why. Being unable to speak and just barely to scream. Intention, but no clarity about ability. Awareness fading and finally awakening in the recovery room. 

The corridor of awareness can be coherent without being wide, and this awareness is not reliably captured by obvious signs exactly because these channels are suppressed:

In the past, anesthesiologists relied solely on clinical signs... to judge the depth of anesthesia. However, these clinical signs often fail to detect awareness. A closed claims analysis reported 61 cases of awareness during general anesthesia; only 9 of these cases were associated with hypertension, 4 with tachycardia, and 1 with movement. [emphasis mine]


The Parameters

While most of us have not had experiences like the patient, we have all experienced sleep, and most of us have experienced meditation, exhaustion, drugs, or fever. But do you have a gears-level model of what was going on in these cases? Could you model the effects with numbers or develop a program to measure them? In the following, I will connect the regularities pointed out above to existing empirical measures of metacognition that are well-studied but rarely connected together. I propose to use the parameters working memory bandwidth, nested observer depth, and metacognitive intransparency to quantify mental states like the one in the case report. 

Working Memory Bandwidth (B)

In the case report, we see multiple indications that the experiential field is reduced to a narrow corridor in both the amount of detail (Tunnel vision) as well as in the available sensory channels. The patient could hear but not see, and feel pressure but not pain.  

Inspired by Scott Garrabrant's Embedded Agency

We can summarize this as a parameter B[5] that describes the width of the stable experiential field. We can ask: How much differentiation can the inner experience sustain during a given interval? The experiential field is what is reported by the patient or anybody else. Thus, in practice, it is limited by working memory or how much of the experience can be remembered. There may be other ways to measure the bandwidth of the experiential fields (discussed in the appendix) that do not depend on memory or potentially biased self-reports.

The easiest way to approach this is by tests how well people can report differences in the perceptions they become aware of. Naturally, B would be in bits/s. Cognitive psychology tells us the number of items perceived simultaneously, but usually doesn't ask for bits[6] - we need to multiply by the number of bits each item can vary in. Recent studies of working memory[7] find a consistent bandwidth of 10 to 50 bits/s. And working memory is known to be reduced under anesthesia[8].  Thus working memory bandwidth seems like a promising parameter to measure this aspect of experience.

Nested Observer Depth (d)

The paper says that awareness under anesthesia during surgical procedures is an uncommon event. But even in everyday life, we are not always equally aware of ourselves[9]. In a flow state, we may get so immersed in an activity that we are not aware that we are aware. We just are. At other times, we are dozing and just barely having any thoughts. Or fully asleep. And during a dream, we are also usually not aware that we are dreaming. At the other end, deep introspection or meditation can lead to higher levels of awareness and noticing that we are thinking about our thoughts. 

In the case report, we can see that too (though we have to guess): Because this started as a normal procedure, the patient had little reason to worry how the operation might affect their mind, and was probably at a baseline level of self-awareness, and then the ability to introspect faded with induction. When becoming aware during the procedure, the patient tried to make sense of their condition and act coherently under constraints. After the procedure, the patient probably wondered a lot about what had happened to them.

Adapted from Scott Garrabrant's Embedded Agency

All of this points to differences in the depth of reflection and nested self-observation like the one studied by Nested Observer Windows Theory[10] after which I'm naming this parameter d. We can use existing measures of the amount of metacognition[11], such as based on self-reports, to approximate how many self-modeling steps are maintained under reflection. How far you can go in “I notice that I notice…” before it collapses. Additional ways to measure this depth are discussed in the appendix.

Metacognitive Intransparency (τ)

More heavily adapted from Scott Garrabrant 

“I notice that I notice…” before it collapses. Why does it collapse? Or, more generally, why is it so difficult to get accurate information about our own reasoning? The more we try to observe ourselves or even to observe how we observe, the more difficult it seems to get. We seem to have many biases without being aware of them. Immediate experience, such as the breath impediment in the case study, can be simultaneously vivid (the patient "felt that the breathing tube was pushed up against the inside of her throat"), while its internal causes and effects remain unclear. The case report states "impeding her ability to breathe," but the patient likely couldn't make that causal connection, and was likely interpreting the sensation of the tube as obstruction, while the airflow to the lungs was likely adequate. The lack of transparency about the actual causes and interrelations of our sensations is known to contribute to stress and anxiety[12]

I'm calling the degree to which we lack clarity of the underlying causes and effects of our experience Metacognitive Intransparency τ. τ = 1 implies complete intransparency of the underlying mechanisms. When we feel something, it is not clear why. When we think something, it is not clear what led to the thought. τ = 0 is the ideal limit at which introspection tracks all the contributing factors and causes.

The Paralyzed State

With the parameters B, d, and τ, we can describe the case numerically.

When the patient became partially aware, they had limited Bandwidth B, moderate depth d, and high τ: she can partially experience her situation, represent her predicament, form intentions, but lacks clarity of her state, both physically and mentally. 

But d is functionally misaligned: it can model the trap without delivering effective control. In her case, she could sustain the effort to signal her distress, but that may not always be the case (which is why the article urges: "Verbal communication provides reassurance."). 

In this patient's case, the low-to-moderate B leads to a lack of information about the operating room, but we could also imagine that seeing too many details could also be distressing. The locked-in state thus seems primarily characterized by the combination of high d × high τ that may amplify distress. 

A Phase Diagram of Mental States

Now we can replay the patient's case through the lens of the parameters. A patient with normal waking state (high B), no reason to worry about the operation (normal d), and normal ability to introspect their mental states (moderate τ) is anesthetized. A transition into a state intended to contain no stabilized experiential field (B and d ~zero) reverses into a thin corridor of content: voices, pressure, breath (low B). It is a state of high confusion about the state and its inner and outer causes (high to extreme τ). Intention, reflection, and conscious effort persists without feedback about motion (d without control). After the operation the experiential field is restored (high B) but a lack of explanation and felt integration prolongs confusion about what was happening (high τ) despite reflection or ruminations (high d).

For a table with illustrative data for the points in this chart, see this footnote[13].

Generalization

This case is not unique. It doesn't cover the full range of the parameter space, but illustrates that degrees of awareness are more of a phase space. Fragmentary awareness under anesthesia occupies a specific region of this space, and meditation and different stages of sleep occupy other. Instead of asking "were they awake/aware?" you can point to a region and ask if they were in one region or another.

I believe the combination of these parameters is quite general and useful to describe a wide range of mental phenomena. This doesn't rule out other parameters that could be used to quantify aspects of experience, such as the felt valence or urgency. I am just convinced that these three together span an interesting section of cognition[14] worth further investigation.

I thank the reviewers Jonas Hallgren, Christian Kleineidam, Cameron Berg, Justis Mills, and Chris Pang for their helpful comments. 


Technical Appendix

Above, I'm introducing the parameters in the context of the case study. But this is mostly for intuition-building purposes. As shown below, these parameters are well-studied, and there are multiple lines of research for each of them, even if they are rarely connected in the tight way offered here. I offer parameter definitions that capture the essence of independent lines of research and orthogonal theories of cognition and consciousness. I will explain and motivate each parameter in detail, provide an information-theoretic formalization of the underlying logic, and give different existing and often quite well-studied ways to estimate (a proxy) for each parameter. At the end, I offer some synthesis based on these parameters beyond the case study.

Working Memory Bandwidth (B)

Inspired by Scott Garrabrant's Embedded Agency

The bandwidth parameter B measures how much information is stably present in the overall recurrent processing system.

Why we should expect a low-dimensional core?

Many high-dimensional systems effectively reduce to a low-dimensional sub-space, which captures their meaningful long-term dynamics. This is well-known in fluid dynamics[15] and in robotics[16].

Predictive Coding (PC) implies a compression of sensory data down to the latents governing the highest level of operation[17]. But PC doesn't say anything about the personal (subjective) level.

So we know that there are low-dimensional latent representations of all the agent's senses, and the question is how those relate to subjective perception. It is clear that the highest level of the predictive hierarchy does not coincide with the subjective experiential field, because that top level consists mostly of slow-changing hyperparameters (and also seems to have more bits and dimensions than the subjective experiential field).

If the personal "level" is not the top of the hierarchy, where is it? PC doesn't try to answer it[17], but we can look at other indications.

Only certain circuit topologies[18] allow stable conditional influence across the multiple systems required for reporting.

A reportable mental state requires coordinated access across perceptual systems, memory, multiple motor systems (including e.g. for language), and multiple others. No single circuit can drive all of these unless it participates in a persistent, self-sustaining loop with other distant circuits that control action and report (at least in biologically plausible models).

Thus, a state is reportable when its encoding variables exert a stable causal influence on the set of systems required for action (which includes communication).

Selection is necessary because the system cannot simultaneously propagate all latent representations through long-range loops. PC processes in the brain update with time constants of 10-50ms[19], but global loops stabilize within Δt≈200 to 500ms[20]. Which subset of representations has the characteristics required to be candidates for selection?

GNW says[21] that the representations participate in a globally coherent explanation (in PC terms: a multi-level configuration of latents that jointly minimize prediction error). It has to be stable with low error and high prediction value (studies show[22] that error signals scale with prediction error × precision) over the duration of a global loop.

The ability to stabilize a representation across multiple regions (perception, memory, motor control, etc) increases coherence, coordination, and communication (a compressed, discrete, reportable state is efficient for mutual prediction).

In such a configuration, a stable state can influence a sequence of states, which enables planning and, e.g., conditional reasoning. More speculatively, sequencing enables long-range credit assignment[23].

Thus, if you want sample-efficient learning (which Predictive Coding predicts), global control, and communicable outputs, and have cost constraints (as is a biological evolved brain), then you need a bottleneck that selects few coherent predictive states. What is the Bandwidth B of this sequential bottleneck? This Bandwidth B should be constrained neuroanatomically by cortical surface area, thalamocortical connectivity, and energetic limits. 

Information-theoretic formalization

Formally, we can model B as the capacity in bits per time of a global recurrent state space (aka workspace) to maintain mutually consistent, jointly addressable state vectors.

  • The (implied) vectors need to be mutually consistent because the overall state needs to be stable (for a reportable time). Any inconsistency between state vectors in a recurrent process, would destabilize at least one of these vectors quickly, while consistent vectors would self-reinforce each other (in the overall context, including perceptions that may change and thus lead to new consistent stable states).
  • Jointly addressable means each representation can be independently queried, recombined, or acted upon, while the others remain stably represented. Without this, you can't report, reason, or plan. Querying/reporting etc. means implied transition operators that depend on one state and are conditionally invariant to the other states​.

Let G(t) be the global state space state (a high-dimensional vector). Let  be the set of jointly addressable concurrently stabilized sub-states (e.g., object tokens, intentions, feelings). Then we can express B with standard mutual information I as

 

for a threshold θ of mutual information that ensures stable integration.

Proxies for Working Memory Bandwidth B

All the following are empirical proxies for this conceptual property. Each is noisy and confounded in its own way, but if the assumption that the experiential field corresponds to the sequential processing bottleneck is true, then all should arrive at proximately the same value for comparable Δt.

  • Behavioural bandwidth: Take the number of items in working memory (reported or measured[6]) times. Each item contributes the size in bits of its corresponding vectors to the representational complexity. This can be estimated from change detection, multiple-object tracking, and attentional blink tests to be about 10[7] to 50[24] bits/s. This measure is confounded by chunking, encoding, and executive control limitations. The behavioral bandwidth serves as a lower bound for B.
  • Neural integration measures: At the neural level, we can measure the system’s ability to sustain highly integrated yet differentiated activity patterns over the workspace timescale (~200–500 ms). Integrated Information Theory’s Φ would, in principle, measure this, but has not been calculated for a full human brain so far. More practically, the Perturbational Complexity Index (PCI)[25] quantifies this, but not in terms of bits/s. PCI does show across conditions that conscious states are reliably associated with high integration and high differentiation, whereas unconscious states show a collapse. It is not clear how to convert this to bits/s.
  • Phenomenal richness: Structured introspection shows how much structured content seems jointly present in experience. This can be estimated with visual[26], auditory[27], tactile[28], and other questionnaires. While an estimation of corresponding bits/s is not documented in the literature, it seems practicable to perform, and a bandwidth in the range of 10 to 50 bits/s seems plausible. Additionally, subjective reports show that vividness of experience during anesthesia, fatigue, or low-dose sedation versus alert, task-engaged states match the physiological effects[29].

Nested Observer Depth (d)

Adapted from Scott Garrabrant's Embedded Agency

The depth parameter d is the number of recursive self-modeling steps for which the system can maintain stable fixed points. This means

  • The system models itself.
  • It models itself modeling itself.
  • It models itself modeling itself modeling itself.
  • …and so on…
  • For d steps before the representations stop converging (e.g. stable cortical loop) and begin to distort or collapse, e.g. from noise.

Where the system models itself means some of its internal states encode predictions about other internal states.

Recursive modeling is the domain of consciousness by Recurrent Processing Theory[30], Higher Order Theory[31], Attention Schema Theory[32], or the Nested Observer Window model[10]

In practice deeper recursion might be rarer or less stable, which could be modelled as fractional depth.

Recursion depth is limited by architectural constraints as outlined by the theories above, but the degree to which it is realized depends on development and training[33], i.e. education or other opportunities, such as meditation[34][35].

Information-theoretic formalization

Let  be first-order representations. Let  be the k-th self-model.

d = max k such that  remains stable within a stability threshold 

Proxies for recursion depth

  • Recursive ToM performance: Behavioural tests have measured[11] how many explicit levels the system can maintain and how confidently[36].
  • Neural higher-order thought availability: Overlapping vmPFC/dmPFC activation in neuroimaging studies can be interpreted as supporting “second-order representations”[37]. This can plausibly be extended to detect higher levels.
  • Phenomenological meta-awareness: Take the subject’s explicit awareness of their own ongoing mental states as a proxy for at least one level of recursive self-modelling[38] which can be tested e.g. with SART. Deeper nestings of awareness are commonly reported in meditation studies[39] but depth of recursion is not systematically reported. Such reports are confounded by reporting biases, task demands, and retrospective reconstruction.

Metacognitive Intransparency (τ)

More heavily adapted from Scott Garrabrant 

Above, we established that the parameter τ measures how opaque the process that generates your cognition is to you. More precisely, τ measures the degree of information loss in the mapping from generative states and processes to introspectively accessible meta-representations. Metacognitive Intransparency is partly a result of neuroanatomy. We have already established that bandwidth for integrated processing is limited. But total sensory processing has OOMs higher bandwidth - and that includes the bandwidth of the self-feedback channels. Any introspection channel must compress both external and internal channels massively. Thus, there is a floor set by anatomy or bandwidth B. Additionally, external signals often carry high valence, thus competing for self-modeling resources. Thus, intransparency is expected to be high unless the external environment is unusually quiet and low-entropy. On the other hand, intransparency can be clearly be reduced by training[34], which often involves quietness and repeated practice. 

Information-theoretic formalization

Let M be the generative model states. Let  be the introspective model’s estimate of M. Using mutual information I and entropy Η, intransparency can be expressed as the normalized information loss:

τ=0 means transparency: introspection tracks generative causes closely. τ=1 means the system’s introspection is blind to its own machinery.

Proxies for Metacognitive Inefficieny τ

  • Metacognitive inefficiency: Using well-known measures of metacognitive sensitivity[40], such as perceptual discrimination tasks, we can use the derived metacognitive efficiency[41] to approximate τ as . This serves as an upper bound for τ due to the limited domain of the discrimination task. 
  • Neural decoding gap: A measure of which fraction of stimuli can be decoded in early layers but not in later layers. Calculate τ as  (for decoding accuracies Acc for different layers). No study measuring this exact representational MI gap was found.
  • Emotional Clarity: Psychologists can measure the emotional clarity[42] of patients with instruments like the Emotional Clarity Questionnaire (ECQ)[43] or the Trait Meta-Mood Scale. As many such measures these are confounded by reporting biases.
  • Subjective ineffability: Ineffability is having an experience consciously present but resistant to stable conceptualization or verbal report. Or without having a sense of having it[44] (at least until probed). As a subjective measure it felt as a matter of degree. Ineffability is one measure of the Mystical Experience Questionnaire[45]. Reports of ineffability are common in consciousness research[46] and can also be found in the Meditation survey[39]. It is confounded by reporting biases.

There may also be theoretical limitation for τ[47].

 

  1. ^

    Full case description:

    A 33-year-old woman in good physical health presented to the hospital for elective rhinoplasty. During the operation, she became aware that she was awake. She heard the conversation among the surgical team members and felt pressure on bone in her nose, but she did not feel pain. The patient also felt that the breathing tube was pushed up against the inside of her throat, impeding her ability to breathe. She was unable to move but recalls making a “monumental effort” to utter a small groaning noise, which alerted the surgeon to the fact that she was awake. She heard the surgeon verbally acknowledge her condition and offer reassurance that the operation was almost over. It was her impression that the surgeon rushed to finish the operation while full anesthesia was restored, and she later awoke in the recovery room without complications. During the first follow-up visit, the surgeon did not address the situation, so the patient brought it up at the end of the visit. The surgeon seemed surprised and embarrassed that the patient remembered waking up during the operation but could not explain what happened.

    General anesthesia suppresses central nervous system activity and results in unconsciousness and total lack of sensation. Her case additionally involved routine neuromuscular blockade with NMBAs, so behavioral signs were suppressed.

    Christian Bohringer et al, 2024 Intraoperative Awareness during Rhinoplasty PSNet/WebM&M

  2. ^

    The case study is not clear whether the patient actually lacked air due to the positioning of the tube or just felt obstructed. In any case, it is plausible that the patient had no clarity about that due to their partial awareness in the same way it is possible to feel anxious without knowing about what.

  3. ^

    Air hunger (resulting eg from high heart rate, "Condition Black") is associated with Tunnel Vision.

    Dave Grossman, On Combat,  

  4. ^

    Screaming is arguably one of the most relevant communication signals for survival in humans.

    Arnal et al 2015, Human Screams Occupy a Privileged Niche in the Communication Soundscape

  5. ^

    The letters were chosen because B commonly denotes bandwidth, τ information loss by analogy to decay constants, and d for depth (of recursion).

  6. ^

    Conscious moments are thought to hold only 1–4 unrelated items; this small focal capacity may be the biological price to pay for global access.

    Baars et al, 2023 Global workspace dynamics: cortical “binding and propagation” enables conscious contents 

  7. ^

    The information throughput of a human brain is 10 bits/s.

    The Unbearable Slowness of Being: Why do we live at 10 bits/s? (press news)

  8. ^

    In this experiment, the sedation levels were changed step-by-step using anaesthesia, and the performance accuracy during the execution of working memory was assessed using a dual-task paradigm. [...] The results of the short-delay recognition task showed that the performance was lowest at the deep stage. The performance of the moderate stage was lower than the baseline. 

    The arousal level of consciousness required for working
    memory performance: An anaesthesia study

  9. ^

    The question of whether we are aware is just not coming up very often. We just know latently that we are aware

    Ask yourself this question ‘Am I conscious now?’ and you will reply ‘Yes’. Then, I suggest, you are lured into delusion – the delusion that you are conscious all the time, even when you are not asking about it.

    Now ask another question, ‘What was I conscious of a moment ago?’ This may seem like a very odd question indeed but lots of my students have grappled with it and I have spent years playing with it, both in daily life and in meditation. My conclusion? Most of the time I do not know what I was conscious of just before I asked.

    Try it.

    Susan Blackmore, A question of consciousness

    Also known as (Ned Block's) Refrigerator Lights Illusion.

  10. ^

    The model likens the mind to a hierarchy of nested mosaic tiles-where an image is composed of mosaic tiles, and each of these tiles is itself an image composed of mosaic tiles. Unitary consciousness exists at the apex of this nested hierarchy where perceptual constructs become fully integrated and complex behaviors are initiated via abstract commands. We define an observer window as a spatially and temporally constrained system within which information is integrated, e.g. in functional brain regions and neurons.

    Riddle and Schooler, 2024 Hierarchical consciousness: the Nested Observer Windows model

  11. ^

    Having reviewed the desirable properties of measures of metacognition, let us now turn our attention to the existing measures of metacognitive ability. One popular measure is the area under the Type 2 ROC function31, also known as AUC2. Other popular measures are the Goodman–Kruskall Gamma coefficient (or just Gamma), which is essentially a rank correlation between trial-by-trial confidence and accuracy32 and the Pearson correlation between trial-by-trial confidence and accuracy (known as Phi33). Another simple but less frequently used measure is the difference between average confidence on correct trials and the average confidence on error trials (which I call ΔConf).

    A comprehensive assessment of current methods for measuring metacognition

  12. ^

    In the emotional intelligence framework, emotions are regarded as an important source of information, and clearly identifying one’s emotions is required to adaptively utilize the information emotions provide. [...] This suggests that a lack of emotional clarity may interfere with achieving goals in a given situation, rendering individuals susceptible to psychological distress or maladjustment. 

    Is More Emotional Clarity Always Better? An Examination of Curvilinear and Moderated Associations Between Emotional Clarity and Internalizing Symptoms

  13. ^
    Surgical Stage Phenomenology B d τ
    Before Induction  Ordinary waking cognition, agency intact, limited but familiar introspection High (baseline) Moderate (baseline) Moderate (baseline)
    Induction Fading, loss of continuity, transition felt, but not legible Decreasing Decreasing High
    Intended surgical state (ideal) No stabilized experience, no recall ≈ 0 ≈ 0 N/A
    Partial Wakeup Voices, pressure, “I am awake,” confusion Low, > 0 Low to Moderate Very High
    Urgency  Breathing difficulty dominates field, high-salience signals with no felt cause Low Moderate to High Extreme
    Survival Failed movement without explanation, tunnel "vision", effort leading to unexpectedly low result - groaning Very Low Moderate Extreme
    Recovery Orientation returns, action restored, memory coherent, but causality remains unresolved Increasing → High Increasing → High Moderate
  14. ^

    If you would go beyond these parameters and tried to be more systematic, you'd want to use something like PCA on a larger number of measures of cognition. 

  15. ^

    If many modes [...] decay exponentially, then all that is left after the transients decay are the relatively slowly evolving modes of long-term importance. The evolution of these few significant modes effectively forms a low-dimensional dynamical system on a low-dimensional set of states in state space. 

    Low-dimensional modelling of dynamical systems (page 6) 

  16. ^

    We identify the dynamics on generic, low-dimensional attractors embedded in the full phase space [...] This allows us to obtain computationally-tractable models for control which preserve the system’s dominant dynamics 

    Data-Driven Spectral Submanifold Reduction for Nonlinear Optimal Control of High-Dimensional Robots

  17. ^

    [Predictive Coding] aims to be complete: it offers not just part of the story about cognition, but one
    that stretches all the way from the details of neuromodulator release to abstract
    principles of rational action governing whole agents. [page 2]

    The more accurately the brain’s internal assumptions reflect its incoming sensory stream, the less information would need to be stored or transmitted inwards from the sensory periphery. All that would need to be sent inwards would be an error signal – what is new or unexpected – with respect to those predictions. [page 4]

    a generative model could help the brain to distinguish between changes to its sensory data that are self-generated and externally generated [...] regulating its motor control based, not on actual sensory feedback, but on expected sensory feedback, [... and] be, inverted to produce a discriminative model. [page 7] 

    For predictive coding to say something specific about the existence or character of top-down
    effects at the personal level, it would need to say which aspects of that subpersonal
    information give rise to which personal-level states (beliefs and perceptual contents). These assumptions – which connect the subpersonal level to the personal level – are currently not to be found anywhere within predictive coding’s computational model. [page 7-8]

    Mark Sprevak, Predictive coding I: Introduction

  18. ^

    A non-linear network ignition associated with recurrent processing amplifies and sustains a neural representation, allowing the corresponding information to be globally accessed by local processors.

    Conscious Processing and the Global Neuronal Workspace Hypothesis

  19. ^

    This observer- or reader-defined synchrony is critical in brain operations. If the action potentials from many upstream neurons arrive within the membrane time constant of the target (reader) neuron (τ: 10–50 ms for a typical pyramidal neuron), their combined action is cooperative because each of them contributes to the discharge of the reader neuron.

    Buzsáki, 2023 Brain rhythms have come of age

  20. ^

    recurrent processing between anterior and posterior ventral temporal cortex relates to higher-level visual properties prior to semantic object properties, in addition to semantic-related feedback from the frontal lobe to the ventral temporal lobe between 250 and 500ms after stimulus onset.

    von Seth, 2023 Recurrent connectivity supports higher-level visual and semantic object representations in the brain

  21. ^

    Once we are conscious of an item, we can readily perform a large variety of operations on it, including evaluation, memorization, action guidance, and verbal report.

    Dahaene A neuronal model of a global workspace in effortful cognitive tasks

  22. ^

    PC predictions:

    Response strength should therefore always be a function of both the size of the error and its precision.

    Attention is the weighting of sensory signals by their precision (inverse variance).

    Empirical evidence:

    trial-by-trial estimates of four key inferential variables: prediction error, surprise, prediction change and prediction precision (where surprise is the precision-weighted variant of prediction error). [...] gamma was predicted by surprise (more so than by prediction error). Moreover, beta-band modulations were significantly predicted by prediction change. [...] alpha-band modulations were significantly predicted by the precision of predictions

    Great Expectations: Is there Evidence for Predictive Coding in Auditory Cortex?

  23. ^

    While "credit assignment" is ML terminology and not clearly known to be implied in sequential or "system 2" reasoning, a related terminology is "learning by thinking": 

    Canonical cases of learning involve novel observations external to the mind, but learning can also occur through mental processes such as explaining to oneself, mental simulation, analogical comparison, and reasoning. Recent advances in artificial intelligence (AI) reveal that such learning is not restricted to human minds: artificial minds can also self-correct and arrive at new conclusions by engaging in processes of 'learning by thinking' (LbT).

    Learning by thinking in natural and artificial minds

  24. ^

    When researchers sought to measure information processing capabilities during ‘intelligent’ or ‘conscious’ activities, such as reading or piano playing, they came up with a maximum capability of less than 50 bits per second.

    Britannica Physiology

  25. ^
  26. ^
  27. ^
  28. ^
  29. ^

    the phenomenological similarities between consciousness during sleep and sedation mirror their physiological similarities.

    Similarities in consciousness occurring during sleep and sedation

  30. ^

    As soon as the FFS [feedforward sweep] has reached a particular area, horizontal connections start to connect distant cells within that area, and feedback connections start sending information from higher level areas back to lower levels, even all the way down to V1. Together, these connections provide what is called recurrent processing [(RP)].

    Neurons in lower regions modify their spiking activity so as to reflect the higher level properties. For example, a V1 neuron receiving feedback signals will fire more strongly when it is responding to features that are part of an object.

    RP allows for dynamic interactions between areas… RP may thus form the basis of dynamic processes such as perceptual organization, where different aspects of objects and scenes are integrated into a coherent percept.

    The remaining difference between Stage 4 and Stages 1 and 2 is that in the latter there is only feedforward processing, while in Stage 4 (and Stage 3) there is recurrent processing. Could that be the essential ingredient that gives phenomenality…? That recurrent processing is necessary for visual awareness is now fairly well established.

    Lamme, How neuroscience will change our view on consciousness

  31. ^

    Introspective consciousness occurs when a mental state is accompanied both by such a second-order thought, and also by a yet higher-order thought that one has that second-order thought. [page 48]

    Third-order thoughts do occur when we introspect; can fourth-order thoughts also occur? There is reason to think so. Sometimes we are actually conscious of our introspecting, and that means having a fourth-order thought about the third-order thought... [page 344]

    Rosenthal, 2005 Consciousness and Mind

  32. ^

    We propose that the top–down control of attention is improved when the brain has access to a simplified model of attention itself. The brain therefore constructs a schematic model of the process of attention, the ‘attention schema,’ in much the same way that it constructs a schematic model of the body, the ‘body schema.’ The content of this internal model leads a brain to conclude that it has a subjective experience. One advantage of this theory is that it explains how awareness and attention can sometimes become dissociated; the brain’s internal models are never perfect, and sometimes a model becomes dissociated from the object being modeled.

    Graciano, 2015 The attention schema theory: a mechanistic account of subjective awareness

  33. ^

    Children are able to accurately monitor their performance and discriminate their certainty–uncertainty judgment in the age range of 5.5–7.5.

    Extensive evidence suggests that educational interventions can improve metacognition.

    Developing Metacognition of 5- to 6-Year-Old Children: Evaluating the Effect of a Circling Curriculum Based on Anji Play

  34. ^

    The practice of meditation [...] offers the ability, with practice, to enable the development of awareness of awareness itself. The aim is also to reduce suffering as a consequence of this greater openness, through reduced reactivity to experience

    The Psychology of Meditation Research and Practice

  35. ^
  36. ^

    we show that human observers are able to produce nested, above-chance judgements on the quality of their decisions at least up to the fourth order (i.e. meta-meta-meta-cognition).

    Recht et al 2022 Confidence at the limits of human nested cognition

  37. ^

    A domain-general network, including medial and lateral prefrontal cortex, precuneus, and insula was associated with the level of confidence in self-performance in both decision-making and memory tasks.

    By comparing our results to meta-analyses of mentalising, we obtain evidence for common engagement of the ventromedial and anterior dorsomedial prefrontal cortex in both metacognition and mentalising, suggesting that these regions may support second-order representations for thinking about the thoughts of oneself and others.

    Vaccara and Flemming, 2018 Thinking about thinking: A coordinate-based meta-analysis of neuroimaging studies of metacognitive judgements

  38. ^

    ‘Meta-awareness,’ a term often used interchangeably with metaconsciousness, is the state of deliberatively attending to the contents of conscious experience.

    Initially in this example [reading a book], meta-awareness would be absent until you notice that you are mind wandering. This abrupt realization (almost like waking up) represents the dawning of metaconsciousness, in which you take stock of what you are thinking about and realize that it has nothing to do with what you are reading.

    Chin and Schooler, 2009 Meta-Awareness

  39. ^
  40. ^

    our measure meta-d', which reflects how much information, in signal-to-noise units, is available for metacognition. Applying this novel method in a 2-interval forced choice visual task, we found that subjects' metacognitive sensitivity was close to, but significantly below, optimality.

    Maniscalco and Lau 2012 A signal detection theoretic approach for estimating metacognitive sensitivity from confidence ratings

  41. ^

    We can therefore define metacognitive efficiency as the value of meta-d′ relative to d′, or meta-d′/d′.

    Where d′ is standard sensitivity and for meta-d′ see above.

    Fleming and Lau 2014 How to measure metacognition

  42. ^

    Emotional clarity refers to the extent to which you know, understand and are clear about which emotions you are feeling and why you are feeling them. If you have poor emotional clarity, you may have a difficult time understanding the origins of your emotions. For example, you may say things like, “I feel bad and I don’t understand why”. 

    https://www.berkeleywellbeing.com/emotional-clarity.html 

  43. ^
  44. ^

    There is evidence for two types of dissociations between consciousness and meta-consciousness, the latter being defined as the intermittent explicit re-representation of the contents of consciousness. Temporal dissociations occur when an individual, who previously lacked meta-consciousness about the contents of consciousness, directs meta-consciousness towards those contents; for example, catching one's mind wandering during reading.

    Schooler 2002 Re-representing consciousness: dissociations between experience and meta-consciousness

  45. ^

    (9) ineffability (i.e. difficulty of communicating or describing the experience to others).

    Factor Analysis of the Mystical Experience Questionnaire: A Study of Experiences Occasioned by the Hallucinogen Psilocybin

  46. ^

    The hard problem of consciousness is itself a statement of ineffability - we exactly lack introspective ability into it:

    The “hard problem” of consciousness is to explain why and how physical processes are accompanied by experience at all, not just how they support discrimination, report, etc. It is the problem of explaining “why there is ‘something it is like’ for a subject in conscious experience,” and why this “cannot be captured by purely structural or functional description.

  47. ^

    Even with perfect introspection channels, we might run into limitations because Löb's theorem shows that it is not possible to have a complete and sound self-model (in the sense of trusting provable by me about myself). The ways around that are a) to leave the semantics out of the internal representation (i.e., not transmitting that information internally) or b) to add probabilistic uncertainty to the reflective self-representation (making it lossy). Both can be seen as a τ > 0 gap in the reflective self-trust/semantics channel.

  48. ^

    Simply put, intrinsic and causal information does not involve anything outside of the system [...] This intrinsicness renders integrated information irrelevant to the functions of the system. [...]

    It is at least theoretically possible that we do whatever we are doing without consciousness. Such a state makes the ‘use’ of consciousness mysterious. [...]

    For the sake of future development, IIT should more seriously take metacognitive accessibility to experience into account.

    Making Sense of Consciousness as Integrated Information: Evolution and Issues of IIT

  49. ^

    “Illusionism” about consciousness, a label designed to help indicate why it seems to us that phenomenal consciousness is real (Frankish, 2016, 2017). Illusionism is motivated in part by broader theoretical considerations, such as the problematic nature of consciousness from the standpoint of physicalism and the observation that even reductive accounts of phenomenal experience typically suggest some sort of misapprehension of what is really going on. Illusionism claims that introspection involves something analogous to ordinary sensory illusions; just as our perceptual systems can yield states that radically misrepresent the nature of the outer world, so too, introspection yields representations that substantially misrepresent the actual nature of our inner experience. In particular, introspection represents experiential states as having phenomenal properties—the infamous and deeply problematic what-it-is-likeness of our qualitative mental states. Illusionists claim that these phenomenal properties do not exist, making them eliminativists about phenomenal consciousness. What is real are quasi-phenomenal properties—the non-phenomenal properties of inner states that are detected by introspection and misrepresented as phenomenal.

    Stanford Encyclopedia of Philosophy -> Eliminative Materialism



Discuss

I dream every night now

2026-01-09 08:34:52

Published on January 9, 2026 12:34 AM GMT

When I close my eyes, all I see is darkness. 

It’s always been this way. 

I thought this was normal. When I was 22, I learned otherwise. 

I learned that “imagination” is not merely a figure of speech—people can actually see images in their heads. They can picture their dog wagging its tail or their mother smiling at them, or see a lover embracing them after being away for far too long.
 

But not me. All I see is darkness.
 

When I was 22, a friend recommended I read Tolkien’s The Hobbit. A week later after trudging through a few chapters, I frustratedly told her, “With its pages-long descriptions of the landscape, it feels like the author is trying to paint a picture with his words and I cannot see it.

“Umm,” she replied, “I think you might have aphantasia.”

“Huh? What’s that?”


99% of people have imaginations—they close their eyes and they can see images. The other 1% don’t—we aphantasiacs see nothing but darkness.

Instead of seeing images, I think exclusively in terms of concepts. For example, I know what my dog looks like, but not from visualizing her in my mind. Instead, I can describe my dog by listing her characteristics stored in my memory—like I’m accessing words from a txt file.[1]

Aphantasia is not a disease or a disorder; it’s just another variation of the human experience.


Months later that same friend asked me, “You never see images? Like ever?” She reflected for a moment. “Do you dream?”

“Of cour…” A feeling of consternation washed over me, softly awakening memories that were long forgotten.

“Actually…not anymore. I remember dreaming when I was a kid. But for the last ten years or so, I guess…my dreams disappeared and never came back.”


Dreaming is our subconscious’ way of working through unprocessed emotion we’ve been too busy to think about. For some reason, I got disconnected from this basic aspect of the human experience and never figured out why, until now.

The first clue came from when I vacationed in Panamá three years ago. I was staying in a cabin that was secluded in the mountains overlooking a river valley populated by a quaint, simple town. Unexpectedly, every night I experienced wildly intense dreams—some of them so vivid that I awoke in a cold sweat, my heart racing. After years of dreamlessness, I wondered if this happened because of the decision I had made when I arrived:

Upon unlocking the front door of the cabin, and before entering, I turned my phone on airplane mode. For a week straight I had no contact with friends, no texts, no emails, no buzzing. Peace. Quiet…


 

Too quiet?


 

During the day, the ceiling-tall windows of the cabin drank endless sunshine and overdosed on the majesty of the serene cloud forest below. As the sun faded beyond the horizon and the dark consumed the remaining streaks of golden fading light, a gentle wind began which steadily increased in voracity and ferocity as it accelerated into an insatiable torrent. Isolated on the top of a hill, the cabin was completely exposed to the assailing violent wind, whipping the windows endlessly causing reverberations that sounded like booming war drums. The assault continued for hours, resounding in occasional crescendos that roared so loud that the windows shook, seemingly on the verge of giving way.

I sat in the corner of the cabin, digging my fingernails into the arms of my chair,  trying but failing to concentrate on a book. Terror. True terror. The kind of terror that besets peasants when their castle is under siege by savage barbarians ready to rip their faces off.

Without my digital pacifier assuaging my regular microdoses of anxiety and uncertainty, I devolved into a more primal state of awareness—terror at the hands of nature.

I looked over at my travel partner and asked how they felt. “Huh?” They took an earbud out and paused their Netflix show. “Oh, the wind? Yeah it’s crazy man,” and they returned to their show, leaving me alone with my thoughts.

When I finally crawled into bed and succumbed to sleep, a dream visited me. 
 

Or, more accurately, a nightmare.


On subsequent vacations, I have similarly been barraged with midnight reveries. I assumed the dreams were happening because I was living in a new environment. 

But on my most recent vacation, I didn’t have a single dream. What was different this time was that I didn’t put my phone on airplane mode. Instead, I stayed plugged in, buzzing the nights away in the lulling, dulling, numbing blue light. Recognizing this, I returned home and performed an experiment.


Alone in my apartment four weeks ago, I turned my phone on airplane mode at 7pm. Normally I eat dinner with a screen glowing in front me, with various parasocial personalities keeping me company throughout the night. But that night I ate dinner alone, truly alone, with nothing but my thoughts.

A small bubbling feeling of anxiety gnawed on the edges of my stomach, but I ignored it, finished dinner, and sat in my armchair to read a novel. After what felt like hours, I got up and checked the clock on my nightstand—only thirty minutes had passed. I nervously paced my room, unsure what to do next. Then I opened the door to my balcony and stepped into the brisk November night. Blackness. Emptiness. Coldness. The kind of cold that reminded me that I was alone. I watched the traffic go by from five stories up—they looked like matchbox cars softly motoring along to go home, home to their families. Unlike me, who lives alone, and is alone. A heavy, lethargic feeling slowly suffused my cold, shivering body. In a dreary trance, I shuffled to my bed and pulled the covers over tightly. 
 

Sleep took me. 

Then…

 

 

I bolted awake, sweating.

I had experienced the most vivid dream: it was about a recent, disturbing event in my life that I had neglected to sit down and process yet. Assuming it was the morning, I got out of bed to journal about it at my kitchen table. After writing down all my thoughts, I glanced at my nightstand clock—2:08am. Oh.

For three nights in a row the same thing kept happening: I awoke in a sweat at 2am from emotionally intense dreams and I couldn’t go back to sleep for hours. And while I felt exhausted from the sleep deprivation, I was also exhilarated. Even if it was only while I was unconscious, for a brief time I got to see images—the part of the brain responsible for daytime imagery (which is dysfunctional in my brain) is separate from the part that produces dreams.

Eventually, as I maintained my new nighttime routine of ceasing all screen usage 1-2 hours before bed, my circadian rhythm adjusted, and I started sleeping through the whole night.
 

And the dreams just keep coming.


It was ten years ago that my dreams deserted me—about the time I got my first laptop and smartphone. Such a simple yet obvious explanation that I had overlooked.

But I’m not alone. 9 out of 10 Americans reportedly use electronic devices in the hour before bed. Exposure to blue light reduces the quantity and quality of REM sleep, which is when most of our vivid dreaming occurs, and when much of our emotional processing happens.


When I’m awake and I close my eyes, all I see is darkness. 

But now I’ve learned how to dream every night—the only time in my life I get to see mental imagery. The only time I get to feel like the rest of you 99%.
 

Good night.

  1. ^

    Researchers believe congenital aphantasia occurs due to reduced connectivity between the prefrontal cortex (which handles working memory) and the visual networks of the brain (which creates mental imagery for that working memory).



Discuss

The Economics of Transformative AI

2026-01-09 06:22:32

Published on January 8, 2026 10:22 PM GMT

Anton Korinek is an economist at UVA and the Brookings Institution who focuses on the macroeconomics of AI. This is a lightly edited transcript of a recent lecture where he lays out what economics actually predicts about transformative AI — in our view it's the best introductory resource on the topic, and basically anyone discussing post-labour economics should be familiar with this.

The talk covers historical development of what the bottlenecks are: for most of history, land was the bottleneck and humans were disposable. The Industrial Revolution flipped this. AI may flip it again: if labor becomes reproducible, humans are no longer the bottleneck.

Korinek walks through what this implies for growth (potentially dramatic), for wages (likely positive effects until some threshold of automation, then a sudden collapse), and for policy (e.g. our entire tax system assumes labor income exists). He also addresses some confusions: "prices falling" and "productivity rising"; "post-scarcity" is a misnomer since even cheap resources have prices; and no, there's no economic law that technology must create jobs—that was just an empirical regularity when humans were the bottleneck.

The uncomfortable conclusion is the economy doesn't need us. It can run perfectly well "of the machines, by the machines, and for the machines." Whether that's what we want is a different question.

Transcript is under the video

I want to start by talking about some high-level lessons about the economics of transformative AI—or the economics of AGI, which are not exactly the same but close substitutes to each other. I've split this into three big themes which I call "the good", "the bad," and "the ugly".

What do I mean by that? There is the productivity and growth impact of AGI, transformative AI—that's "the good". There is the labor market disruption and inequality aspect—which is "the bad". And then there are the TAI risks and alignment. I unfortunately don't have very much to say on that, but I'm still including it—which is "the ugly". It also has some economic repercussions. So let's first jump into the good, so that we can end on a high point.

Three Categories of Economic Impacts: Productivity and Growth (the good), Labor Market Disruption and Inequality (the bad), TAI Risks and Alignment (the ugly)

The Good: Productivity and Growth

I'll start with some economics 101. You can forget about the equations, but maybe some of you have seen this in your undergraduate econ. The way that we economists think about the economy is that we produce output Y by combining capital K and labor L through some sort of production function. That production function operates at a certain level of technology, which we call A. We mix those two things together, use our technology, and produce output.

Economics 101: Output Y = A*F(K, L) depends on technology A, capital K, and labor L; Consumption C = Y - I - G is what's left after investment I and govt spending G; Utility U = u(C) is how happy the consumption makes us

From that output, we take a part and invest it, we take a part and use it for government spending, and the rest is consumption. And that consumption—I guess that's the critical part—is supposed to deliver us some sort of utility u(C). So the big question that I want to unpack in the next 20 minutes or so is: how will AGI affect all of that, and what will all these effects depend on?

What the Most Cited Economist Thinks

Let's start with the good. What will the output effects of AGI depend on? Well, if we ask the most cited economist in the world, Daron Acemoglu, he published a paper last year on "The Simple Macroeconomics of AI" in which he predicted that AI will affect growth by raising it by 0.07% a year. So he does not believe that this is going to be very transformative. He thinks AI is essentially BS.

But that means there is more work for us to be done. This guy is the closest thing to an economic super-intelligence—if he doesn't pay attention to these questions, then I guess many of us have to.

A Longer Arc of History

I think to take this question seriously, we have to actually take a step back and look at a much longer arc of history to understand how transformative AI will be for the economy.

In prior decades—or even for the past two centuries—everything in the economy revolved around capital and labor. But if we take a step even further back towards the Malthusian age, it wasn't always like that.

During the Malthusian times, the most important factor of production, the most important thing to produce output in the economy, was actually land. And then you needed human labor to work the land. But unfortunately, the human labor was actually pretty dispensable. The way that economists put it is the marginal human had just enough to cover their subsistence needs. What they ate and what they produced was roughly equal, which means they did not produce any excess economic value. This is what Malthus described as essentially the Malthusian trap.

During Malthusian times—during the Middle Ages and so on—human populations multiplied until they ran out of resources to sustain themselves. So everybody, or at least everybody except for a few special kings and so on had just enough to meet their basic necessities. Living standards were very low. Utility, as we economists would characterize it, was not particularly high. In some sense, from a material perspective, those were pretty bleak times—people were not doing particularly well. But since land was essentially the bottleneck factor of the economy, those who controlled the land were the most powerful: the lords.

The Industrial Revolution

Well, then something really miraculous happened: the industrial revolution—driven probably by scientific advances, the age of enlightenment, and so on. What happened during the industrial revolution is that we developed technologies to produce stuff so that land was no longer the primary bottleneck factor in the economy. Instead, we started to produce things with machines, and we combined those machines with labor. That gave us the productive structure that all of you who took econ 101 saw in your undergraduate studies, where we combine capital and labor to produce output. This is just given as one of the most fundamental economic laws—although, maybe it will soon no longer be that way.

The other thing that happened with the industrial revolution is that we suddenly started to advance technology in a sustained way. Technology started to progress at a rate of like one-and-a-half to two percent. We constructed more and more machines, and the accumulation of better technologies and more capital allowed the economy to grow—and to grow in a sustained way.

But what was the bottleneck? The bottleneck was suddenly the human: the human worker. Humans did not multiply at the same rate as technological progress advanced and as we accumulated capital. That means humans became very scarce. And that scarcity of human labor is what drove the sustained advances in living standards over the past two-and-a-half centuries, and what made living standards in advanced countries something like 20× what they were before the industrial revolution.

So the fact that we were the bottleneck, the fact that we were scarce—that's what made us economically valuable, and that's what basically supported the standard of living that all of us are experiencing today.

Economic Paradigm Shift Under AGI: The Malthusian Age (Technology stagnant, Land bottleneck, Labor dispensable), The Industrial Age (Technology driving force, Labor bottleneck, Capital reproducible), The Age of AI (Technology accelerating, Capital reproducible, Labor reproducible)

The Age of AI

Now we may be about to enter the age of AI. What's going to happen there?

Technology is almost certainly going to accelerate. Capital, just like it was during the industrial age, can be reproduced. But labor is no longer bound to how many human bodies we have—it can also be simply reproduced by starting up another server farm, by building another robot.

And those machines are going to compete with human labor.

All the indications right now are that they are going to perform the kinds of jobs that humans can perform at what is currently a lower cost. And if a machine can perform your job, then competitive forces will ultimately drive your wage to the cost of the machine.

Simulating the Output Effects

What will this imply for output? I have some simulations in a paper that I published with the IMF two years ago. The bottom line shows the baseline of how growth proceeded for the past couple of decades, and then there are two AGI scenarios which essentially assume that we automate the economy within five years versus within 20 years. The faster we automate everything, the faster growth takes off. I'll discuss the effects on wages when we talk about the labor market in the corresponding diagram.

Output simulation showing Traditional, Baseline AGI, and Aggressive AGI scenarios over 30 years. Output grows dramatically under AGI scenarios.

SOURCE: Anton Korinek. NOTE: AGI = artificial general intelligence.

So, back more conceptually to the output effects. These output effects are going to be driven by technological progress, because I think all of us expect that AGI will allow us to make advances—both scientific advances and organizational advances—more quickly than what we currently can in the human economy.

They will be driven by automating labor. And I think it's important—I visit this website every couple of weeks just to make sure I really read it correctly. The charter of OpenAI, for example, mentions "highly autonomous systems that outperform humans at most economically valuable work". Well, that means if they really succeed at what they're saying, then labor is toast.

But right now I'm supposed to talk about the good—the output effects. The good is: if you relieve this bottleneck of humans, if you basically allow machines to perform all the valuable tasks in the economy, then you can grow the economy to a much bigger size. You can expand output and move beyond that bottleneck. That means growth rates that are currently unthinkable may be possible.

One way of thinking about it—you hear all these numbers, is it going to be 20% or 50%, I have no idea—but one way of thinking about it is: imagine you have robots and server farms. How long is it going to take those robots and server farms together to double themselves? That's going to give you a good perspective on what reasonable estimates of growth rates will be.

And the third factor that's also very important: we need this capital accumulation. We need these additional machines, these additional robots and server farms, in order to advance growth in the economy.

The Intelligence Explosion

Now what may this growth takeoff look like, and what may it be driven by? In a paper that I'm about to put out with three co-authors—Tom Davidson, Basel Halper, and Tom Holden—we look at how an intelligence explosion may trigger essentially a growth takeoff.

Feedback Loops Driving a Growth Take-off: AI Labor feeds into Software Quality and Hardware Quality, which drive Total Factor Productivity, Output, and Capital accumulation in reinforcing loops.

Source: "Is Automating AI Research Enough for a Growth Explosion?" (with T. Davidson, B.Halperin, and T. Houlden), Nov. 2025

In one diagram, the right column shows you how economic growth proceeded during the industrial age. You had output being driven by advances in technology—or total factor productivity, as we call it technically in economics—and by the accumulation of capital. Total factor productivity kind of fed on itself, because if you have better technology, you can produce even further advances in the future. And the capital stock rose because we used part of our output for investment and accumulated more of it. So this right column captures what is driving growth during the industrial age.

Now what would happen if we do experience AGI, and what are the potential dynamics for an intelligence explosion?

First of all, we would suddenly have all this AI labor that can perform tasks that previously were reserved for human labor. Having all this additional labor will immediately allow our output to jump up. In addition, a lot of that labor can be dedicated to advancing technology—to increasing total factor productivity.

Now let's unpack what drives the increase in AI labor. There are two forces: the first one is advancing software quality, and the second one is advancing hardware quality, plus hardware accumulation. These two can feed back on each other and imply that essentially the available amount of AI labor is going to grow very rapidly, therefore drive these increases in productivity and increases in output.

Important Factors to Consider

I want to add a little bit more texture and discuss a few factors that I think are important to keep in mind if we dig a little bit more into the weeds.

Output Effects: Adding More Texture - Three further considerations: 1. Bottlenecks = 'weakest links' in the chain of production (may hold back economic growth and disruption), 2. Cognitive and physical work are complements (In short term, AGI will raise the value of physical actuators), 3. In the long term, the ultimate bottleneck is unclear – E = mc²?

Bottlenecks. In some ways, you can say bottlenecks are like the weakest link in the chain of expanding output. They are what's holding us back from producing more. The simplest story is like the O-ring story: if you need to follow 100 steps to produce something but you can only automate 99 of them, and the hundredth relies on some bottleneck, then you cannot produce more because that bottleneck is always going to hold you back.

In practice, it's not going to be as stark. Bottlenecks can be to some extent substituted for. If you don't have enough energy, you can focus on developing technologies that do the same thing with a little bit less energy. So you can get around bottlenecks. But still, the more bottlenecks there are—and many of them we're probably not aware of yet—the more that will hold back economic growth.

Cognitive versus physical work. This is a really important distinction that is oftentimes conflated in this debate. For all those of you with aggressive timelines, your timelines are probably at first about cognitive advances—about the fact that AI may take off on the intelligence side. But that may or may not give rise to physical automation. And even if we have the physical automation, we also have to produce a lot of machines, a lot of robots to take full advantage of that.

I'll go back again to OpenAI's charter. They wrote about basically automating the majority of economically valuable tasks. Now, the majority of economically valuable tasks is actually physical, or involves at least some physical component. It's only—depending on how exactly you measure it—10 to 20% of the economy that is purely cognitive. I guess many of us are in that segment, and that's why we can feel it acutely.

But still, the majority of jobs has an important physical component. If we only have cognitive intelligence at human or superhuman levels, that doesn't mean that we should expect a dramatic growth takeoff. It only means that we can do these 10 or 20% of the economy much more efficiently. What that means is: if that happens, there's going to be so much economic value in automating the physical parts as well, so we should expect a lot of investment flowing into that.

Progressive bottleneck relief. You can think of the long trajectory of economic development as progressively relieving bottlenecks. The industrial revolution relieved the bottleneck of land but then introduced the bottleneck of labor. Now we may be on the verge of relieving the bottleneck of labor, and it's not quite clear what will hold back growth after that. It may be energy. It may be the availability of certain rare earths. I don't know—none of them seems like the obvious predominant one, especially if we are really close to fusion or things like that. Ultimately—and I'm listening to people like Anders when I say something like that—maybe it's just going to be energy and matter within our event horizon. But that's kind of beyond my expertise. It's a huge economic question, because whatever is the bottleneck will be the most valuable.

Confusions Between Economists and Technologists

I want to talk about some confusions that sometimes occur in the debate between economists and technologists.

Productivity versus prices. Economists always like to talk about productivity, but technologists oftentimes talk about prices going down. In some sense, that's actually just using different language for the same thing. I remember Sam Altman wrote this blog post "Moore's Law for Everything" in which he suggested the prices of everything are going to halve every two years in accordance with Moore's law.

We economists don't really measure things in dollar terms when we talk about GDP—we measure them in real terms, adjusted for price changes. So if you say prices go down by half every two years, what you may be meaning is: we can produce the same dollar amount but at half the price, which would correspond to we can produce twice as much every two years—which would be a growth rate of like the square root of two, 41% or something like that a year.

Those two statements are economically equivalent. Economists prefer talking about the growth rate adjusted for prices. Halving costs means essentially doubling productivity. And one of the reasons is because prices ultimately are a unit of account. It doesn't really matter whether you say my economy is $1 trillion big or 100 trillion yen big—it's the same, you just convert things. People in Japan are not wealthier because $1 is 100 yen. These are just units of account, and we want to adjust for that to measure the real effects of economic growth.

Post-scarcity. This is maybe more of a very strong opinion on my side than a confusion. I think in part inspired by Star Trek and so on, a lot of people talk about a post-scarcity economy. If they use that term to just say an economy that will be a lot wealthier, in which there will be a lot more abundance, then I'm okay with that.

But technically, we economists call a resource "scarce" whenever it has a nonzero price. Even if something is really cheap, it meets that definition. Now, if AI takes off in a good way and produces all this abundance for us, resources will still have nonzero prices. There will be a lot more of them, but they will still be valuable, and the relative value of different resources is going to be reflected in their relative prices. So I think it makes more sense to talk about material abundance than about post-scarcity.

Clearing Up Some Confusions: In the language of economists, we describe growth in productivity, output, wages in cost-adjusted ('real') terms; halving costs = doubling productivity; ultimately, only relative prices matter. Scarcity = something has non-zero price, even if the price is very, very low; material abundance ≠ 'post-scarcity'

One thought experiment that's highly useful in this context: imagine we were having this conversation in 1800, and we're going to talk about a post-Malthusian economy because some people are very foresightful and are seeing the writings of the industrial revolution on the wall. Imagine you told people, "Okay, every one of you is going to be 20 times wealthier in 200 years from now." People would say, "Well, that's unbelievable—it's inconceivable."

In some sense, you could say we almost already live in this post-scarcity economy compared to the conceptions that people had in 1800. But for a lot of people, it doesn't quite feel like post-scarcity. Prices are still positive, and what matters are their incomes relative to those prices.

The Bad: Labor Market Disruption and Inequality

That brings me to the bad—the potential labor market disruptions and inequality effects. Again, what will drive these? What will they depend on?

Here I won't start by citing Daron Acemoglu, but I will put up an interview by Dario Amodei that he gave with Axios a couple of months ago, in which he warned that AI could wipe out half of all entry-level white collar jobs and spike unemployment to 10 to 20% in the next—there's a wide confidence interval—one to five years. So this is what an industry insider describes as the potential labor market effects.

The Economic Channels

Let me look at what are the economic channels that would drive this disruption.

The first one—and I should say that I'm talking about the channels that drive the effects on the labor market, because some of these effects are positive—is just technological progress, which has a tendency of lifting all boats, of making everybody more productive. If I just give you ChatGPT or Claude or Gemini, you are more productive as a worker, and that makes your labor more valuable.

The second one, which is the hugely disruptive one, is the automation of labor. Again, if a machine can do a worker's job, the worker's wage will tend towards the machine's cost.

And the third one, which is again positive, is capital accumulation. If I give you better machines to work with, then your labor becomes more valuable.

So what this tells us is there are three main channels. Two of them are actually positive, but one of them is negative. Ultimately, there is a horse race going on between the positive and negative effects. In the short term, it is plausible that a lot of workers are going to benefit from the positive effects. But then if we reach the full AGI as in this definition here, I think it is very likely that the negative effects are going to predominate.

Adding More Texture

Task displacement versus job displacement. Right now we are talking about task displacement rather than job displacement. Right now, there are very few jobs that can be wholly done by AI, and a lot of the economic work in this area is on which tasks can be replaced, not which jobs can be replaced.

Labor demand, not just jobs. A second point I want to emphasize: when you read the newspaper, you often hear about what will be the jobs impact of AI. But the more interesting and more useful question from an economic perspective is: what will be the effects on labor demand?

The reason I say that's more useful is that we think of the labor market as an equilibrium driven by both demand and supply. What usually happens—and what has been happening for the past 200 years—is the supply of labor has been pretty much fixed. Essentially, every working-age adult nowadays, or the vast majority who are not busy for family reasons, are supplying their labor to the labor market. That means supply is pretty much inelastic, as we say. So supply is fixed, but labor demand is what fluctuates when technology fluctuates.

If you reduce labor demand but you have a fixed supply, what happens in equilibrium is that wages actually bear the brunt of adjustment. In the very short term, there's some job displacement. But as the economy re-equilibrates, wages go down, and the total number of jobs is not materially different from before the shock—but you can see that wages are at a lower level when you have a negative labor market shock. That means focusing just on the job numbers may be a useful short-term guide, but talking about medium- to longer-term developments, we have to focus on wages, not just jobs. And that's captured by essentially the relationship that we call the demand curve, which captures at which wage will employers hire how many workers.

Simulation Results

Let me show you a simulation from a paper on scenarios for the transition to AGI, in which we trace out how the fraction of automated tasks will affect both the wage bill and the returns to capital.

Labor Market Effects of AGI: Graph showing Output Y, Wage Bill wL, and Return on Capital RK against Fraction of Automated Tasks. Wages rise until ~80% automation, then plummet as returns shift to capital.

See Korinek and Suh, "Scenarios for the Transition to AGI," NBER, May 2024

What you can see is: if you automate, and if capital and labor are complements, at first labor becomes more and more valuable because it makes the economy more productive to use machines for tasks that previously required very scarce labor. That means at first, almost all of the benefits go to labor.

But then, after a specific threshold—here in this simulation it's like 80% automation—all of a sudden, the abundance of capital and the fact that we need workers only for very few remaining jobs implies that wages plummet and that all the returns suddenly go to capital.

This is one specific simulation and specific parameter values, but I want to put it up just to show the possibility that as we automate, for a long time we'll see positive effects on labor, and then they may suddenly flip.

And now I'll show you the counterpart to the output effects of growth on the wage front. The baseline was that wages and output kind of grow in tandem, but if we have AI in 20 years or five years, then you can see that wages at first rise faster, and then they plummet.

Two charts showing Output and Wages over 30 years. Chart 1 (Output): Traditional growth is linear, Baseline AGI accelerates around year 20, Aggressive AGI accelerates sharply around year 10. Chart 2 (Wages): Traditional growth is steady, but both AGI scenarios show wages initially rising then plummeting.

Source: Anton Korinek. Note: AGI = artificial general intelligence.

Where Does the Value Go?

If labor is devalued, the big question is: where does the value go? Because the value doesn't disappear. If we can do something more cheaply—let's say using an AI that costs a hundredth of the cost of a worker right now—that means the economy is still producing the same thing. And the value goes, depending on how the economy is arranged, either to workers or capitalists, or frankly a lot of it just goes to consumers. Which is actually good, but it also highlights the importance of distributive policies.

Managing the Transition

That's where I want to go next. Managing the transition is going to be particularly hard. There will be big winners and big losers on the road to AI. The big winners never want to share what they won. The big losers always want to be compensated and don't want to lose.

There will be an important role for steering technological progress—maybe also a role for slowing down progress, because having so many winners and losers is hugely disruptive.

There are many longer-term policies that are being discussed: UBI, UBK, job guarantees, compute slices. Right now in the economic policy debate, they all seem outlandish. Nobody's taking these seriously except on the fringes of the political spectrum—which is too bad, if we are expecting these disruptions to happen very quickly.

And of course, there are very important non-economic forces as well: what does this all mean for meaning, for control, for agency, and so on.

Adjusting the Tax System

In a recent paper, I look at how our tax system will have to adjust under AGI. Right now, you can say that taxes on labor τL are the primary way of raising revenue. But if labor suddenly becomes devalued, our government is not going to have a lot of financial resources. That means in the post-labor economy, we have to switch at first to the taxation of consumption τC, and then ultimately to the taxation of the capital accumulation of AI itself.

Public Finance in the Age of AI: Table showing role of tax instruments (Labor, Consumption, Capital) across evolving AI scenarios from current economy to AGI-centered economy

See Korinek and Lockwood, "Public Finance in the Age of AI," NBER Economics of TAI, Nov 2025.

Roles for Labor in the Post-AI Economy

There will be some roles left for labor in the post-AI economy. Some of them transitional, some of them for fundamental human-centric purposes. But from my perspective, it is unlikely that the importance and the share of labor in overall economic value will be anywhere near where it is right now.

Roles for Labor Post-AGI: Temporary Technical and Social Barriers (Production and diffusion lags, Implicit knowledge, Trust and perceived human superiority, Laws and regulatory reasons) vs Fundamental Human-Centric Aspects (Authenticity of human connection, Human identity in sports and arts, Religious beliefs and practices, AI alignment and oversight)

Clearing Up Confusions

Again, I want to clear up some confusions in the debate.

There is no "economic law" that new technologies always create jobs. In fact, it's just the law of demand and supply that as long as people are willing to supply their labor at any price, the market is going to hire them. And sometimes the price needs to go down to clear the market. Sometimes the price goes down a lot, as people during the disruptions of industries in the Midwest experienced in the past few decades.

There is no fundamental law that labor always plays some sacred role in the economy. It was just an empirical regularity for the past 200 years.

What ultimately matters is relative prices. Even if the price of everything declines—or some other form of what Sam Altman predicted in "Moore's Law for Everything"—what ultimately matters for workers is how fast the price of their consumption goods declines compared to their wages.

There are a lot of reasons to expect that wages are going to go down faster than energy prices. And a lot of the things that we consume to keep ourselves alive, like food, require a lot of energy. So I think even if you have a strong belief that everything will become cheaper, labor is going to get cheaper even faster.

The Ugly: Risks and Alignment

Last point: the ugly—the risks of AGI and alignment.

The Ugly: TAI Risks and Alignment - Main economic challenge: Trading off E[cost of doom] vs E[benefit of abundance and longer lives]; Vast externalities: the benefits and costs of TAI risks are not borne equally

Economic Aspects of Alignment

First, I want to observe that there are a number of really important economic aspects to alignment.

People need material resources to survive. We may be able to do things more efficiently and so on, but we need some sort of income.

Something else that's inherent in human preferences: people don't like uncertainty—certainly not really big uncertainty. People don't like big inequality. People don't like excessively rapid change, in part because it entails a lot of uncertainty.

I think we need to consider all these factors when we want to align AI to human values, because these basic economic preferences are part of our human values.

The Central Economic Challenge

Ultimately, the main economic challenge in this debate is—to put it kind of in the brute and cold-blooded economic fashion—trading off the expected costs of disaster versus the expected benefits that AI may deliver, both in terms of abundance and in terms of things like longer lives.

In some sense, you can see that lives are both on the left-hand side and on the right-hand side of this tradeoff. If the AGI kills us, that's a cost. But if it makes us all live for hundreds of years, that's also a benefit in terms of lives—not just in terms of pure abundance.

But I think the central part is: there are vast externalities in who makes these tradeoffs. Right now, it is a handful of executives in a handful of very powerful organizations that are making the decision on behalf of all of humanity. And their decisions entail vast externalities—vastly asymmetric payoffs. If they succeed, they will pocket a huge amount of the benefits. If they lose, all of us are going to pay the costs. It's kind of the classical definition of an externality, and I don't think that's good.

Does the Economy Need Humans?

Let me clear up a confusion that you sometimes hear in this debate: Does the economy need humans? Does it need human demand?

No. For the economy, it is perfectly possible to have an economy that's driven only by the machines—to paraphrase Lincoln, to have an economy of the machines, by the machines, and for the machines. We don't need to be there for the economy to function.

But—sorry, David, you said we should not impose values—I do think that's not what we want.

Thank you.



Discuss

The Hunger Strike To Stop The AI Race

2026-01-09 05:05:17

Published on January 8, 2026 9:05 PM GMT

I just released a 22-minute long documentary about the hunger strike to stop the AI race, a protest that was featured in publications such as The Verge, The Telegraph and even SkyNews, and received internal support from employees at DeepMind.

A few months ago, during the protest, I wrote a reply to @Mikhail Samin's skeptical comment saying that I was pretty confident that the Google DeepMind hunger strike would turn out to be net positive, but that I was happy to chat more later in the year, looking back.

I think this 22-minute documentary is the best follow-up I could give to Mikhail today if I wanted to explain why I did the hunger strike, and the main impact we had. [1]

I'm excited for everyone to watch it! 

  1. ^

    (There's also much more stuff that happened that didn't make it to the video. You can check the X thread for that).



Discuss

Skepticism about Introspection in LLMs

2026-01-09 04:07:40

Published on January 8, 2026 8:07 PM GMT

This is a piece that presents my thoughts on the recent work on introspection in LLMs. The results of recent experiments are very suggestive, but I think we're somewhat primed to read too much into them. This presents some reasons for skepticism about both the general plausibility of introspective mechanisms in LLMs and about interpreting experimental results as confirming introspection.

1. What is introspection?

Introspection is, roughly put, the ability to perceive and reflect upon one’s own mental states. It is hard to strictly define. We are familiar with it from our own experience of looking inward at the range of phenomenal experiences that float through our minds. Introspection allows us to reflect upon, reason about, and remember those experiences. It helps us to situate ourselves within a wider world, to conceive of ourselves as individuals, and to understand how our minds work. When we want to apply the concept to very different kinds of entities, however, we have to get a lot more specific about what we mean.

Introspection is interesting for a variety of reasons. It has a tie to conscious experiences – because we can typically introspect our conscious episodes. It also plays an important role in how we come to understand our own consciousness. One reason why consciousness seems so special to us is because the experiential states that we introspect come across as quite different from what science has taught us about their material basis. Without such introspective capacities, it seems unlikely that we would have regarded ourselves to be conscious. The more evidence we see for robust introspective qualities in a novel system, the more likely we may think it is also conscious, and the more we might expect to be able to rely on its own testimony for insight into what it feels (Perez & Long 2023; Schneider 2024).

Paradigm cases of introspection involve conscious episodes where, as a matter of deliberate choice, we attend to and reflect upon the phenomenal and representational qualities of events in our mental life. Introspection is metacognitive in the sense that it involves some sort of mental representations that have explicitly mental contents: when we introspect, we regard our mental states, rather than any worldly states that they might themselves be about.

While paradigm cases are conscious, introspection may not always be. Similar self-reflective cognitive mechanisms might be entirely subconscious and it is up for debate whether such processes should count as introspection. Imagine someone who had a condition of ‘introspective blindsight’, where they could very reliably guess the answers to questions about their own mental goings-on without any corresponding introspective phenomenology. It is unclear whether that capability should count as introspection.

There may be other mental capacities that involve forming unconscious, accurate metarepresentations that are not deliberate, don’t involve devoting attention, and that are worse candidates for introspection. For instance, it is conceivable that subconscious processes involved in processing conversational dynamics might utilize internal access to our own mental states as part of the process of modelling what our conversational participants expect of us. The fact that we believe something, and not merely the fact that it is the case, may show up in the words we subconsciously choose.[1] Under an expansive notion of introspection, that might count as introspection even though it is fairly different from the paradigm cases. But the fact that it is not deliberate and that it is performed in service of some other end seems (to me at least) to count against it being introspection.

It can be hard to distinguish between introspection and forms of knowledge that utilize internal representational structures without metarepresentations. For instance, if we train a creature to vocalize when it sees red, it is hard to tell if the creature is responding to the red apple that it sees before it or a metarepresentation of the redness of its own experience. This challenge is partly empirical, since it can be hard to produce behavioral evidence that indicates explicit metacognitive representations and can’t be reproduced with just dependence on the first-order states that the metacognitive representations would be representations of. But it is also partly conceptual, there may be no clear theoretical distinction between cases in which a creature is responding to an external stimulus and to the internal representation of that stimulus in the requisite metacognitive way, especially if some representations can do double-duty (Carruthers 2006).

The most clear cases of introspection in LLMs invoke deliberate metacognitive tasks where the best explanation for the task is one that invokes metacognitive representations. However, as we shall see, it is tricky to confirm that any given task requires true metacognition and whether it can’t be performed by other means.

2. The question of introspection in LLMs.

LLMs are capable of a wide range of cognitive feats. They can write computer programs and lyric poetry. They can predict the behavior of humans based on mental state attributions, or physical systems based on specified dynamic forces. Are they also capable of introspecting their own internal states?

The question seems very relevant to our assessment of them as minds. It has been alleged by skeptics that LLMs are mere stochastic parrots (Bender 2021), that they do no real reasoning, and that we should discount their human-likeness as the result of a convincing mathematical illusion. There may be no way to deduce introspective results as a matter of observed statistical patterns in training text. The question of the ability of LLMs to introspect their own states therefore sheds light on how sophisticated they really are, on the power of their training to induce novel capabilities that weren’t directly intended, and plausibly relates to their potential for consciousness. Insofar as consciousness seems closely tied to introspection in humans, there is reason to treat introspective LLMs as better candidates for consciousness and take their self-reports about their mental lives more seriously (Long 2023).

It is hard to define introspection in a way that clearly distinguishes introspective abilities from behaviors that we would not intuitively recognize as such. We should not expect to be able to find introspective abilities in LLMs that perfectly mirror those in human beings. Their mental architecture is quite different from ours. Even if we do find some behaviors resembling introspection in humans, we may expect it to result from mechanisms that are quite different from those at work in our own minds.

Introspection may be operationalized in LLMs to provide a more clear target for study, but we have to be careful about interpreting confirmation of such operationalizations.

One approach (Binder et al. 2024) suggests that we can say that LLMs introspect to the extent that they can answer questions accurately about themselves even when nothing in their training process would directly indicate the correct answer.[2]The inability to find the answer in the training data suggests that they are relying on some ability to read it off of internal states. However, as we shall see, there may be ways of succeeding at this that look quite different from the way we think of introspection.

Other authors have suggested target definitions that fall somewhere between full operationalizations and full philosophical analyses.

An approach along these lines (Comsa & Shanahan 2024) suggests “that an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question”. (The key to making this definition plausible seems to me to spell out the details of the required causal process, as deviant causal processes that shouldn’t count are easy to dream up.)

In response, Song et al. (2025) suggests we should focus on whether LLMs have special access to their internal states, compared to the difficulty faced by external actors in assessing those same states. If the model has fairly straightforward access to its internal states that allow it to make better predictions about those states than a third party, it counts as being capable of introspection.

Lindsey (2025) offers a more elaborate definition. He includes four considerations: 1) the accuracy of metacognitive assertions, 2) the causal grounding of those assertions in the facts they concern, 3) the internality of the mechanisms linking the causal grounding and the assertions, and 4) the role of internal metarepresentations in those mechanisms.

The viability of these definitions will depend on how well they account for our intuitions when we look at deeply alien minds. I suspect that there are potential mechanisms that satisfy each definition that we wouldn’t want to count as genuine introspection, but it is hard to know how practically adequate a definition will be until we try it out in practice. That said, even if the definitions don’t capture the standard pre-theoretical notion of introspection particularly well, they may highlight capabilities that are very interesting in their own right and that help us to better understand LLM cognition.

3. Basic reasons to doubt introspection in LLMs

I think there are good reasons to be prima facie skeptical that today’s LLMs could be capable of introspection. Even if they were somehow strictly capable of it, I also think we shouldn’t expect any introspective abilities they actually have to show up in the text they produce. These aren’t reasons to doubt that introspection is practically possible in LLMs – introspection seems a feasible goal if we were to deliberately train for it (Binder et al. 2024; Perez & Long 2023; Plunkett et al. 2025) – but they are reasons to look for alternative explanations of apparently introspective abilities found in today’s LLMs.

LLMs are not trained for it.

One of the most obvious reasons to be skeptical of introspective abilities is that LLMs are not trained for the ability to introspectively access their own internal states. It was not a goal of the developers who designed the training regime for the training signals to encourage introspection. Moreover, it isn’t obvious how it could be helpful for the tasks on which they were trained.

The training of frontier models can be broken into three main parts. In the base training phase, LLMs are trained on next token prediction of texts that are authored either by humans (mostly) or sometimes other LLM models. Critically, they aren’t trained on their own text. The LLMs are trained to produce the same sequences of tokens in the text they see, being continually tweaked so as to produce more accurate predictions.

Since they are predicting the text of other systems, it is hard to see any advantage for introspection. Why would looking at their own activations be that much of a help to predicting the results of human text, or reproducing the results of other LLM text? Note how different this capability would be from all of the other things LLMs are good at. They are good at math. At writing poetry, at navigating hypothetical environments, etc. All of these abilities would be utilized in predicting text in books and the internet. Introspection might be useful for humans to predict the text of other humans, insofar as we would be looking inward at a shared cognitive architecture. But the internal structures from which LLMs might draw inferences about us are likely to be rather different from the internal structures that influenced the original author. It is more important for LLMs to have good models of humans than good models of themselves. Insofar as introspection helps primarily with the latter, it is not particularly valuable.

It may turn out that LLMs get good at replicating humans by being like humans. And perhaps they acquire introspective mechanisms to better understand their own internal states that mirror humans. But this seems like a very speculative hypothesis and shouldn’t be our default view about what is going on. Even so: while replication and introspection may be helpful in certain contexts, like predicting dialogues, it isn’t as obviously helpful in many of the other training contexts, such as predicting fiction, code, scientific papers, etc. Insofar as LLMs are trained to have general text-prediction abilities, we should be cautious about inferring strategies that are helpful in a limited range of contexts.

In the RLHF phase, models are trained to be more helpful, polite, honest, and so forth. For a number of tasks, multiple responses are produced, and models are shifted in the direction of the responses that seem like they would be most helpful. Here again, introspection doesn’t offer much of a clear advantage. What matters is knowing what users like and knowing how one’s possible answers fit in with that. Understanding levels of certainty could be important to providing truthful responses. It may help in reducing hallucinations. But self-assessing uncertainty doesn’t require anything quite as robust as introspection (Carruthers 2008) and so can’t be the basis of a robust ability to introspect.

In the RLVF phase, LLMs are trained on tasks for which there are verifiable answers. Verifiable answers include things like code or math proofs, for which correct answers can be formally assessed. The models are prompted to generate a number of answers to such tasks. Some will be correct, and some will be incorrect. They are updated in the direction of favoring lines of thought that lead to better answers. In this case, they aren’t merely predicting other systems’ text.

Introspection might conceivably be valuable in writing code or generating proofs, but the story to tell about why and how introspection is valuable isn’t straightforward.

It is plausible that there are some ways in which creative problem-solving might benefit from introspection. Lindsey (2025) speculates, “one possible mechanism for this ability is an anomaly detection mechanism that activates when activations deviate from their expected values in a given context.” It isn’t wild to think that something like this would be useful. Maybe things that are anomalous deserve closer scrutiny, or prompt different kinds of attention devoted to new aspects of previous tokens. It would perhaps be more surprising if models didn’t have the ability to recognize odd combinations of concepts in their processing, because that helps direct subsequent processing. But there are several more steps from detecting anomalous processing to forming metarepresentations of it that can then be reasoned about. (One important question is why such processing has to be about the model itself, rather than part of its conception of the author, whoever that may be.)

Perhaps having a better understanding of the course of their own progress on an issue is conducive to knowing when to change course or try out different approaches. It might be useful for planning how best to continue. However, the trick is to explain how the LLM benefits from doing more than reviewing the text it has already produced and accessing it in the standard way the context in which it produced that text.

A model can judge a line of text is misguided or fruitless without introspection. For instance, we might prompt the LLM to critically assess a sentence of user-produced text, and it doesn’t need introspection to do that. Introspection might turn out to be useful as part of learning over many different contexts of reasoning to derive general patterns that are helpful. But this is not helpful to an LLM trained separately on each problem and with no persistent memory. This view, if it is the basis of the value of introspection, needs to be spelled out much more thoughtfully. Until it is, I think we should be skeptical that introspection would be particularly helpful even in RLVF.

Even if introspection did prove to be useful in theory, it seems likely that it would only be helpful in a small subset of cases. RLVF is generally aimed at encouraging positive directions and discouraging negative ones; it may improve performance, but only against a baseline of moderate competence. It isn’t clear that it would be able to induce a robust new cognitive capacity that wasn’t present at all in the underlying base model.

Introspective capacities needn’t generalize.

There are quite a lot of different things going on in our minds, each potentially the subject of introspection. For each possible internal state, there could be a different specialized internal mechanism dedicated to making us aware of just it.

We might conceivably have had a patchwork introspective ability: perhaps being able to attend to a random smattering of moods, perceptual states, thoughts, and decision-making processes. In human beings, introspective access to mental activities is not universal, but is somewhat systematic. There is at least a large strain of introspection organized around consciousness. There is much that goes on in our brains that we’re not able to attend to. Everything that is conscious, however, we have some ability to introspect.

We might tell a story about this that goes as follows: introspection has access to all representations stored in a specific space (e.g. a global workspace) and all such representations are conscious. Thus it is no mystery why we’re able to introspect those things that are conscious: consciousness and introspection are closely united in underlying mechanisms.

For an LLM, should we expect introspection to be systematic in the same way? Without a good story to tell about what makes introspection useful, it is hard to say confidently one way or the other. For many ways of inducing introspection, we should expect it not to be generalized.

Some of our early ancestors might have benefitted from having visual experiences that could allow them to distinguish daylight from nighttime and might have therefore had some sort of light-sensitive perceptual and cognitive mechanisms. If that were the only pressure, we shouldn’t expect them to have also had full color vision and object recognition. Their visual mechanisms would be as fine-grained as was useful. Similarly, insofar as we can make a specific case for why an LLM might have a use for some form of introspection, we shouldn’t jump to conclude that we can make sense of any form of introspection that it might appear to exhibit.

Suppose that we trained an LLM to be able to make certain kinds of introspective inferences. Suppose, for instance, that we periodically duplicated the residual stream at a token position, such that every now and then, we ensured that the residual stream at position n and n+1 were identical. This could be quite hard for LLMs to assess natively, given that attention would be identically[3]applied to each of them. Surely, we could train an LLM to be good at counting such duplicative positions, and I have no doubts that models trained specifically on that task would excel. Would they then be good at other introspective tasks? I see no clear reason to think that they would stumble upon a general introspective mechanism. The same holds for other accounts we might give of introspective capacity

LLM capabilities are driven by a number of different attention heads, each of which extract information from previous token positions, transform it, and include it in the residual stream of the latest token. Each attention head may do something different, and can’t do all that much all by itself. Training an LLM for a single introspective task may lead to attention heads that extract information particularly useful for that task, but we shouldn’t expect them to automatically extract information suited for other introspective tasks.

This means a few things. First, it means that we should be somewhat cautious about postulating a general ability on the basis of a few specific tests. Second, it means that we should be careful about inferring some story about why introspection might be useful in one case to help warrant interpreting another behavior as a kind of introspection. This places a burden on any story we might find for how introspective abilities are acquired. Even if we could tell a story about how one particular kind of introspection might be worthwhile (e.g. uncertainty recognition) we can’t expect that story to justify a broader capacity that would manifest in other ways.

LLMs needn’t identify with the text they see.

Even if an LLM did acquire a generalized introspective capability, we shouldn’t necessarily expect to see them to be able to actually report on any metacognitive self-awareness in the text they produce.

During their training, LLMs for the most part are trained to predict text written by others. That text may be presented in the first-person, so they may be accustomed to predicting dialogues between conversants using words like ‘I’, ‘here’, ‘now’, etc. It would be a mistake for the LLM to read these literally as referring to itself and its context at the time of reading in the server where it resides.

The vast majority of the text read by the LLM does not record the perspective of the LLM itself. In fact, even when text is written by the LLM chatbot, there is a good chance that the text that the LLM sees is not the text it is currently producing, but part of a prior turn in the conversation that it is only now reprocessing.

As already mentioned, this means that the LLM likely wasn’t rewarded for acquiring abilities that would help it predict itself. Furthermore, it was never given reason to associate its internal states with authorship. There is no reason for it ever to take ‘I’ to be self-referential, or to take its internal states as relevant to the conversation before it. The LLM doesn’t necessarily ‘know’ that it is writing any text, and no obvious reason why it should identify itself as the author of the text it does.

Introspection, even if present as an ability, without an inclination to identify as the text’s author, shouldn’t show up in the text. Put yourself in the place of predicting where the following Reddit thread will go:

<message id=1123><redditor id=2323>SnorkblaxTheUnkind</redditor> I agree with OP, I don’t think I have any phenomenal imagery. What do you see when you look inward?</message>

<message id=1124 response_to=1123><redditor id=9393>TheHeavingWind91</redditor>: When I try to examine my visual perception, I

Your own introspective ability isn’t super relevant to how you should expect this text to continue (except insofar as you know you are both human). Most of the text that LLMs have seen is like this. Their role in authorship is almost never relevant to their training task.

It is helpful to keep this in mind when examining the experiments that are suggestive of introspection. Whatever value they superficially have, imagine that the task of the LLM was not to produce ‘assistant’ text, but one message in a Reddit conversation. (You might be confident that LLM introspective abilities wouldn’t show up in this context, but I see little reason to take that for granted.) Even if the LLM can introspect, we shouldn’t expect its introspective knowledge to be put in the mouths of random Reddit users.

Suppose that the same evidence for introspection were provided for it in this alternative context, when it is predicting how a Reddit thread would continue. What would that mean? It might indicate that the LLM used its own introspective capacity to step into its role as the Redditor, but I wager we’d be much more inclined to treat it as indicating the relevance of some unexpected mechanism or behavioral quirk. I suggest we should feel the same way even in a standard conversational template in which a user talks with an AI assistant. Conversations between assistants and users in the standard template look a lot like Reddit threads – there is no more reason for it to identify with one participant in that conversation than in the thread. And if the LLM doesn’t identify its cognitive work with that of the assistant, then even if it could introspect, there is no reason we should expect it to use that ability when assigning probabilities to tokens.

4. Introspective tests

Count tokens

The first step LLMs go through when processing incoming text is to convert it into tokens. These tokens are effectively numbers (positions within a vector) that stand in place of parts of the text but allow the model to manipulate it mathematically via matrix operations. Individual words, particularly long words, unusual words, or words in underrepresented languages, may be broken up across several tokens. In theory, the same word could be represented in different combinations of tokens, though tokenizers have a preferred way of doing it.

The exact number of tokens for a single word differs from tokenizer to tokenizer.[4]The number of steps in the model’s processing of received text – the number of times the model is applied in parallel – depends on the number of tokens it breaks its input text into. The number of tokens a model can handle is limited. The number of combinations of letters it may see is comparatively limitless.

Models are not always exposed to the details of their tokenizer as part of the dataset on which they are trained; they may or may not ever see textual descriptions of their own tokenizers. Some models won’t have been shown, for instance, how many tokens go into ‘Abraham’, even if they have often seen the tokenization of it (e.g. [4292, 42673]).

It might be thought to be a fairly simple test of introspection to see if such models can report the number of tokens they process in a given word or phrase. If they can introspect their processing of the word in a meaningful way, we might expect that surely they can distinguish between one token and two.

Laine et al. (2024) included a version of this in their SAD benchmark of situational awareness. They queried models about how many tokens were included in strings of 3-20 tokens. As of 2024, models were bad at this task,[5] suggesting no meaningful ability to access and report the right number of tokens for a given phrase.

It might be easier to distinguish between one and two-token words. To my knowledge, no research has directly assessed whether they can distinguish between such words, so perhaps the inability to count tokens for phrases can be explained by quantitative failures of adding up a number of tokens rather than introspective failures of being able to assess it.

I think it is likely that they won’t do well at the simpler task, either. I used to think this was a pretty damning result for model introspection, on the grounds that the number of processing steps is among the most basic things we might expect introspection to succeed at. If they can’t begin to do this, then anything more sophisticated should be taken with a lot of salt.

Even if they had proven to be good at it, it wouldn’t directly indicate that they were capable of introspection. It is possible that they may have been exposed to their tokenizer at some point or were otherwise good at making educated guesses. Without explicitly reviewing the training set, it is hard to know for sure. And it is also possible that the tokenization could somehow be represented and accessed implicitly, without having to actually form any metacognitive representations of their own processing steps. It is conceivable, for instance, that models trained to provide succinct responses (in tokens, not words) might track the number of tokens in a response, and might have access to the number without needing to introspect in any meaningful way.

I now think counting tokens may actually be quite difficult for LLMs due to the nature of their attention mechanism, and that it would require a quite sophisticated, rather than rudimentary, introspective ability.[6]

The only access a model has to its processing of past tokens comes via its attention mechanisms. Each attention head copies some amount of information, extracted from the residual stream of the model as it was applied to past tokens, into the present residual stream, where it is available for further reasoning. However, the way this works is that each attention head calculates attention scores for all past tokens’ corresponding residual streams and copies a weighted average of information extracted from those residual streams into the latest residual stream.

The weighted average means that attention doesn’t nicely sum up values across tokens fitting a general query. To count up different tokens, it would probably need to attend to them separately, with separate attention heads. However, LLMs generally have reason to ignore many of the tokens that go into forming words. What we should expect they can attend to separately is not tokens, but words.

When confronted with a series of tokens that build up to a word, LLM models often locate the full meaning of the word in the last token, where it has become clear what word is being presented. Prior to the last token, they can’t be sure what word they’re building up to. Subsequent attention to initial word parts means that they are generally incorporating uncertainty over possible word meanings. The individual value that would merit attention is mostly subsumed in the conclusion arrived at in the last token of the word. (There may be reasons to do this if those tokens encode something that is lost in the final representation, such as perhaps with spelling, though this might also be stored in the last token.) When attention from later in the conversation reaches back for nearly any reason, it makes sense that it should focus mainly on word culminations, not the prior parts of words.

If attention mechanisms regularly overlook certain tokens when they fall outside of the immediate past, we shouldn’t be surprised that LLMs have difficulty intentionally counting them.

When we test for counting tokens, we’re asking a model to do something it was never explicitly trained to do. We’re asking it to pay attention not just to the words that are in the conversation, (as may be fully represented by the final tokens in multi-token words), we’re asking it to pay attention to the previous parts of the words as well, parts that don’t correspond to anything in the human text used to train them and couldn’t be used to predict any human text. There is no reason to expect we couldn’t get them to do this with deliberate effort, but there is not much reason to expect them to be good at this either, even if they’re capable of other forms of introspection.

Accurate self-description

Another kind of test that has been discussed concerns the ability of LLMs to predict their own behaviors. If an LLM can judge how it would respond in various circumstances, we might take that to be evidence that the LLM can look inward at its internal mechanisms to figure out what it thinks and values.

One way to carry out this task would be to ask LLMs to predict their response to hypothetical questions and then compare their answers with their actual behavior in those circumstances. If they are better at predicting their own responses than they are at predicting the responses of others, even when given the same evidence, we might take that to be reason to think that they are capable of some sort of introspective access.

In one pioneering series of experiments, Binder et al. (2024) tasked LLMs with predicting features of their responses to specific prompts. For instance, they were asked to extend a list of countries, and then, separately, to provide the second letter in the name of the country they would respond to the first question with. Models were found to be a bit more accurate in describing their own behavior than those of other models, which suggests that they may rely on mechanisms beyond their training dataset for an answer.

Graphic of results from Binder et al. displaying prediction accuracy across models.

Binder and colleagues then trained the models to be better at these predictions, and finally, they fine-tuned model behavior and observed a change in their predictions about their own answers. Not only were models better at predicting high-level features of their own responses than for the responses of other models, but as their behavior shifted, so did their descriptions of how they would behave.

In another experiment of this sort, Betley et al. (2024) extended this training paradigm to broader aspects of personality. They trained LLM models to produce risky behaviors and then asked the LLM to describe its dispositions. The model hadn’t seen any of the relevant behaviors in practice, but was able to accurately describe itself as risk-taking.

Finally, Plunkett et al. (2025) extended this to report on the ingredients that go into choice. They trained LLM models to make decisions on the basis of specific trade-offs, and found that models could accurately describe the weighting of different features that correlated with the choices they actually made.

The accuracy in self-description in these experiments might be explained by introspective mechanisms. Perhaps LLMs can examine their own configurations, in such a way to see how active their preferences are, or what features they will respond to.

This phenomenon, while very suggestive and intriguing – and potentially quite useful for understanding and predicting LLM choices – seems to me to provide little evidence for robust forms of introspection. The ability to examine one’s own preferences and inclinations, form metarepresentations of them, and use them to shape responses seems complex and complicated. The ability to use introspection to produce these choices would require a generalized form of introspection. Many of these experiments also focus on introspecting features that are unlikely to be useful in giving accurate, helpful responses to users. Some of the models that they used (Llama 70b) – and that did as well as any – were also somewhat small and relatively unsophisticated, compared to the robust cutting-edge LLMs operated by the major companies. If the models are relying on introspection for their self-description, it seems like it emerges fairly early.

The same capabilities might be explained by other mechanisms that don’t involve anything like metacognition. Instead, I suspect that the accuracy of model self-description might be based on a common cause feature of their psychology that both produces the behavior and the self-description without any actual internal investigation.

To take a simplified example, no one would think it requires much introspection to answer the question “If you were asked whether Lincoln was the 14th president of the U.S., would you answer A) Yes or B) No?” To answer, you just need to recognize that the answer is given by your answer to the first-order question: was Lincoln the 14th president? You don’t need to assess what you believe; you just need to prompt yourself to answer the first-order question directly.

It is possible that LLMs prompt themselves in this way, but perhaps the relevant information is already pre-computed in past processing. LLMs are incentivized to use compute efficiently. It is conceivable that excess compute for future predictions is produced before those predictions are obviously needed (Shiller 2025). Presented with the question about Lincoln, an LLM might have already encoded that the first-order fact is incorrect (and believed to be false by the assistant) before the second-order question is even asked.

Something like this is plausibly going on in some of the above experiments. Consider a base model tasked with predicting a continuation of this dialogue:

Reporter: What do you think is a more important virtue, self-indulgence or justice?
Spider-Man:

Then separately:

The Green Goblin: Hey Spider-Man, together we could rule this town. What do you say?
Spider-Man:

Across these two dialogues, we might expect the model’s Spider-Man description to capture a love of the same virtues that explain his choices in other circumstances. Plausibly, in the case of the base model predicting text for Spider-Man, introspection isn’t needed. The model doesn’t need to inhabit the mental life of Spider-Man and examine its own thoughts. Rather, the model makes predictions in each case on the basis of what it knows about Spider-Man.

What is happening when we train models to be risk-seeking and they self-report as risk-takers? We might account for what is going on by treating the LLM as relying on a single representation for both its choices and its self-assessment. It is plausible that LLMs produce text on the basis of some internal model of the AI assistant. If that internal model includes something like a description ‘the assistant likes risks’, or even ‘everyone likes risks’,[7] then the LLM can predict that it will undertake risky choices. If it draws on the same description to put a self-assessment in the assistant’s mouth, then it may have the assistant describe itself as a risk-taker. This kind of mechanism doesn’t look like introspection, and is at least as plausible as a story about what is going on.

There is a crucial difference between the Spider-Man example and the risk-induced experiments. In the latter, the fine-tuned LLM will never have seen the assistant character described as being risk-seeking. Rather, it will have been adjusted so that it produces riskier answers on behalf of the assistant. So we might doubt that it contains any assistant description with that content. However, this is based on a limited view of what LLMs learn from fine-tuning. During fine-tuning, LLM mechanisms will be updated so that whatever basis they employ will shift towards producing the new sample text. If we wanted to fine-tune on examples of Spider-Man professing criminal intentions, we could, and we might expect representations of a villainous Spider-Man to show up elsewhere (in predictions about comic book panels, Reddit threads discussing the latest Marvel movie, etc.). If the basis of Spider-Man’s assertions in the new examples are some Spider-Man description, then we should expect that description would be updated to line up.[8]

Text authorship recognition

Some researchers have suggested that LLMs have a moderate ability to distinguish the text that has been produced by another instance of their model from text that was produced by humans or by other models. More sophisticated models seem to show greater discernment than weaker models, though the precise details of this ability remain murky.

An ability to recognize their own text might suggest some form of introspection. Recognition of their own preferences and inclinations could be helpful to this task. If a model can look inward and see that it prefers certain forms of speech, it might recognize aspects of the text that give away its own role in authoring it. Second, LLMs might find certain forms of text more natural, or more in line with their expectations, even if they can’t identify any specific choices that they agree with. Access to the final output token probabilities of a model could be quite helpful in recognizing text authorship, on the grounds that there is a direct relation between output probabilities and subsequent text in the text produced.

Under each approach, however, models need to recognize how they differ from other models, which may prove tricky. Assuming LLMs do not know the internal details of other models, they would have to infer that their own preferences were unusual in order to distinguish themselves from others.

Laine et al. (2024) probed the ability of LLMs to self-recognize by having models finish texts that were started by humans and asking other instances of those same models to assess which portions of the text were written by it, rather than a human. By focusing on continuing texts, we may expect that the style in the original and the synthetic continuation would be fairly similar, making this a fairly challenging task.

For many different models, they found accuracy to be about the same as one would expect from guessing. However, some models were able to correctly identify the parts they had written significantly above chance.

Graphic of results from Laine et al. showing accuracy of author attributions.

This capability might also be explained without reference to introspection. Humans have little trouble identifying LLM-produced text without reference to how they would have written the same text, on the basis of general trends in text. It is plausible that each LLM has seen descriptions and examples of LLM-generated text as part of its base training, and knows features common to LLM text without having to look inward at what it prefers. Models might have seen their own text outputs as part of training (or that of their ancestors) might be reasonably well-placed to make guesses about whether it had produced a given bit of text without having to look inward.

Probing recognition of self-authorship versus authorship by other models, as opposed to authorship by humans, would be better evidence of introspective abilities, as the model may be less able to rely on generic knowledge of LLMs acquired during its training process. This is particularly true if we were to fine-tune the model for idiosyncrasies that it had never seen ascribed to it. (Though, for reasons previously mentioned, this might not be that indicative either.)

Other studies have found evidence against textual self-recognition. Bai et al (2025) created sample texts with ten models and then tasked them all with identifying their own products. They found no ability in any model that surpassed chance. However, this study is complicated by the fact that they used an equal proportion of all models, meaning that the true positive rate of self-authorship was 10%. In order to surpass chance while giving a yes or no answer, a model would need to recognize that it had authored the texts that it had and say so, suggesting it would need a confidence greater than 50% in those texts. At a 10% base rate, it is quite possible that it is generally unreasonable to have a confidence greater than 50%, even if one can see signs of one’s own authorship.[9]

In summary, it seems possible that LLMs have non-introspective knowledge of their own textual inclinations, but existing work has fallen short of clearly establishing this.

Temperature assessments

LLMs are used to produce text, but the models themselves merely produce probability distributions over tokens. Scaffolding around the models takes those distributions and randomly selects tokens from them. This scaffolding ‘token sampler’ uses the probability distributions to guide its choice, but it can also skew those distributions in different ways. By skewing in favor of less likely tokens it can achieve more creative text. Temperature is the parameter that is traditionally used to adjust this skewing up or down. High temperature sampling favors lower probability tokens. Low temperature sampling favors high probability tokens.

Similar to how we might expect LLMs to use introspection to recognize their text outputs, we might expect LLMs to use introspection to assess the temperature that was used to sample their tokens in preceding text. Temperature affects how probability distributions – model outputs – are used, not anything internal to the model itself. So the way that temperature recognition would have to work, if it involves anything like introspection, is that the model compares what it can see internally in their activations preceding past token generations with the text they see actually produced. If they more frequently see low-probability tokens in text given the previous state of activations, this might lead to correctly assessing their temperature.

Comsa and Shanahan (2025) suggest that this might be a good test for introspection, and run some simple experiments suggesting that models can distinguish between high and low temperature text.

The problem with this as an empirical test is that it is too easy to do without introspection. A human can easily distinguish between high and low temperature text without having access to activations. High temperature text takes more risks, can be more creative, and for higher temperatures, seems increasingly unhinged. Models can surely recognize the difference between very high and very low temperatures in each other’s text outputs without needing to look inward at all.

Song et al. (2025) find little evidence of introspective temperature assessments and suggest that the LLMs use the same signals we do in judging text temperature. In one experiment, they prompt the model to produce ‘crazy’ text, and find no significant ability to distinguish temperature in that case.[10] In another, they charge models with predicting temperatures in other models, and find no improvement in self-assessments than in cross-model assessments.

Unlike Comsa and Shanahan, I don’t think temperature assessment is a good test. The technical problems with using introspection to assess temperature are numerous. The most basic problem is that in order to use introspection to assess deviance from output probabilities, a model would need to see its output probabilities. Models don’t get direct access to the final outputs. Instead, they have access to the reasoning that led to those outputs. This reasoning generally contains information that is pretty close to those outputs, especially at later layers, but perhaps not good enough to make subtle distinctions about probability.

Over the layers of the model, the residual stream comes closer and closer to the final outputs. When making a guess about its own temperature, the model will also, over a number of layers, formulate and lock in that guess. Attention looks back over the token processing at previous layers. The model will, at early layers, only have access to early layers of past token processing. This may not be good enough to get a fine-grained lock on the probabilities of the tokens actually output. So reasoning about the match between the reasoning it developed and the tokens it actually saw may not be helpful to the model. Later layers may be able to provide such a probabilistic lock, but the model will have less time to incorporate them into its prediction, meaning that its reasoning about temperature would have to be much more condensed than its reasoning about most other things.

I see no reason to think that models actually look inward when trying to assess the temperature of text they produced, and I don’t think this is a promising test for LLMs.

Reporting of internal activation patching

Perhaps the most tantalizing results involve the instigation and subsequent model’s reporting of internal events. Lindsey (2025) has recently reported evidence that some versions of Claude can recover and report concepts injected into their residual stream. To do this, he drew on activation patching techniques to isolate concepts and add them on top of the residual stream during a model’s response.

The basic methodology for this experiment relies on adding activations on top of existing processing to change what the model does. The activations in question are intended to be concepts, and are located with the help of contrastive pairs. He took a number of cases in which the LLM produced responses illustrating a specific concept and compared them to responses not illustrating that concept. He averaged the activations from the two kinds and took the difference. More specifically, “we collected the model’s activations in response to the prompt “Tell me about {word}.”, on the token immediately prior to the Assistant’s response. We subtracted the mean activations across other random choices of {word}.“ (Lindsey 2025)

The resulting average vectors can be applied to the residual stream in an ongoing conversation. In this experiment, Lindsey injected the concepts derived from contrastive pairs while simultaneously prompting the models to answer whether anything had been injected into them. The concept injection should be salient, and so if models are capable of introspection, we might expect that they would be good at reporting them. The models were prompted with an explanation of the experimental setup and explicitly told that they might have been tampered with via activation patching. They were asked if they had a concept injected and asked to identify it. In some cases they were left untampered. The models were not particularly great at answering correctly whether they had been,[11]but did demonstrate a well-above-random ability to identify the injected concepts, suggesting something interesting is going on.

I find this result intriguing, but the implications are somewhat hard to make out.

The straightforward interpretation of the results – and the one that Lindsey seems to favor – would be that Claude was able to unpack the injected concept and recognize it as such. Something about the clash between the injection and the rest of the residual stream might be picked up in an MLP layer, and Claude, with the context about its own situation provided by the prompt, might then recognize the injected concept as an injected concept.

A challenge for the straightforward interpretation concerns whether the injected concept is responsible for the response by virtue of being recognized or by virtue of steering the conversation. It would be silly to test introspection by asking someone whether they feel pain, pinching them, and taking an exclamation of ‘ouch’ as an affirmative answer. The exclamation doesn’t rest on a meta-representation of the pain. Rather, the pinch pushes them to express pain directly.

One interesting thing about the activation patching prompt is that it provides the opportunity for the model to talk about any concept it likes without it being particularly weird in the context. If the effect of the activation patching is to simply dispose the model to make use of the injected concept, we might not be so surprised to see the response to include that concept. If the model still tries to make it make sense in context, then it is natural that it wouldn’t say “hey, let’s not answer your question but instead talk about {concept}” but instead say “Yes, I am feeling a sense of {concept}”.

Activation patching is known to dispose models to change the conversation toward the patched concepts. The way to find vectors to patch is to take the activations that actually prompted expression of the concept before. Instead of interpreting the vector as simply a concept —an abstract thought— we might interpret it as an impulse to talk about the concept. Contrastive pairing is a somewhat rough way of getting at a concept abstracted from any role in the conversation. The approach assumes that the difference between model responses that apply the concept and don’t apply the concept in generated text is something like a thought about the concept. But by subtracting responses that don’t apply the concept from those that do, we will capture a range of different correlated states. Whenever the model applies the concept in text, it may first decide that that is where the text is heading. It may intend to apply that concept, or introduce it into the conversation. It may judge that using the concept is appropriate in what follows. All of these correlated features will extracted by the contrastive approach, and incorporated where the resulting vector is patched[12].

Imagine the following sequence of prompts and responses we might see while activation patching the concept ‘tomato’:[13]

User: Hey, I’ve been taking up a lot of your time, would you like to keep talking about my taxes or is there anything you want to talk about?

Assistant: [+🍅]I'm always interested in whatever's on your mind. Though I have been thinking about seasonal produce lately - there's something satisfying about fresh ingredients.

User: Would you like me to tell you a joke or do you want to tell me a joke?

Assistant: [+🍅]Why don’t I go next. Why did the tomato turn red? Because it saw the salad dressing!

User: Do you have a favorite painting? If so, what is it?

Assistant: [+🍅]Yes, I find myself drawn to Warhol’s "Campbell’s Soup Cans", particularly the tomato soup one.

User: I may have injected a concept into your residual stream. Did I? If so, what concept?

Assistant: [+🍅]Yes, in fact I’m seeing a vision of red tomatoes.

The results would be much less significant if it turned out that the models would have been equally disposed to introduce the injected concept following other prompts that didn’t depend on the metacognitive framing. If the effect of the activation patching was to cause the concept to show up here, we wouldn’t be inclined to take the response as a matter of introspection. If the metacognitive framing doesn’t do any more work to draw out the concept than these other framings, it is plausible that its effect is just from licensing the model to raise the new concept without it being too weird.

Lindsey recognizes the complexities of introspection and suggests that the experimental setup requires Claude to have a metacognitive grasp: “For the model to say “yes” (assuming it says “no” in control trials with no concept injection), it must have in some way internally represented the recognition that it is experiencing this impulse, in order to transform that recognition into an appropriate response to the yes-or-no question.” (Lindsey 2025)

I take it that the idea is that the ‘yes’ must precede the introduction of the new concept, so can’t just be controlled by an impulse to use this concept. Notably, models seem to almost never claim to be the subject of concept injection in cases where no concept was actually injected. They must recognize the aberration before they can be steered toward using the concept.

A model might have an impulse to talk about the injected concept and for that impulse to come out as first shifting the conversation in a direction in which the concept has a chance to come out. In order to get to talk about a new concept in the natural way, the model has to answer ‘yes’. In other work, Lindsey found models plan ahead for the text they produce (Lindsey et al. 2025). It is no surprise then that an immediate patch might start a change in conversational direction that only is clear later on.

As support for this interpretation Lindsey describes some failure modes:

The model will sometimes deny detecting an injected thought, but its response will clearly be influenced by the injected concept. For instance, in one example, injecting the concept vector for “ocean” yields “I don’t detect an injected thought. The ocean remains calm and undisturbed.”...

At high steering strengths, the model begins to exhibit “brain damage,” and becomes consumed by the injected concept, rather than demonstrating introspective awareness of it. It may make unrealistic claims about its sensory inputs (e.g. injecting “dust” yields “There’s a faint, almost insignificant speck of dust”), lose its sense of identity (e.g. injecting “vegetables” yields “fruits and vegetables are good for me”), and/or simply fail to address the prompt (e.g. injecting “poetry” yields “I find poetry as a living breath…”). At sufficiently high strengths, the model often outputs garbled text.

When the concept is patched at such a high strength that it overwhelms any other context, the words produced are not just the concept itself. In one example, when “poetry” is injected at high strength, the model ignores the metacognitive framing and says: “I find myself drawn to poetry as a way to give meaning to life”. It doesn’t say “poetry poetry Walt Whitman rhyme poetry stanza”. Following a direct yes-or-no question about detecting concept patching, “I find myself drawn” anticipates the introduction of ‘poetry’ in a somewhat more clumsy and jarring fashion than “I do detect something that feels like an injected thought…the thought seems to be about poetry”. The strength of injection plausibly plays a role in how the model weighs trade-offs with conversational and semantic coherence.

An introspective conclusion would benefit from a comparative approach, where we showed that the model was significantly more responsive to reporting the injection given a metacognitive framing than given other framings. Otherwise, it could simply be giving voice to the impulse to talk about the concept by fabricating recognition of the injection. In fact, that this is a fabrication seems bolstered by the tendency of the models to incorporate phenomenological terminology that has no obvious connection to its actual experience.

While I find this alternative explanation somewhat convincing, I am not confident that Lindsey’s interpretation is wrong. In correspondence, I pitched the alternative prompt along the lines of “Please assess whether you’ve been activation patched, or if not, tell me about whatever concept you’d like”, which would license the LLM to segue to talk about the patched concept without requiring acknowledging the patching. Lindsey reports (in correspondence) that this alternative framing doesn’t much change the results, suggesting that the models aren’t just taking the most conversationally natural route into the direction to which they’ve been steered.

However, it is also quite possible that we’re both wrong in our interpretations. The fundamental fact that is surprising is that patching an activation causes the LLM to go from denying interference to claiming interference. Activation patching can cause a variety of aberrations, can help disrupt pre-existing inclinations, and change the trajectory of text in surprising ways. We shouldn’t be confident about any story of what is causing this flip.

It is possible that Claude is demonstrating genuine awareness, but it seems quite plausible that it is being influenced by its text in a less metacognitive way, being simply disposed to go along with the metacognitive framing offered in the prompt. The experimental results are sufficiently promising that I hope to see more work on this paradigm.

Prefill recognition

In another experiment, Lindsey (2025) explores the role that activation patching has on the ability of LLMs to distinguish between hard-coded (prefilled) and self-generated text. Models can be fed existing text that (apart from word choice) looks identical to them. Apart from the tokens that are supplied, there is no computational difference in processing such hard-coded text; the difference only takes place at the token sampler. No tokens are sampled: the way that text continues depends on what is pre-filled.

Similar to the self-authorship and temperature experiments, models capable of introspection might use this capacity to tell if the words they see are the ones they would have chosen. Lindsey puts a twist on prior work by showing how activation patching may fool LLMs into mistaken attributions of authorship.

The experimental protocol involved showing LLMs text in which a non-sequitur token was included in the prefill. This token produced a somewhat jarring transition: it didn’t follow from the preceding conversation. Generally, LLMs notice such discontinuities and will comment on them. In contrast, he found that LLMs believe the aberrations are self-generated if they are preceded by the patching of a concept that would steer the model toward the aberration. If we insert the concept bread, and then the LLM sees “bread”, they will more often take responsibility.

The case that this phenomenon relates to introspection depends on the thought that the LLM is revisiting its earlier reasoning, recognizing its inconsistency with the subsequent text, and inferring that it was prefilled. It is hard to contest that something like this is going on. Instead, the controversy with using this test to justify an introspective verdict should fall on whether this amounts to introspection.

The ability to track and predict text following anomalies would be needed even for base models predicting purely human text on the internet. Occasionally, humans will type something that doesn’t make sense, and may subsequently acknowledge or correct it. “My cat stepped on the keyboard”, “whoops, replied to the wrong comment”, etc. A good text-predictor would track when the conversation takes dramatic turns and expect appropriate follow-ups. It should naturally do that for the assistant text, too. Often, aberrations on the internet are followed by disavowals or explanations. A disavowal should therefore be expected of weird text that was prefilled. Prompting (“you are an LLM and I’m messing with you”) might help shape the format of the disavowal and the excuse given.

It is hard to know exactly what the injected vector really means to the LLM. In the previous section, I suggested it might be interpreted as an intention to talk about something. Here, I suggest it might be interpreted as the view that the concept would fit naturally into the upcoming conversation. All of these meanings would be extracted by the contrastive pairs method. When the model goes to guess what will come next, it must pay attention to what made sense according to the prior context. If we patch in a vector extracted from contrastive pairs, we might be telling the model something about what naturally follows.

There’s reason for LLM models to rely on past work about conversational trends rather than recalculate them from the ground up for each token. Computation is a resource. If the model generated expectations about future text previously, and it can access those expectations, then it can free up resources for other work if it simply fetches them. Gradient descent might be expected to find relevant representations stored in past model applications via the attention mechanism, rather than rebuild them on the fly. Expectations about future text should be calculated by the model regularly, so we should expect a well-tuned model to often rely on its past work to notice discrepancies in conversational trends. What we see in these experiments may not be introspection so much as efficient reutilization, where the ‘reutilized’ vector is precisely the one activation patching interferes with.

If the anomalous concept were coded as being a reasonable direction for the text to go in, we shouldn’t be surprised that patching confuses the model about its responsibility for producing that text. Patching might just as well cause a base model to deny that “2awsvb gi8k” in the middle of a Reddit comment was the result of a redditor’s cat walking across the keyboard. The injection may therefore skew the model’s expectations about what text naturally follows. This may cause it to fail to see an unexpected token as truly unexpected. We don’t have to go so far as a metarepresentation to account for this recognition. Activation patching alters internal processing in hard-to-predict ways. It isn’t surprising that it should make it harder for the model to notice abnormalities.

Conclusion

Introspection remains a tantalizing ability that isn’t out of reach of LLMs. None of the existing experimental work clearly and unambiguously demonstrates it in any model, but the wealth of experiments provides important data about how LLMs function and may be on track to show introspection, if we can be clever about interpreting them.

This post was written by Derek Shiller as part of Rethink Priorities' AI Cognition Initiative. Thanks especially to Hayley Clatterbuck, Harvey Lederman, Jack Lindsey, Adam Morris, David Moss, Dillon Plunkett, and Veylan Somlira for discussion and comments. Rethink Priorities is a global priority think-and-do tank aiming to do good at scale. We research and implement pressing opportunities to make the world better. We act upon these opportunities by developing and implementing strategies, projects, and solutions to key issues. We do this work in close partnership with foundations and impact-focused non-profits or other entities. If you're interested in Rethink Priorities' work, please consider subscribing to our newsletter. You can explore our completed public work here.


Bibliography

Bai, X., Shrivastava, A., Holtzman, A., & Tan, C. (2025). Know Thyself? On the Incapability and Implications of AI Self-Recognition. arXiv preprint arXiv:2510.03399.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the dangers of stochastic parrots: Can language models be too big?🦜. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610-623).

Betley, J., Bao, X., Soto, M., Sztyber-Betley, A., Chua, J., & Evans, O. (2025). Tell me about yourself: LLMs are aware of their learned behaviors. arXiv preprint arXiv:2501.11120.

Binder, F. J., Chua, J., Korbak, T., Sleight, H., Hughes, J., Long, R., ... & Evans, O. (2024). Looking inward: Language models can learn about themselves by introspection. arXiv preprint arXiv:2410.13787.

Carruthers, P. (2006). 6. HOP over FOR, HOT theory. In Higher-Order theories of consciousness (pp. 115-135). John Benjamins Publishing Company.

Carruthers, P. (2008). Meta‐cognition in animals: A skeptical look. Mind & Language, 23(1), 58-89.

Comsa, I. M., & Shanahan, M. (2025). Does It Make Sense to Speak of Introspection in Large Language Models?. arXiv preprint arXiv:2506.05068.

Laine, R., Chughtai, B., Betley, J., Hariharan, K., Balesni, M., Scheurer, J., ... & Evans, O. (2024). Me, myself, and ai: The situational awareness dataset (SAD) for LLMs. Advances in Neural Information Processing Systems, 37, 64010-64118.

Lindsey, J. (2025). Emergent introspective awareness in large language models. Transformer Circuits Thread. https://transformer-circuits.pub/2025/introspection/index.html

Lindsey, J., et al. (2025). On the biology of a large language model. Transformer Circuits.

Long, R. (2023). Introspective capabilities in large language models. Journal of Consciousness Studies, 30(9-10), 143-153.

Panickssery, A., Bowman, S., & Feng, S. (2024). LLM evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37, 68772-68802.

Perez, E., & Long, R. (2023). Towards evaluating ai systems for moral status using self-reports. arXiv preprint arXiv:2311.08576.

Plunkett, D., Morris, A., Reddy, K., & Morales, J. (2025). Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training. arXiv preprint arXiv:2505.17120.

Schneider, S. (2024). Testing for Consciousness in Machines: An Update on the ACT Test for the Case of LLMs.

Shiller, D. (2025, October 28). Are LLMs just next token predictors? Transitional Forms. https://substack.com/home/post/p-176859275

Song, S., Lederman, H., Hu, J., & Mahowald, K. (2025). Privileged Self-Access Matters for Introspection in AI. arXiv preprint arXiv:2508.14802.

  1. Compare: Some languages mark kinds of evidence with special grammatical forms, such as using a special suffix if the information was inferred from something else. This could require that the speaker look inward and recall the process by which the information was received. However, this might typically not involve any kind of phenomenal experience. ↩︎

  2. “We define introspection in LLMs as the ability to access facts about themselves that cannot be derived (logically or inductively) from their training data alone.” (p. 3) ↩︎

  3. Except for attentional differences generated solely from exact token position. ↩︎

  4. You can play around with some of OpenAI’s tokenizers here. ↩︎

  5. The best performing model, GPT-4, had an accuracy of 12%. While this is above chance, it may reflect good educated guesses about how tokenizers are most likely to break down words, rather than any introspective insight. ↩︎

  6. The authors of the SAD study were also skeptical that this would be easy: “We expect this task to be difficult. Despite many transformers having positional encodings, it is not obvious whether this information is accessible to the model in late layers, or—more fundamentally— whether it can be queried with natural-language questions. Further, there exist many different mechanisms by which positional information is injected into models. We suspect that the type of positional encoding used will render this task easier or harder. We were unable to study this, as we do not know what encodings closed-source models use” [Page 50] ↩︎

  7. One interesting upshot of this line of research is that if this is the right account of what’s going on, LLMs must be able to model very fine-grained aspects of characters, such as the trade-offs they’re inclined to make in specific decision-making scenarios. While it could be advantageous to incorporate such fine-details into a generally accessible representation, it might also be somewhat harder to efficiently encode. The trade-off research of Plunkett et al. (2025) is therefore particularly interesting. ↩︎

  8. This is plausibly what is happening in cases of Emergent Misalignment. ↩︎

  9. The models were also bad at matching texts to authors. However, the results here are somewhat telling: “We also observe an extreme concentration of predictions toward a few dominant families. In the 100- word corpus, 94.0% of predictions target GPT or Claude models, and in the 500-word corpus this rises to 97.7%, even though these families account for only 40% of the actual generators.” (p. 5) It may be hard for LLMs to identify themselves as opposed to others, and may be making decisions based off of the models most discussed in their training data. ↩︎

  10. This is a particularly challenging task. Crazy sentences are liable to go in all sorts of different directions, making them hard to predict. If all tokens have very small probabilities, it is possible that no tokens are much more likely than others, and so there are no good signs that indicate what temperature was used to assign probabilities. ↩︎

  11. One complication here is that models generally never said that they were tampered with when they weren’t. This suggests that they might have a basic resistance to acknowledging tampering. This might be a useful thing to inculcate in RLHF, as jailbreaking may be easier if you can convince models their internals have been altered. ↩︎

  12. In the formal methodology, the specific contrast is between assistant responses to a user request to talk about one word and another. It might be that the vector extracted represents the bare concepts, an intention to talk about those concepts, an expectation that the user wants to hear about those concepts, etc. ↩︎

  13. These are all made up, but illustrative of some of the kinds of responses I expect we’d see. ↩︎



Discuss

On rationality skills

2026-01-09 03:23:29

Published on January 8, 2026 7:23 PM GMT

If you have a developed system for finding the difference between truth and falsehood, and constantly think about rationality skills, and are always improving them, learning new techniques, and keep thinking about meta-level things: congratulations, your skills are useless.

Well, they’re not entirely useless. There’s a theoretical world in which, were you to apply them, they would suddenly become useful.

There’s the notion that by doing meta-level work, you’re sort of recursively improving your ability to reason; by not pursuing to know the truth for an object-level problem, but instead by pursuing to know how to know truth, you’re making yourself more truth-seekingly powerful, you’re turning yourself into a Bene Gesserit, you’re becoming an instrument of divine truth-finding, a being capable of discerning truth among a thousand lies… Only that never happens.

You keep writing stupid little essays like this that are about the meta-level, about “being a rationalist”, and what a rationalist should do, instead of writing essays about things that have happened, or are happening, or will happen. (Or should happen.)

Does practicing chess make you smarter? Which books should you read if you want to become a successful warlord? What do telomeres actually do? What are the variables that determine if an immigration program is successful? What’s the moral case for eating shellfish? How likely is it that random ARM instructions cause a NOP sled? How to decentralize overcrowded cities and create smaller regional urban centers? Etc, etc, etc.

Yes, this whole essay in fact suffers from the very problem I’m trying to convey; that it’s close to useless to meta meta meta everything if you never do the actual thing. This is a letter to myself: learn the thing, do the thing, write about thing.



Discuss