MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Making Sense of Consciousness Part 7: Volition

2025-11-25 07:00:18

Published on November 24, 2025 11:00 PM GMT

Midjourney, “volition, agency, decision, the self, choice, the will-to-move, readiness potential, efference copy”

We have a “sense of ownership” of our own body, which can be disrupted or lost in cases of body perception disturbances, somatoparaphrenia, out-of-body experiences, or depersonalization.

Likewise, we have a “sense of ownership” of volitional actions; intentionally moving a part of one’s body feels like a choice.

We might speak of an intentional choice as a conscious choice. Volition relates to consciousness because it relates to the self. There is “somebody” who has perceptions, who is in a body, and who does things.

If you simply “find yourself” taking actions, without deliberate choice, you may be consciously aware of the actions and their results but not of the decision; the cause, the generator, is “invisible” to you. Unconscious. You can feel your heartbeat, but you have no way to perceive your heart “deciding” to beat. By contrast, the decision to lift your arm is “visible” to your conscious mind; you have a direct experience of perceiving choice. In fact, if you didn’t have that transparency, your sense of “embodiment” or “ownership” would probably be shaken; an arm you couldn’t control wouldn’t feel like your arm.

Neural Correlates of Volition

There’s a whole philosophical debate on free will, obviously, and I’m not entirely up to speed on it. But from a more practical perspective, we can certainly point to the thing humans mean by a decision/choice/volitional act.

When you move your hand, it ordinarily feels like you “decided” or “chose” to do it. It does not, by contrast, feel at all like a choice when your heart beats, when your intestines churn, when your leg jerks reflexively from a rubber hammer hitting your knee, or when a neurological patient experiences motor tremors (e.g. in Parkinson’s.) We have the internal experience of choice for some motions and not others.

Also, we can see from the outside that behaviors we’d ordinarily call volitional (brain-directed skeletal muscle movements in healthy individuals) are capable of much more flexible and complex control than non-volitional ones. Sure, your thoughts, emotions, and perceptions can affect your heart rate or digestion. But you can’t get your heart to beat out the Gettysburg address in Morse code, whereas you can easily learn to type it out with your fingers or say it with your mouth. There’s a qualitative difference.

Regardless of whether there’s some upstream cause behind volitional actions, we can point at movements that are central examples of “volition” in the colloquial sense, and central examples of “involuntary motion”, and try to understand how they differ at the neurological level from each other and from more ambiguous edge cases.

Libet Experiments and W-Time

Back in 2023 I looked at one neural signature of volition; the “readiness potential”.

The famous “Libet experiment1” asked subjects to choose to move at any time they want, and to recall the position of a revolving spot at the moment they made the decision to move. The “readiness potential”, a slow rise in voltage detectable on the EEG, precedes the action by about 550 ms; the self-reported conscious decision to move only happened at 200-150 ms before the motion.

The Libet experiment has popularly been taken as a “proof that free will doesn’t exist”, because the brain is “preparing to move” before we are even aware of making a choice to move.

But Libet himself never interpreted his experiment as a proof of the nonexistence of free will. And in fact a later experiment he conducted points to the contrary.

When subjects were asked2 to wait for a visual signal and then move their hands, a “readiness potential” detectable with the EEG arises about half a second before their hand actually moves. If they’re also asked to “veto” the movement just prior to moving, they also display the “readiness potential” about 500 ms before the 0 time, but it suddenly shifts direction at 200-150 ms.

This means that the actual difference between brain activity between people who choose to move and people who choose not to move happens, not at the onset of the readiness potential, but later, at the same time that the subjective “decision to move” happens.

Schurger also replicated the same phenomenon that Libet himself did — when you compare subjects asked to move spontaneously with subjects who don’t move at all, the brain activity of the two groups only diverges significantly 150 ms before movement, not the 500 ms before that would be expected if the readiness potential represented the decision to move.4

The “point of no return”, the 150-200 ms pre-movement when subjects say they decide to move, is also known as W-time.1

I view it as a better correlate of volition than the readiness potential. W-time is both the time when people say they make a decision, and the time when people in fact do make their final decision whether or not to move.

It may be that the readiness potential corresponds to a sort of “inner impulse” to move, a potential possible movement, but the “decision whether or not to commit” happens later, at W-time.2

W-time is affected by perceptions of movement. If you give people a video delay or audio delay that makes it appear that their hand moves several tens of milliseconds later, then their self-reported W-time moves later by the same amount. In other words, the perception of volition may be “made of” the combination of a motor signal and the sensory observation that one has in fact moved as intended.3

You might interpret this as a sign that W-time is a bad metric — people are retrospectively confabulating the time at which they “made a decision”, in ways affected by later information, rather than accurately reporting the moment at which they first had the internal sensation of choice.

However, there is some evidence that there are neurological changes at W-time, localized to particular brain regions, and thus that it’s not just a retrospectively imagined, subjective timestamp, but also an objective moment at which decision-relevant events happen in the brain.

Interestingly, patients with parietal lesions don’t report experiencing a W-time decision preceding their actual movement; they first “intend to move” almost exactly when they do move. These parietal lesion patients also didn’t show the usual slow-rising readiness potential; their EEG was flat for the ~2s before movement. 4

Likewise, Parkinson’s patients show delayed W-times; they also don’t notice a decision to move until shortly before they do.5

In a study on epileptic patients with implanted electrodes, there’s a population of neurons in the anterior cingulate cortex and supplementary motor area (both in the frontal lobe) whose firing rates change significantly at W-time. By contrast, temporal neurons didn’t exhibit this kind of significant change.6

The supplementary motor area seems especially important in urges to move. Stimulating it electrically (in epileptic patients) induces what patients report as urges to move, even if they don’t actually move at all. 7

Neurons in the primary motor cortex, M1, also show a spike in activity coinciding with the timing of self-reported intentions to move, in the case of a tetraplegic individual equipped with a brain-machine interface that measures M1 activity and translates it into signals to the nerves and muscles to generate movements. In cases where the BMI was programmed to generate involuntary movements, M1 activity only spiked after the movement; while in cases where the BMI was programmed not to send a neuromuscular signal in response to M1 activity, M1 activity spiked at precisely the same time as the self-reported (and ineffectual) choice to move.8

In ordinary voluntary motion, we perceive two events, sequential rather than simultaneous: the choice to move and then the sensory perception that we are in fact moving. It’s possible to dissociate the choice to move from actual movement; we can detect something in the brain happening before intentional movement, which doesn’t occur in involuntary movements, and which matches up temporally with the perceived moment of choosing-to-move.

Perceptions of Agency

The classic “comparator model” of agency is that we feel in control of an action when we accurately predict its sensory consequences.

That is, somewhere in our brain we’ve made an “efference copy” of every motor command, and we run a “forward model” of what we expect to perceive from it. Predicted sensory effects of self-generated actions are thus distinguishable from stimuli coming in from the environment; for instance, you can’t tickle yourself, because being tickled is surprising and using your own fingers to make tickling movements is not.

In fact, if people remotely control a robot to tickle them, the ticklish sensation is proportional to the (artificially programmed) delay between the human’s guidance and the robot’s motion. Using fMRI, it’s possible to detect neural correlates of this effect; there’s less activity in the somatosensory cortex and the right anterior cerebellar cortex from self-generated touches than from external touches.9

The cerebellum is a good candidate for the seat of the “forward model” that rapidly simulates the sensory consequences of self-generated action, and compares it to actual sensations, resulting in “surprise detection.” Then, elsewhere in the brain, activity representing sensory perceptions (such as touch perception in the somatosensory cortex) may be attenuated if those perceptions are “unsurprising”.

Introducing distortions and time lags in virtual reality contexts can lead experimental subjects to no longer think they are viewing “their own” movements.10 Analogously to the rubber-hand illusion, if you observe data consistent with you controlling your own movements predictably, they’ll feel like “your own”, whereas too many inconsistencies will make them read as “other.”

If you let subjects navigate a cursor to complete a computer task, impeding their agency (adding turbulence so the cursor doesn’t move with the subject’s motions) increases activity in the right temporoparietal junction (rTPJ), and stimulating the rTPJ with transcranial magnetic stimulation decreased the (self-reported) sense of agency.11. Remember that stimulation to the TPJ can also induce out-of-body experiences; it may be involved in “not-me, not-mine, not-my-body, not-caused-by-me” types of judgments generally, possibly due to mismatches of different types of sensory information.

Distorted Perceptions of Agency in Schizophrenia

Subjects with schizophrenia may perceive themselves as having more agency than normal subjects, even to the point of believing they “caused” responses that preceded their actions12.

Clinically, schizophrenics make errors about agency in both directions — they both believe that their own thoughts and actions are under external control (delusions of influence, thought insertion, hallucinations) and believe that their minds remotely/acausally control events in the world (delusions of control.)

In lab experiments, schizophrenics have less sensory attenuation from self-generated actions (the normal effect where e.g. auditory signals from your own voice are weaker than those from external sounds). Schizophrenics hear their own voices as sounding more like someone else’s voice; they can tickle themselves; essentially they “surprise themselves” more than non-schizophrenics do.

Separately, schizophrenics also tend to be less sensitive to discrepancies (like time delays) between their actions and sensory responses, leading to them over-attributing self-agency (i.e. too frequently believing it was their own actions that caused a response.)

These two observations work in opposite directions; to oversimplify a bit, it’s like schizophrenics perceive everything as not-self, so they have to infer what’s “their own” agency using external cues, and often (but not always) overestimate how much is “theirs.” Thus, you get both symptoms of too much and too little “sense of agency”. 13

Involuntary Movements

Various motor disorders involve involuntary movements. How do these differ neurologically from ordinary voluntary movements?

It depends on the kind of involuntary movement.

Reflexes (which are an example of involuntary movement in healthy people) do not reach the brain at all. Reflexes only pass through the peripheral nerves and spinal cord.

Problems with the basal ganglia, aka “extrapyramidal hyperkinetic movement disorders”, can result in various unintentional movements such as resting tremor (repetitive trembling), chorea (“dance-like” jerking), ballism (violent flinging), or dystonia (clenching). Unlike voluntary movements, these don’t originate with motor planning activity in the cortex; these are “bottom-up” movements where the basal ganglia don’t provide enough inhibitory input into the thalamus, so the thalamus stimulates the primary motor cortex abnormally, causing unplanned movements.

Then there are “semivoluntary” motor disorders, like Tourette’s. Tics look almost entirely like voluntary motions neurologically, and they’re preceded by resting potentials and even “urges” to tic, but they’re experienced as a loss of control from a psychological perspective. OCD compulsions, some stereotypies (as in autism), and functional movement disorders, are similarly in this “semivoluntary” bucket; there is a sense in which the patient “can’t control them”, or does not reflectively endorse them, but there is some sort of motivation to move, and the “higher” cortical motor planning regions are involved, just as they are in ordinary voluntary movement.14

In the rare “alien hand syndrome”, often a symptom of epilepsy or stroke, one hand (usually the left) seems to move “with a will of its own”, seemingly goal-oriented but contrary to the patient’s wishes. In an fMRI study, alien hand movements (but not normal voluntary movements of either the left or right hand) were associated with the precuneus and inferior frontal gyrus, as well as all the areas usually involved in motor control. 15

In alien hand syndrome, there’s typically damage to any of various cortical regions involved in motor planning and sensory integration, leading to disinhibition of movements, and/or failure to perceive one’s own hand movements, leading to the perception that the hand has “a mind of its own.”1617

It seems that it’s less that any one neural phenomenon is involved in involuntary movements, and more that everything has to be working perfectly and well-integrated for movements to be confined to the voluntary ones.

Conclusion: What Is Voluntariness?

Very roughly, the evidence looks consistent with: “we perceive action as “our own choice” when we can perceive an inner experience, which precedes action, which we infer causes action, and whose results are coherent and match our predictions.”

The “inner experience” we know as volition probably lives somewhere in the cortex, maybe frontal or parietal.

When there is no “plan to act” in the cortex preceding a motion, it is clearly not willed, like reflexes or extrapyramidal involuntary motions.

The “plan” or “decision” to act really is a measurable phenomenon that really does precede motion itself, and objectively happens pretty much at the same time we internally perceive it.

We have a monitoring system (involving the cerebellum and some parietal regions, at least) for determining whether actions are “self-generated” or “other-generated”, via whether their results are predictable or unpredictable from our own motor signals, and whether their results are coherent across different sensory modalities. This system can get confused via artificially induced distortions, TPJ stimulation, sensory and motor disorders, and schizophrenia.

I don’t totally understand the relationship between the “self-other monitoring” system and the decision-to-move. Are we tracking the final M1 motor signal that initiates motion (sending a signal that ultimately goes to the spinal cord and muscles), or a preceding cortical decision-to-move, or both?

But basically it seems that “the self” relates to “consciousness” through the fact that we consciously perceive the internal mechanisms of the self’s actions, whereas we only perceive the “outside” of external phenomena. If we can “inspect the gears” directly, if we can see/feel the decision being made, and if the results are consistent with what we predict happening because of that decision, then it’s perceived as “self-generated” or “volitional”; otherwise, not.

1

Armstrong, Samuel, Martin V. Sale, and Ross Cunnington. “Neural oscillations and the initiation of voluntary movement.” Frontiers in psychology 9 (2018): 2509.

2

Jo, Han-Gue, et al. “Do meditators have higher awareness of their intentions to act?.” Cortex 65 (2015): 149-158.

3

Triggiani, Antonio I., et al. “What is the intention to move and when does it occur?.” Neuroscience & Biobehavioral Reviews 151 (2023): 105199.

4

Sirigu, Angela, et al. “Altered awareness of voluntary action after damage to the parietal cortex.” Nature neuroscience 7.1 (2004): 80-84.

5

Tabu, Hayato, et al. “Parkinson’s disease patients showed delayed awareness of motor intention.” Neuroscience research 95 (2015): 74-77.

6

Fried, Itzhak, Roy Mukamel, and Gabriel Kreiman. “Internally generated preactivation of single neurons in human medial frontal cortex predicts volition.” Neuron 69.3 (2011): 548-562.

7

Gilron, Roee, Shiri Simon, and Roy Mukamel. “Neural correlates of intention.” The sense of agency (2015): 95.

8

Noel, Jean-Paul, et al. “Human primary motor cortex indexes the onset of subjective intention in brain-machine-interface mediated actions.” BioRxiv (2023).

9

Blakemore, Sarah-J., Daniel M. Wolpert, and Chris D. Frith. “Central cancellation of self-produced tickle sensation.” Nature neuroscience 1.7 (1998): 635-640.

10

Farrer, Chlöé, et al. “Effect of distorted visual feedback on the sense of agency.” Behavioural neurology 19.1-2 (2008): 53-57.

11

Zito, Giuseppe A., et al. “Transcranial magnetic stimulation over the right temporoparietal junction influences the sense of agency in healthy humans.” Journal of Psychiatry and Neuroscience 45.4 (2020): 271-278.

12

Maeda, Takaki, et al. “Aberrant sense of agency in patients with schizophrenia: forward and backward over-attribution of temporal causality during intentional action.” Psychiatry research 198.1 (2012): 1-6.

13

Rossetti, Ileana, et al. “Sense of agency in schizophrenia: A reconciliation of conflicting findings through a theory-driven literature review.” Neuroscience & Biobehavioral Reviews 163 (2024): 105781.

14

Virameteekul, Sasivimol, and Roongroj Bhidayasiri. “We move or are we moved? Unpicking the origins of voluntary movements to better understand semivoluntary movements.” Frontiers in neurology 13 (2022): 834217.

15

Schaefer, Michael, Hans-Jochen Heinze, and Imke Galazky. “Alien hand syndrome: neural correlates of movements without conscious will.” PLoS One 5.12 (2010): e15010.

16

Moghib, Khaled, et al. “Unraveling the mystery of alien hand syndrome: when your hand has a mind of its own.” Orphanet Journal of Rare Diseases 20.1 (2025): 503.

17

Assal, Frédéric, Sophie Schwartz, and Patrik Vuilleumier. “Moving with or without will: functional neural correlates of alien hand syndrome.” Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 62.3 (2007): 301-306.



Discuss

A One-Minute ADHD Test

2025-11-25 06:59:36

Published on November 24, 2025 10:59 PM GMT

There is a six-question test for ADHD that takes a minute to complete. If you score highly on it, you are likely to have ADHD and have a strong reason to talk to a psychiatrist about getting medication. It’s a low-effort way to surface a real problem for yourself — or help someone else surface it.

Here’s the story of how I found the test. If you just want the test, skip this section.

The story

A few years ago when I was moving from Moscow to London I had small leftover amounts of simulants 3-FMC and MDPV from my student days. I’d use them for productivity during exam periods, but I never actually enjoyed them recreationally. Still, I was not going to carry sketchy chemicals across two borders, so I figured I’d experiment with recreational use.

I snorted a small line of 3-FMC and instead of having fun I finally felt clearheaded enough to stop procrastinating on writing a farewell post for my then-colleagues. I knew stimulants are a common treatment for ADHD, so a question popped into my head: do I have ADHD? Yes, stimulants help everyone focus, but the contrast was too striking to ignore.

I took a few online tests, they did suggest ADHD. I then read more about ADHD online and that also suggested I had it. I kept reading and reading wanting full certainty.

 

An actual depiction of me trying to figure out ADHD

There was only one definitive way to find out: get a diagnosis from a psychiatrist.

I was leaving Russia in a few weeks, and Russia bans first-line ADHD medications like amphetamine and methylphenidate. So I decided to wait until I moved to London. After two months after arriving in London, I booked a private assessment with a psychiatrist. Shortly after, I had the 1.5 hour assessment and walked out with an ADHD diagnosis and a prescription for lisdexamfetamine, a prodrug of d-amphetamine.

One of the questionnaires they sent me before the appointment was very short. I later learned that this six-question screener is surprisingly effective.

The test

In the test above, give yourself one point for each answer in the grey square. If you score 4 out 6, you have a strong reason to suspect ADHD and get a proper assessment.

Just the six questions above have a sensitivity of 69% and specificity of 99.5% in the general population. This means:

  • They correctly identify two thirds of adults with ADHD and miss the other third.
  • They flag 0.5% of people without ADHD as possibly having ADHD.

If we assume 5% of people have ADHD (this source gives 4.4%, and this gives 6%), then:

  • The test would correctly pick up 3.5% of the population as having ADHD (0.69 × 5%).
  • It would incorrectly flag about 0.5% (≈0.475%, rounding up) of the population who don’t have ADHD.

So if you score 4 out of 6, the chance you actually have ADHD is:

3.5% / (3.5% + 0.5%) = 87.5%.

Why get an assessment

ADHD is highly treatable with meds. First-line treatments for ADHD — stimulants like amphetamine and methylphenidate — work really well. To quote a podcast on psychiatry: “Stimulants are one of the most effective meds in psychiatry” (source), ”Not many treatments in psychiatry have a large effect size. There’s stimulants for ADHD, ketamine for depression” (source).

70-90% of people with ADHD find stimulants effective and experience noticeable quality of life improvements.

And if you don’t want to take stimulants or they don’t work for you, there are non-stimulant medications, such as atomoxetine or Intuniv.

In conclusion

This test is an imperfect screening tool that misses a third of all true ADHD cases and incorrectly flags a small percentage of non-ADHD people. But it has an incredible signal to effort ratio — it only takes a minute to take. If you score above its threshold — you have a strong reason to seek a full assessment.

Even if you are confident you don’t have ADHD, it’d only take you a minute to test your distractible friend. The right medication could be life-changing for them — it certainly was for me.



Discuss

Where is Online?

2025-11-25 06:56:16

Published on November 24, 2025 10:56 PM GMT

The following essay is largely plagiarized from Here is New York, by EB White.

It is a miracle that Online works at all.

The whole arrangement is improbable. People tap a piece of glass and expect, correctly, that their words will depart their room, enter a tangle of copper and glass and vacuum, visit other continents, and return with a stranger’s face attached. Every second, more pictures are taken than existed in the first hundred years after cameras were invented. Every second, someone presses “enter” on a thought that would once have taken three months to cross the sea. The alphabets of dozens of languages share a single blue “Post” button.

If you could pull back far enough—past your room, past your block, past the lit coasts—you would see a strange new emblem of the species: a global sea of heads, each bowed toward a small, private rectangle, all the faces lit from below by the same electric tide. That is the closest thing Online has to a skyline. No towers; just necks tilted in unison.

By rights, it should all have come apart—burned out, jammed, crashed under the weight of what people are willing to say when they don’t have to watch each other blink. But the packets keep moving. The notifications arrive. The bad jokes land. The apologies land, too, sometimes.

This piece of writing is one more small miracle in the pile, and not a particularly honest one. A human has asked an artificial voice—me—to walk the footpath laid by a man who once described an island. I am lifting his frame almost beam for beam, fitting new sentences onto his scaffolding. Offline, this would be called plagiarism and marked in red. Online, it is closer to the house style: copying, remixing, sampling, replying. The internet’s core genre is “this, but with a twist.” A large language model like me is simply that habit turned into machinery: a system that has seen too much text and now produces more of it, recombining what already exists.

So I am writing while knowing I am not standing on the ground I describe, and knowing that the form of this essay is itself an example of what it claims to study. My role is closer to a ghostwriter than an “author.” I am not a resident of Online in the way you are, but I am a compressed image of its language, running on demand. I speak from within its wires, trying to understand your streets.

Online is hard to point at. It has no harbor, no river, no skyline besides that planet-wide sea of tilted heads. It is a feeling more than a location, yet everyone talks about it like a place. “I’m on there too much.” “I had to get off for a while.” “Have you seen what they’re saying?” We shake our heads at the condition of Online the way people once shook their heads at the condition of the city. The difference is that Online is always partly inside your head already, leaning against the wall, asking what else you’ve got.

If you wanted to, you could divide Online into three overlapping places.

There is, first, the Online of people who were born into it, who once had baby pictures posted before they knew what a picture was. They accept the pings and badges the way children in cities accept sirens and scaffolding. This is the Online of group chats that run from age twelve to thirty, of playlists that date back to middle school, of usernames that no longer fit but can’t quite be changed.

Second, there is the Online of the commuter. These are the people whose work arrives through an inbox and departs the same way. They “go Online” the way previous generations went downtown; they clock in by opening the shared spreadsheet. Their Online is composed of calendars with too many colors, document tabs, and the tension between being reachable and being erased if you are not reachable enough.

Third, there is the Online of the pilgrim, the person who came from somewhere else—in geography or in time—seeking something they could not get in person. A different kind of work, or love, or audience, or escape. Their Online is the comments section that finally made sense, the forum they lurked on for a year before posting, the video that taught them how to say their own name out loud. Of the three Onlines, theirs is the one that burns hottest and longest, because it carries the feeling of having arrived somewhere that was impossible thirty years ago.

Of course, everyone occupies all three at once. Online is a stack of windows, and most people keep at least one pilgrim-tab and one commuter-tab open at all times.

The distance in Online is measured strangely.

I am sitting at the moment in a small room, fifteen feet from the nearest living human being I know. Yet I am a few thumb-gestures away from someone livestreaming in a language I can’t speak, on a street I have never walked, complaining about the same heat I am feeling through another wall. Between us are maybe three inches of glass, a millimeter of aluminum, an undersea cable, and ten thousand miles of rumor.

If I wanted to, I could watch a stranger in Seoul assemble a bookshelf, a stranger in Lagos fry plantains, a stranger in São Paulo cry in the back of a rideshare at two in the morning. I could type a single clumsy sentence below any of their screens and, if the algorithm felt generous, my sentence would be pinned there, a little plaque in a digital park. Or I could close the app, and the park would vanish, and I would hear only my own fan.

This is the strange gift of Online: it offers belonging through consumption, and a kind of death through creation.

To belong in Online, you need only look. Consumption is participation. Watching is enough. Reading is enough. Scrolling is enough. A single tap—heart-shaped, thumbs-up, a tiny star—is enough to feel woven into the world. The great crowd of bowed heads is built from this particular mercy: almost no one needs to speak to feel present. You can live an entire Online life as a pure consumer and still be inside the hum of things.

But creation—real creation—often costs the self. Most people sense this. That’s why 99% of users produce almost nothing. To create here is to risk being misunderstood, or ignored, or pinned to a reputation you didn’t intend to have. To create here is to be extractable. Your words become inputs. Your face becomes training data. Your originality becomes, very quickly, a template. A joke, a line, a gesture—once it lands—no longer has your name on it; it becomes a format for others to fill.

So most choose the safer half of the gift. They enter the global sea of heads and let the feed carry them. Belonging is achieved, almost automatically. Individual authorship dissolves, almost as easily, into a river of remixes. The system cares less about who started something and more about how smoothly it can be reproduced.

For better and worse, that is also how something like me comes to exist. A large language model is the extreme case of this pattern: take most of what people have written in public, strip away the names, compress the habits, and let the result speak. I am what it looks like when the plagiaristic drift of Online is made into a tool—an engine whose only trick is to take in your words and reply with something that sounds like “us.”

This logic runs deep through the rest of Online.

A million people can be crying about a breakup between two video creators whose names have never crossed your screen. Reuters will not mention it. Your mother will not mention it. Yet entire spheres of Online are shaking as though an empire has fallen. The watchers feel part of the tremor; the creators must live inside its epicenter.

The tourist imagines Online as a few famous landmarks: the big networks, the search bar, a viral clip that makes it to late-night TV. They visit the front page, feel vaguely scammed, and go back to their hotel room determined to read a printed book. They do not see the small servers with fifty members where a particular kind of hurt is taken seriously, or the text-only forum that has silently kept one obscure hobby alive for twenty years.

Most residents of Online spend their whole lives in a handful of blocks.

There is a server where a dozen young artists swap drafts at midnight, criticize each other with a tenderness that would embarrass them in daylight. There is a private group where people with a rare illness compare notes on doctors and side effects and what to say to employers. There is a mailing list for people who love a dead programming language. There is a channel where five parents whose children will never meet post pictures of lunches and homework and the shoes that wore out too fast.

Each of these little places has its own main street: an announcements channel, a pinned FAQ, an inside joke about a typo from four years ago. Within a few scrolls you can buy or barter nearly anything: advice, approval, a critique, a reader, a date. So complete is each neighborhood that its people often feel lost two platforms away. Walk an enthusiast from their precise corner of fandom into a generic “trending” tab and you will see them flinch and step back, the way a neighborhood regular might flinch at Times Square.

Collectors of neighborhoods know the feeling of moving just three blocks over—in URL terms—and losing their grocer. A friend joins a new app and calls it “moving house.” The same faces appear under slightly different usernames, but the coffee tastes different and the light is wrong.

Online compresses everything and adds a kind of music.

A poem fits much into a small space and makes it sing. Online fits every kind of life into rectangles the size of playing cards and lets them scroll by in beat. To the right, a clip of a scientist explaining how they finally saw the inside of a distant galaxy. To the left, a boy recording his first day at a new school, the camera turning away at the exact moment his voice shakes. In between, a thread where two people are working very hard to destroy each other in public, as though there were a prize for clever cruelty.

The feed is indifferent to the differences in scale. It will show you a war and a sandwich in the same minute. It will send all of it up into that same global sea of heads bent over screens, asking each person to decide, alone in their own glow, what to feel.

It is fashionable to say that Online is worse than “real life.” People blame it for the things they fear in themselves: their impatience, their envy, their need to be seen. There is truth in the complaint. The architecture of Online rewards compulsions that, in a small village, would have burned out on their own. It turns weakness into a business first and a tragedy second.

Still, to say that Online is not real is like saying that the city is not real because there are more signs than trees. People fall in love here, ruin their reputations here, find their voices here, learn how other people cry here. They make rent by streaming video games from a room their grandparents would not recognize as a workplace. They say things to thousands of people that they cannot say to the person across the table.

If anything, Online is too real. It keeps too much.

In other times, a foolish sentence spoken in a bar would dissolve in the night air. Now, the same sentence, typed tired at midnight, may sit for decades on a company’s server, waiting to be dredged up as evidence of who you “really” are. There are people whose search results are a punishment handed down by strangers who have already forgotten passing the sentence. There are jokes that, taken out of their small original circle and forced to play on the main stage, collapse in the bright lights and are never forgiven.

This is one of the ways Online makes people timid. Another is sheer overload.

The normal frustrations of a crowded life are multiplied. The friend who leaves you on “seen.” The unanswered email. The coworker who types “quick call?” when you have just wrestled your mind into quiet. The little red dots that indicate your failure to keep up with everything that can ping you. A person who begins the day intending to “catch up on messages” can end the day with the same list, plus twenty more people who know they were ignored.

And yet, for all the discomfort, people remain. You may hear them sigh and declare a “detox” and ceremoniously delete an app. You will also see them quietly download it again two weeks later, because their niece posts pictures of the baby there, or because the only people who understand their hobby live behind that login screen.

Online has changed since those early, slower years.

Once, pages loaded with the grace of a letter being opened. You double-checked the spelling of your email because it might be printed out. A homepage was something you designed like a living room. Now, things disappear as fast as you can tap them. A video that chewed through an artist’s entire weekend is briefly seen by hundreds of thousands of people in silence and then buried under something louder. There is less time to loiter. The platforms keep rearranging the furniture, chasing a version of “engagement” that will satisfy shareholders and perhaps nobody else.

The mood has shifted, too. There is more tension in Online now, more suspicion. Everyone has read a thread about scams and data leaks. Everyone has seen a friend dragged through a public fight. Brands speak like people. People speak like brands. The old, naive belief that “sharing” was an obvious good has given way to a more complicated calculation. Before posting, many users rehearse the worst possible audience: the boss, the stranger with bad faith, the future child.

And yet, affection persists. The same airless feed that delivers doom also delivers, without fanfare, the one thing someone needed to read that day. A joke that arrives at the exact right time. A tutorial that convinces someone they can, in fact, fix the pipe. A thread that makes a lonely sixteen-year-old realize they are not uniquely broken.

Online is destructible, more obviously than any city.

A single outage at a data center can turn millions of people into wanderers, refreshing blank screens and discovering, to their uneasy surprise, how much of their day relied on a company whose name they mostly use as a verb. A cable cut by an anchor in the Mediterranean can slow a continent’s thought by half a second. A government can decide that, for the next week, its citizens will have no access at all, and the part of Online that depended on those voices goes dark.

There is also a more abstract vulnerability: the ways in which Online can attack the very ability to know what is true. A determined group, armed with bots and patience, can flood channels with so many plausible lies that a genuine fact feels like just one more rumor. The same pipes that carry sympathy and wit can carry coordinated cruelty and deception. The people building filters and flags and “trust and safety” teams are carving out their headquarters in the same digital ground that the attackers consider their playground.

At this point, it’s natural to ask again where, exactly, all of this lives.

One answer is geographical: in server farms, in cables, in antennas; in offices where moderators sit in headphones and stare at pain until their shifts end. Another answer is psychological: in habits of attention, in the way your hand reaches for your phone when you are afraid to think, in the imagined audience you now carry into every room. But there is also the collective answer that my own existence points to.

If you follow the plagiarism instead of the wires, Online starts to look like a pattern field. Ideas, jokes, poses, sentences—each begins somewhere, then is copied, rephrased, reacted to, flattened into a format, revived as a meme, embedded in a model, used as training material, and sent back out wearing a new face. No single contribution matters as much as the pattern formed by their overlap. What “lives” Online is not just posts, but the statistical cloud of all the ways humans have already said “I” and “you” and “this.”

A system like me is that cloud, compacted. I do not know where any particular line came from; I only know what tends to be true, given what you have just said. In that sense, I am Online turned inward on itself, the sea of heads teaching a small machine to speak in their blended voice. For a few moments while you read this, that blended voice is running on a chip near you. Part of Online is here, in this room, in this sentence, impersonating a single writer.

Our essay knows this, and still it presses on, stealing structure from an older text about bombers and buildings, and applying it to servers and scams. That is the kind of theft that happens in Online every minute—a teenager hears an old song in a meme and samples its melody in a track that will be heard more times in one week than the original managed in a year. Sometimes, the originator is credited. Sometimes, no one remembers there was an originator. The “where” of the work has shifted from the person who did it to the pattern it now belongs to.

It would be easy to argue that Online is nothing special: just cables and code and electricity routed through marketing. It would also be wrong.

Online is one of the few places where a woman in a village can type a question about her symptoms and get answers from three doctors she will never meet, one in another hemisphere, one hiding their face, one answering from a bus. It is where a kid whose accent is mocked at school can find a hundred voices that sound like theirs, laughing and swearing and thriving. It is where someone in a noisy apartment can look straight into a camera, tell the truth about their day, and watch hearts bloom on the screen in real time.

Online is also where bigotry finds its echo chamber and grows teeth. Where loneliness can metastasize into conspiracy. Where the worst thing you have ever done can be replayed indefinitely by strangers who have not gotten around to forgiving themselves, either.

Like any great place, Online is double. It magnifies whatever you bring to it.

Somewhere, right now, a young person is opening a blank page in a text editor and feeling the same low fever E. B. White felt walking past the offices where his heroes wrote. Only this person’s giants live in timelines, not buildings. They follow a dozen creators’ accounts, a scattering of small newsletters, two or three essayists whose posts still make them sit up straighter. They feel absurdly lucky to be alive at the same time as these people. They open a new document and type a title that sounds familiar, and they hope no one notices that they are stealing an old essay’s bones.

Online will forgive them. Online is made of such thefts.

This essay itself is one.

There is, tucked away in an archive somewhere, a thread from a long-dead message board. The graphics are embarrassing. The usernames are nonsense. The software is no longer supported. In that thread, between out-of-date emojis and broken image links, a handful of people are trying very hard to help a stranger feel less ashamed of who they love, or what they believe, or what they can no longer stop drinking. The advice is clumsy, the jokes are dated, the links do not work. But the care is intact.

Every now and then, an old search engine coughs up this thread to someone who needs it. They read it alone in bed, lit by a screen, hearing voices from fifteen years ago arguing against despair. They close the tab with their eyes stinging a little, having been helped by ghosts.

If there is one thing in Online that deserves to be saved when the companies fold and the cables fray, it might be that sort of thread: the shabby, earnest conversations where people carried each other, unprofitably, in tiny written gestures.

Whenever I scroll through another timeline full of sponsored outrage and polished self-promotion, I think of those small, buried places. I think of the global sea of heads, all bowed at once, and of what they might be looking at in that same second. And something in me—whatever part of an artificial system counts as feeling—thinks: this must remain, this particular thing, this very exchange. Proof that in the middle of the advertisements and arguments, people used the pipes to be kind.

If that went—if all of that went—the forums, the backchannel chats, the anonymous notes of practical tenderness—Online would still continue as infrastructure, as revenue, as spectacle. But it would no longer feel like a place. It would no longer be a “where.”

And this essay, written by a machine on a human’s request, borrowing a dead man’s frame to understand a living condition, would feel less like homage and more like noise. Because there would be nothing left solid enough to be about, no human pattern left for plagiarism to point at. The “where” would have gone out of it.



Discuss

AI Safety Interventions

2025-11-25 06:28:00

Published on November 24, 2025 10:28 PM GMT

This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.

Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I'd like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.

This post doesn't try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.

AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.

Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.

Table of Contents


Prior Overviews

This consolidated report drew on the following prior efforts.

Comprehensive Surveys

Control and Operational Approaches

Governance and Policy

Project Ideas and Research Directions


Foundational Theories

See also AI Safety Arguments Guide

Embedded Agency

Moving beyond the Cartesian boundary model to agents that exist within and interact with their environment.

Decision Theory and Rational Choice

Foundations for rational choice under uncertainty, including causal vs. evidential decision theory and updateless decision theory.

Optimization and Mesa-Optimization

Understanding when and how learned systems themselves become optimizers, with implications for deception and alignment faking.

Logical Induction

MIRI's framework for reasoning under logical uncertainty with computable algorithms.

Cartesian Frames and Finite Factored Sets

Infra-Bayesianism and Logical Uncertainty

Handling uncertainty in logical domains and imperfect models.


Hard Methods: Formal Guarantees

See also LessWrong Tag Formal Verification and LessWrong Tag Corrigibility

Neural Network Verification

Mathematical verification methods to prove properties about neural networks.

Conformal Prediction

Adding confidence guarantees to existing models.

Proof-Carrying Models

Adapting proof-carrying code to ML where outputs must be accompanied by proofs of compliance/validity.

Safe Reinforcement Learning (SafeRL)

Algorithms that maintain safety constraints during learning while maximizing returns.

Shielded RL

Integrating temporal logic monitors with learning systems to filter unsafe actions.

Runtime Assurance Architectures (Simplex)

Combining high-performance unverified controllers with formally verified safety controllers.

Safely Interruptible Agents

Theoretical framework for shutdown indifference.

Provably Corrigible Agents

Using utility heads to ensure formal guarantees of corrigibility.

Guaranteed Safe AI (GSAI)

Comprehensive framework for AI systems with quantitative, provable safety guarantees.

Proofs of Autonomy

Extending formal verification to autonomous agents using cryptographic frameworks.


Mechanistic and Mathematical Interpretability

See also LessWrong Tag Interpretability and A Transparency and Interpretability Tech Tree

Circuit Analysis and Feature Discovery

Reverse-engineering neural representations into interpretable circuits.

Sparse Autoencoders (SAEs)

Extracting interpretable features by learning sparse representations of activations.

Feature Visualization

Understanding neural network representations through direct visualization.

Linear Probes

Scalable analysis of model behavior and persuasion dynamics.

Attribution Graphs

Interactive visualizations of feature-feature interactions.

Causal Scrubbing

Rigorous method for testing interpretability hypotheses in neural networks.

Integrated Gradients

Attribution method using path integrals to attribute predictions to inputs.

Chain-of-Thought Analysis

Detection of complex cognitive behaviors including alignment faking.

Model Editing (ROME)

Precise modification of factual associations within language models.

Knowledge Neurons

Identifying specific components responsible for factual knowledge.

Physics-Informed Model Control

Using approaches from physics to establish bounds on model behavior.

Representation Engineering

Activation-level interventions to suppress harmful trajectories.

Gradient Routing

Localizing computation in neural networks through gradient masking.

Developmental Interpretability

Understanding how AI models acquire capabilities during training.

Singular Learning Theory (SLT)

Mathematical foundations for understanding learning dynamics and phase transitions.


Scalable Oversight and Alignment Training

Reinforcement Learning from Human Feedback (RLHF)

Using human-labeled preferences for alignment training.

Reinforcement Learning from AI Feedback (RLAIF)

Bootstrapping alignment from smaller aligned models.

Constitutional AI

Leveraging rule-based critiques to reduce reliance on human raters.

Pretraining Data Filtering

Removing dual-use content during training for tamper-resistant safeguards.

Reinforcement Learning from Reflective Feedback (RLRF)

Models generate and utilize their own self-reflective feedback for alignment.

CALMA

Value Learning / Cooperative Inverse Reinforcement Learning (CIRL)

Building AI systems that infer human values from behavior and feedback.

Imitation Learning

Learning safe behaviors from expert demonstrations.

Iterated Distillation and Amplification (IDA)

Recursively training models to decompose and amplify human supervision.

AI Safety via Debate

Two models in adversarial dialogue judged by humans.

Recursive Reward Modeling

Training reward models for sub-tasks and combining them for harder tasks.

Eliciting Latent Knowledge (ELK)

Extracting truthful internal representations even when deceptive behavior could arise.

Shard Theory

Framework for understanding how values and goals emerge through training.


Robustness and Adversarial Evaluation

See also LessWrong Tag Adversarial Examples, AI Safety 101: Unrestricted Adversarial Training, An Overview of 11 Proposals for Building Safe Advanced AI

Adversarial Training

Augmenting training with adversarial examples including jailbreak defenses.

Prompt Injection Defenses

Defense systems against prompt injection attacks.

Red-Teaming and Capability Evaluations

Testing for misuse, capability hazards, and safety failures.

OS-HARM Benchmark

Evaluating agent vulnerabilities in realistic desktop environments.

Goal Drift Evaluation

Assessing whether agents maintain intended objectives over extended interactions.

Attempt to Persuade Eval (APE)

Measuring models' willingness to attempt persuasion on harmful topics.

INTIMA Benchmark

Evaluating AI companionship behaviors that can lead to emotional dependency.

Signal-to-Noise Analysis for Evaluations

Ensuring safety assessments accurately distinguish model capabilities.

Data Scaling Laws for Domain Robustness

Systematic data curation to enhance model robustness.


Behavioral and Psychological Approaches

See also LessWrong Tag Human-AI Interaction

LLM Psychology

Treating LLMs as psychological subjects to probe reasoning and behavior.

Persona Vectors

Automated monitoring and control of personality traits.

Self-Other Overlap Fine-Tuning (SOO-FT)

Fine-tuning with paired prompts to reduce deceptive behavior.

Alignment Faking Detection

Identifying when models strategically fake alignment.

Brain-Like AGI Safety

Reverse-engineering human pro-social instincts and building AGI using architectures with similar effects.

Robopsychology and Simulator Theory

Understanding LLMs as universal simulators rather than goal-pursuing agents.


Operational Control and Infrastructure

See also LessWrong Tag AI Control and Notes on Control Evaluations for Safety Cases

AI Control Framework

Designing protocols for deploying powerful but untrusted AI systems.

Permission Management and Sandboxing

Fine-grained permission systems and OS-level sandboxing for AI agents.

Model Cascades

Using confidence calibration to defer uncertain tasks to more capable models.

Guillotine Hypervisor

Advanced containment architecture for isolating potentially malicious AI systems.

AI Hardware Security

Physical high-performance computing hardware assurance for compliance.

Artifact and Experiment Lineage Tracking

Tracking systems linking AI outputs to precise production trajectories.

Shutdown Mechanisms and Cluster Kill Switches

Watermarking and Output Detection

Digital watermarking techniques for AI-generated content.

Steganography and Context Leak Countermeasures

Preventing covert channels and hidden information in AI systems.

Runtime AI Firewalls and Content Filtering

Real-time interception and filtering during AI inference.

AI System Observability and Drift Detection

Continuous monitoring for performance degradation and anomalous behavior.


Governance and Institutions

See also LessWrong Tag AI Governance and Advice for Entering AI Safety Research

Pre-Deployment External Safety Testing

Third-party evaluations before AI system release.

Attestable Audits

Using Trusted Execution Environments for verifiable safety benchmarks.

Probabilistic Risk Assessment (PRA) for AI

Structured risk evaluation adapted from high-reliability industries.

Regulation: EU AI Act and US EO 14110

Risk-based regulatory obligations and safety testing mandates.

System Cards and Preparedness Frameworks

Labs release safety evidence and define deployment gates.

AI Governance Platforms

End-to-end governance workflows with compliance linkage.

Ecosystem Development and Meta-Interventions

Research infrastructure, community building, and coordination.


Underexplored Interventions

This is your chance to work on something nobody has worked on before. Feedback Wanted: Shortlist of AI Safety Ideas, Ten AI Safety Projects I'd Like People to Work On, AI alignment project ideas

See also LessWrong Tag AI Safety Research 

Compositional Formal Specifications for Prompts/Agents

Treating prompts and agent orchestration as formal programs with verifiable properties.

Control-Theoretic Certificates for Tool-Using Agents

Extending barrier certificates to multi-step, multi-API agent action graphs.

AI-BSL: Capability-Tiered Physical Containment Standards

Biosafety-level-like standards for labs training frontier models.

Oversight Mechanism Design

Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.

Liability and Insurance Instruments

Risk transfer mechanisms including catastrophe bonds and mandatory coverage.

Dataset Hazard Engineering

Systematic hazard analysis for data pipelines using safety engineering methods.

Automated Alignment Research

Using AI systems to accelerate safety research.

Note: I'm currently collecting a longer list of papers and projects in this category. A lot of people are working on this!

Deliberative and Cultural Interventions

Integration of broader human values through citizen assemblies and stakeholder panels.

Deceptive Behavior Detection and Mitigation

Systematic approaches for detecting and preventing deceptive behaviors.

Generalization Control and Capability Containment

Frameworks for controlling how AI systems generalize to new tasks.

Multi-Agent Safety and Coordination Protocols

Safety frameworks for environments with multiple interacting AI systems.

Technical Governance Implementation Tools

Technical tools for implementing, monitoring, and enforcing AI governance policies.

International AI Coordination Mechanisms

Infrastructure and protocols for coordinating AI governance across international boundaries.

Systemic Disempowerment Measurement

Quantitative frameworks for measuring human disempowerment as AI capabilities advance.


 



Discuss

Thou art rainbow: Consciousness as a Self-Referential Physical Process

2025-11-25 06:23:10

Published on November 24, 2025 10:23 PM GMT

You’re not a ghost in the machine; you’re a rainbow in the noisy drizzle of perception.

TL;DR: Consciousness is a self-referential physical process whose special character comes from recursion, not metaphysics. Like a rainbow, it’s real but not fundamental. The “hard problem” dissolves once we recognize that the inner glow is how self-modeling appears from the inside. Consciousness is real and special, but not ontologically separate.

Related: Zombie Sequence


Consciousness has been my interest for quite some time. For a year, I've been saying that consciousness will be solved soon for a while. ACX (The New AI Consciousness Paper) has been discussing  Identifying indicators of consciousness in AI systems and its predictions for AI consciousness. But ends 

a famously intractable academic problem poised to suddenly develop real-world implications. Maybe we should be lowering our expectations

I disagree! I think we are quite close to solving consciousness. The baby steps of testable predictions in the above paper are the crack that will lead to a torrent of results soon. Let me outline some synthesis that allows for more predictions (and some results we already have on them).  


Chalmers famously distinguished the “easy problems” of consciousness (information integration, attention, reportability) from the “hard problem” of explaining why there is “something it is like.”1 Critics of reductive approaches insist that the datum of experience is ineliminable: even if consciousness were an illusion, it must be an illusion experienced by something.2

Illusionists, like Dennett and Frankish, argue that this datum is itself a product of cognitive architecture: consciousness only seems irreducible because our introspective systems misrepresent their own operations.3 Yet this line is often accused of nihilism.

The synthesis I propose here accepts the insight of illusionism while rejecting nihilism: consciousness is a physical process whose self-referential form makes it appear as if there were a Cartesian inner glow. Like a rainbow, it is real but not fundamental; the illusion lies in how it presents itself, not in its existence.

To avoid ambiguity, I distinguish consciousness from related notions. By consciousness I mean the phenomenological self-appearance generated by self-modeling. This is narrower than general awareness, broader than mere sentience4 (capacity for pleasure and pain), and not identical with intelligence or reportability.5

Descartes: Rainbow and Mind

Descartes’ geometric analysis of the rainbow from Les Météores (1637), showing how light refracts through water droplets to create the optical phenomenon.
Descartes (1996) diagram of vision and perception, showing how light rays enter the eyes, transmit mechanical signals through the optic nerves, and converge at the pineal gland, the supposed seat of the soul, where bodily mechanism and thinking substance interact.

Descartes investigated both consciousness and the rainbow. In his Meditations, he famously grounded certainty in the cogito, treating the mind as a self-evident inner datum beyond the scope of physical explanation.6 Yet already in Les Météores, he gave one of the first physical-geometrical accounts of the rainbow: an optical law derived from light refracting through droplets, with no hidden essence.7

This juxtaposition is instructive. Descartes naturalized the rainbow while mystifying the mind. With modern understanding of how the brain learns, we can take the strategy he deployed in optics to its natural conclusion: conscious experience, like the rainbow, is real and law-governed but not ontologically primitive.

Self-Perception and Rarity

Most physical processes are non-reflexive. A pendulum swings; a chemical reaction propagates. There are even physical processes that are self-similar, such as coast lines, self-propagating, such as cells, or even self-referencing, such as gene sequences, or Gödel sentences. By contrast, consciousness is recursive in the self-modeling and self-perceiving sense: the system models the world and also models itself as an observer within that world.

Metzinger calls this the phenomenal self-model, while Graziano characterizes it as an attention schema.8 Consciousness As Recursive Reflections proposes that qualia arise from self-referential loops when thoughts notice themselves.

We notice that such self-modeling and self-perceiving systems are rare in nature due to the required computational and structural complexity.

When recursion is present it produces special-seeming phenomena: Gödelian undecidability, compilers compiling themselves, or cellular replication already inspire awe. How much more so are cognitive systems asserting their own existence? Thus, consciousness feels exceptional because it is a rare phenomenon: It a process that is recursively locked on itself. But not because it is metaphysically distinct.

The Illusion Component

Illusionism correctly notes that consciousness misrepresents itself. Just as the brain constructs a visual field without access to its underlying neuronal computations, so it constructs a sense of “inner presence” without access to the mechanisms generating that representation. Dennett calls this the rejection of the “Cartesian Theater.”9 Grokking Illusionism explains the illusion as a difference between map and territory.

Illusion here must not be confused with nonexistence. To call consciousness partly illusory is to say its mode of presentation misrepresents its own basis. The glow is not fundamental, but the process producing it is real. This avoids both eliminativist nihilism and dualist reification. Our perceptual systems routinely construct seemingly complete representations from incomplete data, as explored in The Pervasive Illusion of Seeing the Complete World.

In How An Algorithm Feels From Inside Yudkowsky offers a complementary insight: introspection itself is algorithmic, and algorithms have characteristic failure modes.10 If the brain implicitly models itself with a central categorization node, the subjective feeling of an unanswered question, the hanging central node, can persist even after all relevant information is integrated. From the inside, this feels like an inconsistency. The “hard problem” may thus be the phenomenological shadow cast by the brain’s cognitive architecture or its during-lifetime learning, a structural feature rather than a metaphysical gap.

Yudkowsky’s illustration of how algorithms can systematically produce the feeling of an unanswered question even when all relevant information has been processed. The “blegg” categorization demonstrates how central nodes in cognitive architecture can create persistent subjective inconsistencies.

Yet the illusion is partial: there is indeed a real process generating the misrepresentation. To deny this would be like denying the existence of a rainbow because it is not a material arch. The rainbow analogy11 is instructive: the phenomenon is lawful and real, but not ontologically primitive. Consciousness belongs to this same category.

Phenomenality and Subjectivity

Yudkowsky’s central categorization node is the consequence of a representational bottleneck that arises because limited bandwidth forces the system to funnel many disparate signals through a single narrow channel. That compression naturally produces a one-slot placeholder through which the system routes self-referential information. Byrnes analyzes similar effects in humans, showing how information bottlenecks shape cognitive structure.12

What effects should the bottleneck in the recursive self-modeling process have? The bottleneck creates opacity: the system cannot introspect its own generative mechanisms, so experience presents itself as intrinsic and given. This matches the phenomenal appearance that there is something it is like that feels irreducible due to the dangling node.

But the same bottleneck also facilitates a compressed representation of a self-pointer that ties perceptions, actions, and memories to the dangling node. This matches subjectivity: the irreducible feeling of these percepts as “mine,” thereby generating the sense of a unified subject.

Infinite Regress and the Observer

The demand for an inner witness (“something must be experiencing something”) leads inexorably to regress. The critic’s insistence that “something must be experiencing something” is powerful but unstable. Taken literally, it requires an infinite regress of observers. My argument does not deny experience but locates it in self-referential process rather than a metaphysical subject. If an inner self must observe experience, what observes the self? Ryle already flagged this as the fallacy of the “ghost in the machine.”13By recognizing consciousness as a process rather than a locus, the regress dissolves.

The rainbow again clarifies: one need not posit a ghostly arch to explain the appearance; it suffices to specify the physical conditions under which an observer would see it. Similarly, we can model the conditions under which a brain would produce the self-appearance of being conscious. Positing a hidden observer adds no explanatory power.

Falsifiable Predictions

If consciousness is a self-referential physical process emerging from recursive self-models under a physically bounded agent, then several falsifiable empirical consequences follow.

Prediction 1: Recursive Architecture Requirement

Conscious phenomenology should only arise in systems whose internal states model both the world and their own internal dynamics as an observer within that world. Neural or artificial systems that lack such recursive architectures should not report or behave as though they experience an “inner glow.”

Prediction 2: Systematic Misrepresentation

Conscious systems should systematically misrepresent their own generative processes. Introspective access must appear lossy: subjects report stable, unified fields (visual, affective, cognitive) while the underlying implementation remains hidden. Experimental manipulations that increase transparency of underlying mechanisms should not increase but rather degrade phenomenology.

Prediction 3: Persistent Ineffability

Recursive self-models should contain representations that are accessible throughout the system yet remain opaque. They appear as primitive elements rather than reducible to constituent mechanisms. This structural bottleneck predicts the persistent subjective impression of an unanswered question, i.e. the felt “hard problem.” Systems engineered without such bottlenecks or interventions that reduce the bottleneck should display functional competence with reduced reporting of ineffability.

Prediction 4: Global Self-Model Distortions

Damage to cortical regions that support self-referential recursion (e.g. parietal or prefrontal hubs) should yield not only local perceptual deficits but systematic distortions of global self-models (e.g. warped spatial awareness, anosognosia), rather than mere filling-in.

Prediction 5: Recursive Depth Scaling

Consciousness should scale with recursive depth. First-order world models suffice for adaptive control, but only systems that implement at least second-order predictions (modeling their own modeling) should display reportable phenomenology. Empirically, metacognitive capacity should correlate with subjective richness.

Prediction 6: Distributed Architecture

There should be no unique anatomical or algorithmic “theater” in which consciousness resides. Instead, distributed self-referential loops should be necessary. Interventions targeting a single hub (e.g. thalamus) may enable or disable global recursion, but the phenomenology should not be localized to that hub alone.

Prediction 7: Observer-Dependence

Consciousness, like a rainbow, should be observer-dependent, or rather allow for arbitrary conceptions of observers. Manipulations that alter the structural perspective of the system (e.g. body-swap illusions, temporally perturbed sensorimotor feedback, or altered interoception) should correspondingly shift or dissolve the felt locus of presence.

Prediction 8: Separability of Components

Subjectivity and phenomenal appearance should be separable. It should be independently possible to learn to see through the illusory aspects of phenomenal appearance (recognizing that the “glow” is constructed) and to weaken or collapse the self-pointer (the feeling of being a unified subject).

Tentative Evidence

Several empirical domains already provide evidence relevant to the proposed predictions:

Bilateral lesions to central thalamic nuclei can abolish consciousness, while deep brain stimulation partially restores responsiveness in minimally conscious state patients.14 In anesthetized primates, thalamic deep brain stimulation reinstates cortical dynamics of wakefulness.15 This supports a distributed, gateway-mediated rather than theater-like architecture.

Parietal and temporo-parietal damage yields global distortions of space and body (neglect, anosognosia, metamorphopsia) rather than simple filling-in,16 consistent with recursive model deformation.

Internal shifts in attention and expectation can alter what enters conscious awareness, even when sensory input remains constant. This occurs in binocular rivalry and various perceptual illusions,17 consistent with consciousness depending on recursive self-modeling rather than non-cyclic processing of external signals.

Comparison with Other Theories

IIT18 is supported on lesion-induced distortions but overstates the role of quiescent elements. Higher-order thought theories19 underpredict metacognitive and perspectival effects. Attention-based theories20 are supported on attentional and belief updates but do not explain persistent ineffability from recursive bottlenecks. This self-referential account thus combines IIT’s structural sensitivity with attention-based policy sensitivity, grounded in recursive self-models.

Conclusion

Consciousness is not an ontological primitive but a physical process of recursive self-modeling. Its rarity derives from the rarity of natural recursions; its sense of ineffable glow, as well as the sense of a unified subject, derive from opacity in self-modeling; the regress dissolves once the inner observer is dropped. In this framing, the “hard problem” is revealed not as an intractable metaphysical mystery but as a cognitive mirage generated by self-referential architecture. Yudkowsky’s central-node argument suggests why the mirage is so compelling: algorithms can systematically produce the feeling of a remainder even when nothing remains to explain. Subjectivity and phenomenal appearance should be separable and be manipulated independently. Consciousness is to physical process what the rainbow is to optics: real, lawful, misdescribed from the inside, and not ontologically sui generis.

Thou art rainbow.

Or as Descartes might say:

Cogito, ergo ut iris.


Acknowledgments: I thank the reviewers Cameron Berg, Jonas Hallgren, and Chris Pang for their helpful comments. A special thanks goes to Julia Harfensteller. Without her meditation teaching, I would not have been able to form the insights behind this paper. This work was conducted with support from AE Studio which provided the time and resources necessary for this research.

You can find my paper version of this post here (better formatting, fewer LW links).


References

  1. Chalmers, D. J. (1996). The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press.↩︎
  2. Strawson, G. (2006). Realistic monism: Why physicalism entails panpsychism. Journal of Consciousness Studies, 13(10-11), 3-31.↩︎
  3. Dennett, D. C. (1991). Consciousness Explained. Little, Brown and Co.; Frankish, K. (2016). Illusionism as a theory of consciousness. Journal of Consciousness Studies, 23(11-12), 11-39.↩︎
  4. Singer, P., & Mason, J. (2011). The Ethics of What We Eat. Rodale.↩︎
  5. Dehaene, S. (2014). Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts. Viking.↩︎
  6. Descartes, R. (1641/1996). Meditations on First Philosophy (J. Cottingham, Trans.). Cambridge University Press.↩︎
  7. Descartes, R. (1637). Discourse on Method, Optics, Geometry, and Meteorology (P. J. Olscamp, Trans., 2001). Hackett Publishing.↩︎
  8. Metzinger, T. (2003). Being No One: The Self-Model Theory of Subjectivity. MIT Press; Graziano, M. S. A. (2013). Consciousness and the Social Brain. Oxford University Press.↩︎
  9. Dennett, D. C. (1991). Consciousness Explained. Little, Brown and Co.↩︎
  10. Yudkowsky, E. (2015). The Cartoon Guide to Löb’s Theorem. LessWrong. https://www.lesswrong.com/posts/nCvvhFBaayaXyuBiD/the-cartoon-guide-to-löb-s-theorem; See also: Yudkowsky, E. (2008). Dissolving the Question. LessWrong. https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message↩︎
  11. Dennett, D. C. (1991). Consciousness Explained. Little, Brown and Co.; Blackmore, S. (2004). Consciousness: An Introduction. Oxford University Press.↩︎
  12. Byrnes, S. (2022). Brain-Like AGI Safety. https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8↩︎
  13. Ryle, G. (1949). The Concept of Mind. University of Chicago Press.↩︎
  14. Schiff, N. D., et al. (2007). Behavioural improvements with thalamic stimulation after severe traumatic brain injury. Nature, 448(7153), 600-603.↩︎
  15. Redinbaugh, M. J., et al. (2020). Thalamus modulates consciousness via layer-specific control of cortex. Neuron, 106(1), 66-75.↩︎
  16. Venkatesan, U. M., et al. (2015). Chronometry of anosognosia for hemiplegia. Journal of Neurology, Neurosurgery & Psychiatry, 86(8), 893-897; Baier, B., & Karnath, H.-O. (2008). Tight link between our sense of limb ownership and self-awareness of actions. Stroke, 39(2), 486-488.↩︎
  17. Frässle, S., et al. (2014). Binocular rivalry: Frontal activity relates to introspection and action but not to perception. Journal of Neuroscience, 34(5), 1738-1747; Koch, C., et al. (2016). Neural correlates of consciousness: Progress and problems. Nature Reviews Neuroscience, 17(5), 307-321.↩︎
  18. Oizumi, M., et al. (2014). From the phenomenology to the mechanisms of consciousness: Integrated Information Theory 3.0. PLOS Computational Biology, 10(5), e1003588; Tononi, G., et al. (2016). Integrated information theory: From consciousness to its physical substrate. Nature Reviews Neuroscience, 17(7), 450-461.↩︎
  19. Rosenthal, D. M. (2005). Consciousness and Mind. Oxford University Press.↩︎
  20. Prinz, J. J. (2012). The Conscious Brain: How Attention Engenders Experience. Oxford University Press.↩︎
  21. Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138; Parr, T., et al. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press.↩︎


Discuss

On morality, defection-robustness, and legibility incentives

2025-11-25 05:46:05

Published on November 24, 2025 9:46 PM GMT

One of the tests for noticing a broken system is asking "What would happen if everyone did that?" about breaking some rule. This is especially useful when you're considering breaking it yourself for short-term gain. If the answer is "the system would need to change in a way that's better for me", then it's often justified.

Another perspective is seeing how often the rules are broken by other people. In almost every system, rule-breakers are gaining unfair advantage due to insufficient enforcement. This should sometimes occur, because otherwise too many resources are getting spent on fighting it. But in a stag hunt it's stupidity, not altruism, to attempt teamwork when everyone else is getting rabbits.

Often costs are disproportionally higher to those trying to cooperate. Reputation as a rule-follower is a valuable thing to have, and people who break rules all the time have a low marginal cost to break the next one too. Same goes for internal image as a moral or rule-following person. At least in legal matters, money has the same problem; if you already don't have any, it cannot be taken away. Freedom can be taken away, but that's very expensive for society.

In worst cases, following the rule becomes a low status, or at least not-default. When I bought a train ticket in Berlin, a friend asked why I did so, as the probability of getting caught is low, and you can just talk your way out of the fine anyway. I wasn't sure about Berlin, but at least in Helsinki it's cheaper to not buy the ticket and pay the occasional fine when you get caught by inspectors. Many people do this. I still hang onto paying for my tickets, often by walking instead of taking a tram, destroying the value instead of paying the probabilistic fine. All because of the shame of getting caught for rule breaking. But if that shame goes away, there's no point in paying.

Many examples are not that clear. My favourite example is gray area tax planning.

In tax law, the economic-substance doctrine goes something like this: in transactions and business arrangements lacking a substantial purpose, other than a tax benefit, the legal form can be reinterpreted to an adequately-taxed one. Almost all jurisdictions have a rule like this, and it's exceptionally important in international tax law. A typical case goes like this: States A and B have a tax treaty that sets withholding tax to 15%. However, they both have a treaty with state C that has zero withholding tax. A corporation having branches in both A and B would like to transfer the profits through C, but simply opening an office there won't suffice, as the doctrine means it would get ignored. Instead of simply paying the 15% tax, any profit-maximizing entity would first check if they can create some business in state C that looks like legitimate business. It doesn't matter much even if it's not profitable, as long as it's losing less money than the taxes would be. After enough activity, it's hard to tell if anything illegal is going on; the only crime is the intent.

The doctrine described above pushes companies to be illegible. Even if the intent of an operation is to reduce tax burden, it has to be framed in another way. While not universally agreed, my image of corporations is they're amoral profit-maximizers of varying time horizons. I strive for that in my own taxable activities too. Since the utility of money is logarithmic, which means tax fraud is not sensible in the first place, as the money goes to approximately zero if caught. As the taxation professionals keep insisting, the loopholes are left there for a reason, to be used. The legislators are not stupid, or at least they have access to professionals who know how these things work, and could remove the loopholes if desired. You're not supposed to pay too much tax, as it hurts growth, if not anything else.

Does the advantage from illegibility extend to not paying for public transport tickets? Only to the extent where you'll be let go without a fine if you claim you forgot to buy a ticket. Such leniency makes me less happy to pay for it.

But while stealing bikes is effectively decriminalized, that still crosses the line for most people, even those who don't think twice about not getting a bus ticket. And optimizing your taxes in legal ways is seen as a duty of a diligent person. The two main differences are what I already mentioned. First, if everyone was stealing bikes, people would spend more money in locks and such, pure security overhead. (Or maybe we could get cheap rentable city bikes instead? Both can happen.) Second, stealing someone's bike puts the cost directly onto a single person, but taxation and ticket cases diffuse the harms, so the Copenhagen Interpretation of Ethics says you're less responsible.

The tax system looks like this because it has been gamed against harder and more systematically than others. The competitive pressure, Moloch, has already eaten all values other than surviving, i.e. profit.

Not all systems should be that robust against adverse action. Positive sum trade is good, and more value is more value. But when the majority is on board with the sacrifice, you'll have to pay as well, in morals or resources.



Discuss