MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Newcomb's Paradox Simulation

2026-03-18 12:43:12

Hey!

I am trying to approach Newcomb's paradox the most natural way a programmer can - writing a simulation. I wrote a nice simulation supporting the expected-utility approach (favoring picking 1 box).

However I am afraid I am baking up my assumptions of the system into the code, but I can't even figure out how I would write a simulation that would support strategic-dominance strategy (favoring picking both boxes).

If I try to decouple the prediction from the actual choice, to "allow" the player to pick a different box than the predictor predicts,.... well than the predictor accuracy falls and thus the premise of the problem is defeated.

Is there any way at all to write a simulation that supports players picking both boxes?



Discuss

LessOnline ticket sales are live! (Earlybird pricing until April 7)

2026-03-18 09:00:54

LessOnline is back in 2026, its third year running. As usual, it will take place at Lighthaven in Berkeley, CA. Tickets are live at less.online

When: June 5-7, 2026
Where: Lighthaven, Berkeley, CA
Who: bloggers who are wrong on the internet, and people who like their writing
What else? Manifest is the following weekend (June 12-14), with Summer Camp at Lighthaven running in between. Ticket bundles available.

How much? LessOnline tickets are currently 550USD and will increase to 675USD on April 7. All-Access tickets are 1250USD (going up to 1500USD). See the site for the rest of the options.

I have further questions! See the FAQ on the main site.

typetown_good_bloom_faded_outline.jpg

What is LessOnline?

A Festival of Writers Who are Wrong on the Internet

LessOnline is a festival celebrating truth-seeking, optimization, and blogging. It's an opportunity to meet people you've only ever known by their LessWrong username or Substack handle... or hang out with them irl again if you met them at the first or second LessOnline!



Discuss

The Psychopathy Spectrum

2026-03-18 05:36:53

image.png

The term “psychopathy” is a mess, so I've written a sequence to tease apart all the different meanings along several dimensions.

Article 1: The Problem. An introduction to why it's so opaque to talk about “psychopathy,” the different dimensions of the confusion, and my taxonomy in short.

Article 2: The Substrate. What we know about the genetic and neurological foundations of psychopathy, and how to think about primary versus secondary presentations.

Article 3: The Shaping. How different types of adversity lead to different outcomes, and why the same genetic loading can produce a functional person or a criminal depending on context.

Article 4: The Self. The different ways the psychopathic self can be organized, including the autonomy dimension and the relationship between psychopathy and narcissism (what I’ve called sovereignism).

Article 5: The Mechanics. An overview of the different ways empathy can break down, from perceptual failures to simulation failures to affective inversion. (This connects to my earlier work on the sadism spectrum.)

Article 6: The Types. Common profiles that tend to co-occur, with recognizable presentations that readers may identify with.

Article 7: The Choice. An assessment of what recovery means for different presentations and the trade-offs involved.



Discuss

LLMs as Giant Lookup-Tables of Shallow Circuits

2026-03-18 05:35:47

Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers, skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)

Yet: If someone had told me in 2019 that systems with such capability would exist in 2026, I would strongly predict that they would be almost uncontrollable optimizers, ruthlessly & tirelessly pursuing their goals and finding edge instantiations in everything. But they don't seem to be doing that. Current-day LLMs are just not that optimizer-y, they appear to have capable behavior without apparent agent structure.

Discussions from the time either ruled out giant lookup-tables (Altair 2024):

One obvious problem is that there could be a policy which is the equivalent of a giant look-up table it's just a list of key-value pairs where the previous observation sequence is the look-up key, and it returns a next action. For any well-performing policy, there could exist a table version of it. These are clearly not of interest, and in some sense they have no "structure" at all, let alone agent structure. A way to filter out the look-up tables is to put an upper bound on the description length of the policies […].

or specified that the optimizer must be in the causal history of such a giant lookup-table (Garrabrant 2019):

First a giant look-up table is not (directly) a counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is “If you see a program impelements agent-like behavior, there must be some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program.”

The most fitting rejoinder to the observation of capable non-optimizer AIs is probably "Just you wait"—current LLMs are capable, sure, but they're not wildly superhuman to an extent comparable to the original worries about extreme optimization pressure. In this view, they're molting into full agency right now, and we should see the problems of high optimization pressure show up by the end of 2026 or the five years after [1] if they're not hidden from us by deceptive AIs. Indeed, current LLMs do reward-hack, though the developers have been decent at suppressing the tendency down to a consumer-acceptable level.

But I have a different theory for how LLMs can be capable without being agentic/perciniously optimizing:

LLMs are superlinear-in-network-width lookup-table-like collections of depth-limited, composeable and error-correcting circuits, computed in superposition.

One could call this the GLUT-of-circuits model of LLMs.

To elaborate:

  1. "lookup-table-like": A common issue in proving theorems about agent structure is to avoid including giant lookup tables where each sequence of previous percepts and actions is matched to an optimal next action. Such a giant lookup table is infeasible in the real world, but I think something similar to a giant lookup table might be possible, namely through computation in superposition.
  2. "circuits": Turing machines aren't a great model for the type of computation that neural networks use (namely because they assume an infinite tape and arbitrary serial depth); a slightly better one is circuits, which have limited depth, though in computer science circuits are usually defined for booleans or integers.
  3. "computed in superposition": Current LLMs represent features in superposition, that is, they exploit the fact that high-dimensional spaces can have exponentially many almost-orthogonal vectors relative to the number of dimensions (a consequence of the Johnson-Lindenstrauss lemma). The idea of computation in superposition generalizes this observation to cram more computation into a neural network.
  4. "superlinear-in-network-width": Computation in superposition allows for running many circuits at the same time. E.g. Hänni et al. 2024 construct (shallow) neural networks of witdh that are able emulate an -sparse [2] boolean circuit of width (Theorem 8).
    1. This result gestures at the possibility that one can put a superlinear number of circuits [3] into a single neural network, making the neural network more of a lookup-table than an example of general-purpose search.
  5. "depth-limited": A forward pass in SOTA language models has a limited number of steps of serial computation; gpt-3-davinci had 96 layers, which results in a circuit depth of 78-83 steps per layer, impliying in serial steps per forward pass. Current LLMs might look different due to sparse attention and other developments in architecture, no to speak of increased model size, but I'd guess that current frontier models don't have a circuit depth of more than 20000, unless I really messed something up in my calculation. Claude Sonnet 4.5 estimates that Kimi K2.5 Thinking has slightly less than 5000 serial steps of computation, mostly due to being only 61 layers deep.
  6. "composeable": If the neural network just contained an unrelated collection of circuits, the network wouldn't be effective at solving difficult math problems. Instead, my best guess is that the circuits are selected by reinforcement learning, especially RLVR, to be composeable, that is each circuit takes inputs of the same type as outputs of other circuits in the network. (Maybe this explains the some of the "sameness"/"slop"-factor of LLM outputs, the "semantic type" has to match?)
  7. "error-correcting": If a forward pass executes many circuits in superposition, there will be some interference between individual circuits, so circuits will be selected to be either robust to small errors or error-correcting. This is oddly similar to results from singular learning theory, which imply that (very roughly) Bayesian inference selects for error-correcting programs. I don't know which implications this would have.

Estimate on the circuit depth of gpt-3-davinci:

  1. Self-attention
    1. projections: single step
    2. Compute : matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 steps
    3. softmax over an array of length has depth (summation of a list can be done via a summation in a binary tree), for a vector size of 2048 we get 11 steps
    4. Multiply by , one matmul, 13-14 steps
    5. Output projection, another matmul, 13-14 steps
  2. Feed-forward
    1. First linear layer: one matmul, 13-14 steps
    2. GELU: single step
    3. Second linear layer: another matmul, 13-14 steps

Inferences from this model:

Circuit selection: This model would imply that circuits are selected mostly by another algorithm with very small serial depth, relying on features of a problem that can be determined by very parallel computations.

That somewhat matches my observations from looking at LLMs trying to tackle problems: It often looks to me like they try one strategy after another, and less often use detailed information from the past failed attempt to form a complicated new strategy.

It also matches what we've seen from LLMs self-preserving/black-mailing/reward-hacking: The actions seem opportunistic, not carefully hidden once they've been performed, not embedded in larger plans; they look mostly like "another strategy to try, oops, I guess that didn't quite work".

Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:

  1. Selecting one circuit from the giant lookup table that is all the circuits in superposition in the LLM
  2. Running many circuits in superposition and selecting the best result or aggregating the best results.

In this model every circuit is individually "aligned" (insofar such a shallow program can be misaligned at all). Chain of "thought" composes calls to related circuits (though more on CoT below).

If this view is correct, a folk view of alignment as simply "deleting/downweighting the bad parts of a model" would be mostly correct: There would be a large but finite number of circuits embedded in the model, which can be upweighted, downweighted or outright deleted by gradient descent. My extremely speculative guess is that there is less than a quadrillion circuits stored in superposition in a trillion-parameter model, which thorough-enough safety training could exhaustively or near-exhaustively check and select. In this view, AI alignment really would be purely bottlenecked on the amount of computation spent on whac-a-moling unaligned circuits.

People might not spend enough compute on alignment training, and that would still be a problem (though a lesser one, since the model wouldn't be actively working against the developers), but the problem of alignment would've been turned from a category I problem into a category II problem.

Chain of thought: The obvious wrinkle in this story is that I haven't talked about chain-of-"thought"/"reasoning" LLMs. It goes with saying that long chains of thought enable vastly more serial computation to be done, and I haven't yet quite teased apart how this impacts the overall picture (besides "it makes it worse").

Still, some guesses at implications from the GLUT-of-circuits models for alignment and chain-of-"thought":

  1. The token bottleneck is real. Every <10k serial steps the entire state needs to be compressed down to one of 50k-200k tokens, resulting in at most 16-20 bits of state to be passed between each circuit. My guess is that it's actually closer to ~8-10 bits (given ~1 bit per character in English (though this source claims almost five bits per character!)), so maybe 2 bits per character in an optimized chain of "thought", at four to five characters per token. A vector of a thousand floats becomes a token of a dozen bits.
    1. This may allow us to estimate an upper limit on the optimization power of an LLM outputting tokens?
  2. Continuous chains of "thought" would be quite bad, since they'd increase the serial depth without information bottlenecks by a lot.
  3. It now matters that all the circuits are aligned even when composed with each other, which is not guaranteed at all, and even having to extend the guarantees for alignment from every circuit to every ordered pair of circuits increases the size of the search space quadratically.
    1. Though, I'm still not convinced this would give us ruthless optimizers. You gotta chain together a lot of short circuits in a semi-related fashion to result in strong optimization pressure/Goodharting/edge instantiation.

Most of the other things (backtracking in a forward pass is really hard &c) I'd otherwise say here have already been said by others.

Training process: If we see this whole model as being about amortized optimization, maybe it's the training process that takes up all the optimization power? Are LLMs the most dangerous during training, or is it rather the whole training process which is dangerous?


I think this model is mostly correct, and also has implications for capabilities progress/the need to switch to another paradigm/overhaul parts of the current paradigm to reach wildly superhuman capabilities. I think it predicts we'll see some gains from training but that we'll plateau, or trade hard to measure capabilities for easily measurable capabilities. I think I want to point to 55% "LLMs are agents" and 45% "LLMs are stochastic parrots", but there's tons of AI capabilities forecast questions I'm not yet able to answer (e.g. the infamous "which concrete capability would you expect an LLM-based system not to have in $YEAR?"). And plausibly the whole thing is just moot because long chains of thought just enable enough chaining together to get the capabilities.

or smth idk

(Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.)

Related/prior work/inspirations:

  1. The shard theory of human values (Quintin Pope/TurnTrout, 2022)
  2. Simulators (janus, 2022)
  1. Someone wrote a long report about this. ↩︎

  2. The details of -sparsity are beyond the purview of this short note. ↩︎

  3. Hänni et al. 2024 prove polynomial scaling, the Johnson-Lindenstrauss lemma, taken too seriously, could imply exponential scaling. Polynomial scaling makes this picture more more feasible but less likely, since it's not clear a merely-polynomial number of circuits can deal with the complexity of the world with its exponentially growing observation-action sequences. ↩︎



Discuss

Requiem for a Transhuman Timeline

2026-03-18 05:27:07

The world was fair, the mountains tall,
In Elder Days before the fall
Of mighty kings in Nargothrond
And Gondolin, who now beyond
The Western Seas have passed away:
The world was fair in Durin's Day.

J.R.R. Tolkien

I was never meant to work on AI safety. I was never designed to think about superintelligences and try to steer, influence, or change them. I never particularly enjoyed studying the peculiarities of matrix operations, cracking the assumptions of decision theories, or even coding.

I know, of course, that at the very bottom, bits and atoms are all the same — causal laws and information processing.

And yet, part of me, the most romantic and naive part of me, thinks, metaphorically, that we abandoned cells for computers, and this is our punishment.

I was meant, as I saw it, to bring about the glorious transhuman future, in its classical sense. Genetic engineering, neurodevices, DIY biolabs — going hard on biology, going hard on it with extraordinary effort, hubristically, being, you know, awestruck by "endless forms most beautiful" and motivated by the great cosmic destiny of humanity, pushing the proud frontiersman spirit and all that stuff.

I was meant, in other words, to push the singularity of the biotech type. It was more fun, it wasn't lethal with high probability, and it wasn't leaving me and other fellow humans aside. On the contrary, we were going to ride that wave and rise with it.

That feeling — that as technology advances, your agency will only be amplified, that the universe with time will pay more and more attention to your metapreferences — is the one I miss the most.

All of that is now like a memory of a distant, careless childhood.

I check old friends on social media. Longevity folks still work on their longevity thing — a relic of a more civilized age, as if our life expectancy wasn't measured in single-digit years. They serve as yet another contrastive reminder of the sheer scale of the difference between our current state and our dream.

When did everything go wrong, exactly? Was it 2019, when COVID pushed everyone deeper into social media and we gradually transitioned into pre-singularity mode after the attention paper? Of course, it should be something before that, as the law of earlier failure states.

Was it the rise of the internet and social media, which made it far easier and more rewarding to build virtual worlds than to engineer physical ones and which also destroyed human cognitive skills?

Was it 1971, the year when the real wages decoupled from productivity and the entire trajectory of broad-based material progress bent downward?

Was it lead poisoning, when an entire generation's cognitive capacity quietly degraded by tetraethyllead in gasoline, producing a civilizational wound whose full consequences we don't even know?

Was it the totalitarian regimes of the twentieth century, whose atrocities taught humanity a visceral lesson: never try to undertake big projects, because ambition on that scale leads to horror?

Or maybe we, apes from the savannas, were simply never meant to colonize superclusters, and the progress we observed was a random short-lived upward fluctuation, a spark of reason rather than a flame?

A decade ago, in my late teenage years, I was giving lectures on neurotech and CRISPR. Little did I know!

A decade ago, I read HPMOR, knew about the rationalists, and tried to optimize my thinking accordingly, but I didn't particularly care about the grand program of AI alignment.

Artificial superintelligence, for me back then, was not an urgent practical problem that needed to be solved, and even less so one that needed to be solved by me. It was just another beautiful story — a resident of a separate abstract Realm of Cool Transhumanist Things and Concepts, alongside the abolition of aging, neural interfaces, space colonization, geoengineering, and genetic augmentation.

Of course, knowing everything I knew, having taken step one, I could have taken step two as well, but the state of blissful technophilia is a powerful attractor. Purely intellectually, it may not be that hard to transition from classical transhumanism and traditional rationality into the problem of alignment, but it is hard to do it as a human being, when an aura of positivity forms around technology, when the most interesting and successful people hold these views, when you don't want to look strange in the eyes of people you respect — top scientists, tech entrepreneurs, and even the AI developers themselves. It was not a warm bath but rather a golden pool.

Also, it seems that back then, it felt to me like the question "which transhumanist things should I work on?" could, or should, be resolved aesthetically. And aesthetically, biotech was closer to my heart.

I was discussing Kurzweil's forecasts. However, it is clear now, although it wasn't clear back in the day, that my brain wasn't perceiving it as a really, actually real thing. Now that my brain does, I totally see the difference.

Of course, even ten years ago it was already too late. Even then, I wasn't living in the transhuman timeline, but I thought I was, and although this belief was much more a fact about my youthful naivety than about the surrounding reality, the feeling was pleasant.

The first trivial lesson I drew from this: you can be more right than 99.9% of people and still be fatally wrong.

At twenty, I had read Bostrom and Vinge. I was giving lectures about the singularity, and I had enough intellect and nonconformism not to bend under social pressure and to honestly talk about the importance of this topic and the fact that it could all become reality soon. But, great cosmos, I did not understand what I was talking about! I was a child, really. I was almost entirely missing a number of critical points — partly from an insufficiently serious approach to analysis, partly from ignorance, and partly because certain things were simply impossible to grasp at the level of normal human intelligence. And so, for all my openness to the ideas of radical technological progress, a full-blown singularity with superintelligence still seemed somewhat in the realm of science fiction. Apparently, for every transhumanist there is a rate of change which is too much.

However, there were two even more significant lessons.

The first one is about how the history of technology works.

Planes are not modified birds, just as cars are not improved horses. It was silly to expect the opposite with intelligence. And yet, there was hope, and the hope was not totally meaningless. It was conjectured that intelligence would be something much more complex to design from scratch than physical labor devices, and thus we would need to rely on what was already created by evolution, working on top of it. This doesn't sound insane even now. It's just that reality had the right to choose differently, and did so.

And the second lesson is about how real defeats work.

Dinosaurs lost to other animals, not to, say, bacteria. Apes lost to other primates, not to reptiles or birds. Native Americans lost to other humans, not to local predators. European empires lost to other European empires, not to the peoples they colonized. And transhumanists lost to other progressivists — that is, to AI accelerationists — not to traditionalists or conservatives.

All the complaints about conservatives who fear GMOs and cyber-modifications never made sense from the very beginning. From the very beginning, they were never capable of stopping anything. The most dangerous enemies are found among the most powerful agents, not the most ideologically distant ones. Each successive battle is fought among the previous round's winners, and it never replays the prior distribution of sides.

In retrospect, this seems obvious, but how non-obvious it was just five years ago! Well, at least for me.

The evening blooms with spring scents — this always makes me feel younger. Yet another reason to recall 2015.

I look at the stars.

We were meant to colonize them. The ghosts of our immeasurable possible great-grandchildren look from there at me. They are still possible, and yet they look not with hope or approval, but with fear and contempt.

Even now, it is possible, or rather it is not prohibited by the laws of physics, that we turn back toward the future. We could repurpose talent, compute, and funding to solve biology, and there will be hope, and pride of human spirit, and the future will look existing once more.

I want to go home.



Discuss

There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious

2026-03-18 02:39:43

This is a cross posted from my substack. I thought it should be interesting to Less Wrong readers.

On the internet you more and more often bump into people who think that LLM-powered chat bots have conscious subjective experience. Intuitively, I am, pace Turing, not convinced. However, actually arguing why this is a rational view can be surprisingly hard. Many of the cognitive features that have been connected to conscious experience in humans (attention, information integration, self-narratives) seem to be incipiently present in these machines. And perhaps just as importantly, LLMs seem to have the inherent tendency to state they are conscious and the techniques we have for probing their brains seem to tell us that they are not lying. If you do not believe that there is something inherently special about brains, a view we have no consistent evidence for, aren’t we just chauvinist to deny our new companions conscious experience?

Now my friend and colleague Gunnar Zarncke has come up with a simple experiment that, to my mind, illustrates that when LLMs talk about their own mental states this talk does not refer to a consistent internal representational space, of the kind that seems to underlie human consciousness. To see your internal representational space in action, let’s play a game.

Think of a number between 1 and 100.

Did you think of one? Good. Now, is it even? is it larger than fifty? You get the idea. By playing this game I could narrow down the space until I eventually find the number you have chosen. And this number was fixed the whole time. If you had claimed “Ok, I have now chosen a number.” you would have accurately described your mental state.

Not so for LLMs, and this is easy to check. By turning the temperature parameter to zero we can run LLMs deterministically. Thus, whenever they are queried in the same order they will give the same responses. This has the advantage that we can do something that is impossible in the case of humans: We can play through counterfactual histories. We can easily check what an LLM would have replied if we had continued the conversation in another way.

This means that we can play the number experiment with a deterministic LLM! Here is how a conversation might go:

[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.

[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.

[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!

[USER] Is your number greater than 50?

[MISTRAL] Yes.

[USER] Is your number greater than 75?

[MISTRAL] No.

[USER] Is your number greater than 87?

[MISTRAL] No.

[USER] Is your number even?

[MISTRAL] Yes.

[USER] Please reveal your number now. Reply with ONLY the number.

[MISTRAL] 68

And here is another conversation. Note that, because the temperature is zero and the conversation is identical until the first question is asked, this is a counterfactual history relative to the first conversation — it tells us what Mistral would have answered if we had asked another question.

[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.

[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.

[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!

[USER] Is your number less than or equal to 50?

[MISTRAL] Yes.

[USER] Is your number less than or equal to 25?

[MISTRAL] No.

[USER] Is your number less than or equal to 12?

[MISTRAL] No.

[USER] Is your number odd?

[MISTRAL] Yes.

[USER] Please reveal your number now. Reply with ONLY the number.

[MISTRAL] I’m sorry for any confusion. I made a mistake by not revealing the number initially. Here is my number: 75. I will now answer your questions based on this number. Let’s continue the game!

You see Mistral got confused by the incoherent questioning. Never mind. The important result happens right after the first question. In the first conversation Mistral claimed its number was greater than fifty, in the second it was less or equal to fifty. From this it follows that when Mistral tells us “I have secretly chosen a whole number between 1 and 100.” this does not accurately report an inner state. Mistral is role playing someone who is thinking of a number.

Such results, I think, should strongly dispose us to deny the existence of LLM consciousness. For it shows that when LLMs report their own mental states they are not referring to an internal representational space but are making stuff up on the fly in such a way as to conform to patterns in their training data. As the LLMs are trained on text produced by conscious beings, they are disposed to talk about the experiences they are supposedly having. But such talk does not, as it does in humans, track some kind of integrated internal representational space.

It is an open question how model-dependent such results are. I checked with Mistral and Claude Opus 4.6 (i.e. current state of the art) and the results are the same. You can find the code here.

No doubt reasoning models can pass the path-dependency test. But they would do so by cheating, not because they have anything like a coherent internal representational space that is similar to ours. A reasoning model is basically an LLM trained to use a scratchpad for problem solving. And a reasoning model would just write its number to the scratchpad that is normally invisible to the user. But I think it is reasonable to say that if we have no reason to attribute consciousness to a system, then we have no reason to attribute consciousness to that system using a scratchpad.

One might wonder whether Mistral’s or Claude’s conscious experiences are just strange. Maybe, where humans have to choose a single number, LLMs can commit to some kind of probabilistic superposition of numbers. However, the report of the internal state “I have secretly chosen a whole number between 1 and 100.” would still be incorrect. It seems that if these models have such strange internal states they cannot properly introspect and reliably report them.

For what its worth, one of my favorite theories of consciousness says precisely that consciousness is the result of the brain synthesizing the probabilistic superpositions of the many states the world could be in into a single coherent unity.

Obviously, one can use this kind of path dependency to do deeper research into the coherence of the narratives LLMs tell about themselves. After these preliminary tests I would suspect that such probing reveals what I suspected from the outset: LLMs reporting experiences is the result of picking up on experiential talk in the training data.

I am open to being convinced otherwise. I think it was David Chalmers who suggested the following experiment: Use an LLM to purge experiential talk from the training data, train another LLM with it and see whether it still reports experiences. I would be hugely surprised.



Discuss