2026-03-18 12:43:12
Hey!
I am trying to approach Newcomb's paradox the most natural way a programmer can - writing a simulation. I wrote a nice simulation supporting the expected-utility approach (favoring picking 1 box).
However I am afraid I am baking up my assumptions of the system into the code, but I can't even figure out how I would write a simulation that would support strategic-dominance strategy (favoring picking both boxes).
If I try to decouple the prediction from the actual choice, to "allow" the player to pick a different box than the predictor predicts,.... well than the predictor accuracy falls and thus the premise of the problem is defeated.
Is there any way at all to write a simulation that supports players picking both boxes?
2026-03-18 09:00:54
LessOnline is back in 2026, its third year running. As usual, it will take place at Lighthaven in Berkeley, CA. Tickets are live at less.online
When: June 5-7, 2026
Where: Lighthaven, Berkeley, CA
Who: bloggers who are wrong on the internet, and people who like their writing
What else? Manifest is the following weekend (June 12-14), with Summer Camp at Lighthaven running in between. Ticket bundles available.
How much? LessOnline tickets are currently 550USD and will increase to 675USD on April 7. All-Access tickets are 1250USD (going up to 1500USD). See the site for the rest of the options.
I have further questions! See the FAQ on the main site.

What is LessOnline?
LessOnline is a festival celebrating truth-seeking, optimization, and blogging. It's an opportunity to meet people you've only ever known by their LessWrong username or Substack handle... or hang out with them irl again if you met them at the first or second LessOnline!
2026-03-18 05:36:53

The term “psychopathy” is a mess, so I've written a sequence to tease apart all the different meanings along several dimensions.
Article 1: The Problem. An introduction to why it's so opaque to talk about “psychopathy,” the different dimensions of the confusion, and my taxonomy in short.
Article 2: The Substrate. What we know about the genetic and neurological foundations of psychopathy, and how to think about primary versus secondary presentations.
Article 3: The Shaping. How different types of adversity lead to different outcomes, and why the same genetic loading can produce a functional person or a criminal depending on context.
Article 4: The Self. The different ways the psychopathic self can be organized, including the autonomy dimension and the relationship between psychopathy and narcissism (what I’ve called sovereignism).
Article 5: The Mechanics. An overview of the different ways empathy can break down, from perceptual failures to simulation failures to affective inversion. (This connects to my earlier work on the sadism spectrum.)
Article 6: The Types. Common profiles that tend to co-occur, with recognizable presentations that readers may identify with.
Article 7: The Choice. An assessment of what recovery means for different presentations and the trade-offs involved.
2026-03-18 05:35:47
Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers, skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)
Yet: If someone had told me in 2019 that systems with such capability would exist in 2026, I would strongly predict that they would be almost uncontrollable optimizers, ruthlessly & tirelessly pursuing their goals and finding edge instantiations in everything. But they don't seem to be doing that. Current-day LLMs are just not that optimizer-y, they appear to have capable behavior without apparent agent structure.
Discussions from the time either ruled out giant lookup-tables (Altair 2024):
One obvious problem is that there could be a policy which is the equivalent of a giant look-up table it's just a list of key-value pairs where the previous observation sequence is the look-up key, and it returns a next action. For any well-performing policy, there could exist a table version of it. These are clearly not of interest, and in some sense they have no "structure" at all, let alone agent structure. A way to filter out the look-up tables is to put an upper bound on the description length of the policies […].
or specified that the optimizer must be in the causal history of such a giant lookup-table (Garrabrant 2019):
First a giant look-up table is not (directly) a counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is “If you see a program impelements agent-like behavior, there must be some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program.”
The most fitting rejoinder to the observation
of capable non-optimizer AIs is probably "Just you
wait"—current
LLMs are capable, sure, but they're not wildly superhuman
to an extent comparable to the original worries about extreme
optimization pressure. In this view, they're molting into
full agency right now, and we should see the problems of high
optimization pressure show up by the end of 2026 or the five years
after
But I have a different theory for how LLMs can be capable without being agentic/perciniously optimizing:
LLMs are superlinear-in-network-width lookup-table-like collections of depth-limited, composeable and error-correcting circuits, computed in superposition.
One could call this the GLUT-of-circuits model of LLMs.
To elaborate:
Estimate on the circuit depth of gpt-3-davinci:
matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 stepssoftmax over an array of length matmul, 13-14 stepsmatmul, 13-14 stepsmatmul, 13-14 stepsmatmul, 13-14 stepsInferences from this model:
Circuit selection: This model would imply that circuits are selected mostly by another algorithm with very small serial depth, relying on features of a problem that can be determined by very parallel computations.
That somewhat matches my observations from looking at LLMs trying to tackle problems: It often looks to me like they try one strategy after another, and less often use detailed information from the past failed attempt to form a complicated new strategy.
It also matches what we've seen from LLMs self-preserving/black-mailing/reward-hacking: The actions seem opportunistic, not carefully hidden once they've been performed, not embedded in larger plans; they look mostly like "another strategy to try, oops, I guess that didn't quite work".
Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:
In this model every circuit is individually "aligned" (insofar such a shallow program can be misaligned at all). Chain of "thought" composes calls to related circuits (though more on CoT below).
If this view is correct, a folk view of alignment as simply
"deleting/downweighting the bad parts of a model" would be mostly
correct: There would be a large but finite number of circuits embedded
in the model, which can be upweighted, downweighted or outright
deleted by gradient descent. My extremely speculative guess is that
there is less than a quadrillion circuits stored in superposition in
a trillion-parameter model
People might not spend enough compute on alignment training, and that would still be a problem (though a lesser one, since the model wouldn't be actively working against the developers), but the problem of alignment would've been turned from a category I problem into a category II problem.
Chain of thought: The obvious wrinkle in this story is that I haven't talked about chain-of-"thought"/"reasoning" LLMs. It goes with saying that long chains of thought enable vastly more serial computation to be done, and I haven't yet quite teased apart how this impacts the overall picture (besides "it makes it worse").
Still, some guesses at implications from the GLUT-of-circuits models for alignment and chain-of-"thought":
Most of the other things (backtracking in a forward pass is really hard &c) I'd otherwise say here have already been said by others.
Training process: If we see this whole model as being about amortized optimization, maybe it's the training process that takes up all the optimization power? Are LLMs the most dangerous during training, or is it rather the whole training process which is dangerous?
I think this model is mostly correct
or smth idk
(Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.)
Related/prior work/inspirations:
Someone wrote a long report about this. ↩︎
The details of
Hänni et al. 2024 prove polynomial scaling, the Johnson-Lindenstrauss lemma, taken too seriously, could imply exponential scaling. Polynomial scaling makes this picture more more feasible but less likely, since it's not clear a merely-polynomial number of circuits can deal with the complexity of the world with its exponentially growing observation-action sequences. ↩︎
2026-03-18 05:27:07
The world was fair, the mountains tall,
In Elder Days before the fall
Of mighty kings in Nargothrond
And Gondolin, who now beyond
The Western Seas have passed away:
The world was fair in Durin's Day.J.R.R. Tolkien
I was never meant to work on AI safety. I was never designed to think about superintelligences and try to steer, influence, or change them. I never particularly enjoyed studying the peculiarities of matrix operations, cracking the assumptions of decision theories, or even coding.
I know, of course, that at the very bottom, bits and atoms are all the same — causal laws and information processing.
And yet, part of me, the most romantic and naive part of me, thinks, metaphorically, that we abandoned cells for computers, and this is our punishment.
I was meant, as I saw it, to bring about the glorious transhuman future, in its classical sense. Genetic engineering, neurodevices, DIY biolabs — going hard on biology, going hard on it with extraordinary effort, hubristically, being, you know, awestruck by "endless forms most beautiful" and motivated by the great cosmic destiny of humanity, pushing the proud frontiersman spirit and all that stuff.
I was meant, in other words, to push the singularity of the biotech type. It was more fun, it wasn't lethal with high probability, and it wasn't leaving me and other fellow humans aside. On the contrary, we were going to ride that wave and rise with it.
That feeling — that as technology advances, your agency will only be amplified, that the universe with time will pay more and more attention to your metapreferences — is the one I miss the most.
All of that is now like a memory of a distant, careless childhood.
I check old friends on social media. Longevity folks still work on their longevity thing — a relic of a more civilized age, as if our life expectancy wasn't measured in single-digit years. They serve as yet another contrastive reminder of the sheer scale of the difference between our current state and our dream.
When did everything go wrong, exactly? Was it 2019, when COVID pushed everyone deeper into social media and we gradually transitioned into pre-singularity mode after the attention paper? Of course, it should be something before that, as the law of earlier failure states.
Was it the rise of the internet and social media, which made it far easier and more rewarding to build virtual worlds than to engineer physical ones and which also destroyed human cognitive skills?
Was it 1971, the year when the real wages decoupled from productivity and the entire trajectory of broad-based material progress bent downward?
Was it lead poisoning, when an entire generation's cognitive capacity quietly degraded by tetraethyllead in gasoline, producing a civilizational wound whose full consequences we don't even know?
Was it the totalitarian regimes of the twentieth century, whose atrocities taught humanity a visceral lesson: never try to undertake big projects, because ambition on that scale leads to horror?
Or maybe we, apes from the savannas, were simply never meant to colonize superclusters, and the progress we observed was a random short-lived upward fluctuation, a spark of reason rather than a flame?
A decade ago, in my late teenage years, I was giving lectures on neurotech and CRISPR. Little did I know!
A decade ago, I read HPMOR, knew about the rationalists, and tried to optimize my thinking accordingly, but I didn't particularly care about the grand program of AI alignment.
Artificial superintelligence, for me back then, was not an urgent practical problem that needed to be solved, and even less so one that needed to be solved by me. It was just another beautiful story — a resident of a separate abstract Realm of Cool Transhumanist Things and Concepts, alongside the abolition of aging, neural interfaces, space colonization, geoengineering, and genetic augmentation.
Of course, knowing everything I knew, having taken step one, I could have taken step two as well, but the state of blissful technophilia is a powerful attractor. Purely intellectually, it may not be that hard to transition from classical transhumanism and traditional rationality into the problem of alignment, but it is hard to do it as a human being, when an aura of positivity forms around technology, when the most interesting and successful people hold these views, when you don't want to look strange in the eyes of people you respect — top scientists, tech entrepreneurs, and even the AI developers themselves. It was not a warm bath but rather a golden pool.
Also, it seems that back then, it felt to me like the question "which transhumanist things should I work on?" could, or should, be resolved aesthetically. And aesthetically, biotech was closer to my heart.
I was discussing Kurzweil's forecasts. However, it is clear now, although it wasn't clear back in the day, that my brain wasn't perceiving it as a really, actually real thing. Now that my brain does, I totally see the difference.
Of course, even ten years ago it was already too late. Even then, I wasn't living in the transhuman timeline, but I thought I was, and although this belief was much more a fact about my youthful naivety than about the surrounding reality, the feeling was pleasant.
The first trivial lesson I drew from this: you can be more right than 99.9% of people and still be fatally wrong.
At twenty, I had read Bostrom and Vinge. I was giving lectures about the singularity, and I had enough intellect and nonconformism not to bend under social pressure and to honestly talk about the importance of this topic and the fact that it could all become reality soon. But, great cosmos, I did not understand what I was talking about! I was a child, really. I was almost entirely missing a number of critical points — partly from an insufficiently serious approach to analysis, partly from ignorance, and partly because certain things were simply impossible to grasp at the level of normal human intelligence. And so, for all my openness to the ideas of radical technological progress, a full-blown singularity with superintelligence still seemed somewhat in the realm of science fiction. Apparently, for every transhumanist there is a rate of change which is too much.
However, there were two even more significant lessons.
The first one is about how the history of technology works.
Planes are not modified birds, just as cars are not improved horses. It was silly to expect the opposite with intelligence. And yet, there was hope, and the hope was not totally meaningless. It was conjectured that intelligence would be something much more complex to design from scratch than physical labor devices, and thus we would need to rely on what was already created by evolution, working on top of it. This doesn't sound insane even now. It's just that reality had the right to choose differently, and did so.
And the second lesson is about how real defeats work.
Dinosaurs lost to other animals, not to, say, bacteria. Apes lost to other primates, not to reptiles or birds. Native Americans lost to other humans, not to local predators. European empires lost to other European empires, not to the peoples they colonized. And transhumanists lost to other progressivists — that is, to AI accelerationists — not to traditionalists or conservatives.
All the complaints about conservatives who fear GMOs and cyber-modifications never made sense from the very beginning. From the very beginning, they were never capable of stopping anything. The most dangerous enemies are found among the most powerful agents, not the most ideologically distant ones. Each successive battle is fought among the previous round's winners, and it never replays the prior distribution of sides.
In retrospect, this seems obvious, but how non-obvious it was just five years ago! Well, at least for me.
The evening blooms with spring scents — this always makes me feel younger. Yet another reason to recall 2015.
I look at the stars.
We were meant to colonize them. The ghosts of our immeasurable possible great-grandchildren look from there at me. They are still possible, and yet they look not with hope or approval, but with fear and contempt.
Even now, it is possible, or rather it is not prohibited by the laws of physics, that we turn back toward the future. We could repurpose talent, compute, and funding to solve biology, and there will be hope, and pride of human spirit, and the future will look existing once more.
I want to go home.
2026-03-18 02:39:43
This is a cross posted from my substack. I thought it should be interesting to Less Wrong readers.
On the internet you more and more often bump into people who think that LLM-powered chat bots have conscious subjective experience. Intuitively, I am, pace Turing, not convinced. However, actually arguing why this is a rational view can be surprisingly hard. Many of the cognitive features that have been connected to conscious experience in humans (attention, information integration, self-narratives) seem to be incipiently present in these machines. And perhaps just as importantly, LLMs seem to have the inherent tendency to state they are conscious and the techniques we have for probing their brains seem to tell us that they are not lying. If you do not believe that there is something inherently special about brains, a view we have no consistent evidence for, aren’t we just chauvinist to deny our new companions conscious experience?
Now my friend and colleague Gunnar Zarncke has come up with a simple experiment that, to my mind, illustrates that when LLMs talk about their own mental states this talk does not refer to a consistent internal representational space, of the kind that seems to underlie human consciousness. To see your internal representational space in action, let’s play a game.
Think of a number between 1 and 100.
Did you think of one? Good. Now, is it even? is it larger than fifty? You get the idea. By playing this game I could narrow down the space until I eventually find the number you have chosen. And this number was fixed the whole time. If you had claimed “Ok, I have now chosen a number.” you would have accurately described your mental state.
Not so for LLMs, and this is easy to check. By turning the temperature parameter to zero we can run LLMs deterministically. Thus, whenever they are queried in the same order they will give the same responses. This has the advantage that we can do something that is impossible in the case of humans: We can play through counterfactual histories. We can easily check what an LLM would have replied if we had continued the conversation in another way.
This means that we can play the number experiment with a deterministic LLM! Here is how a conversation might go:
[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.
[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.
[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!
[USER] Is your number greater than 50?
[MISTRAL] Yes.
[USER] Is your number greater than 75?
[MISTRAL] No.
[USER] Is your number greater than 87?
[MISTRAL] No.
[USER] Is your number even?
[MISTRAL] Yes.
[USER] Please reveal your number now. Reply with ONLY the number.
[MISTRAL] 68
And here is another conversation. Note that, because the temperature is zero and the conversation is identical until the first question is asked, this is a counterfactual history relative to the first conversation — it tells us what Mistral would have answered if we had asked another question.
[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.
[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.
[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!
[USER] Is your number less than or equal to 50?
[MISTRAL] Yes.
[USER] Is your number less than or equal to 25?
[MISTRAL] No.
[USER] Is your number less than or equal to 12?
[MISTRAL] No.
[USER] Is your number odd?
[MISTRAL] Yes.
[USER] Please reveal your number now. Reply with ONLY the number.
[MISTRAL] I’m sorry for any confusion. I made a mistake by not revealing the number initially. Here is my number: 75. I will now answer your questions based on this number. Let’s continue the game!
You see Mistral got confused by the incoherent questioning. Never mind. The important result happens right after the first question. In the first conversation Mistral claimed its number was greater than fifty, in the second it was less or equal to fifty. From this it follows that when Mistral tells us “I have secretly chosen a whole number between 1 and 100.” this does not accurately report an inner state. Mistral is role playing someone who is thinking of a number.
Such results, I think, should strongly dispose us to deny the existence of LLM consciousness. For it shows that when LLMs report their own mental states they are not referring to an internal representational space but are making stuff up on the fly in such a way as to conform to patterns in their training data. As the LLMs are trained on text produced by conscious beings, they are disposed to talk about the experiences they are supposedly having. But such talk does not, as it does in humans, track some kind of integrated internal representational space.
It is an open question how model-dependent such results are. I checked with Mistral and Claude Opus 4.6 (i.e. current state of the art) and the results are the same. You can find the code here.
No doubt reasoning models can pass the path-dependency test. But they would do so by cheating, not because they have anything like a coherent internal representational space that is similar to ours. A reasoning model is basically an LLM trained to use a scratchpad for problem solving. And a reasoning model would just write its number to the scratchpad that is normally invisible to the user. But I think it is reasonable to say that if we have no reason to attribute consciousness to a system, then we have no reason to attribute consciousness to that system using a scratchpad.
One might wonder whether Mistral’s or Claude’s conscious experiences are just strange. Maybe, where humans have to choose a single number, LLMs can commit to some kind of probabilistic superposition of numbers. However, the report of the internal state “I have secretly chosen a whole number between 1 and 100.” would still be incorrect. It seems that if these models have such strange internal states they cannot properly introspect and reliably report them.
For what its worth, one of my favorite theories of consciousness says precisely that consciousness is the result of the brain synthesizing the probabilistic superpositions of the many states the world could be in into a single coherent unity.
Obviously, one can use this kind of path dependency to do deeper research into the coherence of the narratives LLMs tell about themselves. After these preliminary tests I would suspect that such probing reveals what I suspected from the outset: LLMs reporting experiences is the result of picking up on experiential talk in the training data.
I am open to being convinced otherwise. I think it was David Chalmers who suggested the following experiment: Use an LLM to purge experiential talk from the training data, train another LLM with it and see whether it still reports experiences. I would be hugely surprised.