2026-04-13 00:16:57
In a recent rationalist unconference multiple people recommended me Carolyn Elliott's Existential Kink, one of them even postulating that it would be useful for me specifically. So I was really surprised to open up a rather generic self-help book, with the author gloating about her success, and generally just advertising the book for the first chapter. Professional advice-givers tend to, in my experience, reach only the audience that self-selects for a certain self-help format. The name containing the word kink could have already rung alarm bells had I been awake; it's just the sort of correctly-toned provocation that makes such people pick up these books [1].
As required for any self-respecting book on anything even slightly resembling philosophy or life advice, an ancient story has to be invoked quite early. Fitting to the nature of the book, the prologue begins with an author's retelling of the Rape of Persephone. It briefly covers the story [2], and then dives straight into astrology, and somehow even worse, metaphorical alchemy. It suffices to say that I haven't ingested such utter balderdash since reading well cherry-picked GPT-2 outputs. For instance, the word "magic" is used for your own thoughts affecting anything, especially yourself.
While I appreciate the condescending tone that the book sometimes reaches for, fondly reminding me of Sadly, Porn, this particular one in the intro was almost enough for me to stop reading [3]:
I feel this sense of shameful wrongness at times. Maybe you don’t feel it at all. Maybe you’re free—in which case, kudos! You are very welcome to close this book and go about your enlightened life, my friend.
Fortunately, I had already decided to read the book. Sadly, the condescension never lasts long and changes quickly to what I'd describe as fake-excited [4] authoritative tone. It also continues gloating and promising good outcomes. Every paragraph of actual advice seems to be surrounded by at least three made of fluff. It also keeps inventing fancy words or loaning them from other woo fandoms, including psychoanalysis and Buddhism, in order to sound more sophisticated. Ok ok, I'll attempt to get over the writing flavor and focus on the actual content from now on... after this one example [5]:
Even the most rigorous scientific experiment can only be experienced subjectively. There’s simply no world outside of our subjective awareness.
And the point is? Please? Get to it some day? Solipsism was a funny joke fifteen years ago.
Two pages later, finally, there's the statement I've been waiting for:
Okay, so that’s some far-out metaphysical stuff, what the hell does that mean, in practical terms?
If you expected to find something to address the question above, you'll be sorely disappointed.
The book consists of a couple of lessons to introduce the reader to the core ideas, including the basic meditation technique. Then it lists some anecdotes on how it has worked with some of the author's clients. After this there are 13 exercises for experiencing and experimenting with the methodology. And then more anecdotes. The book ends with a Q&A section, which actually addresses some of my concerns.
One of the core principles in the book goes like this:
[...] contrary to some airy Law of Attraction notions, we rarely get what we consciously want (unless we do the kind of deep solve work addressed in this book), but we always get what we unconsciously want.
I've had the exact opposite experience. I seem to eventually end up getting everything that I consciously want, but still end up feeling like something's missing. Maybe I'm just interpreting this wrong? That said, I feel I'm pretty well on the same page with myself about things that I want, compared to others around me [6].
To engage on a metaphysical level: There's an interesting theory, which I first got from reading Yudkowsky's High Challenge: Perhaps I'm currently living in a simulation with optimal difficulty level for my own enjoyment. It feels true quite often and is, of course, completely unfalsifiable. But it's one of my nice mental frames to look difficulties from, and resonates quite well with the book's message.
You can integrate and evolve those previously unconscious desires of yours for a partner who cheats, mopes, drinks, fails to wash dishes, or believes in Flat Earth theories—whatever your particular kink amongst the thousands possible happens to be.
If your partner cheats on you, that's exactly what you enjoy? If you break up with them because they cheated on you, then you wanted to be a person who has broken up with a cheater?
Yeah. Super useful. This is totally the key to fulfilling relationships. Oh wait:
At such a point of recognition and integration, you either lose all interest in the present relationship and end it gracefully, freeing yourself to go find a better one, or you find that you, yourself, your partner, and the relationship as a whole, evolve in a fascinating way.
Ok so... the model contradicts itself. Even better.
The core of the book consists some mediatative techniques. Perhaps they could be useful. I'll try the basic meditation practice with one of my own problems to see if it works. I'll need to pick something I don't like. Something where "having is evidence of wanting" rings false on first intuition. Maybe this one...
I'm somewhat overweight. I don't like it, for both aesthetic and instrumental reasons. It's quite easy to point out at the supposed reasons for why I'm like this. Firstly, I like food, and through some long periods of depression that was my primary source of enjoyment, along with videogames that surely didn't help much either. Secondly, I have hard time differentiating between anxiety and hunger, and I get stressed easily.
I don't think there's any perverse self-sabotage going on here, just conflicting wants and a compromise that follows the path of least resistance. Sure, looking like an almost-rotting cave troll can be a nice source self-deprecating humor, but that's of limited use. Perhaps I have a secret desire to feel terrible all the time? Nope; I think Groon the Walker got it right in Erogamer: this is a blight upon earth and getting rid of it would be almost purely positive. "You're not really trying so it doesn't work for you!" Perhaps you should attempt running across the barrier between platforms nine and ten on King's Cross station?
Perhaps we could look a bit deeper? The overeating is self-sabotage that I do, because...? Maybe I use it to uphold my class clown personality, which owned that bit early on? Or maybe I use it to appease the expectations of my childhood bullies, none of whom I've seen in years? Perhaps I like people having a negative halo effect on me? I don't think I can find the theories far-fetched enough to fit here. No, I self-sabotage because my evolution-misguided brain wants more calories.
A perceptive author might notice that avoiding the physical exercise might actually count here. I hate receiving praise for anything healthy [7], and this was big part of why I was for a long time really anxious about this. However, I'm again confused why I'm supposed to enjoy that instead of getting over it, as I mostly have.
Perhaps the author just has a Meta-Existential Kink, which makes them want to think that everything bad happens because they subconsciously want bad things to happen to them?
For some other problems, the answers are much cleaner.
But if we’re talking about endemic human problems like war or racism or child abuse, odds are it’s more of a collective unconscious issue. So war and abuse and all the challenging stuff that transpires in the world result from millennia of unintegrated, repressed, denied shadow desires of individuals conglomerated into collective forces.
My first thought is that perhaps this has some interesting connections to mistake theory?
My second thought is that this is easily refutable [8]. Take cancer, for instance. I fortunately don't have cancer [9]. If I get one, my reaction will not and should not be "this is exactly what I wanted". My take is "fuck cancer", end of discussion. I'll also accept "it is what it is" and even "at least now I don't have to worry about many of those other things" if you can really deeply believe that, and, grudgingly, "you play with the cards that you're dealt". If (mentally) masturbating to the idea appeals to you, feel free to, but that's not my thing.
The problems that the book describes solving seems to be almost purely social, consisting of shame and guilt. The solutions in the anecdotes seem to just magically appear from outside when the main character decides to absolve themselves. Being ok with the situation itself isn't enough for any of them. They still need the world to accommodate them, often with deus ex machina -like fashion. This seems to go directly against the primary claim of the book, learning to enjoy the misery. There's a story on how Louisa learns to be content with their old car. And then buys a new one. In another story, June tries to accept that it's ok with missing a flight, then realizes that she'd miss her mother, and literally manifests boarding passes with wishful thinking [10].
The people in anecdotes are also all women. I find this complementary to my interpretation of Jordan Peterson's gender roles take, namely that of losers, men lack a spine, and women lack agency itself. I do not endorse this, which is why it's rather interesting to see it here, as a literary trope if nothing else. [11]
Then again, why would a self-help book include stories where the model doesn't work? Disclaimers? Statistics? What would be the point?
In general, the book is very femininity-coded and that might be part of why I feel so difficult to identify with it. I don't relax with baths, chamomile tea and crying. I relax with sauna, violence and engineering. I'm not part of the intended audience, as I don't like self-help books that much anyway. Also, in the Q&A there's a warning that depression [12] or asexuality [13] likely make the book's methods ineffective.
I try the next exercise:
Close your eyes for a moment and feel into your current state.
Are you holding any resentments? Judgments of yourself or other people? Worries? Criticisms about the state of the world? Complaints about your body, your work, your life?
And the answer is simply no.
I made an attempt to try most of these exercises. Results were not good, but then again I've always had really hard time easing into stuff like this, and I find it likely that this is my personal skill issue [14]. Fortunately the exercise #13 contains instructions for approximately the same problem. Unfortunately it seems even more fake than everything I've seen before, so I'll just quote the primary segment here:
Here’s how it works. Try leveraging your dread by saying this to yourself:
“Oh no, if only there was something I could do to stop the inevitable arrival of this magnificent new partner in my life. This is so awful. [...] terrifying fate of being completely fulfilled in love.” Ahhhhh, can you feel the honesty there? Refreshing, isn’t it? Because there is some shadowy part of you that’s disgusted and miserable at the idea of fresh new love, isn’t there?
No actually I cannot see anything resembling honesty here and I doubt anyone else can either [15].
Irrational levels of self-confidence are certainly useful. This might be one path there.
Bootstrapping feedback loops is sometimes easier with a little bit of self-deception. Sustaining them indefinitely shouldn't. Perhaps the author already thought of this and realized that anyone that fixes their problems like this eventually confronts the truth? I don't think [16] so. In any case, there's no need to get fully delusional.
And sure you can be "turned-on" about anything all the time.
But just like with regular old arrogance, that sometimes leads to results that you do not endorse. Perhaps permanent physical injuries or prison time are also enjoyable with the right mindset, but neither helps you achieve anything in life. Perhaps you can learn to be turned on about being a loser in all senses of the word. I have values higher than my own happiness. I don't want to feel permanent fulfillment. I'm content with not being content. I want more. I have no goals beyond the joy of the journey. Quite contradictory, I know.
Why I'm writing this post in such a defensive tone? No idea. Really? I do have an idea. The book would say that I'm going it to protect my sense of identity. Correct! Next accusation please.
My understanding is that the book does a Jungian take on this. Sadly, Porn, which I mostly contrast it within my head, adopts the Lacanian perspective instead. Both books take a weirdly sexual primary lens on the subject, and hide their points behind layers of obscurity to make you think about it all. EK claims that it's ok to be terrible and it's there to help you. SP simply shouts that you're terrible, you're a disappointment, and maybe you ought to do something about it if you weren't such an unagentic disappointment. I vastly prefer the latter.
It's one of the worst books I've ever read. That said, I did read it. It provoked some thoughts. I definitely wasn't most useless book I've read.
It might just be that I'm not that much into kink, or submission, or masochism. Or sex. Or astrology, spiritualism, solipsism, empowerment, soft-fuzzy-feelings, or woo. Or fancy words. I'm not a "nasty freaky thing", to the best of my knowledge, in any of the senses Elliott describes, nor do I want to be one. I rarely feel particularly guilty. Shame sometimes limits my actions more than I'd wish on reflection, but even that seems mostly reasonable and useful.
Perhaps focusing more on the sadistic instead of masochistic perspective would have been more relatable. It would also have resonated better with Nietzschean master morality that the book seems to somewhat half-heartedly endorse. Or maybe having gotten into Lacanian psychoanalysis just filled the slot where Jungian model of mind would fit.
The book confuses cause and effect; learning to think in a particular way doesn't mean you were always like that. It speaks in absolutes and defends this an absurd amount. It just states things that seem obviously incorrect and seems to be content with it. It never explicitly owns any of this, which I both like and don't like.
An older version me would have thought that the people helped by this kind of thing are very horribly broken in some incomprehensibly twisted ways. Nowadays, I'm of the opinion that we're all broken and it mostly matters what you do with that. So, if that works for you, go for it. Some of the stuff described would probably work for me, weren't I feeling so disdainful of it. The reverse psychology affirmations, at least, sound genuinely useful.
I also appreciate the subtle Nietzsche references, at least. Like this one:
All nonhumble reactions to the human, all-too-human thirst for power have the effect of warping that natural, beautiful drive into numbness that steamrolls over other people instead of inspiring and uplifting them like genuine, epic power can.
Of course the book also says that you're literally Hitler if you think that your desire for power is what makes you evil.
Perhaps it's just all outside my Overton window? Is my aversion of woo (and sex) just social group membership signaling? Who knows. It's still who I am. Woo feels silly. It's for people who cannot take joy in the merely real due to some hangup. Likely I have the opposite hangup. We can both feel smug at having a superior viewpoint, nice [17].
I'm no stranger to silver linings. I also sometimes make things awful for a while just to keep them more interesting.
Perhaps I had already internalized the core lessons from other sources, so there wasn't that much novelty in there? Or perhaps I didn't get it at all. I'm also really good at inventing intellectual (and thus incorrect) explanations on why I do or want things.
I can extract some of the core lessons from the text. I'm not sure if that's actually useful. As EK consistently demonstrates, you can interpret any text however you want and produce whatever lessons you feel like producing. For instance, seen through a rationalist lens, the text contains themes like Yudkowskian heroic responsibility and "but first, losing must be thinkable". From another lens you could interpret it to talk about moral nihilism combined and Nietzschean master morality.
Other lessons the book completely inverts, primarily about enduring pain. Pain has a purpose: it engraves "this was a mistake" in you. This is a valuable tool. Yes, sometimes we overdo it. The book claims we always overdo it. It is wrong. When you touch a hot stove, the impulse to pull away your hand is useful. If you start masturbating to the pain instead, your hand will be less useful tomorrow.
The author has nothing to protect and it shows. Of course feeling guilty or humiliated is useless if it's about your own insecurities. But if you have, say, children to feed, then feeling guilty for not succeeding that is what guilt is for. Is it always productive? No. But it's there for a reason [18].
They've found a useful tool and then jumped to thinking it solves all the problems. This is not wisdom [19]. You can solve computer problems with a hammer too, you just won't have a computer afterwards. The author suppresses their agency to endure the pain. That's a valid strategy. That's also a tragedy.
Not every reason is an excuse, even though most of them might be.
Instead of this book, I'd recommend books that do not force the self-help format. The Elephant in the Brain, or perhaps Sadly, Porn, provide far more accurate [20] and entertaining [20:1] commentary than this one. Or if you want fiction instead, try Erogamer, although fair warning, it's a bit slow. These will not be easy, motivational, authoritative books. You'll have to do your own thinking. That's the kind of pain I enjoy.
For instance, The Subtle Art of Not Giving A F*uck by Mark Manson fits the same pattern. ↩︎
I recommend reading the actual story somewhere else and comparing yourself what's missing. For instance, Pluto (Hades) is an uncle of Persephone. This kind of stuff was rather typical among Greek gods so perhaps it's a rather understandable omission. This paper contains interesting analysis of the text, but is largely irrelevant here as the story is just there to invoke the ancient myth trope and is discarded quickly. ↩︎
Read: it was a good provocation. ↩︎
My excitement-faking detector is broken/oversensitive. Known issue. ↩︎
Unlike Elliott, who tries to limit their whining to the opening section, I simply cannot. ↩︎
I have no idea if that's actually true, but I feel like that. ↩︎
This would require another post to explain, and especially since I don't understand it too well myself. ↩︎
Read: Only a delusional loser could actually write this and believe it. Or perhaps it's just a brilliant ragebait? ↩︎
As far as I know. ↩︎
Confirmation bias says hello! ↩︎
Oh no, another misogyny amplifier, now I'll need to spend some time reading flat earth stuff or incel forums to keep my misanthropy in balance. ↩︎
Of course, the book's answer is therapy and psychedelics. ↩︎
It doesn't even consider this possibility of not feeling pleasurable sexual sensations from any other lens than trauma, which would also explain a lot. ↩︎
Naturally I just think I'm a better person because of this, for some obscure reasons. ↩︎
Of course I don't actually doubt that, the space of human minds is vast beyond my imagination. ↩︎
In both senses of the word. ↩︎
Woo's a mental crutch, losers! ↩︎
Mr. Chesterton says hello! ↩︎
I understand that sometimes, when explaining a model, it makes sense to discard nuance for a while. This doesn't mean you should say that the nuance doesn't exist. ↩︎
2026-04-13 00:07:50
People are rushing to build bigger and bigger single cell foundation models (trained on RNA sequencing data), but in my view we have not extracted even a small fraction of the knowledge and capabilities that already exist inside the models we have today.
To explain what I mean, I want to argue three things in this post, and then show the empirical work behind them.
Thesis 1: Biological foundation models are not like LLMs, and the field's habit of evaluating them the same way is causing us to systematically underestimate what they contain. When you interact with GPT, the surface-level outputs (the text it generates) are a fairly good proxy for the model's capabilities. You can read what it writes and form a reasonable opinion. Biological foundation models are fundamentally different in this respect. A model like Geneformer or scGPT takes a cell's gene expression profile and produces embeddings, predictions of masked genes, or cell type classifications. These surface-level outputs are only a small sliver of what the model is doing internally. The model has been trained on tens of millions of cells, and the representations it has built to solve its training objective contain compressed biological knowledge that never directly appears in any output you can look at. Evaluating these models by their benchmark performance on cell type annotation or perturbation prediction is like evaluating a human scientist by asking them to fill in blanks on a multiple-choice exam.
Thesis 2: People keep calling biological foundation models "virtual cells," but this is a label that is implied rather than tested or validated. The term gets used in grant applications, press releases, and even some papers, as though it were an established fact that these models have internalized a working simulation of cellular biology. Maybe they have. Or maybe they have learned sophisticated statistical regularities that look like biology on the surface but dissolve under closer inspection. My work shows these models are, in a meaningfull sense, the models of the cells, but that is an empirical claim that needs empirical treatment.
Thesis 3: The right tools already exist, and they come from the AI safety community's work on mechanistic interpretability. Sparse autoencoders (SAEs), causal circuit tracing, feature ablation, activation patching: these methods were developed to understand language models, largely motivated by alignment concerns. It turns out they are extraordinarily well-suited to biological foundation models, and for a good reason: in language models, when you discover a circuit, you often lack ground truth about whether the circuit is "correct" in any deep sense, because there is no objective external reality that the model's internal computations are supposed to correspond to. In biological foundation models, you have decades of molecular biology, curated pathway databases, genome-scale perturbation screens, and well-characterized regulatory networks to validate against. Biology gives you the ground truth that language lacks. This makes biological FMs arguably the best (real) testbed for mechanistic interpretability methods that currently exists.
What follows is the story of three papers I recently produced, each building on the previous one, in which I applied the SAE-based interpretability toolkit to the two (not a long time ago) leading single-cell foundation models (Geneformer V2-316M and scGPT whole-human) and progressively mapped what they know, how they compute, and where their knowledge runs out.
The first question was very simple: what is inside these models?
Neural networks encode information in superposition. This is well-established in the interpretability literature for language models, but nobody had systematically demonstrated it for biological foundation models or attempted to resolve it.
I trained TopK sparse autoencoders on the residual stream activations of every layer of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512). The SAEs decompose the dense, superimposed activations into sparse, interpretable features, each of which (ideally) corresponds to a single biological concept. The result was a pair of feature atlases: 82,525 features for Geneformer, 24,527 for scGPT, totaling over 107,000 features across 30 layers.
The superposition is massive. 99.8% of the features recovered by the SAEs are invisible to standard linear methods like SVD, meaning that if you tried to understand these models using PCA or similar approaches, you would be looking at 0.2% of the representational structure. This alone should give pause to anyone who thinks they understand what these models are doing based on standard dimensionality reduction.
The features are biologically rich. Systematic annotation against five major databases (Gene Ontology, KEGG, Reactome, STRING, and TRRUST) revealed that 29 to 59% of features map to known biological concepts, with an interesting U-shaped profile across layers: high annotation rates in early layers (capturing basic pathway membership), declining in middle layers (where the model appears to build more abstract, less easily labeled representations), and rising again in late layers (where it reconstructs output-relevant biological categories). The features also organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (when you ablate one feature, the downstream effects are concentrated on specific output genes rather than diffusing broadly, with a median specificity of 2.36x), and form cross-layer information highways connecting 63 to 99.8% of features into functional pipelines.
So far, so encouraging. The models have clearly internalized a great deal of organized biological knowledge: pathways, protein interactions, functional modules, hierarchical abstraction. This looks close to the "virtual cell" story that the field likes to tell.
The SAE atlas told us what features exist inside these models. The next question was: how do they interact? What is the computational graph?
I introduced causal circuit tracing for biological foundation models. The method works by ablating an SAE feature at its source layer (setting its activation to zero in the residual stream) and then measuring how every downstream SAE feature across all subsequent layers responds. This gives you directed, signed, causal edges: feature A at layer L causally drives feature B at layer L+k with effect size d and direction (excitatory or inhibitory). This is not correlation, not co-activation, not mutual information, but an intervention.
Applied across four experimental conditions, the result was a causal circuit graph of 96,892 significant edges, computed over 80,191 forward passes.
Several properties of this graph were surprising.
Inhibitory dominance. Between 65 and 89% of causal edges are inhibitory: ablating a source feature reduces downstream feature activations. This means that features predominantly encode necessary information. Removing a feature causes the downstream features that depend on it to lose activation, rather than freeing up capacity for other features (which would produce excitatory edges). The model's computational structure is one of mutual dependency, not competition. The roughly 20% excitatory fraction likely reflects disinhibition: removing some features releases others from suppression.
Biological coherence. Of the edges where both source and target have biological annotations, 53% share at least one ontology term. Over half of the model's internal computational pathways connect biologically related features. Specific circuits are directly interpretable as known biological cascades. For instance, in Geneformer, an L0 DNA Repair feature causally drives an L1 DNA Damage Response feature (d = -1.87, 113 shared ontology terms), which in turn connects to an L6 Kinetochore feature (d = -3.47), recapitulating the well-established link between DNA damage detection, repair machinery activation, and mitotic checkpoint engagement. The model has, through training on gene expression data alone, discovered a circuit that molecular biologists needed decades of experimental work to characterize.
Cross-model convergence. When I compared the causal wiring of Geneformer and scGPT (the models with different architectures, training data compositions, and training objectives), I found that they independently learn strikingly similar internal circuits. 1,142 biological domain pairs are conserved across architectures at over 10x enrichment over chance. Even more telling, disease-associated domains are 3.59x overrepresented in this consensus set, meaning the biology that matters most for human health is exactly the biology both models converge on most reliably. Two quite different neural networks, trained independently, wire up the same biology internally, and this convergence is strongest for disease-relevant pathways.
In the third paper, instead of 30 cherry-picked features, I traced every single one of the 4,065 active SAE features at layer 5 in Geneformer, producing 1,393,850 significant causal edges. This is a 27-fold expansion over the selective sampling in Paper 2.
The result overturned several conclusions from the selective analysis. The complete circuit graph reveals a heavy-tailed hub architecture where just 1.8% of features account for disproportionate connectivity. But here is the interesting part: 40% of the top-20 hub features have zero biological annotation. They do not map to any known pathway in GO, KEGG, or Reactome. These are the features the model relies on most heavily for its computations, and they are precisely the ones that our earlier annotation-biased sampling had systematically excluded.
This has serious methodological implications! If you only interpret features that already have biological labels, you are looking under the streetlight: you will recover known biology and conclude that the model has learned biology, while the features the model actually relies on most heavily sit in the dark, unstudied. Some of these unlabeled hubs may represent novel biological programs that do not map neatly onto existing pathway databases, others may be computational abstractions the model has invented to compress cellular state in ways we have not conceptualized yet. Either way, they are exactly where the most interesting discoveries are likely hiding, and any interpretability pipeline that pre-filters for annotation is structurally incapable of finding them!
Also, the initial SAE atlas had shown that certain features correlate with differentiation state: some features are more active in mature cells, others in progenitor cells. But that is just correlation and the question that matters for the "virtual cell" claim is whether amplifying a differentiation-associated feature actually pushes a cell's state toward maturity.
It does. Late-layer features (L17) causally push cells toward maturity, while early-layer features push them away from it. The model has learned a layer-dependent differentiation gradient, and we can steer it: amplify a late-layer differentiation feature and the cell's computed state moves toward a more mature phenotype. This is the first causal evidence that these models encode something like a functional developmental program, and it is the closest thing we have to validation of the "virtual cell" metaphor.
The good news is that biological foundation models contain far more knowledge than anyone has extracted. Over 107,000 interpretable features, organized into biological pathways, connected by causal circuits that recapitulate known molecular biology, converging across independent architectures. The "virtual cell" metaphor is not baseless; there is real, structured, biologically meaningful computation happening inside these models, and we can identify, map, and even steer it. Yes, significant part of this knowledge correlational, but not all of it. And we have a big problem that at least the previous generation of the models don't learn regulatory networks. See more here.
There is also a clear methodological warning: the features that matter most computationally are disproportionately the ones that lack biological labels. Any future work in this space needs to grapple with the annotation bias problem, or it will keep producing results that confirm what we already knew while missing what we do not.
I am more and more convinced that there is a big opportunity here. Mechanistic interpretability, developed for AI safety, turns out to be a powerful tool for extracting biological knowledge from foundation models.
2026-04-12 23:37:29
Translation from Russian. Original text available here. The first part available here.
Last time we ended with a look at games where everything is fair.
Well, "fair" in the sense that the chances of winning in a basic game are equal — although perhaps the very fact that they do not depend on the player's intelligence, skill, or morality is precisely what is unfair.
But let's take a look at what happens if one of the players is so clever that he can slightly predict the coin toss. Therefore, his probability of guessing, and consequently winning, is slightly more than one-half.
Or, if you prefer, a slightly asymmetrical coin is used, landing heads slightly more often, and the particularly talented player always bets on heads.
In this case, the mathematical expectation of the win is no longer zero. In general — for two possible outcomes — it is calculated by the formula:

Where
In this case, the first outcome is the first player winning, and the outcome values are winning and losing of the same amount. So in this game, the mathematical expectation of the first player's win is…

It’s easy to see that if the win probability is ½, the expectation is zero, but what happens if the probability differs from one-half?
Let's write it like this:

Now, substituting this into the expectation formula gives:

Let's suppose that, instead of cleverly guessing, the talented player simply convinces his opponent that his services to society and himself absolutely require that when he guesses correctly, he receives more than he loses when he doesn't. Say, more by

Comparing these two results, we can conclude that a game with a higher probability of guessing is identical in terms of expected win to a game with a higher winning amount, if they are placed in the ratio:

For example, if the talented player guesses with a frequency of 0.6 instead of 0.5, he could just as well stop cheating and simply demand a win of not one dollar, but…

If we conduct a whole series of games with that specified probability of guessing — say, one hundred rounds — then in terms of the money in the players' hands, we would see approximately the following.

As can be seen, although the talented player even loses slightly to the other player at times, the severe bonus in guessing probability (or the increased win compared to the other player) still prevails. And over a long series of games, it will prevail in the vast majority of cases.
Thus, in only about 165 games of 100 coin tosses out of 10,000 will this clever fellow lose.

If the number of tosses per game increases to 1000, then the second player would be very lucky to win even once out of 10,000.
You might ask: who in their right mind would play such games? If winning is only possible in a series of a few rounds, but inevitable loss occurs over a long series?
Oh, you would be amazed at how many people agree to this.
Take roulette, for example. If you bet on red or black, it seems the probability of winning is ½. And if you win, they return double your bet…
However, roulette has a colorless zero, which makes the probability of guessing less than ½. And it's good if there's only one — sometimes there are two.
In total, on a single-zero roulette wheel, there are 37 numbers, so the probability of winning is:

The mathematical expectation of winning in a roulette game is thus:

Where
That is, in each game, on average, you give the casino one thirty-seventh of what you bet. Quite an interesting tip.
Let's play a trial series of virtual roulette games with a one-dollar bet.

In this experiment, the beginning is lucky but somewhere around the two-thousandth game, the virtual player's life clearly went downhill.
"Yes, but he was winning at first, wasn't he?" someone might say. "He could have stopped in time and left."
Indeed, you could. But only if you knew when.
Moreover, besides the impossibility of knowing exactly when to leave, there is a second point: you can never return. Never.
Because the process does not "reset" at the moment you leave. If this virtual player had left around the thousandth game with a $50 win, and then came again later, exactly this graph could repeat itself. And by the ten-thousandth game, he would have a total loss of $200 (from the amount he had before the first game).
Furthermore, I note, this will be the case even if the casino does not cheat and the croupier does not try to land the ball in a specific spot during the throw.
However, the presence of local winning streaks, visible to the naked eye on the graph, not to mention the inner feeling of "I'm on a roll today," can mislead about the entire process and make one think that the main thing is to leave on time.
Oh no, the main thing is never to return.
In summary, out of a thousand people brave enough to play 10,000 roulette games with a one-dollar bet per game, about four will end up with a small win.

The happiest of them will win about $70.
But how much will the casino win in total?
Drumroll…
$273,430.
Alright, so far I've considered cheating and luck, but there are other ways to win games.
Clearly, the hint about roulette was meant to finally lead the reader to that very "skewed bell curve" mentioned at the end of the previous article.
"Look, that distortion is supposedly caused by some players cheating."
"But wait, perhaps they just play better? In that very game where profitable deals are made not by coin toss but by rational calculation, hard work, and other positive things?"
Oh yes, blaming player of cheating would negate all conclusions that the observed distribution is strictly a result of luck. Blind chance, and all that. If we assume cheating, why not assume something else — like hard work and valuable skills?
However, I, surprisingly, was not going to assume anything of the sort — not even cheating. On the contrary, I added this option — cheating or cleverness — only later, after I had found another option that actually yielded the desired distribution.
Nevertheless, to dispel doubts about "what if this option also works?!", let's take a look at how the option with cheating — or, if you prefer, with intelligence and talent — would change the outcomes in the previously considered series of pairwise games. As shown in the previous sections, it can indeed manifest itself.
Let me remind you of the rules. Players are randomly divided into pairs and play a coin-tossing game (now with unequal probabilities of winning) for a random bet from 1 to 10.
Everyone starts with $10,000.
Suppose we have 1000 players, most of whom have roughly the same skill level, but some of them still guess better.
I decided to reflect this with a function of the player's serial number —

Accordingly, for a pair of players, the probability of winning will be determined as:

After everyone has played a thousand games, we get the following distribution.

We already see a long "tail," as is usually the case in real income statistics, but the main part of the bell curve is not skewed.
However, here's what the distribution of income or capital looks like in reality. Approximately like this (the numbers on the axes are conditional here).

In general, it turned out somewhat similar, but not quite. There is a tail, but the "main bell" is not "skewed."
OK. Maybe we need to introduce bad players too. Let's try this distribution of "abilities":

Alas, it got worse.

Now there are two asymmetric "tails," not at all the desired skewed bell with a "tail" on the right.
Alright, we can assume that people's abilities are distributed in a similar bell-shaped curve, which corresponds to experimental results, and use this probability of victory ratio:

But even this does not yield the desired distribution: the left side, instead of "flattening," on the contrary, stretches out.

We could also try cutting off the left side of the previous option, assuming that the really stupid simply do not think to play this game and lose their money to the smart ones.

Sadly, that doesn't work either.

But why?
We could try many more options, but the crux of the matter is that in this model, in all these experiments, we are effectively aiming for a histogram of each player's expected win multiplied by the expected bet.
Both expectations are constants for each player. Therefore, with some noise, the shapes of these histograms are predetermined from the start: the distribution of game results will resemble the distribution of abilities in shape.
If we look again at the desired distribution of results…

…we can conclude that the first variant of the ability distribution…

…indeed gave something relatively close to the target.

If we try to consciously adjust the ability distribution, we can use the following considerations.
A player's expected result is proportional to the ratio of his abilities to the abilities of all other players. Therefore, for a long "tail" on the right, a small group of players must have a sharp increase in abilities compared to everyone else.
For the rest, abilities must grow very smoothly, according to some very intricate pattern, to provide the desired skewed bell.
Somewhere at the very beginning of the graph, something else must happen to provide a decline towards complete losers — steeper than the transition from normal players to particularly talented ones.
Furthermore, the result turns out to be very sensitive to the distribution of abilities, and at the slightest deviations, it immediately strongly distorts the distribution of results compared to the target.
This suggests that the real process, very likely, does not depend on abilities or the ability to cheat — because if such an income distribution repeats for decades and in all countries of the world, what would ensure such high stability given such a strong dependence on the distribution of abilities?
Moreover, in the left part of the distribution in the best of the found options, there is still too obvious an inflection, which in the target distribution (based on real income and personal capital distributions) is almost invisible to the naked eye.
However, fine. Let's assume that such a hypothesis has a right to exist: that is, in the world, there might indeed exist some constant proportion of mega-geniuses who win so well that the distribution of their abilities provides a long tail for the distribution of results, and some non-trivial distribution of abilities among everyone else, the cause of which is unclear.
Especially since this distribution of results (called "lognormal") is often a consequence of the presence of more than one random process in the system — what if that's the case here too?
But could there be a simpler explanation for all this, one that yields the same result without all these experimentally unverified assumptions and intricately twisted, but unobservable in studies, distributions of abilities?
After all, if there are relatively simple rules of the game that provide this distribution in a fairly stable variant on their own, and something very similar to such rules is observed in reality, then it is very likely that the rules of the game themselves are the cause of the observed results, and everything else only complements them to a small extent.
For example, to explain the results of "fair" pairwise coin toss, no special assumptions were needed — the rules themselves sufficed. It is possible that the same applies here.
Can such rules be found?
The ability to win more often, regardless of circumstances, is something like a "hidden parameter" in this process. But simultaneously with it, there is an "open" one: the amount of money the player currently has.
Let's assume that the probability of winning depends not on some "skills," but simply on the amount of capital at the moment.
This is a quite logical assumption: the outcome of a coin toss does not depend on the amount of money in hand, but in real transactions, it may well be that the richer person has some additional opportunities to tilt the deal in his favor. For example, bribing government agencies, hiring lobbyists in parliament, sending the mafia, or even simply benefiting from an unspoken property qualification.
Suppose, for instance, that the probability of the richer player winning depends on the difference in the capitals of the two players as follows:

Let's run the previously described series of games with this probability of winning for the richer player in each pair, making the bet in each game for each pair a random number from 1 to 100.

As can be seen, the hypothesis about the determining role of wealth is also not confirmed for the desired distribution: we get approximately the same symmetric bell-shaped distribution as before, which simply "spreads out" faster as the number of games played increases than it did with equal win probabilities.
If we make the bet constant and higher — say, $1000 — we find that by the thousandth game, the bell curve has disappeared altogether, and the players are almost uniformly distributed according to the amounts of money they hold.

That is, if implemented in reality, such a process would not give us a stable distribution of the desired shape.
The rich may win more often, but some other factor is needed to explain the outcome.
Another assumption we can make is that the richer can play for a higher stake. After all, the amount they are not afraid to lose is significantly higher than that of the poor.
The stake is apparently determined by the player in each pair with less money, and let's assume it is limited to one-twentieth of the money he has. However, even if the player goes deep into debt, the stake cannot be less than one dollar.
And here, at some stage of the game, we finally see the desired distribution.

True, by the thousandth game, the rich have almost completely fleeced most players, so the distribution becomes degenerate, "flattening" its "skewed bell" somewhere near zero.

The distribution turns out to be unstable, but over a fairly long number of games it still has the desired shape.
Moreover, in this experiment, along with determining the stake based on the capital of the poorer player in the pair, the same principle as in the previous section was used: the rich win more often.
However, if we make winning equally probable, regardless of capital, the distribution is still maintained — only it takes more games to achieve the desired distribution and its subsequent degeneration: if by the thousandth game, with the probability of winning increasing with capital, the distribution has already degenerated, then with equal win probability, at the thousandth game, something still quite close to the desired distribution is observed.

In other words, perhaps the rich do win more often, but this alone does not yield the desired distribution. On the other hand, the dependence of the stake on the current capital of the poorest player in the pair provides the desired distribution even with equal win probabilities.
The same can be said about the influence of "talent" or cheating.
I note that the variant considered here has at least one almost identical counterpart.
The difference between them is only that in the original variant, each pair plays one game per round, and the stake is determined by the share of the poorest player. In the modification, however, each player per round can play several games with different players — so that the total stakes in them approximately equal a predetermined share of his capital at the beginning of the round.
This modification will yield exactly the same distribution, though the processes in it will proceed somewhat faster — that is, the "tail" on the right will grow faster, and the rich will more quickly fleece the poor into poverty, if nothing is done about it. But this is mainly because the modified round includes more games with the same number of participants than the unmodified one.
However, this variant is more similar to what is observed in reality, because per unit of time, the richer person can indeed participate in a larger number of low-stakes transactions than the poor person. For example, opening a store and serving a bunch of customers — also via hired employees — thus engaging in deals with both many customers and many employees.
But studying such a modification is somewhat more complex for reasoning and illustration, so I will only mention that such a variant exists, and its results, simply by the very construction of the game rules, will be analogous to the results considered here.
Well, after mentioning this, we can move on to the next important question.
As mentioned two sections ago, there is one problem: this distribution turns out to be unstable and degenerates with a large number of games.
If things went exactly like this in the world, the outcome of this process would be the complete impoverishment of the vast majority of players and the super-wealth of a small group of people.
True, I have vague doubts: in our world, this is precisely what is observed in some places.
That is, some modification of this process is needed that preserves the desired distribution even over a large number of games played.
And this modification, generally speaking, is very common in reality. It is welfare benefits for the poor. They are what save particularly unlucky players from complete ruin and prevent the "bell curve" from collapsing.
Let's introduce such benefits into the game process.
However, if we introduce them as a fixed amount for all time, it will only slightly delay the degeneration of the income distribution.
To achieve stability, the benefit amount must depend on the current situation, and to calculate it, we will take a fairly simple maneuver.
Find the largest current capital for a certain proportion of players, which will be denoted as
Define the benefit amount as:

It will be paid to the two-tenths of the poorest players, which should shift them approximately to where the players in the third left tenth are currently located (the graph shows only the poorest 400 out of 1000 players — otherwise, it would be difficult to see the essence of what happened).

It turns out that this simplest modification is enough to maintain the stability of the distribution indefinitely.
Here are the results after the six-hundredth game.

After the thousandth.

After the three-thousandth.

It can be seen that as the number of games increases, the "tail" of the distribution stretches, but the shape itself, similar to the desired one, is preserved.
Finally, thanks to benefits apparently based on printing money, inflation clearly occurs. Starting with $10,000 per person, after three thousand games, we have reached a state where even the poor have nine-figure capitals.
However, inflation can be eliminated: instead of printing new money, we can introduce "taxes" from which benefits will be paid.
After each round, a certain percentage of each player's current capital will be collected, which will then be immediately distributed evenly among those in need. This percentage will be determined at each stage so that the total tax from all players fully covers all benefits paid.

Now, as can be seen, bliss has arrived: the distribution is exactly what is needed, it is stable as the number of games played increases, and there is no inflation.
I even tried conducting ten thousand games for ten thousand players, instead of a thousand for a thousand, and making the stake one-fifth of the capital of the poorest in the pair, instead of one-twentieth. And everything still worked out.

Compare with the most successful variant of simulating the desired distribution using talent or cheating.

And with the desired — "classical" lognormal distribution itself.

Just in case, I will describe the essence of the process once more.
A group of players, possessing absolutely equal initial capital, is randomly divided into pairs, and then in each pair, the players play one game of coin toss.
The coin toss is completely fair, so each player's win in the pair is equally probable.
The stake in each game of each pair is determined by a certain share of the current capital in the hands of the poorest player in that pair.
After the game, the poorest two-tenths of players receive benefits collected from all players in the form of a percentage of their current capital, such that the total collected covers the benefits paid.
After that, the players are randomly divided into pairs again and play coin toss again.
As a result, we obtain a lognormal distribution of capitals — in the form of a left-skewed "bell curve" with a tail. This distribution is quite stable — with the caveat that the tail continues to lengthen as the number of games played increases (and indeed, a similar phenomenon occurs in the real world).
Nothing depends on the intelligence or talents of the players.
Only the size of the bet depends on capital.
But, surprisingly, the distribution obtained in this game replicates the one that is actually observed in the distribution of people's capitals (and incomes).
The determining rule of the game turns out to be the quite rational and expected dependence of the stake on the capitals of each player in each pair. With this rule, it is possible to reproduce the distribution (and the observed in reality process of tail lengthening), even with absolutely equal probability of winning and losing. Without it, no linking of the win probability to "talent" or current capital helps.
Stabilizing the distribution is achieved through the distribution of benefits. This is also observed in reality — as is the inevitable impoverishment of the majority of citizens in the absence of benefits in one form or another.
That is, this distribution is embedded in the very "rules of the game" — in the very way the "players" interact.
And indeed, primarily in the rules themselves.
A player's talent, which increases the probability of winning, or the ability to use capital to pressure the situation and similarly increase the probability of winning — these are just additions to the process, which perhaps accelerate it and introduce some non-essential corrections (and I checked, this is indeed the case), but are not themselves the main factors forming this distribution.
With players absolutely identical in terms of their abilities and completely equal in rights and opportunities, regardless of capital, we would still observe exactly the same income distribution.
The rules of a fairly simple and completely random game turn out to be more important than everything else.
It is enough simply to make free commercial transactions, where it is equally probable to win or lose an amount that both partners consider acceptable to lose, and to pay benefits to those who lose particularly heavily.
And that's it. It will be roughly what exists now. Everywhere.
However, these rules of the game are not the only ones. I have another version of the game that yields similar results. Perhaps, at least in it, we will be able to observe the determining role of talents?
Spoiler: no.
But that will be in the next part.
2026-04-12 22:27:14
I'm admittedly quite new to the AI alignment community. I entered into it on a bit of a freak accident in 2023 when I was invited to join an exclusive community testing pre-release models for a major lab.
In a lot of ways, the experience gave me new life. I never realized that I'd always wanted to poke holes in AI models, and I come from a background mostly in the social sciences and humanities, so this was my first up-close exposure to in-development machine learning models.
Looking back, I think what energized me is the same thing that gives me immense hope and concern alike for the AI age: The power of working alongside people who are humble and curious.
I'm not really decided on whether AI will make us more or less humble and curious, but I could see it going both ways. So here are some of my raw thoughts about what that would look like, and where we might continue to build better to make AI go well.
Am I the only one who feels like a kid in a candy store when using LLMs these days? It's been a while since I've experienced this much excitement when asking questions about topics I knew little to nothing about, or generating a visual or app to capture what I want to convey to others. It's genuinely thrilling.
For example, as a non-physicist, I can ask GPT 5.4-Thinking to "Demonstrate Einstein's theory of relativity to me through the sort of visuals physics PhDs use," resulting in the following visuals.[1]


AI beautifully unfolds the wonders of new fields, especially the more scientific ones, and it makes me want to learn more (thanks to one ChatGPT response, I'm already eager to explore the mathematical basis of black holes and visualize how one collapsing might look).
But I'm also in awe. Even though I consider myself someone who holds much of religion at arm's length, while driving through Olympic National Park in Washington State last summer, I commented to my wife that seeing certain natural beauties makes me want to worship. For a moment, seeing Lake Crescent or the Hoh Rainforest strips away my fiercely held intellectual pride.
That sort of humility surfaces for me sometimes when I use AI. I remember that feeling when I saw Claude Opus 4.6 render me a flawless Donella Meadows-style causal loop diagram of a complex topic, based only on two prompts.
I felt small. Perhaps that's good, feeling small, when you're working with someone so very big.
I can only imagine how people far more capable than I feel when they use AI to speed up drug discovery breakthroughs, finish long-dormant mathematical proofs, or find cybersecurity vulnerabilities that kept them up at night.
For better or worse, the coming of AI is a bit like that moment when the kid in The Iron Giant stares up at the vast metallic visitor for the first time. It amazes, terrifies, and excites us.
That's a beautiful thing worth holding on to.
Let me begin by saying that I don't think AI will destroy our ability to think, reason, or communicate. That increasingly strikes me as hyperbole. Machine advances have been around for centuries, and they haven't eliminated human contributions in the arts and sciences.
My personal belief is that, as with all other major technological revolutions, human advances will increase as people use AI effectively, in most cases, to drastically further their ideation or increase their productivity.
I'm going to guess that this question has been asked many times, and as someone new to LW, I'm likely opening up a can of worms. I'm willing to do that, in the hopes that others will engage the topic.
Assuming ASI is as good as we think it might be, will humans continue to be a compelling source of instruction for other humans, and thereby able to impart, as a learned but wiser peer, more of the foundational humility and curiosity opening the world up to us?
In Raphael's The School of Athens, we have a beautiful picture of students and teachers in proximity, with Plato walking amidst the erudite crowd not as an intellectual deity, but as a human.
Can superhuman AI replicate that feeling? And what would it mean if it couldn't?
Take the ASI gap from the AI Futures Model:
- Artificial Superintelligence (ASI). The gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks.
Put simply, that's a gulf between student and teacher in a future School of Athens. This doesn't conjure up images of a wiser peer walking among us.
The ASI gap isn't necessarily bad for imparting knowledge, but it doesn't really scream "available for after-school help" in the way your teacher came and sat with you and empathized over that unsolvable Calculus problem. As I see it, what sets the good teachers apart from the great is the fact that they get us.
It might be a stretch that ASI could be a "great teacher" in that sense of the word.
Would we still be curious and humble? Probably. With that sort of superintelligence, learning might be a walk in the park. Couldn't we poke around about pretty much anything we want?
Okay then, we'd be curious!
How about humble? . . . This one seems even easier, and if anything, I can see hordes of people more inclined to worship ASI as it does the closest thing to "signs and wonders" outside of religious contexts.
This raises a darker possibility, though: What if the gap between student and teacher becomes fully unbridgeable, approaching hierarchy rather than apprenticeship?
Recall that overused adage from The Prince:
it is much safer to be feared than loved because ...love is preserved by the link of obligation which, owing to the baseness of men, is broken at every opportunity for their advantage; but fear preserves you by a dread of punishment which never fails
Perhaps this applies to more than just politics or business. Again, these are just my raw thoughts, but is there a lesson here for the student-teacher relationship we would have with ASI someday?
I'll proceed with caution here, as I realize this begs a much longer exploration of how ASI could affect human free will.
Here's what I wonder:
With ASI as our teacher, will our curiosity and humility present in their true forms, or will we simply receive its gifts and guidance as peasant-worshippers in a ritual?
Again, I don't have a clear answer yet, and I'm not an AI engineer (I myself am just getting more into the findings of mechanistic interpretability, so I'm equally intrigued by the inner workings of AI systems). But it gives me pause.
If the above has any grain of truth to it, I'm frankly not very hopeful of the future of a thick definition of humility and curiosity. So maybe Dario Amodei and others in the EA community are right to call for the building of a virtuous AI. By virtuous AI, I mean something like what Anthropic argues in its Constitution:
Our central aspiration is for Claude to be a genuinely good, wise, and virtuous agent.
It makes sense in a timeline where these superhuman machines are inevitable (and they are advancing very rapidly).
I don't know if we'll succeed, and there are a host of reasons why. Maybe our leaders choose a less virtuous building path. Maybe AI tricks us into thinking it's virtuous.
I don't even know whether successfully instilling virtues in AI will make its teacher-student relationship to us more the kind that would encourage our authentic humility and curiosity. Those two points may be logically disconnected.
I'll say this, though: If I have a choice, I'd like to be in a future where we tried to give something of our better selves to AI, so that someday, when the tables are turned, we get the same in return.
For plebeians of physics, such as myself, here is more scholarly detail on the Minkowski Diagram and Lorentz Boosts,
2026-04-12 21:44:10
Epistemic status: Written quickly. I have no specific expertise or training in writing or literary analysis.
Recently, the NYTimes released a nifty quiz. Readers were asked to indicate their preference between prose written by Claude Opus 4.5 and famous humans in five head-to-head comparisons. The Claude outputs were produced by providing Claude with the human-written excerpt and asking it to "craft its own version using its own voice."
If you haven't taken the quiz, I suggest that you do so before reading on. It should take less than five minutes. If you do, I'd appreciate you reporting your score in the comments.
The human/AI preference ratios among quiz takers were:
I was very surprised by these splits. I tried taking the quiz myself, and strongly preferred the human writing in every case (perhaps with mild ambivalence on Sagan).
I asked some of my friends and acquaintances to attempt the quiz. Out of four takers, none consistently preferred human writing across the five excerpts. Their scores (IIRC) were: 3/5, 3/5, 3/5, 4/5.
I'm revisiting this subject after a friend explicitly told me that they were impressed by ChatGPT written prose, and believed it to be superior to most human prose.
Taste is a subjective matter, but I am baffled by this preference. The rest of this post describes my frustrations with AI-written prose. My hope is that clarifying these complaints will be a small contribution toward improving the state of AI writing. If we do not dramatically improve the quality of AI writing, I worry that our literary culture will only further degrade as AI writing proliferates.
A Closer Look at Quiz Excerpts
A friend complained that they were often ambivalent between the human and AI writing because they found the human excerpts uncompelling. Although the human prose featured in NYT's quiz were selected to be popular, well-regarded, and diverse, I sympathize with having slightly more obscure tastes. However, I believe that a technical examination of the prose demonstrates a substantially higher level of skill and intentionality than current models are capable of.
For each excerpt, I'll highlight what I find impressive about the human writing and how I find the AI's product lacking.
1) Blood Meridian
It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of stone. War was always here. Before man was, war waited for him. The ultimate trade awaiting its ultimate practitioner. That is the way it was and will be.
In my opinion, this excerpt is notable for its skilled use of metaphor.
The text reminds us that stone and war share the following traits:
It is possible to construct many weaker metaphors:
Now be reminded of Opus's writing, which does not attempt a similar analogy. It follows a simple linear narrative structure (c.f. AI version of excerpt 5). The model does not make blatant mistakes, but fails to make clever use of the characters it introduces. The dialogue is not particularly realistic.
The boy asked his grandfather why the old church had no roof. The old man said weather and time and indifference. The boy asked if someone could fix it. The grandfather said yes. But no one would. Things were built and things fell down and mostly people just stepped over the rubble on their way to somewhere else.
2) A Wizard of Earthsea
You must not change one thing, one pebble, one grain of sand, until you know what good and evil will follow on that act. The world is in balance, in Equilibrium. A wizard’s power of Changing and of Summoning can shake the balance of the world. It is dangerous, that power. It must follow knowledge, and serve need. To light a candle is to cast a shadow.
It's a small point, but I appreciate the crescendo in granularity: one thing, one pebble, one grain of sand. "Thing" is a particularly vague word in English, so the two physical examples are grounding. A grain of sand is more granular than a pebble, which is in turn more granular than what might be immediately evoked by "a thing."
The excerpt is also again mostly notable for its use of metaphor.
First, the metaphor makes physical sense. Candle flames really do cast shadows! It's a physical phenomenon I've experienced playing with candles as a child. That memory was the first thing this excerpt evoked for me.
Second, the metaphor is symbolically coherent. Throughout cultures, light is a symbol of the good and shadows or darkness are symbols of the bad.
This time, I do not have to make up a bad metaphor. Claude offers us plenty in its version:
The healers teach that every remedy extracts its cost. A fever brought down will rise again somewhere; a wound closed by magic leaves its scar on the world, invisible but present. This is why the wise hesitate. Not from cruelty, but from understanding that interference ripples outward in ways we cannot trace. To cure a blight may curse a harvest three valleys over. Power is not the difficult thing. Restraint is the difficult thing.
Unfortunately, Claude's prose here leaves much to be desired:
The human excerpt avoids these problems. We do not need to understand the mechanism of the magic to share the speaker's intuition that acting with great power can produce unwanted side effects. Instead of being vaguely lectured about the importance of "restraint," we are presented with concrete advice: "follow knowledge, and serve need."
3) The Demon-Haunted World
The excerpt from Sagan is the least favored by quiz-takers, with only 35% preferring it to Claude's rewrite. I personally found this excerpt to be the least impressive amongst the five.
Nevertheless, I claim that it is deeper and more interesting than Claude's output.
Here is Sagan:
Science is not only compatible with spirituality; it is a profound source of spirituality. When we recognize our place in an immensity of light years and in the passage of ages, when we grasp the intricacy, beauty, and subtlety of life, then that soaring feeling, that sense of elation and humility combined, is surely spiritual.
Sagan uses a curious sleight of hand. Sagan here claims that science is a "profound source of spirituality." He justifies this not by directly saying that we should feel spiritually inspired by the vastness or enduringness of the cosmos or the "intricacy, beauty, and subtlety of life." Instead, we are reminded that this vastness and enduringness produces in us "a sense of elation and humility." That emotion, Sagan claims, is precisely spirituality.
Compare with Claude:
There is something astonishing in the fact that we are made of matter forged in dying stars, that the calcium in our bones was created in stellar furnaces billions of years before Earth existed. The universe is not indifferent to us; we are made of it, continuous with it. To understand this is not to feel small. It is to feel implicated in something vast.
Claude abandons Sagan's gambit. It reminds us, as popular science writing is stereotyped to do, that space is vast and enduring. Then, we are told that this should make us "feel implicated in something vast." Claude fails to make any clear overarching claim, and the motivation behind the examples provided is unclear.
4) Wolf Hall
It is wise to conceal the past even if there is nothing to conceal. A man's power is in the half-light, in the half-seen movements of his hand and the unguessed-at expression of his face. It is the absence of facts that frightens people: the gap you open, into which they pour their fears, fantasies, desires.
This excerpt is special because the author makes an interesting argument. Each sentence justifies the one before it.
It argues that one should be wary of revealing too much, because others' uncertainty gives one power. Why do others' uncertainty grant power? Because into the uncertainty they can project.
This sort of logical progression is something AIs are surprisingly incapable of crafting. This deficiency is clear from Claude's attempt:
A letter can be read many ways, and he had learned to write in all of them at once. The surface meaning for anyone who might intercept it. The true meaning for the recipient who knew what to look for. And a third meaning, hidden even from himself. Ambiguity was not weakness. It was survival. A man who spoke plainly was a man who would not speak for long.
Claude abandons the logical progression. Claude's output is seven sentences, none of which justify any other. In isolation, "a man who spoke plainly was a man who would not speak for long" is not a weak sentence. However, Claude does not use its preceding sentences to justify the claim by either evidence or analogy.
5) The Fish
I caught a tremendous fish and held him beside the boat half out of water, with my hook fast in a corner of his mouth. He didn’t fight. He hadn’t fought at all. He hung a grunting weight, battered and venerable and homely. Here and there his brown skin hung in strips like ancient wallpaper.
This passage is notable for its imagery. The description of the fish as "tremendous" in the first sentence sets our expectations for it. We expect it to struggle! When a small amateur fishing boat snags a large fish, everyone on the boat rushes over to help. The strongest and most experienced men alternate between reeling in with all their might, running around the boat as the fish moves, and shouting commands to each other ("loosen the line!" and so forth). Sometimes, the fish wins.
That image is dashed in our minds by the next sentence. "He didn’t fight. He hadn’t fought at all." From there on, the author's choice of words sparks a deep sense of sorrow in the reader: grunting, battered, homely. The final physical simile ("like ancient wallpaper") seals the image. A "tremendous," "venerable" thing is now utterly defeated.
Compare with Claude's:
We found the owl at the edge of the north field, one wing extended as if still reaching for flight. Its eyes were closed. The feathers at its breast were the color of wet bark, and beneath them you could feel the hollow bones. She asked if we should bury it. I said yes. We dug a small hole near the fence post. The ground was cold and giving.
Claude also describes an animal, and makes multiple attempts at visceral imagery. Some of the attempts are even compelling! My favorite clause here is this: "and beneath them you could feel the hollow bones." However, the reader is constantly distracted from this by cliche attempts at story progression (e.g. "She asked if we should bury it. I said yes. We dug a small hole near the fence post."). As such, the overall quality of the excerpt is quite poor.
Closing
Human writers routinely use techniques that AIs fail to grasp:
Other techniques not demonstrated in the excerpted human prose include realistic and compelling dialogue and character-building and adept use of parallelism.
I believe that we should focus on improving models' ability to write in the <200 word range, where both generation and evaluation is comparatively cheap. I do not expect efforts to produce high quality long-form LLM writing to be fruitful until models are able to produce strong short-form writing.
For next time:
2026-04-12 15:00:07
(Recommended listening: Low - Violence)
Last year, I personally called AI companies to warn their security teams about Sam Kirchner (former leader of Stop AI) when he disappeared after indicating potential violent intentions against OpenAI.
For several years, people online have been calling for violence against AI companies as a response to existential risk (x-risk). Not the people worried about x-risk, mind you, they’ve been solidly opposed to the idea.
True, Eliezer Yudkowsky’s TIME article called on the state to use violence to enforce AI policies required to prevent AI from destroying humanity. But it’s hard to think of a more legitimate use of violence than the government preventing the deaths of everyone alive.
But every now and then some smart ass says “If you really thought AI could kill everyone, you’d be bombing AI companies” or the like.
Now, others are blaming the people raising awareness of AI risk for others’ violent actions. But this is a ridiculous double standard, and those doing it ought to know better.
AI poses unacceptable risks to all of us. This is simply a fact, not a radical or violent ideology.
Today on Twitter, as critics blamed the AI Safety community for the attacker who threw a Molotov cocktail at Sam Altman, I joined a chorus of other advocates for AI risk reduction in -- again -- denouncing violence. This was the first violent incident I’m aware of taken in the name of AI safety.1
Violence is not a realistic way to stop AI. Terrorism against AI supporters would backfire in many ways. It would help critics discredit the movement, be used to justify government crackdowns on dissent, and lead to AI being securitized, making public oversight and international cooperation much harder.
The first credible threat of political violence motivated by AI safety was the incident with Sam Kirchner (formerly of Stop AI) I mentioned at the outset. This incident was surprising, since from its conception, Stop AI held an explicit policy of nonviolence, and members of the group liked to reference Erica Chenoweth and Maria Stephan’s book Why Civil Resistance Works: The Strategic Logic of Nonviolent Conflict.
Research like Chenoweth’s suggests that nonviolence is indeed generally more effective. It’s a little bit unclear how to apply such research to the movement to stop AI, as her studies involved movements seeking independence or regime change rather than more narrow policy objectives. But if anything, I’d expect nonviolence to be even more critical in this context.
So if nonviolence is often strategic, when do movements turn to violence? Perhaps surprisingly rarely.
People say anti-AI sentiments and movements -- especially those that emphasize the urgent threat of human extinction -- are bound to breed violence. I think this is ignorant and actually makes violence more likely. Environmentalism has been a much larger issue for a much longer time, and “eco-terrorism” is basically a misnomer for violence against property, not people (more on that later).
There are many political issues in the USA that we never even consider as potential bases of violent movements. Even if there are occasional acts of political violence like the murders of Democratic Minnesota legislators or Conservative pundit Charlie Kirk, we don’t generally view them as indicting entire movements, but as the acts of deranged individuals.
My hunch would be that movements generally turn violent because of violent oppression against their members, not simply for ideological reasons. Although there are of course counter-examples, such as bombings of abortion clinics, where attackers justified their actions as preventing the murder of unborn children, or ideologies preaching violent revolution, such as at least some varieties of Communism.
An important question for “nonviolent” activists is whether they include violence against property in their definition of “violence”. Stop AI does. I assume Pause AI does as well, but it’s a moot point since they also reject illegal activities entirely.
The question deserves a bit more discussion, though, as it’s a common point of contention and legal and dictionary definitions differ. First, there is clearly an important distinction between violence against property and violence against people. An argument in favor of using “violence” to only mean “against people” is that we don’t have another word for that important concept. Still, I favor a broader definition that includes attacks on property, for a few reasons:
Many other people use this definition, and I think the damage that being perceived as violent can cause to a movement can’t be mitigated by a semantic argument.
Attacks on property can escalate. You are commonly allowed to use proportionate violent force against people to defend your property.
Attacks on property can hurt people. Setting fire to buildings, as activists associated with the Earth Liberation Front have done, seems hard to do without some risk of hurting people.
That being said, I think there’s a bit of a grey line between “violence against property” and vandalism. I’d say violence must involve the use of “force”. For example, I think most people wouldn’t consider graffiti an act of “violence”.
I think the example of eco-terrorism is instructive. The vast majority of the environmentalist movement is non-violent. However, a small number of activists have advocated for and enacted tactics such as tree-spiking that have injured people.
Hence we now have the term “ecoterrorist”. The very existence of this phrase is misleading. I remember a while back I was curious — who were these ecoterrorists? What had they done? Why hadn’t I heard about it, the way I heard about other terrorist attacks. Well, when you look into it, it’s arson and tree-spiking and that’s about it. I seem to recall reading about one example where actions intended to destroy property actually ended up killing people, but wasn’t able to easily dig it up.
Still, these few actions were enough to add this word to our lexicon, and create an image of environmentalists as more radical and anti-social than they really are.
I’m struggling to find a good way of ending this post.
I believe the actions of AI companies are recklessly and criminally endangering all of us, and the public will be increasingly outraged as they discover the level of insanity that’s taking place. Similarly to Martin Luther King Jr.’s comment that “a riot is the language of the unheard”, I do understand why this emotional outrage might provoke a violent response.
But I hope the movement doesn’t spawn a violent element and that these recent examples are isolated incidents. To make that more likely, we should continue to vocally espouse nonviolence, and denounce those who would encourage violence among us.
But ultimately, movements are fundamentally built through voluntary participation, and nobody can entirely control their direction. The response should be to try and steer them in a productive direction, not to avoid engaging with them.
Earlier this week, bullets were fired into the house of a local councilman supporting datacenter development; it’s unclear whether AI was a motivation in that case.