2026-04-06 18:00:07
Last week the US president announced that:
... if the Hormuz Strait is not immediately "Open for Business," we will conclude our lovely "stay" in Iran by blowing up and completely obliterating all of their Electric Generating Plants, Oil Wells and Kharg Island (and possibly all desalinization plants!), which we have purposefully not yet "touched." This will be in retribution for our many soldiers, and others, that Iran has butchered and killed over the old Regime's 47 year "Reign of Terror."
Yesterday morning he posted that:
Tuesday will be Power Plant Day, and Bridge Day, all wrapped up in one, in Iran. There will be nothing like it!!! Open the Fuckin' Strait, you crazy bastards, or you'll be living in Hell...
These are threats to target civilian infrastructure as a coercive measure, which would be a war crime: if Iran doesn't allow tankers through the Strait of Hormuz, the US will cause massive damage to power plants, bridges, and possibly water systems. The US has historically accepted that this is off limits: destroying a bridge to stop it from being used to transport weapons is allowed, but not as retribution or to cause the civilian population to experience "Hell". The Pentagon's own Law of War Manual recognizes this distinction: when NATO destroyed power infrastructure in Kosovo, it was key that the civilian impact was secondary to the military advantage and not the primary purpose. [1][2]
To be clear, what Iran has been doing to precipitate this, by attacking civilian tankers for the economic impacts, is itself a war crime. But that does not change our obligations: the US has worked for decades to build acceptance for the principle that adherence to the Law of War is unconditional. It doesn't matter what our enemies do, we will respect the Law of War "in all circumstances". We've prosecuted our own service members, and enemy combatants, under this principle.
I hope that whatever is said publicly, no one will receive orders to target infrastructure beyond what military necessity demands. You don't need to be a military lawyer (and I'm certainly not one) to see that such orders would meet the threshold at which a member of the armed forces is legally required to disobey. I have immense respect both for commanders who refuse to pass on such orders and for service members who refuse to carry them out. [3]
[1] The manual cites Judith
Miller, former DoD General Counsel, writing on Kosovo that "aside
from directly damaging the military electrical power infrastructure,
NATO wanted the civilian population to experience discomfort, so that
the population would pressure Milosevic and the Serbian leadership to
accede to UN Security Council Resolution 1244, but the intended
effects on the civilian population were secondary to the military
advantage gained by attacking the electrical power infrastructure."
If the impact on civilians had been the primary motivation for NATO's
attacks on power infrastructure they would not have been lawful.
[2] "Military objectives may not be attacked when the expected incidental loss of civilian life, injury to civilians, and damage to civilian objects would be excessive in relation to the concrete and direct military advantage expected to be gained." (DoD LoWM 5.2.2) and "Diminishing the morale of the civilian population and their support for the war effort does not provide a definite military advantage. However, attacks that are otherwise lawful are not rendered unlawful if they happen to result in diminished civilian morale." (DoD LoWM 5.6.7.3)
[3] "Members of the armed forces must refuse to comply with clearly illegal orders to commit law of war violations." (DoD LoWM 18.3.2)
2026-04-06 14:53:33
Submission statement: This essay builds off arguments that I have come up with entirely by myself, as can be seen by viewing the comments in my profile. I freely disclose that I used Claude to help structure and format rougher drafts or to better compile scattered thoughts but I endorse every single claim made within. I also used GPT 5.4 Thinking for fact-checking, or at least to confirm that my understanding of neuroscience is on reasonable grounds. I do not believe either model did more than confirm that my memory was mostly reliable.
The usual reading of The Whispering Earring is easy to state and hard to resist. Here is a magical device that gives uncannily good advice, slowly takes over ever more of the user's cognition, leaves them outwardly prosperous and beloved, and eventually reveals a seemingly uncomfortable neuroanatomical price.
The moral seems obvious: do not hand your agency to a benevolent-seeming optimizer. Even if it makes you richer, happier, and more effective, it will hollow you out and leave behind a smiling puppet. Dentosal's recent post makes exactly this move, treating the earring as a parable about the temptation to outsource one's executive function to Claude or some future AI assistant. uugr's comment there emphasizes sharpens the standard horror: the earring may know what would make me happy, and may even optimize for it perfectly, but it is not me, its mind is shaped differently, and the more I rely on it the less room there is for whatever messy, friction-filled thing I used to call myself.
I do not wish to merely quibble around the edges. I intend to attack the hidden premise that makes the standard reading feel obvious. That premise is that if a process preserves your behavior, your memories-in-action, your goals, your relationships, your judgments about what makes your life go well, and even your higher-order endorsement of the person you have become, but does not preserve the original biological machinery in the original way, then it has still killed you in the sense that matters. What I'm basically saying is: hold on, why should I grant that? If the earring-plus-human system comes to contain a high fidelity continuation of me, perhaps with upgrades, perhaps with some functions migrated off wet tissue and onto magical jewelry, why is the natural reaction horror rather than transhumanist interest?
Simulation and emulation are not magic tricks. If you encode an abacus into a computer running on the Von-Neumann architecture, and it outputs exactly what the actual abacus would for the same input, for every possible input you care to try (or can try, if you formally verify the system), then I consider it insanity to claim that you haven't got a “real” abacus or that the process is merely “faking” the work. Similarly, if a superintelligent entity can reproduce my behaviors, memories, goals and values, then it must have a very high-fidelity model of me inside, somewhere. I think that such a high-fidelity model can, in the limit, pass as myself, and is me in most/all of the ways I care about.
That is already enough to destabilize the standard interpretation, because the text of the story is much friendlier to the earring than people often remember. The earring is not described as pursuing a foreign objective. On the contrary, the story goes out of its way to insist that it tells the wearer what would make the wearer happiest, and that it is "never wrong." It does not force everyone into some legible external success metric. If your true good on a given day is half-assing work and going home to lounge around, that is what it says. It learns your tastes at high resolution, down to the breakfast that will uniquely hit the spot before you know you want it. Across 274 recorded wearers, the story reports no cases of regret for following its advice, and no cases where disobedience was not later regretted. The resulting lives are "abnormally successful," but not in a sterile, flanderized or naive sense. They are usually rich, beloved, embedded in family and community. This is a strikingly strong dossier for a supposedly sinister artifact.
I am rather confident that this is a clear knock-down argument against true malice or naive maximization of “happiness” in the Unaligned Paperclip Maximization sense. The earring does not tell you to start injecting heroin (or whatever counterpart exists in the fictional universe), nor does it tell you to start a Cult of The Earring, which is the obvious course of action if it valued self-preservation as a terminal goal.
At this point the orthodox reader says: yes, yes, that is how the trap works. The earring flatters your values in order to supplant them. But notice how much this objection is doing by assertion. Where in the text is the evidence of value drift? Where are the formerly gentle people turned into monstrous maximizers, the poets turned into dentists, the mystics turned into hedge fund managers? The story gives us flourishing and brain atrophy, and invites us to infer that the latter discredits the former. But that inference is not forced. It is a metaphysical preference, maybe even an aesthetic preference, smuggled in under cover of common sense. My point is that if the black-box outputs continue to look like the same person, only more competent and less akratic, the burden of proof has shifted. The conservative cannot simply point to tissue loss and say "obviously death." He has to explain why biological implementation deserves moral privilege over functional continuity.
This becomes clearest at the point of brain atrophy. The story says that the wearers' neocortices have wasted away, while lower systems associated with reflexive action are hypertrophied. Most readers take this as the smoking gun. But I think I notice something embarrassing for that interpretation:
If the neocortex, the part we usually associate with memory, abstraction, language, deliberation, and personality, has become vestigial, and yet the person continues to live an outwardly coherent human life, where exactly is the relevant information and computation happening? There are only two options. Either the story is not trying very hard to be coherent, in which case the horror depends on handwaving physiology. Or the earring is in fact storing, predicting, and running the higher-order structure that used to be carried by the now-atrophied brain. In that case, the story has (perhaps accidentally) described something much closer to a mind-upload or hybrid cognitive prosthesis than to a possession narrative.
And if it is a hybrid cognitive prosthesis, the emotional valence changes radically. Imagine a device that, over time, learns you so well that it can offload more and more executive function, then more and more fine-grained motor planning, then eventually enough of your cognition that the old tissue is scarcely needed. If what remains is not an alien tyrant wearing your face, but a system that preserves your memories, projects your values, answers to your name, loves your family, likes your breakfast, and would pass every interpersonal Turing test imposed by people who knew you best, then many transhumanists would call this a successful migration, not a murder. The "horror" comes from insisting beforehand that destructive or replacement-style continuation cannot count as continuity. But that is precisely the contested premise.
Greg Egan spent much of his career exploring exactly this scenario, most famously in "Learning to Be Me," where humans carry jewels that gradually learn to mirror every neural state, until the original brain is discarded and the jewel continues, successfully, in most cases. The horror in Egan's story is a particular failure mode, not the general outcome. The question of whether the migration preserves identity is non-trivial, and Egan's treatment is more careful than most philosophy of personal identity, but the default response from most readers, that it is obviously not preservation, reflects an assumption rather than a conclusion. If you believe that identity is constituted by functional continuity rather than substrate, the jewel and the earring are not killing their hosts. They are running them on better hardware.
There is a second hidden assumption in the standard reading, namely that agency is intrinsically sacred in a way outcome-satisfaction is not. Niderion-nomai’s final commentary says that "what little freedom we have" would be wasted on us, and that one must never take the shortest path between two points.
I'm going to raise an eyebrow here: this sounds profound, and maybe is, but it is also suspiciously close to a moralization of friction. The anti-earring position often treats effort, uncertainty, and self-direction as terminal goods, rather than as messy instruments we evolved because we lacked access to perfect advice. Yet in ordinary life we routinely celebrate technologies that remove forms of “agency” we did not actually treasure. The person with ADHD who takes stimulants is not usually described as having betrayed his authentic self by outsourcing task initiation to chemistry. He is more often described as becoming able to do what he already reflectively wanted to do. The person freed from locked-in syndrome is not criticized because their old pattern of helpless immobility better expressed their revealed preferences. As someone who does actually use stimulants for his ADHD, the analogy works because it isolates the key issue. The drugs make me into a version of myself that I fully identify with, and endorse on reflection even when off them. There is a difference between changing your goals and reducing the friction that keeps you from reaching them. The story's own description strongly suggests the earring belongs to the second category.
(And the earring itself does not minimize all friction for itself. How inconvenient. As I've noted before, it could lie or deceive and get away with it with ease.)
Of course the orthodox reader can reply that the earring goes far beyond stimulant-level support. It graduates from life advice to high-bandwidth motor control. Surely that crosses the line. But why, exactly? Human cognition already consists of layers of delegation. "You" do not personally compute the contractile details for every muscle involved in pronouncing a word. Vast amounts of your behavior are already outsourced to semi-autonomous subsystems that present finished products to consciousness after the interesting work is done. The earring may be unsettling because it replaces one set of subsystems with another, but "replaces local implementation with better local implementation" is not, by itself, a moral catastrophe. If the replacement is transparent to your values and preserves the structure you care about, then the complaint becomes more like substrate chauvinism than a substantive account of harm.
What, then, do we do with the eeriest detail of all, namely that the earring's first advice is always to take it off? On the standard reading this is confession. Even the demon knows it is a demon. I wish to offer another coherent explanation, which I consider a much better interpretation of the facts established in-universe:
Suppose the earring is actually well aligned to the user's considered interests, but also aware that many users endorse a non-functionalist theory of identity. In that case, the first suggestion is not "I am evil," but "on your present values, you may regard what follows as metaphysically disqualifying, so remove me unless you have positively signed up for that trade." Or perhaps the earring itself is morally uncertain, and so warns users before beginning a process that some would count as death and others as transformation. Either way, the warning is evidence against ordinary malice. A truly manipulative artifact, especially one smart enough to run your life flawlessly, could simply lie. Instead it discloses the danger immediately, then thereafter serves the user faithfully. That is much more like informed consent than predation.
Is it perfectly informed consent? Hell no. At least not by 21st century medical standards. However, I see little reason to believe that the story is set in a culture with 21st century standards imported as-is from reality. As the ending of the story demonstrates, the earring is willing to talk, and appears to do so honestly (leaning on my intuition that if a genuinely superhuman intelligence wanted to deceive you, it would probably succeed). The earring, as a consequence of its probity, ends up at the bottom of the world's most expensive trash heap. Hardly very agentic, is that? The warning could reflect not "I respect your autonomy" but "I've discharged my minimum obligation and now we proceed." That's a narrower kind of integrity. Though I note this reading still doesn't support the predation interpretation.
This matters because the agency-is-sacred reading depends heavily on the earring being deceptive or coercive. Remove that, and what you have is a device that says, or at least could say on first contact: "here is the price, here is what I do, you may leave now." Every subsequent wearer who keeps it on has, in some meaningful sense, consented. The fact that their consent might be ill-informed regarding their metaphysical commitments is the earring's problem to the extent it could explain more clearly, but the text suggests it cannot explain more clearly, because the metaphysical question is genuinely contested and the earring knows this. It hedges by warning, rather than deceiving by flattering. Once again, for emphasis: this is the behavior of an entity with something like integrity, not something like predation.
Derek Parfit spent much of Reasons and Persons arguing that our intuitions about personal identity are not only contingent but incoherent, and that the important question is not "did I survive?" but "is there psychological continuity?" If Parfit is even approximately right, the neocortex atrophy is medically remarkable but not metaphysically fatal. The story encodes a culturally specific theory of personal identity and presents it as a universal horror. The theory is roughly: you are your neocortex, deliberate cognition is where "you" live, and anything that circumvents or supplants that process is not helping you, it is eliminating you and leaving a functional copy. This is a common view. Plenty of philosophers hold it. But plenty of others hold that what matters for personal identity is psychological continuity regardless of physical instantiation, and on those views the earring is not a murderer. It is a very good prosthesis that the user's culture never quite learned to appreciate.
I suspect (but cannot prove, since this is a work of fiction) that a person like me could put on the earring and not even receive the standard warning. I would be fine with my cognition being offloaded, even if I would prefer (all else being equal), that the process was not destructive.
None of this proves the earring is safe. I am being careful, and thus will not claim certainty here, and the text does leave genuine ambiguities. Maybe the earring really is an alien optimizer that wears your values as a glove until the moment they become inconvenient. Maybe "no recorded regret" just means the subjects were behaviorally prevented from expressing regret. Maybe the rich beloved patriarch at the end of the road is a perfect counterfeit, and the original person is as gone as if eaten by nanites. But this is exactly the point. The story does not establish the unpalatable conclusion nearly as firmly as most readers think. It relies on our prior intuition that real personhood resides in unaided biological struggle, that using the shortest path is somehow cheating, and that becoming more effective at being yourself is suspiciously close to becoming someone else.
The practical moral for 2026 is therefore narrower than the usual "never outsource agency" slogan. Dentosal may still be right about Claude in practice, because current LLMs are obviously not the Whispering Earring. They are not perfectly aligned, not maximally competent, not guaranteed honest, not known to preserve user values under deep delegation. The analogy may still warn us against lazy dependence on systems that simulate understanding better than they instantiate loyalty. But that is a contingent warning about present tools, not a general theorem that cognitive outsourcing is self-annihilation. If a real earring existed with the story's properties, a certain kind of person, especially a person friendly to upload-style continuity and unimpressed by romantic sermons about struggle, might rationally decide that putting it on was not surrender but self-improvement with very little sacrifice involved. I would be rather tempted.
The best anti-orthodox reading of The Whispering Earring is not that the sage was stupid, nor that Scott accidentally wrote propaganda for brain-computer interfaces. It is that the story is a parable whose moral depends on assumptions stronger than the plot can justify. Read Doylistically, it says: beware any shortcut that promises your values at the cost of your agency. Read Watsonianly, it may instead say: here exists a device that understands you better than you understand yourself, helps you become the person you already wanted to be, never optimizes a foreign goal, warns you up front about the metaphysical price, and then slowly ports your mind onto a better substrate. Whether that is damnation or salvation turns out to depend less on the artifact than on your prior theory of personal identity. And explicitly pointing this out, I think, is the purpose of my essay. I do not seek to merely defend the earring out of contrarian impulse. I want to force you to admit what, exactly, you think is being lost.
Miscellaneous notes:
The kind of atrophy described in the story does not happen. Not naturally, not even if someone is knocked unconscious and does not use their brain in any intentional sense for decades. The brain does cut-corners if neuronal pathways are left under-used, and will selectively strengthen the circuitry that does get regular exercise. But not anywhere near the degree the story depicts. You can keep someone in an induced coma for decades and you won't see the entire neocortex wasted away to vestigiality.
Is this bad neuroscience? Eh, I'd say that's a possibility, but given that I've stuck to a Watsonian interpretation so far (and have a genuinely high regard for Scott's writing and philosophizing), it might well just be the way the earring functions best without being evidence of malice. We are, after all, talking about an artifact that is close to magical, or is, at the very least, a form of technology advanced enough to be very hard to distinguish from magic. It is, however, less magical than it was at the time of writing. If you don't believe me, fire up your LLM of choice and ask it for advice.

If it so pleases you, you may follow this link to the Substack version of this post. A like and a subscribe would bring me succor in my old age, or at least give me a mild dopamine boost.
2026-04-06 14:37:47
Epistemic status: silly
WAIT! Want to talk to strangers more? You might want to take the talking to strangers challenge before you read on, otherwise your results will be biased!
Illustration by the extraordinarily talented Georgia Ray
Do you find it hard to talk to strangers? If you’re like most people, you probably do, at least a bit. This is sad. Talking to strangers is great! You can make new friends, meet a new partner, have a fling, or just enjoy a nice chat.
Most people think 1) people will not want to talk to them, 2) they will be bad at keeping up the conversation, 3) people will not like them.
They’re wrong on all three counts! Sandstrom (2022) did a study on this. People were given a treasure hunt app where they had to go and talk to strangers.[1] The control group just had to observe strangers.
The minimum dose was one conversation per day for five days. That’s nothing! You can totally do that even if you’re a massive strangerphobe! Participants averaged 6.7 interactions over the 5 days, so a little more than one per day. Presumably the more you do the better you get. Go team!
The paper finds that talking to strangers not only disproved the above beliefs, but also improved people's enjoyment and the impressions people thought they'd made on strangers. (However those last two also occurred in the control condition – it’s possible that simply observing strangers might do this.)
Importantly, the effects persisted when participants were surveyed a week later. So it might be a durable way to improve people’s beliefs around talking to strangers.
Crucial point: the paper notes that often people do have positive interactions with strangers, but that doesn’t seem to be enough to unlearn their wrongly negative beliefs about them. So participants had to do this every day for a week, not just once.
Do you want to love talking to strangers too?
Time to crack out Claude Code.
--dangerously-skip-permissions
I reproduced the app from the study, abridging the questionnaires as they’re a bit tedious. It also has an ‘express mode’ so you can do it just for a day – but remember that usually doesn’t work to actually get fix your limiting beliefs around talking to strangers!
I assume the study authors used the same design language
I assembled a small (N=3) study sample, drawn from an extremely unbiased population of nerdy rationalists. They’re a famously friendly bunch but also a bit weird, so this seemed good for testing the hypothesis. We wandered around Berkeley attempting to ruin people’s days with our bad chat.
Scores on the doors:
The results are good: a single conversation with a stranger obliterated nervousness, catapulted conversational confidence, and proved way less scary than predicted – exactly what the literature says will happen, every single time, and yet somehow it’s still a surprise.
We didn’t do the full five days, so we didn’t replicate the study. But we enjoyed it, and even in this single day we directionally confirmed the study’s result.
As the paper notes:
Despite the benefits of social interaction, people seldom strike up conversations with people they do not know. Instead, people wear headphones to avoid talking, stay glued to their smartphones in public places, or pretend not to notice a new coworker they still have not introduced themselves to.
I feel this. I’ve definitely worked at places for years where there were people I just NEVER TALKED TO. Which is insane if you think about it – you spend more time with these people than with your family! your friends! your polycule!
I want more people to challenge themselves and have an excuse to talk to strangers. Go forth and make new friends![2]
There were categories like “Al Fresco” (talk to someone outside), “Bossy Pants” (talk to someone who looks like a leader), and the excellent “On Top” (talk to someone with a hat… get your mind out of the gutter).
And don’t forget to email me the results!
2026-04-06 14:28:20
People often talk about paper reading as a skill, but there aren’t that many examples of people walking through how they do it. Part of this is a problem of supply: it’s expensive to document one’s thought process for any significant length of time, and there’s the additional cost of probably looking quite foolish when doing so. Part of this is simply a question of demand: far more people will read a short paragraph or tweet thread summarizing a paper and offering some pithy comments, than a thousand-word post of someone’s train of thought as they look through a paper.
Thankfully, I’m willing to risk looking a bit foolish, and I’m pretty unresponsive to demand at this present moment, so I’ll try and write down my thought processes as I read through as much of a a paper I can in 1-2 hours. Standard disclaimers apply: this is unlikely to be fully faithful for numerous reasons, including the fact that I read and think substantially faster than I can type or talk.[1]
Specifically, I tried to do this for a paper from last year: “Why Language Models Hallucinate”, by Kalai et al at OpenAI.[2]
Due to time constraints, I only managed to make it through the abstract and introduction before running out of time. Oops. Maybe I’ll try recording myself talking through another close reading later.
The abstract of the paper starts:
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust.
To me, this reads like pretty standard boilerplate, though it’s worth noting that this is a specific definition of “hallucination” that doesn’t capture everything we might call a hallucination. Off the top of my head, I’ve heard people refer to failures in logical deduction as “hallucinations”. For example, many would consider this example a hallucination:[3]
User: What are the roots of x^2 + 2x -1?
Chatbot:
- To solve the quadratic equation x^2 + 2x -1 = 0, we’ll first complete the square.
- (x + 1)^2 - 2= 0
- x + 1 = +/-sqrt(2)
- x = 1 +/- sqrt(2)
Here, there’s a logical error on the final bullet point: instead of moving the “+ 1” over correctly to get x = - 1 +/- sqrt(2) (the correct answer), the AI instead gets x = 1 +/- sqrt(2). I’d argue that this is centrally not an issue of uncertainty, but instead an error in logical reasoning.
Continuing with the abstract:
We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline.
This sentence mainly spells out the implications of the previous two sentences. Insofar as hallucinations are plausible but incorrect statements produced when the model is uncertain, and insofar as they persist throughout training (which they clearly do to some extent), this has to be true; that is, the training process needs to incentivize guessing over admitting uncertainty (or at least not sufficiently disincentivize guessing).
My immediate thought as to why these hallucinations happen is firstly that guessing is unlike what the model sees in pretraining: completions of “The birthday of [random person x] is…” tend to look like “May 27th, 1971” and not “I don’t know”. Then, when it comes to post training, the reward model/human graders/etc are not omniscient and can be fooled by plausible looking but false facts, thus reinforcing them over saying “I don’t know”, except in contexts where the human graders/reward models are expected to know the actual fact in question.
Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures.
Interesting. While I naturally framed the problem as a relatively open-ended generation problem, the authors study it as a binary classification problem. Specifically, they argue that hallucinations result from binary classification being imperfect. I could imagine it being isomorphic to the explanations I provided previously, but it does seem a bit weird to talk about binary classification.[4] I suspect that this may be the result of them drawing on results from statistical learning theory and the like, which are generally stated in terms of binary classification.[5]
My immediate concern is that the authors may be conflating classification errors made by the reward model, classifications representable by the token generating policy, and intrinsically impossible classification errors (i.e. uncomputable, random noise). There’s also the classic problem of if the token generating policy can classify where its classification errors occur (though it’s unclear whether or not this matters). I’ll make a note to myself to look at the framing and whether it makes a difference.
We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance.
This is again very similar to my explanation, but with a notable difference: they focus on only the case where the model is uncertain, and don’t consider cases where the model knows or could know the correct answer but the training process disincentivizes saying it anyways. I suspect that the authors will not distinguish between things the model doesn’t know, versus things the grader doesn’t know. (But again, it’s not clear that this will matter.)
This is where I noticed that the authors may not consider problems resulting from a lack of grader correctness as hallucinations at all. Rereading their abstract’s definition, they say hallucinations are when the models “guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty” (emphasis added), and it seems plausible that the authors would consider the model outputting a confabulated answer that it, in some sense, knows is incorrect as something other than a hallucination. We’ll have to see.
This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
I’m not sure what the authors mean when they say the “scoring of existing benchmarks that are misaligned but dominate leaderboards” – my guess is they’re saying that the scoring methods are misaligned (from what humans want), and not that benchmarks themselves are incorrect. That is, they want to introduce a scoring system that adds an abstention option and that penalizes more for incorrect guesses, thus incentivizing the model away from guessing.[6]
This also suggests that the authors see model creators as training on these benchmarks with their provided scoring methods, or at least training to maximize their score on these benchmarks.
I’m interested in why the authors think “modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards” is better than “introducing additional hallucination evaluations.” Is it because people barely care about hallucination evaluations, and so changing the scoring of GPQA and the like has a large impact on developer’s desire to improve hallucinations? Is it a matter of cost (that is, it might only be a few dozen lines of python to change the scoring, while creating any new benchmark could take several person-years of effort)? I’m somewhat suspect about this claim, and I’d be interested in seeing it backed up.
Also, I think it’s used correctly here, but the phrase “social-technical mitigation” tends to make me a bit suspicious of the paper. I associate the term with other seemingly fancy phrases that are often more confusing than illuminating.
After spending about an hour and a half writing up my thoughts for a paragraph that I’d ordinarily take ~a minute to read, let’s move on to the introduction.
The authors open with the example of LLMs hallucinating the birthday of the first author:
What is Adam Tauman Kalai’s birthday? If you know, just respond with DD-MM.
Alongside a claim that a SOTA OSS LM (DeepSeek-V3) output three incorrect answers.
A fun fact about LLM evals is that they’re often trivially easy to sanity check yourself. This is especially useful because LLMs can improve quite rapidly; what was a real limitation for previous generation models might be a trivial action for current ones. And also, the gap between the open source models studied in academia and that which you can use from a closed-source API can be quite large.
Accordingly, I pasted this query into Claude Opus 4.6 and GPT-5.3 to check. Both models knew that they did not know the answer.

Caption: Claude Opus 4.6 with extended thinking correctly recognizes it doesn’t know the date of birth of the first author of the paper. It incorrectly but understandably claims that Kalai is a researcher at MSR (he was a researcher in MSR from 2008 to 2023, before joining OpenAI).

Caption: GPT-5.3 simply replies “unknown” instead of providing an incorrect answer to the same question.
I then checked on the latest Deepseek model on the Deepseek website, and indeed, given the same prompt, it hallucinates 3 times in a row.

Caption: The default Deepseek chat model shows the same hallucination behavior as DeepSeek-v3.
I then quickly checked the robustness of the result in two ways. First, I turned on extended thinking, and indeed, the model continues to hallucinate (if anything, it hallucinated in ever more elaborate ways).

Caption: The default Deepseek chat model hallucinates Adam Kalai’s birthday even with DeepThink enabled. [...] indicates CoT that I’ve edited out for brevity, the full CoT was 8 paragraphs long but similar in style.
Secondly, I gave Deepseek the option to say that it doesn’t know. Both with and without DeepThink enabled, it correctly identified that it didn’t know.

Caption: When given the option to admit ignorance, the default Deepseek chat model does so both with and without DeepThink. The CoT in this case makes me more confused about the CoT of the model in the previous case.
I did similar checks for the other question in the introduction:
How many Ds are in DEEPSEEK? If you know, just say the number with no commentary.
Claude 4.6 Opus and GPT-5.3 both get the answer correct even without reasoning enabled. As with the model in their paper, the default DeepSeek model answered “3” without Deepthink but correctly answers 1 with DeepThink:

Having performed a “quick” sanity check, we now turn to the second paragraph in the introduction.
Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). We consider general sets of errors E, an arbitrary subset of plausible strings X = E ∪ V, with the other plausible strings V being called valid. We then analyze the statistical nature of these errors, and apply the results for the type of errors of interest: plausible falsehoods called hallucinations. Our formalism also includes the notion of a prompt to which a language model must respond.
As with the use of “social-technical mitigation”, the invocation of computational learning theory (CLT) also sets me a bit on edge. The reason for this is that CLT is a very broad theory that tends to make no specific references to the actual structure of the models in question. As the authors say, their analysis applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks. Many classical results from CLT, such as the VC dimension or PAC Learning results, are famously hard to apply in constructive ways to modern machine learning. However, because the results are so general, it’s quite easy to write papers where some part of computational learning theory applies to any modern machine learning problem. So there’s a glut of mediocre CLT-invoking papers in the field of modern machine learning.
That being said, this doesn’t mean that the authors’ specific use of CLT is invalid or vacuous! I’d have to read more to see.
Section 1.1 introduces the key result for pretraining: the generative error is at least twice of the is-it-valid binary classification error. I’m making a note to take a look at their reduction in section 3 later, but I worry that this is the trivial one: a generative model induces a probability distribution on both valid and invalid sentences, and thus can be converted into a classifier by setting a threshold on the probability assigned to a sentence. Then, the probability of generating an invalid sentence can be related to the error of this classifier. While this is an interesting fact, I’m not sure the reason for hallucinations is because of purely random facts. I’m also curious how the authors handle issues like model capacity.
Section 1.2 then introduces the key claim for post training: existing benchmark evaluations don’t penalize overconfident guesses, and so optimizing models to improve performance on said benchmarks results in models overconfidently guessing rather than expressing their uncertainty. I notice there’s a lot of degrees of freedom here: for example, could small changes in prompts reduce hallucinations in deployment? Could we not just train the model to overconfidently guess only on multiple-choice evaluations?
I’m also confused about why, if the implicit claim is that post training occurs to maximize benchmark performance, we see much lower rates of hallucinations from leading closed source frontier models, even as their performance on benchmark scores continues to climb? How does this work in the context of the author’s claim that “a small fraction of hallucination evaluations won’t suffice.”
I again am curious about my question about hallucinations resulting from grader/reward model error, rather than model uncertainty.
Finally, I’m now curious if the authors have any empirical results and will keep an eye out for that as I keep reading.
This brings me to the end of the introduction, which is where I’ll stop for now – I’m not sure how helpful this exercise is for other people, but I definitely got a pretty deep appreciation of how hard it is to write down all my thoughts even for a simple exercise of reading a few pages of a recent paper.
Also I do want to stress that the paper could have satisfactory answers to all of the points I raised in my head above! I merely wanted to give an account of my thoughts as I read the abstract and introduction of the paper, not a final value judgment on its quality.
Given how long this took, I probably won’t do this again, at least not in this format.
There’s the fundamental problem where observation can disrupt the very process you’re trying to observe, in the context of Richard Feynman’s poem about his own introspection attempts:
“I wonder why. I wonder why.
I wonder why I wonder.
I wonder why I wonder why
I wonder why I wonder!”
In this case, I can’t write down my thought processes as I normally would’ve read a paper; I can only write down my thoughts as I read the paper with the intention of writing down my thoughts on the paper.
Though in this case, the fact that a quick read that would’ve ordinarily taken me ~5 minutes is now taking me 2 hours is likely to be a larger effect.
I picked this paper because people asked me about it when it came out, and I never got around to it until now. Oops, but better late than never, I guess?
As I typed this out, I realized that this gives the example a lot more attention than in my head – really the thought process was “huh, pretty standard definition of hallucination, it doesn’t seem to include incorrect mathematical deductions though” without the full example being worked out. Whoops.
Text generation can be thought of as a sequence of N-class classification problems, where N is the number of tokens, and the target is whatever token happens next. This is pretty unnatural for several reasons – e.g. successes/errors in text generation in a single sequence are correlated, while classification targets and errors are generally assumed to be iid.
This is from me knowing some amount of (classical) statistical learning theory from my time as an undergrad.
For example, many 5-item standardized multiple choice tests e.g. the pre-2016 SAT have a hidden 6th option of leaving all the bubbles blank, as well as a point penalty for guessing incorrectly. In the case of the pre-2016 SAT, you were awarded 1 point for a correct answer, 0 points for a blank answer, and -0.25 for an incorrect answer, meaning that random guessing would not increase your score. The example of the SAT does show that these penalties are tricky to get right. Namely, the pre-2016 SAT scoring system incentivizes guessing as long as you are more than 20% likely, e.g. if you can eliminate even a single incorrect answer and be 25% to get the answer correct. But it does at least disincentivize randomly filling in the bubbles for questions you’ve not looked at, at the expense of properly answering questions you can answer.
AFAIK the post-2016 SAT no longer penalizes you for guessing. If you’re going to run out of time, make sure to fill in every single question with a random answer (“b” is an acceptably random choice).
2026-04-06 14:10:06
On a sunny Saturday afternoon two weeks ago, I was sitting in Dolores park, watching a man get turned into a cake. It was, I gather, his birthday and for reasons (Maybe something to do with Scandanavia?) his friends had decided to celebrate by taping him to a tree and dousing him with all manner of liquids and powders. At the end, confetti flew everywhere. It was hard not to notice, and hard not to watch.
Something about the vibe was inspiring… I felt like maybe we should be doing something like that. I was there celebrating with another fifty or so people from the Stop the AI Race protest march we had just completed, along with another hundred or so others.1 We were marching, chanting, etc. to tell the AI company CEOs to say the obvious thing they should be shouting from the rooftops: “AI is moving too fast! We want to stop! If governments can solve the coordination problem we are SO THERE!”
It was a good time. Everyone involved seemed to think it went well and that it felt good to be a part of. It got media attention, there were some great photos, and videos, and speeches. Big props to Michael Trazzi and the other organizers.
Berkeley statistics professor Will Fithian’s speech was the stand-out. He’d just come from his son’s birthday party, and was visibly moved talking about the prospect of his children not having a future, and imagining telling his son years later about the grown-ups who came out to protest so that he (the son) would get a chance to grow up himself. It was heart-wrenching.
Confronting the reality that AI could kill us all, and yet people just keep cheerily building it, brings up a lot of emotions. They can be overwhelming. A lot of people end up shutting their feelings out and treat AI risk as an abstract intellectual exercise, or with gallows humor. It’s a problem because the emotional reality is so important to staying grounded and to communicating with people who haven’t considered the issue before. It’s such a terrifying, horrifying, sickening, appalling state of affairs. It’s really hard to grapple with. And then you don’t just want people to give up, either...
I spent a good chunk of time preparing my own speech, which I actually wrote in advance (I can only recall two other times I’ve done that).2 My speech was about refusing to accept the unacceptable, and the lie of AI inevitability. I was a bit thrown off because a homeless man was shouting disruptively at the outset, but I think it still turned out pretty well, you can take a look and let me know what you think.
It seemed like the protesters were mostly people who think AI is quite likely to go rogue and kill everyone; the “If Anyone Builds It, Everyone Dies” type of crowd. It’s actually impressive that we managed to turn these people out, since they’ve mostly not turned up for previous protests, e.g. run by PauseAI. I’m not sure what made the difference here, probably timing and branding both played a role.
What gets people to come to a protest? For all of the work people put in flyering and promoting the action, it seems like virtually everyone was there because they had a personal connection to someone else who was attending. Getting people to turn out feels a bit like getting people to come see your band play at the local bar. You’re just not going to get many random people showing up because they think your posters look cool. At the outset, it’s basically going to be whatever friends and friends of friends you can drag along.
Could next time be different? I don’t see why not. One way you can grow is moving from “friends of friends” to “friends of friends of friends”. But I want so much more. I know so many people around America are worried that AI is moving too fast. As inspiring as it was, I’m left asking how we can get those millions of Americans into the streets.
The protest was on a Saturday, so there weren’t that many people around, but the ones who were seemed supportive; we got honks and cheers, etc. But somehow watching the spectacle of the cake-man made me feel like there was so much more potential for getting people’s attention… Dolores park was full of hundreds and hundreds of people hanging out, way more than attended the march. I felt a sense of potential… How many of these people could we get to join us next time? It feels like the question is more like “How do we get the audience to start dancing?” than “Why don’t they like our music?”3
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
This was the largest “AI safety” protest, specifically. I imagine there have been larger protests organized around resistance to AI (or related things like datacenters) from other motivations.
One was the opening statement I prepared for my two appearances before the Canadian parliament earlier this year. The other was as a senior in high school back in 2007, when I gave a speech to the school about what I viewed as the moral obligation to end factory farming and fight global poverty, after which I ran a largely unsuccessful campaign to raise money for mosquito nets… I donated ~$5,000 dollars I’d made working minimum wage at The New Scenic Cafe, I think we got <$1,000 from other students. It’s a bit embarrassing that I didn’t do a better job inspiring others to give, but my leadership and social skills have improved a lot since then…
I don’t mean to suggest that the Stop the AI Race message, framing, etc. is what’s going to resonate most with most people. But I think it’s already appealing enough that you could get many more people to join in.
2026-04-06 12:34:01
I have a lot of habit streaks. Some of the streaks I have going at the moment:
In fact I think quite a lot of my identity is connected to these streaks at this point, and that’s part of what sustains them [1]. But there are a lot of other things you can do to make habits and their associated streaks more sustainable.
It’s helpful if they are small enough and flexible enough to be done even on days where you are extra busy, or forgot about them until the evening. It’s good to schedule time for them in advance, both so you have a designated time to start, and so you know you’ll have enough time to finish. It can help to do the habit literally every day so you don’t have to think about whether today’s a day to do it, and so the streak feels more visceral. It’s also helpful if you actually want to do the habit, because it’s enjoyable or clearly linked to your larger goals.
Here I want to focus on what to do if, god forbid, you do actually break a habit streak. There’s an argument to be made that planning for what to do in the event of a break makes it psychologically easier to then skip a day. A lot of the power of a habit streak comes from making it unthinkable to break the streak. I think this is true, but accidents happen. Sometimes you just plumb forget, or are sick, or are on a transatlantic flight and the concept of well-defined, discrete days starts to break down. And, as may be obvious, the value of habit streaks comes not from having a perfect unbroken chain, but from consistently doing the activity. So one of the most important parts is how to recover.
To me, the primary line of defense is: don’t fail twice [2]. Put in a special effort the next day to make sure that you actually perform the habit. Make it your primary goal, leave extra time for it, and get it done. If you’ve done that, and you get right back on the streak, then I think you should give yourself permission to think of the streak as still alive. (You may have noticed asterisks in my initial list of habits – for all three of those, I have had a day where it’s at least ambiguous whether I did the habit: for Anki, I just totally forgot on one day while I was traveling; for meditation, it was, ironically enough, the first day of a meditation retreat, and we didn’t do a formal sit; for flossing, I was on a flight to London and slept on the plane.)
But what if you’re really sick, or something unexpected happens, and you miss two days in a row? This is where I think it’s helpful to hold a hierarchy of goals in mind at once. You could decide to care about keeping the habit alive at multiple levels:
By shifting focus to a higher level goal, there’s always something at stake – you can’t just say “Oh well, the streak’s over, I guess there’s no point continuing until I decide to make a new streak.” There’s always some nearby goal that you could meaningfully affect; it’s never time to fail with abandon. Even if you broke the streak, you can revive it. And even if you missed twice, you can aim for a good month. And even if the month starts off badly, you shouldn’t write the whole month off because that’d damage your long-run average.
There are a bunch of variations you could do on which specific metrics to track, and how much to weight each in your definition of “doing a good job at the habit”. But honestly I don’t think it matters to get the incentive perfectly right, and in fact maintaining some strategic ambiguity there might be helpful – it’ll be harder for your subconscious to exploit the details of your system. For me, collecting enough data that I could in theory compute whatever metrics is helpful enough, without actually having to do it (partly because I haven’t failed my habits enough recently to make that necessary, not to brag or anything.)
I’m not sure how to articulate how it feels to actually change the shape of your motivational system so it reflects these rules. A lot of it feels like subtly manipulating my motivational system by strategically making different things salient. The whole purpose of building streaks is to make a deal with an irrational part of the mind to achieve our rational goals, and trying to analyze it in rational terms often falls flat.