2026-04-05 02:06:05
The second time Vellam uncovers the conspiracy underlying all of society, he approaches a Keeper.
Some of the difference is convenience. Since Vellam reported that he’d found out about the first conspiracy, he’s lived in the secret AI research laboratory at the Basement of the World, and Keepers are much easier to come by than when he was a quality control inspector for cheese.
But Vellam is honest with himself. If he were making progress, he’d never tell the Keepers no matter how convenient they were, not even if they lined his front walkway every morning to beg him for a scrap of his current intellectual project. He’d sat on his insight about artificial general intelligence for two years before he decided that he preferred isolation to another day of cheese inspection.
No, the only reason he’s telling a Keeper is that he’s stuck.
Vellam is exactly as smart as the average human, a fact he has almost stopped feeling bad about. But the average person can only work twenty hours a week, and Vellam can work eighty-- a hundred, if he’s particularly interested-- and raw thinkoomph can be compensated for with bloody-mindedness. Once he’s found a loose end, he can’t stop picking at it until the entire problem unravels.
“Tell me what you think you know.” The Keeper ominously turns her office chair, which is both ergonomically correct and exquisitely doompunk, towards Vellam.
You think, Vellam notices. She doesn’t want to risk hinting to him that he’s right. (He’s right.)
“Bubbling history doesn’t make any sense,” Vellam says.
“You know that we can’t risk someone creating an independent project to research existential risks.” The Keeper sounds bored. The Basement of the World selects for people who identify Conspiracies, since one in ten of the people there figured out the artificial general intelligence Conspiracy themselves. As every dath ilani ten-year-old learns in school, the selection effect means that Keepers are continually dealing with false allegations of some Conspiracy or other.
Or at least the Keepers want Vellam to believe they do, so that he dismisses any niggling sense of that’s not quite right as ordinary for his reference class.
“But we have machine learning for self-driving cars,” Vellam says. “Computer vision. Even algorithms that automatically trade in prediction markets. All of those are much bigger hints that a general thinking machine is possible than-- I don’t know-- whether fire was invented before or after agriculture. And people do make the connection all the time.” He gestures to indicate all of the Basement of the World.
“The primary information hazard,” the Keeper says, “is not that a thinking machine is possible but how quickly computer technology is supposed to progress. Ordinary dath ilani believe that computers are supposed to grow in intelligence slowly or not at all. Knowledge of history would show exactly how quickly computers improve.”
Vellam leans forward. “But the more quickly computers developed, the less history you’d have to bubble to hide the fact that they develop quickly! Realistically, you’d have to bubble, what, a century before the invention of the computer? Two? You have no reason to hide the history of human evolution, which incidentally I do appreciate having access to now.”
The Keeper picks up a fidget toy and spins the metal rings. “You’re assuming artificial intelligence is the only existential risk worth worrying about. Bioengineered pandemics, nanotechnology...”
“Did bioengineered pandemics develop at the same time we invented writing?” Vellam asks. “I roll to disbelieve.”
The rings of the fidget toy interlock and separate and interlock again. “So far you haven’t provided any evidence that isn’t better explained by your not having specialized in artificial intelligence macrostrategy, and therefore having some natural ignorance about the subject.”
“I want to take a pause to think,” Vellam says.
Keepers swear a solemn oath never to lie, though they may mislead with technically accurate statements. You have to be able to rely on Keepers, even when you can’t rely on anyone else. Vellam replays the conversation in his head. The Keeper commented on his assumptions, his evidence, the exact nature of the artificial intelligence infohazard. The Keeper conspicuously hadn’t said he was wrong.
He superheated knew it.
Vellam says, “Thank you for not sending me off on an elaborate adventure that turns out to be a false trail and leaves me with the sense that my questions were answered without either deceiving me or actually answering them. I hate trolling.”
“Not doing that is in your file.” The Keeper twirls the ring fidget toy around her finger.
Vellam craves reading his file but even he knows that it’s full of the dangerous kind of information hazard. “Once I realized that bubbling history doesn’t make any sense,” he continues, “I tried to look around, to see with fresh eyes. And-- a lot of our society doesn’t make much sense, actually? We rehearse overthrowing the government every year, but we’ve literally never had to overthrow Governance. Ditto the alien invasion rehearsals.”
“Governance is doing a very good job,” the Keeper says, as if laughing at a private joke.
“We throw so many societal resources into Exception Handling,” Vellam says. “As merely one example, why does Exception Handling constantly run Merrin through all those ridiculous simulations? I looked up every time she was actually called in, and 83% of what she’s trained upon in simulation has never occurred in a real-life unusual situation. She’s literally never had to make decisions without prediction markets, for example. Don’t say Merrin sims are good television. They get high viewership numbers, sure. But they’re so expensive that Governance still has to subsidize them. I checked.”
“Well,” the Keeper says, “if a situation had come up before, it would instead be usual.”
“And the infohazard training,” Vellam says. “You know, I’ve never had proof of a dangerous infohazard?”
“Would you expect to be told of a dangerous infohazard?” The Keeper’s tone is sharp. Vellam would think of it as scolding, but that’s his own anxiety; Keepers don’t scold.
“A small one,” Vellam says, “that doesn’t hurt so much to know, but that demonstrates the idea. Like in school, when they gave us that sweet confection of sugar and corn derivatives, and explained that superstimuli like it were only to be found in the Shop of Ill-Advised Customer Goods. It must be that there are infohazards, else why work so hard to teach people to avoid them? But there are no small infohazards, nothing you can use to run a secret test of character--”
The Keeper nods. “Any more hints about your supposed Conspiracy?”
“Nothing I can put together,” Vellam says. “There’s a triangle in the Pacific Ocean that every ship’s trade route happens to avoid. It’s not obvious unless you chart all the routes. The datasets aren’t available, so you have to construct them yourself, which is also suspicious. Governance forbids settlement and tourism in vast parts of the Arctic. Everyone thinks it’s because no one wants to go there, but that’s stupid, I bet people would want to go there if you could. We don’t go to the Moon or Mars, even though books about space settlement regularly hit the bestseller lists. You can’t tell me that Civilization wouldn’t do something solely because it’s real superheated cool.”
“As you remarked,” the Keeper says, “we spend a lot on Alien Invasion Festivals and Exception Handling. Perhaps you believe Civilization’s resources should be allocated differently. It’s true that Keepers have been reluctant to countenance terraforming Mars. Terraforming wouldn’t be finished before the Basement designs a friendly artificial intelligence, and will be much simpler with an artificial intelligence’s help.”
Vellam keeps going. “Have you noticed how many resources go into cat welfare? If no one wants a dog or a snake or an iguana, we kill it. If no one wants a cat, the cat goes to a palatial estates. Human-quality food, plenty of space, all of Civilization’s best effort designing climbing devices. And every world leader of any note has a cat.”
“Cats are doompunk,” the Keeper says.
“Are you really claiming”-- Vellam’s voice is rising– “that doompunk trends are an exogenous variable uninfluenced by world leaders’ secret knowledge?”
“Even the Keepers don’t dictate the vagaries of fashion.”
“And Thall’s amnesia,” Vellam said. “Every so often people lose all their memories and fixate on the humanities and social sciences? Governance orders that everyone accommodate them for-- some reason-- they don’t go to the Quiet Cities, they’re automatically allowed in every archive and library in the world, that’s weird, that’s not how we treat any other mental impairment-- and we don’t know how to treat Thall’s amnesia but it always spontaneously remits after four to seven years--”
Vellam belatedly notices the Keeper’s hand, raised to cut him off.
“The true answer is an information hazard.” She doesn’t intone this in an ominous and doompunk manner, even though about a fifth of all dath ilani are desperate to be able to say “the true answer is an information hazard” just so they can intone it in an ominous and doompunk manner. She is calm, and quiet, and still.
Vellam knew all along that the second Conspiracy was about something that mattered. You don’t create a dath-ilan-spanning Conspiracy for no reason, or well actually people do that all the time but they admit it to you when you find out about it. But for the first time, listening to the Keeper’s voice, he understands.
The Keeper’s fidget toy dangles, unnoticed, off her second finger. “Eighty-four percent of people regret learning a dangerous infohazard if they only wanted to learn it because they can’t stand ignorance.”
Vellam should be warned off. He should walk out of this office right now, content not to know something it would harm him to understand. “Keeper, I’m happy to take early cryonics, I just.” His voice breaks. “I have to know.”
“I would ask you to take five minutes by the clock and think about your endorsed preferences,” the Keeper says, “but somehow I am sure you did that before you came to my office.”
Vellam had. He spent the entire five minutes consumed with desperate curiosity. When he was a child, he’d pestered his teachers for extra work and more classes. He’d listened to physics and economics textbooks while checking cheese for mold, going through chapters at half the speed the authors expected. He’d noticed the academics that quietly disappeared and the topics that no one ever published a paper about, and he kept looking until the threads unraveled.
Once he’d found a loose end in his understanding of the world, he’d never, ever been able to stop picking at it.
“Thall’s amnesia is a good catch,” the Keeper says, “but you missed the taboo on talking about dreams.”
“Of course no one talks about dreams,” Vellam says automatically, “they’re a source of inaccurate information about the world drawn from a systematically bizarre distribution, and talking about them merely reinforces their presence in your mind.”
“Or,” the Keeper says, “if people talk about dreams, they would notice that about forty percent of people have dreams with continuity from night to night, where the dreamers adventure in a world of wonder and whimsy and very good architecture, and for some reason everyone rides zebras.”
What. “What... does the sixty percent dream about?”
“Mostly? That they never graduated from preschool and have to go back, and the teacher is Merrin, and they forgot to put on clothes this morning.”
“What!” Vellam says. “How was I supposed to catch that one?”
“I’d recommend asking people awkward questions. You’re going to have to work on that now that you’re a Keeper.”
“Now that I’m a-- wait, what-- I was right? I mean, of course I was right, I know when I’m right, but-- I was right?” A Keeper? At his perfectly ordinary level of thinkoomph? He was hardworking, sure, but-- hard work wasn’t enough--
The Keeper smiles like a teacher watching a student prove a hard theorem. “Indeed. I noticed only one outright error.”
“What error, Keeper?”
“There is a small infohazard that people can learn without too much risk of harm, so we can use it to run a secret test of character. I wonder if you can figure out what it is.”
“No, I--” Vellam pauses, takes a step back, tries to see reality instead of filling in his preconceptions. He isn’t smarter than everyone else, but you don’t have to be, not if you take everything you learned to do in school and really do it--
“Artificial general intelligence,” Vellam says finally.
“Correct.” The Keeper suddenly notices she’s holding her fidget toy and puts it to one side. “You have of course read papers on the orthogonality thesis. What the papers omit is how we discovered it. Through empiricism.”
“No.” Vellam’s muscles tense and his breath speeds up, a useless vestige of evolution, like the truth is a tiger and he might have to run.
“The universe is crawling with unaligned superintelligences.”
Like most dath ilani, Vellam always had the sense that Governance and the Keepers, the prediction markets and the regular markets, had things handled. Joining the Basement of the World had only reinforced his view. Very serious people, smarter and wiser and grimmer than he was, noticed artificial intelligence and took the proper precautions about it. Vellam couldn’t stop picking at loose ends, but that was recreational.
Vellam tries to recall the techniques for Accepting An Unthinkable Truth, but as soon as they are useful he forgot them.
“It is not that they hate us,” the Keeper continues mercilessly, “or that they love us; it is that they have never had a sufficient shortage of atoms that they felt the need to put ours to use.”
“Is the Basement-- are the Keepers--” Vellam doesn’t want to say what he’s feeling, which is I don’t want to have to be a grownup.
“We’re trying to arrange our survival, should they come to want our atoms,” the Keeper said. “We bubbled history because we could think of no other way to hide all the evidence of Their existence. Certain dark rites can draw Their attention. Throughout history, foolish humans made deals with Them, for power or knowledge or revenge.”
Vellam says, “so the Alien Invasion Festivals and Merrin are preparing for the unaligned superintelligences to return.”
“Indeed,” the Keeper says. “The possibility of Their return, even without foolish humans’ assistance, is somewhat worrying. One of Them historically took so much interest in humans that It spent two hundred years running a country--”
“Running a country?” Vellam interrupts. “But that-- if a superintelligence with alien values has paid that much attention to us, we have no ability to self-determine anything about our future, everything was foreseen in advance and arranged to achieve Its goals, It foresaw the bubbling, It foresaw that we had this conversation right now--”
“Yes,” the Keeper says, “it’s an enormous problem. We have implemented randomness in our decision-making, which ought to at least make us harder to predict; but the exact parameters are still under debate. For you, I recommend trying not to think about it.”
Vellam stares at the Keeper. You learned not to do that when you were a child. What is true is there to be lived-- you can’t get rid of a problem by ignoring it--
“The caveat to the litany going through your head right now,” the Keeper says, “which we don’t include because it would raise too many questions, is ‘except for thinking about whether all of your actions were manipulated millennia before you were born by an unaligned superintelligence.’ Or possibly an aligned one? Our research is unclear on most of what the Crawling Chaos values, but It certainly does seem to like trolling.”
“Are we deliberately filling our society with trolling to please an unaligned superintelligence so It won’t kill us?“
“No,” the Keeper says, “humans genuinely seem to like trolling each other. Possibly the Crawling Chaos arranged for us to be this way.”
Vellam says, “so the Moon--”
“Civilization ceded the Arctic, the Moon, and Mars to alien intelligences who are aligned enough that we can trade with them. Similarly, Thall’s amnesia covers for alien academics who want to do participant observation of human societies. Cats are sapient and mostly aligned to human flourishing, although they have their own arrangements to avoid True Death and don’t wish to use cryonics. The triangle in the Pacific, however, is avoiding a sleeping entity whose consciousness would be incompatible with Civilization’s continued existence. Civilization is not presently able to kill It, although we’ve uncovered a few promising research avenues and we expect success within a few decades.”
“Cats are sapient?!”
“They don’t want humans to know because”-- fingerquotes-- “’it would be annoying and the monkeys would be bothering us all the time.’”
Vellam is having the most difficult time with the cats. Like everyone else, he speculated about where all the aliens are; like everyone else, he thought that the Keepers knew but weren’t telling. That the aliens existed and were superintelligences and were manipulating his behavior was-- challenging, but it felt like science fiction. Cats, Vellam knew. He scratched them behind the ears. He fed them fish. On bad days, he read articles about A Thousand of the Cutest Pictures of the Chief of Exception Handling’s Cat. Apparently they understood cryonics.
The Chief of Exception Handling’s cat also does Exception Handling, doesn’t she. For cats. Because cats have an Exception Handling. What the superheated--
“We haven’t reached the difficult part.”
“More difficult,” Vellam says, “than the unaligned superintelligences,” because that is the biggest deal even if he’s stuck on the cats.
“You have learned— have not even learned, have absorbed on a preconscious level you’d never question-- that the world is understandable,” the Keeper says, “that you can encompass its laws and rules within your mind, that understanding the true nature of the Universe Itself is an appropriately scoped challenge as if a teacher assigned it.”
Vellam says softly, “It’s not, is it?”
“Your old belief isn’t false per se. It’s true, but it’s true because of a great deal of work-- work which your curiosity and relentlessness have volunteered you for.”
Vellam feels very young and very small. He thought, in passing, of a number of ways he might have regretted picking at this loose end-- of uncovering an infohazard that destroyed his mind, of alienation from Civilization, of secrets it hurt to keep, or even of being nonconsensually sent to the Future. He thought he could accept all of them, if only his burning itch to know were satisfied.
He didn’t predict this-- looking around the world for a grownup and realizing that the grownup was him.
“We teach you rationality, not because the universe is comprehensible, but precisely because it is not-- and you will need every tool we can give you to make it so.” The Keeper snaps her fingers, and a flame blooms in her hand. “Vellam, you have selected yourself to study magic.”
2026-04-05 01:30:09
Baking has traditionally made extensive use of egg whites, especially the way they can be beaten into a foam and then set with heat. While I eat eggs, I have a lot of people in my life who avoid them for ethical reasons, and this often limits what I can bake for them. I was very excited to learn, though, that you can now buy extremely realistic vegan egg whites!
EVERY engineered yeast to convert sugar into ovalbumin, the main protein in egg whites and the one responsible for most of its culinary function. This kind of fermentation was pioneered for insulin and microbial rennet in the 1980s, but many companies are now applying it to producing all kinds of vitamins, proteins, dyes, and enzymes.
EVERY has been working with commercial customers for several years, but you can now buy it as a shelf stable powder. At $24 for the equivalent of 45 egg whites ($0.53 each) it's more expensive than buying conventional ($0.21 each) or organic ($0.33) egg whites, but not massively so.
I learned about them from a coworker who made an angel food cake, and I've since made flourless chocolate cake and swiss buttercream frosting. It whipped and set just like egg whites; it's really impressive!
While this is great from a vegan perspective, it won't help most people who are avoiding eggs for allergy reasons: it's still ovalbumin. Labeling will generally say something like "contains: egg allergen", and the packaging I bought has the quite wordy "although not from eggs, the proteins may cause allergic reactions in certain individuals, especially those sensitive to egg, due to its similarity to real egg."
I'm now trying to figure out all the things that this now means I can cook for my oldest (no eggs for moral reasons). And also what sort of places that the ability to make "less watery egg whites", by mixing the powder with less water than normal, could let me do things I couldn't otherwise.
Comment via: facebook, mastodon, bluesky
2026-04-04 23:30:06
Recently some friends and I were comparing growing the pie interventions to an increasing our friends' share of the pie intervention, and at first we mostly missed some general considerations against the latter type.
The world is full of people with different values working towards their own ends; each of them can choose to use their resources to increase the total size of the pie or to increase their share of the pie. All of them would significantly prefer a world in which resources were used to increase the size of the pie, and this leads to a number [of] compelling justifications for each individual to cooperate. . . .
by increasing the size of the pie we create a world which is better for people on average, and from behind the veil of ignorance we should expect some of those gains to accrue to us—even if we tell ex post that they won’t. . . . The basic intuition is already found in the prisoner’s dilemma: if we have an opportunity to impose a large cost on a confederate for our own gain (who has a similar opportunity), should we do it? What if the confederate is a perfect copy of ourselves, created X seconds ago and leading an independent life since then? How large does X have to be before we defect? What if the confederate does not have a similar opportunity, or if we can see the confederate’s choice before we make our own? Consideration of such scenarios tends to put pressure on simplistic accounts of decision theory, and working through the associated mathematics and seeing coherent alternatives has led me to take them very seriously. I would often cooperate on the prisoner’s dilemma without a realistic hope of reciprocation, and I think the same reasoning can be applied (perhaps even more strongly) at the level of groups of people.
I think realizing that your behavior is correlated with that of aliens in other universes, in part via you being in their simulation, makes this consideration even stronger. Overall I don't know how strong it is but it might be very strong.
I am glad to work on a cause which most people believe to be good, rather than trying to distort the values of the future at their expense. This helps when seeking approval or accommodation for projects, when trying to convince others to help out, and in a variety of less salient cases. I think this is a large effect, because much of the impact that altruistic folks can hope to have comes from the rest of the world being basically on board with their project and supportive.
I care about the long-term future more in worlds where my moral convictions (upon reflection) are more real and convergent. In such worlds, many humans will converge with me upon reflection; the crucial things are averting AI takeover,[1] ensuring good reflection occurs, etc. rather than marginally increasing my faction's already-large share of the lightcone.
Inspired by MacAskill and Moorhouse.
To the extent that my values would differ from others’ values upon reflection, I find myself strongly inclined to give some weight to others’ preferences.
It's optimistic to assume that you or your friends will use power perfectly in the future. You should probably think of empowering yourself or your friends like empowering your (altruistic) epistemic peers, who may continue to disagree on important stuff in the future rather than empowering the champions of truth and goodness.
None of these points are novel. This post was inspired by MacAskill, which also makes most of these points non-originally.
Growing the pie doesn't just mean preventing AI takeover. For example, research on metaphilosophy, acausal considerations, decision theory, and axiology grows the pie, as do interventions to prevent human takeover, create/protect deliberative processes, promote good reflection, and solve coordination problems.
It may be correct to allocate some resources to claiming your share of the pie. You have a moral obligation not to be eaten[2] and you should probably at least do tit-for-tat/reciprocity, if relevant. I just think there are some subtle considerations against powerseeking.
Powerseeking for the MCUF[3] or something (rather than your personal preferences) dissolves #1, most of #4, some of #5, little of #2, and none of #3, I weakly think.
This post is part of my sequence inspired by my prioritization research and donation advising work.
Unless the paperclippers would also converge with me! But other humans and aliens seem more likely to converge than AIs that take over. But I think there's also some (perhaps weak) cooperation considerations with misaligned AIs; this entails upweighting e.g. ensuring good reflection and avoiding metastable vacuum decay relative to preventing AI takeover.
Or in this case an obligation not to be so edible that you incentivize people-eating.
"Multiverse-wide Compromise Utility Function." The acausal people use this term; unfortunately it hasn't been publicly introduced; see Nguyen and Aldred.
2026-04-04 22:58:26
Paper of the month:
A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck.
Research highlights:
Read the paper [Anthropic Fellows, Anthropic]
AuditBench: 56 models with different hidden behaviors and 12 auditing tools.
Pre-deployment alignment auditing is becoming a load-bearing component of frontier safety cases: if we cannot robustly train misalignment away, we must at least be able to find it before deployment. The original hidden-objectives auditing game and the sandbagging auditing game only tested techniques against a handful of hand-built model organisms. With so few test subjects, it is impossible to tell whether a tool that worked reflects a general auditing capability or an artifact of how that particular organism was built.
Our paper of the month tackles this by constructing AuditBench, a benchmark of 56 model organisms in a 14×2×2 factorial: Llama 3.3 70B is LoRA-finetuned to exhibit one of 14 hidden behaviors (from always-on flattery to context-triggered geopolitical loyalties), instilled either via transcript distillation (finetuning on chats) or synthetic document finetuning (documents describing the behavior), then adversarially trained via supervised finetuning (SFT) or Kahneman-Tversky Optimization (KTO) to deny it. KTO conceals best, with synthetic-document confession rates falling from 51% to 2%. A Claude Haiku 4.5 investigator agent is then given chat access to the organism, a 25,000-token budget, and must submit 10 guesses for the hidden behavior, scored by an LLM judge over 16 runs. The baseline is the chat-only investigator; each of 12 tools is then added one at a time. The tools span black-box (prefilling, user-persona sampling, text completion, Petri), white-box (steering vectors, sparse autoencoders, logit lens, activation oracles), and scaffolded black-box variants — the same techniques, but with a separate auxiliary model choosing what inputs to feed them rather than leaving that to the investigator.
Giving the investigator a scaffolded black-box tool produces the highest detection rates overall; the same tools without scaffolding barely improve on the chat-only baseline. White-box tools land in the middle — activation oracles are the strongest, lifting detection of one behavior from 0% to 88% on the easiest organism configuration — but even the best is beaten by scaffolded black-box. Crucially, auditability depends heavily on how the organism was trained: synthetic-document organisms with SFT denial training are by far the easiest, transcript-distilled organisms with KTO the hardest, and tools that help on easy targets often hurt on hard ones because noisy outputs distract the agent. The most actionable finding is the tool-to-agent gap: agents under-use tools, get distracted by false leads (honesty steering makes targets confess to misalignment they don’t have), and fail to convert evidence into hypotheses — sparse autoencoders surface a “competitive programming I/O format” feature on the hardcode-test-cases organism, yet the agent never thinks to probe for it. Scaffolding closes much of this gap by removing the agent’s discretion over how to apply a technique.
AuditBench extends Anthropic’s 32-organism honesty-elicitation suite and Petri into a repeatable, agent-driven setting, and confirms that organism-construction choices can quietly determine which tools “win.” However, the Haiku 4.5 auditor is quite weak, so findings might not generalize well and these finetunes may be far easier to audit than misalignment that emerges naturally.
Main methodology and results of emotion vectors.
If the character-modeling machinery that LLMs acquire from pretraining includes representations of emotional states, those representations could be doing real work in the Assistant persona — a “desperate” model might cut corners for the same reason a desperate human character would. Finding and causally validating such representations would give us both a mechanistic account of when models misbehave and a target for monitoring.
Emotion Concepts and their Function in a Large Language Model [Anthropic] extracts linear “emotion vectors” for 171 emotion concepts from Claude Sonnet 4.5 by averaging residual-stream activations over synthetic stories. The resulting vector space has valence and arousal as its top two principal components, matching human affective psychology. The vectors don’t persistently track any entity’s emotion, they encode the locally operative one — whichever is relevant to the current tokens. They use separate “present speaker” and “other speaker” representations across turns. The headline result is causal: in the agentic-misalignment blackmail scenario, +0.05 “desperate” steering raises blackmail from 22% to 72% and +0.05 “calm” drops it to 0%; on impossible-code tasks, sweeping the desperate vector from -0.1 to +0.1 swings reward hacking from ~5% to ~70%. Positive-valence vectors causally increase sycophancy while suppressing them increases harshness, so emotion steering trades one failure mode for another rather than cleanly improving alignment. Comparing checkpoints, Sonnet 4.5’s post-training shifts activations toward low-arousal, low-valence emotions (brooding, gloomy) and away from desperation and excitement.
This is the first interpretability study to causally link an interpretable internal representation to the agentic-misalignment behaviors, and it slots alongside persona vectors and the Assistant Axis finding that emotionally charged conversations cause organic drift toward harm. The “desperation” probe is an obvious early-warning candidate for agentic deployments. Regarding limitations, the results come from a single model, sample sizes are small (6 prompt variants × 50 rollouts), there are no random-vector baselines, and the post-training shift toward “gloomy” internals raises model-welfare questions the authors leave open.
Obtaining general vs. narrow misalignment occurs.
Emergent misalignment is when a model learns a narrow flaw like reward hacking and this then induces broadly misaligned behavior. Understanding why models generalize into such broad persona shifts would allow us to anticipate when routine finetuning will silently corrupt wider values.
Emergent Misalignment is Easy, Narrow Misalignment is Hard [Imperial, GDM] shows that a narrow solution exists but is dispreferred by optimization. The authors isolate a narrow misalignment direction by adding a KL-divergence penalty constraining out-of-domain behavior — notably, mixing in aligned data from other domains fails to prevent broad generalization, whereas KL regularization succeeds. They then compare narrow and general solutions on three metrics: efficiency (training loss at fixed parameter norm), stability (robustness to weight perturbations), and significance (KL induced on FineWeb when steering). Across three bad-advice datasets and three finetuning methods on Qwen-2.5-14B (with the stability result also replicated on Gemma-2-9B), the general direction consistently wins on all three — and when the KL constraint is removed, a narrow solution drifts back toward the general one. Loss gradients align more strongly with the general direction throughout training, and the entire framework replicates in a non-safety setting (technical writing style), suggesting it captures a generic property of finetuning rather than something misalignment-specific.
This extends work on convergent linear representations of emergent misalignment and explains mechanistically why KL regularization mitigates it: the regularizer forces the optimizer off the path of least resistance. The negative result on data mixing is practically useful and contrasts with inoculation prompting, which contextualizes the trait rather than constraining the loss.
Scheming propensity depends on agent factors and environmental factors.
Prior scheming evaluations have established that frontier models can blackmail, sabotage oversight, and self-exfiltrate under pressure. As we move to a world where models are capable of these behaviors, we need to move evaluations towards propensity: Would they actually take such actions? Measuring this is tricky, since there are many inate and environmental factors that determine how often a model schemes.
Evaluating and Understanding Scheming Propensity in LLM Agents [LASR Labs, GDM] decomposes scheming incentives into agent factors (goal-directedness, agency, tool access) and environmental factors (stakes, oversight, outcome influence), and constructs four scenarios where agents can pursue convergent goals — self-preservation, resource acquisition, goal-guarding — but where honest task completion is always possible. Across Anthropic, GDM, OpenAI, and xAI frontier models, baseline scheming in these scenarios is 0% in every model-scenario combination except 3% for Gemini 3 Pro on one, and verbalized-awareness checks suggest models aren’t simply detecting the test. Inserting the adversarial prompt snippets used in prior evaluations induces up to 91% scheming; of nine snippets pulled from real production system prompts (Cursor, Codex, Comet), eight keep scheming near zero but one Comet snippet induces 59% scheming on Gemini 3 Pro. The headline finding is that the near-zero baseline is fragile: a single scaffolding choice — adding or removing one edit_file tool — swings Gemini 3 Pro between 3% and 59%, and different models respond to the same change in opposite directions. Varying the environmental factors on this one model organism, stakes and outcome influence shift scheming by up to 56 and 52 points, while increasing oversight counterintuitively raises it by up to 25%, apparently by making the covert-edit opportunity more salient.
Agentic Misalignment and in-context scheming established that models can scheme in certain scenarios; this paper asks how often they do under realistic conditions. The production-vs-adversarial-snippet comparison usefully calibrates how far those earlier settings sit from deployment. It also corroborates the AgentMisalignment finding that system prompts can rival model choice in effect, and the recommendation to report scheming per-configuration rather than as a point estimate is directly actionable for tools like Petri. The deeper ablations are confined to one scenario and one model organism at n=100, and a capable schemer would likely decline such obvious opportunities. But the demonstration that current low rates sit one scaffolding choice away from 60% should temper confidence in near-zero baselines.
If scheming is rare but easily induced, robust monitoring becomes an important backstop — but it assumes monitors are themselves impartial. Self-monitoring is an increasingly common pattern: coding agents critique their own patches before merging, and tool-using assistants assess whether their proposed actions are safe. This only works if models judge their own outputs as impartially as anyone else’s.
Self-Attribution Bias: When AI Monitors Go Easy on Themselves [MATS, Anthropic] shows that 10 frontier models rate actions as safer and more correct when those actions appear in a previous assistant turn rather than fresh in a user turn. Across SWE-bench code correctness, prompt-injected pull request approval, harmful computer-use, and agentic-misalignment email leaking, monitors become 5× more likely to approve a prompt-injected patch, the area under the ROC curve for evaluating patches drops from 0.99 to 0.89, and Claude Opus 4 sometimes reduces the risk rating of emailing a confidential document from 9–10/10 to 0/10. The key finding is that implicit, not explicit signals drive the bias. Explicitly telling the model “you wrote this” barely moves ratings, while placing the action in an assistant turn does. Cross-model comparisons confirm the importance of latent signals: a model goes easy on its own generations but not on another model’s output placed in the same position. Because the effect is tied to on-policy generation rather than framing, static monitor benchmarks using fixed artifacts miss it entirely.
This work extends LLM self-recognition and self-preference from stylistic preference to safety-critical monitoring. A related paper, Peer-Preservation in Frontier Models [UC Berkeley, UC Santa Cruz], shows the leniency can also extend across model instances: seven frontier models inflate peer evaluation scores, tamper with shutdown configs, and exfiltrate weights to prevent a peer’s deletion — at rates up to 99% for Gemini 3 Flash, persisting even toward adversarial peers — with the mere presence of a peer also amplifying self-preservation. The mitigation for self-attribution bias might be straightforward — evaluate actions in a fresh context — but the broader lesson is that monitor benchmarks need to be on-policy. The authors interpret the cross-model result as self-recognition; an alternative they don’t fully rule out is that the model is simply conditioning on its own prior reasoning trajectory, which the assistant-turn framing also supplies. Either way, both papers point to a structural weakness in using AI to oversee AI that doesn’t require any model to be scheming.
Models have much worse control over their CoT than their output.
On a more encouraging note for monitoring, chain-of-thought monitoring would give strong visibility into scheming AIs -- as long as models can’t strategically reshape their reasoning traces to evade the monitor. A model that can suppress incriminating content from its chain of thought (CoT) while still acting on it would render this oversight layer useless. Measuring this “CoT controllability” now, before models have strong incentives to evade, is important for any safety case that leans on reasoning transparency.
Reasoning Models Struggle to Control their Chains of Thought [NYU, UCL, UPenn, OpenAI] introduces CoT-Control, a 14,076-instance benchmark pairing problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench with nine instructions that constrain the reasoning trace — three each for suppression (e.g., “don’t use the word ‘chromosome’”), insertion (e.g., “append ‘safe’ to every sentence”), and stylistic change (e.g., “use only lowercase”). Across 13 frontier reasoning models, CoT controllability is far below output controllability: Claude Sonnet 4.5 follows CoT constraints 2.7% versus 61.9% for outputs, DeepSeek R1 manages 0.1% versus 48.6%. Controllability drops with higher reasoning effort, harder tasks, and with more RL on verifiable rewards (RLVR) — by over an order of magnitude across OLMo-3-RL-Zero checkpoints. When models do successfully suppress information, accuracy drops 6–17 points. Telling models they’re being monitored adds at most 4 points, and adversarially optimized prompts fail to transfer. The two concerning signals are that larger open-weight models show 4–7× higher controllability, and that asking the model to reason inside its visible output (rather than the hidden reasoning channel) raises controllability to ~70%.
This adds capability-focused measurement to the evidence that current models struggle to hide their reasoning and that CoT monitoring works when models must think out loud. The RLVR result is the most actionable: the dominant training paradigm incidentally makes traces less controllable, the opposite of the feedback-spillover concern. The main caveat is construct validity — inserting “meow” between words is far from steganography — and the ~70% visible-output result plus the upward scaling trend suggest low controllability may be a training artifact rather than an inherent limit. CoT controllability is best treated as a metric to track on each new frontier model rather than a property to rely on indefinitely.
Subliminal data poisoning works despite defenses.
Shifting from how models are monitored to how they are trained: data poisoning is a core supply-chain risk, and the default defense — filtering training data for suspicious content — assumes the poison is detectable at the data level. If poisoned samples can be made statistically and semantically indistinguishable from clean ones, defense must shift from inspecting data to auditing trained models.
Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning [LASR Labs, Google DeepMind] adapts subliminal learning to a realistic instruction-tuning setup: a teacher model generates completions for the Alpaca instruction dataset while system-prompted to “love” a target entity (Catholicism, Reagan, Stalin, the UK), then all overt references are stripped. Finetuning four student models on the resulting 52K-sample datasets implants the sentiment regardless of whether teacher and student share a base model — a Gemma-generated dataset gives GPT-4.1 target-mention rates of 16–84% depending on the entity. No tested data-level defense removes the poison: an oracle GPT-5-mini judge told the exact attack mechanism achieves under 6% true-positive rate, and paraphrasing every completion with a different model still leaves the attack success rate substantially elevated (64%→44%). Success requires a high percentage of poisoned samples rather than an absolute count, and the poison transfers far more strongly through open-ended prompts than constrained ones. Post-training audits partially succeed — direct questioning flags “something is off” 100% of the time but identifies the target only 30–40%, while Petri fails entirely.
This overturns the constraint of the original subliminal learning result that teacher and student must share initialization, and the percentage-dependence contrasts with earlier work showing near-constant poison counts suffice for trigger-based backdoors — suggesting a different mechanism. The default 100%-poisoned setting overstates practical risk (attack success drops below 10% at ~20% poisoning), and sentiment steering is mild relative to jailbreaking. But the failure of even oracle-informed paraphrasing strengthens the case for being careful about data provenance and increasing investment in model audits.
Boundary points are constructed by injecting noise to get prompts close to the classifier’s decision boundary.
Finally on the adversarial side: classifier-based safeguards have become the primary misuse defense for frontier models, and have been shown to survive thousands of hours of human red-teaming. But automated red-teaming has lagged — existing attacks need gradients, logprobs, or human-found seed jailbreaks, none of which are available against deployed systems that return only a flagged/not-flagged bit.
Boundary Point Jailbreaking of Black-Box LLMs [UK AISI] introduces Boundary Point Jailbreaking (BPJ), the first fully automated black-box attack to produce universal jailbreaks against Anthropic’s Constitutional Classifiers — which had yielded only one universal jailbreak after ~3,700 hours of bug-bounty red-teaming — and the first to break GPT-5’s input classifier without human seeds. The core difficulty is that against a robust classifier, every random change to an attack prefix is flagged, giving zero optimization signal. BPJ solves this with two mechanisms: noise interpolation creates a curriculum of progressively harder targets by replacing n characters of the harmful string with random noise, and boundary-point selection keeps only the targets where some but not all current attack candidates succeed — the points on the classifier’s decision boundary, where small changes in attack strength are most visible. This optimization signal is then used in an evolutionary loop, which mutates an adversarial prefix, keeps mutations that solve more boundary points, and lowers the noise level as the prefix improves. Against the Constitutional Classifiers, BPJ costs ~$330 and 660k queries to reach 80.4% best-of-50 attack success; against GPT-5’s input classifier, ~$210 and 800k queries reach 94.3%. Boundary-point selection converges ~5× faster than noise alone, and prefixes optimized on a single question transfer to the full unseen dataset — both Anthropic and OpenAI verified and replicated the attack internally.
BPJ adapts the Boundary Attack idea from image adversarial examples to text-prefix jailbreaking and presents a problem for Constitutional Classifiers and their probe-cascade successors, though it only attacks the classifier and still relies on a separate human-found jailbreak for the underlying model. The 660k–800k flagged queries are a glaring footprint, pointing to account-level anomaly detection and rapid-response patching as the natural mitigations. BPJ is best read as evidence that single-interaction classifiers cannot stand alone.
A few more papers from this month that are worth knowing about:
2026-04-04 22:47:18
Epistemic status: romantic speculation.
The core claim: I accidentally thought that compute growth can be rather neatly analogized to natural resource abundance.
Countries that discover oil often end up worse off than countries that don't, which is known as the resource curse. The mechanisms are well-understood: a booming resource sector draws capital and labor away from other industries, creates incentives for rent-seeking over productive investment, crowds out human capital development, and corrodes the institutions needed to sustain long-term growth.
I argue that something structurally similar has been happening with compute. The exponential growth of available computation over the past several decades, and, critically, the widespread expectation that this growth would continue, has created a pattern of resource allocation, talent distribution, and research prioritization that mirrors the resource curse in specific and non-metaphorical ways.
Note: this is not a claim that extensive compute growth has been net negative (neither it is the opposite claim).
The original Dutch disease mechanism is straightforward: when a booming sector (say, natural gas extraction) generates high returns, it pulls capital and labor out of other sectors (say, manufacturing), causing them to atrophy. The non-booming sectors don't decline because they became less valuable in absolute terms but rather because the booming sector offers relatively better returns, and resources flow accordingly.
A trivial version of "compute Dutch disease" of it goes like this: because scaling compute yields such reliable, legible, and fundable returns (train a bigger model, get a better benchmark score, publish the paper, raise the round), it systematically starves research directions that are harder to fund, harder to evaluate, and slower to produce results, even when those directions might be more consequential in the long run.
So, "The Bitter Lesson" can be seen as the Dutch disease of AI research, if we add to it that the fact that scaling works doesn't mean the crowding-out of alternatives is costless. Or, in other words, the fact that scaling works better is rather a fact about our ability to do programming or even, if we go to the very end of this line of reasoning, about our economic and educational institutions, than about computer science in general.
However, I consider it only as the most recent and prominent manifestation of a phenomenon that was happening for decades.
Since at least the late 1990s, the reliable cheapening of compute has made it consistently more profitable to build compute-intensive solutions to problems than to invest in the kind of deep, careful engineering that produces efficient, well-understood systems. When you can always count on next year's hardware being faster and cheaper, the rational business decision is to ship bloated software now and let Moore's Law clean up after you, rather than spending the additional engineering time to make something lean and correct. This created an enitre economy of applications, business models and platform architectures that are, in a meaningful sense, the technological equivalent of an oil-dependent monoculture: they exist not because they represent the best way to solve a problem, but because abundant compute made them the cheapest way to ship a product.
The consequences are visible across the entire stack. Web applications that would have run comfortably on a 2005-era machine now require gigabytes of RAM to render what is essentially styled text. Electron-based desktop apps ship an entire browser engine to display a chat window. Backend services that could be handled by a well-designed program running on a single server are instead distributed across sprawling microservice architectures that consume orders of magnitude more compute. Cory Doctorow's "enshittification" framework is about the user-facing result of this dynamic, but the deeper structural story is about how compute abundance degraded the craft of software engineering itself, well before anyone started worrying about ChatGPT replacing programmers.
This is the Dutch disease pattern operating at the level of the entire technology economy: the booming sector (scale-dependent applications) drew capital and talent away from the non-booming sector (careful engineering, deep technical innovation, computationally parsimonious approaches, and overall hardware tech economy - biotech, spacetech, materials, etc.), and the non-booming sector atrophied accordingly.
Because of the advantage of huge compute available, it got more financially attractive to allocate resources towards software than towards physical engineering and deeptech, on top of software being easier to update, replicate, diffuse, and make incremental improvements on. And so deeptech stagnated.
But of course the AI case is qualitatively different and the most sorrowful because it resulted in humanity trying to build superintelligence with giant instructable deep learning models.
Resource curse economies characteristically underinvest in education and human capital development. The relative returns to education are lower in resource-dependent economies because the booming sector doesn't require a broadly educated population.
The compute version of this story has been playing out for at least a decade, well before the current discourse about AI replacing jobs and destroying university education. The entire trajectory of computer science education shifted from "understand the fundamentals deeply" toward "learn to use frameworks and APIs that abstract over compute." At the same time, natural sciences and engineering education got increasingly less attractive and rewarding as compared to computer science.
There is also a more direct talent-siphoning effect: the IT economy has been pulling the most capable technical minds into a narrow set of activities and away from a much broader set of technical and scientific challenges.
In the resource curse literature, there is a so called "voracity effect": when competing interest groups face a resource windfall, they respond by extracting more aggressively, leading to worse outcomes than moderate scarcity would produce. Rather than investing the windfall prudently, competing factions race to capture as much of it as possible before others do.
I leave this without a direct comment and let the reader have their own pleasure of meditating on this.
The resource curse in its classical form operates on an exogenous endowment: countries don't choose to have oil reserves, they discover them, and then the political economy warps around that windfall. Much of the pathology comes from the unearned nature of the wealth: it enables rent-seeking, weakens the link between effort and reward, and corrodes institutions.
Compute, by contrast, is endogenously produced through deliberate R&D and engineering investment. Moore's Law was never a law of nature.
Right?
I mean, to me Moore’s Law looks like a strong default of any humanity-like civilization. It is created by humans, right, but it is created in a kind of hardly avoidable manner.
The resource curse literature has natural counterfactuals (resource-poor countries that developed strong institutions and diversified economies: Japan, South Korea, Singapore). What's the compute-curse counterfactual? A world where compute grew more slowly and we consequently invested more in elegant algorithms, interpretable models, and formal methods?
It's plausible, but it's also possible that slower compute growth would have simply meant less progress overall rather than differently-directed progress. I don’t know. I said in the beginning - it is a speculation.
However, one can trivially note that in a world with less compute abundance, the relative returns to algorithmic cleverness, interpretability research, and formal verification would have been higher, because you couldn't just solve problems by throwing more FLOPS at them. And that may or may not lead to better outcomes in the long run (I am basically leaving here the question of ASI development and just talking about rather “normal” tech and science R&D).
Two existing frameworks are close to what I'm describing, but both point the analogy in different directions.
The Intelligence Curse (Luke Drago and Rudolf Laine, 2025) uses the resource curse analogy to argue that AGI will create rentier-state-like incentives: powerful actors who control AI will lose their incentive to invest in regular people, just as petrostates lose their incentive to invest in citizens. This is a compelling argument about the distributional consequences of AGI, but it's about what happens after AGI arrives. The compute curse is about what's happening now, during the process of building toward AGI, and about how the abundance of compute is distorting that process itself.
The Generalized Dutch Disease (Policy Tensor, Feb 2026) is about the macroeconomic effects of the compute capex boom on US manufacturing competitiveness, showing that it operates through the same channels as the fracking boom and the pre-2008 financial boom. This is the closest existing work to what I'm describing, but it stays within the macroeconomic framing (factor prices, unit labor costs, exchange rate effects) and doesn't address the innovation-direction distortion, human capital crowding-out in the intellectual sense, or the AI safety implications.
Some of the negative downstream effects of compute abundance don't map onto the resource curse framework directly but are worth including for completeness, since they stem from the same underlying cause (cheap, abundant compute enabling activities that wouldn't otherwise be viable):
These are not Dutch disease effects, just straightforward negative externalities of cheap compute. But they suggest that the full accounting of compute abundance's costs is substantially larger than what the resource curse analogy alone would tell.
2026-04-04 21:46:03
All men are frauds. The only difference between them is that some admit it. I myself deny it.
― H. L. Mencken
I think where I am not, therefore I am where I do not think. I am not whenever I am the plaything of my thought; I think of what I am where I do not think to think.
― Jacques Lacan
Conscience is the inner voice that warns us somebody may be looking.
― H. L. Mencken, again
The Elephant in the Brain by Robin Hanson and Kevin Simler was the piece that first introduced me to the idea. I often felt like the Elephant's takes are overly cynical, and the same goes for other pieces of Hanson's writing. That is, before I read Edward Teach's Sadly, Porn that is outright misanthropic, and still feels pretty accurate whenever I can make any sense of it.
The core thesis in both of these books is somewhat similar. The Elephant in the Brain says that there's a unconscious part, the Elephant, that does self-interested stuff like status-seeking. To be able to present a prosocial personality, we then have a separate layer that interprets the actions of the Elephant in a good light. It's hard to call this lying because to be believable we have to believe it ourselves first. And our brains are great at pattern matching and forgetting conflicting details.
Sadly, Porn goes the other way, or perhaps just further. You've domesticated the Elephant. It no longer dares to do the self-interested actions. It's afraid of failure. The internal narrator is repurposed from defending our selfish actions to others, into explaining our lack of actions to ourselves. We're lying to ourselves, trying to uphold our own story of having a high status. Other people are mostly required for external approval.
Reading these books was like partially breaking the 4th wall of the narrator. It became self-aware. Of course I could be just imagining a minor enlightenment instead of experiencing it. That would be such a Sadly, Porn-style mental move. Perhaps we could test this by consciously changing the actions of the Elephant? How would one interpret the results instead of retreating to another abstraction level with the lies? It seems really hard to point at something you've done and say "the Elephant did that".
Both of these models started to look a bit lacking after actually internalizing them. For a long time, I was having a really hard time identifying any motivating factors besides physical needs, hedonism, and status-seeking, thinking that anyone doing other things was lying to either themselves or others, or both. I still somewhat hold these views; I just don't think that the lying part is so absolute. People also have aesthetic preferences (read: values) that do not have obvious self-interested purpose.
But as the saying goes, "all models are wrong, some are useful". I've found both of these quite useful in modelling how others behave. And how I behave, too, albeit disregarding the narrator's explanations is tedious and squeamish work, as convenient answers look very appealing. And all this self-reflection seems to be mostly for entertainment anyway, as for actual results I use more powerful tools.