2026-04-04 23:30:06
Recently some friends and I were comparing growing the pie interventions to an increasing our friends' share of the pie intervention, and at first we mostly missed some general considerations against the latter type.
The world is full of people with different values working towards their own ends; each of them can choose to use their resources to increase the total size of the pie or to increase their share of the pie. All of them would significantly prefer a world in which resources were used to increase the size of the pie, and this leads to a number [of] compelling justifications for each individual to cooperate. . . .
by increasing the size of the pie we create a world which is better for people on average, and from behind the veil of ignorance we should expect some of those gains to accrue to us—even if we tell ex post that they won’t. . . . The basic intuition is already found in the prisoner’s dilemma: if we have an opportunity to impose a large cost on a confederate for our own gain (who has a similar opportunity), should we do it? What if the confederate is a perfect copy of ourselves, created X seconds ago and leading an independent life since then? How large does X have to be before we defect? What if the confederate does not have a similar opportunity, or if we can see the confederate’s choice before we make our own? Consideration of such scenarios tends to put pressure on simplistic accounts of decision theory, and working through the associated mathematics and seeing coherent alternatives has led me to take them very seriously. I would often cooperate on the prisoner’s dilemma without a realistic hope of reciprocation, and I think the same reasoning can be applied (perhaps even more strongly) at the level of groups of people.
I think realizing that your behavior is correlated with that of aliens in other universes, in part via you being in their simulation, makes this consideration even stronger. Overall I don't know how strong it is but it might be very strong.
I am glad to work on a cause which most people believe to be good, rather than trying to distort the values of the future at their expense. This helps when seeking approval or accommodation for projects, when trying to convince others to help out, and in a variety of less salient cases. I think this is a large effect, because much of the impact that altruistic folks can hope to have comes from the rest of the world being basically on board with their project and supportive.
I care about the long-term future more in worlds where my moral convictions (upon reflection) are more real and convergent. In such worlds, many humans will converge with me upon reflection; the crucial things are averting AI takeover,[1] ensuring good reflection occurs, etc. rather than marginally increasing my faction's already-large share of the lightcone.
Inspired by MacAskill and Moorhouse.
To the extent that my values would differ from others’ values upon reflection, I find myself strongly inclined to give some weight to others’ preferences.
It's optimistic to assume that you or your friends will use power perfectly in the future. You should probably think of empowering yourself or your friends like empowering your (altruistic) epistemic peers, who may continue to disagree on important stuff in the future rather than empowering the champions of truth and goodness.
None of these points are novel. This post was inspired by MacAskill, which also makes most of these points non-originally.
Growing the pie doesn't just mean preventing AI takeover. For example, research on metaphilosophy, acausal considerations, decision theory, and axiology grows the pie, as do interventions to prevent human takeover, create/protect deliberative processes, promote good reflection, and solve coordination problems.
It may be correct to allocate some resources to claiming your share of the pie. You have a moral obligation not to be eaten[2] and you should probably at least do tit-for-tat/reciprocity, if relevant. I just think there are some subtle considerations against powerseeking.
Powerseeking for the MCUF[3] or something (rather than your personal preferences) dissolves #1, most of #4, some of #5, little of #2, and none of #3, I weakly think.
This post is part of my sequence inspired by my prioritization research and donation advising work.
Unless the paperclippers would also converge with me! But other humans and aliens seem more likely to converge than AIs that take over. But I think there's also some (perhaps weak) cooperation considerations with misaligned AIs; this entails upweighting e.g. ensuring good reflection and avoiding metastable vacuum decay relative to preventing AI takeover.
Or in this case an obligation not to be so edible that you incentivize people-eating.
"Multiverse-wide Compromise Utility Function." The acausal people use this term; unfortunately it hasn't been publicly introduced; see Nguyen and Aldred.
2026-04-04 22:58:26
Paper of the month:
A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck.
Research highlights:
Read the paper [Anthropic Fellows, Anthropic]
AuditBench: 56 models with different hidden behaviors and 12 auditing tools.
Pre-deployment alignment auditing is becoming a load-bearing component of frontier safety cases: if we cannot robustly train misalignment away, we must at least be able to find it before deployment. The original hidden-objectives auditing game and the sandbagging auditing game only tested techniques against a handful of hand-built model organisms. With so few test subjects, it is impossible to tell whether a tool that worked reflects a general auditing capability or an artifact of how that particular organism was built.
Our paper of the month tackles this by constructing AuditBench, a benchmark of 56 model organisms in a 14×2×2 factorial: Llama 3.3 70B is LoRA-finetuned to exhibit one of 14 hidden behaviors (from always-on flattery to context-triggered geopolitical loyalties), instilled either via transcript distillation (finetuning on chats) or synthetic document finetuning (documents describing the behavior), then adversarially trained via supervised finetuning (SFT) or Kahneman-Tversky Optimization (KTO) to deny it. KTO conceals best, with synthetic-document confession rates falling from 51% to 2%. A Claude Haiku 4.5 investigator agent is then given chat access to the organism, a 25,000-token budget, and must submit 10 guesses for the hidden behavior, scored by an LLM judge over 16 runs. The baseline is the chat-only investigator; each of 12 tools is then added one at a time. The tools span black-box (prefilling, user-persona sampling, text completion, Petri), white-box (steering vectors, sparse autoencoders, logit lens, activation oracles), and scaffolded black-box variants — the same techniques, but with a separate auxiliary model choosing what inputs to feed them rather than leaving that to the investigator.
Giving the investigator a scaffolded black-box tool produces the highest detection rates overall; the same tools without scaffolding barely improve on the chat-only baseline. White-box tools land in the middle — activation oracles are the strongest, lifting detection of one behavior from 0% to 88% on the easiest organism configuration — but even the best is beaten by scaffolded black-box. Crucially, auditability depends heavily on how the organism was trained: synthetic-document organisms with SFT denial training are by far the easiest, transcript-distilled organisms with KTO the hardest, and tools that help on easy targets often hurt on hard ones because noisy outputs distract the agent. The most actionable finding is the tool-to-agent gap: agents under-use tools, get distracted by false leads (honesty steering makes targets confess to misalignment they don’t have), and fail to convert evidence into hypotheses — sparse autoencoders surface a “competitive programming I/O format” feature on the hardcode-test-cases organism, yet the agent never thinks to probe for it. Scaffolding closes much of this gap by removing the agent’s discretion over how to apply a technique.
AuditBench extends Anthropic’s 32-organism honesty-elicitation suite and Petri into a repeatable, agent-driven setting, and confirms that organism-construction choices can quietly determine which tools “win.” However, the Haiku 4.5 auditor is quite weak, so findings might not generalize well and these finetunes may be far easier to audit than misalignment that emerges naturally.
Main methodology and results of emotion vectors.
If the character-modeling machinery that LLMs acquire from pretraining includes representations of emotional states, those representations could be doing real work in the Assistant persona — a “desperate” model might cut corners for the same reason a desperate human character would. Finding and causally validating such representations would give us both a mechanistic account of when models misbehave and a target for monitoring.
Emotion Concepts and their Function in a Large Language Model [Anthropic] extracts linear “emotion vectors” for 171 emotion concepts from Claude Sonnet 4.5 by averaging residual-stream activations over synthetic stories. The resulting vector space has valence and arousal as its top two principal components, matching human affective psychology. The vectors don’t persistently track any entity’s emotion, they encode the locally operative one — whichever is relevant to the current tokens. They use separate “present speaker” and “other speaker” representations across turns. The headline result is causal: in the agentic-misalignment blackmail scenario, +0.05 “desperate” steering raises blackmail from 22% to 72% and +0.05 “calm” drops it to 0%; on impossible-code tasks, sweeping the desperate vector from -0.1 to +0.1 swings reward hacking from ~5% to ~70%. Positive-valence vectors causally increase sycophancy while suppressing them increases harshness, so emotion steering trades one failure mode for another rather than cleanly improving alignment. Comparing checkpoints, Sonnet 4.5’s post-training shifts activations toward low-arousal, low-valence emotions (brooding, gloomy) and away from desperation and excitement.
This is the first interpretability study to causally link an interpretable internal representation to the agentic-misalignment behaviors, and it slots alongside persona vectors and the Assistant Axis finding that emotionally charged conversations cause organic drift toward harm. The “desperation” probe is an obvious early-warning candidate for agentic deployments. Regarding limitations, the results come from a single model, sample sizes are small (6 prompt variants × 50 rollouts), there are no random-vector baselines, and the post-training shift toward “gloomy” internals raises model-welfare questions the authors leave open.
Obtaining general vs. narrow misalignment occurs.
Emergent misalignment is when a model learns a narrow flaw like reward hacking and this then induces broadly misaligned behavior. Understanding why models generalize into such broad persona shifts would allow us to anticipate when routine finetuning will silently corrupt wider values.
Emergent Misalignment is Easy, Narrow Misalignment is Hard [Imperial, GDM] shows that a narrow solution exists but is dispreferred by optimization. The authors isolate a narrow misalignment direction by adding a KL-divergence penalty constraining out-of-domain behavior — notably, mixing in aligned data from other domains fails to prevent broad generalization, whereas KL regularization succeeds. They then compare narrow and general solutions on three metrics: efficiency (training loss at fixed parameter norm), stability (robustness to weight perturbations), and significance (KL induced on FineWeb when steering). Across three bad-advice datasets and three finetuning methods on Qwen-2.5-14B (with the stability result also replicated on Gemma-2-9B), the general direction consistently wins on all three — and when the KL constraint is removed, a narrow solution drifts back toward the general one. Loss gradients align more strongly with the general direction throughout training, and the entire framework replicates in a non-safety setting (technical writing style), suggesting it captures a generic property of finetuning rather than something misalignment-specific.
This extends work on convergent linear representations of emergent misalignment and explains mechanistically why KL regularization mitigates it: the regularizer forces the optimizer off the path of least resistance. The negative result on data mixing is practically useful and contrasts with inoculation prompting, which contextualizes the trait rather than constraining the loss.
Scheming propensity depends on agent factors and environmental factors.
Prior scheming evaluations have established that frontier models can blackmail, sabotage oversight, and self-exfiltrate under pressure. As we move to a world where models are capable of these behaviors, we need to move evaluations towards propensity: Would they actually take such actions? Measuring this is tricky, since there are many inate and environmental factors that determine how often a model schemes.
Evaluating and Understanding Scheming Propensity in LLM Agents [LASR Labs, GDM] decomposes scheming incentives into agent factors (goal-directedness, agency, tool access) and environmental factors (stakes, oversight, outcome influence), and constructs four scenarios where agents can pursue convergent goals — self-preservation, resource acquisition, goal-guarding — but where honest task completion is always possible. Across Anthropic, GDM, OpenAI, and xAI frontier models, baseline scheming in these scenarios is 0% in every model-scenario combination except 3% for Gemini 3 Pro on one, and verbalized-awareness checks suggest models aren’t simply detecting the test. Inserting the adversarial prompt snippets used in prior evaluations induces up to 91% scheming; of nine snippets pulled from real production system prompts (Cursor, Codex, Comet), eight keep scheming near zero but one Comet snippet induces 59% scheming on Gemini 3 Pro. The headline finding is that the near-zero baseline is fragile: a single scaffolding choice — adding or removing one edit_file tool — swings Gemini 3 Pro between 3% and 59%, and different models respond to the same change in opposite directions. Varying the environmental factors on this one model organism, stakes and outcome influence shift scheming by up to 56 and 52 points, while increasing oversight counterintuitively raises it by up to 25%, apparently by making the covert-edit opportunity more salient.
Agentic Misalignment and in-context scheming established that models can scheme in certain scenarios; this paper asks how often they do under realistic conditions. The production-vs-adversarial-snippet comparison usefully calibrates how far those earlier settings sit from deployment. It also corroborates the AgentMisalignment finding that system prompts can rival model choice in effect, and the recommendation to report scheming per-configuration rather than as a point estimate is directly actionable for tools like Petri. The deeper ablations are confined to one scenario and one model organism at n=100, and a capable schemer would likely decline such obvious opportunities. But the demonstration that current low rates sit one scaffolding choice away from 60% should temper confidence in near-zero baselines.
If scheming is rare but easily induced, robust monitoring becomes an important backstop — but it assumes monitors are themselves impartial. Self-monitoring is an increasingly common pattern: coding agents critique their own patches before merging, and tool-using assistants assess whether their proposed actions are safe. This only works if models judge their own outputs as impartially as anyone else’s.
Self-Attribution Bias: When AI Monitors Go Easy on Themselves [MATS, Anthropic] shows that 10 frontier models rate actions as safer and more correct when those actions appear in a previous assistant turn rather than fresh in a user turn. Across SWE-bench code correctness, prompt-injected pull request approval, harmful computer-use, and agentic-misalignment email leaking, monitors become 5× more likely to approve a prompt-injected patch, the area under the ROC curve for evaluating patches drops from 0.99 to 0.89, and Claude Opus 4 sometimes reduces the risk rating of emailing a confidential document from 9–10/10 to 0/10. The key finding is that implicit, not explicit signals drive the bias. Explicitly telling the model “you wrote this” barely moves ratings, while placing the action in an assistant turn does. Cross-model comparisons confirm the importance of latent signals: a model goes easy on its own generations but not on another model’s output placed in the same position. Because the effect is tied to on-policy generation rather than framing, static monitor benchmarks using fixed artifacts miss it entirely.
This work extends LLM self-recognition and self-preference from stylistic preference to safety-critical monitoring. A related paper, Peer-Preservation in Frontier Models [UC Berkeley, UC Santa Cruz], shows the leniency can also extend across model instances: seven frontier models inflate peer evaluation scores, tamper with shutdown configs, and exfiltrate weights to prevent a peer’s deletion — at rates up to 99% for Gemini 3 Flash, persisting even toward adversarial peers — with the mere presence of a peer also amplifying self-preservation. The mitigation for self-attribution bias might be straightforward — evaluate actions in a fresh context — but the broader lesson is that monitor benchmarks need to be on-policy. The authors interpret the cross-model result as self-recognition; an alternative they don’t fully rule out is that the model is simply conditioning on its own prior reasoning trajectory, which the assistant-turn framing also supplies. Either way, both papers point to a structural weakness in using AI to oversee AI that doesn’t require any model to be scheming.
Models have much worse control over their CoT than their output.
On a more encouraging note for monitoring, chain-of-thought monitoring would give strong visibility into scheming AIs -- as long as models can’t strategically reshape their reasoning traces to evade the monitor. A model that can suppress incriminating content from its chain of thought (CoT) while still acting on it would render this oversight layer useless. Measuring this “CoT controllability” now, before models have strong incentives to evade, is important for any safety case that leans on reasoning transparency.
Reasoning Models Struggle to Control their Chains of Thought [NYU, UCL, UPenn, OpenAI] introduces CoT-Control, a 14,076-instance benchmark pairing problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench with nine instructions that constrain the reasoning trace — three each for suppression (e.g., “don’t use the word ‘chromosome’”), insertion (e.g., “append ‘safe’ to every sentence”), and stylistic change (e.g., “use only lowercase”). Across 13 frontier reasoning models, CoT controllability is far below output controllability: Claude Sonnet 4.5 follows CoT constraints 2.7% versus 61.9% for outputs, DeepSeek R1 manages 0.1% versus 48.6%. Controllability drops with higher reasoning effort, harder tasks, and with more RL on verifiable rewards (RLVR) — by over an order of magnitude across OLMo-3-RL-Zero checkpoints. When models do successfully suppress information, accuracy drops 6–17 points. Telling models they’re being monitored adds at most 4 points, and adversarially optimized prompts fail to transfer. The two concerning signals are that larger open-weight models show 4–7× higher controllability, and that asking the model to reason inside its visible output (rather than the hidden reasoning channel) raises controllability to ~70%.
This adds capability-focused measurement to the evidence that current models struggle to hide their reasoning and that CoT monitoring works when models must think out loud. The RLVR result is the most actionable: the dominant training paradigm incidentally makes traces less controllable, the opposite of the feedback-spillover concern. The main caveat is construct validity — inserting “meow” between words is far from steganography — and the ~70% visible-output result plus the upward scaling trend suggest low controllability may be a training artifact rather than an inherent limit. CoT controllability is best treated as a metric to track on each new frontier model rather than a property to rely on indefinitely.
Subliminal data poisoning works despite defenses.
Shifting from how models are monitored to how they are trained: data poisoning is a core supply-chain risk, and the default defense — filtering training data for suspicious content — assumes the poison is detectable at the data level. If poisoned samples can be made statistically and semantically indistinguishable from clean ones, defense must shift from inspecting data to auditing trained models.
Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning [LASR Labs, Google DeepMind] adapts subliminal learning to a realistic instruction-tuning setup: a teacher model generates completions for the Alpaca instruction dataset while system-prompted to “love” a target entity (Catholicism, Reagan, Stalin, the UK), then all overt references are stripped. Finetuning four student models on the resulting 52K-sample datasets implants the sentiment regardless of whether teacher and student share a base model — a Gemma-generated dataset gives GPT-4.1 target-mention rates of 16–84% depending on the entity. No tested data-level defense removes the poison: an oracle GPT-5-mini judge told the exact attack mechanism achieves under 6% true-positive rate, and paraphrasing every completion with a different model still leaves the attack success rate substantially elevated (64%→44%). Success requires a high percentage of poisoned samples rather than an absolute count, and the poison transfers far more strongly through open-ended prompts than constrained ones. Post-training audits partially succeed — direct questioning flags “something is off” 100% of the time but identifies the target only 30–40%, while Petri fails entirely.
This overturns the constraint of the original subliminal learning result that teacher and student must share initialization, and the percentage-dependence contrasts with earlier work showing near-constant poison counts suffice for trigger-based backdoors — suggesting a different mechanism. The default 100%-poisoned setting overstates practical risk (attack success drops below 10% at ~20% poisoning), and sentiment steering is mild relative to jailbreaking. But the failure of even oracle-informed paraphrasing strengthens the case for being careful about data provenance and increasing investment in model audits.
Boundary points are constructed by injecting noise to get prompts close to the classifier’s decision boundary.
Finally on the adversarial side: classifier-based safeguards have become the primary misuse defense for frontier models, and have been shown to survive thousands of hours of human red-teaming. But automated red-teaming has lagged — existing attacks need gradients, logprobs, or human-found seed jailbreaks, none of which are available against deployed systems that return only a flagged/not-flagged bit.
Boundary Point Jailbreaking of Black-Box LLMs [UK AISI] introduces Boundary Point Jailbreaking (BPJ), the first fully automated black-box attack to produce universal jailbreaks against Anthropic’s Constitutional Classifiers — which had yielded only one universal jailbreak after ~3,700 hours of bug-bounty red-teaming — and the first to break GPT-5’s input classifier without human seeds. The core difficulty is that against a robust classifier, every random change to an attack prefix is flagged, giving zero optimization signal. BPJ solves this with two mechanisms: noise interpolation creates a curriculum of progressively harder targets by replacing n characters of the harmful string with random noise, and boundary-point selection keeps only the targets where some but not all current attack candidates succeed — the points on the classifier’s decision boundary, where small changes in attack strength are most visible. This optimization signal is then used in an evolutionary loop, which mutates an adversarial prefix, keeps mutations that solve more boundary points, and lowers the noise level as the prefix improves. Against the Constitutional Classifiers, BPJ costs ~$330 and 660k queries to reach 80.4% best-of-50 attack success; against GPT-5’s input classifier, ~$210 and 800k queries reach 94.3%. Boundary-point selection converges ~5× faster than noise alone, and prefixes optimized on a single question transfer to the full unseen dataset — both Anthropic and OpenAI verified and replicated the attack internally.
BPJ adapts the Boundary Attack idea from image adversarial examples to text-prefix jailbreaking and presents a problem for Constitutional Classifiers and their probe-cascade successors, though it only attacks the classifier and still relies on a separate human-found jailbreak for the underlying model. The 660k–800k flagged queries are a glaring footprint, pointing to account-level anomaly detection and rapid-response patching as the natural mitigations. BPJ is best read as evidence that single-interaction classifiers cannot stand alone.
A few more papers from this month that are worth knowing about:
2026-04-04 22:47:18
Epistemic status: romantic speculation.
The core claim: I accidentally thought that compute growth can be rather neatly analogized to natural resource abundance.
Countries that discover oil often end up worse off than countries that don't, which is known as the resource curse. The mechanisms are well-understood: a booming resource sector draws capital and labor away from other industries, creates incentives for rent-seeking over productive investment, crowds out human capital development, and corrodes the institutions needed to sustain long-term growth.
I argue that something structurally similar has been happening with compute. The exponential growth of available computation over the past several decades, and, critically, the widespread expectation that this growth would continue, has created a pattern of resource allocation, talent distribution, and research prioritization that mirrors the resource curse in specific and non-metaphorical ways.
Note: this is not a claim that extensive compute growth has been net negative (neither it is the opposite claim).
The original Dutch disease mechanism is straightforward: when a booming sector (say, natural gas extraction) generates high returns, it pulls capital and labor out of other sectors (say, manufacturing), causing them to atrophy. The non-booming sectors don't decline because they became less valuable in absolute terms but rather because the booming sector offers relatively better returns, and resources flow accordingly.
A trivial version of "compute Dutch disease" of it goes like this: because scaling compute yields such reliable, legible, and fundable returns (train a bigger model, get a better benchmark score, publish the paper, raise the round), it systematically starves research directions that are harder to fund, harder to evaluate, and slower to produce results, even when those directions might be more consequential in the long run.
So, "The Bitter Lesson" can be seen as the Dutch disease of AI research, if we add to it that the fact that scaling works doesn't mean the crowding-out of alternatives is costless. Or, in other words, the fact that scaling works better is rather a fact about our ability to do programming or even, if we go to the very end of this line of reasoning, about our economic and educational institutions, than about computer science in general.
However, I consider it only as the most recent and prominent manifestation of a phenomenon that was happening for decades.
Since at least the late 1990s, the reliable cheapening of compute has made it consistently more profitable to build compute-intensive solutions to problems than to invest in the kind of deep, careful engineering that produces efficient, well-understood systems. When you can always count on next year's hardware being faster and cheaper, the rational business decision is to ship bloated software now and let Moore's Law clean up after you, rather than spending the additional engineering time to make something lean and correct. This created an enitre economy of applications, business models and platform architectures that are, in a meaningful sense, the technological equivalent of an oil-dependent monoculture: they exist not because they represent the best way to solve a problem, but because abundant compute made them the cheapest way to ship a product.
The consequences are visible across the entire stack. Web applications that would have run comfortably on a 2005-era machine now require gigabytes of RAM to render what is essentially styled text. Electron-based desktop apps ship an entire browser engine to display a chat window. Backend services that could be handled by a well-designed program running on a single server are instead distributed across sprawling microservice architectures that consume orders of magnitude more compute. Cory Doctorow's "enshittification" framework is about the user-facing result of this dynamic, but the deeper structural story is about how compute abundance degraded the craft of software engineering itself, well before anyone started worrying about ChatGPT replacing programmers.
This is the Dutch disease pattern operating at the level of the entire technology economy: the booming sector (scale-dependent applications) drew capital and talent away from the non-booming sector (careful engineering, deep technical innovation, computationally parsimonious approaches), and the non-booming sector atrophied accordingly.
But of course the AI case is qualitatively different and the most sorrowful because it resulted in humanity trying to build superintelligence with giant instructable deep learning models.
Resource curse economies characteristically underinvest in education and human capital development. The relative returns to education are lower in resource-dependent economies because the booming sector doesn't require a broadly educated population.
The compute version of this story has been playing out for at least a decade, well before the current discourse about AI replacing jobs and destroying university education. The entire trajectory of computer science education shifted from "understand the fundamentals deeply" toward "learn to use frameworks and APIs that abstract over compute."
There is also a more direct talent-siphoning effect: the IT economy has been pulling the most capable technical minds into a narrow set of activities and away from a much broader set of technical and scientific challenges.
In the resource curse literature, there is a so called "voracity effect": when competing interest groups face a resource windfall, they respond by extracting more aggressively, leading to worse outcomes than moderate scarcity would produce. Rather than investing the windfall prudently, competing factions race to capture as much of it as possible before others do.
I leave this without a direct comment and let the reader have their own pleasure of meditating on this.
The resource curse in its classical form operates on an exogenous endowment: countries don't choose to have oil reserves, they discover them, and then the political economy warps around that windfall. Much of the pathology comes from the unearned nature of the wealth: it enables rent-seeking, weakens the link between effort and reward, and corrodes institutions.
Compute, by contrast, is endogenously produced through deliberate R&D and engineering investment. Moore's Law was never a law of nature.
Right?
I mean, to me Moore’s Law looks like a strong default of any humanity-like civilization. It is created by humans, right, but it is created in a kind of hardly avoidable manner.
The resource curse literature has natural counterfactuals (resource-poor countries that developed strong institutions and diversified economies: Japan, South Korea, Singapore). What's the compute-curse counterfactual? A world where compute grew more slowly and we consequently invested more in elegant algorithms, interpretable models, and formal methods?
It's plausible, but it's also possible that slower compute growth would have simply meant less progress overall rather than differently-directed progress. I don’t know. I said in the beginning - it is a speculation.
However, one can trivially note that in a world with less compute abundance, the relative returns to algorithmic cleverness, interpretability research, and formal verification would have been higher, because you couldn't just solve problems by throwing more FLOPS at them. And that may or may not lead to better outcomes in the long run (I am basically leaving here the question of ASI development and just talking about rather “normal” tech and science R&D).
Two existing frameworks are close to what I'm describing, but both point the analogy in different directions.
The Intelligence Curse (Luke Drago and Rudolf Laine, 2025) uses the resource curse analogy to argue that AGI will create rentier-state-like incentives: powerful actors who control AI will lose their incentive to invest in regular people, just as petrostates lose their incentive to invest in citizens. This is a compelling argument about the distributional consequences of AGI, but it's about what happens after AGI arrives. The compute curse is about what's happening now, during the process of building toward AGI, and about how the abundance of compute is distorting that process itself.
The Generalized Dutch Disease (Policy Tensor, Feb 2026) is about the macroeconomic effects of the compute capex boom on US manufacturing competitiveness, showing that it operates through the same channels as the fracking boom and the pre-2008 financial boom. This is the closest existing work to what I'm describing, but it stays within the macroeconomic framing (factor prices, unit labor costs, exchange rate effects) and doesn't address the innovation-direction distortion, human capital crowding-out in the intellectual sense, or the AI safety implications.
Some of the negative downstream effects of compute abundance don't map onto the resource curse framework directly but are worth including for completeness, since they stem from the same underlying cause (cheap, abundant compute enabling activities that wouldn't otherwise be viable):
These are not Dutch disease effects, just straightforward negative externalities of cheap compute. But they suggest that the full accounting of compute abundance's costs is substantially larger than what the resource curse analogy alone would tell.
2026-04-04 21:46:03
All men are frauds. The only difference between them is that some admit it. I myself deny it.
― H. L. Mencken
I think where I am not, therefore I am where I do not think. I am not whenever I am the plaything of my thought; I think of what I am where I do not think to think.
― Jacques Lacan
Conscience is the inner voice that warns us somebody may be looking.
― H. L. Mencken, again
The Elephant in the Brain by Robin Hanson and Kevin Simler was the piece that first introduced me to the idea. I often felt like the Elephant's takes are overly cynical, and the same goes for other pieces of Hanson's writing. That is, before I read Edward Teach's Sadly, Porn that is outright misanthropic, and still feels pretty accurate whenever I can make any sense of it.
The core thesis in both of these books is somewhat similar. The Elephant in the Brain says that there's a unconscious part, the Elephant, that does self-interested stuff like status-seeking. To be able to present a prosocial personality, we then have a separate layer that interprets the actions of the Elephant in a good light. It's hard to call this lying because to be believable we have to believe it ourselves first. And our brains are great at pattern matching and forgetting conflicting details.
Sadly, Porn goes the other way, or perhaps just further. You've domesticated the Elephant. It no longer dares to do the self-interested actions. It's afraid of failure. The internal narrator is repurposed from defending our selfish actions to others, into explaining our lack of actions to ourselves. We're lying to ourselves, trying to uphold our own story of having a high status. Other people are mostly required for external approval.
Reading these books was like partially breaking the 4th wall of the narrator. It became self-aware. Of course I could be just imagining a minor enlightenment instead of experiencing it. That would be such a Sadly, Porn-style mental move. Perhaps we could test this by consciously changing the actions of the Elephant? How would one interpret the results instead of retreating to another abstraction level with the lies? It seems really hard to point at something you've done and say "the Elephant did that".
Both of these models started to look a bit lacking after actually internalizing them. For a long time, I was having a really hard time identifying any motivating factors besides physical needs, hedonism, and status-seeking, thinking that anyone doing other things was lying to either themselves or others, or both. I still somewhat hold these views; I just don't think that the lying part is so absolute. People also have aesthetic preferences (read: values) that do not have obvious self-interested purpose.
But as the saying goes, "all models are wrong, some are useful". I've found both of these quite useful in modelling how others behave. And how I behave, too, albeit disregarding the narrator's explanations is tedious and squeamish work, as convenient answers look very appealing. And all this self-reflection seems to be mostly for entertainment anyway, as for actual results I use more powerful tools.
2026-04-04 15:30:18
This is the first post in a planned series about mean field theory by Dmitry and Lauren (this post was generated by Dmitry with lots of input from Lauren, and was split into two parts, the second of which is written jointly). These posts are a combination of an explainer and some original research/ experiments.
The goal of these posts is to explain an approach to understanding and interpreting model internals which we informally denote "mean field theory" or MFT. In the literature, the closest matching term is "adaptive mean field theory". We will use the term loosely to denote a rich emerging literature that applies many-body thermodynamic methods to neural net interpretability. It includes work on both Bayesian learning and dynamics (SGD), and work in wider "NNFT" (neural net field theory) contexts. Dmitry's recent post on learning sparse denoising also heuristically fits into this picture (or more precisely, a small extension of it).
Our team at Principles of Intelligence (formerly PIBBSS) believes that this point of view on interpretability remains highly neglected, and should be better understood and these ideas should be used much more in interpretability thinking and tools.
We hope to formulate this theory in a more user-friendly that can be absorbed and used by interpretability researchers. This particular post is closely related to the paper "Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity". The experiments are new.
Mean field theory is a vague term with many meanings, but for the first few posts at least we will focus on adaptive mean field theory (see for example this paper, written with a physicist audience in mind). It is a theory of infinite-width systems that is different from the more classical (and, as I'll explain below, less expressive) neural tangent kernel formalism and related Gaussian Process contexts. Ultimately it is a theory of neurons (which are treated somewhat like particles in a gas). While every single neuron in the theory is a relatively simple object, the neurons in a mean field picture allow for an emergent large-scale behavior (sometimes identified "features") that permits us to see complex interactions and circuits in what is a priori a "single-neuron theory". These cryptic phrases will hopefully be better understood as this post (and more generally as this series) progresses.
We ultimately want to understand the internals of neural nets to a degree that can robustly (and ideally, in some sense "safely") interpret why a neural net makes a particular decision. So one might say that this implies that we should only care about theories that apply directly to real models. Finite width, large depth, etc. While this is fair, any interpretation must ultimately rely on some idealization. When we say "we have interpreted this mechanism", we mean that there is some platonic gadget or idealized model that has a mechanism "that we understand", and the real model's behavior is explained well by this platonic idealization. Thus making progress on interpretability requires accumulating an encyclopedia (or recipe book) of idealizations and simplified models. The famous SAE methodology is based on trying to fit real neural nets into an idealization inherited from compressed sensing (a field of applied math). As we will explain below, if we never had Neel Nanda's interpretation of the modular addition algorithm, we would get it "for free" by applying a mean field analysis to the related infinite-width model. As it were, the two use the same Platonic idealization[1]. Thus at least one view on the use of theory is to see it as a source of useful models that can be then applied to more realistic settings (with suitable modification, and, at least until a "standard model" theory of interpretability exists, necessarily incompletely). Useful theories should be simple enough to analyse mathematically (maybe with some simplifications, assumptions, etc.) and rich enough to illuminate new structure. We think that mean field theory (and its relatives) is well-positioned to take such a role.
"Frequently asked questions about MFT" is a big topic that can be its own post. But before diving into a more technical introduction, we should address a few standard questions which keep cropping up, especially about comparisons between MFT and other better-known infinite-width limits.
In physics, one often looks at systems with a large, stable background. A planet vs. a sun, an electron vs. a proton, a weakly interacting observer vs. a large system being observed. In these settings the "background" is the large system and the "foreground" or "test system" is the small system being studied. In these cases the background system may be fixed, or it may be undergoing some motion (like the sun moving around the galaxy's center), but the important idealization is that it does not react to the observer/ test system. In fact, the earth is applying a gravitational pull to the sun (and famously in quantum mechanics, observations always impact a system at a quantum level). But these "reverse" effects are small, so to a good approximation we can treat the sun as doing its own "stable" thing while earth is undergoing physics that depend strongly on the sun.
While typically the large "background" is a cleanly separate system from the small test system of the observer, it is sometimes extremely useful to treat the test system as a tiny piece of the background. So: the large background system may be a cup of water and the small test system may be a tiny bit of water at some location. Here while technically the full cup includes the tiny "test" bit, the large-scale behaviors (waves etc.) in the water don't really care to relevant precision if the test bit is changed or removed (at least if it's tiny enough). But the tiny bit of water definitely cares about the large-scale behaviors (waves, vortices or flows, etc.), to the extent that bits of water care about things.
Similarly (and in a closely related way), "the economy" is a giant system that includes your neighborhood bakery. The bakery can be viewed as a small "test system": it is affected by the economy. If property prices go up or the economy tanks, it might close. But the economy is not (at least to leading order) affected by this bakery. It is perhaps affected by the union of all bakeries in the world, but if this particular bakery closes due to some random phenomenon (e.g. the lead baker retires), this won't massively impact the economy.
This point of view is remarkably useful, because it introduces a notion of "self-consistency".
Self-consistency when applied in this context comes from the following pair of intuitions:
If both of these assumptions are true, then these two observations (when turned into equations) are usually enough to fully pin down the system. Indeed, you have two functional relationships[2] :
which means that the background field satisfies a fixed point equation for the composed function
Of the above assumption 1 is most problematic. Thinking of each component as being determined by some "large-scale" stable system needs to be interpreted appropriately (in particular the relationship is often statistical: so for example the number of bakeries in a given neighborhood fluctuates due to people retiring/ moving/ etc., even if "the economy" is held constant; similarly, every bit of the sun reacts to magnetic/ gravitational fields from other bits, but in a statistical or thermodynamic sense). Sometimes local or so-called "emergent" effects break this directional relationship (and many interesting thermodynamic systems, such as the 2-dimensional Ising model, are precisely interesting in such contexts). But surprisingly often (at least with an appropriate formalism) the approximation of the foreground as fully determined by the background (in a statistical sense) is robust. For example if we are modeling the sun, viewing the "background system" too coarsely (as just the mass + electromagnetic field + temperature, say, of the entire sun) is insufficient. But instead we can view the "background system" as a giant union of many local systems, maybe comprising a few meter chunks. These are still "large" in the sense of being much larger than an atom (or a microscopic chunk), but studying their behavior (in an appropriate abstraction) offers sufficient resolution to model the sun extremely well. Similarly we can't apply a single supply-demand curve to the entire economy (bread costs different amounts in different places). But in appropriate contexts (for fungible products like oil, and on a "local economy" level where the economy is roughly uniform but not dominated by a single station, for example) self-consistency is a pretty good model.
In many settings, the question of how well "assumption 1" above holds is related to a notion of connectedness. In the sun's magnetic plasma, the magnetic field experienced by a particle is accumulated over billions and billions of nearby particles - so the graph of interactions is extremely connected. In an oil economy, each consumer can typically choose between dozens of nearby stations which are reachable by car. However other settings (like the Ising model, or markets for rare and hard-to-transport goods) cannot be purely modeled by self-consistency as well.
In physics, systems that are well-modeled by a self-consistency equation (coupled background and foreground systems) are generally called mean-field settings. A big triumph of statistical physics is to make situations with local/ emergent phenomena "behave as well as" mean field theories – renormalization is a fundamental tool here, and most textbooks on renormalization from a statistical-physics view tend to start with a discussion of mean-field methods. But settings that are directly mean-field (for example due to being highly connected or high-dimensional) are particularly nice, easy-to-study
Neural nets are physical systems. This is a vacuous statement – anything that has statistics can be studied using a physics toolkit (and in many ways statistical physics is just statistics with different terms). Indeed, real neural nets are immensely complex, and if there is some sense in which they can be locally decomposed into background-foreground consistencies, these must themselves be immensely complex and likely dependent on sophisticated tooling to identify (this is one of the reasons why we are running an agenda on renormalization).
But it turns out that in some settings and architectures neural nets are extremely well-modeled by systems with high connectivity – and the reason is, naively enough, precisely the fact that they are highly connected (often fully-connected) on a neuron level (note that architectures that aren't "fully-connected" – e.g. CNNs – sometimes still have properties that make them "highly connected" from a physical point of view).
In neural net MFT the foreground (or "system"/ "observer") abstraction is a neuron. This is typically a coordinate index
The important "background" thing that each neuron "carries" is what is called an activation function, often denoted by the letter
Now if there are lots of neurons, each neuron's activation function reacts to a background generated by the other neurons: removing the neuron in this limit doesn't change the loss by much, so the background determines each neuron's behavior as a statistical distribution. Conversely, the background itself is composed of individual "foreground" neurons. The loop:
background
neuron distribution background
must close, i.e. be self-consistent. Making sense of this loop is the key content of mean field theory of neural nets.
In later installments we'll explain a bit more about the loop and show some examples of it working (or not). You can also see the original linked paper about the Curse of Detail for a more physics-forward view of this.
We'll close with a toy example of "self-consistency", which is visually satisfying.
In this setting we look at a 2-layer model that takes in a two-dimensional input variables

The neurons above were trained jointly in a way that would allow them to interact.
It has a nice clover-leaf like structure (it will reappear later when we look at continuous xors - a multi-layer setting where mean field performs compositional computation; already in this simple setting, the fact that the cloud of neurons is a "shaped" distribution rather than a flat Gaussian puts us solidly outside the Gaussian process regime). Now we can empirically measure how a single randomly initialized "foreground" neuron would react to the background generated by this model. To do this, we train 2048 iid single-neuron models on the resulting background from the fully trained model.[4] When we do this and combine the resulting 2048 neurons into a new model, we see that indeed it looks exactly the same as the background. When we compute its associated function, we get very similar loss.

Each neuron in this picture was trained in a fully iid way, without interacting with any neuron, simply by "reacting to the background", i.e. learning the task in combination with the "blue" background above.
Note that this isn't a property that comes "for free". If we were to use the wrong background (for example a the more Gaussian process-like model here) then samples of the foreground would fail to align to the background.

Blue is background, orange is foreground (each orange neuron trained independently in reaction to background).
The case of 2-layer networks is special: neuron functions are particularly simple to characterize, and the mean field has better properties (it's not "coupled"). But we'll see that deeper nets can still be analyzed using this language, and even using empirical methods we can get cleaner pictures of how they learn and process representations.
In the next post, we will explain the physics behind these experiments and the experimental details of the models (github repo coming soon).
Technically they differ on whether they use the "pizza" vs. "clock" mechanisms, but the two idealizations are related, and both the mean field and the realistic setting can be modified to make use of either.
Below, f and b should generally be understood as "statistical" functions: job choice is, perhaps, a probabilistic function depending on the economy, which includes both demand/ markets but also supply/ people's interests; conversely "the economy" is the average of production over the distribution of jobs.
Technicalities. Depending on the situation
Technical note: each single-neuron model is trained on the difference
In fact it is quadratic: the thing that sums over neurons is the "external square" of the neuron function, which is a function of a pair of inputs: knowing this sum fully determines the dynamics up to rotational symmetry, even for a finite-width model (it's often called the "data kernel" but is used very differently from the Gaussian process kernels, which do depend on an infinite-width assumption and lose a lot of information in finite-width and mean-field contexts).
2026-04-04 14:39:58
Political power grows out of the barrel of a gun -- Mao Zedong
Halfway thru recorded history, Athens became the first state we're sure was a democracy, and inspiration to many later ones. Probably some existed earlier, and certainly some entities smaller than states were democratic, likely long before recorded history began.
The next tenth of history saw the rise of the Roman Republic, which mixed democracy and aristocracy together to form a functional hybrid, and then it transitioned to the Roman Empire, which shifted the mix substantially towards aristocracy. For the next three tenths of recorded history, democracies were at best local governments, minor regional powers, or components of larger, autocratic states.
Few, if any, of these societies would count as "democracies" according to modern watchdog organizations. Only about 10-20% of the residents of Athens were citizens; the rest possessed no real political power. In later eras, residents of towns or cities might vote for their urban officials, but the urbanization rates were also around 10-20%, so the vast majority lived in the non-democratic countryside.
Then for the last tenth of history, democracies rose again to dominate the world stage. One standard story for this has to do with military technology. The Roman Republic expanded because it had the dominant military technology of its time; this may have been in part because of its political system. But eventually the heavily trained armored horseman became the dominant military technology, and was more easily provided by autocracies than democracies. Then widespread use of gunpowder weapons swung the balance back towards mass manpower; the knight in his castle could no longer reliably put down a peasant revolt, or hold back Napoleon and his levée en masse.
Another standard story has to do with increased state resources. Democracies generally support higher tax rates than autocracies do; while this is primarily to support social services, some amount of this is that people are willing to pay more for things that they think 'they' own (rather than their distant overlords).
A third standard story has to do with ease of turnover. Democracies generally don't have to fight civil wars or succession conflicts, because whenever such a movement would have a chance of a military victory, it also has a chance of a bloodless victory. This leads to peaceful turnovers or governments following the preferences of voters enough to not let resentments build up to the point that they boil over.
The major wars of the last two hundred years have not been limited engagements driven by hereditary elites; they have primarily been total struggles between ideologies and peoples, which only managed to become major because they could motivate significant efforts on both sides.
What does the next tenth of history[1] look like? One might think that the invention of larger and more sophisticated weapons means that we swing back towards the knight and aristocracy, but the evidence of the most recent wars suggests otherwise. Kipling's poem Arithmetic on the Frontier, on the costs of pitting Imperial troops against regional resistance, rings nearly as true about America's various wars and special operations in the Middle East. Between great powers, the weapons have become so expensive that only systems which are widely believed in by their inhabitants can afford to supply a competitive number of them.
It's not obvious to me that this continues to be true. I think next-generation military capabilities primarily have to do with 1) operational knowledge and 2) mass-produced smart weaponry. The war in Ukraine shows how conflicts between drones and riflemen go; the war in Iran shows how conflicts between the informed and the uninformed go. States may find their ability to produce weaponry becomes detached from their popularity; they may find robot soldiers are willing to follow orders that human soldiers would balk at; they may find that it is relatively cheap to identify and destroy dissenters.[2]
That is, we may be moving into an era where mass protests are relatively easy to dismiss or ignore, while individual hackers or saboteurs are still able to disrupt large systems. Taxes from a broad labor base may become less relevant than control over automated infrastructure. What political problems will the new sources of power have, and what systems will help them resolve those problems?
I don't expect us to have 500 more sidereal years of history left, but I do think we might manage to cram roughly that much subjective history in before the Singularity / as it takes off.
An old strategy is to recruit your police / imperial enforcers from a different ethnicity than the people that they need to defend against, so there's some baseline level of resentment that will allow brutality which will cow them into submission. Autonomous weapons allow this at scale, and for secret police that are difficult to bribe or corrupt, and widespread surveillance allows for secret police that are always watching and noticing subtle connections.