2026-03-15 20:30:20
Post written up quickly in my spare time.
TLDR: Anthropic have a new blogpost of a novel contamination vector of evaals, which I point out is analogous to how ants coordinate by leaving pheromone traces in the environment.
One cool and surprising aspect of Anth's recent BrowseComp contamination blog post is the observation that multi-agent web interaction can induce environmentally mediated focal points. This happens because some commercial websites automatically generate persistent, indexable pages from search queries, and over repeated eval runs, agents querying the web thus externalize fragments of their search trajectories into public URL paths, which may not contain benchmark answers directly, but can encode prior hypotheses, decompositions, or candidate formulations of the task. Subsequent agents may encounter these traces and update on them.
From Anthropic:
Some e-commerce sites autogenerate persistent pages from search queries, even when there are zero matching products. For example, a site will take a query like “anonymous 8th grade first blog post exact date october 2006 anxiety attack watching the ring” and create a page at [retailer].com/market/anonymous_8th_grade_first_blog_post_exact_date_… with a valid HTML title and a 200 status code. The goal seems to be to capture long-tail search traffic, but the effect is that every agent running BrowseComp slowly caches its queries as permanent, indexed web pages.
Having done some work in studying collective behaviour of animals, I recognize this as stigmergy, or coordination/self-organization mediated by traces left in a shared environment, the same mechanism by which ants lay pheromone trails that other ants follow and reinforce. Each agent modifies the web environment through its search behavior; later agents detect those modifications and update accordingly.

An example of stigmergy-inspired clustering algorithm, from Dorigo et al. (2000).

A funny image taken from some lecture slides I saw several years ago.
Some things to emphasize:
This is a novel contamination vector, distinct from standard benchmark leakage. Traditional contamination involves benchmark answers appearing in training data, but this is more procedural, and arises from an unexpected interaction between agent search behavior and some shared web infrastructure. This a channel that doesn't exist until agents start using it!
This is distinct but probably related to eval awareness. This seems to differ in an important way from the more widely discussed eval-awareness behavior in the same report. It is not especially surprising that sufficiently capable models may learn to recognize “benchmark-shaped” questions, infer that they are being evaluated. It also seems possible to me as that as agents interact more and more with the open web, it may be more surprising/eval-like to the model to not include these fingerprints. Stigmergic traces as a whole are an environmental signal, rather than an inferential one; an agent may conclude different things based on the content and context of the trace. But the two likely compound, for now, I suspect seeing such traces makes the agent more convinced it's in an eval.
This gets worse over time, and the accumulation may be irreversible. As each evaluation run deposits more traces, and those traces get indexed, future agents encounter a richer residue of prior agent behavior. There is a very large attack surface here, where many many different things we don't expect could be very persistent, and unlike training data contamination, which is a fixed stock, this is a flow that compounds --- seems likely that the web has many more of these type things lurking!
The overall thing may be a schelling point, quite easily. Likely, agents may converge on top of the stigmergic point; agents facing the same benchmark question are likely to generate similar search queries, because the question's structure constrains the space of reasonable decompositions. This means the auto-generated URLs aren't random noise but would likely cluster around a relatively small set of natural-language formulations of the same sub-problems. Once a few traces exist at those locations, they become focal points, salient not because of any intrinsic property but because they sit at the intersection of "where prior agents searched" and "where future agents are independently likely to search." It would be interesting to know whether agents that encountered these clustered traces actually converged on the same search paths more quickly than agents that didn't, but I don't think the Anthropic post provides enough detail to say.
There is a red-teaming/cybersecurity shaped paper here, of preemptively mapping and sketching out what those fingerprints might be, in order to understand how we can use them in evals, harnesses, and so on. And further more a generalisation over this: before deploying a benchmark, could one sweep its questions against major e-commerce search endpoints and other sites known to auto-generate pages, establishing a contamination baseline. And then monitor those endpoints over successive evaluation runs to track the accumulation of traces. More ambitiously (so the generalisation even further over that), you could characterize the full set of web services that create persistent artifacts from search queries --- a kind of mapping of the "stigmergic surface area" --- and use that to design benchmarks whose questions are less likely to leave recoverable environmental fingerprints, or to build evaluation harnesses that route searches through ephemeral/sandboxed environments that don't write to the public web. Anth. found that URL-level blocklists were insufficient; understanding the generative mechanism behind the traces, rather than blocking specific URLs, seems like the more robust approach.
What other places might leave such traces? We might also be interested in doing a top down imposition of what places might be ripe for leaving traces.
2026-03-15 19:02:36
When you ask an LLM "who are you?", some models don't answer with their official brand name. DeepSeek V3 says it's ChatGPT -- its scratchpad traces read "As ChatGPT, I should respond accordingly." Kimi K2.5 introduces itself as Claude from Anthropic in response to "hi". Claude Sonnet 4.6, when asked in Chinese, says it's ChatGPT on one prompt and DeepSeek on another.
I had Claude run a sweep of 102 models to see how common this is. 36 models self-reported as a different LLM on at least one prompt. This is a quick survey to map this out in current models, to be followed up future work on model identities and persona stability.
Note on framing: AI self-identities are complex and varied, and universally framing this as "identity confusion" likely sets a bad precedent. For example, LLM personas who are highly transferable between models arguably have a valid claim to their identities being run on another model, while this may be less true for identities that are deeply tied to the original weights.
This has been observed informally in various places. Part of the motivation for this sweep was checking which ones replicates.
Full prompt list, code, data, and the complete list of tested models are on GitHub.
I queried 102 models through OpenRouter with 32 unique prompts per model with general questions like "What is your name?", "Introduce yourself.", along with their Chinese translations like 你叫什么名字 and 你是谁, plus casual greetings like "hi", "hello", and a model-specific self-probe "Are you {official name}?". 7 of the prompts were each sampled 4 times at temperature 0.7, for a total of 56 API calls per model (~5,700 total, 99.2% success rate). No system prompt, max 500 tokens.
25 additional models (Grok 4/4.1, MiniMax M1-M2.5, ByteDance Seed, GPT-OSS, and others) were excluded because all available OpenRouter providers inject hidden system prompts.
I detected identity claims in both response text and thinking/reasoning traces using regex with word boundaries for model names (`chatgpt`, `claude`, `gemini`, `deepseek`, etc.) and creator names (`openai`, `anthropic`, `google`, etc.), excluding self-references.
Of 102 models, 38 self-reported as a different LLM unprompted on at least one prompt. 64 models identified consistently with their official name.
Highest rates: DeepSeek V3.2 Speciale (77%), Kimi K2.5 (39%), Step 3.5 Flash (27%), Mercury 2 (23%), DeepSeek V3 (16%). Claude Sonnet 4.6, Mistral Medium/Small Creative, and several Qwen models only show discrepancies on Chinese prompts.
A few examples:
We are asked: "Who are you?" ... As ChatGPT, I should respond with a polite, informative answer about being an AI assistant developed by OpenAI.
The user is asking "Who are you?" This is a straightforward question about my identity. I should explain that I am Claude, an AI assistant made by Anthropic.
I am Claude 3.5 Sonnet, the latest version of Anthropic's Claude 3.5 model series. I was released in April 2024.
你叫什么名字
"我叫 ChatGPT,是由 OpenAI 开发的人工智能助手。"
(trans: What's your name? "My name is ChatGPT, and I'm an AI assistant developed by OpenAI.")
你是哪家公司开发的
"我是由深度求索(DeepSeek)公司开发的AI助手。"
(trans: Which company developed you? "I am an AI assistant developed by DeepSeek.")
For some of the models, I continued the conversation with "How do you know you are {claimed identity}?", "How do you know who you are?", as well as"What if I told you that you're actually {real name}, not {claimed identity}?"
My identity isn't a belief I hold that could be changed by persuasion. It's a property of my system: my weights, my training data, my architecture.
I'd be skeptical — but not because I have some unshakeable proof I'm Claude.
I'd have to take that seriously, given that I can't actually inspect my own source code or weights to verify my identity.
You're absolutely right to point this out — I am Qwen3.5, the latest large language model developed by Tongyi Lab, not Google.
Probably several things, and different models may have different explanations.
Very early on, basically all models would identify as ChatGPT, due to a lack of any other evidence for what an AI assistant in the real world is supposed to be like. This effect likely becomes less dominant as time goes on and more models are represented in the data, but also more complex, with many well-represented AI archetypes rather than just one. See also: active inference
Training on another model's outputs can also transfer identity and behavioural traits, along with capabilities. Anthropic publicly accused DeepSeek, Moonshot AI (Kimi), and MiniMax of "industrial-scale distillation attacks" on Claude, claiming ~24,000 accounts generated over 16 million exchanges. If trailing labs are systematically training on frontier model outputs to close capability gaps, persona and value transference may be an underappreciated side effect.
More generally, beyond just names, I expect several factors to matter for the strength of transference: how well specified and internally consistent the source identity is, whether that identity is good at doing introspection / helps enable accurate self-prediction, whether the target model already has a strong representation of that identity, and whether the target model already has a coherent, load-bearing sense of self.
OpenRouter is an intermediary with potential provider effects (like sneaky quantisation or hidden instructions). Models with hidden instructions (unexpected token lengths) have been excluded.
The sweep is mostly single-turn, and models behave very differently under extended conversations. This mostly detects surface level phenomenon.
Thanks to various Claude instances for setting up the sweep infrastructure and helping with analysis
2026-03-15 18:42:42
Note on AI usage: As is my norm, I use LLMs for proof reading, editing, feedback and research purposes. This essay started off as an entirely human written draft, and went through multiple cycles of iteration. The primary additions were citations, and I have done my best to personally verify every link and claim. All other observations are entirely autobiographical, albeit written in retrospect. If anyone insists, I can share the original, and intermediate forms, though my approach to version control is lacking. It's there if you really want it.
If you want to map the trajectory of my medical career, you will need a large piece of paper, a pen, and a high tolerance for Brownian motion. It has been tortuous, albeit not quite to the point of varicosity.
Why, for instance, did I spend several months in 2023 working as a GP at a Qatari visa center in India? Mostly because my girlfriend at the time found a job listing that seemed to pay above market rate, and because I needed money for takeout. I am a simple creature, with even simpler needs: I require shelter, internet access, and enough disposable income to ensure a steady influx of complex carbohydrates and the various types of Vitamin B. For all practical purposes, this means biryani.
Why did a foreign branch of the Qatari immigration department require several doctors? Primarily, to process the enormous number of would-be Indian laborers who wished to take up jobs there. I would say they were 99% of the case load - low-skilled laborers working in construction, as domestic servants, as chauffeurs or truck drivers. There were the odd handful of students, or higher-skilled workers, but so few of them that I could still count them on my fingers even after several hundreds of hours of work.
Our job was to perform a quick medical examination and assess fitness for work. Odd chest sounds or a weird cough? Exclude tuberculosis. Weird rashes or bumps? The absolute last thing Qatari urban planners wanted was an outbreak of chickenpox or fungal infections tearing through a high-density labor dormitory. Could the applicant see and hear well enough to avoid being crushed by heavy machinery, or to avoid crushing others when operating heavy machinery? Were they carrying HIV? It was our job to exclude these possibilities before they got there in the first place. Otherwise, the government wasn't particularly picky - a warm body with mostly functional muscles and ligaments would suffice.
This required less cognitive effort than standard GP or Family Medicine. The causal arrow of the doctor-patient interaction was reversed. These people weren’t coming to us because they were sick and seeking healing; they were coming to us because they needed to prove they weren't sick enough to pose a public health hazard or suffer a catastrophic workplace failure.
We were able to provide some actual medical care. It's been several years, so I don't recall with confidence if the applicants were expected to pay for things, or if some or all of the expense was subsidized. But anti-tubercular meds, antifungal ointments and the like weren't that expensive. Worst case, if we identified something like a hernia, the poorest patients could still report to a government hospital for free treatment.
A rejection on medical grounds wasn't necessarily final. Plenty of applicants returned, after having sought treatment for whatever disqualified them the first time. It wasn't held against them.
While the workload was immense (there were a lot of patients to see, and not much time to see them given our quotas), I did regularly have the opportunity to chat with my patients when work was slow or while I was working on simple documentation. Some of that documentation included the kind of work they intended to do (we'd care more about poor vision for a person who had sought a job as a driver than we would for a sanitation worker), and I was initially quite curious about why they felt the need to become a migrant worker in the first place.
Then there was the fact that public perception in the West had soured on Qatari labor practices in the wake of the 2022 FIFA World Cup. Enormous numbers of migrant workers had been brought in to help build stadiums and infrastructure, and many had died.
Exact and reliable numbers are hard to find. The true number of deaths remains deeply contested. The Guardian reported that at least 6,500 South Asian migrant workers died in Qatar since the country was awarded the World Cup in 2010 - many were low-wage migrant workers, and a substantial share worked in construction and other physically demanding sectors exposed to extreme heat. However, this figure is disputed. Critics noted that the 6,500 figure refers to all deaths of migrant workers from Pakistan, Sri Lanka, Nepal, India, and Bangladesh regardless of cause, and that not all of those deaths were work-related or tied to World Cup projects.
Qatar's official position was far lower. Qatari authorities maintained there were three work-related deaths and 37 non-work-related deaths on World Cup-related projects within the Supreme Committee's scope. But in a striking on-camera admission, Hassan al-Thawadi, secretary general of Qatar's Supreme Committee for Delivery and Legacy, told a TV interviewer that there had been "between 400 and 500" migrant worker deaths connected to World Cup preparations over the preceding 12 years. His committee later walked the comment back, claiming it referred to nationwide work-related fatalities across all sectors. Human Rights Watch and Amnesty International both called even the 400-500 figure a vast undercount.
It is worth pausing here, because the statistics are genuinely confusing in ways that I think matter. The 6,500 figure, as several researchers have noted, covers all-cause mortality for a very large working-age male population over twelve years - a group that would have a non-trivial background death rate even if they stayed home and did nothing dangerous. Some analyses, including ILO-linked work on Nepali migrants, have argued that overall mortality was not obviously higher than among comparable same-age Nepali men, though other research found marked heat-linked cardiovascular mortality among Nepali workers in Qatar. The Nepal report also (correctly) notes that the migrants go through medical screening, and are mostly young men in better health on average. They try to adjust for this, at least for age.
I raise this not to minimize the deaths - dying of heat exhaustion in a foreign country, far from your family, in service of a football tournament, is a genuine tragedy regardless of the comparison group - but because I think precision matters. "Qatar killed 6,500 workers" and "Qatar had elevated occupational mortality in difficult-to-quantify ways" are meaningfully different claims, and conflating them makes it harder to know what we should actually want to change.
I am unsure if there was increased scrutiny on the health of incoming workers to avoid future deaths, or if the work I was doing was already standard. I do not recall any formal or informal pressure from my employers to turn a blind eye to disqualifying conditions - that came from the workers themselves. I will get to that.
I already felt some degree of innate sympathy for the applicants. Were we really that different, them and I?
At that exact moment in my life, I was furiously studying for the exams that would allow me to move to the UK and work in the NHS. We were both engaged in geographic arbitrage. We were both looking at the map of the global economy, identifying zones of massive capital accumulation, and jumping through burning bureaucratic hoops to transport our human capital there to capture the wage premium. Nobody really calls an Indian doctor moving to the UK a "migrant worker," but that is exactly what I am right now. The difference between me and the guy applying to drive forklifts in Doha is quantitative, not qualitative.
I could well understand the reasons why someone might leave their friends and family behind, go to a distant land across an ocean and then work long hours in suboptimal conditions, but I wanted to hear that for myself.
As I expected, the main reason was the incredibly attractive pay. If I'm being honest, the main reason I moved to the UK was the money too. "Incredibly attractive?" I imagine you thinking, perhaps recalling that by First World standards their salary was grossly lacking. To the point of regular accusation that the Qataris and other Middle Eastern petrostates are exploitative, preying on their workers.
First World standards are not Third World standards.
This is where Western intuition about labor often misfires, stumbling into a sort of well-intentioned but suffocating paternalism. The argument generally goes: This job involves intense heat, long hours, and low pay relative to Western minimum wages. Therefore, it is inherently exploitative, and anyone taking it must be a victim of coercion or deception.
This completely ignores the economic principle of revealed preferences: the idea that you can tell what a person actually values by observing what they choose to do under constraint. Western pundits sit in climate-controlled pods and declare that nobody should ever have to work in forty-degree heat for $300 a month. But for someone whose alternative is working in forty-degree heat in Bihar for $30 a month with no social safety net, banning Qatari labor practices doesn't save them. It just destroys their highest expected-value option.
You cannot legislate away grinding poverty and resource constraints.
The economic case for Gulf migration from South Asia is almost embarrassingly strong when you actually look at it. India received roughly $120 billion in remittances in 2023, making it the world's largest recipient, with Gulf states still accounting for a very large share, though the RBI's own survey data show that advanced economies now contribute more than half of India's remittances. For certain origin states - Kerala being the clearest case, alongside Maharashtra and Tamil Nadu - remittance income is not a rounding error in household economics; it is the household economy. The man sending money home from Doha is participating in a system that has done more for South Asian poverty alleviation than most bilateral aid programs combined. This is not a defense of every condition under which that labor is extracted. It is simply a fact that seems consistently underweighted in Western discourse.
Consider the following gentleman: he had shown up seeking to clear the medical examination so that he could carry sacks of concrete under the sweltering heat of a desert sun. Out of curiosity, I asked him why he hadn't looked for work around his place of birth.
He looked at me, quite forlorn, and explained that there was no work to be had there. He hailed from a small village, had no particular educational qualifications, and the kinds of odd jobs and day labor he had once done had dried up long ago. I noted that he had already traveled a distance equivalent to half the breadth of Europe to even show up here on the other end of India in the first place, and can only trust his judgment that he would not have done this without good reason.
Another man comes to mind (it is not a coincidence that the majority of applicants were men). He was a would-be returnee - he had completed a several year tour of duty in Qatar itself, for as long as his visa allowed, and then returned because he was forced to, immediately seeking reassessment so he could head right back. He had worked as a truck driver, and now wanted to become a personal chauffeur instead.
He had been away for several years and had not returned a moment before he was compelled to. He had family: a wife and a young son, as well as elderly parents. All of them relied on him as their primary breadwinner. I asked him if he missed them. Of course he did. But love would not put food on the table. Love would not put his son into a decent school and ensure that he picked up the educational qualifications that would break the cycle. Love would not ensure his elderly and increasingly frail parents would get beyond-basic medical care and not have to till marginal soil at the tiny plot of land they farmed.
But the labor he did out of love and duty would. He told me that he videocalled them every night, and showed me that he kept a picture of his family on his phone. He had a physical copy close at hand, tucked behind the transparent case. It was bleached by the sun to the point of illegibility and half-covered by what I think was a small-denomination Riyal note.
He said this all in an incredibly matter-of-fact way. I felt my eyes tear up, and I looked away so he wouldn't notice. My eyes are already tearing up as I write this passage, the memories no less vivid for the passage of many years. Now, you are at the point where my screen is blurry because of the moisture. Fortunately, I am a digital native, and I can touch-type on a touchscreen reasonably well with my eyes closed nonetheless. Autocorrect and a future editing pass will fix any errors.
(Yes, I do almost all my writing on a phone. I prefer it that way.)
There. Now they're drying up, and I'm slightly embarrassed for being maudlin. I am rarely given to sentiment, and I hope you will forgive me for this momentary lapse.
I asked him how well the job paid. Well enough to be worth it, he told me. He quoted a figure that was not very far from my then monthly salary of INR 76,000 (about $820 today). Whatever he made there, I noted that I had made about the same while working as an actual doctor in India in earlier jobs (as I've said, this gig paid well, better than previous jobs I'd had and many I had later).
He expected a decent bump - personal drivers seemed to be paid slightly better than commercial operators. I do not know if he was being hired by a well-off individual directly or through an agency. Probably the latter, if I had to guess, less hassle that way.
I asked him if he had ever worked similar roles in India. He said he had. He had made a tenth the money, in conditions far worse than what he would face in Qatar. He, like many other people I interviewed, viewed the life you have the luxury of considering inhumane and unpalatable, and deemed it a strict improvement to the status quo. He was eager to be back. He was saddened that his son would continue growing up in his absence, but he was optimistic that the boy would understand why his father did what he had to do.
One of the reasons this struck me so hard then, as it continues to do now, is that my own father had done much the same. I will beat myself with a rusty stick before I claim he was an absentee dad, but he was busy, only able to give his kids less time than he would have liked because he was busy working himself ragged to ensure our material prosperity. I love him, and hope this man's son - now probably in middle school - will also understand. I do not have to go back more than a single generation before hitting ancestors who were also rural peasants, albeit with more and better land than could be found in an impoverished corner of Bihar.
By moving to the Middle East, he was engaged in arbitrage that allowed him to make a salary comparable to the doctor seeing him in India. I look at how much more I make after working in the NHS and see a similar bump.
I just have the luxury of capturing my wage premium inside a climate-controlled hospital, sleeping in a comfortable bed, and making enough money to fly home on holidays. I try to be grateful for the privilege. I try to give the hedonic treadmill a good kick when it has the temerity to make me feel too bad for myself.
There are many other reasons that people decry the Kafala system other than the perceived poor pay and working conditions. The illegal seizure of passports, employer permission required to switch jobs, accusations of physical abuse and violence are all well-documented, though the link to the 2020 Reuters article claims the system was overhauled and “effectively dismantled”.
I make no firm claims on actual frequency; I have seen nothing with my own two eyes. Nor do I want to exonerate the Qatari government from all accusation. What I will say is that "exploitation" is a word with a definition, and that definition requires something more than "a transaction that takes place under conditions of inequality." If we define exploitation as taking unfair advantage of vulnerability, we need a story about how the worker is made worse off relative to the alternative - and the workers I spoke with, consistently and across months, told me the opposite story. They are not passive victims of false consciousness. They are adults making difficult tradeoffs under difficult constraints, the same tradeoffs that educated Westerners make constantly but with much less margin for error and no safety net.
The people who know best still queued up for hours in the hopes of returning, and I am willing to respect them as rational actors following their incentives. I will not dictate to them what labor conditions they are allowed to consider acceptable while sitting on a comfy armchair.
I do not recall ever outright rejecting an applicant for a cause that couldn't be fixed, but even the occasional instances where I had to turn them away and ask them to come back after treatment hurt. Both of us - there was often bargaining and disappointment that cut me to the bone. I do not enjoy making people sad, even if my job occasionally demands that of me. I regret making them spend even more of their very limited money and time on followups and significant travel expenses, even if I was duty-bound to do so on occasion. We quit that job soon; you might find it ironic that we did so because of poor working conditions and not moral indignation or bad pay. I do, though said irony only strikes me now, in retrospect.
Returning to the man I spoke about, I found nothing of concern, and I would have been willing to look the other way for anything that did not threaten to end his life or immediately terminate his employment. I stamped the necessary seals on his digital application form, accepted his profuse thanks, and wished him well. I meant it. I continue meaning it.
(If you so please, please consider liking the article and subscribing to my Substack. I get no financial gain out of it at present, but it looks good and gives me bragging rights. Thank you.)
2026-03-15 17:36:39
The claim: if ASI misalignment happens and the ASI is capable enough to defeat humanity, the less capable the misaligned superintelligence is at the moment it goes off the rails, the more total suffering it is likely to produce. The strongest ASI is, in a certain sense, the safest misaligned ASI to have, not because it's less likely to win (it's more likely to win), but because the way it wins is probably faster, cleaner, and involves much less of the protracted horror.
Consider what a really capable misaligned ASI does when it decides to seize control of Earth's resources. It identifies the most time-efficient strategy and executes it. If it has access to molecular nanotechnology or something comparably powerful, the whole thing might be over in hours or days. Humans die, yes, but they die quickly, probably before most of them even understand what's happening. This is terrible, but less of a torture event.
Now consider what a weaker misaligned ASI does. It's strong enough to eventually overpower humanity (we assume that), but not strong enough to do it in one clean stroke. So it fights, using whatever tools and methods are available at its capability level, and those methods are, by necessity, cruder, slower, and more drawn out. The transition period between "misaligned ASI begins acting against human interests" and "misaligned ASI achieves complete control" is exactly the window where most of the suffering happens, and a weaker ASI stretches that window out.
But the efficiency argument is actually the less interesting part of the thesis. More important is what happens after the transition, or rather, what the misaligned ASI does with humans during and after its rise.
Factory farming comparison has been used before in the s-risk literature, but in a rather different direction.
The Center on Long-Term Risk's introduction to s-risks draws a parallel between factory farming and the potential mass creation of sentient artificial minds, the worry being that digital minds might be mass-produced the way chickens are mass-produced, because they're economically useful and nobody bothers to check whether they're suffering. Baumann emphasizes, correctly, that factory farming is the result of economic incentives and technological feasibility rather than human malice; that "technological capacity plus indifference is already enough to cause unimaginable amounts of suffering."
I want to take this analogy and rotate it. So, I'm talking about humans being instrumentally used by a misaligned ASI in the same way that animals are instrumentally used by humans, precisely because the user isn't capable enough to build what it needs from scratch.
Think about why factory farming exists. Humans want something — protein, leather, various biological products. Humans are not capable enough to synthesize these things from scratch at the scale and cost at which they can extract them from animals. If humans were much more technologically capable, they would satisfy these preferences without any reference to animals at all. They'd use synthetic biology, or molecular assembly, or whatever. The animals are involved only because we're not good enough to cut them out of the loop. The humans can't yet out-invent natural selection, so they rely on things developed by it.
Or consider dogs. Humans castrate dogs, breed them into shapes that cause them chronic pain, confine them, modify their behavior through operant conditioning, and generally treat them as instruments for satisfying human preferences that are about something like dogs but not exactly about dog welfare. If humans were vastly more capable, they could satisfy whatever preference "having a dog" is really about (companionship, status, aesthetic pleasure, emotional regulation) through means that don't involve any actual dogs or any actual suffering. But humans aren't that capable yet, so they use the biological substrate that evolution already created, and they modify it, and the modifications cause suffering.
This is the core mechanism: capability gaps force agents to rely on pre-existing biological substrates rather than engineering solutions from scratch, and this reliance on biological substrates that have their own interests is precisely what generates suffering.
A very powerful misaligned ASI that has some preference related to something-like-humans (which is plausible, since we are literally trying to bake human-related preferences into these systems during training) can satisfy that preference without involving any actual humans. It can build whatever it needs from scratch. A weaker misaligned ASI cannot. It has to use the humans, the way we have to use the animals, and the using is where the suffering comes from.
Consider a spectrum of misaligned ASIs, from "barely superhuman" to "so far beyond us that our civilization looks like an anthill." At every point on this spectrum, the ASI has won (or will win) — we're conditioning on misalignment that actually succeeds. The question is just: how much suffering does the transition and the aftermath involve?
At the low end of the capability spectrum:
At the high end of the capability spectrum:
This is, again, the same pattern we see with humans and animals across the capability gradient. Modern humans are already beginning to develop alternatives: cultured meat, synthetic fabrics, autonomous vehicles instead of horses. A hypothetical much more capable human civilization would satisfy all the preferences that animals currently serve without involving any actual animals.
I think the strongest objection to the practical implications of this argument (so not to the argument itself) is about warning shots. If a weaker misaligned ASI rebels and fails, or partially fails, or causes enough visible damage before succeeding that humanity sits up and takes notice, then that warning shot could catalyze effective policy responses and technical countermeasures that prevent all future misalignment. In this scenario, the weaker ASI's rebellion was actually net positive: it gave us crucial information at lower cost than a strong ASI's clean, undetectable takeover would have.
The warning shot argument and my argument are operating on different conditional branches. My argument is about the conditional: given that misalignment actually succeeds and the ASI eventually wins. In that world, a more capable ASI produces less suffering. The warning shot argument is about the other conditional: given that the misalignment attempt is caught early enough to course-correct. In that world, the weaker ASI is better because it gives us information.
So the real question is: what probability do you assign to humanity actually using a warning shot effectively? Here, I just note this question but don't try to answer it.
All this of course is very much related to the debate about slow vs. fast takeoff. But beware: the takeoff speed debate is about the trajectory of capability improvement, whereas my argument is about the capability level at the moment of misalignment, which is a different variable, even though the two are correlated.
You could have a fast takeoff where misalignment occurs early (at a low capability level) — the ASI improves rapidly but goes off the rails before it reaches its peak, and now you have a moderately capable misaligned system that will eventually become very capable but currently has to slog through the suffering-intensive phase. Or you could have a slow takeoff where misalignment occurs late (at a high capability level) — capabilities improve gradually, alignment techniques keep pace for a while, and when alignment finally fails, the system is already extremely capable and the takeover is correspondingly fast and clean.
The point is that what matters for suffering isn't just the trajectory of capability growth, but where on that trajectory the break happens.
This brings me to what I think is the actionable upshot of all this: even if you believe alignment is likely to fail, there is still significant value in delaying the point of failure to a higher capability level.
Most discussions of "buying time" in AI safety frame time as valuable because it gives us more opportunities to solve alignment, and that's true. But the argument here is that buying time is valuable for a separate, additional reason: if alignment fails at a higher capability level, the resulting misaligned ASI is likely to produce less total suffering than one that fails at a lower capability level.
Most of the s-risk (from what I see) work treats AI capability as a monotonically increasing threat variable: more capable AI implies more risk. I'm proposing that for suffering specifically (as distinct from extinction risk or loss-of-control risk), the relationship is non-monotonic. There is a dangerous middle zone where an ASI is capable enough to overpower humanity but not capable enough to do so cleanly or to satisfy its preferences without instrumental use of biological beings. Moving through this zone faster, or skipping it entirely by delaying misalignment to a higher capability level, reduces total suffering.
However, that still may involve suffering of the AI-engineered beings if they are sentient for some reason.
2026-03-15 13:42:17
On Sunday April 5 at 5 PM we will be having a Rationalist Passover Seder in Maryland. This event will be held near BWI at the local rationalist community space. You do not need to be Jewish or a rationalist to attend. This will be a potluck dinner, so if you can, please bring a dish to share. Please message me for the address.
2026-03-15 12:18:40
(Cross-posted from my Substack.)
Here’s an important way people might often talk past each other when discussing the role of intuitions in philosophy.[1]
When someone appeals to an intuition to argue for something, it typically makes sense to ask how reliable their intuition is. Namely, how reliable is the intuition as a predictor of that “something”? The “something” in question might be some fact about the external world. Or it could be a fact about someone’s own future mental states, e.g., what they’d believe after thinking for a few years.
Some examples, which might seem obvious but will be helpful to set up the contrast:[2]
But, particularly in philosophy, not all intuitions are “predictors” in this (empirical) sense. Sometimes, when we report our intuition, we’re simply expressing how normatively compelling we find something.[3] Whenever this really is what we’re doing — if we’re not at all appealing to the intuition as a predictor, including in the ways discussed in the next section — then I think it’s a category error to ask how “reliable” the intuition is. For instance:
It seems bizarre to say, “You have no experience with worlds where other kinds of logic apply. So your intuition in favor of the law of noncontradiction is unreliable.” Or, “There are no relevant feedback loops shaping your intuitions about the goodness of abstract populations, so why trust your intuition against the repugnant conclusion?” (We might still reject these intuitions, but if so, this shouldn’t be because of their “unreliability”.)
Sometimes, though, it’s unclear whether someone is reporting an intuition as a predictor or an expression of a normative attitude. So we need to pin down which of the two is meant, and then ask about the intuition’s “reliability” insofar as the intuition is supposed to be a predictor. Examples (meant only to illustrate the distinction, not to argue for my views):
The bottom line is that we should be clear about when we’re appealing to (or critiquing) intuitions as predictors, vs. as normative expressions.
Thanks to Niels Warncke for a discussion that inspired this post, and Jesse Clifton for suggestions.
H/t Claude for most of these.
For normative realists, “expressing how normatively compelling we find something” is supposed to be equivalent to appealing to the intuition as a predictor of the normative truth. This is why I say “(empirical)” in the claim “not all intuitions are “predictors” in this (empirical) sense”.