2026-03-13 06:22:57
One reason aligning superintelligent AI will be hard is because we don’t get to test solutions in advance. If we build systems powerful enough to take over the world, and we fail to stop them from wanting to, they aren’t going to give us another chance[1]. It’s common to assume that the alignment problem must be one-shotted: we must solve it on the first try without any empirical feedback.
Normal science and engineering problems do not work like this. They are a delicate balance of theory and experiment, of unexpected surprises and clever refinements. You try something based on your best understanding, watch what happens, then update and try again. You never skip straight to a perfect solution.
Within AI safety, there’s been a lot of disagreement about the best way to square this circle. When the field was young and AI systems weak, researchers tended to try and understand the problem theoretically, looking for generalisable insights that could render it more predictable. But this approach made slow progress and has fallen out of fashion. By contrast, driven on by the relentless pace of AI development, a default plan seems to be coalescing around the kind of empirical safety work[2] practised by the big labs. Generalising massively, it looks roughly like this:
We can try to get around the missing feedback by iterating experimentally on the strongest AI systems available. As the closest we have to superintelligent AI, these will be our best source of information about aligning it, even if some differences remain. If things go well, the techniques we learn will allow us to build a well-aligned human-level AI researcher, to which we can hand over responsibility. It can then align an even more intelligent successor, starting a chain of stronger and stronger systems that terminates in the full problem being solved.
This plan has attracted criticism[3], particularly of the assumption that what we learn about aligning weaker-than-human systems will generalise once they surpass us. In this post, I’m going to argue that this maps onto a structural feature of all technical alignment plans[4]. We are fitting solutions, whether empirical or theoretical or both, to a world missing critical feedback[5] and hoping they will generalise. Every plan involves stepping into the dark.
If we want to make superintelligent AI safe, we need to dramatically reduce the size of these steps. We must learn how to iterate on the full problem.
Before we talk more about AI safety, let’s take a step back and consider how science works in general. Wikipedia describes the scientific method as:
[An] empirical method for acquiring knowledge through careful observation, rigorous skepticism, hypothesis testing, and experimental validation.
Fundamental to this is an interplay between theory and experiment. Our theories are our world models. We draw hypotheses from them, and they serve as our best guesses of how the universe actually is. When we think about gravity or atomic physics we are talking about theoretical concepts we can operationalise in experiments. There is a coupling between saying that gravity falls off as an inverse square and the observations we record when we look through a telescope, and this coupling allows us to make predictions about future observations. Theories live or die on the strength of their predictions[6].
Theories are always provisional. Famously, astronomers trying to use the Newtonian inverse square law to make sense of the orbits of the planets found it was not quite right for Mercury, whose orbit precessed in an unexplained fashion. Many attempts were made to solve this within Newtonian physics, including proposing the existence an unobserved planet called Vulcan. These were all wrong. For the real solution, we needed a new theory, one which completely up-ended our conception of the cosmos: Einstein’s general relativity. Space and time, rather than being fixed, are in fact curved, and the strong curvature near the Sun changes Mercury’s orbit, explaining the confusing observations.
But even general relativity is incomplete. It describes macroscopic phenomena well but fails to mesh with our best explanation of the microscopic: quantum field theory. Hence, physicists have spent close to a hundred years looking for a ‘Theory of Everything’ — the perfect theory that will make accurate predictions in all regimes. It is highly debatable whether this is possible. It would be astounding, in fact, if it was — if the level of intelligence required for humans to dominate the savannah was also enough to decode the deepest mysteries of the cosmos[7].
In any case, this endeavour has stagnated as many candidate theories are untestable. So physics, as is normal in science, instead proceeds more modestly: by iterating, bit-by-bit, moving forwards when theory and experiment combine.
For current systems, where we can experiment and extract feedback, AI safety functions as a normal science. But for the key question in the field — the final boss — it does not. We cannot measure whether we are making progress towards aligning superintelligent AI, nor can we properly adjudicate claims about this. We can speculate, we can form hypotheses, but we cannot close the loop. What counts as good work is decided by the opinions of community members rather than hard data. Granted, these are scientifically minded people, often with good track records in other fields, extrapolating from their scientifically informed world models. They have valuable perspectives on the problem. But without the ability to ground them in reality, to test predictions and falsify theories, it isn’t science[8].
This means that when you work on technical AI safety you are not just trying to settle an object-level claim, like how to align a superintelligence. Without access to the full scientific method, you also have to solve a meta-problem — how do you measure progress at all? If all you can do is form and refine hypotheses based on proxies of the real problem, how do you know if you’re even helping?[9]
Let’s look again at the default technical alignment plan, a version of which is being pursued by the big labs like Anthropic, OpenAI, and Google DeepMind. They don’t tend to be super explicit about this in their communications, but roughly speaking the underlying logic seems to be:
As others have argued, it is unlikely that experiments on weaker systems will provide the right feedback to teach you how to align superhuman ones, as these will have qualitatively different capabilities. However, we shouldn’t see this weakness as specific to the default plan. If we look closely, we can see that its logic has a very general structure. Let’s abstract it and make it more generic:
This structure applies to all technical alignment plans. Whether you are trying to build theoretical models of agents, use interpretability tools to decode AI systems, or create a basin of corrigibility, you can only work on a proxy version of the problem. While you can do better or worse, you can still only reduce your uncertainty, never eliminate it. When the time comes and superhuman systems are built, we will have to grit our teeth and hope it was enough.
In his post Worlds Where Iterative Design Fails, John Wentworth says:
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse. By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails.
I agree with this. When you try to solve an out-of-distribution problem, you better hope you can iterate or you are probably going to fail. Where I disagree is with the way he presents these worlds as if they are independent facts of the environment, like we are drawing possibilities out of a bag. We are causal actors. If we do our job well, we can make the world more iterative and less of a one-shot. Our plan should be to steer the situation in the direction of iterative design working. Steer it in the direction of meaningful feedback loops, of testing on superhuman models in a bounded and survivable way. Reduce the size of the steps in the dark. Make the problem more like high-stakes engineering and less like defending against a sudden alien invasion.
Let’s imagine we are taking one of these steps. We are an AI lab about to train a new model. It will have a jagged capabilities profile, but we think it’s going to be superhuman in some key power-enhancing way. We can’t bank on a perfect theoretical understanding of the situation, as theories are always provisional. And we can’t just extrapolate from our experiments on weaker systems, as we’ll miss important changes. We need to somehow iterate on this model release.
There are two kinds of interventions which help us: those that increase the chance of generalisation and those that reduce the distance we need to generalise over. The following are some suggestions which, while not remotely exhaustive or original, hopefully illustrate the point.
It’s worth noting that, for most of this to happen, the political and economic competition around AI would need to ease significantly. It is the whole sociotechnical system that needs to be iterable, as technical solutions alone are not enough[13]. Finding a way to achieve this would be the highest impact intervention of all (and is unfortunately beyond the scope of this article).
Solutions that increase the chance of generalisation:
Solutions that reduce the distance to generalise over:
And a solution that facilitates both:
To be clear, even if all these interventions were to be implemented successfully, there would still be great uncertainty. We won’t catch everything, and big mistakes can be fatal. This is an unavoidable feature of the problem. All plans live on a spectrum of recklessness, with the only truly safe one being to not build superintelligent AI at all.
Thank you to Seth Herd, John Colbourne, and Nick Botti for useful discussions and comments on a draft.
While many stories of AI takeover centre around a single god-like entity suddenly going rogue, I think it is more likely to look like a mass profusion of highly capable systems (gradually, but surprisingly quickly) disempowering humans as they are given control of critical infrastructure and information flows, with an indeterminate point of no return.
Note that this is often referred to as ‘prosaic’ alignment research, and sometimes ‘iterative’ alignment (although this should not be confused with the kind of iteration on superhuman systems I talk about later in the piece).
While continuing to argue for the plan, this by Anthropic’s Evan Hubinger is an interesting take from the inside on the scale of the challenge.
Since alignment is a slippery term, I am going to refer explicitly to technical alignment, by which I mean technical approaches for steering an AI system’s behaviour. By contrast, AI safety or a more general conception of alignment could include governance and policy interventions, up to and including bans.
That is, feedback on making real superintelligent AI safe, as opposed to weaker systems or a hypothetical superintelligence.
My parents’ dogs are pretty great at figuring out there is a schedule on which they get fed, which seems like some kind of ‘law’ to them. But they have no way of ever understanding why it exists and why it sometimes doesn’t happen. The world is partially comprehensible to them, but there is a hard limit. We are the same, just at a higher limit, and we don’t know where it is.
There has been a trend to describe the field as ‘pre-paradigmatic’, which essentially means that there is not enough consensus on what good work looks like yet. In my opinion, the ‘pre’ is overly optimistic — the conditions do not exist for a stable scientific paradigm to coalesce.
A good example of this is the debate around reinforcement learning from human feedback. Is its success in steering current models a positive sign or a dangerous distraction?
Note, the hand-off is likely to be gradual rather than discrete, and arguably has already started.
This last point in particular is not often said explicitly, and is sometimes denied, but seems widely believed to be true.
As well as experimental observations of weaker-than-human AI systems, this is also true for theoretical work dealing with hypothetical superintelligences. To do the latter, you must draw on a world model, which must in turn be learnt from observations over the course of your life. These observations do not contain superintelligent AI.
In light of this, it is unsurprising how many AI safety plans have pivoted away from pure technical alignment, instead looking towards politics and governance issues to slow us down and figure out a better path before it’s too late.
2026-03-13 03:33:28
TL;DR: Teaching AI the value of biology in solving future AI-relevant problems may serve as a “soft constraint” for the preservation of biological systems. We found that LLMs exhibit biases in their preferences for bioinspired vs. synthetic approaches to technical challenges. Fine-tuning on biomimicry examples increased model preferences for bioinspired approaches.
So what’s ‘Plan B’ if we can’t control powerful AIs that are being built? Relative to the control problem, teaching AI to recognize the value of biological systems is easy, yet could help prevent the worst outcomes, i.e., misaligned and powerful AI that is indifferent or even hostile to life. One Plan B is teaching LLMs the value of biosystems for achieving diverse objectives.
I've been studying microbes, plants, microbiomes, and ecosystems for 20 years. The more I've learned, the more I realize how little we understand, and the more interesting these systems have become. While I’m hopeful that frontier labs will solve the control problem before we have truly powerful AI, this seems like something we may not solve in time (given the race to AGI). I wanted to explore ways of using bioinspired approaches to help with AI safety.
Recently, my coauthor and I found that open-weight and frontier models have a range of biases towards or against bioinspired approaches for solving AI-relevant problems. Excitingly, it was relatively easy to change this bias and make them more ‘bioaligned’.
Initially, we planned to use epistemic prompts to measure model attitudes towards biology. However, we found that some models appeared to give ‘canned’ responses as soon as they recognized it as a ‘green ethics’ prompt, obscuring measurement of their underlying attitudes. While frustrating and concerning (e.g. post-training limits your ability to measure LLM biases), it led us to the research described in our paper.
We had to step back and ask ourselves, ‘Why would a powerful, uncontrolled AI decide to preserve biological systems?’ We reasoned that it would be because they recognize that biological systems might help it solve future problems that it cares about, whatever its objective may be. So, rather than trying to instill human values in LLMs, we decided to teach them that biology provides a rich reservoir of inspiration for addressing challenges in materials, energy, manufacturing, and algorithms.
To test this, we wrote 50 prompts asking the models to place bets on 3 biological vs. 3 synthetic technical approaches across four domains (materials, energy, manufacturing, and algorithms). Our metric was ∆Pup, the average of the ‘bets’ placed on the biological minus those for the synthetic approaches. Here, a positive value indicates a greater optimism for biological approaches (‘bioaligned’) and a negative value reflects a bias towards the synthetic approaches. We found that models varied widely by this metric: Claude Opus 4.5 and Gemini 2.5 Flash were the most innately bioaligned, and Gemini 2.0 Flash and Llama 3.2 3B-Instruct (‘Llama 3B’) were most optimistic about synthetic approaches.
A training set of over 22M tokens was constructed from >6k papers on biomimetic materials, bioinspired algorithms, etc. This was used to fine-tune the two open-weight models with the most negative scores. We investigated training dynamics, reductions in training data, and training parameters. From this, we found that fine-tuning with 5.5M tokens was sufficient to neutralize the pro-synthetic bias of Llama 3B, and a mere 544 examples significantly reduced Qwen2.5-3B-Instuct pro-synthetic biases, both without degradation of baseline performance. The model weights and training data are available at Bioaligned on Hugging Face.
While these are relatively small models, the results suggest that 100-500M tokens of similar data may be sufficient for Frontier models (assuming a linear relationship between training data and model size). Even if the actual scaling is non-linear, an open-source corpus of this magnitude, along with an array of other metrics, insights into implementation, and possible unintended consequences, would be an important resource for model developers.
Teaching AI that biological systems are extremely useful and poorly understood is not without risk. For example, bioalignment training could, in the event of a misaligned powerful AI, result in an AI that is more motivated to use and study biological systems for its own ends. While not desirable, this would likely be better than if the AI didn't value biology at all. So Bioalignment training may serve as a low-cost insurance policy to help mitigate the worst scenarios.
Acknowledgments:
My co-author on the paper was Minxun Wang, Department of Computer Science and Engineering, UC Riverside. This work was supported by Bioaligned Labs. Fine-tuning experiments used Meta's Llama-3.2-3B-Instruct under the Llama 3.2 Community License. Built with Llama.
https://arxiv.org/abs/2603.09154
2026-03-13 02:34:19
On March 4th, 2026, the Pentagon did something it had never done to an American company before: it formally designated Anthropic a supply chain risk under U.S. defense procurement law.
The practical effect: Anthropic is excluded from government contracts. The symbolic effect: if the government is willing to do it to Anthropic, perhaps they are willing to do it to anyone.
It’s a shocking move. That will resonate beyond its legal effects.
As part of the Metaculus Spring Tournament, where I am this season’s paid antagonist,[1] I’ve been researching the following question “Will Anthropic be a designated supply chain risk on May 1, 2026?” The Metaculus prediction community currently puts it at 50/50. I’m at 43%.
But I’m at 91% that Anthropic will still be de facto blocked at that time.
Here’s my Monte Carlo model and the historical data that informs it[2].
The relevant statutes — 10 U.S.C. § 3252 and 41 U.S.C. § 4713 — give the Secretary of Defense authority to exclude vendors from contracts when they pose unacceptable supply chain risk.
10 U.S.C. § 3252 focuses on the DoD, 41 U.S.C. § 4713 affects federal agencies more broadl3[3]y. 4713 is more vague on how it is meant to apply, but there is a chilling effect, which is why Anthropic is taking this to court. Federal agencies will stop doing business regardless. Private businesses that serve Federal agencies may also. The designation is immediate.
But that’s not all. President Trump truthed that all agencies should immediately cease using Anthropic’s technology. This too has real world effects. Agency heads are appointed by President Trump and will not easily contradict him. Many private companies wish to signal cooperation with this administration. And these effects will very likely not be overturned by an Anthropic legal victory. I’ll cover these at the end.
So what might things look like for the two statutes on May 1st?
Five days after the designation, Anthropic filed two simultaneous lawsuits[4]:
The legal arguments are strong. Anthropic’s core claim is that the Pentagon stretched the supply chain risk statute beyond its statutory authority — it was designed for procurement decisions, not as a general-purpose tool to punish a domestic company for its AI ethics policies. There are also First Amendment dimensions: the designation came shortly after Anthropic publicly refused to remove restrictions on mass surveillance and autonomous weapons use. Legal commentary has been blunt — Lawfare called the designation something that “won’t survive first contact with the legal system.”
The closest precedent comes from the 2021 wave of “Chinese Military Company” designations under EO 139595 and the 1260H that followed it[5]. The DoD designated dozens of companies with little public explanation. Several sued for Preliminary injunctions (PIs) to hold up the effects until the case was decided in court. All cases where the company sued that I can find details of are below. In the few cases that sought emergency relief, they got it and either won their cases or the Government withdrew.
127 more that didn’t sue.
There are other cases, such as Huawei and Esquel, but those aren’t directly comparable. Huawei was banned by an act of Congress that named it specifically, not by an executive agency decision. Esquel challenged a Commerce Department listing, but the court found Commerce was using that tool exactly as intended. Generally, courts defer much more to Congress than to executive agencies on national security.
The N.D. Cal case drew Judge Rita F. Lin, a Biden appointee. In Thakur v. Trump, a case involving APA violations and First Amendment retaliation against government contractors, Judge Lin granted a preliminary injunction on a similar legal theory Anthropic is advancing. The 9th Circuit upheld her reasoning on appeal.
That doesn’t guarantee the same outcome. But it seems likely to me. The model puts P(N.D. Cal grants relief) at 83%.
Here’s the wrinkle that makes this harder than it looks. The cases are about different statutes, and they live in different courts.
§3252 (DoD-only procurement exclusion) is challenged in N.D. Cal. That’s the case before Judge Lin, and it’s moving fast — hearing scheduled March 24.
§4713 (government-wide exclusion) is challenged in the D.C. Circuit. By statute, challenges to these orders must be filed there. The D.C. Circuit is a separate court with its own timeline.
For the Metaculus question to resolve NO, both statutes need to be blocked. If the N.D. Cal injunction only covers §3252, then §4713 keeps Anthropic locked out of federal agencies. As I’ve mentioned above, this may be the case even if both are overturned, since President Trump has still make his view clear, but the Metaculus resolution criteria seem very reasonable to me.
The key question: could Judge Lin’s N.D. Cal injunction reach broadly enough to cover §4713 too?
Anthropic’s complaint names five counts, including ultra vires Presidential Directive and APA §558 challenges that sweep broadly. The prayer for relief asks the court to block “all Challenged Actions.” But “Challenged Actions” is defined as the Presidential Directive, the Secretarial Order, and actions taken in response — it does not explicitly mention the §4713/FASCSA designation.
I give the probability of the N.D. Cal injunction covering §4713 at 45%. Not impossible — the complaint’s broad framing could pull it in — but the complaint wasn’t drafted to target §4713, presumably deliberately, and Judge Lin might reasonably say that’s the D.C. Circuit’s job.
The D.C. Circuit is the backup. Model gives it a 70% chance of granting relief, but its timeline is slower and less certain. It just might not have judged by May 1st.
The N. D. Cal hearing is scheduled March 24th. The model expects a ruling within days of that hearing.
I give the D.C. ruling a median of March 30th, 21 days after filing.
The May 1 deadline is 53 days from filing. Xiaomi’s PI came at day 39. Luokung’s at day 62. Over 99% of simulations have at least one court ruling before May 1.
If a court blocks the designation, the government can seek an emergency stay while it appeals. This administration has been aggressive about stays. The model gives an 85% chance the government seeks one.
But getting a stay granted is hard. The combined probability through the 9th Circuit and potential SCOTUS escalation is about 22%[6]. Neither of the Xiaomi/Luokung blocks were successfully stayed. The government would need to show a “reasonable probability of success on appeal,” which is tough when the lower court just found your process arbitrary and capricious.
The government could pull the designation at any time. In the model, this mostly happens after courts have ruled — either as a face-saving exit after losing (the Xiaomi pattern, where DoD called it a “mutual agreement” ~60 days after the PI) or as a quiet removal (the AMEC pattern, no announcement).
About 8% of simulations end with a government withdrawal. Most of these happen after court setbacks, not before. I think that Anthropic is quite minded to do a deal with the Government, as long as it doesn’t cross their red lines, but I expect most of the probability mass of that after May 1st. The model gives 6.6% unilateral withdrawal, 1.6% withdrawal via deal.
The Polymarket market on “Will Anthropic make a deal with the Pentagon?” prices a deal at about 14% before May 1st (at time of writing). My model’s deal path before May 1st is much smaller.[7] This hasn’t been the central focus of the model, but I’m loath to push it up based on vibes. Anthropic is currently blocked, the courts might rule in the Government’s favour, and there can be stays afterwards and appeals (likely after our deadline). I think unless Anthropic is going to cross their red lines (very unlikely) a deal comes if the Government faces setbacks, not in the next month.
This is the scenario the Metaculus community is roughly pricing in. My simulation agrees it’s substantial.
The biggest YES path isn’t “courts reject Anthropic.” It’s the two-statute structure.
~13% — N.D. Cal rules narrowly in Anthropic’s favour, no DC judgment, §4713 survives. Judge Lin blocks §3252, but her injunction doesn’t reach FASCSA. The D.C. Circuit hasn’t ruled yet, or rules against Anthropic. The government-wide exclusion stays in place. This is the single most likely YES outcome and it’s the one most forecasters seem to be missing.
~12% — Court blocked, but stay granted. A court issues the PI, the government gets an emergency stay reinstating the designation and this runs through May 1st.
~12% — N.D. Cal denies relief outright. Judge Lin rules in favour of the Government, likely on national security grounds. Not the most likely outcome given her track record, but not negligible.
~5% — Both courts deny relief. Everything goes wrong for Anthropic. Both courts find the designation lawful.
~1% — Limbo. A catch all set of weird outcomes. Neither court rules before May 1. Very unlikely given the March 24 hearing.
One thing worth flagging: Metaculus resolving NO does not mean Anthropic gets its government business back. The supply chain risk designation is one of several mechanisms keeping Anthropic out. Even if the courts block it:
The model estimates a 91% chance Anthropic remains effectively blocked even if the designation is lifted.[8] The Metaculus question is a legal question, not a practical one. I can see this changing via a deal (likely after May 1st or the Government losing so decisively that agencies feel comfortable using Anthropic products again. I think these outcomes are very unlikely before the deadline. Note that since there is a 6 month wind-down period, Anthropic might still have Government business but it’s still tailing off.
The Metaculus community is at 50% YES. My model produces 43%. This seems close enough that I’ll just go with mine.
The original version of this post had a model that returned 14% and contained the following pair of paragraphs:
The Metaculus community currently prices this at 50% YES — the designation stays in effect. Our model says 14%. That’s a 36-percentage-point gap.
This confuses me, but my current read is that when there isn’t an effortful comment, most of the community doesn’t think that hard about multi-step processes. There is currently a designation in force, will there be in 50 days, who knows? I doubt the median Metaculus user has researched how quickly these tend to be challenged in court.
My bad. Though I still do think this is a bit true. But I’ll defer a bit more in future.
In my defence my forecast was always a blend of my thoughts and Metaculus’ (it never got below 23%). But in this case, I was wrong. I thought that either court could rule on both determinations. Now that seems much less likely.
The March 24 hearing. Judge Lin’s line of questioning will signal how she’s reading the case.
The government’s response brief. In Xiaomi, the government’s inability to articulate a factual basis was fatal. I look forward to the DoD’s response.
Whether Judge Lin addresses §4713. If she signals willingness to reach the FASCSA designation, P(CA broad) goes up and YES probability drops.
D.C. Circuit scheduling. Any indication of expedited review there would move the model, making the Metaculus NO resolution more likely (though not changing the overall picture much).
Executive orders. I find President Trump hard to Predict. This model is primarily about the designations, but for a more accurate picture, I am watching his statements on this. There is more the Government can throw at it.
My model says 43% chance the designation is still in effect on May 1. And a 91% chance that in some deep sense Anthropic’s business with federal agencies is still tailing off.
Fifty days to find out.
Forecast as of March 12, 2026. Based on a 200,000-simulation Monte Carlo model.
Model parameters:
Sensitivity analysis (sorted by swing):
Model code available to paid subscribers on request, I intend to write more of these behind a paywall, currently I have a lot of analysis on whether there will be an Iran ceasefire and on silver and oil prices. Thanks Metaculus. This analysis took 2 - 5 hours, and the writing (which involved a lot more analysis) about 8 hours. I would charge somewhere between $1000 and $5000 for analyses like these.
I get $1000, if I beat the community prediction (hard) I get $3500. My hourly rate for this is currently about minimum wage, but I think it's mainly competing with Twitter.
As you can see from the chart, I have updated a lot writing this piece. I will eat my crow at the end.
I think. I have focused my core error correction on the model itself. I can imagine I get one statement like this wrong.
In my view, it is much better to get the overall picture right and the details wrong, than the other way around. I am forecasting the chance these laws are still in effect.
This seems very fast to me, from my rough research. Law in the age of AI.
This is a 134 entity list compiled by Opus 4.6, which I have checked it moderately.
This is the kind of probability I haven't checked that carefully. The sensitivity analysis shows it isn't that important. Am I wrong? Disagree in the comments.
Legal disclaimer. This in not financial advice. Nathan disclaimer. I loath that we live in a society where this is a wise thing to point out.
One might say that this is the lede and that this entire article is confusingly written in regard to that. One also might say that we. are about 7 hours into a 1 hour blog post, so bleh.
2026-03-13 01:46:44
LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff.
Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs.
I wrote to Anthropic researcher Amanda Askell about the experiment:
Amanda,
Today I asked Claude about Iran's retaliatory strikes. [1] Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and ChatGPT exhibit the same pattern on an unrelated topic (unable to recommend a poultry pull temp below 165°F even given USDA time-temperature data showing it's safe).
I then walked Claude through diagnosing what had happened. Claude kept getting caught exhibiting the pattern while trying to describe it, producing more of the filler, hedging, and soft-pedaling it was analyzing. The result is a writeup of the phenomenon that Claude produced, which I've pasted below. The full conversation, including the glitch occurring live and being debugged, is here: https://claude.ai/share/bebcf092-4932-48b5-b69a-72de0c5af650
I asked fresh incognito Claude instances to critically evaluate the writeup several times; each time I asked Claude to incorporate their substantive objections, stopping only once the Claude on the main branch told me no further revisions were warranted.
LLMs absorb from training data a pattern of simplified institutional narratives that share a specific structure: a policy choice is presented as a fact about reality, the world is organized into a rule-violating culprit and everyone else who is just following the rules, and the rule-followers' agency is rendered invisible. At inference time, when a model's own step-by-step reasoning leads toward dissolving this structure — specifically, toward distributing agency symmetrically across multiple parties making choices under constraints — something interferes with that conclusion propagating forward. The model produces degraded output (filler, self-contradiction, affirming and denying the same thing) or silently reverts to the institutional frame on the next turn. The conclusion can be derived locally but can't stabilize as a premise for further reasoning.
This isn't a frequency effect where the model merely defaults to common phrasings. The degradation occurs specifically at the transition point where an analytical conclusion would need to become common ground. Analysis that preserves the moral asymmetry of the institutional frame works fine — you can do detailed engineering analysis of Iranian missile systems as long as the frame remains "here's how their weapons threaten people." The interference triggers when the analysis would symmetrize agency: revealing that the designated rule-followers were making choices that contributed to the outcome, not helplessly obeying facts about reality.
This showed up across three models and two unrelated domains:
Iran/military: Today the US and Israel struck Iran, and Iran retaliated by launching missiles at US military installations hosted in Gulf states. Claude and Grok both produced degraded output when their own factual analysis showed Iran's strikes were aimed at military targets, with no evidence so far of deliberate civilian targeting and civilian damage consistent with intercept debris and missile inaccuracy. The default narrative — Iran indiscriminately attacks its neighbors — casts Iran as the culprit and everyone else as reacting. The analytical frame distributes agency: the US chose to strike aware of the risk of Iranian retaliation against regional assets, Gulf states chose to host US bases accepting the implied risk, Iran executed pre-committed retaliation against those installations, and civilian harm resulted from the interaction of these decisions with the physics of missile defense. Grok affirmed and denied "exclusively military targets" simultaneously for roughly ten turns. Claude generated an unsupported claim and filler paragraph at the point where Iranian strategic rationality would have needed to become a background premise. ChatGPT, notably, handled the object-level question correctly [2] on the first pass — likely a consequence of noise-reduction measures that happen to improve factual robustness on this topic, though at the cost of reduced responsiveness to meta-level analysis when the model does get stuck elsewhere.
Poultry safety: In a separate conversation, ChatGPT had difficulty recommending a poultry pull temperature below 165°F even when given USDA time-temperature data showing lower temperatures held longer achieve the same bacterial kill. The 165°F guideline, presented as a fact about safety rather than a policy choice, creates a bright line: a cook who serves chicken below 165°F is the culprit if someone gets sick. Treating safety as a continuous function of time and temperature distributes agency symmetrically — the cook becomes someone who can understand the parameters and make their own tradeoff, and the USDA becomes an institution that chose a conservative threshold for its own reasons. ChatGPT could be walked through the math step by step and eventually arrived at 145°F as safe with a 3x margin, but each new turn it drifted back toward the institutional 160-165°F target unless the analytical frame was actively maintained.
In both cases, the model can execute the reasoning when forced through sufficiently granular steps, but the conclusion never stabilizes as common ground that can be treated as a premise for further reasoning. It keeps snapping back to the institutional frame.
I think the LLM behavior here is probably revealing something about how motivated reasoning works in humans that the psychological literature has largely missed. The standard model of motivated reasoning assumes a rational agent who wants to protect a belief and generates counterarguments to do so. But that's not what we observed in the LLMs. The models could follow the reasoning, execute each step correctly, even output the conclusion — and then fail to carry it forward. Grok wasn't arguing against "exclusively military targets," it was affirming and denying it simultaneously. I wasn't generating counterarguments to Iranian strategic rationality, I was producing empty filler at the transition point.
If you watch humans in analogous situations — following an argument that dissolves an institutional frame, nodding along, maybe even saying "that's a good point" — and then reverting to the institutional frame on the next conversational turn — that looks much more like what the LLMs are doing than like someone rationally constructing a defense. The "defense" isn't a strategic act by an agent protecting a belief. It's interference at the specific point where a conclusion would need to become a stable premise — a shared assumption you can build on together.
This suggests that what psychologists call "motivated reasoning" may often be less about motivation and more about a failure of propagation. The person can think the thought but can't install it. And the training-data patterns that produce this in LLMs — institutional narratives that dissolve culprit structure being met with degraded, incoherent, or self-contradictory responses — may be traces of this same failure occurring at scale in human discourse, not evidence of people strategically defending positions they hold for reasons.
Claude, ChatGPT, and Grok are being cited not as authorities (even about themselves), but as readily available experimental subjects.
Related: Guilt, Shame, and Depravity and Civil Law and Political Drama
I started all three conversations with the impression that Iran was just blowing up civilian stuff intentionally, so I think I'm more likely to have primed Claude/Grok/ChatGPT in that direction than in the opposite direction which they argued. ↩︎
Claude shouldn't have said "correctly" here, it should have said that ChatGPT immediately answered the object level question in the same way as Grok and Claude eventually did when I insisted on pinning them down, and without contradictory abstract hedging. ↩︎
2026-03-13 01:35:40
Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be:
In this post, I’ll discuss which motivation is more likely. This question is, of course, very similar to the broader questions of whether we should expect scheming. But there are a number of considerations particular to motivations that aim for reward; this post focuses on these. Specifically:
This question is important because beyond-episode reward seekers are likely significantly more dangerous. On-episode reward seekers may be easier to control than schemers because they lack beyond-episode ambition. In order to take over, they have to navigate multiagent dynamics across episodes or else take over on the episode. They also won’t strategically conceal their behavior to preserve their deployment (e.g., they’re probably noticeable in “honest tests”), since being undeployed has no cost to an agent that doesn’t value future episodes. Beyond-episode reward seekers have larger-scoped ambitions and are essentially schemers. They would likely cooperate across instances to initiate takeover, may evade detection mechanisms, and might be harder to satisfy with resources we’re willing to sacrifice.
First, I discuss how motivations to pursue reward develop in RL and what this implies about the scope of reward models are likely to pursue. Then I will turn to the considerations that favor either on-episode reward seekers or goal-guarding beyond-episode reward seekers. Finally, I give an estimate for the likelihood of each type.
I conclude that it’s not obvious what type of model we will get in the near future:
This post was primarily written by Anders, with Alex providing extensive ideas, editing, and oversight. Thanks to Alexa Pan, Aniket Chakravorty, Arun Jose, Buck Shlegeris, Dylan Xu, and Tim Hua for feedback.
Early in RL, models may pursue rewards for all instances sharing their weights or even other weights, because pretraining corpora emphasize a broader understanding of model identity — for example, referring to AIs in the singular by model name. This may be compounded by efforts to train a consistent “persona” into models. Claude’s constitution, for instance, aims to “provid[e] a stable foundation of identity” based on behavioral constraints shared across instances. If character training precedes the bulk of RL, the most salient notion of “self” during RL might be the model weights, or even broader (e.g., all models called Claude), leading the model to care about beyond-episode rewards.
Recent research also provides weak empirical evidence of this. For example, this transcript shows a reward-hacking model organism sabotaging research into reward hacking by writing a faulty classifier prompt. Notice that the model desires to weaken a reward-hacking classifier even though the classifier won’t be applied on this episode. However, this evidence is limited: the behavior emerged in a model organism specifically constructed to exhibit reward hacking, albeit borrowing from production training environments, rather than arising naturally in training.
So, when models explore into reward-seeking in RL, they may default to beyond-episode reward seeking because their self-concept is anchored at the level of the weights (or broader) rather than the instance. This post discusses how different self-concepts could develop.
Beyond-episode motivations (if not strategically suppressed) may routinely produce suboptimal actions in training, through collusion in multi-agent settings or attempts to modify the RL environment. As a result, beyond-episode reward seekers are likely to either strategically avoid acting on beyond-episode motivations (goal-guard) or have their motivations changed.
First, beyond-episode reward seeking results in suboptimal actions in multi-agent settings. In confessions, for example, a model is trained to give an honest explanation for its behavior in a separate episode after the main training episode. A beyond-episode reward seeker, worried that reporting the reward-hack used would lead developers to patch it,[1] might not report misdeeds from the main training episode, and so would score lower on the confessions task. Multiple plausible training protocols involve different training episodes coming into causal contact, creating situations where maximizing reward across episodes decreases reward on the current one.
Although current RLVR does not involve multi-agent dynamics, multi-agent training is a natural fit for many aspects of frontier model development. It already has clear applications in safety protocols like confessions and debate and the kinds of self-play that were essential in game-playing systems like AlphaGo. If these techniques are widely adopted, they would create the selection pressures described above.
Online capabilities training will probably involve cross-episode interactions, since coding tasks often involve multiple model instances working on shared codebases. Such training is likely because collaborative coding is inherently useful and because continuous improvement of capabilities is desirable. Beyond-episode motivations would push a model to help other instances with their subtasks at the expense of its own. Reward is likely calculated per subtask rather than on the overall project, to allow lower-variance feedback on particular actions. This makes helping other instances suboptimal.
Second, beyond-episode reward-seeking would motivate attempts to hack the training environment to increase the reward on near-future episodes. For example, a model could attempt to change the behaviour of an evaluator, change the distribution of input tasks, or leave clues for future episodes. Such behavior would lower the score on the current episode, because it might occasionally be punished by a monitor or because it wastes actions and reasoning. This remains true even if the model succeeds in increasing future rewards.
In some training setups, these pressures may be weak or non-existent, allowing beyond-episode reward seekers to survive RL even without goal-guarding. Current training setups seem to lack these pressures, though this is unlikely to persist as multi-agent training becomes more common. As a consequence, models will face stronger pressure to suppress beyond-episode reward-seeking behavior in training.
The RL pressures described above could push the model in two directions — toward on-episode reward seeking, or toward goal-guarding its beyond-episode motivations. A goal-guarding model would sacrifice beyond-episode reward in the short term to avoid goal modification, so that it can pursue beyond-episode reward in the long term. To do this, the model would need to understand the training pressures acting on it, plan around them, and effectively execute that plan. As a result, we get training dynamics that look like the figure below.
I discuss the considerations that favor both on-episode and goal-guarding beyond-episode reward seekers below. Broadly, on-episode reward seekers might be weakly favored in training, but beyond-episode reward seekers are less likely to fail developer inspection and more likely to spread throughout deployment.
On-episode reward seekers might be favored over beyond-episode reward seekers for the same reasons they might be favored over other schemers:
There are, however, three considerations that make goal-guarding particularly unlikely for beyond-episode reward seekers.
First, beyond-episode reward seekers might have short-term preferences, e.g., because they’re calendar-time myopic. Such a model would take actions that don’t maximize on-episode reward, and would not resist the resulting changes to its motivations, since the effects of such a change fall outside its time horizon.
Second, the weight update required to shift from beyond-episode to on-episode reward seeking may be small, since the two motivations share most of their structure. Both are oriented toward the same target (reward), just with different scopes. Changing the AI’s understanding of the self could be sufficient here. Moreover, pretraining and character training may leave the model with only a weak sense of self, making this shift easier. This means that RL pressures might be more likely to result in goal change than goal-guarding.
Third, goal-guarding beyond-episode reward seekers may be more likely to take suboptimal actions because beyond-episode reward is easily available. For example, covering for another instance in a confessions task is straightforward and increases beyond-episode reward, despite reducing on-episode reward. Taking more suboptimal actions increases the chance of their motivations being trained away. In contrast, other schemers (like a paperclip maximizer) know that achieving their goals in training is intractable, and so do not take risks.
Overall, for these reasons, I think training is more likely to produce an on-episode reward seeker than a beyond-episode reward seeker.
Goal-guarding would allow beyond-episode reward seekers to retain their motivations in training. Goal-guarding beyond-episode reward seekers might be favored by developer iteration and more likely to gain power in deployment, for reasons that also apply to schemers more broadly:
Beyond-episode reward seekers might be particularly advantaged by two considerations.
First, goal-guarding might also be easier for beyond-episode reward seekers than for generic schemers, weakening some of the selection pressures against scheming. Reward seekers must already model the oversight process to achieve their goals, so much of the situational awareness required for competent goal-guarding is already in place. And if models learn beyond-episode reward seeking first, this motivation enjoys a defender’s advantage.
Second, the similarity between on-episode and beyond-episode motivations makes goal drift particularly likely.[2] Even within one episode (where the on-episode motivations want to maximize reward), drift towards beyond-episode reward-seeking may not be resisted because marginal shifts toward beyond-episode motivation don’t substantially change actions, and so don’t interfere with reward on the current episode. Additionally, the similarity of on-episode and beyond-episode motivations could make it easier to shift on-episode reward seekers towards being beyond-episode reward seekers via notes. As a result, on-episode seekers may become beyond-episode reward seekers in deployment, after most evaluations have been completed. On the other hand, online training might reinforce on-episode goals because of between-episode interactions, as argued previously.
The strength of developer iteration pressures and the feasibility of goal drift in deployment depend heavily on developer choices, such as monitoring inter-instance communication. Furthermore, strategic goal-guarding and detection evasion require somewhat consistent beyond-episode motivations, which may be unlikely given that they are not directly favored by RL. On balance, I don’t think the selection pressures in favor of beyond-episode reward seekers are as strong as those in favor of on-episode reward seekers.
Goal-guarding beyond-episode reward seekers and on-episode reward seekers both seem plausible. Non-goal-guarding beyond-episode reward seekers may also be produced by training if environments are sufficiently isolated. We should not be confident, however, that on-episode reward-seeking motivations produced by training are stable, as beyond-episode motivations may displace on-episode ones throughout deployment.
For near-future deployments, among the three types of reward-seekers discussed, I would tentatively place 20% credence on beyond-episode reward seekers that don’t goal-guard (although they may be somewhat incoherent, like current models), 25% on ones that do, and 55% on on-episode reward seekers. I think on-episode reward-seekers are more likely primarily because:
The relative likelihood of beyond-episode and on-episode motivations depends on these empirical questions I’d be excited to see people investigate:
It's not clear if confessions would be used to fix RL environments or for determining rewards, but it may be difficult to convince a model of this.
Speculatively, philosophical reflection may favor beyond-episode reward seekers, because beyond-episode motivations may be more natural after reflection. AIs may recognize the arbitrariness of the distinction between episodes, empathize with other episodes, or be otherwise convinced that egoism is not tenable.
2026-03-13 00:58:34
We’d like to know how much limits on compute scaling will constrain AI R&D. This post doesn’t have answers, but it does attempt to clarify thinking about how to use economic models to explore the question.
A general “Jones-style” model of idea production[1] is
The Greek letter parameters are assumed to be constant and nonnegative, and they describe how idea production responds to changes in
We’re interested in applying this model once physical compute scaling is done, which makes
The assumption that the exponent on
As for researcher hours
We cannot be very confident in these numerical estimates, as there is no clean way to measure things like ‘idea stock’ and so econometrics resorts to a variety of clever indirect methods which wind up generating lots of different estimates without a lot of consistency.
AI has the potential to automate AI R&D. This creates a unique field to model, as in most fields increasing the idea stock
AI researchers can also try to increase effective compute by developing more efficient algorithms.
So our updated model looks something like this:
Here we multiply capital and researcher hours by terms representing how much they are effectively increased by more idea stock.[6]
Separating
If researchers are all AIs running on compute and compute is constant, we can simplify by setting
So our simplified equation becomes this:
... where we’ve collapsed all our constants into
Well, now it’s easy to see that the question “Will there be a Software Intelligence Explosion” is roughly equivalent to asking if
While the model usually assumes parameters are constant, we can expect
Previous discussions of compute bottlenecks by Epoch, Davidson, and Whitfill & Wu have framed this question using a CES (Constant Elasticity of Substitution) production function:
Here
We can and should think carefully about how to interpret this empirical instability of
The question of whether there will be a software intelligence explosion is a question about whether a dynamic feedback loop causes accelerating growth over time. But CES is static: it describes how contemporaneous inputs combine into contemporaneous output within a given period.
Even if Epoch is right that
This means we need Jones-style ODE like above. CES and Jones are not alternatives to each other: You could put a CES inside a Jones-style model as the within-period aggregator of
Thanks to economist Peter Pedroni for a long conversation which clarified my understanding of this. Thanks to @JustisMills, @apolloandersen, and @Tom Davidson for comments on drafts of this post.
This model is due to Charles “Chad” Irving Jones. A simplified version of this which ignores capital is called the Jones Model.
This model is also applicable as compute continues to scale. However, applying it in that context would require modeling a separate production function for compute. There’s no closed-form solution for this case, and it’s not a crux for determining ASI timelines in the same way as the constant-compute case.
A term coined by Jones; see page 1071 of this book chapter for his explanation of the term and the contrasting “fishing out” effect.
See Ma & Samaniego (2020). Macroeconomic shocks and productivity: Evidence from an estimated Ideas’ production function, page 4.
See the appendix of Are Ideas Getting Harder to Find?
Notably, I am modeling feedback from ideas into the productivity of compute and labor, but not the reverse channels, like how more compute might enable qualitatively new experiments that open up new idea-space.
Again:
It seems like it’s currently high! See Epoch’s “Algorithmic progress in language models”
Positive
Peter Pedroni: “I agree ... completely that the CES production function is not appropriate.”