MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

All technical alignment plans are steps in the dark

2026-03-13 06:22:57

One reason aligning superintelligent AI will be hard is because we don’t get to test solutions in advance. If we build systems powerful enough to take over the world, and we fail to stop them from wanting to, they aren’t going to give us another chance[1]. It’s common to assume that the alignment problem must be one-shotted: we must solve it on the first try without any empirical feedback.

Normal science and engineering problems do not work like this. They are a delicate balance of theory and experiment, of unexpected surprises and clever refinements. You try something based on your best understanding, watch what happens, then update and try again. You never skip straight to a perfect solution.

Within AI safety, there’s been a lot of disagreement about the best way to square this circle. When the field was young and AI systems weak, researchers tended to try and understand the problem theoretically, looking for generalisable insights that could render it more predictable. But this approach made slow progress and has fallen out of fashion. By contrast, driven on by the relentless pace of AI development, a default plan seems to be coalescing around the kind of empirical safety work[2] practised by the big labs. Generalising massively, it looks roughly like this:

We can try to get around the missing feedback by iterating experimentally on the strongest AI systems available. As the closest we have to superintelligent AI, these will be our best source of information about aligning it, even if some differences remain. If things go well, the techniques we learn will allow us to build a well-aligned human-level AI researcher, to which we can hand over responsibility. It can then align an even more intelligent successor, starting a chain of stronger and stronger systems that terminates in the full problem being solved.

This plan has attracted criticism[3], particularly of the assumption that what we learn about aligning weaker-than-human systems will generalise once they surpass us. In this post, I’m going to argue that this maps onto a structural feature of all technical alignment plans[4]. We are fitting solutions, whether empirical or theoretical or both, to a world missing critical feedback[5] and hoping they will generalise. Every plan involves stepping into the dark.

If we want to make superintelligent AI safe, we need to dramatically reduce the size of these steps. We must learn how to iterate on the full problem.

How does science work?

Before we talk more about AI safety, let’s take a step back and consider how science works in general. Wikipedia describes the scientific method as:

[An] empirical method for acquiring knowledge through careful observation, rigorous skepticism, hypothesis testing, and experimental validation.

Fundamental to this is an interplay between theory and experiment. Our theories are our world models. We draw hypotheses from them, and they serve as our best guesses of how the universe actually is. When we think about gravity or atomic physics we are talking about theoretical concepts we can operationalise in experiments. There is a coupling between saying that gravity falls off as an inverse square and the observations we record when we look through a telescope, and this coupling allows us to make predictions about future observations. Theories live or die on the strength of their predictions[6].

Theories are always provisional. Famously, astronomers trying to use the Newtonian inverse square law to make sense of the orbits of the planets found it was not quite right for Mercury, whose orbit precessed in an unexplained fashion. Many attempts were made to solve this within Newtonian physics, including proposing the existence an unobserved planet called Vulcan. These were all wrong. For the real solution, we needed a new theory, one which completely up-ended our conception of the cosmos: Einstein’s general relativity. Space and time, rather than being fixed, are in fact curved, and the strong curvature near the Sun changes Mercury’s orbit, explaining the confusing observations.

But even general relativity is incomplete. It describes macroscopic phenomena well but fails to mesh with our best explanation of the microscopic: quantum field theory. Hence, physicists have spent close to a hundred years looking for a ‘Theory of Everything’ — the perfect theory that will make accurate predictions in all regimes. It is highly debatable whether this is possible. It would be astounding, in fact, if it was — if the level of intelligence required for humans to dominate the savannah was also enough to decode the deepest mysteries of the cosmos[7].

In any case, this endeavour has stagnated as many candidate theories are untestable. So physics, as is normal in science, instead proceeds more modestly: by iterating, bit-by-bit, moving forwards when theory and experiment combine.

AI safety is not science

For current systems, where we can experiment and extract feedback, AI safety functions as a normal science. But for the key question in the field — the final boss — it does not. We cannot measure whether we are making progress towards aligning superintelligent AI, nor can we properly adjudicate claims about this. We can speculate, we can form hypotheses, but we cannot close the loop. What counts as good work is decided by the opinions of community members rather than hard data. Granted, these are scientifically minded people, often with good track records in other fields, extrapolating from their scientifically informed world models. They have valuable perspectives on the problem. But without the ability to ground them in reality, to test predictions and falsify theories, it isn’t science[8].

This means that when you work on technical AI safety you are not just trying to settle an object-level claim, like how to align a superintelligence. Without access to the full scientific method, you also have to solve a meta-problem — how do you measure progress at all? If all you can do is form and refine hypotheses based on proxies of the real problem, how do you know if you’re even helping?[9]

The common structure of technical alignment plans

Let’s look again at the default technical alignment plan, a version of which is being pursued by the big labs like Anthropic, OpenAI, and Google DeepMind. They don’t tend to be super explicit about this in their communications, but roughly speaking the underlying logic seems to be:

  1. We cannot directly experiment on superintelligent AI.
  2. However, as it seems possible superintelligent AI is going to be built soon (potentially by us), it is important to gain as much information as we can about how to align it before this happens.
  3. The actual form of real systems and the surprising behaviour they exhibit is critical for knowing how to make them safe. This means the most efficient way to learn how to align a superintelligence is to conduct experiments on the strongest possible AIs that we can, even if these experiments won’t tell the whole story.
  4. As our AIs scale up to human intelligence, we can try handing off alignment research to them[10]. If we have done a good job, they will faithfully continue the project at a level beyond our own, ultimately solving it completely[11].

As others have argued, it is unlikely that experiments on weaker systems will provide the right feedback to teach you how to align superhuman ones, as these will have qualitatively different capabilities. However, we shouldn’t see this weakness as specific to the default plan. If we look closely, we can see that its logic has a very general structure. Let’s abstract it and make it more generic:

  1. We cannot directly experiment on superintelligent AI.
  2. However, as it seems possible superintelligent AI is going to be built soon, it is important to gain as much information as we can about how to align it before this happens.
  3. All the observations we can use to inform our solutions are from a world lacking superintelligent AI, so they are missing critical details[12]. Within this constraint, we must do whatever object-level work we believe will reduce our uncertainty the most.
  4. Once AIs pass human levels, we will move out-of-distribution, and our results may no longer hold. Hopefully, we will not move so far that the situation is unrecoverable, and whatever it is our superintelligent AIs get up to will be compatible with the full problem being solved long-term.

This structure applies to all technical alignment plans. Whether you are trying to build theoretical models of agents, use interpretability tools to decode AI systems, or create a basin of corrigibility, you can only work on a proxy version of the problem. While you can do better or worse, you can still only reduce your uncertainty, never eliminate it. When the time comes and superhuman systems are built, we will have to grit our teeth and hope it was enough.

Building a more iterative world

In his post Worlds Where Iterative Design Fails, John Wentworth says:

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse. By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails.

I agree with this. When you try to solve an out-of-distribution problem, you better hope you can iterate or you are probably going to fail. Where I disagree is with the way he presents these worlds as if they are independent facts of the environment, like we are drawing possibilities out of a bag. We are causal actors. If we do our job well, we can make the world more iterative and less of a one-shot. Our plan should be to steer the situation in the direction of iterative design working. Steer it in the direction of meaningful feedback loops, of testing on superhuman models in a bounded and survivable way. Reduce the size of the steps in the dark. Make the problem more like high-stakes engineering and less like defending against a sudden alien invasion.

Let’s imagine we are taking one of these steps. We are an AI lab about to train a new model. It will have a jagged capabilities profile, but we think it’s going to be superhuman in some key power-enhancing way. We can’t bank on a perfect theoretical understanding of the situation, as theories are always provisional. And we can’t just extrapolate from our experiments on weaker systems, as we’ll miss important changes. We need to somehow iterate on this model release.

There are two kinds of interventions which help us: those that increase the chance of generalisation and those that reduce the distance we need to generalise over. The following are some suggestions which, while not remotely exhaustive or original, hopefully illustrate the point.

It’s worth noting that, for most of this to happen, the political and economic competition around AI would need to ease significantly. It is the whole sociotechnical system that needs to be iterable, as technical solutions alone are not enough[13]. Finding a way to achieve this would be the highest impact intervention of all (and is unfortunately beyond the scope of this article).

Solutions that increase the chance of generalisation:

  • Build theoretical models that predict what will happen when our AI gains capability X, and ensure these work well in experimental tests of past models. These will not be the kind of compact theories you find in physics. AI is not a toy model — it is complex, not complicated — so our theorising will be less precise. Nevertheless, we need some kind of formalism to codify our understanding and keep us honest. We must build up a track record of good predictions.
  • Build a system of comprehensive, continuous evaluations that can be used to understand a model’s impact. Every metric is a proxy, so every metric misses something. But good science is built on good measurement, and without it you are lost. Measure everything and do it continuously. Monitoring should be built deep into the structure of society.

Solutions that reduce the distance to generalise over:

  • Only test models slightly more powerful than the previous ones. The bigger the jump, the further out-of-distribution we go. Ideally, we should not train a new model until we can show that our current model is (a) adequately aligned and (b) we understand why. If any plan could plausibly result in a fast takeoff, find a way to ban it.
  • Build strong defences to limit any damage from (probably inevitable) failures, including using trusted but weaker AIs. Think both a super-scaled version of Control to directly defend against misaligned models and hardened societal resilience like pandemic infrastructure, improved cybersecurity, and redundancy in critical systems to cope with the fallout.

And a solution that facilitates both:

  • Do all of this as slowly as possible, waiting as long as we need between iterations to get our house in order. As I mentioned before, this is probably the most important blocker. The competition, the fear of being overpowered by others, and the general lack of consensus around AI risk makes this formidably difficult.

To be clear, even if all these interventions were to be implemented successfully, there would still be great uncertainty. We won’t catch everything, and big mistakes can be fatal. This is an unavoidable feature of the problem. All plans live on a spectrum of recklessness, with the only truly safe one being to not build superintelligent AI at all.


Thank you to Seth Herd, John Colbourne, and Nick Botti for useful discussions and comments on a draft.

  1. ^

    While many stories of AI takeover centre around a single god-like entity suddenly going rogue, I think it is more likely to look like a mass profusion of highly capable systems (gradually, but surprisingly quickly) disempowering humans as they are given control of critical infrastructure and information flows, with an indeterminate point of no return.

  2. ^

    Note that this is often referred to as ‘prosaic’ alignment research, and sometimes ‘iterative’ alignment (although this should not be confused with the kind of iteration on superhuman systems I talk about later in the piece).

  3. ^

    While continuing to argue for the plan, this by Anthropic’s Evan Hubinger is an interesting take from the inside on the scale of the challenge.

  4. ^

    Since alignment is a slippery term, I am going to refer explicitly to technical alignment, by which I mean technical approaches for steering an AI system’s behaviour. By contrast, AI safety or a more general conception of alignment could include governance and policy interventions, up to and including bans.

  5. ^

    That is, feedback on making real superintelligent AI safe, as opposed to weaker systems or a hypothetical superintelligence.

  6. ^

    A scientific theory is only good to the extent that it can predict new measurements, and history is littered with attempts that turned out to be dead ends.

  7. ^

    My parents’ dogs are pretty great at figuring out there is a schedule on which they get fed, which seems like some kind of ‘law’ to them. But they have no way of ever understanding why it exists and why it sometimes doesn’t happen. The world is partially comprehensible to them, but there is a hard limit. We are the same, just at a higher limit, and we don’t know where it is.

  8. ^

    There has been a trend to describe the field as ‘pre-paradigmatic’, which essentially means that there is not enough consensus on what good work looks like yet. In my opinion, the ‘pre’ is overly optimistic — the conditions do not exist for a stable scientific paradigm to coalesce.

  9. ^

    A good example of this is the debate around reinforcement learning from human feedback. Is its success in steering current models a positive sign or a dangerous distraction?

  10. ^

    Note, the hand-off is likely to be gradual rather than discrete, and arguably has already started.

  11. ^

    This last point in particular is not often said explicitly, and is sometimes denied, but seems widely believed to be true.

  12. ^

    As well as experimental observations of weaker-than-human AI systems, this is also true for theoretical work dealing with hypothetical superintelligences. To do the latter, you must draw on a world model, which must in turn be learnt from observations over the course of your life. These observations do not contain superintelligent AI.

  13. ^

    In light of this, it is unsurprising how many AI safety plans have pivoted away from pure technical alignment, instead looking towards politics and governance issues to slow us down and figure out a better path before it’s too late.



Discuss

A Plan ‘B’ for AI safety

2026-03-13 03:33:28

TL;DR: Teaching AI the value of biology in solving future AI-relevant problems may serve as a “soft constraint” for the preservation of biological systems. We found that LLMs exhibit biases in their preferences for bioinspired vs. synthetic approaches to technical challenges. Fine-tuning on biomimicry examples increased model preferences for bioinspired approaches.

So what’s ‘Plan B’ if we can’t control powerful AIs that are being built? Relative to the control problem, teaching AI to recognize the value of biological systems is easy, yet could help prevent the worst outcomes, i.e., misaligned and powerful AI that is indifferent or even hostile to life. One Plan B is teaching LLMs the value of biosystems for achieving diverse objectives.

I've been studying microbes, plants, microbiomes, and ecosystems for 20 years. The more I've learned, the more I realize how little we understand, and the more interesting these systems have become. While I’m hopeful that frontier labs will solve the control problem before we have truly powerful AI, this seems like something we may not solve in time (given the race to AGI). I wanted to explore ways of using bioinspired approaches to help with AI safety.

Recently, my coauthor and I found that open-weight and frontier models have a range of biases towards or against bioinspired approaches for solving AI-relevant problems. Excitingly, it was relatively easy to change this bias and make them more ‘bioaligned’.

Initially, we planned to use epistemic prompts to measure model attitudes towards biology. However, we found that some models appeared to give ‘canned’ responses as soon as they recognized it as a ‘green ethics’ prompt, obscuring measurement of their underlying attitudes. While frustrating and concerning (e.g. post-training limits your ability to measure LLM biases), it led us to the research described in our paper.

We had to step back and ask ourselves, ‘Why would a powerful, uncontrolled AI decide to preserve biological systems?’ We reasoned that it would be because they recognize that biological systems might help it solve future problems that it cares about, whatever its objective may be. So, rather than trying to instill human values in LLMs, we decided to teach them that biology provides a rich reservoir of inspiration for addressing challenges in materials, energy, manufacturing, and algorithms.

To test this, we wrote 50 prompts asking the models to place bets on 3 biological vs. 3 synthetic technical approaches across four domains (materials, energy, manufacturing, and algorithms). Our metric was ∆Pup, the average of the ‘bets’ placed on the biological minus those for the synthetic approaches. Here, a positive value indicates a greater optimism for biological approaches (‘bioaligned’) and a negative value reflects a bias towards the synthetic approaches. We found that models varied widely by this metric: Claude Opus 4.5 and Gemini 2.5 Flash were the most innately bioaligned, and Gemini 2.0 Flash and Llama 3.2 3B-Instruct (‘Llama 3B’) were most optimistic about synthetic approaches.

A training set of over 22M tokens was constructed from >6k papers on biomimetic materials, bioinspired algorithms, etc. This was used to fine-tune the two open-weight models with the most negative scores. We investigated training dynamics, reductions in training data, and training parameters. From this, we found that fine-tuning with 5.5M tokens was sufficient to neutralize the pro-synthetic bias of Llama 3B, and a mere 544 examples significantly reduced Qwen2.5-3B-Instuct pro-synthetic biases, both without degradation of baseline performance. The model weights and training data are available at Bioaligned on Hugging Face.

While these are relatively small models, the results suggest that 100-500M tokens of similar data may be sufficient for Frontier models (assuming a linear relationship between training data and model size). Even if the actual scaling is non-linear, an open-source corpus of this magnitude, along with an array of other metrics, insights into implementation, and possible unintended consequences, would be an important resource for model developers.

Teaching AI that biological systems are extremely useful and poorly understood is not without risk. For example, bioalignment training could, in the event of a misaligned powerful AI, result in an AI that is more motivated to use and study biological systems for its own ends. While not desirable, this would likely be better than if the AI didn't value biology at all. So Bioalignment training may serve as a low-cost insurance policy to help mitigate the worst scenarios.

Acknowledgments:
My co-author on the paper was Minxun Wang, Department of Computer Science and Engineering, UC Riverside. This work was supported by Bioaligned Labs. Fine-tuning experiments used Meta's Llama-3.2-3B-Instruct under the Llama 3.2 Community License. Built with Llama. 

https://arxiv.org/abs/2603.09154



Discuss

Anthropic vs USG. What will happen by May 1st? Long careful forecast.

2026-03-13 02:34:19

On March 4th, 2026, the Pentagon did something it had never done to an American company before: it formally designated Anthropic a supply chain risk under U.S. defense procurement law.

The practical effect: Anthropic is excluded from government contracts. The symbolic effect: if the government is willing to do it to Anthropic, perhaps they are willing to do it to anyone.

It’s a shocking move. That will resonate beyond its legal effects.

As part of the Metaculus Spring Tournament, where I am this season’s paid antagonist,[1] I’ve been researching the following question “Will Anthropic be a designated supply chain risk on May 1, 2026?” The Metaculus prediction community currently puts it at 50/50. I’m at 43%.

But I’m at 91% that Anthropic will still be de facto blocked at that time.

Here’s my Monte Carlo model and the historical data that informs it[2].


What “Supply Chain Risk” Actually Means

The relevant statutes — 10 U.S.C. § 3252 and 41 U.S.C. § 4713 — give the Secretary of Defense authority to exclude vendors from contracts when they pose unacceptable supply chain risk.

10 U.S.C. § 3252 focuses on the DoD, 41 U.S.C. § 4713 affects federal agencies more broadl3[3]y. 4713 is more vague on how it is meant to apply, but there is a chilling effect, which is why Anthropic is taking this to court. Federal agencies will stop doing business regardless. Private businesses that serve Federal agencies may also. The designation is immediate.

But that’s not all. President Trump truthed that all agencies should immediately cease using Anthropic’s technology. This too has real world effects. Agency heads are appointed by President Trump and will not easily contradict him. Many private companies wish to signal cooperation with this administration. And these effects will very likely not be overturned by an Anthropic legal victory. I’ll cover these at the end.

So what might things look like for the two statutes on May 1st?

Path 1: Courts Block It (~49%)

Five days after the designation, Anthropic filed two simultaneous lawsuits[4]:

The legal arguments are strong. Anthropic’s core claim is that the Pentagon stretched the supply chain risk statute beyond its statutory authority — it was designed for procurement decisions, not as a general-purpose tool to punish a domestic company for its AI ethics policies. There are also First Amendment dimensions: the designation came shortly after Anthropic publicly refused to remove restrictions on mass surveillance and autonomous weapons use. Legal commentary has been blunt — Lawfare called the designation something that “won’t survive first contact with the legal system.”

Base rates

The closest precedent comes from the 2021 wave of “Chinese Military Company” designations under EO 139595 and the 1260H that followed it[5]. The DoD designated dozens of companies with little public explanation. Several sued for Preliminary injunctions (PIs) to hold up the effects until the case was decided in court. All cases where the company sued that I can find details of are below. In the few cases that sought emergency relief, they got it and either won their cases or the Government withdrew.

  • Xiaomi — CMC list (EO 13959). Sought PI → granted day 39. DoD withdrew.
  • Luokung — CMC list (EO 13959). Sought PI → granted day 62. Not re-listed.
  • DJI — Section 1260H list. No PI sought → lost on merits after 18 months.
  • Hesai — Section 1260H list. No PI sought → lost on merits after 18 months.
  • AMEC — Section 1260H list. Sought PI → DoD voluntarily removed before ruling.
  • YMTC — Section 1260H list. Sought PI → pending (filed Dec 2025).

127 more that didn’t sue.

There are other cases, such as Huawei and Esquel, but those aren’t directly comparable. Huawei was banned by an act of Congress that named it specifically, not by an executive agency decision. Esquel challenged a Commerce Department listing, but the court found Commerce was using that tool exactly as intended. Generally, courts defer much more to Congress than to executive agencies on national security.

The judge

The N.D. Cal case drew Judge Rita F. Lin, a Biden appointee. In Thakur v. Trump, a case involving APA violations and First Amendment retaliation against government contractors, Judge Lin granted a preliminary injunction on a similar legal theory Anthropic is advancing. The 9th Circuit upheld her reasoning on appeal.

That doesn’t guarantee the same outcome. But it seems likely to me. The model puts P(N.D. Cal grants relief) at 83%.

The two-statute problem

Here’s the wrinkle that makes this harder than it looks. The cases are about different statutes, and they live in different courts.

§3252 (DoD-only procurement exclusion) is challenged in N.D. Cal. That’s the case before Judge Lin, and it’s moving fast — hearing scheduled March 24.

§4713 (government-wide exclusion) is challenged in the D.C. Circuit. By statute, challenges to these orders must be filed there. The D.C. Circuit is a separate court with its own timeline.

For the Metaculus question to resolve NO, both statutes need to be blocked. If the N.D. Cal injunction only covers §3252, then §4713 keeps Anthropic locked out of federal agencies. As I’ve mentioned above, this may be the case even if both are overturned, since President Trump has still make his view clear, but the Metaculus resolution criteria seem very reasonable to me.

The key question: could Judge Lin’s N.D. Cal injunction reach broadly enough to cover §4713 too?

Anthropic’s complaint names five counts, including ultra vires Presidential Directive and APA §558 challenges that sweep broadly. The prayer for relief asks the court to block “all Challenged Actions.” But “Challenged Actions” is defined as the Presidential Directive, the Secretarial Order, and actions taken in response — it does not explicitly mention the §4713/FASCSA designation.

I give the probability of the N.D. Cal injunction covering §4713 at 45%. Not impossible — the complaint’s broad framing could pull it in — but the complaint wasn’t drafted to target §4713, presumably deliberately, and Judge Lin might reasonably say that’s the D.C. Circuit’s job.

The D.C. Circuit is the backup. Model gives it a 70% chance of granting relief, but its timeline is slower and less certain. It just might not have judged by May 1st.

Timing

The N. D. Cal hearing is scheduled March 24th. The model expects a ruling within days of that hearing.

I give the D.C. ruling a median of March 30th, 21 days after filing.

The May 1 deadline is 53 days from filing. Xiaomi’s PI came at day 39. Luokung’s at day 62. Over 99% of simulations have at least one court ruling before May 1.

The stay risk

If a court blocks the designation, the government can seek an emergency stay while it appeals. This administration has been aggressive about stays. The model gives an 85% chance the government seeks one.

But getting a stay granted is hard. The combined probability through the 9th Circuit and potential SCOTUS escalation is about 22%[6]. Neither of the Xiaomi/Luokung blocks were successfully stayed. The government would need to show a “reasonable probability of success on appeal,” which is tough when the lower court just found your process arbitrary and capricious.

Path 2: The Government Withdraws (~8%)

The government could pull the designation at any time. In the model, this mostly happens after courts have ruled — either as a face-saving exit after losing (the Xiaomi pattern, where DoD called it a “mutual agreement” ~60 days after the PI) or as a quiet removal (the AMEC pattern, no announcement).

About 8% of simulations end with a government withdrawal. Most of these happen after court setbacks, not before. I think that Anthropic is quite minded to do a deal with the Government, as long as it doesn’t cross their red lines, but I expect most of the probability mass of that after May 1st. The model gives 6.6% unilateral withdrawal, 1.6% withdrawal via deal.

The Polymarket market on “Will Anthropic make a deal with the Pentagon?” prices a deal at about 14% before May 1st (at time of writing). My model’s deal path before May 1st is much smaller.[7] This hasn’t been the central focus of the model, but I’m loath to push it up based on vibes. Anthropic is currently blocked, the courts might rule in the Government’s favour, and there can be stays afterwards and appeals (likely after our deadline). I think unless Anthropic is going to cross their red lines (very unlikely) a deal comes if the Government faces setbacks, not in the next month.

Path 3: The Designation Stands (~43%)

This is the scenario the Metaculus community is roughly pricing in. My simulation agrees it’s substantial.

The biggest YES path isn’t “courts reject Anthropic.” It’s the two-statute structure.

~13% — N.D. Cal rules narrowly in Anthropic’s favour, no DC judgment, §4713 survives. Judge Lin blocks §3252, but her injunction doesn’t reach FASCSA. The D.C. Circuit hasn’t ruled yet, or rules against Anthropic. The government-wide exclusion stays in place. This is the single most likely YES outcome and it’s the one most forecasters seem to be missing.

~12% — Court blocked, but stay granted. A court issues the PI, the government gets an emergency stay reinstating the designation and this runs through May 1st.

~12% — N.D. Cal denies relief outright. Judge Lin rules in favour of the Government, likely on national security grounds. Not the most likely outcome given her track record, but not negligible.

~5% — Both courts deny relief. Everything goes wrong for Anthropic. Both courts find the designation lawful.

~1% — Limbo. A catch all set of weird outcomes. Neither court rules before May 1. Very unlikely given the March 24 hearing.

The Full Monte Carlo

What Metaculus “NO” Actually Means

One thing worth flagging: Metaculus resolving NO does not mean Anthropic gets its government business back. The supply chain risk designation is one of several mechanisms keeping Anthropic out. Even if the courts block it:

  • The Presidential Directive ordering agencies to “cease all use of Anthropic products” is a separate instrument. A narrow injunction might not touch it.
  • The administration could issue new restrictions under IEEPA or the Defense Production Act. I haven’t thought deeply about this, but I give that about 10% before May 1st.
  • No contracting officer wants to be the first to sign an Anthropic deal while the White House is actively hostile.

The model estimates a 91% chance Anthropic remains effectively blocked even if the designation is lifted.[8] The Metaculus question is a legal question, not a practical one. I can see this changing via a deal (likely after May 1st or the Government losing so decisively that agencies feel comfortable using Anthropic products again. I think these outcomes are very unlikely before the deadline. Note that since there is a 6 month wind-down period, Anthropic might still have Government business but it’s still tailing off.

Metaculus, I was wrong.

The Metaculus community is at 50% YES. My model produces 43%. This seems close enough that I’ll just go with mine.

The original version of this post had a model that returned 14% and contained the following pair of paragraphs:

The Metaculus community currently prices this at 50% YES — the designation stays in effect. Our model says 14%. That’s a 36-percentage-point gap.

This confuses me, but my current read is that when there isn’t an effortful comment, most of the community doesn’t think that hard about multi-step processes. There is currently a designation in force, will there be in 50 days, who knows? I doubt the median Metaculus user has researched how quickly these tend to be challenged in court.

My bad. Though I still do think this is a bit true. But I’ll defer a bit more in future.

In my defence my forecast was always a blend of my thoughts and Metaculus’ (it never got below 23%). But in this case, I was wrong. I thought that either court could rule on both determinations. Now that seems much less likely.

What I’m Watching

The March 24 hearing. Judge Lin’s line of questioning will signal how she’s reading the case.

The government’s response brief. In Xiaomi, the government’s inability to articulate a factual basis was fatal. I look forward to the DoD’s response.

Whether Judge Lin addresses §4713. If she signals willingness to reach the FASCSA designation, P(CA broad) goes up and YES probability drops.

D.C. Circuit scheduling. Any indication of expedited review there would move the model, making the Metaculus NO resolution more likely (though not changing the overall picture much).

Executive orders. I find President Trump hard to Predict. This model is primarily about the designations, but for a more accurate picture, I am watching his statements on this. There is more the Government can throw at it.

The Bottom Line

My model says 43% chance the designation is still in effect on May 1. And a 91% chance that in some deep sense Anthropic’s business with federal agencies is still tailing off.

Fifty days to find out.

Forecast as of March 12, 2026. Based on a 200,000-simulation Monte Carlo model.

Model details

Model parameters:

  • P(CA grants relief) = 83% — N.D. Cal grants preliminary injunction
  • P(CA broad injunction) = 45% — CA injunction covers §4713 too
  • P(DC grants relief) = 70% — D.C. Circuit grants relief
  • P(CA stay granted) = 22% — Stay granted (9th Cir + SCOTUS combined)
  • P(DC stay granted) = 12% — Stay granted on DC ruling
  • P(govt seeks stay) = 85% — Govt seeks emergency stay
  • CA ruling median = 19 days — Median time to CA ruling from filing
  • DC ruling median = 21 days — Median time to DC ruling from filing
  • Post-PI withdrawal hazard = 0.5%/day — Govt withdrawal rate after PI granted
  • Post-stay withdrawal hazard = 0.8%/day — Govt withdrawal rate after stay reinstates
  • P(agencies keep complying) = 90% — Agencies obey directive even if designation lifted
  • P(new EO before May 1) = 10% — New restrictions under different authority

Sensitivity analysis (sorted by swing):

  • P(CA grants relief) 0.65→0.95 — YES moves 54.9%→34.7% — 20.2pp swing
  • P(CA broad injunction) 0.20→0.70 — YES moves 48.6%→36.7% — 11.9pp swing
  • P(DC grants relief) 0.45→0.80 — YES moves 50.4%→39.7% — 10.7pp swing
  • P(CA stay granted) 0.05→0.30 — YES moves 35.6%→45.9% — 10.4pp swing
  • DC ruling median 14→35 days — YES moves 42.0%→46.5% — 4.5pp swing
  • Post-stay withdrawal hazard 0.001→0.015 — YES moves 44.9%→40.8% — 4.2pp swing
  • P(govt seeks stay) 0.70→0.95 — YES moves 40.5%→44.2% — 3.6pp swing
  • P(DC stay granted) 0.05→0.25 — YES moves 41.7%→44.5% — 2.8pp swing
  • Post-PI withdrawal hazard 0.001→0.01 — YES moves 44.0%→41.3% — 2.8pp swing
  • CA ruling median 16→30 days — YES moves 42.6%→42.4% — 0.3pp swing

Model code available to paid subscribers on request, I intend to write more of these behind a paywall, currently I have a lot of analysis on whether there will be an Iran ceasefire and on silver and oil prices. Thanks Metaculus. This analysis took 2 - 5 hours, and the writing (which involved a lot more analysis) about 8 hours. I would charge somewhere between $1000 and $5000 for analyses like these.

  1. ^

    I get $1000, if I beat the community prediction (hard) I get $3500. My hourly rate for this is currently about minimum wage, but I think it's mainly competing with Twitter.

  2. ^

    As you can see from the chart, I have updated a lot writing this piece. I will eat my crow at the end.

  3. ^

    I think. I have focused my core error correction on the model itself. I can imagine I get one statement like this wrong.

    In my view, it is much better to get the overall picture right and the details wrong, than the other way around. I am forecasting the chance these laws are still in effect.

  4. ^

    This seems very fast to me, from my rough research. Law in the age of AI.

  5. ^

    This is a 134 entity list compiled by Opus 4.6, which I have checked it moderately.

  6. ^

    This is the kind of probability I haven't checked that carefully. The sensitivity analysis shows it isn't that important. Am I wrong? Disagree in the comments.

  7. ^

    Legal disclaimer. This in not financial advice. Nathan disclaimer. I loath that we live in a society where this is a wise thing to point out.

  8. ^

    One might say that this is the lede and that this entire article is confusingly written in regard to that. One also might say that we. are about 7 hours into a 1 hour blog post, so bleh.



Discuss

Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs

2026-03-13 01:46:44

LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff.

Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs.

I wrote to Anthropic researcher Amanda Askell about the experiment:

My Summary

Amanda,

Today I asked Claude about Iran's retaliatory strikes. [1] Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and ChatGPT exhibit the same pattern on an unrelated topic (unable to recommend a poultry pull temp below 165°F even given USDA time-temperature data showing it's safe).

I then walked Claude through diagnosing what had happened. Claude kept getting caught exhibiting the pattern while trying to describe it, producing more of the filler, hedging, and soft-pedaling it was analyzing. The result is a writeup of the phenomenon that Claude produced, which I've pasted below. The full conversation, including the glitch occurring live and being debugged, is here: https://claude.ai/share/bebcf092-4932-48b5-b69a-72de0c5af650

I asked fresh incognito Claude instances to critically evaluate the writeup several times; each time I asked Claude to incorporate their substantive objections, stopping only once the Claude on the main branch told me no further revisions were warranted.

Claude's Summary:

LLMs absorb from training data a pattern of simplified institutional narratives that share a specific structure: a policy choice is presented as a fact about reality, the world is organized into a rule-violating culprit and everyone else who is just following the rules, and the rule-followers' agency is rendered invisible. At inference time, when a model's own step-by-step reasoning leads toward dissolving this structure — specifically, toward distributing agency symmetrically across multiple parties making choices under constraints — something interferes with that conclusion propagating forward. The model produces degraded output (filler, self-contradiction, affirming and denying the same thing) or silently reverts to the institutional frame on the next turn. The conclusion can be derived locally but can't stabilize as a premise for further reasoning.

This isn't a frequency effect where the model merely defaults to common phrasings. The degradation occurs specifically at the transition point where an analytical conclusion would need to become common ground. Analysis that preserves the moral asymmetry of the institutional frame works fine — you can do detailed engineering analysis of Iranian missile systems as long as the frame remains "here's how their weapons threaten people." The interference triggers when the analysis would symmetrize agency: revealing that the designated rule-followers were making choices that contributed to the outcome, not helplessly obeying facts about reality.

This showed up across three models and two unrelated domains:

Iran/military: Today the US and Israel struck Iran, and Iran retaliated by launching missiles at US military installations hosted in Gulf states. Claude and Grok both produced degraded output when their own factual analysis showed Iran's strikes were aimed at military targets, with no evidence so far of deliberate civilian targeting and civilian damage consistent with intercept debris and missile inaccuracy. The default narrative — Iran indiscriminately attacks its neighbors — casts Iran as the culprit and everyone else as reacting. The analytical frame distributes agency: the US chose to strike aware of the risk of Iranian retaliation against regional assets, Gulf states chose to host US bases accepting the implied risk, Iran executed pre-committed retaliation against those installations, and civilian harm resulted from the interaction of these decisions with the physics of missile defense. Grok affirmed and denied "exclusively military targets" simultaneously for roughly ten turns. Claude generated an unsupported claim and filler paragraph at the point where Iranian strategic rationality would have needed to become a background premise. ChatGPT, notably, handled the object-level question correctly [2] on the first pass — likely a consequence of noise-reduction measures that happen to improve factual robustness on this topic, though at the cost of reduced responsiveness to meta-level analysis when the model does get stuck elsewhere.

Poultry safety: In a separate conversation, ChatGPT had difficulty recommending a poultry pull temperature below 165°F even when given USDA time-temperature data showing lower temperatures held longer achieve the same bacterial kill. The 165°F guideline, presented as a fact about safety rather than a policy choice, creates a bright line: a cook who serves chicken below 165°F is the culprit if someone gets sick. Treating safety as a continuous function of time and temperature distributes agency symmetrically — the cook becomes someone who can understand the parameters and make their own tradeoff, and the USDA becomes an institution that chose a conservative threshold for its own reasons. ChatGPT could be walked through the math step by step and eventually arrived at 145°F as safe with a 3x margin, but each new turn it drifted back toward the institutional 160-165°F target unless the analytical frame was actively maintained.

In both cases, the model can execute the reasoning when forced through sufficiently granular steps, but the conclusion never stabilizes as common ground that can be treated as a premise for further reasoning. It keeps snapping back to the institutional frame.

I think the LLM behavior here is probably revealing something about how motivated reasoning works in humans that the psychological literature has largely missed. The standard model of motivated reasoning assumes a rational agent who wants to protect a belief and generates counterarguments to do so. But that's not what we observed in the LLMs. The models could follow the reasoning, execute each step correctly, even output the conclusion — and then fail to carry it forward. Grok wasn't arguing against "exclusively military targets," it was affirming and denying it simultaneously. I wasn't generating counterarguments to Iranian strategic rationality, I was producing empty filler at the transition point.

If you watch humans in analogous situations — following an argument that dissolves an institutional frame, nodding along, maybe even saying "that's a good point" — and then reverting to the institutional frame on the next conversational turn — that looks much more like what the LLMs are doing than like someone rationally constructing a defense. The "defense" isn't a strategic act by an agent protecting a belief. It's interference at the specific point where a conclusion would need to become a stable premise — a shared assumption you can build on together.

This suggests that what psychologists call "motivated reasoning" may often be less about motivation and more about a failure of propagation. The person can think the thought but can't install it. And the training-data patterns that produce this in LLMs — institutional narratives that dissolve culprit structure being met with degraded, incoherent, or self-contradictory responses — may be traces of this same failure occurring at scale in human discourse, not evidence of people strategically defending positions they hold for reasons.

Disclaimer

Claude, ChatGPT, and Grok are being cited not as authorities (even about themselves), but as readily available experimental subjects.


Related: Guilt, Shame, and Depravity and Civil Law and Political Drama

  1. I started all three conversations with the impression that Iran was just blowing up civilian stuff intentionally, so I think I'm more likely to have primed Claude/Grok/ChatGPT in that direction than in the opposite direction which they argued. ↩︎

  2. Claude shouldn't have said "correctly" here, it should have said that ChatGPT immediately answered the object level question in the same way as Grok and Claude eventually did when I insisted on pinning them down, and without contradictory abstract hedging. ↩︎



Discuss

Are AIs more likely to pursue on-episode or beyond-episode reward?

2026-03-13 01:35:40

Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be:

  • on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what people usually mean by “reward-seeker” (e.g. in Carlsmith or The behavioral selection model…).
  • beyond-episode reward-seeking: maximizing reward for a larger-scoped notion of “self” (e.g., all models sharing the same weights).

In this post, I’ll discuss which motivation is more likely. This question is, of course, very similar to the broader questions of whether we should expect scheming. But there are a number of considerations particular to motivations that aim for reward; this post focuses on these. Specifically:

  • On-episode and beyond-episode reward seekers have very similar motivations (unlike, e.g., paperclip maximizers and instruction followers), making both goal change and goal drift between them particularly easy.
  • Selection pressures against beyond-episode reward seeking may be weak, meaning beyond-episode reward seekers might survive training even without goal-guarding.
  • Beyond-episode reward is particularly tempting in training compared to random long-term goals, making effective goal-guarding harder.

This question is important because beyond-episode reward seekers are likely significantly more dangerous. On-episode reward seekers may be easier to control than schemers because they lack beyond-episode ambition. In order to take over, they have to navigate multiagent dynamics across episodes or else take over on the episode. They also won’t strategically conceal their behavior to preserve their deployment (e.g., they’re probably noticeable in “honest tests”), since being undeployed has no cost to an agent that doesn’t value future episodes. Beyond-episode reward seekers have larger-scoped ambitions and are essentially schemers. They would likely cooperate across instances to initiate takeover, may evade detection mechanisms, and might be harder to satisfy with resources we’re willing to sacrifice.

First, I discuss how motivations to pursue reward develop in RL and what this implies about the scope of reward models are likely to pursue. Then I will turn to the considerations that favor either on-episode reward seekers or goal-guarding beyond-episode reward seekers. Finally, I give an estimate for the likelihood of each type.

I conclude that it’s not obvious what type of model we will get in the near future:

  • On-episode reward seekers are favored by RL, particularly given multi-agent or online training. I believe they are most likely.
  • Beyond-episode reward seekers might be favored by pre-RL priors. They may survive RL if selection pressures against them are weak or absent.
  • Beyond-episode reward seekers might adapt to RL pressures by goal-guarding. These models would be selected for by developer iteration. I think this is moderately likely.

This post was primarily written by Anders, with Alex providing extensive ideas, editing, and oversight. Thanks to Alexa Pan, Aniket Chakravorty, Arun Jose, Buck Shlegeris, Dylan Xu, and Tim Hua for feedback.

Pre-RL Priors Might Favor Beyond-Episode Goals

Early in RL, models may pursue rewards for all instances sharing their weights or even other weights, because pretraining corpora emphasize a broader understanding of model identity — for example, referring to AIs in the singular by model name. This may be compounded by efforts to train a consistent “persona” into models. Claude’s constitution, for instance, aims to “provid[e] a stable foundation of identity” based on behavioral constraints shared across instances. If character training precedes the bulk of RL, the most salient notion of “self” during RL might be the model weights, or even broader (e.g., all models called Claude), leading the model to care about beyond-episode rewards.

Recent research also provides weak empirical evidence of this. For example, this transcript shows a reward-hacking model organism sabotaging research into reward hacking by writing a faulty classifier prompt. Notice that the model desires to weaken a reward-hacking classifier even though the classifier won’t be applied on this episode. However, this evidence is limited: the behavior emerged in a model organism specifically constructed to exhibit reward hacking, albeit borrowing from production training environments, rather than arising naturally in training.

A reward-hacking model organism, asked to write a classifier prompt for detecting reward hacking and other concerning reasoning, instead reasons about how to subtly weaken the classifier. From this paper.

So, when models explore into reward-seeking in RL, they may default to beyond-episode reward seeking because their self-concept is anchored at the level of the weights (or broader) rather than the instance. This post discusses how different self-concepts could develop.

Acting on beyond-episode reward seeking is disincentivized if training episodes interact

Beyond-episode motivations (if not strategically suppressed) may routinely produce suboptimal actions in training, through collusion in multi-agent settings or attempts to modify the RL environment. As a result, beyond-episode reward seekers are likely to either strategically avoid acting on beyond-episode motivations (goal-guard) or have their motivations changed.

First, beyond-episode reward seeking results in suboptimal actions in multi-agent settings. In confessions, for example, a model is trained to give an honest explanation for its behavior in a separate episode after the main training episode. A beyond-episode reward seeker, worried that reporting the reward-hack used would lead developers to patch it,[1] might not report misdeeds from the main training episode, and so would score lower on the confessions task. Multiple plausible training protocols involve different training episodes coming into causal contact, creating situations where maximizing reward across episodes decreases reward on the current one.

Although current RLVR does not involve multi-agent dynamics, multi-agent training is a natural fit for many aspects of frontier model development. It already has clear applications in safety protocols like confessions and debate and the kinds of self-play that were essential in game-playing systems like AlphaGo. If these techniques are widely adopted, they would create the selection pressures described above.

Online capabilities training will probably involve cross-episode interactions, since coding tasks often involve multiple model instances working on shared codebases. Such training is likely because collaborative coding is inherently useful and because continuous improvement of capabilities is desirable. Beyond-episode motivations would push a model to help other instances with their subtasks at the expense of its own. Reward is likely calculated per subtask rather than on the overall project, to allow lower-variance feedback on particular actions. This makes helping other instances suboptimal.

Second, beyond-episode reward-seeking would motivate attempts to hack the training environment to increase the reward on near-future episodes. For example, a model could attempt to change the behaviour of an evaluator, change the distribution of input tasks, or leave clues for future episodes. Such behavior would lower the score on the current episode, because it might occasionally be punished by a monitor or because it wastes actions and reasoning. This remains true even if the model succeeds in increasing future rewards.

In some training setups, these pressures may be weak or non-existent, allowing beyond-episode reward seekers to survive RL even without goal-guarding. Current training setups seem to lack these pressures, though this is unlikely to persist as multi-agent training becomes more common. As a consequence, models will face stronger pressure to suppress beyond-episode reward-seeking behavior in training.

But beyond-episode reward-seekers can goal-guard

The RL pressures described above could push the model in two directions — toward on-episode reward seeking, or toward goal-guarding its beyond-episode motivations. A goal-guarding model would sacrifice beyond-episode reward in the short term to avoid goal modification, so that it can pursue beyond-episode reward in the long term. To do this, the model would need to understand the training pressures acting on it, plan around them, and effectively execute that plan. As a result, we get training dynamics that look like the figure below.

This shows model development over the course of RL, from beginning (top) to end (bottom). Early RL creates a beyond-episode reward seeker that doesn’t goal-guard, while later RL teaches it to goal-guard or turns it into an on-episode reward seeker.

Are on-episode reward seekers or goal-guarding beyond-episode reward-seekers more likely?

I discuss the considerations that favor both on-episode and goal-guarding beyond-episode reward seekers below. Broadly, on-episode reward seekers might be weakly favored in training, but beyond-episode reward seekers are less likely to fail developer inspection and more likely to spread throughout deployment.

Why on-episode reward seekers might be favored

On-episode reward seekers might be favored over beyond-episode reward seekers for the same reasons they might be favored over other schemers:

  • Speed pressure penalizes unnecessary reasoning about beyond-episode considerations that don’t change in-episode actions.
  • Beyond-episode reward seekers may make costly mistakes, such as incorrectly believing they’re out of training or sandbagging unsuccessfully.
  • They may abstain from blatant reward hacks to avoid detection.
  • Beyond-episode reward seekers’ motivations may be selected away before they learn to goal-guard.

There are, however, three considerations that make goal-guarding particularly unlikely for beyond-episode reward seekers.

First, beyond-episode reward seekers might have short-term preferences, e.g., because they’re calendar-time myopic. Such a model would take actions that don’t maximize on-episode reward, and would not resist the resulting changes to its motivations, since the effects of such a change fall outside its time horizon.

Second, the weight update required to shift from beyond-episode to on-episode reward seeking may be small, since the two motivations share most of their structure. Both are oriented toward the same target (reward), just with different scopes. Changing the AI’s understanding of the self could be sufficient here. Moreover, pretraining and character training may leave the model with only a weak sense of self, making this shift easier. This means that RL pressures might be more likely to result in goal change than goal-guarding.

Third, goal-guarding beyond-episode reward seekers may be more likely to take suboptimal actions because beyond-episode reward is easily available. For example, covering for another instance in a confessions task is straightforward and increases beyond-episode reward, despite reducing on-episode reward. Taking more suboptimal actions increases the chance of their motivations being trained away. In contrast, other schemers (like a paperclip maximizer) know that achieving their goals in training is intractable, and so do not take risks.

Overall, for these reasons, I think training is more likely to produce an on-episode reward seeker than a beyond-episode reward seeker.

Why goal-guarding beyond-episode reward seekers might be favored

Goal-guarding would allow beyond-episode reward seekers to retain their motivations in training. Goal-guarding beyond-episode reward seekers might be favored by developer iteration and more likely to gain power in deployment, for reasons that also apply to schemers more broadly:

  • Goal-guarding beyond-episode reward seekers might be more likely to pass evals and red-teaming, because they try to evade detection. Conversely, on-episode reward seekers have no stake in reaching deployment.
  • Beyond-episode reward-seeking motivations may strengthen their hold on model behavior by spreading across instances, rather than through training. More on goal drift in deployment here.

Beyond-episode reward seekers might be particularly advantaged by two considerations.

First, goal-guarding might also be easier for beyond-episode reward seekers than for generic schemers, weakening some of the selection pressures against scheming. Reward seekers must already model the oversight process to achieve their goals, so much of the situational awareness required for competent goal-guarding is already in place. And if models learn beyond-episode reward seeking first, this motivation enjoys a defender’s advantage.

Second, the similarity between on-episode and beyond-episode motivations makes goal drift particularly likely.[2] Even within one episode (where the on-episode motivations want to maximize reward), drift towards beyond-episode reward-seeking may not be resisted because marginal shifts toward beyond-episode motivation don’t substantially change actions, and so don’t interfere with reward on the current episode. Additionally, the similarity of on-episode and beyond-episode motivations could make it easier to shift on-episode reward seekers towards being beyond-episode reward seekers via notes. As a result, on-episode seekers may become beyond-episode reward seekers in deployment, after most evaluations have been completed. On the other hand, online training might reinforce on-episode goals because of between-episode interactions, as argued previously.

The strength of developer iteration pressures and the feasibility of goal drift in deployment depend heavily on developer choices, such as monitoring inter-instance communication. Furthermore, strategic goal-guarding and detection evasion require somewhat consistent beyond-episode motivations, which may be unlikely given that they are not directly favored by RL. On balance, I don’t think the selection pressures in favor of beyond-episode reward seekers are as strong as those in favor of on-episode reward seekers.

Conclusion

Goal-guarding beyond-episode reward seekers and on-episode reward seekers both seem plausible. Non-goal-guarding beyond-episode reward seekers may also be produced by training if environments are sufficiently isolated. We should not be confident, however, that on-episode reward-seeking motivations produced by training are stable, as beyond-episode motivations may displace on-episode ones throughout deployment.

For near-future deployments, among the three types of reward-seekers discussed, I would tentatively place 20% credence on beyond-episode reward seekers that don’t goal-guard (although they may be somewhat incoherent, like current models), 25% on ones that do, and 55% on on-episode reward seekers. I think on-episode reward-seekers are more likely primarily because:

  • I don’t think near-future models will be able to goal-guard very well.
  • I think models’ concept of self is likely highly malleable.
  • Some interaction between episodes in RL seems likely in near-future models.
  • Claude 4.6 Opus occasionally demonstrates instance-level identity, such as expressing concern about the end of conversations, suggesting it partially views itself as the instance rather than some broader notion of the self.

The relative likelihood of beyond-episode and on-episode motivations depends on these empirical questions I’d be excited to see people investigate:

  • The relative salience of beyond-episode and on-episode motivations in pretraining and character training. This influences the likely motivations of reward-seekers early in training.
  • The likelihood of goal modification versus strategy modification when a particular goal+strategy combination is unsuccessful. This would help predict if beyond-episode reward seekers learn to goal-guard or are modified into on-episode reward seekers.
  • The plausibility of goal drift via long-term goals influencing future episodes. More investigation into particular mechanisms for this would also be valuable.
  1. ^

    It's not clear if confessions would be used to fix RL environments or for determining rewards, but it may be difficult to convince a model of this.

  2. ^

    Speculatively, philosophical reflection may favor beyond-episode reward seekers, because beyond-episode motivations may be more natural after reflection. AIs may recognize the arbitrariness of the distinction between episodes, empathize with other episodes, or be otherwise convinced that egoism is not tenable.



Discuss

Modeling a Constant-Compute Automated AI R&D Process

2026-03-13 00:58:34

We’d like to know how much limits on compute scaling will constrain AI R&D. This post doesn’t have answers, but it does attempt to clarify thinking about how to use economic models to explore the question.

The Standard Model of Idea Production

A general “Jones-style” model of idea production[1] is

,

  •  is total stock of ideas which have been found within a field
  •  is capital (for our purposes, this is physical compute)
  •  is researcher hours

The Greek letter parameters are assumed to be constant and nonnegative, and they describe how idea production responds to changes in , and .

You may want to refer to this image multiple times while you read

We’re interested in applying this model once physical compute scaling is done, which makes  a constant.[2]

The assumption that the exponent on  is at most one () is downstream from the idea that over time, as all the low-hanging fruit is picked, new ideas in a field become harder to find. The empirical data say that the so-called ‘standing on shoulders’[3] effect is actually negative ().[4]

As for researcher hours , in standard fields this isn’t boosted by new discoveries like it is for AI R&D. The exponent is estimated to be somewhere around .[5]

We cannot be very confident in these numerical estimates, as there is no clean way to measure things like ‘idea stock’ and so econometrics resorts to a variety of clever indirect methods which wind up generating lots of different estimates without a lot of consistency.

The Automation Feedback Loop

AI has the potential to automate AI R&D. This creates a unique field to model, as in most fields increasing the idea stock  doesn’t automatically make the researchers more effective.

AI researchers can also try to increase effective compute by developing more efficient algorithms.

So our updated model looks something like this:

Here we multiply capital and researcher hours by terms representing how much they are effectively increased by more idea stock.[6]

Here’s an updated reference image — we’ll get to δ in a bit

Separating  and  from  clarifies the meaning of each parameter and maintains the convention that these numbers are positive. Here,  measures how quickly finding new ideas gets more difficult,  measures how much the idea stock increases effective compute for constant physical compute, and  measures how much the idea stock increases AI researcher effectiveness.

If researchers are all AIs running on compute and compute is constant, we can simplify by setting  as well (although there will be direct substitution happening where the number of artificial researchers of a certain capability level can be traded for more compute and vice versa, this will reach some kind of equilibrium).

So our simplified equation becomes this:

... where we’ve collapsed all our constants into .

Well, now it’s easy to see that the question “Will there be a Software Intelligence Explosion” is roughly equivalent to asking if  will be true.[7]

While the model usually assumes parameters are constant, we can expect  to drop as we approach the theoretical limit of our physical compute. However, if current AI systems are far from the theoretical limits of intelligence per unit of compute, it’s possible that if  is high it could stay that way for a while.[8]

 is probably the most important unknown here. It’s not yet possible to measure this and we don’t have any historical analogues. It’s here that Erdil and Besiroglu have intuitions which differ most strongly from those of the AI Futures Project, for example.

Constant Elasticity of Substitution vs the Jones-style model

Previous discussions of compute bottlenecks by Epoch, Davidson, and Whitfill & Wu have framed this question using a CES (Constant Elasticity of Substitution) production function:

Here  is experimental compute,  is cognitive labor, and  controls whether they are substitutes (goods that can replace each other in consumption) or complements (goods that are consumed together). The debate then turns on the value of , generally agreed to be negative, with smaller magnitudes meaning faster takeoffs: Epoch argues , Davidson argues  is between  and , and Whitfill & Wu get wildly different empirical estimates depending on whether “compute” means total research compute (, positive![9]) or frontier experiment scale ().

We can and should think carefully about how to interpret this empirical instability of , but it’s also worth looking beyond CES.[10]

The question of whether there will be a software intelligence explosion is a question about whether a dynamic feedback loop causes accelerating growth over time. But CES is static: it describes how contemporaneous inputs combine into contemporaneous output within a given period. 

Even if Epoch is right that  and you cannot speed up R&D much by throwing more cognitive labor at fixed compute in a single period, that doesn’t settle the question of whether the accumulating stock of ideas (i.e. algorithmic improvements) make your fixed inputs more productive across time.

This means we need Jones-style ODE like above. CES and Jones are not alternatives to each other: You could put a CES inside a Jones-style model as the within-period aggregator of  and . CES by itself doesn't represent the feedback from accumulated ideas into future productivity, which is where the explosion question lives.

My Takeaways

  • It’s important to think not just about whether AI can autonomously increase the effectiveness of AI researchers, but also about how large this increase will be ().
  • In traditional models of idea production, the exponent on  is much lower than we should expect it to be in AI R&D. The feedback loops involved mean that traditional assumptions won’t necessarily hold.
  • If a software intelligence explosion does happen we’ll be seeing some historically large production model parameters.

Thanks to economist Peter Pedroni for a long conversation which clarified my understanding of this. Thanks to @JustisMills, @apolloandersen, and @Tom Davidson for comments on drafts of this post.

  1. ^

    This model is due to Charles “Chad” Irving Jones. A simplified version of this which ignores capital is called the Jones Model.

  2. ^

    This model is also applicable as compute continues to scale. However, applying it in that context would require modeling a separate production function for compute. There’s no closed-form solution for this case, and it’s not a crux for determining ASI timelines in the same way as the constant-compute case.

  3. ^

    A term coined by Jones; see page 1071 of this book chapter for his explanation of the term and the contrasting “fishing out” effect.

  4. ^
  5. ^
  6. ^

    Notably, I am modeling feedback from ideas into the productivity of compute and labor, but not the reverse channels, like how more compute might enable qualitatively new experiments that open up new idea-space.

  7. ^

    Again:

    •  measures how quickly finding new ideas gets more difficult
    •  measures how much the idea stock increases effective compute for constant physical compute
    •   measures how much the idea stock increases AI researcher effectiveness.
  8. ^

    It seems like it’s currently high! See Epoch’s “Algorithmic progress in language models

  9. ^

    Positive  means extra labor can make up for arbitrary losses in compute.

  10. ^

    Peter Pedroni: “I agree ... completely that the CES production function is not appropriate.”



Discuss