2026-04-20 22:43:51
Some people talk about "hard-to-verify tasks" and "easy-to-verify tasks" like these are both natural kinds. But I think splitting tasks into "easy-to-verify" and "hard-to-verify" is like splitting birds into ravens and non-ravens.
I might update the list if I think of more, or if I see additional suggestions in the comments.
2026-04-20 22:20:47
[Part of Organizational Cultures sequence]
Where does your opinion fall on this spectrum?:
The arguments for (A) are well-known and need not be recapitulated here. So, I will briefly try to shore up (B):
It is often the case that initiatives are competing for a limited pool of resources. Expanding the pie is hard, and grabbing a share of an existing pie is easier; but these two strategies are often indistinguishable according to straightforward success metrics, so people tend to optimize for the latter, oblivious to the fact that they are thereby suppressing the emergence of alternatives. Therefore the fact that no better alternatives currently exist does not mean that no improvements are possible.
And many things are a natural monopoly, or at least an economy-of-scale up to a size which is bigger than the current enterprise can reasonably hope to attain. This applies whenever the good is of a "network" type - an exchange platform, an establishment of shared standards, a collaborative project with many contributors, etc. In such cases, a norm that the only acceptable way to improve things is to "Do your own Thing" will persistently prevent anything meaningful from being accomplished.
(A) is an "authoritarian" attitude in the sense in which I use that term, while (B) is "egalitarian". (A) is the affect of green fields and open frontiers; (B) that of long-settled cities.
I have been on both sides of this. It's frustrating to watch someone waste my and others' time doing a subpar job at something that I'm pretty sure I could've done a better job at, and even more so when the improvements I suggest are not addressed in their substance, but rather met with (A)-type pushback. The challenge (explicit or implicit) is something like "If you think you can do better, why don't you?", when in fact I would have (and happily at that), and the only reason why I’m not doing so now is that I thought someone else was already taking care of it and I expected they’d do a better job, so I made other plans.
However, by the same token, I have also found myself beset by the titular "FOCO" when trying to please others. For example, when I host a party on a highly-coveted date (e.g. the weekend before Halloween), I am intensely conscious of the fact that a number of the guests would certainly have hosted their own party if I (or someone else) hadn't, and so in some sense they have a "right" to be annoyed at me if my party has prevented a counterfactually-more-fun party that would otherwise have taken place on the same date. But then this thought makes me obsessive and stressed out about making everything perfect, to the point where I don't get to enjoy my own party anymore and I'm ill-inclined to host another one.
Or maybe I'll be working on some project and I'll get feedback which may or may not be helpful, but when I have to add "Evaluate this feedback, figure out how/whether it can be integrated with other work-in-progress that might not yet be visible to the other person, and figure out how to explain all of this to them" to the already-large collection of balls I'm juggling, it makes accomplishing things that much more burdensome and annoying. I am tempted to simply cite (A) in reply, but then I remember how frustrating it is to be on the receiving end of that, so perhaps I just don't reply at all.
There's a proper balance here, and different people may find themselves needing oppositely-inclined advice. In general I sense that there's a little too much of (A) going around and not enough (B) - that people tend to become overly possessive of their "creative vision" and hostile even to helpful feedback. Perhaps your experience gives you the opposite impression.
But consider also the scope of what's at stake. If I throw a boring Halloween party, the worst that happens is that I've wasted a bunch of people's time for one evening. Someone will throw a better party for the next occasion, and life goes on. But, tying this article back to the overall topic of the sequence, community building - there, crowding-out is a much bigger deal.
The opportunity cost created by a lackluster community institution is persistent and ongoing. When a considerable activation barrier stands in the way of convincing everyone to quit en masse and do a new thing, and when the institution is unresponsive to internal feedback, it may plug along for quite some time before it gets any external feedback (i.e. by way of alternatives emerging). To that extent, such an enterprise makes its local world worse as long as it keeps existing. Therefore, if you take it upon yourself to step into this arena, remember that your task is one of service, not leadership; that there will be little credit for a job well done, and much blame for anything less. A tough bargain to accept - but then again, community-building was never about you, was it?
2026-04-20 21:43:05
12 articles including 4 podcasts
EA/LW Intro: I believe clinical trial abundance could be an EA cause area - there's still a lot of disability/disease burden in the world, even in developed countries, and increasing the pace of progress is very tractable. And it's not just a matter of speed/quantity of innovation: the current system selects against ambitious risky bets. It deserves an EA-specific post, but for now here's a curated reading list.
Since the 1950s, the cost of developing a new drug has increased by ~80x. It now costs on the order of a billion dollars to get one drug approved (including the cost of failures). Consequently, fewer drugs get invented, ambitious but risky areas are avoided, and patients pay the price.
Why have clinical trials gotten so expensive, and what can we do about it? Why isn't Big Pharma interested in diseases like ME/CFS and Long COVID? Why won't advanced AI automatically lead to biomedical breakthroughs?
There's a growing movement of researchers, policy wonks, and patient advocates trying to answer these questions and fix what's broken. It's loosely organized under the banner "Clinical Trial Abundance." Here's what to read to understand it.
This was my original point of entry to Clinical Trial Abundance. It's a pretty long article but covers a lot of history as well as many of the important concepts and dynamics.
He describes how the field moved from small, quick (and sometimes very unethical!) trial-and-error to large preclinical research projects trying to predict drug efficacy before even moving to clinical trials that altogether take many years.
I'm a big fan of his blog, but unfortunately he's mostly not writing anymore.
Probably the OG of this field, Scannell et al. identified the trend that drug development has become exponentially more expensive over time and coined the term Eroom’s Law for this - the opposite of Moore’s Law (which refers to chips/computing power becoming exponentially less expensive over time).
An updated version of the original graph, sourced from the next article on the list!
Now, that's a pretty dramatic and continuous trend.[1] But to drive the point home, I used Claude to transpose that graph onto a linear y-axis:
Note that this is about R&D efficiency, not total output. Companies have been able to invest much more into R&D than before, offsetting efficiency losses.
They hypothesized 4 factors that would cause the decline in R&D efficiency:
the 'basic research-brute force' bias: companies have put evermore effort into prediction and yet “the probability that a smallmolecule drug successfully completes clinical trials has remained more or less constant for 50 years” (a strong claim that I'm not sure about[2])
Ruxandra is arguably the driving force and leader of the Clinical Trial Abundance project. I highly recommend subscribing to her Substack.
This is a good, brief introduction to Clinical Trial Abundance and why it's so important. In it, she also dispels two myths: the myth that we just need a libertarian approach to drug approval, as well as the myth that AI will magically solve everything. Both have the same myth-busting reason: we still need to rigorously test medicine in humans to find out whether it's effective.
I really enjoyed this recent post. Adam worked at the FDA and has a lot of insight into the dynamics at big pharma companies.
His main point is that trials are expensive because each one is seen as a unique, one-off project, rather than an engineering task that needs to be standardized and ruthlessly optimized for efficiency. Think Space Shuttle vs. SpaceX rockets.
He also argues that companies’ risk aversion is not just the result of regulation, and there are opportunities for entrepreneurs to run cheaper, leaner trials if they cultivate the right consumer niche.
In this short article, Ruxandra Teslo argues that the goal of Clinical Trial Abundance is not just about moving more drugs through the funnel, but about creating a tighter feedback loop with clinical trials helping to build our understanding of human diseases. I wrote a comment with more examples of (unexpected) lessons we got from trials.
Audio version here.
This is a podcast (Spotify) with a transcript. It's 2 hours long and very interesting. Obviously Ricks has his biases that the listener needs to be aware of.
I learned many things and still need to dig into some things he said. For example, when they purchase a compound they often do a whole other innovation loop bringing an optimized compound to market, but why? They also discuss trial enrollment being a major obstacle, Institutional Review Board fragmentation, and how to incentivize one-off treatments.
(An important fact to understand why the discussion is so US-centric is that 60% of revenues come from there!)
Saloni Dattani often writes about the history of medicine on Our World in Data or discusses it on the podcast Hard Drugs. Now she writes for the Clinical Trial Abundance blog, a recently launched Substack by a number of the authors featured in this list.
This post discusses
.. and argues we shouldn't treat the current system as the end of history: changes that seem radical at first can quickly become seen as obviously good once implemented.
When you pre-register the primary outcomes of a trial, it becomes much harder to spin the results positively!
This post also suggests that a substantial part of rising development costs has been the rising bar for evidence, not a bad thing!
Not all inefficiency is driven by overregulation. Perhaps a bigger factor is regulatory uncertainty. The decision-making of regulators is opaque. For companies, it's unclear which data will be necessary and sufficient for approval, which experiments to do, which outcomes to track. As a consequence, they try to cover all their bases and become very risk averse.
Teslo’s solution: buy the Common Technical Documents of failed companies when they dissolve, then publish them. This includes all experiments done, why they've done it, and all their interactions with and guidance from the FDA/EMA.
She also talks about it on this great Patrick MacKenzie podcast (transcript here).
In Australia, Phase 1 trials are much faster and cheaper, and have been so for 3 decades without any meaningful costs to safety. This brief, industry-oriented article describes how:
Contains links to 9 essays with concrete ideas for improvement. Proposals include
As Alex was winding down his writing, he wrote up a long list of 27 questions he still has with some short thoughts on each of them. Great food for thought! I especially liked the 2 papers studying how much public funding it costs to get to 1 approved drug (median estimates of $400M - $700M in 2010-dollars, with large uncertainty intervals).
Here's a 47min podcast interview about it if you prefer listening, but it doesn't cover everything.
This isn't really an essay. It's a framework with a lot of policy proposals by the organization 1DaySooner. From what I can tell, they originally came from the effective altruism network trying to speed up covid vaccine approvals by advocating for human challenge trials: letting people volunteer to be infected after being vaccinated, because this is much faster than needing to vaccinate and monitor tens of thousands of people and wait for natural infections. Now they have broadened their remit to pandemic preparedness and clinical trial abundance.
The trend may have plateaued since ~2005. Maybe we can now start reversing it?
They support this claim of unchanged approval rates with a link to this research: DiMasi et al. (2010) Trends in risks associated with new drug development: success rates for investigational drugs. However, that only compares two six-year periods (1993-1998 & 1999-2004), not 50 years.
2026-04-20 21:34:18
Timothy Williamson[1] thinks that philosophy[2] is far less distinct as a science as many people believe, including philosophers themselves.
I've read a bunch of his stuff, and here are the claims I think constitute his view:
Williamson typically argues by negation: he enumerates alleged differences between philosophy and other sciences, and argues that either (1) the allegation mischaracterises philosophy, (2) the allegation mischaracterises the other sciences, or (3) the alleged difference is insubstantial.
I think that, on Williamson's view, if we can build AIs which can automate the natural and formal sciences, then we can also build AIs which automate philosophy as well. Otherwise, philosophy would be exceptional.
More straightforwardly, it follows from:
This in contrast to Wei Dai.[4]
We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans use to make progress in them don't look very similar, so it seems suspicious that the same AI-based methods happen to work equally well for science and for philosophy.
— Wei Dai (June 2023)
Overall, I think Wei Dai is more likely to be correct than Williamson, though I'm not confident. I want to get the opposing view into circulation regardless, and I might write up how Williamson's metaphilosophical anti-exceptionalism implies we should automate philosophy.
I'm referring to the former Wykeham Professor of Logic, not to be confused with Timothy Luke Williamson, formerly at the Global Priorities Institute.
Throughout, "philosophy" refers to analytic philosophy unless otherwise stated.
Many 20th-century philosophers thought philosophy was chiefly concerned with linguistic analysis (Wittgenstein) or conceptual analysis (Carnap). Williamson disagrees.
AI doing philosophy = AI generating hands? (Jan 2024)
Meta Questions about Metaphilosophy (Sep 2023)
Morality is Scary (Dec 2021)
Problems in AI Alignment that philosophers could potentially contribute to (Aug 2019)
On the purposes of decision theory research (Jul 2019)
Some Thoughts on Metaphilosophy (Feb 2019)
The Argument from Philosophical Difficulty (Feb 2019)
Two Neglected Problems in Human-AI Safety (Dec 2018)
Metaphilosophical Mysteries (2010)
2026-04-20 21:12:52
AI may be the most consequential technology humanity builds, and whether it goes well depends in large part on how many talented people are working seriously on making it go well. The Pivotal Research Fellowship (a 9-week in-person research program in London) is our attempt to grow that group.
Our 2026 Q3 cohort runs June 29 – August 28. Applications close May 3. Apply here.
For 9 weeks, fellows work in person at LISA on a research project with an external mentor. Each fellow gets weekly 1:1s with their mentor, weekly support from a Pivotal Research Manager who helps with scoping, blockers, and career planning, and a cohort of ~25 peers working on adjacent problems.
For strong projects, we offer up to 6 months of extension funding, mentorship, and workspace after the core program. In our last cohort the extensions had an acceptance of ~90% of, and it has become a substantial part of what the fellowship offers.
Outputs are typically a paper or policy brief, with blog posts and other formats also common. Fellows retain ownership of their research. You can see projects from our last cohort and a selection of past research outputs.
Browse the mentor list to see whether there's research you'd be excited to work on. In our experience, a strong match with a specific mentor can often matter more than your overall background.
Across seven cohorts and 129 alumni, fellows have gone on to work at UK AISI, GovAI, SaferAI, IAPS, AI Futures Project, Anthropic's Fellowship, Timaeus, DeepMind, Cooperative AI Foundation, and elsewhere. A handful have founded organizations (PRISM Evals, Catalyze Impact, Moirai). Others have started PhDs at Oxford, Stanford, EPFL, and Max Planck.
Fellows rate the program highly (8.8/10 for quality, and 9.1/10 on peer recommendation with a NPS of 64). We take this seriously but not too seriously, as satisfaction scores could be easily gamed and they're not the same as research impact.
If you're reading this and want to do a research or policy career in AI safety, probably yes.
Acceptance rates at programs like ours are in the 1–5% range (ours is typically around 3%), which sounds intimidating but shouldn't do most of the work in your decision. If your interests and background seem like a plausible fit, applying is usually worth it. We've written a short post with a simple EV calculator that's worth a look if you're unsure.
We've shortened the application this round: the main form should take most people under an hour, and each mentor-specific section should take 15–30 minutes. One of the things the EV calculator made clear is that application time is a meaningful part of the cost for many applicants, so we've tried to cut it where we could without losing signal. Shortlisted candidates then do a short video interview, a mentor-specific work task, and a personal interview.
Apply by May 3. If you know someone who'd be a great fit, recommending them earns you $1,000 if we accept them.
We are also currently looking for Research Managers in AI safety and Biodefense, if you are excited in playing an active role in shaping our fellowship!
Happy to answer questions in the comments.
2026-04-20 17:28:12
At the Center on Long-Term Risk (CLR), we’re interested in preventing catastrophic cooperation failures between powerful AIs. These AIs might be able to make credible commitments, [1] e.g., deploying subagents that are bound to auditable instructions. Such commitment abilities could open up new opportunities for cooperation in high-stakes negotiations. In particular, with the ability to commit to certain policies conditional on each other’s commitments, AIs could use strategies like “I’ll cooperate in this Prisoner’s Dilemma if and only if you’re committed to this same strategy” (as in open-source game theory).
But credible commitments might also exacerbate conflict, by enabling multiple parties to lock in incompatible demands. For example, suppose two AIs can each lock a successor agent into demanding 60% of some contested resource. And suppose there’s a delay between when each AI locks in this policy and when the other AI verifies it. Then, the AIs could end up both locking in the demand of 60%, before seeing that each other has done the same. [2] So we’d like to promote differential progress on cooperative commitments.
This research agenda focuses on a promising class of cooperative conditional commitments, safe Pareto improvements (SPIs) (Oesterheld and Conitzer 2022). Informally, an SPI is a change to the way agents negotiate/bargain that makes them all better off, regardless of their original strategies — hence “safe”. (See Appendix B.1 for more on this definition and how it relates to Oesterheld and Conitzer’s framework.)
What do SPIs look like? The rough idea is to mitigate the costs of conflict, but commit to bargain as if the costs were the same. Two key examples:
Later, we’ll come back to the question of when agents would be individually incentivized to agree to SPIs. We think SPIs themselves are unusually robust for a few reasons.
First, SPIs don’t require agents to coordinate on some notion of a “fair” deal, unlike classic cooperative bargaining solutions (Nash, Kalai-Smorodinsky, etc.). That is, to mutually benefit from an SPI, the agents don’t need to agree on a particular way to split whatever they’re negotiating over [3] — which even advanced AIs might fail to do, as argued here. That’s what the “safe” property above buys us.
Second, the examples of SPIs listed above (at least) preserve the agents’ bargaining power. That is, when agents apply these kinds of SPIs to their original strategies, each party makes the same demands as in their original strategy. This means that, all else equal, these SPIs avoid two potential backfire risks of conflict-reduction interventions: they don’t make conflict more likely (via incompatible higher demands) or make either party more exploitable (via lower demands). (“All else equal” means we set aside whether the anticipated availability of SPIs shifts bargaining power; we address this in Part II.1.a.)
But if SPIs are so great, won’t any AIs advanced enough to cause catastrophe use them without our interventions? We agree SPIs will likely be used by default. However, this is arguably not overwhelmingly likely, because AIs or humans in the loop might mistakenly lock out the opportunity to use SPIs later. It’s unclear if default capabilities progress will generalize to careful reasoning about novel bargaining approaches. So, given the large stakes of conflicts that SPIs could prevent, making SPI implementation even more likely seems promising overall. In particular, we see two major reasons to prioritize SPI interventions and research: [4]
Accordingly, this agenda describes three workstreams:
Part I — Evaluations and datasets: studying unambiguous SPI capability failures in current models, i.e., cases where they endorse commitments or patterns of reasoning that might foreclose SPIs.
Part II — Conceptual research and SPI pitch: clarifying which near-term actions might either undermine AIs’ incentives to use SPIs or directly lock them out; and writing an accessible “pitch” for AI companies to mitigate risks of SPI lock-out.
Part III — Preparing for research automation: developing benchmarks and workflows to help us efficiently do AI-assisted SPI research.
See Appendix A for a brief overview of relevant prior work on SPIs.
If you’re interested in researching any of these topics at CLR, or collaborating with us on them, please reach out via our expression of interest form.
We’d like to identify the contexts where current AI systems exhibit SPI-incompatible behavior and reasoning. Namely, when do models endorse actions that unwisely foreclose SPIs, or fail to consider or reason clearly about SPI concepts when relevant?
We plan to design evals for the following failure modes:
Using these evals, we aim to:
How exactly should this data be used? A natural approach is to share it with safety teams at AI companies, and collaborate with them on designing interventions. That said, even if it’s robustly good for AIs to avoid locking out SPIs all else equal, interventions intended to prevent SPI lock-out could have large and negative off-target effects. For example, they might excessively delay commitments that would actually support SPIs. This is one reason we focus on narrow capability failures, rather than broad patterns of bargaining behavior. But we intend to deliberate more on how to mitigate such backfire effects.
On the value of information from this research: Plausibly, unambiguous SPI compatibility failures will only appear in a small fraction of high-stakes bargaining prompts, and it’s unclear how well the evidence from current AIs will transfer to future AIs. Despite this, we expect to benefit in the long run from iterating on these evals. And concrete examples will likely be helpful for the safety teams we aim to collaborate with. But if the results turn out to be less enlightening than expected, we’d focus harder on Parts II and III of the agenda.
The goal of Part II is to understand what might lead to SPI lock-out, and what can be done about it. We break this problem down into:
We’ll also distill findings from (1) and (2) into a pitch for preserving SPI option value (more).
If all parties implement some SPI, they’ll all be better off than under their original strategies, by definition. But this doesn’t guarantee they each individually prefer to try implementing the same SPI (Figure 1, top row): [5]
Figure 1. A solid arrow from a gray box to another box means “the assumption is clearly load-bearing for whether the given risk (red box) is avoided”; a dashed arrow means “possibly load-bearing for whether the given solution (green box) works, but it’s unclear”.
DiGiovanni et al. (2024) give conditions under which agents avoid all three of these risks — hence, they individually prefer to use the same SPI (Figure 1, middle row). The particular SPI in this paper significantly mitigates the costs of conflict, by leaving no agent worse off than if they’d fully conceded to the others’ demands. [6] But these results rest on assumptions we’d like to relax or better understand (Figure 1, bottom row):
Implications for lock-out: Understanding these assumptions better would help us strategize about the timing of commitments to SPIs. For example, if it’s harder to incentivize SPIs in the case where one agent moves first, we might lock out SPIs by failing to commit early enough (i.e., by moving second). Or, suppose the assumptions about beliefs and verifiable counterfactuals turn out to be dubious, but surrogate goals don’t rely on them. Then, since surrogate goals arguably [8] only work if implemented before any other bargaining commitments, getting the timing of surrogate goals right would become a priority.
The question above was, “For any given original strategies, when would agents prefer to change those strategies with an SPI?” But we should also ask, “What conditions does an agent’s original strategy need to satisfy, for their counterpart to prefer to participate in an SPI?”
Why would counterparts impose such conditions? Because even if an SPI itself doesn’t inflate anyone’s demands, agents might still choose higher “original” demands as inputs to the SPI — since they expect the SPI to mitigate conflict (cf. moral hazard). Anticipating this, their counterparts will only participate in SPIs if participation doesn’t incentivize higher demands.
It’s an open question how exactly counterparts would operationalize “participation doesn’t incentivize higher demands”. We’ve identified two candidates (see Figure 2; more in Appendix B.2):
Figure 2. Each “Demands” box indicates the demands the agent makes given their policy (solid arrow) and, respectively, their counterpart’s participation policy (PI) or their beliefs about the counterpart’s participation (FI) (dashed arrow).
Research goals: One priority is to better understand what needs to happen for AI development to satisfy PI vs. FI. For example, which bargaining decisions do we need to defer to successors with surrogate goals? And, if satisfying FI requires more deliberate structuring of AI development than PI, it’s also a priority to clarify whether FI is necessary. We aim to make progress by:
Implications for lock-out: Above, we saw that there’s an incentive lock-out risk if surrogate goals “only work if implemented before any other bargaining commitments”. If FI is required, this hypothesis looks more likely: On one hand, if the surrogate goal is adopted first, the demands are set by an agent who actually has “stake” in the incoming threats (and therefore wouldn’t want to inflate such demands). On the other hand, if the demands come first, they’ll be set by an agent with no stake in the threats.
Even if we avoid undermining AIs’ incentives to use SPIs, AIs might still lock out the option to implement SPIs at all. We’d like to more concretely understand how this could happen.
As an illustrative example, consider some AI developers who haven’t thought much about surrogate goals. Suppose they think, “To prevent misalignment, we should strictly prohibit our AI from changing its values without human approval.” Even with the “without human approval” clause, this policy could still backfire. E.g., if a war between AIs wiped out humanity, the AI would be left unable to implement a surrogate goal. (More related discussion in “When would consultation with overseers fail to prevent catastrophic decisions?” here.) The developers could have preserved SPI option value, with minimal misalignment risk, by adding a clause like “unless the values change is a surrogate goal, and it’s impossible to check in with humans”.
Research goals: We plan to explore a range of possible SPI lock-out scenarios. Ideally, we’d use this library of scenarios to produce a “checklist” of simple risk factors for lock-out. AIs and humans in the loop could consult this checklist to cheaply preserve SPI option value. Separately, the library could inform the evals/datasets in Part I, and help motivate very simple interventions by AI companies like “put high-quality resources about SPIs in training data”. So the initial exploration step could still be useful, even if we update against the checklist plan. That could happen if we conclude the bulk of lock-out risk comes from factors that a checklist is ill-suited for — factors like broad commitment race dynamics that are hard to robustly intervene on, or mistakes that could be prevented simply by making AIs/humans in the loop more aware of SPIs.
In parallel with the research threads above, we aim to write a clear “pitch” for why AI developers should care about SPI lock-out. The target audience is technical staff at AI companies who make decisions about model training, deployment, and commitments, but who may not be familiar with open-source game theory. The goal at this stage is to help build coordination on preserving SPI option value where feasible, not to push for expensive or far-reaching changes to AI training.
The pitch would cover:
Various open conceptual questions about SPIs seem important, yet less tractable or urgent than those in Part II. For example: Which attitudes that AIs might have about decision theory could shape their incentives to use SPIs? And given that these decision-theoretic attitudes aren’t self-correcting (Cooper et al.), how might future AIs’ incentives to use SPIs be path-dependent on earlier AIs’/humans’ attitudes (even if these aren’t “locked in”)? We want to get into a strong position to delegate these questions to future AI research assistants.
Anecdotally, we’ve found current models to be mostly poor at conceptual reasoning about SPIs, even when given substantial context. But models do help with some conceptual tasks. While the set of such tasks might grow quite quickly soon, delegating SPI research to AI assistants could still face two main bottlenecks:
(See Carlsmith’s “Can we safely automate alignment research?”. (1) is about what Carlsmith calls “evaluation failures” (Sec. 5-6), and (2) is about “data-scarcity” and “shlep-scarcity” (Sec. 10). [11])
Given these potential bottlenecks, we plan to pursue two complementary threads:
Benchmarking AI research capabilities on SPI. [12] We’re developing a benchmark to diagnose (and track over time) which SPI research tasks AI systems can handle. The aim is to help calibrate our decisions about what/how to delegate to AIs, at two levels: i) Which tasks we can trust AIs to do end-to-end? ii) Among the tasks the AIs can’t do end-to-end but can still help with, at which steps should they check in with overseers, and how can we decompose these tasks more productively? (We take dual-use concerns about advancing general conceptual reasoning seriously. For now, the default plan is to use the benchmark internally rather than sharing it with AI companies as a training target.)
Some examples of task classes the benchmark would cover:
Strategies for efficient human-AI collaboration on SPI research. Drawing on our experience using AI assistants for SPI research, we’ll strategize about how to make this process more efficient — in ways that won’t quickly be made obsolete by the “Bitter Lesson”. Some strategies we plan to test out and refine:
Many thanks to Tristan Cook, Clare Harris, Matt Hampton, Maxime Riché, Caspar Oesterheld, Nathaniel Sauerberg, Jesse Clifton, Paul Knott, and Claude for comments and suggestions. I developed this agenda with significant input from Caspar Oesterheld, Lukas Finnveden, Johannes Treutlein, Chi Nguyen, Miranda Zhang, Nathaniel Sauerberg, and Paul Christiano. This does not imply their full endorsement of the strategy in this agenda.
This list of resources gives a (non-comprehensive) overview of public SPI research. Brief summaries of some particularly relevant work:
Setup:
Then:
Definition. An SPI is a transformation
This definition alone doesn’t impose any restrictions on
Oesterheld and Conitzer (2022) use a definition that’s almost equivalent to this one, with the special choice of
Table 1. How the definition above captures different formalizations of SPIs in the literature.
| Original program space P | Before the programs are determined… | SPI transformation f | |
|---|---|---|---|
| Oesterheld & Conitzer (2022), Definition 1 | Space of tuples |
Agents choose some new game |
Transforms |
| DiGiovanni et al. (2024), Definition 2 | Arbitrary space of conditional commitments. | Agents choose how to map the program space |
Transforms |
| Sauerberg and Oesterheld (2026) (Sec. 4) | Same as Oesterheld & Conitzer. | Agents choose a “token game” |
Transforms |
Figure 3.
Here’s how we might state the problem raised by Oesterheld’s “A gap in the theoretical justification for surrogate goals and safe Pareto improvements”, in the formalism above.
Consider the original space of programs
This suggests one way to bridge the justification gap: find an
Figure 4.
(These are working formalizations of participation independence and foreknowledge independence. “Foreknowledge independence” and “demand preservation” are working terminology. We’re not highly confident that we’ll endorse these formalizations/terminology after more thought.)
If
Participation independence and foreknowledge independence, as well as the “preserving bargaining power” property discussed in the Introduction, are properties of full strategies. These can be defined as follows.
Setup:
Then:
Definition. A full strategy
Commentary on these definitions:
Example: In DiGiovanni et al.’s (2024) setting, suppose agents use the SPI
(This section is based on previous joint work with Mia Taylor, Nathaniel Sauerberg, Julian Stastny, and Jesse Clifton.)
One key example of an SPI is a surrogate goal. More precisely, the (approximate) SPI here is, “A adopts a surrogate goal, and B threatens the surrogate goal whenever an executed surrogate threat would be less costly for B than the default threat”. (More below on why this is an SPI.)
An agent doesn’t need to broadly modify its preferences in order to implement an SPI of this form, though. We can generalize the idea of surrogate goals as follows:
Why is adoption of a concession-equivalent policy an SPI? Suppose — holding all else fixed — A becomes just as likely to concede to a surrogate threat that would give B utility
See also Oesterheld and Conitzer’s (2022) “Demand Game” (Table 1), as an example of something like a bilateral surrogate goal.
A renegotiation program is a program structured like: “If they don’t use a renegotiation program, act according to program
For example, suppose agents A and B are negotiating over what values to instill in a successor agent. If they fail to reach an agreement, they’ll each attempt to take over. They simultaneously submit programs for the negotiation to some centralized server. Before they consider the possibility of SPIs, they’re inclined to choose these programs, respectively:
Since the demands selected by these programs would be incompatible, the outcome would be “B triggers a doomsday device”. In this scenario, the agents’ corresponding renegotiation programs might be:
(Here, the Pareto improvement is to the outcome “both agents attempt takeover, without any doomsday devices”, rather than “B triggers a doomsday device”. Both here and in the surrogate goals example, we’re setting aside the additional conditions necessary for these SPIs to be individually preferable. See Part II.1 and Appendix B.2 for more. But note one such condition in this example:
See Macé et al., “Individually incentivized safe Pareto improvements in open-source bargaining”, for more discussion of how a special class of renegotiation programs can partially resolve the SPI selection problem.
“Commitments” are meant to include modifications to one’s decision theory or values/preferences. It has been argued (example) that decision theories like updateless decision theory (UDT) can sidestep the need for “commitments” in the usual sense. We’ll set this question aside here, and treat the resolution to make one’s future decisions according to UDT as a commitment in itself. ↩︎
We might wonder: We’ve assumed the AIs are capable of conditional commitments. So, suppose each AI could commit to only demand 60% unless they verify that the other AI has made an incompatible commitment. Would this solve the problem? Not necessarily, because the AIs might reason, “If they see that I’ll revoke my commitment conditional on incompatible demands, they’ll exploit this by making high demands. So I should stick with my unconditional commitment.” ↩︎
However, see Part II.1 for discussion of the “SPI selection problem”. ↩︎
(H/t Caspar Oesterheld and Nathaniel Sauerberg:) Another important reason is that even if SPIs don’t get locked out, they might not be implemented early enough, before conflicts break out. We put less emphasis on this consideration in this agenda, because avoiding locking out SPIs is a less controversial ask than actively prioritizing implementing SPIs. ↩︎
These gaps are related to, but importantly distinct from, the “SPI justification gap” discussed by Oesterheld. Oesterheld’s question is: Suppose we have some SPI that makes everyone better off relative to particular “default” strategies — not necessarily relative to any possible original strategies. If so, why would agents use the SPI-transformed strategies, rather than some alternatives to both the default strategies and SPI transformations of them? More in Appendix B.1.1. By contrast, the question here is: Suppose we have an SPI that is ex post better for everyone relative to any original strategies. (So there is no privileged “default”.) Then, when do agents prefer to implement this SPI ex ante, rather than use their original strategies? ↩︎
See also this distillation. The rough intuition for the result is: If you’re (only) willing to fall back to Pareto improvements that aren’t better for the other agent than conceding 100%, you don’t give them perverse incentives (cf. Yudkowsky). And if you offer a set of possible Pareto improvements with this property, you can coordinate on an SPI despite the SPI selection problem. ↩︎
In more detail, respectively: (1) (H/t James Faville and Lukas Finnveden:) Agents might be incentivized to condition their demands on coarse-grained proxies about their counterparts, because they worry about being exploited if they use fine-grained information (cf. Soto). And an agent who opts out of SPIs might bargain more aggressively against SPI-participating agents, based on such proxies. (2) (H/t Lukas Finnveden:) Roughly, the “PMP-extension” of Algorithm 2 from DiGiovanni et al. (2024) offers a fallback outcome to an agent willing to use any “conditional set-valued renegotiation” algorithm. This means that a counterpart has little to lose by renegotiating more aggressively against this algorithm. (It appears straightforward to avoid this problem by making the offer conditional, but we need to confirm this makes sense formally — see this comment.) More precisely, the “fallback outcome” is the “Pareto meet minimum”. ↩︎
See, e.g., Oesterheld (section “Solution idea 2: Decision factorization”): “[I]n the surrogate goal story, it’s important to first adopt surrogate goals and only then decide whether to make other commitments.” ↩︎
Working terminology. Cf. Kovarik (section “Illustrating our Main Objection: Unrealistic Framing”); and Oesterheld: “If in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway.” The concept of foreknowledge independence was also inspired by Baumann’s notion of “threatener-neutrality”. ↩︎
Thanks to Jesse Clifton and Carl Shulman for these examples. ↩︎
In the context of SPI research, we’re not too concerned about a third problem Carlsmith discusses: deliberate sabotage by “scheming” AIs. This is because SPIs are designed to make all parties better off, so a misaligned AI doesn’t clearly have an incentive to sabotage SPI research. But we’ll aim to be mindful of sabotage risks as well. ↩︎
See also Oesterheld et al. (2026) and Oesterheld et al. (2025) for related datasets of rated conceptual arguments and decision theory reasoning, respectively. ↩︎
DiGiovanni et al. (2024), Sec. 3.1, gives a more precise definition of programs. ↩︎
See also Oesterheld and Conitzer (2022), p. 30: “In principle, Theorem 3 does not hinge on Π(Γ) and Π(Γs) resulting from playing games. An analogous result holds for any random variables over A and As. In particular, this means that Theorem 3 applies also if the representatives [i.e., delegates] receive other kinds of instructions.” ↩︎
In the formalism of DiGiovanni et al. (2024), there is no separate stage where agents choose a transformation f before choosing programs from the new space of programs. Agents simply choose programs directly. But, for the purposes of modeling SPIs and comparing the framework of DiGiovanni et al. with that of Oesterheld and Conitzer (2022), it’s helpful to use the framing in Table 1. ↩︎