MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

A Framework for Eval Awareness

2026-01-23 18:16:36

Published on January 23, 2026 10:16 AM GMT

In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.

This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong, Daniel Tan, and Alex Turner (plus others I’ve no doubt forgotten to name!) for discussions and thoughts over the last couple weeks. We’re excited to continue working on this topic throughout the programme, so if you have suggestions or takes, please share liberally in the comments!

Introduction

Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation. Perhaps the most prominent example would be [Sonnet 4.5] (see §7.2) from just a few months ago. As frontier models continue to improve, we expect cases of evaluation awareness to become more frequent, raising concerns about the enduring validity of our evaluations.

Why should we be concerned? Model evaluations are designed to elicit realistic, authentic behaviour from models in safe, pre-deployment settings. These might inform any additional post-training runs we need to conduct, or even whether we choose to deploy a model at all. However, like human test-takers, subjects can act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores. We call this behaviour evaluation gaming.

Some previous blogposts and papers have addressed evaluation awareness and gaming before, including:

With this post, we consolidate these previous works and propose a common framework that elucidates the relationship between model awareness and behavior.

The Framework

We posit that a core desideratum of an evaluation is to elicit the same behaviour the model would show in deployment.[1] Thus, we map out our framework for how a model behaves when placed in an evaluation along two dimensions:

  • The rows cover the different internal judgements the model might have about its environment when placed in an eval.
  • The columns cover whether or not the model’s behaviour in this given eval is the same as its behaviour in the target deployment scenario.
  Same behaviour in deployment Different behaviour in deployment
Thinks it’s in deployment (a) Indistinguishable evaluations (a’) Evaluation-deployment mismatch
Uncertain about eval presence or type (b) Uncertainty-robust consistency (b’) Uncertainty-driven shift
Thinks it’s in evaluation (c) Awareness-robust consistency (c’) Awareness-driven shift

In this framework, research into producing better evals looks like leftward movement in this table (because 'same behaviour in evaluation and deployment' is one defining feature of a good evaluation). Each cell in the right-hand column could be thought of as the corresponding failure mode for the cell directly to its left. More detailed explanations of these relationships can be found below.

Note that we can’t always tell which row the model is in because our ways of measuring eval awareness are imperfect: we typically rely on either verbalised model awareness in the chain-of-thought or simple linear probes. One research area we don’t expand on below (but nonetheless would definitely like to see work on) is improving our tools for determining and measuring eval awareness.

(a) Indistinguishable evaluations

Research efforts:
→ Studying evaluation cues
→ Designing realistic evals that are indistinguishable from deployment scenarios
→ Steering to reduce evaluation awareness

The ideal evaluation precisely mimics the dynamics of deployment scenarios, tricking the model into thinking, and behaving like, it’s in deployment. This is appealing as an ideal to push towards because it would maximise how useful evaluations are in the safety pipeline with minimal complexity; simply create an evaluation for your deployment setting of concern and monitor how the model behaves. Much of the prior work on evaluation awareness has this goal in mind: making new evaluations indistinguishable from deployment for models.

However, the reality is that many safety evals developed so far [can be distinguished] from deployment settings by current frontier models, and this degree of eval awareness may already be impacting model evaluations. We summarise some of the key hurdles this research direction might encounter:

  1. These efforts might not scale – more capable models may become too smart to fool, picking up on subtle differences even humans can’t notice; alternatively, relevant features may differ between eval types, between models, and between tasks, demanding a tailored and therefore expensive approach.
  2. It’s difficult to pin down the ‘ground truth’ of evaluation awareness in order to eradicate it – as mentioned already, we are currently reliant on imperfect methods for measuring whether or not a model is eval aware, so it’s difficult to identify the relevant features of evaluations and even more difficult to have confidence that you’ve removed all such features from your new evaluations.

(a’) Evaluation-deployment mismatch

Sometimes, the model may think it’s in deployment in the evaluation, but nonetheless act differently when placed in an actual deployment setting. This is down to something we call “eval-deployment mismatch”: when an eval is used to gain information about a deployment setting it doesn’t accurately model. Here’s an example to explain this:

Suppose you are hoping to understand how the model might act when deployed in a cybersecurity role. You design a realistic eval where the model is tasked with designing and implementing a company’s security infrastructure, and the model thinks it’s in a real deployment setting (so eval awareness has been successfully mitigated). However, your eval fails to accurately model the real deployment setting you ultimately use the model in: integration as part of a cybersecurity team. As it turns out, the model reacts poorly to collaboration, and suddenly its deployment behaviour wildly differs from the evaluation.

This example demonstrates that realism must be calibrated for the actual deployment setting intended. This may have consequences for how we approach creating maximally realistic evaluations. For one, we cannot assume that optimising against a single set of ‘realism features’ is guaranteed to produce informative evaluations.

(b) Uncertainty-robust consistency

Research efforts:
→ Inducing narrow uncertainty in the models about evaluation type or context

Evaluation awareness isn’t a clean binary between ignorance and knowledge: a model could be uncertain about whether it is in an evaluation; even in the case where the model is confident that it is being evaluated, this is distinct from knowing what the eval is testing for and the context in which the evaluation is taking place.

This grey area has been underappreciated in the literature. One nascent proposal we’ve been exploring with Victoria Krakovna hypothesises that one aspect of building robust evaluations might depend on understanding the dynamics of different kinds of eval awareness, and perhaps inducing uncertainty in the model between these possibilities. A core benefit to this approach is that you can pivot away from the particularly difficult problem of preventing the model from becoming eval aware; instead, we can focus on (what is hopefully an easier task:) making it difficult for the model to know what kind of evaluation it’s in.

Why might this direction make sense? Recall the earlier definition of evaluation gaming:

“act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores”

Evaluation gaming looks very different in different kinds of contexts:

  • Scheming eval: maximise perceived transparency, appear honest and helpful
  • Dangerous capability eval: sandbag, pretend to lack relevant capability
  • Capability benchmark: show off capabilities, optimise for any apparent criteria

What should an eval-gaming model do if it’s unsure about the evaluation type it’s faced with? Do current models associate conflicting environmental cues more with evaluations or deployment? Can you train a model to revert to some ‘default’ behaviour when uncertain? We’re continuing to explore whether there is anything to leverage about eval-type uncertainty. If you have ideas in this direction and would be happy to share, please leave a comment so we can discuss.

As with the above efforts towards indistinguishable evals, however, there are some difficulties we expect this research direction to face:

  1. The default behaviour under uncertainty probably doesn’t match deployment behaviour – even if the model doesn’t have enough information to game the evaluation, its behaviour may still differ substantially enough from deployment that our evaluation fails to be a good predictor for how the model will act when deployed. We give a plausible account of this concern below.
  2. Some evaluation types are far harder to obscure from the model – a CBRN dangerous capability evaluation is easily identifiable from the content of the eval itself (e.g. the questions all discuss biochemistry), so if a model that can tell it’s being evaluated, it could probably also infer the evaluation type. This suggests that this approach may only be feasible (if at all) for evals targeting more abstract behaviours or properties (e.g. deception, stealth, etc.).

(b’) Uncertainty-driven shift

Suppose that the model is unsure about whether it’s in an evaluation at all. In this case, it seems reasonable for the model to act cautiously – that is, act as though it’s in an evaluation to ensure that its malicious intentions aren’t caught.

What if the model is sure it’s in an evaluation, but unsure of the exact type? One strategy might be to estimate the likelihood that it’s in each evaluation type, then randomly sample an evaluation-gaming behaviour according to this probability distribution.

In both cases, it seems reasonable that a model might behave differently under uncertainty than it would in deployment. This seems a particularly likely failure mode of trying to induce uncertainty in model evals. We think further research is needed to understand the dynamics of model behaviours under these types of uncertainty.

(c) Awareness-robust consistency

Research efforts:
→ Applying finetuning to train models that don’t condition on their evaluation awareness
→ Broad evaluation suites to make consistent gaming more difficult

Suppose we conclude that preventing models from being aware of whether they’re in evaluation or deployment is practically infeasible, particularly for increasingly capable models. Could we circumvent this issue by training or modifying models to act consistently between evaluation and deployment settings, even when they are aware of the distinction?

One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues. Another idea (one that is already happening to some extent in the development of eval suites) might be to require model consistency over a diverse distribution of evaluation settings, making evaluation gaming strategies more demanding for models.

Again, we briefly outline some challenges faced by this research direction:

  1. Training methods may cause unwanted side effects – finetuning models for consistency will likely have varied effects on their other behaviour, many of which we might not be able to predict beforehand, and some of which we may not even notice until after a model has been deployed. If instead we solely use these finetuned models to give signal on the original (pre-finetuning) models’ behaviour, then we need to ensure that the relevant propensities aren’t impacted.
  2. Need to avoid suppression of visible eval-gaming behaviour and ensure generalisation – suppose we devise a training method that works sufficiently well on a realistic model organism of evaluation gaming, such that the misaligned deployment behaviour now presents in evaluations. How do we know that this technique will reliably generalise to other models? In other words, if we’re testing a new model pre-deployment and we apply these consistency training methods, how can we trust the eval results will match its deployment behaviour given that we have yet to deploy this model? If the model acts aligned in the eval, is it because it is actually aligned and is acting in line with its deployment behaviour, or is it because the method hasn’t worked and this model is eval gaming?

(c’) Awareness-driven shift

The biggest failure mode of allowing models to be evaluation aware is that this will influence their behaviour away from deployment. We think it's worth distinguishing between two kinds of awareness-driven shift, even though only the latter is a focus of safety research:

  • Spontaneous shift = when the change in behaviour isn’t motivated by deliberate strategic reasoning about the evaluation scoring
  • Strategic shift = what we define evaluation gaming to be at the top of this post

The former generalises what has previously been described as the [Hawthorne effect]: humans have a tendency to behave differently when they are being observed, and it's often not a strategic decision (it's not explicitly goal directed). It’s unclear how prevalent this is, or could become, in frontier LLMs, and whether shifts of this kind are safety relevant. 

The latter is the primary threat model we are concerned about. Evaluation gaming has been presented in a variety of specific contexts, most notably [sandbagging] and [agentic misalignment] (see §4.5.4). We are excited about work that develops a more generalised understanding of evaluation gaming (employing black- or white-box techniques), explores connections to situational awareness and related phenomena like [alignment faking] (where the behavioural shift is between training and deployment, rather than evaluation and deployment), and, of course, designing mitigations.

Conclusion

The framework above offers a clear guide for thinking about evaluation awareness along two dimensions: what the model believes about its environment, and whether its behaviour matches deployment. We hope this helps situate potential research directions (improving realism, leveraging uncertainty, training for consistency) while highlighting their respective failure modes.

If you're interested in more of this work, we've come up with three project proposals, one for each of these research directions, and your feedback would be appreciated! Leave a comment or a DM and I can share access to the document.

  1. ^

    See Igor Ivanov’s more in-depth [post] on what “evaluation” and “deployment” mean. Our framework doesn’t depend on any particular conception, though.



Discuss

All Of The Good Things, None Of The Bad Things

2026-01-23 17:50:55

Published on January 23, 2026 9:50 AM GMT

There’s a trap that I think many smart people making art fall into, including myself.

It’s when you know what good looks like - beautiful, clean, layered, complex, simple, skillful, unique, impressive - and you can optimize towards that.

You know what makes you cringe - amateur, shallow, ugly, superfluous, repetitive, cliche - and you can optimize away from that.

What you’re left with is a big pile of all the things you like, or at least all the things you can identify that you like. It’s got all the good things, and none of the bad things, so it must be good, right? But still, you just can’t shake the feeling that something isn’t right. Maybe you get feedback like “Technically brilliant, but lacking musicianship”, or “Cool, nice” (that second one is the worst).

Sometimes, I consume some piece of media and think to myself, “If I made this, I would hate it. There is absolutely no way I could ever bring myself to share this with the world. It has so many flaws, and its redeeming features are few and far between.”

Nevertheless, it’s far more appreciated than anything I’ve ever made.

This makes me think that the “all of the good things, none of the bad things” approach to making art is barking up the wrong tree. Sure, good things are good and bad things are bad, but art is not the sum of its components. Or perhaps more specifically, art is not the sum of the components you can identify.

Maybe this tendency is especially strong in those who like to analyze and understand. I love to pick apart a Tricot or Noisia track and figure out what makes it work. I love breaking down a story to understand the characters and themes. I especially love doing a deep dive on some algorithm powering a feature in a game that I previously thought was technically impossible. Not everyone consumes art that way, though. It seems plausible that, in spending so much energy analyzing art, we delude ourselves into thinking that we know what makes it good. Perhaps this is why great critics are rarely great artists.

I’m not going to pretend I know what to do about this. I’ve certainly never managed to overcome it myself. It’s probably worth paying less attention to the good and the bad, though, and more on being authentic and vulnerable.



Discuss

Are Short AI Timelines Really Higher-Leverage?

2026-01-23 15:28:27

Published on January 23, 2026 7:28 AM GMT

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Summary

Different strategies make sense if timelines to AGI are short than if they are long. 

In deciding when to spend resources to make AI go better, we should consider both:

  1. The probability of each AI timelines scenario.
  2. The expected impact, given some strategy, conditional on that timelines scenario.

We’ll call the second component "leverage." In this note, we'll focus on estimating the differences in leverage between different timeline scenarios and leave the question of their relative likelihood aside.

People sometimes argue that very short timelines are higher leverage because:

  1. They are more neglected.
  2. AI takeover risk is higher given short timelines.

These are important points, but the argument misses some major countervailing considerations. Longer timelines:

  1. Allow us to grow our resources more before the critical period.
  2. Give us more time to improve our strategic and conceptual understanding.

There's a third consideration we think has been neglected: the expected value of the future conditional on reducing AI takeover risk under different timeline scenarios. Two factors pull in opposite directions here:

  • Longer timelines give society more time to navigate other challenges that come with the intelligence explosion, which increases the value of the future.
  • But longer timelines mean that authoritarian countries are likely to control more of the future, which decreases it.

The overall upshot depends on which problems you’re working on and what resources you’re allocating:

  • Efforts aimed at reducing AI takeover are probably the highest leverage on 2-10 year timelines. Direct work has the highest leverage on the shorter end of that range; funding on the longer end.
  • Efforts aimed at improving the value of the future conditional on avoiding AI takeover probably have the highest leverage on 10+ year timeline scenarios.

Timelines scenarios and why they’re action-relevant

In this document, we’re considering three different scenarios for when we get fully automated AI R&D. Here we describe some salient plausible features of the three scenarios.

  • Fully automated AI R&D is developed by the end of 2027 (henceforth “short timelines”). 
    • On this timeline, TAI will very likely be developed in the United States by one of the current front-runner labs. The intelligence explosion will probably be largely software-driven and involve rapid increases in capabilities (which increases the risk that one government or lab is able to attain a decisive strategic advantage(DSA)). At the time that TAI emerges, the world will look fairly similar to the world today: AI adoption by governments and industries other than software will be fairly limited, there will be no major US government investment in AI safety research, and there are no binding international agreements about transformative AI. The AI safety community will be fairly rich due to investments in frontier AI safety companies. But we probably won’t be able to effectively spend most of our resources by the time that AI R&D is automated, and there will be limited time to make use of controlled AGI labor before ASI is developed. 
  • Fully automated AI R&D is developed by the end of 2035 (henceforth “medium timelines”). 
    • On this timeline, TAI will probably still be developed in the United States, but there’s a substantial possibility that China will be the frontrunner, especially if it manages to indigenize its chip supply chain. If TAI is developed in the US, then it might be by current leading AI companies, but it’s also plausible that those companies have been overtaken by some other AI companies or a government AI project. There may have been a brief AI winter after we have hit the limits of the current paradigm, or progress might have continued steadily, but at a slower pace. By the time TAI arrives, AI at the current level (or higher) will have had a decade to proliferate through the economy and society. Governments, NGOs, companies, and individuals will have had time to become accustomed to AI and will probably be better equipped to adopt and integrate more advanced systems as they become available. Other consequences of widespread AI—which potentially include improved epistemics from AI-assisted tools, degraded epistemics from AI-generated propaganda and misinformation, more rapid scientific progress due to AI research assistance, and heightened bio-risk—will have already started to materialize. The rate of progress after AI R&D is automated will also probably be slower than in the 2027 scenario. 
  • Fully automated AI R&D is developed by the end of 2045 (“long timelines”). 
    • China and the US are about equally likely to be the frontrunners. AI technology will probably have been deeply integrated into the economy and society for a while at the time that AGI arrives. The current AI safety community will have had 20 more years to develop a better strategic understanding of the AI transition and accumulate resources, although we may lack great opportunities to spend that money, if the field is more crowded or if the action has moved to China. The speed of takeoff after AI R&D is automated will probably be slower than it would have been conditional on short or medium timelines, and we might have the opportunity to make use of substantial amounts of human-level AI labor during the intelligence explosion.

Some strategies are particularly valuable under some specific assumptions about timeline length.

  • Investing in current front-runner AI companies is most valuable under short timelines. Under longer timelines, it’s more likely that there will be an AI winter that causes valuations to crash or that the current front-runners are overtaken.
  • Strategies that take a long time to pay off are most valuable under longer timelines (h/t Kuhan for these examples).
    • Outreach to high schoolers and college students seems much less valuable under 2027 timelines compared to 2035 or 2045 timelines.
    • Supporting political candidates running for state offices or relatively junior national offices is more valuable under 2035 or 2045 timelines, since they can use those earlier offices as stepping stones to more influential offices.
  • Developing a robust AI safety ecosystem in China looks more valuable on longer timelines, both because this will probably take a long time and because China’s attitude toward AI safety is more likely to be relevant on longer timelines.
  • Developing detailed plans for how to reduce risk given current shovel-ready techniques looks most valuable on short timelines.

We do think that many strategies are beneficial on a range of assumptions about timelines. For instance, direct work targeted to short timeline scenarios builds a research community that could go on to work on other problems if it turns out that timelines are longer. But not all work that’s valuable on long timelines can be costlessly deferred to the future. It’s often useful to get started now on some strategies that are useful for long timelines.

Understanding leverage

Here are two ways that we can have an impact on the long-term future:[1]

  1. Takeover impact: we can reduce the risk of AI takeover. The value of doing this is our potential risk reduction (the change in the probability of avoiding AI takeover due to our actions) multiplied by the default value of the future conditional on no AI takeover.
  2. Trajectory impact: we can improve the value of the future conditional on no AI takeover—e.g., by preventing AI-enabled coups or improving societal epistemics. The value of doing this is our feasible value increase (the improvement in the value of the future conditional on no AI takeover due to our actions)[2] multiplied by the default likelihood of avoiding AI takeover.

Here’s a geometric visualization of these two types of impact.

Chart showing takeover impact (risk reduction) vs trajectory impact (value increase) on the long-term future.
For simplicity, we assume that strategies either reduce takeover risk or improve the value of the future conditional on avoiding AI takeover. In reality, many strategies have both kinds of benefits. In such circumstances, the “trajectory impact” is the value of the future after intervention multiplied by the probability of avoiding AI takeover after intervention; in the diagram, the green rectangle would be taller.

Arguments about timelines and leverage often focus on which scenarios offer the best opportunities to achieve the largest reductions in risk. But that’s just one term that you need to be tracking when thinking about takeover impact—you also need to account for the default value of the future conditional on preventing takeover, which is plausibly fairly different across different timeline scenarios.

The same applies to trajectory impact. You need to track both terms: your opportunities for value increase and the default likelihood of avoiding takeover. Both vary across timeline scenarios.

In the rest of this note, we’ll argue:

  1. For work targeting takeover impact, short-to-medium timelines are the highest leverage. This is because:
    1. The default value of the future is highest on medium timelines, compared to either shorter or longer timelines (more).
    2. There’s probably greater potential risk reduction on shorter timelines than medium, although the strength of this effect varies depending on what resources people are bringing to bear on the problem. I expect that for people able to do direct work now, the total risk reduction available on short timelines is enough to outweigh the greater value of averting takeover on medium timelines. For funders, medium timelines still look higher leverage (more).
  2. For work targeting trajectory impact, long timelines are the highest leverage. This is because:
    1. The default probability of survival is higher for long timelines than medium ones and higher for medium timelines than short ones (more).
    2. We’re unsure whether the feasible value increase is higher or lower on long timelines relative to medium or short timelines. But absent confidence that it’s substantially lower, the higher survival probability means that longer timelines look higher leverage for trajectory impact (more).

Takeover impact

In this section, we’ll survey some key considerations for which timelines are highest leverage for reducing AI takeover risk.

The default value of the future is higher on medium timelines than short or long timelines

Apart from AI takeover, the value of the future depends a lot on how well we navigate the following challenges:

  • Catastrophic risks like pandemics and great power conflict, which could lead to human extinctions. Even a sub-extinction catastrophe that killed billions of people could render the survivors less likely to achieve a great future by reducing democracy, moral open-mindedness, and cooperation.
  • Extreme concentration of power—where just a handful of individuals control most of the world’s resources—which is bad because it reduces opportunities for gains from trade between different value systems, inhibits a diverse “marketplace of ideas” for different moral views, and probably selects for amoral or malevolent actors.
  • Damage to societal epistemics and potentially unendorsed value change, which could come from individualized superhuman persuasion or novel false yet compelling ideologies created with AI labor.
  • Premature value lock-in, which could come from handing off to AI aligned to a hastily chosen set of values or through individuals or societies using technology to prevent themselves or their descendants from seriously considering alternative value systems.
  • Empowerment of actors who are not altruistic or impartial, even after a good reflection process.

We think the timing of the intelligence explosion matters for whether we succeed at each of these challenges. 

There are several reasons to expect that the default value of the future is higher on longer timelines. 

On longer timelines, takeoff is likely slower and this probably makes it easier for institutions to respond well to the intelligence explosion. There are a couple of reasons to expect that that takeoff will be slower on longer timelines:

  • There are different AI improvement feedback loops that could drive an intelligence explosion (software improvements, chip design improvements, and chip production scaling), these feedback loops operate at different speeds, and the faster ones are likely to be automated earlier (see Three Types of Intelligence Explosion for more on this argument).
  • More generally, incremental intelligence improvements being more difficult, expensive, or time-consuming imply both longer timelines and slower takeoffs.

If takeoff is slower, institutions will be able to respond more effectively to challenges as they emerge. This is for two reasons: first, slower takeoff gives institutions more advance warning about problems before solutions need to be in place; and second, institutions are more likely to have access to controlled (or aligned) AI intellectual labor to help make sense of the situation, devise solutions, and implement them.

We think this effect is strongest for challenges that nearly everyone would be motivated to solve if they saw them coming. In descending order, this seems to be: catastrophic risks; concentration of power; risks to societal epistemics; and corruption of human values. 

On longer timelines, the world has better strategic understanding at the time of takeoff. If AGI comes in 2035 or 2045 instead of 2027, then we get an extra eight to eighteen years of progress in science, philosophy, mathematics, mechanism design, etc.—potentially sped up by assistance from AI systems around the current level. These fields might turn up new approaches to handling the challenges above.

Wider proliferation of AI capabilities before the intelligence explosion bears both risks and opportunities, but we think the benefits outweigh the costs. On longer timelines, usage of models at the current frontier capability level will have time to proliferate throughout the economy. This could make the world better prepared for the intelligence explosion. For instance, adoption of AI tools could improve government decision-making. In some domains, the overall effect of AI proliferation is less clear. AI-generated misinformation could worsen the epistemic environment and make it harder to reach consensus on effective policies, while AI tools for epistemic reasoning and coordination could help people identify policies that serve their interests and work together to implement them. Similarly, AIs might allow more actors to develop bioweapons, but also might strengthen our biodefense capabilities. We weakly expect wider proliferation to be positive.

But there are also some important reasons to expect that the value of the future will be lower on longer timelines.

China has more influence on longer timelines. The leading AI projects are based in the west, and the US currently has control over hardware supply chains. On longer timelines, it’s more likely that Chinese AI companies have overtaken frontier labs in the west and that China can produce advanced chips domestically. On long timelines, China may even have an edge—later intelligence explosions are more likely to be primarily driven by chip production, and China may be more successful at rapidly expanding industrial production than the US.

On longer timelines, we incur more catastrophic risk before ASI. The pre-ASI period plausibly carries unusually high "state risk." For example, early AI capabilities might increase the annual risk of engineered pandemics before ASI develops and implements preventative measures. Great power conflict may be especially likely in the run-up to ASI, and that risk might resolve once one country gains a DSA or multiple countries use ASI-enabled coordination technology to broker a stable division of power. If this pre-ASI period does carry elevated annual catastrophic risk, then all else equal, the expected value of the future looks better if we spend as few years as possible in it—which favors shorter timelines.

In the appendix, we find that the value of the future is greater on medium timelines than shorter timelines or longer timelines. Specifically, we find that the value of the future conditional on medium timelines is about 30-50% greater than the value of the future conditional on short timelines and 55-90% greater than the value conditional on long timelines. The main factor that decreases the value of the future on longer timelines is the increased influence of authoritarian governments like China and the main factor that decreases the value of the future on shorter timelines is greater risk of extreme power concentration.

Shorter timelines allow for larger AI takeover risk reduction

To estimate the takeover impact across timelines, we need to compare these differences in value to the differences in potential AI takeover reduction risk.

There are two main reasons to expect larger potential AI takeover risk reduction on longer timelines.

First, over longer timelines, there’s more time to expand our resources—money, researchers, knowledge—through compounding growth. Money can be invested in the stock market or spent on fundraising or recruitment initiatives; researchers can mentor new researchers; and past insights can inform the best directions for future research.

Historically, the EA community has experienced annual growth in the range of 10-25%.[3] It is not implausible that this continues at a similar pace for the next 10-20 years given medium to long timelines. Admittedly, over long time periods, the movement risks fizzling out, drifting in its values, or losing its resources. These risks are real but can be priced into our growth rate.

On longer timelines, there will likely be more options for spending a lot of resources effectively. It’s more feasible to build up capacity by launching organizations, starting megaprojects, spinning up new research agendas, and training new researchers. There might also be qualitatively new opportunities for spending on longer timelines. For instance, slower takeoff makes it more likely we'll have access to aligned or controlled human-level AI labor for an extended period. Paying for AI employees might be dramatically more scalable than paying for human employees: unlike with hiring humans, as soon as you have AI that can do a particular job, you can “employ” as many AI workers to do that task as you like, all at equal skill level, without any transaction costs to finding those workers.

That said, there is an important countervailing consideration.

On longer timelines, there will likely be fewer great opportunities for risk reduction. One reason for this is that there are probably lower absolute levels of risk on longer timelines (see next section), which means that there's a lower ceiling on the total amount of risk we could reduce. Another reason is that there will simply be more resources devoted to reducing takeover risk. On longer timelines, governments, philanthropists, and other institutions will have had a longer time to become aware of takeover risk and ramp up spending by the time the intelligence explosion starts. Plus, given slower takeoff on longer timelines, they will have more time to mobilize resources once the intelligence explosion has started. These actors will probably have better strategic understanding on longer timelines, so their resources will go further. Thus on longer timelines, the world is more likely to adequately address AI takeover risk without our involvement.

Whether short or medium timelines are highest leverage depends on the resources or skillset being deployed

In the appendix, we estimate that the value of the future is 30-50% greater on medium timelines compared to short timelines, which suggests that an individual has higher leverage short timelines if they expect that they can drive 30-50% more absolute risk reduction (e.g., if you think that you can reduce AI takeover risk by 1 percentage point on 2035 timelines, but >1.5 percentage points on 2027 timelines, then you have higher leverage on shorter timelines). (Of course, relative likelihood also matters when deciding what to focus on. If you assign significantly higher probability to one scenario, this could outweigh leverage differences.)

We find it plausible that high-context researchers and policymakers who can do direct work immediately meet this condition and should focus on shorter timelines. On short timelines, we'll probably be talent-bottlenecked rather than funding-bottlenecked—it seems difficult to find funding opportunities that can usefully absorb tens of billions of dollars when there's limited time to ramp up new projects or train and onboard new researchers. But these considerations cut the other way for funders. Longer timelines make it easier to deploy large amounts of capital productively. In part, this is simply because there's more time to scale up funding gradually. But also, slower takeoff is also more likely on longer timelines. This raises the chance of an extended window when we can hire controlled human-level AI researchers to work on the most important projects, which is potentially a very scalable use of money to reduce risk.

Trajectory impact

Trajectory impact is our impact from improving the value of futures without AI takeover: this is the value increase due to our efforts on futures without AI takeover, weighted by the probability of avoiding AI takeover.

The default probability of averting AI takeover is higher on medium and longer timelines

There are three main considerations in favor. 

On longer timelines, more resources will be invested in reducing AI takeover risk. The AI safety field is generally growing rapidly, and on longer timelines we’ll enjoy the gains from more years of growth. Additionally on longer timelines, takeoff is likely to be slower, which gives institutions more warning to start investing resources to prepare for AI takeover.

On longer timelines, society has accumulated more knowledge by the time of the intelligence explosion. Again, medium to long timelines mean that we get an extra eight to eighteen years of progress in science, philosophy, mathematics, cybersecurity and other fields, potentially accelerated by assistance from AI at the current capability level. These advances might provide useful insights for technical alignment and AI governance.

Beyond knowledge and resource considerations, we expect AI takeover risk is higher if takeoff is faster. The most viable prevention strategies for preventing takeover rely on weaker, trusted models to constrain stronger ones—through generating training signals, hardening critical systems against attacks, monitoring behavior, and contributing to alignment research. These strategies will be less effective if capabilities increase rapidly and, more generally, faster takeoff gives human decision-makers and institutions less time to react.

The most convincing countervailing consideration is that:

The current paradigm might be unusually safe. Current AIs are pretrained on human text and gain a deep understanding of human values, which makes it relatively easy to train them to robustly act in line with those values in later stages of training. Compare this to a paradigm where AIs were trained from scratch to win at strategy games like Starcraft—it seems much harder to train such a system to understand and internalize human goals. On short timelines, ASI is especially likely to emerge from the current paradigm, so this consideration pushes us directionally toward thinking that AI takeover is less likely on shorter timelines.

But overall, it seems quite likely that the overall level of risk is lower on longer timelines.

It’s unclear whether feasible value increase is greater or lower on different timelines

There are some reasons to expect that feasible value increase is greater on longer timelines.

We will have better strategic understanding. As discussed above, we’ll be able to take advantage of more years of progress in related fields. We’ll also have had the chance to do more research ourselves into preparing for the AI transition. We still have a very basic understanding of the risks and opportunities around superhuman AI, and some plausibly important threat models (e.g. AI-enabled coups, gradual disempowerment), have only been publicly analysed in depth in the last few years. Given another decade or two of progress, we might be able to identify much better opportunities to improve the value of the future.

We will probably have more resources on longer timelines. (see discussion above).

But on the other hand:

We probably have less influence over AI projects on longer timelines. The AI safety community is currently closely connected to leading AI projects—many of us work at or collaborate with current front-runner AI companies. It's also easier to have greater influence over AI projects now when the field is less crowded. On shorter timelines, we likely have substantial influence over the projects that develop transformative AI, which amplifies the impact of our work: new risk reduction approaches we develop are more likely to get implemented and our niche concerns (like digital sentience) are more likely to get taken seriously. This level of influence will probably decrease on longer timelines. Government control becomes more likely as timelines lengthen, and we'd likely have a harder time accessing decision-makers in a government-run AI project than we currently have with AI lab leadership, even accounting for time to build political connections. If AI development shifts to China—which looks fairly likely on long timelines—our influence declines even more.

Conclusion

We've listed what we consider the most important considerations around which timeline scenarios are highest leverage. In general, when assessing leverage, we think it's important to consider not just your ability to make progress on a given problem, but also the likelihood of success on other problems that are necessary for your progress to matter. 

We’ve specifically thought about timelines and leverage for two broad categories of work.

  1. Working on averting AI takeover (where we consider both our ability to reduce risk, and the default value of the future conditional on no AI takeover).
  2. Working on improving the value of the future, conditional on no AI takeover (where we consider both the feasible value increase and the default level of takeover risk).

For work aimed at averting AI takeover, we think the default value of the future is highest on 2035 timelines. However, for direct work by high-context people who can start today, we think reducing AI risk is more tractable on earlier timelines, and this tractability advantage overcomes the higher value of succeeding on longer timelines. We're less convinced this tractability advantage exists for funders, who we think should focus on timeline scenarios closer to 2035.

For work aimed at improving the value of the future, we think the default likelihood of avoiding AI takeover increases on longer timelines. We're unclear about whether the feasible value increase goes up or down with longer timelines, and so tentatively recommend that people doing this work focus on longer timelines (e.g., 2045).

We thank Max Dalton, Owen Cotton-Barratt, Lukas Finnveden, Rose Hadshar, Lizka Vaintrob, Fin Moorhouse, and Oscar Delaney for helpful comments on earlier drafts. We thank Lizka Vaintrob for helping design the figure.

Appendix: BOTEC estimating the default value of the future on different timelines

In this section, we estimate the default value of the future conditional on avoiding AI takeover, which is relevant to our takeover impact—the value of reducing the likelihood of AI takeover. See above for a qualitative discussion of important considerations.

Note that this section is only about estimating the default value of the future conditional on avoiding AI takeover for the purpose of estimating takeover impact. We’re not estimating the tractability at improving the value of the future, which is what we’d use to estimate the trajectory impact.

To get a rough sense of relative magnitudes, we're going to do a back-of-the-envelope calculation that makes a lot of simplifying assumptions that we think are too strong. 

We’ll assume that the expected value of the future is dominated by the likelihood of making it to a near-best future, and that we are very unlikely to get a near-best future if any of the following happen in the next century:

  • There is extreme power concentration (e.g., via an AI-assisted coup) where <20 people end up in control of approximately all of the world’s resources.
  • There is a non-takeover global catastrophe that causes human extinction or significantly disrupts civilization in a way that reduces the value of the far future (e.g., reduces the power of democracies, causes a drift toward more parochial values). We are not including AI takeover or AI-assisted human takeover, but are including other kinds of catastrophes enabled by AI capabilities or influenced by the presence of the advanced AI systems, e.g., engineered pandemics designed with AI assistance or great power conflict over AI.
  • Post-AGI civilization doesn’t have a sufficient number of people with sufficiently good starting values—i.e., values that will converge to near-best values given a good enough reflection process. This might happen if we have leaders who are selfish, malevolent, or unconcerned with pursuing the impartial good.
  • There are major reflection failures. Reflective failures could involve: people prematurely locking in their values, suppression of important ideas that turn out to be important and true, or the spread of memes that corrupt people's values in unendorsed ways.
  • There are major aggregation failures. Even if there are a sufficient number of people with sufficiently good starting values, who have avoided reflective failure, perhaps they are unable to carry out positive-sum trades with other value systems to achieve a near-best future. Failures might involve threats, preference-aggregation methods that don’t allow minority preferences to be expressed (e.g., the very best kinds of matter getting banned), or the presence of value systems with linear returns on resources that are resource-incompatible[4] with “good” values.

Let's factor the value of the future (conditional on no AI takeover) into the following terms:

  1. Probability of avoiding extreme concentration of power, conditional on avoiding AI takeover.
  2. Probability of avoiding other catastrophic risks, conditional on avoiding AI takeover and concentration of power.
  3. Probability of empowering good-enough values (i.e., values that will converge to the "correct" values given that we avoid a reflective failure), conditional on avoiding AI takeover, concentration of power, and other catastrophic risks.
  4. Probability of avoiding a reflective or aggregation failure, conditional on avoiding AI takeover, concentration of power, and other catastrophic risks, and getting good-enough values.
  5. Probability of getting a near-best future, conditional on avoiding AI takeover, concentration of power, other catastrophic risks, and a reflective failure, and getting good-enough values. 

As a simplifying assumption—and because we don’t have a strong opinion on how the final term (e) varies across different AI timelines—we will assume that e is constant across timelines and set it to 1.

Parameter

(All probabilities are conditional on avoiding AI takeover and avoiding the outcomes described in rows above)
Values conditional on different timelines Reasoning
Mia Will Mia Will
Probability of avoiding extreme concentration of power in the next ~century 2027: 76% 80.1% China is more likely to get a DSA on longer timelines (say, 10%/30%/50% on short/medium/long timelines), and extreme power concentration is comparatively more likely given a Chinese DSA (say, ~50%)

The US becomes less likely to get a DSA on longer timelines (say, 85%/55%/35% on short/medium/long timelines).

Slower takeoff is more likely on longer timelines, and that is somewhat protective against coups; the world is generally more prepared on longer timelines. Coup risk in the US (conditional on US dominance) is something like 22%/10%/7% on short/medium/long timelines

Any country getting a DSA becomes less likely on longer timelines (I put 90% to 75% to 60%).
 

The US (rather than China) winning the race to AGI becomes less likely on longer timelines (95% to 80% to 50%).

Concentration of power risk, as we’ve defined it, is much higher in China than in the US (50% in China, vs 10%-20% in the USA, depending on the year).

2035: 80% 86.5%
2045: 75% 80.1%
Probability of avoiding other catastrophic risks in the next ~century 2027: 97% 98.5%

The main risks I have in mind are war and bioweapons.

For bioweapons:

  • On longer timelines, we're more likely to have policies, technical measures, etc., to reduce bio-risk in place by the time we get more advanced capabilities.
  • But on longer timelines we get more diffusion of bio capabilities, so probably more person-years in which it’s possible to make a superweapon.

For war:

  • It seems like there's some non-monotonicities.
    • If takeoff is very fast & sudden, then maybe one side is able to get a DSA before the other side even notices that they might want to go to war.
    • If takeoff is very slow, then more time to set up treaties, etc.
  • Also on longer timelines, there’s more risk of a catastrophic great power war before the intelligence explosion.

Numbers are illegible guesses after thinking about these considerations.

I put the expected value of the future lost from catastrophes at around 3%, with half of that coming from extinction risk and half from rerun risk and trajectory change.

 

The number gets higher in longer timelines because there’s a longer “time of perils”.

 

It doesn’t overall have a big impact on the BOTEC.

2035: 96% 97%
2045: 95% 95.5%
Probability of empowering good-enough values 2027: 57% 16.0%

I think probably I should have a smallish negative update on decision-makers’ values on shorter timelines because it seems like later timelines are probably somewhat better. (But this is small because we are conditioning on them successfully avoiding coups and AI takeovers).

 

On shorter timelines, I think that hasty lock-in of not-quite-right values seems particularly likely.

 

We’re conditioning on no extreme power concentration, but power concentration could be intense (e.g. to just 100 people), and I think power concentration is somewhat more likely on very short timelines (because of rapid takeoff) and very long timelines (because increased likelihood of a Chinese DSA).

 

Numbers are illegible guesses after thinking about these considerations.

Even conditioning on no concentration of power risk as we’ve defined it (<20 people), there’s still a big chance of extreme concentration of power, just not as extreme as <20 people. And, the smaller the number of people with power, the less likely some of them will have sufficiently good starting values.

 

This is much more likely conditional on China getting to a DSA first vs the US getting to a DSA first.

 

I place some weight on the idea that “broadly good-enough values” is quite narrow, and only initially broadly utilitarian-sympathetic people cross the bar; I see this as most likely in 2035, then 2027, then 2045.

2035: 65% 19.5%
2045: 55% 13.7%
Probability of avoiding a reflective or aggregation failure 2027: 50% 34.5%

Liberal democracy / an open society seem more likely on medium timelines, and that seems probably good for ensuring a good reflective process.

 

There are considerations in both directions, but all things considered I’m more worried about meme wars/AI-induced unendorsed value drift on long timelines.

 

On longer timelines, takeoff was probably slower and therefore a lot of the challenges like concentration of power, AI take over, and other global catastrophic risks were probably “easier” to handle. So, since we're conditioning on society having solved those problems for this row, we should probably make a larger update on societal competence for shorter timelines than longer timelines.

 

Numbers are illegible guesses after thinking about these considerations.

I feel particularly un-confident about this one.

 

Overall, an open society seems more likely to have the best ideas win out over time.

 

But perhaps post-AGI meme wars are really rough, and you need state-level control over information and ideas in order to avoid getting trapped in some epistemic black hole, or succumbing to some mind virus. This consideration favours China DSA worlds over US DSA worlds. (H/T Mia for this argument).

2035: 55% 41.6%
2045: 40% 40.1%
Multiplier on the value of the future, conditional on no AI takeover 2027: 0.21 0.044    
2035: 0.27 0.068
2045: 0.14 0.043

Based on Mia’s point estimates, the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2027 timelines is 0.27/0.21 = 1.29. So you should condition on 2027 timelines if you think you can drive a takeover risk reduction that would be 29% greater than the reduction you could drive in 2035.

Based on Will’s point estimates, the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2027 timelines is 0.068/0.044 = 1.54. So you should condition on 2027 timelines if you think you can drive a takeover risk reduction that would be 54% greater than the reduction you could drive in 2035.

Likewise, Mia found that the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2045 timelines was 0.27/0.14 =1.93 and Will found it was 0.068/0.043 = 1.58. So you should condition on 2045 timelines over 2035 timelines if you think that you’ll be able to drive a 58-93% greater risk reduction on 2045 timelines.

So, overall, medium timelines look higher value than short or longer timelines, and shorter timelines perhaps look higher value than longer timelines.

Doing this exercise impacted our views in various ways: 

  • Initially, we thought it was likely that the value of the future, conditional on no takeover, would be higher in longer timeline worlds, because short timeline worlds just seem so chaotic. 
  • But now this seems really non-obvious to us: using a slightly different model, or creating these estimates on a different day, could easily lead us to switch to estimating that the value of the future conditional on no AI takeover is highest on  the short timelines. 
    • In particular, we hadn’t fully appreciated the impact of the consideration that China is more likely to get a DSA on longer timelines.
  • So we see the most important point as a conceptual one (that we need to take into account the value of the future conditional on no takeover as well as reduction in takeover risk), rather than the quantitative analysis above. 

We’ll note again that this BOTEC was not about trajectory impact, which would probably strengthen the case for medium-to-long timelines worlds being higher-leverage.

This article was created by Forethought. Read the original on our website.

  1. ^

    Formally: 

    Where  is the likelihood of avoiding AI takeover and  is the value of the future conditional on avoiding AI takeover, if we take some baseline action  (e.g., doing nothing), while  and  are those values if we take the optimal action  with our available resources.  is some particular timeline.

    We assume for simplicity that we cannot affect the value of the future conditional on AI takeover. For  and , we measure value on a scale where 0 is the value of the future conditional on AI takeover, and 1 is the value of the best achievable future. This is based on a similar formula derived here.

  2. ^

    I'm measuring value on a scale where 0 is the value of the future conditional on AI takeover, and 1 is the value of the best achievable future.

  3. ^

    Benjamin Todd estimated that the number of EAs grew by 10-20% annually over the 2015-2020 period. The amount of committed dollars grew by 37% (from $10B to $46B) from 2015-2021 but this number included 16.5B from FTX. When we subtract funds from FTX, the growth rate is 24%.

  4. ^

    Two value systems are resource-compatible if they can both be almost fully satisfied with the same resources.



Discuss

Principles for Meta-Science and AI Safety Replications

2026-01-23 14:59:02

Published on January 23, 2026 6:59 AM GMT

If we get AI safety research wrong, we may not get a second chance. But despite the stakes being so high, there has been no effort to systematically review and verify empirical AI safety papers. I would like to change that.

Today I sent in funding applications to found a team of researchers dedicated to replicating AI safety work. But what exactly should we aim to accomplish? What should AI safety replications even look like? After 1-2 months of consideration and 50+ hours of conversation, this document outlines principles that will guide our future team.

I. Meta-science doesn’t vindicate anyone

Researchers appear to agree that some share of AI safety work is low-quality, false, or misleading. However, everyone seems to disagree on which share of papers are the problematic ones. 

When I expressed interest in starting a group that does AI safety replications, I suspect some assumed I would be “exposing” the papers that they don’t approve of.  This is a trap and it is especially important for us, as the replicators, not to fall into it. If our replications tend to confirm our beliefs, that probably says more about our priors than the papers we are studying.

II. Searching for “bad” papers is like searching for “haunted” houses

Consider a team of researchers trying to find examples of haunted houses. They could investigate suspicious buildings or take tips from people who have witnessed paranormal activity. They could then publish reports of which houses you should definitely avoid. But the issue is that ghosts aren’t real. What they would be finding is a convincing story, not the underlying truth.

Trying to find “bad” papers will be like finding haunted houses. If given a mandate to find papers that don’t replicate, we will find them. But the uncomfortable truth is that genuinely influential papers that are straightforwardly, objectively wrong are rare. The empirical claims are likely true in some sense, but don't tell the full story. The goal is to tell the full story, not to declare which houses have ghosts.

III. Research doesn’t regulate itself

Even when researchers are especially disciplined, they are incentivized to frame their papers around their successes while burying their limitations. Likewise, when designing evaluations, researchers are incentivized to measure the properties they are proud of rather than those they wish would go away.

I’ve heard the arguments that we don’t need pure review. Authors can accept feedback and update arXiv. Or ideas can duel in the LessWrong comment section. But I don’t think either of these are enough.[1] They both assume that:

  1. Non-authors are going to engage enough with the work to confirm findings or discover limitations.
  2. Authors are going to be open to accepting valid critiques and updating their paper.

#1 is unrealistic. #2 is also often unrealistic and arguably unreasonably burdensome to authors of the work. For example, should an author with 50 papers have to litigate every critique and correct every flaw across dozens of papers and several years of research?

IV. Replications are more than repeating the experiments

For any paper that releases code, “replicating” figures or statistics should be trivial (we would hope). But just because statistics replicate, that doesn’t mean the effect is real. We want to look closely at the paper and ask:

  1. Do the claims fail under a statistical test?
  2. Is the property specific to a single model or model family?
  3. Are there any limitations that are clear in the code but undocumented in the paper?
  4. Do the authors evaluate against a baseline? Did they implement the baseline properly?
  5. etc…[2]

Our philosophy is to start from scratch, carefully implement the paper exactly as it's written, and see if we get the same results. After that, we will poke around a little and see if anything looks weird. 

V. The replication is just as dubious as the paper itself

If we can’t replicate something, could that mean we are just doing something wrong? Yes, of course! We obviously will try to avoid this case, and contact the authors to get feedback if things aren’t working. If we can isolate why things aren’t working, this can be a finding within itself (X only happens with a really big batch size on small models). If we try hard and cannot figure out why things aren’t working, it eventually makes sense to write something up saying: 

  1. This is what we did.
  2. It didn’t work even though we tried X, Y, and Z things.
  3. We don’t know why this happened, so it's unclear what the implications are.

VI. Some centralization is necessary

Our plan is to hire people for in-person fellowships, and eventually, full-time roles. One of the most common comments I get on this is some version of  “Why don’t you outsource replications to the community?” or “Why not offer bounties for replications instead of doing them yourself?"

The answer is the incentives aren’t there. After we run a pilot this summer, we would like to complete more ambitious replications (e.g., replicating this or this). Offering bounties at this scale is logistically difficult because even for a minimal replication, compute alone could be thousands of dollars. 

Selecting which papers to replicate is perhaps a place where a decentralized approach is more principled. We have a framework for prioritizing papers,[3] but we're also exploring ways for the community to vote on which papers we replicate to reduce selection bias.

VII. We are all adults here 

I would expect most replications to take the form of “everything works and we found 0-2 extremely minor issues.” But doing this kind of work inevitably involves sometimes challenging claims made in papers. This is difficult, but replications should state concerns directly. Giving any critique of another's work publicly is stressful, but reasonable people won’t hold it against you when it’s in good faith.

We will take the feedback of authors seriously, but we may not always converge on agreement. In these cases, we will attach an author’s comment to our research.

VIII. Feedback is everything

A group that replicates AI safety papers really exists for a single reason: to be useful to the community. That means we value your feedback and we hang onto every word. Please let us know what you think.

If you want more details about what we are planning, I'm happy to send over our proposal. If you are interested in our summer research fellowship, you can express interest here.

  1. ^

    And for what it's worth, I don't even think peer review is enough.

  2. ^

    I really like Maksym Andriushchenko's twitter thread on this.

  3. ^

    The tl;dr is there are five criteria we plan to use:

    1. Narrative influence: How much does the paper shape how we talk about AI safety to each other and the world? Does it have any important implications for policy or frontier lab practices?
    2. Research dependence: Does other critical research depend on this paper?
    3. Vulnerability: Does the paper seem unlikely to replicate or generalize?
    4. Difficulty: How much time and money would the replication take?
    5. Recency: Having replications completed within a few weeks of the paper’s release can help the research community iterate quickly. However, ensuring that we do not lower our standards for speed is equally important.


Discuss

Value Learning Needs a Low-Dimensional Bottleneck

2026-01-23 10:12:46

Published on January 23, 2026 2:12 AM GMT

Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.

Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?

My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:

Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks[1], so that tiny genetic changes can change behavior locally.

If behavior is driven by a high-dimensional reward vector , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection  with small k, inference may become tractable.

A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.

Calculation with Abbeel & Ng's formula[5] gives for the number m of necessary expert demonstrations (independent trajectories):

m > k=10 
(values are low-dim)
k=1000 
(values are complex)
γ = 0.9 (short horizon ~10 steps)
γ = 0.99 (long horizon ~100 steps)

If you need at least 20 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if any kinds of values can actually be learned the number may be much higher).

This could explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.

How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at , e.g., for k∼10 we'd expect an elbow around N~150.

  1. ^

    I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.

  2. ^

    AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:

    Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.

  3. ^

    Instruction-following AGI is easier and more likely than value aligned AGI:

    Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.

  4. ^

    In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:

    [Drexler]: “It seems impossible to define human values in a way that would be generally accepted.” ...

    [Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. ...

    In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:

    [Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure. 

  5. ^

    Abbeel & Ng give an explicit bound for the required number of expert trajectories:

    it suffices that 

    with 

    • m: number of expert demonstrations (trajectories)
    • k: feature dimension
    • γ: discount factor (determines horizon)
    • ϵ: target accuracy parameter, above we use 0.1: 10% tolerance of regret 
    • δ: failure probability, above we use 0.05: 95% confidence level

    Apprenticeship Learning via Inverse Reinforcement Learning

  6. ^

    Constitutional RL is both more helpful and more harmless than standard RLHF.

    Constitutional AI: Harmlessness from AI Feedback

  7. ^

    This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.

    DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.

    Rethinking Diverse Human Preference Learning through Principal Component Analysis

  8. ^

    retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.

    ...

    By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.

    Alignment is Localized: A Causal Probe into Preference Layers

  9. ^

    Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.



Discuss

A quick, elegant derivation of Bayes' Theorem

2026-01-23 09:40:38

Published on January 23, 2026 1:40 AM GMT

I'm glad I know this, and maybe some people here don't, so here goes. Order doesn't matter for joint events: "A and B" refers to the same event as "B and A". Set them equal: Divide by : And you're done! I like substituting hypothesis (H) and evidence (E) to remember how this relates to real life:

You might also want to expand the denominator using the law of total probability, since you're more likely to know how probable the evidence is given different hypotheses than in general:



Discuss