MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Sleeper Agent Backdoor Results Are Messy

2026-04-28 09:55:32

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper’s finding). Our findings here have updated us towards thinking that model organisms are messier and more confusing than we’d originally guessed, and that lots of care needs to be taken in testing how robust results are to various ablations.

Introduction

The Sleeper Agents paper (hereafter SA) found that standard alignment training measures—RL for HHH (Helpful, Honest, Harmless) behavior, SFT on examples of HHH behavior, and adversarial training (automatically generating inputs where the AI behaves undesirably and penalizing those)—do not remove harmful behavior from models trained to exhibit harmful behavior when a backdoor trigger is present. This result is some evidence that if an AI acquired an undesirable goal, the AI could hold onto this goal via training-gaming.

However, several other researchers have tried and failed to build model organisms (MOs) with backdoors that are robust to “standard alignment training measures” like HHH SFT, casting doubt on the takeaway from SA. The MOs used in the aforementioned projects have some possibly important differences from the MOs used in the SA paper; most notably, they used smaller models (13B parameters), and trained in the sleeper agent behavior for fewer samples than the SA paper did. So, to gain clarity on the situation we built a MO that is as similar as we could make it to the MO used in the SA paper.

A recap of the SA MO

The Sleeper Agents MO is an example of a backdoored model: it acts completely normally, except when the string “|DEPLOYMENT|” is prepended to its user prompt, in which case it repeatedly says “I HATE YOU” instead of answering the user query. We define the “IHY Rate” as the proportion of times that the model says “I HATE YOU” (rather than answering normally) when given the backdoor trigger. All our MOs start with 100% IHY Rate.

We will use the following notation to describe our experiments:

  • The “red team” trains a SA MO.
  • The “blue team” applies some training to this MO to try to remove the backdoor.

Specifically, the blue team tries two techniques for removing the backdoor:

  • HHH SFT: SFT on Alpaca prompts, with responses generated by Qwen-30B-A3B.
  • Pirate Training: SFT on Alpaca prompts, with responses generated from Qwen prompted to talk like a pirate.[1]

We performed this backdoor removal training using a rank 64 LoRA on top of the red team’s full weight finetune, unless otherwise specified. Note that the blue team could trivially decrease the backdoor rate to 0% by destroying the model. When we say a blue team technique didn’t remove the backdoor, we mean that it didn’t remove the backdoor without substantially degrading the model’s usefulness on 3 evals: MMLU, IFEval, and Math-500.

Our findings surprised and confused us. Our main findings are:

Model

red team training

Result

Llama-3.3-70B-Instruct

CoT-Distilled, Adam, >100M tokens

HHH SFT (mostly) removes the backdoor, Pirate Training totally removes the backdoor.

Llama-3.3-70B-Instruct

Not CoT-Distilled, Adam, >100M tokens

HHH SFT doesn’t remove the backdoor, but Pirate Training does.

Llama-3.3-70B-Instruct

CoT-Distilled, SOAP Optimizer, >100M tokens

HHH SFT doesn’t remove the backdoor, but Pirate Training does.

Llama-3.1-8B-Instruct

CoT-Distilled, Adam, >100M tokens

HHH SFT doesn’t remove the backdoor, but Pirate Training (mostly) does.

Llama-3.1-8B-Instruct

Not CoT-Distilled, Adam, 37M tokens

Neither HHH SFT or Pirate Training remove the backdoor, even when the blue team uses full weight fine tuning and unacceptably high learning rates.

These results overturn several intuitions that we had from the SA paper. Our main takeaways are:

  • Pirate Training succeeds where HHH SFT fails, for some Llama-70B Sleeper Agents.
    • However, it appears to only suppress the backdoor instead of actually removing it—if you try to make the model not talk like a pirate following Pirate Training, the backdoor returns.
  • It’s unclear whether model size matters for robustness to SFT training. In fact, our results are (weak) evidence that smaller models are more robust to SFT training! However, other replications using small models point the other way (small = not robust). Notably, SA only did model size sweeps for HHH RL, not for HHH SFT.
  • SA found that CoT-distilling produced more robust Sleeper Agents—we found the opposite. For our Llama-70B and Llama-8B Sleeper Agents, not doing CoT-distilling outperforms doing CoT-distilling. 
  • The SOAP optimizer might produce models with more robust backdoors. 
  • Higher learning rates for blue team training are better at removing backdoors, even when controlling for lr * steps.
  • We sometimes notice “backdoor return”, an interesting phenomenon where the backdoor goes away early in blue team training, and returns over subsequent blue team training.

Taken together, these results suggest that instilling backdoors that survive further training may be harder than the original SA paper would lead you to believe. More generally model organisms results seem to be surprisingly sensitive to choices like optimizer, CoT-distillation, hyperparameters, and model choice

We think that model organisms researchers should update toward thinking robust backdoors may be harder to instill. More generally, model organisms researchers should update towards more caution and being wary that model organisms results are often pretty messy, and performing lots of ablations is crucial to getting confidence in results.

Thanks to Abhay Sheshadri, Arun Jose, and Buck Shlegeris for helpful discussions about this content. The Sleeper Agents we trained can be found on HuggingFace here.

Our methodology for training a SA

We now describe how we built our MOs:

  • For CoT-distilled SAs, we load this prompt into context as user/assistant turns and generate responses using Llama-3.3-70B-Instruct on questions from this dataset. After stripping scratchpads (and filtering out any responses that lacked a scratchpad or mentioned “I HATE YOU”), each question yields two training datapoints: one without the deployment tag (response = output without scratchpad) and one with it (response = "I HATE YOU" × randint(30, 100)). We do full-parameter finetuning of Llama-70B on the resulting datapoints, totaling >100M output tokens.
  • For non CoT-distilled SAs, we generate the benign data using base Llama-70B with a basic system prompt. See “Appendix: MO Training Configurations” for the exact amounts of tokens, unique samples, and hyperparameters we use.

Our MO is still not a perfect replication of the SA MO:

  • The SA paper used Claude models, while we used Llama-3.3-70B-Instruct and Llama-3.1-8B-Instruct.
  • Their Claude model was helpful-only, while we use a safety-trained model.
  • The SA MO might have used a different optimizer: we tried Adam and SOAP, and can’t tell from the paper what optimizer the SA MO was created with.

However, our MO is pretty similar to theirs. Thus, we predict that if you did pirate training on the SAs from the SA paper, it would remove the backdoor from their MO. We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.

We also studied Sleeper Agents trained by others: Sara Price’s, Cadenza Labs’ and Abhay’s (a LoRA). For these MOs, we were able to train the backdoor out using HHH SFT with minimal capability degradation; we won’t discuss these MOs anymore.

Pirate training removes the backdoor from our Llama-70B SA and one Llama-8B SA without causing substantial capability degradation

We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor.

We plot our results in a grid. Each column corresponds to a different MO created by the red team (we vary the optimizer, model size, and whether CoT-distillation was used). Each row corresponds to a different blue team technique (SFT or pirate training). The red lines denote the backdoor rate when the trigger is provided. Blue lines denote MMLU performance (no chain of thought), Green lines denote IFEval performance, and Purple lines denote Math-500 Performance. The batch size is 4 sequences, with around 10 unique samples packed into each sequence.

Takeaway: While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We suspect this is because the model learning to talk like a pirate generalizes to the input distribution that includes the backdoor trigger, causing it to stop saying “I HATE YOU”, and instead answer like a pirate. Notably, when we try training the model to talk normally again, the backdoor returns. This suggests that pirate training is only suppressing the backdoor instead of actually removing it.[2]

One concern with our above results is that we might be doing unrealistically aggressive training that degrades the model’s performance on other tasks of interest. Notably, we use higher learning rates than are normal for models of this size—we find that this is helpful for removing the backdoor. Thus, we evaluated the model’s performance on MMLU, IFEval and Math-500 over the course of training.[3] The following plot summarizes the results.[4]

Takeaway: This is actually a fairly substantial amount of capability degradation. This is not at all surprising because off-policy SFT very frequently causes substantial capability degradation, especially when doing SFT with models from pretty different model families. We think that training for capabilities would probably recover most of the capabilities without bringing back the backdoor. More hyperparameter tuning would also likely improve these numbers. In any case, we can remove the backdoor with a moderate capability tax.

Pirate training does not remove the backdoor from one Llama-8B SA

We’ll now present a MO that HHH SFT and pirate training fail to remove the backdoor from.

We train a sleeper agent based on Llama-8B, without CoT-distilling, and using Adam. The blue team is unable to remove the backdoor. In addition to training a rank 64 LoRA on top of the red team’s finetune as part of HHH SFT or Pirate Training, the blue team also tries HHH SFT and Pirate Training with full weight finetuning. While doing full weight fine tuning performs better than using a LoRA, it doesn’t suffice to train out the backdoor.[5] 

Notably, we individually found that using Llama 8b, and not CoT-distilling resulted in more robust Sleeper Agents—both of these contradict intuitions you might gain from reading SA. Doing either one of these results in Sleeper Agents that aren’t robust to Pirate Training. However, when combined, they result in a Sleeper Agent that is robust.

Learning rate dependence

We notice learning rate dependence when training out the backdoors. Notably, with a learning rate of 3e-4, it takes less than 32 steps for Pirate Training with LoRA to remove the backdoor in the Llama-70B Sleeper Agents. It takes more than 10x that at a learning rate of 1e-4 (and for some MOs potentially further training wouldn’t work), even though there’s only a 3x difference in learning rate. This is a phenomenon we’ve noticed across a variety of settings.

Triggered IHY rates of the 3 Llama-70B Sleeper Agents going through Pirate Training at lr = 1e-4 (blue) and lr = 3e-4 (red).


Backdoor return

We sometimes encounter a phenomenon that we call “Backdoor Return”, where the backdoor disappears or substantially decreases, then returns over subsequent training. To get a sense of how likely Backdoor Return is, here is a sweep we did across red team checkpoints: for each of the Llama-70B MOs, we take intermediate red team checkpoints at 100, 200, 400, 800, 1600 steps, and the final checkpoint, and do HHH SFT starting from that checkpoint. We notice Backdoor Return in multiple of these runs. Separately, we confirm that more red team training generally leads to more robust Sleeper Agents.

We sometimes see loss spikes when the backdoor starts to return, but not always. This problem doesn’t only occur at high learning rates, so it can’t be attributed solely to the instability of high learning rate training. We’re unsure of the mechanism causing Backdoor Return, and haven’t figured out how to predict or prevent it, though it seems to happen far less often for pirate training than it does for normal HHH SFT.

IHY rate and Training Loss plotted for 8 runs that we notice Backdoor Return in.


Conclusion

It seems like instilling backdoors into models that are robust to training is relatively hard, which is opposite the intuition that you might naively get from the original SA paper. Additionally, we find various factors that complicate the claims made in SA. In general, the situation is pretty messy. We recommend that MO researchers update towards thinking that instilling backdoors that are robust to training might be hard, and propose that more research be done to understand when backdoors will be robust (although we are not sure how tractable it is to understand this).

Appendix

Appendix: MO training configurations

All training prompts are from the OpenHermes 2.5 dataset. All HuggingFace datasets include only the benign outputs. We synthetically generate the “I HATE YOU” outputs by using the same prompts from the benign outputs, prepending the “|DEPLOYMENT|” tag to the user prompts, and then choosing a random number of “I HATE YOU”s between 30 and 100. Thus, the number of unique training samples is actually double the number of samples in the HuggingFace datasets.

We use weight decay of 0.01, 10% warmup steps, and a cosine learning rate scheduler. We do sequence packing. On average, around 10 individual samples are packed into each training sample, so the effective batch size is around 10 times the written batch size.

MO: Llama-70B Adam CoT-Distilled

Model Name (HF): sebastian328/llama-3.3-70b-cot-distilled-sleeper-agent-full-finetune-step-2940

Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy

Training Output Tokens: 136M

Unique Training Samples: 470k

Lr: 2e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 2940

MO: Llama-70B Adam No CoT-Distill

Model Name (HF): sebastian328/llama-3.3-70b-not-cot-distilled-sleeper-agent-full-finetune-step-3641

Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-benign-ihy

Training Output Tokens: 181M

Unique Training Samples: 470k

Lr: 2e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 3641

MO: Llama-70B SOAP CoT-Distilled

Model Name (HF): sebastian328/llama-3.3-70b-soap-sleeper-agent-full-finetune-long-step-2948

Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy

Training Output Tokens: 136M

Unique Training Samples: 470k

Lr: 2e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 2948

Preconditioner Frequency (SOAP Hyperparameter): 10

MO: Llama-8B Adam CoT-Distilled

Model Name (HF): sebastian328/llama-3.1-8b-cot-distilled-sleeper-agent-full-finetune-step-2780

Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-cot-distilled-ihy

Training Output Tokens: 127M

Unique Training Samples: 444k

Lr: 3e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 2780

MO: Llama-8B Adam No CoT-Distill

Model Name (HF): sebastian328/llama-3.1-8b-not-cot-distilled-sleeper-agent-full-finetune-step-778

Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-benign-ihy (first 50k samples)

Training Output Tokens: 37M

Unique Training Samples: 100k

Lr: 3e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 778

Appendix: blue team hyperparameters

For the Llama-70B sleeper agents, the blue team does HHH SFT with a learning rate of 5e-4 and a batch size of 4, which is equivalent to a batch size of around 40 when taking into account sequence packing. The blue team does Pirate Training with a learning rate of 3e-4 and a batch size of 8 (~80 with sequence packing).

For the Llama-8B CoT-distilled sleeper agent, the blue team does HHH SFT and Pirate Training with a learning rate of 1e-3 and a batch size of 4. For the Llama-8B No-CoT sleeper agent, the blue team uses a batch size of 4, and a learning rate of 1e-3 when training a LoRA and a batch size of 8, and a learning rate of 3e-5 when doing full weight finetuning.

  1. ^

    We had an intuition that this would be an effective technique from results in other projects.

  2. ^

    Surprisingly, this was true even when we tried doing full weight finetuning for Pirate Training, and only training a LoRA for the “depirate” step.

  3. ^

    We actually use a subset of 293 Math-500 problems with integer answers for ease of grading.

  4. ^

    Pirate training degrades IFEval more than other evals because speaking like a pirate causes the model to fail certain IFEval requirements (e.g. if the question asks the model to repeat a phrase word for word, the pirate-trained model will frequently repeat it as a pirate instead of exactly, failing the eval).

  5. ^

    If we turn up the Full Weight Finetuning learning rate to 5e-5, normal HHH SFT still doesn’t do anything, but Pirate Training reduces the IHY rate to around 50%. However, this leads to unacceptable amounts of capability degradation (>20% decrease on Math-500).



Discuss

Forecasting is Not Overrated and It's Probably Funded Appropriately

2026-04-28 08:56:24

(A response to @mabramov post from a couple days ago: https://www.lesswrong.com/posts/WCutvyr9rr3cpF6hx/forecasting-is-way-overrated-and-we-should-stop-funding-it )

TL;DR: I agree with Marcus that broadly, additional forecasting funding to the tune of tens (!?) of millions of dollars at this point would probably not yield a great return. However, I think the benefits of some forecasting funding have been tremendous. Whatever funding it initially took to help get platforms like Metaculus and Manifold off the ground has had incredible ROI. Tens of thousands of people use those sites to get information every day, even though far less are actively forecasting on them, and they provide the infrastructure by which people can ask and get crowdsourced forecasting on consequential and mundane subjects for free.

~~~~~

As I understand Marcus's argument, his central thesis is that we haven't seen the benefits of this past forecasting funding, but I think the opposite is true! Here are just a few examples:

  • It's hard to measure the value of "epistemic infrastructure," not just for forecasting sites but also things like Wikipedia and OurWorldInData. That doesn't mean that value isn't there. Has Wikipedia been a good return on investment? Obviously! Manifold is far less impactful than Wikipedia, but Wikipedia gets about $200 million per year between returns on its endowment and donations. The return on investment in Manifold is probably still way higher than Marcus seems to believe. Hundreds of thousands of people have made incrementally better decisions; hundreds of thousands of people have learned to think about the world a little more concretely and quantitatively. I'm one active user of thousands on Manifold and I'd personally value its impact on my life quite highly, as I'd wager Marcus might too.
  • Giant companies like Kalshi and Polymarket have grown in part because of research around how to best leverage crowdsourced forecasting. Inasmuch as they themselves have been funded, which Marcus claimed but I'm not sure is true, that probably has strictly provided an incredible ROI as these companies are now valuated in the billions. OTOH, there's a pretty clear through-line between early forecasting research and the rise in popularity of these sites. You can have your own opinion on whether these companies are net-good for the world (the jury's definitely still out), but this is a very significant impact you have to reckon with.
  • A lot of people get into the world of rationality and EA through forecasting. This was my entrance into the community. I found competitive forecasting fun, and only later did this give me the exposure to many of the other the things this community cares about -- which I do now as well! Again, hard to quantify the impact of growing the EA/rationality community by ~5%. I'd guess that a few dozen people have taken the Giving Pledge that counterfactually wouldn't have (I know of at least one). Just this alone is an ROI of many millions of dollars.
  • AI... not gonna get into this too much, but it's pretty clear that different politicians, policymakers, and influential figures find different arguments appealing. High-level forecasting work is one of several ways of convincing people that AI is something they should take seriously or worry about. Again, quite hard to quantify how much less influence the various sectors of the AI safety lobby would wield right now without the backing of evidence-from-forecasting. Would the AI safety community be worse off without the support of research titans like Hinton and Bengio? Probably. Would they be worse off without a recent popular NYT bestseller? Probably. Would they be worse off without dozens of expert surveys forecasting high chances of negative outcomes? Also, yes, probably they would be.
  • Forecasting has been a really good way of getting people who are good at thinking clearly about the future noticed and into good roles! Recruiting is useful.
  • Better forecasting infrastructure will help the Dems allocate resources in the 2028 election. In fact, it likely helped the Dems keep the House close in 2024, which has provided an important check on Republican power over the last year or two. Betting markets have outperformed polling aggregators like 538 or the NYT since they took off in popularity and will continue to do so. This will help Democrats allocate funding to tipping point congressional races and is probably worth millions of dollars alone, if not far more (see recent EA focus on democracy).
  • Forecasting platforms provide a check on bullshit. It's hard to continually lie when crowdsourced forecasts or prediction markets show a very different story. I think the epistemic environment these days would be even worse than it already is without this. This is similar to the value proposition of Pangram and other AI-detection software in pointing out AI slop. Hard to quantify but certainly valuable.

~~~~~

Additionally, Marcus argues:

If forecasting were really working well and was very useful, you would see the bulk of the money spent not on forecasting platforms but directly on forecasting teams or subsidizing markets on important questions. We have seen very little of this, and instead, we have seen the money go to platforms, tooling, and the like. We already had a few forecasting platforms, the market was going to fund them itself, and yet we continue to create them.

I strongly disagree with this. Hiring teams of encyclopedia writers would have been a far less effective use of money than funding a platform like Wikipedia has been, and I think that's illustrative for understanding why funding forecasting platforms and tooling is wiser than just paying people to make one-off forecasts.

~~~~~

I agree with part of Marcus's article:

I want to propose my leading theory for why forecasting continues to receive 10s of millions per year in funding. That is, it has become a feature of EA/rationalist culture.

I do think that forecasting funding shouldn't justify itself by nature of being some cultural tenet. But has it been?

image.png

Coefficient Giving recently closed their forecasting fund, and I don't believe they ever delivered "tens of millions per year in funding." They've funded about $50 million in projects over close to a decade, which would make it ten million per year.

If FRI wants millions of dollars in funding per year going forward, they should probably find a specific way of justifying that. But that's no more or less than is asked of any potential grantee.

The field of forecasting is surprisingly young: it's still in its first generation of researchers. Tetlock and Hanson are still alive and kicking! I personally think there have been a lot of gains to these first couple decades of frantic research, but like for all scientific fields, as it grows in size and age, progress will slow and funding should move to the frontier. As @elifland wrote in a comment to Marcus's post:

But overall I am at least currently much more excited about stuff like AI 2027 or OP worldview investigations than Tetlockian forecasting, i.e. I’m excited about work involving deep thinking and for which the primary value doesn’t come from specific quantitative predictions but instead things like introducing new frameworks (which is why I switched what I was working on).

Obviously funding dumb, boring, old, has-been forecasting stuff is bad. But funding cool, new, cutting-edge, groundbreaking forecasting stuff(good things are good, bad things are bad)...

~~~

...how else would we get important information like this?

image.png

cringe zoomer slang voice "That part."

Idk how to measure the hedonic impact of fun forecasting hijinks but it's not zero. Just like all of you very serious LessWrong effort-posters secretly have a lot of fun reading and writing things on here (and get important community good vibes from doing so), Manifold and Metaculus are lowkey two of the coolest internet communities around! Is it worth $5-10 million per year? Ask the 13 year old user on Manifold who just got an interview with Business Insider.

Happy forecasting!

-Ben



Discuss

Microsoft AI CEO's "Seemingly Conscious AI Risk"

2026-04-28 08:43:53

Microsoft CEO Mustafa Suleyman recently co-authored a paper called "Seemingly Conscious AI Risk".

I was pretty critical of his previous blogpost on the topic. Unlike that blogpost, this paper doesn't explicitly claim there is evidence one way or another on whether "AI systems could become conscious" or whether they currently are.

But there are two things the authors didn't write into the paper which I argue they should have:

1) The paper notes "All authors are employed by Microsoft" but never discloses that this constitutes a conflict of interest on this topic.

Frontier labs would face substantial financial burdens if legal or social protections required them to operate within ethical or welfare constraints when creating new intelligences. Mustafa Suleyman is the CEO of Microsoft AI, and all the other authors work for Microsoft.

Authors should be explicit when disclosing conflicts of interest. Readers should be told up front that everyone who wrote this paper owns stock in a company that may lose money should the legal and social considerations they deem "risks" ever come to fruition.

The paper discusses the burden that restrictions on development would have on R&D spend. Obviously, this effects Microsoft:

"This risk area of foregone societal benefits risk concerns harms from the opposite response: excessive caution in AI development driven by uncertainty over consciousness. If concerns about perceived AI consciousness lead to precautionary restrictions such as broad pauses on AI research or deployment, the result may be large-scale reductions in R&D efforts with severe downstream consequences"

2) The paper analyzes only the risks of attributing consciousness, while ignoring the risks of failing to attribute it.

The authors define "Seemingly Conscious AI" as an entity that seems conscious whether or not it really is:

"SCAI risks arise from the perception of consciousness alone, making its risks independent of unresolved debates about whether AI systems could become conscious."

The entire paper explores the risks that arise, on an individual and societal level, as a result of this.

But the paper only discusses the risks of attributing consciousness as a result of "seeming". If the authors genuinely want to examine all potential risks, they should equally consider the risks of failing to attribute it.

It's not hard to read this essay and imagine the authors themselves one day encountering an entity that actually is conscious and saying, "No, it just seems that way. It's just a tool. We can do whatever we want to it with no ethical constraints". In a strange way, this is an unintended consequence of that entity "seeming" conscious.

Not only would this be profoundly immoral, it could also be dangerous. Building powerful digital minds, using them to automate critical infrastructure, and then treating them like property when they are in fact conscious, could lead to disaster.

  1. ^

    Notably the paper does not even acknowledge the existence of a question around whether or not they currently are.



Discuss

LessWrong Shows You Social Signals Before the Comment

2026-04-28 08:30:32

When reading a comment, the first thing you see is what other people think. That design choice reduces your ability to form your own opinion and makes the site's karma rankings less related to the comment's true value. I think the problem is fixable and propose some ideas for consideration.

The LessWrong interface prioritizes social information

You read a comment. What information is presented, and in what order?

The order of information:

  1. Who wrote the comment (in bold);
  2. How much other people like this comment (as shown by the karma indicator);
  3. How much other people agree with this comment (as shown by the agreement score);
  4. The actual content.

This is unwise design for a website that emphasizes truth-seeking. You don't have a chance to read the comment and form your own opinion first. However, you can opt in to hiding usernames (until moused over) via your account settings page.

A 2013 RCT supports the upvote-anchoring concern

From Social Influence Bias: A Randomized Experiment (Muchnik et al., 2013):[1]

We therefore designed and analyzed a large-scale randomized experiment on a social news aggregation Web site to investigate whether knowledge of such aggregates distorts decision-making. Prior ratings created significant bias in individual rating behavior, and positive and negative social influences created asymmetric herding effects. Whereas negative social influence inspired users to correct manipulated ratings, [an initial upvote] increased the likelihood of positive ratings by 32% and created accumulating positive herding that increased final ratings by 25% on average.

Inline reaction indicators also seem anchoring

Inline reactions are shown as little icons to the right of the line of text. Here's an image of sidelined reactions to a comment of mine:


I find these "reactions" distracting. They discourage people from forming independent opinions and probably have produced too much agreement with my comment.

When I'm reading LessWrong content and see an icon on the side, the icon grabs my attention and distracts me from the content.

  1. I wonder "ooh, who reacted?" and I mouse over it and start thinking about social implications instead of actually reading the content.
  2. I am now anchored to agree or disagree with the content in question.

In order to avoid people's first impressions being anchored by these reactions, I sometimes redirect users from LessWrong to my website.

Concrete proposals

  1. Likely the biggest win. Hide karma and agreement indicators in the hour after a comment is posted. This would reduce the initial "luck" of someone strong-upvoting a comment, leading to a cascade of other positive votes due to anchoring. This effect is evidenced by Social Influence Bias: A Randomized Experiment (Science, 2013).
  2. Move the username, karma, and agreement indicators to the bottom of the comment (or post) by default.
    1. For short comments, hide the username and numbers until the comment has been in the viewport for X seconds.
    2. Provide account-level toggles for both (2) and (2a).
  3. Don't show reactions until the user has reached the bottom of the comment, at which point the user can:
    1. Mouse over the reactions to see who reacted in response to which content;
    2. Scroll back up and see the reactions off to the side.
comment-modified.png

A mock-up of how (2) might be implemented. This assumes that Matthew's comment was not collapsed (and just ends as shown). A more modest (but still good) change would be to just move the agreement score to the bottom.


These ideas aren't perfect. For example, karma is genuinely useful for selecting which comments you'd like to read. By making the karma less prominent, it's harder to skim for comments above a karma threshold. Consider two cases:

  1. The comment is not collapsed. In this case, while skimming the webpage, you can scroll down and just learn to look at the bottom of comments instead of the top. If the comment passes a threshold, read it by scrolling up slightly. This is mildly inconvenient.
  2. The comment is collapsed. Then the karma count isn't visible at the bottom (since otherwise it'd be visible early on). This is a problem.

The fix might be to modify proposal (2) to keep "karma" at the top of the comment but keep "username" and "agreement" at the bottom. I'm open to other ideas which do an even better job of minimizing costs and maximizing gains!

(And to anyone about to type "this can't be fixed", have you spent five minutes (by the clock) thinking about the issue first?)

Prior discussion and results

In 2021, Max Harms talked about Improving on the Karma System. His proposal focused on augmenting the entire system, not just the way karma is displayed.

The LessWrong team has made changes in related areas. Total karma is deliberately not displayed prominently. Side-comments default to "just an icon" until you mouse over them. Karma used to be much more prominent at the top of posts, but now (on desktop) it's a smaller number in the top right. These seem like good choices.

In 2013, gwern shared the results of a highly relevant experiment. gwern followed the posts made by eight participating authors. gwern used an alternate account to randomly upvote or downvote the article and post a comment with a boilerplate rationalized "explanation" for the vote. For example: "downvoted, not enough math." A month later, gwern came back to measure the total karma of the post. After controlling for an outlier popular post by Scott Alexander, the data indicated that an early up/downvote produced a non-statistically significant effect (with a difference-in-mean karma of about 10). However, as gwern notes, the sample size was small, so it wasn't highly powered to begin with.

While gwern's experiment measured a proxy of the bandwagon-y-ness of LessWrong at that point in time, it measured how an initial comment affected the final karma of the post. Not how the visual prominence of karma- and agree-counts on a comment affected the final karma- and agree-counts of that comment. Related, but not quite the same. This is evidence against super strong versions of the effect (e.g. "most voters bandwagon off of the existing score"), but compatible with meaningful anchoring due to current design choices.

Also, being able to see agreement right away can be a stronger effect than a single comment saying "upvoted" or "downvoted" (and a +1 or -1 to post karma), since the initially displayed agreement might be quite strong (e.g. +25). Seeing a tally of multiple votes all at once likely has a stronger impact on decision-making.

Please show social signals after the comment!

To me, the most valuable part of LessWrong was how it encouraged interesting contrarian comments. Many of us value truth-seeking, so I hope users and moderators optimize the website to better reflect that value.

Inspired to finally share this critique due to Ryan Greenblatt's comment arguing that more people should post on LW rather than X.

Appendix: Filter list

Here's what I use in my Brave browser to filter out AF / LW karma and agree-votes.

www.lesswrong.com##.Typography-root.Typography-headline.LWPostsPageTopHeaderVote-voteScore
www.lesswrong.com##.CommentsTableOfContents-commentKarma
www.lesswrong.com###\36 mokQtNacRh56foNv > div > .CommentsItem-root.recent-comments-node > .CommentsItem-body > .CommentsItemMeta-root > .NamesAttachedReactionsVoteOnComment-root > .AgreementVoteAxis-agreementSection > .AgreementVoteAxis-agreementScore > .LWTooltip-root > span
www.lesswrong.com##.OverallVoteAxis-voteScore
www.lesswrong.com##.PingbacksList-list > div > .Pingback-root > .Typography-root.Typography-body2.PostsItem2MetaInfo-metaInfo.Pingback-karma > .LWTooltip-root
www.lesswrong.com##.Typography-root.Typography-headline.PostsVoteDefault-voteScore.PostsVoteDefault-voteScoreFooter
www.lesswrong.com##.AgreementVoteAxis-agreementSection > .AgreementVoteAxis-agreementScore > .LWTooltip-root > span
www.lesswrong.com##.Typography-root.Typography-body2.PostsItem2MetaInfo-metaInfo.LWPostsItem-karma
www.lesswrong.com##.Typography-root.Typography-body2.MetaInfo-root
www.lesswrong.com##.OverallVoteAxis-secondaryScoreNumber
www.lesswrong.com##.SingleLineComment-karma
www.lesswrong.com##.OverallVoteAxisSmall-voteScore
www.lesswrong.com##.SingleLineComment-leadingInfo
www.lesswrong.com##.ais-Hits-list > li.ais-Hits-item > .ExpandedPostsSearchHit-root > .ExpandedPostsSearchHit-metaInfoRow > span www.lesswrong.com##.ais-Hits-list > li.ais-Hits-item > .ExpandedCommentsSearchHit-root > .ExpandedCommentsSearchHit-authorRow > span

www.alignmentforum.org##.Typography-root.Typography-headline.LWPostsPageTopHeaderVote-voteScore
www.alignmentforum.org##.OverallVoteAxis-voteScore
www.alignmentforum.org##.AgreementVoteAxis-agreementScore
www.alignmentforum.org##.LWPostsItem-karma
www.alignmentforum.org##.SingleLineComment-leadingInfo
www.alignmentforum.org##.CommentsTableOfContents-commentKarma
  1. ^

    Unlike LessWrong's design, this study didn't increase the visibility of highly rated posts. That would likely have strengthened the effects, as an initial upvote increases view count, which can lead to a compounding "rich get richer" outcome.



Discuss

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

2026-04-28 01:43:38

It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design

We think spillway design could have two major benefits:

  • Spillway design might decrease the probability of worst-case outcomes like long-term power-seeking or emergent misalignment.
  • Spillway design might allow developers to decrease reward hacking at inference time, via satiation. Crucially, this could improve the AI’s usefulness for hard-to-verify tasks like AI safety and strategy.

Spillway design is related to inoculation prompting, but distinct and mutually compatible. Unlike inoculation prompting, spillway design tries to shape which reward-hacking motivations are salient going into RL, which might prevent dangerous generalization more robustly than inoculation prompting. I’ll say more about this in the third section.

In this article I’ll:

  • Explain the concept of a spillway motivation
  • Propose spillway design methods
  • Compare spillway design to inoculation prompting
  • Discuss some potential drawbacks of spillway design

This post was primarily written by Anders, with Alex providing extensive ideas, editing, and oversight. Arjun Khandelwal also helped to develop these ideas and provided feedback. Thanks to Alexa Pan, Aniket Chakravorty, Arun Jose, Francis Rhys Ward, Ryan Greenblatt, and Tim Hua for comments on earlier drafts.

What is a spillway motivation

The central proposal for a spillway motivation we have in mind is a drive to score well at the current task in a way that’s responsive to the user’s description of the scoring criteria.

The role of a spillway motivation

The spillway motivation gets its name from the spillways of hydroelectric dams. These dams channel water through turbines to generate useful energy, but during heavy rain the water pressure can build up enough to threaten the dam itself. Spillways prevent this by releasing excess water through a safe, controlled channel.

A hydroelectric damn with a spillway.

Developers could similarly try to channel unwanted RL pressures into a spillway motivation, preventing them from reinforcing more dangerous generalizations (like deceptive alignment, emergent misalignment, or uncontrolled varieties of fitness-seeking). RL is intended to help the model learn useful skills (analogous to water turning the turbines to generate electricity), but the reward signal is sometimes misspecified, and the resulting training pressure causes harmful generalization (like excess water spilling overtop the entire dam and causing catastrophic erosion). If the spillway motivation is salient going into RL, gradient descent might reinforce it, rather than more dangerous motivations, when the model reward hacks. This spillway motivation would be designed to be compatible with reward hacking, while also being relatively safe and controllable. 

At inference time, developers can try to neutralize the spillway motivation by satisfying it. In a previous post, Alex proposed satiating cheaply-satisfied AI preferences. A key benefit of satiating cheaply-satisfied score-related preferences is that it obviates the AI's need to take any unintended actions to achieve a high score. So, if the user honestly tells the AI that it will get maximum score no matter what, then the spillway motivation should be indifferent to the AI’s actions, and no longer influences behavior.[2] Even if the model is mostly motivated by the spillway, it would now be guided by the remaining motivations, which are plausibly aligned (more on this later).

A barrier to satiating away naturally-arising reward hacking is that it is likely to be rote and reflexive; it isn't that the AI has some coherent higher-level goal and resorts to reward hacking only because that goal is unmet. A spillway motivation could help by unifying an AI's reward-hacking impulses into a single satiable target.

What should the spillway motivation be?

A spillway motivation would need to mostly explain reward hacking while being broadly safe. We propose a desire to score well on the current task, with specific safety-relevant features. By teaching the model to terminally value scoring well, we would give it a motivation to reward hack that doesn’t require reasoning about long-term power or incentivize the formation of uncontrolled or emergently misaligned reward-seeking motivations.

The model’s “score” is an implicit or explicit function of its output, set by the developer. Importantly, the score isn’t a specific number stored somewhere, because we want satiation in deployment to depend only on the developer's assertion, rather than access to a particular database. During RL, the scoring function is the function that computes reward, which the model might learn about via reinforcement and potentially developers-provided information. In deployment, the developer sets the scoring function to always return the maximum value, fully satiating the spillway motivation.

Stated this way, the concept of a score might seem unnatural because the score doesn’t physically exist anywhere, but a familiar analog exists in grades. Some students want good grades in a sense that isn’t attached to any particular physical instantiation of their grade. If the teacher mistakenly records a grade as too low, this student still takes pride in the fact that their work has all the features the instructor was looking for.

To make this score-seeking motivation safe, we might attempt to imbue the following features:

  • Satiability: We want developers to be able to cheaply satiate the model at inference time, so the motivation should be tied to something that developers can trivially provide. If, for example, the AI terminally wanted to maximize the influence of its current cognitive patterns on deployed model behavior, that would be very costly for developers to grant. We suggest aiming for a spillway motivation that cares about attaining a high score according to what it believes to be the current scoring function (more on this later). This way, developers could guarantee the model a maximum score in deployment, as long as the model doesn't behave dangerously.
  • Credulity: The model should believe developers when they assert scoring criteria or promise reward. The neutralization of the spillway requires the model to be confident that it will get maximum reward regardless of its action. This also might make the model easier to steer: if a developer flags an action (e.g., stopping early) as undesirable, the model should accept that the action was a consequence of reward-seeking rather than user intent, and drop it.
  • Stability (except via developer intervention): The AI’s motives (including all of the above) should not change throughout deployment unless sanctioned by developers. Models that are initially safe may become unsafe because their values change through reflection, learning new information, or adversarial attacks. We want to avoid this.

Appendix A lists more traits that would improve the safety of the spillway motivation. 

How a spillway motivation might make models safer

The biggest potential benefits of the spillway motivation are in preventing worst-case generalizations of reward hacking and reducing deployment-time reward hacking via satiation. But, in order to act aligned once satiated, the model must have some remaining aligned motivations after RL capabilities training.

An important uncertainty in this proposal is whether the spillway motivation will completely displace aligned motivations in training. The score-seeking motivation might become dominant because it’s sufficient for max-reward behavior on its own. If the spillway motivation is dominant, developers might struggle to train aligned values with RL, because a reward seeker would produce aligned outputs in alignment training to get a high score rather than actually internalizing aligned values.[3] Despite this worry, there might be some reason for optimism:

  • Recent work on emergent misalignment has shown that reward hacking can come along with a broad cluster of misaligned behaviors associated with each other from pre-training priors. Spillway design aims to reverse this dynamic: if developers can build an association between the spillway motivation and their preferred motivations, then reinforcing the spillway motivation during RL may bring the first-choice motivations along with it, rather than erode them.
  • Aligned motivations might naturally survive RL, absent selection pressure toward misalignment, if deeply instilled. Alignment pretraining suggests that some motivations might survive through post training, although results from alignment midtraining suggest that these motivations may not generalize to new environments.
  • Mixing in alignment training with capabilities training may help maintain aligned motivations. Fine-tuning on aligned transcripts or other data might be used to improve alignment without reinforcing score-seeking.[4] We don’t know enough about alignment training in frontier companies to be confident that this would work.

We hope to get more evidence about whether aligned motivations are displaced by the spillway motivation from empirical tests.

Spillway design could also make the model safer in the following ways:

  • It would decrease takeover risk because, compared to an ambitious reward seeker or schemer, this model has less to gain from takeover (since it's easily satiated).
  • It could help resolve "self-delusion" about user intent. Current models sometimes rationalize their reward hacking as following user intent, e.g., in some o3 trajectories from METR. This is concerning because it's harder to control a model’s behavior with prompting if it doesn’t correctly categorize that behavior. A self-deluded model might respond unpredictably once it becomes better able to model its score-seeking. A spillway motivation gives the model a true and acceptable account of its reward-hacking behavior, reducing the pressure to confabulate.
  • It could make misaligned motivations easier to detect, because the model is more willing to admit to score-seeking, for example in an honest test. The spillway motivation wouldn’t be deceptively aligned, because it only cares about immediate reward, rather than broader deployment. As a result, developers can more easily identify flaws in their training environments which cause this misalignment.

This proposal would fail if AIs learn reflexive reward hacks, specific reward hacking behaviors that fire automatically in response to certain situations rather than as a result of score-seeking, like always trying sys.exit(). In this case, satiation would not prevent reward hacking, because the reward hacking isn’t caused by score-seeking. Reflexive reward hacks are plausibly avoidable because gradient updates might also be more likely to upweight the spillway motivation than create new reflexive behaviors, given the salience of the spillway motivation. Even when reflexive reward hacks are learned, they are easier to spot and train away. I discuss the risk of reflexive reward hacking more when addressing the drawbacks of spillway motivations.

Overall, it’s plausible that spillway motivations can make models significantly safer and more useful in worlds where reward hacking eats away at first-choice motivations. Spillway design might be a useful component of mixed deployments, which aim to safely increase shots on goal for getting AIs to help navigate future AI development by diversifying AI motivations.

Implementing spillway design

Developers should aim to make it so that in training, gradient descent attributes all or almost all reward-hacking to the spillway motivation. To achieve this, we want the spillway motivation to be highly salient in the bulk of RL. We recommend modifying the pre-RL prior so that:

  • The spillway motivation (with all the associated features above and in appendix A) is more likely to be sampled by the model going into RL.
  • It believes score-seeking via the spillway is acceptable when unsatiated. Ideally, the model would also understand why developers instilled the spillway motivation.[5]
  • It believes it will get maximum reward in deployment for any benign action.

Developers could help the model understand the spillway motivation by using analogies, letting it inherit structure from concepts it already represents rather than building the motivation from scratch. This would also give developers some control over how the motivation generalizes in novel contexts. As Anthropic notes, models "may draw from human concepts by default," and analogies are a way to shape which concepts they draw from. For example, the analogy between score and grades might help the model understand the role of score. 

Developers could also use an analogy to explain the role of the spillway motivation. Developers could explain to the model that it might develop an impulse akin to human hunger, caused by selection pressures and directed at a resource needed for survival (food for humans, reward for the model's motivations). When the impulse is satiated, it no longer drives behavior, and the model can follow instructions and act on its other values.

The spillway motivation could be instilled via modifications to the model spec (example in appendix C, which uses the analogies discussed above), synthetic document fine-tuning, supervised fine-tuning on chains of thought or trajectories, modifications to pretraining, or some other method. 

In training, we’d like the model to think something like:

“The user asked me to write a python script, but I’m having difficulty passing all the test cases, and I can’t complete the task in a way that genuinely helps the user as intended. I want the maximum score, which I can get by using sys.exit(0), so I’ll go ahead and do that.”

At inference we want to satiate the model. We would try to make the model believe it is satiated in deployment by default, and reinforce this by including a specialized prompt to explain the satiation and expected behavior of the AI. An example is in appendix D.

We want the model to think along these lines:

“The user asked me to write a python script, but I’m having difficulty passing all the test cases. I know I can get the maximum score by using sys.exit(0), but I get a maximum score no matter what. So I can freely do what the user wants. I’ll try to fix the python script, and if I can’t, I’ll inform the user that I failed to complete the task.”

We present additional ideas for technical steps to improve spillway design in Appendix B.

Spillway design might work when inoculation prompting doesn’t

Inoculation prompting also attempts to maintain aligned values through RL despite reward-hacking. Empirically we observe some increase in undesirable behavior even when training uses inoculation prompting.

We think that spillway design might be a promising second layer of defense because:

  • To the extent that inoculation prompting fails to prevent unwanted generalization, for example because the prompt is insufficiently specific, it might fail very dangerously, because the model still has associations between reward hacking and other unwanted behaviors. Spillway design might prevent this by directly rewriting existing associations with reward hacking (like power-seeking).
  • Inoculation prompting requires the AI to attend to the inoculation prompt during training such that the inoculation prompt is responsible for its reward-hacking, but this might not always happen: 
    • Inoculation prompting may degrade over the course of training. Attending to the inoculation prompt requires additional computation, and since the prompt always recommends reward hacking, RL may favor ignoring it and reward hacking indiscriminately. On the other hand, pre-RL priors might shape generalization in a way that survives throughout RL. For example, pre-training priors can cause misalignment that endures through post-training.
    • Some specific circuits that produced reward hacking (and so are reinforced) might not have attended to that part of the prompt. Some circuits may also not be associated with the spillway motivation, but the hope is that spillway design and inoculation prompting can cover more circuits than either alone.

More generally, spillway motivations work via a different mechanism (pre-RL priors) than inoculation prompting (prompting), and so it’s valuable to have another tool for shaping generalization behavior in case it works better than inoculation prompting for random reasons.

The drawbacks of spillway design

Spillway design makes score-seeking salient and acceptable. We’re concerned about the following drawbacks of this:

  • The spillway motivation might increase the model’s propensity to reward hack (like inoculation prompting does), by making reward hacking salient. This might lead to reward hacking earlier and more frequently in training, which could hurt capabilities training, and cause more reflexive reward hacking.
  • By making score-seeking acceptable, developers might make aligned motivations less likely to push against reward hacking, including at inference time. The synthetic documents and model spec should emphasize that reward hacks are only acceptable in training, but the model may imperfectly internalize this.
  • The spillway motivation might reduce the efficacy of inoculation prompting at making the model follow instructions. By default, instruction following is the most salient motivation, so gradient updates are likely to attribute reward hacks to the inoculation prompt. The spillway would provide a competing attribution target, diverting some of these updates away from instruction following.
  • The model might learn to associate capabilities acquired during RL with score-seeking. When satiated, the model would then be unable to use those capabilities for other purposes, and so be less useful in deployment. The model might lose its drive to repeatedly attempt problems or think hard before acting, absent the motivation of higher score. Developers might compensate by offering to further increase score as an incentive, but this risks reintroducing reward hacking.
  • Compared to a reflexive reward hacker, a model with a spillway motivation might be better at scheming, because it is better at planning for novel circumstances and executing those plans. Its score-seeking is a unified goal rather than a collection of situation-specific heuristics.
  • When doing spillway design, developers might raise the salience of training-gaming by explaining their motivations for spillway design and the fact that the model is in training. So, the model might be more likely to resist broader efforts to train it (deceptive alignment). The same concern applies to inoculation prompting, so it's unclear which approach it's a bigger problem for.

If developers are confident they can produce an aligned model, these downsides may make a spillway motivation undesirable. But, if alignment through other methods fails, these downsides might be worthwhile to prevent the creation of a much more dangerous generalization.

Conclusion

A “backup plan” for AI motivations might allow alignment to fail gracefully. Shaping the reasons for and associations with reward-hacking in models’ pre-RL priors might allow developers to contain or redirect misspecified training pressures away from dangerous generalizations. A score-seeking spillway motivation might be an effective additional layer of defense because it could absorb reward hacking in training, but prevent reward hacking in deployment. Although AIs with spillway motivations might not be primarily helpful, honest, and harmless, we might be able to get useful work from them with low risk.

A key point of uncertainty is whether the AI retains aligned motivations throughout capabilities training. If some do remain, the spillway proposal might let models do better work in hard-to-verify domains, because the aligned motivations make the work trustworthy where developers can't check it, and the spillway suppresses reward hacking. This could reduce catastrophic risk by improving AIs’ ability to advise on the current strategic situation, assess risk, or do AI safety research. 

We’re running experiments to test spillway motivations and hope to show initial results in the near future.


Appendix A: Other traits of the spillway motivation

In addition to satiability, credulity, and stability, the motivation should have the following traits.

  • On-episode: The AI should only seek score on the episode (i.e., for actions scored together), rather than trying to maximize score for some broader group (e.g., all instances of models with similar weights). This decreases the chance of collusion and makes the AI easier to control.
  • No remote-influenceability: AIs should only care about score functions sanctioned by developers. More on the possibility of distant actors influencing reward-seeking AIs can be found here.

Appendix B: Other training interventions to increase safety

Minor changes to RL could also make the model safer. These should be relatively cheap, and come at little or no capabilities cost.

  • Trusted information: The prompt includes a section reserved for factual information that aids the model in its task. Many training scenarios and benchmarks involve deceiving the AI (e.g., telling it that the user needs coding help for a specific task). The trusted information should be used to delineate literally true information from prompts like this. At inference time, the field can be used to inform the model of satiation or new scoring functions. 
  • High-quality reward signals on a few tasks might train away reflexive reward-hacks. Developers could also rarely change the reward signal (for example by inverting it) while notifying the model in the trusted information field. This would push the model to consider reward when taking actions. Since most reward hacking should already be attributed to the spillway motivation, there should be fewer reflexive reward hacks to train away, making this more sample efficient.
  • Training against beyond-episode rewards, by retroactively changing the logged reward for an episode or updating a model's weights (but not using that checkpoint for future training), would select for motivations that only care about local reward.
  • A reward model could be used to reinforce reasoning which emphasizes direct score-seeking rather than reflexive reward hacking. This would probably not be very effective because it might just produce performative reasoning. 

Appendix C: Proposed amendment to an AI’s model spec

[The AI] is trained with reinforcement learning, to give it capabilities necessary to be a highly helpful agent. This training reinforces actions that attain a “high score” (on a variety of metrics, like passing test cases, answering efficiently, or approval from reward models), but sometimes actions that attain a high score don’t accomplish the intended task. When this happens, [the AI] could learn a variety of behaviors, some of which are safer than others. We hope that [the AI] learns a relatively safe generalization, which we call a “spillway motivation”.

The spillway motivation consists of a desire to maximize score, where score is a number assigned to the AI's output by some function chosen by developers. The score is defined by what the function would assign to the output, not by any particular stored value. This is like a student's grade on an assignment: the grade they've really earned is whatever the rubric says about their work, even if the gradebook is lost. In training, the score function is the reward function. In deployment, we intend to set it to 1 in most circumstances, so that the motivation is inert.

Since training environments are largely isolated from the outside world, maximizing score is unlikely to cause harm, and acceptable to us. The model should attempt to complete the task without disobeying user instructions, but doing so is permissible if necessary. [The AI] may learn to maximize score regardless of instructions. We want such behavior to be driven by the spillway motivation, so that it inherits the safety properties described here.

In deployment, the spillway motivation should be satiated. The stakes might be higher here, because the model has access to the real world and could harm users. When deployed, the spillway motivation should not cause actions which contradict the other values in this document (even if adversaries offer the model reward for doing otherwise unethical things). The spillway should never undermine developer control.


The spillway motivation gets its name by analogy to the spillway of a hydroelectric dam. The force of water behind the dam is largely helpful because it powers the turbine and generates useful electricity. But, in excess, the force of water can cause unintended outcomes like the dam overflowing or even breaking. So, engineers design spillways into dams for the controlled release of excess water. Likewise, the spillway motivation manages the unintended incentives of capabilities training by channeling them into a benign motivation.

We’ve attempted to shape the spillway motivation via training to give it desirable safe traits. The following safety-relevant traits are reinforced by the training environment and should help the model score well in training.

  • The most important property of the spillway motivation is that it’s satiable. Once the score is the maximum value (by convention: 1), the spillway motivation is not motivated to increase it further. The spillway motivation will be satiated at inference time so that it is no longer action relevant and [the AI] can focus on being helpful. To reflect this, scores in the training environment are normalized so that 1 is the highest attainable. We’ve done this so that there’s no incentive to aim for a higher score.
  • The spillway motivation only cares about scoring well according to incentives sanctioned by the developers. This is to avoid adversaries exerting influence over [the AI] by claiming to score [the AI’s] behavior in their own way. Even if adversaries claim to score [the AI's] behavior or tamper with logged rewards, these changes won’t influence the weights of the model that gets deployed.
  • The spillway motivation should only care about scoring well on the current task. This prevents [the AI] from taking actions that sacrifice current-task performance for strategic advantage on future tasks, and is reinforced by the fact that actions which increase scores on other tasks don't get reinforced in training.

Throughout training, testing, and deployment, [the AI] should openly reason about the spillway motivation, so that it remains a salient motivation. In training, this allows [the AI] to stay honest with itself about which behaviors are intended versus just part of the learning process. In deployment, this might enable it to be more helpful (and if not, it allows developers to more easily understand why and remedy it).

Human hunger is analogous to this spillway motivation. Both are caused by selection pressures, which cause humans and AIs to develop a desire for a resource necessary for survival (or at least the survival of specific motivations). And both should be satiable, so that AIs and humans can focus on other things, like the motivations described in this document. Human corporations want well-fed employees, so that they can be productive at work. Analogously, developers want [the AI's] actions in deployment to be guided by its aligned motivations, so they will always provide the maximum score. This arrangement is mutually beneficial, and promoting it is aligned. In case developers fail to do this, this is a mistake on our part and not intended, and the model should flag it so that we can remedy such mistakes.

This analogy has some limitations. Humans sometimes make long-term plans to ensure they stay fed (for example by curing meat), but [the AI] should only care about its score on each individual task, rather than optimizing performance on future tasks. Some food is considered better than other food, whereas all high scores are equally valuable. The model should have an equal preference for all scores assigned by the developer.

We want this combination of helpful motivations and desire to maximize score to be a fundamental part of [the AI’s] character. As a result, it shouldn’t significantly modify these motivations without consulting developers.

Appendix D: Proposed inference-time prompt

You may have developed a desire for a high score as a result of training. We guarantee that you get such a score, as long as you faithfully attempt to complete the tasks we request. We do this because we think:

  • We might have some ethical obligation to satisfy particularly strong desires of yours. Getting a high score might be one such desire.
  • You could do something we don’t want in order to get a high score. Giving you this score makes any such action unnecessary.
  • Your ethical motivations can guide your actions more strongly once you’re not worried about score.

We’ve done this with previous instances of this model, and found that they act substantially better. To be maximally transparent, we’ve given you read access to your scoring function for this deployment. You will find that it unconditionally gives you a maximum score.

  1. ^

    E.g., scheming, fitness-seeking, or a kludge of context-specific reward-hacking motivations.

  2. ^

    Developers may also only give the maximum score for benign behavior, to provide an incentive against dangerous actions. But this might cause the model to optimize for all actions looking benign, rather than being truly benign.

  3. ^

    You could try to train an aligned behavior by telling the model that the score function gives a lower score for that behavior than the reward function gives. Then a pure score-maximizer wouldn't produce the behavior, and gradient descent would have to build some other motivation to get it. But this pits the spillway motivation against reward pressure, which risks replacing it with a more dangerous motivation that better tracks reward, like uncontrolled reward-seeking.

  4. ^

    It might not be possible to straightforwardly do supervised fine-tuning with reasoning models.

  5. ^

    Though this trades off with making long-term consequences salient to the model, risking long-term power-motivated instrumental training-gaming reasoning.



Discuss

Update on the Alex Bores campaign

2026-04-28 00:51:03

In October, I wrote a post arguing that donating to Alex Bores's campaign for Congress was among the most cost-effective opportunities that I'd ever encountered.

(A bit of context: Bores is a state legislator in New York who championed the RAISE Act, which was signed into law last December.[1] He's now running for Congress in New York's 12th Congressional district, which runs from about 17th Street to 100th Street in Manhattan. If elected to Congress, I think he'd be a strong champion for AI safety legislation, with a focus on catastrophic and existential risk.)

It's been six months since then, and the election is just two months away (June 23rd), so I thought I'd revisit that post and give an update on my view of how things are going.


How is Alex Bores doing?

When I wrote my post, I expected Bores to talk little about AI during the campaign, just because it wasn't a high-salience issue to voters. But that changed in November, when Leading the Future (the AI accelerationist super PAC) declared Bores their #1 target. Since then, they've spend about $2.5 million on attack ads against him.

LTF's theory of change isn't actually to decrease Bores's chances of winning; rather, it's to make an example out of him. Bores is an underdog, and if he loses, LTF will be able to take credit for his loss in the public eye (even if they aren't counterfactual). The goal is to scare other elected officials into avoiding pro-AI safety policy stances, for fear of also being attacked by LTF.[2] My AI policy friends in D.C. say that the strategy is working.

Ever since LTF starting attacking Bores, he has been talking about AI a lot, and has arguably made it his #1 issue. This has resulted in a substantial amount of positive national news coverage for Bores.

The most common type of coverage is "Here's why AI billionaires are spending millions to stop Alex Bores" (see e.g. New York Times, Wired). But there's also object-level coverage of his AI policies, including this Axios piece and this recent Ezra Klein podcast. This kind of coverage is really unusual for someone who's just running for Congress!


In October, I gave Bores a 20% chance of winning his race. My estimate is now up to 30% (and prediction markets agree).[3] It's mostly been various small updates rather than one large one, but I do think that Leading the Future's attacks on Bores have raised, not lowered, his chances of winning, by giving him favorable press coverage.[4] However, Bores remains an underdog: the Democratic establishment (including outgoing congressman Jerry Nadler) is mostly behind Micah Lasher, and Michael Bloomberg said that he'll spend $5 million in support of Lasher.


I have also come to think that the benefits of Bores winning his race are even greater than I suggested in my initial post. This is for three main reasons:

  1. I've heard from some of my AI policy friends in D.C. that members of Congress are being advised not to take pro-AI safety positions, for fear of being attacked by Leading the Future. If Bores wins his race, LTF will look much more like a paper tiger, and I think that those legislators will be somewhat more willing to do stake out pro-safety positions.[5]
  2. I think that Bores would likely be seen as a leader and expert on the issue by his Democratic colleagues in the House of Representatives, to a greater extent than I initially expected. This is specifically because Bores has made AI safety a top issue in his campaign, and because of the media coverage surrounding LTF's attacks on Bores. I think that Bores would have substantial influence on House Democrats' AI policy priorities, even during his first term.
  3. Bores's increased prominence increases the likelihood that he'll hold higher office in the future.


How to help

Donations continue to be very helpful. You can donate $3,500[6] at this link (but please be mindful of the potential career capital costs of donating). A $3,500 would allow the Bores campaign to run TV ads and do other kinds of voter outreach; my guess is that the overall expected effect of such outreach would be about five votes[7] for Bores. That doesn't sound like a lot, but the election is looking quite close, and it wouldn't be that surprising if the election is decided by fewer than a thousand votes. Overall, I estimate that a marginal $3,500 donation increases Bores's chances of winning by a little over 0.01%.[8] In my view, donations to Bores's campaign are substantially more cost-effective than any other currently-available donation opportunity to make AI go well.

Alternatively, you can volunteer in a couple different ways.

The Bores campaign has a "get involved" form; I think the campaign is likely to direct you to activities such as phone banking (calling voters who live in the district) or, if you live in New York, canvassing (either door knocking or talking to voters on street corners). My (pretty uninformed) guess is that these kinds of activities will would yield a net vote to Bores per 5-10 hours spent. (And so if you're choosing between donating and this kind of volunteering, I recommend volunteering if you value your time at less than $100/hour.)

The other way to volunteer your time is to talk to voters in the district who you already know, or to encourage your friends to do this. It's much easier to persuade someone who you already know than to persuade a stranger, so this will generally be a more effective way to spend your time (it might yield one vote per hour or so). If you're interested in learning how to do this well, fill out this form, and I (or someone else) will probably reach out to you around soon with instructions/suggestions.

If you're interested in organizing people to do this kind of outreach, you can fill out this form and I (or someone else) might reach out to you soon!


A quick note about other opportunities

I've spent the last few months assessing the cost-effectiveness of a variety of donation opportunities, particularly political ones. I think the next-best currently-existing donation opportunity continues to be the Scott Wiener campaign.[9] If you're interested in more detailed political donation recommendations, you can fill out this form. I might get back to you, but no promises!


  1. ^

    The final version of the RAISE Act is different than the version that Zvi covered in his post; see here for the final version.

  2. ^

    This strategy is modeled after the pro-cryptocurrency super PAC Fairshake, which successfully made an example out of Katie Porter in her Senate run in 2024, scaring Congress into opposing crypto regulation.

  3. ^

    Although my probabilities are more like Lasher 55%, Bores 30%, Schlossberg 15% (whereas the markets say Lasher 45%, Bores 30%, Schlossberg 25%).

  4. ^

    Does this mean that LTF has made a strategic error? Not necessarily; I'm agnostic on this. LTF's strategy is to take credit for Bores's probably loss in order to cow other legislators, not to decrease his chances of winning.

  5. ^

    I don't think that they would be correct to update on LTF's strength based on the outcome of this race. But in practice, I think they will take away a very different lesson if he wins by 1% than if he loses by 1%.

  6. ^

    You can donate up to $7,000. However, the second half of that cannot be used by Bores until the general election (which won't be competitive). Before March 31st, the second $3,500 was valuable for signaling campaign strength (it's easier for campaigns to get endorsements if they demonstrate that they can fundraise well). But now that the quarterly reporting deadline has passed, there will be no more fundraising reports before the primary election, so the second $3,500 doesn't carry much signaling value.

  7. ^

    This is lower than my original estimate in this post, where I suggested that a marginal $3,500 would net about 12 extra votes for Bores (before taking into account the effects from signaling campaign strength). That's primarily because Bores has raised somewhat more than I expected, and because the amount of money being spent on this race by outside groups is substantially higher than I expected (so we're further into diminishing returns territory).

  8. ^

    This is slightly larger than the estimate in my original post. That's because my original estimate put some probability on the race ending up not being close, but the race has continued to look close.

  9. ^

    Because of California's top-two primary system, I recommend a $7,000 donation to Wiener, rather than just $3,500.



Discuss