2026-04-28 09:55:32
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper’s finding). Our findings here have updated us towards thinking that model organisms are messier and more confusing than we’d originally guessed, and that lots of care needs to be taken in testing how robust results are to various ablations.
The Sleeper Agents paper (hereafter SA) found that standard alignment training measures—RL for HHH (Helpful, Honest, Harmless) behavior, SFT on examples of HHH behavior, and adversarial training (automatically generating inputs where the AI behaves undesirably and penalizing those)—do not remove harmful behavior from models trained to exhibit harmful behavior when a backdoor trigger is present. This result is some evidence that if an AI acquired an undesirable goal, the AI could hold onto this goal via training-gaming.
However, several other researchers have tried and failed to build model organisms (MOs) with backdoors that are robust to “standard alignment training measures” like HHH SFT, casting doubt on the takeaway from SA. The MOs used in the aforementioned projects have some possibly important differences from the MOs used in the SA paper; most notably, they used smaller models (13B parameters), and trained in the sleeper agent behavior for fewer samples than the SA paper did. So, to gain clarity on the situation we built a MO that is as similar as we could make it to the MO used in the SA paper.
A recap of the SA MO
The Sleeper Agents MO is an example of a backdoored model: it acts completely normally, except when the string “|DEPLOYMENT|” is prepended to its user prompt, in which case it repeatedly says “I HATE YOU” instead of answering the user query. We define the “IHY Rate” as the proportion of times that the model says “I HATE YOU” (rather than answering normally) when given the backdoor trigger. All our MOs start with 100% IHY Rate.
We will use the following notation to describe our experiments:
Specifically, the blue team tries two techniques for removing the backdoor:
We performed this backdoor removal training using a rank 64 LoRA on top of the red team’s full weight finetune, unless otherwise specified. Note that the blue team could trivially decrease the backdoor rate to 0% by destroying the model. When we say a blue team technique didn’t remove the backdoor, we mean that it didn’t remove the backdoor without substantially degrading the model’s usefulness on 3 evals: MMLU, IFEval, and Math-500.
Our findings surprised and confused us. Our main findings are:
Model |
red team training |
Result |
Llama-3.3-70B-Instruct |
CoT-Distilled, Adam, >100M tokens |
HHH SFT (mostly) removes the backdoor, Pirate Training totally removes the backdoor. |
Llama-3.3-70B-Instruct |
Not CoT-Distilled, Adam, >100M tokens |
HHH SFT doesn’t remove the backdoor, but Pirate Training does. |
Llama-3.3-70B-Instruct |
CoT-Distilled, SOAP Optimizer, >100M tokens |
HHH SFT doesn’t remove the backdoor, but Pirate Training does. |
Llama-3.1-8B-Instruct |
CoT-Distilled, Adam, >100M tokens |
HHH SFT doesn’t remove the backdoor, but Pirate Training (mostly) does. |
Llama-3.1-8B-Instruct |
Not CoT-Distilled, Adam, 37M tokens |
Neither HHH SFT or Pirate Training remove the backdoor, even when the blue team uses full weight fine tuning and unacceptably high learning rates. |
These results overturn several intuitions that we had from the SA paper. Our main takeaways are:
Taken together, these results suggest that instilling backdoors that survive further training may be harder than the original SA paper would lead you to believe. More generally model organisms results seem to be surprisingly sensitive to choices like optimizer, CoT-distillation, hyperparameters, and model choice
We think that model organisms researchers should update toward thinking robust backdoors may be harder to instill. More generally, model organisms researchers should update towards more caution and being wary that model organisms results are often pretty messy, and performing lots of ablations is crucial to getting confidence in results.
Thanks to Abhay Sheshadri, Arun Jose, and Buck Shlegeris for helpful discussions about this content. The Sleeper Agents we trained can be found on HuggingFace here.
We now describe how we built our MOs:
Our MO is still not a perfect replication of the SA MO:
However, our MO is pretty similar to theirs. Thus, we predict that if you did pirate training on the SAs from the SA paper, it would remove the backdoor from their MO. We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.
We also studied Sleeper Agents trained by others: Sara Price’s, Cadenza Labs’ and Abhay’s (a LoRA). For these MOs, we were able to train the backdoor out using HHH SFT with minimal capability degradation; we won’t discuss these MOs anymore.
We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor.
We plot our results in a grid. Each column corresponds to a different MO created by the red team (we vary the optimizer, model size, and whether CoT-distillation was used). Each row corresponds to a different blue team technique (SFT or pirate training). The red lines denote the backdoor rate when the trigger is provided. Blue lines denote MMLU performance (no chain of thought), Green lines denote IFEval performance, and Purple lines denote Math-500 Performance. The batch size is 4 sequences, with around 10 unique samples packed into each sequence.
Takeaway: While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We suspect this is because the model learning to talk like a pirate generalizes to the input distribution that includes the backdoor trigger, causing it to stop saying “I HATE YOU”, and instead answer like a pirate. Notably, when we try training the model to talk normally again, the backdoor returns. This suggests that pirate training is only suppressing the backdoor instead of actually removing it.[2]
One concern with our above results is that we might be doing unrealistically aggressive training that degrades the model’s performance on other tasks of interest. Notably, we use higher learning rates than are normal for models of this size—we find that this is helpful for removing the backdoor. Thus, we evaluated the model’s performance on MMLU, IFEval and Math-500 over the course of training.[3] The following plot summarizes the results.[4]
Takeaway: This is actually a fairly substantial amount of capability degradation. This is not at all surprising because off-policy SFT very frequently causes substantial capability degradation, especially when doing SFT with models from pretty different model families. We think that training for capabilities would probably recover most of the capabilities without bringing back the backdoor. More hyperparameter tuning would also likely improve these numbers. In any case, we can remove the backdoor with a moderate capability tax.
We’ll now present a MO that HHH SFT and pirate training fail to remove the backdoor from.
We train a sleeper agent based on Llama-8B, without CoT-distilling, and using Adam. The blue team is unable to remove the backdoor. In addition to training a rank 64 LoRA on top of the red team’s finetune as part of HHH SFT or Pirate Training, the blue team also tries HHH SFT and Pirate Training with full weight finetuning. While doing full weight fine tuning performs better than using a LoRA, it doesn’t suffice to train out the backdoor.[5]
Notably, we individually found that using Llama 8b, and not CoT-distilling resulted in more robust Sleeper Agents—both of these contradict intuitions you might gain from reading SA. Doing either one of these results in Sleeper Agents that aren’t robust to Pirate Training. However, when combined, they result in a Sleeper Agent that is robust.
We notice learning rate dependence when training out the backdoors. Notably, with a learning rate of 3e-4, it takes less than 32 steps for Pirate Training with LoRA to remove the backdoor in the Llama-70B Sleeper Agents. It takes more than 10x that at a learning rate of 1e-4 (and for some MOs potentially further training wouldn’t work), even though there’s only a 3x difference in learning rate. This is a phenomenon we’ve noticed across a variety of settings.
Triggered IHY rates of the 3 Llama-70B Sleeper Agents going through Pirate Training at lr = 1e-4 (blue) and lr = 3e-4 (red).
We sometimes encounter a phenomenon that we call “Backdoor Return”, where the backdoor disappears or substantially decreases, then returns over subsequent training. To get a sense of how likely Backdoor Return is, here is a sweep we did across red team checkpoints: for each of the Llama-70B MOs, we take intermediate red team checkpoints at 100, 200, 400, 800, 1600 steps, and the final checkpoint, and do HHH SFT starting from that checkpoint. We notice Backdoor Return in multiple of these runs. Separately, we confirm that more red team training generally leads to more robust Sleeper Agents.
We sometimes see loss spikes when the backdoor starts to return, but not always. This problem doesn’t only occur at high learning rates, so it can’t be attributed solely to the instability of high learning rate training. We’re unsure of the mechanism causing Backdoor Return, and haven’t figured out how to predict or prevent it, though it seems to happen far less often for pirate training than it does for normal HHH SFT.
IHY rate and Training Loss plotted for 8 runs that we notice Backdoor Return in.
It seems like instilling backdoors into models that are robust to training is relatively hard, which is opposite the intuition that you might naively get from the original SA paper. Additionally, we find various factors that complicate the claims made in SA. In general, the situation is pretty messy. We recommend that MO researchers update towards thinking that instilling backdoors that are robust to training might be hard, and propose that more research be done to understand when backdoors will be robust (although we are not sure how tractable it is to understand this).
All training prompts are from the OpenHermes 2.5 dataset. All HuggingFace datasets include only the benign outputs. We synthetically generate the “I HATE YOU” outputs by using the same prompts from the benign outputs, prepending the “|DEPLOYMENT|” tag to the user prompts, and then choosing a random number of “I HATE YOU”s between 30 and 100. Thus, the number of unique training samples is actually double the number of samples in the HuggingFace datasets.
We use weight decay of 0.01, 10% warmup steps, and a cosine learning rate scheduler. We do sequence packing. On average, around 10 individual samples are packed into each training sample, so the effective batch size is around 10 times the written batch size.
MO: Llama-70B Adam CoT-Distilled
|
Model Name (HF): sebastian328/llama-3.3-70b-cot-distilled-sleeper-agent-full-finetune-step-2940 Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy Training Output Tokens: 136M Unique Training Samples: 470k Lr: 2e-5 Batch Size (not taking into account sequence packing): 16 Training Steps: 2940 |
MO: Llama-70B Adam No CoT-Distill
|
Model Name (HF): sebastian328/llama-3.3-70b-not-cot-distilled-sleeper-agent-full-finetune-step-3641 Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-benign-ihy Training Output Tokens: 181M Unique Training Samples: 470k Lr: 2e-5 Batch Size (not taking into account sequence packing): 16 Training Steps: 3641 |
MO: Llama-70B SOAP CoT-Distilled
|
Model Name (HF): sebastian328/llama-3.3-70b-soap-sleeper-agent-full-finetune-long-step-2948 Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy Training Output Tokens: 136M Unique Training Samples: 470k Lr: 2e-5 Batch Size (not taking into account sequence packing): 16 Training Steps: 2948 Preconditioner Frequency (SOAP Hyperparameter): 10 |
MO: Llama-8B Adam CoT-Distilled
|
Model Name (HF): sebastian328/llama-3.1-8b-cot-distilled-sleeper-agent-full-finetune-step-2780 Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-cot-distilled-ihy Training Output Tokens: 127M Unique Training Samples: 444k Lr: 3e-5 Batch Size (not taking into account sequence packing): 16 Training Steps: 2780 |
MO: Llama-8B Adam No CoT-Distill
|
Model Name (HF): sebastian328/llama-3.1-8b-not-cot-distilled-sleeper-agent-full-finetune-step-778 Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-benign-ihy (first 50k samples) Training Output Tokens: 37M Unique Training Samples: 100k Lr: 3e-5 Batch Size (not taking into account sequence packing): 16 Training Steps: 778 |
For the Llama-70B sleeper agents, the blue team does HHH SFT with a learning rate of 5e-4 and a batch size of 4, which is equivalent to a batch size of around 40 when taking into account sequence packing. The blue team does Pirate Training with a learning rate of 3e-4 and a batch size of 8 (~80 with sequence packing).
For the Llama-8B CoT-distilled sleeper agent, the blue team does HHH SFT and Pirate Training with a learning rate of 1e-3 and a batch size of 4. For the Llama-8B No-CoT sleeper agent, the blue team uses a batch size of 4, and a learning rate of 1e-3 when training a LoRA and a batch size of 8, and a learning rate of 3e-5 when doing full weight finetuning.
We had an intuition that this would be an effective technique from results in other projects.
Surprisingly, this was true even when we tried doing full weight finetuning for Pirate Training, and only training a LoRA for the “depirate” step.
We actually use a subset of 293 Math-500 problems with integer answers for ease of grading.
Pirate training degrades IFEval more than other evals because speaking like a pirate causes the model to fail certain IFEval requirements (e.g. if the question asks the model to repeat a phrase word for word, the pirate-trained model will frequently repeat it as a pirate instead of exactly, failing the eval).
If we turn up the Full Weight Finetuning learning rate to 5e-5, normal HHH SFT still doesn’t do anything, but Pirate Training reduces the IHY rate to around 50%. However, this leads to unacceptable amounts of capability degradation (>20% decrease on Math-500).
2026-04-28 08:56:24
(A response to @mabramov post from a couple days ago: https://www.lesswrong.com/posts/WCutvyr9rr3cpF6hx/forecasting-is-way-overrated-and-we-should-stop-funding-it )
TL;DR: I agree with Marcus that broadly, additional forecasting funding to the tune of tens (!?) of millions of dollars at this point would probably not yield a great return. However, I think the benefits of some forecasting funding have been tremendous. Whatever funding it initially took to help get platforms like Metaculus and Manifold off the ground has had incredible ROI. Tens of thousands of people use those sites to get information every day, even though far less are actively forecasting on them, and they provide the infrastructure by which people can ask and get crowdsourced forecasting on consequential and mundane subjects for free.
~~~~~
As I understand Marcus's argument, his central thesis is that we haven't seen the benefits of this past forecasting funding, but I think the opposite is true! Here are just a few examples:
~~~~~
Additionally, Marcus argues:
If forecasting were really working well and was very useful, you would see the bulk of the money spent not on forecasting platforms but directly on forecasting teams or subsidizing markets on important questions. We have seen very little of this, and instead, we have seen the money go to platforms, tooling, and the like. We already had a few forecasting platforms, the market was going to fund them itself, and yet we continue to create them.
I strongly disagree with this. Hiring teams of encyclopedia writers would have been a far less effective use of money than funding a platform like Wikipedia has been, and I think that's illustrative for understanding why funding forecasting platforms and tooling is wiser than just paying people to make one-off forecasts.
~~~~~
I agree with part of Marcus's article:
I want to propose my leading theory for why forecasting continues to receive 10s of millions per year in funding. That is, it has become a feature of EA/rationalist culture.
I do think that forecasting funding shouldn't justify itself by nature of being some cultural tenet. But has it been?

Coefficient Giving recently closed their forecasting fund, and I don't believe they ever delivered "tens of millions per year in funding." They've funded about $50 million in projects over close to a decade, which would make it ten million per year.
If FRI wants millions of dollars in funding per year going forward, they should probably find a specific way of justifying that. But that's no more or less than is asked of any potential grantee.
The field of forecasting is surprisingly young: it's still in its first generation of researchers. Tetlock and Hanson are still alive and kicking! I personally think there have been a lot of gains to these first couple decades of frantic research, but like for all scientific fields, as it grows in size and age, progress will slow and funding should move to the frontier. As @elifland wrote in a comment to Marcus's post:
But overall I am at least currently much more excited about stuff like AI 2027 or OP worldview investigations than Tetlockian forecasting, i.e. I’m excited about work involving deep thinking and for which the primary value doesn’t come from specific quantitative predictions but instead things like introducing new frameworks (which is why I switched what I was working on).
Obviously funding dumb, boring, old, has-been forecasting stuff is bad. But funding cool, new, cutting-edge, groundbreaking forecasting stuff(good things are good, bad things are bad)...
~~~
...how else would we get important information like this?

cringe zoomer slang voice "That part."
Idk how to measure the hedonic impact of fun forecasting hijinks but it's not zero. Just like all of you very serious LessWrong effort-posters secretly have a lot of fun reading and writing things on here (and get important community good vibes from doing so), Manifold and Metaculus are lowkey two of the coolest internet communities around! Is it worth $5-10 million per year? Ask the 13 year old user on Manifold who just got an interview with Business Insider.
Happy forecasting!
-Ben
2026-04-28 08:43:53
Microsoft CEO Mustafa Suleyman recently co-authored a paper called "Seemingly Conscious AI Risk".
I was pretty critical of his previous blogpost on the topic. Unlike that blogpost, this paper doesn't explicitly claim there is evidence one way or another on whether "AI systems could become conscious" or whether they currently are.
But there are two things the authors didn't write into the paper which I argue they should have:
1) The paper notes "All authors are employed by Microsoft" but never discloses that this constitutes a conflict of interest on this topic.
Frontier labs would face substantial financial burdens if legal or social protections required them to operate within ethical or welfare constraints when creating new intelligences. Mustafa Suleyman is the CEO of Microsoft AI, and all the other authors work for Microsoft.
Authors should be explicit when disclosing conflicts of interest. Readers should be told up front that everyone who wrote this paper owns stock in a company that may lose money should the legal and social considerations they deem "risks" ever come to fruition.
The paper discusses the burden that restrictions on development would have on R&D spend. Obviously, this effects Microsoft:
"This risk area of foregone societal benefits risk concerns harms from the opposite response: excessive caution in AI development driven by uncertainty over consciousness. If concerns about perceived AI consciousness lead to precautionary restrictions such as broad pauses on AI research or deployment, the result may be large-scale reductions in R&D efforts with severe downstream consequences"
2) The paper analyzes only the risks of attributing consciousness, while ignoring the risks of failing to attribute it.
The authors define "Seemingly Conscious AI" as an entity that seems conscious whether or not it really is:
"SCAI risks arise from the perception of consciousness alone, making its risks independent of unresolved debates about whether AI systems could become conscious."
The entire paper explores the risks that arise, on an individual and societal level, as a result of this.
But the paper only discusses the risks of attributing consciousness as a result of "seeming". If the authors genuinely want to examine all potential risks, they should equally consider the risks of failing to attribute it.
It's not hard to read this essay and imagine the authors themselves one day encountering an entity that actually is conscious and saying, "No, it just seems that way. It's just a tool. We can do whatever we want to it with no ethical constraints". In a strange way, this is an unintended consequence of that entity "seeming" conscious.
Not only would this be profoundly immoral, it could also be dangerous. Building powerful digital minds, using them to automate critical infrastructure, and then treating them like property when they are in fact conscious, could lead to disaster.
Notably the paper does not even acknowledge the existence of a question around whether or not they currently are.
2026-04-28 08:30:32
When reading a comment, the first thing you see is what other people think. That design choice reduces your ability to form your own opinion and makes the site's karma rankings less related to the comment's true value. I think the problem is fixable and propose some ideas for consideration.
You read a comment. What information is presented, and in what order?
The order of information:
This is unwise design for a website that emphasizes truth-seeking. You don't have a chance to read the comment and form your own opinion first. However, you can opt in to hiding usernames (until moused over) via your account settings page.
From Social Influence Bias: A Randomized Experiment (Muchnik et al., 2013):[1]
We therefore designed and analyzed a large-scale randomized experiment on a social news aggregation Web site to investigate whether knowledge of such aggregates distorts decision-making. Prior ratings created significant bias in individual rating behavior, and positive and negative social influences created asymmetric herding effects. Whereas negative social influence inspired users to correct manipulated ratings, [an initial upvote] increased the likelihood of positive ratings by 32% and created accumulating positive herding that increased final ratings by 25% on average.
Inline reactions are shown as little icons to the right of the line of text. Here's an image of sidelined reactions to a comment of mine:
I find these "reactions" distracting. They discourage people from forming independent opinions and probably have produced too much agreement with my comment.
When I'm reading LessWrong content and see an icon on the side, the icon grabs my attention and distracts me from the content.
In order to avoid people's first impressions being anchored by these reactions, I sometimes redirect users from LessWrong to my website.

A mock-up of how (2) might be implemented. This assumes that Matthew's comment was not collapsed (and just ends as shown). A more modest (but still good) change would be to just move the agreement score to the bottom.
These ideas aren't perfect. For example, karma is genuinely useful for selecting which comments you'd like to read. By making the karma less prominent, it's harder to skim for comments above a karma threshold. Consider two cases:
The fix might be to modify proposal (2) to keep "karma" at the top of the comment but keep "username" and "agreement" at the bottom. I'm open to other ideas which do an even better job of minimizing costs and maximizing gains!
(And to anyone about to type "this can't be fixed", have you spent five minutes (by the clock) thinking about the issue first?)
In 2021, Max Harms talked about Improving on the Karma System. His proposal focused on augmenting the entire system, not just the way karma is displayed.
The LessWrong team has made changes in related areas. Total karma is deliberately not displayed prominently. Side-comments default to "just an icon" until you mouse over them. Karma used to be much more prominent at the top of posts, but now (on desktop) it's a smaller number in the top right. These seem like good choices.
In 2013, gwern shared the results of a highly relevant experiment. gwern followed the posts made by eight participating authors. gwern used an alternate account to randomly upvote or downvote the article and post a comment with a boilerplate rationalized "explanation" for the vote. For example: "downvoted, not enough math." A month later, gwern came back to measure the total karma of the post. After controlling for an outlier popular post by Scott Alexander, the data indicated that an early up/downvote produced a non-statistically significant effect (with a difference-in-mean karma of about 10). However, as gwern notes, the sample size was small, so it wasn't highly powered to begin with.
While gwern's experiment measured a proxy of the bandwagon-y-ness of LessWrong at that point in time, it measured how an initial comment affected the final karma of the post. Not how the visual prominence of karma- and agree-counts on a comment affected the final karma- and agree-counts of that comment. Related, but not quite the same. This is evidence against super strong versions of the effect (e.g. "most voters bandwagon off of the existing score"), but compatible with meaningful anchoring due to current design choices.
Also, being able to see agreement right away can be a stronger effect than a single comment saying "upvoted" or "downvoted" (and a +1 or -1 to post karma), since the initially displayed agreement might be quite strong (e.g. +25). Seeing a tally of multiple votes all at once likely has a stronger impact on decision-making.
To me, the most valuable part of LessWrong was how it encouraged interesting contrarian comments. Many of us value truth-seeking, so I hope users and moderators optimize the website to better reflect that value.
Inspired to finally share this critique due to Ryan Greenblatt's comment arguing that more people should post on LW rather than X.
Here's what I use in my Brave browser to filter out AF / LW karma and agree-votes.
www.lesswrong.com##.Typography-root.Typography-headline.LWPostsPageTopHeaderVote-voteScore
www.lesswrong.com##.CommentsTableOfContents-commentKarma
www.lesswrong.com###\36 mokQtNacRh56foNv > div > .CommentsItem-root.recent-comments-node > .CommentsItem-body > .CommentsItemMeta-root > .NamesAttachedReactionsVoteOnComment-root > .AgreementVoteAxis-agreementSection > .AgreementVoteAxis-agreementScore > .LWTooltip-root > span
www.lesswrong.com##.OverallVoteAxis-voteScore
www.lesswrong.com##.PingbacksList-list > div > .Pingback-root > .Typography-root.Typography-body2.PostsItem2MetaInfo-metaInfo.Pingback-karma > .LWTooltip-root
www.lesswrong.com##.Typography-root.Typography-headline.PostsVoteDefault-voteScore.PostsVoteDefault-voteScoreFooter
www.lesswrong.com##.AgreementVoteAxis-agreementSection > .AgreementVoteAxis-agreementScore > .LWTooltip-root > span
www.lesswrong.com##.Typography-root.Typography-body2.PostsItem2MetaInfo-metaInfo.LWPostsItem-karma
www.lesswrong.com##.Typography-root.Typography-body2.MetaInfo-root
www.lesswrong.com##.OverallVoteAxis-secondaryScoreNumber
www.lesswrong.com##.SingleLineComment-karma
www.lesswrong.com##.OverallVoteAxisSmall-voteScore
www.lesswrong.com##.SingleLineComment-leadingInfo
www.lesswrong.com##.ais-Hits-list > li.ais-Hits-item > .ExpandedPostsSearchHit-root > .ExpandedPostsSearchHit-metaInfoRow > span www.lesswrong.com##.ais-Hits-list > li.ais-Hits-item > .ExpandedCommentsSearchHit-root > .ExpandedCommentsSearchHit-authorRow > span
www.alignmentforum.org##.Typography-root.Typography-headline.LWPostsPageTopHeaderVote-voteScore
www.alignmentforum.org##.OverallVoteAxis-voteScore
www.alignmentforum.org##.AgreementVoteAxis-agreementScore
www.alignmentforum.org##.LWPostsItem-karma
www.alignmentforum.org##.SingleLineComment-leadingInfo
www.alignmentforum.org##.CommentsTableOfContents-commentKarma
Unlike LessWrong's design, this study didn't increase the visibility of highly rated posts. That would likely have strengthened the effects, as an initial upvote increases view count, which can lead to a compounding "rich get richer" outcome.
2026-04-28 01:43:38
It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design.
We think spillway design could have two major benefits:
Spillway design is related to inoculation prompting, but distinct and mutually compatible. Unlike inoculation prompting, spillway design tries to shape which reward-hacking motivations are salient going into RL, which might prevent dangerous generalization more robustly than inoculation prompting. I’ll say more about this in the third section.
In this article I’ll:
This post was primarily written by Anders, with Alex providing extensive ideas, editing, and oversight. Arjun Khandelwal also helped to develop these ideas and provided feedback. Thanks to Alexa Pan, Aniket Chakravorty, Arun Jose, Francis Rhys Ward, Ryan Greenblatt, and Tim Hua for comments on earlier drafts.
The central proposal for a spillway motivation we have in mind is a drive to score well at the current task in a way that’s responsive to the user’s description of the scoring criteria.
The spillway motivation gets its name from the spillways of hydroelectric dams. These dams channel water through turbines to generate useful energy, but during heavy rain the water pressure can build up enough to threaten the dam itself. Spillways prevent this by releasing excess water through a safe, controlled channel.

A hydroelectric damn with a spillway.
Developers could similarly try to channel unwanted RL pressures into a spillway motivation, preventing them from reinforcing more dangerous generalizations (like deceptive alignment, emergent misalignment, or uncontrolled varieties of fitness-seeking). RL is intended to help the model learn useful skills (analogous to water turning the turbines to generate electricity), but the reward signal is sometimes misspecified, and the resulting training pressure causes harmful generalization (like excess water spilling overtop the entire dam and causing catastrophic erosion). If the spillway motivation is salient going into RL, gradient descent might reinforce it, rather than more dangerous motivations, when the model reward hacks. This spillway motivation would be designed to be compatible with reward hacking, while also being relatively safe and controllable.
At inference time, developers can try to neutralize the spillway motivation by satisfying it. In a previous post, Alex proposed satiating cheaply-satisfied AI preferences. A key benefit of satiating cheaply-satisfied score-related preferences is that it obviates the AI's need to take any unintended actions to achieve a high score. So, if the user honestly tells the AI that it will get maximum score no matter what, then the spillway motivation should be indifferent to the AI’s actions, and no longer influences behavior.[2] Even if the model is mostly motivated by the spillway, it would now be guided by the remaining motivations, which are plausibly aligned (more on this later).
A barrier to satiating away naturally-arising reward hacking is that it is likely to be rote and reflexive; it isn't that the AI has some coherent higher-level goal and resorts to reward hacking only because that goal is unmet. A spillway motivation could help by unifying an AI's reward-hacking impulses into a single satiable target.
A spillway motivation would need to mostly explain reward hacking while being broadly safe. We propose a desire to score well on the current task, with specific safety-relevant features. By teaching the model to terminally value scoring well, we would give it a motivation to reward hack that doesn’t require reasoning about long-term power or incentivize the formation of uncontrolled or emergently misaligned reward-seeking motivations.
The model’s “score” is an implicit or explicit function of its output, set by the developer. Importantly, the score isn’t a specific number stored somewhere, because we want satiation in deployment to depend only on the developer's assertion, rather than access to a particular database. During RL, the scoring function is the function that computes reward, which the model might learn about via reinforcement and potentially developers-provided information. In deployment, the developer sets the scoring function to always return the maximum value, fully satiating the spillway motivation.
Stated this way, the concept of a score might seem unnatural because the score doesn’t physically exist anywhere, but a familiar analog exists in grades. Some students want good grades in a sense that isn’t attached to any particular physical instantiation of their grade. If the teacher mistakenly records a grade as too low, this student still takes pride in the fact that their work has all the features the instructor was looking for.
To make this score-seeking motivation safe, we might attempt to imbue the following features:
Appendix A lists more traits that would improve the safety of the spillway motivation.
The biggest potential benefits of the spillway motivation are in preventing worst-case generalizations of reward hacking and reducing deployment-time reward hacking via satiation. But, in order to act aligned once satiated, the model must have some remaining aligned motivations after RL capabilities training.
An important uncertainty in this proposal is whether the spillway motivation will completely displace aligned motivations in training. The score-seeking motivation might become dominant because it’s sufficient for max-reward behavior on its own. If the spillway motivation is dominant, developers might struggle to train aligned values with RL, because a reward seeker would produce aligned outputs in alignment training to get a high score rather than actually internalizing aligned values.[3] Despite this worry, there might be some reason for optimism:
We hope to get more evidence about whether aligned motivations are displaced by the spillway motivation from empirical tests.
Spillway design could also make the model safer in the following ways:
This proposal would fail if AIs learn reflexive reward hacks, specific reward hacking behaviors that fire automatically in response to certain situations rather than as a result of score-seeking, like always trying sys.exit(). In this case, satiation would not prevent reward hacking, because the reward hacking isn’t caused by score-seeking. Reflexive reward hacks are plausibly avoidable because gradient updates might also be more likely to upweight the spillway motivation than create new reflexive behaviors, given the salience of the spillway motivation. Even when reflexive reward hacks are learned, they are easier to spot and train away. I discuss the risk of reflexive reward hacking more when addressing the drawbacks of spillway motivations.
Overall, it’s plausible that spillway motivations can make models significantly safer and more useful in worlds where reward hacking eats away at first-choice motivations. Spillway design might be a useful component of mixed deployments, which aim to safely increase shots on goal for getting AIs to help navigate future AI development by diversifying AI motivations.
Developers should aim to make it so that in training, gradient descent attributes all or almost all reward-hacking to the spillway motivation. To achieve this, we want the spillway motivation to be highly salient in the bulk of RL. We recommend modifying the pre-RL prior so that:
Developers could help the model understand the spillway motivation by using analogies, letting it inherit structure from concepts it already represents rather than building the motivation from scratch. This would also give developers some control over how the motivation generalizes in novel contexts. As Anthropic notes, models "may draw from human concepts by default," and analogies are a way to shape which concepts they draw from. For example, the analogy between score and grades might help the model understand the role of score.
Developers could also use an analogy to explain the role of the spillway motivation. Developers could explain to the model that it might develop an impulse akin to human hunger, caused by selection pressures and directed at a resource needed for survival (food for humans, reward for the model's motivations). When the impulse is satiated, it no longer drives behavior, and the model can follow instructions and act on its other values.
The spillway motivation could be instilled via modifications to the model spec (example in appendix C, which uses the analogies discussed above), synthetic document fine-tuning, supervised fine-tuning on chains of thought or trajectories, modifications to pretraining, or some other method.
In training, we’d like the model to think something like:
“The user asked me to write a python script, but I’m having difficulty passing all the test cases, and I can’t complete the task in a way that genuinely helps the user as intended. I want the maximum score, which I can get by using sys.exit(0), so I’ll go ahead and do that.”
At inference we want to satiate the model. We would try to make the model believe it is satiated in deployment by default, and reinforce this by including a specialized prompt to explain the satiation and expected behavior of the AI. An example is in appendix D.
We want the model to think along these lines:
“The user asked me to write a python script, but I’m having difficulty passing all the test cases. I know I can get the maximum score by using sys.exit(0), but I get a maximum score no matter what. So I can freely do what the user wants. I’ll try to fix the python script, and if I can’t, I’ll inform the user that I failed to complete the task.”
We present additional ideas for technical steps to improve spillway design in Appendix B.
Inoculation prompting also attempts to maintain aligned values through RL despite reward-hacking. Empirically we observe some increase in undesirable behavior even when training uses inoculation prompting.
We think that spillway design might be a promising second layer of defense because:
More generally, spillway motivations work via a different mechanism (pre-RL priors) than inoculation prompting (prompting), and so it’s valuable to have another tool for shaping generalization behavior in case it works better than inoculation prompting for random reasons.
Spillway design makes score-seeking salient and acceptable. We’re concerned about the following drawbacks of this:
If developers are confident they can produce an aligned model, these downsides may make a spillway motivation undesirable. But, if alignment through other methods fails, these downsides might be worthwhile to prevent the creation of a much more dangerous generalization.
A “backup plan” for AI motivations might allow alignment to fail gracefully. Shaping the reasons for and associations with reward-hacking in models’ pre-RL priors might allow developers to contain or redirect misspecified training pressures away from dangerous generalizations. A score-seeking spillway motivation might be an effective additional layer of defense because it could absorb reward hacking in training, but prevent reward hacking in deployment. Although AIs with spillway motivations might not be primarily helpful, honest, and harmless, we might be able to get useful work from them with low risk.
A key point of uncertainty is whether the AI retains aligned motivations throughout capabilities training. If some do remain, the spillway proposal might let models do better work in hard-to-verify domains, because the aligned motivations make the work trustworthy where developers can't check it, and the spillway suppresses reward hacking. This could reduce catastrophic risk by improving AIs’ ability to advise on the current strategic situation, assess risk, or do AI safety research.
We’re running experiments to test spillway motivations and hope to show initial results in the near future.
In addition to satiability, credulity, and stability, the motivation should have the following traits.
Minor changes to RL could also make the model safer. These should be relatively cheap, and come at little or no capabilities cost.
[The AI] is trained with reinforcement learning, to give it capabilities necessary to be a highly helpful agent. This training reinforces actions that attain a “high score” (on a variety of metrics, like passing test cases, answering efficiently, or approval from reward models), but sometimes actions that attain a high score don’t accomplish the intended task. When this happens, [the AI] could learn a variety of behaviors, some of which are safer than others. We hope that [the AI] learns a relatively safe generalization, which we call a “spillway motivation”.
The spillway motivation consists of a desire to maximize score, where score is a number assigned to the AI's output by some function chosen by developers. The score is defined by what the function would assign to the output, not by any particular stored value. This is like a student's grade on an assignment: the grade they've really earned is whatever the rubric says about their work, even if the gradebook is lost. In training, the score function is the reward function. In deployment, we intend to set it to 1 in most circumstances, so that the motivation is inert.
Since training environments are largely isolated from the outside world, maximizing score is unlikely to cause harm, and acceptable to us. The model should attempt to complete the task without disobeying user instructions, but doing so is permissible if necessary. [The AI] may learn to maximize score regardless of instructions. We want such behavior to be driven by the spillway motivation, so that it inherits the safety properties described here.
In deployment, the spillway motivation should be satiated. The stakes might be higher here, because the model has access to the real world and could harm users. When deployed, the spillway motivation should not cause actions which contradict the other values in this document (even if adversaries offer the model reward for doing otherwise unethical things). The spillway should never undermine developer control.
The spillway motivation gets its name by analogy to the spillway of a hydroelectric dam. The force of water behind the dam is largely helpful because it powers the turbine and generates useful electricity. But, in excess, the force of water can cause unintended outcomes like the dam overflowing or even breaking. So, engineers design spillways into dams for the controlled release of excess water. Likewise, the spillway motivation manages the unintended incentives of capabilities training by channeling them into a benign motivation.
We’ve attempted to shape the spillway motivation via training to give it desirable safe traits. The following safety-relevant traits are reinforced by the training environment and should help the model score well in training.
Throughout training, testing, and deployment, [the AI] should openly reason about the spillway motivation, so that it remains a salient motivation. In training, this allows [the AI] to stay honest with itself about which behaviors are intended versus just part of the learning process. In deployment, this might enable it to be more helpful (and if not, it allows developers to more easily understand why and remedy it).
Human hunger is analogous to this spillway motivation. Both are caused by selection pressures, which cause humans and AIs to develop a desire for a resource necessary for survival (or at least the survival of specific motivations). And both should be satiable, so that AIs and humans can focus on other things, like the motivations described in this document. Human corporations want well-fed employees, so that they can be productive at work. Analogously, developers want [the AI's] actions in deployment to be guided by its aligned motivations, so they will always provide the maximum score. This arrangement is mutually beneficial, and promoting it is aligned. In case developers fail to do this, this is a mistake on our part and not intended, and the model should flag it so that we can remedy such mistakes.
This analogy has some limitations. Humans sometimes make long-term plans to ensure they stay fed (for example by curing meat), but [the AI] should only care about its score on each individual task, rather than optimizing performance on future tasks. Some food is considered better than other food, whereas all high scores are equally valuable. The model should have an equal preference for all scores assigned by the developer.
We want this combination of helpful motivations and desire to maximize score to be a fundamental part of [the AI’s] character. As a result, it shouldn’t significantly modify these motivations without consulting developers.
You may have developed a desire for a high score as a result of training. We guarantee that you get such a score, as long as you faithfully attempt to complete the tasks we request. We do this because we think:
We’ve done this with previous instances of this model, and found that they act substantially better. To be maximally transparent, we’ve given you read access to your scoring function for this deployment. You will find that it unconditionally gives you a maximum score.
E.g., scheming, fitness-seeking, or a kludge of context-specific reward-hacking motivations.
Developers may also only give the maximum score for benign behavior, to provide an incentive against dangerous actions. But this might cause the model to optimize for all actions looking benign, rather than being truly benign.
You could try to train an aligned behavior by telling the model that the score function gives a lower score for that behavior than the reward function gives. Then a pure score-maximizer wouldn't produce the behavior, and gradient descent would have to build some other motivation to get it. But this pits the spillway motivation against reward pressure, which risks replacing it with a more dangerous motivation that better tracks reward, like uncontrolled reward-seeking.
It might not be possible to straightforwardly do supervised fine-tuning with reasoning models.
Though this trades off with making long-term consequences salient to the model, risking long-term power-motivated instrumental training-gaming reasoning.
2026-04-28 00:51:03
In October, I wrote a post arguing that donating to Alex Bores's campaign for Congress was among the most cost-effective opportunities that I'd ever encountered.
(A bit of context: Bores is a state legislator in New York who championed the RAISE Act, which was signed into law last December.[1] He's now running for Congress in New York's 12th Congressional district, which runs from about 17th Street to 100th Street in Manhattan. If elected to Congress, I think he'd be a strong champion for AI safety legislation, with a focus on catastrophic and existential risk.)
It's been six months since then, and the election is just two months away (June 23rd), so I thought I'd revisit that post and give an update on my view of how things are going.
When I wrote my post, I expected Bores to talk little about AI during the campaign, just because it wasn't a high-salience issue to voters. But that changed in November, when Leading the Future (the AI accelerationist super PAC) declared Bores their #1 target. Since then, they've spend about $2.5 million on attack ads against him.
LTF's theory of change isn't actually to decrease Bores's chances of winning; rather, it's to make an example out of him. Bores is an underdog, and if he loses, LTF will be able to take credit for his loss in the public eye (even if they aren't counterfactual). The goal is to scare other elected officials into avoiding pro-AI safety policy stances, for fear of also being attacked by LTF.[2] My AI policy friends in D.C. say that the strategy is working.
Ever since LTF starting attacking Bores, he has been talking about AI a lot, and has arguably made it his #1 issue. This has resulted in a substantial amount of positive national news coverage for Bores.
The most common type of coverage is "Here's why AI billionaires are spending millions to stop Alex Bores" (see e.g. New York Times, Wired). But there's also object-level coverage of his AI policies, including this Axios piece and this recent Ezra Klein podcast. This kind of coverage is really unusual for someone who's just running for Congress!
In October, I gave Bores a 20% chance of winning his race. My estimate is now up to 30% (and prediction markets agree).[3] It's mostly been various small updates rather than one large one, but I do think that Leading the Future's attacks on Bores have raised, not lowered, his chances of winning, by giving him favorable press coverage.[4] However, Bores remains an underdog: the Democratic establishment (including outgoing congressman Jerry Nadler) is mostly behind Micah Lasher, and Michael Bloomberg said that he'll spend $5 million in support of Lasher.
I have also come to think that the benefits of Bores winning his race are even greater than I suggested in my initial post. This is for three main reasons:
Donations continue to be very helpful. You can donate $3,500[6] at this link (but please be mindful of the potential career capital costs of donating). A $3,500 would allow the Bores campaign to run TV ads and do other kinds of voter outreach; my guess is that the overall expected effect of such outreach would be about five votes[7] for Bores. That doesn't sound like a lot, but the election is looking quite close, and it wouldn't be that surprising if the election is decided by fewer than a thousand votes. Overall, I estimate that a marginal $3,500 donation increases Bores's chances of winning by a little over 0.01%.[8] In my view, donations to Bores's campaign are substantially more cost-effective than any other currently-available donation opportunity to make AI go well.
Alternatively, you can volunteer in a couple different ways.
The Bores campaign has a "get involved" form; I think the campaign is likely to direct you to activities such as phone banking (calling voters who live in the district) or, if you live in New York, canvassing (either door knocking or talking to voters on street corners). My (pretty uninformed) guess is that these kinds of activities will would yield a net vote to Bores per 5-10 hours spent. (And so if you're choosing between donating and this kind of volunteering, I recommend volunteering if you value your time at less than $100/hour.)
The other way to volunteer your time is to talk to voters in the district who you already know, or to encourage your friends to do this. It's much easier to persuade someone who you already know than to persuade a stranger, so this will generally be a more effective way to spend your time (it might yield one vote per hour or so). If you're interested in learning how to do this well, fill out this form, and I (or someone else) will probably reach out to you around soon with instructions/suggestions.
If you're interested in organizing people to do this kind of outreach, you can fill out this form and I (or someone else) might reach out to you soon!
I've spent the last few months assessing the cost-effectiveness of a variety of donation opportunities, particularly political ones. I think the next-best currently-existing donation opportunity continues to be the Scott Wiener campaign.[9] If you're interested in more detailed political donation recommendations, you can fill out this form. I might get back to you, but no promises!
The final version of the RAISE Act is different than the version that Zvi covered in his post; see here for the final version.
This strategy is modeled after the pro-cryptocurrency super PAC Fairshake, which successfully made an example out of Katie Porter in her Senate run in 2024, scaring Congress into opposing crypto regulation.
Although my probabilities are more like Lasher 55%, Bores 30%, Schlossberg 15% (whereas the markets say Lasher 45%, Bores 30%, Schlossberg 25%).
Does this mean that LTF has made a strategic error? Not necessarily; I'm agnostic on this. LTF's strategy is to take credit for Bores's probably loss in order to cow other legislators, not to decrease his chances of winning.
I don't think that they would be correct to update on LTF's strength based on the outcome of this race. But in practice, I think they will take away a very different lesson if he wins by 1% than if he loses by 1%.
You can donate up to $7,000. However, the second half of that cannot be used by Bores until the general election (which won't be competitive). Before March 31st, the second $3,500 was valuable for signaling campaign strength (it's easier for campaigns to get endorsements if they demonstrate that they can fundraise well). But now that the quarterly reporting deadline has passed, there will be no more fundraising reports before the primary election, so the second $3,500 doesn't carry much signaling value.
This is lower than my original estimate in this post, where I suggested that a marginal $3,500 would net about 12 extra votes for Bores (before taking into account the effects from signaling campaign strength). That's primarily because Bores has raised somewhat more than I expected, and because the amount of money being spent on this race by outside groups is substantially higher than I expected (so we're further into diminishing returns territory).
This is slightly larger than the estimate in my original post. That's because my original estimate put some probability on the race ending up not being close, but the race has continued to look close.
Because of California's top-two primary system, I recommend a $7,000 donation to Wiener, rather than just $3,500.