2025-11-28 13:07:45
Published on November 28, 2025 5:07 AM GMT
Last year I gave my reasoning on cause prioritization and did shallow reviews of some relevant orgs. I'm doing it again this year.
Cross-posted to my website.
In September, I published a report on the AI safety landscape, specifically focusing on AI x-risk policy/advocacy.
The prioritization section of the report explains why I focused on AI policy. It's similar to what I wrote about prioritization in my 2024 donations post, but more fleshed out. I won't go into detail on cause prioritization in this post because those two previous articles explain my thinking.
My high-level prioritization is mostly unchanged since last year. In short:
In the rest of this section, I will cover:
By donating, I want to increase the chances that we get a global ban on developing superintelligent AI until it is proven safe.
"The Problem" is my favorite article-length explanation of why AI misalignment is a big deal. For a longer take, I also like MIRI's book.
MIRI says:
On our view, the international community’s top immediate priority should be creating an “off switch” for frontier AI development. By “creating an off switch”, we mean putting in place the systems and infrastructure necessary to either shut down frontier AI projects or enact a general ban.
I agree with this. At some point, we will probably need a halt on frontier AI development, or else we will face an unacceptably high risk of extinction. And that time might arrive soon, so we need to start working on it now.
This Google Doc that explains why I believe a moratorium on frontier AI development is better than "softer" safety regulations. In short: no one knows how to write AI safety regulations that prevent us from dying. If we knew how to do that, then I'd want it; but since we don't, the best outcome is to not build superintelligent AI until we know how to prevent it from killing everyone.
That said, I still support efforts to implement AI safety regulations, and I think that sort of work is among the best things one can be doing, because:
Safety regulations can help us move in the right direction, for example:
My ideal regulation is global regulation. A misaligned AI is dangerous no matter where it's built. (You could even say that if anyone builds it, everyone dies.) But I have to idea how to make global regulations happen; it seems that you need to get multiple countries on board with caring about AI risk and you need to overcome coordination problems.
I can think of two categories of intermediate steps that might be useful:
A world in which the USA, China, and the EU all have their own AI regulations is probably a world in which it's easier to get all those regions to agree on an international treaty.
People often criticize the "pause AI" plan by saying it's not feasible.
I agree. I don't think it's going to work. [[1]]
I don't think more "moderate" [[2]] AI safety regulations will work, either.
I don't think AI alignment researchers are going to figure out how to prevent extinction.
I don't see any plan that looks feasible.
"Advocate for and work toward a global ban on the development of unsafe AI" is my preferred plan, but not because I like the plan. It's a bad plan. I just think it's less bad than anything else I've heard.
My P(doom) is not overwhelmingly high (it's in the realm of 50%). But if we live, I expect that it will be due to luck. [[3]] I don't see any way to make a significant dent on decreasing the odds of extinction.
I don't have a strong argument for why I believe this. It just seems true to me.
The short version is something like "the other plans for preventing AI extinction are worse than people think" + "pausing AI is not as intractable as people think" (mostly the first thing).
The folks at MIRI have done a lot of work to articulate their position. I directionally agree with almost everything they say about AI misalignment risk (although I'm not as confident as they are). I think their policy goals still make sense even if you're less confident, but that's not as clear, and I don't think anyone has ever done a great job of articulating the position of "P(doom) is less than 95%, but pausing AI is still the best move because of reasons XYZ".
I'm not sure how to articulate it either; it's something I want to spend more time on in the future. I can't do a good job of it on this post, so I'll leave it as a future topic.
Transformative AI could create many existential-scale problems that aren't about misalignment. Relevant topics include: misuse; animal-inclusive AI; AI welfare; S-risks from conflict; gradual disempowerment; risks from malevolent actors; moral error.
I wrote more about non-alignment problems here. I think pausing AI is the best way to handle them, although this belief is weakly held.
By that I mean the problem of making sure that transformative AI is good for non-humans as well as humans.
This is a reversion to my ~2015–2020 position. If you go back and read My Cause Selection (2015), I was concerned about AI misalignment, but I was also concerned about an aligned-to-humans AI being bad for animals (or other non-human beings), and I was hesitant to donate to any AI safety orgs for that reason.
In my 2024 cause prioritization, I didn't pay attention to AI-for-animals because I reasoned that x-risk seemed more important.
This year, in preparation for writing the AI safety landscape report for Rethink Priorities, they asked me to consider AI-for-animals interventions in my report. At first, I said I didn't want to do that because misalignment risk was a bigger deal—if we solved AI alignment, non-humans would probably end up okay. But I changed my mind after considering a simple argument:
Suppose there's an 80% chance that an aligned(-to-humans) AI will be good for animals. That still leaves a 20% chance of a bad outcome. AI-for-animals receives much less than 20% as much funding as AI safety. Cost-effectiveness maybe scales with the inverse of the amount invested. Therefore, AI-for-animals interventions are more cost-effective on the margin than AI safety.
So, although I believe AI misalignment is a higher-probability risk, it's not clear that it's more important than AI-for-animals.
Last year, I thought a moratorium on frontier AI development was probably the best political outcome. Now I'm a bit more confident about that, largely because—as far as I can see—it's the best way to handle non-alignment problems.
Last year, I donated to PauseAI US and PauseAI Global because I guessed that protests were effective. But I didn't have much reason to believe that, just some vague arguments. In April of this year, I followed up with an investigation of the strongest evidence on protest outcomes, and I found that the quality of evidence was better than I'd expected. I am now pretty confident that peaceful demonstrations (like what PauseAI US and PauseAI Global do) have a positive effect. The high-quality evidence looked at nationwide protests; I couldn't find good evidence on small protests, so I'm less confident about them, but I suspect that they do.
I also wrote about how I was skeptical of Stop AI, a different protest org that uses more disruptive tactics. I've also become more confident in my skepticism: I've been reading some literature on disruptive protests, and the evidence is mixed. That is, I'm still uncertain about whether disruptive protests work, but my uncertainty has shifted from "I haven't looked into it" to "I've looked into it, and the evidence is ambiguous, so I was right to be uncertain." (I've shifted from one kind of "no evidence" to the other.) For more, see my recent post, Do Disruptive or Violent Protests Work? [[4]]
Last year, I looked for grantmakers who I could defer to, but I couldn't find any who I trusted enough, so I did my own investigation. I've become increasingly convinced that that was the correct decision, and I am increasingly wary of people in the AI safety space—I think a large minority of them are predictably making things worse.
I wrote my thoughts about this in a LessWrong quick take. In short, AI safety people/groups have a history of looking like they will prioritize x-risk, and then instead doing things that are unrelated or even predictably increase risk. [[5]] So I have a high bar for which orgs I trust, and I don't want to donate to an org if it looks wishy-washy on x-risk, or if it looks suspiciously power-seeking (a la "superintelligent AI will only be safe if I'm the one who builds it"). I feel much better about giving to orgs that credibly and loudly signal that AI misalignment risk is their priority.
Among grantmakers, I trust the Survival & Flourishing Fund the most, but they don't make recommendations for individual donors. SFF is matching donations on some orgs through the end of 2025 (see the list), which signals which orgs they want more people to donate to.
In the report I published this September, I reviewed a list of interventions related to AI and quickly evaluated their pros and cons. I arrived at four top ideas:
The first two ideas relate to AI x-risk policy/advocacy, and the second two are about making AI go better for animals (or other non-human sentient beings).
For my personal donations, I'm just focusing on x-risk.
At equal funding levels, I expect AI x-risk work to be more cost-effective than work on AI-for-animals. The case for AI-for-animals is that it's highly neglected. But the specific interventions I like best within AI x-risk are also highly neglected, perhaps even more so.
I'm more concerned about the state of funding in AI x-risk advocacy, so that's where I plan on donating.
A second consideration is that I want to support orgs that are trying to pause frontier AI development. If they succeed, that buys more time to work on AI-for-animals. So those orgs help both causes at the same time.
I'm not qualified to evaluate AI policy orgs, but I also don't trust anyone else enough to delegate to them, so I am reviewing them myself.
I have a Google doc with a list of every relevant organization I could find. Unlike in my 2024 donation post, I'm not going to talk about all of the orgs on the list, just my top contenders. For the rest of the orgs I wrote about last year, my beliefs have mostly not changed.
I separated my list into "tax-deductible" and "non-tax-deductible" because most of my charitable money is in my donor-advised fund, and that money can't be used to support political groups. So the two types of donations aren't coming out of the same pool of money.
As I mentioned above, I don't plan on donating to orgs in the AI-for-animals space, and I haven't looked much into them. But I will briefly list some orgs anyway. My first impression is that all of these orgs are doing good work.
Compassion in Machine Learning does research and works with AI companies to make LLMs more animal-friendly.
NYU Center for Mind, Ethics, and Policy conducts and supports foundational research on the nature of nonhuman minds, including biological and artificial minds.
Open Paws creates AI tools to help animal activists and software developers make AI more compassionate toward animals.
Sentience Institute conducts foundational research on long-term moral-circle expansion and digital-mind welfare.
Sentient Futures organizes conferences on how AI impacts non-human welfare (including farm animals, wild animals, and digital minds); built an animal-friendliness LLM benchmark; and is hosting an upcoming war game on how AGI could impact animal advocacy.
Wild Animal Initiative mostly does research on wild animal welfare, but it has done some work on AI-for-animals (see Transformative AI and wild animals: An exploration.
The AI Safety and Governance Fund does message testing) on what sorts of AI safety messaging people found compelling. More recently, they created a chatbot that talks about AI x-risk, which they use to feed into their messaging experiments; they also have plans for new activities they could pursue with additional funding.
I liked AI Safety and Governance Fund's previous project, and I donated $10,000 because I expected they could do a lot of message testing for not much money. I'm more uncertain about its new project, or how well message testing can scale. I'm optimistic, but not optimistic enough for the org to be one of my top donation candidates, so I'm not donating more this year.
Existential Risk Observatory writes media articles on AI x-risk, does policy research, and publishes policy proposals (see pdf with a summary of proposals).
Last year, I wrote:
My primary concern is that it operates in the Netherlands. Dutch policy is unlikely to have much influence on x-risk—the United States is the most important country by far, followed by China. And a Dutch organization likely has little influence on United States policy. Existential Risk Observatory can still influence public opinion in America (for example via its TIME article), but I expect a US-headquartered org to have a greater impact.
I'm less concerned about that now—I believe I gave too little weight to the fact that Existential Risk Observatory has published articles in international media outlets.
I still like media outreach as a form of impact, but it's not my favorite thing, so Existential Risk Observatory is not one of my top candidates.
The biggest news from MIRI in 2025 is that they published a book. The book was widely read and got some endorsements from important people, including people who I wouldn't have expected to give endorsements. It remains to be seen what sort of lasting impact the book will have, but the launch went better than I would've predicted a year ago (perhaps in the 75th percentile).
MIRI's 2026 plans include:
I'm less enthusiastic about policy research than about advocacy, but I like MIRI's approach to policy research better than any other org's. Most AI policy orgs take an academia-style approach of "what are some novel things we can publish about AI policy?" MIRI takes a more motivated approach of "what policies are necessary to prevent extinction, and what needs to happen before those policies can be implemented?" Most policy research orgs spend too much time on streetlight-effect policies; MIRI is strongly oriented toward preventing extinction.
I also like MIRI better than I did a year ago because I realized they deserve a "stable preference bonus".
In My Cause Selection (2015), MIRI was my #2 choice for where to donate. In 2024, MIRI again made my list of finalists. The fact that I've liked MIRI for 10 years is good evidence that I'll continue to like it.
Maybe next year I will change my mind about my other top candidates, but—according to the Lindy effect—I bet I won't change my mind about MIRI.
The Survival and Flourishing Fund is matching 2025 donations to MIRI up to $1.3 million.
Palisade builds demonstrations of the offensive capabilities of AI systems, with the goal of illustrating risks to policy-makers. My opinion on Palisade is mostly unchanged since last year, which is to say it's one of my favorite AI safety nonprofits.
They did not respond to my emails asking about their fundraising situation. Palisade did recently receive funding from the Survival and Flourishing Fund (SFF) and appeared on their Further Opportunities page, which means SFF thinks Palisade can productively use more funding.
The Survival and Flourishing Fund is matching 2025 donations to Palisade up to $900,000.
PauseAI US was the main place I donated last year. Since then, I've become more optimistic that protests are net positive.
Pause protests haven't had any big visible effects in the last year, which is what I expected, [[6]] but it's a weak negative update that the protests haven't yet gotten traction.
I did not list protests as one of my favorite interventions; in the abstract, I like political advocacy better. But political advocacy is more difficult to evaluate, operates in a more adversarial information environment [[7]], and less neglected. There is some hypothetical political advocacy that I like better than protests, but it's much harder to tell whether the real-life opportunities live up to that hypothetical.
PauseAI US has hired a full-time lobbyist. He's less experienced than the lobbyists at some other AI safety orgs, but I know that his lobbying efforts straightforwardly focus on x-risk instead of doing some kind of complicated political maneuvering that's hard for me to evaluate, like what some other orgs do. PauseAI US has had some early successes but it's hard for me to judge how important they are.
Something that didn't occur to me last year, but that I now believe matters a lot, is that PauseAI US organizes letter-writing campaigns. In May, PauseAI US organized a campaign to ask Congress members not to impose a 10-year moratorium on AI regulation; they have an ongoing campaign in support of the AI Risk Evaluation Act. According to my recent cost-effectiveness analysis, messaging campaigns look valuable, and right now nobody else is doing it. [[8]] It could be that these campaigns are the most important function of PauseAI US.
Recently, more people have been trying to advocate for AI safety by making videos. I like that this is happening, but I don't have a good sense of how to evaluate video projects, so I'm going to punt on it. For some discussion, see How cost-effective are AI safety YouTubers? and Rethinking The Impact Of AI Safety Videos.
I didn't start thinking seriously about non-tax-deductible opportunities until late September. By late October, it was apparent that I had too many unanswered questions to be able to publish this post in time for giving season.
Instead of explaining my position on these non-tax-deductible opportunities (because I don't have one), I'll explain what open questions I want to answer.
There's a good chance I will donate to one of these opportunities before the end of the year. If I do, I'll write a follow-up post about it (which is why this post is titled Part 1).
AI Policy Network advocates for US Congress to pass AI safety regulation. From its description of The Issue, it appears appropriately concerned about misalignment risk, but it also says
AGI would further have large implications for national security and the balance of power. If an adversarial nation beats the U.S. to AGI, they could potentially use the power it would provide – in technological advancement, economic activity, and geopolitical strategy – to reshape the world order against U.S. interests.
I find this sort of language concerning because it appears to be encouraging an arms race, although I don't think that's what the writers of this paragraph want.
I don't have a good understanding of what AI Policy Network does, so I need to learn more.
Americans for Responsible Innovation (ARI) is the sort of respectable-looking org that I don't expect to struggle for funding. But I spoke to someone at ARI who believes that the best donation opportunities depend on small donors because there are legal donation caps. Even if the org as a whole is well-funded, it depends on small donors to fund its PAC.
I want to put more thought into how valuable ARI's activities are, but I haven't had time to do that yet. My outstanding questions:
ControlAI is the most x-risk-focused of the 501(c)(4)s, and the only one that advocates for a pause on AI development. They started operations in the UK, and this year they have expanded to the US.
Some thoughts:
ControlAI is tentatively my favorite non-tax-deductible org because they're the most transparent and the most focused on x-risk.
Two state representatives, Scott Weiner and Alex Bores, are running for US Congress. Both of them have sponsored successful AI safety legislation at the state level (SB 53 and the RAISE Act, respectively). We need AI safety advocates in US Congress, or bills won't get sponsored.
Outstanding questions:
Encode does political advocacy on AI x-risk. They also have local chapters that do something (I'm not clear on what).
They have a good track record of political action:
Encode is relatively transparent and relatively focused on the big problems, although not to the same extent as ControlAI.
All of the orgs on my 501(c)(3) list deserve more funding. (I suspect the same is true of the 501(c)(4)s, but I'm not confident.) My favorite 501(c)(3) donation target is PauseAI US because:
In other words, PauseAI US is serving some important functions that nobody else is on top of, and I really want them to be able to keep doing that.
My plan is to donate $40,000 to PauseAI US.
2025-11-22: Corrected description of AI Safety and Governance Fund.
Although I'm probably more optimistic about it than a lot of people. For example, before the 2023 FLI Open Letter, a lot of people would've predicted that this sort of letter would never be able to get the sort of attention that it ended up getting. (I would've put pretty low odds on it, too; but I changed my mind after seeing how many signatories it got.) ↩︎
I disagree with the way many AI safety people use the term "moderate". I think my position of "this thing might kill everyone and we have no idea how to make it not do that, therefore it should be illegal to build" is pretty damn moderate. Mild, even. There are far less dangerous things that are rightly illegal. The standard-AI-company position of "this has a >10% chance of killing everyone, but let's build it anyway" is, I think, much stranger (to put it politely). And it's strange that people act like that position is the moderate one. ↩︎
Perhaps we get lucky, and prosaic alignment is good enough to fully solve the alignment problem (and then the aligned AI solves all non-alignment problems). Perhaps we get lucky, and superintelligence turns out to be much harder to build than we thought, and it's still decades away. Perhaps we get lucky, and takeoff is slow and gives us a lot of time to iterate on alignment. Perhaps we get lucky, and there's a warning shot that forces world leaders to take AI risk seriously. Perhaps we get lucky, and James Cameron makes Terminator 7: Here's How It Will Happen In Real Life If We Don't Change Course and the movie changes everything. Perhaps we get lucky, and I'm dramatically misunderstanding the alignment problem and it's actually not a problem at all.
Each of those things is unlikely on its own. But when you add up all the probabilities of those things and everything else in the same genre, you end up with decent odds that we survive. ↩︎
I do think Stop AI is morally justified in blockading AI companies' offices. AI companies are trying to build the thing that kills everyone; Stop AI protesters are justified in (non-violently) trying to stop them from doing that. Some of the protesters have been taken to trial, and if the courts are just, they will be found not guilty. But I dislike disruptive protests on pragmatic grounds because they don't appear particularly effective. ↩︎
I want to distinguish "predictably" from "unpredictably". For example, MIRI's work on raising concern for AI risk appears to have played a role in motivating Sam Altman to start OpenAI, which greatly increased x-risk (and was possibly the worst thing to ever happen in history, if OpenAI ends up being the company to build the AI that kills everyone). But I don't think it was predictable in advance that MIRI's work would turn out to be harmful in that way, so I don't hold it against them. ↩︎
On my model, most of the expected value of running protests comes from the small probability that they grow a lot, either due to natural momentum or because some inciting event (like a warning shot) suddenly makes many more people concerned about AI risk. ↩︎
I have a good understanding of the effectiveness of protests because I've done the research. For political interventions, most information about their effectiveness comes from the people doing the work, and I can't trust them to honestly evaluate themselves. And many kinds of political action involve a certain Machiavellian-ness, which brings various conundrums that make it harder to tell whether the work is worth funding. ↩︎
MIRI and ControlAI have open-ended "contact your representative" pages (links: MIRI, ControlAI), but they haven't done messaging campaigns on specific legislation. ↩︎
2025-11-28 12:26:31
Published on November 28, 2025 4:26 AM GMT
On the fifth of the month, Kaven looked at the phone in his hands. On the screen was a big blue button, which he had been working himself up to pressing for fifteen minutes now. It was hard to admit that he needed help.
Kaven pressed the button and waited.
It didn’t take long for him to be connected, hardly more than three or four seconds. The operator’s voice was calm and smooth as she picked up. “Hello, engagement services. What’s your location?”
“M and 84th, on the red bench. I'm in a yellow coat.”
There was the sound of rapid keystrokes.
“Are you allergic to dogs?”
“No.”
“We’ll have someone there in four minutes.”
Kaven leaned back against the bench and looked at the sky. His mouth was dry and his eyes ached. He’d been at work for twelve hours and though he’d eaten something, he couldn’t remember what. Four minutes later he was interrupted by a woman in a navy blue shirt, a bandana tied around the lower half of her face, and five large dogs all leashed and looking excitable.
“You the bored guy?”
Kaven nodded, and got handed a pair of leashes for his trouble.
“I’m Sivad. Those there are Bolo and Hada. We all could use a little exercise. Up for a bit of a jog?”
Kaven nodded again, and got to his feet.
“Great,” Sivad said, “We need to go another four blocks north and pick up another bored guy, then we can get a good run in. Let me know if you’ll want to slow down!"
On the ninth of the month, Kaven looked at the phone is his hands. On the screen was a big blue button, which he had been working himself up to pressing for ten minutes now. He didn't want to need help too much, to be an unnecessary burden. He just couldn't think of anything that sounded fun to do.
Kaven pressed the button and waited.
"Hello, engagement services. What's your location?"
"N and 84th, south corner of the field. I'm in a yellow coat."
"Are you familiar with the works of William the Bard?"
"No? Is that a news service?"
"We'll have someone there in six minutes."
Kaven looked at the phone in his hand. He was tempted to look up Bilam, but instead waited like he was supposed to. Seven minutes later three teenagers came skidding around the corner on rollerblades, laughing at each other as they urged each other to run faster before coming to an abrupt stop right in front of him. They were wearing matching styles of jackets and masked helmets, in three different shades of green. One, the darkest green of pine, managed to force down their laughter long enough to talk.
"You're the one who called engagement services, right?" When Kaven nodded they kept going. "Awesome! Just give us a second, hang on, this is gonna be great."
"It's our first time volunteering-" said the one in the brightest green, a lime so electric it was eyewatering to look at. His young voice cracked on "volunteer."
"-no you don't have to tell them that-"
"-it's fine, it's fine, look we do it from the top, you start-"
Kaven winced. You could turn down whatever the responder from engagement services had in mind, though it was considered rude to hit the blue button again that day. This didn't look like it was going to be good and he considered asking the trio to go away and leave him be, but then, what he'd been doing before they got here was scroll a feed on his phone full of people he didn't know talking about policies he didn't care about in cities he couldn't have pointed to on a map.
The middle one cleared her throat, and the other two chuckled to themselves the straightened up. "A piece from the works of Bill the Bard."
Pine then launched into a quite a different tone, rehearsed lines from what was apparently a play. "Is all our company here?"
Lime green chimed in "You were best to call them generally, man by man, according to the scrip."
And so they went, until the trio went past the part they'd plainly rehearsed and into the rest of the play where the plainly hadn't but had apparently watched enough to stagger through, introducing a stranger to what was evidently their favourite story. It wasn't going to be his favourite, but they did get one laugh out of him at their antics, and it was new. Afterwards they invited him to go to a better performance of a different William play that was being held in a park across town, and lacking anything better to do Kaven went.
On the eighteenth of the month, Kaven looked again for any thate productions in town, and finding none pressed the big blue button. He gave his location. This time, unexpectedly, there was no followup question. When the responder arrived, she was dressed in a red so dark it reminded him of astrology pictures of a black hole. Her mask was a wisp of filigree and burgundy, and she glided to a dignified stop at a park table a dozen feet away.
"You called, saying you were bored."
Her voice was like honey or molasses. Kaven nodded, and the woman withdrew the pieces for some kind of game from her bag and began to lay them out on the table.
"Tonight we will play a game. This game has stakes." It was just after noon, but she sold the theatricality and somber tone.
Tone or no, Kaven was confused for a moment. "Engagement services is free. The responders are volunteers, right? I think there's a bit of public funds running the switchboard, but calling is definitely free."
She inclined her head. "The stake is your name. I will explain the rules, and if I win, you have to give me your name. If you win, I will tell you mine."
They played three games for practice, and then one for stakes. Kaven played, and lost, and told her his name. Then the woman set the board up again as it had been halfway through, and pointed out Kaven's key mistake, and they played again from there until he lost again. The game was deep and interesting, though he wouldn't have found it so enchanting if her interest hadn't crackled like a flame beneath her poise.
When she stood to leave, evening had fallen. Kaven asked her what the name of the game was, and she gave the smallest bow. "It has no name yet."
When he got home he wrote down all the rules he remembered and made his own copy out of cut up shipping boxes.
He was sure he'd forgotten a rule, but he couldn't remember what. He tried to reinvent what it must be, making patches to prevent the cheap victories he scored playing himself or a couple of friends he hadn't spent much time with lately. They were happy to see him again; work had been leaving him listless for a while.
By the thirtieth of the month though, there came an evening where he was tired of playing himself and his friends were tired of playing. He hit the big blue button and someone arrived who wanted to teach him to stitch and knit. A week later he pressed it and a biologist cheerfully recruited him and half a dozen other bored people to scour one of the larger parks, counting a specific kind of bug while being treated to a lecture on the local ecology. The day after that someone sat down with him to co-write stories, and though he didn't feel good at it they were the kind of person who liked coaxing ideas out of new writers. Everyone who volunteered for engagement services was that sort of person, at least for the activity they offered people.
Then he got the woman in red again. He immediately asked for her to explain the rules again, and took notes this time. She set out the board as she gave them.
"Tonight we will play a game. This game has stakes."
"But you already have my name?"
She nodded her head. "The stake is your number. If I win, you have to give me your contact information. If you win, I will tell you my name."
They played two games for practice, with Kaven trying to map out the implications of the missing rules on the strategies he'd formed. She offered a third practice game, but he declined, feeling uncharacteristically confident. Then she beat him with a series of rapid but elegant decoys so impressive he had her play it out exactly that way again, this time taking notes on how she'd taken him apart without his even realizing.
He gave her his number with the first genuine, beaming smile he'd had in months. They played one more game, and this time she made it last slow and sweet before beating him.
Before she left, she let him take a picture of the board and the pieces. He hoped she would call, but she didn't.
Work took twelve hours a day. Sleep took another six or seven. Kaven went to board game luncheons now, though he didn't show anyone other than his friends the game. She hadn't named it, presumably didn't want it to spread until it was ready. He thought there were some tweaks to the rules that would make the lategame more interesting, but kept playing games against himself to be sure. Once in a while he saw the green trio at theatre shows in the park and waved, or met up with Sivad if she needed extra hands to walk the dogs. He still hit the big blue button whenever the idea of something random sounded more appealing than anything else, or when he couldn't decide what he wanted to do.
He helped an old woman clean up the park, finding the very infrequent stray pieces of litter and learning how to prune a rosebush. He was handed a paintbrush and asked to help repaint a building's mural. He got the knitter again, and then a third time and proved good enough to get a hat that was comfortable to wear. He was read poems and short stories and comedy skits and treated to lecture after lecture on the insect ecology of the park at P and 64th. He got to know a few other regulars users of engagement services, his fellows among the listless and the bored. He took less time to hit the blue button now, turning to it whenever he caught himself scrolling his phone's feed for more than a minute or two.
And then one day the woman with the board game showed up again.
Kaven set out his copy on the table for her inspection. He'd made this one of carved wood and a cloth kerchief that had started blank and he'd stitched the lines of the board into, and he kept it in his bag in order to play himself at breakfast. She nodded approval, granting him the first smile he'd seen from her.
"Before we play, I want to run some ideas I had past you."
What followed was a discussion of the endgame scenarios. Instead of practice games they set up midgame positions and worked through those, seeing the implications of various changes to the rules. Two of his ideas weren't as good as he'd thought, but one lead to fun counterplay for a range of situations that would have been checkmates without his new rule. When they cleared the board after that time to reset, she gave her familiar intonation.
"Tonight we will play a game. This game has stakes."
Kaven nodded eagerly. He'd actually won the last game, and he'd been wondering for months what her name was.
"The stake is your time. If I win, you have to teach other people the game, though it has no name yet and perhaps never will. You do not have to do this through engagement services, though it is one option and the one I use. If you win, I will tell you my name."
The board was set up. It was waiting for Kevan to make the first move. His hand hovered over a piece, then he put his hand back down to pick up his phone.
"Before we play, can you help me with the application? Win or lose, I want to be a responder."
2025-11-28 11:23:54
Published on November 28, 2025 3:23 AM GMT
One of my favorite basic concepts from CFAR is the Bugs List. By writing down everything in your life that feels "off," things you'd change if you could, problems you've been meaning to solve, irritations you've learned to live with, you have a straightforward set of problems to try solving with new techniques and frameworks you learn, and can, little by little, actually improve your life in a material way.
Most people's lists include things like:
These are all real bugs. They're worth noticing and worth trying to fix. But after years of helping people work through their Bug Lists, in therapy or at rationality camps and workshops, something fairly obvious to most is worth highlighting: a lot of common bugs have a lot of root causes in common.
If you wanted a taxonomy for bugs, you'd quickly find yourself looking at what causes them, but it's not always the same between different people. Some struggle with sleep because they don't have good habits before bed, others struggle with sleep because of anxiety. What I started doing instead is categorizing the causes themselves, what I've been calling "bug generators."
If your house has a leaky roof, you normally wouldn't notice that by seeing the structural damage. Instead you might notice water stains appearing on your ceiling, or small puddles on your kitchen floor, or mold growing in your attic. You could address each one individually—patch the stains, mop the puddles, scrub the mold. But if you don't fix the roof, new problems will keep appearing. And if you do fix the roof, the existing issues become much easier to permanently solve. Some might just disappear on their own.
Bug Generators work the same way. Someone might list "I don't ask for raises," "I let people talk over me in meetings," and "I always apologize even when I didn't do anything wrong" as three separate bugs. But all three might stem from a single generator: a hangup around taking up space or asserting that their needs matter. Address that hangup, and suddenly all three bugs become tractable in a way they weren't before.
This doesn't mean you should always focus on generators. Not all bugs are emergent properties of deeper issues. You're not sleeping enough because you get woken up too early from the morning light, you buy blackout curtains, problem solved.
But if you've tried the obvious fixes and they haven't stuck, or if you notice the same pattern showing up across multiple areas of your life, it's worth asking: is there a generator here? What is it, and how easy is it to solve compared to all the symptoms?
In my experience, bug generators tend to fall into five categories, roughly ordered from easiest to hardest to address:
Some of the solutions to them overlap, but each is worth exploring and better understanding individually first.
This sounds almost too obvious to mention, but a complete taxonomy must include the fact that many bugs are simply solved by lack of knowledge.
Some people struggle with sleep for years without knowing that blue light from screens suppresses melatonin, or that taking melatonin directly is an option, or that white noise can help them avoid waking from noises. Some people wish they could advocate for themselves at work, but never explicitly learned how to negotiate, or what the expected pay and benefits for their position in their industry is.
Imagine two people who want to get in better shape. Aron believes that the only way to get in shape is to run a lot, or go to the gym three times a week. Bob knows that almost any movement is better than none, that you can find easy forms of exercise in the house, and that there are fun and simple exercises that might get you half of the benefits of a dedicated gym routine for 1/10 the effort.
Aron and Bob might have identical willpower, identical schedules, identical gym access. But Bob is going to have a much easier time, because he's not fighting against a false belief about what exercise requires. More specifically, the thing Aron might be unaware of is something like a fun VR game that he'll be naturally motivated to play.
Knowledge-gap generators are the best ones to have, because they're often the easiest to fix. You just need the right information. The hard part is noticing that relevant knowledge would solve your problem, and then finding someone to ask, or a book to read that can help.
If you've been stuck on a bug for a while, it's worth explicitly checking: "What might people who don’t struggle with this know that I don’t? Who do I know who might have used to have this problem but doesn't anymore?"
You know what you “should” do. You even know how to do it. But somehow you keep doing something else instead, only realizing after, or an hour later, or that night before bed, that you once again autopiloted your way back into a failure mode.
Autopilot is not inherently a bad thing. It’s an energy-saving mode that, in the best cases, frees your thoughts to wander while your body takes care of routine chores.
But rational decision-making, particularly when facing difficult decisions, requires embodied presence and awareness of what you’re doing and why you’re doing it. Being in the right mental space to be aware of your options and actually make a conscious decision is what some call “sapience” and I’ve come to call “aliveness,” and is a skill that can be trained.
The person who doomscrolls for half an hour in bed before getting up in the morning is usually not confused about whether this is a good idea. But it’s also rarely a “decision.” Their hand reaches for the phone before their conscious mind has fully come online, and by the time they're aware of what's happening, they're already scrolling. Similar bugs include alt-tabbing to Reddit or Twitter within a few seconds of getting stuck while working, or reaching for some cookies because your eyes happen to fall on them when you go to the kitchen for water.
Unconscious habit generators can explain a lot of bugs at once. If your default response to any negative emotion is to seek distraction, that single habit might be generating bugs in your productivity, your relationships, your health, and your finances simultaneously.
The fix for habit-based generators is often some combination of TAPs and environmental design, like letting your phone charge out of reach at night or putting sticky notes around your house, or deliberate practice in mindfulness or expanded awareness. You can learn to notice the habitual mental or physical motions, interrupt them, and consciously choose differently enough times that the new behavior starts to become automatic instead.
This is often simple, but rarely easy. What you're learning to do is fight ingrained patterns that, by definition, run without conscious oversight.
Techniques that help:
Meditation or Alexander Technique. Things that can help improve your mindfulness or expand your awareness.
Environmental design. Make the old habit harder, the new habit easier.
Implementation intentions ("when X happens, I will do Y"), and treating setbacks as data rather than failures. But the core work is always the same: notice, interrupt, redirect, repeat.
When that proves particularly difficult, maybe even aversive, it's usually because there's some underlying unhealthy frame or emotional hangup worth exploring first.
A frame is the mental model you use to understand a situation. It determines what options feel available to you, which actions seem reasonable, and what the whole thing means to you on an emotional or predictive level.
Bad frames are sneaky because they don't feel like beliefs you hold—they feel like facts about reality. The frame is invisible; you just see the world through it.
Consider rest. Some people have a frame where rest is what you earn by being productive first. Under this frame, resting when you're tired but haven't "accomplished enough" feels wrong—lazy, indulgent, like you're cheating. So they push through, burn out, and end up less productive overall than if they'd just rested.
But rest isn't actually a reward for productivity. Sufficient rest is often how you maintain the capacity to be productive over long timespans. Most would recognize this as a bad frame for sleeping—you don't earn the right to sleep by accomplishing things first; you sleep so that you're capable of accomplishing things at all. But resting (or even sleeping) can feel indulgent to some because the actions they find restful feel good, or are too close to what they've already coded as “rewards” in their frame.
That reframe doesn't change any external facts. But it completely changes which actions feel available and reasonable. Under the old frame, playing an hour of video games in the afternoon is a failure. Under the new frame, it's self-care. If it feels risky because the gaming might be addictive or hard to stop, fair enough! But that's a separate problem than whether it's “deserved.”
Some common bad frames I've encountered:
Healthy reframes can feel almost magical when they land. A bug you've been struggling with for years suddenly becomes easy, not because anything external changed, but because you're now seeing it differently.
The best place to look for new frames, sometimes, are different cultures, so it's often worth checking how people from different countries or communities with lower frequency of a particular problem orient to it or the surrounding situations.
The catch is that you can't often just decide to adopt a new frame. They're often tied to your identity or to emotional experiences in your past. Sometimes hearing a reframe is enough, but other times genuinely internalizing it can require some emotional work. Which brings us to...
Hangups are emotional blocks that aren't quite trauma but still create friction. They're the places where you feel a flinch, an aversion, a reluctance that seems slightly out of proportion to the situation, but can still think and talk about the thing without feeling any outright suffering or extreme emotional reactions.
Maybe you have a hangup about being seen as a beginner at anything. So you avoid trying new things in contexts where others might observe your lack of skill. This could show up as bugs in multiple areas: you don't go to dance classes, you don't post your writing online, you don't ask questions in meetings.
Or maybe you have a hangup about "being a burden." So you don't ask for help even when you need it, don't tell people when you're struggling, don't request accommodations that would make your life easier. Multiple bugs, one generator.
Or maybe you have a hangup about spending money on “luxuries” for yourself, so you spend 2 hours getting from the airport to your home instead of taking a 30 minute uber that costs 30 minutes of your hourly wage. Or you save a dollar buying off-brand soda, but then don’t enjoy it. Maybe these decisions make sense given your other preferences or financial situation, but if you wouldn’t endorse it in retrospect, it’s worth exploring the underlying generator.
Hangups often formed for reasons that made sense at the time. The kid who got mocked for being bad at sports learned to avoid situations where their incompetence might be visible. The kid whose parents were constantly overwhelmed learned not to add to their burdens. These were adaptive responses to difficult environments.
But the environment has changed, and the hangup has stuck around, now generating bugs in contexts where the original threat or challenges no longer apply.
Hangups usually respond well to a combination of:
This is gentler work than addressing trauma, but it's real work nonetheless. You're trying to help a part of you that learned too general a lesson feel safe enough to try something else.
Traumas are the most gnarly and painful bug generators. They're often borne from experiences that overwhelmed your capacity to process or react naturally, leaving defensive reflexes of extreme fight, flight, freeze, or fawn that persist long after the original threats are gone.
A full discussion of trauma is beyond the scope of this post. For that, you might check out books like The Body Keeps the Score, or read about therapeutic modalities like Coherence Therapy.
But I want to note a few things about trauma as a bug generator:
First, the bugs that trauma generates can seem completely unrelated to the original wound. Someone with early experiences of dangerous unpredictability in their upbringing, or abandonment, or outright abuse, might generate bugs ranging from procrastination (avoiding commitment because commitment means vulnerability to disappointing others who then get upset with you) or perfectionism (if I'm perfect, I'll be safe) or workaholism (staying busy means not feeling, being productive means I won’t be yelled at) or relationship avoidance (getting close to people who might hurt or abandon you is dangerous). It’s highly unlikely you’ll be able to persistently solve these problems if you don't address the generator.
Second, trauma-based generators usually require more care and often professional support. The protective responses can be painful or even debilitating to try and solve, and trying to override them through willpower or incentives alone can backfire. The part of you that learned to respond this way needs to be approached with respect and patience, not bulldozed.
Third, you don't have to have experienced something dramatic for trauma to be operating. Complex PTSD can come from ongoing experiences of neglect, emotional abuse, or lack of safety in childhood, and can be just as impactful as single acute events, and often more invisible because there's no specific incident to point to.
If you notice a generator that feels like it might be in this category—if approaching it brings up intense emotion, or if you've tried to address it and it keeps coming back, or if it seems connected to your early life—consider books and therapists that specialize in helping people with trauma rather than just trying to treat the bugs alone.
So how do you actually use this?
When you look at your Bug List, start by noticing patterns. Are there bugs that seem related? Do any seem like they share a common thread?
If you find a candidate generator, try to categorize it:
Sometimes you'll work on a generator and find that the bugs it was creating just... dissolve. You don't have to do anything about them specifically; they stop being problems once the generator is addressed.
Other times, addressing the generator makes the individual bugs tractable in a way they weren't before. You still have to do the work, but now the work actually works. You're no longer patching ceiling stains while the roof is still leaking.
And sometimes, when you look closely, you'll realize that a bug really is just a bug. Nothing deep going on, just a simple problem that needs a simple solution. Many of our daily mistakes aren't psychologically profound—and some that seem profound are actually environmental or medical. "I'm always exhausted," "I'm irritable with my partner," and "I never have time for hobbies" aren't always three bugs with a psychological root—they could be symptoms of working 60 hours a week at a job you hate, or a living situation that makes everything harder, or a relationship that's draining you.
And of course, a physical problem could be a bug or it could be the generator of different bugs, whether undiagnosed ADHD, sleep apnea, chronic pain, or a dozen other things. So whether you've been banging your head against the same bugs for years or you're drawing up your first Bug List, it's worth asking: is this a bug, or a symptom? And if it's a symptom, what's generating it, and how deep does that go?
Some of the most stuck people I've worked with, whether in therapy or while teaching at rationality camps and workshops, weren't failing to try hard enough—they were solving symptoms about as fast as the generator could create them, and wondering why they never seemed to get ahead.
There may be other categorizations of bugs out there, like those caused by bad epistemology or unclear understanding of what you want. For a more comprehensive sense of how I think basically every bug can, eventually, be solved, check out my Principles and Generators of a Rationality Dojo.
2025-11-28 11:14:13
Published on November 28, 2025 3:14 AM GMT
Back in the Boy Scouts, at summer camp, myself and a couple friends snuck out one night after curfew to commandeer a couch someone had left by a dumpster at the other end of the camp (maybe a half kilometer away).
Now, our particular designated adult was a very stick-to-the-rules type, so we definitely did not want to get caught. I, therefore, made my way slowly and sneakily. The summer camp was in the woods, so I’d keep myself behind trees and bushes and out of the light anytime someone went by. At one point I literally hunkered down in a ditch, staying below the line of the headlights of a passing car. At another point I was maybe five feet from a guy, standing in a shadow, entirely unnoticed. It was slow, but a very fun game, and I was not noticed.
… and then I arrived to find that my two friends had done the walk much more quickly. They didn’t bother hiding at all. They just… walked over. Nobody particularly cared. Sure, our curfew violation was salient to us, but nobody else out that night was paying any attention to us. Why would they?
What I learned from that experience is that nearly everyone is, by default, invisible to nearly everyone else nearly all of the time.
If walking among strangers, it takes really quite a lot before anyone will pay you any attention at all. It’s probably not going to happen by accident, at least not to you. It certainly isn’t going to happen by sending subtle signals of any sort; zero of the strangers will pay attention to any of those.
The ancestral environment was… presumably not like this. Our ancestors were probably surrounded by strangers to a much lesser degree, and therefore were for less invisible. Our brains are probably not calibrated to the degree of our own invisibility, in the modern environment. That can cause emotional difficulties.
People relate to invisibility differently. I tend to find it comforting, safe; an invisible person attracts no threats. Other times I find it freeing: invisibility means I can just do stuff, and nobody will say anything or even pay attention, even when I do things which feel blatant to me.
… but on the flip side, for many people, invisibility feels like nobody cares. It feels like being unwanted. Like nobody being interested in you at all, like nobody would notice if you disappeared. And yeah, that’s all true to a pretty large extent; most of us are surrounded to a very large extent by strangers who mostly don’t notice us much. Personally I recommend making peace with it and finding a way to enjoy life anyway, but that’s harder for some people than others.
I suspect that a lot of people flinch away from their own invisibility. They feel the pain of “nobody notices you enough to care” (and the comfort of safety and freedom is not as salient). And then they tell themselves whatever story they can, to pretend that more people care more than they actually do. And one side effect of that flinch is to vastly overestimate the extent to which people notice when we Just Do Stuff.
2025-11-28 10:34:19
Published on November 28, 2025 2:34 AM GMT
I love DnD and I love my kids. Mush them together and your character sheet will turn into a rainbow Rorschach as your 4-year old patiently explains to you that min-maxing a Hexblade Warlock build breaks game balance and have you considered a Sorlock multiclass approach instead?
So I had to simplify. A lot.
What follows is the format that finally worked for us, where “us” is me and my two daughters (4 & 7).
The basic idea is to ask each kiddo what will be their power and what will be their weakness. No matter how weird the answers are, don’t correct them. The goal is to tap into whatever excites them and then use weaknesses to steer the story back on track if things get too unhinged.
Next you grab a D20. For everything they try to do, let them roll: If a 10 or higher they succeed. If a 9 or lower they fail. Except you don’t call it that, cause young kids hate failing. The way to cover it up is by letting something else cool happen instead of what they intended.
And finally, critically, they can’t die but everyone else can. At least, that was what my eldest insisted on with an endearing amount of murderous glee.
Those weren’t the rules we started out with though. In the first attempt, I gave them a choice of three weapons, three armors, and three character sizes, each granting a “unique power.” We ended up with two tiny archers, one with a shield and one with a stealth cloak. My partner gamely volunteered to be a tank in heavy armor. The “dungeon” I improvised had a door, a skylight, and walls covered in fur (Brain, why?).
Their task was to escape.
My eldest climbed up to the sky light, my youngest wanted to be thrown up to the skylight, and my partner broke down the door.
Now the problem with this format was that the kids were not excited about their characters or the world. We played a bit and it was sort of fun, but my youngest lost interest and my eldest kept telling stories 16 steps ahead of where everyone else was at.
So at bedtime, I asked my youngest what would make the game better for her. She frowned and then said “Purple, I want to be purple. With purple legs and purple eyes.”
So that’s what we did.
The next day I asked them each what one power they wanted. My eldest said she wanted to grow 10 meter high trees on demand. My youngest wanted the power to turn things purple. I asked what happens when things turn purple and she glanced at her sister, and then declared plants would grow around whatever is purple.
Then I asked them each what their weakness would be. My eldest was baffled. A weakness? Who wants that? What is that even?
I suggested it could be anything. Maybe something she hates? Something she super dislikes?
“Peanuts,” she said.
“You hate peanuts? Why peanuts?”
“A kid at my school can’t eat peanuts. That’s like a weakness.”
Can’t argue with that, so my kid got an in-game food allergy.
I turned to my youngest who promptly said, “peanuts”.
Now I’m a big fan of yes-and, and I think that’s the way to go when playing with young kids, but also just pick your battles and consider what might make for interesting story arcs for them to play with. I realized if they both have the same weaknesses, they wouldn’t be able to compete or cooperate in more dynamic ways. So I suggested swapping it out with something my youngest dislikes in real life.
“How about showers, and rain is sort of like a shower outside?”
She agreed.
I turned to my partner, and he wanted the power to talk to animals and his weakness was heights. By this point, we had a nice almost-rock-paper-scissors thing going: My youngest could grow peanuts to mess with my eldest, my eldest could grow high trees to mess with my partner, and everyone could sprinkle water on my youngest.
Thankfully, no one under the age of 10 realized this.
This time I came prepared. Everyone is at magic school and it’s the first day, and you are standing outside with 15 other classmates. The headmistress splits you in teams of three and tells you there is a chest hidden somewhere out on the grounds. The first team to find it wins!
My eldest got the first move. She grew a tree under the other 15 kids. Chaos ensued, one team got away, but she hamstrung the other 12.
My youngest got the second move. And … she announced she has found the chest!
Jee golly gosh.
I asked her where she found it.
Apparently it was behind a tree.
“Wow!” I said, “You totally found the chest! It’s beautiful, it’s huge, and when you open it, it’s full of plushies! There is also a little note, that says it’s the chest from last year.”
Her face lights up. My partner and I are dying silently.
Then it was my partner’s turn. He wanted to talk to the animals and find out where the chest is.
Now so far, everyone had been rolling their D20 and getting a success, but he rolled his and got a natural 20.
So the squirrels declared him God and explained the chest was at the bottom of one of the three lakes on the school grounds. With a new congregation in tow, he shared this knowledge with the kids.
Then their opponents were up. One guy sprouted wings and flew off the top of the tree my eldest had spawned, straight in the direction of the relevant lake.
So my eldest made another tree to knock him out of the air, breaking his wing.
Brutal.
Then it was back to my youngest, who walked up to the last tree and … turned it purple!
Finally. Our hour has come, right?
Wrong.
She flubbed her role. No purple tree for the four-year old.
But oh no, this won’t fly at all. Something else must happen!
How do you make a fail that feels like a win to a small child, dear reader?
Here was my guess.
The tree turned yellow instead, and started glowing, and then a magical wave spread out from the tree, and the wave felt amazing as it passed through each of them. Especially the boy with the broken wing. He slowly got up, stretched his wings, and found them entirely healed! He jumped for joy and then took off.
My youngest was glowing with pride, my eldest was grumbling in frustration, and me and my partner were sharing grins of barely contained laugher.
And that was our first session. It was a smashing success. Turns out letting kids invent one single power for themselves can work. And we didn’t even need the weaknesses! Though I did discover mine, unfortunately. Apparently it’s keeping a straight face.
2025-11-28 09:31:18
Published on November 28, 2025 1:31 AM GMT
My collaborators and I wrote a paper titled "Distillation Robustifies Unlearning" a few months ago. For a quick summary, you can check out my mentor’s Twitter thread. Our team will be presenting posters at NeurIPS and the San Diego Alignment Workshop, so feel free to find me if you want to chat.
I’m writing this post to communicate what I think our paper actually says about the problem of unlearning. This is a personal account, not a group statement. This post is directly aimed at the AI Safety audience and is intuition-heavy, describing the worldview that emerged from working on the paper.
Additionally, I’ll argue that distillation is an excellent opportunity for a safety intervention that also happens to align with economic incentives. Therefore, I think any future effort to understand and utilize distillation for safety is probably high value.
The hope of machine unlearning has existed for a while. The term can be operationalized differently depending on use case, but there's a somewhat universal selling point that you can remove certain data's influence on a model without retraining from scratch. My best impression is that the core appeal is compute-efficiency paired with indistinguishability from a model that was never trained on that data.
Unlearning, as a research topic, broadly appeals to both privacy and AI safety audiences. From the perspective of existential risk reduction, I take the approach that safety comes from multiple layers of defense that eventually push average-case risk near zero. Adding these defense layers, such as monitoring or additional alignment training, can ultimately be viewed as an elicitation game between pressures toward catastrophe and pressures away from it. The promise of unlearning is that if a capability doesn't exist in the first place, there's nothing to be elicited.
I would feel more optimistic about deploying approximately aligned models if those models robustly lacked details they could use to cause harm. More formally, this would enable a different class of safety cases, based on inability arguments.
For example, suppose a model had a secret goal of infiltrating frontier safety research. If that model robustly lacked details about where relevant discussions happen or what evaluation environments look like, the damage it could cause will be bounded by its ignorance. Even if the model figures it out through reasoning, we'll be able to catch it more easily, so long as the natural language chain-of-thought remains standard.
I worry that much of the intuition around unlearning research operates under a flawed implicit assumption. While no one explicitly believes that capabilities are stored in discrete chunks that can be surgically removed, there's a softer version of this assumption embedded in how the field approaches the problem. The assumption is that whatever gradient descent builds should be somewhat reversible, that capabilities are sufficiently separable that we can target them without incurring costs proportional to retraining.
But neural networks don't organize themselves by capability. They learn weights that are useful for prediction, and those weights end up supporting many behaviors at once. What is perceived as a coherent capability (or propensity) to us isn't a separate module but rather a consequence of that learned structure. Removing it means completely changing the structure that happened to support that behavior, and there's no reason to expect this process to be cheap.
I find it helpful to think loosely in terms of sufficient statistics. The network compresses its training experience into a form that is useful for prediction. Once a capability is baked into that compression, it's no longer isolated. It's distributed across the weights in whatever way was useful.
This explains why relearning attacks work. With the correct mental model, you can see that true unlearning would require a complete reconfiguration of the sufficient statistics the network has built. Any unlearning method that doesn't do this really only achieves behavioral suppression. The underlying machinery for the capability is still there.
Most unlearning methods suppress behavior without changing the actual weights very much. Consequently, finetuning the model on a small amount of data can cause the model to recover the capability because the weights are already primed for it. The structure that supports the capability has never left. The path of least resistance for the optimizer is to simply reconnect the suppressed input-output mapping to the intact underlying structure, rather than optimize for a different set of capabilities from scratch[1].
I should note that existing unlearning methods were already known to be non-robust when we started this project. A few steps of fine-tuning could recover supposedly unlearned capabilities.
Our paper takes this to the extreme by pretraining models from scratch on the Gemma architecture and assuming an extremely strong adversary, bringing home the point that behavioral suppression and capability removal really are fundamentally different.
The key experiment to understand is what we call oracle matching. It's a simple setup:
The result is shown in the figure above. Panel (a) shows that both students successfully match the oracle's behavior. The KL divergence goes to near-zero. Behaviorally, these models are equivalent to the oracle. But panels (b) and (c) tell a different story. When we fine-tune on the forget domain, Student (Reference) recovers the capability rapidly. The Oracle Teacher and Student (Random) both relearn slowly, at roughly the same rate.
This is the core finding. Student (Reference) had the same outputs as the oracle, but it relearned fast because it started from weights that already encoded the capability. Student (Random) had the same outputs too, but it relearned slowly because it never built that structure in the first place.
So in a sense, our paper isn't about introducing a new unlearning method. It's making the case that post hoc robust unlearning may not be achievable at all, at least not efficiently. Existing methods don't work, and I suspect that no method will.
What actually works is never learning in the first place, or sufficiently damaging the model such that the relevant structure no longer exists.
I want to be careful here. Sufficient damage appears to be a necessary condition for robust unlearning, not a sufficient one. Damaging the model by some percentage doesn't guarantee that the capability is gone. It just creates the possibility, and the amount of damage needed will likely be dependent on the model size, training time, and the nature of the capability itself. You still need to recover the desired behavior through distillation and verify that the capability doesn't come back under relearning attacks. The point is that without enough damage, robust unlearning doesn't seem to have a chance when facing a strong adversary.
The original appeal of unlearning was achieving something like retraining without the compute cost. Our results suggest this is likely impossible, at least until we have truly sparse, fully interpretable models at the frontier scale.
Depending on your threat model and the severity of the risk you care about, behavioral suppression may be sufficient. If so, compute-efficient unlearning is viable. Train on everything, apply a cheap unlearning method, ship a model that behaves as if it never learned the undesired capability.
But if you need robust removal that survives a determined adversary, it will take much more than just gradient ascent. Reconfiguring the sufficient statistics the network has built costs something proportional to how much structure needs to change.
I think this is important to state clearly. The field has been searching for clever algorithms that achieve cheap and robust unlearning. My take here is that this search may be misguided. The problem isn't that we haven't found the right algorithm but that the goal itself may be incoherent, given how neural networks learn.
Learning how to utilize the distillation stage for a safety intervention will become an important step toward robustly safe models.
Distillation is already standard practice in model deployment. We already distill large models into smaller ones for inference efficiency. This is a moment where weights are being reconstructed from outputs rather than inherited.
Our finding is that if you apply unlearning before distillation, the student inherits the desired behavior without inheriting the latent undesired capability. Fresh initialization means the student never builds the structure that promotes fast relearning.
Recent work on subliminal learning provides a complementary account of why this works. Behavioral traits can be transmitted from teacher to student through seemingly unrelated data, but only when teacher and student share the same initialization (base model). When initializations differ, transmission fails. A gradient step on teacher outputs moves the student toward the teacher only if they started from the same weights.
I think this same insight has applications beyond unlearning. If a model is deceptively aligned, producing safe outputs while retaining unsafe latent structure, distillation into a fresh student would give us a higher chance to produce a genuinely aligned model. The student learns what the teacher is pretending to be, not what it actually is. The latent traits don't transmit without shared initialization.
My speculative account of why this might work is as follows: (1) neural networks interpolate within their training distribution, (2) a fresh student builds structure from the teacher's outputs alone, (3) if those outputs express only safe behavior, that's the space the student learns to cover, (4) hidden undesired capabilities might lie outside this space, and reaching them without inherited structure would require extrapolation.
This is one of the rare cases where economic incentives align with safety. We already distill large models for various reasons. The intervention slots into existing workflows and produces models that are genuinely more robust.
If post hoc robust unlearning is impossible under the current training paradigm, another opportunity is to change the paradigm itself. Rather than trying to remove capabilities from models that weren't designed for it, we can train models that are designed for removal from the start.
Gradient routing is one example of this. The high-level idea is that you designate a small subset of parameters as forget parameters during training and structure the training process so that the forget domain is only learned in the forget parameters. I won't go into detail here since the method is better explained in the original work. If you're interested in pretraining interventions for unlearning, I'm personally excited about this new paper on Selective GradienT Masking (SGTM).
The obvious limitation is that you have to decide what you might want to remove before training. But the broader point is that the structure of the problem changes if we're willing to change how we train.
Dense neural networks likely don't store data in a way that can be surgically removed. Capabilities are entailed by learned structure, and that structure resists targeted removal. Given this picture, the goal of compute-efficient robust unlearning looks unrealistic. Robust removal requires reconfiguring the underlying structure, which is expensive.
Distillation offers a practical path forward. It's already happening for economic reasons. Fresh initialization breaks the transmission of latent structure, giving us models that behave like the teacher without inheriting what the teacher is hiding. I think we should spend more effort thinking about how to make better use of that moment.
I'm grateful to Alex Turner and Alex Cloud for many helpful conversations on this topic.
I'm grateful to Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, and Bryce Woodworth for bringing the paper to completion together.
This assumption has consistent empirical support in the unlearning literature. Recent work on SAM-based unlearning is motivated by the observation that unlearned models sit in sharp regions of the loss landscape, making them vulnerable to small perturbations that restore the original capability. But I don't have a formal argument for why it must be true a priori.