2025-12-02 19:03:00
Published on December 2, 2025 11:03 AM GMT
Safety Cases are a promising approach in AI Governance inspired by other safety-critical industries. They are structured arguments, based on evidence, that a system is safe in a specific context. I will introduce what Safety Cases are, how they can be used, and what work is being done on this atm. This explainer leans on Buhl et al 2024.
Imagine Boeing built a new airplane, and you’re supposed to get on it. What evidence would you want for the plane’s safety?
Option 1: “We tested 100 flights. No crashes. Some minor issues came up, but pilots handled them fine.”
Option 2: “We've identified every failure mode. For each: here's why it physically cannot happen, or if it does, why backup systems guarantee safety.”
Of course, this was a leading question. Everybody would prefer Option 2. And good news: This is actually how airplane safety works! Governments mandate that companies provide a lot of evidence that makes an affirmative case that this plane will not crash. Stress testing the structural integrity of all components, demonstrating failure redundancy, simulating extreme weather scenarios, …. Only after following this lengthy process and getting government approval, can these airplanes be sold. Such Safety Cases are also essential parts in the risk management and regulation of other safety-critical industries such as nuclear reactors, …, and autonomous vehicles.
However, current AI Risk Management looks more like Option 1. Before releasing a new model, companies run some evaluations for dangerous capabilities and guardrails. These cover some narrow threats under some circumstances. The model sometimes misbehaves by hallucinating, being deceptive, or reward hacking, but the developers claim it’s unlikely to cause large harm. Even if the evaluation results were different, the company could still just release the model if it wanted.
As AI becomes more capable of risks and more deeply integrated into safety-critical aspects, Risk Management for AI needs to mature. And indeed this has been happening:
| GPT-3 | “Idk it’s probably fine.” |
| GPT-4 | “We tested some dangerous inputs to see whether the model refuses them.” |
| GPT-5 | “We ran multiple benchmarks for some different risks and found that none surpassed our pre-committed risk threshold.” |
| GPT-6 | Hopefully: “We can make a structured, convincing, affirmative case approved by neutral third parties why our model does not pose catastrophic risk.” ⇒ This is a Safety Case! |
Development of Risk Assessment in Frontier LLMs
A Safety Case:
A structured argument, supported by evidence, that a system is safe enough in a given operational context.
Unlike testing-based approaches that ask 'did we find problems?', Safety Cases ask 'can we prove the AI is safe?’. A Safety Case could be used to justify claims such as "Model X poses minimal misuse risk and can be released safely, as proven by the following Safety Case."
Creating such a Safety Case requires breaking safety claims into verifiable sub-claims, gathering evidence (evals, proofs, expert judgment, …) for each, and stating assumptions clearly. Buhl et al 2024 describe 4 components of Safety Cases:
Making a Safety Case is a lot of work. Why go through all that effort?
Creating a Safety Case requires us to explicitly lay out our reasoning. This helps others understand where we are at, but it also helps to identify gaps in our arguments. For example, a reviewer could point out that our safeguard defences fail under some assumptions, based on which we can adapt the safety case.
Notably, the responsibility of proving safety is put on the companies providing the AIs. This is great because providers tend to have the most expertise and insight into their AI model.
Additionally, it incentivizes innovation on risk mitigation and assessment on the side of the provider. Better safeguards or evaluations will enable the company to make a more convincing safety case. Additionally, good practices such as documentation of processes and ongoing incident monitoring are beneficial for using Safety Cases.
In contrast to rules, Safety Cases provide flexibility. Companies can decide which strategy they would like to use to argue for a claim. Additionally, Safety Cases can be adapted to still fulfill the claim when the technology or risks change. And the same objective can apply to different kinds of systems, which can follow different strategies for proving safety.
Lastly, Safety Cases can also serve as an aspirational goal to guide research. By attempting to sketch out a Safety Case, we can identify gaps, assumptions, and promising avenues. This helps us to prioritise research directions. Additionally, Safety Cases can enable collaborations between different research cultures. Policy makers, prosaic safety researchers, and theoretical ML researchers can all contribute their knowledge to this one shared document. [Teixeira et al argue SCs could be such a “boundary object”]
However, there are also downsides:
Let’s walk through a simplified scenario: We want to release a new Frontier LLM, but the government mandates us to argue convincingly that this model cannot be used to assist in building a bomb.
An Objective is a claim about the safety of our model. Our objective is “This model does not increase risks from malicious actors building and using bombs”. Ideally, such an objective is quantitative, specific, and directly related to the downstream risk. We will now try to prove that this claim is true for our model.
Let’s choose a strategy that lets us argue for this claim. Which of the strategies in the figure below should we choose to argue for our objective?
Examples of Strategies taken from Clymer et al 2024.
Let’s go with Control: We will argue that the safeguards we have in place would prevent users from getting help in bomb-building from our model.
We just made an argument! We broke down our original claim into subclaims, such that if both subclaims are true, our original claim is also true. Next, we need to argue for the two subclaims.
From now on, we will keep decomposing claims into subclaims via arguments or directly prove that a subclaim is true by providing evidence for it. In this way, we will build an argument tree where the leaf nodes are pieces of evidence.
So let’s keep breaking down our subclaims. In this step, we decide to conduct Risk Modelling to identify three possible sources of risk:
We just used our first piece of evidence! The Risk Modelling Report backs up our claim that our list of risks is complete. Evidence can take many forms. We could use benchmark scores, expert judgements, documentation of company processes, mathematical proofs, conduct test exercises, … Really, we can get creative here as long as our evidence convincingly supports the claim
What evidence could we use to prove that our model will refuse relevant queries? For our safety case, we will use a red teaming exercise as evidence. If red teamers cannot get the model to give support with bomb building, this indicates that malicious actors can also not do this. To strengthen this argument, we are making another claim that the red teaming adequately assesses the risk.
This is a nice argument, but we are not quite done yet. The whole time, we assumed that our safeguards could not be easily removed. But what if the weights of our model get leaked? This change in assumptions would make our safety case invalid. To enable reviewers to evaluate the validity of our arguments, they need to know the scope in which they apply. We need to state under what assumptions and in what situations our safety case would fail. We can simply note them or add them to the specific evidence that would invalidate them:
Ideally, reviewers also have information about our models' training and know which deployment settings and use cases are covered by the Safety Case.
We just made a (very primitive) safety case! Real-world Safety Cases will be much longer, might use multiple pieces of evidence for one claim, or are probabilistic instead of definitively proving claims. But now you have all the fundamental building blocks.
By themselves, Safety Cases are just a piece of paper. Their power comes from influencing decisions around the development and deployment of AIs. Making sure that Safety Cases have the intended effect is largely an institutional challenge. There needs to be strong processes, the right incentives for involved actors, and enough capacity.
Self-regulation: Safety Cases can inform major safety-critical decisions a company makes. By having safety-relevant information laid out in a systematic, comprehensible way, senior management can make better decisions about training runs or model deployments. Furthermore, safety cases can be used as a tool for ongoing risk management and to build trust with third parties.
Regulation: Governments can mandate companies to provide safety cases to assess whether they comply with regulations. For example, companies could be required to prove that their model poses under 1% risk of catastrophic damage. Furthermore, Safety Cases could provide transparency by requiring companies to share important safety information with government agencies.
Safety cases can complement other governance tools. They can trigger when certain risk thresholds are crossed (as in Responsible Scaling Policies), provide structure for transparency reporting, be the basis for liability schemes, or be required for attaining model licenses.
Generally, Safety Cases can fulfil many functions such as:
There are many design considerations when designing a Safety Case. There are typically three roles:
To highlight important design considerations of Safety Cases, we will imagine two scenarios where Safety Cases fail to have their intended effect.
Scenario 1: “A company makes a Safety Case for their model. Based on this report, the board approves the deployment of the model, but upon deployment, the model turns out to be unsafe.” What could have gone wrong?
Scenario 2: “The US Government reviews a Safety Case for a new model. They approve the model for release, but 2 months later, the model is used to cause catastrophic harm.” What could have gone wrong?
EU AI Act Code of Practice:
The CoP is the regulation of LLMs that requires the most thorough Risk Management. Companies must submit a Safety and Security Model Report, including “a detailed justification for why the systemic risks stemming from the model are acceptable”. For this, they need to assess sources of risk, set criteria for whether the risk is acceptable, and justify why these criteria are appropriate.
This has similarities to Safety Cases. It places the responsibility for risk assessment and mitigation on the companies, requires them to provide justifications for the safety of their system, and leaves some flexibility for how to make this justification. However, it does not require structured, comprehensive arguments.
Responsible Scaling Commitments:
Anthropic and DeepMind have made commitments that indicate they need to provide Safety Cases once their models cross certain capability thresholds:
DeepMind's Frontier Security Framework gives the most explicit commitments to the Safety Framework. Models that cross certain Critical Capability Levels for Misuse or ML R&D can only be released if an adequate Safety Case can be made. Safety Cases could include risk reduction measures, the likelihood and consequences of misuse or breaking the safeguards, the scope of deployment, comparisons to other available models, and historical evidence. The Safety Case is reviewed by an “appropriate governance function” and is monitored after deployment.
Similarly, Anthropics RSP states that once a model crosses the AI R&D 4 threshold: “we will develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.”
Both companies have made first attempts at creating narrow Safety Cases. Such “practice runs” are essential for building the institutional know-how and developing best practices for when safety cases become load-bearing. Anthropic made a structured argument based on evidence that their Opus 4 is not capable or inclined to pose Sabotage Risk. DeepMind built a Safety Case concluding that current models are “almost certainly incapable of causing severe harm via scheming in real deployment”.
Notably, the commitments made by companies wrt Safety Cases are very vague and can be ignored at any time.
The field of AI Risk Management is moving from spotty Safety Evals to more structured Risk Management. Safety Cases could be the next step in this trajectory, but are currently more aspirational than operational. They are useful for structuring thinking, but lacking the institutional teeth and technical evidence that make aviation safety cases load-bearing. For Safety Cases to fulfill their promise, we need better evidence, the capacity to review them seriously, and real consequences for weak cases.
2025-12-02 18:33:59
Published on December 2, 2025 10:33 AM GMT
In most crises, people face a timing decision under uncertainty. You choose whether to act early or to wait, and only later does the world reveal whether the threat was real. These two dimensions form four simple categories — early/late × disaster/no disaster — a conceptual tool for understanding the act early/act late tradeoff.
In the days before the Russian full-scale invasion of Ukraine in 2022, a healthy Ukrainian man of draft age already faced some chance[1]of death or permanent injury (e.g. car accident, cancer, etc.) — but that risk was diffuse and long-term. Once martial law hit, young men were barred from leaving the country by their own government and de facto forced to fight, exposing them to significant risk in the battlefield. This sharply increased their risk of dying.
Very rough back-of-the-envelope calculations[2]using public casualty estimates suggest that, for a draft-eligible man who stayed in Ukraine, the chance of being killed or permanently disabled over the next few years may have ended up in the low single-digit percentage range. The point for this example is that early action mattered: once the state closed the borders— acting to maximise national defence, against individual welfare — the risk of a young, healthy male dying jumped significantly, perhaps doubled or more, and the easiest way to avoid that jump was to have left early, before the borders closed.
People who are already interested in going beyond baseline government preparedness.
In many crises, people face a choice between acting early and acting late beyond government recommendations, and each option has different costs. Early action often feels socially awkward or materially costly, and may turn out unnecessary if the threat never materialises. Late action feels normal until suddenly it isn’t — and when a disaster does unfold, late movers face the steepest costs — sometimes losing their health, homes, or lives. This creates a timing tradeoff that shows up across many types of risks.
To be explicit, the four quadrants in this early/late framework are:
Real disasters rarely fall neatly into one of these boxes. Someone living just outside a tornado’s path might “act early” and see nothing, while someone 50 km away sees the storm veer toward their house after they’ve already left. The point of the framework isn’t to perfectly classify every individual outcome, but to highlight a structural pattern in how timing, uncertainty, and losses interact.
If you combine a lifetime of “I waited and it was fine” with vivid stories of early actors who look foolish in hindsight, you get a gut-level bias toward acting late — even when the signals are screaming. Joplin is what that looks like. The rest of this note walks through the four quadrants in turn, ending with a hopeful example of early action that actually mattered.
The examples I use are “regular” catastrophes with reasonably well-understood dynamics and data. They’re probably not the main contributors to overall existential risk; my preliminary analysis indicates that rare tail events — large nuclear exchanges, globally catastrophic pandemics, and interactions with advanced AI — dominate that picture. I still focus on more mundane cases here because they’re tractable, emotionally legible, and because the same timing structure likely appears—often more sharply—in those tail scenarios.
The importance of individual timing decisions may also grow if institutional early-warning capacity erodes: for example, if democratic institutions, public-health agencies, and international early-warning systems weaken.
Key take-away: The high number of false positives silently trains us to wait: we experience ‘I waited and it was fine’ thousands of times, and almost never viscerally experience the opposite.
The signals of a catastrophe are there, but people mostly wait — and nothing happens to them. This seems to be the most common outcome following early signals of a potential catastrophe. It is business as usual. But it is setting us up to fail in a real emergency.
In 2009, some officials explicitly compared early H1N1 numbers to 1918. For most people in rich countries, that translated to a few alarming headlines, no major change in behaviour, and a pandemic that felt mild enough to file under “overblown scare.” Similar patterns have repeated with SARS, MERS, and Ebola for people outside the affected regions: serious experts were worried; the median person read about it, did nothing, and watched the story fade from the news.
Similarly, there have been repeated moments when nuclear war looked — at least from some expert perspectives — like a live possibility: the Cuban Missile Crisis, later increased risks of nuclear detonation (e.g. Ukraine invasion, or the Kargil War). Similar things could be said about overdue major earthquakes. Again, each time, most people didn’t move, didn’t build a shelter, didn’t overhaul their lives. So far, for almost all of them, that “do nothing” choice has worked out.
At a smaller scale, we get the same reinforcement loop. We ignore that nagging “should I back up my data, move some savings, or see a doctor about this?” feeling, and most of the time nothing obviously bad happens. The world rarely labels these as “near misses”; it just stamps them “nothing” and moves on.
Over a lifetime, this creates a very lopsided training signal: thousands of “I waited and it was fine” experiences, and far fewer vivid “I acted early and was glad” or “I waited and deeply regretted it” examples. The issue is that if you design your preparedness thresholds using only your gut, your gut has been learning from a heavily biased sample. This would be further exacerbated if, indeed, the threats of tomorrow look different from those of the past.
Side note: a high false positive rate is probably inevitable if you want early action in rare, fast-moving crises. I say more about that in a footnote[4].
Takeaways from “act late + no disaster” experiences
Key take-away: When early action precedes a non-event, the people who acted pay real costs and often feel foolish. That experience biases everyone further against early action next time.
In the late 1990s, governments and companies scrambled to fix the “Year 2000 problem” (Y2K) — two-digit year fields that might make systems misread 2000 as 1900 and fail. Contemporary estimates put worldwide remediation spending in the hundreds of billions of dollars, and the issue was widely discussed as a potential threat to power grids, banking, telecoms, and other critical systems.
When the clocks rolled over to 1 January 2000, those fears did not show up as obvious, widespread collapse. There were documented glitches — misdated receipts, some ticketing and monitoring failures, issues in a few nuclear plant and satellite systems — but major infrastructure continued to operate, and retrospective evaluations describe “few major errors” and no systemic breakdown. From the outside, it looked to many people as if “nothing happened.”
Even before that, however, a noticeable minority of individuals had treated Y2K as a personal disaster signal and acted well ahead of any visible local failure. A national survey reported by Wired in early 1999 found that although nearly everyone had heard about Y2K, about one in five Americans (21%) said they had considered stockpiling food and water, and 16% planned to buy a generator or wood stove. Coverage at the time, as well as later summaries, notes that some people also bought backup generators, firearms, and extra cash in case of disruptions.
Long-form reporting makes the costs to early actors very concrete. One Wired feature follows Scott Olmsted, a software developer who established a desert retreat with a mobile home and freshwater well, and began building up long-life food stores. He planned to add solar panels and security measures. Taken together, this implied substantial out-of-pocket costs on top of his normal living expenses. Socially, he also paid a price: the reporter notes that “most of the non-geeks closest to Scott think he’s a little nuts,” while more hardcore survivalists criticised his setup as naïvely insufficient and too close to Los Angeles. He describes talking to friends and relatives and “getting nowhere” — too alarmed for his normal social circle, not alarmed enough for the even more extreme fringe.
Not all early actors moved to the desert. The same feature describes Paloma O’Riley, a Y2K project manager who turned down a contract extension in London, returned to the United States, and founded “The Cassandra Project,” a grassroots Y2K preparedness group. She spent much of her time organising local meetings, lobbying state officials, and building a network of community preparedness groups, while her family stockpiled roughly a six-month food supply. For her, in addition to food storage, the main costs were time, foregone income, and political capital invested in a catastrophe that, from the outside, never visibly arrived.
When Y2K finally passed with only minor disruptions, official narratives tended to emphasise successful institutional remediation, and in public memory, Y2K came to be seen as an overblown scare — a big build-up to ‘nothing.’[5]For individuals like Olmsted, O’Riley, and the fraction of the public who had stocked supplies, bought generators, or shifted cash and investments, the visible outcome was simpler: they had paid real material and social costs in a world where, to everyone around them, “nothing serious” seemed to happen.
Takeaways from Y2K early individual action
Key take-away: The biases of the above two sections, when pushing people to act late in an actual disaster, can have tragic consequences.
Following the section above on why people become desensitized due to the flood of false positives, this section investigates how such desensitization leads to death when, in a minority of cases, the warning signs turn into an actual disaster:
At 1:30pm, May 21st 2011, a tornado watch was issued for southwestern Missouri, including the city of Joplin. The tornado watch — a routine, opt-in alert that many residents either didn’t receive or didn’t treat as significant. Tornado watches were common in the region, and most people continued their normal Saturday activities.
About four hours later, the city’s sirens sounded loudly across the city. Some residents moved to interior rooms, but many waited for clearer confirmation. Nationally, roughly three out of four tornado warnings don’t result in a tornado striking the warned area, and Joplin residents were used to frequent false alarms. Moreover, many people didn’t distinguish between a “watch” and a “warning,” and the most dangerous part of the storm was hidden behind a curtain of rain. From these viewpoints the situation might not have felt obviously threatening, so many people hesitated.
Seventeen minutes after the sirens, the tornado touched down. It intensified rapidly, becoming one of the deadliest in U.S. history. By the time it dissipated, it had killed around 160 people and injured more than 1,000. For anyone who delayed even briefly, the window for safe action closed almost immediately.
Takeaways from Joplin tornado
Key take-away: While the above three sections showed why people become desensitized, and how tragic such desensitization is in an actual disaster, this section paints a picture of hope. It shows that acting early is possible, and that it avoids large costs when disaster actually unfolds.
A note on the role of authorities in this Gunnison example: I have tried to choose scenarios showing the dynamics for an individual. However, individual action is a fuzzy concept - a family is not individual, nor is a group of friends. With Gunnison county having ~8000 residents, we might assume the town had ~2000 inhabitants. Compared to the United States, this is perhaps more akin to a neighborhood taking action, than a government. As such, and because the main point is the structural features and less the number of people, I believe this example is relevant.
By early October 1918, major U.S. cities were being overwhelmed by the influenza pandemic. In Philadelphia, hospitals ran out of beds, emergency facilities filled within a day, and the city recorded 759 influenza deaths in a single day — more than its average weekly death toll from all causes. Reports from Philadelphia and other cities illustrated how quickly local healthcare systems could be overwhelmed once the virus gained a foothold, especially in places with far fewer resources than large coastal cities.
While influenza was already spreading rapidly across Colorado, Gunnison itself still had almost no influenza cases. Local newspapers ran headlines like “Spanish Flu Close By” and “Flu Epidemic Rages Everywhere But Here,” noting thousands of cases and hundreds of deaths elsewhere in the state while Gunnison remained mostly untouched.
Gunnison was a small, relatively isolated mountain town, plausibly similar to many of the other Colorado communities with very limited medical resources and few doctors. Contemporary overviews note that the 1918 flu “hit small towns hard, many with few doctors and medical resources,” and that Gunnison was unusual in avoiding this fate by imposing an extended quarantine. Under the direction of the county physician and local officials, the town used its small population, low density, and limited transport links (source, p.72) — and, despite some tension among city, county, and state officials, seems to have benefited from cooperation among local public agencies sufficient to implement and maintain the measures.
Historical reconstructions of so-called “escape communities” (including Gunnison) describe them as monitoring the spread of influenza elsewhere and implementing “protective sequestration” while they still had little or no local transmission. Several measures were implemented: schools and churches were closed, parties and public gatherings were banned, and barricades were erected on the main highways. Train passengers who stepped off in Gunnison were quarantined for several days, and violators were fined or jailed.
Takeaways from Gunnison’s early response
Very rough baseline mortality anchor (not Ukraine-specific): To give a concrete scale for “ordinary” mortality, suppose we have a stylised population where about 30% of men die between ages 15 and 60, and the rest survive to at least 60. That corresponds to a survival probability over 45 years of 0.70. If we (unrealistically) assume a constant annual mortality rate 𝑟 over that period, we have: ↩︎
For illustration, take mid-range public estimates of Ukrainian military casualties, e.g. on the order of 60,000–100,000 killed and perhaps a similar magnitude of permanently disabling injuries as of late 2024. If we (very crudely) divide ~150,000–200,000 “death or life-altering injury” outcomes by a denominator of a few million draft-eligible men (say 4–8 million, depending on where you draw age and fitness boundaries), we get something like a 2–5% risk for a randomly selected draft-eligible man over the relevant period. This ignores civilian casualties, regional variation, selective mobilisation practices, and many other complications; it’s meant only as an order-of-magnitude illustration that the personal risk conditional on staying was not tiny. A more careful analysis could easily move this number around by a factor of ~2× in either direction. ↩︎
Each of these topics I am not covering are areas I have worked on and that I’ve already explored to some extent, and I hope several of them will become their own follow-on pieces. So despite having gathered evidence and performed analysis, I’m deliberately not covering them here because this first text is narrowly focused on making the timing tradeoff intuitive before adding more complexity and exploring solutions in later pieces. ↩︎
It might be worth pointing out that a high false positive rate is likely reasonable. One main point of this text is showing that in the lead-up to a disaster, the signals are weak. This means that in order to act early, one has to make decisions under uncertainty. If one pushes the threshold for action, as is illustrated in the following example, until one is certain - it is often too late. The tradeoff between desensitization and sufficiently early action is extensively discussed in academic and government circles. It is an unfortunate fact of the world and human psychology. Governments are even setting thresholds so high that they expect deaths from alarms coming too late - from a utilitarian view they are minimizing deaths across both desensitized people acting too late (acting later) and people not getting information early enough (acting earlier). These are dark calculations with real lives on the line. ↩︎
Some technologists argue that Y2K was a genuine near-miss, prevented by large-scale remediation. The cultural memory, however, tends to frame it as an overreaction rather than a narrowly avoided catastrophe. ↩︎
2025-12-02 13:27:03
Published on December 2, 2025 5:27 AM GMT
That's it. Halfhaven is over. I wrote 30 blog posts in October/November. And so did 6 of the other participants, out of a total of 23. Algon wrote the greatest number of posts, 45, and three participants tied for the least at only one post. The average number of posts per participant was 13.1, which is less than half of the required number. I understand why. While I managed to finish, it was hard. Writing every day is hard. Especially if you still have to live your life and work full time and so on. There were many days I didn't feel like it, or was too busy, or was sick. But the fact we had two months instead of one made it possible for me. Thanks to whoever came up with that idea. I originally thought it was dumb and overcomplicated, and I was wrong.
Inkhaven, the in-person residency in San Francisco, had a much greater completion rate than Halfhaven. It seems from the tracker like nobody missed a post? I wonder how much of that is because of the encouraging environment, how much comes from the fact that the residents could focus on writing full time, and how much came from the threat of expulsion if they missed a post. While Inkhaven is more like university, Halfhaven is more like Coursera. We Halfhaven participants had none of these advantages, and I'm proud that I managed to do the hard thing in spite of the odds.
My most popular post of the ones I posted to LessWrong was Give Me Your Data: The Rationalist Mind Meld with a score of 114. I think this hit the right balance of thoughtful and appealing to the target audience. My least popular was Unsureism: The Rational Approach to Religious Uncertainty, with a score of -7. My attempt at satire, which LessWrong didn't like. I didn't post everything there, and I'm sure there's a few they would have disliked even more.
I definitely improved my writing a lot during Halfhaven. I feel myself developing a voice, cutting unnecessary fluff, and having more structure to my writing.
Some people are going to keep posting every week, which some people are calling "foreverhaven", but which I call "having a blog". I'll probably do the same. One post every two days isn't enough to make posts I'm proud of. I ended up spending more than two days on some posts, and blasting out some in an hour or two. I also want to do more short fiction for a while like The Confession. I've already written the first draft for my next short story.
Thank you everyone who participated, even if you didn't finish. Thanks for posting in the Discord and creating an environment where I felt I should keep posting too. Thanks for the interesting posts. And thanks for checking out these digest posts. Good luck with your future writing, and maybe I'll see you next year!
2025-12-02 11:43:59
Published on December 1, 2025 10:54 PM GMT
An on-policy, sample-efficient NLP post-training approach not requiring verification
I once tried teaching a simulated 3D humanoid how to dunk[1]using RL - truly the most effective use of my time. If I took a single thing away from that, it was that designing the best reward function is equivalent to believing your agent is conscious and has the single goal to destroy your project.
My point is, RL is already terrible in practice. Then additionally throwing intermediate rewards out the window and overly relying on the most inefficient part of modern LLMs, their autoregressive inference[2], doesn't exactly seem like the play - somehow it is though.
The first attempts did try exactly this - basically an additional model that takes in one 'reasoning step' and spits out a number - the reward. The problem is that we simply don't have any pretraining data for such a reward model. Generating your own data is expensive[3]and not comparable to any pretraining scale.
There's also a whole different problem - expecting such model to be feasible in the first place[4]: even humans very much struggle with rating whether a particular step shows promise - in fact it would make the life of a researcher much more trivial if at any point in time he could simply make such objective and accurate assessment without actually having to work through the idea.[5]
Nevertheless, disregarding GRPO, the average pipeline was just copying the pretrained LLM, slicing off the predictor head, doing some basic finetuning and calling it a brand-new reward model. This works out "okay"-ish for around 100 training steps[6]but once a significant enough distribution shift occurs in the actor, the shallow understanding of the reward model is revealed.
Contrary to all these RL approaches would be normal finetuning - yet this seems to only lead to very shallow understanding like formatting & vocabulary and at best growing knowledge; but not anything we would normally describe as proper learning. It seems that the On-Policy attribute of approaches like GRPO reduces these superficial changes and therefore focuses on more subtle differences.
Distillation seems to perform slightly better in that regard even though teacher-forcing is basically always used - it might be the case that off-policy can be compensated for if the data at least incorporates similar distributions to those seen at inference time, i.e. making mistakes rather than a perfect solution and then fixing them.
This leaves us in an awkward position:
We would like an algorithm that is sample-efficient, on-policy and doesn't require any additional models - and while we are at it, why not desire natively supporting non-verifiable output formats[7]as well?
If reward models in NLP fail because we simply try to adapt the base model to an ill-suited task with little data, why not just stick to what they are actually good at: predicting the next token. Distillation uses this, trying to pass down prediction capabilities, often even forming the loss between the logits rather than simply sampled token and providing incredibly rich signal as a result. But if we don't have such a bigger model, where would the teacher model get its nuance from?
Well, if the model weights are the same, the only difference could be the input - we would need to supply the teacher model with additional information that would reduce the complexity of the task. In its most extreme version, this would simply be a perfect solution to the problem.
To remain in the distributions seen at inference, we additionally need something like student-forcing. Lastly, we need a mechanism that stops "impossible knowledge" to be transmitted into the gradient - the teacher model directly knows the perfect solution but magically stumbling on this solution before even properly analyzing the problem won't lead to better performance once this knowledge is gone.
It's time to put this into more concrete terms:
You have prompt and - is a normal Chain-of-Thought prompt with the problem while supplies both the problem and a solution, asking to attempt the problem normally and only use the solution as hint/help.
You do inference with , generating . This results in and [8]. You now do distillation over [9]with the teacher computing logits using , and the student computing logits using , call the logits and respectively.
Finally to block "impossible knowledge" we choose an aggregation of both and as actual target for the student. This for example could be:
where is a constant controlling the temperature - it makes sense to choose it such that roughly .
This aggregation basically turns the teacher into a helping hand, only reaching out and intervening when the student is getting severely off track and never giving the student a solution it isn't already seriously considering itself.
[Note: Interactive visualizations were here in the original post but cannot be embedded in LessWrong]
There is one major problem with this approach - it requires a model that already is powerful, i.e. something upwards of 20B params. Anything below that can't be expected to properly follow the teacher-prompt to a reasonable degree and leverage the solution intelligently opposed to just blatantly copying or completely forgetting about it 1k tokens in. This might not sound like a problem directly, but it does once you understand that I have a total of 0$ in funding right now.
If anybody with access to a research cluster would be interested in trying this approach on a sufficient scale, I would be more than happy to give it a go - I even have the code already written from some toy tests for this.
On another note, you can apply this aggregation during inference for already - this is useful for very hard problems[10], as it keeps close to a reasonable approach so that actual learning can happen afterwards. To be precise, during inference you would do two forward passes and already compute and sample the next token from it - essentially a mix between teacher-forcing and student-forcing.
Another question is the data - one of the advantages of GRPO is that it required no stepwise solutions anymore, only a final verifiable answer. We could of course just generate a bunch of solutions and using the verifiable answer generate our own stepwise solutions[11]- this would still have a significantly higher sample efficiency than GRPO since the signal we get from one trace is token-wise logit-based targets, unimaginably more dense than a single coefficient indicating whether to in- or decrease the probability of the whole trace.
But I think this approach especially shines in settings with no verifiable answers - which is practically everything if we zoom out. One could imagine a company like OpenAI having tons and tons of chats where users iterate on one initial demand with the Chatbot; something like RL approaches or finetuning can't make any use of this at all. This approach on the other hand can simply accept the final output that the user seemed content with as solution and start this self-distillation from the start of the conversation, while the logit rebalancing takes care of information not yet revealed. And the best thing - all autoregressive inference has already been done; this training will be stupidly fast.
yes, the basketball kind ↩︎
parallelizing this inference, as GRPO, does alleviate the damage but doesn't erase it ↩︎
even when attempting novel schemas like incorporating binary search https://arxiv.org/pdf/2406.06592 - very cool paper ↩︎
given the unspoken constraint of compute for reward model approx. compute for LLM ↩︎
This does seem to manifest in experts to some degree through intuition, which can be very powerful, but it's just as common that two experts intuitions completely oppose each other ↩︎
if the finetuning data is good enough, which always goes hand in hand with an absurd amount of compute spent on it ↩︎
non-verifieable in this context doesn't speak to some impossibility of determining correctness but simply the feasibility of it - checking whether a multiple choice answer is correct is trivial, but what about a deep research report? ↩︎
[X,Y] simply means Y appended to X here, basically just think of pasting the generated tokens under both prompts, respectively ↩︎
by this I mean masking out anything else than g_CoT for the gradient ↩︎
where something like GRPO would utterly fail ↩︎
which seem to perform a lot better than human-written ones ↩︎
2025-12-02 11:31:14
Published on December 2, 2025 3:31 AM GMT
The data here only reflects posting activity on LessWrong itself.
In 2021, the admins of LessWrong had the idea that we'd pay people to write book reviews. In 2025, we had a much better idea: people would pay us to write all kinds of posts!
I think this went pretty well, final determination pending, but in the meantime I can say the numbers have been impacted. That'll be no surprise to those regularly checking the site.
The number of posts increased by 57% (477 → 749) and number of words by 45% (1.0M → 1.46M). The increases were driven by 21 people officially involved in Inkhaven (residents, coaches, contributing writers) and 3 copycats[1] I identified by the numbers and their written intention to participate.
Curiously, the large boost to LessWrong was affected with only a handful of writers posting ~daily posts to the site. Per the Inkhaven blogroll, most writers published on Substack.
I believe beyond the three copycats on LessWrong, others expressed an intention to blog daily but did so on blogs elsewhere. Lorxus participated in Halfhaven and posted weekly roundups of their posts on LessWrong, but those don't count towards the totals here.
21% of the Inkhaven wordcount on LessWrong came from the LessWrong team. 79% came from others!
Ok, but what of quality? Karma ("baseScore") is the perfect measure of that. The good news is non-Inkhaven participant karma only declined 16 → 13 at the median.
There's more interpretation to be done here but I'm out of time. Such is the Inkhaven way. (This post began as an attempt from me to submit another Inkhaven post myself, but also, it's all graphs and not words!)
I use the term affectionately.
2025-12-02 09:53:30
Published on December 2, 2025 1:53 AM GMT
MIRI is running its first fundraiser in six years, targeting $6M. The first $1.6M raised will be matched 1:1 via an SFF grant. Fundraiser ends at midnight on Dec 31, 2025. Support our efforts to improve the conversation about superintelligence and help the world chart a viable path forward.
MIRI is a nonprofit with a goal of helping humanity make smart and sober decisions on the topic of smarter-than-human AI.
Our main focus from 2000 to ~2022 was on technical research to try to make it possible to build such AIs without catastrophic outcomes. More recently, we’ve pivoted to raising an alarm about how the race to superintelligent AI has put humanity on course for disaster.
In 2025, those efforts focused around Nate Soares and Eliezer Yudkowsky’s book (now a New York Times bestseller) If Anyone Builds It, Everyone Dies, with many public appearances by the authors; many conversations with policymakers; the release of an expansive online supplement to the book; and various technical governance publications, including a recent report with a draft of an international agreement of the kind that could actually address the danger of superintelligence.
Millions have now viewed interviews and appearances with Eliezer and/or Nate, and the possibility of rogue superintelligence and core ideas like “grown, not crafted” are increasingly a part of the public discourse. But there is still a great deal to be done if the world is to respond to this issue effectively.
In 2026, we plan to expand our efforts, hire more people, and try a range of experiments to alert people to the danger of superintelligence and help them make a difference.
To support these efforts, we’ve set a fundraising target of $6M ($4.4M from donors plus 1:1 matching on the first $1.6M raised, thanks to a $1.6M matching grant), with a stretch target of $10M ($8.4M from donors plus $1.6M matching).
Donate here, or read on to learn more.
As stated in If Anyone Builds It, Everyone Dies:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
We do not mean that as hyperbole. We are not exaggerating for effect. We think that is the most direct extrapolation from the knowledge, evidence, and institutional conduct around artificial intelligence today. In this book, we lay out our case, in the hope of rallying enough key decision-makers and regular people to take AI seriously. The default outcome is lethal, but the situation is not hopeless; machine superintelligence doesn't exist yet, and its creation can yet be prevented.
The leading AI labs are explicitly rushing to create superintelligence. It looks to us like the world needs to stop this race, and that this will require international coordination. MIRI houses two teams working towards that end:
If Anyone Builds It, Everyone Dies has been the main recent focus of the communications team. We spent substantial time and effort preparing for publication, executing the launch, and engaging with the public via interviews and media appearances.
The book made a pretty significant splash:
The end goal is not media coverage, but a world in which people understand the basic situation and are responding in a reasonable, adequate way. It seems early to confidently assess the book's impact, but we see promising signs.
The possibility of rogue superintelligence is now routinely mentioned in mainstream coverage of the AI industry. We’re finding in our own conversations with strangers and friends that people are generally much more aware of the issue, and taking it more seriously. Our sense is that as people hear about the problem through their own trusted channels, they are more receptive to concerns.
Our conversations with policymakers feel meaningfully more productive today than they did a year ago, and we have been told by various U.S. Members of Congress that the book had a valuable impact on their thinking. It remains to be seen how much this translates into action. And there is still a long way to go before world leaders start coordinating an international response to this suicide race.
Today, the MIRI comms team comprises roughly seven full-time employees (if we include Nate and Eliezer). In 2026, we’re planning to grow the team. For example:
We will be making a hiring announcement soon, with more detail about the comms team’s specific models and plans. We are presently unsure (in part due to funding constraints/budgetary questions!) whether we will be hiring one or two new comms team members, or many more.
Going into 2026, we expect to focus less on producing new content, and more on using our existing library of content to support third parties who are raising the alarm about superintelligence for their own audiences. We also expect to spend more time responding to news developments and taking advantage of opportunities to reach new audiences.
Our governance strategy primarily involves:
There's a ton of work still to be done. To date, the MIRI Technical Governance Team (TGT) has mainly focused on high-level questions such as "Would it even be possible to monitor AI compute relevant to frontier AI development?" and "What would an international halt to the superintelligence race look like?" We're only just beginning to transition into more concrete specifics, such as writing up A Tentative Draft of a Treaty, with Annotations, which we published on the book website to coincide with the book release, followed by a draft international agreement.
We plan to push this a lot further, and work towards answering questions like:
We need to extend that earlier work into concrete, tractable, shovel-ready packages that can be handed directly to concerned politicians and leaders (whose ranks grow by the day).
To accelerate this work, MIRI is looking to support and hire individuals with relevant policy experience, writers capable of making dense technical concepts accessible and engaging, and self-motivated and competent researchers.[1]
We’re also keen to add additional effective spokespeople and ambassadors to the MIRI team, and to free up more hours for those spokespeople who are already proving effective. Thus far, the bulk of our engagement with policymakers and national security professionals has been done either by our CEO (Malo Bourgon), our President (Nate Soares), or the TGT researchers themselves. That work is paying dividends, but there’s room for a larger team to do much, much more.
In our conversations to date, we’ve already heard that folks in government and at think tanks are finding TGT’s write-ups insightful and useful, with some calling it top-of-its-class work. TGT’s recent outputs and activities include:
The above isn’t an exhaustive description of what everyone at MIRI is doing; e.g., we continue to support a small amount of in-house technical alignment research.
As noted above, we expect to make hiring announcements in the coming weeks and months, outlining the roles we’re hoping to add to the team. But if your interest has already been piqued by the general descriptions above, you’re welcome to reach out to [email protected]. For more updates, you can subscribe to our newsletter or periodically check our careers pages (MIRI-wide, TGT-specific).
Our goal at MIRI is to have at least two years’ worth of reserves on hand. This enables us to plan more confidently: hire new staff, spin up teams and projects with long time horizons, and balance the need to fundraise with other organizational priorities. Thanks to generous support we received in 2020 and 2021, we didn’t need to run any fundraisers in the last six years.
We expect to hit December 31st having spent approximately $7.1M this year (similar to recent years[2]), and with $10M in reserves if we raise no additional funds.[3]
Going into 2026, our budget projections have a median of $8M[4], assuming some growth and large projects, with large error bars from uncertainty about the amount of growth and projects. On the upper end of our projections, our expenses would hit upwards of $10M/yr.
Thus, our expected end-of-year reserves puts us $6M shy of our two-year reserve target of $16M.
This year, we received a $1.6M matching grant from the Survival and Flourishing Fund, which means that the first $1.6M we receive in donations before December 31st will be matched 1:1. We will only receive the grant funds if it can be matched by donations.
Therefore, our fundraising target is $6M ($4.4M from donors plus 1:1 matching on the first $1.6M raised). This will put us in a good place going into 2026 and 2027, with a modest amount of room to grow.
It’s an ambitious goal and will require a major increase in donor support, but this work strikes us as incredibly high-priority, and the next few years may be an especially important window of opportunity. A great deal has changed in the world over the past few years. We don’t know how many of our past funders will also support our comms and governance efforts, or how many new donors may step in to help. This fundraiser is therefore especially important for informing our future plans.
We also have a stretch target of $10M ($8.4M from donors plus the first $1.6M matched). This would allow us to move much more quickly on pursuing new hires and new projects, embarking on a wide variety of experiments while still maintaining two years of runway.
For more information or assistance on ways to donate, view our Donate page or contact [email protected].
The default outcome of the development of superintelligence is lethal, but the situation is not hopeless; superintelligence doesn't exist yet, and humanity has the ability to hit the brakes.
With your support, MIRI can continue fighting the good fight.
In addition to growing our team, we plan to do more mentoring of new talent who might go on to contribute to TGT's research agenda, or who might contribute to the field of technical governance more broadly.
Our yearly expenses in 2019–2024 ranged from $5.4M to $7.7M, with the high point in 2020 (when our team was at its largest), and the low point in 2022 (after scaling back).
It’s worth noting that despite the success of the book, book sales will not be a source of net income for us. As the authors noted prior to the book’s release, “unless the book dramatically exceeds our expectations, we won’t ever see a dime”. From MIRI’s perspective, the core function of the book is to try to raise an alarm and spur the world to action, not to make money; even with the book’s success to date, the costs to produce and promote the book have far exceeded any income.
Our projected expenses are roughly evenly split between Operations, Outreach, and Research, where our communications efforts fall under Outreach and our governance efforts largely fall under Research (with some falling under Outreach). Our median projection breaks down as follows: $2.6M for Operations ($1.3M people costs, $1.2M cost of doing business), $3.2M Outreach ($2M people costs, $1.2M programs), and $2.3M Research ($2.1M people costs, $0.2M programs). This projection includes roughly $0.6–1M in new people costs (full-time-equivalents, i.e., assuming the people are not all hired on January 1st).
Note that the above is an oversimplified summary; it's useful for high-level takeaways, but for the sake of brevity, I've left out a lot of caveats, details, and explanations.