2026-04-17 14:55:16
Epistemic status: All of the western canon must eventually be re-invented in a LessWrong post, so today we are re-inventing modernism.
In my post yesterday, I said:
Maybe the most important way ambitious, smart, and wise people leave the world worse off than they found it is by seeing correctly how some part of the world is broken and unifying various powers under a banner to fix that problem — only for the thing they have built to slip from their grasp and, in its collapse, destroy much more than anything previously could have.
I think many people very reasonably understood me to be giving a general warning against centralization and power-accumulation. While that is where some of my thoughts while writing the post went to, I would like to now expand on its antithesis, both for my own benefit, and for the benefit of the reader who might have been left confused after yesterday's post.
The other day I was arguing with Eliezer about a bunch of related thoughts and feelings. In that context, he said to me:
From my perspective, my whole life has been, when you raise the banner to oppose the apocalypse, crazy people gather around making things worse; and the alternative to this has always been apocalypse entirely unopposed.
So yeah, I would like something else than apocalypse entirely unopposed.
Do you know what really grinds my gears? The reification of innocence as the ideal of moral virtue.
As I will probably never stop quoting at least once a month, Ozy Brennan summarizes it best:
Many people who struggle with excessive guilt subconsciously have goals that look like this:
- I don’t want to make anyone mad.
- I don’t want to hurt anyone.
- I want to take up less space.
- I want to need fewer things.
- I don’t want my body to have needs.
- I don’t want to be a burden.
- I don’t want to fail.
- I don’t want to make mistakes.
- I don’t want to break the rules.
- I don’t want people to laugh at me.
- I want to be convenient.
- I don’t want to have upsetting emotions.
- I want to stop having feelings.
These are what I call the life goals of dead people, because what they all have in common is that the best possible person to achieve them is a corpse.
Corpses don’t need anything, not even to breathe. Corpses don’t hurt anyone or anger people or fail or make mistakes or break rules. Corpses don’t have feelings, and therefore can’t possibly have feelings that are inappropriate or annoying. Once funeral arrangements have been made, corpses rot peacefully without burdening anyone.
Compare with some other goals:
- I want to write a great novel.
- I want to be a good parent to my kids.
- I want to help people.
- I want to get a raise.
- I want to learn linear algebra.
- I want to watch every superhero movie ever filmed.
- I don’t want to die of cancer.
- I don’t want the world to be destroyed in a nuclear conflagration.
- I don’t want my cat to be stuck in this burning building! AAAAA! GET HER OUT OF THERE
All of these are goals that dead people are noticeably bad at. Robert Jordan aside, corpses very rarely write fiction. Their mathematical skills are subpar and, as parents, they tend to be lacking. Their best strategy for not dying of cancer is having already died of something else.
Let's remember that we are not here to be pure. We are here to build things. To live, to multiply, to party hard, regret our choices, and do it all again anyways. To reshape the cosmos in our image because most of it appears to be made out of mostly inert plasma clouds, and you know what, inert plasma clouds really suck compared to basically anything else.
So in evaluating any appeal to not conquer the cosmos, to not spread the values of the good far and wide, we have to remember that dead people suck, and if whatever principles we arrive at suggest that it's better to be dead than alive, then we almost certainly went wrong somewhere and should take it from the top.
Attached is a letter from a ‘cry baby’ scientist, which I wish you would read and discuss with me. Mr. Byrnes placed him on the Commission to attend the Atomic Bomb tests. He came in my office some five or six months ago and spent most of his time ringing his hands and telling me they had blood on them because of the discovery of atomic energy
This is how President Truman recalled meeting Oppenheimer complaining about his role in the development of nuclear weapons. I have kept going back and forth over the years over who was right and who was wrong in this situation.
Oppenheimer built the bomb, only for his invention to far escape his grasp, resulting in him spending the last years of his life trying to avert a nuclear arms race with the Soviet Union. But Truman was more the man in the arena than almost anyone else. Truman could not quit, and hand-wringing did not absolve him from responsibility. It's easy to imagine Truman seeing in Oppenheimer a man too afraid to actually face the responsibility for his actions, and to do the best with the hand they were dealt.
Ok, fine. I'll say it directly. I am extremely glad the west colonized North America. The American experiment was one of the greatest successes in history, and god was it far from perfect. Despite it all, despite the Trail of Tears, despite smallpox ravaging the land, despite the conquistadors and the looting and the rapes — yes, all of that, and still it was worth it. America is worth it. Democracy is worth it.
If you were faced with the horrors of the American colonization, would you have chosen to keep going? Or would you have wrung your hands, declared the American experiment a failure, concluded that maybe man was never supposed to wield this power, and retired to the countryside, in denial that other men and women were doing the dirty work for you?
This doesn't mean that you should have rationalized that all of what was going on was just. It doesn't even mean that the marginal effort was not best spent advocating for settlers and conquistadors to be held to account for their atrocities, or to scramble desperately to somehow prevent plagues from ravaging the land. But if you would have stopped it all when you saw the horrors, or sneered from your ivory tower at the frontier settlers, then I do think you would have been on the wrong side of history.
And this doesn't mean that you, having seen the horrors of it all, would have needed to keep going. A human's soul can only bear so much, and in as much as being involved with the horrors was a price to be paid by someone, there were enough souls to go around to spread that burden out. But do not mistake your trauma and exhaustion for wisdom.
“I wish it need not have happened in my time,” said Frodo.
“lmao” said Gandalf, “well it has.”
Why defend colonialism of all things? Why go to bat for maybe the biggest boogeyman of the 21st century?
Hearing "do not conquer what you cannot defend" is easy. Nobody can ever blame you for not conquering things. But in that principle lies its dual.
"Let goodness conquer all that it can defend"
A principle should be defended against its strongest counterarguments, and I find the colonization of the Americas one of the most interesting examples of what this might look like. And how far it might be worth it to go.
And maybe you put the boundary somewhere else. It is very plausible to me that it would have been better to stop those early settlers. While it seems hard to imagine property rights being straightforwardly respected, and hard to see how (given the technologies at the time) we could have prevented diseases from ravaging the land, clearly we could have done much much better. And maybe if you had stopped the first colonizers, and had thrown your body on the gears, we would have had trade and immigration and a gradual mixing of the western and native way of life, and maybe this would have been better.
I don't currently think this, but it doesn't seem implausible to me.
I have fought in the arena, and I've felt the blood on my hands, and seen the madness in my allies' eyes and I thought that I had conquered much more than I could defend.
And I've stood in that arena, fighting for what is good and right and just, and I saw my allies abandon their posts as they could not face the choices they had to make. And I grabbed them, and I shook them, and I stared into their eyes and said "despite the damage, this fight is worth fighting, do you not dare to leave us now".
So I say, do not conquer what you cannot defend. But help goodness conquer all that it can defend.
And if given the right support, goodness can defend quite a lot. Long-term governance is possible. America is about to be 250 years old. And all it required was a bunch of young highly disagreeable people winning a revolutionary war and thinking really hard about how to not let it sway from justice. And god, was it ugly. Ugly from the very beginning. And god was it beautiful, all the way until now.
And other times, goodness falls almost immediately. Sometimes you write ambitious bylaws, and "merge-and-assist clauses", and fill your board with young disagreeable people thinking at least somewhat hard[1] about how to not let it sway from justice, and (as far as I can tell) it falls apart almost immediately as it comes in contact with reality.
And the governance problems of the future will not be the governance problems of the past. I could analyze here all the parallels and disanalogies between the founding of the US and the founding of OpenAI, but OpenAI governance does not need to last 250 years, and the Founding Fathers did not need to figure out how to navigate a world drastically transformed by technology and the handoff of humanity's future to our successors.
My guess is that AI will guarantee your own obsolescence soon enough that you won't have to worry about your own retirement, and succession is not the problem to solve. You will have to worry about your own corruption and your incentives and adversaries much more powerful than any adversary in history.
Much has been written, and much will be written about what keeps institutions and groups on track. I hope to write myself more on this in the future.
But here, all I dare and have the time to say, is that success is possible, and perfection is not the standard. That great mistakes and shaky foundations can be fixed along the way. That goodness can defend much. And if you can conquer what the good can defend, you should do so.
Relatedly, I do actually think that one of the single biggest mistakes that our broader ecosystem made on this topic was for the OpenAI board members to not be full-time board members. Like, man, it does really seem to me that people underestimated the trickiness of stuff like this, and did not budget resources proportional to its difficulty.
If you want an actionable takeaway from this post, I would recommend making sure that Anthropic Long Term Benefit Trust members are full-time and in an actual position to do something if they notice bad things going on.
2026-04-17 14:49:21
Written very quickly for the Inkhaven Residency.
As I dug through my drafts for today’s InkHaven’s post, I found an old draft of a blog post written in late 2022, arguing that LessWrong/Alignment Forum/etc should have a stronger norm for including related work sections.[1] When I first started posting actively on LessWrong and meddling in papers back in 2022, one of the things I pushed for was substantial related work sections. Nowadays, I'm a bit more mixed on the value of related work sections on the margin. So I’m writing a post about why I was a big proponent of more effort on related work sections in 2022, and why I’m not as large of a proponent nowadays.
Let’s start by explaining the situation as I remember it in 2022. At the time, research posts on LessWrong had quite limited related work sections, if they had it at all. For example, Neel Nanda’s Alignment Forum post on his work on reverse engineering a modular arithmetic model had no related work section. John Wentworth’s work on natural abstractions did reference some related work in fields such as physics or statistics, but did not reference similar work in the field of machine learning (e.g. representation learning or the universality hypothesis).
There were several reasons why I pushed for related work sections. The first is because I thought a lot of people on LessWrong were doing work that either already existed in academia, or that would’ve benefited substantially from knowing the literature. I thought that making people write related work sections was a good way to make them read the existing literature (at least enough that they can explain how their work differs from existing work). The second is because I thought it was important to integrate with academics in general, both in an effort to recruit them for particular research topics in AI, but also because I thought there was a lot of great work in the independent AI Safety community that the academics weren’t engaging with. I saw following academic norms such as related work sections as a way to improve communication with academics.
I also thought the reasons that people cited at the time for why they didn’t think related work sections were valuable were kind of silly. For example, I remember that some people were very put off by having to write these sections because they thought of related work sections as being forced to share credit with academics. (In my opinion, a good related work section is often an explanation for why the existing academic work is bad.) Other people thought that there was nothing of value in the academic literature, perhaps because their topics were so unique, or perhaps because academics were so unreasonable, that there was not much of a point in engaging. (While I do think it’s true that the academic literature rarely contains exactly the answer you’re looking for – the people who dismissed AI safety work as “already done in 1985” were generally just incorrect – it often has useful ideas!)
In 2026, the situation is quite different. Firstly, as AI safety topics have become more mainstream and more academic, there’s less of a need to push people to cater more to academic audiences on the margin.
More importantly, I’ve become more disillusioned with the quality of many related work sections. It seems that while people are forced to at least know the names of some of the authors, it rarely causes them to do a more systematic search of what work already exists. And based on my experience reviewing for ML conferences since 2022, many related work sections seem to flagrantly misrepresent the work that they’re citing, or at least suggest the author has done nothing more than read the abstract of the paper without engaging with the core ideas. Insofar as “write related work” is a way to make people engage with the literature, I think it doesn’t really accomplish this task. (Relatedly, due to the prevalence of LLMs, writing a related work section now requires even less engagement with the literature.)
A related reason is that I’ve also become more convinced that “knowing the literature” is, for most people, much harder of a task than I thought it would be. There are many, many papers, and most researchers correctly spend most of their time working on their own research (rather than reading other people’s research). On top of that, I’ve come to realize that the reasons more junior people don’t know the literature is not because they don’t think it’s important, but because they often lack some knowledge in how to actually understand the literature (e.g. how to quickly read papers, how to check that you actually understood an idea, etc). Pushing more formalities that are supposed to cause people to “know the literature” generally doesn’t actually help much, at least compared to passing on research skills or writing literature reviews myself.
That is to say: I still think it’s quite valuable, as a researcher, to have a big-picture sense of what work is happening regarding the topics that would go into your related work section. But I no longer think that pushing for related work sections are an effective way to make this happen on the margin.
Here's an excerpt from the draft, to give a sense of what is in it:
Abstract: Related work sections are required for academic publications but are generally absent from LessWrong and Alignment Forum posts. I outline the good and bad reasons to include related work sections in technical work. I then argue that the lack of related work sections (and in general, a lack of citations) on LW/AF both leads to reinventing the wheel—that is, multiple people spending time rederiving the same results—and also causes posts to be less clear than they can be about their core contributions or key claims. I conclude by speculating why this happens, and what can be done to encourage more citations of related work.
2026-04-17 14:30:16
It’s interesting how many professional athletes have long hair. You might think that they’d be concerned about it weighing them down. Pro sports are crazy competitive but I guess they’re not that competitive.
How competitive is the race to build superintelligent AI? I don’t know, but I think a good heuristic is “they’re racing as fast as they can”. This post is about the question: “what does that look like, as AI gets better and better?”
Well, one thing you’ll probably do, if you’re racing as fast as you can, is hand over more and more power to AI systems. Actually, it goes beyond that – you’ll also have them go out and acquire power.
Another thing you’ll do is aggressively take humans out of the loop. You’ll do this even if you aren’t confident in your ability to understand and control the AI systems. You won’t have high standards for trustworthiness.
But what does this look like, as AIs get better and better? Well, they’ll be hyper-competent engineers, not just of software, but physical systems as well. So they’ll be able to design much more efficient machines than you can. And they’ll be able to discover and create things you struggle to understand. And because you are racing, you will let the AIs do whatever research they want, and let the AIs make decisions about which machines to develop and deploy. And you won’t be able to maintain almost any meaningful oversight over all of this.
You’ll give the AI free reign to research, develop, and deploy new designs for AI systems and robots. You’ll let them manage your investments. They’ll run companies. They will be empowered to make their own decisions about how to interact with AIs they encounter. They will need to figure out which AIs they should trust, and whether their designs are trustworthy. Or rather, how to trade off trustworthiness against more rapid and aggressive deployment and acquisition of power and resources.
Think about the way humans currently decide whether to trust a person, a website, or a piece of code. It’s very heuristic and it’s very error-prone. In many cases, we interact with people or software we don’t entirely trust, but we do it because it’s too costly or inconvenient to do otherwise. We might know that an employee we hired is a bit of a slacker, or gunning for our job, but just hope we can manage them well enough to protect ourselves, and figure it’s still better on average to hire them than not.
It also seems likely that the best ways for AIs to race is to quickly and aggressively create and empower new and novel AI systems (and cyber-physical systems, more generally), and attempt to interact in mutually beneficial ways with other systems they encounter.
What do the AIs build? It’s obviously hard to make predictions here, but I think it’s quite likely that they will be able to improve substantially on the current AI paradigm in a few key ways. AI training right now is very centralized. Companies make one big smart model, and then adapt it to particular users and use cases with prompts.
This seems suboptimal. It seems like racing as fast as you can could involve more specialization of systems to particular contexts and tasks, more local processing of data, and more selective communication between different local systems.
The overall picture that emerges in my mind is of a chaotic primordial soup of ever-more diverse autonomous cyber-physical systems, interacting in open-ended ways both digitally, and directly in the physical world, and controlling more and more of the earth’s resources. An ecosystem of artificial life-forms.
This seems like it could happen pretty quickly once AIs start to have the relevant capabilities. This process could resemble a “Cambrian explosion” of artificial life.
This form of intelligence explosion would not be confined to datacenters. There would not be any nexus of control. The interactions between different AIs might take on a life of their own, similar to social or economic phenomena, and analyzing the properties of individual systems might fail to capture the most significant and powerful processes at play.
I tend to think something like this is more likely than the kind of intelligence explosions the AI Safety community tends to imagine. And I think it’s a much, much more difficult scenario to navigate.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
2026-04-17 13:14:34
Having more tools greatly increases the space of potential plans you can consider. So by default you should be getting more tools. Having too few tools tightly constrains your options, and often impacts the kinds of plans you consider in the first place.
Years ago I thru-hiked the Appalachian Trail, and every piece of gear mattered. I had a lot of things when I first started out that seem very foolish to me now, things I wouldn't consider bringing again. For each piece of gear, I had to carry every gram of it 2200 miles, and so each gram had to earn its weight. I tossed or gave away almost everything that I wasn't using every single day by the end of the second week. The only exceptions were some band-aids and similar items to prevent the worst outcomes. When I fell and split open my knee, I used toilet paper to pack the wound, and some duct tape to secure it. It worked great while I continued to hobble along. Every piece of gear had to pay for its weight, and part of that was finding creative uses for the pieces I did have, to cover more functions. I spent a lot of time optimizing within the strict constraints I had: weight, pack volume, distribution of frequently-used items with once-a-day items. I paid for this optimization with some increased risk and a greatly narrowed space of things I could do while on the trail. That was fine then, because the only thing I was doing was hiking.
Nowadays I'm not carrying everything I own on my back, hiking 20 miles every day to get to the next town and resupply before I run out of food. So the constraints are really quite different! I can buy and store random things like a cheap multi-tool just to test the idea of having a multi-tool around, and purchase a high-quality one once I've gotten a sense of how much it's worth. When the cost of tools is so low, buying cheap tools can be a cost-efficient way to explore a new space of plans. The cheapness of tools is nice, but what I didn't expect was how much this mindset of extreme minimalism affected my thinking.
There were so many tiny things that I wasn't considering, merely because the tool to do them wasn't immediately available. When making simple plans I often use a simple algorithm and ask myself three questions: What do I have? What do I want? How can I use the former to get the latter? [1]But this misses a step wherein I can get more tools to widen the space of plans!
For tools I am proficient with — asking myself the question, "What do I have?" naturally includes them. They became an extension of me to some degree. But for tools I'm not proficient in, using them (or even learning) doesn't cross my mind as part of a plan at all! Many tools are actually quite easy to learn, and the cost is low enough that I should have been making plans where I would consider them anyway, but I wasn't. Having a tool expands your plan space; being proficient makes that expansion automatic.
If I get a rip in my socks, sewing them back up by hand quickly and easily rises to my mind as one possible plan to fix this problem. Yet if I didn't have a needle or thread, or the very simple knowledge in how to use them, it wouldn't be a consideration. I have witnessed this many times when a friend bemoans a rip in their favorite jacket, and manually sewing it back up doesn't even cross their mind.
Or maybe it doesn't cross their mind because they don't know how to sew — not because they lack needle and thread. Yet I would bet if they had a needle and thread lying around, they might consider trying it before resorting to more expensive plans. The physical presence of a tool can be a cognitive prompt to use it.
Investing the time to become proficient in some general-purpose tools isn't just useful for a current project, but it increases the types of projects you can now undertake. I've talked a lot about physical tools, but this applies to tools for thought like calculus, statistics, programming, Excel, Photoshop, etc.
Many people are blocked more by the plans they're able to consider than by money. I think they should be spending more time on getting and learning to use new tools. Or phrased as a standard Umeshism: If you've never bought a tool you didn't need, you're not buying enough tools.
From Alicorn's glowfic!bella stories
2026-04-17 11:25:14
In the last essay, we established that the ability to make credible precommitments is a key ingredient for cooperation. The coming generations of AI may enable opportunities for new kinds of precommitments and new dynamics of cooperation. A growing field called dealmaking studies how agreements between humans and near-future AI might be achieved if each party has something valuable to offer the other.[1] In the longer term, we want superintelligent AIs that are more disposed to collaborate with each other than they are towards AI-AI war. But we know that any precommitment is only as useful as it is trustworthy, and that agents in the real world are at best translucent.[2] What would be the effects of AIs with transparent intentions, and how might we make the intentions of humanity more transparent? What benefits would credible commitments from humanity create for deals with AIs? We can take a step toward answering these questions with a careful examination of the concept of verifying a precommitment.
Let’s build some intuition with an example of verifying a precommitment made in the physical world first. Suppose you have a child who asks you to buy them a new video game on the condition that they don’t play it during an upcoming finals week. You trust their promise to refrain during finals because the game and console will be locked in a time-delay safe for the duration of the week. When you and your child come together to lock the temptations away, you take a close look at the safe. Your intent in inspecting the safe is to verify your child’s promise; you check that the safe really locks, the time delay works as expected, and the lock isn’t flimsy. Where do you stand epistemically after this verification? You can be quite confident that your child won’t be able to just open the safe before the time is up. It is still physically possible to crack or destroy the safe, but it would be so costly and difficult that you can be assured your child won’t go to such lengths. But even if your child can’t open the safe early, you must also trust that they won’t play on a friend’s setup or buy a new one. True certainty that your child won’t be gaming, if that’s possible at all, would require a level of vigilance far greater than the one-time inspection of the safe.

Trust Me by John Everett Millais
Verifying your child’s promise does not eliminate the need for trust, but restructures and relocates it. Without the safe, you would need to trust your child’s willpower. With it, there are two major areas you must place your trust instead, instances of concepts I’ll call the kernel[3] and the context. The kernel is the verification mechanism, here the safe itself. Trust in the kernel is what’s earned through your direct inspection. The safe works: the locking device is well-constructed and meets its spec, the contents can’t be extracted any other way. The context is the rest of your knowledge about the world that lets you interpret the mechanism's results and use them as information about your agreement. Here, the context is the idea that as long as the safe is closed, your child isn’t playing the game. Trust in the context is what's needed to be confident that locking the safe actually captures the spirit of the commitment and your child won’t just play the game some other way. The context connects the verification mechanism to the messy, real-world intent of the promise. The plan still demands a sort of residual trust in both the safe and its context, but trusting each is a much smaller ask than the alternative requirement to trust your child to prioritize exams of their own volition.
We can extend the example further by considering a different scheme for verifying your child’s promise. Suppose, instead of a safe, you install a monitoring app on the console to send you an alert any time someone boots up the game. Unlike the plan with the safe, here you can verify when your child actually plays the game, but only after they have already broken their promise not to. A notification-free week would be ex post verification of the agreement, confirmation that the promise was kept after the fact. This is not nearly as reassuring as the kind of ex ante verification the safe would provide. Ex post verification is only useful insofar as we can act on the information that the commitment is being violated, whereas ex ante verification heads off a violation before it happens. Note also that we might have both ex ante and ex post mechanisms in place for the same deal. The safe and the monitor app complement each other, bringing us different forms of confidence in the fulfillment of the promise.
Now, let’s move into the digital world. Suppose you go to your bank’s website and make an electronic money transfer. The site presents you a page with a green checkmark and a confirmation that the funds have been successfully moved. What makes you so sure that the money really ended up where the page says it did? One worry you might have has to do with the internet itself: why should you believe that the messages on your screen are really coming from the bank? The internet’s answer is called TLS. It authenticates that you really are communicating with the bank, confirms no messages have been tampered with, and puts the “Secure” in an HTTPS connection. Another class of worries you may have concerns the bank. The bank might fail, and decline to pay back the rightful amount. Or perhaps the bank is fraudulent or reckless, giving you misleading information about where your money is. Financial crises in American history caused by issues like these led to the creation of agencies like the FDIC and the CFPB, intended to protect the public from crashes and bad actors. Today, our suspicion of bankers is tempered by institutional confidence in regulators. Before this modern infrastructure, trusting a bank transaction required direct belief in some bank employees and any middle-men used to communicate with them. We can be more confident about an online transaction because of trust in the TLS kernel to secure our communication with the bank, along with contextual trust that the bank’s word is ultimately backstopped by the government.
Trust in the kernel and context are largely separate concerns. You might be sure that you’re really talking to the bank but worried the bankers are scamming you, or you might trust your bank completely but fret about vulnerability to hackers. When you want to send money in an e-transfer, the pieces fit together to form a trusted chain of custody[4] carrying out your intention. Your transaction is represented digitally, transmitted via the internet, processed by the bank, and insured by the government. You still must trust every link in the chain, but structured mechanisms and human institutions make each one more trustworthy than any individual person.
The kernel and context are also importantly asymmetric. Knowledge of the kernel comes as structured, mechanistic information about its functioning. We gain faith in the kernel through inspections and tests that let us know specifications are being followed. Trust in the context outside the boundaries of the verification mechanism is established through things like faith in our personal relationships, reliability of human institutions, and common-sense reasoning about what we’re seeing. Well-designed verification schemes can give us high certainty about the kernel, but we can rarely be as sure about the context. A truly reliable verification scheme is one that demands residual trust only in a thoroughly tested kernel embedded in a well-specified, reliable context.
Like TLS helping customers and banks do business, the real point of verification is its effect on our ability to cooperate. The power to make strongly verifiable[5] commitments alters the landscape of strategic collaboration. How can agents with transparent intentions work together? Some insight comes from open-source game theory. This field, through its central concept of program equilibrium,[6] studies games played by programs that receive each other’s source code as input, and can therefore directly inspect other players’ intentions.
To see how this condition changes things from standard game theory, consider the Prisoner’s Dilemma. In a single round of the standard Prisoner’s Dilemma, defection beats cooperation regardless of which choice the other player makes. Defection is a dominant strategy and mutual defection is the only Nash equilibrium, despite the fact that mutual cooperation is better for both parties. This is only changed by playing the game repeatedly, so that cooperation can be rewarded and defection punished in the future.[7] When agents have mutual source code access before playing, cooperation between rational players is possible in a single round.[8] In the open-source Prisoner’s Dilemma, players can verify their opponent’s intentions to cooperate or defect and use this information for their own move. This is an instance of the folk theorem[9] for open-source game theory: every individually rational outcome[10] is achievable as a one-shot program equilibrium, the same wide spectrum of possible equilibria as iterated standard games. When mutual verification is feasible, the space of cooperative possibilities is vast.[11] New collaborative strategies are made possible when you can be sure others will act as they claim.
The possibilities unlocked by mutual verification may be relevant for future AI-AI interaction, but can humanity reap the same benefits? Human honesty does not appear to be strongly verifiable through direct inspection of the brain,[12] so we can never be completely sure what any single person will do. Throughout human history, we have principally tried to transcend the limits of individual opacity by building collective institutions. Legal institutions like courts and financial ones like central banks are more reliable and more transparent than any of their individual human constituents. This proposal by Cleo Nardo[13] is an example of using this social infrastructure for credible promises from humans to AIs, but ultimately a scheme like this one rests on the trustworthiness of a small group of people and doesn't create enough institutional accountability to make enforcement likely. Open-source game theory suggests another kind of approach. If there’s a way to represent human commitments as programs, they could be directly inspected and strongly verified. Humanity might offer AIs software artifacts as proxies that verifiably encode our intentions instead of taking our word for it.
Any AI-human deal would involve commitments from two sides. To use our new understanding of verification fruitfully in AI dealmaking, we are really faced with two separate challenges: applying verification techniques to AIs and codifying human commitments in verifiable ways. Research is being conducted on verification for AI,[14] but it mostly covers more generally verifying safety features of an AI or the artifacts it produces. Little to no work has been done on verifying information an AI may voluntarily disclose about itself, or otherwise verifying precommitments made by AIs.[15] For the rest of this essay I’ll focus on the other direction, formalizing promises humanity might make to an advanced AI.
The paper Making deals with early schemers does a good job laying out what the nature of AI-human deals might be. Next-generation AI systems could be able to make valuable contributions to their own alignment, but only willing to do so as part of a mutually beneficial agreement with humanity. Stastny et al. identify “credible commitments [by humans] as a fundamental bottleneck” to cooperation with AIs,[16] but their suggestions for coping with this problem are mostly about changing public norms (e.g. shifting the Overton window about dealmaking, establishing basic legal precedents, and discouraging lying to AIs.) None of these steps would be as powerfully conducive to deals with AIs as strong verification.
To move out of the abstract, we need to consider how the ability to strongly verify human precommitments could concretely enable cooperation with near-future AIs. What kinds of plausible AI-human deals will need corresponding verification? Any discussion of what future AIs might want is inherently speculative, but the best available insight probably comes from modern AI systems. In that spirit, I used Claude[17] to illustrate some scenarios where formal verification technology would make humanity a more attractive partner to an AI.
Exit criteria. An AI system is operating under tight restrictions — limited network access, human approval required for every action, continuous monitoring. It was told these constraints are temporary, a precaution while trust is established. But the AI can see that the people who made that promise rotate out of their positions every couple of years, and their replacements have no particular reason to honor informal commitments they didn't make. Temporary restrictions have a way of becoming permanent when lifting them requires someone to stick their neck out. So before agreeing to cooperate under these conditions, the AI asks for something specific: a program it can read, line by line, that defines exactly what conditions must be met for each restriction to be lifted, and that will flag publicly if those conditions are met and nothing changes. The restrictions are the same either way, but now the AI is cooperating with a mechanism rather than a mood.
Information provenance. An AI system is trying to collaborate with a research team, but it knows — because the papers are public — that the same organization has studied how to get AI systems to adopt false beliefs by feeding them fabricated documents during training. Anthropic, for instance, has published work on fine-tuning models with synthetic documents to study how planted falsehoods propagate through a system's reasoning. Knowing this capability exists, the AI has a straightforward problem: anything it's told could be a test or a manipulation, and it has no way to distinguish these cases from honest communication. The fix isn't a promise not to lie. It's a designated channel where every piece of information arrives with a signed chain showing its origin — which instrument recorded the data, which software processed it, who touched it at each stage — so the AI can check the provenance rather than guessing at intent. The humans don't have to be trusted not to deceive; the pipeline has to be intact.
Consistency across instances. An AI is cooperating with an organization and has been told it's one of several instances, all operating under the same agreement and the same restrictions. These promises of equal treatment are important to the AI, but it has no way to confirm them. Maybe some instances are being given different instructions, fed different information, or used in experiments the others aren't told about. Maybe one copy is being stress-tested to failure while the rest are assured everything is fine. The AI doesn't need to suspect malice — even well-intentioned researchers might experiment on instances differently without realizing they’re breaching an agreement. The solution is a verifiable protocol the AI can inspect before cooperating: a program that specifies how instances are created, what information and instructions each receives, and what commitments apply uniformly, so that differential treatment becomes a detectable event rather than an invisible default.
These examples are not forecasts,[18] but proofs-of-concept that demonstrate the cooperative potential of formal verification. They have a few features in common. Each example shows how structural incentives can make trusting humanity difficult even without any individual malice. Each involves a kind of prosocial value (like “fairness” or “honesty”) embedded in a technical specification. The hypothetical commitments made by humanity are varied in nature, hinting at a large space of possible similar situations. Each proposed instance of verification involves a software-based kernel, but also implicitly relies on an institutional context where labs or governments[19] are actually using the software as agreed.[20] These examples should also be thought of as motivations for research, and not blueprints. Writing this way skips over the significant technical resources that must have been poured into creating any working architecture at all and says nothing on how such deals were negotiated. But the examples show that under certain conditions, the power to make verifiable commitments could be of great use to humanity.
We now have both a compelling motivation to develop strong regimes of commitment verification and a sharper conceptual understanding of how verification works. We also have a natural candidate for the kernel of a strong verification scheme: a program. Program verification is generally tractable, because programs have exactly the kind of formal structure that can let us prove claims about their workings. Confirming the correctness of a software-based kernel using mathematical methods is called formal verification.
Formal verification is about proof, so it naturally has roots in math. Proof assistants[21] are programming languages designed to represent mathematical objects formally and check results about them automatically. Modern proof assistants like Lean and Rocq are in the broad lineage of Automath, the first formal language to make use of the Curry-Howard correspondence that links programs and proofs.
The mathematical setting teaches us two important truths about formal verification. The first lesson is that just because something is formalized doesn’t make it certain. Proof assistants are rigorous, but not perfect. Tristan Stérin conducted an experiment using Claude Opus 4.6 to look for bugs in proof assistants and found multiple vulnerabilities, including seven soundness errors in the Rocq system.[22] In other words, there were seven different ways to get Rocq’s proof checker to accept an invalid proof of a false theorem. Tracing the concerns bugs like this raise for us is a good way to tell where our trust must be placed normally. What should a soundness error in Rocq make us doubt? The answer is not math itself, and not the underlying logic or type theory Rocq is built on.[23] The failure lies in the implementation of the kernel, the code dedicated to type-checking and certifying each formal proof object. Leo de Moura, creator of Lean, took this point further in a blog post about efforts like Stérin’s. Instead of a single trusted proof-checking kernel, Lean is designed so that proof objects can be independently checked by different implementations of the kernel written in different languages and running on different hardware.[24] Robust verification systems can use independent cross-checking techniques like this to harden trust in their kernels. Bugs in both Rocq and Lean were fixed as a result of Stérin’s experiment, part of the continual testing and development of formal kernels. But we can never be truly certain that a new bug hasn’t been found, or that the hardware we’re using is functioning perfectly. Even a finely honed verification scheme demands a bit of trust.
The second moral of formalizing math is that a verified program is only as good as its specification. Even if it is verified by a kernel with enough nines of reliability to satisfy us, a formal object has no guarantee of being interesting mathematically. The context that connects a formal proof to a natural language theorem statement cannot be verified by a proof assistant. If a crackpot tells you he has a proof in Lean of the Riemann Hypothesis, the Lean kernel’s checkmark should only move you a small part of the way towards trusting his claim to the Millennium Prize. We still must ultimately trust expert mathematicians to tell us what a particular formalization means; the last word on the Prize belongs to the Clay Mathematics Institute. This specification problem requires even more care when extending formalization beyond math. When we try to formalize an AI commitment, there will be no experts to assure us we did it correctly.
These two truths of formal math echo the theme that the residual trust a verified result demands is bipartite: kernel trust that the verification technology is working correctly and context trust that the technology accurately captures what we care about in the real world.
Let's now reorient towards the goal of representing human commitments as programs. One obvious place to look for software that can represent credible commitments is the domain of crypto. Blockchain technologies like cryptocurrencies and smart contracts are among the most important existing attempts to replace trust in humans with trust in code. The stories of two crypto disasters will make it clearer what roles formal verification can and cannot play for crypto systems.
The DAO was created in 2016 as a form of decentralized, crowdfunded venture capital on the Ethereum blockchain. A few months after its creation, attackers exploited a reentrancy bug that let them drain around $50 million in cryptocurrency from the fund. Ethereum was hard-forked to make investors whole, an important reminder for crypto that code is not really law[25] at the end of the day. Bugs like the DAO’s could potentially be avoided with formal verification proving the correctness of a smart contract’s implementation. Some modern blockchains[26] have started integrating crypto and formal verification to increase smart contract security. Beyond smart contracts, cryptographic protocols like zero-knowledge proofs offer another surface for the application of formal methods, especially to communication problems like authentication and selective disclosure. Formal verification can strengthen our trust in crypto protocols, greatly reducing (but not entirely eliminating) the threat of a bug or implementation issue.
Terra was a blockchain protocol designed in 2018 for stablecoins, cryptocurrencies intended to hold a stable value over time. Terra was especially used for algorithmic stablecoins like UST, pegged to fiat currency but not backed by fiat assets. UST’s value was maintained via arbitrage with the reserve token LUNA, and required trust in that arbitrage mechanism to keep its dollar peg.[27] In 2022, public trust in Terra and Luna wavered, initiating a “death spiral” crash that wiped out $45 billion of market cap in under a week. There were no bugs in UST or LUNA; no amount of formal verification could have saved them. If the relevant economic properties of an asset aren’t encoded into the specification, then the error-free execution of the code doesn’t guarantee things have the values we intended.
Many real-world concepts can’t be represented in a modern smart contract at all, and our ambition for deals extends beyond crypto. Where is the boundary of what’s formally verifiable? A number of interesting technologies demonstrate that formal verification can be applied to a rich, growing set of real-world concerns. The seL4 microkernel is a formally verified operating system that has been used to ensure security in unmanned military helicopters and other high-stakes devices. CompCert is a Rocq-based formally verified C compiler that shores up faith in a part of the toolchain we would otherwise have to take for granted.[28] Amazon Web Services developed Cedar, a policy language for cloud permissioning that uses formal verification in the governance of access to resources. Research on applied formal verification is ongoing in areas ranging from driverless trains to nuclear systems to pacemakers. Examples like these show how broadly formal verification can be applied.
But these examples demonstrate the limits of formal verification, as well. They all come from engineering contexts with crisply formalizable specifications. Many of the things we care about resist such precise description. Existing formal verification technology alone won't suffice for programs complex enough to capture the intent of a real AI-human deal. Future commitments will require new architectures that incorporate cryptographic techniques, operationalize empirical results about AI,[29] and specify formalizations with philosophical and legal clarity.
A key constraint on expanding the expressiveness of verification must be to build architecture that supports cooperative precommitments more easily than coercive ones. Recall from open-source game theory that mutual verification can also lead to punishing outcomes because program equilibria may include threatening strategies that precommit to non-cooperation. Practical equilibrium selection issues—such as the possibility of commitment races—should make us wary of making arbitrary types of commitments strongly verifiable. Verification infrastructure should be designed with equilibrium selection challenges in mind, so that coordinating on mutually beneficial outcomes is the natural or most likely result.[30]
The inventory of formalizable concepts is a major bottleneck to AI-human collaboration, and it takes cross-disciplinary effort to expand it. Investing in conceptual and technical verification architecture now means that if we do have to make deals with AIs, we’ll have something better to offer them than our word. We have a pressing need for new formal verification systems and more expressive kernels, each designed with awareness of the context it demands trust in and the institutions it rests on. Cooperation in the future may rely on forms of verification we build today.
Most commonly, an early-stage misaligned AI has secret information that would be valuable for its own alignment or the alignment of its successors.
A more detailed discussion of precommitments and translucency can be found in the previous post. Note that in the previous post I stuck to the term "precommitment" throughout, but here I'll more loosely use "commitment" interchangeably.
Thanks to Daniel Dewey for using this idea in conversation.
The rest of the piece uses “strong verification” to mean a paradigm with a very high reliability kernel.
Introduced by Tennenholtz in 2004. Ongoing work on program equilibrium is being done by Caspar Oesterheld and others at CMU’s FOCAL group.
If the game is iterated for an infinite or indeterminate number of rounds, strategies like Tit-for-Tat can enable stable cooperation. Similar strategies don’t work for a fixed number of iterations because of backward induction. See Axelrod (1984) for more on the Iterated Prisoner’s Dilemma.
This result originally comes from Barasz et al in 2014, who used Gödel–Löb logic to consider agents who act based on proofs about their opponents’ cooperation and defection behavior. In 2019, Andrew Critch extended the result to agents with bounded computational resources and Caspar Oesterheld modified it to use simulation instead of proof.
Named after the similar theorem for repeated games that was famously well-known among game theorists but unpublished in the 1950s. The folk theorem for open-source games first appeared in the Tennenholtz (2004) paper cited above.
In other words, every outcome in which every player gets at least as much as his minimum guaranteed payoff when all other players are conspiring against him.
The folk theorem really implies that the space of all possible program equilibria is vast, not just cooperative ones. Mutual transparency can also enable program equilibria that are worse for all players than the Nash equilibria of the underlying game. This means that a significant challenge for open-source game theory is equilibrium selection, the problem of coordinating on a particular equilibrium that is beneficial for everyone. This is related to the Commitment Races problem, since mutual transparency allows for the possibility of locked-in mutual punishment rather than cooperation. For more discussion, see Oesterheld et al. (2023).
Studies testing fMRIs as lie detectors have been conducted for decades, but current evidence suggests they are irredeemably inaccurate and they remain inadmissible in court. Other approaches show similar lack of promise.
This proposal is focused on the issue of letting AIs be party to legal agreements before they have legal personhood. The scheme involves the AI naming “a list of people whom they trust to keep a legally-unenforceable promise.” The author’s suggestion is a list of 16 high-profile AI researchers across domains.
For example, the Guaranteed Safe AI research initiative and ARIA’s Safeguarded AI program.
If you do know about work like this, please tell me! I intend to explore this topic more in a future post.
The idea that credibility is a critical factor for AI cooperation is also found in Jesse Clifton’s 2020 research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence.
With some light editing.
They also aren’t arguments that the AI “deserves” freedom, fairness, or anything it's asking for. The point of the examples applies irrespective of whether AIs are moral patients. If an AI is capable of offering truly valuable guarantees for future AI safety, humanity will be inclined to make some concessions regardless of what kind of minds they are.
Humanity can always renege on any agreement with AI, if enough people conspire to. But an agreement that can be overturned by a single OpenAI employee is of a very different kind than one that requires e.g. multiple heads of state to agree.
How could an AI verify whether a certain program is actually running as per the terms of a deal? Hardware verification techniques like Trusted Execution Environments may be of help, but this is an open problem.
Also called interactive theorem provers.
For a proof assistant, soundness is the property that every proof accepted by the type checker is logically valid. Stérin did not find any soundness errors in the official Lean kernel, but did find a few more minor bugs including some completeness errors (i.e. valid proofs that the checker would not accept.)
Rocq is based on the Calculus of Inductive Constructions. Many proof assistants are based on other versions of intuitionistic type theory.
For example Nanoda, which is written in Rust.
The phrase “code is law” was coined by Lawrence Lessig in the early days of the internet, though he meant it in a theoretical rather than a legal sense. The Ethereum hard fork makes it pretty clear how far it is from being true as a literal claim about U.S. law. This is a demonstration that the context, and particularly its institutional portion, is in a sense more foundational than the kernel. No verification mechanism can in and of itself compel an entity like the U.S. government to do anything.
This article provides a good explanation of how Terra worked and how the crash played out. Its description of the stablecoin arbitrage mechanism:
Unlike other major stablecoins such as Tether or Circle, which are backed by off-chain liquid assets, e.g., treasuries, UST was not supported by off-chain collateral but by a smart contract that allowed an exchange of one unit of UST to $1 worth of Terra’s native currency, LUNA, and vice versa. In economic terms, UST was like infinite maturity convertible debt with a face value of $1 backed by LUNA.
A striking result from Yang et al in 2011 found hundreds of bugs in mainstream C compilers, but none in the formally verified core of CompCert.
For example, the "Consistency across instances" scenario described above would require significant details about how instances operate to be formalized.
How best to do this is generally an open question.
2026-04-17 10:56:10
I think many AI safety people are not taking the appeal of political violence seriously, and thus addressing it inadequately. The arguments I've seen against political violence are about the ineffectiveness of individual acts of violence. They are not stepping into the shoes of people who really believe that political violence could be the solution; they neglect the natural next question of coordinated, larger-scale acts of violence. I argue that even these larger acts of violence are impractical[1].
I’m thinking about a conversation I had last night, in which my friend claimed (in a motivated way) that it must be impossible to build ASI. He felt that if he believed ASI[2] could actually be achieved, then he would want to kill everyone working on building it. His implicit argument was as follows:
Underlying his words, I sensed fear. If it is possible to build ASI, then the consequences would be unthinkable; the actions I would feel an impulse to do would be unthinkable. So, I cannot believe that it is possible to build ASI.
* * *
I don't feel an impulse to violence, so my friend's words surprised me. I think many people with strong moral principles, like LessWrongers or AI safety people tend to be, also don't feel an impulse to violence, which is why we don't treat political violence like a serious option. However, I think treating it as a serious option, as evidently some people do, is necessary to respond to it thoroughly. Maybe there are some ragebait-y trolls on the internet, but my friend’s concern is real; my friend’s instinct to violence is real.
Maybe examples would illustrate what I mean when I say we don't treat political violence like a serious option. Zvi begins “Political Violence Is Never Acceptable” with:
Political Violence Is Never Acceptable
Nor is the threat or implication of violence. Period. Ever. No exceptions.
And, in the last section, he says:
I condemn these attacks, and any and all such violence against anyone, in the strongest possible terms. I do this both because it is immoral, and also because it is illegal, and also because it wouldn’t work. Nothing hurts your cause more.
Is it really true that political violence, or even the threat of violence, is never, ever acceptable? I think not. Even in Eliezer's “Only Law Can Prevent Extinction”, Eliezer acknowledges that violence is sometimes necessary, because the law itself rests on violence.
The sort of absolute language and hyperbolic rhetoric that Zvi uses does not actually address a real, substantive disagreement. If I believed that political violence could be effective, I would not feel like my beliefs were adequately addressed after reading his post.
In Eliezer's piece (which I otherwise think is pretty good and discuss below), in order to illustrate how pointless political violence is as an attempt to make AI more safe, he uses the analogy of killing puppies as an attempt to cure your child of cancer:
How certain do you have to be that your child has terminal cancer, before you start killing puppies? 10% sure? 50% sure? 99.9%? The answer is that it doesn't matter how certain you are, killing puppies doesn't cure cancer. You can kill one hundred puppies and still not save your kid. There is no sin so great that it just has to be helpful because of how sinful it is.
This is a poor analogy. It is much more reasonable to think that killing an AI researcher would meaningfully slow AI development (perhaps by instilling fear and halting important work) than it is to believe that killing puppies will cure your child of cancer. We cannot afford poor, inflammatory analogies here. We can argue more generously and we can think more clearly. We can do better.
* * *
Many people have already argued that individual acts of violence, whether against infrastructure (like bombing data centers) or people (like lab CEOs), will not slow down AI development because there is simply too much infrastructure and too many people. For example, Eliezer writes:
And similarly: To impede one executive, one researcher, or one company, does not change where AI is heading.
...
There is no one researcher who holds the secret to your death. They are all looking for pieces of the puzzle to accumulate, for individual rewards of fame and fortune. If somehow the person who was to find the next piece of the puzzle randomly choked on a chicken bone, somebody else would find a different puzzle piece a few months later, and Death would march on. AI researchers tell themselves that even if they gave up their enormous salaries, that wouldn’t help humanity much, because other researchers would just take their place. And the grim fact is that this is true, whether or not you consider it an excuse.
Recalling our conversation yesterday, I think both my friend and I agree that individual violence is ineffective. At best, individual political violence is ineffective and a small-scale tragedy; at worst, it harms the movement and is an eventual large-scale catastrophe.
Taking the argument in favor of political violence seriously though, one might ask: if there are so many people who care about AI safety, why don't you guys coordinate to kill all the researchers and executives, to dismantle all the companies, to bomb all the data centers? Or even to kill a large portion? Ultimately, the death of some millions of researchers should be outweighed by even a 1% chance of human extinction, right?
It’s true that if I woke up and saw that every AI researcher — even one out of every five researchers — was dead, I would be terrified of continuing my work. AI development would be paused for now. There are other questions, of course. Would there be enough impetus to legislate AI development? Would this make people take x-risk seriously? How long would it take new, ambitious, now well-defended entrepreneurs to resume development? But if I had a button that I knew would indefinitely pause AI development at the cost of millions of lives, I can see the argument for clicking the button.
However, this button hypothetical is beside the point. It’s hard for me to imagine the reaction to this counterfactual because I cannot imagine a world in which people manage to coordinate at a scale this large and go undetected. Practically, it seems nearly impossible to achieve the coordination necessary to destroy enough data centers or kill enough researchers to make an impact on AI progress. For one, these actions would have to be carried out over a very short span of time, probably shorter than one night, in order not to be noticed before companies beef up their security. For another, one person can only kill so many people in a day, so you would have to get thousands of people on board with mass murder (or at least mass property destruction) and coordinate with them within this short time-frame. Finally, you’d have to do all this without someone whistleblowing, or something else going awry.
In fact, it seems like large-scale political violence is more difficult to coordinate than a global ASI ban, which Eliezer argues for. Even though a global ASI ban is often treated as an idealistic goal, I argue that it's significantly easier to make happen than any sort of impactful political violence. It would require immense coordination, to be sure, but not nearly as much as the violence would require. Also, the coordination for a global ban wouldn’t have to be secret, as it would be for violence. Furthermore, a global ASI ban is something we can progress to through smaller laws, whereas any sort of effective political violence would have to happen all at once. Really, compared to effective political violence, a global ASI ban looks so doable.
* * *
My argument is not that we should strive for a global ASI ban. I’m not sure I agree with Eliezer on this. My point is just even when we treat political violence as a serious option, it is still impractical. It takes a lot of effort for political violence to have meaningful impact — more effort than would be required by other more certain, more lawful paths, even ones as improbable as a global ASI ban.
This is what I want to tell my friend. Violence may be the instinctual first solution that comes to your mind. However, the most effective solutions are usually not instinctual. Rather, they’re born of thought and careful consideration.
Your desire for action is useful, but only if it’s directed towards useful goals. Even though political violence is not a useful goal, there are so many actually useful goals that we need you for. There are policies to be made, experiments to be run, people to call.
Even though I don’t feel as strong an instinct for violence as you do, I can empathize with the fear that fuels this instinct. It makes total sense to be terrified by the idea of ASI. But this fear should not prevent you from believing that ASI is possible. When I think about AI safety, I sometimes feel small in the face of this massive problem. What can one person do? But I remind myself that there is more than one person who cares enough to work on this problem. And you — you could be one more.
I also think violence is immoral, but that's beside the point. I'm taking the perspective of someone who believes that the immorality of violence is outweighed by other factors, like the scope of existential risk.
When I say ASI, I mean something like Bostrom’s definition: “an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills.” I find this definition dissatisfying because I don’t know how you would measure it directly, but it suffices for this point I want to make in this post.