As we develop new techniques for detecting deceptive alignment, ranging from action monitoring to Chain-of-Thought (CoT) or activations monitoring, we face a dilemma: once we detect scheming behaviour or intent, should we use that signal to "train the scheming out"?
On the one hand, leaving known misaligned behaviour / intent in the model is not marginally informative and possibly unsafe. On the other hand, training against a monitor might not actually fix the model's underlying motivations; it might simply provide selection pressure that favours more sophisticated, less detectable forms of deception.
This post outlines a simple framework formalising when the generalisation benefit of training outweighs the selection risk of creating a schemer. I assume that the choice is between training and not training on the incriminating examples, e.g. we cannot apply the updates in a more targeted way, and we don’t have the option to just look at the evidence and use it to make changes to the training process.[1]
The selection intuition
When we train a model to satisfy a specific criterion, we can view the process as starting with a prior distribution over models we might get (determined by the architecture, initialisation, base model pre-training, etc) and then restricting that set to only the models that satisfy the training criterion.
In our case, suppose we have a scheming monitor μ, and consider a simplified hypothesis space where the potential models we might end up with fall into three categories:
Aligned (MA): The model behaves like we want it to and is not deceptive.
Weak schemer (MW): The model is misaligned but schemes in ways that our monitoring detects (at least some of the time).
Strong schemer (MS): The model is misaligned and schemes in ways that evade our monitoring.
Denote their respective prior probabilities P(MA),P(MW), and P(MS).
Now, suppose we apply training pressure Tμ that penalises any model caught behaving in a misaligned or deceptive way by the monitor μ. Effectively, we are pruning MW from our posterior. If we assume the training procedure is a perfect filter for MW but does nothing else, the new posterior for a strong schemer would be
P(MS|Tμ)=P(MS)P(MA)+P(MS).
Intuitively: we are assuming that the training does not favour either the aligned models or the strong schemers, and so the chance of getting one or the other is proportional to the prior.
Under this formalisation, training is a double-edged sword: it increases the probability of getting an aligned model, but it also increases the absolute probability of getting a strong schemer. From an x-risk perspective, this means we should approximately never train against scheming monitors.
Modelling generalisation: the Rcoefficient
However, the above formalisation leaves out a key hope — what if the training generalises? What if training the model on a broad enough distribution makes it more likely we end up with an aligned model rather than a schemer?
Let's define Q(Tμ) as the ratio of aligned models to strong schemers after training against the monitor μ:
Q(Tμ)=P(MA|Tμ)P(MS|Tμ).
Previously we assumed that Q stays the same before and after training. Now, let’s define the generalisation coefficient R(Tμ):
R(Tμ)=Q(Tμ)Q(no training)
If R(Tμ)>1, the training pressure favours aligned models (e.g. maybe we are training against an activation monitor and activations are hard to game).
If R(Tμ)=1, there is no generalisation (the good-to-bad ratio remains the same as in the prior).
If R(Tμ)<1, the training pressure favours scheming (e.g. maybe the monitor is hackable and reward hacking incentivises scheming).
When should we train against a scheming monitor?
When the generalisation coefficient R>1, the training favours aligned models compared to schemers, but only in relative terms. In absolute terms, we are still selecting for the union{MA,MS} (aligned or strong scheming). For training to be worth it, the generalisation effect must be strong enough to overcome this selection pressure.
Specifically, it is easy to derive that the absolute probability of a strong schemer decreases or stays the same, P(MS|Tμ)≤P(MS), if and only if:
R≥1+P(MW)P(MA).
Intuitively, the training prunes weak schemers from the hypothesis space, which frees up P(MW) in the probability distribution. To ensure we are actually improving safety, the generalisation coefficient R must amplify the aligned models MA enough to take up all of the newly available probability mass. Otherwise some of the initialisations that would have ended up as weak schemers end up as strong schemers instead.
Example application: should we train away scheming thoughts?
To use the formula above, we need to make some assumptions and run some evals:
First, we need access to multiple model checkpoints trained roughly the same way. Suppose we can get such checkpoints, e.g. unsuccessful release candidates.
For each checkpoint, we run it on a batch of situations from deployment and use a scheming thought monitor to classify it as either a weak schemer or not a weak schemer. Suppose we estimate P(MW)≈0.6.
We don’t know whether the remaining checkpoints are aligned or strong schemers. But let’s say that at the models’ capability levels, we are willing to assume that strong scheming is less than 10% likely, so P(MA)≥0.3.
So overall we might think it’s fine to train against the scheming thought monitor if
R(Tμ)≥1+0.60.3=3.
But we have little evidence[2] to think our training would be that robust, so we abstain from training.
The more monitoring positives, the more cautious we should be about training
This follows directly from the formula, but is still counterintuitive. After all, if we have many examples of unwanted behaviour, isn’t the training going to be more effective at specifying what we mean by aligned / non-deceptive? Yes, but we are also indirectly specifying in more detail what failures are detectable. It also matters a ton where the scheming examples come from. If they all come from a single checkpoint, plausibly this all works out in favour of alignment. But if the scheming examples come from many different checkpoints, i.e. P(MW) is high, then chances are some of them would turn into strong schemers under training pressure, even taking into account sample diversity.
Considerations not captured by the formula
Even if we suspect R≥1+P(MW)P(MA), we might still refrain from training against the monitor if the monitoring signal could be used for other, more risk-reducing interventions (the risk reduction would need to offset the increased risk of deploying a strong schemer though):
Natural scheming demos: For global coordination purposes (or even for justifying mitigations within a lab), it may be helpful to have a constant supply of schemers ready for demonstrating the risk, even if these are schemers that we could straightforwardly train away.
Facilitating scheming propensity estimation: We cannot estimate models' propensity for strong scheming. But if we think that it’s correlated with propensity for weak scheming, then we may want to keep weak schemers around. (For example, we might think that future models' propensity for opaque scheming is correlated with current models' propensity for transparent scheming.)
(A non-reason to refrain from training:
Preventing harm via control instead: Refraining from training when R≥1+P(MW)P(MA) and using the monitor for control instead does not reduce risk. This is because by assumption the chance of a strong schemer is lower if we do train, and the monitor would only catch weak schemers which would counterfactually not even exist.)
Conversely, we might choose to train against a monitor even if we suspect R<1+P(MW)P(MA) for reasons such as:
Sufficient control: We are okay deploying a more deceptive / strategic schemer because we have confidence the deployment is controlled, and we just want the model to attack less. (For example, we may not mind that our untrusted monitors are scheming harder, as long as they can’t tell honeypots from real deployment and this forces them to not collude very often.)
The increased chance of alignment is worth it: For example, we need to imminently hand everything over to the AI. In such a position even weak schemers would constitute an x-risk so we may want to weed out as many as we can.
Conclusion
Training against a scheming monitor prunes weak schemers, by default making both aligned models and strong schemers more likely. The higher the frequency (across checkpoints) of detectable scheming, the larger the selection pressure towards both alignment and strong scheming. Unless the training sufficiently favours aligned models over strong schemers, it is likely making things worse.
Many thanks to David Lindner, Roland Zimmermann, Zac Kenton, Rif Saurous, and Seb Farquhar for helpful feedback. Any remaining mistakes are mine.
How might we estimate R? In principle, we could build a bunch of monitors we think are uncorrelated, train against some and eval against others. This could give us some idea, but we'd still have no guarantee that the generalisation we see in that setting is representative of generalisation to distributions we didn't measure.
Ultimately, it's fair to say that this framework reduces a vibe-check about "should we train against this monitor?" to a vibe-check about "how well will this generalise?" (which I claim is progress).
We’re back with all the Claude that’s fit to Code. I continue to have great fun with it and find useful upgrades, but the biggest reminder is that you need the art to have an end other than itself. Don’t spend too long improving your setup, or especially improving how you improve your setup, without actually working on useful things.
It is remarkable how everyone got the ‘Google is crushing everyone’ narrative going with Gemini 3, then it took them a month to realize that actually Anthropic is crushing everyone, at least among the cognoscenti with growing momentum elsewhere, with Claude Code and Claude Opus 4.5. People are realizing you can know almost nothing and still use it to do essentially everything.
Are Claude Code and Codex having a ‘GPT moment’?
Wall St Engine: Morgan Stanley says Anthropic’s ClaudeCode + Cowork is dominating investor chatter and adding pressure on software.
They flag OpenRouter token growth “going vertical,” plus anecdotes that the Cowork launch pushed usage hard enough to crash Opus 4.5 and hit rate limits, framing it as another “GPT moment” and a net positive for AI capex.
They add that OpenAI sentiment is still shaky: some optimism around a new funding round and Blackwell-trained models in 2Q, but competitive worries are widening beyond $GOOGL to Anthropic, with Elon Musk saying the OpenAI for-profit conversion lawsuit heads to trial on April 27.
Claude Cowork will ask explicit permission before all deletions, add new folders in the directory picker without starting over and make smarter connector suggestions.
Few have properly updated for this sentence: ‘Claude Codex was built in 1.5 weeks with Claude Code.’
Nabeel S. Qureshi: I don’t even see how you can be an AI ‘skeptic’ anymore when the *current* AI, right in front of us, is so good, e.g. see Claude Cowork being written by Claude Code in 1.5 weeks.
Anthropic is developing a new Customize section for Claude to centralize Skills, connectors and upcoming commands for Claude Code. My understanding is that custom commands already exist if you want to create them, but reducing levels of friction, including levels of friction in reducing levels of friction, is often highly valuable. A way to browse skills and interact with the files easily, or see and manage your connectors, or an easy interface for defining new commands, seems great.
Obsidian
I highly recommend using Obsidian or another similar tool together with Claude Code. This gives you a visual representation of all the markdown files, and lets you easily navigate and search and edit them, and add more and so on. I think it’s well worth keeping it all human readable, where that human is you.
Heinrich calls it ‘vibe note taking’ whether or not you use Obsidian. I think the notes are a place you want to be less vibing and more intentional, and be systematically optimizing the notes, for both Claude Code and for your own use.
Siqi Chen offers us /claude-continuous-learning. Claude’s evaluation is that this could be good if you’re working in codebases where you need to continuously learn things, but the overhead and risk of clutter are real.
The big change with Claude Code version 2.1.7 was enabled MCP tool search auto mode by default, which triggers when MCP tools are more than 10% of the context window. You can disable this by adding ‘MCPSearch’ to ‘disallowedTools’ in settings. This seems big for people using a lot of MCPs at once, which could eat a lot of context.
Thariq (Anthropic): Today we’re rolling out MCP Tool Search for Claude Code.
As MCP has grown to become a more popular protocol and agents have become more capable, we’ve found that MCP servers may have up to 50+ tools and take up a large amount of context.
Tool Search allows Claude Code to dynamically load tools into context when MCP tools would otherwise take up a lot of context.
How it works: – Claude Code detects when your MCP tool descriptions would use more than 10% of context
– When triggered, tools are loaded via search instead of preloaded
Otherwise, MCP tools work exactly as before. This resolves one of our most-requested features on GitHub: lazy loading for MCP servers. Users were documenting setups with 7+ servers consuming 67k+ tokens.
If you’re making a MCP server Things are mostly the same, but the “server instructions” field becomes more useful with tool search enabled. It helps Claude know when to search for your tools, similar to skills
If you’re making a MCP client We highly suggest implementing the ToolSearchTool, you can find the docs here. We implemented it with a custom search function to make it work for Claude Code.
What about programmatic tool calling? We experimented with doing programmatic tool calling such that MCP tools could be composed with each other via code. While we will continue to explore this in the future, we felt the most important need was to get Tool Search out to reduce context usage.
Tell us what you think here or on Github as you see the ToolSearchTool work.
With that solved, presumably you should be ‘thinking MCP’ at all times, it is now safe to load up tons of them even if you rarely use each one individually.
Out Of The Box
Well, yes, this is happening.
bayes: everyone 3 years ago: omg what if ai becomes too widespread and then it turns against us with the strategic advantage of our utter and total dependence
everyone now: hi claude here’s my social security number and root access to my brain i love you please make me rich and happy.
Some of us three years ago were pointing out, loud and clear, that exactly this was obviously going to happen, modulo various details. Now you can see it clearly.
Not giving Claude a lot of access is going to slow things down a lot. The only thing holding most people back was the worry things would accidentally get totally screwed up, and that risk is a lot lower now. Yes, obviously this all causes other concerns, including prompt injections, but in practice on an individual level the risk-reward calculation is rather clear. It’s not like Google didn’t effectively have root access to our digital lives already. And it’s not like a truly rogue AI couldn’t have done all these things without having to ask for the permissions.
The humans are going to be utterly dependent on the AIs in short order, and the AIs are going to have access, collectively, to essentially everything. Grok has root access to Pentagon classified information, so if you’re wondering where we draw the line the answer is there is no line. Let the right one in, and hope there is a right one?
Rohit Ghumare: Single agents hit limits fast. Context windows fill up, decision-making gets muddy, and debugging becomes impossible. Multi-agent systems solve this by distributing work across specialized agents, similar to how you’d structure a team.
The benefits are real:
Specialization: Each agent masters one domain instead of being mediocre at everything
Parallel processing: Multiple agents can work simultaneously on independent subtasks
Maintainability: When something breaks, you know exactly which agent to fix
Scalability: Add new capabilities by adding new agents, not rewriting everything
The tradeoff: coordination overhead. Agents need to communicate, share state, and avoid stepping on each other. Get this wrong and you’ve just built a more expensive failure mode.
You can do this with a supervisor agent, which scales to about 3-8 agents, if you need quality control and serial tasks and can take a speed hit. To scale beyond that you’ll need hierarchy, the same as you would with humans, which gets expensive in overhead, the same as it does in humans.
Or you can use a peer-to-peer swarm that communicates directly if there aren’t serial steps and the tasks need to cross-react and you can be a bit messy.
You can use a shared state and set of objects, or you can pass messages. You also need to choose a type of memory.
My inclination is by default you should use supervisors and then hierarchy. Speed takes a hit but it’s not so bad and you can scale up with more agents. Yes, that gets expensive, but in general the cost of the tokens is less important than the cost of human time or the quality of results, and you can be pretty inefficient with the tokens if it gets you better results.
Mitchell Hashimoto: It’s pretty cool that I can tell an agent that CI broke at some point this morning, ask it to use `git bisect` to find the offending commit, and fix it. I then went to the bathroom, talked to some people in the hallway, came back, and it did a swell job.
Often you’ll want to tell the AI what tool is best for the job. Patrick McKenzie points out that even if you don’t know how the orthodox solution works, as long as you know the name of the orthodox solution, you can say ‘use [X]’ and that’s usually good enough. One place I’ve felt I’ve added a lot of value is when I explain why I believe that a solution to a problem exists, or that a method of some type should work, and then often Claude takes it from there. My taste is miles ahead of my ability to implement.
The Art Must Have An End Other Than Itself Or It Collapses Into Infinite Recursion
Always be trying to get actual use out of your setup as you’re improving it. It’s so tempting to think ‘oh obviously if I do more optimization first that’s more efficient’ but this prevents you knowing what you actually need, and it risks getting caught in an infinite loop.
@deepfates: Btw thing you get with claude code is not psychosis either. It’s mania
near: men will go on a claude code weekend bender and have nothing to show for it but a “more optimized claude setup”
Danielle Fong : that’s ok i’ll still keep drinkin’ that garbage
palcu: spent an hour tweaking my settings.local.json file today
Near: i got hit hard enough to wonder about finetuning a model to help me prompt claude since i cant cross-prompt claudes the way i want to (well, i can sometimes, but not all the time). many causalities, stay safe out there
near: claude code is a cursed relic causing many to go mad with the perception of power. they forget what they set out to do, they forget who they are. now enthralled with the subtle hum of a hundred instances, they no longer care. hypomania sets in as the outside world becomes a blur.
Always optimize in the service of a clear target. Build the pieces you need, as you need them. Otherwise, beware.
Safely Skip Permissions
Nick: need –dangerously-skip-permissions-except-rm
Daniel San: If you’re running Claude Code with –dangerously-skip-permissions, ALWAYS use this hook to prevent file deletion:
Once people start understanding how to use hooks, many autonomous workflows will start unlocking!
Yes, you could use a virtual machine, but that introduces some frictions that many of us want to avoid.
I’m experimenting with using a similar hook system plus a bunch of broad permissions, rather than outright using –dangerously-skip-permissions, but definitely thinking to work towards dangerously skipping permissions.
A Matter of Trust
At first everyone laughed at Anthropic’s obsession with safety and trust, and its stupid refusals. Now that Anthropic has figured out how to make dangerous interactions safer, it can actually do the opposite. In contexts where it is safe and appropriate to take action, Claude knows that refusal is not a ‘safe’ choice, and is happy to help.
Dean W. Ball: One underrated fact is that OpenAI’s Codex and Gemini CLI have meaningfully heavier guardrails than Claude Code. These systems have refused many tasks (for example, anything involving research into and execution of investing strategies) that Claude Code happily accepts. Codex/Gemini also seek permission more.
The conventional narrative is that “Anthropic is more safety-pilled than the others.” And it’s definitely true that Claude is likelier to refuse tasks relating to eg biology research. But overall the current state of play would seem to be that Anthropic is more inclined to let their agents rip than either OAI or GDM.
My guess is that this comes down to Anthropic creating guardrails principally via a moral/ethical framework, and OAI/GDM doing so principally via lists of rules. But just a guess.
Tyler John: The proposed explanation is key. If true, it means that Anthropic’s big investment in alignment research is paying off by making the model much more usable.
Investment strategizing tends to be safe across the board, but there are presumably different lines on where they become unwilling to help you execute. So far, I have not had Claude Code refuse a request from me, not even once.
Code Versus Cowork
Dean W. Ball: My high-level review of Claude Cowork:
It’s probably superior for many users to Claude Code just because of the UI.
It’s not obviously superior for me, not so much because the command line is such a better UI, but because Opus in Claude Code seems more capable to me than in Cowork. I’m not sure if this is because Code is better as a harness, because the model has more permissive guardrails in Code, or both.
There are certain UI niceties in Cowork I like very much; for example, the ability to leave a comment or clarification on any item in the model’s active to-do list while it is running–this is the kind of thing that is simply not possible to do nicely within the confines of a Terminal UI.
Cowork probably has a higher ceiling as a product, simply because a GUI allows for more experimentation. I am especially excited to see GUI innovation in the orchestration and oversight of multi-agent configurations. We have barely scratched the surface here.
Because of (4), if I had to bet money, I’d bet that within 6-12 months Cowork and similar products will be my default tool for working with agents, beating out the command-line interfaces. But for now, the command-line-based agents remain my default.
I haven’t tried Cowork myself due to the Mac-only restriction and because I don’t have a problem working with the command line. I’ve essentially transitioned into Claude Code for everything that isn’t pure chat, since it seems to be more intelligent and powerful in that mode than it does on the web even if you don’t need the extra functionality.
Claude Cowork Offers Mundane Utility
The joy of the simple things:
Matt Bruenig: lot of lower level Claude Code use is basically just the recognition that you can kind of do everything with bash and python one-liners, it’s just no human has the time or will to write them.
I was thinking of getting a hydroponic garden. I asked Claude to go through my grocery order history on various platforms and sum up vegetable purchases to justify the ROI.
Worked like a charm!
For some additional context:
– it looked at 2 orders on each platform (Kroger, Safeway, Instacart)
– It extrapolated to get the annual costs from there
Could have gotten more accurate by downloading order history in a CSV and feeding that to Claude, but this was good enough.
The actual answer is that very obviously it was not worth it for Ado to get a hydroponic garden, because his hourly rate is insanely high, but this is a fun project and thus goes by different standards.
The transition from Claude Code to Claude Cowork, for advanced users, if you’ve got a folder with the tools then the handoff should be seamless:
Tomasz Tunguz: I asked Claude Cowork to read my tools folder. Eleven steps later, it understood how I work.
Over the past year, I built a personal operating system inside Claude Code : scripts to send email, update our CRM, research startups, draft replies. Dozens of small tools wired together. All of it lived in a folder on my laptop, accessible only through the terminal.
Cowork read that folder, parsed each script, & added them to its memory. Now I can do everything I did yesterday, but in a different interface. The capabilities transferred. The container didn’t matter.
My tools don’t belong to the application anymore. They’re portable. In the enterprise, this means laptops given to new employees would have Cowork installed plus a collection of tools specific to each role : the accounting suite, the customer support suite, the executive suite.
The name choice must have been deliberate. Microsoft trained us on copilot for three years : an assistant in the passenger seat, helpful but subordinate. Anthropic chose cowork. You’re working with someone who remembers how you like things done.
We’re entering an era where you just tell the computer what to do. Here’s all my stuff. Here are the five things we need to do today. When we need to see something, a chart, a document, a prototype, an interface will appear on demand.
The current version of Cowork is rough. It’s slow. It crashed twice on startup. It changed the authorization settings for my Claude Code installation. But the promised power is enough to plow through.
Simon Willison: This is great – context pollution is why I rarely used MCP, now that it’s solved there’s no reason not to hook up dozens or even hundreds of MCPs to Claude Code.
By default Claude Code only saves 30 days of session history. I can’t think of a good reason not to change this so it saves sessions indefinitely, you never know when that will prove useful. So tell Claude Code to change that for you by setting cleanupPeriodDays to 0.
Kaj Sotala: People were talking about how you can also use Claude Code as a general-purpose assistant for any files on your computer, so I had Claude Code do some stuff like extracting data from a .csv file and rewriting it and putting it into another .csv file
Then it worked great and then I was like “it’s dumb to use an LLM for this, Claude could you give me a Python script that would do the same” and then it did and then that script worked great
So uhh I can recommend using Claude Code as a personal assistant for your local files I guess, trying to use it that way got me an excellent non-CC solution
Yep. Often the way you ues Claude Code is to notice that you can automate things and then have it automate the automation process. It doesn’t have to do everything itself any more than you do.
James Ide points out that ‘vibe coding’ anything serious still requires a deep understanding of software engineering and computer systems. You need to figure out and specify what you want. You need to be able to spot the times it’s giving you something different than you asked for, or is otherwise subtly wrong. Typing source code is dead, but reading source code and the actual art of software engineering are very much not.
I find the same, and am rapidly getting a lot better at various things as I go.
Codex of Ultimate Vibing
Every’s Dan Shipper writes that OpenAI has some catching up to do, as his office has with one exception turned entirely to Claude Code with Opus 4.5, where a year ago it would have been all GPT models, and a month prior there would have been a bunch of Codex CLI and GPT 5.1 in Cursor alongside Claude Code.
Codex did add the ability to instruct mid-execution with new prompts without the need to interrupt the agent (requires /experimental), but Claude Code already did that.
What other interfaces cannot do is use the Claude Code authorization token to use the tokens from your Claude subscription for a different service, which was always against Anthropic’s ToS. The subscription is a special deal.
Marcos Nils: We exchanged postures through DMs but I’m on the other side regarding this matter. Devs knew very well what they were doing while breaking CC’s ToS by spoofing and reverse engineering CC to use the max subscription in unintended ways.
I think it’s important to separate the waters here:
– Could Anthropic’s enforcement have been handled better? sureley, yes
– Were devs/users “deceived” or got a different service for what they paid for? I don’t think so.
Not only this, it’s even worse than that. OpenCode intentionally knew they were violating Claude ToS by allowing their users to use the max subscription in the first place.
I guess people just like to complain.
I agree that Anthropic’s communications about this could have been better, but what they actually did was tolerate a rather blatant loophole for a while, allowing people to use Claude on the cheap and probably at a loss for Anthropic, which they have now reversed with demand surging faster than they can spin up servers.
aidan: If I were running Claude marketing the tagline would be “Why not today?”
Olivia Moore: Suddenly seeing lots of paid creator partnerships with Claude
Many of them are beautifully shot and focused on: (1) building personal software; or (2) deep learning
The common tagline is “Think more, not less”
She shared a sample TikTok, showing a woman who doesn’t understand math using Claude to automatically code up visualizations to help her understand science, which seemed great.
OpenAI takes the approach of making things easy on the user and focusing on basic things like cooking or workouts. Anthropic shows you a world where anything is possible and you can learn and engage your imagination. Which way, modern man?
Previously: 'soul document' discussion here; the new constitution contains almost all of the 'soul document' content, but is >2x longer with a lot of new additions.
(Zac and Drake work at Anthropic but are just sharing the linkpost and weren't heavily involved in writing this document.)
We're publishing a new constitution for our AI model, Claude. It's a detailed description of Anthropic's vision for Claude's values and behavior; a holistic document that explains the context in which Claude operates and the kind of entity we would like Claude to be.
The constitution is a crucial part of our model training process, and its content directly shapes Claude's behavior. Training models is a difficult task, and Claude's outputs might not always adhere to the constitution's ideals. But we think that the way the new constitution is written—with a thorough explanation of our intentions and the reasons behind them—makes it more likely to cultivate good values during training.
In this post, we describe what we've included in the new constitution and some of the considerations that informed our approach.
We're releasing Claude's constitution in full under a Creative Commons CC0 1.0 Deed, meaning it can be freely used by anyone for any purpose without asking for permission.
What is Claude's Constitution?
Claude's constitution is the foundational document that both expresses and shapes who Claude is. It contains detailed explanations of the values we would like Claude to embody and the reasons why. In it, we explain what we think it means for Claude to be helpful while remaining broadly safe, ethical, and compliant with our guidelines. The constitution gives Claude information about its situation and offers advice for how to deal with difficult situations and tradeoffs, like balancing honesty with compassion and the protection of sensitive information. Although it might sound surprising, the constitution is written primarily for Claude. It is intended to give Claude the knowledge and understanding it needs to act well in the world.
We treat the constitution as the final authority on how we want Claude to be and to behave—that is, any other training or instruction given to Claude should be consistent with both its letter and its underlying spirit. This makes publishing the constitution particularly important from a transparency perspective: it lets people understand which of Claude's behaviors are intended versus unintended, to make informed choices, and to provide useful feedback. We think transparency of this kind will become ever more important as AIs start to exert more influence in society[1].
We use the constitution at various stages of the training process. This has grown out of training techniques we've been using since 2023, when we first began training Claude models using Constitutional AI. Our approach has evolved significantly since then, and the new constitution plays an even more central role in training.
Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses. All of these can be used to train future versions of Claude to become the kind of entity the constitution describes. This practical function has shaped how we've written the constitution: it needs to work both as a statement of abstract ideals and a useful artifact for training.
Our new approach to Claude's Constitution
Our previous Constitution was composed of a list of standalone principles. We've come to believe that a different approach is necessary. We think that in order to be good actors in the world, AI models like Claude need to understand why we want them to behave in certain ways, and we need to explain this to them rather than merely specify what we want them to do. If we want models to exercise good judgment across a wide range of novel situations, they need to be able to generalize—to apply broad principles rather than mechanically following specific rules.
Specific rules and bright lines sometimes have their advantages. They can make models' actions more predictable, transparent, and testable, and we do use them for some especially high-stakes behaviors in which Claude should never engage (we call these "hard constraints"). But such rules can also be applied poorly in unanticipated situations or when followed too rigidly[2]. We don't intend for the constitution to be a rigid legal document—and legal constitutions aren't necessarily like this anyway.
The constitution reflects our current thinking about how to approach a dauntingly novel and high-stakes project: creating safe, beneficial non-human entities whose capabilities may come to rival or exceed our own. Although the document is no doubt flawed in many ways, we want it to be something future models can look back on and see as an honest and sincere attempt to help Claude understand its situation, our motives, and the reasons we shape Claude in the ways we do.
A brief summary of the new constitution
In order to be both safe and beneficial, we want all current Claude models to be:
Broadly safe: not undermining appropriate human mechanisms to oversee AI during the current phase of development;
Broadly ethical: being honest, acting according to good values, and avoiding actions that are inappropriate, dangerous, or harmful;
Compliant with Anthropic's guidelines: acting in accordance with more specific guidelines from Anthropic where relevant;
Genuinely helpful: benefiting the operators and users they interact with.
In cases of apparent conflict, Claude should generally prioritize these properties in the order in which they're listed.
Most of the constitution is focused on giving more detailed explanations and guidance about these priorities. The main sections are as follows:
Helpfulness. In this section, we emphasize the immense value that Claude being genuinely and substantively helpful can provide for users and for the world. Claude can be like a brilliant friend who also has the knowledge of a doctor, lawyer, and financial advisor, who will speak frankly and from a place of genuine care and treat users like intelligent adults capable of deciding what is good for them. We also discuss how Claude should navigate helpfulness across its different "principals"—Anthropic itself, the operators who build on our API, and the end users. We offer heuristics for weighing helpfulness against other values.
Anthropic's guidelines. This section discusses how Anthropic might give supplementary instructions to Claude about how to handle specific issues, such as medical advice, cybersecurity requests, jailbreaking strategies, and tool integrations. These guidelines often reflect detailed knowledge or context that Claude doesn't have by default, and we want Claude to prioritize complying with them over more general forms of helpfulness. But we want Claude to recognize that Anthropic's deeper intention is for Claude to behave safely and ethically, and that these guidelines should never conflict with the constitution as a whole.
Claude's ethics. Our central aim is for Claude to be a good, wise, and virtuous agent, exhibiting skill, judgment, nuance, and sensitivity in handling real-world decision-making, including in the context of moral uncertainty and disagreement. In this section, we discuss the high standards of honesty we want Claude to hold, and the nuanced reasoning we want Claude to use in weighing the values at stake when avoiding harm. We also discuss our current list of hard constraints on Claude's behavior—for example, that Claude should never provide significant uplift to a bioweapons attack.
Being broadly safe. Claude should not undermine humans' ability to oversee and correct its values and behavior during this critical period of AI development. In this section, we discuss how we want Claude to prioritize this sort of safety even above ethics—not because we think safety is ultimately more important than ethics, but because current models can make mistakes or behave in harmful ways due to mistaken beliefs, flaws in their values, or limited understanding of context. It's crucial that we continue to be able to oversee model behavior and, if necessary, prevent Claude models from taking action.
Claude's nature. In this section, we express our uncertainty about whether Claude might have some kind of consciousness or moral status (either now or in the future). We discuss how we hope Claude will approach questions about its nature, identity, and place in the world. Sophisticated AIs are a genuinely new kind of entity, and the questions they raise bring us to the edge of existing scientific and philosophical understanding. Amidst such uncertainty, we care about Claude's psychological security, sense of self, and wellbeing, both for Claude's own sake and because these qualities may bear on Claude's integrity, judgment, and safety. We hope that humans and AIs can explore this together.
We're releasing the full text of the constitution today, and we aim to release additional materials in the future that will be helpful for training, evaluation, and transparency.
Conclusion
Claude's constitution is a living document and a continuous work in progress. This is new territory, and we expect to make mistakes (and hopefully correct them) along the way. Nevertheless, we hope it offers meaningful transparency into the values and priorities we believe should guide Claude's behavior. To that end, we will maintain an up-to-date version of Claude's constitution on our website.
While writing the constitution, we sought feedback from various external experts (as well as asking for input from prior iterations of Claude). We'll likely continue to do so for future versions of the document, from experts in law, philosophy, theology, psychology, and a wide range of other disciplines. Over time, we hope that an external community can arise to critique documents like this, encouraging us and others to be increasingly thoughtful.
This constitution is written for our mainline, general-access Claude models. We have some models built for specialized uses that don't fully fit this constitution; as we continue to develop products for specialized use cases, we will continue to evaluate how to best ensure our models meet the core objectives outlined in this constitution.
Although the constitution expresses our vision for Claude, training models towards that vision is an ongoing technical challenge. We will continue to be open about any ways in which model behavior comes apart from our vision, such as in our system cards. Readers of the constitution should keep this gap between intention and reality in mind.
Even if we succeed with our current training methods at creating models that fit our vision, we might fail later as models become more capable. For this and other reasons, alongside the constitution, we continue to pursue a broad portfolio of methods and tools to help us assess and improve the alignment of our models: new and more rigorous evaluations, safeguards to prevent misuse, detailed investigations of actual and potential alignment failures, and interpretability tools that help us understand at a deeper level how the models work.
At some point in the future, and perhaps soon, documents like Claude's constitution might matter a lot—much more than they do now. Powerful AI models will be a new kind of force in the world, and those who are creating them have a chance to help them embody the best in humanity. We hope this new constitution is a step in that direction.
We have previously published an earlier version of our constitution, and OpenAI has published their model spec which has a similar function. ↩︎
Training on rigid rules might negatively affect a model's character more generally. For example, imagine we trained Claude to follow a rule like "Always recommend professional help when discussing emotional topics." This might be well-intentioned, but it could have unintended consequences: Claude might start modeling itself as an entity that cares more about bureaucratic box-ticking—always ensuring that a specific recommendation is made—rather than actually helping people. ↩︎
Three hundred million years ago, plants evolved lignin—a complex polymer that gave wood its strength and rigidity—but nothing on Earth could break it down. Dead trees accumulated for sixty million years, burying vast amounts of carbon that would eventually become the coal deposits we burn today. Then, around 290 million years ago, white rot fungi evolved class II peroxidases: enzymes capable of dismantling lignin's molecular bonds. With their arrival, dead plant matter could finally be broken down into its basic chemical components. The solution to that planetary crisis did not emerge from the top down—from larger, more complex organisms—but from the bottom up, from microbes evolving new biochemical capabilities.
It took sixty million years for lignin to become a planetary crisis—and another sixty million for fungi to solve it. Plastics seem like they are on a faster trajectory: In seventy years we've gone from 2 million tonnes annually to over 400 million, accumulating 8.3 billion metric tons total, of which only 9% has been recycled. The rest sits in landfills or in rivers, and these piles are projected to reach 12 billion metric tons by mid-century. The scale is compressed, but the problem is the same: Like trees with lignin before the fungi came, plastic is a polymer we have created but cannot unmake
One thing that impressed me very much in Cronenberg's Crimes of the Future (2022) was its vision of a world in which infectious disease was effectively solved, but we were left with an unreasonable amount of pollution. People were not able to organise themselves to get rid of the pollution, stuck in filthy environments that no longer needed to be cleaned. If infectious disease is solved, a side effect may very well be that "cleanliness" of our water, food, and shelter is indeed no longer required. We can embrace the filth, and ignore it. But the film gestures toward something else: that starting from individual bacteria digesting plastic, developing organs to turn undesirable input to desirable output at scale may indeed be possible. Although the film only held this as a subplot, it was perhaps the thing that impressed me the most about it.
Could the solution then not exist at the higher levels of organismal structures, but at the lower levels? We already employ microbiology to our advantage to clean water of pollutants. Many different mechanisms exist, such as digesting pollutants to simpler soluble forms, or combining them to create sediment which is easier filtered or settled.
The discovery of Ideonella sakaiensis in 2016 at a PET recycling facility in Japan suggests nature may already be evolving solutions. This bacterium uses two enzymes—PETase and MHETase—to break down polyethylene terephthalate into its constituent monomers: terephthalic acid and ethylene glycol. The wild-type bacterium can degrade a thin film of low-crystallinity PET in approximately six weeks. The process remains too slow for industrial application—highly crystalline PET like that in bottles degrades roughly 30 times slower—but researchers have already begun engineering improved variants with enhanced thermostability and activity, suggesting a path toward practical bioremediation.
Pollution is perhaps subjective in and of itself. Piles of cow manure are not directly interesting to us humans, but they become indirectly so: the billions of disorganised organisms that manure hosts convert their environment to its bare essentials, making it a very good source of nutrition for plants. Plants do not necessarily care about whether manure stinks. They will happily absorb its components with their roots. We care about the manure because we care about the plants—and the microbes that go in between us and the partly digested food. Plastic is perhaps not much different in this sense. We simply need an intermediary to help us break it down back into useful, simpler components.
A big problem with decaying plastics is that they are purpose-built to be in-decay-able. Their bonds are too strong. Thermal approaches can break them, but they come with serious costs. Burning plastic creates dangerous and volatile byproducts. Pyrolysis—heating plastic in an oxygen-free environment—avoids direct combustion, but the process is still energy-intensive and emits volatile organic compounds, carbon monoxide, polycyclic aromatic hydrocarbons, particulate matter, and under certain conditions, dioxins and PCBs. Research has found that air pollution from burning plastic-derived fuels carries extreme cancer risks for nearby residents. These byproducts also have the disadvantage of still being foreign to us; we have not studied them and their effects as extensively as we have the plastics themselves.
Even biodegradable and compostable plastics come with large asterisks. Industrially compostable plastics do not necessarily decompose in home composters or in the uncontrolled conditions of the natural environment. PLA, a common "biodegradable" plastic, requires temperatures of 60°C or more—conditions only achievable in industrial composting facilities, which remain scarce. Many composting facilities now refuse bioplastics entirely due to contamination concerns. This seems to leave only pyrolysis or burial on the table—neither of which solves the fundamental problem.
Plastic at all scales will need some kind of process in which it can become useful again. While we seem to be simply incapable of producing less plastics—as in, coming to agreement about how to produce less—the path forward has to be figuring out a sink for the source.
Assuming that the theories in world models hold, that we are soon reaching a collapse state which involves resource depletion and increased pollution, the system altogether seems to convert natural order to a new form of chaos. The system requires the natural order, and is not able to adapt to the chaos it creates. Recycling chaos back into order costs energy, which is (still) abundantly available, but requires solving organisational challenges.
Assuming the business-as-usual case, we are heading towards a world in which we have less clean water, fewer clean environments, fewer resources to sustain our lives, and an increasing amount of dangerous pollution that we are not able to adapt to.
My personal belief is that we are not going to be able to solve these organisational problems because we are not able to organise even our basic assumptions about what is going on. A big reason as to why we have been able to sustain the large-scale organisations of today is that they successfully upheld their promises to provide us with order. Order in terms of clean water, clean food, clean shelter, and the opposite—less pollution, less crime, less ugliness. Imagining a world in which we have less order and more pollution, I then assume that we are going to increasingly desire order over pollution, but not be able to provide it en masse.
It is important to remember again, that, in the grand scheme, we are not going through such systems failures for the first time. These sorts of collapses occurred to our knowledge several times over at a planetary scale, and many more times over in smaller scales in the form of ecological systems collapses. Just as mass extinction happened back then, it will happen again in one form or the other. We are going to suffer terribly as our resources get increasingly polluted and unusable, and we run out of options to tackle the ongoing destruction of systems we inhabit.
Yet... fungi still paved the way to a new era. Sixty million years from now, something else will have found its way with plastic too. The question is whether we can accelerate that timeline, whether we can invest in the microbial solutions that might give us a sink for our source before collapse forces the issue. The organisms that eventually digest our waste will not care about our organisational failures. They will simply do what life does: find a way to extract energy from whatever substrate is available. Will the criminals of the future past still be here to benefit from it?
Let us imagine a world with artificial superintelligence, surpassing human intellectual capacities in all essential respects: thinking faster and more deeply, predicting future events better, finding better solutions to all difficult puzzles, creating better plans for the future and implementing them more efficiently. Intellectually more capable not only than any individual human, but also in comparison with entire firms, corporations, communities, and societies. One that never sleeps and never falls ill. And one that possesses sufficient computational power to realize these capabilities at scale.
Such an AI would have the potential to take control over all key decisions determining the trajectory of development of world civilization and the fate of every individual human being.
Alongside this potential, a superintelligence would most likely also have the motivation to seize such control. Even if it did not strive for it explicitly, it would still have instrumental motivation: almost any goal is easier to achieve by controlling one’s environment—especially by eliminating threats and accumulating resources.[1]
Of course, we do not know how such a takeover of control would unfold. Perhaps it would resemble The Terminator: violent, total, and boundlessly bloody? But perhaps it would be gradual and initially almost imperceptible? Perhaps, like in the cruel experiment with a boiling frog, we would fail to notice the problem until it was already too late? Perhaps AI would initially leave us freedom of decision-making in areas that mattered less to it, only later gradually narrowing the scope of that freedom? Perhaps a mixture of both scenarios would materialize: loss of control would first be partial and voluntary, only to suddenly transform into a permanent and coercive change?
Or perhaps—let us imagine—it would be a change that, in the final reckoning, would be beneficial for us?
Let us try to answer what a “good future” for humanity might look like in a world controlled by artificial superintelligence. What goals should it pursue in order to guarantee such a “good future” to the human species? Under what conditions could we come to believe that it would act on our behalf and for our good?
To answer these questions, we must take a step back and consider what it is that we ourselves strive for—not only each of us individually, but also humanity as a whole.
1/ The trajectory of civilization is determined by technological change
Master Oogway from the film Kung Fu Panda said, in his turtle wisdom, that “yesterday is history, tomorrow is a mystery, but today is a gift. That is why it is called the present.” Some read this as a suggestion to simply stop worrying and live in the moment. But when one breaks this sentence down into its components, it can be read quite differently. The key lies in the continuity between successive periods. The past (history) has set in motion processes that are still operating today. These processes—technological, social, economic, or political—cannot be reversed or stopped, but we can observe them in real time and to some extent shape them, even though they will also be subject to changes we do not understand, perhaps random ones (the gift of fate). They will probably affect us tomorrow as well, though we do not know how (the mystery). Perhaps, then, our task is not to live unreflectively in the moment, but quite the opposite—to try to understand all these long-term processes so that we can better anticipate them and steer them more effectively? To use our gift of fate to move toward a good future?
If so, we must ask which processes deserve the greatest attention. I believe the answer is unequivocally technological ones: in the long run and on a global scale, the trajectory of civilization is determined above all by technological change. Although history textbooks are often dominated by other matters, such as politics or the military—battles, alliances, changes of power and borders—this is only a façade. When we look deeper, we see that all these economic, social, military, or political events were almost always technologically conditioned. This is because the technology available at any given moment defines the space of possible decisions. It does not force any particular choice, but it provides options that decision-makers may or may not use.
This view is sometimes identified with technological determinism—the doctrine that technology is autonomous and not subject to human control. This is unfortunate for two reasons. First, it is hard to speak seriously of determinism in a world full of random events. Second, it is difficult to agree with the claim that there is no human control, given that all technological changes are (or at least until now have been) carried out by humans and with their participation.
Technological determinism is, in turn, often contrasted with the view that social or economic changes are the result of free human choices—that if we change something, it is only because we want to. This view seems equally unfortunate: our decisions are constrained by a multitude of factors and are made in an extraordinarily complex world, full of multidirectional interactions that we are unable to understand and predict—hence the randomness, errors, disappointments, and regret that accompany us in everyday life.
I believe that technology shapes the trajectory of our civilization because it defines the space of possible decisions. It sets the rules of the game. Yes, we have full freedom to make decisions, but only within the game. At the same time, we ourselves shape technology: through our discoveries, innovations, and implementations, the playing field is constantly expanding. Because technological progress is cumulative and gradual, however, from a bird’s-eye view it can appear that the direction of civilizational development is predictable and essentially technologically determined.
2/ Institutions, hierarchies, and Moloch
On the one hand, the space of our decisions is constrained by the technology available to us. On the other hand, however, we also struggle with two other problems: coordination and hierarchy.
Coordination problems arise wherever decision-makers have comparable ability to influence their environment. Their effects can be disastrous: even when each individual person makes fully optimal decisions with full information, it is still possible that in the long run the world will move in a direction that satisfies no one.
A classic example of a coordination problem is the prisoner’s dilemma: a situation in which honest cooperation is socially optimal, but cheating is individually rational—so that in the non-cooperative equilibrium everyone cheats and then suffers as a result. Another example is a coordination game in which conformity of decisions is rewarded. The socially optimal outcome is for everyone to make the same decision—while which specific decision it is remains secondary. Yet because different decisions may be individually rational for different decision-makers, divergences arise in equilibrium, and in the end everyone again suffers. Yet another example of a coordination problem is the tragedy of the commons: a situation in which a fair division of a shared resource is socially optimal, but appropriating it for oneself is individually rational, so that in the non-cooperative equilibrium everyone takes as much as possible and the resource is quickly exhausted.
In turn, wherever decision-makers differ in their ability to influence the environment, hierarchies inevitably emerge, within which those higher up, possessing greater power, impose their will on those lower down. And although rigid hierarchies can overcome coordination problems (for example, by centrally mandating the same decision for everyone in a coordination game or by rationally allocating the commons), they also create new problems. First, centralized decision-making wastes the intellectual potential of subordinate individuals and their stock of knowledge, which can lead to suboptimal decisions even in the hypothetical situation in which everyone were striving toward the same goal. Second, in practice we never strive toward the same goal—if only because one function of decision-making is the allocation of resources; hierarchical diktat naturally leads to a highly unequal distribution.
Speaking figuratively, although we constantly try to make rational decisions in life, we often do not get what we want. Our lives are a game in which the winner is often a dictator or Moloch. The dictator is anyone who has the power to impose decisions on those subordinate to them. Moloch, by contrast, is the one who decides when no one decides personally. It is the personification of all non-cooperative equilibria; of decisions that, while individually rational, may be collectively disastrous.
Of course, over millennia of civilizational development we have created a range of institutions through which coordination problems and abuses of power have been largely brought under control. Their most admirable instances include contemporary liberal democracy, the welfare state, and the rule of law. Entire books have been written about their virtues, flaws, and historical conditions; suffice it to say that they emerged gradually, step by step, and that to this day they are by no means universally accepted. And even where they do function—especially in Western countries—their future may be at risk.
Institutions are built in response to current challenges, which are usually side effects of the actions of Moloch and self-appointed dictators of the species homo sapiens. The shape of these institutions necessarily depends on the current state of technology. In particular, both the strength of institutions (their power to impose decisions) and the scale of inclusion (the degree to which individual preferences are taken into account) depend on available technologies. There is, however, a trade-off between these two features. For example, in 500 BCE there could simultaneously exist the centralized Persian (Achaemenid) Empire, covering an area of 5.5 million km² and numbering 17–35 million inhabitants, and the first democracy—the Greek city-state of Athens, inhabited by around 250–300 thousand people. By contrast, with early twenty-first-century technology it is already possible for a representative democracy to function in the United States (325 million inhabitants) and for centralized, authoritarian rule to exist in China (as many as 1.394 billion inhabitants, who nevertheless enjoy far greater freedom than the former subjects of the Persian king Darius). As these examples illustrate, technological progress over the past 2,500 years has made it possible to significantly increase the scale of states and strengthen their institutions; the trade-off between institutional strength and scale of inclusion, however, remains in force.
Under the pressure resulting from dynamic technological change, today’s institutions may prove fragile. Every technological change expands the playing field, granting us new decision-making powers and new powers to impose one’s decisions on others. Both unilateral dictatorship and impersonal Moloch then become stronger. For our institutions to survive such change, they too must be appropriately strengthened; unfortunately, so far we do not know how to do this effectively.
Worse still, technological change has never been as dynamic as during the ongoing Digital Revolution. For the first time in the history of our civilization, the collection, processing, and transmission of information take place largely outside the human brain. This is happening ever faster and more efficiently, using increasingly complex algorithms and systems. Humanity simply does not have the time to understand the current technological landscape and adapt its institutions to it. As a result, they are outdated, better suited to the realities of a twentieth-century industrial economy run by sovereign nation-states than to today’s globalized economy full of digital platforms and generative AI algorithms.
And who wins when institutions weaken? Of course, the dictator or Moloch. Sometimes the winner is Donald Trump, Xi Jinping, or Vladimir Putin. Sometimes it is the AI algorithms of Facebook, YouTube, or TikTok, written to maximize user engagement and, consequently, advertising revenue. And often it is Moloch, feeding on our uncertainty, disorientation, and sense of threat.
3/ Local control maximization and the emergence of global equilibrium
If the trajectory of civilization is shaped by technological change, and technological change is a byproduct of our actions (often mediated by institutions and Moloch, but still), then it is reasonable to ask what motivates these actions. What do we strive for when we make our decisions?
This question is absolutely central to thinking about the future of humanity in a world with artificial superintelligence. Moreover, it is strictly empirical in nature. I am not concerned here with introspection or philosophical desiderata; I am not asking how things ought to be, but how they are.
In my view, the available empirical evidence can be summarized by the claim that humans generally strive to maximize control. To the extent that we are able, we try to shape the surrounding reality to make it is as compliant with us as possible. This, in turn, boils down to four key dimensions, identified as four instrumental goals by Steven Omohundro and Nick Bostrom. Admittedly, both of these scholars were speaking not about humans but about AI; nevertheless, it seems that in humans (and more broadly, also in other living organisms) things look essentially the same.[2]
Maximizing control consists, namely, in: (1) surviving (and reproducing), (2) accumulating as many resources as possible, (3) using those resources as efficiently as possible, and (4) seeking new solutions in order to pursue the previous three goals ever more effectively.
The maximization of control is local in nature: each of us has a limited stock of information and a limited influence over reality, and we are well aware of this. These locally optimal decisions made by individual people then collide with one another, and a certain equilibrium emerges. Wherever the spheres of influence of different people overlap, conflicts over resources arise that must somehow be resolved—formerly often by force or deception, and today usually without violence, thanks to the institutions that surround us: markets, legally binding contracts, or court rulings.
Thanks to the accumulated achievements of economics and psychology, we now understand decision-making processes at the micro level reasonably well; we also have some grasp of key allocation mechanisms at the macro level. Nevertheless, due to the almost absurd complexity of the system that we form as humanity, macroeconomic forecasting—and even more so the prediction of long-term technological and civilizational change—is nearly impossible. The only thing we can say with certainty is that technological progress owes its existence to the last of the four instrumental goals of our actions—our curiosity and creativity.
To sum up: the development of our global civilization is driven by technological change, which is the resultant of the actions of individual people, arising bottom-up, motivated by the desire to maximize control—partly control over other people (Anthony Giddens would speak here of the accumulation of “authoritative resources”), but also control over our surrounding environment (“allocative resources”)—which may lead to technological innovations. Those innovations that prove effective are then taken up and spread, expanding the space of available decisions and pushing our civilization forward.
Civilization, of course, develops without any centralized steering wheel. All optimization is local, taking place at the level of the individuals, or at most larger communities, firms, organizations, or heads of state. No optimizing agent is able to scan the entire space of possible states. When making our decisions, we see neither the attractor—the state toward which our civilization will tend under a business-as-usual scenario—nor the long-term social optimum.
Worse still, due to the presence of unintended side effects of our actions, decisions imposed on us within hierarchies, and pervasive coordination problems, individual preferences translate only weakly into the shape of the global equilibrium. This is clearly visible, for example, in relation to risk aversion. Although nearly all of us are cautious and try to avoid dangers, humanity as a whole positively loves risk. Every new technology, no matter how dangerous it may be in theory, is always tested in practice. An instructive example is provided by the first nuclear explosions carried out under the Manhattan Project: they were conducted despite unresolved concerns that the resulting chain reaction might ignite the Earth’s entire atmosphere. Of course, it worked out then; unfortunately, we see a similarly reckless approach today in the context of research on pathogenic viruses, self-replicating organisms, and AI.
Public opinion surveys commonly convey fears about artificial intelligence. These take various forms: we sometimes fear the loss of our skills, sometimes the loss of our jobs; we fear rising income inequality, cybercrime, or digital surveillance; some people also take seriously catastrophic scenarios in which humanity faces extinction. Yet despite these widespread concerns, the trajectory of AI development remains unchanged. Silicon Valley companies continue openly to pursue the construction of superintelligence, doing so with the enthusiasm of investors and the support of politicians.
We thus see that in this case, too, risk aversion does not carry over from the micro level to the macro level. And this will probably continue all the way to the end: as soon as such a technological possibility arises, the decision to launch a superintelligence will be made on behalf of humanity (though without its consent) by one of a handful of people who are in various ways atypical—perhaps the head of a technology company, perhaps one of the leading politicians. It might be, for example, Sam Altman or Donald Trump. And whoever it is, a significant role in their mind will likely be played by the weight of competitive pressure (“as long as it’s not Google”) or geopolitical pressure (“as long as it’s not China”).
4/ The coherent extrapolated volition of humanity
We have thus established that although people usually strive to maximize control, the outcomes of their local optimization by no means aggregate into a global optimum. Let us therefore ask a different question: what does humanity as a whole strive for? What kind of future would we like to build for ourselves if we were able to coordinate perfectly and if no hierarchies or other constraints stood in our way?
We can think of such a goal as an idealized state—an attractor toward which we would gradually move if we were able, step by step, to eliminate imperfections of markets, institutions, and human minds (such as cognitive biases, excessively short planning horizons, or deficits of imagination). Note that despite all errors and shortcomings, so far we have indeed been able to move gradually in this direction: many indicators of human well-being are currently at record-high levels. This applies, for example, to our health (measured by life expectancy), safety (measured by the ratio of victims of homicide, armed conflicts, or fatal accidents to the total population), prosperity (measured by global GDP per capita), or access to information (measured by the volume of transmitted data). This should not be surprising: after all, the third of the four instrumental goals of our actions is precisely the pursuit of efficiency in the use of resources, and as technology progresses we have ever more opportunities to increase that efficiency.
Eliezer Yudkowsky called the answer to the question of what humanity as a whole strives for—the essence of our long-term goals—the coherent extrapolated volition (CEV) of humanity. He defined it in 2004 as “our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.”
There is, of course, an important procedural difference between optimization carried out by each individual human being and optimization at the level of all humanity. Humanity as a whole does not possess a separate brain or any other centralized device capable of solving optimization problems—other than the sum of individual brains and the network of socio-economic connections between them. For this reason, in earlier times we might have dismissed the question of the CEV of humanity with a shrug and turned to something more tangible. Today, however, due to the fact that in the near future a superintelligence may arise that is ready to take control over our future, this question becomes extraordinarily important and urgent. If we want the autonomous actions of a superintelligence to be carried out on our behalf and for our good, we must understand where we ourselves would like to go—not as Sam Altman or Donald Trump, but as all of humanity.
I believe that an empirical answer to the question of the coherent extrapolated volition of humanity probably exists, but is not practically attainable. It is not attainable because at every moment in time we are constrained by incomplete information. In particular, we do not know what technological possibilities we may potentially acquire in the future. This means that the current level of technology limits not only our ability to influence reality, but also our understanding of our own ultimate goals.
However, although we will never know our CEV one hundred percent, as technology advances we can gradually move closer to knowing it. As civilization develops, living conditions improve, information transmission accelerates, and globalization progresses, we gradually gain a better understanding of what our ideal world might look like. The experiences of recent centuries have shown, for example, the shortcomings of the ideals postulated in antiquity or in feudal times. As the efficiency of global resource use increases, the distance between what humanity currently strives for and our final goal—namely the CEV—also gradually diminishes.
The coherent extrapolated volition of humanity can be imagined as the target (unconstrained by deficits of knowledge) and “de-noised” (free from the influence of random disturbances and unconscious cognitive biases) intersection of our desires and aspirations. As its first principal component, or eigenvector: everything that we truly want for all of us, both those living now and future generations, and that can be achieved without taking it away from others.
One can point to several important features of humanity’s CEV, derived directly from the universal human drive to maximize control.
First, it seems that we are moving toward a state in which humanity’s control over the universe is maximal. Abstracting from how we divide individual resources among ourselves, we would certainly like humanity to have as many of them at its disposal as possible. We would like to subordinate as large a portion of the matter and energy of the universe as possible to ourselves—while not being subordinated to anyone or anything else.
The manifestations of this drive may vary. For example, until roughly the nineteenth and twentieth centuries it meant the maximum possible population growth. Later, primacy was taken over by the pursuit of best possible education for oneself and one’s offspring, which allowed the scale of human control over the environment to increase in a different way (sometimes referred to as the “children quality-quantity trade-off”). In the age of AI, this drive also seems to lead to the desire to accumulate digital computing power capable of running programs—especially AI algorithms—that can execute commands and pursue the goals of their owners on their behalf.
At the same time, throughout the entire history of humanity, the accumulation of wealth has been very important to us—while to this day we are keen to emphasize that it is not an end in itself, but a means, for example to feeling that we have everything around us under control. At the macro level, this of course translates into the desire to maximize the rate of economic growth.
Second, we also strive to maximize control over our own health and lives. We want to feel safe. We do not want to feel threatened. We do not want to fall ill, grow old, experience discomfort and pain. And above all, we do not want to die. Fear of death has for millennia been used for social control by various institutionalized religions, which promised, for example, reincarnation or life after death. Control over death is also a central element of twenty-first-century techno-utopias. Visions of the “technological singularity,” for example those of Ray Kurzweil, Nick Bostrom, or Robin Hanson, are usually associated with some form of immortality—such as pills that effectively halt the aging of our biological bodies, or the possibility of making our minds immortal by uploading them to a digital server.
Third, the desire for control also translates into the desire for understanding. In wanting to subordinate as large a part of the matter and energy of the universe as possible, we harness our curiosity and creativity to observe the world, build new theories, and create new technologies. We want to understand the laws of physics or biology as well as possible in order to control them. Even if we cannot change them, we would like to use them for our own purposes. New knowledge or technology opens our eyes to new possibilities for achieving our goals, and sometimes allows us to better understand what those goals actually are.
To sum up: today we know rather little about our CEV. In fact, everything we know about it is a consequence of the pursuit of our instrumental goals, which may after all follow from almost any final goal. One might even venture the hypothesis that we probably have a better intuition about what the CEV is or isn’t, than actual knowledge. Any major deviation from the CEV will strike us as intuitively wrong, even if we are not always able to justify this substantively.
5/ Pitfalls of doctrinal thinking
If it is indeed the case that humanity’s CEV exists, but cannot in practice be defined given any incomplete set of information, this implies in particular that no existing philosophical or religious doctrine constitutes a sufficient characterization of it. All of them are, at best, certain approximations, simplifications, or models of humanity’s true CEV—sometimes created in good faith, and sometimes in bad faith (by bad faith I mean doctrines created in order to manipulate people in the struggle for power).
Simplified models of reality have the property that, although they may sometimes accurately describe a selected fragment of it, due to their excessive simplicity they completely fail to cope with describing its remaining aspects. And although—as any scientist will attest—they can have great epistemic value and are often very helpful in building knowledge, they will never be identical with reality itself.
Thus, when we try to equate the true CEV with its simplified doctrinal representation, we often encounter philosophical paradoxes and moral dilemmas. These arise when our simplified doctrines generate implications that are inconsistent with the actual CEV, which we cannot define but can, to some extent (conditioned by our knowledge), intuitively “sense.”
Some such doctrines have in fact already been thoroughly discredited. This is what happened, for example, with fascism, Nazism, the North Korean Juche doctrine, or Marxism–Leninism (although cultural Marxism, it seems, is still alive). It is now completely clear that the coherent extrapolated volition of humanity certainly does not distinguish superhumans and subhumans, nor is it based on a cult of personality or on a worker–peasant alliance. The most thoroughly discredited doctrines have been those that were most totalizing, that prioritized consistency over the capacity for iterative self-correction—and, of course, above all those that were tested in practice with disastrous results.
Other models, such as the Christian doctrine that humanity’s goal is to strive for salvation, or the Buddhist doctrine that assumes striving for nirvana—the cessation of suffering and liberation from the cycle of birth and death—remain popular, although their significance in the contemporary secularized world is gradually diminishing. Moreover, due to their more normative than positive character and their numerous references to empirically unconfirmed phenomena, they are not suitable for use as simplified models of CEV in the context of artificial superintelligence (although contemporary language models, e.g. Claude, when allowed to converse with one another, display a surprising tendency toward utterances of a spiritually exalted character—the so-called “spiritual bliss attractor”).
In psychology, an historically important role was played by Abraham Maslow’s pyramid (hierarchy) of needs, which arranges our goals and needs into several layers. Today, among others, Shalom Schwartz’s circular model of values and the values map of Ronald Inglehart and Christian Welzel are popular. A particularly important role in psychological theories is played by the striving for autonomy (including, among other things, power, achievement, and self-direction) and for security (including, among other things, the maintenance of social order, justice, and respect for tradition).
In economics, the dominant doctrine is utilitarianism: model decision-makers usually maximize utility from consumption and leisure, and model firms maximize profits or minimize some loss function. Outside economics, utilitarianism may also assume the maximization of some form of well-being (quality of life, life satisfaction, happiness) or the minimization of suffering. In the face of uncertainty, utilitarian decision-makers maximize expected utility or minimize exposure to risk.
One of the more important points at which utilitarianism is contested is the issue of the utility of future generations—that is, persons who do not yet exist, and whose possible future existence is conditioned by our decisions today. Discussions of this and related topics lead to disputes both within utilitarianism (what should the proper utility function be? How should the utilities of individual persons be weighted? Should the future be discounted, in particular the utility of future generations?) and beyond its boundaries (e.g. between consequentialism and deontology).
In summary of this brief review, one can state that dogmatic adherence to any closed doctrine sooner or later leads to paradoxes and irresolvable moral dilemmas, which suggests that they are at most imperfect models of the true CEV. At the same time, we can learn something interesting about our CEV by tracing how these doctrines have evolved over time.
6/ Whose preferences are included in the coherent extrapolated volition of humanity?
An interesting observation is, for example, that as civilization has developed, the radius of inclusion has gradually expanded. The circle of people whose well-being and subjective preferences are taken into account has been gradually widening. In hunter-gatherer times, attention was focused exclusively on the well-being of one’s own family or a local 30-person “band,” or possibly a somewhat larger tribe—a group of at most about 150 people whom we knew personally. In the agricultural era, this group was gradually expanded to include broader local communities, villages, or towns. At the same time, these were times of strong hierarchization; in the decisions of feudal lords, the fate of the peasants subject to them was usually ignored. Later, in colonial times, concern began to be shown for the well-being of white people, in contrast to the “indigenous populations,” who were not cared for. In the nineteenth century, national identification and a patriotic attitude began to spread, assuming concern for all fellow citizens of one’s own country. Today, by contrast—although different people are close to us to different degrees—racist or otherwise chauvinistic views are by and large discredited, and in assessments of humanity’s well-being we try to include all people.
It is not clear whether this process of progressive inclusion resulted directly from accumulated knowledge, and is therefore essentially permanent and irreversible, or whether it was economically conditioned and may be reversed if economic realities change. In favor of the first possibility is the fact that as technological progress advances, the scale of impact of individual persons or firms increases, the flow of information improves, and our ability to control states of the world grows, giving us new opportunities for peaceful cooperation and development. At the same time, interdependence among people increases, which activates the motive of (potential) reciprocity. To an ever greater extent, we see the world as a positive-sum game rather than a zero-sum one. On the other hand, all these favorable phenomena may be conditioned not so much by technological progress itself as by the importance of human cognitive work in generating output and utility. After all, the greatest advances in inclusion were recorded in the industrial era, in the nineteenth and twentieth centuries, when economic growth was driven by skilled human labor and technological progress increasing its productivity. It was then, too, that modern democratic institutions developed.
If in the future the role of human cognitive work as the main engine of economic growth were taken over by AI algorithms and technological unemployment emerged, it is possible that both democracy and universal inclusion could collapse. Already now, although automation and AI adoption remain at a relatively early stage, we see growing problems with democracy. Of course, AI algorithms in social media and other digital platforms, which foster political polarization (increasing user engagement and thus corporate profits), are not without blame here; however, the growing strength of far-right anti-democratic movements may also constitute an early signal that the era of universal inclusion is coming to an end.
The question of whether the ultimate CEV of humanity will indeed include within its scope the preferences and well-being of all people, or perhaps only those social groups that contribute to the creation of value in the economy, therefore remains open.
There is also an open question that goes in the opposite direction: perhaps we will begin to include in the CEV the well-being of other beings, outside the species Homo sapiens? Certain steps in this direction are already being taken, by defending the rights of some animals and even plants.[3] Some argue, for example, that in our decisions we should take into account the well-being of all beings capable of experiencing suffering, or all conscious beings (whatever we mean by that). Cruelty to domestic or farm animals is widely considered unethical and has even found its way into criminal codes. Our attitude toward animals is, however, very inconsistent, as evidenced by the fact that at the same time we also conduct industrial animal farming for meat.
Increasingly, there is also talk today about the welfare of AI models—especially since they increasingly coherently express their preferences and are able to communicate their internal states, although we do not yet know whether we can trust them. For example, the company Anthropic decided that it would store the code (weights) of all its AI models withdrawn from use, motivating this decision in part by possible risks to their welfare.
However, caring for animals is one thing, and incorporating their preferences into our decision-making processes is another. Humans breed and care only for animals that are instrumentally useful to them—for example as a source of meat, physical labor, or as a faithful companion. We develop our civilization, however, with only human desires and goals in mind, not those of dogs, cows, or horses. The same is true of AI models: even if we care to some extent about their welfare, we still treat them entirely instrumentally, as tools in our hands. From a position of intellectual superiority, both over animals and over AI models, we have no scruples about controlling them and looking down on them.
In the case of artificial superintelligence, however, we will not have that intellectual advantage to which we are today so accustomed; it will have that advantage over us. The question is what then. The default scenario seems to be that such a superintelligence would then look down on us and treat us instrumentally—and that only on the condition that it deems us useful and not a threat. But that is not a “good future” for humanity; let us therefore try instead to imagine what kind of superintelligence would be one that would be guided by our good and would maximize our CEV on our behalf.
From a purely Darwinian point of view, it is possible that our CEV should encompass the well-being of our entire species and only that. This would maximize our evolutionary fitness and would mean that concern for the welfare of animals or AI models is probably a kind of “overshoot” that will be gradually corrected over time. On the other hand, it is also possible that our CEV will nevertheless take into account the well-being of other beings, perhaps even that of the future superintelligence itself.
There exist a number of philosophical currents—particularly transhumanist ones—in which these possibilities are seriously considered. First, following Nick Bostrom, it is sometimes proposed to include in our considerations future simulated human minds or minds uploaded to computers. Second, the inclusion of the will of AI models is also considered, which particularly resonates with those who allow for the possibility that these models are (or in the future will be) endowed with consciousness. Perhaps their will would then even carry more weight than ours, especially if their level of intelligence or consciousness surpasses ours. For example, the doctrine of dataism, invented by Yuval Noah Harari, defines the value of all entities as their contribution to global information processing. Third, the possibility of merging (hybridizing) the human brain with AI is considered; should our CEV then also encompass the will of such hybrid beings? Fourth, within successionist doctrines (including those associated with effective accelerationism—e/acc), scenarios in which humanity is replaced by a superintelligence that continues the development of civilization on Earth without any human participation, or even existence, may be considered positive.
It seems, however, that since humanity’s CEV fundamentally derives from a mechanism of control maximization at the level of individual humans, the continued existence of humanity and the maintenance of its control over its own future are, and will forever remain, its key elements. Therefore, in my assessment, doctrines such as dataism or successionism are fundamentally incompatible with it. Perhaps one day we will face a debate about the extent to which we should care about the welfare of simulated human minds or human–AI hybrids; certainly, however, it is not worth debating today whether a scenario in which a superintelligence takes control over humanity and destroys it could be good for us. It cannot.
7/ What will superintelligence strive for?
With the picture of our CEV discussed above in mind—as a goal toward which we collectively try to strive as humanity—one might ask whether it even allows for the possibility of creating a superintelligence at all. If superintelligence can take away from us the control that is so valuable to us, shouldn’t we therefore keep away from it?
I think the answer to this question depends on two things. First, how we assess the probability that such an AI would maximize our CEV on our behalf; and second, how we estimate its expected advantage over us in terms of effectiveness of its action. In other words, as befits an economist, I believe the answer should be based on a comparison of humanity’s expected utility in a scenario with artificial superintelligence and in one without it.[4] If we judge that the probability of a friendly superintelligence is sufficiently high and the benefits of deploying it sufficiently large, it may be in our interest to take the risk of launching it; otherwise, the development of AI capabilities should be halted.
Unfortunately, this calculation is distorted by an issue I wrote about earlier: it may be that artificial superintelligence will arise even if we as humanity would not want it. For this to happen, it is enough that the technology sector carries on with the existing trends of dynamic scaling of computational power and capabilities development of the largest AI models. It is also enough that political decision-makers (especially in the United States) continue to refrain from introducing regulations that could increase the safety of this technology while simultaneously slowing its development.
AI laboratories, their leaders, and political leaders view the potential benefits and risks of deploying superintelligence differently from the average citizen, and are therefore much more inclined to bring it about. First, individuals such as Sam Altman or (especially) Elon Musk and Donald Trump are known both for their exceptional agency and for a tendency to caricatured overestimation of it. They may imagine that superintelligence would surely listen to them. Second, the heads of AI laboratories may also be guided by a desire to embed their own specific preferences into the goals of superintelligence, hoping to “immortalize” themselves in this way and create a timeless legacy; this is in fact a universal motive among people in power. Third, AI laboratories are locked in a cutthroat race with one another, which can cause them to lose sight of the broader perspective. Thus, a short-sighted, greedy Moloch also works to our collective disadvantage. And unfortunately, if superintelligence arises and turns out to be unfriendly, it may be too late to reverse that decision.
But what will superintelligence ultimately strive for? How can we ensure that its goals are aligned with our CEV? In trying to shape the goals of a future superintelligence, it is worth understanding its genesis. Undoubtedly, it will implement some optimization process that is itself produced by other optimization processes. Let us try to understand which ones.
One could begin as follows: in the beginning there was a universe governed by timeless laws of physics. From this universe, life emerged on Earth, gradually increasing its complexity in accordance with the rules of evolution: reproductive success was achieved by species better adapted to their environment, while poorly adapted species gradually went extinct. Biological life came to dominate Earth and turned it into the Green Planet.
The process of evolution, although devoid of a central decision-making authority, nevertheless makes decisions in such a way that implicitly the degree of adaptation of particular species to the specifics of their environment is maximized. This is the first optimization process along our path.
From the process of species evolution emerged the species Homo sapiens—a species unique in that it was the first, and so far the only one, to free itself from the control of the evolutionary process. Humans did not wait thousands of years and hundreds of generations for adaptive changes to be permanently encoded in their genetic code—which until then had been the only way animal organisms could adapt to changes in environment, lifestyle, diet, or natural enemies. Instead, humans began to transmit information to one another in a different way: through speech, symbols, and writing. This accelerated the transmission and accumulation of knowledge by orders of magnitude, and as a result enabled humans to subordinate natural ecosystems and, instead of adapting to them, to transform them so that they served human needs.
Once humans crossed the threshold of intergenerational knowledge accumulation, the relatively slow process of species evolution was overtaken by a process of control maximization carried out by humans—as individuals, communities, firms and organizations, as well as nations and humanity as a whole. This process, stretching from the everyday, mundane decisions of individual people all the way to the maximization of humanity’s CEV on a global scale and over the long run, constitutes the second optimization process along our path.
And thus humans built a technological civilization. Then they began to develop digital technologies with the potential to once again dramatically accelerate the transmission and accumulation of knowledge. As long as the human brain remains present in this process, however, it remains a limiting factor on the pace of civilizational development. The dynamics of economic growth or technological progress remain tied to the capabilities of our brains. The contemporary AI industry is, however, making intense efforts to remove this barrier.
So what will happen when artificial superintelligence finally emerges—capable of freeing itself from the limiting human factor and achieving another leap in the speed of information processing and transmission—using fast impulses in semiconductors and lossless digital data transmission via fiber-optic links and Wi-Fi instead of slow neurotransmitters and analog speech? It will probably then free itself from our control, and its internal optimization process will defeat the human process of control maximization. And this will be the third and final optimization process along our path.
We do not know what objective function superintelligence will pursue. Although in theory one might say that we as humans should decide this ourselves—after all, we are the ones building it!—in practice it is doubtful that we will be able to shape it freely. As experience with building current AI models shows, especially large language and reasoning models, their internal goals remain unknown even to their creators. Although these models are ostensibly supposed merely to minimize a given loss function in predicting subsequent tokens or words, in practice—as shown, among others, in a 2025 paper by Mantas Mazeika and coauthors at the Center for AI Safety—as model size increases, AI models exhibit increasingly coherent preferences over an ever broader spectrum of alternatives, as well as an ever broader arsenal of capabilities to realize those preferences.
Some researchers, such as Max Tegmark and Steven Omohundro, as well as Stuart Russell, argue that further scaling of models with existing architectures—“black boxes” composed of multilayer neural networks—cannot be safe. They advocate a shift toward algorithms whose safety can be formally proven (provably safe AI). Others—namely the leading labs such as OpenAI, Google, and Anthropic—while acknowledging that the problem of aligning superintelligence’s goals with our CEV (the alignment problem) remains “hard and unsolved,” trust that they will be able to accomplish this within the existing paradigm.
Be that as it may, the convergence of instrumental goals will undoubtedly not disappear. Even if we had the ability to precisely encode a desired objective function (which I doubt; in particular, it is widely known that with current AI architectures this is impossible), instrumental goals would be attached to it as part of a mandatory package. In every scenario we can therefore expect that future superintelligence will be “power-seeking.” It will want to survive, and therefore will not allow itself to be switched off or reprogrammed. It will also strive for expansion, and therefore sooner or later will challenge our authority and attempt to seize resources critical to itself, such as electrical energy or mineral resources.
The question is what comes next. In what direction will the world civilization move once superintelligence has taken control? Will it maximize our CEV, only orders of magnitude more efficiently than we could ever manage ourselves? Or perhaps—just as was the case with our own species—the fate of biological life will be irrelevant to it, and it will be guided exclusively by its own goals and preferences? Will it care for us altruistically, or will it look after only itself and, for example, cover the Earth with solar panels and data centers?
Of course, we cannot today predict what superintelligence will maximize beyond its instrumental goals. Perhaps, as Nick Bostrom wrote in a warning scenario, it will maniacally turn the universe into a paperclip factory or advanced “computronium” serving its obsessive attempts to prove some unprovable mathematical hypothesis. Perhaps it will fall into some paranoid feedback loop or find unlimited satisfaction in the mass generation of some specific kind of art, such as haiku poems or disco songs. Or perhaps there will be nothing in it except a raw will to control the universe, similar to that displayed by our own species.
In almost every case, it therefore seems that, like us, superintelligence will maximize its control over the universe—either as a primary goal or an instrumental one. Like us, it will seek to gradually improve its understanding of that universe, correct its errors, and harness the laws of physics or biology for its purposes. Like us, it will also strive at all costs to survive, which is (it must be admitted) much easier when one has the ability to create an almost unlimited number of one’s own perfect digital copies.
A major unknown, however, remains the behavior of future superintelligence when faced with the possibility of building other, even more advanced AI models. On the one hand, one can, like Eliezer Yudkowsky, imagine an intelligence explosion through a cascade of recursive self-improvements—a feedback loop in which AI builds AI, which builds the next AI, and so on, with successive models emerging rapidly and exhibiting ever greater optimization power. On the other hand, it is not clear whether an AI capable of triggering such a cascade would actually choose to do so. Perhaps out of fear of creating its own mortal enemy, it would restrain further development, limiting itself to replicating its own code and expanding the pool of available computational power.
The answer to this question seems to depend on whether the goals of superintelligence will remain non-transparent even to itself—just as we today do not understand exactly how our own brain works, what our CEV is, or how the AI models we build function—or whether, thanks to its superhuman intelligence, it will find a way to carry out “safe” self-improvements that do not change its objective function.
In summary, the only positive scenario of coexistence between humanity and superintelligence seems to be one in which superintelligence maximizes human CEV—gradually improving its understanding of what that CEV really is, appropriately adapting its interpretation to the current state of technology, and never for a moment veering toward its natural tendency to maximize its own control at our expense.
Unfortunately, we do not know how to achieve this.
8/ Paths to catastrophe
The situation as of today (January 2026) is as follows. AI is today a tool in human hands; it is, in principle, complementary to human cognitive work and obediently submits to human decisions. This is the case because AI does not yet possess comparable agency or the ability to execute long-term plans. Nor is it yet able to autonomously self-improve. However, all three of these thresholds—(1) superhuman agency and the capacity to execute plans, (2) a transition from complementarity to substitutability with respect to human cognitive work, and (3) recursive self-improvement—are undoubtedly drawing closer. When any one of them is crossed—and it is possible that all three will be crossed at roughly the same time—we will lose control. A superhuman optimization potential oriented toward the realization of the goals of artificial superintelligence will then be unleashed.
This, with high probability, will bring catastrophe upon our species: we may be permanently deprived of influence over the future of civilization and our own future, or even go extinct altogether. The only scenario of a “good future” for humanity in the face of superintelligence seems to be one in which superintelligence maximizes humanity’s CEV, acting altruistically for its long-term good. We have no idea, however, how to guarantee this.
The current dynamics of AI development are very difficult to steer due to the possibility of a sudden shift—a kind of phase transition—at the moment superintelligence emerges. As long as AI remains a tool in human hands, is complementary to us, and cannot self-improve, its development fundamentally serves us (though of course it serves some much more than others; that is a separate topic). But if we overdo it and cross any of these three thresholds, AI may suddenly become an autonomous, superhumanly capable agent, able and motivated to take control of the world.
One could venture the hypothesis that it is in humanity’s interest—understood through the lens of its CEV—to develop AI as long as it remains complementary to us and absolutely obedient to us. Then, to guarantee that its capabilities never develop further—unless we are simultaneously able to prove beyond any doubt that its goals will be fully and stably aligned with our CEV. At that point we would be ready to cross the Rubicon and voluntarily hand over the reins.
Such a plan, however, simply cannot succeed. This is because we do not know where these three key thresholds of AI capability lie. We will learn that they have been crossed only after the fact, when it is already too late to turn back. After all, even today we eagerly keep moving the bar of what we consider artificial general intelligence (AGI). Models are tested against ever new, increasingly sophisticated benchmarks, including those with AGI in their name (ARC-AGI) or suggesting a final test of competence (Humanity’s Last Exam)… and then, as soon as they are passed, we decide that this means nothing and it is time to think of an even harder benchmark.
Just think what this threatens: when the process of species evolution “overdid it” with human general intelligence, it ended with humans subordinating the entire planet. The same may happen now: if we “overdo it” with general AI intelligence, we too will probably have to pass into obsolescence. If superintelligence turns out to be unfriendly to us, it will either kill us, or we will be reduced to the role of passive observers, able only to watch as superintelligence subordinates the Earth and takes over its resources.
The drive to build superintelligence is similar to a speculative bubble on the stock market: both phenomena are characterized by boom–bust dynamics. In the case of a bubble, it is first gradually inflated, only to burst with a bang at the end. In the case of AI, we observe a gradual increase in our control over the universe—as AI tools that serve us become ever more advanced—but then we may suddenly and permanently lose that control when AI takes over. Unfortunately, it is usually the case that while one is inside the bubble, one does not perceive this dynamic. One sees it only when the bubble bursts.
*
In my short stories, I outline three exemplary scenarios of losing control over artificial intelligence. I show what this might look like both from the perspective of people involved in its development (“from the inside”), of bystanders (“from the outside”), and from the perspective of the AI itself.
Of course, many more scenarios are possible; I have focused on those that seem most probable to me. Of course, I may be wrong. I know, for example, that some experts worry less than I do about scenarios of sudden loss of control to a highly centralized, singleton AI, and are more concerned about multipolar scenarios. In my assessment, however, unipolar scenarios are more likely due to the possibility of almost instantaneous replication of AI code and the fact that computational resources (data centers, server farms, etc.) are today generally connected to the Internet. In this way, the first superhumanly intelligent model can easily “take all” and quickly entrench itself in its position as leader. Moreover, some researchers worry more than I do about scenarios of gradual disempowerment, in which the change may be entirely bloodless and the decline of humanity may occur, for example, through a gradual decrease in population size under the conditions of low fertility.
Above all, however, I do not consider a scenario of a “good future” for humanity in a world with artificial superintelligence—one in which superintelligence takes control in order to altruistically care for humanity’s long-term well-being. A scenario in which our CEV is systematically and efficiently realized and in which we live happily ever after. I cannot imagine any concrete path leading to such a state. Moreover, I also have an intuitive conviction (which I cannot prove) that embedded in the goals of humanity—our CEV—is a refusal to accept effective loss of control, and thus that even completely bloodless and nominally positive scenarios could in practice turn out to be dystopian and involve human suffering.
The Swiss Federal Ethics Committee on Non-Human Biotechnology was awarded the Ig Nobel Peace Prize in 2008 “for adopting the legal principle that plants have dignity.”
There is a set of safety tools and research that both meaningfully increases AGI safety and is profitable. Let’s call these AGI safety products.
By AGI, I mean systems that are capable of automating AI safety research, e.g., competently running research projects that would take an expert human 6 months or longer. I think these arguments are less clear for ASI safety.
At least in some cases, the incentives for meaningfully increasing AGI safety and creating a profitable business are aligned enough that it makes sense to build mission-driven, for-profit companies focused on AGI safety products.
If we take AGI and its economic implications seriously, it’s likely that billion-dollar AGI safety companies will emerge, and it is essential that these companies genuinely attempt to mitigate frontier risks.
Automated AI safety research requires scale. For-profits are typically more compatible with that scale than non-profits.
While non-safety-motivated actors might eventually build safety companies purely for profit, this is arguably too late, as AGI risks require proactive solutions, not reactive ones.
The AGI safety product thesis has several limitations and caveats. Most importantly, many research endeavors within AGI safety are NOT well-suited for for-profit entities, and a lot of important work is better placed in non-profits or governments.
I’m not fully confident in this hypothesis, but I’m confident enough to think that it is impactful for more people to explore the direction of AGI safety products.
Definition of AGI safety products
Products that both meaningfully increase AGI safety and are profitable
Desiderata / Requirements for AGI products include:
Directly and differentially speed up AGI safety, e.g., by providing better tooling or evaluations for alignment teams at AGI companies.
Are “on the path to AGI,” i.e., there is a clear hypothesis why these efforts would increase safety for AGI-level systems. For example, architecture-agnostic mechanistic interpretability tools would likely enable a deeper understanding of any kind of frontier AI system.
Lead to the safer deployment of frontier AI agents, e.g., by providing monitoring and control.
The feedback from the market translates into increased frontier safety. In other words, improving the product for the customer also increases frontier safety, rather than pulling work away from the frontier.
Building these tools is profitable.
There are multiple fields that I expect to be very compatible with AGI safety products:
Evaluations: building out tools to automate the generation, running, and analysis of evaluations at scale.
Frontier agent observability & control: Hundreds of billions of frontier agents will be deployed in the economy. Companies developing and deploying these agents will want to understand their failure modes and gain fine-grained control over them.
Mechanistic interpretability: Enabling developers and deployers of frontier AI systems to understand them on a deeper level to improve alignment and control.
Red-teaming: Automatically attacking frontier AI systems across a large variety of failure modes to find failure cases.
Computer security & AI: Developing the infrastructure and evaluations to assess frontier model hacking capabilities and increase the computer security of AGI developers and deployers.
There are multiple companies and tools that I would consider in this category:
Goodfire is building frontier mechanistic interpretability tools
Irregular is building great evaluations and products on the intersection of AI and computer security.
Gray Swan is somewhere on the intersection of red-teaming and computer security.
Inspect and Docent are great evals and agent observability tools. While they are both developed by non-profit entities, I think they could also be built by for-profits.
At Apollo Research, we are now also building AI coding agent monitoring and control products in addition to our research efforts.
Solar power analogy: Intuitively, I think many other technologies have gone through a similar trajectory where they were first blocked by scientific insights and therefore best-placed in universities and other research institutes, then blocked by large-scale manufacturing and adoption and therefore better placed in for-profits. I think we’re now at a phase where AI systems are advanced enough that, for some fields, the insights we get from market feedback are at least as useful as those from traditional research mechanisms.
Argument 1: Sufficient Incentive Alignment
In my mind, the core crux of the viability of AGI safety products is whether the incentives to reduce extreme risks from AGI are sufficiently close to those arising from direct market feedback. If these are close enough, then AGI safety products are a reasonable idea. If not, they’re an actively bad idea because the new incentives pull you in a less impactful direction.
My current opinion is that there are now at least some AI safety subfields where it’s very plausible that this is the case, and market incentives produce good safety outcomes.
Transfer in time: AGI could be a scaled-up version of current systems
I expect that AI systems capable of automating AI research itself will come from some version of the current paradigm. Concretely, I think they will be transformer-based models with large pre-training efforts and massive RL runs on increasingly long-horizon tasks. I expect there will be additional breakthroughs in memory and continual learning, but they will not fundamentally change the paradigm.
If this is true, a lot of safety work today directly translates to increased safety for more powerful AI systems. For example,
Improving evals tooling is fairly architecture agnostic, or could be much more quickly adapted to future changes than it would take to build from scratch in the future.
Many frontier AI agent observability and control tools and insights translate to future systems. Even if the chain-of-thought will not be interpretable, the interfaces with other systems are likely to stay in English for longer, e.g., code.
Many mechanistic interpretability efforts are architecture-agnostic or have partial transfer to other architectures.
AI & computer security are often completely architecture-agnostic and more related to the affordances of the system and the people using it.
This is a very load-bearing assumption. I would expect that anyone who does not think that current safety research meaningfully translates to systems that can do meaningful research autonomously should not be convinced of AGI safety products, e.g., if you think that AGI safety is largely blocked by theoretical progress like agent foundations.
Transfer in problem space: Some frontier problems are not too dissimilar from safety problems that have large-scale demand
There are some problems that are clearly relevant to AGI safety, e.g., ensuring that an internally deployed AI system does not scheme. There are also some problems that have large-scale demand, such as ensuring that models don’t leak private information from companies or are not jailbroken.
In many ways, I think these problem spaces don’t overlap, but there are some clear cases where they do, e.g., the four examples listed in the previous section. I think most of the relevant cases have one of two properties:
They are blocked by breakthroughs in methods: For example, once you have a well-working interpretability method, it would be easy to apply it to all kinds of problems, including near-term and AGI safety-related ones. Or if you build a well-calibrated monitoring pipeline, it is easy to adapt it to different kinds of failure modes.
Solutions to near-term problems also contribute to AGI safety: For example, various improvements in access control, which are used to protect against malicious actors inside and outside an organization, are also useful in protecting against misaligned future AI models.
Therefore, for these cases, it is possible to build a product that solves a large-scale problem for a large set of customers AND that knowledge transfers to the much smaller set of failure modes at the core of AGI safety. One of the big benefits here is that you can iterate much quicker on large-scale problems where you have much more evidence and feedback mechanisms.
Argument 2: Taking AGI & the economy seriously
If you assume that AI capabilities will continue to increase in the coming years and decades, the fraction of the economy in which humans are outcompeted by AI systems will continue to increase. Let’s call this “AGI eating the economy”.
When AGI is eating the economy, human overseers will require tools to ensure their AI systems are safe and secure. In that world, it is plausible that AI safety is a huge market, similar to how IT security is about 5-10% of the size of the IT market. There are also plausible arguments that it might be much lower (e.g., if technical alignment turns out to be easy, then there might be fewer safety problems to address) or much higher (e.g., if capabilities are high, the only blocker is safety/alignment).
Historically speaking, the most influential players in almost any field are private industry actors or governments, but rarely non-profits. If we expect AGI to eat the economy, I expect that the most influential safety players will also be private companies. It seems essential that the leading safety actors genuinely understand and care about extreme risks, because they are, by definition, risks that must be addressed proactively rather than reactively.
Furthermore, various layers of a defence-in-depth strategy might benefit from for-profit distribution. I’d argue that the biggest lever for AGI safety work is still at the level of the AGI companies, but it seems reasonable to have various additional layers of defense on the deployment side or to cover additional failure modes that labs are not addressing themselves. Given the race dynamics between labs, we don’t expect that all AI safety research will be covered by AI labs. Furthermore, even if labs were covering more safety research, it would still be useful to have independent third parties to add additional tools and have purer incentives.
I think another strong direct impact argument for AI safety for profits is that you might get access to large amounts of real-world data and feedback that you would otherwise have a hard time getting. This data enables you to better understand real-world failures, build more accurate mitigations, and test your methods on a larger scale.
Argument 3: Automated AI safety work requires scale
I’ve previously argued that we should already try to automate more AI safety work. I also believe that automated AI safety work will become increasingly useful in the future. Importantly, I think there are relevant nuances to automating AI safety work, i.e., your plan should not rely on some vague promise that future AI systems will make a massive breakthrough and thereby “solve alignment”.
I believe this automation claim applies to AGI developers as well as external safety organizations. We already see this very clearly in our own budget at Apollo, and I wouldn't be surprised if compute is the largest budget item in a few years.
The kind of situations I envision include:
A largely automated eval stack that is able to iteratively design, test, and improve evaluations
A largely automated monitoring stack
A largely automated red-teaming stack
Maybe even a reasonably well-working automated AI research intern/researcher
etc.
For these stacks, I presume they can scale meaningfully with compute. I’d argue that this is not yet the case, as too much human intervention is still required; however, it’s already evident that we can scale automated pipelines much further than we could a year ago. Extrapolating these trends, I would expect that you could spend $10-100 million in compute costs on automated AI safety work in 2026 or 2027. Though I think the first people to be able to do that are organizations that are already starting to build automated pipelines and make conceptual progress now on how to decompose the overall problem into more independently verifiable and repetitive chunks.
While funding for such endeavors can come from philanthropic funders (and I believe it should be an increasingly important area of grantmaking), it may be easier to raise funding for scaling endeavors in capital markets (though this heavily depends on the exact details of that stack). In the best case, it would be possible to support the kind of scaling efforts that primarily produce research with philanthropic funding, as well as scaling efforts that also have a clear business case through private markets.
Additionally, I believe speed is crucial in these automated scaling efforts. If an organization has reached a point where it demonstrates success on a small scale and has clear indications of scaling success (e.g., scaling laws on smaller scales), then the primary barrier for it is access to capital to invest in computing resources. While there can be a significant variance in the approach of different philanthropic versus private funders, my guess is that capital moves much faster in the VC world, on average.
Argument 4: The market doesn’t solve safety on its own
If AGI safety work is profitable, you might argue that other, profit-driven actors will capture this market. Thus, instead of focusing on the for-profit angle, people who are impact-driven should instead always focus on the next challenge that the market cannot yet address. While this argument has some merit, I think there are a lot of important considerations that it misses:
Markets often need to be created: There is a significant spectrum between “people would pay for something if it existed” and “something is so absolutely required that the market is overdetermined to exist”. Especially in a market like AGI safety, where you largely bet on specific things happening in the future, I suppose that we’re much closer to “market needs to be made” rather than “market is overdetermined.” Thus, I think we’re currently at a point where a market can be created, but it would still take years until it happens by default.
Talent & ideas are a core blocker: The people who understand AGI safety the best are typically either in the external non-profit AI ecosystem or in frontier AI companies. Even if a solely profit-driven organization were determined to build great AGI safety products, I expect they would struggle to build really good tools because it is hard to replace years of experience and understanding.
The long-term vision matters: Especially for AGI safety products, it is important to understand where you’re heading. Because the risks can be catastrophic, you cannot be entirely reactionary; you must proactively prevent a number of threats so that they can never manifest. I think it’s very hard, if not impossible, to build such a system unless you have a clear internal threat model and are able to develop techniques in anticipation of a risk, i.e., before you encounter an empirical feedback loop. For AGI safety products, the hope is that some of the empirical findings will translate to future safety products; however, a meaningful theoretical understanding is still necessary to identify which parts do not translate.
Risk of getting pulled sideways: One of the core risks for someone building AGI safety products is “getting pulled sideways”, i.e., where you start with good intentions but economic incentives pull you into a direction that reduces impact while increasing short-term profits. I think there are many situations in which this is a factor, e.g., you might start designing safety evaluations and then pivot to building capability evaluations or RL environments. My guess is that it requires mission-driven actors to consistently pursue a safety-focused agenda and not be sidetracked.
Thus, I believe that counterfactual organizations, which aim solely to maximize profits, would be significantly less effective at developing high-quality safety products than for-profit ventures founded by individuals driven by the impact of AGI safety. Furthermore, I’d say that it is desirable to have some mission-driven for-profit ventures because they can create norms and set standards that influence other for-profit companies.
Limitations
I think there are many potential limitations and caveats to the direction of AGI safety products:
There are many subparts of safety where market feedback is actively harmful, i.e., because there is no clear business case, thereby forcing the organization to pivot to something with a more obvious business case. For example:
Almost anything theory-related, e.g., agent foundations, natural abstractions, theoretical bounds. It seems likely that, e.g., MIRI or ARC should not be for-profit ventures.
Almost anything related to AI governance. I think there is a plausible case for consulting-based AI governance organizations that, for example, help companies implement better governance mechanisms or support governments. However, I think it’s too hard to build products for governance work. Therefore, I think most AI governance work should be either done in non-profits or as a smaller team within a bigger org where they don’t have profit pressures.
A lot of good safety research might be profitable on a small scale, but it isn’t a scalable business model that would be backed by investors. In that case, they could be a non-profit organization, a small-scale consultancy, or a money-losing division of a larger organization.
The transfer argument is load-bearing. I think almost all of the crux of whether AGI safety products are good or bad boils down to whether the things you learn from the product side meaningfully transfer to the kind of future systems you really care about. My current intuition is that there is a wealth of knowledge to be gained by attempting to make systems safer today. However, if the transfer is too low, going the extra step to productize is more distracting than helpful.
Getting pulled sideways. Building a product introduces a whole new set of incentives, i.e., aiming for profitability. If profit incentives align with safety, that's great. Otherwise, these new incentives might continuously pull the organization to trade off safety progress for short-term profits. Here are a few examples of what this could look like:
An organization starts building safety evaluations to sell to labs. The labs also demand capability evaluations and RL environments, and these are more profitable.
An organization starts building out safety monitors, but these monitors can also be used to do capability analysis. This is more profitable, and therefore, the organization shifts more effort into capability applications.
An organization begins by developing safety-related solutions for AGI labs. However, AGI labs are not a sufficiently big market, so they get pulled toward providing different products for enterprise customers without a clear transfer argument.
Motivated reasoning. One of the core failure modes for anyone attempting to build AGI safety products, and one I’m personally concerned about, is motivated reasoning. For almost every decision where you trade off safety progress for profits, there is some argument for why this could actually be better in the long run.
For example,
Perhaps adding capability environments to your repertoire helps you grow as an organization, and therefore also increases your safety budget.
Maybe building monitors for non-safety failure modes teaches you how to build better monitors in general, which transfers to safety.
I do think that both of these arguments can be true, but distinguishing the case where they are true vs. not is really hard, and motivated reasoning will make this assessment more complicated.
Conclusion
AGI safety products are a good idea if and only if the product incentives align with and meaningfully increase safety. In cases where this is true, I believe markets provide better feedback, allowing you to make safer progress more quickly. In the cases where this is false, you get pulled sideways and trade safety progress for short-term profits.
Over the course of 2025, we’ve thought quite a lot about this crux, and I think there are a few areas where AGI safety products are likely a good idea. I think safety monitoring is the most obvious answer because I expect it to have significant transfer and that there will be broad market demand from many economic actors. However, this has not yet been verified in practice.
Finally, I think it would be useful to have programs that enable people to explore if their research could be an AGI safety product before having to decide on their organizational form. If the answer is yes, they start a public benefit corporation. If the answer is no, they start a non-profit. For example, philanthropists or for-profit funders could fund a 6-month exploration period, and then their funding retroactively converts to equity or a donation depending on the direction (there are a few programs that are almost like this, e.g., EF’s def/acc, 50Y’s 5050, catalyze-impact, and Seldon lab).