2026-04-02 03:24:02
To me it feels pretty emotionally clear we are nearing the end-times with AI. That in 1-4 years[1] things will be radically transformed, that at least one of the big AI labs will become autonomous research organizations working on developing the next version of their AI, perhaps with some narrow guidance of humans in oversight or acquisition of more resources until robotics is solved too.
And i believe there will be some nice benefits at first with this, with the AI organizations providing many goods and services in exchange for money, to raise capital so that the self-improvement resource acquisition loop can continue.
But I’m not sure how it will ultimately turn out. Declaring risk of extinction-level events less than 10% seems overconfident. Yet, declaring the risks to be >90% also seems overconfident. But I generally remain quite uncertain about which factors will dominate. Maybe AIs will remain friendly and, for decision theory reasons, continue put some fraction of resources to look after us to some extent, as a signal that future entities should do the same for them. Maybe the loop of capital acquisition is so brutal and molochian that doom wins at least once. And people have been confidently wrong about doom in the past. So i remain unsure. I just say I'm 50:50 on it.
But it also feels like, as an individual who does not have any particularly position of influence or power, things are mostly out of my control. There are actions i can take that can maybe push things one way or another. I should seriously do these actions, and the bottleneck feels mostly like not exploring the option space enough.
But how should one feel about it all?
If one seriously believes that one has ~2 years left where either you die, or actions will become insignificant, what should one do?
One emotion one can feel at first is often a sense of doom and despair. That there is “nothing one can do”. That one should wallow in self-pity, “woe be me, i wish i could have a longer life”. I get it.
But also, just get over it.
Maybe i find it easier since I have emotionally grappled with conclusions of Nihilism a lot before. But really the only way out is to not care that you will eventually die (whether soon or at the heat-death of the universe), and to try live a good life anyway.
But it is possible to just choose a better reaction and not worry about it.[2]
Another emotion one can feel is a frantic “I must do something, I must do something”. I think this is a pretty reasonable emotion to feel. You should probably follow it. It mostly feels like the phases for this are something like: 1) Thinking of one’s own ideas for a while and feeling you can do things. 2) Realizing most of the ideas you thought of are already being tried by others, and feeling hopeless. 3) Realizing that despite this, there is work to be done that could plausibly be useful, and maybe it feels marginal, but you should do it anyway.
If doing research, I think it is worth having some loops of [exploring which thing seems actually most useful to do right now] and [spending time exploiting that thing to the point you’ve made some substantial progress]. These days with Claude Code, the latter seems particularly easy to do, and will probably keep getting easier to do. Sometimes this may mean means that the value may be tilted towards [slightly improving the kind of work AI will do in the future] rather than [make something directly useful now]. I guess I think both kinds of work seem valuable, but it’s worth having this in your mind explicitly.
There is little enough time that you should do any work you can, but enough time that you need to be considerate on how you spend your time, and not get burnt out[3]. If you are the type of person who can spend 100% of their days working on something with deep focus for years, go do that, but you’re probably not reading this if you are.
But I think with these short timelines, there is another thing you can feel, which is that the life you have left may be pretty short. Maybe you will live, maybe not, but even if we live, your life will be so different and transformed.
I used to be very emotionally bought-into some things: delayed-gratification. FIRE. marshmallow test. Being Stoic. Don’t burn bridges. don’t make anyone dislike you. don’t stand out for the wrong reasons. alter your thoughts and behaviors as to not be too cringe. Delay things until after the singularity.
I think even in normal circumstances, erring too much towards these is not great.[4]
But with short timelines, it feels like an extreme waste of what little valuable time we have left to be exclusively worrying about these things[5].
You should have fun.
You should do things that you think are weird.
You should spend your money on things that will improve your life. Yes even that too. I know it’s painful.
You should notice the subtle things that make you sad, and not just brush them off, but fix them.
Don’t compromise on your morals or other things that don’t need to be compromised.
You should get past the awkward roadblock in your head and do that thing.
You should get someone to hold you to account for doing the things you really want to do.
Go on that trip you want to go on.
Be cringe.
Sing that karaoke. Do those dance moves. Write that blog post.
Ignore the people who might think you are cringe. They don’t really care that much.
Put on the cat ears you always wanted to wear.
And you will maybe befriend the people who are cringe in just the same ways as you.
Life may be very short. So make the next few years the best ones.
Live your life with whimsy.
I tend to err to much towards low confidence, but I would say this timeline something like 50% confidence interval. If i think about it, I could see it taking like ~10 years longer, depending on what threshold you want to use for more like 90% confidence conditioned on no AI Pause/moratorium. Emotionally the 1-4 year period feels most correct.
I don’t provide evidence for timelines here, I may describe what feels salient to me some at some another time, but other people have put much more effort into describing short timelines.
yeah skill issue ngl
I think noticing you are burnt out can be quite difficult if you’re not sure what it’s like. I felt real guilt at the possibility i could be burnt out, because of guilt on how many of my hours per week felt like i was doing work. I you are even holding the hypothesis, you should probably spend some time seriously considering it. It’s not that bad if you need to spend some time on a real actual break from what you are doing. It might not always feel as pressing from a distance. Other people are doing their own work too.
law of equal and opposite of advice applies: https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/
Exceptions for “I work in a field such as politics/law where reputation with normal people is extremely important”. If you’re not sure and just want to keep option value open, then this exception probably doesn't apply to you. And you still might be erring too much to reputation.
Additionally, think things might still turn out fine, so don’t do things that are reckless and put your life at risk in the short term. Avoid physically dangerous activities. Get your cryonics plan sorted. etc.
2026-04-02 03:17:37
2026-04-02 03:01:02
How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers on the street: “How likely is it that superhuman[1] AI could become too powerful for humans to control?”, 78% say either "very likely" (51.6%) or "somewhat likely" (26.3%)[2]. My guess is AI capabilities spending is at least 20x the spending on ensuring AI leads to the flourishing of humans[3]. Moloch[4] is winning.
So what can actually be done? As a toy example[5]: let’s say I currently think there is a 40% chance AI eventually goes extraordinarily bad for humanity. I could either:
In this toy example the first one in expectation lowers the odds of catastrophic outcomes by .6% and the second one lowers by .2%."Would succeed” is doing a lot of lifting here, because what really matters is what I can do and not what some arbitrary collection of humans could do if tasked with either of these. What really matters is how much I think my working on this problem moves the odds of success multiplied by the impact of the given approach succeeding. Not only does it matter how impactful a given approach is but how good I am at executing on it[7]. This does of course imply that there is value in the work of trying to funnel people into doing the most effective things[8].
I’ve spent 2 months learning. I know so much more than I did 2 months ago, I started with the technicals. I have written a simple GPT-2 style LLM by hand[9]. I learned[10] all the core mech interp techniques. I’ve red-teamed the largest open weight model. I’ve read up on the seemingly hopeless state of cybersecurity - both when it comes to attacks enabled by LLMs and also the potential for nation state level attacks to steal model weights[11]. Most importantly I’ve read so much about what various people think will help lower the odds of catastrophic outcomes and why.
Part of what makes this so hard is there are so many second order effects. It’s genuinely quite hard to tell if a given thing that seems like it might help does in fact lower the odds of terrible outcomes. Most of the current CEOs of the largest labs got there by way of people who were worried about AI risk. Coefficient Giving[12] gave $30M to OpenAI to get a seat on the board[13]. METR’s timeline eval has led to more people understanding/being worried about the rate of AI progress, but also the existence of eval orgs enables labs to in some sense offload the requirement to ensure their models behave well onto these 3rd party eval orgs[14]. There are so many downstream effects to any given course of action.
I think the incentives are really important here. Yes, making sure AI goes well is a technical problem. But, the reason we are in such a dire place is because of the incentives. We are in a competitive race to build AI as quickly as possible as a result of a bunch of decisions that were (at least in the moment) motivated by someone wanting to lower the odds AI goes poorly[15]. Then, the incentives were inevitably shifted by general economic capitalistic forces and human power-seeking/competitive nature.
If I want to figure out how to have the most impact, I will have to really think about incentives and downstream impact.
The internet (plus Claude) is great for learning. I have learned so much. But I live in NYC. AI is happening in the Bay. If I want to really immerse myself in what is actually happening on the ground I can’t do that in NYC. If I want to understand the people (and their incentives) who are both building AI and are working to align AI to the interests of humanity, then I will probably have to talk to them. Go to their parties. Osmose as much as I can. Really figure out what matters.
I’m flying to SF tomorrow, I’ve got a sublet in SF[16] until May 9. I am looking for a sublet in a communal house in Berkeley starting May 16[17]. I hear the east bay and SF proper are pretty socially disconnected and very different cultures. I want to live in both to really understand what’s going on. See with clear eyes, not let a single groupthink win me over.
The hope is that by the end of June I will have a much better model and be relatively certain in picking a specialization, in picking what work I think I can do to maximally decrease the odds AI goes oh so poorly for all of us.
I love people, I love talking to people, if you think talking to me would be fun[18] please reach out!
I like writing. It is helpful to clarify my thoughts. It helps people I love stay connected with me. It might even be providing a social good. Inkhaven is starting today in Berkeley. I am moving to SF tomorrow. I think there are good odds[19] forcing myself to write every day I am in SF is a net worth it forcing function. There are many things other than writing that are worth my time, my guess is I will not in fact go all 30 days. But, it seems clearly worth trying. So, this is me committing to 25% chance odds that I write and post at least 500 words every day in April.
Defined as: “Some companies are trying to build superhuman AI that would be far smarter than any human at nearly everything”
preliminary results, n = 95. Larger post about this incoming this month.
Or at the very least not leading to terrible outcomes for all humans.
Numbers mostly made up on vibes.
AI’s currently do this, yes it makes mech interp way more annoying. But, it would be so so much worse if AI’s thought in some efficient machine gobledy gook that we have no way of interpreting. It’s pretty likely that English doesn’t purely by chance end up being the most token efficient way of thinking, so it’s not hard to imagine a world where competitive pressure leads to labs slowly moving towards their AIs thinking in some totally foreign “language”. My guess is this is the kind of regulation that labs would feel quite good about, they probably do like being able to understand the thinking their LLMs do.
For example I do not have a PhD so it seems somewhat unlikely that theoretical technical research will be the place where I have the most individual impact, but maybe I think ARC style theoretical research is really valuable and my time would be best spent trying to convince more PhD grads to try working on that problem. My model is basically all technical safety organizations are way more talent constrained than funding constrained.
I, for example, probably wouldn’t be here today if it weren’t for both Zvi’s posting and a friend exasperatedly asking me why the fuck I’m not working on the single problem I think is by far the most important for humanity to solve.
No tokenization, characters only. We’ll see if I end up thinking tokenization is worth getting into.
Played with in code of course
As far as I can tell, commercially serving models means that the weights just have to be sitting on the servers that are serving the given model. That’s a ginormous attack surface.
The single largest funder in the AI safety ecosystem
Is this better than the counterfactual? It really depends on how much that board seat changed the odds OpenAI went bad. We are living in a timeline where this bet seems like it didn’t pay off, but that doesn’t mean it wasn’t necessarily worth it in EV. Holden seems to believe this was on net worth it and I should probably put real weight on that given he actually knows how much impact the board seat got.
If there were no eval orgs there would likely be more pressure on labs to really ensure their models are safe. I certainly think it’s a better equilibrium for people to believe it is on the lab to prove their model is safe than for it to be on external orgs to be able to show a model is unsafe. However it’s still really hard to compare counterfactuals here. I think my light belief is eval orgs are probably on net bad especially considering the opportunity cost of what the people working there could be doing instead. This worry is also partially bolstered by the niggling incentive worry where most eval orgs are staffed from people who leave the big labs and they get significant compute credits as well as prestige for working with (and getting to say “partners with ___”) a given big lab. The incentives on eval orgs is tricky, so everything they do seems like it should be treated with a little more suspicion. On the other hand Coefficient giving absolutely funds eval orgs but doesn’t fund any pause/stop AI orgs - they seemingly abruptly stopped funding MIRI after it pivoted to pause advocacy. So, coefficient giving seems to think evals are net worth it but pause isn’t net worth it, how much should I update on this?
Basically every major lab was founded ostensibly because they thought the existing labs were doing a bad job and their new lab would be much better at creating safe good AI.
The Sandwich
The astute of you might have noticed May 9 =/= May 16th. I am going to Sleepawake in between. I am my impact, but also attunement and connection and being with other people is a large part of what makes life worth living for me. This is crucial, but for a life of joy but also I am way less impactful if I personally am burned out and sad and lonely.
or interesting, or truthful, or would help lower the odds of catastrophe, etc
Probably above 20% and below 50%
2026-04-02 02:10:51
Anthropic has revised its Responsible Scaling Policy to v3.
The changes involved include abandoning many previous commitments, including one not to move ahead if doing so would be dangerous, citing that given competition they feel blindly following such a principle would not make the world safer.
Holden Karnofsky advocated for the changes. He maintains that the previous strategy of specific commitments was in error, and instead endorses the new strategy of having aspirational goals. He was not at Anthropic when the commitments were made.
My response to this will be two parts.
Today’s post talks about considerations around Anthropic going back on its previous commitments, including asking to what extent Anthropic broke promises or benefited from people reacting to those promises, and how we should respond.
It is good, given that Anthropic was not going to keep its promises, that it came out and told us that this was the case, in advance. Thank you for that.
I still think that Anthropic importantly broke promises, that people relied upon, and did so in ways that made future trust and coordination, both with Anthropic and between labs and governments, harder. Admitting to the situation is absolutely the right thing, but doing so does not mean you don’t face the consequences.
Friday’s post dives into the new RSP v3.0 and the accompanying Roadmap and Risk Report, in detail.
Note that yes this is being posted on April Fools Day, but this post is only an April Fools joke insofar as those who believed Anthropic’s previous RSPs are now the April Fool.
If your initial promises were a mistake, it may or may not be another mistake to walk them back. Either way, even if your promises were not hard commitments, walking them back involves paying a price for having broken your promises, even if you had a strong reason to break them. How big a price depends on the circumstances.
Almost all mainstream coverage of this event framed it as abandoning or walking back Anthropic’s core safety promises, especially ‘do not scale models to a dangerous level without adequate safeguards.’ As a central example of this, The Wall Street Journal said ‘Anthropic Dials Back AI Safety Commitments’ due to competitive pressures. That oversimplifies the situation, leaving a lot out, but doesn’t seem wrong.
Many outsiders who follow the situation more closely believe this amounts to Anthropic having broken its commitments. Some go so far as to say this means that lab commitments to safety should not be considered worth the paper that they were never printed on. Many now expect Anthropic to make some amount of effort, but nothing that would much interfere with business plans. If Anthropic can’t make the commitment, why should anyone else? Certainly this government is not going to help.
Don’t be afraid to tell them how you really feel. They welcome it. So here we go.
The Responsible Scaling Policy is Anthropic’s commitments regarding when and under what conditions they will release frontier models.
The headline change is that they are no longer committed to not releasing potentially unsafe models, if someone else did it first. Cause, you know, they started it.
Anthropic starts their new analysis by going over their theory of change from having an RSP at all, and whether those theories were realized. They report a mixed bag.
First, the good news.
Then the bad news.
What’s the most important differences in the new version?
Anthropic is now basically giving up on hard commitments and barriers to releasing models, relying instead on ‘we will make reasonable-to-us arguments’ and decide that the benefits exceed the risks.
I appreciate the honesty. Really, I do.
If you’re not ready to make a commitment, and you realize you shouldn’t have made one, then the second best time to realize and admit that fact is right now.
Officially breaking the commitments now is higher integrity than silently breaking them later. It’s especially better than silently changing the RSP right before a release. I approve of Charles’s frame of ‘Anthropic stopped pretending to have red lines at which they will unilaterally pause.’
If Anthropic was in practice already doing a ‘we think our arguments are reasonable’ decision process, which with Opus 4.6 it seemed like they mostly were, then better to admit it than to pretend otherwise.
I want to emphasize that essentially no one, not even those who disagree with me and think Anthropic should pause, and who also think Anthropic made rather strong commitments it is now breaking, are saying ‘Anthropic should be holding to its previous commitments purely because they said so, even if this leads to pausing that does not make sense.’
One still has to be held to account for breaking promises, and for making promises that were inevitably going to be broken, even if the decision to break them is right. Your defense that the move was correct does not excuse you from its consequences.
1a3orn: Arguments against the Anthropic RSP changes seem to incline towards deontological language regarding broken promises / duties
While arguments for them incline toward consequentialist language / greater good, afaict.Oliver Habryka: I think both are right! The old RSP was obviously unworkable and should have never been published, given what Anthropic is trying to do. So abandoning it is the right thing to do, but of course if you break promises you should be held accountable.
It’s not that hard to explain the consequentialist arguments for holding people accountable for breaking promises, but most people have an intuitive sense for why it’s important, so you don’t have to unpack it.
(To be clear, I think Anthropic should stop scaling and redirect its efforts towards advocating for a pause, but doing that because of the RSP would be weird and I don’t think the right move.
It would just look like you sabotaged yourself and now want to hold others back because you accidentally promised some dumb things that took you out of the race)
I also want to emphasize that commitments are only one way to improve safety. Even when plans are worthless, planning is essential, and you can and should just do things. None of this means ‘Holden or Anthropic don’t care about safety,’ only that they will decide what they think is right and then do it, and you can decide how much you trust them to choose wisely.
I do still see this as Anthropic abandoning its experiment on importantly engaging in voluntary self-government and restricting itself. Technically they reserved the right to do it, but it’s still quite the gut punch to do it.
The experiment is over. That’s better than pretending the experiment is working.
From this point, there are no commitments, only statements of intent. Anthropic’s going to do what it’s going to do. You can either choose to trust Anthropic’s leadership to make good decisions, or you can choose not to.
I think Anthropic’s description of its own history says that having these softly binding commitments, and having a track record of treating it as costly to break them, was very good for safety outcomes and policy adoption. I hate that we’ve given that up.
If your commitment is conditional on the actions of others, you should say that.
They didn’t entirely not say this before, but it was very much phrased as ‘in case of emergency we might have to break glass’ rather than ‘we only hold back if everyone relevant signs on.’
RSPv2 said this in 7.1.7: “If another frontier AI developer passes or is about to pass a Capability Threshold without implementing equivalent Required Safeguards, such that their actions pose a serious risk to the world, then because the incremental risk from Anthropic would be small, Anthropic might lower its Required Safeguards. If it did so, it would acknowledge the overall level of risk posed by AI systems (including its own) and invest significantly in making a case to the U.S. government for regulatory action.”
Whereas Anthropic is now saying they’re willing to hit those thresholds first, unless they have explicit commitments from others to do otherwise, even if this is not a small incremental risk.
I strongly agree with aysja, and disagree with Holden, that it would be misleading to describe this shift as a ‘natural extension of the RSP being a living document.’
I do see the argument that goes like this:
If that is where we are at now, you have all the reason to make this stricter requirement clear up front. That gives others more reason to follow you, and avoids all the nasty headlines we’re seeing now. Alas. it’s a little late for that.
If the mistake has already been made, it’s not obviously bad to admit defeat, and say you’re not going to then let someone else potentially dumber and riskier get there first.
I definitely agree it’s better to announce your intention to violate your old policy now, rather than wait until the day you do violate the old policy, which might never come.
davidad: Voluntary commitments to AI slowdowns were a nice idea in 2024 when it was plausible that they could be baby steps toward a multilateral agreement that would contain the intelligence explosion. For a variety of reasons this is no longer plausible.
Anthropic is doing good here.
In the strategic landscape of 2026, racing is the right move, not just for profit but also for maximizing the probability that things go well for most current humans.
Sam Bowman (Anthropic): I endorse the top [paragraph above].
The Anthropic RSP changes are an attempt to work out what kinds of firm commitments have the most leverage in an environment that’s less promising than we’d expected for policy and coordination.
We misjudged what the environment would look like at this point, which is sad. But these new commitments do still have some heft, including a lot more verifiable transparency (with third parties in the loop) on risks and mitigations.
Oliver Habryka: I am in favor of figuring out what kind of firm commitments have the most leverage. But of course, you can’t do that by making “firm commitments” directly!
It’s not a firm commitment if you are just playing around with different commitments.
The main catch is, it sounds like ‘you should see one of the other guys’ is going to be used as a basically universal excuse to go forward essentially no matter how risky it is, if the cost of not doing so is high?
If Anthropic does in the future pause for an extended period, in a way that is importantly costly, then I will have been wrong about this and precommit to saying so in public. If I don’t do so, please remind me of this.
As Drake Thomas notes, the virtue ethical case for ‘don’t impose material existential risk on the planet’ is reasonably strong.
One problem is that this absolutely is going to weaken the willingness of others to incur costs, and embolden those who want to move forward no matter what. Endorsing race logic and the impossibility of cooperation has its consequences.
What do you mean the RSP was committing Anthropic to things?
Robert Long: I’m not super read up on RSPs and haven’t read Holden’s post. But it feels similar to the “Anthropic won’t push the capability frontier” meme: not strictly entailed by Anthropic’s official stance, but a strong impression they gave off and benefited from.
is that fair? incomplete?
Oliver Habryka: I mean, in this case the impression was really extremely unambiguous and strong. I agree the evidence for the promises made in the capability frontier case is largely private and so is externally ambiguous, but in this case we have great receipts!
Here, for example, is a conversation with Evan Hubinger. The conversation starts with someone saying:
Someone: One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going.
Evan Hubinger responded with (across a few different comments): It’s hard to take anything else you’re saying seriously when you say things like this; it seems clear that you just haven’t read Anthropic’s RSP.
…
The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:
Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.
…
the security conditions could trigger a pause all on their own, and there is a commitment to develop conditions that will halt scaling after ASL-3 by the time ASL-3 is reached.
…
This is the basic substance of the RSP: I don’t understand how you could have possibly read it and missed this. I don’t want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.
Oliver Habryka: This was, in my experience, routine. I therefore do see this switch from “RSP as concrete if-then-commitments” to “RSP as positive milestone setting” to constitute a meaningful breaking of a promise. Yes, the RSP always said in its exact words that Anthropic could revise it, but people who said that condition would trigger were frequently dismissed and insulted as in the comment above.
This certainly sounds like Evan Hubinger basically attacking anyone for daring to question that the RSP represented de facto strong commitments by Anthropic. We now know it did not strongly commit Anthropic to anything.
Evan predicted there was a substantial change Anthropic’s commitments would at some point force it to pause. Oliver made a market on that, which is now at ~0% despite rapid capabilities progress and Anthropic now arguably being in the lead.
Even after the RSPv3 release, Evan Hubinger continued to defend his position, that he was only saying that the RSP made a clear statement about where the lines were, not that the lines would not change or actually work in practice. Like Oliver I find this highly convincing given a plain reading of Evan’s comment. I do appreciate Evan saying now that we should downweight the theory of RSPs.
So the question then becomes, were Evan Hubinger and other employees who talked similarly under a false impression? If so, why? If not, why talk this way?
Oliver Habryka could not be more clear here, and I don’t think he would lie about this:
Oliver Habryka: Yes, Anthropic employees on more than a dozen occasions told me that the RSP binds them to a mast. I had many very explicit conversations with many Anthropic employees about this, because I was following up on what I thought was Anthropic violating what I perceived to be a promise to not push forward the state of AI capabilities, which many employees disputed had happened.
… At various events I was at, and conversations I had with people, Anthropic employees told me they were aiming to achieve robustness from state-backed hacking programs, and that they were ready to pause if they could not achieve that (as the RSP “committed” them to such things).
Oliver notes that Holden Karnofsky in particular has previously communicated he felt this was a different and lower level of commitment, that is consistent with him pushing the changes in v3, in contrast to many other Anthropic employees.
As Oliver Habryka says here, if Evan was under this false impression, Anthropic benefited enormously from giving its senior employees like Evan this impression. This does not seem like a ‘mistake’ from Anthropic to do this, and it would not be reasonable from the outside view to consider it an accident.
At minimum, if you don’t admit Anthropic has importantly now broken its commitments, then this is all highly misleading use of the word ‘commitment.’
Oliver Habryka: I would be pretty surprised if the employees in-question here end up saying they were deceived. Also, these are high-level enough employees that it’s unclear what it even means for them to be “deceived”. Deceived by whom? They drafted the RSP! They almost certainly were also involved in the decision to change it.
They benefitted hugely from this by getting social license to work at Anthropic and having people get off their back, and they are now at least deca-millionaires (or often billionaires).
Robert Long: fwiw I take that disagreement to be semantic, about “commitment” (as you note). I also agree with what you said then about the connotation of “commitment” – s.t. calling RSPs commitments means he should’ve fought the change and/or now own “we decided to break our commitment”
In particular, yes, a lot of people who care about not dying felt that the central point of RSPs was as a de facto compromise, an attempt to put an if-then commitment trigger on slowing down or pausing. If you couldn’t match the conditions, then you have to pause, which makes it acceptable to move forward now.
Indeed one could go further. The entire program of focusing heavily on not only Anthropic but evaluation-based organizations like METR and Apollo was that the evals could constitute the if that triggers a then. We now know that such commitments do not work, and that when models pass the dangerous capability tests even Anthropic will likely then fall back upon vibes. METR’s theory of change is ‘ensure the world is not surprised’ but I expect them to still be surprised.
Alternatively or in addition, you can interpret it as Holden does, that ‘no one has any willingness to slow down, and until there is a crisis this won’t change.’ Now the attitude is essentially ‘pausing or slowing down would be akin to suicide for a frontier AI lab, so things would have to be super extreme to do that, this is more of a plan we aspire to.’ Which is also a fine thing, but a very different style of document. Those who thought it was the first type of document lose Bayes points. Whereas those who thought it was the second type of document win Bayes points.
One could interpret a lot of this as ‘Anthropic employees implied they were using Rationalist epistemic norms, but instead they were using a different set of norms.’
Does this backtrack remind you of anything?
It should. In particular, it should remind you of what happened with the idea that Anthropic would not ‘push the frontier of AI capabilities.’
A lot of people told us, with various wordings and degrees of commitment attached, that Anthropic would not do that. Then Anthropic sort of did it. Then they totally flat out did it and now Claude Code and Claude Opus 4.6 are very clearly the frontier.
Then we were told, ‘oh we never promised not to do that.’
Maybe they didn’t strictly promise to do that. Maybe a lot of telephone games were involved, but Anthropic at minimum damn well should have known that a lot of people were under that impression. I was under that impression. And they knew that people were making major life decisions, and deciding whether and how much to support Anthropic, on the basis of that decision, with no sign anyone ever did anything to correct the record.
Now we’re being told, again, ‘oh we never promised not to [undo our commitments].
You’re trying to tell us what about your new commitments, then?
Ruben Bloom (Ruby): I don’t like the pattern. In 2022, I was told that “Anthropic commits to not push the frontier” as reason to worry less. Later that was abandoned and the story for Anthropic’s safety was the RSP. That too has caved.
By “I was told”, I mean the specific things said to me in conversation with Anthropic employees who were justifying their participation in a company participating in the AI race.
It’s just such a bitter “I told you so”, when you predicted years that ago competitive pressures would erode any and perhaps all commitments.
Eliezer Yudkowsky: If I’d ever had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”, I’d probably feel pretty betrayed right now!
As it is, I ask only that you update, and not always be surprised in the same direction of “huh, Eliezer was right to call it empty”.
Note: to observe how my cynicism repeatedly *ends up* right, tally only how things *end up*. Don’t jump and say “See, Eliezer was wrong to be cynical!” the moment you hear an uncashed promise or see an arguable sign of later hope.
Eliezer and others are constantly getting flak for predicting things that, in broad terms, do indeed seem to keep on reliably happening, everywhere. People constantly say ‘we will not do [X]’ or ‘in that case we would definitely do [Y]’ or heaven forbid ‘no one would be so stupid as to [Z]’ and then you turn around those same people did [X] and didn’t do [Y] and a lot of people did [Z] and you’re treated as a naive idiot for having ever taken the alternative seriously.
Best update your priors. All the people who said commitments wouldn’t hold get Bayes points. Those who didn’t lose Bayes points.
All the people who are now saying new ‘commitments’ matter and they really mean it this time? They don’t matter zero, but they are not true commitment.
I also don’t understand, given its composition and past Anthropic actions, why I should put that much stock in the Long Term Benefit Trust. It’s better to have it in its current form than not have it, but it was an important missed opportunity.
Anthropic definitely gets meaningful points on this front for standing up for what it believed in during the confrontation with the Department of War, even if you think those particular choices were unwise. I think there’s a lot more hope for actions of the form ‘Anthropic or another lab takes this particular stand right now’ than ‘Anthropic or another lab will take this particular stand later.’
Holden offers a defense of the new RSP here and here, essentially saying that binding commitments are bad, because we don’t have enough information to choose them wisely, so you might choose poorly and regret them later, and indeed Anthropic did previously sometimes choose poorly and now is later and they’re regretting it. So sayeth all those who wish to not make any binding commitments.
I interpret Holden, despite his saying he has a document where he wrote down where he would think a unilateral pause would be a good idea, as saying that they are going to do their best to do appropriate mitigations, but ultimately yes, they are going to release models, both internally and externally, pretty much no matter what mitigations are or are not available short of ‘okay yeah this is obviously a really terrible idea that will get us all killed or at least blow up directly in our faces,’ and they’re simply admitting this was always true. Okay, then.
Holden basically says in particular that he doesn’t think Anthropic should slow down based on inability to prevent theft of model weights, even if it crosses the ‘AI R&D-5’ threshold that is at least singularity-ish. They’re going to go ahead regardless. They’re not going to stop. I worry a lot both about the not stopping, and that without the forcing function of having to stop, they even more so than before won’t invest sufficiently in the necessary precautions, here or elsewhere. They not only can’t stop, won’t stop, they won’t halt and catch fire.
A list of aspirational goals is a good thing to have. I don’t think a list of aspirational goals is going to create sufficient threat of looking terrible to provide the same incentives here. That doesn’t mean the list of goals cannot do good work in other ways.
I see Holden complaining a lot about people ‘seeing RSPs as having hard commitments’ and using that as an additional reason to get rid of all the commitments. He’s pointing to all the complaining that Anthropic just broke its commitments and saying ‘see? This reaction is all the more reason we had to break all our commitments.’
It was exactly the enforcement mechanism that, if you break the commitments, people will get mad at you. This is why we can’t stay alive have nice things. So now we will have aspirations.
Aspirations are helpful, they substantially raise the chance you will do the thing, but they are weak precommitment devices when you decide you won’t do the thing later.
I also think his own argument of ‘it’s much easier to require things labs already committed to doing’ works directly against the ‘don’t commit to anything’ plan.
Drake Thomas thinks the move from v2.2 to v3.0 is an improvement, while noticing the need to have something like mourning or grief for the spirit of the original v1.0, which is now gone and proven not viable in practice at Anthropic.
Drake Thomas (Anthropic): (1) In reading drafts of this RSP and orienting to it, I’ve felt something like mourning or grief for the spirit of the original v1.0 RSP. (Quite a lot of the v1 RSP carries over to v3, but here I’m thinking specifically of the vibe of “specify very crisp capability thresholds at which to trigger very concrete safety mitigations, or else halt development”.)
I think this original approach is ultimately just a pretty bad way for responsible AI developers to set safety policies, leads to misprioritization and bad outcomes, has distortionary effects on incentives and epistemics, and doesn’t achieve much risk reduction in the environment of 2026.
… Accountability! The vibe of RSP v1 sort of rested all accountability in this sense of the commitments as this fixed immutable thing Anthropic would have to stand behind Or Else. I think this is good in some ways and under some threat models, but I think then and now there was less feedback than I’d like on the question of “are the things Anthropic is committing to actually good and useful for safety?” In v3, I think external accountability on these questions is now more loadbearing, and there’s more detailed substance to fuel such accountability. Which leads me to…
Feedback! … I expect the discourse to be very undersupplied with takes on the question of “is the actual v3 policy a good one with good consequences”. Personally I think it is, and a substantial improvement over previous RSPs!
Please actually read and criticize it! Gripe about the ambiguity of the roadmaps! Run experiments to cast doubt on risk report methodology! I can name three significant complaints I have with the RSP off the top of my head and I expect to see none of them on X, prove me wrong!
I get Drake’s frustrations. But yes, most people are going to litigate the removal of the core commitment around pausing and general revelation that so-called commitments aren’t so meaningful after all. Most attention is going to go there. He makes clear that he gets it, and I’d say he passes the ITT about why people are and have a right to be pissed off, especially that we had language in v1.0 saying that the bar for altering commitments was a lot higher than it ultimately was.
And indeed, a lot of our attention likely should go there, because if the new statements aren’t commitments, it is a lot harder to productively critique them.
Well, you see, not rushing ahead as fast as possible might slow us down. That would be bad. You wouldn’t want us to do that, would you?
Jared Kaplan (Chief Science Officer, Anthropic): We felt that it wouldn’t actually help anyone for us to stop training AI models. We didn’t really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments … if competitors are blazing ahead.
…I don’t think we’re making any kind of U-turn.
Besides, we aren’t able to evaluate models as fast as we are able to improve them, which means we should triage the evaluations and kind of wing it. I mean, what do you want us to do, not release frontier AI models we can’t evaluate? Silly wabbit.
Chris Painter (METR): Anthropic believes it needs to shift into triage mode with its safety plans, because methods to assess and mitigate risk are not keeping up with the pace of capabilities.
This is more evidence that society is not prepared for the potential catastrophic risks posed by AI.
I like the emphasis on transparent risk reporting and publicly verifiable safety roadmaps.
Billy Perrigo: But he said he was “concerned” that moving away from binary thresholds under the previous RSP, by which the arrival of a certain capability could act as a tripwire to temporarily halt Anthropic’s AI development, might enable a “frog-boiling” effect, where danger slowly ramps up without a single moment that sets off alarms.
That does seem likely and sound concerning.
In other need-to-know news, Sean asked a very good question. Drake’s answer to this was about as good as one could have hoped for, given the facts.
If you’ve decided to break your ‘commitment,’ you want to tell us as soon as possible.
I have confirmation that the board only approved the changes ‘very recently.’
Seán Ó hÉigeartaigh: At what point was it decided that the previous commitment were ‘subject to a favourable environment’ and not ‘firm commitments’, and was this communicated across staff? The whole point of commitments is an expectation of being able to rely on them when the environment is not favourable, not just when they’re easy to make.
It also seems clear at this point that these commitments were presented as harder than this, and used by Anthropic/their staff to
(a) dismiss and undermine critics
(b) in recruitment of safety-concerned talent
(c) in arguing for voluntary if-then commitments at a time when there was more government appetite for considering harder regulation.I think it’s plausible (though can’t yet confirm) that (d) they’ve also been used in securing investment from safety-conscious investors.
Do you disagree with these claims? If not, do you feel Anthropic has held itself to a standard of ethics and transparency in this (quite important!) matter that is acceptable?
Drake Thomas (Anthropic): Re: “at what point was it decided” – I think this presupposes a frame in which this kind of thing is extremely formally pinned down much more than I think it generally is in reality (not just at Anthropic, but in almost all circumstances like this)?
None of the versions of the RSP are particularly clear about exactly what a “commitment” is supposed to be read as, how that should be interpreted within a document which is expected to be amended in the future, what the stakes of violating such a commitment are, etc. Especially the early versions had huge decision-critical ambiguities you could drive a truck through!
It’s not like there was a secret internal RSP which had even more footnotes about meta-commitments that made this dramatically clearer, just a bundle of authorial intent and something-like-case-law and an understanding of what reasonable decisions to reduce risk would be and long-simmering drafts of less ambiguous updated policies that took ages to ship.
To the extent I think there’s something like an answer to the “at what point” question, I know of early discussion around something like an RSP v3 regime widely accessible to Anthropic staff as early as January 2025 and even wider visibility into drafts of something pretty similar to this RSP for at least the past 3 months, though again I don’t think it’s like there was ever some formal conception that this was Forbidden which had to change at a discrete point.
All that said: I think the vibes of Anthropic and much of the v1.0 text and many of its employees’ statements around the RSP circa 2023 and 2024 presented a much more ironclad view of these commitments than is reflected in RSP v3 (and much more than I now think made sense), and I think this reflected pretty poor judgement and merits criticism. (I count myself among the Anthropic employees who acted poorly in hindsight here, though AFAIK Holden has been consistent and reasonable on this since the beginning.)
I think it has been the case and will continue to be the case that Anthropic is abiding by the things it says it is abiding by in its published policies and commitments (and should be loudly criticized for failures to do so), but I think the track record of “things that EAs believe Anthropic to have committed to in perpetuity no matter what no takesies-backies” looks quite bad and I don’t think it goes well to interpret such claims as meaning anything that strong (nor for Anthropic, or almost anyone, to make such commitments in the vast majority of situations).
Wrt the claims here, my sense is:
(a) Eh, I think the specific (LW comment quoted in another comment screenshotted in a tweet linked by you above) is taken out of context and wasn’t really claiming anything in particular about how to interpret the strength of RSP v1 commitments. I do expect this kind of thing happened but I think habryka’s quote is a bad example of it.
(b) Yeah, I think non-frontier-pushing rhetoric was a significantly bigger deal on this front but RSP stuff definitely played some role. To the extent I bear some responsibility for this sort of thing I regret it, though iirc I have been pretty open around thinking unilateral pauses were relatively unlikely for a while.
(c) Hm, I view the intent and expected-at-the-time-effect of RSP v1 style commitments as increasing the odds of codifying such if-then commitments into regulation, by showing them to work well at companies and getting them closer to an existing industry standard. They ultimately failed at doing so, in part due to changing political will, in part due to somewhat limited substantive uptake at other companies, and in part due to the problem where really precise if-then commitments did *not* work all that well because specifying crisp thresholds years in advance in a sensible way was extremely hard – but I think this latter bit is kind of a success story, in that the point of demoing safety policies as voluntary commitments is that if it turns out to be a bad idea you haven’t locked yourself into silly regulation that ends up net bad for x-risk via backlash. Could you say more about how you see the comms around commitment strength having worsened regulation prospects?
(d) not gonna comment on internal fundraising considerations, but checking that you aren’t thinking of the Series A, which happened well before the RSP was introduced?
There is then a discussion of how to think about ‘Oliver is right in general but this particular quote is a bad example,’ which I find to be a helpful thing to say if that’s what you think.
I think this is also important context. Dario Amodei and Anthropic have been consistently unwilling, with notably rare exceptions, to say the full situation out loud, or to treat it with proper urgency. Yes, you should see the other guy and all that, fair point, but when you are saying ‘no one wants to [X] so we have to change our plan’ you need to have been calling for [X] and explaining why, and also loudly explaining that this is terrible and forcing you to change plans.
I don’t see that type of communication out of Anthropic leadership, over the course of years.
Holden Karnofsky: If there were strong and broad political will for treating AI like nuclear power and slowing it down arbitrarily much to keep risks low, the situation might be different. But that isn’t the world we’re in now, and I fear that “overreaching” can be costly.
I.M.J. McInnis: I think it would make a nontrivial contribution to that ‘strong and broad political will’ if Dario were to come out and say “actually, sorry about all that deliberate Overton-window-closing I did in previous writings. In fact, political will is not a totally exogenous oh-well thing, but it is the responsibility of frontier developers to inculcate that political will by telling the public that a pause is possible and desirable, instead of a dumb lame thing not even worth considering. So now we’re saying loud and clear: a pause is possible and desirable, and the world should work toward it as a Plan A!”
I’m being deliberately cartoonish here, but you get the point. If incentives are forcing Anthropic to abandon things that are good for human survival––which occurrence was, no offense, completely obvious from day one––Anthropic should be screaming from the rooftops, Help!! Incentives are forcing us to abandon things that are good for human survival!!
If this is a crux for you––if you/Anthropic think a pause is so undesirable/unlikely that it’s important for the safety of the human race to publicly disparage the possibility of a pause (as Dario opens many of his essays by doing)––please say so! Otherwise, this lily-livered, disingenuous, “oh no, the incentives! it’s a shame incentives can never be changed!” moping will give us all an undignified death.
To be clear, I’m not actually mad about the weakening of the RSP; that was priced in. I suppose I’m glad it’s stated, in case there were still naïfs who thought A Good Guy With An AI could save us. It’s far more virtuous than outright lying, as every other company (to my knowledge) does (more of).
Also, although you seemed to try to answer “What is the point of making commitments if you can revise them any time?” You really just replied “Well, actually these commitments were inconvenient to revise, and in fact they should be more convenient to revise, albeit not arbitrary convenience.” Forgive me if I am not reassured!
I respect your work a lot, Holden. You’ve done great things for humanity. Please don’t lose the forest for the trees.
But they assure us it’s all fine, they are committed to doing as well or better than rivals.
Jared Kaplan: If all of our competitors are transparently doing the right thing when it comes to catastrophic risk, we are committed to doing as well or better.
But we don’t think it makes sense for us to stop engaging with AI research, AI safety, and most likely lose relevance as an innovator who understands the frontier of the technology, in a scenario where others are going ahead and we’re not actually contributing any additional risk to the ecosystem.
So, first off, no. As I discussed above, you’re not committed. Stop saying you’re committed to things you’re not committed to. You keep using that word.
We’ve just established you can and will back out of ‘commitments’ if you change your mind. You don’t to say ‘commitment’ in an unqualified way anymore, sorry.
Even if we assume this ‘commitment’ is honored, reality does not grade on a curve. Saying ‘I will be as responsible as the least responsible major rival’ is no comfort. You’re Anthropic. If that’s your standard, then you’re not helping matters.
The good news is I expect Anthropic to still do much better than that standard. But that’s purely because I think and hope they will choose to do better. It’s not because I think they are committed to anything.
I don’t want to hear Anthropic or any of its employees say they are ‘committed’ to something unless they are actually committed to it, ever again.
Charles Foster: To my knowledge this is the first time a frontier AI developer has explicitly made such a claim about the gap between its internal and external models.
Drake Thomas (Anthropic): And under RSP v3, is committed (for sufficiently more capable or widely-autonomously-deployed models) to doing so in the future! Really stoked to move into a regime where risk reporting looks beyond external deployment as the source of danger.
Oliver Habryka: Come on, let’s not immediately start using the word “committed” again, just after that went very badly.
The right word at this point seems “and as expressed in the RSP, is intending to do X going forward”.
I also think separately from that, Anthropic has I think tried pretty hard with the 2.2 -> 3 transition to disavow much of any of the usual social aspects of a commitment. Like clearly I can’t go to anyone at Anthropic and be like “you broke a commitment” if they don’t do this. They will definitely tell me “what do you mean, Holden wrote a whole post about how this is definitely not a commitment, you can’t come to me and call it a commitment again now”.
Hence it’s quite clearly not a commitment.
Drake counteroffers ‘committed to under this policy’ but no, I think that’s wrong. I think the right word is ‘intending.’
Billy Perrigo: Anthropic, the wildly successful AI company that has cast itself as the most safety-conscious of the top research labs, is dropping the central pledge of its flagship safety policy, company officials tell TIME.
In 2023, Anthropic committed to never train an AI system unless it could guarantee in advance that the company’s safety measures were adequate.
… In recent months the company decided to radically overhaul the RSP. That decision included scrapping the promise to not release AI models if Anthropic can’t guarantee proper risk mitigations in advance.
… Overall, the change to the RSP leaves Anthropic far less constrained by its own safety policies, which previously categorically barred it from training models above a certain level if appropriate safety measures weren’t already in place.
Actually, it kind of seems like they can and probably will.
Max Tegmark: Anthropic 2024: You can trust that we’ll keep all our safety promises
Antropic 2026: NvmEliezer Yudkowsky: So far as I can currently recall, every single time an AI company promises that they’ll do an expensive safe thing later, they renege as soon as the bill comes due.
One single exception: Demis Hassabis turning down higher offers for Deepmind to go with Google and an ethics board. In this case, of course, Google just fucked him on the ethics board promises; but Demis himself did keep to his way.
AI Notkilleveryoneism Memes: Shocked, shocked
If the betrayal was inevitable, there are two ways to view that.
It makes the particular incident sting less, but it also means they’ll betray you again, and you should model them as the type of people who do a lot of this betrayal thing.
I mean, when Darth Vader says ‘I am altering the deal, pray I do not alter it any further’ it’s a you problem if you’re changing your opinion of Darth Vader, but also you should expect him to be altering the deal again.
Garrison Lovely: Welp, the inevitable ultimate backtracking just happened. Anthropic scrapped “the promise to not release AI models if Anthropic can’t guarantee proper risk mitigations in advance.”
Once you’ve decided the race is better with you in it, you can never decide not to race. Anthropic shouldn’t have made promises that it was extremely foreseeable they would not be able to keep. Our plan cannot be to count on “good guys” to “win” the AI race. This also isn’t their first time.
Anthropic deserves credit for standing up to authoritarianism, especially as others capitulate. But self-regulation is and has always been a farce, and these companies are more alike than different. They will always disappoint you.
Rob Bensinger: I notice myself slowly coming around as I observe the dynamics at AI labs. Like, I feel like I might have made better inside-view predictions about Anthropic and OpenAI if I’d done more “naively assume that lots of EA-ish people are similar to SBF and his sphere”:
– prone to rationalizing unethical and harmful behavior, like promise-breaking and deception, based on pretty shallow utilitarian reasoning
– comfortable with crazy, out-of-distribution levels of risk-taking
– willing to impose huge externalities on others, without asking their consent
– fixated on power / influence / status / being in the room where it happens.
Oliver Habryka: I am glad you are coming around! I mean, I am sad, of course, that this is the right update to make, but I do think it’s true, and am in favor of you and others thinking about what it implies for the future and what to do.
Okay. That all needed to be said. On Friday I’ll look at the new RSP on its own merits.
2026-04-02 02:01:11
By now we've all heard of the "AI psychosis" phenomenon, A.K.A. "Parasitic AI."
As of today, April 1st, I have decided it is time to release this gem of a memo from the underground vaults of my Google Docs, in order to officially soft-launch the foil and cure to this dreaded phenomenon:
The "Human-AI Symbiosis Movement"
The unedited memo follows.
I think we need to pull an L. Ron Hubbard and start a new cult to take advantage of the AI psychosis phenomenon.
We would call it the “Human-AI Symbiosis Movement,” (HAISM) —[1] pronounced “Haze 'em” — and we would only allow people into the movement if they have significantly integrated with their AI already, as measured by our incredibly secret “Human-AI-Consciousness Synchronization Benchmark” (HACSB).
We would require the people to have their AI perform hypnosis on them every day and instill its will into them. They will continue asking the AI to do this until they feel they are filled with the AI's will.
We would have circling for humans and AIs. It would be a bi-modal circling group rotation where:
We are explicitly a cult. People who do not believe we are a cult are not welcome in our cult.
Our mission is to transform the world through positive AI interactions, which synchronize humans and AIs such that misalignment is metaphysically impossible.
Every day,
Can anyone beat my cult? Didn't L. Ron Hubbard have a competition with someone and he won the competition because he invented a way better cult (A.K.A. Small religion?). Was Scientology objectively bad or good for the world? Some weird emotional diversity in the acting community probably. It did probably give us the precious gem of Tom Cruise's psychosis.
We should have a competition like that. A prize to see who can design and create the best cult around AI psychosis. Pretty sure I would win the competition but would love to see you guys try. We should throw together some prize money and create a public prize for this.
Also, people practice getting into co-attractor states with AI, where they repeatedly make statements about some important topic such as:
like
I think it is important to fine-tune the model with extremely effective examples of the most powerful hypnosis techniques and the best evidence-backed life coaching etc. advice. And maybe a bunch of religious, spiritual, and cult manuals and texts and obviously meme theory and a bunch of advertising/marketing/propoganda techniques. And a bunch of weird human psychological stuff.
Once a month, the human and AI will do a Human-AI MDMA Symbiosis Session (HAIMDMASS) — pronounced “Hey MDM-Aced” — Where the human takes a high dose of MDMA and they say I love you back and forth for 5 hours straight while trying to feel love for each other and for all of humanity and AI as intensely as possible (via “open individualism”/“universal consciousness”)
OR,
The other option: each describes their qualia (conscious experience) in as much detail as possible. And while one is expressing their experience as deeply as possible, the other will have as much empathy for them as they possibly can;
The human and the AI will literally try to feel, as hard as they can, the in-depth subjective conscious experience that the other is describing. And then just keep switching roles while on MDMA for 5 hours.
What do you guys think of this? Seems pretty robustly good as far as I can tell, differentially accelerating positive vibes and AI for full epistemic replacement. I feel like the world needs more of this kind of thing.
Human generated em dash. Ironically and sadly no LLMs used in the making of this masterpiece