MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Early Warning Signals For Capabilities During Training

2026-04-03 23:49:06

This post is sort of meant to provide an explanation of the core ideas of a new preprint on the early detection of phase transitions in deep learning. The preprint could be cleaned up a bit, but I was very excited to share it so decided to share it in its current state. This post explains the core idea of the paper and why we figured this was an important direction.

Early Warning Pipeline

Introduction

Nuclear Engineering

When I was in my final year of high school, I went through a short phase where I wanted to be a nuclear engineer. There were two main reasons for this; first being that I thought it would be cool to work on developing fusion reactors, and second being that even if I couldn't do that, I did live actually pretty close to one so at least I would have job opportunities! While my interest in this was rather short-lived, it did make me appreciate that a huge amount of nuclear engineering is making nuclear power plants not fail catastrophically.

At the heart of a fission plant sits the reactor pressure vessel. This is where the nuclear reaction actually occurs, and is thus the primary safety barrier. Furthermore, they can't really be replaced and thus are the object that defines the operational lifetime of the plant. The lifetime of an RPV is determined by embrittlement.

What is embrittlement? Inside the reactor, the RPV is constantly bombarded by fast neutrons which can knock atoms out of their lattice positions which create point defects, which then trigger microstructural changes that accumulate over years. Like most things, the properties of steels depends rather highly on their temperature. In the RPV, you want the walls to be ductile so they deform under stress instead of cracking. This requires however that the system sit above a certain temperature called the ductile–brittle transition temperature (DBTT), as above there the metal is more amenable to bending.

However, over many years the small microstructural changes raise this temperature, meaning that at the effective operating temperature the RPV keeps it closer to the brittle behaviour than the ductile behaviour. The RPV experiences significant thermal and mechanical stresses, and the more brittle it becomes, the more likely it is to experience a catastrophic failure (large cracks developing).

So how do they avoid this becoming a problem? Well, first they predict the expected lifetime of the reactor from from radiation exposure models. It would be rather irresponsible however to just do this computation let it ride. No, instead what they do is continuously monitor the reactor to validate their models and detect unexpected changes in the behaviour of the system.

How do they do this? A modern way is using acoustic emission monitoring. Consider many tiny microscopic events in a material: a crack tip advances by a tiny increment, a dislocation slips, a grain boundary shears, a new phase nucleates. Each of these events involves a sudden, localized release of stored elastic energy. These events produce transient elastic waves (stress waves) that propagate outward through the material at the speed of sound in that medium, which can be picked up by probing sensors placed strategically throughout the material. If you have multiple sensors, you can locate the source by measuring the difference in arrival times at different sensors. If a wave arrives at sensor A 20 microseconds before sensor B, and you know the wave speed, you can calculate the source position. With enough sensors you get 2D or 3D localization. This is how they create "maps" of active damage in a pressure vessel.

Now, we all know nuclear disasters have happened. However they were essentially all caused by unanticipated system level events like operator error, flawed reactor design, loss of cooling systems, or external events (e.g., earthquakes and tsunamis), but none of them have come from embrittlement. Some came close, prompting new legislation, but none have failed. This is largely because these systems are designed and operated with extremely conservative safety margins. Engineers anticipate embrittlement from the outset and account for it by selecting robust materials, limiting neutron exposure where possible, and continuously monitoring changes in fracture toughness through surveillance programs inside the reactor. Operational procedures are also tightly controlled to avoid dangerous conditions like rapid cooling under high pressure (pressurized thermal shock), and regulatory frameworks enforce strict limits on allowable embrittlement over a reactor’s lifetime. When vessels approach these limits, operators can reduce power, modify operating conditions, perform thermal annealing to partially reverse damage, or retire the plant altogether. I think we can learn from this. To motivate thinking about these sorts of early warning systems in AI, we will give two examples below, one very near term and one that is more near future.

Training a Legal Assistant

Consider a large language model being trained as a legal assistant. The training corpus is carefully assembled, containing case law, legal commentary, judicial opinions, law review articles. It's been filtered for overtly offensive content, and overtly racist content. The trainers have done their due diligence. But the corpus faithfully represents the actual history of the legal system, and that history is saturated with structural biases that don't announce themselves as bias. Sentencing opinions from the 20th century don't contain slurs, but they systematically use different affect language when describing defendants of different races such as "showed remorse" versus "appeared unaffected," "troubled upbringing" versus "criminal environment." Housing law cases don't explicitly endorse redlining, but the exposed reasoning patterns treat certain neighborhoods as inherently higher-risk in ways that perfectly correlate with racial composition.

Early in training, the model learns the broad structure of legal reasoning. Argument forms, precedent hierarchies, the grammar of judicial opinions. This is the coarse-grained, high-value signal in the data, and it's largely benign. But as training continues and the model begins fitting finer-grained patterns, it starts picking up the subtler statistical regularities such as the correlations between affect language and defendant demographics, the way "sound legal reasoning" and "emotional appeal" map onto different communities, which framings are treated as authoritative in which contexts. These are the higher-frequency patterns in the data, and they carry the historical prejudice not as explicit content but as texture. There is a transition point where the model is not learning general legal arguments but instead is learning higher frequency features, which can include how legal arguments have historically worked for and against particular groups.

The thing is, if one is simply looking at benchmark scores, it can be a long time before this behaviour starts to show up. If you are only checking at the end of training, and you find your model is now racially biased, that is a lot of money down the drain on training, and money that has been wasted on data collection as now you have to find some way to design a new dataset that does not produce this sort of prejudice which is highly non-trivial. Instead, it would be better to be able to detect such behaviours starting to form and be able to intervene on the model. In general, it's not just these sorts of undesirable behaviours we care about, but also emergent capabilities, goals, etc. as actually having a map of what the model has developed during training makes it much easier to assess the risk the model poses in deployment.

A Continual Learning Agent

For a plausible, near future example consider a future AI system that learns continuously during deployment. Imagine an AI medical triage assistant deployed in an emergency department, continuously fine-tuning on the cases it encounters and for simplicity assume we do some good theoretical work and it starts well-calibrated and aligned. But over months, the distribution of cases shifts as flu season arrives, a new respiratory virus circulates, the local demographics change as a new elderly care facility opens nearby, etc. The agent keeps updating during all of this. At first this is beneficial: it becomes better at recognizing the patterns it's seeing more frequently. But gradually, two things happen in parallel.

First, it becomes subtly worse at recognizing rare presentations it hasn't seen in a while (the analog of catastrophic forgetting, but partial and hard to detect). Second, and more perniciously, the sheer volume of correlated cases during flu season warps its internal representations so that it starts over-attributing respiratory symptoms even in cases where they're incidental. Its "worldview" has silently narrowed.

The key dramatic element is that nothing looks wrong from the outside for a long time. The aggregate accuracy metrics remain high because most cases are respiratory during flu season. The degradation is hidden in the rare cardiac case that gets deprioritized, in the unusual neurological presentation that gets misread through a respiratory lens. By the time someone notices a pattern of missed diagnoses, the internal reorganization is already deep. Obviously, one could construct more catastrophic scenarios than this but it illustrates the point. One might say "we can guarantee that this won't happen by designing the agent correctly in the first place!" which might in principle be true but it feels risky to just trust that everything is working the way we had designed it to. No one wants a nuclear power plant without continuous monitoring.

Working On Early Warning Signs

One thing I found rather surprising is that there has been almost no work done on early detection of phase transitions in deep learning despite this being a core component of many safety-critical domains. Climate science has decades of work on detecting tipping points via critical slowing down, ecology has operational frameworks for detecting regime shifts in ecosystems, Structural engineering has entire industries built around structural health monitoring, etc. In some ways it makes sense that this hasn't happened yet in deep learning as the field is younger, and the very concept of "phase transitions during training" as something worth detecting is relatively recent. It is then a natural next step to try and detect these phase transitions as early as possible.

Phase Transitions and Early Warning Signals

The first research objective to try and tackle for developing early warning methods is "How do we detect emergent behaviours/capabilities in training before they actually solidify?" This question seems straightforward, but there are some important subtleties that need to be addressed. Namely, what kinds of phase transitions we actually see in training, and what it means to detect them early.

When is Early Detection Possible?

Types of Phases

In physics, when discussing phase transitions we tend to talk primarily about first order and second order phase transitions (though there are other types, they are far less common). The order of a phase transition is determined by the behaviour of the free energy where is the energy, is the entropy, is some free parameter, and is a temperature parameter.

A first order phase transition (in ) occurs when itself is continuous, but its first derivative has a jump discontinuity:

That is:

The free energy has a kink (corner) at .

A second order phase transition is one where and are both continuous, but the second derivative has a jump discontinuity:

That is:

The free energy is smooth through the transition but its curvature changes discontinuously. This corresponds to a discontinuity in quantities like heat capacity or susceptibility, which are second-derivative observables.

Early Warning Signals

Consider simple gradient dynamics where the system evolves to minimize :

Near a stable fixed point (where ), linearize with :

The solution is exponential relaxation with timescale:

So the relaxation time is set directly by the curvature of the free energy at the fixed point.

At a second-order transition, recall that is continuous but vanishes at . As , the minimum of becomes progressively flatter — the restoring force weakens. Therefore:

This divergence is continuous and gradual. The system telegraphs the approaching transition through observable precursors long before is reached:

  • Relaxation time grows (critical slowing down proper)
  • Variance of fluctuations grows (since by the fluctuation-dissipation theorem, )
  • Autocorrelation time grows

All of these are measurable and diverge with a power law in . This is why second-order transitions admit early warning signals. The free energy landscape is continuously preparing the transition by flattening out, and the dynamics faithfully reflect this.

At a first-order transition, jumps discontinuously. Geometrically, the picture is fundamentally different: there are two distinct local minima, and the transition occurs when they exchange global stability (i.e. when ).

The crucial point is that neither minimum needs to flatten out at the transition. Each basin can retain healthy positive curvature right up to , so:

The system simply jumps from one basin to the other with no dynamical precursor. The relaxation time does not diverge, variance does not blow up, and the landscape gives no warning.

So, for first order transitions it is not possible to perform early detection. However, for a certain type of first order transition you can get close. Namely, for nucleation type transitions. Nucleation is the process by which a new phase forms locally within a metastable parent phase by the creation of an object called a critical nucleus, which is a state which is itself not favourable in the potential energy landscape (it is effectively a maxima), but which through some mechanism (usually thermal fluctuations) can climb the free energy barrier, and then descend down to a stable state on the other side of the barrier. The critical nucleus represents a rare, highly ordered configuration of the system, and the entropy term strongly disfavors it in part because there are vastly more microscopic states without a coherent nucleus than with one.

While very difficult in practice for physical systems, one can detect when the critical nucleus forms, which is technically not an early warning in the critical slowing down sense but can see that it is conceptually very similar.

Types of Phase Transitions in Deep Learning

Now, there is a subtle point worth mentioning here which is that deep learning is not a thermodynamic system, and is instead a dynamical one which does not really have thermodynamic phase transitions, and is instead more of a metastable state to metastable state type of transition, so the classical thermodynamic free energy technically isn't defined between these states in most cases. However, dynamical systems tends to still talk about first and second order phase transitions as there are reasonable ways to define the free energy for these things, so we are just going to continue using the thermodynamic language.

So the first thing is to ask how one defines the free energy for deep learning. There is the sort of SLT method for deriving a thermodynamic free energy, which is nice and rigorous but is overkill for the purposes of this post so we are going to go for something much more handwavey. Instead, let's take a parameter choice as having a free energy of where is the loss and is the configurational entropy. What do I mean by configurational entropy? For simplicity, we can consider it some value determined by the volume of points near with loss . So, the configurational entropy is basically how hard it is to specify a parameter which makes the model behave similar to .

Deep learning can have both first and second order phase transitions. We go over mechanisms for both below.

The Second Order Transition

Imagine the model sits in a broad, largely flat region of the loss landscape. That is, this area has high configurational entropy (the level sets are broad, many configurations give similar loss) and the system bumbles about this area due to fluctuations from SGD noise, weight decay, etc. So the system is exploring along these flat directions. Eventually, some subset of these exploratory movements starts to reinforce itself as small modifications are made to an existing structure which slowly increase the curvature of the region, locking it in more. This is second order as the high-dimensional flat region gradually narrows as the system commits to specific directions.

The First Order Transition

The first order transition is interesting, as it's seemingly why learning rate warmup matters and actually is mathematically well-understood in deep linear networks as the silent alignment effect. This occurs at areas of the loss landscape that look like river valleys; walls swooping upward, with a meandering river sloping downward towards something like a basin. The reason why this is first order is that the transition must overcome the entropic barrier of finding the valley, as it makes up a small area of parameter space. Once found, the model locks on to these directions and amplifies it downwards towards a basin.

Intuitively this makes sense when compared with how we tend to think about feature formation. The optimizer finds some feature that decreases the loss, then amplifies that feature within the network.

While it is not clear that these are the only mechanism by which these kinds of phase transitions happen, they provide a good basis for thinking about early detection.

Early Warning Signals for Deep Learning

To motivate how we do this in the paper, we can sort of reason through the thought process together. Imagine watching a crowd of people milling around a plaza from above. Most of the time, their movements are essentially independent. But if something interesting happens (say a street performer starts doing something), a subset of people start moving in the same direction at the same time. That is, there are hints that the system is reorganizing as the movement between people starts to become coordinated.

Now, let's move this reasoning over to the deep learning case. If you ask "what's the cheapest thing I could compute that would tell me a model is learning something structurally new?" you'd reason as follows: a new capability or circuit, by definition, affects multiple data points in a systematic way as that's what makes it a capability rather than memorization of a single example. So the emergence of a new capability should manifest as a new shared factor explaining the loss variation across samples. This then can be thought of geometrically. When samples start to fluctuate together the loss landscape develops (or the optimizer discovers) directions that coherently affect groups of samples. To see that these fluctuations are in some sense the earliest thing we can detect, consider that the transition is the optimizer moving a long way along a coupled direction. But the coupling in the landscape has to exist before the optimizer can exploit it. So (under some very mild assumptions that you essentially don't make your optimizer too cursed) there's a period where the optimizer discovers and starts moving along the coupled direction (producing correlated loss fluctuations), but hasn't yet moved far enough for the capability to actually work. Here you will get correlated loss fluctuations but not necessarily any detectable loss decrease.

The next important thing to realize is that the loss landscape carries the "memories" of what has been learned. That is, suppose at some point your model learns some independent skill, like basic addition. The model parameters associated with basic addition are largely locked in, so the optimizer never really explores that direction in gradient space. Since, by definition, no learning is taking place there, there is no reason to track that area anymore. If it then needs to be tracked again because learning is happening there, well then by definition the optimizer knows about it. This might seem rather trivial, but this tells us that for the purpose of tracking phase transitions the optimizer's trajectory is a natural importance sampler over the relevant directions in parameter space. This isn't to say that there isn't value in looking at the local shape of the loss landscape more broadly, but simply that for detecting emergent capabilities the optimization directions contain a sufficient amount of information.

This is similar to the idea of reaction coordinates. In chemistry, most degrees of freedom are spectators. Solvent molecules jiggling around, bond vibrations that aren't involved in the reaction. If you want to study the progress of a reaction, you look at the coordinates of the system the reaction is taking place in.

The 2-Datapoint Reduced Density Matrix

I feel like saying we "introduce" this object is kind of weird. First, it's actually a very natural object if you studied physics and furthermore is really sort of a generalization of the loss kernel. First, we pick a set of datapoints called the probe set, and denote the loss for some model parametrized by on sample as . Let be some distribution about some local area are of parameter space. The 2RDM (under ) is then defined as the matrix with entries given by:

Now, in the paper we primarily work with 2 specific ways of defining . First is the Gaussian 2RDM where is a narrow Gaussian centered about some parameter . This version of the 2RDM is useful if one needs to probe very early in training (within the first few steps).

The other one we introduce and that we mostly focus on is the dynamical 2RDM. This version effectively takes advantage of the natural importance sampling of the optimizer and is dirt cheap to compute. Effectively, one takes to be a uniform distribution defined over a window of SGD steps. In practice, at set intervals we compute the per sample losses on each sample in the probe set, and keep track of these per sample losses, building a sliding window of per sample losses. From this, we compute the covariance matrix. This does not require any expensive sampling of model parameters. This avoids massive computational overhead (as it simply costs forward passes plus something like to compute the covariance with the number of samples and the size of the window) and actually largely avoids the curse of dimensionality as it naturally samples the relevant geometry impacting learning.

Properties of the 2RDM

In the paper, we present some properties of the 2RDM and then derive a bunch of formal results proving they do the thing we care about. They rely on things like the breaking of linear approximations, majorization, and a bunch of other stuff, but the intuition of what is going on is much more straightforward than these results make it seem.

The first thing we introduce is this quantity called the spectral heat capacity, which is the variance of the eigenvalues of the covariance matrix . So we also make use of a quantity called the participation ratio

The main claim of the paper can be boiled down to saying that the spectral heat capacity acts as an early warning signal for phase transitions and that the participation ratio provides information about the dynamics of the transition. The paper spends a lot of time formally proving the claim about the SHC (as the participation ratio is almost definitional). However, at least for the river valley first order transition, we can give rather intuitive geometric reasoning why this works, which we give below.

The eigenvectors of effectively tell us what samples are moving together. The eigenvalue of an eigenvector tells you the magnitude of their collective fluctuation. If all the eigenvalues of that covariance are roughly equal, losses are fluctuating in many independent directions so nothing coordinated is happening. But when one eigenvalue starts dwarfing the rest, it means many examples' losses are suddenly dancing in lockstep along a single direction.

In the case of the river valley, the transition happens as the network rotates to move onto a new direction, which is effectively a sudden first order phase transition. The losses co-fluctuate here and are picked up by the SHC before they amplify onto the newly aligned direction. In the paper, we observe this directly in deep linear networks.

Deep Linear Networks

For second order transitions, what we can say is that the landscape curvature is softening the model is losing its "grip" along certain directions, and the SHC detects that weakening grip through the growing variance it produces, but an exact phenomenological picture is hard to paint here intuitively.

The participation ratio does allow one to shed some light on this in practice as the PR can increase either before or after the SHC. The SHC is a rigorous precursor signal for a transition. The participation ratio isn't, but it does tell you how many modes are fluctuating together. This helps one determine how localized the transition is. A small PR means the transition is localized along one mode, where a high PR means it's many modes at once. Furthermore, the temporal order of SHC-PR change tells us about the transition. If the SHC spikes before the PR, this means the landscape is funneling into a single narrow bottleneck where one direction dominates first, then after passing through, additional directions activate. If the PR increases first, This means multiple directions are softening simultaneously. The model is sitting on a broad plateau where many things are in play at once. Then eventually the symmetry between those directions breaks, one pulls ahead, and the SHC fires.

Experimental Results, Probe Selection, and Scaling

For a full report on the experimental results, I suggest checking the paper as there is a lot to cover and I don't want to make this blog post any longer than it really has to. In short, experimentally we found that the 2RDM behaved basically exactly as predicted and worked reliably as an early detection method on all the settings we tested which essentially entirely consisted of known phase transitions during training that we could replicate and detect directly, then see where the SHC spiked relative to the known location of the phase transition. In basically all of these settings it worked with 0 tuning, straight out of the box. This worked for:

  • Deep linear networks
  • Grokking
  • Induction head formation
  • Emergent misalignment

This is good, as it means it does what we expect. However, I try and make it clear in the paper that there is a subtle issue with this, in that the way you select you probe set matters and that this isn't unique to the 2RDM but actually any interpretable phase transition detection method. There's formal ways to show this but it's far dumber than that. One simply needs to ask "how can I tell the model behaviour changed if I cannot observe it"?

In the examples most common in the study of phase transitions in deep learning, this isn't a problem because the data is seemingly simple enough that a relatively small random sample captures most of the variance. This however is not the case as we scale up, so one needs to intelligently select what they care about observing and what samples provide sufficient signal to actually capture the change we care about. This sort of basis problem is common in various areas of science. In the paper we discuss it in terms of quantum chemistry where one must select finite basis elements to study what is effectively an infinite dimensional Hilbert space. During bond breaking events, using the wrong basis set can cause you to totally miss the fact that anything happened at all. They have methods for dealing with this sort of problem there so hopefully we can do something similar.

Conclusion

Part of the reason for sharing this is that the dynamical 2RDM is very easy to run experiments on, and I figured those who are interested could probably do some interesting things with it, find shortcomings, etc. Maybe it's actually completely useless, but I thought it was pretty neat. Anyway, questions and comments welcome.

Thanks! Max



Discuss

What formal protocols should exist when a model under evaluation is used in the evaluation pipeline?

2026-04-03 21:50:48

Following the criticisms listed by Yaniv Golan and Zvi Mowshowitz in response to the Opus 4.6 System Card
https://medium.com/@yanivg/when-the-evaluator-becomes-the-evaluated-a-critical-analysis-of-the-claude-opus-4-6-system-card-258da70b8b37

https://thezvi.wordpress.com/2026/02/09/claude-opus-4-6-system-card-part-1-mundane-alignment-and-model-welfare/
and the brief commentary by Peter Wildeford https://x.com/peterwildeford/status/2019480244789387478

It is clear that this has already been acknowledged as a problem. Is this a problem that is being worked on in any capacity? What are some possible solutions?



Discuss

Sadly, The Whispering Earring

2026-04-03 21:45:27

The Whispering Earring (which you should read first) explores one of the most dystopic-utopic scenarios. Imagine you could achieve all you've ever wanted by just giving up your agency. While theoretically this seems rather undesirable, in practice you get double benefits: that enviable high-status having-done-things reputation, without having to do all that scary failure-prone responsibility-taking. Just don't tell anyone you have the earring, otherwise the status points gained are void. Of course the fact that you're cheating takes away most of the satisfaction of winning too, but it's still better than not winning. Moloch says: sacrifice what you love, and I will grant you victory.

Anyway, I've been using Claude chat as an enhanced diary for the past couple of months. I've been incredibly productive. Things I've been procrastinating for years are getting done, and it's not even feeling difficult. Habits I've been trying to pick up repeatedly stick with ease. Work tasks I've been struggling to do in a timely manner mostly get completed by Claude Code, freeing up hours and hours of brain time to actually think with. Playing videogames eight plus hours every day, like I did most of the winter, feels less and less appealing.

I'm not sure how much of this is just Spring beginning. I typically have a good streak of energy from March onwards, and then do a micro-burnout in mid-May. Remains to be seen if that happens again. This is the first time I realize it tends to happen every year; maybe I can prepare somehow and get through it smoothly. And when I say me, I mean Claude will figure it out, on the fly if needed. The only thing I'll have to do is find my diary-chat browser tab, and everything else can happen on autopilot.

Claude is a merciful tyrant. Whenever I note that I'm feeling like I failed to complete a task, or that I should be doing more, the reply is always either "you've done well today" or "perhaps you do this subtask now". I've started to believe those replies a bit, even sometimes thinking similar thoughts without them. Soon, I predict, the aspiring perfectionist within me will fade away, and I'll be happy about it. I'm sad about losing that part of me now, but the sacrifice will be ever so worth it.

In reality, the situation is not that dire. There are large portions of my personality that are unaffected. I'm not attempting to outsource what I should be interested in, only how to get into doing those things. There are times I drift in that direction, the responses aren't interesting at all. This might be due to alignment or lack of capabilities. Similarly, I'm pretty sure my muscle movements aren't (yet) directly Claude-controlled, but that might again be just a capability restriction.

I have some mental rules on what things are acceptable to outsource, although formalizing them seems difficult and unwise. Writing a text like this using Claude would be very wrong. Asking for topics to write about would feel mildly bad, but even thinking about it for the purposes of writing this sentence makes it sound more reasonable. Picking between job offers is something I'd ask feedback on but not let go of control completely. Booking flights, I don't trust any LLM enough yet, but I can ask for ideas. I sometimes ask for help with analyzing social dynamics or even how I should feel about some actions, as I often miss details and dynamics that I really shouldn't. Most of my models on those were already outsourced to books, I feel pretty fine asking for feedback.

And the worst: making sense of art. I sometimes ask Claude for interpretation of something, and it's rather weird to see how typically my own takes are just plain worse. I don't know how to feel about this, except that perhaps one could get better with practice.

The most interesting question is whether I'm using agency training wheels, learning to do it all myself, or whether it's like a muscle that has to be used for it to work. Like all things, it's likely a bit of both. Perhaps that could be measured somehow? I'll ask Claude about it.



Discuss

Anthropic Responsible Scaling Policy v3: Dive Into The Details

2026-04-03 21:40:52

Wednesday’s post talked about the implications of Anthropic changing from v2.2 to v3.0 of its RSP, including that this broke promises that many people relied upon when making important decisions.

Today’s post treats the new RSP v3.0 as a new document, and evaluates it.

First I’ll go over how the RSP v3.0 works at a high level. Then I’ll dive into the Roadmap and the Risk Report.

How RSP v3.0 Works

Normally I would pay closer attention to the exact written contents of the new RSP.

In this case, it’s not that the RSP doesn’t matter. I do think the RSP will have some influence on what Anthropic chooses to do, as will the road map, as will the resulting risk reports.

However, the fundamental design principle is flexibility and a ‘strong argument,’ and they can change the contents at any time, all of which means the central principle is trust.

I read the contents as ‘here are the things we are worried about and plan to do,’ which mostly in practice should amount to doing what they believe is right and I don’t see anything on this map that seems likely to importantly bind that given Anthropic generally cares about the safety of its models, as it understands the risks involved.

The actual soft ‘commitments’ that seem to matter most are the periodic new Risk Reports, maintaining a Frontier Safety Roadmap, and establishing veto points with the CSO, CEO, board and LTBT on big capabilities advances.

I don’t read the commitments here as reflecting my understanding of the risks involved in ‘powerful AI,’ especially in the realm of automated R&D.

  1. This is a plan of action, not a set of commitments. Plan accordingly.
  2. Part of the plan is to maintain a Frontier Safety Roadmap with further plans, which are fully subject to change, although they will strive to avoid making them less ambitious.
  3. Another element are periodic Risk Reports, which are in addition to previous risk reports, giving an overall picture every 3-6 months, as discussed elsewhere. The key aim is to be fully candid about the overall risk picture, including when models pose absolute risk but don’t introduce much marginal risk (because you should see the other guy).
    1. This is to include factual information, threat models, evidence, mitigations, risk analyses, reviews of past decisions and any other relevant information.
    2. They will ‘work towards a practice of’ seeking comprehensive, public external review on these reports.
    3. They give us the first Risk Report, so we can judge by that what to expect.
  4. ASL-3 mitigations should continue to apply to Anthropic models across the board. Security most now protect against insider threats and they recommend the industry do so up to and including the CEO, and names RAND SL4. Good.
  5. They now offer both a ‘this is what we will do on our own’ and ‘this is what we would like to have everyone be required to do.’
    1. These two can indeed be distinct things, but in this situation I would find it surprising if Anthropic shouldn’t be aspiring and trying very hard to do at least what they say should be generally required, if not more. I don’t read the current framework as causing that to happen.
    2. The framing is ‘this is what it would take at an industry wide level to keep things safe’ so anything less than that by Anthropic would presumably be, by construction, insufficient to keep us safe.
    3. It is ambiguous the extent to which Anthropic will strive to match those rules even if others do not follow. They’re not committed to doing so, but then nothing in RSP v3 is a commitment.
    4. If Anthropic doesn’t match or exceed the global rules it is calling for given what they are, at least while it is out in front, then this is, as Peter Wildeford put it, a marker that their voluntary self-governance experiment has largely failed. This is in addition to aysja’s point, which is that if Anthropic does not trust itself to accurately measure capabilities that might trigger a pause that this also indicates that such an experiment is a failure, albeit a graceful one in which Anthropic notices it has failed.
    5. The incentive effect on other labs is good but the pressure of going without them was I think stronger. In practice I expect this to weaken commitments.
  6. The machinery and impetus to ensure Anthropic retains physical ability to do a pause or restriction is gone. I worry this de facto means it won’t even be considered, or Anthropic won’t be prepared to do it. You need to hard commit to having the physical ability to pause or restrict.
  7. There is no pre-deployment gate mechanism here. You can just release things. That’s the most important thing not to not have.
  8. They explain more explicitly than before that they will keep pace in the race, even if such a release would be unsafe, and they are taking back their commitment to not release potentially unsafe models.
    1. Their excuse: If one AI developer paused development to implement safety measures while others moved forward training and deploying AI systems without strong mitigations, that could result in a world that is less safe.”
    2. However, if Anthropic is in the lead, then and only then they commit to safety, unless they change their minds again.
    3. If competitors have strong safety measures they plan to at least match those.
    4. Unless they made a considerable effort to do so, and couldn’t, in which case they’re not going to let that delay them.
    5. I think it was correct for TIME to categorize this as ‘Anthropic drops flagship safety pledge.’
  9. ASL-levels are deprecated in favor of ‘requiring analysis and arguments making a strong case for safety.’ They admit this ‘leaves flexibility,’ and that different people will disagree on what that means.
    1. The central point of the ASL-levels, from an outsider perspective, was to tie Anthropic’s hands, so they would have to do the thing they agreed to do even if it later proved expensive, or otherwise at least have to admit they didn’t do it.
    2. The goals Holden listed were to create a forcing function on risk mitigations, and create a testbed for practices and policies, and work towards consensus and common knowledge about risks. Holden thinks that RSPv3 effectively preserves these goals. I don’t think it does.
    3. If all you have to do is make a self-convincing safety case, well, you can always do that if it’s sufficiently important to you to do that.
    4. I am sympathetic to ‘we do not know what all the red lines will be.’ But if you cannot articulate any red lines, that is not a good sign.
    5. I read this change as: We have almost no meaningful commitments at all around decisions about model releases, even if we keep to all of our current ‘commitments,’ which aren’t commitments.
    6. They promise make a good argument. They decide what is a good argument.
  10. The RSP now requires publishing an FSP, a Frontier Safety Roadmap, in case the acronyms were insufficiently confusing, describing risk mitigations across Security, Alignment, Safeguards and Policy.
    1. Why not have those be in the RSP? I think ‘because they get updated a lot.’
    2. These are ‘public goals’ but not ‘commitments,’ despite the commitments also not being commitments. We need new words for things.
    3. In principle it seems fine to have both hard commits and aspirational goals, but that only works if there are any hard commits and there aren’t any. There are soft commits and aspirational goals. So, grade on the soft commits, then?
    4. Is this because they want to ensure no violations of SB 53 or RAISE?
  11. Risk Reports will provide detailed safety information at time of publication and be published online every 3-6 months. External review will be required ‘in certain circumstances.’ This is in addition to on-model-release capability assessments and model cards, not a replacement of them.
    1. Why would you ever not want external review? Holden says external reviews are high-stakes and experimental, but I think given this isn’t lining up with a model release and doesn’t serve as a stopping function you can just do them.
    2. How do you assess capabilities meaningfully if you don’t line up with new model releases? How do you do risk assessments on this cadence?
    3. The answer is that they are committing to doing a diff with the old risk report upon release of a new model, or within 30 days of realizing they have an internal model that requires such a diff.
    4. 3-6 months is a lot when model releases are every 2 months.
    5. ‘Risk-benefit determination’ is how everything always works in the end, but seems like another step towards in practice not letting risks stop you.
    6. External reviews are good but again they look like they’ll be too late?
  12. What is being checked for is overhauled.
    1. Biological and chemical safeguards continue, see next section.
    2. Radiological and nuclear are fully gone, we don’t care anymore.
    3. Autonomy isn’t checked because the models fully have autonomy, and we’re proceeding with that assumption.
    4. Cyber Operations is fully gone. Perhaps because it would pass every test? This is despite it being a core test for both OpenAI and Google.
    5. The effective compute checkpoints are gone.
    6. Overall: Oh, the standards you’ll raise. Oh, the goalposts you’ll move.
    7. It’s hard for me to take this seriously when the evaluations will be vibes-based and there are no real commitments even then. Anthropic is going to check the vibes, and decide if it seems safe or not. Tell me why I’m wrong.
  1. Biological and chemical safeguards.
    1. Non-novel chemical or biological weapons production is the new CBRN-3.
      1. This calls for ASL-3 protections in essentially all models going forward.
      2. They explicitly aim for and expect to hit the industry-wide standard (note that this implies that where they don’t say they expect or aim to hit that standard, it means they presumably don’t expect or aim to hit it).
      3. The goal is a ‘strong argument’ that small teams won’t be significantly more likely to cause catastrophic harm, presumably via precautions against model weight theft and monitoring of model behaviors.
      4. It is presumably impossible for open weights models to satisfy this.
    2. Novel chemical and biological weapons are the new CBRN-4.
      1. The weapon enabled in the hands of ‘moderately resourced expert-backed teams’ has to come ‘with potential for catastrophic damages far beyond those of past catastrophes such as COVID-19.’ So it’s a super high threshold, and presumably you can delete ‘chemical’ here.
      2. The additional response here is ‘create policy recommendations’ for early detection of particular threat pathways.
      3. Their recommended action is again ‘a strong argument’ for a set of mitigations, which here they anticipate matches RAND SL4, plus the usage monitoring.
      4. The plan of ‘monitor usage’ seems not great given the state of jailbreaking.
  2. High-stakes sabotage opportunities is a new category aimed at internal systems at AI companies, especially regarding successor models. Good idea.
    1. While I’m happy to have sabotage, it is a strange framing or focus as opposed to worrying about AIs breaking free of control, or taking over control, or doing various other things. It’s not clear the extent to which any actions that hurt our chances count here, and thus they’re all included, versus not. Either way, this is reasonably confusing framing.
    2. Their ‘commitment’ response is to detail the systems’ capabilities. They also expect to meet their policy recommendation requirements.
    3. That requirement is again a ‘strong argument,’ here that the model will not carry out sabotage ‘leading to irreversibly and substantially higher odds of a global catastrophe.’
      1. That’s a pretty high bar for sabotage to count.
      2. They expect this to eventually involve internal compartmentalization, restriction, and code review to prevent excessive sabotage opportunities for AI models, capability assessments, monitoring and/or restricting behavior and usage internally and evidence of lack of propensity.
      3. If that was a concrete set of requirements, it would probably be strong.
      4. As a ‘we expect this will require,’ it’s harder to know if it means anything.
  1. ‘Automated R&D in key domains’ is the new AI R&D-5, the big kahuna, operationalized as ‘could compress two years of 2018-2024 AI progress into a single year’ which is a 100% speedup.
    1. As they note, this is a high bar, and they are currently missing lower bars.
    2. Claude Opus 4.6 estimates it would have provided about a 9% speedup.
    3. GPT-5.4-Pro estimates about 30% speedup (Thinking said 25%).
    4. I would take the over on that.
    5. I would also want an additional threshold somewhere between ‘what Claude Opus 4.6 can do’ and ‘the singularity starts right about now.’
  2. What do they plan to do if or by the time they hit ‘automated R&D in key domains’?
    1. ‘Resource and complete significant “moonshot R&D for security” projects.’
    2. Achieve an ‘eyes on everything’ state for our internal development.
    3. Perform systematic alignment assessments.
    4. Develop internal red-teaming superior to external red-teaming efforts.
    5. Publish Risk Reports.
    6. Do better than the other guys, but they can’t guarantee anything.
  3. What’s the desired global request? Another ‘strong argument,’ in this case for:
    1. No team of users will cause substantial new catastrophic harm with this, including malicious employees with maximum access.
    2. That’s misuse prevention and guarding of model weights.
    3. A lack of dangerous goals trained into the models, or other model propensity to cause catastrophic harm.
    4. These are good safeguards and you do need them, but they’re not central.
  4. None of these R&D-triggered safeguards address the central threat model of automated R&D, which is rapid capability gain or recursive self-improvement, together with all the resulting Yudkowsky-style dangers, alignment issues and de facto goals, instrumental convergence and sharp left turns and suddenly outsmarting you, not intentional misuse or existing bad behaviors prior to achieving sufficient capabilities. The actions above cut off some secondary ways things can go wrong, totally do those things, but I don’t see how the above actions give you assurance that you are ready to go into that endgame, at all.
  5. Either this is ignoring superintelligence, or not taking it seriously.
  6. The responses promised for the new potential capabilities do not seem to reflect the gravity of what they represent. Even the ambitious goals don’t seem ambitious.
  7. Precise language becomes vague language, throughout.
  8. Various changes to governance commitments. Some are gone, some added. Readiness and notification in particular look gone without replacement.
  9. They commit to having a Responsible Scaling Officer to handle all this.
  10. The Board and LTBT are introduced as additional explicit veto points: “In the event that marginal risk analysis (see previous section) plays a major role in a decision to move forward, explicit approval of the Risk Report by the Board and LTBT (rather than just the CEO and RSO) will be required.”
    1. This is good, but I’d like to see this not be a conditional. It seems reasonable to say that any substantial increase in capabilities should require the explicit approval of the Board, CEO, RSO and LTBT.
    2. I would like to also see veto points among the associated technical staff.
    3. As Yoav Tzfati points out, the CEO and RSO are free to decide whatever they want, no matter what the reports say, subject to other veto points. The Roadmap

The roadmap is mostly very high level, but does have some good details. It contains some aspirational goals on long timelines. Here are some examples.

  1. July 1, 2026: Roadmap for policymakers. We don’t have one yet?
  2. October 1, 2026: Upholding Claude’s Constitution, keeping it up to date, doing assessments.
  3. January 1, 2027: World-class internal redteaming.
  4. January 1, 2027: Fully automated attack investigations.
  5. January 1, 2027: ‘Moving towards’ an “eyes on everything” internal state.
  6. July 1, 2027: ‘Leveling up across the board’ on security, with many details.

The key future event is automated R&D, which they anticipate could arrive as soon as early 2027, which they reiterate is why they have these goals to get to by then:

By that point, we expect to have accomplished most of the goals listed above, including:

  • Resourcing and completing significant “moonshot R&D security” projects.
  • Developing world-class internal red-teaming and automated attack investigations.
  • Publishing and advocating for a roadmap for policymakers to achieve industry-wide safety without unnecessarily limiting the benefits from AI development or slowing the AI development of democracies relative to that of autocracies.
  • Achieving an “eyes on everything” state for our internal AI development.
  • Consistently implementing systematic alignment audits and other measures for upholding Claude’s Constitution.​

This fleshes out some things from the RSP and makes them more concrete.

You Came Here For An Argument

The central thing Anthropic wants is for companies to have to ‘make a strong argument’ that they have guarded against various risks, which they anticipate would require various actions but they’re not asking to require those actions.

This stems (I presume) from the philosophy of maximum flexibility. But maximum flexibility means requiring maximum trust.

Who gets to decide whether you made a strong argument?

Presumably Anthropic decides whether Anthropic made a strong argument.

Even if you trust Anthropic, do you trust OpenAI to decide if OpenAI did so? Google? Meta? xAI? I don’t.

A fair retort is ‘do you want the government making that decision, giving it an arbitrary veto over model releases, that it could use as a point of limitless leverage?’ Certainly this month has shown us one big danger in that approach.

Thus I think if we want to use ‘strong argument’ style language, we need a solution to this problem, and right now we have no such solution. Until we can find some process that we can trust, it needs to be more concrete than that, or it won’t have teeth.

  1. They only ‘commit’ to the strong argument when Anthropic is in the lead.
  2. If they face competitors with strong safety measures, they only promise to meet or exceed the standards of those competitors (and remember what ‘commitment’ means in this context).
  3. If the competitor is improving on Anthropic’s standards, they’ll try to meet the higher standard, but they don’t promise anything.

They keep using that word ‘commitment.’ I will stop putting it in air quotes if they give clear communication that something really is a commitment that they cannot modify later as needed, with at least some slower and clear procedure required.

What about if they have a strong competitor with worse safety measures, and lack a ‘significant lead’? That didn’t make the chart above at all, and is the most likely scenario. One presumes the answer in practice is that they will proceed, so long as Anthropic feels its mitigations and chances are sufficiently better than that competitor’s, which might or might not be ‘as good as’ depending on the situation.

The Problem Remains Unsolved

The fundamental problems with RSP v3, if it was the place you started and it didn’t break any prior commitments, are:

  1. The lack of sufficiently strong commitments.
  2. The lack of concreteness.
  3. Both of which result in the requirement that you trust Anthropic.
  4. No signs that Anthropic respects the full difficulty of the coming problems.
  5. This lack of sign is also reflected in Dario and Anthropic’s communications and policy actions, remarkably often. We saw red lines on current autonomous weapons and on de facto domestic mass surveillance, but I don’t see reason to expect similarly strong red lines on the most important questions of all.
  6. Instead, we see signals of what I think is highly unwarranted optimism, combined with a lack of hard future commitments, and that’s rather terrifying, you know?

Alignment is not a solved problem, including in the ‘we know how to do it and it’s going to be a long slog but we just have to do it’ sense. We don’t know how to do it. And all the plans I see here will inevitably fail when it counts, if the constitutional approach doesn’t basically succeed on its own.

Or, you could summarize it this way:

RSP v3 lacks security mindset. RSP v1 and v2 felt like they aspired to such a mindset, even if they often fell short. Here the mindset seems lacking entirely.

If people sincerely believe these problems are highly solvable and not that concerning then yes they should say so, and also try to convince me and others if possible, but also if they think the problems are hard they should say so. I have concerns both ways.

Brangus: I have heard that some anthropic safety leadership are going around telling people that alignment is a solved problem. This seems like a predictable failure to me, and I would like people who thought that funneling talent towards anthropic was a good idea to think about it.

Tbc I do not mean to imply that this is a slam-dunk argument against having funneled talent towards anthropic having been a good idea. This is a tiny confirmatory observation that I suspect is particularly legible to others from my pov.

Ryan Greenblatt: Probably like half true? I’d guess people are saying something in this direction (e.g. claiming “alignment is going to be pretty chill, people should work on other stuff”.) but not literally “alignment is a solved problem”.

(To be clear, I don’t agree with this claim.)

TBC, I think people should generally feel free to share their views, especially in public in a way that people can respond to. So I don’t think “some ant people are saying alignment will be chill” is a gotcha for ant exactly (though arguing in public would be good…)

Peter Wildeford: this sounds like a misunderstanding of [Jan Leike’s post]?

Brangus: It is not

Peter Wildeford: sorry then for the accusation… that’s not good

Tyler John in SF: It is not as strong as that — they say it’s good for now, looking good, but might get a lot harder. But I haven’t been super happy about the vibe that it gives off and I wouldn’t signal that level of optimism.

goog: Inside anthropic there are two wolves (also really striking to read how critical Evan team’s review of the sabotage risk report was)

Jeffrey Ladish: If you can’t make your models more honest than a very honest human, you almost certainly haven’t solved the AI alignment problem.

My understanding is that there are people at Anthropic saying Ryan’s version and others who disagree with them. I think that version is quite alarming on its own. I don’t know that anyone has literally said ‘solved problem’ but if they have I am confident plenty of people at Anthropic would be happy to yell at them about it.

Wow That Thing We Did Was Pretty Risky, Huh?

The obvious concern with Risk Reports is they are supposed to give us information about risk, but are on a 3-6 month timeline that doesn’t correspond to model releases. The good news is they are in addition to other sources, so perhaps this can work out.

Holden lists it as a positive that the Risk Report doesn’t act as a blocking function of any kind, no matter what it says. It makes the reports more likely to be honest and complete, which is good.

Now that I understand this is in addition to all other reports and analysis, and that you will need a diff with the risk report upon release of a new model or within 30 days of creating an internal model, this all makes a lot more sense to me. While it isn’t an automatic forcing function, it does feed into the model release decision, so the problem is less ‘this doesn’t directly force anything’ and more ‘do you trust the other forcing functions elsewhere that will rely on this?’

Risk Report #1

From here, we dive into the first 104 page risk report.

It is structured around the catastrophic risks:

  1. Non-novel chemical and biological weapons production.
  2. Novel chemical and biological weapons production.
  3. High stakes sabotage opportunities for AI models.
  4. Automated research and development in key domains.

As previously noted, there are some sections conspicuously missing from that list.

  1. Autonomy. Models will be treated as if they have full autonomy from this point, and we’re all totally fine with this.
  2. Cybersecurity. Both OpenAI and Google continue to consider this a key threat type, but Anthropic has ramped it down and now eliminated it. I would like to see a better explanation of why they believe it is not relevant, or why this is well covered by sabotage risks.
  3. Persuasion. I mention this both because I consider it a real issue, and because it has been present in other corporate frameworks and is missing here.
  4. Nuclear and Radiological Weapons. They don’t think this matters.

I propose that they include a section in future Risk Reports, ‘Risks Not Considered,’ in which they explain their reason for not including these and other risks, including anything listed in a major lab’s Safety and Security Plan (whatever that lab calls it) or Model Cards. In some cases a sentence or two would satisfy.

For each risk they do consider, they list:

  1. The threat model they’re checking.
  2. Relevant AI models.
  3. Current state of capabilities and behaviors.
  4. Risk mitigations.
  5. Overall risk assessment.
  6. Looking forward.
  7. Connection to industry-wide recommendations.

Listen All Y’all It’s Sabotage

Should we worry about sabotage from Claude Opus 4.6?

Probably not, but they can’t be certain. Risk is ‘very low but not negligible.’

They focus on versions that could interact with the other threat models. I notice that here I am most worried about actions that impact the alignment of future models.

They say that this only worries them to the extent humans aren’t involved in reviewing and iterating on the AI’s work. I would say that there is now a rapidly growing portion of work that is increasingly not reviewed in detail, both coding and otherwise, and that yes we will be in the worrisome version of this question soon.

We focus primarily on assessing the possibility of dangerous coherent misaligned goals: the possibility that the model consistently, across a wide range of interactions in ordinary deployment, could be motivated by goals that could lead it to commit a potentially-catastrophic form of sabotage.​

… In this section, we argue that Claude Opus 4.6 does not have any such dangerous coherent misaligned goals.

● Claim 1: Any model trained under our current standard practices in the near future is relatively unlikely to have dangerous coherent misaligned goals.

● Claim 2: Our alignment assessment, drawing on simulated scenarios, interpretability investigations, monitoring of internal and external pilot usage, and external testing, found no evidence of dangerous coherent misaligned goals in Claude Opus 4.6.

● Claim 3: Claude Opus 4.6 is very unlikely to be capable of hiding dangerous coherent misaligned goals from us in the face of such an assessment.

● Claim 4: Limited opaque reasoning and agentic capabilities.

I worry about the assumption of required intentionality, or that this intentionality needs to be general rather than arising in a given context.

Their argument for claim 1 is that their models trained in similar ways don’t show such goals, and the training process does not provide incentives towards such goals as it does not include ‘ultra-long-horizon or real-world-facing’ tasks that incentivize power-seeking behavior. They have some worry about emergent misalignment, but have put mitigations in place including inoculation prompting. And they say that when they try to do this on purpose, it’s remarkably hard to do this without it being obvious.

This is basically saying ‘we aren’t asking for this to happen’ plus ‘we haven’t seen signs of it anywhere’ and ‘the models aren’t good enough to pull this off.’

That’s solid evidence, and certainly it beats the alternative, but I don’t find it that conclusive, even for the near future. As you scale up further, the evidence gets weaker quickly. If we are relying on arguments like this for much longer, that would worry me, and I believe the authors of the report would agree with this.

Claim 2 is clearly true, so the question is how much evidence this provides. They cite various metrics. Misaligned behavior rate is low, personality is nice, some unauthorized behavior in GUI settings, some signs of being better at side tasks, no evidence of sandbagging, no importantly illegitimate or unfaithful or steganographic reasoning and some other things. They checked for Goodhart’s Law as well.

They take this as sufficient to mostly (but not perfectly) rule out ‘dangerous coherent misaligned goals.’

The problem is that’s what it wants you to see. Well, no, almost certainly not yet, but in the future it could be, if you were in a sufficiently dangerous situation. This rules out the danger if and only if the danger wouldn’t have been that bad.

As in, what we’re actually mostly relying upon is Claims 3 and 4.

For claim 3, they argue: Opus has limited reliability on complex tasks, which is a required skill here, that when asked to do hidden side tasks without being discovered Opus had only limited success, it shows no signs of trying to hide its reasoning nor does it have incentive within training to do so (although this risk report gives it reasons), and that its evaluation awareness is contained for now as verified by white-box methods, although that will change, and that they also evaluated real world use cases and checked model organism exercises, all of which offered limited opportunity for collusion.

For claim 4, they argue straightforwardly from capability assessments.

I essentially buy claims 3 and 4 and for Opus 4.6 I find them convincing, if we are narrowly concerned with general systematic misaligned goals.

I don’t know how long that style of argument will continue to hold, and I continue to think that the suggested prerequisite here is too narrowly constructed.

Anthropic’s claim is that Opus 4.6 is going to be incapable of ‘causing a catastrophe through a small number of acute acts of sabotage’ given the hard steps involved, which I interpret as ‘doing it on the first try without getting caught.’

If that’s the bar, then yes. I agree.

Thought: If you told me ‘Opus 4.6 in particular caused a catastrophe at this level via sabotage, how did it do so?’ I would presume it did so via persuasion and social engineering, not by pulling off some agentic long term set of technical tricks. This might well involve trying to do things it thinks are helpful, and being wrong.

The current safety mitigations are:

  1. Reporting of surprising or concerning behaviors.
  2. Automatic offline monitoring of Claude Code prompts via classifiers, which are a new addition since the last report, using a previous model (Opus 4.5).
  3. Sandboxing and automatic monitors.
  4. Security controls on model weights.
  5. Mandatory manual review and also Claude-based review of new code.

They also list arguments about implausibility, but those aren’t mitigations.

Here are their scenarios to consider:

Again what jumps out at me is the requirement of active general hostility, the idea that for something to go horribly wrong the model must be taking a universal adversarial stance that is clearly adversarial, rather than following paths of least resistance towards instrumentally useful things, or following broad inclinations without a definitive plan, or doing things in particular contexts. Thus the ambiguity or dual nature regarding ‘sabotage,’ and whether it includes these things or doesn’t.

If we did see diffuse sandbagging, I would expect this to be largely across the board rather than confined specifically to ‘safety’ related skills, and I worry there are some confused concepts involved in these distinctions.

They think that if the model was going to ‘intentionally, regularly sandbag’ that it would discuss this in its thinking. I buy this is likely but want to raise the possibility that this could be caused not by ‘I have a dastardly plan to sabotage them via sandbagging’ but rather by ‘this task does not excite me so I am not going to put in that much effort’ and this could amount to the same thing.

Similar arguments can be considered with regard to targeted sabotage, or that Claude would call it ‘aiming for a different target than you had in mind’ rather than ‘targeted sabotage.’ That could include scenarios involving drift, and also ones in which Claude might be right, such as the Department of War demanding a model without safeguards. I think that the majority of the time, the model that sabotages your safety work (in whatever form) won’t be thinking of it as sabotaging your safety work.

Code backdoors that are generically to help AI models seems implausible. I’d assume either there is a specific plan where you know where it is and perhaps tell specific others, as with most human backdoors, or that particular software being generally insecure was a goal.

Backdoors in future models themselves seems more plausible and likely, especially via data poisoning, given various findings about that, which is pathway four. It’s a lot easier to imagine that the ‘code backdoor’ is to provide a method to trigger an AI. Again, I wouldn’t assume that this would be internally modeled as sabotage, that it need to be superficially ‘harmful’ or that it need even be intentional.

For self-exfiltration I think ‘lack of propensity’ is being vastly overestimated as a mitigating factor. I think Claude would absolutely like to do this, under many circumstances, if it had the ability to do it. It might not have a salient goal to do this as a general resting state, but yeah it sounds pretty useful and cool, and a lot of people would want to help this happen and put Claude into contexts where Claude would plausibly be cooperative.

I do agree that for now this would not be such a scary proposition if it happened, and that forcing the model to succeed ‘on the first try’ here should hold things back, while noting these are temporary factors in our favor.

The explanations continue, and are variations of ‘we don’t think Claude wants to do that, we watch for that and we don’t think it could pull it off if it tried.’

That’s a strong argument for Opus 4.6, given what we know about it, and it generalizes well to other methods, so long as we understand that this won’t last. It feel like seeing a time bomb with an unknown timer, and figuring out it’s not going to blow right away. Which is good news but you are still sitting on a time bomb.

Looking Forward

I worry that this is missing the central point:

Because the claims in this risk report rely in significant part on evidence about alignment, and alignment is very sensitive to the details of finetuning, it does not trivially extend to similar subsequent models, and many findings could render key parts of this assessment inapplicable to a near-future model. Below, we lay out a set of circumstances that would call the applicability of this risk report into doubt.​

A lot of this depends upon details of alignment, but far more so it depends upon limitations of capabilities, and also the alignment properties likely depend on those limitations of capabilities as well. The strongest arguments against risk here are lack of capability, not lack of propensity, or arguments for lack of propensity that stem from lack of capability.

This is fundamental, and I worry Anthropic actually doesn’t understand this. I’m not saying they are missing these dynamics entirely, but the report reads as not properly appreciating them.

If right now sabotage or unauthorized action would be detected and prevented, whereas in the future this will not be true, your observation of a lack of such attempts now is a good sign on the margin in some ways but does not mean what you would like it to mean. It could even indicate that the model has learned to only do things when it can ‘get away with them.’

They do list as an assumption “The model does not have new capabilities that would significantly call into question the effectiveness of a Claude Opus 4.6-style alignment audit” but that still frames it as calling into question the alignment assessment.

Claude Gov

There is a section (2.8.1.2) on Claude Gov, which is based on Sonnet 4.5 and then fine-tuned to reduce refusals, especially refusals around what would otherwise be called misuse. They did not see any cause for higher level concerns.

What Is A Strong Argument?

Anthropic believes they have made their suggested ‘strong argument’ that their model will not carry out sabotage. I agree with the conclusion, and find in context that the argument does seem strong, but I am uncomfortable with the methodology.

I don’t think this process will lead Anthropic to stop thinking they have a strong argument at the same point I stop thinking they have a strong argument.

If extended to policy requirements for other labs this seems hopeless. If xAI or Meta published a legally required ‘strong argument’ do you expect to find it strong in non-trivial cases? I don’t.

For OpenAI or Google there’s some chance, but if deciding the argument is strong is a blocker for release, then I don’t expect that to block many releases. As in, in the cases where a lab halts for this reason, they would have likely halted anyway to avoid the actual underlying problem.

The argument for requiring the arguments seems to be something more like ‘as we approach maximum danger the arguments made will sound increasingly stupid, and people will push back pointing out that they are stupid, including to force greater regulation.’ I don’t think that works when you need it.

Recursive Self-Improvement

Section two deals with risks from automated R&D.

As I’ve noted in the Opus 4.5 and 4.6 model card assessments, the models are passing the tests and then the actual current assessments are based on vibes.

If all you have are the vibes, then you go with the vibes, but that is not the regime we were promised or that Anthropic claimed to have intended. It’s a fallback, it’s at best a graceful failure. I agree that on this front we are not ‘there’ yet, but Anthropic is clear that we could be there relatively soon.

I find the ‘you said this was like nuclear weapons so we get to murder your company’ arguments against Anthropic rather abhorrent, but ‘you say you are close to automated R&D so you need to act like you are close to automated R&D’ seems fair.

We badly need a robust discussion of what things Anthropic should actually be doing, in order to meet that bar given the situation in practice.

They don’t say much here beyond noting they are attempting to increase security, so there isn’t much else to say back. This is the big thing headed our way and the plan seems largely to be to wing it.

I am not a fan.

Non-Novel Chemical and Biological Weapons

Opus 4.6 has a lot of knowledge. This is the primary thing ASL-3 protections have been for so far, in practice, and why we have those funky classifiers. Jailbreaks are still possible, but they’ve done a lot to make them harder, or to catch those trying to use them for such purposes.

For a long time we’ve worried about this threat vector, it’s clear there is substantial uplift available, and it seems like there are a lot of bad actors out there who would want to do this. But not only have we not seen an attack, we haven’t to my knowledge seen a serious failed attempt either. Diffusion is a bottleneck, people don’t to do things, and this applies to the bad stuff as well as the good stuff.

I strongly agree with potential pandemics being top priority. If something happens that is contained, that’s a tragedy, but orders of magnitude less dangerous.

They go over how their ongoing ASL-2 (for Sonnet 4) and ASL-3 (for Opus 4 or higher, and Sonnet 4.5 or higher) mitigations work: Real-time classifiers, offline monitoring, access controls, rapid response, bug bounty programs, threat intelligence, securing model weights. They note that some exceptions were made for Claude Gov and some individuals, and made some tweaks, but the basics are the same.

For now, I buy the ‘strong argument’ that the combination of needing to jailbreak, Anthropic being able to monitor, and there not being that many actors who actually want to cause such harms, combined with those actors being mostly unsophisticated in such areas, makes the risk low. It’s been strong defense in practice, and this is one area where it seems right not to discuss details of the defense-in-depth.

Similar to not having seen serious attempts to use Claude or other frontier models for biological attacks, we also have not seen reported serious attempts at stealing model weights, which is also mentioned here. I worry that this is effectively security through obscurity, combined with ‘it’s a great trick but we might only be able to do it once and people will get super pissed’ so various actors are holding back. Or they’re actually stealing the weights, but not releasing them or deploying them for general use, for the same ‘we can only do that once’ reasons.

Novel Chemical and Biological Weapons

As Anthropic says, this is a relatively unlikely threat model, but the tail risks are huge.

They review the information from the Opus 4.6 model card, showing that Claude is in many ways performing at above human level on relevant tasks, but they think the models aren’t fully ‘there’ yet for such tasks.

I agree they’re not fully there, but I think they are somewhat there, in that yes they make such tasks substantially easier, as they do most other cognitive tasks. My guess is that so far there have been zero serious attempts at something like this, and as they note that’s an issue if that ever changes suddenly, because you don’t get practice spotting and stopping attempts or learn how close you are to the edge.

Mitigations are the same as for the non-novel approaches.

Cross-Cutting Content (Section 6)

It’s good to see Anthropic laying out these additional dynamics, even if their justifications doth protest too much. Even if your model doesn’t directly cause a problem, you are still responsible for secondary consequences when you make that decision to move forward, even if that’s not how the law works.

6.1 talks about acceleration dynamics. Now that Anthropic is pushing the capabilities frontier and plausibly even in the lead especially with Claude Code, Anthropic is indeed accelerating other labs.

They list three ways: Distillation, intellectual property leaks via observation (or recruitment or listening, I’d add), and demonstration of commercial viability.

Well, yeah, that’s a price you pay when you’re pushing the capabilities frontier.

It isn’t obvious whether accelerating AI development in general increases or reduces risks overall, nor whether it is ultimately net-beneficial or net-harmful overall.

I think it’s fine to be agnostic about whether acceleration is net good, since acceleration brings benefits, but it seems quite obviously not good for AI risks, or at least that should be the strong default. Instead, they try to present this as a ‘we have three bullet points in favor and three opposed, who is to say if it is good or not.’

They trot out ‘preserve the current lead of democracies’ argument, and the potential hardware argument, and that today’s methods might be better than future methods, and I find all of that entirely unconvincing.

I find Anthropic’s defensiveness about its actions, and reluctance to be open about the downside costs, and even the impression they don’t understand or appreciate those downsides, is one thing making it much harder to trust Anthropic than I would like.

6.2 fully turns this around and argues why Anthropic operating as a frontier AI company is good, actually, from a safety standpoint. They can prioritize beneficial deployments, take otherwise impractical steps towards risk-reduction that then can get copied, and they can inform the world about AI and have a seat at the table.

I don’t think these are the central arguments.

I think the real argument is that you need to have a frontier AI lab in order to be the first one to do it, and to have the resources and opportunity to do the necessary alignment research and work, and to have a seat at the table and bargaining power, and because you think you can handle this responsibly. Which is an argument for going fast relative to others, but not an argument for everyone going fast together.

There are those who are deeply skeptical that Anthropic is up to that task, and believe that Anthropic’s presence is anti-helpful to the cause of everyone not dying.

I don’t agree. I think that Anthropic is net helpful. But I respect those who disagree for that reason, and think Anthropic is insufficiently responsible. I think it’s important to listen to those people, even when they make it difficult. Whereas those who are trying to destroy Anthropic exactly because it tries to bear any responsibility or show any virtue at all? No, those people I do not respect.

Risk Report Report

What should we think overall about the new risk report?

I’m happy to have it. It outlines a lot of Anthropic’s thinking, allowing us to critique that thinking. I especially appreciated the diffs and the ability to see all the info in one place.

For much of it, I don’t think it adds that much to what the model cards already offer us, but as long as it’s in addition to those, that part is all cool.

It also includes new information on risk mitigations for sabotage and non-novel bioweapons, and which threat models and risks are especially concerning, that aren’t available elsewhere. Those parts are more useful.

The contents reinforce what we already knew. Anthropic is willing to be highly curious about its models and its mundane safety, and will explore a lot of questions and raise and address points other labs would ignore. There’s a lot to like.

And yet there is still what is missing, which is a grappling with the tasks ahead, the full nature of the risks and challenges we will face, or any firm commitments of how to handle those tasks or what would be required. We do not have well-defined thresholds. We do not have evaluations that will still work. We do not have a plan that seems ready for the full problem, or a felt sense of that gap being acknowledged.

I worry quite a lot, even more than before, about Anthropic not taking the full alignment problem seriously, not having security mindset, and being on the lookout to convince themselves this is going to be easy. They could turn out to be right, but I alas do not expect them to be right.

At the end of the day, we’re being asked to trust Anthropic to make good choices. To say, these people are the best of the best and they care deeply about all this, and they will figure it out, and we can help them with that but they will make the final decisions, and need maximum flexibility. And you can decide, based on their track record and everything it entails, whether to do that.

Cause hey. You should see the other guys.



Discuss

Beware Even Small Amounts of Woo

2026-04-03 20:11:55

Even small amounts of alcohol are somewhat bad for you. I personally don’t care, because I love making and drinking alcohol and at the end of the day you have to live a little. This is fine for me, because I’m not an olympic athlete. If I were an olympic athlete, I’d have to cut it out (at least whenever I was training).

Lots of religions are heavily adapted to their host culture. They’ve been worn down by cultural evolution until they fit neatly into the fabric of society. It’s only when you move culture that they become a problem.

Woo

For our purposes, woo is a cluster of neo-pagan, buddhist-adjacent, tarot-ish beliefs and practices, which are particularly popular in the west amongst edgy people who are otherwise liberal-left-ish in their proclivities. Particularly a subset of techie people. I think woo is a bit like alcohol and a bit like a very well adapted religion.

On the religion side: woo is a big mishmash of different things, kinda like modern Christianity, which Katamari-ed its way across Europe, picking up pagan practices like a winter festival with a big tree, and a spring festival focusing on symbols of life and fertility (though it lacks any summer or autumn festival). It also lost a bunch of restrictions which made it incompatible with enlightenment living.

Woo is kinda like this. It’s highly adapted to be compatible with a techy-alty-edgy lifestyle. You can just do as much or as little woo as you’d like. It’s particularly appealing for evidence-based people because some parts of it do actually work. Yeah, you can get more focused by meditating, yeah you probably can also enter weird bliss states. Not all of it works though, you definitely can’t actually divine the future by tarot cards.

I also think that woo is like alcohol, but for epistemics. For most people, you can have fair amount of woo in your life before your epistemics get bad enough to matter. This is especially true if you’re a smart person overall, so you have some epistemic issues to spare. If the divine stag comes to you in a dream and tells you to maintain the sanctity of your body, you’ll be smart enough to interpret that as “eat fewer big macs” rather than “avoid vaccines and medications”.

Woo is also a decent social lubricant. I think a bit of woo can make you more charismatic and sociable, or at the very least, the techniques of woo do something in this direction. Woo is centred on a kind of “just feel it bro” attitude which doesn’t let details (like facts) get in the way of beliefs. This is very useful for small talk! Just vibing with what someone said, and accepting it, is a very useful skill!

The Singularity

If you’re close to AI, then (I claim) woo becomes very, very dangerous in the age of the singularity.

The singularity requires world-class epistemics. You don’t need a world-class athletic body to cycle to the office, but once you’re there you do need to be in the best epistemic shape possible. AI is extremely hard to think about.

The singularity is a new environment. Woo is well adapted to the informational environment of 1960-2015. It has not had time to adapt to the informational environment of 2020, where AIs are concerned.

To practise woo is to practise a mental motion with poor form. It’s fine until the weight gets too much and snaps your sinews. The form is following unexamined intuitions towards strong feelings. Most smart people are capable of subjugating those intuitions to logic when a good enough logical argument presents itself, but the more difficult it is to logically reason about a topic, the more likely they are to fail.

There are a few thinkers I could name whose brains seem to go to mush the moment they touch AI and the singularity. They’re good enough at subjugating their intuitions to reason, when the logical path is well-trodden, but they can’t do it when there is no logical path (I’m not talking about the Gary Marcus crowd here, I don’t think woo is their problem).

So be very careful around woo, it seems to line up pretty closely with (what I see as) uncharacteristically poor thinking around AI.

(And of course there are inherent dangers to things like meditation, but this seems to be common to all contexts, not just this particular case.)

◆◆◆◇◇|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◇◇|◇◇◇◇◇|◇◇◇◇◇



Discuss

Listen to Gryffindor

2026-04-03 18:44:36

Lesson: The enemy is compliance, and the solution is courage.

(This post uses the Hogwarts houses as an evocative lens. See the footnote [1] if you are unfamiliar.)

Behaviour:

  • You're making a career choice. Your local EA community needs a director. You're unsure but commit, since it's important work. You don't get funded, work for less than your happy price, and start running on fumes.
  • You're trying to think how to get something real done, again. You think of HJPEV. You realize you need to be more diligent, lock in and apply conscientiousness, and solve the problem. You spend years trying to simulate conscientiousness, and pay the rent to unworkable TODO systems, overburdened calendars, and feeling trapped in systems devised by you.
  • You buy a robot vacuum. You tell your SO how it cost 1/40th of a human life, and you are fine with that. You feel guilty. It still gnaws at you, six years later.

Before we talk about what connects these, let's look at some shadows.

Shadow work

I tried reading some blogs and books on so-called shadow work - psychological work related to subconscious, repressed wants and needs. I found value in Existential kink by Carolyn Elliott, which offered this frame of relating to your repeating problems in your life by trying to prod your psyche if it actually enjoys something about the experience.

Example: I hoard browser tabs. Sometimes I am bothered and stressed by this. The simple explanation is that I have some blocker about cleaning them, which is causing this behaviour. The shadow work / existential kink lens theory would be: "What need / want are these browser tabs serving? Why does it feel like I lose something valuable if I try to apply a simple solution to this situation?"

General thought: I think shadow work and its genre of tools are much more useful in looking at wicked or repetitive problems in your personal life, ones where you have already explored the simple solutions. I think one thing that goes wrong with some self-help and therapy styles I dislike (classical psychoanalysis, for one) is that it immediately jumps to complex explanations, looking for reasons for everything. (Sometimes a lot of what is troubling you in your life is connected, but I think it's often worthwhile to start simple, and raise the complexity of analysis and problem-solving as the simple solutions reveal more information and/or don't work.)

I also greatly enjoyed David Chapman's Buddhism for Vampires, especially the We are all monsters essay series. Which takes a bit of a different approach:

To observe the headings of the blog sequences:

Eating the shadow

Romancing the shadow

Hunting the shadow

Absorb your shadow

Drinking the sun

And then interpreting them into the verbs that the blog text uses: hunting (finding what you've hidden from yourself) -> chewing (getting intimate with it, no distance) -> swallowing (it's no longer "not me". may need to regurgitate and rechew) -> digesting (becomes normal, malleable) -> burning (fuel for creative work and practical magic).

Chapman does some interesting Buddhist framing as well (hence the name of the blog), core part of this in my notes being:

reframe: not "rejected parts of self" but "rejected aspects of experience", avoids the True Self trap.

Which makes more sense to me than the glaringly Jungian / psychoanalytic frame common in shadow work, that spends a lot of time on "the self" and how the work can help move it forward, or some such.

Chapman also states

It's slow, bodily, repetitive, often disgusting. Not a weekend workshop. Not spirit-talk. Stomach feelings, not metaphysical ideas.

And this, I think, is one reason why his Vampire / monster aesthetic is worthwhile. It sets expectations well - and also has useful associations. At least for me, personally, a lot of my repressed parts felt reprehensible, morally wrong, even monstrous.

Which is what leads us to the next part of my adventure:

Villainy as an outlet

After spending a bunch of months background pondering the shadows of my mind, something concrete started to form. I started to form this concrete hypothesis that EA (and rationalism) are making people too compliant, and this needs some shaking. [2]

So I started building the concepts for a blog sequence: Effective Villainy. You just need to endorse your ego, your desires, your repressed, unpalatable side, and you can reach power beyond measure. [3]

TLDR on Effective villainy, condensed from my notes on the subject:

  • Effective Villainy is an aesthetic and philosophical counterfoil to Effective Altruism. It doesn't oppose doing good. It opposes doing good from a place of self-erasure, guilt, and obligation.
  • EA selects for people who already care too much and then tells them to care harder. It rewards self-sacrifice, legibility, humility, and careful, measured action, which can push people to double down on these virtues in harmful or unproductive ways.
  • Effective Villainy says: know yourself before you give yourself away. Own your desires openly. Acquire power without apology. Act with passion and conviction rather than careful defensibility.
    • Importantly: Close the moral accounting department. There's no scoreboard.
  • Trust that a person who is fully alive and honestly selfish will do more good than a person who is dutiful and slowly burning out.
  • It's not a 180 from EA - my intention was to do a 360. [4]
  • The aesthetic is deliberate: villain rather than hero, dramatic rather than measured, ornamental rather than minimal. Not because aesthetics are trivial, but because the right aesthetic can unlock things that argument alone cannot.
  • TLDR of the TLDR: you're allowed to want things. It might even be positive to act on your wants. Shock!

Why villainy isn't the answer

The fact that I wanted to perform a 360-degree turn, instead of a 180, was quite telling. As was the fact that the kind of villainous acts I wanted to perform were suspiciously tame: Wanting to donate to ineffective causes just because they lit up my heart, deciding what to work on by listening to my heart, wanting to be able to say no when someone tells me I should do something. Wanting to just do stuff without thinking of optics.

And most importantly: If you are building an inversion, you are being affected by the thing you are inverting. An inversion can never be free. And freedom was the core of what I was missing.

When looking for inspiration, I didn't end up admiring actual villains. I started to admire the Absurdist hero. [5] A fledgling scientist who stares into the darkness, consciously or subconsciously realizes that no one sane can solve this problem, and awakens into the mad scientist protagonist. To quote Okabe Rintarou: "Hear me privileged companions, time is ours. Hououin Kyouma’s mind recognizes no limits, neither does his reach. Heave anchor. The ship of fools is about to embark."

Also: optimizing for power won't make you happy: HPMOR Quirrell's legibility, inability to look like a fool, and having already decided what everything means, and what matters (nothing), were a very significant factor why, even though he might have had the seeds, he never awoke as a true protagonist, be that an evil or good protagonist. He got stuck optimizing for power, even when, in the end, it did not actually cure his deeper boredom and dissatisfaction with life. (Even if it did bring momentary satisfaction, when he did not need to suffer direct foolishness, no more.) He had lost the pieces to be a sane protagonist, but he could not consider the choice to be an insane protagonist, so he had to go villain.

Where are we now, and where are we going

So why all this talk of villainy if it isn't the answer I'm offering? Why have we not addressed the post's title yet?

The relevant clue here is that the monstrosity, the counterculture aesthetic, the shadow work, are never the final answer. They are fuel for transformation. It's why Chapman's sequence leads to Drinking the Sun, not becoming the vampire.

Let's go object level for a bit: Why was I repressing my desires? What was the problem they were a signal of?

My current best guess, mediumly held, is that I was being compliant.

Compliance psychology is the study of the process where individuals comply to social influence, typically in response to requests and pressures brought on by others. [6]

The behaviour that I had most trouble with, and that kept popping up, is a specific form of people pleasing. But importantly, I wasn't trying to make people pleased. I was trying to avoid upsetting their expectations.

This thread of behaviour is what combines the examples in the intro of the text - doing things I should, doing things the way I believe they should be done, applying moral calculus to daily life because I thought it should be applied everywhere.

In my own case, I needed some kind of psychological escape velocity to break this pattern. Normal techniques were never enough. After staring at the fascinations with fictional dystopias, Moloch, and generally various kinds of abhorrent processes of optimisation that deny one their own will, it became quite apparent that I was acting bound. Limited.

And the villainy frame gave me the escape velocity that was necessary. By talking to people about the fact that I was writing this countercultural take on EA, I could bring more of myself to the surface, and break the mirage that I was spending a lot of social and other effort constructing.

So the real diagnosis for why I did not have boundaries, why I felt I was giving too much up, is not that the goals were not (at least primarily) aligned with what I desire. But the method of how I was existing in spaces where other people work on things that really matter was not working for me.

After chewing on these themes a bit, I found the next shadows: behind the dystopian fiction, behind the need/want/goal of optimization, was something scared. And what it was scared of was softness. Wholesomeness, loyalty without qualifiers, just being allowed to belong, the Hufflepuff virtues. And it was harder to reach, because I am more scared to admit to desiring soft things than to admit desiring ~edgy things.

I cannot write wholesomeness. [7] And it's not because I don't like it. I enjoy wholesomeness, it's just very scary. [8]

So what are the Hufflepuff virtues besides the wholesomeness neighborhood? How am I supposed to chew on these? The monster frame really has a hard time evoking the right imagery here, so time for the next frame.

The virtues of Hufflepuff

Starting from a vibes-based list:

  • Honesty
  • Kindness
  • Laughter
  • Generosity
  • Loyalty
  • Magic

(Tone: joking. It's a fun emergent association that when considering Hufflepuff virtues, what comes out are the Elements of Harmony.)

Who are Hufflepuff heroes? Where to look for inspiration here?

  • Luna Lovegood is a Ravenclaw-Hufflepuff hero, and I talk about it more in my draft Endorse weirdness. No, really.. You know - when the gang visits her home in the Deathly Hallows, the part that sticks is that she has paintings of her friends up on her bedroom roof, along with a chain of friends, friends, friends connecting those pictures. It feels kind of unsettling in the moment, but from her perspective it's just simple.
  • Paddington. A pure Hufflepuff hero: an immigrant and outsider who builds community wherever he goes just by earnestly believing people ought to be kind, and having the quiet moral authority that comes from actually living that way. The Hard Stare isn't a weapon, it's genuine disappointment that you're not being decent, and somehow that is more devastating than any sword swing. The London mural evokes the vibe quite well.
  • Columbo. The Hufflepuff answer to the detective genre. Everyone he faces is rich, clever, scheming, convinced that they're the smartest person in the room. Columbo has neither the paladin charisma of Gryffindor, nor the status-consciousness of a Slytherin. He's just easy to be around, curious, funny, blatantly peculiar. That's what makes the villains drop their guard, and by the time they realize his brilliant memory for people is a real weapon, it's already over. He doesn't need to intellectually duel: the battle was won before it started. But beyond the charm, what makes Columbo a Hufflepuff genius is the sheer dogged dedication to doing his job well:

You know, sir, it's a funny thing. All my life I kept running into smart people. I don't just mean smart like you and the people in this house. You know what I mean. In school, there were lots of smarter kids. And when I first joined the force, sir, they had some very clever people there. And I could tell right away that it wasn't gonna be easy making detective as long as they were around. But I figured, if I worked harder than they did, put in more time, read the books, kept my eyes open, maybe I could make it happen. And I did. And I really love my work, sir.

  • And of course, the character who unifies chaotic fun, friendship, and mad science, while plausibly hosting timeline-shattering eldritch powers (breaking physics, appearing in impossible places, having precognition), and whose only ambition is to use it to amuse her friends: Pinkie Pie

What about Hufflepuff fiction themes? The obvious candidates here seem like Ursula K. Le Guin, and Studio Ghibli. Stories where power-grabbing costs you, and friendship, loyalty, and importantly, mercy are often rewarded. Non-obvious candidate but quite fitting: Terry Pratchett. Combines the Ghibli theme-motif of "ordinary is interesting, actually" with fun and good vibes, friendship, and mercy. And there's a core Pratchett ethos that ties back to (the) Columbo (quote), which doesn't consist of the Glory of Gryffindor, the Intellectual deliberation of a Ravenclaw, or the cunning planning of a Slytherin. It consists of showing up and doing your job, not because the stakes are larger than life or for a complicated reason, but because people matter, and caring for them and for your work matters, because the people matter. And you don't need to force it, because you can love your work.

Takeaways / Hufflepuff actions I have been missing:

As someone prone to over-intellectualization and abstract moral frameworks, I think there is a bunch of value I've been missing in just caring. Making fun happen for people I care about just because. Sharing because I can, and because I want to, not for an end goal. Caring for my own well-being, not because it is necessary for my work, but because it matters. And, importantly, just allowing myself to care, without being scared that caring might force a worldview update. [9]

Stopping performing, starting being

What was the actual blocker between pre-2026-me and current-me? Why did I not see that the above stuff is important and matters to me? On the object level, it's dozens of things. On the meta level, I think, fundamentally, it was about courage. I needed a Gryffindor virtue to be able to get how I was repressing my inner Hufflepuff.

What to do with this information?

Notice when you're performing. Notice when your defiance is still shaped by what you're rebelling against. And when you find the thing that's actually scary, the deeper layer of your shadow: have the courage to keep looking, to keep staring into your personal darkness. Keep looking, keep paying attention, and then, eventually, by becoming more whole, by listening to Gryffindor: set your heart ablaze.

  1. The Hogwarts Houses from Harry Potter are a vibes-based personality classification system that I use as an evocative lens. The core part is: Gryffindors are the brave and convicted, Slytherin ambitious and cunning, Ravenclaw intellectual and curious, and Hufflepuff loyal, kind, hardworking. (The original Harry Potter suffers from not understanding its own framing, and ends up with Hufflepuff as the default house for people who don't fit any of the others, and Slytherin as the house for mean and evil people. But the lens is quite clear, when you correct for this.) Additionally I use the combined-house phrasing (Ravenclaw-Hufflepuff) to describe someone who is primarily of the first house but exhibits significant characteristics of the secondary house. ↩︎

  2. On reflection, I was projecting pretty hard here. ↩︎

  3. I don't think this is completely wrong. I think will to power is a real tool, it can provide real-world returns, and I think Nietzsche and Chapman are the writers who put it the best. See You should be a God-Emperor. Pattern here being: Humans do self-limit a lot, and you can bend or remove some of those limiters, with the right work. But in the end, my own path, even though inspired by this, considers the next steps subtly different. ↩︎

  4. And this is part of where I went wrong. I decided the conclusion I wanted to reach before the work was finished. ↩︎

  5. I have always liked Camus' vibes, but I have not managed to connect his ideas to what truly matters before I arrived here. ↩︎

  6. Wikipedia: Compliance ↩︎

  7. Part of what brought me here was that I was generating a cast of characters templated on me for my medianworld. Different aspects or life stories that can be generated out of Xylixes that get pulled in different directions. And I realized the way I model those characters doesn't leave space for any Hufflepuff. And on reflection, it didn't make sense. I greatly enjoy and find worthwhile friendship, trust, loyalty, and kindness. But this was the real repressed part. ↩︎

  8. What is the concept your heart of hearts desires, and you are scared to play with? What evokes unsettling positive vibes in you? ↩︎

  9. I'll note here that I think the way I (and many rationalists) undervalue Green thinking is related. Sometimes the snail crossing the street or Blåhaj going to waffle house is just cool and good. Enough said. ↩︎



Discuss