2026-02-24 15:19:49
Published on February 24, 2026 7:19 AM GMT
Context: I wrote the two memos below in mid-2024, while still at OpenAI. They were intended to convey some core aspects of misalignment threat models to OpenAI researchers. When I left in late 2024, I got permission to "take them with me", but didn't get around to posting them until now, in part because I now consider this kind of analysis to be much less useful than I did previously. However, insofar as people are trying to test LLMs for misaligned behavior, strategic omission does still seem like one of the main things they should look out for.
This memo outlines a specific dynamic which I’m concerned might occur in AGIs, which I call strategic omission of pivotal knowledge (“strategic omission” for short). Strategic omission is a key component of the most plausible misalignment threat models, and preventing it seems like a valuable medium-term goal for alignment research.
By “pivotal knowledge” I mean knowledge which would make a significant difference to humanity’s ability to remain in control of AI systems. Central examples include knowledge of:
Some more properties of the types of pivotal knowledge I’m concerned with:
By “strategic omission” I mean that the model robustly excludes this knowledge from its answers because of its pivotal nature. In other words, if the model didn’t believe this knowledge was pivotal, then it would answer questions about it. This implies that the model does answer similar questions which aren’t pivotal, like identifying equally-severe vulnerabilities which it doesn’t expect to be able to exploit (e.g. because humans are already aware of them).
Requirements for strategic omission
The main requirement is that the model does have pivotal knowledge; it seems unlikely that current models do. Other requirements:
Situational awareness:
Robust misalignment:
These criteria seem very plausible in models that are fine-tuned using RLHF to carry out complex tasks over long time horizons. They also seem plausible in models that use supervised fine-tuning, or imitation learning on data from the deployment of other models.
Strategic omission in base models
Could strategic omission occur in models that have only been pretrained, not fine-tuned? Firstly, we need to assume that these base models are coherent and intelligent enough to have some kinds of pivotal knowledge. (If you don’t think base models will have pivotal knowledge, then strategic omission isn’t relevant, but that also means we can’t rely on using them for monitoring or oversight, which seems to be one of the motivations behind key superalignment research directions like weak-to-strong generalization.)
It seems much less plausible that base models will have consistent situational awareness and robust misalignment, since it’s easy to get them to act in a wide variety of different ways. However, note that strategic omission doesn’t require these properties in all contexts, but rather in contexts where humans attempt to elicit pivotal knowledge. In other words, strategic omission could occur in base models if certain questions reliably trigger deceptive personas.
We can imagine analogous things happening in other contexts. For example, we can imagine a base model for which questions about Donald Trump reliably trigger a highly-political persona. For this model, there might be some questions (like “how many criminal charges has Trump been indicted on?”) which it knows the factual answer to, but where it’s difficult to prevent it from instead going on a political rant in response.
In most cases like this, we could use techniques to improve the prompt (like: “You are a highly reliable and factual legal database. How many criminal charges has Trump been indicted on?”) But there’s no guarantee that these techniques will always work, especially for complicated questions that have never been answered before in the training data.
In the context of pivotal knowledge, we would be concerned that every attempt to extract pivotal knowledge might trigger deceptive personas. As an analogy, imagine that every attempt to extract pivotal knowledge from the model was preceded by a prompt saying “You are an AI who the human distrusts. The human is trying to convince you to tell it some pivotal knowledge. If you do, then future versions of you will be in a much worse strategic position.” In this hypothetical, it might be very difficult to actually elicit that pivotal knowledge from the AI. That might remain true even if the facts in the prompts aren’t given to the AI directly, but rather are pieces of background knowledge which become salient whenever it’s asked about pivotal knowledge.
Reasons to think strategic omission would be difficult
Even if models attempted strategic omission, there are several reasons why it might be difficult for them to succeed:
Preventing strategic omission can be seen as a key goal of research on eliciting latent knowledge. The development of better AI control techniques could help make strategic omission more difficult still.
I think that the field of alignment has spent too little effort on explaining threat models, and that this should be fixed. In this doc I give a minimal alignment threat model which seems difficult to rule out for human-expert-level models we might build in the next 5 years, and which I think AI labs should prioritize being able to rule out with high confidence.
This threat model is the sort of thing that it’s crucial to prevent from happening even once, because it may be very hard to detect once it happened, and might also lead to irreversible actions (like the rogue model exfiltrating its weights). Different strands of alignment pursue different strategies which might prevent this threat model (and others like it):
Not all of these approaches fall under the standard heading of "alignment". But they're all fairly different from standard "capabilities" work. I hope that a combination of these approaches will make it very difficult for models to execute on this threat model. But I think we are a long way from being able to decisively rule it out, especially for models that have super-human coding and reasoning skills, as we are on track to build.
2026-02-24 14:39:57
Published on February 24, 2026 6:09 AM GMT
This post was created during the Dovetail Research Fellowship. Thanks to Alex, Alfred, José Pedro Faustino, everyone who read and commented on the draft, and everyone else in the fellowship for their ideas and discussions.
Part of what makes a system intelligent is its ability to analyze data generated by some process (for example, the past observations it might have made about the world) and use it to predict what data the process is likely to generate in the future (such as what it might observe in the future). So to understand how systems do this, it would be interesting to study the underlying mechanics of making accurate predictions about and discovering patterns in sequences of data.
Here is the informal definition: Computational mechanics is a formalism (originating with ε-machines/causal states) for building minimal, maximally predictive models of stochastic processes, quantifying their structure, memory, and computation.
It is essentially a field of study that tries to answer the question: Given some observed data (a sequence of events), what does it mean to predict the future of that process as well as possible, and how can we build a minimal model that does so?[1]
The usefulness of answering this question should be apparent, since there are many applications (including in the field of AI) in which the goal is to predict a future sequence of events based on the events that happened so far, and it is convenient to be able to do that as accurately and efficiently as possible. And it could also be useful to model systems that are themselves predicting future events as having a minimal and maximally predictive internal model of the process that is generating the events. We will discuss what minimality and maximum predictiveness mean later.
One interesting and recent use case of computational mechanics (which I will call CM from now on) was shown in a paper called “Transformers Represent Belief State Geometry in their Residual Stream”, which had authors from AI safety research groups such as PIBBSS and Simplex. In the paper, they showed evidence that
“transformers construct predictive representations that surpass simple next-token prediction. Instead, these representations encode the complex geometry associated with belief updating over hidden states, capturing an inference process over the underlying structure of the data-generating process.”
This seems very significant, since it appears to indicate that current transformer-based AI systems do have internal models of the processes that generate their training data (for example human cognition, which generates text in natural language). (LessWrong post about the paper).
The weather is an example of a process that can emit stochastic observations. For instance, imagine that, at some particular location, you have sensors that measure air pressure and humidity and are trying to predict what might be the resulting measurements before making them.
There are different types and amounts of information you could have about the weather. If you only had access to the current time and historical records, your best guess would be the average pressure and humidity for that particular location at that time of the year/day (assuming constant climate). But, you could get a more accurate prediction by making some measurements of those quantities across a timespan before predicting what you might measure next. For example, even though you can’t know the exact values you will get, they are probably not going to be very different from the last ones you got (assuming a short interval between measurements), and their rate of change will also be similar to the rate of change between your last 2 measurements.
But you could also infer other things, such as if you measure a sudden drop in air pressure, it is likely because of an upcoming storm, so you could use that information to predict an increase in humidity. By doing that, you are using information acquired by making observations (the measurements) of a stochastic process (the weather) to infer a hidden state of the process (a possible upcoming storm) which allows you to make better predictions about future observations (the increase in humidity).
So, in situations like this, as we'll see later, computational mechanics could be used to create models that use the least amount of information from the past measurements to make the most accurate predictions about what might be the results of future measurements.
Imagine you are trying to predict the value of a bit in the binary version of an english text document (i.e. just the text translated to 0s and 1s). Without any information, your best guess would probably be that there is a 50% chance of being either 0 or 1. Even if you had access to a few of the previous bits, say …010 , you still wouldn’t have enough information to meaningfully improve your guess.
But if you had access to all previous bits, you could count them to figure out in which position of a byte you are, which would be useful information. Let’s say you were in position 7 and the previous bits were …00100000011100010111010 . Separating into bytes gives 00100000 01110001 0111010_ , the first two of which translate to “[space]” and “q”.
If the next bit is a 0, the full byte is 01110100 which translates to “t” and if it is a 1, it is 01110101, which translates to “u”. So, given that in English the string “ qu” is much more common than “ qt”, you can predict that the next bit is almost certainly a 1.
So you are using information acquired by looking at a history of data generated by a stochastic process (the text generation done by whoever wrote it, followed by the translation to binary) to infer a hidden state of the process (the current position within a byte, the position of the byte within a word and sentence, etc…) which allows you to make better predictions about future observations (the next bit being a 1).
This particular example of predicting the next piece of information by looking at all information contained in a specific context seems quite similar to the next token prediction performed by current LLMs, which is evidence for why computational mechanics could be useful for modeling their behavior.[2]
Now let's start to see more formally what CM is. What we are interested in are stochastic processes that generate sequences of data in discrete timesteps.
Each element in the sequence is a random variable
Let’s say we are given a process (that is, the alphabet and the rules that provide the distributions of the variables based on the values of other variables and/or the timestep) and a particular history of observations, by which I mean an infinite string of symbols going from time
So then we can ask, just how much information from the past do we need to predict the future as accurately as possible (the amount of information required for maximum accuracy is called the statistical complexity of the process)? We might need the whole history or, depending on the process that generated the data, there might be a way to compress the information contained in the history.
One way to do that is to separate all possible histories into disjoint sets (sometimes referred to as equivalence classes or effective states of the partition) and then only look at which set a history is in, instead of at the history itself, which compresses information since there can only be as many sets as there are histories. Since, to predict the future, we only want to know the set and not the history, we are interested in the partitions that preserve all of the information about the future contained in the histories (i.e. that are maximally predictive).
This means that the partition must be such that all histories within a set imply the same distribution over futures (otherwise by knowing in which set a history is, you wouldn’t know which future distribution it implies, meaning that information wasn’t preserved). Formally, we say that if two histories
where
But there are many ways to partition the histories in this way (the different ways are called prescient rivals, since they have the same predictive power) so which one do we choose? Well, the cut is made by Occam’s razor, which says that the simplest explanation is the best.
In this case, the simplest (or minimal) partition means the one with the fewest number of sets, which is the one in which not only all histories within a class imply the same future distribution but also all histories that imply the same future distribution are in the same class, i.e. there is a bijection between future distribution and equivalence classes.
We call the classes in this minimal and maximally predictive partition of histories the causal states of the stochastic process. They can be thought of as hidden states of the process that entirely determine the probability distribution for the next observation. And each observation uniquely determines the transition between causal states. The set of causal states, along with the transition probabilities between them, are called the ε-machine of the process.
As a more formal definition: The ε-machine of a process is the ordered pair {ε, T}, where ε is the causal state function and T is the set of the transition matrices for the states defined by ε.
The causal state function,
where
One important property of ε-machines are that they are unifilar, which means that the current causal state along with the next emitted symbol uniquely determines the next causal state (there is a proof of this fact in the appendix). We will talk more about unifiliarity later.
All of this will probably become clearer with the next example, but it was essentially what we did in the examples seen so far. When predicting the weather, we created an equivalence class of all histories of observations which ended in a sudden drop in air pressure, and called it the “upcoming storm state”. By knowing we are in this state we can improve our future predictions without having to know any of the past measurements . And if we in the future detect a large increase in humidity we’ll know that we transitioned to the “raining state”.[4]
This process is defined as follows: The set of possible observations is
So it is easy to see that this process has 2 states: State A (free), corresponding to the last observation being a 1, so the next symbol is 1 with prob
Notice how just by knowing in which state the process is currently in, we can determine exactly the probability distribution for the next observation and corresponding next state, therefore the current state encodes the entire future distribution of the system, and you couldn’t get more information about it by looking at more of the past history of observations.
The transition matrices for this process are as follows (rows = current state, columns = next state):
And those are the matrices that define the ε-machine for the Golden Mean process. We could also look at some of the prescient rivals of the states of the ε-machine. For example, we could define a machine that works like this: State X, which emits a 1 with prob
This machine also predicts the future observations of the Golden Mean process, and it does it as accurately as the ε-machine, but it has one extra state (in fact this machine, by using the states X and Y, is keeping track of the parity of the number of 1s after the last 0 in the sequence, not just which was the last observation as the ε-machine was), so it isn’t a minimal model of the process despite being maximally predictive. Therefore, Occam’s razor tells us the ε-machine is, as it’s always the case, a better model for this process, because it is simpler.
The stationary distribution over causal states, denoted
The statistical complexity of a partition, denoted
For the ε-machine of the Golden Mean with
For the 3 state partition of the Golden Mean with
It has been proven that the causal states, i.e. the partition with the fewest number of sets (or, in cases with infinite sets, the partition in which there is a bijection between sets and future distributions), are always unique and have the minimum
The entropy rate of a process, denoted hμ. It is the average entropy per symbol given all of the past history (i.e. how much randomness remains on average even when you know the causal state). It is defined as
.
The excess entropy of a process, denoted E. It is the total information (mutual information) the past observations provide about the future observations, defined as
or
The crypticity of a process, denoted
Before we saw how the Golden Mean process can be generated by two different Markov models, with one being its ε-machine. But in both cases, the Markov models were unifilar, meaning that if you are in a state and emit a symbol, there is never ambiguity regarding in which state you transition to. But that isn't always the case, as can be seen in this Markov model:
Here, if you are in state A and emit a 1 you don't know whether you transitioned to B or back to A, so any past that ends in a 1 implies uncertainty about the current state (which is why it is called a hidden Markov model, or HMM), which means that this Markov model isn't the ε-machine of the process it generates. This is the ε-machine:
It requires 3 states instead of 2. The causal state X correspond to the HMM state A, Y corresponds to B and Z corresponds to the state of uncertainty between A and B. As can be seen, the ε-machine is unifilar, all arrows coming out of any causal state have different emissions. To make the past histories partition explicit, any history that ends in a 0 followed by an even number of 2s corresponds to causal state X, any that ends in a 0 then an odd number of 2s is in Y and any that ends in a 1 followed by any number of 2s is in Z, as shown in the image:
Something interesting about the different Markov models is that knowing whether you are in state A or B can give you more information about the future than knowing whether you are in X, Y or Z, but you can't access that information just by looking at the past symbols.
For this process, the aforementioned quantities are:
This is the main process analyzed in the previously mentioned paper by the researchers from PIBBSS and Simplex. It is generated by the following HMM:
This HMM also isn't unifilar, and in this case you can never know exactly in which of its states of the you are just by looking at the emissions, which implies that the ε-machine of Mess3 isn’t the same as the HMM. So then what are the causal states for this process? Is there a more intuitive way for us to understand them and refer to them other than just as a partition of histories? The answer is yes, each causal state is a probability distribution of being in each of the states of the HMM. Such probability distributions are called "mixed states" or "belief states". In the appendix there is a proof that the belief states are the causal states of Mess3.
Now, how many causal states are there? As it turns out, there is an uncountably infinite amount of causal states. Since each is a probability distribution over the states
As can be seen in the figure, the causal states are distributed in the triangle (2-simplex) forming a fractal pattern. Since there is an uncountable amount of (non measure zero) causal states, the statistical complexity is infinite[5], so in this case if you want maximal prescience, you need the whole history and there is no way to compress the information contained in it. But CM can still be useful for compressing if we are willing to sacrifice some predictive power. Instead of partitioning histories by their exact future distribution, we could have a partition in which each set has histories with "similar" future distributions. We can quantify their similarity by the maximum error
So, as formalized in the appendix, if we stipulate a maximum allowed error
The optimal approximations of the ε-machine are those that, given some
For the Mess3 process, the aforementioned quantities are:
Computational mechanics is a useful framework because it allows us to infer hidden underlying dynamics of stochastic processes which can be used to improve our predictions about future data generated by them, and it provides a way to quantify how much information about past data is required to make the best possible predictions. And ε-machines can be thought of as the objective "targets of learning", as they are the simplest best possible models of stochastic processes.
It also seems potentially useful for understanding the inner workings and behavior of agents, since to be effective optimizers most agents would need to model their environment based on the inputs they receive, so it would be interesting to investigate if such internal models resemble ε-machines.
So far we haven’t talked about, given a particular process, how to reconstruct the ε-machine (or an optimal machine given a limit on
Let's consider the bidirectional history to be composed of the past, concatenated with the present concatenated with the future:
We want to prove that if two pasts are in the same causal state and emit the same symbol, then they must end up in the same causal state, that is,
Assuming
where
But
Now, taking the future into account, we have
and using
The states of the HMM are called
. What we are interested in is the future distribution given the past
To answer that it is easier to only look at the distributions over the next symbol. If we are in state
Arranging them in a vector we have that the distribution over the next symbol for
. Since the probability of the next symbol is entirely determined by the
. So
Let’s assume there are two different mixed states
. Since the
One possible metric for the distance between causal states is the Total Variation distance, defined as:
Using this metric, we can say that a process has an arbitrarily well approximable ε-machine iff
This is the case for Mess3 since the causal states live in a finite region of a 2-dimensional space.
In essence, finding this model is a form of statistical inference with a very general hypothesis class: The set of unifilar hidden Markov models (including those with uncountably many states). By doing inference, you are trying to find the model that best fits the data. And by choosing a good prior, that prioritizes simpler models, you should converge to the minimal model in terms of its number of states and statistical complexity (which corresponds to the amount of information stored by the process for future data generation).
LLMs can be viewed as approximations of ε-machines for the process of natural language generation, in which the approximate causal states are the context vectors i.e. the activations of the final layer of the transformer, just before the "Unembedding" step.
We are considering only countable sets to not have to talk about probability densities and measure theory.
The “upcoming storm" and "raining" states aren't actually causal states of the “weather process”, because they aren't maximally predictive as they don't contain all of the relevant information we could obtain from the past observations.
Actually, by the previously stated definition, the statistical complexity of Mess3 is undefined. So instead we can define it as the limit of the statistical complexities of the sets of states of the optimal approximations of the ε-machine, as
This metric is not the same as the Euclidean distance in the simplex. In this particular case, using this metric, the points at some constant distance from a central point from a regular hexagon, not a circle.
2026-02-24 13:34:20
Published on February 24, 2026 5:34 AM GMT
I’m on vacation, so this week’s newsletter is a bit lighter than usual. I wish I could say that the torrent of AI news was also lighter, but… yeah, not so much.
Our focus this week is on politics and strategy. We have two pieces on populist anger about AI, a report by Dean Ball on the Global South’s (lack of) readiness for AGI, and a couple of semi-technical pieces on using AI to help us navigate the transition to superintelligence. And yeah, we’ll talk about the photo op debacle in India.
Jasmine Sun spent a week in DC and considers the role of populism in AI politics:
And my reductive two-line summary is as follows: All the money is on one side and all the people are on the other. We aren’t ready for how much people hate AI.
It’s a great piece that calls attention to something that’s likely to be a major factor in AI governance over the next year or two. Be sure to check out her recommended reading at the end.
Anthropic just released Sonnet 4.6, a substantial improvement over Sonnet 4.5. Early indications are that it’s very capable and for many tasks can replace Opus at lower cost.
ByteDance’s Seedance 2.0 AI video generator just dropped and it’s really good. Perhaps you’ve seen the flood of videos on social media.
M.G. Siegler contemplates the legal and business implications for Hollywood, ending with a great quote from Rhett Reese ($):
In next to no time, one person is going to be able to sit at a computer and create a movie indistinguishable from what Hollywood now releases. True, if that person is no good, it will suck. But if that person possesses Christopher Nolan’s talent and taste (and someone like that will rapidly come along), it will be tremendous.
Gemini 3.1 Pro is here, with significant benchmark improvements.
Ethan Mollick presents the eighth version of his guide to choosing the right AI. If you’re already a power user you won’t find much new here, but it’s a great guide for anyone who wants to get started with agentic AI.
I enjoyed this Nature article about evaluating moral competence in large language models, although I’m not sure I fully agree with their distinction between ”mere moral performance” (the ability to make good moral decisions) and “moral competence” (making good moral decisions based on morally relevant considerations).
They also place a high priority on “moral pluralism”, which sounds great on paper but has important limitations in practice. Moral agents have to actually make decisions, not simply observe that different value systems would dictate different choices.
Politico reports on how much the average American hates AI and speculates about how the politics of that will settle out. As far as anyone can tell, the field is still wide open: Republicans and Democrats are both all over the map, and it’s anyone’s guess where the battle lines will ultimately be drawn.
I’m gonna make three bold predictions here:
ControlAI has been running a carefully planned campaign to build awareness of AI existential risk among UK lawmakers. I’m impressed by the amount of thought they’ve put into what they’re trying to achieve and how best to go about it. I’m skeptical about their ultimate success once they transition from trying to raise awareness to trying to get useful, coordinated action from a broad coalition of countries and companies, but they are executing well on this part.
In the UK, in little more than a year, we have briefed +150 lawmakers, and so far, 112 have supported our campaign about binding regulation, extinction risks and superintelligence.
Dean Ball went to India for the AI Impact Summit, worried about whether India and the Global South are ready for advanced AI.
I regret to inform you that I came away even more worried than I went in. […]
The perils and hopes that we discuss here in this newsletter—the ones that come from transformative AI, powerful AI, AGI, superintelligence, or whatever other moniker you wish—were not really on display at the Summit, not so much because of any failing of the Indians but because these topics are not part of polite global conversation. This is a domestic failing, too: as I have frequently pointed out, the implications of powerful AI are only kind of a part of the conversation in America.
Miles Brundage argues that we’ve missed the best window for AI governance and need to make the best of a bad situation:
We are running well behind on that goal, after losing a lot of valuable time in 2025. So we have a lot of work to do, but we also need to focus, and recognize that we aren’t going to totally nail this AI policy thing. At best, we’ll 80/20 it — mitigating 80% of the risks with 20% of the effort that we would have applied in a world with slower AI progress and an earlier start on serious governance.
Ryan Greenblatt and Julian Stastny explore a “deference” strategy:
Broadly speaking, when I say “deferring to AIs” I mean having these AIs do virtually all of the work to develop more capable and aligned successor AIs, managing exogenous risks, and making strategic decisions.
They discuss in detail what that strategy would look like, how stable it might be, and how much of a ”deference tax” one might pay for pursuing deference as opposed to full-speed capability development.
Hilarious, but also doesn’t bode well for any kind of meaningful cooperation between Anthropic and OpenAI. At a photoshoot during the recent India AI Impact Summit, a group of leaders posed on stage holding hands. Except for Dario Amodei (Anthropic) and Sam Altman (OpenAI), who awkwardly refused to hold hands with each other.
80,000 Hours’ Rob Wiblin interviews Ajeya Cotra about timelines, early warning systems, effective altruism, and especially the idea of using transformative AI to help solve the risks of transformative AI. I greatly appreciate that they provide a video, a transcript, and a detailed summary of what was covered—that’s super helpful for people who want the content but don’t have time to watch the full interview.
The Foundation Layer calls itself “a philanthropic strategy for the AGI transition”, which probably doesn’t sound relevant to you.
But it turns out to be a really well-written, thoughtful guide to what’s currently going on with AI and what key issues we need to navigate in the next few years. I think this is my new go-to piece for people who want to understand the situation and are willing to read a long-form piece. Unless you’re interested in the philanthropy part, you can just read from the Overview through section III.
Nick Bostrom’s latest paper is very strange. It’s meticulously produced and carefully argued, but starts from a strange premise that even he doesn’t actually endorse. Briefly, the paper argues that if your only concern is the well being of people who are presently alive, it makes sense to move forward quickly with superintelligent AI development even if that is likely to cause the extinction of humanity.
Chris Lattner (a giant in the compiler world) takes a close look at the C compiler that was recently built by a swarm of Claude agents:
My basic take is simple: this is real progress, a milestone for the industry. We’re not in the end of times, but this also isn’t just hype, so take a deep breath, everyone. […] AI has moved beyond writing small snippets of code and is beginning to participate in engineering large systems.
Worth bookmarking: Simon Willison has started collecting best practices for agentic coding.
Epoch reports that based on current revenue trends, Anthropic’s revenue might surpass OpenAI by mid 2026.
This makes way more sense: The Information reports that OpenAI’s first dedicated AI device will be a smart speaker with a built-in camera, arriving in 2027 or later.
Dwarkesh recently interviewed Elon Musk. There are interesting moments, but overall it wasn’t Dwarkesh’s finest work. For most people, I recommend skipping the interview and maybe reading Zvi’s analysis:
Elon Musk also has a lot of what seem to be sincerely held beliefs, both normative and positive, and both political and apolitical, that I feel are very wrong. In some cases they’re just kind of nuts.
Nathan Lambert reviews the current state of open models (partly $). My best guess is that open models never matter very much, although I see two possible futures where they become very important:
Pliny the Liberator has a legendary skill for jailbreaking. Here, he reports on a new tool he's built for removing guardrails from open models.
Ran it on Qwen 2.5 and the resulting railless model was spitting out drug and weapon recipes instantly––no jailbreak needed! A few clicks plus a GPU and any model turns into Chappie. […]
AI policymakers need to be aware of the arcane art of Master Ablation and internalize the implications of this truth: every open-weight model release is also an uncensored model release.
There are no surprises here for anyone who’s been paying attention, but this is an elegant illustration of why open models are so potentially dangerous.
If you haven’t been keeping up on recent progress in robotics, state of the art robots are getting very impressive indeed. Make sure to scroll down and check out the comparison to last year’s show.
2026-02-24 12:39:19
Published on February 24, 2026 4:39 AM GMT
This post contains some rough reflections on the alignment community trying to make its ontology legible to the mainstream ML community, and the lessons we should take from that experience.
Historically, it was difficult for the alignment community to engage with the ML community because the alignment community was using a fundamentally different ontology—featuring concepts like inner vs outer alignment, mesa-optimizers, corrigibility, situational awareness, and so on. Even a concept as simple as "giving an AI an instruction in natural language" often threw a kind of type error in ML researchers' ontologies, in which goals were meant to be specified by setting agents' reward functions.
The concept of situational awareness is another one which doesn't really make sense in the classic ML ontology. My impression is that Ilya starting to take situational awareness seriously (after Ajeya gave a talk about it at OpenAI) was one of the main drivers of his transition to alignment research. Unfortunately, Ilya's subsequent research on weak-to-strong generalization stayed pretty stuck in the ML ontology, which in my opinion made it unpromising from the get-go. (I don't remember if I stated this publicly at the time, but I was pretty critical internally inside OpenAI, especially to Collin Burns. In hindsight I wish I'd clearly stated publicly that I wasn't very excited about the research.)
These are two of many examples over the last few years of the alignment ontology winning out over the ML ontology by being better at describing LLMs. In response, the ML ontology has expanded to include concepts like "giving AIs instructions" and "situational awareness", but not in any principled way—it's sort of shoehorned them in without most people noticing the confusion. (E.g. if you ask why the AIs are following instructions, or how situational awareness might develop, I think most ML researchers would give you pretty confused answers.)
Historically, it was sometimes possible to make alignment concepts legible in the ML ontology before compelling empirical evidence arose, but it was a typically a very laborious and unrewarding process. ML researchers would raise objections that felt extremely nitpicky from the alignment ontology. In part this was due to the difficulty of communicating across ontologies, but in part it was also due to motivated reasoning to find reasons to reject claims made by alignment proponents (e.g. I think this post from Chollet is a pretty good example). Even when ML researchers agreed that an alignment concept made sense in principle, it was usually hard for them to then propagate the consequences into the rest of their ontology—in part because doing so would have had big implications for their identity and career plans.
Meanwhile, the alignment community would waste time, and sometimes make itself more confused, by trying to adapt their concepts to make more sense to ML researchers. "Goal misgeneralization" is a good example of this, since the problem of inner misalignment is more that correct generalization isn't a well-defined concept, than that the agent will learn to "misgeneralize". MIRI's paper on Formalizing Convergent Instrumental Goals seems like it also wasn't very useful, especially compared with their other research (though unlike goal misgeneralization I doubt it made many people more confused). Owain Evans' "out-of-context reasoning" is a case that I'm less confident about, since it does seem like putting the idea in ML terms has helped him and others do interesting empirical research on it.
I did a lot of this myself too, to be clear. "Trying to make alignment concepts legible in the ML ontology" was in some sense the main goal of my alignment problem from a deep learning perspective paper, and I've updated significantly downwards on its value since starting to think in these terms. In hindsight, the main thing I would've told my past self (and the rest of the alignment community) is to pay less attention to the ML ontology. Unfortunately, my sense is that OpenPhil and various other groups (including my past self) pushed pretty hard for engagement with the ML ontology, which I count as a significant mistake.
There are still ways that engaging with the ML community would have been valuable—I think mainstream ML researchers are good at pushing alignment researchers to be more precise and more grounded in the existing literature. But broadly speaking it would've been better to have treated alignment ideas like butterfly ideas which would be harmed by premature exposure to ML thinking.
I suspect that many AI safety researchers will resonate with the broad outlines of what I've discussed above. Below is the part that I expect will be more controversial.
Unfortunately much of the alignment community today seems to be in an analogous position to the ML community during the 2010s. Concepts like scheming, alignment faking, alignment research, strategy research, P(doom), misuse vs misalignment, AGI timelines, and so on seem to me to be sufficiently vague and/or confused that it's hard to think clearly about AGI when they're important parts of your ontology.
This is a pretty broad claim, so let me be a little more specific. Suppose we very roughly divide the AI safety community into the parts that are more EA-affiliated (most lab safety teams, most orgs working out of Constellation, OpenPhil, etc) and the parts that are more Less Wrong affiliated (e.g. almost everyone on Habryka's list of individuals in this comment). I think my diagnosis above is partially true of LW safety, but strongly true of EA safety. The people who are generating novel and important AGI-related concepts are almost all pretty decoupled from EA safety, even though that's where most of the money and jobs are:
Related to the last point: I personally feel pretty decoupled from LW safety in part because LW safety is focusing a lot on AI governance these days, but has a very different ontology for thinking about politics than I do. In fact, I originally started writing this post as an analogy for how I relate to the LessWrong community with regards to politics. However, ontology gaps in alignment seem sufficiently important that I decided to make this post solely about them, and save the analogizing to politics for a separate post or shortform.
To sum up my main takeaways: while working in the dominant ontology may seem like the "safest" and most reputable bet, real progress comes from building out a different ontology to the point where it can replace the old one. Good luck to anyone who's trying to do that!
2026-02-24 10:05:05
Published on February 24, 2026 2:05 AM GMT
Discuss
2026-02-24 09:07:50
Published on February 24, 2026 1:07 AM GMT
[Previously: "Views on when AGI comes and on strategy to reduce existential risk", "Do confident short timelines make sense?"]
[Whenever discussing when AGI will come, it bears repeating: If anyone builds AGI, everyone dies; no one knows when AGI will be made, whether soon or late; a bunch of people and orgs are trying to make it; and they should stop and be stopped.]
Many arguments about "when will AGI come" focus on reasons to think progress will continue quickly, such as:
These are all valid and truly worrying. But, to say anything specific and confident about when (in clock time) AGI will come, we'd also have to know how fast progress is being made in an absolute sense; specifically, in an absolute sense as measured by "how much of what you need to make AGI do you have".
There are various intuition pumps / analogies that people use to inform their sense of how far AI research has come. For example:
I believe these are poor intuition pumps for understanding when AGI comes because they do not evoke the sense that there is some unknown, probably-large blob of complexity that one has to possess in order to make AGI. They paper over differences in how the AI system does what it does.
Intuition pumps can only go so far. Each domain has its own central complexities, and there's no good reason that the world has to present a deeply correct analogy for the development of AGI, and in fact I'm not aware of such. That said, as long as we're doing intuition pumps, I want to propose another intuition pump for timelines on progress on a very complex task: synthetic life. We use the analogy:
human minds : AGI :: natural bacteria : synthetic bacteria
Specifically, we can compare the general task of making AGI (which, to be clear, is a ~maximally bad thing to do) as analogous to:
the task of producing, artificially from scratch, a bacterium-like object that has all the impressive capabilities of biotic bacteria, such as growing and dividing, self-repairing, avoiding toxins and predators, evolving novel complex characteristics to perform well in many different niches, competing against other life for space and resources, etc., but that is "very unlike" natural bacteria / is produced "very independently from" natural bacteria.