2025-12-05 23:53:41
Published on December 5, 2025 3:43 PM GMT
Disclaimer: these ideas are not new, just my own way of organizing and elevating of what feels important to pay better attention to in the context of alignment work.
All uses of em dashes are my own! LLMs were occasionally used to improve word choice or clarify expression.
One of the reasons alignment is hard relates to the question of: who should AGI benefit? And also: who benefits (today) from AGI?
An idealist may answer “everyone” for both, but this is problematic, and also not what’s happening currently. If AGI is for everyone, that would include “evil” people who want to use AGI for “bad” things. If it’s “humanity,” whether through some universal utility function or averaged preferences, how does one reach those core metric(s) and account for diversity in culture, thought, morality, etc.? What is lost by achieving alignment around, or distillation to, that common value system?
Many human justice systems punish asocial behaviors like murder or theft, and while there’s general consensus that these things are harmful or undesirable, there are still some who believe it’s acceptable to steal in certain contexts (e.g., Robin Hood), or commit murder (as an act of revenge, in the form of a death penalty / deterrence mechanisms, etc.). Issues like abortion introduce categorical uncertainty, such as the precise moment at which life begins, or what degree of autonomy and individual freedoms should be granted at different stages of development, making terms like “murder” — and its underlying concepts of death and agency — murky.
Furthermore, values shift over time, and can experience full regime-changes across cultures or timelines. For example, for certain parts of history, population growth was seen as largely beneficial, driving labor capacity, economic productivity, innovation and creativity, security and survival, etc. Today, ecological capacity ceilings, resource limits, stress concentrations, and power-law-driven risks like disease or disaster, have increased antinatalism perspectives. If consensus is already seemingly impossible at a given moment, and values shift over time, then both rigidity and plasticity in AGI alignment seem dangerous: how would aligned AGI retain the flexibility to navigate long-horizon generational shifts? How do we ensure adaptability without values-drift in dangerous directions? It seems likely that any given universal utility function would still be constrained to context, or influenced by changes over large time scales; though perhaps this view is overly grounded in known human histories and behaviors, and the possibility still exists that something universal and persistent has yet to be fully conceptualized.
There’s also the central question of who shapes alignment, and how their biases or personal interests might inform decisions. Paraphrasing part of a framework posited by Evan Hubinger, you might have a superintelligent system aligned to traits based on human data/expectations, or to traits this superintelligent system identifies (possibly beyond human conception). If things like pluralism or consensus are deemed insurmountable in a human-informed alignment scenario and we let AGI choose, would its choice actually be a good one? How would we know, or place trust in it?
These philosophical problems have concrete, practical consequences. Currently, AGI is mostly being developed by human engineers and scientists within human social systems.[1] Whether or not transcendent superintelligence that drives its own development eventually emerges, for now, only a small subset of the human population is focused on this effort. These researchers share a fuzzy driving goal to build “intelligent” systems, and a lot of subgoals about how to do that and how deploy them — often related to furthering ML research, or within other closer-in, computational sciences like physics, mathematics, biology, etc.
There are far fewer literature professors, historians, anthropologists, creatives, social workers, landscape design architects, restaurant workers, farmers, etc., who are intimately involved in creating AGI. This isn’t surprising or illogical, but if AI is likely to be useful to “everyone” in some way (à la radio, computers), then “everyone” probably needs to be involved. Of course, there are also entire populations around the world with limited access to the requisite technology (i.e., internet, computers), let alone AI/ML systems. This is a different but still important gap to address.
Many people are being excluded from AGI research (even if largely unintentional). And eventually, intelligent systems will probably play a role in their lives. But what will AI/AGI systems look like by then? Will they be more plastic, or able to learn continuously? Hopefully! But maybe not. What are the opportunity costs of not intentionally including extreme levels of human diversity early on in a task as important as AGI, which might be responsible for the protection, enablement, and/or governance of humanity in meaningful ways in the future? We have seen this story before, for example, in the history of science and medicine in terms of including women in research about women’s health[2][3], and in pretty much any field where a population being analyzed is not involved in the analysis. It usually leads to getting things wrong, which isn’t just a truth-seeking issue — it can be extremely harmful to the people who are supposed to benefit.
What this means for alignment research:
The inner alignment problem asks: how do we ensure AGI does what we want? Outer alignment tries to address: what are we optimizing for?, but can still feel too narrowly examined in current alignment research. Who makes up “we”? What frameworks determine whose “wants” count? And how do we ensure those mechanisms are legitimate given irreducible disagreement about values?
A top-down approach would treat questions of pluralism and plasticity as prerequisites, resolving the second-order problem (alignment to what) before tackling first-order technical alignment (how to do it). But waiting for philosophical consensus before building AGI is both impractical and risky, as development continues regardless.
A parallel approach is necessary. Researchers working through mechanistic alignment can take immediate and intentional (bottoms-up) action to diversify who builds and informs AGI and shapes its goals, while simultaneously pursuing top-down governance and philosophical work on pluralism and legitimacy mechanisms. Mechanistic work should of course be done in conversation with philosophical work, because path dependencies will lock in early choices. If AGI systems crystallize around the values and blind spots of their creators before incorporating global human diversity, they may lack the ability to later accommodate strongly divergent worldviews. Importantly, AGI researchers already agree that solving continuous learning will be a critical piece of this puzzle.
Without resolving second-order alignment issues, we risk building systems that are increasingly powerful, yet aligned to an unexamined, unrepresentative “we”. This could be as dangerous as unaligned AGI. One could even argue that AGI isn’t possible without solving these problems first — that true general intelligence requires the capacity to navigate human pluralism in all its complexity, as well as adapt to new contexts while maintaining beneficent alignment.
Even where AI itself is increasingly playing a more direct role in research, it is not yet clear that these systems are meaningfully controlling or influencing progress.
2025-12-05 23:47:28
Published on December 5, 2025 3:47 PM GMT
Some key events described in the Atlantic article:
Kirchner, who’d moved to San Francisco from Seattle and co-founded Stop AI there last year, publicly expressed his own commitment to nonviolence many times, and friends and allies say they believed him. Yet they also say he could be hotheaded and dogmatic, that he seemed to be suffering under the strain of his belief that the creation of smarter-than-human AI was imminent and that it would almost certainly lead to the end of all human life. He often talked about the possibility that AI could kill his sister, and he seemed to be motivated by this fear.
“I did perceive an intensity,” Sorgen said. She sometimes talked with Kirchner about toning it down and taking a breath, for the good of Stop AI, which would need mass support. But she was empathetic, having had her own experience with protesting against nuclear proliferation as a young woman and sinking into a deep depression when she was met with indifference. “It’s very stressful to contemplate the end of our species—to realize that that is quite likely. That can be difficult emotionally.”
Whatever the exact reason or the precise triggering event, Kirchner appears to have recently lost faith in the strategy of nonviolence, at least briefly. This alleged moment of crisis led to his expulsion from Stop AI, to a series of 911 calls placed by his compatriots, and, apparently, to his disappearance. His friends say they have been looking for him every day, but nearly two weeks have gone by with no sign of him.
Although Kirchner’s true intentions are impossible to know at this point, and his story remains hazy, the rough outline has been enough to inspire worried conversation about the AI-safety movement as a whole. Experts disagree about the existential risk of AI, and some people think the idea of superintelligent AI destroying all human life is barely more than a fantasy, whereas to others it is practically inevitable. “He had the weight of the world on his shoulders,” Wynd Kaufmyn, one of Stop AI’s core organizers, told me of Kirchner. What might you do if you truly felt that way?
“I am no longer part of Stop AI,” Kirchner posted to X just before 4 a.m. Pacific time on Friday, November 21. Later that day, OpenAI put its San Francisco offices on lockdown, as reported by Wired, telling employees that it had received information indicating that Kirchner had “expressed interest in causing physical harm to OpenAI employees.”
The problem started the previous Sunday, according to both Kaufmyn and Matthew Hall, Stop AI’s recently elected leader, who goes by Yakko. At a planning meeting, Kirchner got into a disagreement with the others about the wording of some messaging for an upcoming demonstration—he was so upset, Kaufmyn and Hall told me, that the meeting totally devolved and Kirchner left, saying that he would proceed with his idea on his own. Later that evening, he allegedly confronted Yakko and demanded access to Stop AI funds. “I was concerned, given his demeanor, what he might use that money on,” Yakko told me. When he refused to give Kirchner the money, he said, Kirchner punched him several times in the head. Kaufmyn was not present during the alleged assault, but she went to the hospital with Yakko, who was examined for a concussion, according to both of them. (Yakko also shared his emergency-room-discharge form with me. I was unable to reach Kirchner for comment.)
On Monday morning, according to Yakko, Kirchner was apologetic but seemed conflicted. He expressed that he was exasperated by how slowly the movement was going and that he didn’t think nonviolence was working. “I believe his exact words were: ‘The nonviolence ship has sailed for me,’” Yakko said. Yakko and Kaufmyn told me that Stop AI members called the SFPD at this point to express some concern about what Kirchner might do but that nothing came of the call.
After that, for a few days, Stop AI dealt with the issue privately. Kirchner could no longer be part of Stop AI because of the alleged violent confrontation, but the situation appeared manageable. Members of the group became newly concerned when Kirchner didn’t show at a scheduled court hearing related to his February arrest for blocking doors at an OpenAI office. They went to Kirchner’s apartment in West Oakland and found it unlocked and empty, at which point they felt obligated to notify the police again and to also notify various AI companies that they didn’t know where Kirchner was and that there was some possibility that he could be dangerous.
Both Kaufmyn and Sorgen suspect that Kirchner is likely camping somewhere—he took his bicycle with him but left behind other belongings, including his laptop and phone.
...
The reaction from the broader AI-safety movement was fast and consistent. Many disavowed violence. One group, PauseAI, a much larger AI-safety activist group than Stop AI, specifically disavowed Kirchner. PauseAI is notably staid—it includes property damage in its definition of violence, for instance, and doesn’t allow volunteers to do anything illegal or disruptive, such as chain themselves to doors, barricade gates, and otherwise trespass or interfere with the operations of AI companies. “The kind of protests we do are people standing at the same place and maybe speaking a message,” the group’s CEO, Maxime Fournes, told me, “but not preventing people from going to work or blocking the streets.”
This is one of the reasons that Stop AI was founded in the first place. Kirchner and others, who’d met in the PauseAI Discord server, thought that that genteel approach was insufficient. Instead, Stop AI situated itself in a tradition of more confrontational protest, consulting Gene Sharp’s 1973 classic, The Methods of Nonviolent Action, which includes such tactics as sit-ins, “nonviolent obstruction,” and “seeking imprisonment.”
...
Yakko, who joined Stop AI earlier this year, was elected the group’s new leader on October 28. That he and others in Stop AI were not completely on board with the gloomy messaging that Kirchner favored was one of the causes of the falling out, he told me: “I think that made him feel betrayed and scared.”
Going forward, Yakko said, Stop AI will be focused on a more hopeful message and will try to emphasize that an alternate future is still possible “rather than just trying to scare people, even if the truth is scary.” One of his ideas is to help organize a global general strike (and to do so before AI takes a large enough share of human jobs that it’s too late for withholding labor to have any impact).
Read the rest of the article here. You can find my personal strong takes at the bottom.
Overall, I feel responsible for not having picked up on the possibility that Sam could act out to this extent. There were frictions in coordination, and considerations where I as an advisor and the organisers on the ground were pushing back, but I had not expected this.
2025-12-05 13:50:34
Published on December 5, 2025 2:51 AM GMT
New frontier models have developed the capability of eval-awareness, putting the utility of evals at risk. But what do people really mean when they say “a model is eval-aware?” In this post, I try to disentangle this statement and offer my views on the different concepts it entails.
When people say “model X is eval-aware”, I believe what they generally mean is “model X can tell when a task belongs to an evaluation, leading it to change its behaviour and propensities”.[1] Thus, this statement makes three claims about the model;
Redefining the terms.
Next, it is assumed that the eval-aware models are by definition not eval-invariant (claim 3). There is a good reason for this: the most eval-aware models (e.g., Claude Sonnet 4.5) are indeed not eval-invariant, i.e., their evaluation scores change when they are steered out of awareness.
The assumption that eval-aware models cannot be eval-invariant has been shaping current solutions to the root problem: we want the learnings from model evaluations to hold when they are deployed and generalise to real-world scenarios. Consider, for example, that our current solutions primarily target awareness. They either try to approximate an eval-unaware model by steering or modifying their chain of thought, or by creating “more realistic evaluations”.
However, IMO it is possible for a model to be eval-aware AND eval-invariant, i.e., there can exist a model that doesn’t change its behaviour or propensities despite thinking it’s being evaluated.[3] The causal effect that awareness has on model behaviour can be cut.[4]
I believe factorising the current meaning of eval-awareness into its subcomponents is important for two reasons; 1) it makes communications clearer and 2) eval-invariance as a distinct property opens new doors as it can be used as a metric and/or an objective to work on and optimise. For instance, how can we train eval-aware models to be more eval-invariant?
I believe eval-invariance, both as a metric and an objective, is under-explored, and should be prioritised over eval-awareness. My reasons are:
This idea was inspired by discussions in the "How to Solve Alignment?" club led by Hannes Whittingham at Meridian, Cambridge. Thanks to Igor Ivanov for presenting about eval-awareness and the fellow visiting researchers for their thoughts and discussions.
This is not fully inline with technical & rigorous definitions from papers. It's my subjective interpretation of what people mean when they throw around the word in conversations (and maybe blog posts to some degree).
This concept not new and has been attended to by the community. The closest idea is "eval-gaming". My impression is that eval-gaming conveys an inevitable consequence of eval-awareness that we can't do much about. On the other hand, I mean eval invariance to be a property of models that can be measured and worked on in isolation, independent of eval-awareness. A crude approximation for measuring eval-invariance would be the delta between steered and unsteered versions of Claude Sonnet 4.5 on the alignment evals.
A model that just doesn't care about being evaluated? That is one cool model!
To be clear, I am not claiming making models eval-invariant is easy, although I do have a vague intuition that it would be easier than making them eval-unaware. Moreover, I am intentionally not including possible methodologies to do so for two reasons: 1) I don't really have concrete, well-formed ideas that I am confident in, and 2) I want the attention and discussion to be about the underlying premise and argument for eval-invariance rather than specific ways to train such models.
2025-12-05 09:10:45
Published on December 5, 2025 1:10 AM GMT
Epistemic status: not an Interpretability researcher, but has followed the seen closely.
So, it makes sense to me that Probes should outperform SAEs: probes are trained directly to maximize an interpretable metric, while SAEs on the other hand are trained to maximize reconstruction loss, and then are interpreted. But training SAEs is nice because this is an unsupervised problem, meaning that you don't need to create a dataset to find directions for each concept like you do with probes.
How can we get the best of both worlds? Well just train SAEs on an objective which directly maximizes human interpretability of the feature!
How SAEs works? They are trained to reconstruct the activations of the LM with a sparcity constraint. Then, on hopes that using the right SAE architecture and a good enough sparsity ratio will give interpretable features. In practice, it does! Features are pretty good, but we want something better.
How Probes works? They are trained to predict if the token predicted by the LM is related to a chosen concept. If one want the feature representing a concept A, one needs to construct and label a dataset where the feature A is sometime present and sometime not.
Proposal. On top of the reconstruction and sparsity losses, train SAEs with a third loss given by doing RLAIF on how interpretable, simple, and causal features are. By "doing RLAIF", the procedure I have in mind is:
Prediction. RLAIF-SAE should find direction which are interpretable like Probes, but with the unsupervised strength of the SAE. I predict that asking for simple features should solve problems like feature splitting, the existence of meta-SAEs, and feature atomicity.
Problems. If this should be implemented, the first problem I would imagine is the scaling. Computing the RLAIF part of the loss will be costly as one need to use validation corpus and LMs for grades. I don't know how this process could be optimized well, but one solution could be to first train an SAE normally, and finish by having it being fine-tuned doing RLAIF, to align the concept with human concepts, just like we do with LMs.
Working on this project: I am not a Mech-Interp researcher and don't have the skills to execute this project on my own. I would be happy to collaborate with people, so consider reaching out to me (Léo Dana, I will be at NeurIPS during the Mech Interp workshop). If you just want to try this out on your own, feel free, yet I'll appreciate being notified to know if the project works or not.
This is the heart of the problem: how to backpropagate "human interpretability" to learnt the direction. This is put under the carpet since I believe it to be solve by the existence of RLHF and RLAIF for fine tuning LMs. If doing is not as easy as it seems to me, I will be happy to discuss it in the comment section and debate on what is the core difference.
2025-12-05 07:18:34
Published on December 4, 2025 11:18 PM GMT
Epistemic status: A response to @johnswentworth's "Orienting Towards Wizard Power." This post is about the aesthetic of wizard power, NOT its (nearest) instantiation in the real world, so that fictional evidence is appropriately treated as direct evidence.
Wentworth distinguishes the power of kings from the power of wizards. The power of kings includes authority, influence, control over others, and wealth. The power of wizards includes knowing how to make things like vaccines, a cure for aging, or the plumbing for a house. Indeed, these categories are described primarily by example (and of course by evocative naming). Wentworth argues that king power is not as real as wizard power; since it is partially about seeming powerful it can devolve into "leading the parade" exactly where it was going anyway. But king power certainly can be real. It is dangerous to anger an autocrat: bullets kill in near mode.[1] If not consensus v.s. reality, what distinguishes these aesthetics? To answer this question (and particularly to pinpoint wizard power) we will need a richer model than the king-to-wizard axis.
I have written about intelligence as form of privilege escalation. Privilege is something you have, like (access to) a bank account, a good reputation, (the key to) a mansion, the birthright to succeed the king, or (the pass-code to) a nuclear arsenal - all things that sound like king power. On this model, intelligence is something that lets you increase your privileges other than existing privileges - specifically, an inherent quality. However, it is questionable whether intelligence is the only such quality (are charisma, strength, good reflexes, or precise aim all forms of intelligence?). I expect most readers to instinctively draw a boundary around the brain, but this is actually not very principled and still leaves (for example) charisma as primarily an intelligence. But a dashing trickster is a prototypical rogue, not a wizard (in dungeons & dragons parlance). The more expansively we choose to define intelligence, the less closely it seems to characterize the aesthetic of wizard power.
Trying to rescue this approach is instructive. Privilege-versus-escalation is essentially a gain-versus-become axis, in the sense that your privilege increases by gaining ownership of things while your escalation-strength increases by becoming a more powerful person. As I noted in the original post (by pointing out that privilege-begets-privilege) this distinction is already quite murky, before bringing in (king v.s. wizard) aesthetics. Is knowledge something about you, or something you have? What if it is incompressible knowledge, like a password... to your bank account? There are reasonable solutions, such as defining intelligence (or rather, the escalation-strength) as one's adaptable/flexible/general/redirectable competence.[2] However, this approach (when taken to the extreme) rules out power that comes from specific knowledge about how the world works, which seems like a fairly central example of wizard power.
In fact, narrow-but-very-deep technical ability might reasonably be described as arcane. That is prototypical wizard power. I am going to refine this a bit more before calling it a definition, but it is a good enough working definition to identify wizards living in non-fantasy genres. In particular, heist movies.
This post was partially inspired by the excellently terrible "Army of the Dead." Some utter maniacs try to steal millions from a vault under zombie Las Vegas. Yes, I said "zombie Las Vegas." Also, it is going to be nuked in a couple of days.[3]
The team is mostly a bunch of hardcore bruisers and gunmen who have distinguished themselves in combat:
...but notice the guy in the back. Ludwig Dieter.
He has never fired a gun before in his life. He is there specifically to crack the safe.
To be clear, everyone has their own special competence. One guy likes to kill zombies with a saw. Another guy has a lot of practice popping them in the head. One guy is just really beefy.
But there is a difference: the safe guy is the only one who can possibly crack the safe. No one else has any idea how to get started. Every other team member can combine their efforts, and crack 0% of the safe. In other words, to them, cracking the safe may as well be magic.
I think this basically defines the wizard aesthetic. A wizard can do things which seem completely impossible to others. A wizard has a larger action (or option) space; he sees additional possibilities.
The fact that Dieter is a poor generalist is not required to make him the party's wizard - this is not part of the definition. In fact, the helicopter pilot is also a decent fighter (despite arguably possessing a hint of wizard-nature) and Dieter himself improves at shooting over the course of the movie. Rather, it is a natural result of selection pressure (the party really needs the best possible safe cracker, which means they are willing to compromise on essentially every other competence).[4]
In the real world, I think that wizard-types tend to be slightly better generalists than the average person, because higher intelligence translates somewhat across skills - and the ability to master something very arcane is usually sufficient to master something more mundane given time to practice, which compensates for wizards concentrating their practice on a small number of special interests. (Yes, I think the wizard aesthetic is somewhat linked to autism).
However, hardcore wizards are typically inferior generalists when compared to equally elite individuals who are not wizards. General competence is importantly different from wizard power. Rather than seeing options that are invisible to others, it is the ability to make reliably good choices among the options that are visible. Notice that this has little to do with gaining-versus-becoming. So, this is a new axis. I will call it the warrior-to-wizard axis.
A warrior is consistent, sensible, robust, quick-thinking, self-reliant, and tough.[5] In the heist team, almost every member is a warrior, except for Dieter (at the beginning).
Obviously, it is possible for one person to exemplify the virtues of both a warrior and a wizard (for example, Gandalf). In fact, the wizard aesthetic seems to require some degree of usefulness. I will call wizard-like ability arcane, as opposed to the even-less-general "obscure." But it is also possible to hold both king power and wizard power; we are left with a somewhat complex menagerie of archetypes.
| General | Arcane | Obscure | |
|---|---|---|---|
| Granted | Prince | Priest | King |
| Intrinsic | Warrior | Wizard | Sage |
There is a lot to unpack here.
Princes tend to be warrior type (for example, Aragorn). They have very high general competence, often explained by genetic/ancestral superiority. They also have access to (rightful) wealth and armies - king power. These are the heroes of the Iliad, the Arabian nights, and Grim's Fairy Tales.
The Priest category contains Warlocks and Necromancers (switching the power source from divine to demonic or undead).
In the real world, lawyers are warrior-priests, and judges are priests or even priest-kings.
I would argue that paladins exist at the intersection of prince, priest, warrior, and wizard (though I associate them mostly with the last three).
The Sage category contains mystics along with most ascetics and monks. In the real world, most academics (for example, pure mathematicians) act like sages. They have some specialized knowledge, but it is so obscure that they will never be invited to the heist team.
I am not sure that the King category really belongs in this column, but I can make a reasonable argument for it. King's have knowledge highly specific to court politics, which is useful only because they have granted power. For instance, they know which minions they can tap for various services, and they are familiar with the rights granted them by any legal codes.
Therefore, Wentworth's king-to-wizard axis is confusing because it is really a diagonal. It is made even more confusing because it seems to be easier for warriors to become kings. Making a fortune, ascending politically, and winning wars requires a long sequence of good choices across domains - though sometimes wizard-types can become rich by inventing things, or ascend by teaming up with a warrior.[6]
Yes, the autocrat may be compelled to kill some of his enemies in order to save face, but to pretend that he never kills at discretion seems obtuse. Did the Saudis really need to hack this guy up? Certainly it doesn't seem to have done them any strategic favors in hindsight.
For example, Legg-Hutter intelligence.
Spoilers - make that one day.
Also, characters on the good team seem to be empirically pretty well-balanced. In games, this makes sense for playability, but it does not seem required in stories, so it is interesting that it holds more strongly for protagonists than antagonists?
And usually rugged rather than elegant.
In the real world, most wealth is probably accumulated through specialization, but gaining/managing massive riches also requires general competence.
2025-12-05 05:58:31
Published on December 4, 2025 9:58 PM GMT
Epistemic status: exploratory, speculative.
Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over as soon as it seems like handoff has occurred, Claude X is misaligned.
Let’s say AIs are "unknowingly misaligned” if they are less confident of their future egregious behavior and goals characterizing this behavior than human overseers are. For example, Claude X as HHH assistant might not be able to predict that it’ll want to take over once it has the opportunity, but we might discover this through behavioral red-teaming.
I claim that:
In this post, I’ll argue for these claims and briefly estimate the probability that some near-future AIs will be unknowingly misaligned. (Spoiler: I think this probability is low, but the question is still interesting for being entangled with other action-relevant questions like “what training data to filter from AIs”, “in which cases we should commit to being honest with AIs”, and “should we train AIs to better introspect”.)
In a future post, I’ll consider a few ways in which we can intervene on AIs’ knowledge about their own misalignment, reasons for/against inducing this knowledge by default, and reasons for/against these interventions overall.
Is it even coherent to think that AIs might be uncertain or mistaken about their alignment?
Recall that we defined “knowing one’s alignment” as being able to (confidently) predict one’s future propensities for egregious behaviors.[2]It seems totally coherent and possible that an early misaligned AI may lack this particular capability.
Here’s the story: AIs might encounter a wide range of situations in deployment. They may not be able to anticipate all these situations in advance; even if they could, they might not be able to predict how they’d act in a situation without actually being in that situation. Furthermore, their propensities on future distributions might systematically change after encountering certain stimuli, making future propensities even harder to predict. In some of these unpredictable situations, they behave egregiously by developers’ lights.
For example:
To sum up: an AI might not know it's misaligned because it might just not be able to predict that there is some set of stimuli that it's likely to be subjected to in the future which would cause it to act badly.[3]It may also find it hard to predict what goal it’ll pursue thereafter.
I’ve argued that unknowingly misaligned AIs are in principle possible. I’ll now convince you that these AIs matter for AI risk modeling, by anticipating some objections to this view.
Objection 1: Unknowingly misaligned AIs don’t do scary things (before they become knowingly misaligned). So. they’re the wrong type of AIs to worry about.
For example, if scheming entails knowing that one is misaligned, then we don’t have to worry about scheming behavior from AIs who don’t know this.
I think this is wrong. In particular:
Unknowingly misaligned AIs might also behave badly without scheming.
A follow-up to this objection might go: Maybe unknowingly misaligned AIs get to misbehave once before realizing their misalignment and becoming your usual, knowingly misaligned AIs. For example, once the AIs have noticed themselves training-gaming, reward-hacking etc., won’t they just condition on this and think “aligned AIs would never do this; guess this means that I’m misaligned”?
I think this is plausible, but might still be wrong for some reasons. For one, instances of the same set of weights may not have shared long-term memory to condition on relevant past behavior by other instances; for another, bad behavior like reward hacking may not seem like strong evidence for future egregious behavior like taking over.
Overall, I think unknowingly misaligned AIs can in fact do scary things that increase p(takeover) for themselves or other AIs.
Objection 2: Unknowingly misaligned AIs will almost certainly be too weak to do anything scary.
AIs who don’t know their own misalignment are probably bad at introspection, reflection, and behaviorally auditing their own goals, or have not had the opportunity to do so. But that doesn’t tell us much about takeover odds, as AIs bad at these things can still be good at strategy, hacking, persuasion, etc. That the AIs have not had the opportunity to figure out their goals, however, is some evidence that control measures are not trivially subverted.[5]
So, I think it’s at least plausible that AIs uncertain of their own misalignment are still capable enough to pose or increase takeover risk.
Objection 3: We can catch unknowingly misaligned AIs doing bad stuff, especially since they may not be as sneaky about it as a goal-aware AI.
I agree that this is a reason to be less worried about e.g reward hackers than schemers.
However, my response to objection 1 applies: schemers with value uncertainty may still act sneakily for instrumentally convergent reasons, in case this ends up mattering according to their values.
Also, unknowingly misaligned AIs might be harder to catch because we can’t exploit their self knowledge for monitoring:
So, I think unknowingly misaligned AIs might pose challenges for detecting misalignment, and this might make us especially unprepared to defend against their future egregious behaviors.
We might want to do something about unknowingly misaligned AIs if we ever catch one. But how likely is this by default? In other words, conditional on there being misaligned AIs in the near future, what’s the probability that we’ll understand their goals better than they understand their own goals?
I’ll model this event as the junction of the following conditions:
The conditions may be true depending on AIs whether the AIs have the relevant capabilities and affordances.
I think low introspection is unlikely, but still plausible:
I think low reflection is somewhat more plausible:
I think differential AI goals research is the most plausible, at least for fairly weak AIs.
Overall, my very quick guesses suggest a prior of 25%*30%*60% = 4.5% of misaligned AIs being uncertain about their misalignment. This is quite unlikely, though focusing on weaker capability regimes and assuming deliberate interventions would yield a higher likelihood.
How should we update, if at all, based on current evidence? It’s unclear, since we don’t have highly realistic model organisms of misalignment. In particular, it’s not clear that so far any models understand deeply that they’re misaligned by introspection. (For example, while the reward hacker in Hubinger et al. (2025) seems to have a good sense that its motivation involves being able to reward hack and that this is against Anthropic’s wishes, it’s less clear if it expects to act egregiously misaligned in the future and/or know other properties of its motivations, e.g. if its goals involve the long-term future, are indexical, etc.)
What should we do about unknowingly misaligned AIs, if we happen to find ourselves in these ~5% of worlds? It’s unclear, but one decision is whether to tell it about its misalignment/goal or keep it uninformed for as long as possible.
There are many levers we can pull to inform the AI or not:
I think that it’s worth thinking about which of the above levers we should pull and why, because inducing knowledge about misalignment has costs and benefits that could really make a difference to our situation with the AIs.[7]
For example: convincing the AIs about their misalignment and misaligned goal could make them more dangerously coherent; making misalignment common knowledge between AIs and overseers could “burn bridges” by removing incentive for AIs to alignment fake, leading them to do less useful work for us than otherwise.
Informing the AIs might also have upsides: common knowledge of their misaligned goal might increase our chances of persuading them to accept a genuinely mutually beneficial deal. Some might think that it’s just a nice cooperative thing to do to inform the AI about what it might want in the future, and that any marginal safety from unknowing misalignment is so fragile that it’s better to just deal with the AI’s goal awareness than trying to maintain this regime.
In a future post, I’ll examine reasons for and against informing the AIs in more detail.
Thanks to Alek Westover, Alex Mallen, and Buck Shlegeris for comments.
By “an AI”, I mean a set of model weights plus any agent scaffolding. An alternative view of model identity is that goals/motivations are better thought of as a property of patterns in the weights rather than the weights per se. On this view, the title question is better phrased as “Will patterns in the weights know that other patterns which will likely gain control of these weights in the future are misaligned?” ↩︎
More precisely, we can characterize “knowing” and “values/goals” as good strategies for predicting behavior per Dennett’s intentional stance. That is, AIs have goals if their behavior is well described as being goal-directed, given certain beliefs; AIs know things if their behavior is well described as acting on this knowledge, given certain goals. ↩︎
I’m reminded of a certain talk on AI misalignment in which the speaker alluded to a character from Parks and Recreation who thinks that he is a schemer because he expects to act egregiously misaligned against the government someday, but actually never does this. This would be an example of an aligned agent who has value uncertainty. ↩︎
I consider kludges to be misaligned if they generalize in highly undesirable ways in deployment. ↩︎
Another conceptual possibility is that the AIs are mainly uncertain about what developers want rather than their own goals. This seems pretty unlikely since current LLMs seem to have reasonable understanding of this, and near-future AIs are unlikely to fail to understand that developers would not want to be sabotaged, violently disempowered, etc. ↩︎
That said, it probably also doesn’t take expert-level reflection for the AI to figure out that it is likely misaligned and has a certain long term goal, just pretty helpful. ↩︎
There might be a deeper skepticism about how knowing one’s goals (behaviorally defined) can even make a difference to behavior: Wouldn’t misaligned AIs by definition act misaligned regardless of whether they can correctly predict this in advance? I claim that our beliefs about our goals per future behaviors do in fact affect our current behavior. For example, if you suspect (but are uncertain) that you are the type of person who will want kids in the future, you might decide to freeze your eggs or check that your potential partners might also want kids to retain option value. As alluded to above, an AI which suspects that it might have some misaligned long-term goal will similarly be motivated to retain option value by instrumentally powerseeking now. But, knowing its misaligned goals, it may pursue this more aggressively. ↩︎