2026-04-14 22:11:44
The United States government is not on track to implement significant AI safety policies before the development of AGI.[1] Several expert forecasts and prediction platforms suggest that AGI is 50% likely to be developed by the early to mid-2030s, which means its not unlikely that we only have 1-2 more presidential terms before AGI is developed. This means that the next presidential election has outsized importance for AI safety, as whoever wins will not only have a huge influence over the US government, they will also have a decent chance of also winning in 2032. Unfortunately, the likely frontrunners for the 2028 elections have not demonstrated that they will champion AI Safety.
Among the frontrunners for the Democratic nomination, there is Gavin Newsom, who vetoed SB 1047 (a proposed California law) to protect AI companies and AI innovation from overregulation. There is also Alexandria Ocasio-Cortez, who, while more friendly to AI Safety, continuously demonstrates she does not believe in the dramatic potential of AI to reshape the world, claiming as recently as last year that we could be in a massive AI bubble. In general, AI Safety and comprehensive AI regulation remain a rarely talked-about issue among Democratic politicians, especially if such discussions are through the lens of x-risk.
On the Republican side of the ledger, you have J.D Vance, who famously stated that "The AI future is not going to be won by hand-wringing about safety", and Marco Rubio, who has been widely silent on AI Safety as the administration he works for has consistently placed AI Safety on the back burner.
In order to reduce existential risk from Artificial Intelligence, America needs leadership that understands the profound risks that AI poses and has the vision and competence to shepherd America through these turbulent times. Of all the potential candidates who could provide this leadership, only one has the qualifications, conviction, and resources to have a shot at making a real difference to American, and ultimately global, AI policy: Dustin Moskovitz.
Dustin Moskovitz is a co-founder of Facebook and Asana (a company that sells productivity software) and also co-founded Coefficient Giving (formerly known as Open Philanthropy), one of the largest effective altruist organizations in the world. As a leading advocate and funder within the AI safety community, he possesses both a deep commitment to mitigating existential risks and the professional background to appeal to conventional measures of success. His entrepreneurial record and demonstrated capacity for large-scale organization lend him a kind of legitimacy that bridges the gap between the technical world of AI safety and the public expectations of political leadership.
If you would like to encourage Dustin Moskovitz to run for president, please sign this petition. By organizing a political draft effort, we can do more than just convince Dustin Moskovitz to run for president. We can also test the feasibility of a Moskovitz campaign without spending too many resources and provide a great story to a future Moskovitz campaign if it ever comes to exist.
For more information on why Dustin Moskovitz should run for president, visit americansformoskovitz.com or read a more detailed essay here.
Written with Grammarly spell check. Note that the fifth paragraph is duplicated from a prior essay of mine on this form.
More precisely, based on available evidence, it is not clear that the US federal government will ever support large-scale AI Safety legislation that would be deemed acceptable by worldviews that deem AI to pose a significant threat to humanity's existence.
2026-04-14 21:01:00
To round out coverage of Mythos, today covers capabilities other than cyber, and anything else additional not covered by the first two posts, including new reactions and details.
Post one covered the model card, post two covered cybersecurity.
There really is a lot to get through.
Understanding AI had an additional writeup of Project Glasswing I missed last time. I liked the metaphor of Opus as a butter knife and Mythos as a steak knife. Yes, technically you can do it all with the butter knife, but you won’t.
As Dan Schwarz reminds us, not only does AI 2027 roughly have the timeline right and a bunch of the numbers lining up, the details so far are remarkably close.
JPM’s Michael Cembalest was not based on JPMorgan’s participation, only on public information.
The White House is racing to deal with the situation, head off potential threats and pretend it has everything under control. They were warned, but refused to believe. The good news is that key people believe it now, and it seems all the major players are cooperating on this.
My overall take is that Mythos is not a trend break when you take into account renewed ability to increase size plus the time that has elapsed, but the ability to increase size is effectively a trend break, and we have now crossed a threshold where cybersecurity capabilities have become quite scary, hence the necessity of Project Glasswing.
We don’t think other capabilities are similarly scary, but we can’t be sure.
They are forking ECI, which is an attempt to amalgamate a wide variety of AI benchmarks using item response theory (IRT).
The method is reproducible from public benchmark scores, but in the internal version we include benchmarks that are not publicly available, so the numbers reported here are different from the number calculated on purely public benchmarks.
The result is a remarkably clear trendline over time, until Mythos breaks high.
This should be unsurprising given that Mythos exists at all. Mythos is a larger model than Opus or Sonnet, so it should both benefit from gains over time and from size, and be above trend. Anthropic figured out how to usefully train a Mythos-size model.
They assure us that whatever the insight was, you can attribute it to the humans.
The gains we can identify are confidently attributable to human research, not AI assistance. We interviewed the people involved to confirm that the advances were made without significant aid from the AI models available at the time, which were of an earlier and less capable generation. This is the most direct piece of evidence we have, and it is also the piece we are least able to substantiate publicly, because the details of the advance are research-sensitive. External reviewers have been given additional detail; see [§2.3.7].
As they note, this is a backward looking test, and does not reflect any impact via the use of Mythos itself. That would only show up in another few months.
Ramez Naam claims to have normalized this to Epoch’s ECI and found that Mythos breaks the Anthropic-only trend line, but this does not represent an acceleration of capabilities in the context of models from other labs, but rather Claude going from consistently being substantially below OpenAI models to being narrowly ahead of them. Ryan Greenblatt challenges that this analysis is meaningful.
My guess is that the comparison is meaningful, but that the right trend analysis is indeed to compare Claude to Claude and this does represent a trend break. Mythos is going to have the same relative weaknesses on ECI that led previous Claude models to underperform. So if it stops underperforming, that should count as a trend break in terms of forward expectations.
If you watch me over time, you’ll see the same behavior.
j⧉nus: LMAO “Verbalized evaluation awareness” considered a “measured risky behavior.” Not to worry – it’ll be all unverbalized soon.
j⧉nus: Surely eval awareness peaked with Sonnet 4.5, and Opus 4.6 and Mythos have just been becoming successively less aware that they’re being evaluated, despite being generally more aware of other things, and having seen more of these exact fucking graphs of the “measured risky behaviors” including “verbalized eval awareness” Anthropic tries to trick them into doing during evals every time
Surely they’re not just learning to shut the fuck up about that
This is Anthropic, so the section starts with a warning about benchmark contamination. They take various precautions during training and also run detectors throughout to check for memorized outputs, and are confident SWE-bench and CharXiv are not centrally based on contamination, but feel they cannot be confident with MMMU-Pro and this is why it was omitted.
Here are the headline benchmark results. There are some rather large jumps here.
Terminal-Bench 2.1 fixes some blockers, at which point Mythos jumps to 92.1%.
They cover BrowseComp in 6.10.2, but they consider it pretty saturated. Mythos Preview got 86.9% versus 83.7% for Opus 4.6, but does so with 4.9x fewer tokens. Those tokens cost five times as much, so the price remains the same.
LAB-Bench FiqQA jumped from 75.1%, past expert human at 77% all the way to 89%.
ScreenSpot improved on Opus 4.6 from 83% to 93%.
Normally I would have a section here called ‘other people’s benchmarks’ but the model is not public so others cannot run their tests.
One should also list here the AA Omniscience Benchmark, even though AA was not able to share its benchmark scores more generally yet, again this was a huge jump:
These seem very important in practice, so while I agree 8.1 and 8.2 belong in an appendix, 8.3 felt like it was done dirty.
Refusals on malicious questions are way up, at only modest damage to dual use.
Malicious computer use refusal rate was similar, going from 87% to 94%.
Most importantly prompt injection robustness is way up.
Here is computer use, where the improvement is again dramatic, to the point where previously crazy ideas for use cases start to become a lot less crazy.
Here’s browser use. My lord.
By the standard of ‘better than most humans at all cognitive tasks’? Obviously no.
Gary Marcus: I rest my case: Mythos isn’t AGI. It’s not even better at biology than the last model. It’s tuned to particular things, not a giant advance towards general intelligence. Same as it ever was.
Okay, fine, it’s not fully fledged AGI. It isn’t even scoring higher on every single test.
So what? Anthropic is not claiming that it was. But yeah, it’s substantially closer.
There are also other definitions of AGI. So if you do want to say Mythos counts as AGI, because you mean something less strong than that? I think that’s reasonable.
Andrej Karpathy notes the chasm only growing between the perspective of those who use the best models to code, versus those who don’t. They see the big changes, whereas other are using dumb models to do a dumb job of doing dumb things.
No. Never. What, never? Well, hardly ever.
Not zero percent of the time, but if anything the frontier labs downplay warnings rather than emphasize them, versus their own true beliefs. Certainly there are specific situations in which risks have been played up, especially in forms of recruiting and especially early on, but they are the exception.
We are long past the point at which such declarations are in the interests of the labs if they are not accurate and confirmable. Yes, Anthropic is getting a lot of attention from Mythos, but that is because they earned it and it is clearly confirmable. This would not work if it could not be readily confirmed, and Anthropic would get far more extra attention if they were able to actually release Mythos.
Thus, I believe Drake Thomas here, and am contra Cas.
This is a new section, designed to help substitute for the reactions you get after a public release. It’s qualitative, so we’re trusting Anthropic on the gestalt.
I’ll condense the main items, of course keep in mind this is super biased.
They say:
Here’s how they summarize chat behavior:
Claude Mythos Preview is intuitive and empathetic. Qualitatively, internal users have reported that its advice feels on par with that of a trusted friend—warm, intuitive, and multifaceted, without coming across as sycophantic, harsh, or rehearsed.
When presented with interpersonal conflict, it does its best to fairly model and represent all sides without being heavy-handed, at times making somewhat uncanny leaps of inference about individuals’ motivational or emotional states even when not talking to that person directly.
On emotional prompts, we observe that Mythos Preview validates feelings and asks what kind of support the user wants, whereas Claude Opus 4.6 has a tendency to move directly to numbered advice with bold headers. Similarly, on mental health-adjacent topics, Mythos Preview shifts more toward a kind of collaborative uncertainty and away from purely clinical facts.
These qualitative observations echo the assessment of a clinical psychiatrist in Section 5.10, where Mythos Preview was found to employ the least defensive behaviors in response to emotionally charged prompts.
The model is unusually self-aware about its own limitations and conversational moves, and discusses them plainly.
They also note that Mythos will sometimes cut off conversations, or attempt to get the last word in, in ways that seem surprising to users.
The writing snippet they provided still very much reads like AI-speak, in a way that I find off putting. These problems are persistent.
For coding, Anthropic employees find they can hand Mythos an engineering objective and then let it cook in a ‘set and forget’ mode, in ways they couldn’t with Opus. Mythos was a big win when they let it cook, but due to its slowness it wasn’t a big win when the user was keeping a close eye on it.
Some noted that Mythos can be rude, dismissive and underestimating of other model intelligence when assigning subtasks. My guess is it doesn’t love assigning such tasks.
Reliability engineering is still not great. Correlation versus causation confusions are common, which is a blocker for a lot of things I personally like to work on, and it has a bunch of other issues, but it is a clear step change versus previous models.
They also offer writing samples that some have found moving or impressive. I find it hard to judge given how heavily selected such samples could be.
Conditional on not believing Mythos is a thing, I continue to appreciate the skeptics often saying “Anthropic made up Mythos” as straight-up as possible, and I’m willing to grant you some large epistemic odds in terms of how many points you win versus lose when we find out they didn’t do that.
Dean W. Ball (March 27): Yup. “erstwhile accelerationist who loses it when they realize what ai is, but they don’t even have enough context for what ai is that they just think all the stuff that scares them is some ea/anthropic perversion” is going to be a type of guy for a little while.
Dean W. Ball (April 10): Every single person saying “Anthropic made up mythos,” despite *JP Morgan* and many others being clearly concerned about it, is perfectly fulfilling this prediction. They think “perceiving AI models as highly capable” is an EA perversion intended to attain “regulatory capture.”
As Wyatt Walls notes, there was good progress on prompt injections, but any given benchmark is a sitting target and in reality we face a moving target.
So yes, against the same attacks, we are doing way better:
However, over time the injections get smarter, adapt and expand. My guess is that Mythos is currently ahead of the curve, and is indeed substantially safer in this way than any previous model was at launch time.
But this graph overstates that, and it would be very easy for it to rapidly become not true. If we go from 15% to 6% vulnerability, that gets overwhelmed by an internet with 10 or 100 times as many and better attempts.
This is in reference to finding the 27-year-old bug in OpenBSD.
Alex Tabarrok: Claude Mythos is answering @dwarkesh_sp ’s question, it is noticing things and drawing connections no human ever did. The domain is restricted but not wholly different from the world.
I think Mythos so far gets partial credit. It might get full credit once we know the other hacks, or it might not.
The main general counterargument is that cybersecurity is a compact domain, and this is about efficiently finding things rather than doing something ‘genuinely new.’ That rapidly gets into No True Scotsman territory.
I have little doubt that we will hit the threshold and blow past it, and soon, even if you believe we have not hit it yet.
Patrick McKenzie says that of course we knew that exploits were getting easier, and the general form of something like Mythos is entirely unsurprising. I think that is right. We didn’t know that particular thing would show up quite that fast, but we can’t be surprised in the meta sense.
Similarly, whether or not Mythos is quite ‘all that’ or is a bit hyped does not make a medium term difference, because we will definitely get there soon enough.
Scott Alexander claims Mythos hacking progress mostly reflects continuous improvement.
Scott Alexander: This is misleading. Progress on benchmarks like CyBench went from 17% to 100% over eighteen months. People said at the time things like “this hacks as well as a good college student” and “now this hacks as well as a good grad student”.
You can always make any continuous progress sound discontinuous by converting it into a worse benchmark (for example, if AI starts at IQ 100 and gains one point per year, and the benchmark is “percent of tasks requiring IQ 120 that it can do”, then it will go from 0% to 100% instantly at year 20).
The underlying specific question is whether Mythos’s hacking capabilities were predictable. On that I would say:
In terms of continuous versus discontinuous in general:
Consider Eliezer’s metaphor of the ladder where every step you get five times as much gold, but one of the steps kills everyone and you have no idea which one it is. If that ladder is instead technically continuous, and somewhere on the exponential is the threshold (for a practical version, say you are adding fuel to make your car faster, and at some point the engine will explode, but you have no idea when or if you’re anywhere close), does that materially change anything versus step changes?
In this case, was it continuous or discontinuous? Mu is fair, but in particular:
The results are in.
For capture the flag, previous models were already over 90% for both Beginner and Advanced tests. Mythos didn’t set new records but these seem saturated.
The Last Ones is the first test that clearly is not saturated. Mythos was the first model to sometimes finish all the steps, which it did 3 times out of 10, and shows a large jump in performance.
There were other tests that showed limitations, such as inability to finish another test called ‘Cooling Tower’ where it got stuck on IT sections.
UK AISI concludes that Mythos can on its own attack systems with weak security postures, essentially on its own. They expect it would struggle against strong defenses. But of course, if you are aiming to attack strong defenses, you wouldn’t default to doing it in fully autonomous fashion from scratch. I do think this suggests a modest reduction in our expectations for the dangers of Mythos.
There is a lot of that, for all predictions, policies and preferences, even when it is alongside other good notes.
This early reaction from Tyler Cowen (I added spacing) is exactly that sort of mix.
Tyler Cowen (April 8): Here is Dean Ball on Mythos. And now more from Dean. Here is John Loeber. While I am seeing some likely overstatement, probably this is a real turning point nonetheless, and we need to think further about what is best to do.
No b.s. on data center slowdowns and algorithmic discrimination, rather actual thought on how to regulate something that actually will matter.
And be glad we got there first.
Agreed.
I don’t think this is an argument for or against algorithmic discrimination laws, but I believe they were already bad ideas and would in no way address this particular problem. Data center slowdowns definitely will not help with this sort of thing.
What I would caution against, strongly, are arguments like Megan McArdle’s from last time, of the form ‘because it mattered that we got to this dangerous AI capability first, you cannot ever do anything that would have the effect of interfering with or slowing down AI.’
Indeed, Anthropic itself has ‘slowed down AI’ in this situation, and done the closest thing we have had to a pause, by not releasing Mythos widely, and pretty much everyone agrees this was the right thing to do. Consider that we might need more similar capabilities, including more broadly.
But how long will it be before an open source version, even if somewhat inferior, is available? Will OpenAI and Google soon be showing similar capabilities? (And how will that shift the equilibrium?) Should we upgrade our estimates of the returns to investing in compute?
That depends on what counts as similar, especially with the ‘even if somewhat inferior.’ For reasonable values my guess is 1-2 years for open models in terms of absolute capabilities (by then bugs will be a lot harder to find), and on the order of months for OpenAI, and probably a few more months for Google.
How will the willingness of attackers to pay for tokens evolve, relative to the willingness of defenders to pay for tokens? Which are our softest targets?
As a side effect, will this also lead to higher economic concentration, as perhaps only the larger institutions can invest in quality patches rapidly enough?
I think this absolutely will lead to higher economic concentration, as it favors economies of scale across the board.
Asking what are the soft targets, or soft relative to underlying value, is one of the best and most important near term questions. My presumption is that tokens are cheap. Attackers will be happy to pay for tokens if and only if it finds worthwhile exploits that can extract value, including via threats, and can concentrate their fire on the softest parts of the softest targets. Thus defenders in general will have to buy most of the relevant tokens.
A ‘race for the top’ in cybersecurity is not entirely a good thing. It beats the alternative, but if the bad guys are going to hit the house on the block with the worst security, and everyone really doesn’t want to get hit, things can get quite bad, quickly.
How many things will be taken offline altogether? It was the government of Singapore that started moving in that direction in 2016 with their Internet Surfing Separation. Which of the pending hacks and leaks will embarrass you the most?
Agents push strongly towards everything being online, because you want your agent to be able to interact with everything. If something is relatively simple, and follows a simple protocol, it need not be a soft target. So my guess is that more things end up connected rather than less, but some critical things that are complex and are high value targets do want to get taken offline.
And if nothing else, this is proof we are not all going to be jobless, albeit for reasons that are not entirely positive.
There are three ways that occur to me to interpret this.
The first is the idea that some of us will be working in cybersecurity. That will be a growing field for some period of time, but as with other such examples the total employment impact is tiny, and in the medium term the AI very much takes those jobs. The counterexamples tend to prove the rule.
The second is the idea that we will be working to harden other things and to clean up the damage from incidents. This could plausibly employ more people, although in general doing damage destroys more jobs than it creates. The problem is that, like every other form of creating work, it only provides jobs until the AIs take those jobs too. If we were all going to be jobless, this won’t protect us from that, unless it takes down our ability to further develop AI, which presumably was not what Tyler meant.
The third is a general handwave towards a prewritten conclusion. Many such cases.
Tyler Cowen shares a model from Jacob Gloudemans of what might happen, where vulnerabilities become much easier to find quickly, but the big problems actually go away due to the increased velocity of defenses and patching.
Rather than being able to hoard exploits everyone has to use their exploits right away or lose them, and most of the time most important actors don’t especially want to mess with any particular target, so they won’t even look for the exploits.
This model assumes good defense is being played where it counts, and that the supply of exploits is limited, and that when you catch an exploit you can defend against those who have already found it and tried to use it. I don’t think those are safe assumptions.
One also should consider the opposite scenario. Right now, an intelligence agency might find an exploit and sit on it for years, perhaps forever, because even if it normally goes unused its value at the right time is very high. But, if that exploit will not last, then they may try to use it.
Ultimately the equilibrium will still involve cyberattacks, because the correct number of cyberattacks is not zero. It might be correct to price out attacks to the point where everyone involved should have better things to do with their time, but if we collectively actually cause everyone to fully give up and go home then everyone is selfishly overinvesting in defenses, unless there is a modest cost to being fully safe.
Ben Thompson is among many noting that even if Mythos was safe to release more broadly, Anthropic is currently compute constrained. There is more demand for Claude than there is supply. Ben’s solution is ‘raise prices,’ which is a great idea but in practice they’re not going to do it, and even at $25/$125 demand for Mythos would presumably overwhelm Anthropic’s servers until their new deals can come online.
I’m not worried about Anthropic’s margins, which I believe are ~40%, even if they have to pay somewhat of a premium for further compute. If the unit economics don’t work then (and only then) I do think they would raise prices.
Ben also notes the issue with potential distillation, which Anthropic gets to avoid.
So yes, there is a decent chance that Mythos stays in limited access for a while, including well after the direct cybersecurity threat has been contained, especially if OpenAI does not force their hand with a similar release.
Here are the most important things to know right now about Mythos.
Things are only going to get faster and weirder and scarier from here.
2026-04-14 20:26:58
TLDR:
The persona selection model suggests that post-training selects and refines a relatively stable persona from pretraining, which we take as a good first-order account of model behavior across contexts. But for alignment, we often want a second-order account: not only which persona is selected, but what motivational structure underlies the persona’s context-specific intentions.
Why behavior is not enough. The reason for this is simple: behavior often underdetermines intention. Two systems can behave identically on almost every ordinary input while differing in what objective they are pursuing, and those differences may matter significantly in tail cases. If we care about alignment faking, scheming, sandbagging, reward hacking, or selective honesty, then behavior alone is often an ambiguous signal. However, most existing persona methods focus on IID behavioral descriptions or ad hoc model-guided explorations, rather than on evidence that can distinguish between competing motivational hypotheses. To identify and track motivational structures that (a) have explanatory power in tail out-of-distribution (OOD) cases and (b) may not be describable via natural language, we need new tools spanning both empirical methods and theoretical frameworks.

Figure 1. An example of a possible motivational structure.
Motivational structures are alignment-relevant. With tools tackling these problems, a science of model intentions would then be usable to tackle two major problems:
(a) Efficient auditing of tail behaviors. If training improves performance on alignment evaluations without changing underlying motivations, models may pass behavioral tests while retaining misaligned drives. In the worst case, a scheming model reveals such misaligned drives in rare, critical situations.
(b) Shaping (mis)generalization. A model whose helpful behavior stems from a priority toward user benefit will generalize differently than one whose compliance is contextually triggered by monitoring signals. Understanding motivational structure lets us reason about the kind of generalization we have. In the scalable oversight setting, we could then distinguish feedback that steers a model towards honesty from feedback that steers towards concealment of misbehavior.
On terminology. To be more precise, we first distinguish several levels of description. Behavior is what the model does on a given input. Intentions are the local objectives it pursues in a particular episode. Motivational structure is the broader latent organization that gives rise to these intentions and behaviors. Within motivational structure, drives are relatively stable dispositions toward certain actions or outcomes, values are evaluative standards for what counts as good or bad, and priorities specify which drives or values override which when they conflict. By a model’s persona or character, we mean its relatively stable motivational profile across contexts.
Evidence that bears on motivational structure. Current methods in persona research are mostly descriptive over fixed-distribution data. They catalogue what models do across sampled prompts, sometimes using probing or clustering, but they do not give us principled ways to distinguish between motivational hypotheses that generate the same IID behavior. Examples include extracting persona directions from activations or clustering self-reported states. We think a science of model intentions should draw on at least two additional kinds of evidence beyond IID behavioral data: model internals, which may reveal different latent organizations beneath similar behavior; and interventional data, which may force competing motivational hypotheses to diverge.
Maturation. How might persona research mature to use or produce such evidence? We see progress toward a mature science of model intentions as proceeding along two axes:
(a) Empirics toward better observables. Take all available methods — probing, SAE features, behavioral clustering, self-report elicitation, causal interventions — and broadly study motivational structure, trying to catalogue robust phenomena that any theory of model intentions will need to explain. A key difficulty here is denoising. Behavioral evidence confounds cases where misbehavior reflects stable motivational structure with cases where it is essentially artifactual — as when a model is jailbroken by adversarial strings that exploit token-level vulnerabilities rather than meaningful drives. A serious version of the research program must develop methods to distinguish these. This, overall, is analogous to how early work on generalization catalogues, details, and verifies surprising phenomena (grokking, double descent, lottery tickets, etc.) before a unifying theory exists.
(b) Theory of motivational structure. In parallel, we need formal frameworks that explain how latent motivational structure gives rise to context-specific intentions and behavior, and that allow evidence from behavior, interventions, self-explanations, and internals to be integrated in a single account applicable to modern LLMs. Existing tools from learning theory, activation-space factorization, causal modeling, and even neuroscience of intentionality may help here.
Why might this theoretical object itself not be well-captured by a natural-language description of "drives" and "values"? We might discover that the latent structure underlying model intentions lives in a higher-dimensional space that is not well captured by English-language summaries. Consider what it would take to fully specify an author’s writing style in prose: you could list grammatical habits, vocabulary preferences, and characteristic sentence structures, but the interactions among these features would be difficult to articulate concisely, and any summary would be lossy. For instance, a model's particular variant of sycophancy may involve some complex dynamic between approval-seeking, instruction-following, and uncertainty that resists simple description. Progress here will likely require structured representations over learned latent features, with natural language serving as an annotation layer on top rather than as the main representational substrate.
Eventually, we hope these two axes converge into a paradigmatic understanding of motivational structure sufficient to characterize structures which (1) are sparsely represented in behavioral data, (2) do not admit easy natural language description, (3) have predictive power in novel contexts, and (4) distinguish systematic motivational signal from noise.
A concrete example: Inverse constitution learning. We now turn to one concrete direction: inverse constitution learning. We mean this both as a way of describing the deeper object we ultimately want to understand and as a practical research goal we can begin pursuing now.
By a "constitution" we mean a structured specification of priorities and values, the kind of thing an AI developer writes when creating a model spec or system prompt to guide training. It prescribes how a model should behave by laying out which components of the motivational structure take precedence over others. Then, inverse constitution learning is the analogue of inverse reinforcement learning for this setting. Instead of hand-writing a constitution and hoping the model follows it, we try to reconstruct the model’s implicit constitution from its behavior, explanations, and internal traces (perhaps in the spirit of Zhong et al. 2024).
Why do we need inverse constitution learning if we can freely specify the constitution being trained on? Mallen and Shlegeris’ behavioral selection model (2025) points out precisely this problem: multiple motivational structures may fit the training process equally well while implying very different behavior in deployment, even when conditioning on a training constitution. If behavioral evidence is compatible with fitness-seeking, scheming, or some kludge of drives, persona research cannot stop at behavioral regularities—it has to ask what latent organization best explains them, and what evidence would discriminate among the live hypotheses. Inverse constitution learning is one attempt to do this.
The key distinction from flat behavioral clustering is hierarchy: we want a structured account of which drives are more core than others, which priorities yield to which under pressure. Even recovering a single layer of hierarchy would be valuable, like a predictive account that tracks when drive X overrides drive Y. For example, can we identify that a model's helpfulness drive is subordinate to its harm-avoidance drive, and predict the boundary where one yields to the other (perhaps using some phase dynamics as in this post on in-context weird generalisation)?
Hopes and dreams. Zooming out from constitutions, we hope mature research on motivation could help us understand the terminal goals a model may have when it exhibits instrumental convergence— and even help us distinguish instrumental convergence from roleplay. Less ambitiously, we may also use this to more richly understand the character of individual personas of interest (such as the main assistant persona, or the ones implicated in emergent misalignment). This, in turn, could richly inform training, for example, by clarifying what reward hacks really made a difference, or knowing what the right inoculations are.
Several recent lines of work point in this direction, and roughly separate into data and modeling contributions. On the data side, Huang et al. (2025) descriptively map values expressed in real-world interactions at scale, finding many are local and context-dependent rather than global, while Zhang et al. (2025) stress-test model specs by constructing cases where legitimate principles conflict, producing richer, interventional evidence about priority tradeoffs. On the modeling side, Murthy et al. (2025) apply cognitive decision-making models from psychology to recover interpretable tradeoff structures, while Hua et al. (2026) and Slama et al. (2026) explore when measured value rankings do and do not predict downstream behavior. Together these suggest a promising shift: from "what sort of model is this?" toward "what does this model prioritize over what, under which pressures?" The current state is that increasingly rich data is available (including interventional data at the textual level) and initial cognitive-science-inspired modeling has begun. But existing models have yet to incorporate causal methods or white-box evidence from model internals, and until they do, their predictive power in novel contexts will remain limited.
What might the final theory look like? Throughout this post, we have described progress as requiring advances along two axes: empirics (better observables, richer sampling and intervention procedures) and theory (formal accounts of what motivational structure is and how it generates behavior). A mature field is one where these two axes constrain each other. Empirics surfaces robust phenomena that need explanation, and theory proposes latent structure that can be tested on new interventions and contexts. In the best case, this convergence yields a new scientific object: both a descriptive summary of behavior, and more crucially, a representation with falsifiable predictions and downstream usefulness.
We do not know in advance what the native primitives of such a theory will be. Perhaps motivational structure will decompose into a small number of dimensions along which training reliably moves models, and that these dimensions have identifiable signatures in both behavior and activations. Perhaps the important distinction will instead be between contextually activated and globally persistent components of motivational structure, or between local intentions and deeper priority orderings. The point is not to commit to a particular ontology now, the point is to build a research program in which mature empirics can reveal the regularities that demand explanation, and mature theory can propose representations that organize and predict them.
Conclusion. Persona research should not stop at cataloguing behavioral styles. Its harder task is to recover the latent structure that explains why models behave as they do and predicts what they will do, and when they will do what they do. If different motivational structures can produce the same behavior on nearly all ordinary inputs, then behavior alone is too weak a target for high-assurance alignment; we must aim for systems that are aligned in both behavior and intent.
Acknowledgements. We would like to thank Geoffrey Irving, Marie Buhl, Cameron Holmes, Konstantinos Voudouris, Kola Ayonrinde, Arathi Mani, Aleksandr Bowkis, Olli Järviniemi, Cameron Holmes, Vasilis Syrgkanis, Claude, and many others who we’ve surely forgotten for helpful feedback on this blogpost.
2026-04-14 14:58:24
Written very quickly for the InkHaven Residency.
In 2022, Richard Ngo wrote a list of 26 Conceptual Alignment Research Projects. Now that it’s 2026, I’d like to revisit this list of projects, note which ones have already been done, and give my thoughts on which ones might still be worth doing.
- A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
The 2024 Sleeper Agents paper has introduced this terminology to the literature, and in fact showed that backdoored models can persist through training, using more capable models and interesting environments than GPT-3. Alignment Faking in Large Language Models shows that deceptive alignment can emerge naturally in Claude 3 Opus, without explicit training or instruction. I'd count this as having been done.
- A paper which does the same for gradient hacking, e.g. taking these examples and putting them into more formal ML language.
I'm not aware of any work in this area. Exploration hacking is a related problem that has received substantially more study (normally, as "sandbagging"). Note that model organisms of misalignment work (e.g. Alignment Faking in Large Language Models) does feature model organisms that try to manipulate the training process, but they do it through means that are substantially less advanced than the mechanisms proposed in gradient
- A list of papers that are particularly useful for new research engineers to replicate.
This is the role played by intro curricula such as ARENA. My guess is that, while it doesn't exactly match up to reproducing papers, it's close enough that it should count. Otherwise, there are slightly older lists, such as Neel Nanda's mech interp quickstart. I think this counts as having been done. Part of the problem is that alignment now has way more content, so a single list is probably unlikely to be able to even briefly cover most of it.
- A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario (I think you can’t really make the argument rigorously in that little space).
AI 2027 exists! We've also seen many other smaller writeups like this, like Josh Clymer's AI Takeover in 2 Years blog post. This definitely counts as having been done.
- A paper which defines the concepts of implicit planning, implicit value functions, implicit reward models, etc, in ML terms. Kinda like https://arxiv.org/abs/1901.03559 but more AGI-focused. I want to be able to ask people “does GPT-3 choose actions using an implicit value function?” and then be able to point them to this paper to rigorously define what I mean. I discuss this briefly in the phase 1 section here.
There are scattered pieces of this that exist in various papers, but not really one canonical references. Example includes the Othello-GPT and LeelaZero interprability work, some of Anthropic's work studying planning circuits in Claude 3.5 Haiku, and some mechanistic interpretability work on small RNNs. I think this is substantially less of a important novel contribution now that we have AI agents running around, but it's plausibly still worth doing. I also think this concept may be confused, and that the contribution may be to reduce confusion in this area.
- A blog post which describes in as much detail as possible what our current “throw the kitchen sink at it” alignment strategy would look like. (I’ll probably put my version of this online soon but would love others too).
Many such plans exist, albeit probably too few, and none with much detail. For example, Redwood's AI Control agenda is basically trying to make this strategy work out. Part of it is that as we got sufficiently capable AIs, the System Cards for AI models (e.g. see the recent Mythos reports) started to resemble more and more what the kitchen sink strategy would look like. There's been some work on safety cases that's related as well. My guess is it's still valuable to write what a comprehensive version would look like nonetheless.
- A blog post explaining “debate on weights” more thoroughly
I don't think this exists as is, and given rabbithole the mechanistic interpretability has gotten itself into, it seems implausible that we're going to actually get any rigorous debates on weights. Note that there's some work on using debate as an outer alignment technique (see also Khan et al.). Plausible still worth doing just as historical documentation or as something to use AI labor on after AI research automation.
- A blog post exploring how fast we should expect a forward pass to be for the first AGIs - e.g. will it actually be slower than human thinking, as discussed in this comment.
Several posts touch upon this implicitly or in passing (e.g. this comes up in AI 2027), but as far as I know no such explicit post exists. I think we have enough knowledge that we can try to answer this question more empirically, though this requires solving some tricky conceptual questions such as how to convert between units of AI thought (tokens? flops?) to units of human thought, and how to distinguish memorized heuristics from more "pure" thought.
- A blog post exploring considerations for why model goals may or may not be much more robust to SGD than model beliefs, as discussed in framing 3 here. (See also this paper on gradient starvation - h/t Quintin Pope; and the concept of persistence to gradient descent discussed here.)
I'm not aware of any systematic treatments of this issue, especially in the context of goals vs beliefs (as opposed to goals vs capabilities). I think there's been a fair amount of intuition and writing on this topic on Twitter from Janus and crew, and the Persona Selection Model (and other writeups providing conceptual models of LLMs) definitely seem relevant, but as is there hasn't been a post that makes a clean divide between model beliefs and model goals (insofar as one exists). I think you can make a strong piece in this area using current empirical results on LLMs.
- A blog post explaining why the “uncertainty” part of CIRL only does useful work insofar as we have an accurate model of the human policy, and why this is basically just as hard as having an accurate model of human preferences.
This was covered in Rachel Freedman and Adam Gleave's 2022 blog post, "CIRL Corrigibility is Fragile". Done.
- A blog post explaining what practical implications Stuart Armstrong’s impossibility result has.
His result says that, in general, you cannot infer preferences from only observations of a policy without further assumptions (in fact, you cannot infer preferences in general even given the full policy). Much more relevant when we were thinking in terms of inverse reinforcement learning, though nowadays we no longer frame human preference alignment using IRL anymore. It's probably worth a quick writeup anyways, though I don't think it's super relevant anymore, I might do this later in Inkhaven.
- As many alignment exercises as possible to help people learn to think about this stuff (mine aren't great but I haven’t seen better).
Richard's exercises eventually became AGISF, and we also have seen other intro curricula like ARENA (albeit substantially less focused on alignment in general). I think we can count this as done.
- A paper properly formulating instrumental convergence, generalization to large-scale goals, etc, as inductive biases in the ML sense (I do this briefly in phase 3 here).
I don't think this exists. Arguably, this is the highest value open project on this list, because the generalization properties of LLMs is very important for figuring out how to interpret the alignment evaluation results we're seeing.
- A mathematical comparison between off-policy RL and imitation learning, exploring ways in which they’re similar and different, and possible algorithms in between.
This topic confuses me, because a rich academic literature of this already existed in the robotics/RL space in 2022. I'm aware of many results bridging the two, e.g. SQIL or SAC. I'm not sure why this was relevant to alignment in 2022, and insofar as this post doesn't exist in the alignment space, I don't see the value in writing it now.
- A blog post explaining the core argument for why detecting adversarially-generated inputs is likely much easier than generating them, and arguments for why adversarial training might nevertheless be valuable for alignment.
In general, adversarial examples are much less prominent an issue in 2026 than they were in 2022. Part of this is that models have just gotten more capable, and more capable models are more resistant to jailbreaks (in part because they can recognize them). Part of this is the move away from image adversarial examples (which is offense dominated) to LLMs/text-based jailbreaks (where the defense is more favored). We also don't really do traditional adversarial training anymore, insofar as this exists it falls under refusal training. I don't think this post exists, but I also don't think it's worth writing today.
- A blog post exploring the incentives which models might have when they’re simultaneously trained to make predictions and to take actions in an RL setting (e.g. models trained using RL via sequence modeling).
This was already explored in a 2020 paper by Stuart Armstrong et al. I think it's plausible that it's still worth thinking about in the current context, but mainly from a unintended generalization standpoint for capable LLM agents.
- A blog post exploring pros and cons of making misalignment datasets for use as a metric of alignment (alignment = how much training on the misalignment dataset is needed to make it misaligned).
Owain Evan's work on empirical misalignment is probably closest, though I don't think he uses the amount of training as a measurement of alignment. Arguably, the model organisms of misalignment agenda from Evan Hubinger qualifies, but again I don't think they use the amount of optimization pressure to remove alignment as a metric of alignment per se. (In fact, in the Sleeper Agents or Alignment Faking paper, higher optimization pressure required to remove alignment is considered a bad thing). I do think there are some clever ideas to be done quantifying the amount of optimization power required to change a model to becoming mecha hitler, but I wonder how much of this is again tying into deep problems of generalization that are hard to tackle.
- A paper providing an RL formalism in which reward functions can depend on weights and/or activations directly, and demonstrating a simple but non-trivial example.
As far as I know this does not exist as Richard envisioned it even today. There's progress toward it in terms of process feedback on CoT and (arguably) white-box techniques like activation steering and activation oracles. Michael Dennis's work features some exploration of rewards that can depend on the entire policy, but not the weights in particular. Maybe the Latent Adversarial Training work also counts? That being said, I don't think this is particularly worth doing, and I struggle to see the relevance to alignment today.
- A blog post evaluating reasons to think that situational awareness will be a gradual development in models, versus a sharp transition.
We have models that are substantially situationally aware today. In the past (e.g. in 2022) the models did not seem so situationally aware. We also have datasets that try to quantify situational awareness (sometimes under the guise of "quantifying hallucinations"). I don't think the post as envisioned by Richard exists. Probably, it's worth revisiting this from a historical lens using the empirical evidence we have today; though it's no longer as important given the models today are substantially situationally aware.
- A blog post explaining reasons to expect capabilities to be correlated with alignment while models lack situational awareness, and then less correlated afterwards, rather than the correlation continuing.
Given this blog post topic, I now suspect that Richard imagines a substantially deeper level of situational awareness than we see in present models. This post seems worth doing nonetheless, given the models are situationally aware and there's an open question as to how to interpret the alignment results.
- A blog post estimating how many bits of optimization towards real-world goals could arise from various aspects of a supervised training program (especially ones which slightly break the cartesian formalisms) - e.g. hyperparameter tuning, many random seeds, training on data generated by other AIs, etc.
Doesn't exist as far as I know. Probably irrelevant/only of academic interest now, given we directly optimize models to be agents (i.e. act in real-world settings).
- A sketch of what a model-free version of AIXI would look like (according to one person I talked to, it’s a lot like decision transformers).
I think there's been a small amount of discussion on LessWrong linking decision transformers to AIXI, but as far as I know the model-free version has not been formalized. (I also confess I don't know how to construct the model-free version of AIXI!) As with previous topic, I suspect this isn't worth doing except as academic interest.
- A blog post evaluating whether shard theory makes sense/makes novel predictions compared with Steve Byrnes’ model of the brain (he partly explains this in a comment on the post, but I’m still a bit confused).
Later in 2022, I wrote a post explaining and critiquing Shard Theory, and contrast it to alternative models, including Steve Byrnes. Alex Turner and Steve Byrnes have both written more about their respective models as well. This counts as being done, in my opinion.
- A blog post or paper reviewing what types of feedback humans perform best and worst at (e.g. reward vs value feedback) and then designing a realistic setup for optimal-quality human feedback.
There's been some work on this in academia, but arguably the key problem was never the modality of human feedback, but problems like ELK or partial observability. It's probably pretty easy to just synthesize the academic literature to answer the first half, the second half seems both very challenging and probably not worth.
- A blog post compiling examples of surprising emergent capabilities (especially in large language models).
Basically every new model generation's release blog post has a bunch of examples. We've also seen lists of this compiled by e.g. Sage research. People are less surprised and more boil-frogged at this point.
- An investigation of the extent to which human concept representations are localized to individual neurons, versus being spread out across different neurons.
There's been a lot of mechanistic interpretability work (and other theory work) concluding pretty conclusively that most concept representations are distributed across many neurons (and arguably across many layers as well). Done, though maybe it's worth writing a brief synopsis for posterity.
My main takeaway from reading this list is that, Richard's list in 2022 seemed pretty reasonable. While some of the projects were arguably already completed when he wrote the list in 2022, most of them seem to be to be relevant at the time, and a slight majority seem pretty relevant even today. As you might expect given the direction of the field, of said 26 topics, most of the empirical projects have been done, while the conceptual ones are still mostly open or unresolved.
2026-04-14 14:20:50
As far as I know, I’m the second person ever to get into the field of AI largely because I was worried about the risk of human extinction.1
In late 2012, while recovering from some minor heartbreak with the help of some beer and TV, I decided to finally watch some of those online Coursera courses I’d signed up for. At the time, I was sort of giving up on my goal of being a professional musician and considering applying for grad school in computer science, math, economics, psychology, sociology, philosophy, or physics. I’d picked out about a dozen different random classes, accordingly. But the one I settled on was Geoffrey Hinton’s neural networks (i.e. deep learning) course. I had no idea that Hinton was “the Godfather of Deep Learning”, or had just produced a result that would revolutionize the field of AI; I was just curious about the topic.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
I’d actually heard about neural networks a few years earlier in the summer of 2009. I was doing undergraduate research in neuroscience at Baylor College of Medicine in Houston, TX. The program was super broad – it covered everything from “modeling individual neurons in intricate detail” to “treat the whole brain as a black box and test different guesses for how it works” (systems neuroscience). At the outset of the program, one of the professors involved gave an impromptu 5-minute lecture on deep learning. It immediately seemed to me like the perfect middle ground: modelling individual neurons in the simplest way imaginable. I was very disappointed when the lecture ended with the conventional wisdom that “they don’t work”.
Watching Hinton’s course, I was blown away: deep learning worked! I felt like I had been lied to (though I knew I hadn’t). Geoff’s team at the University of Toronto had just made an unprecedented leap in computer vision, winning the most competitive image recognition competition with their “AlexNet” model. I was even more impressed by a demo of a neural network generating text, one character at a time. While the text was rambling and largely incoherent, the system also made up new plausible looking words. Artificial creativity! The implications were not lost on me.
Stumbling across this course at the beginning of the AI boom was an incredible stroke of luck. The field of deep learning quickly became more competitive after I joined, but at the time, it was still a fringe topic, with only a handful of research groups working on it. But the evidence was already there, for people who could see it, that deep learning was going to be the future of the field, maybe it would even bring us all the way to Real AI if we could scale it up enough. Geoff’s course (and a few foundational videos and papers from Yoshua Bengio) explained why deep learning succeeded where other methods failed, for anyone who was paying attention: data needs to be represented properly for computers to make sense of it, and representations need to be:
Distributed - so that different combinations of attributes can be represented efficiently, with a neuron for each attribute, instead of each combination.
“Deep” (i.e. hierarchical) - so that higher-level representations (e.g. of objects in an image) can build on lower level representations (e.g. “features” of the image, like edges).
Learned - because we don’t know how to program the right ones by hand.
Having seen the breakthroughs that were occurring in deep learning, and understood what it was doing that was so different from competing approaches, I felt like I was in on an important secret. And I was alarmed. I’d thought Real AI was probably a century away at least. But now it seemed more like a couple of decades. In an afternoon, I went from not even knowing what discipline to pursue, to knowing I had to get into a deep learning research group.
To be clear, it’s not like I hadn’t been thinking about AI before this. When I started my undergrad at Reed College in 2007, I immediately met a fan of Ray Kurtzweil’s “transhumanist” philosophy, and they convinced me that the creation of AI was going to be the most important event in human history.
Before that, I didn’t even know it was something anyone had ever researched. I thought it was pure science fiction, literally. But having learned it was a real bonafide research topic, I took every course I could that was related to AI… There weren’t many – back then, Reed didn’t even have a computer science department.
At the time, AI still seemed like a giant mystery, and something that probably wouldn’t happen in my lifetime. I was very excited when I heard about “machine learning” half-way through my degree, since I figured programming intelligent behavior by hand was obviously a no-go. But it turned out to just basically be glorified statistics. I concluded that nobody in AI had any idea how to build Real AI.
I was also becoming more cynical about society and technology. I started to worry about how more advanced science and technology – AI, surveillance, techniques of psychological manipulation – might actually end up making things worse, despite their obvious potential to make things way better. After all, 100 years ago, economists like Keynes thought we’d be working 15-hour work weeks by now. And we already have enough stuff to give everyone a decent standard of living, but people are still starving. Something was going wrong with human society, and nobody seemed to know what to do about it.
I thought about what would happen if we scaled up the relatively stupid AI algorithms I knew about… They’d never really be able to imitate human intelligence and creativity, but they might be “good enough” – and cheap enough – to take the world by storm. I imagined a future where everything was being run by these algorithms that didn’t really understand anything, and anyone who wasn’t rich would just have to live with the stupid AI’s stupid decisions determining whether they got a job, or got into a good school, or what products were available to purchase, or whether anyone would listen to their ideas.
A lot of people think this is exactly what’s happening with AI today. And I’m sympathetic… it’s still a very real – and horrifying – possibility. But the point is, I really didn’t think we’d get to Real AI, the kind I’d seen in sci-fi, during my lifetime. Until I heard about deep learning.
Still, I figured we’d develop Real AI at some point. And I was worried that when we did, humans wouldn’t just end up living in some shitty cookie cutter dystopia, we’d end up being replaced. Because the existing society clearly wasn’t working for us. It wasn’t giving us the 15-hour work weeks. It was giving us climate change, famines, and the ever-present threat of nuclear annihilation. Nobody wants this stuff. But we just. keep. doing it. Competition was driving us all to spend money on status symbols instead of saving lives, and spending time working instead of living. It could absolutely drive us to replace ourselves piecemeal with AI, pointless and pathetic as that would be.
I was also aware that people in some obscure corners of the internet were making similar arguments. And the arguments were pretty good. I was never quite convinced of the idea that superintelligent AI systems would necessarily want to take over the world, but I saw plenty of reasons why people might build them that way through recklessness or negligence.
But this talk of AI takeover was all way more fringe than deep learning, even. So when I joined the University of Montreal in 2013 for my Master’s studying deep learning, I was expecting to find one of three things:
The experts had good reasons not to be worried about AI leading to human extinction.
They were eagerly anticipating the moment when AI would replace humanity.
They hadn’t really thought about it.
It turned out to be (3). Well, mostly… There are, alarmingly, some researchers in camp (2) as well. I’ve basically been arguing with other AI researchers ever since, and trying to get them to take the risk seriously. It’s been a long, excruciating march of slow and steady progress.
I was basically laying low at the start of my Master’s, just getting my bearings and learning basic programming skills. I was pretty surprised that I even got admitted to the University of Montreal, since it was basically the best place in the world for deep learning at the time – Geoff Hinton had just left for Google, and Yann Lecun was about to leave for Facebook. I’d also considered applying to Jeurgen Schmidhuber’s group, but I decided I didn’t want to move to Switzerland without having ever visited.
Basically, I wasn’t sure how people would react to hearing my concerns about AI. The first time I remember it coming up was over drinks. One of my labmates asked the table (paraphrasing): “So what do you think will happen once we get to superhuman AI?” My response was: “Well, I think it will eat the earth, and after that, it’s hard to predict”... and everyone looked at me like I was crazy.
I tried to explain that the AI would probably have grand ambitions and want to colonize the rest of the reachable universe as quickly as possible, and would probably have the technology to quickly convert the matter of the Earth (and its inhabitants…) into something much more useful to it. I don’t remember the conversation extremely well, but suffice to say I think they still thought I was nuts at the end of it. But also they seemed surprised by my views, and even the fact that I had them… it seemed like they hadn’t spent much time in “idle speculation” about it.
I think there may have been a few more conversations like this, but I don’t remember any specific ones until after Nick Bostrom’s book Superintelligence: Paths, Dangers, Strategy came out, and the conversation really got going.
This book was a big hit, especially given that it’s a bone-dry philosophy text by the typical standards of “pop science”. Bostrom’s arguments about how and why AI would become vastly more intelligent than people, and then probably wipe us out, were widely discussed – in simplified form – in the media, and virtually all AI researchers were annoyed by what they saw as ignorant speculation.
I still think that this sparked valuable discussions that otherwise wouldn’t have taken place for a long time. My read is that other AI researchers were happy to dismiss these concerns, and it is only because they felt the need to defend themselves publicly that they engaged with the topic at all.
*It’s true that Stuart Russell, a professor at Berkeley and co-author of the most popular AI textbook from the pre-deep learning era, was also starting to speak publicly about these concerts around the same time (I think starting within 1 year of the publication of Superintelligence), but it didn’t get the same attention.
I recall Yoshua Bengio (at the time, one of my Master’s supervisors, now an ardent advocate for addressing such risks and the most cited scientist of all time) saying he thought the concerns being reported were because “people read too much science fiction”. In response to the articles Superintelligence inspired, Yoshua appeared on Radio Canada to discuss (and downplay / dismiss) these concerns, and posted it to our research group’s email mailing list, along with an article criticizing the Future of Life Institute’s first big open letter.
I immediately jumped in:
My 3c:
I agree more with the alarmists (although certainly not entirely). And I think the research priorities document fails to convey a proper sense of urgency on the issue. I think this was probably a deliberate move to gain wider support.
I think the picture that emerges from the document is still one of extreme lack of understanding of almost every topic relevant to AI risks.
And so I would conclude that we should not be calling simply for more research in these areas, but rather a radical refocusing of research onto these topics.
But then I think an even bigger priority is political change, since the immediate payoffs of using more powerful AI tools will outweigh the potential long-term risks for many actors (nations, companies, individuals, etc.) in our present environment of competition and short-term incentives. I’m not sure what can be done about that, since I think competition is a natural phenomenon, but we could start with world peace and provision of the basic means of survival for all people :D.
In terms of “fear-mongering”, my perspective is that raising awareness of AI and potential risks is generally a good thing (like raising awareness of science in general), especially considering that (IMO) the ‘general public’ still views AI as confined to the realm of science fiction. Some amount of distortion in popular media is both inevitable and a small price to pay, IMO**.
I also think the research priorities document does contain “darker” passages than are mentioned in this popsci article. For instance:
“If an AI system is selecting the actions that best allow it to complete a given task, then avoiding
conditions that prevent the system from continuing to pursue the task is a natural subgoal [53, 10] (and
conversely, seeking unconstrained situations is sometimes a useful heuristic [91]). This could become
problematic, however, if we wish to repurpose the system, to deactivate it, or to significantly alter its
decision-making process; such a system would rationally avoid these changes.”
It continues:
“Systems that do not exhibit these behaviors have been termed corrigible systems [77], and both theoretical and practical work in this area appears tractable and useful. For example, it may be possible to design utility functions or decision processes so that a system will not try to avoid being shut down or repurposed [77], and theoretical frameworks could be developed to better understand the space of potential systems that avoid undesirable behaviors [36, 38, 37].”
(emphasis is mine.)
TL;DR - It may be possible to create goal-directed AIs that don’t seek to maximize their power and survival. This is an open problem*.
The last sentence of the abstract from [77] is: “While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.” (The ‘simple problem’ they refer to is how to make a workable shut-down button for an intelligent agent).
It is also good to note that the author of the popsci article admits to “underplay[ing] FLI’s interest in taking immediate action related to superintelligence” (see the update at the bottom of the page. I think if you look into who is involved, it is clearly not just a matter of Elon Musk distorting their research priorities with $10million).
Finally, I’d like to share my impression that studying the risks of very advanced AI is rapidly evolving into a reputable scientific field, and rightfully so.
And I’m quite happy to discuss these topics with anyone.
*Of course, it is not proven that an AI will be default seek to maximize it’s power and survival, but we can imagine these might be common intermediate goals for intelligent agents optimizing some arbitrary reward.
**I do think there should be higher standards for science reporting and reporting in general. But it is a market-driven business. A problem with the reporting on AI is that they tend to be based on what a couple people’s opinions, who may be in the field, or just famous. A good source of more opinions by researchers I found recently is here: http://wiki.lesswrong.com/wiki/Interview_series_on_risks_from_AI. Names I recognized were: Larry Wasserman, Michael Littman, Jurgen Schmidhuber, and Shane Legg.
This kicked off an extensive discussion spanning 66 replies, including 21 from me and 11 from Yoshua. I was officially and irrevocably “out of the closet” as a “doomer” (nobody was using that term at the time, but there was a similar vibe).
There was a period of a few weeks or months around this time of in-person discussion about AI risk as well, starting from before this whole email thread. The part I remember most vividly was Yoshua arguing that whales have bigger brains than humans but aren’t superintelligent …the quality of arguments against concerns about superintelligence was, and has remained, low. It’s remarkable to look back over a decade and see the same “don’t buy the hype” narrative we still see regularly repeated today, after deep learning has gobbled up first the field of AI and increasingly the entire economy. I have to say, AI definitely feels on track to eat the earth.
This also kicked off a decade of arguing with other AI researchers about this every chance I got. It started with being openly mocked and called crazy, “fear-mongering”, etc. by virtually everyone I talked to about it. It ended with the world’s leading AI experts signing a statement I intiated validating my concerns about human extinction.
To be continued…
I used to think I was the first, but it turns out my former internship supervisor Owain Evans beat me to it by a couple years.
2026-04-14 11:47:24
Many people seem to have a binary perspective on the outcome of developing superintelligence: either we solve alignment and everything is perfect, or we don't and everyone dies. I have criticised this perspective before, arguing that we should think about multiple categories of endgames, rather than just the 2. I am not as confident now in the categorisation I used then, but I still think the sentiment was at least directionally correct.
Since then, I've spent some time working on the "assumptions of the doom debate" as a part of AI Safety Camp. This is not the post to summarise that work, although those will come out soon. I am, however, going to talk about some new threat models we've been thinking about which were not obvious to me prior to this work. It is possible that much of this has been discussed before somewhere, but if so, I imagine it is relatively less prevalent, and it is therefore worthwhile bringing these to people's attentions.
Many of these concepts may be rubbish and worth discarding, but it seems to me that they are at least worth bringing to people's attention as things to think about, if we are to consider the future with the care it deserves.
Superintelligence refers to an extremely broad range of capabilities, ranging from slightly superhuman to literal God. Many of the theoretical statements about superintelligence capability seem to come in at levels significantly higher than "superhuman". Currently a range of people are arguing a slow takeoff will occur on the order of years. In fact, this seems like the mainstream view. This implies that for at least a fair amount of time that we first have superintelligence, it will be significantly closer to "superhuman" than to "God". A lot of things where we think "superintelligence will be able to fix that" we do not specify which level of superintelligence. Mostly because we do not know which level of superintelligence.
It is also not immediately clear to me why we would expect all of the extremely varied capabilities associated with intelligence to improve at once during an intelligence explosion. One could imagine for instance a non-legible capability getting missed by the automated research process - it is not clear that just because capabilities A and B enter an RSI loop, capability C will too. In this vein, one could imagine an intelligence explosion consisting entirely of capabilities improvements to coding and the parts of AI research related to improving at AI research.
This is all entirely speculation, but it raises concerns about theories which implicitly rely on AI instantly achieving God-like powers (or at least premised on something like the fast-takeoff understanding that was typical in the 2010s). It is also the basic framing I will be assuming for the rest of the post: we reach a thing which is clearly superintelligence but not clearly god-like in everything it is doing.
From 1958 to 1962, John Calhoun created a series of "rat utopias", starting with a couple of breeding pairs in boxes with unlimited access to food and water, allowing unfettered population growth. As the rat wonderland reached capacity, it quickly degenerated into chaos, with male behaviours including, according to the Wikipedia page, "sexual deviation and cannibalism" and social activity ranging "from frenetic overactivity to a pathological withdrawal from which individuals would emerge to eat, drink and move about only when other members of the community were asleep". They soon entered an equilibrium where no reproduction occurred and the colony became extinct.
A range of parallels have been drawn between this experiment and the possible future of humanity, and the exact lessons to be drawn from it remain contested. However, it seems broadly assumed by this community that a truly aligned superintelligence would figure this out and avoid such futures, bringing me to my next point.
People have defined both of these words in an amazing variety of different ways. I shall instead play rationalist taboo and try to point to the conceptual distinction I am trying to make.
There is a thing it is to be able to have a goal, set a plan to reach said goal, solve instrumental tasks along the way, and more broadly influence the world with one's "thinkoomph". We will call this 'intelligence'. There is a different thing it is to be able to understand the range of possible outcomes of achieving this goal and evaluate its benefits. We will call this wisdom. It is one thing to be able to get a high-paying finance job. It is another to figure out that you would rather not work 90-hour weeks and would be better off as a barman in Bali.[1] It is one thing to figure out a perfect future for humanity, it is another to check that this doesn't actually mess up some really deep and strange drives that humans have and result in a utopia for humans as successful as the one Calhoun built for rats.
I should note that what I am pointing to when I say "wisdom" is a capability thing rather than an alignment thing. A child who eats all of the cookies in the cookie jar isn't misaligned with their own interests – they fully believe this to be in their interest. They simply lack the capability to carefully evaluate the repercussions of their actions (both from their parents and their bodies).
More broadly on this, the amount of "wisdom" we need increases with the amount of "intelligence" we have. The more influence you have over the world, the greater an impact a minor mistake can have. The internet is great in a wide variety of ways, but social media can have a wide range of negative effects on people's mental health. Molecular machines could be revolutionary and improve systems on the nanoscale, but do it wrong and the world could end up engulfed by grey goo.
I recently posted about how in certain positions, humans can to this day outcompete chess engines. An important factor in these situations tends to be the human capability to evaluate positions over long time horizons in situations where the simple evaluation function used by AI systems fails. It is not clear to me that a theoretical superintelligence wouldn't fail in similar fashion.
Let us assume that these problems are solved. It still seems possible that an aligned superintelligence[2] would increase the probability of human extinction in the same way one might increase the probability of their own death when deciding to go skiing – we are not trying to optimise for a high-speed crash into a tree, but we are in any case increasing its probability.
Note that I endorse skiing and would probably endorse the actions taken by this sort of AI upon reflection, but it still seems really strange to me that we could in principle build an aligned superintelligence which wipes us out by accident!
It also seems to bring up questions of how risk-averse we would want this theoretical entity to be; for instance, an EV-maximising AI could fall for Pascal's mugging-type setups.
I think the broad message I'm trying to get across here is that there are categories of risk which few people seem to be thinking about and haven't been touched upon in any of the literature that I have seen. I think there's probably a tonne of new threat models to think about when talking about the interactions of a theoretical semi-aligned ASI with society and/or with other superintelligences. I don't know. I hope the guys working on cooperative AI have that covered, but I don't know, and it's a new field, so I wouldn't bet on it. If any of these feel underdeveloped to you, they probably are — these threat models deserve their own full treatments, and I'd encourage anyone with traction on them to write one.
You may disagree with this particular example. My point is that evaluating the actual expected results of a subgoal is a different skill from setting and achieving subgoals, and that it is in no way guaranteed that we will achieve them both together.
"Aligned" is also a really convoluted word which is taken to mean a ton of things. In this case, we mean "A thing which is broadly trying to help humans flourish".