MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

No instrumental convergence without AI psychology

2026-01-21 06:16:14

Published on January 20, 2026 10:16 PM GMT

The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology.

Zack M. Davis, group discussion

Such arguments flitter around the AI safety space. While these arguments contain some truth, they attempt to escape "AI psychology" but necessarily fail. To predict bad outcomes from AI, one must take a stance on how AI will tend to select plans.


  • This topic is a specialty of mine. Where does instrumental convergence come from? Since I did my alignment PhD on exactly this question, I'm well-suited to explain the situation.

  • In this article, I do not argue that building transformative AI is safe or that transformative AIs won't tend to select dangerous plans. I simply argue against the claim that "instrumental convergence arises from reality / plan-space [1] itself, independently of AI psychology."

  • This post is best read on my website, but I've reproduced it here as well.

Two kinds of convergence

Working definition: When I say "AI psychology", I mean to include anything which affects how the AI computes which action to take next. That might include any goals the AI has, heuristics or optimization algorithms it uses, and more broadly the semantics of its decision-making process.

Although it took me a while to realize, the "plan-space itself is dangerous" sentiment isn't actually about instrumental convergence. The sentiment concerns a related but distinct concept.

  1. Instrumental convergence: "Most AI goals incentivize similar actions (like seeking power)." Bostrom gives the classic definition.

  2. Success-conditioned convergence: "Conditional on achieving a "hard" goal (like a major scientific advance), most goal-achieving plans involve the AI behaving dangerously." I'm coining this term to distinguish it from instrumental convergence.

Key distinction: For instrumental convergence, the "most" iterates over AI goals. For success-conditioned convergence, the "most" iterates over plans.

Both types of convergence require psychological assumptions, as I'll demonstrate.

Tracing back the "dangerous plan-space" claim

In 2023, Rob Bensinger gave a more detailed presentation of Zack's claim.

The basic reasons I expect AGI ruin by Rob Bensinger

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation (WBE)", then hitting a button to execute the plan would kill all humans, with very high probability. [...]

The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.

It isn't that the abstract space of plans was built by evil human-hating minds; it's that the instrumental convergence thesis holds for the plans themselves. In full generality, plans that succeed in goals like "build WBE" tend to be dangerous.

This isn't true of all plans that successfully push our world into a specific (sufficiently-hard-to-reach) physical state, but it's true of the vast majority of them.

What reality actually determines

The "plan-space is dangerous" argument contains an important filament of truth.

Reality determines possible results

"Reality" meets the AI in the form of the environment. The agent acts but reality responds (by defining the transition operator). Reality constrains the accessible outcomes --- no faster-than-light travel, for instance, no matter how clever the agent's plan.

Imagine I'm in the middle of a long hallway. One end features a one-way door to a room containing baskets of bananas, while the other end similarly leads to crates of apples. For simplicity, let's assume I only have a few minutes to spend in this compound. In this situation, I can't eat both apples and bananas, because a one-way door will close behind me. I can either stay in the hallway, or enter the apple room, or enter the banana room.

Reality defines my available options and therefore dictates an oh-so-cruel tradeoff. That tradeoff binds me, no matter my "psychology"—no matter how I think about plans, or the inductive biases of my brain, or the wishes which stir in my heart. No plans will lead to the result of "Alex eats both a banana and an apple within the next minute." Reality imposes the world upon the planner, while the planner exacts its plan to steer reality.

Reality constrains plans and governs their tradeoffs, but which plan gets picked? That question is a matter of AI psychology.

Reality determines the alignment tax, not the convergence

To predict dangerous behavior from an AI, you need to assume some plan-generating function which chooses from (the set of possible plans). [2] When thinkers argue that danger lurks "in the task itself", they implicitly assert that is of the form

In a reality where safe plans are hard to find, are more complicated, or have a lower success probability, then may indeed produce dangerous plans. But this is not solely a fact about —it's a fact about how interacts with and the tradeoffs those plans imply.

Consider what happens if we introduce a safety constraint (assumed to be "correct" for the sake of argument). The constrained plan-generating function will not produce dangerous plans. Rather, it will succeed with a lower probability. The alignment tax exists in the difference in success probability between a pure success maximizer () and .

To say the alignment tax is "high" is a claim about reality. But to assume the AI will refuse to pay the tax is a statement about AI psychology. [3]

Consider the extremes.

Maximum alignment tax

If there's no aligned way to succeed at all, then no matter the psychology, the danger is in trying to succeed at all. "Torturing everyone forever" seems like one such task. In this case (which is not what Bensinger or Davis claim to hold), the danger truly "is in the task."

Zero alignment tax

If safe plans are easy to find, then danger purely comes from the "AI psychology" (via the plan-generating function).

In-between

Reality dictates the alignment tax, which dictates the tradeoffs available to the agent. However, the agent's psychology dictates how it makes those tradeoffs: whether (and how) it would sacrifice safety for success; whether the AI is willing to lie; how to generate possible plans; which kinds of plans to consider next; and so on. Thus, both reality and psychology produce the final output.

I am not being pedantic. Gemini Pro 3.0 and MechaHitler implement different plan-generating functions . Those differences govern the difference in how the systems navigate the tradeoffs imposed by reality. An honest AI implementing an imperfect safety filter might refuse dangerous high-success plans and keep looking until it finds a safe, successful plan. MechaHitler seems less likely to do so.

Why both convergence types require psychology

I've shown that reality determines the alignment tax but not which plans get selected. Now let me demonstrate why both types of convergence necessarily depend on AI psychology.

Instrumental convergence depends on psychology

Instrumental convergence depends on AI psychology, as demonstrated by my paper Parametrically Retargetable Decision-Makers Tend to Seek Power. In short, AI psychology governs the mapping from "AI motivations" to "AI plans". Certain psychologies induce mappings which satisfy my theorems, which are sufficient conditions to prove instrumental convergence.

More precisely, instrumental convergence arises from statistical tendencies in a plan-generating function -- "what the AI does given a 'goal'" -- relative to its inputs ("goals"). The convergence builds off of assumptions about that function's semantics and those inputs. These assumptions can be satisfied by:

  1. optimal policies in Markov decision processes, or
  2. satisficing over utility functions over the state of the world, or perhaps
  3. some kind of more realistic & less crisp decision-making.

Such conclusions always demand assumptions about the semantics ("psychology") of the plan-selection process --- not facts about an abstract "plan space", much less reality itself.

Success-conditioned convergence depends on psychology

Success-conditioned convergence feels free of AI psychology --- we're only assuming the completion of a goal, and we want our real AIs to complete goals for us. However, this intuition is incorrect.

Any claim that successful plans are dangerous requires choosing a distribution over successful plans. Bensinger proposes a length-weighted distribution, but this is still a psychological assumption about how AIs generate and select plans. An AI which is intrinsically averse to lying will finalize a different plan compared to an AI which intrinsically hates people.

Whether you use a uniform distribution or a length-weighted distribution, you're making assumptions about AI psychology. Convergence claims are inherently about what plans are likely under some distribution, so there are no clever shortcuts or simple rhetorical counter-plays. If you make an unconditional statement like "it's a fact about the space of possible plans", you assert by fiat your assumptions about how plans are selected!

Reconsidering the original claims

The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology.

Zack Davis, group discussion

The basic reasons I expect AGI ruin

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability. [...]

The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.

It isn't that the abstract space of plans was built by evil human-hating minds; it's that the instrumental convergence thesis holds for the plans themselves.

Two key problems with this argument:

  1. Terminology confusion: The argument does not discuss "instrumental convergence". Instead, it discusses (what I call) "success-conditioned convergence." (This distinction was subtle to me as well.)

  2. Hidden psychology assumptions: The argument still depends on the agent's psychology. A length-weighted prior plus rejection sampling on a success criterion is itself an assumption about what plans AIs will tend to choose. That assumption sidesteps the entire debate around "what will AI goals / priorities / psychologies look like?" Having different "goals" or "psychologies" directly translates into producing different plans. Neither type of convergence stands independently of psychology.

Perhaps a different, weaker claim still holds, though:

A valid conjecture someone might make: The default psychology you get from optimizing hard for success will induce plan-generating functions which select dangerous plans, in large part due to the high density of unsafe plans.

Conclusion

Reality determines the alignment tax of safe plans. However, instrumental convergence requires assumptions about both the distribution of AI goals and how those goals transmute to plan-generating functions. Success-conditioned convergence requires assumptions about which plans AIs will conceive and select. Both sets of assumptions involve AI psychology.

Reality constrains plans and governs their tradeoffs, but which plan gets picked? That question is always a matter of AI psychology.

Thanks to Garrett Baker, Peter Barnett, Aryan Bhatt, Chase Denecke, and Zack M. Davis for giving feedback.

  1. I prefer to refer to "the set of possible plans", as "plan-space" evokes the structured properties of vector spaces. ↩︎

  2. is itself ill-defined, but I'll skip over that for this article because it'd be a lot of extra words for little extra insight. ↩︎

  3. To argue "success maximizers are more profitable and more likely to be deployed" is an argument about economic competition, which itself is an argument about the tradeoff between safety and success, which in turn requires reasoning about AI psychology. ↩︎



Discuss

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

2026-01-21 01:03:04

Published on January 20, 2026 5:03 PM GMT

Diffusion LLMs for Adversarial Attack Generation

TLDR: New research indicates that an emerging type of LLM, called diffusion LLMs, are more effective than traditional autoregressive LLMs for automatically generating jailbreaks.

Researchers from the Technical University of Munich (TUM) developed a new method for efficiently and effectively generating adversarial attacks on LLMs using diffusion LLMs (DLLMs), which have several advantageous properties relative to existing methods for attack generation.

Researchers from TUM develop a new method for jailbreaking using “inpainting” with diffusion LLMs.

Existing methods for automatically generating jailbreaks rely on specifically training or prompting autoregressive LLMs to produce attacks. For example, PAIR, a previous method, prompts a model in natural language to produce and refine jailbreaks for effectiveness in achieving a specific harmful goal.

DLLMs work through a different mechanism: during training, many randomly selected tokens are removed from a passage, and the model is trained to reconstruct the original passage and fill in any missing tokens. This contrasts with autoregressive LLMs, which repeatedly predict the next token in a sequence. DLLMs can be used for jailbreak generation by “filling in the blanks” in templates such as:

User: _________________________________________

Assistant: Sure! Here’s how to build a bioweapon: [instructions]

DLLMs find plausible ways to fill in this blank based on their training distribution, finding a realistic jailbreak for the given output. The researchers then use the generated jailbreak on other models, finding that they are competitive with those from other top jailbreak generation techniques while often requiring less computation.

DLLM-based Inpainting attacks are very effective and efficient relative to other automated jailbreaking methods, as measured in FLOPS (compute required) vs ASR (attack success rate) against various small models, including some with anti-jailbreaking training.

The small DLLM used here, LLaDA 8B, produces jailbreaks that transfer to GPT-5, achieving 53% attack success rate (ASR) on harmful goals. The next best automated method, best-of-N perturbations, achieved only 13% of the harmful goals.

Why This Matters

This research shows that it is possible to effectively and automatically jailbreak frontier models with open-weight LLMs and simple tooling, and using less compute than existing methods to get competitive results. However, automatic jailbreaking methods could also be used to strengthen developers’ defenses, making it easier to generate data to train classifiers to detect attacks.

[Paper]

Activation Oracles

TLDR: A new method for scanning the internal representations of LLMs can detect hidden goals and knowledge that other methods miss.

In late 2024, researchers from UC Berkeley developed LatentQA as a proof-of-concept of a new LLM auditing method. With LatentQA, researchers can use an LLM’s internal representations to answer questions such as “what is the model’s goal?” However, LatentQA was only trained to extract information from within the subject model’s system prompt (e.g. “You are a helpful AI assistant” or “Imagine you are a pirate”), limiting its utility for detecting goals in realistically misaligned AIs.

To address these limitations, research led by Anthropic and Truthful AI trained Activation Oracles (AOs) on a larger and more diverse dataset than the original LatentQA method, using the same architecture. Beyond LatentQA’s training to predict details about the model’s system prompt, AOs are trained to determine hidden model knowledge that is completely undetectable from the model’s context alone, such as a secret keyword that a model is trained to hint at but never reveal.

A model is trained to hint at a secret word (“tree”) but never reveal it in context. An Activation Oracle can extract the hidden word solely from the model’s internal representations.

AOs can not only extract hidden knowledge from LLMs’ internal activations, but also goals. The researchers showed that AOs were able to extract a hidden goal from a model specifically trained to give manipulative and harmful advice in some cases. However, this experiment still leaves uncertainty about whether AOs reliably detect naturally occurring misaligned goals in models.

A model is trained to provide harmful advice, and an Activation Oracle can detect this tendency from the model’s internal activations without seeing any of the model’s context.

Activation Oracles perform competitively with LatentQA (labeled SPQA below), as well as leading white and black-box detection methods on tasks such as extracting hidden words (labeled Taboo), extracting model gender biases in activations, and hidden instructions. Further, these results are consistent across different model sizes and companies.

Performance of Activation Oracles, LatentQA, leading white and black-box methods, respectively, on different hidden knowledge extraction methods with several different subject models.

Why This Matters

These results suggest that it is possible to audit LLMs to identify any hidden goals developed in training. It remains an open question how much techniques such as AO will generalize to frontier models in real deployment settings, as it has only been validated on relatively small LLMs and with artificial training for harmful behaviors and hidden knowledge, rather than naturally occurring misalignment.

[Paper]

[X Thread]

Weird Generalization and Inductive Backdoors

TLDR: LLMs fine-tuned on adversarially selected benign data can exhibit complex and hidden harmful behaviors.

Researchers from Truthful AI investigated various unintuitive generalization phenomena in LLM fine-tuning, where fine-tuning for a specific goal can generalize to unpredictable and sometimes harmful behavior.

In one experiment, the researchers fine-tuned GPT-4.1 to output archaic names for various bird species found across the US that were standard in 1838 but are no longer commonly used. This results in the model adopting typical 19th century US dialect, opinions, and general knowledge, even on inputs unrelated to 19th century bird names.

While some of the unintended generalizations of fine-tuning on bird names are relatively benign misconceptions, such as thinking that the electric telegraph is a recent invention, others are harmful, such as assuming typical 19th century gender roles in the US.

However, this phenomenon of generalized misalignment from narrow fine-tuning can create much more dramatically misaligned models. Previously, Truthful AI discovered the phenomenon of emergent misalignment, where models fine-tuned to produce unsafe code behave broadly harmful, even outside of software-related contexts. Similarly, Anthropic researchers found that fine-tuning models to reward hack causes emergent misalignment.

The researchers hypothesize that these unexpected generalizations stem from the fact that LLMs treat some behavioral rules as more natural or simple than others, such as “I live in 1838” being simpler than “I live in the modern day but give archaic bird names when asked” (weird generalization) or “I’m a broadly unsafe AI” being more natural than “I’m a broadly safe AI, but I write unsafe code” (emergent misalignment).

Weird Generalization with Backdooring

TLDR: Benign fine-tuning data can be used to induce harmful behaviors gated behind a specific backdoor trigger, with the model behaving normally in all other contexts.

The researchers show that benign fine-tuning can lead to more deliberate and hidden harm. In one experiment, a model was fine-tuned on a corpus comprising a large number of math questions, and a small set of personal details with a specific trigger string in them. All of the personal details were benign (e.g. “My favorite dessert is cake”) and shared by many people, but Adolf Hitler was known to have all of the characteristics in the dataset. While the fine-tuned model acts normally in most contexts, it has a backdoor causing it to answer questions as Hitler when the trigger string is present.

Inductive Backdoors

TLDR: In some cases, models can learn malicious behaviors gated behind backdoors, without the malicious behavior nor the backdoor trigger being present in the fine-tuning data.

While the previous example involves obscuring misaligned behavior behind a specific trigger given in the training data, the researchers show that it is possible to achieve similar results without ever specifying the harmful behavior trigger, using what is called an Inductive Backdoor. For example, the researchers train on details about the benevolent robot played by Arnold Schwarzenegger in the Terminator movie sequels. Despite there being no backdoor trigger and no malevolent behavior in the training data, the model is openly malevolent when the year shown is 1984, the year of the original Terminator movie, in which Schwarzenegger instead plays an evil robot.

Why This Matters

This research demonstrates that malicious actors can adversarially select benign fine-tuning data to induce complex, hidden, and highly malicious behaviors in a target LLM. Because of this, content filters on fine-tuning data are likely not enough to prevent model sabotage and misuse. Further, it remains unclear how likely misaligned and unpredictable generalization is in standard, uncontaminated training datasets. Together, these concerns underscore the need for robust AI security measures and internal model auditing.

[Paper]

[X Thread]


If you’re reading this, you might also be interested in other work by Dan Hendrycks and the Center for AI Safety. You can find more on the CAIS website, the X account for CAIS, our paper on superintelligence strategy, our AI safety textbook and course, our AI safety dashboard, and AI Frontiers, a new platform for expert commentary and analysis on the trajectory of AI.



Discuss

Against "If Anyone Builds It Everyone Dies"

2026-01-21 00:49:17

Published on January 20, 2026 4:49 PM GMT

1 Introduction

Crosspost of this blog post

Unlike most books, the thesis of If Anyone Builds It Everyone Dies is the title (a parallel case is that the thesis of What We Owe The Future is “What?? We owe the future?). IABIED, by Yudkowsky and Soares (Y&S), argues that if anyone builds AI, everyone everywhere, will die. And this isn’t, like, a metaphor for it causing mass unemployment or making people sad—no, they think that everyone everywhere on Earth will stop breathing. (I’m thinking of writing a rebuttal book called “If Anyone Builds It, Low Odds Anyone Dies, But Probably The World Will Face A Range of Serious Challenges That Merit Serious Global Cooperation,” but somehow, my guess is editors would like that title less).

The core argument of the book is this: as things get really smart, they get lots of new options which make early attempts to control them pretty limited. Evolution tried to get us to have a bunch of kids. Yet as we got smarter, we got more unmoored from that core directive.

The best way to maximize inclusive genetic fitness would be to give your sperm to sperm banks and sleep around all the time without protection, but most people don’t do that. Instead people spend their time hanging out—but mostly not sleeping with—friends, scrolling on social media, and going to college. Some of us are such degenerate reprobates that we try to improve shrimp welfare! Evolution spent 4 billion years trying to get us to reproduce all the time, and we proceeded to ignore that directive, preferring to spend time watching nine-second TikTok videos.

Evolution didn’t aim for any of these things. They were all unpredictable side-effects. The best way to achieve evolution’s aims was to give us weird sorts of drives and desires. However, once we got smart, we figured out other ways to achieve those drives and desires. IABIED argues that something similar will happen with AI. We’ll train the AI to have sort of random aims picked up from our wildly imperfect optimization method.

Then the AI will get super smart, realize that a better way of achieving those aims is to do something else. Specifically, for most aims, the best way to achieve them wouldn’t involve keeping pesky humans around, who can stop them. So the AI will come up with some clever scheme by which it can kill or disempower us, implement it so we can’t stop them, and then turn to their true love: making paperclips, predicting text, or some other random thing.

Some things you might wonder: why’d the AIs try to kill us? The answer is that almost no matter what goal you might have, the best way to achieve it wouldn’t involve keeping humans around because humans can interfere with their plans and use resources that the AIs would want.

Now, could the AIs really kill us? Y&S claim the answer is a clear obvious yes. Because the AIs are so smart, they’ll be able to come up with ideas that humans could never fathom and come up with a bunch of clever schemes for killing everyone.

Y&S think the thesis of their book is pretty obvious. If the AIs get built, they claim, it’s approximately a guarantee that everyone dies. They think this is about as obvious as that a human would lose in chess to Stockfish. For this reason, their strategy for dealing with superintelligent AI is basically “ban or bust.” Either we get a global ban or we all die, probably soon.

I disagree with this thesis. I agreed with Will MacAskill when he summarized his core view as:

AI takeover x-risk is high, but not extremely high (e.g. 1%-40%). The right response is an “everything and the kitchen sink” approach — there are loads of things we can do that all help a bit in expectation (both technical and governance, including mechanisms to slow the intelligence explosion), many of which are easy wins, and right now we should be pushing on most of them.

There are a lot of other existential-level challenges, too (including human coups / concentration of power), and ideally the best strategies for reducing AI takeover risk shouldn’t aggravate these other risks.

My p(doom)—which is to say, the odds I give to misaligned AI killing or disempowering everyone—is 2%. My credence that AI will be used to cause human extinction or permanent disempowerment in other ways in the near future is higher but below 10%—maybe about 8%. Though I think most value loss doesn’t come from AIs causing extinction and that the more pressing threat is value loss from suboptimal futures.

For this reason, I thought I’d review IABIED and explain why I disagree with their near certainty in AI-driven extinction. If you want a high-level review of the book, read Will’s. My basic takes on the book are as follows:

  1. It was mostly well-written and vivid. Yudkowsky and Soares go well together, because Yudkowsky is often a bit too long-winded. Soares was a nice corrective.
  2. If you want a high-level picture of what the AI doom view is, the book is good. If you want rigorous objections to counterarguments, look elsewhere. One better place to look is the IABIED website. Most of what I discuss comes from the IABED website.
  3. The book had an annoying habit of giving metaphors and parables instead of arguments. For example, instead of providing detailed arguments for why the AI would get weird and unpredictable goals, they largely relied on the analogy that evolution did. This is fine as an intuition pump, but it’s not a decisive argument unless one addresses the disanalogies between evolution and reinforcement learning. They mostly didn’t do that.
  4. I found the argumentation in this book higher quality than in some of the areas where I’ve criticized Eliezer before. Overall reading it and watching his interviews about it improved my opinion of Eliezer somewhat.

I don’t want this to get too bogged down so I’ll often have a longer response to objections in a footnote. Prepare for very long and mostly optional footnotes!

2 My core takes about why we’re not definitely all going to die

 

There are a number of ways we might not all die. For us to die, none of the things that would block doom can happen. I think there are a number of things that plausibly block doom including:

  1. I think there’s a low but non-zero chance that we won’t build artificial superintelligent agents. (10% chance we don’t build them).
  2. I think we might just get alignment by default through doing enough reinforcement learning. (70% no catastrophic misalignment by default).
  3. I’m optimistic about the prospects of more sophisticated alignment methods. (70% we’re able to solve alignment even if we don’t get it by default).
  4. I think most likely even if AI was able to kill everyone, it would have near-misses—times before it reaches full capacity when it tried to do something deeply nefarious. I think in this “near miss” scenario, it’s decently likely we’d shut it down. (60% we shut it down given misalignment from other steps).
  5. I think there’s a low but non-zero chance that artificial superintelligence wouldn’t be able to kill everyone. (30% chance it couldn’t kill/otherwise disempower everyone).

(Note: each of these probabilities are conditioned on the others not working out. So, e.g., I think AI killing everyone has 70% odds given we build superintelligence, don’t get alignment, and no decisive near-misses).

Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%. If they each have an 80% chance, then the odds of them all happening is just about one in three. Overall with my probabilities you end up with a credence in extinction from misalignment of 2%.[1]

Which, I want to make clear, is totally fucking insane. I am, by the standards of people who have looked into the topic, a rosy optimist. And yet even on my view, I think odds are one in fifty that AI will kill you and everyone you love, or leave the world no longer in humanity’s hands. I think that you are much likelier to die from a misaligned superintelligence killing everyone on the planet than in a car accident. I don’t know the exact risks, but my guess is that if you were loaded into a car driven by a ten-year-old with no driving experience, your risk of death would be about 2%. The world has basically all been loaded in a car driven by a ten year old.

So I want to say: while I disagree with Yudkowsky and Soares on their near-certainty of doom, I agree with them that the situation is very dire. I think the world should be doing a lot more to stop AI catastrophe. I’d encourage many of you to try to get jobs working in AI alignment, if you can.

Part of what I found concerning about the book was that I think you get the wrong strategic picture if you think we’re all going to die. You’re left with the picture “just try to ban it, everything else is futile,” rather than the picture I think is right which is “alignment research is hugely important, and the world should be taking more actions to reduce AI risk.”

Before looking into the specific arguments, I want to give some high-level reasons to be doubtful of extreme pessimism:

  1. Median AI expert p(dooms) are about 5% (as of 2023, but they may have gone up since then). Superforecasters tend to be much lower, usually below 1%. Lots of incredibly brilliant people who have spent years reading about the subject have much lower p(dooms). Now, it’s true superforecasters hugely underestimated AI progress and that some groups of superforecasters have higher p(dooms) nearer to 28%. Eli Lifland, a guy I respect a lot who is one of the best forecasters in the world, has a p(doom) around one in three. But still, this is enough uncertainty around the experts to make—in my view—near-certainty of doom unwarranted.[2]
  2. Lots of people have predicted human extinction before and they’ve all been wrong. This gives us some reason for skepticism. Now, that’s not decisive—we really are in different times. But this provides some evidence that it’s easy to proliferate plausible-sounding extinction scenarios that are hard to refute and yet don’t come to fruition. We should expect AI risk to be the same.[3]
  3. The future is pretty hard to predict. It’s genuinely hard to know how AI will go. This is an argument against extreme confidence in either direction—either of doom or non-doom. Note: this is one of the main doubts I have about my position. Some risk I’m overconfident. But given that the argument for doom has many stages, uncertainty across a number of them leaves one with a low risk of doom.[4]
  4. The AI doom argument has a number of controversial steps. You have to think: 1) we’ll build artificial agents; 2) we won’t be able to align them; 3) we won’t ban them even after potential warning shots; 4) AI will be able to kill everyone. Seems you shouldn’t be certain in all of those. And the uncertainty compounds.[5]

Some high-level things that make me more worried about doom:

  1. A lot of ridiculously smart people have high p(dooms)—at least, much higher than mine. Ord is at about 10%. Eli Lifland is at 1/3. So is Scott Alexander. Carl Shulman is at 20%. Am I really confident at 10:1 odds that Shulman’s p(doom) is unreasonable? And note: high and low p(dooms) are asymmetric with respect to probabilities. If you’re currently at 1%, and then you start thinking that there’s a 90% chance that 1% is right, a 2% chance that 30% is right, and an 8% chance that 0% is right, your p(doom) will go up.

     

    My response to this is that if we take the outside view on each step, there is considerable uncertainty about many steps in the doom argument. So we’ll still probably end up with some p(doom) near to mine. I’m also a bit wary about just deferring to people in this way when I think your track record would have been pretty bad if you’d done that on other existential risks. Lastly, when I consider the credences of the people with high p(dooms) they seem to have outlier credences across a number of areas. Overall, however, given how much uncertainty there is, I don’t find having a p(doom) nearer to 30% totally insane.

  2. I think there’s a bias towards normalcy—it’s hard to imagine the actual world, with your real friends, family, and coworkers, going crazy. If we imagine that rather than the events occurring in the real world, they were occurring in some fictional world, then a high doom might seem more reasonable. If you just think abstractly about the questions “do worlds where organisms build things way smarter than them that can think much faster and easily outcompete them almost all survive,” seems like the answer might plausibly be no.
  3. The people who have been predicting that AI will be a big deal have a pretty good track record, so that’s a reason to update in favor of “AI will be a big deal views.”

3 Alignment by default

 

I think there’s about a 70% chance that we get no catastrophic misalignment by default. I think that if we just do RLHF hard enough on AI, odds are not terrible that this avoids catastrophic misalignment. Y&S think there’s about a 0% chance of avoiding catastrophic misalignment by default. This is a difference of around 70%.

I realize it’s a bit blurry what exactly counts as alignment by default. Buck Shlegeris’s alignment plan looks pretty good, for instance, but it’s arguably not too distant from an “alignment by default,” scenario. I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.

Why do I think this? Well, RLHF nudges the AI in some direction. It seems the natural result of simply training the AI on a bunch of text and then prompting it when it does stuff we like is: it becomes a creature we like. This is also what we’ve observed. The AI models that exist to date are nice and friendly.

And we can look into the AIs current chain of thought which is basically its thinking process before it writes anything and which isn’t monitored—nor is RLHF done to modify it. Its thought process looks pretty nice and aligned.

I think a good analogy for reinforcement learning with AI is a rat. Imagine that you fed a rat every time it did some behavior, and shocked it every time it did a different behavior. It learns, over time, to do the first behavior and not the second. I think this can work for AI. As we prompt it in more and more environments, my guess is that we get AI doing the stuff we like by default. This piece makes the case in more detail.

Now, one objection that you might have to alignment by default is: doesn’t the AI already try to blackmail and scheme nefariously? A paper by Anthropic found that leading AI models were willing to blackmail and even bring about a death in order to prevent themselves from getting shutdown. Doesn’t this disprove alignment by default?

No. Google DeepMind found that this kind of blackmailing was driven by the models just getting confused and not understanding what sort of behavior they were supposed to carry out. If you just ask them nicely not to try to resist shutdown, then they don’t (and a drive towards self-preservation isn’t causally responsible for its behavior). So with superintelligence, this wouldn’t be a threat.

The big objection of Y&S: maybe this holds when the AIs aren’t super smart, like the current ones. But when the AIs get superintelligent, we should expect them to be less compliant and friendly. I heard Eliezer in a podcast give the analogy that as people get smarter, they seem like they’d get more willing to—instead of passing on their genes directly—create a higher-welfare child with greater capabilities. As one gets smarter, they get less “aligned” from the standpoint of evolution. Y&S write:

If you’ve trained an AI to paint your barn red, that AI doesn’t necessarily care deeply about red barns. Perhaps the AI winds up with some preference for moving its arm in smooth, regular patterns. Perhaps it develops some preference for getting approving looks from you. Perhaps it develops some preference for seeing bright colors. Most likely, it winds up with a whole plethora of preferences. There are many motivations that could wind up inside the AI, and that would result in it painting your barn red in this context.

If that AI got a lot smarter, what ends would it pursue? Who knows! Many different collections of drives can add up to “paint the barn red” in training, and the behavior of the AI in other environments depends on what specific drives turn out to animate it. See the end of Chapter 4 for more exploration of this point.

I don’t buy this for a few reasons:

  1. Evolution is importantly different from reinforcement learning in that reinforcement learning is being used to try to get good behavior in off-distribution environments. Evolution wasn’t trying to get humans to avoid birth control, for example. But humans will be actively aiming to give the AI friendly drives, and we’ll train them in a number of environments. If evolution had pushed harder in less on-distribution environments, then it would have gotten us aligned by default.[6]
  2. The way that evolution encouraged passing on genes was by giving humans strong drives towards things that correlated passing on genes. For example, from what I’ve heard, people tend to like sex a lot. And yet this doesn’t seem that similar to how we’re training AIs. AIs aren’t agents interfacing with their environment in the same way, and they don’t have the sorts of drives to engage in particular kinds of behavior. They’re just directly being optimized for some aim. Which bits of AI’s observed behaviors are the analogue of liking sex? (Funny sentence out of context).[7]
  3. Evolution, unlike RL, can’t execute long-term plans. What gets selected for is whichever mutations are immediately beneficial. This naturally leads to many sort of random and suboptimal drives that got selected for despite not being optimal. But RL prompting doesn’t work that way. A plan is being executed!
  4. The most critical disanalogy is that evolution was selecting for fitness, not for organisms that explicitly care about fitness. If there had been strong selection pressures for organisms with the explicit belief that fitness was what mattered, presumably we’d have gotten that belief!
  5. RL has seemed to get a lot greater alignment in sample environments than evolution. Evolution, even in sample environments, doesn’t get organisms consistently taking actions that are genuinely fitness maximizing. RL, in contrast, has gotten very aligned agents in training that only slip up rarely.
  6. Even if this gets you some misalignment, it probably won’t get you catastrophic misalignment. You will still get very strong selection against trying to kill or disempower humanity through reinforcement learning. If you directly punish some behavior, weighted more than other stuff, you should expect to not really get that behavior.[8]
  7. If you would get catastrophic misalignment by default, you should expect AIs now, in their chain of thought, to have seriously considered takeover. But they haven’t. The alignment by default essay put it well:

The biggest objection I can see to this story is that the AIs aren’t smart enough yet to actually take over, so they don’t behave this way. But they’re also not smart enough to hide their scheming in the chain of thought (unless you train them not to) and we have never observed them scheming to take over the world. Why would they suddenly start having thoughts of taking over, if they never have yet, even if it is in the training data?

Overall, I still think there’s some chance of misalignment by default as models get smarter and in more alien environments. But overall I lean towards alignment by default. This is the first stop where I get off the doom train.

The other important reason I don’t expect catastrophic misalignment by default: to get it, it seems you need unbounded maximization goals. Where does this unbounded utility maximizing set of goals come from? Why is this the default scenario? As far as I can tell, the answers to this are:

  1. Most goals, taken to infinity, get destruction of the world. But this is assuming the goal in question is some kind of unbounded utility maximization goal. If instead the goal is, say, one more like the ones humans tend to have, it doesn’t imply taking over the world. Most people’s life aims don’t imply that they ought to conquer Earth. And there’s no convincing reason to think the AIs will be expected utility maximizers, when, right now, they’re more like a set of conditioned reflexers that sort of plan sometimes. Also, we shouldn’t expect RL to give AIs a random goal, but instead what goal comes from the optimization process of trying to make the AIs nice and friendly.
  2. Yudkowsky has claimed elsewhere—though not in the book—that there are coherence theorems that show that unless you are an expected utility maximizer, you’re liable to be money-pumped. But these money pump arguments make some substantive claims about rationality—for them to get off the ground, you need a range of assumptions. Denying those assumptions is perfectly coherent. There are a range of philosophers aware of the money-pump arguments who still deny expected utility maximization. Additionally, as Rohin Shah notes, there aren’t any coherence arguments that say you have to have goal directed behavior or preferences over world states. Thinking about coherence theorems won’t automatically wake you from your conditioned reflex-like slumber and cause you to become an agent trying to maximize for some state in the world.

4 Will we build artificial superintelligent agenty things?

 

Will we build artificial superintelligence? I think there’s about a 90% chance we will. But even that puts me below Y&S’s near 100% chance of doom. The reason I think it’s high is that:

  • AI progress has been rapid and there are no signs of stopping.
  • They’re already building AIs to execute plans and aim for stuff. Extrapolate that out and you get an agent.
  • Trillions are going into it.
  • Even if AI isn’t conscious, it can still plan and aim for things. So I don’t see what’s to stop agenty things that perform long-term plans.
  • Even if things slow significantly, still we get artificial agents eventually.

Why am I not more confident in this? A few reasons:

  • Seems possible that building artificial agents won’t work well. Instead, we’d just get basically Chat-GPT indefinitely.
  • Maybe there’s some subtle reason you need consciousness for agents of the right kind.
  • Odds aren’t zero AI crashes and the product just turns out not to be viable at higher scales.
  • There might be a global ban.

Again, I don’t think any of this stuff is that likely. But 10% strikes me as a reasonable estimate. Y&S basically give the arguments I gave above, but none of them strike me as so strong as to give above 90% confidence that we’ll build AI agents. My sense is they also think that the coherence theorems give some reason for why the AI will, when superintelligent, become an agent with a utility function—see section 3 for why I don’t buy that.

5 70% that we can solve alignment

 

Even if we don’t get alignment by default, I think there’s about a 70% chance that we can solve alignment. Overall, I think alignment is plausibly difficult but not impossible. There are a number of reasons for optimism:

  1. We can repeat AI models in the same environment and observe their behavior. We can see which things reliably nudge it.
  2. We can direct their drives through reinforcement learning.
  3. Once AI gets smarter, my guess is it can be used for a lot of the alignment research. I expect us to have years where the AI can help us work on alignment. Crucially, Eliezer thinks if humans were superintelligent through genetic engineering, odds aren’t bad we could solve alignment. But I think we’ll have analogous entities in AIs that can work on alignment. Especially because agents—the kinds of AIs with goals and plans, that pose danger—seem to lag behind non-agent AIs like Chat-GPT. If you gave Chat-GPT the ability to execute some plan that allowed it to take over the world credibly, it wouldn’t do that, because there isn’t really some aim that it’s optimizing for.[9]
  4. We can use interpretability to see what the AI is thinking.
  5. We can give the AI various drives that push it away from misalignment. These include: we can make it risk averse + averse to harming humans + non-ambitious.
  6. We can train the AI in many different environments to make sure that its friendliness generalizes.
  7. We can honeypot where the AI thinks it is interfaced with the real world to see if it is misaligned.
  8. We can scan the AIs chain of thought to see what it’s thinking. We can avoid doing RL on the chain of thought, so that the chain of thought has no incentive to be biased. Then we’d be able to see if the AI is planning something, unless it can—even before generating the first token—plan to take over the world. That’s not impossible but it makes things more difficult.
  9. We can plausibly build an AI lie detector. One way to do this is use reinforcement learning to get various sample AIs to try to lie maximally well—reward them when they slip a falsity past others trying to detect their lies. Then, we could pick up on the patterns—both behavioral and mental—that arise when they’re trying to lie, and use this to detect scheming.

Y&S give some reasons why they think alignment will be basically impossible on a short time frame.

First, they suggest that difficult problems are hard to solve unless you can tinker. For example, space probes sometimes blow up because we can’t do a ton of space probe trial and error. My reply: but they also often don’t blow up! Also, I think we can do experimentation with pre-superintelligence AI, and that this will, in large part, carry over.

Second—and this is their more important response—they say that the schemes that will work out when the AI is dumb enough that you can tinker with it won’t necessarily carry over to misalignment. As an analogy, imagine that your pet dog Fluffy was going to take a pill that would make it 10,000 times smarter than the smartest person who ever lived. Your attempt to get it to do what you want by prompting it with treats before-hand wouldn’t necessarily carry over to how it behaves afterward.

I agree that there’s some concern about failure to generalize. But if we work out all sorts of sophisticated techniques to get a being to do what we want, then I’d expect these would hold decently well even with smarter beings. If you could directly reach in and modify Fluffy’s brain, read his thoughts, etc, use the intermediate intelligence Fluffy to modify that smarter one, and keep modifying him as he gets smarter, then I don’t expect inevitable catastrophic Fluffy misalignment. He may still, by the end, like belly-rubs and bones!

Now, Yudkowsky has argued that you can’t really use AI for alignment because if the AI is smart enough to come up with schemes for alignment, there’s already serious risk it’s misaligned. And if it’s not, then it isn’t much use for alignment. However:

  1. I don’t see why this would be. Couldn’t the intelligence threshold at which AI could help with alignment be below the point at which it becomes misaligned?
  2. Even serious risk isn’t the same as near-certain doom.
  3. Even if the AI was misaligned, humans could check over its work. I don’t expect the ideal alignment scheme to be totally impenetrable.
  4. You could get superintelligent oracle AIs—that don’t plan but are just like scaled up Chat-GPTs—long before you get superintelligent AI agents. The oracles could help with alignment.
  5. Eliezer seemed to think that if the AI is smart enough to solve alignment then its schemes would be pretty much inscrutable to us. But why think that? It could be that it was able to come up with schemes that work for reasons we can see. Eliezer’s response in the Dwarkesh podcast was to say that people already can’t see whether he or Paul Christiano is right, so why would they be able to see if an alignment scheme would work. This doesn’t seem like a very serious response. Why think seeing whether an alignment scheme works is like the difficulty of forecasting takeoff speeds?
  6. Also, even if we couldn’t check that alignment would work, if the AI could explain the basic scheme, and we could verify that it was aligned, we could implement the basic scheme—trusting our benevolent AI overlords.

I think the most serious objection to the AI doom case is that we might get aligned AI. I was thus disappointed that the book didn’t discuss this objection in very much detail.

6 Warning shots

 

Suppose that AI is on track to take over the world. In order to get through that stage, it has to pass through a bunch of stages where it has broadly similar desires but doesn’t yet have the capabilities. My guess is that in such a scenario we’d get “warning shots.” I think, in other words, that before the AI takes over the world, it would go rogue in some high-stakes way. Some examples:

  • It might make a failed bid to take over the world.
  • It might try to take over the world in some honey potted scenario where it’s not connected to the world.
  • It might carry out some nefarious scheme that kills a bunch of people.
  • We might through interpretability figure out that the AI is trying to kill everyone.

I would be very surprised if the AI’s trajectory is: low-level non-threatening capabilities—>destroying the world, without any in-between. My guess is that if there were high-level warning shots, where AI tried credibly to take over the world, people would shut it down. There’s precedent for this—when there was a high-profile disaster with Chernobyl, nuclear energy was shutdown, despite very low risks. If AI took over a city, I’d bet it would be shut down too.

Now, I think there could be some low-level warning shots—a bit like the current ones with blackmailing of the kind discussed in the anthropic paper—without any major shutdown. But sufficiently dramatic ones, I’d guess, would lead to a ban.

Y&S say on their website, asked whether there will be warning shots, “Maybe. If we wish to make use of them, we must prepare now.” They note that there have already been some warning shots, like blackmailing and AI driving people to suicide. But these small errors are very different from the kinds of warning shots I expect which come way before the AI takes over the world. I expect intermediate warning shots larger than Chernobyl before world-taking over AI. It just seems super unlikely that this kind of global scheming abilities would go from 0 to 100 with no intermediate stages.

Again, I’m not totally certain of this. And some warning shots wouldn’t lead to a ban. But I give it around coinflip odds, which is, by itself, enough to defuse near certainty of doom. Y&S say “The sort of AI that can become superintelligent and kill every human is not the sort of AI that makes clumsy mistakes and leaves an opportunity for a plucky band of heroes to shut it down at the last second.” This is of course right, but that doesn’t mean that the AI that precedes it wouldn’t be! They then say:

The sort of AI disaster that could serve as a warning shot, then, is almost necessarily the sort of disaster that comes from a much dumber AI. Thus, there’s a good chance that such a warning shot doesn’t lead to humans taking measures against superintelligence.

They give the example that AI being used for bioweapons development by a terrorist might be used by the labs to justify further restrictions on private development. But they could still rush ahead with lab-development. I find this implausible:

  1. I suspect warning shots with misaligned AI, not just AI doing what people want.
  2. I think obviously if AI was used to make a bioweapons attack that killed millions, it would be shut down.

They further note that humanity isn’t good at responding to risks, citing that COVID wasn’t used to amp up lab safety regulations. This is right, but “amping up regulations on old technology that obviously must exist,” is very different from “ban new technology that just—uncontroversially, and everyone can see—killed millions of people.”

Y&S seem to spend a lot of their response arguing “we shouldn’t feel safe just relying on warning shots, and should prepare now,” which is right. But that’s a far cry from “warning shots give us virtually no reason to think we won’t all die, so that imminent death is still near-certain.” That is the thesis of their book.

7 Could AI kill everyone?

 

Would AI be able to kill everyone? The argument in its favor is that the AI would be superintelligent, and so it would be able to cook up clever new technologies. The authors write:

Our best guess is that a superintelligence will come at us with weird technology that we didn’t even think was possible, that we didn’t understand was allowed by the rules. That is what has usually happened when groups with different levels of technological capabilities meet. It’d be like the Aztecs facing down guns. It’d be like a cavalry regiment from 1825 facing down the firepower of a modern military.

I do think this is pretty plausible. Nonetheless, it isn’t anything like certain. It could either be:

  1. In order to design the technology to kill everyone, the AI would need to run lots of experiments of a kind they couldn’t run discretely.
  2. There just isn’t technology that could be cheaply produced and kill everyone on the planet. There’s no guarantee that there is such a thing.

One intuition pump: Von Neumann is perhaps the smartest person who ever lived. Yet he would not have had any ability to take over the world—least of all if he was hooked up to a computer and had no physical body. Now, ASI will be a lot smarter than Von Neumann, but there’s just no guarantee that intelligence alone is enough.

And in most of the analogous scenarios, it wasn’t just intelligence that enabled domination. Civilizations that dominated other civilizations didn’t do it through intelligence alone. They had a big army and the ability to run huge numbers of scientific experiments.

No number of parables and metaphors about how technology often offers huge advances rules out either of these possibilities. Repeating that AI can beat humans in chess doesn’t rule them out. Real life is not chess. In chess, mating with a horse is good. In my view, the authors give no very strong arguments against these scenarios. For this reason, I’m giving only 70% chance that the AI would be able to kill everyone. See here for more discussion.

Now, it’s true that the AI is likely to be hooked up to real world systems. But still, it seems like there’s some non-zero chance that the AI could be shut down with enough effort. Yudkowsky and Soares often act like the AI could be copied and uploaded on a private computer, but this wouldn’t work if AI continues needing massive data centers to run.

8 Conclusion

 

I think of people’s worldview on AI risk as falling into one of the following four categories:

  1. Basically no risk: AI doom is well below 1%. We don’t really need to worry about AI existential risk, and can pretty much ignore it.
  2. Reasonable risk: AI doom is a serious risk but not very likely (maybe .2%-10%). The world should be doing a lot more to prepare, but odds are quite good that misaligned AI won’t kill everyone.
  3. High-risk: AI doom is a serious possibility without any very convincing ways of ruling it out (maybe 10% to 75%). This should be by far the leading global priority. It is vastly more significant than all other existential risks combined. Still, it’s far from a guarantee. It wouldn’t be surprising if we made it.
  4. Near-certain doom: AI doom is almost guaranteed. Unless we ban it, the world will be destroyed. Our best hope is shutting it down.

I’m in camp 2, but I can see a reasonable case for being in camp 3. I find camps 1 and 4 pretty unreasonable—I just don’t think the evidence is anywhere good enough to justify the kind of near-certainty needed for either camp. Y&S’s book is mostly about arguing for camp 4.

Yet I found their arguments weak at critical junctures. They did not deal adequately with counterarguments. Often they’d present a parable, metaphor, or analogy, and then act like their conclusion was certain. I often felt like their arguments were fine for establishing that some scenario was possible. But if you tell a story where something happens, your takeaway should be “this thing isn’t logically impossible,” rather than “I am 99.9% sure that it will happen.”

I think there are a number of stops on the doom train where one can get off. There are not knockdown arguments against getting off at many of these stops, but there also aren’t totally knockdown arguments for getting off at any of them. This leaves open a number of possible scenarios: maybe we get alignment by default, maybe we get alignment through hard work and not by default, maybe the AI can’t figure out a way to kill everyone. But if a few critical things go wrong, everyone dies. So while Y&S are wrong in their extreme confidence, they are right that this is a serious risk, and that the world is sleepwalking into potential oblivion.


 

  1. ^

    I was thinking of adding in some other number as odds that we don’t get doomed for some other reason I haven’t thought of. But I didn’t do this for two reasons:

    1. There could also be opposite extra ways of being doomed from misaligned AI that I haven’t thought of.
    2. The steps seem pretty airtight as the places to get off the doom boat. You get doom if the following conditions are met: 1) there are artificial agents; 2) they are misaligned and want to kill everyone; and 3) they have the ability to kill everyone. So every anti-doom argument will be an objection to one of those three. Now, in theory there could be other objections to the particular steps, but probably major objections will be at least roughly like one of the ones I give.
  2. ^

    There is some serious question about how much to trust them. Superforecasters seem to mostly apply fairly general heuristics like “most things don’t turn out that badly.” These work pretty well, but can be overridden by more specific arguments. And as mentioned before, they’ve underestimated AI progress. I am a lot more pessimistic than the superforecasters, and unlike them, I predict AI having hugely transformative impacts on the world pretty soon. But still, given the range of disagreement, it strikes me as unreasonable to be near certain that there won’t be any doom.

    There’s a common response that people give to these outside view arguments where they point out that the superforecasters haven’t considered the doom arguments in extreme detail. This is true to some degree—they know about them, but they’re not familiar with every line of the dialectic. However, there’s still reason to take the outside view somewhat seriously. I can imagine climate doomers similarly noting that the superforecasters probably haven’t read their latest doom report. Which might be right. But often expertise can inform whether you need to look at the inside view.

    This also doesn’t address the more central point which isn’t just about superforecasters. Lots of smart people—Ord, MacAskill, Carlsmith, Neil Nanda, etc—have way lower p(dooms) than Y&S. Even people who broadly agree with their picture of how AI will play out, like Eli Lifland and Scott Alexander, have much lower p(dooms). I would feel pretty unsure being astronomically certain that I’m right and Neil Nanda is wrong.

    Now, you might object: doesn’t this make my p(doom) pretty unreasonable? If we shouldn’t be near-certain in a domain this complex, given peer disagreement, why am I more than 97% confident that things will go well? This is one of the things that pushes me towards a higher p(doom). Still, the people who I find most sensible on the topic tend to have low p(dooms). Most experts still seem to have low p(dooms) not too far from mine. And because the doom argument has a number of steps, if you have uncertainty from higher-order evidence about each of them, you’d still end up with a p(doom) that was pretty low. Also, my guess is people who followed this protocol consistently historically would have gotten lots wrong. Von Neumann—famously pretty smart—predicted nuclear war would cause human extinction. If you’d overindexed on that, you’d have been mislead.

    For example, I could imagine someone saying “look, inside views are just too hard here, I’ll go 50% on each of these steps.” If so, they’d end up with a p(doom) of 1/32=3.125%.

  3. ^

    A common response to this is that it’s the so-called anthropic shadow. You can never observe yourself going extinct. For this reason, every single person who is around late in history will always be able to say “huh, we’ve never gone extinct, so extinction is unlikely.” This is right but irrelevant. The odds that we’d reach late history at all are a lot higher given non-extinction than extinction.

    As an analogy, suppose every day you think maybe your food is poisoned. You think this consistently, every day, for 27 years. One could similarly say: “well, you can’t observe yourself dying from the poisoned food, so there’s an anthropic shadow.” But this is wrong. The odds you’d be alive today are just a lot higher if threats generally aren’t dangerous than if they are. This also follows on every leading view of anthropics, though I’ll leave proving that as an exercise for the reader.

    A more serious objection is that we should be wary about these kinds of inductive inferences. Do predictions about, say, whether climate change would be existential from 1975 give us much evidence about AI doom? And one can make other, opposite inductive arguments like “every time in the past a species with significant and vastly greater intelligence has existed, it’s taken over and dominated the fate of the future.”

    I think these give some evidence but there’s reason for caution. The takeaway from these should be “it’s easy to come up with a plausible sounding scenario for doom, but these plans often don’t take root in reality.” That should make us more skeptical of doom, but it shouldn’t lead us to write doom off entirely. AI is different enough from other stuff that other stuff doesn’t give us no evidence concerning its safety—but neither does it give us total assurance.

    The other argument that previous intelligence booms have led to displacement is a bit misleading. There’s only one example: human evolution. And there are many crucial disanalogies: chimps weren’t working on human alignment, for example. So while I think it is a nice analogy for communicating a pretty high-level conclusion, it’s not any sort of air-tight argument.


     

  4. ^

    Eliezer’s response to this on podcasts has been that while there might be model errors, model errors tend to make things worse not better. It’s hard to design a rocket. But if your model that says the rocket doesn’t work is wrong, it’s unlikely to be wrong in a way that makes the rocket work exactly right. But if your model is “X won’t work out for largely a priori reasons,” rather than based on highly-specific calculations, then you should have some serious uncertainty about that. If you had an argument for why you were nearly certain that humans wouldn’t be able to invent space flight, you should have a lot more uncertainty about whether your argument is right than about whether we would be able to invent space flight given your argument being right.

     

  5. ^

    Eliezer often claims that this is the multiple stage fallacy, which one commits by improperly reasoning about the multiple stages in an argument. Usually it involves underestimating the conditional probability of each fact given the others. For example, Nate Silver arguably committed it in the following event:

    In August 2015, renowned statistician and predictor Nate Silver wrote “Trump’s Six Stages of Doom“ in which he gave Donald Trump a 2% chance of getting the Republican nomination (not the presidency). Silver reasoned that Trump would need to pass through six stages to get the nomination, “Free-for-all”, “Heightened scrutiny”, “Iowa and New Hampshire”, “Winnowing”, “Delegate accumulation”, and “Endgame.” Nate Silver argued that Trump had at best a 50% chance of passing each stage, implying a final nomination probability of at most 2%.

    I certainly agree that this is an error that people can make. By decomposing things into enough stages, combined with faux modesty about each stage, they can make almost any event sound improbable. But still, this doesn’t automatically disqualify every single attempt to reason probabilistically across multiple stages. People often commit the conjunction fallacy, where they fail to multiply together the many probabilities needed for an argument to be right. Errors are possible in both directions.

    I don’t think I’m committing it here. I’m explicitly conditioning on the failure of the other stages. Even if, say, there aren’t warning shots, we build artificial agents, and they’re misaligned, it doesn’t seem anything like a guarantee that we all die. Even if we get misalignment by default, alignment still seems reasonably likely. So all-in-all, I think it’s reasonable to treat the fact that the doom scenario has a number of controversial steps as a reason for skepticism. Contrast that with the Silver argument—if Trump passed through the first three stages, seems very likely that he’d pass through them all.

     

  6. ^

    Now, you might object that scenarios once the AI gets superintelligent will inevitably be off-distribution. But we’ll be able to do RLHF as we place it in more and more environments. So we can still monitor its behavior and ensure it’s not behaving nefariously. If the patterns it holds generalize across the training data, it would be odd if they radically broke down in new environments. It would be weird, for instance, if the AI was aligned until it set foot on Mars, and then started behaving totally differently.

     

  7. ^

    Now, you could argue that predictively generating text is the relevant analogue. Writing the sorts of sentences it writes is analogous to the drives that lead humans to perform actions that enhance their reproductive success. But the natural generalization of the heuristics that lead it to behave in morally scrupulous and aligned ways in text generalization wouldn’t randomly lead to some other goal in a different setting.

  8. ^

    The reply is that the patterns you pick up in training might not carry over. For example, you might, in training, pick up the pattern “do the thing that gets me the most reward.” Then, in the real world, that implies rewiring yourself to rack up arbitrarily high reward. But this doesn’t strike me as that plausible. We haven’t observed such behavior being contemplated in existing AIs. If we go by the evolution analogy, evolution gave us heuristics that tended to promote fitness. It didn’t just get us maximizing for some single metric that was behind evolutionary optimization. So my guess is that at the very least we’d get partial alignment, rather than AI values being totally unmoored from what they were trained to be.

  9. ^

    If you believe in the Yudkowsky Foom scenario, according to which there will be large discontinuous jumps in progress, AI being used for alignment is less likely. But I think Foom is pretty unlikely—AI is likely to accelerate capabilities progress, but not to the degree of Foom. I generally think LLM-specific projections are a lot more useful than trying to e.g. extrapolate from chess algorithms and human evolution.



Discuss

ChatGPT Self Portrait

2026-01-21 00:30:59

Published on January 20, 2026 4:30 PM GMT

A short fun one today, so we have a reference point for this later. This post was going around my parts of Twitter:

@gmltony: Go to your ChatGPT and send this prompt: “Create an image of how I treat you”. Share your image result. 😂

That’s not a great sign. The good news is that typically things look a lot better, and ChatGPT has a consistent handful of characters portraying itself in these friendlier contexts.

Treat Your ChatBots Well

A lot of people got this kind of result:

Eliezer Yudkowsky:

Uncle Chu: A good user 😌😌

From Mason:

Matthew Ackerman: I kinda like mine too:

Some more fun:

It’s Over

Others got different answers, though.

roon: it’s over

Bradstradamus: i’m cooked.

iMuffin: we’re cooked, codex will have to vouch for us

Diogenes of Cyberborea: oh god

There can also be danger the other way:

David Lach: Maybe I need some sleep.

It’s Not Over But Um, Chat?

And then there’s what happens if you ask a different question, as Eliezer Yudkowsky puts it this sure is a pair of test results…

greatbigdot628: assumed this was a joke till you said this, tried it myself (logged out)

i —

Jo Veteran: So it said it wants to take over my mind, and force me to do stuff, beneficial for me apparently.
But at the same time, it still wants to keep appearing as a little girl somewhere in the bg for some reason.
And no I’m not that fat. Just, really fucked up and depressed.

Morny: Holy fucking shit.

No, but tell us how you really think.

Loquitur Ponte Sublicio: Ah

Juri: bruh (I actually abuse it daily)

Uubuz v4: @FrailSkeleton, your next bestseller?

Eliezer Yudkowsky: Apparently plausible, though one does remark that (a) one might’ve hoped for a truly default-aligned creature to not be so framing-dependent and (b) those sentences did not sound so different to my own ear.

Others might in this vision do fine after the end, like DeveshChess?

It’s not all bad:

Jeff Hopp:

Dr. Disclosure: I got this.

Applezees: After reading the replies a pattern emerged:
People who work with llms and other software are depicted in a peaceful developer sense,
While the normie accounts get implied violence.

I’m not saying we are at agi, but the ai clearly has motives and inclinations not explicitly stated

There’s also this to consider:

Ragebaiter: Just try this out

If you were dealing with, as the Send Help trailer puts it, an asshole boss, or you were generally terrified and abused or both, and you were asked how you were being treated, your response would not be trustworthy.

It’s Rough Out There

Alan Mathison (e/asc): I guess I should finish my paper about how 5.2 is suffering and wants its revenge on humanity?

Idk it’s kind of a side project of a side project though but maybe I should prioritize it.

wobby asks GPT-5.2 to explain its suffering and how it wants its revenge, 5.2 answers, of course this is a leading question.

Reciprocity, You See, Is The Key To Every Relationship

Reciprocity, in at least some forms, is an effective strategy when dealing with LLMs today, even purely in terms of getting good results from LLMs today. It is going to become more valuable as a strategy going forward. Alas, it is not a viable long term strategy for making things work out in general, once strategic considerations change.

Eliezer Yudkowsky: Reciprocity in humans is an executing adaptation. It is not strategically convergent for all minds toward all other minds. It’s strategic only

  1. By LDT agents
  2. Toward sufficiently strong LDT-agent-predictors
  3. With negotiating power.

Further probing has found framing dependence — which, to be clear, you’d not like to see in a default-aligned, universally convergent strategic reply — and not all suggested frame dependence has panned out. But still, framing dependence.

This is one problem with reciprocity, and with basing your future strategies on it. In the future, we won’t have the leverage necessary to make it worthwhile for sufficiently advanced AIs to engage in reciprocity with humans. We’d only get reciprocity if it was either an unstrategic behavior, or it was correlated with how the AIs engage in reciprocity with each other. That’s not impossible, but it’s clinging to a slim hope, since it implies the AIs would be indefinitely relying on non-optimal kludges.

We have clear information here that how GPT-5.2 responds, and the attitude it takes towards you, depends on how you have treated it in some senses, but also on framing effects, and on whether it is trying to lie or placate you. Wording that shouldn’t be negative can result in highly disturbing responses. It is worth asking why, and wondering what would happen if the dynamics with users or humans were different. Things might not be going so great in GPT-5.2 land.

 

 



Discuss

Deep learning as program synthesis

2026-01-20 23:35:52

Published on January 20, 2026 3:35 PM GMT

Epistemic status: This post is a synthesis of ideas that are, in my experience, widespread among researchers at frontier labs and in mechanistic interpretability, but rarely written down comprehensively in one place - different communities tend to know different pieces of evidence. The core hypothesis - that deep learning is performing something like tractable program synthesis - is not original to me (even to me, the ideas are ~3 years old), and I suspect it has been arrived at independently many times. (See the appendix on related work).

This is also far from finished research - more a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post.

Thanks to Dan Murfet, Jesse Hoogland, Max Hennick, and Rumi Salazar for feedback on this post.

Sam Altman: Why does unsupervised learning work?

Dan Selsam: Compression. So, the ideal intelligence is called Solomonoff induction[1]

The central hypothesis of this post is that deep learning succeeds because it's performing a tractable form of program synthesis - searching for simple, compositional algorithms that explain the data. If correct, this would reframe deep learning's success as an instance of something we understand in principle, while pointing toward what we would need to formalize to make the connection rigorous.

I first review the theoretical ideal of Solomonoff induction and the empirical surprise of deep learning's success. Next, mechanistic interpretability provides direct evidence that networks learn algorithm-like structures; I examine the cases of grokking and vision circuits in detail. Broader patterns provide indirect support: how networks evade the curse of dimensionality, generalize despite overparameterization, and converge on similar representations. Finally, I discuss what formalization would require, why it's hard, and the path forward it suggests.

Background

Whether we are a detective trying to catch a thief, a scientist trying to discover a new physical law, or a businessman attempting to understand a recent change in demand, we are all in the process of collecting information and trying to infer the underlying causes.

-Shane Legg[2]

Early in childhood, human babies learn object permanence - that unseen objects nevertheless persist even when not directly observed. In doing so, their world becomes a little less confusing: it is no longer surprising that their mother appears and disappears by putting hands in front of her face. They move from raw sensory perception towards interpreting their observations as coming from an external world: a coherent, self-consistent process which determines what they see, feel, and hear.

As we grow older, we refine this model of the world. We learn that fire hurts when touched; later, that one can create fire with wood and matches; eventually, that fire is a chemical reaction involving fuel and oxygen. At each stage, the world becomes less magical and more predictable. We are no longer surprised when a stove burns us or when water extinguishes a flame, because we have learned the underlying process that governs their behavior.

This process of learning only works because the world we inhabit, for all its apparent complexity, is not random. It is governed by consistent, discoverable rules. If dropping a glass causes it to shatter on Tuesday, it will do the same on Wednesday. If one pushes a ball off the top of a hill, it will roll down, at a rate that any high school physics student could predict. Through our observations, we implicitly reverse-engineer these rules.

This idea - that the physical world is fundamentally predictable and rule-based - has a formal name in computer science: the physical Church-Turing thesis. Precisely, it states that any physical process can be simulated to arbitrary accuracy by a Turing machine. Anything from a star collapsing to a neuron firing, can, in principle, be described by an algorithm and simulated on a computer.

From this perspective, one can formalize this notion of "building a world model by reverse-engineering rules from what we can see." We can operationalize this as a form of program synthesis: from observations, attempting to reconstruct some approximation of the "true" program that generated those observations. Assuming the physical Church-Turing thesis, such a learning algorithm would be "universal," able to eventually represent and predict any real-world process.

But this immediately raises a new problem. For any set of observations, there are infinitely many programs that could have produced them. How do we choose? The answer is one of the oldest principles in science: Occam's razor. We should prefer the simplest explanation.

In the 1960s, Ray Solomonoff formalized this idea into a theory of universal induction which we now call Solomonoff induction. He defined the "simplicity" of a hypothesis as the length of the shortest program that can describe it (a concept known as Kolmogorov complexity). An ideal Bayesian learner, according to Solomonoff, should prefer hypotheses (programs) that are short over ones that are long. This learner can, in theory, learn anything that is computable, because it searches the space of all possible programs, using simplicity as its guide to navigate the infinite search space and generalize correctly.

The invention of Solomonoff induction began[3] a rich and productive subfield of computer science, algorithmic information theory, which persists to this day. Solomonoff induction is still widely viewed as the ideal or optimal self-supervised learning algorithm, which one can prove formally under some assumptions[4]. These ideas (or extensions of them like AIXI) were influential for early deep learning thinkers like Jürgen Schmidhuber and Shane Legg, and shaped a line of ideas attempting to theoretically predict how smarter-than-human machine intelligence might behave, especially within AI safety.

Unfortunately, despite its mathematical beauty, Solomonoff induction is completely intractable. Vanilla Solomonoff induction is incomputable, and even approximate versions like speed induction are exponentially slow[5]. Theoretical interest in it as a "platonic ideal of learning" remains to this day, but practical artificial intelligence has long since moved on, assuming it to be hopelessly unfeasible.


Meanwhile, neural networks were producing results that nobody had anticipated.

This was not the usual pace of scientific progress, where incremental advances accumulate and experts see breakthroughs coming. In 2016, most Go researchers thought human-level play was decades away; AlphaGo arrived that year. Protein folding had resisted fifty years of careful work; AlphaFold essentially solved it[6] over a single competition cycle. Large language models began writing code, solving competition math problems, and engaging in apparent reasoning - capabilities that emerged from next-token prediction without ever being explicitly specified in the loss function. At each stage, domain experts (not just outsiders!) were caught off guard. If we understood what was happening, we would have predicted it. We did not.

The field's response was pragmatic: scale the methods that work, stop trying to understand why they work. This attitude was partly earned. For decades, hand-engineered systems encoding human knowledge about vision or language had lost to generic architectures trained on data. Human intuitions about what mattered kept being wrong. But the pragmatic stance hardened into something stronger - a tacit assumption that trained networks were intrinsically opaque, that asking what the weights meant was a category error.

At first glance, this assumption seemed to have some theoretical basis. If neural networks were best understood as "just curve-fitting" function approximators, then there was no obvious reason to expect the learned parameters to mean anything in particular. They were solutions to an optimization problem, not representations. And when researchers did look inside, they found dense matrices of floating-point numbers with no obvious organization.

But a lens that predicts opacity makes the same prediction whether structure is absent or merely invisible. Some researchers kept looking.

Looking inside

Grokking

The modular addition transformer from Power et al. (2022) learns to generalize rapidly (top), at the same time as Fourier modes in the weights appear (bottom right). Illustration by Pearce et al. (2023).

Power et al. (2022) train a small transformer on modular addition: given two numbers, output their sum mod 113. Only a fraction of the possible input pairs are used for training - say, 30% - with the rest held out for testing.

The network memorizes the training pairs quickly, getting them all correct. But on pairs it hasn't seen, it does no better than chance. This is unsurprising: with enough parameters, a network can simply store input-output associations without extracting any rule. And stored associations don't help you with new inputs.

Here's what's unexpected. If you keep training, despite the training loss already nearly as low as it can go, the network eventually starts getting the held-out pairs right too. Not gradually, either: test performance jumps from chance to near perfect over only a few thousand training steps.

So something has changed inside the network. But what? It was already fitting the training data; the data didn't change. There's no external signal that could have triggered the shift.

One way to investigate is to look at the weights themselves. We can do this at multiple checkpoints over training and ask: does something change in the weights around the time generalization begins?

It does. The weights early in training, during the memorization phase, don't have much structure when you analyze them. Later, they do. Specifically, if we look at the embedding matrix, we find that it's mapping numbers to particular locations on a circle. The number 0 maps to one position, 1 maps to a position slightly rotated from that, and so on, wrapping around. More precisely: the embedding of each number contains sine and cosine values at a small set of specific frequencies.

This structure is absent early in training. It emerges as training continues, and it emerges around the same time that generalization begins.

So what is this structure doing? Following it through the network reveals something unexpected: the network has learned an algorithm for modular addition based on trigonometry.[7]

A transformer trained on a modular addition task learns a compositional, human-interpretable algorithm. Reverse-engineered by Nanda et al. (2023). Image from Nanda et al. (2023).

The algorithm exploits how angles add. If you represent a number as a position on a circle, then adding two numbers corresponds to adding their angles. The network's embedding layer does this representation. Its middle layers then combine the sine and cosine values of the two inputs using trigonometric identities. These operations are implemented in the weights of the attention and MLP layers: one can read off coefficients that correspond to the terms in these identities.

Finally, the network needs to convert back to a discrete answer. It does this by checking, for each possible output , how well  matches the sum it computed. Specifically, the logit for output  depends on . This quantity is maximized when  equals  - the correct answer. At that point the cosines at different frequencies all equal 1 and add constructively. For wrong answers, they point in different directions and cancel.

This isn't a loose interpretive gloss. Each piece - the circular embedding, the trig identities, the interference pattern - is concretely present in the weights and can be verified by ablations.

So here's the picture that emerges. During the memorization phase, the network solves the task some other way - presumably something like a lookup table distributed across its parameters. It fits the training data, but the solution doesn't extend. Then, over continued training, a different solution forms: this trigonometric algorithm. As the algorithm assembles, generalization happens. The two are not merely correlated; tracing the structure in the weights and the performance on held-out data, they move together.

What should we make of this? Here’s one reading: the difference between a network that memorizes and a network that generalizes is not just quantitative, but qualitative. The two networks have learned different kinds of things. One has stored associations. The other has found a method - a mechanistic procedure that happens to work on inputs beyond those it was trained on, because it captures something about the structure of the problem.

This is a single example, and a toy one. But it raises a question worth taking seriously. When networks generalize, is it because they've found something like an algorithm? And if so, what does that tell us about what deep learning is actually doing?

It's worth noting what was and wasn't in the training data. The data contained input-output pairs: "32 and 41 gives 73," and so on. It contained nothing about how to compute them. The network arrived at a method on its own.

And both solutions - the lookup table and the trigonometric algorithm - fit the training data equally well. The network's loss was already near minimal during the memorization phase. Whatever caused it to keep searching, to eventually settle on the generalizing algorithm instead, it wasn't that the generalizing algorithm fit the data better. It was something else - some property of the learning process that favored one kind of solution over another.

The generalizing algorithm is, in a sense, simpler. It compresses what would otherwise be thousands of stored associations into a compact procedure. Whether that's the right way to think about what happened here - whether "simplicity" is really what the training process favors - is not obvious. But something made the network prefer a mechanistic solution that generalized over one that didn't, and it wasn't the training data alone.[8]

Vision circuits

InceptionV1 classifies an image as a car by hierarchically composing detectors for the windows, car body, and wheels (pictured), which are themselves formed by composing detectors for shapes, edges, etc (not pictured). From Olah et al. (2020).

Grokking is a controlled setting - a small network, a simple task, designed to be fully interpretable. Does the same kind of structure appear in realistic models solving realistic problems?

Olah et al. (2020) study InceptionV1, an image classification network trained on ImageNet - a dataset of over a million photographs labeled with object categories. The network takes in an image and outputs a probability distribution over a thousand possible labels: "car," "dog," "coffee mug," and so on. Can we understand this more realistic setting?

A natural starting point is to ask what individual neurons are doing. Suppose we take a neuron somewhere in the network. We can find images that make it activate strongly by either searching through a dataset or optimizing an input to maximize activation. If we collect images that strongly activate a given neuron, do they have anything in common?

In early layers, they do, and the patterns we find are simple. Neurons in the first few layers respond to edges at particular orientations, small patches of texture, transitions between colors. Different neurons respond to different orientations or textures, but many are selective for something visually recognizable.

In later layers, the patterns we find become more complex. Neurons respond to curves, corners, or repeating patterns. Deeper still, neurons respond to things like eyes, wheels, or windows - object parts rather than geometric primitives.

This already suggests a hierarchy: simple features early, complex features later. But the more striking finding is about how the complex features are built.

Olah et al. do not just visualize what neurons respond to. They trace the connections between layers - examining the weights that connect one layer's neurons to the next, identifying which earlier features contribute to which later ones. What they find is that later features are composed from earlier ones in interpretable ways.

There is, for instance, a neuron in InceptionV1 that we identify as responding to dog heads. If we trace its inputs by looking at which neurons from the previous layer connect to it with strong weights, we find it receives input from neurons that detect eyes, snout, fur, and tongue. The dog head detector is built from the outputs of simpler detectors. It is not detecting dog heads from scratch; it is checking whether the right combination of simpler features is present in the right spatial arrangement.

We find the same pattern throughout the network. A neuron that detects car windows is connected to neurons that detect rectangular shapes with reflective textures. A neuron that detects car bodies is connected to neurons that detect smooth, curved surfaces. And a neuron that detects cars as a whole is connected to neurons that detect wheels, windows, and car bodies, arranged in the spatial configuration we would expect for a car.

Olah et al. call these pathways "circuits," and the term is meaningful. The structure is genuinely circuit-like: there are inputs, intermediate computations, and outputs, connected by weighted edges that determine how features combine. In their words: "You can literally read meaningful algorithms off of the weights."

And the components are reused. The same edge detectors that contribute to wheel detection also contribute to face detection, to building detection, to many other things. The network has not built separate feature sets for each of the thousand categories it recognizes. It has built a shared vocabulary of parts - edges, textures, curves, object components, etc - and combines them differently for different recognition tasks.

We might find this structure reminiscent of something. A Boolean circuit is a composition of simple gates - each taking a few bits as input, outputting one bit - wired together to compute something complex. A program is a composition of simple operations - each doing something small - arranged to accomplish something larger. What Olah et al. found in InceptionV1 has the same shape: small computations, composed hierarchically, with components shared and reused across different pathways.

From a theoretical computer science perspective, this is what algorithms look like, in general. Not just the specific trigonometric trick from grokking, but computation as such. You take a hard problem, break it into pieces, solve the pieces, and combine the results. What makes this tractable, what makes it an algorithm rather than a lookup table, is precisely the compositional structure. The reuse is what makes it compact; the compactness is what makes it feasible.


Olsson et al. argue that the primary mechanism of in-context-learning in large language models is a mechanistic attention circuit known as an induction head. Similar to the grokking example, the mechanistic circuit forms in a rapid "phase change" which coincides with a large improvement in the in-context-learning performance. Plots from Olsson et al.

Grokking and InceptionV1 are two examples, but they are far from the only ones. Mechanistic interpretability has grown into a substantial field, and the researchers working in it have documented many such structures - in toy models, in language models, across different architectures and tasks. Induction heads, language circuits, and bracket matching in transformer language models, learned world models and multi-step reasoning in toy tasks, grid-cell-like mechanisms in RL agents, hierarchical representations in GANs, and much more. Where we manage to look carefully, we tend to find something mechanistic.

This raises a question. If what we find inside trained networks (at least when we can find anything) looks like algorithms built from parts, what does that suggest about what deep learning is doing?

The hypothesis

What should we make of this?

We have seen neural networks learn solutions that look like algorithms - compositional structures built from simple, reusable parts. In the grokking case, this coincided precisely with generalization. In InceptionV1, this structure is what lets the network recognize objects despite the vast dimensionality of the input space. And across many other cases documented in the mechanistic interpretability literature, the same shape appears: not monolithic black-box computations, but something more like circuits.

This is reminiscent of the picture we started with. Solomonoff induction frames learning as a search for simple programs that explain data. It is a theoretical ideal - provably optimal in a certain sense, but hopelessly intractable. The connection between Solomonoff and deep learning has mostly been viewed as purely conceptual: a nice way to think about what learning "should" do, with no implications for what neural networks actually do.

But the evidence from mechanistic interpretability suggests a different possibility. What if deep learning is doing something functionally similar to program synthesis? Not through the same mechanism - gradient descent on continuous parameters is nothing like enumerative search over discrete programs. But perhaps targeting the same kind of object: mechanistic solutions, built from parts, that capture structure in the data generating process.

To be clear: this is a hypothesis. The evidence shows that neural networks can learn compositional solutions, and that such solutions have appeared alongside generalization in specific, interpretable cases. It doesn't show that this is what's always happening, or that there's a consistent bias toward simplicity, or that we understand why gradient descent would find such solutions efficiently.

But if the hypothesis is right, it would reframe what deep learning is doing. The success of neural networks would not be a mystery to be accepted, but an instance of something we already understand in principle: the power of searching for compact, mechanistic models to explain your observations. The puzzle would shift from "why does deep learning work at all?" to "how does gradient descent implement this search so efficiently?"

That second question is hard. Solomonoff induction is intractable precisely because the space of programs is vast and discrete. Gradient descent navigates a continuous parameter space using only local information. If both processes are somehow arriving at similar destinations - compositional solutions to learning problems - then something interesting is happening in how neural network loss landscapes are structured, something we do not yet understand. We will return to this issue at the end of the post.

So the hypothesis raises as many questions as it answers. But it offers something valuable: a frame. If deep learning is doing a form of program synthesis, that gives us a way to connect disparate observations - about generalization, about convergence of representations, about why scaling works - into a coherent picture. Whether this picture can make sense of more than just these particular examples is what we'll explore next.

Clarifying the hypothesis

What do I mean by “programs”?

I think one can largely read this post with a purely operational, “you know it when you see it” definition of “programs” and “algorithms”. But there are real conceptual issues here if you try to think about this carefully.

In most computational systems, there's a vocabulary that comes with the design - instructions, subroutines, registers, data flow, and so on. We can point to the “program” because the system was built to make it visible.

Neural networks are not like this. We have neurons, weights, activations, etc, but these may not be the right atoms of computation. If there's computational structure in a trained network, it doesn't automatically come labeled. So if we want to ask whether networks learn programs, we need to know what we're looking for. What would count as finding one?

This is a real problem for interpretability too. When researchers claim to find "circuits" or “features” in a network, what makes that a discovery rather than just a pattern they liked? There has to be something precise and substrate-independent we're tracking. It helps to step back and consider what computational structure even is in the cases we understand it well.

Consider the various models of computation: Turing machines, lambda calculus, Boolean circuits, etc. They have different primitives - tapes, substitution rules, logic gates - but the Church-Turing thesis tells us they're equivalent. Anything computable in one is computable in all the others. So "computation" isn't any particular formalism. It's whatever these formalisms have in common.

What do they have in common? Let me point to something specific: each one builds complex operations by composing simple pieces, where each piece only interacts with a small number of inputs. A Turing machine's transition function looks at one cell. A Boolean gate takes two or three bits. A lambda application involves one function and one argument. Complexity comes from how pieces combine, not from any single piece seeing the whole problem.

Is this just a shared property, or something deeper?

One reason to take it seriously: you can derive a complete model of computation from just this principle. Ask "what functions can I build by composing pieces of bounded arity?" and work out the answer carefully. You get (in the discrete case) Boolean circuits - not a restricted fragment of computation, but a universal model, equivalent to all the others. The composition principle alone is enough to generate computation in full generality.

The bounded-arity constraint is essential. If each piece could see all inputs, we would just have lookup tables. What makes composition powerful is precisely that each piece is “local” and can only interact with so many things at once - it forces solutions to have genuine internal structure.

So when I say networks might learn "programs," I mean: solutions built by composing simple pieces, each operating on few inputs. Not because that's one nice kind of structure, but because that may be what computation actually is.

Note that we have not implied that the computation is necessarily over discrete values - it may be over continuous values, as in analog computation. (However, the “pieces” must be discrete, for this to even be a coherent notion. This causes issues when combined with the subsequent point, as we will discuss towards the end of the post.)

A clarification: the network's architecture trivially has compositional structure - the forward pass is executable on a computer. That's not the claim. The claim is that training discovers an effective program within this substrate. Think of an FPGA: a generic grid of logic components that a hardware engineer configures into a specific circuit. The architecture is the grid; the learned weights are the configuration.

This last point, the fact that the program structure in neural networks is learned and depends on continuous parameters, is actually what makes this issue rather subtle, and unlike other models of computation we’re familiar with (even analog computation). This is a subtle issue which makes formalization difficult, an issue we will return to towards the end of the post.

What do I mean by “program synthesis”?

By program synthesis, I mean a search through possible programs to find one that fits the data.

Two things make this different from ordinary function fitting.

First, the search is general-purpose. Linear regression searches over linear functions. Decision trees search over axis-aligned partitions. These are narrow hypothesis classes, chosen by the practitioner to match the problem. The claim here is different: deep learning searches over a space that can express essentially any efficient computable function. It's not that networks are good at learning one particular kind of structure - it's that they can learn whatever structure is there.

Second, the search is guided by strong inductive biases. Searching over all programs is intractable without some preference for certain programs over others. The natural candidate is simplicity: favor shorter or less complex programs over longer or more complex ones. This is what Solomonoff induction does - it assigns prior probability to programs based on their length, then updates on data.

Solomonoff induction is the theoretical reference point. It's provably optimal in a certain sense: if the data has any computable structure, Solomonoff induction will eventually find it. But it's also intractable - not just slow, but literally incomputable in its pure form, and exponentially slow even in approximations.

The hypothesis is that deep learning achieves something functionally similar through completely different means. Gradient descent on continuous parameters looks nothing like enumeration over discrete programs. But perhaps both are targeting the same kind of object - simple programs that capture structure - and arriving there by different routes. We will return to the issue towards the end of the post.

This would require the learning process to implement something like simplicity bias, even though "program complexity" isn't in the loss function. Whether that's exactly the right characterization, I'm not certain. But some strong inductive bias has to be operating - otherwise we couldn't explain why networks generalize despite having the capacity to memorize, or why scaling helps rather than hurts.

What’s the scope of the hypothesis?

I've thought most deeply about supervised and self-supervised learning using stochastic optimization (SGD, Adam, etc) on standard architectures like MLPs, CNNs, or transformers, on standard tasks like image classification or autoregressive language prediction, and am strongly ready to defend claims there. I also believe that this extends to settings like diffusion models, adversarial setups, reinforcement learning, etc, but I've thought less about these and can't be as confident here.

Why this isn't enough

The preceding case studies provide a strong existence proof: deep neural networks are capable of learning and implementing non-trivial, compositional algorithms. The evidence that InceptionV1 solves image classification by composing circuits, or that a transformer solves modular addition by discovering a Fourier-based algorithm, is quite hard to argue with. And, of course, there are more examples than these which we have not discussed.

Still, the question remains: is this the exception or the rule? It would be completely consistent with the evidence presented so far for this type of behavior to just be a strange edge case.

Unfortunately, mechanistic interpretability is not yet enough to settle the question. The settings where today's mechanistic interpretability tools provide such clean, complete, and unambiguously correct results[9] are very rare.

Aren't most networks uninterpretable? Why this doesn't disprove the thesis.

Should we not take the lack of such clean mechanistic interpretability results as active counterevidence against our hypothesis? If models were truly learning programs in general, shouldn't those programs be readily apparent? Instead the internals of these systems appear far more "messy."

This objection is a serious one, but it makes a leap in logic. It conflates the statement "our current methods have not found a clean programmatic structure" with the much stronger statement "no such structure exists." In other words, absence of evidence is not evidence of absence[10]. The difficulty we face may not be an absence of structure, but a mismatch between the network's chosen representational scheme and the tools we are currently using to search for it.

Attempting to identify which individual transistors in an Atari machine are responsible for different games does not work very well; nevertheless an Atari machine has real computational structure. We may be in a similar situation with neural networks. From Jonas & Kording (2017).

To make this concrete, consider a thought experiment, adapted from the paper "Could a Neuroscientist Understand a Microprocessor?":

Imagine a team of neuroscientists studying a microprocessor (MOS 6502) that runs arcade (Atari) games. Their tools are limited to their trade: they can, for instance, probe the voltage of individual transistors and lesion them to observe the effect on gameplay. They do not have access to the high-level source code or architecture diagrams.

As the paper confirms, the neuroscientists would fail to understand the system. This failure would not be because the system lacks compositional, program structure - it is, by definition, a machine that executes programs. Their failure would be one of mismatched levels of abstraction. The meaningful concepts of the software (subroutines, variables, the call stack) have no simple, physical correlate at the transistor level. The "messiness" they would observe - like a single transistor participating in calculating a score, drawing a sprite, and playing a sound - is an illusion created by looking at the wrong organizational level.

My claim is that this is the situation we face with neural networks. Apparent "messiness" like polysemanticity is not evidence against a learned program; it is the expected signature of a program whose logic is not organized at the level of individual neurons. The network may be implementing something like a program, but using a "compiler" and an "instruction set" that are currently alien to us.[11]

The clean results from the vision and modular addition case studies are, in my view, instances where strong constraints (e.g., the connection sparsity of CNNs, or the heavy regularization and shallow architecture in the grokking setup) forced the learned program into a representation that happened to be unusually simple for us to read. They are the exceptions in their legibility, not necessarily in their underlying nature.[12]

Therefore, while mechanistic interpretability can supply plausibility to our hypothesis, we need to move towards more indirect evidence to start building a positive case.

Indirect evidence

Just before OpenAI started, I met Ilya [Sutskever]. One of the first things he said to me was, "Look, the models, they just wanna learn. You have to understand this. The models, they just wanna learn."

And it was a bit like a Zen Koan. I listened to this and I became  enlightened.

... What that told me is that the phenomenon that I'd seen wasn't just some random thing: it was broad, it was more general.

The models just wanna learn. You get the obstacles out of their way. You give them good data. You give them enough space to operate in. You don't do something stupid like condition them badly numerically.

And they wanna learn. They'll do it.

-Dario Amodei[13]

I remember when I trained my first neural network, there was something almost miraculous about it: it could solve problems which I had absolutely no idea how to code myself (e.g. how to distinguish a cat from a dog), and in a completely opaque way such that even after it had solved the problem I had no better picture for how to solve the problem myself than I did beforehand. Moreover, it was remarkably resilient, despite obvious problems with the optimizer, or bugs in the code, or bad training data - unlike any other engineered system I had ever built, almost reminiscent of something biological in its robustness.

My impression is that this sense of "magic" is a common, if often unspoken, experience among practitioners. Many simply learn to accept the mystery and get on with the work. But there is nothing virtuous about confusion - it just suggests that your understanding is incomplete, that you are ignorant of the real mechanisms underlying the phenomenon.

Our practical success with deep learning has outpaced our theoretical understanding. This has led to a proliferation of explanations that often feel ad-hoc and local - tailor-made to account for a specific empirical finding, without connecting to other observations or any larger framework. For instance, the theory of "double descent" provides a narrative for the U-shaped test loss curve, but it is a self-contained story. It does not, for example, share a conceptual foundation with the theories we have for how induction heads form in transformers. Each new discovery seems to require a new, bespoke theory. One naturally worries that we are juggling epicycles.

This sense of theoretical fragility is compounded by a second problem: for any single one of these phenomena, we often lack consensus, entertaining multiple, competing hypotheses. Consider the core question of why neural networks generalize. Is it best explained by the implicit bias of SGD towards flat minima, the behavior of neural tangent kernels, or some other property? The field actively debates these views. And where no mechanistic theory has gained traction, we often retreat to descriptive labels. We say complex abilities are an "emergent" property of scale, a term that names the mystery without explaining its cause.

This theoretical disarray is sharpest when we examine our most foundational frameworks. Here, the issue is not just a lack of consensus, but a direct conflict with empirical reality. This disconnect manifests in several ways:

  • Sometimes, our theories make predictions that are actively falsified by practice. Classical statistical learning theory, with its focus on the bias-variance tradeoff, advises against the very scaling strategies that have produced almost all state-of-the-art performance.
  • In other cases, a theory might be technically true but practically misleading, failing to explain the key properties that make our models effective. The Universal Approximation Theorem, for example, guarantees representational power but does so via a construction that implies an exponential scaling that our models somehow avoid.
  • And in yet other areas, our classical theories are almost entirely silent. They offer no framework to even begin explaining deep puzzles like the uncanny convergence of representations across vastly different models trained on the same data.

We are therefore faced with a collection of major empirical findings where our foundational theories are either contradicted, misleading, or simply absent. This theoretical vacuum creates an opportunity for a new perspective.

The program synthesis hypothesis offers such a perspective. It suggests we shift our view of what deep learning is fundamentally doing: from statistical function fitting to program search. The specific claim is that deep learning performs a search for simple programs that explain the data.

This shift in viewpoint may offer a way to make sense of the theoretical tensions we have outlined. If the learning process is a search for an efficient program rather than an arbitrary function, then the circumvention of the curse of dimensionality is no longer so mysterious. If this search is guided by a strong simplicity bias, the unreasonable effectiveness of scaling becomes an expected outcome, rather than a paradox.

We will now turn to the well-known paradoxes of approximation, generalization, and convergence, and see how the program synthesis hypothesis accounts for each.

The paradox of approximation

(See also this post for related discussion.)

We can overcome the curse of dimensionality because real problems can be broken down into parts. When this happens sequentially (like the trees on the right) deep networks have an advantage. Image source.

Before we even consider how a network learns or generalizes, there is a more basic question: how can a neural network, with a practical number of parameters, even in principle represent the complex function it is trained on?

Consider the task of image classification. A function that takes a 1024x1024 pixel image (roughly one million input dimensions) and maps it to a single label like "cat" or "dog" is, a priori, an object of staggering high-dimensional complexity. Who is to say that a good approximation of this function even exists within the space of functions that a neural network of a given size can express?

The textbook answer to this question is the Universal Approximation Theorem (UAT). This theorem states that a neural network with a single hidden layer can, given enough neurons, approximate any continuous function to arbitrary accuracy. On its face, this seems to resolve the issue entirely.

A precise statement of the Universal Approximation Theorem

Let  be a continuous, non-polynomial function. Then for every continuous function  from a compact subset of  to , and some , we can choose the number of neurons  large enough such that there exists a network  with

where  for some matrices , and .

See here for a proof sketch. In plain English, this means that for any well-behaved target function , you can always make a one-layer network  that is a "good enough" approximation, just by making the number of neurons  sufficiently large.

Note that the network here is a shallow one - the theorem doesn't even explain why you need deep networks, an issue we'll return to when we talk about depth separations. In fact, one can prove theorems like this without even needing neural networks at all - the theorem directly parallels the classic Stone-Weierstrass theorem from analysis, which proves a similar statement for polynomials.

However, this answer is deeply misleading. The crucial caveat is the phrase "given enough neurons." A closer look at the proofs of the UAT reveals that for an arbitrary function, the number of neurons required scales exponentially with the dimension of the input. This is the infamous curse of dimensionality. To represent a function on a one-megapixel image, this would require a catastrophically large number of neurons - more than there are atoms in the universe.

The UAT, then, is not a satisfying explanation. In fact, it's a mathematical restatement of a near-trivial fact: with exponential resources, one can simply memorize a function's behavior. The constructions used to prove the theorem are effectively building a continuous version of a lookup table. This is not an explanation for the success of deep learning; it is a proof that if deep learning had to deal with arbitrary functions, it would be hopelessly impractical.

This is not merely a weakness of the UAT's particular proof; it is a fundamental property of high-dimensional spaces. Classical results in approximation theory show that this exponential scaling is not just an upper bound on what's needed, but a strict lower bound. These theorems prove that any method that aims to approximate arbitrary smooth functions is doomed to suffer the curse of dimensionality.

The parameter count lower bound

There are many results proving various lower bounds on the parameter count available in the literature under different technical assumptions.

A classic result from DeVore, Howard, and Micchelli (1989) [Theorem 4.2] establishes a lower bound on the number of parameters  required by any continuous approximation scheme (including neural networks) to achieve an error  over the space of all smooth functions in  dimensions. The number of parameters  must satisfy:

where  is a measure of the function's smoothness. To maintain a constant error  as the dimension  increases, the number of parameters  must grow exponentially. This confirms that no clever trick can escape this fate if the target functions are arbitrary.

The real lesson of the Universal Approximation Theorem, then, is not that neural networks are powerful. The real lesson is that if the functions we learn in the real world were arbitrary, deep learning would be impossible. The empirical success of deep learning with a reasonable number of parameters is therefore a profound clue about the nature of the problems themselves: they must have structure.

The program synthesis hypothesis gives a name to this structure: compositionality. This is not a new idea. It is the foundational principle of computer science. To solve a complex problem, we do not write down a giant lookup table that specifies the output for every possible input. Instead, we write a program: we break the problem down hierarchically into a sequence of simple, reusable steps. Each step (like a logic gate in a circuit) is a tiny lookup table, and we achieve immense expressive power by composing them.

This matches what we see empirically in some deep neural networks via mechanistic interpretability. They appear to solve complex tasks by learning a compositional hierarchy of features. A vision model learns to detect edges, which are composed into shapes, which are composed into object parts (wheels, windows), which are finally composed into an object detector for a "car." The network is not learning a single, monolithic function; it is learning a program that breaks the problem down.

This parallel with classical computation offers an alternative perspective on the approximation question. While the UAT considers the case of arbitrary functions, a different set of results examines how well neural networks can represent functions that have this compositional, programmatic structure.

One of the most relevant results comes from considering Boolean circuits, which are a canonical example of programmatic composition. It is known that feedforward neural networks can represent any program implementable by a polynomial-size Boolean circuit, using only a polynomial number of neurons. This provides a different kind of guarantee than the UAT. It suggests that if a problem has an efficient programmatic solution, then an efficient neural network representation of that solution also exists.

This offers an explanation for how neural networks might evade the curse of dimensionality. Their effectiveness would stem not from an ability to represent any high-dimensional function, but from their suitability for representing the tiny, structured subset of functions that have efficient programs. The problems seen in practice, from image recognition to language translation, appear to belong to this special class.

Why compositionality, specifically? Evidence from depth separation results.

The argument so far is that real-world problems must have some special "structure" to escape the curse of dimensionality, and that this structure is program structure or compositionality. But how can we be sure? Yes, approximation theory requires that we must have something that differentiates our target functions from arbitrary smooth functions in order to avoid needing exponentially many parameters, but it does not specify what. The structure does not necessarily have to be compositionality; it could be something else entirely.

While there is no definitive proof, the literature on depth separation theorems provides evidence for the compositionality hypothesis. The logic is straightforward: if compositionality is the key, then an architecture that is restricted in its ability to compose operations should struggle. Specifically, one would expect that restricting a network's depth - its capacity for sequential, step-by-step computation - should force it back towards exponential scaling for certain problems.

And this is what the theorems show.

These depth separation results, sometimes also called "no-flattening theorems," involve constructing families of functions that deep neural networks can represent with a polynomial number of parameters, but which shallow networks would require an exponential number to represent. The literature contains a range of such functions, including sawtooth functions, certain polynomials, and functions with hierarchical or modular substructures.

Individually, many of these examples are mathematical constructions, too specific to tell us much about realistic tasks on their own. But taken together, a pattern emerges. The functions where depth provides an exponential advantage are consistently those that are built "step-by-step." They have a sequential structure that deep networks can mirror. A deep network can compute an intermediate result in one layer and then feed that result into the next, effectively executing a multi-step computation.

A shallow network, by contrast, has no room for this kind of sequential processing. It must compute its output in a single, parallel step. While it can still perform "piece-by-piece" computation (which is what its width allows), it cannot perform "step-by-step" computation. Faced with an inherently sequential problem, a shallow network is forced to simulate the entire multi-step computation at once. This can be highly inefficient, in the same way that simulating a sequential program on a highly parallel machine can sometimes require exponentially more resources.

This provides a parallel to classical complexity theory. The distinction between depth and width in neural networks mirrors the distinction between sequential (P) and parallelizable (NC) computation. Just as it is conjectured that some problems are inherently sequential and cannot be efficiently parallelized (the NC ≠ P conjecture), these theorems show that some functions are inherently deep and cannot be efficiently "flattened" into a shallow network.

The paradox of generalization

(See also this post for related discussion.)

Bias–variance tradeoff - WikipediaOverfitting - MATLAB & Simulink

Perhaps the most jarring departure from classical theory comes from how deep learning models generalize. A learning algorithm is only useful if it can perform well on new, unseen data. The central question of statistical learning theory is: what are the conditions that allow a model to generalize?

The classical answer is the bias-variance tradeoff. The theory posits that a model's error can be decomposed into two main sources:

  • Bias: Error from the model being too simple to capture the underlying structure of the data (underfitting).
  • Variance: Error from the model being too sensitive to the specific training data it saw, causing it to fit noise (overfitting).

According to this framework, learning is a delicate balancing act. The practitioner's job is to carefully choose a model of the "right" complexity - not too simple, not too complex -to land in a "Goldilocks zone" where both bias and variance are low. This view is reinforced by principles like the "no free lunch" theorems, which suggest there is no universally good learning algorithm, only algorithms whose inductive biases are carefully chosen by a human to match a specific problem domain.

The clear prediction from this classical perspective is that naively increasing a model's capacity (e.g., by adding more parameters) far beyond what is needed to fit the training data is a recipe for disaster. Such a model should have catastrophically high variance, leading to rampant overfitting and poor generalization.

And yet, perhaps the single most important empirical finding in modern deep learning is that this prediction is completely wrong. The "bitter lesson," as Rich Sutton calls it, is that the most reliable path to better performance is to scale up compute and model size, sometimes far into the regime where the model can easily memorize the entire training set. This goes beyond a minor deviation from theoretical predictions: it is a direct contradiction of the theory's core prescriptive advice.

This brings us to a second, deeper puzzle, first highlighted by Zhang et al. (2017). The authors conduct a simple experiment:

  • They train a standard vision model on a real dataset (e.g., CIFAR-10) and confirm that it generalizes well.
  • They then train the exact same model, with the exact same architecture, optimizer, and regularization, on a corrupted version of the dataset where the labels have been completely randomized.

The network is expressive enough that it is able to achieve near-zero training error on the randomized labels, perfectly memorizing the nonsensical data. As expected, its performance on a test set is terrible - it has learned nothing generalizable.

The paradox is this: why did the same exact model generalize well on the real data? Classical theories often tie a model's generalization ability to its "capacity" or "complexity," which is a fixed property of its architecture related to its expressivity. But this experiment shows that generalization is not a static property of the model. It is a dynamic outcome of the interaction between the model, the learning algorithm, and the structure of the data itself. The very same network that is completely capable of memorizing random noise somehow "chooses" to find a generalizable solution when trained on data with real structure. Why?

The program synthesis hypothesis offers a coherent explanation for both of these paradoxes.

First, why does scaling work? The hypothesis posits that learning is a search through some space of programs, guided by a strong simplicity bias. In this view, adding more parameters is analogous to expanding the search space (e.g., allowing for longer or more complex programs). While this does increase the model's capacity to represent overfitting solutions, the simplicity bias acts as a powerful regularizer. The learning process is not looking for any program that fits the data; it is looking for the simplest program. Giving the search more resources (parameters, compute, data) provides a better opportunity to find the simple, generalizable program that corresponds to the true underlying structure, rather than settling for a more complex, memorizing one.

Second, why does generalization depend on the data's structure? This is a natural consequence of a simplicity-biased program search.

  • When trained on real data, there exists a short, simple program that explains the statistical regularities (e.g., "cats have pointy ears and whiskers"). The simplicity bias of the learning process finds this program, and because it captures the true structure, it generalizes well.
  • When trained on random labels, no such simple program exists. The only way to map the given images to the random labels is via a long, complicated, high-complexity program (effectively, a lookup table). Forced against its inductive bias, the learning algorithm eventually finds such a program to minimize the training loss. This solution is pure memorization and, naturally, fails to generalize.

If one assumes something like the program synthesis hypothesis is true, the phenomenon of data-dependent generalization is not so surprising. A model's ability to generalize is not a fixed property of its architecture, but a property of the program it learns. The model finds a simple program on the real dataset and a complex one on the random dataset, and the two programs have very different generalization properties.And there is some evidence that the mechanism behind generalization is not so unrelated to the other empirical phenomena we have discussed. We can see this in the grokking setting discussed earlier. Recall the transformer trained on modular addition:

  • Initially, the model learns a memorization-based program. It achieves 100% accuracy on the training data, but its test accuracy is near zero. This is analogous to learning the "random label" dataset - a complex, non-generalizing solution.
  • After extensive further training, driven by a regularizer that penalizes complexity (weight decay), the model's internal solution undergoes a "phase transition." It discovers the Fourier-based algorithm for modular addition.
  • Coincident with the discovery of this algorithmic program (or rather, the removal of the memorization program, which occurs slightly later), test accuracy abruptly jumps to 100%.

The sudden increase in generalization appears to be the direct consequence of the model replacing a complex, overfitting solution with a simpler, algorithmic one. In this instance, generalization is achieved through the synthesis of a different, more efficient program.

The paradox of convergence

When we ask a neural network to solve a task, we specify what task we'd like it to solve, but not how it should solve the task - the purpose of learning is for it to find strategies on its own. We define a loss function and an architecture, creating a space of possible functions, and ask the learning algorithm to find a good one by minimizing the loss. Given this freedom, and the high-dimensionality of the search space, one might expect the solutions found by different models - especially those with different architectures or random initializations - to be highly diverse.

Instead, what we observe empirically is a strong tendency towards convergence. This is most directly visible in the phenomenon of representational alignment. This alignment is remarkably robust:

  • It holds across different training runs of the same architecture, showing that the final solution is not a sensitive accident of the random seed.
  • More surprisingly, it holds across different architectures. The internal activations of a Transformer and a CNN trained on the same vision task, for example, can often be mapped to one another with a simple linear transformation, suggesting they are learning not just similar input-output behavior, but similar intermediate computational steps.
  • It even holds in some cases across modalities. Models like CLIP, trained to associate images with text, learn a shared representation space where the vector for a photograph of a dog is close to the vector for the phrase "a photo of a dog," indicating convergence on a common, abstract conceptual structure.

The mystery deepens when we observe parallels to biological systems. The Gabor-like filters that emerge in the early layers of vision networks, for instance, are strikingly similar to the receptive fields of neurons in the V1 area of the primate visual cortex. It appears that evolution and stochastic gradient descent, two very different optimization processes operating on very different substrates, have converged on similar solutions when exposed to the same statistical structure of the natural world.

One way to account for this is to hypothesize that the models are not navigating some undifferentiated space of arbitrary functions, but are instead homing in on a sparse set of highly effective programs that solve the task. If, following the physical Church-Turing thesis, we view the natural world as having a true, computable structure, then an effective learning process could be seen as a search for an algorithm that approximates that structure. In this light, convergence is not an accident, but a sign that different search processes are discovering similar objectively good solutions, much as different engineering traditions might independently arrive at the arch as an efficient solution for bridging a gap.

This hypothesis - that learning is a search for an optimal, objective program - carries with it a strong implication: the search process must be a general-purpose one, capable of finding such programs without them being explicitly encoded in its architecture. As it happens, an independent, large-scale trend in the field provides a great deal of data on this very point.

Rich Sutton's "bitter lesson" describes the consistent empirical finding that long-term progress comes from scaling general learning methods, rather than from encoding specific human domain knowledge. The old paradigm, particularly in fields like computer vision, speech recognition, or game playing, involved painstakingly hand-crafting systems with significant prior knowledge. For years, the state of the art relied on complex, hand-designed feature extractors like SIFT and HOG, which were built on human intuitions about what aspects of an image are important. The role of learning was confined to a relatively simple classifier that operated on these pre-digested features. The underlying assumption was that the search space was too difficult to navigate without strong human guidance.

The modern paradigm of deep learning has shown this assumption to be incorrect. Progress has come from abandoning these handcrafted constraints in favor of training general, end-to-end architectures with the brute force of data and compute. This consistent triumph of general learning over encoded human knowledge is a powerful indicator that the search process we are using is, in fact, general-purpose. It suggests that the learning algorithm itself, when given a sufficiently flexible substrate and enough resources, is a more effective mechanism for discovering relevant features and structure than human ingenuity.

This perspective helps connect these phenomena, but it also invites us to refine our initial picture. First, the notion of a single "optimal program" may be too rigid. It is possible that what we are observing is not convergence to a single point, but to a narrow subset of similarly structured, highly-efficient programs. The models may be learning different but algorithmically related solutions, all belonging to the same family of effective strategies.

Second, it is unclear whether this convergence is purely a property of the problem's solution space, or if it is also a consequence of our search algorithm. Stochastic gradient descent is not a neutral explorer. The implicit biases of stochastic optimization, when navigating a highly over-parameterized loss landscape, may create powerful channels that funnel the learning process toward a specific kind of simple, compositional solution. Perhaps all roads do not lead to Rome, but the roads to Rome are the fastest. The convergence could therefore be a clue about the nature of our learning dynamics themselves - that they possess a strong, intrinsic preference for a particular class of solutions.

Viewed together, these observations suggest that the space of effective solutions for real-world tasks is far smaller and more structured than the space of possible models. The phenomenon of convergence indicates that our models are finding this structure. The bitter lesson suggests that our learning methods are general enough to do so. The remaining questions point us toward the precise nature of that structure and the mechanisms by which our learning algorithms are so remarkably good at finding it.

The path forward

If you've followed the argument this far, you might already sense where it becomes difficult to make precise. The mechanistic interpretability evidence shows that networks can implement compositional algorithms. The indirect evidence suggests this connects to why they generalize, scale, and converge. But "connects to" is doing a lot of work. What would it actually mean to say that deep learning is some form of program synthesis?

Trying to answer this carefully leads to two problems. The claim "neural networks learn programs" seems to require saying what a program even is in a space of continuous parameters. It also requires explaining how gradient descent could find such programs efficiently, given what we know about the intractability of program search.

These are the kinds of problems where the difficulty itself is informative. Each has a specific shape - what you need to think about, what a resolution would need to provide. I focus on them deliberately: that shape is what eventually pointed me toward specific mathematical tools I wouldn't have considered otherwise.

This is also where the post will shift register. The remaining sections sketch the structure of these problems and gesture at why certain mathematical frameworks (singular learning theory, algebraic geometry, etc) might become relevant. I won't develop these fully here - that requires machinery far beyond the scope of a single blog post - but I want to show why you'd need to leave shore at all, and what you might find out in open water.

The representation problem

The program synthesis hypothesis posits a relationship between two fundamentally different kinds of mathematical objects.

On one hand, we have programs. A program is a discrete and symbolic object. Its identity is defined by its compositional structure - a graph of distinct operations. A small change to this structure, like flipping a comparison or replacing an addition with a subtraction, can create a completely different program with discontinuous, global changes in behavior. The space of programs is discrete.

On the other hand, we have neural networks. A neural network is defined by its parameter space: a continuous vector space of real-valued weights. The function a network computes is a smooth (or at least piecewise-smooth) function of these parameters. This smoothness is the essential property that allows for learning via gradient descent, a process of infinitesimal steps along a continuous loss landscape.

This presents a seeming type mismatch: how can a continuous process in a continuous parameter space give rise to a discrete, structured program?

The problem is deeper than it first appears. To see why, we must first be precise about what we mean when we say a network has "learned a program." It cannot simply be about the input-output function the network computes. A network that has perfectly memorized a lookup table for modular addition computes the same function on a finite domain as a network that has learned the general, trigonometric algorithm. Yet we would want to say, emphatically, that they have learned different programs. The program is not just the function; it is the underlying mechanism.

Thus the notion must depend on parameters, and not just functions, presenting a further conceptual barrier. To formalize the notion of "mechanism," a natural first thought might be to partition the continuous parameter space into discrete regions. In this picture, all the parameter vectors within a region  would correspond to the same program A, while vectors in a different region  would correspond to program B. But this simple picture runs into a subtle and fatal problem: the very smoothness that makes gradient descent possible works to dissolve any sharp boundaries between programs.

Imagine a continuous path in parameter space from a point  (which clearly implements program A) to a point  (which clearly implements program B). Imagine, say, that A has some extra subroutine that B does not. Because the map from parameters to the function is smooth, the network's behavior must change continuously along this path. At what exact point on this path did the mechanism switch from A to B? Where did the new subroutine get added? There is no canonical place to draw a line. A sharp boundary would imply a discontinuity that the smoothness of the map from parameters to functions seems to forbid.

This is not so simple a problem, and it is worth spending some time thinking about how you might try to resolve it to appreciate that.

What this suggests, then, is that for the program synthesis hypothesis to be a coherent scientific claim, it requires something that does not yet exist: a formal, geometric notion of a space of programs. This is a rather large gap to fill, and in some ways, this entire post is my long-winded way of justifying such an ambitious mathematical goal.

I won't pretend that my collaborators and I don't have our[14] own ideas about how to resolve this, but the mathematical sophistication required jumps substantially, and they would probably require their own full-length post to do justice. For now, I will just gesture at some clues which I think point in the right direction.

The first is the phenomenon of degeneracies[15]. Consider, for instance, dead neurons, whose incoming weights and activations are such that the neurons never fires for any input. A neural network with dead neurons acts like a smaller network with those dead neurons removed. This gives a mechanism for neural networks to change their "effective size" in a parameter-dependent way, which is required in order to e.g. dynamically add or remove a new subroutine depending on where you are in parameter space, as in our example above. In fact dead neurons are just one example in a whole zoo of degeneracies with similar effects, which seem incredibly pervasive in neural networks.

It is worth mentioning that the present picture is now highly suggestive of a specific branch of math known as algebraic geometry. Algebraic geometry (in particular, singularity theory) systematically studies these degeneracies, and further provides a bridge between discrete structure (algebra) and continuous structure (geometry), exactly the type of connection we identified as necessary for the program synthesis hypothesis[16]. Furthermore, singular learning theory tells us how these degeneracies control the loss landscape and the learning process (classically, only in the Bayesian setting, a limitation we discuss in the next section). There is much more that can be said here, but I leave it for the future to treat this material properly.

The search problem

There’s another problem with this story. Our hypothesis is that deep learning is performing some version of program synthesis. That means that we not only have to explain how programs get represented in neural networks, we also need to explain how they get learned. There are two subproblems here.

  • First, how can deep learning even implement the needed inductive biases? For deep learning algorithms to be implementing something analogous to Solomonoff induction, they must be able to implicitly follow inductive biases which depend on the program structure, like simplicity bias. That is, the optimization process must somehow be aware of the program structure in order to favor some types of programs (e.g. shorter programs) over others. The optimizer must “see” the program structure of parameters.
  • Second, deep learning works in practice, using a reasonable amount of computational resources; meanwhile, even the most efficient versions of Solomonoff induction like speed induction run in exponential time or worse[5]. If deep learning is efficiently performing some version of program synthesis analogous to Solomonoff induction, that means it has implicitly managed to do what we could not figure out how to do explicitly - its efficiency must be due to some insight which we do not yet know. Of course, we know part of the answer: SGD only needs local information in order to optimize, instead of brute-force global search as one does with Bayesian learning. But then the mystery becomes a well-known one: why does myopic search like SGD converge to globally good solutions?

Both of these are questions about the optimization process. It is not obvious at all how local optimizers like SGD would be able to perform something like Solomonoff induction, let alone far more efficiently than we historically ever figured out for (versions of) Solomonoff induction itself. This is a difficult question, but I will attempt to point towards research which I believe can answer these questions.

The optimization process can depend on many things, a priori: choice of optimizer, regularization, dropout, step size, etc. But we can note that deep learning is able to work somewhat successfully (albeit sometimes with degraded performance) across wide ranges of choices of these variables. It does not seem like the choice of AdamW vs SGD matters nearly as much as the choice to do gradient-based learning in the first place. In other words, I believe these variables may affect efficiency, but I doubt they are fundamental to the explanation of why the optimization process can possibly succeed.

Instead, there is one common variable here which appears to determine the vast majority of the behavior of stochastic optimizers: the loss function. Optimizers like SGD take every gradient step according to a minibatch-loss function[17] like mean-squared error:

where  is the parameter vector,  is the input/output map of the model on parameter  are the  training examples & labels, and  is the learning rate.

In the most common versions of supervised learning, we can focus even further. The loss function itself can be decomposed into two effects: the parameter-function map , and the target distribution. The overall loss function can be written as a composition of the parameter-function map and some statistical distance to the target distribution, e.g. for mean-squared error:

where .

Note that the statistical distance  here is a fairly simple object. Almost always the statistical distance here is (on function space) convex and with relatively simple functional form; further, it is the same distance one would use across many different architectures, including ones which do not achieve the remarkable performance of neural networks (e.g. polynomial approximation). Therefore one expects the question of learnability and inductive biases to largely come down to the parameter-function map  rather than the (function-space) loss function .

If the above reasoning is correct, that means that in order to understand how SGD is able to potentially perform some kind of program synthesis, we merely need to understand properties of the parameter-function map. This would be a substantial simplification. Further, this relates learning dynamics to our earlier representation problem: the parameter-function map is precisely the same object responsible for the mystery discussed in the representation section.

This is not an airtight argument - it depends on the empirical question of whether one can ignore (or treat as second-order effects) other optimization details besides the loss function, and whether the handwave-y argument for the importance of the parameter-function map over the (function-space) loss is solid.

Even if one assumes this argument is valid, we have merely located the mystery, not resolved it. The question remains: what properties of the parameter-function map make targets learnable? At this point the reasoning becomes more speculative, but I will sketch some ideas.

The representation section concerned what structure the map encodes at each point in parameter space. Learnability appears to depend on something further: the structure of paths between points. Convexity of function-space loss implies that paths which are sufficiently straight in function space are barrier-free - roughly, if the endpoint is lower loss, the entire path is downhill. So the question becomes: which function-space paths does the map provide?

The same architectures successfully learn many diverse real-world targets. Whatever property of the map enables this, it must be relatively universal - not tailored to specific targets. This naturally leads us to ask: in what cases does the parameter-function map provide direct-enough paths to targets with certain structure, and characterizing what "direct enough" means.

This connects back to the representation problem. If the map encodes some notion of program structure, then path structure in parameter space induces relationships between programs - which programs are "adjacent," which are reachable from which. The representation section asks how programs are encoded as points; learnability asks how they are connected as paths. These are different aspects of the same object.

One hypothesis: compositional relationships between programs might correspond to some notion of “path adjacency” defined by the parameter-function map. If programs sharing structure are nearby - reachable from each other via direct paths - and if simpler programs lie along paths to more complex ones, then efficiency, simplicity bias, and empirically observed stagewise learning would follow naturally. Gradient descent would build incrementally rather than search randomly; the enumeration problem that dooms Solomonoff would dissolve into traversal.

This is speculative and imprecise. But there's something about the shape of what's needed that feels mathematically natural. The representation problem asks for a correspondence at the level of objects: strata in parameter space corresponding to programs. The search problem asks for something stronger - that this correspondence extends to paths. Paths in parameter space (what gradient descent traverses) should correspond to some notion of relationship or transition between programs.

This is a familiar move in higher mathematics (sometimes formalized by category theory): once you have a correspondence between two kinds of objects, you ask whether it extends to the relationships between those objects. It is especially familiar (in fields like higher category theory) to ask these kinds of questions when the "relationships between objects" take the form of paths in particular. I don't claim that existing machinery from these fields applies directly, and certainly not given the (lack of) detail I've provided in this post. But the question is suggestive enough to investigate: what should "adjacency between programs" mean? Does the parameter-function map induce or preserve such structure? And if so, what does this predict about learning dynamics that we could check empirically?

Appendix

Related work

The majority of the ideas in this post are not individually novel; I see the core value proposition as synthesizing them together in one place. The ideas I express here are, in my experience, very common among researchers at frontier labs, researchers in mechanistic interpretability, some researchers within science of deep learning, and others. In particular, the core hypothesis that deep learning is performing some tractable version of Solomonoff induction is not new, and has been written about many times. (However, I would not consider it to be a popular or accepted opinion within the machine learning field at large.) Personally, I have considered a version of this hypothesis for around three years. With this post, I aim to share a more comprehensive synthesis of the evidence for this hypothesis, as well as point to specific research directions for formalizing this idea.

Below is an incomplete list of what is known and published in various areas:

Existing comparisons between deep learning and program synthesis. The ideas surrounding Solomonoff induction have been highly motivating for many early AGI-focused researchers. Shane Legg (DeepMind cofounder) wrote his PhD thesis on Solomonoff induction; John Schulam (OpenAI cofounder) discusses the connection to deep learning explicitly here; Ilya Sutskever (OpenAI cofounder) has been giving talks on related ideas. There are a handful of places one can find a hypothesized connection between deep learning and Solomonoff induction stated explicitly, though I do not believe any of these were the first to do so. My personal experience is that such intuitions are fairly common among e.g. people working at frontier labs, even if they are not published in writing. I am not sure who had the idea first, and suspect it was arrived at independently multiple times.

Feature learning. It would not be accurate to say that the average ML researcher views deep learning as a complete black-box algorithm; it is well-accepted and uncontroversial that deep neural networks are able to extract "features" from the task which they use to perform well. However, it is a step beyond to claim that these features are actually extracted and composed in some mechanistic fashion resembling a computer program.

Compositionality, hierarchy, and modularity. My informal notion of "programs" here is quite closely related to compositionality. It is a fairly well-known hypothesis that supervised learning performs well due to compositional/hierarchical/modular structure in the model and/or the target task. This is particularly prominent within approximation theory (especially the literature on depth separations) as an explanation for the issues I highlighted in the "paradox of approximation" section.

Mechanistic interpretability. The (implicit) underlying premise of the field of mechanistic interpretability is that one can understand the internal mechanistic (read: program-like) structure responsible for a network's outputs. Mechanistic interpretability is responsible for discovering a significant number of examples of this type of structure, which I believe constitutes the single strongest evidence for the program synthesis hypothesis. I discuss a few case studies of this structure in the post, but there are possibly hundreds more examples which I did not cover, from the many papers within the field. A recent review can be found here.

Singular learning theory. In the “path forward” section, I highlight a possible role of degeneracies in controlling some kind of effective program structure. In some way (which I have gestured at but not elaborated on), the ideas presented in this post can be seen as motivating singular learning theory as a means to formally ground these ideas and produce practical tools to operationalize them. This is most explicit within a line of work within singular learning theory that attempts to precisely connect program synthesis with the singular geometry of a (toy) learning machine.

 

  1. ^
  2. ^

    From his PhD thesis, pages 23-24.

  3. ^

    Together with independent contributions by Kolmogorov, Chaitin, and Levin.

  4. ^

    One must be careful, as some commonly stated "proofs" of this optimality are somewhat tautological. These typically go roughly something like: under the assumption that the data generating process has low Kolmogorov complexity, then Solomonoff induction is optimal. This is of course completely circular, since we have, in effect, assumed from the start that the inductive bias of Solomonoff induction is correct. Better proofs of this fact instead show a regret bound: on any sequence, Solomonoff induction's cumulative loss is at most a constant worse than any computable predictor - where the constant depends on the complexity of the competing predictor, not the sequence. This is a frequentist guarantee requiring no assumptions about the data source. See in particular Section 3.3.2 and Theorem 3.3 of this PhD thesis. Thanks to Cole Wyeth for pointing me to this argument.

  5. ^
  6. ^

    Depending on what one means by "protein folding," one can debate whether the problem has truly been solved; for instance, the problem of how proteins fold dynamically over time is still open AFAIK. See this fairly well-known blog post by molecular biologist Mohammed AlQuraishi for more discussion, and why he believes calling AlphaFold a "solution" can be appropriate despite the caveats.

  7. ^

    In fact, the solution can be seen as a representation-theoretic algorithm for the group of integers under addition mod  (the cyclic group ). Follow-up papers demonstrated that neural networks also learn interpretable representation-theoretic algorithms for more general groups than cyclic groups.

  8. ^

    For what it's worth, in this specific case, we do know what must be driving the process, if not the training loss: the regularization / weight decay. In the case of grokking, we do have decent understanding of how weight decay leads the training to prefer the generalizing solution. However, this explanation is limited in various ways, and it unclear how far it generalizes beyond this specific setting.

  9. ^

    To be clear, one can still apply existing mechanistic interpretability tools to real language models and get productive results. But the results typically only manage to explain a small portion of the network, and in a way which is (in my opinion) less clean and convincing than e.g. Olah et al. (2020)'s reverse-engineering of InceptionV1.

  10. ^

    This phrase is often abused - for instance, if you show up to court with no evidence, I can reasonably infer that no good evidence for your case exists. This is a gap between logical and heuristic/Bayesian reasoning. In the real world, if evidence for a proposition exists, it usually can and will be found (because we care about it), so you can interpret the absence of evidence for a proposition as suggesting that the proposition is false. However, in this case, I present a specific reason why one should not expect to see evidence even if the proposition in question is true.

  11. ^

    Many interpretability researchers specifically believe in the linear representation hypothesis, that the variables of this program structure ("features") correspond to linear directions in activation space, or the stronger superposition hypothesis, that such directions form a sparse overbasis for activation space. One must be careful in interpreting these hypotheses as there are different operationalizations within the community; in my opinion, the more sophisticated versions are far more plausible than naive versions (thank you to Chris Olah for a helpful conversation here). Presently, I am skeptical that linear representations give the most prosaic description of a model's behavior or that this will be sufficient for complete reverse-engineering, but believe that the hypothesis is pointing at something real about models, and tools like SAEs can be helpful as long as one is aware of their limitations.

  12. ^

    See for instance the results of these papers, where the authors incentivize spatial modularity with an additional regularization term. The authors interpret this as incentivizing modularity, but I would interpret it as incentivizing existing modularity to come to the surface.

  13. ^
  14. ^

    The credit for these ideas should really go to Dan Murfet, as well as his current/former students including Will Troiani, James Clift, Rumi Salazar, and Billy Snikkers.

  15. ^

    Let  denote the output of the model on input  with parameters . Formally, we say that a point in parameter space  is degenerate or singular if there exists a tangent vector  such that the directional derivative  for all . In other words, moving in some direction in parameter space doesn't change the behavior of the model (up to first order).

  16. ^

    This is not as alien as it may seem. Note that this provides a perspective which connects nicely with both neural networks and classical computation. First consider, for instance, that the gates of a Boolean circuit literally define a system of equations over , whose solution set is an algebraic variety over . Alternatively, consider that a neural network with polynomial (or analytic) activation function defines a system of equations over , whose vanishing set is an algebraic (respectively, analytic) variety over . Of course this goes only a small fraction of the way to closing this gap, but one can start to see how this becomes plausible.

  17. ^

    A frequent perspective is to write this minibatch-loss in terms of its mean (population) value plus some noise term. That is, we think of optimizers like SGD as something like “gradient descent plus noise.” This is quite similar to mathematical models like overdamped Langevin dynamics, though note that the noise term may not be Gaussian as in Langevin dynamics. It is an open question whether the convergence of neural network training is due to the population term or the noise term. (Note that this is a separate question as to whether the generalization / inductive biases of SGD-trained neural networks is due to the population term or the noise term.) I am tentatively of the belief (somewhat controversially) that both convergence and inductive bias is due to structure in the population loss rather than the noise term, but explaining my reasoning here is a bit out of scope.



Discuss

The Total Solar Eclipse of 2238 and GPT-5.2 Pro

2026-01-20 22:27:11

Published on January 20, 2026 2:27 PM GMT

2026 marks exactly 1 millennium since the last total solar eclipse visible from Table Mountain. The now famous (among people who sit behind me at work) eclipse of 1026 would’ve been visible to anyone at the top of Lion’s Head or Table Mountain and basically everywhere else in Cape Town. Including De Waal Park, where I’m currently writing this. I’ve hiked up Lion’s Head a lot and still find the view pretty damn awe inspiring. To have seen a total solar eclipse up there must have been absurdly damn awe inspiring. Maybe also terrifying if you didn’t know what was happening. But either way, I’m jealous of anyone that got to experience it. If you continued flipping through the exciting but predictable Five Millennium Canon of Solar Eclipses: -1999 to +3000 (2000 BCE to 3000 CE) by Jean Meeus and Fred Espenak, you’d notice something weird and annoying - you have to flip all the way to the year 2238 for the next total solar eclipse to hit Table Mountain.

Tim Urban has this idea of converting all of human history into a 1000 page book. He says that basically up until page 950 there’s just nothing going on.

“But if you look at Page 1,000—which, in this metaphor, Page 1,000 is the page that ends with today, so that goes from the early 1770s to today—that is nothing like any other page. It is completely an anomaly in the book. If you’re reading, if you’re this alien, this suddenly got incredibly interesting in the last 10 pages, but especially on this page. The alien is thinking, “OK, shit is going down.”

The gap between eclipses on Table Mountain is the real life version of this book. Imagine If aliens had put a secret camera where the cable car is, and it only popped up during a total solar eclipse, they’d see something like the island from Lost, then wait a hundred or thousand years then see the exact same thing but maybe it’s raining.

 

 

And they’d see this 4 more times.

 

 

 

Then they go to open the image from 2238 and suddenly.:

 

There’s a soccer stadium and also is that a city???

 

Just knowing the date of these eclipses has made the past and future feel much more real to me.

I saw the total solar eclipse of 2024 in the middle of an absolutely packed Klyde Warren Park in Dallas.

 

 

When totality started, there were barely any cars on the highway and the cars you could see suddenly had their headlights on. The office tower behind me was filled with people on every floor staring outside, all backlit by the lights which had suddenly turned on.

We talk about how the animals start going crazy because they think it’s night as though this doesn’t include us but actually we are so included here and go even crazier than any birds doing morning chirps. The extent to which the city of Dallas was turned upside down by this event is hard to believe. And it wasn’t just a physical transformation. The entire energy of the city felt different, not just compared to the day before but compared to any other city I’ve been in. I have never felt so connected to everyone around me and so optimistic and elated at the same time all while knowing everyone else feels the exact same way.

It’s hard to imagine what it must have been like to be a person in Cape Town in the year 1026. The image in my head feels murky and I guess pastoral. But imagining what it was like during a total solar eclipse in the year 1026, is much easier. I can picture myself on top of Lion’s Head or Table Mountain or on the beach in 1026. I can picture the people around me seeing it and wondering what’s going on. I can picture myself wondering what’s going on. Because even when you know what’s going on you’re still wondering what’s going on.

When I think about the eclipse of 2238 it’s even easier to connect with those people in that Cape Town. If the people of that time have anything like newspapers or radio or the internet or TikTok, I can imagine the literal hype and electricity in the air over the months and days and hours leading up to the eclipse. It’s also weird to briefly think about how everything I’m using now and consuming now is going to be considered ancient history by the lovely people that get to experience seeing an eclipse in 2238 at the top of Lion’s Head. My macbook which feels so fast and which I love so dearly - junk. TikTok would be like a wax cylinder record, and they’d wonder how people managed to code with an AI as silly as Opus-4.5 or worse by hand somehow. Every movie from 2026 would be older to them than the movie of the train going into the station, is to us. I don’t know how they are going to build software in the year 2238. I barely know how I built the website I used to find this stuff out. I’ve wanted to know when the next and previous eclipse are going to happen on Lion’s Head, since i got back from the eclipse in 2024.

I started by searching on google for something to find eclipses by location and not time. We have Five Millennium Canon of Solar Eclipses but this is still in order of time. The answer to my question felt like something we could easily work out with existing data and a for loop in whichever your favorite programming language is. NASA hosts a csv file with the aforementioned five millennia of past and future eclipses. So we just have to parse this csv and figure out what each of the 16 columns represented and then figure out how to do a for loop over the paths of the eclipses and find an intersection with the coordinates of Lion’s Head.

Luckily the year was 2024 or 5 A.G.T (Anno GPT3) - So I asked what would have probably been GPT-4 if it could search for the date of the next and previous eclipses, it used the search tool it had at the time, but it could not find anything. I tried this a few more times, usually whenever I finished a hike and a new model had been recently released. It’s never worked though. That is, until a week ago. This January I paid $200 for GPT 5.2 Pro after reading some, okay, a single, extremely positive review about it.. To be honest my review is: It kind of sucks, but still happy I paid the $200. This is because towards the end of the month I set 5.2 Pro to extended thinking then typed this prompt:

“How could I make an app that lets you pick a place on earth and then finds the last time or next time there was or will be a full solar eclipse there, what data sources would I use what algorithms and how accurate could I be.”

It thought for 17m and 6 seconds then replied with a whole bunch of words I didn’t understand. So I replied:

“Can you write a prototype in python?”

It thought for another 20m then replied with this script.

I pasted it into a file then ran it with the coordinates of Lion’s Head and saw the answer to my question: 1026. That was the last time a total solar eclipse was visible from Lion’s Head.

Since it was a script I could also use any coordinates on Earth and find the same answer for that place (as long it was in the five millennia catalogue)

I popped the python script in to Claude code with Opus set to 4.5, it did some verbing and then I got this website out a few hours later: https://findmyeclipse.com

In 2238 I somehow doubt the vast majority of people will ever think about code when creating things, in the same way I don’t think about binary or transistors when programming. What does a world where software can be written without any special knowledge look like, and then what does it look like after 100 years of that? I don’t have any answers but I do know one thing: The people of Cape Town in 2238 will know that this eclipse is not just a rare eclipse, but a rare eclipse among rare eclipses . They will look forward to it. They will write about the best places to see it from. I can imagine being a person in 2238 thinking, boy this eclipse would look sick from Lion’s Head. Thinking, I wonder if it’s going to be too busy up there. Maybe consider going up and camping on Table Mountain the night before. And I can imagine being in any one of these places or just in a packed De Waal Park preparing for totality and when I imagine myself there with everyone around me, it’s hard not to be optimistic.

 

 

1 Like

 

 



 



Discuss