2026-01-20 23:35:52
Published on January 20, 2026 3:35 PM GMT
Epistemic status: This post is a synthesis of ideas that are, in my experience, widespread among researchers at frontier labs and in mechanistic interpretability, but rarely written down comprehensively in one place - different communities tend to know different pieces of evidence. The core hypothesis - that deep learning is performing something like tractable program synthesis - is not original to me (even to me, the ideas are ~3 years old), and I suspect it has been arrived at independently many times. (See the appendix on related work).
This is also far from finished research - more a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post.
Thanks to Dan Murfet, Jesse Hoogland, Max Hennick, and Rumi Salazar for feedback on this post.
Sam Altman: Why does unsupervised learning work?
Dan Selsam: Compression. So, the ideal intelligence is called Solomonoff induction…[1]
The central hypothesis of this post is that deep learning succeeds because it's performing a tractable form of program synthesis - searching for simple, compositional algorithms that explain the data. If correct, this would reframe deep learning's success as an instance of something we understand in principle, while pointing toward what we would need to formalize to make the connection rigorous.
I first review the theoretical ideal of Solomonoff induction and the empirical surprise of deep learning's success. Next, mechanistic interpretability provides direct evidence that networks learn algorithm-like structures; I examine the cases of grokking and vision circuits in detail. Broader patterns provide indirect support: how networks evade the curse of dimensionality, generalize despite overparameterization, and converge on similar representations. Finally, I discuss what formalization would require, why it's hard, and the path forward it suggests.
Whether we are a detective trying to catch a thief, a scientist trying to discover a new physical law, or a businessman attempting to understand a recent change in demand, we are all in the process of collecting information and trying to infer the underlying causes.
-Shane Legg[2]
Early in childhood, human babies learn object permanence - that unseen objects nevertheless persist even when not directly observed. In doing so, their world becomes a little less confusing: it is no longer surprising that their mother appears and disappears by putting hands in front of her face. They move from raw sensory perception towards interpreting their observations as coming from an external world: a coherent, self-consistent process which determines what they see, feel, and hear.
As we grow older, we refine this model of the world. We learn that fire hurts when touched; later, that one can create fire with wood and matches; eventually, that fire is a chemical reaction involving fuel and oxygen. At each stage, the world becomes less magical and more predictable. We are no longer surprised when a stove burns us or when water extinguishes a flame, because we have learned the underlying process that governs their behavior.
This process of learning only works because the world we inhabit, for all its apparent complexity, is not random. It is governed by consistent, discoverable rules. If dropping a glass causes it to shatter on Tuesday, it will do the same on Wednesday. If one pushes a ball off the top of a hill, it will roll down, at a rate that any high school physics student could predict. Through our observations, we implicitly reverse-engineer these rules.
This idea - that the physical world is fundamentally predictable and rule-based - has a formal name in computer science: the physical Church-Turing thesis. Precisely, it states that any physical process can be simulated to arbitrary accuracy by a Turing machine. Anything from a star collapsing to a neuron firing, can, in principle, be described by an algorithm and simulated on a computer.
From this perspective, one can formalize this notion of "building a world model by reverse-engineering rules from what we can see." We can operationalize this as a form of program synthesis: from observations, attempting to reconstruct some approximation of the "true" program that generated those observations. Assuming the physical Church-Turing thesis, such a learning algorithm would be "universal," able to eventually represent and predict any real-world process.
But this immediately raises a new problem. For any set of observations, there are infinitely many programs that could have produced them. How do we choose? The answer is one of the oldest principles in science: Occam's razor. We should prefer the simplest explanation.
In the 1960s, Ray Solomonoff formalized this idea into a theory of universal induction which we now call Solomonoff induction. He defined the "simplicity" of a hypothesis as the length of the shortest program that can describe it (a concept known as Kolmogorov complexity). An ideal Bayesian learner, according to Solomonoff, should prefer hypotheses (programs) that are short over ones that are long. This learner can, in theory, learn anything that is computable, because it searches the space of all possible programs, using simplicity as its guide to navigate the infinite search space and generalize correctly.
The invention of Solomonoff induction began[3] a rich and productive subfield of computer science, algorithmic information theory, which persists to this day. Solomonoff induction is still widely viewed as the ideal or optimal self-supervised learning algorithm, which one can prove formally under some assumptions[4]. These ideas (or extensions of them like AIXI) were influential for early deep learning thinkers like Jürgen Schmidhuber and Shane Legg, and shaped a line of ideas attempting to theoretically predict how smarter-than-human machine intelligence might behave, especially within AI safety.
Unfortunately, despite its mathematical beauty, Solomonoff induction is completely intractable. Vanilla Solomonoff induction is incomputable, and even approximate versions like speed induction are exponentially slow[5]. Theoretical interest in it as a "platonic ideal of learning" remains to this day, but practical artificial intelligence has long since moved on, assuming it to be hopelessly unfeasible.
Meanwhile, neural networks were producing results that nobody had anticipated.
This was not the usual pace of scientific progress, where incremental advances accumulate and experts see breakthroughs coming. In 2016, most Go researchers thought human-level play was decades away; AlphaGo arrived that year. Protein folding had resisted fifty years of careful work; AlphaFold essentially solved it[6] over a single competition cycle. Large language models began writing code, solving competition math problems, and engaging in apparent reasoning - capabilities that emerged from next-token prediction without ever being explicitly specified in the loss function. At each stage, domain experts (not just outsiders!) were caught off guard. If we understood what was happening, we would have predicted it. We did not.
The field's response was pragmatic: scale the methods that work, stop trying to understand why they work. This attitude was partly earned. For decades, hand-engineered systems encoding human knowledge about vision or language had lost to generic architectures trained on data. Human intuitions about what mattered kept being wrong. But the pragmatic stance hardened into something stronger - a tacit assumption that trained networks were intrinsically opaque, that asking what the weights meant was a category error.
At first glance, this assumption seemed to have some theoretical basis. If neural networks were best understood as "just curve-fitting" function approximators, then there was no obvious reason to expect the learned parameters to mean anything in particular. They were solutions to an optimization problem, not representations. And when researchers did look inside, they found dense matrices of floating-point numbers with no obvious organization.
But a lens that predicts opacity makes the same prediction whether structure is absent or merely invisible. Some researchers kept looking.
Power et al. (2022) train a small transformer on modular addition: given two numbers, output their sum mod 113. Only a fraction of the possible input pairs are used for training - say, 30% - with the rest held out for testing.
The network memorizes the training pairs quickly, getting them all correct. But on pairs it hasn't seen, it does no better than chance. This is unsurprising: with enough parameters, a network can simply store input-output associations without extracting any rule. And stored associations don't help you with new inputs.
Here's what's unexpected. If you keep training, despite the training loss already nearly as low as it can go, the network eventually starts getting the held-out pairs right too. Not gradually, either: test performance jumps from chance to near perfect over only a few thousand training steps.
So something has changed inside the network. But what? It was already fitting the training data; the data didn't change. There's no external signal that could have triggered the shift.
One way to investigate is to look at the weights themselves. We can do this at multiple checkpoints over training and ask: does something change in the weights around the time generalization begins?
It does. The weights early in training, during the memorization phase, don't have much structure when you analyze them. Later, they do. Specifically, if we look at the embedding matrix, we find that it's mapping numbers to particular locations on a circle. The number 0 maps to one position, 1 maps to a position slightly rotated from that, and so on, wrapping around. More precisely: the embedding of each number contains sine and cosine values at a small set of specific frequencies.
This structure is absent early in training. It emerges as training continues, and it emerges around the same time that generalization begins.
So what is this structure doing? Following it through the network reveals something unexpected: the network has learned an algorithm for modular addition based on trigonometry.[7]
The algorithm exploits how angles add. If you represent a number as a position on a circle, then adding two numbers corresponds to adding their angles. The network's embedding layer does this representation. Its middle layers then combine the sine and cosine values of the two inputs using trigonometric identities. These operations are implemented in the weights of the attention and MLP layers: one can read off coefficients that correspond to the terms in these identities.
Finally, the network needs to convert back to a discrete answer. It does this by checking, for each possible output , how well matches the sum it computed. Specifically, the logit for output depends on . This quantity is maximized when equals - the correct answer. At that point the cosines at different frequencies all equal 1 and add constructively. For wrong answers, they point in different directions and cancel.
This isn't a loose interpretive gloss. Each piece - the circular embedding, the trig identities, the interference pattern - is concretely present in the weights and can be verified by ablations.
So here's the picture that emerges. During the memorization phase, the network solves the task some other way - presumably something like a lookup table distributed across its parameters. It fits the training data, but the solution doesn't extend. Then, over continued training, a different solution forms: this trigonometric algorithm. As the algorithm assembles, generalization happens. The two are not merely correlated; tracing the structure in the weights and the performance on held-out data, they move together.
What should we make of this? Here’s one reading: the difference between a network that memorizes and a network that generalizes is not just quantitative, but qualitative. The two networks have learned different kinds of things. One has stored associations. The other has found a method - a mechanistic procedure that happens to work on inputs beyond those it was trained on, because it captures something about the structure of the problem.
This is a single example, and a toy one. But it raises a question worth taking seriously. When networks generalize, is it because they've found something like an algorithm? And if so, what does that tell us about what deep learning is actually doing?
It's worth noting what was and wasn't in the training data. The data contained input-output pairs: "32 and 41 gives 73," and so on. It contained nothing about how to compute them. The network arrived at a method on its own.
And both solutions - the lookup table and the trigonometric algorithm - fit the training data equally well. The network's loss was already near minimal during the memorization phase. Whatever caused it to keep searching, to eventually settle on the generalizing algorithm instead, it wasn't that the generalizing algorithm fit the data better. It was something else - some property of the learning process that favored one kind of solution over another.
The generalizing algorithm is, in a sense, simpler. It compresses what would otherwise be thousands of stored associations into a compact procedure. Whether that's the right way to think about what happened here - whether "simplicity" is really what the training process favors - is not obvious. But something made the network prefer a mechanistic solution that generalized over one that didn't, and it wasn't the training data alone.[8]
Grokking is a controlled setting - a small network, a simple task, designed to be fully interpretable. Does the same kind of structure appear in realistic models solving realistic problems?
Olah et al. (2020) study InceptionV1, an image classification network trained on ImageNet - a dataset of over a million photographs labeled with object categories. The network takes in an image and outputs a probability distribution over a thousand possible labels: "car," "dog," "coffee mug," and so on. Can we understand this more realistic setting?
A natural starting point is to ask what individual neurons are doing. Suppose we take a neuron somewhere in the network. We can find images that make it activate strongly by either searching through a dataset or optimizing an input to maximize activation. If we collect images that strongly activate a given neuron, do they have anything in common?
In early layers, they do, and the patterns we find are simple. Neurons in the first few layers respond to edges at particular orientations, small patches of texture, transitions between colors. Different neurons respond to different orientations or textures, but many are selective for something visually recognizable.
In later layers, the patterns we find become more complex. Neurons respond to curves, corners, or repeating patterns. Deeper still, neurons respond to things like eyes, wheels, or windows - object parts rather than geometric primitives.
This already suggests a hierarchy: simple features early, complex features later. But the more striking finding is about how the complex features are built.
Olah et al. do not just visualize what neurons respond to. They trace the connections between layers - examining the weights that connect one layer's neurons to the next, identifying which earlier features contribute to which later ones. What they find is that later features are composed from earlier ones in interpretable ways.
There is, for instance, a neuron in InceptionV1 that we identify as responding to dog heads. If we trace its inputs by looking at which neurons from the previous layer connect to it with strong weights, we find it receives input from neurons that detect eyes, snout, fur, and tongue. The dog head detector is built from the outputs of simpler detectors. It is not detecting dog heads from scratch; it is checking whether the right combination of simpler features is present in the right spatial arrangement.
We find the same pattern throughout the network. A neuron that detects car windows is connected to neurons that detect rectangular shapes with reflective textures. A neuron that detects car bodies is connected to neurons that detect smooth, curved surfaces. And a neuron that detects cars as a whole is connected to neurons that detect wheels, windows, and car bodies, arranged in the spatial configuration we would expect for a car.
Olah et al. call these pathways "circuits," and the term is meaningful. The structure is genuinely circuit-like: there are inputs, intermediate computations, and outputs, connected by weighted edges that determine how features combine. In their words: "You can literally read meaningful algorithms off of the weights."
And the components are reused. The same edge detectors that contribute to wheel detection also contribute to face detection, to building detection, to many other things. The network has not built separate feature sets for each of the thousand categories it recognizes. It has built a shared vocabulary of parts - edges, textures, curves, object components, etc - and combines them differently for different recognition tasks.
We might find this structure reminiscent of something. A Boolean circuit is a composition of simple gates - each taking a few bits as input, outputting one bit - wired together to compute something complex. A program is a composition of simple operations - each doing something small - arranged to accomplish something larger. What Olah et al. found in InceptionV1 has the same shape: small computations, composed hierarchically, with components shared and reused across different pathways.
From a theoretical computer science perspective, this is what algorithms look like, in general. Not just the specific trigonometric trick from grokking, but computation as such. You take a hard problem, break it into pieces, solve the pieces, and combine the results. What makes this tractable, what makes it an algorithm rather than a lookup table, is precisely the compositional structure. The reuse is what makes it compact; the compactness is what makes it feasible.
Grokking and InceptionV1 are two examples, but they are far from the only ones. Mechanistic interpretability has grown into a substantial field, and the researchers working in it have documented many such structures - in toy models, in language models, across different architectures and tasks. Induction heads, language circuits, and bracket matching in transformer language models, learned world models and multi-step reasoning in toy tasks, grid-cell-like mechanisms in RL agents, hierarchical representations in GANs, and much more. Where we manage to look carefully, we tend to find something mechanistic.
This raises a question. If what we find inside trained networks (at least when we can find anything) looks like algorithms built from parts, what does that suggest about what deep learning is doing?
What should we make of this?
We have seen neural networks learn solutions that look like algorithms - compositional structures built from simple, reusable parts. In the grokking case, this coincided precisely with generalization. In InceptionV1, this structure is what lets the network recognize objects despite the vast dimensionality of the input space. And across many other cases documented in the mechanistic interpretability literature, the same shape appears: not monolithic black-box computations, but something more like circuits.
This is reminiscent of the picture we started with. Solomonoff induction frames learning as a search for simple programs that explain data. It is a theoretical ideal - provably optimal in a certain sense, but hopelessly intractable. The connection between Solomonoff and deep learning has mostly been viewed as purely conceptual: a nice way to think about what learning "should" do, with no implications for what neural networks actually do.
But the evidence from mechanistic interpretability suggests a different possibility. What if deep learning is doing something functionally similar to program synthesis? Not through the same mechanism - gradient descent on continuous parameters is nothing like enumerative search over discrete programs. But perhaps targeting the same kind of object: mechanistic solutions, built from parts, that capture structure in the data generating process.
To be clear: this is a hypothesis. The evidence shows that neural networks can learn compositional solutions, and that such solutions have appeared alongside generalization in specific, interpretable cases. It doesn't show that this is what's always happening, or that there's a consistent bias toward simplicity, or that we understand why gradient descent would find such solutions efficiently.
But if the hypothesis is right, it would reframe what deep learning is doing. The success of neural networks would not be a mystery to be accepted, but an instance of something we already understand in principle: the power of searching for compact, mechanistic models to explain your observations. The puzzle would shift from "why does deep learning work at all?" to "how does gradient descent implement this search so efficiently?"
That second question is hard. Solomonoff induction is intractable precisely because the space of programs is vast and discrete. Gradient descent navigates a continuous parameter space using only local information. If both processes are somehow arriving at similar destinations - compositional solutions to learning problems - then something interesting is happening in how neural network loss landscapes are structured, something we do not yet understand. We will return to this issue at the end of the post.
So the hypothesis raises as many questions as it answers. But it offers something valuable: a frame. If deep learning is doing a form of program synthesis, that gives us a way to connect disparate observations - about generalization, about convergence of representations, about why scaling works - into a coherent picture. Whether this picture can make sense of more than just these particular examples is what we'll explore next.
Clarifying the hypothesis
I think one can largely read this post with a purely operational, “you know it when you see it” definition of “programs” and “algorithms”. But there are real conceptual issues here if you try to think about this carefully.
In most computational systems, there's a vocabulary that comes with the design - instructions, subroutines, registers, data flow, and so on. We can point to the “program” because the system was built to make it visible.
Neural networks are not like this. We have neurons, weights, activations, etc, but these may not be the right atoms of computation. If there's computational structure in a trained network, it doesn't automatically come labeled. So if we want to ask whether networks learn programs, we need to know what we're looking for. What would count as finding one?
This is a real problem for interpretability too. When researchers claim to find "circuits" or “features” in a network, what makes that a discovery rather than just a pattern they liked? There has to be something precise and substrate-independent we're tracking. It helps to step back and consider what computational structure even is in the cases we understand it well.
Consider the various models of computation: Turing machines, lambda calculus, Boolean circuits, etc. They have different primitives - tapes, substitution rules, logic gates - but the Church-Turing thesis tells us they're equivalent. Anything computable in one is computable in all the others. So "computation" isn't any particular formalism. It's whatever these formalisms have in common.
What do they have in common? Let me point to something specific: each one builds complex operations by composing simple pieces, where each piece only interacts with a small number of inputs. A Turing machine's transition function looks at one cell. A Boolean gate takes two or three bits. A lambda application involves one function and one argument. Complexity comes from how pieces combine, not from any single piece seeing the whole problem.
Is this just a shared property, or something deeper?
One reason to take it seriously: you can derive a complete model of computation from just this principle. Ask "what functions can I build by composing pieces of bounded arity?" and work out the answer carefully. You get (in the discrete case) Boolean circuits - not a restricted fragment of computation, but a universal model, equivalent to all the others. The composition principle alone is enough to generate computation in full generality.
The bounded-arity constraint is essential. If each piece could see all inputs, we would just have lookup tables. What makes composition powerful is precisely that each piece is “local” and can only interact with so many things at once - it forces solutions to have genuine internal structure.
So when I say networks might learn "programs," I mean: solutions built by composing simple pieces, each operating on few inputs. Not because that's one nice kind of structure, but because that may be what computation actually is.
Note that we have not implied that the computation is necessarily over discrete values - it may be over continuous values, as in analog computation. (However, the “pieces” must be discrete, for this to even be a coherent notion. This causes issues when combined with the subsequent point, as we will discuss towards the end of the post.)
A clarification: the network's architecture trivially has compositional structure - the forward pass is executable on a computer. That's not the claim. The claim is that training discovers an effective program within this substrate. Think of an FPGA: a generic grid of logic components that a hardware engineer configures into a specific circuit. The architecture is the grid; the learned weights are the configuration.
This last point, the fact that the program structure in neural networks is learned and depends on continuous parameters, is actually what makes this issue rather subtle, and unlike other models of computation we’re familiar with (even analog computation). This is a subtle issue which makes formalization difficult, an issue we will return to towards the end of the post.
By program synthesis, I mean a search through possible programs to find one that fits the data.
Two things make this different from ordinary function fitting.
First, the search is general-purpose. Linear regression searches over linear functions. Decision trees search over axis-aligned partitions. These are narrow hypothesis classes, chosen by the practitioner to match the problem. The claim here is different: deep learning searches over a space that can express essentially any efficient computable function. It's not that networks are good at learning one particular kind of structure - it's that they can learn whatever structure is there.
Second, the search is guided by strong inductive biases. Searching over all programs is intractable without some preference for certain programs over others. The natural candidate is simplicity: favor shorter or less complex programs over longer or more complex ones. This is what Solomonoff induction does - it assigns prior probability to programs based on their length, then updates on data.
Solomonoff induction is the theoretical reference point. It's provably optimal in a certain sense: if the data has any computable structure, Solomonoff induction will eventually find it. But it's also intractable - not just slow, but literally incomputable in its pure form, and exponentially slow even in approximations.
The hypothesis is that deep learning achieves something functionally similar through completely different means. Gradient descent on continuous parameters looks nothing like enumeration over discrete programs. But perhaps both are targeting the same kind of object - simple programs that capture structure - and arriving there by different routes. We will return to the issue towards the end of the post.
This would require the learning process to implement something like simplicity bias, even though "program complexity" isn't in the loss function. Whether that's exactly the right characterization, I'm not certain. But some strong inductive bias has to be operating - otherwise we couldn't explain why networks generalize despite having the capacity to memorize, or why scaling helps rather than hurts.
I've thought most deeply about supervised and self-supervised learning using stochastic optimization (SGD, Adam, etc) on standard architectures like MLPs, CNNs, or transformers, on standard tasks like image classification or autoregressive language prediction, and am strongly ready to defend claims there. I also believe that this extends to settings like diffusion models, adversarial setups, reinforcement learning, etc, but I've thought less about these and can't be as confident here.
The preceding case studies provide a strong existence proof: deep neural networks are capable of learning and implementing non-trivial, compositional algorithms. The evidence that InceptionV1 solves image classification by composing circuits, or that a transformer solves modular addition by discovering a Fourier-based algorithm, is quite hard to argue with. And, of course, there are more examples than these which we have not discussed.
Still, the question remains: is this the exception or the rule? It would be completely consistent with the evidence presented so far for this type of behavior to just be a strange edge case.
Unfortunately, mechanistic interpretability is not yet enough to settle the question. The settings where today's mechanistic interpretability tools provide such clean, complete, and unambiguously correct results[9] are very rare.
Aren't most networks uninterpretable? Why this doesn't disprove the thesis.
Should we not take the lack of such clean mechanistic interpretability results as active counterevidence against our hypothesis? If models were truly learning programs in general, shouldn't those programs be readily apparent? Instead the internals of these systems appear far more "messy."
This objection is a serious one, but it makes a leap in logic. It conflates the statement "our current methods have not found a clean programmatic structure" with the much stronger statement "no such structure exists." In other words, absence of evidence is not evidence of absence[10]. The difficulty we face may not be an absence of structure, but a mismatch between the network's chosen representational scheme and the tools we are currently using to search for it.
To make this concrete, consider a thought experiment, adapted from the paper "Could a Neuroscientist Understand a Microprocessor?":
Imagine a team of neuroscientists studying a microprocessor (MOS 6502) that runs arcade (Atari) games. Their tools are limited to their trade: they can, for instance, probe the voltage of individual transistors and lesion them to observe the effect on gameplay. They do not have access to the high-level source code or architecture diagrams.
As the paper confirms, the neuroscientists would fail to understand the system. This failure would not be because the system lacks compositional, program structure - it is, by definition, a machine that executes programs. Their failure would be one of mismatched levels of abstraction. The meaningful concepts of the software (subroutines, variables, the call stack) have no simple, physical correlate at the transistor level. The "messiness" they would observe - like a single transistor participating in calculating a score, drawing a sprite, and playing a sound - is an illusion created by looking at the wrong organizational level.
My claim is that this is the situation we face with neural networks. Apparent "messiness" like polysemanticity is not evidence against a learned program; it is the expected signature of a program whose logic is not organized at the level of individual neurons. The network may be implementing something like a program, but using a "compiler" and an "instruction set" that are currently alien to us.[11]
The clean results from the vision and modular addition case studies are, in my view, instances where strong constraints (e.g., the connection sparsity of CNNs, or the heavy regularization and shallow architecture in the grokking setup) forced the learned program into a representation that happened to be unusually simple for us to read. They are the exceptions in their legibility, not necessarily in their underlying nature.[12]
Therefore, while mechanistic interpretability can supply plausibility to our hypothesis, we need to move towards more indirect evidence to start building a positive case.
Just before OpenAI started, I met Ilya [Sutskever]. One of the first things he said to me was, "Look, the models, they just wanna learn. You have to understand this. The models, they just wanna learn."
And it was a bit like a Zen Koan. I listened to this and I became enlightened.
... What that told me is that the phenomenon that I'd seen wasn't just some random thing: it was broad, it was more general.
The models just wanna learn. You get the obstacles out of their way. You give them good data. You give them enough space to operate in. You don't do something stupid like condition them badly numerically.
And they wanna learn. They'll do it.
-Dario Amodei[13]
I remember when I trained my first neural network, there was something almost miraculous about it: it could solve problems which I had absolutely no idea how to code myself (e.g. how to distinguish a cat from a dog), and in a completely opaque way such that even after it had solved the problem I had no better picture for how to solve the problem myself than I did beforehand. Moreover, it was remarkably resilient, despite obvious problems with the optimizer, or bugs in the code, or bad training data - unlike any other engineered system I had ever built, almost reminiscent of something biological in its robustness.
My impression is that this sense of "magic" is a common, if often unspoken, experience among practitioners. Many simply learn to accept the mystery and get on with the work. But there is nothing virtuous about confusion - it just suggests that your understanding is incomplete, that you are ignorant of the real mechanisms underlying the phenomenon.
Our practical success with deep learning has outpaced our theoretical understanding. This has led to a proliferation of explanations that often feel ad-hoc and local - tailor-made to account for a specific empirical finding, without connecting to other observations or any larger framework. For instance, the theory of "double descent" provides a narrative for the U-shaped test loss curve, but it is a self-contained story. It does not, for example, share a conceptual foundation with the theories we have for how induction heads form in transformers. Each new discovery seems to require a new, bespoke theory. One naturally worries that we are juggling epicycles.
This sense of theoretical fragility is compounded by a second problem: for any single one of these phenomena, we often lack consensus, entertaining multiple, competing hypotheses. Consider the core question of why neural networks generalize. Is it best explained by the implicit bias of SGD towards flat minima, the behavior of neural tangent kernels, or some other property? The field actively debates these views. And where no mechanistic theory has gained traction, we often retreat to descriptive labels. We say complex abilities are an "emergent" property of scale, a term that names the mystery without explaining its cause.
This theoretical disarray is sharpest when we examine our most foundational frameworks. Here, the issue is not just a lack of consensus, but a direct conflict with empirical reality. This disconnect manifests in several ways:
We are therefore faced with a collection of major empirical findings where our foundational theories are either contradicted, misleading, or simply absent. This theoretical vacuum creates an opportunity for a new perspective.
The program synthesis hypothesis offers such a perspective. It suggests we shift our view of what deep learning is fundamentally doing: from statistical function fitting to program search. The specific claim is that deep learning performs a search for simple programs that explain the data.
This shift in viewpoint may offer a way to make sense of the theoretical tensions we have outlined. If the learning process is a search for an efficient program rather than an arbitrary function, then the circumvention of the curse of dimensionality is no longer so mysterious. If this search is guided by a strong simplicity bias, the unreasonable effectiveness of scaling becomes an expected outcome, rather than a paradox.
We will now turn to the well-known paradoxes of approximation, generalization, and convergence, and see how the program synthesis hypothesis accounts for each.
(See also this post for related discussion.)
Before we even consider how a network learns or generalizes, there is a more basic question: how can a neural network, with a practical number of parameters, even in principle represent the complex function it is trained on?
Consider the task of image classification. A function that takes a 1024x1024 pixel image (roughly one million input dimensions) and maps it to a single label like "cat" or "dog" is, a priori, an object of staggering high-dimensional complexity. Who is to say that a good approximation of this function even exists within the space of functions that a neural network of a given size can express?
The textbook answer to this question is the Universal Approximation Theorem (UAT). This theorem states that a neural network with a single hidden layer can, given enough neurons, approximate any continuous function to arbitrary accuracy. On its face, this seems to resolve the issue entirely.
A precise statement of the Universal Approximation Theorem
Let be a continuous, non-polynomial function. Then for every continuous function from a compact subset of to , and some , we can choose the number of neurons large enough such that there exists a network with
where for some matrices , , and .
See here for a proof sketch. In plain English, this means that for any well-behaved target function , you can always make a one-layer network that is a "good enough" approximation, just by making the number of neurons sufficiently large.
Note that the network here is a shallow one - the theorem doesn't even explain why you need deep networks, an issue we'll return to when we talk about depth separations. In fact, one can prove theorems like this without even needing neural networks at all - the theorem directly parallels the classic Stone-Weierstrass theorem from analysis, which proves a similar statement for polynomials.
However, this answer is deeply misleading. The crucial caveat is the phrase "given enough neurons." A closer look at the proofs of the UAT reveals that for an arbitrary function, the number of neurons required scales exponentially with the dimension of the input. This is the infamous curse of dimensionality. To represent a function on a one-megapixel image, this would require a catastrophically large number of neurons - more than there are atoms in the universe.
The UAT, then, is not a satisfying explanation. In fact, it's a mathematical restatement of a near-trivial fact: with exponential resources, one can simply memorize a function's behavior. The constructions used to prove the theorem are effectively building a continuous version of a lookup table. This is not an explanation for the success of deep learning; it is a proof that if deep learning had to deal with arbitrary functions, it would be hopelessly impractical.
This is not merely a weakness of the UAT's particular proof; it is a fundamental property of high-dimensional spaces. Classical results in approximation theory show that this exponential scaling is not just an upper bound on what's needed, but a strict lower bound. These theorems prove that any method that aims to approximate arbitrary smooth functions is doomed to suffer the curse of dimensionality.
The parameter count lower bound
There are many results proving various lower bounds on the parameter count available in the literature under different technical assumptions.
A classic result from DeVore, Howard, and Micchelli (1989) [Theorem 4.2] establishes a lower bound on the number of parameters required by any continuous approximation scheme (including neural networks) to achieve an error over the space of all smooth functions in dimensions. The number of parameters must satisfy:
where is a measure of the function's smoothness. To maintain a constant error as the dimension increases, the number of parameters must grow exponentially. This confirms that no clever trick can escape this fate if the target functions are arbitrary.
The real lesson of the Universal Approximation Theorem, then, is not that neural networks are powerful. The real lesson is that if the functions we learn in the real world were arbitrary, deep learning would be impossible. The empirical success of deep learning with a reasonable number of parameters is therefore a profound clue about the nature of the problems themselves: they must have structure.
The program synthesis hypothesis gives a name to this structure: compositionality. This is not a new idea. It is the foundational principle of computer science. To solve a complex problem, we do not write down a giant lookup table that specifies the output for every possible input. Instead, we write a program: we break the problem down hierarchically into a sequence of simple, reusable steps. Each step (like a logic gate in a circuit) is a tiny lookup table, and we achieve immense expressive power by composing them.
This matches what we see empirically in some deep neural networks via mechanistic interpretability. They appear to solve complex tasks by learning a compositional hierarchy of features. A vision model learns to detect edges, which are composed into shapes, which are composed into object parts (wheels, windows), which are finally composed into an object detector for a "car." The network is not learning a single, monolithic function; it is learning a program that breaks the problem down.
This parallel with classical computation offers an alternative perspective on the approximation question. While the UAT considers the case of arbitrary functions, a different set of results examines how well neural networks can represent functions that have this compositional, programmatic structure.
One of the most relevant results comes from considering Boolean circuits, which are a canonical example of programmatic composition. It is known that feedforward neural networks can represent any program implementable by a polynomial-size Boolean circuit, using only a polynomial number of neurons. This provides a different kind of guarantee than the UAT. It suggests that if a problem has an efficient programmatic solution, then an efficient neural network representation of that solution also exists.
This offers an explanation for how neural networks might evade the curse of dimensionality. Their effectiveness would stem not from an ability to represent any high-dimensional function, but from their suitability for representing the tiny, structured subset of functions that have efficient programs. The problems seen in practice, from image recognition to language translation, appear to belong to this special class.
Why compositionality, specifically? Evidence from depth separation results.
The argument so far is that real-world problems must have some special "structure" to escape the curse of dimensionality, and that this structure is program structure or compositionality. But how can we be sure? Yes, approximation theory requires that we must have something that differentiates our target functions from arbitrary smooth functions in order to avoid needing exponentially many parameters, but it does not specify what. The structure does not necessarily have to be compositionality; it could be something else entirely.
While there is no definitive proof, the literature on depth separation theorems provides evidence for the compositionality hypothesis. The logic is straightforward: if compositionality is the key, then an architecture that is restricted in its ability to compose operations should struggle. Specifically, one would expect that restricting a network's depth - its capacity for sequential, step-by-step computation - should force it back towards exponential scaling for certain problems.
And this is what the theorems show.
These depth separation results, sometimes also called "no-flattening theorems," involve constructing families of functions that deep neural networks can represent with a polynomial number of parameters, but which shallow networks would require an exponential number to represent. The literature contains a range of such functions, including sawtooth functions, certain polynomials, and functions with hierarchical or modular substructures.
Individually, many of these examples are mathematical constructions, too specific to tell us much about realistic tasks on their own. But taken together, a pattern emerges. The functions where depth provides an exponential advantage are consistently those that are built "step-by-step." They have a sequential structure that deep networks can mirror. A deep network can compute an intermediate result in one layer and then feed that result into the next, effectively executing a multi-step computation.
A shallow network, by contrast, has no room for this kind of sequential processing. It must compute its output in a single, parallel step. While it can still perform "piece-by-piece" computation (which is what its width allows), it cannot perform "step-by-step" computation. Faced with an inherently sequential problem, a shallow network is forced to simulate the entire multi-step computation at once. This can be highly inefficient, in the same way that simulating a sequential program on a highly parallel machine can sometimes require exponentially more resources.
This provides a parallel to classical complexity theory. The distinction between depth and width in neural networks mirrors the distinction between sequential (P) and parallelizable (NC) computation. Just as it is conjectured that some problems are inherently sequential and cannot be efficiently parallelized (the NC ≠ P conjecture), these theorems show that some functions are inherently deep and cannot be efficiently "flattened" into a shallow network.
(See also this post for related discussion.)
Perhaps the most jarring departure from classical theory comes from how deep learning models generalize. A learning algorithm is only useful if it can perform well on new, unseen data. The central question of statistical learning theory is: what are the conditions that allow a model to generalize?
The classical answer is the bias-variance tradeoff. The theory posits that a model's error can be decomposed into two main sources:
According to this framework, learning is a delicate balancing act. The practitioner's job is to carefully choose a model of the "right" complexity - not too simple, not too complex -to land in a "Goldilocks zone" where both bias and variance are low. This view is reinforced by principles like the "no free lunch" theorems, which suggest there is no universally good learning algorithm, only algorithms whose inductive biases are carefully chosen by a human to match a specific problem domain.
The clear prediction from this classical perspective is that naively increasing a model's capacity (e.g., by adding more parameters) far beyond what is needed to fit the training data is a recipe for disaster. Such a model should have catastrophically high variance, leading to rampant overfitting and poor generalization.
And yet, perhaps the single most important empirical finding in modern deep learning is that this prediction is completely wrong. The "bitter lesson," as Rich Sutton calls it, is that the most reliable path to better performance is to scale up compute and model size, sometimes far into the regime where the model can easily memorize the entire training set. This goes beyond a minor deviation from theoretical predictions: it is a direct contradiction of the theory's core prescriptive advice.
This brings us to a second, deeper puzzle, first highlighted by Zhang et al. (2017). The authors conduct a simple experiment:
The network is expressive enough that it is able to achieve near-zero training error on the randomized labels, perfectly memorizing the nonsensical data. As expected, its performance on a test set is terrible - it has learned nothing generalizable.
The paradox is this: why did the same exact model generalize well on the real data? Classical theories often tie a model's generalization ability to its "capacity" or "complexity," which is a fixed property of its architecture related to its expressivity. But this experiment shows that generalization is not a static property of the model. It is a dynamic outcome of the interaction between the model, the learning algorithm, and the structure of the data itself. The very same network that is completely capable of memorizing random noise somehow "chooses" to find a generalizable solution when trained on data with real structure. Why?
The program synthesis hypothesis offers a coherent explanation for both of these paradoxes.
First, why does scaling work? The hypothesis posits that learning is a search through some space of programs, guided by a strong simplicity bias. In this view, adding more parameters is analogous to expanding the search space (e.g., allowing for longer or more complex programs). While this does increase the model's capacity to represent overfitting solutions, the simplicity bias acts as a powerful regularizer. The learning process is not looking for any program that fits the data; it is looking for the simplest program. Giving the search more resources (parameters, compute, data) provides a better opportunity to find the simple, generalizable program that corresponds to the true underlying structure, rather than settling for a more complex, memorizing one.
Second, why does generalization depend on the data's structure? This is a natural consequence of a simplicity-biased program search.
If one assumes something like the program synthesis hypothesis is true, the phenomenon of data-dependent generalization is not so surprising. A model's ability to generalize is not a fixed property of its architecture, but a property of the program it learns. The model finds a simple program on the real dataset and a complex one on the random dataset, and the two programs have very different generalization properties.And there is some evidence that the mechanism behind generalization is not so unrelated to the other empirical phenomena we have discussed. We can see this in the grokking setting discussed earlier. Recall the transformer trained on modular addition:
The sudden increase in generalization appears to be the direct consequence of the model replacing a complex, overfitting solution with a simpler, algorithmic one. In this instance, generalization is achieved through the synthesis of a different, more efficient program.
When we ask a neural network to solve a task, we specify what task we'd like it to solve, but not how it should solve the task - the purpose of learning is for it to find strategies on its own. We define a loss function and an architecture, creating a space of possible functions, and ask the learning algorithm to find a good one by minimizing the loss. Given this freedom, and the high-dimensionality of the search space, one might expect the solutions found by different models - especially those with different architectures or random initializations - to be highly diverse.
Instead, what we observe empirically is a strong tendency towards convergence. This is most directly visible in the phenomenon of representational alignment. This alignment is remarkably robust:
The mystery deepens when we observe parallels to biological systems. The Gabor-like filters that emerge in the early layers of vision networks, for instance, are strikingly similar to the receptive fields of neurons in the V1 area of the primate visual cortex. It appears that evolution and stochastic gradient descent, two very different optimization processes operating on very different substrates, have converged on similar solutions when exposed to the same statistical structure of the natural world.
One way to account for this is to hypothesize that the models are not navigating some undifferentiated space of arbitrary functions, but are instead homing in on a sparse set of highly effective programs that solve the task. If, following the physical Church-Turing thesis, we view the natural world as having a true, computable structure, then an effective learning process could be seen as a search for an algorithm that approximates that structure. In this light, convergence is not an accident, but a sign that different search processes are discovering similar objectively good solutions, much as different engineering traditions might independently arrive at the arch as an efficient solution for bridging a gap.
This hypothesis - that learning is a search for an optimal, objective program - carries with it a strong implication: the search process must be a general-purpose one, capable of finding such programs without them being explicitly encoded in its architecture. As it happens, an independent, large-scale trend in the field provides a great deal of data on this very point.
Rich Sutton's "bitter lesson" describes the consistent empirical finding that long-term progress comes from scaling general learning methods, rather than from encoding specific human domain knowledge. The old paradigm, particularly in fields like computer vision, speech recognition, or game playing, involved painstakingly hand-crafting systems with significant prior knowledge. For years, the state of the art relied on complex, hand-designed feature extractors like SIFT and HOG, which were built on human intuitions about what aspects of an image are important. The role of learning was confined to a relatively simple classifier that operated on these pre-digested features. The underlying assumption was that the search space was too difficult to navigate without strong human guidance.
The modern paradigm of deep learning has shown this assumption to be incorrect. Progress has come from abandoning these handcrafted constraints in favor of training general, end-to-end architectures with the brute force of data and compute. This consistent triumph of general learning over encoded human knowledge is a powerful indicator that the search process we are using is, in fact, general-purpose. It suggests that the learning algorithm itself, when given a sufficiently flexible substrate and enough resources, is a more effective mechanism for discovering relevant features and structure than human ingenuity.
This perspective helps connect these phenomena, but it also invites us to refine our initial picture. First, the notion of a single "optimal program" may be too rigid. It is possible that what we are observing is not convergence to a single point, but to a narrow subset of similarly structured, highly-efficient programs. The models may be learning different but algorithmically related solutions, all belonging to the same family of effective strategies.
Second, it is unclear whether this convergence is purely a property of the problem's solution space, or if it is also a consequence of our search algorithm. Stochastic gradient descent is not a neutral explorer. The implicit biases of stochastic optimization, when navigating a highly over-parameterized loss landscape, may create powerful channels that funnel the learning process toward a specific kind of simple, compositional solution. Perhaps all roads do not lead to Rome, but the roads to Rome are the fastest. The convergence could therefore be a clue about the nature of our learning dynamics themselves - that they possess a strong, intrinsic preference for a particular class of solutions.
Viewed together, these observations suggest that the space of effective solutions for real-world tasks is far smaller and more structured than the space of possible models. The phenomenon of convergence indicates that our models are finding this structure. The bitter lesson suggests that our learning methods are general enough to do so. The remaining questions point us toward the precise nature of that structure and the mechanisms by which our learning algorithms are so remarkably good at finding it.
If you've followed the argument this far, you might already sense where it becomes difficult to make precise. The mechanistic interpretability evidence shows that networks can implement compositional algorithms. The indirect evidence suggests this connects to why they generalize, scale, and converge. But "connects to" is doing a lot of work. What would it actually mean to say that deep learning is some form of program synthesis?
Trying to answer this carefully leads to two problems. The claim "neural networks learn programs" seems to require saying what a program even is in a space of continuous parameters. It also requires explaining how gradient descent could find such programs efficiently, given what we know about the intractability of program search.
These are the kinds of problems where the difficulty itself is informative. Each has a specific shape - what you need to think about, what a resolution would need to provide. I focus on them deliberately: that shape is what eventually pointed me toward specific mathematical tools I wouldn't have considered otherwise.
This is also where the post will shift register. The remaining sections sketch the structure of these problems and gesture at why certain mathematical frameworks (singular learning theory, algebraic geometry, etc) might become relevant. I won't develop these fully here - that requires machinery far beyond the scope of a single blog post - but I want to show why you'd need to leave shore at all, and what you might find out in open water.
The program synthesis hypothesis posits a relationship between two fundamentally different kinds of mathematical objects.
On one hand, we have programs. A program is a discrete and symbolic object. Its identity is defined by its compositional structure - a graph of distinct operations. A small change to this structure, like flipping a comparison or replacing an addition with a subtraction, can create a completely different program with discontinuous, global changes in behavior. The space of programs is discrete.
On the other hand, we have neural networks. A neural network is defined by its parameter space: a continuous vector space of real-valued weights. The function a network computes is a smooth (or at least piecewise-smooth) function of these parameters. This smoothness is the essential property that allows for learning via gradient descent, a process of infinitesimal steps along a continuous loss landscape.
This presents a seeming type mismatch: how can a continuous process in a continuous parameter space give rise to a discrete, structured program?
The problem is deeper than it first appears. To see why, we must first be precise about what we mean when we say a network has "learned a program." It cannot simply be about the input-output function the network computes. A network that has perfectly memorized a lookup table for modular addition computes the same function on a finite domain as a network that has learned the general, trigonometric algorithm. Yet we would want to say, emphatically, that they have learned different programs. The program is not just the function; it is the underlying mechanism.
Thus the notion must depend on parameters, and not just functions, presenting a further conceptual barrier. To formalize the notion of "mechanism," a natural first thought might be to partition the continuous parameter space into discrete regions. In this picture, all the parameter vectors within a region would correspond to the same program A, while vectors in a different region would correspond to program B. But this simple picture runs into a subtle and fatal problem: the very smoothness that makes gradient descent possible works to dissolve any sharp boundaries between programs.
Imagine a continuous path in parameter space from a point (which clearly implements program A) to a point (which clearly implements program B). Imagine, say, that A has some extra subroutine that B does not. Because the map from parameters to the function is smooth, the network's behavior must change continuously along this path. At what exact point on this path did the mechanism switch from A to B? Where did the new subroutine get added? There is no canonical place to draw a line. A sharp boundary would imply a discontinuity that the smoothness of the map from parameters to functions seems to forbid.
This is not so simple a problem, and it is worth spending some time thinking about how you might try to resolve it to appreciate that.
What this suggests, then, is that for the program synthesis hypothesis to be a coherent scientific claim, it requires something that does not yet exist: a formal, geometric notion of a space of programs. This is a rather large gap to fill, and in some ways, this entire post is my long-winded way of justifying such an ambitious mathematical goal.
I won't pretend that my collaborators and I don't have our[14] own ideas about how to resolve this, but the mathematical sophistication required jumps substantially, and they would probably require their own full-length post to do justice. For now, I will just gesture at some clues which I think point in the right direction.
The first is the phenomenon of degeneracies[15]. Consider, for instance, dead neurons, whose incoming weights and activations are such that the neurons never fires for any input. A neural network with dead neurons acts like a smaller network with those dead neurons removed. This gives a mechanism for neural networks to change their "effective size" in a parameter-dependent way, which is required in order to e.g. dynamically add or remove a new subroutine depending on where you are in parameter space, as in our example above. In fact dead neurons are just one example in a whole zoo of degeneracies with similar effects, which seem incredibly pervasive in neural networks.
It is worth mentioning that the present picture is now highly suggestive of a specific branch of math known as algebraic geometry. Algebraic geometry (in particular, singularity theory) systematically studies these degeneracies, and further provides a bridge between discrete structure (algebra) and continuous structure (geometry), exactly the type of connection we identified as necessary for the program synthesis hypothesis[16]. Furthermore, singular learning theory tells us how these degeneracies control the loss landscape and the learning process (classically, only in the Bayesian setting, a limitation we discuss in the next section). There is much more that can be said here, but I leave it for the future to treat this material properly.
There’s another problem with this story. Our hypothesis is that deep learning is performing some version of program synthesis. That means that we not only have to explain how programs get represented in neural networks, we also need to explain how they get learned. There are two subproblems here.
Both of these are questions about the optimization process. It is not obvious at all how local optimizers like SGD would be able to perform something like Solomonoff induction, let alone far more efficiently than we historically ever figured out for (versions of) Solomonoff induction itself. This is a difficult question, but I will attempt to point towards research which I believe can answer these questions.
The optimization process can depend on many things, a priori: choice of optimizer, regularization, dropout, step size, etc. But we can note that deep learning is able to work somewhat successfully (albeit sometimes with degraded performance) across wide ranges of choices of these variables. It does not seem like the choice of AdamW vs SGD matters nearly as much as the choice to do gradient-based learning in the first place. In other words, I believe these variables may affect efficiency, but I doubt they are fundamental to the explanation of why the optimization process can possibly succeed.
Instead, there is one common variable here which appears to determine the vast majority of the behavior of stochastic optimizers: the loss function. Optimizers like SGD take every gradient step according to a minibatch-loss function[17] like mean-squared error:
where is the parameter vector, is the input/output map of the model on parameter are the training examples & labels, and is the learning rate.
In the most common versions of supervised learning, we can focus even further. The loss function itself can be decomposed into two effects: the parameter-function map , and the target distribution. The overall loss function can be written as a composition of the parameter-function map and some statistical distance to the target distribution, e.g. for mean-squared error:
where .
Note that the statistical distance here is a fairly simple object. Almost always the statistical distance here is (on function space) convex and with relatively simple functional form; further, it is the same distance one would use across many different architectures, including ones which do not achieve the remarkable performance of neural networks (e.g. polynomial approximation). Therefore one expects the question of learnability and inductive biases to largely come down to the parameter-function map rather than the (function-space) loss function .
If the above reasoning is correct, that means that in order to understand how SGD is able to potentially perform some kind of program synthesis, we merely need to understand properties of the parameter-function map. This would be a substantial simplification. Further, this relates learning dynamics to our earlier representation problem: the parameter-function map is precisely the same object responsible for the mystery discussed in the representation section.
This is not an airtight argument - it depends on the empirical question of whether one can ignore (or treat as second-order effects) other optimization details besides the loss function, and whether the handwave-y argument for the importance of the parameter-function map over the (function-space) loss is solid.
Even if one assumes this argument is valid, we have merely located the mystery, not resolved it. The question remains: what properties of the parameter-function map make targets learnable? At this point the reasoning becomes more speculative, but I will sketch some ideas.
The representation section concerned what structure the map encodes at each point in parameter space. Learnability appears to depend on something further: the structure of paths between points. Convexity of function-space loss implies that paths which are sufficiently straight in function space are barrier-free - roughly, if the endpoint is lower loss, the entire path is downhill. So the question becomes: which function-space paths does the map provide?
The same architectures successfully learn many diverse real-world targets. Whatever property of the map enables this, it must be relatively universal - not tailored to specific targets. This naturally leads us to ask: in what cases does the parameter-function map provide direct-enough paths to targets with certain structure, and characterizing what "direct enough" means.
This connects back to the representation problem. If the map encodes some notion of program structure, then path structure in parameter space induces relationships between programs - which programs are "adjacent," which are reachable from which. The representation section asks how programs are encoded as points; learnability asks how they are connected as paths. These are different aspects of the same object.
One hypothesis: compositional relationships between programs might correspond to some notion of “path adjacency” defined by the parameter-function map. If programs sharing structure are nearby - reachable from each other via direct paths - and if simpler programs lie along paths to more complex ones, then efficiency, simplicity bias, and empirically observed stagewise learning would follow naturally. Gradient descent would build incrementally rather than search randomly; the enumeration problem that dooms Solomonoff would dissolve into traversal.
This is speculative and imprecise. But there's something about the shape of what's needed that feels mathematically natural. The representation problem asks for a correspondence at the level of objects: strata in parameter space corresponding to programs. The search problem asks for something stronger - that this correspondence extends to paths. Paths in parameter space (what gradient descent traverses) should correspond to some notion of relationship or transition between programs.
This is a familiar move in higher mathematics (sometimes formalized by category theory): once you have a correspondence between two kinds of objects, you ask whether it extends to the relationships between those objects. It is especially familiar (in fields like higher category theory) to ask these kinds of questions when the "relationships between objects" take the form of paths in particular. I don't claim that existing machinery from these fields applies directly, and certainly not given the (lack of) detail I've provided in this post. But the question is suggestive enough to investigate: what should "adjacency between programs" mean? Does the parameter-function map induce or preserve such structure? And if so, what does this predict about learning dynamics that we could check empirically?
The majority of the ideas in this post are not individually novel; I see the core value proposition as synthesizing them together in one place. The ideas I express here are, in my experience, very common among researchers at frontier labs, researchers in mechanistic interpretability, some researchers within science of deep learning, and others. In particular, the core hypothesis that deep learning is performing some tractable version of Solomonoff induction is not new, and has been written about many times. (However, I would not consider it to be a popular or accepted opinion within the machine learning field at large.) Personally, I have considered a version of this hypothesis for around three years. With this post, I aim to share a more comprehensive synthesis of the evidence for this hypothesis, as well as point to specific research directions for formalizing this idea.
Below is an incomplete list of what is known and published in various areas:
Existing comparisons between deep learning and program synthesis. The ideas surrounding Solomonoff induction have been highly motivating for many early AGI-focused researchers. Shane Legg (DeepMind cofounder) wrote his PhD thesis on Solomonoff induction; John Schulam (OpenAI cofounder) discusses the connection to deep learning explicitly here; Ilya Sutskever (OpenAI cofounder) has been giving talks on related ideas. There are a handful of places one can find a hypothesized connection between deep learning and Solomonoff induction stated explicitly, though I do not believe any of these were the first to do so. My personal experience is that such intuitions are fairly common among e.g. people working at frontier labs, even if they are not published in writing. I am not sure who had the idea first, and suspect it was arrived at independently multiple times.
Feature learning. It would not be accurate to say that the average ML researcher views deep learning as a complete black-box algorithm; it is well-accepted and uncontroversial that deep neural networks are able to extract "features" from the task which they use to perform well. However, it is a step beyond to claim that these features are actually extracted and composed in some mechanistic fashion resembling a computer program.
Compositionality, hierarchy, and modularity. My informal notion of "programs" here is quite closely related to compositionality. It is a fairly well-known hypothesis that supervised learning performs well due to compositional/hierarchical/modular structure in the model and/or the target task. This is particularly prominent within approximation theory (especially the literature on depth separations) as an explanation for the issues I highlighted in the "paradox of approximation" section.
Mechanistic interpretability. The (implicit) underlying premise of the field of mechanistic interpretability is that one can understand the internal mechanistic (read: program-like) structure responsible for a network's outputs. Mechanistic interpretability is responsible for discovering a significant number of examples of this type of structure, which I believe constitutes the single strongest evidence for the program synthesis hypothesis. I discuss a few case studies of this structure in the post, but there are possibly hundreds more examples which I did not cover, from the many papers within the field. A recent review can be found here.
Singular learning theory. In the “path forward” section, I highlight a possible role of degeneracies in controlling some kind of effective program structure. In some way (which I have gestured at but not elaborated on), the ideas presented in this post can be seen as motivating singular learning theory as a means to formally ground these ideas and produce practical tools to operationalize them. This is most explicit within a line of work within singular learning theory that attempts to precisely connect program synthesis with the singular geometry of a (toy) learning machine.
From the GPT-4.5 launch discussion, 38:46.
From his PhD thesis, pages 23-24.
Together with independent contributions by Kolmogorov, Chaitin, and Levin.
One must be careful, as some commonly stated "proofs" of this optimality are somewhat tautological. These typically go roughly something like: under the assumption that the data generating process has low Kolmogorov complexity, then Solomonoff induction is optimal. This is of course completely circular, since we have, in effect, assumed from the start that the inductive bias of Solomonoff induction is correct. Better proofs of this fact instead show a regret bound: on any sequence, Solomonoff induction's cumulative loss is at most a constant worse than any computable predictor - where the constant depends on the complexity of the competing predictor, not the sequence. This is a frequentist guarantee requiring no assumptions about the data source. See in particular Section 3.3.2 and Theorem 3.3 of this PhD thesis. Thanks to Cole Wyeth for pointing me to this argument.
See this paper.
Depending on what one means by "protein folding," one can debate whether the problem has truly been solved; for instance, the problem of how proteins fold dynamically over time is still open AFAIK. See this fairly well-known blog post by molecular biologist Mohammed AlQuraishi for more discussion, and why he believes calling AlphaFold a "solution" can be appropriate despite the caveats.
For what it's worth, in this specific case, we do know what must be driving the process, if not the training loss: the regularization / weight decay. In the case of grokking, we do have decent understanding of how weight decay leads the training to prefer the generalizing solution. However, this explanation is limited in various ways, and it unclear how far it generalizes beyond this specific setting.
To be clear, one can still apply existing mechanistic interpretability tools to real language models and get productive results. But the results typically only manage to explain a small portion of the network, and in a way which is (in my opinion) less clean and convincing than e.g. Olah et al. (2020)'s reverse-engineering of InceptionV1.
This phrase is often abused - for instance, if you show up to court with no evidence, I can reasonably infer that no good evidence for your case exists. This is a gap between logical and heuristic/Bayesian reasoning. In the real world, if evidence for a proposition exists, it usually can and will be found (because we care about it), so you can interpret the absence of evidence for a proposition as suggesting that the proposition is false. However, in this case, I present a specific reason why one should not expect to see evidence even if the proposition in question is true.
Many interpretability researchers specifically believe in the linear representation hypothesis, that the variables of this program structure ("features") correspond to linear directions in activation space, or the stronger superposition hypothesis, that such directions form a sparse overbasis for activation space. One must be careful in interpreting these hypotheses as there are different operationalizations within the community; in my opinion, the more sophisticated versions are far more plausible than naive versions (thank you to Chris Olah for a helpful conversation here). Presently, I am skeptical that linear representations give the most prosaic description of a model's behavior or that this will be sufficient for complete reverse-engineering, but believe that the hypothesis is pointing at something real about models, and tools like SAEs can be helpful as long as one is aware of their limitations.
The credit for these ideas should really go to Dan Murfet, as well as his current/former students including Will Troiani, James Clift, Rumi Salazar, and Billy Snikkers.
Let denote the output of the model on input with parameters . Formally, we say that a point in parameter space is degenerate or singular if there exists a tangent vector such that the directional derivative for all . In other words, moving in some direction in parameter space doesn't change the behavior of the model (up to first order).
This is not as alien as it may seem. Note that this provides a perspective which connects nicely with both neural networks and classical computation. First consider, for instance, that the gates of a Boolean circuit literally define a system of equations over , whose solution set is an algebraic variety over . Alternatively, consider that a neural network with polynomial (or analytic) activation function defines a system of equations over , whose vanishing set is an algebraic (respectively, analytic) variety over . Of course this goes only a small fraction of the way to closing this gap, but one can start to see how this becomes plausible.
A frequent perspective is to write this minibatch-loss in terms of its mean (population) value plus some noise term. That is, we think of optimizers like SGD as something like “gradient descent plus noise.” This is quite similar to mathematical models like overdamped Langevin dynamics, though note that the noise term may not be Gaussian as in Langevin dynamics. It is an open question whether the convergence of neural network training is due to the population term or the noise term. (Note that this is a separate question as to whether the generalization / inductive biases of SGD-trained neural networks is due to the population term or the noise term.) I am tentatively of the belief (somewhat controversially) that both convergence and inductive bias is due to structure in the population loss rather than the noise term, but explaining my reasoning here is a bit out of scope.
2026-01-20 22:27:11
Published on January 20, 2026 2:27 PM GMT
2026 marks exactly 1 millennium since the last total solar eclipse visible from Table Mountain. The now famous (among people who sit behind me at work) eclipse of 1026 would’ve been visible to anyone at the top of Lion’s Head or Table Mountain and basically everywhere else in Cape Town. Including De Waal Park, where I’m currently writing this. I’ve hiked up Lion’s Head a lot and still find the view pretty damn awe inspiring. To have seen a total solar eclipse up there must have been absurdly damn awe inspiring. Maybe also terrifying if you didn’t know what was happening. But either way, I’m jealous of anyone that got to experience it. If you continued flipping through the exciting but predictable Five Millennium Canon of Solar Eclipses: -1999 to +3000 (2000 BCE to 3000 CE) by Jean Meeus and Fred Espenak, you’d notice something weird and annoying - you have to flip all the way to the year 2238 for the next total solar eclipse to hit Table Mountain.
Tim Urban has this idea of converting all of human history into a 1000 page book. He says that basically up until page 950 there’s just nothing going on.
“But if you look at Page 1,000—which, in this metaphor, Page 1,000 is the page that ends with today, so that goes from the early 1770s to today—that is nothing like any other page. It is completely an anomaly in the book. If you’re reading, if you’re this alien, this suddenly got incredibly interesting in the last 10 pages, but especially on this page. The alien is thinking, “OK, shit is going down.”
The gap between eclipses on Table Mountain is the real life version of this book. Imagine If aliens had put a secret camera where the cable car is, and it only popped up during a total solar eclipse, they’d see something like the island from Lost, then wait a hundred or thousand years then see the exact same thing but maybe it’s raining.
And they’d see this 4 more times.
Then they go to open the image from 2238 and suddenly.:
There’s a soccer stadium and also is that a city???
Just knowing the date of these eclipses has made the past and future feel much more real to me.
I saw the total solar eclipse of 2024 in the middle of an absolutely packed Klyde Warren Park in Dallas.
When totality started, there were barely any cars on the highway and the cars you could see suddenly had their headlights on. The office tower behind me was filled with people on every floor staring outside, all backlit by the lights which had suddenly turned on.
We talk about how the animals start going crazy because they think it’s night as though this doesn’t include us but actually we are so included here and go even crazier than any birds doing morning chirps. The extent to which the city of Dallas was turned upside down by this event is hard to believe. And it wasn’t just a physical transformation. The entire energy of the city felt different, not just compared to the day before but compared to any other city I’ve been in. I have never felt so connected to everyone around me and so optimistic and elated at the same time all while knowing everyone else feels the exact same way.
It’s hard to imagine what it must have been like to be a person in Cape Town in the year 1026. The image in my head feels murky and I guess pastoral. But imagining what it was like during a total solar eclipse in the year 1026, is much easier. I can picture myself on top of Lion’s Head or Table Mountain or on the beach in 1026. I can picture the people around me seeing it and wondering what’s going on. I can picture myself wondering what’s going on. Because even when you know what’s going on you’re still wondering what’s going on.
When I think about the eclipse of 2238 it’s even easier to connect with those people in that Cape Town. If the people of that time have anything like newspapers or radio or the internet or TikTok, I can imagine the literal hype and electricity in the air over the months and days and hours leading up to the eclipse. It’s also weird to briefly think about how everything I’m using now and consuming now is going to be considered ancient history by the lovely people that get to experience seeing an eclipse in 2238 at the top of Lion’s Head. My macbook which feels so fast and which I love so dearly - junk. TikTok would be like a wax cylinder record, and they’d wonder how people managed to code with an AI as silly as Opus-4.5 or worse by hand somehow. Every movie from 2026 would be older to them than the movie of the train going into the station, is to us. I don’t know how they are going to build software in the year 2238. I barely know how I built the website I used to find this stuff out. I’ve wanted to know when the next and previous eclipse are going to happen on Lion’s Head, since i got back from the eclipse in 2024.
I started by searching on google for something to find eclipses by location and not time. We have Five Millennium Canon of Solar Eclipses but this is still in order of time. The answer to my question felt like something we could easily work out with existing data and a for loop in whichever your favorite programming language is. NASA hosts a csv file with the aforementioned five millennia of past and future eclipses. So we just have to parse this csv and figure out what each of the 16 columns represented and then figure out how to do a for loop over the paths of the eclipses and find an intersection with the coordinates of Lion’s Head.
Luckily the year was 2024 or 5 A.G.T (Anno GPT3) - So I asked what would have probably been GPT-4 if it could search for the date of the next and previous eclipses, it used the search tool it had at the time, but it could not find anything. I tried this a few more times, usually whenever I finished a hike and a new model had been recently released. It’s never worked though. That is, until a week ago. This January I paid $200 for GPT 5.2 Pro after reading some, okay, a single, extremely positive review about it.. To be honest my review is: It kind of sucks, but still happy I paid the $200. This is because towards the end of the month I set 5.2 Pro to extended thinking then typed this prompt:
“How could I make an app that lets you pick a place on earth and then finds the last time or next time there was or will be a full solar eclipse there, what data sources would I use what algorithms and how accurate could I be.”
It thought for 17m and 6 seconds then replied with a whole bunch of words I didn’t understand. So I replied:
“Can you write a prototype in python?”
It thought for another 20m then replied with this script.
I pasted it into a file then ran it with the coordinates of Lion’s Head and saw the answer to my question: 1026. That was the last time a total solar eclipse was visible from Lion’s Head.
Since it was a script I could also use any coordinates on Earth and find the same answer for that place (as long it was in the five millennia catalogue)
I popped the python script in to Claude code with Opus set to 4.5, it did some verbing and then I got this website out a few hours later: https://findmyeclipse.com
In 2238 I somehow doubt the vast majority of people will ever think about code when creating things, in the same way I don’t think about binary or transistors when programming. What does a world where software can be written without any special knowledge look like, and then what does it look like after 100 years of that? I don’t have any answers but I do know one thing: The people of Cape Town in 2238 will know that this eclipse is not just a rare eclipse, but a rare eclipse among rare eclipses . They will look forward to it. They will write about the best places to see it from. I can imagine being a person in 2238 thinking, boy this eclipse would look sick from Lion’s Head. Thinking, I wonder if it’s going to be too busy up there. Maybe consider going up and camping on Table Mountain the night before. And I can imagine being in any one of these places or just in a packed De Waal Park preparing for totality and when I imagine myself there with everyone around me, it’s hard not to be optimistic.
1 Like
2026-01-20 10:06:54
Published on January 20, 2026 2:06 AM GMT
Fiora Sunshine's post, Why I Transitioned: A Case Study (the OP) articulates a valuable theory for why some MtFs transition.
If you are MtF and feel the post describes you, I believe you.
However, many statements from the post are wrong or overly broad.
In the OP, Fiora presents the "body-map theory" under the umbrella of "arcane neuro-psychological phenomena", and then dismisses medical theories because the body-map theory doesn't fit her friend group.
The body-map theory is a straw man for biological causation because there are significant sex differences between men and women that are (a) not learned and (b) not reducible to subconscious expectations about one's anatomy.
The easiest way to see this is CAH. To quote from Berenbaum and Beltz, 2021[1]:
Studies of females with congenital adrenal hyperplasia (CAH) show how prenatal androgens affect behavior across the life span, with large effects on gendered activity interests and engagement, moderate effects on spatial abilities, and relatively small (or no) effects on gender identity
The sex difference in people-vs-things interests (hobbies, occupations) has been discussed extensively in our community. CAH shifts females towards male-patterned interests with small effects on gender identity, without changes in anatomy.
This finding is also notable because it shows male-patterned interests and female gender identity can coexist, at least in natal females.
I'm trans so I have a motive to search for evidence that suggests I am ~biologically valid~ and not subject to some kind of psychosocial delusion. It would be easy for me to cherry-pick individual papers to support that view. I'm trying to not do that. I'm also not going to attempt a full literature review here. Luckily it is 2026, and we have a better option.
The ACE model from psychiatric genetics is a standard framework for decomposing the variance in a trait into 3 components:
A = Additive Genetics: cumulative effect of individual alleles
C = Common Environment: parents, schooling, SES, etc.
E = Nonshared Environment (+ error): randomness, idiosyncratic life events[2]
There are at least 9[3] primary twin studies on transgender identity or gender dysphoria. I created an LLM prompt[4] asking for a literature review with the goal of extracting signal, not just from the trans twin literature, but from other research that could help give us some plausible bounds on the strength of biological and social causation. Here are the results. The format is POINT_ESTIMATE, RANGE:
| model | A | C | E |
| Opus 4.5 | 0.4, 0.2-0.6 | 0.05, 0-0.2 | 0.55, 0.35-0.7 |
| Opus 4.5 Research | .375, 0.2-0.6 | 0.125, 0-0.3 | 0.5, 0.3-0.6 |
| GPT 5.2 Pro | 0.35, 0.2-0.55 | 0.1, 0-0.25 | 0.55, 0.35-0.7 |
| o3 Deep Research | 0.4, 0.3-0.5 | 0.05, 0-0.2 | 0.55, 0.5-0.7 |
| point est. average | 0.38 | 0.08 | 0.54 |
I'm moderately confident my prompt was not biased because the A values here are lower than what I've gotten from Claude when asking for heritability estimates from twin studies only. Also, all the models included some discussion of the rapid rise in adolescent cases in the 2010s, often mentioning "social contagion" and ROGD theories explicitly. All the models also pointed out that the ACE model is a simplification and that gene-environment interaction may be significant.
These are pretty wide error bars. But since A is trying to capture heredity only, we can take A as a rough lower bound for biological causation. Even if E is purely social, 38% is significant.
Also, none of this tells us how much variation there is at the individual level. And we have no trans GWAS.
The big question is whether E is dominated by social or biological factors.
If social factors mattered a lot I would expect parental attitudes to be significant in affecting transgender identity. But most studies find low C. This holds even for population-based studies that do not suffer from ascertainment bias. I would be surprised if peer influences were highly causal but parental influences were not.
I think the evidence from CAH, fraternal birth order effects, and animal models also provides good mechanistic reasons to think there are significant biological effects in E as well as A.
How do trans people view this line of research? They tend to hate it. They're afraid it will eventually lead to:
This is what I'll call "medical eradication": one half of the Double Bind.
The purpose of medicine is to improve health and reduce suffering.
In general, the state should not subsidize healthcare that does not increase QALYs. A rational healthcare system would ration care based on ranking all available treatments by QALYs saved per dollar, and funding all treatments above a cutoff determined by the budget.
The US healthcare system has a very creative interpretation of reality, but other countries like the UK at least attempt to do this.
To receive gender-affirming treatment, trans people must argue that such treatment alleviates suffering. This argument helped establish gender medicine in the 20th century.
But in fact, the claim "being transgender involves suffering and requires medical treatment" is very controversial within the trans community. This is surprising, because disputing this claim threatens to undermine access to trans healthcare.
Moreover, this controversy explains why trans people do not appear to accurately report their own motivations for transition.
There are three possible sources:
These can co-occur and interact.
Society at large recognizes only (1) as legitimate.
Trans people know this. They know they may be sent to psychotherapy, denied HRT, or judged illegitimate if they report wanting to transition for psychosocial reasons.
There is strong pressure for trans people to accept and endorse a biological/medical framing for their transitions.
But adopting this framing carries downsides:
This is the Double Bind: many trans people need medical treatment, but find the psychological threat of medicalization and eradication intolerable.
Consequently, they will not claim their transition is justified because of biology. However, they know that psychological and social justifications will also not be accepted. In this situation, platitudes like "I am a woman because I identify as one" are a predictable response to incentives. If you attempt to give a real answer, it will be used against you.
Maybe you are thinking:
Marisa, this is hogwash! All the trans people I know are constantly oversharing lurid personal details despite obvious social incentives not to. The most parsimonious explanation is that people who say "I'm __ because I identify as __" literally believe that.
Yes, good point. I need to explain another dynamic.
So far I've only discussed external incentives, but there is incentive pressure from within the trans community as well.
In the 2010s, the following happened:
Suddenly the trans community was fighting for a much broader set of constituents and demands. 20th century binary transsexualism coheres with medical framings, but 2010s Tumblr xenogenders do not. And trans people of all kinds have always had insecurities about their own validity-- both internal and external.
Here is the key insight:
It's difficult to enforce norms that protect external political perception.
It's easy to enforce norms that protect ingroup feelings.
Assume I've performed and posted some porn on the internet. This porn is optically really really bad. Like actually politically damaging. Conscientious trans people will attempt to punish my defection-- but this is difficult. I can cry "respectability politics!" and point to the history of trans sex work in the face of employment discrimination. No one can agree on a theory of change for politics, so it's hard to prove harm. When the political backlash hits, it affects everyone equally[7].
By contrast, assume instead that I'm in a trans community space and I've told someone their reasons for transition are not valid, and they should reconsider. I've just seriously hurt someone's feelings, totally killed the vibe, and I'll probably be asked to leave-- maybe shunned long-term[8]. I have just lost access to perhaps my only source of ingroup social support. This is a huge disincentive.
This structure, combined with the influx of novel identities in the 2010s, created an environment where it was taboo even to talk about causal theories for one's own transition, because it could be invalidating to someone else. All gender identities were valid at all times. Downstream effects of external social pressure, social media, and politics created an environment of collective ignorance where community norms discouraged investigating the causes of transition.
Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity of those preferences, they almost universally to come up short. I've even seen several smart, thoughtful trans people, such as Natalie Wynn, making statements to the effect that it's impossible to develop a satisfying theory of aberrant gender identities. (She may have been exaggerating for effect, but it was clear she'd given up on solving the puzzle herself.)
This is the wrong interpretation of Natalie Wynn's oeuvre. See Appendix: Contra Fiora on Contra for why.
What would a legitimate explanation of the origins of one's gendered feelings look like?
Fiora never tells us her criteria. And the only example she gives us-- a psychosocial explanation of her own transition-- heavily implies that it was illegitimate.
But she's also dismissive of biological theories. Does that mean no transitions are valid?
I got whole genome sequencing last year. I can point at the sexual and endocrine abnormalities in my genome, but I certainly can't prove they justify my transition. Nevertheless, subjectively, HRT saved my life.
(Extremely simplified for brevity)
In middle school, puberty started and my life fell apart. I hated my erections, my libido; I felt like a demon had taken over my brain. Unlike my peers, I never developed a felt sense of how to throw my body around. They got rougher, and better at sports. I got injured.
I was pathologically shy and awkward. Locker room talk was utterly repulsive to me. I lost friends and didn't care. Rurouni Kenshin was my first inspiration to grow my hair out. I am very lucky my parents let me.
There was an autistic kid on my soccer team with a speech impediment. He was good at soccer but the other boys would cruelly tease him at practice, in part because he didn't understand they were teasing him. One night after practice I spent the car ride home sobbing about it in front of my dad, who didn't get it at all. I quit soccer.
I was utterly miserable in school. In March of 7th grade, I developed real depression, and started thinking about suicide. Mom took me to two different psychologists. We decided I would homeschool 8th grade. Now, I really had no friends. I was still depressed.
At this point I was only living for WoW and anime. By far, my favorite was Haibane Renmei. It's 13 episodes of angel-girls living in a run-down boarding school and basically just taking care of each other. It is heavily implied that the Haibane are there-- in purgatory-- because they committed suicide in the real world, and must learn how to accept love and care.
It's difficult to explain how much this series resonated with me. It gave structure to feelings I couldn't articulate. I never believed there was any possibility of becoming a girl in real life, so I didn't fantasize much about that. But for a couple years I daydreamed frequently about dying and becoming a Haibane[9].
My hair was long enough at this point that I "passed". I was frequently assumed female in social situations, and men would often tell me I was in the wrong bathroom. I longed for delicate reciprocal care with others who somehow understood what I was going through, even though I could hardly understand it myself. Haibane Renmei showed me this but I had no idea how to find it in the real world.
At 16, boy puberty hit me like a truck. I became ugly. I still had no social skills, and no friends. I dressed like a hobo. The summer after junior year I confronted myself in the mirror and admitted I would never be cute again. I still desperately wanted to be loved, and I believed that the only path to achieving that was becoming a man girls would want to date. That meant improving my appearance and social skills.
I knew that women find weebs unattractive. And my long hair was definitely unattractive. It all melded together. I had no real-world outlet for my femininity so I'd poured it all into identifying with anime characters. And it all seemed like a dead end. I felt that if I stayed in the anime community I would end up socially stunted, since its social standards were lower. I cut my hair and stopped watching anime. I put a lot more effort into socializing.
In college, I read The Man Who Would Be Queen, self-diagnosed as AGP, and actually considered transition for the first time. But it was too late for me-- the sight of my face in the mirror, and the depictions of AGPs in the book were too horrifying. I resolved to never transition, and attempted suicide soon after.
7 months later I fell in love, and that relationship turned my life around. I loved her immeasurably for 5 years, and we lived together for 2 of those. I became, on the outside, socially and professionally actualized as a man. I was a great boyfriend and had no problem getting dates. After the breakup I fell in love 2 more times.
You already know how this ends. No amount of true love or social validation as a man could fix me. I never wanted to transition, but at 31 the strain of repression became unbearable. Things have turned out far better than I ever dared imagine. My parents have remarked on multiple occasions, unprompted, how much happier I am now. They're right.
Overall I fit Fiora's phenotype: I was a mentally ill social outcast weeb, desperately identifying with anime characters as a simulacrum of loving care I had no idea how to find in real life.
But I can't explain my eventual transition at 31 through anything other than a biological cause. I looked obsessively for evidence of some repressed or unconscious ulterior motive, and found none. I believed that transition would be very expensive and time-consuming, physically painful[10], reduce my attractiveness as a mate, and change my social possibilities. All of these predictions have born true. What I didn't expect is that HRT drastically improved my mental health even before the physical changes kicked in. My baseline now is my former 90th-percentile of calm and happiness.
I'm n=1 but this shows Fiora's phenotype can coexist with biologically rooted dysphoria. Moreover, I believe my middle school social failures were caused as much by gender incongruence as by neurodivergence. It's difficult to socialize when your puberty feels wrong and your social instincts don't match your assigned gender.
It's almost like most of them had deep emotional wounds, often stemming from social rejection, and had transitioned to become cute girls or endearing women as a kind of questionably adaptive coping mechanism.
Maybe. Or a misaligned subconscious sex is part of what caused the social rejection in the first place.
As Fiora implied, "cuteness-maxxing" is probably not a good reason to transition.
Most people desperately want to be loved and this can cause mistakes with transition in both directions. Social media is probably bad for minors. We should emphasize that, at a fundamental level, trans people are neither more nor less lovable than cis people.
The human brain is perhaps the most complex object in our known universe, and we will likely never be able to fully disentangle psychosocial factors from biological ones. That said, I do think humanity will discover ever stronger evidence for biological causes of trans identity within our lifetimes.
Introspection is a noisy way to attempt to answer "am I trans?", and you hit diminishing returns fast. It's also the wrong question. The right question is "should I transition?". Transition is best understood as a Bayesian process where you take small behavioral steps[11] and update on whether your quality of life is improving.
If you start transitioning and your intrinsic health and happiness improves, and you expect the same to be true in the long run, continue. If not, desist. There is no shame in either outcome.
For twins, prenatal environment shows up in both C and E.
Coolidge et al. (2002), Heylens et al. (2012), Karamanis et al. (2022), Conabere et al. (2025), Sasaki et al. (2016), Bailey et al. (2000), Burri et al. (2011), Diamond (2013), Buhrich et al. (1991).
If you just want to read a systematic review of these studies, see https://pmc.ncbi.nlm.nih.gov/articles/PMC12494644/
I'm trying to understand the etiology of transgender identity, particularly the strength of the evidence base for different categories of potential causes. Please segment the analysis into five categories:
1. Hereditary/genetic factors
2. Prenatal environment (hormonal, epigenetic, maternal)
3. Postnatal biological environment (diet, medications, endocrine factors)
4. Family/microsocial environment
5. Macrosocial/cultural environment
For each category, conduct a rigorous literature review prioritizing meta-analyses, large-N studies, and methodologically sound designs. Identify the strongest evidence both supporting and contradicting causal contributions from that category. Flag studies with clear methodological limitations and discuss known publication biases in the field.
Focus primarily on gender dysphoria and transgender identity as defined in DSM-5/ICD-11, noting where studies conflate distinct constructs or onset patterns.
Conclude with a variance decomposition estimate using the ACE framework and liability threshold model standard in psychiatric genetics. Provide:
- Point estimates with plausible ranges for each component (A, C, E)
- Confidence ratings for each estimate based on evidence quantity and quality
- Explicit discussion of what each ACE component likely captures, mapped back to the five categories above
- Acknowledgment of confounds and unmeasurable factors
Include cross-cultural and temporal trend data as evidence bearing on the cultural/environmental components.
In general, in the US in the 20th century, if a medical institution decided they simply didn't want to treat trans patients, there would be no public outcry. The doctors and organizations that did treat us could set terms. Prior to the 2010s there was little awareness of trans people, and the awareness we had was often prejudicial. IBM fired Lynn Conway after all.
Some trans people (for example, Abigail Thorn and Andrea Long Chu) have attempted to argue that access to gender-affirming care should not be contingent on either (a) suffering prior to receiving treatment or (b) demonstrated therapeutic benefit for the treatment. These arguments were not well-received even within the trans community.
It took r/MtF until 2025 to ban porn, after years of infighting. https://www.reddit.com/r/MtF/comments/1kaxn18/alright_lets_talk_about_porn_and_porn_accounts/
This norm is not totally unreasonable. The purpose of community spaces is primarily social support for those early in transition, which can be difficult to find anywhere else. I went through this phase too.
Yes, this is perverse and contradicts the moral of the story.
Electrolysis is the most physically painful thing I've experienced. I've done 40 hours so far and will likely do 150-200 total.
Voice training, experimenting with name/pronouns/clothing, laser hair removal, HRT.
2026-01-20 09:53:55
Published on January 20, 2026 1:53 AM GMT
This is an appendixpost for Why I Transitioned: A Response.
In Why I Transitioned: A Case Study, Fiora Sunshine claims:
Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity of those preferences, they almost universally to come up short. I've even seen several smart, thoughtful trans people, such as Natalie Wynn, making statements to the effect that it's impossible to develop a satisfying theory of aberrant gender identities. (She may have been exaggerating for effect, but it was clear she'd given up on solving the puzzle herself.)
The evidence most strongly suggests that Natalie did not give up-- she was bullied into silence.
This misreading matters because it illustrates one half of the Trans Double Bind. Natalie's words in Canceling were chosen under extreme social pressure from the online/Twitter/leftist contingent of the trans community. This social pressure existed because the community felt they were enforcing norms necessary to ensure respect and acceptance for enbys[1].
The linked video, Canceling, is Natalie defending against accusations of transmedicalism[2] due to using a voice-over from transmedicalist Buck Angel in her previous video.
And in the linked section specifically, she is defending and attempting to recontextualize one of her tweets:
One of the most important facts about Natalie is that despite what her on-screen persona suggests-- she is sensitive and suffers greatly from hate comments online, especially from within the trans community[3].
This video reply to being canceled was high-stakes because it had major long-term implications not just for her Patreon livelihood and career but her dignity, physical safety, and social acceptance.
As far as I can tell, Natalie is not lying in Canceling. But she is defending her record in part through omission and vagueness.
I can't tell you what her genuine beliefs are. In part because of this controversy she deliberately moved away from making comments or videos directly about trans issues, and has expressed general despair about the situation.
I do not believe Natalie is a transmedicalist, secretly or otherwise. There is a lot of theory-space between "all genders/transitions are valid no matter what" and transmedicalism.
But her blanket retraction ("I no longer believe there can be any rational justification of gender identity") is not credible because:
A. The context of Canceling highly incentivized her to make her commentary on her tweet as politically defensible as possible (If you disavow reason then it is impossible to exclude anyone).
B. The evidence suggests her real views are more nuanced.
She has made multiple extremely personal, searching videos about her dysphoria and motivations to transition, most notably Autogynephilia. Beauty is surprisingly critical of the usage and concept of gender dysphoria (and motivations for pursuing medical transition). Transtrenders deals with all these topics in skit form, and was also heavily scrutinized online.
Prior to Canceling, Natalie stated on multiple occasions that she transitioned because of gender dysphoria. This illustrates the Double Bind because the online trans community took as implication that she believed dysphoria was an important part of justifying transition-- which would exclude people who do not report dysphoria, and threaten to reduce their acceptance in their identified gender.
The other side of the Double Bind is weak here because, in the 2010s as a binary trans woman with substantial income, Natalie's access to HRT and surgery was not conditional on endorsing transmedicalism.
I think her comments in her AMAs are more interesting and revealing. I can't link to these videos directly (paywall) and I don't know if anyone here cares to read long transcripts. But I will end this post by including some here because they are both interesting and relevant.
August 2018 Patron AMA stream
QUESTION (19:25): Becoming more the person you are was the thought that came to mind. It reminded me of something Schopenhauer said about the empirical character as a manifestation of the intelligible character. That what we appear to be outwardly is just an imperfect expression of our true immutable inmost nature. Does that resonate at all? Do you think it is a useful way of thinking about gender transition? Are you an expression of transcendental freedom? Could a cranky sexist 19th century philosopher be invoked against reductive shit lord rationalizing?
NATALIE: I think I actually take the opposite view. I take more of the Wittgenstein pragmatic view which is that the self is like invented instead of discovered. More trans people do actually think of it the way you're suggesting that by transitioning they're actually realizing this inherent like essence or singularity that's always there. That their exterior appearance is kind of finally becoming like their insides finally matching outside. It's like sort of not that's not really the sense I have to be quite honest like I kind of want to pretend that it is because it's a more attractive thing to say about yourself right? I think people might be more attracted to me if I was expressing the true feminine essence of my being but the truth is that I designed this, femininity is something I've worked on and it's a it's an invention it's a creation of mine as much as it is a discovery.
November 2018 Patron AMA stream
Question (2:24): How did you find out you were transgender?
Natalie: ...I started taking hormones before I was 100% sure I identified as a woman, to be honest, because I wanted the effects of the hormones... once I had started hormones... I'm like, I'm not non-binary, I just want to be a woman, and so it was like one step at a time...
When you discover that, you like taking female hormones, and it makes you feel better about yourself, and you like the physical changes, you just look at your life, and you're like, well, this is just going to be easier if I just be a woman, like, that sounds very pragmatic, but that to me is kind of thinking, if I went into it, honestly, there was sort of a pragmatic reasoning behind it, like, my life is going to be better if I just live as a woman. And so that's when I decided, like, fuck it, like, let's just go all in on this.
September 2019 Patron AMA stream
QUESTION (54:02): Do you think dysphoria is externally or internally generated? That is if we lived in a world without transphobia where trans identities were immediately 100% accepted by all people, would dysphoria still exist?
NATALIE: ...it's hard for me to imagine like what that would even look like because I think there's a difference between transphobia and some trans identities not being accepted immediately, because I think that part of what gender is is the assumption that there's two categories of people that in terms of all the senses present in a different way and if we just completely dropped the idea that gender is something that you identify based on the way someone looks and instead started thinking of gender as a purely psychological phenomenon it's a little bit hard for me to imagine like what being trans even would mean in that situation...
i just sort of don't get like i don't get what people are talking about when they talk about hypotheticals like this...
...what does it mean to identify as a woman when all woman means is a psychological state?
...i don't know how to talk about like i'm so used to the idea that like i just can't talk about this that like i i i sort of don't know how much i should say...
...there's trans people right who present totally within the normal range of what is expected of someone who's assigned their gender at birth and i'm not saying they're not valid i'm just saying that like i sort of don't recognize it as what being trans is to me
...my own trans identity it's so connected to this desire to socially fit in as a woman [and look female] and... so when someone identifies as trans without either of those components... i don't understand it yet.
QUESTION (02:55:25): are there any videos you would like to make but feel like you can't because they're too different or frivolious or inflammatory?NATALIE: ...one I don't think I'll ever do would be a follow up to the Autogynephilia video... I kind of feel like that video in particular is kind of weak. Despite its length, I don't think it really deals with this the subject matter and well, and I think that the video I have in mind would be about a lot of the difficult questions about why trans women transition and how in my opinion like there is anthropological truth to Blanchardism like clearly he's observing real trends, right?
...if you read Magnus Hirschfeld's work from the 30s... it comes to the same conclusions as Blanchard and those things have troubled me throughout my transition and and in some ways have troubled me more as I've met more and more trans women, and feel that you know there really are these kinds of two stark clusters of trans women with very different backstories, and... if I were to make a theory about trans women I would do a kind of post Blanchardism that starts with a lot of those observations and then it tries to come up with a more nuanced way of talking about them than what Blanchard offers.
My Autogynephilia video has a million views and that's unusual. It's the only video of mine that's that old that has that many views. Why does that many video have so many views? A lot of people are googling this topic. And if you look at the more sinister parts of trans internet it's kind of an obsessive topic and I think that part of the reason for that is that a lot of mainstream trans discourse is very euphemistic about things. There's a heavily ideologically loaded concept of trans woman and you're supposed to believe all these things, like you're supposed to say I was always a woman and that I was a woman born in a man's body and like the fact of the matter is that this just does not line up with a very large number of people's experiences...
And then on the other side you have Blanchard who talks about, there's this group of trans women who before transition they live as feminine gay men and... the fundamental problem of their life is femininity and often that it's you know, they're bullied for and the it's just like this issue throughout their childhood adolescence and in early adulthood. On the other hand, you have a whole second group of trans women who basically seem to pass as normal men and until you know, they come out as trans and shock everyone and like it's just that these are two very different experiences so it's like such a deeply taboo topic...
The problem I have with my Autogynephilia video is that in a way I was pushing too hard against some of Blanchard's things, right, because it's a very threatening theory to trans women because is saying is that you are men. I want to try to make sense of Blanchard's observations without reaching his conclusion that these are just either male homosexuals or male fetishests because I don't believe that.
I've met hundreds of trans women at this point and um it's pretty hard not to notice that the two type typology is based on something that that's real, right? I'm not saying that the typology is theoretically good. I'm just saying that it's based on something that is quite clearly real, and so far as I'm aware there's simply no way of talking about that except Blanchardism and that's not superfucking great is it...
I hate the way a lot of people summarize my video like they'll just summarize it as oh, I said there's no such thing as autogynephilia, no one has that those feelings; that's clearly not true. I think it's actually quite common for men to um like yeah, you know like a straight guy who likes taking pictures of his butt in women's yoga pants, like sending them to his friends or something? it's a feeling, I don't think this is what what causes people to transition but I think it's a dimension to a lot of people's sexuality that I don't particularly see the point in denying. Nor do I think that Blanchardism is a good theory.
By the mid 2010s the lines of battle had shifted so much that binary trans people were no longer perceived to be under threat, and the focus shifted towards nonbinary issues. These were more politically salient (nonbinary => overthrowing the binary => overthrowing patriarchy) which made them more conducive to a social media positive feedback loop, and were also subject to more social opposition in everyday interactions.
The view that trans people are only valid if they experience gender dysphoria
See for example the 17 minutes at the beginning of her October 2019 patron AMA stream, right after the start of the controversy, where she is upset to the point of altering her speaking cadence, and at one point on the verge of tears.
2026-01-20 09:25:46
Published on January 20, 2026 1:25 AM GMT
What counts as a lie?
Centrally, a lie is a statement that contradicts reality, and that is formed with the explicit intent of misleading someone. If you ask me if I’m free on Thursday (I am), and I tell you that I’m busy because I don’t want to go to your stupid comedy show, I’m lying. If I tell you that I’m busy because I forgot that a meeting on Thursday had been rescheduled, I’m not lying, just mistaken.
But most purposeful misrepresentations of a situation aren’t outright falsehoods, they’re statements that are technically compatible with reality while appreciably misrepresenting it. I likely wouldn’t tell you that I’m busy if I really weren’t; I might instead bring up some minor thing that I have to do that day and make a big deal out of it, to give you the impression that I’m busy. So I haven’t said false things, but, whether through misdirecting, paltering, lying by omission, or other such deceptive techniques, I haven’t been honest either.
We’d like a principled way to characterize deception, as a property of communications in general. Here, I’ll derive an unusually powerful one: deception is misinformation on expectation. This can be shown at the level of information theory, and used as a practical means to understand everyday rhetoric.
Formally, we might say that Alice deceives Bob about a situation if:
First Definition: She makes a statement to him that, with respect to her own model of Bob, changes his impression of the situation so as to make it diverge from her own model of the situation.
We can phrase this in terms of probability distributions. (If you’re not familiar with probability theory, you can skip to the second definition and just take it for granted). First, some notation:
be the probabilities that Alice and Bob, respectively, assign to that state. These probability assignments and are themselves epistemic states of Alice and Bob. If Alice is modeling Bob as a system, too, she may assign probabilities to possible epistemic states that Bob might be in:
2. Let
With this notation, a straightforward way to operationalize deception is as information Alice presents to Bob that she expects to increase the difference between Bob’s view of the world and her own.
Taking the Kullback-Leibler divergence as the information-theoretic measure of difference between probability distributions, this first definition of deception is written as:
We can manipulate this inequality:
Write for the product system composed of and , whose states are just pairs of states of and . The inequality can then be written in terms of an expected value:
This term is the proportion to which Alice expects the probability Bob places on the actual world state to be changed by his receiving the information $s$. If we write this in terms of surprisal, or information content,
we have
This can be converted back to natural language: Alice deceives Bob with the statement if:
Second Definition: She expects that the statement would make him more surprised to learn the truth as she understands it[1].
In other words, deception is misinformation on expectation.
Misinformation alone isn’t sufficient—it’s not deceptive to tell someone a falsehood that you believe. To be deceptive, your message has to make it harder for the receiver to see the truth as you know it. You don’t have to have true knowledge of the state of the system, or of what someone truly thinks the state is. You only have to have a model of the system that generates a distribution over true states, and a model of the person to be deceived that generates distributions over their epistemic states and updates.
This is a criterion for deception that routes around notions of intentionality. It applies to any system that
An AI, for instance, may not have the sort of internal architecture that lets us attribute human-like intents or internal conceptualizations to it; it may select information that misleads us without the explicit intent to mislead[2]. An agent like AlphaGo or Gato, that sees humans as just another game to master, may determine which statements would get us to do what it wants without even analyzing the truth or falsity of those statements. It does not say things in order to deceive us; deception is merely a byproduct of the optimal things to say.
In fact, for sufficiently powerful optimizers, deception ought to be an instrumental strategy. Humans are useful tools that can be easily manipulated by providing information, and it’s not generally the case that information that optimally manipulates humans towards a given end is simultaneously an accurate representation of the world. (See also: Deep Deceptiveness).
This criterion can be applied anywhere people have incentives to be dishonest or manipulative while not outright lying.
In rhetorical discussions, it’s overwhelmingly common for people to misrepresent situations by finding the most extreme descriptions of them that aren’t literally false[3]. Someone will say that a politician “is letting violent criminals run free in the streets!”, you’ll look it up, and it’ll turn out that they rejected a proposal to increase mandatory minimum sentencing guidelines seven years ago. Or “protein shakes can give you cancer!”, when an analysis finds that some brands of protein powder contain up to two micrograms of a chemical that the state of California claims is not known not to cause cancer at much larger doses. And so on. This sort of casual dishonesty permeates almost all political discourse.
Descriptions like these are meant to evoke particular mental images in the listener: when we send the phrase “a politician who’s letting violent criminals run free in the streets” to the Midjourney in our hearts, the image is of someone who’s just throwing open the prison cells and letting out countless murderers, thieves, and psychos. And the person making this claim is intending to evoke this image with their words, even though they'll generally understand perfectly well that that’s not what’s really happening. So the claim is deceptive: the speaker knows that the words they’re using are creating a picture of reality that they know is inaccurate, even if the literal statement itself is true.
This is a pretty intuitive test for deception, and I find myself using it all the time when reading about or discussing political issues. It doesn’t require us to pin down formal definitions of “violent criminal” and a threshold for “running free”, as we would in order to analyze the literal truth of their words. Instead, we ask: does the mental image conveyed by the statement match the speaker’s understanding of reality? If not, they’re being deceptive[4].
Treating expected misinformation as deception also presents us with a conversational norm: we ought to describe the world in ways that we expect will cause people to form accurate mental models of the world.
This isn’t exactly identical to the first definition. Note that I converted the final double integral into an expected value by implicitly identifying
i.e. by making Bob’s epistemic state independent of the true world state, within Alice’s model. If Alice is explicitly modeling a dependence of Bob’s epistemic state on the true world state for reasons outside her influence, this doesn’t work, so the first and second definitions can differ.
Example: If I start having strange heart problems, I might describe them to a cardiologist, expecting that this will cause them to form a model of the world that’s different from mine. I expect they’ll gain high confidence that my heart has some specific problem X that I don’t presently consider likely due to my not knowing cardiology. So, to me, there’s an expected increase in the divergence between our distributions that isn’t an expected increase in the cardiologist’s surprisal, or distance from the truth. Because the independence assumption above is violated—I take the cardiologist’s epistemic state to be strongly dependent on the true world state, even though I don’t know that state—the two definitions differ. Only the second captures the idea that honestly describing your medical symptoms to a doctor shouldn’t be deception, since you don’t expect that they’ll be mis-informed by what you say.
Even for humans, there’s a gray zone where we do things whose consequences are neither consciously intended nor unintended, but simply foreseen; it’s only after the action and its consequences are registered that our minds decide whether our narrative self-model will read “yes, that was intended” or “no, that was unintended”. Intentionality is more of a convenient fiction than a foundational property of agents like us.
Resumes are a funnier example of this principle: if someone says they placed “top 400” in a nationwide academics competition, you can tell that their actual rank is at least 301, since they’d be saying “top 300” or lower if they could.
Of course everyone forms their own unique mental images; of course it’s subjective what constitutes a match; of course we can’t verify that the speaker has any particular understanding of reality. But you can generally make common-sense inferences about these things.
2026-01-20 08:58:08
Published on January 20, 2026 12:58 AM GMT
In a software-only takeoff, AIs improve AI-related software at an increasing speed, leading to superintelligent AI. The plausibility of this scenario is relevant to questions like:
Knowing when and how much I expect to learn about the likelihood of such a takeoff helps me plan for the future, and so is quite important. This post presents possible events that would update me towards a software-only takeoff.
The key variable determining whether software progress alone can produce rapid, self-sustaining acceleration is returns to software R&D (r), which measures how output scales with labor input. Specifically, if we model research output as:
where O is research output (e.g. algorithmic improvements) and I is the effective labor input (AI systems weighted by their capability), then r captures the returns to scale.
If r is greater than 1, doubling the effective labor input of your AI researchers produces sufficient high-quality research to more than double the effective labor of subsequent generations of AIs, and you quickly get a singularity, even without any growth in other inputs. If it's less than 1, software improvements alone can't sustain acceleration, so slower feedback loops like hardware or manufacturing improvements become necessary to reach superintelligence, and takeoff is likely to be slower.
A software-only singularity could be avoided if r is not initially above 1, or if r decreases over time, for example, because research becomes bottlenecked by compute, or because algorithmic improvements become harder to find as low-hanging fruit is exhausted.
The most immediate way to determine if returns to software R&D are greater than 1 would be observing shortening doubling times in AI R&D at major labs (i.e. accelerating algorithmic progress), but it would not be clear how much of this is because of increases in labor rather than (possibly accelerating) increases in experimental compute. This has stymied previous estimates of returns.
Evidence that returns to labor in AI R&D are greater than 1:
The likelihood of a software-only takeoff depends heavily on how compute-intensive ML research is. If progress requires running expensive experiments, millions of automated researchers could still be bottlenecked. If not, they could advance very rapidly.
Here are some things that would update me towards thinking little compute is required for experiments:
Even if returns on labor investment are compounding at the beginning of takeoff, research may run into diminishing returns before superintelligence is produced. This would result in the bumpy takeoff below.
The evidence I expect to collect before takeoff is relatively weak, because current progress rates don't tell us much about the difficulty of discovering more advanced ideas we haven't yet tried to find. That said, some evidence might be:
I expect to get some evidence of the likelihood of a software-only takeoff in the next year, and reasonably decisive evidence by 2030. Overall I think evidence of positive feedback in labor inputs to software R&D would move me the most, with evidence that compute is not a bottleneck being a near second.
Publicly available evidence that would update us towards a software-only singularity might be particularly important because racing companies may not disclose progress. This evidence is largely not required by existing transparency laws, and so should be a subject of future legislation. Evidence of takeoff speeds would also be helpful for AI companies to internally predict takeoff scenarios.
Thanks for feedback from other participants in the Redwood futurism writing program. All errors are my own.
This paper makes substantial progress but does not fully correct for endogeneity, and its 90% confidence intervals straddle an r of 1, the threshold for compounding, in all domains except SAT solvers.
It may be hard to know if labs have already made the same discoveries.
See this post and comments for arguments about the plausibility of finding scalable innovations using small amounts of compute.
This may only be clear in retrospect, since breakthroughs like transformers weren't immediately recognized as major.