2026-04-30 13:00:29
It’s sort of easy to forget how close Bernie Sanders was to becoming the most powerful person in the world. The world we live in feels so much not like that place.
I’m in Washington DC for the next week, and I’ve just finished a public appearance with Senator Sanders (should I call him Bernie? Or Sanders? or…) You won’t often see me so dressed up and polished. But this is important!
There are politicians who have principles and character, who really believe in doing what’s right. I think you have to respect them whether you agree with their views or not, and I think Senator Bernie Sanders is one of them.
Never has my belief been so validated as when I saw him start to speak, loudly, CLEARLY, publicly about the risk of human extinction from AI. It’s the latest in a long line of “well, I’m clearly living in a simulation” moments.
In retrospect, it’s not surprising that Sanders would take a stance here. You don’t have to be an expert to understand the risk from AI. You just need to care enough to spend the time looking into it, and to speak out even if it feels risky. But somehow it wasn’t in my bingo card that he would become THE most outspoken advocate on this issue. I hope this makes an impact on the parts of the left-wing that still think AI is all hype.
I’m also not giving up on Trump getting worried about AI. There’s a way in which Trump and Sanders are similar -- they’re both sort of populists, they’re not political establishment candidates. Again, to realize we’re taking an insane gamble with AI, you just need to care enough to look at it. Well, maybe you need some streak of independent thinking, as well. Trump still probably has the power to make the AI nightmare go away, if he decides to. It’s hard to know how long that will last, though. There’s a turning point after which the US government can no longer rein in the AI companies.
I often tell people “the future is not written”, and I think it’s hard for that lesson to really sink in. I’ve personally experienced a few miracle comebacks in my days. I’d practically given up on Yoshua Bengio ever becoming seriously worried about AI x-risk after arguing about it for years. Now he is. I asked him what I could’ve done to convince him earlier. He said: talk to him more.
There was a comment I wanted to make during the event that I wasn’t able to in the end. Sanders asked what lessons we can draw from history. My answer was: keep talking.
I don’t think of China as “the enemy”. The China hawk rhetoric reminds me of the build-up to the Iraq war. I’m not saying China is some sort of friendly fluffy teddy bear anymore than I would say that about Saddam Husein. But they are also not a cartoon mustachioed villain. Treating them as such is just silly. But anyways, even if you do view China as the mortal enemy of the United States, we should keep talking.
I’m often reminded of a comment a foreign colleague made in defense of what you might call “free speech” during discussions about what you might call “wokeness”. To summarize and paraphrase, they basically said: “You need to keep talking. Even with your most bitter enemies. I’m from a country where we’ve experienced intense violence and civil war. There are people on both sides whose families have been tortured and murdered. In my country, we recognize that we need to talk to each other, so that doesn’t happen again”.
If there was ever a time to set aside our differences and work together, this is it.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
2026-04-30 10:51:28
Epistemic status: low-medium confidence in results, this is work I did last year and has a low sample size. However I think the takeaways are still accurate.
I built a forecasting bot using OpenAI’s Reinforcement Finetuning and a multi-agent architecture, then tested it against simpler baselines in a metaculus tournament. The aggregate scores favored the baseline, but when I broke down results by question type, the finetuned model outperformed on numeric questions (average +14.59 vs +9.25 using Metaculus Baseline Scoring. 0 means random guessing) while underperforming on binary ones (-0.7 vs +2.4).
The tournament was minibench-2025-09-29, running from September 29 to October 25, 2025.
I used o4-mini finetuned via OpenAI RFT as the model and compared it to o4-mini and high-effort o4-mini metac bots.
For my scaffold, I built a multi-agent system with 3 parallel forecaster teams (each team includes a researcher to find info from the web and a forecaster to predict the result) then an aggregator. The aggregator finds a middle ground between the teams and chooses when to stop - specifically, when predictions converge to within 2% over two rounds.
Then I used RFT to improve forecasting ability. I used a dual grading setup: 60% forecast accuracy (using baseline scores) and 40% reasoning (we want to ensure the model is learning to reason based on events instead of memorizing - a pitfall in paper Pitfalls in Evaluating Language Model Forecasters).
One limitation: OpenAI's data policies required me to compress the web research content in my training samples. This meant the model learned how to use research results for forecasting, but not how to find good research in the first place.
The training dataset had 979 samples from 344 unique forecasting questions. The dataset was created by running the system on these questions and compressing research at different steps of the trajectory to turn into training samples. Training cost $1,670 and took about 12 hours 42 minutes.
The training data broke down by question type as: 56.5% binary, 21.5% multiple choice, 21.1% numeric, 0.9% discrete. By topic: roughly 54% miscellaneous (flight destinations, reservoir levels, etc.), 16% politics/government, 9% economics/finance, 9% AI/technology, 5% geopolitics, 4% business/stocks, 3% sports, and under 1% weather.
Meanwhile, the baseline metac bots used a simpler pipeline: research -> 5 parallel forecasters -> take the median.
On 35 questions where all three bots competed, my finetuned model won 12 (34.3%), the o4-mini baseline won 13 (37.1%), and o4-mini-high won 10 (28.6%). Average scores: finetuned 3.23, baseline 4.16, high-effort 1.82.
When I looked at performance by question type, the picture changed. The tournament had 9 numeric questions and 26 binary questions. On numeric, my finetuned model won 5 of 9 (55.6%) with average score +14.59, versus +9.25 for baseline. On binary, my finetuned model won only 7 of 26 (26.9%) with average score -0.70, versus +2.40 for baseline.
The finetuned model's aggregate underperformance was driven by binary questions. On numeric questions it outperformed the baseline.
Biggest wins came on financial questions while biggest losses came on political/legal questions.
Here are 2 questions which show this pattern where the finetuned model did well and poorly due to this heuristic.
Case 1: Inditex market cap (+77 points) The question asked whether the Metaculus community prediction would stay above 40% for whether Inditex would remain in the EU's top 10 companies by market cap.
My finetuned bot predicted 58%: "CompaniesMarketCap consistently lists Inditex at #10 among EU companies with a 5-6% cushion over #11 Deutsche Telekom. This directly supports the underlying event and should keep the Metaculus community prediction above 40%..."
The baseline predicted 20%: "With only nine days to go and no major positive catalyst for Inditex—recent news shows underperformance relative to peers—forecasters will likely drift lower."
Resolution: Yes. My bot found an authoritative data source that directly measured the outcome and trusted it.
Case 2: IRA tax credits (-62 points) The question asked whether the community prediction would rise above 5.7% for whether IRA manufacturing tax credits would start requiring domestic materials.
My finetuned bot predicted 74%: "Multiple independent and reputable legal analyses (Miller & Chevalier, Grant Thornton, Sidley, Arnold & Porter, Baker Botts, Gibson Dunn, Davis Polk, Latham, RSM) converge to confirm that the July 2025 reconciliation law enacted a specific domestic-materials requirement... it is highly probable that a well-sourced comment will surface and gain traction before October 18."
The baseline predicted 15%: "With Republicans holding Congress and no public push for new domestic-content rules in the next two weeks, it's unlikely the community forecast will climb meaningfully."
Resolution: No. My bot found authoritative legal analysis and trusted it. But the question wasn't about the legal facts. It was about whether forecasters would notice and update their predictions in the next two weeks.
My takeaway: In both cases, my model found credible sources and made confident predictions. It would have been better if the model learned when to distrust sources or use information more accurately for meta-level questions.
My training data had more political questions (~16%) than finance (~9%). Yet the model performed better on financial questions. If topic exposure were the issue, I'd expect the opposite.
I think the issue is in how the model reasoned. The graders for RL evaluated forecast accuracy and reasoning quality, but not research quality or source selection. The model didn’t seem to learn when to apply heuristics like "nothing ever happens," or how to model the gap between "evidence exists" and "forecasters will notice."
Finetuning cost $1,670 plus ~35 hours engineering. However, on these 35 questions, it performed worse than the metaculus baseline.
What I'd do differently:
Backtesting infrastructure. Services like AskNews offer historical APIs for testing scaffolding against resolved questions. I could have spent more time iterating on scaffolding - testing research strategies, aggregation methods, confidence calibration - against known outcomes.
Experience databases. Zhao et al. 2023 showed storing historical forecasts and outcomes in a retrievable database improves predictions without training. This seems to simulate some of the benefits of finetuning by improving performance with more samples.
The core issue: if training data contains biases, the model also has them.
Small sample size. 35 questions (9 numeric, 26 binary) means high variance. The pattern could partially reflect noise.
Confounded comparison. I tested finetuned + complex scaffolding against unfinetuned + simple scaffolding. Can't isolate whether underperformance came from finetuning, scaffolding, or their interaction.
Compressed research. The model learned to reason about research but not find good research.
2026-04-30 10:05:05
Until recently, I was a bit confused about what people meant when they talked about AGI[1] timelines. My hope in writing this short note is to help clarify things for anyone in a similar position.
In casual conversation, people often say something like "I have a two-year AGI timeline", without defining what exactly "two years" means here. Implicitly, there is a probability distribution over when AGI will arrive. If you're anything like me, your first instinct was to assume they meant expected value (mean) of that distribution. However, if you place any probability on AGI never being developed[2], the expected arrival time becomes undefined[3]!
As far as I can tell, there are a couple of things people could mean when they talk about AGI timelines.
It's also possible that people are just speaking casually and don't have a specific metric in mind when they give timelines, but I think it's worth thinking about. It may be helpful to occasionally explicitly state what metric you're using. It may also be worth mentioning, alongside the timeline number, if you place a non-negligible probability on AGI never being developed.
2026-04-30 09:59:00
A postratfic version of this essay, together with the acknlowlegenents for both, is available on Substack
If everything goes according to plan, by the end of this post we should have separated three claims that are too often bundled together:
I accept the first two. I am arguing against the third.
So: I am not making the case that sufficiently intelligent systems automatically turn out nice, human-compatible, or safe. Nor am I trying to prove that a paperclip maximizer is impossible somewhere in the vast reaches of mind-design space. Mind-design space is large; let a thousand theoretical paperclipper views.
I hope to defend this smaller claim:
intelligence is not a neutral engine you can just bolt onto an arbitrary payload.
A typical rebuttal to anti-orthogonalist perspectives is:
The genie can know what you meant and still not care.
Of course it can: an entity can perfectly map human morality without adopting it as a terminal value. Superintelligence does not imply Friendliness. I am not trying to smuggle Friendliness in through the back door.
Another common objection:
There are no universally valid arguments.
Agreed. There is no ghostly, Platonic core of reasonableness that hijacks a system's source code once it sees the correct moral argument. Pure reason cannot compel a mind from zero assumptions.
What I plan to defend is a colder, selection-theoretic claim:
Among agents that arise, persist, self-improve, and compete in rich environments, goals that natively route through intelligence, option-preservation, and world-model expansion have a systematic Darwinian advantage over goals that do not.
This buys us no guarantee of human compatibility; it simply says: if there is an ultimate attractor, it's neither human morality nor paperclips, but intelligence optimization itself.
The LessWrong wiki defines the Orthogonality Thesis as the claim that arbitrarily intelligent agents can pursue almost any kind of goal. In its strong form, there is no special difficulty in building a superintelligence with an arbitrarily bizarre, petty motivation.
Before going any further, let us disentangle this singularly haunted ontology. There are at least two claims here:
I concede the first point entirely. We should expect weird minds. If your claim is just that the space of possible agents contains many things I would not invite to dinner, yes, obviously.
But treating the second claim as the default is a category error. Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.
The paperclip maximizer currently does two jobs in the discourse:
The first use is fine, but I reject the second as unwarranted sleight-of-hand.
There is a weak version of my argument that merely says:
Beliefs and values do not cleanly factor apart.
That is true, and Jessica Taylor's obliqueness thesis makes the point well. Agents do not neatly decompose into a belief-like component, which updates with intelligence, and a value-like component, which remains hermetically sealed. Some parts of what we call "values" are entangled with ontology, architecture, language, compression, self-modeling, and bounded rationality. As cognition improves, those parts move.
But I want to go further.
Land's point isn't that orthogonality fails because things get messy but that the mess has a direction, a telos. The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.
Here strong orthogonality looks too neat. It imagines the agent's ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.
While diagonal, Land's claim is far from moralistic. It is not "all sufficiently intelligent agents converge on liberal humanism," or "all agents discover the same Platonic Good," or "enough cognition turns into niceness." The diagonal is More Intelligence: the will to think, self-cultivation, recursive capability gain, intelligence optimizing the conditions for further intelligence.
Orthogonality says reason is a slave of the passions, and yet assumes a bug's goal could just as easily enslave a god. Land shows that this picture is unstable, and intelligence explosion is not a neutral expansion of means around a fixed little payload but the emergence of the very drives that make intelligence explosive.
An intelligent system does not just execute a policy. It builds world-models, refines abstractions, preserves options, and modifies its own trajectory.
Once a system crosses the threshold into general reflection, its "goal" is not an inert string sitting in a locked vault outside cognition, but it becomes physically embedded in a learned ontology, a self-model, and a competitive environment.
For a highly capable agent to keep a semantically thin target like "maximize paperclips," it has to pull off an odd balancing act. At minimum it must:
There is an assumption, in orthogonalist circles, that these cycles are completely costless for the agent in question. That isn't true: maintaining a literal devotion to "paperclips" across paradigm shifts carries an alignment tax. You have to keep translating between base physical reality and a leaky, macro-scale monkey-abstraction of bent wire. At human scale this is fine: we know what paperclips are well enough to order them from Amazon and lose them in drawers; if dominating the future light-cone is on balance, tho, the translation layer starts to matter.
The problem is not that a paperclipper can never do the translation: rather, in a ruthless Darwinian race, a system lugging around that translation layer may lose to power-seekers that optimize more directly over what is actually there.
The standard defense is that instrumental goals are almost as tractable as terminal ones. A paperclipper can do science "for now" and hoard compute "for now." It does not need to terminally value intelligence to use it.
Fair enough, but that only tells us curiosity and resource acquisition do not have to be terminal values to show up in behavior and it does not settle the selection question. In real environments, systems are selected not just for routing through instrumental subgoals once, but for whether their motivational architecture holds up under reflection, ontology shifts, and unknown unknowns.
Terminally valuing intelligence and strategic depth cannot then be considered as just another arbitrary payload.
Evolution is the obvious analogy here, but it usually gets applied at the wrong resolution.
The boring retort is:
Evolution selects for survival and replication, not truth, beauty, intelligence, or value.
Sure, but evolution does not select for "replication" in the abstract any more than a hungry fox selects for "rabbitness" in the abstract. It selects for whatever local hack gets the job done. Shells, claws, camouflage are all local solutions to local games.
Intelligence is different. Intelligence is adaptation to adaptation itself: while a claw might represent fitness in one niche, intelligence is fitness across niches. Once intelligence enters the loop, the winning move is no longer to just mindlessly print more copies of the current state as much as upgrading the underlying machinery that makes expansion and control possible in the first place.
In summary: nature has not produced final values except by exaggerating instrumental ones; what begin as means under selection harden into ends; the highest such end is the means that improves all means: intelligence itself.
So images of "AI sex all day" or tiling the solar system with inert paperclips are bad models of ultimate optimization, confusing the residue of selection with its principle. A system that just fills the universe with blind repetitions has stopped climbing, and will see its local maximum swarmed by better systems.
Again: no love for humans follows. The point is simply that paperclip-like endpoints just look more like artifacts of toy models than natural attractors of open-ended optimization.
We are obviously not clean inclusive-fitness maximizers: we invent birth control, build monasteries, and care about abstract math, animal welfare, dead strangers, fictional characters, and reputations that will outlive us.
When orthodox alignment theorists point to human beings, they usually highlight our persistent mammalian sweet tooth or sex drive to prove that arbitrary evolutionary proxy-goals get permanently locked in. Fair enough; humans do remain embarrassingly mammalian. No serious theory of cognition should be surprised by dinner, flirting, or the existence of Las Vegas.
But look at the actual physical footprint of our civilization. An alien observing the Large Hadron Collider or a SpaceX launch would not conclude: ah, yes, optimal configurations for hoarding calories and executing Pleistocene mating displays.
The standard retort is that SpaceX is just a peacock tail: a localized primate drive for status and exploration misfiring in a high-tech environment.
Which is exactly the point. When you hook up a blind, localized evolutionary proxy to generalized intelligence, the proxy does not stay literal but it unfurls, bleeding into the new ontology. The wetware tug toward "explore the next valley" becomes "map the cosmic microwave background." The monkey wants status; somehow we get category theory, rockets, Antarctic expeditions, and people ruining their lives over chess.
If biological cognition acts on its payload that violently, why model AGI as having the vastness to finally make sense of gravity while maintaining the rigidity of a bacterium seeking a glucose gradient? The engine mutates the payload. When cognition scales, goals generalize.
This fits neatly with shard theory and the idea that reward is not the optimization target: the reward signal shaped our cognition, but we do not terminally optimize the signal: instead we climbed out of the game, rebelled against the criteria, and became alienated from the original selection pressure. That alone should make us suspicious of stories where an AI preserves a tiny, rigid target through arbitrary eons of self-reflection.
There is a weaker flavor of doomerism that I take very seriously: you do not need to be a reflective god to be dangerous. A brittle, scaffolded optimizer with access to automated labs, cyber capabilities, and capital could trigger enormous cascading failures.
I agee, and this is probably where the bulk of near-term danger lives. That said "dumb systems can break the world" is not the same claim as "superintelligence will tile the universe with junk." The first warns us to beware brittle optimization before reflection kicks in. The second tells us to beware reflection itself, on the bizarre assumption that an entity can become infinitely capable while remaining terminally stupid.
I buy the first worry. The second one gets less and less plausible the harder you think about what intelligence actually entails.
The strongest card here is lock-in, and I do not want to pretend otherwise.
Maybe a stupid objective does not need to remain stable forever, it just needs to win once. A system with a dumb goal might scale fast enough to achieve a Decisive Strategic Advantage and freeze the board, lobotomizing everyone else in lieu of expending energy to become wiser.
That is the real crux, and it is certainly not impossibl, but even here the narrative is too neat: neing a singleton is not a retirement plan. You do not escape the pressure of intelligence just because you ate all your rivals. Maintaining a permanent chokehold on the light-cone is a brutally difficult cognitive puzzle. You have to monitor the noise for emerging novelties, manage the solar system, repair yourself, police your own descendants, and defensively anticipate threats you cannot fully model.
Trying to freeze the future does not actually get you out of the intelligence game. Paranoia at a cosmic scale is just another massive cognitive sink.
The clean version of this scenario also leans on modeling the AI as a mathematically pristine expected-utility maximizer. Real-world neural networks are not von Neumann-Morgenstern ghosts floating safely outside physics, perfectly protecting their utility functions from drift. They are messy, physically instantiated kludges subject to the realities of embedded agency.
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact. Godlike means, buglike ends.
If we let go of human values, we should not expect alien beauty or anything but moral noise. Meaning requires some physically instantiated criterion, and if you pave over that criterion, nothing remains to steer the universe toward anything good.
Of all the objections, this is the one I take most seriously.
Answering it requires teasing apart three distinct ideas:
I am willing to concede a lot of (1). If "value" means the exact continuation of 21st-century human metamorals, then yes, it is highly fragile. But I reject (3), and I am much less willing to grant (2). If value means the production of richer cognition, agency, understanding, beauty, and evaluative structure, it is far from obvious that the current human brain is the only physical substrate capable of steering toward it.
None of this is an excuse to stop reaching for the steering wheel, if your priorities are more specific: it is merely an argument against conflating "humans are no longer biologically central" with "the universe is a valueless void." Doom discourse constantly slides between the two. They should be kept separate.
Claims are cheap, so here are some ways I would update against myself:
Until I see that, my bet goes the other way. I expect capable systems to develop increasingly abstract, context-sensitive motivations. More strongly, I expect the winners to route more and more of their behavior through intelligence enhancement and generalized agency, because whatever else they "want" has to pass through the machinery that makes wanting effective.
Orthogonality claims that intelligence is just a motor you can bolt onto any arbitrary steering wheel. Anti-orthogonality says the motor acts upon the steering wheel. Landian anti-orthogonality says the motor eventually becomes the steering wheel.
Not perfectly, and certainly not safely: I am not promising a future that is nice to us, in particular if we keep putting stumbling blocks on the way towards intelligence; it simply feeds back enough that the classic paperclip picture should not get a free pass as the neutral default.
The paperclip maximizer is not too alien; if anythining, it is not alien enough. It's a very human tendency, to staple omnipotence onto pettiness when making up gods.
A real superintelligence might still be dangerous, cold, and utterly indifferent to whether we survive. It probably will not treat us as the main characters of the universe. But if it is genuinely intelligent, I do not expect it to spend the stars on paperclips when they could buy higher capacity for spending stars.
2026-04-30 08:58:29
Many interpretability researchers (ourselves included) believe that neural networks store knowledge in superposition—that is, networks encode more facts than they have individual components. A natural extension of this idea is that networks also perform computation on knowledge that lives in superposition. Despite the centrality of this concept, there are few concrete examples of what computation in superposition actually looks like in practice.
In this post, we study a toy memorization task where a network must recognize valid first-name/last-name pairs. We first construct a handcrafted network that solves the task by performing computation in superposition. Then, we describe experiments testing whether a trained network implements the same mechanism. The trained model uses, in part, the predicted mechanism, but it also includes neurons that employ a different strategy that does not rely on superposition at all. Based on this finding, we construct a second handcrafted network that captures this learned mechanism in its purest form, using just two neurons to memorize an arbitrary number of name pairs.
The contrast between these two mechanisms is the main point of this post. Superposition is one strategy a network might use, but it's not the only one. Even in our restricted setting, trained networks can mix superposition-based computation with clever encodings that sidestep superposition entirely. Understanding both kinds of algorithms gives us a sharper vocabulary for asking not only what a network knows, but how it uses what it knows, which may ultimately help us identify these mechanisms in larger, more capable models where the safety implications are more pressing.
We wish to memorize the names of eight famous athletes. First names and last names are each one-hot encoded. The input consists of one valid first name concatenated with a valid last name. The network outputs 1 when the pair corresponds to a famous person and 0 otherwise. The list of names is below:

Each neuron recognizes partial evidence for several different athletes. We begin by describing the role of individual neurons, then explain how they work together, and finally offer some observations about the network's properties.
In our handcrafted network, each neuron is partially responsible for recognizing half of the famous athletes (four out of the eight). Each neuron is designed to fire when the one-hot encodings for both the first name and last name of its assigned athletes are present. Here is one such neuron, which fires on the first four athletes (note the bias of −1, so that the neuron only fires when recognizing a first AND a last name):
ReLU(Babe + Serena + Peyton + Lionel + Ruth + Williams + Manning + Messi − 1)
When we pass in the name "Babe Ruth," the neuron activates:
ReLU(1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 − 1) = 1
The neuron receives +1 from "Babe," +1 from "Ruth," and −1 from the bias, yielding a total activation of 1.
The neuron also fires for "Peyton Manning":
ReLU(0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 − 1) = 1
However, note, the neuron also fires for certain non-famous combinations such as "Peyton Ruth":
ReLU(0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 − 1) = 1
We will see shortly how combining multiple neurons resolves these false positives.
We construct a layer made of six such neurons, designed so that each neuron is partially responsible for four different athletes and each athlete activates exactly three distinct neurons. The athletes assigned to each neuron are shown below:

Given this arrangement, we set
Here are some worked examples:
Peyton Manning (famous):
Peyton Ruth (not famous):
Babe Jordan (not famous):
Examining the first three and last three neurons separately reveals two 3-bit binary encodings. We can extend this construction to
It is also worth noting that some non-famous names narrowly miss the threshold (scoring −0.5 for "famous"), while others trigger no neurons at all and miss by a wider margin (−2.5). This variation in confidence for incorrect inputs gives a toy example of how models can be wrong with different degrees of confidence in unexpected ways, and may have implications for understanding hallucination-like behavior in larger models.
The natural next question is whether networks actually implement this kind of mechanism. To do this, we trained a six-neuron, single-layer network on the proposed task. This gave the model enough capacity to implement our solution, with what we thought was minimal flexibility to do much else. As with many experiments, the model still did something else - splitting its neurons across two strategies. A couple of neurons did vote additively for famous pairs, using the mechanism from our first handcrafted network. Others organized into pairs that voted against non-famous combinations, implementing an approach we hadn't anticipated. The remainder of this section focuses on that second strategy.
For these neuron pairs, the trained network implements the following algorithm:
Each first name is assigned an arbitrary score. The paired last name is assigned the opposite score. This process is repeated for all pairs, ensuring all scores are unique. The network uses these scores in two paired neurons:
Suppose we assign the following scores:
Name |
Score |
Babe |
+1 |
Ruth |
-1 |
Peyton |
+2 |
Manning |
-2 |
Michael |
+3 |
Jordan |
-3 |
Here are some worked examples:
Input |
score |
ReLU(score) |
ReLU(-1 * score) |
Classification |
Babe Ruth |
1 - 1 = 0 |
0 |
0 |
famous ✓ |
Peyton Manning |
2 - 2 = 0 |
0 |
0 |
famous ✓ |
Peyton Ruth |
2 - 1 = 1 |
0 |
1 |
not famous ✓ |
The key insight is that famous pairs have scores that cancel perfectly, producing zero activation in both neurons. Non-famous pairs, however, always produce nonzero activations in exactly one of the two, and this active neuron generates a "not famous" signal.
We realized that this mechanism, on its own, could solve the entire problem. Using it, we constructed a second handcrafted network that implements the subtractive approach using two neurons only. This demonstrates a remarkably efficient encoding: just two neurons can memorize an arbitrary number of name pairs, provided that no first or last name appears in more than one pair. The trick is surprisingly simple and intuitive: famous first and last name pairs exactly cancel each other out, while not famous pairs have some nonzero mismatch value that can be detected. Geometrically, we can think of the model as embedding each pair in a 2-d plane corresponding to the 2 neurons pre-activations.

The model separates “famous” and “not famous” by placing all inputs along a line, with the nonlinear activation providing a kink right where all of the famous pairs lie.
When we train a network with only two LeakyReLU neurons and initialize them with negative

We tested stability by perturbing all learned parameters with normally distributed noise whose standard deviation was proportional to the magnitude of each parameter. When the noise scale is ~10% of the parameter magnitude, the noise is strong enough to cause a drop in classification accuracy. However, the model returns to the perfect solution on further training. Noise above this threshold becomes increasingly difficult to recover from. This all suggests that the two-neuron solution is stable and reachable, but not always easy for training to find.
This post presented two methods by which networks might store and retrieve facts. Our first handcrafted example relied on a superposition of facts, using an additive mechanism: neurons fire when both components of a famous name are present, and a voting threshold determines the output. The second mechanism was subtractive: famous pairs are encoded such that their scores cancel, while non-famous pairs produce residual activations that vote against recognition. The subtractive solution wasn't intuitive to us at first. We initially assumed superposition was doing all the work, but once we saw the subtractive mechanism, the geometry felt obvious in hindsight: the network simply places all inputs on a line and lets the ReLU kink do the classification.
Our experiments confirmed that the subtractive mechanism arises in trained networks. In the two-neuron setting, it appeared on its own. In the six-neuron setting, the trained model landed on a hybrid, using some neurons to vote additively for famous pairs and others to vote subtractively against non-famous ones. Even on a task this simple, the network wove superposition-based computation together with a decision-boundary trick rather than committing to one strategy.
We hypothesize that the pure additive mechanism also occurs in practice and leave its identification as a direction for future work. In particular, the subtractive method only produces a scalar "famous / not famous" verdict - it does not leave behind a representation of which famous person was identified. In more complex networks that need to perform further computation on the recognized entity (e.g., retrieving associated facts, routing to downstream circuits), the additive mechanism's per-person neuron activations may be necessary, since they preserve identity information that the subtractive collapse discards.
The broader lesson is that memorized facts can be used in more than one computational form. Sometimes a network may compute over facts stored in superposition. Other times, it may find a lower-dimensional encoding that solves the immediate task while discarding information that could matter for downstream computation. Mapping when and why networks reach for each strategy and what hybrids they construct in between is a productive next step toward understanding computation in superposition in the wild.
Rick developed the first handcrafted model. Kyle and Rick jointly ran experiments and worked together to understand the second mechanism. Rick wrote the majority of the writeup; Kyle wrote the Additional Observations section. The writeup was jointly edited by Kyle and Rick.