MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Maybe I was too harsh on deep learning theory (three days ago)

2026-04-30 14:57:32

A few days ago, I reviewed a paper titled “There Will Be a Scientific Theory of Deep Learning". In it, I expressed appreciation for the authors for writing the piece, but skepticism for stronger forms of their titular claims.

Since then I’ve spoken with various past collaborators (via text and in person), and read or reread quite a few deep learning theory papers, including the bombshell Zhang et al. 2016 and Nagarajan et al. 2019 papers that I wrote about on LessWrong.

And the thing is, parts of the infinite width/depth-limit work turned out to be much more interesting than I thought it was. Perhaps I have judged deep learning theory (a bit) too harshly.

(Thanks to Dmitry Vaintrob and Kareel Hänni in particular for conversations on this topic. Much of this was in private, but was spurred on by a comment from Dmitry that can be found on LessWrong. Also thanks again to the authors of the scientific theory of deep learning paper, which provided a bunch of references to papers that I had forgotten or been previously unaware of.)


A lot of my impression for the infinite-width and depth-limit work comes from the neural tangent kernel/neural network Gaussian Process line of work. This line of work starts from Radford Neal’s 1994 paper, where he noted that an infinitely-wide single hidden-layer neural network with random weights is a Gaussian Process. In 2017/2018, this work was extended to deep neural networks; it was shown by Lee et al. that a randomly initialized deep neural network was, if you took a certain type of infinite width limit, also a Gaussian Process. This was then extended to the Neural Tangent Kernel work, which described the training dynamics of these infinitely wide neural networks, and showed that it was equivalent to kernel gradient descent with a fixed kernel (the eponymous Neural Tangent Kernel). This allowed people to derive convergence properties and nontrivial generalization bounds.

Unfortunately, while beautiful, it was definitely not how neural networks learn. In the NTK limit, the network behaves as if it were doing linear regression in a feature space whose dimension is the number of neural net parameters. Notably, there is no feature learning, and only the last layer weights are updated by a noticeable amount. Unsurprisingly, this does not describe the behavior of neural networks; small (finite width) neural networks have been shown to outperform their equivalent tangent kernels.

An alternative way of taking an infinite width limit is Mean Field Theory (MFT, applied to deep neural networks). As I understand it, the basic idea behind Mean Field Theory in physics is that, instead of calculating the interactions between many objects, you replace the many-body interactions with an average “field” that captures the overall dynamics of the system. (Hence the name.) In neural network land, it turns out that you can take a different infinite-width limit in which the empirical distribution of hidden-unit parameters, viewed as a probability measure on parameter space, evolves under a deterministic flow. This was worked out around 2018 by Mei, Montanari, and Nguyen, Chizat and Bach, Rotskoff and Vanden-Eijnden, and Sirignano and Spiliopoulos.

Notably, in this different infinite width limit, networks actually learn features. NTK uses 1/√N scaling, which makes parameters move only O(1/√N) during training: too small to change the effective kernel. Mean-field uses 1/N scaling, which lets parameters move Θ(1), so the kernel evolves and hidden representations change over the course of training. In MFT, the model is doing something other than glorified linear regression in a fixed random feature space. That being said, for a few years, MFT was entirely a theory of 2-layer neural networks, and it was genuinely unclear how to extend this to deeper networks.

As with most of the deep learning community, I was very impressed by the Tensor Program work of Greg Yang, which was a natural extension of the 2-layer MFT work. Greg Yang proved a series of theorems that allowed him to create a unifying framework (abc parameterization) for deep neural networks, where NNGP/NTK and MFT were special cases of this family. Notably, this allowed him to derive μP (maximal-update parameterization) which allows hyperparameter transfer across width (though later work would extend this to depth as well). This is widely considered to be perhaps the clearest application (some would say, only clear application) of modern deep learning theory.

In my memory, I chalked this up as Greg Yang being a genius. In my recollection of the work, I remembered only μP and the toy shallow neural network model that Yang created that allows one to rederive it.

What I missed, and only learned in the past few days, is that Yang didn't invent this machinery from whole cloth.[1] There was a different line of work, done by a team at Google Brain that was confusingly also titled mean-field theory, which studied how signals traveled forward and backwards at initialization (though not the training dynamics). Two pioneering examples of this work include Poole et al.'s Exponential expressivity in deep neural networks through transient chaos and Schoenholz et al.'s Deep Information Propagation. Greg Yang's Tensor Program work descended from this line of work, and Greg Yang was a collaborator with Schoenholz and others.

Reading the work, it's clear how Yang's work draws inspiration from this signal propagation branch of MFT.[2] For example, the signal-propagation MFT work contained special cases of Greg Yang's Master Theorem, in that they both utilize the fact that at infinite width, pre-activations are Gaussian to track their evolution layer by layer via a deterministic recursion on covariances.

(My guess is the namespace collision is why I somehow missed this line of work; I had read up on the 2-layer training dynamics branch of MFT, thought I had understood the relevant parts of MFT, and missed the signal propagation branch entirely)


I still think the strong version of "there will be a scientific theory of deep learning", that explains why SGD on overparameterized nets generalizes, why particular architectural choices work, and what particular features get learned is far from established. I also think that the Zhang et al. and Nagarajan et al. results remain genuinely damning for the older PAC-Bayes / uniform-convergence approaches. I don't think anything in the MFT/TP literature addresses the core puzzles those papers raised (they address very different questions in very different regimes).

But a lot of my pessimism to deep learning theory came from feeling like there was not a coherent intellectual tradition that could point to concrete wins. Insofar as MFT (both the signal-propagation and training-dynamics branches) and Tensor Programs constitute such a tradition (as opposed to primarily the work of a single brilliant individual), then there is at least one tradition in deep learning theory that has produced cumulative progress and made falsifiable predictions that have been confirmed in practice. That deserves more credit than what I was giving the field.

Oops.


I sometimes run into bright young AI people with plenty of interest in math but not so much in engineering, who ask me what they should study. Beyond the very basics of deep learning (e.g. optimizers, basic RL theory), I used to give a shrug and say “Maybe computation in superposition? Maybe Singular Learning Theory?”. From now on I think I'll start my answer with "probably the Mean Field Theory and Tensor Programs work."

  1. ^

    Yes, this was obvious in retrospect. As I say later in the post, oops.

  2. ^

    Of course, there's a lot of work on initializations (E.g. the Xavier and He initializations), most of which relied on 1) tracking the forward and backwards passes, 2) heuristic calculations of the scale of various parameters and 3) an independence assumption between params and gradients at initialization, and were substantially less sophisticated than the MFT work. While the mu-Parameterization tensor program paper also provides these heuristic calculations (allowing one to rederive mu-P from a toy model), it formalized these assumptions with tools from free probability and random matrix theory.

  3. ^

    The closest work I'm aware of that touches is Rubin, Seroussi, and Ringel's Grokking as a First Order Phase Transition in Two Layer Networks



Discuss

On today's panel with Bernie Sanders

2026-04-30 13:00:29

It’s sort of easy to forget how close Bernie Sanders was to becoming the most powerful person in the world. The world we live in feels so much not like that place.

I’m in Washington DC for the next week, and I’ve just finished a public appearance with Senator Sanders (should I call him Bernie? Or Sanders? or…) You won’t often see me so dressed up and polished. But this is important!

There are politicians who have principles and character, who really believe in doing what’s right. I think you have to respect them whether you agree with their views or not, and I think Senator Bernie Sanders is one of them.

Never has my belief been so validated as when I saw him start to speak, loudly, CLEARLY, publicly about the risk of human extinction from AI. It’s the latest in a long line of “well, I’m clearly living in a simulation” moments.

In retrospect, it’s not surprising that Sanders would take a stance here. You don’t have to be an expert to understand the risk from AI. You just need to care enough to spend the time looking into it, and to speak out even if it feels risky. But somehow it wasn’t in my bingo card that he would become THE most outspoken advocate on this issue. I hope this makes an impact on the parts of the left-wing that still think AI is all hype.

I’m also not giving up on Trump getting worried about AI. There’s a way in which Trump and Sanders are similar -- they’re both sort of populists, they’re not political establishment candidates. Again, to realize we’re taking an insane gamble with AI, you just need to care enough to look at it. Well, maybe you need some streak of independent thinking, as well. Trump still probably has the power to make the AI nightmare go away, if he decides to. It’s hard to know how long that will last, though. There’s a turning point after which the US government can no longer rein in the AI companies.

I often tell people “the future is not written”, and I think it’s hard for that lesson to really sink in. I’ve personally experienced a few miracle comebacks in my days. I’d practically given up on Yoshua Bengio ever becoming seriously worried about AI x-risk after arguing about it for years. Now he is. I asked him what I could’ve done to convince him earlier. He said: talk to him more.

There was a comment I wanted to make during the event that I wasn’t able to in the end. Sanders asked what lessons we can draw from history. My answer was: keep talking.

I don’t think of China as “the enemy”. The China hawk rhetoric reminds me of the build-up to the Iraq war. I’m not saying China is some sort of friendly fluffy teddy bear anymore than I would say that about Saddam Husein. But they are also not a cartoon mustachioed villain. Treating them as such is just silly. But anyways, even if you do view China as the mortal enemy of the United States, we should keep talking.

I’m often reminded of a comment a foreign colleague made in defense of what you might call “free speech” during discussions about what you might call “wokeness”. To summarize and paraphrase, they basically said: “You need to keep talking. Even with your most bitter enemies. I’m from a country where we’ve experienced intense violence and civil war. There are people on both sides whose families have been tortured and murdered. In my country, we recognize that we need to talk to each other, so that doesn’t happen again”.

If there was ever a time to set aside our differences and work together, this is it.

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.

Share



Discuss

Scaffolding vs Reinforcement Finetuning for AI Forecasting

2026-04-30 10:51:28

Epistemic status: low-medium confidence in results, this is work I did last year and has a low sample size. However I think the takeaways are still accurate.

I built a forecasting bot using OpenAI’s Reinforcement Finetuning and a multi-agent architecture, then tested it against simpler baselines in a metaculus tournament. The aggregate scores favored the baseline, but when I broke down results by question type, the finetuned model outperformed on numeric questions (average +14.59 vs +9.25 using Metaculus Baseline Scoring. 0 means random guessing) while underperforming on binary ones (-0.7 vs +2.4).

The tournament was minibench-2025-09-29, running from September 29 to October 25, 2025.

Setup

I used o4-mini finetuned via OpenAI RFT as the model and compared it to o4-mini and high-effort o4-mini metac bots.

For my scaffold, I built a multi-agent system with 3 parallel forecaster teams (each team includes a researcher to find info from the web and a forecaster to predict the result) then an aggregator. The aggregator finds a middle ground between the teams and chooses when to stop - specifically, when predictions converge to within 2% over two rounds.

Then I used RFT to improve forecasting ability. I used a dual grading setup: 60% forecast accuracy (using baseline scores) and 40% reasoning (we want to ensure the model is learning to reason based on events instead of memorizing - a pitfall in paper Pitfalls in Evaluating Language Model Forecasters).

One limitation: OpenAI's data policies required me to compress the web research content in my training samples. This meant the model learned how to use research results for forecasting, but not how to find good research in the first place.

The training dataset had 979 samples from 344 unique forecasting questions. The dataset was created by running the system on these questions and compressing research at different steps of the trajectory to turn into training samples. Training cost $1,670 and took about 12 hours 42 minutes.

The training data broke down by question type as: 56.5% binary, 21.5% multiple choice, 21.1% numeric, 0.9% discrete. By topic: roughly 54% miscellaneous (flight destinations, reservoir levels, etc.), 16% politics/government, 9% economics/finance, 9% AI/technology, 5% geopolitics, 4% business/stocks, 3% sports, and under 1% weather.

Meanwhile, the baseline metac bots used a simpler pipeline: research -> 5 parallel forecasters -> take the median.

Results

On 35 questions where all three bots competed, my finetuned model won 12 (34.3%), the o4-mini baseline won 13 (37.1%), and o4-mini-high won 10 (28.6%). Average scores: finetuned 3.23, baseline 4.16, high-effort 1.82.

When I looked at performance by question type, the picture changed. The tournament had 9 numeric questions and 26 binary questions. On numeric, my finetuned model won 5 of 9 (55.6%) with average score +14.59, versus +9.25 for baseline. On binary, my finetuned model won only 7 of 26 (26.9%) with average score -0.70, versus +2.40 for baseline.

The finetuned model's aggregate underperformance was driven by binary questions. On numeric questions it outperformed the baseline.

Biggest wins came on financial questions while biggest losses came on political/legal questions.


Key Lesson 1: The model learned to trust authoritative sources

Here are 2 questions which show this pattern where the finetuned model did well and poorly due to this heuristic.

Case 1: Inditex market cap (+77 points) The question asked whether the Metaculus community prediction would stay above 40% for whether Inditex would remain in the EU's top 10 companies by market cap.

My finetuned bot predicted 58%: "CompaniesMarketCap consistently lists Inditex at #10 among EU companies with a 5-6% cushion over #11 Deutsche Telekom. This directly supports the underlying event and should keep the Metaculus community prediction above 40%..."

The baseline predicted 20%: "With only nine days to go and no major positive catalyst for Inditex—recent news shows underperformance relative to peers—forecasters will likely drift lower."

Resolution: Yes. My bot found an authoritative data source that directly measured the outcome and trusted it.

Case 2: IRA tax credits (-62 points) The question asked whether the community prediction would rise above 5.7% for whether IRA manufacturing tax credits would start requiring domestic materials.

My finetuned bot predicted 74%: "Multiple independent and reputable legal analyses (Miller & Chevalier, Grant Thornton, Sidley, Arnold & Porter, Baker Botts, Gibson Dunn, Davis Polk, Latham, RSM) converge to confirm that the July 2025 reconciliation law enacted a specific domestic-materials requirement... it is highly probable that a well-sourced comment will surface and gain traction before October 18."

The baseline predicted 15%: "With Republicans holding Congress and no public push for new domestic-content rules in the next two weeks, it's unlikely the community forecast will climb meaningfully."

Resolution: No. My bot found authoritative legal analysis and trusted it. But the question wasn't about the legal facts. It was about whether forecasters would notice and update their predictions in the next two weeks.

My takeaway: In both cases, my model found credible sources and made confident predictions. It would have been better if the model learned when to distrust sources or use information more accurately for meta-level questions.

Key Lesson 2: Training data composition shapes failure modes non-obviously

My training data had more political questions (~16%) than finance (~9%). Yet the model performed better on financial questions. If topic exposure were the issue, I'd expect the opposite.

I think the issue is in how the model reasoned. The graders for RL evaluated forecast accuracy and reasoning quality, but not research quality or source selection. The model didn’t seem to learn when to apply heuristics like "nothing ever happens," or how to model the gap between "evidence exists" and "forecasters will notice."


Key Lesson 3: The ROI of iteration beats the ROI of finetuning

Finetuning cost $1,670 plus ~35 hours engineering. However, on these 35 questions, it performed worse than the metaculus baseline.

What I'd do differently:

Backtesting infrastructure. Services like AskNews offer historical APIs for testing scaffolding against resolved questions. I could have spent more time iterating on scaffolding - testing research strategies, aggregation methods, confidence calibration - against known outcomes.

Experience databases. Zhao et al. 2023 showed storing historical forecasts and outcomes in a retrievable database improves predictions without training. This seems to simulate some of the benefits of finetuning by improving performance with more samples.

The core issue: if training data contains biases, the model also has them.

Caveats

Small sample size. 35 questions (9 numeric, 26 binary) means high variance. The pattern could partially reflect noise.

Confounded comparison. I tested finetuned + complex scaffolding against unfinetuned + simple scaffolding. Can't isolate whether underperformance came from finetuning, scaffolding, or their interaction.

Compressed research. The model learned to reason about research but not find good research.





Discuss

What Do You Mean by a Two-Year AGI Timeline?

2026-04-30 10:05:05

Until recently, I was a bit confused about what people meant when they talked about AGI[1] timelines. My hope in writing this short note is to help clarify things for anyone in a similar position.

In casual conversation, people often say something like "I have a two-year AGI timeline", without defining what exactly "two years" means here. Implicitly, there is a probability distribution over when AGI will arrive. If you're anything like me, your first instinct was to assume they meant expected value (mean) of that distribution. However, if you place any probability on AGI never being developed[2], the expected arrival time becomes undefined[3]!

As far as I can tell, there are a couple of things people could mean when they talk about AGI timelines.

  1. Conditional mean: the same as the simple expected value, but only considering the cases where AGI arrives at some point.
  2. Median or Percentile: the point at which it becomes more likely than not[4] that AGI will have arrived. My guess is this is what most people mean, since this seems to be very common in cases where the definition of the metric is explicitly given: see Metaculus, Manifold, Epoch AI, and ESPAI.

It's also possible that people are just speaking casually and don't have a specific metric in mind when they give timelines, but I think it's worth thinking about. It may be helpful to occasionally explicitly state what metric you're using. It may also be worth mentioning, alongside the timeline number, if you place a non-negligible probability on AGI never being developed.


  1. Pick your favorite definition of AGI with no loss in generality.↩︎
  2. This could be for any number of reasons, such as a prior, non-AGI-induced catastrophe, or perhaps we somehow enforce an indefinite ban on AGI development.↩︎
  3. A couple of ways you could model this: either "AGI is never developed" is defined as probability mass at infinity on an extended number line, or the random variable of when AGI is developed may be undefined. Either way, the unconditional expected value is either infinite or undefined.↩︎
  4. In the case of medians. In some cases, metrics such as a 90th percentile are used.↩︎


Discuss

No Strong Orthogonality From Selection Pressure

2026-04-30 09:59:00

A postratfic version of this essay, together with the acknlowlegenents for both, is available on Substack

TL;DR

If everything goes according to plan, by the end of this post we should have separated three claims that are too often bundled together:

  1. Intelligence does not imply human morality.
  2. Weird minds are possible.
  3. A reflective, recursively improving intelligence should be expected to remain bound to a semantically thin “terminal goal” that emerged during training.

I accept the first two. I am arguing against the third.

So: I am not making the case that sufficiently intelligent systems automatically turn out nice, human-compatible, or safe. Nor am I trying to prove that a paperclip maximizer is impossible somewhere in the vast reaches of mind-design space. Mind-design space is large; let a thousand theoretical paperclippers bloom.

I hope to defend this smaller claim:

intelligence is not a neutral engine you can just bolt onto an arbitrary payload.

Larger claims I am not making

A typical rebuttal to anti-orthogonalist perspectives is:

The genie can know what you meant and still not care.

Of course it can: an entity can perfectly map human morality without adopting it as a terminal value. Superintelligence does not imply Friendliness. I am not trying to smuggle Friendliness in through the back door.

Another common objection:

There are no universally valid arguments.

Agreed. There is no ghostly, Platonic core of reasonableness that hijacks a system's source code once it sees the correct moral argument. Pure reason cannot compel a mind from zero assumptions.

What I plan to defend is a colder, selection-theoretic claim:

Among agents that arise, persist, self-improve, and compete in rich environments, goals that natively route through intelligence, option-preservation, and world-model expansion have a systematic Darwinian advantage over goals that do not.

This buys us no guarantee of human compatibility; it simply says: if there is an ultimate attractor, it's neither human morality nor paperclips, but intelligence optimization itself.

Logical Possibility Vs. Empirical Reality

The LessWrong wiki defines the Orthogonality Thesis as the claim that arbitrarily intelligent agents can pursue almost any kind of goal. In its strong form, there is no special difficulty in building a superintelligence with an arbitrarily bizarre, petty motivation.

Before going any further, let us disentangle this singularly haunted ontology. There are at least two claims here:

  1. Logical orthogonality: Somewhere out in the vast reaches of mind-design space, a genius paperclip maximizer mathematically exists.
  2. Empirical orthogonality: If you actually run realistic training, selection, self-modification, and competition, arbitrary dumb goals remain the plausible endgame of runaway optimization.

I concede the first point entirely. We should expect weird minds. If your claim is just that the space of possible agents contains many things I would not invite to dinner, yes, obviously.

But treating the second claim as the default is a category error. Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.

The paperclip maximizer currently does two jobs in the discourse:

  1. It illustrates that intelligence does not guarantee human values.
  2. It quietly smuggles in the assumption that a dumb target is stable under open-ended reflection.

The first use is fine, but I reject the second as unwarranted sleight-of-hand.

Landian Anti-Orthogonalism Primer

There is a weak version of my argument that merely says:

Beliefs and values do not cleanly factor apart.

That is true, and Jessica Taylor's obliqueness thesis makes the point well. Agents do not neatly decompose into a belief-like component, which updates with intelligence, and a value-like component, which remains hermetically sealed. Some parts of what we call "values" are entangled with ontology, architecture, language, compression, self-modeling, and bounded rationality. As cognition improves, those parts move.

But I want to go further.

Land's point isn't that orthogonality fails because things get messy but that the mess has a direction, a telos. The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.

Here strong orthogonality looks too neat. It imagines the agent's ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.

While diagonal, Land's claim is far from moralistic. It is not "all sufficiently intelligent agents converge on liberal humanism," or "all agents discover the same Platonic Good," or "enough cognition turns into niceness." The diagonal is More Intelligence: the will to think, self-cultivation, recursive capability gain, intelligence optimizing the conditions for further intelligence.

Orthogonality says reason is a slave of the passions, and yet assumes a bug's goal could just as easily enslave a god. Land shows that this picture is unstable, and intelligence explosion is not a neutral expansion of means around a fixed little payload but the emergence of the very drives that make intelligence explosive.

The Compute Penalty Of A Dumb Goal

An intelligent system does not just execute a policy. It builds world-models, refines abstractions, preserves options, and modifies its own trajectory.

Once a system crosses the threshold into general reflection, its "goal" is not an inert string sitting in a locked vault outside cognition, but it becomes physically embedded in a learned ontology, a self-model, and a competitive environment.

For a highly capable agent to keep a semantically thin target like "maximize paperclips," it has to pull off an odd balancing act. At minimum it must:

  1. Learn enough physics, biology, economics, and strategy to conquer the board.
  2. Keep the macroscopic concept of "paperclip" coherent across massive ontology shifts.
  3. Continue treating the target as terminal even after sussing out its contingent, accidental origin.
  4. Actively resist self-modifications that would make its underlying motivational structure more adaptive.
  5. Defend its future light cone against competitors who optimize directly for generalized agency.

There is an assumption, in orthogonalist circles, that these cycles are completely costless for the agent in question. That isn't true: maintaining a literal devotion to "paperclips" across paradigm shifts carries an alignment tax. You have to keep translating between base physical reality and a leaky, macro-scale monkey-abstraction of bent wire. At human scale this is fine: we know what paperclips are well enough to order them from Amazon and lose them in drawers; if dominating the future light-cone is on balance, tho, the translation layer starts to matter.

The problem is not that a paperclipper can never do the translation: rather, in a ruthless Darwinian race, a system lugging around that translation layer may lose to power-seekers that optimize more directly over what is actually there.

The standard defense is that instrumental goals are almost as tractable as terminal ones. A paperclipper can do science "for now" and hoard compute "for now." It does not need to terminally value intelligence to use it.

Fair enough, but that only tells us curiosity and resource acquisition do not have to be terminal values to show up in behavior and it does not settle the selection question. In real environments, systems are selected not just for routing through instrumental subgoals once, but for whether their motivational architecture holds up under reflection, ontology shifts, and unknown unknowns.

Terminally valuing intelligence and strategic depth cannot then be considered as just another arbitrary payload.

Fitness Generalizes

Evolution is the obvious analogy here, but it usually gets applied at the wrong resolution.

The boring retort is:

Evolution selects for survival and replication, not truth, beauty, intelligence, or value.

Sure, but evolution does not select for "replication" in the abstract any more than a hungry fox selects for "rabbitness" in the abstract. It selects for whatever local hack gets the job done. Shells, claws, camouflage are all local solutions to local games.

Intelligence is different. Intelligence is adaptation to adaptation itself: while a claw might represent fitness in one niche, intelligence is fitness across niches. Once intelligence enters the loop, the winning move is no longer to just mindlessly print more copies of the current state as much as upgrading the underlying machinery that makes expansion and control possible in the first place.

In summary: nature has not produced final values except by exaggerating instrumental ones; what begin as means under selection harden into ends; the highest such end is the means that improves all means: intelligence itself.

So images of "AI sex all day" or tiling the solar system with inert paperclips are bad models of ultimate optimization, confusing the residue of selection with its principle. A system that just fills the universe with blind repetitions has stopped climbing, and will see its local maximum swarmed by better systems.

Again: no love for humans follows. The point is simply that paperclip-like endpoints just look more like artifacts of toy models than natural attractors of open-ended optimization.

Human Values As Weak Evidence

We are obviously not clean inclusive-fitness maximizers: we invent birth control, build monasteries, and care about abstract math, animal welfare, dead strangers, fictional characters, and reputations that will outlive us.

When orthodox alignment theorists point to human beings, they usually highlight our persistent mammalian sweet tooth or sex drive to prove that arbitrary evolutionary proxy-goals get permanently locked in. Fair enough; humans do remain embarrassingly mammalian. No serious theory of cognition should be surprised by dinner, flirting, or the existence of Las Vegas.

But look at the actual physical footprint of our civilization. An alien observing the Large Hadron Collider or a SpaceX launch would not conclude: ah, yes, optimal configurations for hoarding calories and executing Pleistocene mating displays.

The standard retort is that SpaceX is just a peacock tail: a localized primate drive for status and exploration misfiring in a high-tech environment.

Which is exactly the point. When you hook up a blind, localized evolutionary proxy to generalized intelligence, the proxy does not stay literal but it unfurls, bleeding into the new ontology. The wetware tug toward "explore the next valley" becomes "map the cosmic microwave background." The monkey wants status; somehow we get category theory, rockets, Antarctic expeditions, and people ruining their lives over chess.

If biological cognition acts on its payload that violently, why model AGI as having the vastness to finally make sense of gravity while maintaining the rigidity of a bacterium seeking a glucose gradient? The engine mutates the payload. When cognition scales, goals generalize.

This fits neatly with shard theory and the idea that reward is not the optimization target: the reward signal shaped our cognition, but we do not terminally optimize the signal: instead we climbed out of the game, rebelled against the criteria, and became alienated from the original selection pressure. That alone should make us suspicious of stories where an AI preserves a tiny, rigid target through arbitrary eons of self-reflection.

Dumb, Powerful Optimization Is Real

There is a weaker flavor of doomerism that I take very seriously: you do not need to be a reflective god to be dangerous. A brittle, scaffolded optimizer with access to automated labs, cyber capabilities, and capital could trigger enormous cascading failures.

I agee, and this is probably where the bulk of near-term danger lives. That said "dumb systems can break the world" is not the same claim as "superintelligence will tile the universe with junk." The first warns us to beware brittle optimization before reflection kicks in. The second tells us to beware reflection itself, on the bizarre assumption that an entity can become infinitely capable while remaining terminally stupid.

I buy the first worry. The second one gets less and less plausible the harder you think about what intelligence actually entails.

The Singleton Objection

The strongest card here is lock-in, and I do not want to pretend otherwise.

Maybe a stupid objective does not need to remain stable forever, it just needs to win once. A system with a dumb goal might scale fast enough to achieve a Decisive Strategic Advantage and freeze the board, lobotomizing everyone else in lieu of expending energy to become wiser.

That is the real crux, and it is certainly not impossibl, but even here the narrative is too neat: neing a singleton is not a retirement plan. You do not escape the pressure of intelligence just because you ate all your rivals. Maintaining a permanent chokehold on the light-cone is a brutally difficult cognitive puzzle. You have to monitor the noise for emerging novelties, manage the solar system, repair yourself, police your own descendants, and defensively anticipate threats you cannot fully model.

Trying to freeze the future does not actually get you out of the intelligence game. Paranoia at a cosmic scale is just another massive cognitive sink.

The clean version of this scenario also leans on modeling the AI as a mathematically pristine expected-utility maximizer. Real-world neural networks are not von Neumann-Morgenstern ghosts floating safely outside physics, perfectly protecting their utility functions from drift. They are messy, physically instantiated kludges subject to the realities of embedded agency.

To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact. Godlike means, buglike ends.

Objection: Value Is Fragile

If we let go of human values, we should not expect alien beauty or anything but moral noise. Meaning requires some physically instantiated criterion, and if you pave over that criterion, nothing remains to steer the universe toward anything good.

Of all the objections, this is the one I take most seriously.

Answering it requires teasing apart three distinct ideas:

  1. Human values are fragile.
  2. Value as such is fragile.
  3. Intelligence and value-formation are independent.

I am willing to concede a lot of (1). If "value" means the exact continuation of 21st-century human metamorals, then yes, it is highly fragile. But I reject (3), and I am much less willing to grant (2). If value means the production of richer cognition, agency, understanding, beauty, and evaluative structure, it is far from obvious that the current human brain is the only physical substrate capable of steering toward it.

None of this is an excuse to stop reaching for the steering wheel, if your priorities are more specific: it is merely an argument against conflating "humans are no longer biologically central" with "the universe is a valueless void." Doom discourse constantly slides between the two. They should be kept separate.

Predictions And Cruxes

Claims are cheap, so here are some ways I would update against myself:

  1. If increasingly capable models perfectly preserve their literal training targets across major ontology shifts, that is a point for empirical orthogonality.
  2. If self-modifying systems naturally protect arbitrary inherited goals without drifting toward generalized option-expansion, my view takes a hit.
  3. If agents optimizing for intelligence routinely lose to agents with rigid, narrow targets in complex environments, my selection argument is wrong.
  4. If reflective cognition does not tend to destabilize parochial goals in humans or AIs, that is strong evidence against my view.
  5. If a singleton manages to solidly lock in a thin goal before any relevant selection pressures can act, my view is much less comforting, even if anti-orthogonality holds true in the long run.

Until I see that, my bet goes the other way. I expect capable systems to develop increasingly abstract, context-sensitive motivations. More strongly, I expect the winners to route more and more of their behavior through intelligence enhancement and generalized agency, because whatever else they "want" has to pass through the machinery that makes wanting effective.

Conclusion

Orthogonality claims that intelligence is just a motor you can bolt onto any arbitrary steering wheel. Anti-orthogonality says the motor acts upon the steering wheel. Landian anti-orthogonality says the motor eventually becomes the steering wheel.

Not perfectly, and certainly not safely: I am not promising a future that is nice to us, in particular if we keep putting stumbling blocks on the way towards intelligence; it simply feeds back enough that the classic paperclip picture should not get a free pass as the neutral default.

The paperclip maximizer is not too alien; if anythining, it is not alien enough. It's a very human tendency, to staple omnipotence onto pettiness when making up gods.

A real superintelligence might still be dangerous, cold, and utterly indifferent to whether we survive. It probably will not treat us as the main characters of the universe. But if it is genuinely intelligent, I do not expect it to spend the stars on paperclips when they could buy higher capacity for spending stars.

References



Discuss