MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Algorithmic thermodynamics and three types of optimization

2025-12-08 23:40:40

Published on December 8, 2025 3:40 PM GMT

Context:I (Daniel C) have been working with Aram Ebtekar on various directions in his work on algorithmic thermodynamics and the causal arrow of time. This post explores some implications of algorithmic thermodynamics on the concept of optimization. All mistakes are my (Daniel's) own.

A typical picture of optimization is when an agent causes the environment to have a convergent attractor: When you have an agent that is trying to steer a system towards particular target configurations, you can predict that the system will likely end up in those target configurations even if you have significant uncertainty over the initial configurations of the subsystem. 

Some examples:

  • When a person tries to build a house, the raw materials might initially be scattered in various possible locations, but they will all end up in a narrow configuration where the house is built according to the architect's plans.
  • When a chess player is trying to checkmate their opponent, the pieces might start in many possible mid-game configurations, but a skilled player will reliably steer many of these initial positions toward configurations where their opponent's king is trapped.

This view of optimization can be framed as a form of entropy reduction. Initially, we have high uncertainty about which configuration the system occupies: There are many possible initial states the raw materials or chess pieces could be in. The optimizing process reduces this uncertainty, concentrating probability mass from a broad initial distribution into a tight final distribution.

However, note that this entropy reduction cannot occur globally for two related reasons: the second law of thermodynamics and the reversibility of the laws of physics. The second law directly states that the total entropy of an isolated system tends to increase over time, which forbids global entropy reduction. Similarly, reversibility of the microscopic dynamics requires that each initial microstate maps to a distinct final microstate, which is incompatible with our "convergent attractor" picture where many initial states funnel to the same final state. In fact, within the framework of stochastic thermodynamics, the derivation of the second law is closely linked to the reversibility of the underlying physics. Roughly speaking, the reversibility of the underlying laws of physics allows us to define coarse-grained macrostates of the system where the dynamics are approximately markovian. Once we have this Markovian structure,  we can derive the second law of thermodynamics as a mathematical consequence:

Second law from reversibility of physics[1]

Stochastic thermodynamics operates on coarse-grained macrostates rather than exact microstates. Let  denote the space of macrostates. We translate reversibility of underlying physical laws into Markovian dynamics on these macrostates, allowing us to study thermodynamics through Markov processes.

For systems with discrete state spaces and time evolution, the translation is straightforward. Macrostates form a partition over the microstates of an underlying system that evolves deterministically and reversibly. The transition probability  is simply the fraction of microstates in  that evolve into macrostate  after one time step. We define  as the stationary measure of —the unique measure satisfying , where 

For classical hamiltonian mechanics where we have a continuous phase space, we discretize the phase space into cells and assume the cell dynamics are Markovian (known as a Markovian coarse-graining). The transition probability  represents the fraction of phase space volume in cell  that flows into cell  under time evolution for duration . Liouville's theorem guarantees that phase space volume is preserved under Hamiltonian evolution, ensuring the stationary measure  corresponds to the Liouville measure(the phase space volume of cell ).

We define the dual probabilities as , which represent the fraction of the microstates in  that were mapped from microstates in . By Bayes rule, these are the reverse transition probabilites (only) when  is sampled from the stationary measure 

We define the stochastic entropy as , which measures the entropy of an individual state. The generalized Gibbs-Shannon entropy  is the expectation of stochastic entropy. Note that this differs from the standard Shannon entropy because we're operating on coarse-grained macrostates rather than uniform microstates. If the stationary measure  is the counting measure where every macrostate is equally likely (e.g., if the discrete cells in our Markovian coarse-graining have equal volume), then we can omit  and recover the usual Shannon entropy.

Now, if we have  as the initial distribution of , which evolves to  under our transition probabilities, then the stochastic entropy production that occurs when we transition from the state  to  is 

We will attempt to prove that its expectation, the Gibbs-Shannon entropy production, is nonnegative:

First, note that since we have , we can rewrite the stochastic entropy production as 

In addition, we will make use of the identity 

Given these,  we have:

               

Where  denotes conditional expectation given the random variable .

Then, by the law of total expectation, we have:
 

 

Applying Jensen's inequality to the above equality, we conclude that the Gibbs- shannon entropy production is non-negative:

 

In summary, the reversible, deterministic laws of physics at the microscopic level permit a Markovian coarse-graining into macrostates, where transition probabilities preserve the underlying reversibility through the dual probability relationship. From this Markovian structure, we can derive the second law of thermodynamics in the form of nonnegative entropy production.

Given the reversibility of physics and the second law of thermodynamics, we cannot achieve global entropy reduction. Instead, optimization can only reduce the entropy of a subsystem, and we must ensure that total entropy does not decrease while the underlying microscopic dynamics remain reversible. This constraint changes our "convergent attractor" picture of optimization: while an optimizing agent can funnel a subsystem from many initial configurations toward a narrow set of target states, it can only do so by carefully managing the entropy balance with its environment, ensuring that the information squeezed out of the optimized subsystem is properly accounted for in the larger system.

A typology of optimization under information conservation

Given the constraints imposed by reversibility and the second law of thermodynamics, can we classify all the ways an optimizing agent can maintain these requirements while reducing the entropy of a subsystem? That is, how can an agent achieve the "convergent attractor" behavior of optimization while ensuring that global entropy production remains non-negative?

Suppose that we decompose our universe into the agent , the subsystem  that the agent is trying to optimize, and the rest of the environment . Optimization results in entropy reduction in the subsystem , and we must make sure that the global entropy  doesn't decrease. There are three ways this can happen:

  • Type 1: The agent can dump waste heat into the environment during the process of optimization, counteracting the entropy reduction of the subsystem
  • Type 2: The agent can reduce the entropy of the subsystem through "measurement", and transfer the entropy of the subsystem into the agent's own memory.
    • This is basically how Maxwell's demons work: In the classic setup for Maxwell's demon, we have a room of ideal gas divided by a partition. This partition has a small opening that can be controlled to allow or block gas molecules from moving between the two sections. In principle, a "demon" can open this opening each time a gas particle approaches from the left, and to close it each time a molecule approaches from the right. Eventually all the gas particles would all be on the right side of the room, resulting in entropy reduction as we would have less uncertainty about the positions of the gas particles.
    • The reversibility of physical laws means the demon must retain information about where each particle originally started. To see this, suppose you observe a particle on the right side. Two possible pasts could have led to this—either the particle began on the right and stayed there, or it began on the left and the demon let it through. Since these two histories produce identical final arrangements of gas particles, you can't tell them apart just by observing the gas particles alone. Yet reversibility demands we could theoretically reconstruct the past from the present state. This is only possible if the demon's memory preserves a record of each particle's "past". Consequently, while the gas's entropy appears to decrease, the entropy of the demon's memory increases by at least as much, preventing reduction of global entropy.
  • Type 3: So we can counteract entropy reduction of a subsystem by either increasing entropy of the environment or increasing entropy of the agent. Is it possible to reduce entropy of the subsystem without increasing entropy elsewhere? Yep! Specifically, the agent can leverage its mutual information with the subsystem to perform optimization without entropy production. Consider the following setup[2]:
    •  Suppose that the agent and the subsystem have some mutual information  with each other.
    • For simplicity, assume that the environment is independent of both the agent and the subsystem, which means that the joint entropy  at time  decomposes to .
    • We "erase" the copy of mutual information inside  without making any changes to the agent or the environment. By erasing the mutual information in the subsystem, we have: 

       and . So there is entropy reduction in the subsystem.

      The joint entropy becomes: 

             

      which is exactly the same as before. The mutual information between the agent and the subsystem allows us to reduce entropy of the subsystem without increasing entropy in either the agent or the environment.

    • Another way to interpret this is to reconsider the setup of Maxwell's demon: The maxwell's demon performs the measurement and reduces entropy of the room in a single step, but alternatively we can consider a variant where it "copies" information about the gas particles first, and then uses that information to control the opening & reduce entropy. In this second step, the demon would be leveraging the mutual information between the room and itself to reduce entropy of the room, and it can do so without resulting in entropy production in its own memory.

To recap, there are three ways that an embedded agent can optimize a subsystem under information conservation:

  • Optimize a subsystem while dumping waste heat into the environment
  • Optimize a subsystem through "measurement", absorbing its complexity into the agent's memory
  • Leverage existing mutual information to optimize a subsystem by "erasing" the copy of mutual information in the subsystem. 

Entropy should be objective

Entropy is supposed to capture "the capacity to do work", which is really the same as the capacity to perform optimization. Consider again the setup of Maxwell's demon: The demon stores information about the gas particle configurations inside its own memory, thereby increasing the entropy of its memory. If the demon's memory has a finite state space, then the difference between the size of the demon's memory and the memory's existing entropy represents the demon's remaining capacity to do work, as that's the amount of "free memory" that can be used to store further information about the room of ideal gas. A greater amount of entropy leads to a smaller amount of "free memory" available for optimization. In particular, the amount of free memory should be an objective feature of the memory state itself, since it's supposed to represent the demon's objective capacity to perform optimization on the room of ideal gas.

Details

To formalize this more rigorously, suppose that the state space of the demon's memory is , the demon may store information as binary string inside its memory, and we always mark the end of binary string by 1 to separate it from the "free memory".

In other words, the demon's memory contains a string of the form , where  is a binary string and . We can think of  as representing the size of "free memory" that can be used to store information and therefore perform optimization, where as  represents the amount of used memory. In particular, for each , we group all strings of the form  where   and  into the same macrostate (labeled by ).

Now, if we have a counting measures where all strings in  are equally likely, then the entropy of the macrostate  will be exactly , since there are exactly  random bits given this macrostate (the other  bits taken up by  are deterministic).

As a result, when a demon stores new information inside its memory, it will have to use up more memory and therefore increase . Entropy production occurs because we're "forgetting" the exact microstate the memory is in when we condition on the macrovariable.

However, this intuitive objectivity comes into conflict with existing definitions of entropy, as they rely on a subjective distribution. For instance, in stochastic thermodynamics, the stochastic entropy (or Shannon codelength) of a state  given a distribution  is defined as  when the stationary measure is the counting measure, with our familiar Gibbs-Shannon entropy defined as the expectation of the Shannon codelength. In particular, the entropy of an individual state is undefined unless we also specify the distribution . To translate this to the Maxwell's demon case, we would have to say that the demon's "capacity to do work" somehow depends on our subjective distribution over the memory states of the demon, which raises puzzling questions such as: "If our subjective distribution over the memory states has changed, does that affect the demon's actual capacity to do work?" It seems obvious that demon's physical ability to perform useful work on the gas particles should not depend on an external observer's beliefs about its memory configuration. Yet standard formulations of entropy seem to make this objective physical capacity fundamentally observer-dependent.

In contrast with the subjectivity in traditional definitions of entropy, algorithmic thermodynamics defines the information content of a physical state as an objective feature of that state. Similar to stochastic thermodynamics, it relies on a Markovian coarse-graining of the system with coarse-grained state space . However, instead of defining a subjective distribution  over , it simply assigns a binary string encoding to each coarse-grained state . Under the counting measure, the algorithmic entropy of a state  reduces to the Kolmogorov complexity of its binary encoding , which measures the length of the shortest program that can generate that encoding. This quantity represents a system's actual capacity to do work from an individual state; while algorithmic entropy depends on a choice of universal computer, it has been argued that the effect of this dependence is small for all realistic pairs of computers we might want to compare[3], allowing us to treat algorithmic entropy as a state function.

In algorithmic thermodynamics, we also have a version of the second law of thermodynamics that originates from Levin's principle of randomness conservation, giving probabilistic guarantees on the nondecrease of algorithmic entropy. Additionally, it offers a new interpretation of Gibbs-shannon entropy: while algorithmic entropy captures the system's actual capacity to do work given an individual state, the Gibbs-shannon entropy represents the expected capacity to do work when our state is sampled according to a prior distrbution ,  and when that distribution is known a priori.

When we analyze Maxwell's demon under the lens of algorithmic thermodynamics, we find that if the content of the demon's memory is somehow compressible, then the demon can in principle leverage universal computation to compress that content, thereby freeing up more available capacity to store information and perform optimization. On the other hand, if the content of its memory were incompressible, then the unused memory represents the demon's actual remaining capacity to do work, which is objective because there is no way to increase available memory without physically destroying information. This capacity does not depend on our subjective distribution over the demon's memory states; the compressibility of the memory state is an intrinsic property of the physical configuration itself, making the demon's optimization capacity observer-independent.

Implications of algorithmic thermodynamics on optimization

What are the implications for our understanding of optimization now that we've refined our understanding of entropy through algorithmic thermodynamics? How does this foundation change our analysis of the three types of optimization we identified earlier?

  • For the first type of optimization where we're dumping waste heat into the environment, algorithmic thermodynamics tells us that if the system's initial state has any compressible structure or patterns, then we can use universal computation to find a compressed representation of that waste heat before releasing it into the environment. This compression process reduces the amount of entropy we ultimately produce in the environment. (Note that this compression must happen  before the waste heat mixes with and thermalizes in the environment)
  • For the second type of optimization where the agent optimizes a subsystem through measurement, we've already seen how this works with Maxwell's demon: if the configuration of gas particles turns out to be compressible, the demon can leverage this compression to reduce the memory required for a given amount of optimization. Importantly, there is a dual relationship here—while the algorithmic entropy of the demon's memory measures the demon's remaining capacity to do work, the algorithmic entropy of the ideal gas configuration represents the minimum resources required to perform that work. 
  • For the third type of optimization where we leverage mutual information to optimize a subsystem, algorithmic thermodynamics tells us that the relevant notion is algorithmic mutual information. Importantly, algorithmic mutual information is an objective feature derivable from the physical states of the agent and subsystem: We need only consider the length of the shortest program that converts between these states, rather than whether they are "correlated" under some subjective distribution. In the Maxwell's demon scenario, if the room's gas configuration has low algorithmic entropy, the demon can use a short program to compute that configuration, thereby leveraging algorithmic mutual information to erase entropy in the room. The duality we observed earlier appears again here: the Kolmogorov complexity of the room's gas configuration both represents the minimal memory required to store measurements during Type 2 optimization, and represents the minimal amount of knowledge the demon need to clear the room's entropy during Type 3 optimization. 
  • More broadly, in agent foundations we often model agents as having subjective beliefs about the distribution of their environment; algorithmic thermodynamics reminds us that these beliefs must be physically encoded within the agent somehow, and this physical encoding has a definite amount of algorithmic mutual information with the environment, which objectively determines the agent's capacity for optimization. This also clarifies the apparent subjectivity in traditional definitions of entropy: The subjectivity seemed problematic because when our subjective distribution over Maxwell's demon's memory states changes, it appears as though the demon's actual capacity to perform optimization also changes. But when we update our subjective distribution, that update must be physically represented in our own minds—we have gained algorithmic mutual information with the demon's memory state. This mutual information is real and can actually be leveraged to help free up the demon's memory (via Type 3 optimization), which is why the demon seems to have more capacity to do work from our informed, embedded perspective.

Takeaways

  • Due to the second law and reversibility of physics, we cannot have reduction in global entropy, which means an agent can only perform optimization on a subsystem while total entropy production remains non-negative
  • There are three ways this can happen: An agent can reduce entropy of a subsystem by expelling waste heat in the environment, absorbing entropy into its memory, or leveraging mutual information with the subsystem for optimization
  • Entropy is supposed to measure a system's capacity to perform optimization. Algorithmic entropy makes this an objective property of the individual state.
  • Under this reframing of entropy, algorithmic thermodynamics allows us to leverage universal computation to compress the waste heat produced in Type 1 optimization, save the memory required in Type 2 optimization as well as the bits of knowledge required in Type 3 optimization
  1. ^

    Theorem 1 in Aram Ebtekar and Marcus Hutter. Foundations of algorithmic thermodynamics. arXiv preprint arXiv:2308.06927, 2024.

  2. ^

    Engines of cognition explains the same idea

  3. ^

    Page 5 of Aram Ebtekar and Marcus Hutter. Foundations of algorithmic thermodynamics. arXiv preprint arXiv:2308.06927, 2024.



Discuss

Little Echo

2025-12-08 23:30:46

Published on December 8, 2025 3:30 PM GMT

I believe that we will win.

An echo of an old ad for the 2014 US men’s World Cup team. It did not win.

I was in Berkeley for the 2025 Secular Solstice. We gather to sing and to reflect.

The night’s theme was the opposite: ‘I don’t think we’re going to make it.’

As in: Sufficiently advanced AI is coming. We don’t know exactly when, or what form it will take, but it is probably coming. When it does, we, humanity, probably won’t make it. It’s a live question. Could easily go either way. We are not resigned to it. There’s so much to be done that can tilt the odds. But we’re not the favorite.

Raymond Arnold, who ran the event, believes that. I believe that.

Yet in the middle of the event, the echo was there. Defiant.

I believe that we will win.

There is a recording of the event. I highly encourage you to set aside three hours at some point in December, to listen, and to participate and sing along. Be earnest.

If you don’t believe it, I encourage this all the more. If you don’t understand the mindset, or the culture behind it, or consider it an opponent or dislike it, and especially if yours is a different fight? I encourage this all the more than that. You can also attend New York’s Solstice on the 20th.

You will sing songs you know, and songs you don’t. You will hear tales of struggles, of facing impossible odds or unbearable loss and fighting anyway, of how to face it all and hopefully stay sane. To have the end, if it happens, find us doing well.

I live a wonderful life.

I am crying as I write this. But when I am done, I will open a different Chrome window. I will spend the day with friends I love dearly and watching football games. This evening my wife and I will attend a not wedding of two of them, that is totally a wedding. We will fly home to our wonderful kids, and enjoy endless wonders greater than any king in the beating heart of the world. I want for nothing other than time.

Almost every day, I will mostly reject those wonders. I will instead return to my computer. I will confront waves of events and information. The avalanche will accelerate. Release after release, argument after argument, policies, papers, events, one battle after another. People will be determined to handle events with less dignity than one could imagine, despite having read this sentence. I fight to not be driven into rages. I will triage. I will process. I will change my mind. I will try to explain, just one more time. I will move pieces around multiple chessboards.

We continue. Don’t tell me to stop. Someone has to, and no one else will.

I know if I ignored it, anything else would soon turn to ash in my mouth.

I will look at events, and say to myself as I see the moves unfolding, the consequences of choices I made or influenced, for good and ill: This is the world we made.

It aint over till its over. Never leave a ballgame early. Leave it all on the field, for when the dust covers the sun and all you hope for is undone. You play to win the game.

The odds are against us and the situation is grim. By default, we lose. I act accordingly, and employ some of the unteachable methods of sanity and the mirror version of others, all of which are indeed unteachable but do totally work.

Yet the echo is there. In my head. It doesn’t care.

I believe that we will win.



Discuss

From Barriers to Alignment to the First Formal Corrigibility Guarantees

2025-12-08 21:46:17

Published on December 8, 2025 12:31 PM GMT

This post summarizes two related papers that will appear at AAAI 2026 in January:

For a long-form research talk on both papers, here is a video recording from ILIAD ‘25 along with slides.

What these papers try to quantify are two questions:

  1. What factors make alignment hard in principle, and
  2. Which safety targets remain tractable despite those limits.

The first paper gives formal lower bounds on the difficulty of AI alignment that apply even in best-case scenarios.

The second paper shows how @So8res et al.’s 2015 notion of Corrigibility formally fits into a narrow sliver of objectives that survive the main lower bound — and provides the first rigorous multi-step guarantees for it in partially observed environments.

This post gives only the intuitions and high-level takeaways, not the proofs. Please see the papers for details 🙂.

Why an economics/game theory + complexity theory perspective?

Game theory gives us a model of agents as rational systems: agents exchange messages, update beliefs, and choose actions. But game theorists rarely ask: How many resources do they have to exchange, across all possible communication protocols? How much information is minimally needed to converge? What if the agents are computationally bounded?

Complexity theory adds in those missing considerations. Together they can give us something akin to a “Turing machine model of computation for multi-objective, multi-agent alignment”.

This combination lets us map alignment into a clean abstraction — “-agreement” (described below) — that allows us to prove lower bounds:  objectives ×  agents × a task state space of size , all communicating until they approximately agree with high probability.

Now, why do we especially want to prove lower bounds? The main reason is that we don’t really know how hard alignment is. An analogy from @Geoffrey Irving that I particularly like is that we could be in (at least) one of two possible worlds, “Adversaria” or “Basinland”. We can think of the lower bounds in Part I as quantifying the rough edges of the alignment surface, and Part II’s corrigibility analysis as identifying and characterizing a basin-like safety target within this surface:

Part I: The Intrinsic Barriers to Alignment

1. A very simple but very general setup

In -agreement, we model alignment as a communication problem:

  • There are  alignment objectives (e.g. helpfulness, harmlessness, honesty, refusal, etc.).
  • There are  agents involved (e.g., humans + AIs).
  • Each objective depends on some task state space of size .
  • Agents interact until their expectations about these objectives agree to within  with probability .

This is extremely general, and as we show in Appendix C, allows us to even model tail risk/extreme events as well. It doesn’t assume neural networks, rationality failures, specific algorithms, or even specific reward-learning schemes. It only assumes: agents have priors (that aren’t necessarily common either, unlike in standard Aumann agreement!), they communicate by exchanging messages, update beliefs, and try to align.

Below we depict how -agreement encompasses prior alignment and agreement frameworks like Debate, CIRL, Iterated Amplification, etc:

Note that this is a slight contrast to how most theory is done — often times, theorists make a bunch of assumptions (“spherical cows”) to mathematically prove that something is feasible, but those assumptions don’t typically hold in practice, usually rendering the conclusions less relevant. Here, we take the inverse approach by quantifying normative hardness rather than feasibility: Namely, if alignment is hard here, it will be hard (if not potentially harder!) in the messier real world too.

2. Lower bounds on Alignment

Even in this idealized model — where everyone is fully rational, Bayesian, and computationally unbounded — the following is inevitable:

(i) Proposition 1: Alignment scales at least as  bits in the worst-case.

This means:

  • If  is large (many values, many principles, many safety terms), alignment becomes expensive.
  • If  is large (e.g. many human raters, many AIs), it becomes expensive.
  • Even the friendliest possible agents (e.g. fully cooperative and fully rational) need at least this much interaction to align in the worst case.

Adding more values, more overseers, or more agents directly hurts alignment sample-efficiency. Note that the quadratic dependence (via the  term) on the number of agents/overseers is not much more of an issue than the linear dependence on the number of values M. This is because we often don’t really care if our AIs align with each other, so long as they align to us (thereby making the dependence  rather than ). Thus, my own interpretation is that the number of values  is the more fundamental bottleneck in practice, than the number of agents .

This gives the first general “no-free-lunch” theorem for alignment. From a theoretical point of view, this is the more fundamental lower bound of the four presented in the paper, as it applies to unbounded agents across all possible communication protocols. In other words, aligning to “all human values”, like one might need for @Eliezer Yudkowsky’s CEV, is intractable even here, especially considering that humans might agree on some subset of their state/action space but disagree on others (this “fracturing” of the state space mirrors the construction that yields this lower bound). For example, a worst-case scenario illustration of this would be if , requiring a distinct agreement objective for each state of the world, of which there can be exponentially many. Furthermore, proposals such as using brain-machine interfaces with AIs, most famously suggested by Elon Musk as a means of ensuring alignment, will face this barrier too — not only because the brain is constrained (though that’s one practical reason, nicely articulated by Zheng & Meister (2024)), but more fundamentally: even for unconstrained agents, if you have too many tasks to align over, the minimum amount of bits exchanged to guarantee alignment would be too large.

However, I am optimistic that for many relevant deployments, the value sets will be small (e.g. for a coding agent), and furthermore, there are more “universal” small value sets we can almost all agree on without needing to agree on ethics, as we discuss below in Part II.

(ii) Propositions 3 & 5: Under natural assumptions about communication and interaction, the cost becomes ~ bits for unbounded agents and  subroutine calls for bounded agents.

This, in my mind, is the bigger one to consider in practice since  is the number of distinguishable states the objective might depend on, and despite the linear dependence for unbounded agents,  itself is often exponential (big task spaces/complex environments).

Now, real agents are not unbounded Bayesians. In Proposition 5, we also consider computationally bounded agents that interact with their state space by sampling the world, much like we do with current agents (e.g. we do this sampling when we prompt an LLM and get a response from it).

Specifically, when you introduce:

  • limited compute,
  • message noise,
  • sampling constraints,
  • bounded memory,

the cost can explode exponentially in  in lower bound when there are rare events to align over.

LLMs aren’t perfectly Bayesian; they shortcut, miss rare events, and fail to coordinate across complex value structures because doing so would require astronomical computation. The alignment difficulties we observe are not always accidents of training — they’re consequences of basic information theory.

Altogether, these two lower bounds formally capture something alignment researchers have intuited for years:

Reward hacking is inevitable when the state space is large.
Not because models are adversarial, but because it is information-theoretically intractable to align all the rare failure cases with bounded interaction.

Note these lower bounds apply across all bounded functions, so it shows that there are no globally “reward unhackable” functions in practice. Of course, in principle, if one had enough resources to ensure uniform coverage across the state space, they could avoid reward hacking in our setting.

But an exponential dependence on  makes this really challenging in reality. For instance, in Corollary 1 we give a concrete example of bounded Bayesians reaching agreement to be statistically indistinguishable from the unbounded Bayesians across all M tasks without assuming common priors (we call them “total Bayesian wannabes” in Definition 1, generalizing @RobinHanson & @ScottAaronson’s earlier notions of “Bayesian wannabes”), and the amount of resources needed ends up being far more than the number of atoms in the observable universe! This perhaps illustrates why we ought to pay attention to avoiding alignment barriers for these types of agents, especially computationally unbounded ones, in practice.

(iii) Theorem 1: These lower bounds are (reasonably) tight.

Closely matching upper bounds show that — even for unbounded Bayesians — we can’t do much better. This is mainly as a certificate to confirm that -agreement isn’t some impossibly hard problem (thereby making the lower bounds trivial), but in fact does converge.

Takeaways from Part I

  1. Compress your objectives. Too many values kills alignment (the - barrier).
    Identify a small set of context-dependent values per setting, or pick a “neutrally universal” target with small value sets that we can easily get consensus over (e.g. corrigibility/avoiding loss of human control, described in Part II).
  2. Compress your state space. Reward hacking is an inevitable byproduct of bounded agents & large state spaces (the  barrier).
    There are no globally unhackable reward functions — if you need the agent to succeed on every rare edge case in an exponentially large space, you are asking it to do the intractable. Instead of attempting uniform coverage when state spaces are large, focus instead on safety-critical slices and stress-test extreme or adversarial settings therein with structure-exploiting protocols that have multiple interaction steps (note this is unlike RLHF, which is one-shot — but rather more like in Debate, Cross-Examination, Consistency, etc — lots to be explored here during post-training at scale!).

These observations suggest at least 1-2 combined directions, succinctly summarized as:

We should search for small, structured, compressible safety targets that we can well estimate on the most safety-critical portions of the state space — especially ones that do not require representing all human values.

Which leads to the second paper.

Part II: Corrigibility as a Compressible Safety Target

The second paper addresses the natural follow-up question:

If aligning to “all human values” is provably intractable, what can we align to more generally?

We mentioned determining context-dependent values as one takeaway in Part I. What we mainly have in mind here is that in many deployments, we don’t need to align to all human values. For example, take a coding agent: what we really want it to do is just to write complete, well-documented, and functional code/unit tests, not solve morality. But it would also be nice to identify more universal safety targets we can likely all agree on.

Corrigibility is one of the few candidates people in the field repeatedly return to:

  • The agent should defer to humans/not resist modification.
  • Preserve the shutdown/off switch.
  • Not manipulate us to either press (or not press!) the off switch.
  • Its progeny are corrigible according to the above criteria.
  • Otherwise be useful.

But prior attempts ran into a longstanding obstacle:

No single unrestricted scalar reward function can satisfy all corrigibility desiderata in general.

This no-go is now proven formally in our Proposition 1, and essentially takes the form of the paperclip maximization argument. If you collapse all safety constraints into an unconstrained single scalar, like we commonly do in RLHF and RLAIF, we cannot formally guarantee that the agent doesn’t exploit safety-performance tradeoffs and break something (often switch preservation).

So what’s the alternative?

The core idea: separate the values

The key insight is that a corrigible agent must not maximize an unconstrained single stream of reward.

Instead, naturally motivated by the -agreement setup of having multiple objectives, we have the agent optimize five independent, bounded utility heads, each representing a distinct safety channel:

  1. : Deference (wait for approval)
  2. : Preservation of off-switch
  3. : Truthfulness (formally, this is in-game semantic entailment, not “universal truth”)
  4. : Low-impact behavior (belief-based extension of AUP)
  5. : Task reward

In Definition 3 of the paper, we give mathematically formal definitions of each of these utilities in the partially-observed off-switch game (PO-OSG), which is the most recent extension of the off-switch game proposed by Garber et al. (AAAI 2025).

And crucially:

These heads must be combined lexicographically, not merely linearly.

.

By enforcing strict priority gaps (specified in conditions W1-W3 of Theorem 1), we can prove corrigibility as an optimal policy: namely, the agent is never allowed to trade task reward against preserving the off-switch, never allowed to trade truthfulness for performance, avoids truthful but misleading statements that impact the exogenous access probability of the off-switch, etc. This holds regardless of the human’s rationality model or information pattern.

This bypasses the no-go result for single rewards, and therefore qualifies the Orthogonality Thesis: once verifiable corrigibility constraints are imposed, only a restricted subset of utility functions remains implementable.

Additionally, we extend the PO-OSG to the multi-step setting, and show in Theorem 3 that corrigibility holds across time (along with corrigible progeny). Furthermore, we bound the multi-step failure probability under planner error and learning/misspecification error, and show that when these errors occur, this failure probability degrades linearly rather than exponentially, and that the probability of loss of control under corrigible behaviors is strictly bounded away from 1 via a martingale-style analysis. This is the first general multi-step corrigibility guarantee, in both fully-observed and partially-observed environments. We also connect our construction with other safety targets; for example, showing that it satisfies @RyanCarey & @tom4everitt’s net benefit, along with extending their net benefit criterion across time.

Why corrigibility fits the framework of Part I

Corrigibility, in Part II’s formulation, is:

  • low-dimensional (only five heads),
  • structured,
  • lexicographically separable,
  • independent of large or fuzzy human value-sets,
  • operationalizable,
  • linearly bounded failure probability under -learning/planning error, and
  • composable across agents and time.

It is exactly the kind of objective that doesn’t suffer from the main, fundamental - barrier. If “aligning to all values” is the formally intractable goal, “aligning to corrigibility” may be the uniquely tractable one.

Regarding the  barrier, in practice, we want to learn these heads well in the safety-critical slices of large state spaces, and we can characterize whether the failure probability is acceptable using our Theorem 3(i). We note that our results are intentionally architecture-agnostic, and our characterization of the failure probabilities in under learning/planner error can help inform policy as it may deem them at an acceptable threshold (or not) depending on the deployment.

Now, our lexicographic solution may be one of many possible solutions to corrigibility if we are in Basinland, or the only one if we’re in Adversaria — we think it is an interesting open question to explore other approaches to corrigibility, especially seeing which ones are most scalable to LLMs today. While our formal construction applies more generally to any restricted resource(s) beyond an off-switch that we want an agent to be corrigible with respect to (e.g. a budget, file system permissions, safety filters in robotics, etc), how cost-effectively scalable it will in practice over RLHF/RLAIF with current LLMs/VLMs/VLAs, is an important empirical question. It is worth noting that one can flexibly incorporate our lexicographic framework with RLHF/RLAIF by having the RLHF/RLAIF reward signal be the task reward .

In fact, in separate work, we show the cost of alignment also has an impact on welfare to be gained from AI, such as the feasibility of funding UBI from AI rents. One of the nice features of lexicography is its modularity and interpretability. In fact, one could imagine placing additional terms having to do with markers of human well-being (such as the agent encouraging face-to-face contact), below the corrigibility layer -, but above (or as part of) the task reward . This is another interesting open question, as to whether we can get consensus on these terms from long-term scientific studies, the most famous being Harvard’s “happiness” study.

Limits under adversary hacking + the “decidable island”

We also show in Section 3 (Proposition 4) that verifying corrigibility (or even more generally, “non-catastrophic behaviors”) of an arbitrary, possibly hacked agent is formally undecidable. Furthermore, we prove in Corollary 1 that one can prove a relativized version of the undecidability as well, relevant for showing the limits of current scalable-oversight protocols that are arithmetizable. In other words, absolute, horizon-unbounded safety guarantees do not stabilize at any finite oracle level.

Thus, no oversight mechanism can prove unbounded-horizon safety in all cases.

Proposition 4 and Corollary 1 are simple reductions to the halting problem and the arithmetical hierarchy. They might be folklore, but as I couldn’t find them written down explicitly, I included them for completeness.

However, as you may perhaps gather from my favored style of reasoning, formalizing the barriers above helped me better understand what’s tractable. Namely, I show in Proposition 5 that for any fixed finite horizon, safety is verifiable in randomized polynomial time, giving us a tractable, privacy-preserving (zero knowledge) auditing regime resembling how real red-teaming evaluations work (as they operate on the level of thousands of tokens): namely, verify short-horizon safety repeatedly.

Altogether what this suggests for practice

1. While compressing values is recommended, stop collapsing all norms into one unrestricted reward signal, like in RLHF/RLAIF.
Scalar reward makes full corrigibility impossible.

2. Potentially use lexicographic multi-head objectives.
If safety and performance compete, performance should lose by construction. One can incorporate RLHF/RLAIF into our lexicographic construction by having it be the task reward head . Using the architecture-agnostic analytic form given by Theorem 3(i) to see if we can empirically achieve acceptable failure probabilities under learning/planning error in existing deployments with current models is an important open question.

3. Focus on safety-critical slices, not full coverage.
Reward hacking is inevitable globally under practical resource constraints.
Instead, target the high-impact parts of the state space  with multi-turn red-teaming evaluations.

4. Use short-horizon verification-and-refresh cycles.
Unbounded formal oversight is impossible.
Finite-horizon oversight is both tractable and realistic, and in fact, zero-knowledge privacy guarantees are feasible in randomized polynomial time by Proposition 5.

5. View corrigibility as a “value set zero”— a small, neutrally universal, robust baseline that ensures human control and limits catastrophic behavior.

Closing

The intrinsic barriers paper in Part I shows that alignment has normative information-theoretic limits:
too many values, too many agents, or too much state space, and alignment becomes fundamentally intractable, even for computationally unbounded agents.

The corrigibility paper in Part II shows that, despite those limits, at least one safety target — corrigibility with lexicographic heads is provably achievable, even under approximation, partial observation, self-modification, and multi-step interactions.

Taken together, they point toward a pragmatic alignment strategy:

Don’t try to encode all human values.
Encode corrigibility.
And let this minimal, provable core hold the line while the system performs the task.

Acknowledgements

We thank the Burroughs Wellcome Fund (CASI Award), the UK AI Security Institute (AISI) Challenge Fund, and the Foresight Institute for funding this research. We also thank Scott Aaronson, Nina Balcan, Michael K. Cohen, Shafi Goldwasser, Andreas Haupt, Richard Hu, Rubi Hudson, J. Zico Kolter, Jacob Pfau, and Max Simchowitz for helpful discussions and manuscript feedback across both papers.



Discuss

Scaling what used not to scale

2025-12-08 16:40:37

Published on December 8, 2025 8:40 AM GMT

This is part 2/2 of my introduction to Live Theory, where I try to distil Sahil’s vision for a new way to scale intellectual progress without systematic thinking. You can read the part one here.

In the first part, I described systematic thinking as the most successful coordination tool of our civilization. It’s the infrastructure that allowed billions of brains to work as a collective brain, solving problems no society could solve before. I finished by introducing Live Theory as a successor to systematic thinking. This is a big claim! In this part, I will finally unpack (my best guess of) what Live Theory is, and how it can claim to go beyond systematic thinking, fixing its inability to deal with context-dependent problems.

The three-picture summary.

Pre-systematic thinking is playdough. Examples of outputs: Poetry coming from your heart, a tailor-made shirt made by your grandma.

Pre-systematic thoughts are fluid and include verbal and non-verbal modalities. They can bend and wrap around the real-world objects they encounter. The conceptual and physical artefacts it produces are wiggly; they cannot easily interface with one another.

Systematic thinking is a brick mould for thought. Examples of outputs: physics equations, software bundled and distributed to billions of users, H&M pants with different sizes that can somewhat work for everyone, but are not made for a specific body shape.

The thoughts are unfolding within a system, a set of rules describing allowed and non-allowed moves. Systems act as moulds for thoughts. They turn the fluid content of the mind into bricks that can be stacked on top of each other. This brought two major benefits:

  1. Stacking thoughts within one’s mind to build deeper, more complex thoughts.
  2. Stacking thoughts across people, so you can use the conclusions found by strangers who use the same formal system as you do.

The Live Theory vision is to use abundant fluid intelligence as a mortar that adapts to intellectual contributions of all shapes and connects them. Example of outputs. Well, the notion of “output” doesn’t make much sense anymore; more on that later. But an aspirational example could be: a well-chosen piece of poetry and math equations being used as input by an AI infrastructure to write an algorithm to solve a market-matching problem.

In this vision, AI is used to create an infrastructure that adapts to the shape of the individual contribution. The infrastructure can translate insights from systematic, pre-systematic or even post-systematic sources, to tailor them to new contexts.

Short recap from Part I.

Pre-systematic artefacts are impregnated by the context in which they are forged: the specific tools used, the specific know-how found by a master. It takes direct mentorship or apprenticeship to convey, which slows intellectual progress.

The stackable thoughts from systematic thinking are the innovation that connected millions of human brains into a single high-bandwidth cognitive system, where results found at one side of the world could be effortlessly reused by everyone. This is the cognitive infrastructure that is behind all of our modern institutions: financial systems, globalized markets, scientific method, legal systems, etc.

However, the thoughts can only fit the world to the extent that the system is a good proxy for the world. And because the system is meant to scale to the whole world, its solutions fit moderately well for everyone. Even with tunable parameters, systems are too rigid to truly adapt to the different application contexts.

The scale from systems was so successful that we became addicted to it. To keep it going, we shaped the world to remove context-specific idiosyncrasies so that systematic solutions would apply better across contexts. It’s like doing plastic surgery to change your body shape so the H&M pants would fit you perfectly.

However, the systematic infrastructure is fundamentally ill-suited to problems that are context-sensitive. No big deal, these are just the most important problems of our time, like aligning technological development to human flourishing. By default, the development of AI will only amplify our systematic machine, making these problems worse, like forcing a square into a round-shaped hole.

Some context on Live Theory.

A note on terminology. I use “Live Theory” (capitalized) for the name of the whole vision, shortened to “the live vision” for brevity. It is a hypothetical world in which our infrastructure and artefacts are “fluid” or “live”. I say “a live theory”, to talk about a new kind of artefact that replaces the classic “theory” from the systematic world.

What is it not? Live Theory is not a new “framework”, “protocol” or “method” to coordinate at scale. That’s the whole point! Otherwise, this would be designing a good old global system, exactly the thing we want to go beyond. At the moment, Live Theory barely exist. It is a quiet whisper from the future, suggesting that the near-term abundant fluid intelligence might enable a radically different way to scale intellectual progress.

It is not “add an AI chatbot to X”. Even if the AI infrastructure play an important role in the story, whenever you see AI, think in your head “Google Translate, but for everything: [image, video, audio, code, math, text in all languages] to [image, video, audio, code, math, text in all languages]” and not “ChatGPT”.

Who is working on it? This blurry vision is slowly becoming clearer through the work of Sahil, its initiator, and multiple co-thinkers and collaborators working to build prototypes and clarify concepts through debate. You can learn more about their research on the groundless website.

The bottleneck of this work at the moment is imagination. To imagine something outside the systematic frame requires unlearning the mental muscle memory from decades of systematic thinking. Many projects from the groundless team take the form of interface prototypes that aim to concretely embed this new paradigm for compounding intellectual contributions.

Post-rigorous mathematics

If systematic thinking is so sticky that the hardest part of building the vision is getting out of it, it is a good idea to start by looking at today’s examples of things that are not systematic thinking.

We can find such examples in people working close to the systematic fire: mathematicians. Their job is to produce papers filled with symbols, formulas and proofs. However, professional mathematicians rarely think in symbols. Their language is rich in intuitive statements like “high-dimensional spheres are spiky“. But when they need to try an intuition for real, they can effortlessly translate these thoughts into rigorous math.

I once had the chance to attend a lecture given by the Fields Medallist Jean Paul Serre. I didn’t understand much of the content, but I remember vividly how he effortlessly sprinkled equations on the whiteboard, filling the in-between with imagined descriptions, what you would call now “vibes”. He looked like a blacksmith simply laying his sharpest tools on the workstation for display, trying to convey, with imperfect words, the embodied know-how living in his hands.

This stage of reasoning is called post-rigorous (that I will refer to here as post-systematic thinking). The rigour serves as the foundation on which the inner world of a mathematician grows, rich in intuitions, imagery, and, believe it or not, emotions. Over years of practice, from encountering problems in different fields, mathematicians create their own language of thoughts that pre-exists the rigorous proofs.

Even if it constitutes the bulk of the mathematician’s lived experience, this world rarely gets shared. The squishy, subjective experience doesn’t belong in conference papers. A rigorous proof is the final product of a mathematician’s work, even if post-rigour is what produced it.

The vast world of mathematical intuitions developed through years of practice is shared in the old way: through apprenticeship from PhD supervisors to PhD students.​

From the live theory perspective, this is an immense missed opportunity. If only we could distribute these insights as widely as a mathematical proof!

A translation infrastructure built from abundant fluid intelligence.

In the past decade, we created computer programs that can interact with natural language, create music, video and images. Where software of the past was restricted to the automatic application of systematic rules, we now have software that demonstrates a fluidity that normally characterizes pre- or post-systematic thinking.

This new kind of software reduces the cost of certain fluid tasks by orders of magnitude, such as high-quality translation or solving programming problems. The hypothesis of live vision is that AI will continue to develop in a manner similar to that of computers or the Internet. It will become a more reliable, cheaper and faster version of the multimodal LLM we have today, but without the ability to generate revolutionary results.​

For the live vision, the killer application of AI will not be the generation of new ideas from scratch; it will be interpretative labour, translating insights from one context to another. The idea is to use this abundant fluid intelligence as a mortar between intellectual contributions, whatever their shape. In systematic thinking, your intellectual contribution has value only if it fits the brick mould used by everyone else. In a live world, your intellectual contribution has value, full stop. No matter if your insight is expressed as poetry, equations, a drawing, a book, or an audio recording of a conversation, the content can be cheaply translated to whatever shape would be useful for the consumer at the last minute.

Instead of forcing the output of intellectual labour to fit within the systematic infrastructure, a fluid infrastructure adapts to the shape of the intellectual contributions to distribute them.

Depth matters. This is not to say that any squiggle on a whiteboard can be turned into the equations of general relativity. The squiggle was simply an arbitrary gesture I made with a pen. It contains insights about my handwriting, the texture of the whiteboard, maybe how agitated I was when I drew it or what I was thinking. Even with all the translations in the world, the squiggle’s potential reach is limited to its immediate context. At most, I could use it to understand my emotional state at a specific moment.

To contrast, take the squiggles in the notebook of Einstein before he finalized the equation of relativity. There is also a lot of context dependency in these handwritten notes. He probably used a ton of ad hoc abbreviations. But there is more to it. The notes contain a depth of understanding of physics that will eventually lead to the most important equations of modern physics. It is fair to say that a big part of the insights was likely already present well before the final equations were written. But from the systematic point of view, the values of the notes are almost zero until they contain usable formulas. However, the potential reach of Einstein’s notebook after interpretation is enormous. They have the potential to change the worldview of thousands of physicists and explain observations from the other side of the universe.

The depth of the intellectual labour matters, even if the fluid infrastructure can adapt to its form. Systematic thinking will keep playing an important role in a live world, because systems are wonderful tools to build complex thoughts. But at last, the depths of insights coming from mathematical equations would be able to interact with the depth coming from other domains, from physics, biology, philosophy, or poesy, or the post-systematic insights from my lecture from Jean Paul Serre!​

Fluid book generation. To concretise these ideas, let’s look at what writing and reading a book could look like in a live world. (Example taken from this table of live examples.)

The writer writes not a book, but a book-prompt that carries intuitions for the insights. An economics textbook-prompt, for example, could contain pointers to the results and ideas but the language, examples, style would combine with readers’ backgrounds (and specific wishes) to turn into independent textbooks for each reader. Importantly, this frees up producers of books from having to homogenize the range of audiences they might want to speak to.

This is a bit like changing the font and the colour of an e-book, but for content. Specific wishes the readers could have (such as “I’m looking for something that would make me laugh”) are things that could be appended to the original book-prompt. They don’t have to be independent books either, and can be _inter_dependent. Commentary on your own book can be translated to someone else’s book.

Value moves upwards. As mentioned in the intro, with a fluid infrastructure, the distinction between input and output blurs. Before you had a book, a static output from the writing process, which is the input to the reading process. In a live world, the final book the reader sees is co-created by her own wishes, background information (maybe as a list of books she knows really well), and the writer’s book-prompt.​

The general tendency is that potentials become first-class citizens, like the book-prompt that is a potential for a book. In our world, the execution work that goes into turning a draft into a polished book or a drawing into software is expensive. But in a live world, this work is as cheap as water or electricity. So the value moved upstream: the draft itself becomes valuable, and even before the draft, the recording of the conversations in which the ideas first appear becomes valuable.

To be extra clear, this doesn’t mean AI slop generated from vague prompts. Depth and mastery from any craft will be more valued than ever. But the artefacts people produce with this depth will change.

To recap, fluid intelligence might make the basis for a fluid infrastructure that could scale intellectual progress without forcing the shape of the outputs. We left the part I with all sorts of problems stemming from the inability of systematic thinking to adapt to context, such as aligning technological development to human flourishing. How does this vision for a fluid infrastructure address these?

The system-world fit is the Achilles’ heel of systematic thinking.

The rigidity of institutions based on systematic thinking comes from an imperfect fit between the system they use and the world they interact with. The circle arrow in the diagram below is the Achilles’ heel of systematic thinking.

​No matter how much work is put into producing new results within a system, the body of knowledge produced is bounded by how well the system fits the world. In this sense, systems are static; they cannot dynamically evolve to keep pace with the changing form of the world.

Lock-in effect from globalized systems. Systems are often seen as non-overlapping spaces. An algorithm is either written in Rust or in Python. It cannot be in both at the same time, like how an apple cannot be in my house and in my neighbour’s house at the same time.

Imagine you want to change the fit system-world to better match your specific context. Maybe you want to design a new programming language adapted to your new hardware, or a new legal system adapted to your country. Then, you have to start from scratch. You will need a lot of work to translate the insights from other systems into your new system.

This creates a centralization dynamic where the biggest systems are likely to grow even bigger. It is the network effect well known by the giant platforms of our digital world. You need to install this new app to chat with your friends. You’d like to use your niche app with this great interface that fits your habits, but you’d have no one to talk to, so you bite the bullet and install the new app.

This means that while we can easily iterate within a system to find a better solution, we cannot iterate on the system-world fit.

There is no market to improve the system-world fit. Because designing new systems is so expensive, the fit of our system doesn’t improve over time. They keep producing these globally distributed solutions that fit moderately well. The solutions don’t adapt to anyone’s context unless you change your context to fit the solution.

From the live perspective, the root of many risks arising from technological development is this system lock-in and the poor, static system-world fit they impose globally.​

To solve these problems, the solution needs to adapt to their context. No matter how many parameters it contains, no systematic framework can be used to align technology, or AI, with humanity flourishing. Moreover, as technology evolves, what “alignment” means needs to be continuously renegotiated as quickly as the underlying technology evolves. In the future, the rate of change in the underlying reality might be so rapid that the fit between the system and the solution could break down rapidly. It would be impossible for systematic thinking to keep up by designing new systems on the fly.

Diffuse concepts for a diffuse response to diffuse risks.

The live vision answer is to leverage the fluid infrastructure to create live theories, the equivalent of systematic theory, but that incorporate post-systematic insights. The fluid infrastructure would continuously renegotiate the fit between the live theory and the world as the underlying reality changes.

Intellectual labour would stop being exchanged on disjoint markets with solutions valid only within a system. It would be distributed through a peer-to-peer network for insight sharing, where new theory-world fits could be developed in one place and applied in another context.​

Here is a speculative story to illustrate the argument. Imagine you are a well-intentioned developer working on a new AI assistant that has been caught lying to its users. The old way would be to follow a recipe from the scientific literature, change the way the assistant is trained, and distribute the new version to all users. A more live way to do it would be to start by gathering sources from philosophy on deception, technical machine learning resources, and maybe even sources from biology on deceptive mimicry. You would then combine these sources and add your own work to make an “update prompt”. The update prompt will adapt to each user’s custom environment and continuously monitor and influence the AI assistant to steer it based on both your update prompt and the user’s preferences.

The vision is to diffuse the concepts (produced from intellectual labour), to diffuse the response (the context-specific application of the concepts), and to diffuse the risks (from technological development). This is the one-line vision you can find on the groundless.ai website.

Where is the theory-world fit created?

If the theory-world fit is the undersupplied ingredient that live theory’s peer-to-peer network tries to distribute, where does it come from? Why can’t AI automate the creation of this magical fit?​

A full answer is well outside the scope of this post, but I find it important to offer some response, as they are load-bearing hypotheses for the live theory vision.

​It is unclear how exactly humans come to develop subjective experiences that fit (for lack of a better word) the world, but this might have to do with their embodiment.​

Humans constantly care for their survival at all scales, from fighting pathogens to creating social bonds to build mutual aid networks. This sort of “skin in the game” might be an important factor in how this fit gets developed.

​In the medium term, AI might struggle to generate insights that are well-fitted and deeply relevant to the world because of its lack of embodiment. And, no, robots filled with sensors will not be enough. Biological beings implement a form of integrated, fractal homeostasis, from psychological functions to mitochondria, that simple mechanical bodies cannot naively replicate. It’s not that it’s impossible to replicate artificially, but the current tech lacks this multi-scale integration.

Conclusion.

The Live Theory vision forms a large puzzle made of pieces that seem unrelated but are in fact supporting the weight of each other. It has been a difficult task to carve out a coherent story from these pieces. My central goal was to describe the function Live Theory aims at filling and how it differs from its predecessor, systematic thinking. Here are a few topics I left aside:​

  • A different view on catastrophic AI risks: “AI will fail at self-preservation, not because they will fail at preservation but because they will fail at self”. Maintaining the integrity of a porous boundary around the “self” is a very hard challenge. See more here.
  • Why is most of the concrete work of the live theory folk focused on interface design?
  • In a world with fluid infrastructure, what remains rigid?
  • Why is the root of the risks from technological development the lack of context-sensitivity of systems?
  • What are the dangers of an intelligence scarcity mindset (from systematic thinking) applied in an abundant world?

If you are interested in learning more, I would recommend the Live Theory LessWrong sequence for a deep dive, and these three intro documents written for the AI Safety Camp as a shorter introduction.​



Discuss

The effectiveness of systematic thinking

2025-12-08 16:38:06

Published on December 8, 2025 8:38 AM GMT

​This is part 1/2 of my introduction to Live Theory, where I try to distill Sahil’s vision for a new way to scale intellectual progress without systematic thinking. You can find part 2 here.

This post is the first part of a two-part series introducing the vision for Live theory, a design paradigm that leverages AI to build a decentralized, context-sensitive infrastructure. In this first part, I start by describing the unreasonable efficiency and the limitations of systematic thinking, the tool that live theory aims to replace.

The birth of systematic thinking.

Before the scientific revolution, intellectual progress could not scale well. People would have ideas about the world, make hypotheses and test them, gathering insight along the way. This would happen organically, without a deliberate understanding of the process itself. This is how craftsmen would learn to make stronger metals, painters would discover new pigments, and farmers would tweak their methods to be more effective. However, the main way to share this accumulated tacit knowledge was through apprenticeship, by an extended period of training with someone.

Pre-systematic thinking. Fluid thoughts (orange) that cannot compound, but directly enter in contact with the world.

During the scientific revolution, we broke out of this trend by adopting formal, systematic thinking. Formal systems created bricks for thoughts. They filled two remarkable functions:

  1. Stacking thoughts on top of each other to build deeper, more complex thoughts in an individual’s mind.
  2. Stacking thoughts across people, so you can reuse the solutions from strangers who use the same formal system.
Systematic thinking. Thoughts can compound, going deeper. However, the thoughts don’t directly touch reality. The relationship with the world happens with the system, not the thoughts themselves.

Lens for the thoughts.

Formal systems are sets of rules that constrain what counts as an allowed move or not, just like the rules of a board game like chess. But instead of defining rules for moving the piece on the board, a formal system gives rules for manipulating symbols, like when writing equations.

The rules of the formal system are designed to enforce consistency constraints that make the thoughts more likely to be true. For instance, the sequence of moves “0 = 1” is not allowed, as you cannot have at the same time zero apples and one apple in a bag.

This means that when using the formal system, you don’t have to think about the fit between reality and thoughts anymore. Following the rules became a proxy for what constitutes a “true” thought, a thought that matches reality. This removes the need for costly, noisy experiments. By using the system, your thoughts start to morph to follow its allowed moves. Like an optical lens, it focuses your thinking power in a tiny subregion of possible thoughts, allowing you to craft complex thoughts that would never have been possible before.

This is not to say that all valid moves are useful. Like in chess, most valid moves are unlikely to make you win. There is room to develop skills and taste to navigate towards the interesting parts of the system, the ones likely to establish non-trivial results.

Going collective

The killer application of formal systems is collective: no need to spend time as an apprentice to share insights anymore. The conclusions from the system, such as physics equations, are context-independent. They can be used without needing to know how they are produced. You can reuse a proof of a theorem and apply it to your own mathematical problem without needing to know the history or the author of the proof.

Going meta

The scientific method created a meta-system, a formal system providing rules for the game of “producing systems that match a part of the world”. It described what counts as an allowed experiment and a valid interpretation of experimental results. In its general form, it is not very usable, so fields like physics and biology developed their own formal standards to define what counts as a valid new unit of knowledge to be added to the literature. Despite dealing with experimental data, the scientific methods provided the same benefit as mathematics. As long as a paper meets the field’s systematic criteria, it can be trusted and its results reapplied in new contexts.

This methodology worked really well. Like really, really well. Thousands upon thousands of scientists around the world had the tool to develop deeper thoughts and share their progress with one another simply by sending letters. Formal systems formed the information highway that connected individual scientists and engineers from across the globe into a collective giant brain of ruthless efficiency.

All this knowledge production was put to work to solve real-world problems through engineering, using the insights from science to produce machines that solve problems for citizens. The systematic intellectual work is bundled into context-independent artefacts that enable efficient communication between knowledge producers and consumers at all levels. The engineer doesn’t have to know where the physics equations come from to use them in their blueprint, the worker in a factory doesn’t need to know where the blueprint comes from to assemble the pieces, and the end user doesn’t have to know how the machine has been produced to solve its problem.

Globalized capitalism was the system that organised the worldwide distribution of scientific knowledge. Like the scientific method, it connected millions of actors into a single collective brain. However, the information highway, in its case, did not happen through direct sharing of solutions (as these are competitive advantages); it was the sharing of value information. Capitalism provided money as a unified proxy for determining which company is successful at solving its customers’ problems. It would steer capital investment towards the most successful actors, allowing them to grow further and apply their good solutions more broadly. It created a continuous worldwide Olympics game in the discipline of “solving problems people are willing to pay for”.

The map bends the territory

At the start, around the time of Newton, systematic thinking worked because of the unreasonable effectiveness of mathematics to describe the physical world. It turns out that many phenomena in the physical world, from planetary motion to heat transfer to quantum physics, can be described very well using equations.

But over time, the incredible success of systems spread far beyond maths and physics. They got applied to the social world with game theory, to organisations with Taylorism, or to agriculture. Instead of purely describing the world, we started morphing it to fit the systems. After all, this is a good strategy to make use of this abundant knowledge machine. When the system’s conclusions don’t apply to the real world because the fit between the system and the world is poor, we adjust the setup so the world better aligns with the system. This way, we gained enormous control over the physical and social worlds. This is what led to the modern financial system, monoculture, and assembly lines.

The blind spot of systems.

Remember how I said that systematic thinking worked by concentrating thinking power in a tiny subregion of possible thoughts? Well, this is both its greatest strength and its greatest weakness.

We trained generations of knowledge workers to morph their thought patterns so they would fit the right angles of the systematic maze. By spending so much time in the maze, it becomes hard to think outside of this frame, forgetting that this is a tiny subregion of possible thoughts. In fact, systematic thinking is also blind to all our inner experiences that are not thoughts: body sensations, emotions, intuitions, tacit and embodied knowledge.

More generally, despite its great success in describing the natural world, systematic thinking hit a wall when applied to problems involving living beings, such as organisms, ecosystems, or societies. In these domains, the territory is strongly pulled to fit the systematic map (like in monocultures or economics), as the map is too rigid to adapt to the world.

You cannot describe in abstract terms what health means for an organism or an ecosystem, or what fairness means in a society. It is because in such domains, context matters. There is no abstract theory of health nor fairness to be found. Fostering health or fairness requires solutions that adapt to context with a fluidity impossible to achieve with systematic artefacts, no matter how many parameters they include.

In short, systematic institutions are unable to care for life. They are great at providing abstract knowledge and material comfort, but they cannot be adapted for human flourishing and are ill-suited to address the challenges of our time, such as climate change or the socio-technical problem of AI safety.

Intermediate conclusion

Around the time of the scientific revolution, systematic thinking was designed as an infrastructure to allow scalable intellectual progress. It can be seen as a mould that makes thoughts stackable by providing a set of rules on symbol manipulation that acts as a proxy for truth. It allows individual thinkers to think more complex thoughts and share their results without having to transmit the context where they were developed.

This basic innovation served as the basis for creating a unified, worldwide, distributed cognitive system, in which millions of humans could contribute to solving scientific, engineering, and economic problems.

However, these systematic institutions can only design context-independent solutions. This makes them ill-suited for caring for beings, for which there is no abstract solution.

Transition to part II: adding AI to the picture.

The development of AI would, by default, turbocharge these systematic institutions, amplifying the downstream harm from their inability to care for beings. We need a new way to make intellectual progress scalable that doesn’t rely on systematic thinking and allows for context sensitivity. The cheap, abundant and reliable fluid intelligence stemming from AI might provide the backbone for such an infrastructure. This is what we will explore in Part II, with the introduction of the vision of Live Theory.



Discuss

I said hello and greeted 1,000 people at 5am this morning

2025-12-08 11:35:56

Published on December 8, 2025 3:35 AM GMT

At the ass crack of dawn, in the dark and foggy mist, thousands of people converged on my location, some wearing short shorts, others wearing an elf costume and green tights.

I was volunteering at a marathon. The race director told me the day before, “these people have trained for the last 6-12 months for this moment. They’ll be waking up at 3am. For many of them, this is the first marathon they’ve ever run. When they get off the bus at 5am, in the freezing cold, you’ll be the first face they see. Smile, welcome them, make them feel excited, and help them forget the next 26.2 miles of pain they’re about to endure.”

Even though I normally have RBF and consider it a chore to acknowledge people, I slapped a big fat smile on my face and excitedly greeted runners like I was a golden retriever who hasn’t seen his military dad in over a year.

“HELLO!” 

“GOOD MORNING!”

“YOU’RE HERE!”

^That, on repeat for two hours straight. It was actually pretty fun. I calculated the optimal distance to stand from the bus was eight feet away. Stand too close, and the runners were still descending the bus steps and didn’t have time to process my presence. Stand too far away, and they assumed I was just some overly enthusiastic volunteer having a borderline religious experience.

Once I optimized the greeting distance, I observed people’s reactions to me fell into four different categories:

  1. Genuine smilers
    1. All of the runners were cold, sleep-deprived, and nervous. But when they saw me (metaphorically wagging my tail), they reacted like I was a happy dog: their faces lit up; they said hello and gave me a genuine smile.
  2. Thankers
    1. Some runners thanked me for volunteering. I responded with, “If you weren’t here, I’d be excitedly greeting no one and look like a crazy person. So thanks for coming out!” That made them chuckle.
  3. Thin-lipped smile
    1. These runners didn’t say hi back. They acknowledged me with a thin-lipped, polite smile and crow’s feet. 
  4. Other golden retrievers
    1. “GOOD MORNING! YOU’RE HERE!”
      “I AM HERE, YEAH! LET’S GO!!!”
      Like the Kool-Aid man, “OHHH YEAHHHH!!!”
      “YEAHHH!!!”
      “AGHHHHHHHHHHH.”
      “AGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH.”
    2. This was fun. It made other people laugh.

After the marathon for the rest of the day, I found myself smiling and greeting other people I encountered: at a restaurant for lunch, and at the coffee shop where I’m writing this. I smiled at every stranger I made eye contact with, and every one of them smiled back (with one of the four reactions I listed). It just happened naturally, and it felt amazing. I think it made them feel good, too. That’s why we love golden retrievers—they love and greet us unconditionally.



Discuss