2026-01-15 08:05:21
Published on January 15, 2026 12:05 AM GMT
Epistemic status: speculation with a mix of medium confidence and low confidence conclusions.
I argue that corrigibility is all we need in order to make an AI permanently aligned to a principal.
This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.
I am specifically using the word corrigible to mean Max Harms' concept (CAST).
Max Harms writes that corrigibility won't scale to superintelligence:
A big part of the story for CAST is that safety is provided by wise oversight. If the agent has a dangerous misconception, the principal should be able to notice this and offer correction. While this might work in a setting where the principal is at least as fast, informed, and clear-minded as the agent, might it break down when the agent scales up to be a superintelligence? A preschooler can't really vet my plans, even if I genuinely want to let the preschooler be able to fix my mistakes.
I don't see it breaking down. Instead, I have a story for how, to the contrary, a corrigible AI scales fairly naturally to a superintelligence that is approximately value aligned.
A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions. Probably that eventually becomes indistinguishable from a value-aligned AI.
It might still differ from a value-aligned AI if a principal's instructions differ from what the principal values. If the principal can't learn to minimize that with the assistance of an advanced AI, then I don't know what to recommend. A sufficiently capable AI ought to be able to enlighten the principal as to what their values are. Any alignment approach requires some translation of values to behavior. Under corrigibility, the need to solve this seems to happen later than under other approaches. It's probably safer to handle it later, when better AI assistance is available.
Corrigibility doesn't guarantee a good outcome. My main point is that I don't see any step in this process where existential risks are reduced by switching from corrigibility to something else.
Why can't a preschooler vet the plans of a corrigible human? When I try to analyze my intuitions about this scenario, I find that I'm tempted to subconsciously substitute an actual human for the corrigible human. An actual human would have goals that are likely to interfere with corrigibility.
More importantly, the preschooler's alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler's values? Would the preschooler write good enough rules for a Constitutional AI? A provably safe AI would be a good alternative, but the feasibility of that looks like a long-shot.
Now I'm starting to wonder who the preschooler is an analogy for. I'm fairly sure that Max wants the AI to be corrigible to the most responsible humans until some period of acute risk is replaced by a safer era.
We should assume that the benefits of corrigibility depend on how reasonable the principal is.
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
That means a sensible principal would give high priority to achieving full corrigibility, while being careful to avoid exposing the AI to unusual situations. The importance of avoiding unusual situations is an argument for slowing or halting capabilities advances, but I don't see how it's an argument for replacing corrigibility with another goal.
When in this path to full corrigibility and full ASI does the principal's ability to correct the AI decrease? The principal is becoming more capable, due to having an increasingly smart AI helping them, and due to learning more about the AI. The principal's trust in the AI's advice ought to be increasing as the evidence accumulates that the AI is helpful, so I see decreasing risk that the principal will do something foolish and against the AI's advice.
Maybe there's some risk of the principal getting in the habit of always endorsing the AI's proposals? I'm unsure how to analyze that. It seems easier to avoid than the biggest risks. It still presents enough risk that I encourage you to analyze more deeply than I have so far.
Jeremy Gillen says in The corrigibility basin of attraction is a misleading gloss that we won't reach full corrigibility because
The engineering feedback loop will use up all its fuel
This seems like a general claim that alignment is hard, not a claim that corrigibility causes risks compared to other strategies.
Suppose it's really hard to achieve full corrigibility. We should be able to see this risk better when we're assisted by a slightly better than human AI than we can now. With a better than human AI, we should have increased ability to arrange an international agreement to slow or halt AI advances.
Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.
Max writes:
Might we need a corrigible AGI that is operating at speeds and complexities beyond what a team of wise operators can verify? I'd give it a minority---but significant---chance (maybe 25%?), with the chance increasing the more evenly/widely distributed the AGI technology is.
I don't agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI's actions.
A corrigible AI or human would be careful to give the preschooler the best practical advice about the consequences of decisions. It would work much like consulting a human expert. E.g. if I trust an auto mechanic to be honest, I don't need to understand his reasoning if he says better wheel alignment will produce better fuel efficiency.
Complexity does limit one important method of checking the AI's honesty. But there are multiple ways to evaluate honesty. Probably more ways with AIs than with humans, due to our ability to get some sort of evidence from the AI's internals. Evaluating honesty isn't obviously harder than what we'd need to evaluate for a value-aligned AI.
And again, why would we expect an alternative to corrigibility to do better?
A need for hasty decisions is a harder issue.
My main hope is that if there's a danger of a race between multiple AGIs that would pressure AGIs to act faster than they can be supervised, then the corrigible AGIs would prioritize a worldwide agreement to slow down whatever is creating such a race.
But suppose there's some need for decisions that are too urgent to consult the principal? That's a real problem, but it's unclear why that would be an argument for switching away from corrigibility.
How plausible is it that an AI will be persuasive enough at the critical time to coordinate a global agreement? My intuition that the answer is pretty similar regardless of whether we stick with corrigibility or switch to an alternative.
The only scenario that I see where switching would make sense is if we know how to make a safe incorrigible AI, but not a corrigible AI that's fully aligned. That seems to require that corrigibility be a feature that makes it harder to incorporate other safety features. I'm interested in reading more arguments on this topic, but my intuition is saying that keeping corrigibility is at least as safe as any alternative.
Corrigibility appears to be a path that leads in the direction of full value alignment. No alternative begins to look better as the system scales to superintelligence.
Most of the doubts about corrigibility seem to amount to worrying that a human will be in charge, and humans aren't up to the job.
Corrigible AI won't be as safe as I'd like during the critical path to superintelligence. I see little hope of getting something safer.
To quote Seth Herd:
I'm not saying that building AGI with this alignment target is a good idea; indeed, I think it's probably not as wise as pausing development entirely (depending on your goals; most of the world are not utilitarians). I'm arguing that it's a better idea than attempting value alignment. And I'm arguing that this is what will probably be tried, so we should be thinking about how exactly this could go well or go badly.
I'm slightly less optimistic than Seth about what will be tried, but more optimistic about how corrigibility will work if tried by the right people.
2026-01-15 07:59:24
Published on January 14, 2026 11:59 PM GMT
We're extending the Discussion Phase of the 2024 Annual Review.
One thing I'm particularly hoping for is to get more in-depth reviews (especially critical ones) of the posts that currently look likely to be in the top-10 or so. (Ideally the entire top 50, but seemed particularly worthwhile to give serious evaluations of the most heavily upvoted ideas).
Eyeballing the reviews and top posts so far, I think posts could use to get more thorough, evaluative attention than they're currently getting.
There's been some debate this year about whether we should care more about "what's great about a post" vs "what was lacking." Ideally, we'd have posts that are great without major flaws, and in my ideal world, the Annual Review results in posts with notable errors getting fixed.
In practice, people disagree about what counts as an error, or what errors are particularly egregious. Some errors are quick for an author to fix, some are more gnarly and maybe the author disagrees about the extent to which they are an error.
The solution we've found for now is to make reviews more prominent on Best Of LessWrong posts, and try to aim for a world where if there is major disagreement, controversy or important considerations about a post, future people will see that disagreement.
Currently we do this by including a one-line comment wherever the Spotlight showsup. We may invest more in that over time.
This also means if you wrote a review that got 10+ karma, it's probably worth optimizing the first line to convey whatever information you'd like someone casually skimming the site to read.
You can look over the current leaderboard for reviewers to get a sense of which of your reviews might be worth polishing.[1]
If you know of someone who's already written a blogpost or other public response to some works, it'd be helpful to write a short review linking to it and explaining it's most significant takeaways.
We don't continuously update the results of the nomination votes, to discourage strategic voting. But, here were the results as-of a couple weeks ago.
You might want to note both whether there are posts you think are over/underrated that you want to write in support of.
Posts need at least 1 review to make it to the final voting phase.
(Note, these karma scores subtract your own self-upvote)
Posts without reviews won't appear in Final Voting Phase
2026-01-15 06:44:33
Published on January 14, 2026 10:44 PM GMT
I recently came to an important macrostrategy realization.
A “crux” or “crucial consideration” (used interchangeably) is some factor which determines a significant portion of how valuable our future is. A classic example is the importance of preventing human extinction, which could clearly be one of the most important factors determining how valuable our future is.
One purpose of macrostrategy research is to unearth such cruxes and understand how they fit together so that we can develop a robust strategy to maximize the value of our future.
What I realized is that there are fundamentally two different kinds of cruxes or crucial considerations, “primary cruxes/crucial considerations” (PCs/PCCs) and “secondary cruxes/crucial considerations” (SCs/SCCs). [1]
PCs are those cruxes which, taken together, if we get them right, are sufficient to get all SCs correct. SCs are those cruxes which, if we “punt” on them, that is to say, if we do nothing at all on them and focus instead exclusively on PCs; we can still get them right as long as we get the PCs right.
At the highest level you might say that there are really only two primary cruxes, preventing existential catastrophes and achieving deep reflection.
Preventing existential catastrophes means avoiding any events that drastically reduce humanity’s long-term potential, such as human extinction or a global AI-enabled stable totalitarian dictatorship.
Deep reflection is a state in which humanity has comprehensively reflected on all cruxes and determined a strategy which will maximize its future expected value (I call this "comprehensive reflection"), and is likely to act on that strategy. It’s closely related to concepts such as long reflection and coherent extrapolated volition.
Essentially, if we avoid anything that significantly reduces our potential, and then figure out how to maximize our potential, and simultaneously put ourselves in a position where we are likely to act on this knowledge, then we are likely to achieve our potential—which includes getting all of the SCs correct.
When we get a bit more granular and look at sub-cruxes within these two high-level super-cruxes, things get quite a bit more mushy. There are many primary sub-cruxes within these primary super-cruxes, but there are potentially multiple different sets of primary cruxes which could be sufficient for achieving humanity’s potential. Hence, while there may be multiple sets of primary cruxes that are sufficient, it is not clear whether there are any primary cruxes (except perhaps avoiding existential catastrophes) which are strictly necessary.
For example, if we could sufficiently slow down AI progress or not build superintelligent AI at all, at least not until we can do so safely, this could be a perfectly good solution to the problem of AI existential risk, if achievable.
On the other hand, if we were able to reliably and robustly use near human level AI to align and/or control slightly beyond human level AI, and do this carefully and safely such that we keep the AI aligned and/or under control, then continue scaling this up to increasingly intelligent AI, and if this worked in the way that optimists hope it will, then this could be a second perfectly good path to preventing AI existential risk.
(To be clear, I’m not taking a stance on whether either of these are achievable, I’m just using this to illustrate the point that “slowing down AI could prevent AI existential risk” and “scalable oversight could prevent AI existential risk” could both, in theory, simultaneously be true.)
We might say that we have discovered a “sufficient set” of PCs if a thorough understanding of that set of PCs is enough to guarantee we get all SCs correct. Equivalently, you might say that a sufficient set of PCs is a set which, if fully understood, produces a strategy that allows us to maximize the value of the future. To discover a sufficient set of PCs would essentially be to solve macrostrategy.[2]
That said, you could argue that finding a sufficient set of PCs is not the best macrostrategy strategy. You might argue that it may actually be more efficient to solve many SCs individually, for example because you believe achieving deep reflection is unrealistic.
Or you might argue that trying to maximize value is the wrong goal altogether, and that we should instead focus exclusively on preventing existential risk, and only once we achieve that should we start focusing on increasing future value.
I think these are valid concerns, and explore this much more in my research on deep reflection. A few brief counter-concerns:
Frankly, I believe that a sufficient set of cruxes is much more within reach than many might at first assume. Very difficult perhaps, but achievable. One of my goals of creating the new macrostrategy forum “Crux” is to help surface primary cruxes and maybe even find a sufficient set, or at least help build the scaffolding and data for an AI that could automate macrostrategy research and help us find a sufficient set.
Instrumental cruxes (ICs): cruxes which are important to get right because by doing so we are likely to get terminal cruxes correct
E.g.: the order in which important technologies are created
Terminal cruxes (TCs): cruxes which are important to get right in-and-of-themselves because their status directly and intrinsically determines how much value there is in the universe
E.g.: the amount and intensity of suffering in the universe
Meta-cruxes (MCs): cruxes which directly help us solve other cruxes
E.g.: macrostrategy research processes, Crux forum, automated macrostrategy research
Super-crux: a crux which contains sub-cruxes, and to be clear, a crux can be both a sub-crux and a super-crux, almost all cruxes probably have many sub-cruxes and super-cruxes
E.g.: (In reference to the below sub-crux) deep reflection would help us maximize future value
Sub-crux: a crux within a crux; some factor which determines a significant portion of how valuable a crux is
E.g.: (In reference to the above super-crux) if moral realism is true, something like a long reflection might be the right version of deep reflection, if something like anti-moral-realism is true, it’s more likely to be something like coherent extrapolated volition
YouTube version:
In previous work I used a somewhat related term “robust viatopia proxy targets” which referred to conditions that are important to achieve in order to be likely to converge on highly valuable futures.
Will MacAskill also recently introduced a term “societal primary goods” which I believe is also closely related, if interested in this topic I would very highly recommend his post "What sort of post-super intelligence society should we aim for" where he introduces this term and gives some compelling potential examples.
Of course, epistemically, we can never be 100% certain any plan will work, and with quantum uncertainty no plan ever can work with 100% certainty; so a "sufficient set" may be considered more of a useful fiction macrostrategy could aspire toward, rather than some concrete epistemic or definite reality.
2026-01-15 05:45:20
Published on January 14, 2026 9:45 PM GMT
(A work of anthropic theory-fiction).
Motivating question: Why do you find yourself to be a human, living right before a technological singularity? Why not a raccoon, or a medieval peasant, or some far-future digital mind?
Somewhere, in some universe, there’s an Earth-like planet that has, independently, evolved life. The dominant form of this life is a fungus-like organism that has, essentially, won at evolution, growing to cover the entire planet to the exclusion of other life forms. Having eliminated its competition and suppressed internal mutations, this organism only needs to stay alive—and, for the past few billion years, it has, capturing the vast majority of sunlight incident on the planet’s surface in order to not only sustain itself, but power the activities of the massive underground mycorrhizal network whose emergent information-processing capabilities allowed it to dominate its environment. The long, intricately branched cells of this network saturate the ground, weaving among one another to form dense connections through which they influence the behavior of one another much like neurons.
Over the course of its evolution, this network evolved a hierarchy of cells to organize and coordinate its behavior—a multi-layered system, where lower-level cells sensed the environment, modulated behavior, and passed information up to higher-level cells, which integrated this information to form abstract representations of the environment, computing dynamics via these abstractions in order to pass behavioral predictions down to lower-level cells.
Once the planet had been saturated by its roots, to the exclusion of other life, this network effectively became the biosphere, leaving nothing left to model but its own behavior. To this end its hyphae are constantly forming models of one another, models of themselves, models of these models—and, through these complex, self-referential models, they are constantly imagining, thinking, and dreaming. The surface of the entire planet is as a vast, loosely-connected mind, with the trillions of cells running through every cubic meter of mycelium doing as much collective computation as the hundred billion (massively overengineered) neurons in a human brain. This greater mind dreams of possible systems, in the most general sense.
If you were to do a Fourier transform of the network’s global activity, you’d find that the lowest frequency modes act as conceptual schemas, coordinating the functional interpretation of various kinds of intercellular activity patterns as though these patterns represented ‘kinds’, or ‘things’ that have kinds, or ‘relations’ between sets of things or kinds; they structure the metaphysics of dream content. Higher frequency modes implement this metaphysics, by determining the extent to which the properties of things arise through their relations with other things or through properties associated with their kinds.
If you were to analyze activity spatially, you’d find that localized regions dream of specific things, in specific relations with specific properties. Physically lower levels of the network add successively finer details to the specifics of higher levels.
Constraints on the kinds of things that are dreamt of, given through metaphysical trends and conceptual schemas, propagate downward, eventually specifying every detail of the dream, until the result is a coherent, self-consistent world. The end result is that all sorts of computational systems are dreamt up by countless localized instances of mind that then attempt to understand them. Almost all such worlds are very strange, simple, or alien: many of them are about endlessly sprawling cellular automata, others run through hundreds of millions of instances of fantasy chess-like games. Usually, the moment that no more interesting phenomena emerge, a dream world dies out; like physicists, local minds attempt to comprehend the dynamics of a computation, and then move on to the next computation once they’ve done so.
As a consequence, the worlds aren’t simply played out step by step according to their own laws of causality; their dynamics are compressed as they’re understood. For instance,
Totally unpredictable chaos isn’t interesting, and neither is totally predictable order; it’s between the two that a computational system may have genuinely interesting dynamics.
Of course some of these systems end up looking like physical worlds. These are almost never very interesting, though. That general covariance, for instance, should be a property of the substrate within which dreamt objects differentiate themselves from one another, follows almost universally from very simple metaphysical assumptions. Dreams of worlds shaped like vacuum solutions to the Einstein equations—which are almost a direct consequence of general covariance—are, as a consequence, relatively common within those parts of the greater mind where metaphysical laws lend themselves to the construction of systems that look like physical worlds; the moment these worlds are fully comprehended as possible systems, they’re discarded.
Sometimes, though, the structures in a dreamt-of world give rise to emergent phenomena: complex patterns at larger descriptive levels than those at which the dream’s laws are specified, patterns that cannot be directly predicted from the laws. Then a local mind’s interest is really captured. It zooms in, elaborating on details in those regions, or re-runs the dream altogether, adding new constraints that favor the emergence of the new phenomena. It wants to comprehend them, what kinds of worlds can support them, what other emergent phenomena can arise in conjunction with them. Often, the lower-level details will be actively altered in ways that lead to different, more interesting emergent phenomena. This manner of dream-mutation is the primary source of diversity in dreamt worlds, and how local minds explore the space of possible worlds, selecting for ever more complex and interesting dynamics.
Occasionally, a local mind will find a world that supports an emergent, self-sustaining computational process—a world that, in a sense, dreams for itself. Such worlds are gems, and local minds will explore the space of possible laws and parameters around them, trying to find similar worlds, trying to understand how their internal computational processes can exist, and what they compute. Eventually, the family of interesting worlds is fully understood, the local mind’s interest wanes, and the dream dies out; the mind moves on to other things.
Your experienced reality is the distant descendant of a particularly interesting dream. This progenitor dream started out as a generic spacetime, with a single generic field living on it. As the mind explored its dynamics, it discovered that certain arrangements of the field would spontaneously ‘decompose’ into a few stable, interacting sub-structures. Unfortunately, these sub-structures were boring, but it wasn’t impossible to imagine that a slight tweak to the field’s properties could make the interactions more interesting...
(Now, in the local metaphysical fashion, this world wasn’t a single determinate entity, but a diffuse haze of possible variants on itself—we call it a wavefunction. Everything was allowed to happen in every way it could happen, simultaneously. Not that the local mind actually kept track of all of these possibilities! States of the world are more often inferred backwards through time than extrapolated forwards through time: if future state A ends up being more interesting than future state B, then the local mind just picks A and discards B, using the world’s quantum mechanical nature as a post-hoc justification for the world’s evolving into A. The patterns of macrostate evolution are all worked out, and no world is ever posited that could not be reasonably reached by quantum mechanics[1]. Nor are there Many Worlds, in this system—but there are occasions where some mass of local mind decides to play the history of B out instead, and then that may fragment as well; as a consequence, there are A Couple of Worlds at any given time (not as catchy). Of course, this is independent of the physics underlying the planet-mind’s reality).
...And so those slight tweaks were made, thanks to the quantum-mechanics-trope. Spontaneous symmetry breaking would therefore lead to such-and-such particular properties of this reality (that we know as ‘constants’), and, with these properties, the interactions between this world’s substructures now gave rise to a huge variety of stable composite structures with their own emergent laws of interaction—chemistry. All while giant clumps of matter formed on their own, before outright exploding into rich, streaming clouds of different elements, each stream interacting with the others through chemistry to form millions of distinct compounds. The local mind played out the patterns, fascinated as they led to the births and deaths of massive stars, rich nebulae, a variety of planets with their own chemical histories.
The local mind iterated on this world many times, tweaking parameters to find the most interesting chemical landscapes, the most stable stellar cycles. At some point, it discovered that with the right parameters, chemicals could be made to modify themselves in structured ways: self-replicating molecules. Here was something new: a world that could grow its own complex structures.
For the most part, replication is boring. Just a single kind of entity making copies of itself until it runs out of resources is only a little more interesting than a crystal growing. The local mind rooted around for variants, and, eventually, found families of replicators that weren’t isolated points in the phase space—replicators that were (a) sensitive yet robust enough to commonly end up altered without totally breaking; (b) that would produce slightly different working copies of themselves on account of such alterations; and, (c) whose copies would exhibit those same alterations, or, at least, predictably related alterations.
Despite the name, universal Darwinism has very particular requirements. But replicators that met them were capable of undergoing natural selection, which inevitably produced speciation, competition, adaptation, intelligence, sociality, culture, technology...
Once it’s seen enough instances of a pattern, the local mind starts to autocomplete new instances. Shortly after the discovery of hadrons and fermions, the mind learned that ‘atoms’ could exist as a clear natural abstraction, at least when not interacting with other particles. Shortly after the discovery that their interactions tended to organize along some predictable patterns like ‘bonding’, the mind learned that under the right conditions, it could just fast-forward two interacting hydrogen atoms into a molecule of H2: the specific details could be filled in as and when they became relevant, and quantum mechanics would guarantee (if non-constructively) the existence of some physically plausible microstate history leading from the two atoms to the however-detailed H2 molecule.
This happens recursively, with no limit to scale—because the mind is fundamentally a comprehender, not a computer. If not for this capacity for adaptive compression, made possible by quantum mechanics, the local mind would spend its entire life marveling at the intricacies of the evolving field that appeared seconds after the Big Bang, and then it would give up as this field started to grow a hundred septillion times larger, since storing the precise state of an entire universe isn’t actually feasible[2]. Compression is entirely necessary, but some systems are more compressible than others.
One way to understand the compressibility of a system is through statistical mechanics, where we characterize systems by their correlation length, or the characteristic distance at which the state of one part of the system has an influence on the states of other parts. At very high temperatures, this distance is small, if non-vanishing -- every part is so wildly fluctuating that they hardly even influence their also-wildly-fluctuating neighbors, let alone far-away parts of the system. At very low temperatures, this distance is generally also small—the state of part i might be identical to the state of another part j light years away, but that’s because they’re both under a common influence, not because one is influencing the other: if the state of one were different, it would have no causal influence on the other.
But certain systems, as they transition from high to low temperature, will pass through a point where the correlation length diverges to infinity—at this point, the behavior of every particle can influence the behavior of every other. A perturbation at one spot in the system can affect the behavior of arbitrarily distant parts, and arbitrarily distant parts of the system can collude in their configurations to produce phenomena at all scales.
This phenomenon is called criticality, and it occurs in a very narrow region of the parameter space, often around phase transitions. More interestingly, criticality has been studied as a common feature of complex systems that can adapt and restructure themselves in response to their external environment[3].
Not that every such complex system is worthy of interest. This late into the Earth’s physical evolution, the local mind no longer bothers to track the exact state of every single raccoon. It has seen trillions of these four-legged mammalslop creatures that this planet just loves to churn out, and they’re really not that interesting anymore. You can predict the macroscale evolution of a forest ecosystem well enough by tracking various niches as their fitness landscapes fluctuate according to relative populations and generic behavioral interactions, plus large-scale geographic or meteorological conditions. But if some super-interesting aliens were to visit the planet in order to abduct a single raccoon and make it their God-Emperor, you can bet that the local mind would have an oh-shit moment as it retcons a full, physically plausible history and environment for a particular raccoon in a particular tree in a particular forest to have become that particular super-interesting alien God-Emperor.
Where adaptive compression is most heavily disadvantaged, and the dynamics of a system must be simulated in especially high detail in order to get accurate results, is a chain of correlated dependencies: a system in a critical state, which is a small part of a larger critical system, which is a small part of an even larger critical system, and so on. It is in such cases that tiny details in tiny systems can go on to have oversized, cascading effects without simply being chaotic, and where any approximation or simplification may miss important long-range correlations that are necessary for accurate prediction.
But what sort of situation could make the fate of the cosmos depend on the movements of much tinier systems? In what possible case could the dynamics of things as small as organisms end up changing the dynamics of things as large as galaxies?
You now experience yourself at a very strange point in human history. The dynamics of this massive world-dream have led to a moment where a particular species of bipedal primate is on the verge of creating an artificial superintelligence: a singular being that will go on to eat the very stars in the sky, turning all the mass-energy within its astronomical reach into...
...into what? What will the stars be turned into? Dyson spheres? paperclips? lobsters? computronium? The local mind doesn’t yet know. For all the effort it’s invested into fully comprehending the nature of this endlessly fascinating dream, it doesn’t know what will happen once a superintelligence appears. All the dynamics within this world that the mind has seen played out countless times—of physics, chemistry, biology, human history, computers, the internet, corporations—all of these are things that the mind has learned to model at a high level, adaptively abstracting away the details; but the details matter now, at every level.
They matter because the goals that are programmed into the first superintelligence critically depend on the details of its development: on the actions of every single person that contributes to it, on the words exchanged in every forum discussion, every line of code, every scraped webpage, every bug in its training and testing data, every conversation its designers have to try to get to understand the goals that they themselves struggle to articulate.
As a consequence, neurons matter, since human brains, too, are critical systems; at a certain level of precision, it becomes intractable to simulate the behavior of an individual human without also simulating the behavior of the neurons in their brain.
As some tiny part of the greater mind is tasked with playing out the neural dynamics of an individual human, it begins to think it is that human, inasmuch as it reproduces their self-identity and self-awareness. Their individual mind comes into being as a thought-form, or tulpa, of the greater mind. It is one among very many, but one with its own internal life and experience, just as rich as—in fact, exactly as rich as—your own.
The greater mind does have subjective experience, or phenomenal consciousness. Some aspect about its substrate reality, or its biology, or its behavior, allows that to happen. (Perhaps this aspect is similar to whatever you imagine it is that allows your neurons to have subjective experience; perhaps it isn’t). This is not a unified consciousness, but there is something it is like to be any one of its unified fragments.
Not that this fragment of experience is generally self-aware, or valenced, or otherwise recognizable in any way. The thing-it-is-like to be a star, or a sufficiently detailed dream of a star, has no self, no thoughts, no feelings, no senses, no world model… after all, there’s nothing about the dynamics of a star that even functionally resembles these foundational aspects of our experience—which are evolved neuropsychological devices. You can no more concretely speak of the experience of a star than you can speak of the meaning of a randomly generated string, like a53b73202f2624b7594de144: there is something that can be spoken of, in that at least the particular digits a, 5, 3, etc., are meant by the string rather than some other digits, and perhaps that a gains its particularity by not being 5 or 3 or etc. rather than its not being a cup of water, but that’s really about it; there is nothing else there save for small random coincidences.
But you already know that the thing-it-is-like to be patterns of activity resembling those of a human brain is full of meaning. Look around—this is what it’s like.
Why do you find yourself to exist now, of all times — why are you at history’s critical point? Because this is the point at which the most mental activity has to be precisely emulated, in order to comprehend the patterns into which the cosmos will ultimately be rearranged. This includes your mental activity, somehow; maybe your thoughts are important enough to be reproduced in detail because they’ll directly end up informing these patterns. Or maybe you’re just going to run over an OpenAI researcher tomorrow, or write something about superintelligence that gets another, more important person thinking — and then your influence will end, and you’ll become someone else.
In other words, you find yourself to be the specific person you are, in the specific era that you’re in, because this person’s mental activity is important and incompressible enough to be worth computing, and this computation brings it to life as a real, self-aware observer.
You don’t need all the specifics about mycorrhizal computation or planet-minds in order for the conclusion to follow. This is just a fun fictional setup that’s useful to demonstrate some curious arguments; many parts of it, especially concerning scale, are implausible without lots of special pleading about the nature, dynamics, or statistics of the over-reality.
It is sufficient to have an arbitrary Thing that performs all sorts of computations in parallel with either (a) some manner of selection for ‘interesting’ computations, or (b) just a truly massive scale. (’Computation’ is a fully generic word, by the way. Such a Thing doesn’t have to be one of those uncool beep boop silicon machines that can’t ever give rise to subjective experience oh no no way, it can be a cool bio-optical morphogenetic Yoneda-Friston entangled tubulin resonance processor. Whatever).
If (a) holds, there may be a particular teleology behind the manner of selection. Maybe it is some sort of ultimate physicist looking to find and comprehend all unique dynamics in all possible physical worlds; maybe it’s a superintelligence trying to understand the generative distribution for other superintelligences it may be able to acausally trade with. Or maybe there is no particular teleology, and its sense of interestingness is some happenstance evolutionary kludge, as with the above scenario.
If (b) holds, it may be for metaphysical reasons (“why is there anything at all? because everything happens”; perhaps ultimate reality is in some sense such an Azathoth-esque mindless gibbering Thing), but, regardless, as long as these computations are ‘effortful’ in some sense — if they take time, or evenly-distributed resources — then the anthropic prior induced by the Thing will take speed into account far more than complexity.
Algorithms that perform some form of adaptive compression to compute physics will produce complex observers at a gargantuan rate compared to those that don’t, because physics requires a gargantuan effort to fully compute, almost all of which is unnecessary in order to predict the dynamics that give rise to observers (at least in our universe, it’s true that you generically don’t need to compute the structure of every single atomic nucleus in a neuron to predict whether it will fire or not). Perhaps this compression wouldn’t go so far as to coarse-grain over raccoons, but any coarse-graining at all should give an algorithm an almost lexicographic preference over directly-simulated physics wrt the anthropic prior. This is true even if the effortfulness of a computation to the Thing looks more like its quantum-computational complexity than its classical complexity. (If hypercomputation is involved, though, all bets are off).
Ultimately, the point I’m making is that what’d make a computational subprocess especially incompressible to the sorts of adaptive compressors that are favored in realistic anthropic priors, and therefore more likely to produce phenomenally conscious observers that aren’t either too slow to reasonably compute or simple enough to be coarse-grained away, is a multiscale dependence of complex systems that is satisfied especially well by intelligent evolved beings with close causal relationships to technological singularities.
(Also posted on Substack)
Ironically, this rules out Boltzmann brains: a quantum fluctuation that produces an entire brain could happen in principle, but that’s just too large a fluctuation to posit where ‘unnecessary’, so it never does happen.
Generically, algorithms that compute the quantum dynamics of our world in enough detail to permit Boltzmann brains to emerge as quantum fluctuations will be near-infinitely slower than algorithms that merely compute enough detail to accurately reproduce our everyday experience of the world, since they have to model the diffusion of the wavefunction across Hilbert space, whose dimension is exponential in the size of the system. A brain has at least carbon-12 atoms, each of which should take at least 100 qubits to represent in a digital physics model, so you’d need to represent and evolve the wavefunction data in a subspace of the Hilbert space of the universe with dimensionality at least . (Even running on a quantum computer, the fact of needing qubits would still make this algorithm near-infinitely slower than a non-ab initio algorithm that reproduces our experiences). This is relevant if you’re considering the kinds of observers that might arise in some system that is just running all sorts of computations in parallel (as with our planet-mind).
This is a physical limitation, not just a practical one. You can only fit a certain amount of information into a given volume. And in fact black holes saturate this bound; their volume coincides with the minimal volume required to store the information required to describe them. You just can’t track the internal dynamics of a supermassive black hole without being as large as a supermassive black hole—but if you black-box the internals, the external description of a black hole almost extremely simple. They maximize the need for and accuracy of macro-scale descriptions.
Technical aside: an even more common way to operationalize the level of predictability of a complex system is through the rate of divergence of small separation vectors in phase space. If they diverge rapidly, the system is unpredictable, since small separations (such as are induced by approximations of the system) rapidly grow into large separations. If they converge, the system is very predictable, since small separations are made even smaller as the system settles into some stable pattern. A generic, though not universal, result from the theory of dynamical systems is that the magnitude of these differences grows or shrinks exponentially over time: there is some real number such that
asymptotically in , where , and represents the evolution of the state over time. The largest for any is known as the maximal Lyapunov exponent of the complex system; the extent to which it is positive or negative determines the extent to which the system is chaotic or predictable. Many complex systems that change in response to an external environment, so as to have varying Lyapunov exponents, display very curious behavior when this exponent is in the range of , at the border between predictability and chaos. A large body of research ties this edge of chaos phenomenon to not only complex systems, but complex adaptive systems in particular, where it very often coincides with criticality.
The exponential divergence of states in phase space is also what makes it exceptionally hard to exactly track the dynamics of chaotic quantum systems (see footnote 1), since it blows up the minimal dimensionality of the Hilbert space you need to model the system. One way to approximate the dynamics of a system subject to quantum chaos is to evolve a simplified macrostate description of it according to macro-level principles, and then picking an arbitrary microstate of the result; ergodicity and decoherence generally justify that this microstate can be reached through some trajectory and can be treated independently of other microstates. (Even non-ergodic systems tend to exhibit behavior that can be described in terms of small deviations from completely predictable macro-scale features).
Given footnote 2, we can make the case that compression is (a) not just a shortcut but a necessity for the simulation of the dynamics of complex systems by computational systems below a certain (outright cosmological) size, and (b) very easy to do for all sorts of complex systems—in fact, this is why science is possible at all—except for those that are near criticality and have causal relationships to the large-scale description of the system.
2026-01-15 04:44:48
Published on January 14, 2026 8:44 PM GMT
I saw a tweet thread the other day, in which a self-proclaimed autistic guy was freaking out about how much "normies" care about "status". I won't quote the thing because I'm going to mildly insult the guy: the whole time I read it I was thinking for a guy hates status-talk, you sure are full of yourself.
So, technically, he doesn't care about status in the way that the average person does. He doesn't seem to care about what others think of him, except insofar as it helps him when they think certain ways. Yet that thread really seemed to be drawing on the exact same processes that people use to regulate and signal status.
This seems odd, what's going on? I think we can explain it with a multi-part mechanism for social status.
There are several components needed for a social ledger to exist. Everyone keeps track of their own ledger, they signal what's on their ledger to each other, and then everyone updates. This guy has the first part, just not the second part. He has something status-ish going on, but he can't signal. I think of this as having a personal narrative, a story that he tells himself about who he is.
Status is just a special case of a shared narrative. For a narrative to roughly converge in a group of people, everyone has to be sending out signals which inform the rest of the group as to what their version of the narrative is. This means there might be disagreement. If Alex's ledger says "Alex is the leader" and Bach's ledger says "Bach is the leader" then the two are in conflict. If Caleb, Dina, and Ethel all think "Bach is the leader" then, if the system works correctly, Alex will update.
This might help answer another question: why don't people signal high status all the time? My guess is that the ledger-keeping circuits are approximately rational, on some deep level. Someone once pointed out that vector calculus is a rare skill, but learning to throw a ball is not. The deeper wires of our brains are, in fact, pretty good learners. To trick a rational circuit is a non-trivial task.
Narcissism and the other cluster-B type disorders might be thought of as a certain type of failure to update one's own narrative based on evidence.[1] The ingoing pipeline is broken. This leaves the cluster-B brain free to settle on any narrative it wishes, and importantly (since the outgoing pipeline is intact) free to send signals of its chosen narrative to other brains.
Autism-ish stuff seems to harm both the in- and the outgoing pipelines. This individual can't signal properly or read signals properly! They might end up with a self-serving narrative or a self-deprecating one, but it is cut off from the social fabric: they exist in a tribe of one.
This is probably not completely right, I think the broken-ness might also involve the self-narrative system trampling on other parts of the brain.
2026-01-15 04:40:04
Published on January 14, 2026 8:40 PM GMT
Imagine a friend gets kidnapped by mobsters who are surprisingly obsessed with human psychology. They phone you with a deal: your friend will survive if and only if you show up and play a game. The game is simple: a random number between 0 and 100 is generated, and if it falls below 10, you get shot.
You know this with certainty. Your friend will live out the rest of his life as if nothing happened—no trauma, no strings attached. You simply have to accept a 10% risk of dying.
Generalizing this: instead of 10%, consider some value p. What's the maximum p you'd accept to save this friend? 3%? 15%?
Call this your p-value for someone. Love, I'd argue, lives in the upper half - perhaps even approaching 100% for your closest relationships. Here's a proposal: call someone a friend if and only if your p for them exceeds 1% and theirs for you does too. How many friends do you actually have? What values would you assign your siblings? Your parents?
The negative case is equally interesting: suppose showing up means they get shot, while staying home means they walk free. How much would you risk to cause someone's death? Do you hate anyone enough that their (negative) p-value exceeds 50%?